conferences | speakers | series

8-bit Character support on architectures were the smallest addressable unit size is 64-bit in Clang and LLVM

home

8-bit Character support on architectures were the smallest addressable unit size is 64-bit in Clang and LLVM
FOSDEM 2022

Clang and LLVM have a great history of supporting a great variety of CPUs, from 8- to 64-bits assuming they all have a smallest size of an addressable unit of 8-bits words. Despite the fact that a lot of types and there alignment can be defined with the “target datalayout” string, the “character” and “short” type have been hard-coded into clang and llvm.

Clang and LLVM have a great history of supporting a great variety of CPUs, from 8- to 64-bits assuming they all have a smallest size of an addressable unit of 8-bits words. Despite the fact that a lot of types and there alignment can be defined with the “target datalayout” string, the “character” and “short” type have been hard-coded into clang and llvm. Once you compile with clang you will get for example:

@.str = private unnamed_addr constant [6 x i8] c"Hallo\00", align 8

Some proposals exist to that offer a solutions to this problem (e.g. FOSDEM 2012: “Adding 16-bit Character Support in LLVM” or https://lists.llvm.org/pipermail/llvm-dev/2019-May/132080.html: “On removing magic numbers assuming 8-bit bytes”). Following this ideas one has to apply changes to over 120 files (clang and llvm v12.0.0) and keeping a patch set nearly impossible.

Looking for simpler solution for this problem we explored a couple of alternative solutions. Two design goals have to be satisfied:

don’t change CHAR_BIT
keep CharWidth at 8-bits

Only the modification of the character alignment to 64-bits is allowed. With modifying only 8 files (some of them only dealing with character assertions) we end up with the desired result of:

@.str = private unnamed_addr constant [6 x i64] [i64 72, i64 97, i64 108, i64 108, i64 111, i64 0], align 8

This solutions can also easily be adopted to machines with a minimal addressable unit of 16- or 32-bits. Also “WChar” can be addressed with minimal changes.

As this is solution is still under testing, the amount of files changed might be further reduced, and should allow for a small and simple patch set.

Speakers: Thomas Pietsch