Over several years I’ve had several conversations with people about static integer types in programming languages. Given the number of subtleties in the design, and pitfalls in the use, of integer types it’s not surprising that most programmers are unaware of at least some of them. However, what is perhaps more surprising is that most programming language designers and implementers aren’t fully aware of some of the subtleties and pitfalls either. That leads to languages and their implementations baking in unintended, and sometimes unfortunate, design decisions. At the end of conversations on this topic, I’ve sometimes said “this stuff is hard and someone should think more about this and write it up” but it’s an unsatisfying conclusion. So even though I don’t have all the answers – indeed, I’m fairly sure that I don’t know all the questions – I’m going to jot down my current thinking on this subject.
The basics
In this post I’m considering the design of fixed-width integer types in
statically typed languages. By “statically typed” I mean types which are
enforced at compile-time [1] whether they are
explicitly spelt out in source code or inferred by the compiler. By “fixed
width” I mean integers which
fit within a number of bits determined at compile-time. These come in two
flavours: those where the number of bits is the same on all platforms (e.g.
u32
typically means “an unsigned 32-bit integer); and those where
the bits are parameterised by the platform (e.g. Rust’s
usize
might be 16-bits, 32-bits, 64-bits or bigger depending on
the underlying platform). I won’t be considering floating point numbers,
arbitrarily-sized integers (i.e. those whose storage size varies at run-time
depending on the number they are storing), or the like.
An obvious question – and one that I imagine most new programmers have – is why do we have more than one integer type? Why, on a 64-bit machine, do I not always use 64-bit integers? There are various possible explanations, but to me two stand out. First, we often have to interact with things (from external libraries to the hardware we’re running on) that require integers of certain widths. Second, even though our computers often have large quantities of DRAM, memory is not infinite, and even when there’s plenty of free memory, using it wastefully can slow programs down. For example, imagine that I’m writing a system which records students’ marks out of 10: I only need 4-bits to represent all the valid marks, meaning that I need 16 times less memory relative to a 64-bit integer. If my program records lots of students’ marks, then the memory saving between 4-bits and 64-bits will probably have a noticeable effect on the program’s performance, because the caches in modern CPUs are tiny relative to DRAM. For example, my desktop computer has 32GiB DRAM but the L1 cache (split into two equal-sized sub-parts for instructions and data) is a mere 64KiB (about 0.000002% the size of DRAM): saving memory is an effective way of improving program performance.
In practice, few programming languages offer the user direct access to 4-bit integers [2]. Since truncating the range of numbers we need to deal with is nearly always a terrible idea, we therefore have to round up our ideal integer width to the next highest representable integer type. I would thus probably represent students’ marks as an 8-bit integer, but if I needed to represent the distance between two points on earth in miles I’d probably use a 16 bit number (although the maximum possible distance, 8000 miles, can be represented in 13-bits). Interestingly, although it’s invisible to us as normal programmers, compilers and CPUs also sometimes round integer types up even further than we realise. On a 64-bit machine we might want to store a 32-bit integer in our program, but there’s nothing to stop the compiler or the CPU treating it as a 64-bit value for arithmetic operations, provided whenever we can observe the value it’s 32-bits wide.
Signed and unsigned integers
Although I have been clear about the maximum value I wanted to represent in those examples, I’ve been deliberately vague about the minimum value. Although the minimum integer could be any value, we only really need to worry about whether the minimum is zero or less than zero (i.e. a negative value).
If the minimum value is zero, we can use unsigned integer types: their minimum value is 0 and their maximum value 2n-1 (both inclusive) where n is the number of bits for that integer width (i.e. an 8-bit integer can represent 0..(2n-1) = 0..255 inclusive).
If the minimum value is less than zero, we need to use signed integers. Unlike unsigned integer types, whose representation is obvious (the minimum value has no bits set; the maximum value has all bits set), things are a bit trickier with signed integer types. The most common representation is two’s complement such that the minimum value is -(2n-1) and the maximum value is 2n-1-1. The resulting asymmetry – an 8 bit integer can represent -128..127 – can feel a bit icky, but two’s complement has the advantage that arithmetic operations are identical no matter the signedness of the integer. Since CPUs have universally used two’s complement since the late 1960s (I know of only one, very niche, exception [3]), I think it’s safe to assume that two’s complement is the only representation any of us will ever encounter.
Why do we have both signed and unsigned integers? Many people think that a
good reason to have both is because we often use computers to count things that
are naturally non-negative integers, for example the number of apples you’ve
put in your online shopping basket. Using unsigned types in such cases feels
like a sensible idea but there are downsides. For example, what’s the value of
2 - 3
as an unsigned integer? And what happens if you later
discover that you need to represent negative quantities, for example the number
of rotten apples that you need to refund the customer? You often end up in the
wonderful situation where you have both unsigned and signed versions of your
counts at different points of your program, not to mention the difficulties in
dealing with edge cases when converting between the two types. I still have
nightmares about the time I forgot to account for the asymmetric range of
signed integer types when checking to see if an unsigned integer could
fit in a signed integer — debugging that took quite a while.
Such problems would go away entirely if we only ever used signed integers. However, signed integers don’t work very well when we’re representing things to do with memory. For example, if, on a machine with a 32-bit address range, we used signed integers to represent the size of an array of bytes, we could only easily represent sizes of at most 2GiB — despite having an address space of 4GiB. That issue is less of a problem on 64-bit machines (63-bits represents a lot of memory) but what would a signed memory address mean and what happens if/when we want to use all 64-bits to represent memory addresses [4]? While high-level languages can easily forego unsigned integer types, they seem to be a necessary part of lower-level programming languages.
Native word width
The chances are that the computer (be it a desktop, laptop, phone, or otherwise) you’re reading this article on is a 64-bit machine, by which we mean its native word width is 64-bits. We use this term so often that, in my experience, we rarely take a step back to think about what this really means. A typical approximation is that the word width defines both the “main” integer width as well as the width of addresses: we expect operations upon those to be particularly efficient (implying, for example, that most CPU registers are word width). On x86, or amd64/x86-64 [5], hardware – the hardware that defines most software developers’ view of hardware – this is a safe approximation to reality.
Old duffers like me remember the days of not just 32-bit machines, but 16 bit and 8-bit machines: we have direct experience that the world is not just about 64-bit word widths. Surprisingly few of us realise that, even today, many “embedded” devices have a native word width of 8-bits. Indeed, it was only in the 1970s that the notion of multiple-of-8 word widths became common: a quick delve into computing history turns up an astonishing variety of word widths. I can remember my literal astonishment when I saw a decimal (i.e. non binary) computer at The National Museum of Computing — I had read of such things but did not know that specimens still existed. However, until I looked into things a bit deeper for this article, I had no idea of other oddities like 27-bit machines!
The more you look into the notion of “word width”, the fuzzier the
concept becomes. One obvious reason is backwards compatibility. On an
amd64/x86-64 machine, for example, the terminology is compatible with
their 16-bit predecessors: a WORD
is 16-bits (due to the 8086), a
DWORD
is 32-bits (due to the 386), and a
QWORD
64-bits.
There are deeper reasons for this fuzziness. The first computer my family acquired was the (now largely forgotten) Dragon 32 — an 8-bit computer named for its 32KiB of memory. That should immediately ring alarm bells: assuming each address is 8 bits wide, and addresses reference bytes, then an 8-bit address can only reference 256 bytes of memory. In practise, memory can be addressed by combining more than one 8-bit value to reference an address.
We thus need to differentiate between a platform’s data width (i.e. the width of an integer; e.g. 8-bits on the Dragon 32) and address width (16-bits on the Dragon 32). The problem [6] is then how does one combine (smaller) data-width values to reference (larger) addresses? Various schemes have been devised to bridge this gap, from building addresses from two or more registers, to bank switching, and beyond.
The future of address and data widths
The reason we need to care about address and data widths is that, as soon as one programs in a language higher-level than assembly language, one sometimes wants to treat pointers as integers and vice versa. I most often need to do this when mucking around with pointer tagging (i.e. using the always-set-to-zero bits of an aligned pointer to store extra information), where I need to treat the pointer as an integer, set or unset the bits I care about, then treat the integer as a pointer again. In order to do this, I need to know precisely what the rules are for treating pointers as integers and vice versa. The rules are trivial if the data and address widths are the same, but I hope the previous section convinced you that, historically, this has not always been the case.
You might be wondering whether it’s worth worrying about such issues when it’s rather unlikely that anyone will ever again produce a 27-bit machine. However, I believe it’s quite possible that future hardware will, once again, have different data and address widths. There are several possible reasons why one might do so, but I want to use an example that I know at least a little about.
CHERI is an extension to instruction sets that allows the enforcement of hardware capabilities. On current mainstream platforms, different processes can be mutually distrusting, but within a process every piece of code has access to every part of the process’s memory. CHERI adds hardware permissions to pointers, allowing different parts of the same process to be mutually distrusting, for example handing out pointers that allow access to only limited portions of the process’s memory. The potential for increasing security without suffering the overhead of multiple processes is significant. Those with long memories will know that the idea of capabilities isn’t new (for example, the E language is a fascinating exploration of this idea), but what’s meaningfully different about CHERI is that it’s enforced in hardware and adaptable to modern instruction sets. For example, in 2022 there will be physical, fairly modern, Armv8 processors extended with CHERI available for early adopters to play with.
Amongst CHERI’s clever tricks is how permissions are carried around: instead of a pointer being just a memory address, under CHERI it becomes a memory address plus a set of permissions (plus a secret bit stored somewhere hidden so that the hardware can tell if you’ve forged permissions). Although the size of these permissions varies in different CHERI-enabled instruction sets, the most common encoding on 64-bit machines is that a pointer is a 64-bit memory address and a 64-bit set of permissions — 128-bits wide in total. In other words, a typical CHERI CPU has different data and address widths [7]. History, it seems, has not yet ended.
Integer types and pointers
How do systems programming languages deal with platforms that have different data and address widths?
Let’s start with Rust, which
offers the usize
type, defining it as “The pointer-sized unsigned
integer type. The size of this primitive is how many bytes it takes to
reference any location in memory.” That definition is based on the assumption
that a machine’s address and data widths are the same. How should one adapt
Rust to work under CHERI given this mismatch in assumptions? Well, one could
devise various clever schemes to try and force Rust to
acknowledge this difference, but doing so will break a lot of Rust code that
has no choice but to rely on usize
’s existing guarantee of
identical address and data widths
(including
some that I’ve written), mostly in ways that won’t be statically discoverable.
The only pragmatic scheme I can think of will be to define
usize
to be the wider of the data and address width (i.e. 128 bits
on most CHERI devices). This will allow most Rust code to run correctly without
change (other than being recompiled). This doesn’t entirely guarantee
correctness: for example, code that uses the size of a
usize
as a proxy for the data width will behave incorrectly.
However, I think it’s reasonable to assume that this scheme will
work well enough to be acceptable. However, it’s inherently wasteful: most
usize
s only ever store the size of an object in memory, meaning
that 8 bytes of their storage space will never be used. That might not sound
much, but I have written Rust programs where removing 8 bytes from a single,
commonly used struct
has increased performance by 5%. There is no
doubt in my mind that doubling the size of usize
will, therefore,
lead to a noticeable slowdown (perhaps in the region of 5-10%) for many
programs.
Other languages have different approaches to integer types, with an
obvious alternative being to remove the notion of parameterised width types entirely.
D (which I don’t claim to know well) is an example of a language where all integer types have an explicit, fixed width. This suggests that we no longer have to worry about Rust’s conflation of
address and data widths into usize
, but fixed width integer
types don’t allow one to adapt to different word widths. D therefore defines
size_t
to be an alias for the underlying platform’s word width.
I will leave to the theologians whether this means that D really only has
fixed width integer types, but either way, the end result is that D’s
size_t
makes the same assumptions as Rust’s usize
.
Rust and D’s conflation of address and data widths suggests that modern systems programming languages are in an awkward spot when it comes to address and data widths. However, there are too many systems programming languages in the wild for me to claim knowledge of all, or even most, of them. That means that I know that my next statement is almost certainly wrong for at least one language out there. However, as far as I know, the only language which is sufficiently flexible is C — at least, C as of the C99 version.
Let’s start with C’s uintptr_t
[8],
an integer type that’s guaranteed to be wide enough to store a pointer safely.
At first glance this might look like Rust’s usize
in disguise, but C’s
guarantees are subtly different: simplifying heavily, uintptr_t
will be at least as wide as the address width but may be wider or narrower than
the data width. The formal definition is rather more subtle (simplified only
slightly): C guarantees that the integer type uintptr_t
is wide
enough to represent every void*
pointer. The reason for the use
of void*
is that pointer types in C do not all have to be the
same width, but void*
is guaranteed to be wide enough that any
other pointer type can safely be cast to void*
— thus
uintptr_t
is implicitly wide enough for any pointer.
Another of C’s integer types size_t
represents the size of
an object in memory. That might look like it’s related to the concept of the
address width, but it’s subtly different. On segmented architectures like the
8086, size_t
will represent a smaller integer type (16 bits in
this case) than the address width (20 bits). In other words,
size_t
allows us to separate out the notion of “how big can any
object be in memory” from “what is the address of this memory.” As this
suggests, we should have been talking about address, data, and object
widths (and not just address and data widths) all along!
On that note, you might think int
or
long
represents the data width, but it doesn’t, and indeed
C has no other integer type that corresponds to the data width.
In part this is because of a
mess of backwards compatibility: if you’re running OpenBSD on an amd64/x86-64
machine, int
will be 32-bits and long
will be-64
bits, but, on the same hardware, Windows defines both integer types to be 32-bits
[9].
At this point (and even allowing for the simplifications in my explanation of C’s integer types), anyone who doesn’t consider themselves a C language lawyer might find their eyes glazing over. Fortunately there are two useful lessons for us.
We can see the first lesson in C’s definition of uintptr_t
,
which isn’t defined in terms of data or address widths, or even in some other
notion of “address” — but, rather, the more abstract notion of a
“pointer”. By deliberately abstracting away from the hardware, C
is more adaptable than languages who make assumptions about
hardware. This turns out to be very useful for CHERI. In contrast, Rust
conflates the notion of address, data, and object widths into usize
,
making it hard to adapt naturally to a platform like CHERI.
The second lesson is, to some extent, a consequence of the first. Because C
defines several integer types that abstract away from the hardware, their
definition is invariably “at least wide enough to store X” rather than “exactly
the right width to store X”. Similarly, few, if any, guarantees are made about
the relationship between integer types. Thus, while C can easily adapt
to a wide range of platforms, it does so in part by placing a considerable burden on
the programmer to check whether they can, for example, safely store a
size_t
into a uintptr_t
or vice versa. Using C’s
integer types correctly over all possible platforms is, at the very least,
exceedingly difficult. In contrast, Rust’s stronger guarantees are easier to
work with.
Casting between integer types
Programmers often need to cast one integer type into another. For example, in C I can write this code:
#include <stdio.h> int main(void) { int x = 0; long y = x; printf("%d %ld\n", x, y); }
On my OpenBSD/amd64 machine, int
is 32-bits and long
64-bits. Using clang --std=c99 -Wall
to compile the above, there are no
warnings or errors, and the resulting binary prints out 0 0
when
run. Notice that on line 4, the 32-bit value stored in x
was
stored into a 64-bit variable y
and that I didn’t have to do
anything to make that “conversion” happen: C performed an implicit cast
on my behalf. Fortunately, C guarantees that long
is always at
least as wide as int
.
Consider instead this code:
#include <stdio.h> #include <sys/limits.h> int main(void) { long y = LONG_MAX; int z = y; printf("%ld %d\n", y, z); }
This code also compiles without warning or error, printing out
9223372036854775807 -1
on my machine. If you dive into the C
specification, you’ll discover that the values I tried to cast lead to
undefined behaviour [10] which means
that my program could have done anything. In practice, the behaviour I observed
can be explained very simply. y
has the rightmost 63-bits set to
1; the implicit cast to the 32-bit int
type on line 5 throws away
the left-most 32-bits, but keeps the right-most 32-bits, which are still all
set to 1; and because I’m running on a two’s complement platform, a 32-bit
signed integer with all bits set represents the value -1.
The culprit here is the implicit cast from long
to
int
which does something dangerous without me having explicitly
asked for it. In my opinion, implicit casts are a terrible idea, and the
program above should have caused a compile-time error.
I believe that users should have to spell out explicit casts.
C even has syntax for explicit casts: I can change line 5 to int z = (int) y
. Let’s briefly pretend that we can remove implicit casts from C:
what do I gain? For me the killer example is that if I change the type of an
attribute in a struct
from int
to long
,
I will get a compile-time error at every point in my program that tries to
treat the attribute as an int
. This makes tracking down all the
code affected by my change trivial.
However C’s explicit casts aren’t a sufficient safety feature on their own
— indeed, in a sense, they deliberately override safety. In the example
above where we cast long
to int
on OpenBSD/amd64,
we get a dangerous result whether or not the cast is explicit or implicit.
One might argue that such an example is such a blatant mistake that it’s
not worth worrying about, but people make the same underlying mistake in
situations where they have not quite understood the guarantees made about the
relationship between integer types. For example, I might write the following
code:
#include <stdio.h> #include <stdint.h> int main(void) { intptr_t y = ...; long z = (long) y; printf("%ld %ld\n", y, z); }
I know that on my desktop machine both long
and
intptr_t
are 64-bits and that casting between them never
truncates data. However, on CHERI, intptr_t
will be (at least)
128-bits while long
is probably still 64-bits: the above cast will
silently truncate data, which I almost certainly don’t want. There are several
options for preventing this sort of truncation.
First, one can check statically the width of various integer types (e.g. in
a configure
script) and refuse to compile code on platforms where
the assumptions about the relationship between integer types is invalid. This
is good in the sense that it stops the program so much as being compiled on a
platform where it would run incorrectly, but bad in the sense that it does
nothing to help you find and fix the problematic cases.
Secondly, one can dynamically check for truncation in a cast and raise an
error if it occurs. For example, I can define a function
longtoint_dyn
to represent both a cast and a truncation check:
#include <err.h> #include <stdio.h> #include <stdint.h> #include <sys/limits.h> int longtoint_dyn(long x) { if ((x < (long) INT_MIN) || (x > (long) INT_MAX)) { err(1, "%ld cannot be represented as an int", x); } return x; } int main(void) { int x = longtoint_dyn(10); printf("%d\n", x); int y = longtoint_dyn(LONG_MAX); printf("%d\n", y); }
When run on my desktop machine this prints:
10 a.out: 9223372036854775807 cannot be represented as an int: Undefined error: 0
Third, one can make explicit casts fail if they will potentially truncate on the platform the program is being compiled for. Defining this in C is rather awkward, but you can do it semi-portably in C99 with something like:
#if INT_MAX < LONG_MAX #define longtoint_static(x) _Pragma("GCC error \"unsafe cast\"") #else #define longtoint_static(x) ((int) x) #endif
In essence, if the cast is always safe, longtoint_static
compiles
to a simple cast. If the cast might not be safe, longtoint_static
causes a compile-time error [11].
I am extremely fond of these sorts of casts, as they allow me to write code that works well for the platforms I currently care about without compromising the potential to reliably adapt the code to other platforms I might encounter in the future. In essence, casts of this sort both document and enforce every simplifying assumption about integer types I make, and they do so without imposing a run-time performance penalty. If I do encounter a platform where one or more of my assumptions is invalid, I’m notified of each potentially problematic place in my code. Fixing this is a tedious task, but in essence I’ve built my own static analysis ahead of time.
A reasonable question is whether jumping through these hoops is solely a result
of C being an old language [12]. Let’s take Rust as a good example of a modern language.
Very sensibly, it forbids implicit casts between integer types entirely. However,
it does have explicit casts with the as
keyword. When I started
writing Rust code, I was somewhat liberal in my use of as
with
integer types, until I realised that doing so can lead to surprising
results, just as casts in C do. Fortunately, Rust has two other
ways of casting between integer types.
First are From
casts. For example,
usize::from(0u16)
is a statically checked cast from a 16-bit
number to a platform-parameterised usize
: it always compiles successfully and
is guaranteed never to truncate. However, the expression
usize::from(0u32)
fails to compile on my 64-bit desktop computer,
even though the 32-bit to 64-bit conversion will never truncate. The reason for
this discrepancy is two-fold: Rust guarantees that usize
is at
least 16-bits wide; and it only defines From
casts for integer
types that will succeed on every platform. Since casting a 32-bit integer to a
usize
would fail on a 16-bit platform, I’m not ever allowed to compile
— even on a 64-bit platform where such a cast would always
succeed. In other words, From
casts are like a less flexible
version of the longtoint_static
function above.
Second, Rust has TryFrom
casts which allow one to
check for possible truncation at run-time. For example,
usize::try_from(0u32).unwrap()
casts a 32-bit number to a
usize
and aborts the program (with the unwrap
call)
if the cast would truncate the input, rather like longtoint_dyn
above. Looking at the generated machine code for
usize::try_from(0u32).unwrap()
on a 64-bit machine is interesting:
when unoptimised (Rust’s ‘debug’ mode) the machine code really does check for
truncation, even though truncation can never possibly happen; but when
optimised (Rust’s ‘release’ mode), the machine code contains no truncation
checks at all.
TryFrom
casts are very useful, but there are two things
missing. First, and most obviously, they can’t do what longtoint_static
does and statically rule out conversions that are potentially unsafe on a given
platform. I’ve sometimes defined an equivalent of longtoint_static
using
the static_assertions
crate but, of course, the resulting casts are only available to whatever code I’m writing.
Second, there are no guarantees that the checks in a TryFrom
are
optimised away, even if they can never fail [13].
I have seen diligent programmers writing performance critical code
deliberately avoid using TryFrom
because they are worried that the
checks won’t be optimised away. That’s deeply unfortunate, but it’s difficult
to criticise them for doing so.
Although Rust’s casting features have weaknesses, I think that they provide clear pointers for a better design. I hope it’s uncontroversial that, like Rust, languages should not allow implicit casts between integer types at all. However, in my opinion languages should support three kinds of explicit casts for integer types:
-
Those which always succeed. Rust’s
as
is a good example of what to do here: when casting to a narrower integer type, integers should be truncated; when casting to a wider integer type they should zero extend for unsigned types or sign extend for signed types). -
Those which dynamically check whether the input would be truncated and return an error (like
TryFrom
in Rust). If, on a given platform, the cast can be statically proven never to truncate, then the dynamic checks must be guaranteed to be fully optimised away (which Rust does not do [14]). -
Those which fail to compile if the underlying platform might truncate the input (which no language I know of has direct support for).
Programmers can then choose amongst these three kinds of casts as appropriate for different portions of their program: I’ve written programs (e.g. grmtools) where I use each of these kinds of explicit cast at different points.
Overflow
Consider the following C code:
#include <stdio.h> #include <stdint.h> #include <sys/limits.h> int main(void) { int x = INT_MAX; int y = x + 1; printf("%d %d\n", x, y); }
It compiles without error on my desktop machine but when run prints out
2147483647 -2147483648
. Because INT_MAX
is the
maximum 32-bit integer a (signed) int
can hold, adding 1 more
to it causes overflow. On my machine, the particular behaviour I observed
was that the integer “incremented” into
the negative half of the two’s complement representation. This is
a rich source of bugs,
but in C there is no mechanism for detecting overflow after it has occurred
— one has to check for it before it occurs by manually comparing
values to the maximum they could be, which is surprisingly awkward to do.
Rust takes a different approach. Consider this program:
fn main() { println!("{} {}", i32::max_value(), i32::max_value() + 1); }
If I compile that in debug mode and run it I get the following run-time error:
thread 'main' panicked at 'attempt to add with overflow', t.rs:2:20 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
In other words, Rust noticed that the addition would overflow, realised that would be a bad thing, and aborted the program, telling me exactly why and where it did so.
However, if I turn on rustc’s optimisations with -O
and rerun
the program, it not only doesn’t cause an error, but it prints 2147483647 -2147483648
, just as C does.
There is one slight improvement in the situation relative to C: integer
overflow is not undefined behaviour [15], and is guaranteed to perform two’s complement wrapping.
The reason that Rust checks for overflow in debug mode but does not do so in release mode is performance: when I asked a knowledgeable Rust person about this a couple of years ago, I was told that measurements made a few years ago showed that overflow checks made benchmarks about 5% slower on average. That’s a big enough difference that it was thought better to turn off overflow checks in release mode. Rust users can turn overflow checking back on even in release mode — though I doubt that many people know they can do so.
Fortunately, Rust gives users the ability to explicitly check for overflow
during arithmetic operations with the checked_*
family of methods,
such as usize::checked_add
.
If adding the two values would lead to overflow, an error
value is returned, otherwise the result of adding the two values is returned.
This is a nice API and makes dealing with overflow much easier than in C. Even
better, there is also the saturating_*
family of methods which
allow one to clamp the result at the maximum or minimum value (i.e. preventing
wrapping). For example, i32::max_value().saturating_add(1) == i32::max_value()
.
Rust’s approach is so much nicer than C’s that it took me a
while to realise that an even better design is possible! My guess is that
the checked_*
family were added because they’re a direct mapping of a feature
(checking for overflow) that most CPUs have direct support for. However, Rust
forces you to opt into checked_*
even though it’s the “safe”
case — the default, at least in release mode, remains (dangerous) wrapping.
It would be better, in my opinion, to invert this: I would like
all additions to be checked unless I explicitly mark them as unchecked.
Personally the only reason I ever want unchecked arithmetic is in performance
critical code where I can either guarantee that wrapping can’t occur or I’m
willing to take the risk that it won’t. Very few portions of code are so
performance critical that I would ever bother using unchecked arithmetic:
at a rough guess I’d suspect a sensible ratio of unchecked:checked arithmetic
in an average chunk of code might be something like 1:1000.
The good news is that this approach is within our grasp with the
unchecked_*
family of methods such as usize::unchecked_add
family of methods. The bad news is that they’re currently only available in Rust nightly (and
they’ve even disappeared from Rust nightly at some points), but hopefully
they’ll make their way into Rust stable soon, and more people will feel
able to turn on overflow checks in release mode — perhaps it might
even become the default!
Other integer types
Although they’re little used, it’s worth briefly mentioning two other classes
of integer types that C99 defines: the fast
and least
integer types. These have subtly different meanings, although the basic
idea is the same: they allow you to express the idea that you want an integer type that’s at
least n bits wide, but you don’t mind if the integer type the compiler
selects for you has more than n bits. It’s easiest to see by example the difference
between the two classes of integer types.
uint_fast32_t
will choose the “fastest for this platform”
integer type that’s at least 32-bits. For example, on a 64-bit machine it might
be faster to perform 64-bit operations rather than 32-bit operations, at
which point uint_fast32_t
would be 64-bits wide. That’s
generally not true on amd64/x86-64, for example, so one would expect
uint_fast32_t
to define a 32-bit integer on such platforms.
OpenBSD does just that but, for reasons I can’t comprehend, Linux defines it
as a 64-bit integer. In other words, on one of the most important hardware / OS
combinations, using uint_fast32_t
will generally slow your code
down a bit. Perhaps unsurprisingly, I’ve never seen the fast
types used
in real code.
uint_least16_t
defines an integer type that is at least 16-bits
wide but may be wider. The description of these integer types in the C spec
(including in C17) is particularly inscrutable, and I’m not entirely confident
that I’ve understood it. However, I believe the intention is that the compiler
will round a least
type up to the next smallest integer type it
offers: in other words, it won’t define it to be unnecessarily big. That also
defines an implicit ordering: it’s always the case that the width of
uint_leastM_t
will be smaller than or equal to the width
of uint_leastN_t
if M < N
.
As far as I can see, the fast
types are a failure, and the
least
types are unused, possibly because no-one can understand
exactly what the latter mean. I think the least
types are a useful
idea though, and if I was designing a new language, I’d consider adding support
for this idea in, although I wouldn’t consider it crucial.
Summary
There are other things I could talk about (e.g. pointer provenance or non-integral pointer types), but I think I’ve written enough on integer types for one lifetime.
What I hope I’ve done in this article is: show that real hardware has a more complex notion of “word width” than a narrow amd64/x86-64 perspective suggests; how well, or not, integer types in existing systems programming languages support this variety of hardware; and what this suggests for the design of integer types in a mythical future systems programming language. I’m not claiming that I’ve thought of every possible issue in this area, nor that the design suggestions I’ve made are the best possible, but they do seem to be an improvement on what existing languages offer.
Let me finish by summarising what I think the major design lessons for systems programming languages are:
- Both signed and unsigned integer types have their place in systems programming languages.
- It’s reasonable to assume that all arithmetic will be on a two’s complement representation.
- Real hardware does not have a single “word width”: we need separate notions of address, data, and object widths.
- The stronger the guarantees one makes about the relationship between integer types, the harder it is to adapt to different platforms.
- The weaker the guarantees one makes about the relationship between integer types, the harder they are to use correctly.
- Implicit casts are a terrible idea.
- There are three useful kinds of explicit casts: always succeed (truncating or extending as appropriate); dynamically check for truncation and fail if so; cause a compile-time error if the cast can potentially lead to truncation at run-time.
- Arithmetic operations should by default return an error at run-time on overflow; if users don’t want a particular arithmetic operation to fail they should have to explicitly request that a specific arithmetic operation is unchecked.
- Undefined behaviour on integer types is a terrible idea (though unspecified behaviour might have a place).
The final design lesson is that integer types are not an exciting aspect of language design and as a language designer it’s tempting to put off thinking about them in favour of more interesting features — certainly, this was the approach I took for many years. However, since programmers quickly embed assumptions about integer types throughout their programs, it becomes more difficult to change anything meaningful about them relative to many other language features. In my opinion, therefore, integer types are one of the things languages need to think about earlier rather than later.
Updated: (2021-06-23) Travis Downs pointed out that I’d misclassified
the 8088. (2021-12-15) Alan Wu pointed out that the C main
functions
should have a void
parameter list.
Acknowledgements: Thanks to Jacob Bramley, Edd Barrett, Aleks Bonin, and Jacob Hughes for comments.
Footnotes
There are many interesting design challenges around numbers and dynamically typed languages, but that’s a very different topic!
There are many interesting design challenges around numbers and dynamically typed languages, but that’s a very different topic!
The only modern language I’ve come across which allows this is Zig, which allows integers to be any bit width between (I think) 1 and 65535. C99, perhaps inevitably, gives explicit flexibility for such integers to be defined, but no platform I know of actually does so.
The only modern language I’ve come across which allows this is Zig, which allows integers to be any bit width between (I think) 1 and 65535. C99, perhaps inevitably, gives explicit flexibility for such integers to be defined, but no platform I know of actually does so.
The UNISYS ClearPath Dorado – the last machine in this series was released in 2015, but it’s compatible with hardware going back to 1962!
The UNISYS ClearPath Dorado – the last machine in this series was released in 2015, but it’s compatible with hardware going back to 1962!
Most x86 CPUs, for example, currently only use 48-bits of the possible 64-bit address range. There are already some mainstream CPUs which use 57-bits (to represent 5 level page tables) and it would be a brave person who would bet against more bits being used at some future date.
Most x86 CPUs, for example, currently only use 48-bits of the possible 64-bit address range. There are already some mainstream CPUs which use 57-bits (to represent 5 level page tables) and it would be a brave person who would bet against more bits being used at some future date.
The naming of this architecture is a mess: AMD created it initially, hence “amd64”, but once Intel also adopted the instruction set, that name was always going to be contentious. Thus we’ve ended up in the confusing situation where for many, but not all, operating systems we tell people “download the x86-64 version called amd64”.
The naming of this architecture is a mess: AMD created it initially, hence “amd64”, but once Intel also adopted the instruction set, that name was always going to be contentious. Thus we’ve ended up in the confusing situation where for many, but not all, operating systems we tell people “download the x86-64 version called amd64”.
I’m not aware of any machines whose data width is larger than their address width — if I had to guess, I’d suggest that no such machine has ever existed. UPDATE (2021-06-22): When will I learn not to guess?! Mark Bessey points out that System/360 machines have a 32-bit data width but a 24-bit address width.
I’m not aware of any machines whose data width is larger than their address width — if I had to guess, I’d suggest that no such machine has ever existed. UPDATE (2021-06-22): When will I learn not to guess?! Mark Bessey points out that System/360 machines have a 32-bit data width but a 24-bit address width.
At this point the terminology becomes torturous: it might be better to say that CHERI’s pointer width is different than its address width. For the sake of consistency, I’ll stick with the existing terminology.
At this point the terminology becomes torturous: it might be better to say that CHERI’s pointer width is different than its address width. For the sake of consistency, I’ll stick with the existing terminology.
For reasons that are unclear to me, uintptr_t
is an optional type
in C99. However, I don’t know of any platforms which support C99 but don’t define
it.
For reasons that are unclear to me, uintptr_t
is an optional type
in C99. However, I don’t know of any platforms which support C99 but don’t define
it.
There is of course the wonderful long long
integer type. Although I don’t
know for sure, I like to think that at some point soon after long long
was introduced, people realised that this approach wasn’t going to
end well: would a new platform have come along that required a long long long
integer type?!
There is of course the wonderful long long
integer type. Although I don’t
know for sure, I like to think that at some point soon after long long
was introduced, people realised that this approach wasn’t going to
end well: would a new platform have come along that required a long long long
integer type?!
Undefined behaviour is, in my opinion, a terrible thing for everyone except people who write compiler optimisers. I talked a bit about that as part of another blog post.
Undefined behaviour is, in my opinion, a terrible thing for everyone except people who write compiler optimisers. I talked a bit about that as part of another blog post.
It’s possible to make this prettier with information like the line of the cast and so on, but it doesn’t fundamentally change things.
It’s possible to make this prettier with information like the line of the cast and so on, but it doesn’t fundamentally change things.
C++ improves upon C: it has static_cast
which roughly corresponds
to longtoint_static
, though C++ has no built-in equivalent of
longtoint_dyn
.
C++ improves upon C: it has static_cast
which roughly corresponds
to longtoint_static
, though C++ has no built-in equivalent of
longtoint_dyn
.
That said, whenever I’ve actually looked at the machine code resulting from a
TryFrom
in Rust’s release mode, the checks are optimised away.
That said, whenever I’ve actually looked at the machine code resulting from a
TryFrom
in Rust’s release mode, the checks are optimised away.
I’m aware that our current compiler infrastructure – where we tend to separate out a compiler front-end (e.g. rustc) as a separate thing from the optimiser and machine code generator (e.g. LLVM) makes it difficult, and perhaps impossible, to guarantee optimisation outcomes. That is not a reason to say that we should never have such guarantees: it suggests we need to fix our compiler infrastructure to make it possible to make and honour such guarantees.
I’m aware that our current compiler infrastructure – where we tend to separate out a compiler front-end (e.g. rustc) as a separate thing from the optimiser and machine code generator (e.g. LLVM) makes it difficult, and perhaps impossible, to guarantee optimisation outcomes. That is not a reason to say that we should never have such guarantees: it suggests we need to fix our compiler infrastructure to make it possible to make and honour such guarantees.
To be specific, unsigned integer overflow in C has defined behaviour but signed integer overflow does not.
To be specific, unsigned integer overflow in C has defined behaviour but signed integer overflow does not.