Home > Blog e-mail: laurie@tratt.net   twitter: laurencetratt   twitter: laurencetratt
email updates:
  |   RSS feed

Static Integer Types

June 22 2021

Blog archive

 
Last 10 posts
Static Integer Types
Automatic Video Editing
The Evolution of a Research Paper
Automatic Syntax Error Recovery
Stick or Twist?
Which Parsing Approach?
Alternative Sources of Advice
Minimum Times Tend to Mislead When Benchmarking
A Quick Look at Trait Objects in Rust
Why Aren’t More Users More Happy With Our VMs? Part 2
 
 
Over several years I’ve had several conversations with people about static integer types in programming languages. Given the number of subtleties in the design, and pitfalls in the use, of integer types it’s not surprising that most programmers are unaware of at least some of them. However, what is perhaps more surprising is that most programming language designers and implementers aren’t fully aware of some of the subtleties and pitfalls either. That leads to languages and their implementations baking in unintended, and sometimes unfortunate, design decisions. At the end of conversations on this topic, I’ve sometimes said “this stuff is hard and someone should think more about this and write it up” but it’s an unsatisfying conclusion. So even though I don’t have all the answers – indeed, I’m fairly sure that I don’t know all the questions – I’m going to jot down my current thinking on this subject.

The basics

In this post I’m considering the design of fixed-width integer types in statically typed languages. By “statically typed” I mean types which are enforced at compile-time [1] whether they are explicitly spelt out in source code or inferred by the compiler. By “fixed width” I mean integers which fit within a number of bits determined at compile-time. These come in two flavours: those where the number of bits is the same on all platforms (e.g. u32 typically means “an unsigned 32-bit integer); and those where the bits are parameterised by the platform (e.g. Rust’s usize might be 16-bits, 32-bits, 64-bits or bigger depending on the underlying platform). I won’t be considering floating point numbers, arbitrarily-sized integers (i.e. those whose storage size varies at run-time depending on the number they are storing), or the like.

An obvious question – and one that I imagine most new programmers have – is why do we have more than one integer type? Why, on a 64-bit machine, do I not always use 64-bit integers? There are various possible explanations, but to me two stand out. First, we often have to interact with things (from external libraries to the hardware we’re running on) that require integers of certain widths. Second, even though our computers often have large quantities of DRAM, memory is not infinite, and even when there’s plenty of free memory, using it wastefully can slow programs down. For example, imagine that I’m writing a system which records students’ marks out of 10: I only need 4-bits to represent all the valid marks, meaning that I need 16 times less memory relative to a 64-bit integer. If my program records lots of students' marks, then the memory saving between 4-bits and 64-bits will probably have a noticeable effect on the program’s performance, because the caches in modern CPUs are tiny relative to DRAM. For example, my desktop computer has 32GiB DRAM but the L1 cache (split into two equal-sized sub-parts for instructions and data) is a mere 64KiB (about 0.000002% the size of DRAM): saving memory is an effective way of improving program performance.

In practice, few programming languages offer the user direct access to 4-bit integers [2]. Since truncating the range of numbers we need to deal with is nearly always a terrible idea, we therefore have to round up our ideal integer width to the next highest representable integer type. I would thus probably represent students’ marks as an 8-bit integer, but if I needed to represent the distance between two points on earth in miles I’d probably use a 16 bit number (although the maximum possible distance, 8000 miles, can be represented in 13-bits). Interestingly, although it’s invisible to us as normal programmers, compilers and CPUs also sometimes round integer types up even further than we realise. On a 64-bit machine we might want to store a 32-bit integer in our program, but there’s nothing to stop the compiler or the CPU treating it as a 64-bit value for arithmetic operations, provided whenever we can observe the value it’s 32-bits wide.

Signed and unsigned integers

Although I have been clear about the maximum value I wanted to represent in those examples, I’ve been deliberately vague about the minimum value. Although the minimum integer could be any value, we only really need to worry about whether the minimum is zero or less than zero (i.e. a negative value).

If the minimum value is zero, we can use unsigned integer types: their minimum value is 0 and their maximum value 2n-1 (both inclusive) where n is the number of bits for that integer width (i.e. an 8-bit integer can represent 0..(2n-1) = 0..255 inclusive).

If the minimum value is less than zero, we need to use signed integers. Unlike unsigned integer types, whose representation is obvious (the minimum value has no bits set; the maximum value has all bits set), things are a bit trickier with signed integer types. The most common representation is two’s complement such that the minimum value is -(2n-1) and the maximum value is 2n-1-1. The resulting asymmetry – an 8 bit integer can represent -128..127 – can feel a bit icky, but two’s complement has the advantage that arithmetic operations are identical no matter the signedness of the integer. Since CPUs have universally used two’s complement since the late 1960s (I know of only one, very niche, exception [3]), I think it’s safe to assume that two’s complement is the only representation any of us will ever encounter.

Why do we have both signed and unsigned integers? Many people think that a good reason to have both is because we often use computers to count things that are naturally non-negative integers, for example the number of apples you’ve put in your online shopping basket. Using unsigned types in such cases feels like a sensible idea but there are downsides. For example, what’s the value of 2 - 3 as an unsigned integer? And what happens if you later discover that you need to represent negative quantities, for example the number of rotten apples that you need to refund the customer? You often end up in the wonderful situation where you have both unsigned and signed versions of your counts at different points of your program, not to mention the difficulties in dealing with edge cases when converting between the two types. I still have nightmares about the time I forgot to account for the asymmetric range of signed integer types when checking to see if an unsigned integer could fit in a signed integer — debugging that took quite a while.

Such problems would go away entirely if we only ever used signed integers. However, signed integers don’t work very well when we’re representing things to do with memory. For example, if, on a machine with a 32-bit address range, we used signed integers to represent the size of an array of bytes, we could only easily represent sizes of at most 2GiB — despite having an address space of 4GiB. That issue is less of a problem on 64-bit machines (63-bits represents a lot of memory) but what would a signed memory address mean and what happens if/when we want to use all 64-bits to represent memory addresses [4]? While high-level languages can easily forego unsigned integer types, they seem to be a necessary part of lower-level programming languages.

Native word width

The chances are that the computer (be it a desktop, laptop, phone, or otherwise) you’re reading this article on is a 64-bit machine, by which we mean its native word width is 64-bits. We use this term so often that, in my experience, we rarely take a step back to think about what this really means. A typical approximation is that the word width defines both the “main” integer width as well as the width of addresses: we expect operations upon those to be particularly efficient (implying, for example, that most CPU registers are word width). On x86, or amd64/x86-64 [5], hardware – the hardware that defines most software developers’ view of hardware – this is a safe approximation to reality.

Old duffers like me remember the days of not just 32-bit machines, but 16 bit and 8-bit machines: we have direct experience that the world is not just about 64-bit word widths. Surprisingly few of us realise that, even today, many “embedded” devices have a native word width of 8-bits. Indeed, it was only in the 1970s that the notion of multiple-of-8 word widths became common: a quick delve into computing history turns up an astonishing variety of word widths. I can remember my literal astonishment when I saw a decimal (i.e. non binary) computer at The National Museum of Computing — I had read of such things but did not know that specimens still existed. However, until I looked into things a bit deeper for this article, I had no idea of other oddities like 27-bit machines!

The more you look into the notion of “word width”, the fuzzier the concept becomes. One obvious reason is backwards compatibility. On an amd64/x86-64 machine, for example, the terminology is compatible with their 16-bit predecessors: a WORD is 16-bits (due to the 8086), a DWORD is 32-bits (due to the 386), and a QWORD 64-bits.

There are deeper reasons for this fuzziness. The first computer my family acquired was the (now largely forgotten) Dragon 32 — an 8-bit computer named for its 32KiB of memory. That should immediately ring alarm bells: assuming each address is 8 bits wide, and addresses reference bytes, then an 8-bit address can only reference 256 bytes of memory. In practise, memory can be addressed by combining more than one 8-bit value to reference an address.

We thus need to differentiate between a platform’s data width (i.e. the width of an integer; e.g. 8-bits on the Dragon 32) and address width (16-bits on the Dragon 32). The problem [6] is then how does one combine (smaller) data-width values to reference (larger) addresses? Various schemes have been devised to bridge this gap, from building addresses from two or more registers, to bank switching, and beyond.

The future of address and data widths

The reason we need to care about address and data widths is that, as soon as one programs in a language higher-level than assembly language, one sometimes wants to treat pointers as integers and vice versa. I most often need to do this when mucking around with pointer tagging (i.e. using the always-set-to-zero bits of an aligned pointer to store extra information), where I need to treat the pointer as an integer, set or unset the bits I care about, then treat the integer as a pointer again. In order to do this, I need to know precisely what the rules are for treating pointers as integers and vice versa. The rules are trivial if the data and address widths are the same, but I hope the previous section convinced you that, historically, this has not always been the case.

You might be wondering whether it’s worth worrying about such issues when it’s rather unlikely that anyone will ever again produce a 27-bit machine. However, I believe it’s quite possible that future hardware will, once again, have different data and address widths. There are several possible reasons why one might do so, but I want to use an example that I know at least a little about.

CHERI is an extension to instruction sets that allows the enforcement of hardware capabilities. On current mainstream platforms, different processes can be mutually distrusting, but within a process every piece of code has access to every part of the process’s memory. CHERI adds hardware permissions to pointers, allowing different parts of the same process to be mutually distrusting, for example handing out pointers that allow access to only limited portions of the process’s memory. The potential for increasing security without suffering the overhead of multiple processes is significant. Those with long memories will know that the idea of capabilities isn’t new (for example, the E language is a fascinating exploration of this idea), but what’s meaningfully different about CHERI is that it’s enforced in hardware and adaptable to modern instruction sets. For example, in 2022 there will be physical, fairly modern, Armv8 processors extended with CHERI available for early adopters to play with.

Amongst CHERI’s clever tricks is how permissions are carried around: instead of a pointer being just a memory address, under CHERI it becomes a memory address plus a set of permissions (plus a secret bit stored somewhere hidden so that the hardware can tell if you’ve forged permissions). Although the size of these permissions varies in different CHERI-enabled instruction sets, the most common encoding on 64-bit machines is that a pointer is a 64-bit memory address and a 64-bit set of permissions — 128-bits wide in total. In other words, a typical CHERI CPU has different data and address widths [7]. History, it seems, has not yet ended.

Integer types and pointers

How do systems programming languages deal with platforms that have different data and address widths?

Let’s start with Rust, which offers the usize type, defining it as “The pointer-sized unsigned integer type. The size of this primitive is how many bytes it takes to reference any location in memory.” That definition is based on the assumption that a machine’s address and data widths are the same. How should one adapt Rust to work under CHERI given this mismatch in assumptions? Well, one could devise various clever schemes to try and force Rust to acknowledge this difference, but doing so will break a lot of Rust code that has no choice but to rely on usize’s existing guarantee of identical address and data widths (including some that I’ve written), mostly in ways that won’t be statically discoverable.

The only pragmatic scheme I can think of will be to define usize to be the wider of the data and address width (i.e. 128 bits on most CHERI devices). This will allow most Rust code to run correctly without change (other than being recompiled). This doesn’t entirely guarantee correctness: for example, code that uses the size of a usize as a proxy for the data width will behave incorrectly. However, I think it’s reasonable to assume that this scheme will work well enough to be acceptable. However, it’s inherently wasteful: most usizes only ever store the size of an object in memory, meaning that 8 bytes of their storage space will never be used. That might not sound much, but I have written Rust programs where removing 8 bytes from a single, commonly used struct has increased performance by 5%. There is no doubt in my mind that doubling the size of usize will, therefore, lead to a noticeable slowdown (perhaps in the region of 5-10%) for many programs.

Other languages have different approaches to integer types, with an obvious alternative being to remove the notion of parameterised width types entirely. D (which I don’t claim to know well) is an example of a language where all integer types have an explicit, fixed width. This suggests that we no longer have to worry about Rust’s conflation of address and data widths into usize, but fixed width integer types don’t allow one to adapt to different word widths. D therefore defines size_t to be an alias for the underlying platform’s word width. I will leave to the theologians whether this means that D really only has fixed width integer types, but either way, the end result is that D’s size_t makes the same assumptions as Rust’s usize.

Rust and D’s conflation of address and data widths suggests that modern systems programming languages are in an awkward spot when it comes to address and data widths. However, there are too many systems programming languages in the wild for me to claim knowledge of all, or even most, of them. That means that I know that my next statement is almost certainly wrong for at least one language out there. However, as far as I know, the only language which is sufficiently flexible is C — at least, C as of the C99 version.

Let’s start with C’s uintptr_t [8], an integer type that’s guaranteed to be wide enough to store a pointer safely. At first glance this might look like Rust’s usize in disguise, but C’s guarantees are subtly different: simplifying heavily, uintptr_t will be at least as wide as the address width but may be wider or narrower than the data width. The formal definition is rather more subtle (simplified only slightly): C guarantees that the integer type uintptr_t is wide enough to represent every void* pointer. The reason for the use of void* is that pointer types in C do not all have to be the same width, but void* is guaranteed to be wide enough that any other pointer type can safely be cast to void* — thus uintptr_t is implicitly wide enough for any pointer.

Another of C’s integer types size_t represents the size of an object in memory. That might look like it’s related to the concept of the address width, but it’s subtly different. On segmented architectures like the 8086, size_t will represent a smaller integer type (16 bits in this case) than the address width (20 bits). In other words, size_t allows us to separate out the notion of “how big can any object be in memory” from “what is the address of this memory.” As this suggests, we should have been talking about address, data, and object widths (and not just address and data widths) all along!

On that note, you might think int or long represents the data width, but it doesn’t, and indeed C has no other integer type that corresponds to the data width. In part this is because of a mess of backwards compatibility: if you’re running OpenBSD on an amd64/x86-64 machine, int will be 32-bits and long will be-64 bits, but, on the same hardware, Windows defines both integer types to be 32-bits [9].

At this point (and even allowing for the simplifications in my explanation of C’s integer types), anyone who doesn’t consider themselves a C language lawyer might find their eyes glazing over. Fortunately there are two useful lessons for us.

We can see the first lesson in C’s definition of uintptr_t, which isn’t defined in terms of data or address widths, or even in some other notion of “address” — but, rather, the more abstract notion of a “pointer”. By deliberately abstracting away from the hardware, C is more adaptable than languages who make assumptions about hardware. This turns out to be very useful for CHERI. In contrast, Rust conflates the notion of address, data, and object widths into usize, making it hard to adapt naturally to a platform like CHERI.

The second lesson is, to some extent, a consequence of the first. Because C defines several integer types that abstract away from the hardware, their definition is invariably “at least wide enough to store X” rather than “exactly the right width to store X”. Similarly, few, if any, guarantees are made about the relationship between integer types. Thus, while C can easily adapt to a wide range of platforms, it does so in part by placing a considerable burden on the programmer to check whether they can, for example, safely store a size_t into a uintptr_t or vice versa. Using C’s integer types correctly over all possible platforms is, at the very least, exceedingly difficult. In contrast, Rust’s stronger guarantees are easier to work with.

Casting between integer types

Programmers often need to cast one integer type into another. For example, in C I can write this code:
#include <stdio.h>
int main() {
  int x = 0;
  long y = x;
  printf("%d %ld\n", x, y);
}
On my OpenBSD/amd64 machine, int is 32-bits and long 64-bits. Using clang --std=c99 -Wall to compile the above, there are no warnings or errors, and the resulting binary prints out 0 0 when run. Notice that on line 4, the 32-bit value stored in x was stored into a 64-bit variable y and that I didn’t have to do anything to make that “conversion” happen: C performed an implicit cast on my behalf. Fortunately, C guarantees that long is always at least as wide as int.

Consider instead this code:

#include <stdio.h>
#include <sys/limits.h>
int main() {
  long y = LONG_MAX;
  int z = y;
  printf("%ld %d\n", y, z);
}
This code also compiles without warning or error, printing out 9223372036854775807 -1 on my machine. If you dive into the C specification, you'll discover that the values I tried to cast lead to undefined behaviour [10] which means that my program could have done anything. In practice, the behaviour I observed can be explained very simply. y has the rightmost 63-bits set to 1; the implicit cast to the 32-bit int type on line 5 throws away the left-most 32-bits, but keeps the right-most 32-bits, which are still all set to 1; and because I’m running on a two’s complement platform, a 32-bit signed integer with all bits set represents the value -1.

The culprit here is the implicit cast from long to int which does something dangerous without me having explicitly asked for it. In my opinion, implicit casts are a terrible idea, and the program above should have caused a compile-time error.

I believe that users should have to spell out explicit casts. C even has syntax for explicit casts: I can change line 5 to int z = (int) y. Let’s briefly pretend that we can remove implicit casts from C: what do I gain? For me the killer example is that if I change the type of an attribute in a struct from int to long, I will get a compile-time error at every point in my program that tries to treat the attribute as an int. This makes tracking down all the code affected by my change trivial.

However C’s explicit casts aren’t a sufficient safety feature on their own — indeed, in a sense, they deliberately override safety. In the example above where we cast long to int on OpenBSD/amd64, we get a dangerous result whether or not the cast is explicit or implicit. One might argue that such an example is such a blatant mistake that it's not worth worrying about, but people make the same underlying mistake in situations where they have not quite understood the guarantees made about the relationship between integer types. For example, I might write the following code:

#include <stdio.h>
#include <stdint.h>
int main() {
  intptr_t y = ...;
  long z = (long) y;
  printf("%ld %ld\n", y, z);
}
I know that on my desktop machine both long and intptr_t are 64-bits and that casting between them never truncates data. However, on CHERI, intptr_t will be (at least) 128-bits while long is probably still 64-bits: the above cast will silently truncate data, which I almost certainly don’t want. There are several options for preventing this sort of truncation.

First, one can check statically the width of various integer types (e.g. in a configure script) and refuse to compile code on platforms where the assumptions about the relationship between integer types is invalid. This is good in the sense that it stops the program so much as being compiled on a platform where it would run incorrectly, but bad in the sense that it does nothing to help you find and fix the problematic cases.

Secondly, one can dynamically check for truncation in a cast and raise an error if it occurs. For example, I can define a function longtoint_dyn to represent both a cast and a truncation check:

#include <err.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/limits.h>
int longtoint_dyn(long x) {
  if ((x < (long) INT_MIN) || (x > (long) INT_MAX)) {
    err(1, "%ld cannot be represented as an int", x);
  }
  return x;
}
int main() {
  int x = longtoint_dyn(10);
  printf("%d\n", x);
  int y = longtoint_dyn(LONG_MAX);
  printf("%d\n", y);
}
When run on my desktop machine this prints:
10
a.out: 9223372036854775807 cannot be represented as an int: Undefined error: 0

Third, one can make explicit casts fail if they will potentially truncate on the platform the program is being compiled for. Defining this in C is rather awkward, but you can do it semi-portably in C99 with something like:

#if INT_MAX < LONG_MAX
#define longtoint_static(x) _Pragma("GCC error \"unsafe cast\"")
#else
#define longtoint_static(x) ((int) x)
#endif
In essence, if the cast is always safe, longtoint_static compiles to a simple cast. If the cast might not be safe, longtoint_static causes a compile-time error [11].

I am extremely fond of these sorts of casts, as they allow me to write code that works well for the platforms I currently care about without compromising the potential to reliably adapt the code to other platforms I might encounter in the future. In essence, casts of this sort both document and enforce every simplifying assumption about integer types I make, and they do so without imposing a run-time performance penalty. If I do encounter a platform where one or more of my assumptions is invalid, I’m notified of each potentially problematic place in my code. Fixing this is a tedious task, but in essence I’ve built my own static analysis ahead of time.

A reasonable question is whether jumping through these hoops is solely a result of C being an old language [12]. Let’s take Rust as a good example of a modern language. Very sensibly, it forbids implicit casts between integer types entirely. However, it does have explicit casts with the as keyword. When I started writing Rust code, I was somewhat liberal in my use of as with integer types, until I realised that doing so can lead to surprising results, just as casts in C do. Fortunately, Rust has two other ways of casting between integer types.

First are From casts. For example, usize::from(0u16) is a statically checked cast from a 16-bit number to a platform-parameterised usize: it always compiles successfully and is guaranteed never to truncate. However, the expression usize::from(0u32) fails to compile on my 64-bit desktop computer, even though the 32-bit to 64-bit conversion will never truncate. The reason for this discrepancy is two-fold: Rust guarantees that usize is at least 16-bits wide; and it only defines From casts for integer types that will succeed on every platform. Since casting a 32-bit integer to a usize would fail on a 16-bit platform, I’m not ever allowed to compile — even on a 64-bit platform where such a cast would always succeed. In other words, From casts are like a less flexible version of the longtoint_static function above.

Second, Rust has TryFrom casts which allow one to check for possible truncation at run-time. For example, usize::try_from(0u32).unwrap() casts a 32-bit number to a usize and aborts the program (with the unwrap call) if the cast would truncate the input, rather like longtoint_dyn above. Looking at the generated machine code for usize::try_from(0u32).unwrap() on a 64-bit machine is interesting: when unoptimised (Rust’s ‘debug’ mode) the machine code really does check for truncation, even though truncation can never possibly happen; but when optimised (Rust’s ‘release’ mode), the machine code contains no truncation checks at all.

TryFrom casts are very useful, but there are two things missing. First, and most obviously, they can’t do what longtoint_static does and statically rule out conversions that are potentially unsafe on a given platform. I’ve sometimes defined an equivalent of longtoint_static using the static_assertions crate but, of course, the resulting casts are only available to whatever code I’m writing. Second, there are no guarantees that the checks in a TryFrom are optimised away, even if they can never fail [13]. I have seen diligent programmers writing performance critical code deliberately avoid using TryFrom because they are worried that the checks won’t be optimised away. That’s deeply unfortunate, but it’s difficult to criticise them for doing so.

Although Rust’s casting features have weaknesses, I think that they provide clear pointers for a better design. I hope it’s uncontroversial that, like Rust, languages should not allow implicit casts between integer types at all. However, in my opinion languages should support three kinds of explicit casts for integer types:

  1. Those which always succeed. Rust's as is a good example of what to do here: when casting to a narrower integer type, integers should be truncated; when casting to a wider integer type they should zero extend for unsigned types or sign extend for signed types).
  2. Those which dynamically check whether the input would be truncated and return an error (like TryFrom in Rust). If, on a given platform, the cast can be statically proven never to truncate, then the dynamic checks must be guaranteed to be fully optimised away (which Rust does not do [14]).
  3. Those which fail to compile if the underlying platform might truncate the input (which no language I know of has direct support for).
Programmers can then choose amongst these three kinds of casts as appropriate for different portions of their program: I’ve written programs (e.g. grmtools) where I use each of these kinds of explicit cast at different points.

Overflow

Consider the following C code:
#include <stdio.h>
#include <stdint.h>
#include <sys/limits.h>
int main() {
  int x = INT_MAX;
  int y = x + 1;
  printf("%d %d\n", x, y);
}
It compiles without error on my desktop machine but when run prints out 2147483647 -2147483648. Because INT_MAX is the maximum 32-bit integer a (signed) int can hold, adding 1 more to it causes overflow. On my machine, the particular behaviour I observed was that the integer “incremented” into the negative half of the two’s complement representation. This is a rich source of bugs, but in C there is no mechanism for detecting overflow after it has occurred — one has to check for it before it occurs by manually comparing values to the maximum they could be, which is surprisingly awkward to do.

Rust takes a different approach. Consider this program:

fn main() {
  println!("{} {}", i32::max_value(), i32::max_value() + 1);
}
If I compile that in debug mode and run it I get the following run-time error:
thread 'main' panicked at 'attempt to add with overflow', t.rs:2:20
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
In other words, Rust noticed that the addition would overflow, realised that would be a bad thing, and aborted the program, telling me exactly why and where it did so.

However, if I turn on rustc’s optimisations with -O and rerun the program, it not only doesn’t cause an error, but it prints 2147483647 -2147483648, just as C does. There is one slight improvement in the situation relative to C: integer overflow is not undefined behaviour [15], and is guaranteed to perform two’s complement wrapping.

The reason that Rust checks for overflow in debug mode but does not do so in release mode is performance: when I asked a knowledgeable Rust person about this a couple of years ago, I was told that measurements made a few years ago showed that overflow checks made benchmarks about 5% slower on average. That’s a big enough difference that it was thought better to turn off overflow checks in release mode. Rust users can turn overflow checking back on even in release mode — though I doubt that many people know they can do so.

Fortunately, Rust gives users the ability to explicitly check for overflow during arithmetic operations with the checked_* family of methods, such as usize::checked_add. If adding the two values would lead to overflow, an error value is returned, otherwise the result of adding the two values is returned. This is a nice API and makes dealing with overflow much easier than in C. Even better, there is also the saturating_* family of methods which allow one to clamp the result at the maximum or minimum value (i.e. preventing wrapping). For example, i32::max_value().saturating_add(1) == i32::max_value().

Rust’s approach is so much nicer than C’s that it took me a while to realise that an even better design is possible! My guess is that the checked_* family were added because they’re a direct mapping of a feature (checking for overflow) that most CPUs have direct support for. However, Rust forces you to opt into checked_* even though it’s the “safe” case — the default, at least in release mode, remains (dangerous) wrapping. It would be better, in my opinion, to invert this: I would like all additions to be checked unless I explicitly mark them as unchecked. Personally the only reason I ever want unchecked arithmetic is in performance critical code where I can either guarantee that wrapping can’t occur or I’m willing to take the risk that it won’t. Very few portions of code are so performance critical that I would ever bother using unchecked arithmetic: at a rough guess I’d suspect a sensible ratio of unchecked:checked arithmetic in an average chunk of code might be something like 1:1000.

The good news is that this approach is within our grasp with the unchecked_* family of methods such as usize::unchecked_add family of methods. The bad news is that they’re currently only available in Rust nightly (and they’ve even disappeared from Rust nightly at some points), but hopefully they’ll make their way into Rust stable soon, and more people will feel able to turn on overflow checks in release mode — perhaps it might even become the default!

Other integer types

Although they’re little used, it’s worth briefly mentioning two other classes of integer types that C99 defines: the fast and least integer types. These have subtly different meanings, although the basic idea is the same: they allow you to express the idea that you want an integer type that’s at least n bits wide, but you don’t mind if the integer type the compiler selects for you has more than n bits. It’s easiest to see by example the difference between the two classes of integer types.

uint_fast32_t will choose the “fastest for this platform” integer type that’s at least 32-bits. For example, on a 64-bit machine it might be faster to perform 64-bit operations rather than 32-bit operations, at which point uint_fast32_t would be 64-bits wide. That’s generally not true on amd64/x86-64, for example, so one would expect uint_fast32_t to define a 32-bit integer on such platforms. OpenBSD does just that but, for reasons I can’t comprehend, Linux defines it as a 64-bit integer. In other words, on one of the most important hardware / OS combinations, using uint_fast32_t will generally slow your code down a bit. Perhaps unsurprisingly, I’ve never seen the fast types used in real code.

uint_least16_t defines an integer type that is at least 16-bits wide but may be wider. The description of these integer types in the C spec (including in C17) is particularly inscrutable, and I’m not entirely confident that I’ve understood it. However, I believe the intention is that the compiler will round a least type up to the next smallest integer type it offers: in other words, it won't define it to be unnecessarily big. That also defines an implicit ordering: it's always the case that the width of uint_leastM_t will be smaller than or equal to the width of uint_leastN_t if M < N.

As far as I can see, the fast types are a failure, and the least types are unused, possibly because no-one can understand exactly what the latter mean. I think the least types are a useful idea though, and if I was designing a new language, I’d consider adding support for this idea in, although I wouldn’t consider it crucial.

Summary

There are other things I could talk about (e.g. pointer provenance or non-integral pointer types), but I think I’ve written enough on integer types for one lifetime.

What I hope I’ve done in this article is: show that real hardware has a more complex notion of “word width” than a narrow amd64/x86-64 perspective suggests; how well, or not, integer types in existing systems programming languages support this variety of hardware; and what this suggests for the design of integer types in a mythical future systems programming language. I’m not claiming that I’ve thought of every possible issue in this area, nor that the design suggestions I’ve made are the best possible, but they do seem to be an improvement on what existing languages offer.

Let me finish by summarising what I think the major design lessons for systems programming languages are:

  • Both signed and unsigned integer types have their place in systems programming languages.
  • It’s reasonable to assume that all arithmetic will be on a two’s complement representation.
  • Real hardware does not have a single “word width”: we need separate notions of address, data, and object widths.
  • The stronger the guarantees one makes about the relationship between integer types, the harder it is to adapt to different platforms.
  • The weaker the guarantees one makes about the relationship between integer types, the harder they are to use correctly.
  • Implicit casts are a terrible idea.
  • There are three useful kinds of explicit casts: always succeed (truncating or extending as appropriate); dynamically check for truncation and fail if so; cause a compile-time error if the cast can potentially lead to truncation at run-time.
  • Arithmetic operations should by default return an error at run-time on overflow; if users don’t want a particular arithmetic operation to fail they should have to explicitly request that a specific arithmetic operation is unchecked.
  • Undefined behaviour on integer types is a terrible idea (though unspecified behaviour might have a place).

The final design lesson is that integer types are not an exciting aspect of language design and as a language designer it’s tempting to put off thinking about them in favour of more interesting features — certainly, this was the approach I took for many years. However, since programmers quickly embed assumptions about integer types throughout their programs, it becomes more difficult to change anything meaningful about them relative to many other language features. In my opinion, therefore, integer types are one of the things languages needs to think about earlier rather than later.

Updated: (2021-06-23) Travis Downs pointed out that I’d misclassified the 8088.

Acknowledgements: Thanks to Jacob Bramley, Edd Barrett, Aleks Bonin, and Jacob Hughes for comments.

If you’d like updates on new blog entries, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1] There are many interesting design challenges around numbers and dynamically typed languages, but that’s a very different topic!
[2] The only modern language I’ve come across which allows this is Zig, which allows integers to be any bit width between (I think) 1 and 65535. C99, perhaps inevitably, gives explicit flexibility for such integers to be defined, but no platform I know of actually does so.
[3] The UNISYS ClearPath Dorado – the last machine in this series was released in 2015, but it’s compatible with hardware going back to 1962!
[4] Most x86 CPUs, for example, currently only use 48-bits of the possible 64-bit address range. There are already some mainstream CPUs which use 57-bits (to represent 5 level page tables) and it would be a brave person who would bet against more bits being used at some future date.
[5] The naming of this architecture is a mess: AMD created it initially, hence “amd64”, but once Intel also adopted the instruction set, that name was always going to be contentious. Thus we’ve ended up in the confusing situation where for many, but not all, operating systems we tell people “download the x86-64 version called amd64”.
[6] I’m not aware of any machines whose data width is larger than their address width — if I had to guess, I’d suggest that no such machine has ever existed. UPDATE (2021-06-22): When will I learn not to guess?! Mark Bessey points out that System/360 machine have a 32-bit data width but a 24-bit address width.
[7] At this point the terminology becomes torturous: it might be better to say that CHERI’s pointer width is different than its address width. For the sake of consistency, I’ll stick with the existing terminology.
[8] For reasons that are unclear to me, uintptr_t is an optional type in C99. However, I don’t know of any platforms which support C99 but don’t define it.
[9] There is of course the wonderful long long integer type. Although I don’t know for sure, I like to think that at some point soon after long long was introduced, people realised that this approach wasn’t going to end well: would a new platform have come along that required a long long long integer type?!
[10] Undefined behaviour is, in my opinion, a terrible thing for everyone except people who write compiler optimisers. I talked a bit about that as part of another blog post.
[11] It’s possible to make this prettier with information like the line of the cast and so on, but it doesn’t fundamentally change things.
[12] C++ improves upon C: it has static_cast which roughly corresponds to longtoint_static, though C++ has no built-in equivalent of longtoint_dyn.
[13] That said, whenever I’ve actually looked at the machine code resulting from a TryFrom in Rust’s release mode, the checks are optimised away.
[14] I’m aware that our current compiler infrastructure – where we tend to separate out a compiler front-end (e.g. rustc) as a separate thing from the optimiser and machine code generator (e.g. LLVM) makes it difficult, and perhaps impossible, to guarantee optimisation outcomes. That is not a reason to say that we should never have such guarantees: it suggests we need to fix our compiler infrastructure to make it possible to make and honour such guarantees.
[15] To be specific, unsigned integer overflow in C has defined behaviour but signed integer overflow does not.
There are many interesting design challenges around numbers and dynamically typed languages, but that’s a very different topic!
The only modern language I’ve come across which allows this is Zig, which allows integers to be any bit width between (I think) 1 and 65535. C99, perhaps inevitably, gives explicit flexibility for such integers to be defined, but no platform I know of actually does so.
The UNISYS ClearPath Dorado – the last machine in this series was released in 2015, but it’s compatible with hardware going back to 1962!
Most x86 CPUs, for example, currently only use 48-bits of the possible 64-bit address range. There are already some mainstream CPUs which use 57-bits (to represent 5 level page tables) and it would be a brave person who would bet against more bits being used at some future date.
The naming of this architecture is a mess: AMD created it initially, hence “amd64”, but once Intel also adopted the instruction set, that name was always going to be contentious. Thus we’ve ended up in the confusing situation where for many, but not all, operating systems we tell people “download the x86-64 version called amd64”.
I’m not aware of any machines whose data width is larger than their address width — if I had to guess, I’d suggest that no such machine has ever existed. UPDATE (2021-06-22): When will I learn not to guess?! Mark Bessey points out that System/360 machine have a 32-bit data width but a 24-bit address width.
At this point the terminology becomes torturous: it might be better to say that CHERI’s pointer width is different than its address width. For the sake of consistency, I’ll stick with the existing terminology.
For reasons that are unclear to me, uintptr_t is an optional type in C99. However, I don’t know of any platforms which support C99 but don’t define it.
There is of course the wonderful long long integer type. Although I don’t know for sure, I like to think that at some point soon after long long was introduced, people realised that this approach wasn’t going to end well: would a new platform have come along that required a long long long integer type?!
Undefined behaviour is, in my opinion, a terrible thing for everyone except people who write compiler optimisers. I talked a bit about that as part of another blog post.
It’s possible to make this prettier with information like the line of the cast and so on, but it doesn’t fundamentally change things.
C++ improves upon C: it has static_cast which roughly corresponds to longtoint_static, though C++ has no built-in equivalent of longtoint_dyn.
That said, whenever I’ve actually looked at the machine code resulting from a TryFrom in Rust’s release mode, the checks are optimised away.
I’m aware that our current compiler infrastructure – where we tend to separate out a compiler front-end (e.g. rustc) as a separate thing from the optimiser and machine code generator (e.g. LLVM) makes it difficult, and perhaps impossible, to guarantee optimisation outcomes. That is not a reason to say that we should never have such guarantees: it suggests we need to fix our compiler infrastructure to make it possible to make and honour such guarantees.
To be specific, unsigned integer overflow in C has defined behaviour but signed integer overflow does not.
Home > Blog e-mail: laurie@tratt.net   twitter: laurencetratt twitter: laurencetratt