email updates:
  |   RSS feed

How Hard is it to Adapt a Memory Allocator to CHERI?

September 18 2023

Blog archive

 
Last 10 blog posts
How Hard is it to Adapt a Memory Allocator to CHERI?
"Programming" and "Programmers" Mean Different Things to Different People
pizauth: First Stable Release
The Need to Explain
Two Stories for "What is CHERI?"
My Interview with Eelco Visser on Parsing
Why Split Lexing and Parsing Into Two Separate Phases?
Displaying My Washing Machine's Remaining Time With curl, jq, and pizauth
pizauth: dump and restore
How Big Should a Programming Language Be?
 
CHERI is a set of things (adapted CPUs, adapted operating systems, adapted libraries) that, collectively, aim to stop software bugs becoming security flaws. If you'd like to know more, I gave some of my thinking on CHERI in Two Stories for "What is CHERI?".

In this post I'm going to flesh out an observation I made in that post, which is that some software needs thoughtful adaption to CHERI if we want to get the security advantages we hope for. Exactly what that thoughtful adaption might look like will vary, probably substantially, between different pieces of software. What, for instance, might it look like for critical, widely used, components? In this post I'm going to look at how memory allocators (henceforth "allocators"), one of software's most fundamental building blocks, can be adapted to CHERI. If you find this interesting, but want greater depth than I'll go into here, you might be interested in the paper Picking a CHERI Allocator: Security and Performance Considerations that this post is based upon.

A Simple Allocator

It is a truism that virtually every program needs to dynamically allocate memory. Our collective folklore tells us that allocators like dlmalloc or jemalloc are impressive pieces of software that improve on their predecessors, but very few of us can really explain why. We call malloc, realloc, and free and magically chunks of memory are allocated, resized, or freed on our behalf.

As is often the case, one can get a useful insight into allocators by stripping away as much of the cleverness as possible. It turns out that we can write an allocator sufficient to run some real programs in just 25 lines of C:

#include <stddef.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>

static char *heap_start;
static char *heap;
static size_t HEAP_SIZE = 1024 * 1024 * 1024;

void *malloc(size_t sz) {
  if (!heap) heap = heap_start = mmap(NULL, HEAP_SIZE,
    PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON,-1,0);
  sz = __builtin_align_up(sz, _Alignof(max_align_t));
  if (heap + sz > heap_start + HEAP_SIZE) return NULL;
  heap += sz;
  return heap - sz;
}

void free(void *ptr) { }

void *realloc(void *ptr, size_t sz) {
  void *new_ptr = malloc(sz);
  if (ptr && new_ptr) memmove(new_ptr, ptr, sz);
  return new_ptr;
}
What we've implemented is a "bump allocator": every time we want to allocate or reallocate a block of memory, we "bump" (i.e. increment) the pointer stored in the variable heap.

The first time that we call malloc, no heap has been allocated, so we use the operating system primitive mmap to reserve 1GiB of RAM for our process (lines 8, 11, and 12). Because many systems become unhappy if we return unaligned memory (e.g. memory which starts on addresses ending in an odd number), we round the size requested up (on my machine to the next multiple of 16) to avoid such unhappiness (line 13). We then check if there is enough space left in the heap for the allocation, returning NULL if there is not (line 14). If there is enough space in the heap we then return a pointer to the beginning of the block (line 16) and "bump" the heap pointer (line 15).

Unsurprisingly, I have cut a few corners to keep things simple (e.g. let's hope that mmap doesn't fail!). Most obviously, free doesn't "return" memory to the heap: this is semantically correct, though it will cause programs to run out of memory rather quickly! You may also find the definition of realloc surprising, because it copies however many bytes are now requested from the old to the new block. This works because our heap is contiguous so, if there's enough space for the new block, by definition we can copy at least that many bytes from the old block.

So, does this really work? Let's start with the classic binary trees benchmark which does a lot of memory allocation. What we want to do is to compare a normal allocator against our bump allocator. There are various ways we could do this, but the simplest is to put our allocator in a shared object and use the "LD_PRELOAD trick" [1]. Setting the LD_PRELOAD environment variable to point at our shared object allows us to override the system allocator on a per-execution basis.

In the following video you can see me (in order): run the benchmark on my normal desktop computer with OpenBSD's default system allocator; compile the bump allocator; and then use LD_PRELOAD to override the system allocator when I run the benchmark for a second time. Not only does it run, but it's three times faster than when run with the system allocator! Before you get too excited, let's remember that the bump allocator is, to put it mildly, less general than the system allocator — this is not, in general, a sensible performance comparison [2]. Still, it does give us a good degree of confidence that we're really running the bump allocator in the second run of the benchmark.

As that suggests, it's notoriously easy to fool yourself into thinking you're testing your new allocator ("wow, I write software without bugs!") when you're actually running the system allocator. We won't always be able to tell easily from time which allocator we're running so, to give us some assurance for later examples in this post – and also because it's fun – let's add this little chunk of code to our allocator:

__attribute__((destructor))
static void malloc_exit() {
  fprintf(stderr, "heap used %lu\n", heap-heap_start);
}
Now, when a program uses our bump allocator, it will print out how many bytes of heap were used during execution just before the program exits. Have you ever wanted to know how many bytes grep or vi allocate during execution? Now's your chance to find out: I had to be a little careful in the choice of programs I run above, because Unix has, over time, slowly added more functions to the allocator API (most of which are rarely used). Fortunately, we can handle most of those (again with minor corner cutting) fairly easily. It's quite good fun to see which software runs with this extended allocator — assuming you've got enough RAM!

The Bump Allocator and CHERI purecap

What does it mean to run an allocator under CHERI? Let's start by assuming we're using "pure capability" (henceforth "purecap") CHERI. For our purposes, that means that 64-bit "pointers" become 128-bit capabilities, where the additional 64-bits contain various permissions. The main permissions we're interested in are bounds, which record the range of memory that a capability – and, vitally, those capabilities derived from it – are allowed to read/write from. When we compile a C program for CHERI we're actually compiling a "CHERI C" program. For our purposes, the semantics of CHERI C can be thought of as mostly the same as normal C's semantics, though "pointer types" are now "capability types" (and hence double the width they are in normal C).

Perhaps surprisingly, our bump allocator not only compiles out of the box on purecap CHERI, but both binary trees and vi run successfully. That sounds good — until we start digging a bit. In order for CHERI to do something useful, capabilities need to have "non global" permissions. What bounds does the capability returned by our bump allocator's malloc have? The easiest way to see this is to allocate some memory and then use the CHERI API to print out the capability's lower bound, and the length of the bounds. Let's allocate two blocks of memory and see what the result is:

#include <cheriintrin.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
  void *c1 = malloc(4);
  void *c2 = malloc(4);
  printf("c1 address 0x%lx lower 0x%lx length 0x%lx\n",
    cheri_address_get(c1), cheri_base_get(c1),
    cheri_length_get(c1));
  printf("c2 address 0x%lx lower 0x%lx length 0x%lx\n",
    cheri_address_get(c2), cheri_base_get(c2),
    cheri_length_get(c2));
}
For each of the two capabilities returned by malloc this will print: the capability's address (equivalent to the address a "pointer" normally stores); the lower bound of memory the capability can address; and the length of memory (relative to the lower bound) that the capability can read/write from. I can derive other (valid!) capabilities from these that have different addresses, provided that those addresses stay within the bounds.

So that you can play along at home, I'm going to run an emulator of Arm's Morello, built with cheribuild: you can build your own with ./cheribuild.py run-morello-purecap -d. I'll warn you now that the emulator isn't fast, but it is more than usable for our purposes. If I run the code snippet above with, first, the default allocator and, second, our bump allocator I see the following: To make things a bit easier to follow, here's the critical part of the video as text:

root@cheribsd-morello-purecap:~/cheri_malloc_blog # ./print_bounds
c1 address 0x40c0e010 lower 0x40c0e010 length 0x4
c2 address 0x40c0e018 lower 0x40c0e018 length 0x4
root@cheribsd-morello-purecap:~/cheri_malloc_blog # LD_PRELOAD=$(pwd)/malloc2.so ./print_bounds
c1 address 0x41400000 lower 0x41400000 length 0x40000000
c2 address 0x41400010 lower 0x41400000 length 0x40000000
heap used 4128

Both allocators return sensible looking addresses, but the bounds are very different: the default allocator returns capabilities whose bounds are restricted to the 4 bytes requested; but our bump allocator returns capabilities that can read/write from 1GiB of memory!

More formally, with the default allocator, neither c1 or c2 can be derived from the other, but with the bump allocator either capability can be derived from the other. While this isn't exactly wrong, it means that we haven't gained much from CHERI: code with access to one of those two capabilities can read/write from the same memory as the other capability. Indeed, it turns out that every block of memory our bump allocator returns will share the same bounds!

The clue to where those bounds comes from is that they span 1GiB of memory — exactly the same quantity of memory our bump allocator requested from mmap. As that suggests, mmap in CheriBSD returns a capability whose bounds are at least as big as the quantity of memory that you requested, and our bump allocator has simply copied those bounds over unchanged to the caller of malloc.

What Should an Allocator do on CHERI purecap?

Our bump allocator works without changes on CHERI purecap, but gives little meaningful security improvement relative to a non-CHERI system. We thus need to pause for a brief philosophical interlude: what should an allocator do on CHERI purecap?

Naively we might think we want "maximum security" but, even if we could define what that means, we probably couldn't achieve it. Like most things in life, software nearly always requires trade-offs: if we want more of one thing (e.g. security) we'll have to accept less of another (e.g. performance or ease of use). Although I'm not going to dwell on it in this post, there is unlikely to be a "one true secure allocator" that works well in all cases.

Let's start with what security folk call "threat models". In our case, let's assume that our threat model is that external attackers might be able to take over one part of our program and then "upgrade" the attack to read all the program's heap data. Given this threat model, we can then think about techniques and technologies that might help mitigate the threat.

In our case, we have a promising technology (CHERI purecap) and we may then decide that having the program store capabilities with the most restrictive bounds possible is the right way to mitigate the "upgrade" attack. That implies that allocators should return capabilities with bounds restricted to just the memory allocated. Doing so will clearly make an attacker's life harder, even though it might not make it impossible [3].

Adjusting the Bump Allocator

Let's adapt the bump allocator's malloc so that it returns capabilities whose bounds only allow access to the requested range of memory:
#include "cheriintrin.h"

void *malloc(size_t sz) {
  if (!heap) heap = heap_start = mmap(NULL, HEAP_SIZE,
    PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON,-1,0);

  char *new_ptr = __builtin_align_up(
    heap, -cheri_representable_alignment_mask(sz));
  size_t bounds = cheri_representable_length(sz);
  sz = __builtin_align_up(sz, _Alignof(max_align_t));

  if (new_ptr + sz > heap_start + HEAP_SIZE)
    return NULL;
  heap = new_ptr + sz;
  return cheri_bounds_set_exact(new_ptr, bounds);
}
The main changes we've made are lines 7-10 and 15 where we perform an odd-looking dance with CHERI's APIs to create a pointer with the appropriate bounds. In essence, because CHERI permissions don't have enough spare bits to represent all possible lower bounds, we might have to align the heap pointer to an arbitrary amount (lines 7-8), and we may also align the requested length upwards for similar reasons (line 9). We then make sure we always align future pointers to max_align_t as in the normal malloc (line 10). Finally we return a capability with the heap pointer we want, with the bounds set to those we've calculated (line 15).

If I compile our CHERI-fied malloc and run our little bounds printing program again, I can see that I really am getting capabilities whose bounds mean that they cannot be derived from each other:

$ LD_PRELOAD=$(pwd)/malloc4.so ./print_bounds
c1 address 0x41400000 lower 0x41400000 length 0x4
c2 address 0x41400010 lower 0x41400010 length 0x4
heap used 4128
That all looks good! But if I write this this little program:
#include <cheriintrin.h>
#include <stdlib.h>

int main() {
  void *c1 = malloc(4);
  c1 = realloc(c1, 8);
  return 0;
}
and run it, the program is terminated with a SIGPROT (the CHERI equivalent of a SEGFAULT):
$ LD_PRELOAD=$(pwd)/malloc4.so ./print_bounds
In-address space security exception (core dumped)
How can such a simple program cause a problem? The problem is in our original realloc. Earlier I wrote:
You may also find the definition of realloc surprising, because it copies however many bytes are now requested from the old to the new block. This works because our heap is contiguous so, if there's enough space for the new block, by definition we can copy at least that many bytes from the old block.
My corner cutting has come back to bite us! The problem is that when we pass to realloc a capability whose bounds allow access to 4 bytes and then we try and copy 8 bytes from it, CHERI – quite rightly! – considers this a violation of its rules and stops the program.

The fix for this is obvious: we have to know how big the block of memory we want to reallocate currently is. However, our simple bump allocator doesn't store that length. Adjusting it to do so wouldn't be rocket science, but it would be annoying [4]. Fortunately, since the input capability records the length of its bounds, we can use that information to make sure that we don't copy more from the input capability than we should:

void *realloc(void *ptr, size_t sz) {
  void *new_ptr = malloc(sz);
  if (ptr && new_ptr) memmove(new_ptr, ptr,
    cheri_length_get(ptr) < sz
    ? cheri_length_get(ptr) : sz);
  return new_ptr;
}
Note that we only copy the minimum of the input capability's length or the requested size (lines 4 and 5). With this modification, our allocator now works as expected:
$ LD_PRELOAD=$(pwd)/malloc4.so ./print_bounds
heap used 32

What Have We Actually Done?

Amongst all the detail, it's easy to miss that our manual adaptions of the bump allocator to CHERI have meaningfully improved its security. If an attacker gets access to one capability returned by malloc, they have no way of turning it into any other capability returned by malloc. In other words, we have given the user a new tool for "compartmentalising" their heap allocated memory. Consider this simple program:
#include <cheriintrin.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
  void *c1 = malloc(4);
  void *c2 = malloc(4);
  printf("c1 address 0x%lx lower 0x%lx length 0x%lx\n",
    cheri_address_get(c1), cheri_base_get(c1),
    cheri_length_get(c1));
  printf("c2 address 0x%lx lower 0x%lx length 0x%lx\n",
    cheri_address_get(c2), cheri_base_get(c2),
    cheri_length_get(c2));
  c1[0] = 'a';
  printf("c1[0] %c\n", c1[0]);
  printf("(c2 - 0x10)[0] %c\n", (c2 - 0x10)[0]);
}
Because we – as might a cunning attacker – know that the bump allocator places consecutively allocated blocks at predictable locations, we can try rederiving c1 from c2 (line 16). This code executes successfully (i.e. insecurely!) with our original bump allocator on CHERI purecap, but is terminated with a SIGPROT with our adapted bump allocator.

This is not to say that our adjusted allocator magically makes programs impervious to attackers. Most obviously, the user still has to think carefully about which parts of the system should have access to which capabilities returned by the allocator: handing out capabilities without sufficient thought tends to undermine the security properties the user was hoping for. Less obviously, even our adjusted allocator can be tricked into doing things we might not expect. For example, I can take a capability returned by the allocator, narrow its bounds or downgrade its permissions, and hand it off to a less trusted part of my system. An attacker can then use realloc to obtain a more powerful capability than they started with.

This isn't merely a theoretical concern. In Picking a CHERI Allocator we introduced 5 simple "attacks" which, in essence, model situations where an attacker has gained initial access to a system and is trying to widen the scope of the attack. We then tried these attacks on a number of different allocators that have been ported to CHERI purecap. Only one allocator was invulnerable to these simple attacks — and the default CheriBSD allocator, a jemalloc fork, was vulnerable to 3 of them! We didn't even try that hard to find such attacks — it seems likely to me that more attacks on CHERI purecap allocators could be found if someone has the time to investigate further.

Summary

If nothing else, I hope this post has made allocators seem a little less magical to you. On top of that, I hope that you believe that while CHERI (purecap or hybrid) can help you improve a program's security, it isn't magic: just because something runs correctly on CHERI doesn't necessarily mean it's any more secure than running it on a traditional system.

The adjustments we made to the bump allocator to take advantage of what CHERI has to offer are not huge, but they aren't merely mechanical either. As that suggests, to take full advantage of CHERI you have to think not just about the code itself, but how that code is expected to be used.

Acknowledgements: my co-authors on Picking a CHERI Allocator (Jacob Bramley, Dejice Jacob, Andrei Lascu, and Jeremy Singer) did all the hard work and generously offered comments on this post. In focussing on one aspect of that paper for this post, I've probably introduced various mistakes, for which I am solely responsible!

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1]LD_PRELOAD isn't a trick but using it is trickier than people often realise. First, you must give it an absolute path to a shared object. Second, it's astonishingly easy to create an input shared object that doesn't contain all of the functions you need, or expected, to override.
[2]There are a handful of use cases for making free a no-op. First, some people really need their programs to run faster and are prepared to buy a lot of RAM to make that happen — yes, this really does happen! Second, sometimes we want to know what the relative proportion of time we spend in an allocator is — our simple bump allocator gives us a good idea of the "intrinsic" costs of a program with the allocator factored out. That said, modern hardware with its many layers of caches, increasingly requires some caution in interpreting these numbers, because fragmentation (or lack of it) can have noticeable effects.
[3]For example, unless we explicitly protect the "super capability" returned by mmap, that is an obvious weakness in our mitigation. CHERI gives us tools which can make such protection plausible, but I won't cover them in this post.
[4]It is sometimes considered realloc's original sin that it doesn't require the caller to track the input block's current size. Every allocator then has to record (or otherwise derive) the size of a block, which can be a costly operation (in terms of memory use or processor time). That said, it can be rather annoying to use an allocator where you have to track the size of all allocations yourself. Swings and roundabouts...
LD_PRELOAD isn't a trick but using it is trickier than people often realise. First, you must give it an absolute path to a shared object. Second, it's astonishingly easy to create an input shared object that doesn't contain all of the functions you need, or expected, to override.
There are a handful of use cases for making free a no-op. First, some people really need their programs to run faster and are prepared to buy a lot of RAM to make that happen — yes, this really does happen! Second, sometimes we want to know what the relative proportion of time we spend in an allocator is — our simple bump allocator gives us a good idea of the "intrinsic" costs of a program with the allocator factored out. That said, modern hardware with its many layers of caches, increasingly requires some caution in interpreting these numbers, because fragmentation (or lack of it) can have noticeable effects.
For example, unless we explicitly protect the "super capability" returned by mmap, that is an obvious weakness in our mitigation. CHERI gives us tools which can make such protection plausible, but I won't cover them in this post.
It is sometimes considered realloc's original sin that it doesn't require the caller to track the input block's current size. Every allocator then has to record (or otherwise derive) the size of a block, which can be a costly operation (in terms of memory use or processor time). That said, it can be rather annoying to use an allocator where you have to track the size of all allocations yourself. Swings and roundabouts...

"Programming" and "Programmers" Mean Different Things to Different People

August 23 2023

Late last year on my blog I published "How Might Generative AI Change Programming?". Although I'm slightly biased, I'm still comfortable with that post's high-level message. However, there is one important aspect that I think I failed to bring out as explicitly as I should have: different communities' programmers will be affected in different ways.

One reason I made this mistake is that I had overlooked how different communities' assumptions about "programming" and "programmer" [1] have diverged. As with most instances of incommensurability, the people involved (e.g. me) often don't realise that they're talking about slightly different things.

As a concrete example, a couple of weeks back, the Economist – whose coverage of areas I know about I find to be rather good – published an article with the following claim:

An AI "co-pilot" on GitHub, a Microsoft-owned platform for open-source programs, improves coders’ productivity by 30%.
My initial reaction was incredulity. AI tools are not currently even remotely close to speeding up my "day job" programming – which involves careful development of programming language tools like compilers – by anywhere near 30%.

But, after a little thought, I realised that this statement is true for some of the programming I have to do. Recently I had to do a little bit of JavaScript programming, a language which I barely know, and which I am therefore rather unproductive in. After a quick query using ChatGPT I had gained enough insight to find a webpage that let me quickly do the task I needed. I expect that, overall, I was able to complete this task in half the time I would have been able to before ChatGPT came along.

What this highlighted to me is that there are at least two very different styles of programming that I engage in. In one style, AI/ML has, so far, had virtually no effect on me; in the other style, it can be a noticeable productivity boon.

It didn't take me long to realise that was a specific instance of something more general: different people increasingly mean different things when they talk about "programming" and "programmers". Some of this is due to the gradual, and inevitably incomplete, diffusion of understanding about software throughout wider society. But even within the communities of people whose livelihoods involve staring at screens, there are important differences of understanding that can mean we end up talking past each other.

In this post I'm going to define what seem to me to be the three major categories of "programming" and "programmers" I come across, moving from the narrowest to the broadest. While I have my own preferences as to which categories I personally attach the terms "programming" and "programmers", I'm not interested in dying in a ditch over terminology. Rather, it's already helped me better understand what other people mean when I'm talking to them.

The "narrow" category

When I have to write a short description of what I do for work, I use two words: programmer, researcher. The ordering is deliberate: I was a programmer before I was a researcher [2] and I think of programming as my craft.

My self image of myself as a programmer is reflected in the way I approach programming. Most obviously, I am willing to attempt – though I may not always succeed at! – hard programming tasks (e.g. writing low-level and/or highly optimised code). Less obviously, I think of programs and programming itself as something worth careful and continual consideration. I put deliberate effort into practising programming and I spend a lot of time trying to make existing programs better. In many ways, how I program, and the quality of the result, is more important to me than what I am programming.

Although I am nervous of suggesting I am representative of anything other than my own odd self, I believe that many of the people I meet professionally have a similar outlook. We call ourselves "programmers" not just because that's how we spend most of our time, but because it is the activity we care most about. We therefore tend to put a lot of effort into thinking about programming, and how to improve the programs we're working on. We are willing to move between organisations who want us to program very different things because we want to program and we're often less worried what we program.

The "midpoint" category

You will probably not be surprised to learn that the "narrow" definition of "programmer" and "programming" is the one that I implicitly used for much of my programming journey.

The first time I realised that my definition didn't work well was when I interacted more closely with system administrators. These are the folk who keep computing services such as servers running. When I started my PhD, my department had a small but gifted group of system administrators. I enjoyed the company of the "sysadmins" (as we abbreviated their title) a great deal, in no small part because they were the most technical people in the entire department [3]! I learnt a great deal from this group because a good proportion of the tasks they undertook clearly involved programming. Their programming wasn't the same as mine – it was smaller scale, with many of the programs deliberately "throwaway" – but it was similar enough that we could talk productively to one another.

Since then, I've encountered ever more groups of people who are using computers in a way that, at the very least, can be thought of as "programming lite". One obvious group are the most sophisticated of spreadsheet users who create complex expressions that, in conjunction with a spreadsheet's inherently reactive mode of operation, make my head hurt. More generally, there are a surprising number of users of "domain specific languages" (of which spreadsheet expressions are arguably a special, albeit wide-spread, case) who are commanding a computer in ways that are often like full-blown programming.

It seems to me that two things unite "programmers" and "programming" in this category.

First, the people involved rarely call themselves "programmers". They are "system administrators" or "scientists" or "accountants" and so on. In other words, they invariably have a non-programming role that programming just so happens to help them with.

Second, "programming" is both a help and a hindrance to them. If you're trying to crunch a load of data from astronomy experiments, you probably need to program a computer to deal with the sheer quantity of data. However, the time spent programming is a cost to be born in pursuit of a bigger task: the less time you spend programming, the more time you spend on the thing you care most about.

These two factors combine in an interesting way. In general (though there are exceptions), the programs that programmers in this category write tend to be smaller than those in the "narrow" category. Typically, the people involved also put less effort into learning exotic tooling [4] and writing "better" programs than in the "narrow" category. This has some interesting knock-on effects. For example, those of us in the "narrow" category are often obsessed with making software as reusable as possible; in the "midpoint" category, there tends to be a lot more "copy, paste, and tweak". It's easy for people like me to dismiss this as sloppy workmanship, but that misses the point: people in this category necessarily have to reserve much (probably most) of their time and mental bandwidth for the non-programming aspects of their role.

The "broad" definition

I can still remember the first time that I saw a CV that claimed one of the programming languages this person knew was HTML. I scoffed: HTML isn't a programming language! After all, all one is doing with it is writing down how text and images should be formatted for humans to read — that's a long way from doing the sort of fiddly control flow manipulation that I associate with programming!

I've now lost track of how many people I've seen claim HTML as one of the programming languages they know. I've also encountered a growing number of people who talk about "programming" HTML (by which they mean HTML and CSS, though not JavaScript unless stated explicitly). The latter group are interesting as virtually none of them would claim that they could program non-HTML/CSS software and, indeed, I don't think I've ever heard any of them call themselves a "programmer" even when they say they're "programming".

Before I start to sound too snobby, it's now clear to me that creating a good modern web page is a difficult task, which is why my own website is both basic and ugly. Whenever I want a webpage to do something even moderately complex, such as display nicely on a mobile device, I either spend inordinate time on the task, or plead with someone for help. For some reason, I struggle to make sense of the way that CSS elements interact with each other, and it's not like I've only spent 5 minutes trying to understand it. Clearly this stuff is more complex [5] than I once gave it credit for!

It seems to me that what makes programming in this category (which one also sees in some "domain specific languages") different to the previous two categories is that it has little or no notion of "time". In most of the programming I do, I have to think about "if X then happens then Y happens which means that Z can then happen". Put formally, the system continually changes state (in the sense of "statemachine" state), and I have to reason about each state individually. In contrast, one can generally think of HTML as occupying a single state. That means that the reasoning programmers in this group have to do about the system state is generally much more restricted in nature than those in the "narrow" definition from earlier. This isn't a bad thing — anyone who's tried to write a GUI in a fully-fledged programming language will know that the flexibility this offers comes with costs that can be borne repeatedly over time!

Summary

A pithy summary of the three categories, in order, are:
  • Programmers who always call themselves programmers, call the activity they engage in programming, and who write arbitrarily complicated programs.
  • Programmers who generally don't call themselves programmers, generally call the activity they engage in programming, and who tend to write slightly less complicated programs.
  • Non-programmers who don't call themselves programmers but who sometimes call the activity they engage in programming.
These categories are, inevitably, too simplistic to capture the messiness of the real world. The categories are better thought of as highlighting portions of a sliding spectrum rather than as discrete buckets. They are also not mutually exclusive: any given individual might find themselves, at different points, inhabiting different categories. Still, despite these limitations, I hope the high-level point is useful.

Exactly where you consider the cut-off point for a "programmer" or for "programming" is an individual choice. I have my own preferences, but it doesn't worry me if you use a different definition. I also hope that I'll now be aware enough to realise how you're using the term, so that you and I will communicate more effectively!

Acknowledgements: thanks to Martin Berger, Lukas Diekmann, and Hillel Wayne for comments.

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1]"Programmer" often doesn't sound grand enough as a job title, so one finds "software developer" or "software engineers" or sometimes just "engineer" used instead. Interestingly, I find that when people with these job titles refer to the craft element of their job, they frequently revert back to using the term "programmer".
[2]It would not surprise me at all if, in the future, I am a programmer after I'm no longer a researcher.
[3]The head of the group, Neil Faulks, even created his own Linux distribution "Neilix" (a pun on Linus Torvald's naming of "Linux" after himself). In contrast, the academic group I was part of consisted almost entirely of theoreticians. I learnt a lot from them too, but not about programming.
[4]For example, until recently, I believe that the majority of the people I came across in this category did not use version control software (e.g. git). Interestingly, that seems to have changed rapidly in the last 3 or 4 years, although I'm not sure why.
[5]CSS is in fact Turing complete but I think it's fair to say that very few of us get to the point that we realise that.
"Programmer" often doesn't sound grand enough as a job title, so one finds "software developer" or "software engineers" or sometimes just "engineer" used instead. Interestingly, I find that when people with these job titles refer to the craft element of their job, they frequently revert back to using the term "programmer".
It would not surprise me at all if, in the future, I am a programmer after I'm no longer a researcher.
The head of the group, Neil Faulks, even created his own Linux distribution "Neilix" (a pun on Linus Torvald's naming of "Linux" after himself). In contrast, the academic group I was part of consisted almost entirely of theoreticians. I learnt a lot from them too, but not about programming.
For example, until recently, I believe that the majority of the people I came across in this category did not use version control software (e.g. git). Interestingly, that seems to have changed rapidly in the last 3 or 4 years, although I'm not sure why.
CSS is in fact Turing complete but I think it's fair to say that very few of us get to the point that we realise that.

pizauth: First Stable Release

August 13 2023

In pizauth: dump and restore I said:
However, there was one feature I couldn't find a good design for: persisting state to disk. This missing feature had been nagging away at me (and a small, though growing, number of other people) for quite a while: recently I finally alighted upon a design that I thought would work well. pizauth-0.2.2 is the first release to support this. Perhaps this will be the last release before 1.0.0!
I was a little too optimistic. One of pizauth's configuration option names had long bothered me, and addressing that caused a backwards-incompatible change that forced a 0.3.0 release.

Since then, I've added only one feature to pizauth: an info sub-command. This was explicitly inspired by Chris Siebenmann's "Everything that uses configuration files should report where they're located" post. pizauth info [-j] informs the user where pizauth is looking for configuration files and so on. For example:

$ pizauth info
pizauth version 1.0.0:
  cache directory: /home/ltratt/.cache/pizauth
  config file: /home/ltratt/.config/pizauth.conf
Adding -j outputs the same information in JSON format [1] for integration with external tools. The info_format_version field is an integer value specifying the version of the JSON output: if incompatible changes are made, this integer will be monotonically increased.

It's just under a year since pizauth's first alpha release. While I have made three breaking changes since that first release, I did so early on. Two of the three changes are fairly trivial renamings of configuration options; the third is only marginally more "breaking". All in all, pizauth's initial interface seems to have held up fairly well, and while I can think of some possible additions I might make one day, I can't think of any breaking changes.

It thus seems that the time is right for the pizauth's first stable release: 1.0.0 is out! By "stable" I mean that, as far as possible, future releases will not change (though I might add to) the interface — and if such changes are necessary, the major version number will be bumped [2]. In one sense, the 1.0.0 version number means very little, but in another it means a great deal. When releasing software, it's easy to jump too early to "stable" or to be scared to ever do so. 11 months after pizauth's first release, and with a growing number of users, feels about the right time to me to commit to a stable release. Enjoy!

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1]I am not especially fond of JSON per se, but it is now a widely accepted interchange format. The older I get, the more I tend to prefer standards, even if they're not the standard that I would most have preferred.
[2]Yes, I will be following Semantic Versioning — another standard that I'm not especially fond of, but which is better than nothing.
I am not especially fond of JSON per se, but it is now a widely accepted interchange format. The older I get, the more I tend to prefer standards, even if they're not the standard that I would most have preferred.
Yes, I will be following Semantic Versioning — another standard that I'm not especially fond of, but which is better than nothing.

The Need to Explain

July 18 2023

One of my in-built quirks – present, I'm told, from a very young age – is spotting patterns. Whenever I see X, I also see Y; there are 3 As for every B; if this happens, then that is more likely to happen afterwards; and so on. It's not something I do particularly consciously, and few of my observations even reach the lofty heights of banal. Fortunately, it's not all wasted effort, and a small number of these observations end up making my life a little better. My train home always leaves from platforms 1 or 8. If I arrive at the station as my train is due to leave, my best bet is not to check the departure boards but to run past platform 8 to see if my train is there and, if it isn't, run straight to platform 1.

As I've grown older, I've learnt to keep more of the observations I've made to myself. This isn't because my observations very often offend, but because some people view them as creating a void which must be filled, as soon as possible, by an explanation. The problem with spontaneous explanations on topics beyond our purview is that they are often obviously wrong. One of my other quirks is that I find it difficult not to rebut obviously wrong explanations. This can then cause an endless loop of further explanations and rebuttals, which is rarely useful or enjoyable.

To my surprise, I've realised that further attempts at spontaneous explanation become less convincing more often than they become more convincing. One obvious lesson to draw from this is that it is easier to disprove incorrect explanations for a phenomenon than it is to generate a correct explanation. As a researcher, I have learnt that lesson the hard way more than once!

But, over time, I've come to be fascinated by the meta-observation: many of us seem to feel a need to find explanations for everything. I certainly felt that need when I was younger, due to a combination of insecurity and egotism. But, at some point, I realised that while I really want to find out the explanations for a small handful of observations, I'm generally comfortable with not having an explanation.

Why does my train leave from platforms 1 or 8? I probably could find out the explanation for this if I really wanted to, but it doesn't seem like a good use of my time. In many cases – for example, pretty much anything involving physics – I'm probably not capable of uncovering a compelling explanation even if I devote the remainder of my days to the investigation!

Of course, I sometimes can't help myself idly speculating as to possible explanations for an observation. Perhaps this suggests that we have a deep need to feel that we live in a world governed by well-defined laws? Perhaps our willingness to spontaneously think of explanations is the enabler of progress? Honestly, I don't have a clue. The observation alone has been enough to change my behaviour, and make my life a little easier.

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:

Two Stories for "What is CHERI?"

July 5 2023

Some readers might have heard of CHERI. Of those who've heard of it, most have probably got the rough idea that it's a "more secure" CPU — but, after that, I find that most people start to get rather confused when they try and nail down "what is CHERI?".

Having grappled with this myself, and having helped others grapple it, I think that the major reason is that the high-level story requires understanding quite a bit of detail — hardware, Operating Systems (OSs), programming languages (and some end user programs) all have to be adapted for CHERI [1]. Most people tend to focus at first solely on the CPU, but one can't form a coherent story from the CPU alone. Those who persist and piece together the other parts of CHERI can then get swept up in the novelty of the idea, and assume that CHERI magically secures everything it touches. However, CHERI is just a tool and, like all tools, it's better at some things than others — and, even then, only when used thoughtfully.

In this blog post I'm going to explain just enough of CHERI to put together two coherent stories for what it is and where it might be used. I'm not going to try and cover all the possibilities, since, as you'll see, this post is long enough as it is! One reason for that is that CHERI contains overlapping concepts that can be combined in more than one way. Because of that, I'll warn you in advance that there is a fairly hard switch between the two stories in this post — sorry in advance! I'm also not going to be afraid to diverge from CHERI's terminology, which has a number of historical quirks that I find add to the confusion.

The rough idea

Let's start with a quote from the main CHERI site:

CHERI extends conventional hardware Instruction-Set Architectures (ISAs) with new architectural features to enable fine-grained memory protection and highly scalable software compartmentalization. The CHERI memory-protection features allow historically memory-unsafe programming languages such as C and C++ to be adapted to provide strong, compatible, and efficient protection against many currently widely exploited vulnerabilities.
In simpler terms, a CHERI CPU is one that supports extra security features, most notably "capabilities" [2] – pointers with extra information – that can help catch flaws such as buffer overruns before they become security problems. Programs written in languages such as C can be recompiled, mostly with few source-level changes, so that they can make use of capabilities when run.

Let's imagine I have this seemingly innocuous C code:

char *buf = malloc(4);
strcpy(buf, "abcd");
printf("%s\n", buf);
At first glance, you could be forgiven for thinking that strcpy copies 4 bytes into buf. However, strcpy adds a NULL byte to the end of the string it copies, so 5 bytes are written into buf! This is a classic buffer overrun, of the sort that has led to a great many security flaws in real software.

When compiled for a normal CPU and OS, the chances are that this snippet will appear to run correctly, though it may corrupt the program's internal state and cause security problems. If I compile it with a CHERI C compiler, for a CHERI CPU, and run the resulting executable on a CHERI-aware OS, the incorrect write of the fifth byte will be detected by the CPU, and the program terminated with a SIGPROT (the capability equivalent of a SEGFAULT). We can see that happening in the video below, where in the split tmux we have my normal OpenBSD desktop running in the top and an Arm CHERI machine with a CHERI-aware OS in the bottom: The video shows OpenBSD statically warning me that strcpy is dangerous which is clearly true — my buggy program ran to completion on what is often considered the most security conscious Unix around! But on the bottom, with a CHERI-aware OS running on CHERI hardware, the buffer overrun was detected and the program stopped before it could become a security issue. Running gdb shows that the buffer overrun ("Capability bounds fault") is detected while strcpy is running.

I've been deliberately explicit that there are three "CHERI" factors involved in this: a variant of C ("CHERI C") that knows that CHERI capabilities are pointers that carry extra information about the size of a block, and a compiler capable of dealing with this; an operating system (as in the video above that currently typically means CheriBSD, a FreeBSD variant, although some CHERI machines now have a Linux available) whose malloc [3] returns capabilities that record the size of the block it has allocated; and a CHERI CPU which detects the write past the end of the block.

We thus immediately see an obvious point of confusion: when we say "CHERI" do we mean the language, OS, CPU, an abstract definition of what CHERI is capable of, a model of capabilities, or some combination of all of these factors? In informal conversation, I've heard all of these possibilities occur — frequently, the two speakers don't even realise that they're using the term to refer to different things! I don't think it's possible to nail down a single meaning for "CHERI". Instead, I use the unqualified term "CHERI" to refer to the very high-level idea and concepts and then, when talking about a specific component, to qualify it accordingly (e.g. "CHERI CPU" or "CHERI OS" and so on).

CHERI CPUs

You might have noticed above that I said "a" CHERI CPU, not "the" CHERI CPU. We thus see another point of confusion: CHERI features can be added to any CPU. Indeed, there are at least three different CHERI CPUs available. Chronologically, the first was based on a MIPS processor, though I believe that's no longer supported. Arm's experimental Morello CPU is a mostly normal, fairly modern, AArch64 ARMv8-A processor, extended with CHERI features (they're not available commercially, but are/were available to researchers such as myself). There are also at least two RISC-V CHERI processors (desktop and microprocessor) CHERI processors under design: I'm not sure if physical versions are available yet, but there there is at least simulator/emulator support.

A CHERI CPU can be thought of as a combination of the traditional things that make up a CPU, plus the things necessary to support the capability features that CHERI defines. For example, Morello supports the normal AArch64 instruction set "A64" as well as a second "pure capability" instruction set "C64": programs can switch between the two on-the-fly (though this is normally not exposed to the high-level programmer). This might sound sci-fi, but it's not unprecedented: for example, some CPUs can switch between different instruction sets on-the-fly [4].

Practically speaking, even when taking advantage of CHERI-related features that the CPU offers, one can write "CHERI software" that is largely ignorant of the specific CHERI CPU it's running on. There are of course some visible differences between the different CHERI CPUs, but it's rather like the situation with normal CPUs, where I can write code that's largely ignorant of whether it's running on, say, AArch64 or x86_64.

Capabilities

It's now time to define a capability. Semi-formally, I think of a capability as a token that a program can pass to the CPU to perform actions on its behalf. Unfortunately, I find that that definition only really makes sense when you already know what a capability is! To try and break that cycle, I'll explain things using a series of gradually less inaccurate approximations.

For our first approximation, let's say that a capability is a pointer that comes with permissions. Amongst a capability's permissions are traditional things like "can this pointer read and/or write to the address pointed to?" If I try and use a capability in a situation for which it doesn't have sufficient permissions, my program will be terminated with a SIGPROT. New capabilities can be derived from existing capabilities. The simplest derivation is to simply copy the capability unchanged. I can also derive a new capability that has fewer permissions: for example, I can take a capability which can read and write to an address and derive a new capability which can read but not write to the address. However, I cannot add to a capability's permissions — this property is crucial to CHERI's security. There are some interesting implications to this. First, if a program destroys (e.g. by writing NULL) the only copy it has of a powerful permission, it can never recover those permissions. Second, a CHERI system must start with a "root" or "super" capability that has the maximum set of permissions that will ever be needed.

To understand how one can be prevented from extending a capability's permissions, we need a slightly better approximation. On a normal CPU, pointers are interchangeable with integers representing memory addresses [5] — I can magic an integer out of thin air and load a value from the address in memory that matches that integer. On a CHERI CPU, capabilities are not interchangeable with integers / addresses. I cannot magic a new capability out of thin air — only the CPU can create new capabilities, and it will only let me do so by deriving them from existing capabilities. This allows the CPU to enforce the rules about how an (authentic) capability can be derived from an (authentic) input capability.

We can improve our approximation further by realising that every location in memory that can store a capability has an authenticity bit [6] associated with it [7]. If a program asks the CPU to perform a capability-related operation on an inauthentic capability, it will be terminated with a SIGPROT. It might seem odd that one can have inauthentic capabilities, but amongst their uses are that they are needed to represent arbitrary binary data (e.g. a JPEG file in memory). I can always derive an inauthentic capability from an authentic capability, but never vice versa.

We can now answer the seemingly simple question: how big is a capability? Let's first refine our approximation by saying that a capability contains a memory address and a set of permissions. The address is the same size as normal — on a 64-bit CPU like Morello or CHERI RISC-V, that's 64 bits. On those CPUs the set of permissions are also 64 bits in size [8]. Depending on how you look at it, the size of a capability can be thought of as 128 bits (64 and 64, after all!) or 129 bits if you include the authenticity bit — however, since the authenticity bit can only be read separately from the other bits, we typically refer to a capability as 128 bits in size.

Putting a CHERI story together

At this point you can be forgiven a degree of confusion: I've given you lots of low-level detail, but no high-level story. Fortunately, with one more bit of detail, we can start to see a CHERI story come together.

Earlier I used read/write as an example of capability permissions. However, by far the most commonly used permissions are bounds, which record the range of memory that a capability can read/write from. We saw earlier an example of a capability-aware malloc returning a capability that encoded how big the block was. For example, malloc(4) might return a capability with an address of (say) 0x200, a lower bound address of 0x200, and an upper bound address of 0x204 [9]. If I create a capability with an address outside of the bounds 0x200-0x204 [10] and try to read/write from it, my program will terminate with a SIGPROT.

Not coincidentally, capability bounds work very well with pointer arithmetic as found in C. Consider the following snippet:

char *buf = malloc(4);
char *c1 = buf + 1; // In-bounds
char *c2 = buf + 2; // In-bounds
char *c3 = c1 + 1;  // In-bounds (and equivalent to c2)
char *c4 = buf - 1; // Out-of-bounds
char *c5 = buf + 5; // Out-of-bounds
I can read/write to any of buf, c1, c2, or c3 without issue: but trying to read or write from c4 or c5 will terminate my program with a SIGPROT.

We now have enough detail to see a high-level CHERI story come together for the first time:

  • Programming language: CHERI C treats what are normally pointers as capabilities, but does so in a way that allows the vast majority of normal C to be compiled unchanged with CHERI C.
  • Operating system: malloc returns capabilities with appropriately restricted bounds.
  • CPU: programs that do things that are potential or actual security problems (which includes, but is not limited to, code with undefined behaviour) tend to be terminated by a SIGPROT.
Put another way: these three parts of CHERI are a pragmatic way of reducing the security flaws that result from us using 1940s ("von Neumann") architectures, and 1970s languages (C) and operating systems (Unix). For example, software running on CHERI would not have been subject to the classic Heartbleed security bug, since it relied on a buffer overread that would have caused the program to be terminated by a SIGPROT — that might have caused a denial-of-service, but that's a lot better than giving an attacker access to your system.

What does CHERI protect us from?

Because CHERI can transparently protect us from many classic security flaws in C programs, it can end up seeming like magic. Indeed, I haven't even talked about the other protections we gain almost for free, such as preventing a program from executing code at arbitrary addresses. Part of the reason that CHERI can do so much for C (or, indeed, C++) programs is because C/C++ make little effort to protect the programmer from such mistakes, even though the necessary information to do so is already present in the source code. However, as soon as we want to enforce security properties that are not directly inferable from source code, we realise that we have to use CHERI explicitly rather than implicitly.

For example, I've assumed in this post that malloc returns a capability whose bounds are restricted to the size of the block I requested. This seemingly obvious property cannot, in the innards of any realistic malloc, be inferred from the source code: the malloc author will have to explicitly use the C CHERI API (e.g. cheri_bounds_set) to ensure that the bounds are restricted to the appropriate memory range.

In some cases, such as malloc, the bounds-related security properties we want are fairly obvious and only require small tweaks to small sections of code [11]. However, many security properties are either harder to define and/or require many sections of code coordinating correctly in order to enforce them.

For example, my program may have a secret that I want to hide away from most of the program so that it can't accidentally leak to the outside world. Capabilities seem perfect for this, because one of the fundamental properties of a capability system is that it restricts the reachable set of capabilities [12] at a given point in time (unlike a normal system where I can access nearly any part of virtual memory at any point). Indeed, CHERI gives me a number of exotic features such as sentries that allow me to enforce much more complex security properties.

While capabilities clearly have potential to make programs more secure in such cases, I'm sceptical that it will be something people want to do a great deal of. Capabilities are a dynamic (i.e. run-time) concept and humans are notoriously bad at reasoning about the dynamic (or "temporal") behaviour of programs. My guess is that programmers will often make mistakes and give overly-powerful capabilities to parts of a program that should not have access to them, undermining the expected security guarantees.

In a sense, the problem is more general: the more complex my use of capabilities, the less confident I am that they will enforce the security properties I care about. In most cases, capabilities are used to stop bad things happening: a program being terminated by a SIGPROT generally indicates a bug, which will then surely be fixed. That means that I'm left hoping that capabilities will catch the bad cases that I haven't thought of. History suggests that if a programmer hasn't thought of, or hasn't tested, something then it will not work as the programmer would hope.

A related question is: what happens if I'm using a language that offers me more protection than C/C++? For example, if I'm writing code in Python, or (non-unsafe) Rust, I can't be subject to buffer overruns [13]. Are bounds checks useful in such cases? As belt and braces, yes, but capabilities are not cost-free — they take up more memory than traditional pointers (e.g. placing more pressure on caches) and checking their validity must surely have some sort of performance impact on the CPU too. I might be willing to pay this cost on the small parts of a Rust program that use unsafe, but probably not elsewhere.

In summary, C/C++ programs have enough to gain from using capabilities everywhere that many users might choose to do so. However, for other languages, the situation is more complex: bounds checks are much less useful, but still impose a performance cost that people are probably unwilling to pay; and more complex uses of capabilities are hard to reason about. Interestingly, though, CHERI has another trick up its sleeve that I think is of much wider interest than is currently thought.

The hybrid CHERI story

Up until this point in this post I have explained CHERI as if it requires explicitly using 128-bit capabilities everywhere. CHERI has a second "mode" of operations where one can use both normal 64-bit pointers and 128-bit capabilities alongside each other. Conventionally, when a program uses capabilities everywhere it's said to be a purecap program; when it mixes capabilities and normal pointers it's said to be a hybrid program.

To many people's surprise, many current practical CHERI systems use hybrid mode in, at least, their OS kernels, because converting a large OS kernel to purecap is often too much work to contemplate. When the kernel interacts with a purecap user program, it must bridge between the hybrid and purecap worlds [14].

What's interesting about hybrid mode is that 64-bit pointer accesses are implicitly made against global 128-bit capabilities. The global capability I'll use as an example is the Default Data Capability (DDC). One way of thinking about how this works is as follows: if I'm on Morello, executing normal A64 (not C64!) code, and then execute a load with a 64-bit pointer, the CPU will derive a new capability from the DDC with the address of the pointer and then do all its normal capability checks.

Showing how this works is a bit awkward, but I hope the following slightly simplified code helps:

// Get two normal 64-bit pointers to chunks of memory
char *b1 = malloc(4);
char *b2 = malloc(4);

// Restrict the (128-bit) DDC to only cover b1's block of memory
void *__capability new_ddc = cheri_ddc_get();
new_ddc = cheri_address_set(new_ddc, b1);
new_ddc = cheri_bounds_set(new_ddc, 4);
cheri_ddc_set(new_ddc);

printf("%c\n", b1[0]); // Succeeds
printf("%c\n", b2[0]); // SIGPROTs
This chunk is written in "hybrid" CHERI C, which is most easily thought of as "normal" C with 64-bit pointers, with an additional __capability qualifier which denotes a 128-bit capability. We allocate two normal blocks of memory (b1 and b2), receiving back two normal 64-bit pointers. We then read the DDC (which, by default, has bounds from 0 to the maximum possible virtual memory address) into a capability, restrict that capability to only cover b1, and then set the DDC to that new capability [15]. We can then read and write via normal 64-bit pointers to b1 successfully but as soon as we try and read to b2 our program is terminated with a SIGPROT [16].

Hybrid mode is considered passé in the CHERI world, but I see it as potentially interesting because by restricting the DDC, we implicitly define a sort-of "subprocess" within a process. Unix-esque processes offer simple, highly effective isolation but they are very heavyweight. Various sandboxing techniques have been developed over the years, but most are awkward to use and, often, easily bypassed. Hybrid "subprocesses" seem to me to be a new point in the design space.

Splitting programs up into multiple processes for security isn't a new idea. A handful of existing programs, most notably OpenSSH split, already do so ("privilege separation"): an attacker who takes over an individual process has a hard time exploiting the system beyond that process. However, intra-process communication is either comically slow (pipes) or difficult to use reliably (shared memory). CHERI's hybrid mode can still use capabilities, and those are not checked relative to the DDC: thus capabilities can be used to communicate between compartments with very little performance penalty.

Of course, hybrid "subprocesses" are not magic — they clearly provide less isolation than processes, and CHERI's hybrid support has received much less attention than purecap. Still, at the very least hybrid mode offers a very interesting alternative to a purecap world. I think it's fairly easy to imagine a future where software uses both processes and subprocesses to gain different security/performance trade-offs.

Summing up

There is a lot more I could have said: CHERI contains many, many details I haven't even hinted at; and there are certainly other ways of using CHERI than the two stories I've outlined. I think, though, that this post is long enough already!

In terms of the two CHERI stories I've presented, it's worth being explicit that the purecap story is very much the standard CHERI story at the moment, with the hybrid story being more speculative. The very different implications each story has for today and tomorrow's software might give you some idea that CHERI is best thought of as a toolbox – and many tools within that toolbox have more than one possible use. Exactly which use(s) of CHERI will end up being most useful – which, to my mind, probably means "provides the best balance between security and performance" – remains unclear. And, of course, there are no guarantees that CHERI will enter the mainstream! But, whatever happens to CHERI, I think it is extremely helpful in opening our eyes to possibilities that most of us have previously not even considered.

Update (2023-07-05): I originally stated that all CHERI OS kernels were hybrid. I have been reliably informed that is no longer the case — there is a CheriBSD purecap kernel. CHERI Linux, though, is a hybrid-only kernel.

Acknowledgements: thanks to Jacob Bramley, Andrei Lascu, and Myoung Jin Nam for comments.

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1]Although it's difficult to say with absolute certainty, due to a lack of evidence, I think that any CHERI-esque system would require a similarly wide range of changes.
[2]Capabilities, and capability aware hardware, are not new ideas (see e.g. the CAP or E). While it is possible that these ideas didn't hit the mainstream because they are flawed, it seems more likely to me that random chance meant they were not developed to the point that we could meaningfully evaluate them. Put another way, capabilities were, until CHERI, a "what if?".
[3]For simplicity's sake, I'm conflating "kernel" and "userland libraries" into "operating system". In reality – as is the case in CheriBSD today – I need the kernel to return an appropriate capability from mmap which the malloc I'm using can then restrict further.
[4]Indeed, Morello reuses the same BX mnemonic that previous Arm processors used to switch to / from normal ("A32") / Thumb ("T32") modes. Similarly, like Thumb, Morello can also switch between instruction sets based on a bit set/unset in branch instructions.
[5]Well, mostly. As I've previously written about, real, though mostly historic, hardware has done all sorts of odd things with addresses and pointers.
[6]CHERI mostly calls this the "tag", or occasionally the "validity", bit.
[7]The physical location of the authenticity bit is a hardware-dependent detail. An excellent analogy I've heard is to think of the authenticity bit as being similar to ECC memory, where there are extra bits somewhere helping ensure the integrity of the memory but you don't know (and can't find out) where those bits are actually stored.
[8]On CHERI MIPS the address was 64 bits and the permissions 192 bits!
[9]Since a capability's permissions are only 64 bits, the bounds cannot precisely represent every possible range of memory. Modern CHERI systems thus use a clever encoding scheme that allows small bounds to be precisely represented, with larger bounds becoming progressively less precise. I can see this in action with the following code:
char *b1 = malloc(4);
printf("%lu\n", cheri_length_get(b1));
char *b2 = malloc(4097);
printf("%lu\n", cheri_length_get(b2));
char *b3 = malloc(16385);
printf("%lu\n", cheri_length_get(b2));
cheri_length_get is a CHERI C intrinsic (i.e. "builtin CHERI C function") which asks the CPU to report how many bytes the capability can refer to above its lower bound. [Note, since I've used cheri_length_get my program can now only be compiled by CHERI C: it is no longer a valid "normal C" program.] On Morello this prints 4, 4097, 16392 — note that that last number has been rounded up from 16385! Not only are bounds imprecise, but different CHERI CPUs round things up at slightly different points — on CHERI RISC V the code prints out 4, 4104 (note: rounded up!), and 16392.

The reason I'm going into this in detail is because newcomers often check small examples, see that their bounds are precisely represented, and understandably, but incorrectly, generalise from there. Instead, CHERI C code must account for the imprecision of capability bounds, which I think of as a kind of alignment requirement. For example, memory allocators will probably want to leave "blank" memory after a block if its bounds would otherwise overlap with the next block.
[10]Perhaps surprisingly, an out-of-bounds capability is not necessarily inauthentic. From the perspective of this blog post, we don't need to worry about this detail other than to know that it's at least in part to cope with C, which allows a pointer to point to the first byte past the end of a block of memory.
[11]Though it turns out to be much more fiddly than one might expect. More on that in a subsequent post.
[12]Capabilities form a graph, so if I have access to a capability C1 that points to C2 then my reachable set is {C1, C2}. If a capability C3 points to C1 but I still only have a reference to C1, then my reachable set remains {C1, C2}.
[13]Such languages are often called "memory safe", though I find that a rather vague term myself. For example, Python might be a "memory safe" language, but CPython is written in the "memory unsafe" language C: bugs in the low-level code can undermine high-level guarantees. We stand on the shoulders of footguns.
[14]As this might suggest, "purecap" and "hybrid" modes are separate compile-time and run-time concepts. For this post, I'm going to ignore the difference.
[15]Oddly there isn't a cheri_ddc_set in the CHERI API, but on Morello we can define it thus:

void cheri_ddc_set(void *__capability cap) {
    asm volatile("MSR DDC, %[cap]"
      : : [cap] "C"(cap) : "memory");
}

[16]This fragment elides one important detail: the call stack isn't included in our new DDC, so as soon as we read or write to it, the program will SIGPROT. In this example, calling printf will fail. I'll soon explain, at a high level, how to avoid this happening.
Although it's difficult to say with absolute certainty, due to a lack of evidence, I think that any CHERI-esque system would require a similarly wide range of changes.
Capabilities, and capability aware hardware, are not new ideas (see e.g. the CAP or E). While it is possible that these ideas didn't hit the mainstream because they are flawed, it seems more likely to me that random chance meant they were not developed to the point that we could meaningfully evaluate them. Put another way, capabilities were, until CHERI, a "what if?".
For simplicity's sake, I'm conflating "kernel" and "userland libraries" into "operating system". In reality – as is the case in CheriBSD today – I need the kernel to return an appropriate capability from mmap which the malloc I'm using can then restrict further.
Indeed, Morello reuses the same BX mnemonic that previous Arm processors used to switch to / from normal ("A32") / Thumb ("T32") modes. Similarly, like Thumb, Morello can also switch between instruction sets based on a bit set/unset in branch instructions.
CHERI mostly calls this the "tag", or occasionally the "validity", bit.
The physical location of the authenticity bit is a hardware-dependent detail. An excellent analogy I've heard is to think of the authenticity bit as being similar to ECC memory, where there are extra bits somewhere helping ensure the integrity of the memory but you don't know (and can't find out) where those bits are actually stored.
On CHERI MIPS the address was 64 bits and the permissions 192 bits!
Since a capability's permissions are only 64 bits, the bounds cannot precisely represent every possible range of memory. Modern CHERI systems thus use a clever encoding scheme that allows small bounds to be precisely represented, with larger bounds becoming progressively less precise. I can see this in action with the following code:
char *b1 = malloc(4);
printf("%lu\n", cheri_length_get(b1));
char *b2 = malloc(4097);
printf("%lu\n", cheri_length_get(b2));
char *b3 = malloc(16385);
printf("%lu\n", cheri_length_get(b2));
cheri_length_get is a CHERI C intrinsic (i.e. "builtin CHERI C function") which asks the CPU to report how many bytes the capability can refer to above its lower bound. [Note, since I've used cheri_length_get my program can now only be compiled by CHERI C: it is no longer a valid "normal C" program.] On Morello this prints 4, 4097, 16392 — note that that last number has been rounded up from 16385! Not only are bounds imprecise, but different CHERI CPUs round things up at slightly different points — on CHERI RISC V the code prints out 4, 4104 (note: rounded up!), and 16392.

The reason I'm going into this in detail is because newcomers often check small examples, see that their bounds are precisely represented, and understandably, but incorrectly, generalise from there. Instead, CHERI C code must account for the imprecision of capability bounds, which I think of as a kind of alignment requirement. For example, memory allocators will probably want to leave "blank" memory after a block if its bounds would otherwise overlap with the next block.

Perhaps surprisingly, an out-of-bounds capability is not necessarily inauthentic. From the perspective of this blog post, we don't need to worry about this detail other than to know that it's at least in part to cope with C, which allows a pointer to point to the first byte past the end of a block of memory.
Though it turns out to be much more fiddly than one might expect. More on that in a subsequent post.
Capabilities form a graph, so if I have access to a capability C1 that points to C2 then my reachable set is {C1, C2}. If a capability C3 points to C1 but I still only have a reference to C1, then my reachable set remains {C1, C2}.
Such languages are often called "memory safe", though I find that a rather vague term myself. For example, Python might be a "memory safe" language, but CPython is written in the "memory unsafe" language C: bugs in the low-level code can undermine high-level guarantees. We stand on the shoulders of footguns.
As this might suggest, "purecap" and "hybrid" modes are separate compile-time and run-time concepts. For this post, I'm going to ignore the difference.
Oddly there isn't a cheri_ddc_set in the CHERI API, but on Morello we can define it thus:
void cheri_ddc_set(void *__capability cap) {
    asm volatile("MSR DDC, %[cap]"
      : : [cap] "C"(cap) : "memory");
}
This fragment elides one important detail: the call stack isn't included in our new DDC, so as soon as we read or write to it, the program will SIGPROT. In this example, calling printf will fail. I'll soon explain, at a high level, how to avoid this happening.
Blog archive