Laurence Tratt: How Hard is it to Adapt a Memory Allocator to CHERI?

CHERI is a set of things (adapted CPUs, adapted operating systems, adapted libraries) that, collectively, aim to stop software bugs becoming security flaws. If you’d like to know more, I gave some of my thinking on CHERI in Two Stories for “What is CHERI?”.

In this post I’m going to flesh out an observation I made in that post, which is that some software needs thoughtful adaption to CHERI if we want to get the security advantages we hope for. Exactly what that thoughtful adaption might look like will vary, probably substantially, between different pieces of software. What, for instance, might it look like for critical, widely used, components? In this post I’m going to look at how memory allocators (henceforth “allocators”), one of software’s most fundamental building blocks, can be adapted to CHERI. If you find this interesting, but want greater depth than I’ll go into here, you might be interested in the paper Picking a CHERI Allocator: Security and Performance Considerations that this post is based upon.

A Simple Allocator

It is a truism that virtually every program needs to dynamically allocate memory. Our collective folklore tells us that allocators like dlmalloc or jemalloc are impressive pieces of software that improve on their predecessors, but very few of us can really explain why. We call malloc, realloc, and free and magically chunks of memory are allocated, resized, or freed on our behalf.

As is often the case, one can get a useful insight into allocators by stripping away as much of the cleverness as possible. It turns out that we can write an allocator sufficient to run some real programs in just 25 lines of C:

#include <stddef.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>

static char *heap_start;
static char *heap;
static size_t HEAP_SIZE = 1024 * 1024 * 1024;

void *malloc(size_t sz) {
  if (!heap) heap = heap_start = mmap(NULL, HEAP_SIZE,
    PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON,-1,0);
  sz = __builtin_align_up(sz, _Alignof(max_align_t));
  if (heap + sz > heap_start + HEAP_SIZE) return NULL;
  heap += sz;
  return heap - sz;
}

void free(void *ptr) { }

void *realloc(void *ptr, size_t sz) {
  void *new_ptr = malloc(sz);
  if (ptr && new_ptr) memmove(new_ptr, ptr, sz);
  return new_ptr;
}

What we’ve implemented is a “bump allocator”: every time we want to allocate or reallocate a block of memory, we “bump” (i.e. increment) the pointer stored in the variable heap.

The first time that we call malloc, no heap has been allocated, so we use the operating system primitive mmap to reserve 1GiB of RAM for our process (lines 8, 11, and 12). Because many systems become unhappy if we return unaligned memory (e.g. memory which starts on addresses ending in an odd number), we round the size requested up (on my machine to the next multiple of 16) to avoid such unhappiness (line 13). We then check if there is enough space left in the heap for the allocation, returning NULL if there is not (line 14). If there is enough space in the heap we then return a pointer to the beginning of the block (line 16) and “bump” the heap pointer (line 15).

Unsurprisingly, I have cut a few corners to keep things simple (e.g. let’s hope that mmap doesn’t fail!). Most obviously, free doesn’t “return” memory to the heap: this is semantically correct, though it will cause programs to run out of memory rather quickly! You may also find the definition of realloc surprising, because it copies however many bytes are now requested from the old to the new block. This works because our heap is contiguous so, if there’s enough space for the new block, by definition we can copy at least that many bytes from the old block.

So, does this really work? Let’s start with the classic binary trees benchmark which does a lot of memory allocation. What we want to do is to compare a normal allocator against our bump allocator. There are various ways we could do this, but the simplest is to put our allocator in a shared object and use the “ LD_PRELOAD trick“1. Setting the LD_PRELOAD environment variable to point at our shared object allows us to override the system allocator on a per-execution basis.

In the following video you can see me (in order): run the benchmark on my normal desktop computer with OpenBSD’s default system allocator; compile the bump allocator; and then use LD_PRELOAD to override the system allocator when I run the benchmark for a second time.

Not only does it run, but it’s three times faster than when run with the system allocator! Before you get too excited, let’s remember that the bump allocator is, to put it mildly, less general than the system allocator — this is not, in general, a sensible performance comparison 2. Still, it does give us a good degree of confidence that we’re really running the bump allocator in the second run of the benchmark.

As that suggests, it’s notoriously easy to fool yourself into thinking you’re testing your new allocator (“wow, I write software without bugs!”) when you’re actually running the system allocator. We won’t always be able to tell easily from time which allocator we’re running so, to give us some assurance for later examples in this post – and also because it’s fun – let’s add this little chunk of code to our allocator:

__attribute__((destructor))
static void malloc_exit() {
  fprintf(stderr, "heap used %lu\n", heap-heap_start);
}

Now, when a program uses our bump allocator, it will print out how many bytes of heap were used during execution just before the program exits. Have you ever wanted to know how many bytes grep or vi allocate during execution? Now’s your chance to find out:

I had to be a little careful in the choice of programs I run above, because Unix has, over time, slowly added more functions to the allocator API (most of which are rarely used). Fortunately, we can handle most of those (again with minor corner cutting) fairly easily. It’s quite good fun to see which software runs with this extended allocator — assuming you’ve got enough RAM!

The Bump Allocator and CHERI purecap

What does it mean to run an allocator under CHERI? Let’s start by assuming we’re using “pure capability” (henceforth “purecap”) CHERI. For our purposes, that means that 64-bit “pointers” become 128-bit capabilities, where the additional 64-bits contain various permissions. The main permissions we’re interested in are bounds, which record the range of memory that a capability – and, vitally, those capabilities derived from it – are allowed to read/write from. When we compile a C program for CHERI we’re actually compiling a “CHERI C” program. For our purposes, the semantics of CHERI C can be thought of as mostly the same as normal C’s semantics, though “pointer types” are now “capability types” (and hence double the width they are in normal C).

Perhaps surprisingly, our bump allocator not only compiles out of the box on purecap CHERI, but both binary trees and vi run successfully. That sounds good — until we start digging a bit. In order for CHERI to do something useful, capabilities need to have “non global” permissions. What bounds does the capability returned by our bump allocator’s malloc have? The easiest way to see this is to allocate some memory and then use the CHERI API to print out the capability’s lower bound, and the length of the bounds. Let’s allocate two blocks of memory and see what the result is:

#include <cheriintrin.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
  void *c1 = malloc(4);
  void *c2 = malloc(4);
  printf("c1 address 0x%lx lower 0x%lx length 0x%lx\n",
    cheri_address_get(c1), cheri_base_get(c1),
    cheri_length_get(c1));
  printf("c2 address 0x%lx lower 0x%lx length 0x%lx\n",
    cheri_address_get(c2), cheri_base_get(c2),
    cheri_length_get(c2));
}

For each of the two capabilities returned by malloc this will print: the capability’s address (equivalent to the address a “pointer” normally stores); the lower bound of memory the capability can address; and the length of memory (relative to the lower bound) that the capability can read/write from. I can derive other (valid!) capabilities from these that have different addresses, provided that those addresses stay within the bounds.

So that you can play along at home, I’m going to run an emulator of Arm’s Morello, built with cheribuild: you can build your own with ./cheribuild.py run-morello-purecap -d. I’ll warn you now that the emulator isn’t fast, but it is more than usable for our purposes. If I run the code snippet above with, first, the default allocator and, second, our bump allocator I see the following:

To make things a bit easier to follow, here’s the critical part of the video as text:

root@cheribsd-morello-purecap:~/cheri_malloc_blog # ./print_bounds
c1 address 0x40c0e010 lower 0x40c0e010 length 0x4
c2 address 0x40c0e018 lower 0x40c0e018 length 0x4
root@cheribsd-morello-purecap:~/cheri_malloc_blog # LD_PRELOAD=$(pwd)/malloc2.so ./print_bounds
c1 address 0x41400000 lower 0x41400000 length 0x40000000
c2 address 0x41400010 lower 0x41400000 length 0x40000000
heap used 4128

Both allocators return sensible looking addresses, but the bounds are very different: the default allocator returns capabilities whose bounds are restricted to the 4 bytes requested; but our bump allocator returns capabilities that can read/write from 1GiB of memory!

More formally, with the default allocator, neither c1 or c2 can be derived from the other, but with the bump allocator either capability can be derived from the other. While this isn’t exactly wrong, it means that we haven’t gained much from CHERI: code with access to one of those two capabilities can read/write from the same memory as the other capability. Indeed, it turns out that every block of memory our bump allocator returns will share the same bounds!

The clue to where those bounds comes from is that they span 1GiB of memory — exactly the same quantity of memory our bump allocator requested from mmap. As that suggests, mmap in CheriBSD returns a capability whose bounds are at least as big as the quantity of memory that you requested, and our bump allocator has simply copied those bounds over unchanged to the caller of malloc.

What Should an Allocator do on CHERI purecap?

Our bump allocator works without changes on CHERI purecap, but gives little meaningful security improvement relative to a non-CHERI system. We thus need to pause for a brief philosophical interlude: what should an allocator do on CHERI purecap?

Naively we might think we want “maximum security” but, even if we could define what that means, we probably couldn’t achieve it. Like most things in life, software nearly always requires trade-offs: if we want more of one thing (e.g. security) we’ll have to accept less of another (e.g. performance or ease of use). Although I’m not going to dwell on it in this post, there is unlikely to be a “one true secure allocator” that works well in all cases.

Let’s start with what security folk call “threat models”. In our case, let’s assume that our threat model is that external attackers might be able to take over one part of our program and then “upgrade” the attack to read all the program’s heap data. Given this threat model, we can then think about techniques and technologies that might help mitigate the threat.

In our case, we have a promising technology (CHERI purecap) and we may then decide that having the program store capabilities with the most restrictive bounds possible is the right way to mitigate the “upgrade” attack. That implies that allocators should return capabilities with bounds restricted to just the memory allocated. Doing so will clearly make an attacker’s life harder, even though it might not make it impossible 3.

Adjusting the Bump Allocator

Let’s adapt the bump allocator’s malloc so that it returns capabilities whose bounds only allow access to the requested range of memory:

#include "cheriintrin.h"

void *malloc(size_t sz) {
  if (!heap) heap = heap_start = mmap(NULL, HEAP_SIZE,
    PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON,-1,0);

  char *new_ptr = __builtin_align_up(
    heap, -cheri_representable_alignment_mask(sz));
  size_t bounds = cheri_representable_length(sz);
  sz = __builtin_align_up(sz, _Alignof(max_align_t));

  if (new_ptr + sz > heap_start + HEAP_SIZE)
    return NULL;
  heap = new_ptr + sz;
  return cheri_bounds_set_exact(new_ptr, bounds);
}

The main changes we’ve made are lines 7-10 and 15 where we perform an odd-looking dance with CHERI’s APIs to create a pointer with the appropriate bounds. In essence, because CHERI permissions don’t have enough spare bits to represent all possible lower bounds, we might have to align the heap pointer to an arbitrary amount (lines 7-8), and we may also align the requested length upwards for similar reasons (line 9). We then make sure we always align future pointers to max_align_t as in the normal malloc (line 10). Finally we return a capability with the heap pointer we want, with the bounds set to those we’ve calculated (line 15).

If I compile our CHERI-fied malloc and run our little bounds printing program again, I can see that I really am getting capabilities whose bounds mean that they cannot be derived from each other:

$ LD_PRELOAD=$(pwd)/malloc4.so ./print_bounds
c1 address 0x41400000 lower 0x41400000 length 0x4
c2 address 0x41400010 lower 0x41400010 length 0x4
heap used 4128

That all looks good! But if I write this this little program:

#include <cheriintrin.h>
#include <stdlib.h>

int main() {
  void *c1 = malloc(4);
  c1 = realloc(c1, 8);
  return 0;
}

and run it, the program is terminated with a SIGPROT (the CHERI equivalent of a SEGFAULT):

$ LD_PRELOAD=$(pwd)/malloc4.so ./print_bounds
In-address space security exception (core dumped)

How can such a simple program cause a problem? The problem is in our original realloc. Earlier I wrote:

You may also find the definition of realloc surprising, because it copies however many bytes are now requested from the old to the new block. This works because our heap is contiguous so, if there’s enough space for the new block, by definition we can copy at least that many bytes from the old block.

My corner cutting has come back to bite us! The problem is that when we pass to realloc a capability whose bounds allow access to 4 bytes and then we try and copy 8 bytes from it, CHERI – quite rightly! – considers this a violation of its rules and stops the program.

The fix for this is obvious: we have to know how big the block of memory we want to reallocate currently is. However, our simple bump allocator doesn’t store that length. Adjusting it to do so wouldn’t be rocket science, but it would be annoying 4. Fortunately, since the input capability records the length of its bounds, we can use that information to make sure that we don’t copy more from the input capability than we should:

void *realloc(void *ptr, size_t sz) {
  void *new_ptr = malloc(sz);
  if (ptr && new_ptr) memmove(new_ptr, ptr,
    cheri_length_get(ptr) < sz
    ? cheri_length_get(ptr) : sz);
  return new_ptr;
}

Note that we only copy the minimum of the input capability’s length or the requested size (lines 4 and 5). With this modification, our allocator now works as expected:

$ LD_PRELOAD=$(pwd)/malloc4.so ./print_bounds
heap used 32

What Have We Actually Done?

Amongst all the detail, it’s easy to miss that our manual adaptions of the bump allocator to CHERI have meaningfully improved its security. If an attacker gets access to one capability returned by malloc, they have no way of turning it into any other capability returned by malloc. In other words, we have given the user a new tool for “compartmentalising” their heap allocated memory. Consider this simple program:

#include <cheriintrin.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
  void *c1 = malloc(4);
  void *c2 = malloc(4);
  printf("c1 address 0x%lx lower 0x%lx length 0x%lx\n",
    cheri_address_get(c1), cheri_base_get(c1),
    cheri_length_get(c1));
  printf("c2 address 0x%lx lower 0x%lx length 0x%lx\n",
    cheri_address_get(c2), cheri_base_get(c2),
    cheri_length_get(c2));
  c1[0] = 'a';
  printf("c1[0] %c\n", c1[0]);
  printf("(c2 - 0x10)[0] %c\n", (c2 - 0x10)[0]);
}

Because we – as might a cunning attacker – know that the bump allocator places consecutively allocated blocks at predictable locations, we can try rederiving c1 from c2 (line 16). This code executes successfully (i.e. insecurely!) with our original bump allocator on CHERI purecap, but is terminated with a SIGPROT with our adapted bump allocator.

This is not to say that our adjusted allocator magically makes programs impervious to attackers. Most obviously, the user still has to think carefully about which parts of the system should have access to which capabilities returned by the allocator: handing out capabilities without sufficient thought tends to undermine the security properties the user was hoping for. Less obviously, even our adjusted allocator can be tricked into doing things we might not expect. For example, I can take a capability returned by the allocator, narrow its bounds or downgrade its permissions, and hand it off to a less trusted part of my system. An attacker can then use realloc to obtain a more powerful capability than they started with.

This isn’t merely a theoretical concern. In Picking a CHERI Allocator we introduced 5 simple “attacks” which, in essence, model situations where an attacker has gained initial access to a system and is trying to widen the scope of the attack. We then tried these attacks on a number of different allocators that have been ported to CHERI purecap. Only one allocator was invulnerable to these simple attacks — and the default CheriBSD allocator, a jemalloc fork, was vulnerable to 3 of them! We didn’t even try that hard to find such attacks — it seems likely to me that more attacks on CHERI purecap allocators could be found if someone has the time to investigate further.

Summary

If nothing else, I hope this post has made allocators seem a little less magical to you. On top of that, I hope that you believe that while CHERI (purecap or hybrid) can help you improve a program’s security, it isn’t magic: just because something runs correctly on CHERI doesn’t necessarily mean it’s any more secure than running it on a traditional system.

The adjustments we made to the bump allocator to take advantage of what CHERI has to offer are not huge, but they aren’t merely mechanical either. As that suggests, to take full advantage of CHERI you have to think not just about the code itself, but how that code is expected to be used.

Acknowledgements: my co-authors on Picking a CHERI Allocator (Jacob Bramley, Dejice Jacob, Andrei Lascu, and Jeremy Singer) did all the hard work and generously offered comments on this post. In focussing on one aspect of that paper for this post, I’ve probably introduced various mistakes, for which I am solely responsible!

Newer 2023-09-18 10:30 Older

If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

LD_PRELOAD isn’t a trick but using it is trickier than people often realise. First, you must give it an absolute path to a shared object. Second, it’s astonishingly easy to create an input shared object that doesn’t contain all of the functions you need, or expected, to override.

☒

There are a handful of use cases for making free a no-op. First, some people really need their programs to run faster and are prepared to buy a lot of RAM to make that happen — yes, this really does happen! Second, sometimes we want to know what the relative proportion of time we spend in an allocator is — our simple bump allocator gives us a good idea of the “intrinsic” costs of a program with the allocator factored out. That said, modern hardware with its many layers of caches, increasingly requires some caution in interpreting these numbers, because fragmentation (or lack of it) can have noticeable effects.

☒

For example, unless we explicitly protect the “super capability” returned by mmap, that is an obvious weakness in our mitigation. CHERI gives us tools which can make such protection plausible, but I won’t cover them in this post.

☒

It is sometimes considered realloc’s original sin that it doesn’t require the caller to track the input block’s current size. Every allocator then has to record (or otherwise derive) the size of a block, which can be a costly operation (in terms of memory use or processor time). That said, it can be rather annoying to use an allocator where you have to track the size of all allocations yourself. Swings and roundabouts…

☒

How Hard is it to Adapt a Memory Allocator to CHERI?