A little while back I wrote about what I considered to be the four main kinds of optimisation:
- Use a better algorithm.
- Use a better data-structure.
- Use a lower-level system.
- Accept a less precise solution.
In the same way that there are 10 “fifth Beatles”, I could have added many other kinds of optimisation, but I wanted to focus on what my experience has shown me to be the main contenders. In particular, I got a bit of pushback for “accept a less precise solution” but I think that the last 18 months of LLM advances have conclusively shown how many situations there are where this kind of optimisation is applicable.
However, none of this excuses the fact that I missed off a technique that, even when I wrote that post, I use frequently:
- Use parallelisation.
This post is not just a mea culpa, but attempts to give some retrospective justification for my stupidity.
Is parallelisation really useful?
When I migrated this website to a Rust-based markdown system a couple of years ago, my software at first processed one page at a time. Irritatingly regularly, I would save a page, which automatically rebuilds the site, switch to my browser, press reload, and find that… nothing had changed. I’d managed to do all this before the site had rebuilt in the background, confusing the life out of me!
A quick measurement showed that rebuilding the site took around 0.6s to run in “quick” mode. Unfortunately, profiling didn’t show any stand-out hot spots that I could try optimising.
Fortunately, I had a trick up my sleeve: I made my website builder process pages in parallel using multi-threading. The “quick” build time more than halved (to below 0.3s) and suddenly I could no longer refresh my browser before the site had rebuilt. Multi-threading benefits “deploy” mode – the output of which you’re currently reading – even more, making it more than 3x faster.
Using multiple threads to parallelise my website builder wasn’t about improving performance for the sake of it: the better performance I gained from parallelisation increased my effectiveness.
Indeed, parallelisation is so well suited to some problems that if I have to use a non-parallelised version it hurts. For example, all of us know that testing software is a good thing — but it’s common, for all sorts of reasons [1], for testing frameworks to run tests serially (i.e. one after another). That isn’t just inefficient in terms of the use of computing resources, but it also makes it less likely that we’ll write as many tests as we should, as we will start to worry that the test suite takes too long to execute.
When I wrote lang_tester – a testing framework that’s mostly meant for testing compilers, VMs, and the like – I very deliberately made it run tests in parallel from the outset. I’m fortunate to have access to a 72 core server for development and lang_tester’s parallelisation makes one test suite I care deeply about run in 2.5s rather than 37s [2]. That’s fast enough that I am able to test even tiny changes without worrying that my pea brain will drift away from the problem I’m working on.
At this point I hope you also agree with me that parallelisation can be a powerful optimisation. Why did I not think to mention it in my original blog post?
There are, I think, two main reasons why I got this wrong:
-
For a long time, our hardware didn’t offer enough parallelisation potential to make much effort worthwhile. These days, affordable desktop computers come with many CPU cores — the machine I’m writing this on has 16 physical cores. Servers that are, by the standards of such things, relatively affordable are starting to come with CPU core counts in the triple figures.
-
Even when our hardware did start to offer enough parallelisation opportunities, it was difficult to use that parallelisation safely in our programming languages.
The parallelisation potential in modern hardware
Operating systems have long given the illusion of parallel computation: multiple processes can run on a single CPU core, each getting a little “slice” of time before the next gets its turn. This was a foundational advance in computing for two main reasons: it allows multiple people to use a single machine without having to be aware of other users; and when a program is waiting on an external resource (e.g. storage or network), another program can make progress [3].
However, if you have a CPU bound program – i.e. a program that needs the CPU to do a lot of work – none of this helps, because at no point are calculations happening in parallel. Over time this has become known as concurrency to make clear [4] that the “parallelism” is an illusion: the CPU is simply frequently flipping between different calculations, suspending and resuming them as necessary.
For the purposes of this post, we will instead say that “true” parallelism can only occur when a CPU has multiple cores [5] that can execute two or more independent calculations [6] at the same time [7].
For a long time, computers with more than one CPU / CPU core were exotic curiosities. Consumer-grade multicore CPUs are relatively recent (circa 2005) and, when they first arrived, somewhat underwhelming. I remember getting a 2 core CPU, and realising that while it made my operating system a little bit faster – my web browser could render a page while my compiler ran in a terminal – no individual program benefited directly. I thus assumed that “operating system parallelisation” of this kind was the main way that we would benefit from multicores.
The problem with this kind of parallelisation is that it’s difficult to
make use of by a single human — it forces us to multi-task, endlessly firing off
jobs in the background. This disrupts our flow, even if it does make
efficient use of the multiple CPU cores available to us. Some calculations
can be split up into multiple processes, but most cannot (as anyone who’s used
Python’s multiprocessing
module can attest).
The obvious alternative is to allow multi-threaded programs. It’s worth being clear what we mean by a “thread”, because that term has had more than one meaning over time [8]. In essence, a Unix process consists of one or more threads. Each thread is a semi-independent execution of the program: different threads can execute different parts of the program, each running on different stacks; and all threads in a process share the same heap. Crucially, threads can execute in parallel: an n core CPU can execute n threads in simultaneously.
Multi-threading promises many benefits, but poses many problems. From the perspective of hardware, one of the problems with writing multi-threaded programs was that software didn’t know what guarantees hardware gave it, and hardware wasn’t sure what guarantees to give software.
Although many of these problems had been explored in the server space in the early 1990s, consumer CPUs had to learn – and to the extent I understand the history – extend those lessons. Some of the consequences are still with us: most obviously, x86 has a “strong memory model” while Arm has a “weak memory model”. I find these terms are often misunderstood, but a good intuition is: weak memory models require more explicit synchronisation. One would expect weak memory architectures to be able to perform better, with the downside that they’re more likely to highlight problems due to sloppy programming.
That there are differences between x86 and Arm won’t surprise anyone. However, you may find it more surprising that even in 2010, AMD’s and Intel’s x86 memory models were neither compatible with each other, nor had all the edge cases been thought through. The seminal paper that pointed the way forward contains this wonderfully dry sentence:
We review several recent Intel and AMD specifications, showing that all contain serious ambiguities, some are arguably too weak to program above, and some are simply unsound with respect to actual hardware.
At the time this was published, I didn’t really appreciate the advances that were being made: I was still reading about the various ways that seemingly reasonable programs didn’t work as expected on real CPUs.
I don’t think it was until I got my hands on an 8 core [9] CPU (the first of which was only released in 2014!) that I realised that my previous thinking was becoming outdated. Most obviously, 8 cores is a lot of parallelisation potential! But also I gradually came to understand that CPUs had sorted out the most of the major issues that had earlier worried me (and others who were much better informed than me).
The reason I’m giving this brief history lesson is that, like so many of us, I now take multicore CPUs for granted but it’s worth remembering:
-
How recent multicore CPUs are. Multicore CPUs still haven’t been around for half of my professional career!
-
That the introduction of multicore CPUs didn’t immediately lead to reliably programmable multicore CPUs.
Parallelisation and programming languages
At this point, let’s assume that all of us have available reliable, well thought through, and carefully documented multicore CPUs. How do we make our programs take advantage of them and do so reliably? Although there are a few possible ways of programming such devices, I’m going to stick to multi-threading as the chief opportunity and challenge.
Earlier I slipped this short phrase into my definition of multi-threading: “all threads in a process share the same heap”. That, it turns out, is the cause of most multi-threading woes. Most threads in most programs need to communicate with other threads, which in practise means that multiple threads need to load and store to the same part of the shared heap. For this to be done reliably, the threads need to synchronise their loads and stores.
At first glance, synchronisation between threads seems to be no more challenging than shared mutation in normal programming. However, the further you look into it, the more surprises you encounter. Let’s consider the following C code:
void inc(long *p) { long x = *p; x = x + 1; *p = x; }
Given a pointer to an integer on the heap, inc
will load an integer value
from the heap, increment it by one, and store it back to the heap.
Imagine that two different threads both call this function at the same time, both with the same pointer P. If the integer value stored at P is initially 0, I might imagine the end result of those two parallel functions as leaving the value 2 stored in P. However, if both threads load the value 0 simultaneously, they’ll both increment 0 to 1 and store 1 back (twice!) into P. Multi-threading has caused us to lose an integer increment!
The more you look into this, the more ways you realise there are for synchronisation
to cause surprises. If I
run that program on a 32-bit system where long
is 64 bits, I can get
“tearing”, where the two 32-bit portions of the 64-bit integer are read and
written separately, with potentially random results. This isn’t just a problem
with low-level languages like C: I believe tearing of this kind can still occur
with
Java.
Lost writes and tearing might sound like it’s the hardware letting us down, but a different example shows that compilers can also cause confusion. Recently I saw a program which, with irrelevant details elided, contained this snippet:
bzero(p, 1024); pthread_exit(0);
The idea was to zero a portion of the shared heap just before a thread
exited. It was known that another thread could not read that memory until well
after pthread_exit
was called. However, it turned out that, no matter
how long one waited, other threads did not see the shared memory being zeroed.
The problem here is that C knows something about memory synchronisation, and
the program above didn’t inform the compiler that it wanted to do any form of
“synchronise with other threads” write. The compiler, realising that the
writes are therefore implicitly local to the thread, removed the bzero
call. In other words, the CPU played no part in this seemingly “incorrect
synchronisation” — it was a purely compile-time decision.
To inform the compiler that the bzero
cannot be removed by the compiler, one
would need to insert an explicit call to a
directive such as atomic_signal_fence
before pthread_exit
. However,
atomic_signal_fence
doesn’t lead to any run-time action: the bzero
call
won’t be removed, but only the current thread might
“see” the writes from bzero
. If we want other threads to pick up (at some
point…) those writes, we’d need to use a directive such as
atomic_thread_fence
.
If, at this point, you have the sense that I could keep going much further into detail you’d be right: multi-threaded programming is hard. There are many details above and beyond non-multi-threaded programming; many surprises; and most of the surprises occur non-deterministically. This makes debugging, and thus understanding, a nightmare: you might only hit the surprising result in one execution out of thousands [10] and the very act of attempting to debug the bug can cause it to disappear!
My experience is that programmers tend to focus their attention on understanding how the hardware works. In a sense, that’s reasonable, but if your programming language is sufficiently well specified, you only need to know what its synchronisation semantics are: it should abstract over the vagaries of differing hardware.
To many people’s surprise, however, for a long time after programming languages introduced support for multi-threading, they did not completely and accurately specify how programs would work in a multi-threaded context.
Java had support for multithreading [11] from 1995, but the problems started mounting up: a big step forward was taken with JSR 133 in 2004. But that wasn’t the end of it: more problems were discovered (such as this), and more adjustments made.
That meant that for many years, Java programmers who thought they were writing correct multi-threaded programs were not necessarily doing so. This should not be taken as a slight against Java: C only got a memory model roughly equivalent in quality to Java’s in 2011 [12].
In other words: many years after plebs like me could buy multicore CPUs, we still couldn’t program them reliably because our languages had not yet gained sufficient support for reliable multi-threading.
Which languages are good for multi-threading?
Even after programming languages worked out all the details that allow a sufficiently motivated and skillful programmer to write correct multi-threaded programs, I mostly chose not to. Frankly, I quickly lost faith in my ability to reason about multi-threading after writing some C code which parallelised a problem I cared about, and finding that I’d slowed it down, and introduced bugs which, at the time, I didn’t understand at all.
The good news is that other languages make multi-threaded programming easy by largely side-stepping the synchronisation problem. Most obviously, if your language only supports immutable datatypes, then many synchronisation worries, such as the “lost write” problem from earlier, simply can’t happen. Alternatively, if your language requires other threads to operate in a semi-isolated manner (e.g. many actor languages) where communication solely involves immutable datatypes, many synchronisation worries also disappear.
However, there are problems with both classes of languages: they can make expressing some kinds of software rather awkward; and, often, their performance is not great. At the risk of offending many, compared to C, your average Haskell or Erlang program is going to be at least a bit, possibly a lot, and often “too much”, slower — both in single and multi-threaded contexts. I have seen [13] more than one attempt to rewrite a single-threaded C program into another language with more congenial support for multi-threading lead to worse overall performance.
So, for many years, I felt stuck: I wanted to do multi-threading; I didn’t trust myself to write multi-threaded C; and other safer-for-multi-threading languages (including Java) were not appropriate for what I needed.
Not only did I grow to think this depressing situation inevitable, but I stopped thinking of multi-threading as a potential solution for performance problems.
There are alternatives
I started writing Rust code at the very end of 2014, and soon realised that I’d found a language well suited to many of the programming tasks I cared about. What I didn’t realise then was that I had also bought into a language that would make multi-threaded programming fairly easy, rather safe, and very fast.
To cut a long story short, multi-threaded programming in Rust works well
because of (1) the Send
and Sync
traits (2) and Rust’s normal ownership rules.
Send
denotes types whose instances can be moved (sent) to another thread;
Sync
denotes types whose instances can be safely shared between multiple
threads. These traits are (mostly) automatically implemented and they make it
virtually impossible to use types in incorrect ways: try to do something with
say, Rc<T>
(which is neither Send
nor Sync
) in a multi-threaded context
and you’ll get a compile-time error.
I have now written a lot of multi-threaded Rust code and several of the “classic” multi-threading mistakes I would have (did!) make in C have simply not happened in Rust. Notably, I haven’t had any data-races — not one! This is so impressive that I’m astonished that more people don’t shout about it [14]: we now have an imperative language, with good performance, that makes reliable multi-threaded programming possible even for non-geniuses like me!
Now, that’s not to say that multi-threading in Rust is perfect. To keep the
borrow checker happy, one often has to use clone
s when moving data to a
thread — these clutter up code in a slightly frustrating way. Temporary
lifetime extension interacts badly with mutexes and has caused confusing
deadlocks in my code [15].
There are also weird platform differences that Rust doesn’t – and, probably, mostly can’t – protect me from. For example, I like to use condition variables [16] to avoid busy polling, but different platforms interpret “wait with timeout” differently if a suspend/resume cycle happens. Multi-threaded programming still has more rough edges than “normal” programming!
Summary
Parallelism has long held promise but for a long time we couldn’t rely on hardware or programming languages when it came to parallel programming. I hope I’ve explained explain why I – and, I think, many other programmers – have consequently been afraid of exploiting parallelisation.
Nearly all those problems are now in the past and – finally! – we are starting to see somewhat widespread use of software that exploits the parallelisation potential of our hardware. Personally, I now often think of how I will parallelise my software from the very outset.
I hope that, despite my forgetfulness (or stupidity) in missing parallelisation from my original post, you will consider it a viable optimisation tool too!
Acknowledgements: thanks to Davin McCall for comments.
Footnotes
These range from “my programming language implementation isn’t properly multi-threaded” (e.g. CPython) to “my testing framework didn’t consider multi-threading” to “my tests aren’t thread safe”.
These range from “my programming language implementation isn’t properly multi-threaded” (e.g. CPython) to “my testing framework didn’t consider multi-threading” to “my tests aren’t thread safe”.
As is common with parallelisation, the speedup isn’t fully linear with respect to the number of cores. In this case, there are quite a few constant factors that I don’t control (e.g. setting various tests up) and which tend to be particularly noticeable on short-running tests: the longer that tests run for, the greater lang_tester’s perceived speedup will be.
As is common with parallelisation, the speedup isn’t fully linear with respect to the number of cores. In this case, there are quite a few constant factors that I don’t control (e.g. setting various tests up) and which tend to be particularly noticeable on short-running tests: the longer that tests run for, the greater lang_tester’s perceived speedup will be.
The benefits of this are somewhat forgotten by most people, but those of us who used cooperative multitasking operating systems know how much of a difference that can make. A program which doesn’t cooperate is a huge pain.
The benefits of this are somewhat forgotten by most people, but those of us who used cooperative multitasking operating systems know how much of a difference that can make. A program which doesn’t cooperate is a huge pain.
“Clear” is a relative term. I do not find “concurrency” vs. “parallelism” to intuitively mean this, and I have to think carefully each time I use them.
“Clear” is a relative term. I do not find “concurrency” vs. “parallelism” to intuitively mean this, and I have to think carefully each time I use them.
Or multiple CPUs on the same motherboard. From this post’s perspective, this is equivalent, and I’ll use “multicore” to cover both variants.
Or multiple CPUs on the same motherboard. From this post’s perspective, this is equivalent, and I’ll use “multicore” to cover both variants.
As distinct from the somewhat dependent calculations that modern single core CPUs can somewhat parallelise. “Out-of-order” execution is a wonder, but not what I’m covering in this post.
As distinct from the somewhat dependent calculations that modern single core CPUs can somewhat parallelise. “Out-of-order” execution is a wonder, but not what I’m covering in this post.
I’m only going to consider parallelism inside general purpose CPUs. I’m not going to consider vector instructions, GPUs, or other similar modes of computation: they do enable true parallelism, but of a different kind.
I’m only going to consider parallelism inside general purpose CPUs. I’m not going to consider vector instructions, GPUs, or other similar modes of computation: they do enable true parallelism, but of a different kind.
For example, “green threads” were once a big thing, and still crop up from time to time — even though they don’t allow parallel computation! They are best thought of as a form of cooperative multitasking.
For example, “green threads” were once a big thing, and still crop up from time to time — even though they don’t allow parallel computation! They are best thought of as a form of cooperative multitasking.
At this point in the history, one needs to start differentiating “physically distinct” cores from “hyperthreads”. The latter turn one physical core into two “virtual” cores. In the early days, the performance benefits of hyperthreading were minor at best – and sometimes negative! – and they are more prone to side-channel attacks. I believe their performance benefits are probably now positive, albeit not sufficiently large for me to feel like I need to bother with them.
At this point in the history, one needs to start differentiating “physically distinct” cores from “hyperthreads”. The latter turn one physical core into two “virtual” cores. In the early days, the performance benefits of hyperthreading were minor at best – and sometimes negative! – and they are more prone to side-channel attacks. I believe their performance benefits are probably now positive, albeit not sufficiently large for me to feel like I need to bother with them.
To add insult to injury, some types of problems can only occur on certain kinds of hardware. In particular, weak memory architectures such as Arm can highlight certain mistakes that go unnoticed on strong memory architectures such as x64.
To add insult to injury, some types of problems can only occur on certain kinds of hardware. In particular, weak memory architectures such as Arm can highlight certain mistakes that go unnoticed on strong memory architectures such as x64.
I strongly suspect this reflects its genesis in Sun, which was selling servers with multiple CPUs long before they became available in consumer devices.
I strongly suspect this reflects its genesis in Sun, which was selling servers with multiple CPUs long before they became available in consumer devices.
The C11 memory model is the basis for most other language’s memory model semantics (e.g. Rust’s).
The C11 memory model is the basis for most other language’s memory model semantics (e.g. Rust’s).
And heard some very expensive stories.
And heard some very expensive stories.
Surprisingly, I don’t see Rust’s strengths in multi-threaded
programming mentioned anywhere near as often as async
. I say
“surprisingly”, because multi-threaded Rust requires learning no new language
features, and thus has no impact on libraries. In contrast, async
requires
learning lots of new features, some of them very
complex,
and has a significant impact on libraries. There are some very clear use cases
for async
but, personally, I’ve not yet found a need for them.
Surprisingly, I don’t see Rust’s strengths in multi-threaded
programming mentioned anywhere near as often as async
. I say
“surprisingly”, because multi-threaded Rust requires learning no new language
features, and thus has no impact on libraries. In contrast, async
requires
learning lots of new features, some of them very
complex,
and has a significant impact on libraries. There are some very clear use cases
for async
but, personally, I’ve not yet found a need for them.
Though this is at least partially fixed as of a month ago.
Though this is at least partially fixed as of a month ago.
Frustratingly, condition variables seem hugely underused. When I first tried to use them, I found the documentation – in any language! – confusing. Perhaps because of that, a sizeable proportion of the online examples were wrong. I’m not quite sure if the situation has improved, but I suspect not.
Frustratingly, condition variables seem hugely underused. When I first tried to use them, I found the documentation – in any language! – confusing. Perhaps because of that, a sizeable proportion of the online examples were wrong. I’m not quite sure if the situation has improved, but I suspect not.