Blog

 

Minimum Times Tend to Mislead When Benchmarking

August 15 2019

Most of us want, or need, to benchmark software at some point, either because we’re proactively trying to avoid performance problems, or because we’re trying to narrow down the cause of a specific performance problem. Benchmarking seems like it should be simple, but there are countless ways to go wrong: indeed, as things stand, none of us really knows how to do it well, at least not in general. In this short post, I want to make a simple argument against using the minimum time of a benchmark as a proxy for that benchmark’s performance. Doubtless I am recasting other people’s arguments, poorly, but I’ve seen a resurgence in people advocating for the use of the minimum time recently, so it seems like a good time to make the opposing case.

The basics

There are all sorts of things that one might wish to measure about a program: how much memory it consumes, the number of files it opens, and so on. By far the most common measure is “wall-clock time” [1] – i.e. how much time elapses in the real world while the program is running – and that's what I'm going to concentrate on exclusively here.

Once upon a time, measuring wall-clock time was fairly simple. You read the system clock, ran a program, read the system clock again, and reported the delta between the two clock measurements. There were some minor annoyances to do with clocks [2] but measurements tended to be fairly predictable and repeatable. The weasel word in that statement is “fairly”. Noise has always been an issue in benchmarking: except in rare circumstances, there are always external factors (e.g. other processes; even the kernel itself) that can interfere with your benchmark’s execution which, collectively, we tend to refer to as noise.

The only way to get some idea of the impact of noise is to execute multiple runs of a benchmark and produce statistics over those runs. The problem then becomes: what statistics? If you’re anything like me, statistics is an incredibly daunting area. It’s not so much that the underlying maths is complex (though it often is), but that nearly every statistical method comes with a set of preconditions that can be hard to interpret (e.g. “the data must be independent” – even when you’ve understood what “independent” means in a semi-formal sense, understanding whether your data is or isn't independent can be difficult). If you apply a statistical method to data that violates the preconditions, you’ll still get a number out, but it might not mean what you expected.

Fortunately, in traditional settings, one could fairly safely make the following simplifying assumption: noise can only slow a program down. In other words, if I run a program twice, and it produces wall-clock readings of 1.1s and 1.2s, then the 1.1s reading is closer to the “true” performance of the program. Getting a good performance reading then consists of running the program multiple times and taking the minimum time from those since, by definition, that is the closest to the “true” performance, since it has been the least affected by noise.

An example of performance nondeterminism

Imagine I’m writing a program which repeatedly needs to find an English word which matches a given sha256 value. Since this calculation is fairly slow, I might choose to use a hashmap to cache the sha256 values. Here’s some simple Rust code which creates a HashMap from sha256 values to normal strings:
use std::collections::HashMap;
use std::fs::File;
use std::io::{BufRead, BufReader};

use sha2::{Digest, Sha256};

fn main() {
  let mut words = HashMap::new();
  let f = File::open("/usr/share/dict/words").unwrap();
  for l in BufReader::new(f).lines().map(|l| l.unwrap()) {
    words.insert(Sha256::digest(l.as_bytes()), l);
  }
}
I can easily call words.get(h) to find a string which matches the sha256 value h and know that I’ll get average lookup performance of O(1). However, let’s assume that on rare occasions I need to do the opposite calculation, that is I need to find what string matches a given sha256 value. We can add this code to our main function (in this case seeing if the word “zygote”, which happened to be the last word in /usr/share/dict/words that I recognised, really is found in words):
for (k, v) in &words {
  if v == "zygote" {
    for e in k {
      print!("{:x?}", e);
    }
    println!();
    break;
  }
}
If I run the resulting program it duly prints out:
d8be86c985bdd2938cc6cdc9b43039273be1fd97f3e4daed329bade585dd6ef
as the matching value.

Because I have to do a linear scan over the hashmap, this program has O(n) average lookup time. Let’s assume that I think the newly inserted bit of code is a performance bottleneck. I might then try timing the program and seeing what happens. However, because I think I know quite a bit about benchmarking, I assume that the time taken to create the initial hashmap will dominate a single subsequent lookup. To compensate, I put the code of interest into a loop so that it executes enough that I'll get a measurable signal. I also remove the print calls, because I know they can interfere with timings. I thus end up with the following:

for _ in 0..1000 {
  for (_, v) in &words {
    if v == "zygote" {
      break;
    }
  }
}
OK, so let’s build it and run it 10 times so that I can calculate the minimum running time and determine if the performance of this program is good enough:
$ cargo build --release && repeat 10 time target/release/count
    Finished release [optimized] target(s) in 0.01s
target/release/count  0.50s user 0.00s system 99% cpu 0.501 total
target/release/count  0.42s user 0.00s system 99% cpu 0.426 total
target/release/count  0.26s user 0.01s system 99% cpu 0.267 total
target/release/count  0.18s user 0.00s system 98% cpu 0.179 total
target/release/count  0.06s user 0.02s system 99% cpu 0.080 total
target/release/count  0.13s user 0.00s system 97% cpu 0.131 total
target/release/count  0.42s user 0.01s system 99% cpu 0.425 total
target/release/count  0.20s user 0.01s system 98% cpu 0.212 total
target/release/count  0.42s user 0.01s system 99% cpu 0.429 total
target/release/count  0.23s user 0.00s system 98% cpu 0.235 total
At this point I might be a bit surprised: the minimum wall-clock time (0.080s) is over 6 times faster than the maximum (0.501s). I’m quite happy with the minimum time, but decidedly unhappy with the maximum time. What happened? Well, I was running that on a fast Linux server that can be used by lots of other people, so maybe my readings were subject to a lot of noise. Let’s try running it on my trusty OpenBSD desktop:
$ cargo build --release && repeat 10 time target/release/count
    Finished release [optimized] target(s) in 0.02s
target/release/count  0.45s user 0.06s system 98% cpu 0.517 total
target/release/count  0.27s user 0.05s system 104% cpu 0.307 total
target/release/count  1.73s user 0.05s system 99% cpu 1.783 total
target/release/count  1.82s user 0.05s system 99% cpu 1.876 total
target/release/count  2.10s user 0.06s system 99% cpu 2.179 total
target/release/count  1.66s user 0.08s system 99% cpu 1.742 total
target/release/count  0.92s user 0.08s system 99% cpu 1.007 total
target/release/count  0.28s user 0.05s system 96% cpu 0.340 total
target/release/count  2.22s user 0.08s system 100% cpu 2.292 total
target/release/count  0.51s user 0.03s system 101% cpu 0.533 total
Unsurprisingly the timings are now much slower [3], but the ratio between the minimum and maximum times is now well over a factor of 7! What’s going on?

What we are witnessing here is what I call performance non-determinism for want of a better term. “Nondeterministic” is one of those words that academics bandy around willy-nilly but which can be quite confusing. In our case, it means that a program can run differently from one execution to the next. The most obvious way to write a nondeterministic program is to depend at some point on an unknown value. Sometimes this is done deliberately (e.g. an AI player in a game), although it is often accidental (e.g. a data race or reading from an incorrect place in memory).

In the program I’ve written, however, I’m not deliberately reading a random variable nor (I hope!) are there any bugs. However, there is nondeterminism at play, though it’s not directly caused by the user: the order that Rust’s HashMap stores items internally depends on a seed that is randomly chosen each time the program starts (making it harder for attackers to game the hashmap). This is largely invisible when doing a normal lookup, but becomes painfully obvious when you iterate over all the elements: on one run the element of interest might be the first I encounter, and on the next run it might be the last. Altering my program to see how many times I go round the loop before finding a match highlights this:

let mut c = 0;
for _ in 0..1000 {
  for (_, v) in &words {
    c += 1;
    if v == "zygote" {
      break;
    }
  }
}
println!("{}", c);
There's now a clear correlation between that count and the execution time:
$ cargo build --release && repeat 10 time target/release/count
    Finished release [optimized] target(s) in 0.02s
50434000
target/release/count  0.41s user 0.07s system 99% cpu 0.481 total
23732000
target/release/count  0.28s user 0.07s system 96% cpu 0.362 total
80556000
target/release/count  0.65s user 0.04s system 99% cpu 0.696 total
98799000
target/release/count  0.78s user 0.05s system 98% cpu 0.843 total
222337000
target/release/count  1.90s user 0.06s system 100% cpu 1.957 total
82766000
target/release/count  0.63s user 0.08s system 99% cpu 0.713 total
222532000
target/release/count  1.92s user 0.04s system 99% cpu 1.963 total
17949000
target/release/count  0.29s user 0.05s system 100% cpu 0.339 total
18610000
target/release/count  0.30s user 0.04s system 99% cpu 0.341 total
87260000
target/release/count  0.68s user 0.06s system 98% cpu 0.750 total
Notice that the iteration count varies by a factor of more than 10, even in those 10 runs, which largely explains the factor of 6 difference in wall-clock time (it’s likely that the remaining discrepancy is accounted for by the time needed to create the initial HashMap). The reason I call this performance nondeterminism is because it has no impact on correctness – and thus is often forgotten – but can have a significant impact on performance.

So what does all this have to do with using the minimum wall-clock time when benchmarking? In essence it shows that on modern systems, it’s not safe to use the minimum value as a measure of performance because programs such as this have a set of performance points. That seems like a very simple thing to say, but it has a profound difference on the way one views program performance. In this case, roughly speaking, that set has as many distinct performance points as there are words in /use/share/dict/words. Indeed, running my program 100,000 times on an unloaded server (an E3-1240 v6 3.70GHz processor, with turbo boost and hyperthreading turned off, all cores set to performance, and running Debian 9) shows a largely flat distribution [4]:

SVG-Viewer needed.

As this histogram shows, using the minimum time as a proxy for the program’s performance would be deeply misleading: it would lead us to hugely underestimate the running time of this program would take in most cases. Indeed, in this case, even the mean on its own would be misleading! While it’s generally not practical to give people a histogram of performance for every possible benchmark, it’s absolutely vital to report something about the distribution as a whole. In general, confidence intervals are a good technique for doing this, and, in this case, they’d be infinitely better than merely reporting the minimum.

Is this an isolated example?

When you look around, there are a surprising number of places where performance nondeterminism can occur – and it’s not just caused by software! Caches and predictors in processors clearly cause changes in performance. Indeed, I now regularly hear processor caches and the branch predictor being blamed for unexplained performance variations. Although my cynical side thinks that this is sometimes an excuse to avoid understanding what’s really going on, they clearly have surprising and unpredictable effects. Software has its share of complementary features, one of the best known being ASLR. As an added bonus, these various features interact in ways that we don’t yet really understand.

A couple of pointers might help convince you that the problem is a deep one. I'll start with the work which has had the most influence on my way of thinking in this regard: Stabilizer. In essence, Stabilizer takes a program and deliberately changes its layout (e.g. the order of functions) so that the possible space of layouts is, statistically speaking, equally covered: when this experiment was performed in 2013, LLVM’s -O2 optimisation level was found to be statistically indistinguishable from -O3 on a well known benchmark suite. At the risk of stating the obvious, this is highly surprising: everyone knows that -O3 is “better” than -O2! But that might just be because no-one had taken into account performance nondeterminism before... What Stabilizer clearly showed me is that one must, in general, consider programs as having a set of distinct performance values (as we saw in our hashmap searching example). When I was then involved in work on measuring JIT (Just-In-Time) VM performance, we found that other 50% of well known benchmarks suffered from performance nondeterminism – including some of the C benchmarks we’d included as a control because we assumed they wouldn’t be subject to such problems!

The effects on comparison

It turns out that performance nondeterminism isn’t the only reason why the minimum time of a program is a poor measure of a program’s performance. Let’s imagine a program similar to our previous example, but this time let’s search for a random word in a list of strings. A simple version of this looks as follows:
use std::fs::File;
use std::io::{BufRead, BufReader};

use rand::thread_rng;
use rand::seq::SliceRandom;

fn main() {
  let mut words = Vec::new();
  let f = File::open("/usr/share/dict/words").unwrap();
  for l in BufReader::new(f).lines() {
    words.push(l.unwrap())
  }

  let choice = words.choose(&mut thread_rng()).unwrap();
  for _ in 0..10000 {
    assert!(words.iter().any(|w| w == choice));
  }
}
When I run that for 100,000 times I get the following histogram:

SVG-Viewer needed.

Even without looking at the code, some readers might have guessed from that histogram that I’m using a linear search over the list. Wouldn’t it be better to replace it with a binary search? Let’s give it a go:
words.sort();
let choice = words.choose(&mut thread_rng()).unwrap();
for _ in 0..10000 {
  assert!(words.binary_search(&choice).is_ok());
}
Note that I first have to sort the words, because Rust's idea of ordering is different than /usr/share/dict/words. With those small changes I then get the following histogram:

SVG-Viewer needed.

Unfortunately it looks my attempted optimisation has failed: the minimum running time of the linear search (0.009s) is lower than that of the binary search (0.010s) [5]. As we all know, an optimisation that fails to make a program faster should never be included, so it seems that all my efforts have been for nought.

However, if we look at the distributions of both versions of the program alongside each other, a rather different story emerges:

SVG-Viewer needed.

As this shows, the binary search is always fast, while the linear search is often comically slow. In other words, given this data, it would be penny wise and pound foolish for me to use the linear search: I’d make the program very marginally faster for a small number of cases but much slower in most cases.

In my experience, optimisations rarely make all possible cases faster: there are trade-offs involved, and one often has to accept that some inputs will become a little slower in order that most cases can become a lot faster. The problem with using the minimum time is that it is entirely oblivious to trade-offs such as this.

Conclusions

The examples I’ve shown above might look silly to you, but they’re simplifications of problems I’ve seen in real programs: in each case, they were pretty much the first thing I tried to show the problem of minimum times [6]. That said, a perfectly reasonable question is whether I’ve managed to choose from amongst a small handful of examples that show such problems. Alas, particularly with regards to performance nondeterminism, this blogpost has merely highlighted the tip of the iceberg – and, as a community, we don’t really know how deep the iceberg goes.

However, performance nondeterminism is only one of the reasons why using the minimum time of a benchmark is rarely a good idea. As the optimisation example (linear vs. binary search) showed, the fundamental problem with the minimum time of a benchmark is that it gives you false confidence that you can avoid thinking about a benchmark's overall performance distribution. Sometimes you can get away with this, but doing so can often lead to you making the performance of your program worse. That said, performance nondeterminism increasingly worries me, because even a cursory glance at computing history over the last 20 years suggests that both hardware (mostly for performance) and software (mostly for security) will gradually increase the levels of performance nondeterminism over time. In other words, using the minimum time of a benchmark is likely to become more inaccurate and misleading in the future...

If you want to replicate my experiments, please have a look at the minimum_times repository. You can also download the data from the run of the experiment I used for this blog post if you want to investigate it and/or compare it to the data you get from running the experiment on your own system.

Acknowledgements: My thanks to Edd Barrett for for comments on this blog post.

Follow me on Twitter

Footnotes

[1] It’s not uncommon for people to argue that wall-clock time is a bad measure because it’s less predictable than processor counters. The reason wall-clock time appears less predictable is because it takes more things into account (e.g. the amount of time waiting for I/O) and the more things you measure, the more chance there is for variability. There are definitely cases where focussing purely on processor counters is the right thing to do (e.g. tiny loops that are a few instructions long) but in most cases wall-clock time is the only thing that users care about. Put bluntly, as an end user I don't care if a program executes slightly fewer instructions if it does so by doing a lot more I/O: I care purely about elapsed time in the real world.

As a footnote to a footnote, it also transpires that processor counters are not as accurate as one might expect – see the classic paper 2008 paper by Weaver and McKee as well as this 2019 paper which confirms that the situation has not improved in the interim.

[2] From my own personal experience, system clocks often had poor resolution (i.e. they couldn't accurately tell you small differences in performance) and/or were erratic (for example, I remember a PC I did work on in the mid/late 90s that gained about 90 seconds a day).
[3] Partly because OpenBSD’s /usr/share/dict/words is well over twice as big as Debian’s equivalent.
[4] I’ve sometimes heard it said that the central limit theorem saves us when benchmarking. In my experience, this hope is based on an over simplification of the theorem, which is that if you run a program enough times any randomness (including noise) will converge towards a normal distribution. That is only true if the various random variables are close to being independent and identically distributed. As some of the examples in this blogpost show, we sometimes encounter situations where these requirements are not met.
[5] In this case, the cost of the sort is probably responsible for most of the measurable difference, though interestingly there really are cases where linear search is faster than binary search. For example, on modern machines, if you have very small lists (roughly speaking, around 10 elements or fewer), linear search can win: there are parts of JIT compiling VMs, for example, where people care about the difference in speed!
[6] That does not mean that the data is not without surprises. Why does the linear search histogram, for example, have a noticeable peak in the first histogram bin? I can speculate as to possible causes, but I don’t know for sure.
It’s not uncommon for people to argue that wall-clock time is a bad measure because it’s less predictable than processor counters. The reason wall-clock time appears less predictable is because it takes more things into account (e.g. the amount of time waiting for I/O) and the more things you measure, the more chance there is for variability. There are definitely cases where focussing purely on processor counters is the right thing to do (e.g. tiny loops that are a few instructions long) but in most cases wall-clock time is the only thing that users care about. Put bluntly, as an end user I don't care if a program executes slightly fewer instructions if it does so by doing a lot more I/O: I care purely about elapsed time in the real world.

As a footnote to a footnote, it also transpires that processor counters are not as accurate as one might expect – see the classic paper 2008 paper by Weaver and McKee as well as this 2019 paper which confirms that the situation has not improved in the interim.

From my own personal experience, system clocks often had poor resolution (i.e. they couldn't accurately tell you small differences in performance) and/or were erratic (for example, I remember a PC I did work on in the mid/late 90s that gained about 90 seconds a day).
Partly because OpenBSD’s /usr/share/dict/words is well over twice as big as Debian’s equivalent.
I’ve sometimes heard it said that the central limit theorem saves us when benchmarking. In my experience, this hope is based on an over simplification of the theorem, which is that if you run a program enough times any randomness (including noise) will converge towards a normal distribution. That is only true if the various random variables are close to being independent and identically distributed. As some of the examples in this blogpost show, we sometimes encounter situations where these requirements are not met.
In this case, the cost of the sort is probably responsible for most of the measurable difference, though interestingly there really are cases where linear search is faster than binary search. For example, on modern machines, if you have very small lists (roughly speaking, around 10 elements or fewer), linear search can win: there are parts of JIT compiling VMs, for example, where people care about the difference in speed!
That does not mean that the data is not without surprises. Why does the linear search histogram, for example, have a noticeable peak in the first histogram bin? I can speculate as to possible causes, but I don’t know for sure.

A Quick Look at Trait Objects in Rust

February 12 2019

Rust is a genuinely interesting programming language: it has a number of features which are without precedent in mainstream languages, and those features combine in surprising and interesting ways. In many cases, it's a plausible replacement for C [1]: it leads to fairly fast code; and because it doesn't have a garbage collector, its memory use is more predictable, which can be useful for some programs. I've been doing an increasing amount of programming in Rust for the last 3 or 4 years, the biggest thing being the grmtools set of libraries (a good starting place for those is the quickstart guide). However, there isn’t any doubt in my mind that Rust isn’t the easiest language to learn. That’s partly because it’s a fairly big language: in my opinion, language complexity grows exponentially with language size. But it’s also partly because a lot of Rust is unfamiliar. Indeed, my first couple of years in Rust reminded me of my early fumblings with programming: a lot of activity rewarded with only limited progress. However, it’s been worth it: Rust has proven to be a good match for many of the things I want to do.

One of the things that baffled me for quite a long time are Rust’s “trait objects”: they felt like an odd part of the language and I was never quite sure whether I was using them or not, even when I wanted to be. Since I’ve recently had cause to look into them in more detail, I thought it might be helpful to write a few things down, in case anyone else finds my explanation useful. The first part of this blog post covers the basics and the second part takes a look at the performance implications of the way trait objects are implemented in Rust.

The basics

In general, Rust has a very strong preference for static dispatch of function calls, which is where the function matching a call is determined at compile-time. In other words, when I write a function call f(), the compiler statically works out which function f I’m referring to and makes my function call point directly to that function f. This stands in contrast to dynamic dispatch where the function matching a call is only determined at run-time. Dynamic dispatch is common in languages such as Java. Consider a class C which defines a method m which one or more subclasses override. If we have an object o of type C and a function call o.m(), then we have to wait until run-time to determine whether we should call C's implementation of m or one of its subclasses. Simplifying slightly, static dispatch leads to faster performance [2] while dynamic dispatch gives one more flexibility when structuring a program. These features can combine in a variety of ways: some languages only offer static dispatch; some only offer dynamic dispatch; and some require users to opt-in to dynamic dispatch.

It can be easy to think Rust’s traits alone imply dynamic dispatch, but that is often not the case. Consider this code:

trait T {
  fn m(&self) -> u64;
}

struct S {
  i: u64
}

impl T for S {
  fn m(&self) -> u64 { self.i }
}

fn main() {
  let s = S{i : 100};
  println!("{}", s.m());
}
Here the trait T looks a bit like it's a Java interface, requiring any class/struct which implements it to have a method m to return an integer: indeed, the calling syntax in line 15 s.m() looks like a method call on an object which we might well expect to be dynamically dispatched. However, the Rust compiler statically resolves the call m() to the function T::m. This is possible because the variable s on line 14 is statically determined to have type S and, since S implements the T trait which has an m method, that function is the only possible match. As this suggests, Rust’s traits aren’t really like Java interfaces, and structs aren’t really like classes.

However, Rust does allow dynamic dispatch, although at first it can feel like magic to make it happen. Let's assume we want to make a function which can take in any struct which implements the trait T and calls its m function. We might try and write this:

fn f(x: T) {
  println!("{}", x.m())
}
However compiling that leads to the following scary error:
error[E0277]: the size for values of type `(dyn T + 'static)` cannot be known at compilation time
  --> src/main.rs:21:6
   |
21 | fn f(x: T) {
   |      ^ doesn't have a size known at compile-time
   |
   = help: the trait `std::marker::Sized` is not implemented for `(dyn T + 'static)`
   = note: to learn more, visit <https://doc.rust-lang.org/book/second-edition/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait>
   = note: all local variables must have a statically known size
   = help: unsized locals are gated as an unstable feature
The length of that error is intimidating, but it actually says the same thing in three different ways [3]. By specifying that we'll take in an argument of type T we're moving the run-time value into f. However — and I prefer to think of this in terms of running code -- we don't know how big any given struct that implements T will be, so we can’t generate a single chunk of machine code that handles it (should the machine code expect a struct of 8 bytes or 128 bytes or ...? how big should the stack the function requests be?).

One sort-of solution to the problem is to change f to the following:

fn f<X: T>(x: X) {
  println!("{}", x.m())
}
This now compiles, but does so by duplicating code such that static dispatch is still possible. The code between angle brackets <X: T> defines a type parameter X [4]: the type passed (possibly implicitly) for that parameter must implement the trait T. This looks a bit like it's a long-winded way of saying fn(x: T) but the type parameter means that monomorphisation kicks in: a specialised version of f is generated for every distinct type we pass to X, allowing Rust to make everything statically dispatched. Sometimes that's what we want, but it's not always sensible: monomorphisation means that code bloat can be a real concern (we can end up with many separate machine code versions of f); and it can limit our ability to structure our program in a sensible fashion.

Fortunately, there is an alternative solution [5] which does enable dynamic dispatch, as shown in the following code:

trait T {
  fn m(&self) -> u64;
}

struct S1 {
  i: u64
}

impl T for S1 {
  fn m(&self) -> u64 { self.i * 2 }
}

struct S2 {
  j: u64
}

impl T for S2 {
  fn m(&self) -> u64 { self.j * 4 }
}

fn f(x: &T) {
  println!("{}", x.m())
}

fn main() {
  let s1 = S1{i : 100};
  f(&s1);
  let s2 = S2{j : 100};
  f(&s2);
}
Running this program prints out 200 and then 400 and that was achieved by dynamic dispatch! Hooray! But why did it work?

The only real difference with our previous version is that we changed f to take a reference to an object of type T (line 21). Although we don't know how big a struct that implements T might be, every reference to an object that implements T will be the same size, so we only need a single machine code version of f to be generated. It's at this point that the first bit of magic happens: in line 27 we passed a reference to an object of type S1 to f, but f expects a reference to an object of type T. Why is that valid? In such cases, the compiler implicitly coerces &S1 to &T, because it knows that the struct S implements the trait T [6]. Importantly, this coercion magically attaches some extra information (I’ll explain more below) so that the run-time system knows it needs to call S1's methods on that object (and not, for example, S2's methods).

That such an implicit coercion is possible is, in my experience, very surprising to those new to Rust. If it’s any consolation, even experienced Rust programmers can fail to spot these coercions: nothing in f's signature tells you that such a coercion will happen, unless you happen to know that T is a trait, and not a struct. To that end, recent versions of Rust let you add a syntactic signpost to make it clear:

fn f(x: &dyn T) {
  println!("{}", x.m())
}
The extra dyn keyword has no semantic effect, but you might feel it makes it a little more obvious that a coercion to a trait object is about to happen. Unfortunately, because the use of that keyword is currently a bit haphazard, one can never take its absence as a guarantee that dynamic dispatch won’t occur.

Interestingly, there is another way to perform this same coercion, but without using references: we can put trait objects in boxes (i.e. put them on the heap). That way, no matter how big the struct stored inside the box, the size of the box we see is always the same. Here's a simple example:

fn f2(x: Box<T>) {
  println!("{}", x.m())
}

fn main() {
  let b: Box<S1> = Box::new(S1{i: 100});
  f2(b);
}
At line 6 we have a variable of type Box<S1> but passing it to f2 automatically coerces it to Box<T>. In a sense, this is just a variant of the reference coercion: in both cases we’ve turned an unsized thing (a trait object) into a sized thing (a reference or a box).

Fat pointers vs. inner vpointers

I deliberately omitted something in my earlier explanation: while it's true that all references to an object of our trait type T are of the same size, it's not necessarily true that references to objects of different types are the same size. An easy way to see that is in this code, which executes without errors:
use std::mem::size_of;

trait T { }

fn main() {
    assert_eq!(size_of::<&bool>(), size_of::<&u128>());
    assert_eq!(size_of::<&bool>(), size_of::<usize>());
    assert_eq!(size_of::<&dyn T>(), size_of::<usize>() * 2);
}
What this says is that the size of a reference to a bool is the same as the size of a reference to a u128 (line 6), and both of those are a machine word big (line 7). This isn't surprising: references are encoded as pointers. What might be surprising is that the size of a reference to a trait object is two machine words big (line 8). What's going on?

Rust uses fat pointers extensively, including when trait objects are used. A fat pointer is simply a pointer-plus-some-other-stuff, so is at least two machine words big. In the case of a reference to trait objects, the first machine word is a pointer to the object in memory, and the second machine word is a pointer to that object’s vtable (which, itself, is a list of pointers to a struct’s dynamically dispatched functions). Although it’s not universally used terminology, let’s call a pointer to a vtable a vpointer. We can now make sense of the ‘magic’ part of the coercion from a struct to a trait object I mentioned earlier: the coercion from a pointer to a fat pointer adds the struct’s vpointer to the resulting fat pointer. In other words, any trait object coerced from an S1 struct will have a fat pointer with a vpointer to v1, and any trait object coerced from an S2 struct will have a vpointer to v2. v1 will, conceptually, have a single entry pointing to S1::m and v2 a single entry pointing to S2::m. If you want, using unsafe code, you can easily tease the object pointer and vpointer apart.

If you’re a Haskell or Go programmer, this use of fat pointers is probably what you expect. Personally I’m used to vpointers living alongside objects, not alongside object pointers: as far as I know the former technique doesn’t have a name so let’s call it inner vpointers. For example, in a typical object orientated language, every object is dynamically dispatched, so every object carries around its own vpointer. In other words, the choices are between adding an extra machine word to pointers (fat pointers) or an extra machine word to objects (inner vpointers).

Why might Rust have plumped for fat pointers instead of inner vpointers? Considering only performance as a criteria, the downside to inner vpointers is that every object grows in size. If every function call uses an object’s vpointer, this doesn’t matter. However, as I showed earlier, Rust goes out of its way to encourage you to use static dispatch: if it used inner vpointers, the vpointers would probably go unused most of the time [7]. Fat pointers thus have the virtue of only imposing extra costs in the particular program locations where you want to use dynamic dispatch.

What are the performance trade-offs of fat pointers vs. inner vpointers?

Most Rust programs will use dynamic dispatch sparingly – any performance differences between fat pointers and inner vpointers are likely to be irrelevant. However, I want to write programs (language VMs) which use it extensively (for every user-language-level object). It’s therefore of some interest to me as to whether there’s any difference in performance between the two schemes. Unfortunately I haven’t been able to turn up any relevant performance comparisons [8]: the nearest I’ve seen are papers in the security world, where fat pointers are used to encode certain security properties (e.g. the second part of a fat pointer might carry around a memory block’s length, so that all array accesses can be checked for safety), and thus clearly make performance worse. Our setting is quite different.

In order to make a compelling comparison, I'd need real programs of note and measure them rigorously using both schemes, but that’s a bit difficult because I don’t have any such programs yet, and won’t for some time. So I’m going to have make do with a few micro-benchmarks and the inevitable caveat: one should never assume that anything these micro-benchmarks tell us will translate to larger programs. I'm also not going to go to quite the extremes I have in the past to measure performance: I'm looking for a rough indication rather than perfection.

In order to keep things tractable, I made three assumptions about the sorts of program I'm interested in:

  1. Such programs will create trait objects occasionally but call methods in them frequently. I therefore care a lot more about calling costs than I do about creation costs.
  2. I care more about methods which read from the self object than those that don't. Are there any performance differences between the two?
  3. In general – and unlike most Rust programs – I'm likely to have a reasonable amount of aliasing of objects. Does that cause any performance differences when calling functions?

In order to model this, I created a trait GetVal which contains a single method which returns an integer. I then created two structs which implement that trait: SNoRead returns a fixed integer (i.e. it doesn't read from self); and SWithRead returns an integer stored in the struct (i.e. it reads from self). Both structs are a machine word big, so they should have the same effects on the memory allocator (even though SNoRead doesn't ever read the integer stored within it). Eliding a few details, the code looks like this:

trait GetVal {
  fn val(&self) -> usize;
}

struct SNoRead {
  i: usize
}

impl GetVal for SNoRead {
  fn val(&self) -> usize { 0 }
}

struct SWithRead {
  i: usize
}

impl GetVal for SWithRead {
  fn val(&self) -> usize { self.i }
}
To keep things simple, our first two benchmarks will stick with fat pointers, and we'll just measure the difference between calling SNoRead::m and SWithRead::m. Eliding a number of details, our first benchmark looks like this:
fn time(mut f: F) where F: FnMut() {
  let before = Instant::now();
  for _ in 0..ITERS {
    f();
  }
  let d = Instant::now() - before;
  println!("{:?}", d.as_secs() as f64 + d.subsec_nanos() as f64 * 1e-9);
}

fn bench_fat_no_read() {
  let v: Vec<Box<dyn GetVal>> = Vec::with_capacity(VEC_SIZE);
  for _ in 0..VEC_SIZE {
    v.push(Box::new(SNoRead{i:0}));
  }
  time(|| {
    for e in &v {
      assert_eq!(e.val(), 0);
    }
  });
}
In essence, we create a vector with VEC_SIZE elements (lines 11-14), each of which contains a boxed trait object. We then time (lines 2-7) how long it takes to iterate over the vector, calling the method m on each element (lines 16-18). Notice that we don’t measure the time it takes to create the vector and that we repeat the iterating over the vector ITERS times to make the benchmark run long enough. Although I don’t show it above, in order to make the resulting measurements more reliable, each benchmark is compiled into its own binary, and each binary is rerun 30 times. Thus the numbers I’m reporting are the mean of the benchmark run across 30 process executions, and I also report 99% confidence intervals calculated from the standard deviation.

Only having one benchmark makes comparisons a little hard, so let's do the easiest variant first: a version of the above benchmark which uses SWithRead instead of SNoRead. This simply requires duplicating bench_fat_no_read, renaming it to bench_fat_with_read, and replacing SNoRead with SWithRead inside the function.

We then have to decide how big to make the vector, and how many times we repeat iterating over it. I like to make benchmarks run for at least one second if possible, because that tends to lessen the effects of temporary jitter. As an initial measurement, I set VEC_SIZE to 1000000, and ITERS to 1000. Here are the results for the two benchmarks we’ve created thus far:

bench_fat_no_read: 1.708 +/- 0.0138
bench_fat_with_read: 2.152 +/- 0.0103
This isn't too surprising: there's an inevitable fixed cost to iterating over the vector and jumping to another function which is shared by both benchmarks. However, bench_fat_with_read also measures the cost to read self.i in the SWithRead::m function which slows things down by over 25%.

Now we can move onto the hard bit: creating inner vpointer variants of both benchmarks. This is a little bit fiddly, in part because we need to use unsafe Rust code [9]. The basic technique we need to know is that a fat pointer can be split into its constituent parts as follows:

let x: &dyn T = ...;
let (ptr, vtable) = unsafe { mem::transmute<_, (usize, usize)>(x) };
You can look at transmute in several different ways, but I tend to think of it as a way of copying bits in memory and giving the copied bits an arbitrary type: in other words, it’s a way of completely and utterly bypassing Rust’s static type system. In this case, we take in a fat pointer which is two machine words big, and split it into two machine word-sized pointers (which I’ve encoded as usizes, because it slightly reduces the amount of code I have to enter later).

What we need to do first is create a vector of thin (i.e. one machine word big) pointers to “vpointer + object” blocks on the heap. Eliding a few annoying details, here’s code which does just that:

fn vec_vtable() -> Vec<*mut ()> {
  let mut v = Vec::with_capacity(VEC_SIZE);
  for _ in 0..VEC_SIZE {
    let s = SNoRead{i: 0};
    let b = unsafe {
      let (_, vtable) = transmute::<&dyn GetVal, (usize, usize)>(&s);
      let b: *mut usize = alloc(...) as *mut usize;
      b.copy_from(&vtable, 1);
      (b.add(1) as *mut SNoRead).copy_from(&s, 1);
      b as *mut ()
    };
    v.push(b);
  }
  v
}
The type <*mut ()> (line 1) is Rust’s rough equivalent of C’s void * pointer. Every time we make a new SNoRead object (line 4), we create a trait object for it (line 5), and pull out its vpointer (line 6; note that this will be the same value for every element in the vector). We then allocate memory (line 7) sufficient to store the vpointer (line 8) and the object itself (line 9): on a 64-bit machine, the vpointer will be stored at offset 0, and the object will be stored starting at offset 8. Notice that a significant portion of this function is unsafe code: that’s inevitable when fiddling around with low-level pointers like this in Rust.

With that done, we can then create a benchmark:

pub fn bench_innervtable_no_read() {
  let v = vec_vtable();
  time(|| {
    for &e in &v {
      let vtable = unsafe { *(e as *const usize) };
      let obj = unsafe { (e as *const usize).add(1) };
      let b: *const dyn GetVal = unsafe { transmute((obj, vtable)) };
      assert_eq!(unsafe { (&*b).val() }, 0);
    }
  });
}
There are a couple of subtleties here. Notice how we recreate a fat pointer by recombining the object pointer and vpointer (lines 5-7): the *const dyn GetVal is vital here, as otherwise Rust won't know which trait we're trying to make a fat pointer for. In order to turn a raw fat pointer into a normal reference, we have to use the &* operator [10] (line 8). With that done, we can then call method m.

Using the same settings as for our earlier run, our new benchmark (and it's with_read variant) performs as follows:

bench_innervtable_no_read: 2.111 +/- 0.0054
bench_innervtable_with_read: 2.128 +/- 0.0090
Unsurprisingly, bench_innervtable_no_read is much slower than its fat pointer cousin bench_fat_no_read: the former has to do a memory read on each iteration of the loop to read the vpointer, whereas the latter has that available in its fat pointer. bench_fat_no_read is thus very cache friendly because all it's doing is iterating linearly through memory (the vector) and then calling the same function (SNoRead::m) repeatedly.

However, bench_innervtable_with_read is only just (taking into account the confidence intervals) slower than bench_innervtable_with_read. Why might this be? Well, reading the vpointer from memory will nearly always bring the object into the processor’s cache too, making the read in SWithRead::m much cheaper afterwards. Put another way, the first read of a random point in memory is often quite slow, as the processor waits for the value to be fetched from RAM; but reading another point almost immediately afterwards is quick, because an entire cache line (a chunk of memory: 64 bytes on x86) is brought into the processor’s cache in one go. If you look carefully, bench_innervtable_with_read is ever so slightly faster than bench_fat_with_read, although not by enough for me to consider it particularly significant.

Is any of this interesting? Depending on your use case, yes, it might be. Imagine you're implementing a GUI framework, which is a classic use case for dynamic dispatch. A lot of the methods that are called will be empty, because the user doesn't need to handle the respective actions. The numbers above show that inner vpointers would slow you down in such a case: fat pointers are clearly the right choice.

Let's make our final class of benchmark: what happens if, instead of our vector pointing to VEC_SIZE distinct objects, each element points to the same underlying object? In other words, what are the performance implications of having faster access to inner vpointers. First let's create our vector:

fn vec_multialias_vtable() -> Vec<*mut ()> {
  let ptr = {
    let s = SNoRead{i: 0};
    unsafe {
      let (_, vtable) = transmute::<&dyn GetVal, (usize, usize)>(&s);
      let b = alloc(...) as *mut usize;
      b.copy_from(&vtable, 1);
      (b.add(1) as *mut SNoRead).copy_from(&s, 1);
      b as *mut ()
    }
  };
  vec![ptr; VEC_SIZE]
}
Note that we only create a single pointer to an object (line 6), duplicating that pointer (but not the object!) VEC_SIZE times (line 12).

By now, I suspect you've got the hang of the main benchmarks, so I won't repeat those. Let's go straight to the numbers:

bench_fat_multialias_no_read: 1.709 +/- 0.0104
bench_fat_multialias_with_read: 1.709 +/- 0.0099

bench_innervtable_multialias_no_read: 1.641 +/- 0.0007
bench_innervtable_multialias_with_read: 1.644 +/- 0.0115
Interestingly, there is now a small, but noticeable difference: the inner vtable approach is definitely faster. Why is this? Well, it's probably because the fat pointer version of this benchmark consumes more memory. Each pointer consumes 2 machine words, so, at an abstract level, the total memory consumption of the program is (roughly) VEC_SIZE * 2 machine words big (I think we can ignore the single machine word needed for the one object). In contrast, the inner vpointer version consumes only VEC_SIZE machine words. It's thus a little more cache friendly, which probably explains the 4% performance difference.

An important question is that, even for these very limited benchmarks, does anything change if the vector size changes? Yes, it does slightly: it seems that the smaller the vector, the less the difference between the two approaches (as you can see in the performance numbers for a number of variants).

Conclusions

To my simple mind, Trait objects in Rust are confusing because of the way they magically appear through implicit coercions. Maybe my explanation will help someone get to grips with them a little sooner than I managed to.

With regards to the performance comparison, what does it mean for most people? I suggest it shows that Rust’s choice of fat pointers is probably the right one: if you don’t use trait objects, you don’t pay any costs; and if you do use trait objects, the performance is often the best anyway.

What does this mean for me? Well, for VMs, the situation is a little different. In particular, in many VMs, the first thing a method on a VM-level object does is to read from self (e.g. due to biased locking). In such cases, and assuming a sensible object layout, the costs between fat pointers and inner vpointers are probably roughly comparable. Because most languages that aren't Rust allow aliasing, it's also likely that the run-time will see some aliasing, at which point inner vpointers might become a touch quicker; and, even if there's no performance difference, they will use slightly less memory.

Of course, all of this is highly speculative, based on tiny benchmarks, and I'll probably try and keep my options open when implementing my first VM or two in Rust to see if one approach is better in practise than the other. Don't hold your breath for new results any time soon though!

If you want to run, or fiddle with, the benchmarks shown in this blog post, they're available as vtable_bench.

Acknowledgements: My thanks to Edd Barrett and Jacob Hughes for for comments on this blog post.

Follow me on Twitter

Footnotes

[1] And, presumably, C++, though I don’t know that language well enough to say for sure.
[2] Static dispatch makes inlining trivial. In my opinion, inlining is the mother of all optimisations: the more of it you can do, the faster your program will run. Note that inlining doesn’t have to be done statically (though it’s easier that way), but doing it dynamically tends to cause noticeable warmup.
[3] Although this error is unnecessarily verbose, in general modern rustc gives very good quality error messages: older versions of rustc were quite a lot worse in this regard. Indeed, I've noticed that the better error messages make it a lot easier for those learning Rust today compared to 2015 or so.
[4] It’s tempting to call these things “generics”, making them sound similar to the feature of that name found in languages such as Java. However, when you get a bit further into Rust, you start to realise that the “parameter” part of the ”type parameter” name is crucial, because you can pass type parameters around statically in a way that resembles “normal” parameters.
[5] There’s another possible solution which I’m not going to talk about here, involving the CoerceUnsized trait. First, this has been unstable for years, and the lack of activity on the relevant issue suggests it will be unstable for a while to come. Second, it’s really hard to explain what it does.
[6] There are some restrictions on what traits can contain and still be converted into trait objects. I won’t worry about those here.
[7] To some extent, one could statically optimise away some of the costs of inner vpointers by noticing that some structs are never used with dynamic dispatch. That will still allow some wastage, because some structs are only sometimes used with dynamic dispatch. It’s hard to imagine an automatic analysis doing a great job of optimising such cases effectively.
[8] Although it’s not quite the same thing, Gui Andrade's description of dynstack is well worth a read.
[9] Unfortunately what unsafe code is allowed to do and what it's not allowed to do is largely unknown in Rust. This is a big problem for anyone who needs to drop out of safe Rust. There is some progress towards addressing this, but it’s not going to happen quickly.
[10] I must admit that I found this operator's syntax confusing at first. &*x looks like it’s saying “dereference x and get a reference to where x is stored”, but it doesn’t dereference x: it just reinterprets a raw pointer as a Rust reference, which has no run-time effect.
And, presumably, C++, though I don’t know that language well enough to say for sure.
Static dispatch makes inlining trivial. In my opinion, inlining is the mother of all optimisations: the more of it you can do, the faster your program will run. Note that inlining doesn’t have to be done statically (though it’s easier that way), but doing it dynamically tends to cause noticeable warmup.
Although this error is unnecessarily verbose, in general modern rustc gives very good quality error messages: older versions of rustc were quite a lot worse in this regard. Indeed, I've noticed that the better error messages make it a lot easier for those learning Rust today compared to 2015 or so.
It’s tempting to call these things “generics”, making them sound similar to the feature of that name found in languages such as Java. However, when you get a bit further into Rust, you start to realise that the “parameter” part of the ”type parameter” name is crucial, because you can pass type parameters around statically in a way that resembles “normal” parameters.
There’s another possible solution which I’m not going to talk about here, involving the CoerceUnsized trait. First, this has been unstable for years, and the lack of activity on the relevant issue suggests it will be unstable for a while to come. Second, it’s really hard to explain what it does.
There are some restrictions on what traits can contain and still be converted into trait objects. I won’t worry about those here.
To some extent, one could statically optimise away some of the costs of inner vpointers by noticing that some structs are never used with dynamic dispatch. That will still allow some wastage, because some structs are only sometimes used with dynamic dispatch. It’s hard to imagine an automatic analysis doing a great job of optimising such cases effectively.
Although it’s not quite the same thing, Gui Andrade's description of dynstack is well worth a read.
Unfortunately what unsafe code is allowed to do and what it's not allowed to do is largely unknown in Rust. This is a big problem for anyone who needs to drop out of safe Rust. There is some progress towards addressing this, but it’s not going to happen quickly.
I must admit that I found this operator's syntax confusing at first. &*x looks like it’s saying “dereference x and get a reference to where x is stored”, but it doesn’t dereference x: it just reinterprets a raw pointer as a Rust reference, which has no run-time effect.

Why Aren’t More Users More Happy With Our VMs? Part 2

September 19 2018

In the first part of this two-part blog, I showed results from an experiment on VM performance that I’d been part of. Although it wasn’t our original intention, we eventually ended up trying to find out how often mainstream programming language VMs warm up as per traditional expectations. If you consider only individual process executions, then about 70% warmup; if you consider (VM, benchmark) pairs, then about 40% warmup consistently; and if you consider (VM, benchmark, machine) triples, then about 12% warmup consistently.

However you choose to look at it, these numbers aren’t pretty. In trying to make sense of them, I’m going to try and provide some answers to the following three questions. First, are there simple explanations for the performance problems we encountered? Second, how did we get into this situation? Third, how can we do better in the future? To answer the first question, I'm going to draw on our paper. In providing suggested answers to the second and third questions, I'm going to be make extensive use of my personal opinions and experiences, so please don’t blame anyone else for them!

Are there simple explanations for poor performance?

There are all sorts of factors that might be responsible for odd performance, but, beyond those covered in the first part of this blog post, there are two that seem both particularly likely and relatively easy to test for: garbage collection and JIT compilation. For example, garbage collection is, to state the obvious, memory intensive: it could conceivably have all sorts of odd effects on processor caches and thus performance. Similarly, JIT compilation can happen at any point, and there's no guarantee that it makes things faster.

We therefore took two of the VMs under examination – HotSpot and PyPy [1] – and performed a second experiment, separate from the first [2], where we also recorded how long each in-process iteration spent in garbage collection activities and JIT compilation. To be clear, the aim of this second experiment wasn’t to find causations (i.e. to find definitive reasons for odd performance), but to see if there are plausible correlations (i.e. good candidates for further investigation).

Here’s an example plot from this second experiment:

SVG-Viewer needed.

There are three sub-plots in this plot. All share the same x-axis of in-process iterations (i.e. the benchmarking running in a for loop). The “main” sub-plot (the bottom of the 3 sub-plots), is the same as in the first part of this blog, with its y-axis showing the wall-clock time that each in-process iteration takes. In the example above, we can see something odd going on in fasta on PyPy: there are 3 main “sections”, but instead of each “section” having fairly consistent performance for each in-process iteration, they seem to get gradually slower over time.

The top two sub-plots show the time spent in garbage collection and JIT compilation for each in-process iteration [3]. These clearly show that JIT compilation only happens right at the very beginning of the benchmark. However, the time spent in garbage collection is almost perfectly correlated with the overall performance of the benchmark. That suggests first that this benchmark is spending most of its time in garbage collection, and second that there may well be a minor memory leak (in the VM or the benchmark). Again, to be clear, I’m only claiming a correlation, not a causation. However, if I wanted to fix this, I know that I’d start looking for memory allocation sites that don’t seem to be freeing up memory.

SVG-Viewer needed.

Here’s my old friend, Richards running on HotSpot: this benchmark gets 5% slower at about iteration 200, ending up in a steady state of non-peak performance. We can see that we’re regularly spending time garbage collecting [4], but there’s no correlation between garbage collection spikes and the main sub-plot. However, the JIT compilation sub-plot is more interesting: just before the benchmark slows down, there are two long-ish JIT compilation events. It therefore seems plausible that the VM decided to recompile parts of the benchmark (probably based on extra profiling data it had collected), but that the recompilation led to slower machine code being generated. Again, it’s not a guaranteed causation, but it does give a pointer for where to look. However, it's a less useful pointer than the previous example. Since JIT compilers are, to put it mildly, rather complex, fingering the JIT compiler as a culprit still means considering a huge amount of code as the likely cause of the problem.

SVG-Viewer needed.

This third example shows a more troubling case. This is the shallow square-wave example we saw in the first part of this blog (you can see the original plot if you want a reminder). It shows two things. First, turning on profiling changes the nature of most process executions: instead of appearing almost immediately, the square wave doesn’t appear until the thousandth iteration; and the “steady-ish state performance" is better with profiling on (roughly 0.34-0.36s with profiling on vs. 0.36-0.40s with profiling off). Second, no matter how much you look, there is no correlation between garbage collection (indeed, no garbage collection occurs at all!) or JIT compilation and the square wave. You might get excited by the JIT compilation event around iteration 1375, which seems to correlate with the third trough, but there’s no such correlation at the start of any of the other troughs. I honestly don’t have a clue what’s causing the square wave, and there’s nothing here which gives me the slightest pointer as to what I should look at first.

Unfortunately, a lot of examples look like the second or third cases: the correlation only points us to a still-very-large subset of the VM; or doesn’t point us anywhere at all. Indeed, more recently I was involved in a LuaJIT performance project where we also extracted profiling information about garbage collection and JIT compilation. While this information was helpful sometimes, I wouldn’t say that it was a decisive advantage. In some cases, despite the plots not showing anything obvious, we later traced problems back to the garbage collector or the JIT. There are clearly several possible causes for this, but my guess is that the most common is that the consequences of performance problems are often spread around somewhat evenly, thus not showing up as a visible spike at a particular point in time.

There seem, therefore, few simple answers to the odd performance I pointed out in the first part of this post. How did we get into such a situation? Realistically, the answer is almost surely “for lots of reasons, many of which we’ll probably never know.” However, I have a couple of war stories that I think might provide a partial explanation, and that look at things in a slightly different way than I’ve heard others talk about in the past.

Are benchmark suites complete guides to performance?

At the risk of stating the obvious, most of us use benchmarks to guide our optimisations. In other words, if we think we’ve made a compiler faster, we have one or more programs which we measure before and after the change to see if the it really has made things faster or not (and, if it has, by how much). In general, people use multiple such programs, at which point we bestow upon them the grand title of a “benchmark suite”. Optimisations which make our benchmark suite run faster are good; those that make our benchmark suite run slower are bad. But do our benchmark suites tell us about the performance of programs in general?

Some of you may remember that a few years back Dropbox started a new Python VM project called Pyston [5]. I was a bit surprised at the time, because VMs are notoriously a lot more work than people think, and PyPy was already a pretty good Python VM by that point. Still, I didn’t think much more about it, until I happened to bump into Kevin Modzelewski, one of Pyston’s creators, at a conference a year or two later. I had a very interesting conversation with him, as part of which I asked him why Dropbox had decided to create a brand new Python VM. Simplifying things a bit, he said that PyPy performed badly on some common Dropbox idioms, and that he’d distilled the worst offender down to a benchmark where PyPy (a clever JIT compiler) was over 4 times slower than CPython (a simple interpreter).

Looking at the benchmark, the problem became clear very quickly: it was recursive. To my surprise, RPython (the language/framework in which PyPy is written) was unrolling [6] all recursive calls. Unrolling is a powerful, but dangerous, optimisation: it can make programs faster when used judiciously, but used incautiously it causes code bloat and degraded performance. Since unrolling all recursive calls is akin to unrolling all loops, this seemed unlikely to be a good idea. I quickly confirmed with the core developers that this wasn't an intended outcome, and that it was a natural property of tracing's naturally aggressive tendency to inline.

I thought this was would be a fun problem to debug, as it involved a part of RPython that I had not previously looked at. It took 2 or 3 days before I had found the right part of RPython, worked out what was going on, and found a fix. The core of the resulting diff is fairly simple:

diff --git a/rpython/jit/metainterp/pyjitpl.py b/rpython/jit/metainterp/pyjitpl.py
--- a/rpython/jit/metainterp/pyjitpl.py
+++ b/rpython/jit/metainterp/pyjitpl.py
@@ -951,9 +951,31 @@
         if warmrunnerstate.inlining:
             if warmrunnerstate.can_inline_callable(greenboxes):
+                # We've found a potentially inlinable function; now we need to
+                # see if it's already on the stack. In other words: are we about
+                # to enter recursion? If so, we don't want to inline the
+                # recursion, which would be equivalent to unrolling a while
+                # loop.
                 portal_code = targetjitdriver_sd.mainjitcode
-                return self.metainterp.perform_call(portal_code, allboxes,
-                                                    greenkey=greenboxes)
+                inline = True
+                if self.metainterp.is_main_jitcode(portal_code):
+                    for gk, _ in self.metainterp.portal_trace_positions:
+                        if gk is None:
+                            continue
+                        assert len(gk) == len(greenboxes)
+                        i = 0
+                        for i in range(len(gk)):
+                            if not gk[i].same_constant(greenboxes[i]):
+                                break
+                        else:
+                            # The greenkey of a trace position on the stack
+                            # matches what we have, which means we're definitely
+                            # about to recurse.
+                            inline = False
+                            break
+                if inline:
+                    return self.metainterp.perform_call(portal_code, allboxes,
+                                greenkey=greenboxes)

In essence, what this diff does is the following. When we're tracing (roughly speaking, “JIT compiling”) and about to trace into a function F (implicitly identified here by the “greens”), we check the tracer's call-stack to see if F is present. If it is, then inlining F again would be akin to unrolling, and we stop tracing.

Without worrying too much about the details, I hope you’ll agree that it’s a fairly small diff. What's more, it's effective: performance on PyPy improved by 13.5x (meaning that PyPy went from 4.4x slower than CPython to over 3x faster). At this point, I was patting myself on the back for a job well done, and indulging my ego by wondering if Dropbox would have created a new Python VM if this fix had been present.

However, there was a problem. Yes, I'd made one benchmark 13.5x faster, but PyPy has its own benchmarking suite. When I ran the above fix on PyPy’s benchmark suite of 20-odd benchmarks, 5 became slower (up to 4x slower). I knew that the core PyPy developers would, quite reasonably, not accept a performance “fix” that made performance of the main benchmark suite worse. First, I added a mechanism to control how many levels of recursion are unrolled. Second, I set about uncovering how many unrollings of recursive calls would recover the performance of the main PyPy benchmarks.

I thus did some basic benchmarking, putting this table in the email I sent to the pypy-dev mailing list:

     #unrollings |  1   |  2   |  3   |  5   |  7   |  10  |
-----------------+------+------+------+------+------+------+
hexiom2          | 1.3  | 1.4  | 1.1  | 1.0  | 1.0  | 1.0  |
raytrace-simple  | 3.3  | 3.1  | 2.8  | 1.4  | 1.0  | 1.0  |
spectral-norm    | 3.3  | 1.0  | 1.0  | 1.0  | 1.0  | 1.0  |
sympy_str        | 1.5  | 1.0  | 1.0  | 1.0  | 1.0  | 1.0  |
telco            | 4    | 2.5  | 2.0  | 1.0  | 1.0  | 1.0  |
-----------------+------+------+------+------+------+------+
polymorphism     | 0.07 | 0.07 | 0.07 | 0.07 | 0.08 | 0.09 |
Looking back, I’m slightly embarrassed by the quality of benchmarking this represents, but I hope you’ll forgive me for that. The top half of the table (from hexiom2 to telco) represents the subset of PyPy benchmarks that my fix had slowed down. For example, with 1 unrolling of recursion (i.e. equivalent to inlining the function once but performing no further “unrolling”), raytrace-simple ran 3.3x slower than under “normal” PyPy; with 2 levels of unrolling it ran 3.1x slower; and by 7 levels of unrolling my branch and normal PyPy ran the benchmark at the same speed. On the bottom row is Kevin’s benchmark (which, for reasons I can no longer fathom, has timings in seconds): it’s fastest at 1 unrolling, but even at 7 unrollings it’s only lost a bit of performance.

The table above clearly shows that RPython should unroll recursive function calls 7 levels deep. On that basis, my email to the PyPy list proposed that number; that was the number that was merged into RPython; and that's the number that's still used today.

However, I doubted at the time that 7 was the right value and today I’m almost positive that it’s not the right value. I'm almost certain that the right value is going to be much closer to 1 — indeed, probably 1 itself. There's no doubt that simple recursive function calls (of the sort found in many of the PyPy benchmarks) benefit from unrolling. However, if unrolling every loop was a good idea, we'd expect every static compiler to do it, — but they don't [7]. Recursive functions are like loops, but worse: functions are, by definition, bigger than loops on average. However, in the context of tracing, my reasoning is that we care a lot more about the worst case of unrolling than we do the best case. My major concern is that recursive functions are often heavily data dependent. A common example of this are the tree walkers found in compilers, which walk over an abstract syntax tree and compile the program. In such cases, the recursive function looks at a node, does something based on that, and then visits the node’s children. The challenge with this is that the path the recursive function depends entirely on the structure of the input tree. This means that the amount of code executed between each call of the recursive function can vary hugely: we frequently end up tracing paths which are never used in full again. The problem is that meta-tracing is a very expensive activity, causing execution to temporarily slow down by around a couple of orders of magnitude: tracing things that aren’t used again can lead to significant performance losses. Even worse, these losses are often invisible, in the sense that it’s not really a flaw in the user’s program, but an often inevitable mismatch with the dynamic compilation approach.

Realistically, no benchmark suite can ever be complete, in the sense that it can claim to represent the performance behaviour of every possible program. A better question is whether the benchmark suite is representative of a fairly large subset of programs. The above example shows that PyPy’s standard benchmark suite doesn't cover some fairly common idioms. How many other common idioms aren’t covered? My strong intuition is that the answer is “many”. This isn’t to knock PyPy’s benchmark suite: indeed, it’s better than most other VMs’ benchmark suites. However, we often derive a false sense of comfort from benchmark suites: too often the phrase “this optimisation sped up the benchmark suite by 3%” is used interchangeably with “this optimisation speeds up programs by 3%”. The harsh reality is that our benchmark suites reject some optimisations that would make the vast majority of programs run faster, and accept some optimisations that make the vast majority of programs run slower. How often we get unrepresentative answers from our benchmark suites is anyone’s guess, but I fear it has happened more in the past than we imagine.

Are benchmark suites correct guides to performance?

The first war story might make you wonder whether or not benchmark suites tell us much about performance of programs in general. But a second issue with benchmark suites is whether the answer we receive is even correct with respect to the benchmark suite itself. Put another way, if a benchmark suite tells us that an attempted optimisation makes the benchmark suite faster or slower, is that answer reliable?

Alongside the main experiment I reported in the previous blog post, we also (separately) benchmarked SpiderMonkey and V8 against the Octane benchmark suite. Octane consists of 17 benchmarks, originally from V8; it fairly quickly became one of the most important JavaScript benchmark suites. Despite not knowing JavaScript at that point, I took on the task (“how hard can it be?”) of getting Octane into shape [8]. In essence, all I did was put in a for loop that ran each benchmark for 2000 in-process iterations [9]. That done, I ran my slightly altered Octane benchmark suite on d8 (the command-line way of running programs on V8) and got the following output:

$ d8 run.js
Richards
DeltaBlue
Encrypt
Decrypt
RayTrace
Earley
Boyer
RegExp
Splay
NavierStokes
PdfJS

<--- Last few GCs --->

14907865 ms: Mark-sweep 1093.9 (1434.4) -> 1093.4 (1434.4) MB, 274.8 / 0.0 ms [allocation failure] [GC in old space requested].
14908140 ms: Mark-sweep 1093.4 (1434.4) -> 1093.3 (1434.4) MB, 274.4 / 0.0 ms [allocation failure] [GC in old space requested].
14908421 ms: Mark-sweep 1093.3 (1434.4) -> 1100.5 (1418.4) MB, 280.9 / 0.0 ms [last resort gc].
14908703 ms: Mark-sweep 1100.5 (1418.4) -> 1107.8 (1418.4) MB, 282.1 / 0.0 ms [last resort gc].


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x20d333ad3ba9 
    2: extractFontProgram(aka Type1Parser_extractFontProgram) [pdfjs.js:17004] [pc=0x3a13b275421b] (this=0x3de358283581 ,stream=0x4603fbdc4e1 )
    3: new Type1Font [pdfjs.js:17216] [pc=0x3a13b2752078] (this=0x4603fbdaea9 ,name=0x4603fbd9c09 ,file=0x4603fb...


#
# Fatal error in CALL_AND_RETRY_LAST
# Allocation failed - process out of memory
#

zsh: illegal hardware instruction  d8 run.js

I can very distinctly remember having a sinking feeling when I saw that. Fortunately, when I looked at it more closely, it became clear that the situation wasn’t as bad as I had first feared. Starting from the top of that output, you can see that a number of Octane's benchmarks (from Richards to NavierStokes) ran successfully, before PdfJS caused d8 to fail. A cursory look at the error message shows that PdfJS failed because V8 had run out of memory. I decided to have a quick look at whatever timing outputs I had got (bearing in mind that this was done without Krun, so it's so-so benchmarking) before the crash and plotted them:

SVG-Viewer needed.

That graph looks interesting, and there’s a clear blow-up towards the end, but there’s not quite enough data to see if there’s anything meaningful going on before that. Fortunately, the way that I had made in-process iterations was to bundle several of Octane’s ‘inner’ iterations into longer ‘outer’ in-process iterations. I therefore reran things, this time recording how long each ‘inner’ iteration took to run:

SVG-Viewer needed.

Now this is starting to look interesting, but the huge spike at the end makes it hard to see the denser detail elsewhere in the plot. I therefore chopped the last 80 or so in-process iterations off and replotted:

SVG-Viewer needed.

At this point the likely source of the error became clear to me. The first half of this (sub-)plot has regular spikes, which grow roughly linearly: I guessed that they were the garbage collector running, which was probably having to do a little more work each time it was run. The second half of this plot shows a complete change in the timing data. I guessed that the garbage collector was probably going into some sort of low-memory mode, where it trades off memory for time, which causes it to run much more often, and to run much more slowly. It continually has to do more work, until – just to the right of this final plot – it explodes with an out of memory exception. If ever I’ve seen a plot which screams “memory leak”, this is it.

At this point, I had a very strong suspicion what the cause was, and a guess that this was more likely to be a bug in PdfJS than V8. But how likely was it that I would find a bug in a benchmark that was 33,053 lines of code in a language which I didn't know? Frankly, it sounded like a lost cause. Still, I thought it would be worth having a look, so I loaded pdfjs.js into my trusty text editor:

To my surprise, the second line of code after the licence was var canvas_logs = [];, which looked a lot to me like an assignment of a mutable array to a global variable. That’s an obvious candidate for problems, but surely far too obvious to be the bug. Still, I thought I’d see where canvas_logs is used, just in case. I scrolled to the first use of the variable in the file:
At line 50, it looked to me like content is being appended to the array, but nowhere in that function is anything being removed from the list. Indeed, there are 3 other references to this variable in the file, 2 of which check the array’s length, and 1 of which accesses elements in the array. Bingo! Within about 90 seconds, I had found the source of the memory leak. Emphasising my cluelessness with JavaScript, it then took me about 10 minutes to work out how to empty an array of its contents. Programming is a humbling job.

The fix to PdfJS’s memory leak is thus simply to add the line canvas_logs.length = 0 to the beginning of the runPdfJs function. However, to my regret, the fix has not been merged into Octane because, some time after I raised the PR, Octane was retired.

A reasonable question to ask is whether the memory leak in PdfJS is really a problem. Presumably not many other people have run the benchmark long enough to see the crash I saw. Besides, a benchmark’s job is to measure a VM doing some work and, whether there’s a memory leak or not, the PdfJS benchmark is certainly doing some work. This latter point is alluring but, I believe, dangerous. Let me give a couple of scenarios to illustrate this point.

First, it is common to add benchmarks to a benchmark suite when one perceives that they tell us something new about the compiler in question. For example, let’s imagine that I realise that my benchmark suite is light on benchmarks which measure changes to my VM’s code generator. I have a look around and, having profiled a couple of its in-process iterations, decide that PdfJS is the perfect way of addressing this weakness in the benchmark suite. Later on, I make a change to the code-generator which I hope optimises programs such as PdfJS. I run PdfJS for a while (after all, I know that I need to think about peak performance) and find out that the performance hasn’t changed very much, causing me to discard my changes to the code-generator. The problem here is that PdfJS isn’t really a test of a code-generator, rather it is a test of the garbage collector in low-memory situations. That’s not an unreasonable thing to benchmark, but it’s not what I expected PdfJS to be benchmarking [10]. My change to the code-generator might really have been an improvement, but it might have been overwhelmed by the time spent in garbage collection by this particular benchmark.

Second, many benchmark suites – including Octane – have a mode which says “run the benchmark for n seconds and then report how many in-process iterations completed in that time”. Let’s assume that PdfJS manages to complete 100 in-process iterations in 10 seconds. I now introduce a new, complex, optimisation into the VM, which makes my benchmark complete 101 in-process iterations in 10 seconds. It looks like I’ve got a 1% speed-up, which might not be enough to justify the additional complexity in the VM. However, if each in-process iteration does more work than its predecessor the real improvement is higher — potentially much, much higher (see the last few in-process iterations of the graphs earlier in this post).

In both the above examples, a faulty benchmark like PdfJS can cause us to systematically underappreciate, or exclude, effective optimisations. Although a little less likely, it's also possible to imagine it having the opposite effect (i.e. causing us to include ineffective, or even deleterious, optimisations).

Still, you might well think that I’m banging on a bit too much about a single faulty benchmark. All code has bugs, and it’s not surprising that some benchmarks have bugs too. It’s not unreasonable to suggest that the odd bad benchmark probably won’t lead us too far astray. However, when I looked at Octane, CodeLoadClosure also has a memory leak [11], and zlib complains that it “cannot enlarge memory arrays in asm.js” (which might be a memory leak, although I didn’t investigate further). In other words, a non-trivial portion of Octane’s benchmarks appear to be misleading and it's reasonable to assume that, on occasion, they will have misled VM developers. Does this mean that Octane is an unusually terrible benchmark suite? Probably not, but it's certainly made me even more cautious about benchmark suites than I was before.

Summary

Returning back to this post's title: why aren't more users more happy with our VMs? The conclusion I've been forced to is that our benchmarking and our benchmarks have misled us. In a perfect example of Goodhart's Law, we've too often optimised for metrics that aren't representative of what we really care about.

Fortunately, I think it's possible to do a little better without undue effort. Here are my simple suggestions:

  1. Running benchmarks for longer uncovers a surprising number of latent problems. Ultimately, all the odd performance effects we saw in part one of this blog post resulted from running benchmarks for longer than anyone had done before.
  2. We must accept that neither peak performance nor a steady state is guaranteed to occur. Benchmarking methodologies that assume one or both of these concepts are fundamentally broken.
  3. I think it’s intellectually dishonest not to report warmup time. There’s no doubt that warmup time is annoying and that it can paint VMs in an unflattering light. But by pretending it’s a non-issue we’re misleading users about what performance they can expect, and storing up long-term resentment.
  4. We might be able to learn from the machine-learning community and separate out training benchmark suites (which we use to derive the values which drive the various heuristics in VMs) from validation suites (on which we report actual benchmarking numbers).
  5. We need far, far more benchmarks than we currently have, and we have to keep adding benchmarks. In nearly all cases, the more benchmarks we have, the less likely we are to mislead ourselves [12]. My vague guess is that our current benchmark suites are perhaps an order of magnitude too small. But even much larger benchmark suites won’t solve the problem alone: fixed-size benchmark suites become gradually less useful as a VM matures, because the VM becomes more heavily trained towards (or, in a sense, “innoculated against”) the same old benchmarks.
  6. I suspect most users would be happier if we worried more about predictable performance (both making sure that most programs run quite fast, and that programs run at more-or-less the same speed when you rerun them) than we did about peak performance (which currently dominates our thoughts). Put another way, we need to accept that optimisations often come with trade-offs: some appealing optimisations, for example, are highly non-deterministic and may make performance worse almost as often as they make it better.

A big question is whether fixing existing VMs is practical or not. I must admit that the current signs (including some work that I've been involved in) aren't hugely positive in that regard. Most VMs are large (typically hundreds of thousands of lines of code), mature (many people have worked on them, over many years), and complex (squeezing every last bit of juice out of programming language and hardware). Subtle assumptions, both explicit and implicit, are spread throughout and across the codebase — and many of those assumptions have been forgotten (generally, though not exclusively, because people have moved on). Hunting down performance problems thus often devolves into wild stabs in the dark, looking for clues almost at random throughout large, complex codebases. Perhaps unsurprisingly, such hunts fail more often than they succeed.

However, I remain optimistic. First, I think we will start to see a gradual improvement in current VMs. Second, even despite the various problems I’ve detailed, many users are already fairly happy with their VMs: any improvements we make can only make them happier (and, if we’re lucky, make a few unhappy people happy). Speaking very personally, I still think that we don’t really know what the performance ceiling of dynamic compilation is. It is possible that we’re very close to the best performance possible, although my gut feeling is that there’s significant untapped potential. To that end, I’m now working on a new VM research project. It’s far too early to say anything useful about that work yet, and who knows what will happen with funding and the like, but maybe one day it’ll help nudge the field forward a little bit further.

Finally, if you want to repeat the warmup experiment, use Krun, or produce performance stats, then you can: everything’s under an open-source licence, so enjoy!

Acknowledgements: My thanks to Edd Barrett and Carl Friedrich Bolz-Tereick for comments on this blog post. This research was funded by the EPSRC Cooler (EP/K01790X/1) grant and Lecture (EP/L02344X/1) fellowship, and gifts from Cloudflare and Oracle Labs.

Follow me on Twitter

Footnotes

[1] There are two main reasons for restricting things to these two VMs: not all VMs make such profiling information available; and the rest tend to have scant, or missing, information on how to access such features. HotSpot is an honourable exception (its well documented in this regard), but we only managed to get the numbers we needed out of PyPy because we had someone from the core team who we could prod for advice.
[2] The cost of extra profiling information to record garbage collection data and JIT compilation can be quite significant — certainly significant enough that it can change the apparent performance of the main benchmark. It’s therefore vital that this extra information is recorded in a separate experiment, and doesn’t affect the main experiment.
[3] We couldn't work out what the units for the PyPy profiling information are (they're not documented, and it's not obvious from the source code either): as we’ll soon see, that doesn’t matter much in this case.
[4] Unlike PyPy, the units for the garbage collection and JIT compilation sub-plots in HotSpot are clear and simple: seconds.
[5] Sometimes it can feel like everyone, and in one or two cases even their dog, has tried making a Python implementation. Here's an incomplete list of Python implementations (with thanks to Carl Friedrich Bolz-Tereick for pointing out several that I didn’t know about): CPython; IronPython; Jython; Nuitka; Psyco; PyPy; Pyston; Shed Skin; Stackless; Starkiller; TrufflePython; Unladen Swallow; WPython; Zippy.
[6] For those unfamiliar with this optimisation, consider this (contrived) code:
i = 0
while i < 2:
  f()
  i += 1
This can be “unrolled” (i.e. the loop can be expanded) to the equivalent:
f()
f()
In this case, the unrolled version is smaller and faster (since it has no branches). In more complex examples, unrolling can uncover optimisations that the loop would otherwise obscure.
[7] The situation in a tracer is a little different: when you trace a loop, you effectively copy the trace (i.e. unroll it), with the first iteration “setting things up” for the second iteration. That’s not really equivalent to unrolling in the normal sense, though.
[8] Our aim here was to show that you could use warmup_stats without having to use Krun. We thus deliberately stuck to Octane’s behaviour – shared with many other benchmark suites – of running multiple benchmarks in a single process execution. It’s highly likely that benchmark ordering will have some impact on one’s perception of performance, but I have no idea to what extent. That’s for someone else to look into!
[9] In the end, I had to do a bit more than this, because the default Octane runner is written in a style that I found hard to follow. I ended up writing a new runner that dropped a couple of features, but was massively simpler.
[10] Even worse, PdfJS isn’t even a very good way of benchmarking how a VM performs in a low memory situation, because it doesn't hit that point until quite a long way in.
[11] Since I quickly exceeded the 90 second budget that PdfJS had led me to expect was reasonable for finding such bugs, I very quickly timed out in finding the cause of the leak.
[12] The only quibble here is that if, somehow, you manage to have a large number of benchmarks which are either identical or extremely similar, you might end up with a skew towards a subset of a VM’s behaviour. Still, relative to where we are today, it’s a problem I’d be willing to risk.
There are two main reasons for restricting things to these two VMs: not all VMs make such profiling information available; and the rest tend to have scant, or missing, information on how to access such features. HotSpot is an honourable exception (its well documented in this regard), but we only managed to get the numbers we needed out of PyPy because we had someone from the core team who we could prod for advice.
The cost of extra profiling information to record garbage collection data and JIT compilation can be quite significant — certainly significant enough that it can change the apparent performance of the main benchmark. It’s therefore vital that this extra information is recorded in a separate experiment, and doesn’t affect the main experiment.
We couldn't work out what the units for the PyPy profiling information are (they're not documented, and it's not obvious from the source code either): as we’ll soon see, that doesn’t matter much in this case.
Unlike PyPy, the units for the garbage collection and JIT compilation sub-plots in HotSpot are clear and simple: seconds.
Sometimes it can feel like everyone, and in one or two cases even their dog, has tried making a Python implementation. Here's an incomplete list of Python implementations (with thanks to Carl Friedrich Bolz-Tereick for pointing out several that I didn’t know about): CPython; IronPython; Jython; Nuitka; Psyco; PyPy; Pyston; Shed Skin; Stackless; Starkiller; TrufflePython; Unladen Swallow; WPython; Zippy.
For those unfamiliar with this optimisation, consider this (contrived) code:
i = 0
while i < 2:
  f()
  i += 1
This can be “unrolled” (i.e. the loop can be expanded) to the equivalent:
f()
f()
In this case, the unrolled version is smaller and faster (since it has no branches). In more complex examples, unrolling can uncover optimisations that the loop would otherwise obscure.
The situation in a tracer is a little different: when you trace a loop, you effectively copy the trace (i.e. unroll it), with the first iteration “setting things up” for the second iteration. That’s not really equivalent to unrolling in the normal sense, though.
Our aim here was to show that you could use warmup_stats without having to use Krun. We thus deliberately stuck to Octane’s behaviour – shared with many other benchmark suites – of running multiple benchmarks in a single process execution. It’s highly likely that benchmark ordering will have some impact on one’s perception of performance, but I have no idea to what extent. That’s for someone else to look into!
In the end, I had to do a bit more than this, because the default Octane runner is written in a style that I found hard to follow. I ended up writing a new runner that dropped a couple of features, but was massively simpler.
Even worse, PdfJS isn’t even a very good way of benchmarking how a VM performs in a low memory situation, because it doesn't hit that point until quite a long way in.
Since I quickly exceeded the 90 second budget that PdfJS had led me to expect was reasonable for finding such bugs, I very quickly timed out in finding the cause of the leak.
The only quibble here is that if, somehow, you manage to have a large number of benchmarks which are either identical or extremely similar, you might end up with a skew towards a subset of a VM’s behaviour. Still, relative to where we are today, it’s a problem I’d be willing to risk.

Why Aren’t More Users More Happy With Our VMs? Part 1

September 5 2018

Programming language performance is something that nearly everyone cares about at some point, whether they realise it or not. If you’re a programmer, you sometimes have to make your programs run fast; if you’re a user, you might be frustrated at a program that runs slowly, without knowing that it’s caused by a poor programming language implementation. Those of us who work on, or around, programming language Virtual Machines (VMs) tell a good story about performance, but a surprising number of users seem unhappy with the performance of their programs. Sometimes, yes, they’re unrealistic, but are they always so? In this first blog post (based on this paper) of two, I’m going to show that programs running on VMs often don’t follow the simple performance patterns that nearly all of us expected. Perhaps that’s why users aren’t as happy as VM developers think they should be?

A typical claim

Every so often there is a spate of blog posts along the lines of “Language XYZ is faster than C”. [1] Language XYZ varies, of course, whilst C is always... C. Love it or hate it [2], C is, overall, probably the fastest language there is. At the very least, it’s the bar by which all other languages measure themselves.

For example, a claim made about HotSpot (the standard Java VM, often called just “the JVM”, though that there are other Java VMs) is that it makes Java programs as fast as C programs compiled with gcc -O2 [3]. At first glance this statement seems reasonable enough: gcc -O2 produces fast code, as, most people agree, does HotSpot. But what’s actually being compared here? A static (or “Ahead of Time”) compiler such as GCC compiles a program once; the resulting executable/binary can be run as many times as you want, even on another machine without GCC. In contrast, VMs such as HotSpot observe a program running and dynamically compile it into machine code. Dynamic compilers are thus subject to a warmup cost: it takes programs running on them a while to reach a steady state of peak performance. VM developers are very keen on the steady state of peak performance and would rather forget that warmup exists: VM performance is universally reported relative to the steady state of peak performance. Thus the claim at the beginning of this paragraph entirely ignores warmup costs.

A simple representation of this performance model can be seen in the following:

SVG-Viewer needed.

Let’s assume that we’re running a simple benchmark in a for loop: we show each in-process iteration (i.e. each iteration of the loop; the reason for this slightly long-winded name will become apparent soon) of the loop on the x-axis; and the y-axis shows how long each in-process iteration takes to run. When a program starts running in a VM, it’s executed by a profiling interpreter which is a simple, general, but slow implementation of a programming language. As the program executes, the interpreter performs simple profiling to help it determine which parts of the program are frequently executed (e.g. recording how often a function is called or a loop executed). After a little while, the “hot” parts of a program are identified, causing those parts to be compiled into machine code: it’s possible that execution will slow down temporarily while compilation occurs. Once compilation has completed, we reach nirvana: the “steady state” (i.e. it never changes) of ”peak performance” (i.e. the program runs faster than it did before). The time from the beginning of the first in-process iteration to the point that the steady state of peak performance is reached is the “warmup” period. [4]

Of course, a real system will be somewhat noisier than the above suggests: we might expect occasional timing blips while garbage collection occurs; compilation might happen in multiple stages; and so on. However, it captures the basic notions well enough that I’ll use it as the basis for the rest of this blog post.

Measuring warmup

I’ve done a reasonable amount of work on, and with, VMs in recent years (as a couple of examples, consider storage strategies or language composition) which, inevitably, means I’ve spent quite a bit of time measuring performance. The first time that I put a bit of thought into benchmarking, I realised that benchmarking without considering warmup was unsatisfactory, although I didn’t manage to come up with a very sophisticated way of addressing this problem. I also remember being baffled at one or two of the results, and convinced that I’d done something wrong: why did CPython running the Mandelbrot benchmark have a running time of 134.419s and an incredibly wide confidence interval of ± 32.347s? I had no idea, so, in the best tradition of VM measurements, I buried my head in the sand and hoped that no-one would say anything. They didn’t.

The first approach I saw that really tried to do something sophisticated with VM measurement was the work of Kalibera and Jones. [5] The methodology they developed tries hard to work out if/when VM warmup has completed, so that accurate figures can be given for the steady state of peak performance. I can’t say enough good things about this work, which has never received the attention it deserved. There are a few possible reasons for this: the methodology is fairly labour intensive (humans have to look at performance graphs and feed information into a measurement algorithm, which does the rest); some of the maths involved is on the scary looking side; and for a long time there were no publicly available implementations. We (well, mostly Carl Friedrich Bolz and Edd Barrett) eventually put together an implementation that largely solves the last two problems, and we used the resulting methodology in a couple of papers (1 and 2). However, the first problem seemed insoluble: the approach requires quite a lot of a skilled human’s time.

In the process of using the Kalibera and Jones methodology, we noticed quite a lot of variation in the warmup time of different VMs and cases where VMs didn’t seem to warmup at all. This was surprising because pretty much every paper we’d read until that point had assumed – and, in many cases, explicitly stated – that warmup was a quick, consistent, thing. On that basis, it seemed interesting to see how the warmup time of different VMs compared. In May 2015, I asked Edd if he’d knock together a quick experiment in this vein, estimating that it would take a couple of weeks. After a couple of weeks we duly had data to look at but, to put it mildly, it wasn’t what we had expected: it showed all sorts of odd effects. My first reaction was that if we showed this data to anyone else without checking it thoroughly, we’d be in danger of becoming a laughing stock. It was tempting to bury my head in the sand again, but this time it seemed like it would be worth digging deeper to see where we’d gone wrong.

We thus embarked on what we soon called the “warmup experiment”. We quickly realised that, fundamentally, we needed to test a simple scientific hypothesis: that VMs reach a steady state of peak performance. In other words, we wanted to find out whether the fundamental assumption on which previous VM benchmarking has relied holds true. This seems like it should be a simple thing to do, but we soon realised that there are all sorts of things that could interfere with our benchmarking — what are formally called “confounding variables”. When I’d previously benchmarked things, I was happy to run things on a new-ish Linux server installation, and report the results. [6] However, even a seemingly “bare bones” installation can still do many unexpected things. We thus continually iterated on the experiment, at every stage asking ourselves: “what if XYZ coloured our results? How can we stop, or at least reduce the chances of, this happening?”

Thinking in such an adversarial manner isn't natural for most of us but it can be done. For example, early on we realised that if benchmarks run overnight there’s a good chance that cron is likely to run at least one background task [7]. Similarly, if you’re running sshd (which, on a Unix sever, you almost certainly are), a botnet [8] can try using that to log in vast numbers of time: if you’ve ever been subject to such an attack, you’ll know that it can consume a noticeable amount of CPU. But there are many other potential problems. If one benchmark causes a machine to go into swap, subsequent benchmarks will be much slower, simply because some parts of the system will have to be paged back in from disk. Similarly, VMs can leave cache files around to speed up subsequent runs: if that happens, the first run might execute much more code than subsequent runs. After a while of doing this, we shared our initial attempts with other people, who suggested a few more things for us to take a look at, notably whether CPU throttling (due to overheating) was occurring.

After a couple of years of working on this, we had put together a pretty extensive list of potential problems and, in nearly all cases, we’d managed to do something to control them. Because there were so many of these, we wrote a new benchmark runner Krun [9]. In some senses, Krun’s job is very simple: it just has to run benchmarks in a predictable way and record the results. However, there’s a lot of detail hiding just below the surface. I’m not going to go into every detail of Krun, because we’d be here all day (the paper has more details but even that elides some of the nitty-gritties), but let me give you an idea of some of the things Krun does:

  • Turns off CPU turbo boost (which unpredictably interferes with the CPU’s running speed [10]).
  • Turns off the network card (so botnets and the like can’t cause the kernel, or userland, to do extra work).
  • Makes sure the CPU is running in full performance mode (and, on Linux, in tickless mode).
  • Shuts down daemons like smtpd and cron.
  • Reboots the machine before running the first process execution (which is our fancy way of saying "shut the VM down and start it again from scratch") and after every subsequent process execution (so that if one process execution does get the machine into a nasty state (e.g. into swap), the machine’s state will be reset to something fairly predictable before the next process execution is run).
  • Before every process execution, the Unix account used to run benchmarks is deleted (including its home directory) and recreated (so cache files can’t persist between process executions).
  • Runs every process execution with the same stack / heap limits.
  • Before the first process execution, all the machine's temperature sensors are read; subsequent process executions aren’t run until the temperature sensors are more-or-less back to those readings (because we have no idea how non-CPU components are affected by temperature).
  • Checks whether the CPU clocked down and reruns any benchmarks affected (this happens perhaps 0.5% of the time, so it’s rare, but it does happen).

And so on and so forth.

There are a couple of things about Krun which surprise people. First, Krun doesn’t use CPU pinning. We experimented extensively with this, but found that it had odd effects. For example, a seemingly good idea on a 4 core machine is to reserve 1 core for the OS and 3 cores for things being benchmarked. However, we found at least 1 VM which – on seemingly single-threaded benchmarks – behaved poorly with such a setup. We suspect this is because the VM asks the OS how many cores the machine has (receiving the answer “4”), and then uses that many compilation and/or GC threads; those 4 threads then get pinned to 3 cores, contending unpredictably with each other. [11]

Second, Krun does not disable ASLR (Address Space Layout Randomisation). This introduces obvious non-determinism into proceedings and is thus often frowned upon. Indeed, initially we turned ASLR off for this reason. However, I ended up being deeply influenced by Stabilizer, which convincingly shows that changing a program’s layout (most obviously through ASLR) means that programs no longer have a single performance point — rather, they have a set of performance points. In other words, it doesn’t make sense to say “this program runs at speed X” by picking one layout for the program: one has to have some sense of the spread of program speeds with different layouts, because real users will encounter those different layouts, each of which is a valid version of the program. Unfortunately, Stabilizer doesn’t build on modern systems, so we can’t be quite as statistically rigorous in randomising things like linking order. However, Keeping ASLR on gives us an approximation of the effects of randomisation on the VM.

What to benchmark

There are two hard questions in program benchmarking: what to benchmark and how long to run things for. I’m going to tackle them in reverse order.

VM benchmarking generally involves running a program in a loop for a small number of in-process iterations: 5 is common; 30 is considered extremely large; and the largest I’ve ever heard of is 200. However, I knew from experience that some of the counters inside VMs that cause performance changes have thresholds of 1000 or 1024. To make sure that we could see if these counters had an impact on performance, we therefore needed to run benchmarks for more than 1024 in-process iterations. We quickly settled on 2000 in-process iterations as a reasonable number. We also wanted to see what level of variability there was between different process executions (i.e. stopping the VM and running it again from scratch). We had no idea what a good number was for this: we used 10 for a long time, before eventually settling on 30. To be clear, there was some thought behind the choice of 2000 in-process iterations, but 30 process executions was little more than a wild guess.

Choosing what benchmarks to run is difficult, because one often hopes that benchmarks are representative of real programs. Different VM teams often disagree profoundly over benchmark selection because of this. Personally, I am sceptical that a small-ish set of programs (no matter how big the individual programs are) tells us that much about the complete set of possible programs. Fortunately for us, we had slightly different goals. We wanted to see how VMs perform on benchmarks that are well known, thus removing the “it’s just a weird program” retort from the debate. That then meant that we could use the well known Computer Language Benchmarks Game (CLBG) benchmarks. Although one can reasonably argue that these benchmarks are even less representative of general performance than other benchmarks, they have the virtue that every VM team out there has made sure that they run well on their VM.

We performed a few simple checks, and made several alterations, to the CLBG benchmarks. First, we wanted to make sure that – to the extent that a normal end user can tell – the benchmarks are deterministic, because then we could reasonably expect different process executions to behave similarly. We put prints at every if statement, while loop, and the like, and checked the results from multiple runs. In our case, we had to alter fasta so that the same random number was used for every in-process iteration. We also noticed that Java programs sometimes load classes in a different order in such a way that an end-user can observe. Fixing this in general is neither possible nor desirable, but we wanted to make sure our benchmark runner didn’t make things worse, so we performed some simple static games to avoid this issue.

We then dealt with a more subtle issue. Compilers can often notice that large chunks of benchmarks are computing precisely the same number over and over again, allowing them to replace such code with a constant. [12] This can quickly cause one to measure a much different thing than one expected. To avoid this, many benchmarks write intermediate results to stdout, which makes it harder for compilers to remove code. However, I/O calls are expensive (due to context switches and the like): one can quickly end up in a situation where one is measuring the number of I/O calls more than anything else. We therefore removed I/O from the core parts of all our benchmarks, replacing it with checksum calculations (with the same checksum across all language variants of a given benchmark) that we then (conditionally) write to stdout at the end of the benchmark. This doesn’t guarantee that compilers aren’t removing important parts of a benchmark, but it raises the bar substantially.

With that done, it was relatively simple to choose the systems to benchmark: we used a number of mainstream VMs (Graal 0.22; HHVM 3.19.1; TruffleRuby 20170502; HotSpot8u121b13; LuaJIT 2.04; PyPy 5.7.1; and V8 5.8.283.32) and (as a base for comparison) GCC 4.9.4.

Results

At this point, I’ve hopefully convinced you that we designed a thoughtful, careful experiment. I could go into more detail, but it’s probably useful to look at some data. I’m going to use data from our v1.5 run (the data in the paper is from a slightly earlier run; the differences are minor). Let’s start with a simple example:

SVG-Viewer needed.

Starting at the top, the plot title tells us that this is the Fannkuch Redux benchmark, running on LuaJIT, on an OpenBSD Core i7 machine, and is the 14th of 30 process executions. Along the x-axis, we have the in-process iterations, and on the y-axis the time of each iteration. The inset helps us see the first n in-process iterations (in this case, 129) in a bit more detail. The first thing we can see is that this benchmark started slow (in-process iteration 1 took about 0.57233 seconds); quickly got somewhat faster for a while; and then, at about in-process iteration 190, reached the steady state of peak performance. This is almost exactly what VMs are supposed to do, albeit it took much longer to reach the steady state of peak performance than we might expect and, indeed, much longer than anyone had ever bothered benchmarking before. Nevertheless, this benchmark clearly warmed up.

The challenge then becomes working out where it warmed up. At first we had no idea how to work this out, other than using the Kalibera and Jones method — but we have 3,600 of the above plots to look at per machine, which is far too many to look at manually. Eventually (thanks to help from Vince Knight), we were put in touch with Rebecca Killick, a statistician who specialises in changepoint analysis. Intuitively, changepoint analysis automatically tells us when there are statistically significant shifts in our data. The vertical red dashed line in the plot above at in-process iteration 190-ish is a changepoint: the horizontal dashed line between changepoints (there are implicit changepoints at in-process iterations 1 and 2000) is a changepoint segment. In other words, changepoint analysis allows us to split the data in the plot above up, produce statistics on it, and classify it. We therefore developed a classification algorithm, which says (simplifying a bit) “if the last changepoint segment is the fastest changepoint segment in the plot, then this is a warmup”. [13] You can see the classification “warmup” at the end of the title. As an additional visual aid, in-process iterations before the steady state are coloured in grey.

The next type of plot we see regularly is the following:

SVG-Viewer needed.

This is the n-body benchmark running on PyPy on a Linux Core i7 machine. The first thing that you might notice is the big red blob around in-process iteration 1230: this is an outlier, an in-process iteration which is noticeably different in nature (i.e. much faster or slower) than its neighbours. It’s common in these sorts of analyses to ignore outliers if there’s good cause: in our case, we assume that they might be a garbage collection phase, or a context switch into the kernel, or something else that shouldn’t cloud our judgement about the benchmark. The second thing you might notice about this plot is that it only has a single changepoint segment. We classify such plots as “flat”. Because we’re giving the benefit of the doubt to VMs, we assume that warmup happened so quickly in such cases that we can’t observe it happening, meaning that the “flat” classification is morally equivalent to “warmup”.

Here’s another example of warmup, this time on V8:

SVG-Viewer needed.

The good news is that the benchmark did warm up and hit a steady state of peak performance, though the fact that it took just under 600 in-process iterations is surprising. It’s also interesting that there are several earlier changepoints, perhaps while various rounds of compilation were occurring.

We can also deal with more complex examples of warmup, this time on Graal:

SVG-Viewer needed.

The surprise here isn’t the number of outliers (perhaps they’re the garbage collector), but that it took over 900 in-process iterations to hit a steady state of peak performance. Still, the fact that it did so is surely a good thing.

The unexpected

At this point, things become a little more surprising. Consider this plot from HotSpot:

SVG-Viewer needed.

Notice that it started slow (you might need to look at the inset to see this), then got fast for 200 in-process iterations, and then became somewhat slower again. That’s not supposed to happen! It means that we need to adjust our classification algorithm: if the last changepoint segment in a plot isn’t the fastest, we classify it as “slowdown”.

In this case, the slowdown reduces execution speed by 5% which, to state the obvious, is fairly significant (in mature VMs, speed-ups of 5% often require months or years of work). But here’s why you should be really surprised. The benchmark in this plot is Richards: it’s one of the benchmarks that HotSpot has used to benchmark itself since before the VM was even called HotSpot. In other words, this VM is highly optimised for this benchmark — but no-one ever ran it long enough before to notice that performance gets much worse over time.

It turns out that slowdowns are common. Here’s an example from V8:

SVG-Viewer needed.

But there’s one final type of plot left — and it’s the most troubling of all. Here’s an example:

SVG-Viewer needed.

Notice here that the benchmark’s performance keeps shifting, never settling for long. Of course, perhaps it would if we ran it for even longer, but ultimately we have to set a cut-off somewhere. This now raises a tricky question: what should our criteria be for when a benchmark is said never to stabilise? Our choice was to say that if there are changepoints in the last 500 in-process iterations then the plot is classified as “no steady state”.

For clarity, “flat” and “warmup” are equally good; “slowdown” and “no steady state” are bad (in slightly different ways). The latter two are not supposed to happen: when they do, they invalidate nearly all previous benchmarking methodologies.

Edge cases

Changepoint analysis turned out to be the vital thing we needed to make this work useful. When we first tried using changepoint analysis on our data, we sometimes thought it had misclassified things, but in reality it’s just more rigorous (and far less prone to boredom) than us humans. However, there are always going to be some edge cases where the classification could go either way. Here’s my favourite example of this:

SVG-Viewer needed.

Can you see that the timing data keeps shifting, at predictable intervals, above and below the changepoint segment? The changes are small enough that changepoint analysis doesn’t consider these shifts as changepoints, so the plot is classified “warmup”. Interestingly, in some older versions of HotSpot, this benchmark displays a more pronounced square-wave pattern (i.e. the highs are higher and the lows lower); each peak and trough becomes a changepoint segment; and the plot as a whole becomes “no steady state”. In my opinion, for this example, either classification would be acceptable: but, in general, there are very few edge cases such as this.

Inconsistency

Earlier I went into some detail about how we ensured that the benchmarks we’re running are deterministic. We expected that this would lead to consistent performance profiles for a given (benchmark, VM, machine) triple (in other words, we expected consistent performance from process executions of the same benchmark on the same VM on the same machine). We were wrong. Consider these two plots, both of the same benchmark, on the same VM, on the same physical machine:

SVG-Viewer needed.

Notice that one process execution warms up, and the other is a no steady state. To add insult to injury, the no steady state process execution’s performance is often 5% worse than the warmup process execution’s performance. This sort of inconsistency happens frequently.

Perhaps less surprisingly, there’s often substantial inconsistency across operating systems. For example, consider the binary trees benchmark on Linux and OpenBSD in C:

SVG-Viewer needed.

SVG-Viewer needed.

There are two things to note here. First, the performance of this benchmark on Linux resembles a bouncing ball. This, to belabour a point, isn’t a VM: it’s a C program! We suspect (but somehow have never quite managed to get around to investigating) that this is probably the malloc used on Debian: it’s probably incrementally growing, then incrementally compacting, its memory structures. This raises a subtle point which I hadn’t considered when we first started this work: even C programs have a runtime (libc), which can have unpredictable effects on performance. Unsurprisingly, C isn’t as prone to odd effects as the VMs in our experiment, but it’s more prone than I would have expected. Second, notice the substantial difference in performance between the two systems. Admittedly, in this case we’re comparing a Xeon Linux machine against a Core i7 OpenBSD machine, but still, I would be happy to bet money that Linux will always beat OpenBSD in such a benchmark. Why? Simply because OpenBSD’s malloc prioritises security over performance, and jumps through several hoops that almost certainly slow things down substantially.

Benchmarks in aggregate

Looking at the plots above is very useful, because it gives us a way of understanding what unexpected performance behaviour occurs. However, our experiment has thousands of these plots, so it’s hard to put together any sort of bigger picture from them. Fortunately, changepoint analysis allows us to automatically put together various summaries of the data (with Sarah Mount doing most of the hard work to generate the astonishing LaTeX tables below). Here’s an example of the summary data from one of our machines:

SVG-Viewer needed.

There’s a bit too much data there for us to easily read it in this setting, so let’s zoom in on the bottom right part of that table:

SVG-Viewer needed.

This is the data from the spectralnorm benchmark. There’s still quite a lot of detail here, so let’s look at a few rows to unpick what’s going on.

I’ll start with the HotSpot row. The squiggle in the “Class.” (i.e. “classification”) column means that all 30 of spectralnorm’s process executions warmed up on HotSpot. The “Steady iter (#)” column tells us that the median number of in-process iterations for the steady state to be reached was 7 (the numbers in brackets underneath are inter-quartile ranges; think of them a bit like being a bit like confidence intervals, giving you an indication of the spread of data). The histogram to the right of that number shows you the spread of data, with the bucket the median value is part of highlighted in red. In this case, nearly all the data is in the median bucket, with a couple of outliers. The “Steady iter (s)” column tells us that the median wall-clock time to reach the steady state was 1.91 seconds. The “Steady perf (s)” is close to the “traditional” measure of VM performance [14], telling us that the performance of the benchmark in the steady state was 0.31472 seconds per in-process iteration. Notice though that the histogram tells us that there are a number of process executions where the steady state performance is worse than the median, and the very wide 99% confidence interval (±0.169143s) confirms this. This goes some way to justifying all the data you’ve just waded through: yes, the benchmark consistently warms up on HotSpot (which is good), it warms up fairly fast (which is also good), but the steady state performance can vary hugely between process executions (which is bad). Depending on what aspect of performance you care about, you may find different parts of this information more or less important than others.

The “—” classification for LuaJIT means that all 30 of its process executions were flat. By definition that means it warmed-up at in-process iteration 1 and took 0 seconds to do so: thus the steady iter (#) and (s) columns are empty. The final “pure” classification can be seen for Graal, all 30 of whose 30 process executions slowed down (note that, though this doesn’t have a steady state of peak performance, it does still have a steady state, hence our ability to report such numbers here).

The “═” classification for PyPy is “good inconsistent”: 27 of the 30 process executions were flat, and 3 were warmups. Since we consider these two classifications morally equivalent, “good inconsistent” is morally equivalent to all 30 process executions being flat or warmup.

In contrast, the “═ with a cross through it” classification for TruffleRuby is “bad inconsistent” i.e. at least 1 process execution was a slowdown or a no steady state. In this case, 25 of its process executions slowed down and 5 were no steady state.

The bigger picture

Summarising the results gives a clearer picture of how VM performance can surprise:

SVG-Viewer needed.

To keep things simple, I’ll only consider the Linux4790 column above (the OpenBSD machine runs a subset of benchmarks, so requires some care when interpreting). It’s easiest to start with the bottom half of the table first. The way to read this is that 22.0% of all process executions (i.e. not worrying about which VM or which benchmark was involved) were classified as flat and 48.3% as warmup; in other words, just over two thirds of process executions warmed up in the expected fashion. 20.1% of process executions slowed down and 9.6% were no steady state.

Those figures are pretty depressing: almost a third of process executions aren’t doing what we expected! But life gets quite a lot worse in the top of the table. The way to read this is that 8.9% of (VM, benchmark) pairs were always classified as flat, 20.0% as warmup, and 11.1% as good inconsistent. In other words, only 40.0% (VM, benchmark) pairs consistently did the expected thing for all 30 of their process executions. Surprisingly, 4.4% consistently slowed down and 4.4% were consistently no steady state. But 51.1% of all (VM, benchmark) pairs were bad inconsistent: that is, they contain at least one slowdown or no steady state process execution. At the risk of stating the obvious, that means that just over half of the time VMs are not reliably optimising benchmarks!

Things get even worse if you consider all machines together: only 12.5% of (VM, benchmark, machine) triples perform as expected.

Conclusions

When we set out to look at how long VMs take to warm up, we didn’t expect to discover that they often don’t warm up. But, alas, the evidence that they frequently don’t warm up is hard to argue with. Of course, in some cases the difference to performance is small enough that one can live with it, but it’s often bad enough to be a problem. We fairly frequently see performance get 5% or more worse over time in a single process execution. 5% might not sound like much, but it’s a huge figure when you consider that many VM optimisations aim to speed things up by 1% at most. It means that many optimisations that VM developers have slaved away on may have been incorrectly judged to speed things up or slow things down, because the optimisation is well within the variance that VMs exhibit. The variance across process executions is often even bigger, which is even more troubling: it means that it’s not safe to run a program once, measure its performance, and assume you’ll always get that level of performance in the future.

Before the doom and gloom gets too much, I’m going to end on an optimistic note. While I think our work raises big questions about the efficacy of current VMs, I think it also offers a note of salvation. People are often very happy with VM performance, even though it probably isn’t working as well as it could. In particular, my guess is that large programs are subject to many small performance problems which, in conjunction with each other, are dismissed as noise. If that is the case, then, if we can produce VMs that don’t suffer from the problems seen above, we should be able to run programs even faster in the future.

In the second part of this blog post, I’m going to: look at whether there are any simple answers to the performance problems seen above; give some possible explanations as to how we might have got to this point; and suggest some simple pointers to make things a bit less worse in the future. In the meantime, those of you who want more details can read the main research paper.

Acknowledgements: Edd Barrett, Carl Friedrich Bolz, Rebecca Killick, and Sarah Mount did most of the work I’ve reported on above. Vince Knight helped put us in touch with Rebecca Killick: he really deserved more credit on this paper than he received, but he gracefully refused my attempts to add him in! The work of Tomas Kalibera and Richard Jones was a huge influence for all of us. Mario Wolczko gave crucial feedback and encouragement at important points. We also received vital feedback, and questions, from a number of other people. This research was funded by the EPSRC Cooler (EP/K01790X/1) grant and Lecture (EP/L02344X/1) fellowship, and gifts from Cloudflare and Oracle Labs.

Follow me on Twitter

Footnotes

[1] Unfortunately, many such claims are either wrong (e.g. the thing being measured isn’t the thing that’s claimed to be measured) or misleading (e.g. making grand comments that derive from generalising a single workload / benchmark).
[2] I used to love C, but these days I’m more ambivalent, because I no longer think I can remember all the ways that I may trigger undefined behaviour.
[3] p11 of Cliff Click’s slides “A JVM does that?”. To be clear, I’m not criticising Cliff who I not only know and like, but whose contributions to our field are little short of jaw-dropping.
[4] The terms “warmup” and “startup” are often used interchangeably, whereas they refer to two different things: startup is the time for the VM beginning to execute the first bit of the user’s code; warmup is the time for the VM to reach the steady state of peak performance.
[5] There are two relevant Kalibera and Jones papers: a complete but dense description; and a higher-level overview. I definitely recommend starting with the high-level overview (even though it was published later).
[6] In my defence, I can only say that I’ve seen a lot of people benchmark even more incompetently than I’ve described here. My personal favourite was a paper that benchmarked on a laptop (which are prone to overheat, and thus likely to throttle the CPU), with turbo boost left on (making overheating even more likely), running a full user desktop (how much CPU is your web browser consuming right now? One core of mine is running at 40% while a website animates something). The resulting numbers might have been correct, but, if they were, that would surely be more due to luck than judgement.

That said, the most common form of benchmarking (which I’ve certainly been very guilty of in the past) is far worse: it’s best summarised as “run it until I like it”. In this form of benchmarking, if you run a benchmark and get a number you don’t like (too fast or too slow), you run things again and again until you get an answer that matches your expectations / prejudices. In practise, the non-determinism of our hardware and software means that one can often find an appealing answer. Of course, the implied variation between runs should be an alarm bell!

[7] On OpenBSD, for example, by default root runs a cron job every day (/etc/daily) and every week (/etc/weekly). The former does several security related tasks which, one might argue, aren’t likely to impact anything else; however, the latter rebuilds the locate database (which is very IO intensive, and which can thus cause very unpredictable side effects). There is also a monthly cron job but, in practise, that does very little by default (I had entirely forgotten that it existed until I checked my facts for this footnote).
[8] It’s sometimes thought that botnets can only attack you from external networks. However, all it takes is one machine to become compromised (e.g. by someone running something unfortunate on their machine), and then a botnet can exist on an internal network. I’ve heard of this happening at least twice at places that I know well, so it’s a real threat.
[9] The “K” in Krun is a tip of the hat to the Kalibera and Jones work. For those who care, most of us pronounce it as one syllable (i.e. “Krun” not “K-run”).
[10] Turbo boost is less predictable than most of us realise. For example, we had some machines with a stated base frequency of 3.6GHz and a turbo boost of 4GHz. If you ran a process which utilised 1 core fully, you could reliably measure it as running at almost exactly 4GHz; fully utilising 2 and 3 cores gave the same result; but as soon as a fourth core was fully utilised measurements fell to almost exactly 3.8GHz. If memory serves this mode is undocumented in Intel’s manuals, but I saw this behaviour on 2 separate machines.
[11] At the time we experimented with CPU pinning on Linux, there were two mechanisms available: one was buggy and unreliable; and the other (which, of course, we tried second) worked OK. I don’t know if this is still the case, but it certainly makes me wonder if everyone I’ve heard of using CPU pinning is getting the benchmarking environment they expect.
[12] I’m talking here about compilers optimising away code as part of the compiler’s standard optimisations. There’s also a less pleasant case where compilers are customised specifically for one benchmark (e.g. explicitly recognising a benchmark’s source code, and then changing the compilation strategy). One of the better known examples of this was Sun’s special treatment of the 179.art benchmark in SPEC. It’s now difficult to find a concise write-up of this, but a couple of articles provide reasonable pointers.
[13] The real algorithm has two main additional features. First, it treats changepoint segments which differ only by a very small amount of time as being equivalent. Second, two (or more) changepoint segments can be considered equivalent even if they are separated by a non-equivalent changepoint segment.
[14] The details of how we calculate steady performance are quite a lot more involved than is traditional: influenced by Kalibera and Jones, we use bootstrapping, and changepoint segments to lead to a slightly more accurate number than most traditional methodologies.
Unfortunately, many such claims are either wrong (e.g. the thing being measured isn’t the thing that’s claimed to be measured) or misleading (e.g. making grand comments that derive from generalising a single workload / benchmark).
I used to love C, but these days I’m more ambivalent, because I no longer think I can remember all the ways that I may trigger undefined behaviour.
p11 of Cliff Click’s slides “A JVM does that?”. To be clear, I’m not criticising Cliff who I not only know and like, but whose contributions to our field are little short of jaw-dropping.
The terms “warmup” and “startup” are often used interchangeably, whereas they refer to two different things: startup is the time for the VM beginning to execute the first bit of the user’s code; warmup is the time for the VM to reach the steady state of peak performance.
There are two relevant Kalibera and Jones papers: a complete but dense description; and a higher-level overview. I definitely recommend starting with the high-level overview (even though it was published later).
In my defence, I can only say that I’ve seen a lot of people benchmark even more incompetently than I’ve described here. My personal favourite was a paper that benchmarked on a laptop (which are prone to overheat, and thus likely to throttle the CPU), with turbo boost left on (making overheating even more likely), running a full user desktop (how much CPU is your web browser consuming right now? One core of mine is running at 40% while a website animates something). The resulting numbers might have been correct, but, if they were, that would surely be more due to luck than judgement.

That said, the most common form of benchmarking (which I’ve certainly been very guilty of in the past) is far worse: it’s best summarised as “run it until I like it”. In this form of benchmarking, if you run a benchmark and get a number you don’t like (too fast or too slow), you run things again and again until you get an answer that matches your expectations / prejudices. In practise, the non-determinism of our hardware and software means that one can often find an appealing answer. Of course, the implied variation between runs should be an alarm bell!

On OpenBSD, for example, by default root runs a cron job every day (/etc/daily) and every week (/etc/weekly). The former does several security related tasks which, one might argue, aren’t likely to impact anything else; however, the latter rebuilds the locate database (which is very IO intensive, and which can thus cause very unpredictable side effects). There is also a monthly cron job but, in practise, that does very little by default (I had entirely forgotten that it existed until I checked my facts for this footnote).
It’s sometimes thought that botnets can only attack you from external networks. However, all it takes is one machine to become compromised (e.g. by someone running something unfortunate on their machine), and then a botnet can exist on an internal network. I’ve heard of this happening at least twice at places that I know well, so it’s a real threat.
The “K” in Krun is a tip of the hat to the Kalibera and Jones work. For those who care, most of us pronounce it as one syllable (i.e. “Krun” not “K-run”).
Turbo boost is less predictable than most of us realise. For example, we had some machines with a stated base frequency of 3.6GHz and a turbo boost of 4GHz. If you ran a process which utilised 1 core fully, you could reliably measure it as running at almost exactly 4GHz; fully utilising 2 and 3 cores gave the same result; but as soon as a fourth core was fully utilised measurements fell to almost exactly 3.8GHz. If memory serves this mode is undocumented in Intel’s manuals, but I saw this behaviour on 2 separate machines.
At the time we experimented with CPU pinning on Linux, there were two mechanisms available: one was buggy and unreliable; and the other (which, of course, we tried second) worked OK. I don’t know if this is still the case, but it certainly makes me wonder if everyone I’ve heard of using CPU pinning is getting the benchmarking environment they expect.
I’m talking here about compilers optimising away code as part of the compiler’s standard optimisations. There’s also a less pleasant case where compilers are customised specifically for one benchmark (e.g. explicitly recognising a benchmark’s source code, and then changing the compilation strategy). One of the better known examples of this was Sun’s special treatment of the 179.art benchmark in SPEC. It’s now difficult to find a concise write-up of this, but a couple of articles provide reasonable pointers.
The real algorithm has two main additional features. First, it treats changepoint segments which differ only by a very small amount of time as being equivalent. Second, two (or more) changepoint segments can be considered equivalent even if they are separated by a non-equivalent changepoint segment.
The details of how we calculate steady performance are quite a lot more involved than is traditional: influenced by Kalibera and Jones, we use bootstrapping, and changepoint segments to lead to a slightly more accurate number than most traditional methodologies.

What I’ve Learnt So Far About Writing Research Papers

January 10 2018

I long flattered myself that I'm only good at one thing: programming. Experience has taught me that I'm not as good at it as I thought – indeed, that there are people much better at it than I – but it's fair to say that I'm not terrible at it. The problem is that, as a researcher, programming can only ever be part of my job: it's not enough just to build something interesting, I need to explain it to people too. And explaining it to those further away than my voice can carry requires writing a paper. [1]

There are many people who dismiss paper writing as a distraction from building real systems. I sometimes wondered if it was worthwhile myself but, when you've been knocking around as long as I have, the importance of writing things down becomes clear. I’ve lost track of how many conversations I’ve had along these lines: “We saw that problem in System B, and we had a really interesting solution.” ”Where can I read about it?” “We never wrote it up.” “Where can I download it then?” “We never released the source code, the binaries are lost, and it runs on hardware that no longer exists.” I encountered this a lot when I was investigating existing language composition, systems with tantalising references ranging from Lisp systems with different macro expansion semantics to novel text editors. Even when the person I was talking to was passionate and knowledgeable about the system in question, I was rarely able to gain the insights I suspect I could have if I'd had a paper to pore over. Doubtless, many of these systems weren't good enough to worry about them slipping from our memory; but the loss of the best, which may have involved tens or even hundreds of skilled person years of effort, is a tragedy. Writing things up does not guarantee that people in the future will be able to understand everything about the work [2], but at least they stand a sporting chance. Besides, even if people of the future can't understand things, writing things down increases the number of people who can understand the work today.

How should one go about writing a paper? When I started my PhD, I had no idea of how one might go about this task, and it took me many years to find an approach that worked for me. In the rest of this blog post, I'm going to try and drill down into some of the things I’ve learnt about writing papers, and I'm going to give some examples of where I’ve learnt from my mistakes. However, I’m deliberately not phrasing anything herein as “advice” — indeed, several parts of my approach seem to run contrary to most of the advice I’ve seen. I don’t see this as a problem: people work in different ways, and if you’re unlucky enough to have a mind similar to mine, you might find something useful in here.

Before starting to write

Before I start writing, there are two crucial decisions that I need to make. First, what am I going to write about? Second, what is the profile of the intended reader?

The first question nearly always has an easy answer which is that I’ve done some research which is interesting and complete enough to be worth telling other people about. If I don’t have a good answer to that first question, it means that I’m probably not ready to start writing.

The second question is more tricky. In general, I assume that the intended reader is someone intelligent, interested, and well-versed in the general area. My job is to fill in the things that they don’t know. One can go mad trying to guess precisely what people do and don’t know, so my starting assumption is that the reader knows roughly what I knew when I started doing the research in question. However, this doesn’t work for all types of writing, particularly when a paper brings together two subjects that have previously been studied by disjoint communities. On any points where I’m unsure about what knowledge I can take for granted, I err on the side of caution, and assume the reader knows marginally less than I did. This should not be mistaken for an assumption that the reader is stupid, but simply an acknowledgement that none of us can know everything.

I also make a fundamental assumption that the reader has good intentions, and that my job is to be upfront and honest with them. This allows me to write clearly about the weaknesses and limitations of my work without worrying that I will be punished for doing so. This may sound like an odd thing to be explicit about, but people in my position are largely evaluated by the quantity and perceived quality of our peer reviewed papers. It is therefore tempting to adjust one’s writing to maximise the chances of acceptance by peer reviewers (who are often rushed; occasionally incompetent; or, though rarely, malevolent). I did that in the past, but I found the implicit cynicism corrosive: I decided, perhaps somewhat pompously, that I’d rather “fail” peer review honourably than “succeed” dishonourably. This has led to one or two more rejections than I might otherwise have received, but these are more than worth the longer-term satisfaction with the papers I write.

Starting to write

It is a truism of writing that the hardest thing is a blank page: building sufficient momentum to begin writing is no easy thing and I've done my fair share of hoping in vain for inspiration to strike. Of course, the only possible solution is to write something, anything, just to get started. The most common suggestion that I've heard is to start writing the bits of a paper that come easier without worrying about the overall order. While this seems to work well for many people, I've never found this a satisfying approach.

Instead, the first thing I do is to try and work out what the paper’s overall message is going to be, which means starting with the paper’s abstract. [3] This gives me a frame of reference for all subsequent writing and enables me to be ruthless in answering questions like “does the reader need to know about this particular aspect?” and “is this section in the right place in the paper?”. I first reread suggestions for writing an abstract (the fourth point in the link on paper writing advice) because it reminds me to be concise about what problem I solved and to motivate why my solution is worth someone’s time reading. In well established fields, where one is tackling a well-known problem, little motivation is needed. More experimental or unusual research, however, needs to be clearly motivated. This is often harder to do than it seems.

Let me give a concrete example based on the paper fine-grained language composition: a case study. We started looking at language composition because previous solutions were crude, because we felt we had a new approach worth trying, and because it looked like it would be fun. I had a few explanations about why it was worth doing (enough to convince people to fund us to do it), but they were vague. After developing a language composition editor we realised we needed to be able to run composed programs as well. We started with an extended warmup exercise, which gave us the confidence to try tackling something bigger.

We eventually settled on composing PHP and Python and started looking at the problem in detail. About 6-12 months later, we had created PyHyp, which was the first ever fine-grained language composition of “big” languages: certainly, we'd been able to get much further, and with less effort, than I had expected at the outstart. The problem was that we had also outstripped our ability to explain why we'd done what we'd done. The published abstract that I wrote (and I say “I” because I’m almost solely to blame for it) looks as follows:

Although run-time language composition is common, it normally takes the form of a crude Foreign Function Interface (FFI). While useful, such compositions tend to be coarse-grained and slow. In this paper we introduce a novel fine-grained syntactic composition of PHP and Python which allows users to embed each language inside the other, including referencing variables across languages. This composition raises novel design and implementation challenges. We show that good solutions can be found to the design challenges; and that the resulting implementation imposes an acceptable performance overhead of, at most, 2.6x.
The good thing about this abstract is that it's simple to understand: the bad thing is that it doesn't say why the problem we tackled is worth tackling. Since language composition has been largely ignored as a research challenge the abstract therefore relies on readers spontaneously realising why language composition is useful and why our research is worth reading — which is far too much to ask.

The frustrating thing is that, while carrying out the research, one of our collaborators had suggested a plausible “killer app”: system migration. The idea is simple. At the moment we have lots of systems written in old languages: we'd like to rid ourselves of the old languages but not the systems. We can either rewrite the systems from scratch (which is expensive and error prone) or use an automatic translator (which produces code that no human wants to maintain). Language composition offers us the potential to compose an old and a new language together and gradually migrate parts of the system piecemeal, at all times having a runnable system. This idea is tucked away in a paragraph near the end of the paper and is easily overlooked. This is an unforced error on my part, because papers are not meant to be like crime novels, with unexpected twists: a good paper should make clear upfront what the paper is about, and gradually fill in detail. Indeed, I struggled to find a good structure for this paper, perhaps in part because of my failure to set out the most compelling case possible in the abstract.

By the time I came to write a blog post on this subject I had realised my mistake and put the migration idea (in expanded form) near the beginning. Alas, anyone who read only the paper would not have seen this implicit mea culpa.

As this example hopefully shows, finding the right research message is often tricky. Fortunately, it is possible to do a decent job: looking back, I think both the storage strategies for collections in dynamically typed languages and Eco: a language composition editor papers do a much better job in their abstracts.

With the abstract out of the way, I draft an introduction. If I've got my high-level thinking in the abstract right, then the introduction is a much easier task because, to some extent, it's simply an expanded abstract. I add more background, including some context about the research's antecedents, and explain why we've done something useful. Sometimes I will explicitly list the paper's contributions, especially if they might otherwise not pop out (thus the storage strategies paper was clear enough without such a list whereas I felt the fine-grained paper would not be understood without one).

With drafts of the abstract and introduction complete (I invariably come back and, at least, tweak them; sometimes I change them extensively), I then find it possible to move on to the paper's main content.

Writing and editing

Perhaps unsurprisingly, I tend to write papers more-or-less linearly from front to back, finishing one section before moving on to the next. I find this helpful to give me the same sense of flow as the eventual reader: it means that I’m forced to think about seemingly simple questions like “have I introduced this concept earlier, or does it need to be introduced here?” It’s easy to overlook the high-level structure of a paper, and I’ve read many papers with all the right pieces but in the wrong order. [4] Similarly I've read many papers which use too many unconnected examples: each feels sensible in its own right, but each example takes time to understand, destroying the reader's flow. A single example which can be built up in stages as the paper progresses nearly always makes the paper easier to read, and is most easily done if writing is done semi-linearly.

That said, I spend the majority of my time at a much lower-level: I’m trying to get all the facts that need to be in the paper in there, explained clearly, and in the right order. That means that I spend most of my time thinking about whether a sentence says what I need it to say, without saying anything it shouldn’t say, and whether it does so as concisely as possible. I’m not indulging in false modesty when I say that I don’t think that my first attempt at a sentence has ever satisfied all three. As soon as I’ve typed a sentence in, I’ll reread it, and if it’s really bad (which it normally is), I’ll move it a couple of lines down on the screen and try writing another version immediately. For some complex thoughts I can end up with multiple versions scrolling gradually off the bottom off the screen before I eventually get to something that isn’t obviously flawed. But I'm realistic: I probably won’t get those complex sentences right on the first day, so I try to move on fairly swiftly.

When I start a new day, I go back over what I wrote the previous day and look at it in a fresh light. As well as catching out innumerable typos, I always end up rephrasing and reordering sentences. That’s because, if there’s one thing I’ve learnt about good writing, it’s that it’s the result of good editing. And, as far as I can tell, good editing means extensive editing. It means being able to look at what I wrote and — ignoring my ego, which inevitably says “it’s good enough already” — trying to neutrally judge whether it can be improved. This is easy to say but, at least for me, it was an incredibly difficult skill to learn. I only learnt it when I had to write a grant proposal and my first draft was 11 pages, just a little over the 6 page limit. I had no idea what to do, and my first attempts at cutting things down were, frankly, pathetic. Over about 10 weeks, I accidentally crushed my ego enough to learn how to evaluate my own writing, and from that much followed. The grant was rejected, but I got so much out of the experience of writing it that I still judge it a success.

Order and brevity

Let's look at a concrete example that shows how important it is to think about the order in which things are presented and how shorter writing is often better. The example is something that I remember spending a long time grappling with: an overview of meta-tracing. This was first needed for the approaches to interpreter composition paper, where we wanted to give a rough idea of meta-tracing to unfamiliar readers (who would otherwise find much of the paper tough to understand). Overview sections are often tricky: they have to summarise and simplify related work in a way that doesn’t overwhelm unfamiliar readers, while being accurate enough not to offend people who know the area in depth. This sort of writing ruthlessly exposes holes in one’s understanding and writing abilities.

I'm going to use the first paragraph from Section 2 of the “approaches” paper as my example. For the points I’m about to make, it hopefully doesn’t matter too much if you understand meta-tracing or not. Here's the first draft I wrote (minus citations):

Meta-tracing takes an interpreter and creates a VM which contains both the interpreter and a tracing JIT compiler At run-time, user programs running in the VM begin their execution in the interpreter. When a 'hot loop' in the user program is encountered, the actions of the interpreter are traced (i.e. recorded), optimised, and converted to machine code. Subsequent executions of the loop then use the fast machine code version rather than the slow interpreter. Guards are left behind in the machine code so that if execution needs to diverge from the path recorded by the trace, execution can safely fall back to the interpreter.

I have long since forgotten how long that took to write, but 10-15 minutes is a reasonable guess. At first glance, this paragraph may seem to do a reasonable job of explaining things, once one ignores minor issues like the missing full stop before “at run-time”. However, there’s one glaring problem which took me or someone else (I'm no longer sure who) some time to realise: not only will most readers be unfamiliar with “meta-tracing”, they’ll be unfamiliar with “tracing” which the above completely fails to mention. That means that this explanation almost entirely fails to achieve its main purpose. After various minor edits to the section as a whole, eventually I inserted a reference to (non-meta) tracing. Between the first draft above and the final published version below, 24 commits [5], were made to the section as a whole (mostly by me, but with some by Edd Barrett):

This section briefly introduces the concept of meta-tracing. Meta-tracing takes as input an interpreter, and from it creates a VM containing both the interpreter and a tracing JIT compiler. Although tracing is not a new idea, it traditionally required manually implementing both interpreter and trace compiler. Meta-tracing, in contrast, automatically generates a trace compiler from an interpreter. At run-time, user programs running in the VM begin their execution in the interpreter. When a 'hot loop' in the user program is encountered, the actions of the interpreter are traced (i.e. recorded), optimised, and converted to machine code. Subsequent executions of the loop then use the fast machine code version rather than the slow interpreter. Guards are left behind in the machine code so that execution can revert back to the interpreter when the path recorded by the trace differs from that required.
The paragraph is now quite a bit longer, but at least it explains tracing. However, it does so poorly (which, as the git history clearly shows, is entirely my fault and not Edd's). Consider the sentences:
Meta-tracing takes as input an interpreter, and from it creates a VM containing both the interpreter and a tracing JIT compiler. Although tracing is not a new idea, it traditionally required manually implementing both interpreter and trace compiler. Meta-tracing, in contrast, automatically generates a trace compiler from an interpreter.
In order, these three sentences: define meta-tracing; reference tracing without explaining why it’s relevant; and somewhat obliquely explain why meta-tracing is different than tracing. To add insult to injury the third of those sentences partly repeats the content of the first sentence. The problem with this jumble of concepts isn’t just that it’s poor style: by forcing readers to keep switching the concepts they’re reasoning about, it makes it hard for them to concentrate on the crucial aspects. Unfortunately, I either didn’t notice this problem, or couldn’t work out how to fix it, before publication.

Fortunately for us, the story doesn’t end there. Some time later we wrote a paper called fine-grained language composition: a case study which also needed an overview of meta-tracing. I first copied in the paragraph from the “approaches” unchanged [6]. Very soon afterwards (probably because of the benefits of reading the paragraph with a fresh mind) I started addressing the problems noted above. With substantial help from Carl Friedrich Bolz and Edd Barrett, 28 commits to the section as a whole later the final published version looks as follows:

Tracing JIT compilers record hot loops ('traces') in an interpreted program, optimise those traces, and then compile them into machine code. An individual trace is thus a record of one particular path through a program's control flow graph. Subsequent executions which follow that same path can use the machine code generated from the trace instead of the (slow) interpreter. To ensure that the path followed really is the same, 'guards' are left in the machine code at every possible point of divergence. If a guard fails, execution then reverts back to the interpreter.

Meta-tracing JITs have the same basic model, but replace the manually written tracer and machine code generator with equivalents automatically generated from the interpreter itself.

The explanation is now split into two paragraphs. The first has a concise explanation of tracing; the second briefly explains what's different about meta-tracing. Not only this is a much easier flow for the reader, it also has a better explanation of guard failure and, as a pleasant bonus, it's 10% shorter than the “approaches” version. I'm quite proud of this version but notice how long it took us to get there: around 50 commits to the section as a whole (most of which touched the paragraph above). Simplicity and clarity are not easily obtained, but they are always worth striving for.

Writing as part of a team

It's been a long time since I've written a paper on my own. Outside of this blog, nearly all my writing is done in teams of 3–5 people. If my experience is anything to go by, the standard model of collaborating in team writing is for most people to write up the technical bits of a paper they were responsible for, with one person handling the “other” bits (e.g. the introduction and conclusions). Some teams make this model of writing work, but a lot (including some I've been part of) don't, for two different reasons: different people often have jarringly different writing styles [7]; and vital information is often missing, because no individual feels solely responsible for it. It's worth looking at both reasons in more detail.

Sometimes academic writing advice seems to suggest that there is one true writing style to which we should all aspire. This is nonsense, especially with a language as flexible as English: there are often multiple ways of saying the same thing that are all roughly as good as each other (a single idea can often be expressed in several ways, each approximately as good as the other). Which of those alternatives one chooses is largely a matter of personal style. For example, my personal style tends towards fairly short sentences and lots of punctuation. Some people I know prefer long sentences and the complete absence of the colon and semi-colon. Obviously I prefer my own style, but that doesn't generally mean the alternative is wrong — just different. [8] The problem comes because readers of technical papers are very alert to minor changes in use of language, and such changes are much more common when different authors fail to homogenise their style. For example, referring to “dynamic compilation” in one part of a paper and ”adaptive compilation” elsewhere makes a thoughtful reader wonder if the different phrasings denote a real difference or not. Jolting a reader out of their reading flow in order to ponder such a question is not just a waste of time: the more often it happens, the more likely it is that the reader will lose sight of the big picture.

The problem of missing information is both less obvious and more serious. I frequently read papers which, while seemingly sensible at the level of individual sections, don’t make much sense as a whole. In rare cases this is because the paper is technically flawed. More often it’s because there’s a missing part of the explanation that would help me make sense of what I’m reading. Occasionally I’m able to deduce precisely what the missing information must have been but, more often than not, I’m simply left confused.

Missing information is sometimes simply due to an oversight, or because the authors considered it unworthy of inclusion (generally because “everyone knows that”). However, my experience is that, when writing in a team, a third reason is equally common: no individual feels responsible for that piece of information, implicitly assuming that it’ll appear elsewhere in the paper. Inevitably, therefore, the information is never included anywhere. I’ve occasionally seen people try to solve this by cramming huge numbers of definitions into a paper’s introduction. This then makes the start of the paper both a depressing, as well as an obfuscatory, read: the start of the paper should outline the story, not try and fill in every blank.

The rewriter approach to team writing

The approach I use to avoid standard team writing problems is for one person (I’ll call them the “rewriter”) to take overall responsibility for the paper’s writing, ensuring that the paper’s content and style are consistent and complete. The only way I’ve found to make this happen is for the rewriter to examine, and in general rewrite, every contribution to the paper. Most obviously this means that the paper ends up written in a single style. Less obviously it means that one person is forced to understand everything in the paper, lessening the chances of authors telling an inconsistent or incomplete story. The rewriter doesn’t need to be a mini-dictator — indeed, it’s best if they try and work by consensus when possible — but they do need to be unflagging in trying to maintain a consistently high standard across the paper.

I now apply the rewriter idea to nearly all the papers I'm involved with (though I'm not always the rewriter!) and I've learnt a few things since my first experience with this way of writing [9]. First, it relies on individuals being willing to forego their individual writing style and to trust that the rewriter will do a good job. Second, it requires people to be much more open than normal to being questioned about what they’ve written, and to be willing to answer each and every question.

Sometimes the rewriter will read something, realise that they don’t understand the explanation, and ask for clarification before rewriting can commence. Sometimes the rewriter will not have the same assumptions as the original author, misunderstand the original text, and then rewrite it incorrectly. The good news is that the rewritten version is nearly always clearly incorrect: the original author then needs to request changes until the rewritten text becomes accurate. [10] And, more often than you’d expect, the original author realises that there was an aspect of the research that they hadn’t previously considered, and they need to do more technical work before answering the query.

Let me give you a simple example which I quickly plucked out of the VM warmup paper. Carl Friedrich drafted the following text (intentionally quickly, as the style we had got used to was “draft something quickly, run it past the rewriter, then take another pass at it”) [11]:

SVG-Viewer needed.

In this case, I knew that there was an additional factor not mentioned in this text that needed including: we had discovered something about lazy loading interfering with our measurements, although I wasn’t very sure what the “something” was. I thus rewrote the text based on that additional factor but, since I didn’t know much about it, I also added a couple of questions at the same time:

SVG-Viewer needed.

Carl Friedrich then replied:

SVG-Viewer needed.

There are two changes in this version. Most obviously, Carl Friedrich responded to my first question with a comment (and notice that both of us are trying to understand what is going on rather than trying to “win” a debate). Less obviously, he responded to my second question by editing the text and deleting the comment. I’ve always encouraged this way of editing: there’s no point in people playing endless comment ping-pong simply because they’re too afraid to take decisive action. In this case, it was my job to review the edit that had been made (I simply look at the relevant diff in gitk); if I was unhappy with it, I would have reinstated the text with an additional comment (perhaps “I’m not sure my original question made sense, so let me try again”). As is generally the case, I had no problem with this particular edit.

There were several rounds of question-and-answer (see e.g. 1, 2, 3, or my personal favourite 4) before I eventually felt I understood what had gone on enough to rewrite the text to the following:

SVG-Viewer needed.

That text, somewhat simplified, is clearly recognisable as that in the final paper. Notice that neither Carl Friedrich or I would have been able to get the text to that quality on our own: it required working together, being open to questions, and not giving up until we were both satisfied that readers would understand the points we were trying to make.

How long does it take to write a paper?

I remember being told early in my career that, once the research is complete, writing a paper takes about a week. I was tortured by this for several years: I rarely got anything out of the door in under a month. And I’ve got much, much worse as my standards have been raised: I reckon that our major papers (for want of a better term) take at least 8-12 person weeks to write. [12] More complex papers take even longer. As a rough proxy, the VM warmup paper took 999 commits from start to finish (and, as can be seen on arXiv, 7 public releases). Many of those commits (particularly towards the end of writing) are small; some (particularly nearer the beginning of writing) are big. But, either way you look at it, a lot of effort went into the writing. It’s one reason why I’m not a prolific writer of papers: much to the distress of most of the bosses I’ve had, I’d rather try and write 1 good paper a year and fail than succeed at writing 20 mediocre ones.

And while it’s hardest to build the momentum to write the first few words, it’s not trivial to build and maintain the daily momentum needed to complete a paper. On a typical day writing ‘fresh’ prose, I aim to write somewhere around 800-1000 words (I don’t count precisely). Sometimes those words come easily, but sometimes they have to be forced out. I do make sure that I never leave a day empty handed though.

One of the hardest things to know is when a paper is finished. One way is to use an external deadline – e.g. the submission date for a conference – as a proxy for this. In general, this tends to lead to a mad rush in the last few days before the deadline, hugely increasing the chances of mistakes creeping in. I prefer to let papers mature for a lot longer than this, because it continues to amaze me how many mistakes become apparent after a good night’s sleep or two. However, taken too far, this is a recipe for never finishing: I think it’s vital to feel that one is continually moving, even if gradually, towards an end goal. To that end, I try and solidify aspects of the paper as I’m going along, so that I’m not tempted to end up in an endless loop of modifying the same things. Typically, I first satisfy myself that the paper’s structure is stable; then that all the necessary content is in; then that the prose is clear and consistent; and finally that there are no typos left. The movement between those stages is almost exclusively one way: when I’m looking for typos, for example, I won’t suddenly decide to move sections around.

At some point the decision has to be taken that the paper is finished and can be released to the world. If I’ve done my job well, the paper will hopefully find an interested audience. But these days I’m realistic enough to know that the paper won’t be perfect: it won’t be as clear as it should be and it will contain mistakes.

Several years ago, I wrote an overview paper on dynamically typed languages where I tried to collect together a lot of things I’d learnt on the topic. Because I knew that this was an area about which many people have strong opinions, I realised that I would have to do a better job than I'd done on any of my previous papers. I tried to buff the paper to the shiniest of sheens and, when it was published, I patted myself on the back for a job well done. A little while after publication I got the following email [13]:

I'm just reading your article on dynamically typed languages from 2009, and I'm stunned. It's fantastically written, neutral and smart.
At this point, my ego – never the most fragile of objects – had swollen to roughly the size of a small planet. Then I read the next sentence:
So how on earth could you misspell "growth" as "grown" in the abstract?!
I checked, and, indeed, I had made a typo in the first sentence of the abstract – in other words, the 13th word in the paper was wrong. My ego shrank back down to a more appropriate size. I fixed the mistake in the HTML version but the PDF retains the error, as a reminder to my future self of my own fallibility.

Wrapping up

There are many other things I could say but if I had to boil everything I’ve learnt so far about paper writing into one sentence it would be this: you don’t need to be hugely talented with words (I’m not), but you have to be willing to continually reread, rethink, and reword everything you’ve written. When writing in a team “you” particularly means “the rewriter”, if you choose that model. Ultimately the goal is simple: will the reader be able to understand the points you wanted to get across, without undue effort? Bad writing, I suspect, comes from people who think their first effort is good enough. Good writing comes from people who know that, every section, every paragraph, and every sentence will have to be pored over multiple times: and, just when they think they’ve perfected the writing, getting external feedback might cause them to reevaluate everything. Sometimes this effort can feel like treading in Sisyphus’ footsteps but, in my experience, it’s effort well spent.

Acknowledgements: My thanks to Edd Barrett, Carl Friedrich Bolz-Tereick, Lukas Diekmann, and Sarah Mount for thoughtful comments. Any errors and infelicities are my own.

Follow me on Twitter

Footnotes

[1] Blogs like this one have their purpose, but I feel they're solving a different problem to papers. Put crudely, I use my blog to explain things to people who want a general overview, whereas papers contain all the nitty-gritty details.
[2] Unless you're already familiar with Kuhn's work, Dick Gabriel's wonderful paper about incommensurability in programming languages is, in my opinion, required reading. In short: when we read other’s work, particularly if it’s a little old, we often lack the context that all contemporary readers shared; we are therefore liable to misunderstand or dismiss important parts of the explanation. A simple example is the word “computer”. Before the 1940s this referred to people who performed calculations; currently it refers to machines. Unless you’re aware of that fact, you’ll get a completely incorrect impression when reading pre-1940s material that uses the term “computer”. As a collection of syllables, ‘incommensurability’ is an annoying mouthful: but the idea is simple, powerful, and describes a concept that is surprisingly frequently encountered. Dick brings it to life with a very interesting concrete example.
[3] The common advice is to write the abstract last, once you've got the rest of the paper written. This clearly works well for many people, but whenever I've tried doing so, I've found that my writing is aimless and disjointed.
[4] When I’m reviewing papers, I write a small series of small comments in the order that I came across them. It’s quite common for me to write “On p2 I don’t understand X” and then to update it later by saying “Ah, this is explained on p5”.
[5] I used git blame -L to find the relevant commits.
[6] This is sometimes referred to as "self-plagiarism" and some authors will write such sections anew each time to ensure they can't be accused of it. Since background material is generally short and doesn't really form part of a paper's contribution, I don't have a problem with people reusing small amounts of such material: the problem, in my opinion, is when people copy supposedly new technical contributions between papers. In this case I even made clear to myself and my co-authors what was going on in the commit message:
Integrate a brief description of meta-tracing.

The first and last paragraphs are directly swiped from 'approaches to interpreter composition'. The middle paragraph is new, as I suspect we will need to talk about hints to explain why cross-language optimisations perform well.
[7] This effect is often exaggerated when different co-authors have different mother tongues: it's generally not difficult to tell a French author writing in English versus a German author, for example. If this sounds like Anglo-snobbery, I really don't mean it to be: I'm still struggling with one language, and am in awe of people who can express themselves in multiple languages.
[8] That said, there are a few things I consider universally bad style for technical papers. The chief style crime I see amongst junior authors is a tendency to think that the following idiotic trope (repeated by seemingly every school English teacher) is true: “never repeat a word or phrase you’ve used recently.” Perhaps this is good advice for fiction, but it’s terrible advice in technical papers, where it’s vital to use the same word(s) to refer to the same concept throughout a paper.
[9] Personally I stumbled across the "rewriter" way of working entirely by accident. I was part of a research project that ultimately turned out to not really need my skills. When it came to write up the resulting research, I felt like a spare limb, so to make myself feel useful I offered my services as a copy-editor. This was accepted by the rest of the team without any of us realising what we'd signed up to. Whenever someone added a new part to the paper, I would read it very carefully. If I couldn't make any sense of it, I would add questions inline asking for clarification. If I could make sense of most or all of it, I rewrote the whole lot in my style. I viewed it as my job to make sure that the paper flowed, stylistically and technically, from start to finish, and that everything within was adequately explained.

The downside of this approach was that I almost drove my co-authors to murder because none of them had encountered such repeated questioning before. However, after a bit of time had passed (and the paper had done reasonably well), we all agreed it was better written than if we'd taken a traditional writing approach. I hasten to add that is not because the paper used my style, but simply because it used one person's style consistently, and that person had also ensured that all the information the paper needed was contained therein.

[10] My personal tactic is to turn seemingly ambiguous things into definite statements. For example, “This is smaller than some other objects” might become “This is smaller than most other objects” or “This is larger than most other objects” depending on my intuition about which way the ambiguity is best resolved. In the best case, I will have rewritten things more clearly than they were before. In the worst case I have made it both more likely that the original author will spot problems and that they will feel the need to correct things.
[11] The “screenshots” below are SVGs extracted from the PDF generated from LaTeX. The inline comments are a macro that Roel Wuyts used on a shared article we wrote many moons ago. I've shamelessly used it ever since. I can't remember who added colour to the comment macros, but it really helps it stand out from the normal text. These days I use “how much red is there on a page” as a good proxy for “how close is this text to being finished?”
[12] How long did it take to write this blog post? Spread over about 8 months, it took about 7 days (partly because I wrote some material that external comments convinced me didn't really fit in). It's one reason why I'm not a prolific blogger, although the more significant reason is that I generally lack anything worth my time writing and your time reading.
[13] I wish I could write emails as succinct and amusing as this one. While I hugely admire laconic humour in others, I’m not generally capable of editing my thoughts down quite so ruthlessly.
Blogs like this one have their purpose, but I feel they're solving a different problem to papers. Put crudely, I use my blog to explain things to people who want a general overview, whereas papers contain all the nitty-gritty details.
Unless you're already familiar with Kuhn's work, Dick Gabriel's wonderful paper about incommensurability in programming languages is, in my opinion, required reading. In short: when we read other’s work, particularly if it’s a little old, we often lack the context that all contemporary readers shared; we are therefore liable to misunderstand or dismiss important parts of the explanation. A simple example is the word “computer”. Before the 1940s this referred to people who performed calculations; currently it refers to machines. Unless you’re aware of that fact, you’ll get a completely incorrect impression when reading pre-1940s material that uses the term “computer”. As a collection of syllables, ‘incommensurability’ is an annoying mouthful: but the idea is simple, powerful, and describes a concept that is surprisingly frequently encountered. Dick brings it to life with a very interesting concrete example.
The common advice is to write the abstract last, once you've got the rest of the paper written. This clearly works well for many people, but whenever I've tried doing so, I've found that my writing is aimless and disjointed.
When I’m reviewing papers, I write a small series of small comments in the order that I came across them. It’s quite common for me to write “On p2 I don’t understand X” and then to update it later by saying “Ah, this is explained on p5”.
I used git blame -L to find the relevant commits.
This is sometimes referred to as "self-plagiarism" and some authors will write such sections anew each time to ensure they can't be accused of it. Since background material is generally short and doesn't really form part of a paper's contribution, I don't have a problem with people reusing small amounts of such material: the problem, in my opinion, is when people copy supposedly new technical contributions between papers. In this case I even made clear to myself and my co-authors what was going on in the commit message:
Integrate a brief description of meta-tracing.

The first and last paragraphs are directly swiped from 'approaches to interpreter composition'. The middle paragraph is new, as I suspect we will need to talk about hints to explain why cross-language optimisations perform well.
This effect is often exaggerated when different co-authors have different mother tongues: it's generally not difficult to tell a French author writing in English versus a German author, for example. If this sounds like Anglo-snobbery, I really don't mean it to be: I'm still struggling with one language, and am in awe of people who can express themselves in multiple languages.
That said, there are a few things I consider universally bad style for technical papers. The chief style crime I see amongst junior authors is a tendency to think that the following idiotic trope (repeated by seemingly every school English teacher) is true: “never repeat a word or phrase you’ve used recently.” Perhaps this is good advice for fiction, but it’s terrible advice in technical papers, where it’s vital to use the same word(s) to refer to the same concept throughout a paper.
Personally I stumbled across the "rewriter" way of working entirely by accident. I was part of a research project that ultimately turned out to not really need my skills. When it came to write up the resulting research, I felt like a spare limb, so to make myself feel useful I offered my services as a copy-editor. This was accepted by the rest of the team without any of us realising what we'd signed up to. Whenever someone added a new part to the paper, I would read it very carefully. If I couldn't make any sense of it, I would add questions inline asking for clarification. If I could make sense of most or all of it, I rewrote the whole lot in my style. I viewed it as my job to make sure that the paper flowed, stylistically and technically, from start to finish, and that everything within was adequately explained.

The downside of this approach was that I almost drove my co-authors to murder because none of them had encountered such repeated questioning before. However, after a bit of time had passed (and the paper had done reasonably well), we all agreed it was better written than if we'd taken a traditional writing approach. I hasten to add that is not because the paper used my style, but simply because it used one person's style consistently, and that person had also ensured that all the information the paper needed was contained therein.

My personal tactic is to turn seemingly ambiguous things into definite statements. For example, “This is smaller than some other objects” might become “This is smaller than most other objects” or “This is larger than most other objects” depending on my intuition about which way the ambiguity is best resolved. In the best case, I will have rewritten things more clearly than they were before. In the worst case I have made it both more likely that the original author will spot problems and that they will feel the need to correct things.
The “screenshots” below are SVGs extracted from the PDF generated from LaTeX. The inline comments are a macro that Roel Wuyts used on a shared article we wrote many moons ago. I've shamelessly used it ever since. I can't remember who added colour to the comment macros, but it really helps it stand out from the normal text. These days I use “how much red is there on a page” as a good proxy for “how close is this text to being finished?”
How long did it take to write this blog post? Spread over about 8 months, it took about 7 days (partly because I wrote some material that external comments convinced me didn't really fit in). It's one reason why I'm not a prolific blogger, although the more significant reason is that I generally lack anything worth my time writing and your time reading.
I wish I could write emails as succinct and amusing as this one. While I hugely admire laconic humour in others, I’m not generally capable of editing my thoughts down quite so ruthlessly.
 

Blog archive

 

Last 10 posts

Minimum Times Tend to Mislead When Benchmarking
A Quick Look at Trait Objects in Rust
Why Aren’t More Users More Happy With Our VMs? Part 2
Why Aren’t More Users More Happy With Our VMs? Part 1
What I’ve Learnt So Far About Writing Research Papers
What Challenges and Trade-Offs do Optimising Compilers Face?
Fine-grained Language Composition
Debugging Layers
An Editor for Composed Programs
The Bootstrapped Compiler and the Damage Done
 
 

Blog archive