Laurence Tratt: Minimum Times Tend to Mislead When Benchmarking

Most of us want, or need, to benchmark software at some point, either because we’re proactively trying to avoid performance problems, or because we’re trying to narrow down the cause of a specific performance problem. Benchmarking seems like it should be simple, but there are countless ways to go wrong: indeed, as things stand, none of us really knows how to do it well, at least not in general. In this short post, I want to make a simple argument against using the minimum time of a benchmark as a proxy for that benchmark’s performance. Doubtless I am recasting other people’s arguments, poorly, but I’ve seen a resurgence in people advocating for the use of the minimum time recently, so it seems like a good time to make the opposing case.

The basics

There are all sorts of things that one might wish to measure about a program: how much memory it consumes, the number of files it opens, and so on. By far the most common measure is “wall-clock time”1 – i.e. how much time elapses in the real world while the program is running – and that’s what I’m going to concentrate on exclusively here.

Once upon a time, measuring wall-clock time was fairly simple. You read the system clock, ran a program, read the system clock again, and reported the delta between the two clock measurements. There were some minor annoyances to do with clocks 2 but measurements tended to be fairly predictable and repeatable. The weasel word in that statement is “fairly”. Noise has always been an issue in benchmarking: except in rare circumstances, there are always external factors (e.g. other processes; even the kernel itself) that can interfere with your benchmark’s execution which, collectively, we tend to refer to as noise.

The only way to get some idea of the impact of noise is to execute multiple runs of a benchmark and produce statistics over those runs. The problem then becomes: what statistics? If you’re anything like me, statistics is an incredibly daunting area. It’s not so much that the underlying maths is complex (though it often is), but that nearly every statistical method comes with a set of preconditions that can be hard to interpret (e.g. “the data must be independent” – even when you’ve understood what “independent” means in a semi-formal sense, understanding whether your data is or isn’t independent can be difficult). If you apply a statistical method to data that violates the preconditions, you’ll still get a number out, but it might not mean what you expected.

Fortunately, in traditional settings, one could fairly safely make the following simplifying assumption: noise can only slow a program down. In other words, if I run a program twice, and it produces wall-clock readings of 1.1s and 1.2s, then the 1.1s reading is closer to the “true” performance of the program. Getting a good performance reading then consists of running the program multiple times and taking the minimum time from those since, by definition, that is the closest to the “true” performance, since it has been the least affected by noise.

An example of performance nondeterminism

Imagine I’m writing a program which repeatedly needs to find an English word which matches a given sha256 value. Since this calculation is fairly slow, I might choose to use a hashmap to cache the sha256 values. Here’s some simple Rust code which creates a HashMap from sha256 values to normal strings:

use std::collections::HashMap;
use std::fs::File;
use std::io::{BufRead, BufReader};

use sha2::{Digest, Sha256};

fn main() {
  let mut words = HashMap::new();
  let f = File::open("/usr/share/dict/words").unwrap();
  for l in BufReader::new(f).lines().map(|l| l.unwrap()) {
    words.insert(Sha256::digest(l.as_bytes()), l);
  }
}

I can easily call words.get(h) to find a string which matches the sha256 value h and know that I’ll get average lookup performance of O(1). However, let’s assume that on rare occasions I need to do the opposite calculation, that is I need to find what string matches a given sha256 value. We can add this code to our main function (in this case seeing if the word “zygote”, which happened to be the last word in /usr/share/dict/words that I recognised, really is found in words):

for (k, v) in &words {
  if v == "zygote" {
    for e in k {
      print!("{:x?}", e);
    }
    println!();
    break;
  }
}

If I run the resulting program it duly prints out:

d8be86c985bdd2938cc6cdc9b43039273be1fd97f3e4daed329bade585dd6ef

as the matching value.

Because I have to do a linear scan over the hashmap, this program has O(n) average lookup time. Let’s assume that I think the newly inserted bit of code is a performance bottleneck. I might then try timing the program and seeing what happens. However, because I think I know quite a bit about benchmarking, I assume that the time taken to create the initial hashmap will dominate a single subsequent lookup. To compensate, I put the code of interest into a loop so that it executes enough that I’ll get a measurable signal. I also remove the print calls, because I know they can interfere with timings. I thus end up with the following:

for _ in 0..1000 {
  for (_, v) in &words {
    if v == "zygote" {
      break;
    }
  }
}

OK, so let’s build it and run it 10 times so that I can calculate the minimum running time and determine if the performance of this program is good enough:

$ cargo build --release && repeat 10 time target/release/count
    Finished release [optimized] target(s) in 0.01s
target/release/count  0.50s user 0.00s system 99% cpu 0.501 total
target/release/count  0.42s user 0.00s system 99% cpu 0.426 total
target/release/count  0.26s user 0.01s system 99% cpu 0.267 total
target/release/count  0.18s user 0.00s system 98% cpu 0.179 total
target/release/count  0.06s user 0.02s system 99% cpu 0.080 total
target/release/count  0.13s user 0.00s system 97% cpu 0.131 total
target/release/count  0.42s user 0.01s system 99% cpu 0.425 total
target/release/count  0.20s user 0.01s system 98% cpu 0.212 total
target/release/count  0.42s user 0.01s system 99% cpu 0.429 total
target/release/count  0.23s user 0.00s system 98% cpu 0.235 total

At this point I might be a bit surprised: the minimum wall-clock time (0.080s) is over 6 times faster than the maximum (0.501s). I’m quite happy with the minimum time, but decidedly unhappy with the maximum time. What happened? Well, I was running that on a fast Linux server that can be used by lots of other people, so maybe my readings were subject to a lot of noise. Let’s try running it on my trusty OpenBSD desktop:

$ cargo build --release && repeat 10 time target/release/count
    Finished release [optimized] target(s) in 0.02s
target/release/count  0.45s user 0.06s system 98% cpu 0.517 total
target/release/count  0.27s user 0.05s system 104% cpu 0.307 total
target/release/count  1.73s user 0.05s system 99% cpu 1.783 total
target/release/count  1.82s user 0.05s system 99% cpu 1.876 total
target/release/count  2.10s user 0.06s system 99% cpu 2.179 total
target/release/count  1.66s user 0.08s system 99% cpu 1.742 total
target/release/count  0.92s user 0.08s system 99% cpu 1.007 total
target/release/count  0.28s user 0.05s system 96% cpu 0.340 total
target/release/count  2.22s user 0.08s system 100% cpu 2.292 total
target/release/count  0.51s user 0.03s system 101% cpu 0.533 total

Unsurprisingly the timings are now much slower 3, but the ratio between the minimum and maximum times is now well over a factor of 7! What’s going on?

What we are witnessing here is what I call performance non-determinism for want of a better term. “Nondeterministic” is one of those words that academics bandy around willy-nilly but which can be quite confusing. In our case, it means that a program can run differently from one execution to the next. The most obvious way to write a nondeterministic program is to depend at some point on an unknown value. Sometimes this is done deliberately (e.g. an AI player in a game), although it is often accidental (e.g. a data race or reading from an incorrect place in memory).

In the program I’ve written, however, I’m not deliberately reading a random variable nor (I hope!) are there any bugs. However, there is nondeterminism at play, though it’s not directly caused by the user: the order that Rust’s HashMap stores items internally depends on a seed that is randomly chosen each time the program starts (making it harder for attackers to game the hashmap). This is largely invisible when doing a normal lookup, but becomes painfully obvious when you iterate over all the elements: on one run the element of interest might be the first I encounter, and on the next run it might be the last. Altering my program to see how many times I go round the loop before finding a match highlights this:

let mut c = 0;
for _ in 0..1000 {
  for (_, v) in &words {
    c += 1;
    if v == "zygote" {
      break;
    }
  }
}
println!("{}", c);

There’s now a clear correlation between that count and the execution time:

$ cargo build --release && repeat 10 time target/release/count
    Finished release [optimized] target(s) in 0.02s
50434000
target/release/count  0.41s user 0.07s system 99% cpu 0.481 total
23732000
target/release/count  0.28s user 0.07s system 96% cpu 0.362 total
80556000
target/release/count  0.65s user 0.04s system 99% cpu 0.696 total
98799000
target/release/count  0.78s user 0.05s system 98% cpu 0.843 total
222337000
target/release/count  1.90s user 0.06s system 100% cpu 1.957 total
82766000
target/release/count  0.63s user 0.08s system 99% cpu 0.713 total
222532000
target/release/count  1.92s user 0.04s system 99% cpu 1.963 total
17949000
target/release/count  0.29s user 0.05s system 100% cpu 0.339 total
18610000
target/release/count  0.30s user 0.04s system 99% cpu 0.341 total
87260000
target/release/count  0.68s user 0.06s system 98% cpu 0.750 total

Notice that the iteration count varies by a factor of more than 10, even in those 10 runs, which largely explains the factor of 6 difference in wall-clock time (it’s likely that the remaining discrepancy is accounted for by the time needed to create the initial HashMap). The reason I call this performance nondeterminism is because it has no impact on correctness – and thus is often forgotten – but can have a significant impact on performance.

So what does all this have to do with using the minimum wall-clock time when benchmarking? In essence it shows that on modern systems, it’s not safe to use the minimum value as a measure of performance because programs such as this have a set of performance points. That seems like a very simple thing to say, but it has a profound difference on the way one views program performance. In this case, roughly speaking, that set has as many distinct performance points as there are words in /use/share/dict/words. Indeed, running my program 100,000 times on an unloaded server (an E3-1240 v6 3.70GHz processor, with turbo boost and hyperthreading turned off, all cores set to performance, and running Debian 9) shows a largely flat distribution 4:

As this histogram shows, using the minimum time as a proxy for the program’s performance would be deeply misleading: it would lead us to hugely underestimate the running time of this program would take in most cases. Indeed, in this case, even the mean on its own would be misleading! While it’s generally not practical to give people a histogram of performance for every possible benchmark, it’s absolutely vital to report something about the distribution as a whole. In general, confidence intervals are a good technique for doing this, and, in this case, they’d be infinitely better than merely reporting the minimum.

Is this an isolated example?

When you look around, there are a surprising number of places where performance nondeterminism can occur – and it’s not just caused by software! Caches and predictors in processors clearly cause changes in performance. Indeed, I now regularly hear processor caches and the branch predictor being blamed for unexplained performance variations. Although my cynical side thinks that this is sometimes an excuse to avoid understanding what’s really going on, they clearly have surprising and unpredictable effects. Software has its share of complementary features, one of the best known being ASLR. As an added bonus, these various features interact in ways that we don’t yet really understand.

A couple of pointers might help convince you that the problem is a deep one. I’ll start with the work which has had the most influence on my way of thinking in this regard: Stabilizer. In essence, Stabilizer takes a program and deliberately changes its layout (e.g. the order of functions) so that the possible space of layouts is, statistically speaking, equally covered: when this experiment was performed in 2013, LLVM’s -O2 optimisation level was found to be statistically indistinguishable from -O3 on a well known benchmark suite. At the risk of stating the obvious, this is highly surprising: everyone knows that -O3 is “better” than -O2! But that might just be because no-one had taken into account performance nondeterminism before… What Stabilizer clearly showed me is that one must, in general, consider programs as having a set of distinct performance values (as we saw in our hashmap searching example). When I was then involved in work on measuring JIT (Just-In-Time) VM performance, we found that over 50% of well known benchmarks suffered from performance nondeterminism – including some of the C benchmarks we’d included as a control because we assumed they wouldn’t be subject to such problems!

The effects on comparison

It turns out that performance nondeterminism isn’t the only reason why the minimum time of a program is a poor measure of a program’s performance. Let’s imagine a program similar to our previous example, but this time let’s search for a random word in a list of strings. A simple version of this looks as follows:

use std::fs::File;
use std::io::{BufRead, BufReader};

use rand::thread_rng;
use rand::seq::SliceRandom;

fn main() {
  let mut words = Vec::new();
  let f = File::open("/usr/share/dict/words").unwrap();
  for l in BufReader::new(f).lines() {
    words.push(l.unwrap())
  }

  let choice = words.choose(&mut thread_rng()).unwrap();
  for _ in 0..10000 {
    assert!(words.iter().any(|w| w == choice));
  }
}

When I run that for 100,000 times I get the following histogram:

Even without looking at the code, some readers might have guessed from that histogram that I’m using a linear search over the list. Wouldn’t it be better to replace it with a binary search? Let’s give it a go:

words.sort();
let choice = words.choose(&mut thread_rng()).unwrap();
for _ in 0..10000 {
  assert!(words.binary_search(&choice).is_ok());
}

Note that I first have to sort the words, because Rust’s idea of ordering is different than /usr/share/dict/words. With those small changes I then get the following histogram:

Unfortunately it looks my attempted optimisation has failed: the minimum running time of the linear search (0.009s) is lower than that of the binary search (0.010s)5. As we all know, an optimisation that fails to make a program faster should never be included, so it seems that all my efforts have been for nought.

However, if we look at the distributions of both versions of the program alongside each other, a rather different story emerges:

As this shows, the binary search is always fast, while the linear search is often comically slow. In other words, given this data, it would be penny wise and pound foolish for me to use the linear search: I’d make the program very marginally faster for a small number of cases but much slower in most cases.

In my experience, optimisations rarely make all possible cases faster: there are trade-offs involved, and one often has to accept that some inputs will become a little slower in order that most cases can become a lot faster. The problem with using the minimum time is that it is entirely oblivious to trade-offs such as this.

Conclusions

The examples I’ve shown above might look silly to you, but they’re simplifications of problems I’ve seen in real programs: in each case, they were pretty much the first thing I tried to show the problem of minimum times 6. That said, a perfectly reasonable question is whether I’ve managed to choose from amongst a small handful of examples that show such problems. Alas, particularly with regards to performance nondeterminism, this blogpost has merely highlighted the tip of the iceberg – and, as a community, we don’t really know how deep the iceberg goes.

However, performance nondeterminism is only one of the reasons why using the minimum time of a benchmark is rarely a good idea. As the optimisation example (linear vs. binary search) showed, the fundamental problem with the minimum time of a benchmark is that it gives you false confidence that you can avoid thinking about a benchmark’s overall performance distribution. Sometimes you can get away with this, but doing so can often lead to you making the performance of your program worse. That said, performance nondeterminism increasingly worries me, because even a cursory glance at computing history over the last 20 years suggests that both hardware (mostly for performance) and software (mostly for security) will gradually increase the levels of performance nondeterminism over time. In other words, using the minimum time of a benchmark is likely to become more inaccurate and misleading in the future…

If you want to replicate my experiments, please have a look at the minimum_times repository. You can also download the data from the run of the experiment I used for this blog post if you want to investigate it and/or compare it to the data you get from running the experiment on your own system.

Acknowledgements: My thanks to Edd Barrett for for comments on this blog post.

Newer 2019-08-15 08:00 Older

If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

It’s not uncommon for people to argue that wall-clock time is a bad measure because it’s less predictable than processor counters. The reason wall-clock time appears less predictable is because it takes more things into account (e.g. the amount of time waiting for I/O) and the more things you measure, the more chance there is for variability. There are definitely cases where focussing purely on processor counters is the right thing to do (e.g. tiny loops that are a few instructions long) but in most cases wall-clock time is the only thing that users care about. Put bluntly, as an end user I don’t care if a program executes slightly fewer instructions if it does so by doing a lot more I/O: I care purely about elapsed time in the real world.

As a footnote to a footnote, it also transpires that processor counters are not as accurate as one might expect – see the classic paper 2008 paper by Weaver and McKee as well as this 2019 paper which confirms that the situation has not improved in the interim.

☒

From my own personal experience, system clocks often had poor resolution (i.e. they couldn’t accurately tell you small differences in performance) and/or were erratic (for example, I remember a PC I did work on in the mid/late 90s that gained about 90 seconds a day).

☒

Partly because OpenBSD’s /usr/share/dict/words is well over twice as big as Debian’s equivalent.

☒

Partly because OpenBSD’s /usr/share/dict/words is well over twice as big as Debian’s equivalent.

I’ve sometimes heard it said that the central limit theorem saves us when benchmarking. In my experience, this hope is based on an over simplification of the theorem, which is that if you run a program enough times any randomness (including noise) will converge towards a normal distribution. That is only true if the various random variables are close to being independent and identically distributed. As some of the examples in this blogpost show, we sometimes encounter situations where these requirements are not met.

☒

In this case, the cost of the sort is probably responsible for most of the measurable difference, though interestingly there really are cases where linear search is faster than binary search. For example, on modern machines, if you have very small lists (roughly speaking, around 10 elements or fewer), linear search can win: there are parts of JIT compiling VMs, for example, where people care about the difference in speed!

☒

That does not mean that the data is not without surprises. Why does the linear search histogram, for example, have a noticeable peak in the first histogram bin? I can speculate as to possible causes, but I don’t know for sure.

☒

Minimum Times Tend to Mislead When Benchmarking