What Metric to Use When Benchmarking?

Recent posts
pizauth: HTTPS redirects
Recording and Processing Spoken Word
Why the Circular Specification Problem and the Observer Effect Are Distinct
What Factors Explain the Nature of Software?
Some Reflections on Writing Unix Daemons
Faster Shell Startup With Shell Switching
Choosing What To Read
Debugging A Failing Hotkey
How Often Should We Sharpen Our Tools?
Four Kinds of Optimisation

Blog archive

What is the right metric to use when measuring a program’s performance? There are various possibilities, from memory usage to branch prediction hit rates, but I’m going to pick on two that I see widely used: CPU instructions executed (what modern CPUs call “instructions retired”) and wall-clock time (i.e. “how much time has elapsed in the real world?”). In this post, I’m going to try and compare both, showing why each has fundamental weaknesses, before explaining why I use each in different circumstances.

Measurement instability

Let’s start by benchmarking a very simple program that runs on every Unix OS I have access to:

awk 'function fib(n) { \
  return n <= 1 ? 1: fib(n-1) + fib(n-2) \
} BEGIN { fib(30) }'

If I run our two “contenders” – CPU instructions executed [1] and wall-clock time – 30 times each I get the following data:

Instructions (#)Time (s)
12,229,905,2267.651
12,229,942,5086.879
12,229,929,3147.162
12,229,945,3916.919
12,229,951,9166.985
12,229,909,7547.649
12,229,928,4467.509
12,229,925,5397.365
12,229,941,3967.249
12,229,975,3976.732
12,229,947,0256.804
12,229,901,8336.686
12,229,935,5517.043
12,229,897,1086.937
12,229,946,8628.190
12,229,944,1986.861
12,229,917,9937.228
12,229,946,9157.384
12,229,945,1027.536
12,229,911,1537.285
12,229,932,0908.367
12,229,946,2307.198
12,229,916,8267.049
12,229,915,1116.972
12,229,903,3718.191
12,229,921,4577.721
12,229,932,6048.160
12,229,945,3847.702
12,229,943,7558.929
12,229,914,0367.507

That’s a lot of data: how should I interpret it? I could provide just a mean (what we colloquially call the “average”) but that wouldn’t give us a good sense of the distribution of our data. I could plot a histogram, which gives me an excellent sense of the distribution, but I then find it hard to compare runs across more than a couple of benchmarks. Instead, I’m going to use confidence intervals which, for a given probability (I’ll use 99%) tell us something about the spread of data around our mean. For the numbers above we have the following means and 99% confidence intervals:

Instructions (#):12,229,930,650 ± 8701
Time (s):7.391 ± 0.260

Confidence intervals have a surprisingly subtle definition, though a simplification is adequate for this post. Taking the CPU instructions executed row as an example, a reasonable first approximation is to say that we’re fairly sure that the “true” mean of the experiment lies in the range 12,229,930,650 ± 8701 (i.e. from 12,229,921,949 to 12,229,939,351) [2].

What the confidence intervals clearly show is that the wall-clock measurements have a much wider distribution of data: as a simple proxy for that, as a proportion of the mean, the confidence interval for wall-clock time is about 4 orders of magnitude bigger than it is for CPU instructions executed. In other words, in this example, CPU instructions executed are a far more stable and predictable metric than wall-clock time; or, looked at from the opposite direction, wall-clock time is much noisier than CPU instructions executed.

Noisy, or less stable, data causes us all sorts of problems. For example, imagine we were trying to implement a new optimisation in Awk and wanted to understand if, and to what extent, our attempted optimisation had improved performance. If our wall-clock measurements remained as widely distributed as above, it’s less likely that we would be able to tell with confidence whether we’d improved performance or not. In contrast, with CPU instructions executed, we’d stand a much better chance of reliably identifying even fairly small differences in performance.

To some extent, we can use statistics to try and dig wall-clock time back out of the hole it finds itself in. Since it seems likely that our simple awk benchmark executes largely deterministically, we might consider using the minimum wall-clock time value (or, indeed, CPU instructions executed) as the best approximation of the program’s “true” performance. However, modern software and hardware have so many layers of performance-impacting non-determinism (what I call “performance non-determinism”) that for anything but the smallest benchmarks it can be very misleading to use the minimum time [3].

If we want to reduce the noise in wall-clock time, we have to be careful in how we measure it. Readers who have experience of benchmarking have probably been grinding their teeth for several paragraphs by now as I haven’t told you anything about my measurement setup. I recorded CPU instructions executed on a Debian Xeon server [4]. I recorded wall-clock time on my two-physical-core OpenBSD Core i7 laptop, without mains power, with hyper-threading and turbo boost enabled, and while I was using other applications including a web browser. Simply plugging the laptop into the mains and SIGSTOP ing the web browser reduces both the mean and confidence intervals, which become 6.561 ± 0.0520 — in other words, as a proportion of the mean, that confidence interval is an order of magnitude narrower than my first attempt [5].

As this data shows, wall-clock time is heavily affected by external factors, both hardware and software. It might seem that I’m being deliberately silly by benchmarking wall-clock time on a laptop running a full desktop environment, but you would be astonished at how many research papers do just that. This becomes even more of a problem if the benchmarks take a long time to run. Laptops, in particular, struggle to dissipate heat [6], and after a while the hardware tends to move temporarily to a low-performance mode to get temperatures back under control. As you can imagine, this can give an incredibly misleading sense of software performance. Even if we go out of our way to control external factors (which we did in our VM warmup paper) wall-clock time measurements are never likely to approach the stability of CPU instructions executed.

CPU instructions are not a proxy for wall-clock time

Since measuring wall-clock time is so difficult, many people have turned to measuring CPU instructions since they are much less sensitive to external factors [7].

Let’s look at a very different example: let’s imagine I want to write a simple grep clone to search for a slightly implausible regular expression in LLVM’s source code. In Rust I might write:

use ignore::Walk;
use regex::Regex;
use std::fs;

fn main() {
  let re = Regex::new("a[b-f]+g*hi*jkl*m+n*o").unwrap();
  for result in Walk::new("llvm") {
    if let Ok(e) = result {
      if e.path().is_file() {
        if let Ok(s) = fs::read_to_string(e.path()) {
          if re.is_match(&s) {
            println!("{}", e.path().to_str().unwrap());
          }
        }
      }
    }
  }
}

When I run that on the same Debian server as before I get (raw data for instructions and times):

Instructions (#):11,068,644,420 ± 9,632,134
Time (s):2.847 ± 0.004

Because I’m an impatient person, I’m unhappy at my search taking nearly three whole seconds. My intuition is that my program often has to wait for files to be read in from disk: what happens if I modify the program to keep executing the regular expression while waiting for the next data to load? I could use select and friends, but that’s a lot of complexity for my pea brain. Instead, it’s easier for me to split my program into two threads: one which walks the directory structure and loads files from disk; and another which performs regex searches. By setting a channel up between the two threads, I can “send” file data to the regex-searching thread. My code now looks as follows:

use ignore::Walk;
use regex::Regex;
use std::{fs, sync::mpsc::channel, thread};

fn main() {
  let re = Regex::new("a[b-f]+g*hi*jkl*m+n*o").unwrap();
  let (send, recv) = channel();

  let h = thread::spawn(move || {
    for result in Walk::new("llvm") {
      if let Ok(e) = result {
        if e.path().is_file() {
          if let Ok(s) = fs::read_to_string(e.path()) {
            send.send((e, s)).ok();
          }
        }
      }
    }
  });

  for (e, s) in recv.iter() {
    if re.is_match(&s) {
      println!("{}", e.path().to_str().unwrap());
    }
  }

  h.join().unwrap();
}

My new measurements – and note that the CPU instructions executed are the cumulative count across my two threads – are now (raw data for instructions and times)

Instructions (#):11,654,988,513 ± 7,693,766
Time (s):2.646 ± 0.014

The good news is that by checking that the means and confidence intervals for the two versions of my search don’t overlap, I know that I’ve got a statistically significant performance difference on my hands. The bad news is that, roughly speaking [8], I’ve reduced wall-clock time by about 7% but increased CPU instructions by about 5% [9]!

How can these two seemingly related metrics go in opposite directions after my “optimisation”? Well, my use of two threads means that from the wall-clock perspective, we really were able to do things in parallel: some regex searching occurred while the other thread was paused reading in, or processing, file data. However, the two threads have to interact, and those interactions require executing additional code over the non-threaded version (channels aren’t free!), hence the additional CPU instructions [10].

Should I throw away my multi-threaded version because it executes more CPU instructions? Of course not — I was trying to reduce the metric that I as a human care about, which is wall-clock time! What we’re observing here are the dangers of using one metric (CPU instructions executed) as a proxy for another (wall-clock time): there is no guarantee that the two metrics always correlate. In other words, if I care about wall-clock time, that’s what I need to measure: CPU instructions executed measures something slightly different and, in this case, we can really see those differences.

When to use each metric

Summarising what we’ve seen above is fairly simple. Wall-clock time is difficult to measure accurately, but it’s generally the metric that we as humans actually care about. CPU instructions are easy to measure accurately but, especially for larger or more complex programs, can be a misleading proxy for wall-clock time.

Personally, since I mostly care about performance that humans perceive when running fairly large programs, I generally use wall-clock time, but there are situations in which I use CPU instructions executed. In particular, as the parts of a program I’m measuring become smaller, it becomes increasingly difficult to measure wall-clock time accurately. My rule of thumb is that when I’m looking at the performance of an individual function, CPU instructions executed are probably more appropriate. In particular, in many cases I can only realistically benchmark a function as part of a larger program — in other words, I can’t extract the function and run it in a loop until it’s run long enough for accurate wall-clock time measurements. In such cases, I normally use cachegrind (which measures simulated CPU instructions) rather than Linux’s perf (which, when used to give per-function statistics, intermittently samples performance counters, leading to noisy measurements for small functions). It’s still possible for me to be misled by the resulting figures (e.g. because of performance non-determinism I’m unaware of), but the smaller the fragment I’m measuring, the less likely that is to happen.

I’m not dogmatic enough to think that the story is quite as simple as “only small things should use CPU instructions” — there will always be certain cases where something else makes sense. For example, Nicholas Nethercote suggests that CPU instructions can be meaningful if you’re comparing two version of a big program if only small changes have been made. I’ve also made use of this same intuition on occasion, though I’m aware that it’s quite difficult to know exactly what constitutes a small, and hence “safe”, change.

Fundamentally, I think there’s a bigger lesson to be drawn from the particular comparison I’m making in this post: just because a measurement is easy to make doesn’t mean that it’s measuring the thing we truly care about [11]. For example, I’ve seen CPU instructions executed (or CPU cycles) used as a proxy for energy consumption: the two are clearly going to correlate to some extent, but there are almost certainly going to be some very surprising divergences (e.g. for exotic instructions such as those dealing with vectors). I’ve also seen the number of objects allocated as a proxy for memory usage, which ignores the often significant differences between small and large object sizes. I’m sure you’ve seen many similar cases! In many situations, a single metric on its own can’t capture everything we truly care about, and then we need to make multiple measurements and decide what the right trade-offs might be. Benchmarking is hard precisely because there are so many superficially similar situations that, when carefully examined, turn out to require very different measurement techniques.

Acknowledgements: thanks to Edd Barrett, Daniel Lemire, and Davin McCall for comments.

Newer 2022-06-30 08:00 Older
If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1]

Using repeat 30 { perf stat awk '...' 2>&1 | grep instructions }

Using repeat 30 { perf stat awk '...' 2>&1 | grep instructions }

[2]

A more precise definition is to say that if we were to continually rerun the experiment, and calculate 99% confidence intervals for each new experiment, then over time we would expect 99% of the resulting confidence intervals to contain the “true” mean. Even that definition can be finessed a little.

A more precise definition is to say that if we were to continually rerun the experiment, and calculate 99% confidence intervals for each new experiment, then over time we would expect 99% of the resulting confidence intervals to contain the “true” mean. Even that definition can be finessed a little.

[3]

There remain situations where, in my opinion, the minimum time remains arguably the best metric, for example when benchmarking small, deterministic sequences of instructions. I am a big fan of Daniel Lemire’s work, which often involves him benchmarking small, highly optimised sequences of instructions, where the minimum value is generally the most meaningful measure.

There remain situations where, in my opinion, the minimum time remains arguably the best metric, for example when benchmarking small, deterministic sequences of instructions. I am a big fan of Daniel Lemire’s work, which often involves him benchmarking small, highly optimised sequences of instructions, where the minimum value is generally the most meaningful measure.

[4]

The measurements remain stable whether or not the server is highly loaded or not, but, as we’ll soon see, bitter experience has taught me to benchmark on otherwise idle machines.

The measurements remain stable whether or not the server is highly loaded or not, but, as we’ll soon see, bitter experience has taught me to benchmark on otherwise idle machines.

[5]

If I benchmark wall-clock time on the same (mostly idle) Debian server I measured CPU instructions executed on I get 0.950 ± 0.002.

If I benchmark wall-clock time on the same (mostly idle) Debian server I measured CPU instructions executed on I get 0.950 ± 0.002.

[6]

Performance throttling due to overheating happens on servers too though typically much less often. When we were in the early days of benchmarking VM warmup we had to remove one desktop machine with “high performance” water cooling from our experiment as it overheated fairly often. Much later on we implemented a more careful check for performance throttling and found that even on servers in an air-conditioned basement it did occasionally happen (about 0.004% of the time).

Performance throttling due to overheating happens on servers too though typically much less often. When we were in the early days of benchmarking VM warmup we had to remove one desktop machine with “high performance” water cooling from our experiment as it overheated fairly often. Much later on we implemented a more careful check for performance throttling and found that even on servers in an air-conditioned basement it did occasionally happen (about 0.004% of the time).

[7]

Though even on the Awk example above, there is some variation from run to run. Whether that’s measurement error, Awk not being deterministic, or something else entirely, I have no idea!

Though even on the Awk example above, there is some variation from run to run. Whether that’s measurement error, Awk not being deterministic, or something else entirely, I have no idea!

[8]

I’m using “roughly speaking” very deliberately here: good quality statistics with confidence intervals are rather involved, and I’m too lazy to remind myself how to do them here.

I’m using “roughly speaking” very deliberately here: good quality statistics with confidence intervals are rather involved, and I’m too lazy to remind myself how to do them here.

[9]

As my use of the Regex crate implies, I’ve basically written a tiny version of ripgrep. Indeed, if I use ripgrep itself, then on my 36 core server, moving from 1 thread to 36 threads reduces wall-clock time by about 8.5x while increasing CPU instructions executed by about 2%.

As my use of the Regex crate implies, I’ve basically written a tiny version of ripgrep. Indeed, if I use ripgrep itself, then on my 36 core server, moving from 1 thread to 36 threads reduces wall-clock time by about 8.5x while increasing CPU instructions executed by about 2%.

[10]

There are also some additional, though probably fairly constant, factors such as the additional cost of setting up a thread.

There are also some additional, though probably fairly constant, factors such as the additional cost of setting up a thread.

[11]

See also Goodhart’s Law:

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

See also Goodhart’s Law:

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

Comments



(optional)
(used only to verify your comment: it is not displayed)