Why Aren’t More Users More Happy With Our VMs? Part 1

Blog archive

Recent posts
Some Reflections on Writing Unix Daemons
Faster Shell Startup With Shell Switching
Choosing What To Read
Debugging A Failing Hotkey
How Often Should We Sharpen Our Tools?
Four Kinds of Optimisation
Minor Advances in Knowledge Are Still a Worthwhile Goal
How Hard is it to Adapt a Memory Allocator to CHERI?
"Programming" and "Programmers" Mean Different Things to Different People
pizauth: First Stable Release

Programming language performance is something that nearly everyone cares about at some point, whether they realise it or not. If you’re a programmer, you sometimes have to make your programs run fast; if you’re a user, you might be frustrated at a program that runs slowly, without knowing that it’s caused by a poor programming language implementation. Those of us who work on, or around, programming language Virtual Machines (VMs) tell a good story about performance, but a surprising number of users seem unhappy with the performance of their programs. Sometimes, yes, they’re unrealistic, but are they always so? In this first blog post (based on this paper) of two, I’m going to show that programs running on VMs often don’t follow the simple performance patterns that nearly all of us expected. Perhaps that’s why users aren’t as happy as VM developers think they should be?

A typical claim

Every so often there is a spate of blog posts along the lines of “Language XYZ is faster than C”. [1] Language XYZ varies, of course, whilst C is always… C. Love it or hate it [2], C is, overall, probably the fastest language there is. At the very least, it’s the bar by which all other languages measure themselves.

For example, a claim made about HotSpot (the standard Java VM, often called just “the JVM”, though that there are other Java VMs) is that it makes Java programs as fast as C programs compiled with gcc -O2 [3]. At first glance this statement seems reasonable enough: gcc -O2 produces fast code, as, most people agree, does HotSpot. But what’s actually being compared here? A static (or “Ahead of Time”) compiler such as GCC compiles a program once; the resulting executable/binary can be run as many times as you want, even on another machine without GCC. In contrast, VMs such as HotSpot observe a program running and dynamically compile it into machine code. Dynamic compilers are thus subject to a warmup cost: it takes programs running on them a while to reach a steady state of peak performance. VM developers are very keen on the steady state of peak performance and would rather forget that warmup exists: VM performance is universally reported relative to the steady state of peak performance. Thus the claim at the beginning of this paragraph entirely ignores warmup costs.

A simple representation of this performance model can be seen in the following:

Let’s assume that we’re running a simple benchmark in a for loop: we show each in-process iteration (i.e. each iteration of the loop; the reason for this slightly long-winded name will become apparent soon) of the loop on the x-axis; and the y-axis shows how long each in-process iteration takes to run. When a program starts running in a VM, it’s executed by a profiling interpreter which is a simple, general, but slow implementation of a programming language. As the program executes, the interpreter performs simple profiling to help it determine which parts of the program are frequently executed (e.g. recording how often a function is called or a loop executed). After a little while, the “hot” parts of a program are identified, causing those parts to be compiled into machine code: it’s possible that execution will slow down temporarily while compilation occurs. Once compilation has completed, we reach nirvana: the “steady state” (i.e. it never changes) of ”peak performance” (i.e. the program runs faster than it did before). The time from the beginning of the first in-process iteration to the point that the steady state of peak performance is reached is the “warmup” period. [4]

Of course, a real system will be somewhat noisier than the above suggests: we might expect occasional timing blips while garbage collection occurs; compilation might happen in multiple stages; and so on. However, it captures the basic notions well enough that I’ll use it as the basis for the rest of this blog post.

Measuring warmup

I’ve done a reasonable amount of work on, and with, VMs in recent years (as a couple of examples, consider storage strategies or language composition) which, inevitably, means I’ve spent quite a bit of time measuring performance. The first time that I put a bit of thought into benchmarking, I realised that benchmarking without considering warmup was unsatisfactory, although I didn’t manage to come up with a very sophisticated way of addressing this problem. I also remember being baffled at one or two of the results, and convinced that I’d done something wrong: why did CPython running the Mandelbrot benchmark have a running time of 134.419s and an incredibly wide confidence interval of ± 32.347s? I had no idea, so, in the best tradition of VM measurements, I buried my head in the sand and hoped that no-one would say anything. They didn’t.

The first approach I saw that really tried to do something sophisticated with VM measurement was the work of Kalibera and Jones. [5] The methodology they developed tries hard to work out if/when VM warmup has completed, so that accurate figures can be given for the steady state of peak performance. I can’t say enough good things about this work, which has never received the attention it deserved. There are a few possible reasons for this: the methodology is fairly labour intensive (humans have to look at performance graphs and feed information into a measurement algorithm, which does the rest); some of the maths involved is on the scary looking side; and for a long time there were no publicly available implementations. We (well, mostly Carl Friedrich Bolz and Edd Barrett) eventually put together an implementation that largely solves the last two problems, and we used the resulting methodology in a couple of papers (1 and 2). However, the first problem seemed insoluble: the approach requires quite a lot of a skilled human’s time.

In the process of using the Kalibera and Jones methodology, we noticed quite a lot of variation in the warmup time of different VMs and cases where VMs didn’t seem to warmup at all. This was surprising because pretty much every paper we’d read until that point had assumed – and, in many cases, explicitly stated – that warmup was a quick, consistent, thing. On that basis, it seemed interesting to see how the warmup time of different VMs compared. In May 2015, I asked Edd if he’d knock together a quick experiment in this vein, estimating that it would take a couple of weeks. After a couple of weeks we duly had data to look at but, to put it mildly, it wasn’t what we had expected: it showed all sorts of odd effects. My first reaction was that if we showed this data to anyone else without checking it thoroughly, we’d be in danger of becoming a laughing stock. It was tempting to bury my head in the sand again, but this time it seemed like it would be worth digging deeper to see where we’d gone wrong.

We thus embarked on what we soon called the “warmup experiment”. We quickly realised that, fundamentally, we needed to test a simple scientific hypothesis: that VMs reach a steady state of peak performance. In other words, we wanted to find out whether the fundamental assumption on which previous VM benchmarking has relied holds true. This seems like it should be a simple thing to do, but we soon realised that there are all sorts of things that could interfere with our benchmarking — what are formally called “confounding variables”. When I’d previously benchmarked things, I was happy to run things on a new-ish Linux server installation, and report the results. [6] However, even a seemingly “bare bones” installation can still do many unexpected things. We thus continually iterated on the experiment, at every stage asking ourselves: “what if XYZ coloured our results? How can we stop, or at least reduce the chances of, this happening?”

Thinking in such an adversarial manner isn’t natural for most of us but it can be done. For example, early on we realised that if benchmarks run overnight there’s a good chance that cron is likely to run at least one background task [7]. Similarly, if you’re running sshd (which, on a Unix sever, you almost certainly are), a botnet [8] can try using that to log in vast numbers of time: if you’ve ever been subject to such an attack, you’ll know that it can consume a noticeable amount of CPU. But there are many other potential problems. If one benchmark causes a machine to go into swap, subsequent benchmarks will be much slower, simply because some parts of the system will have to be paged back in from disk. Similarly, VMs can leave cache files around to speed up subsequent runs: if that happens, the first run might execute much more code than subsequent runs. After a while of doing this, we shared our initial attempts with other people, who suggested a few more things for us to take a look at, notably whether CPU throttling (due to overheating) was occurring.

After a couple of years of working on this, we had put together a pretty extensive list of potential problems and, in nearly all cases, we’d managed to do something to control them. Because there were so many of these, we wrote a new benchmark runner Krun [9]. In some senses, Krun’s job is very simple: it just has to run benchmarks in a predictable way and record the results. However, there’s a lot of detail hiding just below the surface. I’m not going to go into every detail of Krun, because we’d be here all day (the paper has more details but even that elides some of the nitty-gritties), but let me give you an idea of some of the things Krun does:

  • Turns off CPU turbo boost (which unpredictably interferes with the CPU’s running speed [10]).
  • Turns off the network card (so botnets and the like can’t cause the kernel, or userland, to do extra work).
  • Makes sure the CPU is running in full performance mode (and, on Linux, in tickless mode).
  • Shuts down daemons like smtpd and cron.
  • Reboots the machine before running the first process execution (which is our fancy way of saying “shut the VM down and start it again from scratch”) and after every subsequent process execution (so that if one process execution does get the machine into a nasty state (e.g. into swap), the machine’s state will be reset to something fairly predictable before the next process execution is run).
  • Before every process execution, the Unix account used to run benchmarks is deleted (including its home directory) and recreated (so cache files can’t persist between process executions).
  • Runs every process execution with the same stack / heap limits.
  • Before the first process execution, all the machine’s temperature sensors are read; subsequent process executions aren’t run until the temperature sensors are more-or-less back to those readings (because we have no idea how non-CPU components are affected by temperature).
  • Checks whether the CPU clocked down and reruns any benchmarks affected (this happens perhaps 0.5% of the time, so it’s rare, but it does happen).

And so on and so forth.

There are a couple of things about Krun which surprise people. First, Krun doesn’t use CPU pinning. We experimented extensively with this, but found that it had odd effects. For example, a seemingly good idea on a 4 core machine is to reserve 1 core for the OS and 3 cores for things being benchmarked. However, we found at least 1 VM which – on seemingly single-threaded benchmarks – behaved poorly with such a setup. We suspect this is because the VM asks the OS how many cores the machine has (receiving the answer “4”), and then uses that many compilation and/or GC threads; those 4 threads then get pinned to 3 cores, contending unpredictably with each other. [11]

Second, Krun does not disable ASLR (Address Space Layout Randomisation). This introduces obvious non-determinism into proceedings and is thus often frowned upon. Indeed, initially we turned ASLR off for this reason. However, I ended up being deeply influenced by Stabilizer, which convincingly shows that changing a program’s layout (most obviously through ASLR) means that programs no longer have a single performance point — rather, they have a set of performance points. In other words, it doesn’t make sense to say “this program runs at speed X” by picking one layout for the program: one has to have some sense of the spread of program speeds with different layouts, because real users will encounter those different layouts, each of which is a valid version of the program. Unfortunately, Stabilizer doesn’t build on modern systems, so we can’t be quite as statistically rigorous in randomising things like linking order. However, Keeping ASLR on gives us an approximation of the effects of randomisation on the VM.

What to benchmark

There are two hard questions in program benchmarking: what to benchmark and how long to run things for. I’m going to tackle them in reverse order.

VM benchmarking generally involves running a program in a loop for a small number of in-process iterations: 5 is common; 30 is considered extremely large; and the largest I’ve ever heard of is 200. However, I knew from experience that some of the counters inside VMs that cause performance changes have thresholds of 1000 or 1024. To make sure that we could see if these counters had an impact on performance, we therefore needed to run benchmarks for more than 1024 in-process iterations. We quickly settled on 2000 in-process iterations as a reasonable number. We also wanted to see what level of variability there was between different process executions (i.e. stopping the VM and running it again from scratch). We had no idea what a good number was for this: we used 10 for a long time, before eventually settling on 30. To be clear, there was some thought behind the choice of 2000 in-process iterations, but 30 process executions was little more than a wild guess.

Choosing what benchmarks to run is difficult, because one often hopes that benchmarks are representative of real programs. Different VM teams often disagree profoundly over benchmark selection because of this. Personally, I am sceptical that a small-ish set of programs (no matter how big the individual programs are) tells us that much about the complete set of possible programs. Fortunately for us, we had slightly different goals. We wanted to see how VMs perform on benchmarks that are well known, thus removing the “it’s just a weird program” retort from the debate. That then meant that we could use the well known Computer Language Benchmarks Game (CLBG) benchmarks. Although one can reasonably argue that these benchmarks are even less representative of general performance than other benchmarks, they have the virtue that every VM team out there has made sure that they run well on their VM.

We performed a few simple checks, and made several alterations, to the CLBG benchmarks. First, we wanted to make sure that – to the extent that a normal end user can tell – the benchmarks are deterministic, because then we could reasonably expect different process executions to behave similarly. We put prints at every if statement, while loop, and the like, and checked the results from multiple runs. In our case, we had to alter fasta so that the same random number was used for every in-process iteration. We also noticed that Java programs sometimes load classes in a different order in such a way that an end-user can observe. Fixing this in general is neither possible nor desirable, but we wanted to make sure our benchmark runner didn’t make things worse, so we performed some simple static games to avoid this issue.

We then dealt with a more subtle issue. Compilers can often notice that large chunks of benchmarks are computing precisely the same number over and over again, allowing them to replace such code with a constant. [12] This can quickly cause one to measure a much different thing than one expected. To avoid this, many benchmarks write intermediate results to stdout, which makes it harder for compilers to remove code. However, I/O calls are expensive (due to context switches and the like): one can quickly end up in a situation where one is measuring the number of I/O calls more than anything else. We therefore removed I/O from the core parts of all our benchmarks, replacing it with checksum calculations (with the same checksum across all language variants of a given benchmark) that we then (conditionally) write to stdout at the end of the benchmark. This doesn’t guarantee that compilers aren’t removing important parts of a benchmark, but it raises the bar substantially.

With that done, it was relatively simple to choose the systems to benchmark: we used a number of mainstream VMs (Graal 0.22; HHVM 3.19.1; TruffleRuby 20170502; HotSpot8u121b13; LuaJIT 2.04; PyPy 5.7.1; and V8 5.8.283.32) and (as a base for comparison) GCC 4.9.4.

Results

At this point, I’ve hopefully convinced you that we designed a thoughtful, careful experiment. I could go into more detail, but it’s probably useful to look at some data. I’m going to use data from our v1.5 run (the data in the paper is from a slightly earlier run; the differences are minor). Let’s start with a simple example:

Starting at the top, the plot title tells us that this is the Fannkuch Redux benchmark, running on LuaJIT, on an OpenBSD Core i7 machine, and is the 14th of 30 process executions. Along the x-axis, we have the in-process iterations, and on the y-axis the time of each iteration. The inset helps us see the first n in-process iterations (in this case, 129) in a bit more detail. The first thing we can see is that this benchmark started slow (in-process iteration 1 took about 0.57233 seconds); quickly got somewhat faster for a while; and then, at about in-process iteration 190, reached the steady state of peak performance. This is almost exactly what VMs are supposed to do, albeit it took much longer to reach the steady state of peak performance than we might expect and, indeed, much longer than anyone had ever bothered benchmarking before. Nevertheless, this benchmark clearly warmed up.

The challenge then becomes working out where it warmed up. At first we had no idea how to work this out, other than using the Kalibera and Jones method — but we have 3,600 of the above plots to look at per machine, which is far too many to look at manually. Eventually (thanks to help from Vince Knight), we were put in touch with Rebecca Killick, a statistician who specialises in changepoint analysis. Intuitively, changepoint analysis automatically tells us when there are statistically significant shifts in our data. The vertical red dashed line in the plot above at in-process iteration 190-ish is a changepoint: the horizontal dashed line between changepoints (there are implicit changepoints at in-process iterations 1 and 2000) is a changepoint segment. In other words, changepoint analysis allows us to split the data in the plot above up, produce statistics on it, and classify it. We therefore developed a classification algorithm, which says (simplifying a bit) “if the last changepoint segment is the fastest changepoint segment in the plot, then this is a warmup”. [13] You can see the classification “warmup” at the end of the title. As an additional visual aid, in-process iterations before the steady state are coloured in grey.

The next type of plot we see regularly is the following:

This is the n-body benchmark running on PyPy on a Linux Core i7 machine. The first thing that you might notice is the big red blob around in-process iteration 1230: this is an outlier, an in-process iteration which is noticeably different in nature (i.e. much faster or slower) than its neighbours. It’s common in these sorts of analyses to ignore outliers if there’s good cause: in our case, we assume that they might be a garbage collection phase, or a context switch into the kernel, or something else that shouldn’t cloud our judgement about the benchmark. The second thing you might notice about this plot is that it only has a single changepoint segment. We classify such plots as “flat”. Because we’re giving the benefit of the doubt to VMs, we assume that warmup happened so quickly in such cases that we can’t observe it happening, meaning that the “flat” classification is morally equivalent to “warmup”.

Here’s another example of warmup, this time on V8:

The good news is that the benchmark did warm up and hit a steady state of peak performance, though the fact that it took just under 600 in-process iterations is surprising. It’s also interesting that there are several earlier changepoints, perhaps while various rounds of compilation were occurring.

We can also deal with more complex examples of warmup, this time on Graal:

The surprise here isn’t the number of outliers (perhaps they’re the garbage collector), but that it took over 900 in-process iterations to hit a steady state of peak performance. Still, the fact that it did so is surely a good thing.

The unexpected

At this point, things become a little more surprising. Consider this plot from HotSpot:

Notice that it started slow (you might need to look at the inset to see this), then got fast for 200 in-process iterations, and then became somewhat slower again. That’s not supposed to happen! It means that we need to adjust our classification algorithm: if the last changepoint segment in a plot isn’t the fastest, we classify it as “slowdown”.

In this case, the slowdown reduces execution speed by 5% which, to state the obvious, is fairly significant (in mature VMs, speed-ups of 5% often require months or years of work). But here’s why you should be really surprised. The benchmark in this plot is Richards: it’s one of the benchmarks that HotSpot has used to benchmark itself since before the VM was even called HotSpot. In other words, this VM is highly optimised for this benchmark — but no-one ever ran it long enough before to notice that performance gets much worse over time.

It turns out that slowdowns are common. Here’s an example from V8:

But there’s one final type of plot left — and it’s the most troubling of all. Here’s an example:

Notice here that the benchmark’s performance keeps shifting, never settling for long. Of course, perhaps it would if we ran it for even longer, but ultimately we have to set a cut-off somewhere. This now raises a tricky question: what should our criteria be for when a benchmark is said never to stabilise? Our choice was to say that if there are changepoints in the last 500 in-process iterations then the plot is classified as “no steady state”.

For clarity, “flat” and “warmup” are equally good; “slowdown” and “no steady state” are bad (in slightly different ways). The latter two are not supposed to happen: when they do, they invalidate nearly all previous benchmarking methodologies.

Edge cases

Changepoint analysis turned out to be the vital thing we needed to make this work useful. When we first tried using changepoint analysis on our data, we sometimes thought it had misclassified things, but in reality it’s just more rigorous (and far less prone to boredom) than us humans. However, there are always going to be some edge cases where the classification could go either way. Here’s my favourite example of this:

Can you see that the timing data keeps shifting, at predictable intervals, above and below the changepoint segment? The changes are small enough that changepoint analysis doesn’t consider these shifts as changepoints, so the plot is classified “warmup”. Interestingly, in some older versions of HotSpot, this benchmark displays a more pronounced square-wave pattern (i.e. the highs are higher and the lows lower); each peak and trough becomes a changepoint segment; and the plot as a whole becomes “no steady state”. In my opinion, for this example, either classification would be acceptable: but, in general, there are very few edge cases such as this.

Inconsistency

Earlier I went into some detail about how we ensured that the benchmarks we’re running are deterministic. We expected that this would lead to consistent performance profiles for a given (benchmark, VM, machine) triple (in other words, we expected consistent performance from process executions of the same benchmark on the same VM on the same machine). We were wrong. Consider these two plots, both of the same benchmark, on the same VM, on the same physical machine:

Notice that one process execution warms up, and the other is a no steady state. To add insult to injury, the no steady state process execution’s performance is often 5% worse than the warmup process execution’s performance. This sort of inconsistency happens frequently.

Perhaps less surprisingly, there’s often substantial inconsistency across operating systems. For example, consider the binary trees benchmark on Linux and OpenBSD in C:

There are two things to note here. First, the performance of this benchmark on Linux resembles a bouncing ball. This, to belabour a point, isn’t a VM: it’s a C program! We suspect (but somehow have never quite managed to get around to investigating) that this is probably the malloc used on Debian: it’s probably incrementally growing, then incrementally compacting, its memory structures. This raises a subtle point which I hadn’t considered when we first started this work: even C programs have a runtime (libc), which can have unpredictable effects on performance. Unsurprisingly, C isn’t as prone to odd effects as the VMs in our experiment, but it’s more prone than I would have expected. Second, notice the substantial difference in performance between the two systems. Admittedly, in this case we’re comparing a Xeon Linux machine against a Core i7 OpenBSD machine, but still, I would be happy to bet money that Linux will always beat OpenBSD in such a benchmark. Why? Simply because OpenBSD’s malloc prioritises security over performance, and jumps through several hoops that almost certainly slow things down substantially.

Benchmarks in aggregate

Looking at the plots above is very useful, because it gives us a way of understanding what unexpected performance behaviour occurs. However, our experiment has thousands of these plots, so it’s hard to put together any sort of bigger picture from them. Fortunately, changepoint analysis allows us to automatically put together various summaries of the data (with Sarah Mount doing most of the hard work to generate the astonishing LaTeX tables below). Here’s an example of the summary data from one of our machines:

There’s a bit too much data there for us to easily read it in this setting, so let’s zoom in on the bottom right part of that table:

This is the data from the spectralnorm benchmark. There’s still quite a lot of detail here, so let’s look at a few rows to unpick what’s going on.

I’ll start with the HotSpot row. The squiggle in the “Class.” (i.e. “classification”) column means that all 30 of spectralnorm’s process executions warmed up on HotSpot. The “Steady iter (#)” column tells us that the median number of in-process iterations for the steady state to be reached was 7 (the numbers in brackets underneath are inter-quartile ranges; think of them a bit like being a bit like confidence intervals, giving you an indication of the spread of data). The histogram to the right of that number shows you the spread of data, with the bucket the median value is part of highlighted in red. In this case, nearly all the data is in the median bucket, with a couple of outliers. The “Steady iter (s)” column tells us that the median wall-clock time to reach the steady state was 1.91 seconds. The “Steady perf (s)” is close to the “traditional” measure of VM performance [14], telling us that the performance of the benchmark in the steady state was 0.31472 seconds per in-process iteration. Notice though that the histogram tells us that there are a number of process executions where the steady state performance is worse than the median, and the very wide 99% confidence interval (±0.169143s) confirms this. This goes some way to justifying all the data you’ve just waded through: yes, the benchmark consistently warms up on HotSpot (which is good), it warms up fairly fast (which is also good), but the steady state performance can vary hugely between process executions (which is bad). Depending on what aspect of performance you care about, you may find different parts of this information more or less important than others.

The “—” classification for LuaJIT means that all 30 of its process executions were flat. By definition that means it warmed-up at in-process iteration 1 and took 0 seconds to do so: thus the steady iter (#) and (s) columns are empty. The final “pure” classification can be seen for Graal, all 30 of whose 30 process executions slowed down (note that, though this doesn’t have a steady state of peak performance, it does still have a steady state, hence our ability to report such numbers here).

The “═” classification for PyPy is “good inconsistent”: 27 of the 30 process executions were flat, and 3 were warmups. Since we consider these two classifications morally equivalent, “good inconsistent” is morally equivalent to all 30 process executions being flat or warmup.

In contrast, the “═ with a cross through it” classification for TruffleRuby is “bad inconsistent” i.e. at least 1 process execution was a slowdown or a no steady state. In this case, 25 of its process executions slowed down and 5 were no steady state.

The bigger picture

Summarising the results gives a clearer picture of how VM performance can surprise:

To keep things simple, I’ll only consider the Linux4790 column above (the OpenBSD machine runs a subset of benchmarks, so requires some care when interpreting). It’s easiest to start with the bottom half of the table first. The way to read this is that 22.0% of all process executions (i.e. not worrying about which VM or which benchmark was involved) were classified as flat and 48.3% as warmup; in other words, just over two thirds of process executions warmed up in the expected fashion. 20.1% of process executions slowed down and 9.6% were no steady state.

Those figures are pretty depressing: almost a third of process executions aren’t doing what we expected! But life gets quite a lot worse in the top of the table. The way to read this is that 8.9% of (VM, benchmark) pairs were always classified as flat, 20.0% as warmup, and 11.1% as good inconsistent. In other words, only 40.0% (VM, benchmark) pairs consistently did the expected thing for all 30 of their process executions. Surprisingly, 4.4% consistently slowed down and 4.4% were consistently no steady state. But 51.1% of all (VM, benchmark) pairs were bad inconsistent: that is, they contain at least one slowdown or no steady state process execution. At the risk of stating the obvious, that means that just over half of the time VMs are not reliably optimising benchmarks!

Things get even worse if you consider all machines together: only 12.5% of (VM, benchmark, machine) triples perform as expected.

Conclusions

When we set out to look at how long VMs take to warm up, we didn’t expect to discover that they often don’t warm up. But, alas, the evidence that they frequently don’t warm up is hard to argue with. Of course, in some cases the difference to performance is small enough that one can live with it, but it’s often bad enough to be a problem. We fairly frequently see performance get 5% or more worse over time in a single process execution. 5% might not sound like much, but it’s a huge figure when you consider that many VM optimisations aim to speed things up by 1% at most. It means that many optimisations that VM developers have slaved away on may have been incorrectly judged to speed things up or slow things down, because the optimisation is well within the variance that VMs exhibit. The variance across process executions is often even bigger, which is even more troubling: it means that it’s not safe to run a program once, measure its performance, and assume you’ll always get that level of performance in the future.

Before the doom and gloom gets too much, I’m going to end on an optimistic note. While I think our work raises big questions about the efficacy of current VMs, I think it also offers a note of salvation. People are often very happy with VM performance, even though it probably isn’t working as well as it could. In particular, my guess is that large programs are subject to many small performance problems which, in conjunction with each other, are dismissed as noise. If that is the case, then, if we can produce VMs that don’t suffer from the problems seen above, we should be able to run programs even faster in the future.

In the second part of this blog post, I’m going to: look at whether there are any simple answers to the performance problems seen above; give some possible explanations as to how we might have got to this point; and suggest some simple pointers to make things a bit less worse in the future. In the meantime, those of you who want more details can read the main research paper.

Acknowledgements: Edd Barrett, Carl Friedrich Bolz, Rebecca Killick, and Sarah Mount did most of the work I’ve reported on above. Vince Knight helped put us in touch with Rebecca Killick: he really deserved more credit on this paper than he received, but he gracefully refused my attempts to add him in! The work of Tomas Kalibera and Richard Jones was a huge influence for all of us. Mario Wolczko gave crucial feedback and encouragement at important points. We also received vital feedback, and questions, from a number of other people. This research was funded by the EPSRC Cooler (EP/K01790X/1) grant and Lecture (EP/L02344X/1) fellowship, and gifts from Cloudflare and Oracle Labs.

Newer 2018-09-05 08:00 Older
If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1]

Unfortunately, many such claims are either wrong (e.g. the thing being measured isn’t the thing that’s claimed to be measured) or misleading (e.g. making grand comments that derive from generalising a single workload / benchmark).

Unfortunately, many such claims are either wrong (e.g. the thing being measured isn’t the thing that’s claimed to be measured) or misleading (e.g. making grand comments that derive from generalising a single workload / benchmark).

[2]

I used to love C, but these days I’m more ambivalent, because I no longer think I can remember all the ways that I may trigger undefined behaviour.

I used to love C, but these days I’m more ambivalent, because I no longer think I can remember all the ways that I may trigger undefined behaviour.

[3]

p11 of Cliff Click’s slides “A JVM does that?”. To be clear, I’m not criticising Cliff who I not only know and like, but whose contributions to our field are little short of jaw-dropping.

p11 of Cliff Click’s slides “A JVM does that?”. To be clear, I’m not criticising Cliff who I not only know and like, but whose contributions to our field are little short of jaw-dropping.

[4]

The terms “warmup” and “startup” are often used interchangeably, whereas they refer to two different things: startup is the time for the VM beginning to execute the first bit of the user’s code; warmup is the time for the VM to reach the steady state of peak performance.

The terms “warmup” and “startup” are often used interchangeably, whereas they refer to two different things: startup is the time for the VM beginning to execute the first bit of the user’s code; warmup is the time for the VM to reach the steady state of peak performance.

[5]

There are two relevant Kalibera and Jones papers: a complete but dense description; and a higher-level overview. I definitely recommend starting with the high-level overview (even though it was published later).

There are two relevant Kalibera and Jones papers: a complete but dense description; and a higher-level overview. I definitely recommend starting with the high-level overview (even though it was published later).

[6]

In my defence, I can only say that I’ve seen a lot of people benchmark even more incompetently than I’ve described here. My personal favourite was a paper that benchmarked on a laptop (which are prone to overheat, and thus likely to throttle the CPU), with turbo boost left on (making overheating even more likely), running a full user desktop (how much CPU is your web browser consuming right now? One core of mine is running at 40% while a website animates something). The resulting numbers might have been correct, but, if they were, that would surely be more due to luck than judgement.

That said, the most common form of benchmarking (which I’ve certainly been very guilty of in the past) is far worse: it’s best summarised as “run it until I like it”. In this form of benchmarking, if you run a benchmark and get a number you don’t like (too fast or too slow), you run things again and again until you get an answer that matches your expectations / prejudices. In practise, the non-determinism of our hardware and software means that one can often find an appealing answer. Of course, the implied variation between runs should be an alarm bell!

In my defence, I can only say that I’ve seen a lot of people benchmark even more incompetently than I’ve described here. My personal favourite was a paper that benchmarked on a laptop (which are prone to overheat, and thus likely to throttle the CPU), with turbo boost left on (making overheating even more likely), running a full user desktop (how much CPU is your web browser consuming right now? One core of mine is running at 40% while a website animates something). The resulting numbers might have been correct, but, if they were, that would surely be more due to luck than judgement.

That said, the most common form of benchmarking (which I’ve certainly been very guilty of in the past) is far worse: it’s best summarised as “run it until I like it”. In this form of benchmarking, if you run a benchmark and get a number you don’t like (too fast or too slow), you run things again and again until you get an answer that matches your expectations / prejudices. In practise, the non-determinism of our hardware and software means that one can often find an appealing answer. Of course, the implied variation between runs should be an alarm bell!

[7]

On OpenBSD, for example, by default root runs a cron job every day (/etc/daily) and every week (/etc/weekly). The former does several security related tasks which, one might argue, aren’t likely to impact anything else; however, the latter rebuilds the locate database (which is very IO intensive, and which can thus cause very unpredictable side effects). There is also a monthly cron job but, in practise, that does very little by default (I had entirely forgotten that it existed until I checked my facts for this footnote).

On OpenBSD, for example, by default root runs a cron job every day (/etc/daily) and every week (/etc/weekly). The former does several security related tasks which, one might argue, aren’t likely to impact anything else; however, the latter rebuilds the locate database (which is very IO intensive, and which can thus cause very unpredictable side effects). There is also a monthly cron job but, in practise, that does very little by default (I had entirely forgotten that it existed until I checked my facts for this footnote).

[8]

It’s sometimes thought that botnets can only attack you from external networks. However, all it takes is one machine to become compromised (e.g. by someone running something unfortunate on their machine), and then a botnet can exist on an internal network. I’ve heard of this happening at least twice at places that I know well, so it’s a real threat.

It’s sometimes thought that botnets can only attack you from external networks. However, all it takes is one machine to become compromised (e.g. by someone running something unfortunate on their machine), and then a botnet can exist on an internal network. I’ve heard of this happening at least twice at places that I know well, so it’s a real threat.

[9]

The “K” in Krun is a tip of the hat to the Kalibera and Jones work. For those who care, most of us pronounce it as one syllable (i.e. “Krun” not “K-run”).

The “K” in Krun is a tip of the hat to the Kalibera and Jones work. For those who care, most of us pronounce it as one syllable (i.e. “Krun” not “K-run”).

[10]

Turbo boost is less predictable than most of us realise. For example, we had some machines with a stated base frequency of 3.6GHz and a turbo boost of 4GHz. If you ran a process which utilised 1 core fully, you could reliably measure it as running at almost exactly 4GHz; fully utilising 2 and 3 cores gave the same result; but as soon as a fourth core was fully utilised measurements fell to almost exactly 3.8GHz. If memory serves this mode is undocumented in Intel’s manuals, but I saw this behaviour on 2 separate machines.

Turbo boost is less predictable than most of us realise. For example, we had some machines with a stated base frequency of 3.6GHz and a turbo boost of 4GHz. If you ran a process which utilised 1 core fully, you could reliably measure it as running at almost exactly 4GHz; fully utilising 2 and 3 cores gave the same result; but as soon as a fourth core was fully utilised measurements fell to almost exactly 3.8GHz. If memory serves this mode is undocumented in Intel’s manuals, but I saw this behaviour on 2 separate machines.

[11]

At the time we experimented with CPU pinning on Linux, there were two mechanisms available: one was buggy and unreliable; and the other (which, of course, we tried second) worked OK. I don’t know if this is still the case, but it certainly makes me wonder if everyone I’ve heard of using CPU pinning is getting the benchmarking environment they expect.

At the time we experimented with CPU pinning on Linux, there were two mechanisms available: one was buggy and unreliable; and the other (which, of course, we tried second) worked OK. I don’t know if this is still the case, but it certainly makes me wonder if everyone I’ve heard of using CPU pinning is getting the benchmarking environment they expect.

[12]

I’m talking here about compilers optimising away code as part of the compiler’s standard optimisations. There’s also a less pleasant case where compilers are customised specifically for one benchmark (e.g. explicitly recognising a benchmark’s source code, and then changing the compilation strategy). One of the better known examples of this was Sun’s special treatment of the 179.art benchmark in SPEC. It’s now difficult to find a concise write-up of this, but a couple of articles provide reasonable pointers.

I’m talking here about compilers optimising away code as part of the compiler’s standard optimisations. There’s also a less pleasant case where compilers are customised specifically for one benchmark (e.g. explicitly recognising a benchmark’s source code, and then changing the compilation strategy). One of the better known examples of this was Sun’s special treatment of the 179.art benchmark in SPEC. It’s now difficult to find a concise write-up of this, but a couple of articles provide reasonable pointers.

[13]

The real algorithm has two main additional features. First, it treats changepoint segments which differ only by a very small amount of time as being equivalent. Second, two (or more) changepoint segments can be considered equivalent even if they are separated by a non-equivalent changepoint segment.

The real algorithm has two main additional features. First, it treats changepoint segments which differ only by a very small amount of time as being equivalent. Second, two (or more) changepoint segments can be considered equivalent even if they are separated by a non-equivalent changepoint segment.

[14]

The details of how we calculate steady performance are quite a lot more involved than is traditional: influenced by Kalibera and Jones, we use bootstrapping, and changepoint segments to lead to a slightly more accurate number than most traditional methodologies.

The details of how we calculate steady performance are quite a lot more involved than is traditional: influenced by Kalibera and Jones, we use bootstrapping, and changepoint segments to lead to a slightly more accurate number than most traditional methodologies.

Comments



(optional)
(used only to verify your comment: it is not displayed)