Why Aren’t More Users More Happy With Our VMs? Part 1
September 5 2018
Programming language performance is something that nearly everyone cares about
at some point, whether they realise it or not. If you’re a programmer, you
sometimes have to make your programs run fast; if you’re a user, you might be
frustrated at a program that runs slowly, without knowing that it’s caused by a
poor programming language implementation. Those of us who work on, or around,
programming language Virtual Machines (VMs) tell a good story about performance,
but a surprising number of users seem unhappy with the performance of their
programs. Sometimes, yes, they’re unrealistic, but are they always so? In this
first blog post (based on
this paper)
of two, I’m going to show that programs running on VMs often don’t
follow the simple performance patterns that nearly all of us expected.
Perhaps that’s why users aren’t as
happy as VM developers think they should be?
A typical claim
Every so often there is a spate of blog posts along the lines of “Language XYZ is
faster than C”. [1] Language XYZ varies, of course, whilst C is always... C. Love it
or hate it [2], C is, overall, probably the fastest
language there is. At the very least, it’s the bar by which all other languages
measure themselves.
For example, a claim made about HotSpot (the standard Java VM, often called
just “the JVM”, though that there are other Java VMs) is that it makes Java
programs as fast as C programs compiled with gcc -O2[3]. At first glance this statement seems reasonable enough:
gcc -O2 produces fast code, as, most people agree, does HotSpot.
But what’s actually being compared here? A static (or
“Ahead of Time”) compiler such as GCC compiles a program once; the resulting
executable/binary can be run as many times as you want, even on another machine
without GCC. In contrast, VMs such as HotSpot observe a program running and
dynamically compile it into machine code. Dynamic compilers are thus subject to
a warmup cost: it takes programs running on them a while to reach a
steady state of peak performance. VM developers are very keen on the
steady state of peak performance and would rather forget that warmup exists: VM
performance is universally reported relative to the steady state of peak
performance. Thus the claim at the beginning of this paragraph entirely
ignores warmup costs.
A simple representation of this performance model can be seen in the following:
Let’s assume that we’re running a simple benchmark in a for loop: we show
each in-process iteration (i.e. each iteration of the loop; the reason
for this slightly long-winded name will become apparent soon) of the loop on the
x-axis; and the y-axis shows how long
each in-process iteration takes to run. When a program starts running in a VM, it’s executed by a
profiling interpreter which is a simple, general, but slow
implementation of a programming language. As the program executes, the
interpreter performs simple profiling to help it determine which parts of the program are
frequently executed (e.g. recording how often a function is called or a loop
executed). After a little while, the “hot” parts of a program are identified,
causing those parts to be compiled into machine code: it’s possible
that execution will slow down temporarily while compilation occurs. Once
compilation has completed, we reach nirvana: the “steady state” (i.e. it never
changes) of ”peak performance” (i.e. the program runs faster than it did
before). The time from the beginning of the first in-process iteration to the
point that the steady state of peak performance is reached is the “warmup” period.
[4]
Of course, a real system will be somewhat noisier than the above suggests: we
might expect occasional timing blips while garbage collection occurs;
compilation might happen in multiple stages; and so on. However, it captures the
basic notions well enough that I’ll use it as the basis for the rest of this
blog post.
Measuring warmup
I’ve done a reasonable amount of work on, and with, VMs in recent years (as a
couple of examples, consider storage
strategies or language
composition) which, inevitably, means I’ve spent quite a bit of time
measuring performance. The
first time that I put a bit of thought into benchmarking, I realised that benchmarking without considering
warmup was unsatisfactory, although I didn’t manage to come up with a very
sophisticated way of addressing this problem. I also remember being baffled at
one or two of the results, and convinced that I’d done something wrong: why did
CPython running the Mandelbrot benchmark have a running time of 134.419s and an incredibly
wide confidence interval of ± 32.347s? I had no idea, so, in the best tradition
of VM measurements, I buried my head in the sand and hoped that no-one
would say anything. They didn’t.
The first approach I saw that really tried to do something sophisticated with
VM measurement was the work of Kalibera and Jones. [5] The methodology they developed tries hard
to work out if/when VM warmup has
completed, so that accurate figures can be given for the steady
state of peak performance. I can’t say enough good things about this
work, which has never received the attention it deserved. There are a
few possible reasons for this: the methodology is fairly labour
intensive (humans have to look at performance graphs and feed
information into a measurement algorithm, which does the rest); some
of the maths involved is on the scary looking side; and for a long time there were no
publicly available implementations. We (well, mostly Carl Friedrich
Bolz and Edd Barrett) eventually put together an
implementation that largely solves the last two problems, and we
used the resulting methodology in a couple of papers
(1
and
2).
However, the first problem seemed insoluble: the approach requires
quite a lot of a skilled human’s time.
In the process of using the Kalibera and Jones methodology, we noticed quite
a lot of variation in the warmup time of different VMs and cases where VMs
didn’t seem to warmup at all. This was surprising because pretty much every
paper we’d read until that point had assumed – and, in many cases,
explicitly stated – that warmup was a quick, consistent, thing. On
that basis, it seemed interesting to see how the warmup time
of different VMs compared. In May 2015, I asked Edd if he’d knock together a quick experiment in this
vein, estimating that it would take a couple of weeks. After a couple of weeks
we duly had data to look at but, to put it mildly, it wasn’t what we had
expected: it showed all sorts of odd effects. My first reaction was that if we
showed this data to anyone else without checking it thoroughly, we’d be
in danger of becoming a laughing stock. It was tempting to bury my head in the
sand again, but this time it seemed like it would be worth digging deeper to see
where we’d gone wrong.
We thus embarked on what we soon called the “warmup experiment”. We
quickly realised that, fundamentally, we needed to test a simple scientific hypothesis:
that VMs reach a steady state of peak performance. In other words, we wanted to
find out whether the fundamental assumption on which previous VM benchmarking
has relied holds true. This seems like it should be a simple thing to do, but we
soon realised that there are all sorts of things that could interfere with our
benchmarking — what are formally called “confounding variables”. When I’d
previously benchmarked things, I was happy to run things on a new-ish Linux
server installation, and report the results. [6]
However, even a seemingly “bare bones” installation can still do many unexpected
things. We thus continually iterated on the
experiment, at every stage asking ourselves: “what if XYZ coloured
our results? How can we stop, or at least reduce the chances of, this
happening?”
Thinking in such an adversarial manner isn't natural for most of us but
it can be done. For example,
early on we realised that if benchmarks run overnight there’s a good chance that
cron is likely to run at least one background task [7]. Similarly, if you’re running sshd (which, on a Unix
sever, you almost certainly are), a botnet [8] can try
using that to log in vast numbers of time: if you’ve ever been subject to such an
attack, you’ll know that it can consume a noticeable amount of CPU. But there are
many other potential problems. If one benchmark causes a machine to
go into swap, subsequent benchmarks will be much slower, simply because
some parts of the system will have to be paged back in from disk. Similarly,
VMs can leave cache files around to speed up subsequent runs: if that happens,
the first run might execute much more code than subsequent runs. After
a while of doing this, we shared our initial attempts with other people, who
suggested a few more things for us to take a look at, notably whether CPU
throttling (due to overheating) was occurring.
After a couple of years of working on this, we had put
together a pretty extensive list of potential problems and, in nearly all cases,
we’d managed to do something to control them. Because there were so many of
these, we wrote a new benchmark runner Krun[9]. In some senses, Krun’s job is very simple: it just has
to run benchmarks in a predictable way and record the results. However, there’s
a lot of detail hiding just below the surface. I’m not going to go into every
detail of Krun, because we’d be here all day
(the
paper has more details but even that elides some of the nitty-gritties), but
let me give you an idea of some of the things Krun does:
Turns off CPU turbo boost (which unpredictably interferes with the CPU’s
running speed [10]).
Turns off the network card (so botnets and the like can’t cause the kernel,
or userland, to do extra work).
Makes sure the CPU is running in full performance mode (and, on Linux, in
tickless mode).
Shuts down daemons like smtpd and cron.
Reboots the machine before running the first process execution (which
is our fancy way of saying "shut the VM down and start it again from scratch")
and after every subsequent process execution (so that if
one process execution does get the machine into a nasty state (e.g. into
swap), the machine’s state will be reset to something fairly predictable
before the next process execution is run).
Before every process execution, the Unix account used to run benchmarks is
deleted (including its home directory) and recreated (so cache files
can’t persist between process executions).
Runs every process execution with the same stack / heap limits.
Before the first process execution, all the machine's temperature sensors
are read; subsequent process executions aren’t run until the temperature
sensors are more-or-less back to those readings (because we have no idea
how non-CPU components are affected by temperature).
Checks whether the CPU clocked down and reruns any benchmarks affected (this
happens perhaps 0.5% of the time, so it’s rare, but it does happen).
And so on and so forth.
There are a couple of things about Krun which surprise people. First, Krun
doesn’t use CPU pinning. We experimented
extensively with this, but found that it had odd effects. For example, a
seemingly good idea on a 4 core machine is to reserve 1 core for the OS and 3
cores for things being benchmarked. However, we found at least 1 VM which
– on seemingly single-threaded benchmarks – behaved poorly with
such a setup. We suspect this is because the VM asks the OS how many cores the
machine has (receiving the answer “4”), and then uses that many compilation
and/or GC threads; those 4 threads then get pinned to 3 cores, contending
unpredictably with each other. [11]
Second, Krun
does not disable ASLR
(Address Space Layout Randomisation). This introduces obvious non-determinism
into proceedings and is thus often frowned upon. Indeed, initially we turned
ASLR off for this reason. However, I ended up being deeply influenced by Stabilizer, which
convincingly shows that changing a program’s layout (most obviously through ASLR)
means that programs no longer have a single performance point — rather,
they have a set of performance points. In other words, it doesn’t make sense to say “this
program runs at speed X” by picking one layout for the program: one has
to have some sense of the spread of program speeds with different layouts,
because real users will encounter those different layouts, each of which
is a valid version of the program. Unfortunately,
Stabilizer doesn’t build on modern systems, so we can’t be quite as
statistically rigorous in randomising things like linking order. However,
Keeping ASLR on gives us an approximation of the effects of randomisation on the
VM.
What to benchmark
There are two hard questions in program benchmarking: what to benchmark and how
long to run things for. I’m going to tackle them in reverse order.
VM benchmarking generally involves running a program in a loop
for a small number of in-process iterations: 5 is common; 30 is considered extremely
large; and the largest I’ve ever heard of is 200. However, I knew from
experience that some of the counters inside VMs that cause performance
changes have thresholds of 1000 or 1024. To make sure that we could see if these
counters had an impact on performance, we therefore needed to run benchmarks
for more than 1024 in-process iterations. We
quickly settled on 2000 in-process iterations as a reasonable number.
We also wanted to see what level of variability there was between different
process executions (i.e. stopping the VM and running it again from scratch). We
had no idea what a good number was for this: we used 10 for a long time, before
eventually settling on 30. To be clear, there was some thought behind the choice
of 2000 in-process iterations, but 30 process executions was little more than a
wild guess.
Choosing what benchmarks to run is difficult, because one often
hopes that benchmarks are representative of real programs. Different VM
teams often disagree profoundly over benchmark selection because of this.
Personally, I am sceptical that a small-ish set of programs (no matter how big
the individual programs are) tells us that much about the complete set of
possible programs. Fortunately for us, we had slightly different goals. We
wanted to see how VMs perform on benchmarks that are well known, thus removing
the “it’s just a weird program” retort from the debate. That then meant that we
could use the well known
Computer Language Benchmarks Game (CLBG)
benchmarks. Although one can reasonably argue that these benchmarks are even
less representative of general performance than other benchmarks, they have the virtue that every
VM team out there has made sure that they run well on their VM.
We performed a few simple checks, and made several alterations, to the CLBG
benchmarks. First, we wanted to make sure that – to the extent that a
normal end user can tell – the benchmarks are deterministic, because then
we could reasonably expect different process executions to behave similarly. We
put prints at every if statement, while
loop, and the like, and checked the results from multiple runs. In our case, we
had to alter fasta so that the same random number was used for
every in-process iteration. We also noticed that Java programs
sometimes load classes in a different order in such a way that an end-user can
observe. Fixing this in general is neither possible nor desirable, but we wanted
to make sure our benchmark runner didn’t make things worse, so we performed some
simple static games to avoid this issue.
We then dealt with a more subtle issue. Compilers can often
notice that large chunks of benchmarks are computing precisely the same number over and
over again, allowing them to replace such code with a constant. [12] This can quickly cause one to measure a much different
thing than one expected. To avoid this,
many benchmarks write intermediate results to stdout, which makes it harder
for compilers to remove code. However, I/O calls are expensive (due to context
switches and the like): one can quickly end
up in a situation where one is measuring the number of I/O calls more than
anything else. We therefore removed I/O from the core parts of all our
benchmarks, replacing it with checksum calculations (with the same checksum
across all language variants of a given benchmark) that we then (conditionally)
write to stdout at the end of the benchmark. This doesn’t guarantee
that compilers aren’t removing important parts of a benchmark, but it raises the
bar substantially.
With that done, it was relatively simple to choose the systems to benchmark:
we used a number of mainstream VMs (Graal 0.22; HHVM 3.19.1; TruffleRuby
20170502; HotSpot8u121b13; LuaJIT 2.04; PyPy 5.7.1; and V8 5.8.283.32) and (as a
base for comparison) GCC 4.9.4.
Results
At this point, I’ve hopefully convinced you that we designed a thoughtful,
careful experiment. I could go into more detail, but it’s probably useful to
look at some data. I’m going to use data from our v1.5 run (the data in the
paper is from a slightly earlier run; the differences are minor). Let’s start
with a simple example:
Starting at the top, the plot title tells us that this is the Fannkuch
Redux benchmark, running on LuaJIT, on an OpenBSD Core i7 machine, and
is the 14th of 30 process executions. Along the x-axis, we have the
in-process iterations, and on the y-axis the time of each iteration.
The inset helps us see the first n in-process iterations (in this
case, 129) in a bit more detail. The first thing we can see is that this
benchmark started slow (in-process iteration 1 took about 0.57233 seconds);
quickly got somewhat faster for a while; and then, at about in-process iteration
190, reached the steady state of peak performance. This is almost exactly what
VMs are supposed to do, albeit it took much longer to reach the steady state of peak
performance than we might expect and, indeed, much longer than anyone had
ever bothered benchmarking before. Nevertheless, this benchmark clearly warmed up.
The challenge then becomes working out where it warmed up. At first
we had no idea how to work this out, other than using the Kalibera and Jones
method — but we have 3,600 of the above plots to look at per machine, which is far
too many to look at manually. Eventually (thanks to help from Vince Knight), we were put in touch with Rebecca Killick, a
statistician who specialises in changepoint analysis. Intuitively,
changepoint analysis automatically tells us when there are statistically
significant shifts in our data. The vertical red dashed line in the
plot above at in-process iteration 190-ish is a changepoint: the
horizontal dashed line between changepoints (there are implicit changepoints at
in-process iterations 1 and 2000) is a changepoint segment. In other
words, changepoint analysis allows us to split the data in the plot above up,
produce statistics on it, and classify it. We therefore developed a
classification algorithm, which says (simplifying a bit) “if the last
changepoint segment is the fastest changepoint segment in the plot, then this is
a warmup”. [13] You can see the classification “warmup” at the end of the title.
As an additional visual aid, in-process iterations before the steady state are
coloured in grey.
The next type of plot we see regularly is the following:
This is the n-body benchmark running on PyPy on a Linux Core i7
machine. The first thing that you might notice is the big red blob around
in-process iteration 1230: this is an outlier, an in-process
iteration which is noticeably different in nature (i.e. much faster or slower)
than its neighbours. It’s common in these sorts of analyses to ignore outliers
if there’s good cause: in our case, we assume that they might be a garbage
collection phase, or a context switch into the kernel, or something else that
shouldn’t cloud our judgement about the benchmark. The second thing you might
notice about this plot is that it only has a single changepoint segment. We
classify such plots as “flat”. Because we’re giving the benefit of the doubt to
VMs, we assume that warmup happened so quickly in such cases that we can’t
observe it happening, meaning that the “flat” classification is morally
equivalent to “warmup”.
Here’s another example of warmup, this time on V8:
The good news is that the benchmark did warm up and hit a steady state of peak
performance, though the fact that it took just under 600 in-process iterations
is surprising. It’s also interesting that there are several earlier
changepoints, perhaps while various rounds of compilation were occurring.
We can also deal with more complex examples of warmup, this time on Graal:
The surprise here isn’t the number of outliers (perhaps they’re the garbage
collector), but that it took over 900 in-process iterations to hit a steady
state of peak performance. Still, the fact that it did so is surely a good
thing.
The unexpected
At this point, things become a little more surprising. Consider this plot from HotSpot:
Notice that it started slow (you might need to look at the inset to see this),
then got fast for 200 in-process iterations, and then
became somewhat slower again. That’s not supposed to happen! It means that we need
to adjust our classification algorithm: if the last changepoint segment in a
plot isn’t the fastest, we classify it as “slowdown”.
In this case, the slowdown reduces execution speed by 5% which, to state the
obvious, is fairly significant (in mature VMs, speed-ups of 5% often require
months or years of work). But here’s why you should be really surprised. The benchmark in this plot is
Richards: it’s one of the benchmarks that HotSpot has used to benchmark
itself since before the VM was even called HotSpot. In other words, this VM is
highly optimised for this benchmark — but no-one ever ran it long enough
before to notice that performance gets much worse over time.
It turns out that slowdowns are common. Here’s an example from V8:
But there’s one final type of plot left — and it’s the most troubling
of all. Here’s an example:
Notice here that the benchmark’s performance keeps shifting, never settling for
long. Of course, perhaps it would if we ran it for even longer, but ultimately we have to
set a cut-off somewhere. This now raises a tricky question: what should our criteria be
for when a benchmark is said never to stabilise? Our choice was to say that if there
are changepoints in the last 500 in-process iterations then
the plot is classified as “no steady state”.
For clarity, “flat” and “warmup” are equally good; “slowdown” and “no steady
state” are bad (in slightly different ways). The latter two are not supposed
to happen: when they do, they invalidate nearly all previous benchmarking
methodologies.
Edge cases
Changepoint analysis turned out to be the vital thing we needed to make this
work useful. When we first tried using changepoint analysis on our data, we
sometimes thought it had misclassified things, but in reality it’s just more
rigorous (and far less prone to boredom) than us humans. However, there are
always going to be some edge cases where the classification could go either way.
Here’s my favourite example of this:
Can you see that the timing data keeps shifting, at predictable intervals, above
and below the changepoint segment? The changes are small enough that changepoint
analysis doesn’t consider these shifts as changepoints, so the plot is
classified “warmup”. Interestingly, in some older versions of HotSpot, this
benchmark displays a more pronounced square-wave pattern (i.e. the highs are higher
and the lows lower); each peak and trough becomes a changepoint segment; and
the plot as a whole becomes “no steady state”. In my opinion, for this example,
either classification would be acceptable: but, in general, there are very few
edge cases such as this.
Inconsistency
Earlier I went into some detail about how we ensured that the benchmarks we’re
running are deterministic. We expected that this would lead to consistent
performance profiles for a given (benchmark, VM, machine) triple (in other words,
we expected consistent performance from process executions of the same benchmark
on the same VM on the same machine). We were wrong.
Consider these two plots, both of the same benchmark, on the same VM,
on the same physical machine:
Notice that one process execution warms up, and the other is a no steady state.
To add insult to injury, the no steady state process execution’s performance is
often 5% worse than the warmup process execution’s performance. This sort of
inconsistency happens frequently.
Perhaps less surprisingly, there’s often substantial inconsistency across
operating systems. For example, consider the binary trees benchmark on Linux and
OpenBSD in C:
There are two things to note here. First, the performance of this benchmark on
Linux resembles a bouncing ball. This, to belabour a point, isn’t a VM: it’s a C
program! We suspect (but somehow have never quite managed to get around to
investigating) that this is probably the malloc used on Debian: it’s probably
incrementally growing, then incrementally compacting, its memory structures.
This raises a subtle point which I hadn’t considered when we first started this
work: even C programs have a runtime (libc), which can have
unpredictable effects on performance. Unsurprisingly, C isn’t as prone to odd
effects as the VMs in our experiment, but it’s more prone than I would have
expected. Second, notice the substantial difference in performance between the
two systems. Admittedly, in this case we’re comparing a Xeon Linux machine
against a Core i7 OpenBSD machine, but still, I would be happy to bet money that
Linux will always beat OpenBSD in such a benchmark. Why? Simply because
OpenBSD’s malloc prioritises security over performance, and jumps
through several hoops that almost certainly slow things down substantially.
Benchmarks in aggregate
Looking at the plots above is very useful, because it gives us a way of
understanding what unexpected performance behaviour occurs. However, our
experiment has thousands of these plots, so it’s hard to put together any sort
of bigger picture from them. Fortunately, changepoint analysis allows us to
automatically put together various summaries of the data (with Sarah Mount doing
most of the hard work to generate the astonishing LaTeX tables below). Here’s an
example of the summary data from one of our machines:
There’s a bit too much data there for us to easily read it in this setting, so
let’s zoom in on the bottom right part of that table:
This is the data from the spectralnorm benchmark. There’s still quite a
lot of detail here, so let’s look at a few rows to unpick what’s going on.
I’ll start with the HotSpot row. The squiggle in the “Class.” (i.e. “classification”) column
means that all 30 of spectralnorm’s process executions warmed up on
HotSpot. The “Steady iter (#)” column tells us that the median number of
in-process iterations for the steady state to be reached was 7 (the numbers in
brackets underneath are inter-quartile ranges; think of them a bit like being a
bit like confidence intervals, giving you an indication of the spread of
data). The histogram to the right of that number shows you the spread of data,
with the bucket the median value is part of highlighted in red. In this case, nearly all the data is
in the median bucket, with a couple of outliers. The “Steady iter (s)”
column tells us that the median wall-clock time to reach the steady state was
1.91 seconds. The “Steady perf (s)” is close to the “traditional” measure of
VM performance [14], telling us that the performance of the benchmark in the steady
state was 0.31472 seconds per in-process iteration. Notice though that the
histogram tells us that there are a number of process executions where the
steady state performance is worse than the median, and the very wide 99% confidence
interval (±0.169143s) confirms this. This goes some way to justifying all the
data you’ve just waded through: yes, the benchmark consistently warms up on
HotSpot (which is good), it warms up fairly fast (which is also good), but the
steady state performance can vary hugely between process executions (which is bad).
Depending on what aspect of performance you care about, you may find different
parts of this information more or less important than others.
The “—” classification for LuaJIT means that all 30 of its process
executions were flat. By definition that means it warmed-up at in-process
iteration 1 and took 0 seconds to do so: thus the steady iter (#) and (s)
columns are empty. The final “pure” classification can be seen for Graal,
all 30 of whose 30 process executions slowed down (note that, though this
doesn’t have a steady state of peak performance, it does still have a
steady state, hence our ability to report such numbers here).
The “═” classification for PyPy is “good inconsistent”: 27 of the 30 process
executions were flat, and 3 were warmups. Since we consider these two
classifications morally equivalent, “good inconsistent” is morally equivalent
to all 30 process executions being flat or warmup.
In contrast, the “═ with a cross through it” classification for TruffleRuby
is “bad inconsistent” i.e. at least 1 process execution was a slowdown or a no
steady state. In this case, 25 of its process executions slowed down and 5 were
no steady state.
The bigger picture
Summarising the results gives a clearer picture of how VM performance can
surprise:
To keep things simple, I’ll only consider the Linux4790 column above
(the OpenBSD machine runs a subset of benchmarks, so requires some care when
interpreting). It’s easiest to start with the bottom half of the table first.
The way to read this is that 22.0% of all process executions (i.e. not worrying
about which VM or which benchmark was involved) were classified as flat and
48.3% as warmup; in other words, just over two thirds of process executions
warmed up in the expected fashion. 20.1% of process executions slowed down and 9.6%
were no steady state.
Those figures are pretty depressing: almost a third of process executions
aren’t doing what we expected! But life gets quite a lot worse in the top of the
table. The way to read this is that 8.9% of (VM, benchmark) pairs were always
classified as flat, 20.0% as warmup, and 11.1% as good inconsistent. In other
words, only 40.0% (VM, benchmark) pairs consistently did the expected thing for
all 30 of their process executions. Surprisingly, 4.4% consistently slowed down
and 4.4% were consistently no steady state. But 51.1% of all (VM, benchmark)
pairs were bad inconsistent: that is, they contain at least one slowdown or no
steady state process execution. At the risk of stating the obvious, that means
that just over half of the time VMs are not reliably optimising benchmarks!
Things get even worse if you consider all machines together: only 12.5%
of (VM, benchmark, machine) triples perform as expected.
Conclusions
When we set out to look at how long VMs take to warm up, we didn’t expect to
discover that they often don’t warm up. But, alas, the evidence that
they frequently don’t warm up is hard to argue with. Of course, in some cases the
difference to performance is small enough that one can live with it, but it’s
often bad enough to be a problem. We fairly frequently see performance get 5% or more
worse over time in a single process execution. 5% might not sound like
much, but it’s a huge figure when you consider that many VM
optimisations aim to speed things up by 1% at most. It means that many
optimisations that VM developers have slaved away on may have been incorrectly
judged to speed things up or slow things down, because the optimisation is well
within the variance that VMs exhibit. The variance across process executions is
often even bigger, which is even more troubling: it means that it’s not safe
to run a program once, measure its performance, and assume you’ll always get that
level of performance in the future.
Before the doom and gloom gets too much, I’m going to end on an optimistic
note. While I think our work raises big questions about the efficacy of current
VMs, I think it also offers a note of salvation. People are often very happy
with VM performance, even though it probably isn’t working as well as it could.
In particular, my guess is that large programs are subject to many small
performance problems which, in conjunction with each other, are dismissed as
noise. If that is the case, then, if we can produce VMs that don’t suffer from
the problems seen above, we should be able to run programs even faster in the
future.
In the
second part of this blog post, I’m going to: look at whether there are
any simple answers to the performance problems seen above; give some possible
explanations as to how we might have got to this point; and suggest some simple
pointers to make things a bit less worse in the future. In the meantime, those
of you who want more details can
read
the main research paper.
Acknowledgements: Edd Barrett, Carl Friedrich Bolz, Rebecca Killick,
and Sarah Mount did most of the work I’ve reported on above. Vince Knight helped
put us in touch with Rebecca Killick: he really deserved more credit on this
paper than he received, but he gracefully refused my attempts to add him in! The
work of Tomas Kalibera and Richard Jones was a huge influence for all of us.
Mario Wolczko gave crucial feedback and encouragement at important points. We
also received vital feedback, and questions, from a number of other people.
This research was funded by the EPSRC Cooler
(EP/K01790X/1) grant and Lecture (EP/L02344X/1) fellowship, and gifts from
Cloudflare and Oracle Labs.
Unfortunately, many such claims are either wrong (e.g. the thing being measured
isn’t the thing that’s claimed to be measured) or misleading (e.g. making
grand comments that derive from generalising a single workload / benchmark).
p11 of Cliff
Click’s slides “A JVM does that?”. To be clear, I’m not criticising Cliff
who I not only know and like, but whose contributions to our field are little
short of jaw-dropping.
The terms “warmup” and “startup” are often used interchangeably, whereas they
refer to two different things: startup is the time for the VM beginning to
execute the first bit of the user’s code; warmup is the time for the VM to reach
the steady state of peak performance.
There are two relevant Kalibera and Jones papers: a complete but
dense description; and a higher-level overview. I
definitely recommend starting with the high-level overview (even though it was published
later).
In my defence, I can only say that I’ve seen a lot of people benchmark even more
incompetently than I’ve described here. My personal favourite was a paper that
benchmarked on a laptop (which are prone to overheat, and thus likely to
throttle the CPU), with turbo boost left on (making overheating even more
likely), running a full user desktop (how much CPU is your web browser consuming
right now? One core of mine is running at 40% while a website animates
something). The resulting numbers might have been correct, but, if they were,
that would surely be more due to luck than judgement.
That said, the most common form of benchmarking (which I’ve certainly been
very guilty of in the past) is far worse: it’s best summarised as “run it until
I like it”. In this form of benchmarking, if you run a benchmark and get a
number you don’t like (too fast or too slow), you run things again and again
until you get an answer that matches your expectations / prejudices. In
practise, the non-determinism of our hardware and software means that one can
often find an appealing answer. Of course, the implied variation between runs
should be an alarm bell!
On OpenBSD, for example, by default root runs a cron job every day
(/etc/daily) and every week (/etc/weekly). The former
does several security related tasks which, one might argue, aren’t likely to
impact anything else; however, the latter rebuilds the locate
database (which is very IO intensive, and which can thus cause very
unpredictable side effects). There is also a monthly cron job but, in practise,
that does very little by default (I had entirely forgotten that it existed until
I checked my facts for this footnote).
It’s sometimes thought that botnets can only attack you from external networks.
However, all it takes is one machine to become compromised (e.g. by someone
running something unfortunate on their machine), and then a botnet can exist on
an internal network. I’ve heard of this happening at least twice at places that
I know well, so it’s a real threat.
The “K” in Krun is a tip of the hat to the Kalibera and Jones work. For those
who care, most of us pronounce it as one syllable (i.e. “Krun” not “K-run”).
Turbo boost is less predictable than most of us realise. For example, we
had some machines with a stated base frequency of 3.6GHz and a turbo boost of
4GHz. If you ran a process which utilised 1 core fully, you could reliably
measure it as running at almost exactly 4GHz; fully utilising 2 and 3 cores gave
the same result; but as soon as a fourth core was fully utilised measurements
fell to almost exactly 3.8GHz. If memory serves this mode is undocumented in
Intel’s manuals, but I saw this behaviour on 2 separate machines.
At the time we experimented with CPU pinning on Linux, there were two mechanisms
available: one was buggy and unreliable; and the other (which, of course, we
tried second) worked OK. I don’t know if this is still the case, but it
certainly makes me wonder if everyone I’ve heard of using CPU pinning is getting
the benchmarking environment they expect.
I’m talking here about compilers optimising away code as part of the compiler’s
standard optimisations. There’s also a less pleasant case
where compilers are customised specifically for one benchmark (e.g. explicitly
recognising a benchmark’s source code, and then changing the compilation
strategy). One of the better known examples of this was Sun’s special treatment
of the 179.art benchmark in SPEC. It’s now difficult to find a concise write-up
of this, but a couple
of articles
provide reasonable pointers.
The real algorithm has two main additional features. First, it treats
changepoint segments which differ only by a very small amount of time as being
equivalent. Second, two (or more) changepoint segments can be considered
equivalent even if they are separated by a non-equivalent changepoint segment.
The details of how we calculate steady performance are quite a lot more involved
than is traditional: influenced by Kalibera and Jones, we use
bootstrapping, and changepoint segments to lead to a slightly more accurate
number than most traditional methodologies.
☒
Unfortunately, many such claims are either wrong (e.g. the thing being measured
isn’t the thing that’s claimed to be measured) or misleading (e.g. making
grand comments that derive from generalising a single workload / benchmark).
☒
I used to love C, but these days I’m
more ambivalent, because I no longer think I can remember all the
ways that I may trigger undefined behaviour.
☒
p11 of Cliff
Click’s slides “A JVM does that?”. To be clear, I’m not criticising Cliff
who I not only know and like, but whose contributions to our field are little
short of jaw-dropping.
☒
The terms “warmup” and “startup” are often used interchangeably, whereas they
refer to two different things: startup is the time for the VM beginning to
execute the first bit of the user’s code; warmup is the time for the VM to reach
the steady state of peak performance.
☒
There are two relevant Kalibera and Jones papers: a complete but
dense description; and a higher-level overview. I
definitely recommend starting with the high-level overview (even though it was published
later).
☒
In my defence, I can only say that I’ve seen a lot of people benchmark even more
incompetently than I’ve described here. My personal favourite was a paper that
benchmarked on a laptop (which are prone to overheat, and thus likely to
throttle the CPU), with turbo boost left on (making overheating even more
likely), running a full user desktop (how much CPU is your web browser consuming
right now? One core of mine is running at 40% while a website animates
something). The resulting numbers might have been correct, but, if they were,
that would surely be more due to luck than judgement.
That said, the most common form of benchmarking (which I’ve certainly been
very guilty of in the past) is far worse: it’s best summarised as “run it until
I like it”. In this form of benchmarking, if you run a benchmark and get a
number you don’t like (too fast or too slow), you run things again and again
until you get an answer that matches your expectations / prejudices. In
practise, the non-determinism of our hardware and software means that one can
often find an appealing answer. Of course, the implied variation between runs
should be an alarm bell!
☒
On OpenBSD, for example, by default root runs a cron job every day
(/etc/daily) and every week (/etc/weekly). The former
does several security related tasks which, one might argue, aren’t likely to
impact anything else; however, the latter rebuilds the locate
database (which is very IO intensive, and which can thus cause very
unpredictable side effects). There is also a monthly cron job but, in practise,
that does very little by default (I had entirely forgotten that it existed until
I checked my facts for this footnote).
☒
It’s sometimes thought that botnets can only attack you from external networks.
However, all it takes is one machine to become compromised (e.g. by someone
running something unfortunate on their machine), and then a botnet can exist on
an internal network. I’ve heard of this happening at least twice at places that
I know well, so it’s a real threat.
☒
The “K” in Krun is a tip of the hat to the Kalibera and Jones work. For those
who care, most of us pronounce it as one syllable (i.e. “Krun” not “K-run”).
☒
Turbo boost is less predictable than most of us realise. For example, we
had some machines with a stated base frequency of 3.6GHz and a turbo boost of
4GHz. If you ran a process which utilised 1 core fully, you could reliably
measure it as running at almost exactly 4GHz; fully utilising 2 and 3 cores gave
the same result; but as soon as a fourth core was fully utilised measurements
fell to almost exactly 3.8GHz. If memory serves this mode is undocumented in
Intel’s manuals, but I saw this behaviour on 2 separate machines.
☒
At the time we experimented with CPU pinning on Linux, there were two mechanisms
available: one was buggy and unreliable; and the other (which, of course, we
tried second) worked OK. I don’t know if this is still the case, but it
certainly makes me wonder if everyone I’ve heard of using CPU pinning is getting
the benchmarking environment they expect.
☒
I’m talking here about compilers optimising away code as part of the compiler’s
standard optimisations. There’s also a less pleasant case
where compilers are customised specifically for one benchmark (e.g. explicitly
recognising a benchmark’s source code, and then changing the compilation
strategy). One of the better known examples of this was Sun’s special treatment
of the 179.art benchmark in SPEC. It’s now difficult to find a concise write-up
of this, but a couple
of articles
provide reasonable pointers.
☒
The real algorithm has two main additional features. First, it treats
changepoint segments which differ only by a very small amount of time as being
equivalent. Second, two (or more) changepoint segments can be considered
equivalent even if they are separated by a non-equivalent changepoint segment.
☒
The details of how we calculate steady performance are quite a lot more involved
than is traditional: influenced by Kalibera and Jones, we use
bootstrapping, and changepoint segments to lead to a slightly more accurate
number than most traditional methodologies.