Laurence Tratt: More Evidence for Problems in VM Warmup

Some readers may remember some work I and others were involved in a few years back, where we looked at how programming language VMs (Virtual Machines), mostly those with JIT (Just-In-Time) compilers, warmed-up — or, often, didn’t. In this blog post I’m going to give a bit of background to this area, and then look at a newly published paper (not written by me!), which gives further evidence of warmup problems.

Background

VMs start programs running in an interpreter where they observe which portions run frequently. Those portions are then compiled into machine code which can then be used in place of the interpreter. The period between starting the program and all of the JIT compilation completing is called the warmup time. Once warmup is complete, the program reaches a steady state of peak performance.

At least, that’s what’s supposed to happen: our work showed that, on widely studied small benchmarks, it often doesn’t. Sometimes programs don’t hit a steady state; sometimes, if they do reach a steady state, it’s slower than what came before; and some programs are inconsistent in whether they hit a steady state or not. A couple of examples hopefully help give you an idea. Here’s an example of a “good” benchmark from our dataset which starts slow and hits a steady state of peak performance:

The way to read is plot is that the benchmark is run in a for loop, with each iteration (1 to 2000) shown on the x-axis. On the y-axis the time of each iteration is shown. The vertical red lines are changepoints, which show us when there are statistically significant shifts in the data. The horizontal red lines are changepoint segments which show us the mean value of iterations in-between two changepoints.

Some benchmarks hit a steady state of non-peak performance such as this:

Some benchmarks never hit a steady state:

Taking a representative Linux machine [1] from our experiment as an example, the numbers aren’t happy reading. Only 74.7% of process executions hit a steady state of peak performance. That sounds bad enough, but if we look at (VM, benchmark) pairs, where we compare all 30 process executions of a benchmark on a given VM, we see that there is no guarantee of a consistent style of performance: only 43.5% of (VM, benchmark) pairs hit a steady state of peak performance. If you want more details, you can read the full academic paper, a pair of blog posts I wrote, or watch a talk.

The state of play

The research underlying our paper was hard, often very frustrating work, but as recompense I hoped that it meaningfully nudged the VM benchmarking field forward a bit. However, we weren’t stupid or arrogant enough to believe that we’d solved all benchmarking problems. Towards the end of the paper we tried to make this clear:

We do not claim that this paper represents the end of the road in terms of VM benchmarking methodologies: we hope and expect that this work will be superseded by better approaches in the future.

Personally I expected that within a couple of years our work would have been superseded, perhaps being proven incorrect, or being made obsolete by a better methodology or more extensive experiment. After 5 years of waiting, my expectation was starting to look rather naive. Indeed, many citations of our paper are of the form “VMs warmup [cite our paper]” — in other words, many people cite our paper as suggesting that benchmarks warm-up in a given period, when the paper is all about showing that this often doesn’t happen!

It has thus been a pleasure to stumble across a new paper Towards effective assessment go steady state performance in Java software: Are we there yet? by Luca Traini, Vittorio Cortellessa, Daniele Di Pompeo, and Michele Tucci, which I think makes an important contribution to the VM benchmarking field. However, at 66 pages long, it is not a quick or an easy read: having studied it somewhat carefully, I believe that it paints an even bleaker picture of (JVM) warmup than our paper.

In this post I’m going to do my best to summarise the parts of the paper that are most relevant to my interests, comparing it to the earlier work I was involved in. For the rest of this post I’ll use the acronym “TCPT” to refer to the authors and their paper and “BBKMT” to refer to the paper I was involved with. Please bear in mind that I was not previously aware of TCPT and I don’t think I’ve ever met or communicated with any of the authors — I may end up grossly misrepresenting their research. But if you’re interested in VM benchmarking, and especially if you’ve read BBKMT, you might find that this post helps you think about TCPT.

Overview of the paper

In essence, TCPT investigate whether Java benchmarks reach a steady or not. As that might suggest, they have both broader and narrower aims than our work did: broader in that they investigate more research questions than we did and use many more benchmarks; but narrower in that they do so solely in the context of JMH (the Java Microbenchmark Harness, a benchmark runner for the JVM) and worry only about whether benchmarks reach a steady state or not (so, for example, they aren’t concerned by benchmarks which ready a steady state of non-peak performance).

TCPT pose five research questions, but I’m only going to look at the first, since it’s most related to BBKMT:

RQ₁: Do Java microbenchmarks reach a steady state of performance?

Terminology

The first challenge for me in understanding TCPT is mapping the paper’s terminology (much of which derives from JMH) to BBKMT [2]. Here are the best mappings I’ve been able to come up with:


TCPT	BBKMT
JVM	HotSpot
Benchmark	{Benchmark, (VM, benchmark) pairs}
Fork	Process execution
Invocation	n/a
Iteration	In-process iteration
Steady state	{Warm-up, flat, good inconsistent}
Inconsistent	One or more no steady state process executions

One terminological nightmare in the Java world is the difference between Java the language, the abstract definition of Java bytecode for a Java Virtual Machine (JVM), and specific Java VMs. Because there are multiple Java VMs, each one of which is “a JVM”, I prefer to be clear that TCPT, and this post, are talking about HotSpot, which is, by far, the most widely used JVM. The versions (yes, versions!) of HotSpot used in TCPT contain 2 separate JIT compilers (C1 and C2; Graal is not, I believe, used), which can be used individually, or as two tiers.

TCPT use “benchmark” to mean both “a specific benchmark X” and “all process executions of X”. In BBKMT, and this post, we used “benchmark” for the former and “(VM, benchmark) pair” for the latter.

“Invocation” has no direct mapping to our work: as far as I can tell, it is a process execution that runs for a fixed period of wall-clock time (rather than running for a fixed number of in-process iterations). Since TCPT care only about “steady state” (and not, as we did, flat / warm-up / slow-down / no steady state) that single term maps to: warm-up or flat, in the context of a single process execution; or good inconsistent for a set of process executions.

An “inconsistent” benchmark in TCPT maps to a BBKMT (VM, benchmark) pair where at least one process execution is no steady state. Although it turns out not to occur in TCPT’s data, a (VM, benchmark) pair none of whose process executions hits a steady state is “no steady state”, the same definition as in BBKMT.

TCPT also use the term “the two-phase assumption”, to denote the expectation that, in theory, benchmarks should consist of a warm-up and a steady-state phase. We didn’t capture this quite so neatly in BBKMT, though that’s partly because we differentiated “steady state of peak performance” (warmup) from “steady state of non-peak performance (slowdown)”.

For the rest of this post, I will mostly use BBKMT terminology, except that when I use “inconsistent” (not “bad inconsistent”!), I will be referencing TCPT’s definition: as we shall see later, splitting apart TCPT’s “inconsistent” from BBKMT’s “bad inconsistent” is an important detail.

Summary figures

TCPT are more coy about giving summary figures than I would be, and it wasn’t until p26 that I got a sense of their results. To me, the most important figures in TCPT are that:

10.9% of process executions are no steady state;
43.5% of (VM, benchmark) pairs are inconsistent (i.e. if you run the benchmark repeatedly, some process executions are steady state, some are non-steady state). Note that no (VM, benchmark) pairs were consistently no steady state.

I’ll compare these numbers to BBKMT a little later, but at this point it’s enough to know that the first result is very similar to BBKMT, the second result a little less so.

Benchmarking methodology

Experimental science such as TCPT isn’t intended to “prove” something definitively: rather it aims to increase, or decrease, our confidence in a particular theory. The results above suggest that TCPT should cause us to decrease our confidence in the “theory of VM warm-up” (at least for HotSpot). Do we, in turn, believe that TCPT’s work is of sufficiently good quality to influence our confidence levels? Again, though this can never be absolute, my answer is broadly “yes”, because their methodology seems well designed.

In summary, TCPT’s methodology is in many parts similar to BBKMT’s, so it might not be surprising that I consider their methodology well designed — I am inevitably biased! But once we dive into detail, we can see that there are important differences. First, TCPT do not reuse the software we created specifically for BBKMT (Krun and warmup_stats). Second, although TCPT use the same statistical technique (and software) as BBKMT, the details of how they use it are quite different. Third, TCPT use far more benchmarks than BBKMT. In the rest of this section I’ll give an overview of my understanding of TCPT’s methodology which, overall, I think is an improvement on BBKMT’s.

Running benchmarks

Let’s start with the basics. I have said more than once that, at its core, the major contribution of BBKMT was simply that we ran benchmarks for longer, and more often, than anyone had done before. In our case, we ran each benchmark for 30 process executions and 2000 in-process iterations [3]. Pleasingly, TCPT do something similar:

We perform 10 JMH forks for each benchmark (as suggested by [BBKMT]), where each fork involves an overall execution time of at least 300 seconds and 3000 benchmark invocations.

The “at least 300 seconds” is different than BBKMT: we ran a fixed amount of work, and didn’t worry about how long it took (though we did aim for the fastest implementation of a benchmark to take at least 0.1s, and preferably 1s, per in-process iteration, to reduce measurement error). Is the “at least” an important detail? Probably not: I can make good arguments for both “fixed amount of work” and “minimum run time”. While I would expect each approach to result in slightly different overall data, I doubt that the differences would be significant.

In BBKMT we wrote Krun, a benchmark runner that tried to control as many parts of an experiment as possible. Summarising Krun’s features is beyond the scope of this post, but a couple of examples hopefully give you an idea. Krun automatically reboots the machine after each process execution (so that if, for example, a benchmark pushes a machine into swap, that can’t influence the next process execution) and waits until the machine’s temperature sensors have returned (within a small tolerance) to the value they had when the first process execution started (so that the effects of temperature on things like CPU and memory are consistent across process executions). We long discussed a second paper to understand the impact of all these features, but it didn’t fit into the time and funding we had available. My personal guess is that most features only had occasional effect — but I am very glad we implemented them, as it hugely increased our confidence in our data.

TCPT do several of the things we did in BBKMT to control the overall system state (e.g. disabling turbo boost, disabling hyper threads, turning off Unix daemons) but mostly rely on JMH:

we configure the execution of each benchmark via JMH CLI arguments. Specifically, we configure 3000 measurement iterations (-i 3000) and 0 warmup iterations (-wi 0) to collect all the measurements along the fork. Each iteration continuously executes the benchmark method for 100ms (-r 100ms). The number of fork is configured to 10 (-f 10). As benchmarking mode, we use sample (-bm sample), which returns nominal execution times for a sample of benchmark invocations within the measurement iteration.

TCPT also say they monitor dmesg, though they don’t seem to say if they noticed anything odd or not — in BBKMT we found early on from dmesg that one machine was often overheating, causing us to remove it from our experiment. Indeed, I think that TCPT use only a single machine for benchmarking, whereas in BBKMT we used three, to give us some confidence that our results weren’t the result of a single dodgy machine. I appreciate that we were unusually fortunate in having access to such resources.

As far as I can tell TCPT don’t monitor CPU(s)s for throttling: in BBKMT we found this affected 0.004% of in-process executions, and the classification of a single process execution. I would expect that throttling happened more often to TCPT (since, I think, their environment is not controlled as rigorously as Krun does), but I doubt it happens often enough to significantly effect their data.

There is one part of TCPT which I struggle to defend: they use two versions of HotSpot, both of which are fairly old:

We chose to limit the scope of the experiment to JMH microbenchmarks, because JMH is a mature and widely adopted Java microbenchmark harness. We ran the benchmarks on JVM 8 (JDK 1.8.0 update 241) or 11 (JDK 11.0.6), depending on the requirements of the specific system. Using other JVM versions or JVMs from other vendors may change the results.

Java, and therefore HotSpot, 1.8 debuted in 2014, although because versions are maintained for a long time, update 241 was released in January 2020. Java 11 debuted in 2018, with 11.0.6 also released in January 2020. It’s important to realise that the newest version of Java in January 2020 was version 13. Indeed, BBKMT, published in 2017, also used Java 8! I don’t understand why two versions of HotSpot were used and I would have found the paper’s results even more relevant if a more modern version of HotSpot had been used.

That point aside, my overall feeling is that TCPT have used a subset of the experimental methodology from BBKMT that will mostly give them similar results to the “full” BBKMT methodology.

Statistical analysis

In one of the most interesting parts of TCPT, they take the statistical analysis we developed for BBKMT and extend it. First they identify outliers using Tukey’s method, as in BBKMT. Second, they use the PELT algorithm to identify changepoints, as in BBKMT, but with a crucial change. Third they identify “equivalent” changepoint segments in a different way to BBKMT.

Intuitively, changepoint analysis tells us when data in a time-series has “shifted phase”. Both BBKMT and TCPT use PELT, an algorithm for detecting changepoints [4]. PELT’s penalty argument tunes how aggressive it is at finding changepoints (lower values find more changepoints). Tuning the penalty is an art as much as a science. In BBKMT we used a fairly standard value for the penalty and applied it to all process executions. From memory, I think we started with a low penalty (I think 12 log(n)) and fairly quickly adjusted it upwards to what became the final value (15 log(n)).

TCPT do something very different: they calculate a different penalty for every process execution. They use Lavielle’s suggestion of trying a number of different penalty values, seeing how many changepoints each leads to, and selecting the value at the “elbow” in the resulting graph where reducing the penalty leads to disproportionately more changepoints. TCPT show a neat example of this in Figure 5 (p22).

There are obvious advantages and disadvantages to using fixed, or variable, penalties. A fixed penalty means that you end up with a “fair” comparison across similarly time-series, but variable penalties will work better if your time-series are dissimilar. In BBKMT we mostly had fairly similar time-series data, so a fixed penalty worked well enough for us (at least if you are happy to accept “we looked at lots of process executions manually to check that the changepoints looked sensible” as justification). At first I was unsure whether TCPT’s variable penalties were a good idea or not, but on reflection, given the huge number of benchmarks (which, I assume, vary in nature more than those we used in BBKMT) they are running, I think it’s more likely to give defensible changepoints. It would be interesting to apply their approach to penalties to BBKMT and see how much changes — my intuition is that it would probably work quite well for our data too.

Once changepoint analysis has done its work, TCPT encounter the same problem that we did in BBKMT: changepoint analysis can identify phase shifts smaller than make sense in our domain. For example, changepoint analysis might differentiate two changepoint segments whose means are different by 0.001s when we know that such small differences are probably due to measurement inaccuracy. In BBKMT we tried various heuristics, eventually settling on something that seems silly, but worked for us: simplifying slightly, we ignored differences between two changepoint segments if their means overlapped by 0.001s or overlapped by the second changepoint segments variance. Astute readers might balk at our use of the “variance” as we clearly changed units from seconds squared to seconds. All I can say is that, despite its seeming statistical incompetence, it worked well for us as a heuristic.

TCPT take a very different approach, which classifies two changepoint segments as equivalent if their means ± (confidence_interval * 1.05) overlap. I’m not quite sure what to make of this, either statistically, or practically. Still, given the odd heuristic we used in BBKMT, I can hardly complain. It would have been interesting to see the different effects of the BBKMT and TCPT heuristics.

In summary, I tend to think that TCPT’s statistical approach is likely to prove better than BBKMT’s on most data, though I would have much preferred if they had offered a comparison of the two so that I could understand if there any practical differences.

Benchmark suite

In BBKMT we used a small benchmark suite of deliberately small benchmarks. This had three benefits. First, it allowed us to run the same benchmarks across multiple languages. Second, we were able to give a high degree of assurance that, at the source level, the benchmarks ran deterministically, thus ruling out the possibility that programmer “error” caused performance variation. Third, we modified each benchmark in such a way that we made it very difficult for a clever compiler to fully optimise away the benchmark. The obvious disadvantage is that the benchmarks were small in size (e.g. for Python a total of 898LoC, with a median size of 101LoC) and few in number (6 benchmarks, though bear in mind each of those 6 benchmark had variants for 8 languages).

In contrast, TCPT have a much larger benchmark suite of 586 benchmarks. To collect this suite, they select 30 well-known Java systems, and from each system randomly extract 20 of the available benchmarks. 14 of the original 600 benchmarks failed to run, hence the eventual number of 586. Unfortunately, the paper is rather scanty on further details. In particular, I couldn’t find any mention of how big the benchmarks are, though the fact that they’re often described as “microbenchmarks” suggests they’re probably not much bigger on average than BBKMT’s. I would have liked, at least, to have had a sense of the total LoC of the benchmark suite and the median LoC for all benchmarks. Both metrics are only rough proxies for “how big is the benchmark” but it would have helped me better interpret the data. There’s also no mention of whether TCPT examined the benchmarks for problems or not, other than the 14 benchmarks which didn’t execute at all.

Although I wouldn’t quite say these are minor quibbles (I really want to know the details!), TCPT have somewhat made up for them simply by the sheer size of their benchmark suite. Even if all 586 benchmarks are smaller than those in BBKMT, and even if many of the benchmarks don’t quite work as expected, the sheer number of benchmarks is likely to minimise any resulting distortions.

Results

The research question in TCPT I am most interested in is RQ₁ “Do Java microbenchmarks reach a steady state of performance?” The raw(ish) data for this is in Figure 6 (p27), presented per benchmark suite, and as a summary. The table is split in two: the left hand side “(a)” is for process executions; the right side “(b)” is for benchmarks. 10.9% of process executions are no steady state (the final row “total” of (a)) but since these process executions are scattered across benchmarks, 43.5% of (VM, benchmark) pairs are inconsistent.

It is clear that TCPT’s results are not good news for those expecting VMs, specifically HotSpot, to warmup. If you’re a normal user, the results suggest that you’re often not getting the performance you expect. If you’re a benchmarker, then you’re probably using a methodology which produces misleading results. If you’re a HotSpot developer, you have some hard debugging ahead of you to work out how to fix (often intermittent) performance problems.

Comparison to BBKMT’s results

Comparing TCPT and BBKMT’s results is challenging. First, BBKMT’s results span 8 different language implementations, whereas TCPT only covers HotSpot. Second, BBKMT and TCPT summarise data in slightly different ways. While we can do nothing about the first point, we can do something about the second.

Let’s start with the easiest comparison. TCPT find 10.9% of all process executions to be no steady state. In BBKMT, on two Linux machines, we found 8.7% and 9.6% of process executions to be no steady state. Personally I consider TCPT’s figure to be well within the range that I would consider reproduces [5] BBKMT’s results.

Comparing TCPT’s 43.5% of (VM, benchmark) pairs being “inconsistent” to BBKMT’s notion of “bad inconsistent” is a little trickier than it first seems. Because BBKMT considers “slowdowns” (i.e. steady states of non-peak performance) to be “bad”, BBKMT’s “bad inconsistent” includes slowdowns. If we want to compare TCPT and BBKMT, we need to strip out slowdowns from BBKMT. I therefore rebuilt warmup_stats for the first time in years, downloaded the experiment data, and adjusted warmup_stats to use TCPT’s definition of inconsistent.

Using TCPT’s metric, on BBKMT’s two Linux machines, 32.6% and 34.8% of (VM, benchmark) pairs are inconsistent. This is notably lower than TCPT’s figure of 43.5%.

Should we consider that TCPT reproduces BBKMT’s results given this metric? I tend to think that, broadly speaking, yes we should, given the caveats involved: after all, BBKMT’s figures range over 8 VMs and TCPT’s only over HotSpot. On that basis, it’s not reasonable to expect TCPT to produce nearly-identical figures to BBKMT. Indeed, the sheer number of benchmarks used in TCPT means that their figure is more likely to be representative of HotSpot’s “true” performance than BBKMT’s.

That said, it is interesting to think about possible explanations for TCPT’s higher figure. For example: perhaps HotSpot is more prone to occasional no steady state process executions than the wider set of VMs we looked at in BBKMT; perhaps TCPT’s benchmarking environment is noisier than BBKMT’s; or perhaps TCPT’s adjusted statistical approach is more likely to classify process executions as no steady state.

Summary

Overall, I enjoyed reading TCPT. My most regular complaint about research papers (as other members of my long suffering research group can attest) is that they are nearly always unnecessarily long. TCPT is no exception: I wish its 66 pages had been edited down to 40 pages, as I think that more people would be able to understand and extract its results. Beyond that, slightly superficial, whinge, there are real positives to TCPT. In particular, I think that their statistical methodology moves the state-of-the-art beyond BBKMT’s (though, largely thanks to Krun, BBKMT’s benchmarking running methodology is more refined than TCPT).

My comparison of TCPT’s data to BBKMT suggests that HotSpot’s warmup problems are greater than I would have guessed from BBKMT. I cannot help but wish that TCPT had compared their data to BBKMT’s, as they would then have not only done the same analysis I did above but, hopefully, have gone into detail to understand why. Instead, all you’ve got is my incomplete analysis on a wet November day!

What I do know is that if I was running performance critical software on HotSpot, or I regularly benchmarked software on HotSpot, or I ran a HotSpot team, then I’d be investigating TCPT’s benchmarking methodology and suite in great detail! First, I would want to rerun TCPT’s benchmark suite and analysis on a recent version of HotSpot to see if the problems have got better or worse. Second, I would examine the results of running the benchmark suite for clues that I can use to better understand, and possibly improve, HotSpot.

Since TCPT look only at HotSpot, this post has inevitably mostly considered HotSpot too. It is fun to speculate whether the other VMs in BBKMT would seem similarly problematic if investigated in the same way as TCPT: my best guess is “they would”, but that is pure speculation on my part! I can only hope that future researchers examine those other VMs in the same detail.

When we started our VM benchmarking work, I thought it would be a 2 week job, but it took us over two and a half years to push the research far enough that we felt we had something worth publishing. TCPT is another step in the warmup story, but I suspect there are several steps left to go until we can truly say that we have a good understanding of this under-appreciated area!

Acknowledgements: thanks to Carl Friedrich Bolz-Tereick for comments.

Newer 2022-11-15 08:00 Older

If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1]

Specifically the Linux₄₇₉₀ machine.

☒

Specifically the Linux₄₇₉₀ machine.

[2]

TCPT’s terminology has the advantage of brevity. In contrast, in BBKMT we spent huge amounts of time trying to think of terminology which would best allow people to grasp the difference between what became “process execution” and “in-process iteration”, and ended up with lengthier terminology. While I doubt we chose the best-possible terms, I still tend to think we gained more clarity than the brevity we sacrificed. Still, this is something upon which reasonable people can differ.

☒

[3]

The following text from TCPT confused me for several days:

Nonetheless, no benchmark is invoked less than 3000 times per fork, that is 1000 times more than in (Barrett et al., 2017).

I read “1000 times” in the of “a multiple of 1000”, thus misreading the claim as implying that BBKMT ran benchmarks for 3 in-process iterations. I eventually realised that the text means “an additional 1000 times” i.e. TCPT run 3000 in-process iterations compared to BBKMT’s 2000 in-process iterations. Oops!

☒