Michiel Haisma (2022-11-16 12:55:58) Permalink
Thanks Laurie for your thoughts on this, saves me from digging through all 66 pages of this (albeit very interesting) benchmarking paper.

Hypothetically, could your comparison be expanded (or reduced rather) to only consider the HotSpot results of BBKMT, or does that require too much re-analysis and invalidates other conclusions? Could it be that when only considering HotSpot, the results match up more closely?

Even if so, we all know that even slight changes in setup can cause different results to be observed. So to me at least, while the conclusions hold (VM warmup not what you think, less often than you think). When using different machines, with different OS (or just different distribution), different JRE versions, in a room with different temperature, etc. etc., it seems entirely possible they see a higher degree of inconsistency, given TCPT's lesser control for some of these variables.

Also because the benchmark set is much 'wider' and includes some benchmarks that seem to be particularly spotty, which may be due to the way they are written, or due to type of operations they do. This may also be a lack of control that adds to the variability.

On the statistical side, could it be that even slightly tweaking some of the (at least to me) arbitrary values (confidence interval +5%?) would swing the consistency conclusion by wide margins?

Thinking about the practical implications of these works: are my programs (running in VMs with JIT compilers) getting properly warmed up? How could I possibly confirm this, seeing research groups of experts are spending many years on getting it done properly? Personally, I've definitely seen warmup in my own results, but worry about inconsistencies and slowdowns. Maybe it's because I'm just looking for the warmup and call it good when I see it? When I don't see it, do I just assume it hasn't happened yet? Blame IO issues or noisy neighbours? Not sure if it's in my capacity to really answer this, so I remain waiting for: "VM Warmup pt. 2: definitely warm, all the time, always".

(I work at Oracle Labs. Views are my own.)

Laurence Tratt (2022-11-16 16:08:09) Permalink

I didn't break the HotSpot data out from BBKMT because I feel there's too little of it, and I was worried that people would draw overly strong conclusions from not enough data. However, you can work out your own version of this from Table 2 if you want -- just please don't over-interpret it :)

You're right that minor variations in the experimental setup and statistical analysis can change results. FWIW, I was fairly convinced by TCPT's statistical analysis and I only say "fairly" because for me to say "very" I'd probably have to reimplement the statistical analysis and stare at lots of data for a long time!

In terms of "what can we do?" my simplest suggestion is always "run things more often and for longer". So far, on every VM (and some non-VM static compilers!) I've seen, that shows up problems if you have enough benchmarks. Of course, "how long" is hard to say. In Section 9 of our paper we tried to give a sense of "what would have happened if we'd run our experiment for less time". One can't generalise from the results, but you might still find that gives you a starting point when you're running things yourself. Only once I've run things more often and for longer would I start getting clever with controlling confounding factors.

As for part 2... that sounds like a lot of work, so please don't wait for me ;)