Home e-mail: laurie@tratt.net   twitter: laurencetratt   twitter: laurencetratt
email updates:
  |   RSS feed

November Links

November 30 2022

Blog archive

Last 10 blog posts
November Links
More Evidence for Problems in VM Warmup
What is a Research Summer School?
October Links
pizauth: another alpha release
UML: My Part in its Downfall
September Links
pizauth, an OAuth2 token requester daemon, in alpha
A Week of Bug Reporting
August Links
  • The Death of the Key Change. Pop(ular) music has definitely shifted in style and key changes are no longer common. This article might seem a bit US-centric, but I suspect the UK stats would be similar. The graph at about 1/3 of the way down makes the change jaw-droppingly clear.
  • The leap second’s time is up: world votes to stop pausing clocks Leap seconds are to be abandoned... but what about the existing leap seconds? I suppose I'll still have to deal with machines whose view of the world differs by about 40 seconds for a long time to come...
  • I am disappointed by dynamic typing. Dynamically typed languages seem increasingly uninterested in exploring the things that they can do that statically typed languages can't. This article has some interesting thoughts on how such languages can take advantage of their very nature.
If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:

More Evidence for Problems in VM Warmup

November 15 2022

Some readers may remember some work I and others were involved in a few years back, where we looked at how programming language VMs (Virtual Machines), mostly those with JIT (Just-In-Time) compilers, warmed-up — or, often, didn't. In this blog post I'm going to give a bit of background to this area, and then look at a newly published paper (not written by me!), which gives further evidence of warmup problems.


VMs start programs running in an interpreter where they observe which portions run frequently. Those portions are then compiled into machine code which can then be used in place of the interpreter. The period between starting the program and all of the JIT compilation completing is called the warmup time. Once warmup is complete, the program reaches a steady state of peak performance.

At least, that's what's supposed to happen: our work showed that, on widely studied small benchmarks, it often doesn't. Sometimes programs don't hit a steady state; sometimes, if they do reach a steady state, it's slower than what came before; and some programs are inconsistent in whether they hit a steady state or not. A couple of examples hopefully help give you an idea. Here's an example of a "good" benchmark from our dataset which starts slow and hits a steady state of peak performance:

SVG-Viewer needed.

The way to read is plot is that the benchmark is run in a for loop, with each iteration (1 to 2000) shown on the x-axis. On the y-axis the time of each iteration is shown. The vertical red lines are changepoints, which show us when there are statistically significant shifts in the data. The horizontal red lines are changepoint segments which show us the mean value of iterations in-between two changepoints.

Some benchmarks hit a steady state of non-peak performance such as this:

SVG-Viewer needed.

Some benchmarks never hit a steady state:

SVG-Viewer needed.

Taking a representative Linux machine [1] from our experiment as an example, the numbers aren't happy reading. Only 74.7% of process executions hit a steady state of peak performance. That sounds bad enough, but if we look at (VM, benchmark) pairs, where we compare all 30 process executions of a benchmark on a given VM, we see that there is no guarantee of a consistent style of performance: only 43.5% of (VM, benchmark) pairs hit a steady state of peak performance. If you want more details, you can read the full academic paper, a pair of blog posts I wrote, or watch a talk.

The state of play

The research underlying our paper was hard, often very frustrating work, but as recompense I hoped that it meaningfully nudged the VM benchmarking field forward a bit. However, we weren't stupid or arrogant enough to believe that we'd solved all benchmarking problems. Towards the end of the paper we tried to make this clear:

We do not claim that this paper represents the end of the road in terms of VM benchmarking methodologies: we hope and expect that this work will be superseded by better approaches in the future.

Personally I expected that within a couple of years our work would have been superseded, perhaps being proven incorrect, or being made obsolete by a better methodology or more extensive experiment. After 5 years of waiting, my expectation was starting to look rather naive. Indeed, many citations of our paper are of the form "VMs warmup [cite our paper]" — in other words, many people cite our paper as suggesting that benchmarks warm-up in a given period, when the paper is all about showing that this often doesn't happen!

It has thus been a pleasure to stumble across a new paper Towards effective assessment go steady state performance in Java software: Are we there yet? by Luca Traini, Vittorio Cortellessa, Daniele Di Pompeo, and Michele Tucci, which I think makes an important contribution to the VM benchmarking field. However, at 66 pages long, it is not a quick or an easy read: having studied it somewhat carefully, I believe that it paints an even bleaker picture of (JVM) warmup than our paper.

In this post I'm going to do my best to summarise the parts of the paper that are most relevant to my interests, comparing it to the earlier work I was involved in. For the rest of this post I'll use the acronym "TCPT" to refer to the authors and their paper and "BBKMT" to refer to the paper I was involved with. Please bear in mind that I was not previously aware of TCPT and I don't think I've ever met or communicated with any of the authors — I may end up grossly misrepresenting their research. But if you're interested in VM benchmarking, and especially if you've read BBKMT, you might find that this post helps you think about TCPT.

Overview of the paper

In essence, TCPT investigate whether Java benchmarks reach a steady or not. As that might suggest, they have both broader and narrower aims than our work did: broader in that they investigate more research questions than we did and use many more benchmarks; but narrower in that they do so solely in the context of JMH (the Java Microbenchmark Harness, a benchmark runner for the JVM) and worry only about whether benchmarks reach a steady state or not (so, for example, they aren't concerned by benchmarks which ready a steady state of non-peak performance).

TCPT pose five research questions, but I'm only going to look at the first, since it's most related to BBKMT:

RQ1: Do Java microbenchmarks reach a steady state of performance?


The first challenge for me in understanding TCPT is mapping the paper's terminology (much of which derives from JMH) to BBKMT [2]. Here are the best mappings I've been able to come up with:
Benchmark{Benchmark, (VM, benchmark) pairs}
ForkProcess execution
IterationIn-process iteration
Steady state{Warm-up, flat, good inconsistent}
InconsistentOne or more no steady state process executions

One terminological nightmare in the Java world is the difference between Java the language, the abstract definition of Java bytecode for a Java Virtual Machine (JVM), and specific Java VMs. Because there are multiple Java VMs, each one of which is "a JVM", I prefer to be clear that TCPT, and this post, are talking about HotSpot, which is, by far, the most widely used JVM. The versions (yes, versions!) of HotSpot used in TCPT contain 2 separate JIT compilers (C1 and C2; Graal is not, I believe, used), which can be used individually, or as two tiers.

TCPT use "benchmark" to mean both "a specific benchmark X" and "all process executions of X". In BBKMT, and this post, we used "benchmark" for the former and "(VM, benchmark) pair" for the latter.

"Invocation" has no direct mapping to our work: as far as I can tell, it is a process execution that runs for a fixed period of wall-clock time (rather than running for a fixed number of in-process iterations). Since TCPT care only about "steady state" (and not, as we did, flat / warm-up / slow-down / no steady state) that single term maps to: warm-up or flat, in the context of a single process execution; or good inconsistent for a set of process executions.

An "inconsistent" benchmark in TCPT maps to a BBKMT (VM, benchmark) pair where at least one process execution is no steady state. Although it turns out not to occur in TCPT's data, a (VM, benchmark) pair none of whose process executions hits a steady state is "no steady state", the same definition as in BBKMT.

TCPT also use the term "the two-phase assumption", to denote the expectation that, in theory, benchmarks should consist of a warm-up and a steady-state phase. We didn't capture this quite so neatly in BBKMT, though that's partly because we differentiated "steady state of peak performance" (warmup) from "steady state of non-peak performance (slowdown)".

For the rest of this post, I will mostly use BBKMT terminology, except that when I use "inconsistent" (not "bad inconsistent"!), I will be referencing TCPT's definition: as we shall see later, splitting apart TCPT's "inconsistent" from BBKMT's "bad inconsistent" is an important detail.

Summary figures

TCPT are more coy about giving summary figures than I would be, and it wasn't until p26 that I got a sense of their results. To me, the most important figures in TCPT are that:
  • 10.9% of process executions are no steady state;
  • 43.5% of (VM, benchmark) pairs are inconsistent (i.e. if you run the benchmark repeatedly, some process executions are steady state, some are non-steady state). Note that no (VM, benchmark) pairs were consistently no steady state.
I'll compare these numbers to BBKMT a little later, but at this point it's enough to know that the first result is very similar to BBKMT, the second result a little less so.

Benchmarking methodology

Experimental science such as TCPT isn't intended to "prove" something definitively: rather it aims to increase, or decrease, our confidence in a particular theory. The results above suggest that TCPT should cause us to decrease our confidence in the "theory of VM warm-up" (at least for HotSpot). Do we, in turn, believe that TCPT's work is of sufficiently good quality to influence our confidence levels? Again, though this can never be absolute, my answer is broadly "yes", because their methodology seems well designed.

In summary, TCPT's methodology is in many parts similar to BBKMT's, so it might not be surprising that I consider their methodology well designed — I am inevitably biased! But once we dive into detail, we can see that there are important differences. First, TCPT do not reuse the software we created specifically for BBKMT (Krun and warmup_stats). Second, although TCPT use the same statistical technique (and software) as BBKMT, the details of how they use it are quite different. Third, TCPT use far more benchmarks than BBKMT. In the rest of this section I'll give an overview of my understanding of TCPT's methodology which, overall, I think is an improvement on BBKMT's.

Running benchmarks

Let's start with the basics. I have said more than once that, at its core, the major contribution of BBKMT was simply that we ran benchmarks for longer, and more often, than anyone had done before. In our case, we ran each benchmark for 30 process executions and 2000 in-process iterations [3]. Pleasingly, TCPT do something similar:

We perform 10 JMH forks for each benchmark (as suggested by [BBKMT]), where each fork involves an overall execution time of at least 300 seconds and 3000 benchmark invocations.

The "at least 300 seconds" is different than BBKMT: we ran a fixed amount of work, and didn't worry about how long it took (though we did aim for the fastest implementation of a benchmark to take at least 0.1s, and preferably 1s, per in-process iteration, to reduce measurement error). Is the "at least" an important detail? Probably not: I can make good arguments for both "fixed amount of work" and "minimum run time". While I would expect each approach to result in slightly different overall data, I doubt that the differences would be significant.

In BBKMT we wrote Krun, a benchmark runner that tried to control as many parts of an experiment as possible. Summarising Krun's features is beyond the scope of this post, but a couple of examples hopefully give you an idea. Krun automatically reboots the machine after each process execution (so that if, for example, a benchmark pushes a machine into swap, that can't influence the next process execution) and waits until the machine's temperature sensors have returned (within a small tolerance) to the value they had when the first process execution started (so that the effects of temperature on things like CPU and memory are consistent across process executions). We long discussed a second paper to understand the impact of all these features, but it didn't fit into the time and funding we had available. My personal guess is that most features only had occasional effect — but I am very glad we implemented them, as it hugely increased our confidence in our data.

TCPT do several of the things we did in BBKMT to control the overall system state (e.g. disabling turbo boost, disabling hyper threads, turning off Unix daemons) but mostly rely on JMH:

we configure the execution of each benchmark via JMH CLI arguments. Specifically, we configure 3000 measurement iterations (-i 3000) and 0 warmup iterations (-wi 0) to collect all the measurements along the fork. Each iteration continuously executes the benchmark method for 100ms (-r 100ms). The number of fork is configured to 10 (-f 10). As benchmarking mode, we use sample (-bm sample), which returns nominal execution times for a sample of benchmark invocations within the measurement iteration.

TCPT also say they monitor dmesg, though they don't seem to say if they noticed anything odd or not — in BBKMT we found early on from dmesg that one machine was often overheating, causing us to remove it from our experiment. Indeed, I think that TCPT use only a single machine for benchmarking, whereas in BBKMT we used three, to give us some confidence that our results weren't the result of a single dodgy machine. I appreciate that we were unusually fortunate in having access to such resources.

As far as I can tell TCPT don't monitor CPU(s)s for throttling: in BBKMT we found this affected 0.004% of in-process executions, and the classification of a single process execution. I would expect that throttling happened more often to TCPT (since, I think, their environment is not controlled as rigorously as Krun does), but I doubt it happens often enough to significantly effect their data.

There is one part of TCPT which I struggle to defend: they use two versions of HotSpot, both of which are fairly old:

We chose to limit the scope of the experiment to JMH microbenchmarks, because JMH is a mature and widely adopted Java microbenchmark harness. We ran the benchmarks on JVM 8 (JDK 1.8.0 update 241) or 11 (JDK 11.0.6), depending on the requirements of the specific system. Using other JVM versions or JVMs from other vendors may change the results.

Java, and therefore HotSpot, 1.8 debuted in 2014, although because versions are maintained for a long time, update 241 was released in January 2020. Java 11 debuted in 2018, with 11.0.6 also released in January 2020. It's important to realise that the newest version of Java in January 2020 was version 13. Indeed, BBKMT, published in 2017, also used Java 8! I don't understand why two versions of HotSpot were used and I would have found the paper's results even more relevant if a more modern version of HotSpot had been used.

That point aside, my overall feeling is that TCPT have used a subset of the experimental methodology from BBKMT that will mostly give them similar results to the "full" BBKMT methodology.

Statistical analysis

In one of the most interesting parts of TCPT, they take the statistical analysis we developed for BBKMT and extend it. First they identify outliers using Tukey's method, as in BBKMT. Second, they use the PELT algorithm to identify changepoints, as in BBKMT, but with a crucial change. Third they identify "equivalent" changepoint segments in a different way to BBKMT.

Intuitively, changepoint analysis tells us when data in a time-series has "shifted phase". Both BBKMT and TCPT use PELT, an algorithm for detecting changepoints [4]. PELT's penalty argument tunes how aggressive it is at finding changepoints (lower values find more changepoints). Tuning the penalty is an art as much as a science. In BBKMT we used a fairly standard value for the penalty and applied it to all process executions. From memory, I think we started with a low penalty (I think 12 log(n)) and fairly quickly adjusted it upwards to what became the final value (15 log(n)).

TCPT do something very different: they calculate a different penalty for every process execution. They use Lavielle's suggestion of trying a number of different penalty values, seeing how many changepoints each leads to, and selecting the value at the "elbow" in the resulting graph where reducing the penalty leads to disproportionately more changepoints. TCPT show a neat example of this in Figure 5 (p22).

There are obvious advantages and disadvantages to using fixed, or variable, penalties. A fixed penalty means that you end up with a "fair" comparison across similarly time-series, but variable penalties will work better if your time-series are dissimilar. In BBKMT we mostly had fairly similar time-series data, so a fixed penalty worked well enough for us (at least if you are happy to accept "we looked at lots of process executions manually to check that the changepoints looked sensible" as justification). At first I was unsure whether TCPT's variable penalties were a good idea or not, but on reflection, given the huge number of benchmarks (which, I assume, vary in nature more than those we used in BBKMT) they are running, I think it's more likely to give defensible changepoints. It would be interesting to apply their approach to penalties to BBKMT and see how much changes — my intuition is that it would probably work quite well for our data too.

Once changepoint analysis has done its work, TCPT encounter the same problem that we did in BBKMT: changepoint analysis can identify phase shifts smaller than make sense in our domain. For example, changepoint analysis might differentiate two changepoint segments whose means are different by 0.001s when we know that such small differences are probably due to measurement inaccuracy. In BBKMT we tried various heuristics, eventually settling on something that seems silly, but worked for us: simplifying slightly, we ignored differences between two changepoint segments if their means overlapped by 0.001s or overlapped by the second changepoint segments variance. Astute readers might balk at our use of the "variance" as we clearly changed units from seconds squared to seconds. All I can say is that, despite its seeming statistical incompetence, it worked well for us as a heuristic.

TCPT take a very different approach, which classifies two changepoint segments as equivalent if their means ± (confidence_interval * 1.05) overlap. I'm not quite sure what to make of this, either statistically, or practically. Still, given the odd heuristic we used in BBKMT, I can hardly complain. It would have been interesting to see the different effects of the BBKMT and TCPT heuristics.

In summary, I tend to think that TCPT's statistical approach is likely to prove better than BBKMT's on most data, though I would have much preferred if they had offered a comparison of the two so that I could understand if there any practical differences.

Benchmark suite

In BBKMT we used a small benchmark suite of deliberately small benchmarks. This had three benefits. First, it allowed us to run the same benchmarks across multiple languages. Second, we were able to give a high degree of assurance that, at the source level, the benchmarks ran deterministically, thus ruling out the possibility that programmer "error" caused performance variation. Third, we modified each benchmark in such a way that we made it very difficult for a clever compiler to fully optimise away the benchmark. The obvious disadvantage is that the benchmarks were small in size (e.g. for Python a total of 898LoC, with a median size of 101LoC) and few in number (6 benchmarks, though bear in mind each of those 6 benchmark had variants for 8 languages).

In contrast, TCPT have a much larger benchmark suite of 586 benchmarks. To collect this suite, they select 30 well-known Java systems, and from each system randomly extract 20 of the available benchmarks. 14 of the original 600 benchmarks failed to run, hence the eventual number of 586. Unfortunately, the paper is rather scanty on further details. In particular, I couldn't find any mention of how big the benchmarks are, though the fact that they're often described as "microbenchmarks" suggests they're probably not much bigger on average than BBKMT's. I would have liked, at least, to have had a sense of the total LoC of the benchmark suite and the median LoC for all benchmarks. Both metrics are only rough proxies for "how big is the benchmark" but it would have helped me better interpret the data. There's also no mention of whether TCPT examined the benchmarks for problems or not, other than the 14 benchmarks which didn't execute at all.

Although I wouldn't quite say these are minor quibbles (I really want to know the details!), TCPT have somewhat made up for them simply by the sheer size of their benchmark suite. Even if all 586 benchmarks are smaller than those in BBKMT, and even if many of the benchmarks don't quite work as expected, the sheer number of benchmarks is likely to minimise any resulting distortions.


The research question in TCPT I am most interested in is RQ1 "Do Java microbenchmarks reach a steady state of performance?" The raw(ish) data for this is in Figure 6 (p27), presented per benchmark suite, and as a summary. The table is split in two: the left hand side "(a)" is for process executions; the right side "(b)" is for benchmarks. 10.9% of process executions are no steady state (the final row "total" of (a)) but since these process executions are scattered across benchmarks, 43.5% of (VM, benchmark) pairs are inconsistent.

It is clear that TCPT's results are not good news for those expecting VMs, specifically HotSpot, to warmup. If you're a normal user, the results suggest that you're often not getting the performance you expect. If you're a benchmarker, then you're probably using a methodology which produces misleading results. If you're a HotSpot developer, you have some hard debugging ahead of you to work out how to fix (often intermittent) performance problems.

Comparison to BBKMT's results

Comparing TCPT and BBKMT's results is challenging. First, BBKMT's results span 8 different language implementations, whereas TCPT only covers HotSpot. Second, BBKMT and TCPT summarise data in slightly different ways. While we can do nothing about the first point, we can do something about the second.

Let's start with the easiest comparison. TCPT find 10.9% of all process executions to be no steady state. In BBKMT, on two Linux machines, we found 8.7% and 9.6% of process executions to be no steady state. Personally I consider TCPT's figure to be well within the range that I would consider reproduces [5] BBKMT's results.

Comparing TCPT's 43.5% of (VM, benchmark) pairs being "inconsistent" to BBKMT's notion of "bad inconsistent" is a little trickier than it first seems. Because BBKMT considers "slowdowns" (i.e. steady states of non-peak performance) to be "bad", BBKMT's "bad inconsistent" includes slowdowns. If we want to compare TCPT and BBKMT, we need to strip out slowdowns from BBKMT. I therefore rebuilt warmup_stats for the first time in years, downloaded the experiment data, and adjusted warmup_stats to use TCPT's definition of inconsistent.

Using TCPT's metric, on BBKMT's two Linux machines, 32.6% and 34.8% of (VM, benchmark) pairs are inconsistent. This is notably lower than TCPT's figure of 43.5%.

Should we consider that TCPT reproduces BBKMT's results given this metric? I tend to think that, broadly speaking, yes we should, given the caveats involved: after all, BBKMT's figures range over 8 VMs and TCPT's only over HotSpot. On that basis, it's not reasonable to expect TCPT to produce nearly-identical figures to BBKMT. Indeed, the sheer number of benchmarks used in TCPT means that their figure is more likely to be representative of HotSpot's "true" performance than BBKMT's.

That said, it is interesting to think about possible explanations for TCPT's higher figure. For example: perhaps HotSpot is more prone to occasional no steady state process executions than the wider set of VMs we looked at in BBKMT; perhaps TCPT's benchmarking environment is noisier than BBKMT's; or perhaps TCPT's adjusted statistical approach is more likely to classify process executions as no steady state.


Overall, I enjoyed reading TCPT. My most regular complaint about research papers (as other members of my long suffering research group can attest) is that they are nearly always unnecessarily long. TCPT is no exception: I wish its 66 pages had been edited down to 40 pages, as I think that more people would be able to understand and extract its results. Beyond that, slightly superficial, whinge, there are real positives to TCPT. In particular, I think that their statistical methodology moves the state-of-the-art beyond BBKMT's (though, largely thanks to Krun, BBKMT's benchmarking running methodology is more refined than TCPT).

My comparison of TCPT's data to BBKMT suggests that HotSpot's warmup problems are greater than I would have guessed from BBKMT. I cannot help but wish that TCPT had compared their data to BBKMT's, as they would then have not only done the same analysis I did above but, hopefully, have gone into detail to understand why. Instead, all you've got is my incomplete analysis on a wet November day!

What I do know is that if I was running performance critical software on HotSpot, or I regularly benchmarked software on HotSpot, or I ran a HotSpot team, then I'd be investigating TCPT's benchmarking methodology and suite in great detail! First, I would want to rerun TCPT's benchmark suite and analysis on a recent version of HotSpot to see if the problems have got better or worse. Second, I would examine the results of running the benchmark suite for clues that I can use to better understand, and possibly improve, HotSpot.

Since TCPT look only at HotSpot, this post has inevitably mostly considered HotSpot too. It is fun to speculate whether the other VMs in BBKMT would seem similarly problematic if investigated in the same way as TCPT: my best guess is "they would", but that is pure speculation on my part! I can only hope that future researchers examine those other VMs in the same detail.

When we started our VM benchmarking work, I thought it would be a 2 week job, but it took us over two and a half years to push the research far enough that we felt we had something worth publishing. TCPT is another step in the warmup story, but I suspect there are several steps left to go until we can truly say that we have a good understanding of this under-appreciated area!

Acknowledgements: thanks to Carl Friedrich Bolz-Tereick for comments.

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:


[1] Specifically the Linux4790 machine.
[2] TCPT's terminology has the advantage of brevity. In contrast, in BBKMT we spent huge amounts of time trying to think of terminology which would best allow people to grasp the difference between what became "process execution" and "in-process iteration", and ended up with lengthier terminology. While I doubt we chose the best-possible terms, I still tend to think we gained more clarity than the brevity we sacrificed. Still, this is something upon which reasonable people can differ.
[3] The following text from TCPT confused me for several days:
Nonetheless, no benchmark is invoked less than 3000 times per fork, that is 1000 times more than in (Barrett et al., 2017).
I read "1000 times" in the of "a multiple of 1000", thus misreading the claim as implying that BBKMT ran benchmarks for 3 in-process iterations. I eventually realised that the text means "an additional 1000 times" i.e. TCPT run 3000 in-process iterations compared to BBKMT's 2000 in-process iterations. Oops!

[4] Indeed, PELT's author is Rebecca Killick, the "K" in BBKMT.
[5] In computer science, "reproduction" has come to mean "an independent group can obtain the same result using artifacts which they develop completely independently".
Specifically the Linux4790 machine.
TCPT's terminology has the advantage of brevity. In contrast, in BBKMT we spent huge amounts of time trying to think of terminology which would best allow people to grasp the difference between what became "process execution" and "in-process iteration", and ended up with lengthier terminology. While I doubt we chose the best-possible terms, I still tend to think we gained more clarity than the brevity we sacrificed. Still, this is something upon which reasonable people can differ.
The following text from TCPT confused me for several days:
Nonetheless, no benchmark is invoked less than 3000 times per fork, that is 1000 times more than in (Barrett et al., 2017).
I read "1000 times" in the of "a multiple of 1000", thus misreading the claim as implying that BBKMT ran benchmarks for 3 in-process iterations. I eventually realised that the text means "an additional 1000 times" i.e. TCPT run 3000 in-process iterations compared to BBKMT's 2000 in-process iterations. Oops!
Indeed, PELT's author is Rebecca Killick, the "K" in BBKMT.

What is a Research Summer School?

November 10 2022

If I say "summer school" to you then you'll probably think of extra sessions in summer holidays for children; depending on where you grew up, you might expect such sessions to be either exciting non-academic activities or catch-up lessons. What about summer schools for research students (what I'll call "research summer schools" for the rest of this post, though in research circles they're unambiguously referred to as just "summer schools")?

I've co-organised four in-person research summer schools, most recently as part of the Programming Language Implementation Summer School (PLISS) series, and spoken at two others, and one thing that I've realised is that many people don't really know what they involve. Indeed, I didn't fully realise what they are, or could be, even after I'd been involved in several! This post is my brief attempt to pass on some of what I've learnt about research summer schools.

Let's start with the obvious high-level intent: research summer schools are intended to help research students better understand a research area. "Research students" is an umbrella term: the majority of attendees tend to be PhD students, but there is nearly always a mix of people from other stages of life, particularly postdocs and BSc/MSc students, but sometimes including people who are coming at research from a less common career path.

Research summer schools typically consist of a series of talks, some interactive, some traditional lectures, from experts in the field. At least in my area of computing, they tend to be about a week long, and are typically held at a campus university or non-urban residential facility, so that students and speakers are all housed near to one another, and distractions are minimised. As we'll see later, I think this closeness is important, but I've come to realise that one can overdo the monastic aspect: there are advantages to having a small number of moderate distractions somewhat nearby.

Whether stated explicitly or not, research summer schools tend to implicitly focus either on breadth (i.e. help students understand areas near to, but not exactly identical, to that they're working on) or depth (i.e. help students understand one area in more detail). One thing that surprised me is that it's easy for organisers to think they're focussing on depth without realising that students perceive it as breadth. This is simply because students, being new to an area, have rarely had enough time to learn as much as the (generally significantly older) organisers and speakers expect.

For example, when I co-organised a research summer school on programming language virtual machines, I assumed that such a specialist topic implied that students would know much of the relevant background material — even though, when I was at a similar stage, I knew only a tiny subset of this material! In one way I was right — the students who attended were much better informed than I had been at the same stage. But in a more significant way I was wrong — they were still students, and I don't think any of them knew the necessary background for every single talk. As soon as I realised this, I started asking some speakers to explain some terms that they thought were basic, but which I could tell most students were unfamiliar with.

As that might suggest, speakers at a research summer school have an interesting tightrope to balance on: they often want to show the students something they expect the students to be unfamiliar with, but they need to do so in a way that students can understand. Experience has taught me that, if in doubt, it's better for speakers to assume they're giving a "tutorial" rather than a "research talk". Even students who are very familiar with a given subfield tend to respond well to talks that cover the basics well. Enthusiasm in a research summer school talk is also far more important than for a talk at a research conference. I always ask a number of students which talks they enjoyed most and the correlation with "talks given with enthusiasm" is astonishingly high. To my surprise, that correlation is consistent between students who are familiar with a talk's area and those who are not.

At least as important as the talks themselves are the interactions between senior folk (speakers and organisers) and students outside talks. I always knew that this was important, but I now realise that I significantly underestimated how important it is. Frankly, only in the research summer school I co-organised this year did I finally feel that I personally got this more-or-less right. In particular, I went out of my way to talk to as many different students as possible, for example by making sure I sat next to different people at almost every meal. After 5 days [1] I think I'd had individual conversations, even if only a few minutes long, with about 80% of students, in breakfasts, coffee breaks, lunches, and dinners.

Talking to so many new people in such a short period of time might sound like a chore, particularly if you're a little introverted, but the appreciation that students shows makes the effort worthwhile. In the past I made my job unnecessarily hard by trying to start each conversation in a different way. This year I tried something different: I simply asked students what they're working on. Not only is this easier for me, but it's the best possible question for students: everyone has an answer to that question, and most are eager to tell you more! From there, a meaningful conversation quickly and naturally develops. In some cases, what students are working on was far beyond my ken, but in nearly all cases I was able to find something interesting to ask them, and in a few cases we realised there was something deeper to talk about, generally either career or research related.

When there's something worth talking about in more detail, I think it's useful to find a way to give that student more time, and often away from the crowd. These days I simply suggest we go to a nearby cafe or pub to chat. Since most research summer school venues have a limited number of such options nearby, people tend to naturally congregate there after a while. This works out well: by the time others wander over, I've generally had the most important parts of the conversation with the initial student, and then it's useful to gather more students to talk about other things.

The fact that I'm focussing so much on conversation is deliberate. To me, one of the big advantages of a research summer school is that it breaks down barriers between senior (i.e. old) folk and (usually much younger) students. It's easy for students to imagine that well known researchers are another species, impossibly clever, and unapproachable. Research conferences rarely do much to dispel this image. In contrast, when students spend a week in regular contact with senior folk, they tend to start to realise that they can talk to us oldies as if we're normal human beings. That relationship, as small as it might seem to the senior folk, persists beyond the summer school. I find that students feel they can approach me at other events once they've got to know me at a summer school, whereas students who haven't such an opportunity are often too nervous to make first contact.

One thing I've accidentally experimented with is the length of a research summer school and the number of participants. I now think think that 6 days is a good duration. It takes students at least a couple of days to take stock of the unfamiliar situation they're in, so they need time after that to to take advantage of the situation. However, too much time causes energy levels to drop to unproductive levels.

It's tempting to try and spread the advantage of a research summer school around by accepting as many students as possible. However I've realised that this comes with a significant trade-off: the more students there are, the disproportionately fewer interactions there are with the senior folk. I've come to think that about 8-10 speakers (plus a couple of organisers) and about 35-40 students is a good balance between maximising access and maintaining a cosy feeling that encourages interactions.

These numbers tend to mean that more students want to attend than can be accepted. Because of that, the events I've been involved with don't use a typical "first come, first served" application policy. Instead we solicit expressions of interest (with a simple form that takes under 10 minutes to complete), and later select students from that pool. The selection process is difficult because there are always more worthy applicants than we have places — indeed, we could often fill nearly all places with applicants from "top rated" universities alone! Instead of concentrating benefits, we try to spread them around as best we can, taking many different factors into account when selecting students, whilst doing our best to ensure that those we select are likely to benefit from the experience. No selection process can be perfect, but we've had enough positive comments about the diversity and quality of students attending to think that we're doing a decent job.

Research summer schools are not cheap to run. Costs per student and speaker are high, since it involves accommodation and meals. In general, particularly if the event is held in a residential facility in a wealthy country, the true cost per student can be prohibitively high. Fortunately, at least in my area of computing, it has proven possible to obtain significant sponsorship, from governments, charities, and industry (for example, the summer schools I've organised have received sponsorship from, in rough chronological order, EPSRC, US Office of Naval Research, NSF, ACM Sigplan, Raincode, Facebook / Meta, and probably others I will soon be embarrassed to have forgotten). We use this to subsidise [2] registration, keeping it reasonable for everyone, and offering part and whole bursaries (for registration and travel) to those who would otherwise have been unable to attend. It is difficult to overstate how important sponsorship is at enabling research summer schools to extend their benefits to a diverse range of students.

It's also difficult to overstate how generous research summer school speakers are: they give up their time and knowledge freely, even though many are very busy people. I'm extremely grateful to each and every speaker! I also find that nearly all speakers really enjoy the experience: relative to the size of research conferences, the small scale of a research summer school lends itself to deeper interactions, which are often very rewarding.

As an organiser my job is a bit different to a speaker's: I have to find a venue; gather together an interesting group of speakers; make sure that as wide a range of students as possible can attend; and make sure that while the event is on, speakers and students can focus on the important things — learning and interacting with each other. Estimating the workload is a bit difficult, but a rough estimate is that there's perhaps a working week of effort to be done before the event (albeit spread out several months), and during the event itself one can spend nearly every waking minute doing something! Personally I enjoy the work that needs to be done during the event more than that which needs to be done beforehand, and I'm lucky that my co-organisers have been happy to indulge this preference.

The fact that I've organised several research summer schools probably tells you that I think this work is worthwhile. Indeed, research summer schools are now one of the few things where I feel I can justify, without difficulty, the expense in terms of time, money, and the environment that travel requires. The reason for that is that I've now seen the impact over time on many students. Of course, it doesn't work for everyone, but that's inevitable. At the other end of the spectrum, two of the students, based in different continents, who met at a research summer school I co-organised subsequently married! As that shows, research summer schools can have all sorts of positive, if unexpected, benefits!

Acknowledgements: thanks to Jan Vitek for comments.

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:


[1] I should have had a sixth day, but my journey took 38 hours instead of 8!
[2] Note that I haven't said registration should, in general, be free, because humans have an astonishing tendency not to turn up to events they haven't paid for.
I should have had a sixth day, but my journey took 38 hours instead of 8!
Note that I haven't said registration should, in general, be free, because humans have an astonishing tendency not to turn up to events they haven't paid for.

October Links

November 1 2022

  • How we use binary search to find compiler bugs. This is such an obvious JIT debugging technique that I almost didn't read to the end of the article. I'm glad I did because... this hadn't occurred to me before!
  • A Vision of Metascience. It's unlikely that the current way we go about research is optimal: how can we explore the design space and improve how we go about doing things?
If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:

pizauth: another alpha release

October 20 2022

A few weeks back I made the first alpha release of pizauth, a simple program for requesting, showing, and refreshing OAuth2 access tokens. I've been pleasantly surprised at the number of people who've tested pizauth since then, mostly successfully. Inevitably, pizauth hasn't quite worked perfectly for everyone, so I'm making available the next alpha release, 0.1.1. One new feature, and one portability fix, are particularly worth mentioning.

Running pizauth on a remote machine

A valid use case for pizauth is to run on a remote machine (e.g. to download email in a cron job on a server), but for authentication to happen locally. I had assumed, but hadn't tested, that using pizauth with ssh -L would work. Indeed, it did work, but it required users to work out what port that pizauth's HTTP server (a vital part of the OAuth2 authentications sequence) was listening on. Users could discover the port number by carefully examining the authorisation URL:


Did you spot the port number in that ASCII maelstrom? I wouldn't blame you if you didn't! (Answer in the footnote [1].) Just to make life a bit more miserable for users, the port number is randomly generated by the operating system every time pizauth starts.

I've therefore added a global http_listen configuration option which allows users to force pizauth's HTTP server to listen on a specific IP address and port. For example if you want it to listen on port 12345:

http_listen = "";

With this, it's now much easier to use pizauth on a remote machine [2]. In essence, after fixing the port number [3] with http_listen, you run pizauth server on the remote machine. On your local machine you then open an SSH tunnel to the remote machine (note that you have to listen on the same port locally and remotely, in this example 12345):

ssh -L <remote-machine-name>

When pizauth show is executed on the remote machine, one then copies the authorisation URL over to a browser in the local machine, and authenticates as normal. Once pizauth displays "pizauth processing authentication: you can safely close this page." in the browser, the SSH tunnel can be closed. As and when pizauth requires reauthentication (e.g. because the refresh token has expired), one follows the same "open an SSH tunnel, authenticate, close the SSH tunnel" sequence.

OS X port

The first alpha release didn't compile on OS X because the Unix daemon function leads to a deprecation warning at link time which was upgraded to an error. Fortunately Eddie Antonio Santos intergrated a simple fix for OS X, which at least two other people have subsequently confirmed works well.

Client IDs and secrets are outside pizauth's control

The most common complaint (or plea) I've heard is from people who haven't been able to find the right OAuth2 client IDs and secrets for their setup. Regrettably, there's absolutely nothing I can do about this: it's up to the administrators running each OAuth server as to how they handle this.

Some administrators explicitly list the valid client IDs and secrets (or allow users to generate their own) but many administrators incorrectly assume that client IDs and secrets increase security, and prevent users from finding out valid values. The latter case implicitly creates two classes of users. The first class are shut out from vital services. The second class, using various means that their (incorrectly paranoid) administrators might not approve of, are able to find valid client IDs and secrets which work with their servers.

I hope that a future revision of the OAuth standard will a) only require one of client ID or secret b) made that single setting optional and strongly suggest that it is only used for unmonitored server applications. Until then, all I can do is sympathise with users who can't find valid client IDs and secrets.

Testing is still appreciated!

I'm gradually gaining confidence that pizauth's features and user-facing interface aren't entirely stupid. I hope that even more people will test pizauth to see if there are any other deficiencies that need addressing!

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:


[1] 8765
[2] Using this option also guarantees that pizauth won't listen on a random port that you expected a later program to listen on instead.
[3] I'm not yet aware of a good use case for listening on an IP address other than, but that might be narrowness of thinking on my part.
Using this option also guarantees that pizauth won't listen on a random port that you expected a later program to listen on instead.
I'm not yet aware of a good use case for listening on an IP address other than, but that might be narrowness of thinking on my part.
Blog archive
Home e-mail: laurie@tratt.net   twitter: laurencetratt twitter: laurencetratt