Blog

 

Why Aren’t More Users More Happy With Our VMs? Part 2

September 19 2018

In the first part of this two-part blog, I showed results from an experiment on VM performance that I’d been part of. Although it wasn’t our original intention, we eventually ended up trying to find out how often mainstream programming language VMs warm up as per traditional expectations. If you consider only individual process executions, then about 70% warmup; if you consider (VM, benchmark) pairs, then about 40% warmup consistently; and if you consider (VM, benchmark, machine) triples, then about 12% warmup consistently.

However you choose to look at it, these numbers aren’t pretty. In trying to make sense of them, I’m going to try and provide some answers to the following three questions. First, are there simple explanations for the performance problems we encountered? Second, how did we get into this situation? Third, how can we do better in the future? To answer the first question, I'm going to draw on our paper. In providing suggested answers to the second and third questions, I'm going to be make extensive use of my personal opinions and experiences, so please don’t blame anyone else for them!

Are there simple explanations for poor performance?

There are all sorts of factors that might be responsible for odd performance, but, beyond those covered in the first part of this blog post, there are two that seem both particularly likely and relatively easy to test for: garbage collection and JIT compilation. For example, garbage collection is, to state the obvious, memory intensive: it could conceivably have all sorts of odd effects on processor caches and thus performance. Similarly, JIT compilation can happen at any point, and there's no guarantee that it makes things faster.

We therefore took two of the VMs under examination – HotSpot and PyPy [1] – and performed a second experiment, separate from the first [2], where we also recorded how long each in-process iteration spent in garbage collection activities and JIT compilation. To be clear, the aim of this second experiment wasn’t to find causations (i.e. to find definitive reasons for odd performance), but to see if there are plausible correlations (i.e. good candidates for further investigation).

Here’s an example plot from this second experiment:

SVG-Viewer needed.

There are three sub-plots in this plot. All share the same x-axis of in-process iterations (i.e. the benchmarking running in a for loop). The “main” sub-plot (the bottom of the 3 sub-plots), is the same as in the first part of this blog, with its y-axis showing the wall-clock time that each in-process iteration takes. In the example above, we can see something odd going on in fasta on PyPy: there are 3 main “sections”, but instead of each “section” having fairly consistent performance for each in-process iteration, they seem to get gradually slower over time.

The top two sub-plots show the time spent in garbage collection and JIT compilation for each in-process iteration [3]. These clearly show that JIT compilation only happens right at the very beginning of the benchmark. However, the time spent in garbage collection is almost perfectly correlated with the overall performance of the benchmark. That suggests first that this benchmark is spending most of its time in garbage collection, and second that there may well be a minor memory leak (in the VM or the benchmark). Again, to be clear, I’m only claiming a correlation, not a causation. However, if I wanted to fix this, I know that I’d start looking for memory allocation sites that don’t seem to be freeing up memory.

SVG-Viewer needed.

Here’s my old friend, Richards running on HotSpot: this benchmark gets 5% slower at about iteration 200, ending up in a steady state of non-peak performance. We can see that we’re regularly spending time garbage collecting [4], but there’s no correlation between garbage collection spikes and the main sub-plot. However, the JIT compilation sub-plot is more interesting: just before the benchmark slows down, there are two long-ish JIT compilation events. It therefore seems plausible that the VM decided to recompile parts of the benchmark (probably based on extra profiling data it had collected), but that the recompilation led to slower machine code being generated. Again, it’s not a guaranteed causation, but it does give a pointer for where to look. However, it's a less useful pointer than the previous example. Since JIT compilers are, to put it mildly, rather complex, fingering the JIT compiler as a culprit still means considering a huge amount of code as the likely cause of the problem.

SVG-Viewer needed.

This third example shows a more troubling case. This is the shallow square-wave example we saw in the first part of this blog (you can see the original plot if you want a reminder). It shows two things. First, turning on profiling changes the nature of most process executions: instead of appearing almost immediately, the square wave doesn’t appear until the thousandth iteration; and the “steady-ish state performance" is better with profiling on (roughly 0.34-0.36s with profiling on vs. 0.36-0.40s with profiling off). Second, no matter how much you look, there is no correlation between garbage collection (indeed, no garbage collection occurs at all!) or JIT compilation and the square wave. You might get excited by the JIT compilation event around iteration 1375, which seems to correlate with the third trough, but there’s no such correlation at the start of any of the other troughs. I honestly don’t have a clue what’s causing the square wave, and there’s nothing here which gives me the slightest pointer as to what I should look at first.

Unfortunately, a lot of examples look like the second or third cases: the correlation only points us to a still-very-large subset of the VM; or doesn’t point us anywhere at all. Indeed, more recently I was involved in a LuaJIT performance project where we also extracted profiling information about garbage collection and JIT compilation. While this information was helpful sometimes, I wouldn’t say that it was a decisive advantage. In some cases, despite the plots not showing anything obvious, we later traced problems back to the garbage collector or the JIT. There are clearly several possible causes for this, but my guess is that the most common is that the consequences of performance problems are often spread around somewhat evenly, thus not showing up as a visible spike at a particular point in time.

There seem, therefore, few simple answers to the odd performance I pointed out in the first part of this post. How did we get into such a situation? Realistically, the answer is almost surely “for lots of reasons, many of which we’ll probably never know.” However, I have a couple of war stories that I think might provide a partial explanation, and that look at things in a slightly different way than I’ve heard others talk about in the past.

Are benchmark suites complete guides to performance?

At the risk of stating the obvious, most of us use benchmarks to guide our optimisations. In other words, if we think we’ve made a compiler faster, we have one or more programs which we measure before and after the change to see if the it really has made things faster or not (and, if it has, by how much). In general, people use multiple such programs, at which point we bestow upon them the grand title of a “benchmark suite”. Optimisations which make our benchmark suite run faster are good; those that make our benchmark suite run slower are bad. But do our benchmark suites tell us about the performance of programs in general?

Some of you may remember that a few years back Dropbox started a new Python VM project called Pyston [5]. I was a bit surprised at the time, because VMs are notoriously a lot more work than people think, and PyPy was already a pretty good Python VM by that point. Still, I didn’t think much more about it, until I happened to bump into Kevin Modzelewski, one of Pyston’s creators, at a conference a year or two later. I had a very interesting conversation with him, as part of which I asked him why Dropbox had decided to create a brand new Python VM. Simplifying things a bit, he said that PyPy performed badly on some common Dropbox idioms, and that he’d distilled the worst offender down to a benchmark where PyPy (a clever JIT compiler) was over 4 times slower than CPython (a simple interpreter).

Looking at the benchmark, the problem became clear very quickly: it was recursive. To my surprise, RPython (the language/framework in which PyPy is written) was unrolling [6] all recursive calls. Unrolling is a powerful, but dangerous, optimisation: it can make programs faster when used judiciously, but used incautiously it causes code bloat and degraded performance. Since unrolling all recursive calls is akin to unrolling all loops, this seemed unlikely to be a good idea. I quickly confirmed with the core developers that this wasn't an intended outcome, and that it was a natural property of tracing's naturally aggressive tendency to inline.

I thought this was would be a fun problem to debug, as it involved a part of RPython that I had not previously looked at. It took 2 or 3 days before I had found the right part of RPython, worked out what was going on, and found a fix. The core of the resulting diff is fairly simple:

diff --git a/rpython/jit/metainterp/pyjitpl.py b/rpython/jit/metainterp/pyjitpl.py
--- a/rpython/jit/metainterp/pyjitpl.py
+++ b/rpython/jit/metainterp/pyjitpl.py
@@ -951,9 +951,31 @@
         if warmrunnerstate.inlining:
             if warmrunnerstate.can_inline_callable(greenboxes):
+                # We've found a potentially inlinable function; now we need to
+                # see if it's already on the stack. In other words: are we about
+                # to enter recursion? If so, we don't want to inline the
+                # recursion, which would be equivalent to unrolling a while
+                # loop.
                 portal_code = targetjitdriver_sd.mainjitcode
-                return self.metainterp.perform_call(portal_code, allboxes,
-                                                    greenkey=greenboxes)
+                inline = True
+                if self.metainterp.is_main_jitcode(portal_code):
+                    for gk, _ in self.metainterp.portal_trace_positions:
+                        if gk is None:
+                            continue
+                        assert len(gk) == len(greenboxes)
+                        i = 0
+                        for i in range(len(gk)):
+                            if not gk[i].same_constant(greenboxes[i]):
+                                break
+                        else:
+                            # The greenkey of a trace position on the stack
+                            # matches what we have, which means we're definitely
+                            # about to recurse.
+                            inline = False
+                            break
+                if inline:
+                    return self.metainterp.perform_call(portal_code, allboxes,
+                                greenkey=greenboxes)

In essence, what this diff does is the following. When we're tracing (roughly speaking, “JIT compiling”) and about to trace into a function F (implicitly identified here by the “greens”), we check the tracer's call-stack to see if F is present. If it is, then inlining F again would be akin to unrolling, and we stop tracing.

Without worrying too much about the details, I hope you’ll agree that it’s a fairly small diff. What's more, it's effective: performance on PyPy improved by 13.5x (meaning that PyPy went from 4.4x slower than CPython to over 3x faster). At this point, I was patting myself on the back for a job well done, and indulging my ego by wondering if Dropbox would have created a new Python VM if this fix had been present.

However, there was a problem. Yes, I'd made one benchmark 13.5x faster, but PyPy has its own benchmarking suite. When I ran the above fix on PyPy’s benchmark suite of 20-odd benchmarks, 5 became slower (up to 4x slower). I knew that the core PyPy developers would, quite reasonably, not accept a performance “fix” that made performance of the main benchmark suite worse. First, I added a mechanism to control how many levels of recursion are unrolled. Second, I set about uncovering how many unrollings of recursive calls would recover the performance of the main PyPy benchmarks.

I thus did some basic benchmarking, putting this table in the email I sent to the pypy-dev mailing list:

     #unrollings |  1   |  2   |  3   |  5   |  7   |  10  |
-----------------+------+------+------+------+------+------+
hexiom2          | 1.3  | 1.4  | 1.1  | 1.0  | 1.0  | 1.0  |
raytrace-simple  | 3.3  | 3.1  | 2.8  | 1.4  | 1.0  | 1.0  |
spectral-norm    | 3.3  | 1.0  | 1.0  | 1.0  | 1.0  | 1.0  |
sympy_str        | 1.5  | 1.0  | 1.0  | 1.0  | 1.0  | 1.0  |
telco            | 4    | 2.5  | 2.0  | 1.0  | 1.0  | 1.0  |
-----------------+------+------+------+------+------+------+
polymorphism     | 0.07 | 0.07 | 0.07 | 0.07 | 0.08 | 0.09 |
Looking back, I’m slightly embarrassed by the quality of benchmarking this represents, but I hope you’ll forgive me for that. The top half of the table (from hexiom2 to telco) represents the subset of PyPy benchmarks that my fix had slowed down. For example, with 1 unrolling of recursion (i.e. equivalent to inlining the function once but performing no further “unrolling”), raytrace-simple ran 3.3x slower than under “normal” PyPy; with 2 levels of unrolling it ran 3.1x slower; and by 7 levels of unrolling my branch and normal PyPy ran the benchmark at the same speed. On the bottom row is Kevin’s benchmark (which, for reasons I can no longer fathom, has timings in seconds): it’s fastest at 1 unrolling, but even at 7 unrollings it’s only lost a bit of performance.

The table above clearly shows that RPython should unroll recursive function calls 7 levels deep. On that basis, my email to the PyPy list proposed that number; that was the number that was merged into RPython; and that's the number that's still used today.

However, I doubted at the time that 7 was the right value and today I’m almost positive that it’s not the right value. I'm almost certain that the right value is going to be much closer to 1 — indeed, probably 1 itself. There's no doubt that simple recursive function calls (of the sort found in many of the PyPy benchmarks) benefit from unrolling. However, if unrolling every loop was a good idea, we'd expect every static compiler to do it, — but they don't [7]. Recursive functions are like loops, but worse: functions are, by definition, bigger than loops on average. However, in the context of tracing, my reasoning is that we care a lot more about the worst case of unrolling than we do the best case. My major concern is that recursive functions are often heavily data dependent. A common example of this are the tree walkers found in compilers, which walk over an abstract syntax tree and compile the program. In such cases, the recursive function looks at a node, does something based on that, and then visits the node’s children. The challenge with this is that the path the recursive function depends entirely on the structure of the input tree. This means that the amount of code executed between each call of the recursive function can vary hugely: we frequently end up tracing paths which are never used in full again. The problem is that meta-tracing is a very expensive activity, causing execution to temporarily slow down by around a couple of orders of magnitude: tracing things that aren’t used again can lead to significant performance losses. Even worse, these losses are often invisible, in the sense that it’s not really a flaw in the user’s program, but an often inevitable mismatch with the dynamic compilation approach.

Realistically, no benchmark suite can ever be complete, in the sense that it can claim to represent the performance behaviour of every possible program. A better question is whether the benchmark suite is representative of a fairly large subset of programs. The above example shows that PyPy’s standard benchmark suite doesn't cover some fairly common idioms. How many other common idioms aren’t covered? My strong intuition is that the answer is “many”. This isn’t to knock PyPy’s benchmark suite: indeed, it’s better than most other VMs’ benchmark suites. However, we often derive a false sense of comfort from benchmark suites: too often the phrase “this optimisation sped up the benchmark suite by 3%” is used interchangeably with “this optimisation speeds up programs by 3%”. The harsh reality is that our benchmark suites reject some optimisations that would make the vast majority of programs run faster, and accept some optimisations that make the vast majority of programs run slower. How often we get unrepresentative answers from our benchmark suites is anyone’s guess, but I fear it has happened more in the past than we imagine.

Are benchmark suites correct guides to performance?

The first war story might make you wonder whether or not benchmark suites tell us much about performance of programs in general. But a second issue with benchmark suites is whether the answer we receive is even correct with respect to the benchmark suite itself. Put another way, if a benchmark suite tells us that an attempted optimisation makes the benchmark suite faster or slower, is that answer reliable?

Alongside the main experiment I reported in the previous blog post, we also (separately) benchmarked SpiderMonkey and V8 against the Octane benchmark suite. Octane consists of 17 benchmarks, originally from V8; it fairly quickly became one of the most important JavaScript benchmark suites. Despite not knowing JavaScript at that point, I took on the task (“how hard can it be?”) of getting Octane into shape [8]. In essence, all I did was put in a for loop that ran each benchmark for 2000 in-process iterations [9]. That done, I ran my slightly altered Octane benchmark suite on d8 (the command-line way of running programs on V8) and got the following output:

$ d8 run.js
Richards
DeltaBlue
Encrypt
Decrypt
RayTrace
Earley
Boyer
RegExp
Splay
NavierStokes
PdfJS

<--- Last few GCs --->

14907865 ms: Mark-sweep 1093.9 (1434.4) -> 1093.4 (1434.4) MB, 274.8 / 0.0 ms [allocation failure] [GC in old space requested].
14908140 ms: Mark-sweep 1093.4 (1434.4) -> 1093.3 (1434.4) MB, 274.4 / 0.0 ms [allocation failure] [GC in old space requested].
14908421 ms: Mark-sweep 1093.3 (1434.4) -> 1100.5 (1418.4) MB, 280.9 / 0.0 ms [last resort gc].
14908703 ms: Mark-sweep 1100.5 (1418.4) -> 1107.8 (1418.4) MB, 282.1 / 0.0 ms [last resort gc].


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x20d333ad3ba9 
    2: extractFontProgram(aka Type1Parser_extractFontProgram) [pdfjs.js:17004] [pc=0x3a13b275421b] (this=0x3de358283581 ,stream=0x4603fbdc4e1 )
    3: new Type1Font [pdfjs.js:17216] [pc=0x3a13b2752078] (this=0x4603fbdaea9 ,name=0x4603fbd9c09 ,file=0x4603fb...


#
# Fatal error in CALL_AND_RETRY_LAST
# Allocation failed - process out of memory
#

zsh: illegal hardware instruction  d8 run.js

I can very distinctly remember having a sinking feeling when I saw that. Fortunately, when I looked at it more closely, it became clear that the situation wasn’t as bad as I had first feared. Starting from the top of that output, you can see that a number of Octane's benchmarks (from Richards to NavierStokes) ran successfully, before PdfJS caused d8 to fail. A cursory look at the error message shows that PdfJS failed because V8 had run out of memory. I decided to have a quick look at whatever timing outputs I had got (bearing in mind that this was done without Krun, so it's so-so benchmarking) before the crash and plotted them:

SVG-Viewer needed.

That graph looks interesting, and there’s a clear blow-up towards the end, but there’s not quite enough data to see if there’s anything meaningful going on before that. Fortunately, the way that I had made in-process iterations was to bundle several of Octane’s ‘inner’ iterations into longer ‘outer’ in-process iterations. I therefore reran things, this time recording how long each ‘inner’ iteration took to run:

SVG-Viewer needed.

Now this is starting to look interesting, but the huge spike at the end makes it hard to see the denser detail elsewhere in the plot. I therefore chopped the last 80 or so in-process iterations off and replotted:

SVG-Viewer needed.

At this point the likely source of the error became clear to me. The first half of this (sub-)plot has regular spikes, which grow roughly linearly: I guessed that they were the garbage collector running, which was probably having to do a little more work each time it was run. The second half of this plot shows a complete change in the timing data. I guessed that the garbage collector was probably going into some sort of low-memory mode, where it trades off memory for time, which causes it to run much more often, and to run much more slowly. It continually has to do more work, until – just to the right of this final plot – it explodes with an out of memory exception. If ever I’ve seen a plot which screams “memory leak”, this is it.

At this point, I had a very strong suspicion what the cause was, and a guess that this was more likely to be a bug in PdfJS than V8. But how likely was it that I would find a bug in a benchmark that was 33,053 lines of code in a language which I didn't know? Frankly, it sounded like a lost cause. Still, I thought it would be worth having a look, so I loaded pdfjs.js into my trusty text editor:

To my surprise, the second line of code after the licence was var canvas_logs = [];, which looked a lot to me like an assignment of a mutable array to a global variable. That’s an obvious candidate for problems, but surely far too obvious to be the bug. Still, I thought I’d see where canvas_logs is used, just in case. I scrolled to the first use of the variable in the file:
At line 50, it looked to me like content is being appended to the array, but nowhere in that function is anything being removed from the list. Indeed, there are 3 other references to this variable in the file, 2 of which check the array’s length, and 1 of which accesses elements in the array. Bingo! Within about 90 seconds, I had found the source of the memory leak. Emphasising my cluelessness with JavaScript, it then took me about 10 minutes to work out how to empty an array of its contents. Programming is a humbling job.

The fix to PdfJS’s memory leak is thus simply to add the line canvas_logs.length = 0 to the beginning of the runPdfJs function. However, to my regret, the fix has not been merged into Octane because, some time after I raised the PR, Octane was retired.

A reasonable question to ask is whether the memory leak in PdfJS is really a problem. Presumably not many other people have run the benchmark long enough to see the crash I saw. Besides, a benchmark’s job is to measure a VM doing some work and, whether there’s a memory leak or not, the PdfJS benchmark is certainly doing some work. This latter point is alluring but, I believe, dangerous. Let me give a couple of scenarios to illustrate this point.

First, it is common to add benchmarks to a benchmark suite when one perceives that they tell us something new about the compiler in question. For example, let’s imagine that I realise that my benchmark suite is light on benchmarks which measure changes to my VM’s code generator. I have a look around and, having profiled a couple of its in-process iterations, decide that PdfJS is the perfect way of addressing this weakness in the benchmark suite. Later on, I make a change to the code-generator which I hope optimises programs such as PdfJS. I run PdfJS for a while (after all, I know that I need to think about peak performance) and find out that the performance hasn’t changed very much, causing me to discard my changes to the code-generator. The problem here is that PdfJS isn’t really a test of a code-generator, rather it is a test of the garbage collector in low-memory situations. That’s not an unreasonable thing to benchmark, but it’s not what I expected PdfJS to be benchmarking [10]. My change to the code-generator might really have been an improvement, but it might have been overwhelmed by the time spent in garbage collection by this particular benchmark.

Second, many benchmark suites – including Octane – have a mode which says “run the benchmark for n seconds and then report how many in-process iterations completed in that time”. Let’s assume that PdfJS manages to complete 100 in-process iterations in 10 seconds. I now introduce a new, complex, optimisation into the VM, which makes my benchmark complete 101 in-process iterations in 10 seconds. It looks like I’ve got a 1% speed-up, which might not be enough to justify the additional complexity in the VM. However, if each in-process iteration does more work than its predecessor the real improvement is higher — potentially much, much higher (see the last few in-process iterations of the graphs earlier in this post).

In both the above examples, a faulty benchmark like PdfJS can cause us to systematically underappreciate, or exclude, effective optimisations. Although a little less likely, it's also possible to imagine it having the opposite effect (i.e. causing us to include ineffective, or even deleterious, optimisations).

Still, you might well think that I’m banging on a bit too much about a single faulty benchmark. All code has bugs, and it’s not surprising that some benchmarks have bugs too. It’s not unreasonable to suggest that the odd bad benchmark probably won’t lead us too far astray. However, when I looked at Octane, CodeLoadClosure also has a memory leak [11], and zlib complains that it “cannot enlarge memory arrays in asm.js” (which might be a memory leak, although I didn’t investigate further). In other words, a non-trivial portion of Octane’s benchmarks appear to be misleading and it's reasonable to assume that, on occasion, they will have misled VM developers. Does this mean that Octane is an unusually terrible benchmark suite? Probably not, but it's certainly made me even more cautious about benchmark suites than I was before.

Summary

Returning back to this post's title: why aren't more users more happy with our VMs? The conclusion I've been forced to is that our benchmarking and our benchmarks have misled us. In a perfect example of Goodhart's Law, we've too often optimised for metrics that aren't representative of what we really care about.

Fortunately, I think it's possible to do a little better without undue effort. Here are my simple suggestions:

  1. Running benchmarks for longer uncovers a surprising number of latent problems. Ultimately, all the odd performance effects we saw in part one of this blog post resulted from running benchmarks for longer than anyone had done before.
  2. We must accept that neither peak performance nor a steady state is guaranteed to occur. Benchmarking methodologies that assume one or both of these concepts are fundamentally broken.
  3. I think it’s intellectually dishonest not to report warmup time. There’s no doubt that warmup time is annoying and that it can paint VMs in an unflattering light. But by pretending it’s a non-issue we’re misleading users about what performance they can expect, and storing up long-term resentment.
  4. We might be able to learn from the machine-learning community and separate out training benchmark suites (which we use to derive the values which drive the various heuristics in VMs) from validation suites (on which we report actual benchmarking numbers).
  5. We need far, far more benchmarks than we currently have, and we have to keep adding benchmarks. In nearly all cases, the more benchmarks we have, the less likely we are to mislead ourselves [12]. My vague guess is that our current benchmark suites are perhaps an order of magnitude too small. But even much larger benchmark suites won’t solve the problem alone: fixed-size benchmark suites become gradually less useful as a VM matures, because the VM becomes more heavily trained towards (or, in a sense, “innoculated against”) the same old benchmarks.
  6. I suspect most users would be happier if we worried more about predictable performance (both making sure that most programs run quite fast, and that programs run at more-or-less the same speed when you rerun them) than we did about peak performance (which currently dominates our thoughts). Put another way, we need to accept that optimisations often come with trade-offs: some appealing optimisations, for example, are highly non-deterministic and may make performance worse almost as often as they make it better.

A big question is whether fixing existing VMs is practical or not. I must admit that the current signs (including some work that I've been involved in) aren't hugely positive in that regard. Most VMs are large (typically hundreds of thousands of lines of code), mature (many people have worked on them, over many years), and complex (squeezing every last bit of juice out of programming language and hardware). Subtle assumptions, both explicit and implicit, are spread throughout and across the codebase — and many of those assumptions have been forgotten (generally, though not exclusively, because people have moved on). Hunting down performance problems thus often devolves into wild stabs in the dark, looking for clues almost at random throughout large, complex codebases. Perhaps unsurprisingly, such hunts fail more often than they succeed.

However, I remain optimistic. First, I think we will start to see a gradual improvement in current VMs. Second, even despite the various problems I’ve detailed, many users are already fairly happy with their VMs: any improvements we make can only make them happier (and, if we’re lucky, make a few unhappy people happy). Speaking very personally, I still think that we don’t really know what the performance ceiling of dynamic compilation is. It is possible that we’re very close to the best performance possible, although my gut feeling is that there’s significant untapped potential. To that end, I’m now working on a new VM research project. It’s far too early to say anything useful about that work yet, and who knows what will happen with funding and the like, but maybe one day it’ll help nudge the field forward a little bit further.

Finally, if you want to repeat the warmup experiment, use Krun, or produce performance stats, then you can: everything’s under an open-source licence, so enjoy!

Acknowledgements: My thanks to Edd Barrett and Carl Friedrich Bolz-Tereick for comments on this blog post. This research was funded by the EPSRC Cooler (EP/K01790X/1) grant and Lecture (EP/L02344X/1) fellowship, and gifts from Cloudflare and Oracle Labs.

Follow me on Twitter

Footnotes

[1] There are two main reasons for restricting things to these two VMs: not all VMs make such profiling information available; and the rest tend to have scant, or missing, information on how to access such features. HotSpot is an honourable exception (its well documented in this regard), but we only managed to get the numbers we needed out of PyPy because we had someone from the core team who we could prod for advice.
[2] The cost of extra profiling information to record garbage collection data and JIT compilation can be quite significant — certainly significant enough that it can change the apparent performance of the main benchmark. It’s therefore vital that this extra information is recorded in a separate experiment, and doesn’t affect the main experiment.
[3] We couldn't work out what the units for the PyPy profiling information are (they're not documented, and it's not obvious from the source code either): as we’ll soon see, that doesn’t matter much in this case.
[4] Unlike PyPy, the units for the garbage collection and JIT compilation sub-plots in HotSpot are clear and simple: seconds.
[5] Sometimes it can feel like everyone, and in one or two cases even their dog, has tried making a Python implementation. Here's an incomplete list of Python implementations (with thanks to Carl Friedrich Bolz-Tereick for pointing out several that I didn’t know about): CPython; IronPython; Jython; Nuitka; Psyco; PyPy; Pyston; Shed Skin; Stackless; Starkiller; TrufflePython; Unladen Swallow; WPython; Zippy.
[6] For those unfamiliar with this optimisation, consider this (contrived) code:
i = 0
while i < 2:
  f()
  i += 1
This can be “unrolled” (i.e. the loop can be expanded) to the equivalent:
f()
f()
In this case, the unrolled version is smaller and faster (since it has no branches). In more complex examples, unrolling can uncover optimisations that the loop would otherwise obscure.
[7] The situation in a tracer is a little different: when you trace a loop, you effectively copy the trace (i.e. unroll it), with the first iteration “setting things up” for the second iteration. That’s not really equivalent to unrolling in the normal sense, though.
[8] Our aim here was to show that you could use warmup_stats without having to use Krun. We thus deliberately stuck to Octane’s behaviour – shared with many other benchmark suites – of running multiple benchmarks in a single process execution. It’s highly likely that benchmark ordering will have some impact on one’s perception of performance, but I have no idea to what extent. That’s for someone else to look into!
[9] In the end, I had to do a bit more than this, because the default Octane runner is written in a style that I found hard to follow. I ended up writing a new runner that dropped a couple of features, but was massively simpler.
[10] Even worse, PdfJS isn’t even a very good way of benchmarking how a VM performs in a low memory situation, because it doesn't hit that point until quite a long way in.
[11] Since I quickly exceeded the 90 second budget that PdfJS had led me to expect was reasonable for finding such bugs, I very quickly timed out in finding the cause of the leak.
[12] The only quibble here is that if, somehow, you manage to have a large number of benchmarks which are either identical or extremely similar, you might end up with a skew towards a subset of a VM’s behaviour. Still, relative to where we are today, it’s a problem I’d be willing to risk.
There are two main reasons for restricting things to these two VMs: not all VMs make such profiling information available; and the rest tend to have scant, or missing, information on how to access such features. HotSpot is an honourable exception (its well documented in this regard), but we only managed to get the numbers we needed out of PyPy because we had someone from the core team who we could prod for advice.
The cost of extra profiling information to record garbage collection data and JIT compilation can be quite significant — certainly significant enough that it can change the apparent performance of the main benchmark. It’s therefore vital that this extra information is recorded in a separate experiment, and doesn’t affect the main experiment.
We couldn't work out what the units for the PyPy profiling information are (they're not documented, and it's not obvious from the source code either): as we’ll soon see, that doesn’t matter much in this case.
Unlike PyPy, the units for the garbage collection and JIT compilation sub-plots in HotSpot are clear and simple: seconds.
Sometimes it can feel like everyone, and in one or two cases even their dog, has tried making a Python implementation. Here's an incomplete list of Python implementations (with thanks to Carl Friedrich Bolz-Tereick for pointing out several that I didn’t know about): CPython; IronPython; Jython; Nuitka; Psyco; PyPy; Pyston; Shed Skin; Stackless; Starkiller; TrufflePython; Unladen Swallow; WPython; Zippy.
For those unfamiliar with this optimisation, consider this (contrived) code:
i = 0
while i < 2:
  f()
  i += 1
This can be “unrolled” (i.e. the loop can be expanded) to the equivalent:
f()
f()
In this case, the unrolled version is smaller and faster (since it has no branches). In more complex examples, unrolling can uncover optimisations that the loop would otherwise obscure.
The situation in a tracer is a little different: when you trace a loop, you effectively copy the trace (i.e. unroll it), with the first iteration “setting things up” for the second iteration. That’s not really equivalent to unrolling in the normal sense, though.
Our aim here was to show that you could use warmup_stats without having to use Krun. We thus deliberately stuck to Octane’s behaviour – shared with many other benchmark suites – of running multiple benchmarks in a single process execution. It’s highly likely that benchmark ordering will have some impact on one’s perception of performance, but I have no idea to what extent. That’s for someone else to look into!
In the end, I had to do a bit more than this, because the default Octane runner is written in a style that I found hard to follow. I ended up writing a new runner that dropped a couple of features, but was massively simpler.
Even worse, PdfJS isn’t even a very good way of benchmarking how a VM performs in a low memory situation, because it doesn't hit that point until quite a long way in.
Since I quickly exceeded the 90 second budget that PdfJS had led me to expect was reasonable for finding such bugs, I very quickly timed out in finding the cause of the leak.
The only quibble here is that if, somehow, you manage to have a large number of benchmarks which are either identical or extremely similar, you might end up with a skew towards a subset of a VM’s behaviour. Still, relative to where we are today, it’s a problem I’d be willing to risk.

Why Aren’t More Users More Happy With Our VMs? Part 1

September 5 2018

Programming language performance is something that nearly everyone cares about at some point, whether they realise it or not. If you’re a programmer, you sometimes have to make your programs run fast; if you’re a user, you might be frustrated at a program that runs slowly, without knowing that it’s caused by a poor programming language implementation. Those of us who work on, or around, programming language Virtual Machines (VMs) tell a good story about performance, but a surprising number of users seem unhappy with the performance of their programs. Sometimes, yes, they’re unrealistic, but are they always so? In this first blog post (based on this paper) of two, I’m going to show that programs running on VMs often don’t follow the simple performance patterns that nearly all of us expected. Perhaps that’s why users aren’t as happy as VM developers think they should be?

A typical claim

Every so often there is a spate of blog posts along the lines of “Language XYZ is faster than C”. [1] Language XYZ varies, of course, whilst C is always... C. Love it or hate it [2], C is, overall, probably the fastest language there is. At the very least, it’s the bar by which all other languages measure themselves.

For example, a claim made about HotSpot (the standard Java VM, often called just “the JVM”, though that there are other Java VMs) is that it makes Java programs as fast as C programs compiled with gcc -O2 [3]. At first glance this statement seems reasonable enough: gcc -O2 produces fast code, as, most people agree, does HotSpot. But what’s actually being compared here? A static (or “Ahead of Time”) compiler such as GCC compiles a program once; the resulting executable/binary can be run as many times as you want, even on another machine without GCC. In contrast, VMs such as HotSpot observe a program running and dynamically compile it into machine code. Dynamic compilers are thus subject to a warmup cost: it takes programs running on them a while to reach a steady state of peak performance. VM developers are very keen on the steady state of peak performance and would rather forget that warmup exists: VM performance is universally reported relative to the steady state of peak performance. Thus the claim at the beginning of this paragraph entirely ignores warmup costs.

A simple representation of this performance model can be seen in the following:

SVG-Viewer needed.

Let’s assume that we’re running a simple benchmark in a for loop: we show each in-process iteration (i.e. each iteration of the loop; the reason for this slightly long-winded name will become apparent soon) of the loop on the x-axis; and the y-axis shows how long each in-process iteration takes to run. When a program starts running in a VM, it’s executed by a profiling interpreter which is a simple, general, but slow implementation of a programming language. As the program executes, the interpreter performs simple profiling to help it determine which parts of the program are frequently executed (e.g. recording how often a function is called or a loop executed). After a little while, the “hot” parts of a program are identified, causing those parts to be compiled into machine code: it’s possible that execution will slow down temporarily while compilation occurs. Once compilation has completed, we reach nirvana: the “steady state” (i.e. it never changes) of ”peak performance” (i.e. the program runs faster than it did before). The time from the beginning of the first in-process iteration to the point that the steady state of peak performance is reached is the “warmup” period. [4]

Of course, a real system will be somewhat noisier than the above suggests: we might expect occasional timing blips while garbage collection occurs; compilation might happen in multiple stages; and so on. However, it captures the basic notions well enough that I’ll use it as the basis for the rest of this blog post.

Measuring warmup

I’ve done a reasonable amount of work on, and with, VMs in recent years (as a couple of examples, consider storage strategies or language composition) which, inevitably, means I’ve spent quite a bit of time measuring performance. The first time that I put a bit of thought into benchmarking, I realised that benchmarking without considering warmup was unsatisfactory, although I didn’t manage to come up with a very sophisticated way of addressing this problem. I also remember being baffled at one or two of the results, and convinced that I’d done something wrong: why did CPython running the Mandelbrot benchmark have a running time of 134.419s and an incredibly wide confidence interval of ± 32.347s? I had no idea, so, in the best tradition of VM measurements, I buried my head in the sand and hoped that no-one would say anything. They didn’t.

The first approach I saw that really tried to do something sophisticated with VM measurement was the work of Kalibera and Jones. [5] The methodology they developed tries hard to work out if/when VM warmup has completed, so that accurate figures can be given for the steady state of peak performance. I can’t say enough good things about this work, which has never received the attention it deserved. There are a few possible reasons for this: the methodology is fairly labour intensive (humans have to look at performance graphs and feed information into a measurement algorithm, which does the rest); some of the maths involved is on the scary looking side; and for a long time there were no publicly available implementations. We (well, mostly Carl Friedrich Bolz and Edd Barrett) eventually put together an implementation that largely solves the last two problems, and we used the resulting methodology in a couple of papers (1 and 2). However, the first problem seemed insoluble: the approach requires quite a lot of a skilled human’s time.

In the process of using the Kalibera and Jones methodology, we noticed quite a lot of variation in the warmup time of different VMs and cases where VMs didn’t seem to warmup at all. This was surprising because pretty much every paper we’d read until that point had assumed – and, in many cases, explicitly stated – that warmup was a quick, consistent, thing. On that basis, it seemed interesting to see how the warmup time of different VMs compared. In May 2015, I asked Edd if he’d knock together a quick experiment in this vein, estimating that it would take a couple of weeks. After a couple of weeks we duly had data to look at but, to put it mildly, it wasn’t what we had expected: it showed all sorts of odd effects. My first reaction was that if we showed this data to anyone else without checking it thoroughly, we’d be in danger of becoming a laughing stock. It was tempting to bury my head in the sand again, but this time it seemed like it would be worth digging deeper to see where we’d gone wrong.

We thus embarked on what we soon called the “warmup experiment”. We quickly realised that, fundamentally, we needed to test a simple scientific hypothesis: that VMs reach a steady state of peak performance. In other words, we wanted to find out whether the fundamental assumption on which previous VM benchmarking has relied holds true. This seems like it should be a simple thing to do, but we soon realised that there are all sorts of things that could interfere with our benchmarking — what are formally called “confounding variables”. When I’d previously benchmarked things, I was happy to run things on a new-ish Linux server installation, and report the results. [6] However, even a seemingly “bare bones” installation can still do many unexpected things. We thus continually iterated on the experiment, at every stage asking ourselves: “what if XYZ coloured our results? How can we stop, or at least reduce the chances of, this happening?”

Thinking in such an adversarial manner isn't natural for most of us but it can be done. For example, early on we realised that if benchmarks run overnight there’s a good chance that cron is likely to run at least one background task [7]. Similarly, if you’re running sshd (which, on a Unix sever, you almost certainly are), a botnet [8] can try using that to log in vast numbers of time: if you’ve ever been subject to such an attack, you’ll know that it can consume a noticeable amount of CPU. But there are many other potential problems. If one benchmark causes a machine to go into swap, subsequent benchmarks will be much slower, simply because some parts of the system will have to be paged back in from disk. Similarly, VMs can leave cache files around to speed up subsequent runs: if that happens, the first run might execute much more code than subsequent runs. After a while of doing this, we shared our initial attempts with other people, who suggested a few more things for us to take a look at, notably whether CPU throttling (due to overheating) was occurring.

After a couple of years of working on this, we had put together a pretty extensive list of potential problems and, in nearly all cases, we’d managed to do something to control them. Because there were so many of these, we wrote a new benchmark runner Krun [9]. In some senses, Krun’s job is very simple: it just has to run benchmarks in a predictable way and record the results. However, there’s a lot of detail hiding just below the surface. I’m not going to go into every detail of Krun, because we’d be here all day (the paper has more details but even that elides some of the nitty-gritties), but let me give you an idea of some of the things Krun does:

  • Turns off CPU turbo boost (which unpredictably interferes with the CPU’s running speed [10]).
  • Turns off the network card (so botnets and the like can’t cause the kernel, or userland, to do extra work).
  • Makes sure the CPU is running in full performance mode (and, on Linux, in tickless mode).
  • Shuts down daemons like smtpd and cron.
  • Reboots the machine before running the first process execution (which is our fancy way of saying "shut the VM down and start it again from scratch") and after every subsequent process execution (so that if one process execution does get the machine into a nasty state (e.g. into swap), the machine’s state will be reset to something fairly predictable before the next process execution is run).
  • Before every process execution, the Unix account used to run benchmarks is deleted (including its home directory) and recreated (so cache files can’t persist between process executions).
  • Runs every process execution with the same stack / heap limits.
  • Before the first process execution, all the machine's temperature sensors are read; subsequent process executions aren’t run until the temperature sensors are more-or-less back to those readings (because we have no idea how non-CPU components are affected by temperature).
  • Checks whether the CPU clocked down and reruns any benchmarks affected (this happens perhaps 0.5% of the time, so it’s rare, but it does happen).

And so on and so forth.

There are a couple of things about Krun which surprise people. First, Krun doesn’t use CPU pinning. We experimented extensively with this, but found that it had odd effects. For example, a seemingly good idea on a 4 core machine is to reserve 1 core for the OS and 3 cores for things being benchmarked. However, we found at least 1 VM which – on seemingly single-threaded benchmarks – behaved poorly with such a setup. We suspect this is because the VM asks the OS how many cores the machine has (receiving the answer “4”), and then uses that many compilation and/or GC threads; those 4 threads then get pinned to 3 cores, contending unpredictably with each other. [11]

Second, Krun does not disable ASLR (Address Space Layout Randomisation). This introduces obvious non-determinism into proceedings and is thus often frowned upon. Indeed, initially we turned ASLR off for this reason. However, I ended up being deeply influenced by Stabilizer, which convincingly shows that changing a program’s layout (most obviously through ASLR) means that programs no longer have a single performance point — rather, they have a set of performance points. In other words, it doesn’t make sense to say “this program runs at speed X” by picking one layout for the program: one has to have some sense of the spread of program speeds with different layouts, because real users will encounter those different layouts, each of which is a valid version of the program. Unfortunately, Stabilizer doesn’t build on modern systems, so we can’t be quite as statistically rigorous in randomising things like linking order. However, Keeping ASLR on gives us an approximation of the effects of randomisation on the VM.

What to benchmark

There are two hard questions in program benchmarking: what to benchmark and how long to run things for. I’m going to tackle them in reverse order.

VM benchmarking generally involves running a program in a loop for a small number of in-process iterations: 5 is common; 30 is considered extremely large; and the largest I’ve ever heard of is 200. However, I knew from experience that some of the counters inside VMs that cause performance changes have thresholds of 1000 or 1024. To make sure that we could see if these counters had an impact on performance, we therefore needed to run benchmarks for more than 1024 in-process iterations. We quickly settled on 2000 in-process iterations as a reasonable number. We also wanted to see what level of variability there was between different process executions (i.e. stopping the VM and running it again from scratch). We had no idea what a good number was for this: we used 10 for a long time, before eventually settling on 30. To be clear, there was some thought behind the choice of 2000 in-process iterations, but 30 process executions was little more than a wild guess.

Choosing what benchmarks to run is difficult, because one often hopes that benchmarks are representative of real programs. Different VM teams often disagree profoundly over benchmark selection because of this. Personally, I am sceptical that a small-ish set of programs (no matter how big the individual programs are) tells us that much about the complete set of possible programs. Fortunately for us, we had slightly different goals. We wanted to see how VMs perform on benchmarks that are well known, thus removing the “it’s just a weird program” retort from the debate. That then meant that we could use the well known Computer Language Benchmarks Game (CLBG) benchmarks. Although one can reasonably argue that these benchmarks are even less representative of general performance than other benchmarks, they have the virtue that every VM team out there has made sure that they run well on their VM.

We performed a few simple checks, and made several alterations, to the CLBG benchmarks. First, we wanted to make sure that – to the extent that a normal end user can tell – the benchmarks are deterministic, because then we could reasonably expect different process executions to behave similarly. We put prints at every if statement, while loop, and the like, and checked the results from multiple runs. In our case, we had to alter fasta so that the same random number was used for every in-process iteration. We also noticed that Java programs sometimes load classes in a different order in such a way that an end-user can observe. Fixing this in general is neither possible nor desirable, but we wanted to make sure our benchmark runner didn’t make things worse, so we performed some simple static games to avoid this issue.

We then dealt with a more subtle issue. Compilers can often notice that large chunks of benchmarks are computing precisely the same number over and over again, allowing them to replace such code with a constant. [12] This can quickly cause one to measure a much different thing than one expected. To avoid this, many benchmarks write intermediate results to stdout, which makes it harder for compilers to remove code. However, I/O calls are expensive (due to context switches and the like): one can quickly end up in a situation where one is measuring the number of I/O calls more than anything else. We therefore removed I/O from the core parts of all our benchmarks, replacing it with checksum calculations (with the same checksum across all language variants of a given benchmark) that we then (conditionally) write to stdout at the end of the benchmark. This doesn’t guarantee that compilers aren’t removing important parts of a benchmark, but it raises the bar substantially.

With that done, it was relatively simple to choose the systems to benchmark: we used a number of mainstream VMs (Graal 0.22; HHVM 3.19.1; TruffleRuby 20170502; HotSpot8u121b13; LuaJIT 2.04; PyPy 5.7.1; and V8 5.8.283.32) and (as a base for comparison) GCC 4.9.4.

Results

At this point, I’ve hopefully convinced you that we designed a thoughtful, careful experiment. I could go into more detail, but it’s probably useful to look at some data. I’m going to use data from our v1.5 run (the data in the paper is from a slightly earlier run; the differences are minor). Let’s start with a simple example:

SVG-Viewer needed.

Starting at the top, the plot title tells us that this is the Fannkuch Redux benchmark, running on LuaJIT, on an OpenBSD Core i7 machine, and is the 14th of 30 process executions. Along the x-axis, we have the in-process iterations, and on the y-axis the time of each iteration. The inset helps us see the first n in-process iterations (in this case, 129) in a bit more detail. The first thing we can see is that this benchmark started slow (in-process iteration 1 took about 0.57233 seconds); quickly got somewhat faster for a while; and then, at about in-process iteration 190, reached the steady state of peak performance. This is almost exactly what VMs are supposed to do, albeit it took much longer to reach the steady state of peak performance than we might expect and, indeed, much longer than anyone had ever bothered benchmarking before. Nevertheless, this benchmark clearly warmed up.

The challenge then becomes working out where it warmed up. At first we had no idea how to work this out, other than using the Kalibera and Jones method — but we have 3,600 of the above plots to look at per machine, which is far too many to look at manually. Eventually (thanks to help from Vince Knight), we were put in touch with Rebecca Killick, a statistician who specialises in changepoint analysis. Intuitively, changepoint analysis automatically tells us when there are statistically significant shifts in our data. The vertical red dashed line in the plot above at in-process iteration 190-ish is a changepoint: the horizontal dashed line between changepoints (there are implicit changepoints at in-process iterations 1 and 2000) is a changepoint segment. In other words, changepoint analysis allows us to split the data in the plot above up, produce statistics on it, and classify it. We therefore developed a classification algorithm, which says (simplifying a bit) “if the last changepoint segment is the fastest changepoint segment in the plot, then this is a warmup”. [13] You can see the classification “warmup” at the end of the title. As an additional visual aid, in-process iterations before the steady state are coloured in grey.

The next type of plot we see regularly is the following:

SVG-Viewer needed.

This is the n-body benchmark running on PyPy on a Linux Core i7 machine. The first thing that you might notice is the big red blob around in-process iteration 1230: this is an outlier, an in-process iteration which is noticeably different in nature (i.e. much faster or slower) than its neighbours. It’s common in these sorts of analyses to ignore outliers if there’s good cause: in our case, we assume that they might be a garbage collection phase, or a context switch into the kernel, or something else that shouldn’t cloud our judgement about the benchmark. The second thing you might notice about this plot is that it only has a single changepoint segment. We classify such plots as “flat”. Because we’re giving the benefit of the doubt to VMs, we assume that warmup happened so quickly in such cases that we can’t observe it happening, meaning that the “flat” classification is morally equivalent to “warmup”.

Here’s another example of warmup, this time on V8:

SVG-Viewer needed.

The good news is that the benchmark did warm up and hit a steady state of peak performance, though the fact that it took just under 600 in-process iterations is surprising. It’s also interesting that there are several earlier changepoints, perhaps while various rounds of compilation were occurring.

We can also deal with more complex examples of warmup, this time on Graal:

SVG-Viewer needed.

The surprise here isn’t the number of outliers (perhaps they’re the garbage collector), but that it took over 900 in-process iterations to hit a steady state of peak performance. Still, the fact that it did so is surely a good thing.

The unexpected

At this point, things become a little more surprising. Consider this plot from HotSpot:

SVG-Viewer needed.

Notice that it started slow (you might need to look at the inset to see this), then got fast for 200 in-process iterations, and then became somewhat slower again. That’s not supposed to happen! It means that we need to adjust our classification algorithm: if the last changepoint segment in a plot isn’t the fastest, we classify it as “slowdown”.

In this case, the slowdown reduces execution speed by 5% which, to state the obvious, is fairly significant (in mature VMs, speed-ups of 5% often require months or years of work). But here’s why you should be really surprised. The benchmark in this plot is Richards: it’s one of the benchmarks that HotSpot has used to benchmark itself since before the VM was even called HotSpot. In other words, this VM is highly optimised for this benchmark — but no-one ever ran it long enough before to notice that performance gets much worse over time.

It turns out that slowdowns are common. Here’s an example from V8:

SVG-Viewer needed.

But there’s one final type of plot left — and it’s the most troubling of all. Here’s an example:

SVG-Viewer needed.

Notice here that the benchmark’s performance keeps shifting, never settling for long. Of course, perhaps it would if we ran it for even longer, but ultimately we have to set a cut-off somewhere. This now raises a tricky question: what should our criteria be for when a benchmark is said never to stabilise? Our choice was to say that if there are changepoints in the last 500 in-process iterations then the plot is classified as “no steady state”.

For clarity, “flat” and “warmup” are equally good; “slowdown” and “no steady state” are bad (in slightly different ways). The latter two are not supposed to happen: when they do, they invalidate nearly all previous benchmarking methodologies.

Edge cases

Changepoint analysis turned out to be the vital thing we needed to make this work useful. When we first tried using changepoint analysis on our data, we sometimes thought it had misclassified things, but in reality it’s just more rigorous (and far less prone to boredom) than us humans. However, there are always going to be some edge cases where the classification could go either way. Here’s my favourite example of this:

SVG-Viewer needed.

Can you see that the timing data keeps shifting, at predictable intervals, above and below the changepoint segment? The changes are small enough that changepoint analysis doesn’t consider these shifts as changepoints, so the plot is classified “warmup”. Interestingly, in some older versions of HotSpot, this benchmark displays a more pronounced square-wave pattern (i.e. the highs are higher and the lows lower); each peak and trough becomes a changepoint segment; and the plot as a whole becomes “no steady state”. In my opinion, for this example, either classification would be acceptable: but, in general, there are very few edge cases such as this.

Inconsistency

Earlier I went into some detail about how we ensured that the benchmarks we’re running are deterministic. We expected that this would lead to consistent performance profiles for a given (benchmark, VM, machine) triple (in other words, we expected consistent performance from process executions of the same benchmark on the same VM on the same machine). We were wrong. Consider these two plots, both of the same benchmark, on the same VM, on the same physical machine:

SVG-Viewer needed.

Notice that one process execution warms up, and the other is a no steady state. To add insult to injury, the no steady state process execution’s performance is often 5% worse than the warmup process execution’s performance. This sort of inconsistency happens frequently.

Perhaps less surprisingly, there’s often substantial inconsistency across operating systems. For example, consider the binary trees benchmark on Linux and OpenBSD in C:

SVG-Viewer needed.

SVG-Viewer needed.

There are two things to note here. First, the performance of this benchmark on Linux resembles a bouncing ball. This, to belabour a point, isn’t a VM: it’s a C program! We suspect (but somehow have never quite managed to get around to investigating) that this is probably the malloc used on Debian: it’s probably incrementally growing, then incrementally compacting, its memory structures. This raises a subtle point which I hadn’t considered when we first started this work: even C programs have a runtime (libc), which can have unpredictable effects on performance. Unsurprisingly, C isn’t as prone to odd effects as the VMs in our experiment, but it’s more prone than I would have expected. Second, notice the substantial difference in performance between the two systems. Admittedly, in this case we’re comparing a Xeon Linux machine against a Core i7 OpenBSD machine, but still, I would be happy to bet money that Linux will always beat OpenBSD in such a benchmark. Why? Simply because OpenBSD’s malloc prioritises security over performance, and jumps through several hoops that almost certainly slow things down substantially.

Benchmarks in aggregate

Looking at the plots above is very useful, because it gives us a way of understanding what unexpected performance behaviour occurs. However, our experiment has thousands of these plots, so it’s hard to put together any sort of bigger picture from them. Fortunately, changepoint analysis allows us to automatically put together various summaries of the data (with Sarah Mount doing most of the hard work to generate the astonishing LaTeX tables below). Here’s an example of the summary data from one of our machines:

SVG-Viewer needed.

There’s a bit too much data there for us to easily read it in this setting, so let’s zoom in on the bottom right part of that table:

SVG-Viewer needed.

This is the data from the spectralnorm benchmark. There’s still quite a lot of detail here, so let’s look at a few rows to unpick what’s going on.

I’ll start with the HotSpot row. The squiggle in the “Class.” (i.e. “classification”) column means that all 30 of spectralnorm’s process executions warmed up on HotSpot. The “Steady iter (#)” column tells us that the median number of in-process iterations for the steady state to be reached was 7 (the numbers in brackets underneath are inter-quartile ranges; think of them a bit like being a bit like confidence intervals, giving you an indication of the spread of data). The histogram to the right of that number shows you the spread of data, with the bucket the median value is part of highlighted in red. In this case, nearly all the data is in the median bucket, with a couple of outliers. The “Steady iter (s)” column tells us that the median wall-clock time to reach the steady state was 1.91 seconds. The “Steady perf (s)” is close to the “traditional” measure of VM performance [14], telling us that the performance of the benchmark in the steady state was 0.31472 seconds per in-process iteration. Notice though that the histogram tells us that there are a number of process executions where the steady state performance is worse than the median, and the very wide 99% confidence interval (±0.169143s) confirms this. This goes some way to justifying all the data you’ve just waded through: yes, the benchmark consistently warms up on HotSpot (which is good), it warms up fairly fast (which is also good), but the steady state performance can vary hugely between process executions (which is bad). Depending on what aspect of performance you care about, you may find different parts of this information more or less important than others.

The “—” classification for LuaJIT means that all 30 of its process executions were flat. By definition that means it warmed-up at in-process iteration 1 and took 0 seconds to do so: thus the steady iter (#) and (s) columns are empty. The final “pure” classification can be seen for Graal, all 30 of whose 30 process executions slowed down (note that, though this doesn’t have a steady state of peak performance, it does still have a steady state, hence our ability to report such numbers here).

The “═” classification for PyPy is “good inconsistent”: 27 of the 30 process executions were flat, and 3 were warmups. Since we consider these two classifications morally equivalent, “good inconsistent” is morally equivalent to all 30 process executions being flat or warmup.

In contrast, the “═ with a cross through it” classification for TruffleRuby is “bad inconsistent” i.e. at least 1 process execution was a slowdown or a no steady state. In this case, 25 of its process executions slowed down and 5 were no steady state.

The bigger picture

Summarising the results gives a clearer picture of how VM performance can surprise:

SVG-Viewer needed.

To keep things simple, I’ll only consider the Linux4790 column above (the OpenBSD machine runs a subset of benchmarks, so requires some care when interpreting). It’s easiest to start with the bottom half of the table first. The way to read this is that 22.0% of all process executions (i.e. not worrying about which VM or which benchmark was involved) were classified as flat and 48.3% as warmup; in other words, just over two thirds of process executions warmed up in the expected fashion. 20.1% of process executions slowed down and 9.6% were no steady state.

Those figures are pretty depressing: almost a third of process executions aren’t doing what we expected! But life gets quite a lot worse in the top of the table. The way to read this is that 8.9% of (VM, benchmark) pairs were always classified as flat, 20.0% as warmup, and 11.1% as good inconsistent. In other words, only 40.0% (VM, benchmark) pairs consistently did the expected thing for all 30 of their process executions. Surprisingly, 4.4% consistently slowed down and 4.4% were consistently no steady state. But 51.1% of all (VM, benchmark) pairs were bad inconsistent: that is, they contain at least one slowdown or no steady state process execution. At the risk of stating the obvious, that means that just over half of the time VMs are not reliably optimising benchmarks!

Things get even worse if you consider all machines together: only 12.5% of (VM, benchmark, machine) triples perform as expected.

Conclusions

When we set out to look at how long VMs take to warm up, we didn’t expect to discover that they often don’t warm up. But, alas, the evidence that they frequently don’t warm up is hard to argue with. Of course, in some cases the difference to performance is small enough that one can live with it, but it’s often bad enough to be a problem. We fairly frequently see performance get 5% or more worse over time in a single process execution. 5% might not sound like much, but it’s a huge figure when you consider that many VM optimisations aim to speed things up by 1% at most. It means that many optimisations that VM developers have slaved away on may have been incorrectly judged to speed things up or slow things down, because the optimisation is well within the variance that VMs exhibit. The variance across process executions is often even bigger, which is even more troubling: it means that it’s not safe to run a program once, measure its performance, and assume you’ll always get that level of performance in the future.

Before the doom and gloom gets too much, I’m going to end on an optimistic note. While I think our work raises big questions about the efficacy of current VMs, I think it also offers a note of salvation. People are often very happy with VM performance, even though it probably isn’t working as well as it could. In particular, my guess is that large programs are subject to many small performance problems which, in conjunction with each other, are dismissed as noise. If that is the case, then, if we can produce VMs that don’t suffer from the problems seen above, we should be able to run programs even faster in the future.

In the second part of this blog post, I’m going to: look at whether there are any simple answers to the performance problems seen above; give some possible explanations as to how we might have got to this point; and suggest some simple pointers to make things a bit less worse in the future. In the meantime, those of you who want more details can read the main research paper.

Acknowledgements: Edd Barrett, Carl Friedrich Bolz, Rebecca Killick, and Sarah Mount did most of the work I’ve reported on above. Vince Knight helped put us in touch with Rebecca Killick: he really deserved more credit on this paper than he received, but he gracefully refused my attempts to add him in! The work of Tomas Kalibera and Richard Jones was a huge influence for all of us. Mario Wolczko gave crucial feedback and encouragement at important points. We also received vital feedback, and questions, from a number of other people. This research was funded by the EPSRC Cooler (EP/K01790X/1) grant and Lecture (EP/L02344X/1) fellowship, and gifts from Cloudflare and Oracle Labs.

Follow me on Twitter

Footnotes

[1] Unfortunately, many such claims are either wrong (e.g. the thing being measured isn’t the thing that’s claimed to be measured) or misleading (e.g. making grand comments that derive from generalising a single workload / benchmark).
[2] I used to love C, but these days I’m more ambivalent, because I no longer think I can remember all the ways that I may trigger undefined behaviour.
[3] p11 of Cliff Click’s slides “A JVM does that?”. To be clear, I’m not criticising Cliff who I not only know and like, but whose contributions to our field are little short of jaw-dropping.
[4] The terms “warmup” and “startup” are often used interchangeably, whereas they refer to two different things: startup is the time for the VM beginning to execute the first bit of the user’s code; warmup is the time for the VM to reach the steady state of peak performance.
[5] There are two relevant Kalibera and Jones papers: a complete but dense description; and a higher-level overview. I definitely recommend starting with the high-level overview (even though it was published later).
[6] In my defence, I can only say that I’ve seen a lot of people benchmark even more incompetently than I’ve described here. My personal favourite was a paper that benchmarked on a laptop (which are prone to overheat, and thus likely to throttle the CPU), with turbo boost left on (making overheating even more likely), running a full user desktop (how much CPU is your web browser consuming right now? One core of mine is running at 40% while a website animates something). The resulting numbers might have been correct, but, if they were, that would surely be more due to luck than judgement.

That said, the most common form of benchmarking (which I’ve certainly been very guilty of in the past) is far worse: it’s best summarised as “run it until I like it”. In this form of benchmarking, if you run a benchmark and get a number you don’t like (too fast or too slow), you run things again and again until you get an answer that matches your expectations / prejudices. In practise, the non-determinism of our hardware and software means that one can often find an appealing answer. Of course, the implied variation between runs should be an alarm bell!

[7] On OpenBSD, for example, by default root runs a cron job every day (/etc/daily) and every week (/etc/weekly). The former does several security related tasks which, one might argue, aren’t likely to impact anything else; however, the latter rebuilds the locate database (which is very IO intensive, and which can thus cause very unpredictable side effects). There is also a monthly cron job but, in practise, that does very little by default (I had entirely forgotten that it existed until I checked my facts for this footnote).
[8] It’s sometimes thought that botnets can only attack you from external networks. However, all it takes is one machine to become compromised (e.g. by someone running something unfortunate on their machine), and then a botnet can exist on an internal network. I’ve heard of this happening at least twice at places that I know well, so it’s a real threat.
[9] The “K” in Krun is a tip of the hat to the Kalibera and Jones work. For those who care, most of us pronounce it as one syllable (i.e. “Krun” not “K-run”).
[10] Turbo boost is less predictable than most of us realise. For example, we had some machines with a stated base frequency of 3.6GHz and a turbo boost of 4GHz. If you ran a process which utilised 1 core fully, you could reliably measure it as running at almost exactly 4GHz; fully utilising 2 and 3 cores gave the same result; but as soon as a fourth core was fully utilised measurements fell to almost exactly 3.8GHz. If memory serves this mode is undocumented in Intel’s manuals, but I saw this behaviour on 2 separate machines.
[11] At the time we experimented with CPU pinning on Linux, there were two mechanisms available: one was buggy and unreliable; and the other (which, of course, we tried second) worked OK. I don’t know if this is still the case, but it certainly makes me wonder if everyone I’ve heard of using CPU pinning is getting the benchmarking environment they expect.
[12] I’m talking here about compilers optimising away code as part of the compiler’s standard optimisations. There’s also a less pleasant case where compilers are customised specifically for one benchmark (e.g. explicitly recognising a benchmark’s source code, and then changing the compilation strategy). One of the better known examples of this was Sun’s special treatment of the 179.art benchmark in SPEC. It’s now difficult to find a concise write-up of this, but a couple of articles provide reasonable pointers.
[13] The real algorithm has two main additional features. First, it treats changepoint segments which differ only by a very small amount of time as being equivalent. Second, two (or more) changepoint segments can be considered equivalent even if they are separated by a non-equivalent changepoint segment.
[14] The details of how we calculate steady performance are quite a lot more involved than is traditional: influenced by Kalibera and Jones, we use bootstrapping, and changepoint segments to lead to a slightly more accurate number than most traditional methodologies.
Unfortunately, many such claims are either wrong (e.g. the thing being measured isn’t the thing that’s claimed to be measured) or misleading (e.g. making grand comments that derive from generalising a single workload / benchmark).
I used to love C, but these days I’m more ambivalent, because I no longer think I can remember all the ways that I may trigger undefined behaviour.
p11 of Cliff Click’s slides “A JVM does that?”. To be clear, I’m not criticising Cliff who I not only know and like, but whose contributions to our field are little short of jaw-dropping.
The terms “warmup” and “startup” are often used interchangeably, whereas they refer to two different things: startup is the time for the VM beginning to execute the first bit of the user’s code; warmup is the time for the VM to reach the steady state of peak performance.
There are two relevant Kalibera and Jones papers: a complete but dense description; and a higher-level overview. I definitely recommend starting with the high-level overview (even though it was published later).
In my defence, I can only say that I’ve seen a lot of people benchmark even more incompetently than I’ve described here. My personal favourite was a paper that benchmarked on a laptop (which are prone to overheat, and thus likely to throttle the CPU), with turbo boost left on (making overheating even more likely), running a full user desktop (how much CPU is your web browser consuming right now? One core of mine is running at 40% while a website animates something). The resulting numbers might have been correct, but, if they were, that would surely be more due to luck than judgement.

That said, the most common form of benchmarking (which I’ve certainly been very guilty of in the past) is far worse: it’s best summarised as “run it until I like it”. In this form of benchmarking, if you run a benchmark and get a number you don’t like (too fast or too slow), you run things again and again until you get an answer that matches your expectations / prejudices. In practise, the non-determinism of our hardware and software means that one can often find an appealing answer. Of course, the implied variation between runs should be an alarm bell!

On OpenBSD, for example, by default root runs a cron job every day (/etc/daily) and every week (/etc/weekly). The former does several security related tasks which, one might argue, aren’t likely to impact anything else; however, the latter rebuilds the locate database (which is very IO intensive, and which can thus cause very unpredictable side effects). There is also a monthly cron job but, in practise, that does very little by default (I had entirely forgotten that it existed until I checked my facts for this footnote).
It’s sometimes thought that botnets can only attack you from external networks. However, all it takes is one machine to become compromised (e.g. by someone running something unfortunate on their machine), and then a botnet can exist on an internal network. I’ve heard of this happening at least twice at places that I know well, so it’s a real threat.
The “K” in Krun is a tip of the hat to the Kalibera and Jones work. For those who care, most of us pronounce it as one syllable (i.e. “Krun” not “K-run”).
Turbo boost is less predictable than most of us realise. For example, we had some machines with a stated base frequency of 3.6GHz and a turbo boost of 4GHz. If you ran a process which utilised 1 core fully, you could reliably measure it as running at almost exactly 4GHz; fully utilising 2 and 3 cores gave the same result; but as soon as a fourth core was fully utilised measurements fell to almost exactly 3.8GHz. If memory serves this mode is undocumented in Intel’s manuals, but I saw this behaviour on 2 separate machines.
At the time we experimented with CPU pinning on Linux, there were two mechanisms available: one was buggy and unreliable; and the other (which, of course, we tried second) worked OK. I don’t know if this is still the case, but it certainly makes me wonder if everyone I’ve heard of using CPU pinning is getting the benchmarking environment they expect.
I’m talking here about compilers optimising away code as part of the compiler’s standard optimisations. There’s also a less pleasant case where compilers are customised specifically for one benchmark (e.g. explicitly recognising a benchmark’s source code, and then changing the compilation strategy). One of the better known examples of this was Sun’s special treatment of the 179.art benchmark in SPEC. It’s now difficult to find a concise write-up of this, but a couple of articles provide reasonable pointers.
The real algorithm has two main additional features. First, it treats changepoint segments which differ only by a very small amount of time as being equivalent. Second, two (or more) changepoint segments can be considered equivalent even if they are separated by a non-equivalent changepoint segment.
The details of how we calculate steady performance are quite a lot more involved than is traditional: influenced by Kalibera and Jones, we use bootstrapping, and changepoint segments to lead to a slightly more accurate number than most traditional methodologies.

What I’ve Learnt So Far About Writing Research Papers

January 10 2018

I long flattered myself that I'm only good at one thing: programming. Experience has taught me that I'm not as good at it as I thought – indeed, that there are people much better at it than I – but it's fair to say that I'm not terrible at it. The problem is that, as a researcher, programming can only ever be part of my job: it's not enough just to build something interesting, I need to explain it to people too. And explaining it to those further away than my voice can carry requires writing a paper. [1]

There are many people who dismiss paper writing as a distraction from building real systems. I sometimes wondered if it was worthwhile myself but, when you've been knocking around as long as I have, the importance of writing things down becomes clear. I’ve lost track of how many conversations I’ve had along these lines: “We saw that problem in System B, and we had a really interesting solution.” ”Where can I read about it?” “We never wrote it up.” “Where can I download it then?” “We never released the source code, the binaries are lost, and it runs on hardware that no longer exists.” I encountered this a lot when I was investigating existing language composition, systems with tantalising references ranging from Lisp systems with different macro expansion semantics to novel text editors. Even when the person I was talking to was passionate and knowledgeable about the system in question, I was rarely able to gain the insights I suspect I could have if I'd have had a paper to pore over. Doubtless, many of these systems weren't good enough to worry about them slipping from our memory; but the loss of the best, which may have involved tens or even hundreds of skilled person years of effort, is a tragedy. Writing things up does not guarantee that people in the future will be able to understand everything about the work [2], but at least they stand a sporting chance. Besides, even if people of the future can't understand things, writing things down increases the number of people who can understand the work today.

How should one go about writing a paper? When I started my PhD, I had no idea of how one might go about this task, and it took me many years to find an approach that worked for me. In the rest of this blog post, I'm going to try and drill down into some of the things I’ve learnt about writing papers, and I'm going to give some examples of where I’ve learnt from my mistakes. However, I’m deliberately not phrasing anything herein as “advice” — indeed, several parts of my approach seem to run contrary to most of the advice I’ve seen. I don’t see this as a problem: people work in different ways, and if you’re unlucky enough to have a mind similar to mine, you might find something useful in here.

Before starting to write

Before I start writing, there are two crucial decisions that I need to make. First, what am I going to write about? Second, what is the profile of the intended reader?

The first question nearly always has an easy answer which is that I’ve done some research which is interesting and complete enough to be worth telling other people about. If I don’t have a good answer to that first question, it means that I’m probably not ready to start writing.

The second question is more tricky. In general, I assume that the intended reader is someone intelligent, interested, and well-versed in the general area. My job is to fill in the things that they don’t know. One can go mad trying to guess precisely what people do and don’t know, so my starting assumption is that the reader knows roughly what I knew when I started doing the research in question. However, this doesn’t work for all types of writing, particularly when a paper brings together two subjects that have previously been studied by disjoint communities. On any points where I’m unsure about what knowledge I can take for granted, I err on the side of caution, and assume the reader knows marginally less than I did. This should not be mistaken for an assumption that the reader is stupid, but simply an acknowledgement that none of us can know everything.

I also make a fundamental assumption that the reader has good intentions, and that my job is to be upfront and honest with them. This allows me to write clearly about the weaknesses and limitations of my work without worrying that I will be punished for doing so. This may sound like an odd thing to be explicit about, but people in my position are largely evaluated by the quantity and perceived quality of our peer reviewed papers. It is therefore tempting to adjust one’s writing to maximise the chances of acceptance by peer reviewers (who are often rushed; occasionally incompetent; or, though rarely, malevolent). I did that in the past, but I found the implicit cynicism corrosive: I decided, perhaps somewhat pompously, that I’d rather “fail” peer review honourably than “succeed” dishonourably. This has led to one or two more rejections than I might otherwise have received, but these are more than worth the longer-term satisfaction with the papers I write.

Starting to write

It is a truism of writing that the hardest thing is a blank page: building sufficient momentum to begin writing is no easy thing and I've done my fair share of hoping in vain for inspiration to strike. Of course, the only possible solution is to write something, anything, just to get started. The most common suggestion that I've heard is to start writing the bits of a paper that come easier without worrying about the overall order. While this seems to work well for many people, I've never found this a satisfying approach.

Instead, the first thing I do is to try and work out what the paper’s overall message is going to be, which means starting with the paper’s abstract. [3] This gives me a frame of reference for all subsequent writing and enables me to be ruthless in answering questions like “does the reader need to know about this particular aspect?” and “is this section in the right place in the paper?”. I first reread suggestions for writing an abstract (the fourth point in the link on paper writing advice) because it reminds me to be concise about what problem I solved and to motivate why my solution is worth someone’s time reading. In well established fields, where one is tackling a well-known problem, little motivation is needed. More experimental or unusual research, however, needs to be clearly motivated. This is often harder to do than it seems.

Let me give a concrete example based on the paper fine-grained language composition: a case study. We started looking at language composition because previous solutions were crude, because we felt we had a new approach worth trying, and because it looked like it would be fun. I had a few explanations about why it was worth doing (enough to convince people to fund us to do it), but they were vague. After developing a language composition editor we realised we needed to be able to run composed programs as well. We started with an extended warmup exercise, which gave us the confidence to try tackling something bigger.

We eventually settled on composing PHP and Python and started looking at the problem in detail. About 6-12 months later, we had created PyHyp, which was the first ever fine-grained language composition of “big” languages: certainly, we'd been able to get much further, and with less effort, than I had expected at the outstart. The problem was that we had also outstripped our ability to explain why we'd done what we'd done. The published abstract that I wrote (and I say “I” because I’m almost solely to blame for it) looks as follows:

Although run-time language composition is common, it normally takes the form of a crude Foreign Function Interface (FFI). While useful, such compositions tend to be coarse-grained and slow. In this paper we introduce a novel fine-grained syntactic composition of PHP and Python which allows users to embed each language inside the other, including referencing variables across languages. This composition raises novel design and implementation challenges. We show that good solutions can be found to the design challenges; and that the resulting implementation imposes an acceptable performance overhead of, at most, 2.6x.
The good thing about this abstract is that it's simple to understand: the bad thing is that it doesn't say why the problem we tackled is worth tackling. Since language composition has been largely ignored as a research challenge the abstract therefore relies on readers spontaneously realising why language composition is useful and why our research is worth reading — which is far too much to ask.

The frustrating thing is that, while carrying out the research, one of our collaborators had suggested a plausible “killer app”: system migration. The idea is simple. At the moment we have lots of systems written in old languages: we'd like to rid ourselves of the old languages but not the systems. We can either rewrite the systems from scratch (which is expensive and error prone) or use an automatic translator (which produces code that no human wants to maintain). Language composition offers us the potential to compose an old and a new language together and gradually migrate parts of the system piecemeal, at all times having a runnable system. This idea is tucked away in a paragraph near the end of the paper and is easily overlooked. This is a unforced error on my part, because papers are not meant to be like crime novels, with unexpected twists: a good paper should make clear upfront what the paper is about, and gradually fill in detail. Indeed, I struggled to find a good structure for this paper, perhaps in part because of my failure to set out the most compelling case possible in the abstract.

By the time I came to write a blog post on this subject I had realised my mistake and put the migration idea (in expanded form) near the beginning. Alas, anyone who read only the paper would not have seen this implicit mea culpa.

As this example hopefully shows, finding the right research message is often tricky. Fortunately, it is possible to do a decent job: looking back, I think both the storage strategies for collections in dynamically typed languages and Eco: a language composition editor papers do a much better job in their abstracts.

With the abstract out of the way, I draft an introduction. If I've got my high-level thinking in the abstract right, then the introduction is a much easier task because, to some extent, it's simply an expanded abstract. I add more background, including some context about the research's antecedents, and explain why we've done something useful. Sometimes I will explicitly list the paper's contributions, especially if they might otherwise not pop out (thus the storage strategies paper was clear enough without such a list whereas I felt the fine-grained paper would not be understood without one).

With drafts of the abstract and introduction complete (I invariably come back and, at least, tweak them; sometimes I change them extensively), I then find it possible to move on to the paper's main content.

Writing and editing

Perhaps unsurprisingly, I tend to write papers more-or-less linearly from front to back, finishing one section before moving on to the next. I find this helpful to give me the same sense of flow as the eventual reader: it means that I’m forced to think about seemingly simple questions like “have I introduced this concept earlier, or does it need to be introduced here?” It’s easy to overlook the high-level structure of a paper, and I’ve read many papers with all the right pieces but in the wrong order. [4] Similarly I've read many papers which use too many unconnected examples: each feels sensible in its own right, but each example takes time to understand, destroying the reader's flow. A single example which can be built up in stages as the paper progresses nearly always makes the paper easier to read, and is most easily done if writing is done semi-linearly.

That said, I spend the majority of my time at a much lower-level: I’m trying to get all the facts that need to be in the paper in there, explained clearly, and in the right order. That means that I spend most of my time thinking about whether a sentence says what I need it to say, without saying anything it shouldn’t say, and whether it does so as concisely as possible. I’m not indulging in false modesty when I say that I don’t think that my first attempt at a sentence has ever satisfied all three. As soon as I’ve typed a sentence in, I’ll reread it, and if it’s really bad (which it normally is), I’ll move it a couple of lines down on the screen and try writing another version immediately. For some complex thoughts I can end up with multiple versions scrolling gradually off the bottom off the screen before I eventually get to something that isn’t obviously flawed. But I'm realistic: I probably won’t get those complex sentences right on the first day, so I try to move on fairly swiftly.

When I start a new day, I go back over what I wrote the previous day and look at it in a fresh light. As well as catching out innumerable typos, I always end up rephrasing and reordering sentences. That’s because, if there’s one thing I’ve learnt about good writing, it’s that it’s the result of good editing. And, as far as I can tell, good editing means extensive editing. It means being able to look at what I wrote and — ignoring my ego, which inevitably says “it’s good enough already” — trying to neutrally judge whether it can be improved. This is easy to say but, at least for me, it was an incredibly difficult skill to learn. I only learnt it when I had to write a grant proposal and my first draft was 11 pages, just a little over the 6 page limit. I had no idea what to do, and my first attempts at cutting things down were, frankly, pathetic. Over about 10 weeks, I accidentally crushed my ego enough to learn how to evaluate my own writing, and from that much followed. The grant was rejected, but I got so much out of the experience of writing it that I still judge it a success.

Order and brevity

Let's look at a concrete example that shows how important it is to think about the order in which things are presented and how shorter writing is often better. The example is something that I remember spending a long time grappling with: an overview of meta-tracing. This was first needed for the approaches to interpreter composition paper, where we wanted to give a rough idea of meta-tracing to unfamiliar readers (who would otherwise find much of the paper tough to understand). Overview sections are often tricky: they have to summarise and simplify related work in a way that doesn’t overwhelm unfamiliar readers, while being accurate enough not to offend people who know the area in depth. This sort of writing ruthlessly exposes holes in one’s understanding and writing abilities.

I'm going to use the first paragraph from Section 2 of the “approaches” paper as my example. For the points I’m about to make, it hopefully doesn’t matter too much if you understand meta-tracing or not. Here's the first draft I wrote (minus citations):

Meta-tracing takes an interpreter and creates a VM which contains both the interpreter and a tracing JIT compiler At run-time, user programs running in the VM begin their execution in the interpreter. When a 'hot loop' in the user program is encountered, the actions of the interpreter are traced (i.e. recorded), optimised, and converted to machine code. Subsequent executions of the loop then use the fast machine code version rather than the slow interpreter. Guards are left behind in the machine code so that if execution needs to diverge from the path recorded by the trace, execution can safely fall back to the interpreter.

I have long since forgotten how long that took to write, but 10-15 minutes is a reasonable guess. At first glance, this paragraph may seem to do a reasonable job of explaining things, once one ignores minor issues like the missing full stop before “at run-time”. However, there’s one glaring problem which took me or someone else (I'm no longer sure who) some time to realise: not only will most readers be unfamiliar with “meta-tracing”, they’ll be unfamiliar with “tracing” which the above completely fails to mention. That means that this explanation almost entirely fails to achieve its main purpose. After various minor edits to the section as a whole, eventually I inserted a reference to (non-meta) tracing. Between the first draft above and the final published version below, 24 commits [5], were made to the section as a whole (mostly by me, but with some by Edd Barrett):

This section briefly introduces the concept of meta-tracing. Meta-tracing takes as input an interpreter, and from it creates a VM containing both the interpreter and a tracing JIT compiler. Although tracing is not a new idea, it traditionally required manually implementing both interpreter and trace compiler. Meta-tracing, in contrast, automatically generates a trace compiler from an interpreter. At run-time, user programs running in the VM begin their execution in the interpreter. When a 'hot loop' in the user program is encountered, the actions of the interpreter are traced (i.e. recorded), optimised, and converted to machine code. Subsequent executions of the loop then use the fast machine code version rather than the slow interpreter. Guards are left behind in the machine code so that execution can revert back to the interpreter when the path recorded by the trace differs from that required.
The paragraph is now quite a bit longer, but at least it explains tracing. However, it does so poorly (which, as the git history clearly shows, is entirely my fault and not Edd's). Consider the sentences:
Meta-tracing takes as input an interpreter, and from it creates a VM containing both the interpreter and a tracing JIT compiler. Although tracing is not a new idea, it traditionally required manually implementing both interpreter and trace compiler. Meta-tracing, in contrast, automatically generates a trace compiler from an interpreter.
In order, these three sentences: define meta-tracing; reference tracing without explaining why it’s relevant; and somewhat obliquely explain why meta-tracing is different than tracing. To add insult to injury the third of those sentences partly repeats the content of the first sentence. The problem with this jumble of concepts isn’t just that it’s poor style: by forcing readers to keep switching the concepts they’re reasoning about, it makes it hard for them to concentrate on the crucial aspects. Unfortunately, I either didn’t notice this problem, or couldn’t work out how to fix it, before publication.

Fortunately for us, the story doesn’t end there. Some time later we wrote a paper called fine-grained language composition: a case study which also needed an overview of meta-tracing. I first copied in the paragraph from the “approaches” unchanged [6]. Very soon afterwards (probably because of the benefits of reading the paragraph with a fresh mind) I started addressing the problems noted above. With substantial help from Carl Friedrich Bolz and Edd Barrett, 28 commits to the section as a whole later the final published version looks as follows:

Tracing JIT compilers record hot loops ('traces') in an interpreted program, optimise those traces, and then compile them into machine code. An individual trace is thus a record of one particular path through a program's control flow graph. Subsequent executions which follow that same path can use the machine code generated from the trace instead of the (slow) interpreter. To ensure that the path followed really is the same, 'guards' are left in the machine code at every possible point of divergence. If a guard fails, execution then reverts back to the interpreter.

Meta-tracing JITs have the same basic model, but replace the manually written tracer and machine code generator with equivalents automatically generated from the interpreter itself.

The explanation is now split into two paragraphs. The first has a concise explanation of tracing; the second briefly explains what's different about meta-tracing. Not only this is a much easier flow for the reader, it also has a better explanation of guard failure and, as a pleasant bonus, it's 10% shorter than the “approaches” version. I'm quite proud of this version but notice how long it took us to get there: around 50 commits to the section as a whole (most of which touched the paragraph above). Simplicity and clarity are not easily obtained, but they are always worth striving for.

Writing as part of a team

It's been a long time since I've written a paper on my own. Outside of this blog, nearly all my writing is done in teams of 3–5 people. If my experience is anything to go by, the standard model of collaborating in team writing is for most people to write up the technical bits of a paper they were responsible for, with one person handling the “other” bits (e.g. the introduction and conclusions). Some teams make this model of writing work, but a lot (including some I've been part of) don't, for two different reasons: different people often have jarringly different writing styles [7]; and vital information is often missing, because no individual feels solely responsible for it. It's worth looking at both reasons in more detail.

Sometimes academic writing advice seems to suggest that there is one true writing style to which we should all aspire. This is nonsense, especially with a language as flexible as English: there are often multiple ways of saying the same thing that are all roughly as good as each other (a single idea can often be expressed in several ways, each approximately as good as the other). Which of those alternatives one chooses is largely a matter of personal style. For example, my personal style tends towards fairly short sentences and lots of punctuation. Some people I know prefer long sentences and the complete absence of the colon and semi-colon. Obviously I prefer my own style, but that doesn't generally mean the alternative is wrong — just different. [8] The problem comes because readers of technical papers are very alert to minor changes in use of language, and such changes are much more common when different authors fail to homogenise their style. For example, referring to “dynamic compilation” in one part of a paper and ”adaptive compilation” elsewhere makes a thoughtful reader wonder if the different phrasings denote a real difference or not. Jolting a reader out of their reading flow in order to ponder such a question is not just a waste of time: the more often it happens, the more likely it is that the reader will lose sight of the big picture.

The problem of missing information is both less obvious and more serious. I frequently read papers which, while seemingly sensible at the level of individual sections, don’t make much sense as a whole. In rare cases this is because the paper is technically flawed. More often it’s because there’s a missing part of the explanation that would help me make sense of what I’m reading. Occasionally I’m able to deduce precisely what the missing information must have been but, more often than not, I’m simply left confused.

Missing information is sometimes simply due to an oversight, or because the authors considered it unworthy of inclusion (generally because “everyone knows that”). However, my experience is that, when writing in a team, a third reason is equally common: no individual feels responsible for that piece of information, implicitly assuming that it’ll appear elsewhere in the paper. Inevitably, therefore, the information is never included anywhere. I’ve occasionally seen people try to solve this by cramming huge numbers of definitions into a paper’s introduction. This then makes the start of the paper both a depressing, as well as an obfuscatory, read: the start of the paper should outline the story, not try and fill in every blank.

The rewriter approach to team writing

The approach I use to avoid standard team writing problems is for one person (I’ll call them the “rewriter”) to take overall responsibility for the paper’s writing, ensuring that the paper’s content and style are consistent and complete. The only way I’ve found to make this happen is for the rewriter to examine, and in general rewrite, every contribution to the paper. Most obviously this means that the paper ends up written in a single style. Less obviously it means that one person is forced to understand everything in the paper, lessening the chances of authors telling an inconsistent or incomplete story. The rewriter doesn’t need to be a mini-dictator — indeed, it’s best if they try and work by consensus when possible — but they do need to be unflagging in trying to maintain a consistently high standard across the paper.

I now apply the rewriter idea to nearly all the papers I'm involved with (though I'm not always the rewriter!) and I've learnt a few things since my first experience with this way of writing [9]. First, it relies on individuals being willing to forego their individual writing style and to trust that the rewriter will do a good job. Second, it requires people to be much more open than normal to being questioned about what they’ve written, and to be willing to answer each and every question.

Sometimes the rewriter will read something, realise that they don’t understand the explanation, and ask for clarification before rewriting can commence. Sometimes the rewriter will not have the same assumptions as the original author, misunderstand the original text, and then rewrite it incorrectly. The good news is that the rewritten version is nearly always clearly incorrect: the original author then needs to request changes until the rewritten text becomes accurate. [10] And, more often than you’d expect, the original author realises that there was an aspect of the research that they hadn’t previously considered, and they need to do more technical work before answering the query.

Let me give you a simple example which I quickly plucked out of the VM warmup paper. Carl Friedrich drafted the following text (intentionally quickly, as the style we had got used to was “draft something quickly, run it past the rewriter, then take another pass at it”) [11]:

SVG-Viewer needed.

In this case, I knew that there was an additional factor not mentioned in this text that needed including: we had discovered something about lazy loading interfering with our measurements, although I wasn’t very sure what the “something” was. I thus rewrote the text based on that additional factor but, since I didn’t know much about it, I also added a couple of questions at the same time:

SVG-Viewer needed.

Carl Friedrich then replied:

SVG-Viewer needed.

There are two changes in this version. Most obviously, Carl Friedrich responded to my first question with a comment (and notice that both of us are trying to understand what is going on rather than trying to “win” a debate). Less obviously, he responded to my second question by editing the text and deleting the comment. I’ve always encouraged this way of editing: there’s no point in people playing endless comment ping-pong simply because they’re too afraid to take decisive action. In this case, it was my job to review the edit that had been made (I simply look at the relevant diff in gitk); if I was unhappy with it, I would have reinstated the text with an additional comment (perhaps “I’m not sure my original question made sense, so let me try again”). As is generally the case, I had no problem with this particular edit.

There were several rounds of question-and-answer (see e.g. 1, 2, 3, or my personal favourite 4) before I eventually felt I understood what had gone on enough to rewrite the text to the following:

SVG-Viewer needed.

That text, somewhat simplified, is clearly recognisable as that in the final paper. Notice that neither Carl Friedrich or I would have been able to get the text to that quality on our own: it required working together, being open to questions, and not giving up until we were both satisfied that readers would understand the points we were trying to make.

How long does it take to write a paper?

I remember being told early in my career that, once the research is complete, writing a paper takes about a week. I was tortured by this for several years: I rarely got anything out of the door in under a month. And I’ve got much, much worse as my standards have been raised: I reckon that our major papers (for want of a better term) take at least 8-12 person weeks to write. [12] More complex papers take even longer. As a rough proxy, the VM warmup paper took 999 commits from start to finish (and, as can be seen on arXiv, 7 public releases). Many of those commits (particularly towards the end of writing) are small; some (particularly nearer the beginning of writing) are big. But, either way you look at it, a lot of effort went into the writing. It’s one reason why I’m not a prolific writer of papers: much to the distress of most of the bosses I’ve had, I’d rather try and write 1 good paper a year and fail than succeed at writing 20 mediocre ones.

And while it’s hardest to build the momentum to write the first few words, it’s not trivial to build and maintain the daily momentum needed to complete a paper. On a typical day writing ‘fresh’ prose, I aim to write somewhere around 800-1000 words (I don’t count precisely). Sometimes those words come easily, but sometimes they have to be forced out. I do make sure that I never leave a day empty handed though.

One of the hardest things to know is when a paper is finished. One way is to use an external deadline – e.g. the submission date for a conference – as a proxy for this. In general, this tends to lead to a mad rush in the last few days before the deadline, hugely increasing the chances of mistakes creeping in. I prefer to let papers mature for a lot longer than this, because it continues to amaze me how many mistakes become apparent after a good night’s sleep or two. However, taken too far, this is a recipe for never finishing: I think it’s vital to feel that one is continually moving, even if gradually, towards an end goal. To that end, I try and solidify aspects of the paper as I’m going along, so that I’m not tempted to end up in an endless loop of modifying the same things. Typically, I first satisfy myself that the paper’s structure is stable; then that all the necessary content is in; then that the prose is clear and consistent; and finally that there are no typos left. The movement between those stages is almost exclusively one way: when I’m looking for typos, for example, I won’t suddenly decide to move sections around.

At some point the decision has to be taken that the paper is finished and can be released to the world. If I’ve done my job well, the paper will hopefully find an interested audience. But these days I’m realistic enough to know that the paper won’t be perfect: it won’t be as clear as it should be and it will contain mistakes.

Several years ago, I wrote an overview paper on dynamically typed languages where I tried to collect together a lot of things I’d learnt on the topic. Because I knew that this was an area about which many people have strong opinions, I realised that I would have to do a better job than I'd done on any of my previous papers. I tried to buff the paper to the shiniest of sheens and, when it was published, I patted myself on the back for a job well done. A little while after publication I got the following email [13]:

I'm just reading your article on dynamically typed languages from 2009, and I'm stunned. It's fantastically written, neutral and smart.
At this point, my ego – never the most fragile of objects – had swollen to roughly the size of a small planet. Then I read the next sentence:
So how on earth could you misspell "growth" as "grown" in the abstract?!
I checked, and, indeed, I had made a typo in the first sentence of the abstract – in other words, the 13th word in the paper was wrong. My ego shrank back down to a more appropriate size. I fixed the mistake in the HTML version but the PDF retains the error, as a reminder to my future self of my own fallibility.

Wrapping up

There are many other things I could say but if I had to boil everything I’ve learnt so far about paper writing into one sentence it would be this: you don’t need to be hugely talented with words (I’m not), but you have to be willing to continually reread, rethink, and reword everything you’ve written. When writing in a team “you” particularly means “the rewriter”, if you choose that model. Ultimately the goal is simple: will the reader be able to understand the points you wanted to get across, without undue effort? Bad writing, I suspect, comes from people who think their first effort is good enough. Good writing comes from people who know that, every section, every paragraph, and every sentence will have to be pored over multiple times: and, just when they think they’ve perfected the writing, getting external feedback might cause them to reevaluate everything. Sometimes this effort can feel like treading in Sisyphus’ footsteps but, in my experience, it’s effort well spent.

Acknowledgements: My thanks to Edd Barrett, Carl Friedrich Bolz-Tereick, Lukas Diekmann, and Sarah Mount for thoughtful comments. Any errors and infelicities are my own.

Follow me on Twitter

Footnotes

[1] Blogs like this one have their purpose, but I feel they're solving a different problem to papers. Put crudely, I use my blog to explain things to people who want a general overview, whereas papers contain all the nitty-gritty details.
[2] Unless you're already familiar with Kuhn's work, Dick Gabriel's wonderful paper about incommensurability in programming languages is, in my opinion, required reading. In short: when we read other’s work, particularly if it’s a little old, we often lack the context that all contemporary readers shared; we are therefore liable to misunderstand or dismiss important parts of the explanation. A simple example is the word “computer”. Before the 1940s this referred to people who performed calculations; currently it refers to machines. Unless you’re aware of that fact, you’ll get a completely incorrect impression when reading pre-1940s material that uses the term “computer”. As a collection of syllables, ‘incommensurability’ is an annoying mouthful: but the idea is simple, powerful, and describes a concept that is surprisingly frequently encountered. Dick brings it to life with a very interesting concrete example.
[3] The common advice is to write the abstract last, once you've got the rest of the paper written. This clearly works well for many people, but whenever I've tried doing so, I've found that my writing is aimless and disjointed.
[4] When I’m reviewing papers, I write a small series of small comments in the order that I came across them. It’s quite common for me to write “On p2 I don’t understand X” and then to update it later by saying “Ah, this is explained on p5”.
[5] I used git blame -L to find the relevant commits.
[6] This is sometimes referred to as "self-plagiarism" and some authors will write such sections anew each time to ensure they can't be accused of it. Since background material is generally short and doesn't really form part of a paper's contribution, I don't have a problem with people reusing small amounts of such material: the problem, in my opinion, is when people copy supposedly new technical contributions between papers. In this case I even made clear to myself and my co-authors what was going on in the commit message:
Integrate a brief description of meta-tracing.

The first and last paragraphs are directly swiped from 'approaches to interpreter composition'. The middle paragraph is new, as I suspect we will need to talk about hints to explain why cross-language optimisations perform well.
[7] This effect is often exaggerated when different co-authors have different mother tongues: it's generally not difficult to tell a French author writing in English versus a German author, for example. If this sounds like Anglo-snobbery, I really don't mean it to be: I'm still struggling with one language, and am in awe of people who can express themselves in multiple languages.
[8] That said, there are a few things I consider universally bad style for technical papers. The chief style crime I see amongst junior authors is a tendency to think that the following idiotic trope (repeated by seemingly every school English teacher) is true: “never repeat a word or phrase you’ve used recently.” Perhaps this is good advice for fiction, but it’s terrible advice in technical papers, where it’s vital to use the same word(s) to refer to the same concept throughout a paper.
[9] Personally I stumbled across the "rewriter" way of working entirely by accident. I was part of a research project that ultimately turned out to not really need my skills. When it came to write up the resulting research, I felt like a spare limb, so to make myself feel useful I offered my services as a copy-editor. This was accepted by the rest of the team without any of us realising what we'd signed up to. Whenever someone added a new part to the paper, I would read it very carefully. If I couldn't make any sense of it, I would add questions inline asking for clarification. If I could make sense of most or all of it, I rewrote the whole lot in my style. I viewed it as my job to make sure that the paper flowed, stylistically and technically, from start to finish, and that everything within was adequately explained.

The downside of this approach was that I almost drove my co-authors to murder because none of them had encountered such repeated questioning before. However, after a bit of time had passed (and the paper had done reasonably well), we all agreed it was better written than if we'd taken a traditional writing approach. I hasten to add that is not because the paper used my style, but simply because it used one person's style consistently, and that person had also ensured that all the information the paper needed was contained therein.

[10] My personal tactic is to turn seemingly ambiguous things into definite statements. For example, “This is smaller than some other objects” might become “This is smaller than most other objects” or “This is larger than most other objects” depending on my intuition about which way the ambiguity is best resolved. In the best case, I will have rewritten things more clearly than they were before. In the worst case I have made it both more likely that the original author will spot problems and that they will feel the need to correct things.
[11] The “screenshots” below are SVGs extracted from the PDF generated from LaTeX. The inline comments are a macro that Roel Wuyts used on a shared article we wrote many moons ago. I've shamelessly used it ever since. I can't remember who added colour to the comment macros, but it really helps it stand out from the normal text. These days I use “how much red is there on a page” as a good proxy for “how close is this text to being finished?”
[12] How long did it take to write this blog post? Spread over about 8 months, it took about 7 days (partly because I wrote some material that external comments convinced me didn't really fit in). It's one reason why I'm not a prolific blogger, although the more significant reason is that I generally lack anything worth my time writing and your time reading.
[13] I wish I could write emails as succinct and amusing as this one. While I hugely admire laconic humour in others, I’m not generally capable of editing my thoughts down quite so ruthlessly.
Blogs like this one have their purpose, but I feel they're solving a different problem to papers. Put crudely, I use my blog to explain things to people who want a general overview, whereas papers contain all the nitty-gritty details.
Unless you're already familiar with Kuhn's work, Dick Gabriel's wonderful paper about incommensurability in programming languages is, in my opinion, required reading. In short: when we read other’s work, particularly if it’s a little old, we often lack the context that all contemporary readers shared; we are therefore liable to misunderstand or dismiss important parts of the explanation. A simple example is the word “computer”. Before the 1940s this referred to people who performed calculations; currently it refers to machines. Unless you’re aware of that fact, you’ll get a completely incorrect impression when reading pre-1940s material that uses the term “computer”. As a collection of syllables, ‘incommensurability’ is an annoying mouthful: but the idea is simple, powerful, and describes a concept that is surprisingly frequently encountered. Dick brings it to life with a very interesting concrete example.
The common advice is to write the abstract last, once you've got the rest of the paper written. This clearly works well for many people, but whenever I've tried doing so, I've found that my writing is aimless and disjointed.
When I’m reviewing papers, I write a small series of small comments in the order that I came across them. It’s quite common for me to write “On p2 I don’t understand X” and then to update it later by saying “Ah, this is explained on p5”.
I used git blame -L to find the relevant commits.
This is sometimes referred to as "self-plagiarism" and some authors will write such sections anew each time to ensure they can't be accused of it. Since background material is generally short and doesn't really form part of a paper's contribution, I don't have a problem with people reusing small amounts of such material: the problem, in my opinion, is when people copy supposedly new technical contributions between papers. In this case I even made clear to myself and my co-authors what was going on in the commit message:
Integrate a brief description of meta-tracing.

The first and last paragraphs are directly swiped from 'approaches to interpreter composition'. The middle paragraph is new, as I suspect we will need to talk about hints to explain why cross-language optimisations perform well.
This effect is often exaggerated when different co-authors have different mother tongues: it's generally not difficult to tell a French author writing in English versus a German author, for example. If this sounds like Anglo-snobbery, I really don't mean it to be: I'm still struggling with one language, and am in awe of people who can express themselves in multiple languages.
That said, there are a few things I consider universally bad style for technical papers. The chief style crime I see amongst junior authors is a tendency to think that the following idiotic trope (repeated by seemingly every school English teacher) is true: “never repeat a word or phrase you’ve used recently.” Perhaps this is good advice for fiction, but it’s terrible advice in technical papers, where it’s vital to use the same word(s) to refer to the same concept throughout a paper.
Personally I stumbled across the "rewriter" way of working entirely by accident. I was part of a research project that ultimately turned out to not really need my skills. When it came to write up the resulting research, I felt like a spare limb, so to make myself feel useful I offered my services as a copy-editor. This was accepted by the rest of the team without any of us realising what we'd signed up to. Whenever someone added a new part to the paper, I would read it very carefully. If I couldn't make any sense of it, I would add questions inline asking for clarification. If I could make sense of most or all of it, I rewrote the whole lot in my style. I viewed it as my job to make sure that the paper flowed, stylistically and technically, from start to finish, and that everything within was adequately explained.

The downside of this approach was that I almost drove my co-authors to murder because none of them had encountered such repeated questioning before. However, after a bit of time had passed (and the paper had done reasonably well), we all agreed it was better written than if we'd taken a traditional writing approach. I hasten to add that is not because the paper used my style, but simply because it used one person's style consistently, and that person had also ensured that all the information the paper needed was contained therein.

My personal tactic is to turn seemingly ambiguous things into definite statements. For example, “This is smaller than some other objects” might become “This is smaller than most other objects” or “This is larger than most other objects” depending on my intuition about which way the ambiguity is best resolved. In the best case, I will have rewritten things more clearly than they were before. In the worst case I have made it both more likely that the original author will spot problems and that they will feel the need to correct things.
The “screenshots” below are SVGs extracted from the PDF generated from LaTeX. The inline comments are a macro that Roel Wuyts used on a shared article we wrote many moons ago. I've shamelessly used it ever since. I can't remember who added colour to the comment macros, but it really helps it stand out from the normal text. These days I use “how much red is there on a page” as a good proxy for “how close is this text to being finished?”
How long did it take to write this blog post? Spread over about 8 months, it took about 7 days (partly because I wrote some material that external comments convinced me didn't really fit in). It's one reason why I'm not a prolific blogger, although the more significant reason is that I generally lack anything worth my time writing and your time reading.
I wish I could write emails as succinct and amusing as this one. While I hugely admire laconic humour in others, I’m not generally capable of editing my thoughts down quite so ruthlessly.

What Challenges and Trade-Offs do Optimising Compilers Face?

June 22 2017

After immersing oneself in a research area for several years, it's easy to feel like the heir to a way of looking at the world that is criminally underrated: why don't more people want what I want? This can be a useful motivation, but it can also lead to ignoring the changing context in which most research sits. So, from time to time, I like to ask myself the question: “what if everything I believe about this area is wrong?” It’s a simple question, but hard to answer honestly. Mostly, of course, the only answer my ego will allow me is “I can’t see anything I could be wrong about”. Sometimes I spot a small problem that can be easily fixed. Much more rarely, I've been forced to acknowledge that there’s a deep flaw in my thinking, and that the direction I’m heading in is a dead-end. I’ve twice had to face this realisation, each time after a gradual accumulation of unpleasant evidence that I've tried to ignore. While abandoning several years work wasn’t easy, it felt a lot easier than trying to flog a horse I now knew to be dead.

All of this means that I’m probably a little better disposed to contrarian thinking than I was when I started as a researcher. And so it is that I’m going to write about some thoughts that were initially inspired by a (somewhat controversial) talk by Daniel Bernstein. The talk itself, entitled “Death of Optimizing Compilers” (slides and audio), was presented at ETAPS 2015 in London. I happened to be present for the talk and have engaged in mild advocacy both immediately afterwards and in the two years since. The problem is that the case against is quite easy to state but the case for rather hard — or, at least, I haven’t managed to frame an argument good enough to convince anyone. But I hope I might now have sharpened my argument enough to make another attempt worthwhile.

Suitably condensed, the talk’s thesis is fairly simple. As software systems become bigger and bigger, an ever smaller percentage of code dominates execution time [1], especially as computers are asked to crunch ever bigger chunks of data. However, optimising compilers (e.g. gcc -O2 or even a VM like HotSpot) are outclassed by humans on the most performance critical code. Because of that, optimising compilers are operating in a region between a problem that’s not worth solving (most parts of most programs) and a problem they can’t solve (the really performance critical parts). It doesn’t take much thought to realise that this last point has been the source of controversy: how can someone belittle the huge amounts of effort put into optimising compilers?

Daniel used his own (very cunning) ChaCha20 algorithm as an example of where a human can significantly outdo anything a compiler can. ChaCha20 is effective because the algorithm was invented with the strengths of modern processors in mind: its design and implementation are intricately intertwined in a way that no compiler could ever expect to match. [2]

It may not surprise you that I don’t agree with Daniel’s conclusion. However, sometimes looking at things from an extreme angle can help bring forgotten aspects into focus. So, while I’m not claiming to have any novel insights, here is my attempt to enumerate some often-forgotten (at least by me) issues which limit the effectiveness of optimising compilers.

Balancing program performance and compilation time

We all know that there's a trade-off between optimisation and compilation time: the more optimised you want your program to be, the longer it takes to compile. At one extreme, language implementations like CPython and CRuby have low compilation costs [3] and little or no optimisation costs, but have consequently poor performance. At the other extreme, superoptimisation tries vast numbers of machine-code combinations before picking the fastest. This means that a tiny change to even the tiniest function can take many, many hours to optimise: scaling this up to even “small” functions could be a challenge. Slightly less extreme are C compilers like gcc and clang. These put considerable effort into optimising programs, but generally compile even medium sized files in at most a second or two. [4]

Would developers be happy if we suddenly increased compilation times by, say, a factor of 10 if it lead to a 5% performance increase? Probably not. One possible solution is to add a “really good optimisation but slow to compile” mode to compilers. However, precedent suggests that such a mode will be little used. For example, gcc’s highest optimisation level -O3 is often treated with a mix of fear and disdain: partly because it often doesn't improve performance [5], but also because it's thought to more often produce buggy code. Whether this is still a justifiable position, whether it's because of compiler bugs, or whether it's due to user bugs that only happen to show up when highly optimised is almost irrelevant: if no-one is prepared to use such a mode, it might as well not be there. OpenBSD, long my operating system of choice, uses -O2 exclusively for both the base operating system and packaged third party software. Another possibility is profile-guided optimisation, which I'll touch on later.

We don't know which parts of a program to optimise

In my experience, one of the easiest ways of distinguishing a seasoned programmer from those early in their career is their approach to “manual” optimisation. When presented with a slow running program, junior programmers tend to quickly identify what they think is the bottleneck, and start hacking away at it. Seasoned developers, in contrast, tend to have realised from bitter experience that their initial guesses about bottlenecks are nearly always wrong, and first profile performance.

Sometimes, even standard profiling doesn’t help. My standard approach is to profile a program, identify the longest running function(s), and then stare at it to see if I can improve it. This works in many cases, but fundamentally profiling can only highlight symptoms, not prove causes. For example, if I look at the longest running function and it’s a bubble sort, I can quickly change it to a faster sorting algorithm. But if it’s a Timsort function, there probably isn’t much I can obviously do to fix it. The problem could be elsewhere (e.g. more than once I’ve seen sort functions used as part of a de-duplicator function, which is a slow way of solving that particular problem) and may be very subtle (e.g. bad data structures can cause much more memory to be allocated than necessary). Occasionally, though rarely, I can work out a likely culprit, but judge that the necessary rewrite is too big to justify the gain. But, in most cases, I simply can’t work out if there is anything which would benefit from further optimisation or not.

We can now ask a simple question: if it’s difficult for a human to look at the runtime performance of a specific program and work out what should be optimised, how difficult is it for a compiler? The answer, I fear, is that the compiler has, by far, the more difficult job. Static compilers implicitly make a guess about how a program will execute – without having the luxury of actually executing it [6] – and optimise based on that guess. In some cases, we can make accurate guesses (e.g. we can work out statically how long the machine code sequence for 3 * 4 will take versus 12) but in most we can’t (e.g. will a branch that represents a loop never be taken, taken occasionally, or taken often?). The latter quickly overwhelm the former, meaning that the compiler can have no real idea of which parts of a program are likely to be bottlenecks.

The only safe choice the compiler can thus make is to try and optimise all parts of a program, though this has two issues. First, the trade-off between compilation time and optimisation rears its head again. Since some optimisations are either expensive or rarely applicable, we can’t afford to apply them everywhere, especially as they would mostly have no useful runtime effect. Second, optimisations are heavily order dependent. Imagine I have three optimisations A, B, and C: applying A then B might produce code that C can't do anything with; but applying B then A might produce code that C can optimise. In this simple case we could try both possibilities and, if we're lucky, work out which outcome is better and use that. In a real compiler, with a very large number of optimisations, combinatorial explosion is real and this is impractical. This means that even though our compilers sometimes contain all the optimisations they need to produce high-performing code, they don't run them in the right order (or can't run them for long enough) to optimise all parts of a program. [7] In practice, therefore, trying to optimise all parts of a program means doing a so-so job on all parts of a program. There are two techniques which try and identify the sub-parts of a program that would most benefit from heavy optimisation.

The first is profile-guided optimisation: in essence, one statically compiles a program, profiles an execution or two, then feeds that back to another run of the static optimiser, which uses that information to hone in on the parts of a program which will most benefit from optimisation. Doing this can allow a compiler to run its optimisers for longer on smaller parts of a program, rather than wasting its effort trying to optimise the entire program. This can yield useful performance benefits (see e.g. the 10% improvements that Chrome and Firefox reported) but it’s difficult to set-up, compilation becomes even slower, and it only helps if all subsequent runs follow the same general pattern as the profiled run(s). I suspect the latter point is the most critical: if people use your program in different ways from run to run, profile-guided optimisation can even degrade performance.

The second is dynamic compilation (i.e. JIT compilation), which can be thought of as live, continuous profile-guided optimisation: every time a program runs, dynamic compilation observes its behaviour, and compiles based on that behaviour. Thus dynamic compilation has the advantage over static compilation that it can much more accurately identify the genuine bottlenecks in a program. However, dynamic compilation has to be even more mindful of the costs of optimisation, since time spent dynamically compiling is bad for the user (draining their battery or simply taking potential CPU time away from the user). Thus while dynamic compilation is the most effective way of determining which parts of a program will benefit most from optimisation, it has much less leeway to use expensive optimisations than static compilation.

Finally, whether static or dynamic, all compilers contain a finite number of optimisations: they may not contain the optimisation that our program needs to run fast. There is a small-ish set of optimisations which it's safe to assume that every optimising compiler (static or dynamic) contains. After that, things become, in my experience, a lot more reactive. If a user shows a compiler developer a program which runs unexpectedly slow, the compiler developer will probably add, or tweak, an optimisation to make it perform better. Over time, therefore, different compilers tend to do well on an unpredictable subset of all possible programs. You can often see this in the real world where different optimising compilers for the same language perform much better or worse on a given program. Is there anything we can do about this? I don't think so. After all, we can't (or, at least, probably shouldn't) optimise what we haven't seen: benchmarks are, to a large degree, destiny. [8]

We don't know where the performance ceiling is

One of the problems that optimising compiler authors face is that we don’t really know if we’re doing a good job or not. If optimisation makes a given program 50% faster that might be really good, or it might be somewhat mediocre: since we generally don’t know what the best possible performance (what I call the “performance ceiling”) of a program is, we have no real idea how far away we are from it. The closest we come to a “performance oracle” is when there are multiple compilers for a given language: if another compiler makes a program run faster than our compiler, we know we have work to do.

Given that our current static compilers seem to be well within the “diminishing returns on further optimisations” category, it might appear that we're close to the performance ceiling. But, for all we know, we could be stuck in a local maxima. Are there any ways that we might be able to tell? Superoptimisation, mentioned above, suggests that there are potentially many more performance gains to be had, although it's only been applied on small functions. Dynamic compilation is the other obvious possibility: since much more research has been put into static compilers, we don't really know if current comparisons of static and dynamic compilation are the end of the story or not. [9]

Maintaining implicit properties

As Daniel pointed out, compilers have a narrow view of optimisation: they’ll optimise x + 3 * 4 into x + 12, but I don’t know of any that will turn a bubble sort function into a merge sort. In part this is because recognising that a particular piece of code implements bubble sort is hard (think of all the minor, obfuscating, variations). But, more significantly, it’s because compilers are generally unable to prove that such a major change won’t violate properties of the program explicitly intended by the user. For example, what happens if the user’s bubble sort is so coded because they know that the program needs to run in constant memory and they’re prepared to trade-off execution speed for memory? That property is implicitly encoded in the bubble sort, and the compiler has no way of knowing if it was explicitly intended or not.

Real programs contain countless such examples of implicit properties: indeed, most of us, when programming, don’t even consider how much we expect the optimised version of our program to reflect the program as we wrote it. Partly in recognition of these implicit properties, compilers are often rather conservative in their optimisations: for example, with the notable exception of inlining, they are generally reluctant to perform optimisations that silently increase memory usage. Because of this, a human who can reason about the importance of such properties, and disregard them if they are not important, always has a potential edge over a compiler.

Over-prioritising compilation has consequences

Experience suggests that over-prioritising optimisation can lead to problems in other areas, notably correctness and security. The most obvious, though by no means the only, example of this is C. Most widely known are the problems with undefined behaviour. To cut a long story short, the C specification says, more often than most other languages, “if you do X, the behaviour is undefined”, at which point the program can do absolutely anything, whether you intended it to or not.

The problem is that the vast majority of such cases cannot be detected statically: that is, it’s often impossible to know if your program could exhibit undefined behaviour or not. Have you remembered, for example, to add a runtime check that the divisor of every non-constant division is non-zero? That one’s well known enough that people will tend to check it. But what happens when signed integer addition overflows? C gives no guarantees about the outcome, and most programs have so many additions that we struggle to reason about which ones we might need to be careful about. You can read more examples here.

It goes without saying that a large number of unfortunate bugs (including many security alerts) have occurred because of undefined behaviour. However, the problem has become much worse in recent years as C compilers optimise programs based on the assumption that no undefined behaviour exists. In other words, compilers are now more strictly interpreting the standards which, ironically, means that they have become less conservative in their optimisations. No doubt a few benchmarks increase in speed because of this, but my guess is that it makes only a very small difference to most program's performance.

I used to think that I could reliably program in C, but the exploitation of “minor” undefined behaviour has undermined my confidence. To add insult to injury, not only does C have far more undefined behaviour than other languages, making it more likely that one will forget some of the less common cases, but the C standard isn't even freely available (though one can get drafts, which are “close” to the final standard). This has led to deep unhappiness on both sides: programmers tend to say “please don’t violate widely held expectations even if the standard doesn’t mention those expectations”; compiler writers have a tendency to say “we’re following the standard, and are making your programs faster: if there’s a problem in your program, you should fix it.” If there’s any single factor which will hasten the decline of C as a widely used language, I suspect this is it. [10]

C compilers are the most obvious example of this problem, but not the only ones. There are at least a couple of cases where, depending on the day of the week, I can convince myself either way. A good example of this is Python's memory semantics. In theory, Python long specified that programmers should not rely on the deterministic destruction of objects. However, since the only implementation of Python was CPython, which is reference counted – and since most people don't read, or remember, the language specification – many programs came to rely in subtle ways on deterministic destruction (e.g. implicitly expecting that file descriptors would be freed at the end of a function). When PyPy came along, it soon came to favour non-deterministic destruction. Doing so allowed PyPy to have a high performing state-of-the-art garbage collector, but it did break a number of programs (e.g. many ran out of file descriptors). Is that a good trade-off? Could, or should, PyPy have used some mix of reference counting and “proper” garbage collection? Since this is pretty much the only intentional “incompatibility” between CPython and PyPy it feels just about reasonable to me, although I'm sure some people feel differently.

Summary

It's important to be realistic: most people don't care about program performance most of the time. Modern computers are so fast that most programs run fast enough even with very slow language implementations. In that sense, I agree with Daniel's premise: optimising compilers are often unimportant. But “often” is often unsatisfying, as it is here. Users find themselves transitioning from not caring at all about performance to suddenly really caring, often in the space of a single day.

This, to me, is where optimising compilers come into their own: they mean that even fewer people need care about program performance. And I don't mean that they get us from, say, 98 to 99 people out of 100 not needing to care: it's probably more like going from 80 to 99 people out of 100 not needing to care. This is, I suspect, more significant than it seems: it means that many people can go through an entire career without worrying about performance. Martin Berger reminded me of A N Whitehead’s wonderful line that “civilization advances by extending the number of important operations which we can perform without thinking about them” and this seems a classic example of that at work. Even better, optimising compilers are widely tested and thus generally much more reliable than the equivalent optimisations performed manually.

But I think that those of us who work on optimising compilers need to be honest with ourselves, and with users, about what performance improvement one can expect to see on a typical program. We have a tendency to pick the maximum possible improvement and talk about it as if it's the mean, when there's often a huge difference between the two. There are many good reasons for that gap, and I hope in this blog post I've at least made you think about some of the challenges and trade-offs that optimising compilers are subject to.

Acknowledgements: My thanks to Edd Barrett, Martin Berger, Carl Friedrich Bolz, Lukas Diekmann, and Sarah Mount for substantial and insightful comments on early drafts. Any errors, omissions, or infelicities, are my own.

Follow me on Twitter

Footnotes

[1] Most readers will be familiar with Knuth’s quip that “premature optimisation is the root of all evil.” However, I doubt that any of us have any real idea what proportion of time is spent in the average part of the average program. In such cases, I tend to assume that Pareto’s principle won't be far too wrong (i.e. that 80% of execution time is spent in 20% of code). In 1971 a study by Knuth and others of Fortran programs, found that 50% of execution time was spent in 4% of code. I don't know of modern equivalents of this study, and for them to be truly useful, they'd have to be rather big. If anyone knows of something along these lines, please let me know!
[2] Daniel finished with an idea as to how this limitation could be overcome: instead of compilers being a black box, programs should be able to interact with them to help the compiler optimise them. While there is some work in this general direction (see e.g. Haskell’s approach), I think that pushing it as far as Daniel would like is hard. Certainly, I’ve done enough work in macros and compile-time meta-programming to have some idea of how difficult it is to find a good design for interactions between a program and compiler.
[3] Though it is important to note that both do compile to bytecode internally. The oft-stated distinction between “interpreted” and “compiled” languages is a red-herring: any language can be interpreted or compiled (and some language implementations interpret compiled code).
[4] Interestingly, modern languages seem much happier with lengthier compile times. rustc, which I use a lot, is not fast, even with optimisations turned off; and scalac, when I last used it, was even slower for small files (though since, I think, it doesn't do full program optimisation statically, I imagine it's faster in some senses than rustc at compile-time). I can't say why this is for sure, although I do have a potentially plausible explanation.

Take gcc as a concrete example: its first release was in 1987. At that point, I believe that the Cray-2 was the second fastest machine on the planet operating at 2 GFLOPS (at something like £25,000,000 in today's money); for reference, my 2016 i7-6700K desktop processor (£301) is something like 80-115 GFLOPS. Using GFLOPS as a measurement of speed of anything is dangerous, particularly when it comes to compilers (which do few, if any, floating point operations). But the general trend is clear: a modern commodity processor is a couple of orders magnitude faster than a 30 year old supercomputer. It's safe to assume that compiling programs took quite a bit longer in wall-clock time in 1987 than today, and that therefore people would have put quite a lot of effort (consciously and unconsciously) into making compilers run acceptably fast.

Fast forward to today, and the optimisations that made a compiler run not-too-slow in the late 1980s make it blazingly fast today. In contrast, modern language designers and compiler implementers are happy with compile times that are, relatively, much slower but still, in absolute terms, just about acceptable (perhaps around a couple of seconds for a medium-large file, as a rough rule of thumb; which, I would guess, is probably faster in wall-clock time than a C compiler in 1987).

As a ridiculously simple comparison, compiling "hello world" on my laptop (median value over 10 runs – I'm not pretending this is a paper-strength methodology) takes 0.037s for gcc, 0.173s for Go, and 0.287s for Rust.

[5] In the context of LLVM/clang it appears that -O3 optimises no better than -O2 (at least in 2013 when that paper was written).
[6] Abstract interpretation is an attempt to do this. It's got real uses in bug finding, but I'm not sure that it would do very well at performance prediction.
[7] Jamey Sharp has written an approachable overview of this problem and some of the work in the area.
[8] My experience suggests that this is a more common issue in dynamic optimising compilers, which in some sense have more opportunities – or, at least, more information – to optimise, but also more constraints (chiefly time). However, this may partly be a function of the richness of modern languages, which are more likely to be dynamically compiled and optimised.
[9] My gut instinct is that many programs in modern languages such as Haskell would run faster if they were dynamically (rather than statically) compiled. The basic reason is simple. From a runtime perspective, the pattern matching that is common in such languages is just dynamic typing in disguise (the fact that the pattern matching statically enforces exhaustiveness is of no use beyond the type checker). Thus the static compiler is either going to decline to inline the branches of pattern matches, or it'll guess (often wrongly) that one branch is more commonly taken, or (less likely) it'll try and inline all branches. Since inlining makes pretty much every other optimisation much more effective, this inability to statically optimise must cause performance issues. However, I should be clear: I currently have no evidence for this other than my gut instinct, which I suspect might not stand up in a court of justice.
[10] There have been several attempts to define a “safe” subset of C (e.g. this) that is immune to the most egregious examples of undefined behaviour, but it's not clear that any of these will gain sufficient momentum to become widely accepted. That's a pity, but such is life.
Most readers will be familiar with Knuth’s quip that “premature optimisation is the root of all evil.” However, I doubt that any of us have any real idea what proportion of time is spent in the average part of the average program. In such cases, I tend to assume that Pareto’s principle won't be far too wrong (i.e. that 80% of execution time is spent in 20% of code). In 1971 a study by Knuth and others of Fortran programs, found that 50% of execution time was spent in 4% of code. I don't know of modern equivalents of this study, and for them to be truly useful, they'd have to be rather big. If anyone knows of something along these lines, please let me know!
Daniel finished with an idea as to how this limitation could be overcome: instead of compilers being a black box, programs should be able to interact with them to help the compiler optimise them. While there is some work in this general direction (see e.g. Haskell’s approach), I think that pushing it as far as Daniel would like is hard. Certainly, I’ve done enough work in macros and compile-time meta-programming to have some idea of how difficult it is to find a good design for interactions between a program and compiler.
Though it is important to note that both do compile to bytecode internally. The oft-stated distinction between “interpreted” and “compiled” languages is a red-herring: any language can be interpreted or compiled (and some language implementations interpret compiled code).
Interestingly, modern languages seem much happier with lengthier compile times. rustc, which I use a lot, is not fast, even with optimisations turned off; and scalac, when I last used it, was even slower for small files (though since, I think, it doesn't do full program optimisation statically, I imagine it's faster in some senses than rustc at compile-time). I can't say why this is for sure, although I do have a potentially plausible explanation.

Take gcc as a concrete example: its first release was in 1987. At that point, I believe that the Cray-2 was the second fastest machine on the planet operating at 2 GFLOPS (at something like £25,000,000 in today's money); for reference, my 2016 i7-6700K desktop processor (£301) is something like 80-115 GFLOPS. Using GFLOPS as a measurement of speed of anything is dangerous, particularly when it comes to compilers (which do few, if any, floating point operations). But the general trend is clear: a modern commodity processor is a couple of orders magnitude faster than a 30 year old supercomputer. It's safe to assume that compiling programs took quite a bit longer in wall-clock time in 1987 than today, and that therefore people would have put quite a lot of effort (consciously and unconsciously) into making compilers run acceptably fast.

Fast forward to today, and the optimisations that made a compiler run not-too-slow in the late 1980s make it blazingly fast today. In contrast, modern language designers and compiler implementers are happy with compile times that are, relatively, much slower but still, in absolute terms, just about acceptable (perhaps around a couple of seconds for a medium-large file, as a rough rule of thumb; which, I would guess, is probably faster in wall-clock time than a C compiler in 1987).

As a ridiculously simple comparison, compiling "hello world" on my laptop (median value over 10 runs – I'm not pretending this is a paper-strength methodology) takes 0.037s for gcc, 0.173s for Go, and 0.287s for Rust.

In the context of LLVM/clang it appears that -O3 optimises no better than -O2 (at least in 2013 when that paper was written).
Abstract interpretation is an attempt to do this. It's got real uses in bug finding, but I'm not sure that it would do very well at performance prediction.
My experience suggests that this is a more common issue in dynamic optimising compilers, which in some sense have more opportunities – or, at least, more information – to optimise, but also more constraints (chiefly time). However, this may partly be a function of the richness of modern languages, which are more likely to be dynamically compiled and optimised.
My gut instinct is that many programs in modern languages such as Haskell would run faster if they were dynamically (rather than statically) compiled. The basic reason is simple. From a runtime perspective, the pattern matching that is common in such languages is just dynamic typing in disguise (the fact that the pattern matching statically enforces exhaustiveness is of no use beyond the type checker). Thus the static compiler is either going to decline to inline the branches of pattern matches, or it'll guess (often wrongly) that one branch is more commonly taken, or (less likely) it'll try and inline all branches. Since inlining makes pretty much every other optimisation much more effective, this inability to statically optimise must cause performance issues. However, I should be clear: I currently have no evidence for this other than my gut instinct, which I suspect might not stand up in a court of justice.
There have been several attempts to define a “safe” subset of C (e.g. this) that is immune to the most egregious examples of undefined behaviour, but it's not clear that any of these will gain sufficient momentum to become widely accepted. That's a pity, but such is life.

Fine-grained Language Composition

September 21 2016

Programming languages are anti-social beasts: while they're happy to communicate extensively with the operating system that launched them, they're reluctant to talk to other programming languages. Most, reluctantly, make an exception for low-level languages, which have a natural mapping onto the low-level system ABI, most notably C. Indeed, used without qualification, the term ‘Foreign Function Interface’ (FFI) is typically assumed to refer to an interface to C, as if it was the only “foreign” programming language in existence.

Programming languages therefore resemble islands: each language defines its own community, culture, and implements its own software. Even if one wants to travel to another island, we often find ourselves trapped, and unable to do so. The only real exception to this are languages which run on a single Virtual Machine (VM), most commonly a Java VM. These days, many languages have JVM implementations, but it can sometimes seem that they've simply swapped their expectations of an FFI from C to Java: non-Java JVM languages don't seem to talk to each other very much.

We're so used to this state of affairs that it's difficult for us to see the problems it creates. Perhaps the most obvious is the huge effort that new languages need to expend on creating libraries which already exist in other languages, an effort far greater than the core language or compiler require. Less obvious is that the initial language used to write a system becomes a strait-jacket: we almost always stick with our original choice, even if a better language comes along, even if no-one is trained in the original language, or even if it simply becomes difficult to run the original language.

Those who are brave enough, foolhardy enough, or simply desperate have three main options for escaping from this mire. First, they can manually rewrite their system in a new language. However, rewriting anything but the tiniest system from scratch in a new language is expensive – often prohibitively so – and extremely risky: it often takes years for the new system to reach parity with the old; and seemingly unimportant features are often ignored in the rewrite, only for users to rebel at their exclusion. Second, they can use translation tools to automatically rewrite their system for them. Perhaps surprisingly, this technique really works, even for very complex systems [1]. However, unless the two languages involved are very similar, it comes at a price: the automatically translated code rarely looks like something a human would have written. This makes maintenance of the translated systems almost impossible, and perhaps explains why these tools have seen little real-world use. Third, they can use some form of inter-process communication to have parts of the system running in different languages. The most obvious cost for this route is that splitting the system into separate chunks generally requires significant refactoring. Less obvious is that the overhead of inter-process communication is so high (depending on your system, roughly 5-6 orders of magnitude slower than a normal function call) that this technique is only practical if the resulting chunks communicate infrequently.

The outlines of a solution

At this point, I'm going to assume that you agree with me that there's a real problem with our inability to move easily between different programming languages. Furthermore, that this is a problem that huge numbers of people (including many very large organisations) suffer from, and that existing solutions will never address. You might then wonder: why don't we hear more complaints about this? After all, people complain endlessly about minor limitations in programming languages, frameworks, and operating systems. I think the reason for the lack of complaint is that people think the problem inevitable. Raging against it would be like raging against the River Severn's tide: whatever you do, it'll swallow you up and deposit you wherever it sees fit.

Recently, I've been involved in research – led by Edd Barrett, with assistance from Carl Friedrich Bolz, Lukas Diekmann, and myself – which, we think, suggests a new solution to this problem. If you’ve already read the accompanying paper, then parts of this blog will be familiar, although I hope there’s enough new things to keep you interested: relative to the paper, I’m going to give a bit more motivation and background, and focus a little less on some of the most minute details.

In essence, what we've done is to compose (i.e. glue) together two real-world programming languages – PHP and Python 2.x [2] – and show that users can move between them seamlessly. We did this by composing two meta-tracing interpreters together, such that, at run-time, the two languages not only share the same address space, but that a single JIT compiler operates across them. Building atop a (somewhat traditional) FFI between the two languages, we allow users to embed syntactic fragments of each language inside the other (e.g. one can embed Python functions in PHP and vice versa, and nest those embeddings arbitrarily deep), and for those fragments to interact as if they were written in the same language (referencing variables across languages and so on). Written down like that, it can appear somewhat intimidating, but hopefully Figure 1 shows that the resulting system, called PyHyp, is fairly intuitive.
Figure 1: A PyHyp program implementing a PCG8 pseudo-random number generator, written in the Eco language composition editor. The outer (white background) parts of the file are written in PHP, the inner (pinkish background) parts of the file in Python. ❶ A Python language box is used to add a generator method written in Python to the PHP class RNG. ❷ A Python language box is used to embed a Python expression inside PHP, including a cross-language variable reference for rng (defined in line 19 in PHP and referenced in line 20 in Python). In this case, a Python list comprehension builds a list of random numbers. When the list is passed to PHP, it is 'adapted' as a PHP array. ❸ Running the program pretty-prints the adapted Python list as a PHP array.

At this point, you might be wondering what jamming together PHP and Python really shows: after all, it's not the first time that people have embedded one language inside another – indeed, I've done my share of language embeddings in the past. Most such embeddings are of a small DSL inside a larger, general purpose programming language. I've seen a few cases of people embedding one programming language inside another, but either the embedded language is tiny (e.g. Lisp) or only a subset of the embedded language is supported (e.g. “Java-like”). There's nothing wrong with such approaches, but it's never really been clear that one can sensibly embed one large programming language inside one another.

The approach we ended up taking with PyHyp shows that not only is it possible to embed large programming languages inside each other, but that the result is usable. In essence, we composed together meta-tracing interpreters, resulting in a VM that dynamically compiles composed programs into machine code; and we allow each language's syntax to be directly embedded in the other. Optionally, one can make use of the Eco language composition editor to make the more fiddly parts of the syntactic composition easier to use. Bearing in mind the caveat that Eco is very much a research prototype, we have a reasonable expectation that this general approach will work for a wide variety of other real-world programming languages (indeed, we have also done something similar for Python and Prolog).

It's important to realise that our approach is neither automatic nor magic. Programming languages are made by humans for humans. Composing programming languages together creates new language design issues, which humans must resolve, so that the result is usable by other humans. To be blunt, this was not something that we thought of when we started. In the end, composing real languages turned out to be a bit like assembling flat-pack furniture: some bits don't appear to fit together very well, some take considerable effort to make sense of, and a few appear completely superfluous. We eventually came to think of these challenges as “points of friction” [3]: places where languages don't play nicely with each other. These fall into two categories: semantic friction is where it's difficult to make sense of one language's features in the other (e.g. Python has dictionaries and lists; PHP has only dictionaries; how do we map between the two?); performance friction is where it's difficult to make different parts of each language perform well in conjunction with each other (e.g. cross-language variable lookup is tricky to make fast, since both languages' scoping rules are somewhat dynamic).

What the user sees

What Figure 1 hopefully makes clear is that PyHyp contains all of “normal” PHP and “normal” Python, as well as defining extra rules for composing them together. This is the result of a deliberate design goal: it is vital that normal PHP and Python code works unchanged when it is part of a PyHyp program [4]; we only want users to deal with PyHyp's extra rules when they are writing cross-language code.

So, what are those extra rules? The platitudinous answer is that, as far as possible, we've tried to make them obvious and transparent, particularly if you use Eco. The extra syntactic rules are fairly simple: PyHyp programs always start with PHP [5]; wherever a PHP function is valid, a Python function can be used; wherever a PHP expression is valid, a Python expression can be used; and wherever a Python statement is valid, a PHP function is valid. The extra semantic rules are mostly simple: PHP and Python can reference variables in the outer language (semi-) lexically (we'll look at this in more detail later); core datatypes (e.g. ints, strings) are directly converted between the two languages; collections (lists and the like) are “adapted” such that they're almost indistinguishable from native collections; and everything else is adapted ‘generically’. In essence, an adapter is simply a wrapper around a ‘foreign’ object, such that one language can interact with objects from the other language. [6]

Moving things like integers and strings between the languages works simply and naturally. Life gets a bit more complicated when you encounter areas of semantic friction. One of the most obvious is collections: PHP has only dictionaries (i.e. hash tables; which, confusingly, PHP calls arrays); but Python has separate notions of lists and dictionaries. Passing a Python list or dictionary to PHP is easy: it can only be adapted as a PHP array. The opposite direction is more challenging, as Python lists and dictionaries have substantially different semantics (e.g. lists are ordered, whereas dictionaries are unordered). To most people, it seems “obvious” that if we pass a PHP dictionary which is list-like (i.e. has integer keys from 0, 1, 2, 3 … n) to Python then it should be adapted as a Python list, and indeed our first attempt did just this. However, since both Python (via an adapter) and PHP point to the same underlying PHP array, later mutation can make that PHP array turn into something that isn't list-like (e.g. by adding a key "woo"), at which point the list adapter would be at best misleading, and at worst, incorrect. We therefore ended up taking a different, safer, tack: a PHP dictionary passed to Python is always initially adapted as a Python dictionary. If the user knows for sure that the PHP dictionary really is, and always will be, list-like, then they can obtain a list adapter from the dictionary adapter. As well as the normal dictionary operations, dictionary adapters expose an extra method to_list() [7] which returns a Python list adapter pointing to the underlying PHP dictionary. Of course, this doesn't free us from the possibility of the underlying PHP dictionary becoming list-like — the list adapter continually checks that the underlying dictionary is still list-like, throwing an exception if that is no longer true.

Those readers familiar with PHP might have noticed the implicit assumption above that PHP arrays are mutable. In fact, they are immutable: appending an item, for example, copies the underlying array, adding an extra item to the copy. Indeed, nearly all PHP datatypes are immutable with the exception of user objects (i.e. instances of classes) and references. This latter data type is a mutable reference to another object (i.e. it is a boxed pointer which can point to different objects over time). References can be used to give the illusion that data types such as arrays are mutable. Instead of passing an array around directly, one passes around a reference; when appending an item to an array wrapped in a reference, for example, one simply updates the reference to point to the new array. PHP has somewhat complex rules for automatically wrapping data types but, simplifying somewhat, functions can say that their parameters must come wrapped in a reference; non-references are automatically wrapped in a reference. Since Python users expect lists and dictionaries to be mutable, we would therefore have an uncomfortable instance of semantic friction if adapted PHP arrays were sometimes mutable, sometimes immutable. PyHyp therefore defines that passing an array from PHP to Python automatically wraps it in a reference if it is not already so. Doing so is natural from both PHP and Python's perspective.

I could go on, because PyHyp resolves a number of other instances of semantic friction, but hopefully you believe the following: even when the two languages involved in PyHyp didn't appear able to play together nicely, we were nearly always able to find a satisfying design to resolve the problem. However, in a few cases, there simply isn't a sensible way to fit the two languages together. For example, Python functions can be called with keyword parameters, but PHP has no equivalent notion, and we couldn't conceive of a usable design that would work around this fact. Therefore, trying to call a PHP function with keyword parameters from Python raises a run-time error.

How the run-time works

At this point, it’s hopefully fairly clear what the semantics of PyHyp are, but how are they implemented? Fundamentally, we do something very simple: we glue together interpreters. However, doing so naively would lead to terrible performance, so we use meta-tracing interpreters. Meta-tracing is a way of creating a VM with a JIT compiler from just a description of an interpreter. As I said in an old blog article, this is a fascinating idea, because it turns programming language economics upside down: you can get a fairly fast language implementation for not a great deal of effort [8]. At the time of writing, the main extant meta-tracing language is RPython (if you think of it as Java-ish semantics with Python syntax, you won't be too far off the mark), which is what we used.

Whilst writing a really high-quality meta-tracing interpreter from scratch is a lot easier than writing a really high-quality manual JIT compiler, it's still more work than we wanted to take on. We therefore took two off the shelf interpreters — PyPy, a complete implementation of Python 2.x and HippyVM a reasonably, but definitely not fully complete implementation of PHP [9] — and joined them together. Although there was a little bit of messing around to get this to work (mostly because HippyVM and PyPy both expect to be the sole program being compiled and run), it's fairly easy to do. Simplifying somewhat, PyHyp's main function is that of HippyVM's: HippyVM imports parts of PyPy and calls them as necessary (and PyPy can call back to HippyVM etc.). From the perspective of a meta-tracer such as RPython, our “HippyVM that imports PyPy” interpreter is no different than any other interpreter. In other words, just because we know it contains two constituent interpreters doesn't mean that meta-tracing knows or cares – it's just another interpreter (albeit the largest ever compiled with RPython). That means that the two constituent interpreters run in the same address space, share a garbage collector, and share a JIT compiler.

At a low-level, PyHyp defines a fairly standard FFI between PHP and Python. For example, PHP code can import and call Python code as in the following example, where PHP code imports Python's random module and then calls its randrange function:
$random = import_py_mod("random");
$num = $random->randrange(10, 20);
echo $num . "\n";
This simple example immediately raises the question of how Python's random module and the Python int returned by randrange are represented in HippyVM. In essence, small immutable data-types (e.g. ints and floats) are directly translated between the two languages (e.g. passing a Python integer to PHP causes a PHP integer to be created). All other data-types are represented by an adapter which points back to the original object; all operations performed on the adapter are transparently forwarded to the original object. For example, passing a Python user object to PHP creates a PyAdapter in PHP: attempting to access an attribute in the adapter forwards the request to the underlying object. Passing an adapted object back to the language it came from simply unwraps the adapter, so that they don't gradually pile-up on top of each other.

Looking at the implementation of this can help make things clearer. In the above example, the import_py_mod function returns a Python module object which is adapted in PHP as a W_PyModAdapter. Looking up the randrange “method” returns a Python function object which is adapted as a W_PyCallable. Calling that adapter in PHP causes execution to pass to the underlying Python randrange function. As this may suggest, both PHP and Python have a number of adapters. Some adapters make data-types appear more natural in the other language; others are useful for performance; and the ‘generic’ catch-all adapter ensures that every data-type in each language can be adapted.
Figure 2: Showing that adapters are transparent, with output from running the program in the right hand (‘console’) pane. Passing a PHP array to Python creates a dictionary adapter: querying the adapter its type returns Python's dictionary type. Lines 2 and 3 clearly show that the type of the adapted PHP array, whether as a Python list or dictionary, is indistinguishable from native Python objects. Even identity checks are forwarded: although id_check is passed two separate adapters, the two identity checks work as expected.

This simple model of adapters has a number of advantages: they're simple to implement; flexible in terms of reducing semantic and performance friction; and, as Figure 2 shows they're almost entirely transparent (even object identity checks are forwarded). In order to implement them, we had to invasively modify both HippyVM and PyPy. For the example above, we needed to add Python adapters to HippyVM, and alter PyPy so that Python objects can adapt themselves into PHP. Simplifying slightly, both HippyVM and PyPy have similar data-type hierarchies: both have classes called W_Root that the various data-types inherit from. We thus have to provide an (overridable) way for Python objects to create an appropriate adapter for PHP, and the adapters themselves. A simplified version of the relevant RPython code looks as follows. First the modifications to HippyVM:
# HippyVM: hippy/module/pypy_bridge/py_adapters.py
from hippy.objects.base import W_Root as W_PhRoot

class W_PyGenericAdapter(W_PhRoot):
  def __init__(self, py_inst):
    self.py_inst = py_inst
  def lookup(self, attr_name):
    return self.py_inst.lookup(attr_name).to_php()

  def to_py(self):
    return self.w_py_inst
and second the modifications to PyPy:
# PyPy: pypy/interpreter/baseobjspace.py
import py_adapters

class W_Root:
  def to_php(self):
    py_adapters.W_PyGenericAdapter(self)
As this code hopefully suggests, when a Python object (an instance of W_Root) is passed to PHP, its to_php method is called; that returns a W_PyGenericAdapter, which is a class in PHP's object hierarchy. If the adapter is passed back to Python, the original object is unwrapped and passed back to Python.

As a rough measure of size, PyHyp adds about 2.5KLoC of code to PyPy and HippyVM (of which adapters are about 1.4KLoC). This is a reasonable proxy for complexity: PyHyp inherits most of the lengthy implementation work for free from HippyVM and PyPy. It may be interesting to note that we added 5KLoC of tests, which shows that much of the effort went into working out what what we wanted to see happen.

Syntactic interactions

At a low-level PyHyp composes the two languages through dynamic compilation of Python code (and of PHP code nested inside Python), but it's too ugly for anyone in their right mind to want to use directly. By using language boxes and the Eco language composition editor, we can make syntax composition pleasant. For the simple case of dropping a Python function or expression into a PHP program, there's little to add beyond what is obvious from Figure 1.

Things get more interesting when you want Python and PHP to syntactically interact. What does that even mean? In PyHyp's case, it means that Python and PHP can reference variables in the other language. For example, if PHP defines a variable x then a suitably scoped Python language box can reference that variable x. In the common case, this feature is simple and natural, but digging a bit deeper reveals real challenges. In essence, these boil down to the fact that neither PHP 5.x and Python 2.x have, from a modern perspective, particularly great scoping rules. Neither language is fully lexically scoped (both can add variables to a scope at run-time) and neither fully implements closures (Python emulates closures for reading variables, but this illusion doesn’t deal with assignment to a variable in a parent function). In addition, PHP has 4 separate global namespaces (one each for functions, classes, constants, and variables), each of which is shared across all source files (PHP includes, in a C sense, files rather than imports modules).
Figure 3: Cross-language variable referencing, showing some of the design challenges in finding good cross-language scoping rules. The inner PHP language box references PHP's range function (inclusive in both its parameters); the Python language box references Python's range function (inclusive in its first, but exclusive in its second, parameter); the Python language box references PHP's print_r function.

These details meant that, when we started looking at a design for cross-language variable scoping, we didn't really know where to start: it's something that, as far as we know, no-one has ever attempted before; and we appeared to have picked two languages with different, cumbersome, scoping rules. We quickly discovered that a bad design leads to subtle bugs. Figure 3 shows my favourite example of something. Both PHP and Python define global functions called range(f, t) which generate a list of numbers from f to t. PHP's range is inclusive so range(0, 3) generates [0,1,2,3]; Python's range is exclusive on its second argument so range(0, 3) generates [0,1,2]. That means that if Python or PHP code references the other language's range function then odd things happen (and yes, we discovered this the hard way). A simple solution might be to say that each language can only reference its own global variables, but what happens when Python wants to use PHP's print_r function (which pretty prints PHP data-structures differently than Python's equivalents)?

The eventual design [10] we settled upon deals with such nasty corner cases, but (hopefully) is still simple enough to comprehend. First, within a language box, each language's variable scoping rules are unchanged. When a variable is referenced that is not found in the current language box, then a binding is searched for in the following order. First, the lexical chain of language boxes is recursively searched (from inner to outer) for a binding. Second, if no match has been found, the ‘host’ language box’s globals are searched (i.e. if a Python language box references range, and no outer language box defines a binding for this variable, then Python's global variables are searched first). Third, the other language's globals are searched (which is how Python is able to access PHP’s print_r function). Finally, if no match has been found, an appropriate error is raise. One complication is PHP's multiple global namespaces: in PHP, one can define a function, a class, a constant, and a variable all called x. In PHP, syntactic context disambiguates which namespace is being referred to (e.g. new x() searches for a class called x whereas x() searches for a function of that name). There are no equivalent syntactic forms in Python, so PyHyp searches the global namespaces in the following order, returning the first match: functions, classes, constants, and variables [11].

From an implementation point of view, we needed about 250LoC to add the cross-language scoping rules. Making things run fast (remember that both language's lookup rules are entirely dynamic) was a bit more tricky but, once we'd got a good design, not hugely difficult.

At this point you may be wondering why we went to such bother to design cross-language scoping rules. After all, previous language compositions have nearly always done without such a feature. Our reasoning is fairly simple: sometimes it’s convenient to use one language over the other, but not necessarily at the level of converting one function at a time. In such cases, being able to migrate a single line can be useful. Figure 1 gives a simple example: the Python language box on line 20 allows an interaction with the gen function that would be horribly verbose if it was done in PHP. As soon as you want to be able to mix languages on a line-by-line level, it’s inevitable that you will want to reference variables across the languages, and thus that you need to define cross-language scoping rules.

Performance

Let's assume that you’re now prepared to believe that PyHyp is a useful, innovative, fine-grained composition of two languages: all is for naught if PyHyp programs are slow. How do we tell if PyHyp is slow? The answer seems simple: benchmark mono (i.e. written in either Python or PHP) vs. PyHyp programs. The reality, unfortunately, is somewhat messier: no-one really knows how to reliably write good benchmarks; when a good benchmark is written, programming language implementers have a tendency to optimise specifically for it, reducing its use as a good measure [12]; and we don't yet know how to produce high-quality statistics for many types of real-world executions [13]. As icing on the cake, anyone who’s sufficiently practised at creating benchmarks (and who, like me, is also sufficiently evil) can can create a seemingly reasonable benchmark that just so happens to show the result they want to. That said, it's nearly always the case that some benchmarks are better than no benchmarks — just remember the above caveats when interpreting the results!

The first problem we had for benchmarking our composition is that not only were there no available PyHyp benchmarks, but there weren't even good guidelines for how to go about creating composed benchmarks. We therefore created a series of small, focussed benchmarks that concentrate on a single feature at a time, as well as adapting several classic (relatively speaking larger) microbenchmarks. In essence, for each benchmark we created several variants: mono-PHP; mono-Python; ‘outer’ PHP with ‘inner’ Python; and ‘outer’ Python with ‘inner’ PHP. For example, for the ‘outer’ PHP with ‘inner’ Python benchmarks, we took a PHP benchmark (e.g. DeltaBlue), keeping constants, variables, and classes in PHP, but translating methods into Python. Figure 4 shows an example. The aim with these benchmarks is to get a good idea of the costs of the composition when one frequently crosses the barrier between languages.
Figure 4: The DeltaBlue benchmark. On the left is a mono-PHP version. On the right is a composed version: only functions and methods have been translated into Python, with everything else (including classes, constants etc.) remaining in PHP.

If you want to see the full results, I suggest reading Section 7.3 of the paper. However, for most of us, a summary of the interesting results is probably sufficient. Since PyPy is nearly always faster than HippyVM, we can tighten our definition of “is it fast enough”, by comparing PyHyp’s performance to PyPy’s. The geometric mean slowdown of PyHyp relative to Python is 20% (1.2x) and the worst case is about 220% (2.2x). Though the latter figure isn't amazing, bear in mind that it's the worst case — on average, these benchmarks suggest that that PyHyp's performance is more than good enough to be usable. Indeed, in many cases PyHyp’s performance is better than the standard C-based interpreters used for languages such as Python and Ruby!

So, why is the performance good? First, there can be little doubt that small benchmarks flatter (meta-)tracing — I would expect larger programs to see more of a slowdown, though it's difficult to guess at how much of a slowdown that might be. Second, the most important factor in enabling good optimisations in tracing is inlining, which we were able to enable across the two languages fairly easily. Third, the major overhead that PyHyp's composition imposes at run-time are adapters: most non-primitive objects that cross between the two languages have an adapter created. Thus we might expect that some parts of a program will require doubling memory allocation, since each object will require an adapter. In most traces which contain parts from both PHP and Python, the vast majority of adapters are created, used, and implicitly discarded within the trace. The trace optimiser can then prove that the adapters don’t ‘escape’ from the trace and thus remove their memory allocation, and nearly all other costs associated with the adapter. Thus, in practise, the typical cost of adapters is close to zero.

Of course, there are a few cases where performance isn't as good as we would have liked. Of the cases where we found something concrete (which isn't all cases), we found some where RPython generates inefficient machine code for two otherwise identical traces (and no-one has quite worked out why yet). A few cases suffer from the fact that PyHyp uses a ‘proper’ non-reference counted garbage collector, which sometimes causes HippyVM/PyHyp to have to copy PHP’s immutable arrays more than one would like. There are also a few parts of PyHyp which we could more extensively optimise with a bit more effort, and I'm sure that one could fairly easily write programs which tickle parts that we haven't yet thought to optimise. But, overall, performance seems likely to remain within “good enough to use” territory, which is the most important test of all.

Conclusions

As far as I know, PyHyp is, by some considerable distance, the most fine-grained language composition yet created. I think it shows that there is a completely different, practical solution to the problem, outlined at the start of this article, of migrating a system from an old language to a new language. I hope you can imagine someone using a PyHyp-like system to migrate a system gradually from (say) PHP to Python, even down to the level of line-by-line migration: the resulting composed program would not only run, but run fast enough to be used in production. Perhaps at some point the whole system would be migrated, and the language composition thrown away. The crucial point is that at no point is one forced to flick a switch and migrate wholesale from one language to another overnight.

Could you use PyHyp for this today? In a sense, it's a piece of software in a slightly unusual state: it's more polished than most research pieces of software; but real users will find many missing pieces. Most of those missing pieces are in HippyVM which, although it implements pretty much all of the PHP language, lacks many standard libraries and implements a slightly old version of PHP. Fixing this isn't difficult, but since HippyVM is not currently under active development, it would require someone taking on maintainership, and bringing HippyVM to a higher degree of compatibility with current PHP. The good news is that PyHyp picks up additions to both HippyVM and PyPy with ease, so should work on HippyVM resume, PyHyp would benefit too.

In terms of general lessons – apart from the obvious one that PyHyp shows the plausibility of fine-grained language composition – one thing stands out to me. When we started this work, we assumed that implementing a language composition would be hard, and we concentrated our thinking on that issue. We were very wrong. Implementing PyHyp has, mostly, been fairly easy: we reused sophisticated language implementations; the modifications we needed to make were relatively small and relatively simple; and meta-tracing gave us good performance for very little effort. What caused PyHyp to occupy 12-18 months of our time was a problem we didn't even consider at the start: friction. While we were eventually able to find good solutions to all the design problems we encountered, it took us a surprisingly long time to identify the problems, and even longer to find good solutions to some of them. Partly this is because none of us was expert in both PHP and Python (indeed, weird details in both languages can catch out even the most hardened experts). We sometimes had to base our designs on our guess about one or the other language's semantics, and gradually refine our design through extensive test-cases. But the bigger issue was that we simply didn’t have much, if any, precedent to draw upon. I'd like to think some of the design solutions we came up with will be useful to those who undertake language compositions — and I'll certainly be interested to see the results! In the meantime, feel free to read the PyHyp paper or download PyHyp itself.

Acknowledgements: Edd Barrett was the technical lead of the PyHyp project with assistance from Carl Friedrich Bolz, Lukas Diekmann, and myself; Edd, Carl Friedrich, and Lukas gave useful comments on this blog post, but any errors or infelicities are my own. On behalf of the PyHyp project, I’d like to thank Armin Rigo for adjusting RPython to cope with some of PyHyp’s demands, and advice on HippyVM; Ronan Lamy and Maciej Fijałkowski for help with HippyVM; and Jasper Schulz for help with cross-language exceptions. This work would not have been possible without funding from EPSRC (the Cooler and Lecture projects) and, for early versions of Eco, Oracle Labs. I am grateful to both organisations!

Follow me on Twitter

Footnotes

[1] I remember reading Really Automatic Scalable Object-Oriented Reengineering, which describes a system for translating large C system to Eiffel. Although I had seen a couple of commercial systems tackling “old” languages (e.g. Fortran to Java), I was sceptical that a paper at an academic conference would tackle anything very hard. I was thus impressed when I saw that wget was translated automatically: it's not a small program. I was stunned when I saw that Vim was translated, even down to things like signal handling. I also can't mention this paper without noting how beautifully written it is: it's rare to see authors put so much care into making the reader's journey through the paper a pleasant one.
[2] Why PHP and Python? Partly because we had to start somewhere. Partly because we had off-the-shelf access to an excellent implementation of Python and a reasonble-ish implementation of PHP. Whilst you may have strong opinions about either or both of these languages, this blog is resolute in its intention to sit on the fence about their pros and cons.
[3] This term was inspired by Clausewitz's use of ‘Friktion’. To misquote Clausewitz, everything in programming language design is simple, but even the simplest thing is difficult.
[4] Language composition throws up several annoying linguistic issues. First is the term itself: composition is a long, vague, and not universally understood term. When I started my initial work in this area, I used the term because someone else did — which isn't really a very good defence on my part. In retrospect, I would have done better to have used a plainer term like “multi-lingual programming”. Alas, by the time I realised that, we'd already used the term widely enough that changing it seemed likely to lead to more confusion. Mea culpa. The second problem is what to call the resulting programs. We used the term “composed program” for a while, but that's hopelessly ambiguous: every program is composed from smaller chunks, whether they be in different languages or not. “Multi-lingual program” has merit, and seems a good generic term. In our specific case, we talk about PyHyp programs and PyHyp files to make clear that we're talking about programs or files that are written using the rules of our language composition.
[5] We could have used Python as the starting point. Or, probably more usefully, we could have allowed things to start in either language.
[6] Adapters are very similar to the idea of proxies. However, the term ‘proxy’ often comes with the expectation that users know they're working with proxies, and that one can distinguish proxies and the thing they point to. In most cases, PyHyp's adapters are completely transparent, hence our use of a different term.
[7] Having lived with this decision for 18 months or so, we think we'd probably have done better to have first adapted PHP dictionaries as a ‘blank’ adapter which appears to be neither a list or a Python dictionary, and forcing users to call to_list() or to_dict() on it. The main gotcha with the current behaviour is that Python defines iteration on dictionaries to be iteration over the keys: this means that you can end up iterating over what you think is a list, but which is really 0, 1, 2, 3 … n.
[8] Since I wrote that article, Oracle have released Truffle. Although the low-level details are different (Truffle uses dynamic partial evaluation), the high-level goals of Truffle and meta-tracing are the same. From this article's perspective, you can consider Truffle and meta-tracing to be interchangeable, without a huge loss of precision.
[9] Since we started work on PyHyp, work on HippyVM has finished, partly due to a lack of funding. One thing that those of us attempting to do innovative programming language work inevitably learn is that it's really hard to pay the bills.
[10] My vague recollection is that it took about 14 iterations to get to this point. That's language design for you!
[11] It might be nicer to throw an error if the same name exists in more than one of PHP's global namespaces. However, that's slightly complicated by PHP's lazy class loading, which we had to work around for other reasons.
[12] This is Goodhart's Law in action.
[13] We're working on this at the moment, but we're not finished yet.
I remember reading Really Automatic Scalable Object-Oriented Reengineering, which describes a system for translating large C system to Eiffel. Although I had seen a couple of commercial systems tackling “old” languages (e.g. Fortran to Java), I was sceptical that a paper at an academic conference would tackle anything very hard. I was thus impressed when I saw that wget was translated automatically: it's not a small program. I was stunned when I saw that Vim was translated, even down to things like signal handling. I also can't mention this paper without noting how beautifully written it is: it's rare to see authors put so much care into making the reader's journey through the paper a pleasant one.
Why PHP and Python? Partly because we had to start somewhere. Partly because we had off-the-shelf access to an excellent implementation of Python and a reasonble-ish implementation of PHP. Whilst you may have strong opinions about either or both of these languages, this blog is resolute in its intention to sit on the fence about their pros and cons.
This term was inspired by Clausewitz's use of ‘Friktion’. To misquote Clausewitz, everything in programming language design is simple, but even the simplest thing is difficult.
Language composition throws up several annoying linguistic issues. First is the term itself: composition is a long, vague, and not universally understood term. When I started my initial work in this area, I used the term because someone else did — which isn't really a very good defence on my part. In retrospect, I would have done better to have used a plainer term like “multi-lingual programming”. Alas, by the time I realised that, we'd already used the term widely enough that changing it seemed likely to lead to more confusion. Mea culpa. The second problem is what to call the resulting programs. We used the term “composed program” for a while, but that's hopelessly ambiguous: every program is composed from smaller chunks, whether they be in different languages or not. “Multi-lingual program” has merit, and seems a good generic term. In our specific case, we talk about PyHyp programs and PyHyp files to make clear that we're talking about programs or files that are written using the rules of our language composition.
We could have used Python as the starting point. Or, probably more usefully, we could have allowed things to start in either language.
Adapters are very similar to the idea of proxies. However, the term ‘proxy’ often comes with the expectation that users know they're working with proxies, and that one can distinguish proxies and the thing they point to. In most cases, PyHyp's adapters are completely transparent, hence our use of a different term.
Having lived with this decision for 18 months or so, we think we'd probably have done better to have first adapted PHP dictionaries as a ‘blank’ adapter which appears to be neither a list or a Python dictionary, and forcing users to call to_list() or to_dict() on it. The main gotcha with the current behaviour is that Python defines iteration on dictionaries to be iteration over the keys: this means that you can end up iterating over what you think is a list, but which is really 0, 1, 2, 3 … n.
Since I wrote that article, Oracle have released Truffle. Although the low-level details are different (Truffle uses dynamic partial evaluation), the high-level goals of Truffle and meta-tracing are the same. From this article's perspective, you can consider Truffle and meta-tracing to be interchangeable, without a huge loss of precision.
Since we started work on PyHyp, work on HippyVM has finished, partly due to a lack of funding. One thing that those of us attempting to do innovative programming language work inevitably learn is that it's really hard to pay the bills.
My vague recollection is that it took about 14 iterations to get to this point. That's language design for you!
It might be nicer to throw an error if the same name exists in more than one of PHP's global namespaces. However, that's slightly complicated by PHP's lazy class loading, which we had to work around for other reasons.
This is Goodhart's Law in action.
We're working on this at the moment, but we're not finished yet.
 

Blog archive

 

Last 10 posts

Why Aren’t More Users More Happy With Our VMs? Part 2
Why Aren’t More Users More Happy With Our VMs? Part 1
What I’ve Learnt So Far About Writing Research Papers
What Challenges and Trade-Offs do Optimising Compilers Face?
Fine-grained Language Composition
Debugging Layers
An Editor for Composed Programs
The Bootstrapped Compiler and the Damage Done
Relative and Absolute Levels
General Purpose Programming Languages' Speed of Light
 
 

Blog archive