Blog

 

Which Parsing Approach?

September 15 2020

We all know that parsing is an important part of designing and implementing programming languages, but it’s the equivalent of Brussels sprouts: good for the diet, but a taste that only a select few enjoy. Unfortunately, I’ve come to realise that our general distaste for parsing is problematic. While many of us think that we’ve absorbed the advances of the 1960s into our collective understanding, I fear that we have regressed, and that we are often making inappropriate decisions about parsing. If that sounds accusatory, I don’t mean it to be: I spent over 20 years assuming that parsing is easy and that I didn’t need to understand it properly in order to use it well. Alas, reality has been a cruel teacher, and in this post I want to share some of the lessons I’ve been forced to slowly learn and acknowledge.

Let’s start with the basics. A grammar encodes the syntax rules for a given language. Parsing is the act of taking in an input (e.g. a source file) and determining if, and how, it corresponds to a grammar. At its most basic level, parsing just says “this input does/doesn’t correspond to the grammar”. That’s rarely useful for programming languages, so we normally execute semantic actions while parsing, allowing us to, for example, build a parse tree that represents the input as a tree. If I have a simple calculator grammar and the input 2-3*4 I might get back a tree that looks like the following:

SVG-Viewer needed.

For the rest of this post, I’ll represent trees as “pretty printed text”, where brackets allow us to succinctly express how the tree is structured. For example the above tree is equivalent to (2-(3*4)). I’m going to assume that “parsing” means “check correspondence to the grammar and build a parse tree”. I’m also going to simplify other parsing jargon and nomenclature whenever I can, to try and keep things somewhat comprehensible and the length somewhat manageable.

Recursive descent

There are a bewildering number of ways that one can parse an input, so I’ll start with what is probably the most common: a hand-written parser. While that could mean just about anything, nearly everyone who writes a half-decent hand-written parser, whether they know it or not, is writing a recursive descent parser [1]. The idea is relatively simple: one writes a series of functions which examine the input string at a given position and, if they match at that position, advance the parse. For example, a first attempt at a recursive descent parser in Python that can parse the simple calculator language above might look as follows:

NUMBERS = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
OPERATORS = ["-", "*"]

class Number:
  def __init__(self, n): self.n = n
  def __str__(self): return str(self.n)
class Mul:
  def __init__(self, lhs, rhs):
    self.lhs = lhs
    self.rhs = rhs
  def __str__(self): return "(%s*%s)" % (str(self.lhs), str(self.rhs))
class Sub:
  def __init__(self, lhs, rhs):
    self.lhs = lhs
    self.rhs = rhs
  def __str__(self): return "(%s-%s)" % (str(self.lhs), str(self.rhs))

def parse_expr(s, i):
  if s[i] not in NUMBERS:
    return
  j = i
  while j < len(s) and s[j] in NUMBERS:
    j += 1
  lhs = Number(s[i:j])
  if j == len(s) or s[j] not in OPERATORS:
    return (j + 1, lhs)
  op = s[j]
  r = parse_expr(s, j + 1)
  if r is None:
    return
  (i, rhs) = r
  if op == "-":
    return (i, Sub(lhs, rhs))
  else:
    assert op == "*"
    return (i, Mul(lhs, rhs))

def parse(s):
  r = parse_expr(s, 0)
  if r is None or r[0] <= len(s):
    return "Syntax error"
  return r[1]

print(parse("2-3*4"))
The idea is relatively simple: we have a string ‘s’ we’re parsing, with the variable ‘i’ telling us how far we’ve parsed so far. If it was able to parse part of the input starting at ‘i’ then the parse_expr function returns a pair (i, tree) telling us how far it parsed and giving us back the tree it created; if it failed it returns None. When I parse 2-3*4 it prints:
(2-(3*4))
In other words, if we were to evaluate that tree, we’d get a result of -10 – success! Admittedly, that has come at a cost: the recursive descent parser has quite a lot of boiler-plate to ensure that it doesn’t do something silly and that any syntax errors encountered cause parsing to stop. For example, if you remove the check on lines 40 and 41, then 2abc will parse successfully returning Number(2), ignoring the fact that abc couldn’t be parsed! There are ways to reduce the boilerplate, but if you write recursive descent parsers for a living, you have to learn to live with it.

Unfortunately, if I try parsing 2*3-4 I get a surprising result:

(2*(3-4))
We’ve all been taught from a young age that the grammar of mathematics requires ‘*’ to “bind harder” than ‘-’. Put more formally, ‘*’ is said to have a higher precedence than ‘-’. Unfortunately, my hand-written recursive descent parser has given both operators the same precedence. If I was to evaluate that tree, I’d get -2 instead of the 2 we’d have expected from the original expression.

Fortunately, there’s a fairly standard way to encode operator precedence which, in the style of the above parser, can be written as follows:

def parse_expr(s, i):
  r = parse_factor(s, i)
  if r is None:
    return
  (i, lhs) = r
  if i < len(s) and s[i] == "-":
    r = parse_expr(s, i + 1)
    if r is None:
      return
    (i, rhs) = r
    return (i, Sub(lhs, rhs))
  return (i, lhs)

def parse_factor(s, i):
  r = parse_term(s, i)
  if r is None:
    return None
  (i, lhs) = r
  if i < len(s) and s[i] == "*":
    r = parse_factor(s, i + 1)
    if r is None:
      return
    (i, rhs) = r
    return (i, Mul(lhs, rhs))
  return (i, lhs)

def parse_term(s, i):
  if s[i] not in NUMBERS:
    return
  j = i
  while j < len(s) and s[j] in NUMBERS:
    j += 1
  return (j, Number(s[i:j]))

def parse(s):
  r = parse_expr(s, 0)
  if r is None or r[0] <= len(s):
    return "Syntax error"
  return r[1]
If I parse these expressions:
print(parse("2-3*4"))
print(parse("2*3-4"))
I can see that I get the expected output:
(2-(3*4))
((2*3)-4)

Success at last! Well, not quite, because if I parse 2-3-4 I get another surprising result:

(2-(3-4))
Unfortunately, as this example shows, we’re incorrectly parsing operators as right associative when they should be left associative. In other words, when we see a sequence of subtractions, earlier subtractions should be matched before later subtractions. Fixing this might seem like it’s easy, but it’s not: the “obvious” way of implementing left associativity in a recursive descent parser causes an infinite loop. Fixing that is more involved than I want to get into here: see this page for an approachable summary of solutions to this problem.

It might be tempting to see these problems as the result of an idiot (me) writing a parser for a language they don’t sufficiently understand (mathematics). I hope you can see that there’s also something deeper going on. The underlying problem is that the grammar I wanted to write is ambiguous: 2-3*4 can be parsed as equivalent to 2-(3*4) or (2-3)*4. It is often said that recursive descent parsers are inherently unambiguous. While true, this makes a virtue out of a vice: recursive descent parsers are unambiguous simply because they ignore ambiguities. Put another way, whenever a recursive descent parser encounters a point at run-time where an input can be parsed ambiguously, it arbitrarily chooses one of the possibilities, and charges onwards as if the other possibilities had never existed. Significantly, the parser author is not notified that this happened. Since recursive descent parsers are just normal programs, it’s unlikely that we’ll ever be able to make a static analysis that can take such a parser and reliably tell us at compile-time all the points of ambiguity.

It is therefore probably not a coincidence that recursive descent parsers have no real “theory”. Notably, they do not have any known relationship to the class of grammars we understand best — Context-Free Grammars (CFGs). For example, we do not know, in general, the language which a recursive descent parser will accept: all we can do is throw ever more inputs at it and observe if, and how, it parses them, never knowing if another input will cause a surprising parse.

Over time, I’ve come to view recursive descent as the parsing equivalent of assembly programming: maximum flexibility, maximum performance, and maximum danger. Every non-trivial recursive descent parser I’ve seen converted to another formalism has led to unexpected ambiguities being uncovered. Sometimes this leads to incorrect parses (as above), but it just as often leads to seemingly correct input not being parsed at all [2]. There are good reasons for using recursive descent parsers (I’ll get to those later), but, in my opinion, if another formalism can be used, it generally should be.

Generalised parsers

At the opposite end of the spectrum are what have come to be called generalised parsers. There are various generalised parsing algorithms (e.g. Earley, GLL, and GLR) but, from the perspective of this post, they’re all equivalent. Each can parse any CFG (so they rest on a solid theory), even ambiguous ones (so you don’t have to worry about contorting your grammar), and they guarantee to tell you where all of the ambiguous points in the grammar are at run-time (so you don’t have to worry about things being unexpectedly mis-parsed).

These properties appear to make generalised parsing the solution to the problems noted above with recursive descent parsers. However, this comes at a cost. Consider the following grammar which, once again, parses the little subset of mathematics we’re using as an example:

Expr: Expr "-" Expr
    | Expr "*" Expr
    | "INT"
    ;
Given that grammar, many readers will have spotted an obvious point of ambiguity: 2-3*4 can be parsed as equivalent to (2-3)*4 or 2-(3*4). Generalised parsers are interesting because they generate both possibilities at run-time. It is possible for such parsers to return a “parse forest” (i.e. showing all the ambiguous possibilities), but that’s not very useful for programming languages: we expect compilers to settle on a single meaning for the programs we throw at them. We thus need to disambiguate the ambiguous possibilities so that we end up with a single parse tree. An easy way of doing this is to assign a precedence to a rule’s productions so that if, at one point in the parse, more than one of its productions match, we can pick the one with the highest precedence. For example, I might rewrite my grammar to look as follows:
Expr: Expr "-" Expr %precedence 10
    | Expr "*" Expr %precedence 20
    | "INT"
    ;
Assuming that “higher” precedences mean “bind tighter”, this will then parse 2-3*4 as equivalent to 2-(3*4).

My experience is that fewer people (including, from bitter experience, me) spot a second ambiguity in the above grammar: 2-3-4 can be parsed as left (i.e. (2-3)-4) or right (i.e. 2-(3-4)) associative (because of rules such as Expr "-" Expr). Unfortunately, precedences are not sufficient to disambiguate between those two possibilities: one either needs to rewrite the grammar or use a different disambiguation operator [3].

While the good news is that a generalised parser will reliably tell us at run-time that it encountered an ambiguity, the bad news is that we generally have to wait until we encounter an input that is parsed ambiguously to discover that our grammar is ambiguous. There are some decent heuristics that will statically find many of the points of ambiguity, but they are just that — heuristics.

Over time, I’ve come to view generalised parsing as equivalent to dynamic typing: expressive and safe, but with more errors than necessary being deferred to run-time. I spent years trying to write arbitrary CFGs but, for complex grammars, I continually struggled to squeeze out all of the ambiguities [4]. I did not encounter a user who was happy with, or anything other than startled by, ambiguity errors: it is rather odd to be told that your input is valid but can’t be parsed. That said, I think generalised parsers have a part to play in language composition, where composing different grammars inherently leads to ambiguity. However, I no longer believe that generalised parsing is a good fit for “normal” parsing.

Statically unambiguous parsing

There are several parsing approaches which statically rule out ambiguity, bypassing one of the fundamental problems with generalised parsing. I’ll describe the two best known: LL and LR. In essence, these approaches describe subsets of the CFGs which provably contain only unambiguous grammars. It’s common to describe grammars which adhere to one of these subsets as being “a valid LL grammar” or similar.

However, as far as we know, it is not possible to define the complete subset of unambiguous CFGs, so there are unambiguous grammars which do not fit into these subsets. I thus find it easiest to think of these approaches as being analogous to a static type system: they are sound (i.e. if a grammar is a valid LL/LR grammar, it really is unambiguous) but not complete (some unambiguous grammars aren’t valid LL/LR grammars).

LL parsing

Although less common than in the past, LL parsing still underlies systems such as javacc. My personal bias is that LL parsers are largely unappealing, because the lack of left recursion makes expressing many standard programming language constructs as awkward as with recursive descent parsers. However, as hinted at by that commonality, LL grammars have one important feature: they naturally map to recursive descent parsers (but not necessarily vice versa). One can therefore ensure that a recursive descent parser is not accidentally steamrollering over ambiguities by creating an LL grammar and faithfully mapping it to a recursive descent parser.

To my mind, the combination of LL and recursive descent parser has a small but important niche: if you really, truly need the highest possible performance and/or you want the best possible error messages, this is probably the best route we know of. However, it comes at a significant cost. For a realistic programming language grammar, it will typically take many person months of effort [5] to beat an automatically generated parser. I therefore think this approach only makes sense for a small number of projects (notably industrial-strength compilers and IDEs).

LR parsing: prelude

The last of the major parsing approaches I’m going to look at is LR parsing. In common with many people, I spent years trying to avoid LR parsing because I had imbibed the common idea that LR parsing is a terrible thing. Instead I threw myself into other parsing approaches, notably Earley parsers [6].

Then, in late 2008, while bored in meetings, I started writing extsmail, a program mostly intended to send email via ssh. I thought it would be interesting to write this in the style of a traditional Unix daemon, something I had not attempted before. For the two configuration files extsmail needs, I therefore decided to use the traditional Unix daemon parsing tool Yacc. Not only had I not used Yacc before, I had neither used nor studied LR parsing — I suspected I would have quite a task on my hand. I was rather surprised when it turned out to be easy to write a grammar such as externals_parser.y.

However, I assumed that I had been lucky with these grammars, which are rather simple, and went back to avoiding LR parsing. Having realised that generalised parsing and ambiguity were causing me problems, I spent quite a while dabbling with PEG parsing (which is recursive descent in disguise) before eventually realising that was going to cause me different, but no less severe, problems relative to generalised parsing.

Later I stumbled across [7] Tim Wagner’s thesis on incremental parsing which became the pillar that Lukas Diekmann built upon to create Eco [8]. Wagner’s work uses LR parsing but I managed to involve myself a bit with Eco without actually understanding how LR parsing worked. Then, in 2015, when we were experimenting as a group with Rust, Lukas wrote the beginnings of an LR parser as an experiment, and I quickly jumped in and made a few modifications. Without really intending to, I started expanding the code until I realised I had taken on maintainership of what clearly had the potential to become a full Rust LR parser. At that point, I realised I actually needed to understand LR parsing. I found the explanations lurking on the web a bit confusing a first but the algorithm was simple enough that soon enough I had a full, if basic, Rust LR parser (which became grmtools).

Why am I telling you this long, probably tedious, personal history? Because I want to emphasise that I went out of my way to avoid LR parsing, even though I didn’t really know what I was avoiding or why. Even after I had used LR parsing, and realised that it wasn’t the bogeyman I had expected, I still spent several years trying alternatives. Not only is that embarrassing to admit publicly, it also troubles me: how had I picked up a bias that took me so long to overcome? I’ve gradually alighted upon a plausible explanation for our community’s general dislike of LR parsing and, oddly enough, it relates to undergraduate compiler courses. For reasons that probably made sense in the 1970s and 80s, many compiler courses spend significant, arguably excessive, time on parsing — generally LR parsing. Students come in expecting to be taught how to generate machine code in clever ways, but instead have to learn all sorts of parsing background before they even get to the main LR algorithm. By that point they are thoroughly sick of parsing generally and LR parsing in particular. This is a self-inflicted wound by our subject, as we have accidentally turned people away from a beautiful algorithm [9].

LR parsing

That dealt with, let’s drill into some of the technical details of LR parsing. First, LR is strictly more powerful than LL [10]. In other words, every valid LL grammar is also a valid LR grammar (but not necessarily vice versa). Second, LR grammars are the largest practical subset of unambiguous CFGs that we currently know how to statically define [11].

Let’s actually try out LR parsing [12] by feeding the following grammar:

%start Expr
%%
Expr: Expr "-" Expr
    | Expr "*" Expr
    | "INT"
    ;
to Yacc. Doing so leads to the following being printed at compile-time:
expr1.y: yacc finds 4 shift/reduce conflicts
At this point, I know that some readers will have broken out in a cold sweat at the mention of “shift/reduce conflict”. Don’t panic yet! At the moment, let’s just think of this as the LR parser statically detecting an ambiguity (or four...) and telling us that we should fix it somehow [13].

There are various ways of drilling into more details about those ambiguities. In a shameless plug, I’ll use nimbleparse, but most Yacc implementations have a way of giving more detailed information. nimbleparse also needs a valid lexer, so if I feed it the grammar above as well as this Lex file [14]:

%%
- "-"
\* "*"
[0-9]+ "INT"
I get this output:
Shift/Reduce conflicts:
   State 5: Shift("*") / Reduce(Expr: "Expr" "-" "Expr")
   State 5: Shift("-") / Reduce(Expr: "Expr" "-" "Expr")
   State 6: Shift("*") / Reduce(Expr: "Expr" "*" "Expr")
   State 6: Shift("-") / Reduce(Expr: "Expr" "*" "Expr")

Stategraph:
0: [^ -> . Expr, {'$'}]
   Expr -> 1
   'INT' -> 2
1: [Expr -> Expr . '-' Expr, {'-', '*', '$'}]
   [Expr -> Expr . '*' Expr, {'-', '*', '$'}]
   [^ -> Expr ., {'$'}]
   '-' -> 3
   '*' -> 4
2: [Expr -> 'INT' ., {'-', '*', '$'}]
3: [Expr -> Expr '-' . Expr, {'-', '*', '$'}]
   'INT' -> 2
   Expr -> 5
4: [Expr -> Expr '*' . Expr, {'-', '*', '$'}]
   Expr -> 6
   'INT' -> 2
5: [Expr -> Expr . '-' Expr, {'-', '*', '$'}]
   [Expr -> Expr . '*' Expr, {'-', '*', '$'}]
   [Expr -> Expr '-' Expr ., {'-', '*', '$'}]
   '*' -> 4
   '-' -> 3
6: [Expr -> Expr . '-' Expr, {'-', '*', '$'}]
   [Expr -> Expr . '*' Expr, {'-', '*', '$'}]
   [Expr -> Expr '*' Expr ., {'-', '*', '$'}]
   '*' -> 4
   '-' -> 3
What this shows us is the stategraph (i.e. a statemachine) our grammar has been transformed into and the states where the conflicts have occurred.

It is possible, with a little effort, to understand that stategraph and the conflicts that have occurred. However, I’m not going to go into more detail here, because most readers will probably already have guessed that it’s very hard to make sense of conflicts on large grammars. I’d liken it, roughly speaking, to resolving whole-program type inference errors [15]: the errors reported are correct, but don’t necessarily correspond to the points in your program/grammar that you would consider need fixing.

While I’m sure that it’s possible to improve the way that conflicts are reported [16], to my surprise, I’ve developed lots of grammars without being troubled much by problems with conflicts. Indeed, the only time I’ve ever bothered to try and understand conflicts is when a large existing grammar needs updating to a new external specification, which is not common [17]. In most cases, I’m developing a new, or tweaking an existing, small grammar. Then, just as with languages using type inference, I find it most productive to save and compile after nearly every change. If this does identify a conflict, I know what change caused it, and it then tends to be fairly obvious what a plausible fix is. Not only do I not bother worrying about what state in the stategraph is involved, I don’t even bother checking whether the conflict(s) are shift/reduce, reduce/reduce, or accept/reduce [18].

Honestly, I’ve only encountered one realistic counter-example which is – wait for it – mathematical expressions. It is surprisingly difficult to encode this as an LR grammar, because mathematics’ syntax rules are complex, and nearly every naive grammar for them ambiguous. Fortunately, because it’s such a common example, solutions to this abound on the internet. Here’s the classic solution:

%start Expr
%%
Expr: Expr "-" Term
    | Term
    ;
Term: Term "*" Factor
    | Factor
    ;
Factor: "INT"
    ;
It has no conflicts, which means that Yacc has statically proven that it is unambiguous! It handles precedence – 2-3*4 parses as 2-(3*4) – and associativity – 2-3-4 parses as (2-3)-4 – correctly.

Over time, I’ve come to view LR parsing as equivalent to static typing: occasionally annoyingly restrictive, but providing enough static guarantees to be worth the annoyance for important software. It’s important to remember that LR isn’t magic: while it will stop you writing an ambiguous grammar, it won’t stop you writing an incorrect grammar for the language you wish to parse. For example, although LR will prevent you making a rule both left and right associative, you still have to choose correctly whether it should be left or right associative.

Performance

People often worry about parsing performance in general, and LR parsing performance specifically, though almost always without cause on modern computers. For example, if I take Java’s grammar (which is unusually big, and therefore slow to parse) and the LR parsing system I wrote (which has been only moderately optimised) I can happily parse many tens of thousands of lines of code per second on my 3 year old laptop. Unless you’ve got billions of lines of source code, or millions of users, this is surely fast enough.

I suspect that parsing performance worries date back to the period when parsing techniques were under heavy development. LR parsing was invented in 1965, a time when computers were painfully slow [19] and resource poor. LR parsing works by generating a statetable at compile-time that is then interpreted at run-time. Those statetables were far too big to be practical on the computers of the day, so two solutions were invented to solve this problem.

First, algorithmic subsets of LR (e.g. LALR, SLR) were invented that reduce the size of statetables, at the cost of reducing the number of grammars that they can accept (i.e. some LR grammars are not valid LALR grammars). In practise, these subsets are annoying to use: they cause some seemingly reasonable grammars to be rejected; and understanding why a grammar has been rejected can require a deep understanding of the algorithm [20].

Second, since 1977 we’ve known that you can substantially shrink LR statetables without restricting the grammars accepted [21]. When combined with a couple of other techniques to squeeze the statetable’s memory footprint [22], even the most puny modern machine can run an arbitrary LR parser at impressive speeds.

Error recovery

When I’m programming, I make a truly embarrassing number of syntax errors. It is vital that the parser I’m using accurately reports where I’ve made such an error: most parsers, including LR parsers, do a decent enough job in this regard [23]. It is then nice if the parser recovers from my error, allowing it to continue parsing.

The very best recursive descent parsers [24] do a pretty good job of error recovery. LL parsing systems also generally do a tolerable job for arbitrary LL grammars.

Unfortunately, it is fair to say that LR parsing systems such as Yacc do a poor job. Yacc itself uses error tokens, but the results are so bad that I find Yacc parsers with error recovery more frustrating to use than those without. However, we can do much better for arbitrary LR grammars, and hopefully more LR parsers will give good error messages in the future.

LR parsing: aesthetics

I’m now going to turn to a fuzzier factor: readability. Whether explicitly or implicitly, people need to know the syntax rules of the language they are using. Some programming language designers assume, or hope, that giving users a few code examples is equivalent to telling them the language’s syntax rules. This works for most cases as we can largely rely on a shared cultural understanding of “what a programming language looks like” [25], but experienced programmers know the dangers of ignoring dank corners such as operator precedence. At a deeper level, those who implement a compiler, or even just an accurate syntax highlighter, need to know precisely what a language’s syntax rules are. In my opinion, the readability of a parser is a vital factor in enabling accurate tooling for, and use of, a programming language.

To my mind, of the various grammars and parsers presented in this post, the easiest to read is the version for generalised parsers, because it most closely matches the informal mathematical grammar I was taught as a child. However, this readability comes at a cost: because the grammar is potentially ambiguous I sometimes misjudge which way a given input will be parsed after disambiguation.

The hardest to read is, without a shadow of a doubt, the recursive descent parser. It’s the longest, the most detailed, and the one lacking any underlying theory to guide the reader.

The lack of left recursion in LL parsing makes many grammars awkward to read. A surprising way of seeing that is by using the fact that many (though not all) LR grammars can be converted to LL semi-mechanically (see e.g. this translation of roughly the same LR grammar as used in this post to an LL equivalent): the resulting LL grammar is never easier to read after the conversion.

LR grammars thus fill an important hole. They’re generally close in readability to an arbitrary CFG; since left associativity is so common, they’re nearly always easier to read than LL grammars; and, if you’ll allow a small amount of poetic license, they’re infinitely easier to read than a recursive descent parser.

Of course, I’m clearly somewhat biased, so perhaps these words from Guy Steele might be more compelling:

[Be] sure that your language will parse. It seems stupid ... to sit down and start designing constructs and not worry about how they will fit together. You can get a language that’s difficult if not impossible to parse, not only for a computer, but for a person. I use Yacc constantly as a check of all my language designs, but I very seldom use Yacc in the implementation. I use it as a tester, to be sure that it’s LR(1) ... because if a language is LR(1) it’s more likely that a person can deal with it.

Dynamic Languages Wizards Series - Panel on Language Design

Summary

Having spent years trying virtually every other possible approach to parsing, I now firmly believe that LR parsing is the best approach for the vast majority of purposes: it has the strongest practical safety guarantees, allows good grammar readability, and has decent performance. In particular, I hope future programming language authors take Guy Steele’s advice above, and make their reference grammar LR compliant [26].

Personally, I’ve put my money where my mouth is: I’ve put a lot of work into grmtools, a Yacc-compatible LR parsing system for Rust. grmtools isn’t yet perfect, or complete, nor is it by any means fully optimised — but it’s more than good enough for many purposes and I intend continuing to maintain it for some time to come. I hope it’s one small step towards encouraging people to rediscover the beauty and utility of LR parsing.

Acknowledgements: Thanks to Lukas Diekmann and Naveneetha Vasudevan for comments on drafts of this post. Thanks to Roberto Ierusalimschy and Terence Parr for answering my queries. All opinions, and any errors or infelicities, are very much due to me!

Follow me on Twitter

Footnotes

[1] Parsing Expression Grammars (PEG)s and “parser combinators” in some functional languages are just recursive descent parsers in disguise.
[2] My favourite example of this is best expressed as a Parsing Expression Grammar (PEG):
r <- a / ab
or as a hand-written recursive descent parser:
def r(s, i):
    if i + 1 < len(s) and s[i] == "a":
        return ...
    elif i + 2 < len(s) and s[i] == "ab":
        return ...
Both of these parsers successfully parse the string ‘a’ but fail to parse the string ‘ab’. As soon as ‘a’ is matched, the rule succeeds, which leaves ‘b’ unmatched; neither parser tries to match ‘ab’ directly.
[3] I believe that it’s still an open question as to how many distinct disambiguation operators there need to be.
[4] In Converge I ended up cheating, encoding some default disambiguation rules into the parser. When I did this I didn’t really understand the problem that I’d encountered nor did I realise that my “solution” was not curing, but merely delaying, the pain. The only thing more surprising than encountering an ambiguous parse is finding out that your input has been disambiguated-by-default in the wrong way.
[5] To give a rough idea of scale: Rust’s parser is about 10KLoC and javac’s parser about 4.5KLoC.
[6] Yes, I wrote more than one. I no longer recommend it, because Earley’s original algorithm has a bug in it, and descriptions of a/the fix seem either to be incorrect, or to destroy the beauty of the algorithm.
[7] Michael Van De Vanter first pointed Wagner’s work out to me. However, I didn’t appreciate it for what it was. I then forgot about it, and stumbled across it at “independently” at a later point, before somehow realising that it was what Michael had already suggested. I later learnt to listen to his advice more carefully, and benefited much from it!
[8] It’s also the basis of Tree-sitter, which might be the best long-term argument I know of for programming languages having an LR grammar!
[9] Perhaps I was lucky not to study a compilers course myself (my university did not offer one at that point), as it meant I couldn’t develop the most severe of allergic reactions to LR parsing.
[10] From least to most expressive we thus have: regular expressions, LL, LR, unambiguous, CFG. In other words, regular expressions are a strict subset of LL, LL a strict subset of LR, and so on. The most complete description of the hierarchy I know can be found in p89 of Alexander Okhotin’s talk (where arrows mean “more expressive” and “ordinary” means “CFG”). Note that recursive descent doesn't fit into this hierarchy at all — formally speaking, we know that it accepts a disjoint set of languages relative to CFGs, but, because PEGs have no underlying theory that we know of, we are unable to precisely define that set further.

Another interesting case is the ALL(*) algorithm which underlies ANTLR. ALL(*) accepts a strict superset of LL (including many ambiguous grammars), but is disjoint with LR since ALL(*) doesn’t support left-recursion. However, ANTLR can remove direct left-recursion before invoking ALL(*), so some grammars that might seem impossible to parse with ALL(*) can in fact be parsed by it. Bearing in mind that we’re talking about infinite sets, and that I don’t think we have a formal proof of the following statement, I think it would be fair to say that the ALL(*) subset of CFGs is bigger than the LR subset.

[11] There are larger unambiguous subsets such as LR-Regular (or “LRR”) grammars. However, as far as I can tell, these are probably not practical. For example, it is not decideable as to whether an arbitrary grammar is LRR or not. Marpa is an LRR-based system that claims that later advances have made it practical, though I haven’t investigated how that claim relates to Syzmanski and Williams’ work.
[12] Berkeley Yacc actually implements LALR, but for this example it’s indistinguishable from LR. I’ll discuss LALR a little bit later in this post.
[13] Although I’ve presented the conflicts as errors, in Yacc they’re actually warnings because it has “default conflict resolution” rules (see Section 5 of the Yacc manual). In other words Yacc is willing to take in an ambiguous grammar and automatically disambiguate it to produce an unambiguous grammar. In general, I do not recommend making use of this feature.
[14] Although it’s rarely remarked upon, the traditional splitting of “parsing” into separate lexing and parsing phases is an important part of the ambiguity story. Not only is it easy for the lexer to identify for as a keyword and forest as an identifier, but the parser then only has to distinguish between token types and not token values. Scannerless parsing merges these two phases, which allows more grammars to be expressed, but introduces more scope for ambiguity — and, in some cases, enables the resulting parsing algorithm to accept context-sensitive grammars.
[15] Imagine a Haskell or RPython program where none of the functions have explicit types. The challenge when programming in such systems is that errors are often reported far away from where they were caused. In other words, I might make a static type error in one function, but the type inferencer will detect the resulting error in another function. While type error messages have become much better over time, they can never match human expectations in all cases.
[16] The best conflict reports I’ve seen come from LALRPOP.
[17] Off-hand, I can only think of a single example: when Lukas tried to evolve this Java 7 grammar to Java 8. Until that point, grmtools didn’t have a way of reporting details about conflicts because I hadn’t needed such a feature!

The Java specification used to pride itself on presenting a simple, machine-proven, unambiguous grammar in an appendix. Unfortunately, at some point, this grammar seems to have been dropped from the specification, and I suspect that the new syntax introduced has not been checked for possible ambiguities. We quickly realised that a Java 8 grammar wasn’t important enough to our work for us to invest the time in this, so I don’t know if it is ambiguous or not.

[18] For the insatiably curious, the conflict types mean roughly:
  • shift/reduce: The LR parser can’t be sure whether it should advance the input by one token, or whether a parsing rule will have completed.
  • reduce/reduce: The LR parser can’t be sure which of two rules will have completed.
  • accept/reduce: The LR parser can’t be sure if the entire parse has completed or merely one rule has completed.
That last possibility is so rare that I’d forgotten it even exists before I thought to fact-check this footnote!
[19] Roughly speaking, the fastest super computer in the world at that time ran about 10,000 times slower than a decent desktop chip today.
[20] SLR is particularly restrictive. However, I’m not sure I’ve ever seen SLR used in practise (though I know it was in the past), but LALR is still found in Berkeley Yacc. Even though LALR is less restrictive than SLR, it can still require real programming language grammars to be unpleasantly contorted in places.
[21] Pager’s description is slightly incomplete; it’s best paired with Xin Chen’s thesis. From memory, neither mentions that the algorithm is non-deterministic and can sometimes create unreachable states that can be garbage collected to save a little bit more memory. grmtool’s implementation of this algorithm goes into more detail on such matters and also has the bonus of being runnable. However, Pager’s algorithm doesn’t quite work properly if you use Yacc’s conflict resolution feature. One day I should implement the IELR algorithm to solve this problem.
[22] For example, encoding sparse tables (e.g. in Rust with the sparsevec crate), and packing vectors of small integers (e.g. with the packedvec crate). It’s a long time since I’ve thought about these aspects: from memory, one can do even better than these techniques, but they’re already effective enough that we didn’t feel the need to look further at that point.
[23] There is one major exception in C-ish syntaxes: missing curly brackets. The resulting errors are typically reported many lines after the point that a human would consider as the cause of the problem.
[24] rustc gives the best syntax error messages of any compiler / parser I’ve ever used.
[25] Recent years have reinforced a long-standing trend: programmers don’t like to learn languages with unfamiliar syntaxes. For better or worse, C-ish syntax is likely to be the dominant cultural force in programming languages for decades to come.
[26] That doesn’t mean that the eventual compiler has to contain an LR parser (though I’d start with an LR parser and only consider moving to something else if I had millions of users), but the parser it does contain should be entirely compliant with the reference LR grammar.

Unfortunately, for the foreseeable future, we are going to be stuck with programming languages who have used other parsing formalisms: pity the poor IDE author who has to deal with yet another language with only a recursive descent parser instead of an LR grammar!

Parsing Expression Grammars (PEG)s and “parser combinators” in some functional languages are just recursive descent parsers in disguise.
My favourite example of this is best expressed as a Parsing Expression Grammar (PEG):
r <- a / ab
or as a hand-written recursive descent parser:
def r(s, i):
    if i + 1 < len(s) and s[i] == "a":
        return ...
    elif i + 2 < len(s) and s[i] == "ab":
        return ...
Both of these parsers successfully parse the string ‘a’ but fail to parse the string ‘ab’. As soon as ‘a’ is matched, the rule succeeds, which leaves ‘b’ unmatched; neither parser tries to match ‘ab’ directly.
I believe that it’s still an open question as to how many distinct disambiguation operators there need to be.
In Converge I ended up cheating, encoding some default disambiguation rules into the parser. When I did this I didn’t really understand the problem that I’d encountered nor did I realise that my “solution” was not curing, but merely delaying, the pain. The only thing more surprising than encountering an ambiguous parse is finding out that your input has been disambiguated-by-default in the wrong way.
To give a rough idea of scale: Rust’s parser is about 10KLoC and javac’s parser about 4.5KLoC.
Yes, I wrote more than one. I no longer recommend it, because Earley’s original algorithm has a bug in it, and descriptions of a/the fix seem either to be incorrect, or to destroy the beauty of the algorithm.
Michael Van De Vanter first pointed Wagner’s work out to me. However, I didn’t appreciate it for what it was. I then forgot about it, and stumbled across it at “independently” at a later point, before somehow realising that it was what Michael had already suggested. I later learnt to listen to his advice more carefully, and benefited much from it!
It’s also the basis of Tree-sitter, which might be the best long-term argument I know of for programming languages having an LR grammar!
Perhaps I was lucky not to study a compilers course myself (my university did not offer one at that point), as it meant I couldn’t develop the most severe of allergic reactions to LR parsing.
From least to most expressive we thus have: regular expressions, LL, LR, unambiguous, CFG. In other words, regular expressions are a strict subset of LL, LL a strict subset of LR, and so on. The most complete description of the hierarchy I know can be found in p89 of Alexander Okhotin’s talk (where arrows mean “more expressive” and “ordinary” means “CFG”). Note that recursive descent doesn't fit into this hierarchy at all — formally speaking, we know that it accepts a disjoint set of languages relative to CFGs, but, because PEGs have no underlying theory that we know of, we are unable to precisely define that set further.

Another interesting case is the ALL(*) algorithm which underlies ANTLR. ALL(*) accepts a strict superset of LL (including many ambiguous grammars), but is disjoint with LR since ALL(*) doesn’t support left-recursion. However, ANTLR can remove direct left-recursion before invoking ALL(*), so some grammars that might seem impossible to parse with ALL(*) can in fact be parsed by it. Bearing in mind that we’re talking about infinite sets, and that I don’t think we have a formal proof of the following statement, I think it would be fair to say that the ALL(*) subset of CFGs is bigger than the LR subset.

There are larger unambiguous subsets such as LR-Regular (or “LRR”) grammars. However, as far as I can tell, these are probably not practical. For example, it is not decideable as to whether an arbitrary grammar is LRR or not. Marpa is an LRR-based system that claims that later advances have made it practical, though I haven’t investigated how that claim relates to Syzmanski and Williams’ work.
Berkeley Yacc actually implements LALR, but for this example it’s indistinguishable from LR. I’ll discuss LALR a little bit later in this post.
Although I’ve presented the conflicts as errors, in Yacc they’re actually warnings because it has “default conflict resolution” rules (see Section 5 of the Yacc manual). In other words Yacc is willing to take in an ambiguous grammar and automatically disambiguate it to produce an unambiguous grammar. In general, I do not recommend making use of this feature.
Although it’s rarely remarked upon, the traditional splitting of “parsing” into separate lexing and parsing phases is an important part of the ambiguity story. Not only is it easy for the lexer to identify for as a keyword and forest as an identifier, but the parser then only has to distinguish between token types and not token values. Scannerless parsing merges these two phases, which allows more grammars to be expressed, but introduces more scope for ambiguity — and, in some cases, enables the resulting parsing algorithm to accept context-sensitive grammars.
Imagine a Haskell or RPython program where none of the functions have explicit types. The challenge when programming in such systems is that errors are often reported far away from where they were caused. In other words, I might make a static type error in one function, but the type inferencer will detect the resulting error in another function. While type error messages have become much better over time, they can never match human expectations in all cases.
The best conflict reports I’ve seen come from LALRPOP.
Off-hand, I can only think of a single example: when Lukas tried to evolve this Java 7 grammar to Java 8. Until that point, grmtools didn’t have a way of reporting details about conflicts because I hadn’t needed such a feature!

The Java specification used to pride itself on presenting a simple, machine-proven, unambiguous grammar in an appendix. Unfortunately, at some point, this grammar seems to have been dropped from the specification, and I suspect that the new syntax introduced has not been checked for possible ambiguities. We quickly realised that a Java 8 grammar wasn’t important enough to our work for us to invest the time in this, so I don’t know if it is ambiguous or not.

For the insatiably curious, the conflict types mean roughly:
  • shift/reduce: The LR parser can’t be sure whether it should advance the input by one token, or whether a parsing rule will have completed.
  • reduce/reduce: The LR parser can’t be sure which of two rules will have completed.
  • accept/reduce: The LR parser can’t be sure if the entire parse has completed or merely one rule has completed.
That last possibility is so rare that I’d forgotten it even exists before I thought to fact-check this footnote!
Roughly speaking, the fastest super computer in the world at that time ran about 10,000 times slower than a decent desktop chip today.
SLR is particularly restrictive. However, I’m not sure I’ve ever seen SLR used in practise (though I know it was in the past), but LALR is still found in Berkeley Yacc. Even though LALR is less restrictive than SLR, it can still require real programming language grammars to be unpleasantly contorted in places.
Pager’s description is slightly incomplete; it’s best paired with Xin Chen’s thesis. From memory, neither mentions that the algorithm is non-deterministic and can sometimes create unreachable states that can be garbage collected to save a little bit more memory. grmtool’s implementation of this algorithm goes into more detail on such matters and also has the bonus of being runnable. However, Pager’s algorithm doesn’t quite work properly if you use Yacc’s conflict resolution feature. One day I should implement the IELR algorithm to solve this problem.
For example, encoding sparse tables (e.g. in Rust with the sparsevec crate), and packing vectors of small integers (e.g. with the packedvec crate). It’s a long time since I’ve thought about these aspects: from memory, one can do even better than these techniques, but they’re already effective enough that we didn’t feel the need to look further at that point.
There is one major exception in C-ish syntaxes: missing curly brackets. The resulting errors are typically reported many lines after the point that a human would consider as the cause of the problem.
rustc gives the best syntax error messages of any compiler / parser I’ve ever used.
Recent years have reinforced a long-standing trend: programmers don’t like to learn languages with unfamiliar syntaxes. For better or worse, C-ish syntax is likely to be the dominant cultural force in programming languages for decades to come.
That doesn’t mean that the eventual compiler has to contain an LR parser (though I’d start with an LR parser and only consider moving to something else if I had millions of users), but the parser it does contain should be entirely compliant with the reference LR grammar.

Unfortunately, for the foreseeable future, we are going to be stuck with programming languages who have used other parsing formalisms: pity the poor IDE author who has to deal with yet another language with only a recursive descent parser instead of an LR grammar!


Alternative Sources of Advice

May 6 2020

Download:  Opus (3.1MiB)   mp3 (11.1MiB)
With slowly increasing frequency, I am asked for my advice on life or careers, or something similar. Let us put aside worries about this sign of my increasing decrepitude or the poor taste that some people show in those from whom they seek advice. The brutal truth is that most of the advice that I’ve received as an adult – nearly all of which has been well intentioned – has not been right for me. I therefore assume that any advice I might give others would be equally wrong, and so I try and avoid doing so. In this blog post, I will try to enumerate the problems I have with advice as it is commonly given and look at one way that we can do things differently.

In the context of this article, what I mean by “advice” is the guidance given by one adult (generally older) to others (generally younger) [1]. Advice can be solicited or unsolicited and it can come through many mediums such as one-on-one conversations, listening to a talk, or reading a book.

A common way that advice is given is as a series of commands such as “you should do X” or “you should never do X”. Commands can be expressed bluntly (“you must always do X”) or diplomatically (“in my opinion it is always wise to do X”); they can be general (“you should travel to as many places as possible”) or specific (“you should avoid person P”); and so on.

My issue is not with the form of advice-as-commands, it is that such advice is bad much more often than it is good. The reason for that is simple: most advice tends to generalise from a case of one (“I did X so everyone should do X”) or, worse, from a case of zero (more commonly referred to as “do as I say, not as I do”). The latter is so clearly foolish that it doesn't require explanation. However, I believe there are three issues with the former.

First, generalising from a case of one tends to lead to over-generalisation: if the only sheep I’ve seen happened to be black then I might confidently tell other people that all sheep are black. It took me years to realise how much advice falls into this category, and to have gained enough general knowledge to spot cases that are clearly at odds with reality.

Second, we inevitably generalise from our own back story, and our memories can be astonishingly faulty. I have seemingly clear memories of something happening in my past that other people have been able to definitively prove did not happen. More commonly, we forget challenges we faced in the past, overlook the times that we succeeded through luck rather than judgement, and exaggerate our contributions relative to others’ [2]. When we give advice, we take these faulty memories, generalise, and then simplify: it is hardly wonder that we often mislead.

Third – and there isn’t a nice way to say this – most of us have not led a sufficiently interesting life to be able to generalise from. Like everyone, I have faced challenges here and there, but compared to those faced by most people throughout history, I have led a comfortable, undemanding existence. Generally speaking, the fewer challenges we’ve faced, and the less demanding those challenges were, the less reliable the advice we derive from it is likely to be.

Real-world advice

Let me start with a real, though ludicrous, example. When I was doing my PhD, I was advised by a number of people that PhD theses must have an odd number of chapters. I was absolutely baffled by this but, since everything about the situation I was in was unfamiliar, I assumed that I must be missing something obvious. It took at least a couple of years before I had enough confidence to ignore this bit of advice [3]. Though I never found its source, I assume that someone saw a successful, and beautifully written, thesis that had an odd number of chapters and generalised from there. I was astonished to find only last year that a recent PhD graduate had heard this same bit of advice!

Arguably more dangerous than ludicrous advice is advice that is not just plausible, but which might be right for many people — but not us. For example, the most common advice I receive these days – perhaps two or three times a year from different people on average – is to write more research papers. I understand why I’m being told this. If you look at my list of publications, you’ll see that the number of papers I produce has slowed down from 2014 onwards — and that despite the fact that, after many years being a solo researcher, in 2013 I started leading a research group that’s never dipped below four people in size. I’ve come to realise that, when I’m given this advice, it isn’t because people doubt my work ethic, but because they assume that I’m doing something wrong and that if I only acknowledged that then I could be productive again.

What’s not obvious is that this is the inevitable result of decisions I made several years ago. In about 2012 I looked back at my past work and realised how little of it I felt proud of: I had too often been happy to aim for the mediocre. By the end of 2013 I had realised that I would have to consistently raise the level I was working at and had developed a rough plan to do so. Most significantly, I had decided that since the thing I enjoy most is programming, I might as well focus almost exclusively on research that required a good deal of programming. Since creating viable software systems is not a quick matter, I simply can’t produce a large volume of publications. This is not to say that the advice I’m given is wrong for other people, but it isn’t right for me given the niche I’ve chosen to focus on [4].

Generalising from other people’s experiences

When someone asks me for advice, what should I say? Perhaps I should advise them to grow up in the countryside to hone their focus? Make them realise that they can't learn properly in a class and that the sooner they plough through things on their own, the quicker they'll learn? That they'll be more efficient if they use OpenBSD? That they should study Entombed's Eyemaster until they understand why the point at 0:48, when the false introduction suddenly transitions to the dumb yet joyous main riff, is one of the pinnacles of musical achievement?

I could go on, but hopefully the point is fairly clear: I'm generally unsure what things I've done, or experienced, have been beneficial or not; and even when I am sure, it's generally because the observation is either trivial or hard to repeat. My favourite example of the latter came from the late Duke of Westminster, the UK’s then-richest man: when asked what advice he would give to a young entrepreneur, he replied “make sure they have an ancestor who was a very close friend of William the Conqueror.”

What are we to do? Are we all condemned to learn everything from scratch? At best, that would be highly inefficient.

In my case, the model I have accidentally gravitated towards is to try and derive advice from (mostly) historical figures. This started in earnest my late twenties when I realised that I had very little understanding of continental European history and started studying the history of Prussia. At first, so much was unfamiliar that I was just trying to build up enough context to understand major events, places, and players. At some point, though, I realised that I had stumbled upon something deeper than that: I was starting, without realising it, to derive advice from history. That advice was not command-esque – “if you find a wooden horse at your gates, don’t let it in”, for example, is not very useful advice for me. Instead I was picking up a more abstract sense of the situations people can find themselves in and how they’ve chosen to address the problems they've faced — stretching the Troy analogy a bit, one might conclude that “if something seems too good to be true then, unless carefully investigated from angles, it can lead to disaster”.

I’m going to give you two concrete examples of people I learnt from. I’m deliberately choosing people who are long dead, from a country that no longer exists, and whose experiences were in a hugely different field than my own. Hopefully that makes it a little easier to see that there is no need to stick with the familiar when deriving advice.

I’ll start with Moltke the Elder, who led the Prussian army to victory in three separate wars in the mid/late 19th century, changing European and world history in ways that are still with us today [5]. It goes without saying that war has little direct relationship to computing research. Furthermore, Moltke was a cultured, reserved, man of whom it was said that he could be quiet in seven languages: I am a cultural heathen, can only speak English, and tend to bellow. What could I learn from such a fellow?

The thing that intrigued me most about Moltke was that he challenged a fundamental idea I had somehow imbibed, which is that it’s possible for the leader of an organisation to design and carry out a perfect plan. This is not to say that Moltke was not good at planning. He was, by some distance, the best military planner in the second half of the 19th century: he was the first to realise that railways allowed the unprecedentedly large armies he was commanding to be split up and combined on the battlefield. His focus on that alone would make him a major historical figure.

However, Moltke did two things very unlike most army commanders. First, he expected the situations he found himself in to be unclear and ever-changing: he knew that he would have to continually reevaluate and change his plans. It is hard to beat the summary found in two of his own quotes: “no plan of operations extends with any certainty beyond the first contact with the main hostile force” and “strategy is a system of expedients”. Second, he realised that when you are managing competent people, micromanagement reduces efficiency. He continually informed the generals under his command about his overall intent but allowed them significant flexibility in how they went about addressing that intent [6].

These two concepts changed how I wanted to lead a research group. First, I realised that while I might be able to have a fairly consistent high-level aim over time, the concrete methods I employed to meet that aim would change frequently and in ways that I would not be able to anticipate. I have therefore consciously tried to improve my willingness to be flexible, both at a day-to-day and at a strategic level. Second, if I wanted to get the best out of people, I would need to dampen my natural inclination to work in detail with people. This is not easy, because sometimes my age has led to experience which can make a big boost to someone else’s productivity, but it’s all too easy for me to forget to back off at the appropriate point. I’ve therefore consciously tried to back off sooner rather than later, and when possible not to get involved in the details at all. Inevitably, I have not always succeeded in implementing either of these concepts. Indeed, I have made some horrible mistakes along the way — but I think I would have made more, and much worse, mistakes if I hadn’t used the advice I’d derived from Moltke as a guide.

My second example reaches even further back in time: Clausewitz is more widely quoted than Moltke though, for me at least, harder to understand. His major contribution is On War, an unfinished, often abstract, book on the nature of war. The part of that which changed the way I think most significantly is his concept of friction:

Everything in war is very simple, but the simplest thing is difficult. The difficulties accumulate and end by producing a kind of friction... Imagine a traveller who late in the day decides to cover two more stages before nightfall. Only four or five hours more, on a paved highway with relays of horses: it should be an easy trip. But at the next station he finds no fresh horses, or only poor ones; the country grows hilly, the road bad, night falls, and finally after many difficulties he is only too glad to reach a resting place with ... primitive accommodation. It is much the same in war. Countless minor incidents – the kind you can never really foresee – combine to lower the general level of performance, so that one always falls far short of the intended goal. Iron will-power can overcome this friction; it pulverizes every obstacle, but of course it wears down the machine as well.
That first sentence baffled me for years, but when taken with the rest of the paragraph, three profound truths eventually made themselves clear to me:
  1. When we state a goal we wish to achieve, we simplify the route we expect to take, partly because we must and partly because we don't know all the issues we will encounter [7]. However, when we try and carry out that goal, those issues are more numerous and collectively more damaging to performance than we could have imagined.
  2. Chance is a significant factor in success and failure.
  3. Organisations are like mechanical machines: in the short-term we can make people work harder than normal but doing so in the long-term burns them out.

When written out like that, it can make On War seem a bit like a self-help book or, perhaps even worse, a case of generalising from a case of one. In my opinion, nothing could be further from the truth: Clausewitz's work is the result of generalising from many cases over many years. His observation of friction, his explanation of it, and his terminology for it are, for me, astonishing intellectual achievements. It fundamentally changed how I operate: I expect my plans to be disrupted in both time and direction; and, except in short, rare situations, I do not expect those disruptions to be solved by increasing people’s workloads. Put another way, I’ve realised that I’m involved in a marathon, not a series of sprints, and that I need to moderate my expectations of the route and my effort accordingly.

Summary

I’ve been given a lot of advice over the years, but I think that which I’ve derived from history has had the greatest influence on me. It has given me access to experiences and perspectives that I would never have come across in my normal life. Furthermore, it is rooted in experience: I continue to be amazed at the things that real people have achieved in adverse circumstances. It also seems to me that the more peoples’ stories that I read and understand, the more profound the truths that I seem to alight upon: generalising seems to involve some sort of compound interest that I don’t fully understand, but whose effects I hope continue!

For me, this alternative source of advice has been crucial, but it doesn't mean that you should suddenly start devouring Prussian history for similar lessons. It seems to me that the key is to seek out sources of advice in areas different to those we normally have access to, but the sources of advice that might resonate with you could be very different to me. For example, in the past I learnt a lot about how people deal with stressful situations through watching professional cricket, and more recently through professional road cycling. Others might find their most effective sources to come through very different avenues, such as fiction or film.

That said, I am not dogmatic enough to think that this technique is the only way that one should give or receive advice. For example, there are occasions when advice-as-command really is the right thing, especially when someone is about to do something actively harmful to their best interests. But, in general, when I'm asked to give advice, I try instead to tell people a bit about possibly relevant parts of my story. If some parts of my story give them a useful insight, then that is wonderful; if not, then I hope that someone else’s story (perhaps in conjunction with mine) might prove more useful to them at some point in the future. Fortunately, no matter who you are, and no matter what times you are living through, it seems to me that you can never seek out too many different perspectives.

Acknowledgements: Thanks to Chun Nolan for comments.

Follow me on Twitter

Footnotes

[1] There are other types of advice – notably the guidance that parents give children on a regular basis – but let’s exclude those from consideration.
[2] Inversely, humans have an astonishing tendency to ascribe all failures to other people’s malice. My failures have almost solely been due to my own stupidity; the remaining few have mostly been due to other’s incompetence or laziness, not malice.
[3] I had to double check this, but my thesis did end up with an odd number of chapters, so perhaps subliminally I did take the advice on board, despite thinking that I'd rejected it!
[4] Although I can’t be certain, my suspicion is that’s an indirect result of there not being many people working in the same niche these days: the idea that someone might voluntarily devote large chunks of time to programming is thus a rarer one in computing research than it used to be. Interestingly, in some non-computing research fields, the opposite phenomenon seems to be occurring. At its most extreme, there are 5,154 authors on the paper about the Higgs Boson experiments!
[5] None of this would have been possible without Bismarck. His influence on history might, perhaps, be greater than any other single person’s. I fear that the lessons one can draw from Bismarck are more powerful, but they are far darker. I sometimes think that the English-speaking world’s general ignorance of his approach is a good thing.
[6] The concept that Moltke pioneered ultimately developed into Auftragstaktik (the English term for this has become “mission command”, but that has never resonated with me: it feels both too specific and too easily misunderstood). Moltke motivated the need for devolved responsibility, though not in particular detail, and didn't push it down beyond the top ranks of commanders as later exponents of this concept did.
[7] When I wrote “Confusing problems whose solutions are easy to state with problems whose solutions are easy to realise”, I was covering fairly similar ground. I had read Clausewitz, and some associated work, by that point although I hadn't really understood much of it.
There are other types of advice – notably the guidance that parents give children on a regular basis – but let’s exclude those from consideration.
Inversely, humans have an astonishing tendency to ascribe all failures to other people’s malice. My failures have almost solely been due to my own stupidity; the remaining few have mostly been due to other’s incompetence or laziness, not malice.
I had to double check this, but my thesis did end up with an odd number of chapters, so perhaps subliminally I did take the advice on board, despite thinking that I'd rejected it!
Although I can’t be certain, my suspicion is that’s an indirect result of there not being many people working in the same niche these days: the idea that someone might voluntarily devote large chunks of time to programming is thus a rarer one in computing research than it used to be. Interestingly, in some non-computing research fields, the opposite phenomenon seems to be occurring. At its most extreme, there are 5,154 authors on the paper about the Higgs Boson experiments!
None of this would have been possible without Bismarck. His influence on history might, perhaps, be greater than any other single person’s. I fear that the lessons one can draw from Bismarck are more powerful, but they are far darker. I sometimes think that the English-speaking world’s general ignorance of his approach is a good thing.
The concept that Moltke pioneered ultimately developed into Auftragstaktik (the English term for this has become “mission command”, but that has never resonated with me: it feels both too specific and too easily misunderstood). Moltke motivated the need for devolved responsibility, though not in particular detail, and didn't push it down beyond the top ranks of commanders as later exponents of this concept did.
When I wrote “Confusing problems whose solutions are easy to state with problems whose solutions are easy to realise”, I was covering fairly similar ground. I had read Clausewitz, and some associated work, by that point although I hadn't really understood much of it.

Minimum Times Tend to Mislead When Benchmarking

August 15 2019

Most of us want, or need, to benchmark software at some point, either because we’re proactively trying to avoid performance problems, or because we’re trying to narrow down the cause of a specific performance problem. Benchmarking seems like it should be simple, but there are countless ways to go wrong: indeed, as things stand, none of us really knows how to do it well, at least not in general. In this short post, I want to make a simple argument against using the minimum time of a benchmark as a proxy for that benchmark’s performance. Doubtless I am recasting other people’s arguments, poorly, but I’ve seen a resurgence in people advocating for the use of the minimum time recently, so it seems like a good time to make the opposing case.

The basics

There are all sorts of things that one might wish to measure about a program: how much memory it consumes, the number of files it opens, and so on. By far the most common measure is “wall-clock time” [1] – i.e. how much time elapses in the real world while the program is running – and that's what I'm going to concentrate on exclusively here.

Once upon a time, measuring wall-clock time was fairly simple. You read the system clock, ran a program, read the system clock again, and reported the delta between the two clock measurements. There were some minor annoyances to do with clocks [2] but measurements tended to be fairly predictable and repeatable. The weasel word in that statement is “fairly”. Noise has always been an issue in benchmarking: except in rare circumstances, there are always external factors (e.g. other processes; even the kernel itself) that can interfere with your benchmark’s execution which, collectively, we tend to refer to as noise.

The only way to get some idea of the impact of noise is to execute multiple runs of a benchmark and produce statistics over those runs. The problem then becomes: what statistics? If you’re anything like me, statistics is an incredibly daunting area. It’s not so much that the underlying maths is complex (though it often is), but that nearly every statistical method comes with a set of preconditions that can be hard to interpret (e.g. “the data must be independent” – even when you’ve understood what “independent” means in a semi-formal sense, understanding whether your data is or isn't independent can be difficult). If you apply a statistical method to data that violates the preconditions, you’ll still get a number out, but it might not mean what you expected.

Fortunately, in traditional settings, one could fairly safely make the following simplifying assumption: noise can only slow a program down. In other words, if I run a program twice, and it produces wall-clock readings of 1.1s and 1.2s, then the 1.1s reading is closer to the “true” performance of the program. Getting a good performance reading then consists of running the program multiple times and taking the minimum time from those since, by definition, that is the closest to the “true” performance, since it has been the least affected by noise.

An example of performance nondeterminism

Imagine I’m writing a program which repeatedly needs to find an English word which matches a given sha256 value. Since this calculation is fairly slow, I might choose to use a hashmap to cache the sha256 values. Here’s some simple Rust code which creates a HashMap from sha256 values to normal strings:
use std::collections::HashMap;
use std::fs::File;
use std::io::{BufRead, BufReader};

use sha2::{Digest, Sha256};

fn main() {
  let mut words = HashMap::new();
  let f = File::open("/usr/share/dict/words").unwrap();
  for l in BufReader::new(f).lines().map(|l| l.unwrap()) {
    words.insert(Sha256::digest(l.as_bytes()), l);
  }
}
I can easily call words.get(h) to find a string which matches the sha256 value h and know that I’ll get average lookup performance of O(1). However, let’s assume that on rare occasions I need to do the opposite calculation, that is I need to find what string matches a given sha256 value. We can add this code to our main function (in this case seeing if the word “zygote”, which happened to be the last word in /usr/share/dict/words that I recognised, really is found in words):
for (k, v) in &words {
  if v == "zygote" {
    for e in k {
      print!("{:x?}", e);
    }
    println!();
    break;
  }
}
If I run the resulting program it duly prints out:
d8be86c985bdd2938cc6cdc9b43039273be1fd97f3e4daed329bade585dd6ef
as the matching value.

Because I have to do a linear scan over the hashmap, this program has O(n) average lookup time. Let’s assume that I think the newly inserted bit of code is a performance bottleneck. I might then try timing the program and seeing what happens. However, because I think I know quite a bit about benchmarking, I assume that the time taken to create the initial hashmap will dominate a single subsequent lookup. To compensate, I put the code of interest into a loop so that it executes enough that I'll get a measurable signal. I also remove the print calls, because I know they can interfere with timings. I thus end up with the following:

for _ in 0..1000 {
  for (_, v) in &words {
    if v == "zygote" {
      break;
    }
  }
}
OK, so let’s build it and run it 10 times so that I can calculate the minimum running time and determine if the performance of this program is good enough:
$ cargo build --release && repeat 10 time target/release/count
    Finished release [optimized] target(s) in 0.01s
target/release/count  0.50s user 0.00s system 99% cpu 0.501 total
target/release/count  0.42s user 0.00s system 99% cpu 0.426 total
target/release/count  0.26s user 0.01s system 99% cpu 0.267 total
target/release/count  0.18s user 0.00s system 98% cpu 0.179 total
target/release/count  0.06s user 0.02s system 99% cpu 0.080 total
target/release/count  0.13s user 0.00s system 97% cpu 0.131 total
target/release/count  0.42s user 0.01s system 99% cpu 0.425 total
target/release/count  0.20s user 0.01s system 98% cpu 0.212 total
target/release/count  0.42s user 0.01s system 99% cpu 0.429 total
target/release/count  0.23s user 0.00s system 98% cpu 0.235 total
At this point I might be a bit surprised: the minimum wall-clock time (0.080s) is over 6 times faster than the maximum (0.501s). I’m quite happy with the minimum time, but decidedly unhappy with the maximum time. What happened? Well, I was running that on a fast Linux server that can be used by lots of other people, so maybe my readings were subject to a lot of noise. Let’s try running it on my trusty OpenBSD desktop:
$ cargo build --release && repeat 10 time target/release/count
    Finished release [optimized] target(s) in 0.02s
target/release/count  0.45s user 0.06s system 98% cpu 0.517 total
target/release/count  0.27s user 0.05s system 104% cpu 0.307 total
target/release/count  1.73s user 0.05s system 99% cpu 1.783 total
target/release/count  1.82s user 0.05s system 99% cpu 1.876 total
target/release/count  2.10s user 0.06s system 99% cpu 2.179 total
target/release/count  1.66s user 0.08s system 99% cpu 1.742 total
target/release/count  0.92s user 0.08s system 99% cpu 1.007 total
target/release/count  0.28s user 0.05s system 96% cpu 0.340 total
target/release/count  2.22s user 0.08s system 100% cpu 2.292 total
target/release/count  0.51s user 0.03s system 101% cpu 0.533 total
Unsurprisingly the timings are now much slower [3], but the ratio between the minimum and maximum times is now well over a factor of 7! What’s going on?

What we are witnessing here is what I call performance non-determinism for want of a better term. “Nondeterministic” is one of those words that academics bandy around willy-nilly but which can be quite confusing. In our case, it means that a program can run differently from one execution to the next. The most obvious way to write a nondeterministic program is to depend at some point on an unknown value. Sometimes this is done deliberately (e.g. an AI player in a game), although it is often accidental (e.g. a data race or reading from an incorrect place in memory).

In the program I’ve written, however, I’m not deliberately reading a random variable nor (I hope!) are there any bugs. However, there is nondeterminism at play, though it’s not directly caused by the user: the order that Rust’s HashMap stores items internally depends on a seed that is randomly chosen each time the program starts (making it harder for attackers to game the hashmap). This is largely invisible when doing a normal lookup, but becomes painfully obvious when you iterate over all the elements: on one run the element of interest might be the first I encounter, and on the next run it might be the last. Altering my program to see how many times I go round the loop before finding a match highlights this:

let mut c = 0;
for _ in 0..1000 {
  for (_, v) in &words {
    c += 1;
    if v == "zygote" {
      break;
    }
  }
}
println!("{}", c);
There's now a clear correlation between that count and the execution time:
$ cargo build --release && repeat 10 time target/release/count
    Finished release [optimized] target(s) in 0.02s
50434000
target/release/count  0.41s user 0.07s system 99% cpu 0.481 total
23732000
target/release/count  0.28s user 0.07s system 96% cpu 0.362 total
80556000
target/release/count  0.65s user 0.04s system 99% cpu 0.696 total
98799000
target/release/count  0.78s user 0.05s system 98% cpu 0.843 total
222337000
target/release/count  1.90s user 0.06s system 100% cpu 1.957 total
82766000
target/release/count  0.63s user 0.08s system 99% cpu 0.713 total
222532000
target/release/count  1.92s user 0.04s system 99% cpu 1.963 total
17949000
target/release/count  0.29s user 0.05s system 100% cpu 0.339 total
18610000
target/release/count  0.30s user 0.04s system 99% cpu 0.341 total
87260000
target/release/count  0.68s user 0.06s system 98% cpu 0.750 total
Notice that the iteration count varies by a factor of more than 10, even in those 10 runs, which largely explains the factor of 6 difference in wall-clock time (it’s likely that the remaining discrepancy is accounted for by the time needed to create the initial HashMap). The reason I call this performance nondeterminism is because it has no impact on correctness – and thus is often forgotten – but can have a significant impact on performance.

So what does all this have to do with using the minimum wall-clock time when benchmarking? In essence it shows that on modern systems, it’s not safe to use the minimum value as a measure of performance because programs such as this have a set of performance points. That seems like a very simple thing to say, but it has a profound difference on the way one views program performance. In this case, roughly speaking, that set has as many distinct performance points as there are words in /use/share/dict/words. Indeed, running my program 100,000 times on an unloaded server (an E3-1240 v6 3.70GHz processor, with turbo boost and hyperthreading turned off, all cores set to performance, and running Debian 9) shows a largely flat distribution [4]:

SVG-Viewer needed.

As this histogram shows, using the minimum time as a proxy for the program’s performance would be deeply misleading: it would lead us to hugely underestimate the running time of this program would take in most cases. Indeed, in this case, even the mean on its own would be misleading! While it’s generally not practical to give people a histogram of performance for every possible benchmark, it’s absolutely vital to report something about the distribution as a whole. In general, confidence intervals are a good technique for doing this, and, in this case, they’d be infinitely better than merely reporting the minimum.

Is this an isolated example?

When you look around, there are a surprising number of places where performance nondeterminism can occur – and it’s not just caused by software! Caches and predictors in processors clearly cause changes in performance. Indeed, I now regularly hear processor caches and the branch predictor being blamed for unexplained performance variations. Although my cynical side thinks that this is sometimes an excuse to avoid understanding what’s really going on, they clearly have surprising and unpredictable effects. Software has its share of complementary features, one of the best known being ASLR. As an added bonus, these various features interact in ways that we don’t yet really understand.

A couple of pointers might help convince you that the problem is a deep one. I'll start with the work which has had the most influence on my way of thinking in this regard: Stabilizer. In essence, Stabilizer takes a program and deliberately changes its layout (e.g. the order of functions) so that the possible space of layouts is, statistically speaking, equally covered: when this experiment was performed in 2013, LLVM’s -O2 optimisation level was found to be statistically indistinguishable from -O3 on a well known benchmark suite. At the risk of stating the obvious, this is highly surprising: everyone knows that -O3 is “better” than -O2! But that might just be because no-one had taken into account performance nondeterminism before... What Stabilizer clearly showed me is that one must, in general, consider programs as having a set of distinct performance values (as we saw in our hashmap searching example). When I was then involved in work on measuring JIT (Just-In-Time) VM performance, we found that over 50% of well known benchmarks suffered from performance nondeterminism – including some of the C benchmarks we’d included as a control because we assumed they wouldn’t be subject to such problems!

The effects on comparison

It turns out that performance nondeterminism isn’t the only reason why the minimum time of a program is a poor measure of a program’s performance. Let’s imagine a program similar to our previous example, but this time let’s search for a random word in a list of strings. A simple version of this looks as follows:
use std::fs::File;
use std::io::{BufRead, BufReader};

use rand::thread_rng;
use rand::seq::SliceRandom;

fn main() {
  let mut words = Vec::new();
  let f = File::open("/usr/share/dict/words").unwrap();
  for l in BufReader::new(f).lines() {
    words.push(l.unwrap())
  }

  let choice = words.choose(&mut thread_rng()).unwrap();
  for _ in 0..10000 {
    assert!(words.iter().any(|w| w == choice));
  }
}
When I run that for 100,000 times I get the following histogram:

SVG-Viewer needed.

Even without looking at the code, some readers might have guessed from that histogram that I’m using a linear search over the list. Wouldn’t it be better to replace it with a binary search? Let’s give it a go:
words.sort();
let choice = words.choose(&mut thread_rng()).unwrap();
for _ in 0..10000 {
  assert!(words.binary_search(&choice).is_ok());
}
Note that I first have to sort the words, because Rust's idea of ordering is different than /usr/share/dict/words. With those small changes I then get the following histogram:

SVG-Viewer needed.

Unfortunately it looks my attempted optimisation has failed: the minimum running time of the linear search (0.009s) is lower than that of the binary search (0.010s) [5]. As we all know, an optimisation that fails to make a program faster should never be included, so it seems that all my efforts have been for nought.

However, if we look at the distributions of both versions of the program alongside each other, a rather different story emerges:

SVG-Viewer needed.

As this shows, the binary search is always fast, while the linear search is often comically slow. In other words, given this data, it would be penny wise and pound foolish for me to use the linear search: I’d make the program very marginally faster for a small number of cases but much slower in most cases.

In my experience, optimisations rarely make all possible cases faster: there are trade-offs involved, and one often has to accept that some inputs will become a little slower in order that most cases can become a lot faster. The problem with using the minimum time is that it is entirely oblivious to trade-offs such as this.

Conclusions

The examples I’ve shown above might look silly to you, but they’re simplifications of problems I’ve seen in real programs: in each case, they were pretty much the first thing I tried to show the problem of minimum times [6]. That said, a perfectly reasonable question is whether I’ve managed to choose from amongst a small handful of examples that show such problems. Alas, particularly with regards to performance nondeterminism, this blogpost has merely highlighted the tip of the iceberg – and, as a community, we don’t really know how deep the iceberg goes.

However, performance nondeterminism is only one of the reasons why using the minimum time of a benchmark is rarely a good idea. As the optimisation example (linear vs. binary search) showed, the fundamental problem with the minimum time of a benchmark is that it gives you false confidence that you can avoid thinking about a benchmark's overall performance distribution. Sometimes you can get away with this, but doing so can often lead to you making the performance of your program worse. That said, performance nondeterminism increasingly worries me, because even a cursory glance at computing history over the last 20 years suggests that both hardware (mostly for performance) and software (mostly for security) will gradually increase the levels of performance nondeterminism over time. In other words, using the minimum time of a benchmark is likely to become more inaccurate and misleading in the future...

If you want to replicate my experiments, please have a look at the minimum_times repository. You can also download the data from the run of the experiment I used for this blog post if you want to investigate it and/or compare it to the data you get from running the experiment on your own system.

Acknowledgements: My thanks to Edd Barrett for for comments on this blog post.

Follow me on Twitter

Footnotes

[1] It’s not uncommon for people to argue that wall-clock time is a bad measure because it’s less predictable than processor counters. The reason wall-clock time appears less predictable is because it takes more things into account (e.g. the amount of time waiting for I/O) and the more things you measure, the more chance there is for variability. There are definitely cases where focussing purely on processor counters is the right thing to do (e.g. tiny loops that are a few instructions long) but in most cases wall-clock time is the only thing that users care about. Put bluntly, as an end user I don't care if a program executes slightly fewer instructions if it does so by doing a lot more I/O: I care purely about elapsed time in the real world.

As a footnote to a footnote, it also transpires that processor counters are not as accurate as one might expect – see the classic paper 2008 paper by Weaver and McKee as well as this 2019 paper which confirms that the situation has not improved in the interim.

[2] From my own personal experience, system clocks often had poor resolution (i.e. they couldn't accurately tell you small differences in performance) and/or were erratic (for example, I remember a PC I did work on in the mid/late 90s that gained about 90 seconds a day).
[3] Partly because OpenBSD’s /usr/share/dict/words is well over twice as big as Debian’s equivalent.
[4] I’ve sometimes heard it said that the central limit theorem saves us when benchmarking. In my experience, this hope is based on an over simplification of the theorem, which is that if you run a program enough times any randomness (including noise) will converge towards a normal distribution. That is only true if the various random variables are close to being independent and identically distributed. As some of the examples in this blogpost show, we sometimes encounter situations where these requirements are not met.
[5] In this case, the cost of the sort is probably responsible for most of the measurable difference, though interestingly there really are cases where linear search is faster than binary search. For example, on modern machines, if you have very small lists (roughly speaking, around 10 elements or fewer), linear search can win: there are parts of JIT compiling VMs, for example, where people care about the difference in speed!
[6] That does not mean that the data is not without surprises. Why does the linear search histogram, for example, have a noticeable peak in the first histogram bin? I can speculate as to possible causes, but I don’t know for sure.
It’s not uncommon for people to argue that wall-clock time is a bad measure because it’s less predictable than processor counters. The reason wall-clock time appears less predictable is because it takes more things into account (e.g. the amount of time waiting for I/O) and the more things you measure, the more chance there is for variability. There are definitely cases where focussing purely on processor counters is the right thing to do (e.g. tiny loops that are a few instructions long) but in most cases wall-clock time is the only thing that users care about. Put bluntly, as an end user I don't care if a program executes slightly fewer instructions if it does so by doing a lot more I/O: I care purely about elapsed time in the real world.

As a footnote to a footnote, it also transpires that processor counters are not as accurate as one might expect – see the classic paper 2008 paper by Weaver and McKee as well as this 2019 paper which confirms that the situation has not improved in the interim.

From my own personal experience, system clocks often had poor resolution (i.e. they couldn't accurately tell you small differences in performance) and/or were erratic (for example, I remember a PC I did work on in the mid/late 90s that gained about 90 seconds a day).
Partly because OpenBSD’s /usr/share/dict/words is well over twice as big as Debian’s equivalent.
I’ve sometimes heard it said that the central limit theorem saves us when benchmarking. In my experience, this hope is based on an over simplification of the theorem, which is that if you run a program enough times any randomness (including noise) will converge towards a normal distribution. That is only true if the various random variables are close to being independent and identically distributed. As some of the examples in this blogpost show, we sometimes encounter situations where these requirements are not met.
In this case, the cost of the sort is probably responsible for most of the measurable difference, though interestingly there really are cases where linear search is faster than binary search. For example, on modern machines, if you have very small lists (roughly speaking, around 10 elements or fewer), linear search can win: there are parts of JIT compiling VMs, for example, where people care about the difference in speed!
That does not mean that the data is not without surprises. Why does the linear search histogram, for example, have a noticeable peak in the first histogram bin? I can speculate as to possible causes, but I don’t know for sure.

A Quick Look at Trait Objects in Rust

February 12 2019

Rust is a genuinely interesting programming language: it has a number of features which are without precedent in mainstream languages, and those features combine in surprising and interesting ways. In many cases, it's a plausible replacement for C [1]: it leads to fairly fast code; and because it doesn't have a garbage collector, its memory use is more predictable, which can be useful for some programs. I've been doing an increasing amount of programming in Rust for the last 3 or 4 years, the biggest thing being the grmtools set of libraries (a good starting place for those is the quickstart guide). However, there isn’t any doubt in my mind that Rust isn’t the easiest language to learn. That’s partly because it’s a fairly big language: in my opinion, language complexity grows exponentially with language size. But it’s also partly because a lot of Rust is unfamiliar. Indeed, my first couple of years in Rust reminded me of my early fumblings with programming: a lot of activity rewarded with only limited progress. However, it’s been worth it: Rust has proven to be a good match for many of the things I want to do.

One of the things that baffled me for quite a long time are Rust’s “trait objects”: they felt like an odd part of the language and I was never quite sure whether I was using them or not, even when I wanted to be. Since I’ve recently had cause to look into them in more detail, I thought it might be helpful to write a few things down, in case anyone else finds my explanation useful. The first part of this blog post covers the basics and the second part takes a look at the performance implications of the way trait objects are implemented in Rust.

The basics

In general, Rust has a very strong preference for static dispatch of function calls, which is where the function matching a call is determined at compile-time. In other words, when I write a function call f(), the compiler statically works out which function f I’m referring to and makes my function call point directly to that function f. This stands in contrast to dynamic dispatch where the function matching a call is only determined at run-time. Dynamic dispatch is common in languages such as Java. Consider a class C which defines a method m which one or more subclasses override. If we have an object o of type C and a function call o.m(), then we have to wait until run-time to determine whether we should call C's implementation of m or one of its subclasses. Simplifying slightly, static dispatch leads to faster performance [2] while dynamic dispatch gives one more flexibility when structuring a program. These features can combine in a variety of ways: some languages only offer static dispatch; some only offer dynamic dispatch; and some require users to opt-in to dynamic dispatch.

It can be easy to think Rust’s traits alone imply dynamic dispatch, but that is often not the case. Consider this code:

trait T {
  fn m(&self) -> u64;
}

struct S {
  i: u64
}

impl T for S {
  fn m(&self) -> u64 { self.i }
}

fn main() {
  let s = S{i : 100};
  println!("{}", s.m());
}
Here the trait T looks a bit like it's a Java interface, requiring any class/struct which implements it to have a method m to return an integer: indeed, the calling syntax in line 15 s.m() looks like a method call on an object which we might well expect to be dynamically dispatched. However, the Rust compiler statically resolves the call m() to the function T::m. This is possible because the variable s on line 14 is statically determined to have type S and, since S implements the T trait which has an m method, that function is the only possible match. As this suggests, Rust’s traits aren’t really like Java interfaces, and structs aren’t really like classes.

However, Rust does allow dynamic dispatch, although at first it can feel like magic to make it happen. Let's assume we want to make a function which can take in any struct which implements the trait T and calls its m function. We might try and write this:

fn f(x: T) {
  println!("{}", x.m())
}
However compiling that leads to the following scary error:
error[E0277]: the size for values of type `(dyn T + 'static)` cannot be known at compilation time
  --> src/main.rs:21:6
   |
21 | fn f(x: T) {
   |      ^ doesn't have a size known at compile-time
   |
   = help: the trait `std::marker::Sized` is not implemented for `(dyn T + 'static)`
   = note: to learn more, visit <https://doc.rust-lang.org/book/second-edition/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait>
   = note: all local variables must have a statically known size
   = help: unsized locals are gated as an unstable feature
The length of that error is intimidating, but it actually says the same thing in three different ways [3]. By specifying that we'll take in an argument of type T we're moving the run-time value into f. However — and I prefer to think of this in terms of running code -- we don't know how big any given struct that implements T will be, so we can’t generate a single chunk of machine code that handles it (should the machine code expect a struct of 8 bytes or 128 bytes or ...? how big should the stack the function requests be?).

One sort-of solution to the problem is to change f to the following:

fn f<X: T>(x: X) {
  println!("{}", x.m())
}
This now compiles, but does so by duplicating code such that static dispatch is still possible. The code between angle brackets <X: T> defines a type parameter X [4]: the type passed (possibly implicitly) for that parameter must implement the trait T. This looks a bit like it's a long-winded way of saying fn(x: T) but the type parameter means that monomorphisation kicks in: a specialised version of f is generated for every distinct type we pass to X, allowing Rust to make everything statically dispatched. Sometimes that's what we want, but it's not always sensible: monomorphisation means that code bloat can be a real concern (we can end up with many separate machine code versions of f); and it can limit our ability to structure our program in a sensible fashion.

Fortunately, there is an alternative solution [5] which does enable dynamic dispatch, as shown in the following code:

trait T {
  fn m(&self) -> u64;
}

struct S1 {
  i: u64
}

impl T for S1 {
  fn m(&self) -> u64 { self.i * 2 }
}

struct S2 {
  j: u64
}

impl T for S2 {
  fn m(&self) -> u64 { self.j * 4 }
}

fn f(x: &T) {
  println!("{}", x.m())
}

fn main() {
  let s1 = S1{i : 100};
  f(&s1);
  let s2 = S2{j : 100};
  f(&s2);
}
Running this program prints out 200 and then 400 and that was achieved by dynamic dispatch! Hooray! But why did it work?

The only real difference with our previous version is that we changed f to take a reference to an object of type T (line 21). Although we don't know how big a struct that implements T might be, every reference to an object that implements T will be the same size, so we only need a single machine code version of f to be generated. It's at this point that the first bit of magic happens: in line 27 we passed a reference to an object of type S1 to f, but f expects a reference to an object of type T. Why is that valid? In such cases, the compiler implicitly coerces &S1 to &T, because it knows that the struct S implements the trait T [6]. Importantly, this coercion magically attaches some extra information (I’ll explain more below) so that the run-time system knows it needs to call S1's methods on that object (and not, for example, S2's methods).

That such an implicit coercion is possible is, in my experience, very surprising to those new to Rust. If it’s any consolation, even experienced Rust programmers can fail to spot these coercions: nothing in f's signature tells you that such a coercion will happen, unless you happen to know that T is a trait, and not a struct. To that end, recent versions of Rust let you add a syntactic signpost to make it clear:

fn f(x: &dyn T) {
  println!("{}", x.m())
}
The extra dyn keyword has no semantic effect, but you might feel it makes it a little more obvious that a coercion to a trait object is about to happen. Unfortunately, because the use of that keyword is currently a bit haphazard, one can never take its absence as a guarantee that dynamic dispatch won’t occur.

Interestingly, there is another way to perform this same coercion, but without using references: we can put trait objects in boxes (i.e. put them on the heap). That way, no matter how big the struct stored inside the box, the size of the box we see is always the same. Here's a simple example:

fn f2(x: Box<T>) {
  println!("{}", x.m())
}

fn main() {
  let b: Box<S1> = Box::new(S1{i: 100});
  f2(b);
}
At line 6 we have a variable of type Box<S1> but passing it to f2 automatically coerces it to Box<T>. In a sense, this is just a variant of the reference coercion: in both cases we’ve turned an unsized thing (a trait object) into a sized thing (a reference or a box).

Fat pointers vs. inner vpointers

I deliberately omitted something in my earlier explanation: while it's true that all references to an object of our trait type T are of the same size, it's not necessarily true that references to objects of different types are the same size. An easy way to see that is in this code, which executes without errors:
use std::mem::size_of;

trait T { }

fn main() {
    assert_eq!(size_of::<&bool>(), size_of::<&u128>());
    assert_eq!(size_of::<&bool>(), size_of::<usize>());
    assert_eq!(size_of::<&dyn T>(), size_of::<usize>() * 2);
}
What this says is that the size of a reference to a bool is the same as the size of a reference to a u128 (line 6), and both of those are a machine word big (line 7). This isn't surprising: references are encoded as pointers. What might be surprising is that the size of a reference to a trait object is two machine words big (line 8). What's going on?

Rust uses fat pointers extensively, including when trait objects are used. A fat pointer is simply a pointer-plus-some-other-stuff, so is at least two machine words big. In the case of a reference to trait objects, the first machine word is a pointer to the object in memory, and the second machine word is a pointer to that object’s vtable (which, itself, is a list of pointers to a struct’s dynamically dispatched functions). Although it’s not universally used terminology, let’s call a pointer to a vtable a vpointer. We can now make sense of the ‘magic’ part of the coercion from a struct to a trait object I mentioned earlier: the coercion from a pointer to a fat pointer adds the struct’s vpointer to the resulting fat pointer. In other words, any trait object coerced from an S1 struct will have a fat pointer with a vpointer to v1, and any trait object coerced from an S2 struct will have a vpointer to v2. v1 will, conceptually, have a single entry pointing to S1::m and v2 a single entry pointing to S2::m. If you want, using unsafe code, you can easily tease the object pointer and vpointer apart.

If you’re a Haskell or Go programmer, this use of fat pointers is probably what you expect. Personally I’m used to vpointers living alongside objects, not alongside object pointers: as far as I know the former technique doesn’t have a name so let’s call it inner vpointers. For example, in a typical object orientated language, every object is dynamically dispatched, so every object carries around its own vpointer. In other words, the choices are between adding an extra machine word to pointers (fat pointers) or an extra machine word to objects (inner vpointers).

Why might Rust have plumped for fat pointers instead of inner vpointers? Considering only performance as a criteria, the downside to inner vpointers is that every object grows in size. If every function call uses an object’s vpointer, this doesn’t matter. However, as I showed earlier, Rust goes out of its way to encourage you to use static dispatch: if it used inner vpointers, the vpointers would probably go unused most of the time [7]. Fat pointers thus have the virtue of only imposing extra costs in the particular program locations where you want to use dynamic dispatch.

What are the performance trade-offs of fat pointers vs. inner vpointers?

Most Rust programs will use dynamic dispatch sparingly – any performance differences between fat pointers and inner vpointers are likely to be irrelevant. However, I want to write programs (language VMs) which use it extensively (for every user-language-level object). It’s therefore of some interest to me as to whether there’s any difference in performance between the two schemes. Unfortunately I haven’t been able to turn up any relevant performance comparisons [8]: the nearest I’ve seen are papers in the security world, where fat pointers are used to encode certain security properties (e.g. the second part of a fat pointer might carry around a memory block’s length, so that all array accesses can be checked for safety), and thus clearly make performance worse. Our setting is quite different.

In order to make a compelling comparison, I'd need real programs of note and measure them rigorously using both schemes, but that’s a bit difficult because I don’t have any such programs yet, and won’t for some time. So I’m going to have make do with a few micro-benchmarks and the inevitable caveat: one should never assume that anything these micro-benchmarks tell us will translate to larger programs. I'm also not going to go to quite the extremes I have in the past to measure performance: I'm looking for a rough indication rather than perfection.

In order to keep things tractable, I made three assumptions about the sorts of program I'm interested in:

  1. Such programs will create trait objects occasionally but call methods in them frequently. I therefore care a lot more about calling costs than I do about creation costs.
  2. I care more about methods which read from the self object than those that don't. Are there any performance differences between the two?
  3. In general – and unlike most Rust programs – I'm likely to have a reasonable amount of aliasing of objects. Does that cause any performance differences when calling functions?

In order to model this, I created a trait GetVal which contains a single method which returns an integer. I then created two structs which implement that trait: SNoRead returns a fixed integer (i.e. it doesn't read from self); and SWithRead returns an integer stored in the struct (i.e. it reads from self). Both structs are a machine word big, so they should have the same effects on the memory allocator (even though SNoRead doesn't ever read the integer stored within it). Eliding a few details, the code looks like this:

trait GetVal {
  fn val(&self) -> usize;
}

struct SNoRead {
  i: usize
}

impl GetVal for SNoRead {
  fn val(&self) -> usize { 0 }
}

struct SWithRead {
  i: usize
}

impl GetVal for SWithRead {
  fn val(&self) -> usize { self.i }
}
To keep things simple, our first two benchmarks will stick with fat pointers, and we'll just measure the difference between calling SNoRead::m and SWithRead::m. Eliding a number of details, our first benchmark looks like this:
fn time(mut f: F) where F: FnMut() {
  let before = Instant::now();
  for _ in 0..ITERS {
    f();
  }
  let d = Instant::now() - before;
  println!("{:?}", d.as_secs() as f64 + d.subsec_nanos() as f64 * 1e-9);
}

fn bench_fat_no_read() {
  let v: Vec<Box<dyn GetVal>> = Vec::with_capacity(VEC_SIZE);
  for _ in 0..VEC_SIZE {
    v.push(Box::new(SNoRead{i:0}));
  }
  time(|| {
    for e in &v {
      assert_eq!(e.val(), 0);
    }
  });
}
In essence, we create a vector with VEC_SIZE elements (lines 11-14), each of which contains a boxed trait object. We then time (lines 2-7) how long it takes to iterate over the vector, calling the method m on each element (lines 16-18). Notice that we don’t measure the time it takes to create the vector and that we repeat the iterating over the vector ITERS times to make the benchmark run long enough. Although I don’t show it above, in order to make the resulting measurements more reliable, each benchmark is compiled into its own binary, and each binary is rerun 30 times. Thus the numbers I’m reporting are the mean of the benchmark run across 30 process executions, and I also report 99% confidence intervals calculated from the standard deviation.

Only having one benchmark makes comparisons a little hard, so let's do the easiest variant first: a version of the above benchmark which uses SWithRead instead of SNoRead. This simply requires duplicating bench_fat_no_read, renaming it to bench_fat_with_read, and replacing SNoRead with SWithRead inside the function.

We then have to decide how big to make the vector, and how many times we repeat iterating over it. I like to make benchmarks run for at least one second if possible, because that tends to lessen the effects of temporary jitter. As an initial measurement, I set VEC_SIZE to 1000000, and ITERS to 1000. Here are the results for the two benchmarks we’ve created thus far:

bench_fat_no_read: 1.708 +/- 0.0138
bench_fat_with_read: 2.152 +/- 0.0103
This isn't too surprising: there's an inevitable fixed cost to iterating over the vector and jumping to another function which is shared by both benchmarks. However, bench_fat_with_read also measures the cost to read self.i in the SWithRead::m function which slows things down by over 25%.

Now we can move onto the hard bit: creating inner vpointer variants of both benchmarks. This is a little bit fiddly, in part because we need to use unsafe Rust code [9]. The basic technique we need to know is that a fat pointer can be split into its constituent parts as follows:

let x: &dyn T = ...;
let (ptr, vtable) = unsafe { mem::transmute<_, (usize, usize)>(x) };
You can look at transmute in several different ways, but I tend to think of it as a way of copying bits in memory and giving the copied bits an arbitrary type: in other words, it’s a way of completely and utterly bypassing Rust’s static type system. In this case, we take in a fat pointer which is two machine words big, and split it into two machine word-sized pointers (which I’ve encoded as usizes, because it slightly reduces the amount of code I have to enter later).

What we need to do first is create a vector of thin (i.e. one machine word big) pointers to “vpointer + object” blocks on the heap. Eliding a few annoying details, here’s code which does just that:

fn vec_vtable() -> Vec<*mut ()> {
  let mut v = Vec::with_capacity(VEC_SIZE);
  for _ in 0..VEC_SIZE {
    let s = SNoRead{i: 0};
    let b = unsafe {
      let (_, vtable) = transmute::<&dyn GetVal, (usize, usize)>(&s);
      let b: *mut usize = alloc(...) as *mut usize;
      b.copy_from(&vtable, 1);
      (b.add(1) as *mut SNoRead).copy_from(&s, 1);
      b as *mut ()
    };
    v.push(b);
  }
  v
}
The type <*mut ()> (line 1) is Rust’s rough equivalent of C’s void * pointer. Every time we make a new SNoRead object (line 4), we create a trait object for it (line 5), and pull out its vpointer (line 6; note that this will be the same value for every element in the vector). We then allocate memory (line 7) sufficient to store the vpointer (line 8) and the object itself (line 9): on a 64-bit machine, the vpointer will be stored at offset 0, and the object will be stored starting at offset 8. Notice that a significant portion of this function is unsafe code: that’s inevitable when fiddling around with low-level pointers like this in Rust.

With that done, we can then create a benchmark:

pub fn bench_innervtable_no_read() {
  let v = vec_vtable();
  time(|| {
    for &e in &v {
      let vtable = unsafe { *(e as *const usize) };
      let obj = unsafe { (e as *const usize).add(1) };
      let b: *const dyn GetVal = unsafe { transmute((obj, vtable)) };
      assert_eq!(unsafe { (&*b).val() }, 0);
    }
  });
}
There are a couple of subtleties here. Notice how we recreate a fat pointer by recombining the object pointer and vpointer (lines 5-7): the *const dyn GetVal is vital here, as otherwise Rust won't know which trait we're trying to make a fat pointer for. In order to turn a raw fat pointer into a normal reference, we have to use the &* operator [10] (line 8). With that done, we can then call method m.

Using the same settings as for our earlier run, our new benchmark (and it's with_read variant) performs as follows:

bench_innervtable_no_read: 2.111 +/- 0.0054
bench_innervtable_with_read: 2.128 +/- 0.0090
Unsurprisingly, bench_innervtable_no_read is much slower than its fat pointer cousin bench_fat_no_read: the former has to do a memory read on each iteration of the loop to read the vpointer, whereas the latter has that available in its fat pointer. bench_fat_no_read is thus very cache friendly because all it's doing is iterating linearly through memory (the vector) and then calling the same function (SNoRead::m) repeatedly.

However, bench_innervtable_with_read is only just (taking into account the confidence intervals) slower than bench_innervtable_with_read. Why might this be? Well, reading the vpointer from memory will nearly always bring the object into the processor’s cache too, making the read in SWithRead::m much cheaper afterwards. Put another way, the first read of a random point in memory is often quite slow, as the processor waits for the value to be fetched from RAM; but reading another point almost immediately afterwards is quick, because an entire cache line (a chunk of memory: 64 bytes on x86) is brought into the processor’s cache in one go. If you look carefully, bench_innervtable_with_read is ever so slightly faster than bench_fat_with_read, although not by enough for me to consider it particularly significant.

Is any of this interesting? Depending on your use case, yes, it might be. Imagine you're implementing a GUI framework, which is a classic use case for dynamic dispatch. A lot of the methods that are called will be empty, because the user doesn't need to handle the respective actions. The numbers above show that inner vpointers would slow you down in such a case: fat pointers are clearly the right choice.

Let's make our final class of benchmark: what happens if, instead of our vector pointing to VEC_SIZE distinct objects, each element points to the same underlying object? In other words, what are the performance implications of having faster access to inner vpointers. First let's create our vector:

fn vec_multialias_vtable() -> Vec<*mut ()> {
  let ptr = {
    let s = SNoRead{i: 0};
    unsafe {
      let (_, vtable) = transmute::<&dyn GetVal, (usize, usize)>(&s);
      let b = alloc(...) as *mut usize;
      b.copy_from(&vtable, 1);
      (b.add(1) as *mut SNoRead).copy_from(&s, 1);
      b as *mut ()
    }
  };
  vec![ptr; VEC_SIZE]
}
Note that we only create a single pointer to an object (line 6), duplicating that pointer (but not the object!) VEC_SIZE times (line 12).

By now, I suspect you've got the hang of the main benchmarks, so I won't repeat those. Let's go straight to the numbers:

bench_fat_multialias_no_read: 1.709 +/- 0.0104
bench_fat_multialias_with_read: 1.709 +/- 0.0099

bench_innervtable_multialias_no_read: 1.641 +/- 0.0007
bench_innervtable_multialias_with_read: 1.644 +/- 0.0115
Interestingly, there is now a small, but noticeable difference: the inner vtable approach is definitely faster. Why is this? Well, it's probably because the fat pointer version of this benchmark consumes more memory. Each pointer consumes 2 machine words, so, at an abstract level, the total memory consumption of the program is (roughly) VEC_SIZE * 2 machine words big (I think we can ignore the single machine word needed for the one object). In contrast, the inner vpointer version consumes only VEC_SIZE machine words. It's thus a little more cache friendly, which probably explains the 4% performance difference.

An important question is that, even for these very limited benchmarks, does anything change if the vector size changes? Yes, it does slightly: it seems that the smaller the vector, the less the difference between the two approaches (as you can see in the performance numbers for a number of variants).

Conclusions

To my simple mind, Trait objects in Rust are confusing because of the way they magically appear through implicit coercions. Maybe my explanation will help someone get to grips with them a little sooner than I managed to.

With regards to the performance comparison, what does it mean for most people? I suggest it shows that Rust’s choice of fat pointers is probably the right one: if you don’t use trait objects, you don’t pay any costs; and if you do use trait objects, the performance is often the best anyway.

What does this mean for me? Well, for VMs, the situation is a little different. In particular, in many VMs, the first thing a method on a VM-level object does is to read from self (e.g. due to biased locking). In such cases, and assuming a sensible object layout, the costs between fat pointers and inner vpointers are probably roughly comparable. Because most languages that aren't Rust allow aliasing, it's also likely that the run-time will see some aliasing, at which point inner vpointers might become a touch quicker; and, even if there's no performance difference, they will use slightly less memory.

Of course, all of this is highly speculative, based on tiny benchmarks, and I'll probably try and keep my options open when implementing my first VM or two in Rust to see if one approach is better in practise than the other. Don't hold your breath for new results any time soon though!

If you want to run, or fiddle with, the benchmarks shown in this blog post, they're available as vtable_bench.

Acknowledgements: My thanks to Edd Barrett and Jacob Hughes for for comments on this blog post.

Follow me on Twitter

Footnotes

[1] And, presumably, C++, though I don’t know that language well enough to say for sure.
[2] Static dispatch makes inlining trivial. In my opinion, inlining is the mother of all optimisations: the more of it you can do, the faster your program will run. Note that inlining doesn’t have to be done statically (though it’s easier that way), but doing it dynamically tends to cause noticeable warmup.
[3] Although this error is unnecessarily verbose, in general modern rustc gives very good quality error messages: older versions of rustc were quite a lot worse in this regard. Indeed, I've noticed that the better error messages make it a lot easier for those learning Rust today compared to 2015 or so.
[4] It’s tempting to call these things “generics”, making them sound similar to the feature of that name found in languages such as Java. However, when you get a bit further into Rust, you start to realise that the “parameter” part of the ”type parameter” name is crucial, because you can pass type parameters around statically in a way that resembles “normal” parameters.
[5] There’s another possible solution which I’m not going to talk about here, involving the CoerceUnsized trait. First, this has been unstable for years, and the lack of activity on the relevant issue suggests it will be unstable for a while to come. Second, it’s really hard to explain what it does.
[6] There are some restrictions on what traits can contain and still be converted into trait objects. I won’t worry about those here.
[7] To some extent, one could statically optimise away some of the costs of inner vpointers by noticing that some structs are never used with dynamic dispatch. That will still allow some wastage, because some structs are only sometimes used with dynamic dispatch. It’s hard to imagine an automatic analysis doing a great job of optimising such cases effectively.
[8] Although it’s not quite the same thing, Gui Andrade's description of dynstack is well worth a read.
[9] Unfortunately what unsafe code is allowed to do and what it's not allowed to do is largely unknown in Rust. This is a big problem for anyone who needs to drop out of safe Rust. There is some progress towards addressing this, but it’s not going to happen quickly.
[10] I must admit that I found this operator's syntax confusing at first. &*x looks like it’s saying “dereference x and get a reference to where x is stored”, but it doesn’t dereference x: it just reinterprets a raw pointer as a Rust reference, which has no run-time effect.
And, presumably, C++, though I don’t know that language well enough to say for sure.
Static dispatch makes inlining trivial. In my opinion, inlining is the mother of all optimisations: the more of it you can do, the faster your program will run. Note that inlining doesn’t have to be done statically (though it’s easier that way), but doing it dynamically tends to cause noticeable warmup.
Although this error is unnecessarily verbose, in general modern rustc gives very good quality error messages: older versions of rustc were quite a lot worse in this regard. Indeed, I've noticed that the better error messages make it a lot easier for those learning Rust today compared to 2015 or so.
It’s tempting to call these things “generics”, making them sound similar to the feature of that name found in languages such as Java. However, when you get a bit further into Rust, you start to realise that the “parameter” part of the ”type parameter” name is crucial, because you can pass type parameters around statically in a way that resembles “normal” parameters.
There’s another possible solution which I’m not going to talk about here, involving the CoerceUnsized trait. First, this has been unstable for years, and the lack of activity on the relevant issue suggests it will be unstable for a while to come. Second, it’s really hard to explain what it does.
There are some restrictions on what traits can contain and still be converted into trait objects. I won’t worry about those here.
To some extent, one could statically optimise away some of the costs of inner vpointers by noticing that some structs are never used with dynamic dispatch. That will still allow some wastage, because some structs are only sometimes used with dynamic dispatch. It’s hard to imagine an automatic analysis doing a great job of optimising such cases effectively.
Although it’s not quite the same thing, Gui Andrade's description of dynstack is well worth a read.
Unfortunately what unsafe code is allowed to do and what it's not allowed to do is largely unknown in Rust. This is a big problem for anyone who needs to drop out of safe Rust. There is some progress towards addressing this, but it’s not going to happen quickly.
I must admit that I found this operator's syntax confusing at first. &*x looks like it’s saying “dereference x and get a reference to where x is stored”, but it doesn’t dereference x: it just reinterprets a raw pointer as a Rust reference, which has no run-time effect.

Why Aren’t More Users More Happy With Our VMs? Part 2

September 19 2018

In the first part of this two-part blog, I showed results from an experiment on VM performance that I’d been part of. Although it wasn’t our original intention, we eventually ended up trying to find out how often mainstream programming language VMs warm up as per traditional expectations. If you consider only individual process executions, then about 70% warmup; if you consider (VM, benchmark) pairs, then about 40% warmup consistently; and if you consider (VM, benchmark, machine) triples, then about 12% warmup consistently.

However you choose to look at it, these numbers aren’t pretty. In trying to make sense of them, I’m going to try and provide some answers to the following three questions. First, are there simple explanations for the performance problems we encountered? Second, how did we get into this situation? Third, how can we do better in the future? To answer the first question, I'm going to draw on our paper. In providing suggested answers to the second and third questions, I'm going to be make extensive use of my personal opinions and experiences, so please don’t blame anyone else for them!

Are there simple explanations for poor performance?

There are all sorts of factors that might be responsible for odd performance, but, beyond those covered in the first part of this blog post, there are two that seem both particularly likely and relatively easy to test for: garbage collection and JIT compilation. For example, garbage collection is, to state the obvious, memory intensive: it could conceivably have all sorts of odd effects on processor caches and thus performance. Similarly, JIT compilation can happen at any point, and there's no guarantee that it makes things faster.

We therefore took two of the VMs under examination – HotSpot and PyPy [1] – and performed a second experiment, separate from the first [2], where we also recorded how long each in-process iteration spent in garbage collection activities and JIT compilation. To be clear, the aim of this second experiment wasn’t to find causations (i.e. to find definitive reasons for odd performance), but to see if there are plausible correlations (i.e. good candidates for further investigation).

Here’s an example plot from this second experiment:

SVG-Viewer needed.

There are three sub-plots in this plot. All share the same x-axis of in-process iterations (i.e. the benchmarking running in a for loop). The “main” sub-plot (the bottom of the 3 sub-plots), is the same as in the first part of this blog, with its y-axis showing the wall-clock time that each in-process iteration takes. In the example above, we can see something odd going on in fasta on PyPy: there are 3 main “sections”, but instead of each “section” having fairly consistent performance for each in-process iteration, they seem to get gradually slower over time.

The top two sub-plots show the time spent in garbage collection and JIT compilation for each in-process iteration [3]. These clearly show that JIT compilation only happens right at the very beginning of the benchmark. However, the time spent in garbage collection is almost perfectly correlated with the overall performance of the benchmark. That suggests first that this benchmark is spending most of its time in garbage collection, and second that there may well be a minor memory leak (in the VM or the benchmark). Again, to be clear, I’m only claiming a correlation, not a causation. However, if I wanted to fix this, I know that I’d start looking for memory allocation sites that don’t seem to be freeing up memory.

SVG-Viewer needed.

Here’s my old friend, Richards running on HotSpot: this benchmark gets 5% slower at about iteration 200, ending up in a steady state of non-peak performance. We can see that we’re regularly spending time garbage collecting [4], but there’s no correlation between garbage collection spikes and the main sub-plot. However, the JIT compilation sub-plot is more interesting: just before the benchmark slows down, there are two long-ish JIT compilation events. It therefore seems plausible that the VM decided to recompile parts of the benchmark (probably based on extra profiling data it had collected), but that the recompilation led to slower machine code being generated. Again, it’s not a guaranteed causation, but it does give a pointer for where to look. However, it's a less useful pointer than the previous example. Since JIT compilers are, to put it mildly, rather complex, fingering the JIT compiler as a culprit still means considering a huge amount of code as the likely cause of the problem.

SVG-Viewer needed.

This third example shows a more troubling case. This is the shallow square-wave example we saw in the first part of this blog (you can see the original plot if you want a reminder). It shows two things. First, turning on profiling changes the nature of most process executions: instead of appearing almost immediately, the square wave doesn’t appear until the thousandth iteration; and the “steady-ish state performance" is better with profiling on (roughly 0.34-0.36s with profiling on vs. 0.36-0.40s with profiling off). Second, no matter how much you look, there is no correlation between garbage collection (indeed, no garbage collection occurs at all!) or JIT compilation and the square wave. You might get excited by the JIT compilation event around iteration 1375, which seems to correlate with the third trough, but there’s no such correlation at the start of any of the other troughs. I honestly don’t have a clue what’s causing the square wave, and there’s nothing here which gives me the slightest pointer as to what I should look at first.

Unfortunately, a lot of examples look like the second or third cases: the correlation only points us to a still-very-large subset of the VM; or doesn’t point us anywhere at all. Indeed, more recently I was involved in a LuaJIT performance project where we also extracted profiling information about garbage collection and JIT compilation. While this information was helpful sometimes, I wouldn’t say that it was a decisive advantage. In some cases, despite the plots not showing anything obvious, we later traced problems back to the garbage collector or the JIT. There are clearly several possible causes for this, but my guess is that the most common is that the consequences of performance problems are often spread around somewhat evenly, thus not showing up as a visible spike at a particular point in time.

There seem, therefore, few simple answers to the odd performance I pointed out in the first part of this post. How did we get into such a situation? Realistically, the answer is almost surely “for lots of reasons, many of which we’ll probably never know.” However, I have a couple of war stories that I think might provide a partial explanation, and that look at things in a slightly different way than I’ve heard others talk about in the past.

Are benchmark suites complete guides to performance?

At the risk of stating the obvious, most of us use benchmarks to guide our optimisations. In other words, if we think we’ve made a compiler faster, we have one or more programs which we measure before and after the change to see if the it really has made things faster or not (and, if it has, by how much). In general, people use multiple such programs, at which point we bestow upon them the grand title of a “benchmark suite”. Optimisations which make our benchmark suite run faster are good; those that make our benchmark suite run slower are bad. But do our benchmark suites tell us about the performance of programs in general?

Some of you may remember that a few years back Dropbox started a new Python VM project called Pyston [5]. I was a bit surprised at the time, because VMs are notoriously a lot more work than people think, and PyPy was already a pretty good Python VM by that point. Still, I didn’t think much more about it, until I happened to bump into Kevin Modzelewski, one of Pyston’s creators, at a conference a year or two later. I had a very interesting conversation with him, as part of which I asked him why Dropbox had decided to create a brand new Python VM. Simplifying things a bit, he said that PyPy performed badly on some common Dropbox idioms, and that he’d distilled the worst offender down to a benchmark where PyPy (a clever JIT compiler) was over 4 times slower than CPython (a simple interpreter).

Looking at the benchmark, the problem became clear very quickly: it was recursive. To my surprise, RPython (the language/framework in which PyPy is written) was unrolling [6] all recursive calls. Unrolling is a powerful, but dangerous, optimisation: it can make programs faster when used judiciously, but used incautiously it causes code bloat and degraded performance. Since unrolling all recursive calls is akin to unrolling all loops, this seemed unlikely to be a good idea. I quickly confirmed with the core developers that this wasn't an intended outcome, and that it was a natural property of tracing's naturally aggressive tendency to inline.

I thought this was would be a fun problem to debug, as it involved a part of RPython that I had not previously looked at. It took 2 or 3 days before I had found the right part of RPython, worked out what was going on, and found a fix. The core of the resulting diff is fairly simple:

diff --git a/rpython/jit/metainterp/pyjitpl.py b/rpython/jit/metainterp/pyjitpl.py
--- a/rpython/jit/metainterp/pyjitpl.py
+++ b/rpython/jit/metainterp/pyjitpl.py
@@ -951,9 +951,31 @@
         if warmrunnerstate.inlining:
             if warmrunnerstate.can_inline_callable(greenboxes):
+                # We've found a potentially inlinable function; now we need to
+                # see if it's already on the stack. In other words: are we about
+                # to enter recursion? If so, we don't want to inline the
+                # recursion, which would be equivalent to unrolling a while
+                # loop.
                 portal_code = targetjitdriver_sd.mainjitcode
-                return self.metainterp.perform_call(portal_code, allboxes,
-                                                    greenkey=greenboxes)
+                inline = True
+                if self.metainterp.is_main_jitcode(portal_code):
+                    for gk, _ in self.metainterp.portal_trace_positions:
+                        if gk is None:
+                            continue
+                        assert len(gk) == len(greenboxes)
+                        i = 0
+                        for i in range(len(gk)):
+                            if not gk[i].same_constant(greenboxes[i]):
+                                break
+                        else:
+                            # The greenkey of a trace position on the stack
+                            # matches what we have, which means we're definitely
+                            # about to recurse.
+                            inline = False
+                            break
+                if inline:
+                    return self.metainterp.perform_call(portal_code, allboxes,
+                                greenkey=greenboxes)

In essence, what this diff does is the following. When we're tracing (roughly speaking, “JIT compiling”) and about to trace into a function F (implicitly identified here by the “greens”), we check the tracer's call-stack to see if F is present. If it is, then inlining F again would be akin to unrolling, and we stop tracing.

Without worrying too much about the details, I hope you’ll agree that it’s a fairly small diff. What's more, it's effective: performance on PyPy improved by 13.5x (meaning that PyPy went from 4.4x slower than CPython to over 3x faster). At this point, I was patting myself on the back for a job well done, and indulging my ego by wondering if Dropbox would have created a new Python VM if this fix had been present.

However, there was a problem. Yes, I'd made one benchmark 13.5x faster, but PyPy has its own benchmarking suite. When I ran the above fix on PyPy’s benchmark suite of 20-odd benchmarks, 5 became slower (up to 4x slower). I knew that the core PyPy developers would, quite reasonably, not accept a performance “fix” that made performance of the main benchmark suite worse. First, I added a mechanism to control how many levels of recursion are unrolled. Second, I set about uncovering how many unrollings of recursive calls would recover the performance of the main PyPy benchmarks.

I thus did some basic benchmarking, putting this table in the email I sent to the pypy-dev mailing list:

     #unrollings |  1   |  2   |  3   |  5   |  7   |  10  |
-----------------+------+------+------+------+------+------+
hexiom2          | 1.3  | 1.4  | 1.1  | 1.0  | 1.0  | 1.0  |
raytrace-simple  | 3.3  | 3.1  | 2.8  | 1.4  | 1.0  | 1.0  |
spectral-norm    | 3.3  | 1.0  | 1.0  | 1.0  | 1.0  | 1.0  |
sympy_str        | 1.5  | 1.0  | 1.0  | 1.0  | 1.0  | 1.0  |
telco            | 4    | 2.5  | 2.0  | 1.0  | 1.0  | 1.0  |
-----------------+------+------+------+------+------+------+
polymorphism     | 0.07 | 0.07 | 0.07 | 0.07 | 0.08 | 0.09 |
Looking back, I’m slightly embarrassed by the quality of benchmarking this represents, but I hope you’ll forgive me for that. The top half of the table (from hexiom2 to telco) represents the subset of PyPy benchmarks that my fix had slowed down. For example, with 1 unrolling of recursion (i.e. equivalent to inlining the function once but performing no further “unrolling”), raytrace-simple ran 3.3x slower than under “normal” PyPy; with 2 levels of unrolling it ran 3.1x slower; and by 7 levels of unrolling my branch and normal PyPy ran the benchmark at the same speed. On the bottom row is Kevin’s benchmark (which, for reasons I can no longer fathom, has timings in seconds): it’s fastest at 1 unrolling, but even at 7 unrollings it’s only lost a bit of performance.

The table above clearly shows that RPython should unroll recursive function calls 7 levels deep. On that basis, my email to the PyPy list proposed that number; that was the number that was merged into RPython; and that's the number that's still used today.

However, I doubted at the time that 7 was the right value and today I’m almost positive that it’s not the right value. I'm almost certain that the right value is going to be much closer to 1 — indeed, probably 1 itself. There's no doubt that simple recursive function calls (of the sort found in many of the PyPy benchmarks) benefit from unrolling. However, if unrolling every loop was a good idea, we'd expect every static compiler to do it, — but they don't [7]. Recursive functions are like loops, but worse: functions are, by definition, bigger than loops on average. However, in the context of tracing, my reasoning is that we care a lot more about the worst case of unrolling than we do the best case. My major concern is that recursive functions are often heavily data dependent. A common example of this are the tree walkers found in compilers, which walk over an abstract syntax tree and compile the program. In such cases, the recursive function looks at a node, does something based on that, and then visits the node’s children. The challenge with this is that the path the recursive function depends entirely on the structure of the input tree. This means that the amount of code executed between each call of the recursive function can vary hugely: we frequently end up tracing paths which are never used in full again. The problem is that meta-tracing is a very expensive activity, causing execution to temporarily slow down by around a couple of orders of magnitude: tracing things that aren’t used again can lead to significant performance losses. Even worse, these losses are often invisible, in the sense that it’s not really a flaw in the user’s program, but an often inevitable mismatch with the dynamic compilation approach.

Realistically, no benchmark suite can ever be complete, in the sense that it can claim to represent the performance behaviour of every possible program. A better question is whether the benchmark suite is representative of a fairly large subset of programs. The above example shows that PyPy’s standard benchmark suite doesn't cover some fairly common idioms. How many other common idioms aren’t covered? My strong intuition is that the answer is “many”. This isn’t to knock PyPy’s benchmark suite: indeed, it’s better than most other VMs’ benchmark suites. However, we often derive a false sense of comfort from benchmark suites: too often the phrase “this optimisation sped up the benchmark suite by 3%” is used interchangeably with “this optimisation speeds up programs by 3%”. The harsh reality is that our benchmark suites reject some optimisations that would make the vast majority of programs run faster, and accept some optimisations that make the vast majority of programs run slower. How often we get unrepresentative answers from our benchmark suites is anyone’s guess, but I fear it has happened more in the past than we imagine.

Are benchmark suites correct guides to performance?

The first war story might make you wonder whether or not benchmark suites tell us much about performance of programs in general. But a second issue with benchmark suites is whether the answer we receive is even correct with respect to the benchmark suite itself. Put another way, if a benchmark suite tells us that an attempted optimisation makes the benchmark suite faster or slower, is that answer reliable?

Alongside the main experiment I reported in the previous blog post, we also (separately) benchmarked SpiderMonkey and V8 against the Octane benchmark suite. Octane consists of 17 benchmarks, originally from V8; it fairly quickly became one of the most important JavaScript benchmark suites. Despite not knowing JavaScript at that point, I took on the task (“how hard can it be?”) of getting Octane into shape [8]. In essence, all I did was put in a for loop that ran each benchmark for 2000 in-process iterations [9]. That done, I ran my slightly altered Octane benchmark suite on d8 (the command-line way of running programs on V8) and got the following output:

$ d8 run.js
Richards
DeltaBlue
Encrypt
Decrypt
RayTrace
Earley
Boyer
RegExp
Splay
NavierStokes
PdfJS

<--- Last few GCs --->

14907865 ms: Mark-sweep 1093.9 (1434.4) -> 1093.4 (1434.4) MB, 274.8 / 0.0 ms [allocation failure] [GC in old space requested].
14908140 ms: Mark-sweep 1093.4 (1434.4) -> 1093.3 (1434.4) MB, 274.4 / 0.0 ms [allocation failure] [GC in old space requested].
14908421 ms: Mark-sweep 1093.3 (1434.4) -> 1100.5 (1418.4) MB, 280.9 / 0.0 ms [last resort gc].
14908703 ms: Mark-sweep 1100.5 (1418.4) -> 1107.8 (1418.4) MB, 282.1 / 0.0 ms [last resort gc].


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x20d333ad3ba9 
    2: extractFontProgram(aka Type1Parser_extractFontProgram) [pdfjs.js:17004] [pc=0x3a13b275421b] (this=0x3de358283581 ,stream=0x4603fbdc4e1 )
    3: new Type1Font [pdfjs.js:17216] [pc=0x3a13b2752078] (this=0x4603fbdaea9 ,name=0x4603fbd9c09 ,file=0x4603fb...


#
# Fatal error in CALL_AND_RETRY_LAST
# Allocation failed - process out of memory
#

zsh: illegal hardware instruction  d8 run.js

I can very distinctly remember having a sinking feeling when I saw that. Fortunately, when I looked at it more closely, it became clear that the situation wasn’t as bad as I had first feared. Starting from the top of that output, you can see that a number of Octane's benchmarks (from Richards to NavierStokes) ran successfully, before PdfJS caused d8 to fail. A cursory look at the error message shows that PdfJS failed because V8 had run out of memory. I decided to have a quick look at whatever timing outputs I had got (bearing in mind that this was done without Krun, so it's so-so benchmarking) before the crash and plotted them:

SVG-Viewer needed.

That graph looks interesting, and there’s a clear blow-up towards the end, but there’s not quite enough data to see if there’s anything meaningful going on before that. Fortunately, the way that I had made in-process iterations was to bundle several of Octane’s ‘inner’ iterations into longer ‘outer’ in-process iterations. I therefore reran things, this time recording how long each ‘inner’ iteration took to run:

SVG-Viewer needed.

Now this is starting to look interesting, but the huge spike at the end makes it hard to see the denser detail elsewhere in the plot. I therefore chopped the last 80 or so in-process iterations off and replotted:

SVG-Viewer needed.

At this point the likely source of the error became clear to me. The first half of this (sub-)plot has regular spikes, which grow roughly linearly: I guessed that they were the garbage collector running, which was probably having to do a little more work each time it was run. The second half of this plot shows a complete change in the timing data. I guessed that the garbage collector was probably going into some sort of low-memory mode, where it trades off memory for time, which causes it to run much more often, and to run much more slowly. It continually has to do more work, until – just to the right of this final plot – it explodes with an out of memory exception. If ever I’ve seen a plot which screams “memory leak”, this is it.

At this point, I had a very strong suspicion what the cause was, and a guess that this was more likely to be a bug in PdfJS than V8. But how likely was it that I would find a bug in a benchmark that was 33,053 lines of code in a language which I didn't know? Frankly, it sounded like a lost cause. Still, I thought it would be worth having a look, so I loaded pdfjs.js into my trusty text editor:

To my surprise, the second line of code after the licence was var canvas_logs = [];, which looked a lot to me like an assignment of a mutable array to a global variable. That’s an obvious candidate for problems, but surely far too obvious to be the bug. Still, I thought I’d see where canvas_logs is used, just in case. I scrolled to the first use of the variable in the file:
At line 50, it looked to me like content is being appended to the array, but nowhere in that function is anything being removed from the list. Indeed, there are 3 other references to this variable in the file, 2 of which check the array’s length, and 1 of which accesses elements in the array. Bingo! Within about 90 seconds, I had found the source of the memory leak. Emphasising my cluelessness with JavaScript, it then took me about 10 minutes to work out how to empty an array of its contents. Programming is a humbling job.

The fix to PdfJS’s memory leak is thus simply to add the line canvas_logs.length = 0 to the beginning of the runPdfJs function. However, to my regret, the fix has not been merged into Octane because, some time after I raised the PR, Octane was retired.

A reasonable question to ask is whether the memory leak in PdfJS is really a problem. Presumably not many other people have run the benchmark long enough to see the crash I saw. Besides, a benchmark’s job is to measure a VM doing some work and, whether there’s a memory leak or not, the PdfJS benchmark is certainly doing some work. This latter point is alluring but, I believe, dangerous. Let me give a couple of scenarios to illustrate this point.

First, it is common to add benchmarks to a benchmark suite when one perceives that they tell us something new about the compiler in question. For example, let’s imagine that I realise that my benchmark suite is light on benchmarks which measure changes to my VM’s code generator. I have a look around and, having profiled a couple of its in-process iterations, decide that PdfJS is the perfect way of addressing this weakness in the benchmark suite. Later on, I make a change to the code-generator which I hope optimises programs such as PdfJS. I run PdfJS for a while (after all, I know that I need to think about peak performance) and find out that the performance hasn’t changed very much, causing me to discard my changes to the code-generator. The problem here is that PdfJS isn’t really a test of a code-generator, rather it is a test of the garbage collector in low-memory situations. That’s not an unreasonable thing to benchmark, but it’s not what I expected PdfJS to be benchmarking [10]. My change to the code-generator might really have been an improvement, but it might have been overwhelmed by the time spent in garbage collection by this particular benchmark.

Second, many benchmark suites – including Octane – have a mode which says “run the benchmark for n seconds and then report how many in-process iterations completed in that time”. Let’s assume that PdfJS manages to complete 100 in-process iterations in 10 seconds. I now introduce a new, complex, optimisation into the VM, which makes my benchmark complete 101 in-process iterations in 10 seconds. It looks like I’ve got a 1% speed-up, which might not be enough to justify the additional complexity in the VM. However, if each in-process iteration does more work than its predecessor the real improvement is higher — potentially much, much higher (see the last few in-process iterations of the graphs earlier in this post).

In both the above examples, a faulty benchmark like PdfJS can cause us to systematically underappreciate, or exclude, effective optimisations. Although a little less likely, it's also possible to imagine it having the opposite effect (i.e. causing us to include ineffective, or even deleterious, optimisations).

Still, you might well think that I’m banging on a bit too much about a single faulty benchmark. All code has bugs, and it’s not surprising that some benchmarks have bugs too. It’s not unreasonable to suggest that the odd bad benchmark probably won’t lead us too far astray. However, when I looked at Octane, CodeLoadClosure also has a memory leak [11], and zlib complains that it “cannot enlarge memory arrays in asm.js” (which might be a memory leak, although I didn’t investigate further). In other words, a non-trivial portion of Octane’s benchmarks appear to be misleading and it's reasonable to assume that, on occasion, they will have misled VM developers. Does this mean that Octane is an unusually terrible benchmark suite? Probably not, but it's certainly made me even more cautious about benchmark suites than I was before.

Summary

Returning back to this post's title: why aren't more users more happy with our VMs? The conclusion I've been forced to is that our benchmarking and our benchmarks have misled us. In a perfect example of Goodhart's Law, we've too often optimised for metrics that aren't representative of what we really care about.

Fortunately, I think it's possible to do a little better without undue effort. Here are my simple suggestions:

  1. Running benchmarks for longer uncovers a surprising number of latent problems. Ultimately, all the odd performance effects we saw in part one of this blog post resulted from running benchmarks for longer than anyone had done before.
  2. We must accept that neither peak performance nor a steady state is guaranteed to occur. Benchmarking methodologies that assume one or both of these concepts are fundamentally broken.
  3. I think it’s intellectually dishonest not to report warmup time. There’s no doubt that warmup time is annoying and that it can paint VMs in an unflattering light. But by pretending it’s a non-issue we’re misleading users about what performance they can expect, and storing up long-term resentment.
  4. We might be able to learn from the machine-learning community and separate out training benchmark suites (which we use to derive the values which drive the various heuristics in VMs) from validation suites (on which we report actual benchmarking numbers).
  5. We need far, far more benchmarks than we currently have, and we have to keep adding benchmarks. In nearly all cases, the more benchmarks we have, the less likely we are to mislead ourselves [12]. My vague guess is that our current benchmark suites are perhaps an order of magnitude too small. But even much larger benchmark suites won’t solve the problem alone: fixed-size benchmark suites become gradually less useful as a VM matures, because the VM becomes more heavily trained towards (or, in a sense, “innoculated against”) the same old benchmarks.
  6. I suspect most users would be happier if we worried more about predictable performance (both making sure that most programs run quite fast, and that programs run at more-or-less the same speed when you rerun them) than we did about peak performance (which currently dominates our thoughts). Put another way, we need to accept that optimisations often come with trade-offs: some appealing optimisations, for example, are highly non-deterministic and may make performance worse almost as often as they make it better.

A big question is whether fixing existing VMs is practical or not. I must admit that the current signs (including some work that I've been involved in) aren't hugely positive in that regard. Most VMs are large (typically hundreds of thousands of lines of code), mature (many people have worked on them, over many years), and complex (squeezing every last bit of juice out of programming language and hardware). Subtle assumptions, both explicit and implicit, are spread throughout and across the codebase — and many of those assumptions have been forgotten (generally, though not exclusively, because people have moved on). Hunting down performance problems thus often devolves into wild stabs in the dark, looking for clues almost at random throughout large, complex codebases. Perhaps unsurprisingly, such hunts fail more often than they succeed.

However, I remain optimistic. First, I think we will start to see a gradual improvement in current VMs. Second, even despite the various problems I’ve detailed, many users are already fairly happy with their VMs: any improvements we make can only make them happier (and, if we’re lucky, make a few unhappy people happy). Speaking very personally, I still think that we don’t really know what the performance ceiling of dynamic compilation is. It is possible that we’re very close to the best performance possible, although my gut feeling is that there’s significant untapped potential. To that end, I’m now working on a new VM research project. It’s far too early to say anything useful about that work yet, and who knows what will happen with funding and the like, but maybe one day it’ll help nudge the field forward a little bit further.

Finally, if you want to repeat the warmup experiment, use Krun, or produce performance stats, then you can: everything’s under an open-source licence, so enjoy!

Acknowledgements: My thanks to Edd Barrett and Carl Friedrich Bolz-Tereick for comments on this blog post. This research was funded by the EPSRC Cooler (EP/K01790X/1) grant and Lecture (EP/L02344X/1) fellowship, and gifts from Cloudflare and Oracle Labs.

Follow me on Twitter

Footnotes

[1] There are two main reasons for restricting things to these two VMs: not all VMs make such profiling information available; and the rest tend to have scant, or missing, information on how to access such features. HotSpot is an honourable exception (its well documented in this regard), but we only managed to get the numbers we needed out of PyPy because we had someone from the core team who we could prod for advice.
[2] The cost of extra profiling information to record garbage collection data and JIT compilation can be quite significant — certainly significant enough that it can change the apparent performance of the main benchmark. It’s therefore vital that this extra information is recorded in a separate experiment, and doesn’t affect the main experiment.
[3] We couldn't work out what the units for the PyPy profiling information are (they're not documented, and it's not obvious from the source code either): as we’ll soon see, that doesn’t matter much in this case.
[4] Unlike PyPy, the units for the garbage collection and JIT compilation sub-plots in HotSpot are clear and simple: seconds.
[5] Sometimes it can feel like everyone, and in one or two cases even their dog, has tried making a Python implementation. Here's an incomplete list of Python implementations (with thanks to Carl Friedrich Bolz-Tereick for pointing out several that I didn’t know about): CPython; IronPython; Jython; Nuitka; Psyco; PyPy; Pyston; Shed Skin; Stackless; Starkiller; TrufflePython; Unladen Swallow; WPython; Zippy.
[6] For those unfamiliar with this optimisation, consider this (contrived) code:
i = 0
while i < 2:
  f()
  i += 1
This can be “unrolled” (i.e. the loop can be expanded) to the equivalent:
f()
f()
In this case, the unrolled version is smaller and faster (since it has no branches). In more complex examples, unrolling can uncover optimisations that the loop would otherwise obscure.
[7] The situation in a tracer is a little different: when you trace a loop, you effectively copy the trace (i.e. unroll it), with the first iteration “setting things up” for the second iteration. That’s not really equivalent to unrolling in the normal sense, though.
[8] Our aim here was to show that you could use warmup_stats without having to use Krun. We thus deliberately stuck to Octane’s behaviour – shared with many other benchmark suites – of running multiple benchmarks in a single process execution. It’s highly likely that benchmark ordering will have some impact on one’s perception of performance, but I have no idea to what extent. That’s for someone else to look into!
[9] In the end, I had to do a bit more than this, because the default Octane runner is written in a style that I found hard to follow. I ended up writing a new runner that dropped a couple of features, but was massively simpler.
[10] Even worse, PdfJS isn’t even a very good way of benchmarking how a VM performs in a low memory situation, because it doesn't hit that point until quite a long way in.
[11] Since I quickly exceeded the 90 second budget that PdfJS had led me to expect was reasonable for finding such bugs, I very quickly timed out in finding the cause of the leak.
[12] The only quibble here is that if, somehow, you manage to have a large number of benchmarks which are either identical or extremely similar, you might end up with a skew towards a subset of a VM’s behaviour. Still, relative to where we are today, it’s a problem I’d be willing to risk.
There are two main reasons for restricting things to these two VMs: not all VMs make such profiling information available; and the rest tend to have scant, or missing, information on how to access such features. HotSpot is an honourable exception (its well documented in this regard), but we only managed to get the numbers we needed out of PyPy because we had someone from the core team who we could prod for advice.
The cost of extra profiling information to record garbage collection data and JIT compilation can be quite significant — certainly significant enough that it can change the apparent performance of the main benchmark. It’s therefore vital that this extra information is recorded in a separate experiment, and doesn’t affect the main experiment.
We couldn't work out what the units for the PyPy profiling information are (they're not documented, and it's not obvious from the source code either): as we’ll soon see, that doesn’t matter much in this case.
Unlike PyPy, the units for the garbage collection and JIT compilation sub-plots in HotSpot are clear and simple: seconds.
Sometimes it can feel like everyone, and in one or two cases even their dog, has tried making a Python implementation. Here's an incomplete list of Python implementations (with thanks to Carl Friedrich Bolz-Tereick for pointing out several that I didn’t know about): CPython; IronPython; Jython; Nuitka; Psyco; PyPy; Pyston; Shed Skin; Stackless; Starkiller; TrufflePython; Unladen Swallow; WPython; Zippy.
For those unfamiliar with this optimisation, consider this (contrived) code:
i = 0
while i < 2:
  f()
  i += 1
This can be “unrolled” (i.e. the loop can be expanded) to the equivalent:
f()
f()
In this case, the unrolled version is smaller and faster (since it has no branches). In more complex examples, unrolling can uncover optimisations that the loop would otherwise obscure.
The situation in a tracer is a little different: when you trace a loop, you effectively copy the trace (i.e. unroll it), with the first iteration “setting things up” for the second iteration. That’s not really equivalent to unrolling in the normal sense, though.
Our aim here was to show that you could use warmup_stats without having to use Krun. We thus deliberately stuck to Octane’s behaviour – shared with many other benchmark suites – of running multiple benchmarks in a single process execution. It’s highly likely that benchmark ordering will have some impact on one’s perception of performance, but I have no idea to what extent. That’s for someone else to look into!
In the end, I had to do a bit more than this, because the default Octane runner is written in a style that I found hard to follow. I ended up writing a new runner that dropped a couple of features, but was massively simpler.
Even worse, PdfJS isn’t even a very good way of benchmarking how a VM performs in a low memory situation, because it doesn't hit that point until quite a long way in.
Since I quickly exceeded the 90 second budget that PdfJS had led me to expect was reasonable for finding such bugs, I very quickly timed out in finding the cause of the leak.
The only quibble here is that if, somehow, you manage to have a large number of benchmarks which are either identical or extremely similar, you might end up with a skew towards a subset of a VM’s behaviour. Still, relative to where we are today, it’s a problem I’d be willing to risk.
 

Blog archive

 

Last 10 posts

Which Parsing Approach?
Alternative Sources of Advice
Minimum Times Tend to Mislead When Benchmarking
A Quick Look at Trait Objects in Rust
Why Aren’t More Users More Happy With Our VMs? Part 2
Why Aren’t More Users More Happy With Our VMs? Part 1
What I’ve Learnt So Far About Writing Research Papers
What Challenges and Trade-Offs do Optimising Compilers Face?
Fine-grained Language Composition
Debugging Layers
 
 

Blog archive