RSS / XML feed

Laurence Tratt's Technical Articles


Fast Enough VMs in Fast Enough Time

February 8 2012

[This is a long article because it covers a lot of ground that is little known by the computing mainstream, and easily misunderstood. If you're familiar with the area, you may find yourself wanting to skip certain explanatory sections.]


If you can stomach the smell, put yourself briefly in the shoes of a programming language designer. What you want to do is create new programming languages, combining new and old ideas into a fresh whole. It sounds like a fun, intellectually demanding job, and occasionally it is. However, we know from experience that languages that exist solely in the mind or on paper are mostly worthless: it is only when they are implemented and we can try them out that we can evaluate them. As well as a language design, therefore, we need a corresponding language implementation.

This need for a corresponding language implementation leads to what I think of as the language designer's dilemma: how much implementation is needed to show that a language design is good?

  • If too little, the language will be dismissed out of hand as unusably slow: the kiss of death for many a language. Especially if the language design is adventurous, the language designer may not even be sure that it can be made adequately efficient.
  • If too much, low-level fiddling will have consumed most of the energy that should have gone into design. Even worse, low-level implementation concerns can easily end up dictating higher-level design choices, to the disadvantage of the latter.
Finding a balance between these two points is difficult. When I was designing Converge, I came down on the side of trying to get the design right (though the end result has more than its fair share of mistakes and hacks). That decision had consequences, as I shall now describe.

Putting VM effort into context

I implemented two Virtual Machines (VMs) for Converge, both in C: the first was epically terrible; the second (introduced in Converge 1.0) merely awful. Even the second is extremely, sometimes almost unusably, slow: it is roughly 5-10x slower than CPython, tending towards the slower end of that range. This partly reflects my lack of experience and knowledge about VM implementation when I started; but it also reflects the fact that the Converge VM was a secondary concern to the language design. Despite that, I estimate that I spent around 18 man months on the second VM (intertwined with a lot of non-VM work). If this seems a lot of effort to spend on such a poor quality VM, it's worth bearing a few things in mind: Converge has a hard to optimise expression evaluation system based on Icon (see this paper for more details); C, whilst a fun language in many ways, doesn't lend itself to the fast writing of reliable programs; and most real-world VMs have had a lot more effort put into them.

As rough comparisons, CPython (the de-facto standard for Python, so named because it is written in C) has probably had a couple of orders of magnitude more effort put into it; and Java's HotSpot roughly three orders of magnitude more. It's not surprising that the second C Converge VM doesn't do well in such a comparison — in comparison, it's not so much a minnow as a single-celled organism.

What all this means is that the second C Converge VM so slow that during demos of some of Converge's advanced features, I learned to make winding up gestures at the side of my laptop to amuse the audience. Even I was not sure whether its combination of features could ever be made adequately efficient.

Other approaches

Why did I write my own VM and not use someone else's? The choices are instructive.

The traditional route for big compilers (e.g. gcc) is to output machine code (perhaps via assembler). The output is efficient, but the compiler itself requires a large amount of effort. For example, simply learning the intricacies of a processor like the x86 sufficiently well to generate efficient code isn't for the faint of heart. In short, the amount of effort this approach demands is generally prohibitive.

An alternative to generating machine code directly is to generate C code and have that compiled into machine code. Several compilers have taken this approach over the years (from Cfront to the original Sather compiler). While still relatively difficult to do, it is certainly much easier than generating machine code directly. However it can often lead to a poor experience for users: not only must they pay the costs of double-compilation, but the translation typically loses a large quantity of useful debugging information (something which Converge pays special attention to). With few exceptions (one of which we'll see later), this approach is now rare.

Perhaps the obvious choice is to use an existing VM. The two Goliath's are the JVM (e.g. HotSpot) and Microsoft's CLR (the .NET VM). When I started work on Converge, the latter (in the form of Mono) didn't run on OpenBSD (my platform of choice), so was immediately discounted. HotSpot, however, remained a possibility because of its often stunning performance levels.

The reason that I couldn't use it isn't really HotSpot specific. Rather, it is something inherent to VMs: they reflect the languages, or group of languages, they were designed for. If a language fits within an existing VMs mould, that VM will probably be an excellent choice; if not, the semantic mismatch between the two can be severe. Of course, given sufficient will-power, any programming language can be translated to any other: in practise, the real issues are the ease of the translation and the efficiency of programs run using it. Two examples at opposite ends of the spectrum highlight this. Jython (Python running on the JVM) is a faithful implementation of Python; but even with the power of HotSpot behind it, Jython almost never exceeds CPython in performance, and is generally slower because some features (mostly relating to Python's highly dynamic customisability) do not lend themselves to an efficient implementation on the JVM. Scala, on the other hand, was designed specifically for the JVM — to ensure reasonable performance, Scala's language design has in some parts had to be compromised (e.g. due to type erasure).

Whether the semantic mismatch is manageable depends on the particular combination of language design and VM. Converge's unusual expression evaluation mechanism was enough on its own to rule out a practical JVM implementation (Jcon, an implementation of Icon for the JVM is still slower than the old C Icon interpreter, which itself is no speed demon). As Converge's development progressed, a number of other features (e.g. its approach to tracing information) have made it increasingly difficult to imagine how it could be practically wedged atop an existing VM.

A language for JITing VMs

What all the above means is that the options for implementing a language are generally unpalatable: they either require an undue amount of work, or compromises in language design, and, too often, both. My suspicion is that this status quo has severely inhibited programming language research: few groups have had sufficient resources to implement unusual languages well enough to prove them usable.

And then I came across PyPy. To be more accurate, after a few years of vaguely hearing of PyPy, 6 months ago I unexpectedly bumped into a PyPy developer who convinced me that PyPy's time has come. After porting PyPy to OpenBSD, I investigated further. What I've come to realise is that PyPy is two separate things:

  1. PyPy is a new VM for Python, which is often substantially faster than CPython. If you're a Python user (and I often am), this is interesting, otherwise it may be of only of minor interest.
  2. RPython is a language for writing VMs. This is of interest to every language designer – current or would-be – and language implementer.
Unfortunately the current literature uses PyPy to cover both, which has confused many a reader (myself included). Henceforth, I shall unambiguously use RPython to refer to the language for writing VMs in, and PyPy for the new Python VM written in RPython.

So, what is RPython? The obvious facts about it are that it is strict subset of Python whose programs are translated to C. Every RPython program is a valid Python program (which can be run using a normal Python interpreter), but not vice versa. However, RPython is suitably restricted to allow meaningful static analysis. Most obviously, static types (with a type system roughly comparable to Java's) are inferred and enforced. In addition, extra analysis is performed e.g. to assure that list indices don't become negative. Users can influence the analysis with assert statements, but otherwise it is fully automatic.

RPython would be of relatively little interest if all it did was subset Python and output C. Though a full programming language, RPython is unlikely to be the next big language for web programming, for example: it's too restricted for the mainstream. Indeed, there is little chance of it becoming a widely used alternative to full Python (I would say the same is also true for similar restricted approaches such as Shed Skin).

However, in addition to outputting optimised C code, RPython automatically creates a second representation of the user's program. Assuming RPython has been used to write a VM for language L, one gets not only a traditional interpreter, but also an optimising Just-In-Time (JIT) compiler for free. In other words, when a program written in L executes on an appropriately written RPython VM, hot loops (i.e. those which are executed frequently) are automatically turned into machine code and executed directly. This is RPython's unique selling point, as I'll now explain.

Traditional VM implementation

Because RPython is unique, it's easy to overlook what's interesting about it: it took me a couple of months of using it before I had built up an accurate understanding. Looking at the traditional approach to VM implementation is perhaps the easiest way to explain what's interesting about RPython.

First, let me make a couple of assumptions explicit. Languages like C are well suited to being translated directly to machine code, as they are little more than a thin layer over machine code: everything that is C about a program is dealt with during compilation. In other words, the C language maintains no run-time infrastructure: once a program is translated into machine code, it's on its own. More complex languages (Java, Python etc.) permit complex run-time interactions and changes. This has many advantages from the programmer's point of view, but makes life difficult for the language implementer: most static optimisations are impossible because the user can often change fundamental behaviour at run-time. The best VMs defer optimisations until run-time, when information about the particular way the program is being executed can be used to dynamically optimise it. The insight here, in part, is that while we can't statically optimise the program for all the possible ways it might be run, we can dynamically optimise it for the way a particular user is running it.

Take a JITing VM such as HotSpot or V8. Such a VM will initially use an interpreter to execute the user's program. Unfortunately for us, interpreter is a vague term, used to mean many different things (and often used as a pejorative): let us assume that it is a slow-ish way of loading in a program and executing it step-by-step (perhaps by loading in a bytecode file or by operating on an Abstract Syntax Tree (AST)), with few, if any, dynamic optimisations. Though slow to execute, interpreters have the advantage that they are easy to write, and have low start-up costs. For code that executes infrequently, they are a reasonable choice.

When, during execution, the interpreter in a VM such as HotSpot or V8 spots a frequently executed piece of code, it will hand that chunk of code off to a JIT compiler (along with information about the context within which that code is used) to convert it into machine code. The JIT compiler (henceforth just the JIT) is entirely separate from the interpreter. In other words, such a VM contains two completely separate implementations of the intended language's semantics. This has several consequences:

  • JITs are much harder to create than interpreters, and involve much more work. Most interested people can write an interpreter; writing a JIT requires much more specialist understanding and time.
  • Tiny differences between the interpreter and JIT (which are almost inevitable in such complex systems) lead to a divergence in a program's observable behaviour. From a programmers point of view, the program behaves one way when interpreted, and another when JITted — unfortunately, it is often extremely difficult to work out which parts of a program have been JITted, or when.
  • Because JITs are much more complex than interpreters, languages with complex run-times are disproportionately hard to create JITs for. In other words, the more complex the language, the harder it is to create a complete JIT, and the more likely that the JIT and interpreter will provide subtly different results. For example, Lua – a relatively small language – has a complete JIT in LuaJIT, which has evolved alongside the traditional interpreter over many years. In contrast, Python – a much more complex language (for better or worse) than Lua – had, until recently, only an incomplete JIT in Psyco, which struggled to match CPython's often rapid evolutions.

Of course, all such problems can be solved with sufficient resources, but these explain in large part why major open source languages like Python and Ruby currently ship with JIT-less VMs (i.e. they have only an interpreter).

JITs for free

What RPython allows one to do is profoundly different to the traditional route. In essence, one writes an interpreter and gets a JIT for free. I suggest rereading that sentence again: it fundamentally changes the economics of language implementation for many of us. To give a rough analogy, it is like moving from manual memory management to automatic garbage collection.

RPython is able to do this because of the particular nature of interpreters. An interpreter, whether it be operating on bytecode or ASTs, is simply a large loop: load the next instruction, perform the associated actions, go back to the beginning of the loop. In order to switch from interpretation to JITing, RPython needs to know when a hot loop has been encountered, in order to generate machine code for that loop and to use it for subsequent executions.

In essence, one need only add two function calls to an RPython program to add a JIT. The first function call (can_enter_jit) is used to inform RPython that a loop in the user program has been encountered, and that it might want to start generating machine code if that loop has been encountered often. The second function call (jit_merge_point) is used to indicate to RPython when it can switch to an existing machine code version of a loop. There are a few other details needed, but altogether less than 10 lines of code do the job. To get the most out of the JIT, extra hints to the JIT and certain programming practices help, but that's more icing on the cake rather than anything fundamental.

Now is a good time to get an idea of how RPython generates a JIT from just an interpreter (what RPython calls the language interpreter). As said earlier, RPython automatically layers alongside C code a second representation of the interpreter (the tracing interpreter). The details of how the tracing interpreter is stored are irrelevant to us, except to note that it's in a form that a JIT can manipulate (conceptually it could be an AST-like structure). RPython's JIT is a tracing JIT. When a hot loop is detected, a marker is left such that the next time the loop is about to run, the JIT will enter tracing mode. During tracing mode, a complete execution of the loop is performed and all the actions it takes are traced (i.e. recorded) using the tracing interpreter (which is much, much slower than the language interpreter). After the loop has finished, the trace is then analysed, optimised, and converted into machine code. All subsequent executions of the loop will then call the machine code version. Since subsequent executions may diverge from the recorded trace, RPython automatically inserts guards into the machine code to detect divergence from the machine code version's capabilities. If a guard fails at any point, execution falls back to the interpreter.

At this point, it's worth taking a brief side-tour to introduce tracing JITs. Although the concept of tracing JITs dates to around 1980, the concept didn't achieve any kind of prominence until around 2006 when Andreas Gal modernised it. It has since been used in several places, most notably in Mozilla's TraceMonkey Javascript VM which was used in Firefox (we'll see later why the past tense is used). Tracing JITs record a specific path of execution in a running program and convert it to machine code. This is quite different to the method-based approach of most JITs and therefore requires explanation.

User's end program Trace when x is set to 6 Optimised trace
if x < 0:
  x = x + 1
else:
  x = x + 2
x = x + 3
guard_type(x, int)
guard_not_less_than(x, 0)
x = int_add(x, 2)
x = int_add(x, 3)
guard_type(x, int)
guard_not_less_than(x, 0)
x = int_add(x, 5)
Figure 1: Example code and high-level traces.

Figure 1 shows a high-level example of a tracing JIT for a dynamically typed Python-esque language. Let us assume that the code in the first column is part of a hot loop that the tracing JIT decides is worth converting into machine code. On the next execution of the loop, the initial value of x is set to 6 (i.e. an integer). The VM then enters the tracing interpreter, completing an iteration of the loop, as well as recording a trace of what happened.

The first thing the tracing JIT will create is a guard to ensure that the generated machine code is only executed if x is an integer on subsequent executions; this allows subsequent operations to be optimised for integers. Note that the specific value of x is not recorded in the trace (if it was 1 or 2 or 383, the resulting trace would be the same; but if it was a negative number the trace would be different), because otherwise the trace would be so narrow as to be almost useless. Next, the condition if x < 0 is false, so the else branch is taken. The trace first records a guard, to ensure that subsequent runs follow the same logic as the trace, and then records the else branch, completely ignoring the first branch. The middle column thus shows the full trace from our simple hot loop. With the trace complete, we can then optimise the trace, compressing the two separate additions of integer constants into one, as shown in the final column. That final trace is then translated into machine code. On subsequent executions of the hot loop, the (fast) machine code version will be executed: if either of the guards is false, the machine code version will exit back to the interpreter.

Optimising traces

While the example in Figure 1 gives a reasonable high-level idea about tracing JITs, it doesn't really explain how the trace is created. RPython badges itself as a meta-tracing system, meaning that the user's end program isn't traced directly (which is what Figure 1 suggests), but rather the interpreter itself is traced. Using the same example code from Figure 1, Figure 2 shows a snippet of the interpreter and the trace of the interpreter that leads to. This trace (though simplified somewhat to make it readable) is indicative of the traces that RPython introduces.

Interpreter Initial trace when x is set to 6
program_counter = 0
stack = []
vars = {...}
while True:
  jit_merge_point(program_counter)
  instr = load_instruction(program_counter)
  if instr == INSTR_VAR_GET:
    stack.push(
     vars[read_var_name_from_instruction()])
    program_counter += 1
  elif instr == INSTR_VAR_SET:
    vars[read_var_name_from_instruction()]
     = stack.pop()
    program_counter += 1
  elif instr == INSTR_INT:
    stack.push(read_int_from_instruction())
    program_counter += 1
  elif instr == INSTR_LESS_THAN:
    rhs = stack.pop()
    lhs = stack.pop()
    if isinstance(lhs, int)
     and isinstance(rhs, int):
      if lhs < rhs:
        stack.push(True)
      else:
        stack.push(False)
    else:
      ...
    program_counter += 1
  elif instr == INSTR_IF:
    result = stack.pop()
    if result == True:
      program_counter += 1
    else:
      program_counter +=
        read_jump_from_if_instruction()
  elif instr == INSTR_ADD:
    lhs = stack.pop()
    rhs = stack.pop()
    if isinstance(lhs, int)
     and isinstance(rhs, int):
      stack.push(lhs + rhs)
    else:
      ...
    program_counter += 1
v0 = program_counter
v1 = stack
v2 = vars
v3 = load_instruction(v0)
guard_eq(v3, INSTR_VAR_GET)
v4 = dict_get(v2, "x")
list_append(v1, v4)
v5 = add(v0, 1)
v6 = load_instruction(v5)
guard_eq(v6, INSTR_INT)
list_append(v1, 0)
v7 = add(v5, 1)
v8 = load_instruction(v7)
guard_eq(v8, INSTR_LESS_THAN)
v9 = list_pop(v1)
v10 = list_pop(v1)
guard_type(v9, int)
guard_type(v10, int)
guard_not_less_than(v9, v10)
list_append(v1, False)
v11 = add(v7, 1)
v12 = load_instruction(v11)
guard_eq(v12, INSTR_IF)
v13 = list_pop(v1)
guard_false(v13)
v14 = add(v11, 2)
v15 = load_instruction(v14)
guard_eq(v15, INSTR_VAR_GET)
v16 = dict_get(v2, "x")
list_append(v1, v16)
v17 = add(v14, 1)
v18 = load_instruction(v17)
guard_eq(v18, INSTR_INT)
list_append(v1, 2)
v19 = add(v17, 1)
v20 = load_instruction(v19)
guard_eq(v20, INSTR_ADD)
v21 = list_pop(v1)
v22 = list_pop(v1)
guard_type(v21, int)
guard_type(v22, int)
v23 = add(v22, v21)
list_append(v1, v23)
v24 = add(v19, 1)
v25 = load_instruction(v24)
guard_eq(v25, INSTR_VAR_SET)
v26 = list_pop(v1)
dict_set(v2, "x", v26)
v27 = add(v24, 1)
v28 = load_instruction(v27)
guard_eq(v28, INSTR_VAR_GET)
v29 = dict_get(v2, "x")
list_append(v1, v29)
v30 = add(v27, 1)
v31 = load_instruction(v30)
guard_eq(v31, INSTR_INT)
list_append(v1, 3)
v32 = add(v30, 1)
v33 = load_instruction(v32)
guard_eq(v33, INSTR_ADD)
v34 = list_pop(v1)
v35 = list_pop(v1)
guard_type(v34, int)
guard_type(v35, int)
v36 = add(v35, v34)
list_append(v1, v36)
v37 = add(v32, 1)
v38 = load_instruction(v37)
guard_eq(v38, INSTR_VAR_SET)
v39 = list_pop(v1)
dict_set(v2, "x", v39)
v40 = add(v37, 1)
Figure 2: An interpreter fragment and a full trace of that interpreter for the end-user program from Figure 1.
Hopefully the interpreter code in Figure 2 is mostly self-explanatory, being a simple-minded stack-based interpreter. One thing that needs explanation is the jit_merge_point call: this is the point where the interpreter tells RPython if you have any machine code to execute for the program beginning at position program_counter, here's where to start it.

Initially the trace might seem rather difficult to read, but if you think of it as a flattened record of all of the interpreters actions while executing the user's code, it becomes rather easier. The astute reader will notice that the traces are in Single Static Assignment (SSA) form: all assignments are to previously unused variables. While one would probably not want to write in this style, it has many advantages for optimisers, because it trivially exposes the data flow. In normal programs, we are often unable to determine whether a variable x may be assigned to at later point (because, for example, one branch of an if assigns to it, but the other doesn't): in SSA form we can know for sure.

With this knowledge, it should hopefully be fairly simple to see that the trace on the right is simply a record of the instructions the interpreter on the left performed while executing our example program. It's worth thinking about this relationship, as it's key to RPython's approach.

Once one has a handle on the trace, the sheer size of it should be a concern: there's a lot of stuff in there, and if it was converted to machine code as-is, one would get disappointing performance gains (around 40%). It is at this point that RPython's trace optimiser kicks in. I'll now try and give an idea of what RPython's trace optimiser can do.

The first thing that's obviously pointless about the above is the continual reading of bytecode instructions. If we start from a specific point in the program, and all the guards succeed, we know that the sequence of instructions read will always be the same in a given trace: checking that we've really got an INSTR_VAR_GET followed by a INSTR_INT is pointless. Fortunately the trace optimiser can be influenced by the user. One thing the trace optimiser knows is that because program_counter is passed to jit_merge_point as a way of identifying the current position in the user's end program, any calculations based on it must be constant. These are thus easily optimised away, leaving the trace looking as in Figure 3.

Constant folded trace
v1 = stack
v2 = vars
v4 = dict_get(v2, "x")
list_append(v1, v4)
list_append(v1, 0)
v9 = list_pop(v1)
v10 = list_pop(v1)
guard_type(v9, int)
guard_type(v10, int)
guard_not_less_than(v9, v10)
list_append(v1, False)
v13 = list_pop(v1)
guard_false(v13)
v16 = dict_get(v2, "x")
list_append(v1, v16)
list_append(v1, 2)
v21 = list_pop(v1)
v22 = list_pop(v1)
guard_type(v21, int)
guard_type(v22, int)
v23 = add(v22, v21)
list_append(v1, v23)
v26 = list_pop(v1)
dict_set(v2, "x", v26)
v29 = dict_get(v2, "x")
list_append(v1, v29)
list_append(v1, 3)
v34 = list_pop(v1)
v35 = list_pop(v1)
guard_type(v34, int)
guard_type(v35, int)
v36 = add(v35, v34)
list_append(v1, v36)
v39 = list_pop(v1)
dict_set(v2, "x", v39)
Figure 3: The trace with calculations related to the program counter optimised away.
Our trace has now become a fair bit of smaller, but we need it to get a lot smaller still if we want good performance. Fortunately the SSA form of the trace now comes to the fore. We can follow the flow of operations on a given list l: if an append of an object on a list l is followed by a pop without any other operations on l in the interim, we can remove both calls. We can also do something similar for the dictionary stores and lookups in the middle of the trace: since they're not used until the end, storing intermediate values is pointless. This is a bigger win, as dictionary lookups are much more expensive than list lookups. The resulting optimisations are shown in Figure 4.
List folded trace Dict folded trace
v1 = stack
v2 = vars
v4 = dict_get(v2, "x")
guard_type(v4, int)
guard_not_less_than(v4, 0)
v16 = dict_get(v2, "x")
guard_type(v16, int)
v23 = add(v16, 2)
dict_set(v2, "x", v23)
v29 = dict_get(v2, "x")
guard_type(v29, int)
v36 = add(v29, 3)
dict_set(v2, "x", v36)
v1 = stack
v2 = vars
v4 = dict_get(v2, "x")
guard_type(v4, int)
guard_not_less_than(v4, 0)
v23 = add(v4, 2)
guard_type(v23, int)
v36 = add(v23, 3)
dict_set(v2, "x", v36)
Figure 4: The trace with list and dictionary optimisations folded away.
Our trace is now looking much smaller, but we can still make two further, easy optimisations. First, we know that if v4 is an int then it will still be an int after addition: we can thus remove the second type check. Once that is done, we can easily collapse the two additions of constants (since x + 2 + 3x + 5). Figure 5 shows both optimisations.
Type folded trace Addition folded trace
v1 = stack
v2 = vars
v4 = dict_get(v2, "x")
guard_type(v4, int)
guard_not_less_than(v4, 0)
v23 = add(v4, 2)
v36 = add(v23, 3)
dict_set(v2, "x", v36)
v1 = stack
v2 = vars
v4 = dict_get(v2, "x")
guard_type(v4, int)
guard_not_less_than(v4, 0)
v23 = add(v4, 5)
dict_set(v2, "x", v23)
Figure 5: The trace with type checks and constant additions folded away.
At last, we have a highly optimised trace which is suitable for conversion to machine code. Not only is it much smaller than the original trace, but it contains far fewer complicated, slow function calls: the resulting machine code will run massively faster than the original interpreter. Trying to produce small traces like this is one of the skills of writing a VM in a tracing JIT. The above example should give you a flavour of how this is done in RPython, though there are many low-level details that can make doing so difficult. As we shall see later, we often need to help the JIT to produce small traces.

A new Converge VM

After becoming intrigued by the possibilities of RPython, I decided to use it to implement a new VM for Converge. To allow an apples-to-apples comparison, my initial goal was to maintain 100% compatibility with the old VM, so that the same bytecode could run on both the C and RPython versions of the VM. That goal wasn't quite met, although I came extremely close: for quite some time, bytecode for the RPython VM could be run on the C VM (but not vice versa).

First, it's probably useful to give some simple stats about the C VM that I was aiming to replace. It is about 13KLoC (thousand lines of code; I exclude blank lines and purely commented lines from this count). It contains a simple mark-and-sweep garbage collector that is mostly accurate but conservatively collects the C stack (so that VM code doesn't need be clever when dealing with references to objects). It implements full continuations at the C level, copying the C stack to the heap and back again as necessary (this turned out to be much more portable than I had initially feared), so that code at both the VM and Converge program level can be written in similar(ish) style. The VM is ported to a variety of 32 and 64 bit little endian systems: OpenBSD, Linux, OS X, Cygwin, and (native binary) Windows. Overall, the VM works, but is very slow, and, in parts, rather hard to understand. There are no precise records to determine the effort put into it, but I estimate it took between 12 and 24 man months — let's call it 18 man months.

I started work on the new VM on September 1st 2011. Before September 1st I had never used RPython, nor had anyone (to the best of my knowledge) outside the core PyPy group used RPython for a VM of this size (at the time, the Happy VM, which implements a subset of PHP, was the closest comparison). Though the Converge VM is obviously not a big VM, it is beyond merely a toy. It also has some unusual aspects (as touched upon earlier in this article) that make it an interesting test for RPython.

By December 19th 2011 I had a feature-compatible version of the Converge VM, in the sense that it could run every Converge program I could lay my hands on (which, admittedly, is not a huge number). After an initially slow period of development, mostly because of my unfamiliarity with RPython, progress became rapid towards the end. The resulting VM is about 5.5KLoC (compared to 13KLoC for the C VM). I estimate I was able to dedicate around half of my time to the VM during those 4 months (I started a new job on September 1st and then taught a course on a largely unfamiliar subject).

Although the two time estimates (18 man months for the C VM vs. 2-3 man months for the RPython VM) aren't fully comparable, they are useful. While many parts of the RPython VM were a simple translation from the C VM, that itself was partially a reimplementation of a previous C VM (though to a lesser extent). The RPython's VM structure is also substantially different than the C VM (it's far cleaner, and easier to understand), so some aspects of the translation were hardly simple. My best guess is that moving from C (a language which I enjoy a great deal, despite its flaws) to RPython was the single biggest factor. If nothing else, large amounts of the C VM involve faffing about with memory resizing; RPython, as a fully garbage collected language, sweeps all that under the carpet.

Status of the VM

The new Converge VM's source is freely downloadable as are binaries for most major platforms (other than Windows, at the time of writing). Eventually this VM will form part of a Converge 2.0 release, although more testing will be needed before it's reached that point. Before you form your opinions about the new VM, it's worth knowing what it is and isn't. It's not meant to be an industrial strength VM, at least not in its current form. Converge is a language for exploring compile-time meta-programming and domain specific languages. Some things which mainstream programming languages need to care greatly about (e.g. overflow checking; Unicode) are partly or wholly ignored. Such matters are a problem for a later day.

It's also worth knowing that I haven't spent a huge amount of time optimising the new VM. As soon as it got to fast enough speed, I was happy. Whereas before I often had to explain away Converge's slow performance, it's now sufficient for the experiments I and others want to do in Converge. It's worth exploring that performance in more detail.

Performance

So, the RPython VM was created in roughly 1/6 the time it took to create the C VM. What is the performance like? This section will try and give a flavour of the performance, though please note that it's not totally scientific.

The December 19th version of Converge (git hash 84bb9d6064 if you wish to look at it) was already usefully faster than the old C VM. One of my simple benchmarks has long been to time make regress in the Converge system (excluding the time to build the VM itself). The December 19th version of the VM isn't directly comparable to the old C VM (it has a little more Converge code, and the parsing algorithm, which occupies a lot of the execution time, is slightly different), but it's close enough to be useful. The following timings are from an Ubuntu server with a 3.33GHz Xeon (I'll explain why Linux later); since Linux support was added a day later, it's necessary to checkout a December 20th version for this test (git hash 00e290ccbb). The C VM runs make regress in 67s; the RPython VM (translated with --opt=3 since the JIT wasn't functional at that point) in 32s.

Looking at output from gprof on the two VMs quickly shows up 2 main reasons for this speedup: first, RPython has an infinitely better garbage collector (with obvious consequences); second, RPython has much better dictionaries (also known as hash tables). Given that dynamically typed OO languages like Converge spend an awful lot of time looking up slot names, optimised dictionaries can play an unexpected part in things.

make regress gives a good idea of the lower bound of performance, since it is torture for a JIT. It has a large body of code which runs for just enough time for the JIT to warm up (i.e. to have identified some hot spots, traced them, and converted them into machine code), but not enough time to benefit from its work (as we shall see later, the compiler is particularly punishing for a tracing JIT). Since December 19th, the old and new system have diverged sufficiently that an accurate comparison of make regress is now hard to make. That said, the system has continued to get faster, and even on the hard case of the compiler, it gives a 2-3x speed-up on the old VM, which is rather useful.

How does it perform on more general types of code? In one sense, this is an impossible question, because no two people share the same definition of general. It's even harder when a JIT is involved, as they often give surprising performance results: given two seemingly similar programs, it's quite common for one to perform substantially better with a JIT than the other. So, as a proxy, I'll use a few simple benchmarks — the limitations of this approach are obvious, but it's better than nothing.

I'll start with the stone benchmark. This simple benchmark started life as an Ada program, but we'll take as our starting point the Python version. The Converge version is a straight-forward translation. The old VM lacks certain timing functionality, so the timings for Converge 1.2 are ever so slightly off (being a few ms higher than they should really be): as you'll soon see, such a small difference doesn't make much difference in the overall scheme of things. By default, stone will perform 50000 iterations of the benchmark. To show how JIT warmup times can effect things, I've also included timings for 500000 runs. All timings are on the Xeon machine described earlier; all tests were run 3 times and the best figure taken.

VM stone (50000) stone (500000)
CPython 2.7.2 0.45s 4.50s
PyPy 1.7 0.07s 0.28s
Converge 1.2 2.39s 24.1s
Converge-current 0.14s 0.44s
Figure 6: The Stone benchmark.
Although stone is venerable, it's also so small and artificial that it doesn't necessarily tell us very much. The Richards benchmark is a much more interesting benchmark that models task dispatching (Mario Wolczko explains Richards in depth and also provides various other language implementations). By default it performs 10 iterations; I've also included timings for 100 iterations. Figure 7 shows the timings.
VM Richards (10) Richards (100)
CPython 2.7.2 2.00s 19.9s
PyPy 1.7 0.34s 0.79s
Converge 1.2 12.4s 126s
Converge-current 0.93s 4.9s
Figure 7: The Richards benchmark.
One thing worthy of note in Figure 7 is the better all-round performance of PyPy compared to Converge: it is substantially faster.

As a final benchmark, and an example of something which programmers need to do frequently, I chose something which neither the old or new Converge VM has had any sort of optimisations for: sorting. Part of the reason why I expected them to do badly is that neither optimises list accesses. x[0] in Converge is translated to x.get(0), whereas many other VMs (including CPython and PyPy) special-case this common operation. 100000 random strings were placed into a file and sorted. The same sorting algorithm is used in both the Python version and Converge version (indeed, I translated the Converge version into Python to ensure a fair comparison). Figure 8 shows the timings.

VM sorting (100000) sorting (1000000)
CPython 2.7.2 1.38s 17.3s
PyPy 1.7 0.22s 3.17s
Converge 1.2 13.40s 678s
Converge current 0.42s 4.23s
Figure 8: The sorting benchmark.
The terrible performance of the old Converge VM in Figure 8 surprised even me. The most likely explanation is that the large number of elements overloads the garbage collector: at a certain point, it can overflow its stack, and performance then degrades non-linearly. I was also surprised by the PyPy figures, with a larger than expected slowdown on the larger number of elements. This appears to be fixed in the nightly PyPy build I downloaded (which is very close to what will be PyPy 1.8): the timings were 0.21s and 2.22s for the small and large datasets respectively.

Although this section has contained a number of very hard figures, one should be careful about making strong claims about performance from such a small set of benchmarks. My gut feeling is that they are over-generous to the new Converge VM, mostly because there are many areas in the VM which have received no optimisation attention at all: if one of those was used repeatedly, performance would suffer disproportionately. I suspect that, rather than appearing to be much faster than CPython 2.7.2 (as above), its performance on a wider set of benchmarks would probably be on a more even par. Even so, that would still be a huge improvement on the old VM. The interesting thing is that most of the performance gains are from RPython: I only made a few relatively easy changes to increase performance, as we shall now see.

Optimising an RPython JIT

Some RPython VMs lead to a much more efficient JIT than others. The trace optimiser, while clever, is not magic and certain idioms prevent it working to its full potential. The early versions of the Converge VM were naive and JIT-unfriendly: interestingly, I found that a surprisingly small number of tactics hugely improved the JIT.

The first tactic is to remove as many instances of arbitrarily resizable lists as possible. The JIT can never be sure when appending an item to such a list might require a resize, and is thus forced to add (opaque) calls to internal list operations to deal with this possibility. Such calls prevent many optimisations from being applicable (and are relatively slow). When this was first pointed out to me, I was horrified: my RPython VM was fairly sizeable and used such lists extensively. Most noticeably, the Converge stack was a global, resizable list. After a little bit of thought, I realised that it's possible to statically calculate how much stack space each Converge function requires (this patch started the ball rolling). I was then able to move from a global resizable stack to a fixed-size stack per function frame (i.e. the frame created upon each function call; these are called continuation frames in Converge, though that need not concern us here). At this point, the relative ease of developing in a fairly high-level language became obvious. If I had tried to do such a far-reaching change in the C VM, it would have taken at least a week to do. In RPython, it took less than a day.

Some other arbitrarily resizable lists took a little more thought. After a while it became clear that, even though each function now had its own fixed-size stack, the global stack of function frames, stored of course in a resizable array, was becoming a bottleneck. That seemed hard to fix: unlike the stack size needed by a function frame, there is no way to statically determine how deeply function calls might nest. A simple solution soon presented itself: having each function frame store a pointer to its parent removed the need for a list of function frames (see this patch).

The second tactic is to tell the JIT when it doesn't need to include a calculation in a trace at all. The basic idea here is that when creating a trace, we often know that certain pieces of information are fairly unlikely to change in that context. We can then tell the JIT that these are constant for that trace: it will insert an appropriate guard to ensure that is true, often allowing subsequent calculations to be optimised away. The use of the word constant here can mislead: it's not a static constant in the sense that it's fixed at compile-time. Rather, it is a dynamic value that, in the context of a particular trace, is unlikely to change. Promoting values and eliding functions are the main tools in this context: Carl Friedrich Bolz described examples in a series of blog posts. The new Converge VM, for examples, uses maps (which date back to Self), much as outlined by Carl Friedrich.

The interesting thing is that I haven't really spent that long optimising the Converge JIT: perhaps a man week in the early days (when I was trying to get a high level picture of RPython) and around two man weeks more recently. As a rough metric, I found that each JIT optimisation I was doing was giving me a roughly 5-10% speedup (though the per-function stack change was much more profitable): the cumulative effect was quite pronounced. Admittedly, I now suspect that I've now picked most of the low-hanging fruit; improving performance further will require increasingly more drastic action (much of it in the Converge compiler, which is likely to prove rather harder to change than the VM). Fortunately, I'm adequately happy with performance as it is.

The end result of these optimisations is that the traces produced by the Converge VM are often very efficient: see this optimised trace (randomly chosen — I'm not even sure what piece of code it represents). What's astonishing is that between promoting values, eliding function calls, and optimising traces, many bytecodes now have little or no code attached to them.

What particularly amazes me is how one of Converge's most crippling features from an efficiency point of view, is now handled. Failure is how an Icon-base expression evaluation system allows limited backtracking. In my paper on Converge's Icon inheritance, I noted that the ADD_FAILURE_FRAME and REMOVE_FAILURE_FRAME instructions (which surround nearly every Converge expression) accounted for around 25-30% of all executed opcodes. What's more annoying is that the vast majority of the time, failure frames are created and discarded without being used. My suggestion at the time was that maybe a register-based VM (in the sense of modern Lua non-JITing VMs) might be able to lower this cost somewhat. What I hadn't anticipated was a tracing JIT. If you compare the unoptimised trace to the optimised, you'll notice that the former has lots of code in ADD_FAILURE_FRAME / REMOVE_FAILURE_FRAME instructions, while in the latter such code has almost entirely disappeared. In large part this explains why the optimised trace is 4 times smaller than the unoptimised trace. The best thing of all is that RPython's trace optimiser does this for free: I didn't raise a single finger to make it happen. In one fell swoop, RPython has given Converge the fastest VM for Icon-esque evaluation ever created.

Tracing JIT issues

Tracing JITs are relatively new and have some limitations, at least based on what we currently know. Mozilla, for example, removed their tracing JIT a few months back, because while it's sometimes blazingly fast, it's sometimes rather slow. This is due to a tracing JIT optimising a single code-path at a time: if a guard fails, execution falls back to the (very slow) tracing interpreter for the remainder of that bytecode (which could be quite long), and then back to the language interpreter for subsequent bytecodes. Code which tends to take the same path time after time benefits hugely from tracing; code which tends to branch unpredictably can take considerable time to derive noticeable benefits from the JIT.

The real issue is that we have no way of knowing which code is likely to branch unpredictably until it actually does so. A real program which does this is the Converge compiler: in several points it walks over an AST, calling a function _preorder(x) which, using reflection, dispatches to a function which can handle that AST type (see e.g. the AST well formedness checker for an example of this idiom). Though it makes perfect sense to us, from the point of view of a tracer, _preorder(x) branches completely unpredictably.

User's end program Trace without inlining Trace with inlining
def f(x):
    return 2 + g(x) + 3

def g(x):
    if x < 0:
        return 0
    else:
        return 1
return 2 + g(x) + 3
guard_not_less_than(x, 0)
return 2 + 1 + 3
Figure 9: Inlining code with a call f(6).

In my limited experience, the inherent ability of a tracing JIT to inline code can exacerbate this issue. Consider the simple program in Figure 9. If the tracing JIT disables inlining, the trace looks as in the middle column: the call to g remains as is. As we saw earlier, raw function calls are not only expensive but their opaqueness prevents the trace optimiser from performing much of its magic. By default, therefore, a tracing JIT will inline such functions, resulting in the trace found in the right-hand column. In this case, the inlined version will typically be much faster since the guard check is extremely quick and much of the other machinery surrounding function calling can be optimised away.

However, while inlining is generally a big win, it can sometimes be a big loss. This is chiefly due to the fact that inlining leads to significantly longer traces. Traces are slow to create (due to the tracing interpreter), so the longer we trace, the greater the overhead we impose on the program. If the trace is later used frequently, and in the exact manner it was recorded, the relative cost of the overhead will reduce over time. If the trace is little used, or if guards fail regularly within it, the overhead can easily outweigh the gains. Unfortunately, there's no obvious way to predict when inlining will be a win or loss, because we can't see into a program's future execution patterns. As a heuristic to somewhat counter this problem, RPython has an (adjustable) upper limit to traces: if too long, they are aborted, and the subsequent trace will turn inlining off. This helps somewhat, but what a good value for too long might be isn't obvious to me; it's likely to vary from language to language, and even program to program. Indeed, in an experiment on a mixed tracing / method-based Java VM, Inoue et al. found that long traces are, overall, a win.

With luck, future research will start to whittle away at tracing JITs weaknesses. However, it seems likely that, in the medium term at least, most hand-crafted VMs will remain method-based (referred to hereon as method JITs). That is, when a method is called repeatedly, the entire method is converted to machine code. Where tracing JITs are optimistic – recording a particular path, removing branching possibilities, and assuming that it will mostly be used in the same way on subsequent executions – method JITs are more pessimistic – converting an entire method to machine code, with most of its branches left (though a few trace-like optimisations are likely to be performed, this isn't that important from our point of view). Although a tracing JIT can often beat a method JIT, the latter's performance is much more consistent. That said, as PyPy shows, the performance of a tracing JIT isn't bad in the general scheme of things. Furthermore, it's not obvious to me how an RPython for method JITs system could be created: tracing JITs seem to me to be much better suited to automatic creation.

How fast can it go?

Something that's currently unclear is how fast one can reasonably expect an RPython VM to go. The best guide we currently have to the achievable speed of RPython VMs is PyPy itself. Although it seems that most of the easy wins have now been applied, it's still getting faster (albeit the rate of gains is slowing down), and, more importantly, is increasingly giving good performance for a range of real programs. The PyPy speed centre is an instructive read. At the time of writing PyPy is a bit over 5 times faster than CPython for a collection of real-world programs; for micro-benchmarks it can be a couple of orders of magnitude quicker.

It's clear that, in general, an RPython VM won't reach the performance of something like HotSpot, which has several advantages: the overall better performance of method-based JITs; the fact that it's hand-coded for one specific class of languages; and the sheer amount of effort put into it. But I'd certainly expect RPython VMs to get to comfortably within an order of magnitude performance levels as HotSpot. Time will tell, and as people write RPython VMs for languages like Java, we'll have better points of comparison.

RPython issues

From what I've written above, I hope you get some sense of how interesting and exciting RPython is for the language design and implementation community. I also hope you get a sense of how impressed I am with RPython. Because of that, I feel able to be frank and honest about the limitations and shortcomings of the approach.

The major problem anyone creating a VM in RPython currently faces is documentation or, more accurately, a lack of it. There are various papers and fragments of documentation on RPython, but they're not yet pulled together into a coherent whole. New users will struggle to find either a coherent high-level description or descriptions of vital low-level details. Indeed, the lack of documentation is currently enough to scare off all but the most dedicated of language enthusiasts. As such a dedicated enthusiast, I got a long way with grep, but as my VM got more sophisticated, I had to resort to the IRC channel (something I've never needed to do before) to check the semantics of low-level details such as eliding, for which guesswork is simply too dangerous.

Part of the problem probably stems from the strict adherence of the PyPy / RPython development process to Test Driven Development (TDD). PyPy / RPython has roughly the same amount of code for the main system as for the tests. Although this would not have been my personal choice, it appears to have served the PyPy / RPython development process extremely well. It's impossible not to admire the astonishing evolution of the project and the concepts it has developed; TDD must have played an important part in this. Indeed, although the Converge VM has inevitably uncovered a few bugs in PyPy, they have been surprisingly few and far between. Unfortunately, the prioritisation of tests seems to have been at the expense of documentation. As I rapidly discovered – to the initial bemusement of the RPython developers – tests are a poor substitute for documentation, typically combining a wealth of low-level detail with a lack of any obvious high-level intent. In short, with so many tests, it's often impossible to work out what is really being tested or why. I certainly struggled with this lack of documentation: I suspect I could have shaved at least a third off of my development time if RPython was as well documented as other language projects. Fortunately the PyPy chaps are aware of this, and there are now open issues to resolve this, and I hope my experiences will feed in to that.

As alluded to above, PyPy / RPython have, since their original conception, changed to a degree unmatched by any other project I can think of. The PyPy project started off as an implement Python in Python project to allow people to easily get their heads around the nuts-and-bolts details of a Python VM. Only gradually, over the years, did it turn into an implement a fast Python VM project, with several dead-ends on the route (including an attempt to do traditional partial evaluation). As the desire to implement a faster Python VM grew, the need to define a smaller language became clearer, hence the gradual appearance of RPython.

The Python part of RPython's name has consequences. In essence, RPython is a statically typed language roughly akin to Java; static types allow RPython to statically optimise an RPython program in a ways that would be impossible for full Python. RPython relies almost exclusively on type inference for its static types, with RPython programs rarely making them explicit (far less often than in, say, a typical Haskell program). This means that type errors, when they do occur, can result in inscrutable error messages (as is typical with type inference). While RPython's error messages have become better in recent months, they are still obtuse; identifying the real cause is harder than in most other type inferred languages. It took quite some time before I felt confident writing more than a few lines at a time. If RPython had more explicit static typing, the scope over which the type inferencer would operate would be smaller, and errors probably less baffling.

The massive evolution PyPy / RPython have taken also has implications for the implementation. In short, RPython does not just have an experimental past, it has an experimental present. The translator is littered with a vast number of assertions, many of which can be triggered by seemingly valid user programs. These are often hard to resolve: one is left wondering what a particular assertion has to do with the program that triggered it. Occasionally, if one is really lucky, an assertion has an associated comment which says something like if you got here, it's probably because of X. As RPython matures, I expect the number of occasions such assertions to be triggered to diminish, but I suspect they are likely to maintain a pest for some time to come.

Because every RPython program is also a valid Python program, RPython VMs can also be run using CPython or PyPy — run untranslated in the RPython lingo. From my point of view this facility has been of little use, because running a VM this way is slow and memory intensive. Unfortunately, despite the drawbacks of untranslated execution, it is sometimes less worse than the alternative, as we shall see. Most noticeably, PyPy's extensive test suite is run untranslated.

The alternative is full translation. RPython is a whole program translator. It slurps in a program, statically analyses the whole thing afresh, before converting it to C. Unfortunately, because of the quantity of work it has to do, the translator is extremely slow. On a fairly fast machine, translation of Converge takes 3 or 4 minutes with --opt=3; that time more than doubles with --opt=jit. PyPy, on a fast machine, can easily take 45-60 minutes to to translate with --opt=jit. Any change, no matter how small, requires a full rerun of the translator. Frankly, I had forgotten what it was like to have such a slow edit-compile-run cycle: it makes experimentation, particularly for those unfamiliar with RPython, extremely painful. At some points, when a bug was far too deep in a VM run to be reachable by running untranslated, I almost tore my hair out due to frequent retranslations. On the plus side, the down-time translation induces all but rules out repetitive strain injuries.

Whole program translation may seem an odd decision, but it has a vital use: RPython uses Python as its compile-time meta-programming language. Basically, the RPython translator loads in an RPython VM and executes it as a normal Python program for as long as it chooses. Once that has finished, the translator is given a reference to the entry point of the RPython VM, and it is from that point that translation occurs. Everything that is referenceable from the entry point must be RPython enough (this vague term is from the current documentation) to be translatable; things not reachable are ignored (and may use arbitrary Python features). What this means is that one is able to use normal Python to make decisions about e.g. portability which could not be deferred until actual translation time. The ability to do some sort of pre-translation meta-programming is absolutely necessary for software that needs to be customisable and portable. However, the fact that most VM files are in fact mixed Python and RPython programs is a headache. Some functions in the normal Python libraries are RPython compatible; some aren't; some times RPython alternatives are provided; sometimes they aren't. This mixing and matching between the two is arbitrary, confusing, and not, I suspect, resolvable.

As a final matter, RPython is not just restricted to generating C: at various points it has also had JVM, CLR, and LLVM backends (though, at the time of writing, none of these is currently usable). RPython has thus tried to create its own type system to abstract away the details of these backends, not entirely successfully. This is not a fault unique to RPython. As anyone who's tried porting a C program to a number of platforms will attest, there is no simple set of integer types which works across platforms. Unfortunately, RPython's history of multiple backends and only semi-successful attempts to abstract away low-level type systems means that it has at least 5 type systems for various parts (some of which, admittedly, are hidden from the user). Not only does each have different rules, but the most common combination (rffi and rarithmetic) has different rules for types of the same name. The precise relationship between the varying type systems remains a mystery to me. I suspect that the current Converge VM does not use appropriate types in some places because of this.

The future

RPython, to my mind, is an astonishing project. It has, almost single-handedly, opened up an entirely new approach to VM implementation. As my experience shows, creating a decent RPython VM is not a huge amount of work (despite some frustrations). In short: never again do new languages need come with unusably slow VMs. That the the PyPy / RPython team have shown that these ideas scale up to a fast implementation of a large, real-world language (Python) is another feather in their cap.

An important question is whether the approach that RPython takes is so unique that it is the only possible tool one can imagine using for the job. As my experience with RPython has grown, the answer is clearly no, because RPython is not magic. In other words, it is the first member of a new class of tool, but I do not expect it to be the last member of that class. If nothing else, RPython probably isn't the ideal language for such purposes, as I showed in the previous section. My best guess is that a new Java-like language with compile-time meta-programming (as found in Converge, as it so happens) might be more appropriate, but I could well be wrong. In the meantime, there is no reason not to embrace RPython — it works and it's here, right now.

If you've got this far, congratulations: it's been a long read, I know! This article is so long because its subject is so worthy. I am a curmudgeon and I find most new developments in software to be thoroughly uninteresting. RPython is different. It's the most interesting thing I've seen in well over a decade. Exactly what its ramifications will be is something that only time can tell, but I think they will be two fold. First, I think new languages will suddenly find themselves able to compete well enough with existing languages that they will be given a chance: I hope this will encourage language designers to experiment more than they have previously felt able. Second, the one or two VMs should be enough for all purposes mindset is now severely challenged: RPython shows that it's possible to create a custom VM for a language which substantially outperforms mashing it atop an existing VM. In summary, the future for programming languages has just got a lot brighter: the language designer's dilemma is no more.

Acknowledgements: The RPython developers have been consistently helpful and the new VM wouldn't have got this far without valuable help from Carl Friedrich Bolz and Armin Rigo in particular: Maciej Fijalkowski and others on the PyPy IRC channel have also been extremely helpful. Martin Berger, Carl Friedrich Bolz, and Armin Rigo also gave insightful comments on this article. Any remaining errors and infelicities are, of course, my own.

Link to this entry


Problems with Software 3: Creating Crises Where There Aren't Any

June 28 2011

[This is number 3 in an occasional series ‘Problems with Software’; read the previous article here.]

As an undergraduate in the late 1990s, the part of my course I liked the least was Software Engineering, which started with obviously silly 1970s practices, and went downhill from there. As someone who was already familiar with both programming and, to some extent, the software industry, I pooh-poohed the course; my arrogance was duly rewarded with the lowest mark of any part of my degree, which taught me a useful life lesson.

One previously unfamiliar concept which this dreadful course introduced to me was the ‘software crisis’. First promulgated in the late 1960s and early 1970s, it captured the notion that software was nearly always over-budget, behind schedule, and of low quality. Those with an appreciation of history will know that most of the software technologies (programming languages, operating systems, and so on) we now take for granted took form around this time. As with any new field, this was an exciting time, with no technical precedent: none of its participants would have been able to predict the path forward. Rather, experimentation with new ideas and tools led to successes and failures; the latter received greater attention, leading to the idea of a ‘software crisis’.

To the best of my knowledge, the ‘software crisis’ was the first wide-spread moment of self-doubt in the history of computing. It reflected the idea that the problems of the day were huge and, quite possibly, perpetual. It is this latter point that, with the benefit of hindsight, looks hopelessly naive: by the early 1990s (and, arguably, many years before), software was demonstrably and continually improving in quality and scope, and has continued doing so ever since. While the massive improvements in hardware certainly helped (by the mid 1990s, hardware imposed constraints only on the most demanding of software), it was surely inevitable that, as a community, we would become better at creating software, given experience.

Mercifully, I have not found myself in a discussion about the ‘software crisis’ for 5 or more years; it seems finally to have been put to rest, albeit many, many years after it had any possible relevance. However it seems that, in software, we abhor a crisis vacuum: as soon as one has disappeared, others arise in its place. I've seen a few over the last few years including a ‘skills crisis’ and an ‘outsourcing crisis’, all minor problems at best.

However, one ‘crisis’ has gained particularly widespread attention in the last few years: the ‘multi-core crisis’. A quick historical detour is instructive. Although – in accordance with Gordon Moore's famous observation – the number of transistors on a CPU has doubled every 2 years for the last 50 years, by the mid part of this decade, it was proving increasingly difficult to make use of extra transistors to improve the performance of single-core (i.e. traditional) processors. In short, the easy performance increases that the software industry had indirectly relied upon for many decades largely disappeared. Instead, hardware manufacturers used the extra transistors to produce multi-core processors. The thesis of the ‘multi-core crisis’ is that software has to make use of the extra cores, yet we currently have no general techniques or languages for writing good parallel programs.

Assuming, for the time being, that we accept that this might be a crisis, there are two things which, I think, we can all agree upon. First, virtually all software ever written (including virtually all software written today) has been created with the assumption of sequential execution. Second, although we feel intuitively that many programs should be paralellisable, we do not currently know how to make them so. In a sense, both these statements are closely related: most software is sequential precisely because we don't know how to parallelise it.

One question, which those behind the ‘multi-core crisis’ do not seem to have asked themselves, is whether this will change. For it to do so, we would need to develop languages or techniques which would allow us to easily write parallel software. Broadly speaking, only two approaches have so far seen wide-spread use: processes and threads. The processes approach breaks software up into chunks, each of which executes in its own largely sealed environment, communicating with other chunks via well-defined channels. It has the advantage that it is relatively easy to understand each process; and that processes can not accidentally cause each other to fail. On the downside, processes must be large: the approach is not suited to fine levels of granularity, ruling out most potential uses. The threads approach is based on parallel execution within a single process, using atomic machine instructions to ensure synchronisation between multiple threads. It has the advantage of allowing fine grained parallelisation. On the downside it is notoriously difficult to comprehend multi-threaded programs, and they are typically highly unreliable because of incorrect assumptions surrounding state shared between threads.

In short, processes and threads are not without their uses, but they will never get us to parallel nirvana. A number of other approaches have been proposed but have yet to prove themselves on a wide scale. Transactional memory, for example, has serious efficiency problems and is conceptually troubling with respect to common operations such as input / output. There also seems to be an interesting tradeoff between parallel and sequential suitability: languages such as Erlang, which are, relatively speaking, good at parallelism, often tend to be rather lacking when it comes to the mundane sequential processing which is still a large part of any ‘parellel’ system.

However, all this is incidental detail. The point about a crisis is not about whether we have solutions to it or not, but whether it is a real, pressing problem. The simple answer is, for the vast majority of the population, “no”. Most peoples computing requirements extend little beyond e-mail, web, and perhaps word processing or similar. For such tasks – be they on a desktop computer or running on a remote server – computers have been more than powerful enough since the turn of the century. If we were able to increase sequential execution speed by an order of magnitude, most people wouldn't notice very much: they are already happy with the performance of their processors, so increasing it again won't make much difference (most users will see far more benefit from an SSD than a faster processor). In those areas where better parallel performance would be a big win – scientific computing, for example, or computer games – people are prepared to pay the considerable premium currently needed to create parallel softwre. Ultimately, we may get better at creating parallel software, but most users' lives will tick on merrily enough either way.

The ‘software crisis’ was bound to solve itself given time and the ‘multi-core crisis’ is no such thing: just because we've been given an extra tool doesn't mean that we're in crisis if we don't use it. Eventually (and this may take many years), I suspect people will realise that. But, I have little doubt, it will be replaced by another ‘crisis’, because that seems to be the way of our subject: no issue is too small not to be thought of as a crisis. The irony of our fixation on artificially created crises is that deep, long-term problems – which arguably should be considered crises – are ignored. The most important of these surely relates to security — too much of our software is susceptible to attack, and too few users understand how to use software in a way that minimises security risks. I have yet to come across any major organisation whose computing systems give me the impression that they are truly secure (indeed, I have occasionally been involved in uncovering trivial flaws in real systems). That to me is a real crisis — and it's one that's being exploited every minute of every day. But it's not sexy, it's not new, and it's hard: none of which makes it worthy, to most people, of being thought of as a crisis.

Link to this entry


Problems with Software 2: Failing to Use the Computing Lever

June 7 2011

[This is number 2 in an occasional series ‘Problems with Software’; read the previous article here.]

We all know at least one and I make no apologies for the picture I now paint. Hunch backed, tongue out, index fingers locked rigidly into an uncomfortable ‘r’ shape, and with no sense of the appropriate level of force to put into their actions. The neanderthal figures I refer to are, of course, two fingered typists. In my experience, a typical two fingered typist will struggle to type 30 Words Per Minute (WPM). An average touch typist will manage at least 70 WPM; and for those prepared to put in a bit of effort in, 90-100 WPM is eminently achievable.

Whether high typing speeds are useful depends on the person and the tasks they need to achieve. Agricultural workers, for example, probably won’t see a huge return on investment in typing skills. What astonishes me is that I know people who spend their entire working day in front of a computer, day in, day out, yet who still use this grossly inefficient technique. The slow typists I know easily lose a few hours every week due to their poor technique — yet suggesting that a few days spent learnt touch typing will quickly pay off in increased productivity is inevitably met with a reply of “I don’t have the time”.

Poor typing technique is a concrete example of a common human tendency — to stop learning as soon as one has a way of performing a task, regardless of whether that is the best way. In typing terms, the slowest typist is perhaps 5 times slower than the fastest: in other areas of human activity, the ratio can be much greater. One of those fields is computing and, by implication, software.

At this point, it is instructive to ask oneself a simple question: what is a computer? The standard answers run along the lines of “a CPU, some RAM, a disk”, “a keyboard and a monitor”, or “an operating system and user software.” While correct in their own way, such answers report the mechanisms involved without considering their purpose. The answer I prefer to this question is more abstract: computers are levers. Archimedes, the first person to explain how levers work, famously said “Give me a place to stand, and I will move the Earth.” When asked to prove this by his King, Archimedes used a lever to move a ship single-handedly, something impossible for even the strongest man to do unaided.

One of the first real uses of a computer shows how effective a lever they can be. The semi-programmable British Colossus computer of the mid 1940s was able to perform thousands of comparisons per second, cracking the seemingly unbreakable German Enigma code. What the code-breakers of Bletchley Park had realised was that a computer could perform simple, repetitive tasks at a speed impossible for humans. Operations that had taken a group of people several weeks to complete could be done by Colossus in half an hour: such was Colossus’s speed that, once operational, it played a significant part in shortening the course of the war.

It is little exaggeration to say that computers have developed into the longest levers available to man. Since Colossus, the rate of progress has been little short of astonishing, and computers have been used to do things that were previously unimaginable: from weather prediction to visual special effects, from DNA sequencing to search engines.

Yet, a lever is only truly effective if used from its end: gripped near the pivot, a lever is, in effect, shortened, and its force magnifying effect reduced. Regrettably, most people grip the computing lever extremely close to the pivot. There are many reasons for this and enumerating them all would take a small book. However, a few examples give an indication of the problem.

  1. Doing tasks manually instead of automatically.
    Computers excel at the simple, repetitive tasks which humans are terrible at. I remember seeing someone change a document which used the American English idiom of placing a comma after e.g. (i.e. “e.g., X”) to the less stilted British English idiom without the comma (i.e. “e.g. X”). The person editing did the change by scanning the (rather large) document and manually changing each occurrence, a task which took a couple of rather dispiriting hours. Inevitably, in a large document, some occurrences were missed. I fixed the remaining occurrences with 100% accuracy by using the ‘search and replace’ function in my text editor — which took less than 5 seconds.
  2. Permanent short-term thinking.
    Whenever I am confronted by a computing task, I ask myself “will I need to do this again?” If yes, I generally spend time working out how to automate the task. Consequently I have built up a considerable suite of small tools (a few of the more polished are publicly available) and techniques over the years to call upon. Most computing professionals I know have not a single such tool (and shockingly few techniques); their thinking never extends beyond trying to bumble through the task at hand. A task that, by hand, takes 30 minutes might take 90 minutes to automate; over the years, as that task recurs every 6 months or so, the automated solution pays for itself many times over.
  3. Using one tool for every task.
    Many people are prepared to invest time in learning one tool, but one tool only; it then becomes the hammer used to treat every task encountered as a nail. While this sometimes works well in the sort term, in the medium and long term, the effort involved in contorting the task to the tool can be huge. Perhaps the most common example is the misuse of spreadsheets to store database-like information — if I had a pound for every time an individual’s name was spelt differently on separate ‘sheets’ within a single Excel file, I would be a wealthy man.
  4. Not knowing the field and not staying upto date.
    The first hour of my day is generally spent checking computing newsgroups and websites, to ensure that I have some knowledge of the breadth of my subject, and the latest developments. Sometimes, I confess, I wish I didn’t have to do this; but if I didn’t, I would miss out on those surprisingly frequent moments when someone raises a problem and I hear myself saying “I was reading about a new program X a few weeks back which sounds like it might do what you need.”
  5. Not searching for help.
    I learnt the rudimentaries of programming before I had a modem; whatever problems I encountered had to be solved with the help of the one manual and two books I owned. Memorably, one extremely trivial problem took me nearly a full month to solve, working on my own. Now, my first instinct is to search for the problem on Google; 95% of the time, someone else will have had the same problem, and someone else will have pointed them to a solution. Yet, when I ask someone who presents me with a trivial problem “did you try looking it up on Google” 95% of the time the answer is “I didn’t think of doing that.”
To some extent, all of us will recognise some parts of ourselves in the above, or can pinpoint other areas where we fail to use the computing lever effectively. There is no shame in that: the shame comes in not addressing the problem. Unfortunately, ignorance in the basic use of computers is considered not only acceptable but the norm amongst the majority of computing and software professionals. The resulting loss in efficiency is vast and a rather sad indictment of our field.

Link to this entry


Problems with Software 1: Confusing Problems Whose Solutions Are Easy to State With Problems Whose Solutions Are Easy to Realise

April 19 2011

I love computing, in particular software: since I was young, I have devoted the major portion of my life to it. In return, I have gained access to a never‐ending source of challenges; met many great people; and had my ego smashed on the jagged rocks that are off‐by‐one errors. None of this makes me very well qualified to write an occasional series of articles on Problems with Software, but I will do my best. The spirit in which I write these is a positive one, perhaps akin to a parent admonishing a child for his own long‐term benefit. Of course, I am, and always be, a student of software, not its parent; searching, learning, and in awe of this ever‐evolving subject. With that in mind, let us begin.

I start with what I consider to be the most common mistake in software; a mistake that no other subject I know of commits so frequently. It is the confusion of problems whose solutions are easy to state with problems whose solutions are easy to realise. A non‐computing example may help explain this. The problem is war: around the world, it ruins the lives of countless people. The easily stated solution is that we should stop people from fighting each other. Regretfully, we know that this solution, while correct, is not worth the space it takes on the page. Telling people they should stop fighting changes nothing, and physically stopping them requires resources far beyond that which can be mustered. None of this is to say that it is not worth tackling the problem of war: but most of us have enough common sense to realise that solutions (for there will surely need to be more than one) will be expensive, long‐term, and come with no guarantee of success.

Alas, in software, such common sense is an uncommon thing. In our subject, unworkable solutions are frequently proposed to deep problems; funding is acquired; manpower deployed; and, after an exhausting death march, failure guaranteed. The problem most frequently invoked is the difficulty of creating good software. This clearly is a problem: too much software is expensive, unreliable, and ill‐suited to the task it was created for. The many easily‐stated solutions proposed have had, at best, a tiny effect; at worst, they have impeded progress by suppressing less exciting, but more realistic, truths. Two concrete examples exemplify this.

The first example is Aspect Orientated Programming (AOP). The problem that AOP addresses is this: developing a program is hard because all of its constituents aspects — even those which are in no way related — must be considered together at all times. Imagine a health‐care system which records patient data and has separate modules for doctors and patients. One aspect is server communication; both doctor and patient units need to communicate with a central server and cope with certain error conditions. This aspect is jumbled up with, and scattered across, the rest of the program, making it hard to check that all server interactions will be performed as expected. The AOP solution is then simple: rather than developing software monolithically, it should be developed as separate aspects, which can be worked on in isolation, before being composed together to produce the end result.

There is no doubt that the AOP solution is intuitive and appealing: the jumbled and scattered nature of current software is a curse, with many deleterious effects. I can remember when I first came across AOP. My response then is my response now (though my argument, I hope, is somewhat better stated these days): it can't work. Furthermore, a simple analogy shows why it can't work.

Imagine a hill, composed, horizontally, of different rock strata. Strata can be taken out and put back in, without ever changing the inherent ‘hillness’ of the hill—a hill is still identifiably a hill even if one strata is taken out (though it might have a distinct kink in it) and each strata makes sense in and of itself, whether it is in the hill or not. In this analogy, aspects are the strata; for AOP to work, it is necessary to imagine being able to take a piece of software, pull out an aspect, and have that aspect be coherent in and of itself. It is a beautiful vision, and one which would change programming. Let us now take a different analogy. Instead of a hill, imagine a gravel drive, consisting of hundreds of thousands of tiny stones of varying colours: grey, honey, black, white, and so on. Even Don Quixote would blanch if I asked him to extract all the honey coloured stones, put them to one side, and then later to put them all back in the same place. The unfortunate reality is that software necessarily resembles a gravel drive—just as different coloured stones exist alongside each other, so do `aspects' of a program. The server communication code in the health care system is likely, for example, to form parts of nested AND expressions—there is no general way to pull such an entangled aspect out in such cases (or to have developed it from scratch) and expect to be able to put it back in.

The second example is Model Driven Architecture (MDA). For my sins, I played a small part in this monstrosity; in my defence I was young, naive, and needed the money, though such pitiful excuses are unlikely to help me come the day of reckoning. The problem that MDA aims to address is the following: developing software is hard because programming languages force one to work at a level of abstraction that is not much above that of machine code. Indeed, this is a problem; while a high‐level description of a business's requirements for its new system may be a few pages long, the corresponding software may have tens or hundreds of thousands of lines of code. Relating the mess of a low‐level program back to the high‐level requirements is a skill that few people successfully acquire. The MDA solution is then simple: development should be done with high‐level (mostly UML) diagrams. Common sense quickly suggests that this approach can never work. Diagrams are wonderful for expressing static relationships, but, overall, terrible for expressing most dynamic behaviour. Looping behaviour is elegantly captured by a textual FOR loop but, as anyone who has ever seen a `visual' programming language in action will know that, is a daunting, unmanageable, mess when represented graphically—and a language which makes looping behaviour difficult is not very useful.

What's most frustrating about AOP and MDA is that while both correctly identify real problems, the solutions they present are appealing in their simplicity yet self‐evidently unviable. Huge amounts of money and time have been invested in both, with little prospect of meaningful reward. Indeed, I am not sure that a good solution could exist for either problem: they seem to me largely intractable and inevitable. Of course, this is not to say that people shouldn't try to come up to solutions to these and other such problems—if whinging old codgers like me always had our way, we'd still be living in dark, damp caves and laughing at Barry from the cave next door while he works on his silly sounding ‘wheel’ concept. Ours is a ‘new’ subject, a vast, unexplored continent, and optimism is a vital component of the makeup of successful settlers. But as vital as optimism is realism—expending precious resources on hopeless causes is a guaranteed way to doom settlement. A little dose of common sense — thinking through a couple of common cases, at the very least — when prevented with a wonderful sounding solution to a hard problem would do our subject a lot of good. Other subjects seem to manage it, and I don't see why in software we can't too.

Link to this entry


Parsing: The Solved Problem That Isn't

March 15 2011

Parsing is the act of taking a stream of characters and deducing if and how they conform to an underlying grammar. For example the sentence Bill hits Ben conforms to the part of the English grammar noun verb noun. Parsing concerns itself with uncovering structure; although this gives a partial indication of the meaning of a sentence, the full meaning is only uncovered by later stages of processing. Parseable, but obviously nonsensical, sentences like Bill evaporates Ben highlight this (the sentence is still noun verb noun, but finding two people who agree on what it means will be a struggle). As humans we naturally parse text all the time, without even thinking about it; indeed, we even have a fairly good ability to parse constructs that we've never seen before.

In computing, parsing is also common; while the grammars are synthetic (e.g. of a specific programming language), the overall idea is the same as for human languages. Although different communities have different approaches to the practicalities of parsing - C programmers reach for lex / yacc; functional programmers to parser combinators; others for tools like ANTLR or a Packrat / PEG-based approach - they typically rely on the same underlying area of knowledge.

After the creation of programming languages themselves, parsing was one of the first major areas tackled by theoretical computer science and, in many peoples eyes, one of its greatest successes. The 1960s saw a concerted effort to uncover good theories and algorithms for parsing. Parsing in the early days seems to have shot off in many directions before, largely, converging. Context Free Grammars (CFGs) eventually won, because they are fairly expressive and easy to reason about, both for practitioners and theorists.

Unfortunately, given the extremely limited hardware of 1960s computers (not helped by the lack of an efficient algorithm), the parsing of an arbitrary CFG was too slow to be practical. Parsing algorithms such as LL, LR, and LALR identified subsets of the full class of CFGs that could be efficiently parsed. Later, relatively practical algorithms for parsing any CFG appeared, most notably Earley's 1973 parsing algorithm. It is easy to overlook the relative difference in performance between then and now: the fastest computer in the world from 1964-1969 was the CDC6600 which executed at around 10 MIPS; my 2010 mobile phone has a processor which runs at over 2000 MIPS. By the time computers had become fast enough for Earley's algorithm, LL, LR, and friends had established a cultural dominance which is only now being seriously challenged - many of the most widely used tools still use those algorithms (or variants) for parsing. Nevertheless in tools such as ACCENT / ENTIRE and recent versions of bison, one has access to performant parsers which can parse any CFG, if that is needed.

The general consensus, therefore, is that parsing is a solved problem. If you've got a parsing problem for synthetic languages, one of the existing tools should do the job. A few heroic people - such as Terence Parr, Adrian Johnstone, and Elizabeth Scott - continue working away to ensure that parsing becomes even more efficient but, ultimately, this will be transparently adopted by tools without overtly changing the way that parsing is typically done.

Language composition

One of the things that's become increasingly obvious to me over the past few years is that the general consensus breaks down for one vital emerging trend: language composition. Composition is one of those long, complicated, but often vague terms that crops up a lot in theoretical work. Fortunately, for our purposes it means something simple: grammar composition, which is where we add one grammar to another and have the combined grammar parse text in the new language (exactly the sort of thing we want to do with Domain Specific Languages (DSLs)). To use a classic example, imagine that we wish to extend a Java-like language with SQL so that we can directly write:
for (String s : SELECT name FROM person WHERE age > 18) {
  ...
}
Let's assume that someone has provided us with two separate grammars: one for the Java-like language and one for SQL. Grammar composition seems like it should be fairly easy. In practice, it turns out to be rather frustrating, and I'll now explain some of the reasons why.

Grammar composition

While grammar composition is theoretically trivial, simply squashing two grammars together is rarely useful in practice. Typically, grammars have a single start rule; one therefore needs to choose which of the two grammars has the start rule. More messy is the fact that the chances of the two grammars referencing each other is slight; in practice, one needs to specify a third tranche of data - often referred to, perhaps slightly misleadingly, as glue - which actually links the two grammars together. In our running example, the Java-like language has the main grammar; the glue will specify where, within the Java-like expressions, SQL statements can be referenced.

For those using old parsing algorithms such as LR (and LL etc.), there is a more fundamental problem. If one takes two LR-compatible grammars and combines them, the resulting grammar is not guaranteed to be LR-compatible (i.e. an LR parser may not be able to parse using it). Therefore such algorithms are of little use for grammar composition.

At this point, users of algorithms such as Earley's have a rather smugger look on their face. Since we know from grammar theory that unioning two CFGs always leads to a valid CFG, such algorithms can always parse the result of grammar composition. But, perhaps inevitably, there are problems.

Tokenization

Parsing is generally a two-phase process: first we break the input up into tokens (tokenization); and then we parse the tokens. Tokens are what we call words in everyday language. In English, words are easily defined (roughly: a word starts and ends with a space or punctuation character). Different computer languages, however, have rather different notions of what their tokens are. Sometimes, tokenization rules are easily combined; however since tokenization is done in ignorance of how the token will later be used, sometimes it is difficult. For example, in SQL SELECT might be a keyword but in Java it is also a valid identifier; it is often hard, if not impossible, to combine such tokenization rules in traditional parsers.

Fortunately there is a solution: scannerless parsing (e.g. SDF2 scannerless parsing). For our purposes, it might perhaps better be called tokenless parsing; the different names reflect the naming conventions of different parsing schools. Scannerless parsing does away with a separate tokenization phase; the grammar now contains the information necessary to dynamically tokenize text. Combining grammars with markedly different tokenization rules is now possible.

Fine-grained composition

In practice, the simple glue mentioned earlier used to combine two grammars is often not enough. There can be subtle conflicts between the grammars, in the sense that the combined language might not give the result that was expected. Consider combining two grammars that have different keywords. Scannerless parsing allows us to combine the two grammars, but we may wish to ensure that the combined languages do not allow users to use keywords in the other language as identifiers. There is no easy way to express this in normal CFGs. The SDF2 paper referenced earlier allows reject productions as a solution to this; unfortunately this then makes SDF2 grammars mildly context sensitive. As far as I know, the precise consequences of this haven't been explored, but it does mean that at least some of the body of CFG theory won't be applicable; it's enough to make one a little nervous, at the very least (not withstanding the excellent work that has been created using the SDF2 formalism by Eeclo Visser and others).

A recent, albeit relatively unknown, alternative are boolean grammars. These are a generalization of CFGs that include conjunction and negation, which, at first glance, are exactly the constructs needed to make grammar composition practical (allowing one to say things like identifiers are any sequence of ASCII characters except SELECT). Boolean grammars, to me at least, seem to have a lot of promise, and Alexander Okhotin is making an heroic effort on them. However, there hasn't yet been any practical use of them that I know of, so wrapping ones head around the practicalities is far from trivial. There are also several open questions about boolean grammars, some of which, until they are answered one way or the other, may preclude wide-scale uptake. In particular, one issue relates to ambiguity, of which more now needs to be said.

Ambiguity

By severely restricting what CFGs they accept, grammars which are compatible with traditional parsing algorithms (LL, LR etc.) are always unambiguous (though, as we shall see, this does not mean that all the incompatible grammars are ambiguous: many are unambiguous). Grammar ambiguity is thus less widely understood than it might otherwise have been. Consider the following grammar of standard arithmetic:
E ::= E "+" E
    | E "-" E
    | E "/" E
    | E "*" E
Using this grammar, a string such as 2 + 3 * 4 can be parsed ambiguously in two ways: as equivalent to (2 + 3) * 4; or as equivalent to 2 + (3 * 4). Parsing algorithms such as Earley's will generate all possibilities even though we often only want one of them (due to arithmetic conventions, in this case we want the latter parse). There are several different ways of disambiguating grammars, such as precedences (in this example, higher precedences win in the face of ambiguity):
E ::= E "+" E  %precedence 1
    | E "-" E  %precedence 1
    | E "/" E  %precedence 2
    | E "*" E  %precedence 3
This might suggest that we can tame ambiguity relatively easily: unfortunately, parsing theory tells us that the reality is rather tricky. The basic issue is that, in general, we can not statically analyse a CFG and determine if it is ambiguous or not. To discover whether a given CFG is ambiguous or not we have to try every possible input: if no input triggers an ambiguous parse, the CFG is not ambiguous. However this is, in general, impractical: most CFGs describe infinite languages and can not be exhaustively tested. There are various techniques which aim to give good heuristics for ambiguity (see Bas Basten's masters thesis for a good summary; I am also collaborating with a colleague on a new approach, though it's far too early to say if it will be useful or not). However, these heuristics are inherently limited: if they say a CFG is ambiguous, it definitely is; but if they can not find ambiguity, all they can say is that the CFG might be unambiguous.

Since theoretical problems are not always practical ones, a good question is the following: is this a real problem? In my experience thus far, defining stand-alone grammars for programming languages using Earley parsing (i.e. a parsing algorithm in which ambiguity is possible), it's not been a huge problem: as the grammar designer, I often understand where dangerous ambiguity might exist, and can nip it in the bud. I've been caught out a couple of times, but not enough to really worry about.

However, I do not think that my experience will hold in the face of wide-spread grammar composition. The theoretical reason is easily stated: combining two unambiguous grammars may result in an ambiguous grammar (which, as previously stated, we are unlikely to be able to statically determine in general). Consider combining two grammars from different authors, neither of whom could have anticipated the particular composition: it seems to me that ambiguity is much more likely to crop up in such cases. It will then remain undetected until an unfortunate user finds an input which triggers the ambiguity. Compilers which fail on seemingly valid input are unlikely to be popular.

PEGs

As stated earlier, unambiguous parsing algorithms such as LL and LR aren't easily usable in grammar composition. More recently, a rediscovered parsing approach has gathered a lot of attention: Packrat / PEG parsing (which I henceforth refer to as PEGs). PEGs are different than everything mentioned previously: they have no formal relation to CFGs. The chief reason for this is PEGs ordered choice operator, which removes any possibility for ambiguity in PEGs. PEGs are interesting because, unlike LL and LR, they're closed under composition: in other words, if you have two PEGs and compose them, you have a valid PEG.

Are PEGs the answer to our problems? Alas - at least as things stand - I now doubt it. First, PEGs are rather inexpressive: like LL and LR parsing, PEGs are often frustrating to use in practise. This is, principally, because they don't support left recursion; Alex Warth proposed an approach which adds left recursion but I discovered what appear to be problems with it, though I should note that there is not yet a general consensus on this (and I am collaborating with a colleague to try and reach an understanding of precisely what left recursion in PEGs should mean). Second, while PEGs are always unambiguous, depending on the glue one uses during composition, the ordered choice operator may cause strings that were previously accepted in the individual languages not to be accepted in the combined language - which, to put it mildly, is unlikely to be the desired behaviour.

Conclusions

If you've got this far, well done. This article has ended up much longer than I originally expected - though far shorter than it could be if I really went into detail on some of these points! It is important to note that I am not a parsing expert: I only ever wanted to be a user of parsing, not - as I currently am - someone who knows bits and pieces about its inner workings. What's happened is that, in wanting to make greater use of parsing, I have gradually become aware of the limitations of what I have been able to find. The emphasis is on gradually: knowledge about parsing is scattered over several decades (from the 60s right up to the present day); many publications (some of them hard to get hold of); and many peoples heads (some of whom no longer work in computing, let alone in the area of parsing). It is therefore hard to get an understanding of the range of approaches or their limitations. This article is my attempt to write down my current understanding and, in particular, the limitations of current approaches when composing grammars; I welcome corrections from those more knowledgeable than myself. Predicting the future is a mugs game, but I am starting to wonder whether, if we fail to come up with more suitable parsing algorithms, programming languages of the future that wish to allow syntax extension will bypass parsing altogether, and use syntax directed editing instead. Many people think parsing is a solved problem - I think it isn't.

Link to this entry

Earlier articles

All articles
 
Last 10 articles
Fast Enough VMs in Fast Enough Time
Problems with Software 3: Creating Crises Where There Aren't Any
Problems with Software 2: Failing to Use the Computing Lever
Problems with Software 1: Confusing Problems Whose Solutions Are Easy to State With Problems Whose Solutions Are Easy to Realise
Parsing: The Solved Problem That Isn't
In Praise of the Imperfect
A Modest Attempt to Help Prevent Unnecessary Static / Dynamic Typing Debates
A Proposal for Error Handling
The Missing Level of Abstraction?
Good Programmers are Good Sysadmins are Good Programmers
 
 
DSLs
Tony Clark
Zef Hemel
 
Modelling
Mark Delgano
Steven Kelly
Stuart Kent
Jim Steel
 
OS
Marc Balmer
Mike Erdely
Peter Hansteen
KernelTrap
OpenBSD Journal
 
Programming
Peter Bell
Gilad Bracha
Tony Clark
Cliff Click
William Cook
Jonathan Edwards
Daniel Ehrenberg
Fabien Fleutot
Martin Fowler
John Goerzen
Grace
James Hague
Jeremy Hylton
Ralph Johnson
JOT
Ralf Laemmel
Lambda the Ultimate
Daniel Lemire
Michael Lucas
Bertrand Meyer
Keith Packard
Havoc Pennington
Brown PLT
PyPy
John Regehr
Software Engineering Radio
Diomidis Spinellis
Shin Tai
Markus Voelter
Phil Wadler
Russel Winder
Steve Yegge