Home e-mail: laurie@tratt.net   twitter: laurencetratt twitter: laurencetratt
[RSS feed]

The Evolution of a Research Paper

January 19 2021

Blog archive

 
Last 10 posts
The Evolution of a Research Paper
Automatic Syntax Error Recovery
Stick or Twist?
Which Parsing Approach?
Alternative Sources of Advice
Minimum Times Tend to Mislead When Benchmarking
A Quick Look at Trait Objects in Rust
Why Aren’t More Users More Happy With Our VMs? Part 2
Why Aren’t More Users More Happy With Our VMs? Part 1
What I’ve Learnt So Far About Writing Research Papers
 
 
Download:  Opus (3.1MiB)   mp3 (11.1MiB)
As far as I can recall, the first time that I truly tried to read a research paper was when I started my PhD. I don’t remember which paper was the first recipient of my incompetent gaze, but I do remember how completely out of my depth I felt when trying to make sense of it. Over many subsequent papers I tried during that period, I made very little progress in meaningfully digesting any of what I was reading. I quite often found a fog descending on my brain, a curious mix of tiredness and bewilderedness, that left me feeling defeated.

In retrospect, I faced three main issues: I didn’t have a good technique for reading papers [1]; I was trying to read papers that far exceeded my rather narrow technical knowledge; and I wasn’t reliably selecting papers that were well written. To a reasonable extent, sheer repetition tends to fix all of these issues.

However, I eventually realised there was a much deeper issue: I did not understand how papers come into being and certainly had no idea how I should go about writing my own. From my naive perspective, papers seemed to come fully formed as 20 page PDFs, full of dense prose, foreboding formulae, and complex tables. I had little idea how they got to that state and, for whatever reason, nobody I bumped into ever seemed to talk in a meaningful way about the mechanics of writing a paper. Even now, I’m not sure how most people or groups go about writing their papers.

A few years back I wrote about what I’ve learnt about writing research papers, but I suspect that does a better job of explaining the detailed mechanics of writing parts of a paper than it does giving a sense of the whole. When I recently saw a simple but effective animation of a paper evolving during its writing process, I realised that it might help give a high-level view of the whole of a paper. What I’m going to do in this post is to take two papers I’ve been involved with and use similar animations to give a sense of how the paper evolved, as well as some of the human story behind why it evolved that ways. I’m not going to try and derive grand lessons from these two examples, but I hope that they might help slightly demystify how papers end up looking as they do.

The warmup paper

Let’s start with Virtual Machine Warmup Blows Hot and Cold which is probably the most demanding paper I’ve ever been involved with. To some extent, one can recover the paper’s story from the repository’s commits, but even that won’t tell you everything. So I’m now going to briefly explain the story of the paper [2] from my perspective [3].

It all started innocently enough in May 2015 when Edd Barrett undertook what I thought would be a “quick” bit of work to see how long it takes programming language VMs (Virtual Machines) such as the JVM to “warmup” (i.e. how long does it take them to reach the “steady state of peak performance”?). A couple of weeks later, Edd showed us some rough data, which suggested that VMs often don’t warm-up. This was, to put it mildly, surprising. My immediate worry was that we might have overlooked an important factor (e.g. servers overheating and a resultant throttling of performance) that, if fixed, would make all the odd results disappear. Soon joined by Carl Friedrich Bolz and Sarah Mount, we embarked on a lengthy iterative process, each time trying to think of something we might have missed, or got wrong, addressing that issue, and rerunning the experiment to see if anything meaningful changed.

By September 2015, the data we were seeing wasn’t fundamentally very different from May, and we were starting to feel more confident that we hadn’t done anything stupid. However, “more confident” is relative: we were still worried that we’d missed something obvious. We thus decided to create a formal draft paper that we could pass around to others for comment.

As soon as we started creating that draft paper, it became clear that while we had lots of data, we had no idea how to process it or present it. Vince Knight, a statistician, started giving us advice [4], soon putting us in touch with Rebecca Killick, an expert in changepoint analysis. Without the use of changepoint analysis, the eventual paper wouldn’t have been worth reading.

However, it soon became apparent that integrating changepoint analysis was not going to be a quick job. Since we still wanted comments on our experimental setup, the first version of the paper thus doesn’t have any statistical summaries at all. Despite that limitation, we got useful, detailed feedback from several people who I admire. We also discovered, quite unexpectedly, that many senior members of the VM community had seen peculiar benchmarking results over the years, and were surprised only by the extent of the oddities we were seeing. That helped convince me that the path we were on was a worthwhile one.

We kept beavering away on the research and the paper throughout 2016, most notably integrating a new way of producing benchmarking statistics with changepoints. By December 2016 we’d produced a second version of the paper, which we submitted to a programming languages research conference. Though imperfect, most of the major parts of the paper were now in decent shape, and that version of the paper bears a clear resemblance to the final version. However, the paper was rejected, and the reviewer comments we received didn’t really give us useful pointers for improvement [5].

Fortunately for us — and despite having worked on the research for 2 years, and the paper for 18 months by this point — we were still thinking of improvements ourselves, though we were now firmly in the territory of diminishing returns. By April 2017 we submitted the third version of the paper to another conference where it was conditionally accepted, meaning that we needed to address a handful of specific comments in order to convince the reviewers that the paper should be fully accepted. As is common in such cases, addressing those comments wasn’t too difficult (the expectation is that conditionally accepted papers don’t need too much work done on them in order to be fully accepted).

However, one small part of a comment from a reviewer raised a much bigger question in our heads than I think the reviewer could ever have intended. In essence, we wondered what would have happened if we’d not run benchmarks for as long as we did: would shorter runs still have highlighted as many odd things? We put together a partial answer to this question which you can see in Figure 9 of the fourth version of the paper. That version of the paper was enough to make the reviewers happy, and enable the paper to be fully accepted at the conference, but we still felt that we hadn’t addressed this other question adequately, so we kept plugging away.

Amongst the oddities of computing conferences is that they impose a fixed deadline on submission of a final version of a paper, generally around a month after initial notification. We now had such a deadline, but we were also uncovering new challenges in addressing the “what would have happened?” question. I then took my eye off the ball somewhat, and didn’t stop to think that as we improved our answer to the question, we were slowly but steadily increasing the amount of data we needed to process. Suddenly, 3 days before the deadline, a back of the envelope calculation suggested that our server would take about 20 days to process the data. Frankly, we looked to be in trouble, and I remember thinking that there was now a small, but definite, possibility that we would have to retract the paper. Thankfully, we dug ourselves out of the hole we found ourselves in by crudely breaking our Python processing program up into multiple processes and renting a beefy, and expensive AWS server [6]. It crunched the data in under 24 hours, allowing us to produce the data that ended up as Figure 11 in the fifth version of the paper [7].

That’s the human side of the story, although a few quick numbers may help round things out. We started the underlying research in May 2015 and the paper in September 2015 — and we kept actively working on both until September 2017. As part of the research we created Krun and the experiment itself. The paper itself has 1001 commits, and we made six releases of the paper on arXiv.

Let’s now look at the animation [8] of this paper evolving (skip to 45s if you want to avoid my ugly mug), with each frame representing the state of the paper at one of its 1001 git commits (with the date and time of each commit shown in the bottom right):

You might find it useful to speed up, or slow down, the animation — in my experience, different speeds tend to highlight different aspects of the paper’s evolution. A few further details might help you make sense of the animation. Commits that have a git tag associated with them (representing “releases” of some sort or other) pause the animation for a couple of seconds showing, in red, the tag name alongside the time (e.g. at 1:50 [9]). Within the paper itself, text in red represents a comment, or question, from one author to another. Not every version of the paper is perfectly rendered, because there is an element of bit rot. For example, while TeX is famously stable, LaTeX, and its distributions, aren’t — packages come and go, and backwards compatibility is hit and miss. We also didn’t write the paper in the expectation that anyone would go try and build old versions. However, other than the bibliography sometimes disappearing in the first quarter of the animation, the imperfections seem minor.

A few things that the animation highlights are worth looking at in detail. The paper starts off as a literal skeleton [10]: a title, some authors (and not the final set of authors), a couple of section headings (but little text), and a single citation of another paper (just to make sure the bibliography builds). It then goes through a rapid period of expansion as draft text representing the state of underlying research at that point is added.

Fairly quickly, one sees the first big chunks of red text [11], which is generally one of us noticing that something is either incomplete or nonsensical. This is the first time that one can start to tease apart the relationship between a paper and the research underlying it: for me, a paper is the condensed version of the understanding generated from the underlying research. In my experience, the process of “turning” research into a paper regularly uncovers gaps in the underlying research that we had either not noticed or sometimes actively ignored and which need fixing. This, to my mind, is a virtuous cycle: writing the paper highlights weaknesses in the underlying research that, when fixed, improve the paper.

The periods of expansion sometimes involve a number of commits, but typically last only a few days. Such periods are followed by much lengthier periods of consolidation, which can sometimes last months. Many of these periods of consolidation are where gaps in the research are slowly filled in, though a few exist simply because everyone involved was busy elsewhere! The remainder are those periods leading up to a release of the paper. Some are explicitly “draft” releases, some are submissions to venues (in computing, currently, these mostly take the form of conferences), and there’s generally a “final” release [12]. You’ll notice occasional major changes in the “look” of the paper where a change in the venue we are targeting requires us to change to that venue’s formatting requirements (e.g. at 1:20: before [13] and after [14]).

The final thing the animation makes clear is that, as time goes on, the periods of expansion become shorter, and less frequent, and the periods of consolidation become not just longer, but involve less profound changes. The amount of red text in each commit becomes smaller as the gaps in the research that we notice become smaller. Nevertheless, even at 3 minutes into the animation – fairly near the end of the paper’s evolution – there are noticeable additions to it.

The error recovery paper

Of course, everything that I’ve written above relates to a single paper which may or may not be representative of any other paper. It’s thus worth having a look at a second paper, both as a sanity check, and also to see if anything else interesting arises. In this case I’m going to look at Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers, a paper which shows how one one can add practical, automatic, syntax error recovery to LR grammars. It is not my intention to expatiate with the same minuteness on the whole series of its history, but there is one thing that I want to tease out. Let’s start with the animation (which starts at 40s):

This paper wasn’t a small effort — its writing still spanned two and a half years — but it wasn’t on the same scale as the warmup paper. It has one third as many commits and a smaller, though not exactly tiny, experiment. Unlike the warmup work, however, it led to software (grmtools) which non-researchers might be interested in using [15].

It ran through a similar, though slightly different evolution, to the warmup paper. I ran a private early draft past a couple of people whose work in the area I respected before the first “public” release on arXiv. It was rejected at a conference and a journal before being accepted at a conference.

I think the most interesting thing about this paper is something you can’t see in its final version, which contains a single new algorithm ‘CPCT+’. However, in the video you’ll see at around 1:55 a major reformatting (before [16] and after [17]) almost immediately followed by a significant reduction in the paper’s contents (before [18] and after [19]). If you look at the second version of the paper, just before this reduction happens, you’ll see that the paper contained a second algorithm ‘MF’.

MF wasn’t removed because of page constraints [20], nor because of reviewer comments: I took the decision to remove it because I thought it would make the paper better. MF is in most respects a superior algorithm to CPCT+: MF runs faster, and fails less often. However, CPCT+ already runs quite fast and doesn’t fail very often, so although MF is demonstrably better there isn’t much scope for it to be much better.

MF is, however, much more complex than CPCT+. I don’t have exact figures, but it wouldn’t surprise me if I spent ten times as long on MF as I did on CPCT+. That meant that I was beguiled by MF and my part in it, without stopping to think “is this complexity worth it?” Eventually that question became difficult for me to ignore. If I had to guess, I reckon it might take someone else at least five times longer to integrate MF into their parsing system than CPCT+. Because MF’s performance gains over CPCT+ are rather modest, this seems a poor trade-off.

Thus, reluctantly, I had to conclude that by including MF, I was demanding more from every reader than I was giving back. Put another way, if I’d have left MF in, I’d have been prioritising my ego over the reader’s time: that might be a good trade-off for me, but not for the community overall. Once I’d reached that decision, deleting the text (and the code) was inevitable. That doesn’t mean that deleting it was easy: I was effectively writing off 3–6 months work of my own time. Although that sounds like a lot of time, it’s not unique: in my experience it’s common for weeks of work on the underlying research either not to make it into the resulting paper at all, or at most to form part of a single sentence.

Summary

My aim in this entry hasn’t been to derive general lessons that you, or even I, can apply to other papers: each research project, and each paper, is a bespoke effort [21]. However, I think that a few common themes are worthy of note.

First, there’s a difference between the underlying research and a paper representing that research: by definition the latter can only contain a subset of the former (see e.g. the deletion of MF).

Second, the act of writing a paper frequently uncovers holes in the underlying research that need to be filled in.

Third, writing a paper is a slog: in my opinion, every sentence, every paragraph, and every figure is important, and getting them all into shape is no small task.

Fourth, writing up research and releasing papers is important: I’ve lost track of how many times I’ve heard of interesting systems that weren’t written up and which no longer run and thus are lost to us. To me, writing a paper isn’t about gaining another line on my CV: it’s about clearly and accurately articulating the underlying research, warts and all, to the best of my ability, so that other people can understand what’s been done and decide whether, and how, to build upon it.

I started this entry off by noting how difficult I used to find it to read papers. I don’t suppose anything I’ve written in this entry helps make the task of reading a paper easier, but perhaps it has at least given you a little insight into the process by which a paper is produced.

Acknowledgements: Thanks to Edd Barrett, Carl Friedrich Bolz-Tereick, Aleks Bonin, Martin Chapman, Lukas Diekmann, Jake Hughes, and Dan Luu for comments.

Follow me on Twitter

Footnotes

[1] It took me many years to realise that there are different approaches to reading a paper, that some approaches are meaningfully better than others, and that different people respond better to some approaches than others. These days, my general approach is to try reading a paper as quickly as possible: I try not to linger on parts that I don’t understand, because I’ve learnt from experience that confusion is often resolved by a later part of the paper. Once I’ve finished, I take a short period to digest what I’ve read and then, if necessary, I’ll go back and do another read through.
[2] I’ve talked a little bit about the story behind the paper before, but not in great detail.
[3] As with pretty much any human activity, different co-authors on a paper will nearly always have a slightly different perspective on what happened, because none of us can fully understand everything that other people went through.
[4] Vince later determined that his area of statistics wasn’t quite the tool we needed, and put us in touch with Rebecca. That was already a generous act, but he topped it a while later by asking that we remove him as an author because he felt he hadn’t contributed enough to the paper to be deserving as an author. I tried to convince him that he had already done more than enough to deserve a credit, but he politely stuck to his guns. I can sometimes lazily descend into the mire of cynicism, but when I see someone acting upon such deeply held morals, it reminds me how many good people there are in the world.
[5] Sometimes when a paper is rejected, one receives useful comments that help to improve the paper. I try not to complain about rejections, because it is almost never useful or edifying. However, I’m prepared to make an exception in this case, because, even 3 years later, I still can’t believe the following comments we received:
The unstated inference that extreme lengths are necessary in order to gain confidence in an empirical results is misplaced and counter-productive.
and:
Second, your paper overlooks the economics of research. The researcher's job is _not_ to produce the most error-free result. Their job is _produce the most insightful research given the resources available_
I don’t think I’ve ever been scolded, before or since, for putting too much work into my research!
[6] From memory, it had 32 cores and 1TiB RAM — still, I think, the beefiest computer I’ve ever used.
[7] We then had to fix a single typo to produce the sixth, and final, version.
[8] This took a mere six and three quarter hours to generate!
[9]
[10]
[11]
[12] Personally I hope that papers are “maintained”, at least for a while, in the same way that software is maintained. If someone points out a major flaw in a paper I’ve written, I hope I’ll go back and make a new version of the paper. There are limits to this: at some point, I will have forgotten most of what I once knew, and bit rot can make fixing things impractical. However, currently, there isn’t really a culture of reporting problems on papers, so I haven’t put this good intention into practice: until then, I’m not really sure how long I’m willing or able to maintain a paper for.
[13]
[14]
[15] It also caused us to produce a number of stand-alone Rust libraries: cactus, packedvec, sparsevec, and vob. I’d like to particularly thank Gabriela Moldovan for her work on packedvec. Embarrassingly, as I wrote this footnote, I noticed that we failed to credit her in the paper. Sorry Gabi!
[16]
[17]
[18]
[19]
[20] Even though computing conferences abandoned printed proceedings several years back, they still impose stringent page limits on papers. This has two unfortunate effects: work that naturally needs more space than is allowed is chopped down until it does, compromising legibility; and work that needs less space is padded out to make it look more impressive, compromising legibility in a different way.
[21] Or, at least, they should be a bespoke effort. Alas, there is a problem in academia with “salami slicing”, where what would best be presented in a single paper is instead split up into multiple papers. This is detrimental to the community as a whole but is done because the authors involved believe it will help their careers.
It took me many years to realise that there are different approaches to reading a paper, that some approaches are meaningfully better than others, and that different people respond better to some approaches than others. These days, my general approach is to try reading a paper as quickly as possible: I try not to linger on parts that I don’t understand, because I’ve learnt from experience that confusion is often resolved by a later part of the paper. Once I’ve finished, I take a short period to digest what I’ve read and then, if necessary, I’ll go back and do another read through.
I’ve talked a little bit about the story behind the paper before, but not in great detail.
As with pretty much any human activity, different co-authors on a paper will nearly always have a slightly different perspective on what happened, because none of us can fully understand everything that other people went through.
Vince later determined that his area of statistics wasn’t quite the tool we needed, and put us in touch with Rebecca. That was already a generous act, but he topped it a while later by asking that we remove him as an author because he felt he hadn’t contributed enough to the paper to be deserving as an author. I tried to convince him that he had already done more than enough to deserve a credit, but he politely stuck to his guns. I can sometimes lazily descend into the mire of cynicism, but when I see someone acting upon such deeply held morals, it reminds me how many good people there are in the world.
Sometimes when a paper is rejected, one receives useful comments that help to improve the paper. I try not to complain about rejections, because it is almost never useful or edifying. However, I’m prepared to make an exception in this case, because, even 3 years later, I still can’t believe the following comments we received:
The unstated inference that extreme lengths are necessary in order to gain confidence in an empirical results is misplaced and counter-productive.
and:
Second, your paper overlooks the economics of research. The researcher's job is _not_ to produce the most error-free result. Their job is _produce the most insightful research given the resources available_
I don’t think I’ve ever been scolded, before or since, for putting too much work into my research!
From memory, it had 32 cores and 1TiB RAM — still, I think, the beefiest computer I’ve ever used.
We then had to fix a single typo to produce the sixth, and final, version.
This took a mere six and three quarter hours to generate!
Personally I hope that papers are “maintained”, at least for a while, in the same way that software is maintained. If someone points out a major flaw in a paper I’ve written, I hope I’ll go back and make a new version of the paper. There are limits to this: at some point, I will have forgotten most of what I once knew, and bit rot can make fixing things impractical. However, currently, there isn’t really a culture of reporting problems on papers, so I haven’t put this good intention into practice: until then, I’m not really sure how long I’m willing or able to maintain a paper for.
It also caused us to produce a number of stand-alone Rust libraries: cactus, packedvec, sparsevec, and vob. I’d like to particularly thank Gabriela Moldovan for her work on packedvec. Embarrassingly, as I wrote this footnote, I noticed that we failed to credit her in the paper. Sorry Gabi!
Even though computing conferences abandoned printed proceedings several years back, they still impose stringent page limits on papers. This has two unfortunate effects: work that naturally needs more space than is allowed is chopped down until it does, compromising legibility; and work that needs less space is padded out to make it look more impressive, compromising legibility in a different way.
Or, at least, they should be a bespoke effort. Alas, there is a problem in academia with “salami slicing”, where what would best be presented in a single paper is instead split up into multiple papers. This is detrimental to the community as a whole but is done because the authors involved believe it will help their careers.

Automatic Syntax Error Recovery

November 17 2020

Programming is the best antidote to arrogance I've come across — I make so many errors that I am continually reminded of my own fallibility. Broadly speaking, I think of errors as severe or minor. Severe errors are where I have fundamentally misunderstood something about the system I am creating. Each severe error is a bespoke problem, often requiring custom tooling to understand and fix it. Minor errors, in contrast, are repetitive and quickly fixed. However, they’re also much more numerous than severe errors: even shaving a couple of seconds off of the time it takes a programmer to fix a class of minor errors is worthwhile when you consider how often they occur.

The most minor of minor errors, and also I suspect the most frequent, are syntax errors. They occur for three main reasons: mental sloppiness; inaccurate typing [1]; or an incomplete understanding of the language’s syntax. The latter case is generally part of a brief-ish learning phase we go through and I’m not sure what a good solution for it might look like. The former two cases, however, are extremely common. When I’ve made a small typo, what I want is the parser in my compiler or IDE to pinpoint the location of the syntax error accurately and then recover from it and continue as if I hadn’t made an error at all. Since compilation is often far from instantaneous, and I often make multiple errors (not just syntax errors), good quality syntax error recovery improves my programming efficiency.

Unfortunately, LR parsers – of which I am particularly fond – have a poor reputation for syntax error recovery. I’m going to show in this article that this isn’t inevitable, and that it’s possible to do surprisingly good automatic syntax error recovery for any LR grammar. If you want to know more details, you might be interested in the paper Lukas Diekmann and I recently published called Don't Panic! Better, Fewer, Syntax Errors for LR Parsers. The paper also has a fairly brief accompanying talk, if you find that sort of thing helpful:

For everyone else, let’s continue. To make our lives easier, for the rest of this article I’m going to shorten “syntax error recovery” to “error recovery”.

Outlining the problem

Let’s see error recovery in action in a widely used compiler — javac:

[Click to start/stop animation]
As a quick reminder, ‘int x y;’ isn’t valid Java syntax. javac correctly detects a syntax error after the ‘x’ token and then tries to recover from that error. Since ‘int x;is valid Java, javac assumes that I meant to put a semi-colon after ‘x’, repairs my input accordingly, and continues parsing. This is the good side of error recovery: my silly syntax error hasn’t stopped the compiler from carrying on its good work. However, the bad side of error recovery is immediately apparent: ’y;’ isn’t valid Java, so javac immediately prints out a second, spurious, syntax error that isn’t any fault of mine.

Of course, I have deliberately picked an example where javac does a poor job but I regret to inform you that it didn’t take me very long to find it. Many parsers do such a poor job of error recovery that experienced programmers often scroll back to the location of the first syntax error, ignoring both its repair and any subsequent syntax errors. Instead of being helpful, error recovery can easily have the opposite effect, slowing us down as we look for the real error amongst a slew of spurious errors.

Let’s look at the modern compiler that I most often use as an exemplar of good error messages, rustc. It often does a good job in the face of syntax errors:


[Click to start/stop animation]
However, even rustc can be tripped up when presented with simple syntax errors:

[Click to start/stop animation]

Some language implementations don’t even bother trying to recover from syntax errors. For example, even if I make two easily fixable syntax errors in a file, CPython stops as soon as it encounters the first:


[Click to start/stop animation]

As all this might suggest, error recovery is hard to do well, and it's unlikely that it will ever fully match human intuition about how syntax errors should be fixed. The root of the problem is that when we hit an error while parsing, there are, in general, an infinite number of ways that we could take to try and get parsing back on track. Since exploring infinity takes a while, error recovery has to use heuristics of one sort or another.

The more knowledge a parser has of the language it is parsing, the more refined that heuristic can be. Hand-written parsers have a fundamental advantage here, because one can add as much knowledge about the language’s semantics as one wants to the parser. However, extending a hand-written parser in this way is no small task, especially for languages with large grammars. It’s difficult to get precise figures, but I’ve seen more than one parser that has taken a small number of person years of effort, much of which is devoted to error recovery. Not many of us have that much time to devote to the problem.

Automatically generated parsers, in contrast, are at a clear disadvantage: their only knowledge of the language is that expressed via its grammar. Despite that, automatically generated LL parsers are often able to do a tolerable job of error recovery [2].

Unfortunately, LR parsers have a not undeserved reputation for doing a poor job of error recovery. Yacc, for example, requires users to sprinkle error tokens throughout their grammar in order to have error recovery in the resulting parser: I think I’ve only seen one real grammar which makes use of this feature, and I am sceptical that it can be made to work well. Panic mode is a fully automatic approach to error recovery in LR parsing, but it works by gradually deleting the parsing stack, causing it to delete input before a syntax error in order to try and recover after it. Frankly, panic mode’s repairs are so bad that I think on a modern machine [3] it’s worse than having no error recovery at all.

The roots of a solution

At a certain point when working on grmtools I realised that I should think about error recovery, something which I had only ever encountered as a normal user. A quick word about grmtools: it’s intended as a collection of parsing related libraries in Rust. At the moment, the parts that are most useful to users are lrpar – a Yacc-compatible parser – and, to a lesser extent [4], lrlex – a Lex-ish lexer. For the rest of this article, I’ll almost exclusively be talking about lrpar, as that’s the part concerned with error recovery.

Fortunately for me, I quickly came across Carl Cerecke’s PhD thesis which opened my eyes to an entirely different way of doing error recovery. His thesis rewards careful reading, and has some very good ideas in it. Ultimately I realised that Cerecke’s thesis is a member of what these days I call the Fischer et al. family of error recovery algorithms for LR parsing, since they all trace their lineage back to that paper.

When error recovery algorithms in the Fischer et al. family encounter a syntax error they try to find a repair sequence that, when applied to the input, gets parsing back on track. Different algorithms have different repairs at their disposal and different mechanisms for creating a repair sequence. For example, we ended up using the approach of Corchuelo et al. — one of the most recent members of the Fischer et al. family — as our intellectual base.

CPCT+ in use

We took the Corchuelo et al. algorithm, fixed and semi-formalised it [5], and extended it to produce a new error recovery algorithm CPCT+ that is now part of lrpar. We can use nimbleparse — a simple command-line grammar debugging tool — to see CPCT+ in action on our original Java example:

[Click to start/stop animation]
As with javac’s error recovery, CPCT+ is started when lrpar encounters a syntax error at the ‘y’ token. Unlike javac, CPCT+ presents 3 different repair sequences (numbered 1, 2, 3) to the user which, in order [6], would repair the input to be equivalent to: ‘int x, y;’, ‘int x = y;’, or ‘int x;’. Importantly, repair sequences can contain multiple repairs:

[Click to start/stop animation]
Since you probably don’t want to watch the animation endlessly I’ll put the repair sequences that are reported here:
Parsing error at line 2 column 13. Repair sequences found:
   1: Insert ), Shift {, Delete if, Delete true
   2: Insert ), Shift {, Shift if, Insert (, Shift true, Insert )
This example shows all of the individual repair types that CPCT+ can generate:
  • Insert x’ means ‘insert a token x at the current position’;
  • Shift x’ means ‘keep the token x at the current position unchanged and advance the search’;
  • Delete x’ means ‘delete the token x at the current position’.

A repair sequence is just that: an ordered sequence of repairs. For example, the first repair sequence above means that the input will be repaired to be equivalent to:

class C {
    void f() {
        { }
    }
}
while the second repair sequence will repair the input to be equivalent to:
class C {
    void f() {
        if (true) { }
    }
}
As this shows, CPCT+ is doing something very different to traditional error recovery: it’s repairing input spanning multiple tokens in one go. This is perfectly complementary to repairing syntax errors at different points in a file as this example shows:

[Click to start/stop animation]
Although CPCT+ can return multiple repair sequences, it would be impractical to keep all those possibilities running in parallel — I also doubt that users would be able to interpret the resulting errors! lrpar thus takes the first repair sequence returned by CPCT+, applies it to the input, and continues parsing.

At this point you might be rather sick of Java examples. Fortunately, there's nothing Java specific about CPCT+. If I feed it Lua’s Lex and Yacc files and broken input it'll happily repair that too [7]:


[Click to start/stop animation]
Indeed, CPCT+ will happily perform error recovery on any other language for which you can write a Yacc grammar.

Ultimately, CPCT+ has one main novelty relative to previous members of the Fischer et al. family: it presents the complete set of minimum cost repair sequences to the user where other approaches non-deterministically present one member of that set to the users. In other words, when, for our original Java example, CPCT+ presented this to users:

Parsing error at line 2 column 11. Repair sequences found:
  1: Insert ,
  2: Insert =
  3: Delete y
approaches such as Corchuelo et al. would only have presented one repair sequence to the user. Since those approaches are non-deterministic, each time they’re run they can present a different repair sequence to the one before, which is rather confusing. The intuition behind “minimum cost repair sequence” is that we want to prioritise repair sequences which do the smallest number of alterations to the user’s input: insert and delete repairs increase a repair sequence’s cost, although shift repairs are cost-neutral.

To my mind, in the example above, both ‘Insert ,’ and ‘Insert =’ are equally likely to represent what the programmer intended, and it’s helpful to be shown both. ‘Delete y’ is a bit less likely to represent what the programmer intended, but it’s not a ridiculous suggestion, and in other similar contexts would be the more likely of the 3 repair sequences presented.

Under the bonnet

The paper and/or the code are the places to go if you want to know exactly how CPCT+ works, but I’ll try and give you a very rough idea of how it works here.

When lrpar encounters a syntax error, CPCT+ is started with the grammar’s statetable (the statemachine that an LR parser turns a grammar into; see e.g. this example), the current parsing stack (telling us where we are in the statetable, and how we got there), and the current position in the user’s input. By definition the top of the parsing stack will point to an error state in the statetable. CPCT+’s main job is to return a parsing stack and position to lrpar that allows lrpar to continue parsing; producing repair sequences is a happy by-product of that.

CPCT+ is thus a path-finding algorithm in disguise, and we model it as an instance of Dijkstra's algorithm. In essence, each edge in the graph is a repair, which has a cost; we’re looking to find a path that leads us to success. In this case, “success” can occur in two ways: in rare cases where errors happen near the end of a file we might hit the statetable’s sole accept state; more commonly, we settle for shifting 3 tokens in a row (i.e. we’ve got to a point where we can parse some of the user’s input without causing further errors). As this might suggest, the core search is fairly simple.

Most of CPCT+’s complexity comes from the fact that we try to find all minimum cost paths to success and we need ways of optimising the search. There are a few techniques we describe in the paper to improve performance, so I’ll use what’s probably the most effective as an example. Our basic observation is that, when searching, once-distinct paths often end up reaching the same node, at which point we can model them as one henceforth. We therefore identify compatible nodes and merge them into one. The challenge is then how compatible nodes can be efficiently identified. We make use of an often forgotten facet of hashmaps: a node’s hash behaviour need only be a subset of its equality behaviour. In our case, nodes have three properties (parsing stack, remaining input, repair sequence): we hash based on two of these properties (parsing stack, remaining input) which quickly tells us if a node is potentially compatible; equality then checks (somewhat slowly) all three properties to confirm definite compatibility. This is a powerful optimisation, particularly on the hardest cases, improving average performance by 2x.

Ranking repairs

CPCT+ collects the complete set of minimum cost repair sequences because I thought that would best match what a user would hope to see from an error recovery algorithm. The costs of creating the complete set of minimum cost repair sequences were clear early on but, to my surprise, there are additional benefits.

The overarching problem faced by all approaches in the Fischer et al. family is that the search space is unbounded in size. This is why shifting a mere 3 tokens from the user’s input is enough for us to declare a repair sequence successful: ideally we would like to check all of the remaining input, but that would lead to a combinatorial explosion on all but a handful of inputs. Put another way, CPCT’s core search is inherently local in nature: the repair sequences it creates can still cause subsequent spurious errors beyond the small part of the input that CPCT+ has searched.

The complete set of minimum cost repair sequences allow us to trivially turn the very-local search into a regional search, allowing us to rank repair sequences in a wider context. We take advantage of the fact that CPCT+'s core search typically only finds a small handful of repair sequences. We then temporarily apply each repair sequence to the input and see how far it can parse ahead without causing an error (up to a bound of 250 tokens). We then select the (non-strict) subset which has parsed furthest ahead and discard the rest. Consider this Java example:

class C {
    int x z() { }
}
When run through the “full” CPCT+ algorithm, two repair sequences are reported to the user:
Parsing error at line 2 column 11. Repair sequences found:
   1: Delete z
   2: Insert ;
However if I turn off ranking, we can see that the “core” of CPCT+ in fact generated three repair sequences:
Parsing error at line 2 column 11. Repair sequences found:
   1: Insert =
   2: Insert ;
   3: Delete z
Parsing error at line 2 column 15. Repair sequences found:
   1: Insert ;
In this particular run, lrpar chose to apply the ‘Insert =’ repair sequence to the input. That caused a spurious second error just beyond the region that CPCT+ had searched in. However, the other two repair sequences allow the whole file to be parsed successfully (though there is a semantic problem with the resulting parse, but that’s another matter). It might not be immediately obvious, but traditional Fischer et al. algorithms wouldn’t be able to throw away the ‘Insert =’ repair sequence and keep looking for something better, because they have no point of comparison that would allow them to realise that it’s possible to do better. In other words, the unexpected advantage of the complete set of minimum cost repair sequences is precisely that it allows us to rank repair sequences relative to one another and discard the less good.

I’ve also realised over time that the ranking process (which requires about 20 lines of code) really kills two birds with one stone. First, and most obviously, it reduces spurious syntax errors. Second — and it took me a while to appreciate this — it reduces the quantity of low-quality repair sequences we present to users, making it more likely that users will actually check the repair sequences that are presented.

Lex errors

Like most parsing systems, grmtools separates out lexing (splitting input up into tokens) from parsing (determining if/how a sequence of tokens conforms to a grammar). This article isn’t the place to get into the pros and cons of this, but one factor is relevant: as soon as the lexer encounters an error in input, it will terminate. That means that parsing won’t start and, since error recovery as we’ve defined it thus far is part of the parser, the user won’t see error recovery. That leads to frustrating situations such as this:

[Click to start/stop animation]
Although it didn’t occur to me for a very long time, it’s trivial to convert lexing errors into parsing errors, and then have error recovery happen as normal. Even more surprisingly, this doesn’t require any support from lrpar or CPCT+. The user merely needs to catch input that otherwise wouldn’t lex by adding a rule such as the following at the end of their Lex file:
. "ERROR"
This matches a single character (the ‘.’) [8] as a new token type called ‘ERROR’. However, grmtools moans if tokens are defined in the Lexer but not used in the Yacc grammar so let’s shut it up by adding a dummy rule to the grammar:
ErrorRule: "ERROR" ;
Now when we run nimbleparse, CPCT+ kicks in on our “lexing error”:
Parsing error at line 2 column 11. Repair sequences found:
  1: Delete #, Delete y
  2: Insert ,, Delete #
  3: Insert =, Delete #
I find this satisfying for two reasons: first, because users get a useful feature for precisely no additional effort on lrpar’s part; and second because lexing and parsing errors are now presented uniformly to the user, where before they were confusingly separate. It would probably be a good idea to make this a core feature, so that we could do things like merge consecutive error tokens, but that wouldn’t change the underlying technique.

Integrating error recovery into actions

As far as I have been able to tell, no “advanced” error recovery algorithm has ever made its way into a long-lasting parsing system: I couldn't find a single implementation which I could run. Indeed, a surprising number of error recovery papers don’t even mention a corresponding implementation, though there must surely have been at least a research prototype at some point.

Whatever software did, or didn’t, exist, none of the papers I’ve read make any mention of how error recovery affects the use of parsers. Think about your favourite compiler: when it encounters a syntax error and recovers from it, it doesn’t just continue parsing, but also runs things like the type checker (though it probably refuses to generate code). Of course, the reason your favourite compiler is doing this is probably because it has a hand-written parser. How should we go about dealing with this in a Yacc-ish setting?

grmtools’ solution is surprisingly simple: action code (i.e. the code that is executed when a production is successfully parsed) isn’t given access to tokens directly, but instead to an enum which allows one to distinguish tokens inserted by error recovery from tokens in the user’s input. The reason for this is probably best easy seen via an example, in this case a very simple calculator grammar which calculates numbers as it parses:


[Click to start/stop animation]
In this case, what I chose to do when writing the calculator evaluator is to continue evaluating expressions with syntax errors in, unless an integer was inserted. The reason for that is that I simply don’t have a clue what value I should insert if CPCT+ generated an `Insert INT’ repair: is 0 reasonable? what about -1? or 1? As this suggests, inserting tokens can be quite problematic: while one might quibble about whether evaluation should continue when CPCT+ deleted the second ‘+’ in ‘2 + + 3’, at least that case doesn’t require the evaluator to pluck an integer value out of thin air.

This is an example of what I’ve ended up thinking of as the semantic consequences of error recovery: changing the syntax of the user’s input often changes its semantics, and there is no way for grmtools to know which changes have acceptable semantic consequences and which don’t. For example, inserting a missing ‘:’ in Python probably has no semantic consequences, but inserting the integer 999 into a calculator expression will have a significant semantic consequence.

The good news is that lrpar gives users the flexibility to deal with the semantic consequences of token insertion. For example here's the grmtools-compatible grammar for the calculator language:

Expr -> Result<u64, Box<dyn Error>>:
    Expr '+' Term {
      Ok($1?.checked_add($3?)
            .ok_or_else(||
              Box::<dyn Error>::from("Overflow detected."))?)
    }
  | Term { $1 } ;

Term -> Result<u64, Box<dyn Error>>:
    Term '*' Factor {
      Ok($1?.checked_mul($3?)
            .ok_or_else(||
               Box::<dyn Error>::from("Overflow detected."))?)
    }
  | Factor { $1 } ;

Factor -> Result<u64, Box<dyn Error>>:
    '(' Expr ')' { $2 }
  | 'INT' {
      parse_int($lexer.span_str($1.map_err(|_|
        "<evaluation aborted>")?.span()))
    } ;
If you’re not used to Rust, that might look a little scary, so let’s start with some of the non-error recovery parts. First, the calculator grammar evaluates mathematical expressions as parsing occurs, and it deals exclusively in unsigned 64-bit integers. Second, unlike traditional Yacc, lrpar requires each rule to specify its return type. In this case, each rule has a return type Result<u64, Box<dyn Error>> which says “if successful this returns a u64; if unsuccessful it returns a pointer to a description of the error on the heap”. Put another way ‘dyn Error’ is Rust’s rough equivalent to “this thing will throw an exception”.

As with traditional Yacc, the ‘$n’ expressions in action code reference a symbol’s production, where n starts from 1. Symbols reference either rules or tokens. As with most parsing systems, symbols that reference rules have the static type of that rule (in this example Result<u64, Box<dyn Error>>).

Where grmtools diverges from existing systems is that tokens always have the static type Result<Lexeme, Lexeme>. If you’re used to Rust that might look a bit surprising, as Result types nearly always contain two distinct subtypes, but in this case we’re saying that in the “good” case you get a Lexeme and in the “bad” case you also get a Lexeme. The reason for this is that the “good” case (if you’re familiar with Rust terminology, the ‘Ok’ case) represents a token from the user’s actual input and the “bad” case (‘Err’) represents an inserted token. Because Result is a common Rust type, one can then use all of the standard idioms that Rust programmers are familiar with.

Let’s first look at a simplified version of the first rule in the calculator grammar:

Expr -> Result<u64, Box<dyn Error>>:
    Expr '+' Term { $1? + $3? }
  | Term { $1 }
The Expr rule has two productions. The second of those (‘Term’) simply passes through the result of evaluating another rule unchanged. The first production tries adding the two integers produced by other rules together, but if either of those rules produced a dyn Error then the ‘?’ operator percolates that error upwards (roughly speaking: throws the exception up the call stack).

Now let’s look at a simplified (to the point of being slightly incorrect) version of the third rule in the grammar:

Factor -> Result<u64, Box<dyn Error>>:
    '(' Expr ')' { $2 }
  | 'INT' {
      parse_int($1.map_err(|_| "<evaluation aborted>")?)
    } ;
The second production references the INT token type. The action code then contains $1.map_err(|_| “<evaluation aborted>”)? which, in English, says “if the token $1 was inserted by error recovery, throw a dyn Error”. In other words, the calculator grammar stops evaluating expressions when it encounters an inserted integer. However, if the token was from the user’s input, it is converted to a u64 (with parse_int) and evaluation continues.

If you look back at the original grammar, you can see that this grammar has made the decision that only inserted integers have unacceptable semantic consequences: inserting a ‘*’ for example allows evaluation to continue.

After parsing has completed, a list of parsing errors (and their repairs) is returned to users, so that they can decide how much further they want to continue computation. There’s thus no danger of lrpar repairing input and the consumer of the parse not being able to tell that error recovery occurred. However, you might wonder why lrpar only allows fine-grained control of insert repairs. Surely it could also allow users to make fine-grained decisions in the face of delete repairs? Yes, it could, but I don’t think that would be a very common desire on the part of users, nor can I think how one would provide a nice interface for them to deal with such cases. What lrpar has is thus a pragmatic compromise. It’s also worth noting that although the above may seem very Rust specific, I’m confident that other languages can find a different, natural encoding of the same idea.

Is it fast enough and good enough?

At this point you might be convinced that CPCT+ is a good idea in theory, but are unsure if it’s usable in practise. To me, there are two important questions: is CPCT+ fast enough to be usable? and is CPCT+ good enough to be usable? Answering such questions isn't easy: until (and, mostly, after...) Carl Cerecke’s thesis, I’m not aware of any error recovery approach that had a vaguely convincing evaluation.

The first problem is that we need syntactically incorrect code to use for an evaluation but nearly all source code you can find in the wild is, unsurprisingly, syntactically correct. While there has been some work on artificially creating syntax errors in files, my experience is that programmers produce such a mind-boggling variety of syntax errors that it’s hard to imagine a tool accurately simulating them.

Unlike most previous approaches, we were fortunate that these days the BlueJ Java editor has an opt-in data-collection framework called Blackbox which records programmers (mostly beginners) as they’re editing programs. Crucially, this includes them attempting to compile syntactically incorrect programs. We thus extracted a corpus of 200,000 syntactically incorrect files which programmers thought were worth compiling (i.e. we didn’t take files at the point that the user was still typing in code). Without access to Blackbox, I don’t know what we’d have done: I’m not aware of any other language for which such a rich dataset exists.

There are a number of ways of looking at the “fast enough” question. On our corpus, the mean time CPCT+ spends on error recovery per file is 0.014s. To put that into context, that’s 3x faster than Corchuelo et al., despite the fact that CPCT+ calculates the complete set of minimum cost repair sequences, while Corchuelo et al. finishes as soon as it finds one minimum(ish) [9] cost repair sequence! I also think that the worst case is important. For various reasons, algorithms like CPCT+ really need a timeout to stop them running forever, which we set to a fairly aggressive 0.5s maximum per file, as it seems reasonable to assume that even the most demanding user will tolerate error recovery taking 0.5s. CPCT+ fully repaired 98.4% of files within the timeout on our corpus comparison Corchuelo et al. repaired 94.5% of files. In summary, in most cases CPCT+ runs fast enough that you’re probably not going to notice it.

A much harder question to answer is whether CPCT+ is good enough. In some sense, the only metric that matters is whether real programmers find the errors and repairs reported useful. Unfortunately, that’s an impractical criteria to evaluate in any sensible period of time. Error recovery papers which try to do so typically have fewer than 20 files in their corpus which leads to unconvincing evaluations. Realistically, one has to find an alternative, easier to measure, metric which serves as a proxy for what we really care about.

In our case, we use the total number of syntax errors found as a metric: the fewer the better. Although we know our corpus has at least 200,000 errors (at least 1 per file), we don’t know for sure how many more than that there are. There’s therefore no way of absolutely measuring an error recovery algorithm using this metric: all we can do is make relative comparisons. To give you a baseline for comparison, panic mode reports 981,628 syntax errors while CPCT+ reports 435,812. One way of putting this into context is that if you use panic mode then, on average, you end up with an additional spurious syntax error for each syntax error that CPCT+ reports i.e. panic mode is much worse on this metric than CPCT+. Comparing CPCT+ with Corchuelo et al. is harder, because although Corchuelo et al. finds 15% fewer syntax errors in the corpus than does CPCT+, it also fails on more files than CPCT+. This is almost certainly explained by the fact that Corchuelo et al. is unable to finish parsing the hardest files which sometimes contain an astonishingly large number of syntax errors (e.g. because of repeated copy and paste errors).

Ultimately a truly satisfying answer to the “is CPCT+ good enough?” question is impossible — we can’t even make a meaningful comparison between CPCT+ and Corchuelo et al. with our metric. What we can say, however, pretty conclusively is that CPCT+ is much better than panic mode, the only other equivalent algorithm that’s ever seen somewhat widespread use.

Limitations and future work

Few things in life are perfect, and CPCT+ definitely isn’t. There’s also clear scope to do better, and if I had a spare year or two to devote to the problem, there are various things I’d look at to make error recovery even better.

CPCT+ only tries repairing input at the point of a syntax error: however, that is often later than the point that a human would consider that they made an error. It’s unrealistic to expect CPCT+, or some variant of it, to deal with situations where the “cause” and “result” of an error are spread far apart. However, my experience is that the cause of an error is frequently just 1 or 2 tokens before the point identified as the actual error. It would be interesting to experiment with “rewinding” CPCT+ 1 or 2 tokens in the input and seeing if that’s a good trade-off. This isn’t trivial in the general case (mostly due to the parsing stack), but might be quite doable in many practical cases.

As we said earlier, CPCT+’s search is inherently local and even with repair ranking, it can suggest repair sequences which cause spurious errors. There are two promising, complementary, possibilities that I think might lessen this problem. The first is to make use of a little known, and beautiful approach, to dealing with syntax errors: non-correcting error recovery. This works by discarding all of the input before the point of a syntax error and using a modified parser to parse the remainder: it doesn’t tell the user how to repair the input, but it does report the location of syntax errors. Simplifying a bit, algorithms such as CPCT+ over-approximate true syntax errors (i.e. they report (nearly) all “true” syntax errors alongside some “false” syntax errors) whereas non-correcting error recovery under-approximates (i.e. it misses some “true” syntax errors but never reports ”false” syntax errors). I think it would be possible to use non-correcting error recovery as a superior alternative to CPCT+’s current ranking system and, perhaps, even to guide the search from the outset. Unfortunately, I don’t think that non-correcting error recovery has currently been “ported” to LR parsing, but I don’t think that task is insoluble. The second possibility is to make use of machine learning (see e.g. this paper). Before you get too excited, I doubt that machine learning is a silver bullet for error recovery, because the search space is too large and the variety of syntax errors that humans make quite astonishing. However, my gut feeling is that machine learning approaches will be good at recovering non-local errors in a way that algorithms like CPCT+ are not.

Less obviously, some Yacc grammars lend themselves to good repairs more than others. Without naming any names, some grammars are surprisingly permissive, letting through “incorrect” syntax which a later part of the compiler (or, sometimes, the parser’s semantic actions) then rejects [10]. The problem with this is that the search space becomes very large, causing CPCT+ to either produce large numbers of surprising repair sequences, or, in the worst cases, not to be able finish its search in the alloted time. One solution to this is to rewrite such parts of the grammar to more accurately specify the acceptable syntax, though this is much easier for me to suggest than for someone to actually carry out. Another solution might be to provide additional hints to CPCT+ (along the lines of lrpar’s %avoid_insert directive) that enable it to narrow down its search.

Programmers unintentionally leave hints in their input (e.g. indentation), and languages have major structural components (e.g. block markers such as curly brackets), that error recovery can take into account (see e.g. this approach). To take advantage of this, the grammar author would need to provide hints such as “what are the start / end markers of a block”. Such hints would be optional, but my guess is that most grammar authors would find the resulting improvements sufficiently worthwhile that they’d be willing to invest the necessary time to understand how to use them.

Finally, some parts of grammars necessarily allow huge numbers of alternatives and error recovery at those points is hard work. The most obvious example of this are binary or logical expressions, where many different operators (e.g. ‘+’, ‘||’ etc.) are possible. This can explode the search space, occasionally causing error recovery to fail, or more often, for CPCT+ to generate an overwhelming number of repair sequences. My favourite example of this – and this is directly taken from a real example, albeit with modified variable names! – is the seemingly innocent, though clearly syntactically incorrect, Java expression x = f(""a""b);. CPCT+ generates a comical 23,607 repair sequences for it. What is the user supposed to do with all that information? I don't know! One thing I experimented with at some points was making the combinatorial aspect explicit so that instead of:

Insert x, Insert +, Insert y
Insert x, Insert +, Insert z
Insert x, Insert -, Insert y
Insert x, Insert -, Insert z
the user would be presented with:
Insert x, Insert {+, -}, Insert {y, z}
For various boring reasons, that feature was removed at some point, but writing this down makes me think that it should probably be reintroduced. It wouldn’t completely solve the “overwhelming number of repair sequences” problem, but it would reduce it, probably substantially.

Summary

Parsing is the sort of topic that brings conversations at parties to a standstill. However, since every programmer relies upon parsing, the better a job we can do of helping them fix errors quickly, the better we make their lives. CPCT+ will never beat the very best hand-written error recovery algorithms, but what it does do is bring pretty decent error recovery to any LR grammar. I hope and expect that better error recovery algorithms will come along in the future, but CPCT+ is here now, and if you use Rust, you might want to take a look at grmtools — I'd suggest starting with the quickstart guide in the grmtools book. Hopefully Yacc parsers for other languages might port CPCT+, or something like it, to their implementations, because there isn’t anything very Rust specific about CPCT+, and it’s a fairly easy algorithm to implement (under 500 lines of code in its Rust incantation).

Finally, one of the arguments that some people, quite reasonably, use against LR parsing is that it has poor quality error recovery. That’s a shame because, in my opinion LR parsing is a beautiful approach to parsing. I hope that CPCT+’s error recovery helps to lessen this particular obstacle to the use of LR parsing.

Acknowledgements: My thanks to Edd Barrett and Lukas Diekmann for comments.

Follow me on Twitter

Footnotes

[1] Often said to be the result of “fat fingers”. I have skinny fingers but, alas, this does not seem to have translated to high typing accuracy.
[2] Typically using the FOLLOW set.
[3] When panic mode was introduced, computers were so slow that I expect even bad error recovery was probably semi-useful: if a compiler takes several minutes to run, each genuine error reported to the programmer is valuable, and it’s worth risking them seeing a fairly high proportion of false errors. In contrast, when compilers (mostly) take at most a few tens of seconds, the acceptable proportion of false errors is much lower.
[4] There are languages for which one can fairly easily write a Yacc grammar but not a Lex lexer (e.g. a Python-ish language with indentation-based syntax). Systems like grmtools allow you to use your own (possibly hand-written) lexer for precisely this reason. Fortunately, writing a good quality lexer by hand is generally a fairly easy task.
[5] As a parsing research dilettante, I've long had a suspicion that proper parsing researchers have to sign up to an elaborate, long running joke, whereby they agree to leave out, or otherwise obscure, important details. Alas, most error recovery papers are also part of this conspiracy, so I lost months to misunderstandings of various papers. The most frustrating part is that I wonder if I’ve unintentionally signed up to the conspiracy too: I have no idea whether other people can make sense of the parsing papers I’ve been part of or not...
[6] Note that the order that the three repair sequences are presented in is nondeterministic, so if you run it enough times you’ll see the repair sequences printed in all possible orderings.
[7] If you’re wondering why I used the ‘-q’ option with nimbleparse, it’s because nimbleparse prints debugging information for ambiguous grammars. Lua’s grammar has an inherent ambiguity (with the Lua manual containing the slightly scary statement that “The current parser always [resolves] such constructions in the first way”), which causes nimbleparse to print out its full stategraph and the ambiguities. Even for a small grammar such as Lua’s, the stategraph is not small. As this shows, Yacc input grammars can be ambiguous, though they will be statically converted into a fully unambiguous LR parser. Whether this is a good or bad feature is something that I’ll leave to the philosophers amongst you to debate.
[8] Although ‘.’ is a safe default, there’s no reason why the error token has to be defined that way. However, one has to be careful if using this technique using Lex because its “longest match” rule means that the the text matched by the error token must never be longer than that matched by any other token, otherwise large chunks of input end up being incorrectly classified as an error token. In case you’re wondering, yes, I spent half an hour when writing this blog post wondering why my attempts to be clever weren’t working before remembering the longest match rule.
[9] The original Corchuelo et al. has a bug that means that it misses some minimum cost repair sequences.
[10] My experience so far suggests that this is more often the case in less well used parts of a grammar.
Often said to be the result of “fat fingers”. I have skinny fingers but, alas, this does not seem to have translated to high typing accuracy.
Typically using the FOLLOW set.
When panic mode was introduced, computers were so slow that I expect even bad error recovery was probably semi-useful: if a compiler takes several minutes to run, each genuine error reported to the programmer is valuable, and it’s worth risking them seeing a fairly high proportion of false errors. In contrast, when compilers (mostly) take at most a few tens of seconds, the acceptable proportion of false errors is much lower.
There are languages for which one can fairly easily write a Yacc grammar but not a Lex lexer (e.g. a Python-ish language with indentation-based syntax). Systems like grmtools allow you to use your own (possibly hand-written) lexer for precisely this reason. Fortunately, writing a good quality lexer by hand is generally a fairly easy task.
As a parsing research dilettante, I've long had a suspicion that proper parsing researchers have to sign up to an elaborate, long running joke, whereby they agree to leave out, or otherwise obscure, important details. Alas, most error recovery papers are also part of this conspiracy, so I lost months to misunderstandings of various papers. The most frustrating part is that I wonder if I’ve unintentionally signed up to the conspiracy too: I have no idea whether other people can make sense of the parsing papers I’ve been part of or not...
Note that the order that the three repair sequences are presented in is nondeterministic, so if you run it enough times you’ll see the repair sequences printed in all possible orderings.
If you’re wondering why I used the ‘-q’ option with nimbleparse, it’s because nimbleparse prints debugging information for ambiguous grammars. Lua’s grammar has an inherent ambiguity (with the Lua manual containing the slightly scary statement that “The current parser always [resolves] such constructions in the first way”), which causes nimbleparse to print out its full stategraph and the ambiguities. Even for a small grammar such as Lua’s, the stategraph is not small. As this shows, Yacc input grammars can be ambiguous, though they will be statically converted into a fully unambiguous LR parser. Whether this is a good or bad feature is something that I’ll leave to the philosophers amongst you to debate.
Although ‘.’ is a safe default, there’s no reason why the error token has to be defined that way. However, one has to be careful if using this technique using Lex because its “longest match” rule means that the the text matched by the error token must never be longer than that matched by any other token, otherwise large chunks of input end up being incorrectly classified as an error token. In case you’re wondering, yes, I spent half an hour when writing this blog post wondering why my attempts to be clever weren’t working before remembering the longest match rule.
The original Corchuelo et al. has a bug that means that it misses some minimum cost repair sequences.
My experience so far suggests that this is more often the case in less well used parts of a grammar.

Stick or Twist?

October 7 2020

Download:  Opus (2.1MiB)   mp3 (8.1MiB)
All of us have points in our lives where we have to decide whether we should continue down our current path or change to another. As a researcher, I often face a constrained version of this problem: should I continue on my current research path or change to another? For a long time I wasn’t aware that I was being faced with such decisions; and, when I did become aware, I wasn’t sure how best to make a decision. Over time I’ve realised that a simple “stick or twist?” heuristic mostly works well for me. I don’t claim that anything in this article is novel, nor do I think I’m describing an approach that’s applicable to every situation — but it might provide some useful food for thought.

Let’s start by dismissing a common heuristic: wait and see. When faced with a hard decision, most of us have a tendency to hope that the underlying problem magically disappears: sometimes we consciously delay making a choice, but often we try and pretend the choice doesn’t exist at all. Although it sounds like a cliché, I’m a firm believer that, in general, not making a decision is equivalent to making a decision. If I dither over whether to buy a shop’s last cake or not, someone else will buy it: I’m then, involuntarily, in exactly the same situation as if I’d decided not to buy the cake. If used too often, “wait and see” turns us into corks on the ocean, entirely at the mercy of events. Except for a small number of exceptionally poor decisions, I’ve found that I personally regret decisions I didn’t take more than decisions I did take.

Next, let’s dismiss the two extreme heuristics for making decisions: never change (which is similar, though with a very different motivation, to “wait and see”) or always change [1]. Because we live in a continually changing world, it is inevitable that even a once-optimal path will become suboptimal over time: not changing guarantees irrelevance in the long run. At the other extreme, change for its own sake means that even when we stumble onto an optimal path, we’ll change off it for no good reason.

Although I didn't realise it at the time, my first professional encounter with the need to make a decision about my research direction happened during my PhD. I was part of a research group doing what is now called “software modelling” but which back then was mostly just called UML. As part of that, I started attending international UML standards meetings. I soon realised that most people at these meetings shared a common vision, which is that UML should take over from programming languages. In other words, rather than programming using text, they wished to use UML’s block-and-line diagrams instead. UML had originally been intended to represent high-level program structure (the software equivalent of architectural blueprints), not low-level detail. I wasn’t sure that this was desirable, nor was I sure that it was possible. However, since I was often the youngest person in the room by 15 or more years, I assumed that I must be missing out on a profound insight.

Those doubts grew stronger when, in around 2003, I found myself in a lift with a senior member of the UML standards committee. Somehow his conversation turned to the evil programming languages people, who were somehow conspiring to prevent software from being written in UML. I have never had much truck with conspiracy theories, so I asked him outright how UML could be used for operating system kernels, which have hundreds of thousands of lines of code. He replied, in a tone which suggested he did not expect disagreement, “in 5 years, Linux will be written in UML”. Naive as I was, it was obvious to me that this was a ludicrous timeline and it made me wonder what other unpleasant details the community might have been ignoring.

The confirmation point came a few months later at a more academic meeting. A senior Professor got up to demonstrate the tool his research group had slaved over for years, which allowed one to translate textual programs to UML and back again. He brought up a simple program consisting of 5 lines of text, pressed a button, and transformed it into equivalent UML. The diagrammatic representation was so verbose that it required scrolling over two screens. I stayed silent but someone else politely, and I’m fairly sure innocently, asked “isn’t the diagrammatic version a bit harder to understand?” The Professor looked absolutely stunned by the question: ”but... it’s diagrammatic!” was all he could muster. A couple of other people agreed that it looked a bit harder to understand. The Professor was crushed: rarely before or since have I seen a man shift from a confident to a haunted look so quickly.

By this point I had a clear explanation in my head for why UML could not replace programming languages: diagrams are excellent at displaying a small number of elements but they do not scale past what you can squeeze onto a single screen [2]. Even beginners write programs which are too big by this metric!

I then made what I now realise is a foolish mistake: because I could articulate what I thought was a compelling reason why UML would not scale, I assumed everyone else would come to the same conclusion and the field would collapse. 15 years after I thought this, the (renamed) UML research field is, I believe, about the same size as when I left it.

Why was I so wrong? I suspect that most people in the UML community would probably agree with me that UML has problems scaling. However, I suspect that they think the problem will gradually disappear over time whereas I cannot see a realistic way in which it will be solved. However, neither side’s belief can be validated at the moment and it is unlikely that my belief can ever be completely validated. In retrospect, I should have been more humble about my judgment on UML. It would have been enough for me to say that, in my opinion, the probability of solving UML's scaling difficulties was lower than the probability of it being unsolvable.

However, whatever my motivation was, or should have been, I did change direction, into the field of programming languages. I created Converge, whose main aim was to allow people to extend the language by embedding DSLs (Domain Specific Languages) within it. From a technical perspective this worked well enough (see e.g. an assembler DSL or a statemachine DSL) but it required users to differentiate the DSLs with special brackets ($<<...>>). To my surprise, those special brackets turned out to undermine the whole idea: they are so visually intrusive that if a screen of code contains more than a couple of them, it becomes much harder to read than normal. Unfortunately, the brackets are necessary, because without them it is impossible to parse files [3]. Slowly — and painfully because I had spent several years of my life on this — I was forced to conclude that this approach would never be viable. After several years of feeling stuck (and trying, and failing, to move onto other research problems), I realised that it might be possible to solve the problem of language embedding in an entirely different fashion (which ended up in our language composition work).

However, before I had got that far, I made another mistake. Underlying DSLs in Converge is a macro system (or, more formally, "compile-time meta-programming") derived from Template Haskell. Since DSLs in Converge had failed, I generalised further that macros aren't very useful in modern programming languages [4]. When I later gave a talk titled something like “macros aren’t useful”, other macro researchers were, with one exception [5], bemused by my explanation, thinking it comically simplistic. In retrospect, I now realise that we were tackling subtly different problems: I was trying to embed multiple DSLs within a single normal file while they were identifying whole files as being of one DSL or another. Macro-ish techniques work better for the latter than the former because there is no need to have any syntax to identify one language from another. In other words, my generalisation was subtly flawed.

Those two examples represent some of the more consequential times that I’ve reconsidered my direction of travel, but there are others. After a while, I came to understand what my “stick or twist?” heuristic is: I now think of it as continually searching for the simplest plausible reason why my current direction is wrong. When I find a sufficiently simple reason, and once I’ve convinced myself the reason is plausibly true, I feel that the best course of action is to change direction. Why a “simple reason”? Because experience has taught me that until and unless I can distill a problem down to a very simple concrete explanation, I’ve not actually understood the problem. Why “plausible”? Because it’s rarely possible to be certain about such matters. Once the possibility of being wrong has become sufficiently high, I’d rather risk abandoning old work that might have succeeded rather than getting endlessly stuck on a problem I can't solve [6].

As both examples I’ve given might suggest, other people can come to a different conclusion than me. Reasonable people can often disagree about the probability of a risk manifesting, and different people often have very different thresholds for when they consider a risk worth taking. Even the same person can have different thresholds in different areas: I suspect my willingness for taking a risk in research is higher than average, but in other areas of my life it’s definitely lower than average. So, just because I’ve decided to move on doesn’t mean that other people have made the wrong decision by staying. Indeed, I really hope that they prove me wrong! Wouldn’t it be great if the UML community could improve on textual programming? I might think that outcome unlikely, but if I'm wrong, the resulting dent to my ego will be more than worth it.

Before you worry that I'm indulging in false humility, let me assure you of the true meanness of my spirit by stating that I think not all stick or twist heuristics are equally good. In particular, there is a polar opposite heuristic to mine: continually searching for the most complex reason why my current direction is right. I doubt that anyone would admit to this heuristic out loud, but it is not hard to find people using it, for example within politics or, alas, in some academic disciplines. Why a “complex reason”? There seem to me to be two different causes. Some people seem to be scared that if they’re clear about their thoughts they’ll be exposed as charlatans. Others simply want to win, irrespective of whether their argument is correct or not: a lack of clarity is one weapon they use to assert victory [7].

Inevitably, writing my stick or twist heuristic down is much easier than applying it. First, my ego does its best to block any thoughts that I might be wrong. Second, even when I identify a possible problem with my current direction, and a plausible explanation to explain it, acknowledging it to the point of taking action requires surprising will-power. Third, hard decisions are generally hard because we lack concrete data to guide us. I have to rely on my own experience or readings to deduce a plausible direction — and, largely, trust to luck. It can feel frightening to make decisions knowing that.

Despite these issues, anyone who’s bored enough to look over my research papers will be able to identify at least two other major changes of direction similar to those noted above, each of which was the result of using the same heuristic. Less obvious are the more numerous minor changes I make using this heuristic: when programming, I can sometimes go through several such decisions in a single day.

Finally, you may have noticed that I’ve used “change direction” in two different senses. In the first instance, I abandoned a research field entirely and moved to another; in the second, I merely abandoned a particular approach to tackling a problem, later trying a different approach to that same problem. In both cases, I had to put aside several years of work, but the first course of action might be seen by some people as the more dramatic of the two. To me, they're indistinguishable in magnitude: the real difficulty was identifying the problem, simplifying it, and acknowledging it. Personally, I hope I have many changes of direction yet to come!

Acknowledgements: Thanks to Chun Nolan for comments.

Follow me on Twitter

Footnotes

[1] I’ve not met anyone who sits precisely at either extreme, but I’ve met people who come surprisingly close.
[2] You can see this clearly with state diagrams. Small examples are nearly always clearer than the textual equivalent; but once they get just a little too big, the textual equivalent is nearly always superior.
[3] Alas, this was neither the first nor the last time that parsing has caused me problems — nearly always because of my ignorance about what is and isn’t possible.
[4] I was wrong: in statically typed languages you often need macros to generate code that the type system would otherwise forbid. This is, of course, the original motivation of Template Haskell but I was too stupid to appreciate this.
[5] A senior member of the community stood up and said “he’s right but for the wrong reasons”. Sometimes one has to take praise in whatever form it comes!
[6] When I was an undergraduate and a PhD student, I was often surprised by older academics whose research seemed stuck in an ancient rut. In fact, ”surprised” is a polite way of putting it: I laughed at such people. I’m ashamed to say that one was a gentle, thoughtful lecturer called Malcolm Bird (who died a few years ago). He seemed to us undergraduates to be hopelessly out of date. A couple of years later, after I’d started a PhD, a newly arrived Argentinian PhD student was astonished that our Department had the Malcolm Bird in it. Malcolm, it turns out, had once solved a hard and important problem in computer science. I later found out that as an academic he had continuously taken on more of his fair share of work. In other words, it’s plausible that, in only slightly different circumstances, Malcolm would have had a stellar research career and have been known as one of our subject’s great minds. The realisation that circumstances outside someone’s control can prevent that person realising their “full potential” was a sobering one. It’s too late to apologise to Malcolm in person, so writing about this shameful episode publicly is about as close as I can get.
[7] What I’m never sure with such people is whether their external lack of clarity reflects an interior lack of clarity or not.
I’ve not met anyone who sits precisely at either extreme, but I’ve met people who come surprisingly close.
You can see this clearly with state diagrams. Small examples are nearly always clearer than the textual equivalent; but once they get just a little too big, the textual equivalent is nearly always superior.
Alas, this was neither the first nor the last time that parsing has caused me problems — nearly always because of my ignorance about what is and isn’t possible.
I was wrong: in statically typed languages you often need macros to generate code that the type system would otherwise forbid. This is, of course, the original motivation of Template Haskell but I was too stupid to appreciate this.
A senior member of the community stood up and said “he’s right but for the wrong reasons”. Sometimes one has to take praise in whatever form it comes!
When I was an undergraduate and a PhD student, I was often surprised by older academics whose research seemed stuck in an ancient rut. In fact, ”surprised” is a polite way of putting it: I laughed at such people. I’m ashamed to say that one was a gentle, thoughtful lecturer called Malcolm Bird (who died a few years ago). He seemed to us undergraduates to be hopelessly out of date. A couple of years later, after I’d started a PhD, a newly arrived Argentinian PhD student was astonished that our Department had the Malcolm Bird in it. Malcolm, it turns out, had once solved a hard and important problem in computer science. I later found out that as an academic he had continuously taken on more of his fair share of work. In other words, it’s plausible that, in only slightly different circumstances, Malcolm would have had a stellar research career and have been known as one of our subject’s great minds. The realisation that circumstances outside someone’s control can prevent that person realising their “full potential” was a sobering one. It’s too late to apologise to Malcolm in person, so writing about this shameful episode publicly is about as close as I can get.
What I’m never sure with such people is whether their external lack of clarity reflects an interior lack of clarity or not.

Which Parsing Approach?

September 15 2020

We all know that parsing is an important part of designing and implementing programming languages, but it’s the equivalent of Brussels sprouts: good for the diet, but a taste that only a select few enjoy. Unfortunately, I’ve come to realise that our general distaste for parsing is problematic. While many of us think that we’ve absorbed the advances of the 1960s into our collective understanding, I fear that we have regressed, and that we are often making inappropriate decisions about parsing. If that sounds accusatory, I don’t mean it to be: I spent over 20 years assuming that parsing is easy and that I didn’t need to understand it properly in order to use it well. Alas, reality has been a cruel teacher, and in this post I want to share some of the lessons I’ve been forced to slowly learn and acknowledge.

Let’s start with the basics. A grammar encodes the syntax rules for a given language. Parsing is the act of taking in an input (e.g. a source file) and determining if, and how, it corresponds to a grammar. At its most basic level, parsing just says “this input does/doesn’t correspond to the grammar”. That’s rarely useful for programming languages, so we normally execute semantic actions while parsing, allowing us to, for example, build a parse tree that represents the input as a tree. If I have a simple calculator grammar and the input 2-3*4 I might get back a tree that looks like the following:

SVG-Viewer needed.

For the rest of this post, I’ll represent trees as “pretty printed text”, where brackets allow us to succinctly express how the tree is structured. For example the above tree is equivalent to (2-(3*4)). I’m going to assume that “parsing” means “check correspondence to the grammar and build a parse tree”. I’m also going to simplify other parsing jargon and nomenclature whenever I can, to try and keep things somewhat comprehensible and the length somewhat manageable.

Recursive descent

There are a bewildering number of ways that one can parse an input, so I’ll start with what is probably the most common: a hand-written parser. While that could mean just about anything, nearly everyone who writes a half-decent hand-written parser, whether they know it or not, is writing a recursive descent parser [1]. The idea is relatively simple: one writes a series of functions which examine the input string at a given position and, if they match at that position, advance the parse. For example, a first attempt at a recursive descent parser in Python that can parse the simple calculator language above might look as follows:

NUMBERS = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
OPERATORS = ["-", "*"]

class Number:
  def __init__(self, n): self.n = n
  def __str__(self): return str(self.n)
class Mul:
  def __init__(self, lhs, rhs):
    self.lhs = lhs
    self.rhs = rhs
  def __str__(self): return "(%s*%s)" % (str(self.lhs), str(self.rhs))
class Sub:
  def __init__(self, lhs, rhs):
    self.lhs = lhs
    self.rhs = rhs
  def __str__(self): return "(%s-%s)" % (str(self.lhs), str(self.rhs))

def parse_expr(s, i):
  if s[i] not in NUMBERS:
    return
  j = i
  while j < len(s) and s[j] in NUMBERS:
    j += 1
  lhs = Number(s[i:j])
  if j == len(s) or s[j] not in OPERATORS:
    return (j + 1, lhs)
  op = s[j]
  r = parse_expr(s, j + 1)
  if r is None:
    return
  (i, rhs) = r
  if op == "-":
    return (i, Sub(lhs, rhs))
  else:
    assert op == "*"
    return (i, Mul(lhs, rhs))

def parse(s):
  r = parse_expr(s, 0)
  if r is None or r[0] <= len(s):
    return "Syntax error"
  return r[1]

print(parse("2-3*4"))
The idea is relatively simple: we have a string ‘s’ we’re parsing, with the variable ‘i’ telling us how far we’ve parsed so far. If it was able to parse part of the input starting at ‘i’ then the parse_expr function returns a pair (i, tree) telling us how far it parsed and giving us back the tree it created; if it failed it returns None. When I parse 2-3*4 it prints:
(2-(3*4))
In other words, if we were to evaluate that tree, we’d get a result of -10 – success! Admittedly, that has come at a cost: the recursive descent parser has quite a lot of boiler-plate to ensure that it doesn’t do something silly and that any syntax errors encountered cause parsing to stop. For example, if you remove the check on lines 40 and 41, then 2abc will parse successfully returning Number(2), ignoring the fact that abc couldn’t be parsed! There are ways to reduce the boilerplate, but if you write recursive descent parsers for a living, you have to learn to live with it.

Unfortunately, if I try parsing 2*3-4 I get a surprising result:

(2*(3-4))
We’ve all been taught from a young age that the grammar of mathematics requires ‘*’ to “bind harder” than ‘-’. Put more formally, ‘*’ is said to have a higher precedence than ‘-’. Unfortunately, my hand-written recursive descent parser has given both operators the same precedence. If I was to evaluate that tree, I’d get -2 instead of the 2 we’d have expected from the original expression.

Fortunately, there’s a fairly standard way to encode operator precedence which, in the style of the above parser, can be written as follows:

def parse_expr(s, i):
  r = parse_factor(s, i)
  if r is None:
    return
  (i, lhs) = r
  if i < len(s) and s[i] == "-":
    r = parse_expr(s, i + 1)
    if r is None:
      return
    (i, rhs) = r
    return (i, Sub(lhs, rhs))
  return (i, lhs)

def parse_factor(s, i):
  r = parse_term(s, i)
  if r is None:
    return None
  (i, lhs) = r
  if i < len(s) and s[i] == "*":
    r = parse_factor(s, i + 1)
    if r is None:
      return
    (i, rhs) = r
    return (i, Mul(lhs, rhs))
  return (i, lhs)

def parse_term(s, i):
  if s[i] not in NUMBERS:
    return
  j = i
  while j < len(s) and s[j] in NUMBERS:
    j += 1
  return (j, Number(s[i:j]))

def parse(s):
  r = parse_expr(s, 0)
  if r is None or r[0] <= len(s):
    return "Syntax error"
  return r[1]
If I parse these expressions:
print(parse("2-3*4"))
print(parse("2*3-4"))
I can see that I get the expected output:
(2-(3*4))
((2*3)-4)

Success at last! Well, not quite, because if I parse 2-3-4 I get another surprising result:

(2-(3-4))
Unfortunately, as this example shows, we’re incorrectly parsing operators as right associative when they should be left associative. In other words, when we see a sequence of subtractions, earlier subtractions should be matched before later subtractions. Fixing this might seem like it’s easy, but it’s not: the “obvious” way of implementing left associativity in a recursive descent parser causes an infinite loop. Fixing that is more involved than I want to get into here: see this page for an approachable summary of solutions to this problem.

It might be tempting to see these problems as the result of an idiot (me) writing a parser for a language they don’t sufficiently understand (mathematics). I hope you can see that there’s also something deeper going on. The underlying problem is that the grammar I wanted to write is ambiguous: 2-3*4 can be parsed as equivalent to 2-(3*4) or (2-3)*4. It is often said that recursive descent parsers are inherently unambiguous. While true, this makes a virtue out of a vice: recursive descent parsers are unambiguous simply because they ignore ambiguities. Put another way, whenever a recursive descent parser encounters a point at run-time where an input can be parsed ambiguously, it arbitrarily chooses one of the possibilities, and charges onwards as if the other possibilities had never existed. Significantly, the parser author is not notified that this happened. Since recursive descent parsers are just normal programs, it’s unlikely that we’ll ever be able to make a static analysis that can take such a parser and reliably tell us at compile-time all the points of ambiguity.

It is therefore probably not a coincidence that recursive descent parsers have no real “theory”. Notably, they do not have any known relationship to the class of grammars we understand best — Context-Free Grammars (CFGs). For example, we do not know, in general, the language which a recursive descent parser will accept: all we can do is throw ever more inputs at it and observe if, and how, it parses them, never knowing if another input will cause a surprising parse.

Over time, I’ve come to view recursive descent as the parsing equivalent of assembly programming: maximum flexibility, maximum performance, and maximum danger. Every non-trivial recursive descent parser I’ve seen converted to another formalism has led to unexpected ambiguities being uncovered. Sometimes this leads to incorrect parses (as above), but it just as often leads to seemingly correct input not being parsed at all [2]. There are good reasons for using recursive descent parsers (I’ll get to those later), but, in my opinion, if another formalism can be used, it generally should be.

Generalised parsers

At the opposite end of the spectrum are what have come to be called generalised parsers. There are various generalised parsing algorithms (e.g. Earley, GLL, and GLR) but, from the perspective of this post, they’re all equivalent. Each can parse any CFG (so they rest on a solid theory), even ambiguous ones (so you don’t have to worry about contorting your grammar), and they guarantee to tell you where all of the ambiguous points in the grammar are at run-time (so you don’t have to worry about things being unexpectedly mis-parsed).

These properties appear to make generalised parsing the solution to the problems noted above with recursive descent parsers. However, this comes at a cost. Consider the following grammar which, once again, parses the little subset of mathematics we’re using as an example:

Expr: Expr "-" Expr
    | Expr "*" Expr
    | "INT"
    ;
Given that grammar, many readers will have spotted an obvious point of ambiguity: 2-3*4 can be parsed as equivalent to (2-3)*4 or 2-(3*4). Generalised parsers are interesting because they generate both possibilities at run-time. It is possible for such parsers to return a “parse forest” (i.e. showing all the ambiguous possibilities), but that’s not very useful for programming languages: we expect compilers to settle on a single meaning for the programs we throw at them. We thus need to disambiguate the ambiguous possibilities so that we end up with a single parse tree. An easy way of doing this is to assign a precedence to a rule’s productions so that if, at one point in the parse, more than one of its productions match, we can pick the one with the highest precedence. For example, I might rewrite my grammar to look as follows:
Expr: Expr "-" Expr %precedence 10
    | Expr "*" Expr %precedence 20
    | "INT"
    ;
Assuming that “higher” precedences mean “bind tighter”, this will then parse 2-3*4 as equivalent to 2-(3*4).

My experience is that fewer people (including, from bitter experience, me) spot a second ambiguity in the above grammar: 2-3-4 can be parsed as left (i.e. (2-3)-4) or right (i.e. 2-(3-4)) associative (because of rules such as Expr "-" Expr). Unfortunately, precedences are not sufficient to disambiguate between those two possibilities: one either needs to rewrite the grammar or use a different disambiguation operator [3].

While the good news is that a generalised parser will reliably tell us at run-time that it encountered an ambiguity, the bad news is that we generally have to wait until we encounter an input that is parsed ambiguously to discover that our grammar is ambiguous. There are some decent heuristics that will statically find many of the points of ambiguity, but they are just that — heuristics.

Over time, I’ve come to view generalised parsing as equivalent to dynamic typing: expressive and safe, but with more errors than necessary being deferred to run-time. I spent years trying to write arbitrary CFGs but, for complex grammars, I continually struggled to squeeze out all of the ambiguities [4]. I did not encounter a user who was happy with, or anything other than startled by, ambiguity errors: it is rather odd to be told that your input is valid but can’t be parsed. That said, I think generalised parsers have a part to play in language composition, where composing different grammars inherently leads to ambiguity. However, I no longer believe that generalised parsing is a good fit for “normal” parsing.

Statically unambiguous parsing

There are several parsing approaches which statically rule out ambiguity, bypassing one of the fundamental problems with generalised parsing. I’ll describe the two best known: LL and LR. In essence, these approaches describe subsets of the CFGs which provably contain only unambiguous grammars. It’s common to describe grammars which adhere to one of these subsets as being “a valid LL grammar” or similar.

However, as far as we know, it is not possible to define the complete subset of unambiguous CFGs, so there are unambiguous grammars which do not fit into these subsets. I thus find it easiest to think of these approaches as being analogous to a static type system: they are sound (i.e. if a grammar is a valid LL/LR grammar, it really is unambiguous) but not complete (some unambiguous grammars aren’t valid LL/LR grammars).

LL parsing

Although less common than in the past, LL parsing still underlies systems such as javacc. My personal bias is that LL parsers are largely unappealing, because the lack of left recursion makes expressing many standard programming language constructs as awkward as with recursive descent parsers. However, as hinted at by that commonality, LL grammars have one important feature: they naturally map to recursive descent parsers (but not necessarily vice versa). One can therefore ensure that a recursive descent parser is not accidentally steamrollering over ambiguities by creating an LL grammar and faithfully mapping it to a recursive descent parser.

To my mind, the combination of LL and recursive descent parser has a small but important niche: if you really, truly need the highest possible performance and/or you want the best possible error messages, this is probably the best route we know of. However, it comes at a significant cost. For a realistic programming language grammar, it will typically take many person months of effort [5] to beat an automatically generated parser. I therefore think this approach only makes sense for a small number of projects (notably industrial-strength compilers and IDEs).

LR parsing: prelude

The last of the major parsing approaches I’m going to look at is LR parsing. In common with many people, I spent years trying to avoid LR parsing because I had imbibed the common idea that LR parsing is a terrible thing. Instead I threw myself into other parsing approaches, notably Earley parsers [6].

Then, in late 2008, while bored in meetings, I started writing extsmail, a program mostly intended to send email via ssh. I thought it would be interesting to write this in the style of a traditional Unix daemon, something I had not attempted before. For the two configuration files extsmail needs, I therefore decided to use the traditional Unix daemon parsing tool Yacc. Not only had I not used Yacc before, I had neither used nor studied LR parsing — I suspected I would have quite a task on my hand. I was rather surprised when it turned out to be easy to write a grammar such as externals_parser.y.

However, I assumed that I had been lucky with these grammars, which are rather simple, and went back to avoiding LR parsing. Having realised that generalised parsing and ambiguity were causing me problems, I spent quite a while dabbling with PEG parsing (which is recursive descent in disguise) before eventually realising that was going to cause me different, but no less severe, problems relative to generalised parsing.

Later I stumbled across [7] Tim Wagner’s thesis on incremental parsing which became the pillar that Lukas Diekmann built upon to create Eco [8]. Wagner’s work uses LR parsing but I managed to involve myself a bit with Eco without actually understanding how LR parsing worked. Then, in 2015, when we were experimenting as a group with Rust, Lukas wrote the beginnings of an LR parser as an experiment, and I quickly jumped in and made a few modifications. Without really intending to, I started expanding the code until I realised I had taken on maintainership of what clearly had the potential to become a full Rust LR parser. At that point, I realised I actually needed to understand LR parsing. I found the explanations lurking on the web a bit confusing a first but the algorithm was simple enough that soon enough I had a full, if basic, Rust LR parser (which became grmtools).

Why am I telling you this long, probably tedious, personal history? Because I want to emphasise that I went out of my way to avoid LR parsing, even though I didn’t really know what I was avoiding or why. Even after I had used LR parsing, and realised that it wasn’t the bogeyman I had expected, I still spent several years trying alternatives. Not only is that embarrassing to admit publicly, it also troubles me: how had I picked up a bias that took me so long to overcome? I’ve gradually alighted upon a plausible explanation for our community’s general dislike of LR parsing and, oddly enough, it relates to undergraduate compiler courses. For reasons that probably made sense in the 1970s and 80s, many compiler courses spend significant, arguably excessive, time on parsing — generally LR parsing. Students come in expecting to be taught how to generate machine code in clever ways, but instead have to learn all sorts of parsing background before they even get to the main LR algorithm. By that point they are thoroughly sick of parsing generally and LR parsing in particular. This is a self-inflicted wound by our subject, as we have accidentally turned people away from a beautiful algorithm [9].

LR parsing

That dealt with, let’s drill into some of the technical details of LR parsing. First, LR is strictly more powerful than LL [10]. In other words, every valid LL grammar is also a valid LR grammar (but not necessarily vice versa). Second, LR grammars are the largest practical subset of unambiguous CFGs that we currently know how to statically define [11].

Let’s actually try out LR parsing [12] by feeding the following grammar:

%start Expr
%%
Expr: Expr "-" Expr
    | Expr "*" Expr
    | "INT"
    ;
to Yacc. Doing so leads to the following being printed at compile-time:
expr1.y: yacc finds 4 shift/reduce conflicts
At this point, I know that some readers will have broken out in a cold sweat at the mention of “shift/reduce conflict”. Don’t panic yet! At the moment, let’s just think of this as the LR parser statically detecting an ambiguity (or four...) and telling us that we should fix it somehow [13].

There are various ways of drilling into more details about those ambiguities. In a shameless plug, I’ll use nimbleparse, but most Yacc implementations have a way of giving more detailed information. nimbleparse also needs a valid lexer, so if I feed it the grammar above as well as this Lex file [14]:

%%
- "-"
\* "*"
[0-9]+ "INT"
I get this output:
Shift/Reduce conflicts:
   State 5: Shift("*") / Reduce(Expr: "Expr" "-" "Expr")
   State 5: Shift("-") / Reduce(Expr: "Expr" "-" "Expr")
   State 6: Shift("*") / Reduce(Expr: "Expr" "*" "Expr")
   State 6: Shift("-") / Reduce(Expr: "Expr" "*" "Expr")

Stategraph:
0: [^ -> . Expr, {'$'}]
   Expr -> 1
   'INT' -> 2
1: [Expr -> Expr . '-' Expr, {'-', '*', '$'}]
   [Expr -> Expr . '*' Expr, {'-', '*', '$'}]
   [^ -> Expr ., {'$'}]
   '-' -> 3
   '*' -> 4
2: [Expr -> 'INT' ., {'-', '*', '$'}]
3: [Expr -> Expr '-' . Expr, {'-', '*', '$'}]
   'INT' -> 2
   Expr -> 5
4: [Expr -> Expr '*' . Expr, {'-', '*', '$'}]
   Expr -> 6
   'INT' -> 2
5: [Expr -> Expr . '-' Expr, {'-', '*', '$'}]
   [Expr -> Expr . '*' Expr, {'-', '*', '$'}]
   [Expr -> Expr '-' Expr ., {'-', '*', '$'}]
   '*' -> 4
   '-' -> 3
6: [Expr -> Expr . '-' Expr, {'-', '*', '$'}]
   [Expr -> Expr . '*' Expr, {'-', '*', '$'}]
   [Expr -> Expr '*' Expr ., {'-', '*', '$'}]
   '*' -> 4
   '-' -> 3
What this shows us is the stategraph (i.e. a statemachine) our grammar has been transformed into and the states where the conflicts have occurred.

It is possible, with a little effort, to understand that stategraph and the conflicts that have occurred. However, I’m not going to go into more detail here, because most readers will probably already have guessed that it’s very hard to make sense of conflicts on large grammars. I’d liken it, roughly speaking, to resolving whole-program type inference errors [15]: the errors reported are correct, but don’t necessarily correspond to the points in your program/grammar that you would consider need fixing.

While I’m sure that it’s possible to improve the way that conflicts are reported [16], to my surprise, I’ve developed lots of grammars without being troubled much by problems with conflicts. Indeed, the only time I’ve ever bothered to try and understand conflicts is when a large existing grammar needs updating to a new external specification, which is not common [17]. In most cases, I’m developing a new, or tweaking an existing, small grammar. Then, just as with languages using type inference, I find it most productive to save and compile after nearly every change. If this does identify a conflict, I know what change caused it, and it then tends to be fairly obvious what a plausible fix is. Not only do I not bother worrying about what state in the stategraph is involved, I don’t even bother checking whether the conflict(s) are shift/reduce, reduce/reduce, or accept/reduce [18].

Honestly, I’ve only encountered one realistic counter-example which is – wait for it – mathematical expressions. It is surprisingly difficult to encode this as an LR grammar, because mathematics’ syntax rules are complex, and nearly every naive grammar for them ambiguous. Fortunately, because it’s such a common example, solutions to this abound on the internet. Here’s the classic solution:

%start Expr
%%
Expr: Expr "-" Term
    | Term
    ;
Term: Term "*" Factor
    | Factor
    ;
Factor: "INT"
    ;
It has no conflicts, which means that Yacc has statically proven that it is unambiguous! It handles precedence – 2-3*4 parses as 2-(3*4) – and associativity – 2-3-4 parses as (2-3)-4 – correctly.

Over time, I’ve come to view LR parsing as equivalent to static typing: occasionally annoyingly restrictive, but providing enough static guarantees to be worth the annoyance for important software. It’s important to remember that LR isn’t magic: while it will stop you writing an ambiguous grammar, it won’t stop you writing an incorrect grammar for the language you wish to parse. For example, although LR will prevent you making a rule both left and right associative, you still have to choose correctly whether it should be left or right associative.

Performance

People often worry about parsing performance in general, and LR parsing performance specifically, though almost always without cause on modern computers. For example, if I take Java’s grammar (which is unusually big, and therefore slow to parse) and the LR parsing system I wrote (which has been only moderately optimised) I can happily parse many tens of thousands of lines of code per second on my 3 year old laptop. Unless you’ve got billions of lines of source code, or millions of users, this is surely fast enough.

I suspect that parsing performance worries date back to the period when parsing techniques were under heavy development. LR parsing was invented in 1965, a time when computers were painfully slow [19] and resource poor. LR parsing works by generating a statetable at compile-time that is then interpreted at run-time. Those statetables were far too big to be practical on the computers of the day, so two solutions were invented to solve this problem.

First, algorithmic subsets of LR (e.g. LALR, SLR) were invented that reduce the size of statetables, at the cost of reducing the number of grammars that they can accept (i.e. some LR grammars are not valid LALR grammars). In practise, these subsets are annoying to use: they cause some seemingly reasonable grammars to be rejected; and understanding why a grammar has been rejected can require a deep understanding of the algorithm [20].

Second, since 1977 we’ve known that you can substantially shrink LR statetables without restricting the grammars accepted [21]. When combined with a couple of other techniques to squeeze the statetable’s memory footprint [22], even the most puny modern machine can run an arbitrary LR parser at impressive speeds.

Error recovery

When I’m programming, I make a truly embarrassing number of syntax errors. It is vital that the parser I’m using accurately reports where I’ve made such an error: most parsers, including LR parsers, do a decent enough job in this regard [23]. It is then nice if the parser recovers from my error, allowing it to continue parsing.

The very best recursive descent parsers [24] do a pretty good job of error recovery. LL parsing systems also generally do a tolerable job for arbitrary LL grammars.

Unfortunately, it is fair to say that LR parsing systems such as Yacc do a poor job. Yacc itself uses error tokens, but the results are so bad that I find Yacc parsers with error recovery more frustrating to use than those without. However, we can do much better for arbitrary LR grammars, and hopefully more LR parsers will give good error messages in the future.

LR parsing: aesthetics

I’m now going to turn to a fuzzier factor: readability. Whether explicitly or implicitly, people need to know the syntax rules of the language they are using. Some programming language designers assume, or hope, that giving users a few code examples is equivalent to telling them the language’s syntax rules. This works for most cases as we can largely rely on a shared cultural understanding of “what a programming language looks like” [25], but experienced programmers know the dangers of ignoring dank corners such as operator precedence. At a deeper level, those who implement a compiler, or even just an accurate syntax highlighter, need to know precisely what a language’s syntax rules are. In my opinion, the readability of a parser is a vital factor in enabling accurate tooling for, and use of, a programming language.

To my mind, of the various grammars and parsers presented in this post, the easiest to read is the version for generalised parsers, because it most closely matches the informal mathematical grammar I was taught as a child. However, this readability comes at a cost: because the grammar is potentially ambiguous I sometimes misjudge which way a given input will be parsed after disambiguation.

The hardest to read is, without a shadow of a doubt, the recursive descent parser. It’s the longest, the most detailed, and the one lacking any underlying theory to guide the reader.

The lack of left recursion in LL parsing makes many grammars awkward to read. A surprising way of seeing that is by using the fact that many (though not all) LR grammars can be converted to LL semi-mechanically (see e.g. this translation of roughly the same LR grammar as used in this post to an LL equivalent): the resulting LL grammar is never easier to read after the conversion.

LR grammars thus fill an important hole. They’re generally close in readability to an arbitrary CFG; since left associativity is so common, they’re nearly always easier to read than LL grammars; and, if you’ll allow a small amount of poetic license, they’re infinitely easier to read than a recursive descent parser.

Of course, I’m clearly somewhat biased, so perhaps these words from Guy Steele might be more compelling:

[Be] sure that your language will parse. It seems stupid ... to sit down and start designing constructs and not worry about how they will fit together. You can get a language that’s difficult if not impossible to parse, not only for a computer, but for a person. I use Yacc constantly as a check of all my language designs, but I very seldom use Yacc in the implementation. I use it as a tester, to be sure that it’s LR(1) ... because if a language is LR(1) it’s more likely that a person can deal with it.

Dynamic Languages Wizards Series - Panel on Language Design

Summary

Having spent years trying virtually every other possible approach to parsing, I now firmly believe that LR parsing is the best approach for the vast majority of purposes: it has the strongest practical safety guarantees, allows good grammar readability, and has decent performance. In particular, I hope future programming language authors take Guy Steele’s advice above, and make their reference grammar LR compliant [26].

Personally, I’ve put my money where my mouth is: I’ve put a lot of work into grmtools, a Yacc-compatible LR parsing system for Rust. grmtools isn’t yet perfect, or complete, nor is it by any means fully optimised — but it’s more than good enough for many purposes and I intend continuing to maintain it for some time to come. I hope it’s one small step towards encouraging people to rediscover the beauty and utility of LR parsing.

Acknowledgements: Thanks to Lukas Diekmann and Naveneetha Vasudevan for comments on drafts of this post. Thanks to Roberto Ierusalimschy and Terence Parr for answering my queries. All opinions, and any errors or infelicities, are very much due to me!

Follow me on Twitter

Footnotes

[1] Parsing Expression Grammars (PEG)s and “parser combinators” in some functional languages are just recursive descent parsers in disguise.
[2] My favourite example of this is best expressed as a Parsing Expression Grammar (PEG):
r <- a / ab
or as a hand-written recursive descent parser:
def r(s, i):
    if i + 1 < len(s) and s[i] == "a":
        return ...
    elif i + 2 < len(s) and s[i] == "ab":
        return ...
Both of these parsers successfully parse the string ‘a’ but fail to parse the string ‘ab’. As soon as ‘a’ is matched, the rule succeeds, which leaves ‘b’ unmatched; neither parser tries to match ‘ab’ directly.
[3] I believe that it’s still an open question as to how many distinct disambiguation operators there need to be.
[4] In Converge I ended up cheating, encoding some default disambiguation rules into the parser. When I did this I didn’t really understand the problem that I’d encountered nor did I realise that my “solution” was not curing, but merely delaying, the pain. The only thing more surprising than encountering an ambiguous parse is finding out that your input has been disambiguated-by-default in the wrong way.
[5] To give a rough idea of scale: Rust’s parser is about 10KLoC and javac’s parser about 4.5KLoC.
[6] Yes, I wrote more than one. I no longer recommend it, because Earley’s original algorithm has a bug in it, and descriptions of a/the fix seem either to be incorrect, or to destroy the beauty of the algorithm.
[7] Michael Van De Vanter first pointed Wagner’s work out to me. However, I didn’t appreciate it for what it was. I then forgot about it, and stumbled across it at “independently” at a later point, before somehow realising that it was what Michael had already suggested. I later learnt to listen to his advice more carefully, and benefited much from it!
[8] It’s also the basis of Tree-sitter, which might be the best long-term argument I know of for programming languages having an LR grammar!
[9] Perhaps I was lucky not to study a compilers course myself (my university did not offer one at that point), as it meant I couldn’t develop the most severe of allergic reactions to LR parsing.
[10] From least to most expressive we thus have: regular expressions, LL, LR, unambiguous, CFG. In other words, regular expressions are a strict subset of LL, LL a strict subset of LR, and so on. The most complete description of the hierarchy I know can be found in p89 of Alexander Okhotin’s talk (where arrows mean “more expressive” and “ordinary” means “CFG”). Note that recursive descent doesn't fit into this hierarchy at all — formally speaking, we know that it accepts a disjoint set of languages relative to CFGs, but, because PEGs have no underlying theory that we know of, we are unable to precisely define that set further.

Another interesting case is the ALL(*) algorithm which underlies ANTLR. ALL(*) accepts a strict superset of LL (including many ambiguous grammars), but is disjoint with LR since ALL(*) doesn’t support left-recursion. However, ANTLR can remove direct left-recursion before invoking ALL(*), so some grammars that might seem impossible to parse with ALL(*) can in fact be parsed by it. Bearing in mind that we’re talking about infinite sets, and that I don’t think we have a formal proof of the following statement, I think it would be fair to say that the ALL(*) subset of CFGs is bigger than the LR subset.

[11] There are larger unambiguous subsets such as LR-Regular (or “LRR”) grammars. However, as far as I can tell, these are probably not practical. For example, it is not decideable as to whether an arbitrary grammar is LRR or not. [Update 2020-10-28: a previous version of this footnote suggested that Marpa is LRR-based. It is a generalised parser that can therefore also parse LRR grammars. My apologies for the confusion!]
[12] Berkeley Yacc actually implements LALR, but for this example it’s indistinguishable from LR. I’ll discuss LALR a little bit later in this post.
[13] Although I’ve presented the conflicts as errors, in Yacc they’re actually warnings because it has “default conflict resolution” rules (see Section 5 of the Yacc manual). In other words Yacc is willing to take in an ambiguous grammar and automatically disambiguate it to produce an unambiguous grammar. In general, I do not recommend making use of this feature.
[14] Although it’s rarely remarked upon, the traditional splitting of “parsing” into separate lexing and parsing phases is an important part of the ambiguity story. Not only is it easy for the lexer to identify for as a keyword and forest as an identifier, but the parser then only has to distinguish between token types and not token values. Scannerless parsing merges these two phases, which allows more grammars to be expressed, but introduces more scope for ambiguity — and, in some cases, enables the resulting parsing algorithm to accept context-sensitive grammars.
[15] Imagine a Haskell or RPython program where none of the functions have explicit types. The challenge when programming in such systems is that errors are often reported far away from where they were caused. In other words, I might make a static type error in one function, but the type inferencer will detect the resulting error in another function. While type error messages have become much better over time, they can never match human expectations in all cases.
[16] The best conflict reports I’ve seen come from LALRPOP.
[17] Off-hand, I can only think of a single example: when Lukas tried to evolve this Java 7 grammar to Java 8. Until that point, grmtools didn’t have a way of reporting details about conflicts because I hadn’t needed such a feature!

The Java specification used to pride itself on presenting a simple, machine-proven, unambiguous grammar in an appendix. Unfortunately, at some point, this grammar seems to have been dropped from the specification, and I suspect that the new syntax introduced has not been checked for possible ambiguities. We quickly realised that a Java 8 grammar wasn’t important enough to our work for us to invest the time in this, so I don’t know if it is ambiguous or not.

[18] For the insatiably curious, the conflict types mean roughly:
  • shift/reduce: The LR parser can’t be sure whether it should advance the input by one token, or whether a parsing rule will have completed.
  • reduce/reduce: The LR parser can’t be sure which of two rules will have completed.
  • accept/reduce: The LR parser can’t be sure if the entire parse has completed or merely one rule has completed.
That last possibility is so rare that I’d forgotten it even exists before I thought to fact-check this footnote!
[19] Roughly speaking, the fastest super computer in the world at that time ran about 10,000 times slower than a decent desktop chip today.
[20] SLR is particularly restrictive. However, I’m not sure I’ve ever seen SLR used in practise (though I know it was in the past), but LALR is still found in Berkeley Yacc. Even though LALR is less restrictive than SLR, it can still require real programming language grammars to be unpleasantly contorted in places.
[21] Pager’s description is slightly incomplete; it’s best paired with Xin Chen’s thesis. From memory, neither mentions that the algorithm is non-deterministic and can sometimes create unreachable states that can be garbage collected to save a little bit more memory. grmtool’s implementation of this algorithm goes into more detail on such matters and also has the bonus of being runnable. However, Pager’s algorithm doesn’t quite work properly if you use Yacc’s conflict resolution feature. One day I should implement the IELR algorithm to solve this problem.
[22] For example, encoding sparse tables (e.g. in Rust with the sparsevec crate), and packing vectors of small integers (e.g. with the packedvec crate). It’s a long time since I’ve thought about these aspects: from memory, one can do even better than these techniques, but they’re already effective enough that we didn’t feel the need to look further at that point.
[23] There is one major exception in C-ish syntaxes: missing curly brackets. The resulting errors are typically reported many lines after the point that a human would consider as the cause of the problem.
[24] rustc gives the best syntax error messages of any compiler / parser I’ve ever used.
[25] Recent years have reinforced a long-standing trend: programmers don’t like to learn languages with unfamiliar syntaxes. For better or worse, C-ish syntax is likely to be the dominant cultural force in programming languages for decades to come.
[26] That doesn’t mean that the eventual compiler has to contain an LR parser (though I’d start with an LR parser and only consider moving to something else if I had millions of users), but the parser it does contain should be entirely compliant with the reference LR grammar.

Unfortunately, for the foreseeable future, we are going to be stuck with programming languages who have used other parsing formalisms: pity the poor IDE author who has to deal with yet another language with only a recursive descent parser instead of an LR grammar!

Parsing Expression Grammars (PEG)s and “parser combinators” in some functional languages are just recursive descent parsers in disguise.
My favourite example of this is best expressed as a Parsing Expression Grammar (PEG):
r <- a / ab
or as a hand-written recursive descent parser:
def r(s, i):
    if i + 1 < len(s) and s[i] == "a":
        return ...
    elif i + 2 < len(s) and s[i] == "ab":
        return ...
Both of these parsers successfully parse the string ‘a’ but fail to parse the string ‘ab’. As soon as ‘a’ is matched, the rule succeeds, which leaves ‘b’ unmatched; neither parser tries to match ‘ab’ directly.
I believe that it’s still an open question as to how many distinct disambiguation operators there need to be.
In Converge I ended up cheating, encoding some default disambiguation rules into the parser. When I did this I didn’t really understand the problem that I’d encountered nor did I realise that my “solution” was not curing, but merely delaying, the pain. The only thing more surprising than encountering an ambiguous parse is finding out that your input has been disambiguated-by-default in the wrong way.
To give a rough idea of scale: Rust’s parser is about 10KLoC and javac’s parser about 4.5KLoC.
Yes, I wrote more than one. I no longer recommend it, because Earley’s original algorithm has a bug in it, and descriptions of a/the fix seem either to be incorrect, or to destroy the beauty of the algorithm.
Michael Van De Vanter first pointed Wagner’s work out to me. However, I didn’t appreciate it for what it was. I then forgot about it, and stumbled across it at “independently” at a later point, before somehow realising that it was what Michael had already suggested. I later learnt to listen to his advice more carefully, and benefited much from it!
It’s also the basis of Tree-sitter, which might be the best long-term argument I know of for programming languages having an LR grammar!
Perhaps I was lucky not to study a compilers course myself (my university did not offer one at that point), as it meant I couldn’t develop the most severe of allergic reactions to LR parsing.
From least to most expressive we thus have: regular expressions, LL, LR, unambiguous, CFG. In other words, regular expressions are a strict subset of LL, LL a strict subset of LR, and so on. The most complete description of the hierarchy I know can be found in p89 of Alexander Okhotin’s talk (where arrows mean “more expressive” and “ordinary” means “CFG”). Note that recursive descent doesn't fit into this hierarchy at all — formally speaking, we know that it accepts a disjoint set of languages relative to CFGs, but, because PEGs have no underlying theory that we know of, we are unable to precisely define that set further.

Another interesting case is the ALL(*) algorithm which underlies ANTLR. ALL(*) accepts a strict superset of LL (including many ambiguous grammars), but is disjoint with LR since ALL(*) doesn’t support left-recursion. However, ANTLR can remove direct left-recursion before invoking ALL(*), so some grammars that might seem impossible to parse with ALL(*) can in fact be parsed by it. Bearing in mind that we’re talking about infinite sets, and that I don’t think we have a formal proof of the following statement, I think it would be fair to say that the ALL(*) subset of CFGs is bigger than the LR subset.

There are larger unambiguous subsets such as LR-Regular (or “LRR”) grammars. However, as far as I can tell, these are probably not practical. For example, it is not decideable as to whether an arbitrary grammar is LRR or not. [Update 2020-10-28: a previous version of this footnote suggested that Marpa is LRR-based. It is a generalised parser that can therefore also parse LRR grammars. My apologies for the confusion!]
Berkeley Yacc actually implements LALR, but for this example it’s indistinguishable from LR. I’ll discuss LALR a little bit later in this post.
Although I’ve presented the conflicts as errors, in Yacc they’re actually warnings because it has “default conflict resolution” rules (see Section 5 of the Yacc manual). In other words Yacc is willing to take in an ambiguous grammar and automatically disambiguate it to produce an unambiguous grammar. In general, I do not recommend making use of this feature.
Although it’s rarely remarked upon, the traditional splitting of “parsing” into separate lexing and parsing phases is an important part of the ambiguity story. Not only is it easy for the lexer to identify for as a keyword and forest as an identifier, but the parser then only has to distinguish between token types and not token values. Scannerless parsing merges these two phases, which allows more grammars to be expressed, but introduces more scope for ambiguity — and, in some cases, enables the resulting parsing algorithm to accept context-sensitive grammars.
Imagine a Haskell or RPython program where none of the functions have explicit types. The challenge when programming in such systems is that errors are often reported far away from where they were caused. In other words, I might make a static type error in one function, but the type inferencer will detect the resulting error in another function. While type error messages have become much better over time, they can never match human expectations in all cases.
The best conflict reports I’ve seen come from LALRPOP.
Off-hand, I can only think of a single example: when Lukas tried to evolve this Java 7 grammar to Java 8. Until that point, grmtools didn’t have a way of reporting details about conflicts because I hadn’t needed such a feature!

The Java specification used to pride itself on presenting a simple, machine-proven, unambiguous grammar in an appendix. Unfortunately, at some point, this grammar seems to have been dropped from the specification, and I suspect that the new syntax introduced has not been checked for possible ambiguities. We quickly realised that a Java 8 grammar wasn’t important enough to our work for us to invest the time in this, so I don’t know if it is ambiguous or not.

For the insatiably curious, the conflict types mean roughly:
  • shift/reduce: The LR parser can’t be sure whether it should advance the input by one token, or whether a parsing rule will have completed.
  • reduce/reduce: The LR parser can’t be sure which of two rules will have completed.
  • accept/reduce: The LR parser can’t be sure if the entire parse has completed or merely one rule has completed.
That last possibility is so rare that I’d forgotten it even exists before I thought to fact-check this footnote!
Roughly speaking, the fastest super computer in the world at that time ran about 10,000 times slower than a decent desktop chip today.
SLR is particularly restrictive. However, I’m not sure I’ve ever seen SLR used in practise (though I know it was in the past), but LALR is still found in Berkeley Yacc. Even though LALR is less restrictive than SLR, it can still require real programming language grammars to be unpleasantly contorted in places.
Pager’s description is slightly incomplete; it’s best paired with Xin Chen’s thesis. From memory, neither mentions that the algorithm is non-deterministic and can sometimes create unreachable states that can be garbage collected to save a little bit more memory. grmtool’s implementation of this algorithm goes into more detail on such matters and also has the bonus of being runnable. However, Pager’s algorithm doesn’t quite work properly if you use Yacc’s conflict resolution feature. One day I should implement the IELR algorithm to solve this problem.
For example, encoding sparse tables (e.g. in Rust with the sparsevec crate), and packing vectors of small integers (e.g. with the packedvec crate). It’s a long time since I’ve thought about these aspects: from memory, one can do even better than these techniques, but they’re already effective enough that we didn’t feel the need to look further at that point.
There is one major exception in C-ish syntaxes: missing curly brackets. The resulting errors are typically reported many lines after the point that a human would consider as the cause of the problem.
rustc gives the best syntax error messages of any compiler / parser I’ve ever used.
Recent years have reinforced a long-standing trend: programmers don’t like to learn languages with unfamiliar syntaxes. For better or worse, C-ish syntax is likely to be the dominant cultural force in programming languages for decades to come.
That doesn’t mean that the eventual compiler has to contain an LR parser (though I’d start with an LR parser and only consider moving to something else if I had millions of users), but the parser it does contain should be entirely compliant with the reference LR grammar.

Unfortunately, for the foreseeable future, we are going to be stuck with programming languages who have used other parsing formalisms: pity the poor IDE author who has to deal with yet another language with only a recursive descent parser instead of an LR grammar!


Alternative Sources of Advice

May 6 2020

Download:  Opus (3.1MiB)   mp3 (11.1MiB)
With slowly increasing frequency, I am asked for my advice on life or careers, or something similar. Let us put aside worries about this sign of my increasing decrepitude or the poor taste that some people show in those from whom they seek advice. The brutal truth is that most of the advice that I’ve received as an adult – nearly all of which has been well intentioned – has not been right for me. I therefore assume that any advice I might give others would be equally wrong, and so I try and avoid doing so. In this blog post, I will try to enumerate the problems I have with advice as it is commonly given and look at one way that we can do things differently.

In the context of this article, what I mean by “advice” is the guidance given by one adult (generally older) to others (generally younger) [1]. Advice can be solicited or unsolicited and it can come through many mediums such as one-on-one conversations, listening to a talk, or reading a book.

A common way that advice is given is as a series of commands such as “you should do X” or “you should never do X”. Commands can be expressed bluntly (“you must always do X”) or diplomatically (“in my opinion it is always wise to do X”); they can be general (“you should travel to as many places as possible”) or specific (“you should avoid person P”); and so on.

My issue is not with the form of advice-as-commands, it is that such advice is bad much more often than it is good. The reason for that is simple: most advice tends to generalise from a case of one (“I did X so everyone should do X”) or, worse, from a case of zero (more commonly referred to as “do as I say, not as I do”). The latter is so clearly foolish that it doesn't require explanation. However, I believe there are three issues with the former.

First, generalising from a case of one tends to lead to over-generalisation: if the only sheep I’ve seen happened to be black then I might confidently tell other people that all sheep are black. It took me years to realise how much advice falls into this category, and to have gained enough general knowledge to spot cases that are clearly at odds with reality.

Second, we inevitably generalise from our own back story, and our memories can be astonishingly faulty. I have seemingly clear memories of something happening in my past that other people have been able to definitively prove did not happen. More commonly, we forget challenges we faced in the past, overlook the times that we succeeded through luck rather than judgement, and exaggerate our contributions relative to others’ [2]. When we give advice, we take these faulty memories, generalise, and then simplify: it is hardly wonder that we often mislead.

Third – and there isn’t a nice way to say this – most of us have not led a sufficiently interesting life to be able to generalise from. Like everyone, I have faced challenges here and there, but compared to those faced by most people throughout history, I have led a comfortable, undemanding existence. Generally speaking, the fewer challenges we’ve faced, and the less demanding those challenges were, the less reliable the advice we derive from it is likely to be.

Real-world advice

Let me start with a real, though ludicrous, example. When I was doing my PhD, I was advised by a number of people that PhD theses must have an odd number of chapters. I was absolutely baffled by this but, since everything about the situation I was in was unfamiliar, I assumed that I must be missing something obvious. It took at least a couple of years before I had enough confidence to ignore this bit of advice [3]. Though I never found its source, I assume that someone saw a successful, and beautifully written, thesis that had an odd number of chapters and generalised from there. I was astonished to find only last year that a recent PhD graduate had heard this same bit of advice!

Arguably more dangerous than ludicrous advice is advice that is not just plausible, but which might be right for many people — but not us. For example, the most common advice I receive these days – perhaps two or three times a year from different people on average – is to write more research papers. I understand why I’m being told this. If you look at my list of publications, you’ll see that the number of papers I produce has slowed down from 2014 onwards — and that despite the fact that, after many years being a solo researcher, in 2013 I started leading a research group that’s never dipped below four people in size. I’ve come to realise that, when I’m given this advice, it isn’t because people doubt my work ethic, but because they assume that I’m doing something wrong and that if I only acknowledged that then I could be productive again.

What’s not obvious is that this is the inevitable result of decisions I made several years ago. In about 2012 I looked back at my past work and realised how little of it I felt proud of: I had too often been happy to aim for the mediocre. By the end of 2013 I had realised that I would have to consistently raise the level I was working at and had developed a rough plan to do so. Most significantly, I had decided that since the thing I enjoy most is programming, I might as well focus almost exclusively on research that required a good deal of programming. Since creating viable software systems is not a quick matter, I simply can’t produce a large volume of publications. This is not to say that the advice I’m given is wrong for other people, but it isn’t right for me given the niche I’ve chosen to focus on [4].

Generalising from other people’s experiences

When someone asks me for advice, what should I say? Perhaps I should advise them to grow up in the countryside to hone their focus? Make them realise that they can't learn properly in a class and that the sooner they plough through things on their own, the quicker they'll learn? That they'll be more efficient if they use OpenBSD? That they should study Entombed's Eyemaster until they understand why the point at 0:48, when the false introduction suddenly transitions to the dumb yet joyous main riff, is one of the pinnacles of musical achievement?

I could go on, but hopefully the point is fairly clear: I'm generally unsure what things I've done, or experienced, have been beneficial or not; and even when I am sure, it's generally because the observation is either trivial or hard to repeat. My favourite example of the latter came from the late Duke of Westminster, the UK’s then-richest man: when asked what advice he would give to a young entrepreneur, he replied “make sure they have an ancestor who was a very close friend of William the Conqueror.”

What are we to do? Are we all condemned to learn everything from scratch? At best, that would be highly inefficient.

In my case, the model I have accidentally gravitated towards is to try and derive advice from (mostly) historical figures. This started in earnest my late twenties when I realised that I had very little understanding of continental European history and started studying the history of Prussia. At first, so much was unfamiliar that I was just trying to build up enough context to understand major events, places, and players. At some point, though, I realised that I had stumbled upon something deeper than that: I was starting, without realising it, to derive advice from history. That advice was not command-esque – “if you find a wooden horse at your gates, don’t let it in”, for example, is not very useful advice for me. Instead I was picking up a more abstract sense of the situations people can find themselves in and how they’ve chosen to address the problems they've faced — stretching the Troy analogy a bit, one might conclude that “if something seems too good to be true then, unless carefully investigated from angles, it can lead to disaster”.

I’m going to give you two concrete examples of people I learnt from. I’m deliberately choosing people who are long dead, from a country that no longer exists, and whose experiences were in a hugely different field than my own. Hopefully that makes it a little easier to see that there is no need to stick with the familiar when deriving advice.

I’ll start with Moltke the Elder, who led the Prussian army to victory in three separate wars in the mid/late 19th century, changing European and world history in ways that are still with us today [5]. It goes without saying that war has little direct relationship to computing research. Furthermore, Moltke was a cultured, reserved, man of whom it was said that he could be quiet in seven languages: I am a cultural heathen, can only speak English, and tend to bellow. What could I learn from such a fellow?

The thing that intrigued me most about Moltke was that he challenged a fundamental idea I had somehow imbibed, which is that it’s possible for the leader of an organisation to design and carry out a perfect plan. This is not to say that Moltke was not good at planning. He was, by some distance, the best military planner in the second half of the 19th century: he was the first to realise that railways allowed the unprecedentedly large armies he was commanding to be split up and combined on the battlefield. His focus on that alone would make him a major historical figure.

However, Moltke did two things very unlike most army commanders. First, he expected the situations he found himself in to be unclear and ever-changing: he knew that he would have to continually reevaluate and change his plans. It is hard to beat the summary found in two of his own quotes: “no plan of operations extends with any certainty beyond the first contact with the main hostile force” and “strategy is a system of expedients”. Second, he realised that when you are managing competent people, micromanagement reduces efficiency. He continually informed the generals under his command about his overall intent but allowed them significant flexibility in how they went about addressing that intent [6].

These two concepts changed how I wanted to lead a research group. First, I realised that while I might be able to have a fairly consistent high-level aim over time, the concrete methods I employed to meet that aim would change frequently and in ways that I would not be able to anticipate. I have therefore consciously tried to improve my willingness to be flexible, both at a day-to-day and at a strategic level. Second, if I wanted to get the best out of people, I would need to dampen my natural inclination to work in detail with people. This is not easy, because sometimes my age has led to experience which can make a big boost to someone else’s productivity, but it’s all too easy for me to forget to back off at the appropriate point. I’ve therefore consciously tried to back off sooner rather than later, and when possible not to get involved in the details at all. Inevitably, I have not always succeeded in implementing either of these concepts. Indeed, I have made some horrible mistakes along the way — but I think I would have made more, and much worse, mistakes if I hadn’t used the advice I’d derived from Moltke as a guide.

My second example reaches even further back in time: Clausewitz is more widely quoted than Moltke though, for me at least, harder to understand. His major contribution is On War, an unfinished, often abstract, book on the nature of war. The part of that which changed the way I think most significantly is his concept of friction:

Everything in war is very simple, but the simplest thing is difficult. The difficulties accumulate and end by producing a kind of friction... Imagine a traveller who late in the day decides to cover two more stages before nightfall. Only four or five hours more, on a paved highway with relays of horses: it should be an easy trip. But at the next station he finds no fresh horses, or only poor ones; the country grows hilly, the road bad, night falls, and finally after many difficulties he is only too glad to reach a resting place with ... primitive accommodation. It is much the same in war. Countless minor incidents – the kind you can never really foresee – combine to lower the general level of performance, so that one always falls far short of the intended goal. Iron will-power can overcome this friction; it pulverizes every obstacle, but of course it wears down the machine as well.
That first sentence baffled me for years, but when taken with the rest of the paragraph, three profound truths eventually made themselves clear to me:
  1. When we state a goal we wish to achieve, we simplify the route we expect to take, partly because we must and partly because we don't know all the issues we will encounter [7]. However, when we try and carry out that goal, those issues are more numerous and collectively more damaging to performance than we could have imagined.
  2. Chance is a significant factor in success and failure.
  3. Organisations are like mechanical machines: in the short-term we can make people work harder than normal but doing so in the long-term burns them out.

When written out like that, it can make On War seem a bit like a self-help book or, perhaps even worse, a case of generalising from a case of one. In my opinion, nothing could be further from the truth: Clausewitz's work is the result of generalising from many cases over many years. His observation of friction, his explanation of it, and his terminology for it are, for me, astonishing intellectual achievements. It fundamentally changed how I operate: I expect my plans to be disrupted in both time and direction; and, except in short, rare situations, I do not expect those disruptions to be solved by increasing people’s workloads. Put another way, I’ve realised that I’m involved in a marathon, not a series of sprints, and that I need to moderate my expectations of the route and my effort accordingly.

Summary

I’ve been given a lot of advice over the years, but I think that which I’ve derived from history has had the greatest influence on me. It has given me access to experiences and perspectives that I would never have come across in my normal life. Furthermore, it is rooted in experience: I continue to be amazed at the things that real people have achieved in adverse circumstances. It also seems to me that the more peoples’ stories that I read and understand, the more profound the truths that I seem to alight upon: generalising seems to involve some sort of compound interest that I don’t fully understand, but whose effects I hope continue!

For me, this alternative source of advice has been crucial, but it doesn't mean that you should suddenly start devouring Prussian history for similar lessons. It seems to me that the key is to seek out sources of advice in areas different to those we normally have access to, but the sources of advice that might resonate with you could be very different to me. For example, in the past I learnt a lot about how people deal with stressful situations through watching professional cricket, and more recently through professional road cycling. Others might find their most effective sources to come through very different avenues, such as fiction or film.

That said, I am not dogmatic enough to think that this technique is the only way that one should give or receive advice. For example, there are occasions when advice-as-command really is the right thing, especially when someone is about to do something actively harmful to their best interests. But, in general, when I'm asked to give advice, I try instead to tell people a bit about possibly relevant parts of my story. If some parts of my story give them a useful insight, then that is wonderful; if not, then I hope that someone else’s story (perhaps in conjunction with mine) might prove more useful to them at some point in the future. Fortunately, no matter who you are, and no matter what times you are living through, it seems to me that you can never seek out too many different perspectives.

Acknowledgements: Thanks to Chun Nolan for comments.

Follow me on Twitter

Footnotes

[1] There are other types of advice – notably the guidance that parents give children on a regular basis – but let’s exclude those from consideration.
[2] Inversely, humans have an astonishing tendency to ascribe all failures to other people’s malice. My failures have almost solely been due to my own stupidity; the remaining few have mostly been due to other’s incompetence or laziness, not malice.
[3] I had to double check this, but my thesis did end up with an odd number of chapters, so perhaps subliminally I did take the advice on board, despite thinking that I'd rejected it!
[4] Although I can’t be certain, my suspicion is that’s an indirect result of there not being many people working in the same niche these days: the idea that someone might voluntarily devote large chunks of time to programming is thus a rarer one in computing research than it used to be. Interestingly, in some non-computing research fields, the opposite phenomenon seems to be occurring. At its most extreme, there are 5,154 authors on the paper about the Higgs Boson experiments!
[5] None of this would have been possible without Bismarck. His influence on history might, perhaps, be greater than any other single person’s. I fear that the lessons one can draw from Bismarck are more powerful, but they are far darker. I sometimes think that the English-speaking world’s general ignorance of his approach is a good thing.
[6] The concept that Moltke pioneered ultimately developed into Auftragstaktik (the English term for this has become “mission command”, but that has never resonated with me: it feels both too specific and too easily misunderstood). Moltke motivated the need for devolved responsibility, though not in particular detail, and didn't push it down beyond the top ranks of commanders as later exponents of this concept did.
[7] When I wrote “Confusing problems whose solutions are easy to state with problems whose solutions are easy to realise”, I was covering fairly similar ground. I had read Clausewitz, and some associated work, by that point although I hadn't really understood much of it.
There are other types of advice – notably the guidance that parents give children on a regular basis – but let’s exclude those from consideration.
Inversely, humans have an astonishing tendency to ascribe all failures to other people’s malice. My failures have almost solely been due to my own stupidity; the remaining few have mostly been due to other’s incompetence or laziness, not malice.
I had to double check this, but my thesis did end up with an odd number of chapters, so perhaps subliminally I did take the advice on board, despite thinking that I'd rejected it!
Although I can’t be certain, my suspicion is that’s an indirect result of there not being many people working in the same niche these days: the idea that someone might voluntarily devote large chunks of time to programming is thus a rarer one in computing research than it used to be. Interestingly, in some non-computing research fields, the opposite phenomenon seems to be occurring. At its most extreme, there are 5,154 authors on the paper about the Higgs Boson experiments!
None of this would have been possible without Bismarck. His influence on history might, perhaps, be greater than any other single person’s. I fear that the lessons one can draw from Bismarck are more powerful, but they are far darker. I sometimes think that the English-speaking world’s general ignorance of his approach is a good thing.
The concept that Moltke pioneered ultimately developed into Auftragstaktik (the English term for this has become “mission command”, but that has never resonated with me: it feels both too specific and too easily misunderstood). Moltke motivated the need for devolved responsibility, though not in particular detail, and didn't push it down beyond the top ranks of commanders as later exponents of this concept did.
When I wrote “Confusing problems whose solutions are easy to state with problems whose solutions are easy to realise”, I was covering fairly similar ground. I had read Clausewitz, and some associated work, by that point although I hadn't really understood much of it.

Blog archive
Home e-mail: laurie@tratt.net   twitter: laurencetratt twitter: laurencetratt