Which Parsing Approach?

Recent posts
pizauth: HTTPS redirects
Recording and Processing Spoken Word
Why the Circular Specification Problem and the Observer Effect Are Distinct
What Factors Explain the Nature of Software?
Some Reflections on Writing Unix Daemons
Faster Shell Startup With Shell Switching
Choosing What To Read
Debugging A Failing Hotkey
How Often Should We Sharpen Our Tools?
Four Kinds of Optimisation

Blog archive

We all know that parsing is an important part of designing and implementing programming languages, but it’s the equivalent of Brussels sprouts: good for the diet, but a taste that only a select few enjoy. Unfortunately, I’ve come to realise that our general distaste for parsing is problematic. While many of us think that we’ve absorbed the advances of the 1960s into our collective understanding, I fear that we have regressed, and that we are often making inappropriate decisions about parsing. If that sounds accusatory, I don’t mean it to be: I spent over 20 years assuming that parsing is easy and that I didn’t need to understand it properly in order to use it well. Alas, reality has been a cruel teacher, and in this post I want to share some of the lessons I’ve been forced to slowly learn and acknowledge.

Let’s start with the basics. A grammar encodes the syntax rules for a given language. Parsing is the act of taking in an input (e.g. a source file) and determining if, and how, it corresponds to a grammar. At its most basic level, parsing just says “this input does/doesn’t correspond to the grammar”. That’s rarely useful for programming languages, so we normally execute semantic actions while parsing, allowing us to, for example, build a parse tree that represents the input as a tree. If I have a simple calculator grammar and the input 2-3*4 I might get back a tree that looks like the following:

For the rest of this post, I’ll represent trees as “pretty printed text”, where brackets allow us to succinctly express how the tree is structured. For example the above tree is equivalent to (2-(3*4)). I’m going to assume that “parsing” means “check correspondence to the grammar and build a parse tree”. I’m also going to simplify other parsing jargon and nomenclature whenever I can, to try and keep things somewhat comprehensible and the length somewhat manageable.

Recursive descent

There are a bewildering number of ways that one can parse an input, so I’ll start with what is probably the most common: a hand-written parser. While that could mean just about anything, nearly everyone who writes a half-decent hand-written parser, whether they know it or not, is writing a recursive descent parser [1]. The idea is relatively simple: one writes a series of functions which examine the input string at a given position and, if they match at that position, advance the parse. For example, a first attempt at a recursive descent parser in Python that can parse the simple calculator language above might look as follows:

NUMBERS = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
OPERATORS = ["-", "*"]

class Number:
  def __init__(self, n): self.n = n
  def __str__(self): return str(self.n)
class Mul:
  def __init__(self, lhs, rhs):
    self.lhs = lhs
    self.rhs = rhs
  def __str__(self): return "(%s*%s)" % (str(self.lhs), str(self.rhs))
class Sub:
  def __init__(self, lhs, rhs):
    self.lhs = lhs
    self.rhs = rhs
  def __str__(self): return "(%s-%s)" % (str(self.lhs), str(self.rhs))

def parse_expr(s, i):
  if s[i] not in NUMBERS:
    return
  j = i
  while j < len(s) and s[j] in NUMBERS:
    j += 1
  lhs = Number(s[i:j])
  if j == len(s) or s[j] not in OPERATORS:
    return (j + 1, lhs)
  op = s[j]
  r = parse_expr(s, j + 1)
  if r is None:
    return
  (i, rhs) = r
  if op == "-":
    return (i, Sub(lhs, rhs))
  else:
    assert op == "*"
    return (i, Mul(lhs, rhs))

def parse(s):
  r = parse_expr(s, 0)
  if r is None or r[0] <= len(s):
    return "Syntax error"
  return r[1]

print(parse("2-3*4"))

The idea is relatively simple: we have a string ‘s’ we’re parsing, with the variable ‘i’ telling us how far we’ve parsed so far. If it was able to parse part of the input starting at ‘i’ then the parse_expr function returns a pair (i, tree) telling us how far it parsed and giving us back the tree it created; if it failed it returns None. When I parse 2-3*4 it prints:

(2-(3*4))

In other words, if we were to evaluate that tree, we’d get a result of -10 – success! Admittedly, that has come at a cost: the recursive descent parser has quite a lot of boiler-plate to ensure that it doesn’t do something silly and that any syntax errors encountered cause parsing to stop. For example, if you remove the check on lines 40 and 41, then 2abc will parse successfully returning Number(2), ignoring the fact that abc couldn’t be parsed! There are ways to reduce the boilerplate, but if you write recursive descent parsers for a living, you have to learn to live with it.

Unfortunately, if I try parsing 2*3-4 I get a surprising result:

(2*(3-4))

We’ve all been taught from a young age that the grammar of mathematics requires ‘*’ to “bind harder” than ‘-’. Put more formally, ‘*’ is said to have a higher precedence than ‘-’. Unfortunately, my hand-written recursive descent parser has given both operators the same precedence. If I was to evaluate that tree, I’d get -2 instead of the 2 we’d have expected from the original expression.

Fortunately, there’s a fairly standard way to encode operator precedence which, in the style of the above parser, can be written as follows:

def parse_expr(s, i):
  r = parse_factor(s, i)
  if r is None:
    return
  (i, lhs) = r
  if i < len(s) and s[i] == "-":
    r = parse_expr(s, i + 1)
    if r is None:
      return
    (i, rhs) = r
    return (i, Sub(lhs, rhs))
  return (i, lhs)

def parse_factor(s, i):
  r = parse_term(s, i)
  if r is None:
    return None
  (i, lhs) = r
  if i < len(s) and s[i] == "*":
    r = parse_factor(s, i + 1)
    if r is None:
      return
    (i, rhs) = r
    return (i, Mul(lhs, rhs))
  return (i, lhs)

def parse_term(s, i):
  if s[i] not in NUMBERS:
    return
  j = i
  while j < len(s) and s[j] in NUMBERS:
    j += 1
  return (j, Number(s[i:j]))

def parse(s):
  r = parse_expr(s, 0)
  if r is None or r[0] <= len(s):
    return "Syntax error"
  return r[1]

If I parse these expressions:

print(parse("2-3*4"))
print(parse("2*3-4"))

I can see that I get the expected output:

(2-(3*4))
((2*3)-4)

Success at last! Well, not quite, because if I parse 2-3-4 I get another surprising result:

(2-(3-4))

Unfortunately, as this example shows, we’re incorrectly parsing operators as right associative when they should be left associative. In other words, when we see a sequence of subtractions, earlier subtractions should be matched before later subtractions. Fixing this might seem like it’s easy, but it’s not: the “obvious” way of implementing left associativity in a recursive descent parser causes an infinite loop. Fixing that is more involved than I want to get into here: see this page for an approachable summary of solutions to this problem.

It might be tempting to see these problems as the result of an idiot (me) writing a parser for a language they don’t sufficiently understand (mathematics). I hope you can see that there’s also something deeper going on. The underlying problem is that the grammar I wanted to write is ambiguous: 2-3*4 can be parsed as equivalent to 2-(3*4) or (2-3)*4. It is often said that recursive descent parsers are inherently unambiguous. While true, this makes a virtue out of a vice: recursive descent parsers are unambiguous simply because they ignore ambiguities. Put another way, whenever a recursive descent parser encounters a point at run-time where an input can be parsed ambiguously, it arbitrarily chooses one of the possibilities, and charges onwards as if the other possibilities had never existed. Significantly, the parser author is not notified that this happened. Since recursive descent parsers are just normal programs, it’s unlikely that we’ll ever be able to make a static analysis that can take such a parser and reliably tell us at compile-time all the points of ambiguity.

It is therefore probably not a coincidence that recursive descent parsers have no real “theory”. Notably, they do not have any known relationship to the class of grammars we understand best — Context-Free Grammars (CFGs). For example, we do not know, in general, the language which a recursive descent parser will accept: all we can do is throw ever more inputs at it and observe if, and how, it parses them, never knowing if another input will cause a surprising parse.

Over time, I’ve come to view recursive descent as the parsing equivalent of assembly programming: maximum flexibility, maximum performance, and maximum danger. Every non-trivial recursive descent parser I’ve seen converted to another formalism has led to unexpected ambiguities being uncovered. Sometimes this leads to incorrect parses (as above), but it just as often leads to seemingly correct input not being parsed at all [2]. There are good reasons for using recursive descent parsers (I’ll get to those later), but, in my opinion, if another formalism can be used, it generally should be.

Generalised parsers

At the opposite end of the spectrum are what have come to be called generalised parsers. There are various generalised parsing algorithms (e.g. Earley, GLL, and GLR) but, from the perspective of this post, they’re all equivalent. Each can parse any CFG (so they rest on a solid theory), even ambiguous ones (so you don’t have to worry about contorting your grammar), and they guarantee to tell you where all of the ambiguous points in the grammar are at run-time (so you don’t have to worry about things being unexpectedly mis-parsed).

These properties appear to make generalised parsing the solution to the problems noted above with recursive descent parsers. However, this comes at a cost. Consider the following grammar which, once again, parses the little subset of mathematics we’re using as an example:

Expr: Expr "-" Expr
    | Expr "*" Expr
    | "INT"
    ;

Given that grammar, many readers will have spotted an obvious point of ambiguity: 2-3*4 can be parsed as equivalent to (2-3)*4 or 2-(3*4). Generalised parsers are interesting because they generate both possibilities at run-time. It is possible for such parsers to return a “parse forest” (i.e. showing all the ambiguous possibilities), but that’s not very useful for programming languages: we expect compilers to settle on a single meaning for the programs we throw at them. We thus need to disambiguate the ambiguous possibilities so that we end up with a single parse tree. An easy way of doing this is to assign a precedence to a rule’s productions so that if, at one point in the parse, more than one of its productions match, we can pick the one with the highest precedence. For example, I might rewrite my grammar to look as follows:

Expr: Expr "-" Expr %precedence 10
    | Expr "*" Expr %precedence 20
    | "INT"
    ;

Assuming that “higher” precedences mean “bind tighter”, this will then parse 2-3*4 as equivalent to 2-(3*4).

My experience is that fewer people (including, from bitter experience, me) spot a second ambiguity in the above grammar: 2-3-4 can be parsed as left (i.e. (2-3)-4) or right (i.e. 2-(3-4)) associative (because of rules such as Expr "-" Expr). Unfortunately, precedences are not sufficient to disambiguate between those two possibilities: one either needs to rewrite the grammar or use a different disambiguation operator [3].

While the good news is that a generalised parser will reliably tell us at run-time that it encountered an ambiguity, the bad news is that we generally have to wait until we encounter an input that is parsed ambiguously to discover that our grammar is ambiguous. There are some decent heuristics that will statically find many of the points of ambiguity, but they are just that — heuristics.

Over time, I’ve come to view generalised parsing as equivalent to dynamic typing: expressive and safe, but with more errors than necessary being deferred to run-time. I spent years trying to write arbitrary CFGs but, for complex grammars, I continually struggled to squeeze out all of the ambiguities [4]. I did not encounter a user who was happy with, or anything other than startled by, ambiguity errors: it is rather odd to be told that your input is valid but can’t be parsed. That said, I think generalised parsers have a part to play in language composition, where composing different grammars inherently leads to ambiguity. However, I no longer believe that generalised parsing is a good fit for “normal” parsing.

Statically unambiguous parsing

There are several parsing approaches which statically rule out ambiguity, bypassing one of the fundamental problems with generalised parsing. I’ll describe the two best known: LL and LR. In essence, these approaches describe subsets of the CFGs which provably contain only unambiguous grammars. It’s common to describe grammars which adhere to one of these subsets as being “a valid LL grammar” or similar.

However, as far as we know, it is not possible to define the complete subset of unambiguous CFGs, so there are unambiguous grammars which do not fit into these subsets. I thus find it easiest to think of these approaches as being analogous to a static type system: they are sound (i.e. if a grammar is a valid LL/LR grammar, it really is unambiguous) but not complete (some unambiguous grammars aren’t valid LL/LR grammars).

LL parsing

Although less common than in the past, LL parsing still underlies systems such as javacc. My personal bias is that LL parsers are largely unappealing, because the lack of left recursion makes expressing many standard programming language constructs as awkward as with recursive descent parsers. However, as hinted at by that commonality, LL grammars have one important feature: they naturally map to recursive descent parsers (but not necessarily vice versa). One can therefore ensure that a recursive descent parser is not accidentally steamrollering over ambiguities by creating an LL grammar and faithfully mapping it to a recursive descent parser.

To my mind, the combination of LL and recursive descent parser has a small but important niche: if you really, truly need the highest possible performance and/or you want the best possible error messages, this is probably the best route we know of. However, it comes at a significant cost. For a realistic programming language grammar, it will typically take many person months of effort [5] to beat an automatically generated parser. I therefore think this approach only makes sense for a small number of projects (notably industrial-strength compilers and IDEs).

LR parsing: prelude

The last of the major parsing approaches I’m going to look at is LR parsing. In common with many people, I spent years trying to avoid LR parsing because I had imbibed the common idea that LR parsing is a terrible thing. Instead I threw myself into other parsing approaches, notably Earley parsers [6].

Then, in late 2008, while bored in meetings, I started writing extsmail, a program mostly intended to send email via ssh. I thought it would be interesting to write this in the style of a traditional Unix daemon, something I had not attempted before. For the two configuration files extsmail needs, I therefore decided to use the traditional Unix daemon parsing tool Yacc. Not only had I not used Yacc before, I had neither used nor studied LR parsing — I suspected I would have quite a task on my hand. I was rather surprised when it turned out to be easy to write a grammar such as externals_parser.y.

However, I assumed that I had been lucky with these grammars, which are rather simple, and went back to avoiding LR parsing. Having realised that generalised parsing and ambiguity were causing me problems, I spent quite a while dabbling with PEG parsing (which is recursive descent in disguise) before eventually realising that was going to cause me different, but no less severe, problems relative to generalised parsing.

Later I stumbled across [7] Tim Wagner’s thesis on incremental parsing which became the pillar that Lukas Diekmann built upon to create Eco [8]. Wagner’s work uses LR parsing but I managed to involve myself a bit with Eco without actually understanding how LR parsing worked. Then, in 2015, when we were experimenting as a group with Rust, Lukas wrote the beginnings of an LR parser as an experiment, and I quickly jumped in and made a few modifications. Without really intending to, I started expanding the code until I realised I had taken on maintainership of what clearly had the potential to become a full Rust LR parser. At that point, I realised I actually needed to understand LR parsing. I found the explanations lurking on the web a bit confusing a first but the algorithm was simple enough that soon enough I had a full, if basic, Rust LR parser (which became grmtools).

Why am I telling you this long, probably tedious, personal history? Because I want to emphasise that I went out of my way to avoid LR parsing, even though I didn’t really know what I was avoiding or why. Even after I had used LR parsing, and realised that it wasn’t the bogeyman I had expected, I still spent several years trying alternatives. Not only is that embarrassing to admit publicly, it also troubles me: how had I picked up a bias that took me so long to overcome? I’ve gradually alighted upon a plausible explanation for our community’s general dislike of LR parsing and, oddly enough, it relates to undergraduate compiler courses. For reasons that probably made sense in the 1970s and 80s, many compiler courses spend significant, arguably excessive, time on parsing — generally LR parsing. Students come in expecting to be taught how to generate machine code in clever ways, but instead have to learn all sorts of parsing background before they even get to the main LR algorithm. By that point they are thoroughly sick of parsing generally and LR parsing in particular. This is a self-inflicted wound by our subject, as we have accidentally turned people away from a beautiful algorithm [9].

LR parsing

That dealt with, let’s drill into some of the technical details of LR parsing. First, LR is strictly more powerful than LL [10]. In other words, every valid LL grammar is also a valid LR grammar (but not necessarily vice versa). Second, LR grammars are the largest practical subset of unambiguous CFGs that we currently know how to statically define [11].

Let’s actually try out LR parsing [12] by feeding the following grammar:

%start Expr
%%
Expr: Expr "-" Expr
    | Expr "*" Expr
    | "INT"
    ;

to Yacc. Doing so leads to the following being printed at compile-time:

expr1.y: yacc finds 4 shift/reduce conflicts

At this point, I know that some readers will have broken out in a cold sweat at the mention of “shift/reduce conflict”. Don’t panic yet! At the moment, let’s just think of this as the LR parser statically detecting an ambiguity (or four…) and telling us that we should fix it somehow [13].

There are various ways of drilling into more details about those ambiguities. In a shameless plug, I’ll use nimbleparse, but most Yacc implementations have a way of giving more detailed information. nimbleparse also needs a valid lexer, so if I feed it the grammar above as well as this Lex file [14]:

%%
- "-"
\* "*"
[0-9]+ "INT"

I get this output:

Shift/Reduce conflicts:
   State 5: Shift("*") / Reduce(Expr: "Expr" "-" "Expr")
   State 5: Shift("-") / Reduce(Expr: "Expr" "-" "Expr")
   State 6: Shift("*") / Reduce(Expr: "Expr" "*" "Expr")
   State 6: Shift("-") / Reduce(Expr: "Expr" "*" "Expr")

Stategraph:
0: [^ -> . Expr, {'$'}]
   Expr -> 1
   'INT' -> 2
1: [Expr -> Expr . '-' Expr, {'-', '*', '$'}]
   [Expr -> Expr . '*' Expr, {'-', '*', '$'}]
   [^ -> Expr ., {'$'}]
   '-' -> 3
   '*' -> 4
2: [Expr -> 'INT' ., {'-', '*', '$'}]
3: [Expr -> Expr '-' . Expr, {'-', '*', '$'}]
   'INT' -> 2
   Expr -> 5
4: [Expr -> Expr '*' . Expr, {'-', '*', '$'}]
   Expr -> 6
   'INT' -> 2
5: [Expr -> Expr . '-' Expr, {'-', '*', '$'}]
   [Expr -> Expr . '*' Expr, {'-', '*', '$'}]
   [Expr -> Expr '-' Expr ., {'-', '*', '$'}]
   '*' -> 4
   '-' -> 3
6: [Expr -> Expr . '-' Expr, {'-', '*', '$'}]
   [Expr -> Expr . '*' Expr, {'-', '*', '$'}]
   [Expr -> Expr '*' Expr ., {'-', '*', '$'}]
   '*' -> 4
   '-' -> 3

What this shows us is the stategraph (i.e. a statemachine) our grammar has been transformed into and the states where the conflicts have occurred.

It is possible, with a little effort, to understand that stategraph and the conflicts that have occurred. However, I’m not going to go into more detail here, because most readers will probably already have guessed that it’s very hard to make sense of conflicts on large grammars. I’d liken it, roughly speaking, to resolving whole-program type inference errors [15]: the errors reported are correct, but don’t necessarily correspond to the points in your program/grammar that you would consider need fixing.

While I’m sure that it’s possible to improve the way that conflicts are reported [16], to my surprise, I’ve developed lots of grammars without being troubled much by problems with conflicts. Indeed, the only time I’ve ever bothered to try and understand conflicts is when a large existing grammar needs updating to a new external specification, which is not common [17]. In most cases, I’m developing a new, or tweaking an existing, small grammar. Then, just as with languages using type inference, I find it most productive to save and compile after nearly every change. If this does identify a conflict, I know what change caused it, and it then tends to be fairly obvious what a plausible fix is. Not only do I not bother worrying about what state in the stategraph is involved, I don’t even bother checking whether the conflict(s) are shift/reduce, reduce/reduce, or accept/reduce [18].

Honestly, I’ve only encountered one realistic counter-example which is – wait for it – mathematical expressions. It is surprisingly difficult to encode this as an LR grammar, because mathematics’ syntax rules are complex, and nearly every naive grammar for them ambiguous. Fortunately, because it’s such a common example, solutions to this abound on the internet. Here’s the classic solution:

%start Expr
%%
Expr: Expr "-" Term
    | Term
    ;
Term: Term "*" Factor
    | Factor
    ;
Factor: "INT"
    ;

It has no conflicts, which means that Yacc has statically proven that it is unambiguous! It handles precedence – 2-3*4 parses as 2-(3*4) – and associativity – 2-3-4 parses as (2-3)-4 – correctly.

Over time, I’ve come to view LR parsing as equivalent to static typing: occasionally annoyingly restrictive, but providing enough static guarantees to be worth the annoyance for important software. It’s important to remember that LR isn’t magic: while it will stop you writing an ambiguous grammar, it won’t stop you writing an incorrect grammar for the language you wish to parse. For example, although LR will prevent you making a rule both left and right associative, you still have to choose correctly whether it should be left or right associative.

Performance

People often worry about parsing performance in general, and LR parsing performance specifically, though almost always without cause on modern computers. For example, if I take Java’s grammar (which is unusually big, and therefore slow to parse) and the LR parsing system I wrote (which has been only moderately optimised) I can happily parse many tens of thousands of lines of code per second on my 3 year old laptop. Unless you’ve got billions of lines of source code, or millions of users, this is surely fast enough.

I suspect that parsing performance worries date back to the period when parsing techniques were under heavy development. LR parsing was invented in 1965, a time when computers were painfully slow [19] and resource poor. LR parsing works by generating a statetable at compile-time that is then interpreted at run-time. Those statetables were far too big to be practical on the computers of the day, so two solutions were invented to solve this problem.

First, algorithmic subsets of LR (e.g. LALR, SLR) were invented that reduce the size of statetables, at the cost of reducing the number of grammars that they can accept (i.e. some LR grammars are not valid LALR grammars). In practise, these subsets are annoying to use: they cause some seemingly reasonable grammars to be rejected; and understanding why a grammar has been rejected can require a deep understanding of the algorithm [20].

Second, since 1977 we’ve known that you can substantially shrink LR statetables without restricting the grammars accepted [21]. When combined with a couple of other techniques to squeeze the statetable’s memory footprint [22], even the most puny modern machine can run an arbitrary LR parser at impressive speeds.

Error recovery

When I’m programming, I make a truly embarrassing number of syntax errors. It is vital that the parser I’m using accurately reports where I’ve made such an error: most parsers, including LR parsers, do a decent enough job in this regard [23]. It is then nice if the parser recovers from my error, allowing it to continue parsing.

The very best recursive descent parsers [24] do a pretty good job of error recovery. LL parsing systems also generally do a tolerable job for arbitrary LL grammars.

Unfortunately, it is fair to say that LR parsing systems such as Yacc do a poor job. Yacc itself uses error tokens, but the results are so bad that I find Yacc parsers with error recovery more frustrating to use than those without. However, we can do much better for arbitrary LR grammars, and hopefully more LR parsers will give good error messages in the future.

LR parsing: aesthetics

I’m now going to turn to a fuzzier factor: readability. Whether explicitly or implicitly, people need to know the syntax rules of the language they are using. Some programming language designers assume, or hope, that giving users a few code examples is equivalent to telling them the language’s syntax rules. This works for most cases as we can largely rely on a shared cultural understanding of “what a programming language looks like” [25], but experienced programmers know the dangers of ignoring dank corners such as operator precedence. At a deeper level, those who implement a compiler, or even just an accurate syntax highlighter, need to know precisely what a language’s syntax rules are. In my opinion, the readability of a parser is a vital factor in enabling accurate tooling for, and use of, a programming language.

To my mind, of the various grammars and parsers presented in this post, the easiest to read is the version for generalised parsers, because it most closely matches the informal mathematical grammar I was taught as a child. However, this readability comes at a cost: because the grammar is potentially ambiguous I sometimes misjudge which way a given input will be parsed after disambiguation.

The hardest to read is, without a shadow of a doubt, the recursive descent parser. It’s the longest, the most detailed, and the one lacking any underlying theory to guide the reader.

The lack of left recursion in LL parsing makes many grammars awkward to read. A surprising way of seeing that is by using the fact that many (though not all) LR grammars can be converted to LL semi-mechanically (see e.g. this translation of roughly the same LR grammar as used in this post to an LL equivalent): the resulting LL grammar is never easier to read after the conversion.

LR grammars thus fill an important hole. They’re generally close in readability to an arbitrary CFG; since left associativity is so common, they’re nearly always easier to read than LL grammars; and, if you’ll allow a small amount of poetic license, they’re infinitely easier to read than a recursive descent parser.

Of course, I’m clearly somewhat biased, so perhaps these words from Guy Steele might be more compelling:

[Be] sure that your language will parse. It seems stupid … to sit down and start designing constructs and not worry about how they will fit together. You can get a language that’s difficult if not impossible to parse, not only for a computer, but for a person. I use Yacc constantly as a check of all my language designs, but I very seldom use Yacc in the implementation. I use it as a tester, to be sure that it’s LR(1) … because if a language is LR(1) it’s more likely that a person can deal with it.

Dynamic Languages Wizards Series - Panel on Language Design

Summary

Having spent years trying virtually every other possible approach to parsing, I now firmly believe that LR parsing is the best approach for the vast majority of purposes: it has the strongest practical safety guarantees, allows good grammar readability, and has decent performance. In particular, I hope future programming language authors take Guy Steele’s advice above, and make their reference grammar LR compliant [26].

Personally, I’ve put my money where my mouth is: I’ve put a lot of work into grmtools, a Yacc-compatible LR parsing system for Rust. grmtools isn’t yet perfect, or complete, nor is it by any means fully optimised — but it’s more than good enough for many purposes and I intend continuing to maintain it for some time to come. I hope it’s one small step towards encouraging people to rediscover the beauty and utility of LR parsing.

Acknowledgements: Thanks to Lukas Diekmann and Naveneetha Vasudevan for comments on drafts of this post. Thanks to Roberto Ierusalimschy and Terence Parr for answering my queries. All opinions, and any errors or infelicities, are very much due to me!

Newer 2020-09-15 08:00 Older
If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1]

Parsing Expression Grammars (PEG)s and “parser combinators” in some functional languages are just recursive descent parsers in disguise.

Parsing Expression Grammars (PEG)s and “parser combinators” in some functional languages are just recursive descent parsers in disguise.

[2]

My favourite example of this is best expressed as a Parsing Expression Grammar (PEG):

r <- a / ab

or as a hand-written recursive descent parser:

def r(s, i):
    if i + 1 < len(s) and s[i] == "a":
        return ...
    elif i + 2 < len(s) and s[i] == "ab":
        return ...

Both of these parsers successfully parse the string ‘a’ but fail to parse the string ‘ab’. As soon as ‘a’ is matched, the rule succeeds, which leaves ‘b’ unmatched; neither parser tries to match ‘ab’ directly.

My favourite example of this is best expressed as a Parsing Expression Grammar (PEG):

r <- a / ab

or as a hand-written recursive descent parser:

def r(s, i):
    if i + 1 < len(s) and s[i] == "a":
        return ...
    elif i + 2 < len(s) and s[i] == "ab":
        return ...

Both of these parsers successfully parse the string ‘a’ but fail to parse the string ‘ab’. As soon as ‘a’ is matched, the rule succeeds, which leaves ‘b’ unmatched; neither parser tries to match ‘ab’ directly.

[3]

I believe that it’s still an open question as to how many distinct disambiguation operators there need to be.

I believe that it’s still an open question as to how many distinct disambiguation operators there need to be.

[4]

In Converge I ended up cheating, encoding some default disambiguation rules into the parser. When I did this I didn’t really understand the problem that I’d encountered nor did I realise that my “solution” was not curing, but merely delaying, the pain. The only thing more surprising than encountering an ambiguous parse is finding out that your input has been disambiguated-by-default in the wrong way.

In Converge I ended up cheating, encoding some default disambiguation rules into the parser. When I did this I didn’t really understand the problem that I’d encountered nor did I realise that my “solution” was not curing, but merely delaying, the pain. The only thing more surprising than encountering an ambiguous parse is finding out that your input has been disambiguated-by-default in the wrong way.

[5]

To give a rough idea of scale: Rust’s parser is about 10KLoC and javac’s parser about 4.5KLoC.

To give a rough idea of scale: Rust’s parser is about 10KLoC and javac’s parser about 4.5KLoC.

[6]

Yes, I wrote more than one. I no longer recommend it, because Earley’s original algorithm has a bug in it, and descriptions of a/the fix seem either to be incorrect, or to destroy the beauty of the algorithm.

Yes, I wrote more than one. I no longer recommend it, because Earley’s original algorithm has a bug in it, and descriptions of a/the fix seem either to be incorrect, or to destroy the beauty of the algorithm.

[7]

Michael Van De Vanter first pointed Wagner’s work out to me. However, I didn’t appreciate it for what it was. I then forgot about it, and stumbled across it at “independently” at a later point, before somehow realising that it was what Michael had already suggested. I later learnt to listen to his advice more carefully, and benefited much from it!

Michael Van De Vanter first pointed Wagner’s work out to me. However, I didn’t appreciate it for what it was. I then forgot about it, and stumbled across it at “independently” at a later point, before somehow realising that it was what Michael had already suggested. I later learnt to listen to his advice more carefully, and benefited much from it!

[8]

It’s also the basis of Tree-sitter, which might be the best long-term argument I know of for programming languages having an LR grammar!

It’s also the basis of Tree-sitter, which might be the best long-term argument I know of for programming languages having an LR grammar!

[9]

Perhaps I was lucky not to study a compilers course myself (my university did not offer one at that point), as it meant I couldn’t develop the most severe of allergic reactions to LR parsing.

Perhaps I was lucky not to study a compilers course myself (my university did not offer one at that point), as it meant I couldn’t develop the most severe of allergic reactions to LR parsing.

[10]

From least to most expressive we thus have: regular expressions, LL, LR, unambiguous, CFG. In other words, regular expressions are a strict subset of LL, LL a strict subset of LR, and so on. The most complete description of the hierarchy I know can be found in p89 of Alexander Okhotin’s talk (where arrows mean “more expressive” and “ordinary” means “CFG”). Note that recursive descent doesn’t fit into this hierarchy at all — formally speaking, we know that it accepts a disjoint set of languages relative to CFGs, but, because PEGs have no underlying theory that we know of, we are unable to precisely define that set further.

Another interesting case is the ALL(*) algorithm which underlies ANTLR. ALL(*) accepts a strict superset of LL (including many ambiguous grammars), but is disjoint with LR since ALL(*) doesn’t support left-recursion. However, ANTLR can remove direct left-recursion before invoking ALL(*), so some grammars that might seem impossible to parse with ALL(*) can in fact be parsed by it. Bearing in mind that we’re talking about infinite sets, and that I don’t think we have a formal proof of the following statement, I think it would be fair to say that the ALL(*) subset of CFGs is bigger than the LR subset.

From least to most expressive we thus have: regular expressions, LL, LR, unambiguous, CFG. In other words, regular expressions are a strict subset of LL, LL a strict subset of LR, and so on. The most complete description of the hierarchy I know can be found in p89 of Alexander Okhotin’s talk (where arrows mean “more expressive” and “ordinary” means “CFG”). Note that recursive descent doesn’t fit into this hierarchy at all — formally speaking, we know that it accepts a disjoint set of languages relative to CFGs, but, because PEGs have no underlying theory that we know of, we are unable to precisely define that set further.

Another interesting case is the ALL(*) algorithm which underlies ANTLR. ALL(*) accepts a strict superset of LL (including many ambiguous grammars), but is disjoint with LR since ALL(*) doesn’t support left-recursion. However, ANTLR can remove direct left-recursion before invoking ALL(*), so some grammars that might seem impossible to parse with ALL(*) can in fact be parsed by it. Bearing in mind that we’re talking about infinite sets, and that I don’t think we have a formal proof of the following statement, I think it would be fair to say that the ALL(*) subset of CFGs is bigger than the LR subset.

[11]

There are larger unambiguous subsets such as LR-Regular (or “LRR”) grammars. However, as far as I can tell, these are probably not practical. For example, it is not decidable as to whether an arbitrary grammar is LRR or not. [Update 2020-10-28: a previous version of this footnote suggested that Marpa is LRR-based. It is a generalised parser that can therefore also parse LRR grammars. My apologies for the confusion!] [Update: 2022-12-20: Askar Safin points out that a later paper than the one I cited shows that it is decidable as to whether an arbitrary grammar is LRR or not.]

There are larger unambiguous subsets such as LR-Regular (or “LRR”) grammars. However, as far as I can tell, these are probably not practical. For example, it is not decidable as to whether an arbitrary grammar is LRR or not. [Update 2020-10-28: a previous version of this footnote suggested that Marpa is LRR-based. It is a generalised parser that can therefore also parse LRR grammars. My apologies for the confusion!] [Update: 2022-12-20: Askar Safin points out that a later paper than the one I cited shows that it is decidable as to whether an arbitrary grammar is LRR or not.]

[12]

Berkeley Yacc actually implements LALR, but for this example it’s indistinguishable from LR. I’ll discuss LALR a little bit later in this post.

Berkeley Yacc actually implements LALR, but for this example it’s indistinguishable from LR. I’ll discuss LALR a little bit later in this post.

[13]

Although I’ve presented the conflicts as errors, in Yacc they’re actually warnings because it has “default conflict resolution” rules (see Section 5 of the Yacc manual). In other words Yacc is willing to take in an ambiguous grammar and automatically disambiguate it to produce an unambiguous grammar. In general, I do not recommend making use of this feature.

Although I’ve presented the conflicts as errors, in Yacc they’re actually warnings because it has “default conflict resolution” rules (see Section 5 of the Yacc manual). In other words Yacc is willing to take in an ambiguous grammar and automatically disambiguate it to produce an unambiguous grammar. In general, I do not recommend making use of this feature.

[14]

Although it’s rarely remarked upon, the traditional splitting of “parsing” into separate lexing and parsing phases is an important part of the ambiguity story. Not only is it easy for the lexer to identify for as a keyword and forest as an identifier, but the parser then only has to distinguish between token types and not token values. Scannerless parsing merges these two phases, which allows more grammars to be expressed, but introduces more scope for ambiguity — and, in some cases, enables the resulting parsing algorithm to accept context-sensitive grammars.

Although it’s rarely remarked upon, the traditional splitting of “parsing” into separate lexing and parsing phases is an important part of the ambiguity story. Not only is it easy for the lexer to identify for as a keyword and forest as an identifier, but the parser then only has to distinguish between token types and not token values. Scannerless parsing merges these two phases, which allows more grammars to be expressed, but introduces more scope for ambiguity — and, in some cases, enables the resulting parsing algorithm to accept context-sensitive grammars.

[15]

Imagine a Haskell or RPython program where none of the functions have explicit types. The challenge when programming in such systems is that errors are often reported far away from where they were caused. In other words, I might make a static type error in one function, but the type inferencer will detect the resulting error in another function. While type error messages have become much better over time, they can never match human expectations in all cases.

Imagine a Haskell or RPython program where none of the functions have explicit types. The challenge when programming in such systems is that errors are often reported far away from where they were caused. In other words, I might make a static type error in one function, but the type inferencer will detect the resulting error in another function. While type error messages have become much better over time, they can never match human expectations in all cases.

[16]

The best conflict reports I’ve seen come from LALRPOP.

The best conflict reports I’ve seen come from LALRPOP.

[17]

Off-hand, I can only think of a single example: when Lukas tried to evolve this Java 7 grammar to Java 8. Until that point, grmtools didn’t have a way of reporting details about conflicts because I hadn’t needed such a feature!

The Java specification used to pride itself on presenting a simple, machine-proven, unambiguous grammar in an appendix. Unfortunately, at some point, this grammar seems to have been dropped from the specification, and I suspect that the new syntax introduced has not been checked for possible ambiguities. We quickly realised that a Java 8 grammar wasn’t important enough to our work for us to invest the time in this, so I don’t know if it is ambiguous or not.

Off-hand, I can only think of a single example: when Lukas tried to evolve this Java 7 grammar to Java 8. Until that point, grmtools didn’t have a way of reporting details about conflicts because I hadn’t needed such a feature!

The Java specification used to pride itself on presenting a simple, machine-proven, unambiguous grammar in an appendix. Unfortunately, at some point, this grammar seems to have been dropped from the specification, and I suspect that the new syntax introduced has not been checked for possible ambiguities. We quickly realised that a Java 8 grammar wasn’t important enough to our work for us to invest the time in this, so I don’t know if it is ambiguous or not.

[18]

For the insatiably curious, the conflict types mean roughly:

  • shift/reduce: The LR parser can’t be sure whether it should advance the input by one token, or whether a parsing rule will have completed.
  • reduce/reduce: The LR parser can’t be sure which of two rules will have completed.
  • accept/reduce: The LR parser can’t be sure if the entire parse has completed or merely one rule has completed.

That last possibility is so rare that I’d forgotten it even exists before I thought to fact-check this footnote!

For the insatiably curious, the conflict types mean roughly:

  • shift/reduce: The LR parser can’t be sure whether it should advance the input by one token, or whether a parsing rule will have completed.
  • reduce/reduce: The LR parser can’t be sure which of two rules will have completed.
  • accept/reduce: The LR parser can’t be sure if the entire parse has completed or merely one rule has completed.

That last possibility is so rare that I’d forgotten it even exists before I thought to fact-check this footnote!

[19]

Roughly speaking, the fastest super computer in the world at that time ran about 10,000 times slower than a decent desktop chip today.

Roughly speaking, the fastest super computer in the world at that time ran about 10,000 times slower than a decent desktop chip today.

[20]

SLR is particularly restrictive. However, I’m not sure I’ve ever seen SLR used in practise (though I know it was in the past), but LALR is still found in Berkeley Yacc. Even though LALR is less restrictive than SLR, it can still require real programming language grammars to be unpleasantly contorted in places.

SLR is particularly restrictive. However, I’m not sure I’ve ever seen SLR used in practise (though I know it was in the past), but LALR is still found in Berkeley Yacc. Even though LALR is less restrictive than SLR, it can still require real programming language grammars to be unpleasantly contorted in places.

[21]

Pager’s description is slightly incomplete; it’s best paired with Xin Chen’s thesis. From memory, neither mentions that the algorithm is non-deterministic and can sometimes create unreachable states that can be garbage collected to save a little bit more memory. grmtool’s implementation of this algorithm goes into more detail on such matters and also has the bonus of being runnable. However, Pager’s algorithm doesn’t quite work properly if you use Yacc’s conflict resolution feature. One day I should implement the IELR algorithm to solve this problem.

Pager’s description is slightly incomplete; it’s best paired with Xin Chen’s thesis. From memory, neither mentions that the algorithm is non-deterministic and can sometimes create unreachable states that can be garbage collected to save a little bit more memory. grmtool’s implementation of this algorithm goes into more detail on such matters and also has the bonus of being runnable. However, Pager’s algorithm doesn’t quite work properly if you use Yacc’s conflict resolution feature. One day I should implement the IELR algorithm to solve this problem.

[22]

For example, encoding sparse tables (e.g. in Rust with the sparsevec crate), and packing vectors of small integers (e.g. with the packedvec crate). It’s a long time since I’ve thought about these aspects: from memory, one can do even better than these techniques, but they’re already effective enough that we didn’t feel the need to look further at that point.

For example, encoding sparse tables (e.g. in Rust with the sparsevec crate), and packing vectors of small integers (e.g. with the packedvec crate). It’s a long time since I’ve thought about these aspects: from memory, one can do even better than these techniques, but they’re already effective enough that we didn’t feel the need to look further at that point.

[23]

There is one major exception in C-ish syntaxes: missing curly brackets. The resulting errors are typically reported many lines after the point that a human would consider as the cause of the problem.

There is one major exception in C-ish syntaxes: missing curly brackets. The resulting errors are typically reported many lines after the point that a human would consider as the cause of the problem.

[24]

rustc gives the best syntax error messages of any compiler / parser I’ve ever used.

rustc gives the best syntax error messages of any compiler / parser I’ve ever used.

[25]

Recent years have reinforced a long-standing trend: programmers don’t like to learn languages with unfamiliar syntaxes. For better or worse, C-ish syntax is likely to be the dominant cultural force in programming languages for decades to come.

Recent years have reinforced a long-standing trend: programmers don’t like to learn languages with unfamiliar syntaxes. For better or worse, C-ish syntax is likely to be the dominant cultural force in programming languages for decades to come.

[26]

That doesn’t mean that the eventual compiler has to contain an LR parser (though I’d start with an LR parser and only consider moving to something else if I had millions of users), but the parser it does contain should be entirely compliant with the reference LR grammar.

Unfortunately, for the foreseeable future, we are going to be stuck with programming languages who have used other parsing formalisms: pity the poor IDE author who has to deal with yet another language with only a recursive descent parser instead of an LR grammar!

That doesn’t mean that the eventual compiler has to contain an LR parser (though I’d start with an LR parser and only consider moving to something else if I had millions of users), but the parser it does contain should be entirely compliant with the reference LR grammar.

Unfortunately, for the foreseeable future, we are going to be stuck with programming languages who have used other parsing formalisms: pity the poor IDE author who has to deal with yet another language with only a recursive descent parser instead of an LR grammar!

Comments



(optional)
(used only to verify your comment: it is not displayed)