Why We Need to Know LR and Recursive Descent Parsing Techniques

January 17 2023

Blog archive

 
Last 10 blog posts
Distinguishing an Interpreter from a Compiler
try_repeat released
Why We Need to Know LR and Recursive Descent Parsing Techniques
Compiled and Interpreted Languages: Two Ways of Saying Tomato
Software Security Research Position
How Might Generative AI Change Programming?
pizauth: differentiating transient from permanent errors
November Links
More Evidence for Problems in VM Warmup
What is a Research Summer School?
 
A couple of people have asked me for my thoughts on part of an article from Tiark Rompf called (roughly) "Just Write the Parser" which advocates the use of recursive descent parsing over more "formal" approaches to parsing, at least in the context of teaching compilers. I'm fairly sure I read this article a couple of years ago, but it may have been updated, or simply have been discovered anew. Either way, since I admire both Tiark and his work a great deal, it was useful for me to engage with his ideas and compare them to my own.

In this post, I'm going to summarise Tiark's argument as I see it, and try to explain where I agree and disagree with that summary. I will inevitably rehash many of the arguments from my 2020 "Which Parsing Approach?" post, but I'm going to try to focus on some factors which I now realise were at best buried, and at worst inadequately explained, in that previous post.

Summarising the argument

In essence I believe Tiark is making two points:

  1. Making parsing a major component of teaching compilers is misguided.
  2. Context-free grammars, and their associated parsing techniques, don't align well with real-world compilers, and thus we should deemphasise CFGs (Context-Free Grammars) and their associated parsing algorithms.
I agree with the first point. Learning has to involve prioritisation: we can't learn the underlying theory of everything. Parsing seems to me to be one of those topics where a fairly superficial understanding of the underlying theory is sufficient for most people to use it effectively. Personally, I would rather spend the time that a deep understanding of the underlying theory of parsing requires elsewhere: there are many other parts of compilers that better reward a deeper understanding
[1].

However, I disagree with the second point — or, at least, I disagree with part of it. It is true that many real-world languages' syntaxes are not compatible with LR parsing (and the like) and must be parsed using recursive descent parsing. Without wishing to put words in his mouth, I suspect that Tiark sees this as a virtue of recursive descent parsing.

In contrast, I have come, over time, to see languages that require recursive descent parsing as reflecting a flaw in our approach to language design [2]. I believe it has led us to design languages that are not just hard to parse, that don't just tend to have buggy parsers, but which are harder than necessary to understand.

Differentiating LR and recursive descent parsing

Let's start with some basic definitions. A grammar is a specification of the valid sentences in a language. Note that grammars need not be executable — indeed, they may be specified formally or informally, explicitly or implicitly. A parser is an implementation of a grammar, which checks whether an input is valid with respect to that grammar or not. Grammars are ambiguous (e.g. a grammar such as E: E "+" E | E "-" E is ambiguous because 2+3*4 can be parsed as equivalent to either (2+3)*4 or 2+(3*4) with it) or unambiguous.

For example, all of us share a common understanding of the grammar for basic arithmetic expressions. We can then implement an LR or a recursive descent parser which takes arithmetic expressions as inputs and checks whether they conform to the arithmetic grammar.

In practise, grammars and parsers are often closer in nature than their abstract definitions suggest. In one direction, we can automatically derive parsers from some grammars (e.g. those written to conform to the LR subset of CFGs). In the other direction, a parser implicitly defines a grammar [3]. There is a subtle corollary to this: if I manually create a parser from the specification of a grammar, I may introduce bugs, which means that the parser's implicit grammar may not quite match the explicit grammar I thought I was implementing!

There are various ways of implementing parsers, but for me there are generally only two which we really need to consider: LR parsing and recursive descent parsing.

LR parsing defines a subset of CFGs: every grammar in the LR subset is guaranteed to be unambiguous [4]. From an LR grammar we can automatically derive an LR parser (e.g. using a tool like YACC or lrpar). A challenge with LR parsing is that some grammars are difficult to express unambiguously (e.g. arithmetic expressions): some are impossible to express unambiguously (e.g. many real-world programming languages).

In practise, a recursive descent parser can be thought of as meaning "a hand-written written parser": unless you go out of your way to do otherwise, nearly all hand-written parsers are recursive descent parsers. As this suggests, recursive descent parsers are easy to write. Since they're also just normal programs, we can use them to parse any language we can conceive of.

Presented like this there there doesn't seem to be much of a choice between the two. Recursive descent parsing is easier and more flexible, so let's just use that!

However, recursive descent parsing has an inherent consequence: it produces parsers which resolve ambiguities in the underlying grammar without telling you that they have been resolved. For example, a recursive descent parser for arithmetic expressions will parse 2+3*4 as equivalent to either (2+3)*4 or 2+(3*4), but not both. However, since a recursive descent parser is just a normal program, you might not even realise that you have written your parser in such a way that it has chosen to parse just one of the possibilities — hopefully you chose to parse it as equivalent to 2+(3*4)!

In contrast, a CFG can't be LR if it has such an ambiguity. If you give a grammar such as E: E "+" E | E "-" E to YACC it will tell you it is ambiguous (using the infamous, and in some ways unfortunate, terminology of "shift/shift", "reduce/reduce", or "accept/reduce"). Although few people think of it as such, LR is effectively a type checker for grammar ambiguity: if you don't get any errors, your grammar is guaranteed to be unambiguous. One of the beauties of LR parsing is that the same grammar that has been "type checked" to be unambiguous can then be automatically turned into a parser: there is no way that you can accidentally introduce ambiguity, or other bugs, into the resulting parser [5].

Fundamentally, recursive descent parsers have no equivalent "type checker", and we know that we can't build such an equivalent for the general case. Recursive descent parsers thus parse the languages they parse, but we don't, in general, know how to recover the grammar that such parsers are implicitly parsing. That is not the tautology it sounds: my experience is that bugs in recursive parsers are common. I have had the dubious pleasure of converting a number of recursive descent parsers to LR grammars and none of the non-trivial recursive descent parsers has parsed the language their authors expected. Sometimes they accept inputs they shouldn't; sometimes they reject inputs they shouldn't.

This wouldn't be as much of a problem if it was easy to test parsers, but it turns out to be surprisingly hard to create good parser suites. Most parser test suites that I've seen focus most of their effort on testing that "good" inputs are accepted. Since the range of "bad" inputs is, in general, much more diverse, test suites rarely cover much of the space of "bad" inputs to check that they're rejected [6].

There is an additional, easily over-looked, factor that comes from relying on recursive descent parsers: learners cannot, in any sensible time, understand from the recursive descent parser the language's syntax. Many languages ignore this problem at first, and hope that learners can guess the language's syntax from examples. This is inevitably unsatisfactory, and it is then common over time for languages to try deriving a CFG from the recursive descent parser, to help learners and tool authors. However, these CFGs rarely fully match the recursive descent parser, leaving learners confused, and tool authors frustrated.

There are two other use-cases for recursive descent parsing. The first is performance. Although there aren't, to the best of my knowledge, modern performance numbers, it is reasonable to assume that recursive descent parsing is generally faster than LR parsing. However, "faster" is relative: on modern machines, even a naive LR parser on a huge grammar will happily parse thousands of lines of code a second which is enough for the vast majority of use cases. The second is error recovery. LR parsers have traditionally had terrible error recovery. Immodestly, I now claim that some work I was involved in allows LR parsers to have error recovery that is better than all but the best recursive descent parsers.

Summary

It seems to me that if you believe static type checking allows you to write programs that are more reliable than if you used dynamic type checking then, logically, you must also believe that LR parsing is more reliable than recursive descent parsing. Similarly, if you believe that explicit static types help readers better understand code, then you must also believe that explicit LR grammars better convey to readers the grammar of a language than the equivalent recursive descent parser.

There is then a separate debate to be had about whether the costs of LR parsing are worth the benefits. My personal experience is that, for unambiguous languages, it takes relatively little practise to design LR grammars. It is then more than just a bonus that we can automatically generate a parser from the grammar. Indeed, and to my surprise, I've found it markedly quicker to write an LR grammar than to write, test, and debug the equivalent recursive descent parser. My advice for writing LR grammars to those new to it boils down to, in essence, "write your grammar in small bits, continually test it for LR-ness, and if you get a shift/reduce (etc.) error, undo your last change, and try a variation." If that sounds trite, or patronising, please bear in mind that it's exactly the approach I take myself!

If you agree with this line of thinking, then you might find the following list of recommendations useful:

  • New languages (whether they be "programming languages" or mere "configuration languages" and the like) should come with an LR grammar: you then have a specification that users can understand, and people can automatically derive parsers for their favourite use-case directly from that specification. Specifying new languages in this style is not difficult, and the restrictions on syntax rarely onerous. If writing a grammar in LR style is hard for you as the language author, it will almost certainly be hard for users to understand!
  • Existing languages have often evolved in a manner that makes it difficult, or impossible, to specify an LR grammar. There's no point in trying to fight this: just use recursive descent parsing.
  • The more complex your recursive descent parser, the more effort you need to put into testing. Without extensive testing, your parser will do things you didn't expect: it is just normal code, after all, and all of us make mistakes when coding!
  • If you're writing a parser for a language which doesn't have an LR grammar, it can be worth trying to create one. If you're lucky you'll find that your language has an unambiguous grammar; if you're unlucky, you'll at least understand where some of the points of ambiguity are [7]. I find that this makes writing a recursive descent parser a less fraught task, as at least I know where the likely bugs due to ambiguity will lie.
  • If you're writing a parser for a language which is ambiguous, you might be able to use the conflict resolution offered by YACC and friends. However, in general, I suggest avoiding YACC's conflict resolution if possible, as you have to at least somewhat understand the underlying LR algorithm to understand how conflicts have been resolved.
  • If you need the best possible performance or error recovery, recursive descent parsing is the best choice. If you know you will need this and you're designing a new language, at least create an LL grammar, since from that you can derive a recursive descent parser. [Bear in mind that LL grammars are more annoying to write than LR grammars which is why I suggest avoiding LL unless you have this very specific use case.]
  • LALR and SLR are subsets of LR parsing that are irritating to use and, on any machine from the last 20 years, have no meaningful advantages. Avoid.
  • Otherwise, just use LR parsing, whether it's YACC or the equivalent for your favourite language (e.g. I am fond of lrpar for Rust since I wrote it).
Looking back to Tiark's article, my opinion on teaching parsing is two-fold. First, parsing isn't generally worth much of student's time, so teach them the minimum possible. Second, the minimum possible is, in my opinion, that which is needed to write: recursive descent parsing, so that they can parse existing languages which lack an unambiguous grammar; and LR parsing so that they don't keep designing hard-to-parse languages in the future.

If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1] I didn't have the opportunity to study compilers as an undergraduate. If I had, I would not have wanted parsing to take up more than 1 week of such a course: I'd have wanted to know more about the many other parts that constitute a compiler!
[2] Note I'm using the word "design" deliberately: it is not, in my opinion, a flaw to implement parsers with recursive descent, provided there is a non-recursive descent design of the grammar elsewhere.
[3] More formally, and generally, we can talk about "the language a parser accepts".
[4] Though there are some unambiguous grammars that are not in the LR subset. That need not concern us here.
[5] This isn't true for Yacc which allows action code to change how the parser operates, allowing it to parse non-LR languages. Perhaps unsurprisingly, I do not consider this a good idea, and lrpar does not feature such functionality.
[6] Oddly enough, my experience of converting recursive descent parsers to LR grammars suggests that recursive descent parsers seem about as often to reject "good" inputs as to accept "bad" inputs. I don't have a convincing explanation as to why this might be the case.
[7] Though depending on how many ambiguities your grammar has, you might find that some "shadow" others. In other words, by failing to write one part of a grammar in the LR subset, you might not be able to write other parts of your grammar in the LR subset either, causing some ambiguities to remain unnoticed.
I didn't have the opportunity to study compilers as an undergraduate. If I had, I would not have wanted parsing to take up more than 1 week of such a course: I'd have wanted to know more about the many other parts that constitute a compiler!
Note I'm using the word "design" deliberately: it is not, in my opinion, a flaw to implement parsers with recursive descent, provided there is a non-recursive descent design of the grammar elsewhere.
More formally, and generally, we can talk about "the language a parser accepts".
Though there are some unambiguous grammars that are not in the LR subset. That need not concern us here.
This isn't true for Yacc which allows action code to change how the parser operates, allowing it to parse non-LR languages. Perhaps unsurprisingly, I do not consider this a good idea, and lrpar does not feature such functionality.
Oddly enough, my experience of converting recursive descent parsers to LR grammars suggests that recursive descent parsers seem about as often to reject "good" inputs as to accept "bad" inputs. I don't have a convincing explanation as to why this might be the case.
Though depending on how many ambiguities your grammar has, you might find that some "shadow" others. In other words, by failing to write one part of a grammar in the LR subset, you might not be able to write other parts of your grammar in the LR subset either, causing some ambiguities to remain unnoticed.

Comments

Comment:
Name:
Homepage: (optional)
Email: (used only to verify your comment: it is not displayed)
Can't load comments