In this post, I'm going to summarise Tiark's argument as I see it, and try
to explain where I agree and disagree with that summary. I will inevitably
rehash many of the arguments from my 2020 "Which Parsing
Approach?" post, but I'm going to try to focus on some factors which I now
realise were at best buried, and at worst inadequately explained, in that
previous post.
In essence I believe Tiark is making two points:
However, I disagree with the second point — or, at least, I disagree
with part of it. It is true that many real-world languages' syntaxes are not compatible
with LR parsing (and the like) and must be parsed using
recursive descent parsing.
Without wishing to put words in his mouth, I suspect that Tiark sees this as a virtue of
recursive descent parsing.
In contrast, I have come, over time, to see
languages that require recursive descent parsing as reflecting
a flaw in our approach to language design [2].
I believe it has led us to design languages
that are not just hard to parse, that don't just tend to have buggy parsers,
but which are harder than necessary to understand.
Let's start with some basic definitions. A grammar is a
specification of the valid sentences in a language. Note that grammars need not
be executable — indeed, they may be specified formally or informally,
explicitly or implicitly. A parser is an implementation of a grammar, which
checks whether an input is valid with respect to that grammar or not. Grammars
are ambiguous (e.g. a grammar such as For example, all of us share a common understanding of the grammar for basic
arithmetic expressions. We can then implement an LR or a recursive descent
parser which takes arithmetic expressions as inputs and checks whether they
conform to the arithmetic grammar.
In practise, grammars and parsers are often closer in nature than their
abstract definitions suggest. In one direction, we can automatically derive
parsers from some grammars (e.g. those written to conform to the LR subset of
CFGs). In the other direction, a parser implicitly defines a grammar [3]. There is
a subtle corollary to this: if I manually create a parser from the
specification of a grammar, I may introduce bugs, which means that the parser's
implicit grammar may not quite match the explicit grammar I thought I was
implementing!
There are various ways of implementing parsers, but for me there are
generally only two which we really need to consider: LR parsing and recursive
descent parsing.
LR parsing defines a subset of CFGs: every grammar in the LR subset is
guaranteed to be unambiguous [4]. From an LR
grammar we can automatically derive an LR parser (e.g. using a tool like YACC
or lrpar). A challenge with LR
parsing is that some grammars are difficult to express unambiguously (e.g.
arithmetic expressions): some are impossible to express unambiguously (e.g.
many real-world programming languages).
In practise, a recursive descent parser can be thought of as meaning "a
hand-written written parser": unless you go out of your way to do otherwise,
nearly all hand-written parsers are recursive descent parsers. As this suggests,
recursive descent parsers are easy to write. Since they're also just normal
programs, we can use them to parse any language we can conceive of.
Presented like this there there doesn't seem to be much of a choice
between the two. Recursive descent parsing is easier and more flexible,
so let's just use that!
However, recursive descent parsing has an inherent consequence: it produces
parsers which resolve ambiguities in the underlying grammar without telling you
that they have been resolved. For example, a recursive descent parser for
arithmetic expressions will parse In contrast, a CFG can't be LR if it has such an ambiguity. If you give a
grammar such as Fundamentally, recursive descent parsers have no equivalent "type checker",
and we know that we can't build such an equivalent for the general case.
Recursive descent parsers thus parse the languages they parse, but we don't, in
general, know how to recover the grammar that such parsers are implicitly
parsing. That is not the tautology it sounds: my experience is that bugs in
recursive parsers are common. I have had the dubious pleasure of converting a
number of recursive descent parsers to LR grammars and none of the non-trivial
recursive descent parsers has parsed the language their authors expected.
Sometimes they accept inputs they shouldn't; sometimes they reject inputs they
shouldn't.
This wouldn't be as much of a problem if it was easy to test parsers, but it turns out
to be surprisingly hard to create good parser suites. Most parser test suites
that I've seen focus most of their effort on testing that "good" inputs are
accepted. Since the range of "bad" inputs is, in general, much more diverse,
test suites rarely cover much of the space of "bad" inputs to check that
they're rejected [6].
There is an additional, easily over-looked, factor that comes from
relying on recursive descent parsers: learners cannot,
in any sensible time, understand from the recursive descent parser the
language's syntax. Many languages ignore this problem at first, and hope
that learners can guess the language's syntax from examples. This is inevitably
unsatisfactory, and it is then common over time for languages to try deriving
a CFG from the recursive descent parser, to help learners and tool authors.
However, these CFGs rarely fully match the recursive descent parser, leaving
learners confused, and tool authors frustrated.
There are two other use-cases for recursive descent parsing. The first is
performance. Although there aren't, to the best of my knowledge, modern
performance numbers, it is reasonable to assume that recursive descent parsing
is generally faster than LR parsing. However, "faster" is relative: on modern
machines, even a naive LR parser on a huge grammar will happily parse thousands
of lines of code a second which is enough for the vast majority of use cases.
The second is error recovery. LR parsers have traditionally had terrible error
recovery. Immodestly, I now claim that some work I was
involved in allows LR parsers to have error recovery that is better than
all but the best recursive descent parsers.
There is then a separate debate to be had about whether the costs of LR
parsing are worth the benefits. My personal experience is that, for unambiguous
languages, it takes relatively little practise to design LR grammars. It is
then more than just a bonus that we can automatically generate a parser
from the grammar. Indeed, and to my surprise, I've found it markedly quicker to write an LR grammar
than to write, test, and debug the equivalent recursive descent parser.
My advice for writing LR grammars to those new to it boils down to,
in essence, "write your grammar in
small bits, continually test it for LR-ness, and if you get a shift/reduce
(etc.) error, undo your last change, and try a variation." If that sounds
trite, or patronising, please bear in mind that it's exactly the approach I
take myself!
If you agree with this line of thinking, then you might find the following
list of recommendations useful:
Summarising the argument
I agree with the first point. Learning has to involve prioritisation: we can't
learn the underlying theory of everything. Parsing seems to me to be
one of those topics where a fairly superficial understanding of the underlying
theory is sufficient for most people to use it effectively. Personally, I would
rather spend the time that a deep understanding of the underlying theory of parsing
requires elsewhere: there are many other parts of compilers that better reward
a deeper understanding [1].
Differentiating LR and recursive descent parsing
E: E "+" E | E "*"
E
is ambiguous because 2+3*4
can be parsed as equivalent to
either (2+3)*4
or 2+(3*4)
with it) or unambiguous.
2+3*4
as equivalent to either
(2+3)*4
or 2+(3*4)
, but not both. However, since a
recursive descent parser is just a normal program, you might not even realise
that you have written your parser in such a way that it has chosen to parse
just one of the possibilities — hopefully you chose to parse it as
equivalent to 2+(3*4)
!
E: E "+" E | E "-" E
to YACC it will tell you it
is ambiguous (using the infamous, and in some ways unfortunate, terminology of "shift/shift",
"reduce/reduce", or "accept/reduce"). Although few people think of it as such,
LR is effectively a type checker for grammar ambiguity: if you don't get any
errors, your grammar is guaranteed to be unambiguous. One of the beauties of LR
parsing is that the same grammar that has been "type checked" to be unambiguous
can then be automatically turned into a parser: there is no way that you can
accidentally introduce ambiguity, or other bugs, into the resulting parser
[5].
Summary
It seems to me that if you believe static type checking allows you to write
programs that are more reliable than if you used dynamic type checking then,
logically, you must also believe that LR parsing is more reliable than
recursive descent parsing. Similarly, if you believe that explicit static types
help readers better understand code, then you must also believe that
explicit LR grammars better convey to readers the grammar of a language than
the equivalent recursive descent parser.
Looking back to Tiark's article, my opinion on teaching parsing is two-fold.
First, parsing isn't generally worth much of student's time, so teach them the
minimum possible. Second, the minimum possible is, in my opinion, that which is
needed to write: recursive descent parsing, so that they can parse existing
languages which lack an unambiguous grammar; and LR parsing so that they don't
keep designing hard-to-parse languages in the future.
Footnotes
[1]I didn't have the opportunity to study compilers as an undergraduate. If I had, I would not have wanted parsing to take up more than 1 week of such a course: I'd have wanted to know more about the many other parts that constitute a compiler![2]Note I'm using the word "design" deliberately: it is not, in my opinion, a flaw to implement parsers with recursive descent, provided there is a non-recursive descent design of the grammar elsewhere.
[3]More formally, and generally, we can talk about "the language a parser accepts".
[4]Though there are some unambiguous grammars that are not in the LR subset. That need not concern us here.
[5]This isn't true for Yacc which allows action code to change how the parser operates, allowing it to parse non-LR languages. Perhaps unsurprisingly, I do not consider this a good idea, and lrpar does not feature such functionality.
[6]Oddly enough, my experience of converting recursive descent parsers to LR grammars suggests that recursive descent parsers seem about as often to reject "good" inputs as to accept "bad" inputs. I don't have a convincing explanation as to why this might be the case.
[7]Though depending on how many ambiguities your grammar has, you might find that some "shadow" others. In other words, by failing to write one part of a grammar in the LR subset, you might not be able to write other parts of your grammar in the LR subset either, causing some ambiguities to remain unnoticed.