“lex and yacc” are two terms that are used together so often that they can seem
indivisible. lex
takes a string as input and divides it up
into lexemes (roughly speaking “words”) and yacc
takes a sequence
of lexemes and parses them. Since some parsing approaches such as
recursive descent parsing unify these phases, why do lex
and yacc
split them apart?
Running example
Let’s start with a simple language, which allows us to assign simple expressions to variables:
x = 1; y = x;
Using lrlex
from the Rust grmtools libraries, I can
define a lexer file [1] for this fragment as follows:
%% [a-z]+ "ID" [0-9]+ "INT" = "=" ; ";" [ \t\n\r]+ ;
If there were any declarations (i.e. settings) they would come before the
“ %%
“ line: there are none in this example. Each line after the
“ %%
“ defines a lexing rule. Lexing rules consist of a regular
expression and either an identifier or “;”. For example, the rule [0-9]+ "INT"
says that every non-empty sequence of digits defines an
integer, assigning it the token type “ INT
“. If a lexing rule has
“;” instead of an identifier, any text matching the regular expression is
discarded: as is common, in this example the lexer simply discards whitespace
(spaces, tabs, and newlines).
Helpfully, we can run lrlex
on our lex file and input to see
what lexemes it produces:
$ lrlex /tmp/t.l /tmp/t ID x = = INT 1 ; ; ID y = = ID x ; ;
Each line tells us the token type and the lexeme (i.e. the text matched).
We can now define a simple yacc grammar which allows us to write files with multiple “statements” (i.e. variable definitions or assignments):
%% stmts: stmts stmt | stmt ; stmt: "ID" "=" expr ";" ; expr: "INT" | "ID" ;
If you’re unfamiliar with yacc grammars, the stmts
rules is a long-winded
way of writing the more regex-like “ stmts: stmt+
“ i.e. that a valid input is one or more
statements.
We can can then use grmtools’ nimbleparse
to parse
our input relative to our lexer and grammar:
$ nimbleparse /tmp/t.l /tmp/t.y /tmp/t stmts stmts stmt ID x = = expr INT 1 ; ; stmt ID y = = expr ID x ; ;
The indentation in nimbleparse
shows us the tree structure: if
you’re not sure what you’re looking at, you might find it helpfully to mentally
rotate the tree 90° clockwise.
Overlapping lexing rules
Let’s now assume that I want to introduce mathematical constants into our language, specifically the constant “pi”:
z = pi;
Since pi
is a constant, I want to forbid people from
assigning to it. In other words I want this expression to cause an error:
pi = 3;
I could do this in a “type checking” phase after parsing, but there are good reasons why I might want to make my parser reject such inputs, including: I’m lazy and don’t want to add a type checking phase; it’s easier and faster for external tools (e.g. your IDE) to spot incorrect input if it’s encoded in the parser; and it’s generally easier for users to understand valid inputs from a grammar than it is from a description of type rules.
Let’s give it a go by changing our lexer to explicitly identify pi:
%% [a-z]+ "ID" [0-9]+ "INT" = "=" ; ";" [ \t\n\r]+ ; pi "PI"
and our grammar to only allow pi in the right hand side of an expression:
%% stmts: stmts stmt | stmt ; stmt: "ID" "=" expr ";" ; expr: "INT" | "ID" | "PI" ;
Unfortunately when I run it on the input I expected to be rejected, it is parsed without error:
$ nimbleparse /tmp/t.l /tmp/t.y /tmp/t stmts stmt ID pi = = expr INT 3 ; ;
Clearly I’ve done something wrong, but what?
From experience, when a yacc grammar doesn’t do what’s expected, a lot of people immediately start trying to understand the LR parsing algorithm. I’ve surprised 3 people in the last month by saying, quite truthfully, that I no longer remember how LR parsing works. I’m sure I could refresh my memory fairly quickly if I really needed to, but I have not found my forgetfulness to be a problem because fixing these sorts of problems rarely requires understanding the parsing algorithm.
In this case, it turns out that I don’t need to understand anything about
LR parsing, because the problem is in the lexer! We can see that by running
lrlex
on our input:
$ lrlex /tmp/t.l /tmp/t ID pi = = INT 3 ; ;
This clearly shows that pi
has the token type ID
rather than the expected token type PI
.
At this point, it’s clear that we need to know at least a little about how
lex
(and lrlex
) lex inputs. Fortunately,
the rules are surprisingly simple. Summarising briefly:
-
At each point in the input,
lex
tries every rule in the lex file — it does not stop at the first match. -
If multiple rules match, the “winner” is decided in two steps:
- The rule(s) that match the most number of characters “win”.
- If multiple rules match the same (most!) number of characters, the earliest rule in the lex file “wins”.
In our case, the regular expressions “ [a-z]+
“ and
“ pi
“ both matched the input “ pi
“ but since
“ [a-z]+
“ came earlier in the lex file, it “won”. As this shows
– and contrary to what people often suspect is the case – rules in a lex
file are ordered, and that order is important.
The fix in our case is thus easy — we just need to put the lexing rule for pi earlier than the rule for ID:
%% pi "PI" [a-z]+ "ID" [0-9]+ "INT" = "=" ; ";" [ \t\n\r]+ ;
We can then check that our input is lexed as we expect, and that the incorrect input is rejected:
$ lrlex /tmp/t.l /tmp/t ID pi = = INT 3 ; ; $ nimbleparse /tmp/t.l /tmp/t.y /tmp/t stmts stmt ID = = expr INT 3 ; ; Parsing error at line 1 column 1. Repair sequences found: 1: Insert ID, Delete pi
Since I’ve left grmtools’ error recovery on, it has both recognised “ pi = ...
“ as incorrect and realised that we should have written “ <id> = ...
“ [2].
Simple though this example is, it shows us one reason why it is
useful to separate lexing from parsing. To see that, let’s imagine a tool called lecc
which merges together lex
and yacc
. If we take the
first lexer and grammar from this post and adapt it for lecc
we might write something like:
%% stmts: stmts stmt | stmt ; stmt: "[a-z]+" "=" expr ";" ; expr: "[0-9]+" | "[a-z]+" ;
Because of the nature of this combined lexer/grammar, it’s fairly easy to
see how lecc
would be able to lex/parse it in a way that’s
equivalent to our separate lexer and grammars from earlier.
However, what happens if I want to write a
lecc
file for the lexer and grammar which forbids me assigning
to “ pi
“? I now need a regular expression which matches every
non-empty sequence of lower-case ASCII characters except the string
“ pi
”. Since lex
doesn’t have things like negative
lookahead assertions [3], I have to be quite
creative in how I write this:
%% stmts: stmts stmt | stmt ; stmt: "[a-oq-z][a-z]*|p[a-hj-z][a-z]*|pi[a-z]+|p" "=" expr ";" ; expr: "[0-9]+" | "[a-z]+" ;
Have I got the big regular expression right? I think I might have, but even
after I’ve lightly tested it, I’m not entirely confident. Imagine if I also
wanted to handle other constants such as “ e
“ or “ G
“
and so on — I would very quickly exhaust my ability to write such a
regular expression, even though it’s always theoretically possible to do so.
If I were to generate the necessary regular expression automatically, it’s
quite probable that the result would end up being so large and complex that
it would have poor performance.
We thus see the first argument for separating lexing from parsing: it makes it much easier for us to deal with “overlapping” lexing rules. Our first lexing file only worked because at any point in the input at most one rule could match. As soon as we introduced pi, we created overlap between two rules (ID and pi) [4]. Real languages have many instances of overlap (keywords and identifiers being an obvious, but by no means the only, example).
Scannerless parsing
What my imaginary version of lecc
is doing is really a form of scannerless parsing. Astute readers might have realised that by combining a lexer and
parser, I’ve accidentally introduced ambiguity problems into the seemingly
unambiguous LR formalism. We can see this if I translate the expr
from earlier as follows:
expr: "[0-9]+" | "[a-z]+" | "pi" ;
What does the input “pi” match? Well, it matches the second and third productions in that rule! In other words, we’ve transferred what I called the “overlapping” problem from the lexer to the parser.
Fundamentally, the “overlapping” problem in lexing is a form of ambiguity [5]. We now see a second argument (which is in a sense a generalisation of the first argument) for separating lexing and parsing: it allows us to push the inherent ambiguity of most languages into the lexer, keeping the grammar unambiguous.
What happens, however, if we embrace ambiguity and push scannerless parsing further? The modern form of scannerless parsing came into being thanks to the late – and much missed by me and others – Eelco Visser [6] as part of his PhD thesis. Amongst other ideas he introduced the idea of “reject” rules which we might represent as follows:
%% stmts: stmts stmt | stmt ; stmt: id_except_pi "=" expr ";" ; expr: "[0-9]+" | "[a-z]+" ; id_except_pi: "[a-z]+" ; id_except_pi reject: "pi" ;
In essence, the idea is to say that “ id_except_pi
matches all
identifiers but then rejects those identifiers which match ’ pi
’“.
This achieves the same outcome as my fiercely complex regular expression from
earlier, and is much easier to use!
However, there are problems. In essence, “reject” rules interfere with the compositionality of context-free grammars, making them slightly context sensitive. A partial fix was suggested for this problem some years ago but, to the best of my knowledge, a full fix (or, at least, a full understanding of a fix) remains an open problem. As this might suggest, combining lexing and parsing is much harder than it may first seem.
Summary
The most obvious reason for separating lexing and parsing is that it allows us to move some things that are difficult to express in a parser into a lexer, where they are easy to express. More fundamentally, it allows us to use unambiguous parsing algorithms for languages that would seem to be ambiguous. This can seem a contradiction in terms, or disingenuous: does forcing lexing to remove ambiguity sully supposedly unambiguous parsing algorithms like LR? In my opinion, this is a fun philosophical debating point, but isn’t very relevant in practise: having separate lexers works rather well.
However, as with nearly everything in parsing, there are trade-offs. lex
-style lexers
are not closed under composition. In other words, because lex
files
have ordered rules, combining two lex
files together can have
surprising results, with some rules never matching. This is a terrible
shame, because computing would be much better if we could
reliably compose languages. Scannerless parsing, as we saw above, is closed
under composition, but causes other trade-offs.
There are two other major approaches to language composition. First, is to have some way to switch between different lexers. This almost always involves some form of syntactic marker: Copper and Silver are probably the best examples of such a system. However, when I was working on Converge, which allowed something similar and also required syntactic markers, I eventually concluded that I was rarely willing to pay the cost of the syntactic noise it required. Second, is some form of tree-based editing. Our group (well, mostly Lukas!) developed the Eco editor to explore this, and I think we’ve shown that it can work well in some surprising situations, but it’s still not magic. Parsing continues to be a frustrating topic — not only are there no free lunches, but sometimes there is no lunch available, no matter how deep your pockets!
Footnotes
If you’re familiar with lex
, you might notice that
lrlex
uses a slightly different format.
If you’re familiar with lex
, you might notice that
lrlex
uses a slightly different format.
Note that the token type “ ID
“ in the tree has no value —
error recovery can tell you need a ID at that point, but it can’t
guess what the specific ID should be.
Note that the token type “ ID
“ in the tree has no value —
error recovery can tell you need a ID at that point, but it can’t
guess what the specific ID should be.
Although it’s often forgotten, modern regular expression systems can parse far
more than formally defined “regular languages” — in other words,
modern “regular expression” syntax is a misnomer. lex
predates
these modern concepts and – to the best of my knowledge – its
regular expressions can truly only parse regular languages.
Although it’s often forgotten, modern regular expression systems can parse far
more than formally defined “regular languages” — in other words,
modern “regular expression” syntax is a misnomer. lex
predates
these modern concepts and – to the best of my knowledge – its
regular expressions can truly only parse regular languages.
Another way of looking at this problem is that it is due to languages like English using a finite number of symbols (roughly speaking “an alphabet”, though it includes things like numbers and so on). If every concept we wanted to express had a single unique symbol, we could never have overlap.
Another way of looking at this problem is that it is due to languages like English using a finite number of symbols (roughly speaking “an alphabet”, though it includes things like numbers and so on). If every concept we wanted to express had a single unique symbol, we could never have overlap.
For the simple regular expression language supported by lex
I
believe one could statically detect all instances of overlap, although I’m not
sure if any such tool exists. As an additional complication, lrlex
uses Rust’s regex crate which, like most “regular
expression” engines, supports features beyond the theoretical concept of
“regular expression” — I don’t think one could guarantee to statically
detect all instances of overlap for such languages.
For the simple regular expression language supported by lex
I
believe one could statically detect all instances of overlap, although I’m not
sure if any such tool exists. As an additional complication, lrlex
uses Rust’s regex crate which, like most “regular
expression” engines, supports features beyond the theoretical concept of
“regular expression” — I don’t think one could guarantee to statically
detect all instances of overlap for such languages.
15 years ago I interviewed Eelco on parsing — I’d never really done something like that, but because Eelco was so knowledgeable (even though he told me afterwards he felt a bit rusty on parsing!), I think it came out surprisingly well. Listening back, it’s nice to hear his voice again.
15 years ago I interviewed Eelco on parsing — I’d never really done something like that, but because Eelco was so knowledgeable (even though he told me afterwards he felt a bit rusty on parsing!), I think it came out surprisingly well. Listening back, it’s nice to hear his voice again.