Laurence Tratt: Why Split Lexing and Parsing Into Two Separate Phases?

“lex and yacc” are two terms that are used together so often that they can seem indivisible. lex takes a string as input and divides it up into lexemes (roughly speaking “words”) and yacc takes a sequence of lexemes and parses them. Since some parsing approaches such as recursive descent parsing unify these phases, why do lex and yacc split them apart?

Running example

Let’s start with a simple language, which allows us to assign simple expressions to variables:

x = 1;
y = x;

Using lrlex from the Rust grmtools libraries, I can define a lexer file 1 for this fragment as follows:

%%
[a-z]+ "ID"
[0-9]+ "INT"
= "="
; ";"
[ \t\n\r]+ ;

If there were any declarations (i.e. settings) they would come before the “%%” line: there are none in this example. Each line after the “%%” defines a lexing rule. Lexing rules consist of a regular expression and either an identifier or “;”. For example, the rule [0-9]+ "INT" says that every non-empty sequence of digits defines an integer, assigning it the token type “INT”. If a lexing rule has “;” instead of an identifier, any text matching the regular expression is discarded: as is common, in this example the lexer simply discards whitespace (spaces, tabs, and newlines).

Helpfully, we can run lrlex on our lex file and input to see what lexemes it produces:

$ lrlex /tmp/t.l /tmp/t
ID x
= =
INT 1
; ;
ID y
= =
ID x
; ;

Each line tells us the token type and the lexeme (i.e. the text matched).

We can now define a simple yacc grammar which allows us to write files with multiple “statements” (i.e. variable definitions or assignments):

%%
stmts: stmts stmt | stmt ;
stmt: "ID" "=" expr ";" ;
expr: "INT" | "ID" ;

If you’re unfamiliar with yacc grammars, the stmts rules is a long-winded way of writing the more regex-like “stmts: stmt+” i.e. that a valid input is one or more statements.

We can can then use grmtools’ nimbleparse to parse our input relative to our lexer and grammar:

$ nimbleparse /tmp/t.l /tmp/t.y /tmp/t
stmts
 stmts
  stmt
   ID x
   = =
   expr
    INT 1
   ; ;
 stmt
  ID y
  = =
  expr
   ID x
  ; ;

The indentation in nimbleparse shows us the tree structure: if you’re not sure what you’re looking at, you might find it helpfully to mentally rotate the tree 90° clockwise.

Overlapping lexing rules

Let’s now assume that I want to introduce mathematical constants into our language, specifically the constant “pi”:

z = pi;

Since pi is a constant, I want to forbid people from assigning to it. In other words I want this expression to cause an error:

pi = 3;

I could do this in a “type checking” phase after parsing, but there are good reasons why I might want to make my parser reject such inputs, including: I’m lazy and don’t want to add a type checking phase; it’s easier and faster for external tools (e.g. your IDE) to spot incorrect input if it’s encoded in the parser; and it’s generally easier for users to understand valid inputs from a grammar than it is from a description of type rules.

Let’s give it a go by changing our lexer to explicitly identify pi:

%%
[a-z]+ "ID"
[0-9]+ "INT"
= "="
; ";"
[ \t\n\r]+ ;
pi "PI"

and our grammar to only allow pi in the right hand side of an expression:

%%
stmts: stmts stmt | stmt ;
stmt: "ID" "=" expr ";" ;
expr: "INT" | "ID" | "PI" ;

Unfortunately when I run it on the input I expected to be rejected, it is parsed without error:

$ nimbleparse /tmp/t.l /tmp/t.y /tmp/t
stmts
 stmt
  ID pi
  = =
  expr
   INT 3
  ; ;

Clearly I’ve done something wrong, but what?

From experience, when a yacc grammar doesn’t do what’s expected, a lot of people immediately start trying to understand the LR parsing algorithm. I’ve surprised 3 people in the last month by saying, quite truthfully, that I no longer remember how LR parsing works. I’m sure I could refresh my memory fairly quickly if I really needed to, but I have not found my forgetfulness to be a problem because fixing these sorts of problems rarely requires understanding the parsing algorithm.

In this case, it turns out that I don’t need to understand anything about LR parsing, because the problem is in the lexer! We can see that by running lrlex on our input:

$ lrlex /tmp/t.l /tmp/t
ID pi
= =
INT 3
; ;

This clearly shows that pi has the token type ID rather than the expected token type PI.

At this point, it’s clear that we need to know at least a little about how lex (and lrlex) lex inputs. Fortunately, the rules are surprisingly simple. Summarising briefly:

At each point in the input, lex tries every rule in the lex file — it does not stop at the first match.
If multiple rules match, the “winner” is decided in two steps:
1. The rule(s) that match the most number of characters “win”.
2. If multiple rules match the same (most!) number of characters, the earliest rule in the lex file “wins”.

In our case, the regular expressions “[a-z]+” and “pi” both matched the input “pi” but since “[a-z]+” came earlier in the lex file, it “won”. As this shows – and contrary to what people often suspect is the case – rules in a lex file are ordered, and that order is important.

The fix in our case is thus easy — we just need to put the lexing rule for pi earlier than the rule for ID:

%%
pi "PI"
[a-z]+ "ID"
[0-9]+ "INT"
= "="
; ";"
[ \t\n\r]+ ;

We can then check that our input is lexed as we expect, and that the incorrect input is rejected:

$ lrlex /tmp/t.l /tmp/t
ID pi
= =
INT 3
; ;
$ nimbleparse /tmp/t.l /tmp/t.y /tmp/t
stmts
 stmt
  ID
  = =
  expr
   INT 3
  ; ;

Parsing error at line 1 column 1. Repair sequences found:
   1: Insert ID, Delete pi

Since I’ve left grmtools’ error recovery on, it has both recognised pi=... as incorrect and realised that we should have written <id>=...2.

Simple though this example is, it shows us one reason why it is useful to separate lexing from parsing. To see that, let’s imagine a tool called lecc which merges together lex and yacc. If we take the first lexer and grammar from this post and adapt it for lecc we might write something like:

%%
stmts: stmts stmt | stmt ;
stmt: "[a-z]+" "=" expr ";" ;
expr: "[0-9]+" | "[a-z]+" ;

Because of the nature of this combined lexer/grammar, it’s fairly easy to see how lecc would be able to lex/parse it in a way that’s equivalent to our separate lexer and grammars from earlier.

However, what happens if I want to write a lecc file for the lexer and grammar which forbids me assigning to “pi”? I now need a regular expression which matches every non-empty sequence of lower-case ASCII characters except the string “pi”. Since lex doesn’t have things like negative lookahead assertions 3, I have to be quite creative in how I write this:

%%
stmts: stmts stmt | stmt ;
stmt: "[a-oq-z][a-z]*|p[a-hj-z][a-z]*|pi[a-z]+|p" "=" expr ";" ;
expr: "[0-9]+" | "[a-z]+" ;

Have I got the big regular expression right? I think I might have, but even after I’ve lightly tested it, I’m not entirely confident. Imagine if I also wanted to handle other constants such as “e” or “G” and so on — I would very quickly exhaust my ability to write such a regular expression, even though it’s always theoretically possible to do so. If I were to generate the necessary regular expression automatically, it’s quite probable that the result would end up being so large and complex that it would have poor performance.

We thus see the first argument for separating lexing from parsing: it makes it much easier for us to deal with “overlapping” lexing rules. Our first lexing file only worked because at any point in the input at most one rule could match. As soon as we introduced pi, we created overlap between two rules (ID and pi)4. Real languages have many instances of overlap (keywords and identifiers being an obvious, but by no means the only, example).

Scannerless parsing

What my imaginary version of lecc is doing is really a form of scannerless parsing. Astute readers might have realised that by combining a lexer and parser, I’ve accidentally introduced ambiguity problems into the seemingly unambiguous LR formalism. We can see this if I translate the expr from earlier as follows:

expr: "[0-9]+" | "[a-z]+" | "pi" ;

What does the input “pi” match? Well, it matches the second and third productions in that rule! In other words, we’ve transferred what I called the “overlapping” problem from the lexer to the parser.

Fundamentally, the “overlapping” problem in lexing is a form of ambiguity 5. We now see a second argument (which is in a sense a generalisation of the first argument) for separating lexing and parsing: it allows us to push the inherent ambiguity of most languages into the lexer, keeping the grammar unambiguous.

What happens, however, if we embrace ambiguity and push scannerless parsing further? The modern form of scannerless parsing came into being thanks to the late – and much missed by me and others – Eelco Visser 6 as part of his PhD thesis. Amongst other ideas he introduced the idea of “reject” rules which we might represent as follows:

%%
stmts: stmts stmt | stmt ;
stmt: id_except_pi "=" expr ";" ;
expr: "[0-9]+" | "[a-z]+" ;
id_except_pi: "[a-z]+" ;
id_except_pi reject: "pi" ;

In essence, the idea is to say that “id_except_pi” matches all identifiers but then rejects those identifiers which match ’ pi’“. This achieves the same outcome as my fiercely complex regular expression from earlier, and is much easier to use!

However, there are problems. In essence, “reject” rules interfere with the compositionality of context-free grammars, making them slightly context sensitive. A partial fix 7 was suggested for this problem some years ago but, to the best of my knowledge, a full fix (or, at least, a full understanding of a fix) remains an open problem. As this might suggest, combining lexing and parsing is much harder than it may first seem.

Summary

The most obvious reason for separating lexing and parsing is that it allows us to move some things that are difficult to express in a parser into a lexer, where they are easy to express. More fundamentally, it allows us to use unambiguous parsing algorithms for languages that would seem to be ambiguous. This can seem a contradiction in terms, or disingenuous: does forcing lexing to remove ambiguity sully supposedly unambiguous parsing algorithms like LR? In my opinion, this is a fun philosophical debating point, but isn’t very relevant in practise: having separate lexers works rather well.

However, as with nearly everything in parsing, there are trade-offs. lex-style lexers are not closed under composition. In other words, because lex files have ordered rules, combining two lex files together can have surprising results, with some rules never matching. This is a terrible shame, because computing would be much better if we could reliably compose languages. Scannerless parsing, as we saw above, is closed under composition, but causes other trade-offs.

There are two other major approaches to language composition. First, is to have some way to switch between different lexers. This almost always involves some form of syntactic marker: Copper and Silver are probably the best examples of such a system. However, when I was working on Converge, which allowed something similar and also required syntactic markers, I eventually concluded that I was rarely willing to pay the cost of the syntactic noise it required. Second, is some form of tree-based editing. Our group (well, mostly Lukas!) developed the Eco editor to explore this, and I think we’ve shown that it can work well in some surprising situations, but it’s still not magic. Parsing continues to be a frustrating topic — not only are there no free lunches, but sometimes there is no lunch available, no matter how deep your pockets!

Newer 2023-05-02 10:00 Older

If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

If you’re familiar with lex, you might notice that lrlex uses a slightly different format.

☒

If you’re familiar with lex, you might notice that lrlex uses a slightly different format.

Note that the token type “ID” in the tree has no value — error recovery can tell you need a ID at that point, but it can’t guess what the specific ID should be.

☒

Note that the token type “ID” in the tree has no value — error recovery can tell you need a ID at that point, but it can’t guess what the specific ID should be.

Although it’s often forgotten, modern regular expression systems can parse far more than formally defined “regular languages” — in other words, modern “regular expression” syntax is a misnomer. lex predates these modern concepts and – to the best of my knowledge – its regular expressions can truly only parse regular languages.

☒

Another way of looking at this problem is that it is due to languages like English using a finite number of symbols (roughly speaking “an alphabet”, though it includes things like numbers and so on). If every concept we wanted to express had a single unique symbol, we could never have overlap.

☒

For the simple regular expression language supported by lex I believe one could statically detect all instances of overlap, although I’m not sure if any such tool exists. As an additional complication, lrlex uses Rust’s regex crate which, like most “regular expression” engines, supports features beyond the theoretical concept of “regular expression” — I don’t think one could guarantee to statically detect all instances of overlap for such languages.

☒

15 years ago I interviewed Eelco on parsing — I’d never really done something like that, but because Eelco was so knowledgeable (even though he told me afterwards he felt a bit rusty on parsing!), I think it came out surprisingly well. Listening back, it’s nice to hear his voice again.

☒

“Let’s Accept Rejects, But Only After Repairs” by Jan van Eijck, which can be found starting at p124 in this PDF.

☒