My Interview with Eelco Visser on Parsing

Recent posts
pizauth: HTTPS redirects
Recording and Processing Spoken Word
Why the Circular Specification Problem and the Observer Effect Are Distinct
What Factors Explain the Nature of Software?
Some Reflections on Writing Unix Daemons
Faster Shell Startup With Shell Switching
Choosing What To Read
Debugging A Failing Hotkey
How Often Should We Sharpen Our Tools?
Four Kinds of Optimisation

Blog archive

In my previous post, I mentioned in a footnote an interview I had done with the late Eelco Visser in 2008. The audio is available here, but there isn’t a transcript available. That’s a pity, because, as far as I know, Eelco didn’t do another interview along those lines, and searching through audio is still a difficult thing to do well. So, I thought I’d produce a transcript, as a little way of honouring Eelco’s memory.

In my opinion, the technical material in the interview speaks for itself. The interview process, however, had a partly comic side — neither of us had a clue what we were doing. I had never been an interviewer before (and only did one further interview after this), and I don’t think Eelco had done a technical interview like this. My understanding of parsing was woefully incomplete, and Eelco hadn’t thought about some of the deeper details in several years (he said afterwards that he was a bit “rusty”), though it clearly came back to him very quickly! We recorded the interview in a dingy college room in Cambridge that, perhaps fittingly, looked it was last decorated during the heyday of parsing research — the 1960s.

Afterwards, I’m fairly certain that we went for a drink together in the college bar. I don’t remember the specific topic we talked about, but it almost certainly would have been like many others we had: he’d talk about deep technical matters while I made stupid jokes, with the two ends of the conversation somehow meeting in the middle from time to time. I unthinkingly expected to have more of those conversations in the future. While that’s no longer possible, I have many happy memories of those we did have.

As is conventional, I’ve edited the transcript for clarity (us humans stumble a lot when we’re formulating our thoughts!). I think that Eelco’s engaging personality and deep knowledge come across, particularly as the interview goes on and both of us better understood our roles. I hope you find this an interesting read!

Welcome, wherever you are, my name is Laurence Tratt. Today we’re at the Code Generation Conference in Cambridge and my guest is Eelco Visser. We’re going to be talking today about parsing, so I’m going to let Eelco do a quick little introduction and tell you where he’s from.

Hi Laurie, my name is Eelco Visser. I work at Delft University of Technology, which is in the Netherlands. I’m an Associate Professor there, and I head a group of PhD students and postdocs doing research into language engineering, software configuration management, and stuff like that.

A lot of our listeners will have heard about parsing — they’ll know that parsing goes on somewhere, it’s this magical term! Could you give us an introduction to parsing, why we want to do it, what roughly it does? We’ll go into more detail as we go along.

So the general idea of parsing is that you have text in some language, and you want to do some processing on that. You want to get at the structure of the text and the structure underlying the text. Typically you write a program that reads the characters of the text, makes sense of it, and make a structured representation of that.

Is this a little bit like natural language where implicitly, when we talk, we break apart a sentence: this is the verb, the noun and that sort of thing, the subject?

Exactly, so that’s actually called parsing in natural language linguistics theory. Parsing is from dividing sentences up into parts and then making little tree structures for that. So that’s exactly what parsing is about.

But there are also all kinds of places in computing where you do it. Think of a text file with lines where you have columns separated by tabs, or colons, or something like that, and you want to get the elements from the columns. You would write a little program that gets the elements from the columns, by reading the characters and seeing where the divisions are, and then translate it into some kind of tabular form.

So even breaking up a CSV file is a parsing of a sort?

Exactly.

That’s giving us quite a wide range! What are the places where a typical user is going to be using parsing, even if they don’t know about it?

Well, exactly, all those places, right! Even reading in a string from a form in a web application, you have to see whether it’s actually a correct sentence according to whatever you want in that place in the form. That’s the recognition: you want to say, well this string should comply with this particular language: it should look like this, it should be an email address, it should have a bunch of letters and an @ symbol and then a bunch of other letters, and a dot and some more stuff like that. So that’s already a place where you might come across parsing problems.

People are probably thinking about it in terms like compilers and that sort of thing, but as you’re saying it’s used pretty much everywhere when dealing with text that goes into a computer.

Exactly, yes.

As I said earlier a lot of our listeners, and a lot of people in general, will have the general feeling that parsing is a pretty hard topic — it’s one of those things that people often run away in fear.

Why is that?

Well that’s what I wanted to ask you! Why is this something that people are often scared of? Is it because there’re a lot of technical terms associated with it? that it has a reputation, with compilers and so on? that there’s a lot of scary tools (we’ll be getting into those later)? Do you think it’s just that people are often actually doing it themselves anyway? You said, if they’re breaking down CSV files, or web page validation, they’re doing parsing, and maybe they don’t think of what they’re doing, this simple thing, as parsing?

Right, and then they might may have come across Lex and Yacc in their education and either understood that, but thought it was quite difficult, or were too afraid of it to actually look at it. I don’t know exactly. But Lex and Yacc is the older technology in this area and it is a bit hard to use. I’m not really sure about the psychology of this, because I’m quite happy and familiar using parsing technology — I find it a bit hard to reason about why people would find it hard!

I would tend to agree with you, I think that perhaps for a lot of people, there’s a certain cultural thing that says it’s hard, without necessarily a basis for it. But one of the reasons for that might be that, if we go back to the example of a web page validation, let’s imagine I’ve got a date form and someone can type in a date, in US format, British format, ISO format, whatever — there’s all these various date formats! A lot of that sort of parsing, it’s a little bit of hand hackery. Often they’ll break things apart. Can we do all types of parsing using that technique?

Right, if you have a general purpose programming language you can implement any parsing problem that you want. The only problem is that, at some point, it becomes very unpleasant doing it yourself, and then you would rather like to use some higher level abstractions and a domain specific language.

How unpleasant would you say? I mean, are we talking about impractical or that there’s a lot of corner cases that don’t tend to be done correctly?

Well, it just becomes very tedious. You have to think of every step in the process: if you have read this character, then all these cases may occur. So basically, you have to build up a state machine with transitions, and you have to think of all the cases, and that is very tedious and easy to get wrong.

So another thing that a lot of people will often use is regular expressions — most languages now have reasonable regular expression support, and regular expressions are relatively powerful. Is that enough to do all this parsing stuff? Does that solve some of my problems?

Well, definitely some of these. Input form validation problems, or even CSV file parsing — you can you can get a long way with regular expressions and [pull?] programming and that kind of stuff. So, basically, regular expressions are another way to do parsing — it’s the next level up from parsing manually and writing your own parsing program completely.

Regular expressions are nice because it’s a formal language that declaratively describes a structured little language, but it’s not enough. The problem with regular expressions is that they cannot deal with recursion. So as soon as you have a language that has some kind of parentheses, where you have expressions that have parentheses and then within those parentheses you can have another expression with parentheses and other nested expressions, then we know from formal language theory that regular expressions don’t do the job.

In other words, it’s the problem that they can’t balance things out if there’s two open brackets, they can’t count, in the sense that there should be two at the end?

That’s right.

OK, so that’s a fairly good reason. So, regular expressions get me further, in a sense, or allow me to do more, but they still hit a buffer?

In technical terms, regular expressions are implemented using a a finite state machine, so after reading a character you have to decide which state to go next. The only thing you have to remember is which state you are in. The problem with these parentheses languages is that you need a stack machine, because you need to remember where you have been.

So you end up manually coding extra stuff on top as well as using the regular expressions again? So we’re kind of back to the manual process again, effectively?

With pure regular expressions, you cannot do it. But then, if you have some some way of calling regular expressions and then defining them by functions, or something like that, then you can [use] the function call stack to emulate this, and you can get this nested behavior.

Okay, then we move on to the tools that people know about. But before we get there – and this is why academics start getting involved – there’s a whole theory of parsing, of formality, and so on so. Could you give us an idea of the underlying theory, not going into too much detail perhaps, but of parsing that solves the problems that you’ve talked about? That regular expressions can’t balance brackets and so on.

So this goes back to the 50s and to linguistics, even to the study of natural language. Noam Chomsky in America was studying natural languages and he came up with a hierarchy for formalizing language grammars. People had been thinking about about grammars, but Chomsky came up with a formal system to describe languages. He came up with this idea of context-free grammars. The idea there is that you have a symbol, a so-called nonterminal, and you have productions, [which] can rewrite symbols. So basically it was a system for doing string rewriting. His systems were not for parsing, so much as they were for describing the generation of language. They were generative grammars.

They could recognize that something was valid?

No, they could generate the valid sentences of a language. So, the idea of context-free grammars was that you write it as a symbol, an arrow, and a bunch of other symbols. And the idea was that you start with the start symbol, and then you could look at the rules of the grammar, and those rules said: well, if you have the symbol ‘s’ then you can replace it with the symbol ‘noun’ followed by ‘verb’ and then noun you could replace by this and that.

So it’s describing the valid sentences that I can build?

Exactly. It’s a generator for the language.

So this is like the implicit rules that you know we all learned from our parents listening to them speak when we grow up? We know that sentences have a noun and a verb and the subject and so on — we have those rules implicitly.

Chomsky’s problem was really: can we understand how language works? And his theory was that, actually, our mind somehow uses such kinds of rules to describe how we generate sentences. There’s a debate about whether that actually works for natural language.

In addition to these context-free grammars, you can also add context-sensitive grammars, where you say, well, if you have this symbol in the context of some other symbols then you then you can rewrite it to this set of symbols.

So the actual specific content of a word changes what I can say afterwards? Let me rephrase that. With the context-free grammars, I can put any random words in, provided they obey the structure correctly.

Right.

But with the context-sensitive [grammars], if if I say “dog” then I have to say “bark”, I can’t say “meow” or “moo” or something like that.

Yes, something like that.

I’m simplifying. [2023 note: yeah, right!]

Yeah.

So how does this then relate to computing parsing?

So then in the early early 60s people were working on ALGOL 60. This was a domain specific language for defining algorithms – that was the problem at the time – and they tried to formalize the design of that language. They had a report about it. They come up with this formalism called the eBNF, well BNF actually (Backus-Naur Form). Backus and Naur, if I remember, came up with this formalism for describing grammars which was basically equivalent to Chomsky’s context-free grammars. So they use these rules to describe the syntax of ALGOL. That gave rise to a whole lot of research about defining languages.

So the original definition of ALGOL 60 was saying what the valid syntax was, but at that point it was still a “paper thing” in a sense? It hadn’t been “tooled [up]” necessarily, or not originally — it was thought of a way as just saying here is the valid syntax of ALGOL 60?

Right, it was a language specification language.

So we mentioned earlier about BNF, and we also mentioned eBNF, so could you just give our listeners a brief idea of the difference? What does the “e” do as it were?

The “e” stands for “extended BNF”. BNF is plain context-free grammar: you have a symbol on the left hand side and a bunch of symbols on the right hand side and you basically can expand symbols. There are typical patterns in grammars that you would like to write down a bit more concisely. One typical example is lists: you want to have a list of expressions separated by commas. In a context-free grammar, you would have to define a non-terminal expression list, and then say, well, an expression is either empty, or it can be an expression followed by comma followed by an expression list. So that’s a typical pattern that you have to write over and over again. What eBNF does is provide a high-level construct to allow you to define those things. So, typically in eBNF you can write “expression*” for a list of expressions.

Zero to many, just like in regular expressions?

Exactly. So basically it adds regular expression operators to grammars.

And this is a nice syntactic convenience, because there’s this repeated tedious idiom that you had to go over and this allows you to nicely…

Right, I mean you can define a simple transformation that transforms eBNF grammars to to basic BNF grammars.

So from an expressive power point of view there’s no difference between the two, just eBNF is a lot more pleasant to work with?

Right.

What comes next? If [someone has] a description of a language, maybe they’re familiar with tools like Lex and Yacc, maybe even JavaCC and so on, they know that they [should] write some sort of context-free grammar. They’re probably even aware of the traditional syntax, where there’s a little term on the left hand side, then there’s an arrow, and then there’s some stuff on the right hand side. Magically, they feed this into one of these tools and then it’ll generate some code for them, that then reads in a text file and does some magical stuff for them. So could you maybe give us an idea of what are the inputs to these tools? Are they context-free grammars in general, and what is the typical output of a tool?

The basic idea is that you write the context-free grammar and feed it to a tool. Then there are basically two classes of tools, divided into top-down parsers and bottom-up parsers. Both these classes take a restricted subset of the full set of context-free grammars.

The idea of a top-down parser is that it starts with the start symbol, it says: well okay, here are a number of productions, and my sentence should consist of one of these productions; I’m going to look at the leftmost symbol of all these production and try to see whether the first word in my string matches the first word in one of these productions.

So a production is an individual rule in the grammar, should we say?

Yes. And then it goes on and it decides: well, this is the production you should take. And then it says, well, next we expect this symbol and we’re going to look at the production of that and then eventually you have a complete tree.

The problem with top-down parsers is that they cannot deal with left recursion. For instance, imagine a production where you have an expression ‘e’ can be a expression ‘e’ followed by the symbol ‘+’ followed by another expression [i.e. ‘e: e + e’].

In other words: if I do the standard calculator grammar, I’ve got expression plus expression, or expression minus expression, or expression star expression, standard simple mathematics, and that can’t be expressed?

Exactly. So when you feed [such a grammar] into a top-down parser, it will loop, because it will say: well, okay, I need to recognize an expression; in order to recognize an expression I’m going to have to to recognize an expression. So it will infinitely loop and…

…never makes any progress?

Right. And then you have bottom-up parsers. Yacc is the prototypical bottom-up parser. The idea there is that you decide that if you want to recognize a string for a certain symbol, then you look at all productions and try to see if you can make progress with any of those productions (and maybe multiple). At some point, you reach a point where you have reached the end of a production, and then you can actually say: well, now we can make a reduction, and we can say we have recognized that this part of the string corresponds to this production, and then you can go up one level.

So we’ve mentioned the classic tool — Yacc is the one that we’ll keep on coming back to, probably! A lot of these tools take in a grammar file, whether it be a top down or bottom up parser. A lot of people will be familiar with them generating a whole load of code: you feed them in and magically it generates a C file or a Java file or something. Is that an intrinsic part of the way a parser needs to work? why are they doing this?

Essentially what happens is that a parser generator transforms a grammar into a state machine. The state machine then keeps track of: if the parser is in this state, you see this token in the input, then go to this state. That’s how these things work and that’s what we were discussing before, right? You can write these state machines by hand, by making if-then-else statements, and stuff like that, but you want to avoid that. The clever thing about parser generators is that they generate that state machine for you.

So when they generate all this stuff, that’s an intrinsic part of what they do. But one of the things that anyone who’s ever looked at the file that, say, Yacc generates — there’s an awful lot of stuff in there! Presumably, it could have just done that at runtime, but by doing it in advance it’s able to pre-compute all this. This is presumably the interesting thing?

Right. The process of transforming the grammar into a state machine is often an expensive process, so you want to pre-compute that. At least, if you have large grammars this can be quite an expensive process. Especially if you have LR parser generators with lots of lookahead, then these tables become huge, so you want to definitely do that offline.

But then the other issue is that the essential part of this is that you compute the state machine, and then you could have some representation of that. You could put the state machine in an XML file and then read the XML file in at runtime and interpret it. So the fact that parser generators generate code is not really essential. You need a representation of the state machine and then you can interpret that somehow, or you can generate code that executes the state machine transitions, but that is a matter of of engineering basically.

OK, I think that will be quite interesting to a lot of people. As I said, it’s often a sort of a cultural expectation that they generate code.

For instance, the SGLR parser that’s used by SDF – the grammar formalism that I’ve worked on – it actually interprets a file that encodes the state machine. You don’t have to generate a parser for each language, you can just read it in at runtime. But the computation of the state machine that is the expensive part.

You want to just do that once?

Yes.

So let’s imagine that I’ve got a Yacc thingy: I’ve generated some code, or I’m reading it at runtime. I’ve got a parser and I give it some input. The question is: what does it do next? We’ve touched on this: it builds a tree or…?

So the question is: you have this grammar, the basic thing that a parser does for you is recognize sentences, so first you could use it in a in a web application and recognize that the input of your form is actually correct according to the syntax of your of the language. And so you could indeed make a parser for dates, but next you will probably want to actually use the structure. We talked through this in the beginning: you actually want to have the structure of of the text, so you can do further processing. There’s an intrinsic correspondence: a grammar provides a structure for sentences, a natural tree structure.

Let’s look at the expression grammar: so you have an expression plus an expression is another expression, and expression times expression is an expression. So now if you read the sentence ‘a + b * c’, then the natural tree representation is: you have a ‘+’ node with ‘a’ as a child and you have a ‘*’ node which is a child of this ‘+’ node and it has children ‘b’ and ‘c’ probably. What you could let a grammar or parser do is build this tree structure.

For instance, the SDF formalism that I mentioned before does exactly that. So, when you parse a string with that parser, it will build a tree structure: a term representation of the text that you have parsed. Then you can use that as an input for further processing.

This is a what a lot of people call a parse tree?

That’s right.

Some parsers automatically create this parse tree, and sometimes you create it manually or you don’t even necessarily have to so. For example, the Yacc family has ‘semantic actions’, and you see this slightly artificial example where they have the calculator grammar, and they actually automatically add numbers up and multiply them.

Right. The original philosophy in compilers in the 70s was, well, you didn’t have a lot of memory, and you should really parse a program and immediately when you were parsing you should compile it to machine code. So you have a single linear sweep of your program that should do everything at once. That was easier, because you don’t didn’t have to read in the whole program in memory, because there just wasn’t that much memory! You really had to optimize for those kinds of things.

You see that languages in those days were designed for this. For instance, Pascal was designed so that all definitions became before uses. It was a design to enable this kind of processing: you parse your program and then immediately you could compile it. And now, parser formalisms were also designed in such a way that rather than building up a tree (there’s this natural tree underlying the grammar) you would perform “semantic actions” as a result of recognizing a production. So you would attach to a production a so-called semantic action, which basically was a statement in your host programming language – Yacc used C – and you could execute a bit of C code. There was access to the stack, so you could actually access the children of the trees.

One way to use this is to, in the semantic action, build up a tree that you then could later process. But you could also immediately perform semantic evaluation. You mentioned the expression calculator — that’s the standard example in compiler course, where instead of building the tree ‘+ a b’, you would actually compute ‘+’ the value of ‘a’ and the value of ‘b’.

I think what we’re effectively saying there is that, back in the day, parsing was a very expensive activity in computation and memory terms so there’s there was a lot of effort to try and make this whole thing efficient. So, for example, parse trees: if you pass a whole load of text and build a big tree, it’s expensive, so we try and do things as we go along. Now that maybe also brings us to one of the reasons that i think parsing scares people: there’s a load of these weird terms: LL, LR, LALR, and so on. Can you give us maybe a little idea of the history? Why are are these various terms? Maybe we can explain some of the major ones just briefly at a high level.

Before, I mentioned top down parsing and bottom up parsing, these are the fundamental ways to do parsing.

For instance, top down parsing, you say, well, we should decide on the production to take by looking at the first symbol. So, basically what you were doing there is: look at the first symbol in the input and then try to match that against the symbols in the left hand side of the production. Or you could have a bit more look ahead. That is called LL parsing, and LL(1) is where you have a look ahead of one and LL(2) is where you have a look ahead of two.

So that means is: it’s not that I look at two things ahead in the grammar, but I look ahead two things in the input?

That’s right. You would actually do a bit of a computation to see which tokens were predicted by symbols in the production.

So maybe this is a point actually we should be explicit about one of the things that a lot these parsing approaches have given us implicitly, and we don’t think about explicitly: we’ve mentioned tokens and non-terminals. All of these passing things they’re actually based on a sort of a two-phase structure aren’t they?

That’s right.

The first stage being: I break stuff up into words. If I look at English, for example, a word is separated by a space. So I just break everything up, so those are the tokens? And you also called those non-terminals?

No. The tokens are the are the words, so so things like like ‘+’ and an identifier. The non-terminals are the symbols in your grammars: things like expression, and statements, and for loop.

So those are the things that they’re not actually in the user’s input? Those are parts of the tree structure.

Exactly.

OK. So, those two phases are separated, so this is why we have Lex and Yacc for example? Lex is doing that first splitting the things up into words, and then Yacc is making the tree structure or the semantic actions.

Right.

OK, sorry, so we’re going back: you’ve explained LL. Then we’ve got LR and LALR and those things. Maybe we can go over a little bit about those?

So, LR and LALR are variants used in generating decision tables for bottom-up parsers: shift reduce parsers. They are technical variations on deciding which class of glamour grammars you can actually deal with in a parser generator.

We talked about context-free grammars and I think it’s fair to say that most computer languages we’re interested in, context-free grammars are what we want.

Yes, you would say so, but then people invent things like off-site rules in Haskell or Python, which you cannot express with context-free grammars.

Maybe we’ll get on to that later — that’s quite interesting! But all these funny symbol terms like LL and LR and LALR, are really saying: we’re taking subsets of the context-free grammars? We’re deliberately cutting down the number of things we can parse, presumably for efficiency reasons generally?

The issue is ambiguities. In a top down parser, you need to decide which way to go: you can look a couple of tokens ahead and then you can decide. But if you have to look too much ahead then you’re inspecting the string all the time. For some grammars, for instance these left recursive grammars, there’s not enough tokens you can look ahead, because there’s not a fixed number of tokens you could look at to decide where to go. So the grammar is restricted to be non-recursive, because that’s one example of a restriction to fit a particular approach to do parsing.

We went back to the fact that a lot of this parsing stuff happened in the 1960s, and computers were just a tiny little bit slower then. So the LL-type parsers have this obvious limitation, but the LR parsers still don’t allow me to express every context-free grammar. Again, presumably that is just because that allowed those very slow computers to parse text at a reasonable speed?

Well, it’s not just slowness: it is really about algorithmic complexity. Earley designed a parsing algorithm in the 60s called Earley parsing and that can deal with all concrete grammars (well, maybe except for circular ones) but the complexity of that algorithm is cubic, so it can take up to n3 time to parse a string.

If i’ve understood that correctly, what we’re saying is that Earley parsing (that’s “Earley” with an e at the end, rather than just early in the morning!), the algorthim isn’t necessarily difficult to implement, for the poor person who writes the parser. But the longer the input, it takes a cubic amount of time to parse: in other words it can be very very slow. So, when it was invented, on computers of the day, that would have taken hours possibly to parse a relatively simple string for the calculator grammar for example.

Typically, things get get bad if you get lower strength. Well, that is complexity, right — it doesn’t scale to large things. So the challenge was to have parsing algorithms that were linear and LL and LR guarantee linearity for parsers.

Is this still a concern today? The processor i have in my phone, for example, is is 10 times the speed of the fastest super computer in the 1960s.

But still, complexity beats everything, right? Things go up fast. So it remains a concern: you don’t want an exponential parsing algorithm, or even cubic, you don’t want to have that.

Is that why we’re still using parsers with this LR stuff and LL and so on? The efficiency is still worth us cutting down the amount of things we can easily parse just so we know we can do it very efficiently?

Well, partly. There’re just tools around that are based on this stuff. They’ve proven their worth and they work. Another advantage of using these subsets is that if you succeed in fitting your grammar in this subset, you guarantee that your grammar is non-ambiguous. Ambiguity is a problem where you can parse a string in more than one way: you can say well it might be an expression, but it also might be a state. That is not typically what you want when you define a language, because you want to be clear.

Is the classic type of ambiguity: I have the expression ‘2 + 3 * 4’. Wow we implicitly expect that to be ‘2 + (3 * 4)’ but a parsing algorithm thinks: well, I could either do that as ‘(2 + 3) * 4’ or the other one.

That’s an example of ambiguity, indeed. So depending on how you formulate this, the grammar may be ambiguous to parse.

Earlier, we mentioned parse trees. Now, there are two terms that often come up in parsing, which is “parse trees” and “abstract syntax trees”. Could you differentiate between them for us?

Sure. Parse tree is a tree you get directly from from parsing according to the grammar. We talked about this production ‘e’ is ‘e + e’, so the parse tree you would get from that is a node with label ‘e’ and it has three children: one labeled ‘e’, one labeled ‘+’ and one labeled ‘e’. You may have white space in-between, you could have that in the parse tree as well.

It still maintains the tokens: the original user’s text is all somewhere in there?

Exactly. You can exactly reproduce the string from the tree by just going over the tree and printing all the bits of text that are stored there, and then you get your string back. There’s a one-to-one correspondence between the parse tree and the string you have parsed.

And then we have “abstract syntax tree”?

So now this ‘+’ child of this node is not very interesting. So what you do in an abstract parse tree is transform this parse tree to a tree where you label the nodes of the trees with the name of the production. So for instance we could name the plus production ‘+’ typically, and then we could transform this tree ‘e’ with children ‘e + e’ to a tree with a root labelled ‘+’, and with children two expression nodes. We throw away the literal and also white space and stuff that because we typically don’t need that for further processing.

So if I try and repeat that back in another way. If I’ve got a compiler, I do my first pass, and I get back a parse tree which has got all of the bits of the text in there. But that’s quite low level. So then I’ll typically have a phase which would create an abstract syntax tree which will flatten the structure, throw away the bits of the user’s input that are not not relevant, that are just formatting related and so on, and leave me with the essence of what the user wanted to say.

Exactly.

Are there tools? how does one create an abstract syntax tree? Is it done by hand?

We talked about semantic actions a while ago, and there you see that the user is actually in control of building the tree. Typically you immediately build an abstract syntax tree rather than a parse tree so there’s not this two-stage processing: you just build an abstract syntax tree immediately. The advantage of that is that you actually have to do it, you have to build a structure. You could say, well you could actually derive this abstract syntax tree automatically from the grammar, because there’s a one-to-one correspondence between grammar parse trees and abstract syntax trees.

SDF is a formalism which was designed to do that. There’s a one-to-one correspondence between parse trees and abstract syntax trees. There’s an issue why you cannot do that with all kinds of grammar formalisms. Because in many grammar formalism, well we talked about ambiguities earlier, and you have to disambiguate explicitly. Typically you do that for expressions for instance by having different levels of precedence, which we encode as as non-terminal symbols. So you have a level for multiplicative expressions and additive expressions and these are different symbols. And then what you get is that you get a so-called injection, so productions that say an additive expression can be a multiplicative expression, or it can be a additive plus an additive expression. So this injection going from multiplicative to to additive, you will get a tree node with a single child in it and you may have long chains of these injections. That’s often why you see that you want to have this mapping from parse trees to abstract syntax trees, so you not only throw out these literals and layout, but you also flatten these injection chains. That’s one thing.

Another thing is that some approaches use iteration: you can have a long row of ‘e + e + e + e’ and that’s expressed as an iteration. But in a tree you would rather like to have a nested tree structure. Then you have to do a transformation to get from that list of expressions to the nested structure. For instance, there’s a case in top down parsing formalism where you cannot express left recursion.

So the idea of SDF was that it should be a formalism where the tree structure you get is really a one-to-one correspondence to the grammar formalism. So from the grammar, you know what your tree looks like. In order to do that you shouldn’t express precedence in the grammar but you should just have a production ‘e’ is ‘e + e’ or ‘e * e’ and then you have separate disambiguation rules that tell you: this one is left associative, and multiplication is higher precedence than addition. When you do that outside of the production rules then you don’t contaminate the production rules with this information. You can actually see this tree structure in the grammar.

So SDF is a way of giving significant help in creating abstract syntax trees, and deferring some of these decisions about precedence and so on until the second stage when you’re creating that abstract syntax tree, rather than the parse tree?

No. It’s basically a separation of concerns issue. You can say the grammar describes several things. One is the tree structure that you want to think about, the logical structure of your language elements. The other is, well, how should we parse these things? These expressions should be disambiguated there, the operator should be parsed in a particular order. You could see that as a separate concern, right? If you’re writing a compiler, you’re interested in the logical structure of your language, not so much in disambiguation. That’s kind of a parsing issue. SDF allows you to to do that separation of concerns by having a separate definition of disambiguation rules which are used within the parser. The parser takes care that you don’t get ambiguity but you don’t have to express it in the grammar.

Earlier, we were talking about Lex and Yacc and we had these two very distinct phases of breaking the input up into tokens and then parsing it. Now, I believe that you were one of the original people involved in “scannerless parsing”, where in a sense these two phases are somehow merged. Could you talk to us a little bit about that, please?

That’s right. I worked on this SDF formalism, which was designed to have some natural abstract syntax tree. Before I started working on this, we had a traditional two-phase language definition. In Lex and Yacc, you have separate files for defining the lexical syntax and for defining the context-free syntax. In SDF, we actually had a single formalism that allowed you to describe the lexical syntax and the context-free syntax, so it already integrated those issues. But the lexical syntax was still described using regular expressions and character classes: regular expressions, ‘*’ and ‘+’, and those kind of things. So what I did is so just use regular normal context-free grammars to define the lexical syntax rather than just regular expressions and throw away the scanner. So, rather than having a separate stage in which we tokenize the string, the parser reads characters so the tokens are now the characters and we use this context-free grammar to both do the so-called scanning and the parsing bit.

What advantage does that gain one? It’s nice squishing two phases into one, but does this give me some extra practical expressive power efficiency?

Yes, for instance we talked earlier about the limitations of regular expressions, that you cannot express nested structures. Think of the problem of nested comments, so you have Java-style long comments and you want to nest these things. That doesn’t work because it stops at the comment opening symbol, and then you have another comment opening symbol, and then you have comment close corresponding to the second opening symbol — but it will just stop there and stop parsing your comments. Using context-free grammars for lexical syntax, you can actually express things like nested comments. So that’s a little nice — it kind of generalizes the definition of lexical syntax from what we had before.

That was kind of the motivation sort of thing that I had in my thesis. But that was not a very strong story, I have to say. I mean, there were a few small problems in Pascal that you could could solve more nicely using this, but it was not really a killer application of scannerless parsing. I didn’t realize until much later what the real killer application would be, and that is language combination.

The problem of combining languages is that you have to combine grammars. Well, let’s first talk about why you would want to combine languages. So, we’re here at the code generation conference and what you’re doing in code generation is generate code. So what you typically do is write templates that do code generation. You have some kind of meta language, a template language, and you have a bit of quoted language, the stuff that you generate. Typical template engines not just generate text, or the templates don’t have any structure. That means that the templates are not checked syntactically so you cannot know if the template code fragments are syntactically correct. Also, you don’t have the structure of the text you generate and you cannot do any further processing. At least, you should have to parse after generation and it’s kind of clumsy.

So now using language combination you can say well, OK, we defined a template language that is extended with, say, the syntax of Java. We could actually parse the template fragments using the proper syntax of Java, and then we can syntactically check that these fragments are correct. Then we can use the underlying abstract syntax tree to actually build an abstract syntax tree rather than text.

So, this was difficult because Lex and Yacc, those two phases were very distinct. I had to entirely take my text in and tokenize it and then I would parse it. But if I’ve got two languages in a file, then they probably don’t share the same tokenization rules, and that would make combining them very difficult. But if I’m using scannerless parsing it can automatically say, “oh I’ve reached the Java bit, now I turn into the Java parser” as it were. Then I go back to the template language, I turn into the parser and scanner for that.

Right. So there’s a bit of theoretical foundation for this. The problem is that the context-free grammars are closed under composition. So if you take two context-free grammars and put them together, you have a new context-free grammar. But if you take a regular grammar, so the one that is used for defining the tokens of a language, regular grammars are not closed under composition. Nor are subsets of the context-free grammar such as LL or LR.

Can we explain to our listeners what we mean by “it’s not closed under composition”?

You take two Lex files, you put them together, you don’t have a new Lex file.

It’s an invalid file probably?

Yeah, so it may not be a new valid Lex file because it doesn’t follow the restrictions of the language. Or, it may actually be one but, but because the rules have to be in some order, you may not get the tokenization that you want. For instance, keywords of one language are identifiers in the other language and then there would be reserved words and so you cannot use those keywords or the other language’s identifiers in your other language. And that is annoying. For instance, say you have a template language it has a keyword ‘define’ and in your Java program, you want to use ‘define’ as an identifier. If you make this combined grammar then you couldn’t generate the Java program with this keyword ‘define’ because it’s a keyword in the template language. That’s the sort of thing that’s annoying.

The problem really with this combining scanners is that scanner just goes over a string and sees, well, this should be an identifier, or should be a keyword without looking at the context of where you are in the file. If you do scannerless parsing, the parser basically steers the scanning, so the tokenization says, well, at this point I expect a java identifier, and it will look for a Java identifier. At some other point it will say, here I’m expecting a template engine keyword, so you may have ‘define’ here.

Is this a context sensitive grammar then?

It’s a bit confusing isn’t it? A context-free grammar that does context.

It is still context free?

It is context-free, but it still provides context. It’s always a bit counter intuitive! It still is a context free grammar but you do scanning in the context of a particular symbol.

It maintains all the theoretical niceness of context-free grammars?

Exactly.

Given that we’re moving into this future, possibly, where we’re getting lots of domain specific languages and we’re embedding them in, scannerless parsing is probably a significant way of enabling this? So you see this as the killer application for scannerless parsing?

Absolutely, yes. I mean we have been doing really great things due to the fact that we have this scannerless parsing algorithm. We’ve been embedding domain specific languages into java. We’ve been building code generation tools where we embed the source and target language into the meta language, so they become parse fragments of those languages. I think you only get that right using something like scannerless parsing and the techniques underlying SDF.

OK, excellent. Well, I would agree it sounds like a really great way of moving things forward! I think we’ll probably be seeing an awful lot more scannerless parsing as we move into this sort of world. I think that’s probably something for everyone to keep an eye on. Eelco Visser, thank you very much!

Thanks — it was fun.

Newer 2023-05-16 09:30 Older
If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Comments



(optional)
(used only to verify your comment: it is not displayed)