Parsing: The Solved Problem That Isn't

[RSS feed]
 

March 15 2011

Parsing is the act of taking a stream of characters and deducing if and how they conform to an underlying grammar. For example the sentence Bill hits Ben conforms to the part of the English grammar noun verb noun. Parsing concerns itself with uncovering structure; although this gives a partial indication of the meaning of a sentence, the full meaning is only uncovered by later stages of processing. Parseable, but obviously nonsensical, sentences like Bill evaporates Ben highlight this (the sentence is still noun verb noun, but finding two people who agree on what it means will be a struggle). As humans we naturally parse text all the time, without even thinking about it; indeed, we even have a fairly good ability to parse constructs that we've never seen before.

In computing, parsing is also common; while the grammars are synthetic (e.g. of a specific programming language), the overall idea is the same as for human languages. Although different communities have different approaches to the practicalities of parsing - C programmers reach for lex / yacc; functional programmers to parser combinators; others for tools like ANTLR or a Packrat / PEG-based approach - they typically rely on the same underlying area of knowledge.

After the creation of programming languages themselves, parsing was one of the first major areas tackled by theoretical computer science and, in many peoples eyes, one of its greatest successes. The 1960s saw a concerted effort to uncover good theories and algorithms for parsing. Parsing in the early days seems to have shot off in many directions before, largely, converging. Context Free Grammars (CFGs) eventually won, because they are fairly expressive and easy to reason about, both for practitioners and theorists.

Unfortunately, given the extremely limited hardware of 1960s computers (not helped by the lack of an efficient algorithm), the parsing of an arbitrary CFG was too slow to be practical. Parsing algorithms such as LL, LR, and LALR identified subsets of the full class of CFGs that could be efficiently parsed. Later, relatively practical algorithms for parsing any CFG appeared, most notably Earley's 1973 parsing algorithm. It is easy to overlook the relative difference in performance between then and now: the fastest computer in the world from 1964-1969 was the CDC6600 which executed at around 10 MIPS; my 2010 mobile phone has a processor which runs at over 2000 MIPS. By the time computers had become fast enough for Earley's algorithm, LL, LR, and friends had established a cultural dominance which is only now being seriously challenged - many of the most widely used tools still use those algorithms (or variants) for parsing. Nevertheless in tools such as ACCENT / ENTIRE and recent versions of bison, one has access to performant parsers which can parse any CFG, if that is needed.

The general consensus, therefore, is that parsing is a solved problem. If you've got a parsing problem for synthetic languages, one of the existing tools should do the job. A few heroic people - such as Terence Parr, Adrian Johnstone, and Elizabeth Scott - continue working away to ensure that parsing becomes even more efficient but, ultimately, this will be transparently adopted by tools without overtly changing the way that parsing is typically done.

Language composition

One of the things that's become increasingly obvious to me over the past few years is that the general consensus breaks down for one vital emerging trend: language composition. Composition is one of those long, complicated, but often vague terms that crops up a lot in theoretical work. Fortunately, for our purposes it means something simple: grammar composition, which is where we add one grammar to another and have the combined grammar parse text in the new language (exactly the sort of thing we want to do with Domain Specific Languages (DSLs)). To use a classic example, imagine that we wish to extend a Java-like language with SQL so that we can directly write:
for (String s : SELECT name FROM person WHERE age > 18) {
  ...
}
Let's assume that someone has provided us with two separate grammars: one for the Java-like language and one for SQL. Grammar composition seems like it should be fairly easy. In practice, it turns out to be rather frustrating, and I'll now explain some of the reasons why.

Grammar composition

While grammar composition is theoretically trivial, simply squashing two grammars together is rarely useful in practice. Typically, grammars have a single start rule; one therefore needs to choose which of the two grammars has the start rule. More messy is the fact that the chances of the two grammars referencing each other is slight; in practice, one needs to specify a third tranche of data - often referred to, perhaps slightly misleadingly, as glue - which actually links the two grammars together. In our running example, the Java-like language has the main grammar; the glue will specify where, within the Java-like expressions, SQL statements can be referenced.

For those using old parsing algorithms such as LR (and LL etc.), there is a more fundamental problem. If one takes two LR-compatible grammars and combines them, the resulting grammar is not guaranteed to be LR-compatible (i.e. an LR parser may not be able to parse using it). Therefore such algorithms are of little use for grammar composition.

At this point, users of algorithms such as Earley's have a rather smugger look on their face. Since we know from grammar theory that unioning two CFGs always leads to a valid CFG, such algorithms can always parse the result of grammar composition. But, perhaps inevitably, there are problems.

Tokenization

Parsing is generally a two-phase process: first we break the input up into tokens (tokenization); and then we parse the tokens. Tokens are what we call words in everyday language. In English, words are easily defined (roughly: a word starts and ends with a space or punctuation character). Different computer languages, however, have rather different notions of what their tokens are. Sometimes, tokenization rules are easily combined; however since tokenization is done in ignorance of how the token will later be used, sometimes it is difficult. For example, in SQL SELECT might be a keyword but in Java it is also a valid identifier; it is often hard, if not impossible, to combine such tokenization rules in traditional parsers.

Fortunately there is a solution: scannerless parsing (e.g. SDF2 scannerless parsing). For our purposes, it might perhaps better be called tokenless parsing; the different names reflect the naming conventions of different parsing schools. Scannerless parsing does away with a separate tokenization phase; the grammar now contains the information necessary to dynamically tokenize text. Combining grammars with markedly different tokenization rules is now possible.

Fine-grained composition

In practice, the simple glue mentioned earlier used to combine two grammars is often not enough. There can be subtle conflicts between the grammars, in the sense that the combined language might not give the result that was expected. Consider combining two grammars that have different keywords. Scannerless parsing allows us to combine the two grammars, but we may wish to ensure that the combined languages do not allow users to use keywords in the other language as identifiers. There is no easy way to express this in normal CFGs. The SDF2 paper referenced earlier allows reject productions as a solution to this; unfortunately this then makes SDF2 grammars mildly context sensitive. As far as I know, the precise consequences of this haven't been explored, but it does mean that at least some of the body of CFG theory won't be applicable; it's enough to make one a little nervous, at the very least (not withstanding the excellent work that has been created using the SDF2 formalism by Eeclo Visser and others).

A recent, albeit relatively unknown, alternative are boolean grammars. These are a generalization of CFGs that include conjunction and negation, which, at first glance, are exactly the constructs needed to make grammar composition practical (allowing one to say things like identifiers are any sequence of ASCII characters except SELECT). Boolean grammars, to me at least, seem to have a lot of promise, and Alexander Okhotin is making an heroic effort on them. However, there hasn't yet been any practical use of them that I know of, so wrapping ones head around the practicalities is far from trivial. There are also several open questions about boolean grammars, some of which, until they are answered one way or the other, may preclude wide-scale uptake. In particular, one issue relates to ambiguity, of which more now needs to be said.

Ambiguity

By severely restricting what CFGs they accept, grammars which are compatible with traditional parsing algorithms (LL, LR etc.) are always unambiguous (though, as we shall see, this does not mean that all the incompatible grammars are ambiguous: many are unambiguous). Grammar ambiguity is thus less widely understood than it might otherwise have been. Consider the following grammar of standard arithmetic:
E ::= E "+" E
    | E "-" E
    | E "/" E
    | E "*" E
Using this grammar, a string such as 2 + 3 * 4 can be parsed ambiguously in two ways: as equivalent to (2 + 3) * 4; or as equivalent to 2 + (3 * 4). Parsing algorithms such as Earley's will generate all possibilities even though we often only want one of them (due to arithmetic conventions, in this case we want the latter parse). There are several different ways of disambiguating grammars, such as precedences (in this example, higher precedences win in the face of ambiguity):
E ::= E "+" E  %precedence 1
    | E "-" E  %precedence 1
    | E "/" E  %precedence 2
    | E "*" E  %precedence 3
This might suggest that we can tame ambiguity relatively easily: unfortunately, parsing theory tells us that the reality is rather tricky. The basic issue is that, in general, we can not statically analyse a CFG and determine if it is ambiguous or not. To discover whether a given CFG is ambiguous or not we have to try every possible input: if no input triggers an ambiguous parse, the CFG is not ambiguous. However this is, in general, impractical: most CFGs describe infinite languages and can not be exhaustively tested. There are various techniques which aim to give good heuristics for ambiguity (see Bas Basten's masters thesis for a good summary; I am also collaborating with a colleague on a new approach, though it's far too early to say if it will be useful or not). However, these heuristics are inherently limited: if they say a CFG is ambiguous, it definitely is; but if they can not find ambiguity, all they can say is that the CFG might be unambiguous.

Since theoretical problems are not always practical ones, a good question is the following: is this a real problem? In my experience thus far, defining stand-alone grammars for programming languages using Earley parsing (i.e. a parsing algorithm in which ambiguity is possible), it's not been a huge problem: as the grammar designer, I often understand where dangerous ambiguity might exist, and can nip it in the bud. I've been caught out a couple of times, but not enough to really worry about.

However, I do not think that my experience will hold in the face of wide-spread grammar composition. The theoretical reason is easily stated: combining two unambiguous grammars may result in an ambiguous grammar (which, as previously stated, we are unlikely to be able to statically determine in general). Consider combining two grammars from different authors, neither of whom could have anticipated the particular composition: it seems to me that ambiguity is much more likely to crop up in such cases. It will then remain undetected until an unfortunate user finds an input which triggers the ambiguity. Compilers which fail on seemingly valid input are unlikely to be popular.

PEGs

As stated earlier, unambiguous parsing algorithms such as LL and LR aren't easily usable in grammar composition. More recently, a rediscovered parsing approach has gathered a lot of attention: Packrat / PEG parsing (which I henceforth refer to as PEGs). PEGs are different than everything mentioned previously: they have no formal relation to CFGs. The chief reason for this is PEGs ordered choice operator, which removes any possibility for ambiguity in PEGs. PEGs are interesting because, unlike LL and LR, they're closed under composition: in other words, if you have two PEGs and compose them, you have a valid PEG.

Are PEGs the answer to our problems? Alas - at least as things stand - I now doubt it. First, PEGs are rather inexpressive: like LL and LR parsing, PEGs are often frustrating to use in practise. This is, principally, because they don't support left recursion; Alex Warth proposed an approach which adds left recursion but I discovered what appear to be problems with it, though I should note that there is not yet a general consensus on this (and I am collaborating with a colleague to try and reach an understanding of precisely what left recursion in PEGs should mean). Second, while PEGs are always unambiguous, depending on the glue one uses during composition, the ordered choice operator may cause strings that were previously accepted in the individual languages not to be accepted in the combined language - which, to put it mildly, is unlikely to be the desired behaviour.

Conclusions

If you've got this far, well done. This article has ended up much longer than I originally expected - though far shorter than it could be if I really went into detail on some of these points! It is important to note that I am not a parsing expert: I only ever wanted to be a user of parsing, not - as I currently am - someone who knows bits and pieces about its inner workings. What's happened is that, in wanting to make greater use of parsing, I have gradually become aware of the limitations of what I have been able to find. The emphasis is on gradually: knowledge about parsing is scattered over several decades (from the 60s right up to the present day); many publications (some of them hard to get hold of); and many peoples heads (some of whom no longer work in computing, let alone in the area of parsing). It is therefore hard to get an understanding of the range of approaches or their limitations. This article is my attempt to write down my current understanding and, in particular, the limitations of current approaches when composing grammars; I welcome corrections from those more knowledgeable than myself. Predicting the future is a mugs game, but I am starting to wonder whether, if we fail to come up with more suitable parsing algorithms, programming languages of the future that wish to allow syntax extension will bypass parsing altogether, and use syntax directed editing instead. Many people think parsing is a solved problem - I think it isn't.

Follow me on Twitter @laurencetratt

Link to this entry

 

All posts

 

Last 10 posts

An editor for composed programs
The Bootstrapped Compiler and the Damage Done
Relative and Absolute Levels
General Purpose Programming Languages' Speed of Light
Another Non-Argument in Type Systems
Server Failover For the Cheap and Forgetful
Fast Enough VMs in Fast Enough Time
Problems with Software 3: Creating Crises Where There Aren't Any
Problems with Software 2: Failing to Use the Computing Lever
Problems with Software 1: Confusing Problems Whose Solutions Are Easy to State With Problems Whose Solutions Are Easy to Realise
 
 

DSLs

Tony Clark
Zef Hemel
 

Modelling

Mark Delgano
Steven Kelly
Jim Steel
 

OS

Marc Balmer
Ross Burton
Peter Hansteen
OpenBSD Journal
Ted Unangst
 

Programming

Peter Bell
Gilad Bracha
Tony Clark
Cliff Click
William Cook
Jonathan Edwards
Daniel Ehrenberg
Fabien Fleutot
Martin Fowler
John Goerzen
Grace
James Hague
James Iry
JOT
Ralf Laemmel
Lambda the Ultimate
Daniel Lemire
Michael Lucas
Bertrand Meyer
Keith Packard
Havoc Pennington
Brown PLT
John Regehr
Diomidis Spinellis
Shin Tai
Markus Voelter
Phil Wadler
Russel Winder
Steve Yegge