Text is Dead They Say

[RSS feed]

June 16 2005
Updated: July 11 2005
Updated: January 27 2008

Martin Fowler has recently published a lengthy but rewarding article on language workbenches. I agree with the majority of what he says, and will follow up on that in due course. However there is one particular thread running through the article with which I disagree, and which I wish to address here. Martin's thrust is that in the so-called "post-IntelliJ" world that we inhabit, programming via textual files is a relic from a past age.

The notion that the we are passing out of the "age of text" (my term, for want of a better one) is one that an increasing number of people appear to be coming to. However while Martin uses this as one of the central themes of his document, I personally think it does not affect his core argument. Furthermore, I disagree with the basic tenet.

In the scenario Martin outlines, users will be increasingly creating and editing programs via DSL's. Martin's assumption is that the only practical way of editing DSL's is via dedicated editors that respond to the users input in intelligent ways. This implies to him that users are directly editing a representation of an instance of the languages abstract syntax. I believe that, practically speaking, this implication is correct, if irrelevant. However the original assumption I believe to be flawed - why should users have to edit DSL code via specialised editors? I suspect that there are two deeply connected reasons behind this:

  • Parsing has traditionally been a very difficult thing to do. Even those well versed in the parsing algorithms of such tools as yacc often find expressing a suitable grammar very difficult.
  • Articulating what the valid input to a DSL is to a user is a very difficult thing, partly because the grammar's that many tools use are so complex that they are of little use as explanation.

I suspect that people such as Martin have been burned on bad parsing technologies many times before (I know I have). The prospect of such pain every time wants to come up with a textual DSL is deeply off-putting. Python contains the best (or should that be worst?) example of an unintelligible grammar I have ever seen. Try following the Python grammar to work out how it encodes operator precedence, and you'll see why parsing technologies derived from work in the 50's and 60's have seriously affected productivity in this area, through to the current day. Is it any wonder that people run a mile rather than work with such nonsense? The prospect of creating a DSL editor via a tool thus becomes attractive, just to avoid the yacc's and JavaCC's of this world.

The fact that I need to mention that yacc is an LALR(1) parser should suggest to the uninitiated how baroque virtually everything to do with parsing actually is. From a users perspective, LALR(1) really means "this thing doesn't parse a lot of things that I quite reasonably would expect it to be able to parse". In other words, yacc will generally refuse to work on most "human friendly" grammars, requiring the user to understand arcane nonsense such as "shift / reduce errors". This is all done in the name of efficiency. However the need for extreme efficiency in these algorithms often dates from the 60's. As we all know, computers were a little bit slower back in the day. What was slow back then often works at the speed of light today. There are parsing algorithms out there - Earley parsing is a particular favourite of mine - which can parse any context-free grammar, which constitutes the vast majority of modern programming languages. They may not be as fast as yacc but for the vast majority of purposes (up to and including full blown compilers) they're absolutely fine, and no user is likely to notice the difference in execution speed. However anyone who uses, say, Earley parsing will most definitely notice the massive ease in which grammars can be expressed and used. Therefore I believe the pain which most people associate with parsing text is the result of a 40 year old obsession with speed that today verges on the masochistic.

As a simple example of a chunk of an Earley grammar, here's part of a grammar for calculators that encodes operator precedence:

E ::= E "+" E  %precedence 10
    | E "-" E  %precedence 10
    | E "/" E  %precedence 20
    | E "*" E  %precedence 30
    | "(" E ")"

Compare it to the Python example earlier. I think the reader will agree that this is somewhat more readable, and I don't even feel that I need to explain what the %precedence stuff means because it's intuitive enough on its own. You can give an example like this to most people who've got a half-decent undergraduate computer science education and expect them to understand it. Although this is a simple example, I hope that the reader is at least somewhat convinced that there are practical parsing techniques which alleviate the majority of the pain associated with traditional techniques.

So how does this relate to Martin's original point? Well, if you agree with what I've said up until now, then I hope you'll then agree that it just might be practical to parse textual DSL's easily. Furthermore that it's then possible to give the user trivially adapted documentation about the grammar that they actually have a reasonable chance of understanding. I really believe that this is the case. It's the fundamental idea underlying my Converge system, and while that's still in its early days, nothing I or others have seen suggests to me that this is an unreasonable position to hold.

Systems such as Converge don't require fancy editors to be built (with all the potential usability and portability problems that are often associated with such things), and furthermore they still allow all our favourite tools to be used. After print statements, the most effective debugging tool I have yet found is grep. While grep works perfectly well on Converge source code, how is it likely to do on the proprietary format held in the sort of tools that Martin's advocating? The implications of that worry me, and they might worry you too. But that's another argument for another time.

In summary, while I understand why people have an aversion to textual source code these days, I believe that is due to antiquated tools and techniques. If one uses modern tools and techniques, parsing text becomes an easy and pleasant activity, and virtually everything in Martin's articles holds true for pure textual systems. I suppose the cachet once associated with mis-appropriating a Pete Townshend lyric has disappeared over the last year or two, but I can't resist it. Text is dead they say? Long live text.

Updated (July 11 2005): Clarified that Earley parsing can cope with all context-free grammars. Thanks to Marcin Tustin for spotting this ommission.

Updated (January 27 2008): Updated the link to the Python grammar, and changed the syntax for the Earley example to something more obviously EBNF-like.

Follow me on Twitter @laurencetratt

Link to this entry


All posts


Last 10 posts

The Bootstrapped Compiler and the Damage Done
Relative and Absolute Levels
General Purpose Programming Languages' Speed of Light
Another Non-Argument in Type Systems
Server Failover For the Cheap and Forgetful
Fast Enough VMs in Fast Enough Time
Problems with Software 3: Creating Crises Where There Aren't Any
Problems with Software 2: Failing to Use the Computing Lever
Problems with Software 1: Confusing Problems Whose Solutions Are Easy to State With Problems Whose Solutions Are Easy to Realise
Parsing: The Solved Problem That Isn't


Tony Clark
Zef Hemel


Mark Delgano
Steven Kelly
Jim Steel


Marc Balmer
Ross Burton
Peter Hansteen
OpenBSD Journal
Ted Unangst


Peter Bell
Gilad Bracha
Tony Clark
Cliff Click
William Cook
Jonathan Edwards
Daniel Ehrenberg
Fabien Fleutot
Martin Fowler
John Goerzen
James Hague
James Iry
Ralf Laemmel
Lambda the Ultimate
Daniel Lemire
Michael Lucas
Bertrand Meyer
Keith Packard
Havoc Pennington
Brown PLT
John Regehr
Software Engineering Radio
Diomidis Spinellis
Shin Tai
Markus Voelter
Phil Wadler
Russel Winder
Steve Yegge