email updates:
  |   RSS feed

My Interview with Eelco Visser on Parsing

May 16 2023

Blog archive

 
Last 10 blog posts
My Interview with Eelco Visser on Parsing
Why Split Lexing and Parsing Into Two Separate Phases?
Displaying My Washing Machine's Remaining Time With curl, jq, and pizauth
pizauth: dump and restore
How Big Should a Programming Language Be?
Rust's Two Kinds of 'Assert' Make for Better Code
Scheduling my Electricity Usage
Why Aren't Programming Language Specifications Comprehensive?
Distinguishing an Interpreter from a Compiler
try_repeat released
 
In my previous post, I mentioned in a footnote an interview I had done with the late Eelco Visser in 2008. The audio is available here, but there isn't a transcript available. That's a pity, because, as far as I know, Eelco didn't do another interview along those lines, and searching through audio is still a difficult thing to do well. So, I thought I'd produce a transcript, as a little way of honouring Eelco's memory.

In my opinion, the technical material in the interview speaks for itself. The interview process, however, had a partly comic side — neither of us had a clue what we were doing. I had never been an interviewer before (and only did one further interview after this), and I don't think Eelco had done a technical interview like this. My understanding of parsing was woefully incomplete, and Eelco hadn't thought about some of the deeper details in several years (he said afterwards that he was a bit "rusty"), though it clearly came back to him very quickly! We recorded the interview in a dingy college room in Cambridge that, perhaps fittingly, looked it was last decorated during the heyday of parsing research — the 1960s.

Afterwards, I'm fairly certain that we went for a drink together in the college bar. I don't remember the specific topic we talked about, but it almost certainly would have been like many others we had: he'd talk about deep technical matters while I made stupid jokes, with the two ends of the conversation somehow meeting in the middle from time to time. I unthinkingly expected to have more of those conversations in the future. While that's no longer possible, I have many happy memories of those we did have.

As is conventional, I've edited the transcript for clarity (us humans stumble a lot when we're formulating our thoughts!). I think that Eelco's engaging personality and deep knowledge come across, particularly as the interview goes on and both of us better understood our roles. I hope you find this an interesting read!

Welcome, wherever you are, my name is Laurence Tratt. Today we're at the Code Generation Conference in Cambridge and my guest is Eelco Visser. We're going to be talking today about parsing, so I'm going to let Eelco do a quick little introduction and tell you where he's from.

Hi Laurie, my name is Eelco Visser. I work at Delft University of Technology, which is in the Netherlands. I'm an Associate Professor there, and I head a group of PhD students and postdocs doing research into language engineering, software configuration management, and stuff like that.

A lot of our listeners will have heard about parsing — they'll know that parsing goes on somewhere, it's this magical term! Could you give us an introduction to parsing, why we want to do it, what roughly it does? We'll go into more detail as we go along.

So the general idea of parsing is that you have text in some language, and you want to do some processing on that. You want to get at the structure of the text and the structure underlying the text. Typically you write a program that reads the characters of the text, makes sense of it, and make a structured representation of that.

Is this a little bit like natural language where implicitly, when we talk, we break apart a sentence: this is the verb, the noun and that sort of thing, the subject?

Exactly, so that's actually called parsing in natural language linguistics theory. Parsing is from dividing sentences up into parts and then making little tree structures for that. So that's exactly what parsing is about.

But there are also all kinds of places in computing where you do it. Think of a text file with lines where you have columns separated by tabs, or colons, or something like that, and you want to get the elements from the columns. You would write a little program that gets the elements from the columns, by reading the characters and seeing where the divisions are, and then translate it into some kind of tabular form.

So even breaking up a CSV file is a parsing of a sort?

Exactly.

That's giving us quite a wide range! What are the places where a typical user is going to be using parsing, even if they don't know about it?

Well, exactly, all those places, right! Even reading in a string from a form in a web application, you have to see whether it's actually a correct sentence according to whatever you want in that place in the form. That's the recognition: you want to say, well this string should comply with this particular language: it should look like this, it should be an email address, it should have a bunch of letters and an @ symbol and then a bunch of other letters, and a dot and some more stuff like that. So that's already a place where you might come across parsing problems.

People are probably thinking about it in terms like compilers and that sort of thing, but as you're saying it's used pretty much everywhere when dealing with text that goes into a computer.

Exactly, yes.

As I said earlier a lot of our listeners, and a lot of people in general, will have the general feeling that parsing is a pretty hard topic — it's one of those things that people often run away in fear.

Why is that?

Well that's what I wanted to ask you! Why is this something that people are often scared of? Is it because there're a lot of technical terms associated with it? that it has a reputation, with compilers and so on? that there's a lot of scary tools (we'll be getting into those later)? Do you think it's just that people are often actually doing it themselves anyway? You said, if they're breaking down CSV files, or web page validation, they're doing parsing, and maybe they don't think of what they're doing, this simple thing, as parsing?

Right, and then they might may have come across Lex and Yacc in their education and either understood that, but thought it was quite difficult, or were too afraid of it to actually look at it. I don't know exactly. But Lex and Yacc is the older technology in this area and it is a bit hard to use. I'm not really sure about the psychology of this, because I'm quite happy and familiar using parsing technology — I find it a bit hard to reason about why people would find it hard!

I would tend to agree with you, I think that perhaps for a lot of people, there's a certain cultural thing that says it's hard, without necessarily a basis for it. But one of the reasons for that might be that, if we go back to the example of a web page validation, let's imagine I've got a date form and someone can type in a date, in US format, British format, ISO format, whatever — there's all these various date formats! A lot of that sort of parsing, it's a little bit of hand hackery. Often they'll break things apart. Can we do all types of parsing using that technique?

Right, if you have a general purpose programming language you can implement any parsing problem that you want. The only problem is that, at some point, it becomes very unpleasant doing it yourself, and then you would rather like to use some higher level abstractions and a domain specific language.

How unpleasant would you say? I mean, are we talking about impractical or that there's a lot of corner cases that don't tend to be done correctly?

Well, it just becomes very tedious. You have to think of every step in the process: if you have read this character, then all these cases may occur. So basically, you have to build up a state machine with transitions, and you have to think of all the cases, and that is very tedious and easy to get wrong.

So another thing that a lot of people will often use is regular expressions — most languages now have reasonable regular expression support, and regular expressions are relatively powerful. Is that enough to do all this parsing stuff? Does that solve some of my problems?

Well, definitely some of these. Input form validation problems, or even CSV file parsing — you can you can get a long way with regular expressions and [pull?] programming and that kind of stuff. So, basically, regular expressions are another way to do parsing — it's the next level up from parsing manually and writing your own parsing program completely.

Regular expressions are nice because it's a formal language that declaratively describes a structured little language, but it's not enough. The problem with regular expressions is that they cannot deal with recursion. So as soon as you have a language that has some kind of parentheses, where you have expressions that have parentheses and then within those parentheses you can have another expression with parentheses and other nested expressions, then we know from formal language theory that regular expressions don't do the job.

In other words, it's the problem that they can't balance things out if there's two open brackets, they can't count, in the sense that there should be two at the end?

That's right.

OK, so that's a fairly good reason. So, regular expressions get me further, in a sense, or allow me to do more, but they still hit a buffer?

In technical terms, regular expressions are implemented using a a finite state machine, so after reading a character you have to decide which state to go next. The only thing you have to remember is which state you are in. The problem with these parentheses languages is that you need a stack machine, because you need to remember where you have been.

So you end up manually coding extra stuff on top as well as using the regular expressions again? So we're kind of back to the manual process again, effectively?

With pure regular expressions, you cannot do it. But then, if you have some some way of calling regular expressions and then defining them by functions, or something like that, then you can [use] the function call stack to emulate this, and you can get this nested behavior.

Okay, then we move on to the tools that people know about. But before we get there – and this is why academics start getting involved – there's a whole theory of parsing, of formality, and so on so. Could you give us an idea of the underlying theory, not going into too much detail perhaps, but of parsing that solves the problems that you've talked about? That regular expressions can't balance brackets and so on.

So this goes back to the 50s and to linguistics, even to the study of natural language. Noam Chomsky in America was studying natural languages and he came up with a hierarchy for formalizing language grammars. People had been thinking about about grammars, but Chomsky came up with a formal system to describe languages. He came up with this idea of context-free grammars. The idea there is that you have a symbol, a so-called nonterminal, and you have productions, [which] can rewrite symbols. So basically it was a system for doing string rewriting. His systems were not for parsing, so much as they were for describing the generation of language. They were generative grammars.

They could recognize that something was valid?

No, they could generate the valid sentences of a language. So, the idea of context-free grammars was that you write it as a symbol, an arrow, and a bunch of other symbols. And the idea was that you start with the start symbol, and then you could look at the rules of the grammar, and those rules said: well, if you have the symbol 's' then you can replace it with the symbol 'noun' followed by 'verb' and then noun you could replace by this and that.

So it's describing the valid sentences that I can build?

Exactly. It's a generator for the language.

So this is like the implicit rules that you know we all learned from our parents listening to them speak when we grow up? We know that sentences have a noun and a verb and the subject and so on — we have those rules implicitly.

Chomsky's problem was really: can we understand how language works? And his theory was that, actually, our mind somehow uses such kinds of rules to describe how we generate sentences. There's a debate about whether that actually works for natural language.

In addition to these context-free grammars, you can also add context-sensitive grammars, where you say, well, if you have this symbol in the context of some other symbols then you then you can rewrite it to this set of symbols.

So the actual specific content of a word changes what I can say afterwards? Let me rephrase that. With the context-free grammars, I can put any random words in, provided they obey the structure correctly.

Right.

But with the context-sensitive [grammars], if if I say "dog" then I have to say "bark", I can't say "meow" or "moo" or something like that.

Yes, something like that.

I'm simplifying. [2023 note: yeah, right!]

Yeah.

So how does this then relate to computing parsing?

So then in the early early 60s people were working on ALGOL 60. This was a domain specific language for defining algorithms – that was the problem at the time – and they tried to formalize the design of that language. They had a report about it. They come up with this formalism called the eBNF, well BNF actually (Backus-Naur Form). Backus and Naur, if I remember, came up with this formalism for describing grammars which was basically equivalent to Chomsky's context-free grammars. So they use these rules to describe the syntax of ALGOL. That gave rise to a whole lot of research about defining languages.

So the original definition of ALGOL 60 was saying what the valid syntax was, but at that point it was still a "paper thing" in a sense? It hadn't been "tooled [up]" necessarily, or not originally — it was thought of a way as just saying here is the valid syntax of ALGOL 60?

Right, it was a language specification language.

So we mentioned earlier about BNF, and we also mentioned eBNF, so could you just give our listeners a brief idea of the difference? What does the "e" do as it were?

The "e" stands for "extended BNF". BNF is plain context-free grammar: you have a symbol on the left hand side and a bunch of symbols on the right hand side and you basically can expand symbols. There are typical patterns in grammars that you would like to write down a bit more concisely. One typical example is lists: you want to have a list of expressions separated by commas. In a context-free grammar, you would have to define a non-terminal expression list, and then say, well, an expression is either empty, or it can be an expression followed by comma followed by an expression list. So that's a typical pattern that you have to write over and over again. What eBNF does is provide a high-level construct to allow you to define those things. So, typically in eBNF you can write "expression*" for a list of expressions.

Zero to many, just like in regular expressions?

Exactly. So basically it adds regular expression operators to grammars.

And this is a nice syntactic convenience, because there's this repeated tedious idiom that you had to go over and this allows you to nicely...

Right, I mean you can define a simple transformation that transforms eBNF grammars to to basic BNF grammars.

So from an expressive power point of view there's no difference between the two, just eBNF is a lot more pleasant to work with?

Right.

What comes next? If [someone has] a description of a language, maybe they're familiar with tools like Lex and Yacc, maybe even JavaCC and so on, they know that they [should] write some sort of context-free grammar. They're probably even aware of the traditional syntax, where there's a little term on the left hand side, then there's an arrow, and then there's some stuff on the right hand side. Magically, they feed this into one of these tools and then it'll generate some code for them, that then reads in a text file and does some magical stuff for them. So could you maybe give us an idea of what are the inputs to these tools? Are they context-free grammars in general, and what is the typical output of a tool?

The basic idea is that you write the context-free grammar and feed it to a tool. Then there are basically two classes of tools, divided into top-down parsers and bottom-up parsers. Both these classes take a restricted subset of the full set of context-free grammars.

The idea of a top-down parser is that it starts with the start symbol, it says: well okay, here are a number of productions, and my sentence should consist of one of these productions; I'm going to look at the leftmost symbol of all these production and try to see whether the first word in my string matches the first word in one of these productions.

So a production is an individual rule in the grammar, should we say?

Yes. And then it goes on and it decides: well, this is the production you should take. And then it says, well, next we expect this symbol and we're going to look at the production of that and then eventually you have a complete tree.

The problem with top-down parsers is that they cannot deal with left recursion. For instance, imagine a production where you have an expression 'e' can be a expression 'e' followed by the symbol '+' followed by another expression [i.e. 'e: e + e'].

In other words: if I do the standard calculator grammar, I've got expression plus expression, or expression minus expression, or expression star expression, standard simple mathematics, and that can't be expressed?

Exactly. So when you feed [such a grammar] into a top-down parser, it will loop, because it will say: well, okay, I need to recognize an expression; in order to recognize an expression I'm going to have to to recognize an expression. So it will infinitely loop and...

...never makes any progress?

Right. And then you have bottom-up parsers. Yacc is the prototypical bottom-up parser. The idea there is that you decide that if you want to recognize a string for a certain symbol, then you look at all productions and try to see if you can make progress with any of those productions (and maybe multiple). At some point, you reach a point where you have reached the end of a production, and then you can actually say: well, now we can make a reduction, and we can say we have recognized that this part of the string corresponds to this production, and then you can go up one level.

So we've mentioned the classic tool — Yacc is the one that we'll keep on coming back to, probably! A lot of these tools take in a grammar file, whether it be a top down or bottom up parser. A lot of people will be familiar with them generating a whole load of code: you feed them in and magically it generates a C file or a Java file or something. Is that an intrinsic part of the way a parser needs to work? why are they doing this?

Essentially what happens is that a parser generator transforms a grammar into a state machine. The state machine then keeps track of: if the parser is in this state, you see this token in the input, then go to this state. That's how these things work and that's what we were discussing before, right? You can write these state machines by hand, by making if-then-else statements, and stuff like that, but you want to avoid that. The clever thing about parser generators is that they generate that state machine for you.

So when they generate all this stuff, that's an intrinsic part of what they do. But one of the things that anyone who's ever looked at the file that, say, Yacc generates — there's an awful lot of stuff in there! Presumably, it could have just done that at runtime, but by doing it in advance it's able to pre-compute all this. This is presumably the interesting thing?

Right. The process of transforming the grammar into a state machine is often an expensive process, so you want to pre-compute that. At least, if you have large grammars this can be quite an expensive process. Especially if you have LR parser generators with lots of lookahead, then these tables become huge, so you want to definitely do that offline.

But then the other issue is that the essential part of this is that you compute the state machine, and then you could have some representation of that. You could put the state machine in an XML file and then read the XML file in at runtime and interpret it. So the fact that parser generators generate code is not really essential. You need a representation of the state machine and then you can interpret that somehow, or you can generate code that executes the state machine transitions, but that is a matter of of engineering basically.

OK, I think that will be quite interesting to a lot of people. As I said, it's often a sort of a cultural expectation that they generate code.

For instance, the SGLR parser that's used by SDF – the grammar formalism that I've worked on – it actually interprets a file that encodes the state machine. You don't have to generate a parser for each language, you can just read it in at runtime. But the computation of the state machine that is the expensive part.

You want to just do that once?

Yes.

So let's imagine that I've got a Yacc thingy: I've generated some code, or I'm reading it at runtime. I've got a parser and I give it some input. The question is: what does it do next? We've touched on this: it builds a tree or...?

So the question is: you have this grammar, the basic thing that a parser does for you is recognize sentences, so first you could use it in a in a web application and recognize that the input of your form is actually correct according to the syntax of your of the language. And so you could indeed make a parser for dates, but next you will probably want to actually use the structure. We talked through this in the beginning: you actually want to have the structure of of the text, so you can do further processing. There's an intrinsic correspondence: a grammar provides a structure for sentences, a natural tree structure.

Let's look at the expression grammar: so you have an expression plus an expression is another expression, and expression times expression is an expression. So now if you read the sentence 'a + b * c', then the natural tree representation is: you have a '+' node with 'a' as a child and you have a '*' node which is a child of this '+' node and it has children 'b' and 'c' probably. What you could let a grammar or parser do is build this tree structure.

For instance, the SDF formalism that I mentioned before does exactly that. So, when you parse a string with that parser, it will build a tree structure: a term representation of the text that you have parsed. Then you can use that as an input for further processing.

This is a what a lot of people call a parse tree?

That's right.

Some parsers automatically create this parse tree, and sometimes you create it manually or you don't even necessarily have to so. For example, the Yacc family has 'semantic actions', and you see this slightly artificial example where they have the calculator grammar, and they actually automatically add numbers up and multiply them.

Right. The original philosophy in compilers in the 70s was, well, you didn't have a lot of memory, and you should really parse a program and immediately when you were parsing you should compile it to machine code. So you have a single linear sweep of your program that should do everything at once. That was easier, because you don't didn't have to read in the whole program in memory, because there just wasn't that much memory! You really had to optimize for those kinds of things.

You see that languages in those days were designed for this. For instance, Pascal was designed so that all definitions became before uses. It was a design to enable this kind of processing: you parse your program and then immediately you could compile it. And now, parser formalisms were also designed in such a way that rather than building up a tree (there's this natural tree underlying the grammar) you would perform "semantic actions" as a result of recognizing a production. So you would attach to a production a so-called semantic action, which basically was a statement in your host programming language – Yacc used C – and you could execute a bit of C code. There was access to the stack, so you could actually access the children of the trees.

One way to use this is to, in the semantic action, build up a tree that you then could later process. But you could also immediately perform semantic evaluation. You mentioned the expression calculator — that's the standard example in compiler course, where instead of building the tree '+ a b', you would actually compute '+' the value of 'a' and the value of 'b'.

I think what we're effectively saying there is that, back in the day, parsing was a very expensive activity in computation and memory terms so there's there was a lot of effort to try and make this whole thing efficient. So, for example, parse trees: if you pass a whole load of text and build a big tree, it's expensive, so we try and do things as we go along. Now that maybe also brings us to one of the reasons that i think parsing scares people: there's a load of these weird terms: LL, LR, LALR, and so on. Can you give us maybe a little idea of the history? Why are are these various terms? Maybe we can explain some of the major ones just briefly at a high level.

Before, I mentioned top down parsing and bottom up parsing, these are the fundamental ways to do parsing.

For instance, top down parsing, you say, well, we should decide on the production to take by looking at the first symbol. So, basically what you were doing there is: look at the first symbol in the input and then try to match that against the symbols in the left hand side of the production. Or you could have a bit more look ahead. That is called LL parsing, and LL(1) is where you have a look ahead of one and LL(2) is where you have a look ahead of two.

So that means is: it's not that I look at two things ahead in the grammar, but I look ahead two things in the input?

That's right. You would actually do a bit of a computation to see which tokens were predicted by symbols in the production.

So maybe this is a point actually we should be explicit about one of the things that a lot these parsing approaches have given us implicitly, and we don't think about explicitly: we've mentioned tokens and non-terminals. All of these passing things they're actually based on a sort of a two-phase structure aren't they?

That's right.

The first stage being: I break stuff up into words. If I look at English, for example, a word is separated by a space. So I just break everything up, so those are the tokens? And you also called those non-terminals?

No. The tokens are the are the words, so so things like like '+' and an identifier. The non-terminals are the symbols in your grammars: things like expression, and statements, and for loop.

So those are the things that they're not actually in the user's input? Those are parts of the tree structure.

Exactly.

OK. So, those two phases are separated, so this is why we have Lex and Yacc for example? Lex is doing that first splitting the things up into words, and then Yacc is making the tree structure or the semantic actions.

Right.

OK, sorry, so we're going back: you've explained LL. Then we've got LR and LALR and those things. Maybe we can go over a little bit about those?

So, LR and LALR are variants used in generating decision tables for bottom-up parsers: shift reduce parsers. They are technical variations on deciding which class of glamour grammars you can actually deal with in a parser generator.

We talked about context-free grammars and I think it's fair to say that most computer languages we're interested in, context-free grammars are what we want.

Yes, you would say so, but then people invent things like off-site rules in Haskell or Python, which you cannot express with context-free grammars.

Maybe we'll get on to that later — that's quite interesting! But all these funny symbol terms like LL and LR and LALR, are really saying: we're taking subsets of the context-free grammars? We're deliberately cutting down the number of things we can parse, presumably for efficiency reasons generally?

The issue is ambiguities. In a top down parser, you need to decide which way to go: you can look a couple of tokens ahead and then you can decide. But if you have to look too much ahead then you're inspecting the string all the time. For some grammars, for instance these left recursive grammars, there's not enough tokens you can look ahead, because there's not a fixed number of tokens you could look at to decide where to go. So the grammar is restricted to be non-recursive, because that's one example of a restriction to fit a particular approach to do parsing.

We went back to the fact that a lot of this parsing stuff happened in the 1960s, and computers were just a tiny little bit slower then. So the LL-type parsers have this obvious limitation, but the LR parsers still don't allow me to express every context-free grammar. Again, presumably that is just because that allowed those very slow computers to parse text at a reasonable speed?

Well, it's not just slowness: it is really about algorithmic complexity. Earley designed a parsing algorithm in the 60s called Earley parsing and that can deal with all concrete grammars (well, maybe except for circular ones) but the complexity of that algorithm is cubic, so it can take up to n3 time to parse a string.

If i've understood that correctly, what we're saying is that Earley parsing (that's "Earley" with an e at the end, rather than just early in the morning!), the algorthim isn't necessarily difficult to implement, for the poor person who writes the parser. But the longer the input, it takes a cubic amount of time to parse: in other words it can be very very slow. So, when it was invented, on computers of the day, that would have taken hours possibly to parse a relatively simple string for the calculator grammar for example.

Typically, things get get bad if you get lower strength. Well, that is complexity, right — it doesn't scale to large things. So the challenge was to have parsing algorithms that were linear and LL and LR guarantee linearity for parsers.

Is this still a concern today? The processor i have in my phone, for example, is is 10 times the speed of the fastest super computer in the 1960s.

But still, complexity beats everything, right? Things go up fast. So it remains a concern: you don't want an exponential parsing algorithm, or even cubic, you don't want to have that.

Is that why we're still using parsers with this LR stuff and LL and so on? The efficiency is still worth us cutting down the amount of things we can easily parse just so we know we can do it very efficiently?

Well, partly. There're just tools around that are based on this stuff. They've proven their worth and they work. Another advantage of using these subsets is that if you succeed in fitting your grammar in this subset, you guarantee that your grammar is non-ambiguous. Ambiguity is a problem where you can parse a string in more than one way: you can say well it might be an expression, but it also might be a state. That is not typically what you want when you define a language, because you want to be clear.

Is the classic type of ambiguity: I have the expression '2 + 3 * 4'. Wow we implicitly expect that to be '2 + (3 * 4)' but a parsing algorithm thinks: well, I could either do that as '(2 + 3) * 4' or the other one.

That's an example of ambiguity, indeed. So depending on how you formulate this, the grammar may be ambiguous to parse.

Earlier, we mentioned parse trees. Now, there are two terms that often come up in parsing, which is "parse trees" and "abstract syntax trees". Could you differentiate between them for us?

Sure. Parse tree is a tree you get directly from from parsing according to the grammar. We talked about this production 'e' is 'e + e', so the parse tree you would get from that is a node with label 'e' and it has three children: one labeled 'e', one labeled '+' and one labeled 'e'. You may have white space in-between, you could have that in the parse tree as well.

It still maintains the tokens: the original user's text is all somewhere in there?

Exactly. You can exactly reproduce the string from the tree by just going over the tree and printing all the bits of text that are stored there, and then you get your string back. There's a one-to-one correspondence between the parse tree and the string you have parsed.

And then we have "abstract syntax tree"?

So now this '+' child of this node is not very interesting. So what you do in an abstract parse tree is transform this parse tree to a tree where you label the nodes of the trees with the name of the production. So for instance we could name the plus production '+' typically, and then we could transform this tree 'e' with children 'e + e' to a tree with a root labelled '+', and with children two expression nodes. We throw away the literal and also white space and stuff that because we typically don't need that for further processing.

So if I try and repeat that back in another way. If I've got a compiler, I do my first pass, and I get back a parse tree which has got all of the bits of the text in there. But that's quite low level. So then I'll typically have a phase which would create an abstract syntax tree which will flatten the structure, throw away the bits of the user's input that are not not relevant, that are just formatting related and so on, and leave me with the essence of what the user wanted to say.

Exactly.

Are there tools? how does one create an abstract syntax tree? Is it done by hand?

We talked about semantic actions a while ago, and there you see that the user is actually in control of building the tree. Typically you immediately build an abstract syntax tree rather than a parse tree so there's not this two-stage processing: you just build an abstract syntax tree immediately. The advantage of that is that you actually have to do it, you have to build a structure. You could say, well you could actually derive this abstract syntax tree automatically from the grammar, because there's a one-to-one correspondence between grammar parse trees and abstract syntax trees.

SDF is a formalism which was designed to do that. There's a one-to-one correspondence between parse trees and abstract syntax trees. There's an issue why you cannot do that with all kinds of grammar formalisms. Because in many grammar formalism, well we talked about ambiguities earlier, and you have to disambiguate explicitly. Typically you do that for expressions for instance by having different levels of precedence, which we encode as as non-terminal symbols. So you have a level for multiplicative expressions and additive expressions and these are different symbols. And then what you get is that you get a so-called injection, so productions that say an additive expression can be a multiplicative expression, or it can be a additive plus an additive expression. So this injection going from multiplicative to to additive, you will get a tree node with a single child in it and you may have long chains of these injections. That's often why you see that you want to have this mapping from parse trees to abstract syntax trees, so you not only throw out these literals and layout, but you also flatten these injection chains. That's one thing.

Another thing is that some approaches use iteration: you can have a long row of 'e + e + e + e' and that's expressed as an iteration. But in a tree you would rather like to have a nested tree structure. Then you have to do a transformation to get from that list of expressions to the nested structure. For instance, there's a case in top down parsing formalism where you cannot express left recursion.

So the idea of SDF was that it should be a formalism where the tree structure you get is really a one-to-one correspondence to the grammar formalism. So from the grammar, you know what your tree looks like. In order to do that you shouldn't express precedence in the grammar but you should just have a production 'e' is 'e + e' or 'e * e' and then you have separate disambiguation rules that tell you: this one is left associative, and multiplication is higher precedence than addition. When you do that outside of the production rules then you don't contaminate the production rules with this information. You can actually see this tree structure in the grammar.

So SDF is a way of giving significant help in creating abstract syntax trees, and deferring some of these decisions about precedence and so on until the second stage when you're creating that abstract syntax tree, rather than the parse tree?

No. It's basically a separation of concerns issue. You can say the grammar describes several things. One is the tree structure that you want to think about, the logical structure of your language elements. The other is, well, how should we parse these things? These expressions should be disambiguated there, the operator should be parsed in a particular order. You could see that as a separate concern, right? If you're writing a compiler, you're interested in the logical structure of your language, not so much in disambiguation. That's kind of a parsing issue. SDF allows you to to do that separation of concerns by having a separate definition of disambiguation rules which are used within the parser. The parser takes care that you don't get ambiguity but you don't have to express it in the grammar.

Earlier, we were talking about Lex and Yacc and we had these two very distinct phases of breaking the input up into tokens and then parsing it. Now, I believe that you were one of the original people involved in "scannerless parsing", where in a sense these two phases are somehow merged. Could you talk to us a little bit about that, please?

That's right. I worked on this SDF formalism, which was designed to have some natural abstract syntax tree. Before I started working on this, we had a traditional two-phase language definition. In Lex and Yacc, you have separate files for defining the lexical syntax and for defining the context-free syntax. In SDF, we actually had a single formalism that allowed you to describe the lexical syntax and the context-free syntax, so it already integrated those issues. But the lexical syntax was still described using regular expressions and character classes: regular expressions, '*' and '+', and those kind of things. So what I did is so just use regular normal context-free grammars to define the lexical syntax rather than just regular expressions and throw away the scanner. So, rather than having a separate stage in which we tokenize the string, the parser reads characters so the tokens are now the characters and we use this context-free grammar to both do the so-called scanning and the parsing bit.

What advantage does that gain one? It's nice squishing two phases into one, but does this give me some extra practical expressive power efficiency?

Yes, for instance we talked earlier about the limitations of regular expressions, that you cannot express nested structures. Think of the problem of nested comments, so you have Java-style long comments and you want to nest these things. That doesn't work because it stops at the comment opening symbol, and then you have another comment opening symbol, and then you have comment close corresponding to the second opening symbol — but it will just stop there and stop parsing your comments. Using context-free grammars for lexical syntax, you can actually express things like nested comments. So that's a little nice — it kind of generalizes the definition of lexical syntax from what we had before.

That was kind of the motivation sort of thing that I had in my thesis. But that was not a very strong story, I have to say. I mean, there were a few small problems in Pascal that you could could solve more nicely using this, but it was not really a killer application of scannerless parsing. I didn't realize until much later what the real killer application would be, and that is language combination.

The problem of combining languages is that you have to combine grammars. Well, let's first talk about why you would want to combine languages. So, we're here at the code generation conference and what you're doing in code generation is generate code. So what you typically do is write templates that do code generation. You have some kind of meta language, a template language, and you have a bit of quoted language, the stuff that you generate. Typical template engines not just generate text, or the templates don't have any structure. That means that the templates are not checked syntactically so you cannot know if the template code fragments are syntactically correct. Also, you don't have the structure of the text you generate and you cannot do any further processing. At least, you should have to parse after generation and it's kind of clumsy.

So now using language combination you can say well, OK, we defined a template language that is extended with, say, the syntax of Java. We could actually parse the template fragments using the proper syntax of Java, and then we can syntactically check that these fragments are correct. Then we can use the underlying abstract syntax tree to actually build an abstract syntax tree rather than text.

So, this was difficult because Lex and Yacc, those two phases were very distinct. I had to entirely take my text in and tokenize it and then I would parse it. But if I've got two languages in a file, then they probably don't share the same tokenization rules, and that would make combining them very difficult. But if I'm using scannerless parsing it can automatically say, "oh I've reached the Java bit, now I turn into the Java parser" as it were. Then I go back to the template language, I turn into the parser and scanner for that.

Right. So there's a bit of theoretical foundation for this. The problem is that the context-free grammars are closed under composition. So if you take two context-free grammars and put them together, you have a new context-free grammar. But if you take a regular grammar, so the one that is used for defining the tokens of a language, regular grammars are not closed under composition. Nor are subsets of the context-free grammar such as LL or LR.

Can we explain to our listeners what we mean by "it's not closed under composition"?

You take two Lex files, you put them together, you don't have a new Lex file.

It's an invalid file probably?

Yeah, so it may not be a new valid Lex file because it doesn't follow the restrictions of the language. Or, it may actually be one but, but because the rules have to be in some order, you may not get the tokenization that you want. For instance, keywords of one language are identifiers in the other language and then there would be reserved words and so you cannot use those keywords or the other language's identifiers in your other language. And that is annoying. For instance, say you have a template language it has a keyword 'define' and in your Java program, you want to use 'define' as an identifier. If you make this combined grammar then you couldn't generate the Java program with this keyword 'define' because it's a keyword in the template language. That's the sort of thing that's annoying.

The problem really with this combining scanners is that scanner just goes over a string and sees, well, this should be an identifier, or should be a keyword without looking at the context of where you are in the file. If you do scannerless parsing, the parser basically steers the scanning, so the tokenization says, well, at this point I expect a java identifier, and it will look for a Java identifier. At some other point it will say, here I'm expecting a template engine keyword, so you may have 'define' here.

Is this a context sensitive grammar then?

It's a bit confusing isn't it? A context-free grammar that does context.

It is still context free?

It is context-free, but it still provides context. It's always a bit counter intuitive! It still is a context free grammar but you do scanning in the context of a particular symbol.

It maintains all the theoretical niceness of context-free grammars?

Exactly.

Given that we're moving into this future, possibly, where we're getting lots of domain specific languages and we're embedding them in, scannerless parsing is probably a significant way of enabling this? So you see this as the killer application for scannerless parsing?

Absolutely, yes. I mean we have been doing really great things due to the fact that we have this scannerless parsing algorithm. We've been embedding domain specific languages into java. We've been building code generation tools where we embed the source and target language into the meta language, so they become parse fragments of those languages. I think you only get that right using something like scannerless parsing and the techniques underlying SDF.

OK, excellent. Well, I would agree it sounds like a really great way of moving things forward! I think we'll probably be seeing an awful lot more scannerless parsing as we move into this sort of world. I think that's probably something for everyone to keep an eye on. Eelco Visser, thank you very much!

Thanks — it was fun.

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:

Why Split Lexing and Parsing Into Two Separate Phases?

May 2 2023

"lex and yacc" are two terms that are used together so often that they can seem indivisible. lex takes a string as input and divides it up into lexemes (roughly speaking "words") and yacc takes a sequence of lexemes and parses them. Since some parsing approaches such as recursive descent parsing unify these phases, why do lex and yacc split them apart?

Running example

Let's start with a simple language, which allows us to assign simple expressions to variables:

x = 1;
y = x;

Using lrlex from the Rust grmtools libraries, I can define a lexer file [1] for this fragment as follows:

%%
[a-z]+ "ID"
[0-9]+ "INT"
= "="
; ";"
[ \t\n\r]+ ;
If there were any declarations (i.e. settings) they would come before the "%%" line: there are none in this example. Each line after the "%%" defines a lexing rule. Lexing rules consist of a regular expression and either an identifier or ";". For example, the rule [0-9]+ "INT" says that every non-empty sequence of digits defines an integer, assigning it the token type "INT". If a lexing rule has ";" instead of an identifier, any text matching the regular expression is discarded: as is common, in this example the lexer simply discards whitespace (spaces, tabs, and newlines).

Helpfully, we can run lrlex on our lex file and input to see what lexemes it produces:

$ lrlex /tmp/t.l /tmp/t
ID x
= =
INT 1
; ;
ID y
= =
ID x
; ;
Each line tells us the token type and the lexeme (i.e. the text matched).

We can now define a simple yacc grammar which allows us to write files with multiple "statements" (i.e. variable definitions or assignments):

%%
stmts: stmts stmt | stmt ;
stmt: "ID" "=" expr ";" ;
expr: "INT" | "ID" ;
If you're unfamiliar with yacc grammars, the stmts rules is a long-winded way of writing the more regex-like "stmts: stmt+" i.e. that a valid input is one or more statements.

We can can then use grmtools' nimbleparse to parse our input relative to our lexer and grammar:

$ nimbleparse /tmp/t.l /tmp/t.y /tmp/t
stmts
 stmts
  stmt
   ID x
   = =
   expr
    INT 1
   ; ;
 stmt
  ID y
  = =
  expr
   ID x
  ; ;
The indentation in nimbleparse shows us the tree structure: if you're not sure what you're looking at, you might find it helpfully to mentally rotate the tree 90° clockwise.

Overlapping lexing rules

Let's now assume that I want to introduce mathematical constants into our language, specifically the constant "pi":

z = pi;
Since pi is a constant, I want to forbid people from assigning to it. In other words I want this expression to cause an error:
pi = 3;
I could do this in a "type checking" phase after parsing, but there are good reasons why I might want to make my parser reject such inputs, including: I'm lazy and don't want to add a type checking phase; it's easier and faster for external tools (e.g. your IDE) to spot incorrect input if it's encoded in the parser; and it's generally easier for users to understand valid inputs from a grammar than it is from a description of type rules.

Let's give it a go by changing our lexer to explicitly identify pi:

%%
[a-z]+ "ID"
[0-9]+ "INT"
= "="
; ";"
[ \t\n\r]+ ;
pi "PI"
and our grammar to only allow pi in the right hand side of an expression:
%%
stmts: stmts stmt | stmt ;
stmt: "ID" "=" expr ";" ;
expr: "INT" | "ID" | "PI" ;
Unfortunately when I run it on the input I expected to be rejected, it is parsed without error:
$ nimbleparse /tmp/t.l /tmp/t.y /tmp/t
stmts
 stmt
  ID pi
  = =
  expr
   INT 3
  ; ;
Clearly I've done something wrong, but what?

From experience, when a yacc grammar doesn't do what's expected, a lot of people immediately start trying to understand the LR parsing algorithm. I've surprised 3 people in the last month by saying, quite truthfully, that I no longer remember how LR parsing works. I'm sure I could refresh my memory fairly quickly if I really needed to, but I have not found my forgetfulness to be a problem because fixing these sorts of problems rarely requires understanding the parsing algorithm.

In this case, it turns out that I don't need to understand anything about LR parsing, because the problem is in the lexer! We can see that by running lrlex on our input:

$ lrlex /tmp/t.l /tmp/t
ID pi
= =
INT 3
; ;
This clearly shows that pi has the token type ID rather than the expected token type PI.

At this point, it's clear that we need to know at least a little about how lex (and lrlex) lex inputs. Fortunately, the rules are surprisingly simple. Summarising briefly:

  • At each point in the input, lex tries every rule in the lex file — it does not stop at the first match.
  • If multiple rules match, the "winner" is decided in two steps:
    1. The rule(s) that match the most number of characters "win".
    2. If multiple rules match the same (most!) number of characters, the earliest rule in the lex file "wins".

In our case, the regular expressions "[a-z]+" and "pi" both matched the input "pi" but since "[a-z]+" came earlier in the lex file, it "won". As this shows – and contrary to what people often suspect is the case – rules in a lex file are ordered, and that order is important.

The fix in our case is thus easy — we just need to put the lexing rule for pi earlier than the rule for ID:

%%
pi "PI"
[a-z]+ "ID"
[0-9]+ "INT"
= "="
; ";"
[ \t\n\r]+ ;
We can then check that our input is lexed as we expect, and that the incorrect input is rejected:
$ lrlex /tmp/t.l /tmp/t
ID pi
= =
INT 3
; ;
$ nimbleparse /tmp/t.l /tmp/t.y /tmp/t
stmts
 stmt
  ID 
  = =
  expr
   INT 3
  ; ;

Parsing error at line 1 column 1. Repair sequences found:
   1: Insert ID, Delete pi
Since I've left grmtools' error recovery on, it has both recognised "pi = ..." as incorrect and realised that we should have written "<id> = ..." [2].

Simple though this example is, it shows us one reason why it is useful to separate lexing from parsing. To see that, let's imagine a tool called lecc which merges together lex and yacc. If we take the first lexer and grammar from this post and adapt it for lecc we might write something like:

%%
stmts: stmts stmt | stmt ;
stmt: "[a-z]+" "=" expr ";" ;
expr: "[0-9]+" | "[a-z]+" ;
Because of the nature of this combined lexer/grammar, it's fairly easy to see how lecc would be able to lex/parse it in a way that's equivalent to our separate lexer and grammars from earlier.

However, what happens if I want to write a lecc file for the lexer and grammar which forbids me assigning to "pi"? I now need a regular expression which matches every non-empty sequence of lower-case ASCII characters except the string "pi". Since lex doesn't have things like negative lookahead assertions [3], I have to be quite creative in how I write this:

%%
stmts: stmts stmt | stmt ;
stmt: "[a-oq-z][a-z]*|p[a-hj-z][a-z]*|pi[a-z]+|p" "=" expr ";" ;
expr: "[0-9]+" | "[a-z]+" ;
Have I got the big regular expression right? I think I might have, but even after I've lightly tested it, I'm not entirely confident. Imagine if I also wanted to handle other constants such as "e" or "G" and so on — I would very quickly exhaust my ability to write such a regular expression, even though it's always theoretically possible to do so. If I were to generate the necessary regular expression automatically, it's quite probable that the result would end up being so large and complex that it would have poor performance.

We thus see the first argument for separating lexing from parsing: it makes it much easier for us to deal with "overlapping" lexing rules. Our first lexing file only worked because at any point in the input at most one rule could match. As soon as we introduced pi, we created overlap between two rules (ID and pi) [4]. Real languages have many instances of overlap (keywords and identifiers being an obvious, but by no means the only, example).

Scannerless parsing

What my imaginary version of lecc is doing is really a form of scannerless parsing. Astute readers might have realised that by combining a lexer and parser, I've accidentally introduced ambiguity problems into the seemingly unambiguous LR formalism. We can see this if I translate the expr from earlier as follows:
expr: "[0-9]+" | "[a-z]+" | "pi" ;
What does the input "pi" match? Well, it matches the second and third productions in that rule! In other words, we've transferred what I called the "overlapping" problem from the lexer to the parser.

Fundamentally, the "overlapping" problem in lexing is a form of ambiguity [5]. We now see a second argument (which is in a sense a generalisation of the first argument) for separating lexing and parsing: it allows us to push the inherent ambiguity of most languages into the lexer, keeping the grammar unambiguous.

What happens, however, if we embrace ambiguity and push scannerless parsing further? The modern form of scannerless parsing came into being thanks to the late – and much missed by me and others – Eelco Visser [6] as part of his PhD thesis. Amongst other ideas he introduced the idea of "reject" rules which we might represent as follows:

%%
stmts: stmts stmt | stmt ;
stmt: id_except_pi "=" expr ";" ;
expr: "[0-9]+" | "[a-z]+" ;
id_except_pi: "[a-z]+" ;
id_except_pi reject: "pi" ;
In essence, the idea is to say that "id_except_pi matches all identifiers but then rejects those identifiers which match 'pi'". This achieves the same outcome as my fiercely complex regular expression from earlier, and is much easier to use!

However, there are problems. In essence, "reject" rules interfere with the compositionality of context-free grammars, making them slightly context sensitive. A partial fix was suggested for this problem some years ago but, to the best of my knowledge, a full fix (or, at least, a full understanding of a fix) remains an open problem. As this might suggest, combining lexing and parsing is much harder than it may first seem.

Summary

The most obvious reason for separating lexing and parsing is that it allows us to move some things that are difficult to express in a parser into a lexer, where they are easy to express. More fundamentally, it allows us to use unambiguous parsing algorithms for languages that would seem to be ambiguous. This can seem a contradiction in terms, or disingenuous: does forcing lexing to remove ambiguity sully supposedly unambiguous parsing algorithms like LR? In my opinion, this is a fun philosophical debating point, but isn't very relevant in practise: having separate lexers works rather well.

However, as with nearly everything in parsing, there are trade-offs. lex-style lexers are not closed under composition. In other words, because lex files have ordered rules, combining two lex files together can have surprising results, with some rules never matching. This is a terrible shame, because computing would be much better if we could reliably compose languages. Scannerless parsing, as we saw above, is closed under composition, but causes other trade-offs.

There are two other major approaches to language composition. First, is to have some way to switch between different lexers. This almost always involves some form of syntactic marker: Copper and Silver are probably the best examples of such a system. However, when I was working on Converge, which allowed something similar and also required syntactic markers, I eventually concluded that I was rarely willing to pay the cost of the syntactic noise it required. Second, is some form of tree-based editing. Our group (well, mostly Lukas!) developed the Eco editor to explore this, and I think we've shown that it can work well in some surprising situations, but it's still not magic. Parsing continues to be a frustrating topic — not only are there no free lunches, but sometimes there is no lunch available, no matter how deep your pockets!

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1] If you're familiar with lex, you might notice that lrlex uses a slightly different format.
[2] Note that the token type "ID" in the tree has no value — error recovery can tell you need a ID at that point, but it can't guess what the specific ID should be.
[3] Although it's often forgotten, modern regular expression systems can parse far more than formally defined "regular languages" — in other words, modern "regular expression" syntax is a misnomer. lex predates these modern concepts and – to the best of my knowledge – its regular expressions can truly only parse regular languages.
[4] Another way of looking at this problem is that it is due to languages like English using a finite number of symbols (roughly speaking "an alphabet", though it includes things like numbers and so on). If every concept we wanted to express had a single unique symbol, we could never have overlap.
[5] For the simple regular expression language supported by lex I believe one could statically detect all instances of overlap, although I'm not sure if any such tool exists. As an additional complication, lrlex uses Rust's regex crate which, like most "regular expression" engines, supports features beyond the theoretical concept of "regular expression" — I don't think one could guarantee to statically detect all instances of overlap for such languages.
[6] 15 years ago I interviewed Eelco on parsing — I'd never really done something like that, but because Eelco was so knowledgeable (even though he told me afterwards he felt a bit rusty on parsing!), I think it came out surprisingly well. Listening back, it's nice to hear his voice again.
If you're familiar with lex, you might notice that lrlex uses a slightly different format.
Note that the token type "ID" in the tree has no value — error recovery can tell you need a ID at that point, but it can't guess what the specific ID should be.
Although it's often forgotten, modern regular expression systems can parse far more than formally defined "regular languages" — in other words, modern "regular expression" syntax is a misnomer. lex predates these modern concepts and – to the best of my knowledge – its regular expressions can truly only parse regular languages.
Another way of looking at this problem is that it is due to languages like English using a finite number of symbols (roughly speaking "an alphabet", though it includes things like numbers and so on). If every concept we wanted to express had a single unique symbol, we could never have overlap.
For the simple regular expression language supported by lex I believe one could statically detect all instances of overlap, although I'm not sure if any such tool exists. As an additional complication, lrlex uses Rust's regex crate which, like most "regular expression" engines, supports features beyond the theoretical concept of "regular expression" — I don't think one could guarantee to statically detect all instances of overlap for such languages.
15 years ago I interviewed Eelco on parsing — I'd never really done something like that, but because Eelco was so knowledgeable (even though he told me afterwards he felt a bit rusty on parsing!), I think it came out surprisingly well. Listening back, it's nice to hear his voice again.

Displaying My Washing Machine's Remaining Time With curl, jq, and pizauth

April 11 2023

A couple of months ago our washing machine produced a considerable quantity of smoke when I opened its door, which I interpreted as a suggestion that a replacement was needed. After rummaging around the internet for advice, a couple of days later a new Miele washing machine took its place in our home.

As well as boggling my mind at how much better the new machine was than the cheap machine it replaced, I was pleased to discover that Miele make available a third party API which one can use to interact with the machine. Putting aside any worries about connecting a high-speed rotating device to the internet, I decided to have a poke around. Although the API is more limited in practise than in theory – the API has support for setting values, but those mostly aren't used – I eventually realised it can help me solve one recurring problem.

Like most washing machines, ours beeps to notify me that it's finished. However, I can't always empty it straight away, the beep is obnoxious and repetitive, and it ends up disturbing everyone in the house. I can turn off the beep, but then I don't know when the machine has finished. Miele's app can notify me, but it regularly logs me out, and finding out the time remaining is a chore. What I really wanted is a countdown of the remaining time and notification on my desktop computer. Fortunately, I can do exactly what I want on my desktop computer using basic Unix tools. Here's what a sped-up version of the result looks like (the middle part is hugely sped up; the end part is sped up slightly less so): In this post I'm going to explain how I got this working with the Unix shell, curl, jq, and pizauth. Interested readers should not have much difficulty adapting these techniques for other similar situations.

Registration and Authentication

To get started, I first had to register my washing machine to my email address with Miele's app. It's not the smoothest experience: I had to give it a couple of tries before it worked, but at least it's a one-off task.

With my washing machine registered to my email address, I then needed to set up OAuth2 so that I can use the API from my computer. First, I needed an OAuth2 client ID and client secret [1]. Miele allows anyone to generate their own client ID and client secret by registering ourselves as a developer with Miele which means giving them a random "app name" and the same email address used to register the device in the app. I then had to register that app as one I'm allowing to use on my Miele account.

I then needed an OAuth2 tool: I used pizauth because, well, I wrote it. My ~/.config/pizauth.conf file contains Miele's authorisation and token URIs and the client ID and client secret [2] I got by registering myself with Miele:

account "miele" {
  auth_uri = "https://api.mcs3.miele.com/thirdparty/login/";
  token_uri = "https://api.mcs3.miele.com/thirdparty/token/";
  client_id = "8nka83ka-38ak-38a9-38ah-ah38uaha82hi";
  client_secret = "HOaniszumazr978toh789th789aAGa83";
}
I then ran pizauth server which causes pizauth to listen for requests for access tokens. The first time that pizauth is asked to display an access token it will report an error which contains the URL needed to authenticate yourself with Miele:
$ pizauth show miele
ERROR - Access token unavailable until authorised with URL https://api.mcs3.miele.com/thirdparty/login/?access_type=offline&code_challenge=8akyhq28nahikgkanioqa8yHAatho2troanihtHIHI8&code_challenge_method=S256&client_id=8nka83ka-38ak-38a9-38ah-ah38uaha82hi&redirect_uri=http%3A%2F%2Flocalhost%3A8599%2F&response_type=code&state=auhykFAaaBB
Plugging that URL into a web browser, using the same email address and password as I used in Miele's App, and selecting a country (in my case "Great Britain") completes authentication, and pizauth now works as expected.

Getting the device ID

The Miele API is a RESTful API which we can query using curl, though we have to send an OAuth2 access token each time we want it to do something for us. Let's start by asking the API to list all the devices I've registered with Miele:
$ curl \
  --silent \
  --header "Authorization: Bearer $(pizauth show miele)" \
  https://api.mcs3.miele.com/v1/devices
The only surprising part of this is the --header part: pizauth show miele displays an access token on stdout which is incorporated into the HTTP request. A couple of seconds later, curl prints a vast wodge of JSON to stdout:
{"000173036828":{"ident":{"type":{"key_localized":"Device type","value_raw":1,"value_localized":"Washing machine"},"deviceName":"","protocolVersion":4,"deviceIdentLabel":{"fabNumber":"000173036828","fabIndex":"44","techType":"WEG365","matNumber":"11385920","swids":["3829","20384","28593","23848","29382","28390","23849","23820"]},"xkmIdentLabel":{"techType":"EK057","releaseVersion":"08.20"}},"state":{"ProgramID":{"value_raw":1,"value_localized":"Cottons","key_localized":"Program name"},"status":{"value_raw":5,"value_localized":"In use","key_localized":"status"},"programType":{"value_raw":1,"value_localized":"Own programme","key_localized":"Program type"},"programPhase":{"value_raw":260,"value_localized":"Main wash","key_localized":"Program phase"},"remainingTime":[1,25],"startTime":[0,0],"targetTemperature":[{"value_raw":6000,"value_localized":60.0,"unit":"Celsius"},{"value_raw":-32768,"value_localized":null,"unit":"Celsius"},{"value_raw":-32768,"value_localized":null,"unit":"Celsius"}],"coreTargetTemperature":[{"value_raw":-32768,"value_localized":null,"unit":"Celsius"}],"temperature":[{"value_raw":-32768,"value_localized":null,"unit":"Celsius"},{"value_raw":-32768,"value_localized":null,"unit":"Celsius"},{"value_raw":-32768,"value_localized":null,"unit":"Celsius"}],"coreTemperature":[{"value_raw":-32768,"value_localized":null,"unit":"Celsius"}],"signalInfo":false,"signalFailure":false,"signalDoor":false,"remoteEnable":{"fullRemoteControl":true,"smartGrid":false,"mobileStart":false},"ambientLight":null,"light":null,"elapsedTime":[1,9],"spinningSpeed":{"unit":"rpm","value_raw":1400,"value_localized":"1400","key_localized":"Spin speed"},"dryingStep":{"value_raw":null,"value_localized":"","key_localized":"Drying level"},"ventilationStep":{"value_raw":null,"value_localized":"","key_localized":"Fan level"},"plateStep":[],"ecoFeedback":{"currentWaterConsumption":{"unit":"l","value":6.0},"currentEnergyConsumption":{"unit":"kWh","value":0.7},"waterForecast":0.5,"energyForecast":0.5},"batteryLevel":null}}}
I don't know about you, but I find large wodges of JSON rather hard to read, which is irritating because in amongst that JSON wodge is the device ID of my washing machine. That ID is required to use later parts of the API so I need to fish it out somehow. Fortunately, jq allows us to easily extract the field in question:
$ curl \
  --silent \
  --header "Authorization: Bearer $(pizauth show miele)" \
  https://api.mcs3.miele.com/v1/devices \
  | jq -r '.[] | .ident.deviceIdentLabel.fabNumber'
000173036828
That jq command actually prints out every device ID I've registered with Miele. Since I only have a single Miele device it's fairly obvious that the single ID displayed is for my washing machine, but if you have multiple IDs, how can you tell them apart? The curl command stays the same as before, but now I use jq to fish out the model number and even a human readable name:
$ curl ... \
  | jq -r \
    '.[]
     | .ident
     | (.deviceIdentLabel
     | (.fabNumber + ": " + .techType))
       + " " + .type.value_localized'
000173036828: WEG365 Washing machine
It's now clear that "000173036828" is the device ID I want going forward.

Getting the remaining time

Now that I have its device ID, I can use a different part of the API to see my washing machine's current state:
$ curl \
  --silent \
  --header "Authorization: Bearer $(pizauth show miele)" \
  https://api.mcs3.miele.com/v1/devices/000173036828/state
This gives me another huge wodge of JSON of which only the remainingTime field is interesting:
$ curl ... \
  | jq -r '.remainingTime'
[
  0,
  56
]
That means I've got 0 hours and 56 minutes left — but that formatting is rather horrible. My first attempt might be:
$ curl ... \
  | jq -r '.remainingTime[0] + .remainingTime[1]'
54
but that's added the two integers in the array together, so that's not what I want. Instead, I want to convert each integer to a string and then concatenate them:
$ curl ... \
  | jq -r \
    '(.remainingTime[0] | tostring)
     + ":"
     + (.remainingTime[1] | tostring)'
0:54
What happens when the remaining time goes below 10?
$ curl ... \
  | jq -r
    '(.remainingTime[0] | tostring)
     + ":"
     + (.remainingTime[1] | tostring)'
0:9
That's rather confusing! We need to make sure the minutes are always two digits. There is no builtin way of padding numbers in jq, but we can multiply strings by integers so with a bit of thought we can write:
$ curl ... \
  | jq -r \
    '(.remainingTime[0] | tostring)
     + ":"
     + (.remainingTime[1] | tostring
        | (length | if . >= 2 then "" else "0" * (2 - .) end))
     + (.remainingTime[1] | tostring)'
0:09
The API also tells us what phase the washing machine is currently in, which I find interesting, so let's print that out:
$ curl ... \
  | jq -r '.programPhase.value_localized'
Rinsing

A Countdown

At this point, I can print out the remaining time for the current load once: what I really want to do is update this so that I have a meaningful countdown. Before we try and go further, let's bundle what we've got into a simple script so that we don't have to continually enter commands into a terminal:
#! /bin/sh

set -e

DEVICE_ID=000173036828

get() {
  curl \
    --silent \
    --header "Authorization: Bearer $(pizauth show miele)" \
    https://api.mcs3.miele.com/v1/devices/${DEVICE_ID}/state \
  | jq -r "$1"
}

case $1 in
  phase) get .programPhase.value_localized ;;
  remaining_time)
    get \
      '(.remainingTime[0] | tostring)
       + ":"
       + (.remainingTime[1] | tostring
          | (length
	     | if . >= 2 then "" else "0" * (2 - .) end))
       + (.remainingTime[1] | tostring)'
    ;;
esac
Let's call that script washing_machine:
$ washing_machine remaining_time
0:42
$ washing_machine phase
Rinsing
What we now want to do is create a loop which prints out the remaining time. A first cut, which updates the remaining time every 30 seconds is as follows:
while [ true ]; do
  printf "\r%s (%s)" \
    $(washing_machine remaining_time) \
    $(washing_machine phase)
  sleep 30
done
That works surprisingly well, but has two flaws. First, when the load is finished, the loop doesn't terminate. Second, \r is "carriage return" which means that the countdown keeps displaying text on the same line. This works well provided any new text is the same, or longer, length than what came before. However, if the new text is shorter we end up with odd output such as:
0:32 (Rinsing)h)
We can fix the first problem by noticing that the load is complete when remaining_time returns "0:00":
while [ true ]; do
  t=$(washing_machine remaining_time)
  if [[ $t == "0:00" ]]; then
    break
  fi
  printf "\r%s (%s)" \
    "$t" \
    "$(washing_machine phase)"
  sleep 30
done
We can fix the second problem in various ways, but the most portable I know of uses tput el after the carriage return. This causes all text between the cursor (which \r has moved to the start of the line) and the end of the line to be cleared:
while [ true ]; do
  t=$(washing_machine remaining_time)
  if [[ $t == "0:00" ]]; then
    break
  fi
  printf "\r%s%s (%s)" \
    $(tput el) \
    "$t" \
    "$(washing_machine phase)"
  sleep 30
done
At this point, I'm into nitpicking mode: it irritates me that my countdown has a blinking cursor next to it. I can turn that off and on with tput civis and tput cnorm respectively. However, I need to make sure that if my program is terminated early (e.g. because I press Ctrl-C) that cursor blinking is restored. Fortunately I can use the trap command to restore blinking if this happens, so I can adjust my program as follows:
tput civis
trap "tput cnorm" 1 2 3 13 15
while [ true ]; do
  ...
done
tput cnorm
Last, but not least, when the load has finished I use notify-send to pop a message up in my desktop telling me the load is complete:
notify-send -t 60000 "Washing finished"
Now that I've got that sorted out, I can put it into my script (keeping the now-recursive calls as-is), which now looks as follows:
#! /bin/sh

set -e

DEVICE_ID=000173036828

get() {
  curl \
    --silent \
    --header "Authorization: Bearer $(pizauth show miele)" \
    https://api.mcs3.miele.com/v1/devices/${DEVICE_ID}/state \
  | jq -r "$1"
}

case $1 in
  countdown)
    tput civis
    trap "tput cnorm" 1 2 3 13 15
    while [ true ]; do
      t=$(washing_machine remaining_time)
      if [[ $t == "0:00" ]]; then
        break
      fi
      printf "\r%s%s (%s)" \
        $(tput el) \
        "$t" \
        "$(washing_machine phase)"
      sleep 30
    done
    tput cnorm
    echo
    notify-send -t 60000 "Washing finished"
    ;;
  phase) get .programPhase.value_localized ;;
  remaining_time)
    get \
      '(.remainingTime[0] | tostring)
       + ":"
       + (.remainingTime[1] | tostring
          | (length
	     | if . >= 2 then "" else "0" * (2 - .) end))
       + (.remainingTime[1] | tostring)'
    ;;
esac
And now each time I use my washing machine I need only execute the following at the command line:
$ washing_machine countdown

Summary

Open standards are great, especially when they can be used with simple tools. Nearly everyone has curl installed on their machine already and most people can easily install jq via their favourite package manager. At the time of writing, I think OpenBSD is the only OS which makes pizauth easily installable, but perhaps that will change as time goes on. Still, hopefully the underlying message of this post is clear: we can often do a lot with simple tools!

Update (2023-04-17): A lot of people (including some commenters here) don't understand why I don't simply use a stopwatch instead of the script above. There are a few reasons (e.g. a stopwatch is actually more effort for me to run than my script) but one is particularly important: modern washing machines adjust their time as they run (e.g. based on load weight, or problems when spinning), so the time you see when you start a program can be very different from the actual running time. The script above gets the machine's current time estimate, which naturally takes into account any such adjustments.

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1] The latter doesn't make much sense in this context, and it's not required by the OAuth2 standard, but Miele seem to require it.
[2] No, these are not my real client CD and client secret, but they do follow the same format as the real thing. I'll be doing something similar for other IDs in this post too.
The latter doesn't make much sense in this context, and it's not required by the OAuth2 standard, but Miele seem to require it.
No, these are not my real client CD and client secret, but they do follow the same format as the real thing. I'll be doing something similar for other IDs in this post too.

pizauth: dump and restore

April 3 2023

It's been a little while since I've written about pizauth. I had expected to make a 1.0.0 release by now: that release will be where I commit, to the extent reasonably possible, to maintaining a stable interface in the future. However, there was one feature I couldn't find a good design for: persisting state to disk. This missing feature had been nagging away at me (and a small, though growing, number of other people) for quite a while: recently I finally alighted upon a design that I thought would work well. pizauth-0.2.2 is the first release to support this. Perhaps this will be the last release before 1.0.0!

A little backstory might be interesting. pizauth allows you to authenticate yourself to a remote OAuth2 server (generally using a username, password, and two-factor authentication) and then receive back access tokens that you can use to access services without further authentication. For example, if you're using an email client on your computer, you may well be using OAuth2 to read and send email, even without realising it. OAuth2 has other uses too: for example, I can (and do!) use OAuth2 to see how long my washing machine has left to complete!

Access tokens typically have a lifetime of about an hour, at which point they become useless. Since you probably don't want to be forced to fully reauthenticate every hour, OAuth2 defines a second concept called refresh tokens which allow you to obtain new access tokens. pizauth waits until just before an access token is about to expire before using an account's refresh token to obtain a new access token, giving the illusion of continually valid access tokens. Refresh tokens don't have a fixed lifetime, and typically remain valid at least for many days and potentially indefinitely.

While we obviously don't ever want someone to get hold of any secrets we possess, the short lifetime of an access token gives successful attackers a limited window of opportunity in which to impersonate you. In contrast, if an attacker gets hold of a refresh token, they may be able to impersonate you almost indefinitely. I think of refresh tokens as being "superuser" tokens, since if someone gets hold of the superuser password of your server, there is almost no limit to the bad things they can do.

Because of the sensitivity of refresh tokens, I made a decision early on that pizauth would keep all state about tokens (access and refresh) in memory and not save them on disk. That decision is the fundamental reason why pizauth is a daemon, and not a "reactive" system that loads tokens each time it is run. Although storing secrets in memory isn't totally fool-proof [1], storing secrets on disk is fraught with risk.

Mostly, having pizauth work solely in memory works well. However, there is one common annoyance: whenever you reboot a machine and reload pizauth, you have to reauthenticate all your accounts. If you only have one account and reboot every few weeks, this isn't much of a worry. If you have to reboot near-daily and have multiple accounts, it becomes a chore.

The obvious answer is for pizauth to persist access and refresh tokens to disk so that, after a reboot, it can reload them. This sounds simple, and most of the other OAuth2 Unix tools I know of support it — so why might I say that I found it hard to find a satisfying design for this feature in pizauth?

Let's start by stating the obvious. We really – really, really! – want to encrypt refresh tokens in a way that other programs can't decrypt without interaction from the user. pizauth could define its own encryption strategy, but I think most users would prefer to use an external encryption tool that they already trust. Although gpg is still probably the best known encryption tool on Unix, for the rest of this post I'm going to assume the encryption tool we're using is Age, because I find it much easier to use.

I use a password with Age [2], so if I want to use Age with pizauth, I somehow need to type that passsword into Age in order to decrypt my secrets. Personally I type that password in at the terminal, so it wouldn't be that difficult to have pizauth call Age and block until I've typed my password into Age. However, many people quite reasonably use password or encryption tools which don't expect to be called by a program like pizauth: some expect to call a program like pizauth, and some expect the user to manually copy and paste output over to their program of choice. Neither use case would be well served by pizauth calling an external tool.

One of the challenging things about design issues in general is that existing solutions tend to narrow our thinking. This was exactly what happened to me: I found it hard to think beyond "having pizauth call encryption tools is a bad idea but current OAuth2 tools do just that". And so, despite knowing that some people wanted this feature, I couldn't bring myself to implement it in a way I thought I'd come to regret.

Eventually the way out of my bind became clear to me: rather than having pizauth call an encryption tool, I needed to allow encryption tools to call pizauth. Doing so would allow users to do whatever weird and whacky thing they might want to do to get refresh tokens into pizauth.

Once I'd made this (in retrospect rather simple) observation, a reasonable design soon occurred to me. First I needed a way for pizauth to persist "state" (the name I've given to a combination of access and refresh tokens, as well as parts of pizauth's configuration). For that, I added a command-line call pizauth dump which causes pizauth to output its state to stdout. Second I needed a way for pizauth to load that state back in. For that, I added pizauth restore which reads a previous dump from stdin. A user can then do things like:

$ pizauth show act
54Z7JXDAAQZNdYB8qIP4NJhlA3SVdbCd
$ pizauth dump | age -e -o pizauth.age -R age_public_key
$ pizauth shutdown
$ pizauth server
$ pizauth show act
ERROR - Access token unavailable until authorised with URL ...
$ age -d -i age_private_key -o - pizauth.age | pizauth restore
Enter passphrase for identity file "age_private_key":
$ pizauth show act
54Z7JXDAAQZNdYB8qIP4NJhlA3SVdbCd
Running through each command in order:
  1. pizauth show shows that pizauth has an access token for account act;
  2. pizauth dump writes pizauth's state (including that access token and, hopefully, a refresh token) to stdout which is immediately encrypted by Age (age -e) and written to disk;
  3. pizauth shutdown shuts pizauth down;
  4. pizauth server restarts pizauth,
  5. but since no authentication has yet occurred pizauth show act cannot show an access token;
  6. the previously dumped state can be decrypted with Age (age -d) and piped back into pizauth with pizauth refresh
  7. an access token is now available, even though I haven't reauthenticated.
The advantage of this scheme, as I hope this example makes clear, is that the user is in control of encryption and (in particular) decryption: I think (hope!) this scheme is flexible enough for any conceivable decryption tool the user might want to use.

Astute readers will probably be confused at this point: allowing users to manually dump and restore pizauth's state is all well and good, but how should one know when to dump state? Both access and refresh tokens can change over time, so if you want an accurate restore, you need an up to date dump.

One possibility is simply to regularly poll pizauth (perhaps every minute). A nicer way is to make use of the fact that, in general, encryption doesn't require user interaction. pizauth has thus grown a new global option token_event_cmd which allows an arbitrary command to be run every time an access or refresh token changes (e.g. a new token, a refreshed token, or a token invalidated) in some way [3]

token_event_cmd =
  "pizauth dump | age -e -o pizauth.age -R age_public_key";
In this example, every time a token changes, output from pizauth dump is immediately encrypted and written to disk. In a startup shell script I run when I log in I then have:
while true do
  age -decrypt -i age_private_key -o - pizauth.age \
    | pizauth restore \
    && break
done
The while loop deals with the fact that I sometimes enter my password incorrectly: this will keep calling age until I get my password correct!

pizauth has one final trick up its sleeve — or, rather, a check up its sleeve. When you reload pizauth's configuration, any changes in an account's "secure" details (i.e. those settings that, if changed, could cause you leak a secret to an attacker) invalidate that account's tokens. dump and restore could allow you to bypass those checks if you dump state, shut pizauth down, change the configuration, and restore state. pizauth thus stores extra data about the secure aspects of your configuration in a dump: when restore is called, any changes between the dump's view of an account's configuration and the actual configuration cause tokens to be invalidated. This should stop you accidentally shooting yourself in the foot!

I hope those users who need state to be persisted find the new dump and restore functionality useful and usable — testing and comments are much appreciated! If they turn out to work well – and there are certainly more use cases for them than I've talked about in this post – then perhaps that 1.0.0 release is not too far away!

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1] For example, the number of side channels that people have uncovered in hardware in recent years is mind-boggling.
[2] Perhaps surprisingly, Age didn't support passphrases for a large part of its early history. If nothing else, this would have made me nervous of accidentally copying the private key to an untrusted host. Passphrases aren't a guarantee, but they make it much harder to recover the private key.
[3] token_event_cmd can discriminate based on the account name and token event kind but only users with very particular needs will ever need to do so.
For example, the number of side channels that people have uncovered in hardware in recent years is mind-boggling.
Perhaps surprisingly, Age didn't support passphrases for a large part of its early history. If nothing else, this would have made me nervous of accidentally copying the private key to an untrusted host. Passphrases aren't a guarantee, but they make it much harder to recover the private key.
token_event_cmd can discriminate based on the account name and token event kind but only users with very particular needs will ever need to do so.

How Big Should a Programming Language Be?

March 23 2023

Reading the thought-provoking "Patterns & Abstractions" post reminded me of a long-held opinion I have about programming language design: we have a tendency to keep adding features to a language until it becomes so big [1] that its sheer size makes it difficult to use reliably. Since most of us spend most of our time programming in one language, it can be difficult to see a common trend amongst languages in general.

Language size over time

Let me start with a concrete example. I started using Python around version 1.3, though the first version I really remember was 1.4 [2]. Python 1.5 was a real improvement, adding the assert statement, nested packages ("hierarchical modules" in the release notes), and the re regular expression module — things that I suspect nearly every modern Python programmer finds useful. At that point, Python was still a relatively small language, to the extent that you could reasonably expect to store nearly every detail about it (and its implementation!) in your head.

Python 1.6 added support for unicode — support that was clearly useful, but which turned out to be sufficiently awkward that (somewhat) fixing that awkwardness turned out to be a major cause underlying the schism between Python 2 and 3. As Python 1.6 turned to 2.0 and beyond, this formerly small language kept adding features, many of which I find useful (e.g. list comprehensions), but each of which has inevitably made the language bigger.

In retrospect, Python 1.5 was probably the last version of Python that someone with my limited brainpower could reasonably expect to fully understand. By the time of Python 2.7, Python was definitely a big language, and it has only grown in size since then. I doubt there are many people who could claim they understood every detail of Python 2.7, and there is probably no one who claims that of recent versions of Python 3.

The criteria for language features

How did Python become a big language? The initial version of Python was small, and the first several versions deliberately tried to maintain that smallness: most feature suggestions were explicitly rejected on the grounds that they would make the language too big. At some point (perhaps between Python 1.6 and Python 2.0?) it seemed to me that something changed: features were not rejected solely on the grounds of making the language too big, but on the grounds that they didn't fix an important enough problem.

The difference between these two criteria might seem minor, but "reject a design if it doesn't solve an important problem" implicitly downgrades the size of the language as a concern in and of itself. As soon as a language's designers stop worrying about the size of the language, it seems inevitable to me that the language will grow almost without bound. Certainly, Python looks set to keep on adding new features.

Although I've used Python as an example, I could have chosen many other languages. For example, although Java was never what I'd call a small language, it had many years of fairly minor language changes before adding generics to Java 1.5 — that seemed to be the point at which Java began its journey to the large language it now is. Haskell started off as a fairly small ML-inspired language, but is now a very large language (albeit one can opt-in or out from many features in the standard compiler). JavaScript has gone from a long-weekend design to a rather large language. I know a lot less about languages like C# and the like, but I wouldn't be surprised if they've gone, or are going, through something similar.

These days I do most of my programming at Rust, a language which is already big, but which is perhaps now being forced to think about whether it wants to become very big or not. For example, if preventing Rust getting too big is a goal, it's hard to imagine the recently proposed keyword generics initiative being accepted, even though it clearly solves a problem. On the other hand, Rust has already added dedicated support for async/await, so is it a good idea to cleave the library system in two? Whatever the outcome is, some people will be unhappy — these decisions aren't easy!

Why does this happen? Well, no programming language is perfect: there are always use cases it doesn't support well. In many cases, growing the language in size helps it better support those use cases better. Since there are an unbounded number of potential use cases, there is thus always "more" language design that we can do, each time making the language a little bigger. As Patterns & Abstractions points out, one continual temptation is to move "patterns" (i.e. standard idioms of use) to "abstractions" (i.e. language features). There are advantages to such moves: languages can, for example, give better error messages and better optimise "abstractions". But there is also a cost: patterns do not affect a language's size, but abstractions do. Only some users of a language will encounter a given pattern, but nearly every user is at some point confronted by nearly every abstraction. This quote from Bjarne Stroustrup in a 1992 document on suggested C++ extensions is relevant to many languages:

Please also understand that there are dozens of reasonable extensions and changes being proposed. If every extension that is reasonably well-defined, clean and general, and would make life easier for a couple of hundred or couple of thousand C++ programmers were accepted, the language would more than double in size. We do not think this would be an advantage to the C++ community.

Must all languages grow in size?

There are some interesting exceptions to the trend of languages growing over time.

Lisp-based languages, particularly those in the Scheme tradition, tend to maintain fairly small language cores. However, because of the extensive use of macros, it's sometimes difficult to know how to classify "language size" in such languages. I'm far too out of date with, say, modern Racket to have an informed opinion on its size.

Lua was, and is, a small language — and, probably not coincidentally, so is its implementation. Interestingly, as far as I have been able to tell, one of the ways Lua's designers have kept the language evolving is to regularly force (generally small) breaking changes on its user base. It's an interesting trade-off, and not one that has been taken by any other language that I can think of.

Perhaps the best known (partial) exception is C. Modern C is a surprisingly different language to the original C, but that evolution happens rather slowly these days, with new versions of the C standard coming out around every 8-12 years. Each new versions tends to add few new language features, or even standard functions, and tends to put at least as much effort into clarifying parts of the language's semantics that have either always been unclear or which have been rendered unclear by new hardware (see Why Aren't Programming Language Specifications Comprehensive? for some examples).

My impression of C, as someone from the outside occasionally looking in, is that those responsible for the C language standard probably consider themselves to be more "maintainers" than "designers": they want C to be a useful language in the modern world, but they also clearly want to balance that with not letting C grow too much bigger.

One advantage C has is, ironically, C++: anyone who wants "C but with lots of new features" can be told to go and use C++ instead of growing C in size. Although it's difficult to distinguish cause from effect, perhaps this has dissuaded those people who like adding new features to languages from considering proposing them for C, when C++ is likely to be far more receptive to their suggestions. Certainly, C++ is a huge language which keeps on growing: Bjarne Stroustrup's recent-ish plea to "Remember the Vasa" strikes a cautionary note that few would make about C.

Summary

Language designers face challenging trade-offs. Their languages can always be made better for more users by adding new features, but doing so makes the language bigger. At some point, languages become big enough that few users can understand all of the language — but nearly every user of a language will at least occasionally encounter nearly every feature. When users encounter features they don't understand, their reactions can vary from surprise to worry — if they notice at all! The larger a language is, the easier it is for users to misuse it without even knowing it.

There is no easy way for us to to evaluate whether a new feature is worth increasing a language's size for. It's also easier for those people who will benefit from a feature to advocate for it than for those people whose lives might (unwittingly) be made worse by it to advocate against it. Even talking about "size" in anything other than a vague way [3] is hard: quantitative metrics like the size of a language's grammar or compiler do not always correlate with the inherently qualitative measure of "cognitive load" that two new features may impose on humans.

Perhaps in an ideal world, language designers would at some point declare their languages "complete", capping the language's size. However, hardware and other external factors change in ways that sometimes force even the most conservative language to adapt. Even a language like C, which evolves slowly, still evolves. And for as long as a language must evolve, it must have designers. And for as long as a language has designers, there will always be the temptation to add shiny new features.

However, I think language designers can learn something useful from C's example. While we can't declare our languages "complete" we can at some point declare them to be in "minimal evolution" mode. This is decidedly less sexy than, and offers fewer job opportunities to language designers, than continually adding new features — but it also allows users to focus on the tasks they want to program, rather than having to deal with ever more complexity. Slowly, over time, new languages will come along which take the best ideas, simplify and unify them, and can keep the field moving forwards. In general, in my opinion, that's shown itself to be a more successful tactic than hoping that ever-bigger languages can satisfy all users.

If you’d like updates on new blog posts, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1] Astute readers will notice that this post talks about language "size" rather than "complexity". In general, these two criteria strongly correlate, but of the two "size" is marginally more objective. But, if you prefer to use "complexity" instead of "size", you can do so without changing this post's fundamental meaning.
[2] I don't have a precise chronology to hand, and at the time I was using an unusual platform at the time – Acorn's RISC OS – so there was probably a delay between the "main" release and it being available on RISC OS, muddling my memory even further.
[3] There's a reason I haven't defined "size" precisely in this post — whatever definition I use won't work for everyone! I've hoped that your intuition of "size" is sufficient to understand the overall points I'm making.
Astute readers will notice that this post talks about language "size" rather than "complexity". In general, these two criteria strongly correlate, but of the two "size" is marginally more objective. But, if you prefer to use "complexity" instead of "size", you can do so without changing this post's fundamental meaning.
I don't have a precise chronology to hand, and at the time I was using an unusual platform at the time – Acorn's RISC OS – so there was probably a delay between the "main" release and it being available on RISC OS, muddling my memory even further.
There's a reason I haven't defined "size" precisely in this post — whatever definition I use won't work for everyone! I've hoped that your intuition of "size" is sufficient to understand the overall points I'm making.
Blog archive