These days, it’s not uncommon to hear that a language’s compiler is written in itself. This phrase often confuses non-language implementers: how can a language’s compiler be written in itself? The answer is surprisingly simple: what’s being referred to is the second compiler created for the language. In such cases, the first compiler created is a bootstrapping compiler with just enough functionality to allow a compiler to be written in the language being created. In other words, when creating language NL, a basic compiler is first written in existing language OL, allowing a compiler for NL to be written in NL itself [1]. When the new compiler is completed, the language is said to be bootstrapped, and the bootstrapping compiler is thrown away, leaving only the bootstrapped compiler written in NL. Not every language bootstraps its compiler, but it is increasingly common to do so, especially for those languages which target VMs [2]. Indeed, it’s become sufficiently common that I’ve sometimes seen new languages which haven’t been bootstrapped looked on with slight suspicion (the argument seems to boil down to either or both of you clearly don’t think much of your language if you’re not prepared to use it to write its own compiler or your language is so underpowered, it can’t even be used to write its own compiler).
There are obvious advantages to creating a bootstrapped compiler. Most notably, it means that the language being designed and implemented is also used early in the implementation process. What might otherwise be abstract musings about the use of the language are made concrete by the need to have the language be good enough to write the bootstrapped compiler in. As the compiler is written, missing or undesirable features in the language are discovered and the language’s design extended and changed as appropriate. Bootstrapping thus tends to see the design and implementation of a language progressing in unison.
Traditionally, bootstrapping is seen as an almost universal good (apart from occasional griping over performance issues, which can be ignored). I have bootstrapped languages myself, and I will do so again (attacks by random buses allowing). However, I have recently become aware of a real problem with bootstrapping: it can, and does, distort language design. In other words, by having a bootstrapped compiler as the first – and, often, for some time the biggest – program in a language, the eventual language design is often unintentionally altered and rarely for the better. I suspect this has two root causes.
First, we understand how to design compilers better than any other type of program. There are many books devoted solely to the topic of writing compilers. Many clever researchers have put huge amounts of effort into understanding and optimising almost every aspect of compiler design. We have many example compilers and a huge folklore to draw on. This means that programming a compiler is unlike programming any other system I know. Normal programming – especially when starting – involves a huge degree of uncertainty as to how the program should be constructed; even what the program is expected to do is likely to be shrouded in fog. Not so with compilers: we know exactly what they should do, how their major components should be designed, and how those components should be integrated together. Programming a compiler therefore involves an unusually high degree of certainty right from the start. Compiler writers are able to focus on the essence of the task of the hand, without having to worry that choices taken will later necessitate a fundamental change in the compiler’s design.
Second, compilers are an atypical class of program. In essence, a compiler is a simple batch pipeline process. A program is read in and translated to a tree; a series of tree transformations are applied; and eventually one of those trees is saved out as some sort of binary data (e.g. machine code or bytecode). Most of the intermediate tree transformations calculate a relatively simple bit of information about the program and create a slightly modified tree based on it. A few calculations crop up time and time again, such as: maps from variables to scopes or types; and stacks to determine closures. Significantly, and unlike most programs in the real world, there is no interaction with users: the compiler knows all it needs about the outside world from the moment it is called.
Since we know an unusual amount about how to write compilers, and because compilers are such a specific style of program, compilers tend to use programming languages in an unrepresentative fashion. A certain style of programming is common in compilers: pattern matching against trees and creating slightly altered trees from them. Most of the trees can be easily statically typed and are, or can trivially be made to be, immutable, since – relative to the capabilities of modern computers – the trees involved are rather small. If that is the program you are most used to writing, it is perhaps inevitable that your language design will make that use case particularly easy to realise, at the expense of more normal programs. On top of that, the complexity of the language is almost invisible to those bootstrapping the compiler, so there is a continual temptation to add features in to the language or its libraries.
In a sense, my fear is easily summarised: we language designers are all too often making languages which are much better for writing bootstrapped compilers than they are for writing other programs. We’re coming up with languages with very odd features that are used relatively little other than by the compiler writer, and with languages so feature heavy that only the language designer can understand how everything fits together. Ultimately, we’re thinking about ourselves much more than we are normal programmers. I have certainly been guilty of this myself and I fear I see it in several new languages too (though, thankfully, not all).
There is no guaranteed solution to this problem. We can’t force an iron curtain to be drawn between language designer and compiler. But, as language designers, we can at least be aware of the dangers of drawing overly strong conclusions based solely on the experience of writing a compiler in our new language. As language users, we should treat the claims of language designers with scepticism until they, or other people, have written a reasonable number of non-compiler programs in their language. This still won’t get us to nirvana, but it will stop us being distracted by shiny language features that solve issues compiler writers have that few other programmers share. If we don’t, not only shall the meek inherit the earth, but they may choose to do so with non-bootstrapped languages.
Acknowledgements: My thanks to Edd Barrett, Carl Friedrich Bolz, and Lukas Diekmann for insightful comments on early drafts of this article. All opinions, errors, omissions, or infelicities, are my own.
Footnotes
Converge’s bootstrap compiler, for example, was written in Python.
Converge’s bootstrap compiler, for example, was written in Python.
For no particular reason that I can think of, languages which compile to machine code seem less inclined to bootstrap.
For no particular reason that I can think of, languages which compile to machine code seem less inclined to bootstrap.