The Bootstrapped Compiler and the Damage Done
December 4 2013
These days, it's not uncommon to hear that a language's compiler
written in itself. This phrase often confuses non-language implementers:
how can a language's compiler be written in itself? The answer is
surprisingly simple: what's being referred to is the second compiler created
for the language. In such cases, the first compiler created is a
bootstrapping compiler with just enough functionality to allow a
compiler to be written in the language being created. In other words, when
creating language NL, a basic compiler is first written in existing
language OL, allowing a compiler for NL to be written in
NL itself . When the new compiler is
completed, the language is said to be bootstrapped, and the bootstrapping
compiler is thrown away, leaving only the bootstrapped compiler
written in NL. Not every language bootstraps its compiler, but it is
increasingly common to do so, especially for those languages which target VMs
. Indeed, it's become sufficiently common that I've
sometimes seen new languages which haven't been bootstrapped looked on with
slight suspicion (the argument seems to boil down to either or both of
clearly don't think much of your language if you're not prepared to use it to
write its own compiler or
your language is so underpowered, it can't
even be used to write its own compiler).
There are obvious advantages to creating a bootstrapped compiler. Most
notably, it means that the language being designed and implemented is also
used early in the implementation process. What might otherwise be abstract
musings about the use of the language are made concrete by the need to have
the language be good enough to write the bootstrapped compiler in. As the
compiler is written, missing or undesirable features in the language are
discovered and the language's design extended and changed as appropriate.
Bootstrapping thus tends to see the design and implementation of a language
progressing in unison.
Traditionally, bootstrapping is seen as an almost universal good (apart
from occasional griping over performance issues, which can be ignored). I
have bootstrapped languages myself, and I will do so again (attacks by random
buses allowing). However, I have recently become aware of a real problem with
bootstrapping: it can, and does, distort language design. In other words, by
having a bootstrapped compiler as the first – and, often, for some time
the biggest – program in a language, the eventual language design is
often unintentionally altered and rarely for the better. I suspect this has
two root causes.
First, we understand how to design compilers better than any other type of
program. There are many books devoted solely to the topic of writing
compilers. Many clever researchers have put huge amounts of effort into
understanding and optimising almost every aspect of compiler design. We have
many example compilers and a huge folklore to draw on. This means that
programming a compiler is unlike programming any other system I know. Normal
programming – especially when starting – involves a huge degree
of uncertainty as to how the program should be constructed; even what the
program is expected to do is likely to be shrouded in fog. Not so with
compilers: we know exactly what they should do, how their major components
should be designed, and how those components should be integrated together.
Programming a compiler therefore involves an unusually high degree of
certainty right from the start. Compiler writers are able to focus on the
essence of the task of the hand, without having to worry that choices taken
will later necessitate a fundamental change in the compiler's design.
Second, compilers are an atypical class of program. In essence, a compiler
is a simple batch pipeline process. A program is read in and translated to a
tree; a series of tree transformations are applied; and eventually one of
those trees is saved out as some sort of binary data (e.g. machine code or
bytecode). Most of the intermediate tree transformations calculate a
relatively simple bit of information about the program and create a slightly
modified tree based on it. A few calculations crop up time and time again,
such as: maps from variables to scopes or types; and stacks to determine
closures. Significantly, and unlike most programs in the real world, there is
no interaction with users: the compiler knows all it needs about the outside
world from the moment it is called.
Since we know an unusual amount about how to write compilers, and because
compilers are such a specific style of program, compilers tend to use
programming languages in an unrepresentative fashion. A certain style of
programming is common in compilers: pattern
matching against trees and creating slightly altered trees from them. Most of
the trees can be easily statically typed and are, or can trivially be made to
be, immutable, since – relative to the capabilities of modern computers
– the trees involved are rather small. If that is the program you are
most used to writing, it is perhaps inevitable that your language design will
make that use case particularly easy to realise, at the expense of more
normal programs. On top of that, the complexity of the language is
almost invisible to those bootstrapping the compiler, so there is a continual
temptation to add features in to the language or its libraries.
In a sense, my fear is easily summarised: we language designers are all
too often making languages which are much better for writing bootstrapped
compilers than they are for writing other programs. We're coming up with
languages with very odd features that are used relatively little other than
by the compiler writer, and with languages so feature heavy that only the
language designer can understand how everything fits together. Ultimately, we're
thinking about ourselves much more than we are normal programmers. I have
certainly been guilty of this myself and I fear I see it in several new
languages too (though, thankfully, not all).
There is no guaranteed solution to this problem. We can't force an iron
curtain to be drawn between language designer and compiler. But, as language
designers, we can at least be aware of the dangers of drawing overly strong
conclusions based solely on the experience of writing a compiler in our new
language. As language users, we should treat the claims of language designers
with scepticism until they, or other people, have written a
reasonable number of non-compiler programs in their language. This still
won't get us to nirvana, but it will stop us being distracted by shiny
language features that solve issues compiler writers have that few other
programmers share. If we don't, not only shall the meek inherit the earth,
but they may choose to do so with non-bootstrapped languages.
Acknowledgements: My thanks to Edd Barrett, Carl Friedrich Bolz,
and Lukas Diekmann for insightful comments on early drafts of this article.
All opinions, errors, omissions, or infelicities, are my own.
Converge's bootstrap compiler, for example, was written in Python.
For no particular reason that I can think of, languages which compile to machine code seem less inclined to bootstrap.
Follow me on Twitter