Laurence Tratt's Blog

 

The Bootstrapped Compiler and the Damage Done

December 4 2013

These days, it's not uncommon to hear that a language's compiler is written in itself. This phrase often confuses non-language implementers: how can a language's compiler be written in itself? The answer is surprisingly simple: what's being referred to is the second compiler created for the language. In such cases, the first compiler created is a bootstrapping compiler with just enough functionality to allow a compiler to be written in the language being created. In other words, when creating language NL, a basic compiler is first written in existing language OL, allowing a compiler for NL to be written in NL itself [1]. When the new compiler is completed, the language is said to be bootstrapped, and the bootstrapping compiler is thrown away, leaving only the bootstrapped compiler written in NL. Not every language bootstraps its compiler, but it is increasingly common to do so, especially for those languages which target VMs [2]. Indeed, it's become sufficiently common that I've sometimes seen new languages which haven't been bootstrapped looked on with slight suspicion (the argument seems to boil down to either or both of you clearly don't think much of your language if you're not prepared to use it to write its own compiler or your language is so underpowered, it can't even be used to write its own compiler).

There are obvious advantages to creating a bootstrapped compiler. Most notably, it means that the language being designed and implemented is also used early in the implementation process. What might otherwise be abstract musings about the use of the language are made concrete by the need to have the language be good enough to write the bootstrapped compiler in. As the compiler is written, missing or undesirable features in the language are discovered and the language's design extended and changed as appropriate. Bootstrapping thus tends to see the design and implementation of a language progressing in unison.

Traditionally, bootstrapping is seen as an almost universal good (apart from occasional griping over performance issues, which can be ignored). I have bootstrapped languages myself, and I will do so again (attacks by random buses allowing). However, I have recently become aware of a real problem with bootstrapping: it can, and does, distort language design. In other words, by having a bootstrapped compiler as the first – and, often, for some time the biggest – program in a language, the eventual language design is often unintentionally altered and rarely for the better. I suspect this has two root causes.

First, we understand how to design compilers better than any other type of program. There are many books devoted solely to the topic of writing compilers. Many clever researchers have put huge amounts of effort into understanding and optimising almost every aspect of compiler design. We have many example compilers and a huge folklore to draw on. This means that programming a compiler is unlike programming any other system I know. Normal programming – especially when starting – involves a huge degree of uncertainty as to how the program should be constructed; even what the program is expected to do is likely to be shrouded in fog. Not so with compilers: we know exactly what they should do, how their major components should be designed, and how those components should be integrated together. Programming a compiler therefore involves an unusually high degree of certainty right from the start. Compiler writers are able to focus on the essence of the task of the hand, without having to worry that choices taken will later necessitate a fundamental change in the compiler's design.

Second, compilers are an atypical class of program. In essence, a compiler is a simple batch pipeline process. A program is read in and translated to a tree; a series of tree transformations are applied; and eventually one of those trees is saved out as some sort of binary data (e.g. machine code or bytecode). Most of the intermediate tree transformations calculate a relatively simple bit of information about the program and create a slightly modified tree based on it. A few calculations crop up time and time again, such as: maps from variables to scopes or types; and stacks to determine closures. Significantly, and unlike most programs in the real world, there is no interaction with users: the compiler knows all it needs about the outside world from the moment it is called.

Since we know an unusual amount about how to write compilers, and because compilers are such a specific style of program, compilers tend to use programming languages in an unrepresentative fashion. A certain style of programming is common in compilers: pattern matching against trees and creating slightly altered trees from them. Most of the trees can be easily statically typed and are, or can trivially be made to be, immutable, since – relative to the capabilities of modern computers – the trees involved are rather small. If that is the program you are most used to writing, it is perhaps inevitable that your language design will make that use case particularly easy to realise, at the expense of more normal programs. On top of that, the complexity of the language is almost invisible to those bootstrapping the compiler, so there is a continual temptation to add features in to the language or its libraries.

In a sense, my fear is easily summarised: we language designers are all too often making languages which are much better for writing bootstrapped compilers than they are for writing other programs. We're coming up with languages with very odd features that are used relatively little other than by the compiler writer, and with languages so feature heavy that only the language designer can understand how everything fits together. Ultimately, we're thinking about ourselves much more than we are normal programmers. I have certainly been guilty of this myself and I fear I see it in several new languages too (though, thankfully, not all).

There is no guaranteed solution to this problem. We can't force an iron curtain to be drawn between language designer and compiler. But, as language designers, we can at least be aware of the dangers of drawing overly strong conclusions based solely on the experience of writing a compiler in our new language. As language users, we should treat the claims of language designers with scepticism until they, or other people, have written a reasonable number of non-compiler programs in their language. This still won't get us to nirvana, but it will stop us being distracted by shiny language features that solve issues compiler writer's have that few other programmers share. If we don't, not only shall the meek inherit the earth, but they may choose to do so with non-bootstrapped languages.

Acknowledgements: My thanks to Edd Barrett, Carl Friedrich Bolz, and Lukas Diekmann for insightful comments on early drafts of this article. All opinions, errors, omissions, or infelicities, are my own.

Footnotes

[1] Converge's bootstrap compiler, for example, was written in Python.

[2] For no particular reason that I can think of, languages which compile to machine code seem less inclined to bootstrap.

Link to this entry


Relative and Absolute Levels

September 24 2013

Recently, I've seen a number of discussions along the lines of what makes a language declarative or imperative? (see, in chronological order, William Cook, Bob Harper, and Neel Krishnaswami). Trying to get a crisp definition of vague terms often involves endlessly going around circles. Just in case that isn't bad enough, in this specific case, some believe that their favourite term connotes superiority, thus setting otherwise reasonable people against each other. Intuitively, the two terms are talking about the level of abstraction of a programming language: declarative means that something provides more what and less how; while imperative is more how and only implicitly what. We have a vague feeling that the terms should be precise opposites, but in practice we keep stumbling across examples which are hard to fit precisely into exactly one of the terms. William and Bob are, I believe, both correct in saying that, if the terms declarative or imperative once had a specific meaning, they have long been robbed of it.

I would suggest that what's really going on is that we're trying to use the terms to express absolutes when they can only meaningfully express the relative relation of one thing to another. Before tackling the original point directly, let me try and give an example that is less likely to offend those of us immersed in programming languages. Imagine there are two people competing against each other in a 4km cycle race. Jim takes 5 minutes and Tim takes 4 minutes 50. We might say Tim is quick but we would be incorrect. Tim is only quick relative to Jim. When Kim — who can do the race in 4 minutes 30 — comes into the room, we can clearly see that Tim is not quick compared to Kim. Thus, while relative comparisons between individuals are safe, absolute statements are not: we can't be sure if we will encounter a person who changes our notions of quickest or slowest. Put more carefully, the absolute statements are only correct for the set of individuals we've sampled. Relative statements (e.g. Tim is quicker than Jim) remain correct even when we encounter new individuals.

With that in mind, let's look at a few examples from programming languages:

  • Absolute: Haskell is a declarative language. Incorrect.
  • Relative: Haskell is a more declarative language than C. Correct.
  • Absolute: C is an imperative language. Incorrect.
  • Relative: C is a more imperative language than Haskell. Correct.

Depending on your background, you may find the two incorrects hard to swallow initially. Consider this little C fragment:

a = 1;
b = 2;
c = 3;
d = a + b + c;
print_int(d);
and an example compilation of it into ARM assembler [1].:
MOV R0, #1
MOV R1, #2
MOV R2, #3
ADD R3, R0, R1
ADD R3, R3, R2
STMDB R13!, {R0}
MOV R0, R3
BL print_int
LDMIA R13!, {R0}

Both fragments behave the same, in the sense that they describe the same output (the printing of the number 6). The C version however states much less detail about how that output should occur. The assembler version first has to break the addition down into 2 separate instructions. It then has to save the original value of R0 to prevent it being clobbered by the argument being passed to print. Relative to C, the assembler version feels much less declarative: one might even be moved to say that the assembler version is more imperative.

The fun begins when one realises that the assembly version is not the end of the road. We could, if we wished, go all the way down to electrons when considering a program executing. In the opposite direction, a DSL for expressing Context Free Grammars may feel much more declarative than the Haskell program that is generated from it. As that might suggest, we also need to bear in mind that there isn't a strict hierarchy here. There are multiple dimensions in which different languages can be considered declarative or imperative relative to each other.

This has an interesting implication. For any given two abstraction levels, we can find another in between, or one sideways (and so on). If we wanted to name each, we'd go mad. Thus, proposals such as Neel Krishnaswami's are interesting thought exercises, but we simply can't scale them up (unless you find the prospect of a massive, ever-expanding dictionary appealing).

In summary, trying to define declarative and imperative precisely as absolute terms will never work, if it ever did. Using them as relative terms can feel pedantic, but at least has the virtue of not being inaccurate. Although we really only need one of imperative or declarative, I expect both terms to comfortably outlive me, and for people to use them as vague absolutes in the same way that we use quick and slow as vague absolutes. It's also worth noting that the programming language community is not the only one bedevilled by this issue. The UML folk tie themselves in knots over Platform Independent and Platform Specific Models (PIMs and PSMs). Students of war have spent hundreds of years trying to nail down perfect definitions of strategy and tactics, without success. Humans like the seeming certainty of absolute statements, even in cases where only relative comparisons are safe.

Acknowledgements: My thanks to Carl Friedrich Bolz and Edd Barrett for insightful comments on early drafts of this article. All opinions, errors, omissions, or infelicities, are my own.

Footnotes

[1] It's been 15 years since I last wrote ARM assembler, so my apologies if I've got any minor details wrong.

Link to this entry


General Purpose Programming Languages' Speed of Light

April 9 2013

Having recently returned from a week of talks on programming languages, I found myself wondering where general purpose programming languages might go in the future; soon I wondered where they could go. The plain truth about programming languages is that while there have been many small gains in the last 20 years, there have been no major advances. There have been no new paradigms, though some previously obscure paradigms have grown in popularity; I'm not even aware of major new language features (beyond some aspects of static type systems), though different languages combine features in slightly different ways.

The lack of major advances in a given time period may not seem surprising. After all, major advances are like earthquakes: they occur at unpredictable intervals, with little prior warning before their emergence. Perhaps we are on the cusp of a new discovery and none of us – except, possibly, its authors – are aware of that. That is possible, but I am no longer sure that it's likely. If it doesn't happen, then it seems clear to me that we are getting ever closer to reaching general purpose programming language's speed of light—the point beyond which they cannot go.

Programming languages cannot grow in size forever

The basis of my argument is simple:

  1. Every concept in a programming language imposes a certain cognitive cost upon users.
  2. Our brains ability to comprehend new concepts in a language is inversely proportional to the number of concepts already present.
  3. At a certain number of concepts, we simply can't understand new ones.
Note that I deliberately phrased the first point to reflect the fact that some concepts are more complex to learn than others; but that, at a certain point, the volume of concepts becomes at least as important as the cumulative complexity because of the inevitable interactions between features. In other words, if language L1 has 20 features, of which 6 are complex and 14 simple, the cognitive cost is similar to L2 which has 3 complex and 17 simple features.

This doesn't sound too bad, until one considers the following:

  1. The core of virtually every extant programming language is largely similar.
  2. That core contains many concepts.
  3. Many of those concepts have a high cognitive cost.
In other words, most programming languages are surprisingly similar at their core. Before we can differentiate L1 from L2, we must already have understood features ranging from variables to recursion, from built-in datatypes to evaluation order.

Programming languages' core

The overly earnest teenage boy that lives within all of us wants to believe that programming languages are massively different. At any given moment, someone with a nihilistic bent and an internet connection can find fans of different programming languages fighting pitched battles where language L1 is argued to be always better than L2. The dreary irony, of course, is that what L1 and L2 share in common is nearly always greater than that which differentiates them. How many people use programming languages without function calls? without recursion? without variables? without ...? Nobody does, really. Those languages that are truly unique, such as stack-based languages, remain locked in obscurity.

The reality is that, since the late 1950s, programming languages have steadily been converging towards a stable core of features. That core has proven itself sufficiently good to allow us to develop huge pieces of software. However, it leaves our brain surprisingly little capacity to understand new features. Try and do too much in one language, and problems accrue. We long ago developed the technical ability to create programming languages too complicated for any single person to fully understand. C++ is the standard (though not the only) example of this, and far be it for me not to pick on an easy target. As far as I can tell, not a single person on the planet understands every aspect of C++, not even Stroustrup. This is problematic because one always lives in fear of looking at a C++ program and finding uses of the language one doesn't understand. At best, individuals have to continually look up information to confirm what's going on, which limits productivity. At worse, people misinterpret programs due to gaps in their understanding; this introduces new bugs and allows old bugs to hide in plain view.

In an ideal world, we would be able to understand programming languages in a modular fashion. In other words, if I don't use a feature X, I don't need to know anything about X. Sometimes this works, and organisational coding standards are one way to try and take advantage of this. If this modularity property held fully, it would allow us to add as many modularised features to programming languages as we wanted. However, programming rarely involves creating a program in isolation from the rest of the world, particularly in the modern era. Instead, we pull in libraries from multiple external sources. Even if I don't want to use feature X, a library I use might; and, if it does, it can easily do so in a way which forces me to know about X. Take C++ again: I might not want to use templates in my C++ programs but most modern C++ libraries do. Even if a programming language is designed so that its features are modular, the social use of such a language tends to destroy its modularity. This explains why it's virtually impossible to be a competent programmer without a reasonable understanding of nearly every feature in a language.

Better education?

A reasonable question to ask is whether this ability to exceed the human brain's capacity is a temporary one, due to poor education. As time goes on, we tend to become better at condensing and passing on knowledge. I have heard it said, for example, that what might have baffled an early 20th century physics PhD student is now considered a basic part of an undergraduate education (since I struggled with physics as soon as it became semi-mathematical, I am not in a position to judge). Perhaps, given time, we can teach people the same number of concepts that they know now, but at a lower cognitive cost.

I do expect some movement along these lines, but I'm not convinced it will make a fundamental difference. It seems unlikely that programming students of the future will find it substantially easier to learn recursion or iteration, to pick two semi-random examples. Every language needs to choose at least one of these concepts. Of the two, most people find recursion harder to grasp, but nested iteration doesn't come naturally to many people either. Those of us who've been programming for many years consistently underestimate how difficult newcomers find it. Starting programming education from a young age might change this dynamic, but it's difficult to imagine that sufficient quantities of competent programmers will move to become teachers, at least in the short term.

Better language design

Let's assume, briefly, that you agree that C++ is too complex but are wondering whether its difficulties are due to bad design rather than size. Perhaps all I've really done so far is point out that badly designed languages overload our brains sooner than well designed ones? There is clearly some truth in this but, again, I'm not sure that it fundamentally changes the landscape.

Let's take two statically typed functional languages as an example of more considered language design: Scala and Haskell. Both languages aim to stop users shooting themselves in the foot through the use of expressive type systems. Maybe by constraining users, we can help our brains absorb more general language concepts? Alas, I am not convinced. Some of the things Haskell and Scala's type systems can express are astonishing; but I have also seen each type system baffle world-renowned experts. Type systems, like any other language feature, do not come for free. Even if we could design a perfect programming language, I doubt it could be much larger than such languages are today.

What the above means is that the space for a major advance in general purpose programming languages is limited. The analogy with the speed of light works well here too. As an object's speed approaches the speed of light, the energy required to accelerate it reaches infinity, which is why a normal object can't ever travel at the speed of light. Worse, the energy required is non-linear, so the closer you get to the speed of light, constant increases in energy lead to ever-slower acceleration. Language design is rather like that. Beyond a certain point, every feature – no matter how well designed – has a disproportionate cost. It takes longer to understand, longer to learn how to use idiomatically, and longer to understand its interactions with other features.

Tooling

The tooling that surrounds programming languages has changed significantly in the last two decades. Most developers now use huge IDEs [1], which offer a variety of ways to understand and manipulate programs. Verification techniques and tools have made major advances [2]. And whole new styles of tools now exist. To take two examples: Valgrind allows one to analyse dangerous run-time behaviour in a way that can significantly tame the danger of even unsafe languages such as C; Moose allows one to gain a reasonable high-level understanding of large programs in a fraction of the time it takes to wade through source code by hand.

Clearly, such tooling will only increase in quantity and quality as we move forwards. But will it help us understand programming languages more effectively? Again, I expect it to help a bit, but not to make a fundamental difference. Perhaps future editors will have the ability to simplify code until we zoom in (rather like an extreme version of folding). Ultimately, however, I don't know how we can avoid understanding the detail of programming languages at some point in the development process.

Moving beyond general purpose

What are we to do? Give up on general purpose programming language design? No—for two reasons. First, because a degree of group-think in language design means that we haven't explored the language design space as well as we should have yet. For all we know, there could be useful discoveries to be made in unexplored areas [3]. Second, because the general purpose part of the name is subtly misleading. A traditional assumption is that every programming language has been expected to be applicable to every problem. This then leads to the my language is better than yours notion, which is clearly nonsense. Anyone who tells you their favoured programming language – or even their favoured paradigm – is always the best is delusional. There is no such thing, and my experience is that programming language designers are almost universally modest and realistic about the areas to which their language is best suited.

In a single day, I can do Unix daemon programming in C, virtual machine development in RPython, system administration in the Unix shell, data manipulation in Python, and websites in a hodge podge of languages. Each has strengths and weaknesses which make it better suited to some situations than others. People who need certain styles of concurrent programming might favour Erlang for that, but more conventional languages for sequential processing. Many use cases are currently uncatered for: languages which target power efficient execution may become of greater interest in the future. All in all, I suspect we will see a greater level of customisation of nominally general purpose languages in the future.

I also suspect we will see greater use of more obviously restricted languages, which are often called domain specific languages [4]. Rather than try and make features that work well for all users – and that must still be understood by those for whom they don't work well – we are likely to have better luck by focussing on specific groups and making their life easier. For example, I do not wish to have my programming language cluttered with syntax for traditional mathematics (beyond the bare essentials of addition and the like), because I don't use it often enough. A programmer crunching data to produce statistics, on the other hand, would be much more productive with such a syntax. My current guess is that we will build such languages by composing smaller ones, but there are other possible routes.

In many ways, neither of these outcomes is quite as exciting as the notion of a perfect language. The contemporary emergence of a wide class of acceptable general purpose languages is an acknowledgement that we can't squeeze every feature we might like into a single language—our brains simply can't handle the result. Rather than try and move general purpose languages beyond the speed of light, we'll probably end up with many different points near it. At the same time, the possibility of restricted domain specific languages may enhance our productivity in narrower domains. This is, perhaps, not as immediately exciting a prospect as targeting all users, but it is its relative modesty that makes it more likely to pay off. Furthermore, unlike general purpose languages, that journey is nearer its beginning than its end. We might be reaching the speed of light for general purpose programming languages, but we've barely started with domain specific languages.

And, of course, there's always the remote possibility of an unforeseeable earthquake hitting programming languages. Unlike in the real world, we are not callous or inhuman for hoping that such an earthquake will hit us, even though many of us are not sure it will ever come.

Acknowledgements: My thanks to Martin Berger, Carl Friedrich Bolz, Lukas Diekmann, and Naveneetha Vasudevan for insightful comments on an early draft of this article. All opinions, errors, and infelicities are my own.

Footnotes

[1] I don't, because I'm a Unix Luddite at heart, but you may well think that's my loss.

[2] Even the simple static analyser in clang catches a host of nasty bugs that previously escaped the eyes of even the best developers. More advanced tools can do even better, though one should not overstate the power of such tools.

[3] As an example of the fun to be had exploring unusual language design, I immodestly offer my experiences on integrating an Icon-like expression system into a Python-like language. If nothing else, it taught me that it is possible to completely subvert expectations of how standard parts of a programming language work. The HOPL series of workshops, and the generally wonderful papers therein, are another excellent source of programming language design ideas.

[4] Note that I'm talking about languages with distinct syntax and semantics; the term is also currently used to describe particular idioms of library design and use in conventional languages, though I suspect that will fade in the future.

Link to this entry


Another Non-Argument in Type Systems

October 25 2012

I find most static vs. dynamic debates to be painful (as I've said before) and, in general, I try to ignore them. However, every so often a new argument is raised which needs consideration. About 18 months ago, Bob Harper asserted that a dynamic language is a straightjacketed static language that affords less rather than more expressiveness. Part of the underlying argument is correct, but it avoids an important part of the story. I've recently seen several restatements of this assertion, so, unfortunately, I feel compelled to write a brief rebuttal explaining why I believe it's yet another non-argument in the ongoing debate about which class of languages is better.

My firm belief is that statically and dynamically typed languages overlap in some ways, but differ in others. Each has tasks they are better suited to, and neither is inherently better than the other. Bob shows in his article one area in which statically typed languages allow things that dynamically typed languages can't; in this article, I'm simply going to point out that this leads to a mirror-image where dynamically typed languages allow things that statically typed languages can't. In that sense, the argument I make here is a restatement of ideas in my dynamically typed languages paper.

Before we can go any further, we need a shared understanding of a concept from statically typed languages such as ML and Haskell (i.e. not Java or its ilk): sum types. Since this concept is unfamiliar to many programmers, I'm going to present a simple definition of the core features; sum types have a lot of extra features (tagging, pattern matching, etc.) that aren't germane to this article's argument. The basic idea is that sum types are a super type that ranges over other types. Whereas in OO languages a sub-class explicitly chooses its super-class(es), sum types (the super class equivalent) choose the types they range over (the subclass equivalent) — the latter have no idea which sum types they are ranged over by. I can therefore take the types T1 and T2 and make a sum type over them S. Whenever my program wants to do anything with a value of static type S, I must then use what I'll call a typecase statement to differentiate whether the dynamic type of the value is actually a T1 or a T2 (readers familiar with the concept will realise that this is intended to be a very simple form of pattern matching). Writing this out in pseudocode can help visualize the important relationships between these concepts:

type T1:
  ...

type T2:
  ...

sumtype S rangesover T1, T2

func f(x : S):
  typecase x:
    whenoftype T1:
      ...
    whenoftype T2:
      ...
As this suggests, typecase acts as a dynamic type check, with the only difference from its cousins in dynamically typed languages being that we are forced to write out all of the possible types the sum type ranges over. We could not, for example, leave off the whenoftype T2 clause in the above example.

With that under our belt, we can now look at Bob's argument. He makes two points which can seem to be separate but which I suggest are really just different ways of looking at a single underlying idea:

  1. Dynamic typing is but a special case of static typing, one that limits, rather than liberates, one that shuts down opportunities, rather than opening up new vistas.
  2. this is precisely what is wrong with dynamically typed languages: rather than affording the freedom to ignore types, they instead impose the bondage of restricting attention to a single type! Every single value has to be a value of that type, you have no choice!

Bob points out that, conceptually, a dynamically typed program has a single static sum type ranging over all the other types in the program. We could therefore take a dynamically typed program and perform a conversion to a statically typed equivalent. Although Bob doesn't make it explicit how to do this, we can work out a simple manual method (though it's not clear to me that this would be fully automatable for most real programs). We would first need to see all the points where the program performs operations that use types (perhaps by a sort of type inference; but we'd also need to include dynamic type checks using e.g. isinstance in Python or instanceof in Java). All the types referenced would then be ranged over a single sum type. Whenever a sum type is encountered statically, we would then need to use a typecase to check the dynamic type of the value, and then perform an appropriate action. In this way we can guarantee that our once-dynamically typed program is statically type-safe.

This conceptual argument is quite correct. However, it is then used as the basis for two jumps of logic, both of which I believe to be incorrect. First that this means that dynamic typing is a special case of static typing; and that, because of this, dynamically typed languages only have a single type. I also think that the thought experiment misses the very important factor of whether, if we were to do this for real, the results would be in any way usable. I'll now tackle these points in reverse order.

Usability

Sum types are a powerful theoretical tool; when used judiciously, they can also improve code. Used indiscriminately – particularly where a sum type ranges over an unduly large number of other types – they create an unwieldy mess because of the requirement to check all the types that have been ranged over.

The encoding of dynamically typed languages we used above would lead to a huge increase in the size of programs, since a static type system forces us to explicitly express all the checks that a dynamically typed run-time implicitly performs for us. It would also remove the possibility of making use of a dynamically typed system's ability to execute and run programs which are deliberately not statically type safe (a useful property when comprehending and modifying large systems). I therefore assert that while it's a valid thought experiment, it would be deeply impractical for many real systems which make use of the easy dynamicity offered by dynamically typed languages.

Unitypes and multitypes

Bob rests much of his argument on the notion that his encoding shows that dynamically typed languages only have a single type — that they are, therefore, unityped (as opposed, one presumes, to multityped). As his suggested encoding indirectly shows, dynamically typed languages can indeed be thought of as having a single static type (or, as I often tend to think of things, a static type checker which always returns true). However, they can have an arbitrary number of dynamic types. The fundamental difference between statically and dynamically typed languages isn't over how many types they have: it's about when those types are utilised.

Consider a program which tries to add an int to a string (e.g. "abc" + 123). A statically typed language, depending on its type system, is likely to either reject such a statement as untypeable, or coerce the int to a string at compile-time. A dynamically typed language has exactly the same choices as the statically typed language: it is likely to reject such a statement as untypeable, or coerce the int to a string. The only difference is that the choice of rejection or coercion happens upon every execution at run-time, not once at compile-time.

It therefore makes no sense to say that a language is unityped without qualifying whether that relates to its static or dynamic type system. Python, for example, is statically unityped and dynamically multityped; C is statically multityped and dynamically unityped; and Haskell is statically and dynamically multityped. Although it's a somewhat minor point, one can argue (with a little hand waving) that many assembly languages are both statically and dynamically unityped.

The relative expressiveness of dynamically and statically typed languages

Expressiveness is a word that we in programming languages should be wary of, because, like Humpty Dumpty, we often use it to mean whatever we want it to mean. Bob implicitly equates expressiveness with the ability to state more, stronger, static properties of a program. That's one reasonable interpretation of expressiveness, but it's not the only one — let's call it static expressiveness. If we consider dynamic expressiveness, we can see that expressiveness is a multi-dimensional quality, with no clear winners.

Every static type system designer faces a fundamental choice: should the static type system be sound or complete? If the type system is complete then it will be unsound, letting through some programs that will cause run-time type errors. Conversely, if the type system is sound then (because of the expressivity of programming languages) it will be incomplete, rejecting some programs that will run without type errors at run-time. Nearly all statically typed languages therefore aim to be sound rather than complete [1].

The following program (in pseudo-Python) shows a program which will fail to type check in any sound static type system which I know of:

l = [...]  # A list of non-empty strings
if len(l) > 0:
  l = l[0] # Extract the first string from the list
  print(l) # Print the string
if len(l) <= 0:
  l.sort() # The sort operation is valid on lists but not strings
Sound static type systems will reject such programs as type-unsafe, because a) the l variable changes type (from List(String) on line 1, to string on line 3) b) a later use of the type requires it to be of a specific type (List(*)) on line 6). While this program may not be an example of good design, we can trivially see that this program will never lead to a type error at run-time because of the non-overlapping predicates on the if clauses. We could design a sound type system that will deal with this specific case but, because a sound type-system won't be complete, we can always find another similar run-time correct example which the type system will reject [2].

As this example shows, dynamically typed languages will execute to completion every program which contains no dynamic type errors. Statically typed languages will prevent some such programs from executing in the first place (at best, forcing the user to rewrite his already correct program using sum types). On this basis, one can therefore argue that dynamically typed languages are more dynamically expressive. It also provides a mirror-image encoding from above: one can always translate statically typed programs to a dynamically typed equivalent.

All that I've done in the above is to restate a simple principle: statically and dynamically typed languages enforce their types at different points in a program's life-cycle. Depending on whether you view things through a static or dynamic type prism, either class of language can be considered more expressive. Dynamically typed languages will try and execute type-unsafe programs; statically typed languages will reject some type-safe programs. Neither choice is ideal, but all our current theories suggest this choice is fundamental. Therefore, stating that one class of language is more expressive than the other is pointless unless you clearly state which prism you're viewing life through.

Conclusion

In summary, if Bob's article had explicitly stated that he was solely considering static expressivity (as e.g. James Iry's chart of language's different static expressivity does), I would have agreed with it. In this article, I've tried to show that statically and dynamically typed languages overlap in many ways, but differ in others. Neither is exactly a subset of the other nor, as the underlying theory shows, could they be. Arguing whether static or dynamic typing is universally better or more expressive is therefore pointless. One must consider the pros and cons for the tasks one has at hand, and choose accordingly. Personally, I use both styles, and I have no wish to unduly restrict my language toolbox.

Acknowledgements: My thanks to Martin Berger and Christian Urban for insightful comments on early drafts of this article. All opinions, errors, omissions, or infelicities, are my own.

Updated (October 25 2012)

Joseph Cottam pointed out that the code example was incorrect if l contained an empty string at position 0. I've updated the comment in the code accordingly.

Footnotes

[1] Since formally proving this turns out to be rather hard, some seemingly sound type systems have turned out not to be so (e.g. Eiffel). Fixing this after the fact is then problematic because restricting the type system further to make it sound is likely to end up with perfectly correct real-world programs suddenly being identified as type-unsafe.

[2] One can equally imagine static analysis techniques which would let the program snippet through. Static analysis techniques tend to be both unsound and incomplete in order to make wider-ranging analysis possible. They therefore can not replace static type systems, but can augment them.

Link to this entry


Server Failover For the Cheap and Forgetful

August 2 2012

I've long been of the opinion that real computer people should run their own servers. Most obviously, doing so gives you huge flexibility to do unusual things and share them easily with other people. Less obviously, I think it's hard to be a good programmer unless you're also a competent sysadmin. And so, for the last 12 years or so, I've hosted my own e-mail, web sites, and so on. This has proved to be a good choice: my systems have been more flexible and reliable than my various employer's, for example. I hope that I'm not dogmatic about doing everything myself, so when a better alternative appears, I switch to it. For example, after several years hosting git repositories for different projects myself, I now use github (and its ilk) almost exclusively, because they do a better job of multi-user repositories than I can on my own. But, in general, I prefer to do things myself.

However, there is one undeniable truth when hosting everything yourself: when things go wrong, there's no-one else to blame. If software crashes, hardware dies, or - and this has happened to me twice - the water company cuts through the power cable, it's your sites that are down. After the water company cut through the power cable for a second time, I realised that not having my main server running for three days wasn't very much fun. I eventually managed to route things through a cobbled together backup server, but it was a huge pain. I broke out in a cold sweat every time I wondered how long the next downtime might be: a week? or more? Even worse, I realised, I might not be available at the time to fix things.

I therefore started looking at ways to put in place a reasonable failover system. Importantly, I wanted the most important services on the main server to failover to a backup until the main server came back to life. I naively assumed there would be a fairly easy way of doing this, but, if there is one, I couldn't find it at the time (there are now several web-based DNS services which appear to offer this service, though I haven't tried any of them). Presumably this is why most people (including some surprisingly large organisations) never put into place any sort of failover solution and those that do tend to have large systems teams behind them. Some of the material I read made my eyes hurt: it involved tweaking vast numbers of options in vast numbers of configuration files. As someone who aims to spend, on average, a few minutes a week on sysadmin duties, I knew this was not a sensible long-term option. I'd inevitably forget all the complicated details, which, from a sysadmin viewpoint, is storing up problems for a rainy day.

After some thought, I cobbled together a relatively simple solution that has served me well for a couple of years or more. I think of it as failover for the cheap and forgetful. Nothing in here is new, as such, but it's difficult to find it written down in one place. Dedicated sysadmins will probably snigger at how crude it is, but it works well enough for my needs. It's not intended to deal with, for example, large scale denial of service attacks nor to keep every possible service up and running. It's designed to keep the main services (e.g. web and email) online when the main server is down. This article outlines the basic idea, gives some example configuration files, and the shell script I use to tie everything together.

The setup

Overall, my setup is analogous to having a main server in a data centre, and using a home computer for a backup server. The details are as follows. I have a relatively beefy server (which came to me for nowt, courtesy of a friend-of-a-friend clearing out a data centre) sitting on a fast internet connection. I have a very unbeefy backup server (a 4 year old, passively cooled Tranquil T7, a fanless Atom-based machine; they no longer make that model, but I highly recommend their other machines, which are exquisitely made and totally silent) sitting on a DSL line. The main server runs some fairly hefty services (including a large database) which the backup server can't, but those aren't critical. The only advantage the backup server has is a vast disk, which I use for backing up my music; the main server has faster, but much smaller, disks.

The servers are hosted with completely different network providers and are physically distant, with well over 100 miles between them. Both decisions are deliberate: I don't want network or location specific problems taking down both machines at the same time (many bigger providers, from Redbus to Amazon have regretted locating too many services in one location). Both machines run the latest stable release of OpenBSD, so moving configuration options between machines is fairly easy.

Requirements

What I want out of my system is fairly straightforward. The main server should do as much as it possibly can. The backup server should generally do relatively little, otherwise those who share its DSL line might become upset that their connection has slowed down. When the main server is down, it should be reasonably easy to access services on the backup server. If the backup server has taken over from the main server it should hand back to the main server as soon as that comes back up.

I also want as much of the setup as possible to be under my own control. Experience has taught me that unless (and, often, even if) you pay decent money, large systems run by large groups aren't as reliable as I would like: it's too easy for a small part of such a system to go down and its recovery not to be prioritised. Plus, doing things yourself is informative and good, clean fun.

Why a solution isn't easy

There are some complete, fully transparent, solutions for failover (e.g. CARP), but these generally have preconditions (e.g. machines being on the same network) which rule them out for my purposes.

Most people are familiar with the idea of having multiple mail servers. DNS allows a given domain name to specify multiple MX hosts, each with a priority. When sending mail to a domain, the MX servers are tried in order of priority until one accepts delivery. Failover for the sending of mail is therefore simple. But what about the reading of mail? and how do we see the same set of e-mail at both machines? DNS doesn't have any answer for that. And DNS doesn't have a real-world answer for failover of anything except e-mail sending. Well, it does in theory (see here), but few people implement it, so it doesn't in practice. That means that HTTP failover, for example, must be handled on a per-domain basis.

There are some half-measures which provide partial solutions, but they're unsatisfactory. Round robin DNS, for example, allows requests to be shared between servers. Often used as a simple form of load balancing, it seems like it might also be a way of implementing failover on the cheap. However, it will happily point a proportion of clients at a server which is down, which isn't a good failover strategy.

Outline of a solution

The basis of a solution is relatively simple. First, we have to have a way of rewriting DNS records so that when the main server goes down, the DNS records for the domain are changed to point to the backup server (and back again, when the main server comes up). Second, we need a way of keeping the data within services in sync. As an example of the latter service, I'll describe how I synchronise e-mail.

DNS

DNS has a reputation for being a dark art. In large part, I suspect this is because the configuration files for BIND are so hard to read. For some years I used a web-based DNS service just to avoid the pain of working out how to configure BIND. To get failover up and running easily, I decided to tackle my DNS fear head on. I ended up using NSD instead of BIND, partly because it's a little easier to configure, but mostly because, as I understand things, it's likely to displace BIND in OpenBSD one day.

The basic tactic I use is to continually rewrite the main DNS entry to point at the preferred server. The backup server contains the master DNS. It regularly polls the main server by trying to download index.html via HTTP. If the download succeeds, it rewrites the DNS records to point at the main server; if the transfer fails, the DNS is rewritten to point at the backup server. The main server runs a slave DNS, which is notified of any changes to the backup server's records. This might seem backwards, but if the main server were to contain the master DNS, it wouldn't be able to rewrite DNS to point at the backup server.

The rewriting process is in some ways crude and in some ways a bit clever.

It's crude, because I wanted a solution that needs only basic shell scripting (so it can run on a fresh OpenBSD install immediately) and normal text files (no key exchange, or anything which involves configuring things that I might forget). I therefore manually write DNS zone files as a normal user with two magic markers @serial number@ and @IP@. @serial number@ is a monotonically increasing ID which allows other DNS servers to work out if they've got the latest version of the zone file or not. @IP@ is the IP address of the currently favoured server, yo-yoing between the main and backup servers.

It's in some ways a bit clever because the @serial number@ isn't an arbitrary integer: it's conventionally in the format YYYYMMDDSS (YYYY is the year; MM the month; DD the day; SS the increment during the day). What this means is that in any given day, we can only increment SS 99 times before we run out of options. Therefore we only increment it when @IP@ has changed since the last write of the file; this allows the main server to go down and come back up 44 times in a day before we run out of IDs, which seems enough to cope with all but the most bizarre outcomes.

I therefore wrote a script called zonerw which automatically copes with updating the zone files only when the main server has changed status since the previous execution (i.e. up to down, or down to up), and then notifies the slaves. There's also one other time when we want to rewrite the zone files: when the user has manually changed one or more entries in them (e.g. added a new entry). zonerw therefore has two modes: minimal which is used when regularly polling the state of the main server and only increments @serial number@ if a change is detected; and forced which the user can manually invoke when they have changed the zone files and which forcibly increments @serial number@.

Setting up zonerw is relatively simple. We need slightly different NSD config files on the main server (DNS slave) which we'll assume has IP address 1.2.3.4 and backup server (DNS master) which we'll assume is 5.6.7.8. The two config files are stored in ~/etc/ (so they're easily synchronised across machines). The user then creates a base zone file in ~/var/zones/nsd/ with the magic markers placed as comments (after the ; character) e.g. tratt.net.base (they get rewritten to /var/zones/nsd/, the normal place for NSD zone files). zonerw then runs as a cronjob on the backup server every 5 or so minutes.

The zone files are relatively simple (assuming you can read such things in the first place: see e.g. this tutorial for pointers if you can't), but a few things are worthy of note. First, the domains have a short refresh time of 5 minutes but a long expiry time of 12 weeks. The short refresh time informs other DNS servers that they should regularly check to see if the domain's settings have changed: if the server status changes, we want the updated IP address to be quickly reflected in DNS lookups. The long expiry time means that, if all of my DNS servers are unavailable, other DNS servers should hold onto the records they previously fetched for a long time. This is because the complete disappearance of a domain from DNS (i.e. all a domains DNS servers are down) is a nightmare scenario: mail starts bouncing with fatal errors and, in general, people (and other servers) assume you've died and aren't coming back. Finally, notice that the master DNS server notifies two servers of DNS changes. As well as my 2 servers in the UK, I also use the free service at afraid.org as a DNS slave, just in case network connectivity to the UK is lost (it happened in the past though, admittedly, not for many, many years). It also reduces the DNS request load on the backup server. If I had a server outside the UK, I might do this myself, but I don't, so I'm grateful for afraid's service.

Synchronising e-mail

Our DNS configuration gives us a nice way of making sure the main DNS entry always points to an active server. That doesn't solve the problem of making the content on the servers synchronise. In other words, it's not much good pointing people to a backup server if the content it serves is out of date. In my case, my websites are relatively simple: a script updates them both from the same git directory, so they're nearly always in sync. Other services are, in general, more difficult. Some - most notably databases - have built-in support for synchronising across servers. Most, however, take a little more thought, and each needs to be done on a case-by-case basis. As a simple case study, I'm going to show how I synchronise e-mail across two machines.

E-mail is interesting in that both servers accept e-mail deliveries: we therefore have a two-way synchronisation problem. Traditionally, e-mail was delivered to a single mbox file: synchronising that would be very hard. Most mail servers (and many mail clients) now support the maildir format, where each e-mail is stored in an individual file. Maildirs also try very hard to make each e-mail's filename globally unique (chiefly by using the server's hostname as part of the filename). We thus know that e-mails delivered to the same user on different machines will have different filenames. We can then frame the solution in terms of synchronising directories, which is much easier.

Fortunately, there are already good directory synchronisation tools available, so we don't need to make our own. I use Unison, though there are alternatives. The backup server connects to the main server every 10 minutes and synchronises the maildirs across both machines. The connection is done using an unprivileged user and SSH keys (so that user doesn't have to type in a password) and sudo (allowing the unprivileged user to only execute a mail synchronisation script as root).

The best bit about this solution is how easy it is to setup. It's also easy to extend to multiple users on a Unix box. It also gives a high degree of reliability. Mail that is delivered to one machine appears on the other with a maximum 10 minute gap - in other words, I can never lose more than 10 minutes worth of e-mail.

[As a side note, since I also read my mail through maildirs (using mutt), I also use Unison to synchronise my mail between my desktop, laptop and servers. By default, my mail fetching script connects to the main server. I can manually tell it to collect mail from the backup server, so I can always download my mail, even when the main server is down. If you're wondering why I don't use the main DNS entry for this purpose, it's because Unison uses the host name of the remote server in its archive file; if the hostname you're connecting to later points to a different machine, Unison gets very upset.]

Of course, some services are much harder to synchronise than e-mail, and there is no general strategy. But hopefully this gives an idea for one such strategy.

Summary

As I said earlier, the techniques outlined in this article aren't particularly sophisticated: they won't cope with DOS attacks, for example. I wouldn't even want my backup server to be subject to a slashdotting (or whatever the kids are calling it these days) - it doesn't really have the bandwidth to support it. But for the sort of usages my servers typically see, and for virtually no ongoing maintenance cost, these techniques have proved very useful. Hopefully documenting them might be of use to somebody else.

Link to this entry

Earlier posts

 

All posts

 

Last 10 posts

The Bootstrapped Compiler and the Damage Done
Relative and Absolute Levels
General Purpose Programming Languages' Speed of Light
Another Non-Argument in Type Systems
Server Failover For the Cheap and Forgetful
Fast Enough VMs in Fast Enough Time
Problems with Software 3: Creating Crises Where There Aren't Any
Problems with Software 2: Failing to Use the Computing Lever
Problems with Software 1: Confusing Problems Whose Solutions Are Easy to State With Problems Whose Solutions Are Easy to Realise
Parsing: The Solved Problem That Isn't
 
 

DSLs

Tony Clark
Zef Hemel
 

Modelling

Mark Delgano
Steven Kelly
Jim Steel
 

OS

Marc Balmer
Ross Burton
Peter Hansteen
OpenBSD Journal
Ted Unangst
 

Programming

Peter Bell
Gilad Bracha
Tony Clark
Cliff Click
William Cook
Jonathan Edwards
Daniel Ehrenberg
Fabien Fleutot
Martin Fowler
John Goerzen
Grace
James Hague
James Iry
JOT
Ralf Laemmel
Lambda the Ultimate
Daniel Lemire
Michael Lucas
Bertrand Meyer
Keith Packard
Havoc Pennington
Brown PLT
John Regehr
Software Engineering Radio
Diomidis Spinellis
Shin Tai
Markus Voelter
Phil Wadler
Russel Winder
Steve Yegge