RSS / XML feed

Laurence Tratt's Technical Articles


Extended Backtraces

June 2 2008

I can quite distinctly remember when, as a teenager, I realised that I would spend much of my life debugging. I was programming in assembler, where loops are simply branch statements to defined labels. Because loops are so common, one would use the same label for each loop; later loops would redefine the label, meaning that no problems could occur. Being a literal person I used the label loop for all such loops. When one of my programs mysteriously failed, I could not work out why. Eventually I realised that one of my label definitions had spelt looop (three O's instead of two) instead of loop, so my loop had branched back to the previous loop in the file. Spotting that took me a couple of days.

Later, I realised that most programming errors fit into two broad categories: the obvious and the subtle. Obvious errors are those whose source can be easily pinpointed (even if fixing the problem takes a while). The subtle are typically those where cause and effect are separated, making identification of the root of the problem difficult (often, when eventually located, such problems are easily fixed). The looop problem above was, to a relative novice programmer, a very subtle problem (years of experience have taught me that looking for daft spellings and in my own programs is a good initial target when debugging).

There are, to my mind, two tricks to debugging. The first is to try and turn subtle problems into obvious problems; however subtle problems are typically inherently subtle and unamenable to such treatment. The second is to try and speed up the solving of obvious problems. For me, the main tool for solving obvious problems is the humble backtrace which, when an exception occurs, shows one (in some manner or another) the call stack and, hopefully, what file and line number each entry therein is associated with. Given the following trivial program:

func head(l):
    return [l[0], l[1 : ]]

func main():
    head([1])
    head([])
a standard looking backtrace would be:
Traceback (most recent call at bottom):
  1: File "/tmp/head.cv", line 6
  2: File "/tmp/head.cv", line 2
  3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
Using this we can fairly quickly see that the cause of our error is passing an empty list to a function which assumes that there is at least one element in the list. [As a side note, this example is in one way unrepresentative: in the vast majority of cases, it's typically the bottom one or two lines of the backtrace that pinpoint the real source of the error.]

Backtraces like the above can be found in most modern programming languages like Java. They are immensely useful and form precisely half of my debugging toolkit, the other half being printf - in my view of the world, these two tools obviate the need for debuggers. The power of backtraces is most obviously felt in those languages that don't have them. C programs typically need to be run through a debugger to get a backtrace, meaning that errors in programs running in production can be extremely difficult to diagnose. The first Haskell program I wrote had the head error in it. The resulting message just said Prelude.head: empty list with nothing else to help me - no line numbers or even file names. Needless to say, it took me a long while to work out what this meant, and how it happened (if I remember correctly, I passed an empty list to a library function which, in turn, called head). Unsurprisingly that was also pretty much the last program I wrote in Haskell - languages that turn should-be-obvious errors into subtle errors are of no use to me. [Apparently Haskell now has some in-built, though hardly easy to use, support for backtraces. The implications relating to Haskell's prioritisation of features are, to my mind, highly amusing.]

Python was the first language I saw that took backtraces a little bit further, printing (when possible) the line of source code associated with each part of the backtrace. A Python-esque backtrace looks roughly as follows:

Traceback (most recent call at bottom):
  1: File "/tmp/head.cv", line 6
       head([])
  2: File "/tmp/head.cv", line 2
       return [l[0], l[1 : ]]
  3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
This simple innovation is a real boon: as in this case, one often doen't even need to open a source file in a text editor to see the error made. Python-esque backtraces help make obvious errors quicker to solve than traditional backtraces.

I realised early on in Converge's development that knowing merely the line number of an error was only part of the problem. Often a specific sub-expression within a certain line is the relevant part of the backtrace, and the rest of the line is noise. Converge therefore recorded the column (i.e. offset within a line) where each error is associated with, meaning that backtraces looked like the following:

Traceback (most recent call at bottom):
  1: File "/tmp/head.cv", line 6, column 4
  2: File "/tmp/head.cv", line 2, column 13
  3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
This extra information is very helpful: it means that I can accurately pinpoint which of the two list lookups in line 2 is responsible for calling List.get incorrectly. As a useful advantage, Converge's approach also means that errors that happen within multi-line statements (i.e. logical lines of source split over multiple physical lines in a source file to aid presentation) work properly.

Converge's backtraces stayed like the above for quite some time, until recently when I realised that knowing the start column associated with an error is only part of the story. What one really wants to know is the start and end of the associated expression. A small tweak to the parser, and a huge (but mechanical) change to the compiler, and Converge backtraces could tell one how many characters in the line an error was associated with:

Traceback (most recent call at bottom):
  1: File "/tmp/head.cv", line 6, column 4, length 8
       head([])
  2: File "/tmp/head.cv", line 2, column 12, length 4
       return [l[0], l[1 : ]]
  3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
This is almost helpful, but in practice I find it surprisingly hard to count n characters within a line on screen, which hinders interpretation of the above data.

A short while later, the answer hit me: what the backtraces need to do is to highlight the relevant sub-expression within the line. Here's a screenshot of the above error running in an xterm with the latest version of Converge:

As you might notice, the tiny little difference here is that the part of each line pertinent to the error is in bold and underlined. Knowing that, one can instantly see that the first of the two list lookups on line 2 is responsible for calling List.get incorrectly. Interestingly, my first attempt at this put the offending fragments only in bold, but since whitespace can sometimes be a significant part of an error, underlining can be a useful aid. In the case where the associated source code is split over multiple lines, the first relevant line of source code is printed with ... added to the end of the line to inform the user that the printed line is not the end of the story.

As I explained in a previous entry, when Converge DSLs are translated into Converge ASTs, individual call stack entries can be associated with more than one source location. This means that backtraces tend to be rather long, which previously made tracking down the cause of an error tedious - loading multiple files into text editors and continually flipping back and forth to xterms is not fun. Extended backtraces become a real life saver in this regard. Here's an example where a DSL incorrectly tries to subtract a string from an integer:

Looking at this backtrace, an experienced programmer will be able to quickly surmise that, given the exception message, the most likely candidate for this error is in the ex4.cv file (which, as the eagle-eyed may notice, is code written in the DSL - Converge's errors work with both the base language and with embedded DSLs). Imagine trying to debug this with a traditional backtrace: there's a lot of information in the backtrace, and there would be no indication as to which part of it is more likely to be responsible for the error.

From a practical point of view, Converge's extended backtraces have no run-time penalty for correct code, and users don't have to do anything to enable them - they're a standard part of the system. Extended backtraces can be found in -current versions of Converge (at the time of writing, Converge's support for Curses under Windows is weak, so underlining doesn't work there - it's a quick, fun little project for someone who's interested).

So, going back to the start of this entry, how do Converge's extended backtraces help with debugging? Well, they might help turn the odd subtle error into an obvious error, but that's an incidental benefit. What I think they do is make solving obvious errors much quicker than previously. In the sort time since I've had extended backtraces, I've noticed that I've often been able to almost instantly fix errors that before might have taken me a couple of minutes. Given the number of programming errors I make, the cumulative time saving is most welcome.

In summary, I think that Converge's extended backtraces are a real boon to programming. To the best of my knowledge, Converge is the first language with such backtraces in - I hope it won't be the last!

Link to this entry


Designing Sane Scoping Rules

March 3 2008

If there's one thing which unites pretty much every post-assembly programming language it is the use of the humble variable. Variables are such a common feature that we tend to take them for granted; perhaps I show my background in assembly by being explicitly aware of them. However where programming languages often differ is in the way that they allow one to reference variables: the dreaded scoping rules. In this article I'm going to outline why Converge has the scoping rules it does.

The first thing I need to do is to outline the problem. Although all my examples are framed in terms of a typical imperative programming language, the underlying concepts all translate fairly directly to functional languages too. Here's the simplest example possible:

x := 2
...
y := x
In every language I know this says assign 2 to x and the assign the value of x to y. So after this code is run both x and y will have the value 2.

The first major issue with regards to variables is global vs. local variables. Take the following code which is intended to represent the top-level of a source file:

x := 2

func f():
    x := 3

f()
In a lot of BASIC-type languages there is only one underlying x variable in this program, so after this code is run the outer x will have the value 3. In essence we have a flat variable scope: all variables belong to the same, single, namespace. This not only makes writing programs error-prone (e.g. one function accidentally corrupts another's x), but makes certain styles of programming largely impractical (e.g. recursive functions). It's probably no coincidence that the first mainstream programming language to make a virtue out of recursion - Algol 60 - also was among the first to get its scoping rules in reasonable order.

Retro-fitting sane scoping rules is not easy if the first version of your language used the above scoping rule (note the deliberate use of the singular) - any change of scoping rule(s) has a high chance of breaking programs in nasty ways. Some of the BASIC-type languages I first used solved the backwards compatibility problem by making variables global by default but allowing variables in functions to be declared as local. This allows one to rewrite the above example as follows:

x := 2

func f():
    local x
    x := 3

f()
This code now has two distinct x variables: one at the top-level and one in the f function. Every time f is called - even recursively - it will be given space for a new, fresh x. This feature makes programming a lot easier, even if it defaults to the insane global scoping rule by default. As an aside, surprisingly (to me at least), you can still see the global by default scoping rule in the modern language Lua. This goes to show how fundamental scoping rules are: once they're in a language, users will resist nearly all meaningful change to them.

Most programming languages adopt a slight variant of the above rules which are fairly easy to understand in practice. Essentially variables with the same name as a top-level variable reference that top-level variable directly, while all other variables are local. So in a C-type language the following code contains two variables: a top-level x and a y local to f:

x := 2

func f():
    x := 3
    y := 4

f()
Running the above code means that the top-level x is set to 3, while the y variable is local to f. Variations on this set of scoping rules underly many programming languages in use today.

The next major design challenge for scoping rules is much more subtle and confuses many of us to this day. Knowing that the global keyword in Python declares that assignment to a specified variable doesn't make it local to that scope, consider the following Python code:

x = 2

def f():
    x = 3
    
    def g():
        global x
        print x
        x = 4
    
    print x
    g()

f()
print x
What do you think this will print out? Let's try it out with Python 2.5:
$ python scope.py
3
2
4
$
In other words, we get a result which is a long way from what we might have expected: the print statement in g prints 2 instead of the expected 3 and the final print statement prints 4 instead of the expected 2. What is it doing? What's happening is that most languages don't have nested scopes (as one might expect) but two scopes: a top-level (a.k.a. global or module) scope and a function scope. What this means is that the assignment to x in g references the top-level x, not the x in f; you might want to read that twice to check that you've really understood it.

It might at first seem that Python's scoping rules are simply silly; actually, they're not unreasonable and they're shared by most programming languages (e.g. Java). Why? The problem is that the function g might outlive f. Here's a simple example:

func f():
    x := 2
    func g():
        return x
    return g

f()()
In other words, f returns a reference to g; when g is executed, the value of x known to f will have disappeared as, in most programming languages, variables are stored on the stack. This means that variables only exist for the duration of a function call. Since f's variables will have disappeared when g is executed, all sorts of bad things could happen.

Scheme was the first language that presented a practical solution to this problem of nested scopes in the form of closures. The standard way that closures are defined is guaranteed to confuse and I'm not going to repeat it. They're actually very simple: essentially each function allocates heap memory to store variables on. Thus if an inner function outlives an outer function there is no problem in referencing variables in the outer function even if the stack space has long since disappeared, since a function calls variables can outlive the function call itself.

[As an aside, the fact that closures need to allocate heap memory (although it's often possible to statically analyse such allocations away) has been used as an argument against them in languages such as Java. That's the chief reason that Java has all sorts of complications like inner classes, final variables and so on: Java resisted closures, and then had to resort to hacks to get a poor facsimile of its functionality. It's hard to imagine any decent programming language being built now that doesn't implement closures (Converge certainly does), so closures are gradually losing their exotic tag (which, I suspect, is based on their definition and not their utility, which is far from exotic).]

When I was designing Converge, I put some effort into deciding what its scoping rules were going to be. I wanted to make things safe by default (e.g. no global by default-type nonsense), and to make closures easy to deal with. Converge's scoping rules are (or, at least, were), I believe, the simplest of any imperative programming language. They boil down to this: assigning to a variable makes it local to a function; unless a variable is declared nonlocal, in which case scopes are searched from inner to outer to find the matching variable. That's it. Rewriting the Python example into Converge yields the following:

x := 2

func f():
    x := 3
    
    func g():
        nonlocal x
        Sys::println(x)
        x := 4
    
    Sys::println(x)
    g()

func main():
    f()
    Sys::println(x)
Which when run does the expected:
$ converge s.cv
3
3
2
$
The interesting thing here, in my mind at least, is the nonlocal keyword. Although it's a tad awkward sounding, it was the best combination of brevity and accuracy that I could think of. Unlike Python it would be incorrect to declare a variable as global since there are more than 2 scopes: in fact, there are an arbitrary number. What nonlocal is saying is when you see an x search each successive outer scope in order to find where it was originally defined. It's not a commonly used feature, but when you need it is priceless.

I said above that Converge has - or had - the simplest of any imperative programming language (at least as far as I'm aware). Some time after the first publications and release of Converge, the Python team decided to fix their scoping rules for their backwards-compatibility-breaking Python 3000 release. PEP 3104 contains the eventual proposal they came up with. Interestingly it is identical to Converge's scoping rules even down to using the nonlocal keyword. Please note that I'm not suggesting that the Python team copied Converge uncredited (and even if they did, I wouldn't mind - I actively encourage people to take the good ideas from Converge and put them in other languages). However I think it shows that these are good scoping rules and that eventually imperative programming languages will evolve towards something like them.

The open question is this: why has it taken us, as a community, 50 years or more to define two simple scoping rules (assignment is local; nonlocal successively searches outer scopes) and one simple implementation technique (closures), and why have we taken so many wrong turns (in this article I haven't enumerated the silliest, such as dynamic scoping)? I suspect the answer, if it could be definitively uncovered, would give one a very interesting insight to the subject of computing in general.

Link to this entry


Some Lessons Learned from Icon

December 3 2007
Updated: December 10 2007

One of Converge's lesser known influences is Icon. Although it's a relatively obscure language itself, a surprising number of people have heard of Icon's direct predecessor SNOBOL. I first heard of Icon through Tim Peters and Python, where Icon was the inspiration for Python's generator feature (basically procedures that whose execution can be resumed to produce multiple values). However Icon has a much richer, and definitely unique, approach to expression evaluation than any other imperative programming language. When I started working on Converge, I decided to go back to the source, and see how much of Icon's unique approach I could incorporate into Converge.

This article is a personal look back at Icon's influence on Converge, and the successes and failures resulting from that influence. It's quite probable that much of it reflects my slightly superficial understanding of Icon - apart from its innovative approach to expression evaluation, much of the language has an old fashioned feel to it, sometimes appearing like a dynamically typed version of Pascal, and this prevented me from ever wanting to write anything large in it. This isn't a criticism of Icon as such - it was ahead of its time in many ways - but is merely an observation that modern programmers expect something slightly different from their programming languages. Hopefully, even with the above caveat, this article contains something of value to those interested in programming languages.

Why is Icon interesting

In a nutshell, Icon is interesting due to what it calls goal-directed evaluation. This is an evaluation mechanism which gives to an imperative programming language some of the flexibility more often associated with languages like Prolog, such as a limited form of backtracking. It is built around the concepts of success and failure, and generators (see the Converge documentation for more details). Expressions can succeed or fail; if they succeed, they produce a value; if they fail, they do not produce a value and transmit failure to their containing scope. Generators are functions which can produce more than one value.

Icon's syntax is derived from Algol, so the following complete program prints 1 to 9 inclusive:

procedure upto(x)
  i := 0
  while i < x do {
    suspend i
    i := i + 1
  }
end

procedure main()
  every x := upto(10) do write(x)
end
Most of this is probably reasonably intuitive, apart from the every and suspend keywords. every can be considered to be equivalent to the typical understanding of a for statement (indeed, in Converge, the same construct is called for). suspend is equivalent to yield in Python or Converge, returning a value, but allowing the procedure calls execution to be resumed directly after the suspend statement.

Failure and if: a beautiful combination

If I had to pick one thing from Icon that I would not be without in Converge, it would be the concept of failure in if statements. There is something beautiful about saying if this function succeeded, assign the value to this variable and then do xyz. This idiom is used to build a very useful convention that is present throughout Converge's libraries. As in most languages, container items such as dictionaries have a get function which returns the value associated with a certain key; geting a key which doesn't exist generates an error. There is a very a common variation on this use case, which is to check whether a dictionary has a certain key, and if it does to do something with its value; otherwise a different code path is taken. In most languages this idiom is expressed roughly as follows:
d := Dict{"a" : 2, "b" : 8}
if d.has_key("a"):
  Sys::println(d["a"])
Not only is the double lookup of a an eyesore, and a maintenance accident waiting to happen, but it can be a significant overhead in tight loops. In Python (and probably other languages), it's common to see the following idiom:
d := {"a" : 2, "b" : 8}
try:
  v = d.has_key("a")
  print v
except KeyError:
  pass
This idiom makes use of the fact that it's nearly always quicker to catch an exception if a key isn't found than to perform two lookups. In my opinion, this idiom is far from pleasant, for reasons that I'm sure most readers will intuitively understand.

In Converge one can express this common idiom thus:

x := Dict{"a" : 2, "b" : 8}
if v := x.find("a"):
  Sys::println(v)
In this example, v only gets assigned a value if find succeeds. There are some other advantages to this approach (e.g. there's no fiddling around checking for null), but the symmetry between get and find, and the consistency of this convention across libraries has proved a real success in Converge. For me, this feature alone justifies the Icon experiment in Converge.

Variables should not spring into existence with a default value

The idiom mentioned in the previous section is, unfortunately, somewhat dangerous in Icon itself, as all variables have a default value of sorts. What this means is that the following Icon code runs without error:
procedure main()
  if 1 < 0 then x := 0
  write(x)
end
To me, this is odd, because x never has a value assigned to it (as the expression in the if statement is clearly false); any errors resulting from such a mistake turn out to be very difficult to debug. In Converge I therefore made it so that reading from a currently unassigned variable raises an error, which protects against programming slips such as the following:
x := Dict{"a" : 2, "b" : 8}
if v := x.find("c"):
  ...
else:
  Sys::println(v)
As the lookup of c fails, v doesn't have a value assigned to it, so reading from v in the else branch of the if statement raises an Unassigned_Var_Exception, pinpointing the offending code.

Procedures shouldn't fail by default

In Icon, the default return value of a procedure is failure. This makes quite a lot of sense for generators such as the following:
procedure upto(x)
  i := 0
  while i < x do {
    suspend i
    i := i + 1
  }
end
As this suggests, the typical action for a generator is (via a loop) to generate several values; once that loop is finished, the generator fails. Thus each Icon procedure effectively has return &fail at its end. Very early versions of Converge inherited this behaviour.

What became quickly apparent is that while this is reasonable behaviour for generators, it can be a disaster for normal procedures. Look at this code and see if you can work out what will happen when it's executed:

procedure f(x)
  if x > 0 then return 1
end

procedure main()
  write(f(-1))
end
If you guessed that nothing would be printed to screen, congratulations. If you didn't, then don't worry, because this frequently surprised me. This happens since the failure of the if's condition in means that the procedure executes its default return action, which is to fail. Thus the procedure doesn't return a value, and the call to write is never even evaluated.

Although this might seem trivial, the problem becomes significantly worse when it's embedded in a large body of code. f is an example of what I informally think of as the dangling if return problem - in other words, it's quite easy to write a procedure where different values should be returned, but where one branch of an if incorrectly neglects to actually return anything. I find this occurs quite often during the early stages of development when one is fleshing out functions. On several occasions I spent several hours tracking down instances of this problem.

The question I asked myself was whether this feature was worth the pain. A quick analysis of real world code quickly showed that the vast majority of functions aren't generators, and indeed only ever return a single value. Therefore optimising for the exceptional case (generators) is an odd design choice.

I considered two solutions to the problem. First, one could syntactically differentiate generators and procedures, with the former failing by default and the latter not. Second, one can make all procedures not fail by default. Adding new syntax is a decision that should never be taken lightly, and since generators aren't that common, I couldn't justify the added conceptual complexity. Therefore Converge functions became similar to Python functions, with their default return action being to return the null object.

Generators are useful, but easily hidden

Generators are an integral part of Icon and Converge, but as well as having no specific syntax to define them, there is no specific syntax to call them. Assuming that g is a generator, I have grown quite fond of the following idiom:
l := []
...
for l.append(g())
which appends every element generated by g to l. However the problem in such cases is that it can be difficult to work out what part of the expression is the generator. Converge now tries to solve this problem by prefixing generator function names, whenever it makes sense, with iter_ which highlights that the function in question is a generator. This makes it easier to look at a piece of code and work out what it does without having to understand every function called.

Backtracking is rarely useful

One of the major features of goal-directed evaluation is that backtracking can be performed. The most common way for this to be achieved is by linking expressions together with &. The resulting conjunction of these expressions only succeeds if all of its component expressions succeed. The expressions are evaluated in order. If one of them fails, and a previous expression is a generator, then backtracking occurs. The previous generator is resumed to produce a new value, and the conjunction then resumes execution from that point.

Combining conjunction with for allows compact expressions such as the following to be expressed:

x := "abcabaacvcabcbab"
for Sys::println(i := 0.iter_to(x.len()) & i % 2 == 0 & x[i] == "a" & i)
This prints out every index position in a which is a multiple of two, and which contains the character a (0 6 10 14 for those who are interested).

There are two problems with the above. Most obviously, it is incredibly difficult to read from a human perspective; indeed, it would be far better written as a couple of if statements within a for body. The second problem is that the form of backtracking used is often too weak to be useful. Icon allows variables in conjunctions to have assignments undone during backtracking, but even this isn't really enough (I didn't even bother implementing such functionality in Converge, because I couldn't really see its use). Full backtracking often involves undoing assignments within objects and so on, and no imperative language can ever hope to do this automatically.

Out of all the systems I've written in Converge, only one has made any practical use of the built-in backtracking, and even then it needed some manual help to be truly useful.

The fail variable causes bizarre behaviour

One way in which Icon shows its age is its differentiation between values and references. In keeping with most object orientated languages, Converge makes no such distinction to the user (although within the VM it does so for efficiencies sake). One of Icon's main oddities is that fail is a synonym for return &fail, while &fail is a sort-of reference to an invisible object.

Converge simplified this, so that fail was a variable pointing to a semi-special object (in similar fashion to the null object). However the fail variable proved, in admittedly rare cases, to be conceptually troubling. It's troubling in Icon too, although in slightly different ways. What do you think happens in Icon if I define l as the following and then iterate over each of its elements and print them out?

l := [1, &fail, 3]
If you guessed that an error is raised because l didn't get assigned a value (because evaluating &fail cause the whole thing to fail), pat yourself on the back. Now try this:
l := []
put(l, 1)
put(l, &fail)
put(l, 3)
If you guessed that 1, then 2 get printed out, you're doing very well. And finally consider this:
l := []
put(l, 1)
t := &fail
put(l, t)
put(l, 3)
If you guessed that 1, then a blank line, and then 2 get printed out, you're doing much better than I ever managed to.

Converge's attempts to simplify thus were somewhat successful, but left one hole. While one could successfully evaluate the following list in Converge:

l := [1, fail, 2]
the following mysteriously printed nothing:
Sys::println(l[1])
since l[1] is a synonym for l.get(1), which of course tried to evaluate return fail, which then caused get to fail. While this might seem a contrived example, it unavoidably cropped up when getting a module to return the value of a specific definition, since one of those definitions was fail.

fail was edged out of Converge in stages. A while ago, the recommendation became that the only safe idiom involving fail was return fail. Perhaps surprisingly, this was not an onerous restriction. Nevertheless it still failed to prevent the problem of fail being a definition in a module. Therefore the fail variable was finally banished entirely from Converge recently (although it lives on internally in the VM). fail is now an expression in Converge's grammar, and is equivalent to return &fail in Icon. Thus finally, all of the bizarre behaviour associated with the fail variable is forgotten, and I now sleep easier at night.

Do something different

When I was looking for information in Icon, I stumbled across a few interviews with Ralph Griswold (Icon's main designer), and a paper on its history. One thing that became clear to me was that a major design goal of Icon was not to be just another (boring) synthesis of a few existing language features - like the vast majority of programming languages - but to try out genuinely new things. Have a look at e.g. p38 of this interview or p6 and 13 of this paper looking back at Icon's history. Whatever you might think of Icon, it undoubtedly fulfilled this aim: its expression evaluation system alone has no precedent in contemporary programming languages (and there are other interesting parts of the language).

When I was starting Converge, I agreed whole heartedly with this aim, but had no idea how to achieve it. Indeed, when Converge started, I only had the vague intention of chucking a few seemingly useful influences into the melting pot (initially Python, Icon, ObjVLisp, and ultimately Template Haskell when I'd got the mix of the former three languages about right). I had that vague feeling that either everything interesting had been done before, or that the remaining interesting things were outside of my grasp. To be honest, I sort of gave up on the aim of being innovative, and just concentrated on trying to mix Converge's influences together in a coherent way. What has been interesting is that as Converge has developed, new challenges have presented themselves, and I've had to find answers (some better than others) to such challenges. And in so doing, I think Converge has grown some genuinely innovative features, such as DSL blocks, and its approach to error information for macros amongst others. And in Converge's case - unlike Icon, as I understand it - the journey isn't over. Indeed, since I think Converge is currently nearing the end of the beginning, there's bound to be new challenges ahead, some of which will lead to new innovations of one sort or another, some ultimately successful, some ultimately not.

When I read about Icon's aim, and looked at the end result, I assumed that the innovative ideas in Icon had sprung out of nowhere. I, on the other hand, had no idea how to bring such ideas, fully formed, into the world. What's become clear to me is a restating of Edison's famous rule. Innovation is often a matter of perspiration over inspiration. Now my advice to those who wish to do innovative work is simple: if you put the miles in, you'll raise your head one day and discover that you're in a place where no-one else has ever found themselves in before. It's a good feeling, and one that benefits us all, as collectively we gradually discover what works and what doesn't.

Updated (December 10 2007): Used a clearer example in the Failure and if: a beautiful combination section.

Link to this entry


IEEE Software Special Issue on Dynamically Typed Languages

September 13 2007

Roel Wuyts and I were lucky enough to guest edit the current issue of IEEE Software magazine on dynamically typed languages. We had some very good submissions, and narrowing down the selection was quite a challenge!

The IEEE have helpfully put a few of the articles from the special issue online for free. Get them while you can! The full issue is available in paper form (obviously), or online, although you need IEEE membership to access the other articles.

Link to this entry


How Difficult is it to Write a Compiler?

August 9 2007

Recently I was discussing Converge with someone, and mentioned how little time the core compiler had taken to implement (no compile-time meta-programming, limited error checking, but a functioning compiler nonetheless) - only a few days. The chap I was talking to looked at me and told me that he didn't believe me. I laughed. Actually, he really didn't believe me. Initially that surprised me, and I wondered why he might think that I was pulling his leg. But then I thought back, and only a few years ago I would probably have had the same reaction. This article is my attempt to explain his reaction, and why it's no longer the case.

When I was younger, there were three things in computing that seemed so complex that I had no idea how anyone had ever created them; certainly, I didn't think I would ever be capable of working on such things. The imposing triumvirate were hardware, operating systems, and programming languages. Although electricity has always baffled me, the hardware people have done an amazing job on interoperability of components in recent years, such that I've never felt the need to gain a low-level understanding of how electrons whizz around my systems. Operating systems were partially demystified for me as I moved into the world of open source UNIX-like operating systems (for the record, I've been an OpenBSD user for 8 years or so), where one could poke and prod virtually any aspect - although kernels retain an aura of mystery for most of us. Despite the fact that I was familiar with several languages, and had even implemented a primitive assembler language (which, to my surprise, found its way into a real system - my first large-ish programming success), real programming languages remained a mystery to me for much longer. The obvious question is: why?

Programming languages are, for most of us, an integral part of the computing infrastructure. Only a small handful of languages have ever had a wide impact, and by the time a programming language becomes popular it already probably has a decade or more of history behind it. This means that when the average programmer first comes across a new language, it has had substantial development behind it, libraries and documentation, developed its own culture, and undoubtedly had several flaws uncovered; not to mention lots of little platform portability hacks, and often complex optimisations. Faced with this wodge of sheer stuff, most peoples reaction is either to shrug their shoulders and accept it at face value (the majority), try to understand it but not know where to start (a minority), or spot how similar it is to every other language (a tiny handful of people). Because the first two categories massively outnumber the third category, a general feeling - which we're not necessarily consciously aware of - has developed that programming languages are special, and therefore that only special people can create them (a veiled insult, if ever I saw one). Thus if I suggest to people that, despite having created a fairly fully featured programming language, I'm no more capable a programmer than the next man, they dismiss it as false modesty.

So let's start looking at things in more detail. If you break a programming language down, it only has a handful of things to it. What's more, the variance between different languages is - despite the unfortunate zealotry which often clouds debate - exceedingly small. In fact, at a high level there are really only two things which any language must have: a definition and an implementation. Sometimes a paper definition is produced up front, and only when this is complete is an implementation created. Sometimes the latter defines the former; but conceptually these are two separate things. Basically the definition says things like statement S does X, and the implementation actually takes in a source file and compiles S such that it really does do X. Defining a language is very simple: any fool can do it (the skill comes in defining a good language, but that's a different story). We can then break the implementation down into three components: a compiler, libraries, and a run-time system. Again, these are conceptually separate things, although sometimes they are combined (e.g. most Python programmers do not distinguish the compiler from the run-time system). Libraries are not difficult to create, although they are time consuming. Sometimes the run-time system will be a VM, sometimes it will be bundled up with the executable created by a compiler; regardless, we'll assume for the purposes of this article that a reasonable VM is already available.

Most people don't have much trouble understanding the high level view, but it's when they start trying to understand the compiler - the thing that ultimately they interact with - that things start getting more complex. The problem is that looking at implementations like Python (quite complex), Java (very complex), or GCC (something beyond merely very complex), reveals too much detail to easily uncover the big picture. The Converge compiler, at the moment at least, is rather simpler (although, like any other program, it is growing slowly but surely as it accretes features) and is a nice snapshot of a compiler. It has three basic stages:

  1. Read in a source file, and create a parse tree.
  2. Turn the parse tree into an abstract syntax tree.
  3. Turn the abstract syntax tree into object code.
Although the terminology might be a little different from language to language, these three stages are fairly common (although they might be augmented by extra stages e.g. for optimisation). Unfortunately my experience is that even at this stage, people look at those three codes and think magic. Let's break them down a little more.

Parsing

Parsing is the process of reading in a big wodge of text, discovering its underlying structure, and then creating a parse tree which is easy to operate on. For English, this is a bit like taking in a sentence and discovering what its subject, its verb, and so on is. Parsing doesn't really uncover meaning as such: that's for later stages.

If we have an input file such as the following:

import Sys

func main():
  Sys::println(2 + 3)

What is its parse tree? Well, in order to uncover that, we use a parsing system, which given a description of a language structure - a grammar - can automatically create a parse tree. Parsing has a bad reputation, often deservedly so: commonly used parsing technologies are often horrible to use, although it is perfectly possible to make much more pleasant parsing technologies. Regardless of that, this is the only thing in a compiler where you will need to check an external source to understand a bit more about grammars (Wikipedia's page on formal grammars is a decent start). An indicative chunk of the Converge grammar is as follows:

top_level ::= definition ( "NEWLINE" definition )*
          ::=

definition  ::= class_def
            ::= func_def
            ::= import
            ::= var_def ( "," var_def )* ":=" expr
            ::= splice
            ::= insert

import      ::= "IMPORT" import_name import_as ( "," import_name import_as )*
import_name ::= "ID" ( "::" "ID" )*
import_as   ::= "AS" "ID"
            ::=
In essence this says: a program consists of zero or more definitions; a definition is a class, a function etc.; an import is one or more module name imports, each of which can assign to a different variable name than the module name (using as). Given that and our input program, the following parse tree is produced (using only a couple of lines of code to call a library function or two):
top_level
  -> definition
    -> import
      -> <IMPORT import>
      -> import_name
        -> <ID Sys>
      -> import_as
  -> <NEWLINE>
  -> definition
    -> func_def
      -> func_type
        -> <FUNC func>
      -> func_name
        -> <ID main>
      -> <(>
      -> func_params
      -> <)>
      -> <:>
      -> <INDENT>
      -> func_decls
      -> expr_body
        -> expr
          -> application
            -> expr
              -> module_lookup
                -> expr
                  -> var_lookup
                    -> <ID Sys>
                -> <::>
                -> <ID println>
            -> <(>
            -> expr
              -> binary
                -> expr
                  -> number
                    -> <INT 2>
                -> binary_op
                  -> <+>
                -> expr
                  -> number
                    -> <INT 3>
            -> <)>
      -> <DEDENT>
This might look a bit imposing at first, but it's trivial to convert this back into the original input (although information about blank lines and comments would disappear reversing the transformation). If you want to see what the parse tree is for other Converge inputs, Converge helpfully includes a little program called convergep which given a source file input automatically prints out its parse tree (the above output is cut 'n' paste straight from convergep).

If you've understood this part, congratulations - you've got over the main stumbling block to creating a compiler.

From parse tree to abstract syntax tree

As stated earlier, a parse tree captures structure, but it doesn't capture meaning as such, beyond that which is captured in the structure. This can be seen by the fact that the parse tree still contains things like the : token, despite the fact this is for the users' visual benefit, rather than effecting the meaning of the program. What the second stage of a compiler does is to understand the meaning of the program (in some sense), strip out all the stuff which doesn't effect that, and convert it into another, very similar, data structure - the Abstract Syntax Tree (AST) - which captures this. For example the text 2 + 3 is converted from its parse tree equivalent:
-> expr
  -> binary
    -> expr
      -> number
        -> <INT 2>
    -> binary_op
      -> <+>
    -> expr
      -> number
        -> <INT 3>
into an AST along the lines of Add(Int(2), Int(3)). Notice that typically the parse tree and the AST have almost identical structure. Because of this, the conversion from parse tree to AST is generally very simple. For Converge, this conversion is captured in the compiler/Compiler/IModule_Generator.cv file. As the language grows, clearly this conversion gets larger, but because it is largely mechanical (and therefore often a simple cut 'n' paste job), it is not a difficult task. For example in Converge's early days, when it had approximately the same expressivity as Python, this conversion took around 2-3 days to code from scratch.

From abstract syntax tree to object code

Object code is an intentionally vague term - it might be machine code, VM bytecode, or even an intermediary programming language. Basically here we take an AST like Add(Int(2), Int(3)) and turn it into instructions such as:
INT 2
INT 3
ADD
In a language such as Converge, where the VM is designed explicitly to be used with the language, then there is a strong relationship between the AST and the object code. In other words, this means that the conversion to object code is about the same size and complexity as the conversion from parse tree to AST. If you're converting to, say, assembly code then this will be a trickier task, but even then there will still be a largely mechanical element to the conversion. You can see this in Converge's compiler/Compiler/Bytecode_Generator.cv file.

Conclusion

That's it. You don't need to know, or do, any more to create a compiler. When it's broken down in this simple way, I hope that it partly demystifies what a compiler is. There's nothing particularly magical about a compiler, and with the exception of parsing (which regretfully is often more involved than it should be), nothing particularly complex. I hope you get some sense of how little work there is in creating a compiler for a simple language. Of course, if you want to do something a little more complex, then things become rapidly more work, but you'd be amazed how few languages require such complexity.

In conclusion, the main difficulty for most of us in creating a compiler is overcoming the cultural absurdity which tells us that mere mortals can't create compilers. They can. I did. You can too.

Link to this entry

Earlier articles

All articles
 
Last 10 articles
Extended Backtraces
Designing Sane Scoping Rules
Some Lessons Learned from Icon
IEEE Software Special Issue on Dynamically Typed Languages
How Difficult is it to Write a Compiler?
When Are Macros Useful?
Filling in a Gap
Are Multicore Processors the Root of a New Software Crisis?
The High Risk of Novel Language Features
Evolving DSLs
 
 
DSLs
Martin Bravenboer
Eelco Visser
 
Modelling
Grady Booch
Steve Cook
Keith Duddy
Jack Greenfield
Steven Kelly
Stuart Kent
Michael Lawley
Kerry Raymond
Jim Steel
Alan Cameron Wills
 
OS
Marc Balmer
Mike Erdely
KernelTrap
OpenBSD Journal
 
Programming
Artima
Peter Bell
Gilad Bracha
Ian Cartwright
Code Generation Network
Bram Cohen
Adrian Colyer
Bruce Eckel
Jonathan Edwards
Daniel Ehrenberg
Fabien Fleutot
Chad Fowler
Mark Guzdial
Elliotte Rusty Harold
Jeremy Hylton
Ralph Johnson
Ralf Laemmel
Lambda the Ultimate
Patrick Logan
Niclas Nilsson
Keith Packard
Havoc Pennington
Guido van Rossum
Keith Short
Software Engineering Radio
Diomidis Spinellis
Darren Tucker
Markus Voelter
Phil Wadler
Eugene Wallingford
Marcus Widerberg
Steve Yegge