Debugging Layers

Blog archive

Recent posts
Some Reflections on Writing Unix Daemons
Faster Shell Startup With Shell Switching
Choosing What To Read
Debugging A Failing Hotkey
How Often Should We Sharpen Our Tools?
Four Kinds of Optimisation
Minor Advances in Knowledge Are Still a Worthwhile Goal
How Hard is it to Adapt a Memory Allocator to CHERI?
"Programming" and "Programmers" Mean Different Things to Different People
pizauth: First Stable Release

A couple of weeks back, I joked to someone that one day I would write a condensed retrospective of all my failed research ideas but that, until I did, people could read my research papers for the long version. It seemed a clever thing to say at the time, and I dutifully slapped myself on the back. I then found myself discussing Converge’s approach to error tracking and reporting with someone to whom I thought it might be vaguely relevant, later pointing my poor victim to a chunk of a paper and – when I later remembered its existence – a blog article. The core idea is simple: generated code can be related to one or more source locations. This means that when a user layers one piece of code atop another, each of those layers shows up in stack traces, making debugging significantly easier. When, with a fresh mind, I read over the paper and the blog article, I realised that it’s hard to extract this, and a few related, albeit minor, ideas out of those bits and pieces: they assume the reader knows more about Converge than is realistic.

Normally I wouldn’t necessarily mind too much if old ideas are a little hard to grasp, but I’ve come to view Converge’s error tracking and reporting as probably the only useful piece of language design I managed in the five or so years I spent on Converge [1]. I therefore found myself deciding that I needed to write a brief retrospective on Converge and this feature so that some other poor soul doesn’t spend five years of their life on a feature than can be explained in ten minutes. The aim of this article is to give enough background to understand the core idea of debugging layered programs, but what is enough depends on your point of view. I’ve therefore tried to aim this article at two audiences: one who just wants to quickly get a sense of the important idea (if that’s you, skim the background material, but don’t worry too much about the details); and another who wants to get a deeper sense of how the important idea really works (if that’s you, it’s worth spending some time getting a reasonable appreciation of the background). I’m sure I won’t fully succeed in satisfying either audience, but hopefully both might find a little something of interest herein.

A quick introduction to Converge

For our purposes, Converge can be thought of as Python with a mildly modified syntax and somewhat different standard library: it’s dynamically typed, has an indentation syntax, and a mostly similar world view. One immediately obvious difference is that programs must have a main function and that printing isn’t a builtin statement or function. Here’s a minimal hello world program:

import Sys

func main():
  Sys::println("Hello world!")

On top of this run-of-the-mill base [2], Converge adds a macro system [3] adopted from Template Haskell. In essence, one can generate program fragments at compile-time and splice them into the program such that they can be used as if the programmer had typed them into the original source file. Here’s a contrived example:

import CEI, Sys, Time

func secs_from_epoch():
  Sys::println("Getting time")
  return CEI::iint(Time::current().sec)

func main():
  Sys::println($<secs_from_epoch()>)

Saving that code into a file t1.cv and running it a couple of times at the command-line gives the following:

% converge -v /home/ltratt/t1.cv
===> Compiling /home/ltratt/t1.cv...
Getting time
===> Finished compiling /home/ltratt/t1.cv.
===> Linking.
1436729518
% converge -v /home/ltratt/t1.cv
1436729518

I’m going to assume that Converge’s basic syntax is fairly obvious, but there are a few other things that need to be explained before this example makes sense. First, Converge has distinct compile, link, and run-time phases. Running Converge with -v causes information lines to be printed on stderr (lines 2, 4, and 5) stating when compilation and linking have started and finished. Since there wasn’t a cached compiled version of the program in the above example when it was first run, Converge compiled and linked the program (saving that output to a cache file) before running it. On the second run, Converge read the cached version in and ran it directly, rather than recompiling the source code.

Most of the program’s source code is obvious except the $<e> construct which we’ll call an explicit macro invocation. The expression e is evaluated at compile-time and is allowed to refer to any function or top-level variable defined above the macro invocation; it is expected to generate a program fragment as an Abstract Syntax Tree (AST); that program fragment then overwrites the macro invocation so that when the program is later run, the generated program fragment is run. The secs_from_epoch function generates a trivial program fragment, with an AST representation of an integer (the CEI::iint function takes an actual integer and returns an AST integer class). In essence, after the explicit macro invocation was evaluated, the program that was compiled and linked looked as follows:

import CEI, Sys, Time

func secs_from_epoch():
  Sys::println("Getting time")
  return CEI::iint(Time::current().sec)

func main():
  Sys::println(1436729518)

As this hopefully shows, even though the program was run twice, the macro invocation was only executed when the program was compiled in the first run. No matter how many subsequent times we run the program, it will print out the time that program was first compiled (1436729518). Only if the program is recompiled (e.g. because the cached compiled version is deleted or the source code has changed) will the secs_from_epoch function be called again.

DSL compilers

In my opinion, Converge’s raw macro facility isn’t hugely useful for day-to-day programming. Rather, it is intended to be used as a basis for what I should have called DSL compilers: in essence, one can embed arbitrary strings into a source and compile them into Converge ASTs at compile-time. In so doing, we’ve added a new layer on top of the main language. Although I’m not going to dwell on this, one can layer as deeply as one wants, and it’s not unusual for DSLs to be implemented in terms of other DSLs.

As a simple example of embedding a DSL, I’m going to use a stack-based machine. At compile-time the DSL will be compiled into a Converge function which can then be called as normal, returning the top-most value on the stack. Here’s a use of the DSL:

f := $<<sb>>:
  PUSH 2
  PUSH 3
  ADD

func main():
  Sys::println(f())

We’ll see the definition of sb shortly, but you will be unsurprised to be told that the above program prints out 5 when run. The $<<sb>> syntax (line 1) means all the code on the next level of indentation (lines 2–4) should be left unparsed; the raw string representing that code should be passed to the function sb which will be called at compile-time. In essence, a simple variant of normal macros allows us to embed arbitrary syntaxes into a file, making use of Converge’s indentation-based syntax to know when a DSL starts and finishes, and thus where normal parsing of Converge code can resume. If we take the full file and run it with converge -v, we see the following output:

===> Compiling /home/ltratt/t2.cv...
func ():
    stack := []
    stack.append(2)
    stack.append(3)
    $$3$$rhs$$ := stack.pop()
    $$4$$lhs$$ := stack.pop()
    stack.append($$4$$lhs$$ + $$3$$rhs$$)
    return stack[0 - 1]
===> Finished compiling /home/ltratt/t2.cv.
===> Linking.
5

As this might suggest, the sb function has compiled our DSL into an anonymous function (it also printed it out so that we can get some idea of what it did, but that isn’t something it’s required to do). The compilation is very simple-minded: an empty stack is created (line 3); the integers 2 and 3 are pushed onto the stack (lines 4 and 5); then they are popped off (lines 6 and 7) so that they can be added together and the result pushed onto the stack (line 8); and finally the top-most value on the stack is returned (line 9). Variables beginning with $$ might look scary, but for our purposes they’re simply normal variables with unique names generated by the Converge compiler.

Let’s start with a simple implementation of sb:

import Builtins, CEI, Sys

func sb(dsl_block, src_infos):
  instrs := [] // A sequence of Converge instructions
  // Iterate over each line in the DSL block
  for line := dsl_block.split("\n").iter():
    sp := line.stripped().split(" ") // Split the line into a list
    if sp[0] == "PUSH":
      ast_int := CEI::iint(Builtins::Int.new(sp[1]))
      instrs.append([| &stack.append(${ast_int}) |])
    elif sp[0] == "ADD":
      instrs.extend([|
        rhs := &stack.pop()
        lhs := &stack.pop()
        &stack.append(lhs + rhs)
      |])
  ast := [|
    func ():
      &stack := []
      $c{instrs}
      return &stack[-1]
  |]
  Sys::println(CEI::pp_itree(ast))
  return ast

This function takes in a string (the dsl_block argument in line 3) and gradually compiles it into a sequence of Converge expressions which are then inserted inside an anonymous function (line 17–22) which, to make our lives a bit easier, is printed out (line 23) and then returned (line 24).

Before going any further, I need to explain three bits of related syntax. First, are quasi-quotes [| ... |] which are a syntactic convenience, but an important one. Surrounding an expression with quasi-quotes makes that expression evaluate to its AST equivalent. Instead of laboriously writing out CEI::ibinary(0, CEI::iint(2), CEI::iint(3)) we can instead write [| 2 + 3 |] which compiles to exactly the same code. Second are insertions ${...} [4], which can be used inside quasi-quotes to insert one AST into another (e.g. lines 10 and 20). [| 2 + ${CEI::iint(3)} |] is thus equivalent to [| 2 + 3 |]. Most insertions splice in a single tree, but an insertion used alone on a line can also splice in a list of ASTs (e.g. line 20). Third, Converge has a hygienic name system, which renames all variables inside quasi-quotes which don’t begin with a & to fresh (i.e. unique) names. In the case of sb, we want to stitch together multiple tree fragments and have the stack variable refer to the same thing across all of them. In general, one doesn’t want to use & too frequently, though for the purposes of this article, it doesn’t matter too much.

Better reporting of syntax errors

At this point, we’ve implemented a small DSL which is translated at compile-time into Converge code. However, our implementation of the DSL compiler is rather too simplistic: any mistake by a user of the stack machine DSL will lead to inscrutable errors. Consider a trivial syntax error, where the user forgets to add a value to a PUSH instruction:

f := $<<sb>>:
  PUSH 2
  PUSH
  ADD

This innocuous error leads to a long stack-trace [5] in the DSL compiler whose last few entries are:

   ...
   17: File "/home/ltratt/t3.cv", line 27, column 5, length 31
        f := $<<sb>>:...
   18: File "/home/ltratt/t3.cv", line 9, column 45, length 5
        ast_int := CEI::iint(Builtins::Int.new(sp[1]))
   19: (internal), in Builtins::List.get
 Bounds_Exception: 1 exceeds upper bound 1.

Let us assume we have a user brave enough to follow that stack-trace into code: they will probably deduce that the error relates to the PUSH instruction, but, if their input has multiple such instructions, which might be the culprit? Fortunately, Converge has two tools to help us.

The first tool is something I deliberately skipped over earlier. sb is passed two arguments, with the second conventionally called src_infos. If we print that argument out we see that it is a list of lists:

[["overdrive.tratt.net:/home/ltratt/t3.cv", 725, 21]]

I’ll go into more details of this later, but the basic idea is simple. The src_infos argument tells us where the DSL block started and its extent—in the file t3.cv, starting at character 725, and spanning 21 characters. This information gives us the potential to report back errors in the dsl_block string in terms of the real line numbers the user will see in their file. However, doing so manually is a tedious chore.

Fortunately, the second tool is that Converge has a built-in parser DSL which can use src infos. We can adapt sb as follows:

import Builtins, CEI, CPK::Earley::DSL, Sys

parse := $<<DSL::mk_parser("machine", ["PUSH", "ADD"])>>:
    machine ::= instr ( "NEWLINE" instr )*
    instr   ::= push
              | add
    push    ::= "PUSH" "INT"
    add     ::= "ADD"

func sb(dsl_block, src_infos):
  parse_tree := parse(dsl_block, src_infos)
  instrs := [] // A sequence of Converge instructions
  for i := 0.iter_to(parse_tree.len(), 2):
    instr := parse_tree[i][0]
    instr_type := instr.name
    if instr_type == "push":
      ast_int := CEI::iint(Builtins::Int.new(instr[1].value))
      instrs.append([| &stack.append(${ast_int}) |])
    elif instr_type == "add":
      instrs.extend([|
        rhs := &stack.pop()
        lhs := &stack.pop()
        &stack.append(lhs + rhs)
      |])
  ast := [|
    func ():
      &stack := []
      $c{instrs}
      return &stack[-1]
  |]
  Sys::println(CEI::pp_itree(ast))
  return ast

The parsing DSL takes a fairly conventional context-free grammar (lines 4-8). Unless the user goes out of their way to do so, the parsing DSL automatically reuses Converge’s lexer, so the stack machine DSL has access to lexing rules such as INT. The parsing DSL requires an explicit start rule (machine) and, optionally, a list of additional keywords over those it automatically inherits from Converge (PUSH and ADD). It then converts the grammar into a function which takes two arguments and returns a parse tree. For our very simple DSL, the tree is so shallow that I didn’t need to bother writing a normal tree walker. The result of our simple change is powerful. If we rerun our faulty input against the new sb we get the following error:

 Error:
   File "/home/ltratt/t4.cv", line 36, column 8, length 0:
      PUSH 2
 Parsing error at or near `NEWLINE' token.

While I won’t pretend this is the best possible parsing error message – as humans, we’d have preferred in this instance to have been shown the line before rather than after the newline – the text PUSH 2 really can be found on line 36 in the file t4.cv. Importantly, that puts us right next to the line of code which led to the bug, which is far preferable to trying to divine the source of the problem from a stack-trace. DSL Compilers can also use a related tactic to report static errors to the user using the CEI::error function, which takes a string and, optionally, src infos.

Run-time errors

Finding errors statically is useful but also somewhat easy—life gets a lot harder with run-time errors. The following program is syntactically valid but obviously incorrect, trying to add two items from the stack when only one item is present:

f := $<<sb>>:
  PUSH 2
  ADD

func main():
  Sys::println(f())

Since sb performs few static checks, this code will compile. Looking back at the definition of sb, we would expect that the ADD instruction will fail at run-time when it tries to perform the second pop. Indeed that’s precisely what happens:

% converge -v t5.cv
===> Compiling /home/ltratt/t5.cv...
func ():
    stack := []
    stack.append(3)
    $$3$$rhs$$ := stack.pop()
    $$4$$lhs$$ := stack.pop()
    stack.append($$4$$lhs$$ + $$3$$rhs$$)
    return stack[0 - 1]
===> Finished compiling /home/ltratt/t3.cv.
===> Linking.
Traceback (most recent call at bottom):
  1: File "/home/ltratt/t5.cv", line 40, column 15, length 3
       Sys::println(f())
  2: File "/home/ltratt/t5.cv", line 22, column 15, length 12
       lhs := &stack.pop()
     File "/home/ltratt/t5.cv", line 28, column 6, length 10
       $c{instrs}
  3: (internal), in Builtins::List.pop
Bounds_Exception: -1 below lower bound 0.

The exception raised is Bounds_Exception, which is Converge’s way of saying you tried to access an element beyond the start or end of the list. The stack-trace initially looks helpful: we can immediately see that Converge tracks errors down to the level of where within a line they occur; and we can also see that the problem is related to the &stack.pop() call. However, deeper inspection shows that it’s hard to relate it to the source of the problem. After all, I made a mistake in my DSL code, but the stack-trace information is presented in terms of the DSL compiler. If my stack machine code had contained hundreds of ADD instructions, which would have been the source of the problem? Here we see the problem of debugging layers, as programming languages traditionally privilege one layer—the base programming language.

The stack-trace contains a clue that we might be able to do something about this. There are three frames on the stack (i.e. the main function has called another function, which has itself called another function), but the second frame is associated with two source locations (lines 22 and 28). In short, Converge ASTs can be associated with one or more src infos. ASTs that are created by quasi-quotes automatically have a src info relating them to the source location they were defined in; and further src infos are automatically added whenever ASTs are inserted into each other [6]. Users can then add additional src infos to a quasi-quote using the [<e>| ... |] syntax. Since every terminal and non-terminal in the parse tree records the src info it’s associated with, it’s trivial to pick out the appropriate src info from the parse tree and add it to a quasi-quote. In this case, we simply take the src infos from the ADD non-terminal when we construct the appropriate quasi-quotes, modifying sb as follows:

 elif instr_type == "add":
   instrs.extend([<instr.src_infos>|
     rhs := &stack.pop()
     lhs := &stack.pop()
     &stack.append(lhs + rhs)
   |])

With this small modification, running our incorrect program leads to the following stack-trace:

 Traceback (most recent call at bottom):
   1: File "/home/ltratt/t5.cv", line 40, column 15, length 3
        Sys::println(f())
   2: File "/home/ltratt/t5.cv", line 22, column 15, length 12
        lhs := &stack.pop()
      File "/home/ltratt/t5.cv", line 37, column 2, length 3
        ADD
      File "/home/ltratt/t5.cv", line 28, column 6, length 10
        $c{instrs}
   3: (internal), in Builtins::List.pop
 Bounds_Exception: -1 below lower bound 0.

We can now see that the second frame on the stack is associated with 3 entries, one of which is the line from the stack machine itself (lines 6 and 7 in the stack trace). With this, the user can quickly pinpoint which ADD instruction is the culprit, no matter how many such instructions their program contains.

A reasonable question to ask is whether it’s really necessary to allow more than one src info. Could we not maintain the traditional one-source-location per stack frame location, but allow the user to choose what that location records, and still get the effect we wanted? As it so happens, the first version of Converge did just that so I can conclusively answer this with a no. The reason is very simple: when an error happens at run-time, there is no way of knowing which layer was responsible. Unlike traditional compilers, which are heavily tested over multiple years, DSL compilers tend to be more fragile and more susceptible to bugs. When an error happens, there are roughly even odds as to whether the DSL compiler author or the DSL user is at fault. Only if the debugging information for both layers is reported can both people have a chance of debugging the problem.

How it works

A src info is a triple (file identifier, character offset, character span); src infos are an ordered list of these triples. The key to what I’ve presented above is that src infos – and specifically the notion that there can be more than one at any point – are present in all aspects of the system: the tokenizer, the parser, the compiler, and the VM. In other words, all the parts that make up Converge were written with src infos in mind. The tokenizer starts the work off, giving each token a single src info (though, for consistency, it is a list of src infos that just so happens to have a single src info). The parser doesn’t change tokens, but each time it creates a non-terminal, it calculates the span of the tokens in its subtree (i.e. it looks at all the tokens in the subtree in a depth-first fashion, finding the beginning of the first token in the subtree, and then calculating the span upto and including the last token in the subtree) and creates a src info based on that. As the compiler creates an AST from the parse tree, it copies src infos over (sometimes from tokens, sometimes from non-terminals). The bytecode saved by the compiler then associates each individual bytecode instruction with one or more src infos. When an error happens at run-time, the VM then digs out the appropriate src infos for each location in the relevant stack frame. Thus, src infos have no run-time cost, other when an exception is raised.

The bytecode format that Converge uses is not particularly innovative. Simplifying slightly, each compiled module is its own self-contained binary blob. Within that are several contiguous blocks recording various information. One of these blocks is the sequence of VM instructions (e.g. push, pop etc.), each of which we can think of as a single machine word. Another block then records one or more src infos, again each stored as a machine word, for each VM instruction. There must therefore be at least as many src infos as there are VM instructions. However, since high-level operations invariably compile to multiple VM instructions, it is generally the case that successive VM instructions records the same src infos. Converge therefore adds a few bits to each src info to record how many consecutive VM instructions the src info refers to. On average, this simple encoding reduces the size of the src infos block by half.

Summary

The scheme I’ve described in this article took full form in the second version of Converge. In my mind, Converge is a finished piece of work. At the very least, I feel that I’ve learnt pretty much all that I’m going to learn from Converge, and I think it unlikely I will do much work on it again. So why have I put together such a long-winded article on one little aspect of Converge? It’s for the very simple reason that the idea of src infos – notice my deliberate use of the plural – might have relevance to other languages. In particular, any language which encourages its users to generate code might benefit from such a concept: it really does make debugging multiple layers of generated code much easier. My use of the word encourages is deliberate: I don’t think a language needs to have a macro-ish system for the concept of multiple src infos to be useful. For example, there appears to be a growing culture of generating JavaScript code offline, and while, in such a scenario, src infos would need to be recorded differently, their benefits would be the same. I’m not suggesting that retrofitting this concept in is easy: it will often require changing various components, including delicate parts of VMs. I’m also not suggesting that every language would benefit from this concept: it seems unlikely that a C program would ever want the overhead, no matter how small, that src infos require.

So there you are—that’s probably the most useful thing I managed in five years of language design. If I’m tempted to do language design again, I will print a t-shirt with the slogan I spent five years designing a programming language and I only got one lousy feature as a reminder to leave it to people with more talent for it than I.

Acknowledgements: My thanks to Carl Friedrich Bolz, Martin Berger, and Lukas Diekmann for insightful comments on early drafts of this article. All opinions, errors, omissions, or infelicities, are my own.

Newer 2015-07-28 08:00 Older
If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1]

In my opinion, Converge’s experiment with an Icon-like expression evaluation system wasn’t a success. Fortunately, as so often in research, the dung heap of failed ideas provided fertile ground for a new idea to grow. When I wrote that paper, I thought that the many stack operations needed by Icon’s evaluation system precluded an efficient implementation, before foolishly hand-waving about a possible partial optimisation. What I didn’t foresee at that point was that meta-tracing would optimise Icon’s evaluation system without me having to do anything. Though I was slow to recognise it, meta-tracing had freed me from many onerous constraints I thought inevitable.

In my opinion, Converge’s experiment with an Icon-like expression evaluation system wasn’t a success. Fortunately, as so often in research, the dung heap of failed ideas provided fertile ground for a new idea to grow. When I wrote that paper, I thought that the many stack operations needed by Icon’s evaluation system precluded an efficient implementation, before foolishly hand-waving about a possible partial optimisation. What I didn’t foresee at that point was that meta-tracing would optimise Icon’s evaluation system without me having to do anything. Though I was slow to recognise it, meta-tracing had freed me from many onerous constraints I thought inevitable.

[2]

My personal experience suggests that programming language designers resemble actors: most hopefuls, such as myself, lack the necessary talent; even for those who are talented, the odds of success are remote; and yet there is never any shortage of people who believe they are destined for greatness.

My personal experience suggests that programming language designers resemble actors: most hopefuls, such as myself, lack the necessary talent; even for those who are talented, the odds of success are remote; and yet there is never any shortage of people who believe they are destined for greatness.

[3]

Formally it should be called compile-time meta-programming, but that’s a distracting mouthful. At some points I’ve tried abbreviating this term to CTMP, but, in retrospect, more readers would have been helped if I’d shortened it to macro. If you understand every detail of such systems, there’s a minor difference between traditional macros and compile-time meta-programming but, for most of us, macro gets the point across well enough without requiring the digestion of new terminology.

Formally it should be called compile-time meta-programming, but that’s a distracting mouthful. At some points I’ve tried abbreviating this term to CTMP, but, in retrospect, more readers would have been helped if I’d shortened it to macro. If you understand every detail of such systems, there’s a minor difference between traditional macros and compile-time meta-programming but, for most of us, macro gets the point across well enough without requiring the digestion of new terminology.

[4]

The $c{e} variant of insertion allows non-hygienic splicing. Put another way, it means splice e in and allow it to dynamically capture variables from the surrounding environment. Hygiene, and Converge’s approach to it, aren’t relevant to this article, so I refer interested parties to the documentation for further details.

The $c{e} variant of insertion allows non-hygienic splicing. Put another way, it means splice e in and allow it to dynamically capture variables from the surrounding environment. Hygiene, and Converge’s approach to it, aren’t relevant to this article, so I refer interested parties to the documentation for further details.

[5]

The bolding and highlighting you see is shown by default in the console.

The bolding and highlighting you see is shown by default in the console.

[6]

I’m not sure this is the best policy, because it often records semi or completely redundant src infos. However, all my attempts to find a predictable scheme that recorded less src infos led to cases where the lack of src infos hindered debugging. I have no idea whether this reflects a lack of imagination on my part or whether humans have expectations that no automatic scheme can ever quite meet.

I’m not sure this is the best policy, because it often records semi or completely redundant src infos. However, all my attempts to find a predictable scheme that recorded less src infos led to cases where the lack of src infos hindered debugging. I have no idea whether this reflects a lack of imagination on my part or whether humans have expectations that no automatic scheme can ever quite meet.

Comments



(optional)
(used only to verify your comment: it is not displayed)