Blog

 

What I’ve Learnt So Far About Writing Research Papers

January 10 2018

I long flattered myself that I'm only good at one thing: programming. Experience has taught me that I'm not as good at it as I thought – indeed, that there are people much better at it than I – but it's fair to say that I'm not terrible at it. The problem is that, as a researcher, programming can only ever be part of my job: it's not enough just to build something interesting, I need to explain it to people too. And explaining it to those further away than my voice can carry requires writing a paper. [1]

There are many people who dismiss paper writing as a distraction from building real systems. I sometimes wondered if it was worthwhile myself but, when you've been knocking around as long as I have, the importance of writing things down becomes clear. I’ve lost track of how many conversations I’ve had along these lines: “We saw that problem in System B, and we had a really interesting solution.” ”Where can I read about it?” “We never wrote it up.” “Where can I download it then?” “We never released the source code, the binaries are lost, and it runs on hardware that no longer exists.” I encountered this a lot when I was investigating existing language composition, systems with tantalising references ranging from Lisp systems with different macro expansion semantics to novel text editors. Even when the person I was talking to was passionate and knowledgeable about the system in question, I was rarely able to gain the insights I suspect I could have if I'd have had a paper to pore over. Doubtless, many of these systems weren't good enough to worry about them slipping from our memory; but the loss of the best, which may have involved tens or even hundreds of skilled person years of effort, is a tragedy. Writing things up does not guarantee that people in the future will be able to understand everything about the work [2], but at least they stand a sporting chance. Besides, even if people of the future can't understand things, writing things down increases the number of people who can understand the work today.

How should one go about writing a paper? When I started my PhD, I had no idea of how one might go about this task, and it took me many years to find an approach that worked for me. In the rest of this blog post, I'm going to try and drill down into some of the things I’ve learnt about writing papers, and I'm going to give some examples of where I’ve learnt from my mistakes. However, I’m deliberately not phrasing anything herein as “advice” — indeed, several parts of my approach seem to run contrary to most of the advice I’ve seen. I don’t see this as a problem: people work in different ways, and if you’re unlucky enough to have a mind similar to mine, you might find something useful in here.

Before starting to write

Before I start writing, there are two crucial decisions that I need to make. First, what am I going to write about? Second, what is the profile of the intended reader?

The first question nearly always has an easy answer which is that I’ve done some research which is interesting and complete enough to be worth telling other people about. If I don’t have a good answer to that first question, it means that I’m probably not ready to start writing.

The second question is more tricky. In general, I assume that the intended reader is someone intelligent, interested, and well-versed in the general area. My job is to fill in the things that they don’t know. One can go mad trying to guess precisely what people do and don’t know, so my starting assumption is that the reader knows roughly what I knew when I started doing the research in question. However, this doesn’t work for all types of writing, particularly when a paper brings together two subjects that have previously been studied by disjoint communities. On any points where I’m unsure about what knowledge I can take for granted, I err on the side of caution, and assume the reader knows marginally less than I did. This should not be mistaken for an assumption that the reader is stupid, but simply an acknowledgement that none of us can know everything.

I also make a fundamental assumption that the reader has good intentions, and that my job is to be upfront and honest with them. This allows me to write clearly about the weaknesses and limitations of my work without worrying that I will be punished for doing so. This may sound like an odd thing to be explicit about, but people in my position are largely evaluated by the quantity and perceived quality of our peer reviewed papers. It is therefore tempting to adjust one’s writing to maximise the chances of acceptance by peer reviewers (who are often rushed; occasionally incompetent; or, though rarely, malevolent). I did that in the past, but I found the implicit cynicism corrosive: I decided, perhaps somewhat pompously, that I’d rather “fail” peer review honourably than “succeed” dishonourably. This has led to one or two more rejections than I might otherwise have received, but these are more than worth the longer-term satisfaction with the papers I write.

Starting to write

It is a truism of writing that the hardest thing is a blank page: building sufficient momentum to begin writing is no easy thing and I've done my fair share of hoping in vain for inspiration to strike. Of course, the only possible solution is to write something, anything, just to get started. The most common suggestion that I've heard is to start writing the bits of a paper that come easier without worrying about the overall order. While this seems to work well for many people, I've never found this a satisfying approach.

Instead, the first thing I do is to try and work out what the paper’s overall message is going to be, which means starting with the paper’s abstract. [3] This gives me a frame of reference for all subsequent writing and enables me to be ruthless in answering questions like “does the reader need to know about this particular aspect?” and “is this section in the right place in the paper?”. I first reread suggestions for writing an abstract (the fourth point in the link on paper writing advice) because it reminds me to be concise about what problem I solved and to motivate why my solution is worth someone’s time reading. In well established fields, where one is tackling a well-known problem, little motivation is needed. More experimental or unusual research, however, needs to be clearly motivated. This is often harder to do than it seems.

Let me give a concrete example based on the paper fine-grained language composition: a case study. We started looking at language composition because previous solutions were crude, because we felt we had a new approach worth trying, and because it looked like it would be fun. I had a few explanations about why it was worth doing (enough to convince people to fund us to do it), but they were vague. After developing a language composition editor we realised we needed to be able to run composed programs as well. We started with an extended warmup exercise, which gave us the confidence to try tackling something bigger.

We eventually settled on composing PHP and Python and started looking at the problem in detail. About 6-12 months later, we had created PyHyp, which was the first ever fine-grained language composition of “big” languages: certainly, we'd been able to get much further, and with less effort, than I had expected at the outstart. The problem was that we had also outstripped our ability to explain why we'd done what we'd done. The published abstract that I wrote (and I say “I” because I’m almost solely to blame for it) looks as follows:

Although run-time language composition is common, it normally takes the form of a crude Foreign Function Interface (FFI). While useful, such compositions tend to be coarse-grained and slow. In this paper we introduce a novel fine-grained syntactic composition of PHP and Python which allows users to embed each language inside the other, including referencing variables across languages. This composition raises novel design and implementation challenges. We show that good solutions can be found to the design challenges; and that the resulting implementation imposes an acceptable performance overhead of, at most, 2.6x.
The good thing about this abstract is that it's simple to understand: the bad thing is that it doesn't say why the problem we tackled is worth tackling. Since language composition has been largely ignored as a research challenge the abstract therefore relies on readers spontaneously realising why language composition is useful and why our research is worth reading — which is far too much to ask.

The frustrating thing is that, while carrying out the research, one of our collaborators had suggested a plausible “killer app”: system migration. The idea is simple. At the moment we have lots of systems written in old languages: we'd like to rid ourselves of the old languages but not the systems. We can either rewrite the systems from scratch (which is expensive and error prone) or use an automatic translator (which produces code that no human wants to maintain). Language composition offers us the potential to compose an old and a new language together and gradually migrate parts of the system piecemeal, at all times having a runnable system. This idea is tucked away in a paragraph near the end of the paper and is easily overlooked. This is a unforced error on my part, because papers are not meant to be like crime novels, with unexpected twists: a good paper should make clear upfront what the paper is about, and gradually fill in detail. Indeed, I struggled to find a good structure for this paper, perhaps in part because of my failure to set out the most compelling case possible in the abstract.

By the time I came to write a blog post on this subject I had realised my mistake and put the migration idea (in expanded form) near the beginning. Alas, anyone who read only the paper would not have seen this implicit mea culpa.

As this example hopefully shows, finding the right research message is often tricky. Fortunately, it is possible to do a decent job: looking back, I think both the storage strategies for collections in dynamically typed languages and Eco: a language composition editor papers do a much better job in their abstracts.

With the abstract out of the way, I draft an introduction. If I've got my high-level thinking in the abstract right, then the introduction is a much easier task because, to some extent, it's simply an expanded abstract. I add more background, including some context about the research's antecedents, and explain why we've done something useful. Sometimes I will explicitly list the paper's contributions, especially if they might otherwise not pop out (thus the storage strategies paper was clear enough without such a list whereas I felt the fine-grained paper would not be understood without one).

With drafts of the abstract and introduction complete (I invariably come back and, at least, tweak them; sometimes I change them extensively), I then find it possible to move on to the paper's main content.

Writing and editing

Perhaps unsurprisingly, I tend to write papers more-or-less linearly from front to back, finishing one section before moving on to the next. I find this helpful to give me the same sense of flow as the eventual reader: it means that I’m forced to think about seemingly simple questions like “have I introduced this concept earlier, or does it need to be introduced here?” It’s easy to overlook the high-level structure of a paper, and I’ve read many papers with all the right pieces but in the wrong order. [4] Similarly I've read many papers which use too many unconnected examples: each feels sensible in its own right, but each example takes time to understand, destroying the reader's flow. A single example which can be built up in stages as the paper progresses nearly always makes the paper easier to read, and is most easily done if writing is done semi-linearly.

That said, I spend the majority of my time at a much lower-level: I’m trying to get all the facts that need to be in the paper in there, explained clearly, and in the right order. That means that I spend most of my time thinking about whether a sentence says what I need it to say, without saying anything it shouldn’t say, and whether it does so as concisely as possible. I’m not indulging in false modesty when I say that I don’t think that my first attempt at a sentence has ever satisfied all three. As soon as I’ve typed a sentence in, I’ll reread it, and if it’s really bad (which it normally is), I’ll move it a couple of lines down on the screen and try writing another version immediately. For some complex thoughts I can end up with multiple versions scrolling gradually off the bottom off the screen before I eventually get to something that isn’t obviously flawed. But I'm realistic: I probably won’t get those complex sentences right on the first day, so I try to move on fairly swiftly.

When I start a new day, I go back over what I wrote the previous day and look at it in a fresh light. As well as catching out innumerable typos, I always end up rephrasing and reordering sentences. That’s because, if there’s one thing I’ve learnt about good writing, it’s that it’s the result of good editing. And, as far as I can tell, good editing means extensive editing. It means being able to look at what I wrote and — ignoring my ego, which inevitably says “it’s good enough already” — trying to neutrally judge whether it can be improved. This is easy to say but, at least for me, it was an incredibly difficult skill to learn. I only learnt it when I had to write a grant proposal and my first draft was 11 pages, just a little over the 6 page limit. I had no idea what to do, and my first attempts at cutting things down were, frankly, pathetic. Over about 10 weeks, I accidentally crushed my ego enough to learn how to evaluate my own writing, and from that much followed. The grant was rejected, but I got so much out of the experience of writing it that I still judge it a success.

Order and brevity

Let's look at a concrete example that shows how important it is to think about the order in which things are presented and how shorter writing is often better. The example is something that I remember spending a long time grappling with: an overview of meta-tracing. This was first needed for the approaches to interpreter composition paper, where we wanted to give a rough idea of meta-tracing to unfamiliar readers (who would otherwise find much of the paper tough to understand). Overview sections are often tricky: they have to summarise and simplify related work in a way that doesn’t overwhelm unfamiliar readers, while being accurate enough not to offend people who know the area in depth. This sort of writing ruthlessly exposes holes in one’s understanding and writing abilities.

I'm going to use the first paragraph from Section 2 of the “approaches” paper as my example. For the points I’m about to make, it hopefully doesn’t matter too much if you understand meta-tracing or not. Here's the first draft I wrote (minus citations):

Meta-tracing takes an interpreter and creates a VM which contains both the interpreter and a tracing JIT compiler At run-time, user programs running in the VM begin their execution in the interpreter. When a 'hot loop' in the user program is encountered, the actions of the interpreter are traced (i.e. recorded), optimised, and converted to machine code. Subsequent executions of the loop then use the fast machine code version rather than the slow interpreter. Guards are left behind in the machine code so that if execution needs to diverge from the path recorded by the trace, execution can safely fall back to the interpreter.

I have long since forgotten how long that took to write, but 10-15 minutes is a reasonable guess. At first glance, this paragraph may seem to do a reasonable job of explaining things, once one ignores minor issues like the missing full stop before “at run-time”. However, there’s one glaring problem which took me or someone else (I'm no longer sure who) some time to realise: not only will most readers be unfamiliar with “meta-tracing”, they’ll be unfamiliar with “tracing” which the above completely fails to mention. That means that this explanation almost entirely fails to achieve its main purpose. After various minor edits to the section as a whole, eventually I inserted a reference to (non-meta) tracing. Between the first draft above and the final published version below, 24 commits [5], were made to the section as a whole (mostly by me, but with some by Edd Barrett):

This section briefly introduces the concept of meta-tracing. Meta-tracing takes as input an interpreter, and from it creates a VM containing both the interpreter and a tracing JIT compiler. Although tracing is not a new idea, it traditionally required manually implementing both interpreter and trace compiler. Meta-tracing, in contrast, automatically generates a trace compiler from an interpreter. At run-time, user programs running in the VM begin their execution in the interpreter. When a 'hot loop' in the user program is encountered, the actions of the interpreter are traced (i.e. recorded), optimised, and converted to machine code. Subsequent executions of the loop then use the fast machine code version rather than the slow interpreter. Guards are left behind in the machine code so that execution can revert back to the interpreter when the path recorded by the trace differs from that required.
The paragraph is now quite a bit longer, but at least it explains tracing. However, it does so poorly (which, as the git history clearly shows, is entirely my fault and not Edd's). Consider the sentences:
Meta-tracing takes as input an interpreter, and from it creates a VM containing both the interpreter and a tracing JIT compiler. Although tracing is not a new idea, it traditionally required manually implementing both interpreter and trace compiler. Meta-tracing, in contrast, automatically generates a trace compiler from an interpreter.
In order, these three sentences: define meta-tracing; reference tracing without explaining why it’s relevant; and somewhat obliquely explain why meta-tracing is different than tracing. To add insult to injury the third of those sentences partly repeats the content of the first sentence. The problem with this jumble of concepts isn’t just that it’s poor style: by forcing readers to keep switching the concepts they’re reasoning about, it makes it hard for them to concentrate on the crucial aspects. Unfortunately, I either didn’t notice this problem, or couldn’t work out how to fix it, before publication.

Fortunately for us, the story doesn’t end there. Some time later we wrote a paper called fine-grained language composition: a case study which also needed an overview of meta-tracing. I first copied in the paragraph from the “approaches” unchanged [6]. Very soon afterwards (probably because of the benefits of reading the paragraph with a fresh mind) I started addressing the problems noted above. With substantial help from Carl Friedrich Bolz and Edd Barrett, 28 commits to the section as a whole later the final published version looks as follows:

Tracing JIT compilers record hot loops ('traces') in an interpreted program, optimise those traces, and then compile them into machine code. An individual trace is thus a record of one particular path through a program's control flow graph. Subsequent executions which follow that same path can use the machine code generated from the trace instead of the (slow) interpreter. To ensure that the path followed really is the same, 'guards' are left in the machine code at every possible point of divergence. If a guard fails, execution then reverts back to the interpreter.

Meta-tracing JITs have the same basic model, but replace the manually written tracer and machine code generator with equivalents automatically generated from the interpreter itself.

The explanation is now split into two paragraphs. The first has a concise explanation of tracing; the second briefly explains what's different about meta-tracing. Not only this is a much easier flow for the reader, it also has a better explanation of guard failure and, as a pleasant bonus, it's 10% shorter than the “approaches” version. I'm quite proud of this version but notice how long it took us to get there: around 50 commits to the section as a whole (most of which touched the paragraph above). Simplicity and clarity are not easily obtained, but they are always worth striving for.

Writing as part of a team

It's been a long time since I've written a paper on my own. Outside of this blog, nearly all my writing is done in teams of 3–5 people. If my experience is anything to go by, the standard model of collaborating in team writing is for most people to write up the technical bits of a paper they were responsible for, with one person handling the “other” bits (e.g. the introduction and conclusions). Some teams make this model of writing work, but a lot (including some I've been part of) don't, for two different reasons: different people often have jarringly different writing styles [7]; and vital information is often missing, because no individual feels solely responsible for it. It's worth looking at both reasons in more detail.

Sometimes academic writing advice seems to suggest that there is one true writing style to which we should all aspire. This is nonsense, especially with a language as flexible as English: there are often multiple ways of saying the same thing that are all roughly as good as each other (a single idea can often be expressed in several ways, each approximately as good as the other). Which of those alternatives one chooses is largely a matter of personal style. For example, my personal style tends towards fairly short sentences and lots of punctuation. Some people I know prefer long sentences and the complete absence of the colon and semi-colon. Obviously I prefer my own style, but that doesn't generally mean the alternative is wrong — just different. [8] The problem comes because readers of technical papers are very alert to minor changes in use of language, and such changes are much more common when different authors fail to homogenise their style. For example, referring to “dynamic compilation” in one part of a paper and ”adaptive compilation” elsewhere makes a thoughtful reader wonder if the different phrasings denote a real difference or not. Jolting a reader out of their reading flow in order to ponder such a question is not just a waste of time: the more often it happens, the more likely it is that the reader will lose sight of the big picture.

The problem of missing information is both less obvious and more serious. I frequently read papers which, while seemingly sensible at the level of individual sections, don’t make much sense as a whole. In rare cases this is because the paper is technically flawed. More often it’s because there’s a missing part of the explanation that would help me make sense of what I’m reading. Occasionally I’m able to deduce precisely what the missing information must have been but, more often than not, I’m simply left confused.

Missing information is sometimes simply due to an oversight, or because the authors considered it unworthy of inclusion (generally because “everyone knows that”). However, my experience is that, when writing in a team, a third reason is equally common: no individual feels responsible for that piece of information, implicitly assuming that it’ll appear elsewhere in the paper. Inevitably, therefore, the information is never included anywhere. I’ve occasionally seen people try to solve this by cramming huge numbers of definitions into a paper’s introduction. This then makes the start of the paper both a depressing, as well as an obfuscatory, read: the start of the paper should outline the story, not try and fill in every blank.

The rewriter approach to team writing

The approach I use to avoid standard team writing problems is for one person (I’ll call them the “rewriter”) to take overall responsibility for the paper’s writing, ensuring that the paper’s content and style are consistent and complete. The only way I’ve found to make this happen is for the rewriter to examine, and in general rewrite, every contribution to the paper. Most obviously this means that the paper ends up written in a single style. Less obviously it means that one person is forced to understand everything in the paper, lessening the chances of authors telling an inconsistent or incomplete story. The rewriter doesn’t need to be a mini-dictator — indeed, it’s best if they try and work by consensus when possible — but they do need to be unflagging in trying to maintain a consistently high standard across the paper.

I now apply the rewriter idea to nearly all the papers I'm involved with (though I'm not always the rewriter!) and I've learnt a few things since my first experience with this way of writing [9]. First, it relies on individuals being willing to forego their individual writing style and to trust that the rewriter will do a good job. Second, it requires people to be much more open than normal to being questioned about what they’ve written, and to be willing to answer each and every question.

Sometimes the rewriter will read something, realise that they don’t understand the explanation, and ask for clarification before rewriting can commence. Sometimes the rewriter will not have the same assumptions as the original author, misunderstand the original text, and then rewrite it incorrectly. The good news is that the rewritten version is nearly always clearly incorrect: the original author then needs to request changes until the rewritten text becomes accurate. [10] And, more often than you’d expect, the original author realises that there was an aspect of the research that they hadn’t previously considered, and they need to do more technical work before answering the query.

Let me give you a simple example which I quickly plucked out of the VM warmup paper. Carl Friedrich drafted the following text (intentionally quickly, as the style we had got used to was “draft something quickly, run it past the rewriter, then take another pass at it”) [11]:

SVG-Viewer needed.

In this case, I knew that there was an additional factor not mentioned in this text that needed including: we had discovered something about lazy loading interfering with our measurements, although I wasn’t very sure what the “something” was. I thus rewrote the text based on that additional factor but, since I didn’t know much about it, I also added a couple of questions at the same time:

SVG-Viewer needed.

Carl Friedrich then replied:

SVG-Viewer needed.

There are two changes in this version. Most obviously, Carl Friedrich responded to my first question with a comment (and notice that both of us are trying to understand what is going on rather than trying to “win” a debate). Less obviously, he responded to my second question by editing the text and deleting the comment. I’ve always encouraged this way of editing: there’s no point in people playing endless comment ping-pong simply because they’re too afraid to take decisive action. In this case, it was my job to review the edit that had been made (I simply look the relevant diff in gitk); if I was unhappy with it, I would have reinstated the text with an additional comment (perhaps “I’m not sure my original question made sense, so let me try again”). As is generally the case, I had no problem with this particular edit.

There were several rounds of question-and-answer (see e.g. 1, 2, 3, or my personal favourite 4) before I eventually felt I understood what had gone on enough to rewrite the text to the following:

SVG-Viewer needed.

That text, somewhat simplified, is clearly recognisable as that in the final paper. Notice that neither Carl Friedrich or I would have been able to get the text to that quality on our own: it required working together, being open to questions, and not giving up until we were both satisfied that readers would understand the points we were trying to make.

How long does it take to write a paper?

I remember being told early in my career that, once the research is complete, writing a paper takes about a week. I was tortured by this for several years: I rarely got anything out of the door in under a month. And I’ve got much, much worse as my standards have been raised: I reckon that our major papers (for want of a better term) take at least 8-12 person weeks to write. [12] More complex papers take even longer. As a rough proxy, the VM warmup paper took 999 commits from start to finish (and, as can be seen on arXiv, 7 public releases). Many of those commits (particularly towards the end of writing) are small; some (particularly nearer the beginning of writing) are big. But, either way you look at it, a lot of effort went into the writing. It’s one reason why I’m not a prolific writer of papers: much to the distress of most of the bosses I’ve had, I’d rather try and write 1 good paper a year and fail than succeed at writing 20 mediocre ones.

And while it’s hardest to build the momentum to write the first few words, it’s not trivial to build and maintain the daily momentum needed to complete a paper. On a typical day writing ‘fresh’ prose, I aim to write somewhere around 800-1000 words (I don’t count precisely). Sometimes those words come easily, but sometimes they have to be forced out. I do make sure that I never leave a day empty handed though.

One of the hardest things to know is when a paper is finished. One way is to use an external deadline – e.g. the submission date for a conference – as a proxy for this. In general, this tends to lead to a mad rush in the last few days before the deadline, hugely increasing the chances of mistakes creeping in. I prefer to let papers mature for a lot longer than this, because it continues to amaze me how many mistakes become apparent after a good night’s sleep or two. However, taken too far, this is a recipe for never finishing: I think it’s vital to feel that one is continually moving, even if gradually, towards an end goal. To that end, I try and solidify aspects of the paper as I’m going along, so that I’m not tempted to end up in an endless loop of modifying the same things. Typically, I first satisfy myself that the paper’s structure is stable; then that all the necessary content is in; then that the prose is clear and consistent; and finally that there are no typos left. The movement between those stages is almost exclusively one way: when I’m looking for typos, for example, I won’t suddenly decide to move sections around.

At some point the decision has to be taken that the paper is finished and can be released to the world. If I’ve done my job well, the paper will hopefully find an interested audience. But these days I’m realistic enough to know that the paper won’t be perfect: it won’t be as clear as it should be and it will contain mistakes.

Several years ago, I wrote an overview paper on dynamically typed languages where I tried to collect together a lot of things I’d learnt on the topic. Because I knew that this was an area about which many people have strong opinions, I realised that I would have to do a better job than I'd done on any of my previous papers. I tried to buff the paper to the shiniest of sheens and, when it was published, I patted myself on the back for a job well done. A little while after publication I got the following email [13]:

I'm just reading your article on dynamically typed languages from 2009, and I'm stunned. It's fantastically written, neutral and smart.
At this point, my ego – never the most fragile of objects – had swollen to roughly the size of a small planet. Then I read the next sentence:
So how on earth could you misspell "growth" as "grown" in the abstract?!
I checked, and, indeed, I had made a typo in the first sentence of the abstract – in other words, the 13th word in the paper was wrong. My ego shrank back down to a more appropriate size. I fixed the mistake in the HTML version but the PDF retains the error, as a reminder to my future self of my own fallibility.

Wrapping up

There are many other things I could say but if I had to boil everything I’ve learnt so far about paper writing into one sentence it would be this: you don’t need to be hugely talented with words (I’m not), but you have to be willing to continually reread, rethink, and reword everything you’ve written. When writing in a team “you” particularly means “the rewriter”, if you choose that model. Ultimately the goal is simple: will the reader be able to understand the points you wanted to get across, without undue effort? Bad writing, I suspect, comes from people who think their first effort is good enough. Good writing comes from people who know that, every section, every paragraph, and ever sentence will have to be pored over multiple times: and, just when they think they’ve perfected the writing, getting external feedback might cause them to reevaluate everything. Sometimes this effort can feel like treading in Sisyphus’ footsteps but, in my experience, it’s effort well spent.

Acknowledgements: My thanks to Edd Barrett, Carl Friedrich Bolz-Tereick, Lukas Diekmann, and Sarah Mount for thoughtful comments. Any errors and infelicities are my own.

Footnotes

[1] Blogs like this one have their purpose, but I feel they're solving a different problem to papers. Put crudely, I use my blog to explain things to people who want a general overview, whereas papers contain all the nitty-gritty details.
[2] Unless you're already familiar with Kuhn's work, Dick Gabriel's wonderful paper about incommensurability in programming languages is, in my opinion, required reading. In short: when we read other’s work, particularly if it’s a little old, we often lack the context that all contemporary readers shared; we are therefore liable to misunderstand or dismiss important parts of the explanation. A simple example is the word “computer”. Before the 1940s this referred to people who performed calculations; currently it refers to machines. Unless you’re aware of that fact, you’ll get a completely incorrect impression when reading pre-1940s material that uses the term “computer”. As a collection of syllables, ‘incommensurability’ is an annoying mouthful: but the idea is simple, powerful, and describes a concept that is surprisingly frequently encountered. Dick brings it to life with a very interesting concrete example.
[3] The common advice is to write the abstract last, once you've got the rest of the paper written. This clearly works well for many people, but whenever I've tried doing so, I've found that my writing is aimless and disjointed.
[4] When I’m reviewing papers, I write a small series of small comments in the order that I came across them. It’s quite common for me to write “On p2 I don’t understand X” and then to update it later by saying “Ah, this is explained on p5”.
[5] I used git blame -L to find the relevant commits.
[6] This is sometimes referred to as "self-plagiarism" and some authors will write such sections anew each time to ensure they can't be accused of it. Since background material is generally short and doesn't really form part of a paper's contribution, I don't have a problem with people reusing small amounts of such material: the problem, in my opinion, is when people copy supposedly new technical contributions between papers. In this case I even made clear to myself and my co-authors what was going on in the commit message:
Integrate a brief description of meta-tracing.

The first and last paragraphs are directly swiped from 'approaches to interpreter composition'. The middle paragraph is new, as I suspect we will need to talk about hints to explain why cross-language optimisations perform well.
[7] This effect is often exaggerated when different co-authors have different mother tongues: it's generally not difficult to tell a French author writing in English versus a German author, for example. If this sounds like Anglo-snobbery, I really don't mean it to be: I'm still struggling with one language, and am in awe of people who can express themselves in multiple languages.
[8] That said, there are a few things I consider universally bad style for technical papers. The chief style crime I see amongst junior authors is a tendency to think that the following idiotic trope (repeated by seemingly every school English teacher) is true: “never repeat a word or phrase you’ve used recently.” Perhaps this is good advice for fiction, but it’s terrible advice in technical papers, where it’s vital to use the same word(s) to refer to the same concept throughout a paper.
[9] Personally I stumbled across the "rewriter" way of working entirely by accident. I was part of a research project that ultimately turned out to not really need my skills. When it came to write up the resulting research, I felt like a spare limb, so to make myself feel useful I offered my services as a copy-editor. This was accepted by the rest of the team without any of us realising what we'd signed up to. Whenever someone added a new part to the paper, I would read it very carefully. If I couldn't make any sense of it, I would add questions inline asking for clarification. If I could make sense of most or all of it, I rewrote the whole lot in my style. I viewed it as my job to make sure that the paper flowed, stylistically and technically, from start to finish, and that everything within was adequately explained.

The downside of this approach was that I almost drove my co-authors to murder because none of them had encountered such repeated questioning before. However, after a bit of time had passed (and the paper had done reasonably well), we all agreed it was better written than if we'd taken a traditional writing approach. I hasten to add that is not because the paper used my style, but simply because it used one person's style consistently, and that person had also ensured that all the information the paper needed was contained therein.

[10] My personal tactic is to turn seemingly ambiguous things into definite statements. For example, “This is smaller than some other objects” might become “This is smaller than most other objects” or “This is larger than most other objects” depending on my intuition about which way the ambiguity is best resolved. In the best case, I will have rewritten things more clearly than they were before. In the worst case I have made it both more likely that the original author will spot problems and that they will feel the need to correct things.
[11] The “screenshots” below are SVGs extracted from the PDF generated from LaTeX. The inline comments are a macro that Roel Wuyts used on a shared article we wrote many moons ago. I've shamelessly used it ever since. I can't remember who added colour to the comment macros, but it really helps it stand out from the normal text. These days I use “how much red is there on a page” as a good proxy for “how close is this text to being finished?”
[12] How long did it take to write this blog post? Spread over about 8 months, it took about 7 days (partly because I wrote some material that external comments convinced me didn't really fit in). It's one reason why I'm not a prolific blogger, although the more significant reason is that I generally lack anything worth my time writing and your time reading.
[13] I wish I could write emails as succinct and amusing as this one. While I hugely admire laconic humour in others, I’m not generally capable of editing my thoughts down quite so ruthlessly.
Blogs like this one have their purpose, but I feel they're solving a different problem to papers. Put crudely, I use my blog to explain things to people who want a general overview, whereas papers contain all the nitty-gritty details.

[Click anywhere in this box to close]

Unless you're already familiar with Kuhn's work, Dick Gabriel's wonderful paper about incommensurability in programming languages is, in my opinion, required reading. In short: when we read other’s work, particularly if it’s a little old, we often lack the context that all contemporary readers shared; we are therefore liable to misunderstand or dismiss important parts of the explanation. A simple example is the word “computer”. Before the 1940s this referred to people who performed calculations; currently it refers to machines. Unless you’re aware of that fact, you’ll get a completely incorrect impression when reading pre-1940s material that uses the term “computer”. As a collection of syllables, ‘incommensurability’ is an annoying mouthful: but the idea is simple, powerful, and describes a concept that is surprisingly frequently encountered. Dick brings it to life with a very interesting concrete example.

[Click anywhere in this box to close]

The common advice is to write the abstract last, once you've got the rest of the paper written. This clearly works well for many people, but whenever I've tried doing so, I've found that my writing is aimless and disjointed.

[Click anywhere in this box to close]

When I’m reviewing papers, I write a small series of small comments in the order that I came across them. It’s quite common for me to write “On p2 I don’t understand X” and then to update it later by saying “Ah, this is explained on p5”.

[Click anywhere in this box to close]

I used git blame -L to find the relevant commits.

[Click anywhere in this box to close]

This is sometimes referred to as "self-plagiarism" and some authors will write such sections anew each time to ensure they can't be accused of it. Since background material is generally short and doesn't really form part of a paper's contribution, I don't have a problem with people reusing small amounts of such material: the problem, in my opinion, is when people copy supposedly new technical contributions between papers. In this case I even made clear to myself and my co-authors what was going on in the commit message:
Integrate a brief description of meta-tracing.

The first and last paragraphs are directly swiped from 'approaches to interpreter composition'. The middle paragraph is new, as I suspect we will need to talk about hints to explain why cross-language optimisations perform well.

[Click anywhere in this box to close]

This effect is often exaggerated when different co-authors have different mother tongues: it's generally not difficult to tell a French author writing in English versus a German author, for example. If this sounds like Anglo-snobbery, I really don't mean it to be: I'm still struggling with one language, and am in awe of people who can express themselves in multiple languages.

[Click anywhere in this box to close]

That said, there are a few things I consider universally bad style for technical papers. The chief style crime I see amongst junior authors is a tendency to think that the following idiotic trope (repeated by seemingly every school English teacher) is true: “never repeat a word or phrase you’ve used recently.” Perhaps this is good advice for fiction, but it’s terrible advice in technical papers, where it’s vital to use the same word(s) to refer to the same concept throughout a paper.

[Click anywhere in this box to close]

Personally I stumbled across the "rewriter" way of working entirely by accident. I was part of a research project that ultimately turned out to not really need my skills. When it came to write up the resulting research, I felt like a spare limb, so to make myself feel useful I offered my services as a copy-editor. This was accepted by the rest of the team without any of us realising what we'd signed up to. Whenever someone added a new part to the paper, I would read it very carefully. If I couldn't make any sense of it, I would add questions inline asking for clarification. If I could make sense of most or all of it, I rewrote the whole lot in my style. I viewed it as my job to make sure that the paper flowed, stylistically and technically, from start to finish, and that everything within was adequately explained.

The downside of this approach was that I almost drove my co-authors to murder because none of them had encountered such repeated questioning before. However, after a bit of time had passed (and the paper had done reasonably well), we all agreed it was better written than if we'd taken a traditional writing approach. I hasten to add that is not because the paper used my style, but simply because it used one person's style consistently, and that person had also ensured that all the information the paper needed was contained therein.

[Click anywhere in this box to close]

My personal tactic is to turn seemingly ambiguous things into definite statements. For example, “This is smaller than some other objects” might become “This is smaller than most other objects” or “This is larger than most other objects” depending on my intuition about which way the ambiguity is best resolved. In the best case, I will have rewritten things more clearly than they were before. In the worst case I have made it both more likely that the original author will spot problems and that they will feel the need to correct things.

[Click anywhere in this box to close]

The “screenshots” below are SVGs extracted from the PDF generated from LaTeX. The inline comments are a macro that Roel Wuyts used on a shared article we wrote many moons ago. I've shamelessly used it ever since. I can't remember who added colour to the comment macros, but it really helps it stand out from the normal text. These days I use “how much red is there on a page” as a good proxy for “how close is this text to being finished?”

[Click anywhere in this box to close]

How long did it take to write this blog post? Spread over about 8 months, it took about 7 days (partly because I wrote some material that external comments convinced me didn't really fit in). It's one reason why I'm not a prolific blogger, although the more significant reason is that I generally lack anything worth my time writing and your time reading.

[Click anywhere in this box to close]

I wish I could write emails as succinct and amusing as this one. While I hugely admire laconic humour in others, I’m not generally capable of editing my thoughts down quite so ruthlessly.

[Click anywhere in this box to close]


What Challenges and Trade-Offs do Optimising Compilers Face?

June 22 2017

After immersing oneself in a research area for several years, it's easy to feel like the heir to a way of looking at the world that is criminally underrated: why don't more people want what I want? This can be a useful motivation, but it can also lead to ignoring the changing context in which most research sits. So, from time to time, I like to ask myself the question: “what if everything I believe about this area is wrong?” It’s a simple question, but hard to answer honestly. Mostly, of course, the only answer my ego will allow me is “I can’t see anything I could be wrong about”. Sometimes I spot a small problem that can be easily fixed. Much more rarely, I've been forced to acknowledge that there’s a deep flaw in my thinking, and that the direction I’m heading in is a dead-end. I’ve twice had to face this realisation, each time after a gradual accumulation of unpleasant evidence that I've tried to ignore. While abandoning several years work wasn’t easy, it felt a lot easier than trying to flog a horse I now knew to be dead.

All of this means that I’m probably a little better disposed to contrarian thinking than I was when I started as a researcher. And so it is that I’m going to write about some thoughts that were initially inspired by a (somewhat controversial) talk by Daniel Bernstein. The talk itself, entitled “Death of Optimizing Compilers” (slides and audio), was presented at ETAPS 2015 in London. I happened to be present for the talk and have engaged in mild advocacy both immediately afterwards and in the two years since. The problem is that the case against is quite easy to state but the case for rather hard — or, at least, I haven’t managed to frame an argument good enough to convince anyone. But I hope I might now have sharpened my argument enough to make another attempt worthwhile.

Suitably condensed, the talk’s thesis is fairly simple. As software systems become bigger and bigger, an ever smaller percentage of code dominates execution time [1], especially as computers are asked to crunch ever bigger chunks of data. However, optimising compilers (e.g. gcc -O2 or even a VM like HotSpot) are outclassed by humans on the most performance critical code. Because of that, optimising compilers are operating in a region between a problem that’s not worth solving (most parts of most programs) and a problem they can’t solve (the really performance critical parts). It doesn’t take much thought to realise that this last point has been the source of controversy: how can someone belittle the huge amounts of effort put into optimising compilers?

Daniel used his own (very cunning) ChaCha20 algorithm as an example of where a human can significantly outdo anything a compiler can. ChaCha20 is effective because the algorithm was invented with the strengths of modern processors in mind: its design and implementation are intricately intertwined in a way that no compiler could ever expect to match. [2]

It may not surprise you that I don’t agree with Daniel’s conclusion. However, sometimes looking at things from an extreme angle can help bring forgotten aspects into focus. So, while I’m not claiming to have any novel insights, here is my attempt to enumerate some often-forgotten (at least by me) issues which limit the effectiveness of optimising compilers.

Balancing program performance and compilation time

We all know that there's a trade-off between optimisation and compilation time: the more optimised you want your program to be, the longer it takes to compile. At one extreme, language implementations like CPython and CRuby have low compilation costs [3] and little or no optimisation costs, but have consequently poor performance. At the other extreme, superoptimisation tries vast numbers of machine-code combinations before picking the fastest. This means that a tiny change to even the tiniest function can take many, many hours to optimise: scaling this up to even “small” functions could be a challenge. Slightly less extreme are C compilers like gcc and clang. These put considerable effort into optimising programs, but generally compile even medium sized files in at most a second or two. [4]

Would developers be happy if we suddenly increased compilation times by, say, a factor of 10 if it lead to a 5% performance increase? Probably not. One possible solution is to add a “really good optimisation but slow to compile” mode to compilers. However, precedent suggests that such a mode will be little used. For example, gcc’s highest optimisation level -O3 is often treated with a mix of fear and disdain: partly because it often doesn't improve performance [5], but also because it's thought to more often produce buggy code. Whether this is still a justifiable position, whether it's because of compiler bugs, or whether it's due to user bugs that only happen to show up when highly optimised is almost irrelevant: if no-one is prepared to use such a mode, it might as well not be there. OpenBSD, long my operating system of choice, uses -O2 exclusively for both the base operating system and packaged third party software. Another possibility is profile-guided optimisation, which I'll touch on later.

We don't know which parts of a program to optimise

In my experience, one of the easiest ways of distinguishing a seasoned programmer from those early in their career is their approach to “manual” optimisation. When presented with a slow running program, junior programmers tend to quickly identify what they think is the bottleneck, and start hacking away at it. Seasoned developers, in contrast, tend to have realised from bitter experience that their initial guesses about bottlenecks are nearly always wrong, and first profile performance.

Sometimes, even standard profiling doesn’t help. My standard approach is to profile a program, identify the longest running function(s), and then stare at it to see if I can improve it. This works in many cases, but fundamentally profiling can only highlight symptoms, not prove causes. For example, if I look at the longest running function and it’s a bubble sort, I can quickly change it to a faster sorting algorithm. But if it’s a Timsort function, there probably isn’t much I can obviously do to fix it. The problem could be elsewhere (e.g. more than once I’ve seen sort functions used as part of a de-duplicator function, which is a slow way of solving that particular problem) and may be very subtle (e.g. bad data structures can cause much more memory to be allocated than necessary). Occasionally, though rarely, I can work out a likely culprit, but judge that the necessary rewrite is too big to justify the gain. But, in most cases, I simply can’t work out if there is anything which would benefit from further optimisation or not.

We can now ask a simple question: if it’s difficult for a human to look at the runtime performance of a specific program and work out what should be optimised, how difficult is it for a compiler? The answer, I fear, is that the compiler has, by far, the more difficult job. Static compilers implicitly make a guess about how a program will execute – without having the luxury of actually executing it [6] – and optimise based on that guess. In some cases, we can make accurate guesses (e.g. we can work out statically how long the machine code sequence for 3 * 4 will take versus 12) but in most we can’t (e.g. will a branch that represents a loop never be taken, taken occasionally, or taken often?). The latter quickly overwhelm the former, meaning that the compiler can have no real idea of which parts of a program are likely to be bottlenecks.

The only safe choice the compiler can thus make is to try and optimise all parts of a program, though this has two issues. First, the trade-off between compilation time and optimisation rears its head again. Since some optimisations are either expensive or rarely applicable, we can’t afford to apply them everywhere, especially as they would mostly have no useful runtime effect. Second, optimisations are heavily order dependent. Imagine I have three optimisations A, B, and C: applying A then B might produce code that C can't do anything with; but applying B then A might produce code that C can optimise. In this simple case we could try both possibilities and, if we're lucky, work out which outcome is better and use that. In a real compiler, with a very large number of optimisations, combinatorial explosion is real and this is impractical. This means that even though our compilers sometimes contain all the optimisations they need to produce high-performing code, they don't run them in the right order (or can't run them for long enough) to optimise all parts of a program. [7] In practice, therefore, trying to optimise all parts of a program means doing a so-so job on all parts of a program. There are two techniques which try and identify the sub-parts of a program that would most benefit from heavy optimisation.

The first is profile-guided optimisation: in essence, one statically compiles a program, profiles an execution or two, then feeds that back to another run of the static optimiser, which uses that information to hone in on the parts of a program which will most benefit from optimisation. Doing this can allow a compiler to run its optimisers for longer on smaller parts of a program, rather than wasting its effort trying to optimise the entire program. This can yield useful performance benefits (see e.g. the 10% improvements that Chrome and Firefox reported) but it’s difficult to set-up, compilation becomes even slower, and it only helps if all subsequent runs follow the same general pattern as the profiled run(s). I suspect the latter point is the most critical: if people use your program in different ways from run to run, profile-guided optimisation can even degrade performance.

The second is dynamic compilation (i.e. JIT compilation), which can be thought of as live, continuous profile-guided optimisation: every time a program runs, dynamic compilation observes its behaviour, and compiles based on that behaviour. Thus dynamic compilation has the advantage over static compilation that it can much more accurately identify the genuine bottlenecks in a program. However, dynamic compilation has to be even more mindful of the costs of optimisation, since time spent dynamically compiling is bad for the user (draining their battery or simply taking potential CPU time away from the user). Thus while dynamic compilation is the most effective way of determining which parts of a program will benefit most from optimisation, it has much less leeway to use expensive optimisations than static compilation.

Finally, whether static or dynamic, all compilers contain a finite number of optimisations: they may not contain the optimisation that our program needs to run fast. There is a small-ish set of optimisations which it's safe to assume that every optimising compiler (static or dynamic) contains. After that, things become, in my experience, a lot more reactive. If a user shows a compiler developer a program which runs unexpectedly slow, the compiler developer will probably add, or tweak, an optimisation to make it perform better. Over time, therefore, different compilers tend to do well on an unpredictable subset of all possible programs. You can often see this in the real world where different optimising compilers for the same language perform much better or worse on a given program. Is there anything we can do about this? I don't think so. After all, we can't (or, at least, probably shouldn't) optimise what we haven't seen: benchmarks are, to a large degree, destiny. [8]

We don't know where the performance ceiling is

One of the problems that optimising compiler authors face is that we don’t really know if we’re doing a good job or not. If optimisation makes a given program 50% faster that might be really good, or it might be somewhat mediocre: since we generally don’t know what the best possible performance (what I call the “performance ceiling”) of a program is, we have no real idea how far away we are from it. The closest we come to a “performance oracle” is when there are multiple compilers for a given language: if another compiler makes a program run faster than our compiler, we know we have work to do.

Given that our current static compilers seem to be well within the “diminishing returns on further optimisations” category, it might appear that we're close to the performance ceiling. But, for all we know, we could be stuck in a local maxima. Are there any ways that we might be able to tell? Superoptimisation, mentioned above, suggests that there are potentially many more performance gains to be had, although it's only been applied on small functions. Dynamic compilation is the other obvious possibility: since much more research has been put into static compilers, we don't really know if current comparisons of static and dynamic compilation are the end of the story or not. [9]

Maintaining implicit properties

As Daniel pointed out, compilers have a narrow view of optimisation: they’ll optimise x + 3 * 4 into x + 12, but I don’t know of any that will turn a bubble sort function into a merge sort. In part this is because recognising that a particular piece of code implements bubble sort is hard (think of all the minor, obfuscating, variations). But, more significantly, it’s because compilers are generally unable to prove that such a major change won’t violate properties of the program explicitly intended by the user. For example, what happens if the user’s bubble sort is so coded because they know that the program needs to run in constant memory and they’re prepared to trade-off execution speed for memory? That property is implicitly encoded in the bubble sort, and the compiler has no way of knowing if it was explicitly intended or not.

Real programs contain countless such examples of implicit properties: indeed, most of us, when programming, don’t even consider how much we expect the optimised version of our program to reflect the program as we wrote it. Partly in recognition of these implicit properties, compilers are often rather conservative in their optimisations: for example, with the notable exception of inlining, they are generally reluctant to perform optimisations that silently increase memory usage. Because of this, a human who can reason about the importance of such properties, and disregard them if they are not important, always has a potential edge over a compiler.

Over-prioritising compilation has consequences

Experience suggests that over-prioritising optimisation can lead to problems in other areas, notably correctness and security. The most obvious, though by no means the only, example of this is C. Most widely known are the problems with undefined behaviour. To cut a long story short, the C specification says, more often than most other languages, “if you do X, the behaviour is undefined”, at which point the program can do absolutely anything, whether you intended it to or not.

The problem is that the vast majority of such cases cannot be detected statically: that is, it’s often impossible to know if your program could exhibit undefined behaviour or not. Have you remembered, for example, to add a runtime check that the divisor of every non-constant division is non-zero? That one’s well known enough that people will tend to check it. But what happens when signed integer addition overflows? C gives no guarantees about the outcome, and most programs have so many additions that we struggle to reason about which ones we might need to be careful about. You can read more examples here.

It goes without saying that a large number of unfortunate bugs (including many security alerts) have occurred because of undefined behaviour. However, the problem has become much worse in recent years as C compilers optimise programs based on the assumption that no undefined behaviour exists. In other words, compilers are now more strictly interpreting the standards which, ironically, means that they have become less conservative in their optimisations. No doubt a few benchmarks increase in speed because of this, but my guess is that it makes only a very small difference to most program's performance.

I used to think that I could reliably program in C, but the exploitation of “minor” undefined behaviour has undermined my confidence. To add insult to injury, not only does C have far more undefined behaviour than other languages, making it more likely that one will forget some of the less common cases, but the C standard isn't even freely available (though one can get drafts, which are “close” to the final standard). This has led to deep unhappiness on both sides: programmers tend to say “please don’t violate widely held expectations even if the standard doesn’t mention those expectations”; compiler writers have a tendency to say “we’re following the standard, and are making your programs faster: if there’s a problem in your program, you should fix it.” If there’s any single factor which will hasten the decline of C as a widely used language, I suspect this is it. [10]

C compilers are the most obvious example of this problem, but not the only ones. There are at least a couple of cases where, depending on the day of the week, I can convince myself either way. A good example of this is Python's memory semantics. In theory, Python long specified that programmers should not rely on the deterministic destruction of objects. However, since the only implementation of Python was CPython, which is reference counted – and since most people don't read, or remember, the language specification – many programs came to rely in subtle ways on deterministic destruction (e.g. implicitly expecting that file descriptors would be freed at the end of a function). When PyPy came along, it soon came to favour non-deterministic destruction. Doing so allowed PyPy to have a high performing state-of-the-art garbage collector, but it did break a number of programs (e.g. many ran out of file descriptors). Is that a good trade-off? Could, or should, PyPy have used some mix of reference counting and “proper” garbage collection? Since this is pretty much the only intentional “incompatibility” between CPython and PyPy it feels just about reasonable to me, although I'm sure some people feel differently.

Summary

It's important to be realistic: most people don't care about program performance most of the time. Modern computers are so fast that most programs run fast enough even with very slow language implementations. In that sense, I agree with Daniel's premise: optimising compilers are often unimportant. But “often” is often unsatisfying, as it is here. Users find themselves transitioning from not caring at all about performance to suddenly really caring, often in the space of a single day.

This, to me, is where optimising compilers come into their own: they mean that even fewer people need care about program performance. And I don't mean that they get us from, say, 98 to 99 people out of 100 not needing to care: it's probably more like going from 80 to 99 people out of 100 not needing to care. This is, I suspect, more significant than it seems: it means that many people can go through an entire career without worrying about performance. Martin Berger reminded me of A N Whitehead’s wonderful line that “civilization advances by extending the number of important operations which we can perform without thinking about them” and this seems a classic example of that at work. Even better, optimising compilers are widely tested and thus generally much more reliable than the equivalent optimisations performed manually.

But I think that those of us who work on optimising compilers need to be honest with ourselves, and with users, about what performance improvement one can expect to see on a typical program. We have a tendency to pick the maximum possible improvement and talk about it as if it's the mean, when there's often a huge difference between the two. There are many good reasons for that gap, and I hope in this blog post I've at least made you think about some of the challenges and trade-offs that optimising compilers are subject to.

Acknowledgements: My thanks to Edd Barrett, Martin Berger, Carl Friedrich Bolz, Lukas Diekmann, and Sarah Mount for substantial and insightful comments on early drafts. Any errors, omissions, or infelicities, are my own.

Footnotes

[1] Most readers will be familiar with Knuth’s quip that “premature optimisation is the root of all evil.” However, I doubt that any of us have any real idea what proportion of time is spent in the average part of the average program. In such cases, I tend to assume that Pareto’s principle won't be far too wrong (i.e. that 80% of execution time is spent in 20% of code). In 1971 a study by Knuth and others of Fortran programs, found that 50% of execution time was spent in 4% of code. I don't know of modern equivalents of this study, and for them to be truly useful, they'd have to be rather big. If anyone knows of something along these lines, please let me know!
[2] Daniel finished with an idea as to how this limitation could be overcome: instead of compilers being a black box, programs should be able to interact with them to help the compiler optimise them. While there is some work in this general direction (see e.g. Haskell’s approach), I think that pushing it as far as Daniel would like is hard. Certainly, I’ve done enough work in macros and compile-time meta-programming to have some idea of how difficult it is to find a good design for interactions between a program and compiler.
[3] Though it is important to note that both do compile to bytecode internally. The oft-stated distinction between “interpreted” and “compiled” languages is a red-herring: any language can be interpreted or compiled (and some language implementations interpret compiled code).
[4] Interestingly, modern languages seem much happier with lengthier compile times. rustc, which I use a lot, is not fast, even with optimisations turned off; and scalac, when I last used it, was even slower for small files (though since, I think, it doesn't do full program optimisation statically, I imagine it's faster in some senses than rustc at compile-time). I can't say why this is for sure, although I do have a potentially plausible explanation.

Take gcc as a concrete example: its first release was in 1987. At that point, I believe that the Cray-2 was the second fastest machine on the planet operating at 2 GFLOPS (at something like £25,000,000 in today's money); for reference, my 2016 i7-6700K desktop processor (£301) is something like 80-115 GFLOPS. Using GFLOPS as a measurement of speed of anything is dangerous, particularly when it comes to compilers (which do few, if any, floating point operations). But the general trend is clear: a modern commodity processor is a couple of orders magnitude faster than a 30 year old supercomputer. It's safe to assume that compiling programs took quite a bit longer in wall-clock time in 1987 than today, and that therefore people would have put quite a lot of effort (consciously and unconsciously) into making compilers run acceptably fast.

Fast forward to today, and the optimisations that made a compiler run not-too-slow in the late 1980s make it blazingly fast today. In contrast, modern language designers and compiler implementers are happy with compile times that are, relatively, much slower but still, in absolute terms, just about acceptable (perhaps around a couple of seconds for a medium-large file, as a rough rule of thumb; which, I would guess, is probably faster in wall-clock time than a C compiler in 1987).

As a ridiculously simple comparison, compiling "hello world" on my laptop (median value over 10 runs – I'm not pretending this is a paper-strength methodology) takes 0.037s for gcc, 0.173s for Go, and 0.287s for Rust.

[5] In the context of LLVM/clang it appears that -O3 optimises no better than -O2 (at least in 2013 when that paper was written).
[6] Abstract interpretation is an attempt to do this. It's got real uses in bug finding, but I'm not sure that it would do very well at performance prediction.
[7] Jamey Sharp has written an approachable overview of this problem and some of the work in the area.
[8] My experience suggests that this is a more common issue in dynamic optimising compilers, which in some sense have more opportunities – or, at least, more information – to optimise, but also more constraints (chiefly time). However, this may partly be a function of the richness of modern languages, which are more likely to be dynamically compiled and optimised.
[9] My gut instinct is that many programs in modern languages such as Haskell would run faster if they were dynamically (rather than statically) compiled. The basic reason is simple. From a runtime perspective, the pattern matching that is common in such languages is just dynamic typing in disguise (the fact that the pattern matching statically enforces exhaustiveness is of no use beyond the type checker). Thus the static compiler is either going to decline to inline the branches of pattern matches, or it'll guess (often wrongly) that one branch is more commonly taken, or (less likely) it'll try and inline all branches. Since inlining makes pretty much every other optimisation much more effective, this inability to statically optimise must cause performance issues. However, I should be clear: I currently have no evidence for this other than my gut instinct, which I suspect might not stand up in a court of justice.
[10] There have been several attempts to define a “safe” subset of C (e.g. this) that is immune to the most egregious examples of undefined behaviour, but it's not clear that any of these will gain sufficient momentum to become widely accepted. That's a pity, but such is life.
Most readers will be familiar with Knuth’s quip that “premature optimisation is the root of all evil.” However, I doubt that any of us have any real idea what proportion of time is spent in the average part of the average program. In such cases, I tend to assume that Pareto’s principle won't be far too wrong (i.e. that 80% of execution time is spent in 20% of code). In 1971 a study by Knuth and others of Fortran programs, found that 50% of execution time was spent in 4% of code. I don't know of modern equivalents of this study, and for them to be truly useful, they'd have to be rather big. If anyone knows of something along these lines, please let me know!

[Click anywhere in this box to close]

Daniel finished with an idea as to how this limitation could be overcome: instead of compilers being a black box, programs should be able to interact with them to help the compiler optimise them. While there is some work in this general direction (see e.g. Haskell’s approach), I think that pushing it as far as Daniel would like is hard. Certainly, I’ve done enough work in macros and compile-time meta-programming to have some idea of how difficult it is to find a good design for interactions between a program and compiler.

[Click anywhere in this box to close]

Though it is important to note that both do compile to bytecode internally. The oft-stated distinction between “interpreted” and “compiled” languages is a red-herring: any language can be interpreted or compiled (and some language implementations interpret compiled code).

[Click anywhere in this box to close]

Interestingly, modern languages seem much happier with lengthier compile times. rustc, which I use a lot, is not fast, even with optimisations turned off; and scalac, when I last used it, was even slower for small files (though since, I think, it doesn't do full program optimisation statically, I imagine it's faster in some senses than rustc at compile-time). I can't say why this is for sure, although I do have a potentially plausible explanation.

Take gcc as a concrete example: its first release was in 1987. At that point, I believe that the Cray-2 was the second fastest machine on the planet operating at 2 GFLOPS (at something like £25,000,000 in today's money); for reference, my 2016 i7-6700K desktop processor (£301) is something like 80-115 GFLOPS. Using GFLOPS as a measurement of speed of anything is dangerous, particularly when it comes to compilers (which do few, if any, floating point operations). But the general trend is clear: a modern commodity processor is a couple of orders magnitude faster than a 30 year old supercomputer. It's safe to assume that compiling programs took quite a bit longer in wall-clock time in 1987 than today, and that therefore people would have put quite a lot of effort (consciously and unconsciously) into making compilers run acceptably fast.

Fast forward to today, and the optimisations that made a compiler run not-too-slow in the late 1980s make it blazingly fast today. In contrast, modern language designers and compiler implementers are happy with compile times that are, relatively, much slower but still, in absolute terms, just about acceptable (perhaps around a couple of seconds for a medium-large file, as a rough rule of thumb; which, I would guess, is probably faster in wall-clock time than a C compiler in 1987).

As a ridiculously simple comparison, compiling "hello world" on my laptop (median value over 10 runs – I'm not pretending this is a paper-strength methodology) takes 0.037s for gcc, 0.173s for Go, and 0.287s for Rust.

[Click anywhere in this box to close]

In the context of LLVM/clang it appears that -O3 optimises no better than -O2 (at least in 2013 when that paper was written).

[Click anywhere in this box to close]

Abstract interpretation is an attempt to do this. It's got real uses in bug finding, but I'm not sure that it would do very well at performance prediction.

[Click anywhere in this box to close]

Jamey Sharp has written an approachable overview of this problem and some of the work in the area.

[Click anywhere in this box to close]

My experience suggests that this is a more common issue in dynamic optimising compilers, which in some sense have more opportunities – or, at least, more information – to optimise, but also more constraints (chiefly time). However, this may partly be a function of the richness of modern languages, which are more likely to be dynamically compiled and optimised.

[Click anywhere in this box to close]

My gut instinct is that many programs in modern languages such as Haskell would run faster if they were dynamically (rather than statically) compiled. The basic reason is simple. From a runtime perspective, the pattern matching that is common in such languages is just dynamic typing in disguise (the fact that the pattern matching statically enforces exhaustiveness is of no use beyond the type checker). Thus the static compiler is either going to decline to inline the branches of pattern matches, or it'll guess (often wrongly) that one branch is more commonly taken, or (less likely) it'll try and inline all branches. Since inlining makes pretty much every other optimisation much more effective, this inability to statically optimise must cause performance issues. However, I should be clear: I currently have no evidence for this other than my gut instinct, which I suspect might not stand up in a court of justice.

[Click anywhere in this box to close]

There have been several attempts to define a “safe” subset of C (e.g. this) that is immune to the most egregious examples of undefined behaviour, but it's not clear that any of these will gain sufficient momentum to become widely accepted. That's a pity, but such is life.

[Click anywhere in this box to close]


Fine-grained Language Composition

September 21 2016

Programming languages are anti-social beasts: while they're happy to communicate extensively with the operating system that launched them, they're reluctant to talk to other programming languages. Most, reluctantly, make an exception for low-level languages, which have a natural mapping onto the low-level system ABI, most notably C. Indeed, used without qualification, the term ‘Foreign Function Interface’ (FFI) is typically assumed to refer to an interface to C, as if it was the only “foreign” programming language in existence.

Programming languages therefore resemble islands: each language defines its own community, culture, and implements its own software. Even if one wants to travel to another island, we often find ourselves trapped, and unable to do so. The only real exception to this are languages which run on a single Virtual Machine (VM), most commonly a Java VM. These days, many languages have JVM implementations, but it can sometimes seem that they've simply swapped their expectations of an FFI from C to Java: non-Java JVM languages don't seem to talk to each other very much.

We're so used to this state of affairs that it's difficult for us to see the problems it creates. Perhaps the most obvious is the huge effort that new languages need to expend on creating libraries which already exist in other languages, an effort far greater than the core language or compiler require. Less obvious is that the initial language used to write a system becomes a strait-jacket: we almost always stick with our original choice, even if a better language comes along, even if no-one is trained in the original language, or even if it simply becomes difficult to run the original language.

Those who are brave enough, foolhardy enough, or simply desperate have three main options for escaping from this mire. First, they can manually rewrite their system in a new language. However, rewriting anything but the tiniest system from scratch in a new language is expensive – often prohibitively so – and extremely risky: it often takes years for the new system to reach parity with the old; and seemingly unimportant features are often ignored in the rewrite, only for users to rebel at their exclusion. Second, they can use translation tools to automatically rewrite their system for them. Perhaps surprisingly, this technique really works, even for very complex systems [1]. However, unless the two languages involved are very similar, it comes at a price: the automatically translated code rarely looks like something a human would have written. This makes maintenance of the translated systems almost impossible, and perhaps explains why these tools have seen little real-world use. Third, they can use some form of inter-process communication to have parts of the system running in different languages. The most obvious cost for this route is that splitting the system into separate chunks generally requires significant refactoring. Less obvious is that the overhead of inter-process communication is so high (depending on your system, roughly 5-6 orders of magnitude slower than a normal function call) that this technique is only practical if the resulting chunks communicate infrequently.

The outlines of a solution

At this point, I'm going to assume that you agree with me that there's a real problem with our inability to move easily between different programming languages. Furthermore, that this is a problem that huge numbers of people (including many very large organisations) suffer from, and that existing solutions will never address. You might then wonder: why don't we hear more complaints about this? After all, people complain endlessly about minor limitations in programming languages, frameworks, and operating systems. I think the reason for the lack of complaint is that people think the problem inevitable. Raging against it would be like raging against the River Severn's tide: whatever you do, it'll swallow you up and deposit you wherever it sees fit.

Recently, I've been involved in research – led by Edd Barrett, with assistance from Carl Friedrich Bolz, Lukas Diekmann, and myself – which, we think, suggests a new solution to this problem. If you’ve already read the accompanying paper, then parts of this blog will be familiar, although I hope there’s enough new things to keep you interested: relative to the paper, I’m going to give a bit more motivation and background, and focus a little less on some of the most minute details.

In essence, what we've done is to compose (i.e. glue) together two real-world programming languages – PHP and Python 2.x [2] – and show that users can move between them seamlessly. We did this by composing two meta-tracing interpreters together, such that, at run-time, the two languages not only share the same address space, but that a single JIT compiler operates across them. Building atop a (somewhat traditional) FFI between the two languages, we allow users to embed syntactic fragments of each language inside the other (e.g. one can embed Python functions in PHP and vice versa, and nest those embeddings arbitrarily deep), and for those fragments to interact as if they were written in the same language (referencing variables across languages and so on). Written down like that, it can appear somewhat intimidating, but hopefully Figure 1 shows that the resulting system, called PyHyp, is fairly intuitive.
Figure 1: A PyHyp program implementing a PCG8 pseudo-random number generator, written in the Eco language composition editor. The outer (white background) parts of the file are written in PHP, the inner (pinkish background) parts of the file in Python. ❶ A Python language box is used to add a generator method written in Python to the PHP class RNG. ❷ A Python language box is used to embed a Python expression inside PHP, including a cross-language variable reference for rng (defined in line 19 in PHP and referenced in line 20 in Python). In this case, a Python list comprehension builds a list of random numbers. When the list is passed to PHP, it is 'adapted' as a PHP array. ❸ Running the program pretty-prints the adapted Python list as a PHP array.

At this point, you might be wondering what jamming together PHP and Python really shows: after all, it's not the first time that people have embedded one language inside another – indeed, I've done my share of language embeddings in the past. Most such embeddings are of a small DSL inside a larger, general purpose programming language. I've seen a few cases of people embedding one programming language inside another, but either the embedded language is tiny (e.g. Lisp) or only a subset of the embedded language is supported (e.g. “Java-like”). There's nothing wrong with such approaches, but it's never really been clear that one can sensibly embed one large programming language inside one another.

The approach we ended up taking with PyHyp shows that not only is it possible to embed large programming languages inside each other, but that the result is usable. In essence, we composed together meta-tracing interpreters, resulting in a VM that dynamically compiles composed programs into machine code; and we allow each language's syntax to be directly embedded in the other. Optionally, one can make use of the Eco language composition editor to make the more fiddly parts of the syntactic composition easier to use. Bearing in mind the caveat that Eco is very much a research prototype, we have a reasonable expectation that this general approach will work for a wide variety of other real-world programming languages (indeed, we have also done something similar for Python and Prolog).

It's important to realise that our approach is neither automatic nor magic. Programming languages are made by humans for humans. Composing programming languages together creates new language design issues, which humans must resolve, so that the result is usable by other humans. To be blunt, this was not something that we thought of when we started. In the end, composing real languages turned out to be a bit like assembling flat-pack furniture: some bits don't appear to fit together very well, some take considerable effort to make sense of, and a few appear completely superfluous. We eventually came to think of these challenges as “points of friction” [3]: places where languages don't play nicely with each other. These fall into two categories: semantic friction is where it's difficult to make sense of one language's features in the other (e.g. Python has dictionaries and lists; PHP has only dictionaries; how do we map between the two?); performance friction is where it's difficult to make different parts of each language perform well in conjunction with each other (e.g. cross-language variable lookup is tricky to make fast, since both languages' scoping rules are somewhat dynamic).

What the user sees

What Figure 1 hopefully makes clear is that PyHyp contains all of “normal” PHP and “normal” Python, as well as defining extra rules for composing them together. This is the result of a deliberate design goal: it is vital that normal PHP and Python code works unchanged when it is part of a PyHyp program [4]; we only want users to deal with PyHyp's extra rules when they are writing cross-language code.

So, what are those extra rules? The platitudinous answer is that, as far as possible, we've tried to make them obvious and transparent, particularly if you use Eco. The extra syntactic rules are fairly simple: PyHyp programs always start with PHP [5]; wherever a PHP function is valid, a Python function can be used; wherever a PHP expression is valid, a Python expression can be used; and wherever a Python statement is valid, a PHP function is valid. The extra semantic rules are mostly simple: PHP and Python can reference variables in the outer language (semi-) lexically (we'll look at this in more detail later); core datatypes (e.g. ints, strings) are directly converted between the two languages; collections (lists and the like) are “adapted” such that they're almost indistinguishable from native collections; and everything else is adapted ‘generically’. In essence, an adapter is simply a wrapper around a ‘foreign’ object, such that one language can interact with objects from the other language. [6]

Moving things like integers and strings between the languages works simply and naturally. Life gets a bit more complicated when you encounter areas of semantic friction. One of the most obvious is collections: PHP has only dictionaries (i.e. hash tables; which, confusingly, PHP calls arrays); but Python has separate notions of lists and dictionaries. Passing a Python list or dictionary to PHP is easy: it can only be adapted as a PHP array. The opposite direction is more challenging, as Python lists and dictionaries have substantially different semantics (e.g. lists are ordered, whereas dictionaries are unordered). To most people, it seems “obvious” that if we pass a PHP dictionary which is list-like (i.e. has integer keys from 0, 1, 2, 3 … n) to Python then it should be adapted as a Python list, and indeed our first attempt did just this. However, since both Python (via an adapter) and PHP point to the same underlying PHP array, later mutation can make that PHP array turn into something that isn't list-like (e.g. by adding a key "woo"), at which point the list adapter would be at best misleading, and at worst, incorrect. We therefore ended up taking a different, safer, tack: a PHP dictionary passed to Python is always initially adapted as a Python dictionary. If the user knows for sure that the PHP dictionary really is, and always will be, list-like, then they can obtain a list adapter from the dictionary adapter. As well as the normal dictionary operations, dictionary adapters expose an extra method to_list() [7] which returns a Python list adapter pointing to the underlying PHP dictionary. Of course, this doesn't free us from the possibility of the underlying PHP dictionary becoming list-like — the list adapter continually checks that the underlying dictionary is still list-like, throwing an exception if that is no longer true.

Those readers familiar with PHP might have noticed the implicit assumption above that PHP arrays are mutable. In fact, they are immutable: appending an item, for example, copies the underlying array, adding an extra item to the copy. Indeed, nearly all PHP datatypes are immutable with the exception of user objects (i.e. instances of classes) and references. This latter data type is a mutable reference to another object (i.e. it is a boxed pointer which can point to different objects over time). References can be used to give the illusion that data types such as arrays are mutable. Instead of passing an array around directly, one passes around a reference; when appending an item to an array wrapped in a reference, for example, one simply updates the reference to point to the new array. PHP has somewhat complex rules for automatically wrapping data types but, simplifying somewhat, functions can say that their parameters must come wrapped in a reference; non-references are automatically wrapped in a reference. Since Python users expect lists and dictionaries to be mutable, we would therefore have an uncomfortable instance of semantic friction if adapted PHP arrays were sometimes mutable, sometimes immutable. PyHyp therefore defines that passing an array from PHP to Python automatically wraps it in a reference if it is not already so. Doing so is natural from both PHP and Python's perspective.

I could go on, because PyHyp resolves a number of other instances of semantic friction, but hopefully you believe the following: even when the two languages involved in PyHyp didn't appear able to play together nicely, we were nearly always able to find a satisfying design to resolve the problem. However, in a few cases, there simply isn't a sensible way to fit the two languages together. For example, Python functions can be called with keyword parameters, but PHP has no equivalent notion, and we couldn't conceive of a usable design that would work around this fact. Therefore, trying to call a PHP function with keyword parameters from Python raises a run-time error.

How the run-time works

At this point, it’s hopefully fairly clear what the semantics of PyHyp are, but how are they implemented? Fundamentally, we do something very simple: we glue together interpreters. However, doing so naively would lead to terrible performance, so we use meta-tracing interpreters. Meta-tracing is a way of creating a VM with a JIT compiler from just a description of an interpreter. As I said in an old blog article, this is a fascinating idea, because it turns programming language economics upside down: you can get a fairly fast language implementation for not a great deal of effort [8]. At the time of writing, the main extant meta-tracing language is RPython (if you think of it as Java-ish semantics with Python syntax, you won't be too far off the mark), which is what we used.

Whilst writing a really high-quality meta-tracing interpreter from scratch is a lot easier than writing a really high-quality manual JIT compiler, it's still more work than we wanted to take on. We therefore took two off the shelf interpreters — PyPy, a complete implementation of Python 2.x and HippyVM a reasonably, but definitely not fully complete implementation of PHP [9] — and joined them together. Although there was a little bit of messing around to get this to work (mostly because HippyVM and PyPy both expect to be the sole program being compiled and run), it's fairly easy to do. Simplifying somewhat, PyHyp's main function is that of HippyVM's: HippyVM imports parts of PyPy and calls them as necessary (and PyPy can call back to HippyVM etc.). From the perspective of a meta-tracer such as RPython, our “HippyVM that imports PyPy” interpreter is no different than any other interpreter. In other words, just because we know it contains two constituent interpreters doesn't mean that meta-tracing knows or cares – it's just another interpreter (albeit the largest ever compiled with RPython). That means that the two constituent interpreters run in the same address space, share a garbage collector, and share a JIT compiler.

At a low-level, PyHyp defines a fairly standard FFI between PHP and Python. For example, PHP code can import and call Python code as in the following example, where PHP code imports Python's random module and then calls its randrange function:
$random = import_py_mod("random");
$num = $random->randrange(10, 20);
echo $num . "\n";
This simple example immediately raises the question of how Python's random module and the Python int returned by randrange are represented in HippyVM. In essence, small immutable data-types (e.g. ints and floats) are directly translated between the two languages (e.g. passing a Python integer to PHP causes a PHP integer to be created). All other data-types are represented by an adapter which points back to the original object; all operations performed on the adapter are transparently forwarded to the original object. For example, passing a Python user object to PHP creates a PyAdapter in PHP: attempting to access an attribute in the adapter forwards the request to the underlying object. Passing an adapted object back to the language it came from simply unwraps the adapter, so that they don't gradually pile-up on top of each other.

Looking at the implementation of this can help make things clearer. In the above example, the import_py_mod function returns a Python module object which is adapted in PHP as a W_PyModAdapter. Looking up the randrange “method” returns a Python function object which is adapted as a W_PyCallable. Calling that adapter in PHP causes execution to pass to the underlying Python randrange function. As this may suggest, both PHP and Python have a number of adapters. Some adapters make data-types appear more natural in the other language; others are useful for performance; and the ‘generic’ catch-all adapter ensures that every data-type in each language can be adapted.
Figure 2: Showing that adapters are transparent, with output from running the program in the right hand (‘console’) pane. Passing a PHP array to Python creates a dictionary adapter: querying the adapter its type returns Python's dictionary type. Lines 2 and 3 clearly show that the type of the adapted PHP array, whether as a Python list or dictionary, is indistinguishable from native Python objects. Even identity checks are forwarded: although id_check is passed two separate adapters, the two identity checks work as expected.

This simple model of adapters has a number of advantages: they're simple to implement; flexible in terms of reducing semantic and performance friction; and, as Figure 2 shows they're almost entirely transparent (even object identity checks are forwarded). In order to implement them, we had to invasively modify both HippyVM and PyPy. For the example above, we needed to add Python adapters to HippyVM, and alter PyPy so that Python objects can adapt themselves into PHP. Simplifying slightly, both HippyVM and PyPy have similar data-type hierarchies: both have classes called W_Root that the various data-types inherit from. We thus have to provide an (overridable) way for Python objects to create an appropriate adapter for PHP, and the adapters themselves. A simplified version of the relevant RPython code looks as follows. First the modifications to HippyVM:
# HippyVM: hippy/module/pypy_bridge/py_adapters.py
from hippy.objects.base import W_Root as W_PhRoot

class W_PyGenericAdapter(W_PhRoot):
  def __init__(self, py_inst):
    self.py_inst = py_inst
  def lookup(self, attr_name):
    return self.py_inst.lookup(attr_name).to_php()

  def to_py(self):
    return self.w_py_inst
and second the modifications to PyPy:
# PyPy: pypy/interpreter/baseobjspace.py
import py_adapters

class W_Root:
  def to_php(self):
    py_adapters.W_PyGenericAdapter(self)
As this code hopefully suggests, when a Python object (an instance of W_Root) is passed to PHP, its to_php method is called; that returns a W_PyGenericAdapter, which is a class in PHP's object hierarchy. If the adapter is passed back to Python, the original object is unwrapped and passed back to Python.

As a rough measure of size, PyHyp adds about 2.5KLoC of code to PyPy and HippyVM (of which adapters are about 1.4KLoC). This is a reasonable proxy for complexity: PyHyp inherits most of the lengthy implementation work for free from HippyVM and PyPy. It may be interesting to note that we added 5KLoC of tests, which shows that much of the effort went into working out what what we wanted to see happen.

Syntactic interactions

At a low-level PyHyp composes the two languages through dynamic compilation of Python code (and of PHP code nested inside Python), but it's too ugly for anyone in their right mind to want to use directly. By using language boxes and the Eco language composition editor, we can make syntax composition pleasant. For the simple case of dropping a Python function or expression into a PHP program, there's little to add beyond what is obvious from Figure 1.

Things get more interesting when you want Python and PHP to syntactically interact. What does that even mean? In PyHyp's case, it means that Python and PHP can reference variables in the other language. For example, if PHP defines a variable x then a suitably scoped Python language box can reference that variable x. In the common case, this feature is simple and natural, but digging a bit deeper reveals real challenges. In essence, these boil down to the fact that neither PHP 5.x and Python 2.x have, from a modern perspective, particularly great scoping rules. Neither language is fully lexically scoped (both can add variables to a scope at run-time) and neither fully implements closures (Python emulates closures for reading variables, but this illusion doesn’t deal with assignment to a variable in a parent function). In addition, PHP has 4 separate global namespaces (one each for functions, classes, constants, and variables), each of which is shared across all source files (PHP includes, in a C sense, files rather than imports modules).
Figure 3: Cross-language variable referencing, showing some of the design challenges in finding good cross-language scoping rules. The inner PHP language box references PHP's range function (inclusive in both its parameters); the Python language box references Python's range function (inclusive in its first, but exclusive in its second, parameter); the Python language box references PHP's print_r function.

These details meant that, when we started looking at a design for cross-language variable scoping, we didn't really know where to start: it's something that, as far as we know, no-one has ever attempted before; and we appeared to have picked two languages with different, cumbersome, scoping rules. We quickly discovered that a bad design leads to subtle bugs. Figure 3 shows my favourite example of something. Both PHP and Python define global functions called range(f, t) which generate a list of numbers from f to t. PHP's range is inclusive so range(0, 3) generates [0,1,2,3]; Python's range is exclusive on its second argument so range(0, 3) generates [0,1,2]. That means that if Python or PHP code references the other language's range function then odd things happen (and yes, we discovered this the hard way). A simple solution might be to say that each language can only reference its own global variables, but what happens when Python wants to use PHP's print_r function (which pretty prints PHP data-structures differently than Python's equivalents)?

The eventual design [10] we settled upon deals with such nasty corner cases, but (hopefully) is still simple enough to comprehend. First, within a language box, each language's variable scoping rules are unchanged. When a variable is referenced that is not found in the current language box, then a binding is searched for in the following order. First, the lexical chain of language boxes is recursively searched (from inner to outer) for a binding. Second, if no match has been found, the ‘host’ language box’s globals are searched (i.e. if a Python language box references range, and no outer language box defines a binding for this variable, then Python's global variables are searched first). Third, the other language's globals are searched (which is how Python is able to access PHP’s print_r function). Finally, if no match has been found, an appropriate error is raise. One complication is PHP's multiple global namespaces: in PHP, one can define a function, a class, a constant, and a variable all called x. In PHP, syntactic context disambiguates which namespace is being referred to (e.g. new x() searches for a class called x whereas x() searches for a function of that name). There are no equivalent syntactic forms in Python, so PyHyp searches the global namespaces in the following order, returning the first match: functions, classes, constants, and variables [11].

From an implementation point of view, we needed about 250LoC to add the cross-language scoping rules. Making things run fast (remember that both language's lookup rules are entirely dynamic) was a bit more tricky but, once we'd got a good design, not hugely difficult.

At this point you may be wondering why we went to such bother to design cross-language scoping rules. After all, previous language compositions have nearly always done without such a feature. Our reasoning is fairly simple: sometimes it’s convenient to use one language over the other, but not necessarily at the level of converting one function at a time. In such cases, being able to migrate a single line can be useful. Figure 1 gives a simple example: the Python language box on line 20 allows an interaction with the gen function that would be horribly verbose if it was done in PHP. As soon as you want to be able to mix languages on a line-by-line level, it’s inevitable that you will want to reference variables across the languages, and thus that you need to define cross-language scoping rules.

Performance

Let's assume that you’re now prepared to believe that PyHyp is a useful, innovative, fine-grained composition of two languages: all is for naught if PyHyp programs are slow. How do we tell if PyHyp is slow? The answer seems simple: benchmark mono (i.e. written in either Python or PHP) vs. PyHyp programs. The reality, unfortunately, is somewhat messier: no-one really knows how to reliably write good benchmarks; when a good benchmark is written, programming language implementers have a tendency to optimise specifically for it, reducing its use as a good measure [12]; and we don't yet know how to produce high-quality statistics for many types of real-world executions [13]. As icing on the cake, anyone who’s sufficiently practised at creating benchmarks (and who, like me, is also sufficiently evil) can can create a seemingly reasonable benchmark that just so happens to show the result they want to. That said, it's nearly always the case that some benchmarks are better than no benchmarks — just remember the above caveats when interpreting the results!

The first problem we had for benchmarking our composition is that not only were there no available PyHyp benchmarks, but there weren't even good guidelines for how to go about creating composed benchmarks. We therefore created a series of small, focussed benchmarks that concentrate on a single feature at a time, as well as adapting several classic (relatively speaking larger) microbenchmarks. In essence, for each benchmark we created several variants: mono-PHP; mono-Python; ‘outer’ PHP with ‘inner’ Python; and ‘outer’ Python with ‘inner’ PHP. For example, for the ‘outer’ PHP with ‘inner’ Python benchmarks, we took a PHP benchmark (e.g. DeltaBlue), keeping constants, variables, and classes in PHP, but translating methods into Python. Figure 4 shows an example. The aim with these benchmarks is to get a good idea of the costs of the composition when one frequently crosses the barrier between languages.
Figure 4: The DeltaBlue benchmark. On the left is a mono-PHP version. On the right is a composed version: only functions and methods have been translated into Python, with everything else (including classes, constants etc.) remaining in PHP.

If you want to see the full results, I suggest reading Section 7.3 of the paper. However, for most of us, a summary of the interesting results is probably sufficient. Since PyPy is nearly always faster than HippyVM, we can tighten our definition of “is it fast enough”, by comparing PyHyp’s performance to PyPy’s. The geometric mean slowdown of PyHyp relative to Python is 20% (1.2x) and the worst case is about 220% (2.2x). Though the latter figure isn't amazing, bear in mind that it's the worst case — on average, these benchmarks suggest that that PyHyp's performance is more than good enough to be usable. Indeed, in many cases PyHyp’s performance is better than the standard C-based interpreters used for languages such as Python and Ruby!

So, why is the performance good? First, there can be little doubt that small benchmarks flatter (meta-)tracing — I would expect larger programs to see more of a slowdown, though it's difficult to guess at how much of a slowdown that might be. Second, the most important factor in enabling good optimisations in tracing is inlining, which we were able to enable across the two languages fairly easily. Third, the major overhead that PyHyp's composition imposes at run-time are adapters: most non-primitive objects that cross between the two languages have an adapter created. Thus we might expect that some parts of a program will require doubling memory allocation, since each object will require an adapter. In most traces which contain parts from both PHP and Python, the vast majority of adapters are created, used, and implicitly discarded within the trace. The trace optimiser can then prove that the adapters don’t ‘escape’ from the trace and thus remove their memory allocation, and nearly all other costs associated with the adapter. Thus, in practise, the typical cost of adapters is close to zero.

Of course, there are a few cases where performance isn't as good as we would have liked. Of the cases where we found something concrete (which isn't all cases), we found some where RPython generates inefficient machine code for two otherwise identical traces (and no-one has quite worked out why yet). A few cases suffer from the fact that PyHyp uses a ‘proper’ non-reference counted garbage collector, which sometimes causes HippyVM/PyHyp to have to copy PHP’s immutable arrays more than one would like. There are also a few parts of PyHyp which we could more extensively optimise with a bit more effort, and I'm sure that one could fairly easily write programs which tickle parts that we haven't yet thought to optimise. But, overall, performance seems likely to remain within “good enough to use” territory, which is the most important test of all.

Conclusions

As far as I know, PyHyp is, by some considerable distance, the most fine-grained language composition yet created. I think it shows that there is a completely different, practical solution to the problem, outlined at the start of this article, of migrating a system from an old language to a new language. I hope you can imagine someone using a PyHyp-like system to migrate a system gradually from (say) PHP to Python, even down to the level of line-by-line migration: the resulting composed program would not only run, but run fast enough to be used in production. Perhaps at some point the whole system would be migrated, and the language composition thrown away. The crucial point is that at no point is one forced to flick a switch and migrate wholesale from one language to another overnight.

Could you use PyHyp for this today? In a sense, it's a piece of software in a slightly unusual state: it's more polished than most research pieces of software; but real users will find many missing pieces. Most of those missing pieces are in HippyVM which, although it implements pretty much all of the PHP language, lacks many standard libraries and implements a slightly old version of PHP. Fixing this isn't difficult, but since HippyVM is not currently under active development, it would require someone taking on maintainership, and bringing HippyVM to a higher degree of compatibility with current PHP. The good news is that PyHyp picks up additions to both HippyVM and PyPy with ease, so should work on HippyVM resume, PyHyp would benefit too.

In terms of general lessons – apart from the obvious one that PyHyp shows the plausibility of fine-grained language composition – one thing stands out to me. When we started this work, we assumed that implementing a language composition would be hard, and we concentrated our thinking on that issue. We were very wrong. Implementing PyHyp has, mostly, been fairly easy: we reused sophisticated language implementations; the modifications we needed to make were relatively small and relatively simple; and meta-tracing gave us good performance for very little effort. What caused PyHyp to occupy 12-18 months of our time was a problem we didn't even consider at the start: friction. While we were eventually able to find good solutions to all the design problems we encountered, it took us a surprisingly long time to identify the problems, and even longer to find good solutions to some of them. Partly this is because none of us was expert in both PHP and Python (indeed, weird details in both languages can catch out even the most hardened experts). We sometimes had to base our designs on our guess about one or the other language's semantics, and gradually refine our design through extensive test-cases. But the bigger issue was that we simply didn’t have much, if any, precedent to draw upon. I'd like to think some of the design solutions we came up with will be useful to those who undertake language compositions — and I'll certainly be interested to see the results! In the meantime, feel free to read the PyHyp paper or download PyHyp itself.

Acknowledgements: Edd Barrett was the technical lead of the PyHyp project with assistance from Carl Friedrich Bolz, Lukas Diekmann, and myself; Edd, Carl Friedrich, and Lukas gave useful comments on this blog post, but any errors or infelicities are my own. On behalf of the PyHyp project, I’d like to thank Armin Rigo for adjusting RPython to cope with some of PyHyp’s demands, and advice on HippyVM; Ronan Lamy and Maciej Fijałkowski for help with HippyVM; and Jasper Schulz for help with cross-language exceptions. This work would not have been possible without funding from EPSRC (the Cooler and Lecture projects) and, for early versions of Eco, Oracle Labs. I am grateful to both organisations!

Footnotes

[1] I remember reading Really Automatic Scalable Object-Oriented Reengineering, which describes a system for translating large C system to Eiffel. Although I had seen a couple of commercial systems tackling “old” languages (e.g. Fortran to Java), I was sceptical that a paper at an academic conference would tackle anything very hard. I was thus impressed when I saw that wget was translated automatically: it's not a small program. I was stunned when I saw that Vim was translated, even down to things like signal handling. I also can't mention this paper without noting how beautifully written it is: it's rare to see authors put so much care into making the reader's journey through the paper a pleasant one.
[2] Why PHP and Python? Partly because we had to start somewhere. Partly because we had off-the-shelf access to an excellent implementation of Python and a reasonble-ish implementation of PHP. Whilst you may have strong opinions about either or both of these languages, this blog is resolute in its intention to sit on the fence about their pros and cons.
[3] This term was inspired by Clausewitz's use of ‘Friktion’. To misquote Clausewitz, everything in programming language design is simple, but even the simplest thing is difficult.
[4] Language composition throws up several annoying linguistic issues. First is the term itself: composition is a long, vague, and not universally understood term. When I started my initial work in this area, I used the term because someone else did — which isn't really a very good defence on my part. In retrospect, I would have done better to have used a plainer term like “multi-lingual programming”. Alas, by the time I realised that, we'd already used the term widely enough that changing it seemed likely to lead to more confusion. Mea culpa. The second problem is what to call the resulting programs. We used the term “composed program” for a while, but that's hopelessly ambiguous: every program is composed from smaller chunks, whether they be in different languages or not. “Multi-lingual program” has merit, and seems a good generic term. In our specific case, we talk about PyHyp programs and PyHyp files to make clear that we're talking about programs or files that are written using the rules of our language composition.
[5] We could have used Python as the starting point. Or, probably more usefully, we could have allowed things to start in either language.
[6] Adapters are very similar to the idea of proxies. However, the term ‘proxy’ often comes with the expectation that users know they're working with proxies, and that one can distinguish proxies and the thing they point to. In most cases, PyHyp's adapters are completely transparent, hence our use of a different term.
[7] Having lived with this decision for 18 months or so, we think we'd probably have done better to have first adapted PHP dictionaries as a ‘blank’ adapter which appears to be neither a list or a Python dictionary, and forcing users to call to_list() or to_dict() on it. The main gotcha with the current behaviour is that Python defines iteration on dictionaries to be iteration over the keys: this means that you can end up iterating over what you think is a list, but which is really 0, 1, 2, 3 … n.
[8] Since I wrote that article, Oracle have released Truffle. Although the low-level details are different (Truffle uses dynamic partial evaluation), the high-level goals of Truffle and meta-tracing are the same. From this article's perspective, you can consider Truffle and meta-tracing to be interchangeable, without a huge loss of precision.
[9] Since we started work on PyHyp, work on HippyVM has finished, partly due to a lack of funding. One thing that those of us attempting to do innovative programming language work inevitably learn is that it's really hard to pay the bills.
[10] My vague recollection is that it took about 14 iterations to get to this point. That's language design for you!
[11] It might be nicer to throw an error if the same name exists in more than one of PHP's global namespaces. However, that's slightly complicated by PHP's lazy class loading, which we had to work around for other reasons.
[12] This is Goodhart's Law in action.
[13] We're working on this at the moment, but we're not finished yet.
I remember reading Really Automatic Scalable Object-Oriented Reengineering, which describes a system for translating large C system to Eiffel. Although I had seen a couple of commercial systems tackling “old” languages (e.g. Fortran to Java), I was sceptical that a paper at an academic conference would tackle anything very hard. I was thus impressed when I saw that wget was translated automatically: it's not a small program. I was stunned when I saw that Vim was translated, even down to things like signal handling. I also can't mention this paper without noting how beautifully written it is: it's rare to see authors put so much care into making the reader's journey through the paper a pleasant one.

[Click anywhere in this box to close]

Why PHP and Python? Partly because we had to start somewhere. Partly because we had off-the-shelf access to an excellent implementation of Python and a reasonble-ish implementation of PHP. Whilst you may have strong opinions about either or both of these languages, this blog is resolute in its intention to sit on the fence about their pros and cons.

[Click anywhere in this box to close]

This term was inspired by Clausewitz's use of ‘Friktion’. To misquote Clausewitz, everything in programming language design is simple, but even the simplest thing is difficult.

[Click anywhere in this box to close]

Language composition throws up several annoying linguistic issues. First is the term itself: composition is a long, vague, and not universally understood term. When I started my initial work in this area, I used the term because someone else did — which isn't really a very good defence on my part. In retrospect, I would have done better to have used a plainer term like “multi-lingual programming”. Alas, by the time I realised that, we'd already used the term widely enough that changing it seemed likely to lead to more confusion. Mea culpa. The second problem is what to call the resulting programs. We used the term “composed program” for a while, but that's hopelessly ambiguous: every program is composed from smaller chunks, whether they be in different languages or not. “Multi-lingual program” has merit, and seems a good generic term. In our specific case, we talk about PyHyp programs and PyHyp files to make clear that we're talking about programs or files that are written using the rules of our language composition.

[Click anywhere in this box to close]

We could have used Python as the starting point. Or, probably more usefully, we could have allowed things to start in either language.

[Click anywhere in this box to close]

Adapters are very similar to the idea of proxies. However, the term ‘proxy’ often comes with the expectation that users know they're working with proxies, and that one can distinguish proxies and the thing they point to. In most cases, PyHyp's adapters are completely transparent, hence our use of a different term.

[Click anywhere in this box to close]

Having lived with this decision for 18 months or so, we think we'd probably have done better to have first adapted PHP dictionaries as a ‘blank’ adapter which appears to be neither a list or a Python dictionary, and forcing users to call to_list() or to_dict() on it. The main gotcha with the current behaviour is that Python defines iteration on dictionaries to be iteration over the keys: this means that you can end up iterating over what you think is a list, but which is really 0, 1, 2, 3 … n.

[Click anywhere in this box to close]

Since I wrote that article, Oracle have released Truffle. Although the low-level details are different (Truffle uses dynamic partial evaluation), the high-level goals of Truffle and meta-tracing are the same. From this article's perspective, you can consider Truffle and meta-tracing to be interchangeable, without a huge loss of precision.

[Click anywhere in this box to close]

Since we started work on PyHyp, work on HippyVM has finished, partly due to a lack of funding. One thing that those of us attempting to do innovative programming language work inevitably learn is that it's really hard to pay the bills.

[Click anywhere in this box to close]

My vague recollection is that it took about 14 iterations to get to this point. That's language design for you!

[Click anywhere in this box to close]

It might be nicer to throw an error if the same name exists in more than one of PHP's global namespaces. However, that's slightly complicated by PHP's lazy class loading, which we had to work around for other reasons.

[Click anywhere in this box to close]

This is Goodhart's Law in action.

[Click anywhere in this box to close]

We're working on this at the moment, but we're not finished yet.

[Click anywhere in this box to close]


Debugging Layers

July 28 2015

A couple of weeks back, I joked to someone that one day I would write a condensed retrospective of all my failed research ideas but that, until I did, people could read my research papers for the long version. It seemed a clever thing to say at the time, and I dutifully slapped myself on the back. I then found myself discussing Converge's approach to error tracking and reporting with someone to whom I thought it might be vaguely relevant, later pointing my poor victim to a chunk of a paper and – when I later remembered its existence – a blog article. The core idea is simple: generated code can be related to one or more source locations. This means that when a user layers one piece of code atop another, each of those layers shows up in stack traces, making debugging significantly easier. When, with a fresh mind, I read over the paper and the blog article, I realised that it's hard to extract this, and a few related, albeit minor, ideas out of those bits and pieces: they assume the reader knows more about Converge than is realistic.

Normally I wouldn't necessarily mind too much if old ideas are a little hard to grasp, but I've come to view Converge's error tracking and reporting as probably the only useful piece of language design I managed in the five or so years I spent on Converge [1]. I therefore found myself deciding that I needed to write a brief retrospective on Converge and this feature so that some other poor soul doesn't spend five years of their life on a feature than can be explained in ten minutes. The aim of this article is to give enough background to understand the core idea of debugging layered programs, but what is enough depends on your point of view. I've therefore tried to aim this article at two audiences: one who just wants to quickly get a sense of the important idea (if that's you, skim the background material, but don't worry too much about the details); and another who wants to get a deeper sense of how the important idea really works (if that's you, it's worth spending some time getting a reasonable appreciation of the background). I'm sure I won't fully succeed in satisfying either audience, but hopefully both might find a little something of interest herein.

A quick introduction to Converge

For our purposes, Converge can be thought of as Python with a mildly modified syntax and somewhat different standard library: it's dynamically typed, has an indentation syntax, and a mostly similar world view. One immediately obvious difference is that programs must have a main function and that printing isn't a builtin statement or function. Here's a minimal hello world program:

import Sys

func main():
  Sys::println("Hello world!")
On top of this run-of-the-mill base [2], Converge adds a macro system [3] adopted from Template Haskell. In essence, one can generate program fragments at compile-time and splice them into the program such that they can be used as if the programmer had typed them into the original source file. Here's a contrived example:
import CEI, Sys, Time

func secs_from_epoch():
  Sys::println("Getting time")
  return CEI::iint(Time::current().sec)

func main():
  Sys::println($<secs_from_epoch()>)
Saving that code into a file t1.cv and running it a couple of times at the command-line gives the following:
% converge -v /home/ltratt/t1.cv
===> Compiling /home/ltratt/t1.cv...
Getting time
===> Finished compiling /home/ltratt/t1.cv.
===> Linking.
1436729518
% converge -v /home/ltratt/t1.cv
1436729518
I'm going to assume that Converge's basic syntax is fairly obvious, but there are a few other things that need to be explained before this example makes sense. First, Converge has distinct compile, link, and run-time phases. Running Converge with -v causes information lines to be printed on stderr (lines 2, 4, and 5) stating when compilation and linking have started and finished. Since there wasn't a cached compiled version of the program in the above example when it was first run, Converge compiled and linked the program (saving that output to a cache file) before running it. On the second run, Converge read the cached version in and ran it directly, rather than recompiling the source code.

Most of the program's source code is obvious except the $<e> construct which we'll call an explicit macro invocation. The expression e is evaluated at compile-time and is allowed to refer to any function or top-level variable defined above the macro invocation; it is expected to generate a program fragment as an Abstract Syntax Tree (AST); that program fragment then overwrites the macro invocation so that when the program is later run, the generated program fragment is run. The secs_from_epoch function generates a trivial program fragment, with an AST representation of an integer (the CEI::iint function takes an actual integer and returns an AST integer class). In essence, after the explicit macro invocation was evaluated, the program that was compiled and linked looked as follows:

import CEI, Sys, Time

func secs_from_epoch():
  Sys::println("Getting time")
  return CEI::iint(Time::current().sec)

func main():
  Sys::println(1436729518)
As this hopefully shows, even though the program was run twice, the macro invocation was only executed when the program was compiled in the first run. No matter how many subsequent times we run the program, it will print out the time that program was first compiled (1436729518). Only if the program is recompiled (e.g. because the cached compiled version is deleted or the source code has changed) will the secs_from_epoch function be called again.

DSL compilers

In my opinion, Converge's raw macro facility isn't hugely useful for day-to-day programming. Rather, it is intended to be used as a basis for what I should have called DSL compilers: in essence, one can embed arbitrary strings into a source and compile them into Converge ASTs at compile-time. In so doing, we've added a new layer on top of the main language. Although I'm not going to dwell on this, one can layer as deeply as one wants, and it's not unusual for DSLs to be implemented in terms of other DSLs.

As a simple example of embedding a DSL, I'm going to use a stack-based machine. At compile-time the DSL will be compiled into a Converge function which can then be called as normal, returning the top-most value on the stack. Here's a use of the DSL:

f := $<<sb>>:
  PUSH 2
  PUSH 3
  ADD

func main():
  Sys::println(f())
We'll see the definition of sb shortly, but you will be unsurprised to be told that the above program prints out 5 when run. The $<<sb>> syntax (line 1) means all the code on the next level of indentation (lines 2–4) should be left unparsed; the raw string representing that code should be passed to the function sb which will be called at compile-time. In essence, a simple variant of normal macros allows us to embed arbitrary syntaxes into a file, making use of Converge's indentation-based syntax to know when a DSL starts and finishes, and thus where normal parsing of Converge code can resume. If we take the full file and run it with converge -v, we see the following output:
===> Compiling /home/ltratt/t2.cv...
func ():
    stack := []
    stack.append(2)
    stack.append(3)
    $$3$$rhs$$ := stack.pop()
    $$4$$lhs$$ := stack.pop()
    stack.append($$4$$lhs$$ + $$3$$rhs$$)
    return stack[0 - 1]
===> Finished compiling /home/ltratt/t2.cv.
===> Linking.
5
As this might suggest, the sb function has compiled our DSL into an anonymous function (it also printed it out so that we can get some idea of what it did, but that isn't something it's required to do). The compilation is very simple-minded: an empty stack is created (line 3); the integers 2 and 3 are pushed onto the stack (lines 4 and 5); then they are popped off (lines 6 and 7) so that they can be added together and the result pushed onto the stack (line 8); and finally the top-most value on the stack is returned (line 9). Variables beginning with $$ might look scary, but for our purposes they're simply normal variables with unique names generated by the Converge compiler.

Let's start with a simple implementation of sb:

import Builtins, CEI, Sys

func sb(dsl_block, src_infos):
  instrs := [] // A sequence of Converge instructions
  // Iterate over each line in the DSL block
  for line := dsl_block.split("\n").iter():
    sp := line.stripped().split(" ") // Split the line into a list
    if sp[0] == "PUSH":
      ast_int := CEI::iint(Builtins::Int.new(sp[1]))
      instrs.append([| &stack.append;(${ast_int}) |])
    elif sp[0] == "ADD":
      instrs.extend([|
        rhs := &stack.pop;()
        lhs := &stack.pop;()
        &stack.append;(lhs + rhs)
      |])
  ast := [|
    func ():
      &stack; := []
      $c{instrs}
      return &stack;[-1]
  |]
  Sys::println(CEI::pp_itree(ast))
  return ast
This function takes in a string (the dsl_block argument in line 3) and gradually compiles it into a sequence of Converge expressions which are then inserted inside an anonymous function (line 17–22) which, to make our lives a bit easier, is printed out (line 23) and then returned (line 24).

Before going any further, I need to explain three bits of related syntax. First, are quasi-quotes [| ... |] which are a syntactic convenience, but an important one. Surrounding an expression with quasi-quotes makes that expression evaluate to its AST equivalent. Instead of laboriously writing out CEI::ibinary(0, CEI::iint(2), CEI::iint(3)) we can instead write [| 2 + 3 |] which compiles to exactly the same code. Second are insertions ${...} [4], which can be used inside quasi-quotes to insert one AST into another (e.g. lines 10 and 20). [| 2 + ${CEI::iint(3)} |] is thus equivalent to [| 2 + 3 |]. Most insertions splice in a single tree, but an insertion used alone on a line can also splice in a list of ASTs (e.g. line 20). Third, Converge has a hygienic name system, which renames all variables inside quasi-quotes which don't begin with a & to fresh (i.e. unique) names. In the case of sb, we want to stitch together multiple tree fragments and have the stack variable refer to the same thing across all of them. In general, one doesn't want to use & too frequently, though for the purposes of this article, it doesn't matter too much.

Better reporting of syntax errors

At this point, we've implemented a small DSL which is translated at compile-time into Converge code. However, our implementation of the DSL compiler is rather too simplistic: any mistake by a user of the stack machine DSL will lead to inscrutable errors. Consider a trivial syntax error, where the user forgets to add a value to a PUSH instruction:
f := $<<sb>>:
  PUSH 2
  PUSH
  ADD
This innocuous error leads to a long stack-trace [5] in the DSL compiler whose last few entries are:
   ...
   17: File "/home/ltratt/t3.cv", line 27, column 5, length 31
        f := $<<sb>>:...
   18: File "/home/ltratt/t3.cv", line 9, column 45, length 5
        ast_int := CEI::iint(Builtins::Int.new(sp[1]))
   19: (internal), in Builtins::List.get
 Bounds_Exception: 1 exceeds upper bound 1.
Let us assume we have a user brave enough to follow that stack-trace into code: they will probably deduce that the error relates to the PUSH instruction, but, if their input has multiple such instructions, which might be the culprit? Fortunately, Converge has two tools to help us.

The first tool is something I deliberately skipped over earlier. sb is passed two arguments, with the second conventionally called src_infos. If we print that argument out we see that it is a list of lists:

 [["overdrive.tratt.net:/home/ltratt/t3.cv", 725, 21]]
I'll go into more details of this later, but the basic idea is simple. The src_infos argument tells us where the DSL block started and its extent—in the file t3.cv, starting at character 725, and spanning 21 characters. This information gives us the potential to report back errors in the dsl_block string in terms of the real line numbers the user will see in their file. However, doing so manually is a tedious chore. Fortunately, the second tool is that Converge has a built-in parser DSL which can use src infos. We can adapt sb as follows:
import Builtins, CEI, CPK::Earley::DSL, Sys

parse := $<<DSL::mk_parser("machine", ["PUSH", "ADD"])>>:
    machine ::= instr ( "NEWLINE" instr )*
    instr   ::= push
              | add
    push    ::= "PUSH" "INT"
    add     ::= "ADD"

func sb(dsl_block, src_infos):
  parse_tree := parse(dsl_block, src_infos)
  instrs := [] // A sequence of Converge instructions
  for i := 0.iter_to(parse_tree.len(), 2):
    instr := parse_tree[i][0]
    instr_type := instr.name
    if instr_type == "push":
      ast_int := CEI::iint(Builtins::Int.new(instr[1].value))
      instrs.append([| &stack.append;(${ast_int}) |])
    elif instr_type == "add":
      instrs.extend([|
        rhs := &stack.pop;()
        lhs := &stack.pop;()
        &stack.append;(lhs + rhs)
      |])
  ast := [|
    func ():
      &stack; := []
      $c{instrs}
      return &stack;[-1]
  |]
  Sys::println(CEI::pp_itree(ast))
  return ast
The parsing DSL takes a fairly conventional context-free grammar (lines 4-8). Unless the user goes out of their way to do so, the parsing DSL automatically reuses Converge's lexer, so the stack machine DSL has access to lexing rules such as INT. The parsing DSL requires an explicit start rule (machine) and, optionally, a list of additional keywords over those it automatically inherits from Converge (PUSH and ADD). It then converts the grammar into a function which takes two arguments and returns a parse tree. For our very simple DSL, the tree is so shallow that I didn't need to bother writing a normal tree walker. The result of our simple change is powerful. If we rerun our faulty input against the new sb we get the following error:
 Error:
   File "/home/ltratt/t4.cv", line 36, column 8, length 0:
      PUSH 2
 Parsing error at or near `NEWLINE' token.
While I won't pretend this is the best possible parsing error message – as humans, we'd have preferred in this instance to have been shown the line before rather than after the newline – the text PUSH 2 really can be found on line 36 in the file t4.cv. Importantly, that puts us right next to the line of code which led to the bug, which is far preferable to trying to divine the source of the problem from a stack-trace. DSL Compilers can also use a related tactic to report static errors to the user using the CEI::error function, which takes a string and, optionally, src infos.

Run-time errors

Finding errors statically is useful but also somewhat easy—life gets a lot harder with run-time errors. The following program is syntactically valid but obviously incorrect, trying to add two items from the stack when only one item is present:
f := $<<sb>>:
  PUSH 2
  ADD

func main():
  Sys::println(f())
Since sb performs few static checks, this code will compile. Looking back at the definition of sb, we would expect that the ADD instruction will fail at run-time when it tries to perform the second pop. Indeed that's precisely what happens:
% converge -v t5.cv
===> Compiling /home/ltratt/t5.cv...
func ():
    stack := []
    stack.append(3)
    $$3$$rhs$$ := stack.pop()
    $$4$$lhs$$ := stack.pop()
    stack.append($$4$$lhs$$ + $$3$$rhs$$)
    return stack[0 - 1]
===> Finished compiling /home/ltratt/t3.cv.
===> Linking.
Traceback (most recent call at bottom):
  1: File "/home/ltratt/t5.cv", line 40, column 15, length 3
       Sys::println(f())
  2: File "/home/ltratt/t5.cv", line 22, column 15, length 12
       lhs := &stack.pop;()
     File "/home/ltratt/t5.cv", line 28, column 6, length 10
       $c{instrs}
  3: (internal), in Builtins::List.pop
Bounds_Exception: -1 below lower bound 0.
The exception raised is Bounds_Exception, which is Converge's way of saying you tried to access an element beyond the start or end of the list. The stack-trace initially looks helpful: we can immediately see that Converge tracks errors down to the level of where within a line they occur; and we can also see that the problem is related to the &stack.pop() call. However, deeper inspection shows that it's hard to relate it to the source of the problem. After all, I made a mistake in my DSL code, but the stack-trace information is presented in terms of the DSL compiler. If my stack machine code had contained hundreds of ADD instructions, which would have been the source of the problem? Here we see the problem of debugging layers, as programming languages traditionally privilege one layer—the base programming language.

The stack-trace contains a clue that we might be able to do something about this. There are three frames on the stack (i.e. the main function has called another function, which has itself called another function), but the second frame is associated with two source locations (lines 22 and 28). In short, Converge ASTs can be associated with one or more src infos. ASTs that are created by quasi-quotes automatically have a src info relating them to the source location they were defined in; and further src infos are automatically added whenever ASTs are inserted into each other [6]. Users can then add additional src infos to a quasi-quote using the [<e>| ... |] syntax. Since every terminal and non-terminal in the parse tree records the src info it's associated with, it's trivial to pick out the appropriate src info from the parse tree and add it to a quasi-quote. In this case, we simply take the src infos from the ADD non-terminal when we construct the appropriate quasi-quotes, modifying sb as follows:

 elif instr_type == "add":
   instrs.extend([<instr.src_infos>|
     rhs := &stack.pop;()
     lhs := &stack.pop;()
     &stack.append;(lhs + rhs)
   |])
With this small modification, running our incorrect program leads to the following stack-trace:
 Traceback (most recent call at bottom):
   1: File "/home/ltratt/t5.cv", line 40, column 15, length 3
        Sys::println(f())
   2: File "/home/ltratt/t5.cv", line 22, column 15, length 12
        lhs := &stack.pop;()
      File "/home/ltratt/t5.cv", line 37, column 2, length 3
        ADD
      File "/home/ltratt/t5.cv", line 28, column 6, length 10
        $c{instrs}
   3: (internal), in Builtins::List.pop
 Bounds_Exception: -1 below lower bound 0.
We can now see that the second frame on the stack is associated with 3 entries, one of which is the line from the stack machine itself (lines 6 and 7 in the stack trace). With this, the user can quickly pinpoint which ADD instruction is the culprit, no matter how many such instructions their program contains.

A reasonable question to ask is whether it's really necessary to allow more than one src info. Could we not maintain the traditional one-source-location per stack frame location, but allow the user to choose what that location records, and still get the effect we wanted? As it so happens, the first version of Converge did just that so I can conclusively answer this with a no. The reason is very simple: when an error happens at run-time, there is no way of knowing which layer was responsible. Unlike traditional compilers, which are heavily tested over multiple years, DSL compilers tend to be more fragile and more susceptible to bugs. When an error happens, there are roughly even odds as to whether the DSL compiler author or the DSL user is at fault. Only if the debugging information for both layers is reported can both people have a chance of debugging the problem.

How it works

A src info is a triple (file identifier, character offset, character span); src infos are an ordered list of these triples. The key to what I've presented above is that src infos – and specifically the notion that there can be more than one at any point – are present in all aspects of the system: the tokenizer, the parser, the compiler, and the VM. In other words, all the parts that make up Converge were written with src infos in mind. The tokenizer starts the work off, giving each token a single src info (though, for consistency, it is a list of src infos that just so happens to have a single src info). The parser doesn't change tokens, but each time it creates a non-terminal, it calculates the span of the tokens in its subtree (i.e. it looks at all the tokens in the subtree in a depth-first fashion, finding the beginning of the first token in the subtree, and then calculating the span upto and including the last token in the subtree) and creates a src info based on that. As the compiler creates an AST from the parse tree, it copies src infos over (sometimes from tokens, sometimes from non-terminals). The bytecode saved by the compiler then associates each individual bytecode instruction with one or more src infos. When an error happens at run-time, the VM then digs out the appropriate src infos for each location in the relevant stack frame. Thus, src infos have no run-time cost, other when an exception is raised.

The bytecode format that Converge uses is not particularly innovative. Simplifying slightly, each compiled module is its own self-contained binary blob. Within that are several contiguous blocks recording various information. One of these blocks is the sequence of VM instructions (e.g. push, pop etc.), each of which we can think of as a single machine word. Another block then records one or more src infos, again each stored as a machine word, for each VM instruction. There must therefore be at least as many src infos as there are VM instructions. However, since high-level operations invariably compile to multiple VM instructions, it is generally the case that successive VM instructions records the same src infos. Converge therefore adds a few bits to each src info to record how many consecutive VM instructions the src info refers to. On average, this simple encoding reduces the size of the src infos block by half.

Summary

The scheme I've described in this article took full form in the second version of Converge. In my mind, Converge is a finished piece of work. At the very least, I feel that I've learnt pretty much all that I'm going to learn from Converge, and I think it unlikely I will do much work on it again. So why have I put together such a long-winded article on one little aspect of Converge? It's for the very simple reason that the idea of src infos – notice my deliberate use of the plural – might have relevance to other languages. In particular, any language which encourages its users to generate code might benefit from such a concept: it really does make debugging multiple layers of generated code much easier. My use of the word encourages is deliberate: I don't think a language needs to have a macro-ish system for the concept of multiple src infos to be useful. For example, there appears to be a growing culture of generating JavaScript code offline, and while, in such a scenario, src infos would need to be recorded differently, their benefits would be the same. I'm not suggesting that retrofitting this concept in is easy: it will often require changing various components, including delicate parts of VMs. I'm also not suggesting that every language would benefit from this concept: it seems unlikely that a C program would ever want the overhead, no matter how small, that src infos require.

So there you are—that's probably the most useful thing I managed in five years of language design. If I'm tempted to do language design again, I will print a t-shirt with the slogan I spent five years designing a programming language and I only got one lousy feature as a reminder to leave it to people with more talent for it than I.

Acknowledgements: My thanks to Carl Friedrich Bolz, Martin Berger, and Lukas Diekmann for insightful comments on early drafts of this article. All opinions, errors, omissions, or infelicities, are my own.

Footnotes

[1] In my opinion, Converge's experiment with an Icon-like expression evaluation system wasn't a success. Fortunately, as so often in research, the dung heap of failed ideas provided fertile ground for a new idea to grow. When I wrote that paper, I thought that the many stack operations needed by Icon's evaluation system precluded an efficient implementation, before foolishly hand-waving about a possible partial optimisation. What I didn't foresee at that point was that meta-tracing would optimise Icon's evaluation system without me having to do anything. Though I was slow to recognise it, meta-tracing had freed me from many onerous constraints I thought inevitable.
[2] My personal experience suggests that programming language designers resemble actors: most hopefuls, such as myself, lack the necessary talent; even for those who are talented, the odds of success are remote; and yet there is never any shortage of people who believe they are destined for greatness.
[3] Formally it should be called compile-time meta-programming, but that's a distracting mouthful. At some points I've tried abbreviating this term to CTMP, but, in retrospect, more readers would have been helped if I'd shortened it to macro. If you understand every detail of such systems, there's a minor difference between traditional macros and compile-time meta-programming but, for most of us, macro gets the point across well enough without requiring the digestion of new terminology.
[4] The $c{e} variant of insertion allows non-hygienic splicing. Put another way, it means splice e in and allow it to dynamically capture variables from the surrounding environment. Hygiene, and Converge's approach to it, aren't relevant to this article, so I refer interested parties to the documentation for further details.
[5] The bolding and highlighting you see is shown by default in the console.
[6] I'm not sure this is the best policy, because it often records semi or completely redundant src infos. However, all my attempts to find a predictable scheme that recorded less src infos led to cases where the lack of src infos hindered debugging. I have no idea whether this reflects a lack of imagination on my part or whether humans have expectations that no automatic scheme can ever quite meet.
In my opinion, Converge's experiment with an Icon-like expression evaluation system wasn't a success. Fortunately, as so often in research, the dung heap of failed ideas provided fertile ground for a new idea to grow. When I wrote that paper, I thought that the many stack operations needed by Icon's evaluation system precluded an efficient implementation, before foolishly hand-waving about a possible partial optimisation. What I didn't foresee at that point was that meta-tracing would optimise Icon's evaluation system without me having to do anything. Though I was slow to recognise it, meta-tracing had freed me from many onerous constraints I thought inevitable.

[Click anywhere in this box to close]

My personal experience suggests that programming language designers resemble actors: most hopefuls, such as myself, lack the necessary talent; even for those who are talented, the odds of success are remote; and yet there is never any shortage of people who believe they are destined for greatness.

[Click anywhere in this box to close]

Formally it should be called compile-time meta-programming, but that's a distracting mouthful. At some points I've tried abbreviating this term to CTMP, but, in retrospect, more readers would have been helped if I'd shortened it to macro. If you understand every detail of such systems, there's a minor difference between traditional macros and compile-time meta-programming but, for most of us, macro gets the point across well enough without requiring the digestion of new terminology.

[Click anywhere in this box to close]

The $c{e} variant of insertion allows non-hygienic splicing. Put another way, it means splice e in and allow it to dynamically capture variables from the surrounding environment. Hygiene, and Converge's approach to it, aren't relevant to this article, so I refer interested parties to the documentation for further details.

[Click anywhere in this box to close]

The bolding and highlighting you see is shown by default in the console.

[Click anywhere in this box to close]

I'm not sure this is the best policy, because it often records semi or completely redundant src infos. However, all my attempts to find a predictable scheme that recorded less src infos led to cases where the lack of src infos hindered debugging. I have no idea whether this reflects a lack of imagination on my part or whether humans have expectations that no automatic scheme can ever quite meet.

[Click anywhere in this box to close]


An Editor for Composed Programs

August 20 2014

A few of you might remember a blog post of mine from 2011 titled Parsing: The Solved Problem That Isn't. My hope with parsing has always been that I can take an off-the-shelf approach and use it without having to understand all the details. In that particular post, I was looking at parsing from the angle of language composition: is it possible to glue together two programming language grammars and parse the result? To cut a long story short, my conclusion was none of the current parsing approaches work well for language composition. I was nervous about going public, because I felt sure that I must have missed something obvious in what is a diffuse area (publications are scattered across different conferences, journals, and over a long period of time). To my relief, a number of parsing experts have read, or heard me talk about, the technical facts from that post without barring me from the parsing community [1].

When I wrote the original blog post, I did not know if a good solution to the problem I was considering could exist. I ended by wondering if SDE (Syntax Directed Editing) might be the solution. What I've discovered is that SDE is a wonderful way of guessing someone's age: those under 40 or so have rarely heard of SDE; and those above 45 [2] tend to either convulse in fits of laughter when it's mentioned, become very afraid, or assume I'm an idiot. In other words, those who know of SDE have a strong reaction to it (which, almost invariably, is against SDE). This blog post first gives an overview of SDE before describing a new approach to editing composed programs that we have developed.

Syntax directed editing

Imagine I want to execute a program which adds 2 and 3 times x together. I fire up my trusty text editor, type 2 + 3 * x in, and pass it to a compiler. The compiler first parses the text and creates a parse tree from it, in this case plus(int(2), mul(int(3), var(x)), and then uses that tree to do the rest of the compilation process. In a sense, when programming in a text editor I think in trees; mentally flatten that tree structure into sequences of ASCII characters that I can type into the text editor; and then have the parser recreate the tree structure I had in my head when I wrote the program. The problem with this process is not, as you might be thinking, efficiency—even naive parsing algorithms tend to run fast enough on modern machines. Rather it is that some apparently reasonable programs fail to parse, and the cause of the errors can be obscure, particularly to novice programmers.

In the early 80s, a number of researchers developed SDE tools as a solution to the problem of frustrated and demotivated novices. Instead of allowing people to type in arbitrary text and parsing it later, SDE tools work continuously on trees. SDE tools have the grammar of the language being edited built in, which means they know which trees are syntactically valid. Rather than typing in 2 + 3 * x, one first instructs the SDE tool to create a plus node. A box with a + character appears on screen with two empty boxes to the left and right. One then clicks in the left hand box, types 2, before clicking in the right hand box and typing *. Two more boxes appear on screen into which one enters 3 and x. There is therefore no need for parsing to ever occur: one can directly input the tree in one's head into the SDE tool, rather than indirectly specifying it via a sequence of ASCII characters.

The advantage of SDE is that it guides users in creating correct programs. Indeed, SDE tools actively prevent incorrect programs from being created: at all points the tree is syntactically correct, albeit some holes may be temporarily unfilled. However, this advantage comes at a severe cost: to be frank, SDE tools are generally annoying to use. Novices, like small children, will tolerate nearly anything you throw at them, because they are unaware of better alternatives. When seasoned programmers were asked to use SDE tools, they revolted against the huge decline in their productivity. SDE tools not only make inputting new programs somewhat annoying, but they completely change the way existing programs are edited. Two examples will hopefully give you an idea of why. First, programmers often get half way through an edit in one part of a file (e.g. starting an assignment expression) and realise they need to define something else earlier in a file. They therefore break off from the initial edit, often leaving the program in a syntactically invalid state, and make textual edits elsewhere, before returning to the initial edit. SDE tools tend to forbid such workflows since they need to keep the program in a syntactically valid state [3]. Second, when editing text, it is often convenient to forget the tree structure of the program being edited, and treat it as a simple sequence of ASCII characters. If you ever done something like selecting 2 + 3 from the expression 2 + 3 * x, you have done just this. SDE tools force edits to be relative to a tree. For example, code selections invariably have to be of a whole sub-tree: one can select 2, or +, or 3, or 3 * x (etc.) but not 2 + 3. If this this seems like a minor restriction, you should try using such a tool: I don't think there is a programmer on this earth who will not find the changes required to use such tools hugely detrimental to productivity.

SDE tools thus quickly disappeared from view and within a few years it was almost as if they had never existed. The world would not have been much worse off for this, except for the fact that SDE and language composition are a perfect match. Whereas composing traditional grammars or parsers always involves trade-offs of expressivity or reliability, SDE imposes no such compromise. Since SDE tools are always in control of the languages they are editing, they can allow arbitrary languages to be composed in powerful ways.

I was certainly not the first person to note the potential of SDE for language composition. JetBrains MPS tool is a relatively recent SDE tool which allows composition of powerful editors. To the best of my knowledge, it is the most pleasant SDE tool ever developed. In particular, when entering new text, MPS feels tantalisingly close to a normal text editor. One really can type in 2 + 3 * x as in a normal text editor and have the correct tree created without requiring unusual interactions with the user interface. MPS applies small tree-rewritings as you type, so if you follow 2 with +, the 2 node is shuffled in the tree to make it the left hand child of the +, and the cursor is placed in an empty box to the right hand side of the +. The author of the language being edited by MPS has to enable such transformations, and while this is no small effort, it is a cost that need be paid only once. However, whilst MPS does a good job of hiding its SDE heritage when typing in new text, editing old text has many of the same restrictions as traditional SDE tools. As far as I can tell, MPS has gradually special-cased some common operations to work normally (i.e. as in a text editor), but many common operations have surprising effects. In some cases, edits which are trivial in a traditional editor require special key-presses and actions in MPS to extract and reorder the tree.

MPS is a striking achievement in SDE tools. It lowers SDE to the point that some programmers can be forced, with some encouragement, to use it. Some of the compositions using MPS are truly stunning, and have pushed the envelope far further than anyone has ever managed before. But, if I'm being honest, I wouldn't want to edit a program in MPS, using its current editor, and nor do I think programmers will take to it voluntarily: it makes too many simple things complicated.

A new solution

Simplifying a little, it seemed to us that MPS had made as good a stab as anyone is likely to manage at moving an SDE tool in the direction of a normal text editor; we wondered if it might be possible to move a normal text editor in the direction of SDE. In other words, we hoped to make an editor with all the tree goodness of SDE tools, but which would feel like a normal text editor. So, two years ago, in the Software Development Team at King's College London, we set out to see if such a tool could exist. I am happy to say that we now believe it can, and we (by which I mostly mean Lukas Diekmann) have created a prototype editor Eco based on the principles we have developed. In essence, we have taken a pre-existing incremental parsing algorithm and extended it such that one can arbitrarily insert language boxes at any point in a file. Language boxes allow users to use different language's syntaxes within a single file. This simple technique gives us huge power. Before describing it in more detail, it's easiest to imagine what using Eco feels like.

An example

Imagine we have three modular language definitions: HTML, Python, and SQL. For each we have, at a minimum, a grammar. These modular languages can be composed in arbitrary ways, but let's choose some specific rules to make a composed language called HTML+Python+SQL: the outer language box must be HTML; in the outer HTML language box, Python language boxes can be inserted wherever HTML elements are valid (i.e. not inside HTML tags); SQL language boxes can be inserted anywhere a Python statement is valid; and HTML language boxes can be inserted anywhere a Python statement is valid (but one can not nest Python inside such an inner HTML language box). Each language uses our incremental parser-based editor. An example of using Eco can be seen in Figure 1.

Figure 1: Editing a composed program. An outer HTML document contains several Python language boxes. Some of the Python language boxes themselves contain SQL language boxes. Some specific features are as follows. ❶ A highlighted (SQL) language box (highlighted because the cursor is in it). ❷ An unhighlighted (SQL) language box (by default Eco only highlights the language box the cursor is in, though users can choose to highlight all boxes). ❸ An (inner) HTML language box nested inside Python.

From the user's perspective, after loading Eco, they can create a new file of type HTML+Python+SQL. After the blank page appears, they can start typing HTML exactly as they would in any other editor, adding, altering, removing, or copying and pasting text without restriction. The HTML is continually parsed by the outer language box's incremental parser and a parse tree constructed and updated appropriately within the language box. Syntax errors are highlighted as the user types with red squiggles. The HTML grammar is a standard BNF grammar which specifies where Python+SQL language boxes are syntactically valid by referencing a separate, modular Python grammar. When the user wishes to insert Python code, they press Ctrl+L, which opens a menu of available languages (see Figure 2); they then select Python+SQL from the languages listed and in so doing insert a Python language box into the HTML they had been typing. The Python+SQL language box can appear at any point in the text; however, until it is put into a place consistent with the HTML grammar's reference to the Python+SQL grammar, the language box will be highlighted as a syntax error. Note that this does not affect the user's ability to edit the text inside or outside the box, and the editing experience retains the feel of a normal text editor. As Figure 3 shows, Eco happily tolerates syntactic errors — including language boxes in positions which are syntactically invalid — in multiple places.

Figure 2: Inserting a language box opens up a menu of the languages that Eco knows about. Languages which Eco can be sure are valid in the current context are highlighted in bold to help guide the user.


Figure 3: Editing a file with multiple syntax errors. Lines 6, 8 and 11 contain syntax errors in the traditional sense, and are indicated with horizontal red squiggles. A different kind of syntax error has occurred on line 4: the SQL language box is invalid in its current position (indicated by a vertical squiggle).

Typing inside the Python+SQL language box makes it visibly grow on screen to encompass its contents. Language boxes can be thought of as being similar to the quoting mechanism in traditional text-based approaches which use brackets such as [|...|]; unlike text-based brackets, language boxes can never conflict with the text contained within them. Users can leave a language box by clicking outside it, using the cursor keys, or pressing Ctrl+Shift+L. Within the parse tree, the language box is represented by a token whose type is Python+SQL and whose value is irrelevant to the incremental parser. As this may suggest, conceptually the top-level language of the file (HTML in this case) is a language box itself. Each language box has its own editor, which in this example means each has an incremental parser.

Figure 4: An elided example of an SQL language box nested within an outer Python language box. From the perspective of the outer incremental parser, the tree stops at the SQL token. However, we can clearly see in the above figure that the SQL language box has its own parse tree.

Assuming that there are no syntax errors, at the end of the editing process, the user will be left with a parse tree with multiple nested language boxes inside it as in Figure 4. Put another way, the user will have entered a composed program with no restrictions on where language boxes can be placed; with no requirement to pick a bracketing mechanism which may conflict with nested languages; with no potential for ambiguity; and without sacrificing the ability to edit arbitrary portions of text (even those which happen to span multiple branches of a parse tree, or even those which span different language boxes). In short, Eco feels almost exactly like a normal text editor with the only minor difference being when one enters or exits a language box.

How it works

Language boxes are an embarrassingly simple concept, based on a very simple observation. Context free parsing orders tokens into a tree based solely on the types of tokens: the value is not looked at (doing so would make the parser context sensitive, which is generally undesirable). When the user presses Ctrl+L and selects language L, Eco simply creates a token in the parse tree of type L. From the perspective of each tree, language boxes are normal tokens; the fact that they have sub-trees as values is of no consequence to the parser.

Consequences

Since, in general, one cannot guarantee to be able to parse as normal text the programs that Eco can write, Eco's native save format is as a tree [4]. This does mean that one loses the traditional file-based notion of most programming languages. Fortunately, other communities such as Smalltalk have long since worked out how to deal with important day-to-day functionality like version control when one moves away from greppable, diffable, files. We have not yet incorporated any such work, but it seems unlikely to be a hugely difficult task to do so.

Some thoughts

We think that Eco is not only the first in a new class of editor, but the first editor which makes editing composed programs painless. Put another way, I wouldn't mind using an Eco-like tool to edit composed programs, and I am extremely fussy about my text editor. The fact that I say Eco-like is not accidental. Eco is a prototype tool, with all that entails: it is hugely incomplete (e.g. we currently have little or no undo functionality, depending on how you view things), buggy, and sometimes unexpectedly slow (though not because of the incremental parser, which is astonishingly fast). What we have learnt in building it, however, is more than sufficient to see how the ideas would scale up to an industrially ready tool. Will such a tool ever appear? I don't know. However, unlike when I wrote my original blog post, I can now state with some confidence that a good solution is possible and practical.

If you're interested in finding out more about Eco, you can read the (academic) SLE paper this blog post is partly based on, or download Eco yourself [5]. The paper explains a number of Eco's additional features—it has a few tricks up its sleeves that you might not expect from just reading this blog.

It is also important to realise that Eco is not magic: it is a tool to make editing composed programs practical. It is still necessary to specify what the composed syntax should mean (i.e. to give it semantics), assuming we want to do something with the result. In the case of the HTML+Python+SQL composition, composed programs can be exported to a Python file and then executed. Outer HTML fragments are translated to print statements; SQL language boxes to SQL API calls (with their database connection being to whatever variable a call to sqlite3.connect was assigned to); and inner HTML fragments to strings. We can thus export files from our composition and run them as normal Python programs. There are many other possible, and reasonable, ways to export a program from Eco, and one can imagine a single composition having multiple exporters. If you want to see how Eco fits into the wider language composition landscape, this video of a recent talk shows some of the directions we're pushing ahead on.

Acknowledgements: This work was generously funded by Oracle Labs without whom there would be no Eco. Lukas Diekmann has been the driving force behind Eco. Edd Barrett and Carl Friedrich Bolz gave insightful comments on a draft of this blog post.

Footnotes

[1][1] Though some of them quite reasonably disagree with me as to whether some of the limitations or trade-offs I mentioned really rule out some approaches from language composition.
[2] I'm not sure if my sample size of 40-45 year olds is insufficient, or whether they really are the inbetween generation when it comes to SDE. Fortunately, I will sleep soundly enough without knowing the answer to this particular question.
[3] I'm told that some of the original SDE tools were more flexible than this suggests, but these seem to be the exception, and not the rule. It's unclear to me exactly how flexible such tools were in this regard. Alas, the software involved has either disappeared or requires impossibly ancient hardware and software to run.
[4] It currently happens to be gziped JSON, which is an incidental detail, rather than an attempt to appeal to old farts like me (with gzip) as well as the (JSON-loving) kids.
[5] Depending on when you read this, you may wish to check out the git repository, in order to have the latest fixes included.
[1] Though some of them quite reasonably disagree with me as to whether some of the limitations or trade-offs I mentioned really rule out some approaches from language composition.

[Click anywhere in this box to close]

I'm not sure if my sample size of 40-45 year olds is insufficient, or whether they really are the inbetween generation when it comes to SDE. Fortunately, I will sleep soundly enough without knowing the answer to this particular question.

[Click anywhere in this box to close]

I'm told that some of the original SDE tools were more flexible than this suggests, but these seem to be the exception, and not the rule. It's unclear to me exactly how flexible such tools were in this regard. Alas, the software involved has either disappeared or requires impossibly ancient hardware and software to run.

[Click anywhere in this box to close]

It currently happens to be gziped JSON, which is an incidental detail, rather than an attempt to appeal to old farts like me (with gzip) as well as the (JSON-loving) kids.

[Click anywhere in this box to close]

Depending on when you read this, you may wish to check out the git repository, in order to have the latest fixes included.

[Click anywhere in this box to close]

 

Blog archive

 

Last 10 posts

What I’ve Learnt So Far About Writing Research Papers
What Challenges and Trade-Offs do Optimising Compilers Face?
Fine-grained Language Composition
Debugging Layers
An Editor for Composed Programs
The Bootstrapped Compiler and the Damage Done
Relative and Absolute Levels
General Purpose Programming Languages' Speed of Light
Another Non-Argument in Type Systems
Server Failover For the Cheap and Forgetful
 
 

Blog archive