Laurence Tratt: Retrofitting JIT Compilers into C Interpreters

C interpreters are a common language implementation technique and the basis for the reference implementations of languages such as Lua, Ruby, and Python. Unfortunately, C interpreters are slow, especially compared to language implementations powered by JIT compilers. In this post I’m going to show that it is possible to take C interpreters and, by changing a tiny proportion of code, automatically turn them into JIT compiling VMs (Virtual Machines)1. This offers a point in the language performance design space that was previously out of reach: better performance while retaining full compatibility with reference implementations.

The system that lets us do this is yk, something we’ve have been working on for some time. I want to set expectations appropriately. yk is alpha-stage software: it’s not production ready, though it’s some way beyond a hacky research prototype in breadth and quality. You can quite easily hit TODOs that halt the program, or encounter missing parts where execution continues suboptimally; we only have a subset of the optimisations we’d like; only an x64 backend is currently implemented; and we don’t yet have the breadth of languages and benchmarks that one would need to draw strong conclusions about performance.

That said, you can already see some interesting results. Seeing programming language performance is quite hard, but this video might give you an idea. First, we run a Mandelbrot program through the reference PUC Lua VM 2 and then through yklua, our fork of PUC Lua with an automatically derived JIT compiler:

I must confess that I have cherry picked an example where we get better-than-average performance. Rather than the ~4x improvement in the video above, on our Lua benchmark suite 3 so far a more realistic geometric mean is a little under 2x. Small benchmarks inevitably flatter the sorts of techniques I’ll be talking about, so “real” programs will on average see less than that 4.

What might surprise you is that we only added about 400LoC to PUC Lua, and changed under 50LoC. This hopefully makes the trade-off yk offers clear: yklua does not reach the performance peaks of the wonderful, carefully hand-written, LuaJIT; but LuaJIT is stuck in the 13 year old past of Lua 5.1 5, whereas yklua trivially tracks updates to PUC Lua. The point of this post isn’t to tell you that one side or the other of this trade-off is best. Rather, it’s to show that the trade-off exists, and that yk fills in a part of the design space that was largely out of bounds before.

There’s also nothing inherently Lua specific about yk. Alongside various smaller implementations, recently we’ve put a few days effort into adjusting MicroPython leading to – you might spot a naming pattern here – ykmicropython. The MicroPython interpreter uses some idioms that we have yet to support, and any benchmark which encounters one of those currently has terrible performance. But if you have a program which doesn’t hit many of those parts you can already see a meaningful performance improvement. The Mandelbrot benchmark isn’t where ykmicropython performs best, but it’s still good enough to see a difference (“normal” MicroPython top, ykmicropython bottom):

Eventually, my hope is that if you have a C interpreter, big or small, yk will allow you to add a JIT to it for little effort, gaining additional performance for free.

In this post I’ll explain why we created yk, why it works the way it does, and some of the technical challenges involved. Despite the length of this post, I won’t come close to covering everything I arguably should, and I’ll inevitably simplify much of what I do cover. Future posts will hopefully fill in more details, add nuance, and track yk’s evolution.

I want to thank Shopify and the Royal Academy of Engineering, who together generously funded the underlying research. I am immensely grateful to both, though of course I speak for neither organisation! The work I’m presenting in this post is a joint effort with Edd Barrett, Lukas Diekmann, and Pavel Durov. I want to dedicate this work to the late Chris Seaton, who was an early champion of yk when it was no more than a tiny demo: he is much missed by those of us who considered him a friend.

Why are JIT compilers hard?

Many programming languages are implemented as interpreters. This is the most common approach for dynamically typed languages, but is also true of a surprising number of other languages, from CPU simulators to domain specific languages. Interpreters are simple to write, but do not have great performance.

It is therefore common to add a Just-In-Time (JIT) compiler to interpreters. The fundamental thesis of JIT compilation is that the way a program executed in the near past is indicative of how it will execute in the future. JIT compilers thus speculatively optimise6 a program; whilst the speculations hold, performance is good; if the speculations fail to hold, performance will (possibly temporarily) decrease.

JIT compilers therefore observe a running program, identify the hot (i.e. most frequently executed) parts and some surrounding context (e.g. the types of objects used), extract and optimise those, and compile them to machine code (henceforth “JITed code”). The JIT-compiled code can then be run in place of the interpreter.

JIT compilers are difficult and expensive to create. Most obviously, they have lots of delicate, moving parts, and JIT compiler authors need to be expert in multiple domains. Minor mistakes can lead to days of frustrating debugging; architectural mistakes can take years to undo. Over a decade ago, someone who knew more than me estimated that HotSpot (aka “the JVM” i.e. the standard Java VM) had already consumed well over 1,000 person years of effort. V8, the JavaScript JIT compiling VM in Chrome, has had a large team working on it for many years.

Less obviously, because of their complexity, JIT compilers can be difficult to evolve. Even if one creates a fully compatible JIT compiler for a given version of a language, the blighters in control of the language specification are bound to extend or alter the language. It’s quite easy for changes in the language to break assumptions embedded deep within a JIT compiler, causing the latter to lag behind the language specification. Eventually, “lag” has a tendency to turn into “will never catch up”, which tends to be the beginning of irrelevance.

JIT compilers also make achieving full – and I mean “full”! – compatibility difficult. Often it is necessary to create a new VM in order to create the best possible JIT compiler. However, doing so almost inevitably means foregoing full compatibility. The new VM might not be able to interface with existing extension modules; it might run finalisers at slightly different points in time; not implement every last function in the standard library; and so on. If a user tries a new VM and their software doesn’t work out of the box, not only will they not experiment further, but they will often resolve never to waste time on that VM again.

Perhaps the best evidence for how hard JIT compilers are to create is simply to enumerate the publicly 7 known JIT compilers for Python include Cinder, IronPython, Jython, Nuitka, Psyco, Pyjion, PyPy, Pyston, S6, Shed Skin, Skybison, Starkiller, TrufflePython, Unladen Swallow, WPython, and Zippy. Few of these are still active, despite the skilled people involved, and the desires and hopes of many users.

Automatic JIT compilers

We could bypass most of the problems I’ve noted above if we could somehow automatically create JIT compilers. There are two main approaches to automatically creating JIT compilers from interpreters, with those two approaches exemplified by two systems: meta-tracing in RPython (which is part of the PyPy project); and (a form of) partial evaluation in Truffle.

RPython and Truffle are brilliant technical achievements: I remain in awe of both projects. They can, in different ways and with different trade-offs, create astonishingly high-performance JIT compilers from interpreters.

There is, however, a catch: both systems require you to create a new interpreter. They thus run into the almost-but-not-quite-compatible challenge I mentioned earlier. Perhaps because of this, neither system has seen the widespread use that their excellent performance would suggest they should.

Deriving a JIT compiler from C interpreters

While most of the languages we’re interested in optimising have an English specification, users tend to take what the reference interpreter does as the specification. From the perspective of a language lawyer, users are clearly wrong to do so, but few users know that language lawyers even exist — they mostly care about completing the task at hand!

What this means is so obvious that I did not fully internalise it for many years: for many languages, the C interpreter is the source of truth for the language. Rightly or wrongly, any implementation which deviates from that source of truth, even in minor ways, will be unacceptable to most users.

This means that RPython and Truffle, wonderful though they are, leave an important gap in the design space. When I eventually woke up to that reality, the question we needed to answer became obvious: is it possible to automatically derive a JIT compiler from a C interpreter? If we could do so, we could avoid the compatibility issues that prevent many users from using faster language implementations.

Other people have had, at least to some degree, similar thoughts and there is a small body of work sharing the same aim, notably this often forgotten paper and the later copy-and-patch. Putting aside the (sometimes large) amount of manual work these approaches require, their basic aim is to JIT compile an interpreter’s core interpreter loop. This is a worthwhile thing to do, but many programs spend much of their time outside of that loop. For example, basic functionality like “is object o1 == o2?” is often deferred to functions outside of the core loop. In my opinion, if you want to get good performance, you have to be able to inline code far beyond the core interpreter loop.

Ultimately, at least in my pea brain, there is then only one technique that can realistically achieve the aim we want: meta-tracing 8. And, indeed, that’s exactly what yk does: it meta-traces C interpreters. That’s why we could take in the normal C Lua VM, add a few lines of code, and have it become a JIT compiler.

However “meta-tracing for C” is a far harder thing than that simple phrase suggests. Put bluntly, it’s very close to trying to ram a square peg into a round hole, which is why more than one person I’ve met has suggested, mostly politely, that it’s never going to work. Indeed, we’ve had to solve all sorts of problems, big and small, to make this a practical reality.

Tracing and meta-tracing

Meta-tracing is a surprisingly challenging technique to explain 9, but I’ll do my best. First let’s reuse terminology that I think came from Truffle: a host language is the one we write interpreters in (e.g. C); a guest language is the language being implemented (e.g. Lua, Ruby, your favourite CPU simulator, etc).

Tracing

Let’s start briefly by looking at (non-meta) tracing JIT compilation. Such a system looks for hot loops (e.g. for or while loops) in a guest program. When it encounters one, it then traces (i.e. records) the execution of the next iteration of the loop, optimises it, and compiles it into machine code. When the interpreter next comes to execute that loop, it hands execution over to the JIT-compiled code.

For example, imagine I have this guest language program:

for x in ... do
  if x > 0 then
    y = y + 3
  end
end

and the for loop has become hot enough to trace. Let’s assume that in the particular iteration that was traced x had a concrete value greater than 0. The resulting trace might look roughly as follows:

OP_LOOKUP("x")
guard true, pop() > 0
OP_LOOKUP("y")
OP_INT(3)
OP_ADD
OP_SET("y")

The trace is, mostly, a record of the guest language operations (“opcodes” or “bytecodes”) that were executed. We can then JIT compile the trace above into machine code and use it in place of the interpreter.

There is, however, one obvious oddity in the trace. When we executed the for loop, x > 0 evaluated to true, and we thus executed the true branch of the if statement. You can clearly see y = y + 3 in the above fragment but the if has become a guard. Formally speaking, we say a trace linearises control flow. In plain English, a trace records only one an if statement’s true or false branches, not both.

What that means is that a trace only records a subset of a program’s behaviour: at some point, an execution of the trace might need to move beyond that subset. Guards record the points when this can happen: if a guard fails, the rest of the trace is invalid. When that happens, the JIT-compiled code must deoptimise back to the interpreter.

We can thus immediately see a fundamental difference between tracing JIT compilers and the more common “method-based” JIT compilers. In the latter, one is looking for functions that execute frequently. When such a function is found, it is compiled (at run-time!) in much the same way that a traditional ahead-of-time compiler like gcc/clang would compile a function. In other words, method-based JIT compilers produce code which captures far more of a program’s behaviour 10 than a trace-based compiler.

Some readers with long memories will remember that the first JavaScript JIT compiler in Firefox was TraceMonkey. As its name suggest, it was a tracing JIT compiler for JavaScript, but its performance track record was uneven, and after 2 or 3 years it was replaced by a method-based JIT compiler. This experience has led a number of people over the years to declare that tracing JIT compilers are inferior in all ways.

My view is more nuanced. If you’ve got enough time, enough money, and a green field, method-based JIT compilers are almost certainly going to be superior in the general case 11. If you lack any of those three factors, the right technical choice might be different. yk is explicitly aimed at the case where you lack a green field 12.

Meta-tracing

How does meta-tracing differ from tracing? A tracing JIT compiler is specialised to a particular guest language, and contains a hand-written trace recorder, a hand-written optimiser, and a hand-written code generator. A meta-tracing system says “hand me an interpreter for your favourite guest language and I will do the rest for you”. It does that by recording what the host language interpreter is doing while executing the guest language.

For example if I have a C interpreter along these lines:

Instruction *code = ...;
int pc = 0;
Value **stack = ...;
while (true) {
  Instruction i = code[pc];
  switch (GET_OPCODE(i)) {
    case OP_LOOKUP: push(lookup(GET_OPVAL())); pc++; break;
    case OP_ADD: push(pop() + pop()); pc++; break;
    case OP_JGT:
      if (pop() > 0) pc++; else pc += GET_OPVAL(i);
      break;
    case ...: ...
  }
}

then my guest language fragment from earlier will lead to a trace along the following lines:

push(lookup("x")); pc++;
guard true, pop() > 0; pc++;
push(lookup("y")); pc++;
push(3); pc++;
push(pop() + pop()); pc++;
set("y", pop()); pc++;

As this might suggest, we have recorded the code the host-language interpreter executed while it was itself running the guest program. The (meta-)trace we see here is thus a trace of the C interpreter’s actions. That trace was automatically constructed by the meta-tracer; it can then be passed to the meta-tracing systems optimiser and code generator. In other words, meta-tracing automates what must be hand-written in a (non-meta) tracing JIT compiler.

There’s one additional factor that’s worth highlighting about meta-tracing: it naturally, and aggressively, inlines into function calls. In other words rather than recording push(...) the trace will record the contents of push so the trace will look more like:

guard true, *used < *capacity;
storage[*used] = elem;
*used += 1;

This ability to inline is a fundamental part of meta-tracing’s strengths, and why yk is different than previous approaches to speeding up C interpreters.

From a C interpreter to a meta-tracing interpreter

Now we’ve seen the high-level idea of meta-tracing, the next question is: how does yk convert a C interpreter into a meta-tracing-compatible interpreter? The first key part is that we use a fork of LLVM called – wait for it – ykllvm which must be used to compile the interpreter. At a basic level this just requires changing cc and ld in the interpreter’s Makefile to:

‘yk-config release --cc‘ -o vm.o vm.c
‘yk-config release --ldflags‘ -o vm vm.o

At a deeper level, ykllvm is doing three main things: adding the ability to record a trace; serialising IR so that the interpreter can be reconstructed after trace recording; and making deoptimisation possible. We’ll look at the last of these later: let’s start with the first two.

When we’ve detected a hot loop at the guest language level, we need a way to record what actions the interpreter takes. There are various ways one could do this 13, but the one we use now is to insert calls to a function __yk_trace_basicblock(id) in every basic block: roughly speaking, this can be thought of as “at every point there’s an if or equivalent”. We give every basic block a unique ID.

After an interpreter has run through ykllvm it is compiled in a way that is roughly equivalent to the following:

Instruction *code = ...;
int pc = 0;
Value **stack = ...;
while (true) {
  __yk_tracebasicblock(0);
  Instruction i = code[pc];
  switch (GET_OPCODE(i)) {
    case OP_LOOKUP:
      __yk_tracebasicblock(1);
      push(lookup(GET_OPVAL()));
      pc++; break;
    case OP_ADD:
      __yk_tracebasicblock(2);
      push(pop() + pop());
      pc++; break;
    case OP_JGT:
      __yk_tracebasicblock(2);
      if (pop() > 0) {
        __yk_tracebasicblock(3);
        pc++; 
      } else {
        __yk_tracebasicblock(4);
        pc += GET_OPVAL(i);
      }
      break;
    case ...: ...
  }
}

When a trace is recorded we’ll end up with a sequence along the lines of [0, 1, 0, 1, 0, 2, ...] as a representation of the guest program executing OP_LOOKUP; OP_LOOKUP; OP_ADD.

When we’ve got that sequence of basic block IDs we then need to reconstruct ykllvm’s view of the interpreter’s IR (Intermediate Representation). Roughly speaking, this can be thought of as a tree / graph representation of the interpreter that ykllvm built up whilst compiling it. If we take a chunk of LLVM’s “normal” IR for yklua and represent it textually we get:

359:                                              ; preds = %341
  %360 = load ptr, ptr %355, align 8, !tbaa !11
  %361 = load ptr, ptr %352, align 8, !tbaa !11
  %362 = call ptr @luaH_getshortstr(ptr noundef %361, ptr noundef %360) #18
  %363 = getelementptr inbounds nuw i8, ptr %362, i64
  %364 = load i8, ptr %363, align 8, !tbaa !9
  %365 = and i8 %364, 15
  %366 = icmp eq i8 %365, 0
  br i1 %366, label %371, label %367
367:                                              ; preds = %359
  %368 = load i64, ptr %362, align 8, !tbaa !11
  store i64 %368, ptr %345, align 8, !tbaa !11
  %369 = load i8, ptr %363, align 8, !tbaa !9
  %370 = getelementptr inbounds nuw i8, ptr %345, i64 8
  store i8 %369, ptr %370, align 8, !tbaa !9
  br label %3329
371:                                              ; preds = %341, %359
  %372 = phi ptr [ %362, %359 ], [ null, %341 ]
  store ptr %203, ptr %171, align 8, !tbaa !11
  %373 = load ptr, ptr %172, align 8, !tbaa !11
  store ptr %373, ptr %13, align 8, !tbaa !11
  call void @luaV_finishget(ptr noundef nonnull %0, ptr noundef nonnull %352, ptr noundef %355, ptr noundef %345, ptr noundef %372)
  %374 = load volatile i32, ptr %173, align 8, !tbaa !11
  br label %3329

ykllvm converts that to a slightly different IR: we tweak a few things (e.g. the fearsomely complicated getelementptr instruction is split into easier-to-handle variants) and add safepoints (we’ll come back to those later):

bb47:
  %47_0: ptr = ptr_add %46_2, 8
  %47_1: i8 = load %47_0
  %47_2: i8 = and %47_1, 15i8
  %47_3: i1 = eq %47_2, 0i8
  condbr %47_3, bb49, bb48 [safepoint: 396i64, (%0_0, %0_2, %0_5, %0_6, ...
bb48:
  %48_0: i64 = load %46_2
  *%45_3 = %48_0
  %48_2: i8 = load %47_0
  %48_3: ptr = ptr_add %45_3, 8
  *%48_3 = %48_2
  br bb812
bb49:
  %49_0: ptr = phi bb47 -> %46_2, bb45 -> 0x0
  *%6_10 = %29_0
  %49_2: ptr = load %11_5
  *%0_21 = %49_2
  call luaV_finishget(%0_0, %45_10, %45_13, %45_3, %49_0) [safepoint: 397i64, ...
  br bb50

This “ykllvm IR” is then serialised (i.e. converted into a moderately efficient sequence of bytes) and placed in a special section in the binary alongside the “normally compiled” C code. When a trace is recorded, we can then map basic block IDs (e.g. 47) to a fragment in ykllvm IR, stitch all those fragments together, optimise, and compile them. As this might suggest, although I’ve framed this post in terms of “meta-tracing C interpreters”, arguably what we’re really doing is better described as meta-tracing (a view of) LLVM IR.

Optimising a trace

Just stitching together a trace of recorded basic blocks and compiling them into machine code gives at best moderate performance gains. Indeed, in yk the situation would be worse than in previous approaches because of things like __yk_trace_basicblock, which frustrate some static optimisations in LLVM!

What we need to do is optimise the traces. yk has several 14 traditional compiler optimisations built in. For example constant folding allows us to take a trace like this:

%7: i8 = 8
%8: i8 = 7
%9: i8 = add %7, %8
call print(%9)

and optimise it to:

%9: i8 = 15
call print(%9)

Dead load / store analysis allows us to take a trace such as:

%8: ptr = ...
%9: i8 = ...
store %9, %8 ; store the value of %9 to the pointer in %8
%11: i8 = load %8
call print(%11)

and optimise it to:

%8: ptr = ...
%9: i8 = ...
store %9, %8
call print(%9)

and so on.

Helping the optimiser

The optimisations above are effective and fast to run — but on a normal interpreter, they only have a moderate effect. The fundamental reason for that is that a C interpreter is an arbitrary program and yk’s optimiser often either can’t soundly optimise parts of it, or isn’t sure that doing so will be worth it.

Fortunately, interpreter authors have additional knowledge that can hugely help the optimiser. Sometimes that knowledge is more about the guest language (e.g. a function’s opcodes are immutable) and sometimes it is more about the interpreter (e.g. that a seemingly unbounded loop will in fact only run a small number of times).

Crucially, none of this knowledge is directly expressed in the C code. What we need is a way for the interpreter author to pass this extra knowledge to yk so that its optimiser can take advantage of it. Fortunately, RPython has done a lot of the intellectual heavy lifting in this regard, and we can reuse many of its concepts. Surprisingly, adding or tweaking just a few lines of C code can lead to significant performance improvements in JIT-compiled code. In order to understand why, I’ll dive into detail about two of the major optimisation features that yk offers interpreter authors.

Promotion

Let’s start with one of the core features, promotion, which takes a run-time value and “burns it into” a trace as a constant, and an associated guard (to ensure that if the value changes on subsequent execution, deoptimisation occurs, and execution is still correct).

yk exposes promotion as a C function yk_promote. In normal use, this is just the identity function 15, but when a trace is being recorded, yk_promote copies the value passed to it into a buffer: in the compiled trace, the call to yk_promote will be replaced with the associated constant.

For example, in our C interpreter from earlier we might implement OP_INT(x) which pushes a constant integer (sometimes called an “immediate”) onto the stack as follows:

...
case OP_INT: push(yk_promote(constant_pool[GET_OPVAL(i)])); pc++; break;

Without the yk_promote the trace for OP_ADD(3) might look roughly as follows:

tmp = constant_pool[i & 8]
push(tmp)

With yk_promote it will become:

tmp = constant_pool[i & 8]
guard true, tmp == 3
push(3)

The bad news is that we’ve had to insert a guard (in case the speculation turns out not to hold) but the good news is that after that guard we know for sure that tmp is the constant 3, rather than some unknown value. That means – if I hand-wave a bit, because if I went into full detail we’d be here for hours – that when we see a sequence such as:

tmp = constant_pool[i & 8]
guard true, tmp == 3
push(3)
pc += 1
i = code[pc]
tmp = constant_pool[i & 8]
guard true, tmp == 4
push(4)
push(pop() + pop())

we can optimise that last instruction into push(7) without even looking at the stack (though we will have to ensure that the used portion of the stack is adjusted appropriately). Doing so not only removes some instructions (pop must do a load at the very least) but makes it possible for us to generate better machine code.

Idempotent functions

As you might have guessed from the example above, yk_promote isn’t enough on its own to make a big dent in performance. Happily, though, we can combine it with another feature to make what is perhaps the single biggest individual optimisation for a yk interpreter: exposing that opcodes are immutable 16.

The additional tool we need is a way of expressing idempotent functions: that is, functions that always return the same output given the same input 17. yk allows this to be expressed with the yk_idempotent function annotation 18. Cast your mind back to the C interpreter from earlier and this line:

Instruction i = code[pc];

Let’s assume that for a given pc, we always return the same Instruction (e.g. “at pc 4 you will always see OP_ADD”). We then change this line to be a function call:

Instruction i = load_inst(yk_promote(code + pc));

and then add the function load_inst with the appropriate yk_idempotent annotation:

__attribute__((yk_idempotent))
Instruction load_inst(Instruction *pc) {
  return *pc;
}

During recording trace, yk will record the value that load_inst returned. If the trace optimiser can prove that load_inst was called with a constant value (e.g. because of yk_promote), it will then remove the call to load_inst and replace it with that return value. In other words if load_inst(0x1234) returns 0b01110111 then in a trace we will simply see that latter constant.

The obvious advantage of this is that we have removed the load of the opcode entirely, but what we’ve really done is open the floodgates for yk’s optimiser to do its thing on instruction decoding. For example if GET_OPVAL is:

#define GET_OPVAL(i) ((i & 0b11110000) >> 4)

then yk, knowing that i is 0b01110111, will then optimise GET_OPVAL(i) to 0b0111. As this suggests, optimisations tend to accumulate, particularly in the context of tracing interpreter loops: proving that something is constant often allows other things further down the trace to be optimised too.

The good news doesn’t stop there. Since we had to promote pc, calculations like pc += GET_OPVAL(i) optimise to a constant address; and that allows us to remove most guards based on them (which we can now trivially prove are true).

In other words, by simply breaking out load_inst and declaring it idempotent – a diff of less than 10 lines in yklua – we are able to remove nearly all instructions relating to instruction decoding, as well as several knock-on operations. Individually each operation that’s optimised away is small, and very fast to run, but when enough of them are optimised away there’s a big effect. In yklua, removing the yk_idempotent annotation slows things down by around 4x!

Other optimisations

yk provides other optimisations. For example, by default yk does not inline host-language functions which contain loops into a trace: annotating functions with yk_unroll tells yk to inline these functions, unrolling the loops they contain. This is always semantically correct, but generally dangerous for performance (how often does a loop go around?) — except in cases where you know that the loop will go around a small, constant or near-constant number of times. A classic example of this is a function in an interpreter which unpacks function arguments: guest language functions tend to have a fixed number of parameters. The argument unpacking function tends to be a good candidate for yk_unroll.

However, rather than going through all the possible optimisations, what I hope you’re starting to realise from my description is that there is quite a lot of low-hanging fruit in many C interpreters that can help yk. Indeed, nearly all the performance gains I showed at the start of this post result from basic tweaks to the interpreter (a few yk_promotes, a couple of yk_idempotents, and several yk_unrolls and so on).

An open question is how much more benefit can one get? My intuition is: quite a lot, but it will be highly dependent on how invasive the changes you’re willing to make are.

At the trivial end of things are simple refactorings that make the interpreter more yk friendly. For example, sometimes a very large, important, function has a single (unbounded) loop in it: moving that loop into its own function makes the larger function inlineable. More difficult – though heavily dependent on the interpreter – would be to add Self-style maps, which would open up a lot of opportunities for yk’s optimiser.

Backwards code generation

yk is big enough, complex enough, and with enough novelty that, wherever possible, we’ve tried to filch good ideas from elsewhere. One idea we’ve recently made use of comes from LuaJIT: yk generates machine code by traversing the trace in reverse. This turns out to allow us to generate better code, more quickly, and to do so with less implementation complexity. Since, as far as I know 19, yk is only the second system to do this, it’s worth giving an overview of what the advantages of doing so are.

For example, given a trace such as:

%5: i8 = load %4
%6: i8 = load %4
%7: i8 = add %5, %6

yk generates the (say) x64 code for add first, and only then for the two loads. There are two main reasons for doing this: simple register allocation; and implicit dead code elimination.

Register allocation sounds like it should be a simple thing for a compiler, and in some ways it is. We have variables (e.g. the %x variables in traces) and we assign them to registers in the CPU (e.g. rax). When we run out of registers (of which there are typically few, commonly 16 on modern CPUs), we put variables on the stack (which is near-infinite in size). Unfortunately, done naively, too many variables end up on the stack and performance tanks 20.

As that might suggest, good register allocation isn’t simple. Indeed, so challenging are the trade-offs between “generate code that runs fast enough” and “don’t make an unusably slow compiler” that register allocation has been an active research topic for decades. Summarising the reasons for that succinctly is hard, but the basic problem is that you need to look arbitrarily far backwards and forwards in a program to do good register allocation, and that’s difficult and expensive for programs with complex control flow.

Fortunately, in tracing compilers, LuaJIT shows that we can do something completely different: backwards code generation. In normal forwards code generation, one assigns a fairly arbitrary register to a variable and hopes that will prove to be a good choice for later parts of the trace. In backwards code generation, one assigns a register to a variable knowing that it’s a good choice now, and leaves it up to “earlier” code to adjust as necessary. Those two choices might sound equivalent, but they’re not.

For quite a while, yk did forwards code generation. The register allocator was fiddly and fragile, and in order to make it generate even vaguely good code, we had to have an initial backwards pass that calculated the lifetime of variables and also, where appropriate, “this variable will definitely need to end up in rdi later” and so on.

That we had to have an initial backwards pass was, eventually, the clue for me to consider adopting LuaJIT’s approach. I must admit that at first I found backwards code generation made my head hurt: forwards code generation is far more intuitive. But, once I’d got used to the idea 21, the eventual register allocator became relatively simple to work with. Crucially, it’s easy to use 22, faster to run, and generates far better code than forwards code generation.

I also mentioned above that backwards code generation implicitly does dead code elimination. When a platform backend asks what register variable %7 should be in, it is implicitly saying “%7 will be used later in the trace”. We record that using a single bit. That means that we no longer need to work out %7s lifetime: we find it implicitly upon our first encounter (which, given this is backwards code generation, is its last use). Then, when we come to process %7 we say “is this variable constructed using side effects?” (if so we must generate code for it) or “if it doesn’t have side effects, is it used later in the trace?”. That latter condition is dead code elimination. All in all, dead code elimination is fewer than 10LoC.

The reason I’m highlighting this is because traces end up with quite a lot of dead code, whether due to trace optimisation or platform specific code generation tricks. For example, consider this trace:

%2: ptr = ...
%3: ptradd %2, 8
%4: load %3

On x64 yk this will generate the following x64 code:

; %2: ptr = ...
mov rax, ...
; %3: ptradd %2, 8
; %4: load %3
load rdi, [rax+8]

Note that we didn’t generate any code for the ptradd instruction because load was able to incorporate it indirectly. This might seem trivial, but it means we’ve avoided using a register for %3: you don’t have to do this very often to have a big, positive, impact on register allocation!

Locations

Astute readers will have noticed that I haven’t made clear how an interpreter tells yk where guest language loops are. As you might expect, yk has absolutely no idea where a Lua or Ruby or Python loop might start or end: indeed, it doesn’t even know whether such languages have while loops or some other construct entirely.

To tell where yk where loops are, interpreters have to use a location, which we pass to a control point. Locations are simple: they’re markers that tell yk something about part of the guest program (typically about a specific opcode in the user’s program). Control points 23 are more subtle: they’re the point in the interpreter where control is handed over to yk alongside a location. A control point call might do nothing, might start tracing, or might start executing a compiled trace.

For our purposes 24, we can consider locations to come in two flavours:

null: this guest-level pc is not the start of a loop.
loop: this guest-level pc is the start of a loop.

One constructs these with yk_location_null or yk_location_new respectively, with those two functions returning an opaque YkLocation type. Strictly speaking, one doesn’t need null locations, but they often make it easier to get an interpreter up and running with yk.

YkLocations are quite deliberately, a machine word wide, so that you can store them efficiently (e.g. alongside relevant opcodes). That said, the easiest thing to do is to create an array with one YkLocation per opcode in the user’s program. For example, if we have a language that represents while loops with a backwards jump opcode OP_JMP we might create the array of YkLocations as follows:

Instruction *code = ...;
YkLocation *yklocs = calloc(sizeof(YkLocation), prog_len);
for (int pc = 0; pc < prog_len; pc++) {
  Instruction i = code[pc];
  if (prog[pc] == OP_JMP && GET_OPVAL(i) < 0)
    yklocs[pc] = yk_location_new();
  else
    yklocs[pc] = yk_location_null();
}

And we would insert the control point call into the main loop as follows (having suitably initialised yk with yk_mt_new):

YkMT *mt = yk_mt_new(NULL);
Instruction *code = ...;
YkLocation *yklocs = ...;
int pc = 0;
Value **stack = ...;
while (true) {
  Instruction i = code[pc];
  yk_mt_control_point(mt, &yklocs[pc]);
  switch (GET_OPCODE(i)) {
    case OP_LOOKUP: ...
    ...

At a basic level that’s really all that needs to be done in an interpreter.

Now, there are deeper questions such as “what’s the right place to start a loop?”, “what is a loop?” (e.g. recursive function calls), and “what are good traces?” (too many can be as bad as too few). Some of those are solely for interpreter authors; some need help from yk; and we haven’t yet implemented all the bits that interpreter authors need. I won’t dwell further on that here: I hope to return to this in the near future, as we’ve experimented with various promising designs recently.

Deoptimisation

Earlier I spoke quite a bit about “speculation” and “deoptimisation”. To recap: when we speculate on a value we leave behind a guard; if that guard fails, we deoptimise from the JIT-compiled code back to the C interpreter.

For those in the know “deoptimise … back to the C interpreter” will sound, to use the technical term, bonkers. It implies that somehow we can jump back to an arbitrary point in the binary that LLVM compiled, which seems impossible. Fortunately, it is possible — though it required quite a bit of work, and the acceptance of some trade-offs.

Before I go into detail, let’s think roughly about what needs to happen. When a control point starts executing JIT-compiled code, it doesn’t create a new function frame: instead it takes over the AOT (Ahead-Of-Time) LLVM-compiled binary’s function frame and then adds stuff of its own choosing at the end. When a guard fails, we need to: stop the program; adjust the stack so that it’s in the right shape for the AOT binary; put values in the right register; and jump to the right offset in the AOT binary.

You may not be surprised to learn that LLVM was not built with such a use-case in mind. Fortunately it has the basic tool we need: stackmaps, which tell us the layout of registers and the stack at a given point in the LLVM IR. We then roll our own concept of safepoints25 built upon stackmaps: in essence a safepoint pairs a normal LLVM instruction with a stackmap.

In essence, ykllvm records a safepoint at each conditional and function call. When a guard is about to be created in a trace, yk then knows the stack of safepoints it relates to. The reason for a “stack” of safepoints is due to inlining: the conditional that the guard relates to may be many functions deep from where the trace started. By recording a safepoint just before each function call, we know what the stack of AOT frames (would have) look(ed) like.

Stackmaps are conceptually simple. For a given LLVM variable %7 they record information such as: this variable’s value is stored in registers rdi, r8, and at stack offset 0x7F. Except they don’t quite do so, probably because this part of LLVM is still experimental 26. Stackmaps only record one location for a value (e.g. %7 is stored in rdi) but we need to know all of the locations. We also needed to know additional information such as the callee-saved registers passed to inlined functions. Stackmaps in ykllvm are thus extended to do just that.

We then had to deal with the fact that LLVM happily moves stackmaps around, allowing instructions to be inserted between them and their once-neighbours. This means that they no longer contain accurate information, and certainly don’t serve as valid safepoints anymore! To solve that we introduced a new pass that reassociates stackmaps with the correct neighbouring instruction, updating the stackmap as appropriate.

However, there are a few parts of LLVM which mess with things beyond our pass’s ability to fix. Our solution, alas, is simply to turn some optimisations off in functions that can be traced. This slows down the interpreter somewhat, but fortunately yk’s trace optimiser is able to recover the lost performance once code is traced.

Deoptimisation also raises a completely different, more fundamental, problem which has nothing to do with LLVM: what happens to stack pointers? Consider this code:

int x = ...;
int *y = &x;
if (z) printf("%d\n", *y);

Let’s imagine that z is false when we record the trace but true at some point later: we’ll deoptimise then call printf with a stack address. However, that stack address relates to when JIT-compiled code was running, when the stack had a very different layout: the chances of the address being the same on the AOT stack is slim.

Fortunately, the fix for this is simple: we introduce a shadow stack (creating a fresh shadow stack for each newly created thread) and place all variables whose address is taken on it. The shadow stack is shared between AOT and JIT frames — the latter does not extend or alter it. So, whether we’re running in JIT or AOT compiled code, the address of x will be the same 27.

Conclusions

A couple of weeks ago I finally got around to porting yklua from Lua-5.4.6 to Lua 5.5.0: that’s two year’s worth of changes to Lua 28. How long did it take me? Well under 2 hours. I lost 30 minutes to a boring linker-related oddity 29, and not much less time to some external Lua tests that no longer work on Lua-5.5. The “hard” part of the work took me at most an hour: in any previous system, such changes would probably have taken months, if not longer.

Anecdote though it may be, I hope this gives you a sense that yk is a genuinely new point in the language performance design space, and that it has the potential to benefit many languages and users.

All that said, I don’t want to pretend that yk is the finished article: it most definitely isn’t! There are all sorts of things that need improving. Some are minor and will be noticed by few. Some will have much more profound effects. For example, at the time of writing, ykmicropython happens to work in a way that hits several gaps in yk: the resulting performance can be comically bad. Most of these are known gaps: filling them in isn’t rocket science, but it is work.

I hope, though, that you can see the promise that yk has to offer: that I could update yklua so quickly still seems magical to me! That said, we still have plenty of work to do to realise yk’s full promise, but we’re a very small team (as of April, 1½ people if you count generously), so please be realistic about the pace at which that can occur!

For those who are interested in playing with yk, I suggest starting at the yk book. Enjoy!

Acknowledgements: Thanks to Edd Barrett and Lukas Diekmann for comments.

Newer 2026-04-15 12:35 Older

If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

A quick note on terminology. Formally speaking, a VM contains one or more language implementations, for example both an interpreter and a JIT compiler. “VM” is therefore for the umbrella term, and I’ll only use a specific term (“interpreter” or “JIT”) where it’s necessary to do so.

☒

One unexpected outcome of this is that I have developed a huge admiration for the PUC Lua interpreter. It’s written in an intimidating style, with seemingly endless C macros. Once I got my head around it, I realised that it is one of the most thoughtful codebases I have ever seen. Everything is there for a reason; there is a consistency which one rarely sees; and it is incredibly efficient.

☒

Which, itself, isn’t as big as I would like.

☒

Which, itself, isn’t as big as I would like.

For example, if you have an allocation heavy program, you will see only small benefits from yklua; if your program spends all of its time in the interpreter loop, or certain libraries, you might be lucky enough to see 6x. As you might expect, all these figures have been gradually improving over time – mostly in yk, but also in yklua itself – and will continue to do so. The speedups I’ve shown are thus best thought of as a floor, not a ceiling, for these particular benchmarks.

☒

Well, almost. It’s a slight superset of Lua-5.1, with, I believe, a smattering of features of Lua-5.2.

☒

Well, almost. It’s a slight superset of Lua-5.1, with, I believe, a smattering of features of Lua-5.2.

Which, in this context, is a posh way of saying “gamble”.

☒

Which, in this context, is a posh way of saying “gamble”.

Yes, there are/were additional Python JIT compilers that did not become public knowledge.

☒

Yes, there are/were additional Python JIT compilers that did not become public knowledge.

If you want to think of yk as “RPython for LLVM”, you will be very close to the truth!

☒

If you want to think of yk as “RPython for LLVM”, you will be very close to the truth!

This old post of mine is better than nothing, but I would explain things in a different way now.

☒

This old post of mine is better than nothing, but I would explain things in a different way now.

I say “far more” rather than “all” because method-based JITs still speculate on things like object types, concrete values, and so on.

☒

I say “far more” rather than “all” because method-based JITs still speculate on things like object types, concrete values, and so on.

Though there is a subset of real programs where a well-written tracing JIT compilers will win out.

☒

Though there is a subset of real programs where a well-written tracing JIT compilers will win out.

And, alas, enough money. I do know people who’ve got into compilers for the money. As a route to riches, it seems to me quixotic.

☒

And, alas, enough money. I do know people who’ve got into compilers for the money. As a route to riches, it seems to me quixotic.

We put a huge amount of effort into doing this in hardware. One day I will write up why that didn’t work.

☒

We put a huge amount of effort into doing this in hardware. One day I will write up why that didn’t work.

Though, as of yet, nowhere near as many as I would like it to have.

☒

Though, as of yet, nowhere near as many as I would like it to have.

i.e. yk_promote(x) == x.

☒

i.e. yk_promote(x) == x.

This technique here can be easily adapted to situations where opcodes aren’t immutable, but aren’t expected to change often. There is a small overhead from doing so, but in the grand scheme of things, it’s insignificant.

☒

Unlike the similar concept of a pure function, an idempotent function can have side-effects, but those side-effects must not change the output value that is produced. For example, an idempotent function might insert a key into a hashmap if it is not present, and then always return the same key subsequently.

☒

Idempotency in yk is a strong guarantee on the interpreter author’s behalf: unlike promotion, if the interpreter author gets yk_idempotent wrong, the system is in the land of undefined behaviour.

☒

Update 2026-04-16: I forgot at least one other approach.

☒

Update 2026-04-16: I forgot at least one other approach.

Though very modern CPUs are doing an ever-better job at making loads and stores to the stack as cheap as register accesses. I’ve been stunned by some of the results I’ve seen on recent AMD CPUs. That said, the general point holds: even if some stack accesses are now more-or-less free, many, probably most, are still expensive.

☒

For me, the key idea – and I admit that I don’t know if this is how LuaJIT does it – was to think about register allocation as finding diffs between the register state for the n+1th and nth instructions

☒

We do have a very simple “hints” API that can tell us what register a value earlier in the trace will be in. This is mostly used for things like function return arguments. Happily, from a performance perspective, it’s only used on demand, requiring fairly few calls, and isn’t an entire pass that requires O(n) memory.

☒

This isn’t a great name, because it incorrectly implies “control flow” when that might not actually be the case for a given location. We haven’t yet come up with a better name.

☒

This isn’t a great name, because it incorrectly implies “control flow” when that might not actually be the case for a given location. We haven’t yet come up with a better name.

Not only are some of these names in flux, but I will be adding additional concepts to yk soon. We needn’t concern ourselves with them for now.

☒

Not only are some of these names in flux, but I will be adding additional concepts to yk soon. We needn’t concern ourselves with them for now.

Not to be confused with LLVM’s safepoints, which are also called statepoints (yes, really), which aren’t quite the same thing as patchpoints. As this might suggest, there are multiple experimental APIs in this part of LLVM, mostly intended to help garbage collectors — but none of these APIs seems complete in its implementation, with some having obvious flaws.

Although I’ve seen several references to these features being used by production VMs, it’s unclear to me whether they’re using stock LLVM or a local fork (perhaps like ykllvm) where these things are fixed. For example, see this brave soul who has enumerated several of the sharp edges in stock LLVM.

☒

The stackmap documentation says that it should have enough information to rebuild stack frames:

An important motivation for this design is to allow a runtime to commandeer a stack frame when execution reaches an instruction address associated with a stack map. The runtime must be able to rebuild a stack frame and resume program execution using the information provided by the stack map. For example, execution may resume in an interpreter or a recompiled version of the same function.

but our experience suggests it does not currently have all the necessary information.

☒