Laurence Tratt: Compiled and Interpreted Languages: Two Ways of Saying Tomato

What we might think to be a settled truth often isn’t. I was reminded of this recently after I publicly stated that there is no such thing as a compiled or an interpreted programming language: many people have made clear to me that they believe strongly that languages clearly fall into one of these two categories. Indeed, a simple web search suggests that the vast majority of pages describing compiled and interpreted languages propose a crisp delineation between the two.

I certainly sympathise with the temptation to classify languages in this way, because most programming languages are commonly implemented using one technique or the other. However, experience has taught me that it’s important not to conflate the most common implementation choice with inherent properties of a language.

In this post I’m therefore going to show, using a series of programming language implementations, why languages shouldn’t be classified as compiled or interpreted. Before I do that, I need to start by clarifying the difference between a programming language specification and an implementation of that specification.

Programming language specifications vs. implementations

What is a programming language? Let’s take Python as an example. Many people will say that the Python programming language is what I use to run my Python programs. For example at the command-line I might execute:

$ cat hello.py
print("Hello world")
$ python3 hello.py
Hello world

Even in this simple example, we can see something a little surprising. On all the machines I have access to, there is no longer a python executable: I have to explicitly choose between the python2 or python3 executables. Still, it’s probably not that surprising that a programming language has multiple versions, so this difference can be easily brushed aside.

What is more difficult to ignore is that on most of my machines, I have access to entirely different Python executables. The most well known is PyPy:

$ pypy hello.py
Hello world

PyPy isn’t a version of the Python programming language: indeed, if I go to its download page, I can see that it has download links for variants of PyPy for different Python versions.

Using this, we can differentiate two things: the specification of the Python programming language and implementations of that specification. Python calls the specification the “Python Language Reference”. The python3 executable is an implementation of that specification called CPython [1] [2]. PyPy is one of several completely separate implementations of the Python language specification (see also IronPython amongst others).

Not all implementations of the Python specification are equivalent, either from a community perspective or technically.

CPython is de facto the “standard” implementation of Python. Most explorations of changes to the language specification are made in CPython, and releases of “Python the language specification” are implicitly assumed to be made with each release of “CPython the implementation”.

Other implementations of the Python specification then face an interesting choice. Like all programming language specifications [3], Python’s leaves considerable scope for implementations to satisfy it in different ways. Most Python implementations, however, try to match CPython’s behaviour whenever possible. PyPy, for example, often goes considerably out of its way to match CPython’s behaviour, but we can still observe differences in seemingly simple code:

$ cat diffs.py
print(str(0) is str(0))
$ python3 diffs.py
False
$ pypy diffs.py
True

It’s important to note that in this example code, both CPython and PyPy are correctly implementing the Python specification, but they have chosen to do so in different ways.

Python is far from the only language to clearly separate its specification from implementations. For example, JavaScript has a specification and multiple implementations such as JavaScriptCore (most commonly associated with the Safari browser), SpiderMonkey (Firefox), and V8 (Chrome). C’s specification is implemented by multiple compilers, most obviously Clang and GCC. [4]

Simple programming language implementations

History clearly shows that most programming languages start life with a single implementation, but widely adopted programming languages eventually attract multiple implementations. This has a clear downside — as we saw above, different implementations are rarely fully compatible, and the differences between them can become a seemingly endless source of bugs and frustration. However, there are upsides or, at least, justifications including sheer interest (language implementations are fun!), portability (enabling a language to work on a platform previous implementations did not support), and “embrace and extend” (trying to crush a competitor) [5].

For the rest of this post, I want to focus on two opposing factors that are common reasons behind the choice of a language implementation: simplicity (how easy is it to understand and change the programming language implementation?), and performance (how fast does the implementation make programs run?). The first implementation of most programming languages tends to be simple, partly to make it easier to complete the implementation, partly because because it enables experimentation. Over time, there is then a tendency to forego simplicity to gain performance [6].

Let’s start with simplicity. The simplest programming language implementations are invariably associated with interpreters. As a starting definition for what “interpreter” means, let’s assume it means “execute the input program directly”. For example, a complete Rust interpreter for the BF language is as follows (full source code here):

use std::{env, error, fs, io::{self, Read, Write}};
fn main() -> Result<(), Box<dyn error::Error>> {
  let prog = fs::read(env::args().nth(1).unwrap())?;
  let mut pc = 0;
  let mut cells = vec![0u8; 10000];
  let mut cc = 0;
  while pc < prog.len() {
    match prog[pc] as char {
      '<' => cc -= 1,
      '>' => cc += 1,
      '+' => cells[cc] = cells[cc].wrapping_add(1),
      '-' => cells[cc] = cells[cc].wrapping_sub(1),
      '[' if cells[cc] == 0 => {
        let mut count = 1;
        while count > 0 {
          pc += 1;
          if prog[pc] as char == ']' { count -= 1 }
          else if prog[pc] as char == '[' { count += 1 }
        }
      }
      ']' if cells[cc] != 0 => {
        let mut count = 1;
        while count > 0 {
          pc -= 1;
          if prog[pc] as char == '[' { count -= 1 }
          else if prog[pc] as char == ']' { count += 1 }
        }
      }
      '.' => io::stdout().write_all(&cells[cc..cc+1])?,
      ',' =>
        io::stdin().read_exact(&mut cells[cc..cc + 1])?,
      _ => ()
    }
    pc += 1;
  }
  Ok(())
}

BF is a Turing tape language. Memory is represented by an array of (technically infinite, but restricted above to 10,000) unsigned 8 bit integer cells. The program can access one cell at a time: the current position of the cell pointer is cc; “ <“ moves the cell pointer to the left and “ >“ moves it to the right; “ +“ increments the current cell (with wrapping behaviour i.e. 255 + 1 wraps to 0) and “ -“ decrements the current cell (also with wrapping behaviour). “ [...]“ denote BF’s while loop equivalent which compares the current cell for equality/inequality to zero and jumps forwards/backwards to the matching bracket accordingly: note that brackets can nest, so jumps have to take that into account. A full stop “ .“ outputs the current byte to stdout (where it is interpreted as an ASCII character); a comma “ ,“ reads a byte from stdin and places it in the current cell.

As you might expect, BF programs look a little strange. This BF program:

+++tags = ["essay"]
+++++++[>tags = ["essay"]
+++++++>tags = ["essay"]
++++++tags = ["essay"]
++++>+++>+<<<<-]>++.>+.tags = ["essay"]
+++++++..tags = ["essay"]
+++.>++.<<+++tags = ["essay"]
++++++tags = ["essay"]
++++++.>.tags = ["essay"]
+++.------.--------.>+.>.

turns out to be an old friend when run:

$ cargo run --release --bin interp1 hello.bf
Hello World!

Those of us who are used to programming language implementations tend to forget how magical this is: in just 37 lines of code, I have made a full implementation of a Turing complete programming language. Even though BF is a minimalist language, you can – albeit awkwardly – express any possible program in it. I hope you can still be impressed by how simple a complete interpreter for it is!

Speeding up an interpreter

There is a downside to the simplicity of the BF interpreter above: it’s much slower than other BF implementations. It’s thus worth us investing a little time in making it faster.

A classic way of optimising any programming language implementation is to process the program before executing it with the aim of moving work from during-program-execution to before-execution-processing. In the case of our simple BF interpreter from above, an obvious problem is the way it treats BF’s square brackets [...]. When the interpreter encounters either the opening or closing square bracket, it has to skip forwards/backwards for the matching square bracket, bearing in mind that brackets can be nested. Fortunately, since BF programs can’t modify themselves, we always jump forwards/backwards to the same location. Thus we can precalculate the jumps, store them in a cache, and turn the dynamic searches (using a while loop) into a simple lookup in a vector (full source code here):

fn main() -> Result<(), Box<dyn error::Error>> {
  ...
  let mut bmap = vec![];
  let mut bstack = vec![];
  for (i, b) in prog.iter().enumerate() {
    bmap.push(usize::max_value());
    if *b as char == '[' { bstack.push(i); }
    else if *b as char == ']' {
      let s = bstack.pop().unwrap();
      bmap[s] = i;
      bmap[i] = s;
    }
  }
  ...
  while pc < prog.len() {
    match prog[pc] as char {
      ...
      '[' if cells[cc] == 0 => pc = bmap[pc],
      ']' if cells[cc] != 0 => pc = bmap[pc],
    }
    pc += 1;
  }
  Ok(())
}

Interestingly, this version is 3 lines shorter than our original version. This isn’t entirely uncommon: faster interpreters can sometimes be slightly smaller than their slower equivalents, though “smaller” doesn’t necessarily mean “easier to understand”!

How much faster has the bracket precalculation made this version? I’ll use three BF programs as simple benchmarks: bench.bf (a semi-arbitrary BF workload), hanoi.bf (a Towers of Hanoi solver), and mandelbrot.bf (a Mandelbrot generator). The point here isn’t to do research paper quality benchmarking, but to give us some fun programs to use for comparison. For example, here’s mandlebrot.bf on the first BF interpreter from earlier:

It’s now time to see some numbers. I ran each benchmark 30 times on an unloaded Xeon 6254 server running Debian 11 with Rust 1.66. Results are presented in seconds with 99% confidence intervals. The benchmarking program, data-processor program, and the data I used for this post are all available online. Here are the results for our first two interpreters:

	bench.bf	hanoi.bf	mandelbrot.bf
1: base	2.067±0.0042	13.404±0.0785	30.307±0.0160
2: + bracket caching	1.430±0.0268	8.308±0.0710	14.341±0.1792

“ base“ is the original BF interpreter from earlier in the post; “ + bracket caching“ is our updated version. As we can see, performance has improved by approximately 1.4-2x, depending on the benchmark (this sort of variation between benchmarks is common when benchmarking changes to programming language implementations).

Changing representation

Our first two BF interpreters both load the input program in and execute it directly. In contrast, mainstream language implementations such as Python read a program in and translate it to a different representation. We lack common terminology for what to call this different representation, so I will call it a sequence of opcodes (feel free to substitute “bytecode” or “bitcode” or the like, if you prefer such terms).

We first modify our interpreter (full source code here) by adding an enum, with one opcode per BF instruction as a starting point:

enum Ops {
  Left, Right, Add, Sub, LBrack, RBrack, Output, Input
}

We then have to translate BF instructions into our opcodes, which is an almost 1:1 translation, and change the main loop to work on opcodes:

fn main() -> Result<(), Box<dyn error::Error>> {
  let mut prog = vec![];
  for b in fs::read(env::args().nth(1).unwrap())? {
    match b as char {
      '<' => prog.push(Ops::Left),
      '>' => prog.push(Ops::Right),
      '+' => prog.push(Ops::Add),
      '-' => prog.push(Ops::Sub),
      '[' => prog.push(Ops::LBrack),
      ']' => prog.push(Ops::RBrack),
      '.' => prog.push(Ops::Output),
      ',' => prog.push(Ops::Input),
      _ => (),
    }
  }
  ...
  while pc < prog.len() {
    match prog[pc] {
      Ops::Left => cc -= 1,
      Ops::Right => cc += 1,
      Ops::Add => cells[cc] = cells[cc].wrapping_add(1),
      Ops::Sub => cells[cc] = cells[cc].wrapping_sub(1),
      Ops::LBrack if cells[cc] == 0 => pc = bmap[pc],
      Ops::RBrack if cells[cc] != 0 => pc = bmap[pc],
      Ops::Output => io::stdout().write_all(&cells[cc..cc+1])?,
      Ops::Input =>
        io::stdin().read_exact(&mut cells[cc..cc + 1])?,
      _ => ()
    }
    pc += 1;
  }
  Ok(())
}

This step initially slows execution down slightly, probably in large part because a single byte BF instruction has now become (on a 64-bit machine) an 8 byte opcode:

	bench.bf	hanoi.bf	mandelbrot.bf
2: + bracket caching	1.430±0.0268	8.308±0.0710	14.341±0.1792
3: + opcodes	1.802±0.0103	9.242±0.0047	16.353±0.1486

Fortunately, we can recover some of the lost performance by folding the bracket cache into the LBrack and RBrack opcodes (full source code here):

enum Ops {
  ..., LBrack(usize), RBrack(usize), ...
}
fn main() -> Result<(), Box<dyn error::Error>> {
  let mut prog = vec![];
  for b in fs::read(env::args().nth(1).unwrap())? {
    match b as char {
      ...
      '[' => prog.push(Ops::LBrack(usize::max_value())),
      ']' => prog.push(Ops::RBrack(usize::max_value())),
    }
  }
  let mut bstack = vec![];
  let mut i = 0;
  while i < prog.len() {
    match prog[i] {
      Ops::LBrack(_) => bstack.push(i),
      Ops::RBrack(_) => {
        let s = bstack.pop().unwrap();
        prog[s] = Ops::LBrack(i);
        prog[i] = Ops::RBrack(s);
      }
      _ => ()
    }
    i += 1;
  }
  ...
  while pc < prog.len() {
    match prog[pc] {
      ...
      Ops::LBrack(i) if cells[cc] == 0 => pc = i,
      Ops::RBrack(i) if cells[cc] != 0 => pc = i,
    }
    pc += 1;
  }
  Ok(())
}

This does slightly complicate the processing stage: we first of all create “temporary” LBrack and RBrack opcodes (with usize::max_value() as a dummy value) since we don’t yet have enough information to know how far to jump forwards/backwards [7]. We thus have to add a second stage where we iterate over the opcodes and replace the temporary LBrack and RBrack opcodes with properly calculated jump values.

The effects of this vary across our benchmarks, improving two, and having no effect on the other:

	bench.bf	hanoi.bf	mandelbrot.bf
2: + bracket caching	1.430±0.0268	8.308±0.0710	14.341±0.1792
3: + opcodes	1.802±0.0103	9.242±0.0047	16.353±0.1486
4: + opcode bracket caching	1.263±0.0042	9.240±0.0028	14.898±0.0089

Still, the aim in moving from interpreting BF programs directly to using opcodes isn’t for immediate performance gains. Rather, the use of opcodes is an intermediate step that allows us to optimise BF programs in other ways: so long as the basic opcodes don’t slow us down too much, we hope to build upon them to win in other ways.

Repeated opcodes

One obvious inefficiency in BF programs is repeated sequences of the same instruction. For example to increment a cell by 3, I have to use three add instructions “ +++“. We can optimise these repeated opcodes by adding a simple count to an instruction, replacing Add with Add(3) for this simple example. Let’s start by doing this just for the Add opcode (full source code here):

enum Ops {
  ..., Add(u8), ...
}
fn main() -> Result<(), Box<dyn error::Error>> {
  let mut prog = vec![];
  let bytes = fs::read(env::args().nth(1).unwrap())?;
  let mut i = 0;
  while i < bytes.len() {
    match bytes[i] as char {
      ...
      '+' => {
        let j = bytes[i..].iter()
          .take_while(|b| **b as char == '+').count();
        prog.push(Ops::Add(u8::try_from(j).unwrap()));
        i += j - 1;
      }
    }
    i += 1;
  }
  ...
  while pc < prog.len() {
    match prog[pc] {
      ...
      Ops::Add(i) => cells[cc] = cells[cc].wrapping_add(i),
    }
    pc += 1;
  }
  Ok(())
}

This speeds up performance on bench.bf and hanoi.bf but slows down mandelbrot.bf:

	bench.bf	hanoi.bf	mandelbrot.bf
4: + opcode bracket caching	1.263±0.0042	9.240±0.0028	14.898±0.0089
5: + Add(n)	1.029±0.0042	6.912±0.0042	16.169±0.0108

Now that we know that identifying repeated opcodes is at least sometimes useful, let’s use the same technique for Left, Right, and Sub (full source code here):

enum Ops {
  Left(usize), Right(usize), Add(u8), Sub(u8), ...
}
fn main() -> Result<(), Box<dyn error::Error>> {
  ...
  while i < bytes.len() {
    match bytes[i] as char {
      '<' => {
        let j = bytes[i..].iter()
          .take_while(|b| **b as char == '<').count();
        prog.push(Ops::Left(j));
        i += j - 1;
      }
      '>' => {
        let j = bytes[i..].iter()
          .take_while(|b| **b as char == '>').count();
        prog.push(Ops::Right(j));
        i += j - 1;
      }
      '+' => { ... }
      '-' => {
        let j = bytes[i..].iter()
          .take_while(|b| **b as char == '-').count();
        prog.push(Ops::Sub(u8::try_from(j).unwrap()));
        i += j - 1;
      }
    }
    i += 1;
  }
  ...
  while pc < prog.len() {
    match prog[pc] {
      Ops::Left(i) => cc -= i,
      Ops::Right(i) => cc += i,
      Ops::Add(i) => cells[cc] = cells[cc].wrapping_add(i),
      Ops::Sub(i) => cells[cc] = cells[cc].wrapping_sub(i),
      ...
    }
    pc += 1;
  }
  Ok(())
}

This has a significant benefit for mandelbrot.bf though it does slightly slow the other two benchmarks down. Still, all of the benchmarks are now notably faster than before we identified repeated opcodes:

	bench.bf	hanoi.bf	mandelbrot.bf
4: + opcode bracket caching	1.263±0.0042	9.240±0.0028	14.898±0.0089
5: + Add(n)	1.029±0.0042	6.912±0.0042	16.169±0.0108
6: + Sub(n) / Left(n) / Right(n)	1.135±0.0056	7.042±0.0038	5.721±0.0094

It’s also worth noting that only by translating BF programs to opcodes could we have made this way of optimising BF programs practical. But there’s more to come!

Multi-opcode optimisations

At this point we’ve arguably done the easy stuff and now we have to think a bit more deeply about what parts of a common BF program might slow down. One common BF idiom is a while loop that resets a cell to zero: its running time is proportional to the current value of the cell (i.e. the higher the current value in the cell, the longer it takes to reset to zero!). In terms of our opcodes this results in a sequence [LBrack, Sub(1), RBrack].

There is no way that we can optimise this opcode sequence in terms of the opcodes we currently have. But, fortunately, adding opcodes is easy, and the BF program never need know we’ve done it! Let’s add a new opcode Zero which simply resets a cell to zero directly, and replace all [LBrack, Sub(1), RBrack] sequences with a single Zero opcode. The code is fairly simple (full source code here):

enum Ops {
  ..., Zero
}
fn main() -> Result<(), Box<dyn error::Error>> {
  ...
  let mut i = 0;
  while i < prog.len() {
    if i + 2 < prog.len() &&
       matches!(
         prog[i..i+3],
         [Ops::LBrack(_), Ops::Sub(1), Ops::RBrack(_)]) {
      prog.splice(i..i+3, [Ops::Zero]);
      i += 3;
    } else { i += 1; }
  }
  ...
  while pc < prog.len() {
    match prog[pc] {
      ...
      Ops::Zero => cells[cc] = 0,
    }
    pc += 1;
  }
  Ok(())
}

This hugely speeds up bench.bf and hanoi.bf:

	bench.bf	hanoi.bf	mandelbrot.bf
6: + Sub(n) / Left(n) / Right(n)	1.135±0.0056	7.042±0.0038	5.721±0.0094
7: + Zero	0.301±0.0042	0.491±0.0033	6.023±0.0052

Putting it all together

You might be wondering why I started talking about compiled and interpreted languages, added a digression about programming language specifications, and have ended up optimising BF.

Partly this is to emphasise the split between language specifications and implementations: I’ve not altered the BF specification, but I now have multiple distinct BF implementations.

However, the real reason is that, through a series of small tweaks to our evolving BF implementation(s), we have moved from a “clearly interpreted” BF implementation to what I would argue is a “compiled” implementation. In the former we executed a BF input program directly; in the latter, we not only transform the BF program into another representation, but we perform optimisations on that other representation.

To help make this clearer, I’m going to show the complete implementation as it currently stands, moving the code into different functions, and naming the aspects of the implementation using more standard terminology:


use std::{env, error, fs, io::{self, Read, Write}};
enum Ops {
  Left(usize), Right(usize), Add(u8), Sub(u8),
  LBrack(usize), RBrack(usize), Zero, Output, Input
}
fn main() -> Result<(), Box<dyn error::Error>> {
  let prog = compile()?;
  evaluate(prog)?;
  Ok(())
}

// Compiler
fn optimise(prog: &mut Vec<Ops>) {
  let mut i = 0;
  while i < prog.len() {
    if i + 2 < prog.len() &&
      matches!(
        prog[i..i+3],
        [Ops::LBrack(_), Ops::Sub(1), Ops::RBrack(_)]) {
      prog.splice(i..i+3, [Ops::Zero]);
      i += 3;
    } else { i += 1; }
  }
}
fn compile() -> Result<Vec<Ops>, Box<dyn error::Error>> {
  let mut prog = vec![];
  let bytes = fs::read(env::args().nth(1).unwrap())?;
  let mut i = 0;
  while i < bytes.len() {
    match bytes[i] as char {
      '<' => {
        let j = bytes[i..].iter()
          .take_while(|b| **b as char == '<').count();
        prog.push(Ops::Left(j));
        i += j - 1;
      }
      '>' => {
        let j = bytes[i..].iter()
          .take_while(|b| **b as char == '>').count();
        prog.push(Ops::Right(j));
        i += j - 1;
      }
      '+' => {
        let j = bytes[i..].iter()
          .take_while(|b| **b as char == '+').count();
        prog.push(Ops::Add(u8::try_from(j).unwrap()));
        i += j - 1;
      }
      '-' => {
        let j = bytes[i..].iter()
          .take_while(|b| **b as char == '-').count();
        prog.push(Ops::Sub(u8::try_from(j).unwrap()));
        i += j - 1;
      }
      '[' => prog.push(Ops::LBrack(usize::max_value())),
      ']' => prog.push(Ops::RBrack(usize::max_value())),
      '.' => prog.push(Ops::Output),
      ',' => prog.push(Ops::Input),
      _ => (),
    }
    i += 1;
  }
  optimise(&mut prog);
  let mut bstack = vec![];
  let mut i = 0;
  while i < prog.len() {
    match prog[i] {
      Ops::LBrack(_) => bstack.push(i),
      Ops::RBrack(_) => {
        let s = bstack.pop().unwrap();
        prog[s] = Ops::LBrack(i);
        prog[i] = Ops::RBrack(s);
      }
      _ => ()
    }
    i += 1;
  }
  Ok(prog)
}

// Evaluator / "interpreter"
fn evaluate(prog: Vec<Ops>)
  -> Result<(), Box<dyn error::Error>>
{
  let mut cells = vec![0u8; 10000];
  let mut cc = 0usize;
  let mut pc = 0;
  while pc < prog.len() {
    match prog[pc] {
      Ops::Left(i) => cc -= i,
      Ops::Right(i) => cc += i,
      Ops::Add(i) => cells[cc] = cells[cc].wrapping_add(i),
      Ops::Sub(i) => cells[cc] = cells[cc].wrapping_sub(i),
      Ops::LBrack(i) if cells[cc] == 0 => pc = i,
      Ops::RBrack(i) if cells[cc] != 0 => pc = i,
      Ops::Zero => cells[cc] = 0,
      Ops::Output => io::stdout().write_all(&cells[cc..cc+1])?,
      Ops::Input =>
        io::stdin().read_exact(&mut cells[cc..cc + 1])?,
      _ => ()
    }
    pc += 1;
  }
  Ok(())
}

To help differentiate the various parts of the implementation more clearly, I’ve used colours. This implementation starts with generic parts (the opcodes and main function). When a program is loaded in, it is compiled into a sequence of opcodes. The compiler is split into two: a combination parser / code generator and a separate optimisation step. When the program is fully compiled it is then evaluated (or “interpreted” if you prefer that term).

This raises an interesting question: when would you say our BF implementations moved from interpreted to compiled? I find it hard to pinpoint the transition definitively. The first BF implementation is definitely interpreted; and by the fifth I’d say it’s definitely compiled; but I’m not sure which of the intermediate stages can be stated as the definitive transition point. Rather, the intermediate stages get increasingly hard to classify with certainty.

The muddiness doesn’t end there: different aspects of real language implementations are often muddled together, just as they are in our final BF interpreter. For example, some compilers have very distinct lexing, parsing, AST (Abstract Syntax Tree) generation, and IR (Intermediate Representation) generation phases. Even in our final BF implementation there’s no advantage to separating these phases out: there isn’t even a distinct AST or IR! As do many other compilers, our final BF implementation does some optimisations in early phases (the repeated opcode optimisation) and some in a more clearly delineated optimisation phase (the Zero opcode optimisation). If you look inside, say, LLVM, you’ll see that optimisations happen at various stages beyond those explicitly marked as the “optimising stages”.

A reasonable question, or worry, is whether I’ve deliberately created implementations in a strange style just to make a point. I haven’t. This style – that is, a single executable (often, but not always, called a “VM” (Virtual Machine)) which internally contains a separate compiler and evaluator – is widespread. Notable mainstream language implementations using it include CPython and CRuby. CPython even caches the output of its compiler as .pyc files — the output of a compiler doesn’t have to be machine code!

Finally, programming language performance is sometimes surprising. Here are all 8 of the BF implementations we’ve created:

	bench.bf	hanoi.bf	mandelbrot.bf
1: base	2.067±0.0042	13.404±0.0785	30.307±0.0160
2: + bracket caching	1.430±0.0268	8.308±0.0710	14.341±0.1792
3: + opcodes	1.802±0.0103	9.242±0.0047	16.353±0.1486
4: + opcode bracket caching	1.263±0.0042	9.240±0.0028	14.898±0.0089
5: + Add(n)	1.029±0.0042	6.912±0.0042	16.169±0.0108
6: + Sub(n) / Left(n) / Right(n)	1.135±0.0056	7.042±0.0038	5.721±0.0094
7: + Zero	0.301±0.0042	0.491±0.0033	6.023±0.0052
8: reorganised	0.282±0.0038	0.484±0.0038	5.746±0.0038

You may be as surprised as I am that reordering the implementation into separate compilation and evaluation functions noticeably improves performance! Perhaps rustc and LLVM are indirectly showing their approval for correctly separating and labelling the relevant parts of the implementation?!

Putting this into perspective, here’s the first (left) and last (right) BF implementations running side by side for Mandelbrot (I even gave the first implementation a slight head-start, as you’ll see from the video):

Conclusions

I hope that this post has shown you two main things. First that language specifications and implementations are very different things. Second, via the series of evolving BF implementations, that any given language can be implemented as an interpreter or a compiler. If you don’t find the opcode compiler I wrote convincing, then you can find BF implementations that generate C code or even machine code directly! You could even have good fun implementing more BF optimisations, or a machine code backend, to the final BF implementation from this post.

Another thing I hope that I’ve indirectly shown, or at least hinted at, is that any language can be implemented in any style. Even if you want to think of CPython as an “interpreter” (a term that can be dangerous, as it occludes the separate internal compiler), there’s PyPy, which dynamically compiles Python programs into machine code, and various systems which statically compile Python code into machine code (see e.g. mypyc). There are interpreters for C and C++ (e.g. Cint or Cling). C++’s constexpr and Rust’s const require their respective compilers to include an evaluation mechanism to evaluate C++/Rust code at compile-time — a mechanism that is normally implemented as an interpreter (and Clang’s will soon look rather like our later BF implementations!). Most JIT VMs start programs running in an interpreter before compiling them into machine code, but some use a “baseline” compiler to immediately compile programs into machine code before later optimising them (V8 used this technique until fairly recently).

I could go on — but hopefully you get the idea! Languages are not their implementations, and any language can be implemented in any style. And, unlike the pronunciation of tomato, all those styles of implementation are equally valid.

Acknowledgements: thanks to Carl Friedrich Bolz-Tereick and Lukas Diekmann for comments.

Update (2023-10-28): I’ve fixed an off-by-1 error that prevented the optimisation in the 7th and 8th implementations from kicking in for [-] if it was the very last thing in the program.

Newer 2023-01-10 08:00 Older

If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1]

At the time of writing, the GitHub page says “This is Python version 3.12.0 alpha 3”, showing how easy it is to mix up a language specification and implementation!

☒

At the time of writing, the GitHub page says “This is Python version 3.12.0 alpha 3”, showing how easy it is to mix up a language specification and implementation!

[2]

As far as I know, the name “CPython” was not used originally when Python was created. Tracking down when it was used is hard. Certainly, by the time Jython (“Python for Java”) came along, I can find references to “CPython” as a way of distinguishing the two implementations, but whether that name was used before then I’m unsure.

☒

[3]

Most programming language specifications are written in English, which always leaves room for ambiguity. Some programming language specifications (though, rarely, the “official” specifications) are written in more formal logical notations which much less room for ambiguity (e.g. see the “Full Monty” specification of Python).

However, even the most precise specifications will not describe every possible aspect of observable behaviour, partly because different implementations need some flexibility to satisfy other goals. For example, different approaches to garbage collection are beneficial for different sets of users, but those approaches can also cause programs to execute slightly differently. This is not to say that flexibility in programming language specifications is always intentional!

☒

[4]

Some languages don’t even have a single specification, such as BASIC, allowing many different languages and implementations with “BASIC” in their name to diverge substantially from one another — yet such languages and implementations tend to resemble, at least in their core, our idea of what BASIC should be.

☒

[5]

Interestingly, both “portability” and “embrace and extend” are now much less common reasons for creating a new language implementation.

“Embrace and extend” is now generally seen as gauche at best, or anti-competitive at worst. It is now common for “competitor” implementations to genuinely collaborate with one another. We are all much better off for this, in my opinion!

“Portability”, once something I spent a lot of time messing around with, has reduced due to several factors coinciding. First, the number of hardware and operating systems in practical use has declined, so there are fewer platforms to worry about. Second, many more language implementations are designed from the beginning (or, at least, soon after their beginning) with portability in mind. Third, systems like LLVM (or the JVM) allow language implementations to easily abstract away from a lot of low-level details, gaining portability almost for free.

☒

Interestingly, both “portability” and “embrace and extend” are now much less common reasons for creating a new language implementation.

[6]

Sometimes the first, simple, implementation withers away or is discarded: this is common with statically typed languages. Sometimes the first, simple, implementation becomes the de facto standard implementation: this is common with dynamically typed languages. There are various possible explanations for this cultural difference, though I don’t think it’s hugely important either way.

☒

[7]

OK, we do know how far to jump backwards — for now. That won’t be true a little later, so I’m going to keep things simple.

☒