Home e-mail: laurie@tratt.net   twitter: laurencetratt   twitter: laurencetratt
email updates:
  |   RSS feed

Static Integer Types

June 22 2021

Blog archive

 
Last 10 posts
Static Integer Types
Automatic Video Editing
The Evolution of a Research Paper
Automatic Syntax Error Recovery
Stick or Twist?
Which Parsing Approach?
Alternative Sources of Advice
Minimum Times Tend to Mislead When Benchmarking
A Quick Look at Trait Objects in Rust
Why Aren’t More Users More Happy With Our VMs? Part 2
 
 
Over several years I’ve had several conversations with people about static integer types in programming languages. Given the number of subtleties in the design, and pitfalls in the use, of integer types it’s not surprising that most programmers are unaware of at least some of them. However, what is perhaps more surprising is that most programming language designers and implementers aren’t fully aware of some of the subtleties and pitfalls either. That leads to languages and their implementations baking in unintended, and sometimes unfortunate, design decisions. At the end of conversations on this topic, I’ve sometimes said “this stuff is hard and someone should think more about this and write it up” but it’s an unsatisfying conclusion. So even though I don’t have all the answers – indeed, I’m fairly sure that I don’t know all the questions – I’m going to jot down my current thinking on this subject.

The basics

In this post I’m considering the design of fixed-width integer types in statically typed languages. By “statically typed” I mean types which are enforced at compile-time [1] whether they are explicitly spelt out in source code or inferred by the compiler. By “fixed width” I mean integers which fit within a number of bits determined at compile-time. These come in two flavours: those where the number of bits is the same on all platforms (e.g. u32 typically means “an unsigned 32-bit integer); and those where the bits are parameterised by the platform (e.g. Rust’s usize might be 16-bits, 32-bits, 64-bits or bigger depending on the underlying platform). I won’t be considering floating point numbers, arbitrarily-sized integers (i.e. those whose storage size varies at run-time depending on the number they are storing), or the like.

An obvious question – and one that I imagine most new programmers have – is why do we have more than one integer type? Why, on a 64-bit machine, do I not always use 64-bit integers? There are various possible explanations, but to me two stand out. First, we often have to interact with things (from external libraries to the hardware we’re running on) that require integers of certain widths. Second, even though our computers often have large quantities of DRAM, memory is not infinite, and even when there’s plenty of free memory, using it wastefully can slow programs down. For example, imagine that I’m writing a system which records students’ marks out of 10: I only need 4-bits to represent all the valid marks, meaning that I need 16 times less memory relative to a 64-bit integer. If my program records lots of students' marks, then the memory saving between 4-bits and 64-bits will probably have a noticeable effect on the program’s performance, because the caches in modern CPUs are tiny relative to DRAM. For example, my desktop computer has 32GiB DRAM but the L1 cache (split into two equal-sized sub-parts for instructions and data) is a mere 64KiB (about 0.000002% the size of DRAM): saving memory is an effective way of improving program performance.

In practice, few programming languages offer the user direct access to 4-bit integers [2]. Since truncating the range of numbers we need to deal with is nearly always a terrible idea, we therefore have to round up our ideal integer width to the next highest representable integer type. I would thus probably represent students’ marks as an 8-bit integer, but if I needed to represent the distance between two points on earth in miles I’d probably use a 16 bit number (although the maximum possible distance, 8000 miles, can be represented in 13-bits). Interestingly, although it’s invisible to us as normal programmers, compilers and CPUs also sometimes round integer types up even further than we realise. On a 64-bit machine we might want to store a 32-bit integer in our program, but there’s nothing to stop the compiler or the CPU treating it as a 64-bit value for arithmetic operations, provided whenever we can observe the value it’s 32-bits wide.

Signed and unsigned integers

Although I have been clear about the maximum value I wanted to represent in those examples, I’ve been deliberately vague about the minimum value. Although the minimum integer could be any value, we only really need to worry about whether the minimum is zero or less than zero (i.e. a negative value).

If the minimum value is zero, we can use unsigned integer types: their minimum value is 0 and their maximum value 2n-1 (both inclusive) where n is the number of bits for that integer width (i.e. an 8-bit integer can represent 0..(2n-1) = 0..255 inclusive).

If the minimum value is less than zero, we need to use signed integers. Unlike unsigned integer types, whose representation is obvious (the minimum value has no bits set; the maximum value has all bits set), things are a bit trickier with signed integer types. The most common representation is two’s complement such that the minimum value is -(2n-1) and the maximum value is 2n-1-1. The resulting asymmetry – an 8 bit integer can represent -128..127 – can feel a bit icky, but two’s complement has the advantage that arithmetic operations are identical no matter the signedness of the integer. Since CPUs have universally used two’s complement since the late 1960s (I know of only one, very niche, exception [3]), I think it’s safe to assume that two’s complement is the only representation any of us will ever encounter.

Why do we have both signed and unsigned integers? Many people think that a good reason to have both is because we often use computers to count things that are naturally non-negative integers, for example the number of apples you’ve put in your online shopping basket. Using unsigned types in such cases feels like a sensible idea but there are downsides. For example, what’s the value of 2 - 3 as an unsigned integer? And what happens if you later discover that you need to represent negative quantities, for example the number of rotten apples that you need to refund the customer? You often end up in the wonderful situation where you have both unsigned and signed versions of your counts at different points of your program, not to mention the difficulties in dealing with edge cases when converting between the two types. I still have nightmares about the time I forgot to account for the asymmetric range of signed integer types when checking to see if an unsigned integer could fit in a signed integer — debugging that took quite a while.

Such problems would go away entirely if we only ever used signed integers. However, signed integers don’t work very well when we’re representing things to do with memory. For example, if, on a machine with a 32-bit address range, we used signed integers to represent the size of an array of bytes, we could only easily represent sizes of at most 2GiB — despite having an address space of 4GiB. That issue is less of a problem on 64-bit machines (63-bits represents a lot of memory) but what would a signed memory address mean and what happens if/when we want to use all 64-bits to represent memory addresses [4]? While high-level languages can easily forego unsigned integer types, they seem to be a necessary part of lower-level programming languages.

Native word width

The chances are that the computer (be it a desktop, laptop, phone, or otherwise) you’re reading this article on is a 64-bit machine, by which we mean its native word width is 64-bits. We use this term so often that, in my experience, we rarely take a step back to think about what this really means. A typical approximation is that the word width defines both the “main” integer width as well as the width of addresses: we expect operations upon those to be particularly efficient (implying, for example, that most CPU registers are word width). On x86, or amd64/x86-64 [5], hardware – the hardware that defines most software developers’ view of hardware – this is a safe approximation to reality.

Old duffers like me remember the days of not just 32-bit machines, but 16 bit and 8-bit machines: we have direct experience that the world is not just about 64-bit word widths. Surprisingly few of us realise that, even today, many “embedded” devices have a native word width of 8-bits. Indeed, it was only in the 1970s that the notion of multiple-of-8 word widths became common: a quick delve into computing history turns up an astonishing variety of word widths. I can remember my literal astonishment when I saw a decimal (i.e. non binary) computer at The National Museum of Computing — I had read of such things but did not know that specimens still existed. However, until I looked into things a bit deeper for this article, I had no idea of other oddities like 27-bit machines!

The more you look into the notion of “word width”, the fuzzier the concept becomes. One obvious reason is backwards compatibility. On an amd64/x86-64 machine, for example, the terminology is compatible with their 16-bit predecessors: a WORD is 16-bits (due to the 8086), a DWORD is 32-bits (due to the 386), and a QWORD 64-bits.

There are deeper reasons for this fuzziness. The first computer my family acquired was the (now largely forgotten) Dragon 32 — an 8-bit computer named for its 32KiB of memory. That should immediately ring alarm bells: assuming each address is 8 bits wide, and addresses reference bytes, then an 8-bit address can only reference 256 bytes of memory. In practise, memory can be addressed by combining more than one 8-bit value to reference an address.

We thus need to differentiate between a platform’s data width (i.e. the width of an integer; e.g. 8-bits on the Dragon 32) and address width (16-bits on the Dragon 32). The problem [6] is then how does one combine (smaller) data-width values to reference (larger) addresses? Various schemes have been devised to bridge this gap, from building addresses from two or more registers, to bank switching, and beyond.

The future of address and data widths

The reason we need to care about address and data widths is that, as soon as one programs in a language higher-level than assembly language, one sometimes wants to treat pointers as integers and vice versa. I most often need to do this when mucking around with pointer tagging (i.e. using the always-set-to-zero bits of an aligned pointer to store extra information), where I need to treat the pointer as an integer, set or unset the bits I care about, then treat the integer as a pointer again. In order to do this, I need to know precisely what the rules are for treating pointers as integers and vice versa. The rules are trivial if the data and address widths are the same, but I hope the previous section convinced you that, historically, this has not always been the case.

You might be wondering whether it’s worth worrying about such issues when it’s rather unlikely that anyone will ever again produce a 27-bit machine. However, I believe it’s quite possible that future hardware will, once again, have different data and address widths. There are several possible reasons why one might do so, but I want to use an example that I know at least a little about.

CHERI is an extension to instruction sets that allows the enforcement of hardware capabilities. On current mainstream platforms, different processes can be mutually distrusting, but within a process every piece of code has access to every part of the process’s memory. CHERI adds hardware permissions to pointers, allowing different parts of the same process to be mutually distrusting, for example handing out pointers that allow access to only limited portions of the process’s memory. The potential for increasing security without suffering the overhead of multiple processes is significant. Those with long memories will know that the idea of capabilities isn’t new (for example, the E language is a fascinating exploration of this idea), but what’s meaningfully different about CHERI is that it’s enforced in hardware and adaptable to modern instruction sets. For example, in 2022 there will be physical, fairly modern, Armv8 processors extended with CHERI available for early adopters to play with.

Amongst CHERI’s clever tricks is how permissions are carried around: instead of a pointer being just a memory address, under CHERI it becomes a memory address plus a set of permissions (plus a secret bit stored somewhere hidden so that the hardware can tell if you’ve forged permissions). Although the size of these permissions varies in different CHERI-enabled instruction sets, the most common encoding on 64-bit machines is that a pointer is a 64-bit memory address and a 64-bit set of permissions — 128-bits wide in total. In other words, a typical CHERI CPU has different data and address widths [7]. History, it seems, has not yet ended.

Integer types and pointers

How do systems programming languages deal with platforms that have different data and address widths?

Let’s start with Rust, which offers the usize type, defining it as “The pointer-sized unsigned integer type. The size of this primitive is how many bytes it takes to reference any location in memory.” That definition is based on the assumption that a machine’s address and data widths are the same. How should one adapt Rust to work under CHERI given this mismatch in assumptions? Well, one could devise various clever schemes to try and force Rust to acknowledge this difference, but doing so will break a lot of Rust code that has no choice but to rely on usize’s existing guarantee of identical address and data widths (including some that I’ve written), mostly in ways that won’t be statically discoverable.

The only pragmatic scheme I can think of will be to define usize to be the wider of the data and address width (i.e. 128 bits on most CHERI devices). This will allow most Rust code to run correctly without change (other than being recompiled). This doesn’t entirely guarantee correctness: for example, code that uses the size of a usize as a proxy for the data width will behave incorrectly. However, I think it’s reasonable to assume that this scheme will work well enough to be acceptable. However, it’s inherently wasteful: most usizes only ever store the size of an object in memory, meaning that 8 bytes of their storage space will never be used. That might not sound much, but I have written Rust programs where removing 8 bytes from a single, commonly used struct has increased performance by 5%. There is no doubt in my mind that doubling the size of usize will, therefore, lead to a noticeable slowdown (perhaps in the region of 5-10%) for many programs.

Other languages have different approaches to integer types, with an obvious alternative being to remove the notion of parameterised width types entirely. D (which I don’t claim to know well) is an example of a language where all integer types have an explicit, fixed width. This suggests that we no longer have to worry about Rust’s conflation of address and data widths into usize, but fixed width integer types don’t allow one to adapt to different word widths. D therefore defines size_t to be an alias for the underlying platform’s word width. I will leave to the theologians whether this means that D really only has fixed width integer types, but either way, the end result is that D’s size_t makes the same assumptions as Rust’s usize.

Rust and D’s conflation of address and data widths suggests that modern systems programming languages are in an awkward spot when it comes to address and data widths. However, there are too many systems programming languages in the wild for me to claim knowledge of all, or even most, of them. That means that I know that my next statement is almost certainly wrong for at least one language out there. However, as far as I know, the only language which is sufficiently flexible is C — at least, C as of the C99 version.

Let’s start with C’s uintptr_t [8], an integer type that’s guaranteed to be wide enough to store a pointer safely. At first glance this might look like Rust’s usize in disguise, but C’s guarantees are subtly different: simplifying heavily, uintptr_t will be at least as wide as the address width but may be wider or narrower than the data width. The formal definition is rather more subtle (simplified only slightly): C guarantees that the integer type uintptr_t is wide enough to represent every void* pointer. The reason for the use of void* is that pointer types in C do not all have to be the same width, but void* is guaranteed to be wide enough that any other pointer type can safely be cast to void* — thus uintptr_t is implicitly wide enough for any pointer.

Another of C’s integer types size_t represents the size of an object in memory. That might look like it’s related to the concept of the address width, but it’s subtly different. On segmented architectures like the 8086, size_t will represent a smaller integer type (16 bits in this case) than the address width (20 bits). In other words, size_t allows us to separate out the notion of “how big can any object be in memory” from “what is the address of this memory.” As this suggests, we should have been talking about address, data, and object widths (and not just address and data widths) all along!

On that note, you might think int or long represents the data width, but it doesn’t, and indeed C has no other integer type that corresponds to the data width. In part this is because of a mess of backwards compatibility: if you’re running OpenBSD on an amd64/x86-64 machine, int will be 32-bits and long will be-64 bits, but, on the same hardware, Windows defines both integer types to be 32-bits [9].

At this point (and even allowing for the simplifications in my explanation of C’s integer types), anyone who doesn’t consider themselves a C language lawyer might find their eyes glazing over. Fortunately there are two useful lessons for us.

We can see the first lesson in C’s definition of uintptr_t, which isn’t defined in terms of data or address widths, or even in some other notion of “address” — but, rather, the more abstract notion of a “pointer”. By deliberately abstracting away from the hardware, C is more adaptable than languages who make assumptions about hardware. This turns out to be very useful for CHERI. In contrast, Rust conflates the notion of address, data, and object widths into usize, making it hard to adapt naturally to a platform like CHERI.

The second lesson is, to some extent, a consequence of the first. Because C defines several integer types that abstract away from the hardware, their definition is invariably “at least wide enough to store X” rather than “exactly the right width to store X”. Similarly, few, if any, guarantees are made about the relationship between integer types. Thus, while C can easily adapt to a wide range of platforms, it does so in part by placing a considerable burden on the programmer to check whether they can, for example, safely store a size_t into a uintptr_t or vice versa. Using C’s integer types correctly over all possible platforms is, at the very least, exceedingly difficult. In contrast, Rust’s stronger guarantees are easier to work with.

Casting between integer types

Programmers often need to cast one integer type into another. For example, in C I can write this code:
#include <stdio.h>
int main() {
  int x = 0;
  long y = x;
  printf("%d %ld\n", x, y);
}
On my OpenBSD/amd64 machine, int is 32-bits and long 64-bits. Using clang --std=c99 -Wall to compile the above, there are no warnings or errors, and the resulting binary prints out 0 0 when run. Notice that on line 4, the 32-bit value stored in x was stored into a 64-bit variable y and that I didn’t have to do anything to make that “conversion” happen: C performed an implicit cast on my behalf. Fortunately, C guarantees that long is always at least as wide as int.

Consider instead this code:

#include <stdio.h>
#include <sys/limits.h>
int main() {
  long y = LONG_MAX;
  int z = y;
  printf("%ld %d\n", y, z);
}
This code also compiles without warning or error, printing out 9223372036854775807 -1 on my machine. If you dive into the C specification, you'll discover that the values I tried to cast lead to undefined behaviour [10] which means that my program could have done anything. In practice, the behaviour I observed can be explained very simply. y has the rightmost 63-bits set to 1; the implicit cast to the 32-bit int type on line 5 throws away the left-most 32-bits, but keeps the right-most 32-bits, which are still all set to 1; and because I’m running on a two’s complement platform, a 32-bit signed integer with all bits set represents the value -1.

The culprit here is the implicit cast from long to int which does something dangerous without me having explicitly asked for it. In my opinion, implicit casts are a terrible idea, and the program above should have caused a compile-time error.

I believe that users should have to spell out explicit casts. C even has syntax for explicit casts: I can change line 5 to int z = (int) y. Let’s briefly pretend that we can remove implicit casts from C: what do I gain? For me the killer example is that if I change the type of an attribute in a struct from int to long, I will get a compile-time error at every point in my program that tries to treat the attribute as an int. This makes tracking down all the code affected by my change trivial.

However C’s explicit casts aren’t a sufficient safety feature on their own — indeed, in a sense, they deliberately override safety. In the example above where we cast long to int on OpenBSD/amd64, we get a dangerous result whether or not the cast is explicit or implicit. One might argue that such an example is such a blatant mistake that it's not worth worrying about, but people make the same underlying mistake in situations where they have not quite understood the guarantees made about the relationship between integer types. For example, I might write the following code:

#include <stdio.h>
#include <stdint.h>
int main() {
  intptr_t y = ...;
  long z = (long) y;
  printf("%ld %ld\n", y, z);
}
I know that on my desktop machine both long and intptr_t are 64-bits and that casting between them never truncates data. However, on CHERI, intptr_t will be (at least) 128-bits while long is probably still 64-bits: the above cast will silently truncate data, which I almost certainly don’t want. There are several options for preventing this sort of truncation.

First, one can check statically the width of various integer types (e.g. in a configure script) and refuse to compile code on platforms where the assumptions about the relationship between integer types is invalid. This is good in the sense that it stops the program so much as being compiled on a platform where it would run incorrectly, but bad in the sense that it does nothing to help you find and fix the problematic cases.

Secondly, one can dynamically check for truncation in a cast and raise an error if it occurs. For example, I can define a function longtoint_dyn to represent both a cast and a truncation check:

#include <err.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/limits.h>
int longtoint_dyn(long x) {
  if ((x < (long) INT_MIN) || (x > (long) INT_MAX)) {
    err(1, "%ld cannot be represented as an int", x);
  }
  return x;
}
int main() {
  int x = longtoint_dyn(10);
  printf("%d\n", x);
  int y = longtoint_dyn(LONG_MAX);
  printf("%d\n", y);
}
When run on my desktop machine this prints:
10
a.out: 9223372036854775807 cannot be represented as an int: Undefined error: 0

Third, one can make explicit casts fail if they will potentially truncate on the platform the program is being compiled for. Defining this in C is rather awkward, but you can do it semi-portably in C99 with something like:

#if INT_MAX < LONG_MAX
#define longtoint_static(x) _Pragma("GCC error \"unsafe cast\"")
#else
#define longtoint_static(x) ((int) x)
#endif
In essence, if the cast is always safe, longtoint_static compiles to a simple cast. If the cast might not be safe, longtoint_static causes a compile-time error [11].

I am extremely fond of these sorts of casts, as they allow me to write code that works well for the platforms I currently care about without compromising the potential to reliably adapt the code to other platforms I might encounter in the future. In essence, casts of this sort both document and enforce every simplifying assumption about integer types I make, and they do so without imposing a run-time performance penalty. If I do encounter a platform where one or more of my assumptions is invalid, I’m notified of each potentially problematic place in my code. Fixing this is a tedious task, but in essence I’ve built my own static analysis ahead of time.

A reasonable question is whether jumping through these hoops is solely a result of C being an old language [12]. Let’s take Rust as a good example of a modern language. Very sensibly, it forbids implicit casts between integer types entirely. However, it does have explicit casts with the as keyword. When I started writing Rust code, I was somewhat liberal in my use of as with integer types, until I realised that doing so can lead to surprising results, just as casts in C do. Fortunately, Rust has two other ways of casting between integer types.

First are From casts. For example, usize::from(0u16) is a statically checked cast from a 16-bit number to a platform-parameterised usize: it always compiles successfully and is guaranteed never to truncate. However, the expression usize::from(0u32) fails to compile on my 64-bit desktop computer, even though the 32-bit to 64-bit conversion will never truncate. The reason for this discrepancy is two-fold: Rust guarantees that usize is at least 16-bits wide; and it only defines From casts for integer types that will succeed on every platform. Since casting a 32-bit integer to a usize would fail on a 16-bit platform, I’m not ever allowed to compile — even on a 64-bit platform where such a cast would always succeed. In other words, From casts are like a less flexible version of the longtoint_static function above.

Second, Rust has TryFrom casts which allow one to check for possible truncation at run-time. For example, usize::try_from(0u32).unwrap() casts a 32-bit number to a usize and aborts the program (with the unwrap call) if the cast would truncate the input, rather like longtoint_dyn above. Looking at the generated machine code for usize::try_from(0u32).unwrap() on a 64-bit machine is interesting: when unoptimised (Rust’s ‘debug’ mode) the machine code really does check for truncation, even though truncation can never possibly happen; but when optimised (Rust’s ‘release’ mode), the machine code contains no truncation checks at all.

TryFrom casts are very useful, but there are two things missing. First, and most obviously, they can’t do what longtoint_static does and statically rule out conversions that are potentially unsafe on a given platform. I’ve sometimes defined an equivalent of longtoint_static using the static_assertions crate but, of course, the resulting casts are only available to whatever code I’m writing. Second, there are no guarantees that the checks in a TryFrom are optimised away, even if they can never fail [13]. I have seen diligent programmers writing performance critical code deliberately avoid using TryFrom because they are worried that the checks won’t be optimised away. That’s deeply unfortunate, but it’s difficult to criticise them for doing so.

Although Rust’s casting features have weaknesses, I think that they provide clear pointers for a better design. I hope it’s uncontroversial that, like Rust, languages should not allow implicit casts between integer types at all. However, in my opinion languages should support three kinds of explicit casts for integer types:

  1. Those which always succeed. Rust's as is a good example of what to do here: when casting to a narrower integer type, integers should be truncated; when casting to a wider integer type they should zero extend for unsigned types or sign extend for signed types).
  2. Those which dynamically check whether the input would be truncated and return an error (like TryFrom in Rust). If, on a given platform, the cast can be statically proven never to truncate, then the dynamic checks must be guaranteed to be fully optimised away (which Rust does not do [14]).
  3. Those which fail to compile if the underlying platform might truncate the input (which no language I know of has direct support for).
Programmers can then choose amongst these three kinds of casts as appropriate for different portions of their program: I’ve written programs (e.g. grmtools) where I use each of these kinds of explicit cast at different points.

Overflow

Consider the following C code:
#include <stdio.h>
#include <stdint.h>
#include <sys/limits.h>
int main() {
  int x = INT_MAX;
  int y = x + 1;
  printf("%d %d\n", x, y);
}
It compiles without error on my desktop machine but when run prints out 2147483647 -2147483648. Because INT_MAX is the maximum 32-bit integer a (signed) int can hold, adding 1 more to it causes overflow. On my machine, the particular behaviour I observed was that the integer “incremented” into the negative half of the two’s complement representation. This is a rich source of bugs, but in C there is no mechanism for detecting overflow after it has occurred — one has to check for it before it occurs by manually comparing values to the maximum they could be, which is surprisingly awkward to do.

Rust takes a different approach. Consider this program:

fn main() {
  println!("{} {}", i32::max_value(), i32::max_value() + 1);
}
If I compile that in debug mode and run it I get the following run-time error:
thread 'main' panicked at 'attempt to add with overflow', t.rs:2:20
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
In other words, Rust noticed that the addition would overflow, realised that would be a bad thing, and aborted the program, telling me exactly why and where it did so.

However, if I turn on rustc’s optimisations with -O and rerun the program, it not only doesn’t cause an error, but it prints 2147483647 -2147483648, just as C does. There is one slight improvement in the situation relative to C: integer overflow is not undefined behaviour [15], and is guaranteed to perform two’s complement wrapping.

The reason that Rust checks for overflow in debug mode but does not do so in release mode is performance: when I asked a knowledgeable Rust person about this a couple of years ago, I was told that measurements made a few years ago showed that overflow checks made benchmarks about 5% slower on average. That’s a big enough difference that it was thought better to turn off overflow checks in release mode. Rust users can turn overflow checking back on even in release mode — though I doubt that many people know they can do so.

Fortunately, Rust gives users the ability to explicitly check for overflow during arithmetic operations with the checked_* family of methods, such as usize::checked_add. If adding the two values would lead to overflow, an error value is returned, otherwise the result of adding the two values is returned. This is a nice API and makes dealing with overflow much easier than in C. Even better, there is also the saturating_* family of methods which allow one to clamp the result at the maximum or minimum value (i.e. preventing wrapping). For example, i32::max_value().saturating_add(1) == i32::max_value().

Rust’s approach is so much nicer than C’s that it took me a while to realise that an even better design is possible! My guess is that the checked_* family were added because they’re a direct mapping of a feature (checking for overflow) that most CPUs have direct support for. However, Rust forces you to opt into checked_* even though it’s the “safe” case — the default, at least in release mode, remains (dangerous) wrapping. It would be better, in my opinion, to invert this: I would like all additions to be checked unless I explicitly mark them as unchecked. Personally the only reason I ever want unchecked arithmetic is in performance critical code where I can either guarantee that wrapping can’t occur or I’m willing to take the risk that it won’t. Very few portions of code are so performance critical that I would ever bother using unchecked arithmetic: at a rough guess I’d suspect a sensible ratio of unchecked:checked arithmetic in an average chunk of code might be something like 1:1000.

The good news is that this approach is within our grasp with the unchecked_* family of methods such as usize::unchecked_add family of methods. The bad news is that they’re currently only available in Rust nightly (and they’ve even disappeared from Rust nightly at some points), but hopefully they’ll make their way into Rust stable soon, and more people will feel able to turn on overflow checks in release mode — perhaps it might even become the default!

Other integer types

Although they’re little used, it’s worth briefly mentioning two other classes of integer types that C99 defines: the fast and least integer types. These have subtly different meanings, although the basic idea is the same: they allow you to express the idea that you want an integer type that’s at least n bits wide, but you don’t mind if the integer type the compiler selects for you has more than n bits. It’s easiest to see by example the difference between the two classes of integer types.

uint_fast32_t will choose the “fastest for this platform” integer type that’s at least 32-bits. For example, on a 64-bit machine it might be faster to perform 64-bit operations rather than 32-bit operations, at which point uint_fast32_t would be 64-bits wide. That’s generally not true on amd64/x86-64, for example, so one would expect uint_fast32_t to define a 32-bit integer on such platforms. OpenBSD does just that but, for reasons I can’t comprehend, Linux defines it as a 64-bit integer. In other words, on one of the most important hardware / OS combinations, using uint_fast32_t will generally slow your code down a bit. Perhaps unsurprisingly, I’ve never seen the fast types used in real code.

uint_least16_t defines an integer type that is at least 16-bits wide but may be wider. The description of these integer types in the C spec (including in C17) is particularly inscrutable, and I’m not entirely confident that I’ve understood it. However, I believe the intention is that the compiler will round a least type up to the next smallest integer type it offers: in other words, it won't define it to be unnecessarily big. That also defines an implicit ordering: it's always the case that the width of uint_leastM_t will be smaller than or equal to the width of uint_leastN_t if M < N.

As far as I can see, the fast types are a failure, and the least types are unused, possibly because no-one can understand exactly what the latter mean. I think the least types are a useful idea though, and if I was designing a new language, I’d consider adding support for this idea in, although I wouldn’t consider it crucial.

Summary

There are other things I could talk about (e.g. pointer provenance or non-integral pointer types), but I think I’ve written enough on integer types for one lifetime.

What I hope I’ve done in this article is: show that real hardware has a more complex notion of “word width” than a narrow amd64/x86-64 perspective suggests; how well, or not, integer types in existing systems programming languages support this variety of hardware; and what this suggests for the design of integer types in a mythical future systems programming language. I’m not claiming that I’ve thought of every possible issue in this area, nor that the design suggestions I’ve made are the best possible, but they do seem to be an improvement on what existing languages offer.

Let me finish by summarising what I think the major design lessons for systems programming languages are:

  • Both signed and unsigned integer types have their place in systems programming languages.
  • It’s reasonable to assume that all arithmetic will be on a two’s complement representation.
  • Real hardware does not have a single “word width”: we need separate notions of address, data, and object widths.
  • The stronger the guarantees one makes about the relationship between integer types, the harder it is to adapt to different platforms.
  • The weaker the guarantees one makes about the relationship between integer types, the harder they are to use correctly.
  • Implicit casts are a terrible idea.
  • There are three useful kinds of explicit casts: always succeed (truncating or extending as appropriate); dynamically check for truncation and fail if so; cause a compile-time error if the cast can potentially lead to truncation at run-time.
  • Arithmetic operations should by default return an error at run-time on overflow; if users don’t want a particular arithmetic operation to fail they should have to explicitly request that a specific arithmetic operation is unchecked.
  • Undefined behaviour on integer types is a terrible idea (though unspecified behaviour might have a place).

The final design lesson is that integer types are not an exciting aspect of language design and as a language designer it’s tempting to put off thinking about them in favour of more interesting features — certainly, this was the approach I took for many years. However, since programmers quickly embed assumptions about integer types throughout their programs, it becomes more difficult to change anything meaningful about them relative to many other language features. In my opinion, therefore, integer types are one of the things languages needs to think about earlier rather than later.

Updated: (2021-06-23) Travis Downs pointed out that I’d misclassified the 8088.

Acknowledgements: Thanks to Jacob Bramley, Edd Barrett, Aleks Bonin, and Jacob Hughes for comments.

If you’d like updates on new blog entries, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1] There are many interesting design challenges around numbers and dynamically typed languages, but that’s a very different topic!
[2] The only modern language I’ve come across which allows this is Zig, which allows integers to be any bit width between (I think) 1 and 65535. C99, perhaps inevitably, gives explicit flexibility for such integers to be defined, but no platform I know of actually does so.
[3] The UNISYS ClearPath Dorado – the last machine in this series was released in 2015, but it’s compatible with hardware going back to 1962!
[4] Most x86 CPUs, for example, currently only use 48-bits of the possible 64-bit address range. There are already some mainstream CPUs which use 57-bits (to represent 5 level page tables) and it would be a brave person who would bet against more bits being used at some future date.
[5] The naming of this architecture is a mess: AMD created it initially, hence “amd64”, but once Intel also adopted the instruction set, that name was always going to be contentious. Thus we’ve ended up in the confusing situation where for many, but not all, operating systems we tell people “download the x86-64 version called amd64”.
[6] I’m not aware of any machines whose data width is larger than their address width — if I had to guess, I’d suggest that no such machine has ever existed. UPDATE (2021-06-22): When will I learn not to guess?! Mark Bessey points out that System/360 machine have a 32-bit data width but a 24-bit address width.
[7] At this point the terminology becomes torturous: it might be better to say that CHERI’s pointer width is different than its address width. For the sake of consistency, I’ll stick with the existing terminology.
[8] For reasons that are unclear to me, uintptr_t is an optional type in C99. However, I don’t know of any platforms which support C99 but don’t define it.
[9] There is of course the wonderful long long integer type. Although I don’t know for sure, I like to think that at some point soon after long long was introduced, people realised that this approach wasn’t going to end well: would a new platform have come along that required a long long long integer type?!
[10] Undefined behaviour is, in my opinion, a terrible thing for everyone except people who write compiler optimisers. I talked a bit about that as part of another blog post.
[11] It’s possible to make this prettier with information like the line of the cast and so on, but it doesn’t fundamentally change things.
[12] C++ improves upon C: it has static_cast which roughly corresponds to longtoint_static, though C++ has no built-in equivalent of longtoint_dyn.
[13] That said, whenever I’ve actually looked at the machine code resulting from a TryFrom in Rust’s release mode, the checks are optimised away.
[14] I’m aware that our current compiler infrastructure – where we tend to separate out a compiler front-end (e.g. rustc) as a separate thing from the optimiser and machine code generator (e.g. LLVM) makes it difficult, and perhaps impossible, to guarantee optimisation outcomes. That is not a reason to say that we should never have such guarantees: it suggests we need to fix our compiler infrastructure to make it possible to make and honour such guarantees.
[15] To be specific, unsigned integer overflow in C has defined behaviour but signed integer overflow does not.
There are many interesting design challenges around numbers and dynamically typed languages, but that’s a very different topic!
The only modern language I’ve come across which allows this is Zig, which allows integers to be any bit width between (I think) 1 and 65535. C99, perhaps inevitably, gives explicit flexibility for such integers to be defined, but no platform I know of actually does so.
The UNISYS ClearPath Dorado – the last machine in this series was released in 2015, but it’s compatible with hardware going back to 1962!
Most x86 CPUs, for example, currently only use 48-bits of the possible 64-bit address range. There are already some mainstream CPUs which use 57-bits (to represent 5 level page tables) and it would be a brave person who would bet against more bits being used at some future date.
The naming of this architecture is a mess: AMD created it initially, hence “amd64”, but once Intel also adopted the instruction set, that name was always going to be contentious. Thus we’ve ended up in the confusing situation where for many, but not all, operating systems we tell people “download the x86-64 version called amd64”.
I’m not aware of any machines whose data width is larger than their address width — if I had to guess, I’d suggest that no such machine has ever existed. UPDATE (2021-06-22): When will I learn not to guess?! Mark Bessey points out that System/360 machine have a 32-bit data width but a 24-bit address width.
At this point the terminology becomes torturous: it might be better to say that CHERI’s pointer width is different than its address width. For the sake of consistency, I’ll stick with the existing terminology.
For reasons that are unclear to me, uintptr_t is an optional type in C99. However, I don’t know of any platforms which support C99 but don’t define it.
There is of course the wonderful long long integer type. Although I don’t know for sure, I like to think that at some point soon after long long was introduced, people realised that this approach wasn’t going to end well: would a new platform have come along that required a long long long integer type?!
Undefined behaviour is, in my opinion, a terrible thing for everyone except people who write compiler optimisers. I talked a bit about that as part of another blog post.
It’s possible to make this prettier with information like the line of the cast and so on, but it doesn’t fundamentally change things.
C++ improves upon C: it has static_cast which roughly corresponds to longtoint_static, though C++ has no built-in equivalent of longtoint_dyn.
That said, whenever I’ve actually looked at the machine code resulting from a TryFrom in Rust’s release mode, the checks are optimised away.
I’m aware that our current compiler infrastructure – where we tend to separate out a compiler front-end (e.g. rustc) as a separate thing from the optimiser and machine code generator (e.g. LLVM) makes it difficult, and perhaps impossible, to guarantee optimisation outcomes. That is not a reason to say that we should never have such guarantees: it suggests we need to fix our compiler infrastructure to make it possible to make and honour such guarantees.
To be specific, unsigned integer overflow in C has defined behaviour but signed integer overflow does not.

Automatic Video Editing

March 23 2021

Amongst the many consequences of COVID-19 has been the suspension of in-person talks: suddenly, people like me have had to think about how to produce prerecorded videos. In this article I’m going to explain what I’ve come to understand about video recording and the “automatic video editing” technique I’ve used for videos such as Virtual Machine Warmup Blows Hot and Cold.

To give you an idea of how unfamiliar I was with video production, 12 months ago I not only had no experience of creating videos but I hadn’t even watched enough videos to form an opinion of what was possible or desirable. However, within a couple of hours of attending my first online conference in June 2020, the challenges of video production had become clear to me. To be blunt, most of the videos I was watching were worse than the equivalent live talk: they were mostly bullet point laden PowerPoint slides with poor quality audio. I’ve long been of the opinion that bullet point presentations are unengaging, as they’re designed for the speaker’s benefit, not the audience’s. However, they become even less engaging when there’s no person to look at and all you can hear is a disembodied, poorly amplified, and fuzzy-sounding mumble. Fortunately for me, I saw a talk by Sam Ainsworth that showed me that a different, rather more engaging, type of talk was possible.

Realising that I’d need to produce videos of my own at some point, I immediately [1] started looking into how one might go about recording videos. Unfortunately, it soon became obvious that common video production tools such as OBS hadn’t been ported to OpenBSD [2]. Undeterred, the following day, I experimented with using FFmpeg to record and mix videos. I’ll go into more detail about FFmpeg later, but at this point it’s enough to know that FFmpeg is a powerful command-line tool that can record and edit videos.

After a while, I’d realised that what I wanted to produce are “screencasts with a talking head” videos, which superimpose a shot of a human head onto a capture of a desktop computer’s screen. I’d soon grappled with FFmpeg enough that I could record my camera, microphone, and X11 screen (i.e. my computer display) into separate video files. Then I started trying to edit the resulting files into a half-decent recording and ran into what I’ve since realised is a truism: video editing is hard.

Current approaches to video editing

I’ve done quite a lot of audio recording in my time, and editing it is easy: audio is easily visualised as a wave form, and searching for silences, in particular, is trivial. Selecting the appropriate parts to delete is a simple matter of a couple of mouse clicks and pressing “delete”. For a 15 minute podcast, I probably record about 40 minutes of audio and then spend about 30 minutes editing that down to the final 15 minutes. It might surprise you that editing takes less time than there are minutes of recorded audio, but many unwanted parts (e.g. long periods of silence where I’m glugging water) are so visually obvious that I don’t even need to listen to them.

It is one of life’s odder ironies that it’s impossible to sensibly visualise a video consisting of more than a handful of frames. Finding the right parts to delete is a slow job that involves lots of hunting around and replaying — sometimes the audio part of the video can help speed this up, but silence is not a guarantee that nothing interesting is happening on screen.

The sheer size of video files makes the editing process even more painful — manipulating videos taxes even the most powerful of computers. Saving out a 25 minute video on my computer using a software encoder [3] can easily take 10 minutes, utilising all 4 cores to the maximum while doing so. Experimentation is constrained by long turn-around times.

Two other things exacerbate the problem. First, there’s a lot more that can go wrong when recording a video, so to get the same quality as the equivalent audio-only recording, I typically have to record quite a lot more footage, meaning there’s more to edit. Second, video gives one more options, for example, I might want some scenes in the final video to have just the screen, some to be of just me talking, and some to be a mix of the two. Extra options mean extra complexity, which means extra editing time.

It is thus probably unsurprising that audio and video editing tools vary substantially in difficulty. I must admit that I find Audacity a frustrating tool, as each new version finds new ways of crashing on me, and some of its processing tools are underwhelming [4]. However, I think that even someone unfamiliar with audio editing can get to grips with Audacity in a few minutes. More sophisticated multi-track editors like Ardour (a wonderful, and underappreciated, program in my opinion) are harder to learn, but they pale in comparison to video editors. I’ve mostly used Shotcut because it’s the least hard to use tool I’ve found of that ilk, but it’s still awkward to learn and use, because video editing is fundamentally challenging: to do anything even moderately complex, I have to trawl around various online forums and YouTube videos to find out the sequence of steps to follow.

Automatic video editing

Just at the point that I was starting to despair at how much of my life I might have to dedicate to editing videos, I stumbled across a wonderful idea from Vivek Haldar, who showed that it’s possible to automatically edit videos. In essence, Vivek’s approach is to have a final video comprised of what I’ll call multiple scenes, where the scenes are specified by placing markers in the video that mean “I’m happy with the last scene” and “rewind to the beginning of the last scene and let me try recording it again” (enabling what are called retakes). The markers are embedded in a very literal sense — they turn most of the screen capture red or green in the recorded video (see e.g. 3:45 in Vivek’s video). A Python script then analyses each frame in the video for those markers and extracts the “good” portions to create a final video suitable for uploading elsewhere.

In my opinion, Vivek’s idea is beautifully simple, as it automates away the torturous job of extracting the portions of a recording that you want to keep. I quickly experimented with his idea in my FFmpeg setup, but found that the red and green flashes reflected on my face, making me look like alternately satanic or queasy [5]. Vivek solves this by also automatically editing out silences, but, as we’ll see later, I want greater control over transitions. I also tried delaying the recording either side of the pauses, but found that made it difficult to time transitions appropriately. I was tempted to give up at this point, but Vivek’s idea was just too appealing, so I wondered if I could achieve the same effect without embedding colours in the video.

An overview of Aeschylus

What I eventually came up with is Aeschylus (a sort-of acronym for “Auto-Editing ScreenCasts”). Aeschylus is a proof of concept for automatic video editing — it is not, to put it mildly, very polished. When I started writing it I had no idea of what I wanted, no idea of what was possible, and no idea of how to go about doing the things that I didn’t know I wanted. That lack of clarity is reflected in its mish-mash of shell script, Python, and generated Makefiles. That said, after 6 months of using Aeschylus, I think I’ve gained enough experience to say with confidence that automatic video editing works, that other people might find Aeschylus’s style of automatic video editing useful, and that more robust video production systems should be able to easily integrate something equivalent.

Before I go any further, it’s worth being clear as to what Aeschylus is and isn’t suited to. It:

  1. targets a simple style of videos: screencasts with a talking head.
  2. produces videos suitable for uploading to sites like YouTube: it is not useful for live streaming.
  3. intentionally doesn’t edit “on the fly”. All the constituent parts (e.g. recordings of the camera and screen) are kept separate and can be edited separately later if problems are found.
  4. expects you to record your videos in one location in one sitting (perhaps one could relax this requirement somewhat, but at some point you’d end up back at the complexity of “traditional” video editing).
  5. requires you to be at a computer during recording or, at least, having a means of telling the recording software when you want to set a marker.

These constraints are acceptable for some purposes (e.g. lecturing or teaching), but not others. In particular, if you have any artistic intentions for your videos, this is not going to be the technique for you.

What you get in return for accepting these constraints is a minimal-effort means of producing decent-ish looking videos. In my case, when I’m ready to record a video, I run Aeschylus, press Ctrl+F2 when I’m ready to start the “good” footage, talk for as long as I need to, and press Ctrl+F3 when I’m finished. Aeschylus then switches from “recording” to “editing” mode which involves lengthy, fully automated, video processing. A little while later, I’m presented with a .nut [6] file suitable for uploading to YouTube.

A couple of examples of publicly accessible videos that I’ve produced with Aeschylus are Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers and Virtual Machine Warmup Blows Hot and Cold. While no-one is going to confuse either video with a professional documentary, I think their quality is adequate enough that it’s not distracting. Both videos required precisely zero editing after recording.

Markers

The main difference between Aeshchylus and Vivek’s system is that Aeschylus stores markers in a separate file to the video, recording the wall-clock time that the marker was added to the file. For example, when I press Ctrl+F1 a safepoint marker is added, when I press Ctrl+F2 a rewind marker is added, and when I press Ctrl+F3 a fastforward marker is added. After recording is complete, the markers file looks something like:

1615910625.813433 rewind
1615910640.256888 rewind
1615910653.923218 safepoint
1615910669.182979 rewind
1615910680.459997 fastforward
The long numbers are times in seconds since the Unix epoch, allowing us to tell when those markers were recorded (in this case, approximately 2021-03-16 16:03) down to a fraction of a second. One obvious concern with this problem is how to synchronise the markers with the recording: I’ll talk about that in detail a bit later.

The three marker types used above mean the following:

  • rewind: retake the current scene, rewinding to the previous safepoint or, if there are no previous safepoints, the start of the recording.
  • safepoint: commit the previous scene.
  • fastforward: ignore the recording from this point until the next safepoint or, if there are no subsequent safepoints, until the end of the recording.

Those three marker types might not feel like much, but they’re surprisingly powerful. rewind can be used to denote “start video extraction from this point” and fastforward can be used to denote “stop video extraction at this point”. It’s normal to have multiple adjacent rewinds since they ignore any preceding marker that isn’t a safepoint. Aeschylus also has an undo marker which deletes the preceding marker, if there is one, but I’ve never used it in practise.

Thus, for the example markers above, Aeschylus will trim the start (from the very beginning of the recording up to the second rewind marker), edit out the retake between the safepoint and the second rewind, and then trim the end (the fastforward).

At this point it’s probably easiest to see examples. I recorded a simple “demo” video with a few example markers and first of all removed those markers so that you can see the “before” footage. This includes the inevitable wonky first second of camera footage when using FFmpeg on OpenBSD but, more usefully, shows the FFmpeg recording process running in the centre and highlights the markers that are set in the bottom left:

If I put the markers back in and rerun Aeschylus it then produces an automatically edited video. If you’re not quite sure how this works, you might find it useful to compare the frame count in the FFmpeg process in the video above and below. The “after” video thus looks as follows:

Hopefully that gives you a good idea of what automatic video editing does. The minimum practical sequence of markers to produce a video is simply rewind and fastforward. In practise, I tend to start with multiple rewinds (introductions are important, and it takes me a little while to get into the swing of things), with subsequent safepoints followed by zero or more rewinds, and one final fastforward.

One take versus multiple takes

Aeschylus makes it easy to retake (i.e. rerecord) a scene as often as you can stomach, without incurring any extra editing time. This leads to an inevitable debate about the virtues of “authentic” and “perfect” recordings [7]. On one side of the argument, people assert that “authentic” recordings emulate the wonders of a live talk while “perfect” recordings feel stiff and robotic.

I think this overlooks the fact that truly good live talks are hard to give and thus rare. Part of that is inherent — most people, including me, are not natural public speakers, so we tend to fumble over our words, pause randomly, or repeat ourselves unnecessarily. Part, annoyingly, is because many speakers do not adequately practise: at the most extreme are those speakers to whom each slide in their talk comes as a visible surprise. While some amazing live speakers can probably create wonderful videos in a single take, the rest of us should probably expect to retake parts of a talk if we value, and want to use efficiently, our viewer’s time.

As this suggests, I have little truck with the idea of an “authentic” talk but I don’t want to fall into the trap of trying to create a perfect talk. If nothing else, I probably wouldn’t succeed in doing so, but would exhaust myself in the attempt. Fortunately, I’ve found that when recording videos, one can easily explore the middle group between authentic and perfect, and automatic video editing makes doing so even easier. Just as with live talks, I’ve ended up producing two classes of video.

First is what I’ll call “lecturing”, by which I mean delivering lengthy quantities of material in a relatively short period of time (e.g. lecturing undergraduate students might involve 20+ hours of lectures over 10 weeks). When lecturing in-person, I can’t realistically practise everything I’m going to say in advance, so I end up pausing, repeating, and fumbling a reasonable amount. Similarly, I need to produce enough lecture-quality videos that I need to tolerate a certain level of inefficiency and imperfection in my delivery if I want to record them in reasonable time. When using Aeschylus, I have settled into a consistent pattern where I have a record:keep ratio of about 2:1 (i.e. for every 10 minutes of material I record I keep 5 minutes). More interestingly, these videos have a live-length:video-length ratio of about 2:1 (i.e. for every 10 minutes I would have spoken in real life, the video covers the same ground in 5 minutes). It took me a little while to realise that, with automatic video editing, I’m spending the same time recording videos as I spent delivering them in-person — but viewers only have to spend half as long watching to get the same – in fact, better! – results. Part of that is the reduction in talking inefficiencies, but part of it is because I no longer have to repeat myself as often: if students miss something, they can rewind to focus on just that part. This, in my opinion, is a significant improvement relative to live lectures.

Second is what I’ll call “research talks”, by which I mean one-off, shorter talks, generally on a highly technical topic. When giving these in-person I generally have to practise them several times in order that they’re somewhat accurate and complete. What happens if we try and maintain these standards for video recordings? For the two cases I’ve done so far (Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers and Virtual Machine Warmup Blows Hot and Cold) I’ve had a record:keep ratio of about 5:1 to 6:1 (i.e. for every 10-12 minutes of material I record, I keep 2 minutes). I estimate the live-length:video-length coverage ratio to be about 1.5:1 (i.e. for every 10 minutes I would have spoken in real life, the video covers the same ground in about 7 minutes). However, the recorded videos are more accurate and complete than the equivalent in-person talks, so that ratio perhaps undersells things.

There is one final factor that’s worth considering in this context: how long should a scene in a video be? It would be possible to put together a perfect recording consisting of scenes that last a few seconds each, but when transitions are too frequent, the final video feels inhuman and unsettling. In general, I aim for scenes to average around 30-90 seconds. 30 second scenes seem to be above the “inhumanly frequent transition” threshold, and 90 second scenes mean that, when I inevitably fumble my words, I don’t have to repeat too much content. Put another way, the longer a scene is, the more opportunities I’m giving myself to make a bad enough mistake that I feel forced to do a retake.

In summary, whether I’m producing lecturing or research quality videos, my personal aim is not to produce perfect videos. Instead, I deliberately aim for something at least a little short of perfect: I retake a scene when I think I’ve paused or fumbled in a grating manner, but I’m willing to accept minor pauses and fumbling in. While I’m not saying I’ve got the balance right, I hope this leaves a “human” element to the videos, while also wasting very little of the viewer’s time.

Scene transitions and layouts

Aeschylus originally supported only one scene layout: screen capture with a shrunken version of me in the bottom right corner (e.g. Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers). Because I appear in the same position during the whole video, I soon realised that transitions between scenes can be extremely jarring if I’ve moved too much either side of the transition [8]. Fortunately there is a simple solution: at each safepoint and rewind I sit upright, look at the same part of my camera, and try to strike a neutral tone with my face. The resulting transitions are still visible, but they’re much less jarring.

While this simple scene layout is fine for short videos, it can feel rather boring for longer videos. To solve that problem, I realised that I needed to add support for additional scene layouts: just the screen capture; and a full scale version of my made-for-radio face. You can see all three scene layouts in action in Virtual Machine Warmup Blows Hot and Cold.

Although I cannot deny a certain vanity behind the full-scale face scene layout, I think the different layouts make the resulting videos a bit more engaging. To my surprise, despite being much more obvious, transitions between different scene layouts are much less jarring than transitions within the same scene layout. I’m not sure if that’s because I’ve become inured to that type of transition from over-consumption of television and film, or whether it’s something more fundamental, but it doesn’t really matter either way.

To switch scene layout, Aeschylus has additional marker types (e.g. camera0_scene, screen0_scene, and so on). There is an interesting interaction between setting a scene layout and rewind. My first implementation made Aeschylus “remember” which scene layout was active at a safepoint, reverting to that if a rewind was encountered. However, that made it impossible to cleanly change scene layout as it always happened during a “live” scene. Now, rewind does nothing to change scene layout. Fortunately, this is easy to use in practise. After a safepoint, I set a new scene layout marker if I need one, then set a rewind marker, and carry on. I don’t ever try to change scene layouts in the middle of a scene.

Occasionally I forget to change scene layouts at safepoints or change to the wrong scene layout. Fortunately, because Aeschylus (intentionally) only edits after recording has completed, it’s easy to edit the markers file to add or remove the necessary scene change marker and regenerate the video.

A final challenge with scene transitions is getting an appropriate gap between the final word in the last scene and the first word in the next scene. Too large a gap feels a bit weird; too small a gap can be incredibly jarring. My solution is simple: I set a safepoint just slightly before the point that I would start speaking again; and as soon as I set a rewind, I breathe in to start speaking. In other words, the “gap” is unevenly distributed, with most of it being at the end of a scene [9].

FFmpeg

FFmpeg is often described as the Swiss army knife of video tools, which is unfair: compared to FFmpeg, Swiss army knives have a miserly number of tools. Now that I know what to look for, I suspect that nearly every video we watch digitally has passed through FFmpeg at least once. Aeschylus uses FFmpeg for recording (which is fairly easy) and editing (which turns out to be rather harder).

The sheer variety of tasks that FFmpeg can accomplish means that it would be unreasonable to expect it to be an easy tool to use. However, FFmpeg is even more difficult to use than one might expect, for three reasons. First, it is an old tool, so inevitably there are layers of cruft, and unfortunate design decisions, that can’t be removed or changed without breaking backwards compatibility. Second, although nearly every feature is documented, virtually no feature is documented in a useful way. A common idiom is to say “Feature X does X”, with no real explanation of what X does, or any indication of the implications of using that feature. Third, FFmpeg only rarely complains that data can’t be processed: in general, if it’s not in quite the format the next stage in the pipeline requires, it will coerce it to make it suitable [10], silently choosing values for any options you didn’t think to specify. If, as is common, the output doesn’t play as you hoped, you have to work out what options are applicable to to your input and output and try adjusting them. I have often had to dig into FFmpeg’s source code to find out what’s going on, and sometimes I’ve had to resort to sheer guesswork [11].

However, in case this sounds like whinging, I want to be clear: I really like FFmpeg. Once you’ve got your head around its major concepts – in particular, its powerful filters – you can automate an astonishing number of tasks. Here’s an example of the sort of pipeline that Aeschylus uses:

ffmpeg -hide_banner \
  -i camera0.nut \
  -filter_complex " \
    [0:a]pan=mono|c0=c0+c1[a]; \
    [0:v] \
      fps=fps=24, \
      crop=1400:1045:330:0, \
      vaguedenoiser, \
      chromakey=3e5b0b:0.04:0.02, \
      despill=type=green:mix=0.5:expand=0.3:brightness=0:green=-1:blue=0 \
    [camera0]; \
      color=c=0x595959:r=24:s=1920x1080 \
    [screen0]; \
    [screen0][camera0] \
      overlay=shortest=1:x=W-w+0:y=H-h+1 \
    [v]" \
  -map "[a]" -map "[v]" \
  -f nut -c:v libx264 -crf 0 -preset ultrafast -c:a flac out.nut
I won’t go into all the details, but a few clues can go a long way to making sense of FFmpeg. This command takes as input the file camera0.nut, processes it, and writes the output to out.nut, encoding the video as H.264 and the audio as flac.

The -filter_complex option allows you to access FFmpeg’s full range of features via a filter expression language. Each stage in the pipeline takes in zero or more streams and produces one or more streams as output, with the syntax [in_1][in_n]filter=opt_1=val_1:opt_n=val_n[out_1][out_n] (note that stream names can be, and often are, reused). The input file is split into two: the audio stream becomes [0:a] and the video stream [0:v]. The -map arguments specify which streams end up in the output file (in this case the [a] and [v] streams).

In this example, the filter pipeline converts the stereo audio to mono (pan=...), crops the image (crop=...), reduces camera noise (vaguedenoiser [12]), removes the green screen (chromakey=...) and any reflection from the green screen (despill=...), generates a 1080p grey image (color=...), and places the camera footage over that grey image (overlay=...). Here’s a before and after:

As I hope this suggests, FFmpeg is incredibly powerful, including a wide array of image and audio processing possibilities.

There are, of course, some quirks. For quite a while Aeschylus produced output videos in one go, but as the processing became more complex, I discovered that FFmpeg sometimes wants to buffer input before running a filter: from memory, the overlay filter is a surprising culprit (although I may be blaming the wrong filter). As a lazy hack, Aeschylus now generates a Makefile [13] which sets off a cascade of FFmpeg processes. While that does end up generating a number of intermediate files, it can at least do much of the processing in parallel. It’s not a quick process, but it’s just about the right side of interminable. On my 2016 4 core desktop computer, it can process the 3 hours of input for Virtual Machine Warmup Blows Hot and Cold into a ready to upload video in under an hour and a half. Someone more expert with video processing than I could probably improve this substantially.

Clock synchronisation

Aeschylus works by using the wall-clock time of markers to extract portions of a video. Ideally Aeschylus would instruct the FFmpeg recording process that a marker has been set, and then FFmpeg could use its preferred clock to accurately associate the marker with the audio and video. Unfortunately, although I suspect FFmpeg might have a way of doing this (almost certainly involving “cues”), I wasn’t able to work out how to do so.

Instead, Aeschylus uses the wall-clock time since the Unix epoch for markers, forcing FFmpeg to also use Unix epoch timestamps for audio and video packets. Fortunately forcing FFmpeg to do this isn’t too difficult:

ffmpeg -hide_banner \
  -use_wallclock_as_timestamps 1 \
  -f v4l2 -i /dev/video0 \
  -map "0:v" -copyts recording.nut
However, this raises the inevitable spectre of clock synchronisation. The basic problem is that clocks in various devices tend to run at slightly different speeds: even if you synchronise two different clocks, they drift apart over time. This is not merely a theoretical concern. For example, I recently observed noticeable audio phase issues from two devices’ clocks drifting apart only 20 seconds after starting recording.

I assumed at first that FFmpeg would always use a fixed-rate monotonic clock – i.e. a clock which not only doesn’t go backwards but always goes forwards at a fixed rate (e.g. isn’t affected by NTP) – when available. Instead, if my poke around the source code is anything to go by, FFmpeg uses several different clocks, often the normal system clock or sometimes a monotonic clock. On Linux, for example, FFmpeg mostly uses CLOCK_MONOTONIC (which doesn’t go backwards, but doesn’t go forwards at a fixed rate) except for the v4l2 backend which you can force to use the more sensible CLOCK_MONOTONIC_RAW (which goes forwards at a fixed rate). Possibly because of the different clocks involved, early versions of Aeschylus had huge problems accurately associating markers with the recorded input, causing scene extraction to sometimes be off by multiple video frames.

In the audio world, high-end devices solve this problem by allowing one device to send its clock signal to another, so both are perfectly synchronised. Although I couldn’t find any meaningful documentation for this feature, I believe FFmpeg also allows this sort of synchronisation using the “,” syntax in -map:

ffmpeg \
  -use_wallclock_as_timestamps 1 \
  -f v4l2 -i /dev/video0 \
  -f sndio -i snd/0 \
  -map "0:v,1:a" -map "1:a" -copyts recording.mkv
On my OpenBSD system, as far as I can tell, the sndio backend in FFmpeg uses the non-monotonic system clock, and applies that to video frames as they arrive. Aeschylus thus records marker time with (hopefully) the same system clock. Using the non-monotonic system clock will almost certainly go wrong in some cases (e.g. if NTP starts noticeably adjusting the clock frequency), but I’ve got away with it so far.

There are probably already video recording systems that have most, if not all, of the necessary functionality to enable automatic video editing. For example, although I haven’t looked into it in depth, something like this plugin for OBS might do the trick though, if I’ve understood the documentation correctly. However, I think automatic video editing to be truly reliable needs marker setting to be accurate to below half a video frame’s length (for 24fps, 1 frame lasts about 0.04s). The OBS plugin seems to have an accuracy of 1 second, though I suspect that can be improved without difficulty.

Equipment

The internet is full of people detailing the lorry load of equipment they use to produce videos. Even if I could afford all of it, I wouldn’t have room to store it. Fortunately, one can produce decent looking videos with surprisingly little equipment.

For Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers, I used my 2016 OpenBSD desktop computer, a RØDE Procaster microphone, an Elgato green screen, and a Logitech C920 webcam. I got the Procaster on the hope (mostly realised) that its rear noise rejection would make my creaky 15 year old keyboard less obvious; most people would probably be perfectly happy with a cheaper USB microphone. If I hadn’t wanted to “blend in” to the background, I wouldn’t have needed the green screen at all. The only lighting I used was good old fashioned English sunshine streaming through the window (which explains why I slightly change colour at some points in the video).

Cameras are a slightly touchier subject. The C920 is probably the most common stand-alone webcam, and is generally considered a decent webcam — but that doesn’t mean it’s a good camera. Frankly, webcams are a classic example of a neglected technology: the C920, for example, produces worse videos than a 10 year old mobile phone camera. At full scale, webcam images are blurry and noisy (or “grainy”). Fortunately, if you scale them down, and use a fairly heavy noise reducer, they don’t look too bad.

By the time of Virtual Machine Warmup Blows Hot and Cold I had acquired a Fujifilm x-mount mirrorless camera for photography which alongside a £12 HDMI capture card makes a good video camera too. Since the light in winter is much more variable than summer, I closed the curtains, and used the normal room light plus an £18 ring light (though it’s little more than a glorified desk lamp). A bit more lighting would have improved things, but even with this crude setup, the results are tolerable, producing good enough full-size images from the camera that I could inflict my full-sized face on viewers. However, my slightly old computer, combined with OpenBSD’s sluggish USB performance, means that I can only reliably capture 1080p footage at 24fps [14].

Automating screen capture setup and teardown

When I first tried recording videos, I would manually change the video resolution (from 1900x1200 to 1900x1080), close a few obviously unused applications, start recording, and hope for the best. I ruined a couple of early videos by forgetting to change resolution or forgetting to close the web browser so chat messages popped up half way through. The most memorable mistake was when I left my mouse pointer visible just by the image of my head, in such a way that it looked like a cheap devil horn. Even without these cock-ups, the resulting screen capture was cluttered and distracting.

I soon realised that I needed to more carefully control what was running during screen capture, and that I needed to automate this so that I wouldn’t get it wrong quite so often. Aeschylus thus calls user-defined “setup” and “teardown” functions so that arbitrary code can do whatever it needs to get screen capture into order.

Exactly what you can and can’t do in this regard is heavily dependent on your operating system and, if you’re on Unix, your desktop environment. Fortunately on OpenBSD and XFCE I could automate everything I wanted, discovering some wonderful tools (e.g. screenkey and unclutter) and techniques along the way.

For example, I send SIGSTOP (roughly equivalent to “Ctrl+Z in the terminal”) to bulky programs such as web browsers to ensure they don’t interfere with the recording. I use xwinfo and xdotool to make things like xfce4-panel (the application bar at the bottom of my screen) invisible during recording.

It’s pleasantly surprising how far one can go in changing programs’ configurations from the command line. For example, I use bigger fonts in my editor and terminal in videos than I do normally. xfce4-terminal notices changes to its configuration file [15] so I can change from the old font to a new one with:

sed -i "s/^FontName.*\$/FontName=${NEW_TERMINAL_FONT}/g" \
  ~/.config/xfce4/terminal/terminalrc
I do something similar with neovim, though I haven’t bothered trying to effect running instances: I have to load a new neovim to get the font size increase.

I also set various hot keys during setup, for example to enable gromit-mpx:

xfconf-query --channel xfce4-keyboard-shortcuts \
  --property "/commands/custom/F12" \
  --create --type string --set "gromit-mpx -t"

As this suggests, most of the necessary setup is relatively simple. There is one notable exception. I realised early on that if I made the background colour of my slides the same colour as the background of my desktop, viewers wouldn’t notice when I was moving from one to the other. While it looks like I’ve created a carefully constructed textual overlay, it’s actually just a PDF in a normal PDF viewer which is recorded as part of the X11 screen capture. It’s a cheap trick but an effective one! Unfortunately, I haven’t yet worked out a simple way of changing the background colour of all virtual desktops in XFCE so I have this horrific looking snippet:

r=`python3 -c "print((($BACKGROUND_COLOUR >> 16) & 255) / float(0xFF))"`
g=`python3 -c "print((($BACKGROUND_COLOUR >> 8) & 255) / float(0xFF))"`
b=`python3 -c "print(($BACKGROUND_COLOUR & 255) / float(0xFF))"`
for wksp in `xfconf-query -c xfce4-desktop -l | \
             grep -E "screen0.*(HDMI|monitor|eDP)-?[0-9]/workspace0" | \
             cut -d "/" -f 1-5 | sort -u`; do
  xfconf-query --create -c xfce4-desktop -p ${wksp}/rgba1 \
    -t double -t double -t double -t double -s $r -s $g -s $b -s 1 || true
  xfconf-query -c xfce4-desktop -p ${wksp}/image-style -s 0 || true
  xfconf-query -c xfce4-desktop -p ${wksp}/color-style -s 0 || true
done
I feel sure that it must be possible to do better than that, but at least it works!

The overall effect is significant, as you can see if I run my Aeschylus setup script on the HTML I’m writing for this article:

Of course, once I’ve finished recording, I want my desktop back to normal, so Aeschylus also calls a teardown function. Fortunately, once you’ve got the setup code working correctly, the teardown code is nearly always much easier to write.

Summary

Aeschylus is the worst written bit of software I’ve put my name to since I was about 15 years old and it will probably never be usable for anyone other than me. The reason that I’m willing to risk whatever minimal reputation I have is that I think Aeschylus fleshes out Vivek’s original idea a little bit further, and shows that automatic video editing works at a larger scale. I'm sure that the idea can be extended further — for example, I suspect that many people would like a way of replaying the last few seconds of the previous scene to help them decide what they want to say in the next retake of the current scene.

It’s important to stress that automatic video editing makes one part of video production easier, but it can’t change your delivery or, more significantly, create good content. It’s also not the right tool for people who want to produce videos with ground-breaking visuals, but it works well if you want to produce videos that look just about good enough that they don’t detract from their content. For me, and I suspect quite a lot of other people, that’s as high a bar as we need or want to jump over!

Addendum (2021-03-24): I’ve been pointed at a small number of tools which share some of Aeschylus’s goals. Probably closest in spirit is Gregor Richards's plip, which makes it easy to record markers in OBS (neatly solving the clock synchronisation problem) and then edit out the chunks between markers in a custom editor. The commercial Descript tool takes a different approach, making it possible to do things like edit a video based on automatically transcribe text. I imagine that makes editing much easier though – like approaches which automatically remove silence – I expect that the resulting videos will contain more, and more jarring, transitions than I find comfortable. With luck, there will be further innovation in this area time goes on!

Acknowledgements: Thanks to Edd Barrett, Lukas Diekmann, and Vivek Haldar for comments.

If you’d like updates on new blog entries, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1] “Immediately” is not an exaggeration. After seeing Sam’s talk, I reinvigorated my previously desultory efforts into investigating microphones, placing an order for a RØDE Procaster later that same day. It turned up in time for me to use it for the Ask Me Anything I did with Doug Lea 3 days later. In retrospect, that could have gone badly wrong: I had tested it for less than 5 minutes and on a different computer than the one I used to talk to Doug. As you can clearly see from the video, I hadn’t even worked out at what angle it might sound best. Fortunately, nothing went wrong, but that was more luck than judgment.
[2] At the time of writing, OBS has been at least partly ported to OpenBSD, although I haven’t tried using it.
[3] Hardware encoding is much faster, but generally lower quality, both in terms of visual effects, and also in terms of how much the resulting file is compressed.
[4] For example, Audacity’s built-in compressor has a minimum attack time of 10ms, which is much slower than one wants for speech.
[5] This was exacerbated by the fact that it had not occurred to me to turn the monitor’s brightness down during recording and by the small desk I had at the time. For other reasons, I fixed the latter problem by purchasing this decent sized sit-stand desk — 6 months on, I remain both impressed and pleased with it!
[6] Amongst the many surprising realisations I’ve had with video is that the common video formats have surprising problems. The mp4 format, for example, can contain lossless video but not lossless audio (unless you slightly ignore the standard). I found that out the hard way early on when after multiple rounds of processing, some of my test videos sounded like they’d been recorded on a landline! The mkv allows lossless video and audio but, to my surprise, can’t represent some framerates (including 24fps) accurately.

The nut format isn’t very well known, but I haven’t yet found a problem with it other than it’s apparently not supported by all players — but it’s supported by the video players I use, as well as sites like YouTube. During video processing I exclusively use lossless audio and video codecs (ffmpeg and H.264 in lossless mode) and I only ever upload lossless videos, so whatever processing YouTube and friends undertake does the least possible damage to the final video people watch.

[7] Recorded music is an obvious analogy. Unless (and, often, even if) you listen to a live recording, you’re not listening to a recording of a band playing a song once through: you’re listening to multiple performances glued together to make that recording. Digital recording has allowed this to be taken to an extreme: tiny bursts of “perfect” playing can be glued together to make a recording which no human could ever achieve in a single take. I am not a fan of the results.
[8] To my surprise, some colleagues who are regular YouTube watchers are so used to such transitions that they didn’t even notice them. I remain unsure whether to be impressed or concerned by this.
[9] Because my keyboard is so creaky, the noise of the key press and depress during transitions was awful. Aeschylus fades the volume down and then back up either side of the transition to lessen this.
[10] The major exception is that FFmpeg treats video and audio very differently, and not only refuses to coerce between the two, but won’t even accept one or the other, even when it perhaps could. Thus one ends up with several filters which do the same thing to audio as video (and vice versa), but accept only audio or video (e.g. asetpts and setpts).
[11] To give you an idea, I discovered by trial and error a 700ms delay in a part of FFmpeg I cared about. Absolutely baffled, I turned to Google to find someone on a forum stating that the -muxdelay option defaults to 700ms. This is, of course, not documented.
[12] vaguedenoiser is particularly slow. I previously used hqdn3d, which is much faster, and gave good results on low quality webcam footage. On higher quality mirrorless input, however, hqd3d felt like it was smudging things a bit too much for my liking.
[13] The Makefile for Virtual Machine Warmup Blows Hot and Cold weighs in at 1396 lines.
[14] Although some modern monitors can adapt their refresh rate to the content displayed, a safer bet is that most have a refresh rate that’s an integer multiple of 30. However, for the sort of videos I’m producing the difference between 24fps and 30fps isn’t too significant.
[15] Unfortunately, xfce4-terminal does go a bit wonky sometimes during configuration changes, notably it sometimes gets confused and changes its window size or displays the wrong font. This seems to mostly happen when the font size is decreased, although I have no idea why this might be the case.
“Immediately” is not an exaggeration. After seeing Sam’s talk, I reinvigorated my previously desultory efforts into investigating microphones, placing an order for a RØDE Procaster later that same day. It turned up in time for me to use it for the Ask Me Anything I did with Doug Lea 3 days later. In retrospect, that could have gone badly wrong: I had tested it for less than 5 minutes and on a different computer than the one I used to talk to Doug. As you can clearly see from the video, I hadn’t even worked out at what angle it might sound best. Fortunately, nothing went wrong, but that was more luck than judgment.
At the time of writing, OBS has been at least partly ported to OpenBSD, although I haven’t tried using it.
Hardware encoding is much faster, but generally lower quality, both in terms of visual effects, and also in terms of how much the resulting file is compressed.
For example, Audacity’s built-in compressor has a minimum attack time of 10ms, which is much slower than one wants for speech.
This was exacerbated by the fact that it had not occurred to me to turn the monitor’s brightness down during recording and by the small desk I had at the time. For other reasons, I fixed the latter problem by purchasing this decent sized sit-stand desk — 6 months on, I remain both impressed and pleased with it!
Amongst the many surprising realisations I’ve had with video is that the common video formats have surprising problems. The mp4 format, for example, can contain lossless video but not lossless audio (unless you slightly ignore the standard). I found that out the hard way early on when after multiple rounds of processing, some of my test videos sounded like they’d been recorded on a landline! The mkv allows lossless video and audio but, to my surprise, can’t represent some framerates (including 24fps) accurately.

The nut format isn’t very well known, but I haven’t yet found a problem with it other than it’s apparently not supported by all players — but it’s supported by the video players I use, as well as sites like YouTube. During video processing I exclusively use lossless audio and video codecs (ffmpeg and H.264 in lossless mode) and I only ever upload lossless videos, so whatever processing YouTube and friends undertake does the least possible damage to the final video people watch.

Recorded music is an obvious analogy. Unless (and, often, even if) you listen to a live recording, you’re not listening to a recording of a band playing a song once through: you’re listening to multiple performances glued together to make that recording. Digital recording has allowed this to be taken to an extreme: tiny bursts of “perfect” playing can be glued together to make a recording which no human could ever achieve in a single take. I am not a fan of the results.
To my surprise, some colleagues who are regular YouTube watchers are so used to such transitions that they didn’t even notice them. I remain unsure whether to be impressed or concerned by this.
Because my keyboard is so creaky, the noise of the key press and depress during transitions was awful. Aeschylus fades the volume down and then back up either side of the transition to lessen this.
The major exception is that FFmpeg treats video and audio very differently, and not only refuses to coerce between the two, but won’t even accept one or the other, even when it perhaps could. Thus one ends up with several filters which do the same thing to audio as video (and vice versa), but accept only audio or video (e.g. asetpts and setpts).
To give you an idea, I discovered by trial and error a 700ms delay in a part of FFmpeg I cared about. Absolutely baffled, I turned to Google to find someone on a forum stating that the -muxdelay option defaults to 700ms. This is, of course, not documented.
vaguedenoiser is particularly slow. I previously used hqdn3d, which is much faster, and gave good results on low quality webcam footage. On higher quality mirrorless input, however, hqd3d felt like it was smudging things a bit too much for my liking.
The Makefile for Virtual Machine Warmup Blows Hot and Cold weighs in at 1396 lines.
Although some modern monitors can adapt their refresh rate to the content displayed, a safer bet is that most have a refresh rate that’s an integer multiple of 30. However, for the sort of videos I’m producing the difference between 24fps and 30fps isn’t too significant.
Unfortunately, xfce4-terminal does go a bit wonky sometimes during configuration changes, notably it sometimes gets confused and changes its window size or displays the wrong font. This seems to mostly happen when the font size is decreased, although I have no idea why this might be the case.

The Evolution of a Research Paper

January 19 2021

Download:  Opus (3.1MiB)   mp3 (11.1MiB)
As far as I can recall, the first time that I truly tried to read a research paper was when I started my PhD. I don’t remember which paper was the first recipient of my incompetent gaze, but I do remember how completely out of my depth I felt when trying to make sense of it. Over many subsequent papers I tried during that period, I made very little progress in meaningfully digesting any of what I was reading. I quite often found a fog descending on my brain, a curious mix of tiredness and bewilderedness, that left me feeling defeated.

In retrospect, I faced three main issues: I didn’t have a good technique for reading papers [1]; I was trying to read papers that far exceeded my rather narrow technical knowledge; and I wasn’t reliably selecting papers that were well written. To a reasonable extent, sheer repetition tends to fix all of these issues.

However, I eventually realised there was a much deeper issue: I did not understand how papers come into being and certainly had no idea how I should go about writing my own. From my naive perspective, papers seemed to come fully formed as 20 page PDFs, full of dense prose, foreboding formulae, and complex tables. I had little idea how they got to that state and, for whatever reason, nobody I bumped into ever seemed to talk in a meaningful way about the mechanics of writing a paper. Even now, I’m not sure how most people or groups go about writing their papers.

A few years back I wrote about what I’ve learnt about writing research papers, but I suspect that does a better job of explaining the detailed mechanics of writing parts of a paper than it does giving a sense of the whole. When I recently saw a simple but effective animation of a paper evolving during its writing process, I realised that it might help give a high-level view of the whole of a paper. What I’m going to do in this post is to take two papers I’ve been involved with and use similar animations to give a sense of how the paper evolved, as well as some of the human story behind why it evolved that ways. I’m not going to try and derive grand lessons from these two examples, but I hope that they might help slightly demystify how papers end up looking as they do.

The warmup paper

Let’s start with Virtual Machine Warmup Blows Hot and Cold which is probably the most demanding paper I’ve ever been involved with. To some extent, one can recover the paper’s story from the repository’s commits, but even that won’t tell you everything. So I’m now going to briefly explain the story of the paper [2] from my perspective [3].

It all started innocently enough in May 2015 when Edd Barrett undertook what I thought would be a “quick” bit of work to see how long it takes programming language VMs (Virtual Machines) such as the JVM to “warmup” (i.e. how long does it take them to reach the “steady state of peak performance”?). A couple of weeks later, Edd showed us some rough data, which suggested that VMs often don’t warm-up. This was, to put it mildly, surprising. My immediate worry was that we might have overlooked an important factor (e.g. servers overheating and a resultant throttling of performance) that, if fixed, would make all the odd results disappear. Soon joined by Carl Friedrich Bolz and Sarah Mount, we embarked on a lengthy iterative process, each time trying to think of something we might have missed, or got wrong, addressing that issue, and rerunning the experiment to see if anything meaningful changed.

By September 2015, the data we were seeing wasn’t fundamentally very different from May, and we were starting to feel more confident that we hadn’t done anything stupid. However, “more confident” is relative: we were still worried that we’d missed something obvious. We thus decided to create a formal draft paper that we could pass around to others for comment.

As soon as we started creating that draft paper, it became clear that while we had lots of data, we had no idea how to process it or present it. Vince Knight, a statistician, started giving us advice [4], soon putting us in touch with Rebecca Killick, an expert in changepoint analysis. Without the use of changepoint analysis, the eventual paper wouldn’t have been worth reading.

However, it soon became apparent that integrating changepoint analysis was not going to be a quick job. Since we still wanted comments on our experimental setup, the first version of the paper thus doesn’t have any statistical summaries at all. Despite that limitation, we got useful, detailed feedback from several people who I admire. We also discovered, quite unexpectedly, that many senior members of the VM community had seen peculiar benchmarking results over the years, and were surprised only by the extent of the oddities we were seeing. That helped convince me that the path we were on was a worthwhile one.

We kept beavering away on the research and the paper throughout 2016, most notably integrating a new way of producing benchmarking statistics with changepoints. By December 2016 we’d produced a second version of the paper, which we submitted to a programming languages research conference. Though imperfect, most of the major parts of the paper were now in decent shape, and that version of the paper bears a clear resemblance to the final version. However, the paper was rejected, and the reviewer comments we received didn’t really give us useful pointers for improvement [5].

Fortunately for us — and despite having worked on the research for 2 years, and the paper for 18 months by this point — we were still thinking of improvements ourselves, though we were now firmly in the territory of diminishing returns. By April 2017 we submitted the third version of the paper to another conference where it was conditionally accepted, meaning that we needed to address a handful of specific comments in order to convince the reviewers that the paper should be fully accepted. As is common in such cases, addressing those comments wasn’t too difficult (the expectation is that conditionally accepted papers don’t need too much work done on them in order to be fully accepted).

However, one small part of a comment from a reviewer raised a much bigger question in our heads than I think the reviewer could ever have intended. In essence, we wondered what would have happened if we’d not run benchmarks for as long as we did: would shorter runs still have highlighted as many odd things? We put together a partial answer to this question which you can see in Figure 9 of the fourth version of the paper. That version of the paper was enough to make the reviewers happy, and enable the paper to be fully accepted at the conference, but we still felt that we hadn’t addressed this other question adequately, so we kept plugging away.

Amongst the oddities of computing conferences is that they impose a fixed deadline on submission of a final version of a paper, generally around a month after initial notification. We now had such a deadline, but we were also uncovering new challenges in addressing the “what would have happened?” question. I then took my eye off the ball somewhat, and didn’t stop to think that as we improved our answer to the question, we were slowly but steadily increasing the amount of data we needed to process. Suddenly, 3 days before the deadline, a back of the envelope calculation suggested that our server would take about 20 days to process the data. Frankly, we looked to be in trouble, and I remember thinking that there was now a small, but definite, possibility that we would have to retract the paper. Thankfully, we dug ourselves out of the hole we found ourselves in by crudely breaking our Python processing program up into multiple processes and renting a beefy, and expensive AWS server [6]. It crunched the data in under 24 hours, allowing us to produce the data that ended up as Figure 11 in the fifth version of the paper [7].

That’s the human side of the story, although a few quick numbers may help round things out. We started the underlying research in May 2015 and the paper in September 2015 — and we kept actively working on both until September 2017. As part of the research we created Krun and the experiment itself. The paper itself has 1001 commits, and we made six releases of the paper on arXiv.

Let’s now look at the animation [8] of this paper evolving (skip to 45s if you want to avoid my ugly mug), with each frame representing the state of the paper at one of its 1001 git commits (with the date and time of each commit shown in the bottom right):

You might find it useful to speed up, or slow down, the animation — in my experience, different speeds tend to highlight different aspects of the paper’s evolution. A few further details might help you make sense of the animation. Commits that have a git tag associated with them (representing “releases” of some sort or other) pause the animation for a couple of seconds showing, in red, the tag name alongside the time (e.g. at 1:50 [9]). Within the paper itself, text in red represents a comment, or question, from one author to another. Not every version of the paper is perfectly rendered, because there is an element of bit rot. For example, while TeX is famously stable, LaTeX, and its distributions, aren’t — packages come and go, and backwards compatibility is hit and miss. We also didn’t write the paper in the expectation that anyone would go try and build old versions. However, other than the bibliography sometimes disappearing in the first quarter of the animation, the imperfections seem minor.

A few things that the animation highlights are worth looking at in detail. The paper starts off as a literal skeleton [10]: a title, some authors (and not the final set of authors), a couple of section headings (but little text), and a single citation of another paper (just to make sure the bibliography builds). It then goes through a rapid period of expansion as draft text representing the state of underlying research at that point is added.

Fairly quickly, one sees the first big chunks of red text [11], which is generally one of us noticing that something is either incomplete or nonsensical. This is the first time that one can start to tease apart the relationship between a paper and the research underlying it: for me, a paper is the condensed version of the understanding generated from the underlying research. In my experience, the process of “turning” research into a paper regularly uncovers gaps in the underlying research that we had either not noticed or sometimes actively ignored and which need fixing. This, to my mind, is a virtuous cycle: writing the paper highlights weaknesses in the underlying research that, when fixed, improve the paper.

The periods of expansion sometimes involve a number of commits, but typically last only a few days. Such periods are followed by much lengthier periods of consolidation, which can sometimes last months. Many of these periods of consolidation are where gaps in the research are slowly filled in, though a few exist simply because everyone involved was busy elsewhere! The remainder are those periods leading up to a release of the paper. Some are explicitly “draft” releases, some are submissions to venues (in computing, currently, these mostly take the form of conferences), and there’s generally a “final” release [12]. You’ll notice occasional major changes in the “look” of the paper where a change in the venue we are targeting requires us to change to that venue’s formatting requirements (e.g. at 1:20: before [13] and after [14]).

The final thing the animation makes clear is that, as time goes on, the periods of expansion become shorter, and less frequent, and the periods of consolidation become not just longer, but involve less profound changes. The amount of red text in each commit becomes smaller as the gaps in the research that we notice become smaller. Nevertheless, even at 3 minutes into the animation – fairly near the end of the paper’s evolution – there are noticeable additions to it.

The error recovery paper

Of course, everything that I’ve written above relates to a single paper which may or may not be representative of any other paper. It’s thus worth having a look at a second paper, both as a sanity check, and also to see if anything else interesting arises. In this case I’m going to look at Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers, a paper which shows how one one can add practical, automatic, syntax error recovery to LR grammars. It is not my intention to expatiate with the same minuteness on the whole series of its history, but there is one thing that I want to tease out. Let’s start with the animation (which starts at 40s):

This paper wasn’t a small effort — its writing still spanned two and a half years — but it wasn’t on the same scale as the warmup paper. It has one third as many commits and a smaller, though not exactly tiny, experiment. Unlike the warmup work, however, it led to software (grmtools) which non-researchers might be interested in using [15].

It ran through a similar, though slightly different evolution, to the warmup paper. I ran a private early draft past a couple of people whose work in the area I respected before the first “public” release on arXiv. It was rejected at a conference and a journal before being accepted at a conference.

I think the most interesting thing about this paper is something you can’t see in its final version, which contains a single new algorithm ‘CPCT+’. However, in the video you’ll see at around 1:55 a major reformatting (before [16] and after [17]) almost immediately followed by a significant reduction in the paper’s contents (before [18] and after [19]). If you look at the second version of the paper, just before this reduction happens, you’ll see that the paper contained a second algorithm ‘MF’.

MF wasn’t removed because of page constraints [20], nor because of reviewer comments: I took the decision to remove it because I thought it would make the paper better. MF is in most respects a superior algorithm to CPCT+: MF runs faster, and fails less often. However, CPCT+ already runs quite fast and doesn’t fail very often, so although MF is demonstrably better there isn’t much scope for it to be much better.

MF is, however, much more complex than CPCT+. I don’t have exact figures, but it wouldn’t surprise me if I spent ten times as long on MF as I did on CPCT+. That meant that I was beguiled by MF and my part in it, without stopping to think “is this complexity worth it?” Eventually that question became difficult for me to ignore. If I had to guess, I reckon it might take someone else at least five times longer to integrate MF into their parsing system than CPCT+. Because MF’s performance gains over CPCT+ are rather modest, this seems a poor trade-off.

Thus, reluctantly, I had to conclude that by including MF, I was demanding more from every reader than I was giving back. Put another way, if I’d have left MF in, I’d have been prioritising my ego over the reader’s time: that might be a good trade-off for me, but not for the community overall. Once I’d reached that decision, deleting the text (and the code) was inevitable. That doesn’t mean that deleting it was easy: I was effectively writing off 3–6 months work of my own time. Although that sounds like a lot of time, it’s not unique: in my experience it’s common for weeks of work on the underlying research either not to make it into the resulting paper at all, or at most to form part of a single sentence.

Summary

My aim in this entry hasn’t been to derive general lessons that you, or even I, can apply to other papers: each research project, and each paper, is a bespoke effort [21]. However, I think that a few common themes are worthy of note.

First, there’s a difference between the underlying research and a paper representing that research: by definition the latter can only contain a subset of the former (see e.g. the deletion of MF).

Second, the act of writing a paper frequently uncovers holes in the underlying research that need to be filled in.

Third, writing a paper is a slog: in my opinion, every sentence, every paragraph, and every figure is important, and getting them all into shape is no small task.

Fourth, writing up research and releasing papers is important: I’ve lost track of how many times I’ve heard of interesting systems that weren’t written up and which no longer run and thus are lost to us. To me, writing a paper isn’t about gaining another line on my CV: it’s about clearly and accurately articulating the underlying research, warts and all, to the best of my ability, so that other people can understand what’s been done and decide whether, and how, to build upon it.

I started this entry off by noting how difficult I used to find it to read papers. I don’t suppose anything I’ve written in this entry helps make the task of reading a paper easier, but perhaps it has at least given you a little insight into the process by which a paper is produced.

Acknowledgements: Thanks to Edd Barrett, Carl Friedrich Bolz-Tereick, Aleks Bonin, Martin Chapman, Lukas Diekmann, Jake Hughes, and Dan Luu for comments.

If you’d like updates on new blog entries, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1] It took me many years to realise that there are different approaches to reading a paper, that some approaches are meaningfully better than others, and that different people respond better to some approaches than others. These days, my general approach is to try reading a paper as quickly as possible: I try not to linger on parts that I don’t understand, because I’ve learnt from experience that confusion is often resolved by a later part of the paper. Once I’ve finished, I take a short period to digest what I’ve read and then, if necessary, I’ll go back and do another read through.
[2] I’ve talked a little bit about the story behind the paper before, but not in great detail.
[3] As with pretty much any human activity, different co-authors on a paper will nearly always have a slightly different perspective on what happened, because none of us can fully understand everything that other people went through.
[4] Vince later determined that his area of statistics wasn’t quite the tool we needed, and put us in touch with Rebecca. That was already a generous act, but he topped it a while later by asking that we remove him as an author because he felt he hadn’t contributed enough to the paper to be deserving as an author. I tried to convince him that he had already done more than enough to deserve a credit, but he politely stuck to his guns. I can sometimes lazily descend into the mire of cynicism, but when I see someone acting upon such deeply held morals, it reminds me how many good people there are in the world.
[5] Sometimes when a paper is rejected, one receives useful comments that help to improve the paper. I try not to complain about rejections, because it is almost never useful or edifying. However, I’m prepared to make an exception in this case, because, even 3 years later, I still can’t believe the following comments we received:
The unstated inference that extreme lengths are necessary in order to gain confidence in an empirical results is misplaced and counter-productive.
and:
Second, your paper overlooks the economics of research. The researcher's job is _not_ to produce the most error-free result. Their job is _produce the most insightful research given the resources available_
I don’t think I’ve ever been scolded, before or since, for putting too much work into my research!
[6] From memory, it had 32 cores and 1TiB RAM — still, I think, the beefiest computer I’ve ever used.
[7] We then had to fix a single typo to produce the sixth, and final, version. From memory, it had 32 cores and 1TiB RAM — still, I think, the beefiest computer I’ve ever used. It also caused us to produce a number of stand-alone Rust libraries: cactus, packedvec, sparsevec, and vob. I’d like to particularly thank Gabriela Moldovan for her work on packedvec. Embarrassingly, as I wrote this footnote, I noticed that we failed to credit her in the paper. Sorry Gabi! Even though computing conferences abandoned printed proceedings several years back, they still impose stringent page limits on papers. This has two unfortunate effects: work that naturally needs more space than is allowed is chopped down until it does, compromising legibility; and work that needs less space is padded out to make it look more impressive, compromising legibility in a different way. Or, at least, they should be a bespoke effort. Alas, there is a problem in academia with “salami slicing”, where what would best be presented in a single paper is instead split up into multiple papers. This is detrimental to the community as a whole but is done because the authors involved believe it will help their careers.
[8] This took a mere six and three quarter hours to generate!
[9]
[10]
[11]
[12] Personally I hope that papers are “maintained”, at least for a while, in the same way that software is maintained. If someone points out a major flaw in a paper I’ve written, I hope I’ll go back and make a new version of the paper. There are limits to this: at some point, I will have forgotten most of what I once knew, and bit rot can make fixing things impractical. However, currently, there isn’t really a culture of reporting problems on papers, so I haven’t put this good intention into practice: until then, I’m not really sure how long I’m willing or able to maintain a paper for.
[13]
[14]
[15] It also caused us to produce a number of stand-alone Rust libraries: cactus, packedvec, sparsevec, and vob. I’d like to particularly thank Gabriela Moldovan for her work on packedvec. Embarrassingly, as I wrote this footnote, I noticed that we failed to credit her in the paper. Sorry Gabi!
[16]
[17]
[18]
[19]
[20] Even though computing conferences abandoned printed proceedings several years back, they still impose stringent page limits on papers. This has two unfortunate effects: work that naturally needs more space than is allowed is chopped down until it does, compromising legibility; and work that needs less space is padded out to make it look more impressive, compromising legibility in a different way.
[21] Or, at least, they should be a bespoke effort. Alas, there is a problem in academia with “salami slicing”, where what would best be presented in a single paper is instead split up into multiple papers. This is detrimental to the community as a whole but is done because the authors involved believe it will help their careers.
It took me many years to realise that there are different approaches to reading a paper, that some approaches are meaningfully better than others, and that different people respond better to some approaches than others. These days, my general approach is to try reading a paper as quickly as possible: I try not to linger on parts that I don’t understand, because I’ve learnt from experience that confusion is often resolved by a later part of the paper. Once I’ve finished, I take a short period to digest what I’ve read and then, if necessary, I’ll go back and do another read through.
I’ve talked a little bit about the story behind the paper before, but not in great detail.
As with pretty much any human activity, different co-authors on a paper will nearly always have a slightly different perspective on what happened, because none of us can fully understand everything that other people went through.
Vince later determined that his area of statistics wasn’t quite the tool we needed, and put us in touch with Rebecca. That was already a generous act, but he topped it a while later by asking that we remove him as an author because he felt he hadn’t contributed enough to the paper to be deserving as an author. I tried to convince him that he had already done more than enough to deserve a credit, but he politely stuck to his guns. I can sometimes lazily descend into the mire of cynicism, but when I see someone acting upon such deeply held morals, it reminds me how many good people there are in the world.
Sometimes when a paper is rejected, one receives useful comments that help to improve the paper. I try not to complain about rejections, because it is almost never useful or edifying. However, I’m prepared to make an exception in this case, because, even 3 years later, I still can’t believe the following comments we received:
The unstated inference that extreme lengths are necessary in order to gain confidence in an empirical results is misplaced and counter-productive.
and:
Second, your paper overlooks the economics of research. The researcher's job is _not_ to produce the most error-free result. Their job is _produce the most insightful research given the resources available_
I don’t think I’ve ever been scolded, before or since, for putting too much work into my research!
From memory, it had 32 cores and 1TiB RAM — still, I think, the beefiest computer I’ve ever used.
We then had to fix a single typo to produce the sixth, and final, version. From memory, it had 32 cores and 1TiB RAM — still, I think, the beefiest computer I’ve ever used. It also caused us to produce a number of stand-alone Rust libraries: cactus, packedvec, sparsevec, and vob. I’d like to particularly thank Gabriela Moldovan for her work on packedvec. Embarrassingly, as I wrote this footnote, I noticed that we failed to credit her in the paper. Sorry Gabi! Even though computing conferences abandoned printed proceedings several years back, they still impose stringent page limits on papers. This has two unfortunate effects: work that naturally needs more space than is allowed is chopped down until it does, compromising legibility; and work that needs less space is padded out to make it look more impressive, compromising legibility in a different way. Or, at least, they should be a bespoke effort. Alas, there is a problem in academia with “salami slicing”, where what would best be presented in a single paper is instead split up into multiple papers. This is detrimental to the community as a whole but is done because the authors involved believe it will help their careers.
This took a mere six and three quarter hours to generate!
Personally I hope that papers are “maintained”, at least for a while, in the same way that software is maintained. If someone points out a major flaw in a paper I’ve written, I hope I’ll go back and make a new version of the paper. There are limits to this: at some point, I will have forgotten most of what I once knew, and bit rot can make fixing things impractical. However, currently, there isn’t really a culture of reporting problems on papers, so I haven’t put this good intention into practice: until then, I’m not really sure how long I’m willing or able to maintain a paper for.
It also caused us to produce a number of stand-alone Rust libraries: cactus, packedvec, sparsevec, and vob. I’d like to particularly thank Gabriela Moldovan for her work on packedvec. Embarrassingly, as I wrote this footnote, I noticed that we failed to credit her in the paper. Sorry Gabi!
Even though computing conferences abandoned printed proceedings several years back, they still impose stringent page limits on papers. This has two unfortunate effects: work that naturally needs more space than is allowed is chopped down until it does, compromising legibility; and work that needs less space is padded out to make it look more impressive, compromising legibility in a different way.
Or, at least, they should be a bespoke effort. Alas, there is a problem in academia with “salami slicing”, where what would best be presented in a single paper is instead split up into multiple papers. This is detrimental to the community as a whole but is done because the authors involved believe it will help their careers.

Automatic Syntax Error Recovery

November 17 2020

Programming is the best antidote to arrogance I've come across — I make so many errors that I am continually reminded of my own fallibility. Broadly speaking, I think of errors as severe or minor. Severe errors are where I have fundamentally misunderstood something about the system I am creating. Each severe error is a bespoke problem, often requiring custom tooling to understand and fix it. Minor errors, in contrast, are repetitive and quickly fixed. However, they’re also much more numerous than severe errors: even shaving a couple of seconds off of the time it takes a programmer to fix a class of minor errors is worthwhile when you consider how often they occur.

The most minor of minor errors, and also I suspect the most frequent, are syntax errors. They occur for three main reasons: mental sloppiness; inaccurate typing [1]; or an incomplete understanding of the language’s syntax. The latter case is generally part of a brief-ish learning phase we go through and I’m not sure what a good solution for it might look like. The former two cases, however, are extremely common. When I’ve made a small typo, what I want is the parser in my compiler or IDE to pinpoint the location of the syntax error accurately and then recover from it and continue as if I hadn’t made an error at all. Since compilation is often far from instantaneous, and I often make multiple errors (not just syntax errors), good quality syntax error recovery improves my programming efficiency.

Unfortunately, LR parsers – of which I am particularly fond – have a poor reputation for syntax error recovery. I’m going to show in this article that this isn’t inevitable, and that it’s possible to do surprisingly good automatic syntax error recovery for any LR grammar. If you want to know more details, you might be interested in the paper Lukas Diekmann and I recently published called Don't Panic! Better, Fewer, Syntax Errors for LR Parsers. The paper also has a fairly brief accompanying talk, if you find that sort of thing helpful:

For everyone else, let’s continue. To make our lives easier, for the rest of this article I’m going to shorten “syntax error recovery” to “error recovery”.

Outlining the problem

Let’s see error recovery in action in a widely used compiler — javac:

[Click to start/stop animation]
As a quick reminder, ‘int x y;’ isn’t valid Java syntax. javac correctly detects a syntax error after the ‘x’ token and then tries to recover from that error. Since ‘int x;is valid Java, javac assumes that I meant to put a semi-colon after ‘x’, repairs my input accordingly, and continues parsing. This is the good side of error recovery: my silly syntax error hasn’t stopped the compiler from carrying on its good work. However, the bad side of error recovery is immediately apparent: ’y;’ isn’t valid Java, so javac immediately prints out a second, spurious, syntax error that isn’t any fault of mine.

Of course, I have deliberately picked an example where javac does a poor job but I regret to inform you that it didn’t take me very long to find it. Many parsers do such a poor job of error recovery that experienced programmers often scroll back to the location of the first syntax error, ignoring both its repair and any subsequent syntax errors. Instead of being helpful, error recovery can easily have the opposite effect, slowing us down as we look for the real error amongst a slew of spurious errors.

Let’s look at the modern compiler that I most often use as an exemplar of good error messages, rustc. It often does a good job in the face of syntax errors:


[Click to start/stop animation]
However, even rustc can be tripped up when presented with simple syntax errors:

[Click to start/stop animation]

Some language implementations don’t even bother trying to recover from syntax errors. For example, even if I make two easily fixable syntax errors in a file, CPython stops as soon as it encounters the first:


[Click to start/stop animation]

As all this might suggest, error recovery is hard to do well, and it's unlikely that it will ever fully match human intuition about how syntax errors should be fixed. The root of the problem is that when we hit an error while parsing, there are, in general, an infinite number of ways that we could take to try and get parsing back on track. Since exploring infinity takes a while, error recovery has to use heuristics of one sort or another.

The more knowledge a parser has of the language it is parsing, the more refined that heuristic can be. Hand-written parsers have a fundamental advantage here, because one can add as much knowledge about the language’s semantics as one wants to the parser. However, extending a hand-written parser in this way is no small task, especially for languages with large grammars. It’s difficult to get precise figures, but I’ve seen more than one parser that has taken a small number of person years of effort, much of which is devoted to error recovery. Not many of us have that much time to devote to the problem.

Automatically generated parsers, in contrast, are at a clear disadvantage: their only knowledge of the language is that expressed via its grammar. Despite that, automatically generated LL parsers are often able to do a tolerable job of error recovery [2].

Unfortunately, LR parsers have a not undeserved reputation for doing a poor job of error recovery. Yacc, for example, requires users to sprinkle error tokens throughout their grammar in order to have error recovery in the resulting parser: I think I’ve only seen one real grammar which makes use of this feature, and I am sceptical that it can be made to work well. Panic mode is a fully automatic approach to error recovery in LR parsing, but it works by gradually deleting the parsing stack, causing it to delete input before a syntax error in order to try and recover after it. Frankly, panic mode’s repairs are so bad that I think on a modern machine [3] it’s worse than having no error recovery at all.

The roots of a solution

At a certain point when working on grmtools I realised that I should think about error recovery, something which I had only ever encountered as a normal user. A quick word about grmtools: it’s intended as a collection of parsing related libraries in Rust. At the moment, the parts that are most useful to users are lrpar – a Yacc-compatible parser – and, to a lesser extent [4], lrlex – a Lex-ish lexer. For the rest of this article, I’ll almost exclusively be talking about lrpar, as that’s the part concerned with error recovery.

Fortunately for me, I quickly came across Carl Cerecke’s PhD thesis which opened my eyes to an entirely different way of doing error recovery. His thesis rewards careful reading, and has some very good ideas in it. Ultimately I realised that Cerecke’s thesis is a member of what these days I call the Fischer et al. family of error recovery algorithms for LR parsing, since they all trace their lineage back to that paper.

When error recovery algorithms in the Fischer et al. family encounter a syntax error they try to find a repair sequence that, when applied to the input, gets parsing back on track. Different algorithms have different repairs at their disposal and different mechanisms for creating a repair sequence. For example, we ended up using the approach of Corchuelo et al. — one of the most recent members of the Fischer et al. family — as our intellectual base.

CPCT+ in use

We took the Corchuelo et al. algorithm, fixed and semi-formalised it [5], and extended it to produce a new error recovery algorithm CPCT+ that is now part of lrpar. We can use nimbleparse — a simple command-line grammar debugging tool — to see CPCT+ in action on our original Java example:

[Click to start/stop animation]
As with javac’s error recovery, CPCT+ is started when lrpar encounters a syntax error at the ‘y’ token. Unlike javac, CPCT+ presents 3 different repair sequences (numbered 1, 2, 3) to the user which, in order [6], would repair the input to be equivalent to: ‘int x, y;’, ‘int x = y;’, or ‘int x;’. Importantly, repair sequences can contain multiple repairs:

[Click to start/stop animation]
Since you probably don’t want to watch the animation endlessly I’ll put the repair sequences that are reported here:
Parsing error at line 2 column 13. Repair sequences found:
   1: Insert ), Shift {, Delete if, Delete true
   2: Insert ), Shift {, Shift if, Insert (, Shift true, Insert )
This example shows all of the individual repair types that CPCT+ can generate:
  • Insert x’ means ‘insert a token x at the current position’;
  • Shift x’ means ‘keep the token x at the current position unchanged and advance the search’;
  • Delete x’ means ‘delete the token x at the current position’.

A repair sequence is just that: an ordered sequence of repairs. For example, the first repair sequence above means that the input will be repaired to be equivalent to:

class C {
    void f() {
        { }
    }
}
while the second repair sequence will repair the input to be equivalent to:
class C {
    void f() {
        if (true) { }
    }
}
As this shows, CPCT+ is doing something very different to traditional error recovery: it’s repairing input spanning multiple tokens in one go. This is perfectly complementary to repairing syntax errors at different points in a file as this example shows:

[Click to start/stop animation]
Although CPCT+ can return multiple repair sequences, it would be impractical to keep all those possibilities running in parallel — I also doubt that users would be able to interpret the resulting errors! lrpar thus takes the first repair sequence returned by CPCT+, applies it to the input, and continues parsing.

At this point you might be rather sick of Java examples. Fortunately, there's nothing Java specific about CPCT+. If I feed it Lua’s Lex and Yacc files and broken input it'll happily repair that too [7]:


[Click to start/stop animation]
Indeed, CPCT+ will happily perform error recovery on any other language for which you can write a Yacc grammar.

Ultimately, CPCT+ has one main novelty relative to previous members of the Fischer et al. family: it presents the complete set of minimum cost repair sequences to the user where other approaches non-deterministically present one member of that set to the users. In other words, when, for our original Java example, CPCT+ presented this to users:

Parsing error at line 2 column 11. Repair sequences found:
  1: Insert ,
  2: Insert =
  3: Delete y
approaches such as Corchuelo et al. would only have presented one repair sequence to the user. Since those approaches are non-deterministic, each time they’re run they can present a different repair sequence to the one before, which is rather confusing. The intuition behind “minimum cost repair sequence” is that we want to prioritise repair sequences which do the smallest number of alterations to the user’s input: insert and delete repairs increase a repair sequence’s cost, although shift repairs are cost-neutral.

To my mind, in the example above, both ‘Insert ,’ and ‘Insert =’ are equally likely to represent what the programmer intended, and it’s helpful to be shown both. ‘Delete y’ is a bit less likely to represent what the programmer intended, but it’s not a ridiculous suggestion, and in other similar contexts would be the more likely of the 3 repair sequences presented.

Under the bonnet

The paper and/or the code are the places to go if you want to know exactly how CPCT+ works, but I’ll try and give you a very rough idea of how it works here.

When lrpar encounters a syntax error, CPCT+ is started with the grammar’s statetable (the statemachine that an LR parser turns a grammar into; see e.g. this example), the current parsing stack (telling us where we are in the statetable, and how we got there), and the current position in the user’s input. By definition the top of the parsing stack will point to an error state in the statetable. CPCT+’s main job is to return a parsing stack and position to lrpar that allows lrpar to continue parsing; producing repair sequences is a happy by-product of that.

CPCT+ is thus a path-finding algorithm in disguise, and we model it as an instance of Dijkstra's algorithm. In essence, each edge in the graph is a repair, which has a cost; we’re looking to find a path that leads us to success. In this case, “success” can occur in two ways: in rare cases where errors happen near the end of a file we might hit the statetable’s sole accept state; more commonly, we settle for shifting 3 tokens in a row (i.e. we’ve got to a point where we can parse some of the user’s input without causing further errors). As this might suggest, the core search is fairly simple.

Most of CPCT+’s complexity comes from the fact that we try to find all minimum cost paths to success and we need ways of optimising the search. There are a few techniques we describe in the paper to improve performance, so I’ll use what’s probably the most effective as an example. Our basic observation is that, when searching, once-distinct paths often end up reaching the same node, at which point we can model them as one henceforth. We therefore identify compatible nodes and merge them into one. The challenge is then how compatible nodes can be efficiently identified. We make use of an often forgotten facet of hashmaps: a node’s hash behaviour need only be a subset of its equality behaviour. In our case, nodes have three properties (parsing stack, remaining input, repair sequence): we hash based on two of these properties (parsing stack, remaining input) which quickly tells us if a node is potentially compatible; equality then checks (somewhat slowly) all three properties to confirm definite compatibility. This is a powerful optimisation, particularly on the hardest cases, improving average performance by 2x.

Ranking repairs

CPCT+ collects the complete set of minimum cost repair sequences because I thought that would best match what a user would hope to see from an error recovery algorithm. The costs of creating the complete set of minimum cost repair sequences were clear early on but, to my surprise, there are additional benefits.

The overarching problem faced by all approaches in the Fischer et al. family is that the search space is unbounded in size. This is why shifting a mere 3 tokens from the user’s input is enough for us to declare a repair sequence successful: ideally we would like to check all of the remaining input, but that would lead to a combinatorial explosion on all but a handful of inputs. Put another way, CPCT’s core search is inherently local in nature: the repair sequences it creates can still cause subsequent spurious errors beyond the small part of the input that CPCT+ has searched.

The complete set of minimum cost repair sequences allow us to trivially turn the very-local search into a regional search, allowing us to rank repair sequences in a wider context. We take advantage of the fact that CPCT+'s core search typically only finds a small handful of repair sequences. We then temporarily apply each repair sequence to the input and see how far it can parse ahead without causing an error (up to a bound of 250 tokens). We then select the (non-strict) subset which has parsed furthest ahead and discard the rest. Consider this Java example:

class C {
    int x z() { }
}
When run through the “full” CPCT+ algorithm, two repair sequences are reported to the user:
Parsing error at line 2 column 11. Repair sequences found:
   1: Delete z
   2: Insert ;
However if I turn off ranking, we can see that the “core” of CPCT+ in fact generated three repair sequences:
Parsing error at line 2 column 11. Repair sequences found:
   1: Insert =
   2: Insert ;
   3: Delete z
Parsing error at line 2 column 15. Repair sequences found:
   1: Insert ;
In this particular run, lrpar chose to apply the ‘Insert =’ repair sequence to the input. That caused a spurious second error just beyond the region that CPCT+ had searched in. However, the other two repair sequences allow the whole file to be parsed successfully (though there is a semantic problem with the resulting parse, but that’s another matter). It might not be immediately obvious, but traditional Fischer et al. algorithms wouldn’t be able to throw away the ‘Insert =’ repair sequence and keep looking for something better, because they have no point of comparison that would allow them to realise that it’s possible to do better. In other words, the unexpected advantage of the complete set of minimum cost repair sequences is precisely that it allows us to rank repair sequences relative to one another and discard the less good.

I’ve also realised over time that the ranking process (which requires about 20 lines of code) really kills two birds with one stone. First, and most obviously, it reduces spurious syntax errors. Second — and it took me a while to appreciate this — it reduces the quantity of low-quality repair sequences we present to users, making it more likely that users will actually check the repair sequences that are presented.

Lex errors

Like most parsing systems, grmtools separates out lexing (splitting input up into tokens) from parsing (determining if/how a sequence of tokens conforms to a grammar). This article isn’t the place to get into the pros and cons of this, but one factor is relevant: as soon as the lexer encounters an error in input, it will terminate. That means that parsing won’t start and, since error recovery as we’ve defined it thus far is part of the parser, the user won’t see error recovery. That leads to frustrating situations such as this:

[Click to start/stop animation]
Although it didn’t occur to me for a very long time, it’s trivial to convert lexing errors into parsing errors, and then have error recovery happen as normal. Even more surprisingly, this doesn’t require any support from lrpar or CPCT+. The user merely needs to catch input that otherwise wouldn’t lex by adding a rule such as the following at the end of their Lex file:
. "ERROR"
This matches a single character (the ‘.’) [8] as a new token type called ‘ERROR’. However, grmtools moans if tokens are defined in the Lexer but not used in the Yacc grammar so let’s shut it up by adding a dummy rule to the grammar:
ErrorRule: "ERROR" ;
Now when we run nimbleparse, CPCT+ kicks in on our “lexing error”:
Parsing error at line 2 column 11. Repair sequences found:
  1: Delete #, Delete y
  2: Insert ,, Delete #
  3: Insert =, Delete #
I find this satisfying for two reasons: first, because users get a useful feature for precisely no additional effort on lrpar’s part; and second because lexing and parsing errors are now presented uniformly to the user, where before they were confusingly separate. It would probably be a good idea to make this a core feature, so that we could do things like merge consecutive error tokens, but that wouldn’t change the underlying technique.

Integrating error recovery into actions

As far as I have been able to tell, no “advanced” error recovery algorithm has ever made its way into a long-lasting parsing system: I couldn't find a single implementation which I could run. Indeed, a surprising number of error recovery papers don’t even mention a corresponding implementation, though there must surely have been at least a research prototype at some point.

Whatever software did, or didn’t, exist, none of the papers I’ve read make any mention of how error recovery affects the use of parsers. Think about your favourite compiler: when it encounters a syntax error and recovers from it, it doesn’t just continue parsing, but also runs things like the type checker (though it probably refuses to generate code). Of course, the reason your favourite compiler is doing this is probably because it has a hand-written parser. How should we go about dealing with this in a Yacc-ish setting?

grmtools’ solution is surprisingly simple: action code (i.e. the code that is executed when a production is successfully parsed) isn’t given access to tokens directly, but instead to an enum which allows one to distinguish tokens inserted by error recovery from tokens in the user’s input. The reason for this is probably best easy seen via an example, in this case a very simple calculator grammar which calculates numbers as it parses:


[Click to start/stop animation]
In this case, what I chose to do when writing the calculator evaluator is to continue evaluating expressions with syntax errors in, unless an integer was inserted. The reason for that is that I simply don’t have a clue what value I should insert if CPCT+ generated an `Insert INT’ repair: is 0 reasonable? what about -1? or 1? As this suggests, inserting tokens can be quite problematic: while one might quibble about whether evaluation should continue when CPCT+ deleted the second ‘+’ in ‘2 + + 3’, at least that case doesn’t require the evaluator to pluck an integer value out of thin air.

This is an example of what I’ve ended up thinking of as the semantic consequences of error recovery: changing the syntax of the user’s input often changes its semantics, and there is no way for grmtools to know which changes have acceptable semantic consequences and which don’t. For example, inserting a missing ‘:’ in Python probably has no semantic consequences, but inserting the integer 999 into a calculator expression will have a significant semantic consequence.

The good news is that lrpar gives users the flexibility to deal with the semantic consequences of token insertion. For example here's the grmtools-compatible grammar for the calculator language:

Expr -> Result<u64, Box<dyn Error>>:
    Expr '+' Term {
      Ok($1?.checked_add($3?)
            .ok_or_else(||
              Box::<dyn Error>::from("Overflow detected."))?)
    }
  | Term { $1 } ;

Term -> Result<u64, Box<dyn Error>>:
    Term '*' Factor {
      Ok($1?.checked_mul($3?)
            .ok_or_else(||
               Box::<dyn Error>::from("Overflow detected."))?)
    }
  | Factor { $1 } ;

Factor -> Result<u64, Box<dyn Error>>:
    '(' Expr ')' { $2 }
  | 'INT' {
      parse_int($lexer.span_str($1.map_err(|_|
        "<evaluation aborted>")?.span()))
    } ;
If you’re not used to Rust, that might look a little scary, so let’s start with some of the non-error recovery parts. First, the calculator grammar evaluates mathematical expressions as parsing occurs, and it deals exclusively in unsigned 64-bit integers. Second, unlike traditional Yacc, lrpar requires each rule to specify its return type. In this case, each rule has a return type Result<u64, Box<dyn Error>> which says “if successful this returns a u64; if unsuccessful it returns a pointer to a description of the error on the heap”. Put another way ‘dyn Error’ is Rust’s rough equivalent to “this thing will throw an exception”.

As with traditional Yacc, the ‘$n’ expressions in action code reference a symbol’s production, where n starts from 1. Symbols reference either rules or tokens. As with most parsing systems, symbols that reference rules have the static type of that rule (in this example Result<u64, Box<dyn Error>>).

Where grmtools diverges from existing systems is that tokens always have the static type Result<Lexeme, Lexeme>. If you’re used to Rust that might look a bit surprising, as Result types nearly always contain two distinct subtypes, but in this case we’re saying that in the “good” case you get a Lexeme and in the “bad” case you also get a Lexeme. The reason for this is that the “good” case (if you’re familiar with Rust terminology, the ‘Ok’ case) represents a token from the user’s actual input and the “bad” case (‘Err’) represents an inserted token. Because Result is a common Rust type, one can then use all of the standard idioms that Rust programmers are familiar with.

Let’s first look at a simplified version of the first rule in the calculator grammar:

Expr -> Result<u64, Box<dyn Error>>:
    Expr '+' Term { $1? + $3? }
  | Term { $1 }
The Expr rule has two productions. The second of those (‘Term’) simply passes through the result of evaluating another rule unchanged. The first production tries adding the two integers produced by other rules together, but if either of those rules produced a dyn Error then the ‘?’ operator percolates that error upwards (roughly speaking: throws the exception up the call stack).

Now let’s look at a simplified (to the point of being slightly incorrect) version of the third rule in the grammar:

Factor -> Result<u64, Box<dyn Error>>:
    '(' Expr ')' { $2 }
  | 'INT' {
      parse_int($1.map_err(|_| "<evaluation aborted>")?)
    } ;
The second production references the INT token type. The action code then contains $1.map_err(|_| “<evaluation aborted>”)? which, in English, says “if the token $1 was inserted by error recovery, throw a dyn Error”. In other words, the calculator grammar stops evaluating expressions when it encounters an inserted integer. However, if the token was from the user’s input, it is converted to a u64 (with parse_int) and evaluation continues.

If you look back at the original grammar, you can see that this grammar has made the decision that only inserted integers have unacceptable semantic consequences: inserting a ‘*’ for example allows evaluation to continue.

After parsing has completed, a list of parsing errors (and their repairs) is returned to users, so that they can decide how much further they want to continue computation. There’s thus no danger of lrpar repairing input and the consumer of the parse not being able to tell that error recovery occurred. However, you might wonder why lrpar only allows fine-grained control of insert repairs. Surely it could also allow users to make fine-grained decisions in the face of delete repairs? Yes, it could, but I don’t think that would be a very common desire on the part of users, nor can I think how one would provide a nice interface for them to deal with such cases. What lrpar has is thus a pragmatic compromise. It’s also worth noting that although the above may seem very Rust specific, I’m confident that other languages can find a different, natural encoding of the same idea.

Is it fast enough and good enough?

At this point you might be convinced that CPCT+ is a good idea in theory, but are unsure if it’s usable in practise. To me, there are two important questions: is CPCT+ fast enough to be usable? and is CPCT+ good enough to be usable? Answering such questions isn't easy: until (and, mostly, after...) Carl Cerecke’s thesis, I’m not aware of any error recovery approach that had a vaguely convincing evaluation.

The first problem is that we need syntactically incorrect code to use for an evaluation but nearly all source code you can find in the wild is, unsurprisingly, syntactically correct. While there has been some work on artificially creating syntax errors in files, my experience is that programmers produce such a mind-boggling variety of syntax errors that it’s hard to imagine a tool accurately simulating them.

Unlike most previous approaches, we were fortunate that these days the BlueJ Java editor has an opt-in data-collection framework called Blackbox which records programmers (mostly beginners) as they’re editing programs. Crucially, this includes them attempting to compile syntactically incorrect programs. We thus extracted a corpus of 200,000 syntactically incorrect files which programmers thought were worth compiling (i.e. we didn’t take files at the point that the user was still typing in code). Without access to Blackbox, I don’t know what we’d have done: I’m not aware of any other language for which such a rich dataset exists.

There are a number of ways of looking at the “fast enough” question. On our corpus, the mean time CPCT+ spends on error recovery per file is 0.014s. To put that into context, that’s 3x faster than Corchuelo et al., despite the fact that CPCT+ calculates the complete set of minimum cost repair sequences, while Corchuelo et al. finishes as soon as it finds one minimum(ish) [9] cost repair sequence! I also think that the worst case is important. For various reasons, algorithms like CPCT+ really need a timeout to stop them running forever, which we set to a fairly aggressive 0.5s maximum per file, as it seems reasonable to assume that even the most demanding user will tolerate error recovery taking 0.5s. CPCT+ fully repaired 98.4% of files within the timeout on our corpus comparison Corchuelo et al. repaired 94.5% of files. In summary, in most cases CPCT+ runs fast enough that you’re probably not going to notice it.

A much harder question to answer is whether CPCT+ is good enough. In some sense, the only metric that matters is whether real programmers find the errors and repairs reported useful. Unfortunately, that’s an impractical criteria to evaluate in any sensible period of time. Error recovery papers which try to do so typically have fewer than 20 files in their corpus which leads to unconvincing evaluations. Realistically, one has to find an alternative, easier to measure, metric which serves as a proxy for what we really care about.

In our case, we use the total number of syntax errors found as a metric: the fewer the better. Although we know our corpus has at least 200,000 errors (at least 1 per file), we don’t know for sure how many more than that there are. There’s therefore no way of absolutely measuring an error recovery algorithm using this metric: all we can do is make relative comparisons. To give you a baseline for comparison, panic mode reports 981,628 syntax errors while CPCT+ reports 435,812. One way of putting this into context is that if you use panic mode then, on average, you end up with an additional spurious syntax error for each syntax error that CPCT+ reports i.e. panic mode is much worse on this metric than CPCT+. Comparing CPCT+ with Corchuelo et al. is harder, because although Corchuelo et al. finds 15% fewer syntax errors in the corpus than does CPCT+, it also fails on more files than CPCT+. This is almost certainly explained by the fact that Corchuelo et al. is unable to finish parsing the hardest files which sometimes contain an astonishingly large number of syntax errors (e.g. because of repeated copy and paste errors).

Ultimately a truly satisfying answer to the “is CPCT+ good enough?” question is impossible — we can’t even make a meaningful comparison between CPCT+ and Corchuelo et al. with our metric. What we can say, however, pretty conclusively is that CPCT+ is much better than panic mode, the only other equivalent algorithm that’s ever seen somewhat widespread use.

Limitations and future work

Few things in life are perfect, and CPCT+ definitely isn’t. There’s also clear scope to do better, and if I had a spare year or two to devote to the problem, there are various things I’d look at to make error recovery even better.

CPCT+ only tries repairing input at the point of a syntax error: however, that is often later than the point that a human would consider that they made an error. It’s unrealistic to expect CPCT+, or some variant of it, to deal with situations where the “cause” and “result” of an error are spread far apart. However, my experience is that the cause of an error is frequently just 1 or 2 tokens before the point identified as the actual error. It would be interesting to experiment with “rewinding” CPCT+ 1 or 2 tokens in the input and seeing if that’s a good trade-off. This isn’t trivial in the general case (mostly due to the parsing stack), but might be quite doable in many practical cases.

As we said earlier, CPCT+’s search is inherently local and even with repair ranking, it can suggest repair sequences which cause spurious errors. There are two promising, complementary, possibilities that I think might lessen this problem. The first is to make use of a little known, and beautiful approach, to dealing with syntax errors: non-correcting error recovery. This works by discarding all of the input before the point of a syntax error and using a modified parser to parse the remainder: it doesn’t tell the user how to repair the input, but it does report the location of syntax errors. Simplifying a bit, algorithms such as CPCT+ over-approximate true syntax errors (i.e. they report (nearly) all “true” syntax errors alongside some “false” syntax errors) whereas non-correcting error recovery under-approximates (i.e. it misses some “true” syntax errors but never reports ”false” syntax errors). I think it would be possible to use non-correcting error recovery as a superior alternative to CPCT+’s current ranking system and, perhaps, even to guide the search from the outset. Unfortunately, I don’t think that non-correcting error recovery has currently been “ported” to LR parsing, but I don’t think that task is insoluble. The second possibility is to make use of machine learning (see e.g. this paper). Before you get too excited, I doubt that machine learning is a silver bullet for error recovery, because the search space is too large and the variety of syntax errors that humans make quite astonishing. However, my gut feeling is that machine learning approaches will be good at recovering non-local errors in a way that algorithms like CPCT+ are not.

Less obviously, some Yacc grammars lend themselves to good repairs more than others. Without naming any names, some grammars are surprisingly permissive, letting through “incorrect” syntax which a later part of the compiler (or, sometimes, the parser’s semantic actions) then rejects [10]. The problem with this is that the search space becomes very large, causing CPCT+ to either produce large numbers of surprising repair sequences, or, in the worst cases, not to be able finish its search in the alloted time. One solution to this is to rewrite such parts of the grammar to more accurately specify the acceptable syntax, though this is much easier for me to suggest than for someone to actually carry out. Another solution might be to provide additional hints to CPCT+ (along the lines of lrpar’s %avoid_insert directive) that enable it to narrow down its search.

Programmers unintentionally leave hints in their input (e.g. indentation), and languages have major structural components (e.g. block markers such as curly brackets), that error recovery can take into account (see e.g. this approach). To take advantage of this, the grammar author would need to provide hints such as “what are the start / end markers of a block”. Such hints would be optional, but my guess is that most grammar authors would find the resulting improvements sufficiently worthwhile that they’d be willing to invest the necessary time to understand how to use them.

Finally, some parts of grammars necessarily allow huge numbers of alternatives and error recovery at those points is hard work. The most obvious example of this are binary or logical expressions, where many different operators (e.g. ‘+’, ‘||’ etc.) are possible. This can explode the search space, occasionally causing error recovery to fail, or more often, for CPCT+ to generate an overwhelming number of repair sequences. My favourite example of this – and this is directly taken from a real example, albeit with modified variable names! – is the seemingly innocent, though clearly syntactically incorrect, Java expression x = f(""a""b);. CPCT+ generates a comical 23,607 repair sequences for it. What is the user supposed to do with all that information? I don't know! One thing I experimented with at some points was making the combinatorial aspect explicit so that instead of:

Insert x, Insert +, Insert y
Insert x, Insert +, Insert z
Insert x, Insert -, Insert y
Insert x, Insert -, Insert z
the user would be presented with:
Insert x, Insert {+, -}, Insert {y, z}
For various boring reasons, that feature was removed at some point, but writing this down makes me think that it should probably be reintroduced. It wouldn’t completely solve the “overwhelming number of repair sequences” problem, but it would reduce it, probably substantially.

Summary

Parsing is the sort of topic that brings conversations at parties to a standstill. However, since every programmer relies upon parsing, the better a job we can do of helping them fix errors quickly, the better we make their lives. CPCT+ will never beat the very best hand-written error recovery algorithms, but what it does do is bring pretty decent error recovery to any LR grammar. I hope and expect that better error recovery algorithms will come along in the future, but CPCT+ is here now, and if you use Rust, you might want to take a look at grmtools — I'd suggest starting with the quickstart guide in the grmtools book. Hopefully Yacc parsers for other languages might port CPCT+, or something like it, to their implementations, because there isn’t anything very Rust specific about CPCT+, and it’s a fairly easy algorithm to implement (under 500 lines of code in its Rust incantation).

Finally, one of the arguments that some people, quite reasonably, use against LR parsing is that it has poor quality error recovery. That’s a shame because, in my opinion LR parsing is a beautiful approach to parsing. I hope that CPCT+’s error recovery helps to lessen this particular obstacle to the use of LR parsing.

Acknowledgements: My thanks to Edd Barrett and Lukas Diekmann for comments.

If you’d like updates on new blog entries, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1] Often said to be the result of “fat fingers”. I have skinny fingers but, alas, this does not seem to have translated to high typing accuracy.
[2] Typically using the FOLLOW set.
[3] When panic mode was introduced, computers were so slow that I expect even bad error recovery was probably semi-useful: if a compiler takes several minutes to run, each genuine error reported to the programmer is valuable, and it’s worth risking them seeing a fairly high proportion of false errors. In contrast, when compilers (mostly) take at most a few tens of seconds, the acceptable proportion of false errors is much lower.
[4] There are languages for which one can fairly easily write a Yacc grammar but not a Lex lexer (e.g. a Python-ish language with indentation-based syntax). Systems like grmtools allow you to use your own (possibly hand-written) lexer for precisely this reason. Fortunately, writing a good quality lexer by hand is generally a fairly easy task.
[5] As a parsing research dilettante, I've long had a suspicion that proper parsing researchers have to sign up to an elaborate, long running joke, whereby they agree to leave out, or otherwise obscure, important details. Alas, most error recovery papers are also part of this conspiracy, so I lost months to misunderstandings of various papers. The most frustrating part is that I wonder if I’ve unintentionally signed up to the conspiracy too: I have no idea whether other people can make sense of the parsing papers I’ve been part of or not...
[6] Note that the order that the three repair sequences are presented in is nondeterministic, so if you run it enough times you’ll see the repair sequences printed in all possible orderings.
[7] If you’re wondering why I used the ‘-q’ option with nimbleparse, it’s because nimbleparse prints debugging information for ambiguous grammars. Lua’s grammar has an inherent ambiguity (with the Lua manual containing the slightly scary statement that “The current parser always [resolves] such constructions in the first way”), which causes nimbleparse to print out its full stategraph and the ambiguities. Even for a small grammar such as Lua’s, the stategraph is not small. As this shows, Yacc input grammars can be ambiguous, though they will be statically converted into a fully unambiguous LR parser. Whether this is a good or bad feature is something that I’ll leave to the philosophers amongst you to debate.
[8] Although ‘.’ is a safe default, there’s no reason why the error token has to be defined that way. However, one has to be careful if using this technique using Lex because its “longest match” rule means that the the text matched by the error token must never be longer than that matched by any other token, otherwise large chunks of input end up being incorrectly classified as an error token. In case you’re wondering, yes, I spent half an hour when writing this blog post wondering why my attempts to be clever weren’t working before remembering the longest match rule.
[9] The original Corchuelo et al. has a bug that means that it misses some minimum cost repair sequences.
[10] My experience so far suggests that this is more often the case in less well used parts of a grammar.
Often said to be the result of “fat fingers”. I have skinny fingers but, alas, this does not seem to have translated to high typing accuracy.
Typically using the FOLLOW set.
When panic mode was introduced, computers were so slow that I expect even bad error recovery was probably semi-useful: if a compiler takes several minutes to run, each genuine error reported to the programmer is valuable, and it’s worth risking them seeing a fairly high proportion of false errors. In contrast, when compilers (mostly) take at most a few tens of seconds, the acceptable proportion of false errors is much lower.
There are languages for which one can fairly easily write a Yacc grammar but not a Lex lexer (e.g. a Python-ish language with indentation-based syntax). Systems like grmtools allow you to use your own (possibly hand-written) lexer for precisely this reason. Fortunately, writing a good quality lexer by hand is generally a fairly easy task.
As a parsing research dilettante, I've long had a suspicion that proper parsing researchers have to sign up to an elaborate, long running joke, whereby they agree to leave out, or otherwise obscure, important details. Alas, most error recovery papers are also part of this conspiracy, so I lost months to misunderstandings of various papers. The most frustrating part is that I wonder if I’ve unintentionally signed up to the conspiracy too: I have no idea whether other people can make sense of the parsing papers I’ve been part of or not...
Note that the order that the three repair sequences are presented in is nondeterministic, so if you run it enough times you’ll see the repair sequences printed in all possible orderings.
If you’re wondering why I used the ‘-q’ option with nimbleparse, it’s because nimbleparse prints debugging information for ambiguous grammars. Lua’s grammar has an inherent ambiguity (with the Lua manual containing the slightly scary statement that “The current parser always [resolves] such constructions in the first way”), which causes nimbleparse to print out its full stategraph and the ambiguities. Even for a small grammar such as Lua’s, the stategraph is not small. As this shows, Yacc input grammars can be ambiguous, though they will be statically converted into a fully unambiguous LR parser. Whether this is a good or bad feature is something that I’ll leave to the philosophers amongst you to debate.
Although ‘.’ is a safe default, there’s no reason why the error token has to be defined that way. However, one has to be careful if using this technique using Lex because its “longest match” rule means that the the text matched by the error token must never be longer than that matched by any other token, otherwise large chunks of input end up being incorrectly classified as an error token. In case you’re wondering, yes, I spent half an hour when writing this blog post wondering why my attempts to be clever weren’t working before remembering the longest match rule.
The original Corchuelo et al. has a bug that means that it misses some minimum cost repair sequences.
My experience so far suggests that this is more often the case in less well used parts of a grammar.

Stick or Twist?

October 7 2020

Download:  Opus (2.1MiB)   mp3 (8.1MiB)
All of us have points in our lives where we have to decide whether we should continue down our current path or change to another. As a researcher, I often face a constrained version of this problem: should I continue on my current research path or change to another? For a long time I wasn’t aware that I was being faced with such decisions; and, when I did become aware, I wasn’t sure how best to make a decision. Over time I’ve realised that a simple “stick or twist?” heuristic mostly works well for me. I don’t claim that anything in this article is novel, nor do I think I’m describing an approach that’s applicable to every situation — but it might provide some useful food for thought.

Let’s start by dismissing a common heuristic: wait and see. When faced with a hard decision, most of us have a tendency to hope that the underlying problem magically disappears: sometimes we consciously delay making a choice, but often we try and pretend the choice doesn’t exist at all. Although it sounds like a cliché, I’m a firm believer that, in general, not making a decision is equivalent to making a decision. If I dither over whether to buy a shop’s last cake or not, someone else will buy it: I’m then, involuntarily, in exactly the same situation as if I’d decided not to buy the cake. If used too often, “wait and see” turns us into corks on the ocean, entirely at the mercy of events. Except for a small number of exceptionally poor decisions, I’ve found that I personally regret decisions I didn’t take more than decisions I did take.

Next, let’s dismiss the two extreme heuristics for making decisions: never change (which is similar, though with a very different motivation, to “wait and see”) or always change [1]. Because we live in a continually changing world, it is inevitable that even a once-optimal path will become suboptimal over time: not changing guarantees irrelevance in the long run. At the other extreme, change for its own sake means that even when we stumble onto an optimal path, we’ll change off it for no good reason.

Although I didn't realise it at the time, my first professional encounter with the need to make a decision about my research direction happened during my PhD. I was part of a research group doing what is now called “software modelling” but which back then was mostly just called UML. As part of that, I started attending international UML standards meetings. I soon realised that most people at these meetings shared a common vision, which is that UML should take over from programming languages. In other words, rather than programming using text, they wished to use UML’s block-and-line diagrams instead. UML had originally been intended to represent high-level program structure (the software equivalent of architectural blueprints), not low-level detail. I wasn’t sure that this was desirable, nor was I sure that it was possible. However, since I was often the youngest person in the room by 15 or more years, I assumed that I must be missing out on a profound insight.

Those doubts grew stronger when, in around 2003, I found myself in a lift with a senior member of the UML standards committee. Somehow his conversation turned to the evil programming languages people, who were somehow conspiring to prevent software from being written in UML. I have never had much truck with conspiracy theories, so I asked him outright how UML could be used for operating system kernels, which have hundreds of thousands of lines of code. He replied, in a tone which suggested he did not expect disagreement, “in 5 years, Linux will be written in UML”. Naive as I was, it was obvious to me that this was a ludicrous timeline and it made me wonder what other unpleasant details the community might have been ignoring.

The confirmation point came a few months later at a more academic meeting. A senior Professor got up to demonstrate the tool his research group had slaved over for years, which allowed one to translate textual programs to UML and back again. He brought up a simple program consisting of 5 lines of text, pressed a button, and transformed it into equivalent UML. The diagrammatic representation was so verbose that it required scrolling over two screens. I stayed silent but someone else politely, and I’m fairly sure innocently, asked “isn’t the diagrammatic version a bit harder to understand?” The Professor looked absolutely stunned by the question: ”but... it’s diagrammatic!” was all he could muster. A couple of other people agreed that it looked a bit harder to understand. The Professor was crushed: rarely before or since have I seen a man shift from a confident to a haunted look so quickly.

By this point I had a clear explanation in my head for why UML could not replace programming languages: diagrams are excellent at displaying a small number of elements but they do not scale past what you can squeeze onto a single screen [2]. Even beginners write programs which are too big by this metric!

I then made what I now realise is a foolish mistake: because I could articulate what I thought was a compelling reason why UML would not scale, I assumed everyone else would come to the same conclusion and the field would collapse. 15 years after I thought this, the (renamed) UML research field is, I believe, about the same size as when I left it.

Why was I so wrong? I suspect that most people in the UML community would probably agree with me that UML has problems scaling. However, I suspect that they think the problem will gradually disappear over time whereas I cannot see a realistic way in which it will be solved. However, neither side’s belief can be validated at the moment and it is unlikely that my belief can ever be completely validated. In retrospect, I should have been more humble about my judgment on UML. It would have been enough for me to say that, in my opinion, the probability of solving UML's scaling difficulties was lower than the probability of it being unsolvable.

However, whatever my motivation was, or should have been, I did change direction, into the field of programming languages. I created Converge, whose main aim was to allow people to extend the language by embedding DSLs (Domain Specific Languages) within it. From a technical perspective this worked well enough (see e.g. an assembler DSL or a statemachine DSL) but it required users to differentiate the DSLs with special brackets ($<<...>>). To my surprise, those special brackets turned out to undermine the whole idea: they are so visually intrusive that if a screen of code contains more than a couple of them, it becomes much harder to read than normal. Unfortunately, the brackets are necessary, because without them it is impossible to parse files [3]. Slowly — and painfully because I had spent several years of my life on this — I was forced to conclude that this approach would never be viable. After several years of feeling stuck (and trying, and failing, to move onto other research problems), I realised that it might be possible to solve the problem of language embedding in an entirely different fashion (which ended up in our language composition work).

However, before I had got that far, I made another mistake. Underlying DSLs in Converge is a macro system (or, more formally, "compile-time meta-programming") derived from Template Haskell. Since DSLs in Converge had failed, I generalised further that macros aren't very useful in modern programming languages [4]. When I later gave a talk titled something like “macros aren’t useful”, other macro researchers were, with one exception [5], bemused by my explanation, thinking it comically simplistic. In retrospect, I now realise that we were tackling subtly different problems: I was trying to embed multiple DSLs within a single normal file while they were identifying whole files as being of one DSL or another. Macro-ish techniques work better for the latter than the former because there is no need to have any syntax to identify one language from another. In other words, my generalisation was subtly flawed.

Those two examples represent some of the more consequential times that I’ve reconsidered my direction of travel, but there are others. After a while, I came to understand what my “stick or twist?” heuristic is: I now think of it as continually searching for the simplest plausible reason why my current direction is wrong. When I find a sufficiently simple reason, and once I’ve convinced myself the reason is plausibly true, I feel that the best course of action is to change direction. Why a “simple reason”? Because experience has taught me that until and unless I can distill a problem down to a very simple concrete explanation, I’ve not actually understood the problem. Why “plausible”? Because it’s rarely possible to be certain about such matters. Once the possibility of being wrong has become sufficiently high, I’d rather risk abandoning old work that might have succeeded rather than getting endlessly stuck on a problem I can't solve [6].

As both examples I’ve given might suggest, other people can come to a different conclusion than me. Reasonable people can often disagree about the probability of a risk manifesting, and different people often have very different thresholds for when they consider a risk worth taking. Even the same person can have different thresholds in different areas: I suspect my willingness for taking a risk in research is higher than average, but in other areas of my life it’s definitely lower than average. So, just because I’ve decided to move on doesn’t mean that other people have made the wrong decision by staying. Indeed, I really hope that they prove me wrong! Wouldn’t it be great if the UML community could improve on textual programming? I might think that outcome unlikely, but if I'm wrong, the resulting dent to my ego will be more than worth it.

Before you worry that I'm indulging in false humility, let me assure you of the true meanness of my spirit by stating that I think not all stick or twist heuristics are equally good. In particular, there is a polar opposite heuristic to mine: continually searching for the most complex reason why my current direction is right. I doubt that anyone would admit to this heuristic out loud, but it is not hard to find people using it, for example within politics or, alas, in some academic disciplines. Why a “complex reason”? There seem to me to be two different causes. Some people seem to be scared that if they’re clear about their thoughts they’ll be exposed as charlatans. Others simply want to win, irrespective of whether their argument is correct or not: a lack of clarity is one weapon they use to assert victory [7].

Inevitably, writing my stick or twist heuristic down is much easier than applying it. First, my ego does its best to block any thoughts that I might be wrong. Second, even when I identify a possible problem with my current direction, and a plausible explanation to explain it, acknowledging it to the point of taking action requires surprising will-power. Third, hard decisions are generally hard because we lack concrete data to guide us. I have to rely on my own experience or readings to deduce a plausible direction — and, largely, trust to luck. It can feel frightening to make decisions knowing that.

Despite these issues, anyone who’s bored enough to look over my research papers will be able to identify at least two other major changes of direction similar to those noted above, each of which was the result of using the same heuristic. Less obvious are the more numerous minor changes I make using this heuristic: when programming, I can sometimes go through several such decisions in a single day.

Finally, you may have noticed that I’ve used “change direction” in two different senses. In the first instance, I abandoned a research field entirely and moved to another; in the second, I merely abandoned a particular approach to tackling a problem, later trying a different approach to that same problem. In both cases, I had to put aside several years of work, but the first course of action might be seen by some people as the more dramatic of the two. To me, they're indistinguishable in magnitude: the real difficulty was identifying the problem, simplifying it, and acknowledging it. Personally, I hope I have many changes of direction yet to come!

Acknowledgements: Thanks to Chun Nolan for comments.

If you’d like updates on new blog entries, follow me on Twitter,
or subscribe to email updates:

Footnotes

[1] I’ve not met anyone who sits precisely at either extreme, but I’ve met people who come surprisingly close.
[2] You can see this clearly with state diagrams. Small examples are nearly always clearer than the textual equivalent; but once they get just a little too big, the textual equivalent is nearly always superior.
[3] Alas, this was neither the first nor the last time that parsing has caused me problems — nearly always because of my ignorance about what is and isn’t possible.
[4] I was wrong: in statically typed languages you often need macros to generate code that the type system would otherwise forbid. This is, of course, the original motivation of Template Haskell but I was too stupid to appreciate this.
[5] A senior member of the community stood up and said “he’s right but for the wrong reasons”. Sometimes one has to take praise in whatever form it comes!
[6] When I was an undergraduate and a PhD student, I was often surprised by older academics whose research seemed stuck in an ancient rut. In fact, ”surprised” is a polite way of putting it: I laughed at such people. I’m ashamed to say that one was a gentle, thoughtful lecturer called Malcolm Bird (who died a few years ago). He seemed to us undergraduates to be hopelessly out of date. A couple of years later, after I’d started a PhD, a newly arrived Argentinian PhD student was astonished that our Department had the Malcolm Bird in it. Malcolm, it turns out, had once solved a hard and important problem in computer science. I later found out that as an academic he had continuously taken on more of his fair share of work. In other words, it’s plausible that, in only slightly different circumstances, Malcolm would have had a stellar research career and have been known as one of our subject’s great minds. The realisation that circumstances outside someone’s control can prevent that person realising their “full potential” was a sobering one. It’s too late to apologise to Malcolm in person, so writing about this shameful episode publicly is about as close as I can get.
[7] What I’m never sure with such people is whether their external lack of clarity reflects an interior lack of clarity or not.
I’ve not met anyone who sits precisely at either extreme, but I’ve met people who come surprisingly close.
You can see this clearly with state diagrams. Small examples are nearly always clearer than the textual equivalent; but once they get just a little too big, the textual equivalent is nearly always superior.
Alas, this was neither the first nor the last time that parsing has caused me problems — nearly always because of my ignorance about what is and isn’t possible.
I was wrong: in statically typed languages you often need macros to generate code that the type system would otherwise forbid. This is, of course, the original motivation of Template Haskell but I was too stupid to appreciate this.
A senior member of the community stood up and said “he’s right but for the wrong reasons”. Sometimes one has to take praise in whatever form it comes!
When I was an undergraduate and a PhD student, I was often surprised by older academics whose research seemed stuck in an ancient rut. In fact, ”surprised” is a polite way of putting it: I laughed at such people. I’m ashamed to say that one was a gentle, thoughtful lecturer called Malcolm Bird (who died a few years ago). He seemed to us undergraduates to be hopelessly out of date. A couple of years later, after I’d started a PhD, a newly arrived Argentinian PhD student was astonished that our Department had the Malcolm Bird in it. Malcolm, it turns out, had once solved a hard and important problem in computer science. I later found out that as an academic he had continuously taken on more of his fair share of work. In other words, it’s plausible that, in only slightly different circumstances, Malcolm would have had a stellar research career and have been known as one of our subject’s great minds. The realisation that circumstances outside someone’s control can prevent that person realising their “full potential” was a sobering one. It’s too late to apologise to Malcolm in person, so writing about this shameful episode publicly is about as close as I can get.
What I’m never sure with such people is whether their external lack of clarity reflects an interior lack of clarity or not.

Blog archive
Home e-mail: laurie@tratt.net   twitter: laurencetratt twitter: laurencetratt