Why Aren't Programming Language Specifications Comprehensive?

Recent posts
Structured Editing and Incremental Parsing
How I Prepare to Make a Video on Programming
pizauth: HTTPS redirects
Recording and Processing Spoken Word
Why the Circular Specification Problem and the Observer Effect Are Distinct
What Factors Explain the Nature of Software?
Some Reflections on Writing Unix Daemons
Faster Shell Startup With Shell Switching
Choosing What To Read
Debugging A Failing Hotkey

Blog archive

In Compiled and Interpreted Languages: Two Ways of Saying Tomato I used this little example to show that we can observe differences in how CPython and PyPy execute code:

$ cat diffs.py
print(str(0) is str(0))
$ python3 diffs.py
False
$ pypy diffs.py
True

Although I explained in that post that programming language specifications and implementations are separate things, I didn’t explain why it’s ever considered acceptable for different implementations of a language to produce different output for a given program. In this post I’m going to attempt to correct that oversight, and to do so in a way that I hope is relatively approachable. As that suggests, I will be as informal as possible, since I’ve found that more formal treatments of this subject can obscure where it has a day-to-day impact on programming.

Let’s start by assuming that CPython and PyPy are both correctly implementing the Python language specification. Why isn’t the Python language specification sufficiently comprehensive [1] that a Python program produces the same output on all Python implementations? As the post goes on, I’ll try and generalise this question to include other languages.

Deliberate flexibility

Let’s look at the part of the Python language specification that’s relevant to the example above, which compares strings by identity [2]:

The operators is and is not test for an object’s identity: x is y is true if and only if x and y are the same object… [4]

How on earth can CPython and PyPy interpret this simple, clear definition differently? Although it’s easily missed, at the end of that paragraph in the Python specification is a footnote “[4]”, which says:

Due to automatic garbage-collection, free lists, and the dynamic nature of descriptors, you may notice seemingly unusual behaviour in certain uses of the is operator, like those involving comparisons between instance methods, or constants. Check their documentation for more info.

From the perspective of this post, it’s an unfortunately written footnote — less a specification than a vaguely worded excuse. Let’s try and write a quick specification for “ is“ in something approaching specificationese:

Identity checks with “ is“ allow a user to test if two references point to exactly the same object. Only instances of user-level classes are guaranteed to consistently test true/false with an identity check. The result of identity tests on any other kind of object is implementation defined: an implementation may choose, for example, to cache integers or strings, perform identity checks on integers by value, and so on.

The key phrase is “implementation defined”: this specification makes clear what implementations are required to do (make sure is works with instances of user class es) and where they have wriggle room to do different things (instances of anything else). By implication, users must not rely on the results of identity tests for anything other than user-level classes.

Why might a specification choose to allow implementations this sort of wriggle room? After all, it clearly makes life harder for users, who can no longer know exactly how a program will execute if it uses is on instances of non-user class es. The main reason, as hinted at in the original Python footnote, is that different, desirable, approaches to implementing a language are sometimes fundamentally unable to achieve the same outcome.

Perhaps the easiest way to see the interaction between language specifications and implementation techniques is object destructors (e.g. __del__ methods in Python or drop in Rust), which are called when an implementation determines that an object is no longer being used. In a reference counted implementation, like CPython, destructors are called immediately after the last reference to an object is lost. In contrast, implementations using “full” garbage collection, like PyPy, only check whether an object is still referenced periodically: an object’s destructor might not be run until a long time after the last reference to that object is lost.

We can see this happening with the following, slightly contrived [3], program:

class C:
  def __del__(self): print("__del__")
x = C()
x = None
print("p")
for _ in range(1000): y = [0] * 10000

which runs the destructor at different points in execution on CPython and PyPy:

$ python3 /tmp/t.py
__del__
p
$ pypy /tmp/t.py
p
__del__

There is simply no practical way to make a garbage collector like PyPy’s always call destructors at the same point in a program as CPython’s reference counting: if you want the performance advantages of “full” garbage collection, you have to forgo predictable destruction. Since both reference counting and “full” garbage collection have pros and cons, many language specifications, like Python’s, deliberately leave implementations the necessary wriggle room to choose either approach (e.g. not by requiring destructors to be run at a predictable point).

However, it’s important to realise that not all language specifications allow wriggle room in the same places. For example, the Rust specification requires that implementations call drop methods at the same point (and, if there are multiple objects going out of scope, to do so in a specific order!). Users now rely on that behaviour for correctness. In other words, if the Rust specification was to try to retrospectively allow transparent “full” garbage collection, with unpredictable calling of drop methods, much Rust code would break.

As this might suggest, changing a language specification to allow new wriggle room breaks backwards compatibility. The burden that this places on users ranges from the easily detected (e.g. syntax changes are spotted immediately by the compiler and thus easily, if tediously, fixed) to the dangerously subtle (e.g. if the rules on when destructors are called are changed). In practise, this means that the older and/or more users a language has, the harder it is to add wriggle room, particularly if it’s of the non-easily-detected variety. In practise this means that, as time goes on, it becomes progressively more difficult to “fix” or “adapt” many aspects of a language specification.

Semi-inevitable flexibility

Language specifications can be thought of as defining what it means for multiple implementations to run programs “equivalently”. Program equivalence is a difficult concept to nail down: although I’ve implicitly defined equivalence thus far to mean “prints the same things on different implementations”, there are many other notions too.

From our perspective, what isn’t specified is often at least as important as what is. Notably, there are aspects of a language’s execution where a specification could require something to happen, but variation seems so inevitable that specifications do not attempt to define it. The most obvious instance of this – so obvious that most people don’t even consider it – is timing. Consider this little Python program:

import time
before = time.time()
i = 0
for _ in range(1000000):
  i += 1
after = time.time()
if after - before > 0.01:
  print("Running on CPython")
else:
  print("Running on PyPy")

Let’s run it:

$ python2 /tmp/t.py
Running on CPython
$ pypy /tmp/t.py
Running on PyPy

Just to be sure, I ran each of those commands 100 times: each time they correctly printed out “Running on CPython” or “Running on Python” [4]. Notice that the program doesn’t directly probe the version of Python it’s running on: it just times how long a portion of the program takes and makes an inference from that (i.e. that CPython will take longer than 0.01s). The reason this works is that I know PyPy will quickly recognise that the for loop can be fully optimised away, whereas CPython won’t.

The Python language specification could, if it really wanted to, say “implementations are forbidden to optimise away for loops, even if those for loops don’t have any directly observable effects”. Of course, it doesn’t say that, because in most situations it would be very silly to force programs to run unnecessarily slow [5]. Likewise, language specifications rarely specify how fast operations run in terms of seconds, because that would mean that upgrading our computer’s CPU would not make our programs run faster.

Although not quite as clear cut, a similar aspect is memory storage: many languages specify little or nothing about how objects and the like are stored in memory. Those languages that do specify some aspects of storage tend to do so in limited ways. Most notably, low-level languages like C often specify how the fields of individual objects are laid out in memory, but say nothing about how, for example, multiple objects are allocated on the stack.

In one way, we could consider these aspects to be similar to the first category of “deliberate flexibility”. However, my experience is that most language designers (or, at least, me!) don’t even think about aspects like timing at all, since it just seems obvious to us that different implementations will run programs at different speeds. For want of a better term, I think of such aspects as “semi-inevitable”: neither specifier nor implementer bothers wasting mental bandwidth worrying about such aspects.

Undesirable flexibility

Many readers will have heard the term “undefined behaviour”, something that causes many of us to quake in fear, or shake in anger. In essence, language specifications sometimes say “if you do X, then all bets are off as to what happens”. For example, if in C your program tries to access the 2nd element of an array of size 1, you have entered the land of undefined behaviour, and there are no guarantees about how your program will execute on an arbitrary implementation of C.

Before we explore undefined behaviour in more detail, let’s first tighten our definition of “implementation defined”. I’ll define it to mean those points in a specification where different implementations are explicitly allowed to take different execution strategies, but must tell you which they are using. In some cases, they may be free to pick any possible execution strategy, in some cases they may have to choose from a subset dictated by the specification.

Undefined behaviour might at first sound like yet another variation of implementation defined. However, undefined behaviour is the language specification telling programmers what the boundaries of what it is willing to specify are: if a programmer steps beyond those boundaries, the language implementation can do absolutely anything. Unfortunately, giving simple concrete examples of the effects undefined behaviour is hard, as they are intimately tied to a combination of compiler version and hardware: an approachable description of behaviours at a point in time can be found in this paper, for example; a more recent investigation is also worth reading.

Most people tend to think that undefined behaviour sounds a bit bad, in the same way that a cold shower sounds a bit bad, but hardly fatal. However, undefined behaviour is insidious, because the problems it causes can be both very bad, and easily overlooked or misunderstood.

Why might a language specification choose to use the concept of undefined behaviour if it is so dangerous? Originally, undefined behaviour was chiefly a means of enabling portability across hardware that was much more diverse than today’s (e.g. 40 years ago hardware had not universally settled on a representation of signed integers); these days we’d probably expect most such cases to be implementation defined [6]. However, over time, undefined behaviour has also come to be seen as important to efficient program execution. For example, detecting integer overflow (e.g. if you try to add 1 << 63 to itself, the result no longer fits in a 64-bit integer) is an expensive operation on most machines [7]: many people feel they are unwilling to pay that price [8], given that most integer operations in most programs will never overflow. Because of this, C specifies that overflow of signed integers is undefined behaviour, so that users don’t have to pay the costs of overflow checks.

This might sound like a slightly theoretical issue — most integers in most programs really won’t overflow, and since we’re only ever likely to encounter two’s complement hardware, the effects of overflow are easily predicted. However, most of us will at some point want to make sure an integer in one of our programs doesn’t overflow, and then we encounter the difficulty of checking overflow. Most obviously, you mustn’t retrospectively check whether overflow has occurred: you must instead perform a torturous check in advance to see whether overflow will occur. If you write the check in a way that leads to undefined behaviour according to the C standard, most modern C compilers will simply remove your check without telling you! Even more surprisingly, if your program can execute in a way that causes undefined behaviour, the effects of undefined behaviour can be felt “backwards in time”: code executed before the “actual” undefined behaviour can be executed in unexpected ways.

For many years, the presence of undefined behaviour in the specification didn’t really cause problems. However, over time, compiler authors – a very competitive bunch – started looking for new opportunities to make their compilers faster than others. They started looking at the C language specification with lawyerly eyes, and realised that undefined behaviour allowed them to make “good” programs run faster: by implication, they believe that those who write “bad” programs do not deserve to have their programs run “correctly”. Once it became acceptable to exploit undefined behaviour, we ended up on the slippery slope to where we are now.

We thus see a surprising aspect of language specifications: they often aren’t comprehensive because they weren’t comprehensive. Furthermore, older languages influence newer languages. There might be a parallel universe where the first C language specification decided to give a safe (i.e. not undefined behaviour!) definition to signed integer overflow: in that parallel universe, it’s probable that all subsequent language specifications would also use safe definitions for integer overflow [9].

A frustration I have with C’s extensive use of undefined behaviour is that it gives the concept a bad name when it really does have uses. For example, I believe that array accesses should as standard check that you are accessing an in-bounds element. However, there is performance critical code where this check is too costly. In such cases, I think there should be a second mechanism for accessing array elements that does not have a bounds check, instead documenting that I must promise not to access out-of-bounds elements. If I break that promise, I will deservedly end up the land of undefined behaviour. As this suggests, the possibility of undefined behaviour should, in my opinion, be opt-in, and something that users can easily confine to small parts of a program.

Unknown flexibility

There is an interesting, and to me importantly distinct, way that language specifications are influenced by history: sometimes they don’t specify some aspects because no-one thought of specifying such aspects.

The most obvious instance of this is memory models. Simplifying significantly, and focussing on just one specific aspect, modern programming languages must specify how memory reads and writes interact across multiple threads if they want users to be able to write safe multi-threaded programs. However, before multicore processors started entering the mainstream (in the years immediately after 2000) there was little need to specify these aspects at all [10], and so no programming language that I know of did so.

Multicore processors meant that programming languages found themselves in a situation where they suddenly needed to carefully specify something they had never considered before. Java was the first language I know of that took this seriously; other languages like C and C++ followed a little later. Although I’ve slightly lost track of the current state of such matters, I believe that there are still edge cases that mainstream language specifications do not adequately cover.

From this we can see that language specifications are often reactive: sometimes, a previously ignored problem can no longer be ignored; sometimes, something that was not a problem becomes a problem.

Summary

Language specifications are not comprehensive because we don’t want them to be, can’t make them be, or because they once weren’t. At a sufficiently high level, these reasons are indistinguishable, but I find that separating them out really helps me understand what I’m dealing with at any given point.

An important question is: can we do better? Yes, we can. Look at any mainstream language’s specification over time, and you’ll see that there’s a clear trend in the direction of greater precision and coverage. We are fortunate that so many talented people have put so much effort into improving the situation, and must hope that they continue to do so!

However, you will also notice that all mainstream languages’ specifications are in English prose. It is tempting to imagine that if we had formal specifications (i.e. “mathematical” or “formal logic” specifications) that all problems would be solved. However, even if it was practical – and for large languages it’s at least at the edge of our current abilities – we still need such specifications to leave flexibility for implementations to do things differently from one another. In other words, it is not the form of specifications that means that they end up accidentally allowing implementations flexibility, but it is something inherent to their nature [11].

On a final note, you may have noticed that I have been careful not to mix my opinions of a programming language with my opinions on its specification. I mostly avoid using C these days, because I am too stupid to write reliable programs given the many ways I can cause undefined behaviour. However, the C specification is, mostly, rather good: it covers the vast majority of what should be covered, and I find much of it to be thoughtful and clearly written. In contrast, Rust, a language that I use much more, has a painfully incomplete specification: most obviously there are no meaningful semantics for unsafe code [12]. As this may suggest, an implicit reason why language specifications are not comprehensive is that they are hard, and take years of painstaking work to mature. I try and remind myself how hard they are whenever I am tempted to criticise language designers and specifiers!

Acknowledgements: My thanks to Martin Berger, Lukas Diekmann, and Stephen Kell for comments.

Newer 2023-02-16 10:00 Older
If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1]

I’m deliberately avoiding using the term “complete” here, because, depending on the context, that has a formal meaning that can slightly distract us.

I’m deliberately avoiding using the term “complete” here, because, depending on the context, that has a formal meaning that can slightly distract us.

[2]

Note the problem here is that, as in most programming languages, Python allows us to differentiate whether two objects have the same contents or whether two objects are actually the same object (i.e. they have the same identity).

Note the problem here is that, as in most programming languages, Python allows us to differentiate whether two objects have the same contents or whether two objects are actually the same object (i.e. they have the same identity).

[3]

To the surprise of many people, most garbage collected languages do not guarantee to call an object’s destructor: instead, the typical guarantee is that a destructor is called at most once. The for loop in this program is designed to force PyPy to run at least one garbage collection cycle: if this loop is removed, you’ll see that “__del__” is never printed!

To the surprise of many people, most garbage collected languages do not guarantee to call an object’s destructor: instead, the typical guarantee is that a destructor is called at most once. The for loop in this program is designed to force PyPy to run at least one garbage collection cycle: if this loop is removed, you’ll see that “__del__” is never printed!

[4]

If you have a machine which is substantially slower / faster, or which has a mix of performance and efficiency cores, you might find the timing numbers are a little different.

If you have a machine which is substantially slower / faster, or which has a mix of performance and efficiency cores, you might find the timing numbers are a little different.

[5]

Interestingly – and, to most of us, surprisingly – there are programs where we really, really don’t want this sort of optimisation to happen. Formally speaking, my little program has used a “timing side channel” to infer something about the program being run. Some security-conscious programs require that their algorithms run in constant time so that an attacker can’t use a timing side-channel to infer such properties about the program or its data. As this shows, it can be difficult, perhaps even impossible, to accommodate all possible use cases of a language in a specification.

Interestingly – and, to most of us, surprisingly – there are programs where we really, really don’t want this sort of optimisation to happen. Formally speaking, my little program has used a “timing side channel” to infer something about the program being run. Some security-conscious programs require that their algorithms run in constant time so that an attacker can’t use a timing side-channel to infer such properties about the program or its data. As this shows, it can be difficult, perhaps even impossible, to accommodate all possible use cases of a language in a specification.

[6]

Stephen Kell (who tells me he found out about this from Victor Yodaiken) pointed me at a fascinating example from Kernighan and Ritchie’s “The C Programming Language”. In the second edition that I have access to, Section 8.7 is an example memory allocator. Page 185 contains this text:

There is still one assumption, however, that pointers to different blocks returned by sbrk can be meaningfully compared. This is not guaranteed by the standard, which permits pointer comparisons only within an array. Thus this version of malloc is portable only among machines for which general pointer comparison is meaningful.

It is difficult to know for sure what result Kernighan and Ritchie expected from their code when it was compiled and run on a machine for which general pointer comparison was not meaningful. They may have expected the code not to compile, though given the relatively simple compilers of the day, this seems unlikely. Probably more likely, they expected the code to compile, and run in a way that could be determined in advance by understanding the hardware. In other words, they probably expected the program to run in a predictable way, even if that wasn’t the way they might have wanted.

In modern C, the code clearly makes use of undefined behaviour because of the comparison of arbitrary pointers with < in the free function (see Appendix J of the C11 specification). It is then impossible to predict exactly how the code will execute once it is compiled by a modern optimising compiler.

In case that makes it sound like I know lots about the C specification, I should thank Stephen for doing the detective work!

Stephen Kell (who tells me he found out about this from Victor Yodaiken) pointed me at a fascinating example from Kernighan and Ritchie’s “The C Programming Language”. In the second edition that I have access to, Section 8.7 is an example memory allocator. Page 185 contains this text:

There is still one assumption, however, that pointers to different blocks returned by sbrk can be meaningfully compared. This is not guaranteed by the standard, which permits pointer comparisons only within an array. Thus this version of malloc is portable only among machines for which general pointer comparison is meaningful.

It is difficult to know for sure what result Kernighan and Ritchie expected from their code when it was compiled and run on a machine for which general pointer comparison was not meaningful. They may have expected the code not to compile, though given the relatively simple compilers of the day, this seems unlikely. Probably more likely, they expected the code to compile, and run in a way that could be determined in advance by understanding the hardware. In other words, they probably expected the program to run in a predictable way, even if that wasn’t the way they might have wanted.

In modern C, the code clearly makes use of undefined behaviour because of the comparison of arbitrary pointers with < in the free function (see Appendix J of the C11 specification). It is then impossible to predict exactly how the code will execute once it is compiled by a modern optimising compiler.

In case that makes it sound like I know lots about the C specification, I should thank Stephen for doing the detective work!

[7]

The cost of detecting overflow heavily depends on the hardware and compiler. I am not sure the situation has much improved since John Regehr’s plea 9 years ago.

The cost of detecting overflow heavily depends on the hardware and compiler. I am not sure the situation has much improved since John Regehr’s plea 9 years ago.

[8]

Though virtually no-one I’ve met who believes they are unwilling to pay the price knows what the price is!

Though virtually no-one I’ve met who believes they are unwilling to pay the price knows what the price is!

[9]

In my opinion, Rust gets this partly right by requiring that integers have a two’s complement representation. This is a reasonable choice for modern hardware, but was not a choice that C could sensibly have made when it was originally specified. By forcing a representation for integers, one can always know what the result of integer overflow will be.

However, in release mode, Rust is allowed to, and does by default, remove all overflow checks, which I believe is too much of a security risk to be acceptable.

In my opinion, Rust gets this partly right by requiring that integers have a two’s complement representation. This is a reasonable choice for modern hardware, but was not a choice that C could sensibly have made when it was originally specified. By forcing a representation for integers, one can always know what the result of integer overflow will be.

However, in release mode, Rust is allowed to, and does by default, remove all overflow checks, which I believe is too much of a security risk to be acceptable.

[10]

As compilers have become more aggressively optimising, a fully fleshed out memory model has also become important for single-threaded code. My impression, though, is that the emergence of multicore hardware was what led to the development of fully fleshed out memory models. Those memory models then made it easier for compilers to define which single-threaded code could be more aggressively optimised.

As compilers have become more aggressively optimising, a fully fleshed out memory model has also become important for single-threaded code. My impression, though, is that the emergence of multicore hardware was what led to the development of fully fleshed out memory models. Those memory models then made it easier for compilers to define which single-threaded code could be more aggressively optimised.

[11]

For example, consider the excellent stream of work from Peter Sewell and friends on formalising C, who call this flexibility “looseness” (see e.g. this paper).

For example, consider the excellent stream of work from Peter Sewell and friends on formalising C, who call this flexibility “looseness” (see e.g. this paper).

[12]

When challenged on this, I ask people to define Rust’s aliasing rules. After pointing to various unofficial suggestions (most obviously stacked borrows), I point them to the official documentation which says:

The precise Rust aliasing rules are somewhat in flux, but the main points are not contentious

Since I have often had to use unsafe code in Rust, this has placed me in the unenviable position of looking through forums for hints from Rust and LLVM developers and examining the output of LLVM to come to an educated guess as to what the current semantics might be. I have hope that meaningful progress on this front is about to start.

When challenged on this, I ask people to define Rust’s aliasing rules. After pointing to various unofficial suggestions (most obviously stacked borrows), I point them to the official documentation which says:

The precise Rust aliasing rules are somewhat in flux, but the main points are not contentious

Since I have often had to use unsafe code in Rust, this has placed me in the unenviable position of looking through forums for hints from Rust and LLVM developers and examining the output of LLVM to come to an educated guess as to what the current semantics might be. I have hope that meaningful progress on this front is about to start.

Comments



(optional)
(used only to verify your comment: it is not displayed)