In Compiled and Interpreted Languages: Two Ways of Saying Tomato I used this little example to show that we can observe differences in how CPython and PyPy execute code:
$ cat diffs.py print(str(0) is str(0)) $ python3 diffs.py False $ pypy diffs.py True
Although I explained in that post that programming language specifications and implementations are separate things, I didn’t explain why it’s ever considered acceptable for different implementations of a language to produce different output for a given program. In this post I’m going to attempt to correct that oversight, and to do so in a way that I hope is relatively approachable. As that suggests, I will be as informal as possible, since I’ve found that more formal treatments of this subject can obscure where it has a day-to-day impact on programming.
Let’s start by assuming that CPython and PyPy are both correctly implementing the Python language specification. Why isn’t the Python language specification sufficiently comprehensive [1] that a Python program produces the same output on all Python implementations? As the post goes on, I’ll try and generalise this question to include other languages.
Deliberate flexibility
Let’s look at the part of the Python language specification that’s relevant to the example above, which compares strings by identity [2]:
The operators
is
andis not
test for an object’s identity:x is y
is true if and only ifx
andy
are the same object… [4]
How on earth can CPython and PyPy interpret this simple, clear definition differently? Although it’s easily missed, at the end of that paragraph in the Python specification is a footnote “[4]”, which says:
Due to automatic garbage-collection, free lists, and the dynamic nature of descriptors, you may notice seemingly unusual behaviour in certain uses of the is operator, like those involving comparisons between instance methods, or constants. Check their documentation for more info.
From the perspective of this post, it’s an unfortunately written footnote
— less a specification than a vaguely worded excuse. Let’s try
and write a quick specification for “ is
“ in something approaching
specificationese:
Identity checks with “
is
“ allow a user to test if two references point to exactly the same object. Only instances of user-level classes are guaranteed to consistently test true/false with an identity check. The result of identity tests on any other kind of object is implementation defined: an implementation may choose, for example, to cache integers or strings, perform identity checks on integers by value, and so on.
The key phrase is “implementation defined”: this specification makes clear
what implementations are required to do (make sure is
works with
instances of user class
es) and where they have wriggle room to do
different things (instances of anything else). By implication, users must not
rely on the results of identity tests for anything other than user-level
classes.
Why might a specification choose to allow implementations this sort of
wriggle room? After all, it clearly makes life harder for users, who can no
longer know exactly how a program will execute if it uses is
on instances of non-user class
es. The main reason, as
hinted at in the original Python footnote, is that different, desirable,
approaches to implementing a language are sometimes fundamentally unable to
achieve the same outcome.
Perhaps the easiest way to see the interaction between language specifications
and implementation techniques is object destructors (e.g.
__del__
methods in Python or drop
in Rust), which are
called when an implementation determines that an object is no longer
being used. In a reference counted implementation, like CPython, destructors
are called immediately after the last reference to an object is lost. In contrast,
implementations using “full” garbage collection, like PyPy, only check whether
an object is still referenced periodically: an object’s destructor might not be
run until a long time after the last reference to that object is lost.
We can see this happening with the following, slightly contrived [3], program:
class C: def __del__(self): print("__del__") x = C() x = None print("p") for _ in range(1000): y = [0] * 10000
which runs the destructor at different points in execution on CPython and PyPy:
$ python3 /tmp/t.py __del__ p $ pypy /tmp/t.py p __del__
There is simply no practical way to make a garbage collector like PyPy’s always call destructors at the same point in a program as CPython’s reference counting: if you want the performance advantages of “full” garbage collection, you have to forgo predictable destruction. Since both reference counting and “full” garbage collection have pros and cons, many language specifications, like Python’s, deliberately leave implementations the necessary wriggle room to choose either approach (e.g. not by requiring destructors to be run at a predictable point).
However, it’s important to realise that not all language specifications
allow wriggle room in the same places. For example, the Rust specification requires that implementations
call drop
methods at the same point (and, if there are multiple
objects going out of scope, to do so in a specific order!). Users now rely
on that behaviour for correctness. In other words, if the Rust specification was
to try to retrospectively allow transparent “full” garbage collection,
with unpredictable calling of drop
methods, much Rust
code would break.
As this might suggest, changing a language specification to allow new wriggle room breaks backwards compatibility. The burden that this places on users ranges from the easily detected (e.g. syntax changes are spotted immediately by the compiler and thus easily, if tediously, fixed) to the dangerously subtle (e.g. if the rules on when destructors are called are changed). In practise, this means that the older and/or more users a language has, the harder it is to add wriggle room, particularly if it’s of the non-easily-detected variety. In practise this means that, as time goes on, it becomes progressively more difficult to “fix” or “adapt” many aspects of a language specification.
Semi-inevitable flexibility
Language specifications can be thought of as defining what it means for multiple implementations to run programs “equivalently”. Program equivalence is a difficult concept to nail down: although I’ve implicitly defined equivalence thus far to mean “prints the same things on different implementations”, there are many other notions too.
From our perspective, what isn’t specified is often at least as important as what is. Notably, there are aspects of a language’s execution where a specification could require something to happen, but variation seems so inevitable that specifications do not attempt to define it. The most obvious instance of this – so obvious that most people don’t even consider it – is timing. Consider this little Python program:
import time before = time.time() i = 0 for _ in range(1000000): i += 1 after = time.time() if after - before > 0.01: print("Running on CPython") else: print("Running on PyPy")
Let’s run it:
$ python2 /tmp/t.py Running on CPython $ pypy /tmp/t.py Running on PyPy
Just to be sure, I ran each of those commands 100 times: each time they
correctly printed out “Running on CPython” or “Running on Python” [4]. Notice that the program doesn’t directly probe the
version of Python it’s running on: it just times how long a portion of the
program takes and makes an inference from that (i.e. that CPython will
take longer than 0.01s). The reason this works is that I know PyPy
will quickly recognise that the for
loop can be fully optimised
away, whereas CPython won’t.
The Python language specification could, if it really wanted to, say
“implementations are forbidden to optimise away for
loops, even if
those for
loops don’t have any directly observable effects”. Of course, it
doesn’t say that, because in most situations it would be very silly to force
programs to run unnecessarily slow [5].
Likewise, language specifications rarely specify how fast operations run in
terms of seconds, because that would mean that upgrading our computer’s CPU
would not make our programs run faster.
Although not quite as clear cut, a similar aspect is memory storage: many languages specify little or nothing about how objects and the like are stored in memory. Those languages that do specify some aspects of storage tend to do so in limited ways. Most notably, low-level languages like C often specify how the fields of individual objects are laid out in memory, but say nothing about how, for example, multiple objects are allocated on the stack.
In one way, we could consider these aspects to be similar to the first category of “deliberate flexibility”. However, my experience is that most language designers (or, at least, me!) don’t even think about aspects like timing at all, since it just seems obvious to us that different implementations will run programs at different speeds. For want of a better term, I think of such aspects as “semi-inevitable”: neither specifier nor implementer bothers wasting mental bandwidth worrying about such aspects.
Undesirable flexibility
Many readers will have heard the term “undefined behaviour”, something that causes many of us to quake in fear, or shake in anger. In essence, language specifications sometimes say “if you do X, then all bets are off as to what happens”. For example, if in C your program tries to access the 2nd element of an array of size 1, you have entered the land of undefined behaviour, and there are no guarantees about how your program will execute on an arbitrary implementation of C.
Before we explore undefined behaviour in more detail, let’s first tighten our definition of “implementation defined”. I’ll define it to mean those points in a specification where different implementations are explicitly allowed to take different execution strategies, but must tell you which they are using. In some cases, they may be free to pick any possible execution strategy, in some cases they may have to choose from a subset dictated by the specification.
Undefined behaviour might at first sound like yet another variation of implementation defined. However, undefined behaviour is the language specification telling programmers what the boundaries of what it is willing to specify are: if a programmer steps beyond those boundaries, the language implementation can do absolutely anything. Unfortunately, giving simple concrete examples of the effects undefined behaviour is hard, as they are intimately tied to a combination of compiler version and hardware: an approachable description of behaviours at a point in time can be found in this paper, for example; a more recent investigation is also worth reading.
Most people tend to think that undefined behaviour sounds a bit bad, in the same way that a cold shower sounds a bit bad, but hardly fatal. However, undefined behaviour is insidious, because the problems it causes can be both very bad, and easily overlooked or misunderstood.
Why might a language specification choose to use the concept of undefined
behaviour if it is so dangerous? Originally, undefined behaviour was chiefly
a means of enabling portability across hardware that was much more diverse
than today’s (e.g. 40 years ago hardware had not universally settled
on a representation of signed integers); these days we’d probably
expect most such cases to be implementation defined [6].
However, over time, undefined behaviour has also come to be seen as
important to efficient program execution. For example, detecting integer overflow (e.g. if you try to
add 1 << 63
to itself, the result no longer fits in a 64-bit
integer) is an expensive operation on most machines [7]:
many people feel they are unwilling to pay that price [8],
given that most integer operations in
most programs will never overflow. Because of this, C specifies that overflow
of signed integers is undefined behaviour, so that users don’t have to pay
the costs of overflow checks.
This might sound like a slightly theoretical issue — most integers in most programs really won’t overflow, and since we’re only ever likely to encounter two’s complement hardware, the effects of overflow are easily predicted. However, most of us will at some point want to make sure an integer in one of our programs doesn’t overflow, and then we encounter the difficulty of checking overflow. Most obviously, you mustn’t retrospectively check whether overflow has occurred: you must instead perform a torturous check in advance to see whether overflow will occur. If you write the check in a way that leads to undefined behaviour according to the C standard, most modern C compilers will simply remove your check without telling you! Even more surprisingly, if your program can execute in a way that causes undefined behaviour, the effects of undefined behaviour can be felt “backwards in time”: code executed before the “actual” undefined behaviour can be executed in unexpected ways.
For many years, the presence of undefined behaviour in the specification didn’t really cause problems. However, over time, compiler authors – a very competitive bunch – started looking for new opportunities to make their compilers faster than others. They started looking at the C language specification with lawyerly eyes, and realised that undefined behaviour allowed them to make “good” programs run faster: by implication, they believe that those who write “bad” programs do not deserve to have their programs run “correctly”. Once it became acceptable to exploit undefined behaviour, we ended up on the slippery slope to where we are now.
We thus see a surprising aspect of language specifications: they often aren’t comprehensive because they weren’t comprehensive. Furthermore, older languages influence newer languages. There might be a parallel universe where the first C language specification decided to give a safe (i.e. not undefined behaviour!) definition to signed integer overflow: in that parallel universe, it’s probable that all subsequent language specifications would also use safe definitions for integer overflow [9].
A frustration I have with C’s extensive use of undefined behaviour is that it gives the concept a bad name when it really does have uses. For example, I believe that array accesses should as standard check that you are accessing an in-bounds element. However, there is performance critical code where this check is too costly. In such cases, I think there should be a second mechanism for accessing array elements that does not have a bounds check, instead documenting that I must promise not to access out-of-bounds elements. If I break that promise, I will deservedly end up the land of undefined behaviour. As this suggests, the possibility of undefined behaviour should, in my opinion, be opt-in, and something that users can easily confine to small parts of a program.
Unknown flexibility
There is an interesting, and to me importantly distinct, way that language specifications are influenced by history: sometimes they don’t specify some aspects because no-one thought of specifying such aspects.
The most obvious instance of this is memory models. Simplifying significantly, and focussing on just one specific aspect, modern programming languages must specify how memory reads and writes interact across multiple threads if they want users to be able to write safe multi-threaded programs. However, before multicore processors started entering the mainstream (in the years immediately after 2000) there was little need to specify these aspects at all [10], and so no programming language that I know of did so.
Multicore processors meant that programming languages found themselves in a situation where they suddenly needed to carefully specify something they had never considered before. Java was the first language I know of that took this seriously; other languages like C and C++ followed a little later. Although I’ve slightly lost track of the current state of such matters, I believe that there are still edge cases that mainstream language specifications do not adequately cover.
From this we can see that language specifications are often reactive: sometimes, a previously ignored problem can no longer be ignored; sometimes, something that was not a problem becomes a problem.
Summary
Language specifications are not comprehensive because we don’t want them to be, can’t make them be, or because they once weren’t. At a sufficiently high level, these reasons are indistinguishable, but I find that separating them out really helps me understand what I’m dealing with at any given point.
An important question is: can we do better? Yes, we can. Look at any mainstream language’s specification over time, and you’ll see that there’s a clear trend in the direction of greater precision and coverage. We are fortunate that so many talented people have put so much effort into improving the situation, and must hope that they continue to do so!
However, you will also notice that all mainstream languages’ specifications are in English prose. It is tempting to imagine that if we had formal specifications (i.e. “mathematical” or “formal logic” specifications) that all problems would be solved. However, even if it was practical – and for large languages it’s at least at the edge of our current abilities – we still need such specifications to leave flexibility for implementations to do things differently from one another. In other words, it is not the form of specifications that means that they end up accidentally allowing implementations flexibility, but it is something inherent to their nature [11].
On a final note, you may have noticed that I have been careful not to mix my opinions of a programming language with my opinions on its specification. I mostly avoid using C these days, because I am too stupid to write reliable programs given the many ways I can cause undefined behaviour. However, the C specification is, mostly, rather good: it covers the vast majority of what should be covered, and I find much of it to be thoughtful and clearly written. In contrast, Rust, a language that I use much more, has a painfully incomplete specification: most obviously there are no meaningful semantics for unsafe code [12]. As this may suggest, an implicit reason why language specifications are not comprehensive is that they are hard, and take years of painstaking work to mature. I try and remind myself how hard they are whenever I am tempted to criticise language designers and specifiers!
Acknowledgements: My thanks to Martin Berger, Lukas Diekmann, and Stephen Kell for comments.
Footnotes
I’m deliberately avoiding using the term “complete” here, because, depending on the context, that has a formal meaning that can slightly distract us.
I’m deliberately avoiding using the term “complete” here, because, depending on the context, that has a formal meaning that can slightly distract us.
Note the problem here is that, as in most programming languages, Python allows us to differentiate whether two objects have the same contents or whether two objects are actually the same object (i.e. they have the same identity).
Note the problem here is that, as in most programming languages, Python allows us to differentiate whether two objects have the same contents or whether two objects are actually the same object (i.e. they have the same identity).
To the surprise of many people, most garbage collected languages do not
guarantee to call an object’s destructor: instead, the typical guarantee is
that a destructor is called at most once. The for
loop in this
program is designed to force PyPy to run at least one garbage collection cycle:
if this loop is removed, you’ll see that “__del__” is never printed!
To the surprise of many people, most garbage collected languages do not
guarantee to call an object’s destructor: instead, the typical guarantee is
that a destructor is called at most once. The for
loop in this
program is designed to force PyPy to run at least one garbage collection cycle:
if this loop is removed, you’ll see that “__del__” is never printed!
If you have a machine which is substantially slower / faster, or which has a mix of performance and efficiency cores, you might find the timing numbers are a little different.
If you have a machine which is substantially slower / faster, or which has a mix of performance and efficiency cores, you might find the timing numbers are a little different.
Interestingly – and, to most of us, surprisingly – there are programs where we really, really don’t want this sort of optimisation to happen. Formally speaking, my little program has used a “timing side channel” to infer something about the program being run. Some security-conscious programs require that their algorithms run in constant time so that an attacker can’t use a timing side-channel to infer such properties about the program or its data. As this shows, it can be difficult, perhaps even impossible, to accommodate all possible use cases of a language in a specification.
Interestingly – and, to most of us, surprisingly – there are programs where we really, really don’t want this sort of optimisation to happen. Formally speaking, my little program has used a “timing side channel” to infer something about the program being run. Some security-conscious programs require that their algorithms run in constant time so that an attacker can’t use a timing side-channel to infer such properties about the program or its data. As this shows, it can be difficult, perhaps even impossible, to accommodate all possible use cases of a language in a specification.
Stephen Kell (who tells me he found out about this from Victor Yodaiken) pointed me at a fascinating example from Kernighan and Ritchie’s “The C Programming Language”. In the second edition that I have access to, Section 8.7 is an example memory allocator. Page 185 contains this text:
There is still one assumption, however, that pointers to different blocks returned by
sbrk
can be meaningfully compared. This is not guaranteed by the standard, which permits pointer comparisons only within an array. Thus this version ofmalloc
is portable only among machines for which general pointer comparison is meaningful.
It is difficult to know for sure what result Kernighan and Ritchie expected from their code when it was compiled and run on a machine for which general pointer comparison was not meaningful. They may have expected the code not to compile, though given the relatively simple compilers of the day, this seems unlikely. Probably more likely, they expected the code to compile, and run in a way that could be determined in advance by understanding the hardware. In other words, they probably expected the program to run in a predictable way, even if that wasn’t the way they might have wanted.
In modern C, the code clearly makes use of undefined behaviour because of the
comparison of arbitrary pointers with < in the free
function
(see Appendix J of the C11 specification). It is then impossible to predict
exactly how the code will execute once it is compiled by a modern optimising
compiler.
In case that makes it sound like I know lots about the C specification, I should thank Stephen for doing the detective work!
Stephen Kell (who tells me he found out about this from Victor Yodaiken) pointed me at a fascinating example from Kernighan and Ritchie’s “The C Programming Language”. In the second edition that I have access to, Section 8.7 is an example memory allocator. Page 185 contains this text:
There is still one assumption, however, that pointers to different blocks returned by
sbrk
can be meaningfully compared. This is not guaranteed by the standard, which permits pointer comparisons only within an array. Thus this version ofmalloc
is portable only among machines for which general pointer comparison is meaningful.
It is difficult to know for sure what result Kernighan and Ritchie expected from their code when it was compiled and run on a machine for which general pointer comparison was not meaningful. They may have expected the code not to compile, though given the relatively simple compilers of the day, this seems unlikely. Probably more likely, they expected the code to compile, and run in a way that could be determined in advance by understanding the hardware. In other words, they probably expected the program to run in a predictable way, even if that wasn’t the way they might have wanted.
In modern C, the code clearly makes use of undefined behaviour because of the
comparison of arbitrary pointers with < in the free
function
(see Appendix J of the C11 specification). It is then impossible to predict
exactly how the code will execute once it is compiled by a modern optimising
compiler.
In case that makes it sound like I know lots about the C specification, I should thank Stephen for doing the detective work!
The cost of detecting overflow heavily depends on the hardware and compiler. I am not sure the situation has much improved since John Regehr’s plea 9 years ago.
The cost of detecting overflow heavily depends on the hardware and compiler. I am not sure the situation has much improved since John Regehr’s plea 9 years ago.
Though virtually no-one I’ve met who believes they are unwilling to pay the price knows what the price is!
Though virtually no-one I’ve met who believes they are unwilling to pay the price knows what the price is!
In my opinion, Rust gets this partly right by requiring that integers have a two’s complement representation. This is a reasonable choice for modern hardware, but was not a choice that C could sensibly have made when it was originally specified. By forcing a representation for integers, one can always know what the result of integer overflow will be.
However, in release
mode, Rust is allowed to, and does
by default, remove all overflow checks, which I believe is too much of a
security risk to be acceptable.
In my opinion, Rust gets this partly right by requiring that integers have a two’s complement representation. This is a reasonable choice for modern hardware, but was not a choice that C could sensibly have made when it was originally specified. By forcing a representation for integers, one can always know what the result of integer overflow will be.
However, in release
mode, Rust is allowed to, and does
by default, remove all overflow checks, which I believe is too much of a
security risk to be acceptable.
As compilers have become more aggressively optimising, a fully fleshed out memory model has also become important for single-threaded code. My impression, though, is that the emergence of multicore hardware was what led to the development of fully fleshed out memory models. Those memory models then made it easier for compilers to define which single-threaded code could be more aggressively optimised.
As compilers have become more aggressively optimising, a fully fleshed out memory model has also become important for single-threaded code. My impression, though, is that the emergence of multicore hardware was what led to the development of fully fleshed out memory models. Those memory models then made it easier for compilers to define which single-threaded code could be more aggressively optimised.
For example, consider the excellent stream of work from Peter Sewell and friends on formalising C, who call this flexibility “looseness” (see e.g. this paper).
For example, consider the excellent stream of work from Peter Sewell and friends on formalising C, who call this flexibility “looseness” (see e.g. this paper).
When challenged on this, I ask people to define Rust’s aliasing rules. After pointing to various unofficial suggestions (most obviously stacked borrows), I point them to the official documentation which says:
The precise Rust aliasing rules are somewhat in flux, but the main points are not contentious
Since I have often had to use unsafe code in Rust, this has placed me in the unenviable position of looking through forums for hints from Rust and LLVM developers and examining the output of LLVM to come to an educated guess as to what the current semantics might be. I have hope that meaningful progress on this front is about to start.
When challenged on this, I ask people to define Rust’s aliasing rules. After pointing to various unofficial suggestions (most obviously stacked borrows), I point them to the official documentation which says:
The precise Rust aliasing rules are somewhat in flux, but the main points are not contentious
Since I have often had to use unsafe code in Rust, this has placed me in the unenviable position of looking through forums for hints from Rust and LLVM developers and examining the output of LLVM to come to an educated guess as to what the current semantics might be. I have hope that meaningful progress on this front is about to start.