Language size over time
Let me start with a concrete example. I started using Python around version
1.3, though the first version I really remember was 1.4 [2]. Python 1.5 was a real improvement, adding the assert
statement, nested packages ("hierarchical modules" in the release notes), and
the re
regular expression module — things that I suspect
nearly every modern Python programmer finds useful. At that point, Python was
still a relatively small language, to the extent that you could reasonably
expect to store nearly every detail about it (and its implementation!) in your
head.
Python 1.6 added support for unicode — support that was clearly useful, but which turned out to be sufficiently awkward that (somewhat) fixing that awkwardness turned out to be a major cause underlying the schism between Python 2 and 3. As Python 1.6 turned to 2.0 and beyond, this formerly small language kept adding features, many of which I find useful (e.g. list comprehensions), but each of which has inevitably made the language bigger.
In retrospect, Python 1.5 was probably the last version of Python that someone with my limited brainpower could reasonably expect to fully understand. By the time of Python 2.7, Python was definitely a big language, and it has only grown in size since then. I doubt there are many people who could claim they understood every detail of Python 2.7, and there is probably no one who claims that of recent versions of Python 3.
The criteria for language features
How did Python become a big language? The initial version of Python was small, and the first several versions deliberately tried to maintain that smallness: most feature suggestions were explicitly rejected on the grounds that they would make the language too big. At some point (perhaps between Python 1.6 and Python 2.0?) it seemed to me that something changed: features were not rejected solely on the grounds of making the language too big, but on the grounds that they didn't fix an important enough problem.
The difference between these two criteria might seem minor, but "reject a design if it doesn't solve an important problem" implicitly downgrades the size of the language as a concern in and of itself. As soon as a language's designers stop worrying about the size of the language, it seems inevitable to me that the language will grow almost without bound. Certainly, Python looks set to keep on adding new features.
Although I've used Python as an example, I could have chosen many other languages. For example, although Java was never what I'd call a small language, it had many years of fairly minor language changes before adding generics to Java 1.5 — that seemed to be the point at which Java began its journey to the large language it now is. Haskell started off as a fairly small ML-inspired language, but is now a very large language (albeit one can opt-in or out from many features in the standard compiler). JavaScript has gone from a long-weekend design to a rather large language. I know a lot less about languages like C# and the like, but I wouldn't be surprised if they've gone, or are going, through something similar.
These days I do most of my programming at Rust, a language which is already big, but which is perhaps now being forced to think about whether it wants to become very big or not. For example, if preventing Rust getting too big is a goal, it's hard to imagine the recently proposed keyword generics initiative being accepted, even though it clearly solves a problem. On the other hand, Rust has already added dedicated support for async/await, so is it a good idea to cleave the library system in two? Whatever the outcome is, some people will be unhappy — these decisions aren't easy!
Why does this happen? Well, no programming language is perfect: there are always use cases it doesn't support well. In many cases, growing the language in size helps it better support those use cases better. Since there are an unbounded number of potential use cases, there is thus always "more" language design that we can do, each time making the language a little bigger. As Patterns & Abstractions points out, one continual temptation is to move "patterns" (i.e. standard idioms of use) to "abstractions" (i.e. language features). There are advantages to such moves: languages can, for example, give better error messages and better optimise "abstractions". But there is also a cost: patterns do not affect a language's size, but abstractions do. Only some user's of a language will encounter a given pattern, but nearly every user is at some point confronted by nearly every abstraction. This quote from Bjarne Stroustrup in a 1992 document on suggested C++ extensions is relevant to many languages:
Please also understand that there are dozens of reasonable extensions and changes being proposed. If every extension that is reasonably well-defined, clean and general, and would make life easier for a couple of hundred or couple of thousand C++ programmers were accepted, the language would more than double in size. We do not think this would be an advantage to the C++ community.
Must all languages grow in size?
There are some interesting exceptions to the trend of languages growing over time.Lisp-based languages, particularly those in the Scheme tradition, tend to maintain fairly small language cores. However, because of the extensive use of macros, it's sometimes difficult to know how to classify "language size" in such languages. I'm far too out of date with, say, modern Racket to have an informed opinion on its size.
Lua was, and is, a small language — and, probably not coincidentally, so is its implementation. Interestingly, as far as I have been able to tell, one of the ways Lua's designers have kept the language evolving is to regularly force (generally small) breaking changes on its user base. It's an interesting trade-off, and not one that has been taken by any other language that I can think of.
Perhaps the best known (partial) exception is C. Modern C is a surprisingly different language to the original C, but that evolution happens rather slowly these days, with new versions of the C standard coming out around every 8-12 years. Each new versions tends to add few new language features, or even standard functions, and tends to put at least as much effort into clarifying parts of the language's semantics that have either always been unclear or which have been rendered unclear by new hardware (see Why Aren't Programming Language Specifications Comprehensive? for some examples).
My impression of C, as someone from the outside occasionally looking in, is that those responsible for the C language standard probably consider themselves to be more "maintainers" than "designers": they want C to be a useful language in the modern world, but they also clearly want to balance that with not letting C grow too much bigger.
One advantage C has is, ironically, C++: anyone who wants "C but with lots of new features" can be told to go and use C++ instead of growing C in size. Although it's difficult to distinguish cause from effect, perhaps this has dissuaded those people who like adding new features to languages from considering proposing them for C, when C++ is likely to be far more receptive to their suggestions. Certainly, C++ is a huge language which keeps on growing: Bjarne Stroustrup's recent-ish plea to "Remember the Vasa" strikes a cautionary note that few would make about C.
Summary
Language designers face challenging trade-offs. Their languages can always be made better for more users by adding new features, but doing so makes the language bigger. At some point, languages become big enough that few users can understand all of the language — but nearly every user of a language will at least occasionally encounter nearly every feature. When users encounter features they don't understand, their reactions can vary from surprise to worry — if they notice at all! The larger a language is, the easier it is for users to misuse it without even knowing it.There is no easy way for us to to evaluate whether a new feature is worth increasing a language's size for. It's also easier for those people who will benefit from a feature to advocate for it than for those people whose lives might (unwittingly) be made worse by it to advocate against it. Even talking about "size" in anything other than a vague way [3] is hard: quantitative metrics like the size of a language's grammar or compiler do not always correlate with the inherently qualitative measure of "cognitive load" that two new features may impose on humans.
Perhaps in an ideal world, language designers would at some point declare their languages "complete", capping the language's size. However, hardware and other external factors change in ways that sometimes force even the most conservative language to adapt. Even a language like C, which evolves slowly, still evolves. And for as long as a language must evolve, it must have designers. And for as long as a language has designers, there will always be the temptation to add shiny new features.
However, I think language designers can learn something useful from C's example. While we can't declare our languages "complete" we can at some point declare them to be in "minimal evolution" mode. This is decidedly less sexy than, and offers fewer job opportunities to, language designers than continually adding new features — but it also allows users to focus on the tasks they want to program, rather than having to deal with ever more complexity. Slowly, over time, new languages will come along which take the best ideas, simplify and unify them, and can keep the field moving forwards. In general, in my opinion, that's shown itself to be a more successful tactic than hoping that ever-bigger languages can satisfy all users.
Footnotes
[1] Astute readers will notice that this post talks about language "size" rather than "complexity". In general, these two criteria strongly correlate, but of the two "size" is marginally more objective. But, if you prefer to use "complexity" instead of "size", you can do so without changing this post's fundamental meaning.[2] I don't have a precise chronology to hand, and at the time I was using an unusual platform at the time – Acorn's RISC OS – so there was probably a delay between the "main" release and it being available on RISC OS, muddling my memory even further.
[3] There's a reason I haven't defined "size" precisely in this post — whatever definition I use won't work for everyone! I've hoped that your intuition of "size" is sufficient to understand the overall points I'm making.
Rust's Two Kinds of 'Assert' Make for Better Code
March 16 2023
Daniel Lemire's recent post "runtime asserts are not free" looks at the run-time cost ofassert
statements in C and shows that a simple assert
in a frequently
executed loop can cause significant overhead.
My own opinion on assertions has shifted over the years, from "I don't see the
point" to "use them sparingly" to "use them as much as possible". That last
shift is largely due to Rust having two kinds of "assert" statement –
assert
and debug_assert
– which has allowed
me to accurately express two different kinds of assertions, largely freeing
me from performance worries. If you come from a
language that only has one kind of assert statement, this distinction can seem
pointless, so in this post I want to briefly explain why it helped shift my
thinking.
Background
Let me quickly define what I mean by an "assert": it's a programming language statement that checks a property and causes a crash if that property does not hold (conventionally called a "failing assert"). For example, if I have a Python program with a list of people's ages and calculate the minimum age, I might want to check that the youngest person doesn't have a negative age:
ages = [...] youngest = min(ages) assert(youngest >= 0)If
ages
contains a negative value – or if min
doesn't work correctly! – the assert
will fail and cause a
run-time exception:
Traceback (most recent call last): File "/tmp/t.py", line 3, inIn other words, writingassert(youngest >= 0) AssertionError
assert
is roughly equivalent to:
ages = [...] youngest = min(ages) if not (youngest >= 0): raise AssertionErrorIn practise, asserts are mostly used to check assumptions about a program's state — in this case, that at no point has a negative age entered into the system.
There are two major reasons why I might want to check this particular
assumption. First, I might have written subsequent code which will only execute
correctly with non-negative youngest
values: I want to prevent
that subsequent code from executing if that property is violated. Second, the
assert
both documents and checks the property. In other words,
I could just have written a comment:
ages = [...] youngest = min(ages) # youngest must be non-negative or bad things will happen below ...That comment accurately describes the program's assumption, but if the assumption is incorrect – perhaps because another part of the program uses -1 to mean "we don't know how old this person is" – the threatened "bad things" will occur. If I'm lucky, the effects will be relatively benign, and perhaps even invisible. But, if I'm unlucky, genuinely bad things will occur, ranging from odd output to security vulnerabilities.
Debugging incorrect assumptions of this sort is hard, because the effect of
the assumption's violation is generally only noticed long after the violation occurred.
It's not unusual for some poor programmer to spend a day or more hunting down a
problem only to find that it was caused by the violation of a simple
assumption. In contrast, the assert
causes my program to crash
predictably, with a clear report, and at the earliest possible opportunity. In
general, fixing the causes of failing asserts tends to be relatively simple.
Why asserts are used less often than one might think
As I've described them above, asserts sound like a clear win — but most programs use many fewer asserts than one might hope.The most obvious reason for this is that programmers often don't realise the assumptions they're embedding in their programs, or don't consider the consequences of their assumptions. This tends to be particularly true for junior programmers, who have not yet built up scar tissue from multi-day debugging sessions that were necessary only because they didn't think to use asserts. It took me many years of programming before I realised how much time I was wasting by not thinking about, and checking with asserts, my assumptions about a program's properties.
Sometimes it's also very difficult to work out how to assert the property
one cares about. This is particularly true in languages like C where there is
no built-in help to express properties such as "no element in the list can be
negative". The lengthier, and more difficult, an assert
needs
to be – particularly if it needs a helper function – the less
likely it is to be written down.
Inevitably, some asserts are plain wrong, either expressing an incorrect
property, or expressing correct property incorrectly. I think most of us
expect such mistakes. However, what many people don't realise is that asserts can change a
program's behaviour if they have side effects. I have shot myself in the foot
more than once by copying and pasting code such as l[i++]
into an
assert, causing the program to execute differently depending on whether the
assert is compiled in or not. I view this as inevitable stupidity on my part,
rather than a flaw in the concept of asserts, but I have heard of at least one
organisation that bans (or, at least, banned) asserts because of this issue.
Performance issues
Daniel pointed out a very different reason for avoiding asserts: they can cause serious performance issues when used in the "wrong" places. An assert introduces a branch (i.e. an "if") which must be executed at run-time [1]. The existence of an assert can also cause a compiler to miss compile-time optimisation opportunities [2]. There is a general fear in the programming community about the performance costs of asserts, even though none of us can know how they impact a given program without actually measuring it.To avoid performance issues, most software is compiled in either "debug" (sometimes
called "testing") or "release" modes: in debug
mode, asserts are compiled in and checked at run-time; but in release mode, the
asserts are not compiled in, and thus not checked at run-time. In languages
like C, there isn't a standard concept of "debug" and "release" modes, but
many people consider "release" mode to imply adding a flag -DNDEBUG
,
which causes assert
to become a no-op. Rust's
standard Cargo build system defaults to debug mode, with --release
performing a release build.
Two kinds of asserts
While not compiling (and thus not checking) asserts in release mode removes performance problems, it also weakens the guarantees we have about a program's correctness — just because a test suite doesn't violate an assumption doesn't mean that real users won't use the program in a way which does.These days I thus view asserts as falling into two categories:
- Checking problem domain assumptions.
- Checking internal assumptions.
The first category contains assumptions about the "real world" problem my program is trying to help solve. For example, if I'm writing a warehouse stock system, parts of my program might assume properties such as "an item's barcode is never empty".
The second category contains assumptions about the way I've structured my program. For example, I might have written a function which runs much faster if I assume the input integer is greater than 1. It's probable that, at the time I write that function, I won't call it in a way that violates that property: but later programmers (including me!) will quite possibly forget, or not notice, that property. An assertion is thus particularly helpful for future programmers, especially when they're refactoring code, to give them greater confidence that they haven't broken the program in subtle ways.
What it took me years to realise is that I have very different confidence levels about my assumptions in each category. I have high confidence that violations of my assumptions in the second category will be caught during normal testing. However, I have much lower confidence that violations of my assumptions in the first category will be caught during testing.
My differing confidence levels shouldn't be surprising — after all, I've been hired because I can program, not because I know much about warehouse stock systems or barcodes! However, since "normal testing" implies "debug mode" and "user is running the program" implies "release mode", it means that the assumptions I am least confident are not exercised when they're most needed.
Two kinds of assert statement
The problem that I've just expressed ultimately occurs because languages like C force us to encode both kinds of assumption with a singleassert
statement: either all asserts are compiled in or none are.
I long considered this inevitable, but when I moved to Rust several years
back, I slowly realised that I now had access to two kinds of assert statement.
debug_assert
is rather like assert
in C, in the
sense that the assumptions it expresses are only checked in debug mode.
In contrast, assert
is "always checked in
both debug and release builds, and cannot be disabled."
This might seem like a small difference, but for me it completely unlocked
the power of assertions. If you look at a lot of the code I now write, you'll
see liberal use of debug_assert
, often checking quite minor
assumptions, including those I never think likely to be violated. I never even
think, let alone worry, about the performance impact of
debug_assert
. But occasionally you'll spot an assert
,
sometimes even in fairly frequently executed code — those are where I'm
checking particularly important assumptions, or assumptions in which I have
particularly low confidence. Each time I write assert
I think
about the possible performance impact, and also about whether there's a way I
can increase my confidence in the assumption to the point that I can downgrade
it to a debug_assert
. Similarly, when it comes to debugging, I
often check assert
statements quite carefully as they indicate
my low confidence in a particular assumption: it's more likely that I have
to reconsider an assert
than a debug_assert
.
Of course, there's no reason why you can't write your own equivalents of
assert
and debug_assert
in C, or any other
language, but having them built into a language (or standard library), where
their differing motivations are clearly documented and widely understood, makes
it much easier to use them freely. I hope languages other than Rust will
continue to make this difference in assertions — though I would
prefer a shorter name than "debug_assert"!
Footnotes
[1] Branch predictors are amazing but sometimes people talk about them as if they make branches free. Alas, they're not that amazing![2] The inverse is also true: asserts can sometimes help a compiler optimise code better. Although I have no hard data to suggest which outcome occurs more often, I am inclined to think the negative outcome is more common.
Scheduling my Electricity Usage
March 5 2023
Modern societies function in large part thanks to our ability to use as much electricity as we want, when we want it. As renewable energy (solar, wind) plays a greater part in the electricity generation mix, our opportunities to use "cleaner" energy increase, but renewables are also fickle — sometimes the wind doesn't blow, and the sun doesn't shine. Today in the UK, for example, the weather is overcast and still: solar and wind aren't going to generate much power. In such cases, the electricity we're using is generated from much "dirtier" sources.For the last 6 months or so, I've been trying to see what I can do to increase the "clean" proportion of electricity I use. I have a lot of limiting factors: I can't, for example, install solar panels or a wind turbine. But what I can do is change when I use electricity.
My intention is not for or me (or my family!) to live like a stylite or troglodyte: personally, I rather like the conveniences and comfort of modern life. Therefore, there are lots of things where my timings are inflexible — I'm not going to turn the oven on to produce a meal at midnight, nor am I going to switch my computer off when there's a big football match on.
However, there are several things where my timings are flexible. I'm going to use my washing machine as an example (the same principle applies to e.g. dishwashers, tumble dryers, and the like). I often need to make sure something is washed by a certain time, but it doesn't matter if I wash it right now or just before the point that I need it to have been washed.
Since I know that peak energy usage tends to be from around 16:00-19:00 each day, I guessed that's the "dirtiest" time of the day to use electricity and I avoided running my washing machine then. After a little more thought, I guessed that not many people are using electricity early in the morning, so I scheduled the washing machine to come on at about 06:00. I guessed that this was early enough to have some benefit from a power generation perspective, but just late enough that the (potentially noisy!) spin cycle wouldn't occur until our neighbours are awake.
However, you'll notice that I was doing quite a lot of guessing. I also recently realised that my assumptions were largely based on wind power, which is the main renewable source in the UK in winter. It's a reasonable guess that wind power is fairly evenly spread over 24 hours but as spring arrives, solar power is going to become more important, and that's clearly highly unevenly spread throughout the day. How should I factor that in?
I was thus very pleased a few weeks back to stumble across the UK Carbon Intensity forecasting site. At first my internal programmer was attracted by the "API" part of the site, but I soon realised that this site is useful for more normal activities. It provides a 48 hour forecast of how carbon intensive (a tighter definition than the wishy-washy "clean" I'd been using until now) energy usage will be.
If you scroll down about halfway (I wish this was at the top of the site!) you can see the current forecast:

On the x-axis we have time and on the y-axis carbon intensivity (lower is better). The left-hand one third of the graph (in off white) shows the previous 24 hours, matching the forecast to the reality. The right-hand two thirds of the graph show the forecast. As this shows, today's overcast, still weather in the UK means that electricity usage is very carbon intensive but tomorrow, as wind speeds pick up as the morning goes on, carbon intensivity declines noticeably, hitting a trough at midday.
What I like about this forecast is how simple it is to read. Currently I need to have washed a load of towels before Friday. I could wash them today, but that would be foolish when I could do so tomorrow at much lower carbon intensivity. So tomorrow, around 11:30, you will find me feeding a load of towels into my washing machine. Not only is this easier than scheduling an early morning wash, but I don't have to worry about the spin cycle disturbing anyone's sleep!
Of course, just me changing my energy usage patterns doesn't make much difference: backup (carbon intensive) electricity generation will still be active whatever I do. And, perhaps, one day in the future we'll have access to as much carbon neutral energy as we want.
However, in the here and now, the forecasting site allows those of us who have the ability, whether at home or at work, to shift our energy usage a bit to see if we can also shift the collective needle a bit. It also shows the power of a neat little website with a simple graph — I just hope they move the main graph to the top of the site soon! And, of course, the UK forecasting site is only relevant for those of us in the UK — are there equivalents for other countries? If you know of any, please feel free to leave a comment below!
Why Aren't Programming Language Specifications Comprehensive?
February 16 2023
In Compiled and Interpreted Languages: Two Ways of Saying Tomato I used this little example to show that we can observe differences in how CPython and PyPy execute code:$ cat diffs.py print(str(0) is str(0)) $ python3 diffs.py False $ pypy diffs.py TrueAlthough I explained in that post that programming language specifications and implementations are separate things, I didn't explain why it's ever considered acceptable for different implementations of a language to produce different output for a given program. In this post I'm going to attempt to correct that oversight, and to do so in a way that I hope is relatively approachable. As that suggests, I will be as informal as possible, since I've found that more formal treatments of this subject can obscure where it has a day-to-day impact on programming.
Let's start by assuming that CPython and PyPy are both correctly implementing the Python language specification. Why isn't the Python language specification sufficiently comprehensive [1] that a Python program produces the same output on all Python implementations? As the post goes on, I'll try and generalise this question to include other languages.
Deliberate flexibility
Let's look at the part of the Python language specification that's relevant to the example above, which compares strings by identity [2]:The operatorsHow on earth can CPython and PyPy interpret this simple, clear definition differently? Although it's easily missed, at the end of that paragraph in the Python specification is a footnote "[4]", which says:is
andis not
test for an object’s identity:x is y
is true if and only ifx
andy
are the same object... [4]
Due to automatic garbage-collection, free lists, and the dynamic nature of descriptors, you may notice seemingly unusual behaviour in certain uses of the is operator, like those involving comparisons between instance methods, or constants. Check their documentation for more info.From the perspective of this post, it's an unfortunately written footnote — less a specification than a vaguely worded excuse. Let's try and write a quick specification for "
is
" in something approaching
specificationese:
Identity checks with "is
" allow a user to test if two references
point to exactly the same object. Only instances of user-level classes are
guaranteed to consistently test true/false with an identity check. The result of
identity tests on any other kind of object is implementation defined: an
implementation may choose, for example, to cache integers or strings, perform
identity checks on integers by value, and so on.
The key phrase is "implementation defined": this specification makes clear
what implementations are required to do (make sure is
works with
instances of user class
es) and where they have wriggle room to do
different things (instances of anything else). By implication, users must not
rely on the results of identity tests for anything other than user-level
classes.
Why might a specification choose to allow implementations this sort of
wriggle room? After all, it clearly makes life harder for users, who can no
longer know exactly how a program will execute if it uses is
on instances of non-user class
es. The main reason, as
hinted at in the original Python footnote, is that different, desirable,
approaches to implementing a language are sometimes fundamentally unable to
achieve the same outcome.
Perhaps the easiest way to see the interaction between language specifications
and implementation techniques is object destructors (e.g.
__del__
methods in Python or drop
in Rust), which are
called when an implementation determines that an object is no longer
being used. In a reference counted implementation, like CPython, destructors
are called immediately after the last reference to an object is lost. In contrast,
implementations using "full" garbage collection, like PyPy, only check whether
an object is still referenced periodically: an object's destructor might not be
run until a long time after the last reference to that object is lost.
We can see this happening with the following, slightly contrived [3], program:
class C: def __del__(self): print("__del__") x = C() x = None print("p") for _ in range(1000): y = [0] * 10000which runs the destructor at different points in execution on CPython and PyPy:
$ python3 /tmp/t.py __del__ p $ pypy /tmp/t.py p __del__
There is simply no practical way to make a garbage collector like PyPy's always call destructors at the same point in a program as CPython's reference counting: if you want the performance advantages of "full" garbage collection, you have to forgo predictable destruction. Since both reference counting and "full" garbage collection have pros and cons, many language specifications, like Python's, deliberately leave implementations the necessary wriggle room to choose either approach (e.g. not by requiring destructors to be run at a predictable point).
However, it's important to realise that not all language specifications
allow wriggle room in the same places. For example, the Rust specification requires that implementations
call drop
methods at the same point (and, if there are multiple
objects going out of scope, to do so in a specific order!). Users now rely
on that behaviour for correctness. In other words, if the Rust specification was
to try to retrospectively allow transparent "full" garbage collection,
with unpredictable calling of drop
methods, much Rust
code would break.
As this might suggest, changing a language specification to allow new wriggle room breaks backwards compatibility. The burden that this places on users ranges from the easily detected (e.g. syntax changes are spotted immediately by the compiler and thus easily, if tediously, fixed) to the dangerously subtle (e.g. if the rules on when destructors are called are changed). In practise, this means that the older and/or more users a language has, the harder it is to add wriggle room, particularly if it's of the non-easily-detected variety. In practise this means that, as time goes on, it becomes progressively more difficult to "fix" or "adapt" many aspects of a language specification.
Semi-inevitable flexibility
Language specifications can be thought of as defining what it means for multiple implementations to run programs "equivalently". Program equivalence is a difficult concept to nail down: although I've implicitly defined equivalence thus far to mean "prints the same things on different implementations", there are many other notions too.From our perspective, what isn't specified is often at least as important as what is. Notably, there are aspects of a language's execution where a specification could require something to happen, but variation seems so inevitable that specifications do not attempt to define it. The most obvious instance of this – so obvious that most people don't even consider it – is timing. Consider this little Python program:
import time before = time.time() i = 0 for _ in range(1000000): i += 1 after = time.time() if after - before > 0.01: print("Running on CPython") else: print("Running on PyPy")Let's run it:
$ python2 /tmp/t.py Running on CPython $ pypy /tmp/t.py Running on PyPyJust to be sure, I ran each of those commands 100 times: each time they correctly printed out "Running on CPython" or "Running on Python" [4]. Notice that the program doesn't directly probe the version of Python it's running on: it just times how long a portion of the program takes and makes an inference from that (i.e. that CPython will take longer than 0.01s). The reason this works is that I know PyPy will quickly recognise that the
for
loop can be fully optimised
away, whereas CPython won't.
The Python language specification could, if it really wanted to, say
"implementations are forbidden to optimise away for
loops, even if
those for
loops don't have any directly observable effects". Of course, it
doesn't say that, because in most situations it would be very silly to force
programs to run unnecessarily slow [5].
Likewise, language specifications rarely specify how fast operations run in
terms of seconds, because that would mean that upgrading our computer's CPU
would not make our programs run faster.
Although not quite as clear cut, a similar aspect is memory storage: many languages specify little or nothing about how objects and the like are stored in memory. Those languages that do specify some aspects of storage tend to do so in limited ways. Most notably, low-level languages like C often specify how the fields of individual objects are laid out in memory, but say nothing about how, for example, multiple objects are allocated on the stack.
In one way, we could consider these aspects to be similar to the first category of "deliberate flexibility". However, my experience is that most language designers (or, at least, me!) don't even think about aspects like timing at all, since it just seems obvious to us that different implementations will run programs at different speeds. For want of a better term, I think of such aspects as "semi-inevitable": neither specifier nor implementer bothers wasting mental bandwidth worrying about such aspects.
Undesirable flexibility
Many readers will have heard the term "undefined behaviour", something that causes many of us to quake in fear, or shake in anger. In essence, language specifications sometimes say "if you do X, then all bets are off as to what happens". For example, if in C your program tries to access the 2nd element of an array of size 1, you have entered the land of undefined behaviour, and there are no guarantees about how your program will execute on an arbitrary implementation of C.Before we explore undefined behaviour in more detail, let's first tighten our definition of "implementation defined". I'll define it to mean those points in a specification where different implementations are explicitly allowed to take different execution strategies, but must tell you which they are using. In some cases, they may be free to pick any possible execution strategy, in some cases they may have to choose from a subset dictated by the specification.
Undefined behaviour might at first sound like yet another variation of implementation defined. However, undefined behaviour is the language specification telling programmers what the boundaries of what it is willing to specify are: if a programmer steps beyond those boundaries, the language implementation can do absolutely anything. Unfortunately, giving simple concrete examples of the effects undefined behaviour is hard, as they are intimately tied to a combination of compiler version and hardware: an approachable description of behaviours at a point in time can be found in this paper, for example; a more recent investigation is also worth reading.
Most people tend to think that undefined behaviour sounds a bit bad, in the same way that a cold shower sounds a bit bad, but hardly fatal. However, undefined behaviour is insidious, because the problems it causes can be both very bad, and easily overlooked or misunderstood.
Why might a language specification choose to use the concept of undefined
behaviour if it is so dangerous? Originally, undefined behaviour was chiefly
a means of enabling portability across hardware that was much more diverse
than today's (e.g. 40 years ago hardware had not universally settled
on a representation of signed integers); these days we'd probably
expect most such cases to be implementation defined [6].
However, over time, undefined behaviour has also come to be seen as
important to efficient program execution. For example, detecting integer overflow (e.g. if you try to
add 1 << 63
to itself, the result no longer fits in a 64-bit
integer) is an expensive operation on most machines [7]:
many people feel they are unwilling to pay that price [8],
given that most integer operations in
most programs will never overflow. Because of this, C specifies that overflow
of signed integers is undefined behaviour, so that users don't have to pay
the costs of overflow checks.
This might sound like a slightly theoretical issue — most integers in most programs really won't overflow, and since we're only ever likely to encounter two's complement hardware, the effects of overflow are easily predicted. However, most of us will at some point want to make sure an integer in one of our programs doesn't overflow, and then we encounter the difficulty of checking overflow. Most obviously, you mustn't retrospectively check whether overflow has occurred: you must instead perform a torturous check in advance to see whether overflow will occur. If you write the check in a way that leads to undefined behaviour according to the C standard, most modern C compilers will simply remove your check without telling you! Even more surprisingly, if your program can execute in a way that causes undefined behaviour, the effects of undefined behaviour can be felt "backwards in time": code executed before the "actual" undefined behaviour can be executed in unexpected ways.
For many years, the presence of undefined behaviour in the specification didn't really cause problems. However, over time, compiler authors – a very competitive bunch – started looking for new opportunities to make their compilers faster than others. They started looking at the C language specification with lawyerly eyes, and realised that undefined behaviour allowed them to make "good" programs run faster: by implication, they believe that those who write "bad" programs do not deserve to have their programs run "correctly". Once it became acceptable to exploit undefined behaviour, we ended up on the slippery slope to where we are now.
We thus see a surprising aspect of language specifications: they often aren't comprehensive because they weren't comprehensive. Furthermore, older languages influence newer languages. There might be a parallel universe where the first C language specification decided to give a safe (i.e. not undefined behaviour!) definition to signed integer overflow: in that parallel universe, it's probable that all subsequent language specifications would also use safe definitions for integer overflow [9].
A frustration I have with C's extensive use of undefined behaviour is that it gives the concept a bad name when it really does have uses. For example, I believe that array accesses should as standard check that you are accessing an in-bounds element. However, there is performance critical code where this check is too costly. In such cases, I think there should be a second mechanism for accessing array elements that does not have a bounds check, instead documenting that I must promise not to access out-of-bounds elements. If I break that promise, I will deservedly end up the land of undefined behaviour. As this suggests, the possibility of undefined behaviour should, in my opinion, be opt-in, and something that users can easily confine to small parts of a program.
Unknown flexibility
There is an interesting, and to me importantly distinct, way that language specifications are influenced by history: sometimes they don't specify some aspects because no-one thought of specifying such aspects.The most obvious instance of this is memory models. Simplifying significantly, and focussing on just one specific aspect, modern programming languages must specify how memory reads and writes interact across multiple threads if they want users to be able to write safe multi-threaded programs. However, before multicore processors started entering the mainstream (in the years immediately after 2000) there was little need to specify these aspects at all [10], and so no programming language that I know of did so.
Multicore processors meant that programming languages found themselves in a situation where they suddenly needed to carefully specify something they had never considered before. Java was the first language I know of that took this seriously; other languages like C and C++ followed a little later. Although I've slightly lost track of the current state of such matters, I believe that there are still edge cases that mainstream language specifications do not adequately cover.
From this we can see that language specifications are often reactive: sometimes, a previously ignored problem can no longer be ignored; sometimes, something that was not a problem becomes a problem.
Summary
Language specifications are not comprehensive because we don't want them to be, can't make them be, or because they once weren't. At a sufficiently high level, these reasons are indistinguishable, but I find that separating them out really helps me understand what I'm dealing with at any given point.An important question is: can we do better? Yes, we can. Look at any mainstream language's specification over time, and you'll see that there's a clear trend in the direction of greater precision and coverage. We are fortunate that so many talented people have put so much effort into improving the situation, and must hope that they continue to do so!
However, you will also notice that all mainstream languages' specifications are in English prose. It is tempting to imagine that if we had formal specifications (i.e. "mathematical" or "formal logic" specifications) that all problems would be solved. However, even if it was practical – and for large languages it's at least at the edge of our current abilities – we still need such specifications to leave flexibility for implementations to do things differently from one another. In other words, it is not the form of specifications that means that they end up accidentally allowing implementations flexibility, but it is something inherent to their nature [11].
On a final note, you may have noticed that I have been careful not to mix my opinions of a programming language with my opinions on its specification. I mostly avoid using C these days, because I am too stupid to write reliable programs given the many ways I can cause undefined behaviour. However, the C specification is, mostly, rather good: it covers the vast majority of what should be covered, and I find much of it to be thoughtful and clearly written. In contrast, Rust, a language that I use much more, has a painfully incomplete specification: most obviously there are no meaningful semantics for unsafe code [12]. As this may suggest, an implicit reason why language specifications are not comprehensive is that they are hard, and take years of painstaking work to mature. I try and remind myself how hard they are whenever I am tempted to criticise language designers and specifiers!
Acknowledgements: My thanks to Martin Berger, Lukas Diekmann, and Stephen Kell for comments.
Footnotes
[1] I'm deliberately avoiding using the term "complete" here, because, depending on the context, that has a formal meaning that can slightly distract us.[2] Note the problem here is that, as in most programming languages, Python allows us to differentiate whether two objects have the same contents or whether two objects are actually the same object (i.e. they have the same identity).
[3] To the surprise of many people, most garbage collected languages do not guarantee to call an object's destructor: instead, the typical guarantee is that a destructor is called at most once. The
for
loop in this
program is designed to force PyPy to run at least one garbage collection cycle:
if this loop is removed, you'll see that "__del__" is never printed![4] If you have a machine which is substantially slower / faster, or which has a mix of performance and efficiency cores, you might find the timing numbers are a little different.
[5] Interestingly – and, to most of us, surprisingly – there are programs where we really, really don't want this sort of optimisation to happen. Formally speaking, my little program has used a "timing side channel" to infer something about the program being run. Some security-conscious programs require that their algorithms run in constant time so that an attacker can't use a timing side-channel to infer such properties about the program or its data. As this shows, it can be difficult, perhaps even impossible, to accommodate all possible use cases of a language in a specification.
[6] Stephen Kell (who tells me he found out about this from Victor Yodaiken) pointed me at a fascinating example from Kernighan and Ritchie's "The C Programming Language". In the second edition that I have access to, Section 8.7 is an example memory allocator. Page 185 contains this text:
There is still one assumption, however, that pointers to different blocks returned byIt is difficult to know for sure what result Kernighan and Ritchie expected from their code when it was compiled and run on a machine for which general pointer comparison was not meaningful. They may have expected the code not to compile, though given the relatively simple compilers of the day, this seems unlikely. Probably more likely, they expected the code to compile, and run in a way that could be determined in advance by understanding the hardware. In other words, they probably expected the program to run in a predictable way, even if that wasn't the way they might have wanted.sbrk
can be meaningfully compared. This is not guaranteed by the standard, which permits pointer comparisons only within an array. Thus this version ofmalloc
is portable only among machines for which general pointer comparison is meaningful.
In modern C, the code clearly makes use of undefined behaviour because of the
comparison of arbitrary pointers with < in the free
function
(see Appendix J of the C11 specification). It is then impossible to predict
exactly how the code will execute once it is compiled by a modern optimising
compiler.
In case that makes it sound like I know lots about the C specification, I should thank Stephen for doing the detective work!
[7] The cost of detecting overflow heavily depends on the hardware and compiler. I am not sure the situation has much improved since John Regehr's plea 9 years ago.
[8] Though virtually no-one I've met who believes they are unwilling to pay the price knows what the price is!
[9] In my opinion, Rust gets this partly right by requiring that integers have a two's complement representation. This is a reasonable choice for modern hardware, but was not a choice that C could sensibly have made when it was originally specified. By forcing a representation for integers, one can always know what the result of integer overflow will be.
However, in release
mode, Rust is allowed to, and does
by default, remove all overflow checks, which I believe is too much of a
security risk to be acceptable.
[10] As compilers have become more aggressively optimising, a fully fleshed out memory model has also become important for single-threaded code. My impression, though, is that the emergence of multicore hardware was what led to the development of fully fleshed out memory models. Those memory models then made it easier for compilers to define which single-threaded code could be more aggressively optimised.
[11] For example, consider the excellent stream of work from Peter Sewell and friends on formalising C, who call this flexibility "looseness" (see e.g. this paper).
[12] When challenged on this, I ask people to define Rust's aliasing rules. After pointing to various unofficial suggestions (most obviously stacked borrows), I point them to the official documentation which says:
The precise Rust aliasing rules are somewhat in flux, but the main points are not contentiousSince I have often had to use unsafe code in Rust, this has placed me in the unenviable position of looking through forums for hints from Rust and LLVM developers and examining the output of LLVM to come to an educated guess as to what the current semantics might be. I have hope that meaningful progress on this front is about to start.
for
loop in this
program is designed to force PyPy to run at least one garbage collection cycle:
if this loop is removed, you'll see that "__del__" is never printed!There is still one assumption, however, that pointers to different blocks returned byIt is difficult to know for sure what result Kernighan and Ritchie expected from their code when it was compiled and run on a machine for which general pointer comparison was not meaningful. They may have expected the code not to compile, though given the relatively simple compilers of the day, this seems unlikely. Probably more likely, they expected the code to compile, and run in a way that could be determined in advance by understanding the hardware. In other words, they probably expected the program to run in a predictable way, even if that wasn't the way they might have wanted.sbrk
can be meaningfully compared. This is not guaranteed by the standard, which permits pointer comparisons only within an array. Thus this version ofmalloc
is portable only among machines for which general pointer comparison is meaningful.
In modern C, the code clearly makes use of undefined behaviour because of the
comparison of arbitrary pointers with < in the free
function
(see Appendix J of the C11 specification). It is then impossible to predict
exactly how the code will execute once it is compiled by a modern optimising
compiler.
In case that makes it sound like I know lots about the C specification, I should thank Stephen for doing the detective work!
However, in release
mode, Rust is allowed to, and does
by default, remove all overflow checks, which I believe is too much of a
security risk to be acceptable.
The precise Rust aliasing rules are somewhat in flux, but the main points are not contentiousSince I have often had to use unsafe code in Rust, this has placed me in the unenviable position of looking through forums for hints from Rust and LLVM developers and examining the output of LLVM to come to an educated guess as to what the current semantics might be. I have hope that meaningful progress on this front is about to start.
Distinguishing an Interpreter from a Compiler
January 26 2023
In Compiled and Interpreted Languages: Two Ways of Saying Tomato, I showed how any language can be implemented as an interpreter or a compiler. However, I was deliberately vague when defining how one might distinguish an "interpreter" from a "compiler", in no small part because I couldn't think of a crisp definition of the two terms. Although I wasn't quite as vague as "I know it when I see it", I was uncomfortably close.It was thus with great interest that I read a comment on the post from a good friend, and language implementation veteran, Mario Wolczko. In a sense, Mario gave two ways of distinguishing compilers from interpreters, but I'm going to quote the one that made my jaw drop:
How to distinguish between a compiler and an interpreter? One way is to look at the time it takes. A compiler should run in time roughly proportional to the size of the source, whereas the runtime of an interpreter is determined by how much work the source program has to do for a given input. For example, suppose the source program reads a number and then has a loop whose trip count is this number. The time it takes to compile this program will be independent of the number, in contrast to the time it takes to interpret the program.This is, in my opinion, a brilliant way of distinguishing interpreters from compilers.
In case his definition doesn't click with you, let me try a minor variation (ab)using Big O notation. For a program P, we can consider the running time of a compiler to be O(n) where n is the size of the source code (e.g. lines of code [1]). In contrast, the running time of an interpreter is O(f(m)) where m is a concrete input [2] to the program and f() an evaluation function. O(n) grows linearly with n but O(f(m)) may have a non-linear relationship with m.
There are a couple of things I like about this definition. First, it makes clear that compilers run in ignorance of concrete inputs (there is no m in the Big O notation for a compiler above). Second, it makes clear that the running time of an interpreter relates to a specific concrete input, not the size of the program being interpreted, nor to every possible concrete input. If you're used to compilers and interpreters, both facts are "obvious", but I've never seen them defined so crisply before.
How does this style of definition work for the grey zone in between traditional compilers and interpreters? Some readers may have heard of partial evaluation, which takes a program, a concrete input, and partially compiles the program relative to that input. At first glance, that might sound like it has the same complexity as an interpreter, even though it feels like it should be more like a compiler.
Fortunately, I think a simple expansion of Mario's definition shows the difference clearly: we just need to give separate Big O notations for the compile-time and run-time phases of different language implementation approaches:
Compile-time | Run-time | |
---|---|---|
Compiler | O(n) | O(f(m)) |
Interpreter | O(0) | O(f(m)) |
Partial evaluator | O(g(m)) | O(f(m) - g(m)) |
Bringing it back to some of the examples from Compiled and Interpreted Languages: Two Ways of Saying Tomato, how might we classify, say, CPython?
Compile-time | Run-time | |
---|---|---|
Compiler | O(n) | O(f(m)) |
Interpreter | O(0) | O(f(m)) |
Partial evaluator | O(g(m)) | O(f(m) - g(m)) |
CPython | O(n) | O(f(m)) |
Now we can have a go at the various BF implementations from Compiled and Interpreted Languages: Two Ways of Saying Tomato:
Compile-time | Run-time | |
---|---|---|
Compiler | O(n) | O(f(m)) |
Interpreter | O(0) | O(f(m)) |
Partial evaluator | O(g(m)) | O(f(m) - g(m)) |
CPython | O(n) | O(f(m)) |
1: base | O(0) | O(f(m)) |
2: + bracket caching | O(n) | O(f(m)) |
3: + opcodes | O(n) | O(f(m)) |
4: + opcode bracket caching | O(n) | O(f(m)) |
5: + Add(n) | O(n) | O(f(m)) |
6: + Sub(n) / Left(n) / Right(n) | O(n) | O(f(m)) |
7: + Zero | O(n) | O(f(m)) |
8: reorganised | O(n) | O(f(m)) |
This raises an interesting question: when would you say our BF implementations moved from interpreted to compiled? I find it hard to pinpoint the transition definitively. The first BF implementation is definitely interpreted; and by the fifth I'd say it's definitely compiled; but I'm not sure which of the intermediate stages can be stated as the definitive transition point. Rather, the intermediate stages get increasingly hard to classify with certainty.Using our new definitions has forced me to be clear that even the second BF interpreter has "compiler" Big O notation!
Thanks again to Mario for giving me a definition that's made my thoughts more concrete. If you're interested in language implementation, please be sure to look at the Virtual Machines and Managed Runtimes course he gave, and which is freely available online — it's a wonderful resource!
Acknowledgements: thanks to Mario Wolczko for comments.
Footnotes
[1] This definition would also work for a number of other reasonable metrics, for example, the number of functions.[2] To keep things simple I'm implicitly assuming that m is an integer.
[3] Another way of looking at this would be to say that m is split into two parts: mstatic and mdynamic. Then the compile-time would be f(mstatic) and the run-time would be f(mdynamic).
[4] There is a long-standing debate as to whether O(0) is equivalent to O(1) or not, since O(1) stands for any constant, which includes zero. If you feel more comfortable with an interpreter being defined as O(1), please feel free to make that substitution in your head for the rest of this post.