What Challenges and Trade-Offs do Optimising Compilers Face?

Recent posts
pizauth: HTTPS redirects
Recording and Processing Spoken Word
Why the Circular Specification Problem and the Observer Effect Are Distinct
What Factors Explain the Nature of Software?
Some Reflections on Writing Unix Daemons
Faster Shell Startup With Shell Switching
Choosing What To Read
Debugging A Failing Hotkey
How Often Should We Sharpen Our Tools?
Four Kinds of Optimisation

Blog archive

After immersing oneself in a research area for several years, it’s easy to feel like the heir to a way of looking at the world that is criminally underrated: why don’t more people want what I want? This can be a useful motivation, but it can also lead to ignoring the changing context in which most research sits. So, from time to time, I like to ask myself the question: “what if everything I believe about this area is wrong?” It’s a simple question, but hard to answer honestly. Mostly, of course, the only answer my ego will allow me is “I can’t see anything I could be wrong about”. Sometimes I spot a small problem that can be easily fixed. Much more rarely, I’ve been forced to acknowledge that there’s a deep flaw in my thinking, and that the direction I’m heading in is a dead-end. I’ve twice had to face this realisation, each time after a gradual accumulation of unpleasant evidence that I’ve tried to ignore. While abandoning several years work wasn’t easy, it felt a lot easier than trying to flog a horse I now knew to be dead.

All of this means that I’m probably a little better disposed to contrarian thinking than I was when I started as a researcher. And so it is that I’m going to write about some thoughts that were initially inspired by a (somewhat controversial) talk by Daniel Bernstein. The talk itself, entitled “Death of Optimizing Compilers” (slides and audio), was presented at ETAPS 2015 in London. I happened to be present for the talk and have engaged in mild advocacy both immediately afterwards and in the two years since. The problem is that the case against is quite easy to state but the case for rather hard — or, at least, I haven’t managed to frame an argument good enough to convince anyone. But I hope I might now have sharpened my argument enough to make another attempt worthwhile.

Suitably condensed, the talk’s thesis is fairly simple. As software systems become bigger and bigger, an ever smaller percentage of code dominates execution time [1], especially as computers are asked to crunch ever bigger chunks of data. However, optimising compilers (e.g. gcc -O2 or even a VM like HotSpot) are outclassed by humans on the most performance critical code. Because of that, optimising compilers are operating in a region between a problem that’s not worth solving (most parts of most programs) and a problem they can’t solve (the really performance critical parts). It doesn’t take much thought to realise that this last point has been the source of controversy: how can someone belittle the huge amounts of effort put into optimising compilers?

Daniel used his own (very cunning) ChaCha20 algorithm as an example of where a human can significantly outdo anything a compiler can. ChaCha20 is effective because the algorithm was invented with the strengths of modern processors in mind: its design and implementation are intricately intertwined in a way that no compiler could ever expect to match. [2]

It may not surprise you that I don’t agree with Daniel’s conclusion. However, sometimes looking at things from an extreme angle can help bring forgotten aspects into focus. So, while I’m not claiming to have any novel insights, here is my attempt to enumerate some often-forgotten (at least by me) issues which limit the effectiveness of optimising compilers.

Balancing program performance and compilation time

We all know that there’s a trade-off between optimisation and compilation time: the more optimised you want your program to be, the longer it takes to compile. At one extreme, language implementations like CPython and CRuby have low compilation costs [3] and little or no optimisation costs, but have consequently poor performance. At the other extreme, superoptimisation tries vast numbers of machine-code combinations before picking the fastest. This means that a tiny change to even the tiniest function can take many, many hours to optimise: scaling this up to even “small” functions could be a challenge. Slightly less extreme are C compilers like gcc and clang. These put considerable effort into optimising programs, but generally compile even medium sized files in at most a second or two. [4]

Would developers be happy if we suddenly increased compilation times by, say, a factor of 10 if it lead to a 5% performance increase? Probably not. One possible solution is to add a “really good optimisation but slow to compile” mode to compilers. However, precedent suggests that such a mode will be little used. For example, gcc’s highest optimisation level -O3 is often treated with a mix of fear and disdain: partly because it often doesn’t improve performance [5], but also because it’s thought to more often produce buggy code. Whether this is still a justifiable position, whether it’s because of compiler bugs, or whether it’s due to user bugs that only happen to show up when highly optimised is almost irrelevant: if no-one is prepared to use such a mode, it might as well not be there. OpenBSD, long my operating system of choice, uses -O2 exclusively for both the base operating system and packaged third party software. Another possibility is profile-guided optimisation, which I’ll touch on later.

We don’t know which parts of a program to optimise

In my experience, one of the easiest ways of distinguishing a seasoned programmer from those early in their career is their approach to “manual” optimisation. When presented with a slow running program, junior programmers tend to quickly identify what they think is the bottleneck, and start hacking away at it. Seasoned developers, in contrast, tend to have realised from bitter experience that their initial guesses about bottlenecks are nearly always wrong, and first profile performance.

Sometimes, even standard profiling doesn’t help. My standard approach is to profile a program, identify the longest running function(s), and then stare at it to see if I can improve it. This works in many cases, but fundamentally profiling can only highlight symptoms, not prove causes. For example, if I look at the longest running function and it’s a bubble sort, I can quickly change it to a faster sorting algorithm. But if it’s a Timsort function, there probably isn’t much I can obviously do to fix it. The problem could be elsewhere (e.g. more than once I’ve seen sort functions used as part of a de-duplicator function, which is a slow way of solving that particular problem) and may be very subtle (e.g. bad data structures can cause much more memory to be allocated than necessary). Occasionally, though rarely, I can work out a likely culprit, but judge that the necessary rewrite is too big to justify the gain. But, in most cases, I simply can’t work out if there is anything which would benefit from further optimisation or not.

We can now ask a simple question: if it’s difficult for a human to look at the runtime performance of a specific program and work out what should be optimised, how difficult is it for a compiler? The answer, I fear, is that the compiler has, by far, the more difficult job. Static compilers implicitly make a guess about how a program will execute – without having the luxury of actually executing it [6] – and optimise based on that guess. In some cases, we can make accurate guesses (e.g. we can work out statically how long the machine code sequence for 3 * 4 will take versus 12) but in most we can’t (e.g. will a branch that represents a loop never be taken, taken occasionally, or taken often?). The latter quickly overwhelm the former, meaning that the compiler can have no real idea of which parts of a program are likely to be bottlenecks.

The only safe choice the compiler can thus make is to try and optimise all parts of a program, though this has two issues. First, the trade-off between compilation time and optimisation rears its head again. Since some optimisations are either expensive or rarely applicable, we can’t afford to apply them everywhere, especially as they would mostly have no useful runtime effect. Second, optimisations are heavily order dependent. Imagine I have three optimisations A, B, and C: applying A then B might produce code that C can’t do anything with; but applying B then A might produce code that C can optimise. In this simple case we could try both possibilities and, if we’re lucky, work out which outcome is better and use that. In a real compiler, with a very large number of optimisations, combinatorial explosion is real and this is impractical. This means that even though our compilers sometimes contain all the optimisations they need to produce high-performing code, they don’t run them in the right order (or can’t run them for long enough) to optimise all parts of a program. [7] In practice, therefore, trying to optimise all parts of a program means doing a so-so job on all parts of a program. There are two techniques which try and identify the sub-parts of a program that would most benefit from heavy optimisation.

The first is profile-guided optimisation: in essence, one statically compiles a program, profiles an execution or two, then feeds that back to another run of the static optimiser, which uses that information to hone in on the parts of a program which will most benefit from optimisation. Doing this can allow a compiler to run its optimisers for longer on smaller parts of a program, rather than wasting its effort trying to optimise the entire program. This can yield useful performance benefits (see e.g. the 10% improvements that Chrome and Firefox reported) but it’s difficult to set-up, compilation becomes even slower, and it only helps if all subsequent runs follow the same general pattern as the profiled run(s). I suspect the latter point is the most critical: if people use your program in different ways from run to run, profile-guided optimisation can even degrade performance.

The second is dynamic compilation (i.e. JIT compilation), which can be thought of as live, continuous profile-guided optimisation: every time a program runs, dynamic compilation observes its behaviour, and compiles based on that behaviour. Thus dynamic compilation has the advantage over static compilation that it can much more accurately identify the genuine bottlenecks in a program. However, dynamic compilation has to be even more mindful of the costs of optimisation, since time spent dynamically compiling is bad for the user (draining their battery or simply taking potential CPU time away from the user). Thus while dynamic compilation is the most effective way of determining which parts of a program will benefit most from optimisation, it has much less leeway to use expensive optimisations than static compilation.

Finally, whether static or dynamic, all compilers contain a finite number of optimisations: they may not contain the optimisation that our program needs to run fast. There is a small-ish set of optimisations which it’s safe to assume that every optimising compiler (static or dynamic) contains. After that, things become, in my experience, a lot more reactive. If a user shows a compiler developer a program which runs unexpectedly slow, the compiler developer will probably add, or tweak, an optimisation to make it perform better. Over time, therefore, different compilers tend to do well on an unpredictable subset of all possible programs. You can often see this in the real world where different optimising compilers for the same language perform much better or worse on a given program. Is there anything we can do about this? I don’t think so. After all, we can’t (or, at least, probably shouldn’t) optimise what we haven’t seen: benchmarks are, to a large degree, destiny. [8]

We don’t know where the performance ceiling is

One of the problems that optimising compiler authors face is that we don’t really know if we’re doing a good job or not. If optimisation makes a given program 50% faster that might be really good, or it might be somewhat mediocre: since we generally don’t know what the best possible performance (what I call the “performance ceiling”) of a program is, we have no real idea how far away we are from it. The closest we come to a “performance oracle” is when there are multiple compilers for a given language: if another compiler makes a program run faster than our compiler, we know we have work to do.

Given that our current static compilers seem to be well within the “diminishing returns on further optimisations” category, it might appear that we’re close to the performance ceiling. But, for all we know, we could be stuck in a local maxima. Are there any ways that we might be able to tell? Superoptimisation, mentioned above, suggests that there are potentially many more performance gains to be had, although it’s only been applied on small functions. Dynamic compilation is the other obvious possibility: since much more research has been put into static compilers, we don’t really know if current comparisons of static and dynamic compilation are the end of the story or not. [9]

Maintaining implicit properties

As Daniel pointed out, compilers have a narrow view of optimisation: they’ll optimise x + 3 * 4 into x + 12, but I don’t know of any that will turn a bubble sort function into a merge sort. In part this is because recognising that a particular piece of code implements bubble sort is hard (think of all the minor, obfuscating, variations). But, more significantly, it’s because compilers are generally unable to prove that such a major change won’t violate properties of the program explicitly intended by the user. For example, what happens if the user’s bubble sort is so coded because they know that the program needs to run in constant memory and they’re prepared to trade-off execution speed for memory? That property is implicitly encoded in the bubble sort, and the compiler has no way of knowing if it was explicitly intended or not.

Real programs contain countless such examples of implicit properties: indeed, most of us, when programming, don’t even consider how much we expect the optimised version of our program to reflect the program as we wrote it. Partly in recognition of these implicit properties, compilers are often rather conservative in their optimisations: for example, with the notable exception of inlining, they are generally reluctant to perform optimisations that silently increase memory usage. Because of this, a human who can reason about the importance of such properties, and disregard them if they are not important, always has a potential edge over a compiler.

Over-prioritising compilation has consequences

Experience suggests that over-prioritising optimisation can lead to problems in other areas, notably correctness and security. The most obvious, though by no means the only, example of this is C. Most widely known are the problems with undefined behaviour. To cut a long story short, the C specification says, more often than most other languages, “if you do X, the behaviour is undefined”, at which point the program can do absolutely anything, whether you intended it to or not.

The problem is that the vast majority of such cases cannot be detected statically: that is, it’s often impossible to know if your program could exhibit undefined behaviour or not. Have you remembered, for example, to add a runtime check that the divisor of every non-constant division is non-zero? That one’s well known enough that people will tend to check it. But what happens when signed integer addition overflows? C gives no guarantees about the outcome, and most programs have so many additions that we struggle to reason about which ones we might need to be careful about. You can read more examples here.

It goes without saying that a large number of unfortunate bugs (including many security alerts) have occurred because of undefined behaviour. However, the problem has become much worse in recent years as C compilers optimise programs based on the assumption that no undefined behaviour exists. In other words, compilers are now more strictly interpreting the standards which, ironically, means that they have become less conservative in their optimisations. No doubt a few benchmarks increase in speed because of this, but my guess is that it makes only a very small difference to most program’s performance.

I used to think that I could reliably program in C, but the exploitation of “minor” undefined behaviour has undermined my confidence. To add insult to injury, not only does C have far more undefined behaviour than other languages, making it more likely that one will forget some of the less common cases, but the C standard isn’t even freely available (though one can get drafts, which are “close” to the final standard). This has led to deep unhappiness on both sides: programmers tend to say “please don’t violate widely held expectations even if the standard doesn’t mention those expectations”; compiler writers have a tendency to say “we’re following the standard, and are making your programs faster: if there’s a problem in your program, you should fix it.” If there’s any single factor which will hasten the decline of C as a widely used language, I suspect this is it. [10]

C compilers are the most obvious example of this problem, but not the only ones. There are at least a couple of cases where, depending on the day of the week, I can convince myself either way. A good example of this is Python’s memory semantics. In theory, Python long specified that programmers should not rely on the deterministic destruction of objects. However, since the only implementation of Python was CPython, which is reference counted – and since most people don’t read, or remember, the language specification – many programs came to rely in subtle ways on deterministic destruction (e.g. implicitly expecting that file descriptors would be freed at the end of a function). When PyPy came along, it soon came to favour non-deterministic destruction. Doing so allowed PyPy to have a high performing state-of-the-art garbage collector, but it did break a number of programs (e.g. many ran out of file descriptors). Is that a good trade-off? Could, or should, PyPy have used some mix of reference counting and “proper” garbage collection? Since this is pretty much the only intentional “incompatibility” between CPython and PyPy it feels just about reasonable to me, although I’m sure some people feel differently.

Summary

It’s important to be realistic: most people don’t care about program performance most of the time. Modern computers are so fast that most programs run fast enough even with very slow language implementations. In that sense, I agree with Daniel’s premise: optimising compilers are often unimportant. But “often” is often unsatisfying, as it is here. Users find themselves transitioning from not caring at all about performance to suddenly really caring, often in the space of a single day.

This, to me, is where optimising compilers come into their own: they mean that even fewer people need care about program performance. And I don’t mean that they get us from, say, 98 to 99 people out of 100 not needing to care: it’s probably more like going from 80 to 99 people out of 100 not needing to care. This is, I suspect, more significant than it seems: it means that many people can go through an entire career without worrying about performance. Martin Berger reminded me of A N Whitehead’s wonderful line that “civilization advances by extending the number of important operations which we can perform without thinking about them” and this seems a classic example of that at work. Even better, optimising compilers are widely tested and thus generally much more reliable than the equivalent optimisations performed manually.

But I think that those of us who work on optimising compilers need to be honest with ourselves, and with users, about what performance improvement one can expect to see on a typical program. We have a tendency to pick the maximum possible improvement and talk about it as if it’s the mean, when there’s often a huge difference between the two. There are many good reasons for that gap, and I hope in this blog post I’ve at least made you think about some of the challenges and trade-offs that optimising compilers are subject to.

Acknowledgements: My thanks to Edd Barrett, Martin Berger, Carl Friedrich Bolz, Lukas Diekmann, and Sarah Mount for substantial and insightful comments on early drafts. Any errors, omissions, or infelicities, are my own.

Newer 2017-06-22 08:00 Older
If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1]

Most readers will be familiar with Knuth’s quip that “premature optimisation is the root of all evil.” However, I doubt that any of us have any real idea what proportion of time is spent in the average part of the average program. In such cases, I tend to assume that Pareto’s principle won’t be far too wrong (i.e. that 80% of execution time is spent in 20% of code). In 1971 a study by Knuth and others of Fortran programs, found that 50% of execution time was spent in 4% of code. I don’t know of modern equivalents of this study, and for them to be truly useful, they’d have to be rather big. If anyone knows of something along these lines, please let me know!

Most readers will be familiar with Knuth’s quip that “premature optimisation is the root of all evil.” However, I doubt that any of us have any real idea what proportion of time is spent in the average part of the average program. In such cases, I tend to assume that Pareto’s principle won’t be far too wrong (i.e. that 80% of execution time is spent in 20% of code). In 1971 a study by Knuth and others of Fortran programs, found that 50% of execution time was spent in 4% of code. I don’t know of modern equivalents of this study, and for them to be truly useful, they’d have to be rather big. If anyone knows of something along these lines, please let me know!

[2]

Daniel finished with an idea as to how this limitation could be overcome: instead of compilers being a black box, programs should be able to interact with them to help the compiler optimise them. While there is some work in this general direction (see e.g. Haskell’s approach), I think that pushing it as far as Daniel would like is hard. Certainly, I’ve done enough work in macros and compile-time meta-programming to have some idea of how difficult it is to find a good design for interactions between a program and compiler.

Daniel finished with an idea as to how this limitation could be overcome: instead of compilers being a black box, programs should be able to interact with them to help the compiler optimise them. While there is some work in this general direction (see e.g. Haskell’s approach), I think that pushing it as far as Daniel would like is hard. Certainly, I’ve done enough work in macros and compile-time meta-programming to have some idea of how difficult it is to find a good design for interactions between a program and compiler.

[3]

Though it is important to note that both do compile to bytecode internally. The oft-stated distinction between “interpreted” and “compiled” languages is a red-herring: any language can be interpreted or compiled (and some language implementations interpret compiled code).

Though it is important to note that both do compile to bytecode internally. The oft-stated distinction between “interpreted” and “compiled” languages is a red-herring: any language can be interpreted or compiled (and some language implementations interpret compiled code).

[4]

Interestingly, modern languages seem much happier with lengthier compile times. rustc, which I use a lot, is not fast, even with optimisations turned off; and scalac, when I last used it, was even slower for small files (though since, I think, it doesn’t do full program optimisation statically, I imagine it’s faster in some senses than rustc at compile-time). I can’t say why this is for sure, although I do have a potentially plausible explanation.

Take gcc as a concrete example: its first release was in 1987. At that point, I believe that the Cray-2 was the second fastest machine on the planet operating at 2 GFLOPS (at something like £25,000,000 in today’s money); for reference, my 2016 i7-6700K desktop processor (£301) is something like 80-115 GFLOPS. Using GFLOPS as a measurement of speed of anything is dangerous, particularly when it comes to compilers (which do few, if any, floating point operations). But the general trend is clear: a modern commodity processor is a couple of orders magnitude faster than a 30 year old supercomputer. It’s safe to assume that compiling programs took quite a bit longer in wall-clock time in 1987 than today, and that therefore people would have put quite a lot of effort (consciously and unconsciously) into making compilers run acceptably fast.

Fast forward to today, and the optimisations that made a compiler run not-too-slow in the late 1980s make it blazingly fast today. In contrast, modern language designers and compiler implementers are happy with compile times that are, relatively, much slower but still, in absolute terms, just about acceptable (perhaps around a couple of seconds for a medium-large file, as a rough rule of thumb; which, I would guess, is probably faster in wall-clock time than a C compiler in 1987).

As a ridiculously simple comparison, compiling “hello world” on my laptop (median value over 10 runs – I’m not pretending this is a paper-strength methodology) takes 0.037s for gcc, 0.173s for Go, and 0.287s for Rust.

Interestingly, modern languages seem much happier with lengthier compile times. rustc, which I use a lot, is not fast, even with optimisations turned off; and scalac, when I last used it, was even slower for small files (though since, I think, it doesn’t do full program optimisation statically, I imagine it’s faster in some senses than rustc at compile-time). I can’t say why this is for sure, although I do have a potentially plausible explanation.

Take gcc as a concrete example: its first release was in 1987. At that point, I believe that the Cray-2 was the second fastest machine on the planet operating at 2 GFLOPS (at something like £25,000,000 in today’s money); for reference, my 2016 i7-6700K desktop processor (£301) is something like 80-115 GFLOPS. Using GFLOPS as a measurement of speed of anything is dangerous, particularly when it comes to compilers (which do few, if any, floating point operations). But the general trend is clear: a modern commodity processor is a couple of orders magnitude faster than a 30 year old supercomputer. It’s safe to assume that compiling programs took quite a bit longer in wall-clock time in 1987 than today, and that therefore people would have put quite a lot of effort (consciously and unconsciously) into making compilers run acceptably fast.

Fast forward to today, and the optimisations that made a compiler run not-too-slow in the late 1980s make it blazingly fast today. In contrast, modern language designers and compiler implementers are happy with compile times that are, relatively, much slower but still, in absolute terms, just about acceptable (perhaps around a couple of seconds for a medium-large file, as a rough rule of thumb; which, I would guess, is probably faster in wall-clock time than a C compiler in 1987).

As a ridiculously simple comparison, compiling “hello world” on my laptop (median value over 10 runs – I’m not pretending this is a paper-strength methodology) takes 0.037s for gcc, 0.173s for Go, and 0.287s for Rust.

[5]

In the context of LLVM/clang it appears that -O3 optimises no better than -O2 (at least in 2013 when that paper was written).

In the context of LLVM/clang it appears that -O3 optimises no better than -O2 (at least in 2013 when that paper was written).

[6]

Abstract interpretation is an attempt to do this. It’s got real uses in bug finding, but I’m not sure that it would do very well at performance prediction.

Abstract interpretation is an attempt to do this. It’s got real uses in bug finding, but I’m not sure that it would do very well at performance prediction.

[8]

My experience suggests that this is a more common issue in dynamic optimising compilers, which in some sense have more opportunities – or, at least, more information – to optimise, but also more constraints (chiefly time). However, this may partly be a function of the richness of modern languages, which are more likely to be dynamically compiled and optimised.

My experience suggests that this is a more common issue in dynamic optimising compilers, which in some sense have more opportunities – or, at least, more information – to optimise, but also more constraints (chiefly time). However, this may partly be a function of the richness of modern languages, which are more likely to be dynamically compiled and optimised.

[9]

My gut instinct is that many programs in modern languages such as Haskell would run faster if they were dynamically (rather than statically) compiled. The basic reason is simple. From a runtime perspective, the pattern matching that is common in such languages is just dynamic typing in disguise (the fact that the pattern matching statically enforces exhaustiveness is of no use beyond the type checker). Thus the static compiler is either going to decline to inline the branches of pattern matches, or it’ll guess (often wrongly) that one branch is more commonly taken, or (less likely) it’ll try and inline all branches. Since inlining makes pretty much every other optimisation much more effective, this inability to statically optimise must cause performance issues. However, I should be clear: I currently have no evidence for this other than my gut instinct, which I suspect might not stand up in a court of justice.

My gut instinct is that many programs in modern languages such as Haskell would run faster if they were dynamically (rather than statically) compiled. The basic reason is simple. From a runtime perspective, the pattern matching that is common in such languages is just dynamic typing in disguise (the fact that the pattern matching statically enforces exhaustiveness is of no use beyond the type checker). Thus the static compiler is either going to decline to inline the branches of pattern matches, or it’ll guess (often wrongly) that one branch is more commonly taken, or (less likely) it’ll try and inline all branches. Since inlining makes pretty much every other optimisation much more effective, this inability to statically optimise must cause performance issues. However, I should be clear: I currently have no evidence for this other than my gut instinct, which I suspect might not stand up in a court of justice.

[10]

There have been several attempts to define a “safe” subset of C (e.g. this) that is immune to the most egregious examples of undefined behaviour, but it’s not clear that any of these will gain sufficient momentum to become widely accepted. That’s a pity, but such is life.

There have been several attempts to define a “safe” subset of C (e.g. this) that is immune to the most egregious examples of undefined behaviour, but it’s not clear that any of these will gain sufficient momentum to become widely accepted. That’s a pity, but such is life.

Comments



(optional)
(used only to verify your comment: it is not displayed)