Laurence Tratt: A Proposal for Error Handling

A while ago I wrote about my experience of writing extsmail, and how surprised I was that highly reliable and fault tolerant programs could be written in C. In large part I attributed this to the lack of exceptions in C. In this article, I expand upon this point, consider some of the practical issues with exceptions based language, and present a candidate language design proposal that might partly mitigate the problem. I don’t promise that this is a good design; but it does present some of the issues in a different way than I’ve previously seen and if it encourages a debate on this issue, that might be use enough.

The two approaches

There are two main approaches to trapping and propagating errors which, to avoid possible ambiguity, I define as follows:

Error checking is where functions return a code denoting success or, otherwise, an error; callers check the error code and either ignore it, perform a specific action based on that code, or propagate it to their caller. Error checking does not require any support from language, compiler, or VM; it depends entirely on programmers following the convention. Languages which use this idiom include C.
Exceptions imply a secondary means of controlling program flow other than function calls and returns. When an exception is raised, this secondary control flow immediately propagates the exception to the functions caller; if a try block exists, the exception can be caught and handled (or rethrown); if no such block exists, the exception continues propagating up the call stack. Exceptions require certain features in the language, and support from both compiler and VM. Languages which use this idiom include Java, Python, and Converge.

Outline of the problem

The fundamental problem as I see it can be concisely expressed: exceptions allow predictable systems to be written with much less code than with error checking; but exceptions make writing highly fault tolerant code difficult. The former point is, I’m fairly sure, uncontentious; with exceptions, one needn’t explicitly check and propagate most errors, negating the need for vast wodges of code. For many systems, this is fine - indeed, it is desirable. In general, most of the small and medium sized systems I write have little to no exception checking; if something odd happens (e.g. a file is missing) I prefer such systems to immediately exit rather than limp on to an unpredictable end.

The problem comes when one is trying to write highly fault tolerant code, by which I mean systems which try to keep on running even when undesirable events or situations arise. I documented my experiences when writing the (tiny) e-mail sending program extsmail earlier. Highly fault tolerant systems need to deal explicitly with almost every error that can occur; with judicious thought, it is amazing how many errors can either be partly or fully recovered from. In extsmail, the only fatal error (i.e. the system immediately exits) is when memory is exhausted. I would guess that about 40% of extsmail’s code is to do with error checking and handling - such fault tolerant systems are much more expensive to produce than the alternative. extsmail is written in C, a language which much folklore would suggest is fundamentally ill-suited to the task. The fact that extsmail - and indeed, most of the other C software I use - works and, I hope, works reasonably well, continues to grate against my long-held prejudices; yet I honestly don’t think I could write a system which is as fault tolerant in any other comparable language - in particular, in any exception-based language of my acquaintance.

Some readers might interpret the above as being a position based either on ignorance, or a hatred of exceptions. While I generally plead guilty to charges of ignorance, I can provide some evidence that in this rare situation it’s not the case. The language that I designed - Converge - is an exception-based language. To the charge of hatred I say that, except when writing highly fault-tolerant systems, I believe exceptions are the best way of dealing with errors.

If I don’t hate exceptions, what’s the problem? Here’s the rub. In theory there is no real problem with exceptions; they’re clearly a more pleasing solution to the problem than error checking. The problem comes because they seem to inevitably lead to a certain style of programming that is not suitable for highly fault tolerant code. This style permeates every exception based language and library I have seen. Before I go into more detail as to what the problem is, it’s best to take a step back and look at the competing solutions.

Error checking

Let’s consider error checking in a C-like environment (I say C-like because I wish to avoid the horror that is is errno; let us also assume that this C-like environment allows functions to return multiple values). A possible example is the following:

int err;
if ((err = write(...)) != 0) {
  switch (err) {
    case EINTR:
      ...
    case EAGAIN:
      ...
    case ...:
      ...
  }
}

In other words, a function write is called and, if it returns an error, all the possible errors that the function can return are explicitly dealt with. There are three important things implicit in the above. First, functions uniformly denote success or error (most Unix C functions denote success by returning 0; values other than 0 indicate failure). Second, errors are simple integer values. This means that they can easily be stored and compared against pre-defined error codes. Third, functions carefully document which errors they can return so that the caller need only handle those errors.

In general, it’s unusual to explicitly deal with each different error that a function can return. At the other end of extreme, one can choose to simply ignore the error returned by a function:

write(...);

which is equivalent to the exception-based code try { write } catch (Exception) { } in, say, Java - although noticeably terser. In practice, one will also see many instances of the following idiom:

if (write(...) != 0)
  err(1, "Fatal error.\n");

This is a rough approximation of the concept in exception-based systems of exiting the whole program if an exception is not caught at some level. This particular idiom in error-checking systems is a pain in the rear end: it’s tedious and verbose to write; and if one has several points that share the error message Fatal error then it may not be possible to easily distinguish which one of them triggered.

As the above hopefully suggests, error checking code suffers several problems including: verbosity; the ease with which minor typos escape notice; and the difficulty of retrospectively changing APIs (extending the errors that a function returns would, in general, necessitate changing - or at least checking - all callers of that function). However there is one point which, while it might initially seem a problem, turns out to have interesting positive consequences. Since error checking relies entirely on convention, every function that follows this idiom must carefully document what errors it can return; without that, the idiom would be unusable. We’ll return to this shortly.

Exceptions

Most readers are probably familiar with exception based systems as the majority of modern languages utilise them in one form or another. It is this familiarity which I suspect numbs us to one of the major practical problems with exceptions. Let’s recast the initial error checking example into exceptions:

try {
  err = write(...);
}
catch Interrupt_Exception e {
  ...
}
catch Again_Exception e {
  ...
}
catch ... {
  ...
}

As this suggests, it’s possible to almost exactly emulate the error checking approach in an exception based language. However, one will almost never see the above idiom in an exception based language (I don’t think I’ve ever seen it). Exceptions are typically organised into a hierarchy so one is much more likely to see:

try {
  err = write(...);
}
catch IO_Exception e {
  ...
}

Oddly enough, this organisation of exceptions into hierarchies, while convenient, is not something I’m keen on for fault tolerant systems. The reason is that, by abstracting away from the specific error that occurred, it tends to give a false sense of security that one is dealing with a given exception in the right way. For example, most systems have a multitude of IO related exceptions, yet it is rare for anyone to deal with anything other than the top-level IO exception class; in a truly fault tolerant system many of those sub-exceptions are likely to be best dealt with differently than others.

Of course, the beauty of exceptions is that they don’t need to be explicitly handled. Furthermore, there is much less potential to silently, and accidentally, swallow an error (as can easily happen in error-checking systems; and assuming one doesn’t have a brain-dead checked exception system as in Java). A system which uses exceptions is entirely predictable: it will run reliably and tend to fail reliably and quickly. In contrast, a buggy error-checking system will often fall over long after the real error occurred, in a way which makes debugging painful at best.

The problem with exceptions

Ironically it is the ease of exceptions which I believe is their downfall.

The first problem is theoretical in nature. Most languages don’t include exceptions in their typing system. This means that there is no way of knowing what exceptions a function will throw. As such, this isn’t a problem: I’m not known as a big fan of static typing anyway. The problem then becomes a cultural one: the exception based languages with which I am familiar only lightly document what exceptions a function can throw. Even when they document the exceptions, it is rarely clear exactly what circumstances will lead to the exception being raised; and it is generally equally ambiguous as to exactly what the exception being raised means. Compare that with the terse, but invariably precise, descriptions of the C Unix man pages. Each function religiously documents the error codes it will return and what they mean. Clearly this is a cultural problem at heart - there’s no real reason why exception based systems have to have such poor documentation. But, I would argue, that because most programs have no need to precisely know what exceptions a function could throw, there is no cultural pressure to document such things. And, as soon as one function is imprecisely documented, any function which calls it is inevitably going to be imprecise in its documentation (even if it doesn’t realise it). Poor documentation of exceptions is thus difficult to avoid.
The second problem is due to the ease with which exceptions can be thrown. Most functions, if carefully analysed, can potentially throw a vast number of exceptions. This is hardly surprising. Every look-up of an element in a list; every division of an integer; not to mention the fun that polymorphism can bring to the table; all of these things can potentially raise an exception. The culture of most exception based languages is not to see this as problem and thus not to be defensive: exceptions safely catch problems, so little to nothing is done to check in advance that a given action will not raise an exception. In contrast, error-checking languages have to be defensive: if the pre-conditions for an action are not checked in advance, the system is likely to go wrong in unpredictable fashion. Therefore error-checking systems force programmers to explicitly consider every error that a function might have to throw; and most of them are either dealt with explicitly or, at least, documented. In general, functions in error-checking languages return far fewer distinct errors than an exception-based function can raise exceptions.
The third problem is related to the second. Philosophically speaking, I believe that there are two types of exceptions:
1. Programmer cock-ups (e.g. pop’ing an empty list; division by zero).
2. The inability of an external thing to fulfil its contract (e.g. writing to a network socket which has been closed by the other end).

Programmer cock-ups are frequent (at least, they are in the programs I write), and mean that one or more assumptions underlying the program in question have been violated. These to me are very bad things: in general, I wish the program to be immediately terminated so that the program doesn’t limp on to an even worse death later, and so that I can easily pinpoint the underlying cause. Furthermore, by definition, I as the programmer don’t mean to make these cock-ups, so I don’t put try ... catch statements in to handle them. In other words, I think that exceptions cover programmer cock-ups very well. In contrast, the inability of an external thing to fulfil its contract is different. We all know that network sockets can close at any time; so whenever reading from a network socket, it is reasonable to expect to check for read errors. In fault tolerant systems, one is likely to try and deal with all such errors explicitly.

The fourth problem relates to the try { ... } catch { ... } construct. Any code in the try block which raises an exception will be dealt with appropriately. The difficulty here is the frequency with which too much code is put into this try block. The classic case I see looks as follows:
```
try {
  f(g(...));
}
catch (Exception) {
  ...
}
```
The intention here is that if f throws an exception, the system still soldiers on. The problem is that the ease with which code can be put in the try block means that f‘s parameters’ evaluations are also included; and they are rarely meant to be. In other words, any exceptions which g raises will be silently - and in this case unintentionally - swallowed. In an error checking approach, the call to g would have to be explicitly checked for errors, largely preventing this incorrect idiom.

For average programs, none of these points is a huge problem; in fact, they arguably make a programmers life easier. But for fault-tolerant systems all are genuine issues: it is impossible to write a fault tolerant system if you are unsure what errors you need to deal with; if the number of errors you are required to deal with becomes too great, the sheer magnitude of task will overwhelm you; if you’re unclear what errors are the result of your mistakes and which aren’t, debugging becomes a philosophical minefield; and if you tend to be too crude in the granularity with which you check errors, mistakes will happen.

A proposal

At a high-level, what I believe might be an improvement on the current situation is a feature which, to some extent, merges the better parts of error-checking and exceptions. Let us therefore consider what the best parts of each approach are. Error checking forces programmers to be considerate both of the errors that they have to deal with, and the quantity of errors they return to the caller. Exceptions make code far smaller and are particularly good at dealing with programmer or cock-ups and very unusual situations (e.g. out of memory errors) which virtually no-one wants to deal with explicitly.

What both error-checking and exceptions provide is a way of returning two things from a function: an error code / exception, and the results of the function. The beginning part of my proposal is for an operator which pulls these two things apart in one go. For arguments sake, let’s use back-slash \ as this operator. On the left hand side of \ are the functions normal return values; on the right-hand side its error code. Error codes are integers (reflecting one of my prejudices; but the type of the error code is not hugely important), with 0 indicating success. We can then thus write code such as:

bytes_written \ err = write(...);
if (err == EINTR)
  ...

On the other side of the coin, we need a way for functions to return errors (or not). Mirroring the above syntax, return x returns x as per normal and sets the returned error code as 0 (or whatever the no error value is). return x \ e returns x and sets the returned error code to e (note that it would be valid, though syntactically redundant to explicitly return the no error value).

Although it’s not directly related, an obvious problem with error checking is the difficulty with extending an API; any errors codes not explicitly checked for will be silently swallowed. To some extent this is a deliberate design decision: it should be culturally difficult (not impossible; but definitely difficult) to extend the range of errors that a function returns (exceptions provide no encouragement to follow this rule whatsoever). However, I also assume that any language with this feature in will have an equivalent of Converge’s ndif statement which raises an exception if none of its branches match.

So what has all this bought us? Well, at first glance, we’ve just got an error-checking system which formalises the way in which errors are returned. Indeed, this is a large part of the story. We now have a simple, uniform way of performing the error-checking approach. But, if this was all we’d got, we wouldn’t have advanced much beyond C. By formalising how errors are returned, we can also define what happens if the error code is not explicitly checked for. In other words, what does the following code do?

bytes_written = write(...);

Answer: if write raises an error and it is not dealt with, the error turns from an integer error code into a standard exception. In general, I would not expect such exceptions to be caught; they would terminate the program, leaving behind a simple to debug line-by-line backtrace.

Finally, I expect exceptions to be maintained almost exactly as they exist in current languages, including the throws / raise construct (which can also be called with an error code, so that if a calling function doesn’t know what to do with a particular error code, it can be turned into an exception by that caller). The only difference I would make is that I would forbid exception hierarchies. This is because in this proposal, it’s not really expected that exceptions will ever be caught (with one caveat): they’re really just a debugging aid and, as such, exception hierarchies only muddy the waters. Because there are fault-tolerant systems such as network servers - where staying alive is the number one priority - I would also maintain try ... catch, though I would expect it to be little used other than to surround a top-level call, with the catch block restarting the server (or a similar recovery mechanism) in the event of an exception.

Intentions

The intention of the proposal is twofold:

Programmer cock-ups are handled as normal exceptions, don’t require any extra effort on the part of the user, and lead to immediate and easy-to-debug program failure.
It allows programs which want to do explicit error-checking to do so, and to do so in a framework in which error-checking is culturally practical; in other words, it doesn’t suffer from the practical, mostly cultural, problems noted with exceptions.

What’s worth noting is that use of the error-checking facility is entirely optional: if one doesn’t use the \ construct, what’s left is a standard (if slightly simplified) exception-based system. In other words, the error-checking approach is a bonus which doesn’t interfere with normal exception-based practices; it degrades gracefully into using standard exceptions.

Conclusion

In one way, what this article has done is to note problems with the practice of exceptions, and then suggest a partial return to the stone-age of explicit error checking. In that sense, one counter-argument is that I am aiming to throw the baby out with the bath water; and there is merit in that argument. There are also a couple of obvious questions which it raises. First, has this scheme been proposed elsewhere? Not that I know of, but there are a vast amount of relatively obscure languages lurking around, so it’s not impossible. Second, would this work in a practical language? I don’t know - this is a classic paper design which may well not survive its first pass through a compiler. One day I hope to find out.

My thanks to Martin Berger for commenting on a draft of this article. Any errors and infelicities are my own, as are all the bad ideas.

Newer 2009-12-14 08:00 Older

If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

A Proposal for Error Handling