Laurence Tratt: Async and Finaliser Deadlocks

Two days ago I was listening to the Oxide podcast on futurelocks, a very complicated bug involving async code in Rust. I must admit that I struggled to understand what was going on, partly because of the subject matter, and partly because podcasts are my backdrop to household chores 1. At some point towards the end, though, someone phrased things in a way that my pea brain could immediately understand.

In essence, I think futurelocks are a complex instance of a long-standing problem with (what I will call for now) asynchronous code. You may notice that I did not say a “well known” problem. Personally, I only realised this problem exists a couple of years ago. For better or worse, subsequent discussions with many other folk have convinced me that my lack of awareness is common.

Deadlocking finalisers

Helpfully, Oxide have a thoughtful writeup of the futurelock problem. However, even though I like to flatter myself that I’m a competent Rust programmer, I had to work hard to understand precisely what’s going on.

Fortunately we can create a simplified version of the underlying problem in Python:

import threading
mutex = threading.Lock()
class T:
  def __del__(self):
    print("acquiring")
    mutex.acquire()
    print("acquired")
    mutex.release()

t = T()
mutex.acquire()
t = None
mutex.release()

In essence, this code is modelling a classic programming need. A mutex (colloquially a “lock”) guards a shared resource (e.g. a network socket, counter, etc.). When the garbage collector determines that an object is no longer used, it runs its finaliser i.e. its __del__ method. In this case the finaliser acquires the mutex (i.e. locks it), allowing it do something with the shared resource the mutex is guarding, and then releases the mutex (i.e. unlocks it).

Unfortunately, when I run this code in CPython, it hangs at the terminal having written just:

$ python3 t.py
acquiring

What’s going on? Why hasn’t it printed acquired?

Before I explain why it hangs, it might help if I show one “fix”: if I remove the line t = None and rerun the program it works as expected:

$ python3 t.py
acquiring
acquired
$

The problem, in essence, is that my original program uses mutexes in a language implementation that runs finalisers interleaved with user code on the same operating system thread. I was able to craft this program to show what I wanted because I happen to know that CPython’s garbage collector uses reference counting 2: the t = None line meant that the T object created on line 10 was immediately recognised as unused, and its finaliser executed.

The finaliser then tried to acquire the mutex (line 6). That mutex is shared with code outside the object, and unfortunately the “main” part of the program has already grabbed the lock (line 8). At this point we have a classic deadlock: one part of the program holds a lock; another part will not continue until it has managed to grab that lock; but the latter needs to complete for the former to complete, and that is not possible. No matter how long I wait, this program will never print acquired.

The reason I have ended up in this unfortunate situation is because the finaliser is executed asynchronously in-between lines 12 (t=None) and 13 (mutex.release()) on the same thread as the “main” program. My experience is that we’re so used to the general idea of finalisers that we don’t stop to think how different their execution is to that of normal functions.

Most obviously, my program has no explicit calls to t.__del__(): another part of CPython’s run-time called that method on my behalf. Less obviously, CPython’s run-time chose when to call __del__. It happened to do so at the point that I wrote t=None but it could have done so at any point from then on, and still satisfied Python’s specification. Indeed, I can cause CPython to call __del__ at different points in many ways. For example, if I’d written t2=t just after t=T(), my program would not have hung.

This is “non-local” reasoning at work. In general, I can’t statically guarantee look at a line of Python code and know whether it will cause one or more finalisers to run unless I also understand every other part of the system which potentially relates to that line.

The history of this problem

I said earlier that this was a long-standing problem. This 2012 post from Andy Wingo uses a very similar example to the one above, so that’s 13 years ago. But – and I’ve learnt this in inevitable whenever it comes to anything relating to programming languages and memory – Hans Boehm got there much earlier, in 2002. His wonderful paper Destructors, Finalizers, and Synchronization is one of those reads where I can split my life into before/after.

In my case, I was pointed at this paper by Jacob Hughes during the work for what ultimately became “Alloy”, the garbage collector for Rust that we wrote up as Garbage Collection for Rust: The Finalizer Frontier. Boehm’s insights in his paper had a profound effect on the design goals we set for Alloy: the pun in the title of our paper is a fairly direct result of the challenges that we realised Boehm had implicitly set us in his paper.

Boehm’s paper dates from a period when Java, in particular, was experiencing all sorts of problems caused by the widespread use of finalisers 3. Amongst them is the example I wrote in Python above. Boehm explicitly, and clearly, explains exactly the problem that I happened to show in Python above: finalisers that try to acquire mutexes can cause surprising deadlocks. As Boehm points out, this isn’t an odd thing to want to do:

There is usually not much of a point in writing a finalizer that touches only the object being finalized, since such object updates wouldn’t normally be observable. Thus useful finalizers must touch global shared state.

In other words: finalisers often need to use mutexes to update other objects.

The situation in CPython isn’t too bad in this regard. Reference counting gives me a fighting chance of working out in advance when finalisers will be called. If I get it wrong then, assuming my program is deterministic, I will get deadlocks at the same point in execution on every run. That makes debugging annoying but plausible.

In a non-reference counted implementation, such as Java implementations, things are much worse: finalisers can be called at any point; and even in seemingly deterministic programs, they will tend to be called at different points in different executions. Debugging such cases must have been an absolute nightmare!

The solution for garbage collectors

Fortunately the solution is quite simple, and is most clearly articulated on a slide accompanying Boehm’s paper:

Garbage collectors should run finalizers from a separate thread 4

To truly understand why this is the solution, we need to define some new terminology. Until now I have been using “asynchronous” in its colloquial sense of “allow one operating system thread to interleave two different portions of code”. Technical folk like to call this concurrent execution. In contrast, parallel execution is when two portions of code are run on two operating system threads. Unfortunately in day-to-day use these two terms are interchangeable, leading to endless confusion.

In our situation, I find it easier to think of cooperative multitasking vs. preemptive multitasking.

I used to use a GUI-based operating system called RISC OS which used cooperative multitasking. A program would do some work and voluntarily give control back to the operating system, which would then hand control over to another program. If a program did not give control back (i.e. did not “cooperate”), either due to a bug or because the work it was doing was taking longer than the author had anticipated, the entire system would appear to freeze. This happened frustratingly often, particularly for new versions of software.

Modern operating systems instead switch control between different programs when the operating system chooses (i.e. preemptively): one program cannot hog execution and stop others running.

Garbage collectors like CPython’s run finalizers in a “cooperative” way: they implicitly hope that the finalizer executes quickly, or at least does not get stuck forever. In our running example, we saw a case where a finalizer would not cooperate: it wanted to acquire a lock that it could never get, and thus was stuck forever.

It might be tempting to look at our example and say that it’s the programmer’s fault for writing a non-cooperative finalizer. The problem is that there is no way to write a cooperative finalizer that achieves the desired outcome. I might try waiting for other code to let go of the mutex with something like:

class T:
  def __del__(self):
    while not mutex.acquire(blocking=False):
      time.sleep(0.1)
    print("acquired")

but this will still hang forever and cause my CPU to get very hot! The finalizer will only be called once: it can’t temporarily let the “main” code run and release the mutex, despite my vague attempt above.

We now have, I hope, the terminology to understand the solution Boehm is emphasising:

Garbage collectors should run finalizers from a separate thread

As we’ve seen above, finalizers cannot be guaranteed to be cooperative in and of themselves. The solution is then simple: running them on another thread doesn’t require them to be cooperative! In other words, if the __del__ method from our original example was run on another thread, then the program could not have deadlocked: one thread would have waited until the other had released the mutex, then continued.

This is the approach that modern JVMs take. It is also the approach we realised we were duty-bound to take in Alloy: finalizers are always run on a separate thread. This turned out to have huge implications which we did our level best to address. If you read the paper, you’ll realise that a huge chunk of it is devoted to making it possible to safely run as many Rust destructors – most of which were not written in anticipation of multi-threading – as possible on other threads.

From finalisers to futurelocks

At this point I can imagine many people feeling frustrated that I have dumbed down the futurelocks problem that I motivated this post with. It’s clearly more complex than the simple finalizer deadlock I’ve used as a running example. Indeed, a good-faith reader might reasonably question whether the two things share anything in common other than deadlocking.

My belief is that they share something fundamental in common: cooperative multitasking and the interleaving execution of arbitrary code on the same thread. As soon as you have those two ingredients, it seems to me that you have the potential for deadlocks

In my view, async runtimes often (not always, but often) share an important flaw with cooperative multitasking: they hope that the next future / promise cooperates and returns control. If it cannot return control, the system will be stuck.

In the futurelocks example, the flaw is that the select! macro commits to running one future when another future is the one that can unlock the mutex. Thus select! implicitly hopes that the future it is committed to is the one that allows the system to make progress. That hope is sometimes dashed.

We can thus see some interesting similarities and differences between finaliser deadlocks and futurelock deadlocks.

Both are based on the misplaced hope that code will always cooperate — sometimes it is impossible for code to cooperate, even if it wants to. However, if we could guarantee what code will execute at any given point, we could at least assure ourselves that only truly cooperative code will be executed. Alas, one does not know which future / promise / finaliser will be executed or resumed: it could have come from anywhere, at any point in the past.

However, one reason I suspect that async programmers run into this issue less frequently is that that while finalisers can run at any arbitrary point, async schedulers only run code at awaits or when a future has completed. In other words, the possible interleavings are many fewer in the async case. However, “many fewer” is very different to “none”. Could one create a variant of the select! macro that fixes the particular problem? Yes: but one could always create other seemingly valid async code that leads to the same problem.

Solutions?

Is there a way to solve futurelocks short of causing the system to exit if a deadlock is detected at run-time? I must admit that I am pessimistic.

Why don’t we just ban people from using mutexes in async functions? This is clearly possible – but async system code often needs to utilise multiple threads for the same performance reasons as non-async code. If you need that, you’re almost certainly going to want to use mutexes to safely share state amongst futures that might end up executing on different threads.

Can we create a static analysis that rules out the bad cases? The non-local nature of the problem means that type systems definitely won’t be enough to completely solve the problem, though we might be able to make it rule out some cases. Perhaps a more sophisticated, bespoke, static analysis could do what we want, but I suspect that it will be hard to create an analysis which accepts enough good cases and rejects enough bad cases to be worthwhile.

Perhaps we should just run all potentially deadlocking futures in another thread? That sounds just like the solution to the problem of deadlocking finalisers! However, the only way this would work is if every potentially deadlocking future / promise was executed on a thread that did not currently hold a mutex. In general, we won’t be able to tell if a particular future / promise might have the potential to deadlock. That would cause us having to put many, perhaps most, futures on a thread where we know they can’t interfere with any other half-executed future. For deeply nested async functions, this could require the use of an unbounded number of threads. I suspect the resulting performance would be terrible.

Summary

As this post might have suggested, I am less optimistic about the long term desirability of async code than many other people. Async does solve some problems extremely well — but it introduces new ones that I, and I believe many others, find harder to reason about. This post has touched on one, but there are others 5.

Fundamentally, one way the analogy in this post breaks down is that scheduling finalisers has a simple, correct, fairly efficient solution: put them on their own thread. I do not see that there is, or can be, a practical equivalent for the async case, where scheduling is necessarily more complex. I fear that futurelocks will always be a problem.

Personally, I love the fact that (at least on non-embedded devices), operating system threads are cheap to create and run — and they give true parallelism to boot! Rust also makes writing reliable multi-threaded code vastly easier than in any other language. When problems do occur, which so far is very rare 6, I find them relatively easy to debug.

Threads aren’t a complete solution, though. In particular, they’re a very heavyweight way of solving problems like “which of my many open network sockets has data for me to read?” For those cases, I manually use poll or its equivalents. I’m not a fan of the API – the number of ways to detect when a file has been fully read with poll is mind boggling and not consistent across platforms – but at least problems with it remain localised to the code calling poll.

However, I appreciate that such opinions put me in a tiny minority. Despite that, I hope that my post has shown you that there is a more general problem which is worth knowing about. I also encourage you to listen to the Oxide podcast, which is one of my favourite technical podcasts! Without a poke from that, I wouldn’t have thought to write this post at all!

Newer 2025-11-12 11:05 Older

If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

When I realised I can see how many podcast hours I rack up in a year, I got a real sense of how much time I spend on chores!

☒

When I realised I can see how many podcast hours I rack up in a year, I got a real sense of how much time I spend on chores!

PyPy does not deadlock on this example as written. If I add:

import gc
gc.collect()

between t = None and mutex.release() then it does.

☒

PyPy does not deadlock on this example as written. If I add:

import gc
gc.collect()

between t = None and mutex.release() then it does.

Boehm, I’m sure, was also aware of many of these problems from his work on BDWGC, the retrofittable garbage collector for C/C++. Java, though, was the hot topic of the day.

☒

Boehm, I’m sure, was also aware of many of these problems from his work on BDWGC, the retrofittable garbage collector for C/C++. Java, though, was the hot topic of the day.

As I understand matters, JVMs were required to do this, but did not always do so. I believe a common mistake was for low-memory situations to call finalizers during allocation i.e. potentially on the same thread as code holding a lock that finalizer would try to acquire.

☒

For example: incomprehensible backtraces; futures that the code forgets to execute; the bifurcation of the ecosystem; and, in Rust at least, a significant increase in language complexity.

☒