Some Reflections on Writing Unix Daemons

Recent posts
Structured Editing and Incremental Parsing
How I Prepare to Make a Video on Programming
pizauth: HTTPS redirects
Recording and Processing Spoken Word
Why the Circular Specification Problem and the Observer Effect Are Distinct
What Factors Explain the Nature of Software?
Some Reflections on Writing Unix Daemons
Faster Shell Startup With Shell Switching
Choosing What To Read
Debugging A Failing Hotkey

Blog archive

Unix daemons are programs which run in the background, performing tasks on our behalf indefinitely. Daemons are somewhat mysterious, easily overlooked, programs of a kind few of us have experience in writing.

In this post I’m going to describe some of the things I’ve learnt in the course of writing and maintaining 3 daemons (extsmail, snare, and pizauth) over 15 years. I’m going to start, however, with a brief overview of what daemons are, and how we get them up and running, as it will make the later parts of the post easier to understand.

Background

Let’s start with the most important question: why the term “daemon” and why the weird spelling? The term seems to date from the early 1960s when “operating systems” as we know them today started to take recognisable form. The general term is, quite sensibly, based on Maxwell’s demon, but I assume the “ye olde world” pseudo-antiquarian spelling (i.e. the extra “a”) is the result of people at the time finding it amusing. Unix seems to have used “daemon” as the spelling from early in its history. The spellings “demon” and “daemon” are often used interchangeably, probably because the former is both more familiar and in people’s spell checker [1].

As this post goes on we’ll look at a diverse set of daemons, but to get us started, let’s think of “server” processes (e.g. HTTP/web or SMTP/mail servers). These tend to have a minimal user-interface, text-file configuration, and some command-line switches. Most are configured once at, or shortly after, system installation. Crucially, this class of daemon is not tied to a “human user login” — once the daemon is executed, it puts itself “into the background”, running indefinitely, even if the user logs out.

Over time, daemons have become more common in non-server contexts. Unix desktops start a surprising number of daemons (e.g. for audio multiplexing). Although there’s some variety in implementation, some of these end up being tied indirectly to the desktop login (i.e. logging out will try and kill them). I’m going to ignore that, because it isn’t significant and serves mostly to obscure the commonality between the two classes of daemon.

Starting a daemon

Most daemons start running when a system is booted, or when a user starts a desktop environment.

Because daemons need to run in the background, they have to start – or, more accurately, be started – differently to most other programs. Fundamentally, daemons need to be somehow “detached” from normal terminal process control so that they run in the background, and aren’t killed when their immediate parent process exits (e.g. because a user logs out or closes a terminal window). As far as I can tell, there have been three main approaches to doing so.

First was the “double fork method”. The first fork create a new process as normal; the child process calls setsid; then the child process forks again. The grandchild process is then unable to reattach to the terminal. I’m unsure if the possible security gains from doing this are worthwhile or not, and fork is not a cost-free operation [2].

Much later, in 1990, the daemon() function became fairly, though never universally, widespread. In essence it does a single fork. The first version lacks the nochdir and noclose parameters, which were added just two weeks later. The current version of this function in OpenBSD is recognisable from that second version! Though there are some mild differences between the double fork method and daemon(), most of us can live a happy and productive life without knowing about them.

More recently, a very different method of starting daemons is via a “manager” of some sorts, the best known of which is Linux’s systemd. Rather than having processes detach themselves from the terminal, these managers run daemons as if they were normal (albeit long-running) programs. This slightly simplifies the daemons themselves and provides a more homogeneous experience for the user. However, the manager itself now has to provide functionality for a wider variety of cases than most people care about. Notably, the manager has to run early in the boot process and to have some notion of daemon dependency for those cases where one daemon must run before another. I’m not a Linux person, but even I’m aware that systemd evokes strong positive and negative opinions, though I have no experience of, and no opinion on, this matter.

Ultimately, no matter how a daemon is started, what all daemons share in common is that they are long-running background tasks.

My experience

I’ve now written, and still maintain, 3 Unix daemons (in chronological order):

  1. extsmail, for sending email via external commands.
  2. snare, a GitHub webhooks runner.
  3. pizauth, for accessing OAuth2 tokens.

As well as spanning 3 different problem domains and use cases, these 3 daemons span time: I started extsmail in 2008, snare in 2019, and pizauth in 2022. They’re written in 2 languages: extsmail in C; and snare and pizauth in Rust. My understanding of what daemons are, and what they could be, has changed considerably over that time as has the context in which they run.

I’m going to start by going over each daemon in turn, before gathering together some general lessons.

extsmail

I started extsmail during a time when internet access was less widely spread, slower, and less reliable, than it is now. I wanted to be able to send emails from mutt (later neomutt) without having to check whether they’d actually been sent. I could have set up a full, local SMTP server, but many systems would have rejected emails coming from a random IP address. I could have set up some sort of authenticated relay, but that not only looked tricky to do but also difficult to secure (in the days before Let’s Encrypt).

What I really wanted to do was exploit my existing ssh setup to send email, implicitly gaining its authentication and security for free. That led me to create extsmail which is most easily thought of as “a per-user sendmail binary that keeps trying to send email via a given command until it succeeds”. For example to configure mutt to use extsmail one just needs to set:

set sendmail="extsmail -oem -oi"

and then configure extsmail itself with this in ~/.extsmail/externals:

group {
  external myremote {
    sendmail = "/usr/bin/ssh myremote.example.com /usr/sbin/sendmail"
  }
}

In this simple example, sendmail on a remote machine is run as if it was on my local machine. My mail appears to have been sent by myremote.example.com, even though I wrote it on my local machine.

Personally I start extsmaild (the daemon part of extsmail) in ~/.xsession with extsmaild -m daemon.

I didn’t have a clue what I was doing when I started extsmail. Indeed, it was the first “proper” Unix program I’d written. I certainly didn’t expect to be using it over 15 years later, but as things stand I can’t imagine not using it. I’m writing this paragraph on a train and though it has wifi, it’s not entirely reliable. extsmail allows me to press “send” on an email and be confident that it will be sent despite the surprising frequency with which attempts fail on the train’s wifi.

Although there have been a small number of bugs that have meant some emails have been stuck in the queue, as far as I remember extsmail has never claimed that an email has been sent that wasn’t, which is more important. Just on my own I’ve sent at least 80,000 emails via extsmail!

Looking back, extsmail has gone through a few phases: the rush of initial development in late 2008 to early 2009; a long period until early 2012 of mostly minor portability tweaks as it picked up users; a roughly year-long period until early 2013 as I slowly understood and fixed a number of rare situations where messages could get stuck in the queue; another year-long period in 2014 where a fairly prolific contributor made a number of small but useful improvements; then a long period since of maintenance and roughly annual releases.

The Good

Something that’s obvious now, but wasn’t obvious to me when I started, was that it is plausible to create good quality Unix daemons without prohibitive effort levels. extsmail isn’t a huge program but it’s proven useful to, and, for at least the last decade, reliable for, a number of users. I derive quite a bit of satisfaction from knowing that even such a niche tool is useful to other people.

A very different lesson I learnt from extsmail is the power of conventions. This came in two flavours.

First are Unix conventions around how software is packaged and installed. I decided to create a standard configure script, standard installation expectations, conventional man pages, and so on, in part just to see how they were done. Most of these conventions are at least slightly annoying — man pages, for example, use an utterly bizarre, hard-to-read, format. However, over time the advantage of doing so became clear — they’re familiar to users, and reduce the burden of installation, understanding, and using a new tool. Compare that to a project with a custom build system (or the use of a new build tool), which can waste huge amounts of a user’s time. Ever since I have tried to think “can I solve this problem in a way that reuses known conventions or do I really, really have to do something different?” The answer, once I can control my ego, is nearly always “I don’t need to do something different”.

Second is the much vaguer Unix convention that good programs should try to do one thing well and leave as much as possible to external programs. extsmail has a specific use case: send email via external commands. In the maximalist direction, I could have tried to make a fully general purpose program, but then I’d have emulated the complexity of the “big” SMTP servers, which would have defeated the point. In the minimalist direction, I could have made extsmail only able to send email via ssh, as that was the only use-case I had, and the only one I knew of, when I started writing extsmail.

By allowing users to run arbitrary commands, I found myself in the happy situation that an early user told me they were using msmtp instead of ssh with extsmail. Not only did I not know of msmtp, but soon after I found I needed SMTP with extsmail, and msmtp was the perfect fit! extsmail made me realise that I could design small programs that were flexible enough not to feel unduly constrained. This sounds like a small thing, but that realisation has had a profound effect on how I design software ever since.

The bad

In retrospect, extsmail has several minor flaws.

First are its configuration flaws. extsmail uses the older convention of expecting configuration files in ~/.extsmail/ rather than the more modern, and in my opinion rather tidier, ~/.config/ [3]. That is easily fixed, but if I’m going to do that I should also address the fact that extsmail forces users to create two configuration files (~/.extsmail/config and ~/.extsmail/externals) when one would do. I made this choice out of sheer stupidity: I didn’t look hard enough at precedents, and didn’t think through the flexibility that might realistically be needed. I keep telling myself that I’ll allow these two files to be merged into one [4], and have a migration period where both old and new are valid, but it’s not ever a high enough priority to get to the top of my TODO list.

Second are the two binaries extsmail installs: extsmail is a very tiny sendmail replacement (about 150LoC); the main daemon is named extsmaild. The ‘d’ suffix is not an uncommon Unix convention, but if you installed a program called “extsmail” what command would you run first? extsmail, of course! Doing so loads a program which seems to sit mute at your terminal. In fact it’s waiting for input from stdin (like sendmail!), but that’s not at all obvious. If you quit extsmail manually – with Ctrl-D or Ctrl-D – it leaves behind a blank file. extsmaild used to complain that this was not a valid file: after this confused early extsmail users, I taught extsmaild to ignore empty files altogether.

Third is the relative lack of automated tests. There are many odd cases that documentation tells us we need to handle, but which we don’t know how to actually make happen artificially. It’s then very tempting to do no automated testing at all — a temptation I gave into.

Astonishingly, it took a long time before the consequences of this decision became clear to me. In particular, as careful as I tried to be in handling the odd cases that Unix throws up – I became a very careful reader of man pages – there are more than one can imagine.

Two examples that I eventually bumped into are that Unix’s handling of child processes is an absolute mess and that poll behaves differently across platforms. Even after I adjusted extsmail for these in isolation, I didn’t think about them in conjunction with each other. After many years I realised there was a problem and forced myself to put together a simple test suite which helped me narrow down the problem and slowly but surely fix the problem. If I’d have created a test suite earlier, even something small, I’d have saved myself a lot of pain!

Other Lessons

One thing that really surprised me when I started writing extsmail is how many ways there are for Unix functions to fail. A simple example is that many functions in Unix can return early (with EINTR) if an interrupt is received: sometimes one can just retry calling the function, but in other cases, one might want to do something different.

The sheer variety of ways things can go wrong is mind-boggling: pipes can close unexpectedly; files that have data in them no longer return their data; child processes sometimes act like computer-game lemmings; and so on. Eventually I slowly encountered all of these occurring in real-life, and more besides.

As I slowly realised the burden I’d taken on, I worried that I’d bitten off more than I could chew. I was tempted to simplify the problem by having the program exit if any of the weirder cases happened, particularly those that I didn’t know how to trigger.

I soon decided that a fundamental part of being a daemon is the ability to keep running (correctly!) when things go wrong. That eventually led me to decide that I would make extsmail continue in every case except a lack of memory [5]. I think I eventually succeeded, though handling all of these cases now probably takes around 50% of extsmail’s source code!

A surprising outcome for me was the realisation that I would not have been able to write extsmail in an exception-based language. Because C forces you to think about each and every case where something can go wrong, libraries have to carefully document all the ways they can go wrong. That was exactly what I needed to make extsmail robust.

snare

The research group I’m part of tries, as a matter of course, to automate much of our infrastructure. We use Buildbot for our continuous integration, but there are all sorts of other tasks where that’s not the right tool. For example, we wanted to automatically rebuild websites when there are pushes to a GitHub repository, but the existing systems that intended to do this were complex — I remember not even being to get one or two to build, let alone run!

I quickly realised that what I wanted was a daemon that could run arbitrary Unix commands when an event occurred on a repository (pushes, pull requests merging, etc). Since I wasn’t very sure how easy this might be, I hacked together a quick prototype in Python in a couple of hours. It worked well enough that I realised that writing a “proper” daemon was plausible. Since I was mostly writing code in Rust at that point, I decided to use this imminent work as a test-bed for writing a daemon in Rust.

I thus created snare. snare is configured with a single configuration file (I learnt that lesson from extsmail!). A minimal configuration file looks as follows:

listen = "<ip-address>:<port>";

github {
  match ".*" {
    cmd = "/path/to/prps/%o/%r %e %j";
    errorcmd = "cat %s | mailx -s \"snare error: github.com/%o/%r\" someone@example.com";
    secret = "<secret>";
  }
}

As this suggests, snare listens on a given IP address and port. In a GitHub repository, you specify a “webhook” that sends messages (as HTTP requests) to that IP and port. match ".*" is a regular expression that matches against a owner/repository pair: .* matches any string. When a request is received, snare checks that it matches the secret and then runs the shell command cmd: the % variables expand to the GitHub owner repository event, and the JSON request respectively. If anything goes wrong, snare runs errorcmd in this example sending error output to a given email address.

The good

snare was written relatively quickly. Within a couple of months of part-time work it had taken firm shape. A couple of months after that it was largely finished. Partly I created it quickly because I had the experience of extsmail under my belt.

But, mostly, it’s because snare was written in Rust, not C. This was a very deliberate choice on my part: one surprising lesson from extsmail was how C makes it possible, despite its clear flaws, to write reliable programs. No other language I’d come across seemed a plausible replacement for C, until I tried Rust. Overall I consider the experiment a definite success!

snare borrows a configuration pattern that I’d first seen in OpenBSD’s pf and seen repeated a few times since. In essence, the configuration file “executes” from top to bottom with each match overwriting some or all of the previous values. This is a simple, powerful pattern. For example, given this snare configuration:

listen = "<address>:<port>";
github {
  match ".*" {
    cmd = "/path/to/prps/%o/%r %e %j";
    errorcmd = "cat %s | mailx -s \"snare error: github.com/%o/%r\" abc @def.com";
    secret = "sec";
  }
  match "a/b" {
    errorcmd = "lpr %s";
  }
}

the repositories a/b and c/d will have the following settings:

a/b:
  queue = sequential
  timeout = 3600
  cmd = "/path/to/prps/%o/%r %e %j"
  errorcmd = "lpr %s"
  secret = "sec"
c/d:
  queue = sequential
  timeout = 3600
  cmd = "/path/to/prps/%o/%r %e %j"
  errorcmd = "cat %s | mailx -s \"snare error: github.com/%o/%r\" abc@def.com"
  secret = "sec"

As this might suggest, queue = sequential and timeout = 3600 are defaults provided by snare: the match ".*" block overrides cmd, errorcmd, and secret; and a/b further overwrites secret.

This configuration style isn’t appropriate for every program, but when it is, it works really well. Some of the lengthy snare setups I have benefit hugely from this idiom.

The bad

snare contains an HTTP server to listen for requests. I used an external library which had the seemingly small side-effect of using Rust’s async/await features. I soon came to regret this. There are use-cases for async/await, particularly for single-threaded languages, or for people writing network servers that have to deal with vast numbers of queries. Rust is multi-threaded – indeed, its type system forbids most classic multi-threading errors – and very few of us write servers that deal with vast numbers of queries.

Unfortunately, async/await splits code into two in ways that become very awkward to work with. I quickly found that snare was being divided into code that needed to be async/await and that which didn’t. I gasped when I found that I needed to use a different Mutex library for async/await code, and moved to reduce its use in snare to a bare minimum. Even that wasn’t quite enough: I found that snare had weird, though very minor, memory leaks that seemed to result from the vast quantity of async/await run-time code I’d implicitly slurped in.

In 2023 I finally got fed up enough to remove the last async/await vestiges. snare’s dependencies decreased from 213 to 172 and on OpenBSD the binary size shrunk by 20% — for exactly the same user-visible functionality! Of course, this touched the core parts of snare, and I had again been too stupid to add in a test suite. So first I added a test suite for basic functionality, giving me confidence that the changes weren’t breaking that; I then further extended the test suites to cover the security-relevant parts of snare.

And, finally, why did I use semi-colons at the end of configuration options? I think I was influenced by a period writing lots of grammars for C-like languages. However, it feels out of place with Unix configuration norms.

Other Lessons

extsmail and snare both listen for the SIGHUP signal and reload their configuration file when the signal is received. I have rarely used this facility with extsmail, but I have used it quite often for snare: since it might be running jobs, I’d rather not shut it down and restart it unless I have to.

However, SIGHUP is unsatisfactory: communicating via a PID (Process ID) is both dangerous [6] and clumsy (distinguishing multiple instances from each other). When something goes wrong, users aren’t notified directly, with messages hidden away in logs, which they often forget to check.

This observation is not a new one – indeed, many people have made it before me – but I resolved that I would try a different approach for any other Unix daemons I might write in the future.

pizauth

OAuth2 is an increasingly widely used authentication standard. In 2022 I suddenly found myself needing to use it for some crucial parts of my workflow, including email, but couldn’t find an existing tool which worked well in my setup. I realised from my experience with snare that I could now create decent quality daemons fairly quickly so soon after I created the first version of pizauth. The first alpha version was released about 6 weeks later.

pizauth requires a single configuration file (typically ~/.confg/pizauth.conf) which looks roughly as follows:

account "officesmtp" {
  auth_uri = "https://login.microsoftonline.com/common/oauth2/v2.0/authorize";
  token_uri = "https://login.microsoftonline.com/common/oauth2/v2.0/token";
  client_id = "..."; // Fill in with your Client ID
  client_secret = "..."; // Fill in with your Client secret
  scopes = [
    "https://outlook.office365.com/IMAP.AccessAsUser.All",
    "https://outlook.office365.com/SMTP.Send",
    "offline_access"
  ];
  // You don't have to specify login_hint, but it does make
  // authentication a little easier.
  auth_uri_fields = {"login_hint": "email@example.com"};
}

One lesson I learnt from snare is that pizauth is split into two parts. The “backend” or “server” part is started with:

pizauth server

The “frontend” commands then communicate with the backend via a Unix domain socket. For example instead of sending a SIGHUP signal to a PID and crossing my fingers, pizauth reload asks pizauth in a safe way to reload its configuration file. The two parts communicate with a very simple text-based protocol. pizauth reload sends the command reload: to the server which replies with ok: or error:<reason>.

To obtain an authorisation token, one runs pizauth show. The first time this is run for an account, it will print an error to stderr that includes an authorisation URL:

$ pizauth show officesmtp
ERROR - Token unavailable until authorised with URL https://login.microsoftonline.com/common/oauth2/v2.0/authorize?access_type=offline&code_challenge=xpVa0mDzvR1Ozw5_cWN43DsO-k5_blQNHIzynyPfD3c&code_challenge_method=S256&scope=https%3A%2F%2Foutlook.office365.com%2FIMAP.AccessAsUser.All+https%3A%2F%2Foutlook.office365.com%2FSMTP.Send+offline_access&client_id=&redirect_uri=http%3A%2F%2Flocalhost%3A14204%2F&response_type=code&state=%25E6%25A0%25EF%2503h6%25BCK&client_secret=&login_hint=email@example.com

The user then needs to open that URL in the browser of their choice and complete authentication. Once complete, pizauth will be notified, and shortly afterwards pizauth show officesmtp will start showing a token on stdout:

$ pizauth show officesmtp
DIASSPt7jlcBPTWUUCtXMWtj9TlPC6U3P3aV6C9NYrQyrhZ9L2LhyJKgl5MP7YV4

Once authentication has complete, pizauth regularly updates the access token, aiming to do so sufficiently seamlessly that end users never need to worry when access tokens change.

pizauth has some useful tricks up its sleeve. Most obviously, pizauth show is asynchronous, so authorisation might be required by another program without an alert being printed on your terminal. auth_notify_cmd allows the user to specify a command to be run in such cases. To immediately open authorisation URLs in a browser we can specify:

auth_notify_cmd = "open \"$PIZAUTH_URL\"";

Having the browser suddenly open a page out of nowhere can be discombobulating. If we want to be more subtle, we can use notify-send to show a pop-up which we have to click on to open a URL:

auth_notify_cmd = "if [[ \"$(notify-send -A \"Open $PIZAUTH_ACCOUNT\" -t 30000 'pizauth authorisation')\" == \"0\" ]]; then open \"$PIZAUTH_URL\"; fi";

The good

pizauth was written in almost indecent haste: I was a better programmer in Rust by this time, and had the experience of writing snare under my belt. I avoided several of the mistakes I’d made when writing snare, which saved me useful amounts of time.

I deliberately tried to incorporate what I’d learnt about daemons from extsmail and snare to make pizauth a more “modern” Unix daemon. Most obviously, the split into a backend and frontends, which communicate with domain sockets, removes the traditional worries about PIDs and race conditions. All pizauth’s commands are run through a single binary, which makes installation and discovery easy.

pizauth also respects various modern conventions about configuration and cache file locations. pizauth info is a subcommand I wish more daemons had (inspired by this post) as it takes the guesswork out of file locations:

$ pizauth info
pizauth version 1.0.0:
  cache directory: /home/ltratt/.cache/pizauth
  config file: /home/ltratt/.config/pizauth.conf

“Modern” doesn’t mean diverging from the good parts of “traditional” daemons. In particular, pizauth is fairly minimal. I’m particularly fond of the design of pizauth dump and pizauth restore which are designed to work well with external tools.

The bad

It’s perhaps a little too soon for me to know what I’ll regret about pizauth. The code which deals with communication between backend and frontends could do with a good tidy-up, but it is functional and safe.

Perhaps the deepest issue I’ve realised so far is that I wish, behind the scenes, I’d installed multiple binaries: one for the backend; and one per front-end command. If each of those binaries contained only the code it needs to run, the scope for things like gadget based attacks would be reduced.


Higher-Level Reflections

I hope that sharing my detailed reflections on extsmail, snare, and pizauth has been useful. Doing so has also helped me collect together some higher-level reflections, some of which are probably obvious from the above, but some of which might not be. I’m going to break them down into categories.

We Need More Daemons

The thing that makes daemons a distinct kind of software is that they run permanently in the background and react to events. We intuitively realise why this is relevant for daemons that react to events external to the computer (e.g. an HTTP server reacting to GET requests), but we generally underplay how useful this is for events internal to the computer. Because a daemon can react immediately to events, it allows different programs to work together so seamlessly that as a user we perceive them as one entity.

For example, many people run – or would like to run! – a daemon which detects each update of a file on their local drives and immediately back it up remotely. extsmail is a more niche example of this same technique — as soon as an email is written to the drive, extsmail tries to run a command to “send” it. There are many other kinds of events that one might usefully react to, including hardware attach/detach, networks going up and down, programs starting and stopping, and so on.

Once upon a time daemons had to “poll” (i.e. intermittently query) the system to see if an event had occurred. Not only did this introduce lag, but continually waking the daemon up, and it then performing mostly pointless operations could be a notable computational and power draw. Modern Unices have largely solved this problem: processes can ask the kernel to be notified for a wide variety of such events (see e.g. epoll or kqueue). In other words, we can now write daemons which consume virtually no resources until an event of probable interest actually occurs.

My strong belief is that too few programs exploit this way of working and that the world would be a better place if we wrote more of them. To some extent I think this is because many people don’t realise what a daemon could do, let alone what modern Unices make easy and efficient to do.

What I’ve also realised is that many excellent programmers implicitly assume that they cannot, or should not, write a daemon [7]. Many assume that “other” people are better suited to doing so, either because daemons can be created only by the very best programmers, or those most steeped in ancient Unix lore.

You might think that I’m about to descend into platitudes and say that every programmer can create good quality daemons. As much as I might like that to be the case, long experience has taught me otherwise. In particular, when writing a daemon one has to think at every point “what are all the things that could go wrong?” Not every programmer seems able or willing to think in that way.

However “not every programmer” is far from “only the best programmers”. I’m a decent programmer, but I know many better — and I’ve been able to create three reasonable quality daemons in little scraps of time. It’s also much easier to create a daemon now than it used to be, in part because Rust is a viable language for daemons.

There’s thus a good news story here: while I think we need more daemons, it’s never been easier to create good daemons!

Simplicity is a Virtue for Daemons

“Do one thing and do it well” has become a Unix cliché. As advice goes, it’s often not appropriate — I remember using minimalistic text editors, for example, and I much prefer the modern kitchen sinks. But the very nature of daemons makes this advice particularly apt.

The reason for this is ultimately due to the unusual nature of daemons: because they run in the background, they’re easily forgotten. I’m using “forgotten” deliberately because it can be interpreted in two ways: daemons are forgotten because they’re doing their job well and we assume they will keep doing so [8]; or they’re forgotten because they stopped doing their job well but no-one has noticed and/or known what the culprit is.

It can be tempting to think that if no-one has noticed something isn’t working then the thing isn’t important. The problem is that sometimes we don’t notice that something isn’t working until it’s too late, automatic backups not working being a well-known example.

One common solution to daemons going wrong is to have continual notifications or monitoring: when something goes wrong, a human is alerted, and they have to fix the resulting mess. Sometimes there’s no way around this, but it is obviously inefficient, and out of reach to all but the most patient or best resourced. More to the point it focusses on symptoms, not causes.

My belief is that the reliability of a daemon in large part depends on its configuration complexity. Unlike foreground software that we knowingly interact with daily, we tend not to fully understand what daemons can do, or exactly how they work. Even worse, we tend to configure them once up-front, when our ignorance is at its height, and then leave them alone for months or years.

As we increase the complexity of a daemon’s configuration, we thus increase the likelihood of users configuring it incorrectly. I think that most daemons offer too many configuration options to users: many seem useful, or at worst harmless, but collectively they tend to overload the user. Often the user doesn’t even realise they’re overloaded: the more configuration options a daemon has, the worse it tends to do at documenting individual details.

Because daemons are hard to test, I am sometimes sceptical that the interactions between more obscure configuration options can be fully relied upon. Ultimately the more configuration options we make available to users the more complex our daemon becomes internally, and the more likely we are to introduce bugs.

User-Interface:

Users typically interact with daemons in 4 ways:

  1. Starting the daemon with given flags (e.g. -v).
  2. Specifying a configuration file (e.g. ~/.config/pizauth.con).
  3. Sending signals (e.g. SIGHUP) to the daemon.
  4. Communicating via a domain socket (i.e. inter-process communication where the pipe is identified with a given file such as ~/.cache/pizauth/pizauth.sock).

Those 4 ways neatly break down into categories: configuration (1 and 2) and communication (3 and 4).

Configuration

As noted above, just as it is best to keep the scope of a daemon as narrow as possible, it is also best to keep configuration flexibility as limited as possible.

I tend to add very few flags to my daemons, normally only those: to allow specifying a specific configuration file (-c); to help debugging by not forking into the background (-d) [9]; and to temporarily increase logging verbosity (-v, which can generally be specified multiple times for ever-increasing log information).

For configuration files, I also try to keep the options to the minimum that I think a user needs. In general, I believe that users think they need more flexibility than they actually do. A small, but sometimes vocal, set of users get very frustrated when they can’t configure every little aspect of the system, but I don’t think they should ruin it for the rest of us. When I feel it is necessary to provide an option for a subset of users I strive to provide a sensible default that means most people can ignore it. For example, snare allows users to configure how many parallel jobs it can run, but the default is to allow as many jobs as there are active CPU cores. This obviously doesn’t suit all cases, but it does well enough for most.

Finally there is the issue of configuration file syntax. It is tempting to use a “standard” syntax such as TOML. In general I don’t think these standard syntaxes are a great fit for daemons: they’re either not flexible enough (e.g. TOML) or too flexible (e.g. YAML). I prefer to create a simple Yacc grammar, generally a C-ish syntax with curly brackets for nesting. One advantage of using a custom syntax is that I can catch a surprising number of errors in the parser and give good quality error messages. While doing so does require users to learn a new syntax, most of us are so familiar with this kind of syntax that it’s not a noticeable barrier to entry.

Communication

Traditionally the only way one could directly communicate with a daemon was via a signal. For example, most daemons interpret SIGHUP as “reload your configuration file” [10].

I now strongly believe that using signals in this way is a bad idea that we should consign to the past. Put bluntly, signals are one of the worst designed parts of Unix. From a user’s perspective, their reliance on process names and/or PIDs introduces copious opportunities to sending signals to the wrong process. From the daemon author’s perspective, signals execute code in a bizarre form of asynchronous execution during which very few normal operations can be safely performed. I’ve noticed that most programmers are far too optimistic about what is permissible in a signal handler and only bitter experience of odd failures makes them really read and understand the rules.

Instead it is better to communicate with a daemon via a domain socket. In essence, this is a Unix pipe whose user-facing end is identified via a named file. Since writing bytes into a pipe isn’t a great user experience, the “daemon” then needs splitting into two: the “backend” or “server” daemon and a “front-end” or “client” command which communicates with the backend.

pizauth is an example of a daemon which uses this idiom. pizauth server crates a domain socket (e.g. ~/.cache/pizauth/pizauth.sock) when it starts. Commands such as pizauth show act1 connect to the domain socket, write a simple command (e.g. showtoken:act1) to the socket, and listen for the reply (e.g. access_token:abcd1234).

As well as bypassing worries about PIDs and signal handlers, this idiom turns out to have other advantages. It makes it easy and reliable for a daemon to check if it is already running: if the domain socket file doesn’t exist, the daemon obviously isn’t running; if the domain socket file does exist, the daemon can send a request to the socket and if no answer is received then the daemon isn’t running. Finally, it makes it trivial to add new ways of interacting with the daemon. pizauth, for example, has grown several new subcommands over time, each of which takes me just a minute or two to implement [11].

There are two consequences to this style of daemon. First, how does one expose all the different subcommands to the user? Some daemons install multiple binaries with different names. Especially if these binaries don’t share a common prefix, this reduces discoverability. Even if they do share a common prefix, it means that simple -h style help is only per-binary. In contrast pizauth -h can sensibly show the user all the subcommands in one go:

$ pizauth -h
Usage:
  pizauth dump
  pizauth info [-j]
  pizauth refresh [-u] <account>
  pizauth restore
  pizauth reload
  pizauth revoke <account>
  pizauth server [-c <config-path>] [-dv]
  pizauth show [-u] <account>
  pizauth shutdown
  pizauth status

Second, how does one implement all these subcommands? pizauth literally installs a single binary: if invoked as pizauth server it takes one path of execution; all other subcommands take another path. For many daemons this is probably acceptable, but it does mean that the single binary is quite big. These days that makes me somewhat worried that I’m exposing an unnecessarily big attack surface for gadget attacks.

In retrospect, I think it would have been better to install multiple subcommand binaries out of sight of the user (e.g. pizauth-server, pizauth-show into /usr/local/exec/pizauth/) and have a single binary or shell script (/usr/local/bin/pizauth) which forwards execution on. Each subcommand binary would then contain only the minimal code necessary to perform its particular action. From an implementation perspective, the “core” of pizauth would then be a library which each subcommand then imports and uses as needed. It wouldn’t have been much more difficult to implement this style and I will probably do so for any future daemons I write.

Implementation Suggestions

Until recently, daemons were universally written in C. It is hard to recommend C for new software: it reduces productivity and is difficult to write secure code in [12].

My experience with snare and pizauth is that Rust is a viable language for writing daemons in. Rust isn’t a perfect language (e.g. unsafe Rust currently has no meaningful semantics so any code which uses unsafe comes with fewer guarantees than C) and some parts of the implementation rub unpleasantly against traditional Unix expectations (the size of the binaries it produces is astonishing). But one can be massively more productive in Rust than in C, even before one reaches for the astonishing breadth of libraries now available. Certainly, I would not have found enough time to write snare or pizauth if I’d restricted myself to C.

One pleasant benefit of Rust’s type system is that it makes writing multi-threaded programs much easier than any other equivalent language. Since daemons often want to do multiple things in parallel (e.g. listen for new events, and carry out actions relevant for previous events), threads are a natural way of doing so. pizauth, in particular, makes extensive use of threads and is much better for it. My only note of caution is that code with locks often falls foul of the unspecified nature of temporary lifetime extension. This isn’t a theoretical issue — I only realised what temporary lifetime extension was after it caused problems (now fixed) in pizauth.

Whatever language one uses for a daemon, robustness – the ability to keep working correctly in the face of difficult situations – is key. It’s tempting to think robustness is something that can be added to a daemon later, but it has profound effects at both a macro and micro level. Expecting to write a daemon as quickly as a throwaway script is a bad idea: robustness takes time; it requires more code (often much more code); and it requires careful and continual thought.

The final lesson I have learned for daemons is that just because automated testing is hard doesn’t mean that one should avoid automated testing. Most daemons I know of have little if any automated testing, and I accepted this as a given. However, because daemons run continuously, they are more likely to encounter unlikely cases than any other software we write. While some unlikely cases, particularly those that can only happen on certain platforms, are implausible to trigger, I came to regret not more thoroughly testing the daemons that I wrote.

The obvious challenge is that there’s often not much one can unit test in a daemon. For example, snare and pizauth’s configuration modules have reasonable unit testing [13], but what I worry about more is that the system as a whole does the right thing. I mentioned earlier problems with extsmail, child processes, and poll. extsmail now has a small, but surprisingly effective, test suite which checks these aspects work properly.

snare had a different problem: I became increasingly reluctant over time to make meaningful changes to snare because I was worried that I might break things in a subtle way. When I decided to rip out the external HTTP server snare was using, I realised that I would need confidence that my replacement didn’t break things. I thus also added a meaningful system-wide test suite (sometimes called “integration” or “black-box” tests) which now covers a large amount of snare’s functionality.

The trick, in both extsmail and snare’s case, was to accept that while it’s impractical to test everything about a daemon, most things can be tested with sufficient thought. Tests can easily set-up isolated environments (e.g. in temporary directories), run simple shell commands, and then the effects of those commands checked. In many cases, the resulting tests are “timing based” — that is, the only thing a test can sometimes do is sleep for a sufficiently long time that a testable action must have occurred. Timing based tests are obviously fragile, and consequently are looked down upon. But when this is the only plausible option for testing, it is far better to accept this is necessary than not to test at all!

In summary, even though many existing Unix daemons lack automated testing, I think we now know we should do better, and we know how to do so. Certainly, I’ve learnt my lesson: any future daemon I write will come with thorough testing from day one.

Summary

My high-level recommendations for writing a daemon are:

  1. Daemons tend to be configured once and then forgotten. Thus the more focussed in scope they are, and the simpler they are to configure, the more likely they are to keep running correctly. It’s tempting to try and make daemons super flexible, but that forces a substantial cognitive burden on users and implementers which is generally unnecessary and often counter-productive.

  2. Writing new daemons in Rust is viable and for even moderately complex daemons, a good trade-off. Perhaps, though, for the smallest daemons, the large size of Rust binaries might not be appropriate.

  3. Daemons need system-wide automated testing if we want to have confidence in their correctness today and after any future changes. Testing daemons can seem difficult or impossible, especially if one expects to test everything, but often a large subset of functionality can be automatically tested fairly easily.

  4. Communicating with a daemon by signals (e.g. SIGHUP) is a bad idea. Unix domain sockets give greater flexibility and a better user experience.

  5. Installing one user-facing binary which has sub-command processing makes understanding, and later expansion, of the daemon easier.

Finally, there is no better way to understand Unix than to write a daemon. There are many positives to doing this — surprisingly, I was able to make use of a number of lessons I learnt from extsmail in my research, and even one or two from snare! There is also a negative. Being forced to confront Unix’s decades of history, the flaws present from birth, and those accreted since, has convinced me that we should be rethinking operating systems. But even if someone finds the resources to reimagine operating systems, we’ll almost certainly want something which looks rather like daemons!

Newer 2024-02-28 11:00 Older
If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1]

I have mostly used “daemon” in documentation but have sometimes slipped and written “demon” in the past. Having finally bothered to check the history behind the spelling, I’m going to try and use “daemon” more consistently in the future.

I have mostly used “daemon” in documentation but have sometimes slipped and written “demon” in the past. Having finally bothered to check the history behind the spelling, I’m going to try and use “daemon” more consistently in the future.

[2]

We now take for granted that forking is fairly fast. In particular, we assume that fork uses COW (Copy-On-Write): that is, it doesn’t require immediately copying all of the parent process’s memory. It seems that it wasn’t until around 1980-1981 that Unixes started using COW for fork. I assume that fork, especially a double fork, was a rather slow operation before this!

We now take for granted that forking is fairly fast. In particular, we assume that fork uses COW (Copy-On-Write): that is, it doesn’t require immediately copying all of the parent process’s memory. It seems that it wasn’t until around 1980-1981 that Unixes started using COW for fork. I assume that fork, especially a double fork, was a rather slow operation before this!

[3]

Technically this is the default if $XDG_CONFIG_HOME isn’t set in the environment. I have wondered more than once if anyone ever sets that variable!

Technically this is the default if $XDG_CONFIG_HOME isn’t set in the environment. I have wondered more than once if anyone ever sets that variable!

[4]

From a parsing perspective, this should be trivial.

From a parsing perspective, this should be trivial.

[5]

Given how frugal extsmaild is with memory use, it isn’t very likely to run out of memory. Certainly, it’s not yet happened to me!

Given how frugal extsmaild is with memory use, it isn’t very likely to run out of memory. Certainly, it’s not yet happened to me!

[6]

What happens if the process you’re interested in stops and, before you notice, another starts with the same PID?

What happens if the process you’re interested in stops and, before you notice, another starts with the same PID?

[7]

Though many such people are willing to write “microservices”, many of which can be seen as a cousin of daemons. For reasons that I can only ascribe vaguely to “culture”, microservices do not in general seem to be written to the same standard we expect from daemons. There are of course many exceptions to this generalisation, and I don’t think there’s an inherent reason for it to be so in general anyway. Also, this might just me being old and grumpy.

Though many such people are willing to write “microservices”, many of which can be seen as a cousin of daemons. For reasons that I can only ascribe vaguely to “culture”, microservices do not in general seem to be written to the same standard we expect from daemons. There are of course many exceptions to this generalisation, and I don’t think there’s an inherent reason for it to be so in general anyway. Also, this might just me being old and grumpy.

[8]

In cricket a wicket keeper is said to be good when no-one notices what they’re doing: we only notice them when they make mistakes.

In cricket a wicket keeper is said to be good when no-one notices what they’re doing: we only notice them when they make mistakes.

[9]

(Updated 2024-02-09) Alexis pointed me to a message showing that dpd was the first daemon to have the -d switch.

(Updated 2024-02-09) Alexis pointed me to a message showing that dpd was the first daemon to have the -d switch.

[10]

Officially SIGHUP informs a process that its controlling terminal is closed. Since daemons (normally) have no controlling terminal, this signal is irrelevant to them, which is why it is now widely reused for “reload your configuration file”.

Officially SIGHUP informs a process that its controlling terminal is closed. Since daemons (normally) have no controlling terminal, this signal is irrelevant to them, which is why it is now widely reused for “reload your configuration file”.

[11]

It was relatively obvious to me when writing pizauth that I needed to do some sort of “frontend” / “backend” split but when I wrote snare I didn’t think I needed to. I now think this was a mistake and that snare should also have been written in this style. For example, I would sometimes like to know what jobs snare is currently running, but I don’t currently have a nice way of showing that to the user. Once you have the “frontend” / “backend” split implemented, it becomes trivial to add such features.

It was relatively obvious to me when writing pizauth that I needed to do some sort of “frontend” / “backend” split but when I wrote snare I didn’t think I needed to. I now think this was a mistake and that snare should also have been written in this style. For example, I would sometimes like to know what jobs snare is currently running, but I don’t currently have a nice way of showing that to the user. Once you have the “frontend” / “backend” split implemented, it becomes trivial to add such features.

[12]

The prevalence of undefined behaviour in C makes this even worse.

The prevalence of undefined behaviour in C makes this even worse.

[13]

snare’s testing is a little underwhelming in this regard; pizauth’s is more thorough.

snare’s testing is a little underwhelming in this regard; pizauth’s is more thorough.

Comments



(optional)
(used only to verify your comment: it is not displayed)