|Home > Blog||e-mail: email@example.com github: ltratt twitter: @laurencetratt|
August 2 2012
However, there is one undeniable truth when hosting everything yourself: when things go wrong, there's no-one else to blame. If software crashes, hardware dies, or - and this has happened to me twice - the water company cuts through the power cable, it's your sites that are down. After the water company cut through the power cable for a second time, I realised that not having my main server running for three days wasn't very much fun. I eventually managed to route things through a cobbled together backup server, but it was a huge pain. I broke out in a cold sweat every time I wondered how long the next downtime might be: a week? or more? Even worse, I realised, I might not be available at the time to fix things.
I therefore started looking at ways to put in place a reasonable failover system. Importantly, I wanted the most important services on the main server to failover to a backup until the main server came back to life. I naively assumed there would be a fairly easy way of doing this, but, if there is one, I couldn't find it at the time (there are now several web-based DNS services which appear to offer this service, though I haven't tried any of them). Presumably this is why most people (including some surprisingly large organisations) never put into place any sort of failover solution and those that do tend to have large systems teams behind them. Some of the material I read made my eyes hurt: it involved tweaking vast numbers of options in vast numbers of configuration files. As someone who aims to spend, on average, a few minutes a week on sysadmin duties, I knew this was not a sensible long-term option. I'd inevitably forget all the complicated details, which, from a sysadmin viewpoint, is storing up problems for a rainy day.
After some thought, I cobbled together a relatively simple solution that has served me well for a couple of years or more. I think of it as failover for the cheap and forgetful. Nothing in here is new, as such, but it's difficult to find it written down in one place. Dedicated sysadmins will probably snigger at how crude it is, but it works well enough for my needs. It's not intended to deal with, for example, large scale denial of service attacks nor to keep every possible service up and running. It's designed to keep the main services (e.g. web and email) online when the main server is down. This article outlines the basic idea, gives some example configuration files, and the shell script I use to tie everything together.
The setupOverall, my setup is analogous to having a main server in a data centre, and using a home computer for a backup server. The details are as follows.
The servers are hosted with completely different network providers and are physically distant, with well over 100 miles between them. Both decisions are deliberate: I don't want network or location specific problems taking down both machines at the same time (many bigger providers, from Redbus to Amazon have regretted locating too many services in one location). Both machines run the latest stable release of OpenBSD, so moving configuration options between machines is fairly easy.
RequirementsWhat I want out of my system is fairly straightforward. The main server should do as much as it possibly can. The backup server should generally do relatively little, otherwise those who share its DSL line might become upset that their connection has slowed down. When the main server is down, it should be reasonably easy to access services on the backup server. If the backup server has taken over from the main server it should hand back to the main server as soon as that comes back up.
I also want as much of the setup as possible to be under my own control. Experience has taught me that unless (and, often, even if) you pay decent money, large systems run by large groups aren't as reliable as I would like: it's too easy for a small part of such a system to go down and its recovery not to be prioritised. Plus, doing things yourself is informative and good, clean fun.
Why a solution isn't easyThere are some complete, fully transparent, solutions for failover (e.g. CARP), but these generally have preconditions (e.g. machines being on the same network) which rule them out for my purposes.
Most people are familiar with the idea of having multiple mail servers. DNS allows a given domain name to specify multiple MX hosts, each with a priority. When sending mail to a domain, the MX servers are tried in order of priority until one accepts delivery. Failover for the sending of mail is therefore simple. But what about the reading of mail? and how do we
There are some half-measures which provide partial solutions, but they're unsatisfactory. Round robin DNS, for example, allows requests to be shared between servers. Often used as a simple form of load balancing, it seems like it might also be a way of implementing failover on the cheap. However, it will happily point a proportion of clients at a server which is down, which isn't a good failover strategy.
Outline of a solutionThe basis of a solution is relatively simple. First, we have to have a way of rewriting DNS records so that when the main server goes down, the DNS records for the domain are changed to point to the backup server (and back again, when the main server comes up). Second, we need a way of keeping the data within services in sync. As an example of the latter service, I'll describe how I synchronise e-mail.
DNSDNS has a reputation for being a dark art. In large part, I suspect this is because the configuration files for BIND are so hard to read. For some years I used a web-based DNS service just to avoid the pain of working out how to configure BIND. To get failover up and running easily, I decided to tackle my DNS fear head on. I ended up using NSD instead of BIND, partly because it's a little easier to configure, but mostly because, as I understand things, it's likely to displace BIND in OpenBSD one day.
The basic tactic I use is to continually rewrite the main DNS entry to point at the preferred server. The backup server contains the master DNS. It regularly polls the main server by trying to download
The rewriting process is in some ways crude and in some ways a bit clever.
It's crude, because I wanted a solution that needs only basic shell scripting (so it can run on a fresh OpenBSD install immediately) and normal text files (no key exchange, or anything which involves configuring things that I might forget). I therefore manually write DNS zone files as a normal user with two magic markers
It's in some ways a bit clever because the
I therefore wrote a script called
The zone files are relatively simple (assuming you can read such things in the first place: see e.g. this tutorial for pointers if you can't), but a few things are worthy of note. First, the domains have a short refresh time of 5 minutes but a long expiry time of 12 weeks. The short refresh time informs other DNS servers that they should regularly check to see if the domain's settings have changed: if the server status changes, we want the updated IP address to be quickly reflected in DNS lookups. The long expiry time means that, if all of my DNS servers are unavailable, other DNS servers should hold onto the records they previously fetched for a long time. This is because the complete disappearance of a domain from DNS (i.e. all a domains DNS servers are down) is a nightmare scenario: mail starts bouncing with fatal errors and, in general, people (and other servers) assume you've died and aren't coming back. Finally, notice that the master DNS server notifies two servers of DNS changes. As well as my 2 servers in the UK, I also use the free service at afraid.org as a DNS slave, just in case network connectivity to the UK is lost (it happened in the past though, admittedly, not for many, many years). It also reduces the DNS request load on the backup server. If I had a server outside the UK, I might do this myself, but I don't, so I'm grateful for afraid's service.
Synchronising e-mailOur DNS configuration gives us a nice way of making sure the main DNS entry always points to an active server. That doesn't solve the problem of making the content on the servers synchronise. In other words, it's not much good pointing people to a backup server if the content it serves is out of date. In my case, my websites are relatively simple: a script updates them both from the same git directory, so they're nearly always in sync. Other services are, in general, more difficult. Some - most notably databases - have built-in support for synchronising across servers. Most, however, take a little more thought, and each needs to be done on a case-by-case basis. As a simple case study, I'm going to show how I synchronise e-mail across two machines.
E-mail is interesting in that both servers accept e-mail deliveries: we therefore have a two-way synchronisation problem. Traditionally, e-mail was delivered to a single
Fortunately, there are already good directory synchronisation tools available, so we don't need to make our own. I use Unison, though there are alternatives. The backup server connects to the main server every 10 minutes and synchronises the maildirs across both machines. The connection is done using an unprivileged user and SSH keys (so that user doesn't have to type in a password) and
The best bit about this solution is how easy it is to setup. It's also easy to extend to multiple users on a Unix box. It also gives a high degree of reliability. Mail that is delivered to one machine appears on the other with a maximum 10 minute gap - in other words, I can never lose more than 10 minutes worth of e-mail.
[As a side note, since I also read my mail through maildirs (using mutt), I also use Unison to synchronise my mail between my desktop, laptop and servers. By default, my mail fetching script connects to the main server. I can manually tell it to collect mail from the backup server, so I can always download my mail, even when the main server is down. If you're wondering why I don't use the main DNS entry for this purpose, it's because Unison uses the host name of the remote server in its archive file; if the hostname you're connecting to later points to a different machine, Unison gets very upset.]
Of course, some services are much harder to synchronise than e-mail, and there is no general strategy. But hopefully this gives an idea for one such strategy.
SummaryAs I said earlier, the techniques outlined in this article aren't particularly sophisticated: they won't cope with DOS attacks, for example. I wouldn't even want my backup server to be subject to a slashdotting (or whatever the kids are calling it these days) - it doesn't really have the bandwidth to support it. But for the sort of usages my servers typically see, and for virtually no ongoing maintenance cost, these techniques have proved very useful. Hopefully documenting them might be of use to somebody else.
|Home > Blog||e-mail: firstname.lastname@example.org github: ltratt twitter: @laurencetratt|
|Copyright © 1995-2012 Laurence Tratt|