Home Directory Synchronization

December 10 2005

My life sometimes feels overly peripatetic. One way in which I feel this pinch is that I regularly use three computers: my desktop at home, my laptop, and my desktop at work. I also have a server at work (on which you are probably reading this) and a server at home. This creates an obvious problem in synchronizing files between machines. When I was only regularly using two machines, I used to use the traditional approach to the problem which was to manually copy files (using scp) from machine to machine. This was a pain, and even with just two machines I occasionally overwrote files with old versions, or went on a trip only to discover I didn't have the latest version of a particular file on my laptop. With more than two computers the problem becomes disproportionately more difficult.

My solution to this problem is not a novel one, but nor does it seem to be particularly well known. I had the germ of the idea around three years or so ago, and largely got it working before finding that Joey Hess had already eloquently described most of the important steps; I used some of Joey's ideas to refine my setup. The idea that's fairly completely described by Joey is 'use version control to store files in your home directory.' Version control systems such as CVS are generally used so that multiple developers can work on the same source code and share their changes in a controlled fashion amongst each other. As this implies, on each developers' machine lies a (largely) identical copy of the shared source code. However there's no reason to restrict this to being of use only when multiple people are involved. If one has multiple computers, using version control software simply means that each contains an identical copy of shared files.

The benefits of taking this approach are, from my experience, almost impossible to overstate. My life has not only become significantly easier by significantly reducing the chance for mistakes, but I've also been able to be significantly more cavalier about moving between new machines, adding new machines to my menagerie, and even simply reinstalling existing machines. Of course for most normal people out there, this won't be an advantage at all since it fulfils a need you don't have, and uses a mechanism you won't want to understand, but if you're a serious computer user I think you should consider it.

I suspect one of the reasons why this method is rarely used - I know a grand total of one person in real life who uses something approaching this technique - is because of the use of "version control system" in the above text. Version control software is traditionally scary (most of the tools were one or more of big, slow, and unreliable), and of course it is seen as being applicable only to source code. In practice, even with simple tools, neither of these points is valid. Using this technique does require some thought, and it does take getting used to, but once one is used to it, the benefits significantly outweigh the disadvantages.

One thing that's interesting is that I see the list of pros, cons and irrelevancies a little bit differently than Joey and other similar write-ups.

    Pros
  • 1=. Automatic file synchronization across multiple machines.
  • 1=. Multiple distributed backups (I always have a minimum of two copies of my latest data in locations separated by over 100 miles).
  • 3. No real reason to ever remove files from your home directory since there's no significant organizational penalty for keeping them around.
  • 4. Allows easy use of staging servers. I use this to develop my web site locally on various machines, before seamlessly committing it to my web server.
  • 5. Ensures home directory is kept 'clean' since a fresh checkout will not check out irrelevant files (e.g. all the cruft files associated with running LaTeX, or those resulting from a C compilation) since those will never have been added to the repository.

    Cons
  • 1. Getting your existing files into shape to make the move to this system can be time-consuming (it probably took me around 4-5 hours).
  • 2. Adding files to the repository (and maintaining the list of files which shouldn't be added to the repository) is tedious, but fortunately takes relatively little time once one is used to it.
  • 3. You really need access to a server that is available anywhere on the Internet to get the most out of the technique. As most interested parties will have a DSL line, this is a very minor con.

    Irrelevancies
  • 1. Being able to get old versions of your files is useful. I have used this feature once. And that was just to see what would happen.
  • 2. It's not practical with binary data. In fact, there's no problem with this, provided you're sensible. See divide your binary data into three types a little later in the article.
I think it's telling that I could easily have written many more pros (admittedly, with diminishing returns), but I struggled to think of even a few cons and irrelevancies.

So now that I've been using this technique for a few years I feel that I have a few useful suggestions for anyone tempted to go down this highly recommended route.

Use a commonly available version control system.

At some point you will probably want to ensure that you can synchronize your data on a machine where it might be a liability to have unusual software. I use CVS since most of its (well known) deficiencies relate to problems encountered with multiple developers. The only significant remaining pain relates to directory handling and renaming files, and I can live with that, as annoying as it is.

An oft used alternative is Subversion but I wouldn't touch that with a barge pole, since it appears to be a project with the limited ambition of just replacing CVS. Unfortunately while they fixed some of CVS's more obvious deficiencies, they've introduced some tear-inducingly stupid new flaws. I've seen several corrupted repositories because using BSD-DB or similar for a storage backend is an obviously bad move.

At some point, one of the more advanced systems like Darcs or bzr might be well known enough to use here. But not for a few years yet I suspect.

Think before you name and add files.

Especially with CVS, renaming of files and directories is a slow and tedious task. But no matter what your system, a useful consequence of using this approach is that you will probably carry a copy of every file you add to your repository for life. If you choose an inappropriate name in haste, or locate a file in an inappropriate location, you will make life difficult for yourself in the long run.

A corollary of this is that the layout of the top-level directories in your home directory is extremely important. I have the following:

  • .private
  • audio
  • bin
  • misc
  • photos
  • research
  • share
  • src
  • tmp
  • web
  • work
Notice that I have been dull to the extreme in my naming, that I have reused standard UNIX naming conventions when possible, and that I have only a few top-level directories. These are all deliberate decisions. Each one of these is also a CVS module which means that I only check out certain combinations of directories on certain machines (e.g. .private only gets checked out on trusted machines).

Divide your binary data into three types.

Since binary data tends to be much bigger than text files, I split binary data into three groups:
  1. Binary data which is both 'irreplaceable' and doesn't change regularly, should be checked into the repository. Photos come under this heading.
  2. Binary data which has no intrinsic value should be considered local to a particular machine. This means that it can be deleted without consequence.
  3. Binary data which it is useful to synchronize, but which can be recreated by other means if necessary, is synchronized by a lighter weight mechanism. Audio comes under this category (since I own every CD I have converted into oggs, I can recreate this data if necessary) as does some programs' data (I use this to synchronize my RSS readers data).
I use Unison for this latter task. Unison is very fast, but I don't entirely trust it because I've watched in horror as it deleted a large directory of files when one of its archive files (Unison's name for its record keeping files) was removed (fortunately I was testing it out, so I had a backup). I only use it to synchronize files that I can recreate from another source.

Some lateral thinking can lead to useful savings in terms of the amount of binary data you store. For example I store only the large versions of my photos in my repository, but I've set up Makefile's so that the thumbnails and web pages that allow one to sensibly view these files are created after checkout (or any changes to the photos). Although the saving of around 15% that I get in this particular case might not seem very significant, this actually translates to a useful saving when checking out a fresh repository or manipulating files because binary data tends to dwarf textual data in size.

E-mail is special.

Using either version control or the binary data technique outlined for e-mail would be masochistic. I use OfflineIMAP to synchronize my e-mail because it's better suited to the task and I use some other useful tricks on it (which I will document in a later entry).

Automate your setup.

I have a couple of small scripts which make my life a lot easier. The first is an obvious one which I call cvssync (not the best name in retrospect) and which takes two arguments: ci or up. It goes through all my various CVS modules and updates them or commits the changes, runs some Unison commands, calls my cvsfix script (see Joey's article for suggestions on what this should do), and performs a few other minor tasks. None of which I need to explicitly remember.

The second script is much less obvious: I call it cvsbootstrap. When I have a freshly installed machine, I put just this one script on there and it connects to my server and downloads all the various CVS modules etc on to the new machine. This makes the process of installing a new machine painless. The script takes two arguments maximal and minimal which determine which modules are checked out (minimal is used on irregularly used machines or on those whose security I do not entirely trust). Since I use this script relatively infrequently it is often broken by the time I use it on a new machine since I may have changed the layout of my setup in some minor way, but even when it only does 75% of the job I need it to do, it still saves me a couple of hours of remembering how long-forgotten part of my setup works. I tend to fix the error that occurred, and then check it in without any testing which reflects the unusual nature of this script.

Create a complete backup before you try this.

Trust me on this one. At first you will either forget to add files, not add them correctly, not fully understand the software you're using, or suffer a similar such problem. If you have a backup you can fix these problems with little penalty; after a month or so without problems, you may well feel comfortable discarding the backup.

Follow me on Twitter

 

Blog archive

 

Last 10 posts

What Challenges and Trade-Offs do Optimising Compilers Face?
Fine-grained Language Composition
Debugging Layers
An Editor for Composed Programs
The Bootstrapped Compiler and the Damage Done
Relative and Absolute Levels
General Purpose Programming Languages' Speed of Light
Another Non-Argument in Type Systems
Server Failover For the Cheap and Forgetful
Fast Enough VMs in Fast Enough Time