RSS / XML feed

Laurence Tratt's Technical Articles


Good programmers are good sysadmins are good programmers

March 20 2009

It is human nature to assume that what is familiar to us, is familiar to all. I can still clearly remember, when I was around 12, going to a friends house, and being astonished at the fact that around my dinner plate were three pairs of knives and forks. In fact, petrified would probably be a better word - not only had I never personally seen anything like it, I had never imagined that anyone would, or indeed could, use more than one knife and fork in a single meal.

Of course, my life thus far has not just involved my surprise at other peoples habits; on occasions (less rare than my ego might have preferred), other people have been surprised at mine. Recently a non-computing friend saw my main computer workspace - a Unix setup with 4 xterms displayed - and asked, jokingly, if I was plotting to take over the world (I blame the media for this particular image). I'm so used to my own setup that I no longer think of it as odd but, when suitably prompted, I can see why other people might think so. In comparison to a Windows or Mac machine, full of little visual goodies, and perhaps with only a web browser or word processor loaded, a number of tiny xterms filled with half-executed commands does look odd.

It's not really surprising that a non-computing person would find my setup odd - after all, I spend a lot of time on computers, so it's to be expected that some things that I have come to find natural scare casual users. What has surprised, and continues to surprise me, is how many computing people I come across find my setup odd - sufficiently odd that it attracts comment. Some people are baffled as to why my systems are as they are, some are curious as to how it works, and some people sneer at the way I do things. There is no good answer to the sneer, nor any great reason to answer that person (although, possibly due to a deep character flaw, I find the sneer rather amusing). However the how and why are interesting questions, which raise interesting points, and are more closely integrated than they may first appear.

Here's the broad setup I use. I have a desktop machine (because it's fast and comfortable), a laptop (which I use only when out and about, because laptops are slow and ergonomically disastrous), a main server (where you're probably reading this from), and a backup server (for the next time the water company cuts through the cable powering the main server; in an attempt to salve my environmental qualms, the backup server is a very low power device that also serves some domestic purposes). Though various people have some sort of access to the servers, I personally administer all 4 machines. This raises two immediate problems: how to keep the administration overhead to a minimum; and how to keep files synchronised between each machine.

The answer to the administration overhead question is, for me, simple: use the same operating system for all machines. That way, the lessons learned on one machine apply trivially to the others (and, when things go belly up, machines can relatively easily stand in for one another). As someone who (through a quirk of geography and history) never passed through the DOS / Windows world, I eventually gravitated towards Unix operating systems and, after a brief flirtation with Linux, I've exclusively used OpenBSD for nearly 10 years. This immediately scares most people off, or confirms their worst suspicions of me - to give you an idea of the popularity of this OS, at the time of writing, I've met precisely 1 OpenBSD user in real life. Why did I chose OpenBSD? Simple: it's simple. OpenBSD does very little by default, and what it does, it does well, consistently, with minimal configuration, good documentation, and is easily administered remotely. I can have a blank box turned into a complete OpenBSD install with everything I want, setup how I need, in a couple of hours (most of which is automatic downloading of stuff, and doesn't require my presence). I keep an open mind about OS replacements, but so far none of them appears an improvement, or even a sideways step. Of course, using what is often dismissively called a server operating system does involve some compromises, although less than you might think - the increasing diversity of real-world OS's (thanks indirectly, I think, to OS X) has meant that running a minority platform involves fewer compromises than it did 5 or 6 years back.

The file synchronization problem is a little more subtle, but arguably more important. I outlined my mechanism for this a while back and while it's changed in detail quite a bit, in spirit it's still the same: I use a version control system (git these days) for my important files and Unison for large files that I can recreate via other mechanisms.

A corollary of using a decent Unix, and synchronizing files automatically, is that virtually everything is configured by simple text files, so to a large extent my configuration also propagates across machines. I have also tried over the years to accept, whenever possible, the default configuration on a machine. The reason for this is simple: the less I feel the need to change, the easier it is to move between machines (and different OS's). Of course, there's a limit to how far I'm prepared to accept someone else's choices, and so I do change a reasonable number of settings; but, compared to most people I know, I change relatively little.

As well as trying to use the default configuration as far as is practicable, I also try to maximise my use of tools supplied with the OS and, failing that, to use the simplest tool that does the task I require. A decent Unix comes with a wide variety of little tools, most of them neglected by most users; it continues to amaze me as to how many tasks can be expressed in terms of these little tools. Using tools that are standard across many different machines and OS's again lowers the barriers to moving between machines. It also generally implies a greater consistency of user experience, since tools from the same providers tend to be more consistent; some providers (particularly commercial) seem to delight in perverse user interface choices, which means that installing and learning new tools can be an uncomfortable experience. I also try, whenever possible, to use command-line tools not because - despite what some of my friends think - I like being obscure (my formative years were spent on RISC OS, where the GUI was King and the default assumption was that the command-line was for the mentally unsound) but because it's easier to control command-line tools and maintain consistency across platforms.

Most of my time on a computer is spent either doing e-mail (I use mutt because it's the least annoying mail client I've yet found, despite its obvious limitations), web browsing, programming, or writing. The latter two tasks are the most interesting. If for you, as for me, an average day is a wild trip of programming in several languages, and working on several papers of dubious literary merit, then you'll know how much time one can spend editing text; I often have 20 or 30 files open for editing. The sad truth of the matter is that, as far as I can tell, the modern computing world does not contain a single decent text editor (whereas RISC OS, which I mentioned earlier, had at least 2 excellent text editors). Most text editors are either arcane (e.g. vi and emacs) or bloated (e.g. Eclipse). Since I am not clever enough for the former, and far too impatient to wait for the latter to load (I had a massive shock 4 or 5 years back when, on a powerful machine, I found to my horror that if I typed at full speed in a well known IDE, there was a noticeable lag in text appearing on screen), I use a half-way option, NEdit. NEdit has many limitations and flaws, but it's simple, loads almost immediately, and its syntax highlighting is just about powerful enough to satisfy me.

Let's return to the 4 xterms I mentioned at the beginning of the article, which scared my non-computing friend - it's both worse and better than it seems. I setup KDE so that it has 12 virtual desktops (one of the main reasons I used KDE in the early days was because it binds sensible keys to virtual desktop selection by default), of which I typically use 7 to 8 at any given point. The first is my main work area; one of the xterms has mutt permanently loaded (I occasionally load other mutts in different xterms to simultaneously read multiple folders; an advantage of using a simple tool), one is mostly used for downloading e-mail, and the other two are for random commands and ssh sessions. Desktop 2 is for web browsing and desktop 3 for web page editing. Desktop 4 is my calendar. Desktops 5 and 6 are for programming. Desktops 7 and 8 are for paper writing, with 9 and 10 being used for secondary paper writing. The remaining desktops are spares. This may seem complex or unduly pernickety. The answer to both points is essentially the same: I evolved this setup organically over time so it seems natural, to me at least. Because of virtual desktops, I need only a single 19" monitor, although these days it's a struggle to find a sensibly sized monitor. Unfortunately, computer people are generally gadget people, and gadget people are easily fooled by bigger, faster, better, so monitors these days have largely useless resolutions. Wide-screen monitors, for example, are (I assume) good for watching films, but they're fairly useless for text editing, where screen height is more important than width. Furthermore, the pointless fixation on resolution means that many fixed size things (some fonts, icons etc.) appear tiny, so I increasingly see people having to put their nose virtually to their screen to read things. 1280x1024 works for me and, until someone doubles the resolution without increasing the screen size (since then one can imagine that old apps could be transparently run with two physical pixels for every logical pixel they perceive, while new apps will have access to the genuine high resolution, making everything a bit smoother and sharper), I will try to resist the siren call of the resolution junkies.

In the above I've tried to give a brief outline of how I use computers - a modern reader may well detect a certain Luddite tendency in some of the choices, but hopefully I've also provided some small justification for each of the detailed choices. Of course, none of the above is really a high-level why. Why go to all this effort? Why these particular choices? Although I didn't explicitly think of it this way when I first started down the path that led me to my current mode of operation, there is a solid reason behind it. When I look at the really good programmers I've come across (directly and indirectly), then, with only one exception I can really think of, they seem to share one thing in common: they're also good sysadmins. Their machine(s) are in good order, simple, with the right hardware for the job, and ready for the task at hand; and they can whip a new machine into shape quickly. When the muse strikes, there is little to get in the way of good work: they know their machines inside out, the tools inside out, all their files are easily available, everything used frequently loads quickly, and they can flick rapidly between the sub-tasks that constitute a larger job. Similarly, the best sysadmins I've seen are also good programmers. I don't think it's possible to understand a modern OS without being a decent programmer - more importantly, it's certainly not possible to tame and control an OS in the desired way without programming being involved. There's a symbiosis between these two activities that seems to me undeniable; being really good in one requires being at least fairly good in the other.

So, in conclusion, my computer setup is an attempt to emulate, in my own small way, the best habits I've been able to pick up from those more able than myself. It's a continual work in progress, but it does the trick for me.

Link to this entry


How can C Programs be so Reliable?

November 11 2008

C is, today, a unique programming language. Surprisingly few people can really program in C and yet many of us have quite strong opinions about it. Buffer overflows, stack smashing, integer overflows - C has many well publicised flaws, and these terms are often bandied about confidently, even by those unfamiliar with C. Personally I shied away from C for a decade, for one reason or another: originally, compilers were expensive (this being the days before free UNIX clones were readily available) and slow; the culture was intimidatory; and, of course, all the C scare stories made me think that a mere mortal programmer such as myself would never be able to write a reliable C program.

Discounting a couple of tiny C modules that I created largely by blindly cutting and pasting from other places, the first C program I wrote was the Converge VM. Two things from this experience surprised me. First, writing C programs turned out not to be that difficult. With hindsight, I should have realised that a youth misspent writing programs in assembler gave me nearly all the mental tools I needed - after all, C is little more than a high-level assembly language. Once one has understood a concept such as pointers (arguably the trickiest concept in low-level languages, having no simple real-world analogy) in one language, one has understood it in every language. Second, the Converge VM hasn't been riddled with bugs as I expected.

In fact, ignoring logic errors that would have happened in any language, only two C-specific errors have thus far caused any real problem in the Converge VM (please note, I'm sure there are lots of bugs lurking - but I'm happy not to have hit too many of them yet). One was a list which wasn't correctly NULL terminated (a classic C error); that took a while to track down. The other was much more subtle, and took several days, spread over a couple of months, to solve. The Converge garbage collector can conservatively garbage collect arbitrary malloc'd chunks of memory, looking for pointers. In all modern architectures, pointers have to live on word-aligned boundaries. However, malloc'd chunks of memory are often not word-aligned in length. Thus sometimes the garbage collector would try and read the 4 bytes of memory starting at position 4 in a chunk - even if that chunk that was only 5 bytes long. In other words, the garbage collector tried to read in 1 byte of proper data and 3 bytes of possibly random stuff in an area of memory it didn't theoretically have access to. The rare, and subtle, errors this led to were almost impossible to reason about. But let's be honest - in how many languages can one retrospectively add a garbage collector?

My experience with the Converge VM didn't really fit my previous prejudices. I had implicitly bought into the idea that C programs segfault at random, eat data, and generally act like Vikings on a day trip to Lindisfarne; in contrast, programs written in higher level languages supposedly fail in nice, predictable patterns. Gradually it occurred to me that virtually all of the software that I use on a daily a basis - that to which I entrust my most important data - is written in C. And I can't remember the last time there was a major problem with any of this software - it's reliable in the sense that it doesn't crash, and also reliable in the sense that it handles minor failures gracefully. Granted, I am extremely fussy about the software I use (I've been an OpenBSD user for 9 years or so, and software doesn't get much better than that), and there are some obvious reasons as to why it might be so reliable: it's used by (relatively) large numbers of people, who help shake out bugs; the software has been developed over a long period of time, so previous generations bore the brunt of the bugs; and, if we're being brutally honest, only fairly competent programmers tend to use C in the first place. But still, the fundamental question remained: why is so much of the software I use in C so reliable?

After a dark period of paper writing, I've recently been doing a little bit of C programming. As someone who, at some points, spends far too much time away from home, reliably sending e-mail has always been an issue. For several years I have sent e-mail by piping messages to a sendmail process on a remote machine via ssh. While this solves several problems (e.g. blacklisting), it has the problem that on many networks (particularly wireless networks) a surprising number of network connections get dropped. Checking that each e-mail has been sent is a frustrating process. So, having mulled on its design for a little while, I decided to create a simple utility to robustly send e-mail via ssh. The resulting program - extsmail - has more features than I'd originally expected, but the basic idea is simply to retry sending messages via an external command such as ssh, until the message has been successfully sent. I also wanted the utility to be as frugal with resources as practical, and to be as portable as possible. This inevitably led to extsmail being written in C. I then decided, as an experiment, to try and write this, as far as possible, in the traditional UNIX way: only to rely on features found in all sensible UNIX clones and to be robust against failure. In so doing, I made two observations, new to me, about writing software in C.

The first observation is semi-obvious. Because software written in C can fail in so many ways, I was much more careful than normal when writing it. In particular, anything involved in manipulating chunks of memory raises the prospect of off-by-one type errors - which are particularly dangerous in C. Whereas in a higher-level language I might be lazy and think hmm, do I need to subtract 1 from this value when I index into the array? Let's run it and find out, in C I thought OK, let's sit down and reason about this. Ironically, the time taken to run-and-discover often seems not to be much different to sit-down-and-think - except the latter is a lot more mentally draining.

The second observation is something I had not previously considered. In C there is no exception handling. If, as in the case of extsmail, one wants to be robust against errors, one has to handle all possible error paths oneself. This is extremely painful in one way - a huge proportion (I would guess at least 40%) of extsmail is dedicated to detecting and recovering from errors - although made easier by the fact that UNIX functions always carefully detail how and when they will fail. In other words, when one calls a function like stat in C, the documentation lists all the failure conditions; the user can then easily choose which errors conditions he wishes his program to recover from, and which are fatal to further execution (in extsmail, out of memory errors are about the only fatal errors). This is a huge difference in mind-set from exception based languages, where the typical philosophy is to write code as normal, only rarely inserting try ... catch blocks to recover from specific errors (which are only sporadically documented). Java, with its checked exceptions, takes a different approach telling the user you must try and catch these specific exceptions when you call this function.

What I realised is that neither exception-based approach is appropriate when one wishes to make software as robust as possible. What one needs is to know exactly which errors / exceptions a function can return / raise, and then deal with each on a case-by-case basis. While it is possible that modern IDEs could (indeed, they may well do, for all I know) automatically show you some of the exceptions that a given function can raise, this can only go so far. Theoretically speaking, sub-classing and polymorphism in OO languages means that pre-compiled libraries can not be sure what exceptions a given function call may raise (since subclasses may overload functions, which can then raise different exceptions). From a practical point of view, I suspect that many functions would claim to raise so many different exceptions that the user would be overwhelmed: in contrast, the UNIX functions are very aware that they need to minimise the amount of errors that they return to the user, either by recovering from internal failure, or by grouping errors. I further suspect that many libraries that rely on exception handling would need to be substantially rewritten to reduce the number of exceptions they raise to a reasonable number. Furthermore, it is the caller of a function who needs to determine which errors are minor and can be recovered from, and which cause more fundamental problems, possibly resulting in the program exiting; checked exceptions, by forcing the caller to deal with certain exceptions, miss the point here.

Henry Spencer said, Those who don't understand UNIX are doomed to reinvent it, poorly. And that's probably why so many of the programs written in C are more reliable than our prejudices might suggest - the UNIX culture, the oldest and wisest in mainstream computing, has found ways of turning some of C's limitations and flaws into advantages. As my experience shows, I am yet another person to slowly realise this. All that said, I don't recommend using C unless much thought has been given to the decision - the resulting software might be reliable, but it will have taken a significant human effort to produce it.

Link to this entry


Free Text Geocoding

September 1 2008

I think nearly everyone would agree that maps are useful things; I've only met a couple of people who labour under the illusion that they know the way from anywhere to anywhere. Personally I find maps more than interesting - they're often fascinating. Maps are one thing the British do immensely well (judging by some of the inaccurate doodles I've seen abroad). By looking at the detailed Ordnance Survey map of the area where I grew up, I can easily see layers of historical information such as: the Roman road a mile away; the long-abandoned railway lines upon which roads were later built; the now-drained marshes and the canal network later built upon them. Of course, these are fairly standard maps with a well-understood relationship to the real world; sites such as Strange Maps and Mark Easton's blog show how maps can present data in many other forms. My headmaster at school thought that geography as a subject had lost its way; instead of focusing on rock strata, it should focus on teaching children where places where. At the time, we all thought he was a touch eccentric; in retrospect, he was absolutely right. Huge tracts of politics, economics, and history only make sense in the context of a given place(s). I could go on, but I hope my point is clear: maps are important, and not just for finding our way around.

A big problem with traditional maps is finding things: places, areas, and so on. Even for a medium sized paper map, the size of the index is huge; and the index for a good quality London A-Z is often almost as big as the map itself. Paper map indexes have their own funny little language, trying to squeeze as much data in as possible. However paper indexes have two problems. First, how many times have you forgotten what square you were supposed to look at when you get to the proscribed page? Second, indexes can only grow to a certain size before they are unusable.

Computer geocoding

The advent of computer mapping has been a real boon, and not to just map fiends such as I. The first site I used regularly was Street Map. Having the map data for the whole UK was incredibly useful and the search facilities, while crude, were a huge improvement on a paper index. When later sites such as Google Maps became available, I was stunned. Suddenly freely viewable map data (though note I do not use the phrase free map data) for much of the world was coupled with a new way of searching. No paper index, or Streetmap's cloying need to be told what type of search was being performed (e.g. place name, street name, post code etc.). Instead is what I call (for want of a better name) free text geocoding: that is, where one types in something in the format used in real-life, and the search engine finds the right place automatically. One doesn't need to tell the search engine that the search is for a postcode or a place-name, or even which country one is searching - it magically does the right thing. Well, of course - not always the right thing. For example, at the time of writing, if I do a search for Penrith in Google Maps UK, I get taken to a suburb of Sydney in Australia. The poor residents of Penrith in the North of England don't even have their town mentioned as a possible match for the search; one must instead search for something like Penrith Cumbria or Penrith UK. There are a range of related minor infelicities in both Google Maps and Yahoo Maps. However on average, both do an adequate job.

There are several reasons why sites such as Google Maps are not as widely used as they might be. Regrettably the chief reason is legal: there are restrictions on the way that map data can be used. Fortunately an initiative - OpenStreetMap - to create much freer map data was started a few years back (and started by Brits - we are, it appears, a nation of map lovers) and is now at the stage where, despite many huge holes in its data, it is semi-usable: in central London, for example, it already has arguably the highest quality maps. Some of the uses of OpenStreetMap are already quite astonishing (the UK postcode layer is a simple, but effective, example of what can be done - click the little + icon in the top-right of the map for more fun), and its accelerating progress is impressive.

One part of OpenStreetMap that frustrates me a little is its search (the Name Finder). Although it does a reasonable job in many respects, the results it returns are hard to interpret (type in Penrith and then try and work out which result is the UK town - I got this wrong on my first go and I'm English!), and it's not easy to use it outside of a website (because it's written in PHP). Furthermore the version running on OpenStreetMap's front page is painfully slow (possibly because it's running on overloaded servers). [At the time of writing, I can't even work out how to get it to find Penrith in the UK automatically. Queries like Penrith, UK crash it. Penrith - please don't take the cold-shoulder from the map search engines personally!]

Free text geocoding

Since one of the huge benefits of using computer mapping is its search ability, I thought a fun little summer project would be to create my own free text geocoder. I started with only a vague idea of what a free text geocoder should (or could) do. While I don't now claim to have thought of everything, I do now have a much clearer idea of what a good free text geocoder should do. I've split these into must and should haves. A free text geocoder must:

  1. give accurate results (meaning it always, somewhere in the list of matches, gives the place the user is looking for). While this might seem obvious, some of the existing free text geocoders don't do this, as we saw earlier.
  2. require as little formatting from the user as practicable (e.g. London SW1 and SW1 London are both likely searches).
  3. return any unmatched text (thus allowing searches like cafes in Pimlico to return Pimlico as the place matched and cafes in as unmatched text which an application can then use as it pleases).
  4. be fast: results shouldn't take more than a second on average (and preferably should be much quicker).
  5. be localisable. This is both linguistic and cultural. Searching should take place disregarding the users input language, but results should be localised when possible. Results should be formatted relative to the users local cultural expectations (e.g. in the US, state names are always shown; in the UK county names are nearly always shown).
If possible, it should:
  • be possible to use it do more than just find the latitude and longitude of a place.
  • try and weight the results so that the most likely match(es) are given higher priority (e.g. an Englishman searching for Penrith should see the English town as the first match, while an Australian should see the Sydney suburb).
  • be usable in different contexts (e.g. in websites, or in applications).
  • be amenable to (possibly quite low-level) customization.

Fetegeo

To this end I created, and have now released, Fetegeo with a BSD / MIT licence. Using Fetegeo's included client / server interface, queries can be performed on the command-line:

$ fetegeoc geo London
Match #1
  ID: 719913
  Name: London
  Latitude: 51.508415
  Longitude: -0.125533
  Country ID: 233
  Parent ID: 1262
  Population: 7421209
  PP: London, United Kingdom
  Dangling text:
Of course, there are a lot of London's in the world and I haven't copied all of Fetegeo's output. Notice though that, since the preferred country of the user wasn't specified, it's chosen what most people are likely to consider to be the London as the first match. If the user specifies that their country is Canada then London in Ontario is the first match:
$ fetegeoc -c ca geo london
Match #1
  ID: 2984878
  Name: London
  Latitude: 42.983389283
  Longitude: -81.233042387
  Country ID: 39
  Parent ID: 540
  Population: 346765
  PP: London, Ontario
  Dangling text:
Fetegeo can be instructed to allow dangling (i.e. unmatched) text in matches:
$ fetegeoc -d geo Museums in London
Match #1
  ID: 719913
  Name: London
  Latitude: 51.508415
  Longitude: -0.125533
  Country ID: 233
  Parent ID: 1262
  Population: 7421209
  PP: London, United Kingdom
  Dangling text: Museums in

If you're interested, there's a slightly more thorough description of the ways that Fetegeo can be used, and a simple demo which geocodes results and shows them on an OpenStreetMap map.

How Fetegeo works

Internally, Fetegeo's search is fairly simple and its approach is easily described. Strings in Fetegeo are always normalised; in particular punctuation is removed, and strings are lower-cased. String queries to the database are always on hashes of normalised words. Given a normalised string S, Fetegeo breaks it into a list of words. It then works right-to-left in the list, trying to find all possible matches. Whenever a match within S occurs, a counter is decremented meaning that subsequent matching takes place n elements from the end of the list. Matches are greedy; they always try to first match the maximum number of words possible, before gradually trying fewer words. Matches are exhaustive in the sense that all possibilities are tried; however only the longest matches are eventually returned to the user (some obvious optimisations are used here, so that some possibilities aren't tried if it's obvious they won't work). Fetegeo first of all tries to match the entirety of S as being a place within the users country; if that fails, it tries to match a country name at the right-hand side of the string. It then tries to match places and postcodes; if a place is found, it is considered to be a parent area; subsequent matches can only match places within that area or (recursively) a sub-area. Postcodes can occur at any point in the match. Of course, there's a lot more detail than this in the code, but this is the essence of Fetegeo's fast free text geocoding.

How Fetegeo compares

How does Fetegeo score on the must / should chart?

It's too early to say how accurate Fetegeo's results are. First, Fetegeo has only been relatively lightly tested so far: it's inevitable that there will be bugs and oversights. Second, any free text geocoder is subject to something I'm tentatively calling Tratt's First Law of Free Text Geocoding (though I doubt I'm the first to think of it): the upper bound for results quality is determined by your dataset. The best free text geocoder in the world can only give iffy results with an iffy dataset. Fetegeo's initial dataset is based on Geonames data (and postcode data from various other sources). While Geonames should be saluted as the first serious attempt to collate freely-available place data, the structure of the data is less than ideal, and the data itself is of variable quality, suffering from frequent inaccuracies and duplication. Because of this, Fetegeo has been designed to be relatively independent of any particular dataset; I hope one day in the not too distant future that OpenStreetMap's data will be sufficiently broad in scope to replace Geoname's (OpenStreetMap's data is already deeper in the sense that it includes roads, which Geonames doesn't).

Fetegeo is already reasonably fast, given that it's only been semi-optimised. On my 3-ish year old desktop machine, using the stock install of PostgreSQL (a few tweaks would, I suspect, make it perform much better - if only I could work out what those tweaks were amongst the mass of overlapping configuration options!), typical queries are answered in less than 0.1s. Fetegeo makes use of simple caching internally to speed things up. Someone who understands databases better than I could almost certainly make things run much faster.

Fetegeo has the beginnings of being usable for more than just longitude / latitude searches, but there is some way to go yet to prove this is feasible. In particular I would like to see it capable of being used by applications to classify things as being in particular areas. Imagine you have a website listing X's in Britain (where X could be just about anything), where each X is located at a particular latitude / longitude. This allows one to easily search for all X's near place P. However users often want to perform area searches such as all X's in London or all X's in Rutland. Exposing the identifiers of places, counties, states (and so on) makes this latter type of query feasible.

Fetegeo is usable in a number of different ways. As such, Fetegeo is just a Python library which can be included and used in any application. Fetegeo also comes with a standard internet server and (command-line) client, which can receive and answer XML queries (as an aside, the XML parser used is often the slowest part of querying). This means that even a simple web-site can query a single Fetegeo server and make use of its caching facilities and so on.

Conclusion

I have no idea whether anyone will find Fetegeo useful. It seems to me that, even in its current embryonic form, it fills an unoccupied niche, at least in terms of its licence if not its functionality (yet). I hope that other people might find it interesting, and start to extend its functionality to make it more widely applicable. If you want to find out more, and contribute, please waltz on over to fetegeo.org.

Link to this entry


Extended Backtraces

June 2 2008

I can quite distinctly remember when, as a teenager, I realised that I would spend much of my life debugging. I was programming in assembler, where loops are simply branch statements to defined labels. Because loops are so common, one would use the same label for each loop; later loops would redefine the label, meaning that no problems could occur. Being a literal person I used the label loop for all such loops. When one of my programs mysteriously failed, I could not work out why. Eventually I realised that one of my label definitions had spelt looop (three O's instead of two) instead of loop, so my loop had branched back to the previous loop in the file. Spotting that took me a couple of days.

Later, I realised that most programming errors fit into two broad categories: the obvious and the subtle. Obvious errors are those whose source can be easily pinpointed (even if fixing the problem takes a while). The subtle are typically those where cause and effect are separated, making identification of the root of the problem difficult (often, when eventually located, such problems are easily fixed). The looop problem above was, to a relative novice programmer, a very subtle problem (years of experience have taught me that looking for daft spellings in my own programs is a good initial target when debugging).

There are, to my mind, two tricks to debugging. The first is to try and turn subtle problems into obvious problems; however subtle problems are typically inherently subtle and unamenable to such treatment. The second is to try and speed up the solving of obvious problems. For me, the main tool for solving obvious problems is the humble backtrace which, when an exception occurs, shows one (in some manner or another) the call stack and, hopefully, what file and line number each entry therein is associated with. Given the following trivial program:

func head(l):
    return [l[0], l[1 : ]]

func main():
    head([1])
    head([])
a standard looking backtrace would be:
Traceback (most recent call at bottom):
  1: File "/tmp/head.cv", line 6
  2: File "/tmp/head.cv", line 2
  3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
Using this we can fairly quickly see that the cause of our error is passing an empty list to a function which assumes that there is at least one element in the list. [As a side note, this example is in one way unrepresentative: in the vast majority of cases, it's typically the bottom one or two lines of the backtrace that pinpoint the real source of the error.]

Backtraces like the above can be found in most modern programming languages like Java. They are immensely useful and form precisely half of my debugging toolkit, the other half being printf - in my view of the world, these two tools obviate the need for debuggers. The power of backtraces is most obviously felt in those languages that don't have them. C programs typically need to be run through a debugger to get a backtrace, meaning that errors in programs running in production can be extremely difficult to diagnose. The first Haskell program I wrote had the head error in it. The resulting message just said Prelude.head: empty list with nothing else to help me - no line numbers or even file names. Needless to say, it took me a long while to work out what this meant, and how it happened (if I remember correctly, I passed an empty list to a library function which, in turn, called head). Unsurprisingly that was also pretty much the last program I wrote in Haskell - languages that turn should-be-obvious errors into subtle errors are of no use to me. [Apparently Haskell now has some in-built, though hardly easy to use, support for backtraces. The implications relating to Haskell's prioritisation of features are, to my mind, highly amusing.]

Python was the first language I saw that took backtraces a little bit further, printing (when possible) the line of source code associated with each part of the backtrace. A Python-esque backtrace looks roughly as follows:

Traceback (most recent call at bottom):
  1: File "/tmp/head.cv", line 6
       head([])
  2: File "/tmp/head.cv", line 2
       return [l[0], l[1 : ]]
  3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
This simple innovation is a real boon: as in this case, one often doen't even need to open a source file in a text editor to see the error made. Python-esque backtraces help make obvious errors quicker to solve than traditional backtraces.

I realised early on in Converge's development that knowing merely the line number of an error was only part of the problem. Often a specific sub-expression within a certain line is the relevant part of the backtrace, and the rest of the line is noise. Converge therefore recorded the column (i.e. offset within a line) where each error is associated with, meaning that backtraces looked like the following:

Traceback (most recent call at bottom):
  1: File "/tmp/head.cv", line 6, column 4
  2: File "/tmp/head.cv", line 2, column 13
  3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
This extra information is very helpful: it means that I can accurately pinpoint which of the two list lookups in line 2 is responsible for calling List.get incorrectly. As a useful advantage, Converge's approach also means that errors that happen within multi-line statements (i.e. logical lines of source split over multiple physical lines in a source file to aid presentation) work properly.

Converge's backtraces stayed like the above for quite some time, until recently when I realised that knowing the start column associated with an error is only part of the story. What one really wants to know is the start and end of the associated expression. A small tweak to the parser, and a huge (but mechanical) change to the compiler, and Converge backtraces could tell one how many characters in the line an error was associated with:

Traceback (most recent call at bottom):
  1: File "/tmp/head.cv", line 6, column 4, length 8
       head([])
  2: File "/tmp/head.cv", line 2, column 12, length 4
       return [l[0], l[1 : ]]
  3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
This is almost helpful, but in practice I find it surprisingly hard to count n characters within a line on screen, which hinders interpretation of the above data.

A short while later, the answer hit me: what the backtraces need to do is to highlight the relevant sub-expression within the line. Here's a screenshot of the above error running in an xterm with the latest version of Converge:

As you might notice, the tiny little difference here is that the part of each line pertinent to the error is in bold and underlined. Knowing that, one can instantly see that the first of the two list lookups on line 2 is responsible for calling List.get incorrectly. Interestingly, my first attempt at this put the offending fragments only in bold, but since whitespace can sometimes be a significant part of an error, underlining can be a useful aid. In the case where the associated source code is split over multiple lines, the first relevant line of source code is printed with ... added to the end of the line to inform the user that the printed line is not the end of the story.

As I explained in a previous entry, when Converge DSLs are translated into Converge ASTs, individual call stack entries can be associated with more than one source location. This means that backtraces tend to be rather long, which previously made tracking down the cause of an error tedious - loading multiple files into text editors and continually flipping back and forth to xterms is not fun. Extended backtraces become a real life saver in this regard. Here's an example where a DSL incorrectly tries to subtract a string from an integer:

Looking at this backtrace, an experienced programmer will be able to quickly surmise that, given the exception message, the most likely candidate for this error is in the ex4.cv file (which, as the eagle-eyed may notice, is code written in the DSL - Converge's errors work with both the base language and with embedded DSLs). Imagine trying to debug this with a traditional backtrace: there's a lot of information in the backtrace, and there would be no indication as to which part of it is more likely to be responsible for the error.

From a practical point of view, Converge's extended backtraces have no run-time penalty for correct code, and users don't have to do anything to enable them - they're a standard part of the system. Extended backtraces can be found in -current versions of Converge (at the time of writing, Converge's support for Curses under Windows is weak, so underlining doesn't work there - it's a quick, fun little project for someone who's interested).

So, going back to the start of this entry, how do Converge's extended backtraces help with debugging? Well, they might help turn the odd subtle error into an obvious error, but that's an incidental benefit. What I think they do is make solving obvious errors much quicker than previously. In the sort time since I've had extended backtraces, I've noticed that I've often been able to almost instantly fix errors that before might have taken me a couple of minutes. Given the number of programming errors I make, the cumulative time saving is most welcome.

In summary, I think that Converge's extended backtraces are a real boon to programming. To the best of my knowledge, Converge is the first language with such backtraces in - I hope it won't be the last!

Link to this entry


Designing Sane Scoping Rules

March 3 2008

If there's one thing which unites pretty much every post-assembly programming language it is the use of the humble variable. Variables are such a common feature that we tend to take them for granted; perhaps I show my background in assembly by being explicitly aware of them. However where programming languages often differ is in the way that they allow one to reference variables: the dreaded scoping rules. In this article I'm going to outline why Converge has the scoping rules it does.

The first thing I need to do is to outline the problem. Although all my examples are framed in terms of a typical imperative programming language, the underlying concepts all translate fairly directly to functional languages too. Here's the simplest example possible:

x := 2
...
y := x
In every language I know this says assign 2 to x and the assign the value of x to y. So after this code is run both x and y will have the value 2.

The first major issue with regards to variables is global vs. local variables. Take the following code which is intended to represent the top-level of a source file:

x := 2

func f():
    x := 3

f()
In a lot of BASIC-type languages there is only one underlying x variable in this program, so after this code is run the outer x will have the value 3. In essence we have a flat variable scope: all variables belong to the same, single, namespace. This not only makes writing programs error-prone (e.g. one function accidentally corrupts another's x), but makes certain styles of programming largely impractical (e.g. recursive functions). It's probably no coincidence that the first mainstream programming language to make a virtue out of recursion - Algol 60 - also was among the first to get its scoping rules in reasonable order.

Retro-fitting sane scoping rules is not easy if the first version of your language used the above scoping rule (note the deliberate use of the singular) - any change of scoping rule(s) has a high chance of breaking programs in nasty ways. Some of the BASIC-type languages I first used solved the backwards compatibility problem by making variables global by default but allowing variables in functions to be declared as local. This allows one to rewrite the above example as follows:

x := 2

func f():
    local x
    x := 3

f()
This code now has two distinct x variables: one at the top-level and one in the f function. Every time f is called - even recursively - it will be given space for a new, fresh x. This feature makes programming a lot easier, even if it defaults to the insane global scoping rule by default. As an aside, surprisingly (to me at least), you can still see the global by default scoping rule in the modern language Lua. This goes to show how fundamental scoping rules are: once they're in a language, users will resist nearly all meaningful change to them.

Most programming languages adopt a slight variant of the above rules which are fairly easy to understand in practice. Essentially variables with the same name as a top-level variable reference that top-level variable directly, while all other variables are local. So in a C-type language the following code contains two variables: a top-level x and a y local to f:

x := 2

func f():
    x := 3
    y := 4

f()
Running the above code means that the top-level x is set to 3, while the y variable is local to f. Variations on this set of scoping rules underly many programming languages in use today.

The next major design challenge for scoping rules is much more subtle and confuses many of us to this day. Knowing that the global keyword in Python declares that assignment to a specified variable doesn't make it local to that scope, consider the following Python code:

x = 2

def f():
    x = 3
    
    def g():
        global x
        print x
        x = 4
    
    print x
    g()

f()
print x
What do you think this will print out? Let's try it out with Python 2.5:
$ python scope.py
3
2
4
$
In other words, we get a result which is a long way from what we might have expected: the print statement in g prints 2 instead of the expected 3 and the final print statement prints 4 instead of the expected 2. What is it doing? What's happening is that most languages don't have nested scopes (as one might expect) but two scopes: a top-level (a.k.a. global or module) scope and a function scope. What this means is that the assignment to x in g references the top-level x, not the x in f; you might want to read that twice to check that you've really understood it.

It might at first seem that Python's scoping rules are simply silly; actually, they're not unreasonable and they're shared by most programming languages (e.g. Java). Why? The problem is that the function g might outlive f. Here's a simple example:

func f():
    x := 2
    func g():
        return x
    return g

f()()
In other words, f returns a reference to g; when g is executed, the value of x known to f will have disappeared as, in most programming languages, variables are stored on the stack. This means that variables only exist for the duration of a function call. Since f's variables will have disappeared when g is executed, all sorts of bad things could happen.

Scheme was the first language that presented a practical solution to this problem of nested scopes in the form of closures. The standard way that closures are defined is guaranteed to confuse and I'm not going to repeat it. They're actually very simple: essentially each function allocates heap memory to store variables on. Thus if an inner function outlives an outer function there is no problem in referencing variables in the outer function even if the stack space has long since disappeared, since a function calls variables can outlive the function call itself.

[As an aside, the fact that closures need to allocate heap memory (although it's often possible to statically analyse such allocations away) has been used as an argument against them in languages such as Java. That's the chief reason that Java has all sorts of complications like inner classes, final variables and so on: Java resisted closures, and then had to resort to hacks to get a poor facsimile of its functionality. It's hard to imagine any decent programming language being built now that doesn't implement closures (Converge certainly does), so closures are gradually losing their exotic tag (which, I suspect, is based on their definition and not their utility, which is far from exotic).]

When I was designing Converge, I put some effort into deciding what its scoping rules were going to be. I wanted to make things safe by default (e.g. no global by default-type nonsense), and to make closures easy to deal with. Converge's scoping rules are (or, at least, were), I believe, the simplest of any imperative programming language. They boil down to this: assigning to a variable makes it local to a function; unless a variable is declared nonlocal, in which case scopes are searched from inner to outer to find the matching variable. That's it. Rewriting the Python example into Converge yields the following:

x := 2

func f():
    x := 3
    
    func g():
        nonlocal x
        Sys::println(x)
        x := 4
    
    Sys::println(x)
    g()

func main():
    f()
    Sys::println(x)
Which when run does the expected:
$ converge s.cv
3
3
2
$
The interesting thing here, in my mind at least, is the nonlocal keyword. Although it's a tad awkward sounding, it was the best combination of brevity and accuracy that I could think of. Unlike Python it would be incorrect to declare a variable as global since there are more than 2 scopes: in fact, there are an arbitrary number. What nonlocal is saying is when you see an x search each successive outer scope in order to find where it was originally defined. It's not a commonly used feature, but when you need it is priceless.

I said above that Converge has - or had - the simplest of any imperative programming language (at least as far as I'm aware). Some time after the first publications and release of Converge, the Python team decided to fix their scoping rules for their backwards-compatibility-breaking Python 3000 release. PEP 3104 contains the eventual proposal they came up with. Interestingly it is identical to Converge's scoping rules even down to using the nonlocal keyword. Please note that I'm not suggesting that the Python team copied Converge uncredited (and even if they did, I wouldn't mind - I actively encourage people to take the good ideas from Converge and put them in other languages). However I think it shows that these are good scoping rules and that eventually imperative programming languages will evolve towards something like them.

The open question is this: why has it taken us, as a community, 50 years or more to define two simple scoping rules (assignment is local; nonlocal successively searches outer scopes) and one simple implementation technique (closures), and why have we taken so many wrong turns (in this article I haven't enumerated the silliest, such as dynamic scoping)? I suspect the answer, if it could be definitively uncovered, would give one a very interesting insight to the subject of computing in general.

Link to this entry

Earlier articles

All articles
 
Last 10 articles
Good programmers are good sysadmins are good programmers
How can C Programs be so Reliable?
Free Text Geocoding
Extended Backtraces
Designing Sane Scoping Rules
Some Lessons Learned from Icon
IEEE Software Special Issue on Dynamically Typed Languages
How Difficult is it to Write a Compiler?
When Are Macros Useful?
Filling in a Gap