Free Text Geocoding
September 1 2008
I think nearly everyone would agree that maps are useful things; I've only met a couple of people who labour under the illusion that they know the way from anywhere to anywhere. Personally I find maps more than interesting - they're often fascinating. Maps are one thing the British do immensely well (judging by some of the inaccurate doodles I've seen abroad). By looking at the detailed Ordnance Survey map of the area where I grew up, I can easily see layers of historical information such as: the Roman road a mile away; the long-abandoned railway lines upon which roads were later built; the now-drained marshes and the canal network later built upon them. Of course, these are fairly
standard maps with a well-understood relationship to the real world; sites such as Strange Maps and Mark Easton's blog show how maps can present data in many other forms. My headmaster at school thought that geography as a subject had lost its way; instead of focusing on rock strata, it should focus on teaching children where places where. At the time, we all thought he was a touch eccentric; in retrospect, he was absolutely right. Huge tracts of politics, economics, and history only make sense in the context of a given place(s). I could go on, but I hope my point is clear: maps are important, and not just for finding our way around.
A big problem with traditional maps is finding things: places, areas, and so on. Even for a medium sized paper map, the size of the index is huge; and the index for a good quality London A-Z is often almost as big as the map itself. Paper map indexes have their own funny little language, trying to squeeze as much data in as possible. However paper indexes have two problems. First, how many times have you forgotten what square you were supposed to look at when you get to the proscribed page? Second, indexes can only grow to a certain size before they are unusable.
The advent of computer mapping has been a real boon, and not to just map fiends such as I. The first site I used regularly was Street Map. Having the map data for the whole UK was incredibly useful and the search facilities, while crude, were a huge improvement on a paper index. When later sites such as Google Maps became available, I was stunned. Suddenly freely viewable map data (though note I do not use the phrase free map data) for much of the world was coupled with a new way of searching. No paper index, or Streetmap's cloying need to be told what type of search was being performed (e.g. place name, street name, post code etc.). Instead is what I call (for want of a better name) free text geocoding: that is, where one types in something in the format used in real-life, and the search engine finds the right place automatically. One doesn't need to tell the search engine that the search is for a postcode or a place-name, or even which country one is searching - it magically does the right thing. Well, of course - not always the right thing. For example, at the time of writing, if I do a search for
Penrith in Google Maps UK, I get taken to a suburb of Sydney in Australia. The poor residents of Penrith in the North of England don't even have their town mentioned as a possible match for the search; one must instead search for something like
Penrith Cumbria or
Penrith UK. There are a range of related minor infelicities in both Google Maps and Yahoo Maps. However on average, both do an adequate job.
There are several reasons why sites such as Google Maps are not as widely used as they might be. Regrettably the chief reason is legal: there are restrictions on the way that map data can be used. Fortunately an initiative - OpenStreetMap - to create much freer map data was started a few years back (and started by Brits - we are, it appears, a nation of map lovers) and is now at the stage where, despite many huge holes in its data, it is semi-usable: in central London, for example, it already has arguably the highest quality maps. Some of the uses of OpenStreetMap are already quite astonishing (the UK postcode layer is a simple, but effective, example of what can be done - click the little
icon in the top-right of the map for more fun), and its accelerating progress is impressive.
One part of OpenStreetMap that frustrates me a little is its search (the Name Finder). Although it does a reasonable job in many respects, the results it returns are hard to interpret (type in
Penrith and then try and work out which result is the UK town - I got this wrong on my first go and I'm English!), and it's not easy to use it outside of a website (because it's written in PHP). Furthermore the version running on OpenStreetMap's front page is painfully slow (possibly because it's running on overloaded servers). [At the time of writing, I can't even work out how to get it to find Penrith in the UK automatically. Queries like
Penrith, UK crash it. Penrith - please don't take the cold-shoulder from the map search engines personally!]
Free text geocoding
Since one of the huge benefits of using computer mapping is its search ability, I thought a fun little summer project would be to create my own free text geocoder. I started with only a vague idea of what a free text geocoder should (or could) do. While I don't now claim to have thought of everything, I do now have a much clearer idea of what a good free text geocoder should do. I've split these into
should haves. A free text geocoder must:
If possible, it should:
- give accurate results (meaning it always, somewhere in the list of matches, gives the place the user is looking for). While this might seem obvious, some of the existing free text geocoders don't do this, as we saw earlier.
- require as little formatting from the user as practicable (e.g.
London SW1 and
SW1 London are both likely searches).
- return any unmatched text (thus allowing searches like
cafes in Pimlico to return
Pimlico as the place matched and
cafes in as unmatched text which an application can then use as it pleases).
- be fast: results shouldn't take more than a second on average (and preferably should be much quicker).
- be localisable. This is both linguistic and cultural. Searching should take place disregarding the users input language, but results should be localised when possible. Results should be formatted relative to the users local cultural expectations (e.g. in the US, state names are always shown; in the UK county names are nearly always shown).
- be possible to use it do more than just find the latitude and longitude of a place.
- try and weight the results so that the
most likely match(es) are given higher priority (e.g. an Englishman searching for Penrith should see the English town as the first match, while an Australian should see the Sydney suburb).
- be usable in different contexts (e.g. in websites, or in applications).
- be amenable to (possibly quite low-level) customization.
To this end I created, and have now released, Fetegeo with a BSD / MIT licence. Using Fetegeo's included client / server interface, queries can be performed on the command-line:
$ fetegeoc geo London
Country ID: 233
Parent ID: 1262
PP: London, United Kingdom
Of course, there are a lot of London's in the world and I haven't copied all of Fetegeo's output. Notice though that, since the preferred country of the user wasn't specified, it's chosen what most people are likely to consider to be
the London as the first match. If the user specifies that their country is Canada then London in Ontario is the first match:
$ fetegeoc -c ca geo london
Country ID: 39
Parent ID: 540
PP: London, Ontario
Fetegeo can be instructed to allow dangling (i.e.
unmatched) text in matches:
$ fetegeoc -d geo Museums in London
Country ID: 233
Parent ID: 1262
PP: London, United Kingdom
Dangling text: Museums in
If you're interested, there's a slightly more thorough description of the ways that Fetegeo can be used, and a simple demo which geocodes results and shows them on an OpenStreetMap map.
How Fetegeo works
Internally, Fetegeo's search is fairly simple and its approach is easily described. Strings in Fetegeo are always normalised; in particular punctuation is removed, and strings are lower-cased. String queries to the database are always on hashes of normalised words. Given a normalised string
S, Fetegeo breaks it into a list of words. It then works right-to-left in the list, trying to find all possible matches. Whenever a match within
S occurs, a counter is decremented meaning that subsequent matching takes place n elements from the end of the list. Matches are greedy; they always try to first match the maximum number of words possible, before gradually trying fewer words. Matches are exhaustive in the sense that all possibilities are tried; however only the longest matches are eventually returned to the user (some obvious optimisations are used here, so that some possibilities aren't tried if it's obvious they won't work). Fetegeo first of all tries to match the entirety of
S as being a place within the users country; if that fails, it tries to match a country name at the right-hand side of the string. It then tries to match places and postcodes; if a place is found, it is considered to be a parent area; subsequent matches can only match places within that area or (recursively) a sub-area. Postcodes can occur at any point in the match. Of course, there's a lot more detail than this in the code, but this is the essence of Fetegeo's fast free text geocoding.
How Fetegeo compares
How does Fetegeo score on the must / should chart?
It's too early to say how accurate Fetegeo's results are. First, Fetegeo has only been relatively lightly tested so far: it's inevitable that there will be bugs and oversights. Second, any free text geocoder is subject to something I'm tentatively calling Tratt's First Law of Free Text Geocoding (though I doubt I'm the first to think of it): the upper bound for results quality is determined by your dataset. The best free text geocoder in the world can only give iffy results with an iffy dataset. Fetegeo's initial dataset is based on Geonames data (and postcode data from various other sources). While Geonames should be saluted as the first serious attempt to collate freely-available place data, the structure of the data is less than ideal, and the data itself is of variable quality, suffering from frequent inaccuracies and duplication. Because of this, Fetegeo has been designed to be relatively independent of any particular dataset; I hope one day in the not too distant future that OpenStreetMap's data will be sufficiently broad in scope to replace Geoname's (OpenStreetMap's data is already deeper in the sense that it includes roads, which Geonames doesn't).
Fetegeo is already reasonably fast, given that it's only been semi-optimised. On my 3-ish year old desktop machine, using the stock install of PostgreSQL (a few tweaks would, I suspect, make it perform much better - if only I could work out what those tweaks were amongst the mass of overlapping configuration options!), typical queries are answered in less than 0.1s. Fetegeo makes use of simple caching internally to speed things up. Someone who understands databases better than I could almost certainly make things run much faster.
Fetegeo has the beginnings of being usable for more than just longitude / latitude searches, but there is some way to go yet to prove this is feasible. In particular I would like to see it capable of being used by applications to classify things as being in particular areas. Imagine you have a website listing X's in Britain (where X could be just about anything), where each X is located at a particular latitude / longitude. This allows one to easily search for
all X's near place P. However users often want to perform area searches such as
all X's in London or
all X's in Rutland. Exposing the identifiers of places, counties, states (and so on) makes this latter type of query feasible.
Fetegeo is usable in a number of different ways. As such, Fetegeo is just a Python library which can be included and used in any application. Fetegeo also comes with a standard internet server and (command-line) client, which can receive and answer XML queries (as an aside, the XML parser used is often the slowest part of querying). This means that even a simple web-site can query a single Fetegeo server and make use of its caching facilities and so on.
I have no idea whether anyone will find Fetegeo useful. It seems to me that, even in its current embryonic form, it fills an unoccupied niche, at least in terms of its licence if not its functionality (yet). I hope that other people might find it interesting, and start to extend its functionality to make it more widely applicable. If you want to find out more, and contribute, please waltz on over to fetegeo.org.
Follow me on Twitter