Tuesday, November 07, 2006

Geo-ripping Wikipedia

As part of my on going quest to stop my drift into senility and powerpoint (the difference is marginal) and make sure that when I recommend things to clients that they actually work I went in search of some Web Services to play with. Now there used to be a useful bunch over a Capescience, now they've just got the Global Weather one which is okay, but I could do with more than one (and I hate stock quote examples). I also wanted to see what could be done to get some interesting information out of wikipedia, so I hatched a plan.

The idea was to create a very simple Web Service which took the Wikipedia Backup file and then extracted from it the geolocation information that now exists on lots of the pages.

Stage one was doing the georip, this was very simple. I elected to use a StAX parser (the reference implementation infact) as the file is excessively large. Using StAX was certainly quick (it takes Windows longer to copy the file somewhere than it does StAX to do the parse) but there are a few odd elements in using it (which I'll probably put a separate post on). That gave me a simple database schema (created using Hibernate 3 as an EJB 3 implementation and using inheritance in the objects mapped down onto the tables, very nice implementation by the Hibernate folks).

Next up was the bit that I thought should be simple, after all I had my domain model, I had my queries, I had even built some service facades, so how long could it possibly take to get this onto a server? The answer was far too long, particularly if you try and use Axis2 v1.0 (which I just couldn't get to work), switching to Axis 1.4 picked up the pace considerably, and thanks to some server coaching by Mr Hedges its now up and running.

There are around 70,000 geolocations that I managed to extract from Wikipedia. Some of these aren't accurate for several reasons
1) They just aren't accurate in Wikipedia
2) There were multiple co-ordinates in the page, so I just picked the first one
3) There are 8 formats that I identified, there could be more which would be parsed wrong.

So there it is, extracting 70,000 locations out of unstructured information and repurposing it via a Web Service for external access. Couple of notes

1) The Wikipedia -> DB translation is done offline as a batch
2) Name based queries are limited to ten returns
3) Any issues let me know
4) Don't rely on any information in this for anything serious.

The next stage is to get some Web Services up that run WS-Security and some other WS-* elements so there are more public testing services for people to use.


Technorati Tags: , , , , ,