Thursday, 1 March 2007

How we make the TomeRaider Wikipedia File

Thanks for all your suggestions and glitch finds in the Wikipedia released last week. We have just uploaded a new one that’s much purer. Also this one links to the print version of the Wikipedia article as this renders better on handhelds.

It generally takes about 40 hours to make one of these TomeRaider Wikipedia files, and that’s not including checking and glitch fixing.

The process:

  • We download the raw text data of the Wikipedia. Uncompressed this comes in at over ten gigs. One file. Ten Gigs. It’s a whopper!
  • Then we pass it through some simple filters we wrote in C to get the size down to something manageable. This filter truncates the articles and removes some other “known junk”.
  • Then we have to pass it through a second program we wrote that does a bazillion find and replaces using regular expressions.
  • Lastly, the output of this file should be ready to import into TomeRaider.
  • When a TomeRaider file of this size – it is now 1.5 million entries – is made three very processor intensive tasks are performed.
    • The text gets compressed using our own methods. This takes an age – but its so efficient to uncompress on your smartphone that the trade off is worth it..
    • All of the hyperlinks, in every page need to be checked and fixed or removed. Even using binary search this takes a long long time.
    • The file then needs to be sorted into the right order for TomeRaider. Again, a simple tasks but on such huge quantities of text takes a long long time.

And the result, when you take it out of the oven, is a perfectly baked TomeRaider file of the wonderful Wikipedia.

I remember a couple of years ago talking to Andrew Orlowski about the Wikipedia, he was pretty pessimistic about its potential and the validity of its content. But it does seem that peer review on such a huge scale is a method that can produce factual content with a new kind of reliability.

You can download the new version of the TomeRaider Wikipedia Here and the new TomeRaider PPC /Smartphone here

4 comments:

Peter said...

Your wikipedia version is so cut down - it is almost pointless to look at it. Each article only has the first couple of lines. Eric (http://infodisiac.com/Wikipedia/) did a much better job - even included pictures. It's a pity he didn't updated it since a year ago.

burepe said...

I really would like a version of the Japanese wikipedia. Is that possible? How can I make it?

Aidan said...

Why don't you make the full wikipedia? That would make me buy your program - this is just so cut down it's a waste of time.

Given the price of large flash cards these days a full wikipedia should be easy to do. This however is just utterly useless.

aiya said...

Obama Is Lying<1>Office 2010When was the last time the MSM took a Republican's side in a fight over credibility with a Democratic opponent? Well, it has been a while. Microsoft Office 2010
However, conservatives have little to grumble about in the recent face-off between Barack Obama and John McCain over McCain's statement that Microsoft wordtroops might have to remain for "100 years" in Iraq "as long as Americans are not being injured or harmed or wounded or killed" after Office 2007Microsoft OfficeMicrosoft Office 2007 Office 2007 keyOffice 2007 downloadOutlook 2010Windows 7Microsoft outlookMicrosoft outlook 2010