Wednesday, October 30, 2013


I've been having some fun recently using MongoDB (and its Python module, pymongo) which I'm planning to use for roistr the search engine. The stuff I need it for is fairly straightforward and I'm enjoying using it. My mac has a test database while the real system is sitting on some more substantial hardware.

It does, however, feel odd to talk about using, shall we say, commodity software for a search engine, no matter how early. Of course it's a good idea to use tried and tested open source programs built precisely for this type of task (plus it's a great learning experience), but I imagine that a lot of organisations have their own home-made stuff. Being a band of exactly one and trying to build a billion-index search engine, I'm up against it so I need to rely on others' softwares.

Sunday, October 13, 2013

Roistr is becoming a search engine

We've ad a lot of fun with Roistr's semantic relevance engine, but we have managed to figure out a number of technical problems to make it a viable search engine technology. What this means is that we're phasing out the purely relevance parts, and instead we're going to make Roistr a semantic search engine – or rather, a search engine that relies on natural language processing techniques rather than simple keyword matching. We expect it to be ready in a few months, depending upon how quickly we can get stuff sorted out

First of all, we're only going to index the English version of Wikipedia. It's only a few million documents, and forms a nice test bed as well as being an information base for future developments. Once that's done, we'll aim to spider the web. This, of course, will take a while, especially given that we have extremely limited resources, but it will be fun to test ourselves against the giants.

Our long-term ambition is to have an index of over 1 billion documents. From there, we should be able to cover most searches fairly well. Will it be better than existing search engines like Google, Bing or Blekko? Well, time will tell.