Sunday, October 30, 2011

Using Roistr to establish meaning

At its heart, Roistr's semantic relevance engine is a very powerful tool. Here, I'll talk about how even the limited web-based demonstration can be used to extract meaning.

Let's take this well-known sentence: The quick brown fox jumps over the lazy dog.

Our question is this: how can we work out what are the key words in this sentence - the words that most indicate what it's about?

Word meaning is established in several ways but primarily through its own context. The words that surround it are the most likely way to understand what a word means. There are other things such as the user's own context (the pre-existing knowledge they have) but for now, we'll concentrate just on each word's own context.

So what do we do? We take each word and compare it against the sentence. The first screenshot shows how this is done. The entire sentence is the category definition and each meaning word is compared against it. You can try this yourself.

Then we compare and look at the similarity scores.

The second figure shows the similarities of each word to its parent sentence, all ranked in descending order. What do we see?

The most striking thing is that the word 'fox has the highest similarity (0.62) quickly followed by 'dog' (0.58). Both of these are quite high and would be lower for longer documents.

These similarity scores tell us that the two words that are most important in the sentences meaning are 'fox' and 'dog' respectively. If we were to summarise this sentence in just one word, we would use the word 'fox' with 'dog' coming a close second.

This is just a toy test but does illustrate one use to which Roistr's semantic relevance engine can be put.

Saturday, October 22, 2011

Social Recommendations demo released

We at Roistr have released a demonstration of what our semantic relevance engine can do. With nothing other than a Twitter ID, we can extract a person's tweets and work out which of Amazon's current best-sellers will most fit that person.

It works using a semantic relevance comparison between the tweets (as a single document) and each book's editorial review ("blurb"). The assumption is that more similar texts imply closer interests.

You can try it out at and try anyone's Twitter ID.

The next step is to make it work using public posts from Facebook, GooglePlus and blogs!

Sunday, October 16, 2011

Roistr's value proposition

Roistr's bottom-line aim is to increase conversions in a way that is quick and integrates simply into a vertical.

We aim to benefit companies by letting them understand relevance in qualitative information in the same way as a human - but much faster and it doesn't get tired. It's difficult to get any computer system to sequentially read a series of documents and be able to provide a measure of meaning, particularly in some way that matters to companies. But Roistr can do it.

How? It can take a set of documents (these could be Twitter tweets, G+ or Facebook posts, forum posts - pretty much anything that someone writes about themselves - and aggregate them as a document and compare them against the product / service offerings. The idea is that a person's posts will indicate their interests and from that, companies can work out the most relevant offering they have for that person.

How easy is it to integrate? This depends a lot upon the host system but Roistr is designed to be as easy to integrate as possible. We have an API. Organisations will have their data as per normal; and they can interrupt their process to send it all to us. We categorise / cluster it according to meaning and send back a list of how each document should be ranked / categorised / clustered using JSON objects.

In summary: we help companies understand the type of data that offers true insights into a person but is extremely hard to analyse; and we do so in a way that integrates as simply as possible with our clients' systems.

Wednesday, October 12, 2011

New website on the way!

Here is an early preview of the new site's design. The current one really is a stop-gap until further notice; but this one looks much better. The exact content needs to be worked on because Roistr's proposition is quite a hard one to communicate; but the concept, layout and interaction design will be up when I get chance to sit down and code.

Tuesday, October 11, 2011

Online demonstrations

Early on, I made the decision that releasing Roistr without online demonstrations would be a mistake. This was simply because I could not see it convincing anyone without some kind of demonstration. However, within this general field, a lot of companies simply offer the 'contact us for a demo' type wording and it seems well enough for them.

We are, however, a start-up company which means that we have few eyeballs. Although the website is getting views from around the world already, it's not enough on its own.

So Roistr has a couple of online demonstrations. These are very simplistic and pruned to the branch and the most important one is the (planned categorisation demo). This is where you can entire a category description of something and find out which of a bunch of documents is most relevant. It's very cool but of itself it isn't really relevant to people who come by the site unless they go to the effort of entering in their own data.

So I thought of a demo where people could enter a Twitter user name and the last 20 tweets would be matched against a set of something. That would be cool because it a) is easy to complete (just enter any valid Twitter name), b) if the use enters their own user name (quite likely), then results will have greater personal relevance than just some random text, and c) it can be tested with any Twitter user name (only public posts are retrieved).

The problem now is working out what to match against: should it be a product list? A series of books? News articles?

I'm not sure yet but maybe all of them - people can then choose the type of product or service that is likely to appeal to them.

But there is an element of unfamiliarity about this: the list has to be short enough to be quickly scanned (else they might think, "I'm sure there is something more relevant to me but I cannot find it!"), general (so that there is something of high relevance there), but specific (to improve relevance). It's a difficult task. I posted a question on Quora about this. Already there are followers so people are interested in the question. No answers yet so I'm going to have to make a judgement call. :-)

Monday, October 10, 2011


What is Roistr built on? I can talk a little bit about some of the stuff we work on. There is a lot of proprietary code but a fair bit is built upon various open source libraries.

A vast amount is Python which is a wonderful language for rapid prototyping. Ally it with the numpy and scipy libraries, and it's pretty fast at hard-core number crunching which is a lot of what Roistr does. A part of the analysis we do uses NLTK (the Natural Language ToolKit) and another part uses Gensim. Both come highly recommended.

The server is the Tornado server released as open source by Facebook and written by FriendFeed.

This software represents awesome work so thank you to all those who have released it.

Sunday, October 9, 2011

Value Proposition

The hardest part of the entire process of making Roistr has been effectively communicating the value proposition (i.e., what Roistr can do for businesses). I am generally good with words, but I can see the flaws in any ideas I have.

There was excellent advice on a Quora thread ( which I've used. The gist was to generate a small number of plain English claims about what Roistr can do; there was also advice to present a problem, how Roistr solves it and the resulting benefit.

So what we came up with were three points:

Discover what's really important to your customers
Match your products and services to each individual customer
See increased conversions by providing customers with more relevant offerings

The next step is to tie in an effective demonstration that really communicates the individual relevance aspects to potential customers. I've asked on Quora ( and the question already has several followers but no answers yet.

In other news, I'm preparing an API for release. Because a lot of our work is Python based, Python will be running it. It's a RESTful service (or appears to be - there are some grey areas in the definition) so can be easily accessed with most languages easily enough. With Python, it just requires importing the urllib and urllib2 modules though I might prepare a proper module to simplify operations. The aim is to encourage people to try it out and see what they can do with it, so I want it to be as painless as possible.

I anticipate that the first release of the API should be made within a couple of days.

Friday, October 7, 2011

Oh yeah, the URL

Just in case anyone's interested, the site is available for release and some demonstrations. It's quite creaky (take my word for it!) but it does allow you to try the semantic relevance engine out for yourself through a couple of very limited web-based demonstrations.

It's at Email me directly with comments and the like at alan - at -

Monday, October 3, 2011

Getting close to launch

Some stealth mode this is - me writing public blog posts!

Roistr is getting very close to release. The engine itself works beautifully and the next part was to build the website. It's a work in progress (you can see it at so don't expect too much yet!

In case you're curious, the web server and framework is FriendFeed's Tornado which turned out to be pretty good when we grokked it. It's made developing the site very easy indeed; and is an awesome server.

The site will offer web-based demonstrations and an API for more detailed work. The demonstrations will offer:

1) Compare All - submit a bunch of documents and compare each one to every other
2) Planned categorisation - give us some category descriptions and a bunch of documents to be categorised, and we'll do the hard work!
3) Unplanned categorisation - much like the planned categorisation above but you don't need to provide a description. Our semantic analysis will work out which documents are clustered.

The API will be limited in how often it can be used (we have limited hardware and resources being a bootstrapped company) but there will be the opportunity to really test Roistr out in much more detail.

We're also planned a cooler demonstration but we're not sure what yet. Something like if you enter a twitter name, we'll work out which of a 1000 news stories or products are most similar. Suggestions are welcomed!