Tuesday, December 27, 2011

Experiment is coming soon

We're planning a little experiment soon to test how well Roistr can match book descriptions with Tweets. We'll be getting people to rate each of ten books according to how interesting each book is to them personally.

Watch this space because we'll be asking for participants soon.

Thursday, December 22, 2011

Personalised Advertising Guided by Social Media

Summary

Current advertising selections are determined by a number of factors, often combined to produce content that is relevant to the user. Increased relevance results in increased conversions.

I'll discuss some of the ways in which products are targeted towards potential buyers with benefits and drawbacks. There's also some talk about the future directions of advertising.

How are particular groups targeted now?

1. No targeting
The most obvious answer is that in many cases they're not. Adverts are simply put up and hope to hit at some point. The benefit is that you'll reach most people but the drawback is that this method is that a lot of people who have no interest whatsoever will also be reached. Even with Internet communications being cheap, this fact can still result in wasted effort and money.

2. Demographics
Traditionally, marketers try to understand their customers (potential or current) by grouping them, primarily using demographics. This works on the assumption that members of each group have similar predictable characteristics (they have similar habits and buy similar things). In some ways, this is almost a type-theory; and these are not widely accepted in contemporary psychology. The benefits are that doing this does increase conversions over zero-targeting and groups can be readily identified once a customer's information has been provided. The drawback is that getting this information is hard - companies are often willing to pay a lot of money for this type of data about a market. The more specifically a product can be targeted, the greater the chance of conversion; but to target more specifically requires an increasing amount of information about potential buyers. This can be hard, particularly as a lot of the information might not be obvious or accessible in large scale survey type research.

3. Stated interests
Potential buyers might be asked to state what their preferences are. This can help filter out irrelevant advertising materials and focus on relevant ones thus leading to increased conversions. The drawbacks are that it can be hard to get this information and such information just not be true.

4. Purchase history
Another form of prediction by groups is by using prior purchases. The theory is that if two people buy product A and the first person also bought product B, then the second person is more likely to buy product B than someone who didn't buy A. The advantage is that it requires no demographic information about buyers, and that it increases conversions to a level above chance. The disadvantages are that it relies upon the assumption of types of people; the data analysis required is quite heavy (often with very large sets); the difficulty in getting the data in the first place; and that it doesn't account for transience - trends or influences that act upon large groups of people for short periods of time.

5. Contextual adverts
Google popularised the contextual advert. These relied upon analysing the text of whatever page or search query a person entered and produced semantically similar adverts in response. These are held to have high levels of conversions due to increased relevance but they fail in one aspect: they only take into account context but not the person behind the context. "We are what we search for" is not enough because many searches are for things that do not define us but rather meet temporary or one-off information needs.

6. Social network
Facebook recently announced its intention to use the social graph to target advertisements. The theory is that what your friends buy is what you will want to buy. This may work but may not and ignores the realities of social networking. Many people use social networks with families (do I really want to buy what my daughter or my father want?), for professional promotion (so a recruiter I've chatted with once bought a particular car. That will influence my decision not at all) or people we have non-permanent relationships with (I can 'friend' people I haven't seen since primary school. How do their purchasing decisions relate to mine?).

These methods offer a partial solution. If done well, they will increase conversions which makes it easy to be complacent, maybe even think that the problem is solved as well as it possibly can be. The most effective advertising and marketing units will use a range of methods that complement each other to provide as full a coverage as possible.

But if we explore newer techniques, we might find other ways to complement

Trade-offs
There appears to be a trade-off. To a better chance of a sale from an advert, it has to be targeted more specifically. But to be targeted more specifically requires data gathering and analysis a priori to advertising.

But these methods still don't address the person behind the advert. People are either unconsidered/treated as uniform members of a group (which is successful but could be greatly improved) or have to provide data about themselves up-front. The only exception is the contextual advert which takes no account of the person but rather their current information need.

Future methods
The holy grail of advertising is to produce a method that takes into account not just context but also the person behind it in a way that requires as little a prior data gathering and analysis as possible, preferably none. These data must also be honest: the possibility of potential buyers giving misleading information should be low. Finally, the data must be obtained with the permission of the potential buyer. Not having this permission could backfire and turn a potential customer into one who refuses to do business.

So, where?
This leads to the question: where can we get such information?

The information sources must:

  1. Be made publicly available with users' permission preferably express
  2. Be retrieved for low or effectively zero cost
  3. Be about a single person
  4. Give a description of a person at the personal level
  5. Give some indication of the person's current context
  6. Provide a degree of authenticity

The solution
One answer is social media. Facebook, Twitter, and GooglePlus all offer information that is (often) publicly available and highlights the concerns of interest to an individual at a personal and contextual level. If a matter was not relevant to someone, why would they write about it?

But this information is hard to analyse. There's no neat forms with precise Likert scales, no specifically expressed interests and the like. It's plain, natural text written to be understood by other humans. It needs more preparation than most companies can put into a single potential sale before it can be analysed. Methods such as human-performed content analysis can categorise propositions according to set criteria and this information is gold. Human methods, however, don't scale very well.

But there is hope. Methods within artificial intelligence, specifically natural language processing, can analyse such text within a representation or 'map' of language. From this, we can see how closely related two pieces of text are. Or, in other words, we can relate someone's Facebook posts to a range of product descriptions.

Using natural language processing techniques can help you understand how similar two pieces of text are: one being a person's social media posts and the other being a range of products or services. The assumption is that the more similar an advert is to someone's social media, the higher relevance it will have to that person. This means higher conversions and greater sales.

There are many methods within natural language processing to estimate relevance. Google themselves use a system that was (and may still be) reliant upon Wordnet, a formal ontology of words and how they relate to each other.

Text analysis is used to identify topics within text and sentiment analysis is used to understand general feelings or attitudes towards something.

At Roistr, we use a combination of methods to understand the underlying meanings of documents. These are used to gauge the proximity of two or more documents from which we infer relevance. If two pieces of text are close then they are similar in meaning.

Evidence
We're doing an experiment soon using Amazon's top ten best sellers and asking people with a Twitter account to rate the 10 books on a scale of how much the book interests them personally. We will then take each person's public Tweets and rate them using our semantic relevance engine. The two sets of results will be compared to each other: individual human judgements compared to our personalised advertising doing the same thing using the individual's tweets.

Hopefully, this will give us some numbers to help us see whether it's possible to predict the most relevant product from a person's tweets. The experiment will be released soon and we will publish the results both here and in a white paper that will be free to download.

Sunday, December 11, 2011

Using Tornado as a web framework at Roistr

At Roistr, one of our essential tools is the Tornado web server. We use this as part of the workflow (in conjunction with other tools) to generate web pages as a web framework. It's easy to use templates and this is how to include HTML from different files.

Each Tornado program has a main program that runs everything and each page has its own object that inherits from tornado.web.RequestHandler. This object has methods for get and post.

Roistr itself is composed of a number of template features (bits of HTML common to all pages) and unique code (HTML unique to a single page). Being able to use a file to save those common bits makes it easy to update across the site - change on file (say the menu) and all the pages change.

In Roistr, we store the header, the menu and the footer in separate files. If we change any of these files, all web pages are updated. This saves work, saves testing effort, and makes it easier to maintain the site, particularly when changes are needed.

But how to incorporate this into a web page?

It's simple. For an example, let's say the header is contained in a file called, header.html. We want to include this with a page's unique content. The header contains the opening tags of the page all the way up to (and including) the tag.

We go to the unique page (let's call it, uniques.html) and the first thing we enter (because the header comes first) is this:

{% include header.html %}

This instructs Tornado to take all the code from header.html and send it out as the page. After that line, we continue with the unique page's content.

So generally, a Roistr page consists of:

{% include header.html %}
{% include menu.html %}
blah... blah... blah...
{% include footer.html %}

And now if I want to change anything in the header, the menu, or the footer across the site, I can change one of the above files and the whole site changes.

If you wanted to set up your own CMS, you could easily do so just by setting dividing up your pages into modules and loading each module - just like existing CMS's do. It's a bit more work like that but it's not really that much compared to learning Joomla or Drupal from scratch. Plus it's probably easier to build a custom site that would require significant modification of the Joomla or Drupal code.

I guess it depends on what you want. Personally, I like to be in total control of the HTML/CSS code that gets churned out because a) I might radically change it, and b) I'm in control of updates. Of course, you have all the problems on your own plate but that's part of the choice.

Best of luck!

Natural Language Interfaces

Here at Roistr, we're working to push the boundaries of UX as hard as we can. We're just a small company but very motivated and passionate about ensuring the best possible UX for everyone.

Does our work have any effect on this desire? Well yes. The deeper part is that we're pushing hard to make the next generation of natural language interfaces. Our ultimate goal is to pass the Turing test and many very capable people have failed. It's good to have an ambition though :-)

So what role does the current incarnation of our semantic relevance engine have on natural language interfaces?

We can reduce the amount of effort it takes to get things done which is good UX in my book. For example, in one business use case we have for recruiters, we can storm ahead easily. I recently looked at one recruitment website and it took me 20 minutes to go through the sign-up forms before I got fed up and abandoned the process.

With Roistr, I just upload a resume or CV, and that can be used to match me to particular jobs. Plus it does a reasonable job of matching which is great news. Uploading a resume or even copying / pasting it is much quicker than having to type in my work details *yet again*.

Another thing Roistr can do is to understand what concepts lie at the heart of what people write. I've used it already in tests to elicit the core concept from a document and it works surprisingly well. This can be used for automatic summarisation, categorisation and a whole host of other applications.

We're planning a sitemap tool whereby we can use a content breakdown and reform it into an information architecture. In my own experience of card sorts, people organise content according to meaning (whether topic or function) and being able to access similarity of meaning means that we're able to associate similar things. From this, we can build an IA in just a few seconds even for large amounts of content that couldn't realistically be put into a card sort.

Another possibility is making the job of keeping content useful easier. We're considering writing an extension for SharePoint that organises content according to meaning, much like a human would. We can also identify possible duplicates which for large corporate intranet sites is very useful. A better UX is provided by providing content that is up-to-date, timely and relevant so Roistr can play a big role in making the world a more sensible place.

As I said, our ultimate aim is to have a machine that you can have a sensible conversation with; something that's not like talking to a socially-inappropriate amnesiac but more like a real person. The possibilities we have are endless...

Thursday, December 1, 2011

How to work this?

Our concept of Roistr is that it works as SaaS - software as a service - that works when requested by clients. Whenever a business needs to understand the meaning of a set of documents or perform a common operation (like finding which ones of a collection are the most similar to a standard), they can access our RESTful API and get results instantly.

But recently, Roistr was accepted for Microsoft's BizSpark programme. This is a nice bonus because it has links to lots of other companies so it's a good networking opportunity. It also provides access to a load of MS software and this got me thinking of whether we should offer the semantic relevance engine as a software product - something boxed or downloaded and used locally rather than purely online. It's certainly possible, assuming that anyone who runs it has a powerful enough machine (multi-core ideally, which isn't so rare these days).

If so, we'll need to work out how this can be done. Roistr's API is restful so it should integrate easily with anything else. A simple HTTP call is made and this can be done via any language worth its salt.

But is there an advantage to be made from offering a Java API? A C# or MS .Net API? We're unsure on this but we now have the tools to make a .Net API thanks to MS.

Monday, November 7, 2011

New demos

Today, we just released a new demonstration which is showing a semantic search using books.


The idea is that the user provides some kind of information and this is used to match against book descriptions. It shows the power of semantic matching over keyword-based systems.




Two different types of information can be provided: the first is a descriptive paragraph. My example here is the classic line, "The quick brown fox jumps over the lazy dog". Results are in the next screenshot below and all are related in some way to 'fox'.




How does this improve over keyword searches? Well the main thing is that Roistr looks for the gist behind a document. It neatly deals with synonyms and has word-sense disambiguation built in to try and focus on the most meaningful match. It seeks out affinity beyond the purely lexical. The books choices are also intriguing. They're less like a normal library topic search and more like a knowledgeable librarian's recommendations.

We haven't yet tested this with users yet but I get the feeling that the accuracy is higher than a keyword search when used with real-world queries. I'm personally quite pleased with it.


The other way is to enter someone's Twitter name - it can be anyone's, not just your own - and the 20 most recent Tweets from that Twitter name are taken and used instead of the paragraph we just talked about.




The results show the book descriptions that are most semantically similar to the Tweets. Now this may or may not be related to the person's real interests; but if people Tweet about something, it's more likely that they're interested in it.

Here are the results using the Twitter name:




The usual provisos matter here: this is an early demo (the designer in me is going nuts!) and alpha but it does seem to work reliably. I had the engine process almost 200k documents on the weekend without a problem so it seems to be reasonably reliable.

In other words, all this is early work but very promising. The engine's quite stable and simple - much like a Unix tool, it does one job but does it very well - and can fit into a number of frameworks easily. The testament to this is that the code to make this run was written in a few hours (including the web interface which was the most time-consuming part).

Friday, November 4, 2011

How reliable is Roistr?

One question asked of us is how reliable Roistr is under stress. It works fine with the limited demonstrations, for example, but can it handle something closer to real-world work?

Lately, we've been preparing more detailed demonstrations. They both involve linking someone's social media to a list of a) book reviews and b) movie reviews (note: the latter is just for internal use right now). Although limited in scope, the data sets are more like real world data sets: the book reviews cover over 14,000 books; the movies over 150,000. So far, the engine has been working solidly and has produced results without a single blip on it.

The problem we do have is that it's not so fast in producing results - the 14,000+ books took 30 minutes; but calculating each vector is not a trivial operation: retrieving each vector takes many millions of floating point operations. We would like to reduce the time taken for this but this is where Roistr offers real value over existing methods: things like keyword matches are quicker but they aren't as good. Mimicking human performance takes a lot of effort.

We're happier to put up with slower results because they're more accurate and the important thing is for relevance to be maximised.

If you would like a demonstration or field test of Roistr, talk to me, Alan Salmoni (email link) and I'll see what we can organise for you.

Sunday, October 30, 2011

Using Roistr to establish meaning

At its heart, Roistr's semantic relevance engine is a very powerful tool. Here, I'll talk about how even the limited web-based demonstration can be used to extract meaning.

Let's take this well-known sentence: The quick brown fox jumps over the lazy dog.

Our question is this: how can we work out what are the key words in this sentence - the words that most indicate what it's about?

Word meaning is established in several ways but primarily through its own context. The words that surround it are the most likely way to understand what a word means. There are other things such as the user's own context (the pre-existing knowledge they have) but for now, we'll concentrate just on each word's own context.

So what do we do? We take each word and compare it against the sentence. The first screenshot shows how this is done. The entire sentence is the category definition and each meaning word is compared against it. You can try this yourself.



Then we compare and look at the similarity scores.

The second figure shows the similarities of each word to its parent sentence, all ranked in descending order. What do we see?

The most striking thing is that the word 'fox has the highest similarity (0.62) quickly followed by 'dog' (0.58). Both of these are quite high and would be lower for longer documents.

These similarity scores tell us that the two words that are most important in the sentences meaning are 'fox' and 'dog' respectively. If we were to summarise this sentence in just one word, we would use the word 'fox' with 'dog' coming a close second.

This is just a toy test but does illustrate one use to which Roistr's semantic relevance engine can be put.

Saturday, October 22, 2011

Social Recommendations demo released

We at Roistr have released a demonstration of what our semantic relevance engine can do. With nothing other than a Twitter ID, we can extract a person's tweets and work out which of Amazon's current best-sellers will most fit that person.

It works using a semantic relevance comparison between the tweets (as a single document) and each book's editorial review ("blurb"). The assumption is that more similar texts imply closer interests.

You can try it out at http://Roistr.com/social and try anyone's Twitter ID.

The next step is to make it work using public posts from Facebook, GooglePlus and blogs!

Sunday, October 16, 2011

Roistr's value proposition

Roistr's bottom-line aim is to increase conversions in a way that is quick and integrates simply into a vertical.

We aim to benefit companies by letting them understand relevance in qualitative information in the same way as a human - but much faster and it doesn't get tired. It's difficult to get any computer system to sequentially read a series of documents and be able to provide a measure of meaning, particularly in some way that matters to companies. But Roistr can do it.

How? It can take a set of documents (these could be Twitter tweets, G+ or Facebook posts, forum posts - pretty much anything that someone writes about themselves - and aggregate them as a document and compare them against the product / service offerings. The idea is that a person's posts will indicate their interests and from that, companies can work out the most relevant offering they have for that person.

How easy is it to integrate? This depends a lot upon the host system but Roistr is designed to be as easy to integrate as possible. We have an API. Organisations will have their data as per normal; and they can interrupt their process to send it all to us. We categorise / cluster it according to meaning and send back a list of how each document should be ranked / categorised / clustered using JSON objects.

In summary: we help companies understand the type of data that offers true insights into a person but is extremely hard to analyse; and we do so in a way that integrates as simply as possible with our clients' systems.

Wednesday, October 12, 2011

New website on the way!


Here is an early preview of the new site's design. The current one really is a stop-gap until further notice; but this one looks much better. The exact content needs to be worked on because Roistr's proposition is quite a hard one to communicate; but the concept, layout and interaction design will be up when I get chance to sit down and code.

Tuesday, October 11, 2011

Online demonstrations

Early on, I made the decision that releasing Roistr without online demonstrations would be a mistake. This was simply because I could not see it convincing anyone without some kind of demonstration. However, within this general field, a lot of companies simply offer the 'contact us for a demo' type wording and it seems well enough for them.

We are, however, a start-up company which means that we have few eyeballs. Although the website is getting views from around the world already, it's not enough on its own.

So Roistr has a couple of online demonstrations. These are very simplistic and pruned to the branch and the most important one is the (planned categorisation demo). This is where you can entire a category description of something and find out which of a bunch of documents is most relevant. It's very cool but of itself it isn't really relevant to people who come by the site unless they go to the effort of entering in their own data.

So I thought of a demo where people could enter a Twitter user name and the last 20 tweets would be matched against a set of something. That would be cool because it a) is easy to complete (just enter any valid Twitter name), b) if the use enters their own user name (quite likely), then results will have greater personal relevance than just some random text, and c) it can be tested with any Twitter user name (only public posts are retrieved).

The problem now is working out what to match against: should it be a product list? A series of books? News articles?

I'm not sure yet but maybe all of them - people can then choose the type of product or service that is likely to appeal to them.

But there is an element of unfamiliarity about this: the list has to be short enough to be quickly scanned (else they might think, "I'm sure there is something more relevant to me but I cannot find it!"), general (so that there is something of high relevance there), but specific (to improve relevance). It's a difficult task. I posted a question on Quora about this. Already there are followers so people are interested in the question. No answers yet so I'm going to have to make a judgement call. :-)

Monday, October 10, 2011

Technology

What is Roistr built on? I can talk a little bit about some of the stuff we work on. There is a lot of proprietary code but a fair bit is built upon various open source libraries.

A vast amount is Python which is a wonderful language for rapid prototyping. Ally it with the numpy and scipy libraries, and it's pretty fast at hard-core number crunching which is a lot of what Roistr does. A part of the analysis we do uses NLTK (the Natural Language ToolKit) and another part uses Gensim. Both come highly recommended.

The server is the Tornado server released as open source by Facebook and written by FriendFeed.

This software represents awesome work so thank you to all those who have released it.

Sunday, October 9, 2011

Value Proposition

The hardest part of the entire process of making Roistr has been effectively communicating the value proposition (i.e., what Roistr can do for businesses). I am generally good with words, but I can see the flaws in any ideas I have.

There was excellent advice on a Quora thread (http://www.quora.com/What-is-the-best-way-to-communicate-a-value-proposition?q=value+proposition) which I've used. The gist was to generate a small number of plain English claims about what Roistr can do; there was also advice to present a problem, how Roistr solves it and the resulting benefit.

So what we came up with were three points:

Discover what's really important to your customers
Match your products and services to each individual customer
See increased conversions by providing customers with more relevant offerings

The next step is to tie in an effective demonstration that really communicates the individual relevance aspects to potential customers. I've asked on Quora (http://www.quora.com/Are-there-effective-semantic-relevance-demonstrations) and the question already has several followers but no answers yet.

In other news, I'm preparing an API for release. Because a lot of our work is Python based, Python will be running it. It's a RESTful service (or appears to be - there are some grey areas in the definition) so can be easily accessed with most languages easily enough. With Python, it just requires importing the urllib and urllib2 modules though I might prepare a proper module to simplify operations. The aim is to encourage people to try it out and see what they can do with it, so I want it to be as painless as possible.

I anticipate that the first release of the API should be made within a couple of days.

Friday, October 7, 2011

Oh yeah, the URL

Just in case anyone's interested, the site is available for release and some demonstrations. It's quite creaky (take my word for it!) but it does allow you to try the semantic relevance engine out for yourself through a couple of very limited web-based demonstrations.

It's at http://roistr.com. Email me directly with comments and the like at alan - at - thoughtintodesign.com.

Monday, October 3, 2011

Getting close to launch

Some stealth mode this is - me writing public blog posts!

Roistr is getting very close to release. The engine itself works beautifully and the next part was to build the website. It's a work in progress (you can see it at http://roistr.com) so don't expect too much yet!

In case you're curious, the web server and framework is FriendFeed's Tornado which turned out to be pretty good when we grokked it. It's made developing the site very easy indeed; and is an awesome server.

The site will offer web-based demonstrations and an API for more detailed work. The demonstrations will offer:

1) Compare All - submit a bunch of documents and compare each one to every other
2) Planned categorisation - give us some category descriptions and a bunch of documents to be categorised, and we'll do the hard work!
3) Unplanned categorisation - much like the planned categorisation above but you don't need to provide a description. Our semantic analysis will work out which documents are clustered.

The API will be limited in how often it can be used (we have limited hardware and resources being a bootstrapped company) but there will be the opportunity to really test Roistr out in much more detail.

We're also planned a cooler demonstration but we're not sure what yet. Something like if you enter a twitter name, we'll work out which of a 1000 news stories or products are most similar. Suggestions are welcomed!

Tuesday, August 2, 2011

This is Roistr

What can you write for a first entry? I guess an introduction to Roistr would be good.

Roistr is a system for semantic association and uses natural language processing techniques to work out how similar 2 (or more!) documents are. This doesn't sound tremendously useful until you think about it in more detail: What Roistr is going to provide is a scalable way for organisations to associate similar documents.

So if you had a job description and a large bunch of resumes, you can find out which resumes most closely match the JD. Or if you have a selection of holidays and a customer who doesn't know exactly which holiday they want, they can write a description of their perfect holiday and Roistr will select the best one. Or if you have a lot of adverts, it will choose which one best matches a page's content.

A lot of work has already been done to solve this type of problem but they don't work in the same way as humans do. Roistr does just that. Already, the core engine (I call it the Semantic Relevance Engine or SRE) can categorise quite well without any training at all. This means that Roistr is like employing an educated person who can match things at a very fast speed.

Right now, we are in stealth mode but getting close to release (hence this blog to begin letting people know what we can do). The core engine works and works well and reliably but providing it as a service is providing a set of challenges that are being swiftly knocked down, one by one.

btw, Roistr is a part of my company, Thought Into Design Ltd which is incorporated in the United Kingdom. Watch this space for more news or twitter on @ThoughtN2Design.