Thursday, August 15, 2013

Shortcutting vector multiplications

Roistr represents documents with vectors of numbers. When I want to see how close two documents are in meaning, the semantic relevance engine uses a cosine similarity which after years of experimentation gives me the best results (or at least, those closest to human judgement).

The problem

Each cosine comparison represents literally thousands of calculations: Both vectors are normalised, and these are multiplied; and the dot product of both vectors are multiplied. The dot product is then divided by the first value. 

The problem is that it's computationally expensive – in other words, it's slow. 

For comparing a document against one hundred other documents, this operation needs to be performed 100 times. Which is 100 times slowness. 

And for real-world work, there would be thousands and maybe tens of thousands of other documents. For a real-world search engine, we're talking billions of comparisons and probably trillions of operations for each tiny search. This does not scale and standard methods work better because nobody wants to wait long for search results these days when a reasonable approximation can be gotten from Bing, Blekko or DuckDuckGo. 

What I did

So my task was to investigate whether this process had any short-cuts. Imagine if I had an index of billions of documents and had a statistical summary of each. Then, with an incoming document (say, a search phrase), the engine could quickly identify the 20 or 100 most likely candidates and compare only against those. For each semantic comparison (cosine

This would cut down response time enormously and make semantic search more viable. For each cosine similarity calculation, I could discard potentially thousands of documents.

So the trick is to figure out what characteristics (if any) describe a vector well enough to allow this approximation.

So far, I've done a very limited test. Suffice to say that the single descriptions I used are insufficient (mean, variance, range).

But by plugging them all into a regression, I created a model that might have some viability. Far more work is needed of course to validate this, but even being able to discard 80% of documents would reduce system response times by a large degree.


The model so far attempts to predict the cosine value by using a negative coefficient for the mean, a positive one for the variance, and a very weak negative one for the range. The results are far from perfect (R = 0.74, R^2 = 0.55, R^2 adj = 0.33) and only one of the independent variables approached significance; but the results are just an initial stab, more of a pilot study than anything: the statistical power is very low.

Next steps are to expand the materials, run a load of calculations, and see if a model with a lower residual is possible.  

Exciting stuff and great to be doing stats again!

EDIT: I'm aware this might be a solved problem but I want to figure this out myself.