Friday, August 16, 2013

What about a line of best fit + residual?

Some more thoughts (btw, I'm going to experiment with these).

I spent some time doing a personal literature review but found that existing solutions are aimed at sparse vectors which are extremely commonplace within information retrieval. My vectors, however, are dense. It's rare to find any zero value – if my engine returns any zero values, it's failed or very coincidental.

So I sat back and thought for a bit. We have a vector and want to know which other vectors it's approximately closest to.

So why not pre-calculate a line of best fit for each stored vector? Then, an unknown document can have its own line calculated easily, and too-distant vectors eliminated easily with two simple comparisons: remove all vectors with slopes too different from the unknown document's slope; then remove remaining vectors with intercepts too different from the unknown document's intercept.

Furthermore, residuals could also be used.

So after three simple 1 number ~ 1 number comparisons for each of the stored vectors, we should be able to eliminate a whole bunch of documents. It's not perfect by any means, but it's not meant to be. Its only supposed to work as an approximation to reduce the problem space.

The question then (if it begins to work of course!) is what qualifies as "too different". This is a function of the stored vectors' slopes, intercepts and residuals, and what percentage of the overall store we want left for detailed comparison.