Monday, August 19, 2013

Vector 'velocity' not working

Comparing the lines of best fit was a red-herring largely because the lines were almost identical, regardless of the documents they described.

So the next idea was to examine 'velocity'. I must admit that I'm not too sure what the real name is and I'm sure it's measured somehow. It is, however, a fairly simple concept of how much each point varies compared to its neighbours. Although somewhat similar to variance and other measures of dispersion, consider variance to be a measure of dispersion where each data point is effectively independent of all the others. Here, each data point is measured against its neighbours.

I found little correlation with the difference in velocity score and difference in cosine measure. This means that velocity isn't a way to short-cut all-pairs similarity measurements for dense matrices.

If you examine these two vectors, they produced almost the same cosine with another document (0.605 to three decimal places). The velocity measures were 77.01 for the first document and 175.0 for the other. Examining plots of the two vectors appears to support this. Many other documents with vastly different cosines were interlopers when it came to velocity measures.

This was the vector showing a velocity measure of 77.01. The second vector showed 175.0.
Despite both showing a similar relationship to another document, the pattern of the vector is markedly different and not sufficient to approximate a cosine similarity score.

R code for velocity:

https://gist.github.com/salmoni/6269948