Friday, August 16, 2013

Fail – part 1

After yesterday's article on short-cutting cosine similarity comparisons (, I did a larger scale test using the Brown corpus (all 500 docs). So I retrieved vectors, undertook all 124,750 comparisons and calculated the descriptives for each comparison.

The linear regression showed nothing worthwhile which makes this idea false and not worth pursuing.

Why the difference in results? Well, the former was very small-scale so any interesting results were probably statistical artifacts rather than anything substantial. But it's good to investigate this properly and know the idea doesn't work.

For those interested, R = 0.10, R^2 = 0.01 and R^2 adjusted = 0.0. Beta coefficients were 0.0 except for mean at -0.08. Not surprising when the sum of squares was 10.03 and the residual was 2,352.51.

Clearly, if there is another factor (or factors) that can approximate a cosine comparison using descriptive statistics, I have not yet found it or them yet.