## Friday, August 23, 2013

### All-pairs comparisons for dense matrices

Some of my recent posts described my woes in trying to figure out how I could quickly identify the most likely subset of vectors that match a candidate vector without repeatedly calculating cosine similarity.

I have found a method that appears to work! I won't describe it here because it needs more testing. Falsifications can and should be reported soon, but confirmations should be announced cautiously!

But if it works, I can reduce a set of vectors down to a fraction using fewer calculations. It's not ideal, but effective indexing and searching in a database should, in theory, substantially improve performance to the point where we can easily identify the best matching subset with relatively little ease.

### Making a new search engine

So why is this being done? Why do I want to be able to take a mulit-billion element index of vectors and quickly and cheaply identify a tiny (say 100-item) subset before conducting the cosine comparisons?

Well, that would be giving it away but the subheading is fairly clear.

Besides, we're figuring out a way to store a 3+ billion page search engine on a terabyte of space. Yes, it's possible and yes, we're going to try it.

Once we've sorted out a few more problems!

## Thursday, August 22, 2013

### Using Wordnet from Prolog – part one

Prolog has many uses in natural language processing, and here is an introduction to using Wordnet through Prolog with examples of common tasks.

I enjoy learning Prolog because it conforms to how I thought computers should be programmed before I'd actually programmed one. You'd specify a set of rules, and the computer came back with answers that conformed to those rules.

But Prolog is also handy at natural language processing (NLP) and one of these endeavours is the famous Wordnet being made available in this fascinating language. This is well cool. I've used Wordnet lots in Python but rarely in Prolog.

First, download the Prolog Wordnet effort here (documentation here). I tried using it in GProlog but got an "atom table full" error. SWI Prolog handled it well enough.

Once downloaded, open it up and open a terminal (if not already), and cd to the directory the Wordnet files were unzipped to. Then begin your Prolog and type

[wn_s].

Remember the following full-stop / period.

Then, searching for a word's synsets was easy:

s(ID, W_num, 'ease', Ss_type, Sense_num, Tag_count).

And I got back a range of synsets related to the word, ease.

ID = 101064148
W_num = 2
Ss_type = n
Sense_num = 5
Tag_count = 0
Which means the Wordnet ID for that sense is 101064148, it's the second synset (W_num = 2), this synset has 'ease' as a noun (Ss_type = n), there are a total of 5 senses for 'ease' (Sense_num = 5) and there are no tags (Tag_count = 0).

Press the semi-colon to cycle through them all.

This is so simple. And if you want a particular sense (e.g., the first synset of 'ease'), just specify it thus:

s(ID, 1, 'ease', As_type, Sense_num, Tag_count).

As_type mostly reports words as type 'n' which means 'noun'. It can also report verbs (v), adverbs (r) and adjectives and adjective satellites.

Glossy Wordnet

Now let's retrieve a 'gloss'. A gloss is a natural language sentence illustrating the use of the particular word-sense.

First we load the gloss module.

[wn_g].

Then, we can take a word-sense ID and retrieve the gloss:

g(101064148, Gloss).

Which shows:

Gloss = 'freedom from activity (work or strain or responsibility); "took his repose by the swimming pool"'.

In the next Prolog and Wordnet article, I'll go through some more advanced ways of using Wordnet with Prolog.

## Monday, August 19, 2013

### Lots of Python statistics code

As some of you might know, I've been writing code for statistics in Python for some time. I also have a repository on Github which is the basis of a book I've been writing for several years.

There's a lot there including modules for unit testing. If you want to join in, feel free to fork or email changes in to me if you prefer.

https://github.com/salmoni/CompStatsPython

And for the eagle-eyed amongst you, there's some basic stats stuff in Prolog too! See stats.pl here

List of descriptives:

• VSort - sort a vector
• MSort - sort a numpy.matrix
• CalculateRanks - for calculating the ranks of a numpy.matrix
• GetSSCP_M - calculates the sum of squares and cross-products numpy.matrix
• GetVarsCovars_M - calculates the variances and covariances numpy.matrix
• GetVariances - calculates the variances of a numpy.matrix of variables
• GetStdDevs - calculates the standard deviations of a numpy.matrix of variables
• GetCorrelationMatrix - calculates the correlation numpy.matrix
• Count - returns the number of non-missing data
• sum - returns the sum of non-missing data
• minimum - returns the minimum of non-missing data
• maximum - returns the maximum of non-missing data
• Range - maximum minus the minimum
• proportions -
• relfreqmode -
• cumsum -
• cumproduct -
• cumpercent -
• frequencies -
• trimmeddata -
• trimmedmean -
• bitrimmedmean -
• mean -
• median -
• mode -
• moment -
• TukeyQuartiles - returns Tukey's hinges
• MooreQuartiles - returns Moore & McCabe's hinges
• SPQuantile - quantile used by S-Plus
• TradQuantile - quantile used by SPSS
• MidstepQuantile - mid-step qua
• Q1 - Q1 quantile from Hyndnumpy.man & Fan
• Q2 - Q2 quantile from Hyndnumpy.man & Fan
• Q3 - Q3 quantile from Hyndnumpy.man & Fan
• Q4 - Q4 quantile from Hyndnumpy.man & Fan
• Q5 - Q5 quantile from Hyndnumpy.man & Fan
• Q6 - Q6 quantile from Hyndnumpy.man & Fan
• Q7 - Q7 quantile from Hyndnumpy.man & Fan
• Q8 - Q8 quantile from Hyndnumpy.man & Fan
• Q9 - Q9 quantile from Hyndnumpy.man & Fan
• InterquartileRange -
• SS - sum of squares
• SSDevs - sum of squared deviations from the mean
• SampVar - sample variance
• PopVar - population variance
• SampStdDev - sample standard deviation
• PopStdDev - population standard deviation
• StdErr - standard error
• CoeffVar - coefficient of variation
• ConfidenceIntervals - returns the confidence intervals
• numpy.maD - Median absolute deviation
• GeometricMean - the geometric mean
• HarmonicMean - the harmonic mean
• MSSD - mean square of successive differences
• Skewness - returns the skewness
• Kurtosis - returns the kurtosis
• StandardScore - transforms a vector into a standard (ie, z-) score
• EffectSizeControl - returns an effect size if a control condition is present
• EffectSize - returns an effect size if no control is present
• FiveNumber - Tukey's five number sumnumpy.mary (minimum, lower quartile, median, upper quartile, maximum)
• OutliersSQR - returns two arrays, one of outliers defined by 1.5 * IQR, and the other without these outliers

And others such as

• Cronbach's Alpha
• Kruskal-Wallis
• Multiple regression
• Mann-Whitney U
• Wilcoxon
• ANOVA

### Vector 'velocity' not working

Comparing the lines of best fit was a red-herring largely because the lines were almost identical, regardless of the documents they described.

So the next idea was to examine 'velocity'. I must admit that I'm not too sure what the real name is and I'm sure it's measured somehow. It is, however, a fairly simple concept of how much each point varies compared to its neighbours. Although somewhat similar to variance and other measures of dispersion, consider variance to be a measure of dispersion where each data point is effectively independent of all the others. Here, each data point is measured against its neighbours.

I found little correlation with the difference in velocity score and difference in cosine measure. This means that velocity isn't a way to short-cut all-pairs similarity measurements for dense matrices.

If you examine these two vectors, they produced almost the same cosine with another document (0.605 to three decimal places). The velocity measures were 77.01 for the first document and 175.0 for the other. Examining plots of the two vectors appears to support this. Many other documents with vastly different cosines were interlopers when it came to velocity measures.

This was the vector showing a velocity measure of 77.01. The second vector showed 175.0.
Despite both showing a similar relationship to another document, the pattern of the vector is markedly different and not sufficient to approximate a cosine similarity score.

R code for velocity:

https://gist.github.com/salmoni/6269948

## Saturday, August 17, 2013

### Too early to tell... And some reading

I mentioned some literature I came across for short-cutting the all-pairs similarity search. These were for sparse matrices and not dense like I need. But they might be useful.

Optimizing Parallel Algorithms for All Pairs Similarity Search (2013), http://www.cs.ucsb.edu/projects/psc/wsdm2013.pdf

Scaling up all-pairs similarity search (2007) http://wwwconference.org/www2007/papers/paper342.pdf

An Incremental Preﬁx Filtering Approach for the All Pairs Similarity Search Problem (unknown) http://www.win.tue.nl/~lamthuy/MyPub/ippair.pdf

Parallel All Pairs Similarity Search (unknown) http://www.lidi.info.unlp.edu.ar/WorldComp2011-Mirror/IKE5242.pdf

Some early results from trying to compare lines of best fit aren't that promising but this is just early analysis. Full results here later.

## Friday, August 16, 2013

### What about a line of best fit + residual?

Some more thoughts (btw, I'm going to experiment with these).

I spent some time doing a personal literature review but found that existing solutions are aimed at sparse vectors which are extremely commonplace within information retrieval. My vectors, however, are dense. It's rare to find any zero value – if my engine returns any zero values, it's failed or very coincidental.

So I sat back and thought for a bit. We have a vector and want to know which other vectors it's approximately closest to.

So why not pre-calculate a line of best fit for each stored vector? Then, an unknown document can have its own line calculated easily, and too-distant vectors eliminated easily with two simple comparisons: remove all vectors with slopes too different from the unknown document's slope; then remove remaining vectors with intercepts too different from the unknown document's intercept.

Furthermore, residuals could also be used.

So after three simple 1 number ~ 1 number comparisons for each of the stored vectors, we should be able to eliminate a whole bunch of documents. It's not perfect by any means, but it's not meant to be. Its only supposed to work as an approximation to reduce the problem space.

The question then (if it begins to work of course!) is what qualifies as "too different". This is a function of the stored vectors' slopes, intercepts and residuals, and what percentage of the overall store we want left for detailed comparison.

### Fail – part 1

After yesterday's article on short-cutting cosine similarity comparisons (http://blog.roistr.com/2013/08/shortcutting-vector-multiplications.html), I did a larger scale test using the Brown corpus (all 500 docs). So I retrieved vectors, undertook all 124,750 comparisons and calculated the descriptives for each comparison.

The linear regression showed nothing worthwhile which makes this idea false and not worth pursuing.

Why the difference in results? Well, the former was very small-scale so any interesting results were probably statistical artifacts rather than anything substantial. But it's good to investigate this properly and know the idea doesn't work.

For those interested, R = 0.10, R^2 = 0.01 and R^2 adjusted = 0.0. Beta coefficients were 0.0 except for mean at -0.08. Not surprising when the sum of squares was 10.03 and the residual was 2,352.51.

Clearly, if there is another factor (or factors) that can approximate a cosine comparison using descriptive statistics, I have not yet found it or them yet.

## Thursday, August 15, 2013

### Shortcutting vector multiplications

Roistr represents documents with vectors of numbers. When I want to see how close two documents are in meaning, the semantic relevance engine uses a cosine similarity which after years of experimentation gives me the best results (or at least, those closest to human judgement).

### The problem

Each cosine comparison represents literally thousands of calculations: Both vectors are normalised, and these are multiplied; and the dot product of both vectors are multiplied. The dot product is then divided by the first value.

The problem is that it's computationally expensive – in other words, it's slow.

For comparing a document against one hundred other documents, this operation needs to be performed 100 times. Which is 100 times slowness.

And for real-world work, there would be thousands and maybe tens of thousands of other documents. For a real-world search engine, we're talking billions of comparisons and probably trillions of operations for each tiny search. This does not scale and standard methods work better because nobody wants to wait long for search results these days when a reasonable approximation can be gotten from Bing, Blekko or DuckDuckGo.

### What I did

So my task was to investigate whether this process had any short-cuts. Imagine if I had an index of billions of documents and had a statistical summary of each. Then, with an incoming document (say, a search phrase), the engine could quickly identify the 20 or 100 most likely candidates and compare only against those. For each semantic comparison (cosine

This would cut down response time enormously and make semantic search more viable. For each cosine similarity calculation, I could discard potentially thousands of documents.

So the trick is to figure out what characteristics (if any) describe a vector well enough to allow this approximation.

So far, I've done a very limited test. Suffice to say that the single descriptions I used are insufficient (mean, variance, range).

But by plugging them all into a regression, I created a model that might have some viability. Far more work is needed of course to validate this, but even being able to discard 80% of documents would reduce system response times by a large degree.

### Results

The model so far attempts to predict the cosine value by using a negative coefficient for the mean, a positive one for the variance, and a very weak negative one for the range. The results are far from perfect (R = 0.74, R^2 = 0.55, R^2 adj = 0.33) and only one of the independent variables approached significance; but the results are just an initial stab, more of a pilot study than anything: the statistical power is very low.

Next steps are to expand the materials, run a load of calculations, and see if a model with a lower residual is possible.

Exciting stuff and great to be doing stats again!

EDIT: I'm aware this might be a solved problem but I want to figure this out myself.