Wednesday, October 30, 2013

MongoDB

I've been having some fun recently using MongoDB (and its Python module, pymongo) which I'm planning to use for roistr the search engine. The stuff I need it for is fairly straightforward and I'm enjoying using it. My mac has a test database while the real system is sitting on some more substantial hardware.

It does, however, feel odd to talk about using, shall we say, commodity software for a search engine, no matter how early. Of course it's a good idea to use tried and tested open source programs built precisely for this type of task (plus it's a great learning experience), but I imagine that a lot of organisations have their own home-made stuff. Being a band of exactly one and trying to build a billion-index search engine, I'm up against it so I need to rely on others' softwares.

Sunday, October 13, 2013

Roistr is becoming a search engine

We've ad a lot of fun with Roistr's semantic relevance engine, but we have managed to figure out a number of technical problems to make it a viable search engine technology. What this means is that we're phasing out the purely relevance parts, and instead we're going to make Roistr a semantic search engine – or rather, a search engine that relies on natural language processing techniques rather than simple keyword matching. We expect it to be ready in a few months, depending upon how quickly we can get stuff sorted out

First of all, we're only going to index the English version of Wikipedia. It's only a few million documents, and forms a nice test bed as well as being an information base for future developments. Once that's done, we'll aim to spider the web. This, of course, will take a while, especially given that we have extremely limited resources, but it will be fun to test ourselves against the giants.

Our long-term ambition is to have an index of over 1 billion documents. From there, we should be able to cover most searches fairly well. Will it be better than existing search engines like Google, Bing or Blekko? Well, time will tell.

Friday, August 23, 2013

Shortcutting all-pairs comparisons for dense matrices – got it

All-pairs comparisons for dense matrices


Some of my recent posts described my woes in trying to figure out how I could quickly identify the most likely subset of vectors that match a candidate vector without repeatedly calculating cosine similarity.

I have found a method that appears to work! I won't describe it here because it needs more testing. Falsifications can and should be reported soon, but confirmations should be announced cautiously!

But if it works, I can reduce a set of vectors down to a fraction using fewer calculations. It's not ideal, but effective indexing and searching in a database should, in theory, substantially improve performance to the point where we can easily identify the best matching subset with relatively little ease.

Making a new search engine

So why is this being done? Why do I want to be able to take a mulit-billion element index of vectors and quickly and cheaply identify a tiny (say 100-item) subset before conducting the cosine comparisons?

Well, that would be giving it away but the subheading is fairly clear.

Besides, we're figuring out a way to store a 3+ billion page search engine on a terabyte of space. Yes, it's possible and yes, we're going to try it.

Once we've sorted out a few more problems!

Thursday, August 22, 2013

Using Wordnet from Prolog – part one

Prolog has many uses in natural language processing, and here is an introduction to using Wordnet through Prolog with examples of common tasks.

I enjoy learning Prolog because it conforms to how I thought computers should be programmed before I'd actually programmed one. You'd specify a set of rules, and the computer came back with answers that conformed to those rules.

But Prolog is also handy at natural language processing (NLP) and one of these endeavours is the famous Wordnet being made available in this fascinating language. This is well cool. I've used Wordnet lots in Python but rarely in Prolog.

First, download the Prolog Wordnet effort here (documentation here). I tried using it in GProlog but got an "atom table full" error. SWI Prolog handled it well enough.

Once downloaded, open it up and open a terminal (if not already), and cd to the directory the Wordnet files were unzipped to. Then begin your Prolog and type

[wn_s].

Remember the following full-stop / period.

Then, searching for a word's synsets was easy:

s(ID, W_num, 'ease', Ss_type, Sense_num, Tag_count).

And I got back a range of synsets related to the word, ease.

ID = 101064148
W_num = 2
Ss_type = n
Sense_num = 5
Tag_count = 0
Which means the Wordnet ID for that sense is 101064148, it's the second synset (W_num = 2), this synset has 'ease' as a noun (Ss_type = n), there are a total of 5 senses for 'ease' (Sense_num = 5) and there are no tags (Tag_count = 0).

Press the semi-colon to cycle through them all.

This is so simple. And if you want a particular sense (e.g., the first synset of 'ease'), just specify it thus:

s(ID, 1, 'ease', As_type, Sense_num, Tag_count).

As_type mostly reports words as type 'n' which means 'noun'. It can also report verbs (v), adverbs (r) and adjectives and adjective satellites.

Glossy Wordnet

Now let's retrieve a 'gloss'. A gloss is a natural language sentence illustrating the use of the particular word-sense.

First we load the gloss module.

[wn_g].

Then, we can take a word-sense ID and retrieve the gloss:

g(101064148, Gloss).

Which shows:

Gloss = 'freedom from activity (work or strain or responsibility); "took his repose by the swimming pool"'.

In the next Prolog and Wordnet article, I'll go through some more advanced ways of using Wordnet with Prolog.

Monday, August 19, 2013

Lots of Python statistics code

As some of you might know, I've been writing code for statistics in Python for some time. I also have a repository on Github which is the basis of a book I've been writing for several years.

There's a lot there including modules for unit testing. If you want to join in, feel free to fork or email changes in to me if you prefer.

https://github.com/salmoni/CompStatsPython

And for the eagle-eyed amongst you, there's some basic stats stuff in Prolog too! See stats.pl here

List of descriptives:


  • VSort - sort a vector
  • MSort - sort a numpy.matrix
  • CalculateRanks - for calculating the ranks of a numpy.matrix
  • GetSSCP_M - calculates the sum of squares and cross-products numpy.matrix
  • GetVarsCovars_M - calculates the variances and covariances numpy.matrix
  • GetVariances - calculates the variances of a numpy.matrix of variables
  • GetStdDevs - calculates the standard deviations of a numpy.matrix of variables
  • GetCorrelationMatrix - calculates the correlation numpy.matrix
  • Count - returns the number of non-missing data
  • sum - returns the sum of non-missing data
  • minimum - returns the minimum of non-missing data
  • maximum - returns the maximum of non-missing data
  • Range - maximum minus the minimum
  • proportions - 
  • relfreqmode - 
  • cumsum - 
  • cumproduct - 
  • cumpercent - 
  • frequencies - 
  • trimmeddata - 
  • trimmedmean - 
  • bitrimmedmean - 
  • mean - 
  • median - 
  • mode - 
  • moment - 
  • TukeyQuartiles - returns Tukey's hinges
  • MooreQuartiles - returns Moore & McCabe's hinges
  • SPQuantile - quantile used by S-Plus
  • TradQuantile - quantile used by SPSS
  • MidstepQuantile - mid-step qua
  • Q1 - Q1 quantile from Hyndnumpy.man & Fan
  • Q2 - Q2 quantile from Hyndnumpy.man & Fan
  • Q3 - Q3 quantile from Hyndnumpy.man & Fan
  • Q4 - Q4 quantile from Hyndnumpy.man & Fan
  • Q5 - Q5 quantile from Hyndnumpy.man & Fan
  • Q6 - Q6 quantile from Hyndnumpy.man & Fan
  • Q7 - Q7 quantile from Hyndnumpy.man & Fan
  • Q8 - Q8 quantile from Hyndnumpy.man & Fan
  • Q9 - Q9 quantile from Hyndnumpy.man & Fan 
  • InterquartileRange - 
  • SS - sum of squares
  • SSDevs - sum of squared deviations from the mean
  • SampVar - sample variance
  • PopVar - population variance
  • SampStdDev - sample standard deviation
  • PopStdDev - population standard deviation
  • StdErr - standard error
  • CoeffVar - coefficient of variation
  • ConfidenceIntervals - returns the confidence intervals
  • numpy.maD - Median absolute deviation
  • GeometricMean - the geometric mean
  • HarmonicMean - the harmonic mean
  • MSSD - mean square of successive differences
  • Skewness - returns the skewness
  • Kurtosis - returns the kurtosis
  • StandardScore - transforms a vector into a standard (ie, z-) score
  • EffectSizeControl - returns an effect size if a control condition is present
  • EffectSize - returns an effect size if no control is present
  • FiveNumber - Tukey's five number sumnumpy.mary (minimum, lower quartile, median, upper quartile, maximum)
  • OutliersSQR - returns two arrays, one of outliers defined by 1.5 * IQR, and the other without these outliers


And others such as


  • Cronbach's Alpha
  • Kruskal-Wallis
  • Multiple regression
  • Mann-Whitney U
  • Wilcoxon
  • ANOVA



Vector 'velocity' not working

Comparing the lines of best fit was a red-herring largely because the lines were almost identical, regardless of the documents they described.

So the next idea was to examine 'velocity'. I must admit that I'm not too sure what the real name is and I'm sure it's measured somehow. It is, however, a fairly simple concept of how much each point varies compared to its neighbours. Although somewhat similar to variance and other measures of dispersion, consider variance to be a measure of dispersion where each data point is effectively independent of all the others. Here, each data point is measured against its neighbours.

I found little correlation with the difference in velocity score and difference in cosine measure. This means that velocity isn't a way to short-cut all-pairs similarity measurements for dense matrices.

If you examine these two vectors, they produced almost the same cosine with another document (0.605 to three decimal places). The velocity measures were 77.01 for the first document and 175.0 for the other. Examining plots of the two vectors appears to support this. Many other documents with vastly different cosines were interlopers when it came to velocity measures.

This was the vector showing a velocity measure of 77.01. The second vector showed 175.0.
Despite both showing a similar relationship to another document, the pattern of the vector is markedly different and not sufficient to approximate a cosine similarity score.

R code for velocity:

https://gist.github.com/salmoni/6269948

Saturday, August 17, 2013

Too early to tell... And some reading

I mentioned some literature I came across for short-cutting the all-pairs similarity search. These were for sparse matrices and not dense like I need. But they might be useful.

Optimizing Parallel Algorithms for All Pairs Similarity Search (2013), http://www.cs.ucsb.edu/projects/psc/wsdm2013.pdf

Scaling up all-pairs similarity search (2007) http://wwwconference.org/www2007/papers/paper342.pdf

An Incremental Prefix Filtering Approach for the All Pairs Similarity Search Problem (unknown) http://www.win.tue.nl/~lamthuy/MyPub/ippair.pdf

Parallel All Pairs Similarity Search (unknown) http://www.lidi.info.unlp.edu.ar/WorldComp2011-Mirror/IKE5242.pdf


Some early results from trying to compare lines of best fit aren't that promising but this is just early analysis. Full results here later.