Wednesday, October 30, 2013


I've been having some fun recently using MongoDB (and its Python module, pymongo) which I'm planning to use for roistr the search engine. The stuff I need it for is fairly straightforward and I'm enjoying using it. My mac has a test database while the real system is sitting on some more substantial hardware.

It does, however, feel odd to talk about using, shall we say, commodity software for a search engine, no matter how early. Of course it's a good idea to use tried and tested open source programs built precisely for this type of task (plus it's a great learning experience), but I imagine that a lot of organisations have their own home-made stuff. Being a band of exactly one and trying to build a billion-index search engine, I'm up against it so I need to rely on others' softwares.

Sunday, October 13, 2013

Roistr is becoming a search engine

We've ad a lot of fun with Roistr's semantic relevance engine, but we have managed to figure out a number of technical problems to make it a viable search engine technology. What this means is that we're phasing out the purely relevance parts, and instead we're going to make Roistr a semantic search engine – or rather, a search engine that relies on natural language processing techniques rather than simple keyword matching. We expect it to be ready in a few months, depending upon how quickly we can get stuff sorted out

First of all, we're only going to index the English version of Wikipedia. It's only a few million documents, and forms a nice test bed as well as being an information base for future developments. Once that's done, we'll aim to spider the web. This, of course, will take a while, especially given that we have extremely limited resources, but it will be fun to test ourselves against the giants.

Our long-term ambition is to have an index of over 1 billion documents. From there, we should be able to cover most searches fairly well. Will it be better than existing search engines like Google, Bing or Blekko? Well, time will tell.

Friday, August 23, 2013

Shortcutting all-pairs comparisons for dense matrices – got it

All-pairs comparisons for dense matrices

Some of my recent posts described my woes in trying to figure out how I could quickly identify the most likely subset of vectors that match a candidate vector without repeatedly calculating cosine similarity.

I have found a method that appears to work! I won't describe it here because it needs more testing. Falsifications can and should be reported soon, but confirmations should be announced cautiously!

But if it works, I can reduce a set of vectors down to a fraction using fewer calculations. It's not ideal, but effective indexing and searching in a database should, in theory, substantially improve performance to the point where we can easily identify the best matching subset with relatively little ease.

Making a new search engine

So why is this being done? Why do I want to be able to take a mulit-billion element index of vectors and quickly and cheaply identify a tiny (say 100-item) subset before conducting the cosine comparisons?

Well, that would be giving it away but the subheading is fairly clear.

Besides, we're figuring out a way to store a 3+ billion page search engine on a terabyte of space. Yes, it's possible and yes, we're going to try it.

Once we've sorted out a few more problems!

Thursday, August 22, 2013

Using Wordnet from Prolog – part one

Prolog has many uses in natural language processing, and here is an introduction to using Wordnet through Prolog with examples of common tasks.

I enjoy learning Prolog because it conforms to how I thought computers should be programmed before I'd actually programmed one. You'd specify a set of rules, and the computer came back with answers that conformed to those rules.

But Prolog is also handy at natural language processing (NLP) and one of these endeavours is the famous Wordnet being made available in this fascinating language. This is well cool. I've used Wordnet lots in Python but rarely in Prolog.

First, download the Prolog Wordnet effort here (documentation here). I tried using it in GProlog but got an "atom table full" error. SWI Prolog handled it well enough.

Once downloaded, open it up and open a terminal (if not already), and cd to the directory the Wordnet files were unzipped to. Then begin your Prolog and type


Remember the following full-stop / period.

Then, searching for a word's synsets was easy:

s(ID, W_num, 'ease', Ss_type, Sense_num, Tag_count).

And I got back a range of synsets related to the word, ease.

ID = 101064148
W_num = 2
Ss_type = n
Sense_num = 5
Tag_count = 0
Which means the Wordnet ID for that sense is 101064148, it's the second synset (W_num = 2), this synset has 'ease' as a noun (Ss_type = n), there are a total of 5 senses for 'ease' (Sense_num = 5) and there are no tags (Tag_count = 0).

Press the semi-colon to cycle through them all.

This is so simple. And if you want a particular sense (e.g., the first synset of 'ease'), just specify it thus:

s(ID, 1, 'ease', As_type, Sense_num, Tag_count).

As_type mostly reports words as type 'n' which means 'noun'. It can also report verbs (v), adverbs (r) and adjectives and adjective satellites.

Glossy Wordnet

Now let's retrieve a 'gloss'. A gloss is a natural language sentence illustrating the use of the particular word-sense.

First we load the gloss module.


Then, we can take a word-sense ID and retrieve the gloss:

g(101064148, Gloss).

Which shows:

Gloss = 'freedom from activity (work or strain or responsibility); "took his repose by the swimming pool"'.

In the next Prolog and Wordnet article, I'll go through some more advanced ways of using Wordnet with Prolog.

Monday, August 19, 2013

Lots of Python statistics code

As some of you might know, I've been writing code for statistics in Python for some time. I also have a repository on Github which is the basis of a book I've been writing for several years.

There's a lot there including modules for unit testing. If you want to join in, feel free to fork or email changes in to me if you prefer.

And for the eagle-eyed amongst you, there's some basic stats stuff in Prolog too! See here

List of descriptives:

  • VSort - sort a vector
  • MSort - sort a numpy.matrix
  • CalculateRanks - for calculating the ranks of a numpy.matrix
  • GetSSCP_M - calculates the sum of squares and cross-products numpy.matrix
  • GetVarsCovars_M - calculates the variances and covariances numpy.matrix
  • GetVariances - calculates the variances of a numpy.matrix of variables
  • GetStdDevs - calculates the standard deviations of a numpy.matrix of variables
  • GetCorrelationMatrix - calculates the correlation numpy.matrix
  • Count - returns the number of non-missing data
  • sum - returns the sum of non-missing data
  • minimum - returns the minimum of non-missing data
  • maximum - returns the maximum of non-missing data
  • Range - maximum minus the minimum
  • proportions - 
  • relfreqmode - 
  • cumsum - 
  • cumproduct - 
  • cumpercent - 
  • frequencies - 
  • trimmeddata - 
  • trimmedmean - 
  • bitrimmedmean - 
  • mean - 
  • median - 
  • mode - 
  • moment - 
  • TukeyQuartiles - returns Tukey's hinges
  • MooreQuartiles - returns Moore & McCabe's hinges
  • SPQuantile - quantile used by S-Plus
  • TradQuantile - quantile used by SPSS
  • MidstepQuantile - mid-step qua
  • Q1 - Q1 quantile from & Fan
  • Q2 - Q2 quantile from & Fan
  • Q3 - Q3 quantile from & Fan
  • Q4 - Q4 quantile from & Fan
  • Q5 - Q5 quantile from & Fan
  • Q6 - Q6 quantile from & Fan
  • Q7 - Q7 quantile from & Fan
  • Q8 - Q8 quantile from & Fan
  • Q9 - Q9 quantile from & Fan 
  • InterquartileRange - 
  • SS - sum of squares
  • SSDevs - sum of squared deviations from the mean
  • SampVar - sample variance
  • PopVar - population variance
  • SampStdDev - sample standard deviation
  • PopStdDev - population standard deviation
  • StdErr - standard error
  • CoeffVar - coefficient of variation
  • ConfidenceIntervals - returns the confidence intervals
  • numpy.maD - Median absolute deviation
  • GeometricMean - the geometric mean
  • HarmonicMean - the harmonic mean
  • MSSD - mean square of successive differences
  • Skewness - returns the skewness
  • Kurtosis - returns the kurtosis
  • StandardScore - transforms a vector into a standard (ie, z-) score
  • EffectSizeControl - returns an effect size if a control condition is present
  • EffectSize - returns an effect size if no control is present
  • FiveNumber - Tukey's five number sumnumpy.mary (minimum, lower quartile, median, upper quartile, maximum)
  • OutliersSQR - returns two arrays, one of outliers defined by 1.5 * IQR, and the other without these outliers

And others such as

  • Cronbach's Alpha
  • Kruskal-Wallis
  • Multiple regression
  • Mann-Whitney U
  • Wilcoxon

Vector 'velocity' not working

Comparing the lines of best fit was a red-herring largely because the lines were almost identical, regardless of the documents they described.

So the next idea was to examine 'velocity'. I must admit that I'm not too sure what the real name is and I'm sure it's measured somehow. It is, however, a fairly simple concept of how much each point varies compared to its neighbours. Although somewhat similar to variance and other measures of dispersion, consider variance to be a measure of dispersion where each data point is effectively independent of all the others. Here, each data point is measured against its neighbours.

I found little correlation with the difference in velocity score and difference in cosine measure. This means that velocity isn't a way to short-cut all-pairs similarity measurements for dense matrices.

If you examine these two vectors, they produced almost the same cosine with another document (0.605 to three decimal places). The velocity measures were 77.01 for the first document and 175.0 for the other. Examining plots of the two vectors appears to support this. Many other documents with vastly different cosines were interlopers when it came to velocity measures.

This was the vector showing a velocity measure of 77.01. The second vector showed 175.0.
Despite both showing a similar relationship to another document, the pattern of the vector is markedly different and not sufficient to approximate a cosine similarity score.

R code for velocity:

Saturday, August 17, 2013

Too early to tell... And some reading

I mentioned some literature I came across for short-cutting the all-pairs similarity search. These were for sparse matrices and not dense like I need. But they might be useful.

Optimizing Parallel Algorithms for All Pairs Similarity Search (2013),

Scaling up all-pairs similarity search (2007)

An Incremental Prefix Filtering Approach for the All Pairs Similarity Search Problem (unknown)

Parallel All Pairs Similarity Search (unknown)

Some early results from trying to compare lines of best fit aren't that promising but this is just early analysis. Full results here later.

Friday, August 16, 2013

What about a line of best fit + residual?

Some more thoughts (btw, I'm going to experiment with these).

I spent some time doing a personal literature review but found that existing solutions are aimed at sparse vectors which are extremely commonplace within information retrieval. My vectors, however, are dense. It's rare to find any zero value – if my engine returns any zero values, it's failed or very coincidental.

So I sat back and thought for a bit. We have a vector and want to know which other vectors it's approximately closest to.

So why not pre-calculate a line of best fit for each stored vector? Then, an unknown document can have its own line calculated easily, and too-distant vectors eliminated easily with two simple comparisons: remove all vectors with slopes too different from the unknown document's slope; then remove remaining vectors with intercepts too different from the unknown document's intercept.

Furthermore, residuals could also be used.

So after three simple 1 number ~ 1 number comparisons for each of the stored vectors, we should be able to eliminate a whole bunch of documents. It's not perfect by any means, but it's not meant to be. Its only supposed to work as an approximation to reduce the problem space.

The question then (if it begins to work of course!) is what qualifies as "too different". This is a function of the stored vectors' slopes, intercepts and residuals, and what percentage of the overall store we want left for detailed comparison.

Fail – part 1

After yesterday's article on short-cutting cosine similarity comparisons (, I did a larger scale test using the Brown corpus (all 500 docs). So I retrieved vectors, undertook all 124,750 comparisons and calculated the descriptives for each comparison.

The linear regression showed nothing worthwhile which makes this idea false and not worth pursuing.

Why the difference in results? Well, the former was very small-scale so any interesting results were probably statistical artifacts rather than anything substantial. But it's good to investigate this properly and know the idea doesn't work.

For those interested, R = 0.10, R^2 = 0.01 and R^2 adjusted = 0.0. Beta coefficients were 0.0 except for mean at -0.08. Not surprising when the sum of squares was 10.03 and the residual was 2,352.51.

Clearly, if there is another factor (or factors) that can approximate a cosine comparison using descriptive statistics, I have not yet found it or them yet.

Thursday, August 15, 2013

Shortcutting vector multiplications

Roistr represents documents with vectors of numbers. When I want to see how close two documents are in meaning, the semantic relevance engine uses a cosine similarity which after years of experimentation gives me the best results (or at least, those closest to human judgement).

The problem

Each cosine comparison represents literally thousands of calculations: Both vectors are normalised, and these are multiplied; and the dot product of both vectors are multiplied. The dot product is then divided by the first value. 

The problem is that it's computationally expensive – in other words, it's slow. 

For comparing a document against one hundred other documents, this operation needs to be performed 100 times. Which is 100 times slowness. 

And for real-world work, there would be thousands and maybe tens of thousands of other documents. For a real-world search engine, we're talking billions of comparisons and probably trillions of operations for each tiny search. This does not scale and standard methods work better because nobody wants to wait long for search results these days when a reasonable approximation can be gotten from Bing, Blekko or DuckDuckGo. 

What I did

So my task was to investigate whether this process had any short-cuts. Imagine if I had an index of billions of documents and had a statistical summary of each. Then, with an incoming document (say, a search phrase), the engine could quickly identify the 20 or 100 most likely candidates and compare only against those. For each semantic comparison (cosine

This would cut down response time enormously and make semantic search more viable. For each cosine similarity calculation, I could discard potentially thousands of documents.

So the trick is to figure out what characteristics (if any) describe a vector well enough to allow this approximation.

So far, I've done a very limited test. Suffice to say that the single descriptions I used are insufficient (mean, variance, range).

But by plugging them all into a regression, I created a model that might have some viability. Far more work is needed of course to validate this, but even being able to discard 80% of documents would reduce system response times by a large degree.


The model so far attempts to predict the cosine value by using a negative coefficient for the mean, a positive one for the variance, and a very weak negative one for the range. The results are far from perfect (R = 0.74, R^2 = 0.55, R^2 adj = 0.33) and only one of the independent variables approached significance; but the results are just an initial stab, more of a pilot study than anything: the statistical power is very low.

Next steps are to expand the materials, run a load of calculations, and see if a model with a lower residual is possible.  

Exciting stuff and great to be doing stats again!

EDIT: I'm aware this might be a solved problem but I want to figure this out myself.

Wednesday, January 16, 2013

Prolog and NLP

Prolog has been a fun language to use. It's like how I thought programming computers would be before I ever programme. You tell it things (facts and rules) and then you'd ask it questions to find answers.

In natural language processing, Prolog used to be very common but much less so now with languages such as Python and Java taking a lot of attention.

But, of course, if the only tool you know is a hammer, then every problem is solved with a nail. What Prolog and other declarative languages can do is exceptional. I don't see it as the entire solution. At Roistr, we've always said that Python was our mainstay language that we use to create statistical models of text. Computational linguistics has fallen out of favour somewhat lately, but I wondered how to mix the two together.

As an introduction, Prolog can take statements and return true / yes (it's logically possible), or false / no (it's not). There are other returns but we can worry about those some other time.

The concern I have is that Prolog doesn't have the grey areas that naturally occur in real life. My thoughts were if we could create a Prolog engine and supply it with explicit facts and rules. Queries that rely solely upon these facts and rules would be dealt with as normal. However, it would have the ability to deal with implicit knowledge by consulting a semantic relevance engine.

So if it was give facts about the things that John owns but doesn't mention a car, the semantic relevance engine could step in and realise that 'car' is almost synonymous with the 'automobile' fact that does appear.

It could also handle unknown facts and infer rules from the semantic association the engine provides. Say it's provided with a fact about John that says he's in France. When asked if he's in Europe, it would search its explicit knowledge base which returns false, or rather more correctly unknown (because it's not explicitly stated or ruled that France is in Europe), but the engine would note the close association between the concepts, 'France' and 'Europe' and allow the program to infer a likelihood.

So let's work this into an example:

?- john(europe).
Explicit false
Implicit 0.87 true

So the semantic relevance engine would be a back-up knowledge base to be used to provide factual knowledge and inference. It's a bit bizarre and odd to think of and needs a lot of work before the validity of this notion could be tested.

SEO Link Checking with Semantic Relevance

How can you tell if inbound or outbounds links point to sites that Goole penalise you for? Is there a way to compare the content of sites that you link to or sites that link to you to see if they're damaging your rankings?

When a site links to you, it's easy to check manually to see if it's a decent site or not. After all, you just have to go into your SEO software and click on the link...

But this doesn't scale well. If you're an SEO consultant or agency with hundreds or thousands of sites, it's impossible to check them all.

Roistr's semantic relevance engine is a good candidate for this task. It can analyse your site and work out its overall 'gist'. It can even compare each page against this gist to make sure your content is reasonably consistent.

But it can also take other sites' content and compare it against yours. This way, you can see if your blog about immigration is being linked to by sites concerned with immigration - or if they're just trying to sell slippers.

We're currently working on providing a nice demonstration. You'll be able to enter a URL and we will compare it against some other sites so you can see if they're on-topic to you or not.

And if it's any good, you can contact us to automate a link-checking solution for all your sites!