## Thursday, February 12, 2015

### Fast prime numbers in Python

I spent some time recently on Project Euler and got side-tracked by the efficient calculation of prime numbers. After using a brute force method (iterating through a range of numbers and trying to find their factors), I read around and found a nice page at http://rebrained.com/?p=458, I found a good Stack Overflow page at http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n. These inspired me to try again and I came up with the following routine. It's faster than most but not as fast as primes6 on the first page I linked to when generating more than approximately prime numbers up to 350,000-400,000. Below that, nothing seems to touch it.

It's also different. It uses numpy (which is cheating, in a way) but does the job well. I like it because it seems more understandable once you've grasped that the routine doesn't do any division. Instead, it's pure sieve operations on a vector of booleans. Anything found to be divisible by anything other than 1 and itself is marked as False, and the routine finishes by returning the indices of True values - which are primes.

I've triangulated the results by summing them and comparing the sums against those of other routines and there's no differences I've noticed yet.

import numpy as np
from math import sqrt

def ajs_primes3a(upto):
mat = np.ones((upto), dtype=bool) # set up a long boolean array
mat[0] = False # remove 0
mat[1] = False # remove 1
mat[4::2] = False # remove anything divisible by 2
for idx in range(3, int(sqrt(upto))+1, 2): # remove anything else divisible
mat[idx*2::idx] = False
return np.where(mat == True)[0] # return the indices which are the primes

I'm quite pleased with this early foray into optimising a routine but there's work to do compared to prime6. What I like is that it has no division and instead seems to be a pure sieve and doesn't create a long list of numbers.

I tried other versions with a half-series so that anything divisible by 2 just wasn't considered, but what I came up with just weren't as fast.

Times (msecs, same machine, same time, best of 3-6 multiple runs)

10k      100k     500k     1m       20m
prime6       0.001258 0.002722 0.007229 0.001229 0.22388
erat         0.005414 0.059047 0.333737 0.673749 15+ seconds
ajs_primes3a 0.000360 0.001897 0.008540 0.016952 0.70135

Up to 100k, mine leads and might even be faster than loading in from file. prime6, however, takes over strongly after that and extends its lead very well. Mine doesn't lose too much ground, considering, so it's best to think of mine as fast-ish but nicely understandable.

## Wednesday, October 30, 2013

### MongoDB

I've been having some fun recently using MongoDB (and its Python module, pymongo) which I'm planning to use for roistr the search engine. The stuff I need it for is fairly straightforward and I'm enjoying using it. My mac has a test database while the real system is sitting on some more substantial hardware.

It does, however, feel odd to talk about using, shall we say, commodity software for a search engine, no matter how early. Of course it's a good idea to use tried and tested open source programs built precisely for this type of task (plus it's a great learning experience), but I imagine that a lot of organisations have their own home-made stuff. Being a band of exactly one and trying to build a billion-index search engine, I'm up against it so I need to rely on others' softwares.

## Sunday, October 13, 2013

### Roistr is becoming a search engine

We've ad a lot of fun with Roistr's semantic relevance engine, but we have managed to figure out a number of technical problems to make it a viable search engine technology. What this means is that we're phasing out the purely relevance parts, and instead we're going to make Roistr a semantic search engine – or rather, a search engine that relies on natural language processing techniques rather than simple keyword matching. We expect it to be ready in a few months, depending upon how quickly we can get stuff sorted out

First of all, we're only going to index the English version of Wikipedia. It's only a few million documents, and forms a nice test bed as well as being an information base for future developments. Once that's done, we'll aim to spider the web. This, of course, will take a while, especially given that we have extremely limited resources, but it will be fun to test ourselves against the giants.

Our long-term ambition is to have an index of over 1 billion documents. From there, we should be able to cover most searches fairly well. Will it be better than existing search engines like Google, Bing or Blekko? Well, time will tell.

## Friday, August 23, 2013

### All-pairs comparisons for dense matrices

Some of my recent posts described my woes in trying to figure out how I could quickly identify the most likely subset of vectors that match a candidate vector without repeatedly calculating cosine similarity.

I have found a method that appears to work! I won't describe it here because it needs more testing. Falsifications can and should be reported soon, but confirmations should be announced cautiously!

But if it works, I can reduce a set of vectors down to a fraction using fewer calculations. It's not ideal, but effective indexing and searching in a database should, in theory, substantially improve performance to the point where we can easily identify the best matching subset with relatively little ease.

### Making a new search engine

So why is this being done? Why do I want to be able to take a mulit-billion element index of vectors and quickly and cheaply identify a tiny (say 100-item) subset before conducting the cosine comparisons?

Well, that would be giving it away but the subheading is fairly clear.

Besides, we're figuring out a way to store a 3+ billion page search engine on a terabyte of space. Yes, it's possible and yes, we're going to try it.

Once we've sorted out a few more problems!

## Thursday, August 22, 2013

### Using Wordnet from Prolog – part one

Prolog has many uses in natural language processing, and here is an introduction to using Wordnet through Prolog with examples of common tasks.

I enjoy learning Prolog because it conforms to how I thought computers should be programmed before I'd actually programmed one. You'd specify a set of rules, and the computer came back with answers that conformed to those rules.

But Prolog is also handy at natural language processing (NLP) and one of these endeavours is the famous Wordnet being made available in this fascinating language. This is well cool. I've used Wordnet lots in Python but rarely in Prolog.

First, download the Prolog Wordnet effort here (documentation here). I tried using it in GProlog but got an "atom table full" error. SWI Prolog handled it well enough.

Once downloaded, open it up and open a terminal (if not already), and cd to the directory the Wordnet files were unzipped to. Then begin your Prolog and type

[wn_s].

Remember the following full-stop / period.

Then, searching for a word's synsets was easy:

s(ID, W_num, 'ease', Ss_type, Sense_num, Tag_count).

And I got back a range of synsets related to the word, ease.

ID = 101064148
W_num = 2
Ss_type = n
Sense_num = 5
Tag_count = 0
Which means the Wordnet ID for that sense is 101064148, it's the second synset (W_num = 2), this synset has 'ease' as a noun (Ss_type = n), there are a total of 5 senses for 'ease' (Sense_num = 5) and there are no tags (Tag_count = 0).

Press the semi-colon to cycle through them all.

This is so simple. And if you want a particular sense (e.g., the first synset of 'ease'), just specify it thus:

s(ID, 1, 'ease', As_type, Sense_num, Tag_count).

As_type mostly reports words as type 'n' which means 'noun'. It can also report verbs (v), adverbs (r) and adjectives and adjective satellites.

Glossy Wordnet

Now let's retrieve a 'gloss'. A gloss is a natural language sentence illustrating the use of the particular word-sense.

First we load the gloss module.

[wn_g].

Then, we can take a word-sense ID and retrieve the gloss:

g(101064148, Gloss).

Which shows:

Gloss = 'freedom from activity (work or strain or responsibility); "took his repose by the swimming pool"'.

In the next Prolog and Wordnet article, I'll go through some more advanced ways of using Wordnet with Prolog.

## Monday, August 19, 2013

### Lots of Python statistics code

As some of you might know, I've been writing code for statistics in Python for some time. I also have a repository on Github which is the basis of a book I've been writing for several years.

There's a lot there including modules for unit testing. If you want to join in, feel free to fork or email changes in to me if you prefer.

https://github.com/salmoni/CompStatsPython

And for the eagle-eyed amongst you, there's some basic stats stuff in Prolog too! See stats.pl here

List of descriptives:

• VSort - sort a vector
• MSort - sort a numpy.matrix
• CalculateRanks - for calculating the ranks of a numpy.matrix
• GetSSCP_M - calculates the sum of squares and cross-products numpy.matrix
• GetVarsCovars_M - calculates the variances and covariances numpy.matrix
• GetVariances - calculates the variances of a numpy.matrix of variables
• GetStdDevs - calculates the standard deviations of a numpy.matrix of variables
• GetCorrelationMatrix - calculates the correlation numpy.matrix
• Count - returns the number of non-missing data
• sum - returns the sum of non-missing data
• minimum - returns the minimum of non-missing data
• maximum - returns the maximum of non-missing data
• Range - maximum minus the minimum
• proportions -
• relfreqmode -
• cumsum -
• cumproduct -
• cumpercent -
• frequencies -
• trimmeddata -
• trimmedmean -
• bitrimmedmean -
• mean -
• median -
• mode -
• moment -
• TukeyQuartiles - returns Tukey's hinges
• MooreQuartiles - returns Moore & McCabe's hinges
• SPQuantile - quantile used by S-Plus
• TradQuantile - quantile used by SPSS
• MidstepQuantile - mid-step qua
• Q1 - Q1 quantile from Hyndnumpy.man & Fan
• Q2 - Q2 quantile from Hyndnumpy.man & Fan
• Q3 - Q3 quantile from Hyndnumpy.man & Fan
• Q4 - Q4 quantile from Hyndnumpy.man & Fan
• Q5 - Q5 quantile from Hyndnumpy.man & Fan
• Q6 - Q6 quantile from Hyndnumpy.man & Fan
• Q7 - Q7 quantile from Hyndnumpy.man & Fan
• Q8 - Q8 quantile from Hyndnumpy.man & Fan
• Q9 - Q9 quantile from Hyndnumpy.man & Fan
• InterquartileRange -
• SS - sum of squares
• SSDevs - sum of squared deviations from the mean
• SampVar - sample variance
• PopVar - population variance
• SampStdDev - sample standard deviation
• PopStdDev - population standard deviation
• StdErr - standard error
• CoeffVar - coefficient of variation
• ConfidenceIntervals - returns the confidence intervals
• numpy.maD - Median absolute deviation
• GeometricMean - the geometric mean
• HarmonicMean - the harmonic mean
• MSSD - mean square of successive differences
• Skewness - returns the skewness
• Kurtosis - returns the kurtosis
• StandardScore - transforms a vector into a standard (ie, z-) score
• EffectSizeControl - returns an effect size if a control condition is present
• EffectSize - returns an effect size if no control is present
• FiveNumber - Tukey's five number sumnumpy.mary (minimum, lower quartile, median, upper quartile, maximum)
• OutliersSQR - returns two arrays, one of outliers defined by 1.5 * IQR, and the other without these outliers

And others such as

• Cronbach's Alpha
• Kruskal-Wallis
• Multiple regression
• Mann-Whitney U
• Wilcoxon
• ANOVA

### Vector 'velocity' not working

Comparing the lines of best fit was a red-herring largely because the lines were almost identical, regardless of the documents they described.

So the next idea was to examine 'velocity'. I must admit that I'm not too sure what the real name is and I'm sure it's measured somehow. It is, however, a fairly simple concept of how much each point varies compared to its neighbours. Although somewhat similar to variance and other measures of dispersion, consider variance to be a measure of dispersion where each data point is effectively independent of all the others. Here, each data point is measured against its neighbours.

I found little correlation with the difference in velocity score and difference in cosine measure. This means that velocity isn't a way to short-cut all-pairs similarity measurements for dense matrices.

If you examine these two vectors, they produced almost the same cosine with another document (0.605 to three decimal places). The velocity measures were 77.01 for the first document and 175.0 for the other. Examining plots of the two vectors appears to support this. Many other documents with vastly different cosines were interlopers when it came to velocity measures.

This was the vector showing a velocity measure of 77.01. The second vector showed 175.0.
Despite both showing a similar relationship to another document, the pattern of the vector is markedly different and not sufficient to approximate a cosine similarity score.

R code for velocity:

https://gist.github.com/salmoni/6269948