Friday, August 10, 2018

Parsing HTML with Go to get a specific table's data (using goQuery)

Abstract


How to use Go to download a web page and parse data for multiple tables. This is also useful for downloading information from tags within a node rather than across the entire page. The full code is here: https://gist.github.com/salmoni/27aee5bb0d26536391aabe7f13a72494

Salstat into Go


We're re-writing our statistics program, Salstat in Go. Why Go? Well, it was an opportunity to learn the language (I like concurrency – it's great for the kind of large-scale analysis we do at Roistr) and it's fast, has the wx toolkit available and it's way more easy to create packages for people to download and install than Python is.

So far, it's been a blast and really good fun. I personally have found Go to be almost as terse as Python which was a surprise. Possibly a little less readable but not significantly so. I like the speed of compilation and execution and, as I said above, ease of packaging really makes a difference.

One problem we faced was with the scraping feature. This allows users to enter the URL of a webpage that has a nice table of data. Salstat downloads the page and parses it, and gives you a preview to select which table you want to import data from. That's all fine but Go was hard to handle because all the documentation we read outlined only simple cases like downloading all the links in a page.

With Salstat, we want to download a table's headings and a table's cells but only for each table not across all tables. There was little info to help us do this. With a bit of experimentation, we found this. It's almost obvious when you see it but until you see it, maybe it's not so obvious.

Our use case


We have some HTML code that contains two tables, each with data. We want all and only the data for the second table. Returned data will be headings and cells.

Which library


We used the goQuery library which is a version of jQuery but in Go. It's not fully jQuery compliant because Go is not within a browser, but it allowed us access to each table in a page and then each row, each table heading and each table cell for each table.

Let's import the library:

import "github.com/PuerkitoBio/goquery"

Then set up variables to hold a list of headings and a 2-dimensional list of data. This is done within a function:

func goGet() {
var headings, row []string
var rows [][]string

Next, we specify a webpage, download it and abort if there is an error:

doc, err := goquery.NewDocumentFromReader(strings.NewReader(data))
if err != nil {
fmt.Println("No url found")
log.Fatal(err)
}

The next bit is nice. Using the goQuery library, we find each table within the document:

doc.Find("table").Each(func(index int, tablehtml *goquery.Selection) {

This iterates through each table and allows us to perform an action. It gives us the table's index and a selection which points to the HTML code within just that specific table.

if (index) == 1 { // 0 being the first table, 1 being the second

We can also find rows within the table with a similar line:

tablehtml.Find("tr").Each(func(indextr int, rowhtml *goquery.Selection) {

Cool. Now we've identified each table within a page, and each row within each table. Next, we want to find the table headings (with the "th" tag) and store them in the 'headings' variable:

rowhtml.Find("th").Each(func(indexth int, tableheading *goquery.Selection) {
headings = append(headings, tableheading.Text())
})

These two lines find each table heading and then appends each to the 'headings' variable. Note that I wanted just the text of the heading rather than the HTML.

Next, we get each cell ('td' tag) for each row which creates a list of a row's cells. Once we've got all the cells for a row, the list of cells is appended to the 'rows' variable and the single row variable is reset:

rowhtml.Find("td").Each(func(indexth int, tablecell *goquery.Selection) {
row = append(row, tablecell.Text())
})
rows = append(rows, row)
row = nil
})
}
})

Note the closing brackets.

We now have all the data for the table we want and can print them out to check:

fmt.Println("####### headings = ", len(headings), headings)
fmt.Println("####### rows = ", len(rows), rows)
}

Again, note the closing bracket which ends the function.

Thursday, February 12, 2015

Fast prime numbers in Python

I spent some time recently on Project Euler and got side-tracked by the efficient calculation of prime numbers. After using a brute force method (iterating through a range of numbers and trying to find their factors), I read around and found a nice page at http://rebrained.com/?p=458, I found a good Stack Overflow page at http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n. These inspired me to try again and I came up with the following routine. It's faster than most but not as fast as primes6 on the first page I linked to when generating more than approximately prime numbers up to 350,000-400,000. Below that, nothing seems to touch it.

It's also different. It uses numpy (which is cheating, in a way) but does the job well. I like it because it seems more understandable once you've grasped that the routine doesn't do any division. Instead, it's pure sieve operations on a vector of booleans. Anything found to be divisible by anything other than 1 and itself is marked as False, and the routine finishes by returning the indices of True values - which are primes.

I've triangulated the results by summing them and comparing the sums against those of other routines and there's no differences I've noticed yet.

import numpy as np
from math import sqrt

def ajs_primes3a(upto):
  mat = np.ones((upto), dtype=bool) # set up a long boolean array
  mat[0] = False # remove 0
  mat[1] = False # remove 1
  mat[4::2] = False # remove anything divisible by 2
  for idx in range(3, int(sqrt(upto))+1, 2): # remove anything else divisible
      mat[idx*2::idx] = False 
  return np.where(mat == True)[0] # return the indices which are the primes

I'm quite pleased with this early foray into optimising a routine but there's work to do compared to prime6. What I like is that it has no division and instead seems to be a pure sieve and doesn't create a long list of numbers.

I tried other versions with a half-series so that anything divisible by 2 just wasn't considered, but what I came up with just weren't as fast.

Times (msecs, same machine, same time, best of 3-6 multiple runs)

             10k      100k     500k     1m       20m
prime6       0.001258 0.002722 0.007229 0.001229 0.22388
erat         0.005414 0.059047 0.333737 0.673749 15+ seconds
ajs_primes3a 0.000360 0.001897 0.008540 0.016952 0.70135

Up to 100k, mine leads and might even be faster than loading in from file. prime6, however, takes over strongly after that and extends its lead very well. Mine doesn't lose too much ground, considering, so it's best to think of mine as fast-ish but nicely understandable. 



Wednesday, October 30, 2013

MongoDB

I've been having some fun recently using MongoDB (and its Python module, pymongo) which I'm planning to use for roistr the search engine. The stuff I need it for is fairly straightforward and I'm enjoying using it. My mac has a test database while the real system is sitting on some more substantial hardware.

It does, however, feel odd to talk about using, shall we say, commodity software for a search engine, no matter how early. Of course it's a good idea to use tried and tested open source programs built precisely for this type of task (plus it's a great learning experience), but I imagine that a lot of organisations have their own home-made stuff. Being a band of exactly one and trying to build a billion-index search engine, I'm up against it so I need to rely on others' softwares.

Sunday, October 13, 2013

Roistr is becoming a search engine

We've ad a lot of fun with Roistr's semantic relevance engine, but we have managed to figure out a number of technical problems to make it a viable search engine technology. What this means is that we're phasing out the purely relevance parts, and instead we're going to make Roistr a semantic search engine – or rather, a search engine that relies on natural language processing techniques rather than simple keyword matching. We expect it to be ready in a few months, depending upon how quickly we can get stuff sorted out

First of all, we're only going to index the English version of Wikipedia. It's only a few million documents, and forms a nice test bed as well as being an information base for future developments. Once that's done, we'll aim to spider the web. This, of course, will take a while, especially given that we have extremely limited resources, but it will be fun to test ourselves against the giants.

Our long-term ambition is to have an index of over 1 billion documents. From there, we should be able to cover most searches fairly well. Will it be better than existing search engines like Google, Bing or Blekko? Well, time will tell.

Friday, August 23, 2013

Shortcutting all-pairs comparisons for dense matrices – got it

All-pairs comparisons for dense matrices


Some of my recent posts described my woes in trying to figure out how I could quickly identify the most likely subset of vectors that match a candidate vector without repeatedly calculating cosine similarity.

I have found a method that appears to work! I won't describe it here because it needs more testing. Falsifications can and should be reported soon, but confirmations should be announced cautiously!

But if it works, I can reduce a set of vectors down to a fraction using fewer calculations. It's not ideal, but effective indexing and searching in a database should, in theory, substantially improve performance to the point where we can easily identify the best matching subset with relatively little ease.

Making a new search engine

So why is this being done? Why do I want to be able to take a mulit-billion element index of vectors and quickly and cheaply identify a tiny (say 100-item) subset before conducting the cosine comparisons?

Well, that would be giving it away but the subheading is fairly clear.

Besides, we're figuring out a way to store a 3+ billion page search engine on a terabyte of space. Yes, it's possible and yes, we're going to try it.

Once we've sorted out a few more problems!

Thursday, August 22, 2013

Using Wordnet from Prolog – part one

Prolog has many uses in natural language processing, and here is an introduction to using Wordnet through Prolog with examples of common tasks.

I enjoy learning Prolog because it conforms to how I thought computers should be programmed before I'd actually programmed one. You'd specify a set of rules, and the computer came back with answers that conformed to those rules.

But Prolog is also handy at natural language processing (NLP) and one of these endeavours is the famous Wordnet being made available in this fascinating language. This is well cool. I've used Wordnet lots in Python but rarely in Prolog.

First, download the Prolog Wordnet effort here (documentation here). I tried using it in GProlog but got an "atom table full" error. SWI Prolog handled it well enough.

Once downloaded, open it up and open a terminal (if not already), and cd to the directory the Wordnet files were unzipped to. Then begin your Prolog and type

[wn_s].

Remember the following full-stop / period.

Then, searching for a word's synsets was easy:

s(ID, W_num, 'ease', Ss_type, Sense_num, Tag_count).

And I got back a range of synsets related to the word, ease.

ID = 101064148
W_num = 2
Ss_type = n
Sense_num = 5
Tag_count = 0
Which means the Wordnet ID for that sense is 101064148, it's the second synset (W_num = 2), this synset has 'ease' as a noun (Ss_type = n), there are a total of 5 senses for 'ease' (Sense_num = 5) and there are no tags (Tag_count = 0).

Press the semi-colon to cycle through them all.

This is so simple. And if you want a particular sense (e.g., the first synset of 'ease'), just specify it thus:

s(ID, 1, 'ease', As_type, Sense_num, Tag_count).

As_type mostly reports words as type 'n' which means 'noun'. It can also report verbs (v), adverbs (r) and adjectives and adjective satellites.

Glossy Wordnet

Now let's retrieve a 'gloss'. A gloss is a natural language sentence illustrating the use of the particular word-sense.

First we load the gloss module.

[wn_g].

Then, we can take a word-sense ID and retrieve the gloss:

g(101064148, Gloss).

Which shows:

Gloss = 'freedom from activity (work or strain or responsibility); "took his repose by the swimming pool"'.

In the next Prolog and Wordnet article, I'll go through some more advanced ways of using Wordnet with Prolog.

Monday, August 19, 2013

Lots of Python statistics code

As some of you might know, I've been writing code for statistics in Python for some time. I also have a repository on Github which is the basis of a book I've been writing for several years.

There's a lot there including modules for unit testing. If you want to join in, feel free to fork or email changes in to me if you prefer.

https://github.com/salmoni/CompStatsPython

And for the eagle-eyed amongst you, there's some basic stats stuff in Prolog too! See stats.pl here

List of descriptives:


  • VSort - sort a vector
  • MSort - sort a numpy.matrix
  • CalculateRanks - for calculating the ranks of a numpy.matrix
  • GetSSCP_M - calculates the sum of squares and cross-products numpy.matrix
  • GetVarsCovars_M - calculates the variances and covariances numpy.matrix
  • GetVariances - calculates the variances of a numpy.matrix of variables
  • GetStdDevs - calculates the standard deviations of a numpy.matrix of variables
  • GetCorrelationMatrix - calculates the correlation numpy.matrix
  • Count - returns the number of non-missing data
  • sum - returns the sum of non-missing data
  • minimum - returns the minimum of non-missing data
  • maximum - returns the maximum of non-missing data
  • Range - maximum minus the minimum
  • proportions - 
  • relfreqmode - 
  • cumsum - 
  • cumproduct - 
  • cumpercent - 
  • frequencies - 
  • trimmeddata - 
  • trimmedmean - 
  • bitrimmedmean - 
  • mean - 
  • median - 
  • mode - 
  • moment - 
  • TukeyQuartiles - returns Tukey's hinges
  • MooreQuartiles - returns Moore & McCabe's hinges
  • SPQuantile - quantile used by S-Plus
  • TradQuantile - quantile used by SPSS
  • MidstepQuantile - mid-step qua
  • Q1 - Q1 quantile from Hyndnumpy.man & Fan
  • Q2 - Q2 quantile from Hyndnumpy.man & Fan
  • Q3 - Q3 quantile from Hyndnumpy.man & Fan
  • Q4 - Q4 quantile from Hyndnumpy.man & Fan
  • Q5 - Q5 quantile from Hyndnumpy.man & Fan
  • Q6 - Q6 quantile from Hyndnumpy.man & Fan
  • Q7 - Q7 quantile from Hyndnumpy.man & Fan
  • Q8 - Q8 quantile from Hyndnumpy.man & Fan
  • Q9 - Q9 quantile from Hyndnumpy.man & Fan 
  • InterquartileRange - 
  • SS - sum of squares
  • SSDevs - sum of squared deviations from the mean
  • SampVar - sample variance
  • PopVar - population variance
  • SampStdDev - sample standard deviation
  • PopStdDev - population standard deviation
  • StdErr - standard error
  • CoeffVar - coefficient of variation
  • ConfidenceIntervals - returns the confidence intervals
  • numpy.maD - Median absolute deviation
  • GeometricMean - the geometric mean
  • HarmonicMean - the harmonic mean
  • MSSD - mean square of successive differences
  • Skewness - returns the skewness
  • Kurtosis - returns the kurtosis
  • StandardScore - transforms a vector into a standard (ie, z-) score
  • EffectSizeControl - returns an effect size if a control condition is present
  • EffectSize - returns an effect size if no control is present
  • FiveNumber - Tukey's five number sumnumpy.mary (minimum, lower quartile, median, upper quartile, maximum)
  • OutliersSQR - returns two arrays, one of outliers defined by 1.5 * IQR, and the other without these outliers


And others such as


  • Cronbach's Alpha
  • Kruskal-Wallis
  • Multiple regression
  • Mann-Whitney U
  • Wilcoxon
  • ANOVA