Friday, August 10, 2018

Parsing HTML with Go to get a specific table's data (using goQuery)

Abstract


How to use Go to download a web page and parse data for multiple tables. This is also useful for downloading information from tags within a node rather than across the entire page. The full code is here: https://gist.github.com/salmoni/27aee5bb0d26536391aabe7f13a72494

Salstat into Go


We're re-writing our statistics program, Salstat in Go. Why Go? Well, it was an opportunity to learn the language (I like concurrency – it's great for the kind of large-scale analysis we do at Roistr) and it's fast, has the wx toolkit available and it's way more easy to create packages for people to download and install than Python is.

So far, it's been a blast and really good fun. I personally have found Go to be almost as terse as Python which was a surprise. Possibly a little less readable but not significantly so. I like the speed of compilation and execution and, as I said above, ease of packaging really makes a difference.

One problem we faced was with the scraping feature. This allows users to enter the URL of a webpage that has a nice table of data. Salstat downloads the page and parses it, and gives you a preview to select which table you want to import data from. That's all fine but Go was hard to handle because all the documentation we read outlined only simple cases like downloading all the links in a page.

With Salstat, we want to download a table's headings and a table's cells but only for each table not across all tables. There was little info to help us do this. With a bit of experimentation, we found this. It's almost obvious when you see it but until you see it, maybe it's not so obvious.

Our use case


We have some HTML code that contains two tables, each with data. We want all and only the data for the second table. Returned data will be headings and cells.

Which library


We used the goQuery library which is a version of jQuery but in Go. It's not fully jQuery compliant because Go is not within a browser, but it allowed us access to each table in a page and then each row, each table heading and each table cell for each table.

Let's import the library:

import "github.com/PuerkitoBio/goquery"

Then set up variables to hold a list of headings and a 2-dimensional list of data. This is done within a function:

func goGet() {
var headings, row []string
var rows [][]string

Next, we specify a webpage, download it and abort if there is an error:

doc, err := goquery.NewDocumentFromReader(strings.NewReader(data))
if err != nil {
fmt.Println("No url found")
log.Fatal(err)
}

The next bit is nice. Using the goQuery library, we find each table within the document:

doc.Find("table").Each(func(index int, tablehtml *goquery.Selection) {

This iterates through each table and allows us to perform an action. It gives us the table's index and a selection which points to the HTML code within just that specific table.

if (index) == 1 { // 0 being the first table, 1 being the second

We can also find rows within the table with a similar line:

tablehtml.Find("tr").Each(func(indextr int, rowhtml *goquery.Selection) {

Cool. Now we've identified each table within a page, and each row within each table. Next, we want to find the table headings (with the "th" tag) and store them in the 'headings' variable:

rowhtml.Find("th").Each(func(indexth int, tableheading *goquery.Selection) {
headings = append(headings, tableheading.Text())
})

These two lines find each table heading and then appends each to the 'headings' variable. Note that I wanted just the text of the heading rather than the HTML.

Next, we get each cell ('td' tag) for each row which creates a list of a row's cells. Once we've got all the cells for a row, the list of cells is appended to the 'rows' variable and the single row variable is reset:

rowhtml.Find("td").Each(func(indexth int, tablecell *goquery.Selection) {
row = append(row, tablecell.Text())
})
rows = append(rows, row)
row = nil
})
}
})

Note the closing brackets.

We now have all the data for the table we want and can print them out to check:

fmt.Println("####### headings = ", len(headings), headings)
fmt.Println("####### rows = ", len(rows), rows)
}

Again, note the closing bracket which ends the function.