Abstract
How to use Go to download a web page and parse data for multiple tables. This is also useful for downloading information from tags within a node rather than across the entire page. The full code is here: https://gist.github.com/salmoni/27aee5bb0d26536391aabe7f13a72494
Salstat into Go
We're re-writing our statistics program, Salstat in Go. Why Go? Well, it was an opportunity to learn the language (I like concurrency – it's great for the kind of large-scale analysis we do at Roistr) and it's fast, has the wx toolkit available and it's way more easy to create packages for people to download and install than Python is.
So far, it's been a blast and really good fun. I personally have found Go to be almost as terse as Python which was a surprise. Possibly a little less readable but not significantly so. I like the speed of compilation and execution and, as I said above, ease of packaging really makes a difference.
One problem we faced was with the scraping feature. This allows users to enter the URL of a webpage that has a nice table of data. Salstat downloads the page and parses it, and gives you a preview to select which table you want to import data from. That's all fine but Go was hard to handle because all the documentation we read outlined only simple cases like downloading all the links in a page.
With Salstat, we want to download a table's headings and a table's cells but only for each table not across all tables. There was little info to help us do this. With a bit of experimentation, we found this. It's almost obvious when you see it but until you see it, maybe it's not so obvious.
Our use case
We have some HTML code that contains two tables, each with data. We want all and only the data for the second table. Returned data will be headings and cells.
Which library
We used the goQuery library which is a version of jQuery but in Go. It's not fully jQuery compliant because Go is not within a browser, but it allowed us access to each table in a page and then each row, each table heading and each table cell for each table.
Let's import the library:
import "github.com/PuerkitoBio/goquery"
Then set up variables to hold a list of headings and a 2-dimensional list of data. This is done within a function:
func goGet() {
var headings, row []string
var rows [][]string
Next, we specify a webpage, download it and abort if there is an error:
doc, err := goquery.NewDocumentFromReader(strings.NewReader(data))
if err != nil {
fmt.Println("No url found")
log.Fatal(err)
}
doc.Find("table").Each(func(index int, tablehtml *goquery.Selection) {
if (index) == 1 { // 0 being the first table, 1 being the second
tablehtml.Find("tr").Each(func(indextr int, rowhtml *goquery.Selection) {
rowhtml.Find("th").Each(func(indexth int, tableheading *goquery.Selection) {
headings = append(headings, tableheading.Text())
})
These two lines find each table heading and then appends each to the 'headings' variable. Note that I wanted just the text of the heading rather than the HTML.
Next, we get each cell ('td' tag) for each row which creates a list of a row's cells. Once we've got all the cells for a row, the list of cells is appended to the 'rows' variable and the single row variable is reset:
rowhtml.Find("td").Each(func(indexth int, tablecell *goquery.Selection) {
row = append(row, tablecell.Text())
})
rows = append(rows, row)
row = nil
})
}
})
Note the closing brackets.
We now have all the data for the table we want and can print them out to check:
fmt.Println("####### headings = ", len(headings), headings)
fmt.Println("####### rows = ", len(rows), rows)
}
Again, note the closing bracket which ends the function.
I must state that the blog post is really beneficial to anyone else who reads it because the information and knowledge it includes is vital. Continue to provide such useful knowledge through your posts and keep posting more on
ReplyDeleteBig Data Solutions
Business Analytics Services
Data Modernization Services
AI Service Provider