Tuesday, June 19, 2012

Useful Corpora

When researching natural language processing techniques to build our semantic relevance engine, we needed a large corpus of English language text to train on.

Finding these materials wasn't the easiest task and this post discusses different sources I used along with issues when preparing them for analysis. I hope it's of use to people out there. I hope to add to it as I go along.

Name & URL What it is How big it is Issues preparing it
Wikipedia A grand database / encyclopaedia of user generated content Enormous. There are downloads of builds around 6GB in size and more recent dumps are much larger. The HTML is hard to parse even with competent libraries like BeautifulSoup.
The current dump is 7.8GB compressed as of writing.
I found The 2010 Westbury text-only & cleaned download via bitTorrent (1.7 GB compressed).
HTML static dumps no longer work (up to 2008 only) but a current database dump exists.
Westbury Usenet Corpus A series of Usenet posts, 2005-2010. I'm not sure if they're cleaned / prepared / have spam removed. Organised by year, the largest is 7.7GB, smallest is 2.1GB As far as I remember, articles were clearly divided and the data were well-cleaned.
Ohsumed (search for Ohsumed if the link doesn't work) Summaries of articles from medical journals. Fields include: Title, abstract, MeSH indexing terms, author(s), source and publication type. There's 348,566 summaries coming to 400 MB in size. This collection is good for testing specialist topic information retrieval
Project Gutenberg A large collection of books (mostly novels) in many languages with a concentration of English. This is a great collection for the general (if sometimes old-fashioned) use of English without being specific to a particular topic. First, filter out non-English novels. Copyright notices and any pre-ambles (including author names) are not standard and need to be removed.
Brown Corpus One of the original stalwart corpuses, this offers 500 articles in 15 categories. It's available in NLTK. It's tagged with part-of-speech tags and another version (the SemCor version) is tagged with Wordnet word senses.
20 Newsgroups This is a limited and anonymised collection of Usenet messages taken from 20 groups. Taken from 20 groups, there are about 20,000 messages. The collection is about 20MB. They arrive as plain text. Depending upon what you want to test, you might need to control for common symbols (e.g., 'From', 'Subject'). I find this corpus very good for real world spontaneous categorisation tests because the data are quite noisy.
London Gazette Data
Reuters 21578 Although superceded by NIST's, this older Reuters corpus is still good. 8.2 MB compressed, 28.0 MB uncompressed for 12,902 documents in 90 classes. It's available (tagged) in NLTK Just a personal observation: I've found that these documents are often difficult to spontaneously categorise accurately.
IMDB plot summaries along with other information such as goofs and trivia for each movie Movie reviews of over 60,000 movies, often with more than one review per movie The plot summaries file is 68.3 MB compressed. Results are in plain text. Movies are divided by a line of dashes, title (and date in brackets) is preceded by "MV: " on the same line, review text is preceded by "PL: " on the same line and the author's user name (sometimes an email) is preceded by "BY: ". You need to be ready to cope with Unicode in the movie names and for more than 1 review per movie. Check the terms and conditions to ensure that what you want to do with the data is okay.
Bible Corpus This is available in 56 languages. In terms of modern corpora, it's quite small but is good for learning. 5.2 MB as uncompressed XML Needs to be parsed by XML. I'd recommend BeautifulSoup