Tuesday, June 19, 2012

Useful Corpora

When researching natural language processing techniques to build our semantic relevance engine, we needed a large corpus of English language text to train on.

Finding these materials wasn't the easiest task and this post discusses different sources I used along with issues when preparing them for analysis. I hope it's of use to people out there. I hope to add to it as I go along.


Name & URL What it is How big it is Issues preparing it
Wikipedia A grand database / encyclopaedia of user generated content Enormous. There are downloads of builds around 6GB in size and more recent dumps are much larger. The HTML is hard to parse even with competent libraries like BeautifulSoup.
The current dump is 7.8GB compressed as of writing.
I found The 2010 Westbury text-only & cleaned download via bitTorrent (1.7 GB compressed).
HTML static dumps no longer work (up to 2008 only) but a current database dump exists.
Westbury Usenet Corpus A series of Usenet posts, 2005-2010. I'm not sure if they're cleaned / prepared / have spam removed. Organised by year, the largest is 7.7GB, smallest is 2.1GB As far as I remember, articles were clearly divided and the data were well-cleaned.
Ohsumed (search for Ohsumed if the link doesn't work) Summaries of articles from medical journals. Fields include: Title, abstract, MeSH indexing terms, author(s), source and publication type. There's 348,566 summaries coming to 400 MB in size. This collection is good for testing specialist topic information retrieval
Project Gutenberg A large collection of books (mostly novels) in many languages with a concentration of English. This is a great collection for the general (if sometimes old-fashioned) use of English without being specific to a particular topic. First, filter out non-English novels. Copyright notices and any pre-ambles (including author names) are not standard and need to be removed.
Brown Corpus One of the original stalwart corpuses, this offers 500 articles in 15 categories. It's available in NLTK. It's tagged with part-of-speech tags and another version (the SemCor version) is tagged with Wordnet word senses.
20 Newsgroups This is a limited and anonymised collection of Usenet messages taken from 20 groups. Taken from 20 groups, there are about 20,000 messages. The collection is about 20MB. They arrive as plain text. Depending upon what you want to test, you might need to control for common symbols (e.g., 'From', 'Subject'). I find this corpus very good for real world spontaneous categorisation tests because the data are quite noisy.
Parliament
London Gazette Data
Reuters 21578 Although superceded by NIST's, this older Reuters corpus is still good. 8.2 MB compressed, 28.0 MB uncompressed for 12,902 documents in 90 classes. It's available (tagged) in NLTK Just a personal observation: I've found that these documents are often difficult to spontaneously categorise accurately.
IMDB plot summaries along with other information such as goofs and trivia for each movie Movie reviews of over 60,000 movies, often with more than one review per movie The plot summaries file is 68.3 MB compressed. Results are in plain text. Movies are divided by a line of dashes, title (and date in brackets) is preceded by "MV: " on the same line, review text is preceded by "PL: " on the same line and the author's user name (sometimes an email) is preceded by "BY: ". You need to be ready to cope with Unicode in the movie names and for more than 1 review per movie. Check the terms and conditions to ensure that what you want to do with the data is okay.
Bible Corpus This is available in 56 languages. In terms of modern corpora, it's quite small but is good for learning. 5.2 MB as uncompressed XML Needs to be parsed by XML. I'd recommend BeautifulSoup

Sunday, June 3, 2012

Negative results for word-sense disambiguation using latent semantic analysis

In a previous post, I mentioned that Roistr's semantic relevance engine had some potential for word-sense disambiguation (WSD). For those who don't know, WSD is when a word has more than one sense: say 'dog' can refer to a canine mammal of the genus canis lupis familiaris or to a chase (among others). Being able to do WSD effectively is crucial for machines to get a handle on the context of language. I see it as one of the fundamental challenges to overcome in NLP.

My earlier post mentioned how Roistr (which uses LSA) identified the correct word sense for a single case which made me curious to know more. I examined this in more detail and got negative results which are just as important to report as positive ones.

For the experiment, I took the SemCor annotated corpus (which is based on the Brown corpus but has words tagged with word-sense) and extracted some sentences which contained words with 5 different senses in WordNet. The entire sentences were then compared to each sense's definition and example (plain text describing the word in context) and results compared against the actual word sense. To do this comparison, I used Roistr's demonstration best matches feature and put the source sentence in a left-hand box and the word sense definitions and examples in a right hand box (one box for each word sense). The selection was that definition and example text with the highest similarity to the source sentence.

Of all words, none of the correct senses were selected. The median rank was 3 out of 5 which is fairly poor!

Problems might come from the model (though it seems to work well in other tasks), this method of WSD (which might just not be a good enough solution), or confounds from the definition and example texts which might contain words that don't provide the best context. I feel confident that all are responsible to different degrees with the second point the strongest though I don't have evidence to back this up yet.

Still, it's good to know and hope this might be of use to someone else.