When researching natural language processing techniques to build our semantic relevance engine, we needed a large corpus of English language text to train on.
Finding these materials wasn't the easiest task and this post discusses different sources I used along with issues when preparing them for analysis. I hope it's of use to people out there. I hope to add to it as I go along.
Finding these materials wasn't the easiest task and this post discusses different sources I used along with issues when preparing them for analysis. I hope it's of use to people out there. I hope to add to it as I go along.
Name & URL | What it is | How big it is | Issues preparing it |
---|---|---|---|
Wikipedia | A grand database / encyclopaedia of user generated content | Enormous. There are downloads of builds around 6GB in size and more recent dumps are much larger. | The HTML is hard to parse even with competent libraries like BeautifulSoup. The current dump is 7.8GB compressed as of writing. I found The 2010 Westbury text-only & cleaned download via bitTorrent (1.7 GB compressed). HTML static dumps no longer work (up to 2008 only) but a current database dump exists. |
Westbury Usenet Corpus | A series of Usenet posts, 2005-2010. I'm not sure if they're cleaned / prepared / have spam removed. | Organised by year, the largest is 7.7GB, smallest is 2.1GB | As far as I remember, articles were clearly divided and the data were well-cleaned. |
Ohsumed (search for Ohsumed if the link doesn't work) | Summaries of articles from medical journals. Fields include: Title, abstract, MeSH indexing terms, author(s), source and publication type. | There's 348,566 summaries coming to 400 MB in size. | This collection is good for testing specialist topic information retrieval |
Project Gutenberg | A large collection of books (mostly novels) in many languages with a concentration of English. | This is a great collection for the general (if sometimes old-fashioned) use of English without being specific to a particular topic. | First, filter out non-English novels. Copyright notices and any pre-ambles (including author names) are not standard and need to be removed. |
Brown Corpus | One of the original stalwart corpuses, this offers 500 articles in 15 categories. | It's available in NLTK. It's tagged with part-of-speech tags and another version (the SemCor version) is tagged with Wordnet word senses. | |
20 Newsgroups | This is a limited and anonymised collection of Usenet messages taken from 20 groups. | Taken from 20 groups, there are about 20,000 messages. The collection is about 20MB. | They arrive as plain text. Depending upon what you want to test, you might need to control for common symbols (e.g., 'From', 'Subject'). I find this corpus very good for real world spontaneous categorisation tests because the data are quite noisy. |
Parliament | |||
London Gazette Data | |||
Reuters 21578 | Although superceded by NIST's, this older Reuters corpus is still good. | 8.2 MB compressed, 28.0 MB uncompressed for 12,902 documents in 90 classes. It's available (tagged) in NLTK | Just a personal observation: I've found that these documents are often difficult to spontaneously categorise accurately. |
IMDB plot summaries along with other information such as goofs and trivia for each movie | Movie reviews of over 60,000 movies, often with more than one review per movie | The plot summaries file is 68.3 MB compressed. | Results are in plain text. Movies are divided by a line of dashes, title (and date in brackets) is preceded by "MV: " on the same line, review text is preceded by "PL: " on the same line and the author's user name (sometimes an email) is preceded by "BY: ". You need to be ready to cope with Unicode in the movie names and for more than 1 review per movie. Check the terms and conditions to ensure that what you want to do with the data is okay. |
Bible Corpus | This is available in 56 languages. In terms of modern corpora, it's quite small but is good for learning. | 5.2 MB as uncompressed XML | Needs to be parsed by XML. I'd recommend BeautifulSoup |
No comments:
Post a Comment