Sunday, November 18, 2012

Non-English language models

Part of Thought Into Design's work was to create some language models to understand the similarity of web pages that link to each other. This is in response to Google's recent updates where irrelevant links appear to be penalised for SEO purposes.

The English model was our first focus because that's what the largest share of customers are using.

And it took ages to do. As always with statistical things, the time-consuming part was preparing the data to a good enough level. When trying to create statistical models, it's easy to jump in and tokenise / stem / lemmatise / whatever the corpus and go ahead but even a cursory inspection of your dictionary will show nonsense values that should not be there. The critical part is to reduce this noise to a level so low that the good stuff shines more brightly.

In all, it took a couple of months of work to prepare the corpus properly along with some mind-bending custom-written tokenising code and many late nights.

And there was another requirement. The process had to be repeatable. The client a) will need to perform this every so often for their model to stay up to date and b) models in other languages are necessary. A large part of time was sunk into creating scripts that could do all this in a repeatable process for any language.

But now we have it. The English-language models were creating in just a few days of intensive analysis and they work and work well using an amalgam of different transforms with some proprietary magic.

I wanted to see how well the process adapted to new languages so I tried to create a model in Norwegian. Why Norwegian? Well, I like Norway lots, particularly the outbreak of peace in response to Breivik's murders, and the corpus was very small compared to English.

The whole process took less than 24 hours which includes compiling the corpus. In theory, I can do this for most European languages.

Actually, this is a lie. I need to check the veracity of each dictionary and even a superficial examination requires a good knowledge of the language. I can make estimates but there's nothing like having a handy Norwegian to glance over my shoulder.

But that aside, I can create large working statistical language models in 14 languages so far (based on evidence so far!). Further blood, sweat and tears are yet needed for Asian languages but that's in hand.

Tuesday, June 19, 2012

Useful Corpora

When researching natural language processing techniques to build our semantic relevance engine, we needed a large corpus of English language text to train on.

Finding these materials wasn't the easiest task and this post discusses different sources I used along with issues when preparing them for analysis. I hope it's of use to people out there. I hope to add to it as I go along.

Name & URL What it is How big it is Issues preparing it
Wikipedia A grand database / encyclopaedia of user generated content Enormous. There are downloads of builds around 6GB in size and more recent dumps are much larger. The HTML is hard to parse even with competent libraries like BeautifulSoup.
The current dump is 7.8GB compressed as of writing.
I found The 2010 Westbury text-only & cleaned download via bitTorrent (1.7 GB compressed).
HTML static dumps no longer work (up to 2008 only) but a current database dump exists.
Westbury Usenet Corpus A series of Usenet posts, 2005-2010. I'm not sure if they're cleaned / prepared / have spam removed. Organised by year, the largest is 7.7GB, smallest is 2.1GB As far as I remember, articles were clearly divided and the data were well-cleaned.
Ohsumed (search for Ohsumed if the link doesn't work) Summaries of articles from medical journals. Fields include: Title, abstract, MeSH indexing terms, author(s), source and publication type. There's 348,566 summaries coming to 400 MB in size. This collection is good for testing specialist topic information retrieval
Project Gutenberg A large collection of books (mostly novels) in many languages with a concentration of English. This is a great collection for the general (if sometimes old-fashioned) use of English without being specific to a particular topic. First, filter out non-English novels. Copyright notices and any pre-ambles (including author names) are not standard and need to be removed.
Brown Corpus One of the original stalwart corpuses, this offers 500 articles in 15 categories. It's available in NLTK. It's tagged with part-of-speech tags and another version (the SemCor version) is tagged with Wordnet word senses.
20 Newsgroups This is a limited and anonymised collection of Usenet messages taken from 20 groups. Taken from 20 groups, there are about 20,000 messages. The collection is about 20MB. They arrive as plain text. Depending upon what you want to test, you might need to control for common symbols (e.g., 'From', 'Subject'). I find this corpus very good for real world spontaneous categorisation tests because the data are quite noisy.
London Gazette Data
Reuters 21578 Although superceded by NIST's, this older Reuters corpus is still good. 8.2 MB compressed, 28.0 MB uncompressed for 12,902 documents in 90 classes. It's available (tagged) in NLTK Just a personal observation: I've found that these documents are often difficult to spontaneously categorise accurately.
IMDB plot summaries along with other information such as goofs and trivia for each movie Movie reviews of over 60,000 movies, often with more than one review per movie The plot summaries file is 68.3 MB compressed. Results are in plain text. Movies are divided by a line of dashes, title (and date in brackets) is preceded by "MV: " on the same line, review text is preceded by "PL: " on the same line and the author's user name (sometimes an email) is preceded by "BY: ". You need to be ready to cope with Unicode in the movie names and for more than 1 review per movie. Check the terms and conditions to ensure that what you want to do with the data is okay.
Bible Corpus This is available in 56 languages. In terms of modern corpora, it's quite small but is good for learning. 5.2 MB as uncompressed XML Needs to be parsed by XML. I'd recommend BeautifulSoup

Sunday, June 3, 2012

Negative results for word-sense disambiguation using latent semantic analysis

In a previous post, I mentioned that Roistr's semantic relevance engine had some potential for word-sense disambiguation (WSD). For those who don't know, WSD is when a word has more than one sense: say 'dog' can refer to a canine mammal of the genus canis lupis familiaris or to a chase (among others). Being able to do WSD effectively is crucial for machines to get a handle on the context of language. I see it as one of the fundamental challenges to overcome in NLP.

My earlier post mentioned how Roistr (which uses LSA) identified the correct word sense for a single case which made me curious to know more. I examined this in more detail and got negative results which are just as important to report as positive ones.

For the experiment, I took the SemCor annotated corpus (which is based on the Brown corpus but has words tagged with word-sense) and extracted some sentences which contained words with 5 different senses in WordNet. The entire sentences were then compared to each sense's definition and example (plain text describing the word in context) and results compared against the actual word sense. To do this comparison, I used Roistr's demonstration best matches feature and put the source sentence in a left-hand box and the word sense definitions and examples in a right hand box (one box for each word sense). The selection was that definition and example text with the highest similarity to the source sentence.

Of all words, none of the correct senses were selected. The median rank was 3 out of 5 which is fairly poor!

Problems might come from the model (though it seems to work well in other tasks), this method of WSD (which might just not be a good enough solution), or confounds from the definition and example texts which might contain words that don't provide the best context. I feel confident that all are responsible to different degrees with the second point the strongest though I don't have evidence to back this up yet.

Still, it's good to know and hope this might be of use to someone else.

Monday, January 9, 2012

Wordnet examples and confusion with word-sense disambiguation

Wordnet is a truly amazing piece of software and we use it a lot at Roistr for semantic relevance. One useful part of it is that each synset has examples of the word sense in use. But sometimes, these can mislead a little...

I'm currently looking into improving word-sense disambiguation - this is when you're faced with a word and is has more than one possible meaning. If I said, "tear", do I mean "tear apart; rip" or "tear from your eye; teardrop"? Identifying the word sense is word-sense disambiguation.

One of the methods I'm investigating is using our semantic relevance engine for disambiguation. As a preliminary test, I tried to disambiguate the sentence, "the quick brown fox jumps over the lazy dog" and I wanted to disambiguate the word "dog". For a human, it's easy - the sentence refers to "dog" the animal, Wordnet's first synset for "dog" ('dog.n.01'). But this can be hard for a computer to do. Wordnet produced 8 synsets for the word "dog" which is enough of a problem space to begin developing my mental model of this process.

So what I tried was extracting the 'examples' from Wordnet. Most synsets have example text showing the word in use. Going to Python and typing in the following shows the examples for "dog".

>>> from nltk.corpus import wordnet as wn
>>> ss = wn.synsets("dog")
>>> print ss
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> # All of the synsets for the word "dog"
>>> for s in ss:
>>>         print s.examples
['the dog barked all night']
['she got a reputation as a frump', "she's a real dog"]
['you lucky dog']
['you dirty dog']
['the andirons were too hot to touch']
['The policeman chased the mugger down the alley', 'the dog chased the rabbit']

I made two of the example texts into a single sentence: the second and the last. Then I compared each of these examples against the original text, "the quick brown fox jumps over the lazy dog". The theory is that the most appropriate word sense's example text will have the highest similarity to the original text. This is what I found:

0.577778545308  the dog barked all night
0.572310228632  she got a reputation as a frump "shes a real dog"
0.576110795307  you lucky dog
0.572804146515  you dirty dog
0.0652719366684 the andirons were too hot to touch
0.64144233489   The policeman chased the mugger down the alley, the dog chased the rabbit
Nuts. The final word-sense was a verb and produced the highest similarity (0.64). It was also incorrect. This was a shame because the first synset's example text was the second highest (0.577778) and was the correct one. So near and yet so far...

But then something struck me - the final synset's word sense tries to descrive "dog" as "chase". The word, "chase" appears in the example text - but so does the word "dog" from synset.n.01!

So in this particular instance, the example text was confounded because it included 2 word senses. No wonder the similarities were so high!

I changed the example text of the final "chase" synset to just "The policeman chased the mugger down the alley" and ran the analysis again. This time, the results were:

0.577778545308  the dog barked all night
0.572310228632  she got a reputation as a frump "shes a real dog"
0.576110795307  you lucky dog
0.572804146515  you dirty dog
0.0652719366684 the andirons were too hot to touch
0.175722447184  The policeman chased the mugger down the alley

This makes a lot more sense. I was hoping for a greater distinction between the top-rated word sense and the next but the rank ordering is correct. I think that there is potential in this method but I also have 2 ideas:

  1. Word sense - for fine grained word-sense disambiguations, even highly educated humans often won't agree
  2. From what I've seen, a lot of word-sense disambiguation results in a single decision. This is rarely the case for humans - we will often wait or search for more contextual information before fully committing to a particular word-sense. Effective writing should (in theory) make the word-sense clear; poor writing leaves more chance for confusion. This means that I want to work on a method of using probabilities to represent word-senses rather than forcing something into making the earliest possible decision. Tricky indeed, but if it's more accurate then it's probably worth looking at with Moore's Law and all that.

Friday, January 6, 2012

See us on AngelList

In case anyone's curious, we can be contacted on Angel List ( We're not really looking for investment but it's a useful site for releasing company updates.

Sentiment analysis with Latent Semantic Analysis

Part of Roistr's work was generating an effective model for latent semantic analysis (LSA). If you don't know about LSA, it is described as a bag-of-words method of analysing text: you create the standard word by document matrix, transform it with the term frequency / inverse document frequency algorithm, and then use a method (most commonly singular value decomposition) to transform the matrix into 3 component matrices. One of these matrices has its dimensionality reduced (generally somewhere between 200 - 400 dimensions) and the matrices are multiplied back together. The resulting matrix can be used to provide surprisingly human-like judgements on text. More information ishere.

I did a few experiments recently on whether LSA could be used for sentiment analysis. The research question was given a set of movie reviews rated by experts as positive and negative, could I use LSA to find out which was which?

My first idea was to use a standard model and compare documents against 2 others: one was filled with positive sentiment words; the other with negative sentiment words. The concept was to see whether a document was more proximate to the positive-sentiment words document or the negative-sentiment words document.

The results showed that despite an observed non-significant difference, there was no ability to discriminate sentiment.

The second method I used created 2 LSA models, one using a dictionary of purely positive words and another using a dictionary of purely negative words. I then used these model to analyse text documents and see which produced the better understanding by comparing analysis via both of the sentiment models against the analysis of the standard model. The theory is that a strong expression of sentiment should show a greater departure from the standard model than a weaker expression.

This showed no significant differences which implies no ability to discriminate according to positive or negative sentiment.

So no luck so far but it's important to discuss the things that don't work as well as those that do.

There are weaknesses with this work that might have confounded the conclusion:

  • It only analysed entire documents. This may be acceptable for many tweets which are single topic / single expression but not for more complex documents (e.g., product reviews)
  • The lists of sentiment words may have been incomplete
  • The standard model may not be suitable for this purpose - it is an excellent component to understand semantic meaning (topic identification, semantic extraction and so on) but sentiment analysis is a different kettle of fish. 

However, I feel confident that the two approaches I've discussed are not worth pursuing in more detail. They might have a great role to play in combination with other methods which is another avenue we're currently exploring.

Wednesday, January 4, 2012

Identifying base units of meaning

Within NLP and at least from my own perspective, there is a methodological challenge of identifying propositions (rather: what I considered them to be). These are the base units of meaning and have applications in text analytics, sentiment analysis and other such stuff. I get the impression that the differences between utterances, sentences and propositions has become maybe too academic (though it has a basis). I need to be able to focus on what I thought were propositions because the NLP community tend to use sentences as the base unit of complex meaning. I doubt this because sentences can contain several different statements or declarations.

For example, it makes sentiment analysis much more powerful because it makes analysis more powerful. Consider the sentence, "The Godfather is fair but the trailers were totally abysmal." If you want to find out people's responses to the particular movie, then conventional sentiment analysis will record it as negative because the entire sentence is used as a unit and "totally abysmal" is more negative than "fair" is positive even though the former is nothing to do with the movie itself but rather of other movies in trailers. Topic identification should play a role here but it's likely that mistakes have been made.

Splitting the sentence into baser "statements" makes the sentiment analysis more accurate because it can tell that there are really two statements: "The Godfather is fair" and "the trailers were totally abysmal". Good sentiment analysis (i.e., that not based on the entire document) will see that the first sentence is about the topic we want to know about and the sentiment is "fair", and also that the second sentence is largely irrelevant so we can ignore any expressed sentiment.

But identifying the component statements is a hard technical problem, so much so that many efforts just use the base sentence rather than tackle it.

I had some thoughts last night about statement identification while reading Kintsch's Comprehension book again, particularly the toy example of John driving and then crashing his car (4 sentences in 2 paragraphs). The similarity between the most similar words seems to indicate something: the number of statements is equal to the most similar words. What is most similar: well, there's a boundary somewhere, but (say) within 0.10 of a cosine similarity calculation between words and document of the highest relevance. Everything below would be words that support the statement.

So if the words of a sentence are compared against the entire sentence and one word stands out with the highest similarity (say the highest is 0.56 and the next is 0.32), then the sentence is likely to contain a single statement - what is a single sentence.

If, however, the highest scoring word is not much different to the next scoring word (say the highest is 0.56 and the next is 0.52) then there are likely to be two statements: one with the 0.56 word as its operative word and the other with 0.52 as its operative word. In other words, the entire sentence is saying two things and could easily be broken into two sentences.

I doubt very much if this is a truism and it won't work in many cases; but I wondered if this could be the start of a guiding principle? Perhaps even a method of splitting a complex sentence into constituent parts (say if there is also conjoining content like "and" or ";").

This is definitely worth some research...

Tuesday, January 3, 2012

From Bag-of-Words to single words

Within natural language processing, there are a number of techniques labelled "bag-of-words" techniques. They essentially treat documents as a single "bag" of words with no account of word order or the proximity of other words. Examples include latent semantic analysis (LSA) and latent dirichlet allocation (LDA).

This seems to limit the utility of such approaches. When performing sentiment analysis, they can be misleading: although the overall sentiment of a document can be gleaned, it may be useless. If a document says this: "My car is rubbish. I'm going to sell it to a great person and I'll get an amazing deal if I'm lucky", most overall approaches would classify this as positive. The core element ("My card is rubbish") is negative. This is compounded when analysing longer documents. Even with those of just a few paragraphs, and overall sentiment may be an entirely meaningless measure if a document contains different themes.

So the bag-of-words approaches seem misleading in managing text analysis.

Some work by Walter Kintsch [1] offers clues as to a way out. In his seminal book, "Comprehension", he outlined a way in which bag-of-words approaches can be used cleverly to focus on levels below the document level (chapter level, paragraph level, sentence level, propositional level, and even word level). Such methods can be used for many purposes including word sense disambiguation and topic identification.

The key is to compare proximity in the same way as you would compare semantic similarity between two document but in this case, compare an element of each level against the whole document. So if you wanted to know which word was the single 'operative' word in a document, you would break down the document into words and compare each against the entire document. The word with the highest similarity is the operative word. Likewise, you can identify the operative paragraph, chapter, and proposition within a document; or even the operative word in a proposition, sentence, paragraph or chapter and so on.

Kintsch used a worked example:

John was driving his new car on a lonely country road. The air was warm and full of the smell of spring flowers.
He hit a hole in the road and a spring broke. John lost control and the car hit a tree.
Here we have 1 document with 2 paragraphs. Each paragraph has 2 sentences each.

In Kintch's example, he compared (using cosine measures) words and sentences (to find the operative word in each sentence), sentences and paragraphs, and paragraphs and the whole texts.

In the first sentence, the word 'John' followed by 'tree' showed the highest cosines; in the second, 'air'; the third, 'car'; the fourth, 'John' followed closely by 'tree'.

For sentences compared to paragraphs, the final sentence had the highest cosine with its paragraph. When the two paragraphs were compared against the entire document, both were similar. Kintsch reported that this can also be a good method to automatically evaluate summaries. It may also be used to ensure the smooth flow of subjects throughout a text.

This makes bag-of-words techniques incredibly useful and applicable in domains beyond pure document-based analysis.

[1] Comprehension (2007) W. Kintsch.  Cambridge University Press.

Monday, January 2, 2012

"From Sentiment Analysis to Enterprise Applications"

I was quite pleased to see a new article by Seth Grimes entitled, "From Sentiment Analysis to Enterprise Applications" in which Seth discusses the uses of sentiment analysis in the enterprise market.

Of most interest to me was how sentiment analysis offers insights into marketing that were responsive (more-or-less immediate; which is a part of what we're doing at Roistr) and how such information helps to segment data. Our speciality is unstructured text which is traditionally quite hard to analyse: simplistic methods, like superficial text analysis, don't offer a high degree of insight; which means that methods to harness these data have to be quite advanced to work well across a range of situations.

The article is here: From Sentiment Analysis to Enterprise Applications

Goodbye Forum, Hello Blogspot!

We've switched off our forum and are using this place instead for discussion and comments. It's early days for the company and I was so pleased to see hundreds of shiny new members on our BBPress installation. This all turned to disappointment as the user names and emails made it clear that we had a plethora of spammers and nothing else.

Feel free to comment here. We're very happy to receive guest posts from anyone interested in Roistr, natural language processing, social commerce, social media or artificial intelligence / machine learning.