Monday, January 9, 2012

Wordnet Examples and confusion with word-sense disambiguation

Wordnet is a truly amazing piece of software and we use it a lot at Roistr for semantic relevance. One useful part of it is that each synset has examples of the word sense in use. But sometimes, these can mislead a little...

I'm currently looking into improving word-sense disambiguation - this is when you're faced with a word and is has more than one possible meaning. If I said, "tear", do I mean "tear apart; rip" or "tear from your eye; teardrop"? Identifying the word sense is word-sense disambiguation.

One of the methods I'm investigating is using our semantic relevance engine for disambiguation. As a preliminary test, I tried to disambiguate the sentence, "the quick brown fox jumps over the lazy dog" and I wanted to disambiguate the word "dog". For a human, it's easy - the sentence refers to "dog" the animal, Wordnet's first synset for "dog" ('dog.n.01'). But this can be hard for a computer to do. Wordnet produced 8 synsets for the word "dog" which is enough of a problem space to begin developing my mental model of this process.

So what I tried was extracting the 'examples' from Wordnet. Most synsets have example text showing the word in use. Going to Python and typing in the following shows the examples for "dog".

>>> from nltk.corpus import wordnet as wn
>>> ss = wn.synsets("dog")
>>> print ss
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> # All of the synsets for the word "dog"
>>> for s in ss:
>>>         print s.examples
['the dog barked all night']
['she got a reputation as a frump', "she's a real dog"]
['you lucky dog']
['you dirty dog']
[]
[]
['the andirons were too hot to touch']
['The policeman chased the mugger down the alley', 'the dog chased the rabbit']



I made two of the example texts into a single sentence: the second and the last. Then I compared each of these examples against the original text, "the quick brown fox jumps over the lazy dog". The theory is that the most appropriate word sense's example text will have the highest similarity to the original text. This is what I found:

0.577778545308  the dog barked all night
0.572310228632  she got a reputation as a frump "shes a real dog"
0.576110795307  you lucky dog
0.572804146515  you dirty dog
0.0652719366684 the andirons were too hot to touch
0.64144233489   The policeman chased the mugger down the alley, the dog chased the rabbit
Nuts. The final word-sense was a verb and produced the highest similarity (0.64). It was also incorrect. This was a shame because the first synset's example text was the second highest (0.577778) and was the correct one. So near and yet so far...

But then something struck me - the final synset's word sense tries to descrive "dog" as "chase". The word, "chase" appears in the example text - but so does the word "dog" from synset.n.01!

So in this particular instance, the example text was confounded because it included 2 word senses. No wonder the similarities were so high!

I changed the example text of the final "chase" synset to just "The policeman chased the mugger down the alley" and ran the analysis again. This time, the results were:

0.577778545308  the dog barked all night
0.572310228632  she got a reputation as a frump "shes a real dog"
0.576110795307  you lucky dog
0.572804146515  you dirty dog
0.0652719366684 the andirons were too hot to touch
0.175722447184  The policeman chased the mugger down the alley

This makes a lot more sense. I was hoping for a greater distinction between the top-rated word sense and the next but the rank ordering is correct. I think that there is potential in this method but I also have 2 ideas:

  1. Word sense - for fine grained word-sense disambiguations, even highly educated humans often won't agree
  2. From what I've seen, a lot of word-sense disambiguation results in a single decision. This is rarely the case for humans - we will often wait or search for more contextual information before fully committing to a particular word-sense. Effective writing should (in theory) make the word-sense clear; poor writing leaves more chance for confusion. This means that I want to work on a method of using probabilities to represent word-senses rather than forcing something into making the earliest possible decision. Tricky indeed, but if it's more accurate then it's probably worth looking at with Moore's Law and all that.


Friday, January 6, 2012

See us on AngelList

In case anyone's curious, we can be contacted on Angel List (http://angel.co/roistr-1). We're not really looking for investment but it's a useful site for releasing company updates.

Sentiment analysis with Latent Semantic Analysis


Part of Roistr's work was generating an effective model for latent semantic analysis (LSA). If you don't know about LSA, it is described as a bag-of-words method of analysing text: you create the standard word by document matrix, transform it with the term frequency / inverse document frequency algorithm, and then use a method (most commonly singular value decomposition) to transform the matrix into 3 component matrices. One of these matrices has its dimensionality reduced (generally somewhere between 200 - 400 dimensions) and the matrices are multiplied back together. The resulting matrix can be used to provide surprisingly human-like judgements on text. More information ishere.

I did a few experiments recently on whether LSA could be used for sentiment analysis. The research question was given a set of movie reviews rated by experts as positive and negative, could I use LSA to find out which was which?

My first idea was to use a standard model and compare documents against 2 others: one was filled with positive sentiment words; the other with negative sentiment words. The concept was to see whether a document was more proximate to the positive-sentiment words document or the negative-sentiment words document.

The results showed that despite an observed non-significant difference, there was no ability to discriminate sentiment.

The second method I used created 2 LSA models, one using a dictionary of purely positive words and another using a dictionary of purely negative words. I then used these model to analyse text documents and see which produced the better understanding by comparing analysis via both of the sentiment models against the analysis of the standard model. The theory is that a strong expression of sentiment should show a greater departure from the standard model than a weaker expression.

This showed no significant differences which implies no ability to discriminate according to positive or negative sentiment.

So no luck so far but it's important to discuss the things that don't work as well as those that do.

There are weaknesses with this work that might have confounded the conclusion:


  • It only analysed entire documents. This may be acceptable for many tweets which are single topic / single expression but not for more complex documents (e.g., product reviews)
  • The lists of sentiment words may have been incomplete
  • The standard model may not be suitable for this purpose - it is an excellent component to understand semantic meaning (topic identification, semantic extraction and so on) but sentiment analysis is a different kettle of fish. 


However, I feel confident that the two approaches I've discussed are not worth pursuing in more detail. They might have a great role to play in combination with other methods which is another avenue we're currently exploring.


Wednesday, January 4, 2012

Identifying base units of meaning


Within NLP and at least from my own perspective, there is a methodological challenge of identifying propositions (rather: what I considered them to be). These are the base units of meaning and have applications in text analytics, sentiment analysis and other such stuff. I get the impression that the differences between utterances, sentences and propositions has become maybe too academic (though it has a basis). I need to be able to focus on what I thought were propositions because the NLP community tend to use sentences as the base unit of complex meaning. I doubt this because sentences can contain several different statements or declarations.

For example, it makes sentiment analysis much more powerful because it makes analysis more powerful. Consider the sentence, "The Godfather is fair but the trailers were totally abysmal." If you want to find out people's responses to the particular movie, then conventional sentiment analysis will record it as negative because the entire sentence is used as a unit and "totally abysmal" is more negative than "fair" is positive even though the former is nothing to do with the movie itself but rather of other movies in trailers. Topic identification should play a role here but it's likely that mistakes have been made.

Splitting the sentence into baser "statements" makes the sentiment analysis more accurate because it can tell that there are really two statements: "The Godfather is fair" and "the trailers were totally abysmal". Good sentiment analysis (i.e., that not based on the entire document) will see that the first sentence is about the topic we want to know about and the sentiment is "fair", and also that the second sentence is largely irrelevant so we can ignore any expressed sentiment.

But identifying the component statements is a hard technical problem, so much so that many efforts just use the base sentence rather than tackle it.

I had some thoughts last night about statement identification while reading Kintsch's Comprehension book again, particularly the toy example of John driving and then crashing his car (4 sentences in 2 paragraphs). The similarity between the most similar words seems to indicate something: the number of statements is equal to the most similar words. What is most similar: well, there's a boundary somewhere, but (say) within 0.10 of a cosine similarity calculation between words and document of the highest relevance. Everything below would be words that support the statement.

So if the words of a sentence are compared against the entire sentence and one word stands out with the highest similarity (say the highest is 0.56 and the next is 0.32), then the sentence is likely to contain a single statement - what is a single sentence.

If, however, the highest scoring word is not much different to the next scoring word (say the highest is 0.56 and the next is 0.52) then there are likely to be two statements: one with the 0.56 word as its operative word and the other with 0.52 as its operative word. In other words, the entire sentence is saying two things and could easily be broken into two sentences.

I doubt very much if this is a truism and it won't work in many cases; but I wondered if this could be the start of a guiding principle? Perhaps even a method of splitting a complex sentence into constituent parts (say if there is also conjoining content like "and" or ";").

This is definitely worth some research...

Tuesday, January 3, 2012

From Bag-of-Words to single words

Within natural language processing, there are a number of techniques labelled "bag-of-words" techniques. They essentially treat documents as a single "bag" of words with no account of word order or the proximity of other words. Examples include latent semantic analysis (LSA) and latent dirichlet allocation (LDA).

This seems to limit the utility of such approaches. When performing sentiment analysis, they can be misleading: although the overall sentiment of a document can be gleaned, it may be useless. If a document says this: "My car is rubbish. I'm going to sell it to a great person and I'll get an amazing deal if I'm lucky", most overall approaches would classify this as positive. The core element ("My card is rubbish") is negative. This is compounded when analysing longer documents. Even with those of just a few paragraphs, and overall sentiment may be an entirely meaningless measure if a document contains different themes.

So the bag-of-words approaches seem misleading in managing text analysis.

Some work by Walter Kintsch [1] offers clues as to a way out. In his seminal book, "Comprehension", he outlined a way in which bag-of-words approaches can be used cleverly to focus on levels below the document level (chapter level, paragraph level, sentence level, propositional level, and even word level). Such methods can be used for many purposes including word sense disambiguation and topic identification.

The key is to compare proximity in the same way as you would compare semantic similarity between two document but in this case, compare an element of each level against the whole document. So if you wanted to know which word was the single 'operative' word in a document, you would break down the document into words and compare each against the entire document. The word with the highest similarity is the operative word. Likewise, you can identify the operative paragraph, chapter, and proposition within a document; or even the operative word in a proposition, sentence, paragraph or chapter and so on.

Kintsch used a worked example:

John was driving his new car on a lonely country road. The air was warm and full of the smell of spring flowers.
He hit a hole in the road and a spring broke. John lost control and the car hit a tree.
Here we have 1 document with 2 paragraphs. Each paragraph has 2 sentences each.

In Kintch's example, he compared (using cosine measures) words and sentences (to find the operative word in each sentence), sentences and paragraphs, and paragraphs and the whole texts.

In the first sentence, the word 'John' followed by 'tree' showed the highest cosines; in the second, 'air'; the third, 'car'; the fourth, 'John' followed closely by 'tree'.

For sentences compared to paragraphs, the final sentence had the highest cosine with its paragraph. When the two paragraphs were compared against the entire document, both were similar. Kintsch reported that this can also be a good method to automatically evaluate summaries. It may also be used to ensure the smooth flow of subjects throughout a text.

This makes bag-of-words techniques incredibly useful and applicable in domains beyond pure document-based analysis.

[1] Comprehension (2007) W. Kintsch.  Cambridge University Press.

Monday, January 2, 2012

"From Sentiment Analysis to Enterprise Applications"

I was quite pleased to see a new article by Seth Grimes entitled, "From Sentiment Analysis to Enterprise Applications" in which Seth discusses the uses of sentiment analysis in the enterprise market.

Of most interest to me was how sentiment analysis offers insights into marketing that were responsive (more-or-less immediate; which is a part of what we're doing at Roistr) and how such information helps to segment data. Our speciality is unstructured text which is traditionally quite hard to analyse: simplistic methods, like superficial text analysis, don't offer a high degree of insight; which means that methods to harness these data have to be quite advanced to work well across a range of situations.

The article is here: From Sentiment Analysis to Enterprise Applications

Goodbye Forum, Hello Blogspot!

We've switched off our forum and are using this place instead for discussion and comments. It's early days for the company and I was so pleased to see hundreds of shiny new members on our BBPress installation. This all turned to disappointment as the user names and emails made it clear that we had a plethora of spammers and nothing else.

Feel free to comment here. We're very happy to receive guest posts from anyone interested in Roistr, natural language processing, social commerce, social media or artificial intelligence / machine learning.