Monday, January 9, 2012

Wordnet Examples and confusion with word-sense disambiguation

Wordnet is a truly amazing piece of software and we use it a lot at Roistr for semantic relevance. One useful part of it is that each synset has examples of the word sense in use. But sometimes, these can mislead a little...

I'm currently looking into improving word-sense disambiguation - this is when you're faced with a word and is has more than one possible meaning. If I said, "tear", do I mean "tear apart; rip" or "tear from your eye; teardrop"? Identifying the word sense is word-sense disambiguation.

One of the methods I'm investigating is using our semantic relevance engine for disambiguation. As a preliminary test, I tried to disambiguate the sentence, "the quick brown fox jumps over the lazy dog" and I wanted to disambiguate the word "dog". For a human, it's easy - the sentence refers to "dog" the animal, Wordnet's first synset for "dog" ('dog.n.01'). But this can be hard for a computer to do. Wordnet produced 8 synsets for the word "dog" which is enough of a problem space to begin developing my mental model of this process.

So what I tried was extracting the 'examples' from Wordnet. Most synsets have example text showing the word in use. Going to Python and typing in the following shows the examples for "dog".

>>> from nltk.corpus import wordnet as wn
>>> ss = wn.synsets("dog")
>>> print ss
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> # All of the synsets for the word "dog"
>>> for s in ss:
>>>         print s.examples
['the dog barked all night']
['she got a reputation as a frump', "she's a real dog"]
['you lucky dog']
['you dirty dog']
[]
[]
['the andirons were too hot to touch']
['The policeman chased the mugger down the alley', 'the dog chased the rabbit']



I made two of the example texts into a single sentence: the second and the last. Then I compared each of these examples against the original text, "the quick brown fox jumps over the lazy dog". The theory is that the most appropriate word sense's example text will have the highest similarity to the original text. This is what I found:

0.577778545308  the dog barked all night
0.572310228632  she got a reputation as a frump "shes a real dog"
0.576110795307  you lucky dog
0.572804146515  you dirty dog
0.0652719366684 the andirons were too hot to touch
0.64144233489   The policeman chased the mugger down the alley, the dog chased the rabbit
Nuts. The final word-sense was a verb and produced the highest similarity (0.64). It was also incorrect. This was a shame because the first synset's example text was the second highest (0.577778) and was the correct one. So near and yet so far...

But then something struck me - the final synset's word sense tries to descrive "dog" as "chase". The word, "chase" appears in the example text - but so does the word "dog" from synset.n.01!

So in this particular instance, the example text was confounded because it included 2 word senses. No wonder the similarities were so high!

I changed the example text of the final "chase" synset to just "The policeman chased the mugger down the alley" and ran the analysis again. This time, the results were:

0.577778545308  the dog barked all night
0.572310228632  she got a reputation as a frump "shes a real dog"
0.576110795307  you lucky dog
0.572804146515  you dirty dog
0.0652719366684 the andirons were too hot to touch
0.175722447184  The policeman chased the mugger down the alley

This makes a lot more sense. I was hoping for a greater distinction between the top-rated word sense and the next but the rank ordering is correct. I think that there is potential in this method but I also have 2 ideas:

  1. Word sense - for fine grained word-sense disambiguations, even highly educated humans often won't agree
  2. From what I've seen, a lot of word-sense disambiguation results in a single decision. This is rarely the case for humans - we will often wait or search for more contextual information before fully committing to a particular word-sense. Effective writing should (in theory) make the word-sense clear; poor writing leaves more chance for confusion. This means that I want to work on a method of using probabilities to represent word-senses rather than forcing something into making the earliest possible decision. Tricky indeed, but if it's more accurate then it's probably worth looking at with Moore's Law and all that.