Sunday, June 3, 2012

Negative results for word-sense disambiguation using latent semantic analysis

In a previous post, I mentioned that Roistr's semantic relevance engine had some potential for word-sense disambiguation (WSD). For those who don't know, WSD is when a word has more than one sense: say 'dog' can refer to a canine mammal of the genus canis lupis familiaris or to a chase (among others). Being able to do WSD effectively is crucial for machines to get a handle on the context of language. I see it as one of the fundamental challenges to overcome in NLP.

My earlier post mentioned how Roistr (which uses LSA) identified the correct word sense for a single case which made me curious to know more. I examined this in more detail and got negative results which are just as important to report as positive ones.

For the experiment, I took the SemCor annotated corpus (which is based on the Brown corpus but has words tagged with word-sense) and extracted some sentences which contained words with 5 different senses in WordNet. The entire sentences were then compared to each sense's definition and example (plain text describing the word in context) and results compared against the actual word sense. To do this comparison, I used Roistr's demonstration best matches feature and put the source sentence in a left-hand box and the word sense definitions and examples in a right hand box (one box for each word sense). The selection was that definition and example text with the highest similarity to the source sentence.

Of all words, none of the correct senses were selected. The median rank was 3 out of 5 which is fairly poor!

Problems might come from the model (though it seems to work well in other tasks), this method of WSD (which might just not be a good enough solution), or confounds from the definition and example texts which might contain words that don't provide the best context. I feel confident that all are responsible to different degrees with the second point the strongest though I don't have evidence to back this up yet.

Still, it's good to know and hope this might be of use to someone else.