Tuesday, January 3, 2012

From Bag-of-Words to single words

Within natural language processing, there are a number of techniques labelled "bag-of-words" techniques. They essentially treat documents as a single "bag" of words with no account of word order or the proximity of other words. Examples include latent semantic analysis (LSA) and latent dirichlet allocation (LDA).

This seems to limit the utility of such approaches. When performing sentiment analysis, they can be misleading: although the overall sentiment of a document can be gleaned, it may be useless. If a document says this: "My car is rubbish. I'm going to sell it to a great person and I'll get an amazing deal if I'm lucky", most overall approaches would classify this as positive. The core element ("My card is rubbish") is negative. This is compounded when analysing longer documents. Even with those of just a few paragraphs, and overall sentiment may be an entirely meaningless measure if a document contains different themes.

So the bag-of-words approaches seem misleading in managing text analysis.

Some work by Walter Kintsch [1] offers clues as to a way out. In his seminal book, "Comprehension", he outlined a way in which bag-of-words approaches can be used cleverly to focus on levels below the document level (chapter level, paragraph level, sentence level, propositional level, and even word level). Such methods can be used for many purposes including word sense disambiguation and topic identification.

The key is to compare proximity in the same way as you would compare semantic similarity between two document but in this case, compare an element of each level against the whole document. So if you wanted to know which word was the single 'operative' word in a document, you would break down the document into words and compare each against the entire document. The word with the highest similarity is the operative word. Likewise, you can identify the operative paragraph, chapter, and proposition within a document; or even the operative word in a proposition, sentence, paragraph or chapter and so on.

Kintsch used a worked example:

John was driving his new car on a lonely country road. The air was warm and full of the smell of spring flowers.
He hit a hole in the road and a spring broke. John lost control and the car hit a tree.
Here we have 1 document with 2 paragraphs. Each paragraph has 2 sentences each.

In Kintch's example, he compared (using cosine measures) words and sentences (to find the operative word in each sentence), sentences and paragraphs, and paragraphs and the whole texts.

In the first sentence, the word 'John' followed by 'tree' showed the highest cosines; in the second, 'air'; the third, 'car'; the fourth, 'John' followed closely by 'tree'.

For sentences compared to paragraphs, the final sentence had the highest cosine with its paragraph. When the two paragraphs were compared against the entire document, both were similar. Kintsch reported that this can also be a good method to automatically evaluate summaries. It may also be used to ensure the smooth flow of subjects throughout a text.

This makes bag-of-words techniques incredibly useful and applicable in domains beyond pure document-based analysis.

[1] Comprehension (2007) W. Kintsch.  Cambridge University Press.