Wednesday, January 4, 2012

Identifying base units of meaning

Within NLP and at least from my own perspective, there is a methodological challenge of identifying propositions (rather: what I considered them to be). These are the base units of meaning and have applications in text analytics, sentiment analysis and other such stuff. I get the impression that the differences between utterances, sentences and propositions has become maybe too academic (though it has a basis). I need to be able to focus on what I thought were propositions because the NLP community tend to use sentences as the base unit of complex meaning. I doubt this because sentences can contain several different statements or declarations.

For example, it makes sentiment analysis much more powerful because it makes analysis more powerful. Consider the sentence, "The Godfather is fair but the trailers were totally abysmal." If you want to find out people's responses to the particular movie, then conventional sentiment analysis will record it as negative because the entire sentence is used as a unit and "totally abysmal" is more negative than "fair" is positive even though the former is nothing to do with the movie itself but rather of other movies in trailers. Topic identification should play a role here but it's likely that mistakes have been made.

Splitting the sentence into baser "statements" makes the sentiment analysis more accurate because it can tell that there are really two statements: "The Godfather is fair" and "the trailers were totally abysmal". Good sentiment analysis (i.e., that not based on the entire document) will see that the first sentence is about the topic we want to know about and the sentiment is "fair", and also that the second sentence is largely irrelevant so we can ignore any expressed sentiment.

But identifying the component statements is a hard technical problem, so much so that many efforts just use the base sentence rather than tackle it.

I had some thoughts last night about statement identification while reading Kintsch's Comprehension book again, particularly the toy example of John driving and then crashing his car (4 sentences in 2 paragraphs). The similarity between the most similar words seems to indicate something: the number of statements is equal to the most similar words. What is most similar: well, there's a boundary somewhere, but (say) within 0.10 of a cosine similarity calculation between words and document of the highest relevance. Everything below would be words that support the statement.

So if the words of a sentence are compared against the entire sentence and one word stands out with the highest similarity (say the highest is 0.56 and the next is 0.32), then the sentence is likely to contain a single statement - what is a single sentence.

If, however, the highest scoring word is not much different to the next scoring word (say the highest is 0.56 and the next is 0.52) then there are likely to be two statements: one with the 0.56 word as its operative word and the other with 0.52 as its operative word. In other words, the entire sentence is saying two things and could easily be broken into two sentences.

I doubt very much if this is a truism and it won't work in many cases; but I wondered if this could be the start of a guiding principle? Perhaps even a method of splitting a complex sentence into constituent parts (say if there is also conjoining content like "and" or ";").

This is definitely worth some research...