What is the difference between lexicon-based and model-based techniques?
sentiment analysis: example movie „Star Wars“

For Lexicon-based techniques it is not so easy to understand the meaning behind combinations of terms, because they just identify the existence of words related to positive or negative sentiment, but not how they interact with each other. For example, the use of complicated negation or especially of irony makes it very challenge for Lexicon-based techniques to correctly predict the sentiment.

On the other hand, model-based approaches can capture the meaning of more complicated expressions, as long as they are provided with a suitable training set that correctly characterizes the sentiment (we have to remember that even for humans, who provide the sentiment in the training data, it may become challenging to correctly understand the sentiment).

Let’s see some example results; both techniques can easily capture the positive sentiment in tweets such as: “We LOVED seeing #StarWars #TheForceAwakens at @AMCTheatres”, or even in tweets that consist only of hashtags, such as: “#true #love #starwars #shestheone #perfect #match”. However, the model-based technique correctly identified that the tweet “How have I still not seen #StarWars” is positive, whereas the Lexicon-based technique classified it as negative. Similarly, the model-based technique correctly classified the tweet “It could have been so good! #starwars #MastersOfTheUniverse #trailer https://t.co/2DwRDzVb2w” as negative, whereas the Lexicon-based technique as positive (the phrase “so good” was probably responsible for this).

Why can the model-based technique overcome the challenges of the Lexicon-based one? Because it is trained with data that can capture the correlation between combinations of words to sentiment; of course, as long as the training data are suitable for the task at hand. In our case, this holds; but only partially.

The issue is that the training data consists of relatively detailed movie reviews, whereas our task in the end is to classify rather short tweets that might reveal sentiment about a specific movie. Thus, the Lexicon-based technique can in some cases present an advantage. For example, it correctly identifies that the tweet “One more reason #StarWars is better off without George Lucas. https://t.co/95dhZLH3ke” reveals a positive sentiment, whereas the model-based classifies it as negative.

Although this tweet makes a comparison, its sentiment towards the new Star Wars movie is positive.

So it is not surprising that we may opt for combining the two techniques. Ways to do this can be found in related sentiment analysis literature (for example, see: https://www.cs.uic.edu/~liub/FBS/SentimentAnalysis-and-OpinionMining.pdf)

Please read also the first and the second part of this blog.