Star Wars movie: How can we analyze reviews on social web?
We explain today how can we analyze sentiment in online reviews on Star Wars movie.
In the blog entry “Sentiment Analysis with RapidMiner #1: Basics” (german language) we described two sentiment-analysis techniques available in RapidMiner:
- lexicon-based and
- model-based, which uses machine learning.
We want to use RapidMiner’s lexicon-based technique to apply sentiment analysis to Tweets related to Star Wars movie.
We use the “Search Twitter” operator available by the “Social Media” extension (a Twitter connection has to be configured first) and search the top 1000 most recent or popular tweets for the query-hashtag “#starwars”. Then we select just the “Text” attribute, which includes the content of the retrieved tweets (i.e., ignore the rest meta-data) and perform a loop (“Loop Values” operator applied on the ID of tweets – the ID has first to be converted to nominal for this reason) over all tweets in order to save each one of them as a separate text file. Finally, we apply the “Process Documents from Files” operator, to process each saved tweet. We choose the TF-IDF representation for the creation of a vector space. The main process is illustrated below.
Inside the “Process Documents from Files” operator we determine the steps for the processing of each tweet. We can perform tokenization to extract the individual words of each tweet and, thus, represent a tweet with a “bag of words”.
We can also filter stop-words. Next, we apply the stemming available from the Wordnet extension, which finds the lemma of each word. Finally, we apply the “Extract Sentiment” operator (again from the Wordnet extension). We choose to use all available part-of-speech entities, such as nouns, verbs, adjectives and adverbs. The resulting sub-process is illustrated below.
The output of the process can be graphically depicted in RapidMiner with a histogram of the computed sentiment values (attribute “sentiment”). An example of result is depicted below. The histogram offers interesting insights; for example, we can spot at the right part of the distribution the existence of a sub-group of Twitter users that express a relatively high positive sentiment, which are likely to be loyal fans of the Star Wars series.
The second approach is based on model to classify positive against negative sentiment. The model needs to be trained with data that are both: a) related to our topic, that is, movie reviews; and b) reveals information about the sentiment of each review, that is, positive vs. negative. Polarity dataset v2.0 is Movie Review Data made available (http://www.cs.cornell.edu/people/pabo/movie-review-data/) by Cornell University and it contains 1000 postive reviews and 1000 negative reviews. After downloading the data, we Process each of them to extract the related terms by using the TF-IDF vector creation option, as illustrated in the right part of the figure below. As before, during the processing we apply tokenization, stemming, and removal of stop-words (see the right part of figure below). The resulting term-document matrix is used to train an SVM classifier inside the cross-validation operator. The attained accuracy is about 80%. Both the model and the set of all 1510 terms are stored inside the repository.
Finally, we can apply the trained model to the data we have already downloaded from Twitter as shown in figure below. We first process each tweet by keeping only the terms used in the trained model and then we process them by applying the same transformations as in the case of the training data above. Finally, we apply the trained model to the resulting term-document matrix and obtain sentiment scores for each tweet (as sentiment score we use the difference between the classifiers confidence for the positive and negative classes).
In our next blog article we will compare both techniques regarding sentiment analysis.