01 nov 2012
[Update]: you can check out the code on Github
In this post I will try to give a very introductory view of some techniques that could be useful when you want to perform a basic analysis of opinions written in english.
These techniques come 100% from experience in real-life projects. Don't expect a theoretical introduction of Sentiment Analysis and the multiple strategies out there to achieve opinion mining, this is only a practical example of applying some basic rules to extract the polarity (positive or negative) of a text.
Let's start looking at an example opinion:
"What can I say about this place. The staff of the restaurant is nice and the eggplant is not bad. Apart from that, very uninspired food, lack of atmosphere and too expensive. I am a staunch vegetarian and was sorely dissapointed with the veggie options on the menu. Will be the last time I visit, I recommend others to avoid."
As you can see, this is a mainly negative review about a restaurant.
General or detailed sentiment
Sometimes we only want an overall rating of the sentiment of the whole review. In other cases, we need a little more detail, and we want each negative or positive comment identified.
This kind of detailed detection can be quite challenging. Sometimes the aspect is explicit. An example is the opinion "very uninspired food", where the criticized aspect is the food. In other cases, is implicit: the sentence "too expensive" gives a negative opinion about the price without mentioning it.
In this post I will focus on detecting the overall polarity of a review, leaving for later the identification of individual opinions on concrete aspects of the restaurant. To compute the polarity of a review, I'm going to use an approach based on dictionaries and some basic algorithms.
A note about the dictionaries
A dictionary is no more than a list of words that share a category. For example, you can have a dictionary for positive expressions, and another one for stop words.
The design of the dictionaries highly depends on the concrete topic where you want to perform the opinion mining. Mining hotel opinions is quite different than mining laptops opinions. Not only the positive/negative expressions could be different but the context vocabulary is also quite distinct.
Before writing code, there is an important decision to make. Our code will have to interact with text, splitting, tagging, and extracting information from it.
But what should be the structure of our text?
This is a key decision because it will determine our algorithms in some ways. We should decide if we want to differentiate sentences inside a a paragraph. We could define a sentence as a list of tokens. But what is a token? a string? a more complex structure? Note that we will want to assign tags to our token. Should we only allow one tag per-token or unlimited ones?
Infinite options here. We could choose a very simple structure, for example, defining the text simply as a list of words. Or we could define a more elaborated structure carrying every possible attribute of a processed text (word lemmas, word forms, multiple taggings, inflections...)
As usual, a compromise between these two extremes can be a good way to go.
For the examples of this post, I'm going to use the following structure:
- Each text is a list of sentences
- Each sentence is a list of tokens
- Each token is a tuple of three elements: a word form (the exact word that appeared in the text), a word lemma (a generalized version of the word), and a list of associated tags
This is a structure type I've found quite useful. Is ready for some "advanced" processing (lemmatization, multiple tags) without being too complex (at least in Python).
This is an example of a POS-tagged paragraph:
[[('All', 'All', ['DT']), ('that', 'that', ['DT']), ('is', 'is', ['VBZ']), ('gold', 'gold', ['NN']), ('does', 'does', ['VBZ']), ('not', 'not', ['RB']), ('glitter', 'glitter', ['VB']), ('.', '.', ['.'])], [('Not', 'Not', ['RB']), ('all', 'all', ['DT']), ('those', 'those', ['DT']), ('who', 'who', ['WP']), ('wander', 'wander', ['NN']), ('are', 'are', ['VBP']), ('lost', 'lost', ['VBN'])]]
Once we have decided the structural shape of your processed text, we can start writing some code to read, and pre-process this text. With pre-process I mean some common first steps in NLP such as: Tokenize, Split into sentences, and POS Tag.
I will use the NLTK library for these tasks:
Now, using this two simple wrapper classes, I can perform a basic text preprocessing, where the input is the text as a string and the output is a collection of sentences, each of which is again a collection of tokens.
By the moment, our tokens are quite simple. Since we are using NLTK, and it does not lemmatize words, our forms and lemmas will be always identical. At this point of the process, the only tag associated to each word is its own POS Tag provided by NLTK.
The next step is to recognize positive and negative expressions. To achieve this, I'm going to use dictionaries, i.e. simple files containing expressions that will be searched in our text.
For example, I'm going to define two tiny dictionaries, one for positive expressions and other for negative ones:
nice: [positive] awesome: [positive] cool: [positive] superb: [positive]
bad: [negative] uninspired: [negative] expensive: [negative] dissapointed: [negative] recommend others to avoid: [negative]
In case you were wondering, we could have used a simpler format, or used only one file, but this dictionary format will be useful later.
Note that these are only two example dictionaries, useless in a real life project.
The following code defines a class that I will use to tag our pre-processed text with our just defined dictionaries.
When tagging our review, the input is the previously preprocessed text, and the output is the same text, enriched with tags of type "positive" or "negative":
We could already perform a basic calculus of the positiveness or negativeness of a review.
Simply counting how many positive and negative expressions we detected, could be a (very naive) sentiment measure.
The following code snippet applies this idea:
So, our review could be considered "quite negative" since it has a score of -4
The previous "sentiment score" was very basic: it only counts positive and negative expressions and makes a sum, without taking into account that maybe some expressions are more positive or more negative than others.
A way of defining this "strength" could be using two new dictionaries. One for "incrementers" and another for "decrementers".
Let's define two tiny examples:
too: [inc] very: [inc] sorely: [inc]
barely: [dec] little: [dec]
We instantiate again our tagger, telling it to use these two new dictionaries:
Now, we could improve in some way our sentiment score. The idea is that "good" has more strength than "barely good" but less than "very good".
The following code defines the recursive function sentence_score to compute the sentiment score of a sentence. The most remarkable thing about it is that it uses information about the previous token to make a decision on the score of the current token.
This function is then used by our new sentiment_score function:
Notice that the review is now considered more negative, due to the appearance of expressions such as "very uninspired", "too expensive" and "sorely dissapointed".
With the approach we've been following so far, some expressions could be incorrectly tagged. For example, this part of our example review:
the eggplant is not bad
contains the word bad but the sentence is a positive opinion about the eggplant.
This is because the appearance of the negation word not, that flips the meaning of the negative adjective bad.
We could take into account these types of polarity flips defining a dictionary of inverters:
lack of: [inv] not: [inv]
When tagging our text, we should also specify this new dictionary in the instantiation of our tagger:
Then, we could adapt our sentiment_score function. We want it to flip the polarity of a sentiment word when is preceded by an inverter:
Recalculating again the sentiment score:
It's now -5.0 since "not bad" is considered positive.
We have seen a little introduction to some basic techniques and algorithms that can give us an overall "score" of how positive or negative a review is.
The steps we've followed are:
- Split the text into sentences, and each sentence into tokens
- Add POS (Part Of Speech) tags to the Splitted text, using NLTK
- Enrich the POS-tagged text with our own tags using dictionaries. These tags are in a different "semantic level" than POS-tags: "positive", "negative", "inverter", "incrementer" and "decrementer"
- Implement some basic extraction rules over the tagged text, in form of python functions
That could be a good starting point to someone interested in sentiment analysis, but this is only the very beginning.
In a real-life system you should work harder, especially in the extraction-rules part (and, of course, in the dictionaries).
The method described so far is a rule-based approach. There are other techniques to perform sentiment analysis, for example, applying machine-learning algorithms. In any case, I think that advanced rule-based or machine-learning systems are out of scope in an introductory post like this.
Hope you enjoyed the reading!