With an ever increasing amount of data being generated every day, it's important to have the right text analytics tools at your disposal. Repustate's approach to text analytics is comprehensive, battle tested and flexible enough to meet the needs of customers in a variety of industries for a variety of use cases.
This page will give you a peek behind the curtain to see how Repustate performs sentiment analysis and how Repustate handles the various pitfalls one might enounter when analyzing text for sentiment.
Before any sentiment can be computed, a block of text has to be broken down into its grammatical constructs. This helps guide Repustate's machine learning algorithms in the right direction. The decomposition of text into its grammatical parts is called part-of-speech tagging. Here's an example:
I really loved the pizza last night
Running this through Repustate's part of speech tagger returns this output:
I(PP) really(ADV) loved(VB) the(ART) pizza(NN) last(ADJ) night(NN)
After each word, the appropriate "tag" follows. PP means proper noun, ADV means adverb, VB means verb and so on. A full tagset is available here. With tags applied, we now have a hint as to which words or phrases might be of interest. For example, ART refers to an article. We know articles never really influence a sentence, so we can safely ignore these words.
One final step before we get to the sentiment; the concept of lemmatizing a word. In all languages, the same word can be written differently based on several factors including tense, gender and number. Some langauges, such as Arabic, have more complex rules, while in English, there are fewer rules to worry about as English doesn't have a concept of gender for words. Let's illustrate this with an example:
I really enjoyed that movie
I really enjoy watching movies
To a human, it's obvious these sentences are similar, but we're trying to teach a machine how to know that. The first thing to notice is that enjoy and enjoyed are similar. When you lemmatize the word enjoyed, that is, you reduce it to its root form, you get enjoy. The same for movies and movie. We must lemmatize words in order to simplify our sentiment analysis as it reduces the total number of words you have to consider when looking for polarity.
At its simplest, sentiment analysis is about finding single words that have sentiment associated with them. Words like "love", "hate", "great", "terrible" etc. immediately convey a particular sentiment or polarity ("polarity" meaning positive or negative sentiment). These terms are said to have a prior polarity.
But is sentiment analysis really just a matter of finding prior polarity terms? If it is, then we're done here. But, of course, it's not and there's much more to it. Consider this example:
I did not like that restaurant last night.
The word "like", as a verb, conveys positive sentiment. But everyone would agree that this sentence is negative in regards to the restaurant. Or what about this sentence:
That dinner could not have been any better.
As these two examples show us, prior polarity terms by themselves aren't enough. We need more analysis to accurately determine sentiment.
A negation reverses the polarity of the following (or sometimes, preceding) term. As in the previous example, did not like is actually negative because of the presence of not and in spite of like. The phrase could not have been is what we call an "amplifier". Even though it contains a negation, what it actually does is amplify the term or phrase that follows it. Note that a positive or negative term can follow this particular amplifier.
Negations and amplifiers make sentiment analysis more complex as it means simply looking up prior polarity terms and then adding up the occurences or something similarly naive just won't work.
When you calculate the sentiment of one entire block of text, this is called the document sentiment. Often, this score suffices. When examining shorter pieces of text, such as Twitter content, you can usually get away with just utilizing document level sentiment. Editorials from news sites or blog articles also generally avail themselves to being analyzed in their entirety. So when does document sentiment fail? Consider this block of text:
We stayed at the Ritz for 4 days last month. The food and service was exquisite, but the price was way too expensive compared to surrounding hotels.
This hotel review snippet begins as positive, but then veers into the negative direction. Analyzing this document in its totality may hide this nuance. If you're the hotel manager, you might want to know which aspects of your hotel people enjoy and which ones they do not.
To combat the issue of multiple polarities appearing within one block of text, Repustate's API allows for scoped sentiment. This means you can get multiple sentiment scores from the same block of text that are scoped to a particular topic (e.g. a person or place) or scoped to semantic idea (e.g. pricing at a hotel). By utilizing scoped sentiment, the nuance that exists in a review is brought up to the surface and all relevant sentiments are captured.
Different industries employ different means of communication. Words or phrases that have one connotation in one context might have a completely different one in another. It's nearly impossible to account for all these variations in one universal model. That's why Repustate allows you to create your sentiment rules that are specific to your industry and use case.
Multilingual sentiment analysis is hard, especially when dealing with real world text. But Repustate has developed multiple techniques that when put to use together, deliver a fast, accurate and predictable text analytics engine.