Multilingual Sentiment Analysis: The Other 80%

English is the global language of business, but it is not the native language of a majority of the world’s customers. In fact, only 20% of people in the world speak a functional version of English. What about the other 80% of consumers? Does business not value their feedback? Or is the marketplace biased toward English-speaking customers?

The main goal of most businesses is to acquire new customers, and keep the ones they already have. This means that businesses need to understand the feelings and opinions of their customers in the native language they choose to express themselves in. But this is great in theory, but not in practice. In most cases marketers and researchers either ignore the feedback provided by non-English speakers, or translate native language text such as Arabic, Chinese and Urdu, into English and then analyze the feedback. This is just not acceptable unless you believe that inaccuracy is not an important factor to consumer sentiment analysis.

This article takes a brief look at the importance of a native, multilingual approach to voice of customer using sentiment analysis which identifies, extracts and analyzes customer emotions and opinions across social media, surveys, customer service tickets, etc.

Language And The Challenge Of Translation

Translation is the process of transferring words or text from one language into another. Sounds simple right? Wrong. Due to the complexities of language in general, and the specific differences between various languages, translation is rarely a word-for-word transfer from say English to Italian, or vice versa. Translations must deal with contrasts between basic linguistic building blocks such as grammar, syntax, semantics, lexicons, morphology, and even tonality in the case of languages like Mandarin, Thai and Punjabi. Add in all the various literary devices that people use such as idioms, sarcasm and slang, and it’s easy to see that trying to accurately capture the intended meaning of a text through translation can be a daunting task.

But when it comes to businesses trying to know their consumers, what we really want to understand is the true intended meaning of their feelings when it comes to prices, product features, customer service, quality, etc. In other words, when it comes to analyzing voice-of-consumer, accuracy is everything.

What is Voice of Customer?

Voice of the customer, or VoC is the practice of analyzing customer feedback to improve your product, solution, or service. VoC is typically gauged by utilizing customer survey tools and feedback systems. Most companies understand the importance of customer feedback analysis and how it can supercharge your customer experience efforts, but a lot of organizations are relying on archaic methods to extract the data they need to better understand what their customers are trying to say.

Marketing departments, PR agencies, and even some automated social media listening tools, often use automated, machine translators like Google and Amazon to change foreign language text into English to conduct sentiment analysis on the text. One major disadvantage of machine translation is its inability to pick up on cultural nuances, contextual content clues, and local slang. This results in content that can feel mechanical, stilted and full of errors.

By using customer feedback with Voice of Customer methods, organizations can gain valuable insights into where in their organization they are doing a good job and where they might need some work. Consumer insighting doesn’t have to be over-complicated; with Repustate, use a 21st-century solution to get the best results. Get into the minds of your customers using opinion mining and read between the survey lines. To ensure that these insights can be fully leveraged, it is important for them to be precise, which means analyzing them in the native language in which the initial opinions are spoken.

Multilingual Sentiment Analysis

So what is multilingual sentiment analysis? I think many of us know what sentiment analysis is, basically the identification, extraction and analysis of consumer feelings and opinions expressed through social media or customer surveys.

A multilingual approach to sentiment analysis begins with the simple belief that to understand voice-of-customer, you must analyse consumer opinions and feelings in the original language in which they were expressed. If those comments are Portuguese, then they must be reviewed in that language, etc. For practitioners of this approach to text analytics, translation and inaccuracy go hand-in-hand as the two most severe errors you can commit as a brand marketer, product manager or market researcher.

Marketing departments, advertising agencies, and even some automated social listening and social monitoring tools, often use automated, machine translators like Google and Amazon to change one language to another to conduct sentiment analysis on the text. One major disadvantage of machine translation is its inability to pick up on cultural nuances, contextual content clues, and local slang. This results in content that can feel mechanical, stilted and full of errors. Translating any text data, be it a social media post or a product review, can reduce sentiment analysis accuracy by almost 20%. That’s huge if you are looking to use the results to guide your change management. In any form of data analytics inaccuracy is a cardinal sin.

How does Repustate perform Sentiment Analysis in 23 Languages?

Steps for doing Multilingual sentiment analysis

Repustate uses a group of semantic technologies to calculate sentiment by applying language specific rules to each piece of text. That means there isn’t one “true” algorithm; what works in English doesn’t necessarily apply to Arabic.

Repustate’s sentiment is done natively for each language. There are no intermediary translations being done, which means the accuracy is much higher.

With that said, regardless of the language being analyzed, each block of text goes through the following set of transformations:

Step 1: Part of speech tagging

This involves classifying each word at a grammatical level. That is, identify which words are nouns, verbs, adjectives, adverbs etc. and which words are objects of others. Repustate identifies conjunctions and subordinate clauses, prepositional phrases and noun phrases - all to help the Repustate engine “understand” the true meaning of the text.

Repustate has developed its own part of speech tagger for each language used. Part of speech tagging is done by first accumulating a massive corpus of pre-tagged text (i.e. humans have gone through and tagged words into their respective part of speech tags). With this information, Repustate trains a part of speech tagger and relies on probabilities to determine the correct part of speech for a given word in a given context. For example, the word “like” can be a verb (“I like you”) and a preposition (“He looks like his brother”). In the first case, “like” connotes positive sentiment, but in the second case, it does not.

In English, there are a handful of words that have this double (or triple) meaning where in some context, a word has sentiment, and in others it doesn’t. But it gets quite complex while doing Arabic sentiment analysis, as in Arabic, the words can have up to 12 different meanings given the surrounding context.

To accurately perform sentiment, you need a very finely tuned and well trained part of speech tagger. It must be language specific as some languages have a much more complex morphology than others.

Step 2: Lemmatization

The next step is to lemmatize each word where applicable. Lemmatization is the process of determining the root of a word. For example, “loved”, “loving”, “lover” are all based on the root word “love”. To make sure no word goes unanalyzed, a proper lemmatizer is required and again, it must be language specific. The rules of conjugating nouns and verbs based on number, gender, tense etc. differ wildly from language to language. Repustate handles this all for each language.

Step 3: Prior Polarity

There are many words that even without any surrounding context, immediately connote sentiment. Words like “love”, “hate”, “despise” etc. have an immediate polarizing effect. Sentiment analysis relies on having an exhaustive list of terms that have prior polarity in order to provide a foundation for determining sentiment.

Step 4: Negations, amplifiers & other grammatical constructs

We almost have all the tools we need, but if we stopped right here, Repustate’s sentiment analysis would be inaccurate in any but the most trivial cases. Repustate now layers on nuanced grammatical aspects, unique to each language, including negations and amplifiers.

A negation reverses the polarity of the following (or sometimes, preceding) term. Consider the difference between “I like coffee” and “I do not like coffee”. In some languages, the negation comes first, in some, it comes after, and in some it appears at the end of the sentence (Turkish for example). Repustate is aware of all these language specific nuances.

The phrase “could not have been” is what we call an “amplifier”. Even though it contains a negation (“not”), what it actually does is amplify the term or phrase that follows it (e.g. “This vacation could not have been better”) Conjunctions and subordinate clauses often act to contradict their preceding component. Consider the phrase “I wanted to like the movie, but it was so boring”. The first half of the sentence indicates positive sentiment but the conjunction “but” then works to counter the sentiment.

Negations, amplifiers and other grammatical constructs are what make determining sentiment complex. Repustate’s sentiment analysis handles these complexities quickly and thoroughly for all languages.

Step 5: Wrapping it all up using machine learning

Repustate uses machine learning to calculate a sentiment score that combines various factors including the presence of terms with prior polarity, any negations, amplifiers or other grammatical constructs, as well as the length of the text. Shorter text with a high ratio of polarizing terms to non-polarizing terms leads to a score closer to the bounds of -1 (true negative) and 1 (true positive). A score of 0 or very close to 0 (±0.05) can be interpreted as being neutral; either there was no sentiment expressed or it was ambiguous.

Repustate provides sentiment analysis for 23 languages including: English, Portoguese, Italian, Swedish, Finnish, French, Thai, Korean, Spanish, Urdu, Arabic, Dutch, Danish, Chinese, German, Turkish, Hebrew, Russian, Malaysian, Polish, Japanese, Indonesian and Norwegian.

Contact Us Now to Get Started!