No matter how clever your algorithms and heuristics, having a ton of data trumps all.
We’ve been mixing ingredients for a while at Repustate, trying to come up with the perfect recipe for our problem. Our goal is to determine what somebody is intending to buy based on what they write on various social media outlets. For example, “I want to buy a bike” is a statement of purchase intent, while “There goes the last bus” is not. We went about trying to solve this problem by taking a small sample of data, manually tagging it as being purchase intent or not, and then hoping to extrapolate. This is generally the approach taken in the world of NLP. We derived all sorts of clever rules-based logic to try to accurately predict this intent. Turns out, it’s next to impossible to do this accurately based solely on “state-of-the-art” algorithms and systems.
You see, it’s not that the algorithms and methodologies that academics in the NLP world publish are useless; it’s just that they’re not useful when applied to more general sets of documents. Let me explain.
Academics generally take a corpus of text from a very specific subset of language (movie reviews are always a popular choice) and train their machine learning systems on that. Their results are often great (some > 90% accuracy) but I view these as contrived. Movie reviews or articles from Reutuers’ archive, another popular source, are well written pieces of English text. The sentences are structured properly, there is proper subject-verb agreement etc. In other words, it’s almost the complete opposite of what you read on Twitter, Facebook or a blog comment.
The reality is that for accurate social media listening, you want to accurately tag social media content with semantic meaning, so you can’t rely on the traditional means of learning. You have to do what Google does and amass a ridiculous amount of data, find patterns, and then try to predict. Google Instant is an example of this. Their predictions of what you’re trying to search for are based on the billions of n-grams they’ve generated over the past decade. When you type “How can I get my girlfriend to”, with a high degree of probability, Google knows what you’re going to type next. I won’t reproduce here the possibilities Google shows as that’s more than slightly NSFW.
And that’s what Repustate is doing now. Grabbing loads of data and categorizing it manually for creating a sentiment analysis engine. It’s the only way.