Twitter hashtags are everywhere these days and the ability to data mine them is an important one. The problem with hashtags is they are one long string that is composed of a few smaller ones. If we were able to segment the long hash tag into its individual words, we can gain some extra insight into the context of the tweet and maybe determine the sentiment as a result.
So what to do? How do we solve the problem of the long, single string?
As we did with Chinese sentiment, we had to rely on conditional probabilities to determine the most likely words in a string of characters. Put differently, you're trying to answer the question: "If the previous word was X, what are the odds the next word is Y?" To answer this, you need to build up a probability model using some tagged data. We grabbed the most common bigrams from Google's ngram database and then using the frequencies listed, constructed a probability model.
To better understand why we needed the probabilities, let's take a look at a concrete example. Take the following hashtag: #vacationfallout. There are two possible segmentations here, ["vacation", "fallout"] or ["vacation", "fall", "out"]. So how we do know which to use? We examine the probability that the string "fallout" comes after "vacation". This probability, as we know from our model, is higher than the probability of the words "fall" and "out" coming after "vacation", so that's the one we go with.
Now of course, since we're dealing with probabilities, we might be wrong. Perhaps the author did intend for that hashtag to mean ["vacation", "fall", "out"]. But we learn to live with the fact that we'll be wrong sometimes; the key is that we'll be wrong much less frequently than when we're right.
Since the Repustate API is hit pretty heavily, we still need to be concerned with performance. The first step we take is to determine if there is a hashtag to expand. We do this using a simple regular expression. The next step, once we've determined there is a hashtag present, is to expand it into its individual words. To make things go a bit faster, we memoize the functions we use so that when we encounter the same patterns, and we do, we won't waste time calculating things from scratch each time. Here's the decorator we use to memoize in Python: