We recently introduced a new set of API calls for our enterprise customers (all customers will soon have access to these APIs) that allows you to create customized rules for categorizing text. For example, let's say you want to classify Twitter data into one of two categories: Photography or Photoshop. If it has to do with photography, that's one category, if it has to do with Photoshop, that's another category.
So we begin by listing out some boolean rules as to how we want to match our text. We can use the OR operator, we can use AND, we can use negation by placing a "-" (dash or hyphen) before a token and we can use parentheses to group pieces of logic together. Here are our definitions for the two categories:
Photography: photo OR camera OR picture
Photoshop: "Photoshop" -photo -shop
The first rule states if a tweet has at least one of the words "photo", "camera" or "picture" then classify it as being in the Photograph category. The 2nd rule states if it has the word "Photoshop" and does not contain the words "photo" and "shop" by themselves, then this piece of text is under the Photoshop category. You'll notice there's an implicit AND operator where ever we use a white space to separate tokens.
Now one approach would be to take your text, tokenize it into a bag of words, and then go one by one through each of the boolean clauses and see which matches. But that's way too slow. We can do this much faster using regular expressions.
The hilarity of the quote above not withstanding, this problem is ready made for regular expressions because we can compile our rules once at startup time and then just iterate over them for each piece of text. But how do you convert the category definitions above into a regular expression, using negative lookaheads/behinds and all that other regexp goodness? We used Python's AST module.
Thinking back to your days in CS, you'll remember that an expression like 2 + 2 can be parsed and converted into a tree, where the binary operator '+' is the parent and it has two child nodes, namely '2' and '2'. If we take our boolean expressions and replace OR with '+' and the AND operator (whitespace) with a '*', we can feed our text into the Python ast module like so:
The "process" method, shown below, is what then traverses the tree and emits the necessary regular expression text:
(I've omitted the code for some of the helper methods but what you see here is the heart of the algorithm). So the final regular expression for the two rules about would look like this:
That second rule in particular is a doozy because it's using lookarounds which are a pain in the butt to try to manually derive.
The AST module emits a tree where each node has a type. So when traversing the tree, we just have to check which type of node we're dealing with and proceed accordingly. If it's a binary operator for example, such as the OR operation, we know we have to put a pipe (i.e. "|") between the two operands to form an "OR" in regular expression syntax. Similarly, AND and NOT are processed accordingly and since it's a tree, we can do this recursively. Neat.
(More documentation on the AST module can be found here.)
The final product is a very fast regular expression which allows us to categorize text quickly and accurately. Next post, I'm going to talk about semantic categorization (e.g. Tag all pieces of text that have to do with football or baseball under the "Sports" category) so stay tuned!