This is a (long) blog post about our experience at Repustate in migrating a big chunk of code from Python/Cython to Go. If you want to read the whole story, background and all, read on. If you're interested in just what Python developers need to know before taking the plunge, click the link below.
One of the best technological feats that we've done here at Repustate was implementing Arabic sentiment analysis. Arabic is one tough nut to crack because of the complex morphological forms Arabic words can take. Tokenization (splitting a sentence up into individual words) is also tougher in Arabic than in say, English, because Arabic words can contain whitespace within the word itself (e.g. the position of 'aleph' within a word). Without giving away our secret recipe, Repustate uses support vector machines (SVM) to come up with the most likely meaning behind a sentence and then apply sentiment to that. In total, we use 22 models (i.e 22 SVMs) and each word in a document is analyzed. So if you have 500 words in a document, that's more than 10,000 comparisons against the SVMs.
Repustate is almost entirely a Python shop; we use Django for the API and website. So it only made sense (at the time) to keep the code base homogenous and implement all of the Arabic sentiment engine in Python as well. As far as prototyping and implementing goes, Python is hard to beat. Very expressive, awesome 3rd party libraries etc. If you're serving up web pages, it's perfect. But when you're doing low level computations, doing lots of comparisons against hash tables (dictionaries in Python), things get slow. We were able to process about 2-3 Arabic documents per second, which is too slow. By comparison, our English language sentiment engine can do about 500 per second.
So we fired up the Python profiler and began investigating what was taking so long. Remember above how I said we have 22 SVMs and each word passes through it? Well that was all done in serial, not in parallel. OK, our first thought was to change to this to a map/reduce like operator. TL;DR: The map/reduce idiom stinks in Python. When you need concurrency, Python is just not your friend. At PyCon 2013, Guido spoke about Tulip, his new project that was hoping to remedy this, but that's not due out for a while, and why wait there's already something better out there.
My friend at Mozilla told me that Mozilla Services was switching over to Go for much of their logging infrastructure, in part because of the awesomeness of goroutines. Go was designed by the folks at Google and it was designed with concurrency as a first-class notion, not an afterthought, as Python's various solutions are. So we went about making the change from Python to Go.
While the Go code is not yet in production, the results are ridiculously encouraging. We're doing 1000 documents/s now, using WAY less memory, and not having to debug ugly multiprocess/gevent/"why won't Control-C kill my process" code that you get in Python.
Anyone who has a bit of an understanding of how programming languages work (interpreted vs. compiled, dynamic vs. static) will say, "Well duh, obviously Go is faster". Yeah, we could have re-written the whole thing in Java and seen similar improvements, but that's not why Go is such a winner. The code you write with Go just seems to be correct. I can't really put my finger on it, but somehow once the code compiled (and it compiles QUICKLY), you just get the feeling that it'll work (not just run without error, but even logically be correct). I know, that sounds very wishy-washy, but it's true. It's very similar to Python in terms of verbosity (or lack thereof) and it treats functions as first-class objects, so functional programming is easy to reason about. And of course, goroutines and channels make your life so much easier. So you get the performance boost of static typing and having finer control over memory allocation but you don't forfeit too much in expressiveness.
With all the compliments out of the way, you really do need a different mindset at times when dealing with Go compared to Python. So here's a list of notes I kept as the migration took place - just random things that popped into my head when converting Python code to Go:
Yes, a million times, yes. The speed boost is just too good to pass up. Also, and this counts for something I think, Go is a trendy language right now, so when it comes to recruiting, I think having Go as a critical part of Repustate's tech stack will help.