At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server. We also distribute our software as virtual machines to some customers so memory usage has to be light because we can't control how much memory our customers will allocate to the VMs they deploy.
To summarize, here's our checklist of requirements:
So our first attempt was to store the models on disk in a MongoDB and to load them into memory as Python dictionaries. This worked and satisfied #3 and #4 but failed #1 and #2. This is how Repustate operated for a while, but memory usage kept growing and it became unsustainable. Python dictionaries are not memory efficient. And it was too expensive for each Apache process to need a copy of this since we were not sharing the data between processes.
One night I was complaining about our dilemma and a friend of mine, who happens to be a great developer at Red Hat, said these three words: "memory mapped file". Of course! In fact, Repustate already uses memory mapped files but I completely forgot about this. So that solves half my problem - it meets requirements #2. But what format does the memory mapped file take? Thankfully computer science has already solved all the world's problems and the perfect data structure was already out there: tries.
Tries (pronounced "trees" for some reason and not "try's") AKA radix trees AKA prefix trees are a data structure that lend themselves to objects that need string keys. Wikipedia has a better write up but long story short, tries are great for the type of models Repustate uses.
I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. "Marisa" is an acronym for Matching Algorithm with Recursively Implemented StorAge. What's great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size - our experience is similar.
What's great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server's memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python's dictionary implementation.
Next time you're in need of sharing large amounts of data, give memory mapped tries a chance.