Repustate takes pride in offering both an accurate and fast Named Entity Recognition API. This API is the lifeblood of Deep Search; it allows Deep Search to identify millions of entities across all supported languages in just a few milliseconds and make them all searchable.
But don't just take our word for it. We've compiled a benchmark to test other vendor's offerings against Repustate's Named Entity Recogition capabilities to see which is best.
All of the code is open source as well as the input data.
We compared each vendor to Repustate according to the following criteria:
|Entity detection coverage||Are all entities present in the text recognized? This tests simple exact matches, but also context sensitive disambiguations, Twitter handles, ticker symbols, aliases, acronyms, nicknames etc.|
|Granularity of entity classification||We evaluate how specific the entities are classified. Does the vendor differentiate between Location types (cities, countries, rivers etc.) or all locations just tagged as "Location"?|
|Language coverage||The sample data contains samples from many languages, especially some tricky ones like Arabic and Japanese. We tested to see which vendors handled these languages properly.|
|Speed of API||The amount of time it takes to process the test data set.|
Below are our findings. All of these results can be reproduced using the source code and sample data provided.
|Google Cloud NLP||75%||❌||10||1070|
|Microsoft Azure Cognitive||50%||❌||6||3210|
1 Repustate is the baseline at 100% accuracy on the sample data
2 Number of languages out of the 22 covered by Repustate
3 Repustate averages 60ms per API call including network latency
4 spaCy is run locally while all other providers are over HTTP. As such, there's no network latency in spaCy's time
Repustate's Named Entity Recognition API only classifies entities it's seen before. Many of the other vendors detect people, places, and organizations based on their position in a sentence, grammar structure etc. As a result, Repustate's model has a tendency to miss lesser known entities.
Another weak spot Repustate has (to be fair, all vendors do, too), is the ability to detect the titles of TV shows, books, movies and songs in any context. Often, these types of entities are very common words and require quite a bit of surrounding context to properly disambiguate.
Repustate's Named Entity Recognition API provides much more specific classification of entities than any other offering. Classifying locations by their specific type or people by their profession opens the door for better content recommendation and document similarity comparisons. Most of the vendors have "Other" or "Misc" catch-all classifications which are quite useless in any real-world application.
Use Repustate. Joking aside, Google's Cloud NLP performs quite well and has good language support. It's a bit slow compared to the others, but if your dataset isn't too big, that speed hit shouldn't be too bad. If for whatever reason you can't use Repustate, we recommend Google ... But seriously, use Repustate.