Overview

Repustate takes pride in offering both an accurate and fast Named Entity Recognition API. This API is the lifeblood of Deep Search; it allows Deep Search to identify millions of entities across all supported languages in just a few milliseconds and make them all searchable.

But don't just take our word for it. We've compiled a benchmark to test other vendor's offerings against Repustate's Named Entity Recogition capabilities to see which is best.

All of the code is open source as well as the input data.

The Contenders

Evaluation Criteria

We compared each vendor to Repustate according to the following criteria:

Criterion Details
Entity detection coverage Are all entities present in the text recognized? This tests simple exact matches, but also context sensitive disambiguations, Twitter handles, ticker symbols, aliases, acronyms, nicknames etc.
Granularity of entity classification We evaluate how specific the entities are classified. Does the vendor differentiate between Location types (cities, countries, rivers etc.) or all locations just tagged as "Location"?
Language coverage The sample data contains samples from many languages, especially some tricky ones like Arabic and Japanese. We tested to see which vendors handled these languages properly.
Speed of API The amount of time it takes to process the test data set.

Results

Below are our findings. All of these results can be reproduced using the source code and sample data provided.

Vendor Accuracy1 Granularity Languages2 Speed (ms)3
Google Cloud NLP 75% 10 1070
Amazon Comprehend 67% 6 160
Dandelion 63% 7 250
TextRazor 61% 12 240
Microsoft Azure Cognitive 50% 6 3210
spaCy 45% 7 304
Aylien 42% 6 150

1 Repustate is the baseline at 100% accuracy on the sample data
2 Number of languages out of the 22 covered by Repustate
3 Repustate averages 60ms per API call including network latency
4 spaCy is run locally while all other providers are over HTTP. As such, there's no network latency in spaCy's time

Summary

  • None of the vendors resolve Twitter usernames to their underlying entity (e.g. @realDonaldTrump → Donald Trump).
  • Only TextRazor detected stock ticker symbols (e.g. $AAPL → Apple Inc.)
  • Context disambiguation accuracy varied depending on how hard the sentence was. None of the vendors were able to pick up all entities in the sentence "My YC app: Dropbox - Throw away your USB drive". "YC" (Y Combinator) in particular was never picked up.
  • Google's NLP offering is ... confusing. Sometimes works really well, and other times confusingly poor. In the sentence: "Brian Cox starts in the HBO show 'Succession'", Google didn't pick up the show name (Succession) and it selected the wrong Brian Cox
  • Not all vendors know about aliases or detect common spelling mistakes. "Montreal Canadians" was correctly resolved to "Montreal Canadiens" by only Amazon, Google and TextRazor
  • None of the vendors were able to associate noun "qualifiers" that appeared later in a sentence to nouns appearing earlier. For example in the sentence "Blue, Brown, Orange, Green and Red Lines were running normally." - the qualifier "Lines" applies to "Blue", "Brown", "Orange" and "Green" which could then be inferred to be subway lines and therefore the ones in Chicago.
  • None of the vendors classify entities at anywhere near the level of granularity of Repustate. Microsoft Azure does, however, qualify numerical values at a very granular level (temperature, percentages, dates etc.). This granularity surpasses Repustate's for numerical values. However, classifying all people as just being "People" or "Person" omits so much context and severly limits how much analysis can be done. Repustate classifies people, for example, into as many as 55 classifications
  • Some vendors mention future releases will support more languages, at publication time, none supported Named Entity Recognition in all 22 languages Repustate supports and none supported any of the really hard languages (Arabic, Japanese, Korean). A few supported Chinese but with limited entity coverage.

Where Repustate trails

Repustate's Named Entity Recognition API only classifies entities it's seen before. Many of the other vendors detect people, places, and organizations based on their position in a sentence, grammar structure etc. As a result, Repustate's model has a tendency to miss lesser known entities.

Another weak spot Repustate has (to be fair, all vendors do, too), is the ability to detect the titles of TV shows, books, movies and songs in any context. Often, these types of entities are very common words and require quite a bit of surrounding context to properly disambiguate.

Where Repustate leads

Repustate's Named Entity Recognition API provides much more specific classification of entities than any other offering. Classifying locations by their specific type or people by their profession opens the door for better content recommendation and document similarity comparisons. Most of the vendors have "Other" or "Misc" catch-all classifications which are quite useless in any real-world application.

Conclusion

Use Repustate. Joking aside, Google's Cloud NLP performs quite well and has good language support. It's a bit slow compared to the others, but if your dataset isn't too big, that speed hit shouldn't be too bad. If for whatever reason you can't use Repustate, we recommend Google ... But seriously, use Repustate.