Repustate is now a part of Sprout Social . Read the press release here

Repustate's Named Entity Recognition API is the best in the business - and we can prove it.

Join leading companies using Repustate

Overview

Repustate takes pride in offering both an accurate and fast Named Entity Recognition API. This API is the lifeblood of Deep Semantic Search as it allows Deep Search to identify millions of entities across all supported languages in just a few milliseconds and make them all searchable.

But don't just take our word for it. We've compiled a benchmark to test other vendor's offerings against Repustate's Named Entity Recogition capabilities to see which is best.

All of the code is open source as well as the input data.

The Contenders

Google Cloud NLP
Microsoft Azure Cognitive
Dandelion
Aylien

Amazon Comprehend
spaCy
TextRazor

Evaluation Criteria

All of the code is open source as well as the input data.

Criterion	Details
Entity detection coverage	Are all entities present in the text recognized? This tests simple exact matches, but also context sensitive disambiguations, Twitter handles, ticker symbols, aliases, acronyms, nicknames etc.
Granularity of entity classification	We evaluate how specific the entities are classified. Does the vendor differentiate between Location types (cities, countries, rivers etc.) or all locations just tagged as "Location"?
Language coverage	The sample data contains samples from many languages, especially some tricky ones like Arabic and Japanese. We tested to see which vendors handled these languages properly.
Speed of API	The amount of time it takes to process the test data set.

Results

Below are our findings. All of these results can be reproduced using the source code and sample data provided.

VENDOR	ACCURACY	GRANULARITY	LANGUAGES	SPEED (MS)
Repustate	95%	✓	23	60
Google Cloud NLP	75%	❌	10	1070
Amazon Comprehend	67%	❌	6	160
Dandelion	63%	❌	7	250
TextRazor	61%	❌	12	240
Microsoft Azure Cognitive	50%	❌	6	3210
spaCy	45%	❌	7	30¹
Aylien	42%	❌	6	150

¹ spaCy is run locally while all other providers are over HTTP. As such, there's no network latency in spaCy's time.

Summary

None of the vendors resolve Twitter usernames to their underlying entity (e.g. @realDonaldTrump → Donald Trump).
Only TextRazor detected stock ticker symbols (e.g. $AAPL → Apple Inc.)
Context disambiguation accuracy varied depending on how hard the sentence was. None of the vendors were able to pick up all entities in the sentence "My YC app: Dropbox - Throw away your USB drive". "YC" (Y Combinator) in particular was never picked up.
Google's NLP offering is ... confusing. Sometimes works really well, and other times confusingly poor. In the sentence: "Brian Cox starts in the HBO show 'Succession'", Google didn't pick up the show name (Succession) and it selected the wrong Brian Cox
Not all vendors know about aliases or detect common spelling mistakes. "Montreal Canadians" was correctly resolved to "Montreal Canadiens" by only Amazon, Google and TextRazor
None of the vendors were able to associate noun "qualifiers" that appeared later in a sentence to nouns appearing earlier. For example in the sentence "Blue, Brown, Orange, Green and Red Lines were running normally." - the qualifier "Lines" applies to "Blue", "Brown", "Orange" and "Green" which could then be inferred to be subway lines and therefore the ones in Chicago.
None of the vendors classify entities at anywhere near the level of granularity of Repustate. Microsoft Azure does, however, qualify numerical values at a very granular level (temperature, percentages, dates etc.). This granularity surpasses Repustate's for numerical values. However, classifying all people as just being "People" or "Person" omits so much context and severly limits how much analysis can be done. Repustate classifies people, for example, into as many as 55 classifications
Some vendors mention future releases will support more languages, at publication time, none supported Named Entity Recognition in all 23 languages Repustate supports and none supported any of the really hard languages (Arabic, Japanese, Korean). A few supported Chinese but with limited entity coverage.

Where Repustate Leads

Repustate's Named Entity Recognition API provides much more specific classification of entities than any other offering.
Classifying locations by their specific type or people by their profession opens the door for better content recommendation and document similarity comparisons.
Most of the vendors have "Other" or "Misc" catch-all classifications which are quite useless in any real-world application.

Conclusion

Use Repustate. Joking aside, Google's Cloud NLP performs quite well and has good language support. It's a bit slow compared to the others, but if your dataset isn't too big, that speed hit shouldn't be too bad. If for whatever reason you can't use Repustate, we recommend Google ... But seriously, use Repustate.