Benchmarks

This page shows the performance benchmarks across the most popular API calls Repustate supports. The 3 most recent versions of Repustate are displayed along with their respective throughput.

Performance is measured in terms of Kilobytes/s (KB/s), rather than "documents" per second or "requests" per second, as both documents and requests are ambiguous in nature and size.

Broadly speaking, Repustate is CPU-bound so the more CPU cores, the more performant the processing will be. All benchmarks are carried out on an Intel 2x Xeon E5-2630v3 16 core server running Ubuntu 17.10



About the benchmarks

To test the performance of Repustate, a large corpus of text in all languages Repustate supports is used. The corpus is a mix of long form, well written text (NYTimes, BBC etc.), shorter text (Reddit comments, blog and message board comments) and short text often with inconsistent formatting (all caps, every word title cased, no punctuation, emoji etc.). Some text processes faster than others, hence the need for a good variety of writing styles as well as document size.

As an example for why variety is needed, take the task of named entity recognition. Given these two sentences:

1) The Giants beat the Cowboys 20 - 15 last night.

2) The New York Giants beat the Dallas Cowboys 20 - 15 last night.

The first sentence will take longer to process because the terms "Giants" and "Cowboys" are fairly common and therefore need to be disambiguated given the context of the sentence whereas "New York Giants" and "Dallas Cowboys" are already fully disambiguated.

Having a variety of text samples from different sources ensures Repustate accurately tests the most common patterns of text seen in media today.