Introduction to Deep Search

Repustate's Deep Search uses machine learning to analyze text for notable entities (businesses, people, brands, products, locations, historical events etc.) and stores all of this information in a search index. Using a very intuitive natural language search interface, you can search your content semantically instead of the usual keyword and boolean approach. Deep Search works across all currently supported languages.

With Deep Search, queries that otherwise would have required complex boolean queries and complicated data munging or were impossible to perform, now become trivial. Using a simple interface, peform queries such as:

  • Find all documents mentioning American politicians negatively in Arabic, English or Russian.
  • Find all documents mentioning businesses with a market cap greater than $100 billion and whose stock went up by more than 10% in the last 30 days
  • Find all documents mentioning a clothing brand or car brand positively

As Repustate adds more languages, entities and metadata, the combinations of queries will grow and the possibilities become near endless.

Terminology

To better acquaint yourself with Repustate's Deep Search, here are some terms commonly used in this and other documentation:

Term Description
Document Any plain text you wish to analyze. The text must be in one the currently supported languages. There is no mimimum length, but must be less than 5MB
Index A collection of documents. You can have multiple indexes if you want, but can only search one at a time.
Entity Any person, place or thing of note. Examples includes Barack Obama, Toronto, iPhone XR, and Chocolate. There are over 6 million entities in Deep Search.
Metadata An attribute of an entity that allows for more granular searching. Metadata examples are age, gender, place of birth, and stock price.
Classification A semantic categorization of an entity. Entities can have 1 or more classifications. Example classifications in Deep Search are Person.politician, Person.pro_athlete, Location.city, Product.smartphone and Health.disease. There are over 200 classifications in Deep Search.

You can see a complete listing of all classifications, metadata and themes here.

Indexing

Getting your data into Deep Search is a simple and straightforward task:

  1. Create an index giving it a unique name (you can have as many indexes as you like)
  2. Using Repustate's RESTful API, post your documents to the index you created in #1 above.
  3. Documents are indexed in real-time - you're now ready to search!

Searching

As soon as your documents are indexed they are ready to be searched. There are two components to searching:

  1. Constructing a properly formatted query and
  2. Sending the query to correct search index

Deep Search has two methods of constructing queries and the method you choose depends on your use case and user profile:

  1. Natural language search e.g. "Find politicians who are American and mentioned negatively in Arabic,English,Russian"
  2. Deep Search Query Language (DSQL) e.g. "Person.politician.nationality:American sentiment:neg lang:ar,en,ru"

Regardless of the how you structure your queries, Deep Search indexes can be searched in one of two methods:

  1. Using the Deep Search API. Searches are sent over HTTPS and responses are returned as JSON
  2. Using the Deep Search Javascript plugin. This renders a search bar in any web page and comes with smart suggestions and hints on how to structure your queries. This is recommended if you're exposing the search feature to outside users. Responses are JSON and can be rendered as you see fit via callbacks.

read the deep search docs

Themes

Deep Search tries to identify any themes that are present in a document. A theme is a high-level topic, such as sports or health. The list of available themes is:

arts automotives business entertainment
finance food health home
law military music politics
religion science sports fashion
technology transportation weather

You can search your index for any documents that match one or more of the above themes.

Geolocation

In addition to entities, themes and sentimenet, Deep Search extracts any geolocation data it can infer from the text of a document. If a specific landmark, neighbourhood, city, state, province or country are mentioned, the document is augmented with the lat/long coordinates for the location. This allows you to query your documents by specifying a geofence and requiring all matching documents to mention a location that fits within the geofence.

The Deep Search Query documentation contains more information about using this feature.

Customization

While Deep Search maintains a very exhaustive list of entities, it is possible to add your own entities. Using the Deep Search API you can add any entity you like and they'll be immediately available for analysis. Custom entities are private only to your account and no other customer will have access to them.

Data & Privacy

Any data stored on Repustate's servers using the cloud version of Deep Search is kept private, secure and only for your use. At no time does any 3rd party ever have access to your Deep Search data nor do Repustate's own employees ever inspect customer data UNLESS specifically requested to by the customer.

If you decide to close your account or delete an index, all data and any backups are deleted forever.

On Premise

Deep Search is also available as an on-premise install. In this scenario, all of your documents are indexed into a backend of your choosing on servers you manage. The API calls are exactly same as when using Repustate's public cloud offering, the only difference being where the data is stored. Currently, Deep Search supports the following backends:

  • PostgreSQL and any database that implementes the PostgreSQL wire protocol (e.g. CockroachDB)
  • MySQL and any database that implementes the MySQL wire protocol (e.g. MariaDB, MemSQL)
  • Oracle
  • SQLite
  • Microsoft SQL Server
  • Solr
  • MongoDB
  • HBase

For more information about the on-premise install works, technical requirements and configurations, please see the Repustate Server documentation.