Named entity recognition using spatial clustering

Our Customer Success Stories

Repustate has helped organizations worldwide turn their data into actionable insights.

Learn how these insights helped them increase productivity, customer loyalty, and sales revenue.

See all Stories

Table of Contents

Here’s a simple task: given the sentence “Any good restaurants near Ballard or Fremont?” figure out what the person is referring to. Where are Ballard and Fremont? What are Ballard or Fremont?

This (relatively) simple task illustrates the challenges of AI and machine learning. How do we train a language model to understand the intent of the query and resolve any entities as accurately as a person would but much faster?

Step by Step process

The first step is identify any candidates for entities. There are a variety of ways of doing this from analyzing the grammar of the sentence via a part of speech tagger to training a neural network to detect entity candidates based on a pre-tagged corpus. In our simple example, the two entities we’re most concerned with are “Ballard” and “Fremont”. A part of speech tagger would have tagged both of these as being proper nouns, which are usually great candidates for named entity recognition.

OK now that we have our candidates - what (or where?) is a Ballard and Fremont? The use of the word “near” is a hint that Ballard and Fremont are probably locations and not people (e.g. Glen Ballard). If we look up a list of all locations called Ballard and Fremont, you get locations ranging from Ballard, Utah to Ballard, Ireland and from Fremont, Nebraska to Fremont, Ohio. How do we know the right ones? Spatial clustering to the rescue

If we plotted all of the locations containing Ballard and Fremont on a map, you’d see a bunch of dots all over the place. Here is where we have to apply a bit of common sense and heuristics and assume the query isn’t asking for restaurant recommendations in Utah or Ireland, but two locations called Ballard and Fremont that are close to one another. Fortunately, there is an algorithm called DBSCAN which does just that. Given a list of points, this algorithm identifies which points are clustered the closest to one another. Once you have your list of clusters, sort them by the smallest distance between the points in a given cluster and you have your answer.

To illustrate this in practice: Ballard, California is 423 km from Fremont, California. Ballard the neighbourhood in Seattle is 4.9 km from Fremont, the neighbourhood in Seattle. Our algorithm thus identifies the Seattle neighbourhoods Ballard & Fremont as the most likely entities given the context and would return a list of restaurants in the Seattle area.

This relatively simple semantic search exercise demonstrates the hidden complexity of machine learning and the great challenges we face in trying to emulate, and surpass, human knowledge.

Categories

Our Customer Success Stories

Step by Step process

Join leading companies using Repustate

Related Articles

Top 10 Word Cloud Generators

Top 8 Data Analysis Companies

Top 10 Data Cleaning Techniques for Better Results