Giancarlo Frison Signals from the Noise

Catalog Entity Extraction for Search

Keyword extraction from search queries is a fundamental aspect of conversational commerce. In this article I illustrate a simple but effective way to get relevant entities from user’s utterances and rank them against an unstructured product catalog and an ontology database.

The primary purpose of a conversational application is to serve user demands, and when an user search in a e-commerce context, he is mostly looking for products. There is one main distinction that characterize a query when it is performed in the website rather than a messaging application. In the website, when users submit a query they already express their search intention, therefore the terms are usually concise and descriptive. Conversely, when inquiring a Chatbot, users use more expressive forms such as: Could you suggest me pale ale beers and ice creams for my party?. While the intention is deducted by a classification task, relevant terms for search, are just a subset of the entire sentence.

Baseline approach for searching, would be to take all text as query, returning innumerable hits of everything even remotely relevant, providing little help for customers. Another solution regards Named Entity Recognition, a class of algorithms that seeks and classify entities, also by means of neural networks.

While machine learning techniques can reach high levels of accuracy, they might not be the favorite solution for production usage. They require hardly available training data, and what will work for a specific product segment will not work for another. That is why the following approach could offer the flexibility demanded for real use case scenarios. It is easily plugged in any e-commerce without any particular adaptation. This method is very simple. I don’t consider structured product features, rather I take in account only simple and concise information that is obtained just by the product name.

I want to extract the features that might affect the Chatbot’s answer, based on the quality of the search query.

It is very plausible to give straights result when the query is really pertinent to returned item list, as well as informing the users whenever the query terms do not match exactly with what we can offer them, or even when the query terms demand for something we cannot provide. The desirable features are:

  • Query entities selection. When in the query there are more than one entity cluster, the conversational agent will be able to detect it and to ask the user to choose with entity will search first. For example, in query above there 2 terms: pale ale beers and ice creams. For example the Chatbot could answer:

    Are you searching for pale ale beer or ice cream?

  • Partial term matching. The user is prompted that the exact criteria does not match, but a less ranking one is provided. Pale ale beers is not in catalog, but ale beer yes.

    We don’t have pale ale beer, but just ale beer. These are our suggestions:…

  • Term out of market segment. Prompt the user that the inquired item is not sold by this shopping website.

    We do not sell insurance, sorry.

Indexing and searching tasks

The two fundamental tasks in information retrieval are the one for collecting and storing product informations, and on the other side, the task for obtaining them. Index phase collects features from the products’ name, while the search phase extracts matches from text query. Both tasks manipulate text in the following ways: Entities clusterization, Part Of Speech filtering, Lemmatization.

Entities clusterization

The objective is to isolate every entity within their search space (or features) that refines the query. For doing that, I use stop words (irrelevant terms such as articles, prepositions, adverbs) and some punctuations (full stops, semicolon, exclamation and question marks) to split the entire sentence into word clusters

could you suggest me pale ale beers and ice creams for my party

This rainbowed sentence assume me, and, for as stop words for tokenizing the possible entities clusters.

Part Of Speech filtering

Clusters previously obtained are filter by their Part Of Speech (POS) classification. The POS tagging assigns to each word their definition as noun, verb, adjective, adverb. I explicitly exclude verbs, adverbs and pronouns. This is why could you suggest is excluded since it is entirely formed by ignored words. The output is represented as:

could you suggest / pale ale beers / ice creams / party


Lemmatization refers to the process of returning the root form of inflected words, in order to facilitate the analysis and the search of those terms. For example, “Finds” and “found” are grouped together as “find”. In this way, cluster entities are turned into:

pale ale beer(s) / ice cream(s) / party

Catalog indexing

Text manipulation, as above described, occurs both for storing the catalog data and for querying.

In the indexing phase, when all catalog is scanned, parsed and tokenized, all particles will be stored into a Set. A Set is a collection of distinct items. For efficiently storing the presence of a particular cluster, bloom filters play a fundamental role.

Bloom Filters

How to check if a n-gram is present in the product list? Bloom filters solve the problem on storing large Set in a fixed and pre-defined sized vector. By the algorithm, an element is converted in some numeric values (h) and set true in a bit vector, at the h position.

Bloom filters allow to compress a large amount of source data, negotiating a grade of uncertainty.

How could be validated the presence of the element in the bit array? Just checking if the vector is true/false in the h position. That gives the certainty whether the element is not present, or, whether vector checking is positive, a determined confidence degree that such element is present. The true positive probability depends on the vector length and the number of hashes.

N-gram generation

An n-gram is a contiguous sequence of n words. I generate all possible combination of n-grams out of word clusters. The emphasized words are the result of this generation:

Beer Ice cream Party
pale ale beer ice cream party
pale ale ice  
ale beer cream  

Once the n-grams are generated, it is fast to check if one of them is present by inquiring the catalog bloom filter. For each entity cluster, we can check n-grams starting from the longest, in order to prioritize what exactly the user wants. We want to know also if the exact entity cluster is not present but only an its sub-gram. Moreover, we need to deal with such queries that asks products or services not offered by the given catalog market segment.

Ontology database

How can we check whether in the query there are valid terms but they are not treated by us? ConceptNet could be the answer. For this purpose more than 400K terms have been collected among several categories and indexed as the catalog terms in a separate database.


At the end of this process the final output will look like to this:

entity clusters:
  term: pale ale beer
  catalog: false
    term: ale beer
    catalog: true

  term: ice cream
  catalog: true

  term: party
  catalog: false

I have described a simple way for extracting query terms from a raw sentence. This approach provides useful information that could be managed by an conversational engine for corroborating search results with meaningful answers. On the other hand, this model doesn’t handle with misspellings, which represent alone about 15% of online search failures. This technique doesn’t deal with relatedness matching, or semantic matching. That means we can’t satisfy the search with relevant and pertinent results whenever customers use different terms from those in the website. I have already solved this problem by means of neural networks, and I will describe it in another article.

Acknowledgment Thank you Sidi for the contribution.