|26 Sep 2019
|03 Aug 2021
Chatbots are appointed to return informations from queries or accomplish specific tasks by means of natural language. Conversational commerce apps answer from a wide variety of query types, both very detailed or generic:
fruit juice? \
I search for
Voelkel apfel. \
Could you suggest me
pale ale beers and
vanilla ice cream for my party?
Queries might affect word aggregation that hold a specific semantic:
Please, give me a
Red bull does not refer to colored bull, but instead - in the context of a conversational app for food groceries - it unequivocally points to the energy drink can.
It is plausible to return search results that are really pertinent to what the user is looking for, as well as informing the user whether the query terms do not match exactly with what the merchant can offer them, or even when the query terms demand for something that is completely off-scope .
Where informations for discriminating a specific word sequence (namely collations) from a meaning to another come from? Those might be obtained by elaborating on huge amount of text specific for the domain of the interested topic, in this case food. By applying statistical analysis on text it is possible to extract NER that could be successively used on extracting query terms from text statements.
But, specific-domain text archives represent a valuable resource which their acquisition could be costly and hard to obtain.
The idea roll around three key process, collations gathering, product indexing and key terms extraction from user queries. The index process collects features from the products’ name, while during the search process, the relevant collations are matched out of the tex query.
In the era of Big Data, raw and structured representations of human knowledge are available in every higher pace than ever. Open databases such as DBPedia, Wiktionary, WordNet, OpenCyc could be grossly defined as generic commons sense of human knowledge. They are aggregated in ConceptNet.io, a structured ontology database, a semantic network. Collations and words are linked throughout a dense network of relationships such as: ”is part of”, ”is capable of”, ”is a type of”. relations and types could be easily filtered from obtaining the corpus of terms needed for a specific domain case, in our example, food. Collations might be found in those databases instead of elaborating big corpora that might be not available or very difficult to elaborate.
Once the relevant collations are acquired, it’s turn to describe how online queries (like in the example at the beginning) could be elaborated in order to filter out the key terms we are interested on.
I individuated some desirable features the solution should provide:
- Query entities selection. When in the query there are more than one entity cluster, the conversational agent will be able to detect it and to ask the user to choose with entity will search first. For example: give me a
red bulland a
- Partial term matching. The user is prompted that the exact criteria does not match, but a less ranking one is provided. for example in give me
vanilla ice creamthe specific
vanilla ice creamis not available but
ice creamit is.
- Terms off scope. Prompt the user that the inquired item is not for sale. for example: I’m looking for an
The terms extraction from the product catalog and the user text query share the same following proce- dures described below. They are Lemmatization, N-gram factorization.
Lemmatization procedure returns the root for of the inflected word. For example runs and running are pointing to the same root run.
Indexing product catalog
During the product’s name parsing from the catalog, all particles are parsed, tokenized and finally stored in a in-memory Set (bloom filters).
N-Grams extraction from search query
An n-gram is a contiguous sequence of n words. In the above example could you suggest me vanilla ice cream for my party the collation vanilla ice cream will be exploded as:
vanilla ice cream,
The system will weight the filtered items according to their length: longer first. More consecutive collations terms’ are detected, better the search output will be. therefore, at the end the search output will look like this:
entity clusters :
term : vanilla ice cream
catalog : false
term: ice cream
catalog : true
term : party
catalog : false