Giancarlo Frison Signals from the Noise

The Basic Principles of Language

What is this exhilarating noise come out of my mouth when I talk? Not surely because that precise sequence of sounds, pops and squeezes are particularly melodic, but thanks to that palace of sophistications erected in favor of language, we can talk and afford a wide range of expressions. Since I began erratically to explore natural language processing I have been wondering how it comes out so natural for us, while it is extremely complicated from a computational perspective. What has caught my curiosity is the nature of language and its fundamental aspects that might have shaped the rudimentary ‘Me Tarzan, you Jane’, the sentence that paraphrases the earliest and the simplest level of language.

The difficulty of studying the evolution of language is that in its early forms the available evidences are sparse. Spoken languages don’t leave fossils. Moreover, all existing languages, including the far remote tribal ones, are already sophisticated. Contemporary ones have a lot of words, refined grammar structures and can express almost everything with a remarkable richness of details. Even in written human records collected so far, dating 5.000 years ago or so, things look almost the same like they are now. Linguists have studied how communication change over time and inferred how it could appear us when the first rudimental steps toward a language were adopted in the first place. What are the basic and fundamental aspects and principles of language that whether they would be taken away, the whole towering edifice of language would immediately collapse like a stack of cards? I would introduce them by a simple composition, which could not be taken as an example of eloquence, but nobody would find it difficult to understand:

I supermarket enter      basket bring      pick fresh fruit

I go cashier      pay cashier basket      bring bag      quit

As might be noticed, there are no grammatical elements (prepositions, conjunctions, adverbs, plurals, tenses, relative clauses, complement clauses) that glue and hold sentences together, nor any abstract term. Nonetheless, the proto-sentence remains comprehensible due to very few natural principles that arrange those words together. Those principles crystallized into our brain million of years before language was even conceived by our ancestors. The evolution wired those principles in our cortex for facilitating communication. The first lines of distinction in early languages came from the concrete world, such as actions and things and how to refer to them in space, the pointing words. The second principle refers to the sequentiality of events and and as one can correctly imagine this affect the ordering of words. The third is more about the economy of communication, by contextualizing meanings and references in the sentence.

Pointing words

Pointing words assist for referring or locating something in space. They are This, that, here, there and their reference depends on where the actors are. What is this for me could be that for you, due to the relative position of object and subject. Those referencing words are not simply compelling because children use them as an accompaniment to the pointing gesture, reinforcing the intimate link between physical world and mental representation in premature brains. Pointing words, oppositely to other grammatical terms, are not originated by anything else than pointing words. They are root and core concepts.

Things, actions

The sample text should help to inform that early languages were restricted to simple words, the ones involving only concrete entities in the here and now. Things and action distinction is also a part of what is social intelligence and the world representation which is common in other primates and this conceptual distinction was already there. Even metaphors, that count a large belonging among words of our dictionary, turns out of have concrete origins, they were evolved from elements of physical environment.

Order of words

Another basic principle of any language relies on a single strategy: the ordering of words. What belongs together in reality appears close also in the language and follows the same sequentiality. It is natural to describe an action as central word between two participants. Between the actor and the patient (whom the action is performed) the order is the ordinary mapping from reality to language. Consider for example the Caesar’s Principle: I came, I saw, I conquered (veni vidi vici). This saying was conferred to Julius Caesar after a victory. The order of words is clearly not accidental, it reflects the sequence of actions in the real world.

Context

The third principle is concerned with repetition. What is already stated or it is not particularly important does not need to be iterated again. What could be understood and inferred from the context may be omitted in the sentence. This follow the principle of least effort, which is also applicable in language. Whether I would have written the story like this:

I supermarket enter      I bring basket      I pick fruit      I quit

the redundancy of the subject would be truly annoying, in any language. Have been invented several ways to keeping track of participants in the conversation, take by example pronouns.

Concept Search by Word Embeddings

Semantic Search Catalog search is one of the most important factor to the success of e-commerce sites and accurate and relevant results are critical to successful conversion.
The following approach aims to reduce user frustration by presenting related products, when searched items are not available in catalog. The central hypothesis is that an user might buy products with similar characteristics of a product originally searched, leading the successful search into a purchase.

Search engines help to find relevant matches against a query according to various information-retrieval algorithms. Those systems find text occurrences, but regardless their effectiveness, they are unequivocally related to the terms provided by the catalog. Therefore, products cannot be retrieved by words that are not already present in the inventory.

Concept matching (a sub-domain of semantic search) refers to the quality of retrieved instances based on significance. The association of terms by an acceptable grade of relatedness, pivots around those key points:

  1. Knowledge gathering. Where is it possible to identify semantic relations among words?
  2. Concept extraction. How relations could be extracted and then predicted?

The elaboration applied to the data for obtaining our demanded features is called word embedding.

Word embedding is a very popular term undoubtedly because of the contribution of the deep learning community. It is associate to the research of distributional semantics, the branch of studies for elaborating semantic similarities between words based on their distributional properties.

“a word is characterized by the company it keeps”. cit R. Firth

Algorithms (like the well-known skip-gram, cbow, glove) are employed to train models for predict words as they sequentially appears in a given text corpora. As result, the word embedding model converts a single word into a list of similarities, a vector. Analogous words are represented by similar vectors and cosine similarity measures the cosine of the angle between word vectors, thus scoring the relatedness between two words.

Concept Matching Algorithm

Concept Search Diagram

In the example above the user submits the unknown search query Chardonnay which has some similar terms retrieved in the word embeddings. Some of them might exist in catalog and they are returned to the user.

algorithm retrieve_alternatives is

  input: unrecognized term query, word vectors embeddings

  output: ordered list of products and ranking

  query_embeddings ← get similarities of query from embeddings

  results ← empty

  for each w in query_embeddings:

    resultsearch by w

    result.ranking ←result.ranking * w.score

    append result to results

  return results sort by ranking

Topic-specific Embeddings

Word embeddings are obtained by elaborating a huge quantity of text, namely corpus or corpora. There are available several large and structured set of texts for creating word embeddings: Google News corpus, Wikipedia, and so on, as well as word vectors already trained against those corpora. Since the quality of word embeddings reflects the corpus from which it has been generated, I purposely created a topic-specific corpora specialized in food, by scanning more than 600 food blogs and collecting roughly 40 Mb of prepared text. The amount of text is risible in comparison with Google News but nonetheless it is enough for the purposes of computing similarity in the small range of catalog queries. The preparation of corpora includes the remotion of everything but words, case conversion and sentence tokenization. I choose fastText for elaborating text representations, it uses sub-word information to build vectors for unknown words and as the name might suggest, it is really fast.

This solution has been filed as “System, computer-implemented method and computer program product for information retrieval” at the European patent office. It is applicable to many different domains, like in clothing, automobile, electronics retail, just by getting the proper specialized corpora from which word similarity can be inferred.

Stochastic Conversational Workflows

Traditionally, user interfaces are a series of screens and forms for exchanging informations with the user. Most of the applications start with a main screen from which users can navigate using breadcrumbs, menus, buttons like back and forward. This paradigm remained almost unaltered with the coming of hypertext where one may jump from a page or dialog into another by visual links, that are immediately accessible. Chatbots shift UX towards conversational hypertext that produces the appearance of having a conversation with the computer. People can interact naturally, and since everyone already knows at least one natural language, nobody needs any training for it.

The casualness of the medium contrasts with the complexity of the structured, and sometimes cumbersome, functions for achieving a specific goal, but while web-users can easily switch to a new goal-oriented scenario with a click, in the messaging application this could be mainly done by texting. Conversational applications usually implement workflows not by screens or forms but by piling new dialog scenarios into the conversational stack. Technically, it is something like packing finite state machines, on which every layer represents a particular task. When the current one is accomplished, the dialog state closes and it is removed from the stack. Conversational workflows may be managed by state-machine engines implemented directly in the chatbot or alternatively by existing flow managers such as Dialogflow (former API.AI), Wit.ai, LUIS.ai (Language Understanding Intelligent Service) by which designers can setup conversation processes in their web dashboards.

Conventional Dialog Stack

Even though it might be tempting to assume users will follow the exact logical sequence of steps defined by the bot’s designer, it rarely occurs. The austere interface of a Chatbot does not constraint users in pre-defined schema, it does not prevent nor discourage them to behave like they naturally do with other people: express a demand, ask information, change search criteria for a product, ask maybe again, and eventually pay or just abort the process.

Rule-base vs stochastic workflow process

People do not communicate in stacks. They tend to jump from a subject to another almost in a random way. Users may decide to do something entirely different, no matter how the process flow has been structured. They may ask for questions unrelated to the current procedure, or cancel it and then start over again. It is natural that humans switch topics during dialogue for whatever reason.

Even though it is easy to reach significant outcomes by using one of the mentioned NLU (Natural Language Understanding) systems with little effort, those modelers pose quickly their limitations due to their rigid characteristics concerning activation and behavior. In other words, conversational workflows may need to be hard-coded based on few discriminating features, mainly imposed in a limited set of user intentions. It may be almost impossible to manage programmatically a large amount of features that can affect dialogs. Further, variations of such processes may result in growing complexity, becoming unmanageable over time.

New conversational model

Real world is a stubborn place. It is complex in ways that resist abstraction and modeling. It notices and reacts to our attempts to affect it. Nor can we hope to examine it objectively by pre-defined rules or by programmed state machines.

The fundamental aspect can’t be ignored is the probabilistic nature of the model that will serve those dialogs.

The complexity of the system increases with the number of variables, leading the conversation to a certain function instead of another. Hence, the obligation to move away from rule-based systems and embrace uncertainty, probability and statistic.

Once the limitations of those approaches are unveiled, we are ready to attain context and sequentiality in a completely different way. We should let machine learning do what it can do best: calculate predictions over large amount of input.

The system should being able to replicate human behavior by learning from real conversation segments.

The model should inquiry the system for getting more informations and take more plausible decisions on what to present to the user. With those premises, machine learning is entering into the game, and it comes with the form of neural networks.

Text classification

Neural networks are adaptable systems whose ability to learn comes from varying the strength of connections between its artificial neurons. They are basically universal function approximators. I described neural networks for intention classification by using convolutions for grasping the semantics behind user’s sentences.

Additional input

For helping the classifier could be provided, other than the raw text, a set of discrete features bring informations that share some predictive relation with the action to choose. Those informations could be somehow related to text itself, but also they could be contextual to the user profile or to the ongoing marketing campaigns promoted by the merchant.

For example, a particular search request should be directed into a special promotion? The time of the day can affect the query selection because in the night a camomile sells better than black tea? Just feed the classifier with those features, and let ML to do the hard work.
The new conversational model should not just understand the single intention of a phrase, but elaborate it in the overall context the user has engaged with the Chatbot.

Time series models

The concept of time series is dependent on the idea that past behavior can be used to predict future behavior. In sequence-based models, the output is not just determined by the last input, like in regression predictive models, but also by its proceedings. This peculiar predictor should have some interesting characteristics. Latests input affect more the final output than the ones far away in time, and those models should be able to override, remove, remember qualifiers along the sequence. Long short-term memory (LSTM) units come to our rescue. LSTMs can remember values over arbitrary event series and they are a more sophisticated extension of the recurrent neural networks.

Multi-task learning (MTL)

Consider a hypothetical recommender that has learned to predict your preferences about cars. A mono-task system could be trained to give a single output that might match the car model with your profile. A multi-task model could be trained to not only return you the model’s name, but also the color, the engine type, the accessories. MTL aims to solve simultaneously multiple classification tasks within the same neural network. We can view MTL as a form of transfer learning where commonalities and differences across tasks are exploited to improve the overall learning efficiency. It seems that neural networks love diversity: more tasks they learn more accurate are their predictions, compared to training them separately. The conversational engine I created uses multi-task learning (MTL). It not merely returns a single label, but instead it returns a fine-grained set of parameters that add expressiveness into the behavior of the Chatbot.

Neural network architecture

My proposition is about to use both (CNN, LSTM) in a fully-connected neural network, in order to leverage classification qualities of CNN with the sequentiality of LSTM. That means, a particular meaning of an user’s utterance is not considered alone like the current state of the art of Chatbot classifiers, but it is evaluated in the context of the conversation. Fully-connected neural network means that the different layers of the network (CNN, LSTM) are affected by the sames feed-forward and back-propagation iterations.

Stochastic Conversational Workflows

While CNN extracts a relevant representation of the user’s input, other type of inputs can be feed into the LSTM layer as illustrated. In this way, the meaning of the user’s sentence is evaluated within a set a additional parameters that can affect the decision outcome of a particular conversational step.

Context and Sequentiality in Conversational Applications

Contextual memory in conversational applications plays a central role in any type of interaction between the Chatbot and the user. It is the bidirectional transfer of information where interlocutors are aware of the relational, environmental, and cultural context of the exchange. I will show some examples on how a contextual based system might improve the flow of the dialog.

Consider the following sentence:

If you prefer salty food, tiramisù is a bad dessert to eat.

If we quote only the part that says:

tiramisù is a bad dessert to eat.

The resulting sense is completely misleading. The speaker appears to simply saying that tiramisù is a bad dessert for everybody while it is a bad choice only for those who don’t like sweet and prefer instead salty food. Even though the quoting is taken directly from the original sentence, it omits important informations that you need to have for understanding what the speaker is really saying. That omitted information is the context.

Context is the circumstances surroundings a message.

In the realm of conversational applications, where users can dialog with Chatbots, the context is a fundamental aspect for any type of interaction that spans from goal-oriented to one-shot tasks. Let’s consider a simple case, where the same sentence appears in different contexts with consequentially different outcomes.

Taking the following sentence: Add bread into cart the system should react accordingly with what the user has previously entered, more precisely:

  • If user has just searched for a product, the system will drive him to select the desired product and tap the button below → Drive the user to tap the related ‘Add to Cart’ button.
  • In all other cases → Search for bread.

portfolio_view

Context as customer engagement

The sequentiality of interaction between customer and Chatbot is the key for understand user’s intention, in any situation. In the online buying process, shortening the conversion rate and accelerating checkouts makes real difference between success and failure of e-commerce initiatives. Conversational systems can tie customer engagement and purchase in a very short cycle. Smart sequencing, when seamlessly embedded in the processes, can lead new purchase opportunities.

Charly context dialog flow

Exploring contextual repetition

Sequence is a collection of events that might contain repetition. Repetitive inputs might mean unsatisfied or not understood requests. In this odd example we can simulate a behavioral pattern where an interlocutor may get upset by an annoying and repetitive progression of unsolicited opinions, as may I rhetorically define this following conversation:

In this article I summarized how the ability to apply Contextual Intelligence should represent an intrinsic skill for any conversational application, for any scenario. It is the proficiency at automatically adapt responses to what the user is demanding during a conversation. All those example cases are actually implemented in Charly. How I made it, will be the subject on a following article. Stay tuned!

Catalog Entity Extraction for Search

Keyword extraction from search queries is a fundamental aspect of conversational commerce. In this article I illustrate a simple but effective way to get relevant entities from user’s utterances and rank them against an unstructured product catalog and an ontology database.

The primary purpose of a conversational application is to serve user demands, and when an user search in a e-commerce context, he is mostly looking for products. There is one main distinction that characterize a query when it is performed in the website rather than a messaging application. In the website, when users submit a query they already express their search intention, therefore the terms are usually concise and descriptive. Conversely, when inquiring a Chatbot, users use more expressive forms such as: Could you suggest me pale ale beers and ice creams for my party?. While the intention is deducted by a classification task, relevant terms for search, are just a subset of the entire sentence.

Baseline approach for searching, would be to take all text as query, returning innumerable hits of everything even remotely relevant, providing little help for customers. Another solution regards Named Entity Recognition, a class of algorithms that seeks and classify entities, also by means of neural networks.

While machine learning techniques can reach high levels of accuracy, they might not be the favorite solution for production usage. They require hardly available training data, and what will work for a specific product segment will not work for another. That is why the following approach could offer the flexibility demanded for real use case scenarios. It is easily plugged in any e-commerce without any particular adaptation. This method is very simple. I don’t consider structured product features, rather I take in account only simple and concise information that is obtained just by the product name.

I want to extract the features that might affect the Chatbot’s answer, based on the quality of the search query.

It is very plausible to give straights result when the query is really pertinent to returned item list, as well as informing the users whenever the query terms do not match exactly with what we can offer them, or even when the query terms demand for something we cannot provide. The desirable features are:

  • Query entities selection. When in the query there are more than one entity cluster, the conversational agent will be able to detect it and to ask the user to choose with entity will search first. For example, in query above there 2 terms: pale ale beers and ice creams. For example the Chatbot could answer:

    Are you searching for pale ale beer or ice cream?

  • Partial term matching. The user is prompted that the exact criteria does not match, but a less ranking one is provided. Pale ale beers is not in catalog, but ale beer yes.

    We don’t have pale ale beer, but just ale beer. These are our suggestions:…

  • Term out of market segment. Prompt the user that the inquired item is not sold by this shopping website.

    We do not sell insurance, sorry.

Indexing and searching tasks

The two fundamental tasks in information retrieval are the one for collecting and storing product informations, and on the other side, the task for obtaining them. Index phase collects features from the products’ name, while the search phase extracts matches from text query. Both tasks manipulate text in the following ways: Entities clusterization, Part Of Speech filtering, Lemmatization.

Entities clusterization

The objective is to isolate every entity within their search space (or features) that refines the query. For doing that, I use stop words (irrelevant terms such as articles, prepositions, adverbs) and some punctuations (full stops, semicolon, exclamation and question marks) to split the entire sentence into word clusters

could you suggest me pale ale beers and ice creams for my party

This rainbowed sentence assume me, and, for as stop words for tokenizing the possible entities clusters.

Part Of Speech filtering

Clusters previously obtained are filter by their Part Of Speech (POS) classification. The POS tagging assigns to each word their definition as noun, verb, adjective, adverb. I explicitly exclude verbs, adverbs and pronouns. This is why could you suggest is excluded since it is entirely formed by ignored words. The output is represented as:

could you suggest / pale ale beers / ice creams / party

Lemmatization

Lemmatization refers to the process of returning the root form of inflected words, in order to facilitate the analysis and the search of those terms. For example, “Finds” and “found” are grouped together as “find”. In this way, cluster entities are turned into:

pale ale beer(s) / ice cream(s) / party

Catalog indexing

Text manipulation, as above described, occurs both for storing the catalog data and for querying.

In the indexing phase, when all catalog is scanned, parsed and tokenized, all particles will be stored into a Set. A Set is a collection of distinct items. For efficiently storing the presence of a particular cluster, bloom filters play a fundamental role.

Bloom Filters

How to check if a n-gram is present in the product list? Bloom filters solve the problem on storing large Set in a fixed and pre-defined sized vector. By the algorithm, an element is converted in some numeric values (h) and set true in a bit vector, at the h position.

Bloom filters allow to compress a large amount of source data, negotiating a grade of uncertainty.

How could be validated the presence of the element in the bit array? Just checking if the vector is true/false in the h position. That gives the certainty whether the element is not present, or, whether vector checking is positive, a determined confidence degree that such element is present. The true positive probability depends on the vector length and the number of hashes.

N-gram generation

An n-gram is a contiguous sequence of n words. I generate all possible combination of n-grams out of word clusters. The emphasized words are the result of this generation:

Beer Ice cream Party
pale ale beer ice cream party
pale ale ice  
ale beer cream  
pale    
ale    
beer    

Once the n-grams are generated, it is fast to check if one of them is present by inquiring the catalog bloom filter. For each entity cluster, we can check n-grams starting from the longest, in order to prioritize what exactly the user wants. We want to know also if the exact entity cluster is not present but only an its sub-gram. Moreover, we need to deal with such queries that asks products or services not offered by the given catalog market segment.

Ontology database

How can we check whether in the query there are valid terms but they are not treated by us? ConceptNet could be the answer. For this purpose more than 400K terms have been collected among several categories and indexed as the catalog terms in a separate database.

Conclusion

At the end of this process the final output will look like to this:

entity clusters:
  term: pale ale beer
  catalog: false
  subterms:
    term: ale beer
    catalog: true

  term: ice cream
  catalog: true

  term: party
  catalog: false

I have described a simple way for extracting query terms from a raw sentence. This approach provides useful information that could be managed by an conversational engine for corroborating search results with meaningful answers. On the other hand, this model doesn’t handle with misspellings, which represent alone about 15% of online search failures. This technique doesn’t deal with relatedness matching, or semantic matching. That means we can’t satisfy the search with relevant and pertinent results whenever customers use different terms from those in the website. I have already solved this problem by means of neural networks, and I will describe it in another article.

Acknowledgment Thank you Sidi for the contribution.