Giancarlo Frison Signals from the Noise

First Steps on Evolutionary Systems

Nick Youngson CC BY-SA 3.0 Alpha Stock Images

Goal programming attempts to find solutions which possibly satisfy, otherwise violates minimally, a set of goals. It has been enjoyed in innumerable domains such as engineering, financing or resource allocation. Solutions may include optimal strategies to maximize, for example, a sale’s profit or, on the other hand, to minimize the cost of a purchase under an acceptable threshold.

An optimized plan could be blended as a program defined as an abstract syntax tree (AST):

This is the tree representation of . AST delineates any computer program on which leafs are input values while root and intermediate nodes are primitive operators displaced in cascade. An automatic system built upon goal optimizations should develop methods for synthesizing programs by running intelligent agents that learn by them self on how to reach targets. Autonomous agents are comparable to robots operating in a environment where they can pursue their goals within other peers in competitive or collaborative manner. For the level of complexity of a multi-agent system, one of the promising technique could be found, is in the realm of evolutionary algorithms.

Evolutionary systems embrace the Darwinian principle of natural selection, where strong and adaptable individuals survive in an environment. The mechanism at its foundation is very simple and it is as follow:

  • A number of ASTs (chromosomes) are randomly created.
  • Each chromosome is evaluated through a fitness function.
  • Best ones are selected, the others are disposed.
  • Chromosomes could be breded among the selected for a new generation.
  • Offsprings are randomly mutated.
  • Repeat until the score threshold is reached.

The “breeding” is called crossover. Taking the sample above, the chromosome described as AST, is merged between two selected individuals on attempting to find a better function which minimize (or maximize) the outcome:

This method, also known as genetic programming (GP), may overtake other optimization algorithms when problems presents no-linear relationships and when the solution space has many local minima where gradient-based algorithms show their limits on overcoming them like in the Rastrigin function:

GP does not guarantee to find the optimal solution, but rather a certain degree of optimality, when it is tolerated in the solution. GP might appear to brute force the seeking for solutions, but the cumulative selection lower the complexity to very few generations, like reported in the Weasel program:

I don’t know who it was first pointed out that, given enough time, a monkey bashing away at random on a typewriter could produce all the works of Shakespeare. The operative phrase is, of course, given enough time. Let us limit the task facing our monkey somewhat. Suppose that he has to produce, not the complete works of Shakespeare but just the short sentence ‘Methinks it is like a weasel’, and we shall make it relatively easy by giving him a typewriter with a restricted keyboard, one with just the 26 (capital) letters, and a space bar. How long will he take to write this one little sentence?

GP has been proved to be a competitive alternative by being faster to learn in comparison of neural network algorithms (Q-Learning) on reinforcement learning, including Atari and humanoid locomotion. In the example above, assuming that the selection of each letter in a sequence of 28 characters will be random, the number of possible combinations are about . GP solves it in 46 generations.

As GP is inspired by biological nature of evolution, it often surprises researchers by unexpected outcomes. That is the case of a software created for repairing buggy code. It found a clever loophole in order to fix a bug in a sorting algorithm:

In other experiments, the fitness function rewarded minimizing the difference between what the program generated and the ideal target output, which was stored in text files. After several generations of evolution, suddenly and strangely, many perfectly fit solutions appeared, seemingly out of nowhere. Upon manual inspection, these highly fit programs still were clearly broken. It turned out that one of the individuals had deleted all of the target files when it was run!

Aggregated fitness functions

Back to the subject of this post, a generic and flexible environment for training agents on reaching some goals must deal with what is defined as multi-objective optimization. The final outcome should solve several goals which might be in conflict with each other, like for example growing profit for a business and rise salary to its employees. Multi-objective optimization give rise to a set of Pareto-optimal solutions. The purpose of training is to create agents that can find as many such solutions as possible. The aggregated fitness function (AFF) has minimal knowledge on how a goal is achieved and evaluates only what is actually achieved. Therefore, the procedure of how an agent accomplishes a task is irrelevant. The drawback is obviously that there is no guidance for evolution through immediate solutions.

Simple experiment

A GP system has been instructed to model as many ASTs as the number of digit of a randomly generated set of numbers. Those programs should find the respective digit out of the given number, like units, tens and so on. Assuming as a set of programs, and the the digits of the input number when the output will be and , just as simple as that. The fitness function measure the square of the sum of the distance between the predicted and real digit. The fitness function returns the sum of the squared error between the calculated digit and the correct ones . In doing so, the feedback for the single program is lost, only the aggregated one is considered by the genetic algorithm.

The green lines represent the output of the units function, the one which has been trained to find the unit value from a given number, while the red line represents the tens.

The Basic Principles of Language

What is this exhilarating noise come out of my mouth when I talk? Not surely because that precise sequence of sounds, pops and squeezes are particularly melodic, but thanks to that palace of sophistications erected in favor of language, we can talk and afford a wide range of expressions. Since I began erratically to explore natural language processing I have been wondering how it comes out so natural for us, while it is extremely complicated from a computational perspective. What has caught my curiosity is the nature of language and its fundamental aspects that might have shaped the rudimentary ‘Me Tarzan, you Jane’, the sentence that paraphrases the earliest and the simplest level of language.

The difficulty of studying the evolution of language is that in its early forms the available evidences are sparse. Spoken languages don’t leave fossils. Moreover, all existing languages, including the far remote tribal ones, are already sophisticated. Contemporary ones have a lot of words, refined grammar structures and can express almost everything with a remarkable richness of details. Even in written human records collected so far, dating 5.000 years ago or so, things look almost the same like they are now. Linguists have studied how communication change over time and inferred how it could appear us when the first rudimental steps toward a language were adopted in the first place. What are the basic and fundamental aspects and principles of language that whether they would be taken away, the whole towering edifice of language would immediately collapse like a stack of cards? I would introduce them by a simple composition, which could not be taken as an example of eloquence, but nobody would find it difficult to understand:

I supermarket enter      basket bring      pick fresh fruit

I go cashier      pay cashier basket      bring bag      quit

As might be noticed, there are no grammatical elements (prepositions, conjunctions, adverbs, plurals, tenses, relative clauses, complement clauses) that glue and hold sentences together, nor any abstract term. Nonetheless, the proto-sentence remains comprehensible due to very few natural principles that arrange those words together. Those principles crystallized into our brain million of years before language was even conceived by our ancestors. The evolution wired those principles in our cortex for facilitating communication. The first lines of distinction in early languages came from the concrete world, such as actions and things and how to refer to them in space, the pointing words. The second principle refers to the sequentiality of events and and as one can correctly imagine this affect the ordering of words. The third is more about the economy of communication, by contextualizing meanings and references in the sentence.

Pointing words

Pointing words assist for referring or locating something in space. They are This, that, here, there and their reference depends on where the actors are. What is this for me could be that for you, due to the relative position of object and subject. Those referencing words are not simply compelling because children use them as an accompaniment to the pointing gesture, reinforcing the intimate link between physical world and mental representation in premature brains. Pointing words, oppositely to other grammatical terms, are not originated by anything else than pointing words. They are root and core concepts.

Things, actions

The sample text should help to inform that early languages were restricted to simple words, the ones involving only concrete entities in the here and now. Things and action distinction is also a part of what is social intelligence and the world representation which is common in other primates and this conceptual distinction was already there. Even metaphors, that count a large belonging among words of our dictionary, turns out of have concrete origins, they were evolved from elements of physical environment.

Order of words

Another basic principle of any language relies on a single strategy: the ordering of words. What belongs together in reality appears close also in the language and follows the same sequentiality. It is natural to describe an action as central word between two participants. Between the actor and the patient (whom the action is performed) the order is the ordinary mapping from reality to language. Consider for example the Caesar’s Principle: I came, I saw, I conquered (veni vidi vici). This saying was conferred to Julius Caesar after a victory. The order of words is clearly not accidental, it reflects the sequence of actions in the real world.

Context

The third principle is concerned with repetition. What is already stated or it is not particularly important does not need to be iterated again. What could be understood and inferred from the context may be omitted in the sentence. This follow the principle of least effort, which is also applicable in language. Whether I would have written the story like this:

I supermarket enter      I bring basket      I pick fruit      I quit

the redundancy of the subject would be truly annoying, in any language. Have been invented several ways to keeping track of participants in the conversation, take by example pronouns.

Concept Search by Word Embeddings

Semantic Search Catalog search is one of the most important factor to the success of e-commerce sites and accurate and relevant results are critical to successful conversion.
The following approach aims to reduce user frustration by presenting related products, when searched items are not available in catalog. The central hypothesis is that an user might buy products with similar characteristics of a product originally searched, leading the successful search into a purchase.

Search engines help to find relevant matches against a query according to various information-retrieval algorithms. Those systems find text occurrences, but regardless their effectiveness, they are unequivocally related to the terms provided by the catalog. Therefore, products cannot be retrieved by words that are not already present in the inventory.

Concept matching (a sub-domain of semantic search) refers to the quality of retrieved instances based on significance. The association of terms by an acceptable grade of relatedness, pivots around those key points:

  1. Knowledge gathering. Where is it possible to identify semantic relations among words?
  2. Concept extraction. How relations could be extracted and then predicted?

The elaboration applied to the data for obtaining our demanded features is called word embedding.

Word embedding is a very popular term undoubtedly because of the contribution of the deep learning community. It is associate to the research of distributional semantics, the branch of studies for elaborating semantic similarities between words based on their distributional properties.

“a word is characterized by the company it keeps”. cit R. Firth

Algorithms (like the well-known skip-gram, cbow, glove) are employed to train models for predict words as they sequentially appears in a given text corpora. As result, the word embedding model converts a single word into a list of similarities, a vector. Analogous words are represented by similar vectors and cosine similarity measures the cosine of the angle between word vectors, thus scoring the relatedness between two words.

Concept Matching Algorithm

Concept Search Diagram

In the example above the user submits the unknown search query Chardonnay which has some similar terms retrieved in the word embeddings. Some of them might exist in catalog and they are returned to the user.

algorithm retrieve_alternatives is

  input: unrecognized term query, word vectors embeddings

  output: ordered list of products and ranking

  query_embeddings ← get similarities of query from embeddings

  results ← empty

  for each w in query_embeddings:

    resultsearch by w

    result.ranking ←result.ranking * w.score

    append result to results

  return results sort by ranking

Topic-specific Embeddings

Word embeddings are obtained by elaborating a huge quantity of text, namely corpus or corpora. There are available several large and structured set of texts for creating word embeddings: Google News corpus, Wikipedia, and so on, as well as word vectors already trained against those corpora. Since the quality of word embeddings reflects the corpus from which it has been generated, I purposely created a topic-specific corpora specialized in food, by scanning more than 600 food blogs and collecting roughly 40 Mb of prepared text. The amount of text is risible in comparison with Google News but nonetheless it is enough for the purposes of computing similarity in the small range of catalog queries. The preparation of corpora includes the remotion of everything but words, case conversion and sentence tokenization. I choose fastText for elaborating text representations, it uses sub-word information to build vectors for unknown words and as the name might suggest, it is really fast.

This solution has been filed as “System, computer-implemented method and computer program product for information retrieval” at the European patent office. It is applicable to many different domains, like in clothing, automobile, electronics retail, just by getting the proper specialized corpora from which word similarity can be inferred.

Stochastic Conversational Workflows

Traditionally, user interfaces are a series of screens and forms for exchanging informations with the user. Most of the applications start with a main screen from which users can navigate using breadcrumbs, menus, buttons like back and forward. This paradigm remained almost unaltered with the coming of hypertext where one may jump from a page or dialog into another by visual links, that are immediately accessible. Chatbots shift UX towards conversational hypertext that produces the appearance of having a conversation with the computer. People can interact naturally, and since everyone already knows at least one natural language, nobody needs any training for it.

The casualness of the medium contrasts with the complexity of the structured, and sometimes cumbersome, functions for achieving a specific goal, but while web-users can easily switch to a new goal-oriented scenario with a click, in the messaging application this could be mainly done by texting. Conversational applications usually implement workflows not by screens or forms but by piling new dialog scenarios into the conversational stack. Technically, it is something like packing finite state machines, on which every layer represents a particular task. When the current one is accomplished, the dialog state closes and it is removed from the stack. Conversational workflows may be managed by state-machine engines implemented directly in the chatbot or alternatively by existing flow managers such as Dialogflow (former API.AI), Wit.ai, LUIS.ai (Language Understanding Intelligent Service) by which designers can setup conversation processes in their web dashboards.

Conventional Dialog Stack

Even though it might be tempting to assume users will follow the exact logical sequence of steps defined by the bot’s designer, it rarely occurs. The austere interface of a Chatbot does not constraint users in pre-defined schema, it does not prevent nor discourage them to behave like they naturally do with other people: express a demand, ask information, change search criteria for a product, ask maybe again, and eventually pay or just abort the process.

Rule-base vs stochastic workflow process

People do not communicate in stacks. They tend to jump from a subject to another almost in a random way. Users may decide to do something entirely different, no matter how the process flow has been structured. They may ask for questions unrelated to the current procedure, or cancel it and then start over again. It is natural that humans switch topics during dialogue for whatever reason.

Even though it is easy to reach significant outcomes by using one of the mentioned NLU (Natural Language Understanding) systems with little effort, those modelers pose quickly their limitations due to their rigid characteristics concerning activation and behavior. In other words, conversational workflows may need to be hard-coded based on few discriminating features, mainly imposed in a limited set of user intentions. It may be almost impossible to manage programmatically a large amount of features that can affect dialogs. Further, variations of such processes may result in growing complexity, becoming unmanageable over time.

New conversational model

Real world is a stubborn place. It is complex in ways that resist abstraction and modeling. It notices and reacts to our attempts to affect it. Nor can we hope to examine it objectively by pre-defined rules or by programmed state machines.

The fundamental aspect can’t be ignored is the probabilistic nature of the model that will serve those dialogs.

The complexity of the system increases with the number of variables, leading the conversation to a certain function instead of another. Hence, the obligation to move away from rule-based systems and embrace uncertainty, probability and statistic.

Once the limitations of those approaches are unveiled, we are ready to attain context and sequentiality in a completely different way. We should let machine learning do what it can do best: calculate predictions over large amount of input.

The system should being able to replicate human behavior by learning from real conversation segments.

The model should inquiry the system for getting more informations and take more plausible decisions on what to present to the user. With those premises, machine learning is entering into the game, and it comes with the form of neural networks.

Text classification

Neural networks are adaptable systems whose ability to learn comes from varying the strength of connections between its artificial neurons. They are basically universal function approximators. I described neural networks for intention classification by using convolutions for grasping the semantics behind user’s sentences.

Additional input

For helping the classifier could be provided, other than the raw text, a set of discrete features bring informations that share some predictive relation with the action to choose. Those informations could be somehow related to text itself, but also they could be contextual to the user profile or to the ongoing marketing campaigns promoted by the merchant.

For example, a particular search request should be directed into a special promotion? The time of the day can affect the query selection because in the night a camomile sells better than black tea? Just feed the classifier with those features, and let ML to do the hard work.
The new conversational model should not just understand the single intention of a phrase, but elaborate it in the overall context the user has engaged with the Chatbot.

Time series models

The concept of time series is dependent on the idea that past behavior can be used to predict future behavior. In sequence-based models, the output is not just determined by the last input, like in regression predictive models, but also by its proceedings. This peculiar predictor should have some interesting characteristics. Latests input affect more the final output than the ones far away in time, and those models should be able to override, remove, remember qualifiers along the sequence. Long short-term memory (LSTM) units come to our rescue. LSTMs can remember values over arbitrary event series and they are a more sophisticated extension of the recurrent neural networks.

Multi-task learning (MTL)

Consider a hypothetical recommender that has learned to predict your preferences about cars. A mono-task system could be trained to give a single output that might match the car model with your profile. A multi-task model could be trained to not only return you the model’s name, but also the color, the engine type, the accessories. MTL aims to solve simultaneously multiple classification tasks within the same neural network. We can view MTL as a form of transfer learning where commonalities and differences across tasks are exploited to improve the overall learning efficiency. It seems that neural networks love diversity: more tasks they learn more accurate are their predictions, compared to training them separately. The conversational engine I created uses multi-task learning (MTL). It not merely returns a single label, but instead it returns a fine-grained set of parameters that add expressiveness into the behavior of the Chatbot.

Neural network architecture

My proposition is about to use both (CNN, LSTM) in a fully-connected neural network, in order to leverage classification qualities of CNN with the sequentiality of LSTM. That means, a particular meaning of an user’s utterance is not considered alone like the current state of the art of Chatbot classifiers, but it is evaluated in the context of the conversation. Fully-connected neural network means that the different layers of the network (CNN, LSTM) are affected by the sames feed-forward and back-propagation iterations.

Stochastic Conversational Workflows

While CNN extracts a relevant representation of the user’s input, other type of inputs can be feed into the LSTM layer as illustrated. In this way, the meaning of the user’s sentence is evaluated within a set a additional parameters that can affect the decision outcome of a particular conversational step.

Context and Sequentiality in Conversational Applications

Contextual memory in conversational applications plays a central role in any type of interaction between the Chatbot and the user. It is the bidirectional transfer of information where interlocutors are aware of the relational, environmental, and cultural context of the exchange. I will show some examples on how a contextual based system might improve the flow of the dialog.

Consider the following sentence:

If you prefer salty food, tiramisù is a bad dessert to eat.

If we quote only the part that says:

tiramisù is a bad dessert to eat.

The resulting sense is completely misleading. The speaker appears to simply saying that tiramisù is a bad dessert for everybody while it is a bad choice only for those who don’t like sweet and prefer instead salty food. Even though the quoting is taken directly from the original sentence, it omits important informations that you need to have for understanding what the speaker is really saying. That omitted information is the context.

Context is the circumstances surroundings a message.

In the realm of conversational applications, where users can dialog with Chatbots, the context is a fundamental aspect for any type of interaction that spans from goal-oriented to one-shot tasks. Let’s consider a simple case, where the same sentence appears in different contexts with consequentially different outcomes.

Taking the following sentence: Add bread into cart the system should react accordingly with what the user has previously entered, more precisely:

  • If user has just searched for a product, the system will drive him to select the desired product and tap the button below → Drive the user to tap the related ‘Add to Cart’ button.
  • In all other cases → Search for bread.

portfolio_view

Context as customer engagement

The sequentiality of interaction between customer and Chatbot is the key for understand user’s intention, in any situation. In the online buying process, shortening the conversion rate and accelerating checkouts makes real difference between success and failure of e-commerce initiatives. Conversational systems can tie customer engagement and purchase in a very short cycle. Smart sequencing, when seamlessly embedded in the processes, can lead new purchase opportunities.

Charly context dialog flow

Exploring contextual repetition

Sequence is a collection of events that might contain repetition. Repetitive inputs might mean unsatisfied or not understood requests. In this odd example we can simulate a behavioral pattern where an interlocutor may get upset by an annoying and repetitive progression of unsolicited opinions, as may I rhetorically define this following conversation:

In this article I summarized how the ability to apply Contextual Intelligence should represent an intrinsic skill for any conversational application, for any scenario. It is the proficiency at automatically adapt responses to what the user is demanding during a conversation. All those example cases are actually implemented in Charly. How I made it, will be the subject on a following article. Stay tuned!