Giancarlo Frison Signals from the Noise

Program Induction and Synthesis on ICML 2018

The International Conference on Machine Learning ICML took place this year in Europe, in the beautiful city of Stockholm from 10th to 15th of July. This is one of the two premiere conferences (within NIPS) on Artificial Intelligence research, and the numbers indicate the magnitude of the event: 612 accepted papers out of 2473 submissions, 9 tutorial and 67 workshop sessions on the latest advances in all disciplines of machine learning. One of the most intriguing workshop was about machine intelligence capable of writing software code for complex procedural behavior.

Program Induction

Can we teach computers to write code? This is the question that brings out an entire branch of research specialized in program synthesis. Programming is a demanding task that requires extensive knowledge, experience and not a frivolous degree of creativity. Despite the premises that discourage any form of automation, machine learning can reshape the way software is developed. That could be seen as a giant step comparable to the transition of punch cards in the ’60s in favor of magnetic tapes. New programming systems could enable non-programmers to produce correct, cheap, safe and efficient software, opening new automation horizons even for not yet foreseeable purposes.

Curious on what does it feel like? Try to describe a problem in plain English and translate it in a routine in your favorite programming language. Let’s have a look on this sample:

You are given a number var0. You have to set var2 to 2. If var0-2 is divisible by 3 you have to set var1 to 1, otherwise you have to set var1 to zero. For each var3 between 1 and var0-1, if var2 is less than var0 you have to, add var3*3+2 to var2, if var0-var2 is greater than or equal to zero and var0-var2 is divisible by 3 add 1 to var1; otherwise you have to break from the enclosing loop. You have to return var1.

As you might have noticed, this is a detailed description of an algorithm. Natural language lacks of precise semantics necessary for describing concepts with mathematic granularity, this is why the generated code looks less amusing than the original (natural) code:

Likewise ML tasks detect patterns, a program synthesis task aims to solve problems that might be identified as a composition of basic programming primitives, such as conditional operators, loop controls and even recursions and used for generating specific program trees.

Generating code is a extremely challenging problem and the output’s goodness is still very limited by using existing approaches. One discrepancy in the flow above settle in the different nature of the encoded solution (generally neural networks) and generated code. Neural networks lay in the differentiable realm (gradient-based training) while source code generation belongs to the discrete parish, because of its rigorous grammar and intolerance to typos. If you want to get hands dirty on it, it is now available a public dataset for training your system on software programming.

An introductory talk on this matter was hold by Joshua Tenenbaum, professor at MIT and contributor of Bayesian methods for computational learning. He is specialized in cognitive sciences and and he strives to grasp the fundamental mechanics to human learning, in order to ultimately transfer them to computer programs. I find it useful to describe the Bayesian approach as a generic framework for every-day life decisions, where possible solutions are weighted according to our observations and past experiences flavored by doses of uncertainty.

DreamCoder

The DreamCoder (Ellis 2018) rumbles on the idea of creating a repository of simple problems (and their paired solutions) for being reused on solving more complex ones. Even though the full paper is so far not publicly available, it seems inspired by the ways programmers organize their work: building shared subroutines that can be composed to develop more complex procedures. Instead of solving all problems from scratch, it tries to think flexibly and brings what it has already learned.

Program synthesis using example

What I guess would be the most promising and attainable utilization of program induction is coding repairing and migration. Raise an hand 🙋 who hates repetitive, error-prone, edits due to general refactoring or library upgrading. Those tasks occurs during software evolution and they can be done only manually, since they are beyond the capabilities of IDEs. REFAZER (rebuild in Portuguese) attempts to classify code transformations by large set of examples. It has been proven to be correct on 84% of the transformations, which is not bad at all. The research has been published and it is available here (Rolim et.al; ICSE 2017). The technique for synthesizing programs from examples quotes:

Each rewrite rule matches some subtrees of the given AST and outputs modified versions of these subtrees. Additionally, we specify constraints for our DSL operators based on the input-output examples to reduce the search space of transformations, allowing PROSE to efficiently synthesize them.

There are large and open repository for such dataset, take for example Github as widely known repository for open-sourced projects, so uniquely for this case, the training data availability should not be a blocking issue.

First Steps on Evolutionary Systems

Nick Youngson CC BY-SA 3.0 Alpha Stock Images

Goal programming attempts to find solutions which possibly satisfy, otherwise violates minimally, a set of goals. It has been enjoyed in innumerable domains such as engineering, financing or resource allocation. Solutions may include optimal strategies to maximize, for example, a sale’s profit or, on the other hand, to minimize the cost of a purchase under an acceptable threshold.

An optimized plan could be blended as a program defined as an abstract syntax tree (AST):

This is the tree representation of . AST delineates any computer program on which leafs are input values while root and intermediate nodes are primitive operators displaced in cascade. An automatic system built upon goal optimizations should develop methods for synthesizing programs by running intelligent agents that learn by them self on how to reach targets. Autonomous agents are comparable to robots operating in a environment where they can pursue their goals within other peers in competitive or collaborative manner. For the level of complexity of a multi-agent system, one of the promising technique could be found, is in the realm of evolutionary algorithms.

Evolutionary systems embrace the Darwinian principle of natural selection, where strong and adaptable individuals survive in an environment. The mechanism at its foundation is very simple and it is as follow:

  • A number of ASTs (chromosomes) are randomly created.
  • Each chromosome is evaluated through a fitness function.
  • Best ones are selected, the others are disposed.
  • Chromosomes could be breded among the selected for a new generation.
  • Offsprings are randomly mutated.
  • Repeat until the score threshold is reached.

The “breeding” is called crossover. Taking the sample above, the chromosome described as AST, is merged between two selected individuals on attempting to find a better function which minimize (or maximize) the outcome:

This method, also known as genetic programming (GP), may overtake other optimization algorithms when problems presents no-linear relationships and when the solution space has many local minima where gradient-based algorithms show their limits on overcoming them like in the Rastrigin function:

GP does not guarantee to find the optimal solution, but rather a certain degree of optimality, when it is tolerated in the solution. GP might appear to brute force the seeking for solutions, but the cumulative selection lower the complexity to very few generations, like reported in the Weasel program:

I don’t know who it was first pointed out that, given enough time, a monkey bashing away at random on a typewriter could produce all the works of Shakespeare. The operative phrase is, of course, given enough time. Let us limit the task facing our monkey somewhat. Suppose that he has to produce, not the complete works of Shakespeare but just the short sentence ‘Methinks it is like a weasel’, and we shall make it relatively easy by giving him a typewriter with a restricted keyboard, one with just the 26 (capital) letters, and a space bar. How long will he take to write this one little sentence?

GP has been proved to be a competitive alternative by being faster to learn in comparison of neural network algorithms (Q-Learning) on reinforcement learning, including Atari and humanoid locomotion. In the example above, assuming that the selection of each letter in a sequence of 28 characters will be random, the number of possible combinations are about . GP solves it in 46 generations.

As GP is inspired by biological nature of evolution, it often surprises researchers by unexpected outcomes. That is the case of a software created for repairing buggy code. It found a clever loophole in order to fix a bug in a sorting algorithm:

In other experiments, the fitness function rewarded minimizing the difference between what the program generated and the ideal target output, which was stored in text files. After several generations of evolution, suddenly and strangely, many perfectly fit solutions appeared, seemingly out of nowhere. Upon manual inspection, these highly fit programs still were clearly broken. It turned out that one of the individuals had deleted all of the target files when it was run!

Aggregated fitness functions

Back to the subject of this post, a generic and flexible environment for training agents on reaching some goals must deal with what is defined as multi-objective optimization. The final outcome should solve several goals which might be in conflict with each other, like for example growing profit for a business and rise salary to its employees. Multi-objective optimization give rise to a set of Pareto-optimal solutions. The purpose of training is to create agents that can find as many such solutions as possible. The aggregated fitness function (AFF) has minimal knowledge on how a goal is achieved and evaluates only what is actually achieved. Therefore, the procedure of how an agent accomplishes a task is irrelevant. The drawback is obviously that there is no guidance for evolution through immediate solutions.

Simple experiment

A GP system has been instructed to model as many ASTs as the number of digit of a randomly generated set of numbers. Those programs should find the respective digit out of the given number, like units, tens and so on. Assuming as a set of programs, and the the digits of the input number when the output will be and , just as simple as that. The fitness function measure the square of the sum of the distance between the predicted and real digit. The fitness function returns the sum of the squared error between the calculated digit and the correct ones . In doing so, the feedback for the single program is lost, only the aggregated one is considered by the genetic algorithm.

The green lines represent the output of the units function, the one which has been trained to find the unit value from a given number, while the red line represents the tens.

The Basic Principles of Language

What is this exhilarating noise come out of my mouth when I talk? Not surely because that precise sequence of sounds, pops and squeezes are particularly melodic, but thanks to that palace of sophistications erected in favor of language, we can talk and afford a wide range of expressions. Since I began erratically to explore natural language processing I have been wondering how it comes out so natural for us, while it is extremely complicated from a computational perspective. What has caught my curiosity is the nature of language and its fundamental aspects that might have shaped the rudimentary ‘Me Tarzan, you Jane’, the sentence that paraphrases the earliest and the simplest level of language.

The difficulty of studying the evolution of language is that in its early forms the available evidences are sparse. Spoken languages don’t leave fossils. Moreover, all existing languages, including the far remote tribal ones, are already sophisticated. Contemporary ones have a lot of words, refined grammar structures and can express almost everything with a remarkable richness of details. Even in written human records collected so far, dating 5.000 years ago or so, things look almost the same like they are now. Linguists have studied how communication change over time and inferred how it could appear us when the first rudimental steps toward a language were adopted in the first place. What are the basic and fundamental aspects and principles of language that whether they would be taken away, the whole towering edifice of language would immediately collapse like a stack of cards? I would introduce them by a simple composition, which could not be taken as an example of eloquence, but nobody would find it difficult to understand:

I supermarket enter      basket bring      pick fresh fruit

I go cashier      pay cashier basket      bring bag      quit

As might be noticed, there are no grammatical elements (prepositions, conjunctions, adverbs, plurals, tenses, relative clauses, complement clauses) that glue and hold sentences together, nor any abstract term. Nonetheless, the proto-sentence remains comprehensible due to very few natural principles that arrange those words together. Those principles crystallized into our brain million of years before language was even conceived by our ancestors. The evolution wired those principles in our cortex for facilitating communication. The first lines of distinction in early languages came from the concrete world, such as actions and things and how to refer to them in space, the pointing words. The second principle refers to the sequentiality of events and and as one can correctly imagine this affect the ordering of words. The third is more about the economy of communication, by contextualizing meanings and references in the sentence.

Pointing words

Pointing words assist for referring or locating something in space. They are This, that, here, there and their reference depends on where the actors are. What is this for me could be that for you, due to the relative position of object and subject. Those referencing words are not simply compelling because children use them as an accompaniment to the pointing gesture, reinforcing the intimate link between physical world and mental representation in premature brains. Pointing words, oppositely to other grammatical terms, are not originated by anything else than pointing words. They are root and core concepts.

Things, actions

The sample text should help to inform that early languages were restricted to simple words, the ones involving only concrete entities in the here and now. Things and action distinction is also a part of what is social intelligence and the world representation which is common in other primates and this conceptual distinction was already there. Even metaphors, that count a large belonging among words of our dictionary, turns out of have concrete origins, they were evolved from elements of physical environment.

Order of words

Another basic principle of any language relies on a single strategy: the ordering of words. What belongs together in reality appears close also in the language and follows the same sequentiality. It is natural to describe an action as central word between two participants. Between the actor and the patient (whom the action is performed) the order is the ordinary mapping from reality to language. Consider for example the Caesar’s Principle: I came, I saw, I conquered (veni vidi vici). This saying was conferred to Julius Caesar after a victory. The order of words is clearly not accidental, it reflects the sequence of actions in the real world.

Context

The third principle is concerned with repetition. What is already stated or it is not particularly important does not need to be iterated again. What could be understood and inferred from the context may be omitted in the sentence. This follow the principle of least effort, which is also applicable in language. Whether I would have written the story like this:

I supermarket enter      I bring basket      I pick fruit      I quit

the redundancy of the subject would be truly annoying, in any language. Have been invented several ways to keeping track of participants in the conversation, take by example pronouns.

Concept Search by Word Embeddings

Semantic Search Catalog search is one of the most important factor to the success of e-commerce sites and accurate and relevant results are critical to successful conversion.
The following approach aims to reduce user frustration by presenting related products, when searched items are not available in catalog. The central hypothesis is that an user might buy products with similar characteristics of a product originally searched, leading the successful search into a purchase.

Search engines help to find relevant matches against a query according to various information-retrieval algorithms. Those systems find text occurrences, but regardless their effectiveness, they are unequivocally related to the terms provided by the catalog. Therefore, products cannot be retrieved by words that are not already present in the inventory.

Concept matching (a sub-domain of semantic search) refers to the quality of retrieved instances based on significance. The association of terms by an acceptable grade of relatedness, pivots around those key points:

  1. Knowledge gathering. Where is it possible to identify semantic relations among words?
  2. Concept extraction. How relations could be extracted and then predicted?

The elaboration applied to the data for obtaining our demanded features is called word embedding.

Word embedding is a very popular term undoubtedly because of the contribution of the deep learning community. It is associate to the research of distributional semantics, the branch of studies for elaborating semantic similarities between words based on their distributional properties.

“a word is characterized by the company it keeps”. cit R. Firth

Algorithms (like the well-known skip-gram, cbow, glove) are employed to train models for predict words as they sequentially appears in a given text corpora. As result, the word embedding model converts a single word into a list of similarities, a vector. Analogous words are represented by similar vectors and cosine similarity measures the cosine of the angle between word vectors, thus scoring the relatedness between two words.

Concept Matching Algorithm

Concept Search Diagram

In the example above the user submits the unknown search query Chardonnay which has some similar terms retrieved in the word embeddings. Some of them might exist in catalog and they are returned to the user.

algorithm retrieve_alternatives is

  input: unrecognized term query, word vectors embeddings

  output: ordered list of products and ranking

  query_embeddings ← get similarities of query from embeddings

  results ← empty

  for each w in query_embeddings:

    resultsearch by w

    result.ranking ←result.ranking * w.score

    append result to results

  return results sort by ranking

Topic-specific Embeddings

Word embeddings are obtained by elaborating a huge quantity of text, namely corpus or corpora. There are available several large and structured set of texts for creating word embeddings: Google News corpus, Wikipedia, and so on, as well as word vectors already trained against those corpora. Since the quality of word embeddings reflects the corpus from which it has been generated, I purposely created a topic-specific corpora specialized in food, by scanning more than 600 food blogs and collecting roughly 40 Mb of prepared text. The amount of text is risible in comparison with Google News but nonetheless it is enough for the purposes of computing similarity in the small range of catalog queries. The preparation of corpora includes the remotion of everything but words, case conversion and sentence tokenization. I choose fastText for elaborating text representations, it uses sub-word information to build vectors for unknown words and as the name might suggest, it is really fast.

This solution has been filed as “System, computer-implemented method and computer program product for information retrieval” at the European patent office. It is applicable to many different domains, like in clothing, automobile, electronics retail, just by getting the proper specialized corpora from which word similarity can be inferred.

Stochastic Conversational Workflows

Traditionally, user interfaces are a series of screens and forms for exchanging informations with the user. Most of the applications start with a main screen from which users can navigate using breadcrumbs, menus, buttons like back and forward. This paradigm remained almost unaltered with the coming of hypertext where one may jump from a page or dialog into another by visual links, that are immediately accessible. Chatbots shift UX towards conversational hypertext that produces the appearance of having a conversation with the computer. People can interact naturally, and since everyone already knows at least one natural language, nobody needs any training for it.

The casualness of the medium contrasts with the complexity of the structured, and sometimes cumbersome, functions for achieving a specific goal, but while web-users can easily switch to a new goal-oriented scenario with a click, in the messaging application this could be mainly done by texting. Conversational applications usually implement workflows not by screens or forms but by piling new dialog scenarios into the conversational stack. Technically, it is something like packing finite state machines, on which every layer represents a particular task. When the current one is accomplished, the dialog state closes and it is removed from the stack. Conversational workflows may be managed by state-machine engines implemented directly in the chatbot or alternatively by existing flow managers such as Dialogflow (former API.AI), Wit.ai, LUIS.ai (Language Understanding Intelligent Service) by which designers can setup conversation processes in their web dashboards.

Conventional Dialog Stack

Even though it might be tempting to assume users will follow the exact logical sequence of steps defined by the bot’s designer, it rarely occurs. The austere interface of a Chatbot does not constraint users in pre-defined schema, it does not prevent nor discourage them to behave like they naturally do with other people: express a demand, ask information, change search criteria for a product, ask maybe again, and eventually pay or just abort the process.

Rule-base vs stochastic workflow process

People do not communicate in stacks. They tend to jump from a subject to another almost in a random way. Users may decide to do something entirely different, no matter how the process flow has been structured. They may ask for questions unrelated to the current procedure, or cancel it and then start over again. It is natural that humans switch topics during dialogue for whatever reason.

Even though it is easy to reach significant outcomes by using one of the mentioned NLU (Natural Language Understanding) systems with little effort, those modelers pose quickly their limitations due to their rigid characteristics concerning activation and behavior. In other words, conversational workflows may need to be hard-coded based on few discriminating features, mainly imposed in a limited set of user intentions. It may be almost impossible to manage programmatically a large amount of features that can affect dialogs. Further, variations of such processes may result in growing complexity, becoming unmanageable over time.

New conversational model

Real world is a stubborn place. It is complex in ways that resist abstraction and modeling. It notices and reacts to our attempts to affect it. Nor can we hope to examine it objectively by pre-defined rules or by programmed state machines.

The fundamental aspect can’t be ignored is the probabilistic nature of the model that will serve those dialogs.

The complexity of the system increases with the number of variables, leading the conversation to a certain function instead of another. Hence, the obligation to move away from rule-based systems and embrace uncertainty, probability and statistic.

Once the limitations of those approaches are unveiled, we are ready to attain context and sequentiality in a completely different way. We should let machine learning do what it can do best: calculate predictions over large amount of input.

The system should being able to replicate human behavior by learning from real conversation segments.

The model should inquiry the system for getting more informations and take more plausible decisions on what to present to the user. With those premises, machine learning is entering into the game, and it comes with the form of neural networks.

Text classification

Neural networks are adaptable systems whose ability to learn comes from varying the strength of connections between its artificial neurons. They are basically universal function approximators. I described neural networks for intention classification by using convolutions for grasping the semantics behind user’s sentences.

Additional input

For helping the classifier could be provided, other than the raw text, a set of discrete features bring informations that share some predictive relation with the action to choose. Those informations could be somehow related to text itself, but also they could be contextual to the user profile or to the ongoing marketing campaigns promoted by the merchant.

For example, a particular search request should be directed into a special promotion? The time of the day can affect the query selection because in the night a camomile sells better than black tea? Just feed the classifier with those features, and let ML to do the hard work.
The new conversational model should not just understand the single intention of a phrase, but elaborate it in the overall context the user has engaged with the Chatbot.

Time series models

The concept of time series is dependent on the idea that past behavior can be used to predict future behavior. In sequence-based models, the output is not just determined by the last input, like in regression predictive models, but also by its proceedings. This peculiar predictor should have some interesting characteristics. Latests input affect more the final output than the ones far away in time, and those models should be able to override, remove, remember qualifiers along the sequence. Long short-term memory (LSTM) units come to our rescue. LSTMs can remember values over arbitrary event series and they are a more sophisticated extension of the recurrent neural networks.

Multi-task learning (MTL)

Consider a hypothetical recommender that has learned to predict your preferences about cars. A mono-task system could be trained to give a single output that might match the car model with your profile. A multi-task model could be trained to not only return you the model’s name, but also the color, the engine type, the accessories. MTL aims to solve simultaneously multiple classification tasks within the same neural network. We can view MTL as a form of transfer learning where commonalities and differences across tasks are exploited to improve the overall learning efficiency. It seems that neural networks love diversity: more tasks they learn more accurate are their predictions, compared to training them separately. The conversational engine I created uses multi-task learning (MTL). It not merely returns a single label, but instead it returns a fine-grained set of parameters that add expressiveness into the behavior of the Chatbot.

Neural network architecture

My proposition is about to use both (CNN, LSTM) in a fully-connected neural network, in order to leverage classification qualities of CNN with the sequentiality of LSTM. That means, a particular meaning of an user’s utterance is not considered alone like the current state of the art of Chatbot classifiers, but it is evaluated in the context of the conversation. Fully-connected neural network means that the different layers of the network (CNN, LSTM) are affected by the sames feed-forward and back-propagation iterations.

Stochastic Conversational Workflows

While CNN extracts a relevant representation of the user’s input, other type of inputs can be feed into the LSTM layer as illustrated. In this way, the meaning of the user’s sentence is evaluated within a set a additional parameters that can affect the decision outcome of a particular conversational step.