2 minute read

Reinforcement learning covers a family of algorithms with the purpose of maximize a cumulative reward that an agent can obtain from an environment.
It seems like training crows to collect cigarette butts in exchange for peanuts, or paraphrasing an old say, the carrot and stick metaphor for cold algorithms instead of living donkeys.

The agent and environment have not been emphasized vainly, they represent more concretely a vacuum cleaner
sweeping your flat, an A/B testing engine for commerce or a driveless car in a crossroad. If you have heard about latest advances in the field, you would have came across of Deepmind’s AlphaZero, by which it is possible, with an affordable set of hardware, to build from scratch the best chess player in the world in just 4 hours.

“Difficulties strengthen the mind, as labor does the body.”
Lucius Annaeus Seneca

Researches don’t lack of challenges in this field. The most important one comes from the intrinsic nature of RL learning process, which rely solely on the evaluations of its actions. Improvements are driven by just one signal, the reward.

Rewards are often very sparse.

They come after hundred or thousands of steps, exponentially increasing the combination of actions the agent must explore for finding a barely better sequence among of them. If that does not seem arduous, consider also the non-stationary nature of some environments, particularly common in dynamic scenarios where the learning phase resemble pursuing a moving target.

Non-stationary environments

Non-stationary means that the return of an action, performed in the precisely exact conditions of a past experience, might be different from what expected. This is particularly intuitive in the case of multi-agent scenario (MARL), in which the agent plays with one or more other agents that are learning too.

What distinguish RL from other optimization methods

While RL helps on creating agents that can autonomously take decisions, other algorithms attain this goal too, but with different working principles.

Supervised learning could be easily distinguished because it is trained with correct samples instead of vague rewards, making it simpler for loss functions to converge into an useful solution.

Mathematical optimization differs from RL in a more subtle way. Likewise RL, the simplex algorithm find solutions by iterating on optimization loops, but it works only on perfect information problems.
When considering what it exactly means, let’s look at the knapsack or the traveling salesman problems.

All the necessary informations for elaborating the optimal solution are readily there.

In other words, there is no exploration.
Conversely, an RL agent is like a probe on an heavenly body, where the assumptions on the environment are nearly absent. The agent needs to figure out autonomously the good and the bad actions only by the feedback from environment, the so called model free learning approach.

Genetic Programming is an evolutionary optimization method that share most of the characteristics of RL, it is iterative, suitable for imperfect information systems due to its stochastic explorative nature. Even the terminology is somehow related. For example, what is the objective function, in GP is named fitness function, just a polyseme.
What differentiate it from RL is it’s evolutionary method, I briefly explained in this post.