Pragmatic Programming Techniques: Reinforcement Learning

Showing posts with label Reinforcement Learning. Show all posts

Sunday, September 2, 2018

Structure Learning and Imitation Learning

In classical prediction use case, the predicted output is either a number (for regression) or category (for classification). A set of training data (x, y) where x is the input and y is the labeled output is provided to train a parameterized predictive model.

The model is characterized by a set of parameters w
Given an input x, for the model predicts y_hat = f(x; w) for regression, or the model predicts the probability of each possible class for classification
Define a Lost function L(y, y_hat) for regression, or L(y, P(y=a | x), P(y=b | x) ...), find the parameters w to minimize L

This problem is typically view as an optimization problem, and use gradient descent approach to solve it.

Need for Structure Learning

However, in some cases, y is not as simple as a number or a class. For example

For machine translation, x is a sentence in English, and y is a translated sentence in French
For self-driving-vehicle, x is a camera image, and y is the control action on steering wheel, brake, gas pedal

In these cases, the output y can be viewed as an object. But wait, can we break down the object as multiple numbers / categories and use the classical regression / classification approach to solve it ? Not quite, because the lost function cannot be just formulated as the summation of loss of individual components. For example, two French sentences using different words may still be a very good translation of the same English sentence. So we need to generalize a bit more and introduce the concept of an object and compatibility here.

The prediction problem can be generalized as: Given input x, finding an object y such that y is the "most compatible" with x. The compatibility is a parameterized function that we are going to learn from the training data.

The compatibility function is defined as F(x, y; w)
During training phase, we tune parameter w such that for every sample in training data, F(x, y; w) is the maximum. In other words, F(x, Y=y; w) > F(x, Y=other_val; w)

Notice that the training process is different from classical ML in the following way.

There are two optimization loop here. a) Given parameter w, find y_opt that maximize F(x,y,w). b) Given Lost = gap between F(x,y,w) and F(x,y_opt,w), find w that minimize the gap.
It turns out the first optimization is solved in a problem specific way while the second optimization can be solved by classical gradient descent approach.

After we learn the compatibility function parameters, at inference phase, we will apply the first optimization to the given input x to find the most compatible y_opt such that F(x, y_opt; w) is the maximum.

Rather than trying to exactly match y_hat to y in the training data, structure learning enable us to learn a more abstract relationship (ie: compatibility) between x and y so that we can output other equally-good y even it is not the same as the y in the training data. This more-generalized form is very powerful when we don't have a lot of training data. The downside of structure learning is it is compute intensive, because at inference phase it needs to solve an optimization problem which is typically expensive.

Imitation Learning

In a typical setting of Reinforcement Learning, an digital agent observe the state from the environment, formulate its policy to determine an action that it believes will maximize its accumulative future reward, take the action, get the reward from the environment and transition to the next state. Reinforcement learning is about how the agent optimize its policy along its experience when interacting with the environment.

For an overview of Reinforcement Learning and basic algorithm, you can visit my previous blog here.

Basically, reinforcement learning is based on a trial-and-error approach to learn the lesson. This approach can be very costly because during the trial process, serious mistake can be made (imagine if we use reinforcement learning to learn self-driving car, we may crash many cars before we learn something meaningful). In practice, people rely on simulator to mimic the environment. However, coming up good simulator is not easy because it requires a very deep understanding of how the actual environment behaves, this is one of the limitation that restrict reinforcement learning to be broadly applied.

Another important design consideration is how the reward is assigned. Of course, we can use the actual reward from the environment to train the policy, but this is usually very inefficient. Imagine when playing the chess game, we only get the reward at the end when we win/lose the game. Propagating this reward all the way to each move will be very inefficient. To make the learning faster, we usually use a technique called "reward shaping", basically to assign some artificial reward along the trajectory to bias the agent towards certain desirable actions (based on domain knowledge).

One special form of reward shaping is "imitation learning" where we assign intermediate reward based on how the action "similar" to when an expert does in real-life circumstances. Lets say we collect a set of observations that the expert is taking action y in state x, and try to learn a model that the agent will bias to take action y when seeing state x. But wait, does it sound like a supervised learning problem ? Can we just train a prediction model from x to y and we are done ?

Unfortunately, it is not as simple. Expert data is typically very sparse and expensive to get, meaning we usually don't have too many data from the expert. Imagine in a self-driving program, if we want to learn how to react when the car is almost crash, we may not find any situation in the expert's observation because the expert may not run into such dangerous situation at all. On the other hand, we don't need to copy exactly when the expert did in every situation, we just need to copy those that is relevant to the situation.

"Inverse Reinforcement Learning" comes into rescue. Basically, it cuts off the reward from the environment and replace it with a "reward estimator function", which is trained from a set of expert behavior, assuming that expert behavior will achieve highest reward.

Underlying algorithm of inverse reinforcement learning is based on the "structure learning" algorithm. In this case, the x is the start state, y is the output of the trajectory of the expert which is basically the training data. And y_opt is the output of the trajectory based on the agent policy, which is learned from the reward function using Reinforcement Learning algorithm. The compatibility function is basically our reward function because we assume expert behavior achieve highest reward.

Then we bring it the structure learning algorithm below ...

The agent still need to interact with the environment (or simulator) to get its trajectory, but the environment only need to determine the next state, but not the reward. Again, there are two nested optimization loop in the algorithm

Given a reward function (characterized by w), use classical RL to learn the optimal policy
Use the optimal policy to interact with the environment to collect the total reward of each episode, then adjust the reward function parameter w such that the expert behavior always get the highest total reward.

Friday, August 25, 2017

Reinforcement Learning Overview

There are basically 3 different types of Machine Learning

Supervised Learning: The major use case is Prediction. We provide a set of training data including the input and output, then train a model that can predict output from an unseen input.
Unsupervised Learning: The major use case is Pattern extraction. We provide a set of data that has no output, the algorithm will try to extract the underlying non-trivial structure within the data.
Reinforcement Learning: The major use case is Optimization. Mimicking how human learn from childhood, we use a trial and error approach to find out what actions will produce good outcome, and bias our preference towards those good actions.

In this post, I will provide an overview of the settings of Reinforcement Learning as well as some of its key algorithms.

Agent / Environment Interaction

Reinforcement Learning is all about how we can make good decision through trial and error. It is the interaction between the "agent" and the "environment".

Repeat the following steps until reaching a termination condition

The agent observe the environment having state s
Out of all possible actions, the agent need to decide which action to take. (this is called "policy", which is a function that output an action given the current state)
Agent take the action, and the environment receive that action
Through a transition matrix model, environment determine what is the next state and proceed to that state
Through a reward distribution model, the environment determines the reward to the agent given he take action a at state s

The goal for the agent is to determine an optimal policy such that the "value" of the start state is maximized.

Some terminology

Episode: a sequence of (s1, a1, r1, s2, a2, r2, s3, a3, r3 .... st, at, rt ... sT, aT, rT)
Reward rt: Money the agent receive after taking action at a state at time t
Return: Cumulative reward since the action is taken (sum of rt, r[t+1], ... rT)
Value: Expected return at a particular state, called "state value" V(s), or expected return when taking action a at state s, called "Q Value" Q(s,a)

The optimal policy can be formulated as choosing action a* amount all choices of a at state s such that Q(s, a*) is maximum.

To deal with never ended interaction, we put a discount factor "gamma" on future reward. This discount factor will turn the sum of an infinite series into a finite number.

Optimal Policy when model is known

If we know the "model", then figuring out the policy is easy. We just need to use dynamic programming technique to compute the optimal policy offline and there is no need for learning.

Two algorithms can be used:

"Value iteration" starts with a random value and iteratively update the value based on the Bellman's equation, and finally compute the "value" of each state or state/action pair (also call Q state). The optimal policy for a given state s is to choose the action a* that maximize the Q value, Q(s, a).

Another algorithm "Policy iteration" starts with a random policy, and iteratively modifies the policy to make it better, until the policy at next iteration doesn't change any more.

However, in practice, we usually don't know the model, so we cannot compute the optimal policy as described above.

Optimal Policy when model is unknown

One solution is the "model based" learning, we spare some time to find out the transition probability model as well as the reward distribution model. To make sure we experience all possible combinations of different state/action pairs, we will take random action in order to learn the model.

Once we learn the model, we can go back to use the value iteration or policy iteration to determine the optimal policy.

Learning has a cost though. Rather than taking the best action, we will take random action in order to explore new actions that we haven't tried before and it is very likely that the associated reward is not maximum. However we accumulate our knowledge about how the environment reacts under a wider range of scenarios and hopefully this will help us to get a better action in future. In other words, we sacrifice or trade off our short term gain for a long term gain.

Making the right balance is important. A common approach is to use the epsilon greedy algorithm. For each decision step, we allocate a small probability e where we take random action and probability (1-e) where we take the best known action we have explored before.

Another solution approach is the "model free" learning. Lets go back to look at the detail formula under Value iteration and Policy iteration, the reason of knowing the model is to calculate the expected value of state value and Q value. Can we directly figure out the expected state and Q value through trial and error ?

Value based model free learning

If we modify the Q value iteration algorithm to replace the expected reward/nextstate with the actual reward/nextstate, we arrive at the SARSA algorithm below.

Deep Q Learning

The algorithm above requires us to keep a table to remember all Q(s,a) values which can be huge, and also becomes infinite if any of the state or action is continuous. To deal with this, we will introduce the idea of value function. The state and action will become the input parameters of this function, which will create "input features" and then feed into a linear model and finally output the Q value.

Now we modify the previous SARSA algorithm to the following ...

Instead of lookup the Q(s,a) value, we call the function (can be a DNN) to pass in the f(s, a) feature, and get its output
We randomly initialize the parameter of the function (can be weights if the function is a DNN)
We update the parameters using gradient descent on the lost which can be the difference between the estimated value and the target value (can be a one step look ahead estimation: r + gamma*max_a'[Q(s',a)] )

If we further generalize the Q value function using a deep neural network, and update the parameter using back propagation, then we reach a simple version of Deep Q Learning.

While this algorithm allow us to learn the Q value function which can represents a continuous state, we still need to evaluate every action and pick the one with the maximum Q value. In other words, the action space can only be discrete and finite.

Policy gradient

Since the end goal is to pick the right action, and finding out the Q value is just the means (so we can pick the action of maximum Q), why don't we learn a function that takes a state and directly output an action. Using this policy function approach, we can handle both continuous or discrete action space as well.

The key idea is to learn a function (given a state, output an action)

If the action is discrete, it outputs a probability distribution of each action
It the action is continuous, it output the mean and variance of the action, assume normal distribution

The agent will sample from the output distribution to determine the action, so its chosen action is stochastic (nondeterministic). Then the environment will determine the reward and next state. Cycle repeats ...

The goal is to find the best policy function where the expected value of Q(s, a) is maximize. Notice that s and a are random variable parameterized by θ.

To maximize an "expected value" of a function with parameters θ, we need to calculate the gradient of that function.

Actor Critic Algorithm

There are 2 moving targets in this equation:

To improve the policy function, we need an accurate estimation of Q value and also need to know the gradient of log(s, a)
To make the Q value estimation more accurate, we need a stable policy function

We can break down these into two different roles

An actor, whose job is to improve the policy function by tuning the policy function parameters
A critic, whose job is to fine tune the estimation of Q value based on current (incrementally improving) policy

The "actor critic" algorithm is shown below.

Then we enhance this algorithm by adding the following steps

Replace the Q value function with an Advantage function, where A(s, a) = Q(s, a) - Expected Q(s, *). ie: A(s, a) = Q(s, a) - V(s)
Run multiple thread Asynchronously

This is the state of the art A3C algorithm.

Learning resources and credits

Some of the algorithms I discussed above is extracted from the following sources