Deep Reinforcement learning notes (UBC)
Background
This note is the class note of UBC Deep reinforcement learning, namely CS294112 or CS285. the lecturer is Sergey Levine. The lecturer video can be find on Youtube. I wrote two notes on reinforcement learning before, one is basic RL, the other is the David Silver class note.
Different from the previous courses, this course includes deeper theoretical view, more recent methods and some advancer topics. Especially in model based RL and metalearning. It is more suitable for the guys who are interested on robotics control and deeper understanding of reinforcement learning.
This class is a little bit hard to study, so make sure you follow the class tightly.
I’m sorry that some of maths can not view properly on the post, maybe I will figure out it that later.
1. Imitation learning
The main problem of imitation: distribution drift
how to make the distribution of training dataset as same as the distribution under policy?
DAgger
DAgger: Dataset Aggregation
goal: collect training data from $p_{\pi_\theta}(o_t)$ instead of $p_{data}(o_t)$ !
how? just run $p_{\pi_\theta}(a_t  o_t)$. 
but need labels $a_t$ !

train $\pi_{\theta}(a_t o_t)$ from human data $\mathcal{D}={o_1,a_1,…,o_N,a_N}$; 
run $\pi_\theta(a_t o_t)$ to get dataset $\mathcal{D}_\pi = {o_1,…,o_M}$;  ask human to label $\mathcal{D}_\pi$ with actions $s_t$;
 aggregate: $\mathcal{D}\gets \mathcal{D}\cup\mathcal{D}_\pi$
fit the model perfectly
why fail to fit expert?
 NonMarkonvian behavior
 use history observations
 Multimodal behavior
 for discrete action, it is OK since Softmax output probability over actions
 for continuous action
 output mixture of Gaussians
 latent variable models(inject noise to network input)
 autoregressive discretization
other problems of imitation learning
 human labeled data is finite
 human not good at some problems
reward function of imitation learning
reward function of imitation learning can be
MDP & RL Intro
The goal of RL
expected reward where $p_\theta(\tau)$ is the distribution of the sequence
Q & V
Types of RL algorithms
 Policy gradient
 valuebased
 Actorcritic
 modelbased RL
 for planning
 optimal control
 discrete planning
 improve policy
 something else
 dynamic programming
 simulated experience
 for planning
tradeoffs
 sample efficiency
 stability & ease of use
assumptions
 stochastic or deterministic
 continuous or discrete
 episodic or infinite horizon
sample efficiency

off policy: able to improve the policy without generating new samples from that policy

on policy: each time the policy is changed, even a little bit, we need to generate new samples
stability & ease of use
convergence is a problem
supervised learning almost always gradient descent
RL often not strictly gradient descent
2. Policy gradient
Objective function
policy differentiation
log derivative
Objective function differentiation
evaluating the policy gradient
REINFORCE algorithm

sample ${\tau^i}$ from $\pi_\theta(s_t s_t)$ (run the policy); 
$\Delta_\theta J(\theta)\approx\sum_{i}^N\left(\sum_{t} \Delta_\theta \log \pi_\theta (a_t s_t) \right)\left(\sum_{t} r(s_t,a_t)\right)$;  $\theta \gets\theta+\alpha\Delta_\theta J(\theta)$.
policy gradient
Reduce variance
Causality: policy at time $t’$ cannot affect reward at time t when $t<t’$ baseline prove Here, $\tau$ means the whole episode sample by current policy.
we can proof that their has a optimal baseline to reduce variance: But in practice, we just use the expectation of reward as baseline to reduce the complexity.
policy gradient is onpolicy algorithm
Offpolicy learning & importance sampling
what if we sample from $\bar{\pi}(\tau)$ instead?
Importance sampling
so apply this to our objective function, we have and we have so we have
The offpolicy policy gradient
we can view state and action separately, then:
If $\frac{p_{\theta’}(s_t)}{p_{\theta}(s_t)}$ is small and bounded, then we can delete it, and this leads to TPRO method we will discuss later.
for coding, we can use “pseudoloss” as weighted maximum likelihood with automatic differentiation:
policy gradient in practice
 the gradient has high variance
 this isn’t the same as supervised learning!
 gradients will be really noisy!

consider using much larger batches
 tweaking learning rates is very hard
 adaptive step size rules like ADAM can be OKish
 we will learn about policy gradientspecific learning rate adjustment method later
3. Actorcritic method
### Basics
recap policy gradient where Q is a sample from trajectories, which is unbiased estimate but has high variance problem.
We can use expectation to reduce variance and we define then
Advantage
The better $A^\pi(s_t,a_t)$ estimate, the lower the variance.
Value function fitting
and we add a little bias(one step biased sample) for convenience so we have then we only need to fit $V^\pi(s)$ !
Policy evaluation
Monte Carlo policy evaluation (this is what policy gradient does) We can try multiple samples if we can reset the environment to previous state Monte Carlo evaluation with function approximation
with function approximation, only using one sample from trajectory still pretty good.
training data: ${\left(s_{i,t},\sum_{t’=t}^Tr(s_{i,t’},a_{i,t’})\right)}$
supervised regression: $\mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)y_i\parallel^2$
Ideal target: Monte Carlo target:
TD(bootstrapped)
training data: $ {\left(s_{i,t},r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\right)} $
Actorcritic algorithm
batch actorcritic algorithm:

sample ${s_i,a_i}$ from $\pi_\theta(a s)$  fit $\hat{V_\phi^\pi}(s)$ to sampled reward sums
 evaluate $\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\hat{V}\phi^\pi(s’_i)\hat{V}\phi^\pi(s_i)$

$\Delta_\theta J(\theta)\approx \sum_i \Delta_\theta \log \pi_\theta (a_{i} s_{i})\hat{A}^\pi(s_i,a_i)$  $\theta \gets \theta+\alpha\Delta_\theta J(\theta)$
Aside: discount factors
what if T (episode length) is $\infty$ ?
$\hat{V}_\phi^\pi$ can get infinitely large in many cases
simple trick: better to get rewards sooner than later actually we use discount in policy gradient as Online actorcritic algorithm(can apply to every single step):

take action $a\sim\pi_\theta(a s)$, get $(s,a,s’,r)$  update $\hat{V_\phi^\pi}(s)$ using target $r+\gamma\hat{V_\phi^\pi}(s’)$
 evaluate $\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}\phi^\pi(s’)\hat{V}\phi^\pi(s)$

$\Delta_\theta J(\theta)\approx \Delta_\theta \log \pi_\theta (a s)\hat{A}^\pi(s,a)$  $\theta \gets \theta+\alpha\Delta_\theta J(\theta)$
Architecture design
network architecture choice

value network and policy network are separate(more stable and sample)

some of value network and policy network are shared(have shared feature)
works best with a batch
tradeoff and balance
policy gradient
actorcritic policy gradient is no bias but has higher variance
actorcritic is lower variance but not unbiased
so can we combine these two things?
here we have critics as statedependent baselines
 no bias
 lower variance
Eligibility traces & nstep returns
Critic and Monte Carlo critic
combine these two?
nstep returns choosing $n>1$ often works better!!!
Generalized advantage estimation(GAE)
Do we have to choose just one n?
Cut everywhere all at once!
How to weight?
Mostly prefer cutting earlier(less variance) $w_n\propto\lambda^{n1}$ e.g. $\lambda=0.95$
and this leads to Eligibility traces in this way, every time you want to update a state, you need to have n steps experience
4. Value based methods
$\underset{a_t}{\arg\max}A^\pi(s_t,a_t)$ : best action from $s_t$, if we then follow $\pi$
then:
this at least as good as any $a_t \sim \pi(a_t, s_t)$
Policy iteration
 evaluate $A^\pi(s,a)$
 set $\pi \gets \pi’$
Dynamic programming
assume we know $p(s’  s,a)$ and s and a are both discrete (and small) 
bootstrapped update: with deterministic policy $\pi(s)=a$, we have
So policy iteration become
 set $Q^\pi(s,a)\gets r(s,a)+\gamma E[V^\pi(s’)]$
 set $V(s)\gets \max_a Q(s,a)$
Function approximator
$\mathcal{L}=\frac{1}{2}\sum_i\parallel V_\phi (s)\max_a Q(s,a)\parallel^2$
fitted value iteration
fitted value iteration algorithm:

set $y_i \gets \max_{a_i}(r(s_i a_i)+\gamma E[V_\phi(s’_i)])$  set $\phi \gets \arg\min_\phi \frac{1}{2}\sum_i\parallel V_\phi (s)y_i\parallel^2$$
but we can not do maximum if we do not have dynamics, so we evaluate Q instead of V
fitted Qiteration
 collect dataset ${(s_i, a_i,s_i’,r_i)}$ using some policy
 set $y_i \gets r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)$
 set $\phi \gets \arg min_\phi \frac{1}{2}\sum_iQ_\phi(s_i,a_i)y_i^2$ repeat step 2,3 k times and then return step 1
Qlearning is offpolicy, since it fit the $Q(s,a)$, which estimate all state action Q, no matter the action and state from which policy, it has the maximum item. And for the $r(s,a)$ item, given a and a, transition is independent of $\pi$.
exploration

epsilongreedy

Boltzmann exploration
Value function learning theory
value iteration:
 set $Q(s,a) \gets r(s,a)+\gamma E[V(s’)]$
 set $V(s) \gets \max_a Q(s,a)$
tabular case is converged.
Nontabular case is not guarantee convergence.
In actorcritic, it also need to estimate the V, and if using bootstrap approach, it has the same problem that can not guarantee convergence.
5. Practical Qlearning
What’s wrong of the online Qlearning?
Actually, it is not gradient decent, it do not calculate the gradient of target Q in y.
samples is i.i.d
Replay buffer (replay samples many times)
Qlearning with a replay buffer:
 collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, add it to $\mathcal{B}$
 sample a batch $(s_i,a_i,s_i’,r_i)$ from $\mathcal{B} $
 $\phi \gets\phi\alpha\sum_i\frac{d Q_\phi}{d \phi} (s_i,a_i) \frac{1}{2}\sum_i(Q_\phi(s_i,a_i) [r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)])^2$ , do these k times
Target network
DQN (Target network+Replay buffer)
 save target network parameters: $\phi’ \gets \phi$
 collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, add it to $\mathcal{B}$ , do this N times
 sample a batch $(s_i,a_i,s_i’,r_i)$ from $\mathcal{B} $
 $\phi \gets \arg min_\phi \frac{1}{2}\sum_iQ_\phi(s_i,a_i) [r(s_i,a_i) +\gamma \max_{a_i’}Q_{\phi’}(s_i’,a_i’)]^2$ , do 2,3 k times
 collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, add it to $\mathcal{B}$ , do this N times
Alternative target network
Polyak averaging: soft update to avoid sudden target network update:
update $\phi’$: $\phi’ \gets \tau \phi’ + (1\tau)\phi$ e.g. $\tau =0.999$
Double Qlearning
Are the Qvalues accurate?
It’s often much larger than the true value since the maximum operation will always adds the noisy Q estimation to make Q function overestimate.
Target value $y_j=r_i +\gamma \max_{s_j’}Q_{\phi’}(s_j’,a_j’)$ value also comes from $Q_{\phi’}$ action selected according to $Q_{\phi’}$
How to address this?
Double Qlearning
idea: don’t use the same network to choose the action and evaluate value! (decorrelate the noise)
use two networks: the value of both two networks come from the other network!
Double Qlearning in practice
Just use the current and target networks as $\phi_A$ and $\phi_B$, use current network to choose action, and current network get Q value.
standard Qlearning: $y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi’}(s’,a’))$
double Qlearning: $y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi}(s’,a’))$
Multistep returns
In Qlearning, this only actually correct when learning onpolicy. Because the sum of r comes from the transitions of different policy.
How to fix?
 ignore the problem when N is small
 cut the tracedynamically choose N to get only onpolicy data
 works well when data mostly onpolicy, and the action space is samll
 importance sampling—ref the paper “safe and efficient offpolicy reinforcement learning” Munos et al. 16
Qlearning with continuous actions
How to do argmax in continuous actions space?

optimization
 gradient based optimization (e.g., SGD) a bit slow in the inner loop
 action space typically lowdimensional—–what about stochastic optimization?
a simple if sample from discrete cations
$max_a Q(s,a)\approx max{Q(s,a_1),…,Q(s,a_N)}$
more accurate solution:
 crossentropy method (CEM)
 simple iterative stochastic optimization
 CMAES

use function class that is easy to optimize NAF: Normalized Advantage Functions
Using the neural network to get $\mu,P,V$
Then but this lose some representational power

learn an approximate maximizer
DDPG idea: train another network $\mu_\theta(s)$ such that $\mu_\theta(s)\approx arg \max_aQ_\phi(s,a)$
how to train? solve $\theta \gets \arg \max_\theta Q_\phi(s,\mu_\theta(s))$ DDPG:
 take some action $s_i$ and observe $(s_i,a_i,s_i’,r_i)$, add it to $\mathcal{B}$
 sample minibatch ${s_j,a_j,s_j’,r_j}$ from $\mathcal{B}$ uniformly
 compute $y_j=r_j+\gamma Q_{\phi’}(s_j’,\mu_{\theta’}(s_j’))$ using target nets $Q_{\phi’}$ and $\mu_{\theta’}$_j
 $\phi \gets \phi  \alpha\sum_j\frac{dQ_\phi}{d\phi}(s_j,a)(Q_\phi(s_j,a_j)y_k)$
 $\theta \gets \theta  \beta\sum_j\frac{d\mu}{d\theta}(s_j)\frac{dQ_\phi}{da}(s_j,a)$
 update $\phi’$ and $\theta’$ (e.g., Polyak averaging)
Tips for Qlearning

Bellman error gradients can be big;clip gradients or use Huber loss instead of square error

Double Qlearning helps a lot in practice, simple and no downsides

Nstep returns also help a lot, but have some downsides

Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too

Run multiple random seeds, it’s very inconsistent between runs
6. Advanced Policy Gradients
Basics
Recap
Recap: policy gradient
REINFORCE algorithm

sample ${\tau^i}$ from $\pi_\theta(s_t s_t)$ (run the policy) 
$\Delta_\theta J(\theta)\approx\sum_{i}\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t^i s_t^i) \left(\sum_{t’=t}^T r(s_{t’},a_{t’})\right)\right)$  $\theta \gets\theta+\alpha\Delta_\theta J(\theta)$
Why does policy gradient work?
policy gradient as policy iteration
$J(\theta)=E_{\tau \sim p_\theta(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]$
so we proved that:
The Goal is Making things offpolicy
But we want to sample from $\pi_\theta$ not $\pi_{\theta’}$, we apply importance sampling: but there still has the state sample from $p_{\theta’}(s_t)$ , and can we approximate it as $p_\theta(s_t)$? so that we can use $\hat{A}^\pi(s_t,a_t)$ to get improved policy $\pi’$.
Bounding the objective value
Here we can prove that:
$\pi_{\theta’}$ if close to $\pi_\theta$ if $  \pi_{\theta’(a_t  s_t)}\pi_\theta(a_t  s_t)  \le\epsilon$ for all $s_t$ 
$  p_{\theta’}(s_t)p_\theta(s_t)  \le 2\epsilon t$ 
The prove of this refer the lecture video or the TRPO paper.
It’s easy to prove that: so C is $O(Tr_{max})$ or $O(\frac{r_{max}}{1\gamma})$
So after all the prove before, what we get? For small enough $\epsilon$, this is guaranteed to improve $J(\theta’)J(\theta)$
A more convenient bound is using KL divergence: $  \pi_{\theta’(a_t  s_t)}\pi_\theta(a_t  s_t)  \le \sqrt{\frac{1}{2}D_{KL}(\pi_{\theta’}(a_t  s_t)\pi(a_t  s_t))}$ 
$\Rightarrow D_{KL}(\pi_{\theta’}(a_ts_t)\pi(a_ts_t)$ bounds state marginal difference, where Why not using $\epsilon$ but the $D_{KL}$?
KL divergence has some very convenient properties that make i much easier to approximate!
So the optimization becomes:
Solving the constrained optimization problem
How do we enforce the constraint?
By using dual gradient descent, we set the object function as
 Maximize $\mathcal{L}(\theta’, \lambda)$ with respect to $\theta$

$\lambda \gets + \alpha(D_{KL}(\pi_{\theta’}(a_t s_t)\pi(a_t s_t))\epsilon)$
How else do we optimize the object?
define: applying Firstorder Taylor expansion and optimize and so the optimization becomes and gradient ascent does this: by updating like $\theta’=\theta’+\sqrt{\frac{\epsilon}{\Delta_\theta J(\theta)^2}}\Delta_\theta J(\theta)$, this is what actually gradient ascent(policy gradient) doing.
But this (the gradient ascent constrain) is not a good constrain since some parameters change probabilities a lot more than others, and we want that the probability distributions are close.
Applying ‘second order Taylor expansion’ to $D_{KL}$ where $\pmb{F}$ is the ‘Fisherinformation matrix’ which can estimate with with samples And if we use the following update the constrain will satisfied. and this is called the natural gradient.
Practical methods and notes
 natural policy gradient $\theta’=\theta+\alpha\pmb{F}^{1}\Delta_\theta J(\theta)$
 Generally a good choice to stabilize policy gradient training
 See this paper for details:
 Petters, Schaal. Reinforcement learning of motor skills with policy gradients.
 Practical implementation: requires efficient Fishervector products, a bit nontrivial to do without computing the full matrix
 See: Schulman et all. Trust region policy optimization
 Trust region policy optimization (TRPO) $\alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}$
 Just use the IS (important sampling) objective directly (use $\bar{A}$ as object)
 Use regularization to stay close to old policy
 See: proximal policy optimization (PPO)
So the TRPO and the PPO is two Practical methods solving the constrained optimization in neural network setting.
7. Optimal Control and Planning
Recap: the reinforcement learning objective In modelfree RL, we do not know $p(s_{t+1}s_t,a_t)$.
But actually we knew the dynamics sometimes.
 Often we do know the dynamics
 Often we can learn the dynamics
If we know the dynamics, what can we do?
Modelbased reinforcement learning

Modelbased reinforcement learning: learn the transition dynamics, then figure out how to choose actions

How can we make decisions if we know the dynamics?
a. How can we choose actions under perfect knowledge of the system dynamics?
b. Optimal control, trajectory optimization, planning

How can we learning unknown dynamics?

How can we then also learn policies? (e,g. by imitating optimal control)
The objective
Deterministic case
Stochastic openloop case
openloop: choose a_1…a_T in one time, not step by step closedloop: every step the agent gets a feedback from environment
Stochastic closedloop case
Stochastic optimization
optimal control/planning：
Crossentropy method (CEM)
Here $A$ is $a_1,…,a_t$
 sample $A_1,…,A_n$ from $p(A)$
 evaluate $J(A_1),…,J(A_n)$
 pick M elites $A_{i_1},…,A_{i_M}$ with the highest value, where $M<N$
 refit $p(A)$ to the elites $A_{i_1},…,A_{i_M}$
Monte Carlo Tree Search (MCTS)
Generic MCTS sketch
 find a leaf $s_l$ using TreePolicy($s_1$)
 evaluate the leaf using DefaultPolicy($s_l$)
 update all values in the tree between $a_1$ and $s_l$
take best action from $s_1$ and repeat
every node stores Q and N, Q is the estimated value and N is the visited number
UCT TreePolicy($s_t$)
if $s_t$ nit fully expanded, choose new $a_t$
else choose child with best Score($s_{t+1}$) $$ Score(s_t) = \frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2\ln N(s_{t1})}{N(s_t)}} $$ For more about MCTS, ref Browne. et al. A survey of Monte Carlo Tree Search Methods. (2012)
Optimal control
Here we shows the optimization process if we know the environment dynamics. Almost the stuffs in control theory.
Deterministic case
Shooting methods vs collocation
the previous CEM is actually random shooting method.
collocation method: optimize over actions and states, with constraints.
Linear case: LQR
Linear case: the case that F is linear function and cost is quadratic function Where Base case: solve for $u_T$ only We substitute $u_T$ by $x_T$ to eliminate $u_T$ Then solve for $U_{T1}$ in term terms of $x_{T1}$ and then do same thing as the T case, result in similar results.
backward recursion
for $t=T$ to 1: Forward recursion
For $t=1$ to $T$:
Stochastic dynamics
if the probability is Gaussian and the mean is linear and variance is fixed. Then same algorithm can be applied since symmetry of Gaussian.
Nonlinear case: DDP/iterative LQR
approximate a nonlinear system as a linearquadratic system using Taylor expansion
In fact, this just Newton’s method for trajectory optimization.
for more Newton’s method for trajectory optimization, ref follow papers:
 Differential dynamic programming.(1970)
 Synthesis and Stabilization of complex behaviors through online trajectory optimization.(2012)
 practical guide for implementing nonlinear iterative LQR.
 Learning Neural Network policies with guided policy search under unknown dynamics (2014)
 Probabilistic formation and trust region alternative to deterministic line search.
8. ModelBased Reinforcement Learning (learning the model)
Basic
Why learn the model?
If we knew $f(s_t,a_t)=s_{t+1}$, we could use the tools from last course.
(or $p(s_{t+1} s_t,a_t)$ in stochastic case)
modelbased reinforcement learning version 0.5:

run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$  learning dynamics model $f(s,a)$ to minimize $\sum_if(s_i,a_i)s_i’^2$
 plan through $f(s,a)$ to choose actions
Does it work?
 This is how system identification works in classical robotics
 Some care should be taken to design a good base policy
 Particularly effective if we can handengineer a dynamics representation using our knowledge of physics, and fit just a few parameters
 The model only fit the base policy, but the final actual policy beyond that policy, that will cause distribution mismatch problem.
Overfitting problem
Distribution mismatch problem
Can we do better?
can we make $p_{\pi_0}(s_t)=p_{\pi_f}(s_t)$?
modelbased reinforcement learning version 1.0:

run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$  learning dynamics model $f(s,a)$ to minimize $\sum_if(s_i,a_i)s_i’^2$
 plan through $f(s,a)$ to choose actions
 execute those actions and add the resulting data ${(s,a,s’)_j}$ to $\mathcal{D}$, and repeat step 2~4
But the model has errors, so it may lead to some bad actions, How to address that?
MPC
modelbased reinforcement learning version 1.5:

run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$  learning dynamics model $f(s,a)$ to minimize $\sum_if(s_i,a_i)s_i’^2$
 plan through $f(s,a)$ to choose actions
 execute the first planned action, observe resulting state $s’$ (MPC)
 append $(s,a,s’)$ to dataset $\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5
Using model uncertainty
Can we do better by using model uncertainty?
How to get uncertainty?
 use output entropy(bad idea)
 estimate model uncertainty

one way to get this is by Bayesian neural networks (BNN) (introduce later)

another way is train multiple models, and see if they agree each other.(Bootstrap ensembles)
How to train?
main idea: need to generate “independent” datasets to get “independent” models.
can do this by resampling from dataset with replacement, means you have same distribution but different ordered datasets
Does this works?
This basically works
Very crude approximation, because the number of models is usually small (<10)
Resampling with replacement is usually unnecessary, because SGD and random initialization usually makes the models sufficiently independent
For candidate action sequence $a_1,…,a_H$:

sample $\theta\sim p(\theta \mathcal{D})$ 
at each time step $t$, sample $s_{t+1}\sim p(s_{t+1} s_t,a_t,\theta)$  calculate $R=\sum_tr(s_t,a_t)$
 repeat steps 1 to 3 and accumulate the average reward
Modelbased RL with images (POMDP)
Modelbased RL with latent space models
What about complex observations?
 High dimensionality
 Redundancy
 Partial observability
learn approximate posterior $q_\psi(s_t  o_{1:t},a_{1:t})$ 
other choices:

$q_\psi(s_t,s_{t+1} o_{1:t},a_{1:t})$ 
$q_\psi(s_t o_t)$
here we only estimate $q_\psi(s_t  o_t)$ 
assume that $q_\psi(s_t  o_t)$ is deterministic 
stochastic case requires variational inference (later)
Deterministic encoder and maybe the reward also need to learn. Modelbased RL with latent space models

run base policy $\pi_0(o_t a_t)$ (e.g., random policy) to collect $\mathcal{D}={(o,a,o’)_i}$ 
learning $p_\psi(s_{t+1} s_t,a_t), p_\psi(r_t s_t), p(o_t s_t), g_\psi(o_t)$  plan through the model to choose actions
 execute the first planned action, observe resulting state $o’$ (MPC)
 append $(o,a,o’)$ to dataset $\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5
Learn directly in observation space
directly learn $p(o_{t+1}  o_t,a_t)$ 
do image prediction
learn reward or set the goal observation
9. ModelBased RL and Policy Learning
Basic
What if we want a policy rather than just optimal control?
 Do not need to replan (faster)
 Potentially better generalization
 Closed loop control
Backpropagate directly into the policy
modelbased reinforcement learning version 2.0:

run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$  learning dynamics model $f(s,a)$ to minimize $\sum_if(s_i,a_i)s_i’^2$

backpropagate through $f(s,a)$ into the policy to optimize $\pi_\theta(a_t s_t)$ 
run $\pi_\theta (a_t s_t)$, appending the visited tuples $(s,a,s’)$ to $\mathcal{D}$ , repeat steps 2~4
What’s the problem?
 similar parameter sensitivity problems as shooting methods
 But no longer have convenient second order LQRlike method, because policy parameters couple all the tie steps, so no dynamic programming
 Similar problem to training long RNNs with BPTT
 Vanishing and exploding gradients
 Unlike LSTM, we can’t just “choose” a simple dynamics, dynamics are chosen by nature
Guided policy search
How to deal with constrain?
Dual gradient decent (DGD)
 Find $x^*\gets \arg\min_x\mathcal{L}(x,\lambda)$
 Compute $\frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)$
 $\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$
A small tweak to DGD: augmented Lagrangian
 Find $x^*\gets \arg\min_x\bar{\mathcal{L}}(x,\lambda)$
 Compute $\frac{dg}{d\lambda}=\frac{d\bar{\mathcal{L}}}{d\lambda}(x^*,\lambda)$
 $\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$
When far from solution, quadratic term tends to improve stability
Constraining trajectory optimization with dual gradient descent
Guided policy search (GPS) discussion
 Find $\tau \gets \arg\min_\tau \bar{\mathcal{L}}(\tau,\theta,\lambda)$ (e.g. via iLQR or other planning methods)
 Find $\theta \gets \arg\min_\theta\bar{\mathcal{L}}(\tau, \theta, \lambda)$ (e.g. via SGD)
 $\lambda \gets \lambda+\alpha \frac{dg}{d\lambda}$ and repeat
 Can be interpreted as constrained trajectory optimization method
 Can be interpreted as imitation of optimal control expert, since step 2 is just supervised learning
 The optimal control “teacher” adapts to the learner , and avoids actions that the learner can’t mimic
General guided policy search scheme
 Optimize $p(\tau)$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
 Optimize $\theta$ with respect to some supervised objective
 Increment or modify dual variables $\lambda$
Need to choose:
 form of $p(\tau)$ or $\tau$ (if deterministic)
 optimization method for $p(\tau)$ or $\tau$
 surrogate $\tilde{c}(x_t,u_t)$

supervised objective for $\pi_\theta(u_t x_t)$
Deterministic case
 Optimize $\tau$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
 Optimize $\theta$ with respect to some supervised objective
 Increment or modify dual variables $\lambda$. repeat 1~3
Learning with multiple trajectories
 Optimize each $\tau_i$ in parallel with respect to $\tilde{c}(x_t,u_t)$
 Optimize $\theta$ with respect to some supervised objective
 Increment or modify dual variables $\lambda$. repeat 1~3
Stochastic (Gaussian) GPS
 Optimize $p(\tau)$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
 Optimize $\theta$ with respect to some supervised objective
 Increment or modify dual variables $\lambda$
Here, a little different from pure imitation learning that mimic the optimal control result, the agent imitate the planning results, and then if it can not perform good imitation, then the optimization process will adjust its planning policy to fit the learning policy since the policy is a constrain of planning.
Input Remapping Trick
Imitation optimal control
Imitation optimal control with DAgger
 from current state $s_t$, run MCTS to get $a_t,a_{t+1},…$
 add $(s_t,a_t)$ to dataset $\mathcal{D}$

execute action $s_t\sim\pi(a_t s_t)$ (not MCTS action!). repeat 1~3 N times  update the policy by training on $\mathcal{D}$
Problems of the original DAgger
 Ask human to label the state from other policy is hard
 run the initial not good policy in real world is dangerous in some applications
We address the first problem with a planning method, and how about the the second problem?
Imitating MPC: PLATO algorithm

train $\pi_\theta(u_t o_t)$ from labeled data $\mathcal{D}={o_1,u_1,…,o_N,u_N}$ 
run $\hat{\pi}(u_t o_t)$ to get dataset $\mathcal{D}_\pi={o_1,…,o_M}$  Ask computer to label $\mathcal{D_\pi}$ with actions $u_t$
 Aggregate: $\mathcal{D}\gets\mathcal{D}\cup\mathcal{D}_\pi$
Simple stochastic policy: $\hat{\pi}(u_tx_t)=\mathcal{N}(K_tx_t+k_t, \Sigma_{u_t})$
Here the $\hat{\pi}$ is replaned by optimal control method, for simplicity, choose Gaussian policy since it is easy to plan with LQR, and the planning also add the KL constrain, which make sure the behavior policy is not far from the learning policy, but with this planning, it move actions away from some very bad (dangerous) actions.
DAgger vs GPS
 DAgger does not require an adaptive expert
 Any expert will do, so long as states from learned policy can be labeled
 Assumes it is possible to match expert’s behavior up to bounded loss
 Not always possible (e.g. partially observed domains)
 GPS adapts the “expert” behavior
 Does not require bounded loss on initial expert (expert will change)
Why imitate?

Relatively stable and easy to use

Supervise learning works very well

control\planning (usually) works very well

The combination of two (usually) works very well

 Input remapping trick: can exploit availability of additional information at training time to learn policy from raw observations. (planning with state and learning policy with observations)
 overcomes optimization challenges of backpropagating into policy directly
Modelfree optimization with a model
 just use policy gradient(or other modelfree RL method) even though you have a model. (just treat the model as a simulator)
 Sometimes better than using the gradients!
Dyna
online Qlearning algorithm that performs modelfree RL with a model
 given state $s$, pick action $a$ using exploration policy
 observe $s’$ and $r$, to get transition $(s,a,s’,r)$

update model $\hat{p}(s’ s,a)$ and $\hat{r}(s,a)$ using $(s,a,s’)$  Qupdate: $Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)Q(s,a)]$
 repeat $K$ times:
 sample $(s,a)\sim\mathcal{B}$ from buffer of past states and actions
 Qupdate: $Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)Q(s,a)]$
when model become better, reevaluate the old states and make the estimation more accurate.
General “Dynastyle” modelbased RL recipe
 given state $s$, pick action $a$ using exploration policy

learn model $\hat{p}(s’ s,a)$ (and optionally, $\hat{r}(s,a)$)  repeat K times:
 sample $s\sim\mathcal{B}$ from buffer
 choose action a (from $\mathcal{B}$, from $\pi$, or random)

simulate $s’\sim\hat{p}(s’ s,a)$ (and $r=\hat{r}(s,a)$)  train on $(s,a,s’,r)$ with modelfree RL
 (optional) take N more modelbased steps
This only requires short (as few as one step) rollouts from model, which has a little accumulated error.
Modelbased RL algorithms summary
Methods
 Learn model and plan (without policy)
 Iteratively more data to overcome distribution mismatch
 Replan every time step (MPC) to mitigate small model errors
 Learning policy
 Backpropagate into policy (e.g., PILCO)–simple but potentially unstable
 imitate optimal control in a constrained optimization framework (e.g., GPS)
 imitate optimal control via DAggerlike process (e.g., PLATO)
 Use modelfree algorithm with a model(Dyna, etc.)
Limitation of modelbased RL
 Need some kind of model
 Not always available
 Sometimes harder to learn than the policy
 Learning the model takes time & data
 Sometimes expressive model classes (neural nets) are not fast
 Sometimes fast model classes (linear models) are not expressive
 Some kind of additional assumptions
 Linearizability/continuity
 Ability to reset the system (for local linear models)
 Smoothness (for GPstyle global model)
 Etc.
Here are some of my understandings of modelbased RL:
First, why we need modelbased RL?
the modelfree RL learn everything from experience, the state space may very larger, it learning from scratch, which need a lot of exploration, otherwise, it may hard to coverage or has high chance to coverage to local optimal.
But in modelbased RL, the model is known or already learned, so it shift the very hard exploration process to planning, which can find some decent directions that lead to good results by optimal control methods or just search by simulating with model. So after planning, the pretty promising trajectories is generated, and the policy only requires to learn to imitate these good trajectories, which reduce lots of random exploration.
Second, why not just use optimal control rather than learning policy?
Actually, you just use the optimal control methods, like traditional control methods, or MPC.
However, not all model can apply the explicit optimal control method like LQR, since the model is hard to solve mathematically. In addition, using neural network may have better generalization property, and close loop control seems better.
Third, What kind of method can i use in model based RL?
 learning model and just using planing, do not learn policy
 Learn policy by guided policy search
 imitating optimal control with DAgger
What kind of algorithm should I use?
rank of the samples efficiency required (low to high) but computation efficiency (high to low)
 gradientfree methods (e.g. NES, CMA, etc)
 full online methods (e.g. A3C)
 policy gradient methods (e.g. TRPO)
 replay buffer value estimation methods (Qlearning, DDPG, NAF, SAC, etc.)
 modelbased deep RL (e.g. PETS, guided policy search)
 modelbased “shallow” RL (e.g. PILCO)
Comments