Deep Reinforcement learning notes (UBC)

22 minute read


This note is the class note of UBC Deep reinforcement learning, namely CS294-112 or CS285. the lecturer is ‎Sergey Levine. The lecturer video can be find on Youtube. I wrote two notes on reinforcement learning before, one is basic RL, the other is the David Silver class note.

Different from the previous courses, this course includes deeper theoretical view, more recent methods and some advancer topics. Especially in model based RL and meta-learning. It is more suitable for the guys who are interested on robotics control and deeper understanding of reinforcement learning.

This class is a little bit hard to study, so make sure you follow the class tightly.

I’m sorry that some of maths can not view properly on the post, maybe I will figure out it that later.

1. Imitation learning

The main problem of imitation: distribution drift

how to make the distribution of training dataset as same as the distribution under policy?


DAgger: Dataset Aggregation

goal: collect training data from $p_{\pi_\theta}(o_t)$ instead of $p_{data}(o_t)​$ !

how? just run $p_{\pi_\theta}(a_t o_t)$.

but need labels $a_t​$ !

  1. train $\pi_{\theta}(a_t o_t)$ from human data $\mathcal{D}={o_1,a_1,…,o_N,a_N}$;
  2. run $\pi_\theta(a_t o_t)$ to get dataset $\mathcal{D}_\pi = {o_1,…,o_M}$;
  3. ask human to label $\mathcal{D}_\pi$ with actions $s_t$;
  4. aggregate: $\mathcal{D}\gets \mathcal{D}\cup\mathcal{D}_\pi$

fit the model perfectly

why fail to fit expert?

  1. Non-Markonvian behavior
    • use history observations
  2. Multimodal behavior
    • for discrete action, it is OK since Softmax output probability over actions
    • for continuous action
      • output mixture of Gaussians
      • latent variable models(inject noise to network input)
      • autoregressive discretization

other problems of imitation learning

  • human labeled data is finite
  • human not good at some problems

reward function of imitation learning

reward function of imitation learning can be

MDP & RL Intro

The goal of RL

expected reward where $p_\theta(\tau)$ is the distribution of the sequence

Q & V

Types of RL algorithms

  • Policy gradient
  • value-based
  • Actor-critic
  • model-based RL
    • for planning
      • optimal control
      • discrete planning
    • improve policy
    • something else
      • dynamic programming
      • simulated experience


  • sample efficiency
  • stability & ease of use


  • stochastic or deterministic
  • continuous or discrete
  • episodic or infinite horizon

sample efficiency

  • off policy: able to improve the policy without generating new samples from that policy

  • on policy: each time the policy is changed, even a little bit, we need to generate new samples

stability & ease of use

convergence is a problem

supervised learning almost always gradient descent

RL often not strictly gradient descent

2. Policy gradient

Objective function

policy differentiation

log derivative

Objective function differentiation

evaluating the policy gradient

REINFORCE algorithm

  1. sample ${\tau^i}$ from $\pi_\theta(s_t s_t)$ (run the policy);
  2. $\Delta_\theta J(\theta)\approx\sum_{i}^N\left(\sum_{t} \Delta_\theta \log \pi_\theta (a_t s_t) \right)\left(\sum_{t} r(s_t,a_t)\right)$;
  3. $\theta \gets\theta+\alpha\Delta_\theta J(\theta)$.
policy gradient
Reduce variance

Causality: policy at time $t’$ cannot affect reward at time t when $t<t’$ baseline prove Here, $\tau$ means the whole episode sample by current policy.

we can proof that their has a optimal baseline to reduce variance: But in practice, we just use the expectation of reward as baseline to reduce the complexity.

policy gradient is on-policy algorithm

Off-policy learning & importance sampling

what if we sample from $\bar{\pi}(\tau)$ instead?

Importance sampling

so apply this to our objective function, we have and we have so we have

The off-policy policy gradient we can view state and action separately, then:
If $\frac{p_{\theta’}(s_t)}{p_{\theta}(s_t)}$ is small and bounded, then we can delete it, and this leads to TPRO method we will discuss later.

for coding, we can use “pseudo-loss” as weighted maximum likelihood with automatic differentiation:

policy gradient in practice
  • the gradient has high variance
    • this isn’t the same as supervised learning!
    • gradients will be really noisy!
  • consider using much larger batches

  • tweaking learning rates is very hard
    • adaptive step size rules like ADAM can be OK-ish
    • we will learn about policy gradient-specific learning rate adjustment method later

3. Actor-critic method

### Basics

recap policy gradient where Q is a sample from trajectories, which is unbiased estimate but has high variance problem.

We can use expectation to reduce variance and we define then


The better $A^\pi(s_t,a_t)$ estimate, the lower the variance.

Value function fitting

and we add a little bias(one step biased sample) for convenience so we have then we only need to fit $V^\pi(s)$ !

Policy evaluation

Monte Carlo policy evaluation (this is what policy gradient does) We can try multiple samples if we can reset the environment to previous state Monte Carlo evaluation with function approximation

with function approximation, only using one sample from trajectory still pretty good.

training data: ${\left(s_{i,t},\sum_{t’=t}^Tr(s_{i,t’},a_{i,t’})\right)}$

supervised regression: $\mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)-y_i\parallel^2$

Ideal target: Monte Carlo target:


training data: $ {\left(s_{i,t},r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\right)} $

Actor-critic algorithm

batch actor-critic algorithm:

  1. sample ${s_i,a_i}$ from $\pi_\theta(a s)$
  2. fit $\hat{V_\phi^\pi}(s)$ to sampled reward sums
  3. evaluate $\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\hat{V}\phi^\pi(s’_i)-\hat{V}\phi^\pi(s_i)$
  4. $\Delta_\theta J(\theta)\approx \sum_i \Delta_\theta \log \pi_\theta (a_{i} s_{i})\hat{A}^\pi(s_i,a_i)$
  5. $\theta \gets \theta+\alpha\Delta_\theta J(\theta)$

Aside: discount factors

what if T (episode length) is $\infty$ ?

$\hat{V}_\phi^\pi$ can get infinitely large in many cases

simple trick: better to get rewards sooner than later actually we use discount in policy gradient as Online actor-critic algorithm(can apply to every single step):

  1. take action $a\sim\pi_\theta(a s)$, get $(s,a,s’,r)$
  2. update $\hat{V_\phi^\pi}(s)$ using target $r+\gamma\hat{V_\phi^\pi}(s’)$
  3. evaluate $\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}\phi^\pi(s’)-\hat{V}\phi^\pi(s)$
  4. $\Delta_\theta J(\theta)\approx \Delta_\theta \log \pi_\theta (a s)\hat{A}^\pi(s,a)$
  5. $\theta \gets \theta+\alpha\Delta_\theta J(\theta)$

Architecture design

network architecture choice

  • value network and policy network are separate(more stable and sample)

  • some of value network and policy network are shared(have shared feature)

works best with a batch

trade-off and balance

policy gradient

actor-critic policy gradient is no bias but has higher variance

actor-critic is lower variance but not unbiased

so can we combine these two things?

here we have critics as state-dependent baselines

  • no bias
  • lower variance

Eligibility traces & n-step returns

Critic and Monte Carlo critic

combine these two?

n-step returns choosing $n>1$ often works better!!!

Generalized advantage estimation(GAE)

Do we have to choose just one n?

Cut everywhere all at once!

How to weight?

Mostly prefer cutting earlier(less variance) $w_n\propto\lambda^{n-1}$ e.g. $\lambda=0.95$

and this leads to Eligibility traces in this way, every time you want to update a state, you need to have n steps experience

4. Value based methods

$\underset{a_t}{\arg\max}A^\pi(s_t,a_t)$ : best action from $s_t$, if we then follow $\pi$


this at least as good as any $a_t \sim \pi(a_t, s_t)$

Policy iteration

  1. evaluate $A^\pi(s,a)$
  2. set $\pi \gets \pi’$

Dynamic programming

assume we know $p(s’ s,a)$ and s and a are both discrete (and small)

bootstrapped update: with deterministic policy $\pi(s)=a$, we have

So policy iteration become

  1. set $Q^\pi(s,a)\gets r(s,a)+\gamma E[V^\pi(s’)]$
  2. set $V(s)\gets \max_a Q(s,a)$

Function approximator

$\mathcal{L}=\frac{1}{2}\sum_i\parallel V_\phi (s)-\max_a Q(s,a)\parallel^2$

fitted value iteration

fitted value iteration algorithm:

  1. set $y_i \gets \max_{a_i}(r(s_i a_i)+\gamma E[V_\phi(s’_i)])$
  2. set $\phi \gets \arg\min_\phi \frac{1}{2}\sum_i\parallel V_\phi (s)-y_i\parallel^2$$

but we can not do maximum if we do not have dynamics, so we evaluate Q instead of V

fitted Q-iteration
  1. collect dataset ${(s_i, a_i,s_i’,r_i)}$ using some policy
  2. set $y_i \gets r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)$
  3. set $\phi \gets \arg min_\phi \frac{1}{2}\sum_i|Q_\phi(s_i,a_i)-y_i|^2$ repeat step 2,3 k times and then return step 1

Q-learning is off-policy, since it fit the $Q(s,a)$, which estimate all state action Q, no matter the action and state from which policy, it has the maximum item. And for the $r(s,a)$ item, given a and a, transition is independent of $\pi$.

  1. epsilon-greedy

  2. Boltzmann exploration

Value function learning theory

value iteration:

  1. set $Q(s,a) \gets r(s,a)+\gamma E[V(s’)]$
  2. set $V(s) \gets \max_a Q(s,a)$

tabular case is converged.

Non-tabular case is not guarantee convergence.

In actor-critic, it also need to estimate the V, and if using bootstrap approach, it has the same problem that can not guarantee convergence.

5. Practical Q-learning

What’s wrong of the on-line Q-learning?

Actually, it is not gradient decent, it do not calculate the gradient of target Q in y.

samples is i.i.d

Replay buffer (replay samples many times)

Q-learning with a replay buffer:

  1. collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, add it to $\mathcal{B}$
    1. sample a batch $(s_i,a_i,s_i’,r_i)$ from $\mathcal{B} $
    2. $\phi \gets\phi-\alpha\sum_i\frac{d Q_\phi}{d \phi} (s_i,a_i) \frac{1}{2}\sum_i(Q_\phi(s_i,a_i)- [r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)])^2$ , do these k times

Target network

DQN (Target network+Replay buffer)

  1. save target network parameters: $\phi’ \gets \phi$
    1. collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, add it to $\mathcal{B}$ , do this N times
      1. sample a batch $(s_i,a_i,s_i’,r_i)$ from $\mathcal{B} $
      2. $\phi \gets \arg min_\phi \frac{1}{2}\sum_i|Q_\phi(s_i,a_i)- [r(s_i,a_i) +\gamma \max_{a_i’}Q_{\phi’}(s_i’,a_i’)]|^2$ , do 2,3 k times

Alternative target network

Polyak averaging: soft update to avoid sudden target network update:

update $\phi’$: $\phi’ \gets \tau \phi’ + (1-\tau)\phi$ e.g. $\tau =0.999$

Double Q-learning

Are the Q-values accurate?

It’s often much larger than the true value since the maximum operation will always adds the noisy Q estimation to make Q function overestimate.

Target value $y_j=r_i +\gamma \max_{s_j’}Q_{\phi’}(s_j’,a_j’)$ value also comes from $Q_{\phi’}$ action selected according to $Q_{\phi’}$

How to address this?

Double Q-learning

idea: don’t use the same network to choose the action and evaluate value! (de-correlate the noise)

use two networks: the value of both two networks come from the other network!

Double Q-learning in practice

Just use the current and target networks as $\phi_A$ and $\phi_B$, use current network to choose action, and current network get Q value.

standard Q-learning: $y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi’}(s’,a’))$

double Q-learning: $y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi}(s’,a’))$

Multi-step returns

In Q-learning, this only actually correct when learning on-policy. Because the sum of r comes from the transitions of different policy.

How to fix?

  • ignore the problem when N is small
  • cut the trace-dynamically choose N to get only on-policy data
    • works well when data mostly on-policy, and the action space is samll
  • importance sampling—ref the paper “safe and efficient off-policy reinforcement learning” Munos et al. 16

Q-learning with continuous actions

How to do argmax in continuous actions space?

  1. optimization

    • gradient based optimization (e.g., SGD) a bit slow in the inner loop
    • action space typically low-dimensional—–what about stochastic optimization?

    -a simple if sample from discrete cations

    $max_a Q(s,a)\approx max{Q(s,a_1),…,Q(s,a_N)}$

    -more accurate solution:

    • cross-entropy method (CEM)
      • simple iterative stochastic optimization
    • CMA-ES
  2. use function class that is easy to optimize NAF: Normalized Advantage Functions

    Using the neural network to get $\mu,P,V$

    Then but this lose some representational power

  3. learn an approximate maximizer

    DDPG idea: train another network $\mu_\theta(s)$ such that $\mu_\theta(s)\approx arg \max_aQ_\phi(s,a)$

    how to train? solve $\theta \gets \arg \max_\theta Q_\phi(s,\mu_\theta(s))$ DDPG:

    1. take some action $s_i$ and observe $(s_i,a_i,s_i’,r_i)$, add it to $\mathcal{B}$
    2. sample mini-batch ${s_j,a_j,s_j’,r_j}$ from $\mathcal{B}$ uniformly
    3. compute $y_j=r_j+\gamma Q_{\phi’}(s_j’,\mu_{\theta’}(s_j’))$ using target nets $Q_{\phi’}$ and $\mu_{\theta’}$_j
    4. $\phi \gets \phi - \alpha\sum_j\frac{dQ_\phi}{d\phi}(s_j,a)(Q_\phi(s_j,a_j)-y_k)$
    5. $\theta \gets \theta - \beta\sum_j\frac{d\mu}{d\theta}(s_j)\frac{dQ_\phi}{da}(s_j,a)$
    6. update $\phi’$ and $\theta’$ (e.g., Polyak averaging)

Tips for Q-learning

  • Bellman error gradients can be big;clip gradients or use Huber loss instead of square error

  • Double Q-learning helps a lot in practice, simple and no downsides

  • N-step returns also help a lot, but have some downsides

  • Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too

  • Run multiple random seeds, it’s very inconsistent between runs

6. Advanced Policy Gradients



Recap: policy gradient

REINFORCE algorithm

  1. sample ${\tau^i}$ from $\pi_\theta(s_t s_t)$ (run the policy)
  2. $\Delta_\theta J(\theta)\approx\sum_{i}\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t^i s_t^i) \left(\sum_{t’=t}^T r(s_{t’},a_{t’})\right)\right)$
  3. $\theta \gets\theta+\alpha\Delta_\theta J(\theta)$

Why does policy gradient work?

policy gradient as policy iteration

$J(\theta)=E_{\tau \sim p_\theta(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]$

so we proved that:

The Goal is Making things off-policy

But we want to sample from $\pi_\theta$ not $\pi_{\theta’}$, we apply importance sampling: but there still has the state sample from $p_{\theta’}(s_t)$ , and can we approximate it as $p_\theta(s_t)$? so that we can use $\hat{A}^\pi(s_t,a_t)$ to get improved policy $\pi’$.

Bounding the objective value

Here we can prove that:

$\pi_{\theta’}$ if close to $\pi_\theta$ if $ \pi_{\theta’(a_t s_t)}-\pi_\theta(a_t s_t) \le\epsilon$ for all $s_t$
$ p_{\theta’}(s_t)-p_\theta(s_t) \le 2\epsilon t$

The prove of this refer the lecture video or the TRPO paper.

It’s easy to prove that: so C is $O(Tr_{max})$ or $O(\frac{r_{max}}{1-\gamma})$

So after all the prove before, what we get? For small enough $\epsilon$, this is guaranteed to improve $J(\theta’)-J(\theta)$

A more convenient bound is using KL divergence: $ \pi_{\theta’(a_t s_t)}-\pi_\theta(a_t s_t) \le \sqrt{\frac{1}{2}D_{KL}(\pi_{\theta’}(a_t s_t)|\pi(a_t s_t))}$

$\Rightarrow D_{KL}(\pi_{\theta’}(a_t|s_t)|\pi(a_t|s_t)$ bounds state marginal difference, where Why not using $\epsilon$ but the $D_{KL}$?

KL divergence has some very convenient properties that make i much easier to approximate!

So the optimization becomes:

Solving the constrained optimization problem

How do we enforce the constraint?

By using dual gradient descent, we set the object function as

  1. Maximize $\mathcal{L}(\theta’, \lambda)$ with respect to $\theta$
  2. $\lambda \gets + \alpha(D_{KL}(\pi_{\theta’}(a_t s_t)|\pi(a_t s_t))-\epsilon)$

How else do we optimize the object?

define: applying First-order Taylor expansion and optimize and so the optimization becomes and gradient ascent does this: by updating like $\theta’=\theta’+\sqrt{\frac{\epsilon}{|\Delta_\theta J(\theta)|^2}}\Delta_\theta J(\theta)$, this is what actually gradient ascent(policy gradient) doing.

But this (the gradient ascent constrain) is not a good constrain since some parameters change probabilities a lot more than others, and we want that the probability distributions are close.

Applying ‘second order Taylor expansion’ to $D_{KL}$ where $\pmb{F}$ is the ‘Fisher-information matrix’ which can estimate with with samples And if we use the following update the constrain will satisfied. and this is called the natural gradient.

How natural policy gradients improves

Practical methods and notes

  • natural policy gradient $\theta’=\theta+\alpha\pmb{F}^{-1}\Delta_\theta J(\theta)$
    • Generally a good choice to stabilize policy gradient training
    • See this paper for details:
      • Petters, Schaal. Reinforcement learning of motor skills with policy gradients.
    • Practical implementation: requires efficient Fisher-vector products, a bit non-trivial to do without computing the full matrix
      • See: Schulman et all. Trust region policy optimization
  • Trust region policy optimization (TRPO) $\alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}$
  • Just use the IS (important sampling) objective directly (use $\bar{A}$ as object)
    • Use regularization to stay close to old policy
    • See: proximal policy optimization (PPO)

So the TRPO and the PPO is two Practical methods solving the constrained optimization in neural network setting.

7. Optimal Control and Planning

Recap: the reinforcement learning objective In model-free RL, we do not know $p(s_{t+1}|s_t,a_t)$.

But actually we knew the dynamics sometimes.

  • Often we do know the dynamics
  • Often we can learn the dynamics

If we know the dynamics, what can we do?

Model-based reinforcement learning

  1. Model-based reinforcement learning: learn the transition dynamics, then figure out how to choose actions

  2. How can we make decisions if we know the dynamics?

    a. How can we choose actions under perfect knowledge of the system dynamics?

    b. Optimal control, trajectory optimization, planning

  3. How can we learning unknown dynamics?

  4. How can we then also learn policies? (e,g. by imitating optimal control)

The objective

Deterministic case

Stochastic open-loop case

open-loop: choose a_1…a_T in one time, not step by step closed-loop: every step the agent gets a feedback from environment

Stochastic closed-loop case

Stochastic optimization

optimal control/planning:

Cross-entropy method (CEM)

Here $A$ is $a_1,…,a_t$

  1. sample $A_1,…,A_n$ from $p(A)$
  2. evaluate $J(A_1),…,J(A_n)$
  3. pick M elites $A_{i_1},…,A_{i_M}$ with the highest value, where $M<N$
  4. refit $p(A)$ to the elites $A_{i_1},…,A_{i_M}$

Monte Carlo Tree Search (MCTS)

Generic MCTS sketch

  1. find a leaf $s_l$ using TreePolicy($s_1$)
  2. evaluate the leaf using DefaultPolicy($s_l$)
  3. update all values in the tree between $a_1$ and $s_l$

take best action from $s_1$ and repeat

every node stores Q and N, Q is the estimated value and N is the visited number

UCT TreePolicy($s_t$)

if $s_t$ nit fully expanded, choose new $a_t$

else choose child with best Score($s_{t+1}$) $$ Score(s_t) = \frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2\ln N(s_{t-1})}{N(s_t)}} $$ For more about MCTS, ref Browne. et al. A survey of Monte Carlo Tree Search Methods. (2012)

Optimal control

Here we shows the optimization process if we know the environment dynamics. Almost the stuffs in control theory.

Deterministic case

Shooting methods vs collocation

the previous CEM is actually random shooting method.

collocation method: optimize over actions and states, with constraints.

Linear case: LQR

Linear case: the case that F is linear function and cost is quadratic function Where Base case: solve for $u_T$ only We substitute $u_T$ by $x_T$ to eliminate $u_T$ Then solve for $U_{T-1}$ in term terms of $x_{T-1}$ and then do same thing as the T case, result in similar results.

backward recursion

for $t=T$ to 1: Forward recursion

For $t=1$ to $T$:

Stochastic dynamics

if the probability is Gaussian and the mean is linear and variance is fixed. Then same algorithm can be applied since symmetry of Gaussian.

Nonlinear case: DDP/iterative LQR

approximate a nonlinear system as a linear-quadratic system using Taylor expansion

In fact, this just Newton’s method for trajectory optimization.

for more Newton’s method for trajectory optimization, ref follow papers:

  1. Differential dynamic programming.(1970)
  2. Synthesis and Stabilization of complex behaviors through online trajectory optimization.(2012)
    • practical guide for implementing non-linear iterative LQR.
  3. Learning Neural Network policies with guided policy search under unknown dynamics (2014)
    • Probabilistic formation and trust region alternative to deterministic line search.

8. Model-Based Reinforcement Learning (learning the model)


Why learn the model?

If we knew $f(s_t,a_t)=s_{t+1}$, we could use the tools from last course.

(or $p(s_{t+1} s_t,a_t)$ in stochastic case)

model-based reinforcement learning version 0.5:

  1. run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$
  2. learning dynamics model $f(s,a)$ to minimize $\sum_i|f(s_i,a_i)-s_i’|^2$
  3. plan through $f(s,a)$ to choose actions

Does it work?

  • This is how system identification works in classical robotics
  • Some care should be taken to design a good base policy
  • Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters
  • The model only fit the base policy, but the final actual policy beyond that policy, that will cause distribution mismatch problem.

Over-fitting problem

Distribution mismatch problem

Can we do better?

can we make $p_{\pi_0}(s_t)=p_{\pi_f}(s_t)$?

model-based reinforcement learning version 1.0:

  1. run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$
  2. learning dynamics model $f(s,a)$ to minimize $\sum_i|f(s_i,a_i)-s_i’|^2$
  3. plan through $f(s,a)$ to choose actions
  4. execute those actions and add the resulting data ${(s,a,s’)_j}$ to $\mathcal{D}$, and repeat step 2~4

But the model has errors, so it may lead to some bad actions, How to address that?


model-based reinforcement learning version 1.5:

  1. run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$
  2. learning dynamics model $f(s,a)$ to minimize $\sum_i|f(s_i,a_i)-s_i’|^2$
  3. plan through $f(s,a)$ to choose actions
  4. execute the first planned action, observe resulting state $s’$ (MPC)
  5. append $(s,a,s’)$ to dataset $\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5

Using model uncertainty

Can we do better by using model uncertainty?

How to get uncertainty?

  1. use output entropy(bad idea)
  2. estimate model uncertainty
  • one way to get this is by Bayesian neural networks (BNN) (introduce later)

  • another way is train multiple models, and see if they agree each other.(Bootstrap ensembles)

How to train?

main idea: need to generate “independent” datasets to get “independent” models.

can do this by re-sampling from dataset with replacement, means you have same distribution but different ordered datasets

Does this works?

This basically works

Very crude approximation, because the number of models is usually small (<10)

Re-sampling with replacement is usually unnecessary, because SGD and random initialization usually makes the models sufficiently independent

For candidate action sequence $a_1,…,a_H$:

  1. sample $\theta\sim p(\theta \mathcal{D})$
  2. at each time step $t$, sample $s_{t+1}\sim p(s_{t+1} s_t,a_t,\theta)$
  3. calculate $R=\sum_tr(s_t,a_t)$
  4. repeat steps 1 to 3 and accumulate the average reward

Model-based RL with images (POMDP)

Model-based RL with latent space models

What about complex observations?

  • High dimensionality
  • Redundancy
  • Partial observability
learn approximate posterior $q_\psi(s_t o_{1:t},a_{1:t})$

other choices:

  • $q_\psi(s_t,s_{t+1} o_{1:t},a_{1:t})$
  • $q_\psi(s_t o_t)$
here we only estimate $q_\psi(s_t o_t)$
assume that $q_\psi(s_t o_t)$ is deterministic

stochastic case requires variational inference (later)

Deterministic encoder and maybe the reward also need to learn. Model-based RL with latent space models

  1. run base policy $\pi_0(o_t a_t)$ (e.g., random policy) to collect $\mathcal{D}={(o,a,o’)_i}$
  2. learning $p_\psi(s_{t+1} s_t,a_t), p_\psi(r_t s_t), p(o_t s_t), g_\psi(o_t)$
  3. plan through the model to choose actions
  4. execute the first planned action, observe resulting state $o’$ (MPC)
  5. append $(o,a,o’)$ to dataset $\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5

Learn directly in observation space

directly learn $p(o_{t+1} o_t,a_t)$

do image prediction

learn reward or set the goal observation

9. Model-Based RL and Policy Learning


What if we want a policy rather than just optimal control?

  • Do not need to re-plan (faster)
  • Potentially better generalization
  • Closed loop control

Back-propagate directly into the policy

model-based reinforcement learning version 2.0:

  1. run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$
  2. learning dynamics model $f(s,a)$ to minimize $\sum_i|f(s_i,a_i)-s_i’|^2$
  3. back-propagate through $f(s,a)$ into the policy to optimize $\pi_\theta(a_t s_t)$
  4. run $\pi_\theta (a_t s_t)$, appending the visited tuples $(s,a,s’)$ to $\mathcal{D}$ , repeat steps 2~4

What’s the problem?

  • similar parameter sensitivity problems as shooting methods
    • But no longer have convenient second order LQR-like method, because policy parameters couple all the tie steps, so no dynamic programming
  • Similar problem to training long RNNs with BPTT
    • Vanishing and exploding gradients
    • Unlike LSTM, we can’t just “choose” a simple dynamics, dynamics are chosen by nature

How to deal with constrain?

Dual gradient decent (DGD)

  1. Find $x^*\gets \arg\min_x\mathcal{L}(x,\lambda)$
  2. Compute $\frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)$
  3. $\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$

A small tweak to DGD: augmented Lagrangian

  1. Find $x^*\gets \arg\min_x\bar{\mathcal{L}}(x,\lambda)$
  2. Compute $\frac{dg}{d\lambda}=\frac{d\bar{\mathcal{L}}}{d\lambda}(x^*,\lambda)$
  3. $\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$

When far from solution, quadratic term tends to improve stability

Constraining trajectory optimization with dual gradient descent

Guided policy search (GPS) discussion

  1. Find $\tau \gets \arg\min_\tau \bar{\mathcal{L}}(\tau,\theta,\lambda)$ (e.g. via iLQR or other planning methods)
  2. Find $\theta \gets \arg\min_\theta\bar{\mathcal{L}}(\tau, \theta, \lambda)$ (e.g. via SGD)
  3. $\lambda \gets \lambda+\alpha \frac{dg}{d\lambda}$ and repeat
  • Can be interpreted as constrained trajectory optimization method
  • Can be interpreted as imitation of optimal control expert, since step 2 is just supervised learning
  • The optimal control “teacher” adapts to the learner , and avoids actions that the learner can’t mimic

General guided policy search scheme

  1. Optimize $p(\tau)$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
  2. Optimize $\theta$ with respect to some supervised objective
  3. Increment or modify dual variables $\lambda$

Need to choose:

  • form of $p(\tau)$ or $\tau$ (if deterministic)
  • optimization method for $p(\tau)$ or $\tau$
  • surrogate $\tilde{c}(x_t,u_t)$
  • supervised objective for $\pi_\theta(u_t x_t)$
Deterministic case
  1. Optimize $\tau$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
  2. Optimize $\theta$ with respect to some supervised objective
  3. Increment or modify dual variables $\lambda$. repeat 1~3

Learning with multiple trajectories

  1. Optimize each $\tau_i$ in parallel with respect to $\tilde{c}(x_t,u_t)$
  2. Optimize $\theta$ with respect to some supervised objective
  3. Increment or modify dual variables $\lambda$. repeat 1~3
Stochastic (Gaussian) GPS
  1. Optimize $p(\tau)$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
  2. Optimize $\theta$ with respect to some supervised objective
  3. Increment or modify dual variables $\lambda$

Here, a little different from pure imitation learning that mimic the optimal control result, the agent imitate the planning results, and then if it can not perform good imitation, then the optimization process will adjust its planning policy to fit the learning policy since the policy is a constrain of planning.

Input Remapping Trick

Imitation optimal control

Imitation optimal control with DAgger

  1. from current state $s_t$, run MCTS to get $a_t,a_{t+1},…$
  2. add $(s_t,a_t)$ to dataset $\mathcal{D}$
  3. execute action $s_t\sim\pi(a_t s_t)$ (not MCTS action!). repeat 1~3 N times
  4. update the policy by training on $\mathcal{D}$

Problems of the original DAgger

  • Ask human to label the state from other policy is hard
  • run the initial not good policy in real world is dangerous in some applications

We address the first problem with a planning method, and how about the the second problem?

Imitating MPC: PLATO algorithm

  1. train $\pi_\theta(u_t o_t)$ from labeled data $\mathcal{D}={o_1,u_1,…,o_N,u_N}$
  2. run $\hat{\pi}(u_t o_t)$ to get dataset $\mathcal{D}_\pi={o_1,…,o_M}$
  3. Ask computer to label $\mathcal{D_\pi}$ with actions $u_t$
  4. Aggregate: $\mathcal{D}\gets\mathcal{D}\cup\mathcal{D}_\pi$

Simple stochastic policy: $\hat{\pi}(u_t|x_t)=\mathcal{N}(K_tx_t+k_t, \Sigma_{u_t})$

Here the $\hat{\pi}$ is re-planed by optimal control method, for simplicity, choose Gaussian policy since it is easy to plan with LQR, and the planning also add the KL constrain, which make sure the behavior policy is not far from the learning policy, but with this planning, it move actions away from some very bad (dangerous) actions.

DAgger vs GPS
  • DAgger does not require an adaptive expert
    • Any expert will do, so long as states from learned policy can be labeled
    • Assumes it is possible to match expert’s behavior up to bounded loss
      • Not always possible (e.g. partially observed domains)
  • GPS adapts the “expert” behavior
    • Does not require bounded loss on initial expert (expert will change)
Why imitate?
  • Relatively stable and easy to use

    • Supervise learning works very well

    • control\planning (usually) works very well

    • The combination of two (usually) works very well

  • Input remapping trick: can exploit availability of additional information at training time to learn policy from raw observations. (planning with state and learning policy with observations)
  • overcomes optimization challenges of back-propagating into policy directly

Model-free optimization with a model

  • just use policy gradient(or other model-free RL method) even though you have a model. (just treat the model as a simulator)
  • Sometimes better than using the gradients!


on-line Q-learning algorithm that performs model-free RL with a model

  1. given state $s$, pick action $a$ using exploration policy
  2. observe $s’$ and $r$, to get transition $(s,a,s’,r)$
  3. update model $\hat{p}(s’ s,a)$ and $\hat{r}(s,a)$ using $(s,a,s’)$
  4. Q-update: $Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)-Q(s,a)]$
  5. repeat $K$ times:
    1. sample $(s,a)\sim\mathcal{B}$ from buffer of past states and actions
    2. Q-update: $Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)-Q(s,a)]$

when model become better, re-evaluate the old states and make the estimation more accurate.

General “Dyna-style” model-based RL recipe

  1. given state $s$, pick action $a$ using exploration policy
  2. learn model $\hat{p}(s’ s,a)$ (and optionally, $\hat{r}(s,a)$)
  3. repeat K times:
    1. sample $s\sim\mathcal{B}$ from buffer
    2. choose action a (from $\mathcal{B}$, from $\pi$, or random)
    3. simulate $s’\sim\hat{p}(s’ s,a)$ (and $r=\hat{r}(s,a)$)
    4. train on $(s,a,s’,r)$ with model-free RL
    5. (optional) take N more model-based steps

This only requires short (as few as one step) rollouts from model, which has a little accumulated error.

Model-based RL algorithms summary


  • Learn model and plan (without policy)
    • Iteratively more data to overcome distribution mismatch
    • Re-plan every time step (MPC) to mitigate small model errors
  • Learning policy
    • Back-propagate into policy (e.g., PILCO)–simple but potentially unstable
    • imitate optimal control in a constrained optimization framework (e.g., GPS)
    • imitate optimal control via DAgger-like process (e.g., PLATO)
    • Use model-free algorithm with a model(Dyna, etc.)

Limitation of model-based RL

  • Need some kind of model
    • Not always available
    • Sometimes harder to learn than the policy
  • Learning the model takes time & data
    • Sometimes expressive model classes (neural nets) are not fast
    • Sometimes fast model classes (linear models) are not expressive
  • Some kind of additional assumptions
    • Linearizability/continuity
    • Ability to reset the system (for local linear models)
    • Smoothness (for GP-style global model)
    • Etc.

Here are some of my understandings of model-based RL:

First, why we need model-based RL?

the model-free RL learn everything from experience, the state space may very larger, it learning from scratch, which need a lot of exploration, otherwise, it may hard to coverage or has high chance to coverage to local optimal.

But in model-based RL, the model is known or already learned, so it shift the very hard exploration process to planning, which can find some decent directions that lead to good results by optimal control methods or just search by simulating with model. So after planning, the pretty promising trajectories is generated, and the policy only requires to learn to imitate these good trajectories, which reduce lots of random exploration.

Second, why not just use optimal control rather than learning policy?

Actually, you just use the optimal control methods, like traditional control methods, or MPC.

However, not all model can apply the explicit optimal control method like LQR, since the model is hard to solve mathematically. In addition, using neural network may have better generalization property, and close loop control seems better.

Third, What kind of method can i use in model based RL?

  • learning model and just using planing, do not learn policy
  • Learn policy by guided policy search
  • imitating optimal control with DAgger

What kind of algorithm should I use?

rank of the samples efficiency required (low to high) but computation efficiency (high to low)

  • gradient-free methods (e.g. NES, CMA, etc)
  • full on-line methods (e.g. A3C)
  • policy gradient methods (e.g. TRPO)
  • replay buffer value estimation methods (Q-learning, DDPG, NAF, SAC, etc.)
  • model-based deep RL (e.g. PETS, guided policy search)
  • model-based “shallow” RL (e.g. PILCO)