48 minute read


This note is the class note of UBC Deep reinforcement learning, namely CS294-112 or CS285. the lecturer is ‎Sergey Levine. The lecturer video can be find on Youtube. I wrote two notes on reinforcement learning before, one is basic RL, the other is the David Silver class note.

Different from the previous courses, this course includes deeper theoretical view, more recent methods and some advancer topics. Especially in model based RL and meta-learning. It is more suitable for the guys who are interested on robotics control and deeper understanding of reinforcement learning.

This class is a little bit hard to study, so make sure you follow the class tightly.

I’m sorry that some of maths can not view properly on the post, maybe I will figure out it later.

Table of contents


1. Imitation learning

2. Policy gradient

3. Actor-critic method

4. Value based methods

5. Practical Q-learning

6. Advanced Policy Gradients

7. Optimal Control and Planning

8. Model-Based Reinforcement Learning (learning the model)

9. Model-Based RL and Policy Learning

10 Variational Inference and Generative Models

11. Re-framing Control as an Inference Problem

12. Inverse Reinforcement Learning

13. Transfer and Multi-task Learning

14. Distributed RL

15. Exploration

16 Meta Reinforcement learning

17 Information theory, challenges, open problems

18 Rethinking Reinforcement Learning from the Perspective of Generalization (Chelsea Finn)

1. Imitation learning

The main problem of imitation: distribution drift

how to make the distribution of training dataset as same as the distribution under policy?


DAgger: Dataset Aggregation

goal: collect training data from $p_{\pi_\theta}(o_t)$ instead of $p_{data}(o_t)$ !

how? just run $p_{\pi_\theta}(a_t o_t)$

but need labels $a_t$ !

  1. train $\pi_{\theta}(a_t o_t)$ from human data $\mathcal{D}={o_1,a_1,…,o_N,a_N}$
  2. run $\pi_\theta(a_t o_t)$ to get dataset $\mathcal{D}_\pi = {o_1,…,o_M}$
  3. ask human to label $\mathcal{D}_\pi$ with actions $s_t$
  4. aggregate: $\mathcal{D}\gets \mathcal{D}\cup\mathcal{D}_\pi$

fit the model perfectly

why fail to fit expert?

  1. Non-Markonvian behavior
    • use history observations
  2. Multimodal behavior
    • for discrete action, it is OK since Softmax output probability over actions
    • for continuous action
      • output mixture of Gaussians
      • latent variable models(inject noise to network input)
      • autoregressive discretization

other problems of imitation learning

  • human labeled data is finite
  • human not good at some problems

reward function of imitation learning

reward function of imitation learning can be \(r(s,a) = \log {p(a=\pi^*(s)|s)}\)

MDP & RL Intro

The goal of RL

expected reward \(p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\) where $p_\theta(\tau)$ is the distribution of the sequence

Q & V

\[V^\pi(s_t)=\sum_{t'=t}^TE_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]\\ v^\pi(s_t)=E_{s_t\sim\pi(a_t|s_t)}[Q^\pi(s_t,a_t)]\]

Types of RL algorithms

  • Policy gradient
  • value-based
  • Actor-critic
  • model-based RL
    • for planning
      • optimal control
      • discrete planning
    • improve policy
    • something else
      • dynamic programming
      • simulated experience


  • sample efficiency
  • stability & ease of use


  • stochastic or deterministic
  • continuous or discrete
  • episodic or infinite horizon

sample efficiency

  • off policy: able to improve the policy without generating new samples from that policy
  • on policy: each time the policy is changed, even a little bit, we need to generate new samples

stability & ease of use

convergence is a problem

supervised learning almost always gradient descent

RL often not strictly gradient descent

2. Policy gradient

Objective function

\[\theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\\ J(\theta)=E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\approx\frac{1}{N}\sum_i\sum_tr(s_{i,t},a_{i,t})\]

policy differentiation

log derivative
\[\begin{align} \pi_\theta(\tau)\Delta \log \pi_\theta(\tau)&=\pi_\theta(\tau)\frac{\Delta\pi_\theta(\tau)}{\pi_\theta(\tau)}=\Delta\pi_\theta(\tau)\\ \pi_\theta(\tau)&=\pi_\theta(s_1,a_1,...,s_T,a_T)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \log \pi_\theta(\tau) &=\log p(s_1) + \sum_{t=1}^T \log \pi_\theta (a_t|s_t) + \log p(s_{t+1}|s_t,a_t)\\ &\Delta_\theta \left[\log p(s_1) + \sum_{t=1}^T \log \pi_\theta (a_t|s_t) + \log p(s_{t+1}|s_t,a_t)\right]= \sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t|s_t) \end{align}\]
Objective function differentiation
\[\begin{align} \theta^*&=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[{r}(\tau)]=\int\pi_\theta(\tau)r(\tau)d\tau\\ {r}(\tau)&=\sum_t r(s_t,a_t)\\ \Delta_\theta J(\theta)&=\int\Delta_\theta \pi_\theta(\tau)r(\tau)d\tau\\ &=\int\pi_\theta(\tau)\Delta_\theta \log \pi_\theta(\tau)r(\tau) d\tau\\ &=E_{r\sim\pi_\theta}[\Delta_\theta \log \pi_\theta(\tau)r(\tau)]\\ &=E_{r\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t|s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right] \end{align}\]
evaluating the policy gradient
\[\begin{align} \Delta_\theta J(\theta)&=E_{\tau\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t|s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right]\\ \Delta_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t|s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\\ \theta &\gets\theta+\alpha\Delta_\theta J(\theta) \end{align}\]

REINFORCE algorithm

  1. sample ${\tau^i}$ from $\pi_\theta(s_t s_t)$ (run the policy)
  2. $\Delta_\theta J(\theta)\approx\sum_{i}^N\left(\sum_{t} \Delta_\theta \log \pi_\theta (a_t s_t) \right)\left(\sum_{t} r(s_t,a_t)\right)$
  3. $\theta \gets\theta+\alpha\Delta_\theta J(\theta)$
policy gradient
\[\Delta_\theta J(\theta)\approx\frac{1}{N} \Delta_\theta \log \pi_\theta (\tau) r(\tau)\]
Reduce variance

Causality: policy at time $t’$ cannot affect reward at time t when $t<t’$ \(\begin{align} \Delta_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\\ &\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t’},a_{i,t'})\right)\\ &=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\hat{Q}_{i,t} \end{align}\) baseline \(b=\frac{1}{N}\sum_{i=1}^{n}e(\tau)\\ \Delta_\theta J(\theta)\approx\frac{1}{N} \Delta_\theta \log \pi_\theta (\tau) [r(\tau)-b]\) prove \(\begin{align} E[\Delta_\theta \log \pi_\theta(\tau)b]&=\int \pi_\theta(\tau) \Delta_\theta \log \pi_\theta(\tau)b d\tau \\ &= \int\Delta_\theta \pi_\theta(\tau)b d\tau\\ &=b\Delta_\theta\int\pi_\theta(\tau)\\ &=b\Delta_\theta1\\ &=0 \end{align}\) Here, $\tau$ means the whole episode sample by current policy.

we can proof that their has a optimal baseline to reduce variance: \(b=\frac{E[g(\tau)^2e(\tau)]}{E[g(\tau)^2]}\) But in practice, we just use the expectation of reward as baseline to reduce the complexity.

policy gradient is on-policy algorithm

Off-policy learning & importance sampling

\[\theta^*=\underset{\theta}{\arg\max} J(\theta)\\ J(\theta)=E_{\tau\sim\pi_\theta(\tau)}[r(\tau)]\]

what if we sample from $\bar{\pi}(\tau)$ instead?

Importance sampling \(\begin{align} E_{x\sim p(x)}[f(x)]&=\int p(x)f(x)dx\\ &=\int \frac {q(x)}{q(x)}p(x)f(x)dx\\ &=E_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right] \end{align}\) so apply this to our objective function, we have \(J(\theta)=E_{\tau\sim\bar{\pi}(\tau)}\left[\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}r(\tau)\right]\) and we have \(\pi_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}=\frac{p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}(a_t|s_t)p(s_{t+1}|s_t,a_t)}=\frac{\prod_{t=1}^T\pi_\theta(a_t|s_t)}{\prod_{t=1}^T \bar{\pi}(a_t|s_t)}\) so we have \(\begin{align} J(\theta')&=E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau)\right]\\ \Delta_{\theta'}J(\theta')&=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\Delta_{\theta'}\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau)\right]\\ &=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\Delta_{\theta'} \log \pi_{\theta}(\tau)r(\tau)\right] \end{align}\) The off-policy policy gradient \(\begin{align} \Delta_{\theta'}J(\theta')&=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\Delta_{\theta'} \log \pi_{\theta}(\tau)r(\tau)\right]\\ &=E_{\tau\sim\pi_\theta}\left[\left(\prod_{t=1}^T\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\right)\left(\sum_{t=1}^T \Delta_{\theta'} \log \pi_{\theta'} (a_t|s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right]\\ &=E_{\tau\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_{\theta'} \log \pi_{\theta'} (a_t|s_t) \right)\left(\prod_{t'=1}^t\frac{\pi_{\theta'}(a_{t'}|s_{t'})}{\pi_\theta(a_{t'}|s_{t'})}\right)\left(\sum_{t'=t}^T r(s_{t'},a_{t'})\left(\prod_{t''=t}^{T}\frac{\pi_{\theta'}(a_{t''}|s_{t''})}{\pi_\theta(a_{t''}|s_{t''})}\right)\right)\right] \end{align}\) we can view state and action separately, then:
\(\begin{align} \theta^*&=\underset{\theta}{\arg\max} \sum_{t=1}^TE_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]\\ J(\theta)&=E_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]\\ &=E_{s_t\sim p_\theta(s_t)}\left[E_{a_t\sim \pi(a_t,s_t)}[r(s_t,a_t)]\right]\\ J(\theta')&=E_{s_t\sim p_\theta(s_t)}\left[\cancel{\frac{p_{\theta'}(s_t)}{p_{\theta}(s_t)}}E_{a_t\sim \pi(a_t,s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}r(s_t,a_t)\right]\right] \end{align}\) If $\frac{p_{\theta’}(s_t)}{p_{\theta}(s_t)}$ is small and bounded, then we can delete it, and this leads to TPRO method we will discuss later.

for coding, we can use “pseudo-loss” as weighted maximum likelihood with automatic differentiation: \(\bar{J}(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \log \pi_\theta (a_{i,t}|s_{i,t})\hat{Q}_{i,t}\)

policy gradient in practice
  • the gradient has high variance
    • this isn’t the same as supervised learning!
    • gradients will be really noisy!
  • consider using much larger batches
  • tweaking learning rates is very hard
    • adaptive step size rules like ADAM can be OK-ish
    • we will learn about policy gradient-specific learning rate adjustment method later

3. Actor-critic method


recap policy gradient \(\begin{align} \Delta_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\\ &\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t’},a_{i,t'})\right)\\ &=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\hat{Q}_{i,t} \end{align}\) where Q is a sample from trajectories, which is unbiased estimate but has high variance problem.

We can use expectation to reduce variance \(\hat{Q}_i,t\approx \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]\) and we define \(\hat{Q}_i,t= \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]\\ V(s_t)=E_{a_t\sim\pi(s_t|s_t)}[Q(s_t,a_t)]\\\) then \(\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})(Q(s_{i,t}, a_{i,t})-V(s_{i,t}))\)


\[A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t)\\ \Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})A^\pi(s_t,a_t)\]

The better $A^\pi(s_t,a_t)$ estimate, the lower the variance.

Value function fitting

\[Q^\pi(s_t,a_t)=r(s_t,a_t)+E_{s_{t+1}\sim p(s_{t+1}|s_t,a_t})[V^\pi(s_{t+1})]\]

and we add a little bias(one step biased sample) for convenience \(Q^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})\) so we have \(A^\pi(s_t,a_t) \approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_t)\) then we only need to fit $V^\pi(s)$ !

Policy evaluation

\[V^\pi(s_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]\\ J(\theta)=E_{s_1\sim p(s_1)}[V^\pi(s_1)]\]

Monte Carlo policy evaluation (this is what policy gradient does) \(V^\pi(s_t)\approx \sum_{t'=t}^Tr(s_{t'},a_{t'})\) We can try multiple samples if we can reset the environment to previous state \(V^\pi(s_t)\approx \frac{1}{N}\sum_{i=0}^N\sum_{t'=t}^Tr(s_{t'},a_{t'})\) Monte Carlo evaluation with function approximation

with function approximation, only using one sample from trajectory still pretty good.

training data: ${\left(s_{i,t},\sum_{t’=t}^Tr(s_{i,t’},a_{i,t’})\right)}$

supervised regression: $\mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)-y_i\parallel^2$

Ideal target: \(y_{i,t}=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]\approx r(s_{s_{i,t}},a_{i,t})+V^\pi(s_{i,t+1})\approx r(s_{s_{i,t}},a_{i,t})+\hat{V^\pi_\phi}(s_{i,t+1})\) Monte Carlo target: \(y_{i,t}=\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'})\)


training data: $ {\left(s_{i,t},r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\right)} $

Actor-critic algorithm

batch actor-critic algorithm:

  1. sample ${s_i,a_i}$ from $\pi_\theta(a s)$
  2. fit $\hat{V_\phi^\pi}(s)$ to sampled reward sums
  3. evaluate $\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\hat{V}\phi^\pi(s’_i)-\hat{V}\phi^\pi(s_i)$
  4. $\Delta_\theta J(\theta)\approx \sum_i \Delta_\theta \log \pi_\theta (a_{i} s_{i})\hat{A}^\pi(s_i,a_i)$
  5. $\theta \gets \theta+\alpha\Delta_\theta J(\theta)$
\[V^\pi(s_{i,t})=\sum_{t'=t}^TE_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t}]\\ V^\pi(s_{i,t})\approx\sum_{t'=t}^Tr(s_{t'},a_{t'})\\ V^\pi(s_{i,t})\approx r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\\ \mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)-y_i\parallel^2\]

Aside: discount factors

what if T (episode length) is $\infty$ ?

$\hat{V}_\phi^\pi$ can get infinitely large in many cases

simple trick: better to get rewards sooner than later \(V^\pi(s_{i,t})\approx r(s_{s_{i,t}},a_{i,t})+\gamma\hat{V}^\pi_\phi(s_{i,t+1})\\ \gamma \in [0,1]\) actually we use discount in policy gradient as \(\Delta_\theta J(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)\) Online actor-critic algorithm(can apply to every single step):

  1. take action $a\sim\pi_\theta(a s)$, get $(s,a,s’,r)$
  2. update $\hat{V_\phi^\pi}(s)$ using target $r+\gamma\hat{V_\phi^\pi}(s’)$
  3. evaluate $\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}\phi^\pi(s’)-\hat{V}\phi^\pi(s)$
  4. $\Delta_\theta J(\theta)\approx \Delta_\theta \log \pi_\theta (a s)\hat{A}^\pi(s,a)$
  5. $\theta \gets \theta+\alpha\Delta_\theta J(\theta)$

Architecture design

network architecture choice

  • value network and policy network are separate(more stable and sample)
  • some of value network and policy network are shared(have shared feature)

works best with a batch

trade-off and balance

policy gradient \(\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-b\right)\) actor-critic \(\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(r(s_i,a_i)+\hat{V}_\phi^\pi(s'_i)-\hat{V}_\phi^\pi(s_i)\right)\) policy gradient is no bias but has higher variance

actor-critic is lower variance but not unbiased

so can we combine these two things?

here we have critics as state-dependent baselines \(\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-\hat{V}_\phi^\pi(s_{i,t})\right)\)

  • no bias
  • lower variance

Eligibility traces & n-step returns

Critic and Monte Carlo critic \(\hat{A}^\pi_C(s_t,a_t)=r(s_t,a_t)+\gamma\hat{V}_\phi^\pi(s_{t+1})-\hat{V}_\phi^\pi(s_t)\\ \hat{A}^\pi_{MC}(s_t,a_t)=\sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{t'},a_{t'})-\hat{V}_\phi^\pi(s_{t})\)

combine these two?

n-step returns \(\hat{A}^\pi_{n}(s_t,a_t)=\sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^n\hat{V}_\phi^\pi(s_{t+n})-\hat{V}_\phi^\pi(s_{t})\) choosing $n>1$ often works better!!!

Generalized advantage estimation(GAE)

Do we have to choose just one n?

Cut everywhere all at once! \(\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty w_n\hat{A}(s_t,a_t)\)

How to weight?

Mostly prefer cutting earlier(less variance) $w_n\propto\lambda^{n-1}$ e.g. $\lambda=0.95$

and this leads to Eligibility traces \(\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty (\gamma\lambda)^{t'-t}\delta_{t'}\\ \delta_{t'}=r(s_{t'},a_{t'})+\gamma\hat{V}_\phi^\pi(s_{t'+1})-\hat{V}_\phi^\pi(s_{t'})\) in this way, every time you want to update a state, you need to have n steps experience

4. Value based methods

$\underset{a_t}{\arg\max}A^\pi(s_t,a_t)$ : best action from $s_t$, if we then follow $\pi$

then: \(\pi'(a_t|s_t)=\begin{cases}1, &if \quad a_=\underset{a_t}{\arg\max}A^\pi(s_t,a_t) \cr 0, &otherwise\end{cases}\)

this at least as good as any $a_t \sim \pi(a_t, s_t)$

Policy iteration

  1. evaluate $A^\pi(s,a)$
  2. set $\pi \gets \pi’$

Dynamic programming

assume we know $p(s’ s,a)$ and s and a are both discrete (and small)

bootstrapped update: \(V^\pi(s) \gets E_{a\sim\pi(a|s)}[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}[V^\pi(s')]]\) with deterministic policy $\pi(s)=a$, we have \(V^\pi(s) \gets r(s,\pi(s))+\gamma E_{s'\sim p(s'|s,\pi(s))}[V^\pi(s')]\)

\[\underset{a_t}{\arg\max}A^\pi(s_t,a_t)=\underset{a_t}{\arg\max}Q^\pi(s_t,a_t)\\ Q^\pi(s,a)=r(s,a)+\gamma E[V^\pi(s')]\]

So policy iteration become

  1. set $Q^\pi(s,a)\gets r(s,a)+\gamma E[V^\pi(s’)]$
  2. set $V(s)\gets \max_a Q(s,a)$

Function approximator

$\mathcal{L}=\frac{1}{2}\sum_i\parallel V_\phi (s)-\max_a Q(s,a)\parallel^2$

fitted value iteration

fitted value iteration algorithm:

  1. set $y_i \gets \max_{a_i}(r(s_i a_i)+\gamma E[V_\phi(s’_i)])$
  2. set $\phi \gets \arg\min_\phi \frac{1}{2}\sum_i\parallel V_\phi (s)-y_i\parallel^2$$

but we can not do maximum if we do not have dynamics, so we evaluate Q instead of V \(Q^\pi(s,a) \gets r(s,a)+\gamma E_{s'\sim p(s'|s,a)}[Q^\pi(s',\pi(s'))]\)

fitted Q-iteration
  1. collect dataset ${(s_i, a_i,s_i’,r_i)}$ using some policy
  2. set $y_i \gets r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)$
  3. set $\phi \gets \arg min_\phi \frac{1}{2}\sum_i|Q_\phi(s_i,a_i)-y_i|^2$ repeat step 2,3 k times and then return step 1

Q-learning is off-policy, since it fit the $Q(s,a)$, which estimate all state action Q, no matter the action and state from which policy, it has the maximum item. And for the $r(s,a)$ item, given a and a, transition is independent of $\pi$.

  1. epsilon-greedy \(\pi(a_t|s_t)=\begin{cases}1-\epsilon , &if a_t={\arg\max}Q_\phi(s_t,a_t) \cr \epsilon(|\mathcal{A}|-1), &otherwise\end{cases}\)

  2. Boltzmann exploration

    \[\pi(a_t|s_t) \propto \exp(Q_\phi(s_t,a_t))\]

Value function learning theory

value iteration:

  1. set $Q(s,a) \gets r(s,a)+\gamma E[V(s’)]$
  2. set $V(s) \gets \max_a Q(s,a)$

tabular case is converged.

Non-tabular case is not guarantee convergence.

In actor-critic, it also need to estimate the V, and if using bootstrap approach, it has the same problem that can not guarantee convergence.

5. Practical Q-learning

What’s wrong of the on-line Q-learning?

Actually, it is not gradient decent, it do not calculate the gradient of target Q in y.

samples is i.i.d

Replay buffer (replay samples many times)

Q-learning with a replay buffer:

  1. collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, add it to $\mathcal{B}$
    1. sample a batch $(s_i,a_i,s_i’,r_i)$ from $\mathcal{B} $
    2. $\phi \gets\phi-\alpha\sum_i\frac{d Q_\phi}{d \phi} (s_i,a_i) \frac{1}{2}\sum_i(Q_\phi(s_i,a_i)- [r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)])^2$ , do these k times

Target network

DQN (Target network+Replay buffer)

  1. save target network parameters: $\phi’ \gets \phi$
    1. collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, add it to $\mathcal{B}$ , do this N times
      1. sample a batch $(s_i,a_i,s_i’,r_i)$ from $\mathcal{B} $
      2. $\phi \gets \arg min_\phi \frac{1}{2}\sum_i|Q_\phi(s_i,a_i)- [r(s_i,a_i) +\gamma \max_{a_i’}Q_{\phi’}(s_i’,a_i’)]|^2$ , do 2,3 k times

Alternative target network

Polyak averaging: soft update to avoid sudden target network update:

update $\phi’$: $\phi’ \gets \tau \phi’ + (1-\tau)\phi$ e.g. $\tau =0.999$

Double Q-learning

Are the Q-values accurate?

It’s often much larger than the true value since the maximum operation will always adds the noisy Q estimation to make Q function overestimate.

Target value $y_j=r_i +\gamma \max_{s_j’}Q_{\phi’}(s_j’,a_j’)$ \(\max_{a'}Q_{\phi'}(s',a') = Q_{\phi'}(s',\arg \max_{a'}Q_{\phi'}(s',a'))\) value also comes from $Q_{\phi’}$ action selected according to $Q_{\phi’}$

How to address this?

Double Q-learning

idea: don’t use the same network to choose the action and evaluate value! (de-correlate the noise)

use two networks: \(Q_{\phi_A}\gets r +\gamma Q_{\phi_B}(s',\arg \max_{a'}Q_{\phi_A}(s',a')) \\ Q_{\phi_B}\gets r +\gamma Q_{\phi_A}(s',\arg \max_{a'}Q_{\phi_B}(s',a'))\) the value of both two networks come from the other network!

Double Q-learning in practice

Just use the current and target networks as $\phi_A$ and $\phi_B$, use current network to choose action, and current network get Q value.

standard Q-learning: $y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi’}(s’,a’))$

double Q-learning: $y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi}(s’,a’))$

Multi-step returns

\[y_{j,t}=\sum_{t'=t}^{t'+N-1}r_{j,t'}+\gamma ^N \max Q_{\phi'}(s_{j,t+N},a_{j, t+N})\]

In Q-learning, this only actually correct when learning on-policy. Because the sum of r comes from the transitions of different policy.

How to fix?

  • ignore the problem when N is small
  • cut the trace-dynamically choose N to get only on-policy data
    • works well when data mostly on-policy, and the action space is samll
  • importance sampling—ref the paper “safe and efficient off-policy reinforcement learning” Munos et al. 16

Q-learning with continuous actions

How to do argmax in continuous actions space?

  1. optimization

    • gradient based optimization (e.g., SGD) a bit slow in the inner loop
    • action space typically low-dimensional—–what about stochastic optimization?

    -a simple if sample from discrete cations

    $max_a Q(s,a)\approx max{Q(s,a_1),…,Q(s,a_N)}$

    -more accurate solution:

    • cross-entropy method (CEM)
      • simple iterative stochastic optimization
    • CMA-ES
  2. use function class that is easy to optimize \(Q_{\phi}(s,a) = -\frac{1}{2}(a-\mu_\phi(s))^TP_{\phi}(s)(a-\mu_\phi(s))+V_\phi(s)\) NAF: Normalized Advantage Functions

    Using the neural network to get $\mu,P,V$

    Then \(\arg \max_aQ_\phi(s,a) =\mu_\phi(s)\; \; \max_aQ(s,a)=V_\phi(s)\) but this lose some representational power

  3. learn an approximate maximizer

    DDPG \(\max_aQ_\phi(s,a)=Q_\phi(s,arg\max_a Q_\phi(s,a))\) idea: train another network $\mu_\theta(s)$ such that $\mu_\theta(s)\approx arg \max_aQ_\phi(s,a)$

    how to train? solve $\theta \gets \arg \max_\theta Q_\phi(s,\mu_\theta(s))$ \(\frac{dQ_\phi}{d\theta}=\frac{da}{d\theta}\frac{dQ_\theta}{da}\) DDPG:

    1. take some action $s_i$ and observe $(s_i,a_i,s_i’,r_i)$, add it to $\mathcal{B}$
    2. sample mini-batch ${s_j,a_j,s_j’,r_j}$ from $\mathcal{B}$ uniformly
    3. compute $y_j=r_j+\gamma Q_{\phi’}(s_j’,\mu_{\theta’}(s_j’))$ using target nets $Q_{\phi’}$ and $\mu_{\theta’}$_j
    4. $\phi \gets \phi - \alpha\sum_j\frac{dQ_\phi}{d\phi}(s_j,a)(Q_\phi(s_j,a_j)-y_k)$
    5. $\theta \gets \theta - \beta\sum_j\frac{d\mu}{d\theta}(s_j)\frac{dQ_\phi}{da}(s_j,a)$
    6. update $\phi’$ and $\theta’$ (e.g., Polyak averaging)

Tips for Q-learning

  • Bellman error gradients can be big;clip gradients or use Huber loss instead of square error \(\pi(a_t|s_t)=\begin{cases}x^2/2 , &\text{if} \;|x|\le\delta \cr \delta|x|-\delta^2/2, &\text{otherwise}\end{cases}\)

  • Double Q-learning helps a lot in practice, simple and no downsides

  • N-step returns also help a lot, but have some downsides

  • Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too

  • Run multiple random seeds, it’s very inconsistent between runs

6. Advanced Policy Gradients



Recap: policy gradient

REINFORCE algorithm

  1. sample ${\tau^i}$ from $\pi_\theta(s_t s_t)$ (run the policy)
  2. $\Delta_\theta J(\theta)\approx\sum_{i}\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t^i s_t^i) \left(\sum_{t’=t}^T r(s_{t’},a_{t’})\right)\right)$
  3. $\theta \gets\theta+\alpha\Delta_\theta J(\theta)$

Why does policy gradient work?

policy gradient as policy iteration

$J(\theta)=E_{\tau \sim p_\theta(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]$ \(\begin{align} J(\theta')-J(\theta)&=J(\theta')-E_{s_0 \sim p(s_1)}[V^{\pi_\theta}(s_0)]\\ &=J(\theta')-E_{\tau \sim p_{\theta'}(\tau)}[V^{\pi_\theta}(s_0)]\\ &=J(\theta')-E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tV^{\pi_\theta}(s_t)-\sum_{t=1}^\infty\gamma^tV^{\pi_\theta}(s_t)\right]\\ &=J(\theta')+E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t))\right]\\ &=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]+E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t))\right]\\ &=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(r(s_t,a_t)+\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t))\right]\\ &==E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right] \end{align}\) so we proved that: \(J(\theta')-J(\theta)=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\)

The Goal is Making things off-policy

But we want to sample from $\pi_\theta$ not $\pi_{\theta’}$, we apply importance sampling: \(\begin{align} E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]&=\sum_{t=0}^{\infty}E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta'}(a_t|s_t)}\left[\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ &=\sum_{t=0}^{\infty}E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right] \end{align}\) but there still has the state sample from $p_{\theta’}(s_t)$ , and can we approximate it as $p_\theta(s_t)$? so that we can use $\hat{A}^\pi(s_t,a_t)$ to get improved policy $\pi’$.

Bounding the objective value

Here we can prove that:

$\pi_{\theta’}$ if close to $\pi_\theta$ if $ \pi_{\theta’(a_t s_t)}-\pi_\theta(a_t s_t) \le\epsilon$ for all $s_t$
$ p_{\theta’}(s_t)-p_\theta(s_t) \le 2\epsilon t$

The prove of this refer the lecture video or the TRPO paper.

It’s easy to prove that: \(\begin{align} E_{p_{\theta'}}[f(s_t)]=\sum_{s_t}p_{\theta'}(s_t)f(s_t)&\ge\sum_{s_t}p_\theta(s_t)f(s_t)-|p_{\theta'}(s_t)-p_\theta(s_t)|\max_{s_t}f(s_t)\\ &\ge\sum_{s_t}p_\theta(s_t)f(s_t)-2\epsilon t\max_{s_t}f(s_t) \end{align}\) so \(\sum_t E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \ge\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]-\sum_t 2\epsilon t C\) C is $O(Tr_{max})$ or $O(\frac{r_{max}}{1-\gamma})$

So after all the prove before, what we get? \(\theta' \gets \arg \max_{\theta'}\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \text{such that}\:\:|\pi_{\theta'(a_t|s_t)}-\pi_\theta(a_t|s_t)|\le\epsilon\) For small enough $\epsilon$, this is guaranteed to improve $J(\theta’)-J(\theta)$

A more convenient bound is using KL divergence: $ \pi_{\theta’(a_t s_t)}-\pi_\theta(a_t s_t) \le \sqrt{\frac{1}{2}D_{KL}(\pi_{\theta’}(a_t s_t)|\pi(a_t s_t))}$

$\Rightarrow D_{KL}(\pi_{\theta’}(a_t|s_t)|\pi(a_t|s_t)$ bounds state marginal difference, where \(D_{KL}(p_1(s)\|p_2(x))=E_{x\sim p_1(x)}\left[ \log \frac{p_1(x)}{p_2(x)}\right]\) Why not using $\epsilon$ but the $D_{KL}$?

KL divergence has some very convenient properties that make i much easier to approximate!

So the optimization becomes: \(\theta' \gets \arg \max_{\theta'}\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\\text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))\le\epsilon\)

Solving the constrained optimization problem

How do we enforce the constraint?

By using dual gradient descent, we set the object function as \(\mathcal{L}(\theta',\lambda)=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]-\lambda(D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))-\epsilon)\)

  1. Maximize $\mathcal{L}(\theta’, \lambda)$ with respect to $\theta$
  2. $\lambda \gets + \alpha(D_{KL}(\pi_{\theta’}(a_t s_t)|\pi(a_t s_t))-\epsilon)$

How else do we optimize the object?

define: \(\begin{align} \bar{A}(\theta')&=\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \bar{A}(\theta)&=\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right] \end{align}\) applying First-order Taylor expansion and optimize \(\theta' \gets \arg \max_{\theta'}\Delta_\theta\bar A(\theta)^T(\theta'-\theta)\\ \text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))\le\epsilon\) and \(\begin{align} \Delta_{\theta'}\bar A(\theta')&=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^t \Delta_{\theta'}\log{\pi_{\theta'}(a_t|s_t)}A^{\pi_\theta}(s_t,a_t)\right]\right]\\ \Delta_{\theta}\bar A(\theta)&=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\gamma^t \Delta_{\theta}\log{\pi_{\theta}(a_t|s_t)}A^{\pi_\theta}(s_t,a_t)\right]\right]=\Delta_\theta J(\theta) \end{align}\) so the optimization becomes \(\theta' \gets \arg \max_{\theta'}\Delta_\theta J(\theta)^T(\theta'-\theta)\\ \text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))\le\epsilon\) and gradient ascent does this: \(\theta' \gets \arg \max_{\theta'}\Delta_\theta J(\theta)^T(\theta'-\theta)\\ \text{such that}\:\:\|\theta-\theta'\|\le\epsilon\) by updating like $\theta’=\theta’+\sqrt{\frac{\epsilon}{|\Delta_\theta J(\theta)|^2}}\Delta_\theta J(\theta)$, this is what actually gradient ascent(policy gradient) doing.

But this (the gradient ascent constrain) is not a good constrain since some parameters change probabilities a lot more than others, and we want that the probability distributions are close.

Applying ‘second order Taylor expansion’ to $D_{KL}$ \(D_{KL}(\pi_{\theta'}\|\pi_\theta)\approx\frac{1}{2}(\theta'-\theta)^T\pmb{F}(\theta'-\theta)\) where $\pmb{F}$ is the ‘Fisher-information matrix’ which can estimate with with samples \(\pmb{F}=E_{\pi_\theta}[\Delta_{\theta}\log\pi_\theta(a|s)\Delta_\theta\log \pi_\theta(a|s)^T]\) And if we use the following update \(\theta'=\theta+\alpha\pmb{F}^{-1}\Delta_\theta J(\theta)\\ \alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}\) the constrain will satisfied. and this is called the natural gradient.

How natural policy gradients improves

Practical methods and notes

  • natural policy gradient $\theta’=\theta+\alpha\pmb{F}^{-1}\Delta_\theta J(\theta)$
    • Generally a good choice to stabilize policy gradient training
    • See this paper for details:
      • Petters, Schaal. Reinforcement learning of motor skills with policy gradients.
    • Practical implementation: requires efficient Fisher-vector products, a bit non-trivial to do without computing the full matrix
      • See: Schulman et all. Trust region policy optimization
  • Trust region policy optimization (TRPO) $\alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}$
  • Just use the IS (important sampling) objective directly (use $\bar{A}$ as object)
    • Use regularization to stay close to old policy
    • See: proximal policy optimization (PPO)

So the TRPO and the PPO is two Practical methods solving the constrained optimization in neural network setting.

7. Optimal Control and Planning

Recap: the reinforcement learning objective \(p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\) In model-free RL, we do not know $p(s_{t+1}|s_t,a_t)$.

But actually we knew the dynamics sometimes.

  • Often we do know the dynamics
  • Often we can learn the dynamics

If we know the dynamics, what can we do?

Model-based reinforcement learning

  1. Model-based reinforcement learning: learn the transition dynamics, then figure out how to choose actions

  2. How can we make decisions if we know the dynamics?

    a. How can we choose actions under perfect knowledge of the system dynamics?

    b. Optimal control, trajectory optimization, planning

  3. How can we learning unknown dynamics?

  4. How can we then also learn policies? (e,g. by imitating optimal control)

The objective


Deterministic case


Stochastic open-loop case

\[p_\theta(s_1,...,s_T|a_1,...,a_T)=p(s_1)\prod_{t=1}^Tp(s_{t+1}|s_t,a_t)\\ a_1,...,a_T=\arg\max_{a_1,...,a_T}E\left[\sum_{t=1}^Tr(s_t,a_t)|a_1,...,a_T\right]\]

open-loop: choose a_1…a_T in one time, not step by step closed-loop: every step the agent gets a feedback from environment

Stochastic closed-loop case

\[p_\theta(s_1,a_1,...,s_T,a_T)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \pi=\underset{\pi}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\]

Stochastic optimization

optimal control/planning: \(a_1,...,a_t=\arg\max_{a_1,...,a_t}J(a_1,...,a_t)\\ A=\arg\max_AJ(A)\)

Cross-entropy method (CEM)

Here $A$ is $a_1,…,a_t$

  1. sample $A_1,…,A_n$ from $p(A)$
  2. evaluate $J(A_1),…,J(A_n)$
  3. pick M elites $A_{i_1},…,A_{i_M}$ with the highest value, where $M<N$
  4. refit $p(A)$ to the elites $A_{i_1},…,A_{i_M}$

Monte Carlo Tree Search (MCTS)

Generic MCTS sketch

  1. find a leaf $s_l$ using TreePolicy($s_1$)
  2. evaluate the leaf using DefaultPolicy($s_l$)
  3. update all values in the tree between $a_1$ and $s_l$

take best action from $s_1$ and repeat

every node stores Q and N, Q is the estimated value and N is the visited number

UCT TreePolicy($s_t$)

​ if $s_t$ nit fully expanded, choose new $a_t$

​ else choose child with best Score($s_{t+1}$) \(Score(s_t) = \frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2\ln N(s_{t-1})}{N(s_t)}}\) For more about MCTS, ref Browne. et al. A survey of Monte Carlo Tree Search Methods. (2012)

Optimal control

Here we shows the optimization process if we know the environment dynamics. Almost the stuffs in control theory.

Deterministic case \(\min_{u_1,...,u_T}\sum_{t=1}^Tc(s_t,u_t)\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\\ \min_{u_1,...,u_T}c(x_1,u_1)+c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)\)

Shooting methods vs collocation

the previous CEM is actually random shooting method.

collocation method: optimize over actions and states, with constraints. \(\min_{u_1,...,u_T,x_1,...,x_T}\sum_{t=1}^Tc(s_t,u_t)\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\)

Linear case: LQR


Linear case: the case that F is linear function and cost is quadratic function \(f(x_t,u_t)=F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t\\ c(x_t,u_t)=\frac{1}{2}\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^Tc_t\) Where \(C_T=\begin{bmatrix} C_{x_T,x_T} & C_{x_T,u_T} \\ C_{u_T,x_T} & C_{u_T,u_T} \end{bmatrix}\\ c_T=\begin{bmatrix} c_{x_T}\\ c_{u_T} \end{bmatrix}\) Base case: solve for $u_T$ only \(\begin{align} Q(x_T,u_T)&= \text{const}+\frac{1}{2}\begin{bmatrix} x_{T} \\ u_{T} \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{T} \\ u_{T} \\ \end{bmatrix}+\begin{bmatrix} x_{T} \\ u_{T} \\ \end{bmatrix}^Tc_T\\ \Delta_{u_T}Q(x_T,u_T)&=C_{u_t,x_T}x_T+c_{u_T,u_T}u_T+c_{u_T}^T=0\\ u_T&=-C_{u_T,u_T}^{-1}(C_{u_T,x_T}X_T+c_{u_T})\\ U_T&=K_Tx_T+k_T\\ K_T&=-C_{u_T,u_T}^{-1}C_{u_T,x_T}\\ k_T&=-C_{u_T,u_T}^{-1}c_{u_T} \end{align}\) We substitute $u_T$ by $x_T$ to eliminate $u_T$ \(\begin{align} V(x_T)&= \text{const}+\frac{1}{2}\begin{bmatrix} x_{T} \\ K_Tx_T+k_T \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{T} \\ K_Tx_T+k_T \\ \end{bmatrix}+\begin{bmatrix} x_{T} \\ K_Tx_T+k_T \\ \end{bmatrix}^Tc_T\\ V(x_T)&=\text{const}+\frac{1}{2}x_T^TV_Tx_T+x_T^Tv_T \end{align}\) Then solve for $U_{T-1}$ in term terms of $x_{T-1}$ \(\begin{align} f(x_{T-1},u_{T-1})&=x_T=F_t\begin{bmatrix} x_ \\ u_ \\ \end{bmatrix}+f_{T-1}\\ c(x_t,u_t)&=\frac{1}{2}\begin{bmatrix}           x_{T-1} \\           u_{T-1} \\         \end{bmatrix}^TC_t\begin{bmatrix}           x_{T-1} \\           u_{T-1} \\         \end{bmatrix}+\begin{bmatrix}           x_{T-1} \\           u_{T-1} \\         \end{bmatrix}^Tc_{T-1}+V(f(x_{T-1},u_{T-1}))\\ V(f(x_{T-1},u_{T-1}))&=\text{const}+\frac{1}{2}x_T^TV_Tx_T+x_T^Tv_T\\ &\text{and then raplece $x_T$ with the dynamic f} \end{align}\) and then do same thing as the T case, result in similar results.

backward recursion

for $t=T$ to 1: \(\begin{align} Q_t&=C_t+F_t^TV_{t+1}F_t\\ q_t&=c_t+F_t^TV_{t+1}f_t+F_t^Tv_{t+1}\\ Q(x_t,u_t)&=\text{const}+\frac{1}{2}\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^TQ_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^Tq_t\\ u_t &\gets \arg\max_{u_t}Q(x_t,u_t)=K_tx_t+k_t\\ K_t&=-Q_{u_t,u_t}^{-1}Q_{u_t,x_t}\\ k_t&=-Q_{u_t,u_t}^{-1}q_{u_t}\\ V_t&=Q_{x_t,x_t}+Q_{x_t,u_t}K_t+K_t^TQ_{u_t,x_t}+K_t^TQ_{u_t,u_t}K_t\\ v_t&=q_{x_t}+Q_{x_t,u_t}k_t+K_t^TQ_{u_t}+K_t^TQ_{u_t,u_t}k_t\\ V(x_t)&=\text{const}+\frac{1}{2}x_t^T V_tx_t+x_t^Tv_t\\ V(x_t)&=\min Q(x_t,u_t) \end{align}\) Forward recursion

For $t=1$ to $T$: \(u_t=K_tx_t+k_t\\ x_{t+1}=f(x_t,u_t)\)

Stochastic dynamics

if the probability is Gaussian and the mean is linear and variance is fixed. Then same algorithm can be applied since symmetry of Gaussian. \(f(x_t,u_t)=F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t\\ x_{t-1}\sim p(x_{t+1}|x_t,u_t)\\ p(x_{t+1}|x_t,u_t)=\mathcal{N}\left(F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t, \Sigma_t\right)\)

Nonlinear case: DDP/iterative LQR

approximate a nonlinear system as a linear-quadratic system using Taylor expansion \(f(x_t,u_t)\approx f(\hat{x}_t,\hat{u}_t)+\Delta_{x_t,u_t}f(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}\\ c(x_t,u_t)\approx c(\hat{x}_t,\hat{u}_t)+\Delta_{x_t,u_t}c(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}+\frac{1}{2}\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}^T\Delta^2_{x_t,u_t}c(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}\)

\[\bar{f}(\delta x_t,\delta u_t)=F_t\begin{bmatrix} \delta x_t \\ \delta u_t \\ \end{bmatrix}\\ \bar{c}=\frac{1}{2}\begin{bmatrix} \delta x_{t} \\ \delta u_{t} \\ \end{bmatrix}^TC_t\begin{bmatrix} \delta x_{t} \\ \delta u_{t} \\ \end{bmatrix}+\begin{bmatrix} \delta x_{t} \\ \delta u_{t} \\ \end{bmatrix}^Tc_t\\ \delta x_t= x_t-\hat{x}_t\\ \delta u_t= u_t-\hat{u}_t\]

In fact, this just Newton’s method for trajectory optimization.

for more Newton’s method for trajectory optimization, ref follow papers:

  1. Differential dynamic programming.(1970)
  2. Synthesis and Stabilization of complex behaviors through online trajectory optimization.(2012)
    • practical guide for implementing non-linear iterative LQR.
  3. Learning Neural Network policies with guided policy search under unknown dynamics (2014)
    • Probabilistic formation and trust region alternative to deterministic line search.

8. Model-Based Reinforcement Learning (learning the model)


Why learn the model?

If we knew $f(s_t,a_t)=s_{t+1}$, we could use the tools from last course.

(or $p(s_{t+1} s_t,a_t)$ in stochastic case)

model-based reinforcement learning version 0.5:

  1. run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$
  2. learning dynamics model $f(s,a)$ to minimize $\sum_i|f(s_i,a_i)-s_i’|^2$
  3. plan through $f(s,a)$ to choose actions

Does it work?

  • This is how system identification works in classical robotics
  • Some care should be taken to design a good base policy
  • Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters
  • The model only fit the base policy, but the final actual policy beyond that policy, that will cause distribution mismatch problem.

Over-fitting problem

Distribution mismatch problem

Can we do better?

can we make $p_{\pi_0}(s_t)=p_{\pi_f}(s_t)$?

model-based reinforcement learning version 1.0:

  1. run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$
  2. learning dynamics model $f(s,a)$ to minimize $\sum_i|f(s_i,a_i)-s_i’|^2$
  3. plan through $f(s,a)$ to choose actions
  4. execute those actions and add the resulting data ${(s,a,s’)_j}$ to $\mathcal{D}$, and repeat step 2~4

But the model has errors, so it may lead to some bad actions, How to address that?


model-based reinforcement learning version 1.5:

  1. run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$
  2. learning dynamics model $f(s,a)$ to minimize $\sum_i|f(s_i,a_i)-s_i’|^2$
  3. plan through $f(s,a)$ to choose actions
  4. execute the first planned action, observe resulting state $s’$ (MPC)
  5. append $(s,a,s’)$ to dataset $\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5

Using model uncertainty

Can we do better by using model uncertainty?

How to get uncertainty?

  1. use output entropy(bad idea)
  2. estimate model uncertainty
\[\int p(s_{t+1}|s_t,a_t,\theta)p(\theta|\mathcal{D})d\theta\]
  • one way to get this is by Bayesian neural networks (BNN) (introduce later)
  • another way is train multiple models, and see if they agree each other.(Bootstrap ensembles)
\[p(\theta|\mathcal{D})\approx\frac{1}{N}\sum_i\delta(\theta_i)\\ \int p(s_{t+1}|s_t,a_t,\theta)p(\theta|\mathcal{D})d\theta\approx\frac{1}{N}\sum_ip(s_{t+1}|s_t,a_t,\theta_i)\]

How to train?

main idea: need to generate “independent” datasets to get “independent” models.

can do this by re-sampling from dataset with replacement, means you have same distribution but different ordered datasets

Does this works?

This basically works

Very crude approximation, because the number of models is usually small (<10)

Re-sampling with replacement is usually unnecessary, because SGD and random initialization usually makes the models sufficiently independent

For candidate action sequence $a_1,…,a_H$:

  1. sample $\theta\sim p(\theta \mathcal{D})$
  2. at each time step $t$, sample $s_{t+1}\sim p(s_{t+1} s_t,a_t,\theta)$
  3. calculate $R=\sum_tr(s_t,a_t)$
  4. repeat steps 1 to 3 and accumulate the average reward

Model-based RL with images (POMDP)

Model-based RL with latent space models

What about complex observations?

  • High dimensionality
  • Redundancy
  • Partial observability
\[\max_\phi\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(s_{t+1,i}|s_{t,i},a_{t,i})+\log p_\phi(o_{t,i}|s_{t,i})]\]
learn approximate posterior $q_\psi(s_t o_{1:t},a_{1:t})$

other choices:

  • $q_\psi(s_t,s_{t+1} o_{1:t},a_{1:t})$
  • $q_\psi(s_t o_t)$
here we only estimate $q_\psi(s_t o_t)$
assume that $q_\psi(s_t o_t)$ is deterministic

stochastic case requires variational inference (later)

Deterministic encoder \(q_\psi(s_t|o_t)=\delta(s_t=g_\psi(o_t))\Rightarrow s_t=g_\psi(o_t)\) and maybe the reward also need to learn. \(\max_{\phi,\psi}\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(g_\psi(o_{t+1,i})|g_\psi(o_{t,i}),a_{t,i})+\log p_\phi(o_{t,i}|g_\psi (o_{t,i})]\\ \max_{\phi,\psi}\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(g_\psi(o_{t+1,i})|g_\psi(o_{t,i}),a_{t,i})+\log p_\phi(o_{t,i}|g_\psi (o_{t,i})+log p_\phi(r_{t,i}|g_\psi(o_{t,i}))]\) Model-based RL with latent space models

  1. run base policy $\pi_0(o_t a_t)$ (e.g., random policy) to collect $\mathcal{D}={(o,a,o’)_i}$
  2. learning $p_\psi(s_{t+1} s_t,a_t), p_\psi(r_t s_t), p(o_t s_t), g_\psi(o_t)$
  3. plan through the model to choose actions
  4. execute the first planned action, observe resulting state $o’$ (MPC)
  5. append $(o,a,o’)$ to dataset $\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5

Learn directly in observation space

directly learn $p(o_{t+1} o_t,a_t)$

do image prediction

learn reward or set the goal observation

9. Model-Based RL and Policy Learning


What if we want a policy rather than just optimal control?

  • Do not need to re-plan (faster)
  • Potentially better generalization
  • Closed loop control

Back-propagate directly into the policy

model-based reinforcement learning version 2.0:

  1. run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$
  2. learning dynamics model $f(s,a)$ to minimize $\sum_i|f(s_i,a_i)-s_i’|^2$
  3. back-propagate through $f(s,a)$ into the policy to optimize $\pi_\theta(a_t s_t)$
  4. run $\pi_\theta (a_t s_t)$, appending the visited tuples $(s,a,s’)$ to $\mathcal{D}$ , repeat steps 2~4

What’s the problem?

  • similar parameter sensitivity problems as shooting methods
    • But no longer have convenient second order LQR-like method, because policy parameters couple all the tie steps, so no dynamic programming
  • Similar problem to training long RNNs with BPTT
    • Vanishing and exploding gradients
    • Unlike LSTM, we can’t just “choose” a simple dynamics, dynamics are chosen by nature
\[\min_{u_1,...,u_T,x_1,...,x_T}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\] \[\min_{u_1,...,u_T,x_1,...,x_T,\theta}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1}), u_t=\pi_\theta(x_t)\\ \min_{u_1,...,u_T,x_1,...,x_T,\theta}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\\ \:\: \text{s.t.}\:\:u_t=\pi_\theta(x_t)\]

How to deal with constrain?

Dual gradient decent (DGD)

\[\min_xf(x)\:\:\text{s.t.}\:C(x)=0\:\:\:\:\:\: \mathcal{L}(x,\lambda)=f(x)+\lambda C(x)\\ g(\lambda)=\mathcal{L}(x^*(\lambda),\lambda)\\ x^*=\arg\min_x\mathcal{L}(x,\lambda)\\ \frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)\]
  1. Find $x^*\gets \arg\min_x\mathcal{L}(x,\lambda)$
  2. Compute $\frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)$
  3. $\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$

A small tweak to DGD: augmented Lagrangian \(\bar{\mathcal{L}}(x,\lambda)=f(x)+\lambda C(x)+\rho\|C(x)\|^2\)

  1. Find $x^*\gets \arg\min_x\bar{\mathcal{L}}(x,\lambda)$
  2. Compute $\frac{dg}{d\lambda}=\frac{d\bar{\mathcal{L}}}{d\lambda}(x^*,\lambda)$
  3. $\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$

When far from solution, quadratic term tends to improve stability

Constraining trajectory optimization with dual gradient descent \(\min_{\tau,\theta}c(\tau)\:\:\text{s.t.}\:\:u_t=\pi_\theta(x_t)\\ \bar{\mathcal{L}}(\tau,\theta,\lambda)=c(\tau)+\sum_{t=1}^T\lambda_t(\pi_\theta(x_t)-u_t)+\sum_{t=1}^T\rho_t(\pi_\theta(x_t)-u_t)^2\)

Guided policy search (GPS) discussion

  1. Find $\tau \gets \arg\min_\tau \bar{\mathcal{L}}(\tau,\theta,\lambda)$ (e.g. via iLQR or other planning methods)
  2. Find $\theta \gets \arg\min_\theta\bar{\mathcal{L}}(\tau, \theta, \lambda)$ (e.g. via SGD)
  3. $\lambda \gets \lambda+\alpha \frac{dg}{d\lambda}$ and repeat
  • Can be interpreted as constrained trajectory optimization method
  • Can be interpreted as imitation of optimal control expert, since step 2 is just supervised learning
  • The optimal control “teacher” adapts to the learner , and avoids actions that the learner can’t mimic

General guided policy search scheme

  1. Optimize $p(\tau)$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
  2. Optimize $\theta$ with respect to some supervised objective
  3. Increment or modify dual variables $\lambda$

Need to choose:

  • form of $p(\tau)$ or $\tau$ (if deterministic)
  • optimization method for $p(\tau)$ or $\tau$
  • surrogate $\tilde{c}(x_t,u_t)$
  • supervised objective for $\pi_\theta(u_t x_t)$
Deterministic case
\[\min_{\tau,\theta}c(\tau)\:\:\:\text{s.t.}\:\:\:u_t=\pi_\theta(x_t)\\ \bar{\mathcal{L}}(\tau,\theta,\lambda)=\tilde{c}(\tau)=c(\tau)+\sum_{t=1}^T\lambda_t(\pi_\theta(x_t)-u_t)+\sum_{t=1}^T\rho_t(\pi_\theta(x_t)-u_t)^2\] \[\tilde{c}_{k+1,i}(x_t,u_t)=c(x_t,u_t)+\lambda_{k+1,i}\log \pi_\theta (u_t|x_t)\]
  1. Optimize $\tau$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
  2. Optimize $\theta$ with respect to some supervised objective
  3. Increment or modify dual variables $\lambda$. repeat 1~3

Learning with multiple trajectories \(\min_{\tau_1,...,\tau_N,\theta}\sum_{i=1}^{N}c(\tau_i)\:\:\:\text{s.t.}\:\:\:u_{t,i}=\pi_\theta(x_{t,i})\:\:\forall i\:\forall t\)

  1. Optimize each $\tau_i$ in parallel with respect to $\tilde{c}(x_t,u_t)$
  2. Optimize $\theta$ with respect to some supervised objective
  3. Increment or modify dual variables $\lambda$. repeat 1~3
Stochastic (Gaussian) GPS
\[\min_{p,\theta}E_{\tau\sim p(\tau)}{c(\tau)}\:\:\text{s.t.}\:\:p(u_t|x_t)=\pi_\theta(u_t|x_t)\\ p(u_t|x_t)=\mathcal{N}(K_t(x_t-\hat{x}_t)+k_t+\hat{u}_t,\Sigma_t)\]
  1. Optimize $p(\tau)$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
  2. Optimize $\theta$ with respect to some supervised objective
  3. Increment or modify dual variables $\lambda$

Here, a little different from pure imitation learning that mimic the optimal control result, the agent imitate the planning results, and then if it can not perform good imitation, then the optimization process will adjust its planning policy to fit the learning policy since the policy is a constrain of planning.

Input Remapping Trick \(\min_{p,\theta}E_{\tau\sim p(\tau)}[{c(\tau)}]\:\:\text{s.t.}\:\:p(u_t|x_t)=\pi_\theta(u_t|o_t)\\)

Imitation optimal control

Imitation optimal control with DAgger

  1. from current state $s_t$, run MCTS to get $a_t,a_{t+1},…$
  2. add $(s_t,a_t)$ to dataset $\mathcal{D}$
  3. execute action $s_t\sim\pi(a_t s_t)$ (not MCTS action!). repeat 1~3 N times
  4. update the policy by training on $\mathcal{D}$

Problems of the original DAgger

  • Ask human to label the state from other policy is hard
  • run the initial not good policy in real world is dangerous in some applications

We address the first problem with a planning method, and how about the the second problem?

Imitating MPC: PLATO algorithm

  1. train $\pi_\theta(u_t o_t)$ from labeled data $\mathcal{D}={o_1,u_1,…,o_N,u_N}$
  2. run $\hat{\pi}(u_t o_t)$ to get dataset $\mathcal{D}_\pi={o_1,…,o_M}$
  3. Ask computer to label $\mathcal{D_\pi}$ with actions $u_t$
  4. Aggregate: $\mathcal{D}\gets\mathcal{D}\cup\mathcal{D}_\pi$

Simple stochastic policy: $\hat{\pi}(u_t|x_t)=\mathcal{N}(K_tx_t+k_t, \Sigma_{u_t})$ \(\hat{\pi}(u_t|x_t)=\arg\min_{\hat{\pi}}\sum_{t'=t}^TE_{\hat{\pi}}[c(x_{t'},u_{t'})]+\lambda D_{KL}(\hat{\pi}(u_t|x_t)\|\pi_\theta(u_t|o_t))\)

Here the $\hat{\pi}$ is re-planed by optimal control method, for simplicity, choose Gaussian policy since it is easy to plan with LQR, and the planning also add the KL constrain, which make sure the behavior policy is not far from the learning policy, but with this planning, it move actions away from some very bad (dangerous) actions.

DAgger vs GPS
  • DAgger does not require an adaptive expert
    • Any expert will do, so long as states from learned policy can be labeled
    • Assumes it is possible to match expert’s behavior up to bounded loss
      • Not always possible (e.g. partially observed domains)
  • GPS adapts the “expert” behavior
    • Does not require bounded loss on initial expert (expert will change)
Why imitate?
  • Relatively stable and easy to use
    • Supervise learning works very well
    • control\planning (usually) works very well
    • The combination of two (usually) works very well
  • Input remapping trick: can exploit availability of additional information at training time to learn policy from raw observations. (planning with state and learning policy with observations)
  • overcomes optimization challenges of back-propagating into policy directly


Model-free optimization with a model

  • just use policy gradient(or other model-free RL method) even though you have a model. (just treat the model as a simulator)
  • Sometimes better than using the gradients!


on-line Q-learning algorithm that performs model-free RL with a model

  1. given state $s$, pick action $a$ using exploration policy
  2. observe $s’$ and $r$, to get transition $(s,a,s’,r)$
  3. update model $\hat{p}(s’ s,a)$ and $\hat{r}(s,a)$ using $(s,a,s’)$
  4. Q-update: $Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)-Q(s,a)]$
  5. repeat $K$ times:
    1. sample $(s,a)\sim\mathcal{B}$ from buffer of past states and actions
    2. Q-update: $Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)-Q(s,a)]$

when model become better, re-evaluate the old states and make the estimation more accurate.

General “Dyna-style” model-based RL recipe

  1. given state $s$, pick action $a$ using exploration policy
  2. learn model $\hat{p}(s’ s,a)$ (and optionally, $\hat{r}(s,a)$)
  3. repeat K times:
    1. sample $s\sim\mathcal{B}$ from buffer
    2. choose action a (from ​$\mathcal{B}$, from $\pi$, or random)
    3. simulate $s’\sim\hat{p}(s’ s,a)$ (and $r=\hat{r}(s,a)$)
    4. train on $(s,a,s’,r)$ with model-free RL
    5. (optional) take N more model-based steps

This only requires short (as few as one step) rollouts from model, which has a little accumulated error.

Model-based RL algorithms summary


  • Learn model and plan (without policy)
    • Iteratively more data to overcome distribution mismatch
    • Re-plan every time step (MPC) to mitigate small model errors
  • Learning policy
    • Back-propagate into policy (e.g., PILCO)–simple but potentially unstable
    • imitate optimal control in a constrained optimization framework (e.g., GPS)
    • imitate optimal control via DAgger-like process (e.g., PLATO)
    • Use model-free algorithm with a model(Dyna, etc.)

Limitation of model-based RL

  • Need some kind of model
    • Not always available
    • Sometimes harder to learn than the policy
  • Learning the model takes time & data
    • Sometimes expressive model classes (neural nets) are not fast
    • Sometimes fast model classes (linear models) are not expressive
  • Some kind of additional assumptions
    • Linearizability/continuity
    • Ability to reset the system (for local linear models)
    • Smoothness (for GP-style global model)
    • Etc.

Here are some of my understandings of model-based RL:

First, why we need model-based RL?

the model-free RL learn everything from experience, the state space may very larger, it learning from scratch, which need a lot of exploration, otherwise, it may hard to coverage or has high chance to coverage to local optimal.

But in model-based RL, the model is known or already learned, so it shift the very hard exploration process to planning, which can find some decent directions that lead to good results by optimal control methods or just search by simulating with model. So after planning, the pretty promising trajectories is generated, and the policy only requires to learn to imitate these good trajectories, which reduce lots of random exploration.

Second, why not just use optimal control rather than learning policy?

Actually, you just use the optimal control methods, like traditional control methods, or MPC.

However, not all model can apply the explicit optimal control method like LQR, since the model is hard to solve mathematically. In addition, using neural network may have better generalization property, and close loop control seems better.

Third, What kind of method can i use in model based RL?

  • learning model and just using planing, do not learn policy
  • Learn policy by guided policy search
  • imitating optimal control with DAgger

What kind of algorithm should I use?

rank of the samples efficiency required (low to high) but computation efficiency (high to low)

  • gradient-free methods (e.g. NES, CMA, etc)
  • full on-line methods (e.g. A3C)
  • policy gradient methods (e.g. TRPO)
  • replay buffer value estimation methods (Q-learning, DDPG, NAF, SAC, etc.)
  • model-based deep RL (e.g. PETS, guided policy search)
  • model-based “shallow” RL (e.g. PILCO)


10 Variational Inference and Generative Models

Probabilistic models

Latent variable models

\[p(x)=\sum_zp(x|z)p(z)\\ p(y|x)=\sum_zp(y|x,z)p(z)\]

Latent variable models in general

feed random Gaussian to neural network to fit any distribution \(p(x|z)=\mathcal{N}(\mu_{nn},\sigma_{nn}(z))\\ p(x)=\int p(x|z)p(z)dz\) where $p(z)$ is Gaussian.

the neural network input is sample of Gaussian, and output is many mean and variance of Gaussian.

Latent variable models in RL: conditional latent variable models for multi-modal policies

How to train latent variable models?

model to fit a distribution

the model: $p_\theta(x)$

the data: $\mathcal{D}={x_1,x_2,x_3,…,x_N}$

maximum likelihood fit: $\theta\gets\arg\max_\theta \frac{1}{N} \sum_i \log p_\theta(x_i)$

in latent variable model

the model: $p(x)=\int p(x z)p(z)dz$
maximum likelihood fit: $\theta\gets\arg\max_\theta \frac{1}{N} \sum_i \log \left(\int p(x z)p(z)dz\right)$

Estimating the log-likelihood

alternative: expected log-likelihood: $\theta\gets\arg\max_\theta \frac{1}{N} \sum_i E_{z\sim p(z x_i)} \log p_\theta(x_i,z)$
The variational approximation

approximate $p(z|x_i)$ with $q_i(z)=\mathcal{N}(\mu_i,\sigma_i)$ \(\begin{align} \log p(x_i) &= \log \int_z p(x_i|z)p(z)\\ &=\log \int_z p(x_i|z)p(z)\frac{q_i(z)}{q_i(z)}\\ &=\log E_{z\sim q_i(z)}\left[\frac{p(x_i|z)p(z)}{q_i(z)}\right]\\ &\ge E_{z \sim q_i(z)}\left[\log \frac{p(x_i|z)p(z)}{q_i(z)}\right]\\ &= E_{z \sim q_i(z)}[\log p(x_i|z)+\log p(z)]- E_{z \sim q_i(z)}[\log q_i(z)]\\ &= E_{z \sim q_i(z)}[\log p(x_i|z)+\log p(z)]+ \mathcal{H}(q_i) \end{align}\) Jensen’s inequality: $\log E[y]\ge E[\log y]$

Entropy: $\mathcal{H}(p)=-E_{x \sim p(x)}[\log p(x)]=-\int_x p(x)\log p(x)dx$

KL Divergence: $D_{KL}(q   p)=E_{x \sim q(x)}\left[\log \frac{q(x)}{p(x)}\right]=E_{x \sim q(x)}[\log q(x)]- E_{x \sim q(x)}[\log p(x)] =-E_{x \sim q(x)}[\log p(x)]- \mathcal{H}(q)$

further analysis \(\log p(x_i) = E_{z \sim q_i(z)}[\log p(x_i|z)+\log p(z)]+ \mathcal{H}(q_i)=\mathcal{L}_i(q,p_i)\) so what makes a good $q_i(z)$?

intuition: $q_i(z)$ should approximate $p(z x_i)$

why? \(\begin{align} D_{KL}(q_i(z_i||p(z|x_i)))&=E_{z\sim q_i(z)}\left[\log \frac{q_i(z)}{p(z|x_i)}\right]\\ &=E_{z \sim q_i(z)}\left[\log \frac{q_i(z)p(x_i)}{p(x_i,z)}\right]\\ &=-E_{z\sim q_i(z)}[\log p(x_i|z)+\log p(z)] +E_{z \sim q_i(z)}+E_{z\sim q_i(z)}[\log q_i(z)]+E_{z \sim q_i(z)}[\log p(x_i)]\\ &=-E_{z\sim q_i(z)}[\log p(x_i|z)+\log p(z)] +E_{z \sim q_i(z)}-\mathcal{H}(q_i)\log p(x_i)\\ &=-\mathcal{L}_i(p,q_i)+\log p(x_i) \end{align}\)

\[\begin{align} \log p(x_i)&=D_{KL}(q_i(z_i||p(z|x_i)))+\mathcal{L}_i(p,q_i)\\ 0 &\le D_{KL}(q_i(z_i||p(z|x_i))) \\ \log p(x_i)&\ge \mathcal{L}_i(p,q_i) \end{align}\]
So this also prove that $\mathcal{L}_i(p,q_i)$ is low bound, and the KL Divergence is the bound gap, so when $q_i(x)$ close to $p(z x_i)$, the bound will become tight.

Actually minimize KL divergence is maximizing $\mathcal{L}_i(p,q_i)$. so we need to adjust $q_i(z)$ to maximizing $\mathcal{L}_i(p,q_i)$.

So all we need to do is: \(\theta \gets \arg \max_{\theta} \frac{1}{N}\sum_i \mathcal{L}_i(q,q_i)\)

\[\mathcal{L}_i(q,p_i)=E_{z \sim q_i(z)}[\log p_\theta(x_i|z)+\log p(z)]+ \mathcal{H}(q_i)\]


for each $x_i$ (or mini-batch):

​ calculate $\Delta_\theta \mathcal{L}_i(p,q_i)$

​ sample $z \sim q_i(z)$

​ $\Delta_\theta \mathcal{L}(p,q_i)\approx\Delta_\theta \log p_\theta(x_i z)$

​ $\theta \gets \theta+\alpha \Delta_\theta\mathcal{L}(p,q_i)$

​ update $q_i$ to maximize $\mathcal{L}_i(p,q_i)$

How to update $q_i$?

let’s say $q_i(z)=\mathcal{N}(\mu_i,\sigma_i)$

use gradient $\Delta_{\mu_i}\mathcal{L}i(p,q_i)$ and $\Delta{\sigma_i}\mathcal{L}_i(p,q_i)$

gradient ascent on $\mu_i,\sigma_i$

What’ the problem?

every sample has a $\mu_i,\sigma_i$, when you have many samples , the total paramters are $ \theta +( \mu_i + \sigma_i )*N$
intuition: $q_i(z)$ should approximate $p(z x_i)$
what if we learn a network $q_i(z)=q(z x_i)\approx p(z x_i)$ ?
so we have two network: $p_\theta(x z)$ and $q_\phi(z x)$

Amortized variational inference

\[q_\phi(z|x)=\mathcal{N}(\mu_\phi(x),\sigma_\phi(x))\] \[\log p(x_i)\ge E_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)+\log p(z)]+ \mathcal{H}(q_\phi(z|x_i))=\mathcal{L}(q_\theta(x_i|z),q_\phi(z|x_i))\]


for each $x_i$ (or mini-batch):

​ calculate $\mathcal{L}(q_\theta(x_i z),q_\phi(z x_i))$
​ sample $z \sim q_\phi(z x_i)$
​ $\Delta_\theta \mathcal{L}(p,q_i)\approx\Delta_\theta \log p_\theta(x_i z)$

​ $\theta \gets \theta+\alpha \Delta_\theta\mathcal{L}$

​ $\theta \gets \theta+\alpha \Delta_\phi\mathcal{L}$

how can we get $\Delta_\phi\mathcal{L}$ ? \(\mathcal{L}_i=E_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)+\log p(z)]+ \mathcal{H}(q_\phi(z|x_i))\\ J(\phi)= E_{z \sim q_\phi(z|x_i)}[r(x_i,z)]\\ J(\phi)\approx \frac{1}{M}\sum_j\Delta_\phi \log q_\phi(z_j|x_i)r(x_i,z_j)\) one way is just use the policy gradient trick, but it has high variance, th other way is apply the re-parameterization trick.

The re-parameterization trick

\[q_\phi(z|x)=\mathcal{N}(\mu_\phi(x),\sigma_\phi(x))\\ z=\mu_\phi(x)+\epsilon\sigma_\phi(x)\;\:\epsilon\sim \mathcal{N}(0,1)\] \[\begin{align} J(\phi)&= E_{z \sim q_\phi(z|x_i)}[r(x_i,z)]\\ &=E_{\epsilon \sim \mathcal{N}(0,1)}[r(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))] \end{align}\]

and then we can estimating $\Delta_\phi J(\phi)$:

​ sample $\epsilon_1,…,\epsilon_M$ from $\mathcal{N}(0,1)$ (even a single sample per data point works well!)

​ $\Delta_\phi J(\phi) \approx \frac{1}{M}\sum_j\Delta_\phi r(x_i,\mu_\phi(x_i)+\epsilon_j\sigma_\phi(x_i)) $

this is low variance since it get gradient from r, not just sample r.

Another way to look at it… \(\begin{align} \mathcal{L}_i&=E_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)+\log p(z)]+ \mathcal{H}(q_\phi(z|x_i))\\ &=E_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)]+E_{z \sim q_\phi(z|x_i)}[\log p(z)]+ \mathcal{H}(q_\phi(z|x_i))\\ &=E_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)]-D_{KL}(q_\phi(z|x_i||p(z)))\\ &=E_{\epsilon \sim \mathcal{N}(0,1)}[p_\theta(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))]-D_{KL}(q_\phi(z|x_i||p(z)))\\ &\approx \log p_\theta(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))-D_{KL}(q_\phi(z|x_i||p(z))) \end{align}\)

Re-parameterization trick vs. policy gradient

  • policy gradient $J(\phi)\approx \frac{1}{M}\sum_j\Delta_\phi \log q_\phi(z_j x_i)r(x_i,z_j)$
    • Can handle both discrete and continuous latent variables
    • but has high variance, requires multiple sample & small learning rates
  • Re-parameterization trick $\Delta_\phi J(\phi) \approx \frac{1}{M}\sum_j\Delta_\phi r(x_i,\mu_\phi(x_i)+\epsilon_j\sigma_\phi(x_i)) $
    • only continuous latent variables
    • very simple to implement
    • low variance

The variational auto encoder (VAE)


Conditional models \(\mathcal{L}_i=E_{z \sim q_\phi(z|x_i,y_i)}[\log p_\theta(y_i|x_i,z)+\log p(z|x_i)]+ \mathcal{H}(q_\phi(z|x_i,y_i))\) the application of variational inference

  • using RL\control+variational inference to model human behavior
  • using generative models and variational inference for exploration

this class is a little tough, here is some of my understanding:

we want to represent the distribution of a object(by neural network) and make it has ability to generalize the feature of the object (multi-modal). To achieve this, the random variables($z$) are used as input of model, but what kind of random variances can achieve this? the mathematic proof shows that the distribution of random variance should approximate $p(z x_i)$, this is kind of compress sensing, where $z$ is the latent variable. Finally, this is used to do variational auto encoder (VAE).

To make the network trainable, the re-parameterization trick is applied.

11. Re-framing Control as an Inference Problem

Get object function from policy

The human behavior is stochastic and suboptimal but overall good, and how can interpret this kind of model?

A probabilistic graphical model of decision making

\[p(\tau|\mathcal{O}_{1:T})\:\:\:\;\;\;p(\mathcal{O}_t|s_t,a_t)=\exp(r(s_t,a_t))\\ p(\tau|\mathcal{O}_{1:T})=\frac{p(\tau,\mathcal{O}_{1:T})}{p(\mathcal{O}_{1:T})}\propto p(\tau)\prod\exp(r(s_t,a_t))=p(\tau)\exp\left(\sum_tr(s_t,a_t\right)\]

$\mathcal{O}$ is the boolean variable indicate the policy choose randomly or choose to maximize reward


Backward massages

\[\begin{align} \beta_t(s_t,a_t)&=p(\mathcal{O}_{t:T}|s_t,a_t)\\ &=\int p(\mathcal{O}_{t:T},s_{t+1}|s_t,a_t)ds_{t+1}\\ &=\int p(\mathcal{O}_{t+1:T}|s_{t+1}) p(s_{t+1}|s_t,a_t)p(\mathcal{O}_t|s_t,a_t)ds_{t+1}\\ \end{align}\] \[p(\mathcal{O}_{t+1:T}|s_{t+1})=\beta_{t+1}(s_{t+1})=\int p(\mathcal{O}_{t+1:T}|s_{t+1,}a_{t+1})p(a_{t+1}|s_{t+1})d a_{t+1}\\ =\int \beta_t(s_{t+1,}a_{t+1})p(a_{t+1}|s_{t+1})d a_{t+1}\]

for $t=T-1$ to 1:

\[\begin{align} \beta_t(s_t,a_t)&=p(\mathcal{O}_t|s_t,a_t)E_{s_{t+1}\sim p(s_{t+1}|s_t,a_t)}[\beta_{t+1}(s_{t+1})]\\ \beta_t(s_t)&=E_{s_t\sim p(a_t|s_t)}[\beta_t(s_t,a_t)] \end{align}\]

let $V_t(s_t) =\log \beta_t(s_t)$

let $Q_t(s_t,a_t)= \log \beta_t(s_t,a_t)$ \(V_t=\log\int \exp(Q_t(s_t,a_t))da_t\\ V_t(s_t) \to \max_{s_t}Q_t(s_t,a_t)\;\text{as}\;Q_t(s_t,a_t)\;\text{gets bigger!}\\ Q_t(s_t,a_t)=r(s_t,a_t)+\log E[\exp(V_{t+1}(s_{t+1}))]\\ \text{for determistic transition: }Q_t(s_t,a_t)=r(s_t,a_t)+V_{t+1}(s_{t+1})\)

Policy computation

\[\begin{align*} p(a_t|s_t,\mathcal{O}_1:T)&=\pi(a_t|s_t)\\ &=p(a_t|s_t,\mathcal{O}_t:T)\\ &=\frac{p(a_t,a_t|\mathcal{O}_{t:T})}{p(s_t|\mathcal{O}_{t:T})}\\ &=\frac{p(\mathcal{O}_{t:T}|a_t,s_t)p(a_t,s_t)/p(a_t,s_t)}{p(\mathcal{O}_{t:T}|s_t)p(s_t)/p(\mathcal{O}_{t:T})}\\ &=\frac{p(\mathcal{O}_{t:T}|a_t,s_t)}{p(\mathcal{O}_{t:T}|s_t)}\frac{p(a_t,s_t)}{p(s_t)}\\ &=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}p(a_t|s_t)\\ &=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)} \end{align*}\]

Policy computation with value functions

for $t=T-1$ to 1:

\[V_t=\log\int \exp(Q_t(s_t,a_t))da_t\\ Q_t(s_t,a_t)=r(s_t,a_t)+\log E[\exp(V_{t+1}(s_{t+1}))]\]
\[\pi(a_t|s_t)=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}\\ V_t(s_t) =\log \beta_t(s_t)\\ Q_t(s_t,a_t)= \log \beta_t(s_t,a_t)\]

So \(\pi(a_t|s_t)=\exp(Q_t(s_t,a_t)-V_t(s_t))=\exp(A_t(s_t,a_t))\) with temperature: $\pi(a_t|s_t)=\exp(\frac{1}{\alpha}Q_t(s_t,a_t)-\frac{1}{\alpha}V_t(s_t))=\exp(\frac{1}{\alpha}A_t(s_t,a_t))$ if $\alpha$ is near zero, the max action will dominate, so it’s more like deterministic, and $\alpha$ near 1 is more stochastic.

  • Natural interpretation: better actions are more probable
  • Random tie-breaking
  • Analogous to Boltzmann exploration
  • Approaches greedy policy as temperature decreases

Forward massages

\[\begin{align*} \alpha_t(s_t)&=p(s_t|\mathcal{O_{1:t-1}})\\ &=\int p(s_t,s_{t-1},,a_{t-1}|\mathcal{O}_{1:t-1})ds_{t-1}da_{t-1}\\ &=\int p(s_t|s_{t-1},a_{t-1},\mathcal{O}_{1:t-1})p(a_{t-1}|s_{t-1},\mathcal{O}_{1:t-1})p(s_{t-1|\mathcal{O}_{1:t-1}})ds_{t-1}da_{t-1}\\ &=\int p(s_t|s_{t-1},a_{t-1})p(a_{t-1}|s_{t-1},\mathcal{O}_{1:t-1})p(s_{t-1|\mathcal{O}_{1:t-1}})ds_{t-1}da_{t-1} \end{align*}\] \[\begin{align*} p(s_t|s_{t-1},a_{t-1})p(a_{t-1}|s_{t-1},\mathcal{O}_{1:t-1}) &=\frac{p(\mathcal{O}_{t-1}|s_{t-1},a_{t-1})p(a_{t-1}|s_{t-1})}{p(\mathcal{O}_{t-1}|s_{t-1})}\frac{p(\mathcal{O}_{t-1}|s_{t-1})p(s_{t-1}|\mathcal{O}_{1:t-2})}{p(\mathcal{O}_{t-1}|\mathcal{O}_{1:t-2})}\\ &=\frac{p(\mathcal{O}_{t-1}|s_{t-1},a_{t-1})p(a_{t-1}|s_{t-1})}{1}\frac{\alpha_{t-1}(s_{t-1})}{p(\mathcal{O}_{t-1}|\mathcal{O}_{1:t-2})} \end{align*}\]

what if we want $p(s_t|\mathcal{O}_{1:T})$? \(\begin{align*} p(s_t|\mathcal{O}_{1:T})&=\frac{p(s_t,\mathcal{O}_{1:T})}{p(\mathcal{O}_{1:T})}\\ &=\frac{p(\mathcal{O}_{t:T}|s_t)p(s_t,\mathcal{O}_{1:t-1})}{p(\mathcal{O}_{1:T})}\\ &\propto \beta_t(s_t)p(s_t|\mathcal{O}_{1:t-1})p(\mathcal{O}_{1:t-1})\\ &\propto \beta_t(s_t)\alpha_t(s_t) \end{align*}\)

The optimism problem

for $t=T-1$ to 1:

\[\begin{align} \beta_t(s_t,a_t)&=p(\mathcal{O}_t|s_t,a_t)E_{s_{t+1}\sim p(s_{t+1}|s_t,a_t)}[\beta_{t+1}(s_{t+1})]\\ \beta_t(s_t)&=E_{s_t\sim p(a_t|s_t)}[\beta_t(s_t,a_t)] \end{align}\]

let $V_t(s_t) =\log \beta_t(s_t)$

let $Q_t(s_t,a_t)= \log \beta_t(s_t,a_t)$ \(Q_t(s_t,a_t)=r(s_t,a_t)+\log E[\exp(V_{t+1}(s_{t+1}))]\) why did this happen?

The inference problem: $p(s_{1:T},a_{1:T} \mathcal{O}_{1:T})$
marginalizing and conditioning, we get: $p(a_t s_t,\mathcal{O}_{1:T})$ (the policy)

“given that you obtained high reward, what was your action probability?”

marginalizing and conditioning, we get: $p(s_{t+1}, s_t,a_t,\mathcal{O}{1:T})\ne p(s{t+1 s_t,a_t})$

“given that you obtained high reward, what was your transition probability?”

because we asking the question of transition probability that condition on the good result, so there is the optimism problem.

Addressing the optimism problem

we actually want to ask “given that you obtained high reward, what was your action probability when the transition probability did not change?”

find anther distribution $q(s_{1:T},a_{1:T})$ that is close to $p(s_{1:T},a_{1:T} \mathcal{O}{1:T})$ but has dynamics $p(s{t+1} s_t,a_t)$

Try variational inference!

let $\mathrm{x}=\mathcal{O}{1:T}$ and $\mathrm{z}=(s{1:T},s_{1:T})$ find $q(\mathrm{z})$ to approximate $p(\mathrm{z} \mathrm{x})$

Control via variational inference

\[q(s_{1:T},a_{1:T})=p(s_1)\prod_t p(s_{t+1}|s_t,a_t)q(a_t|s_t)\]

The variational lower bound (last class) \(\begin{align} \log p(x)&\ge E_{z \sim q(z)}[\log p(x,z)-\log q(z)]\\ \log p(\mathcal{O}_{1:T}) &\ge E_{s_{1:T},a_{1:T}\sim q}[\log p(s_1)+\sum_{t=1}^T\log p(s_{t+1|s_t,a_t})+\sum_{t=1}^T\log p(\mathcal{O}_t|s_t,a_t)\\-\log p(s_1)-&\sum_{t=1}^T\log p(s_{t+1}|s_t,a_t)-\sum_{t=1}^T\log q(s_t|s_t)]\\ &=E_{(s_{1:T},a_{1:T})\sim q}\left[\sum_tr(s_t,a_t)-\log q(a_t|s_t)\right]\\ &=\sum_t E_{(s_t,a_t)\sim q}[r(s_t,a_t)+\mathcal{H}(q(a_t|s_t))] \end{align}\) maximize the rewards and entropy

Optimizing the variation lower bound \(Q_t(s_t,a_t)=r(s_t,a_t)+E[(V_{t+1(s_{t+1})})]\\ V_t(s_t)=\log \int \exp(Q_t(s_t,a_t))d a_t\)

backward pass-variational

for $t=T-1$ to 1:

\[V_t=\log\int \exp(Q_t(s_t,a_t))da_t\\ Q_t(s_t,a_t)=r(s_t,a_t)+E[\exp(V_{t+1}(s_{t+1}))]\]


  • discounted SOC: $Q_t(s_t,a_t)=r(s_t,a_t+\gamma E(V_{t+1(s_{t+1})}))$
  • explicit temperature: $V_t(s_t)=\alpha \log \int \exp(\frac{1}{\alpha}Q_t(s_t,a_t))da_t$

Soft Q-learning

soft Q-learning $\phi \gets \phi + \alpha \Delta_\phi Q_\phi(s,a)(r(s,a)+\gamma V(s’)-Q_\phi(s,a))$

target value: $V(s’)=\text{soft} \max_{a’}Q_\phi(s’,a’)=\log \int \exp (Q_\phi(s’,a’))da’$

$\pi(a s)=\exp(Q_\phi(s,a)-V(s))=\exp(A(s,a))$


Policy gradient with soft optimality

$\pi(a s)=\exp(Q_\phi(s,a)-V(s))$ optimizes $\sum_{\pi(s_t,a_t)}[r(s_t,a_t)]+E_{\pi(s_t)}[\mathcal{H}(\pi(a_t s_t))]$
intuition: $\pi(a s)\propto \exp(Q_\phi(s,a))$ when $\pi$ minimizes $D_{KL}(\pi(a s)   \frac{1}{Z}\exp(Q(s,a))$

Soft Policy gradient vs soft Q-learning

policy gradient derivation: \(J(\theta)=\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)]+E_{\pi(s_t)}[\mathcal{H}(\pi(a|s_t))]\\ =\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)-\log \pi(a_t|s_t)]\\ \log \pi(s_t|s_t)=Q(s_t,a_t)-V(s_t)\)

\[\begin{align} &\Delta_\theta\left[\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)-\log \pi(a_t|s_t)]\right]\\ &\approx \frac{1}{N}\sum_i\sum_t\Delta_\theta \log \pi(a_t|s_t)\left(r(s_t,a_t)+\left(\sum_{t'=t+1}^Tr(s_{t'},s_{t'})-\log \pi(a_{t'}|s_{t'})\right)-\log \pi(a_t|s_t)-1\right)\\ &\approx \frac{1}{N}\sum_i\sum_t(\Delta_\theta Q(s_t,a_t)-\Delta_\theta V(s_t))\left(r(s_t,a_t)+Q(s_{t+1},s_{t+1})-Q(s_t,a_t)+V(s_t)\right)\\ &\approx \frac{1}{N}\sum_i\sum_t(\Delta_\theta Q(s_t,a_t)-\Delta_\theta V(s_t))\left(r(s_t,a_t)+Q(s_{t+1},s_{t+1})-Q(s_t,a_t)\right) \end{align}\]

soft Q-learning: $$

  • \frac{1}{N}\sum_i\sum_t\Delta_\theta Q(s_t,a_t)\left(r(s_t,a_t)+\text{soft}\max_{a_{t+1}}Q(s_{t+1},s_{t+1})-Q(s_t,a_t)\right) $$

Benefits of soft optimality

  • Improve exploration and prevent entropy collapse
  • Easier to specialize (fine-tune) policies for more specific tasks
  • Principled approach to break ties
  • Better robustness (due to wider converge of states)
  • Con reduce to hard optimality as reward magnitude increase
  • Good model for modeling human behavior

12. Inverse Reinforcement Learning

Why should we worry about learning rewards

The imitation learning perspective

Standard imitation learning:

  • copy the action performed by the expert
  • no reasoning about outcomes of actions

Human imitation learning:

  • copy the intent of the expert
  • might take very different actions

The reinforcement learning perspective

sometimes reward is complicated and unclear

Inverse reinforcement learning

Infer reward functions from demonstrations

  • by itself, this is an underspecified problem
  • many reward function can explain the same behavior

Inverse reinforcement learning:


states $s \in \mathcal{S}$, action $a \in \mathcal{A}$

(sometimes) transitions $p(s’ s,a)$

samples ${\tau_i}$ sample from $\pi^*(\tau)$

learn $r_\psi(s,a)$

… and then use it to learn $\pi*(a s)$

Learning the optimality variable

$p(\mathcal{O}_t s_t,a_t,\psi)=\exp(r_\psi(s_t,a_t))$
$p(\tau \mathcal{O}{1:T},\psi) \propto p(\tau)\exp\left(\sum_tr\psi(s_t,a_t\right)$


samples ${\tau_i}$ sampled from $\pi*(\tau)$
​ maximum likelihood learning: $\max_\psi \frac{1}{N}\sum_{i-1}^N\log p(\tau_i \mathcal{O}{1:T},\psi)=\max\psi \frac{1}{N}\sum_{i=1}^Nr_{\psi}(\tau_i)-\log Z$

$\log Z$ make sure all trajectory probability sums up to 1

The IRL partition function
\[\max_\psi \frac{1}{N}\sum_{i=1}^Nr_{\psi}(\tau_i)-\log Z\\ Z=\int p(\tau)\exp(r_\psi(\tau))d\tau\\ \Delta_\psi \mathcal{L}=\frac{1}{N}\sum_{i=1}^N\Delta_\psi r_\psi(\tau_i)-\frac{1}{Z}\int p(\tau)\exp (r_\psi(\tau))\Delta_\psi r_\psi(\tau)d\tau\\ =E_{\tau\sim \pi^*(\tau)}[\Delta_\psi r_\psi(\tau_i)]-E_{\tau \sim p(\tau|\mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(\tau)]\]
Estimating the expectation
$ p(s_t,a_t \mathcal{O}_{1:T},\psi)=p(a_t s_t,\mathcal{O}_{1:T},\psi)p(s_t \mathcal{O}_{1:T},\psi)$

let $\mu_t(s_t,a_t)\propto\beta(s_t,a_t)\alpha(s_t)$ \(\begin{align} E_{\tau \sim p(\tau|\mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(\tau)]&= E_{\tau \sim p(\tau|\mathcal{O}_{1:T},\psi)}[\Delta_\psi \sum_{t=1}^Tr_\psi(s_t,a_t)]\\ &=\sum_{t=1}^T E_{(s_t,a_t) \sim p(s_t,a_t|\mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(s_t,a_t)]\\ &=\sum_{t=1}^T\vec{\mu}_t^T\Delta_\psi\vec{r}_\psi \end{align}\)

The MaxEnt IRL algorithm
  1. Given $\psi$, compute backward message $\beta(s_t,a_t)$
  2. Given $\psi$, compute forward message $\alpha(s_t)$
  3. Compute $\mu_t(s_t,a_t) \propto \beta(s_t,a_t)\alpha(s_t)$
  4. Evaluate $\Delta_\psi \mathcal{L}=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\Delta_\psi r_\psi(s_{i,t},a_{i,t})-\sum_{t=1}^T\int\int\mu_t(s_t,a_t)\Delta_\psi r_\psi(s_t,a_t)ds_tda_t$
  5. $\psi \gets \psi +\eta \Delta_\psi\mathcal{L}$

Why MaxEnt? in the case where $r_\psi(s_t,a_t)=\psi^Tf(s_t,a_t)$, we can show that it optimizes $\max_\psi\mathcal{H}(\pi^{r_\psi}) $ such sthat $E_{\pi^{r_\psi}}[f]=E_{\pi^*}[f]$

paper: Ziebart et al. 2008: Maximum Entropy Inverse Reinforcement learning

what’s missing so far?

  • MaxEnt IRL so far requires…
    • Solving for (soft) optimal policy in the inner loop
    • Enumerating all state-action tuples for visitation frequency and gradient
  • To apply this in practical problem settings, we need to handle…
    • Large and continuous state and action spaces
    • States obtained via sampling only
    • Unknown dynamics

recall: \(\Delta_\psi \mathcal{L}=E_{\tau\sim \pi^*(\tau)}[\Delta_\psi r_\psi(\tau_i)]-E_{\tau \sim p(\tau|\mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(\tau)]\) The first part is just the trajectory under expert policy, the second part of this is soft optimal policy under current reward, a sample idea is to learning this soft policy using the methods introduced in last class, bu it is impractical since you need to train it to convergence, it has a lot of computation.

More efficient sample-based updates

Guided cost learning
\[\Delta_\psi \approx \frac{1}{N}\sum_{i=1}^N\Delta_\psi r_{\psi}(\tau_i)-\frac{1}{M}\sum_{j=1}^M\Delta_\psi r_\psi(\tau_j)\]
instead of learn the $p(a_t s_t, \mathcal{O}_{1:T},\psi)$using any max-ant RL algorithm to converge then run this policy to sample ${\tau_j}$, can we just run one or several gradient step and then sample?

Solution 1: use importance sampling \(\Delta_\psi\mathcal{L} \approx \frac{1}{N}\sum_{i=1}^N\Delta_\psi r_{\psi}(\tau_i)-\frac{1}{\sum_jw_j}\sum_{j=1}^Mw_j\Delta_\psi r_\psi(\tau_j)\;\;\;\;w_j=\frac{p(\tau)\exp(r_\psi(\tau_j))}{\pi(\tau_j)}\)

\[\begin{align} w_j&=\frac{p(\tau)\exp(r_\psi(\tau_j))}{\pi(\tau_j)}\\ &=\frac{p(s_1)\prod_tp(s_{t+1}|s_t,a_t)\exp(r_\psi(s_t,a_t))}{p(s_1)\prod_tp(s_{t+1}|s_t,a_t)\pi(a_t|s_t)}\\ &=\frac{\exp(\sum_tr_\psi(s_t,a_t))}{\prod_t\pi(a_t|s_t)} \end{align}\]

each policy update w.r.t. $r_\psi$ brings us closer to the target distribution!

paper : guided cost learning algorithm. Finn et al. ICML ‘16

this actually like a game, or GAN, which the $\Delta_\psi\mathcal{L} \approx \frac{1}{N}\sum_{i=1}^N\Delta_\psi r_{\psi}(\tau_i)-\frac{1}{\sum_jw_j}\sum_{j=1}^Mw_j\Delta_\psi r_\psi(\tau_j)$ makes demos are made more likely, samples less likely. But $\Delta_\theta\mathcal{L}\approx\frac{1}{M}\sum_{j=1}^M\Delta_\theta\log\pi_\theta(\tau_j)r_\psi(\tau_j)$ policy changed to make it harder to distinguish from demos.

Inverse RL as a Generative adversarial Networks (GAN)

In GAN, the best discriminator is $D^(x)=\frac{p^(x)}{p_\theta(x)+p^*(x)}$

For IRL, optimal policy approaches $\pi_\theta(\tau)\propto p(\tau)\exp(r_\psi(\tau))$

choose this parameterization for discriminator: \(\begin{align} D_\psi(\tau)&=\frac{p(\tau\frac{1}{Z}\exp(r(\tau)))}{p_\theta(\tau)+p(\tau)\frac{1}{Z}\exp(r(\tau))}\\ &=\frac{\frac{1}{Z}\exp(r(\tau))}{\prod_t\pi_\theta(a_t|s_t)+\frac{1}{Z}\exp(r(\tau))}\\ &=\frac{\frac{1}{Z}\exp(r(\tau))}{\prod_t\pi_\theta(a_t|s_t)+\frac{1}{Z}\exp(r(\tau))} \end{align}\) and using GAN to lean by $\psi \gets \arg \max_\psi E_{\tau \sim p*}[\log D_\psi(\tau)]+E_{\tau \sim \pi_\theta}[\log(1-D_\psi(\tau))]$, and then it will get same result as previous Inverse RL update rule. and here we don not need important weight, we can just optimize Z w.r.t. same objective as $\psi$ ! and the weight is in Z.

so the discriminator is $\psi \gets \arg \max_\psi E_{\tau \sim p*}[\log D_\psi(\tau)]+E_{\tau \sim \pi_\theta}[\log(1-D_\psi(\tau))]$

the generator is $\Delta_\theta\mathcal{L}\approx\frac{1}{M}\sum_{j=1}^M\Delta_\theta\log\pi_\theta(\tau_j)r_\psi(\tau_j)$

After inverse RL, the reward has already learned, when the environment changed, it use learned reward representation has the generalization ability to learn policy under new environment.

Regular discriminator

Can we just use a regular discriminator? \(\psi \gets \arg \max_\psi E_{\tau \sim p*}[\log D_\psi(\tau)]+E_{\tau \sim \pi_\theta}[\log(1-D_\psi(\tau))]\) and just parametrize $D_\psi(\tau)$ as standard binary neural net classifier

the generator becomes $\Delta_\theta\mathcal{L}\approx\frac{1}{M}\sum_{j=1}^M\Delta_\theta\log\pi_\theta(\tau_j)\log D_\psi(\tau_j)$

  • this is simpler to set up optimization
  • but discriminator knows nothing at convergence
  • and do know the reward representation, only get the policy $\pi_\theta$


13. Transfer and Multi-task Learning

Use prior knowledge

Transfer learning: using experience from one set of tasks for faster learning and better performance on a new task. In RL, task is MDP

shot: number of attempts in the target domain

0-shot: just run a policy trained in the source domain

1-shot: try the task once

few shot: try the task a few times

  1. “forward” transfer: train on one task, transfer to new task
    1. just try it and hope for best
    2. fine-tune on the new task
    3. randomize source domian
  2. Multi-task transfer: train on many tasks, transfer to new task
    1. generate highly randomized source domains
    2. model-based reinforcement learning
    3. model distillation
    4. contextual policies
    5. modular policy networks
  3. Multi-task meta-learning: learning to learning from many tasks
    1. RNN-based meta-learning
    2. Gradient-based meta-learning

This class is pretty high-level introduction about Transfer and Multi-task Learning. The class give make possible direction of Transfer and Multi-task Learning, and gives many paper for further study in this part, see lecture slides for more detail of recommended papers.

14. Distributed RL

2013/2015: DQN: replay buffer

2015: GORILA

2016: A3C: one learner multiple actors, actor generate gradient and send it to learner

2018: IMPALA: several actors and learnings, actor only acting and generate data for learner, using importance sampling (V-trace) correct for policy lag.

2018: Ape-X/R2D2: Reintroduces replay

2019: R2D3

RLlib: Abstractions for Distributed Reinforcement Learning (ICML’18)

15. Exploration

Exploration in bandit

Regret \(Reg(T)=TE[r(a^*)]-\sum_{t=1}^Tr(a_t)\)

Optimistic exploration

optimistic estimate: $a=\arg\max \hat{\mu}_a + C \sigma_a$

Intuition: try each arm until you are sure it’s not great

example: $a=\arg \max \hat{\mu}_a+\sqrt{\frac{2\ln T}{N(a)}}$ $Reg (T)$ is $O(\log T)$

Probability matching /posterior sampling

assume $r(a_i)\sim p_{\theta_i}(r_i)$

this defines a POMDP with $s=[\theta_1,…,\theta_n]$

belief state is $\hat{p}(\theta_1,…,\theta_n)$

idea: sample $\theta_1,…,\theta_n \sim\hat{p}(\theta_1,..,\theta_n)$

pretend the model $\theta_1,..,\theta_n$ is correct

take the optimal action

update the model and repeat the process

Information gain

let $\mathcal{H}(\hat{p}(z))$ be the current entropy of our $z$ estimate

let $\mathcal{H}(\hat{p}(z) y)$ be the entropy of our $z$ estimate after observation $y$

Information gain is \(IG(z,y)=E_y[\mathcal{H}(\hat{p}(z))-\mathcal(\hat{p}(z)|y)]\\ IG(z,y|a)=E_y[\mathcal{H}(\hat{p}(z))-\mathcal(\hat{p}(z)|y)|a]\) For bandit

$y= r_a, z=\theta_a$

$g(a)=IG(\theta_a,r_a a)$


choose $a$ according to $\arg\min_a \frac{\Delta(a)^2}{g(a)}$

Exploration in DRL

Optimistic exploration in RL

UCB: $a=\arg \max \hat{\mu}_a+\sqrt{\frac{2\ln T}{N(a)}}$

In MDPs, count-based exploration: use $N(s,a)$ or $N(s)$ to add exploration bonus

use $r^+(s,a)=r(s,a)+\mathcal{B}(N(s))$

use $r^+(s,a)$ instead of $r(s,a)$ with any model-free algorithm

but when using counts, maybe we didn’t see the same thing twice. so we need to have a representation of similarity of states.

idea: fit a density model $p_\theta(s)$ (or $p_\theta(s,a)$)

$p_\theta(s)$ might be high even for a new $s$ if $s$ is similar to previously seen states \(P(s)=\frac{N(s)}{n}\\ P'(s)=\frac{N(s)+1}{n+1}\)

Exploring with pseudo-counts

fit model $p_\theta(s)$ to all states $\mathcal{D}$ seen so for

take a step $i$ and observe $s_i$

fit new model $p_{\theta’}(s)$ to $\mathcal{D}\cup s_i$

use $p_\theta(s_i)$ and $p_{\theta’}(s_i)$ to estimate $\hat{N}(s)$

set $r^+(s,a)=r(s,a)+\mathcal{B}(N(s))$, and repeat!

How to get $\hat{N}(s)$ , use previous equations and get \(\hat{N}(s_i)=\hat{n}p_\theta(s_i)\\ \hat{n}=\frac{1-p_{\theta'}(s_i)}{p_{\theta'}(s_i)-p_\theta(s_i)}p_\theta(s_i)\)

What kind of bonus to use? many chooses

UCB: $\mathcal{B}(N(s))=\sqrt{\frac{2\ln T}{N(s)}}$

MBIE-EB: $\mathcal{B}(N(s))=\sqrt{\frac{1}{N(s)}}$

BEB: $\mathcal{B}(N(s))=\frac{1}{N(s)}$

What kind of model ($p_\theta(s)$) to use?
  • Bellemare et al.: “CTS” model: condition each pixel on its top-left neighborhood

  • Counting with hashes

    idea: compress $s$ into a $k$-bit code via $\phi (s)$, then count $N(\phi(s))$

    short codes=more hash collisions

    Can use VAE compression to get hash

  • implicit density modeling with exemplar model

    explicitly compare to new state to past states

    Intuition: the state is novel if it is easy to distinguish from all previous seen states by a classifier

    for each observed state $s$, fit a classifier to classify that state against all past states $\mathcal{D}$, use classifiler error to obtain density \(p_\theta(s)=\frac{1-D_s(s)}{D_s(s)}\) In practice, just train one amortized model takes in exemplar as input

    for detail ref. Fu et al. “EX2: Exploration with exemplar models”

  • Heuristic estimation of counts via errors

    idea: we do not need the densities, just get something tell us if the state is novel or not!

    let’s say we have some target function $f^*(s,a)$, this function can be any function, just show a mapping against $s,a$.

    given our buffer $\mathcal{D}={(s_i,a_i)}$, fit $\hat{f}_\theta(s,a)$

    use $\xi(s,a)=   \hat{f}_\theta(s,a)-f^*(s,a)   ^2$ as bonus

​ what should we use for $f^*(s,a)$?

​ one common choice: set $f^*(s,a)=s’$

​ even simpler: $f^(s,a)=f_\phi(s,a)$, where $\phi$ is a *random parameter vector (random network distillation)

Posterior sampling in deep RL

  1. sample Q-function Q from $p(Q)$
  2. act according Q for for one episode
  3. update p(Q)


  1. given a dataset $\mathcal{D}$, re-sample with replacement N times to get $\mathcal{D}_1,…,\mathcal{D}_N$
  2. train each model $f_{\theta_i}$ on $\mathcal{D}_i$
  3. to sample from $p(\theta)$, sample $i \in[1,…,N]$ and use $f_{\theta_i}$

but training N big neural nets is expensive, so we use one network with shared base network and multiple head for each model, and in practice, we actually do not do re-sample, since we have different initialization for each model.

Reasoning about information gain


prediction gain: $\log p_{\theta’}(s)-\log p_\theta(s)$

intuition: if density changed a lot, the state was novel

variational inference:

IG can be equivalently written as $D_{KL}(p(z y)   p(z))$
learn about transitions $p_\theta(s_{t+1} s_t,a_t): z=\theta$
$y = (s_t,a_t,s_{t+1})$ $D_{KL}(p(\theta h,s_t,a_t,s_{t+1})   p(\theta h))$ where h is the history of all prior transitions

intuition: a transition is more informative if it causes belief over $\theta$ to change

idea: use variational inference to estimate $q(\theta \phi)\approx p(\theta h)$

given new transition $(s,a,s’)$, update $\phi$ to get $\phi’$

and use $D_{KL}(q(\theta \phi)   q(\theta \phi))$ as approximate bonus

for more details, ref Houthooft et al. “VIME”

Imitation learning vs. Reinforcement learning

Imitation learning

  • requires demonstrations
  • must address distributional shift
  • Simple, stable supervised learning+
  • Only as good as the demo

Reinforcement learning

  • Require reward function
  • Must address exploration
  • Potential non-convergent RL
  • Can become arbitrarily good+

Can we get the best of both?

we have both demonstration and rewards

IRL already addresses distributional shift via RL, but it doesn’t use a known reward function!

Simplest combination: pre-train by imitation & fine-tune by RL

  1. collected demonstration data $(s_i,a_i)$
  2. initialize $\pi_\theta$ as $\max_\theta \sum_i \log \pi_\theta(a_i s_i)$
  3. run $\pi_\theta$ to collect experience
  4. improve $\pi_\theta$ with any RL algorithm and repeat 3 and 4

but in step 3 the agent can be very bad due to distribution shift, so the first batch of bad data can destroy initialization.

Off-policy RL

we can address this by off-policy RL. Off-policy RL can use any data, so we can keep the demonstration in buffer.

  • off-policy policy gradient (with importance sampling)
  • off-policy Q-learning
Policy gradient with demonstrations
\[\Delta J(\theta)=\sum_{\tau \in \mathcal{D}}\left[\sum_{t=1}^T\Delta_\theta\log\pi_\theta(a_t|s_t)\left(\prod_{t'=1}^t\frac{\pi_\theta(a_{t'}|s_{t'})}{q(a_{t'}|s_{t'})}\right)\left(\sum_{t'=t}^Tr(s_{t'},a_{t'})\right)\right]\]

where the $\mathcal{D}$ includes both demo data and policy data

Problem 1: which distribution did the demonstrations come from?

​ option 1: use supervised behavior cloning to approximate $\pi_{demo}$

​ option 2: assume Dirac delta: $\pi_{demo}(\tau)=\frac{1}{N}\delta (\tau \in \mathcal{D})$ this works best with self-normalized importance sampling—– $E_{p(x)}[f(x)]\approx\frac{1}{\sum_j\frac{p(x_j)}{q(x_j)}}\sum_i\frac{p(x_i)}{q(x_i)}f(x_i)$

Problem 2: what to do if have multiple distributions

fusion distribution: $q(x)=\frac{1}{M}\sum_iq_i(x)$

Q-learning with demonstrations

just drop demo demonstration data to replay buffer

What’s the problem?

importance sampling: recipe for getting stuck

Q-learning: just good data is not enough, only having good data is hard to fit good Q value

more problems with this highly off-policy Q learning

this is highly off-policy so we do not using imitation policy to collect data any more, so f Q function makes any mistake, tis had to fix it, and then we any train on totally garbage.

to address this problem, \(Q(s,a) \gets r(s,a)+E_{a'\sim\pi_{new}}[Q(s',a')]\) How to pick $\pi_{new}(a|s)$?

option 1: stay close to $\beta$

​ e.g. $D_{KL}(\pi_{new}(. s)   \beta(. s))\le \epsilon$

​ issue 1: we don’t know $\beta$

​ issue 2: this is way too conservative

option 2: constrain to support of $\beta$ ref. these two papers

​ Kumar et al. stabilizing Off-Policy Q-learning vai bootstrapping Error Reduction

​ fojimoto et al. Off-Policy Deep Reinforcement learning without exploration

Imitation as an auxiliary loss function

imitation objective: $\sum_{(s,a) \in \mathcal{D}{demo}}\log \pi\theta(a s)$

RL objective: $E_{\pi_\theta}[r(s,a)]$

hybrid objective: $E_{\pi_\theta} [r(s,a)]+\lambda \sum_{(s,a) \in \mathcal{D}{demo}} \log \pi\theta(a s)$

Hybrid Q-learning: \(J(Q)=J_{DQ}(Q)+\lambda_1J_n(Q)+\lambda_2j_E(Q)+\lambda_3J_{L2}(Q)\\ J_E(Q)=\max_{a\in A}[Q(s,a)+l(s_E,a)]-Q(s,a_E)\) and $J_{DQ}$ is the Q-learning loss, $J_n(Q)$ is n-step Q-learning loss, $J_{L2}$ is regulation loss

what’s the problem?

  • need to tune the weight
  • The design of the objective, esp. for imitation, takes a lot of care
  • Algorithm becomes problem-dependent

16 Meta Reinforcement learning

This part introduced many meta learning methods in high level, may find and read papers later

17 Information theory, challenges, open problems

Information theory

entropy \(\mathcal{H}(p(x))=-E_{x\sim p(x)}[\log p(x)\) mutual dependence \(\begin{align} \mathcal{I}&=D_{KL}(p(x,y)||p(x)p(y))\\ &=E_{(x,y)\sim p(x,y)}\left[\log \frac{p(x,y)}{p(x)p(y)}\right]\\ &=\mathcal{H}(p(y))-\mathcal{H}(p(y|x)) \end{align}\) define $\pi(s)$ state marginal distribution of policy $\pi$

$\mathcal{H}(\pi(s))$ state marginal entropy of policy $\pi$

empowerment: $\mathcal{I}(s_{t+1},a_t)=\mathcal{H}(s_{t+1})-\mathcal{H}(s_{t+1} a_t)$

Learning without a reward function by reaching goals

one way is giving goal states, maybe using VAE to generate goals

  1. Propose goal: $z_g \sim p(z)$, $x_g \sim p_\theta(x_g z_g)$
  2. Attempt to reach goal using $\pi(a x,x_g)$, reach $\bar{x}$
  3. Use data to update $\pi$
  4. Use data to update $p_\theta(x_g z_g)$, $q_\phi(z_g x_g)$

but how to diverse goals?

in step 4

the Standard MLE: $\theta, \phi \sim \arg \max _{\theta, \phi}E[\log p(\bar{x})]$

the weighted MLE: $\theta, \phi \sim \arg \max _{\theta, \phi}E[w(\bar{x})\log p(\bar{x})]$$

where $w(\bar{x})=p_\theta(\bar{x})^\alpha$

key result: for ant $\alpha \in [-1,0) $, entropy $\mathcal{H}(p_\theta(x))$ increases!

This actually doing $\max \mathcal{H}(p(G))$

and what does RL do?

$\pi(s S,G)$ trained to reach goal G as $\pi$ gets better, final state S gets close to G,
that means $p(G S)$ becomes more deterministic!

so we actually doing this \(\max \mathcal{H}(p(G))-\mathcal(p(G|S))=\max \mathcal{I}(S;G)\)

Learning diverse skills

$\pi(s s,z)$ and $z$ is task index.

Intuition: different skill should visit different state-space regions

Diversity-promoting reward function \(\pi(a|s,z) =\arg \max_\pi \sum_z E_{s \sim \pi(s|z)}[r(a,z)]$\) where $r(s,z)= \log p(z|s)$, reward states that are unlikely for other $z’\ne z$.

Here, when z is sampled, the only thing learning doing is maximize the probability of state of this policy by tune the $\pi(s z)$, it turns out that different policies get different skills. Ref. Diversity is all you need (ICLR-2019)

In fact this is also goal reaching. \(\mathcal{I}(z,s)=H(z)-H(z|s)\) we actually maximizing $H(z)$ by sampling $p(z)$uniformly, and the minimizing $H(z|s)$ using the algorithm.

Unsupervised reinforcement learning for meta-learning

using unsupervised reinforcement learning to propose tasks for meta learning.

so at first, using unsupervised meta-RL tor generate tasks, then using meat-learning for those tasks by given reward function from previous steps.

Challenges in deep reinforcement learning

Core algorithm:

  • Stability
  • Efficiency
  • Generalization


  • Problem formulation
  • Supervision

Stability and hyper-parameter tuning

  • Devising stable RL algorithm is very hard
  • Q-learning/value function estimation
    • No guarantee of convergence
    • Lots of parameters for stability: target network delay, replay buffer size, clipping, sensitivity to learning rates, etc.
  • Policy gradient/likelihood ratio/REINFORCE
    • Very high variance gradient estimator
    • Lots of samples, complex bases, etc.
    • Parameters: batch size, learning rate, design of baseline
  • Model-based RL algorithms
    • Model class and fitting method
    • Optimizing policy w.r.t. model non-trivial due to back-propagation through time
    • More subtle issue: policy tends to exploit the model

The challenge with hyper-parameters is severe

  • Algorithms with favorable improvement and convergence properties
    • TRPO
    • Safe reinforcement learning, High-confidence policy improvement [Thomas ‘15’]
  • Algorithms that adaptively adjust parameters
    • Q-Prop

Not great for beating benchmarks, but absolutely essential to make RL a viable tool for real-world problems.

Sample complexity

  • real-world learning becomes difficult or impractical
  • Precludes the use of expensive, high-fidelity simulators
  • Limits applicability to real-world problems

what can we do?

  • Better model-based RL algorithms
  • Design faster algorithms
    • Addressing Function Approximation Error in Actor-Critic Algorithms[Fujimoto et al. ‘18’]
    • Soft Actor-Critic
  • Reuse prior knowledge to accelerate reinforcement learning
    • RL2 [Duan at al. 17]
    • Learning to reinforcement learning [Wang et al. ‘17’]
    • MAML [Finn at al. ‘17’]

Scaling & Generalization

  • Small-scale
  • Emphasizes mastery
  • Evaluated on performance
  • Where i the generalization

Reinforcement learning need to re-collect data during training

Assumption problems

Single task or multi-task

  • Train on multiple tasks, then try ti generalize or finetune
    • policy distillation
    • Actor-mimic
    • MAML
  • Unsupervised or weakly supervised learning of diverse behaviors
    • Stochastic neural networks
    • Reinforcement learning with deep energy-based policies

Where does the supervision come from?

  • find some different tasks
  • learn objectives/reward form demonstration (IRL)
  • Generate objectives automatically

What is the role of the reward function?

Unsupervised reinforcement learning

  1. Interaction with the world without a reward function

  2. Learning something about the world

  3. Use what you learned to quickly solve new tasks

Other sources of supervision

  • Demonstrations
  • Language
  • Human preferences

Where does the supervision signal come from?

  • Yannn YeCun’s cake
    • Unsupervised or self-supervised learning
    • Model learning (predict the future)
    • Generative modeling of the world
    • Lots to do even before you accomplish your goal
  • Imitation & understanding other agents
  • The giant value backup

18 Rethinking Reinforcement Learning from the Perspective of Generalization (Chelsea Finn)


Learning to Learning with gradient. Finn. PhD Thesis 2018

Efficient Off-Policy Meta-Reinforcement learning via Probabilistic Context Variable ICML-19

These algorithms only adapt to same similar tasks, but they can not adapt to entirely new tasks!!!

If we want to do this, we need to make sure the meta-train task distribution same as the meta-test task distribution.

  • Algorithms: more general than a policy, from demo and trial? or others, Meta world

  • Task Representation: How to tell the tasks, language or goal?

  • Data: RoboNet