Deep Reinforcement learning notes (UBC)
Background
This note is the class note of UBC Deep reinforcement learning, namely CS294112 or CS285. the lecturer is Sergey Levine. The lecturer video can be find on Youtube. I wrote two notes on reinforcement learning before, one is basic RL, the other is the David Silver class note.
Different from the previous courses, this course includes deeper theoretical view, more recent methods and some advancer topics. Especially in model based RL and metalearning. It is more suitable for the guys who are interested on robotics control and deeper understanding of reinforcement learning.
This class is a little bit hard to study, so make sure you follow the class tightly.
I’m sorry that some of maths can not view properly on the post, maybe I will figure out it later.
Table of contents
7. Optimal Control and Planning
8. ModelBased Reinforcement Learning (learning the model)
9. ModelBased RL and Policy Learning
10 Variational Inference and Generative Models
11. Reframing Control as an Inference Problem
12. Inverse Reinforcement Learning
13. Transfer and Multitask Learning
16 Meta Reinforcement learning
17 Information theory, challenges, open problems
18 Rethinking Reinforcement Learning from the Perspective of Generalization (Chelsea Finn)
1. Imitation learning
The main problem of imitation: distribution drift
how to make the distribution of training dataset as same as the distribution under policy?
DAgger
DAgger: Dataset Aggregation
goal: collect training data from $p_{\pi_\theta}(o_t)$ instead of $p_{data}(o_t)$ !
how? just run $p_{\pi_\theta}(a_t  o_t)$ 
but need labels $a_t$ !

train $\pi_{\theta}(a_t o_t)$ from human data $\mathcal{D}={o_1,a_1,…,o_N,a_N}$ 
run $\pi_\theta(a_t o_t)$ to get dataset $\mathcal{D}_\pi = {o_1,…,o_M}$  ask human to label $\mathcal{D}_\pi$ with actions $s_t$
 aggregate: $\mathcal{D}\gets \mathcal{D}\cup\mathcal{D}_\pi$
fit the model perfectly
why fail to fit expert?
 NonMarkonvian behavior
 use history observations
 Multimodal behavior
 for discrete action, it is OK since Softmax output probability over actions
 for continuous action
 output mixture of Gaussians
 latent variable models(inject noise to network input)
 autoregressive discretization
other problems of imitation learning
 human labeled data is finite
 human not good at some problems
reward function of imitation learning
reward function of imitation learning can be \(r(s,a) = \log {p(a=\pi^*(s)s)}\)
MDP & RL Intro
The goal of RL
expected reward \(p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_ts_t)p(s_{t+1}s_t,a_t)\\ \theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\) where $p_\theta(\tau)$ is the distribution of the sequence
Q & V
\[V^\pi(s_t)=\sum_{t'=t}^TE_{\pi_\theta}[r(s_{t'},a_{t'})s_t]\\ v^\pi(s_t)=E_{s_t\sim\pi(a_ts_t)}[Q^\pi(s_t,a_t)]\]Types of RL algorithms
 Policy gradient
 valuebased
 Actorcritic
 modelbased RL
 for planning
 optimal control
 discrete planning
 improve policy
 something else
 dynamic programming
 simulated experience
 for planning
tradeoffs
 sample efficiency
 stability & ease of use
assumptions
 stochastic or deterministic
 continuous or discrete
 episodic or infinite horizon
sample efficiency
 off policy: able to improve the policy without generating new samples from that policy
 on policy: each time the policy is changed, even a little bit, we need to generate new samples
stability & ease of use
convergence is a problem
supervised learning almost always gradient descent
RL often not strictly gradient descent
2. Policy gradient
Objective function
\[\theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\\ J(\theta)=E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\approx\frac{1}{N}\sum_i\sum_tr(s_{i,t},a_{i,t})\]policy differentiation
log derivative
\[\begin{align} \pi_\theta(\tau)\Delta \log \pi_\theta(\tau)&=\pi_\theta(\tau)\frac{\Delta\pi_\theta(\tau)}{\pi_\theta(\tau)}=\Delta\pi_\theta(\tau)\\ \pi_\theta(\tau)&=\pi_\theta(s_1,a_1,...,s_T,a_T)=p(s_1)\prod_{t=1}^T\pi(a_ts_t)p(s_{t+1}s_t,a_t)\\ \log \pi_\theta(\tau) &=\log p(s_1) + \sum_{t=1}^T \log \pi_\theta (a_ts_t) + \log p(s_{t+1}s_t,a_t)\\ &\Delta_\theta \left[\log p(s_1) + \sum_{t=1}^T \log \pi_\theta (a_ts_t) + \log p(s_{t+1}s_t,a_t)\right]= \sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_ts_t) \end{align}\]Objective function differentiation
\[\begin{align} \theta^*&=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[{r}(\tau)]=\int\pi_\theta(\tau)r(\tau)d\tau\\ {r}(\tau)&=\sum_t r(s_t,a_t)\\ \Delta_\theta J(\theta)&=\int\Delta_\theta \pi_\theta(\tau)r(\tau)d\tau\\ &=\int\pi_\theta(\tau)\Delta_\theta \log \pi_\theta(\tau)r(\tau) d\tau\\ &=E_{r\sim\pi_\theta}[\Delta_\theta \log \pi_\theta(\tau)r(\tau)]\\ &=E_{r\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_ts_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right] \end{align}\]evaluating the policy gradient
\[\begin{align} \Delta_\theta J(\theta)&=E_{\tau\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_ts_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right]\\ \Delta_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_ts_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\\ \theta &\gets\theta+\alpha\Delta_\theta J(\theta) \end{align}\]REINFORCE algorithm

sample ${\tau^i}$ from $\pi_\theta(s_t s_t)$ (run the policy) 
$\Delta_\theta J(\theta)\approx\sum_{i}^N\left(\sum_{t} \Delta_\theta \log \pi_\theta (a_t s_t) \right)\left(\sum_{t} r(s_t,a_t)\right)$  $\theta \gets\theta+\alpha\Delta_\theta J(\theta)$
policy gradient
\[\Delta_\theta J(\theta)\approx\frac{1}{N} \Delta_\theta \log \pi_\theta (\tau) r(\tau)\]Reduce variance
Causality: policy at time $t’$ cannot affect reward at time t when $t<t’$ \(\begin{align} \Delta_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}s_{i,t}) \right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\\ &\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t’},a_{i,t'})\right)\\ &=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}s_{i,t})\hat{Q}_{i,t} \end{align}\) baseline \(b=\frac{1}{N}\sum_{i=1}^{n}e(\tau)\\ \Delta_\theta J(\theta)\approx\frac{1}{N} \Delta_\theta \log \pi_\theta (\tau) [r(\tau)b]\) prove \(\begin{align} E[\Delta_\theta \log \pi_\theta(\tau)b]&=\int \pi_\theta(\tau) \Delta_\theta \log \pi_\theta(\tau)b d\tau \\ &= \int\Delta_\theta \pi_\theta(\tau)b d\tau\\ &=b\Delta_\theta\int\pi_\theta(\tau)\\ &=b\Delta_\theta1\\ &=0 \end{align}\) Here, $\tau$ means the whole episode sample by current policy.
we can proof that their has a optimal baseline to reduce variance: \(b=\frac{E[g(\tau)^2e(\tau)]}{E[g(\tau)^2]}\) But in practice, we just use the expectation of reward as baseline to reduce the complexity.
policy gradient is onpolicy algorithm
Offpolicy learning & importance sampling
\[\theta^*=\underset{\theta}{\arg\max} J(\theta)\\ J(\theta)=E_{\tau\sim\pi_\theta(\tau)}[r(\tau)]\]what if we sample from $\bar{\pi}(\tau)$ instead?
Importance sampling
\(\begin{align}
E_{x\sim p(x)}[f(x)]&=\int p(x)f(x)dx\\
&=\int \frac {q(x)}{q(x)}p(x)f(x)dx\\
&=E_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right]
\end{align}\)
so apply this to our objective function, we have
\(J(\theta)=E_{\tau\sim\bar{\pi}(\tau)}\left[\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}r(\tau)\right]\)
and we have
\(\pi_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi_\theta(a_ts_t)p(s_{t+1}s_t,a_t)\\
\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}=\frac{p(s_1)\prod_{t=1}^T\pi_\theta(a_ts_t)p(s_{t+1}s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}(a_ts_t)p(s_{t+1}s_t,a_t)}=\frac{\prod_{t=1}^T\pi_\theta(a_ts_t)}{\prod_{t=1}^T \bar{\pi}(a_ts_t)}\)
so we have
\(\begin{align}
J(\theta')&=E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau)\right]\\
\Delta_{\theta'}J(\theta')&=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\Delta_{\theta'}\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau)\right]\\
&=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\Delta_{\theta'} \log \pi_{\theta}(\tau)r(\tau)\right]
\end{align}\)
The offpolicy policy gradient
\(\begin{align}
\Delta_{\theta'}J(\theta')&=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\Delta_{\theta'} \log \pi_{\theta}(\tau)r(\tau)\right]\\
&=E_{\tau\sim\pi_\theta}\left[\left(\prod_{t=1}^T\frac{\pi_{\theta'}(a_ts_t)}{\pi_\theta(a_ts_t)}\right)\left(\sum_{t=1}^T \Delta_{\theta'} \log \pi_{\theta'} (a_ts_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right]\\
&=E_{\tau\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_{\theta'} \log \pi_{\theta'} (a_ts_t) \right)\left(\prod_{t'=1}^t\frac{\pi_{\theta'}(a_{t'}s_{t'})}{\pi_\theta(a_{t'}s_{t'})}\right)\left(\sum_{t'=t}^T r(s_{t'},a_{t'})\left(\prod_{t''=t}^{T}\frac{\pi_{\theta'}(a_{t''}s_{t''})}{\pi_\theta(a_{t''}s_{t''})}\right)\right)\right]
\end{align}\)
we can view state and action separately, then:
\(\begin{align}
\theta^*&=\underset{\theta}{\arg\max} \sum_{t=1}^TE_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]\\
J(\theta)&=E_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]\\
&=E_{s_t\sim p_\theta(s_t)}\left[E_{a_t\sim \pi(a_t,s_t)}[r(s_t,a_t)]\right]\\
J(\theta')&=E_{s_t\sim p_\theta(s_t)}\left[\cancel{\frac{p_{\theta'}(s_t)}{p_{\theta}(s_t)}}E_{a_t\sim \pi(a_t,s_t)}\left[\frac{\pi_{\theta'}(a_ts_t)}{\pi_{\theta}(a_ts_t)}r(s_t,a_t)\right]\right]
\end{align}\)
If $\frac{p_{\theta’}(s_t)}{p_{\theta}(s_t)}$ is small and bounded, then we can delete it, and this leads to TPRO method we will discuss later.
for coding, we can use “pseudoloss” as weighted maximum likelihood with automatic differentiation: \(\bar{J}(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \log \pi_\theta (a_{i,t}s_{i,t})\hat{Q}_{i,t}\)
policy gradient in practice
 the gradient has high variance
 this isn’t the same as supervised learning!
 gradients will be really noisy!
 consider using much larger batches
 tweaking learning rates is very hard
 adaptive step size rules like ADAM can be OKish
 we will learn about policy gradientspecific learning rate adjustment method later
3. Actorcritic method
Basics
recap policy gradient \(\begin{align} \Delta_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}s_{i,t}) \right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\\ &\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t’},a_{i,t'})\right)\\ &=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}s_{i,t})\hat{Q}_{i,t} \end{align}\) where Q is a sample from trajectories, which is unbiased estimate but has high variance problem.
We can use expectation to reduce variance \(\hat{Q}_i,t\approx \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})s_t,a_t]\) and we define \(\hat{Q}_i,t= \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})s_t,a_t]\\ V(s_t)=E_{a_t\sim\pi(s_ts_t)}[Q(s_t,a_t)]\\\) then \(\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}s_{i,t})(Q(s_{i,t}, a_{i,t})V(s_{i,t}))\)
Advantage
\[A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)V^\pi(s_t)\\ \Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}s_{i,t})A^\pi(s_t,a_t)\]The better $A^\pi(s_t,a_t)$ estimate, the lower the variance.
Value function fitting
\[Q^\pi(s_t,a_t)=r(s_t,a_t)+E_{s_{t+1}\sim p(s_{t+1}s_t,a_t})[V^\pi(s_{t+1})]\]and we add a little bias(one step biased sample) for convenience \(Q^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})\) so we have \(A^\pi(s_t,a_t) \approx r(s_t,a_t)+V^\pi(s_{t+1})V^\pi(s_t)\) then we only need to fit $V^\pi(s)$ !
Policy evaluation
\[V^\pi(s_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})s_t]\\ J(\theta)=E_{s_1\sim p(s_1)}[V^\pi(s_1)]\]Monte Carlo policy evaluation (this is what policy gradient does) \(V^\pi(s_t)\approx \sum_{t'=t}^Tr(s_{t'},a_{t'})\) We can try multiple samples if we can reset the environment to previous state \(V^\pi(s_t)\approx \frac{1}{N}\sum_{i=0}^N\sum_{t'=t}^Tr(s_{t'},a_{t'})\) Monte Carlo evaluation with function approximation
with function approximation, only using one sample from trajectory still pretty good.
training data: ${\left(s_{i,t},\sum_{t’=t}^Tr(s_{i,t’},a_{i,t’})\right)}$
supervised regression: $\mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)y_i\parallel^2$
Ideal target: \(y_{i,t}=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})s_t]\approx r(s_{s_{i,t}},a_{i,t})+V^\pi(s_{i,t+1})\approx r(s_{s_{i,t}},a_{i,t})+\hat{V^\pi_\phi}(s_{i,t+1})\) Monte Carlo target: \(y_{i,t}=\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'})\)
TD(bootstrapped)
training data: $ {\left(s_{i,t},r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\right)} $
Actorcritic algorithm
batch actorcritic algorithm:

sample ${s_i,a_i}$ from $\pi_\theta(a s)$  fit $\hat{V_\phi^\pi}(s)$ to sampled reward sums
 evaluate $\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\hat{V}\phi^\pi(s’_i)\hat{V}\phi^\pi(s_i)$

$\Delta_\theta J(\theta)\approx \sum_i \Delta_\theta \log \pi_\theta (a_{i} s_{i})\hat{A}^\pi(s_i,a_i)$  $\theta \gets \theta+\alpha\Delta_\theta J(\theta)$
Aside: discount factors
what if T (episode length) is $\infty$ ?
$\hat{V}_\phi^\pi$ can get infinitely large in many cases
simple trick: better to get rewards sooner than later \(V^\pi(s_{i,t})\approx r(s_{s_{i,t}},a_{i,t})+\gamma\hat{V}^\pi_\phi(s_{i,t+1})\\ \gamma \in [0,1]\) actually we use discount in policy gradient as \(\Delta_\theta J(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}s_{i,t})\left(\sum_{t'=t}^T\gamma^{t't}r(s_{i,t'},a_{i,t'})\right)\) Online actorcritic algorithm(can apply to every single step):

take action $a\sim\pi_\theta(a s)$, get $(s,a,s’,r)$  update $\hat{V_\phi^\pi}(s)$ using target $r+\gamma\hat{V_\phi^\pi}(s’)$
 evaluate $\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}\phi^\pi(s’)\hat{V}\phi^\pi(s)$

$\Delta_\theta J(\theta)\approx \Delta_\theta \log \pi_\theta (a s)\hat{A}^\pi(s,a)$  $\theta \gets \theta+\alpha\Delta_\theta J(\theta)$
Architecture design
network architecture choice
 value network and policy network are separate(more stable and sample)
 some of value network and policy network are shared(have shared feature)
works best with a batch
tradeoff and balance
policy gradient \(\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}s_{i,t})\left(\sum_{t'=t}^T\gamma^{t't}r(s_{i,t'},a_{i,t'})b\right)\) actorcritic \(\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}s_{i,t})\left(r(s_i,a_i)+\hat{V}_\phi^\pi(s'_i)\hat{V}_\phi^\pi(s_i)\right)\) policy gradient is no bias but has higher variance
actorcritic is lower variance but not unbiased
so can we combine these two things?
here we have critics as statedependent baselines \(\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}s_{i,t})\left(\sum_{t'=t}^{\infty}\gamma^{t't}r(s_{i,t'},a_{i,t'})\hat{V}_\phi^\pi(s_{i,t})\right)\)
 no bias
 lower variance
Eligibility traces & nstep returns
Critic and Monte Carlo critic \(\hat{A}^\pi_C(s_t,a_t)=r(s_t,a_t)+\gamma\hat{V}_\phi^\pi(s_{t+1})\hat{V}_\phi^\pi(s_t)\\ \hat{A}^\pi_{MC}(s_t,a_t)=\sum_{t'=t}^{\infty}\gamma^{t't}r(s_{t'},a_{t'})\hat{V}_\phi^\pi(s_{t})\)
combine these two?
nstep returns \(\hat{A}^\pi_{n}(s_t,a_t)=\sum_{t'=t}^{t+n}\gamma^{t't}r(s_{t'},a_{t'})+\gamma^n\hat{V}_\phi^\pi(s_{t+n})\hat{V}_\phi^\pi(s_{t})\) choosing $n>1$ often works better!!!
Generalized advantage estimation(GAE)
Do we have to choose just one n?
Cut everywhere all at once! \(\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty w_n\hat{A}(s_t,a_t)\)
How to weight?
Mostly prefer cutting earlier(less variance) $w_n\propto\lambda^{n1}$ e.g. $\lambda=0.95$
and this leads to Eligibility traces \(\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty (\gamma\lambda)^{t't}\delta_{t'}\\ \delta_{t'}=r(s_{t'},a_{t'})+\gamma\hat{V}_\phi^\pi(s_{t'+1})\hat{V}_\phi^\pi(s_{t'})\) in this way, every time you want to update a state, you need to have n steps experience
4. Value based methods
$\underset{a_t}{\arg\max}A^\pi(s_t,a_t)$ : best action from $s_t$, if we then follow $\pi$
then: \(\pi'(a_ts_t)=\begin{cases}1, &if \quad a_=\underset{a_t}{\arg\max}A^\pi(s_t,a_t) \cr 0, &otherwise\end{cases}\)
this at least as good as any $a_t \sim \pi(a_t, s_t)$
Policy iteration
 evaluate $A^\pi(s,a)$
 set $\pi \gets \pi’$
Dynamic programming
assume we know $p(s’  s,a)$ and s and a are both discrete (and small) 
bootstrapped update: \(V^\pi(s) \gets E_{a\sim\pi(as)}[r(s,a)+\gamma E_{s'\sim p(s's,a)}[V^\pi(s')]]\) with deterministic policy $\pi(s)=a$, we have \(V^\pi(s) \gets r(s,\pi(s))+\gamma E_{s'\sim p(s's,\pi(s))}[V^\pi(s')]\)
\[\underset{a_t}{\arg\max}A^\pi(s_t,a_t)=\underset{a_t}{\arg\max}Q^\pi(s_t,a_t)\\ Q^\pi(s,a)=r(s,a)+\gamma E[V^\pi(s')]\]So policy iteration become
 set $Q^\pi(s,a)\gets r(s,a)+\gamma E[V^\pi(s’)]$
 set $V(s)\gets \max_a Q(s,a)$
Function approximator
$\mathcal{L}=\frac{1}{2}\sum_i\parallel V_\phi (s)\max_a Q(s,a)\parallel^2$
fitted value iteration
fitted value iteration algorithm:

set $y_i \gets \max_{a_i}(r(s_i a_i)+\gamma E[V_\phi(s’_i)])$  set $\phi \gets \arg\min_\phi \frac{1}{2}\sum_i\parallel V_\phi (s)y_i\parallel^2$$
but we can not do maximum if we do not have dynamics, so we evaluate Q instead of V \(Q^\pi(s,a) \gets r(s,a)+\gamma E_{s'\sim p(s's,a)}[Q^\pi(s',\pi(s'))]\)
fitted Qiteration
 collect dataset ${(s_i, a_i,s_i’,r_i)}$ using some policy
 set $y_i \gets r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)$
 set $\phi \gets \arg min_\phi \frac{1}{2}\sum_iQ_\phi(s_i,a_i)y_i^2$ repeat step 2,3 k times and then return step 1
Qlearning is offpolicy, since it fit the $Q(s,a)$, which estimate all state action Q, no matter the action and state from which policy, it has the maximum item. And for the $r(s,a)$ item, given a and a, transition is independent of $\pi$.
exploration

epsilongreedy \(\pi(a_ts_t)=\begin{cases}1\epsilon , &if a_t={\arg\max}Q_\phi(s_t,a_t) \cr \epsilon(\mathcal{A}1), &otherwise\end{cases}\)

Boltzmann exploration
\[\pi(a_ts_t) \propto \exp(Q_\phi(s_t,a_t))\]
Value function learning theory
value iteration:
 set $Q(s,a) \gets r(s,a)+\gamma E[V(s’)]$
 set $V(s) \gets \max_a Q(s,a)$
tabular case is converged.
Nontabular case is not guarantee convergence.
In actorcritic, it also need to estimate the V, and if using bootstrap approach, it has the same problem that can not guarantee convergence.
5. Practical Qlearning
What’s wrong of the online Qlearning?
Actually, it is not gradient decent, it do not calculate the gradient of target Q in y.
samples is i.i.d
Replay buffer (replay samples many times)
Qlearning with a replay buffer:
 collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, add it to $\mathcal{B}$
 sample a batch $(s_i,a_i,s_i’,r_i)$ from $\mathcal{B} $
 $\phi \gets\phi\alpha\sum_i\frac{d Q_\phi}{d \phi} (s_i,a_i) \frac{1}{2}\sum_i(Q_\phi(s_i,a_i) [r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)])^2$ , do these k times
Target network
DQN (Target network+Replay buffer)
 save target network parameters: $\phi’ \gets \phi$
 collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, add it to $\mathcal{B}$ , do this N times
 sample a batch $(s_i,a_i,s_i’,r_i)$ from $\mathcal{B} $
 $\phi \gets \arg min_\phi \frac{1}{2}\sum_iQ_\phi(s_i,a_i) [r(s_i,a_i) +\gamma \max_{a_i’}Q_{\phi’}(s_i’,a_i’)]^2$ , do 2,3 k times
 collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, add it to $\mathcal{B}$ , do this N times
Alternative target network
Polyak averaging: soft update to avoid sudden target network update:
update $\phi’$: $\phi’ \gets \tau \phi’ + (1\tau)\phi$ e.g. $\tau =0.999$
Double Qlearning
Are the Qvalues accurate?
It’s often much larger than the true value since the maximum operation will always adds the noisy Q estimation to make Q function overestimate.
Target value $y_j=r_i +\gamma \max_{s_j’}Q_{\phi’}(s_j’,a_j’)$ \(\max_{a'}Q_{\phi'}(s',a') = Q_{\phi'}(s',\arg \max_{a'}Q_{\phi'}(s',a'))\) value also comes from $Q_{\phi’}$ action selected according to $Q_{\phi’}$
How to address this?
Double Qlearning
idea: don’t use the same network to choose the action and evaluate value! (decorrelate the noise)
use two networks: \(Q_{\phi_A}\gets r +\gamma Q_{\phi_B}(s',\arg \max_{a'}Q_{\phi_A}(s',a')) \\ Q_{\phi_B}\gets r +\gamma Q_{\phi_A}(s',\arg \max_{a'}Q_{\phi_B}(s',a'))\) the value of both two networks come from the other network!
Double Qlearning in practice
Just use the current and target networks as $\phi_A$ and $\phi_B$, use current network to choose action, and current network get Q value.
standard Qlearning: $y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi’}(s’,a’))$
double Qlearning: $y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi}(s’,a’))$
Multistep returns
\[y_{j,t}=\sum_{t'=t}^{t'+N1}r_{j,t'}+\gamma ^N \max Q_{\phi'}(s_{j,t+N},a_{j, t+N})\]In Qlearning, this only actually correct when learning onpolicy. Because the sum of r comes from the transitions of different policy.
How to fix?
 ignore the problem when N is small
 cut the tracedynamically choose N to get only onpolicy data
 works well when data mostly onpolicy, and the action space is samll
 importance sampling—ref the paper “safe and efficient offpolicy reinforcement learning” Munos et al. 16
Qlearning with continuous actions
How to do argmax in continuous actions space?

optimization
 gradient based optimization (e.g., SGD) a bit slow in the inner loop
 action space typically lowdimensional—–what about stochastic optimization?
a simple if sample from discrete cations
$max_a Q(s,a)\approx max{Q(s,a_1),…,Q(s,a_N)}$
more accurate solution:
 crossentropy method (CEM)
 simple iterative stochastic optimization
 CMAES

use function class that is easy to optimize \(Q_{\phi}(s,a) = \frac{1}{2}(a\mu_\phi(s))^TP_{\phi}(s)(a\mu_\phi(s))+V_\phi(s)\) NAF: Normalized Advantage Functions
Using the neural network to get $\mu,P,V$
Then \(\arg \max_aQ_\phi(s,a) =\mu_\phi(s)\; \; \max_aQ(s,a)=V_\phi(s)\) but this lose some representational power

learn an approximate maximizer
DDPG \(\max_aQ_\phi(s,a)=Q_\phi(s,arg\max_a Q_\phi(s,a))\) idea: train another network $\mu_\theta(s)$ such that $\mu_\theta(s)\approx arg \max_aQ_\phi(s,a)$
how to train? solve $\theta \gets \arg \max_\theta Q_\phi(s,\mu_\theta(s))$ \(\frac{dQ_\phi}{d\theta}=\frac{da}{d\theta}\frac{dQ_\theta}{da}\) DDPG:
 take some action $s_i$ and observe $(s_i,a_i,s_i’,r_i)$, add it to $\mathcal{B}$
 sample minibatch ${s_j,a_j,s_j’,r_j}$ from $\mathcal{B}$ uniformly
 compute $y_j=r_j+\gamma Q_{\phi’}(s_j’,\mu_{\theta’}(s_j’))$ using target nets $Q_{\phi’}$ and $\mu_{\theta’}$_j
 $\phi \gets \phi  \alpha\sum_j\frac{dQ_\phi}{d\phi}(s_j,a)(Q_\phi(s_j,a_j)y_k)$
 $\theta \gets \theta  \beta\sum_j\frac{d\mu}{d\theta}(s_j)\frac{dQ_\phi}{da}(s_j,a)$
 update $\phi’$ and $\theta’$ (e.g., Polyak averaging)
Tips for Qlearning

Bellman error gradients can be big;clip gradients or use Huber loss instead of square error \(\pi(a_ts_t)=\begin{cases}x^2/2 , &\text{if} \;x\le\delta \cr \deltax\delta^2/2, &\text{otherwise}\end{cases}\)

Double Qlearning helps a lot in practice, simple and no downsides

Nstep returns also help a lot, but have some downsides

Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too

Run multiple random seeds, it’s very inconsistent between runs
6. Advanced Policy Gradients
Basics
Recap
Recap: policy gradient
REINFORCE algorithm

sample ${\tau^i}$ from $\pi_\theta(s_t s_t)$ (run the policy) 
$\Delta_\theta J(\theta)\approx\sum_{i}\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t^i s_t^i) \left(\sum_{t’=t}^T r(s_{t’},a_{t’})\right)\right)$  $\theta \gets\theta+\alpha\Delta_\theta J(\theta)$
Why does policy gradient work?
policy gradient as policy iteration
$J(\theta)=E_{\tau \sim p_\theta(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]$ \(\begin{align} J(\theta')J(\theta)&=J(\theta')E_{s_0 \sim p(s_1)}[V^{\pi_\theta}(s_0)]\\ &=J(\theta')E_{\tau \sim p_{\theta'}(\tau)}[V^{\pi_\theta}(s_0)]\\ &=J(\theta')E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tV^{\pi_\theta}(s_t)\sum_{t=1}^\infty\gamma^tV^{\pi_\theta}(s_t)\right]\\ &=J(\theta')+E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(\gamma V^{\pi_\theta}(s_{t+1})V^{\pi_\theta}(s_t))\right]\\ &=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]+E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(\gamma V^{\pi_\theta}(s_{t+1})V^{\pi_\theta}(s_t))\right]\\ &=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(r(s_t,a_t)+\gamma V^{\pi_\theta}(s_{t+1})V^{\pi_\theta}(s_t))\right]\\ &==E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right] \end{align}\) so we proved that: \(J(\theta')J(\theta)=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\)
The Goal is Making things offpolicy
But we want to sample from $\pi_\theta$ not $\pi_{\theta’}$, we apply importance sampling: \(\begin{align} E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]&=\sum_{t=0}^{\infty}E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta'}(a_ts_t)}\left[\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ &=\sum_{t=0}^{\infty}E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_ts_t)}\left[\frac{\pi_{\theta'}(a_ts_t)}{\pi_{\theta}(a_ts_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right] \end{align}\) but there still has the state sample from $p_{\theta’}(s_t)$ , and can we approximate it as $p_\theta(s_t)$? so that we can use $\hat{A}^\pi(s_t,a_t)$ to get improved policy $\pi’$.
Bounding the objective value
Here we can prove that:
$\pi_{\theta’}$ if close to $\pi_\theta$ if $  \pi_{\theta’(a_t  s_t)}\pi_\theta(a_t  s_t)  \le\epsilon$ for all $s_t$ 
$  p_{\theta’}(s_t)p_\theta(s_t)  \le 2\epsilon t$ 
The prove of this refer the lecture video or the TRPO paper.
It’s easy to prove that: \(\begin{align} E_{p_{\theta'}}[f(s_t)]=\sum_{s_t}p_{\theta'}(s_t)f(s_t)&\ge\sum_{s_t}p_\theta(s_t)f(s_t)p_{\theta'}(s_t)p_\theta(s_t)\max_{s_t}f(s_t)\\ &\ge\sum_{s_t}p_\theta(s_t)f(s_t)2\epsilon t\max_{s_t}f(s_t) \end{align}\) so \(\sum_t E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_ts_t)}\left[\frac{\pi_{\theta'}(a_ts_t)}{\pi_{\theta}(a_ts_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \ge\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_ts_t)}\left[\frac{\pi_{\theta'}(a_ts_t)}{\pi_{\theta}(a_ts_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\sum_t 2\epsilon t C\) C is $O(Tr_{max})$ or $O(\frac{r_{max}}{1\gamma})$
So after all the prove before, what we get? \(\theta' \gets \arg \max_{\theta'}\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_ts_t)}\left[\frac{\pi_{\theta'}(a_ts_t)}{\pi_{\theta}(a_ts_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \text{such that}\:\:\pi_{\theta'(a_ts_t)}\pi_\theta(a_ts_t)\le\epsilon\) For small enough $\epsilon$, this is guaranteed to improve $J(\theta’)J(\theta)$
A more convenient bound is using KL divergence: $  \pi_{\theta’(a_t  s_t)}\pi_\theta(a_t  s_t)  \le \sqrt{\frac{1}{2}D_{KL}(\pi_{\theta’}(a_t  s_t)\pi(a_t  s_t))}$ 
$\Rightarrow D_{KL}(\pi_{\theta’}(a_ts_t)\pi(a_ts_t)$ bounds state marginal difference, where \(D_{KL}(p_1(s)\p_2(x))=E_{x\sim p_1(x)}\left[ \log \frac{p_1(x)}{p_2(x)}\right]\) Why not using $\epsilon$ but the $D_{KL}$?
KL divergence has some very convenient properties that make i much easier to approximate!
So the optimization becomes: \(\theta' \gets \arg \max_{\theta'}\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_ts_t)}\left[\frac{\pi_{\theta'}(a_ts_t)}{\pi_{\theta}(a_ts_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\\text{such that}\:\:D_{KL}(\pi_{\theta'}(a_ts_t)\\pi(a_ts_t))\le\epsilon\)
Solving the constrained optimization problem
How do we enforce the constraint?
By using dual gradient descent, we set the object function as \(\mathcal{L}(\theta',\lambda)=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_ts_t)}\left[\frac{\pi_{\theta'}(a_ts_t)}{\pi_{\theta}(a_ts_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\lambda(D_{KL}(\pi_{\theta'}(a_ts_t)\\pi(a_ts_t))\epsilon)\)
 Maximize $\mathcal{L}(\theta’, \lambda)$ with respect to $\theta$

$\lambda \gets + \alpha(D_{KL}(\pi_{\theta’}(a_t s_t)\pi(a_t s_t))\epsilon)$
How else do we optimize the object?
define: \(\begin{align} \bar{A}(\theta')&=\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_ts_t)}\left[\frac{\pi_{\theta'}(a_ts_t)}{\pi_{\theta}(a_ts_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \bar{A}(\theta)&=\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_ts_t)}\left[\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right] \end{align}\) applying Firstorder Taylor expansion and optimize \(\theta' \gets \arg \max_{\theta'}\Delta_\theta\bar A(\theta)^T(\theta'\theta)\\ \text{such that}\:\:D_{KL}(\pi_{\theta'}(a_ts_t)\\pi(a_ts_t))\le\epsilon\) and \(\begin{align} \Delta_{\theta'}\bar A(\theta')&=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_ts_t)}\left[\frac{\pi_{\theta'}(a_ts_t)}{\pi_{\theta}(a_ts_t)}\gamma^t \Delta_{\theta'}\log{\pi_{\theta'}(a_ts_t)}A^{\pi_\theta}(s_t,a_t)\right]\right]\\ \Delta_{\theta}\bar A(\theta)&=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_ts_t)}\left[\gamma^t \Delta_{\theta}\log{\pi_{\theta}(a_ts_t)}A^{\pi_\theta}(s_t,a_t)\right]\right]=\Delta_\theta J(\theta) \end{align}\) so the optimization becomes \(\theta' \gets \arg \max_{\theta'}\Delta_\theta J(\theta)^T(\theta'\theta)\\ \text{such that}\:\:D_{KL}(\pi_{\theta'}(a_ts_t)\\pi(a_ts_t))\le\epsilon\) and gradient ascent does this: \(\theta' \gets \arg \max_{\theta'}\Delta_\theta J(\theta)^T(\theta'\theta)\\ \text{such that}\:\:\\theta\theta'\\le\epsilon\) by updating like $\theta’=\theta’+\sqrt{\frac{\epsilon}{\Delta_\theta J(\theta)^2}}\Delta_\theta J(\theta)$, this is what actually gradient ascent(policy gradient) doing.
But this (the gradient ascent constrain) is not a good constrain since some parameters change probabilities a lot more than others, and we want that the probability distributions are close.
Applying ‘second order Taylor expansion’ to $D_{KL}$ \(D_{KL}(\pi_{\theta'}\\pi_\theta)\approx\frac{1}{2}(\theta'\theta)^T\pmb{F}(\theta'\theta)\) where $\pmb{F}$ is the ‘Fisherinformation matrix’ which can estimate with with samples \(\pmb{F}=E_{\pi_\theta}[\Delta_{\theta}\log\pi_\theta(as)\Delta_\theta\log \pi_\theta(as)^T]\) And if we use the following update \(\theta'=\theta+\alpha\pmb{F}^{1}\Delta_\theta J(\theta)\\ \alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}\) the constrain will satisfied. and this is called the natural gradient.
Practical methods and notes
 natural policy gradient $\theta’=\theta+\alpha\pmb{F}^{1}\Delta_\theta J(\theta)$
 Generally a good choice to stabilize policy gradient training
 See this paper for details:
 Petters, Schaal. Reinforcement learning of motor skills with policy gradients.
 Practical implementation: requires efficient Fishervector products, a bit nontrivial to do without computing the full matrix
 See: Schulman et all. Trust region policy optimization
 Trust region policy optimization (TRPO) $\alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}$
 Just use the IS (important sampling) objective directly (use $\bar{A}$ as object)
 Use regularization to stay close to old policy
 See: proximal policy optimization (PPO)
So the TRPO and the PPO is two Practical methods solving the constrained optimization in neural network setting.
7. Optimal Control and Planning
Recap: the reinforcement learning objective \(p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_ts_t)p(s_{t+1}s_t,a_t)\\ \theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\) In modelfree RL, we do not know $p(s_{t+1}s_t,a_t)$.
But actually we knew the dynamics sometimes.
 Often we do know the dynamics
 Often we can learn the dynamics
If we know the dynamics, what can we do?
Modelbased reinforcement learning

Modelbased reinforcement learning: learn the transition dynamics, then figure out how to choose actions

How can we make decisions if we know the dynamics?
a. How can we choose actions under perfect knowledge of the system dynamics?
b. Optimal control, trajectory optimization, planning

How can we learning unknown dynamics?

How can we then also learn policies? (e,g. by imitating optimal control)
The objective
\[\min_{a_1,...,a_T}\sum_{t=1}^Tc(s_t,a_t)\;\text{s.t.}\;s_t=f(s_{t1},a_{t1})\]Deterministic case
\[a_1,...,a_T=\arg\max_{a_1,...,a_T}\sum_{t=1}^Tr(s_t,a_t)\;\text{s.t.}\;s_t=f(s_{t1},a_{t1})\]Stochastic openloop case
\[p_\theta(s_1,...,s_Ta_1,...,a_T)=p(s_1)\prod_{t=1}^Tp(s_{t+1}s_t,a_t)\\ a_1,...,a_T=\arg\max_{a_1,...,a_T}E\left[\sum_{t=1}^Tr(s_t,a_t)a_1,...,a_T\right]\]openloop: choose a_1…a_T in one time, not step by step closedloop: every step the agent gets a feedback from environment
Stochastic closedloop case
\[p_\theta(s_1,a_1,...,s_T,a_T)=p(s_1)\prod_{t=1}^T\pi(a_ts_t)p(s_{t+1}s_t,a_t)\\ \pi=\underset{\pi}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\]Stochastic optimization
optimal control/planning： \(a_1,...,a_t=\arg\max_{a_1,...,a_t}J(a_1,...,a_t)\\ A=\arg\max_AJ(A)\)
Crossentropy method (CEM)
Here $A$ is $a_1,…,a_t$
 sample $A_1,…,A_n$ from $p(A)$
 evaluate $J(A_1),…,J(A_n)$
 pick M elites $A_{i_1},…,A_{i_M}$ with the highest value, where $M<N$
 refit $p(A)$ to the elites $A_{i_1},…,A_{i_M}$
Monte Carlo Tree Search (MCTS)
Generic MCTS sketch
 find a leaf $s_l$ using TreePolicy($s_1$)
 evaluate the leaf using DefaultPolicy($s_l$)
 update all values in the tree between $a_1$ and $s_l$
take best action from $s_1$ and repeat
every node stores Q and N, Q is the estimated value and N is the visited number
UCT TreePolicy($s_t$)
if $s_t$ nit fully expanded, choose new $a_t$
else choose child with best Score($s_{t+1}$) \(Score(s_t) = \frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2\ln N(s_{t1})}{N(s_t)}}\) For more about MCTS, ref Browne. et al. A survey of Monte Carlo Tree Search Methods. (2012)
Optimal control
Here we shows the optimization process if we know the environment dynamics. Almost the stuffs in control theory.
Deterministic case \(\min_{u_1,...,u_T}\sum_{t=1}^Tc(s_t,u_t)\: \text{s.t.}\:x_t=f(x_{t1},u_{t1})\\ \min_{u_1,...,u_T}c(x_1,u_1)+c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)\)
Shooting methods vs collocation
the previous CEM is actually random shooting method.
collocation method: optimize over actions and states, with constraints. \(\min_{u_1,...,u_T,x_1,...,x_T}\sum_{t=1}^Tc(s_t,u_t)\: \text{s.t.}\:x_t=f(x_{t1},u_{t1})\)
Linear case: LQR
\[\min_{u_1,...,u_T}c(x_1,u_1)+c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)\]Linear case: the case that F is linear function and cost is quadratic function \(f(x_t,u_t)=F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t\\ c(x_t,u_t)=\frac{1}{2}\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^Tc_t\) Where \(C_T=\begin{bmatrix} C_{x_T,x_T} & C_{x_T,u_T} \\ C_{u_T,x_T} & C_{u_T,u_T} \end{bmatrix}\\ c_T=\begin{bmatrix} c_{x_T}\\ c_{u_T} \end{bmatrix}\) Base case: solve for $u_T$ only \(\begin{align} Q(x_T,u_T)&= \text{const}+\frac{1}{2}\begin{bmatrix} x_{T} \\ u_{T} \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{T} \\ u_{T} \\ \end{bmatrix}+\begin{bmatrix} x_{T} \\ u_{T} \\ \end{bmatrix}^Tc_T\\ \Delta_{u_T}Q(x_T,u_T)&=C_{u_t,x_T}x_T+c_{u_T,u_T}u_T+c_{u_T}^T=0\\ u_T&=C_{u_T,u_T}^{1}(C_{u_T,x_T}X_T+c_{u_T})\\ U_T&=K_Tx_T+k_T\\ K_T&=C_{u_T,u_T}^{1}C_{u_T,x_T}\\ k_T&=C_{u_T,u_T}^{1}c_{u_T} \end{align}\) We substitute $u_T$ by $x_T$ to eliminate $u_T$ \(\begin{align} V(x_T)&= \text{const}+\frac{1}{2}\begin{bmatrix} x_{T} \\ K_Tx_T+k_T \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{T} \\ K_Tx_T+k_T \\ \end{bmatrix}+\begin{bmatrix} x_{T} \\ K_Tx_T+k_T \\ \end{bmatrix}^Tc_T\\ V(x_T)&=\text{const}+\frac{1}{2}x_T^TV_Tx_T+x_T^Tv_T \end{align}\) Then solve for $U_{T1}$ in term terms of $x_{T1}$ \(\begin{align} f(x_{T1},u_{T1})&=x_T=F_t\begin{bmatrix} x_ \\ u_ \\ \end{bmatrix}+f_{T1}\\ c(x_t,u_t)&=\frac{1}{2}\begin{bmatrix} x_{T1} \\ u_{T1} \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{T1} \\ u_{T1} \\ \end{bmatrix}+\begin{bmatrix} x_{T1} \\ u_{T1} \\ \end{bmatrix}^Tc_{T1}+V(f(x_{T1},u_{T1}))\\ V(f(x_{T1},u_{T1}))&=\text{const}+\frac{1}{2}x_T^TV_Tx_T+x_T^Tv_T\\ &\text{and then raplece $x_T$ with the dynamic f} \end{align}\) and then do same thing as the T case, result in similar results.
backward recursion
for $t=T$ to 1: \(\begin{align} Q_t&=C_t+F_t^TV_{t+1}F_t\\ q_t&=c_t+F_t^TV_{t+1}f_t+F_t^Tv_{t+1}\\ Q(x_t,u_t)&=\text{const}+\frac{1}{2}\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^TQ_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^Tq_t\\ u_t &\gets \arg\max_{u_t}Q(x_t,u_t)=K_tx_t+k_t\\ K_t&=Q_{u_t,u_t}^{1}Q_{u_t,x_t}\\ k_t&=Q_{u_t,u_t}^{1}q_{u_t}\\ V_t&=Q_{x_t,x_t}+Q_{x_t,u_t}K_t+K_t^TQ_{u_t,x_t}+K_t^TQ_{u_t,u_t}K_t\\ v_t&=q_{x_t}+Q_{x_t,u_t}k_t+K_t^TQ_{u_t}+K_t^TQ_{u_t,u_t}k_t\\ V(x_t)&=\text{const}+\frac{1}{2}x_t^T V_tx_t+x_t^Tv_t\\ V(x_t)&=\min Q(x_t,u_t) \end{align}\) Forward recursion
For $t=1$ to $T$: \(u_t=K_tx_t+k_t\\ x_{t+1}=f(x_t,u_t)\)
Stochastic dynamics
if the probability is Gaussian and the mean is linear and variance is fixed. Then same algorithm can be applied since symmetry of Gaussian. \(f(x_t,u_t)=F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t\\ x_{t1}\sim p(x_{t+1}x_t,u_t)\\ p(x_{t+1}x_t,u_t)=\mathcal{N}\left(F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t, \Sigma_t\right)\)
Nonlinear case: DDP/iterative LQR
approximate a nonlinear system as a linearquadratic system using Taylor expansion \(f(x_t,u_t)\approx f(\hat{x}_t,\hat{u}_t)+\Delta_{x_t,u_t}f(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}\hat{x}_t \\ u_{t}\hat{u}_t \\ \end{bmatrix}\\ c(x_t,u_t)\approx c(\hat{x}_t,\hat{u}_t)+\Delta_{x_t,u_t}c(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}\hat{x}_t \\ u_{t}\hat{u}_t \\ \end{bmatrix}+\frac{1}{2}\begin{bmatrix} x_{t}\hat{x}_t \\ u_{t}\hat{u}_t \\ \end{bmatrix}^T\Delta^2_{x_t,u_t}c(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}\hat{x}_t \\ u_{t}\hat{u}_t \\ \end{bmatrix}\)
\[\bar{f}(\delta x_t,\delta u_t)=F_t\begin{bmatrix} \delta x_t \\ \delta u_t \\ \end{bmatrix}\\ \bar{c}=\frac{1}{2}\begin{bmatrix} \delta x_{t} \\ \delta u_{t} \\ \end{bmatrix}^TC_t\begin{bmatrix} \delta x_{t} \\ \delta u_{t} \\ \end{bmatrix}+\begin{bmatrix} \delta x_{t} \\ \delta u_{t} \\ \end{bmatrix}^Tc_t\\ \delta x_t= x_t\hat{x}_t\\ \delta u_t= u_t\hat{u}_t\]In fact, this just Newton’s method for trajectory optimization.
for more Newton’s method for trajectory optimization, ref follow papers:
 Differential dynamic programming.(1970)
 Synthesis and Stabilization of complex behaviors through online trajectory optimization.(2012)
 practical guide for implementing nonlinear iterative LQR.
 Learning Neural Network policies with guided policy search under unknown dynamics (2014)
 Probabilistic formation and trust region alternative to deterministic line search.
8. ModelBased Reinforcement Learning (learning the model)
Basic
Why learn the model?
If we knew $f(s_t,a_t)=s_{t+1}$, we could use the tools from last course.
(or $p(s_{t+1} s_t,a_t)$ in stochastic case)
modelbased reinforcement learning version 0.5:

run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$  learning dynamics model $f(s,a)$ to minimize $\sum_if(s_i,a_i)s_i’^2$
 plan through $f(s,a)$ to choose actions
Does it work?
 This is how system identification works in classical robotics
 Some care should be taken to design a good base policy
 Particularly effective if we can handengineer a dynamics representation using our knowledge of physics, and fit just a few parameters
 The model only fit the base policy, but the final actual policy beyond that policy, that will cause distribution mismatch problem.
Overfitting problem
Distribution mismatch problem
Can we do better?
can we make $p_{\pi_0}(s_t)=p_{\pi_f}(s_t)$?
modelbased reinforcement learning version 1.0:

run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$  learning dynamics model $f(s,a)$ to minimize $\sum_if(s_i,a_i)s_i’^2$
 plan through $f(s,a)$ to choose actions
 execute those actions and add the resulting data ${(s,a,s’)_j}$ to $\mathcal{D}$, and repeat step 2~4
But the model has errors, so it may lead to some bad actions, How to address that?
MPC
modelbased reinforcement learning version 1.5:

run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$  learning dynamics model $f(s,a)$ to minimize $\sum_if(s_i,a_i)s_i’^2$
 plan through $f(s,a)$ to choose actions
 execute the first planned action, observe resulting state $s’$ (MPC)
 append $(s,a,s’)$ to dataset $\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5
Using model uncertainty
Can we do better by using model uncertainty?
How to get uncertainty?
 use output entropy(bad idea)
 estimate model uncertainty
 one way to get this is by Bayesian neural networks (BNN) (introduce later)
 another way is train multiple models, and see if they agree each other.(Bootstrap ensembles)
How to train?
main idea: need to generate “independent” datasets to get “independent” models.
can do this by resampling from dataset with replacement, means you have same distribution but different ordered datasets
Does this works?
This basically works
Very crude approximation, because the number of models is usually small (<10)
Resampling with replacement is usually unnecessary, because SGD and random initialization usually makes the models sufficiently independent
For candidate action sequence $a_1,…,a_H$:

sample $\theta\sim p(\theta \mathcal{D})$ 
at each time step $t$, sample $s_{t+1}\sim p(s_{t+1} s_t,a_t,\theta)$  calculate $R=\sum_tr(s_t,a_t)$
 repeat steps 1 to 3 and accumulate the average reward
Modelbased RL with images (POMDP)
Modelbased RL with latent space models
What about complex observations?
 High dimensionality
 Redundancy
 Partial observability
learn approximate posterior $q_\psi(s_t  o_{1:t},a_{1:t})$ 
other choices:

$q_\psi(s_t,s_{t+1} o_{1:t},a_{1:t})$ 
$q_\psi(s_t o_t)$
here we only estimate $q_\psi(s_t  o_t)$ 
assume that $q_\psi(s_t  o_t)$ is deterministic 
stochastic case requires variational inference (later)
Deterministic encoder \(q_\psi(s_to_t)=\delta(s_t=g_\psi(o_t))\Rightarrow s_t=g_\psi(o_t)\) and maybe the reward also need to learn. \(\max_{\phi,\psi}\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(g_\psi(o_{t+1,i})g_\psi(o_{t,i}),a_{t,i})+\log p_\phi(o_{t,i}g_\psi (o_{t,i})]\\ \max_{\phi,\psi}\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(g_\psi(o_{t+1,i})g_\psi(o_{t,i}),a_{t,i})+\log p_\phi(o_{t,i}g_\psi (o_{t,i})+log p_\phi(r_{t,i}g_\psi(o_{t,i}))]\) Modelbased RL with latent space models

run base policy $\pi_0(o_t a_t)$ (e.g., random policy) to collect $\mathcal{D}={(o,a,o’)_i}$ 
learning $p_\psi(s_{t+1} s_t,a_t), p_\psi(r_t s_t), p(o_t s_t), g_\psi(o_t)$  plan through the model to choose actions
 execute the first planned action, observe resulting state $o’$ (MPC)
 append $(o,a,o’)$ to dataset $\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5
Learn directly in observation space
directly learn $p(o_{t+1}  o_t,a_t)$ 
do image prediction
learn reward or set the goal observation
9. ModelBased RL and Policy Learning
Basic
What if we want a policy rather than just optimal control?
 Do not need to replan (faster)
 Potentially better generalization
 Closed loop control
Backpropagate directly into the policy
modelbased reinforcement learning version 2.0:

run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$  learning dynamics model $f(s,a)$ to minimize $\sum_if(s_i,a_i)s_i’^2$

backpropagate through $f(s,a)$ into the policy to optimize $\pi_\theta(a_t s_t)$ 
run $\pi_\theta (a_t s_t)$, appending the visited tuples $(s,a,s’)$ to $\mathcal{D}$ , repeat steps 2~4
What’s the problem?
 similar parameter sensitivity problems as shooting methods
 But no longer have convenient second order LQRlike method, because policy parameters couple all the tie steps, so no dynamic programming
 Similar problem to training long RNNs with BPTT
 Vanishing and exploding gradients
 Unlike LSTM, we can’t just “choose” a simple dynamics, dynamics are chosen by nature
Guided policy search
\[\min_{u_1,...,u_T,x_1,...,x_T}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t1},u_{t1})\] \[\min_{u_1,...,u_T,x_1,...,x_T,\theta}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t1},u_{t1}), u_t=\pi_\theta(x_t)\\ \min_{u_1,...,u_T,x_1,...,x_T,\theta}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t1},u_{t1})\\ \:\: \text{s.t.}\:\:u_t=\pi_\theta(x_t)\]How to deal with constrain?
Dual gradient decent (DGD)
\[\min_xf(x)\:\:\text{s.t.}\:C(x)=0\:\:\:\:\:\: \mathcal{L}(x,\lambda)=f(x)+\lambda C(x)\\ g(\lambda)=\mathcal{L}(x^*(\lambda),\lambda)\\ x^*=\arg\min_x\mathcal{L}(x,\lambda)\\ \frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)\] Find $x^*\gets \arg\min_x\mathcal{L}(x,\lambda)$
 Compute $\frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)$
 $\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$
A small tweak to DGD: augmented Lagrangian \(\bar{\mathcal{L}}(x,\lambda)=f(x)+\lambda C(x)+\rho\C(x)\^2\)
 Find $x^*\gets \arg\min_x\bar{\mathcal{L}}(x,\lambda)$
 Compute $\frac{dg}{d\lambda}=\frac{d\bar{\mathcal{L}}}{d\lambda}(x^*,\lambda)$
 $\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$
When far from solution, quadratic term tends to improve stability
Constraining trajectory optimization with dual gradient descent \(\min_{\tau,\theta}c(\tau)\:\:\text{s.t.}\:\:u_t=\pi_\theta(x_t)\\ \bar{\mathcal{L}}(\tau,\theta,\lambda)=c(\tau)+\sum_{t=1}^T\lambda_t(\pi_\theta(x_t)u_t)+\sum_{t=1}^T\rho_t(\pi_\theta(x_t)u_t)^2\)
Guided policy search (GPS) discussion
 Find $\tau \gets \arg\min_\tau \bar{\mathcal{L}}(\tau,\theta,\lambda)$ (e.g. via iLQR or other planning methods)
 Find $\theta \gets \arg\min_\theta\bar{\mathcal{L}}(\tau, \theta, \lambda)$ (e.g. via SGD)
 $\lambda \gets \lambda+\alpha \frac{dg}{d\lambda}$ and repeat
 Can be interpreted as constrained trajectory optimization method
 Can be interpreted as imitation of optimal control expert, since step 2 is just supervised learning
 The optimal control “teacher” adapts to the learner , and avoids actions that the learner can’t mimic
General guided policy search scheme
 Optimize $p(\tau)$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
 Optimize $\theta$ with respect to some supervised objective
 Increment or modify dual variables $\lambda$
Need to choose:
 form of $p(\tau)$ or $\tau$ (if deterministic)
 optimization method for $p(\tau)$ or $\tau$
 surrogate $\tilde{c}(x_t,u_t)$

supervised objective for $\pi_\theta(u_t x_t)$
Deterministic case
\[\min_{\tau,\theta}c(\tau)\:\:\:\text{s.t.}\:\:\:u_t=\pi_\theta(x_t)\\ \bar{\mathcal{L}}(\tau,\theta,\lambda)=\tilde{c}(\tau)=c(\tau)+\sum_{t=1}^T\lambda_t(\pi_\theta(x_t)u_t)+\sum_{t=1}^T\rho_t(\pi_\theta(x_t)u_t)^2\] \[\tilde{c}_{k+1,i}(x_t,u_t)=c(x_t,u_t)+\lambda_{k+1,i}\log \pi_\theta (u_tx_t)\] Optimize $\tau$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
 Optimize $\theta$ with respect to some supervised objective
 Increment or modify dual variables $\lambda$. repeat 1~3
Learning with multiple trajectories \(\min_{\tau_1,...,\tau_N,\theta}\sum_{i=1}^{N}c(\tau_i)\:\:\:\text{s.t.}\:\:\:u_{t,i}=\pi_\theta(x_{t,i})\:\:\forall i\:\forall t\)
 Optimize each $\tau_i$ in parallel with respect to $\tilde{c}(x_t,u_t)$
 Optimize $\theta$ with respect to some supervised objective
 Increment or modify dual variables $\lambda$. repeat 1~3
Stochastic (Gaussian) GPS
\[\min_{p,\theta}E_{\tau\sim p(\tau)}{c(\tau)}\:\:\text{s.t.}\:\:p(u_tx_t)=\pi_\theta(u_tx_t)\\ p(u_tx_t)=\mathcal{N}(K_t(x_t\hat{x}_t)+k_t+\hat{u}_t,\Sigma_t)\] Optimize $p(\tau)$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
 Optimize $\theta$ with respect to some supervised objective
 Increment or modify dual variables $\lambda$
Here, a little different from pure imitation learning that mimic the optimal control result, the agent imitate the planning results, and then if it can not perform good imitation, then the optimization process will adjust its planning policy to fit the learning policy since the policy is a constrain of planning.
Input Remapping Trick \(\min_{p,\theta}E_{\tau\sim p(\tau)}[{c(\tau)}]\:\:\text{s.t.}\:\:p(u_tx_t)=\pi_\theta(u_to_t)\\)
Imitation optimal control
Imitation optimal control with DAgger
 from current state $s_t$, run MCTS to get $a_t,a_{t+1},…$
 add $(s_t,a_t)$ to dataset $\mathcal{D}$

execute action $s_t\sim\pi(a_t s_t)$ (not MCTS action!). repeat 1~3 N times  update the policy by training on $\mathcal{D}$
Problems of the original DAgger
 Ask human to label the state from other policy is hard
 run the initial not good policy in real world is dangerous in some applications
We address the first problem with a planning method, and how about the the second problem?
Imitating MPC: PLATO algorithm

train $\pi_\theta(u_t o_t)$ from labeled data $\mathcal{D}={o_1,u_1,…,o_N,u_N}$ 
run $\hat{\pi}(u_t o_t)$ to get dataset $\mathcal{D}_\pi={o_1,…,o_M}$  Ask computer to label $\mathcal{D_\pi}$ with actions $u_t$
 Aggregate: $\mathcal{D}\gets\mathcal{D}\cup\mathcal{D}_\pi$
Simple stochastic policy: $\hat{\pi}(u_tx_t)=\mathcal{N}(K_tx_t+k_t, \Sigma_{u_t})$ \(\hat{\pi}(u_tx_t)=\arg\min_{\hat{\pi}}\sum_{t'=t}^TE_{\hat{\pi}}[c(x_{t'},u_{t'})]+\lambda D_{KL}(\hat{\pi}(u_tx_t)\\pi_\theta(u_to_t))\)
Here the $\hat{\pi}$ is replaned by optimal control method, for simplicity, choose Gaussian policy since it is easy to plan with LQR, and the planning also add the KL constrain, which make sure the behavior policy is not far from the learning policy, but with this planning, it move actions away from some very bad (dangerous) actions.
DAgger vs GPS
 DAgger does not require an adaptive expert
 Any expert will do, so long as states from learned policy can be labeled
 Assumes it is possible to match expert’s behavior up to bounded loss
 Not always possible (e.g. partially observed domains)
 GPS adapts the “expert” behavior
 Does not require bounded loss on initial expert (expert will change)
Why imitate?
 Relatively stable and easy to use
 Supervise learning works very well
 control\planning (usually) works very well
 The combination of two (usually) works very well
 Input remapping trick: can exploit availability of additional information at training time to learn policy from raw observations. (planning with state and learning policy with observations)
 overcomes optimization challenges of backpropagating into policy directly
Modelfree optimization with a model
 just use policy gradient(or other modelfree RL method) even though you have a model. (just treat the model as a simulator)
 Sometimes better than using the gradients!
Dyna
online Qlearning algorithm that performs modelfree RL with a model
 given state $s$, pick action $a$ using exploration policy
 observe $s’$ and $r$, to get transition $(s,a,s’,r)$

update model $\hat{p}(s’ s,a)$ and $\hat{r}(s,a)$ using $(s,a,s’)$  Qupdate: $Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)Q(s,a)]$
 repeat $K$ times:
 sample $(s,a)\sim\mathcal{B}$ from buffer of past states and actions
 Qupdate: $Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)Q(s,a)]$
when model become better, reevaluate the old states and make the estimation more accurate.
General “Dynastyle” modelbased RL recipe
 given state $s$, pick action $a$ using exploration policy

learn model $\hat{p}(s’ s,a)$ (and optionally, $\hat{r}(s,a)$)  repeat K times:
 sample $s\sim\mathcal{B}$ from buffer
 choose action a (from $\mathcal{B}$, from $\pi$, or random)

simulate $s’\sim\hat{p}(s’ s,a)$ (and $r=\hat{r}(s,a)$)  train on $(s,a,s’,r)$ with modelfree RL
 (optional) take N more modelbased steps
This only requires short (as few as one step) rollouts from model, which has a little accumulated error.
Modelbased RL algorithms summary
Methods
 Learn model and plan (without policy)
 Iteratively more data to overcome distribution mismatch
 Replan every time step (MPC) to mitigate small model errors
 Learning policy
 Backpropagate into policy (e.g., PILCO)–simple but potentially unstable
 imitate optimal control in a constrained optimization framework (e.g., GPS)
 imitate optimal control via DAggerlike process (e.g., PLATO)
 Use modelfree algorithm with a model(Dyna, etc.)
Limitation of modelbased RL
 Need some kind of model
 Not always available
 Sometimes harder to learn than the policy
 Learning the model takes time & data
 Sometimes expressive model classes (neural nets) are not fast
 Sometimes fast model classes (linear models) are not expressive
 Some kind of additional assumptions
 Linearizability/continuity
 Ability to reset the system (for local linear models)
 Smoothness (for GPstyle global model)
 Etc.
Here are some of my understandings of modelbased RL:
First, why we need modelbased RL?
the modelfree RL learn everything from experience, the state space may very larger, it learning from scratch, which need a lot of exploration, otherwise, it may hard to coverage or has high chance to coverage to local optimal.
But in modelbased RL, the model is known or already learned, so it shift the very hard exploration process to planning, which can find some decent directions that lead to good results by optimal control methods or just search by simulating with model. So after planning, the pretty promising trajectories is generated, and the policy only requires to learn to imitate these good trajectories, which reduce lots of random exploration.
Second, why not just use optimal control rather than learning policy?
Actually, you just use the optimal control methods, like traditional control methods, or MPC.
However, not all model can apply the explicit optimal control method like LQR, since the model is hard to solve mathematically. In addition, using neural network may have better generalization property, and close loop control seems better.
Third, What kind of method can i use in model based RL?
 learning model and just using planing, do not learn policy
 Learn policy by guided policy search
 imitating optimal control with DAgger
What kind of algorithm should I use?
rank of the samples efficiency required (low to high) but computation efficiency (high to low)
 gradientfree methods (e.g. NES, CMA, etc)
 full online methods (e.g. A3C)
 policy gradient methods (e.g. TRPO)
 replay buffer value estimation methods (Qlearning, DDPG, NAF, SAC, etc.)
 modelbased deep RL (e.g. PETS, guided policy search)
 modelbased “shallow” RL (e.g. PILCO)
10 Variational Inference and Generative Models
Probabilistic models
Latent variable models
\[p(x)=\sum_zp(xz)p(z)\\ p(yx)=\sum_zp(yx,z)p(z)\]Latent variable models in general
feed random Gaussian to neural network to fit any distribution \(p(xz)=\mathcal{N}(\mu_{nn},\sigma_{nn}(z))\\ p(x)=\int p(xz)p(z)dz\) where $p(z)$ is Gaussian.
the neural network input is sample of Gaussian, and output is many mean and variance of Gaussian.
Latent variable models in RL: conditional latent variable models for multimodal policies
How to train latent variable models?
model to fit a distribution
the model: $p_\theta(x)$
the data: $\mathcal{D}={x_1,x_2,x_3,…,x_N}$
maximum likelihood fit: $\theta\gets\arg\max_\theta \frac{1}{N} \sum_i \log p_\theta(x_i)$
in latent variable model
the model: $p(x)=\int p(x  z)p(z)dz$ 
maximum likelihood fit: $\theta\gets\arg\max_\theta \frac{1}{N} \sum_i \log \left(\int p(x  z)p(z)dz\right)$ 
Estimating the loglikelihood
alternative: expected loglikelihood: $\theta\gets\arg\max_\theta \frac{1}{N} \sum_i E_{z\sim p(z  x_i)} \log p_\theta(x_i,z)$ 
The variational approximation
approximate $p(zx_i)$ with $q_i(z)=\mathcal{N}(\mu_i,\sigma_i)$ \(\begin{align} \log p(x_i) &= \log \int_z p(x_iz)p(z)\\ &=\log \int_z p(x_iz)p(z)\frac{q_i(z)}{q_i(z)}\\ &=\log E_{z\sim q_i(z)}\left[\frac{p(x_iz)p(z)}{q_i(z)}\right]\\ &\ge E_{z \sim q_i(z)}\left[\log \frac{p(x_iz)p(z)}{q_i(z)}\right]\\ &= E_{z \sim q_i(z)}[\log p(x_iz)+\log p(z)] E_{z \sim q_i(z)}[\log q_i(z)]\\ &= E_{z \sim q_i(z)}[\log p(x_iz)+\log p(z)]+ \mathcal{H}(q_i) \end{align}\) Jensen’s inequality: $\log E[y]\ge E[\log y]$
Entropy: $\mathcal{H}(p)=E_{x \sim p(x)}[\log p(x)]=\int_x p(x)\log p(x)dx$
KL Divergence: $D_{KL}(q  p)=E_{x \sim q(x)}\left[\log \frac{q(x)}{p(x)}\right]=E_{x \sim q(x)}[\log q(x)] E_{x \sim q(x)}[\log p(x)] =E_{x \sim q(x)}[\log p(x)] \mathcal{H}(q)$ 
further analysis \(\log p(x_i) = E_{z \sim q_i(z)}[\log p(x_iz)+\log p(z)]+ \mathcal{H}(q_i)=\mathcal{L}_i(q,p_i)\) so what makes a good $q_i(z)$?
intuition: $q_i(z)$ should approximate $p(z  x_i)$ 
why? \(\begin{align} D_{KL}(q_i(z_ip(zx_i)))&=E_{z\sim q_i(z)}\left[\log \frac{q_i(z)}{p(zx_i)}\right]\\ &=E_{z \sim q_i(z)}\left[\log \frac{q_i(z)p(x_i)}{p(x_i,z)}\right]\\ &=E_{z\sim q_i(z)}[\log p(x_iz)+\log p(z)] +E_{z \sim q_i(z)}+E_{z\sim q_i(z)}[\log q_i(z)]+E_{z \sim q_i(z)}[\log p(x_i)]\\ &=E_{z\sim q_i(z)}[\log p(x_iz)+\log p(z)] +E_{z \sim q_i(z)}\mathcal{H}(q_i)\log p(x_i)\\ &=\mathcal{L}_i(p,q_i)+\log p(x_i) \end{align}\)
\[\begin{align} \log p(x_i)&=D_{KL}(q_i(z_ip(zx_i)))+\mathcal{L}_i(p,q_i)\\ 0 &\le D_{KL}(q_i(z_ip(zx_i))) \\ \log p(x_i)&\ge \mathcal{L}_i(p,q_i) \end{align}\]
So this also prove that $\mathcal{L}_i(p,q_i)$ is low bound, and the KL Divergence is the bound gap, so when $q_i(x)$ close to $p(z x_i)$, the bound will become tight. Actually minimize KL divergence is maximizing $\mathcal{L}_i(p,q_i)$. so we need to adjust $q_i(z)$ to maximizing $\mathcal{L}_i(p,q_i)$.
So all we need to do is: \(\theta \gets \arg \max_{\theta} \frac{1}{N}\sum_i \mathcal{L}_i(q,q_i)\)
\[\mathcal{L}_i(q,p_i)=E_{z \sim q_i(z)}[\log p_\theta(x_iz)+\log p(z)]+ \mathcal{H}(q_i)\]Algorithm:
for each $x_i$ (or minibatch):
calculate $\Delta_\theta \mathcal{L}_i(p,q_i)$
sample $z \sim q_i(z)$
$\Delta_\theta \mathcal{L}(p,q_i)\approx\Delta_\theta \log p_\theta(x_i  z)$ 
$\theta \gets \theta+\alpha \Delta_\theta\mathcal{L}(p,q_i)$
update $q_i$ to maximize $\mathcal{L}_i(p,q_i)$
How to update $q_i$?
let’s say $q_i(z)=\mathcal{N}(\mu_i,\sigma_i)$
use gradient $\Delta_{\mu_i}\mathcal{L}i(p,q_i)$ and $\Delta{\sigma_i}\mathcal{L}_i(p,q_i)$
gradient ascent on $\mu_i,\sigma_i$
What’ the problem?
every sample has a $\mu_i,\sigma_i$, when you have many samples , the total paramters are $  \theta  +(  \mu_i  +  \sigma_i  )*N$ 
intuition: $q_i(z)$ should approximate $p(z  x_i)$ 
what if we learn a network $q_i(z)=q(z  x_i)\approx p(z  x_i)$ ? 
so we have two network: $p_\theta(x  z)$ and $q_\phi(z  x)$ 
Amortized variational inference
\[q_\phi(zx)=\mathcal{N}(\mu_\phi(x),\sigma_\phi(x))\] \[\log p(x_i)\ge E_{z \sim q_\phi(zx_i)}[\log p_\theta(x_iz)+\log p(z)]+ \mathcal{H}(q_\phi(zx_i))=\mathcal{L}(q_\theta(x_iz),q_\phi(zx_i))\]Algorithm:
for each $x_i$ (or minibatch):
calculate $\mathcal{L}(q_\theta(x_i  z),q_\phi(z  x_i))$ 
sample $z \sim q_\phi(z  x_i)$ 
$\Delta_\theta \mathcal{L}(p,q_i)\approx\Delta_\theta \log p_\theta(x_i  z)$ 
$\theta \gets \theta+\alpha \Delta_\theta\mathcal{L}$
$\theta \gets \theta+\alpha \Delta_\phi\mathcal{L}$
how can we get $\Delta_\phi\mathcal{L}$ ? \(\mathcal{L}_i=E_{z \sim q_\phi(zx_i)}[\log p_\theta(x_iz)+\log p(z)]+ \mathcal{H}(q_\phi(zx_i))\\ J(\phi)= E_{z \sim q_\phi(zx_i)}[r(x_i,z)]\\ J(\phi)\approx \frac{1}{M}\sum_j\Delta_\phi \log q_\phi(z_jx_i)r(x_i,z_j)\) one way is just use the policy gradient trick, but it has high variance, th other way is apply the reparameterization trick.
The reparameterization trick
\[q_\phi(zx)=\mathcal{N}(\mu_\phi(x),\sigma_\phi(x))\\ z=\mu_\phi(x)+\epsilon\sigma_\phi(x)\;\:\epsilon\sim \mathcal{N}(0,1)\] \[\begin{align} J(\phi)&= E_{z \sim q_\phi(zx_i)}[r(x_i,z)]\\ &=E_{\epsilon \sim \mathcal{N}(0,1)}[r(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))] \end{align}\]and then we can estimating $\Delta_\phi J(\phi)$:
sample $\epsilon_1,…,\epsilon_M$ from $\mathcal{N}(0,1)$ (even a single sample per data point works well!)
$\Delta_\phi J(\phi) \approx \frac{1}{M}\sum_j\Delta_\phi r(x_i,\mu_\phi(x_i)+\epsilon_j\sigma_\phi(x_i)) $
this is low variance since it get gradient from r, not just sample r.
Another way to look at it… \(\begin{align} \mathcal{L}_i&=E_{z \sim q_\phi(zx_i)}[\log p_\theta(x_iz)+\log p(z)]+ \mathcal{H}(q_\phi(zx_i))\\ &=E_{z \sim q_\phi(zx_i)}[\log p_\theta(x_iz)]+E_{z \sim q_\phi(zx_i)}[\log p(z)]+ \mathcal{H}(q_\phi(zx_i))\\ &=E_{z \sim q_\phi(zx_i)}[\log p_\theta(x_iz)]D_{KL}(q_\phi(zx_ip(z)))\\ &=E_{\epsilon \sim \mathcal{N}(0,1)}[p_\theta(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))]D_{KL}(q_\phi(zx_ip(z)))\\ &\approx \log p_\theta(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))D_{KL}(q_\phi(zx_ip(z))) \end{align}\)
Reparameterization trick vs. policy gradient

policy gradient $J(\phi)\approx \frac{1}{M}\sum_j\Delta_\phi \log q_\phi(z_j x_i)r(x_i,z_j)$  Can handle both discrete and continuous latent variables
 but has high variance, requires multiple sample & small learning rates
 Reparameterization trick $\Delta_\phi J(\phi) \approx \frac{1}{M}\sum_j\Delta_\phi r(x_i,\mu_\phi(x_i)+\epsilon_j\sigma_\phi(x_i)) $
 only continuous latent variables
 very simple to implement
 low variance
The variational auto encoder (VAE)
Conditional models \(\mathcal{L}_i=E_{z \sim q_\phi(zx_i,y_i)}[\log p_\theta(y_ix_i,z)+\log p(zx_i)]+ \mathcal{H}(q_\phi(zx_i,y_i))\) the application of variational inference
 using RL\control+variational inference to model human behavior
 using generative models and variational inference for exploration
this class is a little tough, here is some of my understanding:
we want to represent the distribution of a object(by neural network) and make it has ability to generalize the feature of the object (multimodal). To achieve this, the random variables($z$) are used as input of model, but what kind of random variances can achieve this? the mathematic proof shows that the distribution of random variance should approximate $p(z x_i)$, this is kind of compress sensing, where $z$ is the latent variable. Finally, this is used to do variational auto encoder (VAE). To make the network trainable, the reparameterization trick is applied.
11. Reframing Control as an Inference Problem
Get object function from policy
The human behavior is stochastic and suboptimal but overall good, and how can interpret this kind of model?
A probabilistic graphical model of decision making
\[p(\tau\mathcal{O}_{1:T})\:\:\:\;\;\;p(\mathcal{O}_ts_t,a_t)=\exp(r(s_t,a_t))\\ p(\tau\mathcal{O}_{1:T})=\frac{p(\tau,\mathcal{O}_{1:T})}{p(\mathcal{O}_{1:T})}\propto p(\tau)\prod\exp(r(s_t,a_t))=p(\tau)\exp\left(\sum_tr(s_t,a_t\right)\]$\mathcal{O}$ is the boolean variable indicate the policy choose randomly or choose to maximize reward
Inference
Backward massages
\[\begin{align} \beta_t(s_t,a_t)&=p(\mathcal{O}_{t:T}s_t,a_t)\\ &=\int p(\mathcal{O}_{t:T},s_{t+1}s_t,a_t)ds_{t+1}\\ &=\int p(\mathcal{O}_{t+1:T}s_{t+1}) p(s_{t+1}s_t,a_t)p(\mathcal{O}_ts_t,a_t)ds_{t+1}\\ \end{align}\] \[p(\mathcal{O}_{t+1:T}s_{t+1})=\beta_{t+1}(s_{t+1})=\int p(\mathcal{O}_{t+1:T}s_{t+1,}a_{t+1})p(a_{t+1}s_{t+1})d a_{t+1}\\ =\int \beta_t(s_{t+1,}a_{t+1})p(a_{t+1}s_{t+1})d a_{t+1}\]for $t=T1$ to 1:
\[\begin{align} \beta_t(s_t,a_t)&=p(\mathcal{O}_ts_t,a_t)E_{s_{t+1}\sim p(s_{t+1}s_t,a_t)}[\beta_{t+1}(s_{t+1})]\\ \beta_t(s_t)&=E_{s_t\sim p(a_ts_t)}[\beta_t(s_t,a_t)] \end{align}\]
let $V_t(s_t) =\log \beta_t(s_t)$
let $Q_t(s_t,a_t)= \log \beta_t(s_t,a_t)$ \(V_t=\log\int \exp(Q_t(s_t,a_t))da_t\\ V_t(s_t) \to \max_{s_t}Q_t(s_t,a_t)\;\text{as}\;Q_t(s_t,a_t)\;\text{gets bigger!}\\ Q_t(s_t,a_t)=r(s_t,a_t)+\log E[\exp(V_{t+1}(s_{t+1}))]\\ \text{for determistic transition: }Q_t(s_t,a_t)=r(s_t,a_t)+V_{t+1}(s_{t+1})\)
Policy computation
\[\begin{align*} p(a_ts_t,\mathcal{O}_1:T)&=\pi(a_ts_t)\\ &=p(a_ts_t,\mathcal{O}_t:T)\\ &=\frac{p(a_t,a_t\mathcal{O}_{t:T})}{p(s_t\mathcal{O}_{t:T})}\\ &=\frac{p(\mathcal{O}_{t:T}a_t,s_t)p(a_t,s_t)/p(a_t,s_t)}{p(\mathcal{O}_{t:T}s_t)p(s_t)/p(\mathcal{O}_{t:T})}\\ &=\frac{p(\mathcal{O}_{t:T}a_t,s_t)}{p(\mathcal{O}_{t:T}s_t)}\frac{p(a_t,s_t)}{p(s_t)}\\ &=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}p(a_ts_t)\\ &=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)} \end{align*}\]Policy computation with value functions
for $t=T1$ to 1:
\[V_t=\log\int \exp(Q_t(s_t,a_t))da_t\\ Q_t(s_t,a_t)=r(s_t,a_t)+\log E[\exp(V_{t+1}(s_{t+1}))]\]\[\pi(a_ts_t)=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}\\ V_t(s_t) =\log \beta_t(s_t)\\ Q_t(s_t,a_t)= \log \beta_t(s_t,a_t)\]
So \(\pi(a_ts_t)=\exp(Q_t(s_t,a_t)V_t(s_t))=\exp(A_t(s_t,a_t))\) with temperature: $\pi(a_ts_t)=\exp(\frac{1}{\alpha}Q_t(s_t,a_t)\frac{1}{\alpha}V_t(s_t))=\exp(\frac{1}{\alpha}A_t(s_t,a_t))$ if $\alpha$ is near zero, the max action will dominate, so it’s more like deterministic, and $\alpha$ near 1 is more stochastic.
 Natural interpretation: better actions are more probable
 Random tiebreaking
 Analogous to Boltzmann exploration
 Approaches greedy policy as temperature decreases
Forward massages
\[\begin{align*} \alpha_t(s_t)&=p(s_t\mathcal{O_{1:t1}})\\ &=\int p(s_t,s_{t1},,a_{t1}\mathcal{O}_{1:t1})ds_{t1}da_{t1}\\ &=\int p(s_ts_{t1},a_{t1},\mathcal{O}_{1:t1})p(a_{t1}s_{t1},\mathcal{O}_{1:t1})p(s_{t1\mathcal{O}_{1:t1}})ds_{t1}da_{t1}\\ &=\int p(s_ts_{t1},a_{t1})p(a_{t1}s_{t1},\mathcal{O}_{1:t1})p(s_{t1\mathcal{O}_{1:t1}})ds_{t1}da_{t1} \end{align*}\] \[\begin{align*} p(s_ts_{t1},a_{t1})p(a_{t1}s_{t1},\mathcal{O}_{1:t1}) &=\frac{p(\mathcal{O}_{t1}s_{t1},a_{t1})p(a_{t1}s_{t1})}{p(\mathcal{O}_{t1}s_{t1})}\frac{p(\mathcal{O}_{t1}s_{t1})p(s_{t1}\mathcal{O}_{1:t2})}{p(\mathcal{O}_{t1}\mathcal{O}_{1:t2})}\\ &=\frac{p(\mathcal{O}_{t1}s_{t1},a_{t1})p(a_{t1}s_{t1})}{1}\frac{\alpha_{t1}(s_{t1})}{p(\mathcal{O}_{t1}\mathcal{O}_{1:t2})} \end{align*}\]what if we want $p(s_t\mathcal{O}_{1:T})$? \(\begin{align*} p(s_t\mathcal{O}_{1:T})&=\frac{p(s_t,\mathcal{O}_{1:T})}{p(\mathcal{O}_{1:T})}\\ &=\frac{p(\mathcal{O}_{t:T}s_t)p(s_t,\mathcal{O}_{1:t1})}{p(\mathcal{O}_{1:T})}\\ &\propto \beta_t(s_t)p(s_t\mathcal{O}_{1:t1})p(\mathcal{O}_{1:t1})\\ &\propto \beta_t(s_t)\alpha_t(s_t) \end{align*}\)
The optimism problem
for $t=T1$ to 1:
\[\begin{align} \beta_t(s_t,a_t)&=p(\mathcal{O}_ts_t,a_t)E_{s_{t+1}\sim p(s_{t+1}s_t,a_t)}[\beta_{t+1}(s_{t+1})]\\ \beta_t(s_t)&=E_{s_t\sim p(a_ts_t)}[\beta_t(s_t,a_t)] \end{align}\]
let $V_t(s_t) =\log \beta_t(s_t)$
let $Q_t(s_t,a_t)= \log \beta_t(s_t,a_t)$ \(Q_t(s_t,a_t)=r(s_t,a_t)+\log E[\exp(V_{t+1}(s_{t+1}))]\) why did this happen?
The inference problem: $p(s_{1:T},a_{1:T}  \mathcal{O}_{1:T})$ 
marginalizing and conditioning, we get: $p(a_t  s_t,\mathcal{O}_{1:T})$ (the policy) 
“given that you obtained high reward, what was your action probability?”
marginalizing and conditioning, we get: $p(s_{t+1},  s_t,a_t,\mathcal{O}{1:T})\ne p(s{t+1  s_t,a_t})$ 
“given that you obtained high reward, what was your transition probability?”
because we asking the question of transition probability that condition on the good result, so there is the optimism problem.
Addressing the optimism problem
we actually want to ask “given that you obtained high reward, what was your action probability when the transition probability did not change?”
find anther distribution $q(s_{1:T},a_{1:T})$ that is close to $p(s_{1:T},a_{1:T}  \mathcal{O}{1:T})$ but has dynamics $p(s{t+1}  s_t,a_t)$ 
Try variational inference!
let $\mathrm{x}=\mathcal{O}{1:T}$ and $\mathrm{z}=(s{1:T},s_{1:T})$ find $q(\mathrm{z})$ to approximate $p(\mathrm{z}  \mathrm{x})$ 
Control via variational inference
\[q(s_{1:T},a_{1:T})=p(s_1)\prod_t p(s_{t+1}s_t,a_t)q(a_ts_t)\]The variational lower bound (last class) \(\begin{align} \log p(x)&\ge E_{z \sim q(z)}[\log p(x,z)\log q(z)]\\ \log p(\mathcal{O}_{1:T}) &\ge E_{s_{1:T},a_{1:T}\sim q}[\log p(s_1)+\sum_{t=1}^T\log p(s_{t+1s_t,a_t})+\sum_{t=1}^T\log p(\mathcal{O}_ts_t,a_t)\\\log p(s_1)&\sum_{t=1}^T\log p(s_{t+1}s_t,a_t)\sum_{t=1}^T\log q(s_ts_t)]\\ &=E_{(s_{1:T},a_{1:T})\sim q}\left[\sum_tr(s_t,a_t)\log q(a_ts_t)\right]\\ &=\sum_t E_{(s_t,a_t)\sim q}[r(s_t,a_t)+\mathcal{H}(q(a_ts_t))] \end{align}\) maximize the rewards and entropy
Optimizing the variation lower bound \(Q_t(s_t,a_t)=r(s_t,a_t)+E[(V_{t+1(s_{t+1})})]\\ V_t(s_t)=\log \int \exp(Q_t(s_t,a_t))d a_t\)
backward passvariational
for $t=T1$ to 1:
\[V_t=\log\int \exp(Q_t(s_t,a_t))da_t\\ Q_t(s_t,a_t)=r(s_t,a_t)+E[\exp(V_{t+1}(s_{t+1}))]\]
Variants:
 discounted SOC: $Q_t(s_t,a_t)=r(s_t,a_t+\gamma E(V_{t+1(s_{t+1})}))$
 explicit temperature: $V_t(s_t)=\alpha \log \int \exp(\frac{1}{\alpha}Q_t(s_t,a_t))da_t$
Soft Qlearning
soft Qlearning $\phi \gets \phi + \alpha \Delta_\phi Q_\phi(s,a)(r(s,a)+\gamma V(s’)Q_\phi(s,a))$
target value: $V(s’)=\text{soft} \max_{a’}Q_\phi(s’,a’)=\log \int \exp (Q_\phi(s’,a’))da’$
$\pi(a  s)=\exp(Q_\phi(s,a)V(s))=\exp(A(s,a))$ 
Policy gradient with soft optimality
$\pi(a  s)=\exp(Q_\phi(s,a)V(s))$ optimizes $\sum_{\pi(s_t,a_t)}[r(s_t,a_t)]+E_{\pi(s_t)}[\mathcal{H}(\pi(a_t  s_t))]$ 
intuition: $\pi(a  s)\propto \exp(Q_\phi(s,a))$ when $\pi$ minimizes $D_{KL}(\pi(a  s)  \frac{1}{Z}\exp(Q(s,a))$ 
Soft Policy gradient vs soft Qlearning
policy gradient derivation: \(J(\theta)=\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)]+E_{\pi(s_t)}[\mathcal{H}(\pi(as_t))]\\ =\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)\log \pi(a_ts_t)]\\ \log \pi(s_ts_t)=Q(s_t,a_t)V(s_t)\)
\[\begin{align} &\Delta_\theta\left[\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)\log \pi(a_ts_t)]\right]\\ &\approx \frac{1}{N}\sum_i\sum_t\Delta_\theta \log \pi(a_ts_t)\left(r(s_t,a_t)+\left(\sum_{t'=t+1}^Tr(s_{t'},s_{t'})\log \pi(a_{t'}s_{t'})\right)\log \pi(a_ts_t)1\right)\\ &\approx \frac{1}{N}\sum_i\sum_t(\Delta_\theta Q(s_t,a_t)\Delta_\theta V(s_t))\left(r(s_t,a_t)+Q(s_{t+1},s_{t+1})Q(s_t,a_t)+V(s_t)\right)\\ &\approx \frac{1}{N}\sum_i\sum_t(\Delta_\theta Q(s_t,a_t)\Delta_\theta V(s_t))\left(r(s_t,a_t)+Q(s_{t+1},s_{t+1})Q(s_t,a_t)\right) \end{align}\]soft Qlearning: $$
 \frac{1}{N}\sum_i\sum_t\Delta_\theta Q(s_t,a_t)\left(r(s_t,a_t)+\text{soft}\max_{a_{t+1}}Q(s_{t+1},s_{t+1})Q(s_t,a_t)\right) $$
Benefits of soft optimality
 Improve exploration and prevent entropy collapse
 Easier to specialize (finetune) policies for more specific tasks
 Principled approach to break ties
 Better robustness (due to wider converge of states)
 Con reduce to hard optimality as reward magnitude increase
 Good model for modeling human behavior
12. Inverse Reinforcement Learning
Why should we worry about learning rewards
The imitation learning perspective
Standard imitation learning:
 copy the action performed by the expert
 no reasoning about outcomes of actions
Human imitation learning:
 copy the intent of the expert
 might take very different actions
The reinforcement learning perspective
sometimes reward is complicated and unclear
Inverse reinforcement learning
Infer reward functions from demonstrations
 by itself, this is an underspecified problem
 many reward function can explain the same behavior
Inverse reinforcement learning:
given:
states $s \in \mathcal{S}$, action $a \in \mathcal{A}$
(sometimes) transitions $p(s’  s,a)$ 
samples ${\tau_i}$ sample from $\pi^*(\tau)$
learn $r_\psi(s,a)$
… and then use it to learn $\pi*(a  s)$ 
Learning the optimality variable
$p(\mathcal{O}_t  s_t,a_t,\psi)=\exp(r_\psi(s_t,a_t))$ 
$p(\tau  \mathcal{O}{1:T},\psi) \propto p(\tau)\exp\left(\sum_tr\psi(s_t,a_t\right)$ 
given:
samples ${\tau_i}$ sampled from $\pi*(\tau)$ maximum likelihood learning: $\max_\psi \frac{1}{N}\sum_{i1}^N\log p(\tau_i  \mathcal{O}{1:T},\psi)=\max\psi \frac{1}{N}\sum_{i=1}^Nr_{\psi}(\tau_i)\log Z$ 
$\log Z$ make sure all trajectory probability sums up to 1
The IRL partition function
\[\max_\psi \frac{1}{N}\sum_{i=1}^Nr_{\psi}(\tau_i)\log Z\\ Z=\int p(\tau)\exp(r_\psi(\tau))d\tau\\ \Delta_\psi \mathcal{L}=\frac{1}{N}\sum_{i=1}^N\Delta_\psi r_\psi(\tau_i)\frac{1}{Z}\int p(\tau)\exp (r_\psi(\tau))\Delta_\psi r_\psi(\tau)d\tau\\ =E_{\tau\sim \pi^*(\tau)}[\Delta_\psi r_\psi(\tau_i)]E_{\tau \sim p(\tau\mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(\tau)]\]Estimating the expectation
$ p(s_t,a_t  \mathcal{O}_{1:T},\psi)=p(a_t  s_t,\mathcal{O}_{1:T},\psi)p(s_t  \mathcal{O}_{1:T},\psi)$ 
let $\mu_t(s_t,a_t)\propto\beta(s_t,a_t)\alpha(s_t)$ \(\begin{align} E_{\tau \sim p(\tau\mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(\tau)]&= E_{\tau \sim p(\tau\mathcal{O}_{1:T},\psi)}[\Delta_\psi \sum_{t=1}^Tr_\psi(s_t,a_t)]\\ &=\sum_{t=1}^T E_{(s_t,a_t) \sim p(s_t,a_t\mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(s_t,a_t)]\\ &=\sum_{t=1}^T\vec{\mu}_t^T\Delta_\psi\vec{r}_\psi \end{align}\)
The MaxEnt IRL algorithm
 Given $\psi$, compute backward message $\beta(s_t,a_t)$
 Given $\psi$, compute forward message $\alpha(s_t)$
 Compute $\mu_t(s_t,a_t) \propto \beta(s_t,a_t)\alpha(s_t)$
 Evaluate $\Delta_\psi \mathcal{L}=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\Delta_\psi r_\psi(s_{i,t},a_{i,t})\sum_{t=1}^T\int\int\mu_t(s_t,a_t)\Delta_\psi r_\psi(s_t,a_t)ds_tda_t$
 $\psi \gets \psi +\eta \Delta_\psi\mathcal{L}$
Why MaxEnt? in the case where $r_\psi(s_t,a_t)=\psi^Tf(s_t,a_t)$, we can show that it optimizes $\max_\psi\mathcal{H}(\pi^{r_\psi}) $ such sthat $E_{\pi^{r_\psi}}[f]=E_{\pi^*}[f]$
paper: Ziebart et al. 2008: Maximum Entropy Inverse Reinforcement learning
what’s missing so far?
 MaxEnt IRL so far requires…
 Solving for (soft) optimal policy in the inner loop
 Enumerating all stateaction tuples for visitation frequency and gradient
 To apply this in practical problem settings, we need to handle…
 Large and continuous state and action spaces
 States obtained via sampling only
 Unknown dynamics
recall: \(\Delta_\psi \mathcal{L}=E_{\tau\sim \pi^*(\tau)}[\Delta_\psi r_\psi(\tau_i)]E_{\tau \sim p(\tau\mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(\tau)]\) The first part is just the trajectory under expert policy, the second part of this is soft optimal policy under current reward, a sample idea is to learning this soft policy using the methods introduced in last class, bu it is impractical since you need to train it to convergence, it has a lot of computation.
More efficient samplebased updates
Guided cost learning
\[\Delta_\psi \approx \frac{1}{N}\sum_{i=1}^N\Delta_\psi r_{\psi}(\tau_i)\frac{1}{M}\sum_{j=1}^M\Delta_\psi r_\psi(\tau_j)\]instead of learn the $p(a_t  s_t, \mathcal{O}_{1:T},\psi)$using any maxant RL algorithm to converge then run this policy to sample ${\tau_j}$, can we just run one or several gradient step and then sample? 
Solution 1: use importance sampling \(\Delta_\psi\mathcal{L} \approx \frac{1}{N}\sum_{i=1}^N\Delta_\psi r_{\psi}(\tau_i)\frac{1}{\sum_jw_j}\sum_{j=1}^Mw_j\Delta_\psi r_\psi(\tau_j)\;\;\;\;w_j=\frac{p(\tau)\exp(r_\psi(\tau_j))}{\pi(\tau_j)}\)
\[\begin{align} w_j&=\frac{p(\tau)\exp(r_\psi(\tau_j))}{\pi(\tau_j)}\\ &=\frac{p(s_1)\prod_tp(s_{t+1}s_t,a_t)\exp(r_\psi(s_t,a_t))}{p(s_1)\prod_tp(s_{t+1}s_t,a_t)\pi(a_ts_t)}\\ &=\frac{\exp(\sum_tr_\psi(s_t,a_t))}{\prod_t\pi(a_ts_t)} \end{align}\]each policy update w.r.t. $r_\psi$ brings us closer to the target distribution!
paper : guided cost learning algorithm. Finn et al. ICML ‘16
this actually like a game, or GAN, which the $\Delta_\psi\mathcal{L} \approx \frac{1}{N}\sum_{i=1}^N\Delta_\psi r_{\psi}(\tau_i)\frac{1}{\sum_jw_j}\sum_{j=1}^Mw_j\Delta_\psi r_\psi(\tau_j)$ makes demos are made more likely, samples less likely. But $\Delta_\theta\mathcal{L}\approx\frac{1}{M}\sum_{j=1}^M\Delta_\theta\log\pi_\theta(\tau_j)r_\psi(\tau_j)$ policy changed to make it harder to distinguish from demos.
Inverse RL as a Generative adversarial Networks (GAN)
In GAN, the best discriminator is $D^(x)=\frac{p^(x)}{p_\theta(x)+p^*(x)}$
For IRL, optimal policy approaches $\pi_\theta(\tau)\propto p(\tau)\exp(r_\psi(\tau))$
choose this parameterization for discriminator: \(\begin{align} D_\psi(\tau)&=\frac{p(\tau\frac{1}{Z}\exp(r(\tau)))}{p_\theta(\tau)+p(\tau)\frac{1}{Z}\exp(r(\tau))}\\ &=\frac{\frac{1}{Z}\exp(r(\tau))}{\prod_t\pi_\theta(a_ts_t)+\frac{1}{Z}\exp(r(\tau))}\\ &=\frac{\frac{1}{Z}\exp(r(\tau))}{\prod_t\pi_\theta(a_ts_t)+\frac{1}{Z}\exp(r(\tau))} \end{align}\) and using GAN to lean by $\psi \gets \arg \max_\psi E_{\tau \sim p*}[\log D_\psi(\tau)]+E_{\tau \sim \pi_\theta}[\log(1D_\psi(\tau))]$, and then it will get same result as previous Inverse RL update rule. and here we don not need important weight, we can just optimize Z w.r.t. same objective as $\psi$ ! and the weight is in Z.
so the discriminator is $\psi \gets \arg \max_\psi E_{\tau \sim p*}[\log D_\psi(\tau)]+E_{\tau \sim \pi_\theta}[\log(1D_\psi(\tau))]$
the generator is $\Delta_\theta\mathcal{L}\approx\frac{1}{M}\sum_{j=1}^M\Delta_\theta\log\pi_\theta(\tau_j)r_\psi(\tau_j)$
After inverse RL, the reward has already learned, when the environment changed, it use learned reward representation has the generalization ability to learn policy under new environment.
Regular discriminator
Can we just use a regular discriminator? \(\psi \gets \arg \max_\psi E_{\tau \sim p*}[\log D_\psi(\tau)]+E_{\tau \sim \pi_\theta}[\log(1D_\psi(\tau))]\) and just parametrize $D_\psi(\tau)$ as standard binary neural net classifier
the generator becomes $\Delta_\theta\mathcal{L}\approx\frac{1}{M}\sum_{j=1}^M\Delta_\theta\log\pi_\theta(\tau_j)\log D_\psi(\tau_j)$
 this is simpler to set up optimization
 but discriminator knows nothing at convergence
 and do know the reward representation, only get the policy $\pi_\theta$
13. Transfer and Multitask Learning
Use prior knowledge
Transfer learning: using experience from one set of tasks for faster learning and better performance on a new task. In RL, task is MDP
shot: number of attempts in the target domain
0shot: just run a policy trained in the source domain
1shot: try the task once
few shot: try the task a few times
 “forward” transfer: train on one task, transfer to new task
 just try it and hope for best
 finetune on the new task
 randomize source domian
 Multitask transfer: train on many tasks, transfer to new task
 generate highly randomized source domains
 modelbased reinforcement learning
 model distillation
 contextual policies
 modular policy networks
 Multitask metalearning: learning to learning from many tasks
 RNNbased metalearning
 Gradientbased metalearning
This class is pretty highlevel introduction about Transfer and Multitask Learning. The class give make possible direction of Transfer and Multitask Learning, and gives many paper for further study in this part, see lecture slides for more detail of recommended papers.
14. Distributed RL
2013/2015: DQN: replay buffer
2015: GORILA
2016: A3C: one learner multiple actors, actor generate gradient and send it to learner
2018: IMPALA: several actors and learnings, actor only acting and generate data for learner, using importance sampling (Vtrace) correct for policy lag.
2018: ApeX/R2D2： Reintroduces replay
2019: R2D3
RLlib: Abstractions for Distributed Reinforcement Learning (ICML’18)
15. Exploration
Exploration in bandit
Regret \(Reg(T)=TE[r(a^*)]\sum_{t=1}^Tr(a_t)\)
Optimistic exploration
optimistic estimate: $a=\arg\max \hat{\mu}_a + C \sigma_a$
Intuition: try each arm until you are sure it’s not great
example: $a=\arg \max \hat{\mu}_a+\sqrt{\frac{2\ln T}{N(a)}}$ $Reg (T)$ is $O(\log T)$
Probability matching /posterior sampling
assume $r(a_i)\sim p_{\theta_i}(r_i)$
this defines a POMDP with $s=[\theta_1,…,\theta_n]$
belief state is $\hat{p}(\theta_1,…,\theta_n)$
idea: sample $\theta_1,…,\theta_n \sim\hat{p}(\theta_1,..,\theta_n)$
pretend the model $\theta_1,..,\theta_n$ is correct
take the optimal action
update the model and repeat the process
Information gain
let $\mathcal{H}(\hat{p}(z))$ be the current entropy of our $z$ estimate
let $\mathcal{H}(\hat{p}(z)  y)$ be the entropy of our $z$ estimate after observation $y$ 
Information gain is \(IG(z,y)=E_y[\mathcal{H}(\hat{p}(z))\mathcal(\hat{p}(z)y)]\\ IG(z,ya)=E_y[\mathcal{H}(\hat{p}(z))\mathcal(\hat{p}(z)y)a]\) For bandit
$y= r_a, z=\theta_a$
$g(a)=IG(\theta_a,r_a  a)$ 
$\Delta(a)=E[r(a^*)r(a)]$
choose $a$ according to $\arg\min_a \frac{\Delta(a)^2}{g(a)}$
Exploration in DRL
Optimistic exploration in RL
UCB: $a=\arg \max \hat{\mu}_a+\sqrt{\frac{2\ln T}{N(a)}}$
In MDPs, countbased exploration: use $N(s,a)$ or $N(s)$ to add exploration bonus
use $r^+(s,a)=r(s,a)+\mathcal{B}(N(s))$
use $r^+(s,a)$ instead of $r(s,a)$ with any modelfree algorithm
but when using counts, maybe we didn’t see the same thing twice. so we need to have a representation of similarity of states.
idea: fit a density model $p_\theta(s)$ (or $p_\theta(s,a)$)
$p_\theta(s)$ might be high even for a new $s$ if $s$ is similar to previously seen states \(P(s)=\frac{N(s)}{n}\\ P'(s)=\frac{N(s)+1}{n+1}\)
Exploring with pseudocounts
fit model $p_\theta(s)$ to all states $\mathcal{D}$ seen so for
take a step $i$ and observe $s_i$
fit new model $p_{\theta’}(s)$ to $\mathcal{D}\cup s_i$
use $p_\theta(s_i)$ and $p_{\theta’}(s_i)$ to estimate $\hat{N}(s)$
set $r^+(s,a)=r(s,a)+\mathcal{B}(N(s))$, and repeat!
How to get $\hat{N}(s)$ , use previous equations and get \(\hat{N}(s_i)=\hat{n}p_\theta(s_i)\\ \hat{n}=\frac{1p_{\theta'}(s_i)}{p_{\theta'}(s_i)p_\theta(s_i)}p_\theta(s_i)\)
What kind of bonus to use? many chooses
UCB: $\mathcal{B}(N(s))=\sqrt{\frac{2\ln T}{N(s)}}$
MBIEEB: $\mathcal{B}(N(s))=\sqrt{\frac{1}{N(s)}}$
BEB: $\mathcal{B}(N(s))=\frac{1}{N(s)}$
What kind of model ($p_\theta(s)$) to use?

Bellemare et al.: “CTS” model: condition each pixel on its topleft neighborhood

Counting with hashes
idea: compress $s$ into a $k$bit code via $\phi (s)$, then count $N(\phi(s))$
short codes=more hash collisions
Can use VAE compression to get hash

implicit density modeling with exemplar model
explicitly compare to new state to past states
Intuition: the state is novel if it is easy to distinguish from all previous seen states by a classifier
for each observed state $s$, fit a classifier to classify that state against all past states $\mathcal{D}$, use classifiler error to obtain density \(p_\theta(s)=\frac{1D_s(s)}{D_s(s)}\) In practice, just train one amortized model takes in exemplar as input
for detail ref. Fu et al. “EX2: Exploration with exemplar models”

Heuristic estimation of counts via errors
idea: we do not need the densities, just get something tell us if the state is novel or not!
let’s say we have some target function $f^*(s,a)$, this function can be any function, just show a mapping against $s,a$.
given our buffer $\mathcal{D}={(s_i,a_i)}$, fit $\hat{f}_\theta(s,a)$
use $\xi(s,a)= \hat{f}_\theta(s,a)f^*(s,a) ^2$ as bonus
what should we use for $f^*(s,a)$?
one common choice: set $f^*(s,a)=s’$
even simpler: $f^(s,a)=f_\phi(s,a)$, where $\phi$ is a *random parameter vector (random network distillation)
Posterior sampling in deep RL
 sample Qfunction Q from $p(Q)$
 act according Q for for one episode
 update p(Q)
Bootstrap
 given a dataset $\mathcal{D}$, resample with replacement N times to get $\mathcal{D}_1,…,\mathcal{D}_N$
 train each model $f_{\theta_i}$ on $\mathcal{D}_i$
 to sample from $p(\theta)$, sample $i \in[1,…,N]$ and use $f_{\theta_i}$
but training N big neural nets is expensive, so we use one network with shared base network and multiple head for each model, and in practice, we actually do not do resample, since we have different initialization for each model.
Reasoning about information gain
approximations:
prediction gain: $\log p_{\theta’}(s)\log p_\theta(s)$
intuition: if density changed a lot, the state was novel
variational inference:
IG can be equivalently written as $D_{KL}(p(z  y)  p(z))$ 
learn about transitions $p_\theta(s_{t+1}  s_t,a_t): z=\theta$ 
$y = (s_t,a_t,s_{t+1})$ $D_{KL}(p(\theta  h,s_t,a_t,s_{t+1})  p(\theta  h))$ where h is the history of all prior transitions 
intuition: a transition is more informative if it causes belief over $\theta$ to change
idea: use variational inference to estimate $q(\theta  \phi)\approx p(\theta  h)$ 
given new transition $(s,a,s’)$, update $\phi$ to get $\phi’$
and use $D_{KL}(q(\theta  \phi)  q(\theta  \phi))$ as approximate bonus 
for more details, ref Houthooft et al. “VIME”
Imitation learning vs. Reinforcement learning
Imitation learning
 requires demonstrations
 must address distributional shift
 Simple, stable supervised learning+
 Only as good as the demo
Reinforcement learning
 Require reward function
 Must address exploration
 Potential nonconvergent RL
 Can become arbitrarily good+
Can we get the best of both?
we have both demonstration and rewards
IRL already addresses distributional shift via RL, but it doesn’t use a known reward function!
Simplest combination: pretrain by imitation & finetune by RL
 collected demonstration data $(s_i,a_i)$

initialize $\pi_\theta$ as $\max_\theta \sum_i \log \pi_\theta(a_i s_i)$  run $\pi_\theta$ to collect experience
 improve $\pi_\theta$ with any RL algorithm and repeat 3 and 4
but in step 3 the agent can be very bad due to distribution shift, so the first batch of bad data can destroy initialization.
Offpolicy RL
we can address this by offpolicy RL. Offpolicy RL can use any data, so we can keep the demonstration in buffer.
 offpolicy policy gradient (with importance sampling)
 offpolicy Qlearning
Policy gradient with demonstrations
\[\Delta J(\theta)=\sum_{\tau \in \mathcal{D}}\left[\sum_{t=1}^T\Delta_\theta\log\pi_\theta(a_ts_t)\left(\prod_{t'=1}^t\frac{\pi_\theta(a_{t'}s_{t'})}{q(a_{t'}s_{t'})}\right)\left(\sum_{t'=t}^Tr(s_{t'},a_{t'})\right)\right]\]where the $\mathcal{D}$ includes both demo data and policy data
Problem 1: which distribution did the demonstrations come from?
option 1: use supervised behavior cloning to approximate $\pi_{demo}$
option 2: assume Dirac delta: $\pi_{demo}(\tau)=\frac{1}{N}\delta (\tau \in \mathcal{D})$ this works best with selfnormalized importance sampling—– $E_{p(x)}[f(x)]\approx\frac{1}{\sum_j\frac{p(x_j)}{q(x_j)}}\sum_i\frac{p(x_i)}{q(x_i)}f(x_i)$
Problem 2: what to do if have multiple distributions
fusion distribution: $q(x)=\frac{1}{M}\sum_iq_i(x)$
Qlearning with demonstrations
just drop demo demonstration data to replay buffer
What’s the problem?
importance sampling: recipe for getting stuck
Qlearning: just good data is not enough, only having good data is hard to fit good Q value
more problems with this highly offpolicy Q learning
this is highly offpolicy so we do not using imitation policy to collect data any more, so f Q function makes any mistake, tis had to fix it, and then we any train on totally garbage.
to address this problem, \(Q(s,a) \gets r(s,a)+E_{a'\sim\pi_{new}}[Q(s',a')]\) How to pick $\pi_{new}(as)$?
option 1: stay close to $\beta$
e.g. $D_{KL}(\pi_{new}(.  s)  \beta(.  s))\le \epsilon$ 
issue 1: we don’t know $\beta$
issue 2: this is way too conservative
option 2: constrain to support of $\beta$ ref. these two papers
Kumar et al. stabilizing OffPolicy Qlearning vai bootstrapping Error Reduction
fojimoto et al. OffPolicy Deep Reinforcement learning without exploration
Imitation as an auxiliary loss function
imitation objective: $\sum_{(s,a) \in \mathcal{D}{demo}}\log \pi\theta(a  s)$ 
RL objective: $E_{\pi_\theta}[r(s,a)]$
hybrid objective: $E_{\pi_\theta} [r(s,a)]+\lambda \sum_{(s,a) \in \mathcal{D}{demo}} \log \pi\theta(a  s)$ 
Hybrid Qlearning: \(J(Q)=J_{DQ}(Q)+\lambda_1J_n(Q)+\lambda_2j_E(Q)+\lambda_3J_{L2}(Q)\\ J_E(Q)=\max_{a\in A}[Q(s,a)+l(s_E,a)]Q(s,a_E)\) and $J_{DQ}$ is the Qlearning loss, $J_n(Q)$ is nstep Qlearning loss, $J_{L2}$ is regulation loss
what’s the problem?
 need to tune the weight
 The design of the objective, esp. for imitation, takes a lot of care
 Algorithm becomes problemdependent
16 Meta Reinforcement learning
This part introduced many meta learning methods in high level, may find and read papers later
17 Information theory, challenges, open problems
Information theory
entropy \(\mathcal{H}(p(x))=E_{x\sim p(x)}[\log p(x)\) mutual dependence \(\begin{align} \mathcal{I}&=D_{KL}(p(x,y)p(x)p(y))\\ &=E_{(x,y)\sim p(x,y)}\left[\log \frac{p(x,y)}{p(x)p(y)}\right]\\ &=\mathcal{H}(p(y))\mathcal{H}(p(yx)) \end{align}\) define $\pi(s)$ state marginal distribution of policy $\pi$
$\mathcal{H}(\pi(s))$ state marginal entropy of policy $\pi$
empowerment: $\mathcal{I}(s_{t+1},a_t)=\mathcal{H}(s_{t+1})\mathcal{H}(s_{t+1}  a_t)$ 
Learning without a reward function by reaching goals
one way is giving goal states, maybe using VAE to generate goals

Propose goal: $z_g \sim p(z)$, $x_g \sim p_\theta(x_g z_g)$ 
Attempt to reach goal using $\pi(a x,x_g)$, reach $\bar{x}$  Use data to update $\pi$

Use data to update $p_\theta(x_g z_g)$, $q_\phi(z_g x_g)$
but how to diverse goals?
in step 4
the Standard MLE: $\theta, \phi \sim \arg \max _{\theta, \phi}E[\log p(\bar{x})]$
the weighted MLE: $\theta, \phi \sim \arg \max _{\theta, \phi}E[w(\bar{x})\log p(\bar{x})]$$
where $w(\bar{x})=p_\theta(\bar{x})^\alpha$
key result: for ant $\alpha \in [1,0) $, entropy $\mathcal{H}(p_\theta(x))$ increases!
This actually doing $\max \mathcal{H}(p(G))$
and what does RL do?
$\pi(s  S,G)$ trained to reach goal G as $\pi$ gets better, final state S gets close to G, 
that means $p(G  S)$ becomes more deterministic! 
so we actually doing this \(\max \mathcal{H}(p(G))\mathcal(p(GS))=\max \mathcal{I}(S;G)\)
Learning diverse skills
$\pi(s  s,z)$ and $z$ is task index. 
Intuition: different skill should visit different statespace regions
Diversitypromoting reward function \(\pi(as,z) =\arg \max_\pi \sum_z E_{s \sim \pi(sz)}[r(a,z)]$\) where $r(s,z)= \log p(zs)$, reward states that are unlikely for other $z’\ne z$.
Here, when z is sampled, the only thing learning doing is maximize the probability of state of this policy by tune the $\pi(s  z)$, it turns out that different policies get different skills. Ref. Diversity is all you need (ICLR2019) 
In fact this is also goal reaching. \(\mathcal{I}(z,s)=H(z)H(zs)\) we actually maximizing $H(z)$ by sampling $p(z)$uniformly, and the minimizing $H(zs)$ using the algorithm.
Unsupervised reinforcement learning for metalearning
using unsupervised reinforcement learning to propose tasks for meta learning.
so at first, using unsupervised metaRL tor generate tasks, then using meatlearning for those tasks by given reward function from previous steps.
Challenges in deep reinforcement learning
Core algorithm:
 Stability
 Efficiency
 Generalization
Assumptions:
 Problem formulation
 Supervision
Stability and hyperparameter tuning
 Devising stable RL algorithm is very hard
 Qlearning/value function estimation
 No guarantee of convergence
 Lots of parameters for stability: target network delay, replay buffer size, clipping, sensitivity to learning rates, etc.
 Policy gradient/likelihood ratio/REINFORCE
 Very high variance gradient estimator
 Lots of samples, complex bases, etc.
 Parameters: batch size, learning rate, design of baseline
 Modelbased RL algorithms
 Model class and fitting method
 Optimizing policy w.r.t. model nontrivial due to backpropagation through time
 More subtle issue: policy tends to exploit the model
The challenge with hyperparameters is severe
 Algorithms with favorable improvement and convergence properties
 TRPO
 Safe reinforcement learning, Highconfidence policy improvement [Thomas ‘15’]
 Algorithms that adaptively adjust parameters
 QProp
Not great for beating benchmarks, but absolutely essential to make RL a viable tool for realworld problems.
Sample complexity
 realworld learning becomes difficult or impractical
 Precludes the use of expensive, highfidelity simulators
 Limits applicability to realworld problems
what can we do?
 Better modelbased RL algorithms
 Design faster algorithms
 Addressing Function Approximation Error in ActorCritic Algorithms[Fujimoto et al. ‘18’]
 Soft ActorCritic
 Reuse prior knowledge to accelerate reinforcement learning
 RL2 [Duan at al. 17]
 Learning to reinforcement learning [Wang et al. ‘17’]
 MAML [Finn at al. ‘17’]
Scaling & Generalization
 Smallscale
 Emphasizes mastery
 Evaluated on performance
 Where i the generalization
Reinforcement learning need to recollect data during training
Assumption problems
Single task or multitask
 Train on multiple tasks, then try ti generalize or finetune
 policy distillation
 Actormimic
 MAML
 Unsupervised or weakly supervised learning of diverse behaviors
 Stochastic neural networks
 Reinforcement learning with deep energybased policies
Where does the supervision come from?
 find some different tasks
 learn objectives/reward form demonstration (IRL)
 Generate objectives automatically
What is the role of the reward function?
Unsupervised reinforcement learning

Interaction with the world without a reward function

Learning something about the world

Use what you learned to quickly solve new tasks
Other sources of supervision
 Demonstrations
 Language
 Human preferences
Where does the supervision signal come from?
 Yannn YeCun’s cake
 Unsupervised or selfsupervised learning
 Model learning (predict the future)
 Generative modeling of the world
 Lots to do even before you accomplish your goal
 Imitation & understanding other agents
 The giant value backup
18 Rethinking Reinforcement Learning from the Perspective of Generalization (Chelsea Finn)
Metalearning:
Learning to Learning with gradient. Finn. PhD Thesis 2018
Efficient OffPolicy MetaReinforcement learning via Probabilistic Context Variable ICML19
These algorithms only adapt to same similar tasks, but they can not adapt to entirely new tasks!!!
If we want to do this, we need to make sure the metatrain task distribution same as the metatest task distribution.

Algorithms: more general than a policy, from demo and trial? or others, Meta world

Task Representation: How to tell the tasks, language or goal?

Data: RoboNet
Comments