# Deep Reinforcement learning notes (UBC)

## Background

This note is the class note of UBC Deep reinforcement learning, namely CS294-112 or CS285. the lecturer is ‎Sergey Levine. The lecturer video can be find on Youtube. I wrote two notes on reinforcement learning before, one is basic RL, the other is the David Silver class note.

Different from the previous courses, this course includes deeper theoretical view, more recent methods and some advancer topics. Especially in model based RL and meta-learning. It is more suitable for the guys who are interested on robotics control and deeper understanding of reinforcement learning.

This class is a little bit hard to study, so make sure you follow the class tightly.

I’m sorry that some of maths can not view properly on the post, maybe I will figure out it that later.

## 1. Imitation learning

### The main problem of imitation: distribution drift

how to make the distribution of training dataset as same as the distribution under policy?

DAgger

#### DAgger: Dataset Aggregation

goal: collect training data from $p_{\pi_\theta}(o_t)$ instead of $p_{data}(o_t)​$ !

 how? just run $p_{\pi_\theta}(a_t o_t)$.

but need labels $a_t​$ !

1.  train $\pi_{\theta}(a_t o_t)$ from human data $\mathcal{D}={o_1,a_1,…,o_N,a_N}$;
2.  run $\pi_\theta(a_t o_t)$ to get dataset $\mathcal{D}_\pi = {o_1,…,o_M}$;
3. ask human to label $\mathcal{D}_\pi$ with actions $s_t$;
4. aggregate: $\mathcal{D}\gets \mathcal{D}\cup\mathcal{D}_\pi$

fit the model perfectly

#### why fail to fit expert?

1. Non-Markonvian behavior
• use history observations
2. Multimodal behavior
• for discrete action, it is OK since Softmax output probability over actions
• for continuous action
• output mixture of Gaussians
• latent variable models(inject noise to network input)
• autoregressive discretization

other problems of imitation learning

• human labeled data is finite
• human not good at some problems

#### reward function of imitation learning

reward function of imitation learning can be $r(s,a) = \log {p(a=\pi^*(s)|s)}$

## MDP & RL Intro

### The goal of RL

expected reward $p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]$ where $p_\theta(\tau)$ is the distribution of the sequence

### Types of RL algorithms

• value-based
• Actor-critic
• model-based RL
• for planning
• optimal control
• discrete planning
• improve policy
• something else
• dynamic programming
• simulated experience

• sample efficiency
• stability & ease of use

#### assumptions

• stochastic or deterministic
• continuous or discrete
• episodic or infinite horizon

#### sample efficiency

• off policy: able to improve the policy without generating new samples from that policy

• on policy: each time the policy is changed, even a little bit, we need to generate new samples

#### stability & ease of use

convergence is a problem

supervised learning almost always gradient descent

RL often not strictly gradient descent

%

%

#### REINFORCE algorithm

1.  sample ${\tau^i}$ from $\pi_\theta(s_t s_t)$ (run the policy);
2.  $\Delta_\theta J(\theta)\approx\sum_{i}^N\left(\sum_{t} \Delta_\theta \log \pi_\theta (a_t s_t) \right)\left(\sum_{t} r(s_t,a_t)\right)$;
3. $\theta \gets\theta+\alpha\Delta_\theta J(\theta)$.
##### Reduce variance

Causality: policy at time $t’$ cannot affect reward at time t when $t<t’$ % baseline $b=\frac{1}{N}\sum_{i=1}^{n}e(\tau)\\ \Delta_\theta J(\theta)\approx\frac{1}{N} \Delta_\theta \log \pi_\theta (\tau) [r(\tau)-b]$ prove % Here, $\tau$ means the whole episode sample by current policy.

we can proof that their has a optimal baseline to reduce variance: $b=\frac{E[g(\tau)^2e(\tau)]}{E[g(\tau)^2]}$ But in practice, we just use the expectation of reward as baseline to reduce the complexity.

#### Off-policy learning & importance sampling

what if we sample from $\bar{\pi}(\tau)$ instead?

Importance sampling %

so apply this to our objective function, we have $J(\theta)=E_{\tau\sim\bar{\pi}(\tau)}\left[\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}r(\tau)\right]$ and we have $\pi_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}=\frac{p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}(a_t|s_t)p(s_{t+1}|s_t,a_t)}=\frac{\prod_{t=1}^T\pi_\theta(a_t|s_t)}{\prod_{t=1}^T \bar{\pi}(a_t|s_t)}$ so we have %

The off-policy policy gradient % we can view state and action separately, then:
% If $\frac{p_{\theta’}(s_t)}{p_{\theta}(s_t)}$ is small and bounded, then we can delete it, and this leads to TPRO method we will discuss later.

for coding, we can use “pseudo-loss” as weighted maximum likelihood with automatic differentiation: $\bar{J}(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \log \pi_\theta (a_{i,t}|s_{i,t})\hat{Q}_{i,t}$

• the gradient has high variance
• this isn’t the same as supervised learning!
• gradients will be really noisy!
• consider using much larger batches

• tweaking learning rates is very hard

## 3. Actor-critic method

### Basics

recap policy gradient % where Q is a sample from trajectories, which is unbiased estimate but has high variance problem.

We can use expectation to reduce variance $\hat{Q}_i,t\approx \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]$ and we define $\hat{Q}_i,t= \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]\\ V(s_t)=E_{a_t\sim\pi(s_t|s_t)}[Q(s_t,a_t)]\\$ then $\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})(Q(s_{i,t}, a_{i,t})-V(s_{i,t}))$

The better $A^\pi(s_t,a_t)$ estimate, the lower the variance.

#### Value function fitting

and we add a little bias(one step biased sample) for convenience $Q^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})$ so we have $A^\pi(s_t,a_t) \approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_t)$ then we only need to fit $V^\pi(s)$ !

#### Policy evaluation

Monte Carlo policy evaluation (this is what policy gradient does) $V^\pi(s_t)\approx \sum_{t'=t}^Tr(s_{t'},a_{t'})$ We can try multiple samples if we can reset the environment to previous state $V^\pi(s_t)\approx \frac{1}{N}\sum_{i=0}^N\sum_{t'=t}^Tr(s_{t'},a_{t'})$ Monte Carlo evaluation with function approximation

with function approximation, only using one sample from trajectory still pretty good.

training data: ${\left(s_{i,t},\sum_{t’=t}^Tr(s_{i,t’},a_{i,t’})\right)}$

supervised regression: $\mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)-y_i\parallel^2$

Ideal target: $y_{i,t}=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]\approx r(s_{s_{i,t}},a_{i,t})+V^\pi(s_{i,t+1})\approx r(s_{s_{i,t}},a_{i,t})+\hat{V^\pi_\phi}(s_{i,t+1})$ Monte Carlo target: $y_{i,t}=\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'})$

#### TD(bootstrapped)

training data: ${\left(s_{i,t},r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\right)}$

### Actor-critic algorithm

batch actor-critic algorithm:

1.  sample ${s_i,a_i}$ from $\pi_\theta(a s)$
2. fit $\hat{V_\phi^\pi}(s)$ to sampled reward sums
3. evaluate $\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\hat{V}\phi^\pi(s’_i)-\hat{V}\phi^\pi(s_i)$
4.  $\Delta_\theta J(\theta)\approx \sum_i \Delta_\theta \log \pi_\theta (a_{i} s_{i})\hat{A}^\pi(s_i,a_i)$
5. $\theta \gets \theta+\alpha\Delta_\theta J(\theta)$

#### Aside: discount factors

what if T (episode length) is $\infty$ ?

$\hat{V}_\phi^\pi$ can get infinitely large in many cases

simple trick: better to get rewards sooner than later $V^\pi(s_{i,t})\approx r(s_{s_{i,t}},a_{i,t})+\gamma\hat{V}^\pi_\phi(s_{i,t+1})\\ \gamma \in [0,1]$ actually we use discount in policy gradient as $\Delta_\theta J(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)$ Online actor-critic algorithm(can apply to every single step):

1.  take action $a\sim\pi_\theta(a s)$, get $(s,a,s’,r)$
2. update $\hat{V_\phi^\pi}(s)$ using target $r+\gamma\hat{V_\phi^\pi}(s’)$
3. evaluate $\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}\phi^\pi(s’)-\hat{V}\phi^\pi(s)$
4.  $\Delta_\theta J(\theta)\approx \Delta_\theta \log \pi_\theta (a s)\hat{A}^\pi(s,a)$
5. $\theta \gets \theta+\alpha\Delta_\theta J(\theta)$

#### Architecture design

network architecture choice

• value network and policy network are separate(more stable and sample)

• some of value network and policy network are shared(have shared feature)

works best with a batch

$\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-b\right)$ actor-critic $\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(r(s_i,a_i)+\hat{V}_\phi^\pi(s'_i)-\hat{V}_\phi^\pi(s_i)\right)$ policy gradient is no bias but has higher variance

actor-critic is lower variance but not unbiased

so can we combine these two things?

here we have critics as state-dependent baselines $\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-\hat{V}_\phi^\pi(s_{i,t})\right)$

• no bias
• lower variance

Eligibility traces & n-step returns

Critic and Monte Carlo critic $\hat{A}^\pi_C(s_t,a_t)=r(s_t,a_t)+\gamma\hat{V}_\phi^\pi(s_{t+1})-\hat{V}_\phi^\pi(s_t)\\ \hat{A}^\pi_{MC}(s_t,a_t)=\sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{t'},a_{t'})-\hat{V}_\phi^\pi(s_{t})$

combine these two?

n-step returns $\hat{A}^\pi_{n}(s_t,a_t)=\sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^n\hat{V}_\phi^\pi(s_{t+n})-\hat{V}_\phi^\pi(s_{t})$ choosing $n>1$ often works better!!!

Do we have to choose just one n?

Cut everywhere all at once! $\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty w_n\hat{A}(s_t,a_t)$

How to weight?

Mostly prefer cutting earlier(less variance) $w_n\propto\lambda^{n-1}$ e.g. $\lambda=0.95$

and this leads to Eligibility traces $\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty (\gamma\lambda)^{t'-t}\delta_{t'}\\ \delta_{t'}=r(s_{t'},a_{t'})+\gamma\hat{V}_\phi^\pi(s_{t'+1})-\hat{V}_\phi^\pi(s_{t'})$ in this way, every time you want to update a state, you need to have n steps experience

## 4. Value based methods

$\underset{a_t}{\arg\max}A^\pi(s_t,a_t)$ : best action from $s_t$, if we then follow $\pi$

then: $% $

 this at least as good as any $a_t \sim \pi(a_t, s_t)$

### Policy iteration

1. evaluate $A^\pi(s,a)$
2. set $\pi \gets \pi’$

### Dynamic programming

 assume we know $p(s’ s,a)$ and s and a are both discrete (and small)

bootstrapped update: $V^\pi(s) \gets E_{a\sim\pi(a|s)}[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}[V^\pi(s')]]$ with deterministic policy $\pi(s)=a$, we have $V^\pi(s) \gets r(s,\pi(s))+\gamma E_{s'\sim p(s'|s,\pi(s))}[V^\pi(s')]$

So policy iteration become

1. set $Q^\pi(s,a)\gets r(s,a)+\gamma E[V^\pi(s’)]$
2. set $V(s)\gets \max_a Q(s,a)$

#### Function approximator

$\mathcal{L}=\frac{1}{2}\sum_i\parallel V_\phi (s)-\max_a Q(s,a)\parallel^2$

##### fitted value iteration

fitted value iteration algorithm:

1.  set $y_i \gets \max_{a_i}(r(s_i a_i)+\gamma E[V_\phi(s’_i)])$
2. set $\phi \gets \arg\min_\phi \frac{1}{2}\sum_i\parallel V_\phi (s)-y_i\parallel^2$$but we can not do maximum if we do not have dynamics, so we evaluate Q instead of V $Q^\pi(s,a) \gets r(s,a)+\gamma E_{s'\sim p(s'|s,a)}[Q^\pi(s',\pi(s'))]$ ##### fitted Q-iteration 1. collect dataset {(s_i, a_i,s_i’,r_i)} using some policy 2. set y_i \gets r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’) 3. set \phi \gets \arg min_\phi \frac{1}{2}\sum_i|Q_\phi(s_i,a_i)-y_i|^2 repeat step 2,3 k times and then return step 1 Q-learning is off-policy, since it fit the Q(s,a), which estimate all state action Q, no matter the action and state from which policy, it has the maximum item. And for the r(s,a) item, given a and a, transition is independent of \pi. ##### exploration 1. epsilon-greedy $% $ 2. Boltzmann exploration #### Value function learning theory value iteration: 1. set Q(s,a) \gets r(s,a)+\gamma E[V(s’)] 2. set V(s) \gets \max_a Q(s,a) tabular case is converged. Non-tabular case is not guarantee convergence. In actor-critic, it also need to estimate the V, and if using bootstrap approach, it has the same problem that can not guarantee convergence. ## 5. Practical Q-learning What’s wrong of the on-line Q-learning? Actually, it is not gradient decent, it do not calculate the gradient of target Q in y. samples is i.i.d ### Replay buffer (replay samples many times) Q-learning with a replay buffer: 1. collect dataset {(s_i,a_i,s_i’,t_i)} using some policy, add it to \mathcal{B} 1. sample a batch (s_i,a_i,s_i’,r_i) from \mathcal{B} 2. \phi \gets\phi-\alpha\sum_i\frac{d Q_\phi}{d \phi} (s_i,a_i) \frac{1}{2}\sum_i(Q_\phi(s_i,a_i)- [r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)])^2 , do these k times ### Target network #### DQN (Target network+Replay buffer) 1. save target network parameters: \phi’ \gets \phi 1. collect dataset {(s_i,a_i,s_i’,t_i)} using some policy, add it to \mathcal{B} , do this N times 1. sample a batch (s_i,a_i,s_i’,r_i) from \mathcal{B} 2. \phi \gets \arg min_\phi \frac{1}{2}\sum_i|Q_\phi(s_i,a_i)- [r(s_i,a_i) +\gamma \max_{a_i’}Q_{\phi’}(s_i’,a_i’)]|^2 , do 2,3 k times #### Alternative target network Polyak averaging: soft update to avoid sudden target network update: update \phi’: \phi’ \gets \tau \phi’ + (1-\tau)\phi e.g. \tau =0.999 ### Double Q-learning #### Are the Q-values accurate? It’s often much larger than the true value since the maximum operation will always adds the noisy Q estimation to make Q function overestimate. Target value y_j=r_i +\gamma \max_{s_j’}Q_{\phi’}(s_j’,a_j’) $\max_{a'}Q_{\phi'}(s',a') = Q_{\phi'}(s',\arg \max_{a'}Q_{\phi'}(s',a'))$ value also comes from Q_{\phi’} action selected according to Q_{\phi’} How to address this? #### Double Q-learning idea: don’t use the same network to choose the action and evaluate value! (de-correlate the noise) use two networks: $Q_{\phi_A}\gets r +\gamma Q_{\phi_B}(s',\arg \max_{a'}Q_{\phi_A}(s',a')) \\ Q_{\phi_B}\gets r +\gamma Q_{\phi_A}(s',\arg \max_{a'}Q_{\phi_B}(s',a'))$ the value of both two networks come from the other network! #### Double Q-learning in practice Just use the current and target networks as \phi_A and \phi_B, use current network to choose action, and current network get Q value. standard Q-learning: y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi’}(s’,a’)) double Q-learning: y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi}(s’,a’)) ### Multi-step returns In Q-learning, this only actually correct when learning on-policy. Because the sum of r comes from the transitions of different policy. How to fix? • ignore the problem when N is small • cut the trace-dynamically choose N to get only on-policy data • works well when data mostly on-policy, and the action space is samll • importance sampling—ref the paper “safe and efficient off-policy reinforcement learning” Munos et al. 16 ### Q-learning with continuous actions How to do argmax in continuous actions space? 1. optimization • gradient based optimization (e.g., SGD) a bit slow in the inner loop • action space typically low-dimensional—–what about stochastic optimization? -a simple if sample from discrete cations max_a Q(s,a)\approx max{Q(s,a_1),…,Q(s,a_N)} -more accurate solution: • cross-entropy method (CEM) • simple iterative stochastic optimization • CMA-ES 2. use function class that is easy to optimize $Q_{\phi}(s,a) = -\frac{1}{2}(a-\mu_\phi(s))^TP_{\phi}(s)(a-\mu_\phi(s))+V_\phi(s)$ NAF: Normalized Advantage Functions Using the neural network to get \mu,P,V Then $\arg \max_aQ_\phi(s,a) =\mu_\phi(s)\; \; \max_aQ(s,a)=V_\phi(s)$ but this lose some representational power 3. learn an approximate maximizer DDPG $\max_aQ_\phi(s,a)=Q_\phi(s,arg\max_a Q_\phi(s,a))$ idea: train another network \mu_\theta(s) such that \mu_\theta(s)\approx arg \max_aQ_\phi(s,a) how to train? solve \theta \gets \arg \max_\theta Q_\phi(s,\mu_\theta(s)) $\frac{dQ_\phi}{d\theta}=\frac{da}{d\theta}\frac{dQ_\theta}{da}$ DDPG: 1. take some action s_i and observe (s_i,a_i,s_i’,r_i), add it to \mathcal{B} 2. sample mini-batch {s_j,a_j,s_j’,r_j} from \mathcal{B} uniformly 3. compute y_j=r_j+\gamma Q_{\phi’}(s_j’,\mu_{\theta’}(s_j’)) using target nets Q_{\phi’} and \mu_{\theta’}_j 4. \phi \gets \phi - \alpha\sum_j\frac{dQ_\phi}{d\phi}(s_j,a)(Q_\phi(s_j,a_j)-y_k) 5. \theta \gets \theta - \beta\sum_j\frac{d\mu}{d\theta}(s_j)\frac{dQ_\phi}{da}(s_j,a) 6. update \phi’ and \theta’ (e.g., Polyak averaging) ### Tips for Q-learning • Bellman error gradients can be big;clip gradients or use Huber loss instead of square error $% $ • Double Q-learning helps a lot in practice, simple and no downsides • N-step returns also help a lot, but have some downsides • Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too • Run multiple random seeds, it’s very inconsistent between runs ## 6. Advanced Policy Gradients ### Basics #### Recap Recap: policy gradient REINFORCE algorithm 1.  sample {\tau^i} from \pi_\theta(s_t s_t) (run the policy) 2.  \Delta_\theta J(\theta)\approx\sum_{i}\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t^i s_t^i) \left(\sum_{t’=t}^T r(s_{t’},a_{t’})\right)\right) 3. \theta \gets\theta+\alpha\Delta_\theta J(\theta) Why does policy gradient work? policy gradient as policy iteration J(\theta)=E_{\tau \sim p_\theta(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right] % so we proved that: $J(\theta')-J(\theta)=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]$ #### The Goal is Making things off-policy But we want to sample from \pi_\theta not \pi_{\theta’}, we apply importance sampling: % but there still has the state sample from p_{\theta’}(s_t) , and can we approximate it as p_\theta(s_t)? so that we can use \hat{A}^\pi(s_t,a_t) to get improved policy \pi’. ### Bounding the objective value Here we can prove that:  \pi_{\theta’} if close to \pi_\theta if \pi_{\theta’(a_t s_t)}-\pi_\theta(a_t s_t) \le\epsilon for all s_t  p_{\theta’}(s_t)-p_\theta(s_t) \le 2\epsilon t The prove of this refer the lecture video or the TRPO paper. It’s easy to prove that: % so $\sum_t E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \ge\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]-\sum_t 2\epsilon t C$ C is O(Tr_{max}) or O(\frac{r_{max}}{1-\gamma}) So after all the prove before, what we get? $\theta' \gets \arg \max_{\theta'}\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \text{such that}\:\:|\pi_{\theta'(a_t|s_t)}-\pi_\theta(a_t|s_t)|\le\epsilon$ For small enough \epsilon, this is guaranteed to improve J(\theta’)-J(\theta)  A more convenient bound is using KL divergence: \pi_{\theta’(a_t s_t)}-\pi_\theta(a_t s_t) \le \sqrt{\frac{1}{2}D_{KL}(\pi_{\theta’}(a_t s_t)|\pi(a_t s_t))} \Rightarrow D_{KL}(\pi_{\theta’}(a_t|s_t)|\pi(a_t|s_t) bounds state marginal difference, where $D_{KL}(p_1(s)\|p_2(x))=E_{x\sim p_1(x)}\left[ \log \frac{p_1(x)}{p_2(x)}\right]$ Why not using \epsilon but the D_{KL}? KL divergence has some very convenient properties that make i much easier to approximate! So the optimization becomes: $\theta' \gets \arg \max_{\theta'}\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\\text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))\le\epsilon$ ### Solving the constrained optimization problem How do we enforce the constraint? By using dual gradient descent, we set the object function as $\mathcal{L}(\theta',\lambda)=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]-\lambda(D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))-\epsilon)$ 1. Maximize \mathcal{L}(\theta’, \lambda) with respect to \theta 2.  \lambda \gets + \alpha(D_{KL}(\pi_{\theta’}(a_t s_t)|\pi(a_t s_t))-\epsilon) How else do we optimize the object? define: % applying First-order Taylor expansion and optimize $\theta' \gets \arg \max_{\theta'}\Delta_\theta\bar A(\theta)^T(\theta'-\theta)\\ \text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))\le\epsilon$ and % so the optimization becomes $\theta' \gets \arg \max_{\theta'}\Delta_\theta J(\theta)^T(\theta'-\theta)\\ \text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))\le\epsilon$ and gradient ascent does this: $\theta' \gets \arg \max_{\theta'}\Delta_\theta J(\theta)^T(\theta'-\theta)\\ \text{such that}\:\:\|\theta-\theta'\|\le\epsilon$ by updating like \theta’=\theta’+\sqrt{\frac{\epsilon}{|\Delta_\theta J(\theta)|^2}}\Delta_\theta J(\theta), this is what actually gradient ascent(policy gradient) doing. But this (the gradient ascent constrain) is not a good constrain since some parameters change probabilities a lot more than others, and we want that the probability distributions are close. Applying ‘second order Taylor expansion’ to D_{KL} $D_{KL}(\pi_{\theta'}\|\pi_\theta)\approx\frac{1}{2}(\theta'-\theta)^T\pmb{F}(\theta'-\theta)$ where \pmb{F} is the ‘Fisher-information matrix’ which can estimate with with samples $\pmb{F}=E_{\pi_\theta}[\Delta_{\theta}\log\pi_\theta(a|s)\Delta_\theta\log \pi_\theta(a|s)^T]$ And if we use the following update $\theta'=\theta+\alpha\pmb{F}^{-1}\Delta_\theta J(\theta)\\ \alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}$ the constrain will satisfied. and this is called the natural gradient. ### Practical methods and notes • natural policy gradient \theta’=\theta+\alpha\pmb{F}^{-1}\Delta_\theta J(\theta) • Generally a good choice to stabilize policy gradient training • See this paper for details: • Petters, Schaal. Reinforcement learning of motor skills with policy gradients. • Practical implementation: requires efficient Fisher-vector products, a bit non-trivial to do without computing the full matrix • See: Schulman et all. Trust region policy optimization • Trust region policy optimization (TRPO) \alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}} • Just use the IS (important sampling) objective directly (use \bar{A} as object) • Use regularization to stay close to old policy • See: proximal policy optimization (PPO) So the TRPO and the PPO is two Practical methods solving the constrained optimization in neural network setting. ## 7. Optimal Control and Planning Recap: the reinforcement learning objective $p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]$ In model-free RL, we do not know p(s_{t+1}|s_t,a_t). But actually we knew the dynamics sometimes. • Often we do know the dynamics • Often we can learn the dynamics If we know the dynamics, what can we do? ### Model-based reinforcement learning 1. Model-based reinforcement learning: learn the transition dynamics, then figure out how to choose actions 2. How can we make decisions if we know the dynamics? a. How can we choose actions under perfect knowledge of the system dynamics? b. Optimal control, trajectory optimization, planning 3. How can we learning unknown dynamics? 4. How can we then also learn policies? (e,g. by imitating optimal control) ### The objective #### Deterministic case #### Stochastic open-loop case open-loop: choose a_1…a_T in one time, not step by step closed-loop: every step the agent gets a feedback from environment #### Stochastic closed-loop case ### Stochastic optimization optimal control/planning： $a_1,...,a_t=\arg\max_{a_1,...,a_t}J(a_1,...,a_t)\\ A=\arg\max_AJ(A)$ #### Cross-entropy method (CEM) Here A is a_1,…,a_t 1. sample A_1,…,A_n from p(A) 2. evaluate J(A_1),…,J(A_n) 3. pick M elites A_{i_1},…,A_{i_M} with the highest value, where M<N 4. refit p(A) to the elites A_{i_1},…,A_{i_M} #### Monte Carlo Tree Search (MCTS) Generic MCTS sketch 1. find a leaf s_l using TreePolicy(s_1) 2. evaluate the leaf using DefaultPolicy(s_l) 3. update all values in the tree between a_1 and s_l take best action from s_1 and repeat every node stores Q and N, Q is the estimated value and N is the visited number UCT TreePolicy(s_t) if s_t nit fully expanded, choose new a_t else choose child with best Score(s_{t+1})$$ Score(s_t) = \frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2\ln N(s_{t-1})}{N(s_t)}}$$For more about MCTS, ref Browne. et al. A survey of Monte Carlo Tree Search Methods. (2012)  ## Optimal control Here we shows the optimization process if we know the environment dynamics. Almost the stuffs in control theory. Deterministic case $\min_{u_1,...,u_T}\sum_{t=1}^Tc(s_t,u_t)\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\\ \min_{u_1,...,u_T}c(x_1,u_1)+c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)$ #### Shooting methods vs collocation the previous CEM is actually random shooting method. collocation method: optimize over actions and states, with constraints. $\min_{u_1,...,u_T,x_1,...,x_T}\sum_{t=1}^Tc(s_t,u_t)\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})$ #### Linear case: LQR Linear case: the case that F is linear function and cost is quadratic function $f(x_t,u_t)=F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t\\ c(x_t,u_t)=\frac{1}{2}\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^Tc_t$ Where $% $ Base case: solve for$u_T$only % We substitute$u_T$by$x_T$to eliminate$u_T$% Then solve for$U_{T-1}$in term terms of$x_{T-1}$% and then do same thing as the T case, result in similar results. ##### backward recursion for$t=T$to 1: % Forward recursion For$t=1$to$T$: $u_t=K_tx_t+k_t\\ x_{t+1}=f(x_t,u_t)$ ##### Stochastic dynamics if the probability is Gaussian and the mean is linear and variance is fixed. Then same algorithm can be applied since symmetry of Gaussian. $f(x_t,u_t)=F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t\\ x_{t-1}\sim p(x_{t+1}|x_t,u_t)\\ p(x_{t+1}|x_t,u_t)=\mathcal{N}\left(F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t, \Sigma_t\right)$ #### Nonlinear case: DDP/iterative LQR approximate a nonlinear system as a linear-quadratic system using Taylor expansion $f(x_t,u_t)\approx f(\hat{x}_t,\hat{u}_t)+\Delta_{x_t,u_t}f(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}\\ c(x_t,u_t)\approx c(\hat{x}_t,\hat{u}_t)+\Delta_{x_t,u_t}c(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}+\frac{1}{2}\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}^T\Delta^2_{x_t,u_t}c(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}$ In fact, this just Newton’s method for trajectory optimization. for more Newton’s method for trajectory optimization, ref follow papers: 1. Differential dynamic programming.(1970) 2. Synthesis and Stabilization of complex behaviors through online trajectory optimization.(2012) • practical guide for implementing non-linear iterative LQR. 3. Learning Neural Network policies with guided policy search under unknown dynamics (2014) • Probabilistic formation and trust region alternative to deterministic line search. ## 8. Model-Based Reinforcement Learning (learning the model) ### Basic Why learn the model? If we knew$f(s_t,a_t)=s_{t+1}$, we could use the tools from last course.  (or$p(s_{t+1} s_t,a_t)$in stochastic case) model-based reinforcement learning version 0.5: 1.  run base policy$\pi_0(a_t s_t)$(e.g., random policy) to collect$\mathcal{D}={(s,a,s’)_i}$2. learning dynamics model$f(s,a)$to minimize$\sum_i|f(s_i,a_i)-s_i’|^2$3. plan through$f(s,a)$to choose actions Does it work? • This is how system identification works in classical robotics • Some care should be taken to design a good base policy • Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters • The model only fit the base policy, but the final actual policy beyond that policy, that will cause distribution mismatch problem. ### Over-fitting problem #### Distribution mismatch problem Can we do better? can we make$p_{\pi_0}(s_t)=p_{\pi_f}(s_t)$? model-based reinforcement learning version 1.0: 1.  run base policy$\pi_0(a_t s_t)$(e.g., random policy) to collect$\mathcal{D}={(s,a,s’)_i}$2. learning dynamics model$f(s,a)$to minimize$\sum_i|f(s_i,a_i)-s_i’|^2$3. plan through$f(s,a)$to choose actions 4. execute those actions and add the resulting data${(s,a,s’)_j}$to$\mathcal{D}$, and repeat step 2~4 But the model has errors, so it may lead to some bad actions, How to address that? #### MPC model-based reinforcement learning version 1.5: 1.  run base policy$\pi_0(a_t s_t)$(e.g., random policy) to collect$\mathcal{D}={(s,a,s’)_i}$2. learning dynamics model$f(s,a)$to minimize$\sum_i|f(s_i,a_i)-s_i’|^2$3. plan through$f(s,a)$to choose actions 4. execute the first planned action, observe resulting state$s’$(MPC) 5. append$(s,a,s’)$to dataset$\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5 #### Using model uncertainty Can we do better by using model uncertainty? How to get uncertainty? 1. use output entropy(bad idea) 2. estimate model uncertainty • one way to get this is by Bayesian neural networks (BNN) (introduce later) • another way is train multiple models, and see if they agree each other.(Bootstrap ensembles) How to train? main idea: need to generate “independent” datasets to get “independent” models. can do this by re-sampling from dataset with replacement, means you have same distribution but different ordered datasets Does this works? This basically works Very crude approximation, because the number of models is usually small (<10) Re-sampling with replacement is usually unnecessary, because SGD and random initialization usually makes the models sufficiently independent For candidate action sequence$a_1,…,a_H$: 1.  sample$\theta\sim p(\theta \mathcal{D})$2.  at each time step$t$, sample$s_{t+1}\sim p(s_{t+1} s_t,a_t,\theta)$3. calculate$R=\sum_tr(s_t,a_t)$4. repeat steps 1 to 3 and accumulate the average reward ### Model-based RL with images (POMDP) #### Model-based RL with latent space models What about complex observations? • High dimensionality • Redundancy • Partial observability  learn approximate posterior$q_\psi(s_t o_{1:t},a_{1:t})$other choices: • $q_\psi(s_t,s_{t+1} o_{1:t},a_{1:t})$• $q_\psi(s_t o_t)$ here we only estimate$q_\psi(s_t o_t)$ assume that$q_\psi(s_t o_t)$is deterministic stochastic case requires variational inference (later) Deterministic encoder $q_\psi(s_t|o_t)=\delta(s_t=g_\psi(o_t))\Rightarrow s_t=g_\psi(o_t)$ and maybe the reward also need to learn. $\max_{\phi,\psi}\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(g_\psi(o_{t+1,i})|g_\psi(o_{t,i}),a_{t,i})+\log p_\phi(o_{t,i}|g_\psi (o_{t,i})]\\ \max_{\phi,\psi}\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(g_\psi(o_{t+1,i})|g_\psi(o_{t,i}),a_{t,i})+\log p_\phi(o_{t,i}|g_\psi (o_{t,i})+log p_\phi(r_{t,i}|g_\psi(o_{t,i}))]$ Model-based RL with latent space models 1.  run base policy$\pi_0(o_t a_t)$(e.g., random policy) to collect$\mathcal{D}={(o,a,o’)_i}$2.  learning$p_\psi(s_{t+1} s_t,a_t), p_\psi(r_t s_t), p(o_t s_t), g_\psi(o_t)$3. plan through the model to choose actions 4. execute the first planned action, observe resulting state$o’$(MPC) 5. append$(o,a,o’)$to dataset$\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5 #### Learn directly in observation space  directly learn$p(o_{t+1} o_t,a_t)$do image prediction learn reward or set the goal observation ## 9. Model-Based RL and Policy Learning ### Basic What if we want a policy rather than just optimal control? • Do not need to re-plan (faster) • Potentially better generalization • Closed loop control Back-propagate directly into the policy model-based reinforcement learning version 2.0: 1.  run base policy$\pi_0(a_t s_t)$(e.g., random policy) to collect$\mathcal{D}={(s,a,s’)_i}$2. learning dynamics model$f(s,a)$to minimize$\sum_i|f(s_i,a_i)-s_i’|^2$3.  back-propagate through$f(s,a)$into the policy to optimize$\pi_\theta(a_t s_t)$4.  run$\pi_\theta (a_t s_t)$, appending the visited tuples$(s,a,s’)$to$\mathcal{D}$, repeat steps 2~4 What’s the problem? • similar parameter sensitivity problems as shooting methods • But no longer have convenient second order LQR-like method, because policy parameters couple all the tie steps, so no dynamic programming • Similar problem to training long RNNs with BPTT • Vanishing and exploding gradients • Unlike LSTM, we can’t just “choose” a simple dynamics, dynamics are chosen by nature How to deal with constrain? #### Dual gradient decent (DGD) 1. Find$x^*\gets \arg\min_x\mathcal{L}(x,\lambda)$2. Compute$\frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)$3.$\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$A small tweak to DGD: augmented Lagrangian 1. Find$x^*\gets \arg\min_x\bar{\mathcal{L}}(x,\lambda)$2. Compute$\frac{dg}{d\lambda}=\frac{d\bar{\mathcal{L}}}{d\lambda}(x^*,\lambda)$3.$\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$When far from solution, quadratic term tends to improve stability Constraining trajectory optimization with dual gradient descent $\min_{\tau,\theta}c(\tau)\:\:\text{s.t.}\:\:u_t=\pi_\theta(x_t)\\ \bar{\mathcal{L}}(\tau,\theta,\lambda)=c(\tau)+\sum_{t=1}^T\lambda_t(\pi_\theta(x_t)-u_t)+\sum_{t=1}^T\rho_t(\pi_\theta(x_t)-u_t)^2$ #### Guided policy search (GPS) discussion 1. Find$\tau \gets \arg\min_\tau \bar{\mathcal{L}}(\tau,\theta,\lambda)$(e.g. via iLQR or other planning methods) 2. Find$\theta \gets \arg\min_\theta\bar{\mathcal{L}}(\tau, \theta, \lambda)$(e.g. via SGD) 3.$\lambda \gets \lambda+\alpha \frac{dg}{d\lambda}$and repeat • Can be interpreted as constrained trajectory optimization method • Can be interpreted as imitation of optimal control expert, since step 2 is just supervised learning • The optimal control “teacher” adapts to the learner , and avoids actions that the learner can’t mimic General guided policy search scheme 1. Optimize$p(\tau)$with respect to some surrogate$\tilde{c}(x_t,u_t)$2. Optimize$\theta$with respect to some supervised objective 3. Increment or modify dual variables$\lambda$Need to choose: • form of$p(\tau)$or$\tau$(if deterministic) • optimization method for$p(\tau)$or$\tau$• surrogate$\tilde{c}(x_t,u_t)$•  supervised objective for$\pi_\theta(u_t x_t)$##### Deterministic case 1. Optimize$\tau$with respect to some surrogate$\tilde{c}(x_t,u_t)$2. Optimize$\theta$with respect to some supervised objective 3. Increment or modify dual variables$\lambda$. repeat 1~3 Learning with multiple trajectories 1. Optimize each$\tau_i$in parallel with respect to$\tilde{c}(x_t,u_t)$2. Optimize$\theta$with respect to some supervised objective 3. Increment or modify dual variables$\lambda$. repeat 1~3 ##### Stochastic (Gaussian) GPS 1. Optimize$p(\tau)$with respect to some surrogate$\tilde{c}(x_t,u_t)$2. Optimize$\theta$with respect to some supervised objective 3. Increment or modify dual variables$\lambda$Here, a little different from pure imitation learning that mimic the optimal control result, the agent imitate the planning results, and then if it can not perform good imitation, then the optimization process will adjust its planning policy to fit the learning policy since the policy is a constrain of planning. Input Remapping Trick $\min_{p,\theta}E_{\tau\sim p(\tau)}[{c(\tau)}]\:\:\text{s.t.}\:\:p(u_t|x_t)=\pi_\theta(u_t|o_t)\$ ### Imitation optimal control #### Imitation optimal control with DAgger 1. from current state$s_t$, run MCTS to get$a_t,a_{t+1},…$2. add$(s_t,a_t)$to dataset$\mathcal{D}$3.  execute action$s_t\sim\pi(a_t s_t)$(not MCTS action!). repeat 1~3 N times 4. update the policy by training on$\mathcal{D}$Problems of the original DAgger • Ask human to label the state from other policy is hard • run the initial not good policy in real world is dangerous in some applications We address the first problem with a planning method, and how about the the second problem? #### Imitating MPC: PLATO algorithm 1.  train$\pi_\theta(u_t o_t)$from labeled data$\mathcal{D}={o_1,u_1,…,o_N,u_N}$2.  run$\hat{\pi}(u_t o_t)$to get dataset$\mathcal{D}_\pi={o_1,…,o_M}$3. Ask computer to label$\mathcal{D_\pi}$with actions$u_t$4. Aggregate:$\mathcal{D}\gets\mathcal{D}\cup\mathcal{D}_\pi$Simple stochastic policy:$\hat{\pi}(u_t|x_t)=\mathcal{N}(K_tx_t+k_t, \Sigma_{u_t})$$\hat{\pi}(u_t|x_t)=\arg\min_{\hat{\pi}}\sum_{t'=t}^TE_{\hat{\pi}}[c(x_{t'},u_{t'})]+\lambda D_{KL}(\hat{\pi}(u_t|x_t)\|\pi_\theta(u_t|o_t))$ Here the$\hat{\pi}$is re-planed by optimal control method, for simplicity, choose Gaussian policy since it is easy to plan with LQR, and the planning also add the KL constrain, which make sure the behavior policy is not far from the learning policy, but with this planning, it move actions away from some very bad (dangerous) actions. ##### DAgger vs GPS • DAgger does not require an adaptive expert • Any expert will do, so long as states from learned policy can be labeled • Assumes it is possible to match expert’s behavior up to bounded loss • Not always possible (e.g. partially observed domains) • GPS adapts the “expert” behavior • Does not require bounded loss on initial expert (expert will change) ##### Why imitate? • Relatively stable and easy to use • Supervise learning works very well • control\planning (usually) works very well • The combination of two (usually) works very well • Input remapping trick: can exploit availability of additional information at training time to learn policy from raw observations. (planning with state and learning policy with observations) • overcomes optimization challenges of back-propagating into policy directly ### Model-free optimization with a model • just use policy gradient(or other model-free RL method) even though you have a model. (just treat the model as a simulator) • Sometimes better than using the gradients! #### Dyna on-line Q-learning algorithm that performs model-free RL with a model 1. given state$s$, pick action$a$using exploration policy 2. observe$s’$and$r$, to get transition$(s,a,s’,r)$3.  update model$\hat{p}(s’ s,a)$and$\hat{r}(s,a)$using$(s,a,s’)$4. Q-update:$Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)-Q(s,a)]$5. repeat$K$times: 1. sample$(s,a)\sim\mathcal{B}$from buffer of past states and actions 2. Q-update:$Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)-Q(s,a)]$when model become better, re-evaluate the old states and make the estimation more accurate. #### General “Dyna-style” model-based RL recipe 1. given state$s$, pick action$a$using exploration policy 2.  learn model$\hat{p}(s’ s,a)$(and optionally,$\hat{r}(s,a)$) 3. repeat K times: 1. sample$s\sim\mathcal{B}$from buffer 2. choose action a (from$\mathcal{B}$, from$\pi$, or random) 3.  simulate$s’\sim\hat{p}(s’ s,a)$(and$r=\hat{r}(s,a)$) 4. train on$(s,a,s’,r)\$ with model-free RL
5. (optional) take N more model-based steps

This only requires short (as few as one step) rollouts from model, which has a little accumulated error.

### Model-based RL algorithms summary

#### Methods

• Learn model and plan (without policy)
• Iteratively more data to overcome distribution mismatch
• Re-plan every time step (MPC) to mitigate small model errors
• Learning policy
• Back-propagate into policy (e.g., PILCO)–simple but potentially unstable
• imitate optimal control in a constrained optimization framework (e.g., GPS)
• imitate optimal control via DAgger-like process (e.g., PLATO)
• Use model-free algorithm with a model(Dyna, etc.)

#### Limitation of model-based RL

• Need some kind of model
• Not always available
• Sometimes harder to learn than the policy
• Learning the model takes time & data
• Sometimes expressive model classes (neural nets) are not fast
• Sometimes fast model classes (linear models) are not expressive
• Some kind of additional assumptions
• Linearizability/continuity
• Ability to reset the system (for local linear models)
• Smoothness (for GP-style global model)
• Etc.

Here are some of my understandings of model-based RL:

First, why we need model-based RL?

the model-free RL learn everything from experience, the state space may very larger, it learning from scratch, which need a lot of exploration, otherwise, it may hard to coverage or has high chance to coverage to local optimal.

But in model-based RL, the model is known or already learned, so it shift the very hard exploration process to planning, which can find some decent directions that lead to good results by optimal control methods or just search by simulating with model. So after planning, the pretty promising trajectories is generated, and the policy only requires to learn to imitate these good trajectories, which reduce lots of random exploration.

Second, why not just use optimal control rather than learning policy?

Actually, you just use the optimal control methods, like traditional control methods, or MPC.

However, not all model can apply the explicit optimal control method like LQR, since the model is hard to solve mathematically. In addition, using neural network may have better generalization property, and close loop control seems better.

Third, What kind of method can i use in model based RL?

• learning model and just using planing, do not learn policy
• Learn policy by guided policy search
• imitating optimal control with DAgger

### What kind of algorithm should I use?

rank of the samples efficiency required (low to high) but computation efficiency (high to low)

• gradient-free methods (e.g. NES, CMA, etc)
• full on-line methods (e.g. A3C)
• policy gradient methods (e.g. TRPO)
• replay buffer value estimation methods (Q-learning, DDPG, NAF, SAC, etc.)
• model-based deep RL (e.g. PETS, guided policy search)
• model-based “shallow” RL (e.g. PILCO)

Tags:

Updated: