# Deep Reinforcement learning notes (UBC)

## Background

This note is the class note of UBC Deep reinforcement learning, namely CS294-112 or CS285. the lecturer is ‎Sergey Levine. The lecturer video can be find on Youtube. I wrote two notes on reinforcement learning before, one is basic RL, the other is the David Silver class note.

Different from the previous courses, this course includes deeper theoretical view, more recent methods and some advancer topics. Especially in model based RL and meta-learning. It is more suitable for the guys who are interested on robotics control and deeper understanding of reinforcement learning.

This class is a little bit hard to study, so make sure you follow the class tightly.

I’m sorry that some of maths can not view properly on the post, maybe I will figure out it later.

Background

1. Imitation learning

3. Actor-critic method

4. Value based methods

5. Practical Q-learning

7. Optimal Control and Planning

8. Model-Based Reinforcement Learning (learning the model)

9. Model-Based RL and Policy Learning

10 Variational Inference and Generative Models

11. Re-framing Control as an Inference Problem

12. Inverse Reinforcement Learning

14. Distributed RL

15. Exploration

16 Meta Reinforcement learning

17 Information theory, challenges, open problems

18 Rethinking Reinforcement Learning from the Perspective of Generalization (Chelsea Finn)

## 1. Imitation learning

### The main problem of imitation: distribution drift

how to make the distribution of training dataset as same as the distribution under policy?

DAgger

#### DAgger: Dataset Aggregation

goal: collect training data from $p_{\pi_\theta}(o_t)$ instead of $p_{data}(o_t)$ !

 how? just run $p_{\pi_\theta}(a_t o_t)$

but need labels $a_t$ !

1.  train $\pi_{\theta}(a_t o_t)$ from human data $\mathcal{D}={o_1,a_1,…,o_N,a_N}$
2.  run $\pi_\theta(a_t o_t)$ to get dataset $\mathcal{D}_\pi = {o_1,…,o_M}$
3. ask human to label $\mathcal{D}_\pi$ with actions $s_t$
4. aggregate: $\mathcal{D}\gets \mathcal{D}\cup\mathcal{D}_\pi$

fit the model perfectly

#### why fail to fit expert?

1. Non-Markonvian behavior
• use history observations
2. Multimodal behavior
• for discrete action, it is OK since Softmax output probability over actions
• for continuous action
• output mixture of Gaussians
• latent variable models(inject noise to network input)
• autoregressive discretization

other problems of imitation learning

• human labeled data is finite
• human not good at some problems

#### reward function of imitation learning

reward function of imitation learning can be $r(s,a) = \log {p(a=\pi^*(s)|s)}$

## MDP & RL Intro

### The goal of RL

expected reward $p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]$ where $p_\theta(\tau)$ is the distribution of the sequence

### Types of RL algorithms

• value-based
• Actor-critic
• model-based RL
• for planning
• optimal control
• discrete planning
• improve policy
• something else
• dynamic programming
• simulated experience

• sample efficiency
• stability & ease of use

#### assumptions

• stochastic or deterministic
• continuous or discrete
• episodic or infinite horizon

#### sample efficiency

• off policy: able to improve the policy without generating new samples from that policy
• on policy: each time the policy is changed, even a little bit, we need to generate new samples

#### stability & ease of use

convergence is a problem

supervised learning almost always gradient descent

RL often not strictly gradient descent

#### REINFORCE algorithm

1.  sample ${\tau^i}$ from $\pi_\theta(s_t s_t)$ (run the policy)
2.  $\Delta_\theta J(\theta)\approx\sum_{i}^N\left(\sum_{t} \Delta_\theta \log \pi_\theta (a_t s_t) \right)\left(\sum_{t} r(s_t,a_t)\right)$
3. $\theta \gets\theta+\alpha\Delta_\theta J(\theta)$
##### Reduce variance

Causality: policy at time $t’$ cannot affect reward at time t when $t<t’$ % baseline $b=\frac{1}{N}\sum_{i=1}^{n}e(\tau)\\ \Delta_\theta J(\theta)\approx\frac{1}{N} \Delta_\theta \log \pi_\theta (\tau) [r(\tau)-b]$ prove % Here, $\tau$ means the whole episode sample by current policy.

we can proof that their has a optimal baseline to reduce variance: $b=\frac{E[g(\tau)^2e(\tau)]}{E[g(\tau)^2]}$ But in practice, we just use the expectation of reward as baseline to reduce the complexity.

#### Off-policy learning & importance sampling

what if we sample from $\bar{\pi}(\tau)$ instead?

Importance sampling % so apply this to our objective function, we have $J(\theta)=E_{\tau\sim\bar{\pi}(\tau)}\left[\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}r(\tau)\right]$ and we have $\pi_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}=\frac{p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}(a_t|s_t)p(s_{t+1}|s_t,a_t)}=\frac{\prod_{t=1}^T\pi_\theta(a_t|s_t)}{\prod_{t=1}^T \bar{\pi}(a_t|s_t)}$ so we have % The off-policy policy gradient % we can view state and action separately, then:
% If $\frac{p_{\theta’}(s_t)}{p_{\theta}(s_t)}$ is small and bounded, then we can delete it, and this leads to TPRO method we will discuss later.

for coding, we can use “pseudo-loss” as weighted maximum likelihood with automatic differentiation: $\bar{J}(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \log \pi_\theta (a_{i,t}|s_{i,t})\hat{Q}_{i,t}$

• the gradient has high variance
• this isn’t the same as supervised learning!
• gradients will be really noisy!
• consider using much larger batches
• tweaking learning rates is very hard

## 3. Actor-critic method

### Basics

recap policy gradient % where Q is a sample from trajectories, which is unbiased estimate but has high variance problem.

We can use expectation to reduce variance $\hat{Q}_i,t\approx \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]$ and we define $\hat{Q}_i,t= \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]\\ V(s_t)=E_{a_t\sim\pi(s_t|s_t)}[Q(s_t,a_t)]\\$ then $\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})(Q(s_{i,t}, a_{i,t})-V(s_{i,t}))$

The better $A^\pi(s_t,a_t)$ estimate, the lower the variance.

#### Value function fitting

and we add a little bias(one step biased sample) for convenience $Q^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})$ so we have $A^\pi(s_t,a_t) \approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_t)$ then we only need to fit $V^\pi(s)$ !

#### Policy evaluation

Monte Carlo policy evaluation (this is what policy gradient does) $V^\pi(s_t)\approx \sum_{t'=t}^Tr(s_{t'},a_{t'})$ We can try multiple samples if we can reset the environment to previous state $V^\pi(s_t)\approx \frac{1}{N}\sum_{i=0}^N\sum_{t'=t}^Tr(s_{t'},a_{t'})$ Monte Carlo evaluation with function approximation

with function approximation, only using one sample from trajectory still pretty good.

training data: ${\left(s_{i,t},\sum_{t’=t}^Tr(s_{i,t’},a_{i,t’})\right)}$

supervised regression: $\mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)-y_i\parallel^2$

Ideal target: $y_{i,t}=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]\approx r(s_{s_{i,t}},a_{i,t})+V^\pi(s_{i,t+1})\approx r(s_{s_{i,t}},a_{i,t})+\hat{V^\pi_\phi}(s_{i,t+1})$ Monte Carlo target: $y_{i,t}=\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'})$

#### TD(bootstrapped)

training data: ${\left(s_{i,t},r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\right)}$

### Actor-critic algorithm

batch actor-critic algorithm:

1.  sample ${s_i,a_i}$ from $\pi_\theta(a s)$
2. fit $\hat{V_\phi^\pi}(s)$ to sampled reward sums
3. evaluate $\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\hat{V}\phi^\pi(s’_i)-\hat{V}\phi^\pi(s_i)$
4.  $\Delta_\theta J(\theta)\approx \sum_i \Delta_\theta \log \pi_\theta (a_{i} s_{i})\hat{A}^\pi(s_i,a_i)$
5. $\theta \gets \theta+\alpha\Delta_\theta J(\theta)$

#### Aside: discount factors

what if T (episode length) is $\infty$ ?

$\hat{V}_\phi^\pi$ can get infinitely large in many cases

simple trick: better to get rewards sooner than later $V^\pi(s_{i,t})\approx r(s_{s_{i,t}},a_{i,t})+\gamma\hat{V}^\pi_\phi(s_{i,t+1})\\ \gamma \in [0,1]$ actually we use discount in policy gradient as $\Delta_\theta J(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)$ Online actor-critic algorithm(can apply to every single step):

1.  take action $a\sim\pi_\theta(a s)$, get $(s,a,s’,r)$
2. update $\hat{V_\phi^\pi}(s)$ using target $r+\gamma\hat{V_\phi^\pi}(s’)$
3. evaluate $\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}\phi^\pi(s’)-\hat{V}\phi^\pi(s)$
4.  $\Delta_\theta J(\theta)\approx \Delta_\theta \log \pi_\theta (a s)\hat{A}^\pi(s,a)$
5. $\theta \gets \theta+\alpha\Delta_\theta J(\theta)$

#### Architecture design

network architecture choice

• value network and policy network are separate(more stable and sample)
• some of value network and policy network are shared(have shared feature)

works best with a batch

policy gradient $\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-b\right)$ actor-critic $\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(r(s_i,a_i)+\hat{V}_\phi^\pi(s'_i)-\hat{V}_\phi^\pi(s_i)\right)$ policy gradient is no bias but has higher variance

actor-critic is lower variance but not unbiased

so can we combine these two things?

here we have critics as state-dependent baselines $\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-\hat{V}_\phi^\pi(s_{i,t})\right)$

• no bias
• lower variance

Eligibility traces & n-step returns

Critic and Monte Carlo critic $\hat{A}^\pi_C(s_t,a_t)=r(s_t,a_t)+\gamma\hat{V}_\phi^\pi(s_{t+1})-\hat{V}_\phi^\pi(s_t)\\ \hat{A}^\pi_{MC}(s_t,a_t)=\sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{t'},a_{t'})-\hat{V}_\phi^\pi(s_{t})$

combine these two?

n-step returns $\hat{A}^\pi_{n}(s_t,a_t)=\sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^n\hat{V}_\phi^\pi(s_{t+n})-\hat{V}_\phi^\pi(s_{t})$ choosing $n>1$ often works better!!!

Do we have to choose just one n?

Cut everywhere all at once! $\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty w_n\hat{A}(s_t,a_t)$

How to weight?

Mostly prefer cutting earlier(less variance) $w_n\propto\lambda^{n-1}$ e.g. $\lambda=0.95$

and this leads to Eligibility traces $\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty (\gamma\lambda)^{t'-t}\delta_{t'}\\ \delta_{t'}=r(s_{t'},a_{t'})+\gamma\hat{V}_\phi^\pi(s_{t'+1})-\hat{V}_\phi^\pi(s_{t'})$ in this way, every time you want to update a state, you need to have n steps experience

## 4. Value based methods

$\underset{a_t}{\arg\max}A^\pi(s_t,a_t)$ : best action from $s_t$, if we then follow $\pi$

then: $% $

 this at least as good as any $a_t \sim \pi(a_t, s_t)$

### Policy iteration

1. evaluate $A^\pi(s,a)$
2. set $\pi \gets \pi’$

### Dynamic programming

 assume we know $p(s’ s,a)$ and s and a are both discrete (and small)

bootstrapped update: $V^\pi(s) \gets E_{a\sim\pi(a|s)}[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}[V^\pi(s')]]$ with deterministic policy $\pi(s)=a$, we have $V^\pi(s) \gets r(s,\pi(s))+\gamma E_{s'\sim p(s'|s,\pi(s))}[V^\pi(s')]$

So policy iteration become

1. set $Q^\pi(s,a)\gets r(s,a)+\gamma E[V^\pi(s’)]$
2. set $V(s)\gets \max_a Q(s,a)$

#### Function approximator

$\mathcal{L}=\frac{1}{2}\sum_i\parallel V_\phi (s)-\max_a Q(s,a)\parallel^2$

##### fitted value iteration

fitted value iteration algorithm:

1.  set $y_i \gets \max_{a_i}(r(s_i a_i)+\gamma E[V_\phi(s’_i)])$

### Unsupervised reinforcement learning for meta-learning

using unsupervised reinforcement learning to propose tasks for meta learning.

so at first, using unsupervised meta-RL tor generate tasks, then using meat-learning for those tasks by given reward function from previous steps.

### Challenges in deep reinforcement learning

Core algorithm:

• Stability
• Efficiency
• Generalization

Assumptions:

• Problem formulation
• Supervision

#### Stability and hyper-parameter tuning

• Devising stable RL algorithm is very hard
• Q-learning/value function estimation
• No guarantee of convergence
• Lots of parameters for stability: target network delay, replay buffer size, clipping, sensitivity to learning rates, etc.
• Very high variance gradient estimator
• Lots of samples, complex bases, etc.
• Parameters: batch size, learning rate, design of baseline
• Model-based RL algorithms
• Model class and fitting method
• Optimizing policy w.r.t. model non-trivial due to back-propagation through time
• More subtle issue: policy tends to exploit the model

The challenge with hyper-parameters is severe

• Algorithms with favorable improvement and convergence properties
• TRPO
• Safe reinforcement learning, High-confidence policy improvement [Thomas ‘15’]
• Q-Prop

Not great for beating benchmarks, but absolutely essential to make RL a viable tool for real-world problems.

#### Sample complexity

• real-world learning becomes difficult or impractical
• Precludes the use of expensive, high-fidelity simulators
• Limits applicability to real-world problems

what can we do?

• Better model-based RL algorithms
• Design faster algorithms
• Addressing Function Approximation Error in Actor-Critic Algorithms[Fujimoto et al. ‘18’]
• Soft Actor-Critic
• Reuse prior knowledge to accelerate reinforcement learning
• RL2 [Duan at al. 17]
• Learning to reinforcement learning [Wang et al. ‘17’]
• MAML [Finn at al. ‘17’]

#### Scaling & Generalization

• Small-scale
• Emphasizes mastery
• Evaluated on performance
• Where i the generalization

Reinforcement learning need to re-collect data during training

### Assumption problems

• Train on multiple tasks, then try ti generalize or finetune
• policy distillation
• Actor-mimic
• MAML
• Unsupervised or weakly supervised learning of diverse behaviors
• Stochastic neural networks
• Reinforcement learning with deep energy-based policies

Where does the supervision come from?

• learn objectives/reward form demonstration (IRL)
• Generate objectives automatically

What is the role of the reward function?

Unsupervised reinforcement learning

1. Interaction with the world without a reward function

2. Learning something about the world

3. Use what you learned to quickly solve new tasks

Other sources of supervision

• Demonstrations
• Language
• Human preferences

Where does the supervision signal come from?

• Yannn YeCun’s cake
• Unsupervised or self-supervised learning
• Model learning (predict the future)
• Generative modeling of the world
• Lots to do even before you accomplish your goal
• Imitation & understanding other agents
• The giant value backup

## 18 Rethinking Reinforcement Learning from the Perspective of Generalization (Chelsea Finn)

Meta-learning:

Learning to Learning with gradient. Finn. PhD Thesis 2018

Efficient Off-Policy Meta-Reinforcement learning via Probabilistic Context Variable ICML-19

If we want to do this, we need to make sure the meta-train task distribution same as the meta-test task distribution.

• Algorithms: more general than a policy, from demo and trial? or others, Meta world