Jekyll2019-10-11T20:29:05+08:00https://dongdongbh.tech/feed.xmlDongda’s homepageHomepage of Dongda Li, an amazing website.Dongda Lidongdongbhbh@gmail.comDeep Reinforcement learning notes (UBC)2019-10-11T00:00:00+08:002019-10-11T00:00:00+08:00https://dongdongbh.tech/UBC-RL<h2 id="background">Background</h2> <p>This note is the class note of UBC Deep reinforcement learning, namely CS294-112 or CS285. the lecturer is ‎<a href="https://people.eecs.berkeley.edu/~svlevine/">Sergey Levine</a>. The lecturer video can be find on Youtube. I wrote two notes on reinforcement learning before, one is <a href="https://dongdongbh.tech/RL-note/">basic RL</a>, the other is the <a href="https://dongdongbh.tech/RL-courses/">David Silver class note</a>.</p> <p>Different from the previous courses, this course includes deeper theoretical view, more recent methods and some advancer topics. Especially in model based RL and meta-learning. It is more suitable for the guys who are interested on robotics control and deeper understanding of reinforcement learning.</p> <p>This class is a little bit hard to study, so make sure you follow the class tightly.</p> <p>I’m sorry that some of maths can not view properly on the post, maybe I will figure out it that later.</p> <h2 id="1-imitation-learning">1. Imitation learning</h2> <h3 id="the-main-problem-of-imitation-distribution-drift">The main problem of imitation: distribution drift</h3> <p>how to make the distribution of training dataset as same as the distribution under policy?</p> <p>DAgger</p> <h4 id="dagger-dataset-aggregation">DAgger: Dataset Aggregation</h4> <p>goal: collect training data from $p_{\pi_\theta}(o_t)$ instead of $p_{data}(o_t)​$ !</p> <table> <tbody> <tr> <td>how? just run $p_{\pi_\theta}(a_t</td> <td>o_t)$.</td> </tr> </tbody> </table> <p>but need labels $a_t​$ !</p> <ol> <li> <table> <tbody> <tr> <td>train $\pi_{\theta}(a_t</td> <td>o_t)$ from human data $\mathcal{D}={o_1,a_1,…,o_N,a_N}$;</td> </tr> </tbody> </table> </li> <li> <table> <tbody> <tr> <td>run $\pi_\theta(a_t</td> <td>o_t)$ to get dataset $\mathcal{D}_\pi = {o_1,…,o_M}$;</td> </tr> </tbody> </table> </li> <li>ask human to label $\mathcal{D}_\pi$ with actions $s_t$;</li> <li>aggregate: $\mathcal{D}\gets \mathcal{D}\cup\mathcal{D}_\pi$</li> </ol> <p>fit the model perfectly</p> <h4 id="why-fail-to-fit-expert">why fail to fit expert?</h4> <ol> <li>Non-Markonvian behavior <ul> <li>use history observations</li> </ul> </li> <li>Multimodal behavior <ul> <li>for discrete action, it is OK since Softmax output probability over actions</li> <li>for continuous action <ul> <li>output mixture of Gaussians</li> <li>latent variable models(inject noise to network input)</li> <li>autoregressive discretization</li> </ul> </li> </ul> </li> </ol> <p>other problems of imitation learning</p> <ul> <li>human labeled data is finite</li> <li>human not good at some problems</li> </ul> <h4 id="reward-function-of--imitation-learning">reward function of imitation learning</h4> <p>reward function of imitation learning can be <script type="math/tex">r(s,a) = \log {p(a=\pi^*(s)|s)}</script></p> <h2 id="mdp--rl-intro">MDP &amp; RL Intro</h2> <h3 id="the-goal-of-rl">The goal of RL</h3> <p>expected reward <script type="math/tex">p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]</script> where $p_\theta(\tau)$ is the distribution of the sequence</p> <h3 id="q--v">Q &amp; V</h3> <script type="math/tex; mode=display">V^\pi(s_t)=\sum_{t'=t}^TE_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]\\ v^\pi(s_t)=E_{s_t\sim\pi(s_t|s_t)}[Q^\pi(s_t,a_t)]</script> <h3 id="types-of-rl-algorithms">Types of RL algorithms</h3> <ul> <li>Policy gradient</li> <li>value-based</li> <li>Actor-critic</li> <li>model-based RL <ul> <li>for planning <ul> <li>optimal control</li> <li>discrete planning</li> </ul> </li> <li>improve policy</li> <li>something else <ul> <li>dynamic programming</li> <li>simulated experience</li> </ul> </li> </ul> </li> </ul> <h4 id="trade-offs">trade-offs</h4> <ul> <li>sample efficiency</li> <li>stability &amp; ease of use</li> </ul> <h4 id="assumptions">assumptions</h4> <ul> <li>stochastic or deterministic</li> <li>continuous or discrete</li> <li>episodic or infinite horizon</li> </ul> <h4 id="sample-efficiency">sample efficiency</h4> <ul> <li> <p><strong>off policy</strong>: able to improve the policy without generating new samples from that policy</p> </li> <li> <p><strong>on policy</strong>: each time the policy is changed, even a little bit, we need to generate new samples</p> </li> </ul> <h4 id="stability--ease-of-use">stability &amp; ease of use</h4> <p><strong>convergence</strong> is a problem</p> <p>supervised learning almost <em>always</em> gradient descent</p> <p>RL often <em>not</em> strictly gradient descent</p> <h2 id="2-policy-gradient">2. Policy gradient</h2> <h4 id="objective-function">Objective function</h4> <script type="math/tex; mode=display">\theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\\ J(\theta)=E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\approx\frac{1}{N}\sum_i\sum_tr(s_{i,t},a_{i,t})</script> <h4 id="policy-differentiation">policy differentiation</h4> <h5 id="log-derivative">log derivative</h5> <p><script type="math/tex">% <![CDATA[ \begin{align} \pi_\theta(\tau)\Delta \log \pi_\theta(\tau)&=\pi_\theta(\tau)\frac{\Delta\pi_\theta(\tau)}{\pi_\theta(\tau)}=\Delta\pi_\theta(\tau)\\ \pi_\theta(\tau)&=\pi_\theta(s_1,a_1,...,s_T,a_T)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \log \pi_\theta(\tau) &=\log p(s_1) + \sum_{t=1}^T \log \pi_\theta (a_t|s_t) + \log p(s_{t+1}|s_t,a_t)\\ &\Delta_\theta \left[\log p(s_1) + \sum_{t=1}^T \log \pi_\theta (a_t|s_t) + \log p(s_{t+1}|s_t,a_t)\right]= \sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t|s_t) \end{align} %]]></script></p> <h5 id="objective-function-differentiation">Objective function differentiation</h5> <p><script type="math/tex">% <![CDATA[ \begin{align} \theta^*&=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[{r}(\tau)]=\int\pi_\theta(\tau)r(\tau)d\tau\\ {r}(\tau)&=\sum_t r(s_t,a_t)\\ \Delta_\theta J(\theta)&=\int\Delta_\theta \pi_\theta(\tau)r(\tau)d\tau\\ &=\int\pi_\theta(\tau)\Delta_\theta \log \pi_\theta(\tau)r(\tau) d\tau\\ &=E_{r\sim\pi_\theta}[\Delta_\theta \log \pi_\theta(\tau)r(\tau)]\\ &=E_{r\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t|s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right] \end{align} %]]></script></p> <h5 id="evaluating-the-policy-gradient">evaluating the policy gradient</h5> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \Delta_\theta J(\theta)&=E_{\tau\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t|s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right]\\ \Delta_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t|s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\\ \theta &\gets\theta+\alpha\Delta_\theta J(\theta) \end{align} %]]></script> <h4 id="reinforce-algorithm">REINFORCE algorithm</h4> <ol> <li> <table> <tbody> <tr> <td>sample ${\tau^i}$ from $\pi_\theta(s_t</td> <td>s_t)$ (run the policy);</td> </tr> </tbody> </table> </li> <li> <table> <tbody> <tr> <td>$\Delta_\theta J(\theta)\approx\sum_{i}^N\left(\sum_{t} \Delta_\theta \log \pi_\theta (a_t</td> <td>s_t) \right)\left(\sum_{t} r(s_t,a_t)\right)$;</td> </tr> </tbody> </table> </li> <li>$\theta \gets\theta+\alpha\Delta_\theta J(\theta)$.</li> </ol> <h5 id="policy-gradient">policy gradient</h5> <script type="math/tex; mode=display">\Delta_\theta J(\theta)\approx\frac{1}{N} \Delta_\theta \log \pi_\theta (\tau) r(\tau)</script> <h5 id="reduce-variance">Reduce variance</h5> <p><strong>Causality</strong>: policy at time $t’$ cannot affect reward at time t when $t&lt;t’$ <script type="math/tex">% <![CDATA[ \begin{align} \Delta_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\\ &\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t’},a_{i,t'})\right)\\ &=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\hat{Q}_{i,t} \end{align} %]]></script> <strong>baseline</strong> <script type="math/tex">b=\frac{1}{N}\sum_{i=1}^{n}e(\tau)\\ \Delta_\theta J(\theta)\approx\frac{1}{N} \Delta_\theta \log \pi_\theta (\tau) [r(\tau)-b]</script> prove <script type="math/tex">% <![CDATA[ \begin{align} E[\Delta_\theta \log \pi_\theta(\tau)b]&=\int \pi_\theta(\tau) \Delta_\theta \log \pi_\theta(\tau)b d\tau \\ &= \int\Delta_\theta \pi_\theta(\tau)b d\tau\\ &=b\Delta_\theta\int\pi_\theta(\tau)\\ &=b\Delta_\theta1\\ &=0 \end{align} %]]></script> Here, $\tau$ means the whole <strong>episode</strong> sample by current policy.</p> <p>we can proof that their has a optimal baseline to reduce variance: <script type="math/tex">b=\frac{E[g(\tau)^2e(\tau)]}{E[g(\tau)^2]}</script> But in practice, we just use the expectation of reward as baseline to reduce the complexity.</p> <blockquote> <p>policy gradient is <strong>on-policy</strong> algorithm</p> </blockquote> <h4 id="off-policy-learning--importance-sampling">Off-policy learning &amp; importance sampling</h4> <script type="math/tex; mode=display">\theta^*=\underset{\theta}{\arg\max} J(\theta)\\ J(\theta)=E_{\tau\sim\pi_\theta(\tau)}[r(\tau)]</script> <p>what if we sample from $\bar{\pi}(\tau)$ instead?</p> <p><strong>Importance sampling</strong> <script type="math/tex">% <![CDATA[ \begin{align} E_{x\sim p(x)}[f(x)]&=\int p(x)f(x)dx\\ &=\int \frac {q(x)}{q(x)}p(x)f(x)dx\\ &=E_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right] \end{align} %]]></script></p> <p>so apply this to our objective function, we have <script type="math/tex">J(\theta)=E_{\tau\sim\bar{\pi}(\tau)}\left[\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}r(\tau)\right]</script> and we have <script type="math/tex">\pi_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}=\frac{p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}(a_t|s_t)p(s_{t+1}|s_t,a_t)}=\frac{\prod_{t=1}^T\pi_\theta(a_t|s_t)}{\prod_{t=1}^T \bar{\pi}(a_t|s_t)}</script> so we have <script type="math/tex">% <![CDATA[ \begin{align} J(\theta')&=E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau)\right]\\ \Delta_{\theta'}J(\theta')&=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\Delta_{\theta'}\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau)\right]\\ &=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\Delta_{\theta'} \log \pi_{\theta}(\tau)r(\tau)\right] \end{align} %]]></script></p> <p><strong>The off-policy policy gradient</strong> <script type="math/tex">% <![CDATA[ \begin{align} \Delta_{\theta'}J(\theta')&=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\Delta_{\theta'} \log \pi_{\theta}(\tau)r(\tau)\right]\\ &=E_{\tau\sim\pi_\theta}\left[\left(\prod_{t=1}^T\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\right)\left(\sum_{t=1}^T \Delta_{\theta'} \log \pi_{\theta'} (a_t|s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right]\\ &=E_{\tau\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_{\theta'} \log \pi_{\theta'} (a_t|s_t) \right)\left(\prod_{t'=1}^t\frac{\pi_{\theta'}(a_{t'}|s_{t'})}{\pi_\theta(a_{t'}|s_{t'})}\right)\left(\sum_{t'=t}^T r(s_{t'},a_{t'})\left(\prod_{t''=t}^{T}\frac{\pi_{\theta'}(a_{t''}|s_{t''})}{\pi_\theta(a_{t''}|s_{t''})}\right)\right)\right] \end{align} %]]></script> we can view state and action separately, then:<br /> <script type="math/tex">% <![CDATA[ \begin{align} \theta^*&=\underset{\theta}{\arg\max} \sum_{t=1}^TE_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]\\ J(\theta)&=E_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]\\ &=E_{s_t\sim p_\theta(s_t)}\left[E_{a_t\sim \pi(a_t,s_t)}[r(s_t,a_t)]\right]\\ J(\theta')&=E_{s_t\sim p_\theta(s_t)}\left[\cancel{\frac{p_{\theta'}(s_t)}{p_{\theta}(s_t)}}E_{a_t\sim \pi(a_t,s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}r(s_t,a_t)\right]\right] \end{align} %]]></script> If $\frac{p_{\theta’}(s_t)}{p_{\theta}(s_t)}$ is small and bounded, then we can delete it, and this leads to <strong>TPRO</strong> method we will discuss later.</p> <p>for coding, we can use “pseudo-loss” as weighted maximum likelihood with automatic differentiation: <script type="math/tex">\bar{J}(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \log \pi_\theta (a_{i,t}|s_{i,t})\hat{Q}_{i,t}</script></p> <h5 id="policy-gradient-in-practice">policy gradient in practice</h5> <ul> <li>the gradient has <strong>high variance</strong> <ul> <li>this isn’t the same as supervised learning!</li> <li>gradients will be really noisy!</li> </ul> </li> <li> <p>consider using much <strong>larger batches</strong></p> </li> <li>tweaking <strong>learning rates</strong> is very hard <ul> <li>adaptive step size rules like ADAM can be OK-ish</li> <li>we will learn about policy gradient-specific learning rate adjustment method later</li> </ul> </li> </ul> <h2 id="3-actor-critic-method">3. Actor-critic method</h2> <p>### Basics</p> <p>recap policy gradient <script type="math/tex">% <![CDATA[ \begin{align} \Delta_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\\ &\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t’},a_{i,t'})\right)\\ &=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\hat{Q}_{i,t} \end{align} %]]></script> where Q is a sample from trajectories, which is unbiased estimate but has high variance problem.</p> <p>We can use expectation to reduce variance <script type="math/tex">\hat{Q}_i,t\approx \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]</script> and we define <script type="math/tex">\hat{Q}_i,t= \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]\\ V(s_t)=E_{a_t\sim\pi(s_t|s_t)}[Q(s_t,a_t)]\\</script> then <script type="math/tex">\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})(Q(s_{i,t}, a_{i,t})-V(s_{i,t}))</script></p> <h4 id="advantage">Advantage</h4> <script type="math/tex; mode=display">A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t)\\ \Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})A^\pi(s_t,a_t)</script> <p>The better $A^\pi(s_t,a_t)$ estimate, the lower the variance.</p> <h4 id="value-function-fitting">Value function fitting</h4> <script type="math/tex; mode=display">Q^\pi(s_t,a_t)=r(s_t,a_t)+E_{s_{t+1}\sim p(s_{t+1}|s_t,a_t})[V^\pi(s_{t+1})]</script> <p>and we add a little bias(one step biased sample) for convenience <script type="math/tex">Q^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})</script> so we have <script type="math/tex">A^\pi(s_t,a_t) \approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_t)</script> then we only need to fit $V^\pi(s)$ !</p> <h4 id="policy-evaluation">Policy evaluation</h4> <script type="math/tex; mode=display">V^\pi(s_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]\\ J(\theta)=E_{s_1\sim p(s_1)}[V^\pi(s_1)]</script> <p>Monte Carlo policy evaluation (this is what policy gradient does) <script type="math/tex">V^\pi(s_t)\approx \sum_{t'=t}^Tr(s_{t'},a_{t'})</script> We can try multiple samples <strong>if we can reset</strong> the environment to previous state <script type="math/tex">V^\pi(s_t)\approx \frac{1}{N}\sum_{i=0}^N\sum_{t'=t}^Tr(s_{t'},a_{t'})</script> <strong>Monte Carlo evaluation with function approximation</strong></p> <p>with function approximation, only using one sample from trajectory still pretty good.</p> <p>training data: ${\left(s_{i,t},\sum_{t’=t}^Tr(s_{i,t’},a_{i,t’})\right)}$</p> <p>supervised regression: $\mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)-y_i\parallel^2$</p> <p>Ideal target: <script type="math/tex">y_{i,t}=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]\approx r(s_{s_{i,t}},a_{i,t})+V^\pi(s_{i,t+1})\approx r(s_{s_{i,t}},a_{i,t})+\hat{V^\pi_\phi}(s_{i,t+1})</script> Monte Carlo target: <script type="math/tex">y_{i,t}=\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'})</script></p> <h4 id="tdbootstrapped">TD(bootstrapped)</h4> <p>training data: ${\left(s_{i,t},r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\right)}$</p> <h3 id="actor-critic-algorithm">Actor-critic algorithm</h3> <p>batch actor-critic algorithm:</p> <ol> <li> <table> <tbody> <tr> <td>sample ${s_i,a_i}$ from $\pi_\theta(a</td> <td>s)$</td> </tr> </tbody> </table> </li> <li>fit $\hat{V_\phi^\pi}(s)$ to sampled reward sums</li> <li>evaluate $\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\hat{V}<em>\phi^\pi(s’_i)-\hat{V}</em>\phi^\pi(s_i)$</li> <li> <table> <tbody> <tr> <td>$\Delta_\theta J(\theta)\approx \sum_i \Delta_\theta \log \pi_\theta (a_{i}</td> <td>s_{i})\hat{A}^\pi(s_i,a_i)$</td> </tr> </tbody> </table> </li> <li>$\theta \gets \theta+\alpha\Delta_\theta J(\theta)$</li> </ol> <script type="math/tex; mode=display">V^\pi(s_{i,t})=\sum_{t'=t}^TE_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t}]\\ V^\pi(s_{i,t})\approx\sum_{t'=t}^Tr(s_{t'},a_{t'})\\ V^\pi(s_{i,t})\approx r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\\ \mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)-y_i\parallel^2</script> <h4 id="aside-discount-factors">Aside: discount factors</h4> <p>what if T (episode length) is $\infty$ ?</p> <p>$\hat{V}_\phi^\pi$ can get infinitely large in many cases</p> <p>simple trick: better to get rewards sooner than later <script type="math/tex">V^\pi(s_{i,t})\approx r(s_{s_{i,t}},a_{i,t})+\gamma\hat{V}^\pi_\phi(s_{i,t+1})\\ \gamma \in [0,1]</script> actually we use discount in policy gradient as <script type="math/tex">\Delta_\theta J(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)</script> Online actor-critic algorithm(can apply to every single step):</p> <ol> <li> <table> <tbody> <tr> <td>take action $a\sim\pi_\theta(a</td> <td>s)$, get $(s,a,s’,r)$</td> </tr> </tbody> </table> </li> <li>update $\hat{V_\phi^\pi}(s)$ using target $r+\gamma\hat{V_\phi^\pi}(s’)$</li> <li>evaluate $\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}<em>\phi^\pi(s’)-\hat{V}</em>\phi^\pi(s)$</li> <li> <table> <tbody> <tr> <td>$\Delta_\theta J(\theta)\approx \Delta_\theta \log \pi_\theta (a</td> <td>s)\hat{A}^\pi(s,a)$</td> </tr> </tbody> </table> </li> <li>$\theta \gets \theta+\alpha\Delta_\theta J(\theta)$</li> </ol> <h4 id="architecture-design">Architecture design</h4> <p>network architecture choice</p> <ul> <li> <p>value network and policy network are separate(more stable and sample)</p> </li> <li> <p>some of value network and policy network are shared(have shared feature)</p> </li> </ul> <p>works best with a batch</p> <h4 id="trade-off-and-balance">trade-off and balance</h4> <p>policy gradient</p> <p><script type="math/tex">\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-b\right)</script> actor-critic <script type="math/tex">\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(r(s_i,a_i)+\hat{V}_\phi^\pi(s'_i)-\hat{V}_\phi^\pi(s_i)\right)</script> policy gradient is no bias but has higher variance</p> <p>actor-critic is lower variance but not unbiased</p> <blockquote> <p>so can we combine these two things?</p> </blockquote> <p>here we have <strong>critics as state-dependent baselines</strong> <script type="math/tex">\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-\hat{V}_\phi^\pi(s_{i,t})\right)</script></p> <ul> <li>no bias</li> <li>lower variance</li> </ul> <p><strong>Eligibility traces &amp; n-step returns</strong></p> <p>Critic and Monte Carlo critic <script type="math/tex">\hat{A}^\pi_C(s_t,a_t)=r(s_t,a_t)+\gamma\hat{V}_\phi^\pi(s_{t+1})-\hat{V}_\phi^\pi(s_t)\\ \hat{A}^\pi_{MC}(s_t,a_t)=\sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{t'},a_{t'})-\hat{V}_\phi^\pi(s_{t})</script></p> <blockquote> <p>combine these two?</p> </blockquote> <p>n-step returns <script type="math/tex">\hat{A}^\pi_{n}(s_t,a_t)=\sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^n\hat{V}_\phi^\pi(s_{t+n})-\hat{V}_\phi^\pi(s_{t})</script> choosing $n&gt;1$ often works better!!!</p> <p><strong>Generalized advantage estimation(GAE)</strong></p> <blockquote> <p>Do we have to choose just one n?</p> </blockquote> <p>Cut everywhere all at once! <script type="math/tex">\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty w_n\hat{A}(s_t,a_t)</script></p> <blockquote> <p>How to weight?</p> </blockquote> <p>Mostly prefer cutting earlier(less variance) $w_n\propto\lambda^{n-1}$ e.g. $\lambda=0.95$</p> <p>and this leads to Eligibility traces <script type="math/tex">\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty (\gamma\lambda)^{t'-t}\delta_{t'}\\ \delta_{t'}=r(s_{t'},a_{t'})+\gamma\hat{V}_\phi^\pi(s_{t'+1})-\hat{V}_\phi^\pi(s_{t'})</script> in this way, every time you want to update a state, you need to have n steps experience</p> <h2 id="4-value-based-methods">4. Value based methods</h2> <p>$\underset{a_t}{\arg\max}A^\pi(s_t,a_t)$ : best action from $s_t$, if we then follow $\pi$</p> <p>then: <script type="math/tex">% <![CDATA[ \pi'(a_t|s_t)=\begin{cases}1, &if \quad a_=\underset{a_t}{\arg\max}A^\pi(s_t,a_t) \cr 0, &otherwise\end{cases} %]]></script></p> <blockquote> <table> <tbody> <tr> <td>this at least as good as any $a_t \sim \pi(a_t,</td> <td>s_t)$</td> </tr> </tbody> </table> </blockquote> <h3 id="policy-iteration">Policy iteration</h3> <ol> <li>evaluate $A^\pi(s,a)$</li> <li>set $\pi \gets \pi’$</li> </ol> <h3 id="dynamic-programming">Dynamic programming</h3> <table> <tbody> <tr> <td>assume we know $p(s’</td> <td>s,a)$ and s and a are both discrete (and small)</td> </tr> </tbody> </table> <p>bootstrapped update: <script type="math/tex">V^\pi(s) \gets E_{a\sim\pi(a|s)}[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}[V^\pi(s')]]</script> with deterministic policy $\pi(s)=a$, we have <script type="math/tex">V^\pi(s) \gets r(s,\pi(s))+\gamma E_{s'\sim p(s'|s,\pi(s))}[V^\pi(s')]</script></p> <script type="math/tex; mode=display">\underset{a_t}{\arg\max}A^\pi(s_t,a_t)=\underset{a_t}{\arg\max}Q^\pi(s_t,a_t)\\ Q^\pi(s,a)=r(s,a)+\gamma E[V^\pi(s')]</script> <p>So policy iteration become</p> <ol> <li>set $Q^\pi(s,a)\gets r(s,a)+\gamma E[V^\pi(s’)]$</li> <li>set $V(s)\gets \max_a Q(s,a)$</li> </ol> <h4 id="function-approximator">Function approximator</h4> <p>$\mathcal{L}=\frac{1}{2}\sum_i\parallel V_\phi (s)-\max_a Q(s,a)\parallel^2$</p> <h5 id="fitted-value-iteration">fitted value iteration</h5> <p>fitted value iteration algorithm:</p> <ol> <li> <table> <tbody> <tr> <td>set $y_i \gets \max_{a_i}(r(s_i</td> <td>a_i)+\gamma E[V_\phi(s’_i)])$</td> </tr> </tbody> </table> </li> <li>set \phi \gets \arg\min_\phi \frac{1}{2}\sum_i\parallel V_\phi (s)-y_i\parallel^2</li> </ol> <p>but we can not do maximum if we do not have dynamics, so we evaluate Q instead of V <script type="math/tex">Q^\pi(s,a) \gets r(s,a)+\gamma E_{s'\sim p(s'|s,a)}[Q^\pi(s',\pi(s'))]</script></p> <h5 id="fitted-q-iteration">fitted Q-iteration</h5> <ol> <li>collect dataset {(s_i, a_i,s_i’,r_i)} using some policy</li> <li>set y_i \gets r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)</li> <li>set \phi \gets \arg min_\phi \frac{1}{2}\sum_i|Q_\phi(s_i,a_i)-y_i|^2 repeat step 2,3 k times and then return step 1</li> </ol> <p>Q-learning is <strong>off-policy</strong>, since it fit the Q(s,a), which estimate all state action Q, no matter the action and state from which policy, it has the maximum item. And for the r(s,a) item, given a and a, transition is independent of \pi.</p> <h5 id="exploration">exploration</h5> <ol> <li> <p>epsilon-greedy <script type="math/tex">% <![CDATA[ \pi(a_t|s_t)=\begin{cases}1-\epsilon , &if a_t={\arg\max}Q_\phi(s_t,a_t) \cr \epsilon(|\mathcal{A}|-1), &otherwise\end{cases} %]]></script></p> </li> <li> <p>Boltzmann exploration</p> <script type="math/tex; mode=display">\pi(a_t|s_t) \propto \exp(Q_\phi(s_t,a_t))</script> </li> </ol> <h4 id="value-function-learning-theory">Value function learning theory</h4> <p>value iteration:</p> <ol> <li>set Q(s,a) \gets r(s,a)+\gamma E[V(s’)]</li> <li>set V(s) \gets \max_a Q(s,a)</li> </ol> <p>tabular case is converged.</p> <p>Non-tabular case is not guarantee convergence.</p> <p>In actor-critic, it also need to estimate the V, and if using bootstrap approach, it has the same problem that can not guarantee convergence.</p> <h2 id="5-practical-q-learning">5. Practical Q-learning</h2> <p>What’s wrong of the on-line Q-learning?</p> <blockquote> <p>Actually, it is not gradient decent, it do not calculate the gradient of target Q in y.</p> <p>samples is i.i.d</p> </blockquote> <h3 id="replay-buffer-replay-samples-many-times">Replay buffer (replay samples many times)</h3> <p>Q-learning with a replay buffer:</p> <ol> <li>collect dataset {(s_i,a_i,s_i’,t_i)} using some policy, <strong>add</strong> it to \mathcal{B} <ol> <li>sample a batch (s_i,a_i,s_i’,r_i) from \mathcal{B} </li> <li>\phi \gets\phi-\alpha\sum_i\frac{d Q_\phi}{d \phi} (s_i,a_i) \frac{1}{2}\sum_i(Q_\phi(s_i,a_i)- [r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)])^2 , do these k times</li> </ol> </li> </ol> <h3 id="target-network">Target network</h3> <h4 id="dqn-target-networkreplay-buffer">DQN (Target network+Replay buffer)</h4> <ol> <li>save target network parameters: \phi’ \gets \phi <ol> <li>collect dataset {(s_i,a_i,s_i’,t_i)} using some policy, <strong>add</strong> it to \mathcal{B} , do this N times <ol> <li>sample a batch (s_i,a_i,s_i’,r_i) from \mathcal{B} </li> <li>\phi \gets \arg min_\phi \frac{1}{2}\sum_i|Q_\phi(s_i,a_i)- [r(s_i,a_i) +\gamma \max_{a_i’}Q_{\phi’}(s_i’,a_i’)]|^2 , do 2,3 k times</li> </ol> </li> </ol> </li> </ol> <h4 id="alternative-target-network">Alternative target network</h4> <p>Polyak averaging: soft update to avoid sudden target network update:</p> <p>update \phi’: \phi’ \gets \tau \phi’ + (1-\tau)\phi e.g. \tau =0.999</p> <h3 id="double-q-learning">Double Q-learning</h3> <h4 id="are-the-q-values-accurate">Are the Q-values accurate?</h4> <p>It’s often much <strong>larger</strong> than the true value since the <strong>maximum</strong> operation will always adds the <strong>noisy</strong> Q estimation to make Q function overestimate.</p> <p>Target value y_j=r_i +\gamma \max_{s_j’}Q_{\phi’}(s_j’,a_j’) <script type="math/tex">\max_{a'}Q_{\phi'}(s',a') = Q_{\phi'}(s',\arg \max_{a'}Q_{\phi'}(s',a'))</script> value <em>also</em> comes from Q_{\phi’} action selected according to Q_{\phi’}</p> <p>How to address this?</p> <h4 id="double-q-learning-1">Double Q-learning</h4> <p>idea: don’t use the same network to choose the action and evaluate value! (<strong>de-correlate</strong> the noise)</p> <p>use two networks: <script type="math/tex">Q_{\phi_A}\gets r +\gamma Q_{\phi_B}(s',\arg \max_{a'}Q_{\phi_A}(s',a')) \\ Q_{\phi_B}\gets r +\gamma Q_{\phi_A}(s',\arg \max_{a'}Q_{\phi_B}(s',a'))</script> the value of both two networks come from the <strong>other</strong> network!</p> <h4 id="double-q-learning-in-practice">Double Q-learning in practice</h4> <p>Just use the current and target networks as \phi_A and \phi_B, use current network to choose action, and current network get Q value.</p> <p>standard Q-learning: y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi’}(s’,a’))</p> <p>double Q-learning: y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi}(s’,a’))</p> <h3 id="multi-step-returns">Multi-step returns</h3> <script type="math/tex; mode=display">y_{j,t}=\sum_{t'=t}^{t'+N-1}r_{j,t'}+\gamma ^N \max Q_{\phi'}(s_{j,t+N},a_{j, t+N})</script> <p>In Q-learning, this only actually correct when learning <strong>on-policy</strong>. Because the sum of r comes from the transitions of different policy.</p> <p>How to fix?</p> <ul> <li>ignore the problem when N is small</li> <li>cut the trace-dynamically choose N to get only on-policy data <ul> <li>works well when data mostly on-policy, and the action space is samll</li> </ul> </li> <li>importance sampling—ref the paper “safe and efficient off-policy reinforcement learning” Munos et al. 16</li> </ul> <h3 id="q-learning-with-continuous-actions">Q-learning with continuous actions</h3> <p>How to do argmax in continuous actions space?</p> <ol> <li> <p>optimization</p> <ul> <li>gradient based optimization (e.g., SGD) a bit slow in the inner loop</li> <li>action space typically low-dimensional—–what about stochastic optimization?</li> </ul> <p>-a simple if sample from discrete cations</p> <p>max_a Q(s,a)\approx max{Q(s,a_1),…,Q(s,a_N)}</p> <p>-more accurate solution:</p> <ul> <li>cross-entropy method (CEM) <ul> <li>simple iterative stochastic optimization</li> </ul> </li> <li>CMA-ES</li> </ul> </li> <li> <p>use function class that is easy to optimize <script type="math/tex">Q_{\phi}(s,a) = -\frac{1}{2}(a-\mu_\phi(s))^TP_{\phi}(s)(a-\mu_\phi(s))+V_\phi(s)</script> <strong>NAF</strong>: <strong>N</strong>ormalized <strong>A</strong>dvantage <strong>F</strong>unctions</p> <p>Using the neural network to get \mu,P,V</p> <p>Then <script type="math/tex">\arg \max_aQ_\phi(s,a) =\mu_\phi(s)\; \; \max_aQ(s,a)=V_\phi(s)</script> <strong>but</strong> this lose some representational power</p> </li> <li> <p>learn an approximate maximizer</p> <p><strong>DDPG</strong> <script type="math/tex">\max_aQ_\phi(s,a)=Q_\phi(s,arg\max_a Q_\phi(s,a))</script> idea: train another network \mu_\theta(s) such that \mu_\theta(s)\approx arg \max_aQ_\phi(s,a)</p> <p>how to train? solve \theta \gets \arg \max_\theta Q_\phi(s,\mu_\theta(s)) <script type="math/tex">\frac{dQ_\phi}{d\theta}=\frac{da}{d\theta}\frac{dQ_\theta}{da}</script> DDPG:</p> <ol> <li>take some action s_i and observe (s_i,a_i,s_i’,r_i), add it to \mathcal{B}</li> <li>sample mini-batch {s_j,a_j,s_j’,r_j} from \mathcal{B} uniformly</li> <li>compute y_j=r_j+\gamma Q_{\phi’}(s_j’,\mu_{\theta’}(s_j’)) using target nets Q_{\phi’} and \mu_{\theta’}_j</li> <li>\phi \gets \phi - \alpha\sum_j\frac{dQ_\phi}{d\phi}(s_j,a)(Q_\phi(s_j,a_j)-y_k)</li> <li>\theta \gets \theta - \beta\sum_j\frac{d\mu}{d\theta}(s_j)\frac{dQ_\phi}{da}(s_j,a)</li> <li>update \phi’ and \theta’ (e.g., Polyak averaging)</li> </ol> </li> </ol> <h3 id="tips-for-q-learning">Tips for Q-learning</h3> <ul> <li> <p>Bellman error gradients can be big;clip gradients or use Huber loss instead of square error <script type="math/tex">% <![CDATA[ \pi(a_t|s_t)=\begin{cases}x^2/2 , &\text{if} \;|x|\le\delta \cr \delta|x|-\delta^2/2, &\text{otherwise}\end{cases} %]]></script></p> </li> <li> <p>Double Q-learning helps <em>a lot</em> in practice, simple and no downsides</p> </li> <li> <p>N-step returns also help a lot, but have some downsides</p> </li> <li> <p>Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too</p> </li> <li> <p>Run multiple random seeds, it’s very inconsistent between runs</p> </li> </ul> <h2 id="6-advanced-policy-gradients">6. Advanced Policy Gradients</h2> <h3 id="basics">Basics</h3> <h4 id="recap">Recap</h4> <p>Recap: policy gradient</p> <p><strong>REINFORCE</strong> algorithm</p> <ol> <li> <table> <tbody> <tr> <td>sample {\tau^i} from \pi_\theta(s_t</td> <td>s_t) (run the policy)</td> </tr> </tbody> </table> </li> <li> <table> <tbody> <tr> <td>\Delta_\theta J(\theta)\approx\sum_{i}\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t^i</td> <td>s_t^i) \left(\sum_{t’=t}^T r(s_{t’},a_{t’})\right)\right)</td> </tr> </tbody> </table> </li> <li>\theta \gets\theta+\alpha\Delta_\theta J(\theta)</li> </ol> <p>Why does policy gradient work?</p> <p>policy gradient as <strong>policy iteration</strong></p> <p>J(\theta)=E_{\tau \sim p_\theta(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]</p> <p><script type="math/tex">% <![CDATA[ \begin{align} J(\theta')-J(\theta)&=J(\theta')-E_{s_0 \sim p(s_1)}[V^{\pi_\theta}(s_0)]\\ &=J(\theta')-E_{\tau \sim p_{\theta'}(\tau)}[V^{\pi_\theta}(s_0)]\\ &=J(\theta')-E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tV^{\pi_\theta}(s_t)-\sum_{t=1}^\infty\gamma^tV^{\pi_\theta}(s_t)\right]\\ &=J(\theta')+E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t))\right]\\ &=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]+E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t))\right]\\ &=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(r(s_t,a_t)+\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t))\right]\\ &==E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right] \end{align} %]]></script> so we proved that: <script type="math/tex">J(\theta')-J(\theta)=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]</script></p> <h4 id="the-goal-is-making-things-off-policy">The Goal is Making things off-policy</h4> <p>But we <strong>want to sample</strong> from \pi_\theta not \pi_{\theta’}, we apply <strong>importance sampling</strong>: <script type="math/tex">% <![CDATA[ \begin{align} E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]&=\sum_{t=0}^{\infty}E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta'}(a_t|s_t)}\left[\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ &=\sum_{t=0}^{\infty}E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right] \end{align} %]]></script> but there <strong>still</strong> has the state sample from p_{\theta’}(s_t) , and can we approximate it as p_\theta(s_t)? so that we can use \hat{A}^\pi(s_t,a_t) to get improved policy \pi’.</p> <h3 id="bounding-the-objective-value">Bounding the objective value</h3> <p>Here we can prove that:</p> <table> <tbody> <tr> <td>\pi_{\theta’} if close to \pi_\theta if </td> <td>\pi_{\theta’(a_t</td> <td>s_t)}-\pi_\theta(a_t</td> <td>s_t)</td> <td>\le\epsilon for all s_t</td> </tr> </tbody> </table> <table> <tbody> <tr> <td></td> <td>p_{\theta’}(s_t)-p_\theta(s_t)</td> <td>\le 2\epsilon t</td> </tr> </tbody> </table> <p>The prove of this refer the lecture video or the <strong>TRPO</strong> paper.</p> <p>It’s easy to prove that: <script type="math/tex">% <![CDATA[ \begin{align} E_{p_{\theta'}}[f(s_t)]=\sum_{s_t}p_{\theta'}(s_t)f(s_t)&\ge\sum_{s_t}p_\theta(s_t)f(s_t)-|p_{\theta'}(s_t)-p_\theta(s_t)|\max_{s_t}f(s_t)\\ &\ge\sum_{s_t}p_\theta(s_t)f(s_t)-2\epsilon t\max_{s_t}f(s_t) \end{align} %]]></script> so <script type="math/tex">\sum_t E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \ge\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]-\sum_t 2\epsilon t C</script> C is O(Tr_{max}) or O(\frac{r_{max}}{1-\gamma})</p> <p>So after all the prove before, what we get? <script type="math/tex">\theta' \gets \arg \max_{\theta'}\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \text{such that}\:\:|\pi_{\theta'(a_t|s_t)}-\pi_\theta(a_t|s_t)|\le\epsilon</script> For <strong>small enough</strong> \epsilon, this is <strong>guaranteed to improve</strong> J(\theta’)-J(\theta)</p> <table> <tbody> <tr> <td>A more convenient bound is using KL divergence: </td> <td>\pi_{\theta’(a_t</td> <td>s_t)}-\pi_\theta(a_t</td> <td>s_t)</td> <td>\le \sqrt{\frac{1}{2}D_{KL}(\pi_{\theta’}(a_t</td> <td>s_t)|\pi(a_t</td> <td>s_t))}</td> </tr> </tbody> </table> <p>\Rightarrow D_{KL}(\pi_{\theta’}(a_t|s_t)|\pi(a_t|s_t) bounds state marginal difference, where <script type="math/tex">D_{KL}(p_1(s)\|p_2(x))=E_{x\sim p_1(x)}\left[ \log \frac{p_1(x)}{p_2(x)}\right]</script> Why not using \epsilon but the D_{KL}?</p> <blockquote> <p>KL divergence has some <strong>very convenient properties</strong> that make i much easier to approximate!</p> </blockquote> <p>So the optimization becomes: <script type="math/tex">\theta' \gets \arg \max_{\theta'}\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\\text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))\le\epsilon</script></p> <h3 id="solving-the-constrained-optimization-problem">Solving the constrained optimization problem</h3> <p>How do we enforce the <strong>constraint</strong>?</p> <p>By using <strong>dual gradient descent</strong>, we set the object function as <script type="math/tex">\mathcal{L}(\theta',\lambda)=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]-\lambda(D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))-\epsilon)</script></p> <ol> <li>Maximize \mathcal{L}(\theta’, \lambda) with respect to \theta</li> <li> <table> <tbody> <tr> <td>\lambda \gets + \alpha(D_{KL}(\pi_{\theta’}(a_t</td> <td>s_t)|\pi(a_t</td> <td>s_t))-\epsilon)</td> </tr> </tbody> </table> </li> </ol> <p>How <strong>else</strong> do we optimize the object?</p> <p>define: <script type="math/tex">% <![CDATA[ \begin{align} \bar{A}(\theta')&=\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \bar{A}(\theta)&=\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right] \end{align} %]]></script> applying <strong>First-order Taylor expansion</strong> and optimize <script type="math/tex">\theta' \gets \arg \max_{\theta'}\Delta_\theta\bar A(\theta)^T(\theta'-\theta)\\ \text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))\le\epsilon</script> and <script type="math/tex">% <![CDATA[ \begin{align} \Delta_{\theta'}\bar A(\theta')&=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^t \Delta_{\theta'}\log{\pi_{\theta'}(a_t|s_t)}A^{\pi_\theta}(s_t,a_t)\right]\right]\\ \Delta_{\theta}\bar A(\theta)&=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\gamma^t \Delta_{\theta}\log{\pi_{\theta}(a_t|s_t)}A^{\pi_\theta}(s_t,a_t)\right]\right]=\Delta_\theta J(\theta) \end{align} %]]></script> so the optimization becomes <script type="math/tex">\theta' \gets \arg \max_{\theta'}\Delta_\theta J(\theta)^T(\theta'-\theta)\\ \text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))\le\epsilon</script> and gradient ascent does this: <script type="math/tex">\theta' \gets \arg \max_{\theta'}\Delta_\theta J(\theta)^T(\theta'-\theta)\\ \text{such that}\:\:\|\theta-\theta'\|\le\epsilon</script> by updating like \theta’=\theta’+\sqrt{\frac{\epsilon}{|\Delta_\theta J(\theta)|^2}}\Delta_\theta J(\theta), this is what actually gradient ascent(policy gradient) doing.</p> <p>But this (the gradient ascent constrain) is not a good constrain since some parameters change probabilities a lot more than others, and we want that the probability distributions are close.</p> <p>Applying ‘second order Taylor expansion’ to D_{KL} <script type="math/tex">D_{KL}(\pi_{\theta'}\|\pi_\theta)\approx\frac{1}{2}(\theta'-\theta)^T\pmb{F}(\theta'-\theta)</script> where \pmb{F} is the ‘Fisher-information matrix’ which can estimate with with samples <script type="math/tex">\pmb{F}=E_{\pi_\theta}[\Delta_{\theta}\log\pi_\theta(a|s)\Delta_\theta\log \pi_\theta(a|s)^T]</script> And if we use the following update <script type="math/tex">\theta'=\theta+\alpha\pmb{F}^{-1}\Delta_\theta J(\theta)\\ \alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}</script> the constrain will satisfied. and this is called the <strong>natural gradient</strong>.</p> <p><img src="/home/dd/.config/Typora/typora-user-images/1570189074407.png" alt="How natural policy gradients improves" /></p> <h3 id="practical-methods-and-notes">Practical methods and notes</h3> <ul> <li>natural policy gradient \theta’=\theta+\alpha\pmb{F}^{-1}\Delta_\theta J(\theta) <ul> <li>Generally a good choice to stabilize policy gradient training</li> <li>See this paper for details: <ul> <li>Petters, Schaal. Reinforcement learning of motor skills with policy gradients.</li> </ul> </li> <li>Practical implementation: requires efficient Fisher-vector products, a bit non-trivial to do without computing the full matrix <ul> <li>See: Schulman et all. Trust region policy optimization</li> </ul> </li> </ul> </li> <li>Trust region policy optimization (<strong>TRPO</strong>) \alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}</li> <li>Just use the IS (important sampling) objective directly (use \bar{A} as object) <ul> <li>Use regularization to stay close to old policy</li> <li>See: proximal policy optimization (<strong>PPO</strong>)</li> </ul> </li> </ul> <p>So the TRPO and the PPO is two Practical methods solving the constrained optimization in neural network setting.</p> <h2 id="7-optimal-control-and-planning">7. Optimal Control and Planning</h2> <p>Recap: the reinforcement learning objective <script type="math/tex">p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]</script> In model-free RL, we do not know p(s_{t+1}|s_t,a_t).</p> <p>But actually we knew the dynamics sometimes.</p> <ul> <li>Often we do know the dynamics</li> <li>Often we can learn the dynamics</li> </ul> <p>If we know the dynamics, what can we do?</p> <h3 id="model-based-reinforcement-learning">Model-based reinforcement learning</h3> <ol> <li> <p>Model-based reinforcement learning: learn the transition dynamics, then figure out how to choose actions</p> </li> <li> <p>How can we make decisions if we know the dynamics?</p> <p>a. How can we choose actions under perfect knowledge of the system dynamics?</p> <p>b. Optimal control, trajectory optimization, planning</p> </li> <li> <p>How can we learning <em>unknown dynamics</em>?</p> </li> <li> <p>How can we then also learn policies? (e,g. by imitating optimal control)</p> </li> </ol> <h3 id="the-objective">The objective</h3> <script type="math/tex; mode=display">\min_{a_1,...,a_T}\sum_{t=1}^Tc(s_t,a_t)\;\text{s.t.}\;s_t=f(s_{t-1},a_{t-1})</script> <h4 id="deterministic-case">Deterministic case</h4> <script type="math/tex; mode=display">a_1,...,a_T=\arg\max_{a_1,...,a_T}\sum_{t=1}^Tr(s_t,a_t)\;\text{s.t.}\;s_t=f(s_{t-1},a_{t-1})</script> <h4 id="stochastic-open-loop-case">Stochastic open-loop case</h4> <script type="math/tex; mode=display">p_\theta(s_1,...,s_T|a_1,...,a_T)=p(s_1)\prod_{t=1}^Tp(s_{t+1}|s_t,a_t)\\ a_1,...,a_T=\arg\max_{a_1,...,a_T}E\left[\sum_{t=1}^Tr(s_t,a_t)|a_1,...,a_T\right]</script> <p><strong>open-loop</strong>: choose a_1…a_T in one time, not step by step <strong>closed-loop</strong>: every step the agent gets a feedback from environment</p> <h4 id="stochastic-closed-loop-case">Stochastic closed-loop case</h4> <script type="math/tex; mode=display">p_\theta(s_1,a_1,...,s_T,a_T)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \pi=\underset{\pi}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]</script> <h3 id="stochastic-optimization">Stochastic optimization</h3> <p>optimal control/planning： <script type="math/tex">a_1,...,a_t=\arg\max_{a_1,...,a_t}J(a_1,...,a_t)\\ A=\arg\max_AJ(A)</script></p> <h4 id="cross-entropy-method-cem">Cross-entropy method (CEM)</h4> <p>Here A is a_1,…,a_t</p> <ol> <li>sample A_1,…,A_n from p(A)</li> <li>evaluate J(A_1),…,J(A_n)</li> <li>pick M <em>elites</em> A_{i_1},…,A_{i_M} with the highest value, where M&lt;N</li> <li>refit p(A) to the elites A_{i_1},…,A_{i_M}</li> </ol> <h4 id="monte-carlo-tree-search-mcts">Monte Carlo Tree Search (MCTS)</h4> <p>Generic MCTS sketch</p> <ol> <li>find a leaf s_l using TreePolicy(s_1)</li> <li>evaluate the leaf using DefaultPolicy(s_l)</li> <li>update all values in the tree between a_1 and s_l</li> </ol> <p>take best action from s_1 and repeat</p> <p>every node stores Q and N, Q is the estimated value and N is the visited number</p> <p><strong>UCT</strong> TreePolicy(s_t)</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if s_t nit fully expanded, choose new a_t else choose child with best Score(s_{t+1}) Score(s_t) = \frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2\ln N(s_{t-1})}{N(s_t)}} For more about MCTS, ref Browne. et al. A survey of Monte Carlo Tree Search Methods. (2012) </code></pre></div></div> <h2 id="optimal-control">Optimal control</h2> <p>Here we shows the optimization process if we know the environment dynamics. Almost the stuffs in control theory.</p> <p><strong>Deterministic</strong> case <script type="math/tex">\min_{u_1,...,u_T}\sum_{t=1}^Tc(s_t,u_t)\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\\ \min_{u_1,...,u_T}c(x_1,u_1)+c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)</script></p> <h4 id="shooting-methods-vs-collocation">Shooting methods vs collocation</h4> <p>the previous CEM is actually random shooting method.</p> <p>collocation method: optimize over actions and states, with constraints. <script type="math/tex">\min_{u_1,...,u_T,x_1,...,x_T}\sum_{t=1}^Tc(s_t,u_t)\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})</script></p> <h4 id="linear-case-lqr">Linear case: LQR</h4> <script type="math/tex; mode=display">\min_{u_1,...,u_T}c(x_1,u_1)+c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)</script> <p>Linear case: the case that F is <strong>linear</strong> function and cost is <strong>quadratic</strong> function <script type="math/tex">f(x_t,u_t)=F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t\\ c(x_t,u_t)=\frac{1}{2}\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^Tc_t</script> Where <script type="math/tex">% <![CDATA[ C_T=\begin{bmatrix} C_{x_T,x_T} & C_{x_T,u_T} \\ C_{u_T,x_T} & C_{u_T,u_T} \end{bmatrix}\\ c_T=\begin{bmatrix} c_{x_T}\\ c_{u_T} \end{bmatrix} %]]></script> Base case: solve for u_T only <script type="math/tex">% <![CDATA[ \begin{align} Q(x_T,u_T)&= \text{const}+\frac{1}{2}\begin{bmatrix} x_{T} \\ u_{T} \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{T} \\ u_{T} \\ \end{bmatrix}+\begin{bmatrix} x_{T} \\ u_{T} \\ \end{bmatrix}^Tc_T\\ \Delta_{u_T}Q(x_T,u_T)&=C_{u_t,x_T}x_T+c_{u_T,u_T}u_T+c_{u_T}^T=0\\ u_T&=-C_{u_T,u_T}^{-1}(C_{u_T,x_T}X_T+c_{u_T})\\ U_T&=K_Tx_T+k_T\\ K_T&=-C_{u_T,u_T}^{-1}C_{u_T,x_T}\\ k_T&=-C_{u_T,u_T}^{-1}c_{u_T} \end{align} %]]></script> We substitute u_T by x_T to eliminate u_T <script type="math/tex">% <![CDATA[ \begin{align} V(x_T)&= \text{const}+\frac{1}{2}\begin{bmatrix} x_{T} \\ K_Tx_T+k_T \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{T} \\ K_Tx_T+k_T \\ \end{bmatrix}+\begin{bmatrix} x_{T} \\ K_Tx_T+k_T \\ \end{bmatrix}^Tc_T\\ V(x_T)&=\text{const}+\frac{1}{2}x_T^TV_Tx_T+x_T^Tv_T \end{align} %]]></script> Then solve for U_{T-1} in term terms of x_{T-1} <script type="math/tex">% <![CDATA[ \begin{align} f(x_{T-1},u_{T-1})&=x_T=F_t\begin{bmatrix} x_ \\ u_ \\ \end{bmatrix}+f_{T-1}\\ c(x_t,u_t)&=\frac{1}{2}\begin{bmatrix} x_{T-1} \\ u_{T-1} \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{T-1} \\ u_{T-1} \\ \end{bmatrix}+\begin{bmatrix} x_{T-1} \\ u_{T-1} \\ \end{bmatrix}^Tc_{T-1}+V(f(x_{T-1},u_{T-1}))\\ V(f(x_{T-1},u_{T-1}))&=\text{const}+\frac{1}{2}x_T^TV_Tx_T+x_T^Tv_T\\ &\text{and then raplece x_T with the dynamic f} \end{align} %]]></script> and then do same thing as the T case, result in similar results.</p> <h5 id="backward-recursion">backward recursion</h5> <p>for t=T to 1: <script type="math/tex">% <![CDATA[ \begin{align} Q_t&=C_t+F_t^TV_{t+1}F_t\\ q_t&=c_t+F_t^TV_{t+1}f_t+F_t^Tv_{t+1}\\ Q(x_t,u_t)&=\text{const}+\frac{1}{2}\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^TQ_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^Tq_t\\ u_t &\gets \arg\max_{u_t}Q(x_t,u_t)=K_tx_t+k_t\\ K_t&=-Q_{u_t,u_t}^{-1}Q_{u_t,x_t}\\ k_t&=-Q_{u_t,u_t}^{-1}q_{u_t}\\ V_t&=Q_{x_t,x_t}+Q_{x_t,u_t}K_t+K_t^TQ_{u_t,x_t}+K_t^TQ_{u_t,u_t}K_t\\ v_t&=q_{x_t}+Q_{x_t,u_t}k_t+K_t^TQ_{u_t}+K_t^TQ_{u_t,u_t}k_t\\ V(x_t)&=\text{const}+\frac{1}{2}x_t^T V_tx_t+x_t^Tv_t\\ V(x_t)&=\min Q(x_t,u_t) \end{align} %]]></script> Forward recursion</p> <p>For t=1 to T: <script type="math/tex">u_t=K_tx_t+k_t\\ x_{t+1}=f(x_t,u_t)</script></p> <h5 id="stochastic-dynamics">Stochastic dynamics</h5> <p>if the probability is Gaussian and the mean is linear and variance is fixed. Then same algorithm can be applied since symmetry of Gaussian. <script type="math/tex">f(x_t,u_t)=F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t\\ x_{t-1}\sim p(x_{t+1}|x_t,u_t)\\ p(x_{t+1}|x_t,u_t)=\mathcal{N}\left(F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t, \Sigma_t\right)</script></p> <h4 id="nonlinear-case-ddpiterative-lqr">Nonlinear case: DDP/iterative LQR</h4> <p>approximate a nonlinear system as a linear-quadratic system using <strong>Taylor expansion</strong> <script type="math/tex">f(x_t,u_t)\approx f(\hat{x}_t,\hat{u}_t)+\Delta_{x_t,u_t}f(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}\\ c(x_t,u_t)\approx c(\hat{x}_t,\hat{u}_t)+\Delta_{x_t,u_t}c(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}+\frac{1}{2}\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}^T\Delta^2_{x_t,u_t}c(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}</script></p> <script type="math/tex; mode=display">\bar{f}(\delta x_t,\delta u_t)=F_t\begin{bmatrix} \delta x_t \\ \delta u_t \\ \end{bmatrix}\\ \bar{c}=\frac{1}{2}\begin{bmatrix} \delta x_{t} \\ \delta u_{t} \\ \end{bmatrix}^TC_t\begin{bmatrix} \delta x_{t} \\ \delta u_{t} \\ \end{bmatrix}+\begin{bmatrix} \delta x_{t} \\ \delta u_{t} \\ \end{bmatrix}^Tc_t\\ \delta x_t= x_t-\hat{x}_t\\ \delta u_t= u_t-\hat{u}_t</script> <p>In fact, this just Newton’s method for trajectory optimization.</p> <p>for more Newton’s method for trajectory optimization, ref follow papers:</p> <ol> <li>Differential dynamic programming.(1970)</li> <li>Synthesis and Stabilization of complex behaviors through online trajectory optimization.(2012) <ul> <li>practical guide for implementing non-linear iterative LQR.</li> </ul> </li> <li>Learning Neural Network policies with guided policy search under unknown dynamics (2014) <ul> <li>Probabilistic formation and trust region alternative to deterministic line search.</li> </ul> </li> </ol> <h2 id="8-model-based-reinforcement-learning-learning-the-model">8. Model-Based Reinforcement Learning (learning the model)</h2> <h3 id="basic">Basic</h3> <p>Why learn the model?</p> <blockquote> <p>If we knew f(s_t,a_t)=s_{t+1}, we could use the tools from last course.</p> <table> <tbody> <tr> <td>(or p(s_{t+1}</td> <td>s_t,a_t) in stochastic case)</td> </tr> </tbody> </table> </blockquote> <p>model-based reinforcement learning <strong>version 0.5</strong>:</p> <ol> <li> <table> <tbody> <tr> <td>run base policy \pi_0(a_t</td> <td>s_t) (e.g., random policy) to collect \mathcal{D}={(s,a,s’)_i}</td> </tr> </tbody> </table> </li> <li>learning dynamics model f(s,a) to minimize \sum_i|f(s_i,a_i)-s_i’|^2</li> <li>plan through f(s,a) to choose actions</li> </ol> <p>Does it work?</p> <ul> <li>This is how <strong>system identification</strong> works in classical robotics</li> <li>Some care should be taken to design a good base policy</li> <li>Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters</li> <li>The model only fit the base policy, but the final actual policy beyond that policy, that will cause <strong>distribution mismatch problem</strong>.</li> </ul> <h3 id="over-fitting-problem">Over-fitting problem</h3> <h4 id="distribution-mismatch-problem">Distribution mismatch problem</h4> <p>Can we do better?</p> <p>can we make p_{\pi_0}(s_t)=p_{\pi_f}(s_t)?</p> <p>model-based reinforcement learning <strong>version 1.0:</strong></p> <ol> <li> <table> <tbody> <tr> <td>run base policy \pi_0(a_t</td> <td>s_t) (e.g., random policy) to collect \mathcal{D}={(s,a,s’)_i}</td> </tr> </tbody> </table> </li> <li>learning dynamics model f(s,a) to minimize \sum_i|f(s_i,a_i)-s_i’|^2</li> <li>plan through f(s,a) to choose actions</li> <li>execute those actions and add the resulting data {(s,a,s’)_j} to \mathcal{D}, and repeat step 2~4</li> </ol> <p>But the model has errors, so it may lead to some bad actions, How to address that?</p> <h4 id="mpc">MPC</h4> <p>model-based reinforcement learning <strong>version 1.5</strong>:</p> <ol> <li> <table> <tbody> <tr> <td>run base policy \pi_0(a_t</td> <td>s_t) (e.g., random policy) to collect \mathcal{D}={(s,a,s’)_i}</td> </tr> </tbody> </table> </li> <li>learning dynamics model f(s,a) to minimize \sum_i|f(s_i,a_i)-s_i’|^2</li> <li>plan through f(s,a) to choose actions</li> <li>execute the <strong>first</strong> planned action, observe resulting state s’ (<strong>MPC</strong>)</li> <li>append (s,a,s’) to dataset \mathcal{D}. repeat steps 3~5, and every N steps repeat steps 2~5</li> </ol> <h4 id="using-model-uncertainty">Using model uncertainty</h4> <p>Can we do better by using model <strong>uncertainty</strong>?</p> <p>How to get uncertainty?</p> <ol> <li>use output entropy(bad idea)</li> <li>estimate model uncertainty</li> </ol> <script type="math/tex; mode=display">\int p(s_{t+1}|s_t,a_t,\theta)p(\theta|\mathcal{D})d\theta</script> <ul> <li> <p>one way to get this is by Bayesian neural networks (BNN) (introduce later)</p> </li> <li> <p>another way is train multiple models, and see if they agree each other.(<strong>Bootstrap ensembles</strong>)</p> </li> </ul> <script type="math/tex; mode=display">p(\theta|\mathcal{D})\approx\frac{1}{N}\sum_i\delta(\theta_i)\\ \int p(s_{t+1}|s_t,a_t,\theta)p(\theta|\mathcal{D})d\theta\approx\frac{1}{N}\sum_ip(s_{t+1}|s_t,a_t,\theta_i)</script> <p>How to train?</p> <blockquote> <p>main idea: need to generate “independent” datasets to get “independent” models.</p> <p>can do this by re-sampling from dataset with replacement, means you have same distribution but different ordered datasets</p> </blockquote> <p>Does this works?</p> <blockquote> <p>This basically works</p> <p>Very crude approximation, because the number of models is usually small (&lt;10)</p> <p>Re-sampling with replacement is usually unnecessary, because SGD and random initialization usually makes the models sufficiently independent</p> </blockquote> <p>For candidate action sequence a_1,…,a_H:</p> <ol> <li> <table> <tbody> <tr> <td>sample \theta\sim p(\theta</td> <td>\mathcal{D})</td> </tr> </tbody> </table> </li> <li> <table> <tbody> <tr> <td>at each time step t, sample s_{t+1}\sim p(s_{t+1}</td> <td>s_t,a_t,\theta)</td> </tr> </tbody> </table> </li> <li>calculate R=\sum_tr(s_t,a_t)</li> <li>repeat steps 1 to 3 and accumulate the average reward</li> </ol> <h3 id="model-based-rl-with-images-pomdp">Model-based RL with images (POMDP)</h3> <h4 id="model-based-rl-with-latent-space-models">Model-based RL with latent space models</h4> <p>What about <strong>complex observations</strong>?</p> <ul> <li>High dimensionality</li> <li>Redundancy</li> <li>Partial observability</li> </ul> <script type="math/tex; mode=display">\max_\phi\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(s_{t+1,i}|s_{t,i},a_{t,i})+\log p_\phi(o_{t,i}|s_{t,i})]</script> <table> <tbody> <tr> <td>learn <em>approximate</em> posterior q_\psi(s_t</td> <td>o_{1:t},a_{1:t})</td> </tr> </tbody> </table> <p>other choices:</p> <ul> <li> <table> <tbody> <tr> <td>q_\psi(s_t,s_{t+1}</td> <td>o_{1:t},a_{1:t})</td> </tr> </tbody> </table> </li> <li> <table> <tbody> <tr> <td>q_\psi(s_t</td> <td>o_t)</td> </tr> </tbody> </table> </li> </ul> <table> <tbody> <tr> <td>here we only estimate q_\psi(s_t</td> <td>o_t)</td> </tr> </tbody> </table> <table> <tbody> <tr> <td>assume that q_\psi(s_t</td> <td>o_t) is <em>deterministic</em></td> </tr> </tbody> </table> <p>stochastic case requires variational inference (later)</p> <p><strong>Deterministic encoder</strong> <script type="math/tex">q_\psi(s_t|o_t)=\delta(s_t=g_\psi(o_t))\Rightarrow s_t=g_\psi(o_t)</script> and maybe the reward also need to learn. <script type="math/tex">\max_{\phi,\psi}\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(g_\psi(o_{t+1,i})|g_\psi(o_{t,i}),a_{t,i})+\log p_\phi(o_{t,i}|g_\psi (o_{t,i})]\\ \max_{\phi,\psi}\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(g_\psi(o_{t+1,i})|g_\psi(o_{t,i}),a_{t,i})+\log p_\phi(o_{t,i}|g_\psi (o_{t,i})+log p_\phi(r_{t,i}|g_\psi(o_{t,i}))]</script> Model-based RL with latent space models</p> <ol> <li> <table> <tbody> <tr> <td>run base policy \pi_0(o_t</td> <td>a_t) (e.g., random policy) to collect \mathcal{D}={(o,a,o’)_i}</td> </tr> </tbody> </table> </li> <li> <table> <tbody> <tr> <td>learning p_\psi(s_{t+1}</td> <td>s_t,a_t), p_\psi(r_t</td> <td>s_t), p(o_t</td> <td>s_t), g_\psi(o_t)</td> </tr> </tbody> </table> </li> <li>plan through the model to choose actions</li> <li>execute the <strong>first</strong> planned action, observe resulting state o’ (<strong>MPC</strong>)</li> <li>append (o,a,o’) to dataset \mathcal{D}. repeat steps 3~5, and every N steps repeat steps 2~5</li> </ol> <h4 id="learn-directly-in-observation-space">Learn directly in observation space</h4> <table> <tbody> <tr> <td>directly learn p(o_{t+1}</td> <td>o_t,a_t)</td> </tr> </tbody> </table> <p>do image prediction</p> <p>learn reward or set the goal observation</p> <h2 id="9-model-based-rl-and-policy-learning">9. Model-Based RL and Policy Learning</h2> <h3 id="basic-1">Basic</h3> <p>What if we want a policy rather than just optimal control?</p> <ul> <li>Do not need to re-plan (faster)</li> <li>Potentially better generalization</li> <li>Closed loop control</li> </ul> <p>Back-propagate directly into the policy</p> <p>model-based reinforcement learning <strong>version 2.0</strong>:</p> <ol> <li> <table> <tbody> <tr> <td>run base policy \pi_0(a_t</td> <td>s_t) (e.g., random policy) to collect \mathcal{D}={(s,a,s’)_i}</td> </tr> </tbody> </table> </li> <li>learning dynamics model f(s,a) to minimize \sum_i|f(s_i,a_i)-s_i’|^2</li> <li> <table> <tbody> <tr> <td>back-propagate through f(s,a) into the policy to optimize \pi_\theta(a_t</td> <td>s_t)</td> </tr> </tbody> </table> </li> <li> <table> <tbody> <tr> <td>run \pi_\theta (a_t</td> <td>s_t), appending the visited tuples (s,a,s’) to \mathcal{D} , repeat steps 2~4</td> </tr> </tbody> </table> </li> </ol> <p>What’s the <strong>problem</strong>?</p> <ul> <li>similar parameter sensitivity problems as shooting methods <ul> <li>But no longer have convenient second order LQR-like method, because policy parameters <strong>couple</strong> all the tie steps, so no dynamic programming</li> </ul> </li> <li>Similar problem to training long RNNs with BPTT <ul> <li>Vanishing and exploding gradients</li> <li>Unlike LSTM, we can’t just “choose” a simple dynamics, dynamics are chosen by nature</li> </ul> </li> </ul> <h3 id="guided-policy-search">Guided policy search</h3> <script type="math/tex; mode=display">\min_{u_1,...,u_T,x_1,...,x_T}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})</script> <script type="math/tex; mode=display">\min_{u_1,...,u_T,x_1,...,x_T,\theta}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1}), u_t=\pi_\theta(x_t)\\ \min_{u_1,...,u_T,x_1,...,x_T,\theta}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\\ \:\: \text{s.t.}\:\:u_t=\pi_\theta(x_t)</script> <p>How to deal with constrain?</p> <h4 id="dual-gradient-decent-dgd">Dual gradient decent (DGD)</h4> <script type="math/tex; mode=display">\min_xf(x)\:\:\text{s.t.}\:C(x)=0\:\:\:\:\:\: \mathcal{L}(x,\lambda)=f(x)+\lambda C(x)\\ g(\lambda)=\mathcal{L}(x^*(\lambda),\lambda)\\ x^*=\arg\min_x\mathcal{L}(x,\lambda)\\ \frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)</script> <ol> <li>Find x^*\gets \arg\min_x\mathcal{L}(x,\lambda)</li> <li>Compute \frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)</li> <li>\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}</li> </ol> <p>A small tweak to DGD: augmented Lagrangian</p> <script type="math/tex; mode=display">\bar{\mathcal{L}}(x,\lambda)=f(x)+\lambda C(x)+\rho\|C(x)\|^2</script> <ol> <li>Find x^*\gets \arg\min_x\bar{\mathcal{L}}(x,\lambda)</li> <li>Compute \frac{dg}{d\lambda}=\frac{d\bar{\mathcal{L}}}{d\lambda}(x^*,\lambda)</li> <li>\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}</li> </ol> <p>When far from solution, quadratic term tends to improve stability</p> <p>Constraining trajectory optimization with dual gradient descent <script type="math/tex">\min_{\tau,\theta}c(\tau)\:\:\text{s.t.}\:\:u_t=\pi_\theta(x_t)\\ \bar{\mathcal{L}}(\tau,\theta,\lambda)=c(\tau)+\sum_{t=1}^T\lambda_t(\pi_\theta(x_t)-u_t)+\sum_{t=1}^T\rho_t(\pi_\theta(x_t)-u_t)^2</script></p> <h4 id="guided-policy-search-gps-discussion">Guided policy search (GPS) discussion</h4> <ol> <li>Find \tau \gets \arg\min_\tau \bar{\mathcal{L}}(\tau,\theta,\lambda) (e.g. via iLQR or other planning methods)</li> <li>Find \theta \gets \arg\min_\theta\bar{\mathcal{L}}(\tau, \theta, \lambda) (e.g. via SGD)</li> <li>\lambda \gets \lambda+\alpha \frac{dg}{d\lambda} and repeat</li> </ol> <ul> <li>Can be interpreted as constrained trajectory optimization method</li> <li>Can be interpreted as imitation of optimal control expert, since step 2 is just supervised learning</li> <li>The optimal control “teacher” adapts to the learner , and avoids actions that the learner can’t mimic</li> </ul> <p>General guided policy search scheme</p> <ol> <li>Optimize p(\tau) with respect to some surrogate \tilde{c}(x_t,u_t)</li> <li>Optimize \theta with respect to some supervised objective</li> <li>Increment or modify dual variables \lambda</li> </ol> <p>Need to choose:</p> <ul> <li>form of p(\tau) or \tau (if deterministic)</li> <li>optimization method for p(\tau) or \tau</li> <li>surrogate \tilde{c}(x_t,u_t)</li> <li> <table> <tbody> <tr> <td>supervised objective for \pi_\theta(u_t</td> <td>x_t)</td> </tr> </tbody> </table> </li> </ul> <h5 id="deterministic-case-1">Deterministic case</h5> <script type="math/tex; mode=display">\min_{\tau,\theta}c(\tau)\:\:\:\text{s.t.}\:\:\:u_t=\pi_\theta(x_t)\\ \bar{\mathcal{L}}(\tau,\theta,\lambda)=\tilde{c}(\tau)=c(\tau)+\sum_{t=1}^T\lambda_t(\pi_\theta(x_t)-u_t)+\sum_{t=1}^T\rho_t(\pi_\theta(x_t)-u_t)^2</script> <ol> <li>Optimize \tau with respect to some surrogate \tilde{c}(x_t,u_t)</li> <li>Optimize \theta with respect to some supervised objective</li> <li>Increment or modify dual variables \lambda. repeat 1~3</li> </ol> <p>Learning with multiple trajectories</p> <script type="math/tex; mode=display">\min_{\tau_1,...,\tau_N,\theta}\sum_{i=1}^{N}c(\tau_i)\:\:\:\text{s.t.}\:\:\:u_{t,i}=\pi_\theta(x_{t,i})\:\:\forall i\:\forall t</script> <ol> <li>Optimize each \tau_i <em>in parallel</em> with respect to \tilde{c}(x_t,u_t)</li> <li>Optimize \theta with respect to some supervised objective</li> <li>Increment or modify dual variables \lambda. repeat 1~3</li> </ol> <h5 id="stochastic-gaussian-gps">Stochastic (Gaussian) GPS</h5> <script type="math/tex; mode=display">\min_{p,\theta}E_{\tau\sim p(\tau)}{c(\tau)}\:\:\text{s.t.}\:\:p(u_t|x_t)=\pi_\theta(u_t|x_t)\\ p(u_t|x_t)=\mathcal{N}(K_t(x_t-\hat{x}_t)+k_t+\hat{u}_t,\Sigma_t)</script> <ol> <li>Optimize p(\tau) with respect to some surrogate \tilde{c}(x_t,u_t)</li> <li>Optimize \theta with respect to some supervised objective</li> <li>Increment or modify dual variables \lambda</li> </ol> <blockquote> <p>Here, a little different from pure imitation learning that mimic the optimal control result, the agent imitate the planning results, and then if it can not perform good imitation, then the optimization process will adjust its planning policy to fit the learning policy since the policy is a constrain of planning.</p> </blockquote> <p>Input Remapping Trick <script type="math/tex">\min_{p,\theta}E_{\tau\sim p(\tau)}[{c(\tau)}]\:\:\text{s.t.}\:\:p(u_t|x_t)=\pi_\theta(u_t|o_t)\</script></p> <h3 id="imitation-optimal-control">Imitation optimal control</h3> <h4 id="imitation-optimal-control-with-dagger">Imitation optimal control with DAgger</h4> <ol> <li>from current state s_t, run MCTS to get a_t,a_{t+1},…</li> <li>add (s_t,a_t) to dataset \mathcal{D}</li> <li> <table> <tbody> <tr> <td>execute action s_t\sim\pi(a_t</td> <td>s_t) (not MCTS action!). repeat 1~3 N times</td> </tr> </tbody> </table> </li> <li>update the policy by training on \mathcal{D}</li> </ol> <p>Problems of the original DAgger</p> <ul> <li>Ask human to label the state from other policy is hard</li> <li>run the initial not good policy in real world is dangerous in some applications</li> </ul> <p>We address the first problem with a planning method, and how about the the second problem?</p> <h4 id="imitating-mpc-plato-algorithm">Imitating MPC: PLATO algorithm</h4> <ol> <li> <table> <tbody> <tr> <td>train \pi_\theta(u_t</td> <td>o_t) from labeled data \mathcal{D}={o_1,u_1,…,o_N,u_N}</td> </tr> </tbody> </table> </li> <li> <table> <tbody> <tr> <td>run \hat{\pi}(u_t</td> <td>o_t) to get dataset \mathcal{D}_\pi={o_1,…,o_M}</td> </tr> </tbody> </table> </li> <li>Ask computer to label \mathcal{D_\pi} with actions u_t</li> <li>Aggregate: \mathcal{D}\gets\mathcal{D}\cup\mathcal{D}_\pi</li> </ol> <p><strong>Simple</strong> stochastic policy: \hat{\pi}(u_t|x_t)=\mathcal{N}(K_tx_t+k_t, \Sigma_{u_t}) <script type="math/tex">\hat{\pi}(u_t|x_t)=\arg\min_{\hat{\pi}}\sum_{t'=t}^TE_{\hat{\pi}}[c(x_{t'},u_{t'})]+\lambda D_{KL}(\hat{\pi}(u_t|x_t)\|\pi_\theta(u_t|o_t))</script></p> <blockquote> <p>Here the \hat{\pi} is re-planed by optimal control method, for simplicity, choose Gaussian policy since it is easy to plan with LQR, and the planning also add the KL constrain, which make sure the behavior policy is not far from the learning policy, but with this planning, it move actions away from some very bad (dangerous) actions.</p> </blockquote> <h5 id="dagger-vs-gps">DAgger vs GPS</h5> <ul> <li>DAgger does not require an adaptive expert <ul> <li>Any expert will do, so long as states from learned policy can be labeled</li> <li>Assumes it is possible to match expert’s behavior up to bounded loss <ul> <li>Not always possible (e.g. partially observed domains)</li> </ul> </li> </ul> </li> <li>GPS adapts the “expert” behavior <ul> <li>Does not require bounded loss on initial expert (expert will change)</li> </ul> </li> </ul> <h5 id="why-imitate">Why imitate?</h5> <ul> <li> <p>Relatively stable and easy to use</p> <ul> <li> <p>Supervise learning works very well</p> </li> <li> <p>control\planning (usually) works very well</p> </li> <li> <p>The combination of two (usually) works very well</p> </li> </ul> </li> <li>Input remapping trick: can exploit availability of additional information at training time to learn policy from raw observations. (planning with state and learning policy with observations)</li> <li>overcomes optimization challenges of back-propagating into policy directly</li> </ul> <h3 id="model-free-optimization-with-a-model">Model-free optimization with a model</h3> <ul> <li>just use policy gradient(or other model-free RL method) even though you have a model. (just treat the model as a simulator)</li> <li>Sometimes better than using the gradients!</li> </ul> <h4 id="dyna">Dyna</h4> <p>on-line Q-learning algorithm that performs model-free RL with a model</p> <ol> <li>given state s, pick action a using exploration policy</li> <li>observe s’ and r, to get transition (s,a,s’,r)</li> <li> <table> <tbody> <tr> <td>update model \hat{p}(s’</td> <td>s,a) and \hat{r}(s,a) using (s,a,s’)</td> </tr> </tbody> </table> </li> <li>Q-update: Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)-Q(s,a)]</li> <li>repeat K times: <ol> <li>sample (s,a)\sim\mathcal{B} from buffer of past states and actions</li> <li>Q-update: Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)-Q(s,a)]</li> </ol> </li> </ol> <p>when model become better, re-evaluate the old states and make the estimation more accurate.</p> <h4 id="general-dyna-style-model-based-rl-recipe">General “Dyna-style” model-based RL recipe</h4> <ol> <li>given state s, pick action a using exploration policy</li> <li> <table> <tbody> <tr> <td>learn model \hat{p}(s’</td> <td>s,a) (and optionally, \hat{r}(s,a))</td> </tr> </tbody> </table> </li> <li>repeat K times: <ol> <li>sample s\sim\mathcal{B} from buffer</li> <li>choose action a (from \mathcal{B}, from \pi, or random)</li> <li> <table> <tbody> <tr> <td>simulate s’\sim\hat{p}(s’</td> <td>s,a) (and r=\hat{r}(s,a))</td> </tr> </tbody> </table> </li> <li>train on (s,a,s’,r) with model-free RL</li> <li>(optional) take N more model-based steps</li> </ol> </li> </ol> <p>This only requires short (as few as one step) rollouts from model, which has a little accumulated error.</p> <h3 id="model-based-rl-algorithms-summary">Model-based RL algorithms summary</h3> <h4 id="methods">Methods</h4> <ul> <li>Learn model and plan (without policy) <ul> <li>Iteratively more data to overcome distribution mismatch</li> <li>Re-plan every time step (MPC) to mitigate small model errors</li> </ul> </li> <li>Learning policy <ul> <li>Back-propagate into policy (e.g., PILCO)–simple but potentially unstable</li> <li>imitate optimal control in a constrained optimization framework (e.g., GPS)</li> <li>imitate optimal control via DAgger-like process (e.g., PLATO)</li> <li>Use model-free algorithm with a model(Dyna, etc.)</li> </ul> </li> </ul> <h4 id="limitation-of-model-based-rl">Limitation of model-based RL</h4> <ul> <li>Need some kind of model <ul> <li>Not always available</li> <li>Sometimes harder to learn than the policy</li> </ul> </li> <li>Learning the model takes time &amp; data <ul> <li>Sometimes expressive model classes (neural nets) are not fast</li> <li>Sometimes fast model classes (linear models) are not expressive</li> </ul> </li> <li>Some kind of additional assumptions <ul> <li>Linearizability/continuity</li> <li>Ability to reset the system (for local linear models)</li> <li>Smoothness (for GP-style global model)</li> <li>Etc.</li> </ul> </li> </ul> <blockquote> <p>Here are some of my understandings of model-based RL:</p> <p>First, <strong>why</strong> we need model-based RL?</p> <p>the model-free RL learn everything from experience, the state space may very larger, it learning from scratch, which need a lot of exploration, otherwise, it may hard to coverage or has high chance to coverage to local optimal.</p> <p>But in model-based RL, the model is known or already learned, so it shift the very hard exploration process to planning, which can find some decent directions that lead to good results by optimal control methods or just search by simulating with model. So after planning, the pretty promising trajectories is generated, and the policy only requires to learn to imitate these good trajectories, which reduce lots of random exploration.</p> <p>Second, why not just use optimal control rather than learning policy?</p> <p>Actually, you just use the optimal control methods, like traditional control methods, or MPC.</p> <p>However, not all model can apply the explicit optimal control method like LQR, since the model is hard to solve mathematically. In addition, using neural network may have better generalization property, and close loop control seems better.</p> <p>Third, What kind of method can i use in model based RL?</p> <ul> <li>learning model and just using planing, do not learn policy</li> <li>Learn policy by guided policy search</li> <li>imitating optimal control with DAgger</li> </ul> </blockquote> <h3 id="what-kind-of-algorithm-should-i-use">What kind of algorithm should I use?</h3> <p>rank of the samples efficiency required (low to high) but computation efficiency (high to low)</p> <ul> <li>gradient-free methods (e.g. NES, CMA, etc)</li> <li>full on-line methods (e.g. A3C)</li> <li>policy gradient methods (e.g. TRPO)</li> <li>replay buffer value estimation methods (Q-learning, DDPG, NAF, SAC, etc.)</li> <li>model-based deep RL (e.g. PETS, guided policy search)</li> <li>model-based “shallow” RL (e.g. PILCO)</li> </ul> <p><img src="/home/dd/.config/Typora/typora-user-images/1570794958736.png" alt="1570794958736" /></p> <blockquote> <ul> <li></li> </ul> </blockquote>Dongda Lidongdongbhbh@gmail.comBackgroundFor research beginner2019-09-27T00:00:00+08:002019-09-27T00:00:00+08:00https://dongdongbh.tech/good-research<p>Recently, A friend of mine told me that he learned a lot of things from my posts and Github. I’m so happy that people can benefit from my posts, which encourages me keep updating my site.</p> <p>In this post, I want to share my views about how to be a good researcher.</p> <h2 id="what-is-a-good-research-work">What is a good research work?</h2> <p>First, What is a good research work?</p> <blockquote> <p>Many people think that a good researcher is the one who has published lots of research papers. But in my view, the quality is more important than the quantity.</p> </blockquote> <p>So how to define the quality of a research paper?</p> <blockquote> <p>Many people think a good paper is the paper which has lots of citations. Yes, of course, citation numbers is a good metrics to evaluate a research paper. However, in my view, a really good paper is the paper still has many new citations after it published more than ten years. A good work either propose a deep theoretical view, or find a good method that greatly influence the develop of the domain.</p> </blockquote> <h2 id="how-to-start">How to start?</h2> <p>First of all, people should know if they want to do research. In my opinion, a researcher should have strong desire to improve make things better.</p> <p>If you find that you have a very strong motivation to do research. There are some advices for you.</p> <ul> <li>Please read every research document in <strong>English</strong>, the original research paper or book should be read first, then you can find some Blogs explaining that. Never read stuffs in you own language even it is well translated.</li> <li>When reading a paper, you should <strong>think</strong> it deeply, what the underlay reason make things better? is there anything strange and may be improved? If you do not understate it, try to dig into it a little bit harder, and make sure you fully understand it. or you may find some source code to see the detail of its implementation. A good researcher is not only a good coder, but a good thinker, never stop on superficial stuff, but the fundamental reason.</li> <li>Improve you English <strong>writing</strong> skill, not only the grammar and vocabulary, but also the <strong>logic</strong> of essay.</li> <li>One should not hold the view that doing thing just so so, almost there is enough. but have some kind of Obsessive-Compulsive Disorder (OCD) that I must find the reason, I must do it best.</li> <li>Do not trust authority, believe your strong intuition.</li> </ul>Dongda Lidongdongbhbh@gmail.comRecently, A friend of mine told me that he learned a lot of things from my posts and Github. I’m so happy that people can benefit from my posts, which encourages me keep updating my site.Linux tricks2019-09-20T00:00:00+08:002019-09-20T00:00:00+08:00https://dongdongbh.tech/linux-terminal-shortcut<h2 id="tricks-on-linux-command-line">Tricks on Linux command line</h2> <p><code class="highlighter-rouge">cd -</code>: back to the last working directory</p> <h4 id="running-multiple-commands">Running multiple commands</h4> <ul> <li> <p>Running multiple commands in one single command</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>command_1; command_2; command_3 </code></pre></div> </div> </li> <li> <p>Running multiple commands in one single command only if the <strong>previous command was successful</strong></p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>command_1 &amp;&amp; command_2 </code></pre></div> </div> </li> </ul> <h4 id="previous-commands-and-arguments">Previous commands and arguments</h4> <p>!! the last command</p> <p>!: the of previous command argument</p> <p>Alt+. previous command argument</p> <h4 id="check-for-spelling-of-words-in-linux">Check for Spelling of Words in Linux</h4> <p><code class="highlighter-rouge">look docum</code></p> <h2 id="linux-terminal-shortcuts-list">Linux terminal shortcuts list</h2> <ol> <li>Ctrl+a Move cursor to <strong>start of line</strong></li> <li>Ctrl+e Move cursor to <strong>end of line</strong></li> <li>Ctrl+b Move back one character</li> <li>Alt+b Move back one word</li> <li>Ctrl+f Move forward one character</li> <li>Alt+f Move forward one word</li> <li>Ctrl+d Delete current character</li> <li>Ctrl+w Cut the last word</li> <li>Ctrl+k Cut everything <strong>after the cursor</strong></li> <li>Alt+d Cut word after the cursor</li> <li>Alt+w Cut word before the cursor</li> <li>Ctrl+y Paste the last deleted command</li> <li>Ctrl+_ Undo</li> <li>Ctrl+u Cut everything <strong>before the cursor</strong></li> <li>Ctrl+xx Toggle between first and current position</li> <li>Ctrl+l <strong>Clear the terminal</strong></li> <li>Ctrl+c Cancel the command</li> <li>Ctrl+r <strong>Search</strong> command in history - type the search term</li> <li>Ctrl+j End the search at current history entry</li> <li>Ctrl+g Cancel the search and restore original line</li> <li>Ctrl+n Next command from the History</li> <li>Ctrl+p previous command from the History</li> </ol> <p>Ref <a href="https://stackoverflow.com/questions/9679776/how-do-i-clear-delete-the-current-line-in-terminal">stackoverflow</a></p> <h2 id="pretty-view-csv-file-in-terminal">pretty view csv file in terminal</h2> <p>add this to .bashrc, and then just <code class="highlighter-rouge">pretty_csv xxx.csv</code></p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function pretty_csv { column -t -s, -n "@" | less -F -S -X -K } </code></pre></div></div>Dongda Lidongdongbhbh@gmail.comTricks on Linux command line cd -: back to the last working directoryReinforcement Learning Course Notes-David Silver2019-05-18T00:00:00+08:002019-05-18T00:00:00+08:00https://dongdongbh.tech/RL-courses<h2 id="background">Background</h2> <p>I started learning Reinforcement Learning 2018, and I first learn it from the book “Deep Reinforcement Learning Hands-On” by Maxim Lapan, that book tells me some high level concept of Reinforcement Learning and how to implement it by Pytorch step by step. But when I dig out more about Reinforcement Learning, I find the high level intuition is not enough, so I read the <a href="http://incompleteideas.net/book/bookdraft2017nov5.pdf">Reinforcement Learning An introduction</a> by S.G, and following the courses <a href="https://www.youtube.com/watch?v=2pWv7GOvuf0">Reinforcement Learning</a> by David Silver, I got deeper understanding of RL. For the code implementation of the book and course, refer <a href="&lt;https://github.com/dennybritz/reinforcement-learning">this</a> Github repository.</p> <p>Here is some of my notes when I taking the course, for some concepts and ideas that are hard to understand, I add some my own explanation and intuition on this post, and I omit some simple concepts on this note, hopefully this note will also help you to start your RL tour.</p> <h4 id="table-of-contents">Table of contents</h4> <p><a href="#background">Background </a></p> <p><a href="#1introduction">1.Introduction </a></p> <p><a href="#2mdp">2.MDP </a></p> <p><a href="#3planningbydynamicprogramming">3. Planning by Dynamic Programming </a></p> <p><a href="#4modelfreeprediction">4. model-free prediction </a></p> <p><a href="#5modelfreecontrol">5 Model-free control </a></p> <p><a href="#6valuefunctionapproximation">6 Value function approximation </a></p> <p><a href="#7policygradientmethods">7 Policy gradient methods </a></p> <p><a href="#8integratinglearningandplanning">8.Integrating Learning and Planning </a></p> <p><a href="#9explorationandexploitation">9. Exploration and Exploitation </a></p> <p><a href="#10casestudyrlinclassicgames">10. Case Study: RL in Classic Games </a></p> <h2 id="1introduction">1.Introduction</h2> <p>RL feature</p> <ul> <li>reward signal</li> <li>feedback delay</li> <li>sequence not i.i.d</li> <li>action affect subsequent data</li> </ul> <h3 id="why-using-discount-reward">Why using discount reward?</h3> <ul> <li>mathematically convenient</li> <li>avoids <strong>infinite</strong> returns in cyclic Markov processes</li> <li>we are not very confident about our <strong>prediction of reward</strong>, maybe we we only confident about some near future steps.</li> <li>human shows preference for immediate reward</li> <li>it is sometimes possible to use undiscounted reward</li> </ul> <h2 id="2mdp">2.MDP</h2> <p>In MDP, reward is <strong>action reward</strong>, not state reward! <script type="math/tex">R_s^a=E[R_{t+1}|S_t=s,A_t=a]</script> Bellman Optimality Equation is <strong>non-linear</strong> , so we solve it by iteration methods.</p> <h2 id="3-planning-by-dynamic-programming">3. Planning by Dynamic Programming</h2> <h3 id="planningclearly-know-the-mdpmodel-and-try-to-find-optimal-policy">planning(clearly know the MDP(model) and try to find optimal policy)</h3> <p><strong>prediction</strong>: given of MDP and policy, you output the value function(policy evaluation)</p> <p><strong>control</strong>: given MDP, output optimal value function and optimal policy(solving MDP)</p> <ul> <li> <p>policy evaluation</p> </li> <li> <p>policy iteration</p> <ul> <li> <p>policy evaluation(k steps to converge)</p> </li> <li> <p>policy improvement</p> <p>if we iterate policy once and once again and the MDP we already know, we will finally get the optimal policy(proved). so the <strong>policy iteration solve the MDP</strong>.</p> </li> </ul> </li> <li> <p>value iteration</p> <ol> <li> <p>value update (1 step policy evaluation)</p> </li> <li> <p>policy improvement(one step greedy based on updated value)</p> <p>iterate this also solve the MDP</p> </li> </ol> </li> </ul> <h3 id="asynchronous-dynamic-programming">asynchronous dynamic programming</h3> <ul> <li>in-place dynamic programming(update the old value with new value immediately, not wait for all states new value)</li> <li>prioritized sweeping(based on value iteration error)</li> <li>real-time dynamic programming(run the game)</li> </ul> <h2 id="4-model-free-prediction">4. model-free prediction</h2> <p>model-free by sample</p> <h3 id="monte-carlo-learning">Monte-Carlo learning</h3> <p>every update of Monte-Carlo learning must have <strong>full episode</strong></p> <ul> <li> <p>First-Visit Monte-Carlo policy evaluation</p> <p>just run the agent following the policy the <strong>first</strong> time that state s is visited in an episode and do following calculation <script type="math/tex">N(s)\gets N(s)+1 \\ S(s)\gets S(s)+G_t \\ V(s)=S(s)/N(s) \\ V(s)\to v_\pi \quad as \quad N(s) \to \infty</script></p> </li> <li> <p>Every-Visit Monte-Carlo policy evaluation</p> <p>just run the agent following the policy the <strong>every</strong> time(maybe there is a loop, a state can be visited more than one time) that state s is visited in an episode</p> <blockquote> <p>Incremental mean <script type="math/tex">% <![CDATA[ \begin{align} \mu_k &= \frac{1}{k}\sum_{j=1}^k x_j \\ &=\frac{1}{k}(x_k + \sum_{j=1}^{k-1} x_j) \\ &= \frac{1}{k}(x_k + (k-1)\mu_{k-1}) \\ &= \mu_{k-1}+\frac{1}{k}(x_k - \mu_{k-1}) \end{align} %]]></script></p> </blockquote> <p>so by the incremental mean: <script type="math/tex">N(S_t)\gets N(S_t)+1 \\ V(S_t)\gets V(S_t)+\frac{1}{N_t}(G_t-v(S_t)) \\</script> In non-stationary problem, it can be useful to track a running mean, i.e. forget old episodes. <script type="math/tex">V(S_t)\gets V(S_t)+\alpha(G_t-V(S_t))</script></p> </li> </ul> <h3 id="temporal-difference-learning">Temporal-Difference Learning</h3> <p>learn form <strong>incomplete</strong> episodes, it gauss the reward. <script type="math/tex">V(S_t)\gets V(S_t)+\alpha(G_t-V(S_t)) \\ V(s_t)\gets V(S_t)+\alpha(R_{t+1}+\gamma V(S_{t+1}) - V(S_t))</script> <strong>TD target</strong>: G_t=R_{t+1}+\gamma V(S_{t+1}) TD(0)</p> <p><strong>TD error</strong>: \delta_t = R_{t+1}+\gamma V(S_{t+1}) -V(S_t)</p> <h3 id="tdlambdabalance-between-mc-and-td">TD(\lambda)—balance between MC and TD</h3> <p>Let TD target look n steps into the future, if n is very large and the episode is terminal, then it’s Monte-Carlo <script type="math/tex">% <![CDATA[ \begin{align} G_t^{(n)}&=R_{t+1}+\gamma R_{t+2}+ ... + \gamma^{n-1} R_{t+n} + \gamma^nV(S_{t+n}) \\ V(S_t)&\gets V(S_t)+\alpha(G_t-V(S_t)) \end{align} %]]></script> Averaging n-step returns—<strong>forward TD(\lambda)</strong> <script type="math/tex">% <![CDATA[ \begin{align} G_t^{\lambda} &= (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1} G_t^{(n)} \\ V(S_t)&\gets V(S_t)+\alpha(G_t^\lambda-V(S_t)) \end{align} %]]></script> Eligibility traces, combine frequency heuristic and recency heuristic <script type="math/tex">% <![CDATA[ \begin{align} E_0(s) &= 0 \\ E_t(s) &= \gamma \lambda E_{t-1}(s) + 1(S_t=s) \end{align} %]]></script> TD(\lambda)—TD(0) and \lambda decayed Eligibility traces —<strong>backward TD(\lambda)</strong> <script type="math/tex">% <![CDATA[ \begin{align} \delta_t &= R_{t+1}+\gamma V(S_{t+1}) -V(S_t) \\ V(s) &\gets V(s)+\alpha \delta_tE_t(s) \end{align} %]]></script> if the updates are offline (means in one episode, we always use the old value), then the sum of forward TD(\lambda) is identical to the backward TD(\lambda) <script type="math/tex">\sum_{t=1}^T \alpha \delta_t E_t(s) = \sum_{t=1}^T \alpha(G_t^\lambda - V(S_t))1(S_t=s)</script></p> <h2 id="5-model-free-control">5 Model-free control</h2> <p>\epsilon-greedy policy add exploration to make sure we are improving our policy and explore the ervironment.</p> <h3 id="on-policy-monte-carlo-control">On policy Monte-Carlo control</h3> <p>for <strong>every episode</strong>:</p> <ol> <li>policy evaluation: Monte-Carlo policy evaluation Q\approx q_\pi </li> <li>policy improvement: \epsilon-greedy policy improvement based on Q(s,a)</li> </ol> <p>Greedy in the limit with infinite exploration (GLIE) will find optimal solution.</p> <h4 id="glie-monte-carlo-control">GLIE Monte-Carlo control</h4> <p>for the kth episode, set \epsilon \gets 1/k , finally \epsilon_k reduce to zero, and it will get the optimal policy.</p> <h3 id="on-policy-td-learning">On-policy TD learning</h3> <p><strong>Sarsa</strong> <script type="math/tex">Q(S,A) \gets Q(S,A)+\alpha (R+ \gamma Q(S',A')-Q(S,A))</script> <strong>On-Policy Sarsa:</strong></p> <p>for <strong>every time-step</strong>:</p> <ul> <li>policy evaluation: Sarsa, Q\approx q_\pi </li> <li>policy improvement: \epsilon-greedy policy improvement based on Q(s,a)</li> </ul> <p>forward n-step Sarsa —&gt;Sarsa(\lambda) just like TD(\lambda)</p> <p><strong>Eligibility traces:</strong> <script type="math/tex">% <![CDATA[ \begin{align} E_0(s,a) &= 0 \\ E_t(s,a) &= \gamma \lambda E_{t-1}(s,a) + 1(S_t=s,A_t=a) \end{align} %]]></script> backward Sarsa(\lambda) by adding eligibility traces</p> <p>and every time step <strong>for all (s,a)</strong> do following: <script type="math/tex">% <![CDATA[ \begin{align} \delta_t &= R_{t+1}+\gamma Q(S_{t+1},A_{t+1}) -Q(S_t,A_t) \\ Q(s,a) &\gets Q(s,a)+\alpha \delta_tE_t(s,a) \end{align} %]]></script></p> <blockquote> <p>The intuition of this that the current state action pair reward and value influence all other state action pairs, but it will influence the most frequent and recent pair more. and the \lambda shows how much current influence others. if you only use one step Sarsa, every you get reward, it only update one state action pair, so it is slower. For more, refer Gridworld example on course-5.</p> </blockquote> <h3 id="off-policy-learning">Off-policy learning</h3> <h4 id="importance-sampling">Importance sampling</h4> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} E_{X~\sim P}[f(X)] &= \sum P(X)f(X) \\ &=\sum Q(X) \frac{P(X)}{Q(X)} f(X) &= E_{X ~\sim Q}\left[\frac{P(X)}{Q(X)} f(X)\right] \end{align} %]]></script> <p>Importance sampling for off-policy TD <script type="math/tex">V(s_t) \gets V(S_t) + \alpha \left(\frac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1}+\gamma V(S_{t+1})-V(s_t)\right)</script></p> <h4 id="q-learning">Q-learning</h4> <table> <tbody> <tr> <td>Next action is chosen using behavior policy(the true behavior) A_{t+1} ~\sim \mu(.</td> <td>S_t)</td> </tr> </tbody> </table> <p>but consider alternative successor action(our target policy) A’ \sim \pi(.|S_t) <script type="math/tex">Q(S,A) \gets Q(S,A)+\alpha (R_{t+1} + \gamma Q(S_{t+1},A')-Q(S,A))</script></p> <blockquote> <p>Here has something may hard to understand, so I explain it. no matter what action we actually do(behave) next, we just update Q according our target policy action, so finally we got the Q of target policy \pi.</p> </blockquote> <h5 id="off-policy-control-with-q-learning">Off-policy control with Q-Learning</h5> <ul> <li> <p>the target policy is greedy w.r.t Q(s,a) <script type="math/tex">\pi(S_{t+1})=\underset{a'}{\arg\max} Q(S_{t+1},a)</script></p> </li> <li> <p>the behavior policy \mu is e.g. \epsilon -greedy w.r.t. Q(s,a) or maybe some totally random policy, <strong>it doesn’t matter for us</strong> since it is off-policy, and we only evaluate Q on \pi.</p> </li> </ul> <script type="math/tex; mode=display">Q(S,A) \gets Q(S,A)+\alpha (R+ \gamma \max_{a'} Q(S',A')-Q(S,A))</script> <p>and Q-learning will converges to the optimal action-value function Q(s,a) \to q_*(s,a)</p> <blockquote> <p>Q-learning can be used in off-policy learning, but it also can be used in on-policy learning!</p> <p>For on-policy, if you using \epsilon -greedy policy update, Sarsa is a good on-policy method, but you use Q-learning is fine since \epsilon -greedy is similar to max Q policy, so you can make sure you explore most of policy action, so it is also efficient.</p> </blockquote> <h2 id="6-value-function-approximation">6 Value function approximation</h2> <p>Before this lecture, we talk about <strong>tabular learning</strong> since we have to maintain a Q table or value table etc.</p> <h3 id="introduction">Introduction</h3> <h4 id="why">why</h4> <ul> <li>state space is large</li> <li>continuous state space</li> </ul> <h4 id="value-function-approximation">Value function approximation</h4> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \hat{v}(s,\pmb{w}) &\approx v_\pi(s) \\ \hat{q}(s,a,\pmb{w})&\approx q_\pi(s,a) \end{align} %]]></script> <h4 id="approximator">Approximator</h4> <ul> <li>non-stationary (state values are changing since policy is changinng)</li> <li>non-i.i.d. (sample according policy)</li> </ul> <h3 id="incremental-method">Incremental method</h3> <h4 id="basic-sgd-for-value-function-approximation">Basic SGD for Value function approximation</h4> <ul> <li>Stochastic Gradient descent</li> <li>feature vectors</li> </ul> <script type="math/tex; mode=display">x(S) = \begin{pmatrix} x_1(s) \\ \vdots \\ x_n(s) \end{pmatrix}</script> <ul> <li> <p>linear value function approximation <script type="math/tex">% <![CDATA[ \begin{align} \hat{v}(S,\pmb{w}) &= \pmb{x}(S)^T \pmb{w} = \sum_{j=1}^n \pmb{x}_j(S) \pmb{w}_j\\ J(\pmb {w}) &= E_\pi\left[(v_\pi(S)-\hat{v}(S,\pmb{w}))^2\right] \\ \Delta\pmb{w}&=-\frac{1}{2} \alpha \Delta_w J(\pmb{w}) \\ &=\alpha E_\pi \left[(v_\pi(S)-\hat{v}(S,\pmb{w})) \Delta_{\pmb{w}}\hat{v}(S,\pmb{w})\right] \\ \Delta\pmb{w}&=\alpha (v_\pi(S)-\hat{v}(S,\pmb{w})) \Delta_{\pmb{w}}\hat{v}(S,\pmb{w}) \\ & = \alpha (v_\pi(S)-\hat{v}(S,\pmb{w}))\pmb{x}(S) \end{align} %]]></script></p> </li> <li> <p>Table lookup feature</p> </li> </ul> <blockquote> <p>table lookup is a special case of linear value function approximation, where w is the value of individual state.</p> </blockquote> <script type="math/tex; mode=display">x(S) = \begin{pmatrix} 1(S=s_1)\\ \vdots \\ 1(S=s_n) \end{pmatrix}\\ \hat{v}(S,w) = \begin{pmatrix} 1(S=s_1)\\ \vdots \\ 1(S=s_n) \end{pmatrix}.\begin{pmatrix} w_1\\ \vdots \\ w_n \end{pmatrix}</script> <p>Incremental prediction algorithms</p> <h4 id="how-to-supervise">How to supervise?</h4> <ul> <li> <p>For MC, the target is the return G_t <script type="math/tex">\Delta w = \alpha (G_t-\hat{v}(S_t,\pmb w))\Delta_w \hat{v}(S_t,w)</script></p> </li> <li> <p>For TD(0), the target is the TD target R_{t+1} + \gamma \hat{v}(S_{t+1},\pmb{w}) <script type="math/tex">\Delta w = \alpha (R_{t+1} + \gamma \hat{v}(S_{t+1},\pmb{w})-\hat{v}(S_t,\pmb w))\Delta_w \hat{v}(S_t,w)</script></p> <blockquote> <p>here should notice that the TD target also has \hat{v}(S_{t+1},\pmb{w}), it contains w, but we <strong>do not</strong> calculate gradient of it, we just trust target at each time step, we only look forward, rather than look forward and backward at the same time. Otherwise it can not converge.</p> </blockquote> </li> <li> <p>For TD(\lambda) , the target is \lambda-return G_t^\lambda \begin{align} \Delta\pmb{w} &amp;= \alpha (G_t^\lambda-\hat{v}(S_t,\pmb w))\Delta_w \hat{v}(S_t,w) \</p> <p>\end{align} <script type="math/tex">for backward view of LinearTD(\lambda):</script> \begin{align} \delta_t&amp;= R_{t+1} + \gamma \hat{v}(S_{t+1},\pmb{w})-\hat{v}(S_t,\pmb{w})<br /> E_t &amp;= \gamma \lambda E_{t-1} +\pmb{x}(S_t) <br /> \Delta \pmb{w}&amp;=\alpha \delta_t E_t \end{align} </p> <blockquote> <p>here, unlike E_t(s) = \gamma \lambda E_{t-1}(s) + 1(S_t=s) , we put x(S_t) in E_t, so we don’t need remember all previous x(S_t), note that in Linear TD, \Delta \hat{v}(S_t,\pmb{w}) is x(S_t).</p> <p>here the eligibility traces is the state features, so the most recent state(state feature) have more weight, unlike TD(0), this is update all previous states <strong>simultaneously</strong> and the weight of state decayed by \lambda.</p> </blockquote> </li> </ul> <h4 id="control-with-value-function-approximation">Control with value function approximation</h4> <p>policy evaluation: <strong>approximate</strong> policy evaluation, \hat{q}(.,.,\pmb{w}) \approx q_\pi</p> <p>policy improvement: \epsilon - greedy policy improvement.</p> <p>Action-value function approximation <script type="math/tex">% <![CDATA[ \begin{align} \hat{q}(S,A,\pmb{w}) &\approx q_\pi(S,A)\\ J(\pmb {w}) &= E_\pi\left[(q_\pi(S,A)-\hat{q}(S,A,\pmb{w}))^2\right] \\ \Delta\pmb{w}&=-\frac{1}{2} \alpha \Delta_w J(\pmb{w}) \\ &=\alpha (q_\pi(S,A)-\hat{q}(S,A,\pmb{w}))\Delta_{\pmb{w}}\hat{q}(S,A,\pmb{w}) \end{align} %]]></script> <strong>Linear</strong> Action-value function approximation <script type="math/tex">% <![CDATA[ \begin{align}x(S,A) &= \begin{pmatrix} x_1(S,A)\\ \vdots \\ x_n(S,A) \end{pmatrix} \\ \hat{q}(S,A,\pmb{w}) &= \pmb{x}(S,A)^T \pmb{w} = \sum_{j=1}^n \pmb{x}_j(S,A) \pmb{w}_j\\ \Delta\pmb{w}&=\alpha (q_\pi(S,A)-\hat{q}(S,A,\pmb{w}))x(S,A) \end{align} %]]></script> The target is similar as value update, I’m lazy and do not write it down, you can refer it on book.</p> <p>TD is not guarantee converge</p> <p>convergence of gradient TD learning</p> <p><img src="/home/dd/.config/Typora/typora-user-images/1558081004285.png" alt="convergence of gradient TD learning" /></p> <p>Convergence of Control Algorithms</p> <p><img src="/home/dd/.config/Typora/typora-user-images/1558081180861.png" alt="1558081180861" /></p> <h3 id="batch-reinforcement-learning">Batch reinforcement learning</h3> <p><strong>motivation:</strong> try to fit the experiences</p> <ul> <li>given value function approximation \hat{v}(s,\pmb{w} \approx v_\pi(s))</li> <li>experience\mathcal{D} consisting of\langle state,value \rangle pairs:</li> </ul> <script type="math/tex; mode=display">\mathcal{D} = \{\langle s_1,v_1^\pi\rangle\,\langle s_2,v_2^\pi\rangle\,...,\langle s_n,v_n^\pi\rangle\}</script> <ul> <li><strong>Least squares</strong> minimizing sum-squares error <script type="math/tex">% <![CDATA[ \begin{align} LS(\pmb{w})&=\sum_{t=1}^T(v_t^\pi -\hat{v}(s_t,\pmb{w}))^2 \\ &=\mathbb{E}_\mathcal{D}\left[(v^\pi-\hat{v}(s,\pmb{w}))^2\right] \end{align} %]]></script></li> </ul> <h4 id="sgd-with-experience-replayde-correlate-states">SGD with experience replay(de-correlate states)</h4> <ol> <li> <p><strong>sample</strong> state, vale form experience <script type="math/tex">\langle s,v^\pi\rangle \sim \mathcal{D}</script></p> </li> <li> <p>apply SGD update</p> </li> </ol> <script type="math/tex; mode=display">\Delta\pmb{w} = \alpha (v^\pi-\hat{v}(s,\pmb{w}))\Delta_{\pmb{w}}\hat{v}(s,\pmb{w})</script> <p>Then converge to least squares solution</p> <h4 id="dqn-experience-replay--fixed-q-targetsoff-policy">DQN (experience replay + Fixed Q-targets)(off-policy)</h4> <ol> <li> <p>Take action a_t according to \epsilon - greedy policy to get experience (s_t,a_t,r_{t+1},s_{t+1}) store in \mathcal{D}</p> </li> <li> <p>Sample random mini-batch of transitions (s,s,r,s’)</p> </li> <li> <p>compute Q-learning targets w.r.t. old, fixed parameters w^-</p> </li> <li> <p>optimize MSE between Q-network and Q-learning targets. <script type="math/tex">\mathcal{L}_i(\mathrm{w}_i) = \mathbb{E}_{s,a,r,s' \sim \mathcal{D}_i}\left[\left(r+\gamma \max_{a'} Q(s',a';\mathrm{w^-})-Q(s,a,;\mathrm{w}_i)\right)^2\right]</script></p> </li> <li> <p>using SGD update</p> </li> </ol> <blockquote> <p>On-linear Q-learning hard to converge, so <strong>why DQN converge?</strong></p> <ul> <li>experience replay de-correlate state make it more like i.i.d.</li> <li>Fixed Q-targets make it stable</li> </ul> </blockquote> <h4 id="least-square-evaluation">Least square evaluation</h4> <p>if the approximation function is <strong>linear</strong> and the <strong>feature space is small</strong>, we can solve the policy evaluation by least square <strong>directly</strong>.</p> <ul> <li>policy evaluation: evaluation by <strong>least squares Q-learning</strong></li> <li>policy improvement: greedy policy improvement.</li> </ul> <h2 id="7-policy-gradient-methods">7 Policy gradient methods</h2> <h3 id="introduction-1">Introduction</h3> <h4 id="policy-based-reinforcement-learning">policy-based reinforcement learning</h4> <p><strong>directly parametrize the policy</strong> <script type="math/tex">\pi_\theta(s,a) = \mathcal(P)[a|s,\theta]</script> advantages:</p> <ul> <li>better convergence properties</li> <li>effective in high-dimensional or <strong>continuous action spaces</strong></li> <li>can learn stochastic policies</li> </ul> <p>disadvantages:</p> <ul> <li>converge to a local rather then global optimum</li> <li>evaluating a policy is typically inefficient and high variance</li> </ul> <h4 id="policy-gradient">policy gradient</h4> <p>Let J(\theta) be policy objective function</p> <p>find <em>local <strong>maximum</strong></em> of policy objective function(value of policy) <script type="math/tex">\Delta \theta = \alpha \Delta_\theta J(\theta)</script> where \Delta_\theta J(\theta) is the <em>policy gradient</em> <script type="math/tex">\Delta_\theta J(\theta) = \begin{pmatrix} \frac{\partial J(\theta)}{\partial\theta_1 } \\ \vdots \\ \frac{\partial J(\theta)}{\partial\theta_n } \end{pmatrix}</script> <strong>Score function trick</strong> <script type="math/tex">% <![CDATA[ \begin{align} \Delta_\theta\pi(s,a) &= \pi_\theta \frac{\Delta_\theta(s,a)}{\pi_\theta(s,a)} \\ &=\pi_\theta(s,a)\Delta_\theta \log\pi_\theta(s,a) \end{align} %]]></script> The <em>score function</em> is \Delta_\theta \log\pi_\theta(s,a)</p> <h5 id="policy">policy</h5> <ul> <li>Softmax policy for discrete actions</li> <li>Gaussian policy for continuous action spaces</li> </ul> <p>for <strong>one-step</strong> MDPs apply <em>score function trick</em> <script type="math/tex">% <![CDATA[ \begin{align} J(\theta) & = \mathbb{E}_{\pi_\theta}[r] \\ & = \sum_{s\in \mathcal{S}} d(s) \sum _{a\in \mathcal{A}} \pi_\theta(s,a)\mathcal{R}_{s,a}\\ \Delta J(\theta) & = \sum_{s\in \mathcal{S}} d(s) \sum _{a\in \mathcal{A}} \pi_\theta(s,a)\Delta_\theta\log\pi_\theta(s,a)\mathcal{R}_{s,a} \\ & = \mathbb{E}_{\pi_\theta}[\Delta_\theta\log\pi_\theta(s,a)r] \end{align} %]]></script></p> <h4 id="policy-gradient-theorem">Policy gradient theorem</h4> <p>the policy gradient is <script type="math/tex">\Delta_\theta J(\theta)= \mathbb{E}_{\pi_\theta}[\Delta_\theta\log\pi_\theta(s,a)Q^{\pi_\theta}(s,a)]</script></p> <h3 id="monte-carlo-policy-gradientreinforce">Monte-Carlo policy gradient(REINFORCE)</h3> <p>using return v_t as an unbiased sample of Q^{\pi_\theta}(s_t,a_t) <script type="math/tex">\Delta\theta_t = \alpha\Delta_\theta\log\pi_\theta(s,a)v_t\\ v_t = G_t = r_{t+1}+\gamma r_{t+2}+\gamma^3 r_{t+3}...</script> pseudo code</p> <blockquote> <p><strong>function REINFORCE</strong> Initialize \theta arbitrarily</p> <p>​ <strong>for</strong> each episode { s_1,a_1,r_2,…,s_{T-1},a_{T-1},R_T } \sim \pi_\theta <strong>do</strong></p> <p>​ <strong>for</strong> t=1 to T-1 <strong>do</strong></p> <p>​ \theta \gets \theta+\alpha\Delta_\theta\log\pi_\theta(s,a)v_t</p> <p>​ <strong>end for</strong></p> <p>​ <strong>end for</strong></p> <p>​ <strong>return</strong> \theta</p> <p><strong>end function</strong></p> </blockquote> <p>REINFORCE has the <strong>high variance problem</strong>, since it get v_t by sampling.</p> <h3 id="actor-critic-policy-gradient">Actor-Critic policy gradient</h3> <h5 id="idea">Idea</h5> <p>use a critic to estimate the action-value function <script type="math/tex">Q_w(s,a) \approx Q^{\pi_\theta}(s,a)</script> Actor-Critic algorithm follow an <em>approximate</em> policy gradient <script type="math/tex">\Delta_\theta J(\theta) \approx \mathbb{E}_{\pi_\theta}[\Delta_\theta\log\pi_\theta(s,a)Q_w(s,a)] \\ \Delta \theta= \alpha \Delta_\theta\log\pi_\theta(s,a)Q_w(s,a)</script></p> <h5 id="action-value-actor-critic">Action value actor-Critic</h5> <p>Using linear value fn approx. Q_w(s,a) = \phi(s,a)^Tw</p> <ul> <li>Critic Updates w by TD(0)</li> <li>Actor Updates \theta by policy gradient</li> </ul> <blockquote> <p><strong>function QAC</strong> Initialize s, \theta</p> <p>​ Sample a \sim \pi_\theta</p> <p>​ <strong>for</strong> each step <strong>do</strong></p> <p>​ Sample reward r=\mathcal{R}_s^a; sample transition s’ \sim \mathcal{P}_s^a,.</p> <p>​ Sample action a’ \sim \pi_\theta(s’,a’)</p> <p>​ \delta = r + \gamma Q_w(s’,a’)- Q_w(s,a)</p> <p>​ \theta = \theta + \alpha \Delta_\theta \log \pi_\theta(s,a) Q_w(s,a)</p> <p>​ w \gets w+\beta \delta\phi(s,a)</p> <p>​ a \gets a’, s\gets s’</p> <p>​ <strong>end for</strong></p> <p><strong>end function</strong></p> </blockquote> <p>So it seems that <strong>Value-based learning is a spacial case of actor-critic,</strong> since the greedy function based on Q is one spacial case of policy gradient, when we set the policy gradient step size very large, then the probability of the action which max Q will close to 1, and the others will close to 0, that is what greedy means.</p> <h5 id="reducing-variance-using-a-baseline">Reducing variance using a baseline</h5> <ul> <li> <p>Subtract a baseline function B(s) from the policy gradient</p> </li> <li> <p>This can reduce variance, without changing expectation <script type="math/tex">% <![CDATA[ \begin{align} \mathbb{E}_{\pi_\theta} [\Delta_\theta \log \pi_\theta(s,a)B(s)] &= \sum_{s \in \mathcal{S}}d^{\pi_\theta}(s)\sum_a \Delta_\theta \pi_\theta(s,a)B(s)\\ &= \sum_{s \in \mathcal{S}}d^{\pi_\theta}(s)B(s)\Delta_\theta \sum_{a\in \mathcal{A}} \pi_\theta(s,a)\\ & = \sum_{s \in \mathcal{S}}d^{\pi_\theta}(s)B(s)\Delta_\theta (1) \\ &=0 \end{align} %]]></script></p> </li> <li> <p>a good baseline is the state value function B(s) = V^{\pi_\theta}(s)</p> </li> <li> <p>So we can rewrite the policy gradient using the <em>advantage function</em> A^{\pi_\theta}(s,a) <script type="math/tex">A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s) \\ \Delta_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\Delta_\theta \log \pi_\theta(s,a)A^{\pi_\theta}(s,a)]</script></p> </li> </ul> <blockquote> <p>Actually, by using advantage function, we get rid of the variance between states, and it will make our policy network more stable.</p> </blockquote> <p>So how to <strong>estimate the advantage function</strong>? you can using two network to estimate Q and V respectively, but it is more complicated. More commonly used is by bootstrapping.</p> <ul> <li>TD error</li> </ul> <script type="math/tex; mode=display">\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s')-V^{\pi_\theta}(s)</script> <ul> <li> <p>TD error is an unbiased estimate(sample) of the advantage function <script type="math/tex">% <![CDATA[ \begin{align} \mathbb{E}_{\pi_\theta} & = \mathbb{E}_{\pi_\theta}[r + \gamma V^{\pi_\theta}(s')|s,a] - V^{\pi_\theta}(s) \\ & = Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s) \\ & = A^{\pi_\theta}(s,a) \end{align} %]]></script></p> </li> <li> <p>So <script type="math/tex">\Delta_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\Delta_\theta \log \pi_\theta(s,a)\delta^{\pi_\theta}]</script></p> </li> <li> <p>In practice, we can use an approximate TD error for one step <script type="math/tex">\delta_v = r + \gamma V_v(s')-V_v(s)</script></p> </li> <li> <p>this approach only requires one set of critic parameters for v.</p> </li> </ul> <p>For Critic, we can plug in previous used methods in value approximation, such as MC, TD(0),TD(\lambda) and TD(\lambda) with eligibility traces.</p> <ul> <li> <p>MC policy gradient, \mathrm{v}_t is the true MC return. <script type="math/tex">\Delta \theta = \alpha(\mathrm{v}_t - V_v(s_t))\Delta_\theta \log \pi_\theta(s_t,a_t)</script></p> </li> <li> <p>TD(0) <script type="math/tex">\Delta \theta = \alpha(r + \gamma V_v(s_{t+1})-V_v(s_t))\Delta_\theta \log \pi_\theta(s_t,a_t)</script></p> </li> <li> <p>TD(\lambda) <script type="math/tex">\Delta \theta = \alpha(\mathrm{v}_t^\lambda + \gamma V_v(s_{t+1})-V_v(s_t))\Delta_\theta \log \pi_\theta(s_t,a_t)</script></p> </li> <li> <p>TD(\lambda) with eligibility traces (backward-view) <script type="math/tex">% <![CDATA[ \begin{align} \delta & = r_{t+1} + \gamma V_V(s_{t+1}) -V_v(s_t) \\ e_{t+1} &= \lambda e_t + \Delta_\theta\log \pi_\theta(s,a) \\ \Delta \theta&= \alpha \theta e_t \end{align} %]]></script></p> </li> </ul> <p>For <strong>continuous action space</strong>, we use Gauss to represent our policy, but Gauss is noisy, so it’s better to use <strong>deterministic policy</strong>(by just picking the mean) to reduce noise and make it easy to converge. This turns out the <strong>deterministic policy gradient(DPG)</strong> algorithm.</p> <h4 id="deterministic-policy-gradientoff-policy">Deterministic policy gradient(off-policy)</h4> <p>Deterministic policy: <script type="math/tex">a_t = \mu(s_t|\theta^\mu)</script> Q network parametrize by \theta^Q ,the distribution of states under behavior policy is \rho^\beta <script type="math/tex">% <![CDATA[ \begin{align} L(\theta^Q) &= \mathbb{E}_{s_t \sim \rho^\beta, a_t \sim \beta,r_t\sim E}[(Q(s_t,a_t|\theta^Q)-y_t)^2] \\ y_t & = r(s_t,a_t)+\gamma Q(s_{t+1},\mu(s_{t+1})|\theta^Q) \end{align} %]]></script> policy network parametrize by \theta^\mu <script type="math/tex">% <![CDATA[ \begin{align} J(\theta^\mu) & = \mathbb{E}_{s \sim \rho^\beta}[Q(s,a| \theta^Q)|_{s=s_t,a=\mu(s_t|\theta^\mu)}] \\ \Delta_{\theta^\mu}J &\approx \mathbb{E}_{s \sim \rho^\beta}[\Delta_{\theta_\mu} Q(s,a| \theta^Q)|_{s=s_t,a=\mu(s_t|\theta^\mu)}] \\ & = \mathbb{E}_{s \sim \rho^\beta}[\Delta Q(s,a| \theta^Q|_{s=s_t,a=\mu(s_t)}\Delta_{\theta_\mu}\mu(s|\theta^\mu)|s=s_t] \end{align} %]]></script> to make training more stable, we use target network for both critic network and actor network, and update them by <strong>soft update</strong>: <script type="math/tex">% <![CDATA[ soft\; update\left\{ \begin{aligned} \theta^{Q'} & \gets \tau\theta^Q+(1-\tau)\theta^{Q'} \\ \theta^{\mu'} & \gets \tau\theta^\mu+(1-\tau)\theta^{\mu'} \\ \end{aligned} \right. %]]></script> and we set \tau very small to update parameters smoothly, e.g. \tau = 0.001.</p> <p>In addition, we add some noise to deterministic action when we are exploring the environment to get experience. <script type="math/tex">\mu'(s_t) = \mu(s_t|\theta_t^\mu)+\mathcal{N}_t</script> where \mathcal{N} is the noise, it can be chosen to suit the environment, e.g. Ornstein-Uhlenbeck noise.</p> <h2 id="8integrating-learning-and-planning">8.Integrating Learning and Planning</h2> <h3 id="introduction-2">Introduction</h3> <p>model-free RL</p> <ul> <li>no model</li> <li><strong>Learn</strong> value function(and or policy) from experience</li> </ul> <p>model-based RL</p> <ul> <li>learn a model from experience</li> <li><strong>plan</strong> value function(and or policy) from model</li> </ul> <p>Model \mathcal{M} = \langle \mathcal{P}<em>\eta, \mathcal{R}</em>\eta \rangle <script type="math/tex">S_{t+1} \sim \mathcal{P}_\eta(s_{t+1}|s_t,A_t) \\ R_{t+1} = \mathcal{R}_\eta(R_{t+1}|s_t,A_t)</script> <strong>Model learning</strong> from experience {S_1,A_1,R_2,…,S_T} bu supervised learning <script type="math/tex">S_1, A_1 \to R_2, S_2 \\ S_2, A_2 \to R_3, S_3 \\ \vdots \\ S_{T-1}, A_{T-1} \to R_T, S_T</script></p> <ul> <li>s,a \to r is a regression problem</li> <li>s,s \to s’ is a density estimation problem</li> </ul> <h3 id="planning-with-a-model">Planning with a model</h3> <h5 id="sample-based-planning">Sample-based planning</h5> <ol> <li>sample experience from model</li> <li>apply model-free RL to samples <ul> <li>Monte-Carlo control</li> <li>Sarsa</li> <li>Q-learning</li> </ul> </li> </ol> <p>Performance of model-based RL is limited to optimal policy for approximate MDP</p> <h3 id="integrated-architectures">Integrated architectures</h3> <p>Integrating learning and planning—–<strong>Dyna</strong></p> <ul> <li>Learning a model from real experience</li> <li><strong>Learn and plan</strong> value function (and/or policy) from <strong>real <u>and</u> simulated experience</strong></li> </ul> <h3 id="simulation-based-search">Simulation-Based Search</h3> <ul> <li><strong>Forward search</strong> select the best action by <strong>lookahead</strong></li> <li>build a <strong>search tree</strong> withe the <strong>current state</strong> s_t at the root</li> <li>solve the <strong>sub-MDP</strong> starting from <strong>now</strong></li> </ul> <p>Simulation-Based Search</p> <ol> <li><strong>Simulate</strong> episodes of experience for <strong>now</strong> with the model</li> <li>Apply <strong>model-free</strong> RL to simulated episodes <ul> <li>Monte-Carlo control \to Monte-Carlo search</li> <li>Sarsa \to TD search</li> </ul> </li> </ol> <h4 id="sample-monte-carlo-search">Sample Monte-Carlo search</h4> <ul> <li> <p>Given a model \mathcal{M}_v and a <strong>simulation policy</strong> \pi</p> </li> <li> <p>For each action a \in \mathcal{A}</p> <ul> <li> <p>Simulate K episodes from current(real) state s_t <script type="math/tex">\{s_t,a,R_{t+1}^k,S_{t+1}^k,A_{t+1}^k,...,s_T^k\}_{k=1}^K \sim \mathcal{M}_v,\pi</script></p> </li> <li> <p>Evaluate action by mean return(<strong>Monte-Carlo evaluation</strong>)</p> </li> </ul> <script type="math/tex; mode=display">Q(s_t,a) = \frac{1}{K}\sum_{k=1}^K G_t \overset{\text{P}}{\to} q_\pi(s_t,a)</script> </li> <li> <p>Select current(real) action with maximum value <script type="math/tex">a_t = \underset{a \in \mathcal{A}}{\arg\max} Q(S_{t},a)</script></p> </li> </ul> <h4 id="monte-carlo-tree-search">Monte-Carlo tree search</h4> <ul> <li> <p>Given a model \mathcal{M}_v</p> </li> <li> <p>Simulate K episodes from current(real) state s_t using current simulation policy \pi <script type="math/tex">\{s_t,A_t^k,R_{t+1}^k,S_{t+1}^k,A_{t+1}^k,...,s_T^k\}_{k=1}^K \sim \mathcal{M}_v,\pi</script></p> </li> <li> <p>Build a search tree containing visited states and actions</p> </li> <li> <p><strong>Evaluate</strong> states Q(s,a) by mean return of episodes from s,a <script type="math/tex">Q(s_t,a) = \frac{1}{N(s,a)}\sum_{k=1}^K \sum_{u=t}^T \mathbf{1}(s_u,A_u = s,a) G_u \overset{\text{P}}{\to} q_\pi(s_t,a)</script></p> </li> <li> <p>After search is finished, select current(real) action with maximum value in search tree <script type="math/tex">a_t = \underset{a \in \mathcal{A}}{\arg\max} Q(S_{t},a)</script></p> </li> <li> <p>Each simulation consist of two phases(in-tree, out-of-tree)</p> <ul> <li><strong>Tree policy</strong>(improves): pick actions to maximise Q(s,a)</li> <li><strong>Default policy</strong>(fixed): pick action randomly</li> </ul> </li> </ul> <blockquote> <p>Here we update Q on the whole sub-tree, not only the current state. And after every episode of searching, we improve the policy based on the new update value, then start a new searching. With the searching progress, we exploit on the direction which is more promise to success since we keep updating our searching policy to that direction. In addition, we also need to explore a little bit the other direction, so we can apply MCTS with which action has the max Upper Confidence Bound(UCT) , that is idea of AlphaZero.</p> </blockquote> <p><strong>Temporal-Difference Search</strong></p> <p>e.g. update by Sarsa <script type="math/tex">\Delta Q(S,A) = \alpha (R+\gamma Q(S',A') -Q(S,A))</script> and you can also use a <strong>function approximation</strong> for <strong>simulated</strong> Q.</p> <p><strong>Dyna-2</strong></p> <ul> <li><strong>long-term</strong> memory(real experience)—TD learning</li> <li><strong>Short-term</strong>(working) memory(simulated experience)—TD search &amp; TD learning</li> </ul> <h2 id="9-exploration-and-exploitation">9. Exploration and Exploitation</h2> <p>way to exploration</p> <ul> <li>random exploration <ul> <li>use <strong>Gaussian noise</strong> in <strong>continuous action space</strong></li> <li>\epsilon - greedy, random on \epsilon probability</li> <li>Softmax, select on action policy distribution</li> </ul> </li> <li>optimism in the face of uncertainty———prefer to explore state/actions with highest uncertainty <ul> <li>Optimistic Initialization</li> <li>UCB</li> <li>Thompson sampling</li> </ul> </li> <li>Information state space <ul> <li>Gittins indices</li> <li>Bayes-adaptive MDPS</li> </ul> </li> </ul> <p>State-action exploration vs. parameter exploration</p> <h3 id="multi-arm-bandit">Multi-arm bandit</h3> <p>Total <strong>regret</strong> <script type="math/tex">% <![CDATA[ \begin{align} L_t &= \mathbb{E}\left[\sum_{\tau=1}^t V^*-Q(a_\tau)\right] \\ & \sum_{a \in \mathcal{A}}\mathbb{E}[N_t(a)](V^*-Q(a)) \\ &=\sum_{a \in \mathcal{A}}\mathbb{E}[N_t(a)]\Delta a \end{align} %]]></script> Optimistic Initialization</p> <ul> <li>initialize Q(a) to high value</li> <li>Then act greedily</li> <li>turns out linear regret</li> </ul> <p>\epsilon - greedy</p> <ul> <li>turns out linear regret</li> </ul> <p>decay \epsilon - greedy</p> <ul> <li>sub-linear regret(need know gaps), if you tune it very well and find it just on the gap, it is good, otherwise, it maybe bad.</li> </ul> <p>the regret has a low bound, it is a log bound</p> <p>The performance of any algorithm is determined by similarity between optimal arm and other arms <script type="math/tex">\lim_{t \to \infty}L_t \ge \log t\sum_{a|\Delta a>0} \frac{\Delta a}{KL(\mathcal{R}^a||\mathcal{R}^{a_*})}</script></p> <h4 id="optimism-in-the-face-of-uncertainty"><strong>Optimism</strong> in the Face of <strong>Uncertainty</strong></h4> <p><strong>Upper Confidence Bounds(UCB)</strong></p> <ul> <li> <p>Estimate an upper confidence U_t(a) for each action value</p> </li> <li> <p>Such that q(a) \leq Q_t(a)+U_t(a) with high probability</p> </li> <li> <p>The upper confidence depend on the number of times N(s) has been sampled</p> </li> <li> <p>Select action maximizing Upper Confidence Bounds(UCB) <script type="math/tex">A_t =\underset{a \in \mathcal{A}}{\arg\max} [Q(S_{t},a)+U_t(a)]</script></p> </li> </ul> <p><em>Theorem(Hoeffding’s Inequality)</em></p> <blockquote> <p>let x_1,…,X_t be i.i.d. random variables in[0,1], and let \overline{X} = \frac{1}{\tau}\sum_{\tau=1}^tX_\tau be the sample mean. Then <script type="math/tex">\mathbb{P}[\mathbb{E}[X]> \overline{X}_t+u] \leq e^{-2tu^2}</script></p> </blockquote> <p>we apply the Hoeffding’s Inequality to rewards of the bandit conditioned on selecting action a <script type="math/tex">\mathbb{P}[ Q(a)> Q(a)+U_t(a)] \leq e^{-2N_t(a)U_t(a)^2}</script></p> <ul> <li> <p>Pack a probability p that true value exceeds UCB</p> </li> <li> <p>Then solve for U_t(a) <script type="math/tex">\begin{align} e^{-2N_t(a)U_t(a)^2} = p \\ \\ U_t(a)=\sqrt{\frac{-\log p}{2N_t(a)}} \end{align}</script></p> </li> <li> <p>Reduce p as we observe more rewards, e.g. p = t^{-4} <script type="math/tex">U_t(a)=\sqrt{\frac{2\log t}{N_t(a)}}</script></p> </li> <li> <p>Make sure we select optimal action as t \to \infty</p> </li> </ul> <p>This leads to the <strong>UCB1 algorithm</strong> <script type="math/tex">A_t =\underset{a \in \mathcal{A}}{\arg\max} \left[Q(S_{t},a)+\sqrt{\frac{2\log t}{N_t(a)}}\right]</script> The UCB algorithm achieves logarithmic asymptotic total regret <script type="math/tex">\lim_{t\to\infty}L_t \leq 8\log t\sum_{a|\Delta>0}\Delta a</script> Bayesian Bandits</p> <p>Probability matching—Thompson sampling—optimal for one sample, but may not good for MDP.</p> <h4 id="solving-information-state-space-banditsmdp">Solving Information State Space Bandits—MDP</h4> <p>define MDP on information state space</p> <h3 id="mdp">MDP</h3> <p>UCB <script type="math/tex">A_t =\underset{a \in \mathcal{A}}{\arg\max} [Q(S_{t},a)+U_t(S_t,a)]</script> R-Max algorithm</p> <h2 id="10-case-study-rl-in-classic-games">10. Case Study: RL in Classic Games</h2> <p>TBA.</p>Dongda Lidongdongbhbh@gmail.comBackground I started learning Reinforcement Learning 2018, and I first learn it from the book “Deep Reinforcement Learning Hands-On” by Maxim Lapan, that book tells me some high level concept of Reinforcement Learning and how to implement it by Pytorch step by step. But when I dig out more about Reinforcement Learning, I find the high level intuition is not enough, so I read the Reinforcement Learning An introduction by S.G, and following the courses Reinforcement Learning by David Silver, I got deeper understanding of RL. For the code implementation of the book and course, refer this Github repository. Here is some of my notes when I taking the course, for some concepts and ideas that are hard to understand, I add some my own explanation and intuition on this post, and I omit some simple concepts on this note, hopefully this note will also help you to start your RL tour. Table of contents Background 1.Introduction 2.MDP 3. Planning by Dynamic Programming 4. model-free prediction 5 Model-free control 6 Value function approximation 7 Policy gradient methods 8.Integrating Learning and Planning 9. Exploration and Exploitation 10. Case Study: RL in Classic Games 1.Introduction RL feature reward signal feedback delay sequence not i.i.d action affect subsequent data Why using discount reward? mathematically convenient avoids infinite returns in cyclic Markov processes we are not very confident about our prediction of reward, maybe we we only confident about some near future steps. human shows preference for immediate reward it is sometimes possible to use undiscounted reward 2.MDP In MDP, reward is action reward, not state reward! Bellman Optimality Equation is non-linear , so we solve it by iteration methods. 3. Planning by Dynamic Programming planning(clearly know the MDP(model) and try to find optimal policy) prediction: given of MDP and policy, you output the value function(policy evaluation) control: given MDP, output optimal value function and optimal policy(solving MDP) policy evaluation policy iteration policy evaluation(k steps to converge) policy improvement if we iterate policy once and once again and the MDP we already know, we will finally get the optimal policy(proved). so the policy iteration solve the MDP. value iteration value update (1 step policy evaluation) policy improvement(one step greedy based on updated value) iterate this also solve the MDP asynchronous dynamic programming in-place dynamic programming(update the old value with new value immediately, not wait for all states new value) prioritized sweeping(based on value iteration error) real-time dynamic programming(run the game) 4. model-free prediction model-free by sample Monte-Carlo learning every update of Monte-Carlo learning must have full episode First-Visit Monte-Carlo policy evaluation just run the agent following the policy the first time that state s is visited in an episode and do following calculation Every-Visit Monte-Carlo policy evaluation just run the agent following the policy the every time(maybe there is a loop, a state can be visited more than one time) that state s is visited in an episode Incremental mean so by the incremental mean: In non-stationary problem, it can be useful to track a running mean, i.e. forget old episodes. Temporal-Difference Learning learn form incomplete episodes, it gauss the reward. TD target: G_t=R_{t+1}+\gamma V(S_{t+1}) TD(0) TD error: \delta_t = R_{t+1}+\gamma V(S_{t+1}) -V(S_t) TD(\lambda)—balance between MC and TD Let TD target look n steps into the future, if n is very large and the episode is terminal, then it’s Monte-Carlo Averaging n-step returns—forward TD(\lambda) Eligibility traces, combine frequency heuristic and recency heuristic TD(\lambda)—TD(0) and \lambda decayed Eligibility traces —backward TD(\lambda) if the updates are offline (means in one episode, we always use the old value), then the sum of forward TD(\lambda) is identical to the backward TD(\lambda) 5 Model-free control \epsilon-greedy policy add exploration to make sure we are improving our policy and explore the ervironment. On policy Monte-Carlo control for every episode: policy evaluation: Monte-Carlo policy evaluation Q\approx q_\pi policy improvement: \epsilon-greedy policy improvement based on Q(s,a) Greedy in the limit with infinite exploration (GLIE) will find optimal solution. GLIE Monte-Carlo control for the kth episode, set \epsilon \gets 1/k , finally \epsilon_k reduce to zero, and it will get the optimal policy. On-policy TD learning Sarsa On-Policy Sarsa: for every time-step: policy evaluation: Sarsa, Q\approx q_\pi policy improvement: \epsilon-greedy policy improvement based on Q(s,a) forward n-step Sarsa —&gt;Sarsa(\lambda) just like TD(\lambda) Eligibility traces: backward Sarsa(\lambda) by adding eligibility traces and every time step for all (s,a) do following: The intuition of this that the current state action pair reward and value influence all other state action pairs, but it will influence the most frequent and recent pair more. and the \lambda shows how much current influence others. if you only use one step Sarsa, every you get reward, it only update one state action pair, so it is slower. For more, refer Gridworld example on course-5. Off-policy learning Importance sampling Importance sampling for off-policy TD Q-learning Next action is chosen using behavior policy(the true behavior) A_{t+1} ~\sim \mu(. S_t) but consider alternative successor action(our target policy) A’ \sim \pi(.|S_t) Here has something may hard to understand, so I explain it. no matter what action we actually do(behave) next, we just update Q according our target policy action, so finally we got the Q of target policy \pi. Off-policy control with Q-Learning the target policy is greedy w.r.t Q(s,a) the behavior policy \mu is e.g. \epsilon -greedy w.r.t. Q(s,a) or maybe some totally random policy, it doesn’t matter for us since it is off-policy, and we only evaluate Q on \pi. and Q-learning will converges to the optimal action-value function Q(s,a) \to q_*(s,a) Q-learning can be used in off-policy learning, but it also can be used in on-policy learning! For on-policy, if you using \epsilon -greedy policy update, Sarsa is a good on-policy method, but you use Q-learning is fine since \epsilon -greedy is similar to max Q policy, so you can make sure you explore most of policy action, so it is also efficient. 6 Value function approximation Before this lecture, we talk about tabular learning since we have to maintain a Q table or value table etc. Introduction why state space is large continuous state space Value function approximation Approximator non-stationary (state values are changing since policy is changinng) non-i.i.d. (sample according policy) Incremental method Basic SGD for Value function approximation Stochastic Gradient descent feature vectors linear value function approximation Table lookup feature table lookup is a special case of linear value function approximation, where w is the value of individual state. Incremental prediction algorithms How to supervise? For MC, the target is the return G_t For TD(0), the target is the TD target R_{t+1} + \gamma \hat{v}(S_{t+1},\pmb{w}) here should notice that the TD target also has \hat{v}(S_{t+1},\pmb{w}), it contains w, but we do not calculate gradient of it, we just trust target at each time step, we only look forward, rather than look forward and backward at the same time. Otherwise it can not converge. For TD(\lambda) , the target is \lambda-return G_t^\lambda \begin{align} \Delta\pmb{w} &amp;= \alpha (G_t^\lambda-\hat{v}(S_t,\pmb w))\Delta_w \hat{v}(S_t,w) \ \end{align} \begin{align} \delta_t&amp;= R_{t+1} + \gamma \hat{v}(S_{t+1},\pmb{w})-\hat{v}(S_t,\pmb{w}) E_t &amp;= \gamma \lambda E_{t-1} +\pmb{x}(S_t) \Delta \pmb{w}&amp;=\alpha \delta_t E_t \end{align}$here, unlike$E_t(s) = \gamma \lambda E_{t-1}(s) + 1(S_t=s)$, we put$x(S_t)$in$E_t$, so we don’t need remember all previous$x(S_t)$, note that in Linear TD,$\Delta \hat{v}(S_t,\pmb{w})$is$x(S_t)$. here the eligibility traces is the state features, so the most recent state(state feature) have more weight, unlike TD(0), this is update all previous states simultaneously and the weight of state decayed by$\lambda$. Control with value function approximation policy evaluation: approximate policy evaluation,$\hat{q}(.,.,\pmb{w}) \approx q_\pi$policy improvement:$\epsilon - greedy$policy improvement. Action-value function approximation Linear Action-value function approximation The target is similar as value update, I’m lazy and do not write it down, you can refer it on book. TD is not guarantee converge convergence of gradient TD learning Convergence of Control Algorithms Batch reinforcement learning motivation: try to fit the experiences given value function approximation$\hat{v}(s,\pmb{w} \approx v_\pi(s))$experience$\mathcal{D} $consisting of$\langle state,value \rangle$pairs: Least squares minimizing sum-squares error SGD with experience replay(de-correlate states) sample state, vale form experience apply SGD update Then converge to least squares solution DQN (experience replay + Fixed Q-targets)(off-policy) Take action$a_t$according to$\epsilon - greedy$policy to get experience$(s_t,a_t,r_{t+1},s_{t+1})$store in$\mathcal{D}$Sample random mini-batch of transitions$(s,s,r,s’)$compute Q-learning targets w.r.t. old, fixed parameters$w^-$optimize MSE between Q-network and Q-learning targets. using SGD update On-linear Q-learning hard to converge, so why DQN converge? experience replay de-correlate state make it more like i.i.d. Fixed Q-targets make it stable Least square evaluation if the approximation function is linear and the feature space is small, we can solve the policy evaluation by least square directly. policy evaluation: evaluation by least squares Q-learning policy improvement: greedy policy improvement. 7 Policy gradient methods Introduction policy-based reinforcement learning directly parametrize the policy advantages: better convergence properties effective in high-dimensional or continuous action spaces can learn stochastic policies disadvantages: converge to a local rather then global optimum evaluating a policy is typically inefficient and high variance policy gradient Let$J(\theta)$be policy objective function find local maximum of policy objective function(value of policy) where$\Delta_\theta J(\theta)$is the policy gradient Score function trick The score function is$\Delta_\theta \log\pi_\theta(s,a)$policy Softmax policy for discrete actions Gaussian policy for continuous action spaces for one-step MDPs apply score function trick Policy gradient theorem the policy gradient is Monte-Carlo policy gradient(REINFORCE) using return$v_t$as an unbiased sample of$Q^{\pi_\theta}(s_t,a_t)$pseudo code function REINFORCE Initialize$\theta$arbitrarily ​ for each episode$ { s_1,a_1,r_2,…,s_{T-1},a_{T-1},R_T } \sim \pi_\theta$do ​ for$t=1$to$T-1$do ​$\theta \gets \theta+\alpha\Delta_\theta\log\pi_\theta(s,a)v_t$​ end for ​ end for ​ return$\theta$end function REINFORCE has the high variance problem, since it get$v_t$by sampling. Actor-Critic policy gradient Idea use a critic to estimate the action-value function Actor-Critic algorithm follow an approximate policy gradient Action value actor-Critic Using linear value fn approx.$Q_w(s,a) = \phi(s,a)^Tw$Critic Updates w by TD(0) Actor Updates$\theta$by policy gradient function QAC Initialize$s, \theta$​ Sample$a \sim \pi_\theta$​ for each step do ​ Sample reward$r=\mathcal{R}_s^a$; sample transition$s’ \sim \mathcal{P}_s^a$,. ​ Sample action$a’ \sim \pi_\theta(s’,a’)$​$\delta = r + \gamma Q_w(s’,a’)- Q_w(s,a)$​$\theta = \theta + \alpha \Delta_\theta \log \pi_\theta(s,a) Q_w(s,a)$​$w \gets w+\beta \delta\phi(s,a)$​$a \gets a’, s\gets s’$​ end for end function So it seems that Value-based learning is a spacial case of actor-critic, since the greedy function based on Q is one spacial case of policy gradient, when we set the policy gradient step size very large, then the probability of the action which max Q will close to 1, and the others will close to 0, that is what greedy means. Reducing variance using a baseline Subtract a baseline function$B(s)$from the policy gradient This can reduce variance, without changing expectation a good baseline is the state value function$B(s) = V^{\pi_\theta}(s)$So we can rewrite the policy gradient using the advantage function$A^{\pi_\theta}(s,a)$Actually, by using advantage function, we get rid of the variance between states, and it will make our policy network more stable. So how to estimate the advantage function? you can using two network to estimate Q and V respectively, but it is more complicated. More commonly used is by bootstrapping. TD error TD error is an unbiased estimate(sample) of the advantage function So In practice, we can use an approximate TD error for one step this approach only requires one set of critic parameters for v. For Critic, we can plug in previous used methods in value approximation, such as MC, TD(0),TD($\lambda$) and TD($\lambda$) with eligibility traces. MC policy gradient,$\mathrm{v}_t$is the true MC return. TD(0) TD($\lambda$) TD($\lambda$) with eligibility traces (backward-view) For continuous action space, we use Gauss to represent our policy, but Gauss is noisy, so it’s better to use deterministic policy(by just picking the mean) to reduce noise and make it easy to converge. This turns out the deterministic policy gradient(DPG) algorithm. Deterministic policy gradient(off-policy) Deterministic policy: Q network parametrize by$\theta^Q$,the distribution of states under behavior policy is$\rho^\beta$policy network parametrize by$\theta^\mu$to make training more stable, we use target network for both critic network and actor network, and update them by soft update: and we set$\tau$very small to update parameters smoothly, e.g.$\tau = 0.001$. In addition, we add some noise to deterministic action when we are exploring the environment to get experience. where$\mathcal{N}$is the noise, it can be chosen to suit the environment, e.g. Ornstein-Uhlenbeck noise. 8.Integrating Learning and Planning Introduction model-free RL no model Learn value function(and or policy) from experience model-based RL learn a model from experience plan value function(and or policy) from model Model$\mathcal{M} = \langle \mathcal{P}\eta, \mathcal{R}\eta \rangle$Model learning from experience${S_1,A_1,R_2,…,S_T}$bu supervised learning$s,a \to r$is a regression problem$s,s \to s’$is a density estimation problem Planning with a model Sample-based planning sample experience from model apply model-free RL to samples Monte-Carlo control Sarsa Q-learning Performance of model-based RL is limited to optimal policy for approximate MDP Integrated architectures Integrating learning and planning—–Dyna Learning a model from real experience Learn and plan value function (and/or policy) from real and simulated experience Simulation-Based Search Forward search select the best action by lookahead build a search tree withe the current state$s_t$at the root solve the sub-MDP starting from now Simulation-Based Search Simulate episodes of experience for now with the model Apply model-free RL to simulated episodes Monte-Carlo control$\to$Monte-Carlo search Sarsa$\to$TD search Sample Monte-Carlo search Given a model$\mathcal{M}_v$and a simulation policy$\pi$For each action$a \in \mathcal{A}$Simulate K episodes from current(real) state$s_t$Evaluate action by mean return(Monte-Carlo evaluation) Select current(real) action with maximum value Monte-Carlo tree search Given a model$\mathcal{M}_v$Simulate K episodes from current(real) state$s_t$using current simulation policy$\pi$Build a search tree containing visited states and actions Evaluate states Q(s,a) by mean return of episodes from s,a After search is finished, select current(real) action with maximum value in search tree Each simulation consist of two phases(in-tree, out-of-tree) Tree policy(improves): pick actions to maximise Q(s,a) Default policy(fixed): pick action randomly Here we update Q on the whole sub-tree, not only the current state. And after every episode of searching, we improve the policy based on the new update value, then start a new searching. With the searching progress, we exploit on the direction which is more promise to success since we keep updating our searching policy to that direction. In addition, we also need to explore a little bit the other direction, so we can apply MCTS with which action has the max Upper Confidence Bound(UCT) , that is idea of AlphaZero. Temporal-Difference Search e.g. update by Sarsa and you can also use a function approximation for simulated Q. Dyna-2 long-term memory(real experience)—TD learning Short-term(working) memory(simulated experience)—TD search &amp; TD learning 9. Exploration and Exploitation way to exploration random exploration use Gaussian noise in continuous action space$\epsilon - greedy$, random on$\epsilon$probability Softmax, select on action policy distribution optimism in the face of uncertainty———prefer to explore state/actions with highest uncertainty Optimistic Initialization UCB Thompson sampling Information state space Gittins indices Bayes-adaptive MDPS State-action exploration vs. parameter exploration Multi-arm bandit Total regret Optimistic Initialization initialize Q(a) to high value Then act greedily turns out linear regret$\epsilon - greedy$turns out linear regret decay$\epsilon - greedy$sub-linear regret(need know gaps), if you tune it very well and find it just on the gap, it is good, otherwise, it maybe bad. the regret has a low bound, it is a log bound The performance of any algorithm is determined by similarity between optimal arm and other arms Optimism in the Face of Uncertainty Upper Confidence Bounds(UCB) Estimate an upper confidence$U_t(a)$for each action value Such that$q(a) \leq Q_t(a)+U_t(a)$with high probability The upper confidence depend on the number of times N(s) has been sampled Select action maximizing Upper Confidence Bounds(UCB) Theorem(Hoeffding’s Inequality) let$x_1,…,X_t$be i.i.d. random variables in[0,1], and let$\overline{X} = \frac{1}{\tau}\sum_{\tau=1}^tX_\tau$be the sample mean. Then we apply the Hoeffding’s Inequality to rewards of the bandit conditioned on selecting action a Pack a probability p that true value exceeds UCB Then solve for$U_t(a)$Reduce p as we observe more rewards, e.g.$p = t^{-4}$Make sure we select optimal action as$t \to \infty$This leads to the UCB1 algorithm The UCB algorithm achieves logarithmic asymptotic total regret Bayesian Bandits Probability matching—Thompson sampling—optimal for one sample, but may not good for MDP. Solving Information State Space Bandits—MDP define MDP on information state space MDP UCB R-Max algorithm 10. Case Study: RL in Classic Games TBA.Set up machine learning development environment2019-04-18T00:00:00+08:002019-04-18T00:00:00+08:00https://dongdongbh.tech/ML-env<h2 id="background">background</h2> <p>Machine learning algorithm always need lots of computing resource, generally, the computer is big and noisy, and most of them are host on Linux system, so most of us run our code on a remote server. How to set up the remote development environment to make us work smoothly is really important.</p> <p>There includes several stuffs we have to set up:</p> <ul> <li>ssh</li> <li>file transfer with server</li> <li>UI on remote</li> <li>machine learning environment</li> <li>remote development tool</li> <li>others</li> </ul> <p><strong>Notice:</strong> If you are in **mainland China **, you’d better set up a proxy to go through <a href="https://en.wikipedia.org/wiki/Great_Firewall">GFW</a>, so that you can enjoy free <u>network</u>, you need to set up you proxy both on browser and terminal, since you need to download many packages on terminal when you setting up you environment, otherwise it may cost you lots of time to set up stuffs. For how to set up network proxy, refer to my post—<a href="https://dongdongbh.tech/blog/vps/">set up VPS</a>.</p> <p><em>Special statement: This tutorial is only for learning and research, thanks.</em></p> <h2 id="ssh">ssh</h2> <p>basically, we should visit our remote sever through ssh connection. refer this for <a href="http://www.linuxproblem.org/art_9.html">ssh without password</a>.</p> <p>on Linux PC, just follow the basic tutorial, on Windows, you first need a bash environment, I recommend some bash env. based on mintty, such as <a href="http://cygwin.com/">Cygwin</a>, <a href="https://www.git-scm.com/downloads">git bash</a> or <a href="https://github.com/goreliu/wsl-terminal">wsl-terminal</a>,</p> <p>you can rename your ssh sever by edit <code class="highlighter-rouge">~/.ssh/config</code>, eg.</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Host my_server HostName example.com # ip or domain name User root # user name </code></pre></div></div> <p>then your can just visit your server by <code class="highlighter-rouge">ssh my_server</code></p> <p>In addition, if you need to connect your server through a <strong>jumper machine</strong>, refer my <a href="https://dongdongbh.github.io/note/#/server">note</a> that how to make your ssh more smoothly by adding ssh tunnel.</p> <p>If visiting your server should through a SSL based VPN, and the VPN client only has Windows version, and your host machine is linux, then how to make it work on your Linux? refer my post—<a href="https://dongdongbh.tech/enabling-ssl-VPN-on-linux/">Enabling SSL VPN on Linux</a>.</p> <h2 id="file-transfer-with-server">file transfer with server</h2> <p>just refer my another post <a href="https://dongdongbh.tech/markup/file-transport/">Transfer files</a>, using scp or sshfs.</p> <h2 id="machine-learning-environment">machine learning environment</h2> <ol> <li> <p>basic tools</p> <p>just install some basic tools on Linux, such as</p> <blockquote> <p>git vim tmux htop etc.</p> </blockquote> <p>you can write a shell scrip to do all that, and I will publish a scrip on my Github to duo that later.</p> </li> <li> <p>development environment</p> <p>machine learning frameworks:</p> <ul> <li><a href="https://pytorch.org/get-started/locally/">pytorch</a></li> <li><a href="https://www.tensorflow.org/install">tensorflow</a></li> <li>Cuda driver</li> </ul> <p>how to set up environment</p> <p>​ most remote servers are using python as high-level development language, there are several python <strong>package management</strong> tools, such as <a href="https://pypi.org/project/pip/">pip</a>, <a href="https://docs.conda.io/en/latest/">conda</a>, and python <strong>virtual environment</strong> managers, such as <a href="https://docs.conda.io/en/latest/miniconda.html">miniconda</a>, <a href="https://docs.anaconda.com/">anaconda</a>, <a href="https://github.com/pyenv/pyenv">pyenv</a>, and I personally recommend conda, your can use Miniconda or anaconda.</p> <p>​ you’d better set up an python environment which separate your env with system, since there amy be other people also using your machine, mixing up stuffs may make your env heavy and out of your control. moreover, setting up a virtual env make your transplant your env easier.</p> <p>some basic conda command, for more, refer <a href="https://docs.conda.io/projects/conda/en/latest/commands.html">conda cmd</a>.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code> conda install numpy conda remove numpy conda create <span class="nt">-n</span> myenv conda create –n test_env <span class="nv">python</span><span class="o">=</span>3.6 conda list conda config <span class="nt">--add</span> channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ conda config <span class="nt">--set</span> show_channel_urls yes conda activate <span class="nv">$ENVIRONMENT_NAME</span> conda deactivate </code></pre></div> </div> </li> </ol> <h2 id="remote-development-tool">remote development tool</h2> <p>some developer do like just write code on vim via ssh to their server, so how to develop on remote? one is Jupyter notebook, you can rite code and view plot figures on browser, for IDE, I recommend Pytorch, it can write code on local and automatically synchronize to your remote server and run code on your remote server with your local Pycharm IDE.</p> <ol> <li> <p>Jupyter</p> <p>looking for more about jupyter set up, refer my note <a href="https://dongdongbh.github.io/note/#/remote-visit-https">jupyter and tensorboard configuration</a></p> <p>if you want to set specific GPU in your python program, please us <code class="highlighter-rouge">export CUDA_VISIBLE_DEVICES=0,1 text.py </code>, if you are in Jupyter, there is a way to set it, but sometime it doesn’t work, so I recommend you just add the following line to your ‘~/.bashrc’, then you needn’t to set it every time.</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export CUDA_VISIBLE_DEVICES=0,1 </code></pre></div> </div> </li> <li> <p>tensorborad</p> <p>tensorboard is a monitor for your to trace and debug your algorithm. also visited on browser after remote server set up it.</p> </li> <li> <p>pycharm</p> <p>for set up pycharm to work on remove server, refer this <a href="https://medium.com/@erikhallstrm/work-remotely-with-pycharm-tensorflow-and-ssh-c60564be862d">post</a> for to to set up.</p> <p>on pycharm, you can edit, run and debug your code on remote server. and it also sport matplotlib to plot figures on remote and view them in IDE.</p> </li> <li> <p>vs code</p> <p>vs code can only edit remote file, but can not run it on remote via local interface, for details, refer<a href="https://matttrent.com/remote-development/">Developing on a remote server</a></p> </li> </ol> <h2 id="view-remote-ui-on-local">view remote UI on local</h2> <p>basically, you can use X11 forward to do it, just use <code class="highlighter-rouge">ssh -X name@domain</code>, on Linux, you only need basic set up <code class="highlighter-rouge">.ssh/config</code> to enable X11 forward, and to test it, just run <code class="highlighter-rouge">xclock</code> on remote server, and there will be a clock ui pop up on your local machine.</p> <p>for windows user, your need install and open a X11 server first, you can install <a href="http://www.straightrunning.com/XmingNotes/">xming</a>.</p> <h2 id="others">Others</h2> <p>TBC.</p>Dongda Lidongdongbhbh@gmail.combackground Machine learning algorithm always need lots of computing resource, generally, the computer is big and noisy, and most of them are host on Linux system, so most of us run our code on a remote server. How to set up the remote development environment to make us work smoothly is really important. There includes several stuffs we have to set up: ssh file transfer with server UI on remote machine learning environment remote development tool others Notice: If you are in **mainland China **, you’d better set up a proxy to go through GFW, so that you can enjoy free network, you need to set up you proxy both on browser and terminal, since you need to download many packages on terminal when you setting up you environment, otherwise it may cost you lots of time to set up stuffs. For how to set up network proxy, refer to my post—set up VPS. Special statement: This tutorial is only for learning and research, thanks. ssh basically, we should visit our remote sever through ssh connection. refer this for ssh without password. on Linux PC, just follow the basic tutorial, on Windows, you first need a bash environment, I recommend some bash env. based on mintty, such as Cygwin, git bash or wsl-terminal, you can rename your ssh sever by edit ~/.ssh/config, eg. Host my_server HostName example.com # ip or domain name User root # user name then your can just visit your server by ssh my_server In addition, if you need to connect your server through a jumper machine, refer my note that how to make your ssh more smoothly by adding ssh tunnel. If visiting your server should through a SSL based VPN, and the VPN client only has Windows version, and your host machine is linux, then how to make it work on your Linux? refer my post—Enabling SSL VPN on Linux. file transfer with server just refer my another post Transfer files, using scp or sshfs. machine learning environment basic tools just install some basic tools on Linux, such as git vim tmux htop etc. you can write a shell scrip to do all that, and I will publish a scrip on my Github to duo that later. development environment machine learning frameworks: pytorch tensorflow Cuda driver how to set up environment ​ most remote servers are using python as high-level development language, there are several python package management tools, such as pip, conda, and python virtual environment managers, such as miniconda, anaconda, pyenv, and I personally recommend conda, your can use Miniconda or anaconda. ​ you’d better set up an python environment which separate your env with system, since there amy be other people also using your machine, mixing up stuffs may make your env heavy and out of your control. moreover, setting up a virtual env make your transplant your env easier. some basic conda command, for more, refer conda cmd. conda install numpy conda remove numpy conda create -n myenv conda create –n test_env python=3.6 conda list conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ conda config --set show_channel_urls yes conda activate $ENVIRONMENT_NAME conda deactivate remote development tool some developer do like just write code on vim via ssh to their server, so how to develop on remote? one is Jupyter notebook, you can rite code and view plot figures on browser, for IDE, I recommend Pytorch, it can write code on local and automatically synchronize to your remote server and run code on your remote server with your local Pycharm IDE. Jupyter looking for more about jupyter set up, refer my note jupyter and tensorboard configuration if you want to set specific GPU in your python program, please us export CUDA_VISIBLE_DEVICES=0,1 text.py , if you are in Jupyter, there is a way to set it, but sometime it doesn’t work, so I recommend you just add the following line to your ‘~/.bashrc’, then you needn’t to set it every time. export CUDA_VISIBLE_DEVICES=0,1 tensorborad tensorboard is a monitor for your to trace and debug your algorithm. also visited on browser after remote server set up it. pycharm for set up pycharm to work on remove server, refer this post for to to set up. on pycharm, you can edit, run and debug your code on remote server. and it also sport matplotlib to plot figures on remote and view them in IDE. vs code vs code can only edit remote file, but can not run it on remote via local interface, for details, referDeveloping on a remote server view remote UI on local basically, you can use X11 forward to do it, just use ssh -X name@domain, on Linux, you only need basic set up .ssh/config to enable X11 forward, and to test it, just run xclock on remote server, and there will be a clock ui pop up on your local machine. for windows user, your need install and open a X11 server first, you can install xming. Others TBC.Enabling SSL VPN on linux2019-04-12T00:00:00+08:002019-04-12T00:00:00+08:00https://dongdongbh.tech/enabling-ssl-VPN-on-linux<h3 id="why">Why?</h3> <p>Many ssl vpn software companies(such as Huawei, Sangfor etc.) do not have Linux client, so we need to use virtual machine to run the windows client and bridge the network to Linux. Ref. this <a href="https://zsrkmyn.github.io/how-to-use-sangfor-sslvpn-in-linux.html">post</a> which is in Chinese(LOL).</p> <h3 id="how-to-install-windows-on-qemu-hosted-on-ubuntu1804">How to install windows on qemu hosted on ubuntu18.04</h3> <ol> <li>First, make sure you properly install “qume-kvm” and “virt-manager “</li> <li>Download win installation ISO file and <a href="https://fedoraproject.org/wiki/Windows_Virtio_Drivers">VirtIO</a> IOS file. notice that you <strong>must</strong> use the last Virtio driver, otherwise there will have a bug that your network will be unstable.</li> <li>Start <em>Virtual Machine Manager</em> to run GUI installing guide. Like this:<img src="../assets/images/vpn_post/1555035639703.png" alt="Virtual Machine Manager" /></li> <li>then following <a href="https://github.com/hpaluch/hpaluch.github.io/wiki/Install-Windows7-on-KVM-Qemu">this wiki</a> to install.</li> </ol> <p><strong>Notice:</strong> In my test case, Huawei’s VPN software doesn’t work on win7, it only works on win8 or later.</p> <h3 id="how-to-bridge">How to bridge</h3> <ol> <li>On Linux host add and start a bridge:</li> </ol> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo ip l add qbr0 type bridge sudo ip l set qbr0 up </code></pre></div></div> <ol> <li> <p>Add this bridge network card(NC) to virtual machine like this<img src="../assets/images/vpn_post/1555061415008.png" alt="1555061415008" /></p> </li> <li> <p>set up the VPN NC share to bridge NC on guest machine</p> </li> <li> <p>start the VPN SSL software on Windows, the windows set you bridge NC ip as <code class="highlighter-rouge">192.168.137.1</code>. show in figure</p> <p><img src="../assets/images/vpn_post/1555062668324.png" alt="1555062668324" /></p> </li> <li> <p>so in all, your have to run following before you start your network</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>ip l add qbr0 <span class="nb">type </span>bridge <span class="nb">sudo </span>ip l <span class="nb">set </span>qbr0 up <span class="nb">sudo </span>ip a add 192.168.137.9/24 dev qbr0 <span class="nb">sudo </span>ip r add 10.0.0.0/8 via 192.168.137.1 dev qbr0 </code></pre></div> </div> </li> <li> <p>Verify by <code class="highlighter-rouge">ip r</code> like this, where <code class="highlighter-rouge">172.xx.xx.xx</code> is my host NC ip, <code class="highlighter-rouge">192.168.122.xx</code> is virtual machine bridge IP for Internet, and <code class="highlighter-rouge">192.168.137.x</code> is for bridge VPN. Oh, my God, how complex it is!!!</p> <p><img src="../assets/images/vpn_post/route2.png" alt="route2" /></p> </li> <li> <p>Enjoy it !</p> <p><code class="highlighter-rouge">ssh xxx@10.xx.xx.xx</code></p> </li> </ol> <p>Thanks my friend <a href="https://zsrkmyn.github.io/">Stephen</a> helped me debugging this.</p> <p>Thinks for reading!</p>Dongda Lidongdongbhbh@gmail.comWhy?Set up HP laserjet 1020 printer on linux2019-04-12T00:00:00+08:002019-04-12T00:00:00+08:00https://dongdongbh.tech/setup-hp-1020-priter-on-linux<h3 id="background">Background</h3> <p>most Linux distributions has printer driver by default, for HP, it is HPLIB, but there an addition plugin which is close source needed for this 1020 printer, when installing this plug-in, I counter some problems, and I will describe it later.</p> <h3 id="problem">Problem</h3> <p>print error</p> <p><code class="highlighter-rouge">hp laserjet 1020, hpcups 3.17.10, requires proprietary plugin</code></p> <h3 id="solution-1-hpip">Solution 1 hpip</h3> <ol> <li>install hplip</li> </ol> <p><code class="highlighter-rouge">sudo apt-get install hplip hplip-gui</code></p> <ol> <li>refer <a href="https://developers.hp.com/hp-linux-imaging-and-printing/binary_plugin.html">this</a> to install plugin: <ol> <li>connect the printer and type command  hp-plugin</li> <li>follow GUI and automatically download the plugin, <strong>but</strong> there is a 404 problem for me, many be caused by my network(I am in China, LOL). so you need manually download the plugin and install it.</li> <li>you can manually download xxx.plugin file <a href="https://www.openprinting.org/download/printdriver/auxfiles/HP/plugins/">here</a>, and load it from local file to HPLIB by <code class="highlighter-rouge">hp-plugin</code> command.</li> <li>enjoy it!</li> </ol> </li> </ol> <h3 id="solution-2-foo2zjs">Solution 2 foo2zjs</h3> <p><code class="highlighter-rouge">foo2zjs</code> is an open source laib for printer, and it work well on HP 1020.</p> <ol> <li> <p>remove HPLIB<code class="highlighter-rouge">sudo apt-get remove --assume-yes hplip hpijs hplip-cups hplip-data libhpmud0 foomatic-db-hpijs </code></p> </li> <li>make install foo2zjsby <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo apt-get install cupsys-bsd foo2zjs make build-essential wget http://support.ideainformatica.com/hplj1020/foo2zjs-patched.tar.gz tar zxvf foo2zjs-patched.tar.gz cd foo2zjs make sudo make install sudo make install-udev sudo udevstart </code></pre></div> </div> </li> <li> <p>plug the printer and run <code class="highlighter-rouge">sudo /etc/init.d/cupsys restart</code></p> </li> <li> <p>to make plain (lpr) text print nicely, run</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> sudo lpoptions -o cpi-12 -o lpi=7 -o page-left=36 -o page-right=36 -o page-top=36 -o page-bottom=36 </code></pre></div> </div> </li> <li> <p>open system printer manager and set up HP printer model as foo2zji by changing the model form <code class="highlighter-rouge">foo2zjs/ppd</code> and select the <code class="highlighter-rouge">xxx.ppd</code> for your printer.</p> </li> <li> <p>if there any problem, you many modify the ppd file manually, e.g. change the Default page size to A4 from “Letter”</p> <p><code class="highlighter-rouge">sudo gedit xxx/ppd/LaserJet-1020.ppd</code></p> </li> <li>enjoy it!</li> </ol>Dongda Lidongdongbhbh@gmail.comBackgroundCreate your website on cloud2019-01-28T00:00:00+08:002019-01-28T00:00:00+08:00https://dongdongbh.tech/resource/create-website<h2 id="create-your-website-on-virtual-private-servervps">Create your website on Virtual Private Server(VPS)</h2> <p>We host our website on cloud VPS, our website based on Jekyll, so we can simply write our pages by Markdown. For the convenience of updating our site, we build Git server on VPS to auto publish it.</p> <h3 id="requirements">Requirements</h3> <ul> <li>a VPS (e.g. google cloud VM instance)</li> <li>a domain name (e.g. dongdongbh.tech)</li> </ul> <h3 id="steps">Steps</h3> <ol> <li> <p>ssh login your server(assume the system of your server is Linux);</p> </li> <li> <p>install Ruby, <a href="https://jekyllrb.com/docs/">Jekyll</a>, Git, <a href="https://www.nginx.com/resources/wiki/start/topics/tutorials/install/">Nginx</a>;</p> </li> <li> <p>setup Nginx in <code class="highlighter-rouge">/etc/nginx/sites-enabled/default</code>, write:</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">##</span> <span class="c1"># You should look at the following URL's in order to grasp a solid understanding</span> <span class="c1"># of Nginx configuration files in order to fully unleash the power of Nginx.</span> <span class="c1"># http://wiki.nginx.org/Pitfalls</span> <span class="c1"># http://wiki.nginx.org/QuickStart</span> <span class="c1"># http://wiki.nginx.org/Configuration</span> <span class="c1">#</span> <span class="c1"># Generally, you will want to move this file somewhere, and start with a clean</span> <span class="c1"># file but keep this around for reference. Or just disable in sites-enabled.</span> <span class="c1">#</span> <span class="c1"># Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples.</span> <span class="c1">##</span> <span class="c1"># Default server configuration</span> <span class="c1">#</span> <span class="s">server {</span> <span class="err"> </span><span class="s">listen 80;</span> <span class="err"> </span><span class="s">listen [::]:80;</span> <span class="err"> </span> <span class="err"> </span><span class="c1"># your domain name</span> <span class="err"> </span><span class="s">server_name dongdongbh.tech www.dongdongbh.tech;</span> <span class="err"> </span><span class="s">rewrite ^(.*)$ https://$host$1 permanent;</span> <span class="err">}</span> <span class="s">server {</span> <span class="err"> </span><span class="c1"># listen 80;</span> <span class="err"> </span><span class="c1"># listen [::]:80;</span> <span class="err"> </span><span class="c1"># SSL configuration for https</span> <span class="err"> </span> <span class="err"> </span><span class="s">listen 443 ssl default_server;</span> <span class="err"> </span><span class="s">listen [::]:443 ssl default_server;</span> <span class="err"> </span><span class="c1"># put your ssl certificate file under /etc/nginx/cert directory and set here</span> <span class="err"> </span><span class="c1"># or you can follow the ssl vendor's instruction</span> <span class="err"> </span><span class="c1"># ssl on; </span> <span class="err"> </span><span class="s">ssl_certificate cert/xxxxxxxxxxxx.pem;</span> <span class="s">ssl_certificate_key cert/xxxxxxxxxxxx.key;</span> <span class="err"> </span><span class="s">ssl_session_timeout 5m;</span> <span class="s">ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE:ECDH:AES:HIGH:!NULL:!aNULL:!MD5:!ADH:!RC4;</span> <span class="s">ssl_protocols TLSv1 TLSv1.1 TLSv1.2;</span> <span class="s">ssl_prefer_server_ciphers on;</span> <span class="err"> </span> <span class="err"> </span><span class="c1"># Self signed certs generated by the ssl-cert package</span> <span class="err"> </span><span class="c1"># Don't use them in a production server!</span> <span class="err"> </span><span class="c1">#</span> <span class="err"> </span><span class="c1">#include snippets/snakeoil.conf;</span> <span class="err"> </span><span class="c1"># your site location</span> <span class="err"> </span><span class="s">root /var/www/mysite;</span> <span class="err"> </span><span class="c1"># Add index.php to the list if you are using PHP</span> <span class="err"> </span><span class="s">index index.html index.htm index.nginx-debian.html;</span> <span class="err"> </span><span class="c1"># your domain name</span> <span class="err"> </span><span class="s">server_name dongdongbh.tech www.dongdongbh.tech;</span> <span class="err"> </span><span class="s">location / {</span> <span class="err"> </span><span class="c1"># First attempt to serve request as file, then</span> <span class="err"> </span><span class="c1"># as directory, then fall back to displaying a 404.</span> <span class="err"> </span><span class="s">try_files $uri$uri/ =404;</span> <span class="err"> }</span> <span class="err">}</span> </code></pre></div> </div> </li> <li> <p>For https SSL Encrypt, you can use a free SSL provider <a href="https://letsencrypt.org/getting-started/">Let’s Encrypt</a> or <a href="https://certbot.eff.org/lets-encrypt/debianstretch-nginx">Certbot</a> to do it.</p> </li> <li> <p>setup your git server repository for your site. e.g. <code class="highlighter-rouge">/srv/git/website.git</code>. For details, ref <a href="https://git-scm.com/book/en/v2/Git-on-the-Server-Setting-Up-the-Server">Setting Up Git Server</a></p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>chgrp <span class="nt">-R</span> <span class="o">[</span>remote user name] /srv/git<span class="o">(</span>the dir<span class="o">)</span> <span class="nb">sudo </span>chmod <span class="nt">-R</span> g+rw /srv/git<span class="o">(</span>the dir<span class="o">)</span> </code></pre></div> </div> </li> <li> <p>find(or create) file <code class="highlighter-rouge">post-receive</code> in dir <code class="highlighter-rouge">/srv/git/website.git/hooks'</code> and fill following lines:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span> <span class="nv">GIT_REPO</span><span class="o">=</span>/srv/git/website.git <span class="nv">TMP_GIT_CLONE</span><span class="o">=</span>/tmp/mysite <span class="nv">PUBLIC_WWW</span><span class="o">=</span>/var/www/mysite git clone <span class="nv">$GIT_REPO</span> <span class="nv">$TMP_GIT_CLONE</span> <span class="nb">cd</span> <span class="nv">$TMP_GIT_CLONE</span> <span class="nb">sudo </span><span class="nv">JEKYLL_ENV</span><span class="o">=</span>production bundle <span class="nb">exec </span>jekyll build <span class="nt">-s</span> <span class="nv">$TMP_GIT_CLONE</span> <span class="nt">-d</span> <span class="nv">$PUBLIC_WWW</span> rm <span class="nt">-Rf</span> <span class="nv">$TMP_GIT_CLONE</span> <span class="nb">exit</span> </code></pre></div> </div> </li> <li> <p>make sure you VPS port 80&amp;443 are opened;</p> </li> <li> <p>In site directory on your local computer:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git remote add server user_name@dongdongbh.tech:/srv/git/website.git </code></pre></div> </div> <p>then you can use git to update your website, when push your local updates to your server, your site will automatically update.</p> </li> </ol>Dongda Lidongdongbhbh@gmail.comCreate your website on Virtual Private Server(VPS)Reinforcement learning notes2018-12-01T00:00:00+08:002018-12-01T00:00:00+08:00https://dongdongbh.tech/RL-note<h2 id="table-of-contents">Table of contents</h2> <ul> <li><a href="#1.Basic">Basic</a></li> <li><a href="#cross-entropy-method">Cross-entropy method</a></li> <li><a href="#tabular-learning">Tabular Learning</a></li> <li><a href="#deep-q-learning">DQN</a></li> <li><a href="#policy-gradients">Policy Gradients</a></li> <li><a href="#Deep-Reinforcement-Learning (Deep RL) in-Natural-Language-Processing (NLP)">DRL in NLP</a></li> <li><a href="#nn-functions">NN functions</a></li> </ul> <h2 id="basic">basic</h2> <h5 id="markov-decision-process-mdp">Markov Decision Process (MDP)</h5> <p><img src="../assets/images/rl/reinforcement-learning-fig1-700.jpg" width="70%" /></p> <p>environment, state, observation, reward, action, agent</p> <h5 id="policy">Policy</h5> <script type="math/tex; mode=display">\pi: S \times A \to [0,1]\\ \label{l1} \pi(a|s) = P(a_t=a|s_t=s)</script> <h5 id="state-value-function">State-value function</h5> <script type="math/tex; mode=display">R = \Sigma_{t = 0}^{\infty}\gamma^tr_t\\ \label{l2} V_\pi(s) = E[R] = E[\Sigma_{t = 0}^{\infty}\gamma^tr_t|s_0 = s]\\</script> <p>where $r_t$ is the reward at step $t$, $\gamma\in[0,1]$ is the discount-rate.</p> <h5 id="value-function">Value function</h5> <script type="math/tex; mode=display">V^\pi(s) =E[R|s,\pi]\\ \label{l3} V^*(s) = \max_\pi V^\pi(s)</script> <h5 id="action-value-function">Action value function</h5> <script type="math/tex; mode=display">Q^\pi(s,a) = E[R|s,a,\pi]\\ \label{l4} Q^*(s) = \max_a Q^\pi(s,a)</script> <h3 id="method-classification">method classification</h3> <ul> <li>model-based: previous observation <strong>predict</strong> following rewards and observations</li> <li>model-free: train it by intuition</li> <li>police-based: <strong>directly</strong> approximating the policy of the agent</li> <li>value-based: the agent calculates the <strong>value of every possible action</strong></li> <li>off police: the ability of the method to learn on old <strong>historical data</strong> (obtained</li> <li>on police: requires <strong>fresh data</strong> obtained from the environment</li> </ul> <h3 id="police-based-method">Police-based method</h3> <p><strong>just like a classification problem</strong></p> <ul> <li>NN input: observation</li> <li>NN output: distribution of actions</li> <li>agent: random choose action base on distribution of actions(police)</li> </ul> <h2 id="cross-entropy-method">cross-entropy method</h2> <h4 id="steps">steps:</h4> <ol> <li>Play N number of episodes using our current model and environment.</li> <li>Calculate the total reward for every episode and decide on a reward boundary. Usually, we use some percentile of all rewards, such as 50th or 70th.</li> <li>Throw away all episodes with a reward below the boundary.</li> <li>Train on the remaining “elite” episodes using observations as the input and issued actions as the desired output.</li> <li>Repeat from step 1 until we become satisfied with the result.</li> </ol> <p>use <strong>cross-entropy loss</strong> function as loss function</p> <p><strong>drawback:</strong> Cross-entropy methods have difficult to understand which step or which state is good and which is not good, it just know overall this episode is better or not</p> <h2 id="tabular-learning">tabular learning</h2> <h3 id="why-using-q-but-not-v">Why using Q but not V?</h3> <p>​ if I know the value of current state, I know the state is good or not, but I don’t know how to choose next action, even I know the V of all next state, I <strong>can not directly</strong> know which action i need to do, so we decide action base on Q.</p> <p>​ if I know Q of all available action, we just choose the action which has max Q, then this action surely has max V according the definition of V(the relationship of Q and V).</p> <h3 id="the-value-iteration-in-the-env-with-a-loop">The value iteration in the Env with a loop</h3> <p>If there is no $\gamma (\gamma = 1)$ and the environment has a loop, the value of state will be infinite.</p> <h3 id="problems--in-q-learning">problems in Q-learning</h3> <ul> <li>state is not discrete</li> <li>state space is is very large</li> <li> <table> <tbody> <tr> <td>don’t know probability of action and reward matrix (P(s’,r</td> <td>s,a)).</td> </tr> </tbody> </table> </li> </ul> <h3 id="value-iteration">Value iteration</h3> <h4 id="reward-table">Reward table</h4> <ul> <li>index: “source state” + “action” + “target state”</li> <li>value: reward</li> </ul> <h4 id="transition-table">Transition table</h4> <ul> <li>index: “state” + “action”</li> <li>value: index: state value: counts</li> </ul> <h4 id="value-table">Value table</h4> <ul> <li>index: state</li> <li>value: value of state</li> </ul> <h4 id="steps-1">Steps</h4> <ol> <li> <p>random action to build reward and transitions table</p> </li> <li> <p>perform a value iteration loop over all state</p> </li> <li> <p>play several full episodes to choose the best action using the updated value table, at the same time, update reward and transitions table using new data.</p> </li> </ol> <p>**Problems of separating training and testing **: When using the previous steps, you actually separate training and testing, it may has another problem, since the task may be difficult,using random action is hard to reach the final state, so you may lack some states which are near the final step. So, maybe you should conduct training and testing at the same time, and add some exploit into testing.</p> <h3 id="q-learning">Q-learning</h3> <p>Different to value iteration,Q-learn change the value table to Q value table:</p> <h4 id="q-value-table">Q value table</h4> <ul> <li>index: “state” + “action”</li> <li>value: action value(Q)</li> </ul> <p>Here : <script type="math/tex">V(s) = \mathop{\arg\max}_{a}Q(a,s)</script></p> <h2 id="deep-q-learning">deep q-learning</h2> <h4 id="dqn">DQN:</h4> <p>input: state</p> <p>output: all action(n actions) value of input state</p> <p>classification: off policy, value based and model free</p> <h4 id="problems">problems:</h4> <ul> <li>how to balance explore&amp;exploit</li> <li>data is not independent and identically distributed(i.i.d), which is required for neural network training.</li> <li>may partially observable MDPs (<strong>POMDP</strong>)</li> </ul> <h4 id="basic-tricks-in-deepmind-2015-paper">Basic tricks in Deepmind 2015 paper:</h4> <ul> <li>$\epsilon$-greedy to deal with explore&amp;exploit</li> <li>replay buffer and target network to deal with i.i.d, <ul> <li>replay buffer make it more random, it random select experience in a replay buffer</li> <li>target network isolated the influence of nearby Q during training</li> </ul> </li> <li>several observations as a state to deal with POMDP</li> </ul> <h4 id="double-dqn">Double DQN</h4> <p><strong>Idea:</strong> Choosing <strong>actions</strong> for the next state using the <strong>trained network</strong> but taking <strong>values of Q from the target net</strong>.</p> <h4 id="noisy-networks">Noisy Networks</h4> <p><strong>Idea:</strong> Add <strong>a noise to the weights of fully-connected layers</strong> of the network and adjust the parameters of this noise during training using back propagation. (to replace $\epsilon$-greedy and improve performance)</p> <h4 id="prioritized-replay-buffer">Prioritized replay buffer</h4> <p><strong>Idea:</strong> This method tries to improve the efficiency of samples in the replay buffer by <strong>prioritizing those samples according to the training loss</strong>.</p> <p><strong>Trick:</strong> using loss weight to compensated the distribution bias introduced by priorities.</p> <h4 id="dueling-dqn">Dueling DQN</h4> <p><strong>Idea:</strong> The Q-values Q(s, a) our network is trying to approximate can be divided into quantities: the value of the state V(s) and the advantage of actions in this state A(s, a).</p> <p><strong>Trick:</strong> the mean value of the advantage of any state to be zero.</p> <h4 id="categorical-dqn">Categorical DQN</h4> <p><strong>Idea:</strong> Train the probability distribution of action Q-value rather than the action Q-value</p> <p><strong>Tricks:</strong></p> <ul> <li> <p>using generic parametric distribution that is basically a fixed amount of values placed regularly on a values range. every fixed amount of values range calls atom.</p> </li> <li> <p>use loss Kullback- Leibler (KL)-divergence</p> </li> </ul> <h2 id="policy-gradients">policy gradients</h2> <h3 id="reinforce">REINFORCE</h3> <h4 id="idea">idea</h4> <p>Policy Gradient <script type="math/tex">\Delta J \approx E[Q(s,a)\Delta\log\pi(a|s)] \label{l5}</script> loss formula <script type="math/tex">loss = -Q(s,a)\log\pi(a|s) \label{l6}</script> Increase the probability of actions that have given us good total reward and decrease the probability of actions with bad final outcomes. <script type="math/tex">{split} \pi(a|s)>0\\ -\log\pi(a|s) > 0 \label{l7}</script> {split}</p> <h4 id="problems-1"><strong>problems:</strong></h4> <ul> <li>one training need full episodes since require Q from finished episode</li> <li> <p>High gradients variance, long steps episode have larger Q than short one</p> </li> <li>converge to some locally-optimal policy since lack of exploration</li> <li>not i.i.d. Correlation between samples</li> </ul> <h4 id="basic-tricks">basic tricks</h4> <ul> <li>learning Q(Actor-Critic)</li> <li>subtracting a value called baseline from the Q to avoid high gradients variance</li> <li>in order to prevent our agent from being stuck in the local minimum, subtracting the entropy from the loss function, punishing the agent for being too certain about the action to take.</li> <li>parallel environments to reduce <strong>correlation</strong>, steps from different environments.</li> </ul> <h3 id="actor--critic">Actor- Critic</h3> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{aligned} Q(s,a) &= \Sigma_{i=0}^{N-1}\gamma^ir_i+\gamma^NV(s_N)\\ Loss_{value} &= MSE(V(s),Q(s,a))\\ \label{Q_update} \end{aligned} \end{equation} %]]></script> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{aligned} Q(s,a) &= A(s,a)+V(s)\\ Loss_{policy} &= -A(s,a)\log\pi(a|s)\\ \label{pg_update} \end{aligned} \end{equation} %]]></script> <p>Using equation \ref{Q_update} to train V(s) (Critic) and equation \ref{pg_update} to train policy. We call A(s,a) as advantage, so it is advantage Actor- Critic (<strong>A2C)</strong>.</p> <p><strong>Idea</strong>: The scale of our gradient will be just advantage A(s, a), we use another neural network, which will approximate V(s) for every observation.</p> <h4 id="implementation">Implementation</h4> <p>In practice, policy and value networks partially overlap, mostly due to the efficiency and convergence considerations. In this case, policy and value are implemented as different heads of the network, taking the output from the common body and transforming it into the probability distribution and a single number representing the value of the state. This helps both networks to share low-level features, but combine them in a different way.</p> <h4 id="tricks">Tricks</h4> <ul> <li> <p>add entropy bonus to loss function <script type="math/tex">H_{entropy} = -\Sigma (\pi \log\pi) \\ Loss_{entropy} = \beta*\Sigma_i (\pi_\theta(s_i)*\log\pi_\theta(s_i)) \label{l10}</script></p> <blockquote> <p>the loss function of entropy has a minimum when probability distribution is uniform, so by adding it to the loss function, we’re pushing our agent away from being too certain about its actions.</p> </blockquote> </li> <li> <p>using several environments to improve stability</p> </li> <li> <p>gradient clipping to prevents our gradients at optimization stage from becoming too large and pushing our policy too far.</p> </li> </ul> <h4 id="total-loss-function">Total Loss function</h4> <p>Finally, our loss is the sum of PG, value and entropy loss <script type="math/tex">Loss =Loss_{policy}+Loss_{value}+Loss_{entropy}</script></p> <h4 id="asynchronous-advantage-actor-critica3c">Asynchronous Advantage Actor-Critic(A3C)</h4> <blockquote> <p>Just using parallel envs to speed up training, there will be some code level tricks to speed up by fully utilizing multiple GPUs and CPUs. For more details, ref some open source implementations on Github.</p> </blockquote> <h2 id="deep-reinforcement-learning-deep-rl-in-natural-language-processing-nlp">Deep Reinforcement Learning (Deep RL) in Natural Language Processing (NLP)</h2> <h3 id="basic-concepts-in-nlp">Basic concepts in NLP</h3> <ul> <li>Recurrent Neural Networks (RNNs)</li> <li>word embeddings</li> <li>the <strong>seq2seq</strong> model</li> <li><a href="http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention">Recurrent models of visual attention</a> (original paper NIPS 2014)</li> </ul> <p>Ref. <a href="http://cs224d.stanford.edu">CS224d</a> for more about NLP.</p> <h4 id="rnn">RNN</h4> <p>The idea of an RNN is a network with fixed input and output, which is being applied to the sequence of objects and can pass information along this sequence. This information is called hidden state and is normally just a vector of numbers of some size.</p> <p><strong>Unfold RNN (unfold by time)</strong></p> <p><img src="../assets/images/rl/Recurrent_neural_network_unfold.svg.png" width="70%" /></p> <p>RNN produce different output for the same input in different contexts, RNNs can be seen as a standard building block of the systems that need to process variable-length input.</p> <h5 id="lstm">LSTM</h5> <p><img src="../assets/images/rl/800px-Long_Short-Term_Memory.svg.png" width="70%" /></p> <h4 id="word-embeddingword2vec">Word embedding(word2vec)</h4> <p>Word and phrase embeddings, is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where <strong>words or phrases from the vocabulary are mapped to vectors of real numbers</strong>. Conceptually it involves a mathematical embedding from a space with <strong>one dimension per word to a continuous vector space with a much lower dimension</strong>.</p> <p><strong>Methods</strong> to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, explainable knowledge base method, and explicit representation in terms of the context in which words appear.</p> <p>Word embedding is good for NLP tasks such as syntactic parsing and sentiment analysis. Ref <a href="https://en.wikipedia.org/wiki/Word_embedding">word embedding</a> for details.</p> <p>You can use some pretrained dataset or get it by training your own dataset.</p> <h4 id="encoder-decoderseq2seq">Encoder-Decoder(seq2seq)</h4> <p><img src="../assets/images/rl/seq2seq.jpg" width="70%" /></p> <p>use an RNN to process an input sequence and encode this sequence into some fixed-length representation. This RNN is called an encoder. Then you feed the encoded vector into another RNN, called a decoder, which has to produce the resulting sequence. It is widely used in machine translation.</p> <ul> <li> <p><strong>teacher-forcing mode</strong>: decoder input is the target reference</p> </li> <li> <p><strong>curriculum learning mode</strong>: decoder input is the last out put of previous decoder</p> <table> <thead> <tr> <th style="text-align: center"><img src="../assets/images/rl/curriculum learning.jpg" width="70%" /></th> </tr> </thead> <tbody> <tr> <td style="text-align: center">curriculum learning mode</td> </tr> </tbody> </table> </li> <li> <p><strong>attention mechanism</strong></p> </li> </ul> <table> <thead> <tr> <th style="text-align: center"><img src="../assets/images/rl/nmt-model-fast.gif" width="70%" /></th> </tr> </thead> <tbody> <tr> <td style="text-align: center">seq2seq (picture from Google)</td> </tr> <tr> <td style="text-align: center"><img src="../assets/images/rl/attention.jpg" width="70%" /></td> </tr> <tr> <td style="text-align: center">attention mechanism (this picture from <a href="https://zhuanlan.zhihu.com/p/40920384">zhihu</a>)</td> </tr> </tbody> </table> <h3 id="rl-in-seq2seq">RL in seq2seq</h3> <ul> <li>sampling from probability distribution, instead of learning some average result</li> <li>score is not differentiable, we still can use PG to update, use score as scale</li> <li>introducing stochasticity into the process of decoding when dataset is limited</li> <li>use argmax score as baseline of Q</li> </ul> <h2 id="continuous-action-space">Continuous Action Space</h2> <h2 id="ddpg">DDPG</h2> <p>TBC.</p> <h2 id="model-based-rl">Model-based RL</h2> <h4 id="what-is-model-base">What is model-base?</h4> <p>Consider that our MDP is a model which has input state s, and action a, it will output reward r and new state s’, so there will be a model that <code class="highlighter-rouge">(r,s') = M(s,a)</code>, sometimes we do not know the exact model of MDP, but we can learn it from our experiences. Here is the question, what is the usage of the model? we can use it to searching and planning, searching can make us have more experiences by only simulate with our model, and planning means use the simulated experiences to update our police and value function, and planning methods can be same as directly RL methods.</p> <p>see the following graph, the left side is directly RL, and the right side is model-based RL.</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> -----------------&gt;policy update&lt;----------------------- | | | |planning | | directly RL| real environment simulation | | ^ | |intercat |searching | v | -------------------experiences-----------------------&gt;Model </code></pre></div></div> <blockquote> <p>Why using model-based RL?</p> <p>we want to get more experiences by simulation and speed up learning.</p> </blockquote> <h4 id="how-to-search">How to search?</h4> <p>We want to get more experiences by searching, so how to search? Here are some methods:</p> <ul> <li>Rollout search</li> <li>Monte Carlo tree search (MCTS)</li> <li>…</li> </ul> <h2 id="nn-functions">nn functions</h2> <h4 id="sigmoid">sigmoid</h4> <p><em><u>It transfer a value input to (0,1)</u></em></p> <script type="math/tex; mode=display">f(x)=\frac{L}{1+e^{-x}} = \frac{e^{x}}{e{x}+1}</script> <h4 id="softmax"><strong>softmax</strong></h4> <p>In short, <em><u>It transfer K-dimensional vector input to (0,1)</u></em></p> <p>In mathematics, the softmax function, or normalized exponential function, is a generalization of the logistic function that “squashes” a K-dimensional vector <strong>z</strong> of arbitrary real values to a K-dimensional vector \sigma(<strong>z</strong>) of real values, where each entry is in the range (0, 1), and all the entries add up to 1.</p> <h4 id="tanh">tanh</h4> <p><em><u>It transfer a value input to (-1,1)</u></em></p> <script type="math/tex; mode=display">f(x)=tanh(x)= \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}</script> <h4 id="relu"><strong>relu</strong></h4> <script type="math/tex; mode=display">f(x)=max(0,x)</script> <h2 id="reference">Reference</h2> <ul> <li> <p>Maxim Lapan, Deep Reinforcement Learning Hands-On 2018</p> </li> <li>Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529.</li> <li>Mnih V, Heess N, Graves A. Recurrent models of visual attention[C]//Advances in neural information processing systems. 2014: 2204-2212.</li> <li>Paszke, Adam and Gross, etc. Automatic differentiation in PyTorch, 2017</li> </ul>Dongda Lidongdongbhbh@gmail.comTable of contents Basic Cross-entropy method Tabular Learning DQN Policy Gradients DRL in NLP NN functionsConvolutional Neural Networks dimension2018-12-01T00:00:00+08:002018-12-01T00:00:00+08:00https://dongdongbh.tech/CNN-dimension<h3 id="convolution-operation-share-the-convolution-core"><strong>convolution operation</strong>: share the convolution core</h3> <p>output size is: <script type="math/tex">O = (n-f+1) * (n-f+1)</script> where input size is $n\times n$, convolution core size is $f\times f$.</p> <p><img src="../assets/images/cnn/convolution_core.gif" alt="con" /></p> <h3 id="terms-channels-strides-padding"><strong>terms</strong>: channels, strides, padding</h3> <ul> <li><strong>channels</strong>: look following picture, the input is $6\times 6\times 3$ RGB picture, and we use $3\times 3\times 3$ convolution core, the last 3 is the channel number of convolution core, which is same as the input picture channel number. During convolution, we multiply the input and add it together , so the output size is $4\times 4\times 1$.</li> </ul> <p><img src="../assets/images/cnn/convolution.png" alt="cnn" /></p> <p>and we often not only use only one convolution core, we use multi-cores to get multi-features form input. The following image shows the situation that we use two cores, so the output size is $4\times 4\times 2$. Then the input channel of following convolution layer is 2.</p> <p><img src="../assets/images/cnn/cnn.png" alt="cnn" /></p> <p><img src="../assets/images/cnn/layers.png" alt="cnn" /></p> <p>output size: <script type="math/tex">O =(n-f+1) * (n-f+1) * C_o</script> where $C_o$ is the number of convolution cores(output channel).</p> <ul> <li> <p><strong>padding</strong>: add data to the border of input, often to make sure the output size is same as input.</p> <p><img src="../assets/images/cnn/padding.png" alt="padding" /></p> <p>output size: <script type="math/tex">O =(n+2p-f+1) * (n+2p-f+1) * C_o</script> where $p$ is padding length of one side. e.g. the last picture $p = 1$.</p> <p>if you want to make sure output size is same as input, set $p = (f-1)/2$, since at this moment, $n-f+2p+1 = n$.</p> </li> <li> <p><strong>strides</strong>: stripe is the moving step length of convolution core.</p> </li> </ul> <p><img src="../assets/images/cnn/stride.png" alt="strides" /></p> <p><img src="../assets/images/cnn/stride-2.png" alt="strdes" /></p> <p>output size: <script type="math/tex">O =(\frac{n+2p-f}{s}+1) * (\frac{n+2p-f}{s}+1) * C_o</script> where $s$ is stride length.</p> <h3 id="demo"><strong>demo</strong></h3> <p><img src="../assets/images/cnn/demo.gif" alt="cnn" /></p> <p>So lets get the demo’s output size:</p> <hr /> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{aligned} p=1\\ s=2\\ S_{input} &= 7*7*3\\ S_{core} &= 3*3*3\\ S_{out} &= (\frac{n+2p-f}{s}+1) * (\frac{n+2p-f}{s}+1) * C_o \\ &= (\frac{7+2-3}{2}+1) * (\frac{7+2-3}{2}+1) * 2 \\&= 4*4*2 \end{aligned} \end{equation} %]]></script>Dongda Lidongdongbhbh@gmail.comconvolution operation: share the convolution core