<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://dongdongbh.tech/feed.xml" rel="self" type="application/atom+xml" /><link href="https://dongdongbh.tech/" rel="alternate" type="text/html" /><updated>2026-03-22T02:03:43-04:00</updated><id>https://dongdongbh.tech/feed.xml</id><title type="html">Dongda’s homepage</title><subtitle>Homepage of Dongda Li, an amazing website.</subtitle><author><name>Dongda Li</name><email>dongdongbhbh@gmail.com</email></author><entry><title type="html">Mindwtr: The Best Free, Open-Source GTD App for All Platforms</title><link href="https://dongdongbh.tech/blog/mindwtr/" rel="alternate" type="text/html" title="Mindwtr: The Best Free, Open-Source GTD App for All Platforms" /><published>2025-12-10T00:00:00-05:00</published><updated>2026-03-22T02:03:37-04:00</updated><id>https://dongdongbh.tech/blog/mindwtr</id><content type="html" xml:base="https://dongdongbh.tech/blog/mindwtr/"><![CDATA[<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "SoftwareApplication",
  "name": "Mindwtr",
  "applicationCategory": "ProductivityApplication",
  "operatingSystem": "Windows, macOS, Linux, Android, iOS",
  "description": "Free, open-source GTD (Getting Things Done) app. Local-first, cross-platform, no account required.",
  "url": "https://dongdongbh.tech/blog/mindwtr/",
  "downloadUrl": "https://github.com/dongdongbh/Mindwtr/releases",
  "codeRepository": "https://github.com/dongdongbh/Mindwtr",
  "installUrl": [
    "https://apps.apple.com/app/mindwtr/id6758597144",
    "https://play.google.com/store/apps/details?id=tech.dongdongbh.mindwtr",
    "https://apps.microsoft.com/detail/9n0v5b0b6frx"
  ],
  "screenshot": "https://dongdongbh.tech/assets/images/mindwtr-og.png",
  "keywords": ["GTD", "Getting Things Done", "task management", "productivity", "open source", "local-first"],
  "author": {
    "@type": "Person",
    "name": "Dongda Li",
    "url": "https://dongdongbh.tech"
  },
  "offers": {
    "@type": "Offer",
    "price": "0",
    "priceCurrency": "USD"
  },
  "license": "https://opensource.org/licenses/AGPL-3.0",
  "isAccessibleForFree": true,
  "featureList": [
    "GTD workflow (Capture, Clarify, Organize, Reflect, Engage)",
    "Local-first data model",
    "Cross-platform (Windows, macOS, Linux, Android, iOS, Web)",
    "WebDAV, Dropbox, and self-hosted sync",
    "AI copilot with BYOK and local LLM support",
    "Obsidian integration",
    "CLI, REST API, and MCP server"
  ]
}
</script>

<h2 id="what-is-mindwtr">What is Mindwtr?</h2>

<p><strong>Mindwtr is a free, open-source GTD (Getting Things Done) application</strong> that runs on Windows, macOS, Linux, Android, and iOS. It is local-first, requires no account, and implements the complete GTD methodology — from inbox capture to weekly review.</p>

<p>I built Mindwtr because I could not find a GTD app that matched how I actually live and think. I wanted something calm, fast, and honest. Not a product designed to maximize screen time. Not a tool that makes me pay forever just to keep access to my own tasks. And not an app that treats Getting Things Done like a checklist of trendy features.</p>

<p>I also needed true cross-platform support. My day moves across different devices and operating systems, so I wanted one GTD system that follows me everywhere instead of forcing me into one ecosystem.</p>

<p>I wanted a system I could trust for years:</p>
<ul>
  <li>my data stays mine</li>
  <li>the workflow stays clear</li>
  <li>the app stays useful even without a central hosted service</li>
  <li>the experience stays consistent across platforms</li>
</ul>

<p>So I started building Mindwtr.</p>

<h2 id="why-gtd-and-why-a-dedicated-gtd-app-matters">Why GTD, and why a dedicated GTD app matters</h2>

<p>For me, GTD is not about being “productive” in a social-media sense. It is about mental clarity.</p>

<p>My brain works better when it does not need to remember everything. The Getting Things Done method gives me a reliable loop:</p>
<ul>
  <li><strong>Capture</strong> what has my attention</li>
  <li><strong>Clarify</strong> what it means</li>
  <li><strong>Organize</strong> it in the right place</li>
  <li><strong>Review</strong> regularly</li>
  <li><strong>Engage</strong> with confidence</li>
</ul>

<p>Mindwtr is built around that full GTD workflow. If an app skips these parts, it becomes a simple list manager. I wanted a full GTD practice, not a bucket of todos. That is the key difference between a task management app and a true GTD application.</p>

<h2 id="how-mindwtr-compares-to-other-gtd-apps">How Mindwtr compares to other GTD apps</h2>

<p>When searching for the best GTD app, you will find options like Todoist, TickTick, OmniFocus, Nirvana, and Everdo. Here is how Mindwtr compares:</p>

<table>
  <thead>
    <tr>
      <th>Capability</th>
      <th>Mindwtr</th>
      <th>Todoist</th>
      <th>TickTick</th>
      <th>OmniFocus</th>
      <th>NirvanaHQ</th>
      <th>Everdo</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Open source</td>
      <td>Yes</td>
      <td>No</td>
      <td>No</td>
      <td>No</td>
      <td>No</td>
      <td>No</td>
    </tr>
    <tr>
      <td>GTD-native workflow</td>
      <td>Yes</td>
      <td>Partial</td>
      <td>Partial</td>
      <td>Yes</td>
      <td>Yes</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td>All major platforms (incl. Linux)</td>
      <td>Yes</td>
      <td>Yes</td>
      <td>Yes</td>
      <td>Apple only</td>
      <td>Web + mobile</td>
      <td>No mobile</td>
    </tr>
    <tr>
      <td>Local-first, no account required</td>
      <td>Yes</td>
      <td>No</td>
      <td>No</td>
      <td>Yes</td>
      <td>No</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td>AI assistant (BYOK + local LLM)</td>
      <td>Yes</td>
      <td>No</td>
      <td>No</td>
      <td>No</td>
      <td>No</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Flexible sync (WebDAV / Dropbox / self-hosted)</td>
      <td>Yes</td>
      <td>No</td>
      <td>No</td>
      <td>No</td>
      <td>No</td>
      <td>Partial</td>
    </tr>
    <tr>
      <td>Completely free</td>
      <td>Yes</td>
      <td>No</td>
      <td>No</td>
      <td>No</td>
      <td>No</td>
      <td>No</td>
    </tr>
  </tbody>
</table>

<p>Mindwtr is the only GTD app that combines open source, full cross-platform support including Linux, local-first data, and a complete Getting Things Done workflow — all for free.</p>

<h2 id="key-features-of-mindwtr-as-a-gtd-application">Key features of Mindwtr as a GTD application</h2>

<ul>
  <li><strong>Full GTD workflow</strong>: Capture, Clarify, Organize, Reflect, Engage — end to end.</li>
  <li><strong>Focus view</strong>: combines time-based agenda with context-filtered next actions.</li>
  <li><strong>Local-first data</strong>: file-based storage with optional WebDAV, Dropbox, or self-hosted cloud sync.</li>
  <li><strong>Obsidian integration</strong>: import tasks from your Obsidian vault with deep links on desktop.</li>
  <li><strong>AI copilot</strong> (optional): clarify, break down, and review tasks with BYOK AI (OpenAI, Gemini, Claude, or local LLMs).</li>
  <li><strong>Cross-platform</strong>: desktop apps (Tauri v2) for Windows, macOS, Linux; mobile apps (React Native) for Android and iOS; plus a PWA.</li>
  <li><strong>Automation</strong>: CLI, REST API, and MCP server for LLM-powered workflows.</li>
  <li><strong>Weekly review wizard</strong>: guided review with reminders to keep your GTD system current.</li>
  <li><strong>Pomodoro timer</strong>: optional focus timer integrated into the Focus view.</li>
  <li><strong>16 languages</strong>: English, Chinese, Spanish, Hindi, Arabic, German, Russian, Japanese, French, Portuguese, Polish, Korean, Italian, Turkish, Dutch, and more.</li>
</ul>

<h2 id="philosophy-a-calm-gtd-app">Philosophy: a calm GTD app</h2>

<p>Mindwtr follows a simple principle:
<strong>simple by default, powerful when needed.</strong></p>

<p>That means:</p>
<ul>
  <li>progressive disclosure: advanced options appear when they matter</li>
  <li>less by default: fewer knobs, less noise, less cognitive load</li>
  <li>avoid feature creep: clarity over clutter</li>
  <li>local-first foundation: your system should work even when the internet is unreliable</li>
  <li>practical cross-platform: desktop and mobile should feel like one trusted system</li>
</ul>

<p>I want Mindwtr to feel like a quiet workspace, not a cockpit.</p>

<h2 id="how-a-gtd-app-helps-in-daily-life">How a GTD app helps in daily life</h2>

<p>Most of the value is not dramatic. It is small, repeated relief.</p>

<ul>
  <li>In the morning, I can quickly see what deserves attention today.</li>
  <li>During the day, I can capture tasks before they disappear from memory.</li>
  <li>When I feel overloaded, I can process inbox items and turn ambiguity into clear next actions.</li>
  <li>In weekly review, I can reset direction instead of drifting.</li>
  <li>Across devices, I can keep one trusted system instead of scattered notes and reminders.</li>
</ul>

<p>That is the core promise of a good GTD application: less mental friction, better decisions, and more calm.</p>

<p>And because Mindwtr supports almost all major platforms, I do not have to rebuild my workflow when I switch devices.</p>

<h2 id="why-an-open-source-gtd-app-matters">Why an open-source GTD app matters</h2>

<p>Mindwtr is free and open source because this kind of tool should be inspectable, adaptable, and community-owned.</p>

<p>Open source means:</p>
<ul>
  <li>no lock-in by design</li>
  <li>transparent behavior</li>
  <li>contributions from real users</li>
  <li>long-term sustainability beyond one company roadmap</li>
</ul>

<p>If something feels wrong, anyone can report it. If something can be better, anyone can improve it. That keeps the project honest.</p>

<p>Most GTD apps on the market are proprietary and require monthly subscriptions. Mindwtr proves that a high-quality Getting Things Done application can be free, open, and community-driven.</p>

<h2 id="get-mindwtr--free-gtd-app-for-all-platforms">Get Mindwtr — free GTD app for all platforms</h2>

<p>Mindwtr started as a personal need, but it became a shared tool for people who want a practical GTD system without noise, lock-in, or subscription pressure.</p>

<p>Today it runs across almost all major platforms: <strong>Windows, macOS, Linux, Android, and iOS</strong>.</p>

<p><strong>Install Mindwtr:</strong></p>

<table>
  <thead>
    <tr>
      <th>Platform</th>
      <th>Install</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Windows</td>
      <td><a href="https://apps.microsoft.com/detail/9n0v5b0b6frx">Microsoft Store</a>, <a href="https://winstall.app/apps/dongdongbh.Mindwtr">Winget</a>, <a href="https://github.com/dongdongbh/homebrew-mindwtr">Scoop</a></td>
    </tr>
    <tr>
      <td>macOS</td>
      <td><a href="https://apps.apple.com/app/mindwtr/id6758597144">Mac App Store</a>, <a href="https://formulae.brew.sh/cask/mindwtr">Homebrew</a></td>
    </tr>
    <tr>
      <td>Linux</td>
      <td><a href="https://flathub.org/apps/tech.dongdongbh.mindwtr">Flathub</a>, <a href="https://aur.archlinux.org/packages/mindwtr-bin">AUR</a>, APT, DNF, AppImage</td>
    </tr>
    <tr>
      <td>Android</td>
      <td><a href="https://play.google.com/store/apps/details?id=tech.dongdongbh.mindwtr">Google Play</a>, <a href="https://apt.izzysoft.de/fdroid/index/apk/tech.dongdongbh.mindwtr">IzzyOnDroid</a></td>
    </tr>
    <tr>
      <td>iOS</td>
      <td><a href="https://apps.apple.com/app/mindwtr/id6758597144">App Store</a></td>
    </tr>
    <tr>
      <td>Web</td>
      <td>PWA with Docker self-hosting</td>
    </tr>
  </tbody>
</table>

<p><strong>Links:</strong></p>
<ul>
  <li>GitHub: <a href="https://github.com/dongdongbh/Mindwtr">https://github.com/dongdongbh/Mindwtr</a></li>
  <li>Wiki &amp; documentation: <a href="https://github.com/dongdongbh/Mindwtr/wiki">https://github.com/dongdongbh/Mindwtr/wiki</a></li>
  <li>Issues: <a href="https://github.com/dongdongbh/Mindwtr/issues">https://github.com/dongdongbh/Mindwtr/issues</a></li>
  <li>Discussions: <a href="https://github.com/dongdongbh/Mindwtr/discussions">https://github.com/dongdongbh/Mindwtr/discussions</a></li>
  <li>Discord: <a href="https://discord.gg/ahhFxuDBb4">https://discord.gg/ahhFxuDBb4</a></li>
</ul>]]></content><author><name>Dongda Li</name><email>dongdongbhbh@gmail.com</email></author><category term="Blog" /><category term="productivity" /><category term="GTD" /><category term="open-source" /><category term="Getting Things Done" /><category term="task management" /><category term="GTD app" /><category term="cross-platform" /><summary type="html"><![CDATA[Mindwtr is a free, open-source GTD (Getting Things Done) app for Windows, macOS, Linux, Android, and iOS. Local-first, no account required, with full GTD workflow support.]]></summary></entry><entry><title type="html">From Docker to Singularity: Setting Up and Managing Tasks with HTCondor and Slurm</title><link href="https://dongdongbh.tech/blog/singularity/" rel="alternate" type="text/html" title="From Docker to Singularity: Setting Up and Managing Tasks with HTCondor and Slurm" /><published>2025-01-10T00:00:00-05:00</published><updated>2025-11-14T21:59:09-05:00</updated><id>https://dongdongbh.tech/blog/singularity</id><content type="html" xml:base="https://dongdongbh.tech/blog/singularity/"><![CDATA[<h3 id="background"><strong>Background</strong></h3>

<p>When I first started using my university’s computing cluster, I quickly realized I needed to set up custom environments for my tasks. Like many, I initially turned to <strong>Docker</strong>, a popular tool for containerization. However, I soon ran into challenges when using Docker on an HPC cluster. This led me to discover <strong>Singularity</strong>, a container solution specifically designed for HPC environments. In this post, I’ll explain why we need containers in HPC, the key differences between Docker and Singularity, and provide a step-by-step guide to managing tasks with Singularity, HTCondor, and Slurm.</p>

<hr />

<h3 id="why-do-we-need-containers-in-hpc"><strong>Why Do We Need Containers in HPC?</strong></h3>

<p>HPC clusters are shared environments where multiple users run diverse tasks. This can create conflicts:</p>
<ol>
  <li><strong>Dependency Issues</strong>: Programs often require specific libraries, compilers, or environments that may not be installed on the cluster.</li>
  <li><strong>Permission Restrictions</strong>: Most HPC systems don’t grant users <code class="language-plaintext highlighter-rouge">sudo</code> access, making it difficult to install system-level packages.</li>
  <li><strong>Reproducibility</strong>: Without containers, reproducing results across different systems can be challenging.</li>
</ol>

<p><strong>Containers</strong> solve these problems by bundling applications and their dependencies into portable environments. With a container, you can:</p>
<ul>
  <li>Install software that requires <code class="language-plaintext highlighter-rouge">sudo</code> inside the container.</li>
  <li>Run the container on any compatible system without worrying about the host environment.</li>
</ul>

<hr />

<h3 id="why-use-singularity-for-hpc"><strong>Why Use Singularity for HPC?</strong></h3>

<p>Containers solve many problems in HPC environments:</p>
<ol>
  <li><strong>Dependency Conflicts</strong>: Applications often require specific libraries that may not be installed on the cluster.</li>
  <li><strong>No Root Privileges</strong>: Most HPC users lack <code class="language-plaintext highlighter-rouge">sudo</code> access, making it hard to install system-level software.</li>
  <li><strong>Reproducibility</strong>: Containers ensure the environment is consistent across systems.</li>
</ol>

<p><strong>Why not Docker?</strong>
Docker isolates the container from the host and requires root privileges to run, which makes it unsuitable for shared HPC systems. <strong>Singularity</strong>, on the other hand:</p>
<ul>
  <li>Integrates seamlessly with the host system (e.g., mounts home directories by default).</li>
  <li>Runs without root privileges, making it safer and compatible with shared environments.</li>
  <li>Allows easy access to host-level resources like GPUs, shared filesystems, and user-installed environments (e.g., Conda).</li>
</ul>

<h3 id="key-difference-between-docker-and-singularity"><strong>Key Difference Between Docker and Singularity</strong></h3>

<table>
  <thead>
    <tr>
      <th>Feature</th>
      <th>Docker</th>
      <th>Singularity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Isolation</strong></td>
      <td>Containers are isolated from the host</td>
      <td>Integrates with the host system</td>
    </tr>
    <tr>
      <td><strong>Root Privileges</strong></td>
      <td>Requires root privileges to run</td>
      <td>Runs without root privileges</td>
    </tr>
    <tr>
      <td><strong>HPC Compatibility</strong></td>
      <td>Not designed for HPC</td>
      <td>Specifically designed for HPC</td>
    </tr>
    <tr>
      <td><strong>Filesystem Access</strong></td>
      <td>Host filesystem is not mounted</td>
      <td>Host home directory is mounted by default</td>
    </tr>
  </tbody>
</table>

<hr />

<h3 id="best-practices-for-singularity-in-hpc"><strong>Best Practices for Singularity in HPC</strong></h3>

<p>The key realization when using Singularity is that you <strong>only need to install software requiring <code class="language-plaintext highlighter-rouge">sudo</code></strong> inside the container. For everything else (e.g., user-level Python packages or Conda environments), you can use the host environment.</p>

<p>For example:</p>
<ol>
  <li>Use Singularity to install system-level dependencies (e.g., CUDA libraries).</li>
  <li>Use the host system for Conda environments, scripts, and datasets.</li>
</ol>

<hr />

<h3 id="step-by-step-guide-from-docker-to-singularity"><strong>Step-by-Step Guide: From Docker to Singularity</strong></h3>

<h4 id="1-save-a-running-docker-container-as-an-image"><strong>1. Save a Running Docker Container as an Image</strong></h4>
<ol>
  <li><strong>Run and Configure the Docker Container</strong>:
Start a Docker container:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run <span class="nt">-it</span> nvidia/cuda:11.8.0-base-ubuntu20.04 bash
</code></pre></div>    </div>
    <p>Inside the container, install system-level dependencies:</p>
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get update <span class="o">&amp;&amp;</span> apt-get <span class="nb">install</span> <span class="nt">-y</span> libjpeg-dev python3 python3-pip
pip <span class="nb">install </span>torch torchvision
</code></pre></div>    </div>
  </li>
  <li><strong>Save the Running Container as a Docker Image</strong>:
Get the container ID:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker ps
</code></pre></div>    </div>
    <p>Commit the running container:</p>
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker commit &lt;container_id&gt; cuda_image
</code></pre></div>    </div>
  </li>
</ol>

<h4 id="2-convert-the-docker-image-to-a-singularity-sif-file"><strong>2. Convert the Docker Image to a Singularity SIF File</strong></h4>
<p>Use Singularity to convert the Docker image:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>singularity build cuda_image.sif docker-daemon://cuda_image:latest
</code></pre></div></div>

<hr />

<h4 id="2-convert-docker-image-to-singularity"><strong>2. Convert Docker Image to Singularity</strong></h4>
<p>Convert the Docker image to a Singularity SIF file:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>singularity build cuda_image.sif docker-daemon://cuda_image:latest
</code></pre></div></div>

<h4 id="3-use-singularity-with-host-resources"><strong>3. Use Singularity with Host Resources</strong></h4>
<p>Run the Singularity container, binding the host’s home directory and using the host’s Conda environment:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>singularity <span class="nb">exec</span> <span class="nt">--bind</span> /home/user:/home/user cuda_image.sif bash <span class="nt">-c</span> <span class="s2">"
  source /home/user/miniconda3/etc/profile.d/conda.sh &amp;&amp;
  conda activate my_env &amp;&amp;
  python /home/user/code/train.py
"</span>
</code></pre></div></div>

<hr />

<h3 id="running-tasks-with-htcondor"><strong>Running Tasks with HTCondor</strong></h3>

<h4 id="1-wrapper-script"><strong>1. Wrapper Script</strong></h4>
<p>The wrapper script is executed for each task submission. Ensure it uses the proper shebang:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/usr/bin/bash</span>

<span class="c"># Load Conda and activate the environment</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span><span class="s2">"/home/user/miniconda3/bin:</span><span class="nv">$PATH</span><span class="s2">"</span>
<span class="nb">source</span> /home/user/miniconda3/etc/profile.d/conda.sh
conda activate my_env

<span class="c"># Run the Singularity container and execute the Python script</span>
singularity <span class="nb">exec</span> <span class="nt">--bind</span> /home/user:/home/user cuda_image.sif bash <span class="nt">-c</span> <span class="s2">"
  source /home/user/miniconda3/etc/profile.d/conda.sh &amp;&amp;
  conda activate my_env &amp;&amp;
  python /home/user/code/train.py
"</span>
</code></pre></div></div>

<h4 id="2-htcondor-submit-file"><strong>2. HTCondor Submit File</strong></h4>
<p>Create a submission file for your task:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>executable = wrapper.sh
output     = output/task.out
error      = output/task.err
log        = output/task.log
request_gpus = 1
Requirements = (CUDADeviceName == "NVIDIA A100 80GB PCIe")
queue
</code></pre></div></div>

<p>Submit the task:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>condor_submit task.sub
</code></pre></div></div>

<h4 id="3-monitor-jobs"><strong>3. Monitor Jobs</strong></h4>
<p>Check the status of your jobs:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>condor_q
</code></pre></div></div>
<p>Check GPU availability:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>condor_status <span class="nt">-constraint</span> <span class="s1">'CUDADeviceName == "NVIDIA A100 80GB PCIe"'</span>
</code></pre></div></div>
<p>Check GPU users:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>condor_status <span class="nt">-constraint</span> <span class="s1">'CUDADeviceName == "NVIDIA A100 80GB PCIe" &amp;&amp; State == "Claimed"'</span> <span class="nt">-af</span> Name RemoteOwner
</code></pre></div></div>
<hr />

<h3 id="managing-tasks-with-slurm"><strong>Managing Tasks with Slurm</strong></h3>

<p>Slurm is another workload manager, optimized for distributed training and tightly coupled tasks.</p>

<h4 id="1-slurm-script"><strong>1. Slurm Script</strong></h4>
<p>Write a Slurm submission script:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c">#SBATCH --job-name=my_task</span>
<span class="c">#SBATCH --output=task.out</span>
<span class="c">#SBATCH --error=task.err</span>
<span class="c">#SBATCH --gres=gpu:1</span>

singularity <span class="nb">exec</span> /path/to/cuda_image.sif python /home/user/code/train.py
</code></pre></div></div>

<p>Submit the job:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sbatch task.slurm
</code></pre></div></div>
<h3 id="comparing-htcondor-and-slurm"><strong>Comparing HTCondor and Slurm</strong></h3>

<table>
  <thead>
    <tr>
      <th>Feature</th>
      <th>HTCondor</th>
      <th>Slurm</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Best Use Case</strong></td>
      <td>High-throughput, independent tasks</td>
      <td>Distributed, tightly coupled tasks</td>
    </tr>
    <tr>
      <td><strong>GPU Support</strong></td>
      <td>Yes</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td><strong>Ease of Use</strong></td>
      <td>Simple for independent jobs</td>
      <td>Better for multi-node configurations</td>
    </tr>
    <tr>
      <td><strong>Distributed Training</strong></td>
      <td>Not optimized for communication-heavy jobs</td>
      <td>Supports MPI, NCCL, Gloo</td>
    </tr>
  </tbody>
</table>

<hr />

<h3 id="distributed-training-with-singularity"><strong>Distributed Training with Singularity</strong></h3>

<p>For distributed training, tools like <strong>NCCL</strong>, <strong>Gloo</strong>, and <strong>MPI</strong> are critical:</p>
<ol>
  <li><strong>NCCL</strong>: Best for multi-GPU training on NVIDIA hardware.</li>
  <li><strong>Gloo</strong>: General-purpose communication for PyTorch.</li>
  <li><strong>MPI</strong>: High-performance communication for multi-node setups.</li>
</ol>

<p><strong>Why InfiniBand?</strong></p>
<ul>
  <li>Standard Ethernet introduces bottlenecks in distributed training.</li>
  <li>InfiniBand provides high-speed, low-latency connections for scaling training across nodes.</li>
</ul>

<hr />

<h3 id="conclusion"><strong>Conclusion</strong></h3>

<p>Singularity simplifies the process of running containerized tasks on HPC systems. By combining Singularity with HTCondor and Slurm, you can efficiently manage high-throughput and distributed workloads. Use Docker for building containers, but leverage Singularity for running them in HPC environments. And remember: only include system-level dependencies in the container, while keeping user-level tools and data on the host system.</p>

<p>For more information:</p>
<ul>
  <li><a href="https://sylabs.io/docs/">Singularity Documentation</a></li>
  <li><a href="https://docs.docker.com/">Docker Documentation</a></li>
  <li><a href="https://htcondor.readthedocs.io/">HTCondor Documentation</a></li>
  <li><a href="https://slurm.schedmd.com/documentation.html">Slurm Documentation</a></li>
</ul>

<hr />]]></content><author><name>Dongda Li</name><email>dongdongbhbh@gmail.com</email></author><category term="Blog" /><category term="content" /><category term="tutorial" /><summary type="html"><![CDATA[From Docker to Singularity-Setting Up and Managing Tasks with HTCondor and Slurm]]></summary></entry><entry><title type="html">Using `torchrun` for Distributed Training</title><link href="https://dongdongbh.tech/blog/torchrun/" rel="alternate" type="text/html" title="Using `torchrun` for Distributed Training" /><published>2025-01-10T00:00:00-05:00</published><updated>2025-01-10T01:49:21-05:00</updated><id>https://dongdongbh.tech/blog/torchrun</id><content type="html" xml:base="https://dongdongbh.tech/blog/torchrun/"><![CDATA[<p><code class="language-plaintext highlighter-rouge">torchrun</code> is a utility provided by <strong>PyTorch</strong> to simplify launching distributed training jobs. It manages process spawning, inter-process communication, and resource allocation across multiple GPUs and nodes.</p>

<p>Here’s a detailed guide on how to use <code class="language-plaintext highlighter-rouge">torchrun</code> for distributed training:</p>

<hr />

<h3 id="1-understand-distributed-training-concepts"><strong>1. Understand Distributed Training Concepts</strong></h3>
<ul>
  <li><strong>Distributed Data Parallel (DDP)</strong>:
    <ul>
      <li>PyTorch’s <code class="language-plaintext highlighter-rouge">torch.nn.parallel.DistributedDataParallel</code> (DDP) is the backbone for distributed training.</li>
      <li>It splits data across GPUs and synchronizes gradients during training.</li>
    </ul>
  </li>
  <li><strong>Backend Options</strong>:
    <ul>
      <li><strong>NCCL</strong>: Recommended for GPU-based training (supports CUDA).</li>
      <li><strong>Gloo</strong>: Works for CPU-based training or smaller setups.</li>
      <li><strong>MPI</strong>: For large-scale multi-node clusters (requires MPI setup).</li>
    </ul>
  </li>
  <li><strong>Process Groups</strong>:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">torchrun</code> launches a group of processes that communicate with each other.</li>
      <li>Each GPU typically corresponds to one process.</li>
    </ul>
  </li>
</ul>

<hr />

<h3 id="2-install-necessary-dependencies"><strong>2. Install Necessary Dependencies</strong></h3>
<p>Ensure your PyTorch version supports <code class="language-plaintext highlighter-rouge">torchrun</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>torch torchvision torchaudio
</code></pre></div></div>

<p>For multi-node distributed training:</p>
<ul>
  <li><strong>NCCL</strong> is automatically installed with PyTorch.</li>
  <li>For <strong>MPI</strong>, install the required libraries:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt-get <span class="nb">install </span>libopenmpi-dev
</code></pre></div>    </div>
  </li>
</ul>

<hr />

<h3 id="3-prepare-your-training-script"><strong>3. Prepare Your Training Script</strong></h3>
<p>Modify your PyTorch training script to use <code class="language-plaintext highlighter-rouge">DistributedDataParallel</code>.</p>

<h4 id="key-changes-in-trainpy">Key Changes in <code class="language-plaintext highlighter-rouge">train.py</code>:</h4>
<ol>
  <li><strong>Initialize Distributed Process Group</strong>:
    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.distributed</span> <span class="k">as</span> <span class="n">dist</span>
<span class="kn">from</span> <span class="nn">torch.nn.parallel</span> <span class="kn">import</span> <span class="n">DistributedDataParallel</span> <span class="k">as</span> <span class="n">DDP</span>

<span class="k">def</span> <span class="nf">setup</span><span class="p">(</span><span class="n">rank</span><span class="p">,</span> <span class="n">world_size</span><span class="p">):</span>
    <span class="n">dist</span><span class="p">.</span><span class="n">init_process_group</span><span class="p">(</span><span class="s">"nccl"</span><span class="p">,</span> <span class="n">rank</span><span class="o">=</span><span class="n">rank</span><span class="p">,</span> <span class="n">world_size</span><span class="o">=</span><span class="n">world_size</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">cleanup</span><span class="p">():</span>
    <span class="n">dist</span><span class="p">.</span><span class="n">destroy_process_group</span><span class="p">()</span>
</code></pre></div>    </div>
  </li>
  <li><strong>Wrap Your Model with DDP</strong>:
    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">main</span><span class="p">(</span><span class="n">rank</span><span class="p">,</span> <span class="n">world_size</span><span class="p">):</span>
    <span class="n">setup</span><span class="p">(</span><span class="n">rank</span><span class="p">,</span> <span class="n">world_size</span><span class="p">)</span>
    <span class="n">model</span> <span class="o">=</span> <span class="n">MyModel</span><span class="p">().</span><span class="n">to</span><span class="p">(</span><span class="n">rank</span><span class="p">)</span>
    <span class="n">ddp_model</span> <span class="o">=</span> <span class="n">DDP</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">device_ids</span><span class="o">=</span><span class="p">[</span><span class="n">rank</span><span class="p">])</span>

    <span class="c1"># Training loop
</span>    <span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">ddp_model</span><span class="p">.</span><span class="n">parameters</span><span class="p">())</span>
    <span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">epochs</span><span class="p">):</span>
        <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
        <span class="n">outputs</span> <span class="o">=</span> <span class="n">ddp_model</span><span class="p">(</span><span class="n">inputs</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">rank</span><span class="p">))</span>
        <span class="n">loss</span> <span class="o">=</span> <span class="n">criterion</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span> <span class="n">targets</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">rank</span><span class="p">))</span>
        <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
        <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>

    <span class="n">cleanup</span><span class="p">()</span>
</code></pre></div>    </div>
  </li>
  <li><strong>Spawn Processes</strong>:
Use <code class="language-plaintext highlighter-rouge">torch.multiprocessing.spawn</code> for process management:
    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">world_size</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">device_count</span><span class="p">()</span>
    <span class="n">torch</span><span class="p">.</span><span class="n">multiprocessing</span><span class="p">.</span><span class="n">spawn</span><span class="p">(</span><span class="n">main</span><span class="p">,</span> <span class="n">args</span><span class="o">=</span><span class="p">(</span><span class="n">world_size</span><span class="p">,),</span> <span class="n">nprocs</span><span class="o">=</span><span class="n">world_size</span><span class="p">,</span> <span class="n">join</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
</ol>

<hr />

<h3 id="4-launch-training-with-torchrun"><strong>4. Launch Training with <code class="language-plaintext highlighter-rouge">torchrun</code></strong></h3>
<p>Use <code class="language-plaintext highlighter-rouge">torchrun</code> to manage distributed training processes.</p>

<h4 id="single-node-multi-gpu-training"><strong>Single Node, Multi-GPU Training</strong></h4>
<p>For a single node with 4 GPUs:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torchrun <span class="nt">--nproc_per_node</span><span class="o">=</span>4 train.py
</code></pre></div></div>

<ul>
  <li><code class="language-plaintext highlighter-rouge">--nproc_per_node</code>: Number of processes to launch (e.g., number of GPUs).</li>
</ul>

<h4 id="multi-node-distributed-training"><strong>Multi-Node Distributed Training</strong></h4>
<p>For multi-node setups:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torchrun <span class="nt">--nnodes</span><span class="o">=</span>2 <span class="nt">--nproc_per_node</span><span class="o">=</span>4 <span class="nt">--rdzv_backend</span><span class="o">=</span>c10d <span class="se">\</span>
         <span class="nt">--rdzv_endpoint</span><span class="o">=</span><span class="s2">"master_ip:29500"</span> train.py
</code></pre></div></div>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">--nnodes</code></strong>: Number of nodes participating in training.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">--nproc_per_node</code></strong>: Number of processes per node (typically number of GPUs per node).</li>
  <li><strong><code class="language-plaintext highlighter-rouge">--rdzv_backend</code></strong>: Rendezvous backend (<code class="language-plaintext highlighter-rouge">c10d</code> is the default).</li>
  <li><strong><code class="language-plaintext highlighter-rouge">--rdzv_endpoint</code></strong>: IP and port of the master node for communication.</li>
</ul>

<hr />

<h3 id="5-check-system-configuration"><strong>5. Check System Configuration</strong></h3>
<p>Ensure the environment is configured correctly:</p>
<ol>
  <li><strong>NCCL Settings</strong>:
    <ul>
      <li>For multi-node training:
        <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">NCCL_DEBUG</span><span class="o">=</span>INFO
<span class="nb">export </span><span class="nv">NCCL_IB_DISABLE</span><span class="o">=</span>0
<span class="nb">export </span><span class="nv">NCCL_SOCKET_IFNAME</span><span class="o">=</span>eth0  <span class="c"># or your network interface</span>
</code></pre></div>        </div>
      </li>
    </ul>
  </li>
  <li><strong>CUDA and GPU Settings</strong>:
    <ul>
      <li>Confirm GPU visibility:
        <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nvidia-smi
</code></pre></div>        </div>
      </li>
      <li>Set <code class="language-plaintext highlighter-rouge">CUDA_VISIBLE_DEVICES</code> if needed:
        <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span>0,1,2,3
</code></pre></div>        </div>
      </li>
    </ul>
  </li>
</ol>

<hr />

<h3 id="6-monitor-and-debug"><strong>6. Monitor and Debug</strong></h3>
<ol>
  <li>Use verbose logging for debugging:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">TORCH_DISTRIBUTED_DEBUG</span><span class="o">=</span>DETAIL
torchrun <span class="nt">--nproc_per_node</span><span class="o">=</span>4 train.py
</code></pre></div>    </div>
  </li>
  <li>Check GPU utilization:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>watch <span class="nt">-n</span> 1 nvidia-smi
</code></pre></div>    </div>
  </li>
  <li>Debug communication issues (e.g., NCCL or Gloo):
    <ul>
      <li>Check network connectivity between nodes.</li>
      <li>Use <code class="language-plaintext highlighter-rouge">dmesg</code> or log files for hardware errors.</li>
    </ul>
  </li>
</ol>

<hr />

<h3 id="example-scenarios"><strong>Example Scenarios</strong></h3>

<h4 id="single-node-4-gpus"><strong>Single Node, 4 GPUs</strong></h4>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torchrun <span class="nt">--nproc_per_node</span><span class="o">=</span>4 train.py
</code></pre></div></div>

<h4 id="two-nodes-8-gpus-total-4-gpus-per-node"><strong>Two Nodes, 8 GPUs Total (4 GPUs Per Node)</strong></h4>
<ol>
  <li>Start on the master node:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torchrun <span class="nt">--nnodes</span><span class="o">=</span>2 <span class="nt">--nproc_per_node</span><span class="o">=</span>4 <span class="nt">--rdzv_endpoint</span><span class="o">=</span><span class="s2">"master_ip:29500"</span> train.py
</code></pre></div>    </div>
  </li>
  <li>Start on the worker node:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torchrun <span class="nt">--nnodes</span><span class="o">=</span>2 <span class="nt">--nproc_per_node</span><span class="o">=</span>4 <span class="nt">--rdzv_endpoint</span><span class="o">=</span><span class="s2">"master_ip:29500"</span> train.py
</code></pre></div>    </div>
  </li>
</ol>

<hr />

<h3 id="7-best-practices"><strong>7. Best Practices</strong></h3>
<ol>
  <li>Use <strong>batch normalization</strong> carefully in DDP to avoid synchronization overhead.</li>
  <li>Optimize network communication with <strong>InfiniBand</strong> if available.</li>
  <li>Profile your training script to identify bottlenecks using PyTorch’s profiler.</li>
</ol>

<hr />

<h3 id="references"><strong>References</strong></h3>
<ul>
  <li><a href="https://pytorch.org/docs/stable/distributed.html">PyTorch Distributed Training Documentation</a></li>
  <li><a href="https://developer.nvidia.com/nccl">NCCL Documentation</a></li>
  <li><a href="https://pytorch.org/docs/stable/elastic/run.html">Torchrun CLI Reference</a></li>
</ul>]]></content><author><name>Dongda Li</name><email>dongdongbhbh@gmail.com</email></author><category term="Blog" /><category term="content" /><category term="tutorial" /><summary type="html"><![CDATA[Using `torchrun` for Distributed Training]]></summary></entry><entry><title type="html">Setting Up a Nebula Overlay Network with Syncthing</title><link href="https://dongdongbh.tech/blog/nebula/" rel="alternate" type="text/html" title="Setting Up a Nebula Overlay Network with Syncthing" /><published>2025-01-01T00:00:00-05:00</published><updated>2025-01-03T02:39:45-05:00</updated><id>https://dongdongbh.tech/blog/nebula</id><content type="html" xml:base="https://dongdongbh.tech/blog/nebula/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In today’s interconnected world, managing secure and efficient data synchronization across multiple devices is crucial. Overlay networks provide a robust solution for creating secure communication channels over existing networks. This tutorial will guide you through setting up a <strong>Nebula overlay network</strong> and integrating it with <strong>Syncthing</strong> for seamless and secure file synchronization between your PC and Android phone.</p>

<hr />

<h2 id="what-is-an-overlay-network">What is an Overlay Network?</h2>

<p>An <strong>overlay network</strong> is a virtual network built on top of an existing physical network. It allows devices to communicate as if they are directly connected, regardless of their actual physical locations. Overlay networks are instrumental in enhancing security, managing network traffic, and enabling functionalities like:</p>

<ul>
  <li><strong>VPN Services</strong>: Creating secure tunnels between devices.</li>
  <li><strong>Peer-to-Peer Communication</strong>: Facilitating direct connections without centralized servers.</li>
  <li><strong>Network Segmentation</strong>: Isolating different parts of a network for security or performance reasons.</li>
</ul>

<hr />

<h3 id="nebula-overlay-network">Nebula Overlay Network</h3>

<p><a href="https://github.com/slackhq/nebula">Nebula</a> is an open-source, scalable overlay networking tool that enables secure communication between devices, regardless of their physical location or network configuration. It is ideal for private networking and secure communication.</p>

<h2 id="prerequisites">Prerequisites</h2>

<ul>
  <li>A PC running Linux (tested on Ubuntu/Debian).</li>
  <li>A public server to act as a Lighthouse.</li>
  <li>An Android phone with the Nebula app installed.</li>
  <li>Basic knowledge of networking and terminal commands.</li>
  <li>Root or administrative privileges on the devices.</li>
</ul>

<hr />

<h2 id="setting-up-nebula-overlay-network">Setting Up Nebula Overlay Network</h2>

<h3 id="1-download-nebula-software">1. Download Nebula Software</h3>

<p>On the PC and Lighthouse server, download and extract Nebula:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget https://github.com/slackhq/nebula/releases/download/v1.6.1/nebula-linux-amd64.tar.gz
<span class="nb">tar</span> <span class="nt">-xzf</span> nebula-linux-amd64.tar.gz
<span class="nb">sudo mv </span>nebula /usr/local/bin/
<span class="nb">sudo mv </span>nebula-cert /usr/local/bin/
</code></pre></div></div>

<h3 id="2-create-certificate-authority-ca">2. Create Certificate Authority (CA)</h3>

<p>Generate the CA certificate and private key:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nebula-cert ca <span class="nt">-name</span> <span class="s2">"Nebula Network"</span>
<span class="nb">sudo mkdir</span> <span class="nt">-p</span> /etc/nebula
<span class="nb">sudo mv </span>ca.crt ca.key /etc/nebula/
<span class="nb">sudo chmod </span>600 /etc/nebula/ca.key
</code></pre></div></div>

<h3 id="3-generate-certificates-and-keys-for-each-device">3. Generate Certificates and Keys for Each Device</h3>

<h4 id="a-pc-configuration">a. PC Configuration</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nebula-cert sign <span class="nt">-name</span> <span class="s2">"PC"</span> <span class="nt">-ip</span> <span class="s2">"192.168.100.2/24"</span>
<span class="nb">sudo mv </span>PC.crt PC.key /etc/nebula/
</code></pre></div></div>

<h4 id="b-lighthouse-server-configuration">b. Lighthouse Server Configuration</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nebula-cert sign <span class="nt">-name</span> <span class="s2">"Lighthouse"</span> <span class="nt">-ip</span> <span class="s2">"192.168.100.1/24"</span>
<span class="nb">sudo mkdir</span> <span class="nt">-p</span> /etc/nebula/pki
<span class="nb">sudo mv </span>Lighthouse.crt Lighthouse.key /etc/nebula/pki/
</code></pre></div></div>

<h4 id="c-android-phone-configuration">c. Android Phone Configuration</h4>

<ol>
  <li>Open the Nebula app on your Android phone to generate a public key.</li>
  <li>Transfer the public key to your PC and use it to create a signed certificate:</li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nebula-cert sign <span class="nt">-name</span> <span class="s2">"Phone"</span> <span class="nt">-ip</span> <span class="s2">"192.168.100.3/24"</span> <span class="nt">-in-pub-key</span> phone_public.key
</code></pre></div></div>

<ol>
  <li>Transfer the <code class="language-plaintext highlighter-rouge">Phone.crt</code> and <code class="language-plaintext highlighter-rouge">ca.crt</code> back to your phone via the Nebula app.</li>
</ol>

<hr />

<h3 id="4-configure-nebula-on-each-device">4. Configure Nebula on Each Device</h3>

<h4 id="a-pc-configuration-file">a. PC Configuration File</h4>

<p>Create <code class="language-plaintext highlighter-rouge">/etc/nebula/config.yml</code> with the following content:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">pki</span><span class="pi">:</span>
  <span class="na">ca</span><span class="pi">:</span> <span class="s2">"</span><span class="s">/etc/nebula/ca.crt"</span>
  <span class="na">cert</span><span class="pi">:</span> <span class="s2">"</span><span class="s">/etc/nebula/PC.crt"</span>
  <span class="na">key</span><span class="pi">:</span> <span class="s2">"</span><span class="s">/etc/nebula/PC.key"</span>

<span class="na">static_host_map</span><span class="pi">:</span>
  <span class="s2">"</span><span class="s">192.168.100.1"</span><span class="err">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">xx.xx.xx.xx:4242"</span><span class="pi">]</span>

<span class="na">lighthouse</span><span class="pi">:</span>
  <span class="na">am_lighthouse</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">interval</span><span class="pi">:</span> <span class="m">60</span>
  <span class="na">hosts</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s2">"</span><span class="s">192.168.100.1"</span>

<span class="na">listen</span><span class="pi">:</span>
  <span class="na">host</span><span class="pi">:</span> <span class="s">0.0.0.0</span>
  <span class="na">port</span><span class="pi">:</span> <span class="m">4242</span>

<span class="na">punchy</span><span class="pi">:</span>
  <span class="na">punch</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">delay</span><span class="pi">:</span> <span class="s">1s</span>
  <span class="na">respond</span><span class="pi">:</span> <span class="no">true</span>

<span class="na">relay</span><span class="pi">:</span>
  <span class="na">relays</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">192.168.100.1</span>
  <span class="na">am_relay</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">use_relays</span><span class="pi">:</span> <span class="no">true</span>


<span class="na">tun</span><span class="pi">:</span>
  <span class="na">disabled</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">drop_local_broadcast</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">drop_multicast</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">tx_queue</span><span class="pi">:</span> <span class="m">500</span>
  <span class="na">dev</span><span class="pi">:</span> <span class="s">nebula1</span>
  <span class="na">mtu</span><span class="pi">:</span> <span class="m">1300</span>

<span class="na">firewall</span><span class="pi">:</span>
  <span class="na">outbound_action</span><span class="pi">:</span> <span class="s">drop</span>
  <span class="na">inbound_action</span><span class="pi">:</span> <span class="s">drop</span>
  <span class="na">inbound</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">port</span><span class="pi">:</span> <span class="s">any</span>
      <span class="na">proto</span><span class="pi">:</span> <span class="s">any</span>
      <span class="na">host</span><span class="pi">:</span> <span class="s">any</span>
  <span class="na">outbound</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">port</span><span class="pi">:</span> <span class="s">any</span>
      <span class="na">proto</span><span class="pi">:</span> <span class="s">any</span>
      <span class="na">host</span><span class="pi">:</span> <span class="s">any</span>
<span class="na">logging</span><span class="pi">:</span>
  <span class="na">level</span><span class="pi">:</span> <span class="s">info</span>
  <span class="na">format</span><span class="pi">:</span> <span class="s">text</span>
</code></pre></div></div>

<h4 id="b-lighthouse-configuration-file">b. Lighthouse Configuration File</h4>

<p>Create <code class="language-plaintext highlighter-rouge">/etc/nebula/lighthouse.yml</code> with the following content:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">pki</span><span class="pi">:</span>
  <span class="na">ca</span><span class="pi">:</span> <span class="s2">"</span><span class="s">/etc/nebula/pki/ca.crt"</span>
  <span class="na">cert</span><span class="pi">:</span> <span class="s2">"</span><span class="s">/etc/nebula/pki/Lighthouse.crt"</span>
  <span class="na">key</span><span class="pi">:</span> <span class="s2">"</span><span class="s">/etc/nebula/pki/Lighthouse.key"</span>

<span class="na">static_host_map</span><span class="pi">:</span>
  <span class="s2">"</span><span class="s">192.168.100.1"</span><span class="err">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">xx.xx.xx.xx:4242"</span><span class="pi">]</span>

<span class="na">lighthouse</span><span class="pi">:</span>
  <span class="na">am_lighthouse</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">interval</span><span class="pi">:</span> <span class="m">60</span>

<span class="na">listen</span><span class="pi">:</span>
  <span class="na">host</span><span class="pi">:</span> <span class="s">0.0.0.0</span>
  <span class="na">port</span><span class="pi">:</span> <span class="m">4242</span>

<span class="na">punchy</span><span class="pi">:</span>
  <span class="na">punch</span><span class="pi">:</span> <span class="no">true</span>

<span class="na">relay</span><span class="pi">:</span>
  <span class="na">am_relay</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">use_relays</span><span class="pi">:</span> <span class="no">true</span>

<span class="na">firewall</span><span class="pi">:</span>
  <span class="na">inbound</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">port</span><span class="pi">:</span> <span class="s">any</span>
      <span class="na">proto</span><span class="pi">:</span> <span class="s">any</span>
      <span class="na">host</span><span class="pi">:</span> <span class="s">any</span>
  <span class="na">outbound</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">port</span><span class="pi">:</span> <span class="s">any</span>
      <span class="na">proto</span><span class="pi">:</span> <span class="s">any</span>
      <span class="na">host</span><span class="pi">:</span> <span class="s">any</span>


<span class="na">tun</span><span class="pi">:</span>
  <span class="na">disabled</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">drop_local_broadcast</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">drop_multicast</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">tx_queue</span><span class="pi">:</span> <span class="m">500</span>
  <span class="na">dev</span><span class="pi">:</span> <span class="s">nebula1</span>
  <span class="na">mtu</span><span class="pi">:</span> <span class="m">1300</span>

<span class="na">stats</span><span class="pi">:</span>
  <span class="na">type</span><span class="pi">:</span> <span class="s">prometheus</span>
  <span class="na">listen</span><span class="pi">:</span> <span class="s">0.0.0.0:8080</span>
  <span class="na">path</span><span class="pi">:</span> <span class="s">/metrics</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">prometheusns</span>
  <span class="na">subsystem</span><span class="pi">:</span> <span class="s">nebula</span>
  <span class="na">interval</span><span class="pi">:</span> <span class="s">10s</span>

  <span class="na">message_metrics</span><span class="pi">:</span> <span class="no">false</span>

  <span class="na">lighthouse_metrics</span><span class="pi">:</span> <span class="no">false</span>

<span class="na">logging</span><span class="pi">:</span>
  <span class="na">level</span><span class="pi">:</span> <span class="s">info</span>
  <span class="na">format</span><span class="pi">:</span> <span class="s">text</span>
</code></pre></div></div>

<h4 id="c-android-phone-configuration-file">c. Android Phone Configuration File</h4>

<p>Follow the Nebula app instructions to import <code class="language-plaintext highlighter-rouge">Phone.crt</code> and <code class="language-plaintext highlighter-rouge">ca.crt</code>.</p>

<hr />

<h3 id="5-open-udp-port-4242-on-lighthouse">5. Open UDP Port 4242 on Lighthouse</h3>

<p>Ensure the Lighthouse server allows incoming UDP traffic on port 4242:</p>

<hr />

<h3 id="6-start-nebula-services">6. Start Nebula Services</h3>

<p>Start Nebula on the PC and Lighthouse server:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>nebula <span class="nt">-config</span> /etc/nebula/config.yml
</code></pre></div></div>
<p>To set Nebula as a service on a Linux system, you can create a <strong>systemd</strong> service file. This ensures that Nebula starts automatically on boot and can be managed like other system services.</p>

<h3 id="steps-to-set-up-nebula-as-a-service">Steps to Set Up Nebula as a Service</h3>

<ol>
  <li><strong>Create a Systemd Service File</strong>
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>nano /etc/systemd/system/nebula.service
</code></pre></div>    </div>
  </li>
  <li><strong>Add the Following Configuration</strong>
    <div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">[Unit]</span>
<span class="py">Description</span><span class="p">=</span><span class="s">Nebula Overlay Network</span>
<span class="py">After</span><span class="p">=</span><span class="s">network.target</span>

<span class="nn">[Service]</span>
<span class="py">Type</span><span class="p">=</span><span class="s">simple</span>
<span class="py">ExecStart</span><span class="p">=</span><span class="s">/usr/local/bin/nebula -config /etc/nebula/config.yml</span>
<span class="py">Restart</span><span class="p">=</span><span class="s">on-failure</span>
<span class="py">User</span><span class="p">=</span><span class="s">root</span>

<span class="nn">[Install]</span>
<span class="py">WantedBy</span><span class="p">=</span><span class="s">multi-user.target</span>
</code></pre></div>    </div>

    <p><strong>Explanation:</strong></p>
    <ul>
      <li><strong><code class="language-plaintext highlighter-rouge">ExecStart</code></strong> specifies the path to the Nebula binary and configuration file.</li>
      <li><strong><code class="language-plaintext highlighter-rouge">Restart=on-failure</code></strong> ensures Nebula restarts automatically if it crashes.</li>
      <li><strong><code class="language-plaintext highlighter-rouge">User=root</code></strong> runs Nebula with root privileges (required for managing network interfaces).</li>
    </ul>
  </li>
  <li><strong>Reload the Systemd Daemon</strong>
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl daemon-reload
</code></pre></div>    </div>
  </li>
  <li><strong>Enable Nebula to Start on Boot</strong>
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl <span class="nb">enable </span>nebula
</code></pre></div>    </div>
  </li>
  <li><strong>Start the Nebula Service</strong>
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl start nebula
</code></pre></div>    </div>
  </li>
  <li><strong>Check the Service Status</strong>
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl status nebula
</code></pre></div>    </div>

    <p><strong>Expected Output:</strong></p>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>● nebula.service - Nebula Overlay Network
     Loaded: loaded (/etc/systemd/system/nebula.service; enabled; vendor preset: enabled)
     Active: active (running) since ...
</code></pre></div>    </div>
  </li>
  <li><strong>Stop or Restart the Service (Optional)</strong>
    <ul>
      <li>To stop the service:
        <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl stop nebula
</code></pre></div>        </div>
      </li>
      <li>To restart the service:
        <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl restart nebula
</code></pre></div>        </div>
      </li>
    </ul>
  </li>
</ol>

<h3 id="logs-and-debugging">Logs and Debugging</h3>

<ul>
  <li>View logs for the Nebula service:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>journalctl <span class="nt">-u</span> nebula
</code></pre></div>    </div>
  </li>
</ul>

<p>By configuring Nebula as a service, you can ensure it runs reliably and automatically, making your overlay network setup more robust and manageable.</p>

<hr />

<h2 id="setting-up-syncthing">Setting Up Syncthing</h2>

<h3 id="1-download-and-install-syncthing">1. Download and Install Syncthing</h3>

<p>On each device, download and install Syncthing from the <a href="https://syncthing.net/">official website</a>.</p>

<h3 id="2-configure-syncthing-to-use-nebula-network">2. Configure Syncthing to Use Nebula Network</h3>

<p>In Syncthing, set the Nebula IPs (e.g., <code class="language-plaintext highlighter-rouge">tcp://192.168.100.3:22000</code>) as the device addresses.</p>

<h3 id="3-syncing-files-across-devices">3. Syncing Files Across Devices</h3>

<p>Add shared folders in Syncthing and start syncing files securely over the Nebula network.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>By setting up Nebula and Syncthing, you’ve created a secure overlay network for file synchronization across devices. This setup ensures privacy, flexibility, and efficient communication.</p>

<p><strong>Special statement: This tutorial is only for learning and research, thanks.</strong></p>]]></content><author><name>Dongda Li</name><email>dongdongbhbh@gmail.com</email></author><category term="Blog" /><category term="content" /><category term="tutorial" /><summary type="html"><![CDATA[Setting Up a Nebula Overlay Network with Syncthing]]></summary></entry><entry><title type="html">Transparent proxy with V2ray and clash</title><link href="https://dongdongbh.tech/blog/tproxy/" rel="alternate" type="text/html" title="Transparent proxy with V2ray and clash" /><published>2021-10-19T00:00:00-04:00</published><updated>2025-01-06T14:02:41-05:00</updated><id>https://dongdongbh.tech/blog/tproxy</id><content type="html" xml:base="https://dongdongbh.tech/blog/tproxy/"><![CDATA[<hr />

<h2 id="background">Background</h2>

<p>In my <a href="https://dongdongbh.tech/blog/vps/">previous post</a>, I demonstrated how to set up a proxy server and configure clients on various devices. However, configuring the proxy client on each device can be cumbersome. Additionally, some users may prefer to route all network traffic through a proxy, regardless of whether it’s HTTP, HTTPS, SOCKS5, or other protocols. In such cases, setting up a <strong>transparent proxy</strong> is more convenient. This proxy acts as a router, enabling all connected devices (including the host itself) to use the proxy. Transparent proxies are also referred to as bypass gateways or soft routers and typically work on TCP/UDP protocols.</p>

<p>This post will guide you step-by-step on setting up a transparent proxy on Linux. To follow along, you should have basic knowledge of Linux and networking.</p>

<p>There are various tools to achieve transparent proxy functionality, such as Proxifier (Windows), <em>Surge for Mac</em>, tun2socks, and dns2socks for Linux. The key to implementing a transparent proxy lies in correctly handling <strong>DNS resolution</strong>. In this post, I use the built-in DNS configurations of <strong>V2Ray</strong> and <strong>Clash</strong>. DNS requests are forwarded using <code class="language-plaintext highlighter-rouge">iptables</code> and <code class="language-plaintext highlighter-rouge">ip route</code>. Notably, Clash offers robust DNS settings.</p>

<hr />

<h2 id="basics">Basics</h2>

<h3 id="how-dns-works">How DNS Works</h3>

<p>When you make an HTTP request, the system first sends a DNS query (UDP port 53) with the domain name to the configured DNS server. The server responds with the corresponding IP address. The application then establishes a TCP connection to the target server using this IP and begins data exchange.</p>

<p>A typical connection flow looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>            DNS Query
   ___________________________&gt;
APP&lt;---------------------------DNS Server
  |         Response (IP)
  |         TCP Data Exchange
  |-----------------------------------&gt;Website  
</code></pre></div></div>

<p>When using a proxy, the flow changes. Below are scenarios for different proxy types:</p>

<hr />

<h4 id="socks5-proxy">SOCKS5 Proxy</h4>

<p>In the SOCKS5 case, the application packs the domain name and TCP data into a SOCKS5 protocol packet. The proxy client forwards this packet to the proxy server, where DNS resolution occurs.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      SOCKS5 (Domain + TCP Data)                    SOCKS5 Data / Proxy Protocol                          DNS Query
APP---------------------------------&gt;Proxy Client-------------------------------------&gt;Proxy Server&lt;-------------------&gt;DNS Server
                                                                                          |     Response (IP)
                                                                                          |
                                                                                          |     TCP Data
                                                                                          |--------------------------------Website  
</code></pre></div></div>

<p>In <strong>global/transparent proxy</strong> setups, not all applications can handle SOCKS5. Thus, a program must intercept DNS requests and process them appropriately.</p>

<hr />

<h4 id="tun2socksredir">tun2socks/redir</h4>

<p><strong>tun2socks</strong> (part of BadVPN) intercepts all incoming TCP connections (regardless of destination IP) and forwards them to a SOCKS server. This method allows applications to use SOCKS without built-in support, even on a Linux router.</p>

<p>In this case, the application sends a DNS request, receives an IP (possibly inaccurate due to a “dirty” DNS server), and then establishes a TCP connection. The local proxy client intercepts and repackages the domain and TCP data into a SOCKS5 packet for forwarding.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>         Requested IP
   &lt;_________________________________              Domain + Data (Proxy Protocol)                      DNS Query
APP---------------------------------&gt;Proxy Client&lt;-------------------------------------&gt;Proxy Server&lt;-------------------&gt;DNS Server
  |              DNS Query             /|\                                                  |     Response (IP)
  |                                     |                                                   |
  |       TCP Data                      |                                                   |     TCP Data     
  |-------------------------------------|                                                   |----------------------------&gt;Website
</code></pre></div></div>

<hr />

<h4 id="fake-ip-mode">Fake IP Mode</h4>

<p>In this mode, the local proxy client intercepts DNS requests and returns a “fake” IP to the application. The application establishes a connection with the fake IP, while the proxy client maps the fake IP to the actual domain and forwards the data to the proxy server.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>          Fake IP
   &lt;_________________________________                  Domain + Data (Proxy Protocol)                    DNS Query
APP---------------------------------&gt;Proxy Client&lt;-------------------------------------&gt;Proxy Server&lt;-------------------&gt;DNS Server
  |              DNS Query             /|\                                                  |     Response (IP)
  |                                     |                                                   |
  |   TCP Data                          |                                                   |     TCP Data     
  |-------------------------------------|                                                   |----------------------------&gt;Website
</code></pre></div></div>

<p><strong>Advantages &amp; Disadvantages</strong>:</p>

<ul>
  <li><strong>Fake IP Mode</strong>: Faster, as the real IP doesn’t need to be sent to the application. However, the application cannot determine the website’s real IP.</li>
  <li><strong>Real IP Mode</strong>: Slower, but more accurate for applications requiring the actual IP.</li>
</ul>

<hr />

<h3 id="iptables"><code class="language-plaintext highlighter-rouge">iptables</code></h3>

<p>To manipulate traffic for transparent proxies, familiarity with <code class="language-plaintext highlighter-rouge">iptables</code> is essential. Here’s an overview:</p>

<p><img src="../../assets/images/tproxy/iptable.webp" alt="iptables Overview" /></p>

<p>Basic <code class="language-plaintext highlighter-rouge">iptables</code> commands:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>iptables <span class="nt">-L</span> <span class="nt">-t</span> <span class="o">{</span>nat,mangle<span class="o">}</span>   <span class="c"># List chains</span>
iptables <span class="nt">-N</span> XXXX              <span class="c"># Create chain</span>
iptables <span class="nt">-A</span> ...               <span class="c"># Add rule</span>
iptables <span class="nt">-D</span> ...               <span class="c"># Delete rule</span>
</code></pre></div></div>

<hr />

<h2 id="requirements">Requirements</h2>

<ul>
  <li>A Linux machine.</li>
  <li>An operational V2Ray or Clash proxy.</li>
</ul>

<hr />

<h2 id="setting-up-a-v2ray-transparent-proxy">Setting Up a V2Ray Transparent Proxy</h2>

<h3 id="v2ray-configuration-configjson">V2Ray Configuration (<code class="language-plaintext highlighter-rouge">config.json</code>)</h3>

<p>The <code class="language-plaintext highlighter-rouge">dokodemo-door</code> inbound receives all traffic redirected by <code class="language-plaintext highlighter-rouge">iptables</code>. Traffic processed by V2Ray is marked with socket mark <code class="language-plaintext highlighter-rouge">255 (0xFF)</code> to avoid loopback.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"routing"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="err">...</span><span class="p">},</span><span class="w">
  </span><span class="nl">"inbounds"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="err">...</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"port"</span><span class="p">:</span><span class="w"> </span><span class="mi">12345</span><span class="p">,</span><span class="w">
      </span><span class="nl">"protocol"</span><span class="p">:</span><span class="w"> </span><span class="s2">"dokodemo-door"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"settings"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"network"</span><span class="p">:</span><span class="w"> </span><span class="s2">"tcp,udp"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"followRedirect"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
      </span><span class="p">},</span><span class="w">
      </span><span class="nl">"sniffing"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"enabled"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
        </span><span class="nl">"destOverride"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"http"</span><span class="p">,</span><span class="w"> </span><span class="s2">"tls"</span><span class="p">]</span><span class="w">
      </span><span class="p">},</span><span class="w">
      </span><span class="nl">"streamSettings"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"sockopt"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
          </span><span class="nl">"tproxy"</span><span class="p">:</span><span class="w"> </span><span class="s2">"redirect"</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">}</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">],</span><span class="w">
  </span><span class="nl">"outbounds"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="err">...</span><span class="w">
      </span><span class="nl">"streamSettings"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="err">...</span><span class="w">
        </span><span class="nl">"sockopt"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
          </span><span class="nl">"mark"</span><span class="p">:</span><span class="w"> </span><span class="mi">255</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">}</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<hr />

<h3 id="iptables-configuration"><code class="language-plaintext highlighter-rouge">iptables</code> Configuration</h3>

<p>Run these commands with root privileges (<code class="language-plaintext highlighter-rouge">sudo su</code>):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">lan_ipaddr</span><span class="o">=</span><span class="s2">"192.168.1.1"</span>  <span class="c"># Local router IP</span>
<span class="nv">proxy_server</span><span class="o">=</span><span class="s2">"123.123.123.123"</span>  <span class="c"># Proxy server IP</span>
<span class="nv">proxy_port</span><span class="o">=</span><span class="s2">"7892"</span>  <span class="c"># Transparent proxy port</span>

<span class="c"># Enable IP forwarding</span>
<span class="nb">echo </span>net.ipv4.ip_forward<span class="o">=</span>1 <span class="o">&gt;&gt;</span> /etc/sysctl.conf <span class="o">&amp;&amp;</span> sysctl <span class="nt">-p</span>

<span class="c"># Route for loopback</span>
ip rule add fwmark 1 table 100
ip route add <span class="nb">local </span>0.0.0.0/0 dev lo table 100

<span class="c"># Proxy local network</span>
iptables <span class="nt">-t</span> mangle <span class="nt">-N</span> V2RAY
iptables <span class="nt">-t</span> mangle <span class="nt">-A</span> V2RAY <span class="nt">-d</span> <span class="k">${</span><span class="nv">proxy_server</span><span class="k">}</span> <span class="nt">-j</span> RETURN
iptables <span class="nt">-t</span> mangle <span class="nt">-A</span> V2RAY <span class="nt">-d</span> 127.0.0.1/32 <span class="nt">-j</span> RETURN
...
</code></pre></div></div>

<p>This method uses the <strong>REDIRECT</strong> approach. For the <strong>TPROXY</strong> method, refer to <a href="https://toutyrater.github.io/app/tproxy.html">this guide</a>.</p>

<h2 id="setting-up-a-clash-transparent-proxy">Setting Up a Clash Transparent Proxy</h2>

<p>Clash is a powerful, rule-based proxy tool with features like high-level routing and DNS management. It is widely used due to its flexibility and robust capabilities.</p>

<h3 id="configuring-clash-as-a-bypass-gateway">Configuring Clash as a Bypass Gateway</h3>

<p>If you’re using a Raspberry Pi as a bypass gateway, you can set a static IP address for it and configure it to act as both a DHCP and DNS server. Alternatively, if your main router supports multiple gateways, you can configure two gateways:</p>

<ol>
  <li>One for traffic routed through the Raspberry Pi (for proxying).</li>
  <li>Another for direct routing (normal internet access).</li>
</ol>

<p>The network overview would look like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Phone/PC/Pad
        |
      1 |
        |
+-------v-------+      2      +---------------+
|               |-------------&gt;               |
|  WiFi Router  |             |  Raspberry Pi |
|               &lt;-------------|               |
+------+--+-----+      3      +---------------+
       |  |
    3.1|  | 3.2
       |  +----------&gt;  Direct LAN
       v
   +---+---+
   | Proxy |
   +---+---+
       |
       |
       v
 Internet WAN
</code></pre></div></div>

<hr />

<h3 id="avoiding-loop-problems">Avoiding Loop Problems</h3>

<p>To prevent traffic loops, create a dedicated user <code class="language-plaintext highlighter-rouge">clash</code> and ensure that traffic originating from this user is excluded from the proxying rules.</p>

<h4 id="steps-to-create-a-user-for-clash">Steps to Create a User for Clash</h4>

<ol>
  <li>Create the <code class="language-plaintext highlighter-rouge">clash</code> user and its home directory:</li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>useradd <span class="nt">-U</span> clash
<span class="nb">sudo </span>mkhomedir_helper clash
<span class="nb">sudo chown </span>clash:clash /usr/local/bin/clash
</code></pre></div></div>

<ol>
  <li>Create or modify the service file at <code class="language-plaintext highlighter-rouge">/etc/systemd/system/clash.service</code> to define the user as <code class="language-plaintext highlighter-rouge">clash</code>:</li>
</ol>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Unit]
Description=clash
After=network.target

[Service]
User=clash
Group=clash
AmbientCapabilities=CAP_NET_BIND_SERVICE CAP_NET_ADMIN
ExecStart=/usr/local/bin/clash -d /etc/clash
Restart=on-failure

[Install]
WantedBy=multi-user.target
</code></pre></div></div>

<ul>
  <li><code class="language-plaintext highlighter-rouge">CAP_NET_BIND_SERVICE</code>: Allows the Clash process to bind to privileged ports like 53 (DNS).</li>
  <li><code class="language-plaintext highlighter-rouge">CAP_NET_ADMIN</code>: Grants the Clash process permissions for network administration (necessary for UDP proxying).</li>
</ul>

<ol>
  <li>Reload the systemd configuration and enable Clash:</li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl daemon-reload
<span class="nb">sudo </span>systemctl <span class="nb">enable </span>clash
</code></pre></div></div>

<hr />

<h3 id="clash-dns-configuration">Clash DNS Configuration</h3>

<p>To ensure reliable DNS resolution, configure Clash to handle DNS queries. Below is a sample DNS configuration for <code class="language-plaintext highlighter-rouge">config.yaml</code>:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">dns</span><span class="pi">:</span>
  <span class="na">enable</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">ipv6</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">listen</span><span class="pi">:</span> <span class="s">0.0.0.0:1053</span>
  <span class="na">enhanced-mode</span><span class="pi">:</span> <span class="s">redir-host</span>       <span class="c1"># Modes: redir-host or fake-ip</span>
  <span class="na">use-hosts</span><span class="pi">:</span> <span class="no">true</span>                 <span class="c1"># Use hosts for resolution</span>
  <span class="na">nameserver</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">119.29.29.29</span>      <span class="c1"># DNSPod</span>
    <span class="pi">-</span> <span class="s">223.5.5.5</span>         <span class="c1"># Alibaba DNS</span>
  <span class="na">fallback</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">tls://8.8.8.8:853</span>         <span class="c1"># Google DNS over TLS</span>
    <span class="pi">-</span> <span class="s">tls://8.8.4.4:853</span>         <span class="c1"># Google DNS over TLS</span>
    <span class="pi">-</span> <span class="s">https://1.1.1.1/dns-query</span> <span class="c1"># Cloudflare DNS over HTTPS</span>
    <span class="pi">-</span> <span class="s">https://dns.google/dns-query</span> <span class="c1"># Google DNS over HTTPS</span>
  <span class="na">fallback-filter</span><span class="pi">:</span>
    <span class="na">geoip</span><span class="pi">:</span> <span class="no">true</span>
</code></pre></div></div>

<p>For <strong>fake-ip</strong> mode, update the configuration:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="na">enhanced-mode</span><span class="pi">:</span> <span class="s">fake-ip</span>
  <span class="na">fake-ip-range</span><span class="pi">:</span> <span class="s">198.18.0.1/16</span>  <span class="c1"># Fake IP pool</span>
</code></pre></div></div>

<hr />

<h3 id="clash-full-configuration-example">Clash Full Configuration Example</h3>

<p>Below is a complete example of <code class="language-plaintext highlighter-rouge">config.yaml</code>:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">port</span><span class="pi">:</span> <span class="m">7890</span>
<span class="na">socks-port</span><span class="pi">:</span> <span class="m">7891</span>
<span class="na">redir-port</span><span class="pi">:</span> <span class="m">7892</span>
<span class="na">allow-lan</span><span class="pi">:</span> <span class="no">true</span>
<span class="na">mode</span><span class="pi">:</span> <span class="s">Rule</span>
<span class="na">log-level</span><span class="pi">:</span> <span class="s">info</span>
<span class="na">external-controller</span><span class="pi">:</span> <span class="s">0.0.0.0:9090</span>
<span class="na">secret</span><span class="pi">:</span> <span class="s2">"</span><span class="s">"</span>
<span class="na">external-ui</span><span class="pi">:</span> <span class="s">dashboard</span>

<span class="c1"># Define proxies</span>
<span class="na">proxies</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Proxy"</span>
    <span class="na">type</span><span class="pi">:</span> <span class="s">http</span>
    <span class="na">server</span><span class="pi">:</span> <span class="s">your.proxy.server</span>
    <span class="na">port</span><span class="pi">:</span> <span class="m">1234</span>
    <span class="na">username</span><span class="pi">:</span> <span class="s">user</span>
    <span class="na">password</span><span class="pi">:</span> <span class="s">pass</span>

<span class="c1"># Proxy groups and rules</span>
<span class="na">proxy-groups</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Default"</span>
    <span class="na">type</span><span class="pi">:</span> <span class="s">select</span>
    <span class="na">proxies</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">Proxy</span>

<span class="na">rules</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">DOMAIN-SUFFIX,example.com,Default</span>
  <span class="pi">-</span> <span class="s">GEOIP,CN,DIRECT</span>
  <span class="pi">-</span> <span class="s">MATCH,Default</span>

<span class="na">dns</span><span class="pi">:</span>
  <span class="na">enable</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">ipv6</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">listen</span><span class="pi">:</span> <span class="s">0.0.0.0:1053</span>
  <span class="na">enhanced-mode</span><span class="pi">:</span> <span class="s">redir-host</span>
  <span class="na">nameserver</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">119.29.29.29</span>
    <span class="pi">-</span> <span class="s">223.5.5.5</span>
  <span class="na">fallback</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">tls://8.8.8.8:853</span>
    <span class="pi">-</span> <span class="s">tls://8.8.4.4:853</span>
    <span class="pi">-</span> <span class="s">https://1.1.1.1/dns-query</span>
    <span class="pi">-</span> <span class="s">https://dns.google/dns-query</span>
  <span class="na">fallback-filter</span><span class="pi">:</span>
    <span class="na">geoip</span><span class="pi">:</span> <span class="no">true</span>
</code></pre></div></div>

<hr />

<h3 id="configuring-iptables-for-clash">Configuring <code class="language-plaintext highlighter-rouge">iptables</code> for Clash</h3>

<ol>
  <li>Enable IP forwarding:</li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo </span>net.ipv4.ip_forward<span class="o">=</span>1 <span class="o">&gt;&gt;</span> /etc/sysctl.conf <span class="o">&amp;&amp;</span> sysctl <span class="nt">-p</span>
</code></pre></div></div>

<ol>
  <li>Create a script to set up <code class="language-plaintext highlighter-rouge">iptables</code> rules:</li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>

<span class="nv">IPT</span><span class="o">=</span>/sbin/iptables
<span class="nv">lan_ipaddr</span><span class="o">=</span><span class="si">$(</span>/sbin/ip route | <span class="nb">awk</span> <span class="s1">'/default/ { print $3 }'</span><span class="si">)</span>
<span class="nv">dns_port</span><span class="o">=</span><span class="s2">"1053"</span>  
<span class="nv">proxy_port</span><span class="o">=</span><span class="s2">"7892"</span> 

<span class="c"># Flush existing rules</span>
<span class="nv">$IPT</span> <span class="nt">-F</span>

<span class="c"># Create chains</span>
<span class="nv">$IPT</span> <span class="nt">-t</span> nat <span class="nt">-N</span> CLASH_TCP_RULE
<span class="nv">$IPT</span> <span class="nt">-t</span> nat <span class="nt">-F</span> CLASH_TCP_RULE

<span class="c"># Exclude local addresses</span>
<span class="nv">$IPT</span> <span class="nt">-t</span> nat <span class="nt">-A</span> CLASH_TCP_RULE <span class="nt">-d</span> 10.0.0.0/8 <span class="nt">-j</span> RETURN
<span class="nv">$IPT</span> <span class="nt">-t</span> nat <span class="nt">-A</span> CLASH_TCP_RULE <span class="nt">-d</span> 127.0.0.0/8 <span class="nt">-j</span> RETURN
<span class="nv">$IPT</span> <span class="nt">-t</span> nat <span class="nt">-A</span> CLASH_TCP_RULE <span class="nt">-d</span> 192.168.0.0/16 <span class="nt">-j</span> RETURN
<span class="nv">$IPT</span> <span class="nt">-t</span> nat <span class="nt">-A</span> CLASH_TCP_RULE <span class="nt">-p</span> tcp <span class="nt">--dport</span> 22 <span class="nt">-j</span> RETURN
<span class="nv">$IPT</span> <span class="nt">-t</span> nat <span class="nt">-A</span> CLASH_TCP_RULE <span class="nt">-p</span> tcp <span class="nt">--dport</span> <span class="k">${</span><span class="nv">proxy_port</span><span class="k">}</span> <span class="nt">-j</span> RETURN

<span class="c"># Redirect remaining TCP traffic</span>
<span class="nv">$IPT</span> <span class="nt">-t</span> nat <span class="nt">-A</span> CLASH_TCP_RULE <span class="nt">-p</span> tcp <span class="nt">-j</span> REDIRECT <span class="nt">--to-ports</span> <span class="k">${</span><span class="nv">proxy_port</span><span class="k">}</span>

<span class="c"># DNS redirection</span>
<span class="nv">$IPT</span> <span class="nt">-t</span> nat <span class="nt">-A</span> PREROUTING <span class="nt">-p</span> udp <span class="nt">--dport</span> 53 <span class="nt">-j</span> REDIRECT <span class="nt">--to-port</span> <span class="k">${</span><span class="nv">dns_port</span><span class="k">}</span>

<span class="c"># Apply rules</span>
<span class="nv">$IPT</span> <span class="nt">-t</span> nat <span class="nt">-A</span> PREROUTING <span class="nt">-p</span> tcp <span class="nt">-j</span> CLASH_TCP_RULE
</code></pre></div></div>

<ol>
  <li>Save the script, make it executable, and run it at startup:</li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">chmod</span> +x clash-iptables.sh
<span class="nb">sudo</span> ./clash-iptables.sh
</code></pre></div></div>

<ol>
  <li>To persist the rules after reboot, use <code class="language-plaintext highlighter-rouge">iptables-persistent</code>:</li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>iptables-persistent
iptables-save <span class="o">&gt;</span> /etc/iptables/rules.v4
</code></pre></div></div>

<p>Alternatively, add the script to <code class="language-plaintext highlighter-rouge">/etc/rc.local</code> for execution at startup.</p>

<hr />

<p><strong>Special statement: This tutorial is only for learning and research, thanks.</strong></p>]]></content><author><name>Dongda Li</name><email>dongdongbhbh@gmail.com</email></author><category term="Blog" /><category term="content" /><category term="tutorial" /><summary type="html"><![CDATA[Transparent proxy with V2ray and clash on Linux，bypass gateway, soft router]]></summary></entry><entry><title type="html">Pytorch distributed data parallel step by step</title><link href="https://dongdongbh.tech/ddp/" rel="alternate" type="text/html" title="Pytorch distributed data parallel step by step" /><published>2020-11-19T00:00:00-05:00</published><updated>2025-01-06T14:02:41-05:00</updated><id>https://dongdongbh.tech/ddp</id><content type="html" xml:base="https://dongdongbh.tech/ddp/"><![CDATA[<hr />

<h2 id="background">Background</h2>

<p>How can you speed up your training? What should you do when your model is too large to fit into a single GPU’s memory? How can you efficiently utilize multiple GPUs?</p>

<p><strong>Distributed training</strong> is designed to address these challenges. In PyTorch, two common approaches for distributed training are <strong>DataParallel</strong> and <strong>Distributed Data Parallel (DDP)</strong>.</p>

<hr />

<h3 id="dataparallel">DataParallel</h3>

<p>The <code class="language-plaintext highlighter-rouge">DataParallel</code> module splits a batch of data into smaller mini-batches, each assigned to a different GPU. Every GPU holds a copy of the model. After the forward pass, gradients from all GPUs are sent to a master GPU, which performs the back-propagation and updates the model parameters. The updated parameters are then broadcasted back to all GPUs.</p>

<p>However, there are key limitations with <code class="language-plaintext highlighter-rouge">DataParallel</code>:</p>

<ol>
  <li><strong>Communication Overhead</strong>: Gradients and updated model parameters must be transmitted between GPUs, causing significant communication overhead.</li>
  <li><strong>Memory Bottleneck</strong>: The memory usage is constrained by the master GPU, as it handles all back-propagation. This prevents the full utilization of other GPUs’ memory.</li>
  <li><strong>Slower Training</strong>: Relying on a single GPU for back-propagation slows down the training process.</li>
</ol>

<hr />

<h3 id="distributed-data-parallel-ddp">Distributed Data Parallel (DDP)</h3>

<p><strong>Distributed Data Parallel (DDP)</strong> is a more efficient solution that addresses the drawbacks of <code class="language-plaintext highlighter-rouge">DataParallel</code>. DDP attaches autograd hooks to each parameter, triggering gradient synchronization across GPUs using the <code class="language-plaintext highlighter-rouge">AllReduce</code> operation. This allows all GPUs to perform back-propagation independently after the forward pass.</p>

<p><strong>Key Advantages</strong>:</p>
<ul>
  <li><strong>Reduced Communication Overhead</strong>: Only gradients are synchronized, reducing data transfer costs.</li>
  <li><strong>Balanced Memory Usage</strong>: Each GPU handles its own back-propagation, resulting in similar memory usage across GPUs.</li>
  <li><strong>Scalability</strong>: DDP supports multi-node setups and peer-to-peer communication between GPUs.</li>
  <li><strong>Improved Performance</strong>: Multiple CPU processes are used, alleviating the limitations of Python’s Global Interpreter Lock (GIL).</li>
</ul>

<p>For more details, see <a href="https://pytorch.org/tutorials/beginner/dist_overview.html">PyTorch Distributed Overview</a>.</p>

<p>This guide focuses on implementing DDP for single-machine, multi-GPU setups.</p>

<hr />

<h2 id="getting-started-with-ddp">Getting Started with DDP</h2>

<h3 id="running-ddp">Running DDP</h3>

<p>The <code class="language-plaintext highlighter-rouge">torch.distributed.launch</code> utility spawns multiple processes for you. Set <code class="language-plaintext highlighter-rouge">nproc_per_node</code> to the number of GPUs on your machine so that each process corresponds to one GPU.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span>0,1 python <span class="nt">-m</span> torch.distributed.launch <span class="nt">--nproc_per_node</span><span class="o">=</span>2 main.py <span class="nv">$args</span>
</code></pre></div></div>

<hr />

<h3 id="preparing-data">Preparing Data</h3>

<h4 id="supervised-learning">Supervised Learning</h4>

<p>Use <code class="language-plaintext highlighter-rouge">DistributedSampler</code> to split the dataset among processes:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">train_sampler</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">distributed</span><span class="p">.</span><span class="n">DistributedSampler</span><span class="p">(</span><span class="n">train_dataset</span><span class="p">)</span>
<span class="n">train_loader</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">DataLoader</span><span class="p">(</span><span class="n">train_dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="p">...,</span> <span class="n">sampler</span><span class="o">=</span><span class="n">train_sampler</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="reinforcement-learning">Reinforcement Learning</h4>

<p>In reinforcement learning, run the environment in each rank process with <strong>different seeds</strong> to ensure diversity.</p>

<hr />

<h3 id="ddp-initialization-with-nvidia-nccl-backend">DDP Initialization with NVIDIA NCCL Backend</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch.distributed</span> <span class="k">as</span> <span class="n">dist</span>
<span class="kn">from</span> <span class="nn">torch.nn.parallel</span> <span class="kn">import</span> <span class="n">DistributedDataParallel</span> <span class="k">as</span> <span class="n">DDP</span>
<span class="kn">import</span> <span class="nn">argparse</span>

<span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="p">.</span><span class="n">ArgumentParser</span><span class="p">()</span>
<span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">"--local_rank"</span><span class="p">,</span> <span class="n">default</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">local_rank</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="n">parse_args</span><span class="p">().</span><span class="n">local_rank</span>

<span class="c1"># Initialize DDP
</span><span class="n">dist</span><span class="p">.</span><span class="n">init_process_group</span><span class="p">(</span><span class="n">backend</span><span class="o">=</span><span class="s">'nccl'</span><span class="p">,</span> <span class="n">init_method</span><span class="o">=</span><span class="s">'env://'</span><span class="p">)</span>
<span class="n">rank</span> <span class="o">=</span> <span class="n">dist</span><span class="p">.</span><span class="n">get_rank</span><span class="p">()</span>
<span class="n">world_size</span> <span class="o">=</span> <span class="n">dist</span><span class="p">.</span><span class="n">get_world_size</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"My rank=</span><span class="si">{</span><span class="n">rank</span><span class="si">}</span><span class="s">, local_rank=</span><span class="si">{</span><span class="n">local_rank</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">set_device</span><span class="p">(</span><span class="n">local_rank</span><span class="p">)</span>
</code></pre></div></div>

<hr />

<h3 id="wrapping-the-model">Wrapping the Model</h3>

<p>Wrap your model with <code class="language-plaintext highlighter-rouge">DistributedDataParallel</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">DDP</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">device_ids</span><span class="o">=</span><span class="p">[</span><span class="n">local_rank</span><span class="p">],</span> <span class="n">output_device</span><span class="o">=</span><span class="n">local_rank</span><span class="p">)</span>
</code></pre></div></div>

<hr />

<h3 id="training">Training</h3>

<p>Synchronize the sampler for each epoch and perform training as usual:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_epochs</span><span class="p">):</span>
    <span class="n">train_loader</span><span class="p">.</span><span class="n">sampler</span><span class="p">.</span><span class="n">set_epoch</span><span class="p">(</span><span class="n">epoch</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">data</span><span class="p">,</span> <span class="n">label</span> <span class="ow">in</span> <span class="n">train_loader</span><span class="p">:</span>
        <span class="n">prediction</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
        <span class="n">loss</span> <span class="o">=</span> <span class="n">loss_fn</span><span class="p">(</span><span class="n">prediction</span><span class="p">,</span> <span class="n">label</span><span class="p">)</span>
        <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
        <span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.001</span><span class="p">)</span>
        <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
</code></pre></div></div>

<hr />

<h3 id="logging-data">Logging Data</h3>

<p>Use <code class="language-plaintext highlighter-rouge">torch.distributed.reduce</code> to aggregate data across ranks. For example, summing the loss across GPUs and calculating the mean:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loss</span> <span class="o">=</span> <span class="n">loss</span><span class="p">.</span><span class="n">clone</span><span class="p">().</span><span class="n">detach</span><span class="p">()</span>
<span class="n">dist</span><span class="p">.</span><span class="nb">reduce</span><span class="p">(</span><span class="n">loss</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">if</span> <span class="n">dist</span><span class="p">.</span><span class="n">get_rank</span><span class="p">()</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
    <span class="n">loss_mean</span> <span class="o">=</span> <span class="n">loss</span> <span class="o">/</span> <span class="n">dist</span><span class="p">.</span><span class="n">get_world_size</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Epoch: </span><span class="si">{</span><span class="n">epoch</span><span class="si">}</span><span class="s">, Loss: </span><span class="si">{</span><span class="n">loss_mean</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<hr />

<h3 id="saving-and-loading-checkpoints">Saving and Loading Checkpoints</h3>

<h4 id="saving-checkpoints">Saving Checkpoints</h4>

<p>Only save checkpoints on rank 0:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">dist</span><span class="p">.</span><span class="n">get_rank</span><span class="p">()</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
    <span class="n">checkpoint_state</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">'iter_no'</span><span class="p">:</span> <span class="n">iter_no</span><span class="p">,</span>
        <span class="s">'model'</span><span class="p">:</span> <span class="n">model</span><span class="p">.</span><span class="n">state_dict</span><span class="p">(),</span>
        <span class="s">'optimizer'</span><span class="p">:</span> <span class="n">optimizer</span><span class="p">.</span><span class="n">state_dict</span><span class="p">(),</span>
    <span class="p">}</span>
    <span class="n">torch</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="n">checkpoint_state</span><span class="p">,</span> <span class="n">checkpoint_path</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="loading-checkpoints">Loading Checkpoints</h4>

<p>Map the checkpoint to the current rank’s device:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">load_checkpoint</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">rank</span><span class="p">,</span> <span class="n">checkpoint_path</span><span class="p">):</span>
    <span class="n">map_location</span> <span class="o">=</span> <span class="p">{</span><span class="s">'cuda:%d'</span> <span class="o">%</span> <span class="mi">0</span><span class="p">:</span> <span class="s">'cuda:%d'</span> <span class="o">%</span> <span class="n">rank</span><span class="p">}</span>
    <span class="n">checkpoint_state</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">checkpoint_path</span><span class="p">,</span> <span class="n">map_location</span><span class="o">=</span><span class="n">map_location</span><span class="p">)</span>
    <span class="n">model</span><span class="p">.</span><span class="n">load_state_dict</span><span class="p">(</span><span class="n">checkpoint_state</span><span class="p">[</span><span class="s">'model'</span><span class="p">])</span>
    <span class="n">optimizer</span><span class="p">.</span><span class="n">load_state_dict</span><span class="p">(</span><span class="n">checkpoint_state</span><span class="p">[</span><span class="s">'optimizer'</span><span class="p">])</span>
    <span class="k">return</span> <span class="n">checkpoint_state</span><span class="p">[</span><span class="s">'iter_no'</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
</code></pre></div></div>

<hr />

<h3 id="handling-batchnorm">Handling BatchNorm</h3>

<p>To synchronize BatchNorm across GPUs, convert the model to use <code class="language-plaintext highlighter-rouge">SyncBatchNorm</code> before wrapping it with DDP:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">SyncBatchNorm</span><span class="p">.</span><span class="n">convert_sync_batchnorm</span><span class="p">(</span><span class="n">model</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">DDP</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">device_ids</span><span class="o">=</span><span class="p">[</span><span class="n">local_rank</span><span class="p">],</span> <span class="n">output_device</span><span class="o">=</span><span class="n">local_rank</span><span class="p">)</span>
</code></pre></div></div>

<hr />

<h3 id="common-issues-and-troubleshooting">Common Issues and Troubleshooting</h3>

<ol>
  <li><strong>Program Hangs</strong>: Ensure all ranks participate in collective operations like <code class="language-plaintext highlighter-rouge">reduce</code>.</li>
  <li><strong>NCCL Errors in Docker</strong>: Check for appropriate NCCL configurations or Docker flags.</li>
  <li><strong>Unused Parameters</strong>: Avoid having unused parameters, as they may cause synchronization issues.</li>
</ol>

<p>These issues will be covered in more detail in a future post.</p>

<hr />]]></content><author><name>Dongda Li</name><email>dongdongbhbh@gmail.com</email></author><category term="tutorial" /><summary type="html"><![CDATA[Pytorch distributed data parallel]]></summary></entry><entry><title type="html">Docker container for machine learning environments</title><link href="https://dongdongbh.tech/docker/" rel="alternate" type="text/html" title="Docker container for machine learning environments" /><published>2020-08-12T00:00:00-04:00</published><updated>2026-03-22T01:36:28-04:00</updated><id>https://dongdongbh.tech/docker</id><content type="html" xml:base="https://dongdongbh.tech/docker/"><![CDATA[<hr />

<h2 id="docker-basics">Docker Basics</h2>

<p>Refer to the <a href="https://docs.docker.com/">Docker documentation</a> and use <code class="language-plaintext highlighter-rouge">docker --help</code> for more details. Here’s a great <a href="https://ropenscilabs.github.io/r-docker-tutorial/">Docker tutorial</a> to get started.</p>

<hr />

<h3 id="docker-image-operations">Docker Image Operations</h3>

<ul>
  <li><strong>Download</strong>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker pull <span class="o">[</span>OPTIONS] NAME[:TAG|@DIGEST]
</code></pre></div>    </div>
  </li>
  <li><strong>Commit Changes</strong>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker commit <span class="o">[</span>OPTIONS] CONTAINER <span class="o">[</span>REPOSITORY[:TAG]]
</code></pre></div>    </div>
  </li>
</ul>

<hr />

<h3 id="checking-docker-status">Checking Docker Status</h3>

<ul>
  <li><strong>List Running Containers</strong>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker ps
</code></pre></div>    </div>
  </li>
  <li><strong>List Images</strong>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker images
</code></pre></div>    </div>
  </li>
  <li><strong>Inspect a Container/Image</strong>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker inspect
</code></pre></div>    </div>
  </li>
</ul>

<hr />

<h3 id="other-useful-commands">Other Useful Commands</h3>

<ul>
  <li><strong>Run a Container</strong>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run
</code></pre></div>    </div>
  </li>
  <li><strong>Remove an Image</strong>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker rmi
</code></pre></div>    </div>
  </li>
  <li><strong>Remove a Container</strong>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker <span class="nb">rm</span>
</code></pre></div>    </div>
  </li>
  <li><strong>Copy Files</strong>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker <span class="nb">cp</span> <span class="o">[</span>OPTIONS] SRC_PATH|- CONTAINER:DEST_PATH
</code></pre></div>    </div>
  </li>
</ul>

<hr />

<h3 id="switching-between-interactive-and-daemon-modes">Switching Between Interactive and Daemon Modes</h3>

<p>Press <code class="language-plaintext highlighter-rouge">&lt;Ctrl&gt; + p</code> followed by <code class="language-plaintext highlighter-rouge">&lt;Ctrl&gt; + q</code> to detach from a container running in interactive mode and switch it to daemon mode. To reattach, use:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker attach <span class="o">[</span>OPTIONS] CONTAINER
</code></pre></div></div>

<hr />

<h3 id="running-docker-without-sudo">Running Docker Without <code class="language-plaintext highlighter-rouge">sudo</code></h3>

<p>To allow running Docker commands without <code class="language-plaintext highlighter-rouge">sudo</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>groupadd docker
<span class="nb">sudo </span>usermod <span class="nt">-aG</span> docker <span class="nv">$USER</span>
newgrp docker
</code></pre></div></div>

<hr />

<h3 id="pushing-images-to-docker-hub">Pushing Images to Docker Hub</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker login
docker tag &lt;image_id&gt; yourhubusername/REPOSITORY_NAME:tag
docker push yourhubusername/REPOSITORY_NAME
</code></pre></div></div>

<hr />

<h3 id="writing-a-dockerfile">Writing a Dockerfile</h3>

<p>A basic <code class="language-plaintext highlighter-rouge">Dockerfile</code> template (refer to the <a href="https://docs.docker.com/engine/reference/builder/">Dockerfile documentation</a>):</p>

<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span><span class="s"> ubuntu:18.04</span>
<span class="k">COPY</span><span class="s"> . /app</span>
<span class="k">EXPOSE</span><span class="s"> 9000</span>
<span class="k">RUN </span>make /app
<span class="k">CMD</span><span class="s"> python /app/app.py</span>
</code></pre></div></div>

<hr />

<h3 id="using-docker-composeyml">Using <code class="language-plaintext highlighter-rouge">docker-compose.yml</code></h3>

<p>A <code class="language-plaintext highlighter-rouge">docker-compose.yml</code> file example (refer to <a href="https://docs.docker.com/compose/compose-file/">Compose documentation</a>):</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">version</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3.8"</span>
<span class="na">services</span><span class="pi">:</span>
  <span class="na">webapp</span><span class="pi">:</span>
    <span class="na">build</span><span class="pi">:</span>
      <span class="na">context</span><span class="pi">:</span> <span class="s">./dir</span>
      <span class="na">dockerfile</span><span class="pi">:</span> <span class="s">Dockerfile-alternate</span>
      <span class="na">args</span><span class="pi">:</span>
        <span class="na">buildno</span><span class="pi">:</span> <span class="m">1</span>
</code></pre></div></div>

<p>See <a href="https://docs.docker.com/compose/django/">this example</a> for setting up a Django project.</p>

<hr />

<h2 id="docker-proxy-configuration">Docker Proxy Configuration</h2>

<ol>
  <li>
    <p><strong>Set Proxy for <code class="language-plaintext highlighter-rouge">docker pull</code></strong>:<br />
Refer to the <a href="https://docs.docker.com/config/daemon/systemd/#httphttps-proxy">Docker proxy documentation</a>.</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo mkdir</span> <span class="nt">-p</span> /etc/systemd/system/docker.service.d
vim /etc/systemd/system/docker.service.d/http-proxy.conf
</code></pre></div>    </div>
  </li>
  <li>
    <p><strong>Add the Following Configuration</strong>:<br />
Replace <code class="language-plaintext highlighter-rouge">127.0.0.1:1080</code> with your proxy’s address.</p>

    <div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">[Service]</span>
<span class="py">Environment</span><span class="p">=</span><span class="s">"HTTP_PROXY=socks5://127.0.0.1:1080"</span>
<span class="py">Environment</span><span class="p">=</span><span class="s">"HTTPS_PROXY=socks5://127.0.0.1:1080"</span>
</code></pre></div>    </div>
  </li>
  <li><strong>Apply Changes</strong>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl daemon-reload
<span class="nb">sudo </span>systemctl restart docker
</code></pre></div>    </div>
  </li>
  <li><strong>Verify</strong>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl show <span class="nt">--property</span><span class="o">=</span>Environment docker
</code></pre></div>    </div>
  </li>
</ol>

<hr />

<h2 id="using-docker-with-cuda">Using Docker with CUDA</h2>

<p>Refer to <a href="https://github.com/NVIDIA/nvidia-docker">NVIDIA Docker</a> for details.</p>

<h3 id="setup-nvidia-container-toolkit">Setup NVIDIA Container Toolkit</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">distribution</span><span class="o">=</span><span class="si">$(</span><span class="nb">.</span> /etc/os-release<span class="p">;</span><span class="nb">echo</span> <span class="nv">$ID$VERSION_ID</span><span class="si">)</span>
curl <span class="nt">-s</span> <span class="nt">-L</span> https://nvidia.github.io/nvidia-docker/gpgkey | <span class="nb">sudo </span>apt-key add -
curl <span class="nt">-s</span> <span class="nt">-L</span> https://nvidia.github.io/nvidia-docker/<span class="nv">$distribution</span>/nvidia-docker.list | <span class="nb">sudo tee</span> /etc/apt/sources.list.d/nvidia-docker.list

<span class="nb">sudo </span>apt-get update <span class="o">&amp;&amp;</span> <span class="nb">sudo </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> nvidia-container-toolkit
<span class="nb">sudo </span>systemctl restart docker
</code></pre></div></div>

<h3 id="pull-and-run-a-cuda-image">Pull and Run a CUDA Image</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker pull nvidia/cuda:10.2-basic
docker run <span class="nt">--gpus</span> all <span class="nt">--ipc</span><span class="o">=</span>host <span class="nt">--net</span> host <span class="nt">-it</span> <span class="nt">--rm</span> <span class="se">\</span>
  <span class="nt">-v</span> /etc/localtime:/etc/localtime:ro <span class="se">\</span>
  <span class="nt">-v</span> /dev/shm:/dev/shm <span class="se">\</span>
  <span class="nt">-v</span> <span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span>:/workspace <span class="se">\</span>
  <span class="nt">--user</span> <span class="si">$(</span><span class="nb">id</span> <span class="nt">-u</span><span class="si">)</span>:<span class="si">$(</span><span class="nb">id</span> <span class="nt">-g</span><span class="si">)</span> <span class="se">\</span>
  nvidia/cuda:10.2-runtime-ubuntu18.04
</code></pre></div></div>

<h3 id="create-a-dockerfile-for-cuda">Create a Dockerfile for CUDA</h3>

<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ARG</span><span class="s"> DOCKER_BASE_IMAGE=nvidia/cuda:10.2-basic</span>
<span class="k">FROM</span><span class="s"> $DOCKER_BASE_IMAGE</span>

<span class="k">RUN </span><span class="nb">rm</span> /etc/apt/sources.list.d/cuda.list <span class="o">&amp;&amp;</span> <span class="se">\
</span>    <span class="nb">rm</span> /etc/apt/sources.list.d/nvidia-ml.list <span class="o">&amp;&amp;</span> <span class="se">\
</span>    apt-get update <span class="o">&amp;&amp;</span> apt-get <span class="nb">install</span> <span class="nt">-y</span> <span class="nb">sudo</span>

<span class="k">COPY</span><span class="s"> pre-install.sh .</span>
<span class="k">RUN </span>./pre-install.sh

<span class="k">ARG</span><span class="s"> UID=1000</span>
<span class="k">ARG</span><span class="s"> GID=1000</span>
<span class="k">ARG</span><span class="s"> USER=docker</span>
<span class="k">ARG</span><span class="s"> PW=docker</span>

<span class="k">RUN </span>useradd <span class="nt">-m</span> <span class="k">${</span><span class="nv">USER</span><span class="k">}</span> <span class="nt">--uid</span><span class="o">=</span><span class="k">${</span><span class="nv">UID</span><span class="k">}</span> <span class="nt">-s</span> /bin/bash <span class="o">&amp;&amp;</span> <span class="se">\
</span>    <span class="nb">echo</span> <span class="s2">"</span><span class="k">${</span><span class="nv">USER</span><span class="k">}</span><span class="s2">:</span><span class="k">${</span><span class="nv">PW</span><span class="k">}</span><span class="s2">"</span> | chpasswd <span class="o">&amp;&amp;</span> <span class="se">\
</span>    adduser <span class="k">${</span><span class="nv">USER</span><span class="k">}</span> <span class="nb">sudo</span>

<span class="k">USER</span><span class="s"> ${USER}</span>
<span class="k">WORKDIR</span><span class="s"> /home/${USER}</span>
</code></pre></div></div>

<hr />

<h2 id="container-using-host-proxy">Container Using Host Proxy</h2>

<h3 id="1-configure-proxy">1. Configure Proxy</h3>

<p>Add the following to <code class="language-plaintext highlighter-rouge">~/.docker/config.json</code>:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"proxies"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"default"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"httpProxy"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://127.0.0.1:8118"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"httpsProxy"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://127.0.0.1:8118"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"noProxy"</span><span class="p">:</span><span class="w"> </span><span class="s2">"localhost"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Alternatively, set the proxy in the <code class="language-plaintext highlighter-rouge">Dockerfile</code> or during build:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker build <span class="nt">--net</span> host ...
</code></pre></div></div>

<hr />

<h2 id="accessing-containers-via-ssh">Accessing Containers via SSH</h2>

<h3 id="ssh-from-the-host-machine">SSH from the Host Machine</h3>

<ol>
  <li>Ensure SSH is installed and running in the container.</li>
  <li>
    <p>Find the container’s IP address:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker inspect &lt;container_id&gt; | <span class="nb">grep</span> <span class="s2">"IPAddress"</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>SSH to the container:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh user@&lt;container_ip_address&gt;
</code></pre></div>    </div>
  </li>
</ol>

<h3 id="direct-ssh-to-containers-on-remote-machines">Direct SSH to Containers on Remote Machines</h3>

<p>Map the container’s SSH port to the host:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run <span class="nt">-p</span> 52022:22 container1
docker run <span class="nt">-p</span> 53022:22 container2
</code></pre></div></div>

<p>SSH to the container using the host’s IP and mapped port:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh <span class="nt">-p</span> 52022 user@&lt;host_ip&gt;
</code></pre></div></div>

<hr />

<h2 id="accessing-files-inside-containers">Accessing Files Inside Containers</h2>

<ol>
  <li><strong>Map Directories</strong>: Use volume mapping during <code class="language-plaintext highlighter-rouge">docker run</code>.</li>
  <li>
    <p><strong>Set Up a Web Server</strong>: Run a basic HTTP server in the container:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python3 <span class="nt">-m</span> http.server
</code></pre></div>    </div>
  </li>
  <li><strong>Use WebDAV</strong>: Set up <a href="https://www.comparitech.com/net-admin/webdav/">WebDAV</a> for collaborative access.</li>
</ol>

<h3 id="webdav-example">WebDAV Example</h3>

<ol>
  <li>
    <p>Install WebDAV:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>wsgidav cheroot
</code></pre></div>    </div>
  </li>
  <li>
    <p>Create a <code class="language-plaintext highlighter-rouge">wsgidav.yaml</code> configuration file.</p>
  </li>
  <li>
    <p>Run WebDAV:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wsgidav <span class="nt">--config</span><span class="o">=</span>wsgidav.yaml <span class="nt">--host</span><span class="o">=</span>0.0.0.0 <span class="nt">--port</span><span class="o">=</span>8000 <span class="nt">--root</span> ./share
</code></pre></div>    </div>
  </li>
  <li>
    <p>Set up an SSH tunnel:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh <span class="nt">-f</span> <span class="nt">-N</span> <span class="nt">-L</span> 9980:0.0.0.0:8000 <span class="nt">-p</span> 12345 user@&lt;jumper_ip&gt;
</code></pre></div>    </div>
  </li>
  <li>
    <p>Access the container’s files via WebDAV (<code class="language-plaintext highlighter-rouge">dav://localhost:9980/</code>).</p>
  </li>
</ol>

<p>Enjoy seamless file management directly from your file explorer!</p>

<hr />]]></content><author><name>Dongda Li</name><email>dongdongbhbh@gmail.com</email></author><category term="tutorial" /><summary type="html"><![CDATA[Docker tutorial: setting up containers for machine learning environments, images, Dockerfiles, and common commands]]></summary></entry><entry><title type="html">Setting Up a File Server on VPS with Nginx</title><link href="https://dongdongbh.tech/blog/file-server/" rel="alternate" type="text/html" title="Setting Up a File Server on VPS with Nginx" /><published>2020-06-09T00:00:00-04:00</published><updated>2025-01-06T15:07:50-05:00</updated><id>https://dongdongbh.tech/blog/file-server</id><content type="html" xml:base="https://dongdongbh.tech/blog/file-server/"><![CDATA[<hr />

<p>This guide provides a detailed walkthrough to set up a web file server using <strong>Nginx</strong>, <strong>h5ai</strong>, <strong>Aria2</strong>, and <strong>AriaNG</strong> on a Debian-based VPS. Additionally, it explains how to enhance functionality with SSL and local development options.</p>

<hr />

<h2 id="background">Background</h2>

<p>Learn to host a file server with <a href="https://larsjung.de/h5ai/">h5ai</a>, manage downloads using <a href="https://aria2.github.io/">Aria2</a>, and set up configurations via <strong>Nginx</strong> on a Debian 9 VPS.</p>

<hr />

<h2 id="how-to">How to</h2>

<h3 id="1-basic-nginx-configuration">1. Basic Nginx Configuration</h3>

<p>Update the Nginx configuration to host your file server:</p>

<ol>
  <li>Open <code class="language-plaintext highlighter-rouge">/etc/nginx/sites-enabled/default</code> and add:
    <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">server {</span>
    <span class="s">listen xxxx;</span> <span class="c1"># Replace xxxx with your desired port</span>
    <span class="s">server_name localhost;</span>
    <span class="s">root /home/bh/share;</span>

    <span class="s">location / {</span>
        <span class="s">autoindex on;</span>           <span class="c1"># Enable directory listing</span>
        <span class="s">autoindex_exact_size on;</span> <span class="c1"># Show file sizes</span>
        <span class="s">autoindex_localtime on;</span> <span class="c1"># Show local time for files</span>
    <span class="err">}</span>
<span class="err">}</span>
</code></pre></div>    </div>
  </li>
  <li>Reload Nginx:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>service nginx reload
</code></pre></div>    </div>
  </li>
  <li>Access the file server at <code class="language-plaintext highlighter-rouge">yourdomain.com:xxxx</code>.</li>
</ol>

<hr />

<h3 id="2-enhance-with-h5ai">2. Enhance with h5ai</h3>

<h4 id="install-h5ai">Install h5ai</h4>

<ol>
  <li>Install PHP:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>php
</code></pre></div>    </div>
  </li>
  <li>Update the Nginx configuration in <code class="language-plaintext highlighter-rouge">/etc/nginx/sites-enabled/default</code>:
    <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">server {</span>
    <span class="s">listen xxxx;</span>
    <span class="s">server_name localhost;</span>
    <span class="s">root /home/bh/share;</span>
    <span class="s">index index.html /_h5ai/public/index.php;</span>

    <span class="s">location ~ \.php$ {</span>
        <span class="s">fastcgi_pass unix:/run/php/php7.4-fpm.sock;</span> <span class="c1"># Check your PHP socket path</span>
        <span class="s">include fastcgi_params;</span>
        <span class="s">fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;</span>
        <span class="s">fastcgi_param SCRIPT_NAME $fastcgi_script_name;</span>
    <span class="s">}</span>
<span class="err">}</span>
</code></pre></div>    </div>
  </li>
  <li>Reload Nginx:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>service nginx reload
</code></pre></div>    </div>
  </li>
</ol>

<hr />

<h3 id="3-add-folder-password-protection">3. Add Folder Password Protection</h3>

<p>Protect specific folders with HTTP authentication:</p>

<ol>
  <li>Install <code class="language-plaintext highlighter-rouge">apache2-utils</code>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>apache2-utils
</code></pre></div>    </div>
  </li>
  <li>Create a password file and user:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>htpasswd <span class="nt">-c</span> /etc/nginx/passwd your-username
</code></pre></div>    </div>
  </li>
  <li>Update Nginx configuration:
    <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">location /private {</span>
    <span class="s">autoindex on;</span>
    <span class="s">auth_basic "Restricted Access";</span>
    <span class="s">auth_basic_user_file /etc/nginx/passwd;</span>
<span class="err">}</span>
</code></pre></div>    </div>
  </li>
  <li>Reload Nginx:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>service nginx reload
</code></pre></div>    </div>
  </li>
</ol>

<hr />

<h3 id="4-integrate-aria2-and-ariang">4. Integrate Aria2 and AriaNG</h3>

<h4 id="install-aria2">Install Aria2</h4>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>aria2
</code></pre></div></div>

<h4 id="configure-aria2">Configure Aria2</h4>
<ol>
  <li>Create configuration files:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> ~/.aria2
vim ~/.aria2/aria2.conf
</code></pre></div>    </div>
  </li>
  <li>Add the following:
    <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">dir=/home/your-username/aria2/download</span>
<span class="s">enable-rpc=true</span>
<span class="s">rpc-listen-all=true</span>
<span class="s">rpc-listen-port=6800</span>
<span class="s">rpc-secret=your_rpc_password</span>
<span class="s">file-allocation=none</span>
<span class="s">continue=true</span>
<span class="s">max-concurrent-downloads=10</span>
</code></pre></div>    </div>
  </li>
  <li>Run Aria2:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aria2c <span class="nt">--conf-path</span><span class="o">=</span><span class="s2">"/home/your-username/.aria2/aria2.conf"</span>
</code></pre></div>    </div>
  </li>
</ol>

<hr />

<h4 id="install-and-configure-ariang">Install and Configure AriaNG</h4>

<ol>
  <li>Download <a href="https://github.com/mayswind/AriaNg/releases">AriaNG</a>.</li>
  <li>Place files in <code class="language-plaintext highlighter-rouge">/home/your-username/aria2/AriaNG</code>.</li>
</ol>

<hr />

<h4 id="configure-nginx-for-aria2">Configure Nginx for Aria2</h4>

<ol>
  <li>Create a new Nginx configuration in <code class="language-plaintext highlighter-rouge">/etc/nginx/sites-available/aria.conf</code>:
    <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">server {</span>
    <span class="s">listen 443 ssl;</span>
    <span class="s">server_name your-domain.com;</span>

    <span class="s">root /home/your-username/aria2/AriaNG;</span>

    <span class="s">location ^~ /jsonrpc {</span>
        <span class="s">proxy_pass http://127.0.0.1:6800/jsonrpc;</span>
        <span class="s">proxy_set_header Host $http_host;</span>
        <span class="s">proxy_set_header X-Real-IP $remote_addr;</span>
        <span class="s">proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;</span>
    <span class="s">}</span>

    <span class="s">ssl_certificate /path/to/fullchain.pem;</span> <span class="c1"># Adjust paths</span>
    <span class="s">ssl_certificate_key /path/to/privkey.pem;</span>
<span class="err">}</span>
</code></pre></div>    </div>
  </li>
  <li>Enable the configuration:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo ln</span> <span class="nt">-s</span> /etc/nginx/sites-available/aria.conf /etc/nginx/sites-enabled/
<span class="nb">sudo </span>service nginx reload
</code></pre></div>    </div>
  </li>
  <li>Visit the web interface and configure RPC settings in AriaNG.</li>
</ol>

<hr />

<h3 id="5-set-up-ssl-with-certbot">5. Set Up SSL with Certbot</h3>

<ol>
  <li>Install Certbot:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>certbot
</code></pre></div>    </div>
  </li>
  <li>Obtain and configure an SSL certificate:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>certbot <span class="nt">--nginx</span>
</code></pre></div>    </div>
  </li>
  <li>Update Nginx configurations to use SSL.</li>
</ol>

<hr />

<h3 id="6-run-aria2-as-a-daemon">6. Run Aria2 as a Daemon</h3>

<ol>
  <li>Create a systemd service:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>vim /etc/systemd/system/aria2.service
</code></pre></div>    </div>
  </li>
  <li>Add:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>Unit]
<span class="nv">Description</span><span class="o">=</span>Aria2c download manager
<span class="nv">After</span><span class="o">=</span>network.target

<span class="o">[</span>Service]
<span class="nv">User</span><span class="o">=</span>your-username
<span class="nv">ExecStart</span><span class="o">=</span>/usr/bin/aria2c <span class="nt">--conf-path</span><span class="o">=</span>/home/your-username/.aria2/aria2.conf
<span class="nv">Restart</span><span class="o">=</span>on-failure

<span class="o">[</span>Install]
<span class="nv">WantedBy</span><span class="o">=</span>multi-user.target
</code></pre></div>    </div>
  </li>
  <li>Enable and start the service:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl <span class="nb">enable </span>aria2.service
<span class="nb">sudo </span>systemctl start aria2.service
</code></pre></div>    </div>
  </li>
</ol>

<hr />

<h3 id="7-local-development-options">7. Local Development Options</h3>

<ol>
  <li>Use <strong>samba</strong> or <strong>NFS</strong> for local file sharing.</li>
  <li>For SSHFS:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sshfs user@host:/path/to/share /local/mount/point
</code></pre></div>    </div>
    <p>Unmount with:</p>
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>umount /local/mount/point
</code></pre></div>    </div>
  </li>
</ol>

<hr />

<p>By following this guide, you can successfully set up a secure, functional file server, integrate powerful tools like Aria2 and AriaNG, and enable seamless file management both locally and remotely.</p>]]></content><author><name>Dongda Li</name><email>dongdongbhbh@gmail.com</email></author><category term="Blog" /><category term="content" /><category term="tutorial" /><summary type="html"><![CDATA[Set up web file server, h5ai, Aria2 with nginx on debian VPS.]]></summary></entry><entry><title type="html">Accessing an Intranet Machine from Anywhere Using FRP</title><link href="https://dongdongbh.tech/blog/expose-Intranet/" rel="alternate" type="text/html" title="Accessing an Intranet Machine from Anywhere Using FRP" /><published>2020-06-07T00:00:00-04:00</published><updated>2025-01-06T14:57:46-05:00</updated><id>https://dongdongbh.tech/blog/expose-Intranet</id><content type="html" xml:base="https://dongdongbh.tech/blog/expose-Intranet/"><![CDATA[<hr />

<h2 id="background">Background</h2>

<p><strong>Scenario</strong>:<br />
You have an intranet machine (e.g., a computer in your company) without a public IP, and you want to access it from home or anywhere with internet connectivity. You might also want to host a website on this local machine.</p>

<p><strong>Requirement</strong>:<br />
You need access to a server with a public IP.</p>

<hr />

<h2 id="solution-overview">Solution Overview</h2>

<p>To achieve this, tools like <strong>FRP</strong>, <a href="https://ngrok.com/">ngrok</a>, <strong>NPS</strong>, and <strong>Zerotier</strong> can be used. Here, we focus on <strong>FRP (Fast Reverse Proxy)</strong>, an open-source tool. You can download the appropriate version for your operating system from the <a href="https://github.com/fatedier/frp/releases">FRP GitHub Releases</a>.</p>

<p>This guide demonstrates how to set up FRP for SSH access. For more features, refer to the <a href="https://github.com/fatedier/frp">FRP documentation</a>.</p>

<hr />

<h2 id="ssh-usage">SSH Usage</h2>

<h3 id="file-setup">File Setup</h3>

<ol>
  <li>Place <strong><code class="language-plaintext highlighter-rouge">frps</code></strong> and <strong><code class="language-plaintext highlighter-rouge">frps.ini</code></strong> on the public server.</li>
  <li>Place <strong><code class="language-plaintext highlighter-rouge">frpc</code></strong> and <strong><code class="language-plaintext highlighter-rouge">frpc.ini</code></strong> on the intranet machine.</li>
</ol>

<hr />

<h3 id="accessing-the-intranet-machine-via-ssh">Accessing the Intranet Machine via SSH</h3>

<h4 id="step-1-configure-the-public-server">Step 1: Configure the Public Server</h4>

<p>Edit the <code class="language-plaintext highlighter-rouge">frps.ini</code> file:</p>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># frps.ini
</span><span class="nn">[common]</span>
<span class="py">bind_port</span> <span class="p">=</span> <span class="s">7000</span>
</code></pre></div></div>

<p>Start <code class="language-plaintext highlighter-rouge">frps</code> in the background:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">nohup</span> ./frps <span class="nt">-c</span> ./frps.ini <span class="o">&gt;</span> /dev/null 2&gt;&amp;1 &amp;
</code></pre></div></div>

<hr />

<h4 id="step-2-configure-the-intranet-machine">Step 2: Configure the Intranet Machine</h4>

<p>Edit the <code class="language-plaintext highlighter-rouge">frpc.ini</code> file. Replace <code class="language-plaintext highlighter-rouge">x.x.x.x</code> with the public server’s IP address:</p>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># frpc.ini
</span><span class="nn">[common]</span>
<span class="py">server_addr</span> <span class="p">=</span> <span class="s">x.x.x.x</span>
<span class="py">server_port</span> <span class="p">=</span> <span class="s">7000</span>

<span class="nn">[ssh]</span>
<span class="py">type</span> <span class="p">=</span> <span class="s">tcp</span>
<span class="py">local_ip</span> <span class="p">=</span> <span class="s">127.0.0.1</span>
<span class="py">local_port</span> <span class="p">=</span> <span class="s">22</span>
<span class="py">remote_port</span> <span class="p">=</span> <span class="s">6000</span>
</code></pre></div></div>

<p>Start <code class="language-plaintext highlighter-rouge">frpc</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./frpc <span class="nt">-c</span> ./frpc.ini
</code></pre></div></div>

<hr />

<h4 id="step-3-connect-to-the-intranet-machine">Step 3: Connect to the Intranet Machine</h4>

<p>From any external machine, connect via SSH. Replace <code class="language-plaintext highlighter-rouge">x.x.x.x</code> with the public server’s IP, and assume the username is <code class="language-plaintext highlighter-rouge">test</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh <span class="nt">-oPort</span><span class="o">=</span>6000 <span class="nb">test</span>@x.x.x.x
</code></pre></div></div>

<hr />

<h3 id="important-notes">Important Notes</h3>

<ol>
  <li>Ensure the ports used in FRP (e.g., <code class="language-plaintext highlighter-rouge">7000</code>, <code class="language-plaintext highlighter-rouge">6000</code>) are open on the public server’s firewall.</li>
  <li>Each client machine needs a unique <strong><code class="language-plaintext highlighter-rouge">remote_port</code></strong> for mapping.</li>
</ol>

<hr />

<h3 id="extending-access-for-web-applications">Extending Access for Web Applications</h3>

<p>To access tools like <strong>Jupyter Notebook</strong> or <strong>TensorBoard</strong>, simply add additional port mappings in <code class="language-plaintext highlighter-rouge">frpc.ini</code>. For example:</p>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">[jupyter]</span>
<span class="py">type</span> <span class="p">=</span> <span class="s">tcp</span>
<span class="py">local_ip</span> <span class="p">=</span> <span class="s">127.0.0.1</span>
<span class="py">local_port</span> <span class="p">=</span> <span class="s">8888</span>
<span class="py">remote_port</span> <span class="p">=</span> <span class="s">8889</span>
</code></pre></div></div>

<p>Then, access the application at <code class="language-plaintext highlighter-rouge">https://x.x.x.x:8889</code>.</p>

<hr />

<h2 id="using-ssh-on-a-mobile-phone">Using SSH on a Mobile Phone</h2>

<p>For <strong>iOS</strong>, you can use <strong>Termius</strong>, which offers basic SSH functionality for free.</p>

<h3 id="steps-to-set-up-ssh-in-termius">Steps to Set Up SSH in Termius</h3>

<ol>
  <li>Open Termius.</li>
  <li>Navigate to <strong>Hosts</strong> → <strong>Add New</strong> → Enter the remote IP, SSH username, and password, then save.</li>
  <li>Connect to the host.</li>
</ol>

<hr />

<h3 id="using-ssh-keys-in-termius">Using SSH Keys in Termius</h3>

<ol>
  <li>Open Termius.</li>
  <li>Go to <strong>Keychain</strong> → <strong>Add Key</strong> (or use an existing key).</li>
  <li>Copy the public key.</li>
  <li>Append the Termius public key to the server’s <code class="language-plaintext highlighter-rouge">~/.ssh/authorized_keys</code> file.</li>
  <li>In Termius, edit the host, attach the created key, and save.</li>
  <li>Connect to the server.</li>
</ol>

<p>Enjoy secure and efficient SSH access from your mobile device!</p>

<hr />

<p>This guide enables you to set up remote SSH access to an intranet machine using FRP and explains how to extend its functionality for web applications and mobile devices.</p>]]></content><author><name>Dongda Li</name><email>dongdongbhbh@gmail.com</email></author><category term="Blog" /><category term="content" /><category term="tutorial" /><summary type="html"><![CDATA[Expose Intranet machine to outside]]></summary></entry><entry><title type="html">Deep Reinforcement learning notes (UBC)</title><link href="https://dongdongbh.tech/UBC-RL/" rel="alternate" type="text/html" title="Deep Reinforcement learning notes (UBC)" /><published>2019-11-27T00:00:00-05:00</published><updated>2025-11-13T18:07:35-05:00</updated><id>https://dongdongbh.tech/UBC-RL</id><content type="html" xml:base="https://dongdongbh.tech/UBC-RL/"><![CDATA[<h2 id="background">Background</h2>

<p>This note is the class note of UBC Deep reinforcement learning, namely CS294-112 or CS285. the lecturer is ‎<a href="https://people.eecs.berkeley.edu/~svlevine/">Sergey Levine</a>. The lecturer video can be find on Youtube. I wrote two notes on reinforcement learning before, one is <a href="https://dongdongbh.tech/RL-note/">basic RL</a>, the other is the <a href="https://dongdongbh.tech/RL-courses/">David Silver class note</a>.</p>

<p>Different from the previous courses, this course includes deeper theoretical view, more recent methods and some advancer topics. Especially in model based RL and meta-learning. It is more suitable for the guys who are interested on robotics control and deeper understanding of reinforcement learning.</p>

<p>This class is a little bit hard to study, so make sure you follow the class tightly.</p>

<p>I’m sorry that some of maths can not view properly on the post, <a href="../assets/pdf/UBC.pdf">Download PDF version of the notes</a>.</p>

<h4 id="table-of-contents">Table of contents</h4>

<p><a href="#Background">Background</a></p>

<p><a href="#1. Imitation learning">1. Imitation learning
 </a></p>

<p><a href="#2. Policy gradient">2. Policy gradient
 </a></p>

<p><a href="#3. Actor-critic method">3. Actor-critic method
 </a></p>

<p><a href="#4. Value based methods">4. Value based methods
 </a></p>

<p><a href="#5. Practical Q-learning">5. Practical Q-learning
 </a></p>

<p><a href="#6. Advanced Policy Gradients">6. Advanced Policy Gradients
 </a></p>

<p><a href="#7. Optimal Control and Planning">7. Optimal Control and Planning
 </a></p>

<p><a href="#8. Model-Based Reinforcement Learning (learning the model)">8. Model-Based Reinforcement Learning (learning the model)
 </a></p>

<p><a href="#9. Model-Based RL and Policy Learning">9. Model-Based RL and Policy Learning
 </a></p>

<p><a href="#10 Variational Inference and Generative Models">10 Variational Inference and Generative Models
 </a></p>

<p><a href="#11. Re-framing Control as an Inference Problem">11. Re-framing Control as an Inference Problem
 </a></p>

<p><a href="#12. Inverse Reinforcement Learning">12. Inverse Reinforcement Learning
 </a></p>

<p><a href="#13. Transfer and Multi-task Learning">13. Transfer and Multi-task Learning
 </a></p>

<p><a href="#14. Distributed RL">14. Distributed RL
 </a></p>

<p><a href="#15. Exploration">15. Exploration
 </a></p>

<p><a href="#16 Meta Reinforcement learning">16 Meta Reinforcement learning
 </a></p>

<p><a href="#17 Information theory, challenges, open problems">17 Information theory, challenges, open problems
 </a></p>

<p><a href="#18 Rethinking Reinforcement Learning from the Perspective of Generalization (Chelsea Finn)">18 Rethinking Reinforcement Learning from the Perspective of Generalization (Chelsea Finn)
 </a></p>

<h2 id="1-imitation-learning">1. Imitation learning</h2>

<h3 id="the-main-problem-of-imitation-distribution-drift">The main problem of imitation: distribution drift</h3>

<p>how to make the distribution of training dataset as same as the distribution under policy?</p>

<p>DAgger</p>

<h4 id="dagger-dataset-aggregation">DAgger: Dataset Aggregation</h4>

<p>goal: collect training data from $p_{\pi_\theta}(o_t)$ instead of $p_{data}(o_t)$ !</p>

<p>how? just run $p_{\pi_\theta}(a_t \mid o_t)$</p>

<p>but need labels $a_t$ !</p>

<ol>
  <li>train $\pi_{\theta}(a_t \mid o_t)$ from human data $\mathcal{D}={o_1,a_1,…,o_N,a_N}$</li>
  <li>run $\pi_\theta(a_t \mid o_t)$ to get dataset $\mathcal{D}_\pi = {o_1,…,o_M}$</li>
  <li>ask human to label $\mathcal{D}_\pi$ with actions $s_t$</li>
  <li>aggregate: $\mathcal{D}\gets \mathcal{D}\cup\mathcal{D}_\pi$</li>
</ol>

<p>fit the model perfectly</p>

<h4 id="why-fail-to-fit-expert">why fail to fit expert?</h4>

<ol>
  <li>Non-Markonvian behavior
    <ul>
      <li>use history observations</li>
    </ul>
  </li>
  <li>Multimodal behavior
    <ul>
      <li>for discrete action, it is OK since Softmax output probability over actions</li>
      <li>for continuous action</li>
      <li>output mixture of Gaussians</li>
      <li>latent variable models(inject noise to network input)</li>
      <li>autoregressive discretization</li>
    </ul>
  </li>
</ol>

<p>other problems of imitation learning</p>

<ul>
  <li>human labeled data is finite</li>
  <li>human not good at some problems</li>
</ul>

<h4 id="reward-function-of-imitation-learning">reward function of imitation learning</h4>

<p>reward function of imitation learning can be</p>

\[r(s,a) = \log {p(a=\pi^*(s) \mid s)}\]

<h2 id="mdp--rl-intro">MDP &amp; RL Intro</h2>

<h3 id="the-goal-of-rl">The goal of RL</h3>

<p>expected reward</p>

\[p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t \mid s_t)p(s_{t+1} \mid s_t,a_t)\\
\theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\]

<p>where $p_\theta(\tau)$ is the distribution of the sequence</p>

<h3 id="q--v">Q &amp; V</h3>

\[V^\pi(s_t)=\sum_{t'=t}^TE_{\pi_\theta}[r(s_{t'},a_{t'}) \mid s_t]\\
v^\pi(s_t)=E_{s_t\sim\pi(a_t \mid s_t)}[Q^\pi(s_t,a_t)]\]

<h3 id="types-of-rl-algorithms">Types of RL algorithms</h3>

<ul>
  <li>Policy gradient</li>
  <li>value-based</li>
  <li>Actor-critic</li>
  <li>model-based RL</li>
  <li>for planning</li>
  <li>optimal control</li>
  <li>discrete planning</li>
  <li>improve policy</li>
  <li>something else</li>
  <li>dynamic programming</li>
  <li>simulated experience</li>
</ul>

<h4 id="trade-offs">trade-offs</h4>

<ul>
  <li>sample efficiency</li>
  <li>stability &amp; ease of use</li>
</ul>

<h4 id="assumptions">assumptions</h4>

<ul>
  <li>stochastic or deterministic</li>
  <li>continuous or discrete</li>
  <li>episodic or infinite horizon</li>
</ul>

<h4 id="sample-efficiency">sample efficiency</h4>

<ul>
  <li><strong>off policy</strong>: able to improve the policy without generating new samples from that policy</li>
  <li><strong>on policy</strong>: each time the policy is changed, even a little bit, we need to generate new samples</li>
</ul>

<h4 id="stability--ease-of-use">stability &amp; ease of use</h4>

<p><strong>convergence</strong> is a problem</p>

<p>Supervised learning almost <em>always</em> gradient descent</p>

<p>RL often <em>not</em> strictly gradient descent</p>

<h2 id="2-policy-gradient">2. Policy gradient</h2>

<h4 id="objective-function">Objective function</h4>

\[\theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\\
J(\theta)=E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\approx\frac{1}{N}\sum_i\sum_tr(s_{i,t},a_{i,t})\]

<h4 id="policy-differentiation">policy differentiation</h4>

<h5 id="log-derivative">log derivative</h5>

\[\begin{align}
\pi_\theta(\tau)\Delta \log \pi_\theta(\tau)&amp;=\pi_\theta(\tau)\frac{\Delta\pi_\theta(\tau)}{\pi_\theta(\tau)}=\Delta\pi_\theta(\tau)\\
\pi_\theta(\tau)&amp;=\pi_\theta(s_1,a_1,...,s_T,a_T)=p(s_1)\prod_{t=1}^T\pi(a_t \mid s_t)p(s_{t+1} \mid s_t,a_t)\\
\log \pi_\theta(\tau) &amp;=\log p(s_1) + \sum_{t=1}^T \log \pi_\theta (a_t \mid s_t) + \log p(s_{t+1} \mid s_t,a_t)\\
&amp;\Delta_\theta \left[\log p(s_1) + \sum_{t=1}^T \log \pi_\theta (a_t \mid s_t) + \log p(s_{t+1} \mid s_t,a_t)\right]= \sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t \mid s_t) 
\end{align}\]

<h5 id="objective-function-differentiation">Objective function differentiation</h5>

\[\begin{align}
\theta^*&amp;=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[{r}(\tau)]=\int\pi_\theta(\tau)r(\tau)d\tau\\
{r}(\tau)&amp;=\sum_t r(s_t,a_t)\\
\Delta_\theta J(\theta)&amp;=\int\Delta_\theta \pi_\theta(\tau)r(\tau)d\tau\\
&amp;=\int\pi_\theta(\tau)\Delta_\theta \log \pi_\theta(\tau)r(\tau) d\tau\\
&amp;=E_{r\sim\pi_\theta}[\Delta_\theta \log \pi_\theta(\tau)r(\tau)]\\
&amp;=E_{r\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t \mid s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right]
\end{align}\]

<h5 id="evaluating-the-policy-gradient">evaluating the policy gradient</h5>

\[\begin{align}
\Delta_\theta J(\theta)&amp;=E_{\tau\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t \mid s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right]\\
\Delta_\theta J(\theta)&amp;\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t \mid s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\\
\theta &amp;\gets\theta+\alpha\Delta_\theta J(\theta)
\end{align}\]

<h4 id="reinforce-algorithm">REINFORCE algorithm</h4>

<ol>
  <li>sample ${\tau^i}$ from $\pi_\theta(s_t \mid s_t)$ (run the policy)</li>
  <li>$\Delta_\theta J(\theta)\approx\sum_{i}^N\left(\sum_{t} \Delta_\theta \log \pi_\theta (a_t \mid s_t) \right)\left(\sum_{t} r(s_t,a_t)\right)$</li>
  <li>$\theta \gets\theta+\alpha\Delta_\theta J(\theta)$</li>
</ol>

<h5 id="policy-gradient">policy gradient</h5>

\[\Delta_\theta J(\theta)\approx\frac{1}{N} \Delta_\theta \log \pi_\theta (\tau) r(\tau)\]

<h5 id="reduce-variance">Reduce variance</h5>

<p><strong>Causality</strong>: policy at time $t’$ cannot affect reward at time t when $t&lt;t’$</p>

\[\begin{align}
\Delta_\theta J(\theta)&amp;\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t} \mid s_{i,t}) \right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\\
&amp;\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t} \mid s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\right)\\
&amp;=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t} \mid s_{i,t})\hat{Q}_{i,t}
\end{align}\]

<p><strong>baseline</strong></p>

<p>\(b=\frac{1}{N}\sum_{i=1}^{n}e(\tau)\\
\Delta_\theta J(\theta)\approx\frac{1}{N} \Delta_\theta \log \pi_\theta (\tau) [r(\tau)-b]\)
prove
\(\begin{align}
E[\Delta_\theta \log \pi_\theta(\tau)b]&amp;=\int \pi_\theta(\tau) \Delta_\theta \log \pi_\theta(\tau)b d\tau \\
&amp;= \int\Delta_\theta \pi_\theta(\tau)b d\tau\\
&amp;=b\Delta_\theta\int\pi_\theta(\tau)\\
&amp;=b\Delta_\theta1\\
&amp;=0
\end{align}\)</p>

<p>Here, $\tau$ means the whole <strong>episode</strong> sample by current policy.</p>

<p>we can proof that their has a optimal baseline to reduce variance:</p>

\[b=\frac{E[g(\tau)^2e(\tau)]}{E[g(\tau)^2]}\]

<p>But in practice, we just use the expectation of reward as baseline to reduce the complexity.</p>

<blockquote>
  <p>policy gradient is <strong>on-policy</strong> algorithm</p>
</blockquote>

<h4 id="off-policy-learning--importance-sampling">Off-policy learning &amp; importance sampling</h4>

\[\theta^*=\underset{\theta}{\arg\max} J(\theta)\\
J(\theta)=E_{\tau\sim\pi_\theta(\tau)}[r(\tau)]\]

<p>what if we sample from $\bar{\pi}(\tau)$ instead?</p>

<p><strong>Importance sampling</strong></p>

\[\begin{align}
E_{x\sim p(x)}[f(x)]&amp;=\int p(x)f(x)dx\\
&amp;=\int \frac {q(x)}{q(x)}p(x)f(x)dx\\
&amp;=E_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right]
\end{align}\]

<p>so apply this to our objective function, we have</p>

\[J(\theta)=E_{\tau\sim\bar{\pi}(\tau)}\left[\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}r(\tau)\right]\]

<p>and we have</p>

\[\pi_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi_\theta(a_t \mid s_t)p(s_{t+1} \mid s_t,a_t)\\
\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}=\frac{p(s_1)\prod_{t=1}^T\pi_\theta(a_t \mid s_t)p(s_{t+1} \mid s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}(a_t \mid s_t)p(s_{t+1} \mid s_t,a_t)}=\frac{\prod_{t=1}^T\pi_\theta(a_t \mid s_t)}{\prod_{t=1}^T \bar{\pi}(a_t \mid s_t)}\]

<p>so we have</p>

\[\begin{align}
J(\theta')&amp;=E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau)\right]\\
\Delta_{\theta'}J(\theta')&amp;=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\Delta_{\theta'}\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau)\right]\\
&amp;=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\Delta_{\theta'} \log \pi_{\theta}(\tau)r(\tau)\right]
\end{align}\]

<p><strong>The off-policy policy gradient</strong></p>

\[\begin{align}
\Delta_{\theta'}J(\theta')&amp;=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\Delta_{\theta'} \log \pi_{\theta}(\tau)r(\tau)\right]\\
&amp;=E_{\tau\sim\pi_\theta}\left[\left(\prod_{t=1}^T\frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_\theta(a_t \mid s_t)}\right)\left(\sum_{t=1}^T \Delta_{\theta'} \log \pi_{\theta'} (a_t \mid s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right]\\
&amp;=E_{\tau\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_{\theta'} \log \pi_{\theta'} (a_t \mid s_t) \right)\left(\prod_{t'=1}^t\frac{\pi_{\theta'}(a_{t'} \mid s_{t'})}{\pi_\theta(a_{t'} \mid s_{t'})}\right)\left(\sum_{t'=t}^T r(s_{t'},a_{t'})\left(\prod_{t''=t}^{T}\frac{\pi_{\theta'}(a_{t''} \mid s_{t''})}{\pi_\theta(a_{t''} \mid s_{t''})}\right)\right)\right]
\end{align}\]

<p>we can view state and action separately, then:</p>

\[\begin{align}
\theta^*&amp;=\underset{\theta}{\arg\max} \sum_{t=1}^TE_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]\\
J(\theta)&amp;=E_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]\\
&amp;=E_{s_t\sim p_\theta(s_t)}\left[E_{a_t\sim \pi(a_t,s_t)}[r(s_t,a_t)]\right]\\
J(\theta')&amp;=E_{s_t\sim p_\theta(s_t)}\left[\cancel{\frac{p_{\theta'}(s_t)}{p_{\theta}(s_t)}}E_{a_t\sim \pi(a_t,s_t)}\left[\frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}r(s_t,a_t)\right]\right]
\end{align}\]

<p>If $\frac{p_{\theta’}(s_t)}{p_{\theta}(s_t)}$ is small and bounded, then we can delete it, and this leads to <strong>TPRO</strong> method we will discuss later.</p>

<p>For coding, we can use “pseudo-loss” as weighted maximum likelihood with automatic differentiation:</p>

\[\bar{J}(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \log \pi_\theta (a_{i,t} \mid s_{i,t})\hat{Q}_{i,t}\]

<h5 id="policy-gradient-in-practice">policy gradient in practice</h5>

<ul>
  <li>the gradient has <strong>high variance</strong></li>
  <li>this isn’t the same as supervised learning!</li>
  <li>gradients will be really noisy!</li>
  <li>consider using much <strong>larger batches</strong></li>
  <li>tweaking <strong>learning rates</strong> is very hard</li>
  <li>adaptive step size rules like ADAM can be OK-ish</li>
  <li>we will learn about policy gradient-specific learning rate adjustment method later</li>
</ul>

<h2 id="3-actor-critic-method">3. Actor-critic method</h2>

<h3 id="basics">Basics</h3>

<p>recap policy gradient</p>

\[\begin{align}
\Delta_\theta J(\theta)&amp;\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t} \mid s_{i,t}) \right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\\
&amp;\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t} \mid s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\right)\\
&amp;=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t} \mid s_{i,t})\hat{Q}_{i,t}
\end{align}\]

<p>where Q is a sample from trajectories, which is unbiased estimate but has high variance problem.</p>

<p>We can use expectation to reduce variance</p>

\[\hat{Q}_{i,t}\approx \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'}) \mid s_t,a_t]\]

<p>And we define</p>

\[\hat{Q}_{i,t}= \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'}) \mid s_t,a_t]\\
V(s_t)=E_{a_t\sim\pi(s_t \mid s_t)}[Q(s_t,a_t)]\\\]

<p>then</p>

\[\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t} \mid s_{i,t})(Q(s_{i,t}, a_{i,t})-V(s_{i,t}))\]

<h4 id="advantage">Advantage</h4>

\[A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t)\\
\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t} \mid s_{i,t})A^\pi(s_t,a_t)\]

<p>The better $A^\pi(s_t,a_t)$ estimate, the lower the variance.</p>

<h4 id="value-function-fitting">Value function fitting</h4>

\[Q^\pi(s_t,a_t)=r(s_t,a_t)+E_{s_{t+1}\sim p(s_{t+1} \mid s_t,a_t})[V^\pi(s_{t+1})]\]

<p>and we add a little bias(one step biased sample) for convenience</p>

\[Q^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})\]

<p>so we have</p>

\[A^\pi(s_t,a_t) \approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_t)\]

<p>then we only need to fit $V^\pi(s)$ !</p>

<h4 id="policy-evaluation">Policy evaluation</h4>

\[V^\pi(s_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'}) \mid s_t]\\
J(\theta)=E_{s_1\sim p(s_1)}[V^\pi(s_1)]\]

<p>Monte Carlo policy evaluation (this is what policy gradient does)</p>

\[V^\pi(s_t)\approx \sum_{t'=t}^Tr(s_{t'},a_{t'})\]

<p>We can try multiple samples <strong>if we can reset</strong> the environment to previous state</p>

\[V^\pi(s_t)\approx \frac{1}{N}\sum_{i=0}^N\sum_{t'=t}^Tr(s_{t'},a_{t'})\]

<p><strong>Monte Carlo evaluation with function approximation</strong></p>

<p>with function approximation, only using one sample from trajectory still pretty good.</p>

<p>training data: ${\left(s_{i,t},\sum_{t’=t}^Tr(s_{i,t’},a_{i,t’})\right)}$</p>

<p>supervised regression: $\mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)-y_i\parallel^2$</p>

<p>Ideal target:</p>

\[y_{i,t}=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'}) \mid s_t]\approx r(s_{s_{i,t}},a_{i,t})+V^\pi(s_{i,t+1})\approx r(s_{s_{i,t}},a_{i,t})+\hat{V^\pi_\phi}(s_{i,t+1})\]

<p>Monte Carlo target:</p>

\[y_{i,t}=\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'})\]

<h4 id="tdbootstrapped">TD(bootstrapped)</h4>

<p>training data: $ {\left(s_{i,t},r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\right)} $</p>

<h3 id="actor-critic-algorithm">Actor-critic algorithm</h3>

<p>batch actor-critic algorithm:</p>

<ol>
  <li>sample ${s_i,a_i}$ from $\pi_\theta(a \mid s)$</li>
  <li>fit $\hat{V_\phi^\pi}(s)$ to sampled reward sums</li>
  <li>evaluate $\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\hat{V}<em>\phi^\pi(s’_i)-\hat{V}</em>\phi^\pi(s_i)$</li>
  <li>$\Delta_\theta J(\theta)\approx \sum_i \Delta_\theta \log \pi_\theta (a_{i} \mid s_{i})\hat{A}^\pi(s_i,a_i)$</li>
  <li>$\theta \gets \theta+\alpha\Delta_\theta J(\theta)$</li>
</ol>

\[V^\pi(s_{i,t})=\sum_{t'=t}^TE_{\pi_\theta}[r(s_{t'},a_{t'}) \mid s_{i,t}]\\
V^\pi(s_{i,t})\approx\sum_{t'=t}^Tr(s_{t'},a_{t'})\\
V^\pi(s_{i,t})\approx r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\\
\mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)-y_i\parallel^2\]

<h4 id="aside-discount-factors">Aside: discount factors</h4>

<p>what if T (episode length) is $\infty$ ?</p>

<p>$\hat{V}_\phi^\pi$ can get infinitely large in many cases</p>

<p>simple trick: better to get rewards sooner than later</p>

\[V^\pi(s_{i,t})\approx r(s_{s_{i,t}},a_{i,t})+\gamma\hat{V}^\pi_\phi(s_{i,t+1})\\
\gamma \in [0,1]\]

<p>actually we use discount in policy gradient as</p>

\[\Delta_\theta J(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t} \mid s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)\]

<p>Online actor-critic algorithm(can apply to every single step):</p>

<ol>
  <li>take action $a\sim\pi_\theta(a \mid s)$, get $(s,a,s’,r)$</li>
  <li>update $\hat{V}<em>\phi^\pi(s)$ using target $r+\gamma\,\hat{V}</em>\phi^\pi(s’)$</li>
  <li>evaluate $\hat{A}^\pi(s,a)=r(s,a)+\gamma\,\hat{V}<em>\phi^\pi(s’)-\hat{V}</em>\phi^\pi(s)$</li>
  <li>$\Delta_\theta J(\theta)\approx \Delta_\theta \log \pi_\theta (a \mid s)\hat{A}^\pi(s,a)$</li>
  <li>$\theta \gets \theta+\alpha\Delta_\theta J(\theta)$</li>
</ol>

<h4 id="architecture-design">Architecture design</h4>

<p>network architecture choice</p>

<ul>
  <li>value network and policy network are separate(more stable and sample)</li>
  <li>some of value network and policy network are shared(have shared feature)</li>
</ul>

<p>works best with a batch</p>

<h4 id="trade-off-and-balance">trade-off and balance</h4>

<p>policy gradient</p>

\[\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t} \mid s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-b\right)\]

<p>Actor-critic</p>

\[\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t} \mid s_{i,t})\left(r(s_i,a_i)+\hat{V}_\phi^\pi(s'_i)-\hat{V}_\phi^\pi(s_i)\right)\]

<p>Policy gradient is no bias but has higher variance</p>

<p>Actor-critic is lower variance but not unbiased</p>

<blockquote>
  <p>so can we combine these two things?</p>
</blockquote>

<p>Here we have <strong>critics as state-dependent baselines</strong></p>

\[\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t} \mid s_{i,t})\left(\sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-\hat{V}_\phi^\pi(s_{i,t})\right)\]

<ul>
  <li>no bias</li>
  <li>lower variance</li>
</ul>

<p><strong>Eligibility traces &amp; n-step returns</strong></p>

<p>Critic and Monte Carlo critic</p>

\[\hat{A}^\pi_C(s_t,a_t)=r(s_t,a_t)+\gamma\hat{V}_\phi^\pi(s_{t+1})-\hat{V}_\phi^\pi(s_t) \\
\hat{A}^\pi_{MC}(s_t,a_t)=\sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{t'},a_{t'})-\hat{V}_\phi^\pi(s_{t})\]

<blockquote>
  <p>combine these two?</p>
</blockquote>

<p>n-step returns</p>

\[\hat{A}^\pi_{n}(s_t,a_t)=\sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^n\hat{V}_\phi^\pi(s_{t+n})-\hat{V}_\phi^\pi(s_{t})\]

<p>Choosing $n&gt;1$ often works better!!!</p>

<p><strong>Generalized advantage estimation(GAE)</strong></p>

<blockquote>
  <p>Do we have to choose just one n?</p>
</blockquote>

<p>Cut everywhere all at once!</p>

\[\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty w_n\hat{A}(s_t,a_t)\]

<blockquote>
  <p>How to weight?</p>
</blockquote>

<p>Mostly prefer cutting earlier(less variance) $w_n\propto\lambda^{n-1}$ e.g. $\lambda=0.95$</p>

<p>and this leads to Eligibility traces</p>

\[\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty (\gamma\lambda)^{t'-t}\delta_{t'}\\
\delta_{t'}=r(s_{t'},a_{t'})+\gamma\hat{V}_\phi^\pi(s_{t'+1})-\hat{V}_\phi^\pi(s_{t'})\]

<p>in this way, every time you want to update a state, you need to have n steps experience</p>

<h2 id="4-value-based-methods">4. Value based methods</h2>

<p>$\underset{a_t}{\arg\max}A^\pi(s_t,a_t)$ : best action from $s_t$, if we then follow $\pi$</p>

<p>then:</p>

\[\pi'(a_t \mid s_t)=\begin{cases}1, &amp;if \quad a_=\underset{a_t}{\arg\max}A^\pi(s_t,a_t) \cr 0, &amp;otherwise\end{cases}\]

<blockquote>
  <p>this at least as good as any $a_t \sim \pi(a_t, \mid s_t)$</p>
</blockquote>

<h3 id="policy-iteration">Policy iteration</h3>

<ol>
  <li>evaluate $A^\pi(s,a)$</li>
  <li>set $\pi \gets \pi’$</li>
</ol>

<h3 id="dynamic-programming">Dynamic programming</h3>

<p>assume we know $p(s’ \mid s,a)$ and s and a are both discrete (and small)</p>

<p>bootstrapped update:</p>

\[V^\pi(s) \gets E_{a\sim\pi(a \mid s)}[r(s,a)+\gamma E_{s'\sim p(s' \mid s,a)}[V^\pi(s')]]\]

<p>with deterministic policy $\pi(s)=a$, we have</p>

\[V^\pi(s) \gets r(s,\pi(s))+\gamma E_{s'\sim p(s' \mid s,\pi(s))}[V^\pi(s')]\]

\[\underset{a_t}{\arg\max}A^\pi(s_t,a_t)=\underset{a_t}{\arg\max}Q^\pi(s_t,a_t)\\
Q^\pi(s,a)=r(s,a)+\gamma E[V^\pi(s')]\]

<p>So policy iteration become</p>

<ol>
  <li>set $Q^\pi(s,a)\gets r(s,a)+\gamma E[V^\pi(s’)]$</li>
  <li>set $V(s)\gets \max_a Q(s,a)$</li>
</ol>

<h4 id="function-approximator">Function approximator</h4>

<p>$\mathcal{L}=\frac{1}{2}\sum_i\parallel V_\phi (s)-\max_a Q(s,a)\parallel^2$</p>

<h5 id="fitted-value-iteration">fitted value iteration</h5>

<p>fitted value iteration algorithm:</p>

<ol>
  <li>set $y_i \gets \max_{a_i}(r(s_i \mid a_i)+\gamma E[V_\phi(s’_i)])$</li>
  <li>set $\phi \gets \arg\min_\phi \frac{1}{2}\sum_i\parallel V_\phi (s)-y_i\parallel^2$</li>
</ol>

<p>but we can not do maximum if we do not have dynamics, so we evaluate Q instead of V</p>

\[Q^\pi(s,a) \gets r(s,a)+\gamma E_{s'\sim p(s' \mid s,a)}[Q^\pi(s',\pi(s'))]\]

<h5 id="fitted-q-iteration">fitted Q-iteration</h5>

<ol>
  <li>collect dataset ${(s_i, a_i,s_i’,r_i)}$ using some policy</li>
  <li>set $y_i \gets r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)$</li>
  <li>set $\phi \gets \arg min_\phi \frac{1}{2}\sum_i\ \mid Q_\phi(s_i,a_i)-y_i\ \mid ^2$ repeat step 2,3 k times and then return step 1</li>
</ol>

<p>Q-learning is <strong>off-policy</strong>, since it fit the $Q(s,a)$, which estimate all state action Q, no matter the action and state from which policy, it has the maximum item. And for the $r(s,a)$ item, given a and a, transition is independent of $\pi$.</p>

<h5 id="exploration">exploration</h5>

<ol>
  <li>epsilon-greedy</li>
</ol>

\[\pi(a_t \mid s_t)=\begin{cases}1-\epsilon , &amp;if a_t={\arg\max}Q_\phi(s_t,a_t) 
 \cr \epsilon( \mid \mathcal{A} \mid -1), &amp;otherwise\end{cases}\]

<ol>
  <li>Boltzmann exploration</li>
</ol>

\[\pi(a_t \mid s_t) \propto \exp(Q_\phi(s_t,a_t))\]

<h4 id="value-function-learning-theory">Value function learning theory</h4>

<p>value iteration:</p>

<ol>
  <li>set $Q(s,a) \gets r(s,a)+\gamma E[V(s’)]$</li>
  <li>set $V(s) \gets \max_a Q(s,a)$</li>
</ol>

<p>tabular case is converged.</p>

<p>Non-tabular case is not guarantee convergence.</p>

<p>In actor-critic, it also need to estimate the V, and if using bootstrap approach, it has the same problem that can not guarantee convergence.</p>

<h2 id="5-practical-q-learning">5. Practical Q-learning</h2>

<p>What’s wrong of the on-line Q-learning?</p>

<blockquote>
  <p>Actually, it is not gradient decent, it do not calculate the gradient of target Q in y.</p>

  <p>samples is i.i.d</p>
</blockquote>

<h3 id="replay-buffer-replay-samples-many-times">Replay buffer (replay samples many times)</h3>

<p>Q-learning with a replay buffer:</p>

<ol>
  <li>collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, <strong>add</strong> it to $\mathcal{B}$</li>
  <li>sample a batch $(s_i,a_i,s_i’,r_i)$ from $\mathcal{B} $</li>
  <li>$\phi \gets\phi-\alpha\sum_i\frac{d Q_\phi}{d \phi} (s_i,a_i) \frac{1}{2}\sum_i(Q_\phi(s_i,a_i)- [r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)])^2$ , do these k times</li>
</ol>

<h3 id="target-network">Target network</h3>

<h4 id="dqn-target-networkreplay-buffer">DQN (Target network+Replay buffer)</h4>

<ol>
  <li>save target network parameters: $\phi’ \gets \phi$</li>
  <li>collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, <strong>add</strong> it to $\mathcal{B}$ , do this N times</li>
  <li>sample a batch $(s_i,a_i,s_i’,r_i)$ from $\mathcal{B} $</li>
  <li>$\phi \gets \arg min_\phi \frac{1}{2}\sum_i\ \mid Q_\phi(s_i,a_i)- [r(s_i,a_i) +\gamma \max_{a_i’}Q_{\phi’}(s_i’,a_i’)]\ \mid ^2$ , do 2,3 k times</li>
</ol>

<h4 id="alternative-target-network">Alternative target network</h4>

<p>Polyak averaging: soft update to avoid sudden target network update:</p>

<p>update $\phi’$: $\phi’ \gets \tau \phi’ + (1-\tau)\phi$ e.g. $\tau =0.999$</p>

<h3 id="double-q-learning">Double Q-learning</h3>

<h4 id="are-the-q-values-accurate">Are the Q-values accurate?</h4>

<p>It’s often much <strong>larger</strong> than the true value since the <strong>maximum</strong> operation will always adds the <strong>noisy</strong> Q estimation to make Q function overestimate.</p>

<p>Target value $y_j=r_i +\gamma \max_{s_j’}Q_{\phi’}(s_j’,a_j’)$</p>

\[\max_{a'}Q_{\phi'}(s',a') = Q_{\phi'}(s',\arg \max_{a'}Q_{\phi'}(s',a'))\]

<p>value <em>also</em> comes from $Q_{\phi’}$ action selected according to $Q_{\phi’}$</p>

<p>How to address this?</p>

<h4 id="double-q-learning-1">Double Q-learning</h4>

<p>idea: don’t use the same network to choose the action and evaluate value! (<strong>de-correlate</strong> the noise)</p>

<p>use two networks:</p>

\[Q_{\phi_A}\gets r +\gamma Q_{\phi_B}(s',\arg \max_{a'}Q_{\phi_A}(s',a')) \\
Q_{\phi_B}\gets r +\gamma Q_{\phi_A}(s',\arg \max_{a'}Q_{\phi_B}(s',a'))\]

<p>the value of both two networks come from the <strong>other</strong> network!</p>

<h4 id="double-q-learning-in-practice">Double Q-learning in practice</h4>

<p>Just use the current and target networks as $\phi_A$ and $\phi_B$, use current network to choose action, and current network get Q value.</p>

<p>standard Q-learning: $y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi’}(s’,a’))$</p>

<p>double Q-learning: $y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi}(s’,a’))$</p>

<h3 id="multi-step-returns">Multi-step returns</h3>

\[y_{j,t}=\sum_{t'=t}^{t'+N-1}r_{j,t'}+\gamma ^N \max Q_{\phi'}(s_{j,t+N},a_{j, t+N})\]

<p>In Q-learning, this only actually correct when learning <strong>on-policy</strong>. Because the sum of r comes from the transitions of different policy.</p>

<p>How to fix?</p>

<ul>
  <li>ignore the problem when N is small</li>
  <li>cut the trace-dynamically choose N to get only on-policy data</li>
  <li>works well when data mostly on-policy, and the action space is samll</li>
  <li>importance sampling—ref the paper “safe and efficient off-policy reinforcement learning” Munos et al. 16</li>
</ul>

<h3 id="q-learning-with-continuous-actions">Q-learning with continuous actions</h3>

<p>How to do argmax in continuous actions space?</p>

<ol>
  <li>optimization</li>
</ol>

<ul>
  <li>gradient based optimization (e.g., SGD) a bit slow in the inner loop</li>
  <li>action space typically low-dimensional—–what about stochastic optimization?</li>
</ul>

<p>-a simple if sample from discrete cations</p>

<p>$max_a Q(s,a)\approx max{Q(s,a_1),…,Q(s,a_N)}$</p>

<p>-more accurate solution:</p>

<ul>
  <li>cross-entropy method (CEM)</li>
  <li>simple iterative stochastic optimization</li>
  <li>CMA-ES</li>
</ul>

<ol>
  <li>use function class that is easy to optimize</li>
</ol>

\[Q_{\phi}(s,a) = -\frac{1}{2}(a-\mu_\phi(s))^TP_{\phi}(s)(a-\mu_\phi(s))+V_\phi(s)\]

<p><strong>NAF</strong>: <strong>N</strong>ormalized <strong>A</strong>dvantage <strong>F</strong>unctions</p>

<p>Using the neural network to get $\mu,P,V$</p>

<p>Then</p>

\[\arg \max_aQ_\phi(s,a) =\mu_\phi(s)\; \; \max_aQ(s,a)=V_\phi(s)\]

<p><strong>but</strong> this lose some representational power</p>

<ol>
  <li>learn an approximate maximizer</li>
</ol>

<p><strong>DDPG</strong></p>

\[\max_aQ_\phi(s,a)=Q_\phi(s,arg\max_a Q_\phi(s,a))\]

<p>idea: train another network $\mu_\theta(s)$ such that $\mu_\theta(s)\approx arg \max_aQ_\phi(s,a)$</p>

<p>how to train? solve $\theta \gets \arg \max_\theta Q_\phi(s,\mu_\theta(s))$</p>

\[\frac{dQ_\phi}{d\theta}=\frac{da}{d\theta}\frac{dQ_\theta}{da}\]

<p>DDPG:</p>

<ol>
  <li>take some action $s_i$ and observe $(s_i,a_i,s_i’,r_i)$, add it to $\mathcal{B}$</li>
  <li>sample mini-batch ${s_j,a_j,s_j’,r_j}$ from $\mathcal{B}$ uniformly</li>
  <li>compute $y_j=r_j+\gamma Q_{\phi’}(s_j’,\mu_{\theta’}(s_j’))$ using target nets $Q_{\phi’}$ and $\mu_{\theta’}$_j</li>
  <li>$\phi \gets \phi - \alpha\sum_j\frac{dQ_\phi}{d\phi}(s_j,a)(Q_\phi(s_j,a_j)-y_k)$</li>
  <li>$\theta \gets \theta - \beta\sum_j\frac{d\mu}{d\theta}(s_j)\frac{dQ_\phi}{da}(s_j,a)$</li>
  <li>update $\phi’$ and $\theta’$ (e.g., Polyak averaging)</li>
</ol>

<h3 id="tips-for-q-learning">Tips for Q-learning</h3>

<ul>
  <li>Bellman error gradients can be big;clip gradients or use Huber loss instead of square error</li>
</ul>

\[\pi(a_t \mid s_t)=\begin{cases}x^2/2 , &amp;\text{if} \; \mid x \mid \le\delta 
 \cr \delta \mid x \mid -\delta^2/2, &amp;\text{otherwise}\end{cases}\]

<ul>
  <li>
    <p>Double Q-learning helps <em>a lot</em> in practice, simple and no downsides</p>
  </li>
  <li>
    <p>N-step returns also help a lot, but have some downsides</p>
  </li>
  <li>
    <p>Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too</p>
  </li>
  <li>
    <p>Run multiple random seeds, it’s very inconsistent between runs</p>
  </li>
</ul>

<h2 id="6-advanced-policy-gradients">6. Advanced Policy Gradients</h2>

<h3 id="basics-1">Basics</h3>

<h4 id="recap">Recap</h4>

<p>Recap: policy gradient</p>

<p><strong>REINFORCE</strong> algorithm</p>

<ol>
  <li>sample ${\tau^i}$ from $\pi_\theta(s_t \mid s_t)$ (run the policy)</li>
  <li>$\Delta_\theta J(\theta)\approx\sum_{i}\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t^i \mid s_t^i) \left(\sum_{t’=t}^T r(s_{t’},a_{t’})\right)\right)$</li>
  <li>$\theta \gets\theta+\alpha\Delta_\theta J(\theta)$</li>
</ol>

<p>Why does policy gradient work?</p>

<p>policy gradient as <strong>policy iteration</strong></p>

<p>$J(\theta)=E_{\tau \sim p_\theta(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]$</p>

\[\begin{align}
J(\theta')-J(\theta)&amp;=J(\theta')-E_{s_0 \sim p(s_1)}[V^{\pi_\theta}(s_0)]\\
&amp;=J(\theta')-E_{\tau \sim p_{\theta'}(\tau)}[V^{\pi_\theta}(s_0)]\\
&amp;=J(\theta')-E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tV^{\pi_\theta}(s_t)-\sum_{t=1}^\infty\gamma^tV^{\pi_\theta}(s_t)\right]\\
&amp;=J(\theta')+E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t))\right]\\
&amp;=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]+E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t))\right]\\
&amp;=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(r(s_t,a_t)+\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t))\right]\\
&amp;==E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]
\end{align}\]

<p>so we proved that:</p>

\[J(\theta')-J(\theta)=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\]

<h4 id="the-goal-is-making-things-off-policy">The Goal is Making things off-policy</h4>

<p>But we <strong>want to sample</strong> from $\pi_\theta$ not $\pi_{\theta’}$, we apply <strong>importance sampling</strong>:</p>

\[\begin{align}
E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]&amp;=\sum_{t=0}^{\infty}E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta'}(a_t \mid s_t)}\left[\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\
&amp;=\sum_{t=0}^{\infty}E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[\frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]
\end{align}\]

<p>but there <strong>still</strong> has the state sample from $p_{\theta’}(s_t)$ , and can we approximate it as $p_\theta(s_t)$? so that we can use $\hat{A}^\pi(s_t,a_t)$ to get improved policy $\pi’$.</p>

<h3 id="bounding-the-objective-value">Bounding the objective value</h3>

<p>Here we can prove that:</p>

<p>$\pi_{\theta’}$ if close to $\pi_\theta$ if $ \mid \pi_{\theta’(a_t \mid s_t)}-\pi_\theta(a_t \mid s_t) \mid \le\epsilon$ for all $s_t$</p>

<p>$ \mid p_{\theta’}(s_t)-p_\theta(s_t) \mid \le 2\epsilon t$</p>

<p>The prove of this refer the lecture video or the <strong>TRPO</strong> paper.</p>

<p>It’s easy to prove that:</p>

\[\begin{align}
E_{p_{\theta'}}[f(s_t)]=\sum_{s_t}p_{\theta'}(s_t)f(s_t)&amp;\ge\sum_{s_t}p_\theta(s_t)f(s_t)- \mid p_{\theta'}(s_t)-p_\theta(s_t) \mid \max_{s_t}f(s_t)\\
&amp;\ge\sum_{s_t}p_\theta(s_t)f(s_t)-2\epsilon t\max_{s_t}f(s_t)
\end{align}\]

<p>so</p>

\[\sum_t E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[\frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\
\ge\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[\frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]-\sum_t 2\epsilon t C\]

<p>C is $O(Tr_{max})$ or $O(\frac{r_{max}}{1-\gamma})$</p>

<p>So after all the prove before, what we get?</p>

\[\theta' \gets \arg \max_{\theta'}\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[\frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\
\text{such that}\:\: \mid \pi_{\theta'(a_t \mid s_t)}-\pi_\theta(a_t \mid s_t) \mid \le\epsilon\]

<p>For <strong>small enough</strong> $\epsilon$, this is <strong>guaranteed to improve</strong> $J(\theta’)-J(\theta)$</p>

<p>A more convenient bound is using KL divergence: $ \mid \pi_{\theta’(a_t \mid s_t)}-\pi_\theta(a_t \mid s_t) \mid \le \sqrt{\frac{1}{2}D_{KL}(\pi_{\theta’}(a_t \mid s_t)\ \mid \pi(a_t \mid s_t))}$</p>

<p>$\Rightarrow D_{KL}(\pi_{\theta’}(a_t \mid s_t)\ \mid \pi(a_t \mid s_t)$ bounds state marginal difference, where</p>

\[D_{KL}(p_1(s)\ \mid p_2(x))=E_{x\sim p_1(x)}\left[ \log \frac{p_1(x)}{p_2(x)}\right]\]

<p>Why not using $\epsilon$ but the $D_{KL}$?</p>

<blockquote>
  <p>KL divergence has some <strong>very convenient properties</strong> that make i much easier to approximate!</p>
</blockquote>

<p>So the optimization becomes:</p>

\[\theta' \gets \arg \max_{\theta'}\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[\frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\\text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t \mid s_t)\ \mid \pi(a_t \mid s_t))\le\epsilon\]

<h3 id="solving-the-constrained-optimization-problem">Solving the constrained optimization problem</h3>

<p>How do we enforce the <strong>constraint</strong>?</p>

<p>By using <strong>dual gradient descent</strong>, we set the object function as</p>

\[\mathcal{L}(\theta',\lambda)=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[\frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]-\lambda(D_{KL}(\pi_{\theta'}(a_t \mid s_t)\ \mid \pi(a_t \mid s_t))-\epsilon)\]

<ol>
  <li>Maximize $\mathcal{L}(\theta’, \lambda)$ with respect to $\theta$</li>
  <li>$\lambda \gets + \alpha(D_{KL}(\pi_{\theta’}(a_t \mid s_t)\ \mid \pi(a_t \mid s_t))-\epsilon)$</li>
</ol>

<p>How <strong>else</strong> do we optimize the object?</p>

<p>define:</p>

\[\begin{align}
\bar{A}(\theta')&amp;=\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[\frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\
\bar{A}(\theta)&amp;=\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]
\end{align}\]

<p>applying <strong>First-order Taylor expansion</strong> and optimize</p>

\[\theta' \gets \arg \max_{\theta'}\Delta_\theta\bar A(\theta)^T(\theta'-\theta)\\
\text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t \mid s_t)\ \mid \pi(a_t \mid s_t))\le\epsilon\]

<p>and</p>

\[\begin{align}
\Delta_{\theta'}\bar A(\theta')&amp;=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[\frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}\gamma^t \Delta_{\theta'}\log{\pi_{\theta'}(a_t \mid s_t)}A^{\pi_\theta}(s_t,a_t)\right]\right]\\
\Delta_{\theta}\bar A(\theta)&amp;=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[\gamma^t \Delta_{\theta}\log{\pi_{\theta}(a_t \mid s_t)}A^{\pi_\theta}(s_t,a_t)\right]\right]=\Delta_\theta J(\theta)
\end{align}\]

<p>so the optimization becomes</p>

\[\theta' \gets \arg \max_{\theta'}\Delta_\theta J(\theta)^T(\theta'-\theta)\\
\text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t \mid s_t)\ \mid \pi(a_t \mid s_t))\le\epsilon\]

<p>and gradient ascent does this:</p>

\[\theta' \gets \arg \max_{\theta'}\Delta_\theta J(\theta)^T(\theta'-\theta)\\
\text{such that}\:\:\ \mid \theta-\theta'\ \mid \le\epsilon\]

<p>by updating like $\theta’=\theta’+\sqrt{\frac{\epsilon}{\ \mid \Delta_\theta J(\theta)\ \mid ^2}}\Delta_\theta J(\theta)$, this is what actually gradient ascent(policy gradient) doing.</p>

<p>But this (the gradient ascent constrain) is not a good constrain since some parameters change probabilities a lot more than others, and we want that the probability distributions are close.</p>

<p>Applying ‘second order Taylor expansion’ to $D_{KL}$</p>

\[D_{KL}(\pi_{\theta'}\ \mid \pi_\theta)\approx\frac{1}{2}(\theta'-\theta)^T\pmb{F}(\theta'-\theta)\]

<p>where $\pmb{F}$ is the ‘Fisher-information matrix’ which can estimate with with samples</p>

\[\pmb{F}=E_{\pi_\theta}[\Delta_{\theta}\log\pi_\theta(a \mid s)\Delta_\theta\log \pi_\theta(a \mid s)^T]\]

<p>And if we use the following update</p>

\[\theta'=\theta+\alpha\pmb{F}^{-1}\Delta_\theta J(\theta)\\
\alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}\]

<p>the constrain will satisfied. and this is called the <strong>natural gradient</strong>.</p>

<blockquote>
  <p><em>Figure reference: the KL trust region diagram lives in the <a href="../assets/pdf/UBC.pdf">PDF version of these notes</a>.</em></p>
</blockquote>

<h3 id="practical-methods-and-notes">Practical methods and notes</h3>

<ul>
  <li>natural policy gradient 							$\theta’=\theta+\alpha\pmb{F}^{-1}\Delta_\theta J(\theta)$</li>
  <li>Generally a good choice to stabilize policy gradient training</li>
  <li>See this paper for details:</li>
  <li>Petters, Schaal. Reinforcement learning of motor skills with policy gradients.</li>
  <li>Practical implementation: requires efficient Fisher-vector products, a bit non-trivial to do without computing the full matrix</li>
  <li>See: Schulman et all. Trust region policy optimization</li>
  <li>Trust region policy optimization (<strong>TRPO</strong>) $\alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}$</li>
  <li>Just use the IS (important sampling) objective directly (use $\bar{A}$ as object)</li>
  <li>Use regularization to stay close to old policy</li>
  <li>See: proximal policy optimization (<strong>PPO</strong>)</li>
</ul>

<p>So the TRPO and the PPO is two Practical methods solving the constrained optimization in neural network setting.</p>

<h2 id="7-optimal-control-and-planning">7. Optimal Control and Planning</h2>

<p>Recap: the reinforcement learning objective</p>

\[p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t \mid s_t)p(s_{t+1} \mid s_t,a_t)\\
\theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\]

<p>In model-free RL, we do not know $p(s_{t+1} \mid s_t,a_t)$.</p>

<p>But actually we knew the dynamics sometimes.</p>

<ul>
  <li>Often we do know the dynamics</li>
  <li>Often we can learn the dynamics</li>
</ul>

<p>If we know the dynamics, what can we do?</p>

<h3 id="model-based-reinforcement-learning">Model-based reinforcement learning</h3>

<ol>
  <li>
    <p>Model-based reinforcement learning: learn the transition dynamics, then figure out how to choose actions</p>
  </li>
  <li>
    <p>How can we make decisions if we know the dynamics?</p>
  </li>
</ol>

<p>a. How can we choose actions under perfect knowledge of the system dynamics?</p>

<p>b. Optimal control, trajectory optimization, planning</p>

<ol>
  <li>
    <p>How can we learning <em>unknown dynamics</em>?</p>
  </li>
  <li>
    <p>How can we then also learn policies? (e,g. by imitating optimal control)</p>
  </li>
</ol>

<h3 id="the-objective">The objective</h3>

\[\min_{a_1,...,a_T}\sum_{t=1}^Tc(s_t,a_t)\;\text{s.t.}\;s_t=f(s_{t-1},a_{t-1})\]

<h4 id="deterministic-case">Deterministic case</h4>

\[a_1,...,a_T=\arg\max_{a_1,...,a_T}\sum_{t=1}^Tr(s_t,a_t)\;\text{s.t.}\;s_t=f(s_{t-1},a_{t-1})\]

<h4 id="stochastic-open-loop-case">Stochastic open-loop case</h4>

\[p_\theta(s_1,...,s_T \mid a_1,...,a_T)=p(s_1)\prod_{t=1}^Tp(s_{t+1} \mid s_t,a_t)\\
a_1,...,a_T=\arg\max_{a_1,...,a_T}E\left[\sum_{t=1}^Tr(s_t,a_t) \mid a_1,...,a_T\right]\]

<p><strong>open-loop</strong>: choose a_1…a_T in one time, not step by step
<strong>closed-loop</strong>: every step the agent gets a feedback from environment</p>

<h4 id="stochastic-closed-loop-case">Stochastic closed-loop case</h4>

\[p_\theta(s_1,a_1,...,s_T,a_T)=p(s_1)\prod_{t=1}^T\pi(a_t \mid s_t)p(s_{t+1} \mid s_t,a_t)\\
\pi=\underset{\pi}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\]

<h3 id="stochastic-optimization">Stochastic optimization</h3>

<p>optimal control/planning：</p>

\[a_1,...,a_t=\arg\max_{a_1,...,a_t}J(a_1,...,a_t)\\
A=\arg\max_AJ(A)\]

<h4 id="cross-entropy-method-cem">Cross-entropy method (CEM)</h4>

<p>Here $A$ is $a_1,…,a_t$</p>

<ol>
  <li>sample $A_1,…,A_n$ from $p(A)$</li>
  <li>evaluate $J(A_1),…,J(A_n)$</li>
  <li>pick M <em>elites</em> $A_{i_1},…,A_{i_M}$ with the highest value, where $M&lt;N$</li>
  <li>refit $p(A)$ to the elites $A_{i_1},…,A_{i_M}$</li>
</ol>

<h4 id="monte-carlo-tree-search-mcts">Monte Carlo Tree Search (MCTS)</h4>

<p>Generic MCTS sketch</p>

<ol>
  <li>find a leaf $s_l$ using TreePolicy($s_1$)</li>
  <li>evaluate the leaf using DefaultPolicy($s_l$)</li>
  <li>update all values in the tree between $a_1$ and $s_l$</li>
</ol>

<p>take best action from $s_1$ and repeat</p>

<p>every node stores Q and N, Q is the estimated value and N is the visited number</p>

<p><strong>UCT</strong> TreePolicy($s_t$)</p>

<p>​	if $s_t$ nit fully expanded, choose new $a_t$</p>

<p>​	else choose child with best Score($s_{t+1}$)</p>

\[Score(s_t) = \frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2\ln N(s_{t-1})}{N(s_t)}}\]

<p>For more about MCTS, ref Browne. et al. A survey of Monte Carlo Tree Search Methods. (2012)</p>

<h3 id="optimal-control">Optimal control</h3>

<p>Here we shows the optimization process if we know the environment dynamics. Almost the stuffs in control theory.</p>

<p><strong>Deterministic</strong> case</p>

\[\min_{u_1,...,u_T}\sum_{t=1}^Tc(s_t,u_t)\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\\
\min_{u_1,...,u_T}c(x_1,u_1)+c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)\]

<h4 id="shooting-methods-vs-collocation">Shooting methods vs collocation</h4>

<p>the previous CEM is actually random shooting method.</p>

<p>collocation method: optimize over actions and states, with constraints.</p>

\[\min_{u_1,...,u_T,x_1,...,x_T}\sum_{t=1}^Tc(s_t,u_t)\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\]

<h4 id="linear-case-lqr">Linear case: LQR</h4>

\[\min_{u_1,...,u_T}c(x_1,u_1)+c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)\]

<p>Linear case: the case that F is <strong>linear</strong> function and cost is <strong>quadratic</strong> function</p>

\[f(x_t,u_t)=F_t\begin{bmatrix}
 x_{t} \\
 u_{t} \\
 \end{bmatrix}+f_t\\
 c(x_t,u_t)=\frac{1}{2}\begin{bmatrix}
 x_{t} \\
 u_{t} \\
 \end{bmatrix}^TC_t\begin{bmatrix}
 x_{t} \\
 u_{t} \\
 \end{bmatrix}+\begin{bmatrix}
 x_{t} \\
 u_{t} \\
 \end{bmatrix}^Tc_t\]

<p>Where</p>

\[C_T=\begin{bmatrix} 
C_{x_T,x_T} &amp; C_{x_T,u_T} \\
C_{u_T,x_T} &amp; C_{u_T,u_T} 
\end{bmatrix}\\
c_T=\begin{bmatrix} 
c_{x_T}\\
c_{u_T}
\end{bmatrix}\]

<p>Base case: solve for $u_T$ only</p>

\[\begin{align}
Q(x_T,u_T)&amp;= \text{const}+\frac{1}{2}\begin{bmatrix}
 x_{T} \\
 u_{T} \\
 \end{bmatrix}^TC_t\begin{bmatrix}
 x_{T} \\
 u_{T} \\
 \end{bmatrix}+\begin{bmatrix}
 x_{T} \\
 u_{T} \\
 \end{bmatrix}^Tc_T\\
 \Delta_{u_T}Q(x_T,u_T)&amp;=C_{u_t,x_T}x_T+c_{u_T,u_T}u_T+c_{u_T}^T=0\\
 u_T&amp;=-C_{u_T,u_T}^{-1}(C_{u_T,x_T}X_T+c_{u_T})\\
 U_T&amp;=K_Tx_T+k_T\\
 K_T&amp;=-C_{u_T,u_T}^{-1}C_{u_T,x_T}\\
 k_T&amp;=-C_{u_T,u_T}^{-1}c_{u_T}
\end{align}\]

<p>We substitute $u_T$ by $x_T$ to eliminate $u_T$</p>

\[\begin{align}
V(x_T)&amp;= \text{const}+\frac{1}{2}\begin{bmatrix}
 x_{T} \\
 K_Tx_T+k_T \\
 \end{bmatrix}^TC_t\begin{bmatrix}
 x_{T} \\
 K_Tx_T+k_T \\
 \end{bmatrix}+\begin{bmatrix}
 x_{T} \\
 K_Tx_T+k_T \\
 \end{bmatrix}^Tc_T\\
 V(x_T)&amp;=\text{const}+\frac{1}{2}x_T^TV_Tx_T+x_T^Tv_T
\end{align}\]

<p>Then solve for $U_{T-1}$ in term terms of $x_{T-1}$</p>

\[\begin{align}
f(x_{T-1},u_{T-1})&amp;=x_T=F_t\begin{bmatrix}
 x_{T-1} \\
 u_{T-1} \\
 \end{bmatrix}+f_{T-1}\\
c(x_t,u_t)&amp;=\frac{1}{2}\begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^TC_t\begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}+\begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^Tc_{T-1}+V(f(x_{T-1},u_{T-1}))\\
V(f(x_{T-1},u_{T-1}))&amp;=\text{const}+\frac{1}{2}x_T^TV_Tx_T+x_T^Tv_T\\
&amp;\text{and then replace $x_T$ with the dynamics $f$}
\end{align}\]

<p>and then do same thing as the T case, result in similar results.</p>

<h5 id="backward-recursion">backward recursion</h5>

<p>for $t=T$ to 1:</p>

\[\begin{align}
Q_t&amp;=C_t+F_t^TV_{t+1}F_t\\
q_t&amp;=c_t+F_t^TV_{t+1}f_t+F_t^Tv_{t+1}\\
Q(x_t,u_t)&amp;=\text{const}+\frac{1}{2}\begin{bmatrix}
 x_{t} \\
 u_{t} \\
 \end{bmatrix}^TQ_t\begin{bmatrix}
 x_{t} \\
 u_{t} \\
 \end{bmatrix}+\begin{bmatrix}
 x_{t} \\
 u_{t} \\
 \end{bmatrix}^Tq_t\\
u_t &amp;\gets \arg\max_{u_t}Q(x_t,u_t)=K_tx_t+k_t\\
K_t&amp;=-Q_{u_t,u_t}^{-1}Q_{u_t,x_t}\\
k_t&amp;=-Q_{u_t,u_t}^{-1}q_{u_t}\\
V_t&amp;=Q_{x_t,x_t}+Q_{x_t,u_t}K_t+K_t^TQ_{u_t,x_t}+K_t^TQ_{u_t,u_t}K_t\\
v_t&amp;=q_{x_t}+Q_{x_t,u_t}k_t+K_t^TQ_{u_t}+K_t^TQ_{u_t,u_t}k_t\\
V(x_t)&amp;=\text{const}+\frac{1}{2}x_t^T V_tx_t+x_t^Tv_t\\
V(x_t)&amp;=\min Q(x_t,u_t)
\end{align}\]

<p>Forward recursion</p>

<p>For $t=1$ to $T$:</p>

\[u_t=K_tx_t+k_t\\
x_{t+1}=f(x_t,u_t)\]

<h5 id="stochastic-dynamics">Stochastic dynamics</h5>

<p>if the probability is Gaussian and the mean is linear and variance is fixed. Then same algorithm can be applied since symmetry of Gaussian.</p>

\[f(x_t,u_t)=F_t\begin{bmatrix}
 x_{t} \\
 u_{t} \\
 \end{bmatrix}+f_t\\
x_{t-1}\sim p(x_{t+1} \mid x_t,u_t)\\
 p(x_{t+1} \mid x_t,u_t)=\mathcal{N}\left(F_t\begin{bmatrix}
 x_{t} \\
 u_{t} \\
 \end{bmatrix}+f_t, \Sigma_t\right)\]

<h4 id="nonlinear-case-ddpiterative-lqr">Nonlinear case: DDP/iterative LQR</h4>

<p>approximate a nonlinear system as a linear-quadratic system using <strong>Taylor expansion</strong></p>

\[f(x_t,u_t)\approx f(\hat{x}_t,\hat{u}_t)+\Delta_{x_t,u_t}f(\hat{x}_t,\hat{u}_t)\begin{bmatrix}
 x_{t}-\hat{x}_t \\
 u_{t}-\hat{u}_t \\
 \end{bmatrix}\\
c(x_t,u_t)\approx c(\hat{x}_t,\hat{u}_t)+\Delta_{x_t,u_t}c(\hat{x}_t,\hat{u}_t)\begin{bmatrix}
 x_{t}-\hat{x}_t \\
 u_{t}-\hat{u}_t \\
 \end{bmatrix}+\frac{1}{2}\begin{bmatrix}
 x_{t}-\hat{x}_t \\
 u_{t}-\hat{u}_t \\
 \end{bmatrix}^T\Delta^2_{x_t,u_t}c(\hat{x}_t,\hat{u}_t)\begin{bmatrix}
 x_{t}-\hat{x}_t \\
 u_{t}-\hat{u}_t \\
 \end{bmatrix}\]

\[\bar{f}(\delta x_t,\delta u_t)=F_t\begin{bmatrix}
 \delta x_t \\
 \delta u_t \\
 \end{bmatrix}\\
\bar{c}=\frac{1}{2}\begin{bmatrix}
 \delta x_{t} \\
 \delta u_{t} \\
 \end{bmatrix}^TC_t\begin{bmatrix}
 \delta x_{t} \\
 \delta u_{t} \\
 \end{bmatrix}+\begin{bmatrix}
 \delta x_{t} \\
 \delta u_{t} \\
 \end{bmatrix}^Tc_t\\
 \delta x_t= x_t-\hat{x}_t\\
 \delta u_t= u_t-\hat{u}_t\]

<p>In fact, this just Newton’s method for trajectory optimization.</p>

<p>for more Newton’s method for trajectory optimization, ref follow papers:</p>

<ol>
  <li>Differential dynamic programming.(1970)</li>
  <li>Synthesis and Stabilization of complex behaviors through online trajectory optimization.(2012)
    <ul>
      <li>practical guide for implementing non-linear iterative LQR.</li>
    </ul>
  </li>
  <li>Learning Neural Network policies with guided policy search under unknown dynamics (2014)
    <ul>
      <li>Probabilistic formation and trust region alternative to deterministic line search.</li>
    </ul>
  </li>
</ol>

<h2 id="8-model-based-reinforcement-learning-learning-the-model">8. Model-Based Reinforcement Learning (learning the model)</h2>

<h3 id="basic">Basic</h3>

<p>Why learn the model?</p>

<blockquote>
  <p>If we knew $f(s_t,a_t)=s_{t+1}$, we could use the tools from last course.</p>

  <p>(or $p(s_{t+1} \mid s_t,a_t)$ in stochastic case)</p>
</blockquote>

<p>model-based reinforcement learning <strong>version 0.5</strong>:</p>

<ol>
  <li>run base policy $\pi_0(a_t \mid s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$</li>
  <li>learning dynamics model $f(s,a)$ to minimize $\sum_i\ \mid f(s_i,a_i)-s_i’\ \mid ^2$</li>
  <li>plan through $f(s,a)$ to choose actions</li>
</ol>

<p>Does it work?</p>

<ul>
  <li>This is how <strong>system identification</strong> works in classical robotics</li>
  <li>Some care should be taken to design a good base policy</li>
  <li>Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters</li>
  <li>The model only fit the base policy, but the final actual policy beyond that policy, that will cause <strong>distribution mismatch problem</strong>.</li>
</ul>

<h3 id="over-fitting-problem">Over-fitting problem</h3>

<h4 id="distribution-mismatch-problem">Distribution mismatch problem</h4>

<p>Can we do better?</p>

<p>can we make $p_{\pi_0}(s_t)=p_{\pi_f}(s_t)$?</p>

<p>model-based reinforcement learning <strong>version 1.0:</strong></p>

<ol>
  <li>run base policy $\pi_0(a_t \mid s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$</li>
  <li>learning dynamics model $f(s,a)$ to minimize $\sum_i\ \mid f(s_i,a_i)-s_i’\ \mid ^2$</li>
  <li>plan through $f(s,a)$ to choose actions</li>
  <li>execute those actions and add the resulting data ${(s,a,s’)_j}$ to $\mathcal{D}$, and repeat step 2~4</li>
</ol>

<p>But the model has errors, so it may lead to some bad actions, How to address that?</p>

<h4 id="mpc">MPC</h4>

<p>model-based reinforcement learning <strong>version 1.5</strong>:</p>

<ol>
  <li>run base policy $\pi_0(a_t \mid s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$</li>
  <li>learning dynamics model $f(s,a)$ to minimize $\sum_i\ \mid f(s_i,a_i)-s_i’\ \mid ^2$</li>
  <li>plan through $f(s,a)$ to choose actions</li>
  <li>execute the <strong>first</strong> planned action, observe resulting state $s’$ (<strong>MPC</strong>)</li>
  <li>append $(s,a,s’)$ to dataset $\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5</li>
</ol>

<h4 id="using-model-uncertainty">Using model uncertainty</h4>

<p>Can we do better by using model <strong>uncertainty</strong>?</p>

<p>How to get uncertainty?</p>

<ol>
  <li>use output entropy(bad idea)</li>
  <li>estimate model uncertainty</li>
</ol>

\[\int p(s_{t+1} \mid s_t,a_t,\theta)p(\theta \mid \mathcal{D})d\theta\]

<ul>
  <li>one way to get this is by Bayesian neural networks (BNN) (introduce later)</li>
  <li>another way is train multiple models, and see if they agree each other.(<strong>Bootstrap ensembles</strong>)</li>
</ul>

\[p(\theta \mid \mathcal{D})\approx\frac{1}{N}\sum_i\delta(\theta_i)\\
\int p(s_{t+1} \mid s_t,a_t,\theta)p(\theta \mid \mathcal{D})d\theta\approx\frac{1}{N}\sum_ip(s_{t+1} \mid s_t,a_t,\theta_i)\]

<p>How to train?</p>

<blockquote>
  <p>main idea: need to generate “independent” datasets to get “independent” models.</p>

  <p>can do this by re-sampling from dataset with replacement, means you have same distribution but different ordered datasets</p>
</blockquote>

<p>Does this works?</p>

<blockquote>
  <p>This basically works</p>

  <p>Very crude approximation, because the number of models is usually small (&lt;10)</p>

  <p>Re-sampling with replacement is usually unnecessary, because SGD and random initialization usually makes the models sufficiently independent</p>
</blockquote>

<p>For candidate action sequence $a_1,…,a_H$:</p>

<ol>
  <li>sample $\theta\sim p(\theta \mid \mathcal{D})$</li>
  <li>at each time step $t$, sample $s_{t+1}\sim p(s_{t+1} \mid s_t,a_t,\theta)$</li>
  <li>calculate $R=\sum_tr(s_t,a_t)$</li>
  <li>repeat steps 1 to 3 and accumulate the average reward</li>
</ol>

<h3 id="model-based-rl-with-images-pomdp">Model-based RL with images (POMDP)</h3>

<h4 id="model-based-rl-with-latent-space-models">Model-based RL with latent space models</h4>

<p>What about <strong>complex observations</strong>?</p>

<ul>
  <li>High dimensionality</li>
  <li>Redundancy</li>
  <li>Partial observability</li>
</ul>

\[\max_\phi\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(s_{t+1,i} \mid s_{t,i},a_{t,i})+\log p_\phi(o_{t,i} \mid s_{t,i})]\]

<p>learn <em>approximate</em> posterior $q_\psi(s_t \mid o_{1:t},a_{1:t})$</p>

<p>other choices:</p>

<ul>
  <li>$q_\psi(s_t,s_{t+1} \mid o_{1:t},a_{1:t})$</li>
  <li>$q_\psi(s_t \mid o_t)$</li>
</ul>

<p>here we only estimate $q_\psi(s_t \mid o_t)$</p>

<p>assume that $q_\psi(s_t \mid o_t)$ is <em>deterministic</em></p>

<p>stochastic case requires variational inference (later)</p>

<p><strong>Deterministic encoder</strong></p>

\[q_\psi(s_t \mid o_t)=\delta(s_t=g_\psi(o_t))\Rightarrow s_t=g_\psi(o_t)\]

<p>and maybe the reward also need to learn.</p>

\[\max_{\phi,\psi}\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(g_\psi(o_{t+1,i}) \mid g_\psi(o_{t,i}),a_{t,i})+\log p_\phi(o_{t,i} \mid g_\psi (o_{t,i})]\\
\max_{\phi,\psi}\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(g_\psi(o_{t+1,i}) \mid g_\psi(o_{t,i}),a_{t,i})+\log p_\phi(o_{t,i} \mid g_\psi (o_{t,i})+log p_\phi(r_{t,i} \mid g_\psi(o_{t,i}))]\]

<p>Model-based RL with latent space models</p>

<ol>
  <li>run base policy $\pi_0(o_t \mid a_t)$ (e.g., random policy) to collect $\mathcal{D}={(o,a,o’)_i}$</li>
  <li>learning $p_\psi(s_{t+1} \mid s_t,a_t), p_\psi(r_t \mid s_t), p(o_t \mid s_t), g_\psi(o_t)$</li>
  <li>plan through the model to choose actions</li>
  <li>execute the <strong>first</strong> planned action, observe resulting state $o’$ (<strong>MPC</strong>)</li>
  <li>append $(o,a,o’)$ to dataset $\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5</li>
</ol>

<h4 id="learn-directly-in-observation-space">Learn directly in observation space</h4>

<p>directly learn $p(o_{t+1} \mid o_t,a_t)$</p>

<p>do image prediction</p>

<p>learn reward or set the goal observation</p>

<h2 id="9-model-based-rl-and-policy-learning">9. Model-Based RL and Policy Learning</h2>

<h3 id="basic-1">Basic</h3>

<p>What if we want a policy rather than just optimal control?</p>

<ul>
  <li>Do not need to re-plan (faster)</li>
  <li>Potentially better generalization</li>
  <li>Closed loop control</li>
</ul>

<p>Back-propagate directly into the policy</p>

<p>model-based reinforcement learning <strong>version 2.0</strong>:</p>

<ol>
  <li>run base policy $\pi_0(a_t \mid s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$</li>
  <li>learning dynamics model $f(s,a)$ to minimize $\sum_i\ \mid f(s_i,a_i)-s_i’\ \mid ^2$</li>
  <li>back-propagate through $f(s,a)$ into the policy to optimize $\pi_\theta(a_t \mid s_t)$</li>
  <li>run $\pi_\theta (a_t \mid s_t)$, appending the visited tuples $(s,a,s’)$ to $\mathcal{D}$ , repeat steps 2~4</li>
</ol>

<p>What’s the <strong>problem</strong>?</p>

<ul>
  <li>similar parameter sensitivity problems as shooting methods</li>
  <li>But no longer have convenient second order LQR-like method, because policy parameters <strong>couple</strong> all the tie steps, so no dynamic programming</li>
  <li>Similar problem to training long RNNs with BPTT</li>
  <li>Vanishing and exploding gradients</li>
  <li>Unlike LSTM, we can’t just “choose” a simple dynamics, dynamics are chosen by nature</li>
</ul>

<h3 id="guided-policy-search">Guided policy search</h3>

\[\min_{u_1,...,u_T,x_1,...,x_T}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\]

\[\min_{u_1,...,u_T,x_1,...,x_T,\theta}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1}), u_t=\pi_\theta(x_t)\\
\min_{u_1,...,u_T,x_1,...,x_T,\theta}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\\
\:\: \text{s.t.}\:\:u_t=\pi_\theta(x_t)\]

<p>How to deal with constrain?</p>

<h4 id="dual-gradient-decent-dgd">Dual gradient decent (DGD)</h4>

\[\min_xf(x)\:\:\text{s.t.}\:C(x)=0\:\:\:\:\:\: \mathcal{L}(x,\lambda)=f(x)+\lambda C(x)\\
g(\lambda)=\mathcal{L}(x^*(\lambda),\lambda)\\
x^*=\arg\min_x\mathcal{L}(x,\lambda)\\
\frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)\]

<ol>
  <li>Find $x^*\gets \arg\min_x\mathcal{L}(x,\lambda)$</li>
  <li>Compute $\frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)$</li>
  <li>$\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$</li>
</ol>

<p>A small tweak to DGD: augmented Lagrangian</p>

\[\bar{\mathcal{L}}(x,\lambda)=f(x)+\lambda C(x)+\rho\ \mid C(x)\ \mid ^2\]

<ol>
  <li>Find $x^*\gets \arg\min_x\bar{\mathcal{L}}(x,\lambda)$</li>
  <li>Compute $\frac{dg}{d\lambda}=\frac{d\bar{\mathcal{L}}}{d\lambda}(x^*,\lambda)$</li>
  <li>$\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$</li>
</ol>

<p>When far from solution, quadratic term tends to improve stability</p>

<p>Constraining trajectory optimization with dual gradient descent
\(\min_{\tau,\theta}c(\tau)\:\:\text{s.t.}\:\:u_t=\pi_\theta(x_t)\\
\bar{\mathcal{L}}(\tau,\theta,\lambda)=c(\tau)+\sum_{t=1}^T\lambda_t(\pi_\theta(x_t)-u_t)+\sum_{t=1}^T\rho_t(\pi_\theta(x_t)-u_t)^2\)</p>

<h4 id="guided-policy-search-gps-discussion">Guided policy search (GPS) discussion</h4>

<ol>
  <li>Find $\tau \gets \arg\min_\tau \bar{\mathcal{L}}(\tau,\theta,\lambda)$ (e.g. via iLQR or other planning methods)</li>
  <li>Find $\theta \gets \arg\min_\theta\bar{\mathcal{L}}(\tau, \theta, \lambda)$ (e.g. via SGD)</li>
  <li>$\lambda \gets \lambda+\alpha \frac{dg}{d\lambda}$ and repeat</li>
</ol>

<ul>
  <li>Can be interpreted as constrained trajectory optimization method</li>
  <li>Can be interpreted as imitation of optimal control expert, since step 2 is just supervised learning</li>
  <li>The optimal control “teacher” adapts to the learner , and avoids actions that the learner can’t mimic</li>
</ul>

<p>General guided policy search scheme</p>

<ol>
  <li>Optimize $p(\tau)$ with respect to some surrogate $\tilde{c}(x_t,u_t)$</li>
  <li>Optimize $\theta$ with respect to some supervised objective</li>
  <li>Increment or modify dual variables $\lambda$</li>
</ol>

<p>Need to choose:</p>

<ul>
  <li>form of $p(\tau)$ or $\tau$ (if deterministic)</li>
  <li>optimization method for $p(\tau)$ or $\tau$</li>
  <li>surrogate $\tilde{c}(x_t,u_t)$</li>
  <li>supervised objective for $\pi_\theta(u_t \mid x_t)$</li>
</ul>

<h5 id="deterministic-case-1">Deterministic case</h5>

\[\min_{\tau,\theta}c(\tau)\:\:\:\text{s.t.}\:\:\:u_t=\pi_\theta(x_t)\\
\bar{\mathcal{L}}(\tau,\theta,\lambda)=\tilde{c}(\tau)=c(\tau)+\sum_{t=1}^T\lambda_t(\pi_\theta(x_t)-u_t)+\sum_{t=1}^T\rho_t(\pi_\theta(x_t)-u_t)^2\]

\[\tilde{c}_{k+1,i}(x_t,u_t)=c(x_t,u_t)+\lambda_{k+1,i}\log \pi_\theta (u_t \mid x_t)\]

<ol>
  <li>Optimize $\tau$ with respect to some surrogate $\tilde{c}(x_t,u_t)$</li>
  <li>Optimize $\theta$ with respect to some supervised objective</li>
  <li>Increment or modify dual variables $\lambda$. repeat 1~3</li>
</ol>

<p>Learning with multiple trajectories</p>

\[\min_{\tau_1,...,\tau_N,\theta}\sum_{i=1}^{N}c(\tau_i)\:\:\:\text{s.t.}\:\:\:u_{t,i}=\pi_\theta(x_{t,i})\:\:\forall i\:\forall t\]

<ol>
  <li>Optimize each $\tau_i$ <em>in parallel</em> with respect to $\tilde{c}(x_t,u_t)$</li>
  <li>Optimize $\theta$ with respect to some supervised objective</li>
  <li>Increment or modify dual variables $\lambda$. repeat 1~3</li>
</ol>

<h5 id="stochastic-gaussian-gps">Stochastic (Gaussian) GPS</h5>

\[\begin{aligned}
\min_{p,\theta}\;E_{\tau\sim p(\tau)}[c(\tau)] &amp; \quad \text{s.t.}\quad p(u_t \mid x_t)=\pi_\theta(u_t \mid x_t) \\
p(u_t \mid x_t) &amp;=\mathcal{N}\big(K_t(x_t-\hat{x}_t)+k_t+\hat{u}_t,\Sigma_t\big)
\end{aligned}\]

<ol>
  <li>Optimize $p(\tau)$ with respect to some surrogate $\tilde{c}(x_t,u_t)$</li>
  <li>Optimize $\theta$ with respect to some supervised objective</li>
  <li>Increment or modify dual variables $\lambda$</li>
</ol>

<blockquote>
  <p>Here, a little different from pure imitation learning that mimic the optimal control result, the agent imitate the planning results, and then if it can not perform good imitation, then the optimization process will adjust its planning policy to fit the learning policy since the policy is a constrain of planning.</p>
</blockquote>

<p>Input Remapping Trick</p>

<script type="math/tex; mode=display">
\min_{p,\theta} E_{\tau\sim p(\tau)}[c(\tau)] \quad \text{s.t.}\quad p(u_t \mid x_t)=\pi_\theta(u_t \mid o_t)
</script>

<h3 id="imitation-optimal-control">Imitation optimal control</h3>

<h4 id="imitation-optimal-control-with-dagger">Imitation optimal control with DAgger</h4>

<ol>
  <li>from current state $s_t$, run MCTS to get $a_t,a_{t+1},…$</li>
  <li>add $(s_t,a_t)$ to dataset $\mathcal{D}$</li>
  <li>execute action $s_t\sim\pi(a_t \mid s_t)$ (not MCTS action!). repeat 1~3 N times</li>
  <li>update the policy by training on $\mathcal{D}$</li>
</ol>

<p>Problems of the original DAgger</p>

<ul>
  <li>Ask human to label the state from other policy is hard</li>
  <li>run the initial not good policy in real world is dangerous in some applications</li>
</ul>

<p>We address the first problem with a planning method, and how about the the second problem?</p>

<h4 id="imitating-mpc-plato-algorithm">Imitating MPC: PLATO algorithm</h4>

<ol>
  <li>train $\pi_\theta(u_t \mid o_t)$ from labeled data $\mathcal{D}={o_1,u_1,…,o_N,u_N}$</li>
  <li>run $\hat{\pi}(u_t \mid o_t)$ to get dataset $\mathcal{D}_\pi={o_1,…,o_M}$</li>
  <li>Ask computer to label $\mathcal{D_\pi}$ with actions $u_t$</li>
  <li>Aggregate: $\mathcal{D}\gets\mathcal{D}\cup\mathcal{D}_\pi$</li>
</ol>

<p><strong>Simple</strong> stochastic policy: $\hat{\pi}(u_t \mid x_t)=\mathcal{N}(K_tx_t+k_t, \Sigma_{u_t})$</p>

\[\hat{\pi}(u_t \mid x_t)=\arg\min_{\hat{\pi}}\sum_{t'=t}^TE_{\hat{\pi}}[c(x_{t'},u_{t'})]+\lambda D_{KL}(\hat{\pi}(u_t \mid x_t)\ \mid \pi_\theta(u_t \mid o_t))\]

<blockquote>
  <p>Here the $\hat{\pi}$ is re-planed by optimal control method, for simplicity, choose Gaussian policy since it is easy to plan with LQR, and the planning also add the KL constrain, which make sure the behavior policy is not far from the learning policy, but with this planning, it move actions away from some very bad (dangerous) actions.</p>
</blockquote>

<h5 id="dagger-vs-gps">DAgger vs GPS</h5>

<ul>
  <li>DAgger does not require an adaptive expert</li>
  <li>Any expert will do, so long as states from learned policy can be labeled</li>
  <li>Assumes it is possible to match expert’s behavior up to bounded loss</li>
  <li>Not always possible (e.g. partially observed domains)</li>
  <li>GPS adapts the “expert” behavior</li>
  <li>Does not require bounded loss on initial expert (expert will change)</li>
</ul>

<h5 id="why-imitate">Why imitate?</h5>

<ul>
  <li>Relatively stable and easy to use</li>
  <li>Supervise learning works very well</li>
  <li>control\planning (usually) works very well</li>
  <li>The combination of two (usually) works very well</li>
  <li>Input remapping trick: can exploit availability of additional information at training time to learn policy from raw observations. (planning with state and learning policy with observations)</li>
  <li>overcomes optimization challenges of back-propagating into policy directly</li>
</ul>

<blockquote>
  <p><em>See the accompanying PDF for the illustrative rollout diagram used in class.</em></p>
</blockquote>

<h3 id="model-free-optimization-with-a-model">Model-free optimization with a model</h3>

<ul>
  <li>just use policy gradient(or other model-free RL method) even though you have a model. (just treat the model as a simulator)</li>
  <li>Sometimes better than using the gradients!</li>
</ul>

<h4 id="dyna">Dyna</h4>

<p>on-line Q-learning algorithm that performs model-free RL with a model</p>

<ol>
  <li>given state $s$, pick action $a$ using exploration policy</li>
  <li>observe $s’$ and $r$, to get transition $(s,a,s’,r)$</li>
  <li>update model $\hat{p}(s’ \mid s,a)$ and $\hat{r}(s,a)$ using $(s,a,s’)$</li>
  <li>Q-update: $Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)-Q(s,a)]$</li>
  <li>repeat $K$ times:</li>
  <li>sample $(s,a)\sim\mathcal{B}$ from buffer of past states and actions</li>
  <li>Q-update: $Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)-Q(s,a)]$</li>
</ol>

<p>when model become better, re-evaluate the old states and make the estimation more accurate.</p>

<h4 id="general-dyna-style-model-based-rl-recipe">General “Dyna-style” model-based RL recipe</h4>

<ol>
  <li>given state $s$, pick action $a$ using exploration policy</li>
  <li>learn model $\hat{p}(s’ \mid s,a)$ (and optionally, $\hat{r}(s,a)$)</li>
  <li>repeat K times:</li>
  <li>sample $s\sim\mathcal{B}$ from buffer</li>
  <li>choose action a (from ​$\mathcal{B}$, from $\pi$, or random)</li>
  <li>simulate $s’\sim\hat{p}(s’ \mid s,a)$ (and $r=\hat{r}(s,a)$)</li>
  <li>train on $(s,a,s’,r)$ with model-free RL</li>
  <li>(optional) take N more model-based steps</li>
</ol>

<p>This only requires short (as few as one step) rollouts from model, which has a little accumulated error.</p>

<h3 id="model-based-rl-algorithms-summary">Model-based RL algorithms summary</h3>

<h4 id="methods">Methods</h4>

<ul>
  <li>Learn model and plan (without policy)</li>
  <li>Iteratively more data to overcome distribution mismatch</li>
  <li>Re-plan every time step (MPC) to mitigate small model errors</li>
  <li>Learning policy</li>
  <li>Back-propagate into policy (e.g., PILCO)–simple but potentially unstable</li>
  <li>imitate optimal control in a constrained optimization framework (e.g., GPS)</li>
  <li>imitate optimal control via DAgger-like process (e.g., PLATO)</li>
  <li>Use model-free algorithm with a model(Dyna, etc.)</li>
</ul>

<h4 id="limitation-of-model-based-rl">Limitation of model-based RL</h4>

<ul>
  <li>Need some kind of model</li>
  <li>Not always available</li>
  <li>Sometimes harder to learn than the policy</li>
  <li>Learning the model takes time &amp; data</li>
  <li>Sometimes expressive model classes (neural nets) are not fast</li>
  <li>Sometimes fast model classes (linear models) are not expressive</li>
  <li>Some kind of additional assumptions</li>
  <li>Linearizability/continuity</li>
  <li>Ability to reset the system (for local linear models)</li>
  <li>Smoothness (for GP-style global model)</li>
  <li>Etc.</li>
</ul>

<blockquote>
  <p>Here are some of my understandings of model-based RL:</p>

  <p>First, <strong>why</strong> we need model-based RL?</p>

  <p>the model-free RL learn everything from experience, the state space may very larger, it learning from scratch, which need a lot of exploration, otherwise, it may hard to coverage or has high chance to coverage to local optimal.</p>

  <p>But in model-based RL, the model is known or already learned, so it shift the very hard exploration process to planning, which can find some decent directions that lead to good results by optimal control methods or just search by simulating with model. So after planning, the pretty promising trajectories is generated, and the policy only requires to learn to imitate these good trajectories, which reduce lots of random exploration.</p>

  <p>Second, why not just use optimal control rather than learning policy?</p>

  <p>Actually, you just use the optimal control methods, like traditional control methods, or MPC.</p>

  <p>However, not all model can apply the explicit optimal control method like LQR, since the model is hard to solve mathematically. In addition, using neural network may have better generalization property, and close loop control seems better.</p>

  <p>Third, What kind of method can i use in model based RL?</p>

  <ul>
    <li>learning model and just using planing, do not learn policy</li>
    <li>Learn policy by guided policy search</li>
    <li>imitating optimal control with DAgger</li>
  </ul>
</blockquote>

<h3 id="what-kind-of-algorithm-should-i-use">What kind of algorithm should I use?</h3>

<p>rank of the samples efficiency required (low to high) but computation efficiency (high to low)</p>

<ul>
  <li>gradient-free methods (e.g. NES, CMA, etc)</li>
  <li>full on-line methods (e.g. A3C)</li>
  <li>policy gradient methods (e.g. TRPO)</li>
  <li>replay buffer value estimation methods (Q-learning, DDPG, NAF, SAC, etc.)</li>
  <li>model-based deep RL (e.g. PETS, guided policy search)</li>
  <li>model-based “shallow” RL (e.g. PILCO)</li>
</ul>

<blockquote>
  <p><em>The full pipeline sketch is preserved in the downloadable PDF version.</em></p>
</blockquote>

<h2 id="10-variational-inference-and-generative-models">10 Variational Inference and Generative Models</h2>

<h3 id="probabilistic-models">Probabilistic models</h3>

<h4 id="latent-variable-models">Latent variable models</h4>

\[p(x)=\sum_zp(x \mid z)p(z)\\
p(y \mid x)=\sum_zp(y \mid x,z)p(z)\]

<p>Latent variable models in general</p>

<p>feed random Gaussian to neural network to fit any distribution</p>

\[p(x \mid z)=\mathcal{N}(\mu_{nn},\sigma_{nn}(z))\\
p(x)=\int p(x \mid z)p(z)dz\]

<p>where $p(z)$ is Gaussian.</p>

<p>the neural network input is sample of Gaussian, and output is many mean and variance of Gaussian.</p>

<p>Latent variable models in RL: conditional latent variable models for <strong>multi-modal policies</strong></p>

<h4 id="how-to-train-latent-variable-models">How to train latent variable models?</h4>

<p>model to fit a distribution</p>

<p>the model: $p_\theta(x)$</p>

<p>the data: $\mathcal{D}={x_1,x_2,x_3,…,x_N}$</p>

<p>maximum likelihood fit: $\theta\gets\arg\max_\theta \frac{1}{N} \sum_i \log p_\theta(x_i)$</p>

<p>in latent variable model</p>

<p>the model: $p(x)=\int p(x \mid z)p(z)dz$</p>

<p>maximum likelihood fit: $\theta\gets\arg\max_\theta \frac{1}{N} \sum_i \log \left(\int p(x \mid z)p(z)dz\right)$</p>

<p>Estimating the log-likelihood</p>

<p>alternative: <em>expected</em> log-likelihood: $\theta\gets\arg\max_\theta \frac{1}{N} \sum_i E_{z\sim p(z \mid x_i)} \log p_\theta(x_i,z)$</p>

<h5 id="the-variational-approximation">The variational approximation</h5>

<p>approximate $p(z \mid x_i)$ with $q_i(z)=\mathcal{N}(\mu_i,\sigma_i)$</p>

\[\begin{align}
\log p(x_i) &amp;= \log \int_z p(x_i \mid z)p(z)\\
&amp;=\log \int_z p(x_i \mid z)p(z)\frac{q_i(z)}{q_i(z)}\\
&amp;=\log E_{z\sim q_i(z)}\left[\frac{p(x_i \mid z)p(z)}{q_i(z)}\right]\\
&amp;\ge E_{z \sim q_i(z)}\left[\log \frac{p(x_i \mid z)p(z)}{q_i(z)}\right]\\
&amp;= E_{z \sim q_i(z)}[\log p(x_i \mid z)+\log p(z)]- E_{z \sim q_i(z)}[\log q_i(z)]\\
&amp;= E_{z \sim q_i(z)}[\log p(x_i \mid z)+\log p(z)]+ \mathcal{H}(q_i)
\end{align}\]

<p>Jensen’s inequality: $\log E[y]\ge E[\log y]$</p>

<p>Entropy: $\mathcal{H}(p)=-E_{x \sim p(x)}[\log p(x)]=-\int_x p(x)\log p(x)dx$</p>

<p>KL Divergence: $D_{KL}(q \mid \mid p)=E_{x \sim q(x)}\left[\log \frac{q(x)}{p(x)}\right]=E_{x \sim q(x)}[\log q(x)]- E_{x \sim q(x)}[\log p(x)] =-E_{x \sim q(x)}[\log p(x)]- \mathcal{H}(q)$</p>

<p>further analysis</p>

<p>Unable to render expression.
\(\log p(x_i) = E_{z \sim q_i(z)}[\log p(x_i \mid z)+\log p(z)]+ \mathcal{H}(q_i)=\mathcal{L}_i(q,p_i)\)</p>

<p>so what makes a good $q_i(z)$?</p>

<p>intuition: $q_i(z)$ should approximate $p(z \mid x_i)$</p>

<p>why?</p>

\[\begin{align}
D_{KL}(q_i(z_i \mid \mid p(z \mid x_i)))&amp;=E_{z\sim q_i(z)}\left[\log \frac{q_i(z)}{p(z \mid x_i)}\right]\\
&amp;=E_{z \sim q_i(z)}\left[\log \frac{q_i(z)p(x_i)}{p(x_i,z)}\right]\\
&amp;=-E_{z\sim q_i(z)}[\log p(x_i \mid z)+\log p(z)] +E_{z \sim q_i(z)}+E_{z\sim q_i(z)}[\log q_i(z)]+E_{z \sim q_i(z)}[\log p(x_i)]\\
&amp;=-E_{z\sim q_i(z)}[\log p(x_i \mid z)+\log p(z)] +E_{z \sim q_i(z)}-\mathcal{H}(q_i)\log p(x_i)\\
&amp;=-\mathcal{L}_i(p,q_i)+\log p(x_i)
\end{align}\]

\[\begin{align}
\log p(x_i)&amp;=D_{KL}(q_i(z_i \mid \mid p(z \mid x_i)))+\mathcal{L}_i(p,q_i)\\
0 &amp;\le D_{KL}(q_i(z_i \mid \mid p(z \mid x_i))) \\
\log p(x_i)&amp;\ge \mathcal{L}_i(p,q_i)
\end{align}\]

<blockquote>
  <p>So this also prove that $\mathcal{L}_i(p,q_i)$ is low bound, and the KL Divergence is the bound gap, so when $q_i(x)$ close to $p(z \mid x_i)$, the bound will become tight.</p>

  <p>Actually minimize KL divergence is maximizing $\mathcal{L}_i(p,q_i)$. so we need to adjust $q_i(z)$ to maximizing $\mathcal{L}_i(p,q_i)$.</p>
</blockquote>

<p>So all we need to do is:</p>

\[\theta \gets \arg \max_{\theta} \frac{1}{N}\sum_i \mathcal{L}_i(q,q_i)\]

\[\mathcal{L}_i(q,p_i)=E_{z \sim q_i(z)}[\log p_\theta(x_i \mid z)+\log p(z)]+ \mathcal{H}(q_i)\]

<p>Algorithm:</p>

<p>for each $x_i$ (or mini-batch):</p>

<p>​	calculate $\Delta_\theta \mathcal{L}_i(p,q_i)$</p>

<p>​	sample $z \sim q_i(z)$</p>

<p>​	$\Delta_\theta \mathcal{L}(p,q_i)\approx\Delta_\theta \log p_\theta(x_i \mid z)$</p>

<p>​	$\theta \gets \theta+\alpha \Delta_\theta\mathcal{L}(p,q_i)$</p>

<p>​	update $q_i$ to maximize $\mathcal{L}_i(p,q_i)$</p>

<p>How to update $q_i$?</p>

<p>let’s say $q_i(z)=\mathcal{N}(\mu_i,\sigma_i)$</p>

<p>use gradient $\Delta_{\mu_i}\mathcal{L}<em>i(p,q_i)$ and $\Delta</em>{\sigma_i}\mathcal{L}_i(p,q_i)$</p>

<p>gradient ascent on $\mu_i,\sigma_i$</p>

<p>What’ the problem?</p>

<p>every sample has a $\mu_i,\sigma_i$. When you have many samples, the total parameters are $ \lvert \theta \rvert + ( \lvert \mu_i \rvert + \lvert \sigma_i \rvert )N$</p>

<p>intuition: $q_i(z)$ should approximate $p(z \mid x_i)$</p>

<p>what if we learn a <em>network</em> $q_i(z)=q(z \mid x_i)\approx p(z \mid x_i)$ ?</p>

<p>so we have two network: $p_\theta(x \mid z)$ and $q_\phi(z \mid x)$</p>

<h3 id="amortized-variational-inference">Amortized variational inference</h3>

\[q_\phi(z \mid x)=\mathcal{N}(\mu_\phi(x),\sigma_\phi(x))\]

\[\log p(x_i)\ge E_{z \sim q_\phi(z \mid x_i)}[\log p_\theta(x_i \mid z)+\log p(z)]+ \mathcal{H}(q_\phi(z \mid x_i))=\mathcal{L}(q_\theta(x_i \mid z),q_\phi(z \mid x_i))\]

<p>Algorithm:</p>

<p>for each $x_i$ (or mini-batch):</p>

<p>​	calculate $\mathcal{L}(q_\theta(x_i \mid z),q_\phi(z \mid x_i))$</p>

<p>​	sample $z \sim q_\phi(z \mid x_i)$</p>

<p>​	$\Delta_\theta \mathcal{L}(p,q_i)\approx\Delta_\theta \log p_\theta(x_i \mid z)$</p>

<p>​	$\theta \gets \theta+\alpha \Delta_\theta\mathcal{L}$</p>

<p>​	$\theta \gets \theta+\alpha \Delta_\phi\mathcal{L}$</p>

<p>how can we get $\Delta_\phi\mathcal{L}$ ?
\(\mathcal{L}_i=E_{z \sim q_\phi(z \mid x_i)}[\log p_\theta(x_i \mid z)+\log p(z)]+ \mathcal{H}(q_\phi(z \mid x_i))\\
J(\phi)= E_{z \sim q_\phi(z \mid x_i)}[r(x_i,z)]\\
J(\phi)\approx \frac{1}{M}\sum_j\Delta_\phi \log q_\phi(z_j \mid x_i)r(x_i,z_j)\)
one way is just use the policy gradient trick, but it has high variance, th other way is apply the re-parameterization trick.</p>

<h4 id="the-re-parameterization-trick">The re-parameterization trick</h4>

\[q_\phi(z \mid x)=\mathcal{N}(\mu_\phi(x),\sigma_\phi(x))\\
z=\mu_\phi(x)+\epsilon\sigma_\phi(x)\;\:\epsilon\sim \mathcal{N}(0,1)\]

\[\begin{align}
J(\phi)&amp;= E_{z \sim q_\phi(z \mid x_i)}[r(x_i,z)]\\
&amp;=E_{\epsilon \sim \mathcal{N}(0,1)}[r(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))]
\end{align}\]

<p>and then we can estimating $\Delta_\phi J(\phi)$:</p>

<p>​	sample $\epsilon_1,…,\epsilon_M$ from $\mathcal{N}(0,1)$ (even a single sample per data point works well!)</p>

<p>​	$\Delta_\phi J(\phi) \approx \frac{1}{M}\sum_j\Delta_\phi r(x_i,\mu_\phi(x_i)+\epsilon_j\sigma_\phi(x_i)) $</p>

<p>this is low variance since it get gradient from r, not just sample r.</p>

<p>Another way to look at it…</p>

\[\begin{align}
\mathcal{L}_i&amp;=E_{z \sim q_\phi(z \mid x_i)}[\log p_\theta(x_i \mid z)+\log p(z)]+ \mathcal{H}(q_\phi(z \mid x_i))\\
&amp;=E_{z \sim q_\phi(z \mid x_i)}[\log p_\theta(x_i \mid z)]+E_{z \sim q_\phi(z \mid x_i)}[\log p(z)]+ \mathcal{H}(q_\phi(z \mid x_i))\\
&amp;=E_{z \sim q_\phi(z \mid x_i)}[\log p_\theta(x_i \mid z)]-D_{KL}(q_\phi(z \mid x_i \mid \mid p(z)))\\
&amp;=E_{\epsilon \sim \mathcal{N}(0,1)}[p_\theta(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))]-D_{KL}(q_\phi(z \mid x_i \mid \mid p(z)))\\
&amp;\approx \log p_\theta(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))-D_{KL}(q_\phi(z \mid x_i \mid \mid p(z)))
\end{align}\]

<h4 id="re-parameterization-trick-vs-policy-gradient">Re-parameterization trick vs. policy gradient</h4>

<ul>
  <li>policy gradient $J(\phi)\approx \frac{1}{M}\sum_j\Delta_\phi \log q_\phi(z_j \mid x_i)r(x_i,z_j)$</li>
  <li>Can handle both discrete and continuous latent variables</li>
  <li>but has high variance, requires multiple sample &amp; small learning rates</li>
  <li>Re-parameterization trick $\Delta_\phi J(\phi) \approx \frac{1}{M}\sum_j\Delta_\phi r(x_i,\mu_\phi(x_i)+\epsilon_j\sigma_\phi(x_i)) $</li>
  <li>only continuous latent variables</li>
  <li>very simple to implement</li>
  <li>low variance</li>
</ul>

<h3 id="the-variational-auto-encoder-vae">The variational auto encoder (VAE)</h3>

<blockquote>
  <p><em>Consult the PDF export for the value-iteration schematic mentioned here.</em></p>
</blockquote>

<p>Conditional models
\(\mathcal{L}_i=E_{z \sim q_\phi(z \mid x_i,y_i)}[\log p_\theta(y_i \mid x_i,z)+\log p(z \mid x_i)]+ \mathcal{H}(q_\phi(z \mid x_i,y_i))\)
the application of variational inference</p>

<ul>
  <li>using RL\control+variational inference to model human behavior</li>
  <li>using generative models and variational inference for exploration</li>
</ul>

<blockquote>
  <p>this class is a little tough, here is some of my understanding:</p>

  <p>we want to represent the distribution of a object(by neural network) and make it has ability to generalize the feature of the object (multi-modal). To achieve this, the random variables($z$) are used as input of model, but what kind of random variances can achieve this? the mathematic proof shows that the distribution of random variance should approximate $p(z \mid x_i)$, this is kind of compress sensing, where $z$ is the latent variable. Finally, this is used to do variational auto encoder (VAE).</p>

  <p>To make the network trainable, the re-parameterization trick is applied.</p>
</blockquote>

<h2 id="11-re-framing-control-as-an-inference-problem">11. Re-framing Control as an Inference Problem</h2>

<p>Get object function from policy</p>

<p>The human behavior is stochastic and suboptimal but overall good, and how can interpret this kind of model?</p>

<h3 id="a-probabilistic-graphical-model-of-decision-making">A probabilistic graphical model of decision making</h3>

\[p(\tau \mid \mathcal{O}_{1:T})\:\:\:\;\;\;p(\mathcal{O}_t \mid s_t,a_t)=\exp(r(s_t,a_t))\\
p(\tau \mid \mathcal{O}_{1:T})=\frac{p(\tau,\mathcal{O}_{1:T})}{p(\mathcal{O}_{1:T})}\propto p(\tau)\prod\exp(r(s_t,a_t))=p(\tau)\exp\left(\sum_t r(s_t,a_t)\right)\]

<p>$\mathcal{O}$ is the boolean variable indicate the policy choose randomly or choose to maximize reward</p>

<h3 id="inference">Inference</h3>

<h4 id="backward-massages">Backward massages</h4>

\[\begin{align}
\beta_t(s_t,a_t)&amp;=p(\mathcal{O}_{t:T} \mid s_t,a_t)\\
&amp;=\int p(\mathcal{O}_{t:T},s_{t+1} \mid s_t,a_t)ds_{t+1}\\
&amp;=\int p(\mathcal{O}_{t+1:T} \mid s_{t+1}) p(s_{t+1} \mid s_t,a_t)p(\mathcal{O}_t \mid s_t,a_t)ds_{t+1}\\
\end{align}\]

\[p(\mathcal{O}_{t+1:T} \mid s_{t+1})=\beta_{t+1}(s_{t+1})=\int p(\mathcal{O}_{t+1:T} \mid s_{t+1,}a_{t+1})p(a_{t+1} \mid s_{t+1})d a_{t+1}\\
=\int \beta_t(s_{t+1,}a_{t+1})p(a_{t+1} \mid s_{t+1})d a_{t+1}\]

<p>for $t=T-1$ to 1:</p>

<blockquote>
\[\begin{align}
\beta_t(s_t,a_t)&amp;=p(\mathcal{O}_t \mid s_t,a_t)E_{s_{t+1}\sim p(s_{t+1} \mid s_t,a_t)}[\beta_{t+1}(s_{t+1})]\\
\beta_t(s_t)&amp;=E_{s_t\sim p(a_t \mid s_t)}[\beta_t(s_t,a_t)]
\end{align}\]
</blockquote>

<p>let $V_t(s_t) =\log \beta_t(s_t)$</p>

<p>let $Q_t(s_t,a_t)= \log \beta_t(s_t,a_t)$</p>

\[V_t=\log\int \exp(Q_t(s_t,a_t))da_t\\
V_t(s_t) \to \max_{s_t}Q_t(s_t,a_t)\;\text{as}\;Q_t(s_t,a_t)\;\text{gets bigger!}\\
Q_t(s_t,a_t)=r(s_t,a_t)+\log E[\exp(V_{t+1}(s_{t+1}))]\\
\text{for determistic transition: }Q_t(s_t,a_t)=r(s_t,a_t)+V_{t+1}(s_{t+1})\]

<h4 id="policy-computation">Policy computation</h4>

\[\begin{align*}
p(a_t \mid s_t,\mathcal{O}_1:T)&amp;=\pi(a_t \mid s_t)\\
&amp;=p(a_t \mid s_t,\mathcal{O}_{t:T})\\
&amp;=\frac{p(a_t,a_t \mid \mathcal{O}_{t:T})}{p(s_t \mid \mathcal{O}_{t:T})}\\
&amp;=\frac{p(\mathcal{O}_{t:T} \mid a_t,s_t)p(a_t,s_t)/p(a_t,s_t)}{p(\mathcal{O}_{t:T} \mid s_t)p(s_t)/p(\mathcal{O}_{t:T})}\\
&amp;=\frac{p(\mathcal{O}_{t:T} \mid a_t,s_t)}{p(\mathcal{O}_{t:T} \mid s_t)}\frac{p(a_t,s_t)}{p(s_t)}\\
&amp;=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}p(a_t \mid s_t)\\
&amp;=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}
\end{align*}\]

<h4 id="policy-computation-with-value-functions">Policy computation with value functions</h4>

<p>for $t=T-1$ to 1:</p>

<blockquote>
\[V_t=\log\int \exp(Q_t(s_t,a_t))da_t\\
Q_t(s_t,a_t)=r(s_t,a_t)+\log E[\exp(V_{t+1}(s_{t+1}))]\]
</blockquote>

\[\pi(a_t \mid s_t)=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}\\
V_t(s_t) =\log \beta_t(s_t)\\
Q_t(s_t,a_t)= \log \beta_t(s_t,a_t)\]

<p>So</p>

\[\pi(a_t \mid s_t)=\exp(Q_t(s_t,a_t)-V_t(s_t))=\exp(A_t(s_t,a_t))\]

<p>with temperature: $\pi(a_t \mid s_t)=\exp(\frac{1}{\alpha}Q_t(s_t,a_t)-\frac{1}{\alpha}V_t(s_t))=\exp(\frac{1}{\alpha}A_t(s_t,a_t))$ if $\alpha$ is near zero, the max action will dominate, so it’s more like deterministic, and $\alpha$ near 1 is more stochastic.</p>

<ul>
  <li>Natural interpretation: better actions are more probable</li>
  <li>Random tie-breaking</li>
  <li>Analogous to Boltzmann exploration</li>
  <li>Approaches greedy policy as temperature decreases</li>
</ul>

<h4 id="forward-massages">Forward massages</h4>

\[\begin{align*}
\alpha_t(s_t)&amp;=p(s_t \mid \mathcal{O_{1:t-1}})\\
&amp;=\int p(s_t,s_{t-1},,a_{t-1} \mid \mathcal{O}_{1:t-1})ds_{t-1}da_{t-1}\\
&amp;=\int p(s_t \mid s_{t-1},a_{t-1},\mathcal{O}_{1:t-1})p(a_{t-1} \mid s_{t-1},\mathcal{O}_{1:t-1})p(s_{t-1 \mid \mathcal{O}_{1:t-1}})ds_{t-1}da_{t-1}\\
&amp;=\int p(s_t \mid s_{t-1},a_{t-1})p(a_{t-1} \mid s_{t-1},\mathcal{O}_{1:t-1})p(s_{t-1 \mid \mathcal{O}_{1:t-1}})ds_{t-1}da_{t-1}
\end{align*}\]

\[\begin{align*}
p(s_t \mid s_{t-1},a_{t-1})p(a_{t-1} \mid s_{t-1},\mathcal{O}_{1:t-1})
&amp;=\frac{p(\mathcal{O}_{t-1} \mid s_{t-1},a_{t-1})p(a_{t-1} \mid s_{t-1})}{p(\mathcal{O}_{t-1} \mid s_{t-1})}\frac{p(\mathcal{O}_{t-1} \mid s_{t-1})p(s_{t-1} \mid \mathcal{O}_{1:t-2})}{p(\mathcal{O}_{t-1} \mid \mathcal{O}_{1:t-2})}\\
&amp;=\frac{p(\mathcal{O}_{t-1} \mid s_{t-1},a_{t-1})p(a_{t-1} \mid s_{t-1})}{1}\frac{\alpha_{t-1}(s_{t-1})}{p(\mathcal{O}_{t-1} \mid \mathcal{O}_{1:t-2})}
\end{align*}\]

<p>what if we want $p(s_t \mid \mathcal{O}_{1:T})$?</p>

\[\begin{align*}
p(s_t \mid \mathcal{O}_{1:T})&amp;=\frac{p(s_t,\mathcal{O}_{1:T})}{p(\mathcal{O}_{1:T})}\\
&amp;=\frac{p(\mathcal{O}_{t:T} \mid s_t)p(s_t,\mathcal{O}_{1:t-1})}{p(\mathcal{O}_{1:T})}\\
&amp;\propto \beta_t(s_t)p(s_t \mid \mathcal{O}_{1:t-1})p(\mathcal{O}_{1:t-1})\\
&amp;\propto \beta_t(s_t)\alpha_t(s_t)
\end{align*}\]

<h3 id="the-optimism-problem">The optimism problem</h3>

<p>for $t=T-1$ to 1:</p>

<blockquote>
\[\begin{align}
\beta_t(s_t,a_t)&amp;=p(\mathcal{O}_t \mid s_t,a_t)E_{s_{t+1}\sim p(s_{t+1} \mid s_t,a_t)}[\beta_{t+1}(s_{t+1})]\\
\beta_t(s_t)&amp;=E_{s_t\sim p(a_t \mid s_t)}[\beta_t(s_t,a_t)]
\end{align}\]
</blockquote>

<p>let $V_t(s_t) =\log \beta_t(s_t)$</p>

<p>let $Q_t(s_t,a_t)= \log \beta_t(s_t,a_t)$
\(Q_t(s_t,a_t)=r(s_t,a_t)+\log E[\exp(V_{t+1}(s_{t+1}))]\)
why did this happen?</p>

<p>The inference problem: $p(s_{1:T},a_{1:T} \mid \mathcal{O}_{1:T})$</p>

<p>marginalizing and conditioning, we get: $p(a_t \mid s_t,\mathcal{O}_{1:T})$ (the policy)</p>

<blockquote>
  <p>“given that you obtained high reward, what was your action probability?”</p>
</blockquote>

<p>marginalizing and conditioning, we get: $p(s_{t+1} \mid s_t,a_t,\mathcal{O}<em>{1:T})\ne p(s</em>{t+1} \mid s_t,a_t)$</p>

<blockquote>
  <p>“given that you obtained high reward, what was your transition probability?”</p>
</blockquote>

<p>because we asking the question of transition probability that condition on the good result, so there is the optimism problem.</p>

<h4 id="addressing-the-optimism-problem">Addressing the optimism problem</h4>

<p>we actually want to ask “given that you obtained high reward, what was your action probability when the transition probability did not change?”</p>

<p>find another distribution $q(s_{1:T},a_{1:T})$ that is close to $p(s_{1:T},a_{1:T} \mid \mathcal{O}<em>{1:T})$ but has dynamics $p(s</em>{t+1} \mid s_t,a_t)$</p>

<p>Try variational inference!</p>

<p>let $\mathbf{x}=\mathcal{O}<em>{1:T}$ and $\mathbf{z}=(s</em>{1:T},a_{1:T})$; find $q(\mathbf{z})$ to approximate $p(\mathbf{z} \mid \mathbf{x})$</p>

<h4 id="control-via-variational-inference">Control via variational inference</h4>

\[q(s_{1:T},a_{1:T})=p(s_1)\prod_t p(s_{t+1} \mid s_t,a_t)q(a_t \mid s_t)\]

<p>The variational lower bound (last class)</p>

\[\begin{align}
\log p(x)&amp;\ge E_{z \sim q(z)}[\log p(x,z)-\log q(z)]\\
\log p(\mathcal{O}_{1:T}) &amp;\ge E_{s_{1:T},a_{1:T}\sim q}\Big[\log p(s_1)+\sum_{t=1}^T\log p(s_{t+1} \mid s_t,a_t)+\sum_{t=1}^T\log p(\mathcal{O}_t \mid s_t,a_t)\\
&amp;\qquad -\log p(s_1)-\sum_{t=1}^T\log p(s_{t+1} \mid s_t,a_t)-\sum_{t=1}^T\log q(a_t \mid s_t)\Big]\\
&amp;=E_{(s_{1:T},a_{1:T})\sim q}\left[\sum_t r(s_t,a_t)-\log q(a_t \mid s_t)\right]\\
&amp;=\sum_t E_{(s_t,a_t)\sim q}[r(s_t,a_t)+\mathcal{H}(q(a_t \mid s_t))]
\end{align}\]

<p>maximize the rewards and entropy</p>

<p>Optimizing the variation lower bound</p>

\[Q_t(s_t,a_t)=r(s_t,a_t)+E[V_{t+1}(s_{t+1})]\\
V_t(s_t)=\log \int \exp(Q_t(s_t,a_t))d a_t\]

<h4 id="backward-pass-variational">backward pass-variational</h4>

<p>for $t=T-1$ to 1:</p>

<blockquote>
\[V_t=\log\int \exp(Q_t(s_t,a_t))da_t\\
Q_t(s_t,a_t)=r(s_t,a_t)+E[\exp(V_{t+1}(s_{t+1}))]\]
</blockquote>

<p><strong>Variants:</strong></p>

<ul>
  <li>discounted SOC: $Q_t(s_t,a_t)=r(s_t,a_t+\gamma E(V_{t+1(s_{t+1})}))$</li>
  <li>explicit temperature: $V_t(s_t)=\alpha \log \int \exp(\frac{1}{\alpha}Q_t(s_t,a_t))da_t$</li>
</ul>

<h4 id="soft-q-learning">Soft Q-learning</h4>

<p>soft Q-learning $\phi \gets \phi + \alpha \Delta_\phi Q_\phi(s,a)(r(s,a)+\gamma V(s’)-Q_\phi(s,a))$</p>

<p>target value: $V(s’)=\text{soft} \max_{a’}Q_\phi(s’,a’)=\log \int \exp (Q_\phi(s’,a’))da’$</p>

<p>$\pi(a \mid s)=\exp(Q_\phi(s,a)-V(s))=\exp(A(s,a))$</p>

<blockquote>
  <p><em>Detailed timeline graphic available in the PDF notes.</em></p>
</blockquote>

<h4 id="policy-gradient-with-soft-optimality">Policy gradient with soft optimality</h4>

<p>$\pi(a \mid s)=\exp(Q_\phi(s,a)-V(s))$ optimizes $\sum_{\pi(s_t,a_t)}[r(s_t,a_t)]+E_{\pi(s_t)}[\mathcal{H}(\pi(a_t \mid s_t))]$</p>

<p><strong>intuition:</strong> $\pi(a \mid s)\propto \exp(Q_\phi(s,a))$ when $\pi$ minimizes $D_{KL}(\pi(a \mid s) \mid \mid \frac{1}{Z}\exp(Q(s,a))$</p>

<h4 id="soft-policy-gradient-vs-soft-q-learning">Soft Policy gradient vs soft Q-learning</h4>

<p>policy gradient derivation:</p>

\[J(\theta)=\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)]+E_{\pi(s_t)}[\mathcal{H}(\pi(a \mid s_t))]\\
=\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)-\log \pi(a_t \mid s_t)]\\
\log \pi(s_t \mid s_t)=Q(s_t,a_t)-V(s_t)\]

\[\begin{align}
&amp;\Delta_\theta\left[\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)-\log \pi(a_t \mid s_t)]\right]\\
&amp;\approx \frac{1}{N}\sum_i\sum_t\Delta_\theta \log \pi(a_t \mid s_t)\left(r(s_t,a_t)+\left(\sum_{t'=t+1}^Tr(s_{t'},s_{t'})-\log \pi(a_{t'} \mid s_{t'})\right)-\log \pi(a_t \mid s_t)-1\right)\\
&amp;\approx \frac{1}{N}\sum_i\sum_t(\Delta_\theta Q(s_t,a_t)-\Delta_\theta V(s_t))\left(r(s_t,a_t)+Q(s_{t+1},s_{t+1})-Q(s_t,a_t)+V(s_t)\right)\\
&amp;\approx \frac{1}{N}\sum_i\sum_t(\Delta_\theta Q(s_t,a_t)-\Delta_\theta V(s_t))\left(r(s_t,a_t)+Q(s_{t+1},s_{t+1})-Q(s_t,a_t)\right)
\end{align}\]

<p>soft Q-learning:</p>

\[- \frac{1}{N}\sum_i\sum_t\Delta_\theta Q(s_t,a_t)\left(r(s_t,a_t)+\text{soft}\max_{a_{t+1}}Q(s_{t+1},s_{t+1})-Q(s_t,a_t)\right)\]

<h4 id="benefits-of-soft-optimality">Benefits of soft optimality</h4>

<ul>
  <li>Improve exploration and prevent entropy collapse</li>
  <li>Easier to specialize (fine-tune) policies for more specific tasks</li>
  <li>Principled approach to break ties</li>
  <li>Better robustness (due to wider converge of states)</li>
  <li>Con reduce to hard optimality as reward magnitude increase</li>
  <li>Good model for modeling human behavior</li>
</ul>

<h2 id="12-inverse-reinforcement-learning">12. Inverse Reinforcement Learning</h2>

<h3 id="why-should-we-worry-about-learning-rewards">Why should we worry about learning rewards</h3>

<h4 id="the-imitation-learning-perspective">The imitation learning perspective</h4>

<p>Standard imitation learning:</p>

<ul>
  <li>copy the action performed by the expert</li>
  <li>no reasoning about outcomes of actions</li>
</ul>

<p>Human imitation learning:</p>

<ul>
  <li>copy the <em>intent</em> of the expert</li>
  <li>might take very different actions</li>
</ul>

<h4 id="the-reinforcement-learning-perspective">The reinforcement learning perspective</h4>

<p>sometimes reward is complicated and unclear</p>

<h3 id="inverse-reinforcement-learning">Inverse reinforcement learning</h3>

<p>Infer reward functions from demonstrations</p>

<ul>
  <li>by itself, this is an underspecified problem</li>
  <li>many reward function can explain the same behavior</li>
</ul>

<p>Inverse reinforcement learning:</p>

<p>Given:</p>

<ul>
  <li>states $s \in \mathcal{S}$, actions $a \in \mathcal{A}$</li>
  <li>(sometimes) a transition model $p(s’ \mid s,a)$</li>
  <li>demonstrations ${\tau_i}$ drawn from $\pi^*(\tau)$</li>
</ul>

<p>Goal: learn $r_\psi(s,a)$ and then recover $\pi^*(a \mid s)$.</p>

<h4 id="learning-the-optimality-variable">Learning the optimality variable</h4>

<p>$p(\mathcal{O}<em>t \mid s_t,a_t,\psi)=\exp(r</em>\psi(s_t,a_t))$</p>

<p>$p(\tau \mid \mathcal{O}<em>{1:T},\psi) \propto p(\tau)\exp\left(\sum_t r</em>\psi(s_t,a_t)\right)$</p>

<p>Here the demonstrations are ${\tau_i}$ sampled from $\pi^*(\tau)$ and we perform maximum-likelihood learning:</p>

\[\max_\psi \frac{1}{N}\sum_{i=1}^N \log p(\tau_i \mid \mathcal{O}_{1:T},\psi)
  = \max_\psi \frac{1}{N}\sum_{i=1}^N r_{\psi}(\tau_i) - \log Z,\]

<p>where $\log Z$ ensures the trajectory probabilities sum to one.</p>

<h5 id="the-irl-partition-function">The IRL partition function</h5>

\[\max_\psi \frac{1}{N}\sum_{i=1}^Nr_{\psi}(\tau_i)-\log Z\\
Z=\int p(\tau)\exp(r_\psi(\tau))d\tau\\
\Delta_\psi \mathcal{L}=\frac{1}{N}\sum_{i=1}^N\Delta_\psi r_\psi(\tau_i)-\frac{1}{Z}\int p(\tau)\exp (r_\psi(\tau))\Delta_\psi r_\psi(\tau)d\tau\\
=E_{\tau\sim \pi^*(\tau)}[\Delta_\psi r_\psi(\tau_i)]-E_{\tau \sim p(\tau \mid \mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(\tau)]\]

<h5 id="estimating-the-expectation">Estimating the expectation</h5>

<p>$ p(s_t,a_t \mid \mathcal{O}<em>{1:T},\psi)=p(a_t \mid s_t,\mathcal{O}</em>{1:T},\psi)p(s_t \mid \mathcal{O}_{1:T},\psi)$</p>

<p>let $\mu_t(s_t,a_t)\propto\beta(s_t,a_t)\alpha(s_t)$</p>

\[\begin{align}
E_{\tau \sim p(\tau \mid \mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(\tau)]&amp;= E_{\tau \sim p(\tau \mid \mathcal{O}_{1:T},\psi)}[\Delta_\psi \sum_{t=1}^Tr_\psi(s_t,a_t)]\\
&amp;=\sum_{t=1}^T E_{(s_t,a_t) \sim p(s_t,a_t \mid \mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(s_t,a_t)]\\
&amp;=\sum_{t=1}^T\vec{\mu}_t^T\Delta_\psi\vec{r}_\psi
\end{align}\]

<h5 id="the-maxent-irl-algorithm">The MaxEnt IRL algorithm</h5>

<ol>
  <li>Given $\psi$, compute backward message $\beta(s_t,a_t)$</li>
  <li>Given $\psi$, compute forward message $\alpha(s_t)$</li>
  <li>Compute $\mu_t(s_t,a_t) \propto \beta(s_t,a_t)\alpha(s_t)$</li>
  <li>Evaluate $\Delta_\psi \mathcal{L}=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\Delta_\psi r_\psi(s_{i,t},a_{i,t})-\sum_{t=1}^T\int\int\mu_t(s_t,a_t)\Delta_\psi r_\psi(s_t,a_t)ds_tda_t$</li>
  <li>$\psi \gets \psi +\eta \Delta_\psi\mathcal{L}$</li>
</ol>

<p>Why MaxEnt?
in the case where $r_\psi(s_t,a_t)=\psi^Tf(s_t,a_t)$, we can show that it optimizes $\max_\psi\mathcal{H}(\pi^{r_\psi}) $ such sthat $E_{\pi^{r_\psi}}[f]=E_{\pi^*}[f]$</p>

<p>paper: Ziebart et al. 2008: Maximum Entropy Inverse Reinforcement learning</p>

<p>what’s missing so far?</p>

<ul>
  <li>MaxEnt IRL so far requires…</li>
  <li>Solving for (soft) optimal policy in the inner loop</li>
  <li>Enumerating all state-action tuples for visitation frequency and gradient</li>
  <li>To apply this in practical problem settings, we need to handle…</li>
  <li>Large and continuous state and action spaces</li>
  <li>States obtained via sampling only</li>
  <li>Unknown dynamics</li>
</ul>

<p>recall:</p>

\[\Delta_\psi \mathcal{L}=E_{\tau\sim \pi^*(\tau)}[\Delta_\psi r_\psi(\tau_i)]-E_{\tau \sim p(\tau \mid \mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(\tau)]\]

<p>The first part is just the trajectory under expert policy, the second part of this is soft optimal policy under current reward, a sample idea is to learning this soft policy using the methods introduced in last class, bu it is impractical since you need to train it to convergence, it has a lot of computation.</p>

<h4 id="more-efficient-sample-based-updates">More efficient sample-based updates</h4>

<h5 id="guided-cost-learning">Guided cost learning</h5>

\[\Delta_\psi \approx \frac{1}{N}\sum_{i=1}^N\Delta_\psi r_{\psi}(\tau_i)-\frac{1}{M}\sum_{j=1}^M\Delta_\psi r_\psi(\tau_j)\]

<p>instead of learn the $p(a_t \mid s_t, \mathcal{O}_{1:T},\psi)$using any max-ant RL algorithm to converge then run this policy to sample ${\tau_j}$, can we just run one or several gradient step and then sample?</p>

<p>Solution 1: use importance sampling</p>

\[\Delta_\psi\mathcal{L} \approx \frac{1}{N}\sum_{i=1}^N\Delta_\psi r_{\psi}(\tau_i)-\frac{1}{\sum_jw_j}\sum_{j=1}^Mw_j\Delta_\psi r_\psi(\tau_j)\;\;\;\;w_j=\frac{p(\tau)\exp(r_\psi(\tau_j))}{\pi(\tau_j)}\]

\[\begin{align}
w_j&amp;=\frac{p(\tau)\exp(r_\psi(\tau_j))}{\pi(\tau_j)}\\
&amp;=\frac{p(s_1)\prod_tp(s_{t+1} \mid s_t,a_t)\exp(r_\psi(s_t,a_t))}{p(s_1)\prod_tp(s_{t+1} \mid s_t,a_t)\pi(a_t \mid s_t)}\\
&amp;=\frac{\exp(\sum_t r_\psi(s_t,a_t))}{\prod_t\pi(a_t \mid s_t)}
\end{align}\]

<p>each policy update w.r.t. $r_\psi$ brings us closer to the target distribution!</p>

<p>paper : guided cost learning algorithm. Finn et al. ICML ‘16</p>

<p>this actually like a game, or GAN, which the $\Delta_\psi\mathcal{L} \approx \frac{1}{N}\sum_{i=1}^N\Delta_\psi r_{\psi}(\tau_i)-\frac{1}{\sum_jw_j}\sum_{j=1}^Mw_j\Delta_\psi r_\psi(\tau_j)$ makes demos are made more likely, samples less likely. But $\Delta_\theta\mathcal{L}\approx\frac{1}{M}\sum_{j=1}^M\Delta_\theta\log\pi_\theta(\tau_j)r_\psi(\tau_j)$ policy changed to make it <em>harder</em> to distinguish from demos.</p>

<h5 id="inverse-rl-as-a-generative-adversarial-networks-gan">Inverse RL as a Generative adversarial Networks (GAN)</h5>

<p>In GAN, the best discriminator is $D^<em>(x)=\frac{p^</em>(x)}{p_\theta(x)+p^*(x)}$</p>

<p>For IRL, optimal policy approaches $\pi_\theta(\tau)\propto p(\tau)\exp(r_\psi(\tau))$</p>

<p>choose this parameterization for discriminator:</p>

\[\begin{align}
D_\psi(\tau)&amp;=\frac{p(\tau)\frac{1}{Z}\exp(r_\psi(\tau))}{\pi_\theta(\tau)+p(\tau)\frac{1}{Z}\exp(r_\psi(\tau))}\\
&amp;=\frac{\frac{1}{Z}\exp(r_\psi(\tau))}{\prod_t\pi_\theta(a_t \mid s_t)+\frac{1}{Z}\exp(r_\psi(\tau))}
\end{align}\]

<p>and using GAN to lean by $\psi \gets \arg \max_\psi E_{\tau \sim p*}[\log D_\psi(\tau)]+E_{\tau \sim \pi_\theta}[\log(1-D_\psi(\tau))]$, and then it will get same result as previous Inverse RL update rule. and here we don not need important weight, we can just optimize Z w.r.t. same objective as $\psi$ ! and the weight is in Z.</p>

<p>so the discriminator is $\psi \gets \arg \max_\psi E_{\tau \sim p*}[\log D_\psi(\tau)]+E_{\tau \sim \pi_\theta}[\log(1-D_\psi(\tau))]$</p>

<p>the generator is $\Delta_\theta\mathcal{L}\approx\frac{1}{M}\sum_{j=1}^M\Delta_\theta\log\pi_\theta(\tau_j)r_\psi(\tau_j)$</p>

<p>After inverse RL, the reward has already learned, when the environment changed, it use learned reward representation has the generalization ability to learn policy under new environment.</p>

<h5 id="regular-discriminator">Regular discriminator</h5>

<p>Can we just use a regular discriminator?</p>

<p>\(\psi \gets \arg \max_\psi E_{\tau \sim p*}[\log D_\psi(\tau)]+E_{\tau \sim \pi_\theta}[\log(1-D_\psi(\tau))]\)
and just parametrize $D_\psi(\tau)$ as standard binary neural net classifier</p>

<p>the generator becomes $\Delta_\theta\mathcal{L}\approx\frac{1}{M}\sum_{j=1}^M\Delta_\theta\log\pi_\theta(\tau_j)\log D_\psi(\tau_j)$</p>

<ul>
  <li>this is simpler to set up optimization</li>
  <li>but discriminator knows nothing at convergence</li>
  <li>and do know the reward representation, only get the policy $\pi_\theta$</li>
</ul>

<blockquote>
  <p><em>The original slide for this section appears in the PDF; it is referenced here to avoid broken local paths.</em></p>
</blockquote>

<h2 id="13-transfer-and-multi-task-learning">13. Transfer and Multi-task Learning</h2>

<p>Use prior knowledge</p>

<p><strong>Transfer learning</strong>: using experience from one set of tasks for faster learning and better performance on a new task. In RL, task is MDP</p>

<p><strong>shot</strong>: number of attempts in the target domain</p>

<p><strong>0-shot</strong>: just run a policy trained in the source domain</p>

<p><strong>1-shot</strong>: try the task once</p>

<p><strong>few shot</strong>: try the task a few times</p>

<ol>
  <li>“forward” transfer: train on one task, transfer to new task</li>
  <li>just try it and hope for best</li>
  <li>fine-tune on the new task</li>
  <li>randomize source domian</li>
  <li>Multi-task transfer: train on many tasks, transfer to new task</li>
  <li>generate highly randomized source domains</li>
  <li>model-based reinforcement learning</li>
  <li>model distillation</li>
  <li>contextual policies</li>
  <li>modular policy networks</li>
  <li>Multi-task meta-learning: learning to learning from many tasks</li>
  <li>RNN-based meta-learning</li>
  <li>Gradient-based meta-learning</li>
</ol>

<p>This class is pretty high-level introduction about Transfer and Multi-task Learning. The class give make possible direction of Transfer and Multi-task Learning, and gives many paper for further study in this part, see lecture slides for more detail of recommended papers.</p>

<h2 id="14-distributed-rl">14. Distributed RL</h2>

<p>2013/2015: DQN: replay buffer</p>

<p>2015: GORILA</p>

<p>2016: A3C: one learner multiple actors, actor generate gradient and send it to learner</p>

<p>2018: IMPALA: several actors and learnings, actor only acting and generate data for learner, using importance sampling (V-trace) correct for policy lag.</p>

<p>2018: Ape-X/R2D2： Reintroduces replay</p>

<p>2019: R2D3</p>

<p>RLlib: Abstractions for Distributed Reinforcement Learning (ICML’18)</p>

<h2 id="15-exploration">15. Exploration</h2>

<h3 id="exploration-in-bandit">Exploration in bandit</h3>

<p>Regret</p>

\[Reg(T)=TE[r(a^*)]-\sum_{t=1}^Tr(a_t)\]

<h4 id="optimistic-exploration">Optimistic exploration</h4>

<p>optimistic estimate: $a=\arg\max \hat{\mu}_a + C \sigma_a$</p>

<p>Intuition: try each arm until you are sure it’s not great</p>

<p>example: $a=\arg \max \hat{\mu}_a+\sqrt{\frac{2\ln T}{N(a)}}$ $Reg (T)$ is $O(\log T)$</p>

<h4 id="probability-matching-posterior-sampling">Probability matching /posterior sampling</h4>

<p>assume $r(a_i)\sim p_{\theta_i}(r_i)$</p>

<p>this defines a POMDP with $s=[\theta_1,…,\theta_n]$</p>

<p>belief state is $\hat{p}(\theta_1,…,\theta_n)$</p>

<p>idea: sample $\theta_1,…,\theta_n \sim\hat{p}(\theta_1,..,\theta_n)$</p>

<p>pretend the model $\theta_1,..,\theta_n$ is correct</p>

<p>take the optimal action</p>

<p>update the model and repeat the process</p>

<h4 id="information-gain">Information gain</h4>

<p>let $\mathcal{H}(\hat{p}(z))$ be the current entropy of our $z$ estimate</p>

<p>let $\mathcal{H}(\hat{p}(z) \mid y)$ be the entropy of our $z$ estimate after observation $y$</p>

<p>Information gain is</p>

\[IG(z,y)=E_y[\mathcal{H}(\hat{p}(z))-\mathcal{H}(\hat{p}(z) \mid y)]\\
IG(z,y \mid a)=E_y[\mathcal{H}(\hat{p}(z))-\mathcal{H}(\hat{p}(z) \mid y) \mid a]\]

<p>For bandit</p>

<p>$y= r_a, z=\theta_a$</p>

<p>$g(a)=IG(\theta_a,r_a \mid a)$</p>

<p>$\Delta(a)=E[r(a^*)-r(a)]$</p>

<p>choose $a$ according to $\arg\min_a \frac{\Delta(a)^2}{g(a)}$</p>

<h3 id="exploration-in-drl">Exploration in DRL</h3>

<h4 id="optimistic-exploration-in-rl">Optimistic exploration in RL</h4>

<p>UCB: $a=\arg \max \hat{\mu}_a+\sqrt{\frac{2\ln T}{N(a)}}$</p>

<p>In MDPs, count-based exploration: use $N(s,a)$ or $N(s)$ to add <em>exploration bonus</em></p>

<p>use $r^+(s,a)=r(s,a)+\mathcal{B}(N(s))$</p>

<p>use $r^+(s,a)$ instead of $r(s,a)$ with any model-free algorithm</p>

<p>but when using counts, maybe we didn’t see the same thing twice. so we need to have a representation of <strong>similarity</strong> of states.</p>

<p>idea: fit a density model $p_\theta(s)$ (or $p_\theta(s,a)$)</p>

<p>$p_\theta(s)$ might be high even for a new $s$ if $s$ is similar to previously seen states</p>

\[P(s)=\frac{N(s)}{n}\\
P'(s)=\frac{N(s)+1}{n+1}\]

<h5 id="exploring-with-pseudo-counts">Exploring with pseudo-counts</h5>

<p>fit model $p_\theta(s)$ to all states $\mathcal{D}$ seen so for</p>

<p>take a step $i$ and observe $s_i$</p>

<p>fit new model $p_{\theta’}(s)$ to $\mathcal{D}\cup s_i$</p>

<p>use $p_\theta(s_i)$ and $p_{\theta’}(s_i)$ to estimate $\hat{N}(s)$</p>

<p>set $r^+(s,a)=r(s,a)+\mathcal{B}(N(s))$, and repeat!</p>

<p>How to get $\hat{N}(s)$ , use previous equations and get</p>

\[\hat{N}(s_i)=\hat{n}p_\theta(s_i)\\
\hat{n}=\frac{1-p_{\theta'}(s_i)}{p_{\theta'}(s_i)-p_\theta(s_i)}p_\theta(s_i)\]

<h5 id="what-kind-of-bonus-to-use-many-chooses">What kind of bonus to use? many chooses</h5>

<p>UCB: $\mathcal{B}(N(s))=\sqrt{\frac{2\ln T}{N(s)}}$</p>

<p>MBIE-EB: $\mathcal{B}(N(s))=\sqrt{\frac{1}{N(s)}}$</p>

<p>BEB: $\mathcal{B}(N(s))=\frac{1}{N(s)}$</p>

<h5 id="what-kind-of-model-p_thetas-to-use">What kind of model ($p_\theta(s)$) to use?</h5>

<ul>
  <li>
    <p>Bellemare et al.: “CTS” model: condition each pixel on its top-left neighborhood</p>
  </li>
  <li>
    <p>Counting with hashes</p>
  </li>
</ul>

<p>idea: compress $s$ into a $k$-bit code via $\phi (s)$, then count $N(\phi(s))$</p>

<p>short codes=more hash collisions</p>

<p>Can use VAE compression to get hash</p>

<ul>
  <li>implicit density modeling with exemplar model</li>
</ul>

<p>explicitly compare to new state to past states</p>

<p>Intuition: the state is <strong>novel</strong> if it is <strong>easy</strong> to <strong>distinguish</strong> from all previous seen states by a classifier</p>

<p>for each observed state $s$, fit a classifier to classify that state against all past states $\mathcal{D}$, use classifiler error to obtain density</p>

\[p_\theta(s)=\frac{1-D_s(s)}{D_s(s)}\]

<p>In practice, just train one amortized model takes in exemplar as input</p>

<p>for detail ref. Fu et al. “EX2: Exploration with exemplar models”</p>

<ul>
  <li>Heuristic estimation of counts via errors</li>
</ul>

<p>idea: we do not need the densities, just get something tell us if the state is novel or not!</p>

<p>let’s say we have some target function $f^*(s,a)$, this function can be any function, just show a mapping against $s,a$.</p>

<p>given our buffer $\mathcal{D}={(s_i,a_i)}$, fit $\hat{f}_\theta(s,a)$</p>

<p>use $\xi(s,a)= \mid \mid \hat{f}_\theta(s,a)-f^*(s,a) \mid \mid ^2$ as bonus</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>What should we use for $f^*(s,a)$?

One common choice: set $f^*(s,a)=s'$

Even simpler: $f^*(s,a)=f_\phi(s,a)$, where $\phi$ is a *random* parameter vector (random network distillation)
</code></pre></div></div>

<h4 id="posterior-sampling-in-deep-rl">Posterior sampling in deep RL</h4>

<ol>
  <li>sample Q-function Q from $p(Q)$</li>
  <li>act according Q for for one episode</li>
  <li>update p(Q)</li>
</ol>

<p>Bootstrap</p>

<ol>
  <li>given a dataset $\mathcal{D}$, re-sample with replacement N times to get $\mathcal{D}_1,…,\mathcal{D}_N$</li>
  <li>train each model $f_{\theta_i}$ on $\mathcal{D}_i$</li>
  <li>to sample from $p(\theta)$, sample $i \in[1,…,N]$ and use $f_{\theta_i}$</li>
</ol>

<p>but training N big neural nets is expensive, so we use one network with shared base network and multiple head for each model, and in practice, we actually do not do re-sample, since we have different initialization for each model.</p>

<h4 id="reasoning-about-information-gain">Reasoning about information gain</h4>

<p>approximations:</p>

<p>prediction gain: $\log p_{\theta’}(s)-\log p_\theta(s)$</p>

<p>intuition: if density changed a lot, the state was novel</p>

<p>variational inference:</p>

<p>IG can be equivalently written as $D_{KL}(p(z \mid y) \mid \mid p(z))$</p>

<p>learn about <em>transitions</em> $p_\theta(s_{t+1} \mid s_t,a_t): z=\theta$</p>

<p>$y = (s_t,a_t,s_{t+1})$ $D_{KL}(p(\theta \mid h,s_t,a_t,s_{t+1}) \mid \mid p(\theta \mid h))$ where h is the history of all prior transitions</p>

<p>intuition: a transition is more informative if it causes belief over $\theta$ to change</p>

<p>idea: use variational inference to estimate $q(\theta \mid \phi)\approx p(\theta \mid h)$</p>

<p>given new transition $(s,a,s’)$, update $\phi$ to get $\phi’$</p>

<p>and use $D_{KL}(q(\theta \mid \phi) \mid \mid q(\theta \mid \phi))$ as approximate bonus</p>

<p>for more details, ref Houthooft et al. “VIME”</p>

<h3 id="imitation-learning-vs-reinforcement-learning">Imitation learning vs. Reinforcement learning</h3>

<p>Imitation learning</p>

<ul>
  <li>requires demonstrations</li>
  <li>must address distributional shift</li>
  <li>Simple, stable supervised learning+</li>
  <li>Only as good as the demo</li>
</ul>

<p>Reinforcement learning</p>

<ul>
  <li>Require reward function</li>
  <li>Must address exploration</li>
  <li>Potential non-convergent RL</li>
  <li>Can become arbitrarily good+</li>
</ul>

<p>Can we get the best of both?</p>

<p>we have both demonstration and rewards</p>

<p>IRL already addresses distributional shift via RL, but it doesn’t use a known reward function!</p>

<h4 id="simplest-combination-pre-train-by-imitation--fine-tune-by-rl">Simplest combination: pre-train by imitation &amp; fine-tune by RL</h4>

<ol>
  <li>collected demonstration data $(s_i,a_i)$</li>
  <li>initialize $\pi_\theta$ as $\max_\theta \sum_i \log \pi_\theta(a_i \mid s_i)$</li>
  <li>run $\pi_\theta$ to collect experience</li>
  <li>improve $\pi_\theta$ with any RL algorithm and repeat 3 and 4</li>
</ol>

<p>but in step 3 the agent can be very bad due to distribution shift, so the first batch of bad data can destroy initialization.</p>

<h4 id="off-policy-rl">Off-policy RL</h4>

<p>we can address this by off-policy RL. Off-policy RL can use any data, so we can keep the demonstration in buffer.</p>

<ul>
  <li>off-policy policy gradient (with importance sampling)</li>
  <li>off-policy Q-learning</li>
</ul>

<h5 id="policy-gradient-with-demonstrations">Policy gradient with demonstrations</h5>

\[\Delta J(\theta)=\sum_{\tau \in \mathcal{D}}\left[\sum_{t=1}^T\Delta_\theta\log\pi_\theta(a_t \mid s_t)\left(\prod_{t'=1}^t\frac{\pi_\theta(a_{t'} \mid s_{t'})}{q(a_{t'} \mid s_{t'})}\right)\left(\sum_{t'=t}^Tr(s_{t'},a_{t'})\right)\right]\]

<p>where the $\mathcal{D}$ includes both demo data and policy data</p>

<p>Problem 1: which distribution did the demonstrations come from?</p>

<ul>
  <li>
    <p>option 1: use supervised behavior cloning to approximate $\pi_{demo}$</p>
  </li>
  <li>
    <p>option 2: assume Dirac delta: $\pi_{demo}(\tau)=\frac{1}{N}\delta (\tau \in \mathcal{D})$ this works best with self-normalized importance sampling—–
$E_{p(x)}[f(x)]\approx\frac{1}{\sum_j\frac{p(x_j)}{q(x_j)}}\sum_i\frac{p(x_i)}{q(x_i)}f(x_i)$</p>
  </li>
</ul>

<p>Problem 2: what to do if have multiple distributions</p>

<ul>
  <li><em>fusion</em> distribution: $q(x)=\frac{1}{M}\sum_iq_i(x)$</li>
</ul>

<h5 id="q-learning-with-demonstrations">Q-learning with demonstrations</h5>

<p>just drop demo demonstration data to replay buffer</p>

<p>What’s the problem?</p>

<p>importance sampling: recipe for getting stuck</p>

<p>Q-learning: just good data is not enough, only having good data is hard to fit good Q value</p>

<p>more problems with this highly off-policy Q learning</p>

<p>this is highly off-policy so we do not using imitation policy to collect data any more, so f Q function makes any mistake, tis had to fix it, and then we any train on totally garbage.</p>

<p>to address this problem,</p>

\[Q(s,a) \gets r(s,a)+E_{a'\sim\pi_{new}}[Q(s',a')]\]

<p>How to pick $\pi_{new}(a \mid s)$?</p>

<p>option 1: stay close to $\beta$</p>

<ul>
  <li>
    <p>e.g. $D_{KL}(\pi_{new}(. \mid s) \mid \mid \beta(. \mid s))\le \epsilon$</p>
  </li>
  <li>
    <p>issue 1: we don’t know $\beta$</p>
  </li>
  <li>
    <p>issue 2: this is way too conservative</p>
  </li>
</ul>

<p>option 2: constrain to support of $\beta$ ref. these two papers</p>

<ul>
  <li>
    <p>Kumar et al. stabilizing Off-Policy Q-learning vai bootstrapping Error Reduction</p>
  </li>
  <li>
    <p>fojimoto et al. Off-Policy Deep Reinforcement learning without exploration</p>
  </li>
</ul>

<h4 id="imitation-as-an-auxiliary-loss-function">Imitation as an auxiliary loss function</h4>

<p>imitation objective: $\sum_{(s,a) \in \mathcal{D}<em>{demo}}\log \pi</em>\theta(a \mid s)$</p>

<p>RL objective: $E_{\pi_\theta}[r(s,a)]$</p>

<p>hybrid objective: $E_{\pi_\theta} [r(s,a)]+\lambda \sum_{(s,a) \in \mathcal{D}<em>{demo}} \log \pi</em>\theta(a \mid s)$</p>

<p>Hybrid Q-learning:</p>

\[J(Q)=J_{DQ}(Q)+\lambda_1J_n(Q)+\lambda_2j_E(Q)+\lambda_3J_{L2}(Q)\\
J_E(Q)=\max_{a\in A}[Q(s,a)+l(s_E,a)]-Q(s,a_E)\]

<p>and $J_{DQ}$ is the Q-learning loss, $J_n(Q)$ is n-step Q-learning loss, $J_{L2}$ is regulation loss</p>

<p>what’s the problem?</p>

<ul>
  <li>need to tune the weight</li>
  <li>The design of the objective, esp. for imitation, takes a lot of care</li>
  <li>Algorithm becomes problem-dependent</li>
</ul>

<h2 id="16-meta-reinforcement-learning">16 Meta Reinforcement learning</h2>

<p>This part introduced many meta learning methods in high level, may find and read papers later</p>

<h2 id="17-information-theory-challenges-open-problems">17 Information theory, challenges, open problems</h2>

<h3 id="information-theory">Information theory</h3>

<p>entropy</p>

\[\mathcal{H}(p(x))=-E_{x\sim p(x)}[\log p(x)\]

<p>mutual dependence</p>

\[\begin{align}
\mathcal{I}&amp;=D_{KL}(p(x,y) \mid \mid p(x)p(y))\\
&amp;=E_{(x,y)\sim p(x,y)}\left[\log \frac{p(x,y)}{p(x)p(y)}\right]\\
&amp;=\mathcal{H}(p(y))-\mathcal{H}(p(y \mid x))
\end{align}\]

<p>define $\pi(s)$ state <em>marginal</em> distribution of policy $\pi$</p>

<p>$\mathcal{H}(\pi(s))$ state <em>marginal</em> entropy of policy $\pi$</p>

<p>empowerment: $\mathcal{I}(s_{t+1},a_t)=\mathcal{H}(s_{t+1})-\mathcal{H}(s_{t+1} \mid a_t)$</p>

<h3 id="learning-without-a-reward-function-by-reaching-goals">Learning without a reward function by reaching goals</h3>

<p>one way is giving goal states, maybe using VAE to generate goals</p>

<ol>
  <li>Propose goal: $z_g \sim p(z)$, $x_g \sim p_\theta(x_g \mid z_g)$</li>
  <li>Attempt to reach goal using $\pi(a \mid x,x_g)$, reach $\bar{x}$</li>
  <li>Use data to update $\pi$</li>
  <li>Use data to update $p_\theta(x_g \mid z_g)$, $q_\phi(z_g \mid x_g)$</li>
</ol>

<p>but how to diverse goals?</p>

<p>in step 4</p>

<p>the Standard MLE: $\theta, \phi \sim \arg \max _{\theta, \phi}E[\log p(\bar{x})]$</p>

<p>the weighted MLE: $\theta, \phi \sim \arg \max _{\theta, \phi}E[w(\bar{x})\log p(\bar{x})]$</p>

<p>where $w(\bar{x})=p_\theta(\bar{x})^\alpha$</p>

<p>key result: for any $\alpha \in [-1,0)$, entropy $\mathcal{H}(p_\theta(x))$ increases!</p>

<p>This actually doing $\max \mathcal{H}(p(G))$</p>

<p>and what does RL do?</p>

<p>$\pi(s \mid S,G)$ trained to reach goal G as $\pi$ gets better, final state S gets close to G,</p>

<p>that means $p(G \mid S)$ becomes more deterministic!</p>

<p>so we actually doing this</p>

\[\max \mathcal{H}(p(G)) - \mathcal{H}(p(G \mid S)) = \max \mathcal{I}(S;G)\]

<h3 id="learning-diverse-skills">Learning diverse skills</h3>

<p>$\pi(s \mid s,z)$ and $z$ is task index.</p>

<p>Intuition: different skill should visit different state-space regions</p>

<p>Diversity-promoting reward function</p>

<p>[
\pi(a \mid s,z) = \arg\max_\pi \sum_z \mathbb{E}_{s \sim \pi(s \mid z)}[r(a,z)]
]</p>

<p>where $r(s,z)= \log p(z \mid s)$, rewarding states that are unlikely for other $z’ \ne z$.</p>

<p>Here, when z is sampled, the only thing learning doing is maximize the probability of state of this policy by tune the $\pi(s \mid z)$, it turns out that different policies get different skills. Ref. Diversity is all you need (ICLR-2019)</p>

<p>In fact this is also goal reaching.</p>

\[\mathcal{I}(z,s)=\mathcal{H}(z)-\mathcal{H}(z \mid s)\]

<p>we actually maximize $H(z)$ by sampling $p(z)$ uniformly, and then minimize $H(z \mid s)$ using the algorithm.</p>

<h3 id="unsupervised-reinforcement-learning-for-meta-learning">Unsupervised reinforcement learning for meta-learning</h3>

<p>using unsupervised reinforcement learning to propose tasks for meta learning.</p>

<p>so at first, using unsupervised meta-RL tor generate tasks, then using meat-learning for those tasks by given reward function from previous steps.</p>

<h3 id="challenges-in-deep-reinforcement-learning">Challenges in deep reinforcement learning</h3>

<p>Core algorithm:</p>

<ul>
  <li>Stability</li>
  <li>Efficiency</li>
  <li>Generalization</li>
</ul>

<p>Assumptions:</p>

<ul>
  <li>Problem formulation</li>
  <li>Supervision</li>
</ul>

<h4 id="stability-and-hyper-parameter-tuning">Stability and hyper-parameter tuning</h4>

<ul>
  <li>Devising stable RL algorithm is very hard</li>
  <li>Q-learning/value function estimation</li>
  <li>No guarantee of convergence</li>
  <li>Lots of parameters for stability: target network delay, replay buffer size, clipping, sensitivity to learning rates, etc.</li>
  <li>Policy gradient/likelihood ratio/REINFORCE</li>
  <li>Very high variance gradient estimator</li>
  <li>Lots of samples, complex bases, etc.</li>
  <li>Parameters: batch size, learning rate, design of baseline</li>
  <li>Model-based RL algorithms</li>
  <li>Model class and fitting method</li>
  <li>Optimizing policy w.r.t. model non-trivial due to back-propagation through time</li>
  <li>More subtle issue: policy tends to <em>exploit</em> the model</li>
</ul>

<p>The challenge with hyper-parameters is severe</p>

<ul>
  <li>Algorithms with favorable improvement and convergence properties</li>
  <li>TRPO</li>
  <li>Safe reinforcement learning, High-confidence policy improvement [Thomas ‘15’]</li>
  <li>Algorithms that adaptively adjust parameters</li>
  <li>Q-Prop</li>
</ul>

<p><strong>Not great for beating benchmarks</strong>, but absolutely essential to make RL a viable tool for real-world problems.</p>

<h4 id="sample-complexity">Sample complexity</h4>

<ul>
  <li>real-world learning becomes difficult or impractical</li>
  <li>Precludes the use of expensive, high-fidelity simulators</li>
  <li>Limits applicability to real-world problems</li>
</ul>

<p>what can we do?</p>

<ul>
  <li>Better model-based RL algorithms</li>
  <li>Design faster algorithms</li>
  <li>Addressing Function Approximation Error in Actor-Critic Algorithms[Fujimoto et al. ‘18’]</li>
  <li>Soft Actor-Critic</li>
  <li>Reuse prior knowledge to accelerate reinforcement learning</li>
  <li>RL2 [Duan at al. 17]</li>
  <li>Learning to reinforcement learning [Wang et al. ‘17’]</li>
  <li>MAML [Finn at al. ‘17’]</li>
</ul>

<h4 id="scaling--generalization">Scaling &amp; Generalization</h4>

<ul>
  <li>Small-scale</li>
  <li>Emphasizes mastery</li>
  <li>Evaluated on performance</li>
  <li>Where i the generalization</li>
</ul>

<p>Reinforcement learning need to re-collect data during training</p>

<h3 id="assumption-problems">Assumption problems</h3>

<p>Single task or multi-task</p>

<ul>
  <li>Train on multiple tasks, then try ti generalize or finetune</li>
  <li>policy distillation</li>
  <li>Actor-mimic</li>
  <li>MAML</li>
  <li>Unsupervised or weakly supervised learning of diverse behaviors</li>
  <li>Stochastic neural networks</li>
  <li>Reinforcement learning with deep energy-based policies</li>
</ul>

<p>Where does the supervision come from?</p>

<ul>
  <li>find some different tasks</li>
  <li>learn objectives/reward form demonstration (IRL)</li>
  <li>Generate objectives automatically</li>
</ul>

<p>What is the role of the reward function?</p>

<p>Unsupervised reinforcement learning</p>

<ol>
  <li>
    <p>Interaction with the world without a reward function</p>
  </li>
  <li>
    <p>Learning something about the world</p>
  </li>
  <li>
    <p>Use what you learned to quickly solve new tasks</p>
  </li>
</ol>

<p>Other sources of supervision</p>

<ul>
  <li>Demonstrations</li>
  <li>Language</li>
  <li>Human preferences</li>
</ul>

<p>Where does the supervision signal come from?</p>

<ul>
  <li>Yannn YeCun’s cake</li>
  <li>Unsupervised or self-supervised learning</li>
  <li>Model learning (predict the future)</li>
  <li>Generative modeling of the world</li>
  <li>Lots to do even before you accomplish your goal</li>
  <li>Imitation &amp; understanding other agents</li>
  <li>The giant value backup</li>
</ul>

<h2 id="18-rethinking-reinforcement-learning-from-the-perspective-of-generalization-chelsea-finn">18 Rethinking Reinforcement Learning from the Perspective of Generalization (Chelsea Finn)</h2>

<p>Meta-learning:</p>

<p>Learning to Learning with gradient. Finn. PhD Thesis 2018</p>

<p>Efficient Off-Policy Meta-Reinforcement learning via Probabilistic Context Variable ICML-19</p>

<p>These algorithms only adapt to same similar tasks, but they can not adapt to <strong>entirely new tasks</strong>!!!</p>

<p>If we want to do this, we need to make sure the meta-train task distribution same as the meta-test task distribution.</p>

<ul>
  <li>
    <p>Algorithms: more general than a policy, from demo and trial? or others, Meta world</p>
  </li>
  <li>
    <p>Task Representation: How to tell the tasks, language or goal?</p>
  </li>
  <li>
    <p>Data: RoboNet</p>
  </li>
</ul>

<blockquote>
  <ul>
    <li></li>
  </ul>
</blockquote>]]></content><author><name>Dongda Li</name><email>dongdongbhbh@gmail.com</email></author><category term="note" /><summary type="html"><![CDATA[Deep Reinforcement learning notes UBC]]></summary></entry></feed>