Dongda’s homepage

Transparent proxy with V2ray and clash

2021-10-19T00:00:00-04:00

Background

My previous post shows how to set up a proxy server and uses it on various clients, while setting up the client on every device is cumbersome. In addition, people may want to forward all network traffic through a proxy regardless it is http/https/socks5 or not. In these situations, it is better to set up a global proxy works as a router, all devices (includes itself) connected to it would use the proxy, which is called transparent proxy, people also call it bypass gateway or soft router. Most of these methods directly work on tcp/upd.

In this post, I will demonstrate how to set up transparent proxy on Linux step by step. In order to understand this post, you should know basics about Linux and networks.

There are many tools to achieve transparent proxy, like Proxifier (Windows), Surge for Mac, tun2socks, dns2socks for Linux etc. The key of transparent proxy how to proxy DNS. In this post, I didn’t use these tools. I only adopt the V2ray and clash build-in DNS configuration and forward DNS requests to it using iptables and ip route. In fact, Clash has pretty good DNS setting.

Basics

How DNS works?

When you set a http request, it first sents a DNS request (on UDP port 53) with domain name as a payload to the predefined DNS server using local DNS requesting tool, then the DNS request reach the name server, it returns corresponding IP address. Then the APP lunches a TCP connection to the destination server with this IP and TCP data stream. That’s how the normal http/https connection works.

            DNS req
   ___________________________>
APP<---------------------------DNS server
  |         web IP
  |         IP and TCP data
  |----------------------------------->website  	

When you are using a proxy client/server, things might change. In the following part of this section, I only show the DNS working with a proxy in different setting. The direct link is similar to the normal http connection.

socks5

In socks5 case, the APP packs domain name and TCP data stream to socks5 packages. The client program sent the socks5 data (domain name and tcp data stream) to the proxy server with its proxy protocol, the the DNS resolution is working on the proxy server.

      socks5(domain and tcpdata)                    socks5 data /w proxy protocol                          DNS req
APP--------------------------------->proxy client------------------------------------->proxy server<------------------->DNS server
                                                                                          |     web IP
                                                                                          |
                                                                                          |     IP and TCP data
                                                                                          |--------------------------------website  

When global proxy/transparent proxy is required, not all APPs have the socks packing function, so there have to be a program that catches the DNS request and make it works as we want.

tun2socks/redir

tun2socks is part of BadVPN which accepts all incoming TCP connections (regardless of destination IP), and forwards them through a SOCKS server. This allows you to forward all connections through SOCKS without any application support. It can be used, for example, to forward connections through a remote SSH server or through Tor. Because of how it works, it can even be installed on a Linux router to transparently forward clients through SOCKS.

In this case, the program catches the DNS request, and gets the IP address by it’s own way (the result may wrong due to dirty DNS servers or the magics), and it returns it to the APP, the APP then lunches a TCP connection with this IP and the tcp data stream. But in local program, it just packs previous domain name and TCP data stream to as a socks package. then set to the proxy server, and let the proxy server do the actual DNS request.

         requested IP
   <_________________________________              domian and data w/ proxy protocol                      DNS req
APP--------------------------------->proxy client<------------------------------------->proxy server<------------------->DNS server
  |              DNS req               /|\                                                    |      web IP
  |                                     |                                                     |
  |       IP and TCP data               |                                                     |    IP and TCP data     
  |-------------------------------------|                                                     |---------------------------->website

Fake IP

Fake IP mode is that the local proxy client catches DNS requests, then return a self-produced fake IP to APP. The APP would lunch a TCP connection with this fake IP and it’s TCP data. When the local proxy client searches the domain name with this fake IP, and sends the domain name and TCP data to the proxy server, and lets the proxy server do the actual DNS request.

          Fake IP
   <_________________________________                  domian and data w/ proxy protocol                    DNS req
APP--------------------------------->proxy client<------------------------------------->proxy server<------------------->DNS server
  |              DNS req               /|\                                                   |      web IP
  |                                     |                                                    |
  |   IP and TCP data                   |                                                    |    IP and TCP data     
  |-------------------------------------|                                                    |---------------------------->website

pros & cons

Fake IP mode is a bit faster since it do not have to send a real IP to local APP, but the local app can not know the real IP of website.

Iptables

if you don’t familiar with iptables and ip route, please do your own research. Here is a good picture for iptables overview.

iptables -L -t {nat, mangle}   # list chains
iptables -N  XXXX				# creat chain
iptables -A      				# add rule
iptables -D      				# delete rule

Requirements

You should have a Linux computer/device;
You should an available v2ray/clash proxy.

Set up a V2ray transparent proxy

V2ray `config.json`

dokodemo-door is used for receive all traffic forward to V2ray. The traffic go through V2ray proxy is added a socket mark 255 (0xFF) for iptables to bypass it to avoid loop back.

{
"routing": {...},
"inbounds": [
 {
   ...
 },
 {
   "port": 12345, //opening port
   "protocol": "dokodemo-door",
   "settings": {
     "network": "tcp,udp",
     "followRedirect": true // receive packages from iptables
   },
   "sniffing": {
     "enabled": true,
     "destOverride": ["http", "tls"]
   },
   "streamSettings": {
     "sockopt": {
       "tproxy": "redirect" 
     }
   }

 }
],
"outbounds": [
 {
   ...
   "streamSettings": {
     ...
     "sockopt": {
       "mark": 255  // SO_MARK，for iptables 
     }
   }
 }
 ...
]
}

`iptables` setting

run all of following with root permission, you can simple run sudo su to get the permission.

lan_ipaddr="192.168.1.1"   # local router IP
proxy_server="123.123.123.123"  # tour proxy server
proxy_port="7892"           # transparent proxy forward port

# allow ip forward
echo net.ipv4.ip_forward=1 >> /etc/sysctl.conf && sysctl -p

# set route for lo back to perouting
ip rule add fwmark 1 table 100
ip route add local 0.0.0.0/0 dev lo table 100


# proxy local network
iptables -t mangle -N V2RAY

# Ignore your V2Ray server's addresses
# It's very IMPORTANT, just be careful.
iptables -t mangle -A V2RAY -d ${proxy_server} -j RETURN

iptables -t mangle -A V2RAY -d 127.0.0.1/32 -j RETURN
iptables -t mangle -A V2RAY -d 224.0.0.0/4 -j RETURN
iptables -t mangle -A V2RAY -d 255.255.255.255/32 -j RETURN
iptables -t mangle -A V2RAY -d ${lan_ipaddr}/16 -p tcp -j RETURN # direct for local network
iptables -t mangle -A V2RAY -d ${lan_ipaddr}/16 -p udp ! --dport 53 -j RETURN # direct for local network except 53 port for DNS
iptables -t mangle -A V2RAY -p udp -j TPROXY --on-port ${proxy_port} --tproxy-mark 1 # set mark 1 to UDP，forward to proxy
iptables -t mangle -A V2RAY -p tcp ! --dport 22 -j TPROXY --on-port ${proxy_port} --tproxy-mark 1 # set mark 1 to TCP，forward to proxy, except 53 port for SSH
iptables -t mangle -A PREROUTING -j V2RAY # apply to perounting

# proxy this machine
iptables -t mangle -N V2RAY_MASK
iptables -t mangle -A V2RAY_MASK -d ${proxy_server} -j RETURN

iptables -t mangle -A V2RAY_MASK -d 224.0.0.0/4 -j RETURN
iptables -t mangle -A V2RAY_MASK -d 255.255.255.255/32 -j RETURN
iptables -t mangle -A V2RAY_MASK -d ${lan_ipaddr}/16 -p tcp -j RETURN # direct for local network
iptables -t mangle -A V2RAY_MASK -d ${lan_ipaddr}/16 -p udp ! --dport 53 -j RETURN #  direct for local network except 53 port for DNS
iptables -t mangle -A V2RAY_MASK -j RETURN -m mark --mark 0xff    # set SO_MARK as 0xff to avoid loop back
iptables -t mangle -A V2RAY_MASK -p udp -j MARK --set-mark 1   # mark UDP and reroute
iptables -t mangle -A V2RAY_MASK -p tcp -j MARK --set-mark 1   # mark UDP and reroute
iptables -t mangle -A OUTPUT -j V2RAY_MASK # apply to output of this machine

Here we use the DNS of V2ray. We first set up forward traffic to proxy, then we re-sent all packages of this machine back to itself as if it if forwarding, then just follow the forward rules to proxy.

This method is the REDIRECT method, their is a TPROXY method for V2ray, but I didn’t make it works on my machine, if you interested, see ̧here for reference.

Set up a Clash transparent proxy

Clash is a rule based proxy. It has proxy, high level routing, DNS and a lot more functions. It’s a really popular tool.

Bypass gateway

Supposing that you are using Raspberry Pi as your bypass gateway. You should set a static Raspberry address and let it works as DHCP and DNS server of the main server, or if you main router support multiple gateway, you may open two gateway on the main router, one for proxy which go through Raspberry and back, the other just works as normal router gateway. Just as this network overview graph.

Phone/PC/Pad
        |
      1 |
        |
+-------v-------+      2      +---------------+
|               |------------->               |
|  WiFi router  |             |  Raspberry Pi |
|               <-------------|               |
+------+--+-----+      3      +---------------+
       |  |
    3.1|  | 3.2
       |  +---------->  Direct LAN
       v
   +---+---+
   | Proxy |
   +---+---+
       |
       |
       v
 Internet WAN

Avoid the loop problem

To avoid the loop problem, a clash user is created, clash is run by clash, ipatables use uid to identify the traffic from clash.

Create a user clash, make sure you create a home dir for clash, otherwise there is no place for clash configuration file to initialize.

useradd -U clash
sudo mkhomedir_helper clash
sudo chown clash:clash /usr/local/bin/clash

Create or modify /etc/systemd/system/clash.service, which define user as clash

[Unit]
Description=clash
After=network.target

[Service]
User=clash
Group=clash
AmbientCapabilities=CAP_NET_BIND_SERVICE CAP_NET_ADMIN
ExecStart=/usr/local/bin/clash -d /etc/clash
Restart=on-failure

[Install]
WantedBy=multi-user.target

Here we set AmbientCapabilities for DNS, since we clash user to run service, it can not use 53 port, we need CAP_NET_BIND_SERVICE permission. And if we we want to proxy UDP, we need CAP_NET_ADMIN permission.

do demon update

sudo systemctl daemon-reload
sudo systemctl enable clash

Clash DNS setting in `config.yaml`

For DNS, here we want to use the DNS of Clash, let clash do the DNS job since our local DNS may dirty. Following lines is a good templete to configure your DNS in config.yaml of clash.

dns:
  enable: true
  ipv6: false
  listen: 0.0.0.0:1053
  enhanced-mode: redir-host       # redir-host or fake-ip
  # fake-ip-range: 198.18.0.1/16    # Fake IP addresses pool CIDR
  use-hosts: true                 # lookup hosts and return IP record
  nameserver:
    - 119.29.29.29      # DNSpod 
    - 223.5.5.5         # Alibaba DNS
  # if not china then use fallback
  fallback:
    - tls://8.8.8.8:853         # Google DNS over TLS 50ms
    - tls://8.8.4.4:853         # cloudflare DNS over TLS 50ms
    - https://1.1.1.1/dns-query # cloudflare DNS over HTTPS
    - https://dns.google/dns-query # Google DNS over HTTPS

  # Force DNS use fallback
  fallback-filter:
    # true: CN use nameserver resolution， no CN use fallback
    geoip: true
    # take effect when geoip set as false : when do not match `ipcidr` use `nameserver`, match `ipcidr` use `fallback`.
    ipcidr:
      - 240.0.0.0/4

Here, DNS sets as redir-host mode, you can also use fake-ip mode with

  enhanced-mode:  fake-ip       # redir-host or fake-ip
  fake-ip-range: 198.18.0.1/16    # Fake IP addresses pool CIDR

and make sure you also set the iptables accordingly.

The final config.yaml looks like this

port: 7890
socks-port: 7891
redir-port: 7892
allow-lan: true
mode: Rule
log-level: info
external-controller: 0.0.0.0:9090
secret: ""
external-ui: dashboard
# your proxy
Proxy: 
Proxy Group:
#
Rule:
# 
dns:
  enable: true

`iptables` setting

First, allow forward ip

echo net.ipv4.ip_forward=1 >> /etc/sysctl.conf && sysctl -p

set up iptables

#!/bin/bash

IPT=/sbin/iptables
lan_ipaddr=$(/sbin/ip route | awk '/default/ { print $3 }')
dns_port="1053"  
proxy_port="7892" 

# remove any existing rules
$IPT -F

# create new nat rule
$IPT -t nat -N CLASH_TCP_RULE
$IPT -t nat -F CLASH_TCP_RULE

# do not forward local address
$IPT -t nat -A CLASH_TCP_RULE -d 10.0.0.0/8 -j RETURN
$IPT -t nat -A CLASH_TCP_RULE -d 127.0.0.0/8 -j RETURN
$IPT -t nat -A CLASH_TCP_RULE -d 169.254.0.0/16 -j RETURN
$IPT -t nat -A CLASH_TCP_RULE -d 172.16.0.0/12 -j RETURN
$IPT -t nat -A CLASH_TCP_RULE -d 192.168.0.0/16 -j RETURN
$IPT -t nat -A CLASH_TCP_RULE -d 224.0.0.0/4 -j RETURN
$IPT -t nat -A CLASH_TCP_RULE -d 240.0.0.0/4 -j RETURN
$IPT -t nat -A CLASH_TCP_RULE -d ${lan_ipaddr}/16 -j RETURN

# do not forward ssh, clash http socks ports, transparent proxy port, clash web API port
$IPT -t nat -A CLASH_TCP_RULE -p tcp --dport 22 -j RETURN
$IPT -t nat -A CLASH_TCP_RULE -p tcp --dport 7890 -j RETURN
$IPT -t nat -A CLASH_TCP_RULE -p tcp --dport 7891 -j RETURN
$IPT -t nat -A CLASH_TCP_RULE -p tcp --dport 7892 -j RETURN
$IPT -t nat -A CLASH_TCP_RULE -p tcp --dport 9090 -j RETURN

# proxy_port take over HTTP/HTTPS request
$IPT -t nat -A CLASH_TCP_RULE  -p tcp -j REDIRECT --to-ports ${proxy_port}

# forward freedom DNS server address
$IPT -t nat -I PREROUTING -p tcp -d 8.8.8.8 -j REDIRECT --to-port "$proxy_port"
$IPT -t nat -I PREROUTING -p tcp -d 8.8.4.4 -j REDIRECT --to-port "$proxy_port"
$IPT -t nat -A PREROUTING -p tcp  -j CLASH_TCP_RULE
# Fake-IP rule
# $IPT -t nat -A OUTPUT -p tcp -d 198.18.0.0/16 -j REDIRECT --to-port ${proxy_port}

# forward DNS request to dns_port
$IPT -t nat -N CLASH_DNS_RULE
$IPT -t nat -F CLASH_DNS_RULE

$IPT -t nat -A PREROUTING -p udp -s ${lan_ipaddr}/16 --dport 53 -j CLASH_DNS_RULE
$IPT -t nat -A CLASH_DNS_RULE -p udp -s ${lan_ipaddr}/16 --dport 53 -j REDIRECT --to-ports $dns_port
$IPT -t nat -I OUTPUT -p udp --dport 53 -j CLASH_DNS_RULE

# this machine
$IPT -t nat -A OUTPUT -p tcp -m owner ! --uid-owner clash -j REDIRECT --to-port ${proxy_port}

We don’t forward the traffic with uid clash to avoid the loop problem.

Here we didn’t proxy UDP, only proxy DNS on 53 port. see ref Clash TProxy Mode for UDP proxy.

Save and reload `iptables`

There are two methods to save and reload the iptables to avoid it disappeared after reboot. One is using iptables-persistent.

The tool iptables-persistent will automatically reload the saved rules during starting.
```
sudo apt install iptables-persistent
iptables-save > /etc/iptables/rules.v4
```
A more flexible way is running the shell script to set iptables on every starting, in this way it automatically detect the LAN so that it can works when you changed your network environment. You can save it into /etc/rc.d/rc.local to make it run on every starting.

Special statement: This tutorial is only for learning and research, thanks.

Pytorch distributed data parallel step by step

2020-11-19T00:00:00-05:00

Background

How to speed up your training? How to train the large model that can not fit into a singe GPU memory? How to make full use of a number of GPUs?

Distributed training is born for handling these problems. In Pytorch, there is dataparallel and distributed data parallel,

Dataparallel

The dataparallel split a batch of data to several mini-batches, and feed each mini-batch to one GPU, each GPU has a copy of model, After each forward pass, all gradients are send to the master GPU, and only the master GPU do the back-propagation and update parameters, then it broadcast the updated parameters to other GPUs. There is three key problems with dataparallel:

There are twice data transaction between GPUs, one is the gradient transaction, the other is model parameter transaction. It leads to great communication overhands;
The memory cost is bounded by the master GPU’s memory. Because all back-propagation are performed on the master GPU, the memory cost of master GPU is larger than that of others. As a result, you can not make full use of other GPU memory since it is bounded by the master one;
Back-propagation on a single GPU makes training really slow.

Distributed Data Parallel (DDP)

Distributed Data Parallel aims to solve the above problems. It add a autograd hook for each parameter, so when the gradient in all GPUs is ready, it tiger the hook to synchronize gradient between GPUs by using the AllReduce function of the back-end. So after the forward pass and all gradients are synchronized, each GPU do back-propagation locally. Here, the commutation cost is only the gradient synchronization, and the whole process is not relay on one master GPU, thus all GPUs have similar memory cost. In addition, DDP can also works on multiple machines, it can communicated by P2P. For more details refer PyTorch Distributed Overview. DDP also has a benefit that it can use multiple CPUs since it run several process, which reduce the limit of python GIL.

The implementation of Dataparallel is just single line of code, you can refer the pytroch documentation for detail. Here, I only show how to use DDP on single machine with multiple GPUs.

Get start with DDP

Run

torch.distributed.launch will spawn multiple processes for you. nproc_per_node usually set as the number GPU on the node so that each GPU has a process.

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 main.py $args

Prepare data

In supervised learning, you can use DistributedSampler as sampler function of your dataloader. It will do the split data set job for you.

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

In reinforcement learning, you may run your environment within every rank process with different seeds.

DDP initialization with Nvidia NCCL back-end

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", default=-1)
local_rank = parser.parse_args().local_rank

dist.init_process_group(backend='nccl', init_method='env://')
rank = dist.get_rank()
world_size = dist.get_world_size()
print('my rank={} local_rank={}'.format(rank, local_rank))
torch.cuda.set_device(local_rank)

Model

Just warped by DDP

model = model.to(device)
model = DDP(model, device_ids=[local_rank], output_device=local_rank)

Training

for epoch in range(num_epochs):
    trainloader.sampler.set_epoch(epoch)
    for data, label in trainloader:
        prediction = model(data)
        loss = loss_fn(prediction, label)
        loss.backward()
        optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
        optimizer.step()

Log data

use reduce torch.distributed.reduce to sum all data from different rank, the divide by world size to get mean.

loss = loss.clone().detach()
loss_mean = dist.reduce(loss, rank=0) / dist.get_world_size()
if dist.get_rank() == 0:
	# collect results into rank0
	print(f"epoch: {epoch}, loss: {loss_mean} ")

Checkpoint load and save

when loading, make sure you map location properly.

def load_checkpoint_path(model, optimizer, rank, checkpoint_path)
	# configure map_location properly
    map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
    checkpoint_state = torch.load(checkpoint_path, map_location=map_location)
    
    model.load_state_dict(checkpoint_state['model'])
    iter_init = checkpoint_state['iter_no'] + 1  # next iteration
    optimizer.load_state_dict(checkpoint_state['optimizer'])
    return iter_init

if dist.get_rank() == 0:
	# only save on rank 0
    checkpoint_state = {
                'iter_no': iter_no,  # last completed iteration
                'model': modules.state_dict(),
                'optimizer': optimizer.state_dict(),
            }
    torch.save(checkpoint_state, checkpoint_path)

Batchnorm

To get same calculation results as single card, DDP should synchronize between GPUs when doing batchnorm.

batch norm use convert_sync_batchnorm before wrapping Network with DDP.

model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = DDP(model, device_ids=[local_rank], output_device=local_rank)

The problems you may face

program hang when using reduce on part of GPUs;
NCCL error when using docker,
parameter not ready when you have parameters are not used to calculate loss

will talk about these later~

Docker container for machine learning environments

2020-08-12T00:00:00-04:00

Docker basic

Ref docker doc and docker --help

Here is a good docker tutorial

docker image download and commit

docker pull [OPTIONS] NAME[:TAG|@DIGEST]
docker commit [OPTIONS] CONTAINER [REPOSITORY[:TAG]]

check docker

docker ps
docker images
docker inspect 

Others

docker run
docker rmi
docker rm 
docker cp [OPTIONS] SRC_PATH|- CONTAINER:DEST_PATH

Ctrl+p then Ctrl+q will turn container interactive mode to daemon mode. If you want to reattach, run

docker attach [OPTIONS] CONTAINER

Setup docker

run docker without sudo

sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker 

Push to docker hub

docker login
docker tag image_id yourhubusername/REPOSITORY_NAME:tag
docker push yourhubusername/REPOSITORY_NAME

Dockerfile

a basic template, ref. dockerfile

FROM ubuntu:18.04
COPY . /app
EXPOSE 9000
RUN make /app
CMD python /app/app.py

docker-compose.yml ref compose

version: "3.8"
services:
  webapp:
    build:
      context: ./dir
      dockerfile: Dockerfile-alternate
      args:
        buildno: 1

a full example of django dockerfile and docker-compose

Docker proxy

set proxy for docker pull, ref docker-proxy

sudo mkdir -p /etc/systemd/system/docker.service.d
vim /etc/systemd/system/docker.service.d/http-proxy.conf

[Service]
Environment="HTTP_PROXY=socks5://127.0.0.1:1080"
Environment="HTTPS_PROXY=socks5://127.0.0.1:1080"

where 127.0.0.1:1080 is your proxy forward port.

Take effect

sudo systemctl daemon-reload
sudo systemctl restart docker or systemctl reload docker

Verify and enjoy

sudo systemctl show --property=Environment docker

Docker with cuda

refer nvidia-docker

setup nvidia-container-toolkit

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

pull cuda image depending your cuda version

docker pull nvidia/cuda:10.2-basic

run docker

docker run --gpus all --ipc=host --net host -it --rm -v /etc/localtime:/etc/localtime:ro -v /dev/shm:/dev/shm -v $(pwd):/workspace --user $(id -u):$(id -g) nvidia/cuda:10.2-runtime-ubuntu18.04

-v /dev/shm:/dev/shm for torch distributed train with nccl back-end

or create a docker file

ARG DOCKER_BASE_IMAGE=nvidia/cuda:10.2-basic
FROM $DOCKER_BASE_IMAGE
# remove cuda list avoid gpg error when apt-get
RUN rm /etc/apt/sources.list.d/cuda.list && rm /etc/apt/sources.list.d/nvidia-ml.list
RUN apt-get update && apt-get -y install sudo 

COPY pre-install.sh .
RUN ./pre-install.sh

ARG UID=1000
ARG GID=1000
ARG USER=docker
# default password for user
ARG PW=docker
RUN useradd -m ${USER} --uid=${UID} -s /bin/bash  && echo "${USER}:${PW}" | chpasswd && adduser ${USER} sudo
# Setup default user, when enter docker container
USER ${USER}
WORKDIR /home/${USER}

you can put pre-installed script as

apt-get install -y iputils-ping
apt-get install -y vim
apt-get install -y wget
apt-get install -y git

build by

export UID=$(id -u)
export GID=$(id -g)
sudo docker build --build-arg USER=$USER \
			 --build-arg UID=$UID \
             --build-arg GID=$GID \
			 -t mycuda \
			 -f Dockerfile \
			 .

run docker with

docker run --gpus all --ipc=host --net host -it -e "TERM=xterm-256color" -v /etc/localtime:/etc/localtime:ro -v /dev/shm:/dev/shm -v $(pwd):/home/docker/workspace --hostname mycontainer --user docker mycuda

add --rm if you want to remove the container after it exit.

re-attach to stopped container by

docker start -ai [container name or id]

Container using Host proxy

I set up a proxy on host:8118, and I want my container use it to access network, How to achieve this?

1. Setup client proxy

According docker document, set up proxy for docker is by setting environment variables, there are two method to do it, one is create/edit the file ~/.docker/config.json, add following to it.

{
 "proxies":
 {
   "default":
   {
     "httpProxy": "http://127.0.0.1:8118",
     "httpsProxy": "http://127.0.0.1:8118",
     "noProxy": "localhost"
   }
 }
}

After set the config, when you rebuild the docker image, the proxy will automatically set up for the image. but there is one thing should be noticed, that is when building the docker image, you may run some commands using network (e.g. apt update) within the container, but you do not finished the proxy set up. So these commands could fail duo to network error. Here you have to make it use the host network when building docker image by using

docker build --net host xxx

another way is set the environment variables manually in Dockerfile by ENV HTTP_PROXY "http://127.0.0.1:8118", or your can just set it in docker CLI.

2. Setup network mapping

There are two ways to map the host network to container, one is just port all host network to container by using docker run --net host.
Another way is only map the proxy port to the your host by docker run -p 8118:8118.

SSH to container

ssh on host machine

one directly ssh to a running container.

make sure the container installed and started ssh service;
inspect the container ip address by

docker inspect <container id> | grep "IPAddress"

ssh to container

ssh user@container_ip_address

Direct ssh on remote machine

map container ssh service port to some host port by run container with -p <hostPort>:<containerPort>. i.e:

docker run -p 52022:22 container1 
docker run -p 53022:22 container2

Then, if ports 52022 and 53022 of host’s are accessible from outside, you can directly ssh to the containers using the ip of the host (Remote Server) specifying the port in ssh with -p <port>. I.e.:

ssh -p 52022 myuser@RemoteServer –> SSH to container1

ssh -p 53022 myuser@RemoteServer –> SSH to container2

Access files inside container

There are three ways to do achieve this:

you can just map container’s folder to host, and then access it as it on host, e.g. sftp;
you can create a http server by python3 -m http.server, and map the server port to host, and then access files on browser by http://host:port, but this is only a sample http file server, you can not edit files just as sftp.
you can WebDAV for file access.

Here I illustrate the WebDAV solution, In my case, I run my container on a remote host. The remote host is accessible by ssh specific port, say port 12345 (actually there is a jumper server between me and the remote host) . In this case I can not directly use ssh to the container, unless I map the 12345 to container’s ssh server port, but this will lead to the result that I can not access the remote host directly but the container. To avoid that, I have to access the files inside container by http protocol not ssh directly. Then I ssh tunnel the http fileserver port to my local machine.

conatiner ----WebDAV server port 8000
    |
    |port mapping 8000->host:8000
    |           22                    12345
  host--------------------jumper----------------my computer 9980

WebDAV is a long-standing protocol that enables a webserver to act as a fileserver and support collaborative authoring of content on the web. so it satisfy my requirements.

Set up WebDAV server on container

install python WebDAV

 pip install wsgidav
 pip install cheroot

create wsgidav configuration file wsgidav.yaml, refer wsgidav config for detail

run wsgidav

wsgidav --config=`wsgidav.yaml --host=0.0.0.0 --port=8000 --root ./share

set up ssh tunnel

on your computer

ssh -f -N -L 9980:0.0.0.0:8000  -p 12345 host_username@jumper_ip

Then you can access the files inside remote container by dav://localhost:9980/, on Linux, this is quite sample, just open you files explorer, select other location and type the dav://localhost:9980/, fill user name and password set by wsgidav.yaml, then you done! you can read/write these file just as it on your computer, enjoy!

On windows, you may download a WebDAV client to use.

Set up file server on vps by nginx

2020-06-09T00:00:00-04:00

Background

Set up web file server, h5ai, Aria2 with nginx on debian 9.

How to

Basic setting

In /etc/nginx/sites-enabled/default add

server {			# my file server
	listen xxxx; 	# your file server port
	server_name localhost;
	root /home/bh/share;

	location / {
			autoindex on; # index 
    		autoindex_exact_size on; # file size 
    		autoindex_localtime on; # filetime 
  	}
	
}

Then sudo service reload nginx, and visit your file system on youdomian.com:xxx

With `_h5ai`

h5ai is a modern file indexer for HTTP web servers with focus on your files.

Install php, check by php -v

In /etc/nginx/sites-enabled/default change to

Check fastcgi_pass in /etc/php5/fpm/pool.d/www.conf

server {			# my file server
	listen xxxx; 	# your file server port
	server_name localhost;
	root /home/bh/share;
	index index.html /_h5ai/public/index.php;
   	
	location ~ \.php$ {
    		fastcgi_pass	unix:/run/php/php7.4-fpm.sock;
			include         fastcgi_params;
    		fastcgi_param   SCRIPT_FILENAME    $document_root$fastcgi_script_name;
    		fastcgi_param   SCRIPT_NAME        $fastcgi_script_name;
	}
}

And you can also add your ssl to this server

Add password to folder

add password to your folder, use httpd-tools, you can install it by sudo apt install apache2-utils,

Create password, the name is the user name for password.

htpasswd -c /etc/nginx/passwd name

add config to /etc/nginx/sites-enabled/default

location /private {
                autoindex on;
                autoindex_exact_size off;
                autoindex_localtime on;
                auth_basic "Please input passward:";
                auth_basic_user_file /etc/nginx/passwd;
        }

sudo service reload nginx
enjoy!

with this server, you can host a git server on cloud, and auto clone you repository to you file web.

Download and upload

If you want to build you own cloud server, Seafile, Kodexplorer, owncloud, nextlcoud are good choices.

You can build a ftp (e.g. vsftpd, sftp) for file upload and download. By doing so, you can download and upload your files by ftp, and view it on nginx hosted web.

If you want to use nginx it self, it can be done by this. or refer here. It done by nginx_upload_module.

Aria2 is a good offline download tool, and it enables you use AriaNG front-end to manage your download processes on web. After you download the video files, you can view it on h5ai.

For cloud, I tried Seafile, but is hard to install on Debian and unstable for my server, I try it with scrip, manually install and docker, all failed or unstable. Finally, I installed owncloud with docker, and succeeded.

and from my experience, I think Nextlcoud is a better choice.

and after set up docker, you can set up ssl on nginx by add server:

server {
        listen 8082 ssl;
        listen [::]:8082 ssl;
        server_name  dongdongbh.tech;
        
        location / {
        	proxy_pass   http://127.0.0.1:8080;
            proxy_set_header X-Forwarded-For $remote_addr;
            proxy_set_header Host            $http_host;
            
            client_max_body_size    10000m;
		}

Local development

For local developer, I recommend you use samba server, sshfs, nfs and mount it to local.

samba is used for the connection between linux and window, if you use two linuxs, just use NFS.

This avoid many scp operations.

for ssfs

sshfs  user@192.168.1.200:/home/user/share home/user/remote

unmount by

sudo umount mountpoint

Aria2+AriaNG+Nginx

some of this part are referred from this post.

Install Aria2

sudo apt-get install aria2

Set up aria2

mkdir ~/.aria2
vim ~/.aria2/aria2.conf

add

dir=/home/user_name/aria2/download

file-allocation=trunc
continue=true
daemon=true

disk-cache=32M
file-allocation=none
continue=true
max-concurrent-downloads=10
max-connection-per-server=5
min-split-size=10M
split=20
disable-ipv6=true
input-file=/home/user_name/.aria2/aria2.session
save-session=/home/user_name/.aria2/aria2.session

enable-rpc=true
rpc-allow-origin-all=true
rpc-listen-all=true
rpc-listen-port=6800
rpc-secret=<your password for rpc>

# I use nginx to do ssl in the frontend.
rpc-secure=true
# rpc-certificate=/etc/letsencrypt/live/your-host/fullchain.pem
# rpc-private-key=/etc/letsencrypt/live/your-host/privkey.pem

follow-torrent=true
listen-port=6881-6999
enable-dht=true
enable-peer-exchange=true
peer-id-prefix=-TR2770-
user-agent=Transmission/2.77
seed-ratio=0.1
bt-seed-unverified=true
bt-save-metadata=false

Add ssl

Setting up ssl aria2c lead to the permission problem on starting arai2. Instead, I use nginx to do ssl in the frontend, and managed by Certbot so the ssl in aria2c setting was disabled.

~~first check permission, replace the user with your user name~~

sudo -u user ls -la /etc/letsencrypt/live/YourDomain/privkey.pem

~~and follow to aria2.conf~~

rpc-certificate=/etc/letsencrypt/live/YourDomain/fullchain.pem
rpc-private-key=/etc/letsencrypt/live/YourDomain/privkey.pem
rpc-secure=true

run

sudo aria2c --conf-path="/home/user_name/.aria2/aria2.conf"

AriaNG

AriaNG is a front-end for aria2. Another popular front-end is webui-aria2

download on github, down load in /home/user/aria2/AriaNG.

Nginx

in /etc/nginx/sites-available add file aria.conf, link by

sudo ln -s /etc/nginx/sites-available/aria.conf /etc/nginx/sites-enabled/aria.conf

then edit aria.conf

server {

	listen 443 ssl; # managed by Certbot
	listen [::]:443 ssl;
	# the virtual host name of this 
	server_name $your-virtua-erver.address; 
	# frontend here
	root /home/$user/aria2/AriaNG;

	# backend only connect to rpc
	location ^~ /jsonrpc {
	proxy_http_version 1.1;
	add_header Front-End-Https on;
	proxy_set_header Connection "";
	proxy_set_header Host $http_host;
	proxy_set_header X-NginX-Proxy true;
	proxy_set_header X-Real-IP $remote_addr;
	proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
	proxy_pass http://127.0.0.1:6800/jsonrpc;
	proxy_pass_header X-Transmission-Session-Id;
	}
	ssl_certificate path/fullchain.pem; # managed by Certbot
	ssl_certificate_key path/privkey.pem; # managed by Certbot


}
server{
    if ($host = $your-virtua-erver.address) {
        return 301 https://$host$request_uri;
    } # managed by Certbot


	listen 80;
        listen [::]:80;
	server_name $your-virtua-erver.address;


}

sudo service reload nginx

Then you can visit it by ip:port, and you can also add ssl to Nginx config.

In AriaNG web, go AriaNg Setting->RPC tab, setup RPC. Set port to 443, and fill the rpc password.

Set aria2 as daemon

add service by sudo vim /etc/systemd/system/aria2.service, and add following lines

[Unit]
Description=Aria2c download manager
Requires=network.target
After=dhcpcd.service
       
[Service]
User=$user
Type=forking
RemainAfterExit=yes
ExecStart=/usr/bin/aria2c --conf-path=/home/$user/.aria2/aria2.conf 
ExecReload=/usr/bin/kill -HUP $MAINPID
ExecStop=/usr/bin/kill -s STOP $MAINPID
RestartSec=1min
Restart=on-failure
       
[Install]
WantedBy=multi-user.target

reload daemon
```
sudo systemctl daemon-reload
```

start daemon by

sudo systemctl start aria2.service

check if start correctly by

systemctl status aria2.service

enable daemon start on system startup
```
sudo systemctl enable aria2.service
```

Expose Intranet machine to outside by frp

2020-06-07T00:00:00-04:00

Background

condition: you have a Intranet machine(mabe the machine is in your company), and it do not have a pubic IP, and you want to visit your machine at your home or anywhere can connect Internet. Or you may want to host your website on your local machine.

requirements: you need have a server with a public IP.

Howto:

There are many tools to do that, e.g. frp, ngrok, nps, Zerotier. Here, we use a open source named frp, your can choose the version of your operate system and download it from Github repo.

and we just describe the bash ssh function, more functions you can refer frp Readme file

SSH Usage

Put frps and frps.ini to your server with public IP.

Put frpc and frpc.ini to your server in LAN.

Access your computer in LAN by SSH

Modify frps.ini:

# frps.ini
[common]
bind_port = 7000

Start frps on background:

nohup ./frps -c ./frps.ini > /dev/null 2>&1 &

Modify frpc.ini, server_addr is your frps’s server IP:

# frpc.ini
[common]
server_addr = x.x.x.x
server_port = 7000

[ssh]
type = tcp
local_ip = 127.0.0.1
local_port = 22
remote_port = 6000

Start frpc:

./frpc -c ./frpc.ini

Connect to server in LAN by ssh assuming that username is test:

ssh -oPort=6000 test@x.x.x.x

Notice:

you must open your port used in frs( eg. 7000,6000) on your server, usually it is on the setting of firewall rules.
Every client need one remote_port to map.

and if you want to visit Jupyter notebook and tensorboard, you only need to add a new port to frpc, and then visit https://x.x.x.x:port

ssh on your mobile phone

here we just describe the method on IOS. we use a software named Termius, the basic ssh function of it is free.

open the Termius
Hosts—>add new—->input you remote ip, ssh username, password and save it
just connect the host

if you’d like to use ssh on Termius,

open the Termius
Keychain—>add key(or use existed one)
edit it and copy the public key
append the Termius public key to your server’s ~/.ssh/authorized_keys file
on Termius, edit your host and add the key your created in Keychain ,and then save it
connect server and enjoy it!

Deep Reinforcement learning notes (UBC)

2019-11-27T00:00:00-05:00

Background

This note is the class note of UBC Deep reinforcement learning, namely CS294-112 or CS285. the lecturer is ‎Sergey Levine. The lecturer video can be find on Youtube. I wrote two notes on reinforcement learning before, one is basic RL, the other is the David Silver class note.

Different from the previous courses, this course includes deeper theoretical view, more recent methods and some advancer topics. Especially in model based RL and meta-learning. It is more suitable for the guys who are interested on robotics control and deeper understanding of reinforcement learning.

This class is a little bit hard to study, so make sure you follow the class tightly.

I’m sorry that some of maths can not view properly on the post, maybe I will figure out it later.

Background

1. Imitation learning

2. Policy gradient

3. Actor-critic method

4. Value based methods

5. Practical Q-learning

6. Advanced Policy Gradients

7. Optimal Control and Planning

8. Model-Based Reinforcement Learning (learning the model)

9. Model-Based RL and Policy Learning

10 Variational Inference and Generative Models

11. Re-framing Control as an Inference Problem

12. Inverse Reinforcement Learning

13. Transfer and Multi-task Learning

14. Distributed RL

15. Exploration

16 Meta Reinforcement learning

17 Information theory, challenges, open problems

18 Rethinking Reinforcement Learning from the Perspective of Generalization (Chelsea Finn)

1. Imitation learning

The main problem of imitation: distribution drift

how to make the distribution of training dataset as same as the distribution under policy?

DAgger

DAgger: Dataset Aggregation

goal: collect training data from $p_{\pi_\theta}(o_t)$ instead of $p_{data}(o_t)$ !

how? just run $p_{\pi_\theta}(a_t

o_t)$

but need labels $a_t$ !

train $\pi_{\theta}(a_t o_t)$ from human data $\mathcal{D}={o_1,a_1,…,o_N,a_N}$
run $\pi_\theta(a_t o_t)$ to get dataset $\mathcal{D}_\pi = {o_1,…,o_M}$
ask human to label $\mathcal{D}_\pi$ with actions $s_t$
aggregate: $\mathcal{D}\gets \mathcal{D}\cup\mathcal{D}_\pi$

fit the model perfectly

why fail to fit expert?

Non-Markonvian behavior
- use history observations
Multimodal behavior
- for discrete action, it is OK since Softmax output probability over actions
- for continuous action
  - output mixture of Gaussians
  - latent variable models(inject noise to network input)
  - autoregressive discretization

reward function of imitation learning

reward function of imitation learning can be $r(s,a) = \log {p(a=\pi^*(s)|s)}$

MDP & RL Intro

The goal of RL

expected reward $p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]$ where $p_\theta(\tau)$ is the distribution of the sequence

Q & V

\[V^\pi(s_t)=\sum_{t'=t}^TE_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]\\ v^\pi(s_t)=E_{s_t\sim\pi(a_t|s_t)}[Q^\pi(s_t,a_t)]\]

Types of RL algorithms

Policy gradient
value-based
Actor-critic
model-based RL
- for planning
  - optimal control
  - discrete planning
- improve policy
- something else
  - dynamic programming
  - simulated experience

trade-offs

sample efficiency
stability & ease of use

assumptions

stochastic or deterministic
continuous or discrete
episodic or infinite horizon

sample efficiency

off policy: able to improve the policy without generating new samples from that policy
on policy: each time the policy is changed, even a little bit, we need to generate new samples

stability & ease of use

convergence is a problem

supervised learning almost always gradient descent

RL often not strictly gradient descent

2. Policy gradient

Objective function

\[\theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\\ J(\theta)=E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\approx\frac{1}{N}\sum_i\sum_tr(s_{i,t},a_{i,t})\]

policy differentiation

log derivative

\[\begin{align} \pi_\theta(\tau)\Delta \log \pi_\theta(\tau)&=\pi_\theta(\tau)\frac{\Delta\pi_\theta(\tau)}{\pi_\theta(\tau)}=\Delta\pi_\theta(\tau)\\ \pi_\theta(\tau)&=\pi_\theta(s_1,a_1,...,s_T,a_T)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \log \pi_\theta(\tau) &=\log p(s_1) + \sum_{t=1}^T \log \pi_\theta (a_t|s_t) + \log p(s_{t+1}|s_t,a_t)\\ &\Delta_\theta \left[\log p(s_1) + \sum_{t=1}^T \log \pi_\theta (a_t|s_t) + \log p(s_{t+1}|s_t,a_t)\right]= \sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t|s_t) \end{align}\]

Objective function differentiation

\[\begin{align} \theta^*&=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[{r}(\tau)]=\int\pi_\theta(\tau)r(\tau)d\tau\\ {r}(\tau)&=\sum_t r(s_t,a_t)\\ \Delta_\theta J(\theta)&=\int\Delta_\theta \pi_\theta(\tau)r(\tau)d\tau\\ &=\int\pi_\theta(\tau)\Delta_\theta \log \pi_\theta(\tau)r(\tau) d\tau\\ &=E_{r\sim\pi_\theta}[\Delta_\theta \log \pi_\theta(\tau)r(\tau)]\\ &=E_{r\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t|s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right] \end{align}\]

evaluating the policy gradient

\[\begin{align} \Delta_\theta J(\theta)&=E_{\tau\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t|s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right]\\ \Delta_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t|s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\\ \theta &\gets\theta+\alpha\Delta_\theta J(\theta) \end{align}\]

REINFORCE algorithm

sample ${\tau^i}$ from $\pi_\theta(s_t s_t)$ (run the policy)
$\Delta_\theta J(\theta)\approx\sum_{i}^N\left(\sum_{t} \Delta_\theta \log \pi_\theta (a_t s_t) \right)\left(\sum_{t} r(s_t,a_t)\right)$
$\theta \gets\theta+\alpha\Delta_\theta J(\theta)$

policy gradient

\[\Delta_\theta J(\theta)\approx\frac{1}{N} \Delta_\theta \log \pi_\theta (\tau) r(\tau)\]

Reduce variance

Causality: policy at time $t’$ cannot affect reward at time t when $t<t’$ $\begin{align} \Delta_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\\ &\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t’},a_{i,t'})\right)\\ &=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\hat{Q}_{i,t} \end{align}$ baseline $b=\frac{1}{N}\sum_{i=1}^{n}e(\tau)\\ \Delta_\theta J(\theta)\approx\frac{1}{N} \Delta_\theta \log \pi_\theta (\tau) [r(\tau)-b]$ prove $\begin{align} E[\Delta_\theta \log \pi_\theta(\tau)b]&=\int \pi_\theta(\tau) \Delta_\theta \log \pi_\theta(\tau)b d\tau \\ &= \int\Delta_\theta \pi_\theta(\tau)b d\tau\\ &=b\Delta_\theta\int\pi_\theta(\tau)\\ &=b\Delta_\theta1\\ &=0 \end{align}$ Here, $\tau$ means the whole episode sample by current policy.

we can proof that their has a optimal baseline to reduce variance: $b=\frac{E[g(\tau)^2e(\tau)]}{E[g(\tau)^2]}$ But in practice, we just use the expectation of reward as baseline to reduce the complexity.

policy gradient is on-policy algorithm

Off-policy learning & importance sampling

\[\theta^*=\underset{\theta}{\arg\max} J(\theta)\\ J(\theta)=E_{\tau\sim\pi_\theta(\tau)}[r(\tau)]\]

what if we sample from $\bar{\pi}(\tau)$ instead?

Importance sampling $\begin{align} E_{x\sim p(x)}[f(x)]&=\int p(x)f(x)dx\\ &=\int \frac {q(x)}{q(x)}p(x)f(x)dx\\ &=E_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right] \end{align}$ so apply this to our objective function, we have $J(\theta)=E_{\tau\sim\bar{\pi}(\tau)}\left[\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}r(\tau)\right]$ and we have $\pi_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}=\frac{p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}(a_t|s_t)p(s_{t+1}|s_t,a_t)}=\frac{\prod_{t=1}^T\pi_\theta(a_t|s_t)}{\prod_{t=1}^T \bar{\pi}(a_t|s_t)}$ so we have $\begin{align} J(\theta')&=E_{\tau\sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau)\right]\\ \Delta_{\theta'}J(\theta')&=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\Delta_{\theta'}\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau)\right]\\ &=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\Delta_{\theta'} \log \pi_{\theta}(\tau)r(\tau)\right] \end{align}$ The off-policy policy gradient $\begin{align} \Delta_{\theta'}J(\theta')&=E_{\tau \sim \pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\Delta_{\theta'} \log \pi_{\theta}(\tau)r(\tau)\right]\\ &=E_{\tau\sim\pi_\theta}\left[\left(\prod_{t=1}^T\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\right)\left(\sum_{t=1}^T \Delta_{\theta'} \log \pi_{\theta'} (a_t|s_t) \right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right]\\ &=E_{\tau\sim\pi_\theta}\left[\left(\sum_{t=1}^T \Delta_{\theta'} \log \pi_{\theta'} (a_t|s_t) \right)\left(\prod_{t'=1}^t\frac{\pi_{\theta'}(a_{t'}|s_{t'})}{\pi_\theta(a_{t'}|s_{t'})}\right)\left(\sum_{t'=t}^T r(s_{t'},a_{t'})\left(\prod_{t''=t}^{T}\frac{\pi_{\theta'}(a_{t''}|s_{t''})}{\pi_\theta(a_{t''}|s_{t''})}\right)\right)\right] \end{align}$ we can view state and action separately, then:
$\begin{align} \theta^*&=\underset{\theta}{\arg\max} \sum_{t=1}^TE_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]\\ J(\theta)&=E_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)]\\ &=E_{s_t\sim p_\theta(s_t)}\left[E_{a_t\sim \pi(a_t,s_t)}[r(s_t,a_t)]\right]\\ J(\theta')&=E_{s_t\sim p_\theta(s_t)}\left[\cancel{\frac{p_{\theta'}(s_t)}{p_{\theta}(s_t)}}E_{a_t\sim \pi(a_t,s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}r(s_t,a_t)\right]\right] \end{align}$ If $\frac{p_{\theta’}(s_t)}{p_{\theta}(s_t)}$ is small and bounded, then we can delete it, and this leads to TPRO method we will discuss later.

for coding, we can use “pseudo-loss” as weighted maximum likelihood with automatic differentiation: $\bar{J}(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \log \pi_\theta (a_{i,t}|s_{i,t})\hat{Q}_{i,t}$

policy gradient in practice

the gradient has high variance
- this isn’t the same as supervised learning!
- gradients will be really noisy!
consider using much larger batches
tweaking learning rates is very hard
- adaptive step size rules like ADAM can be OK-ish
- we will learn about policy gradient-specific learning rate adjustment method later

3. Actor-critic method

Basics

recap policy gradient $\begin{align} \Delta_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\\ &\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t’},a_{i,t'})\right)\\ &=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\hat{Q}_{i,t} \end{align}$ where Q is a sample from trajectories, which is unbiased estimate but has high variance problem.

We can use expectation to reduce variance $\hat{Q}_i,t\approx \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]$ and we define $\hat{Q}_i,t= \sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]\\ V(s_t)=E_{a_t\sim\pi(s_t|s_t)}[Q(s_t,a_t)]\\$ then $\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})(Q(s_{i,t}, a_{i,t})-V(s_{i,t}))$

Advantage

\[A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t)\\ \Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})A^\pi(s_t,a_t)\]

The better $A^\pi(s_t,a_t)$ estimate, the lower the variance.

Value function fitting

\[Q^\pi(s_t,a_t)=r(s_t,a_t)+E_{s_{t+1}\sim p(s_{t+1}|s_t,a_t})[V^\pi(s_{t+1})]\]

and we add a little bias(one step biased sample) for convenience $Q^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})$ so we have $A^\pi(s_t,a_t) \approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_t)$ then we only need to fit $V^\pi(s)$ !

Policy evaluation

\[V^\pi(s_t)=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]\\ J(\theta)=E_{s_1\sim p(s_1)}[V^\pi(s_1)]\]

Monte Carlo policy evaluation (this is what policy gradient does) $V^\pi(s_t)\approx \sum_{t'=t}^Tr(s_{t'},a_{t'})$ We can try multiple samples if we can reset the environment to previous state $V^\pi(s_t)\approx \frac{1}{N}\sum_{i=0}^N\sum_{t'=t}^Tr(s_{t'},a_{t'})$ Monte Carlo evaluation with function approximation

with function approximation, only using one sample from trajectory still pretty good.

training data: ${\left(s_{i,t},\sum_{t’=t}^Tr(s_{i,t’},a_{i,t’})\right)}$

supervised regression: $\mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)-y_i\parallel^2$

Ideal target: $y_{i,t}=\sum_{t'=t}^T E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t]\approx r(s_{s_{i,t}},a_{i,t})+V^\pi(s_{i,t+1})\approx r(s_{s_{i,t}},a_{i,t})+\hat{V^\pi_\phi}(s_{i,t+1})$ Monte Carlo target: $y_{i,t}=\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'})$

TD(bootstrapped)

training data: $ {\left(s_{i,t},r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\right)} $

Actor-critic algorithm

batch actor-critic algorithm:

sample ${s_i,a_i}$ from $\pi_\theta(a s)$
fit $\hat{V_\phi^\pi}(s)$ to sampled reward sums
evaluate $\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\hat{V}\phi^\pi(s’_i)-\hat{V}\phi^\pi(s_i)$
$\Delta_\theta J(\theta)\approx \sum_i \Delta_\theta \log \pi_\theta (a_{i} s_{i})\hat{A}^\pi(s_i,a_i)$
$\theta \gets \theta+\alpha\Delta_\theta J(\theta)$

\[V^\pi(s_{i,t})=\sum_{t'=t}^TE_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t}]\\ V^\pi(s_{i,t})\approx\sum_{t'=t}^Tr(s_{t'},a_{t'})\\ V^\pi(s_{i,t})\approx r(s_{s_{i,t}},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})\\ \mathcal{L}=\frac{1}{2}\sum_i\parallel \hat{V_\phi^\pi}(s_i)-y_i\parallel^2\]

Aside: discount factors

what if T (episode length) is $\infty$ ?

$\hat{V}_\phi^\pi$ can get infinitely large in many cases

simple trick: better to get rewards sooner than later $V^\pi(s_{i,t})\approx r(s_{s_{i,t}},a_{i,t})+\gamma\hat{V}^\pi_\phi(s_{i,t+1})\\ \gamma \in [0,1]$ actually we use discount in policy gradient as $\Delta_\theta J(\theta)=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)$ Online actor-critic algorithm(can apply to every single step):

take action $a\sim\pi_\theta(a s)$, get $(s,a,s’,r)$
update $\hat{V_\phi^\pi}(s)$ using target $r+\gamma\hat{V_\phi^\pi}(s’)$
evaluate $\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}\phi^\pi(s’)-\hat{V}\phi^\pi(s)$
$\Delta_\theta J(\theta)\approx \Delta_\theta \log \pi_\theta (a s)\hat{A}^\pi(s,a)$
$\theta \gets \theta+\alpha\Delta_\theta J(\theta)$

Architecture design

network architecture choice

value network and policy network are separate(more stable and sample)
some of value network and policy network are shared(have shared feature)

works best with a batch

trade-off and balance

policy gradient $\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-b\right)$ actor-critic $\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(r(s_i,a_i)+\hat{V}_\phi^\pi(s'_i)-\hat{V}_\phi^\pi(s_i)\right)$ policy gradient is no bias but has higher variance

actor-critic is lower variance but not unbiased

so can we combine these two things?

here we have critics as state-dependent baselines $\Delta_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_{i,t}|s_{i,t})\left(\sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-\hat{V}_\phi^\pi(s_{i,t})\right)$

no bias
lower variance

Eligibility traces & n-step returns

Critic and Monte Carlo critic $\hat{A}^\pi_C(s_t,a_t)=r(s_t,a_t)+\gamma\hat{V}_\phi^\pi(s_{t+1})-\hat{V}_\phi^\pi(s_t)\\ \hat{A}^\pi_{MC}(s_t,a_t)=\sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{t'},a_{t'})-\hat{V}_\phi^\pi(s_{t})$

combine these two?

n-step returns $\hat{A}^\pi_{n}(s_t,a_t)=\sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^n\hat{V}_\phi^\pi(s_{t+n})-\hat{V}_\phi^\pi(s_{t})$ choosing $n>1$ often works better!!!

Generalized advantage estimation(GAE)

Do we have to choose just one n?

Cut everywhere all at once! $\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty w_n\hat{A}(s_t,a_t)$

How to weight?

Mostly prefer cutting earlier(less variance) $w_n\propto\lambda^{n-1}$ e.g. $\lambda=0.95$

and this leads to Eligibility traces $\hat{A}^\pi_{GAE}(s_t,a_t)=\sum_{n=1}^\infty (\gamma\lambda)^{t'-t}\delta_{t'}\\ \delta_{t'}=r(s_{t'},a_{t'})+\gamma\hat{V}_\phi^\pi(s_{t'+1})-\hat{V}_\phi^\pi(s_{t'})$ in this way, every time you want to update a state, you need to have n steps experience

4. Value based methods

$\underset{a_t}{\arg\max}A^\pi(s_t,a_t)$ : best action from $s_t$, if we then follow $\pi$

then: $\pi'(a_t|s_t)=\begin{cases}1, &if \quad a_=\underset{a_t}{\arg\max}A^\pi(s_t,a_t) \cr 0, &otherwise\end{cases}$

this at least as good as any $a_t \sim \pi(a_t, s_t)$

Policy iteration

evaluate $A^\pi(s,a)$
set $\pi \gets \pi’$

Dynamic programming

assume we know $p(s’

s,a)$ and s and a are both discrete (and small)

bootstrapped update: $V^\pi(s) \gets E_{a\sim\pi(a|s)}[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}[V^\pi(s')]]$ with deterministic policy $\pi(s)=a$, we have $V^\pi(s) \gets r(s,\pi(s))+\gamma E_{s'\sim p(s'|s,\pi(s))}[V^\pi(s')]$

\[\underset{a_t}{\arg\max}A^\pi(s_t,a_t)=\underset{a_t}{\arg\max}Q^\pi(s_t,a_t)\\ Q^\pi(s,a)=r(s,a)+\gamma E[V^\pi(s')]\]

So policy iteration become

set $Q^\pi(s,a)\gets r(s,a)+\gamma E[V^\pi(s’)]$
set $V(s)\gets \max_a Q(s,a)$

Function approximator

$\mathcal{L}=\frac{1}{2}\sum_i\parallel V_\phi (s)-\max_a Q(s,a)\parallel^2$

fitted value iteration

fitted value iteration algorithm:

set $y_i \gets \max_{a_i}(r(s_i a_i)+\gamma E[V_\phi(s’_i)])$
set $\phi \gets \arg\min_\phi \frac{1}{2}\sum_i\parallel V_\phi (s)-y_i\parallel^2$$

but we can not do maximum if we do not have dynamics, so we evaluate Q instead of V $Q^\pi(s,a) \gets r(s,a)+\gamma E_{s'\sim p(s'|s,a)}[Q^\pi(s',\pi(s'))]$

fitted Q-iteration

collect dataset ${(s_i, a_i,s_i’,r_i)}$ using some policy
set $y_i \gets r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)$
set $\phi \gets \arg min_\phi \frac{1}{2}\sum_i|Q_\phi(s_i,a_i)-y_i|^2$ repeat step 2,3 k times and then return step 1

Q-learning is off-policy, since it fit the $Q(s,a)$, which estimate all state action Q, no matter the action and state from which policy, it has the maximum item. And for the $r(s,a)$ item, given a and a, transition is independent of $\pi$.

exploration

epsilon-greedy $\pi(a_t|s_t)=\begin{cases}1-\epsilon , &if a_t={\arg\max}Q_\phi(s_t,a_t) \cr \epsilon(|\mathcal{A}|-1), &otherwise\end{cases}$
Boltzmann exploration
\[\pi(a_t|s_t) \propto \exp(Q_\phi(s_t,a_t))\]

Value function learning theory

value iteration:

set $Q(s,a) \gets r(s,a)+\gamma E[V(s’)]$
set $V(s) \gets \max_a Q(s,a)$

tabular case is converged.

Non-tabular case is not guarantee convergence.

In actor-critic, it also need to estimate the V, and if using bootstrap approach, it has the same problem that can not guarantee convergence.

5. Practical Q-learning

What’s wrong of the on-line Q-learning?

Actually, it is not gradient decent, it do not calculate the gradient of target Q in y.

samples is i.i.d

Replay buffer (replay samples many times)

Q-learning with a replay buffer:

collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, add it to $\mathcal{B}$
1. sample a batch $(s_i,a_i,s_i’,r_i)$ from $\mathcal{B} $
2. $\phi \gets\phi-\alpha\sum_i\frac{d Q_\phi}{d \phi} (s_i,a_i) \frac{1}{2}\sum_i(Q_\phi(s_i,a_i)- [r(s_i,a_i) +\gamma \max_{a_i’}Q_\phi(s_i’,a_i’)])^2$ , do these k times

Target network

DQN (Target network+Replay buffer)

save target network parameters: $\phi’ \gets \phi$
1. collect dataset ${(s_i,a_i,s_i’,t_i)}$ using some policy, add it to $\mathcal{B}$ , do this N times
  1. sample a batch $(s_i,a_i,s_i’,r_i)$ from $\mathcal{B} $
  2. $\phi \gets \arg min_\phi \frac{1}{2}\sum_i|Q_\phi(s_i,a_i)- [r(s_i,a_i) +\gamma \max_{a_i’}Q_{\phi’}(s_i’,a_i’)]|^2$ , do 2,3 k times

Alternative target network

Polyak averaging: soft update to avoid sudden target network update:

update $\phi’$: $\phi’ \gets \tau \phi’ + (1-\tau)\phi$ e.g. $\tau =0.999$

Double Q-learning

Are the Q-values accurate?

It’s often much larger than the true value since the maximum operation will always adds the noisy Q estimation to make Q function overestimate.

Target value $y_j=r_i +\gamma \max_{s_j’}Q_{\phi’}(s_j’,a_j’)$ $\max_{a'}Q_{\phi'}(s',a') = Q_{\phi'}(s',\arg \max_{a'}Q_{\phi'}(s',a'))$ value also comes from $Q_{\phi’}$ action selected according to $Q_{\phi’}$

How to address this?

Double Q-learning

idea: don’t use the same network to choose the action and evaluate value! (de-correlate the noise)

use two networks: $Q_{\phi_A}\gets r +\gamma Q_{\phi_B}(s',\arg \max_{a'}Q_{\phi_A}(s',a')) \\ Q_{\phi_B}\gets r +\gamma Q_{\phi_A}(s',\arg \max_{a'}Q_{\phi_B}(s',a'))$ the value of both two networks come from the other network!

Double Q-learning in practice

Just use the current and target networks as $\phi_A$ and $\phi_B$, use current network to choose action, and current network get Q value.

standard Q-learning: $y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi’}(s’,a’))$

double Q-learning: $y=r+\gamma Q_{\phi’}(s’, \arg \max_{a’}Q_{\phi}(s’,a’))$

Multi-step returns

\[y_{j,t}=\sum_{t'=t}^{t'+N-1}r_{j,t'}+\gamma ^N \max Q_{\phi'}(s_{j,t+N},a_{j, t+N})\]

In Q-learning, this only actually correct when learning on-policy. Because the sum of r comes from the transitions of different policy.

How to fix?

ignore the problem when N is small
cut the trace-dynamically choose N to get only on-policy data
- works well when data mostly on-policy, and the action space is samll
importance sampling—ref the paper “safe and efficient off-policy reinforcement learning” Munos et al. 16

Q-learning with continuous actions

How to do argmax in continuous actions space?

optimization
- gradient based optimization (e.g., SGD) a bit slow in the inner loop
- action space typically low-dimensional—–what about stochastic optimization?
-a simple if sample from discrete cations

$max_a Q(s,a)\approx max{Q(s,a_1),…,Q(s,a_N)}$

-more accurate solution:
- cross-entropy method (CEM)
  - simple iterative stochastic optimization
- CMA-ES
use function class that is easy to optimize $Q_{\phi}(s,a) = -\frac{1}{2}(a-\mu_\phi(s))^TP_{\phi}(s)(a-\mu_\phi(s))+V_\phi(s)$ NAF: Normalized Advantage Functions

Using the neural network to get $\mu,P,V$

Then $\arg \max_aQ_\phi(s,a) =\mu_\phi(s)\; \; \max_aQ(s,a)=V_\phi(s)$ but this lose some representational power
learn an approximate maximizer

DDPG $\max_aQ_\phi(s,a)=Q_\phi(s,arg\max_a Q_\phi(s,a))$ idea: train another network $\mu_\theta(s)$ such that $\mu_\theta(s)\approx arg \max_aQ_\phi(s,a)$

how to train? solve $\theta \gets \arg \max_\theta Q_\phi(s,\mu_\theta(s))$ $\frac{dQ_\phi}{d\theta}=\frac{da}{d\theta}\frac{dQ_\theta}{da}$ DDPG:
1. take some action $s_i$ and observe $(s_i,a_i,s_i’,r_i)$, add it to $\mathcal{B}$
2. sample mini-batch ${s_j,a_j,s_j’,r_j}$ from $\mathcal{B}$ uniformly
3. compute $y_j=r_j+\gamma Q_{\phi’}(s_j’,\mu_{\theta’}(s_j’))$ using target nets $Q_{\phi’}$ and $\mu_{\theta’}$_j
4. $\phi \gets \phi - \alpha\sum_j\frac{dQ_\phi}{d\phi}(s_j,a)(Q_\phi(s_j,a_j)-y_k)$
5. $\theta \gets \theta - \beta\sum_j\frac{d\mu}{d\theta}(s_j)\frac{dQ_\phi}{da}(s_j,a)$
6. update $\phi’$ and $\theta’$ (e.g., Polyak averaging)

Tips for Q-learning

Bellman error gradients can be big;clip gradients or use Huber loss instead of square error $\pi(a_t|s_t)=\begin{cases}x^2/2 , &\text{if} \;|x|\le\delta \cr \delta|x|-\delta^2/2, &\text{otherwise}\end{cases}$
Double Q-learning helps a lot in practice, simple and no downsides
N-step returns also help a lot, but have some downsides
Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too
Run multiple random seeds, it’s very inconsistent between runs

6. Advanced Policy Gradients

Basics

Recap

Recap: policy gradient

REINFORCE algorithm

sample ${\tau^i}$ from $\pi_\theta(s_t s_t)$ (run the policy)

$\Delta_\theta J(\theta)\approx\sum_{i}\left(\sum_{t=1}^T \Delta_\theta \log \pi_\theta (a_t^i

s_t^i) \left(\sum_{t’=t}^T r(s_{t’},a_{t’})\right)\right)$

$\theta \gets\theta+\alpha\Delta_\theta J(\theta)$

Why does policy gradient work?

policy gradient as policy iteration

$J(\theta)=E_{\tau \sim p_\theta(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]$ $\begin{align} J(\theta')-J(\theta)&=J(\theta')-E_{s_0 \sim p(s_1)}[V^{\pi_\theta}(s_0)]\\ &=J(\theta')-E_{\tau \sim p_{\theta'}(\tau)}[V^{\pi_\theta}(s_0)]\\ &=J(\theta')-E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tV^{\pi_\theta}(s_t)-\sum_{t=1}^\infty\gamma^tV^{\pi_\theta}(s_t)\right]\\ &=J(\theta')+E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t))\right]\\ &=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]+E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t))\right]\\ &=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^t(r(s_t,a_t)+\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t))\right]\\ &==E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right] \end{align}$ so we proved that: $J(\theta')-J(\theta)=E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]$

The Goal is Making things off-policy

But we want to sample from $\pi_\theta$ not $\pi_{\theta’}$, we apply importance sampling: $\begin{align} E_{\tau \sim p_{\theta'}(\tau)}\left[\sum_{t=0}^{\infty}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]&=\sum_{t=0}^{\infty}E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta'}(a_t|s_t)}\left[\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ &=\sum_{t=0}^{\infty}E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right] \end{align}$ but there still has the state sample from $p_{\theta’}(s_t)$ , and can we approximate it as $p_\theta(s_t)$? so that we can use $\hat{A}^\pi(s_t,a_t)$ to get improved policy $\pi’$.

Bounding the objective value

Here we can prove that:

$\pi_{\theta’}$ if close to $\pi_\theta$ if $

\pi_{\theta’(a_t

s_t)}-\pi_\theta(a_t

s_t)

\le\epsilon$ for all $s_t$

p_{\theta’}(s_t)-p_\theta(s_t)

\le 2\epsilon t$

The prove of this refer the lecture video or the TRPO paper.

It’s easy to prove that: $\begin{align} E_{p_{\theta'}}[f(s_t)]=\sum_{s_t}p_{\theta'}(s_t)f(s_t)&\ge\sum_{s_t}p_\theta(s_t)f(s_t)-|p_{\theta'}(s_t)-p_\theta(s_t)|\max_{s_t}f(s_t)\\ &\ge\sum_{s_t}p_\theta(s_t)f(s_t)-2\epsilon t\max_{s_t}f(s_t) \end{align}$ so $\sum_t E_{s_t \sim p_{\theta'}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \ge\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]-\sum_t 2\epsilon t C$ C is $O(Tr_{max})$ or $O(\frac{r_{max}}{1-\gamma})$

So after all the prove before, what we get? $\theta' \gets \arg \max_{\theta'}\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \text{such that}\:\:|\pi_{\theta'(a_t|s_t)}-\pi_\theta(a_t|s_t)|\le\epsilon$ For small enough $\epsilon$, this is guaranteed to improve $J(\theta’)-J(\theta)$

A more convenient bound is using KL divergence: $

\pi_{\theta’(a_t

s_t)}-\pi_\theta(a_t

s_t)

\le \sqrt{\frac{1}{2}D_{KL}(\pi_{\theta’}(a_t

s_t)|\pi(a_t

s_t))}$

$\Rightarrow D_{KL}(\pi_{\theta’}(a_t|s_t)|\pi(a_t|s_t)$ bounds state marginal difference, where $D_{KL}(p_1(s)\|p_2(x))=E_{x\sim p_1(x)}\left[ \log \frac{p_1(x)}{p_2(x)}\right]$ Why not using $\epsilon$ but the $D_{KL}$?

KL divergence has some very convenient properties that make i much easier to approximate!

So the optimization becomes: $\theta' \gets \arg \max_{\theta'}\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\\text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))\le\epsilon$

Solving the constrained optimization problem

How do we enforce the constraint?

By using dual gradient descent, we set the object function as $\mathcal{L}(\theta',\lambda)=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]-\lambda(D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))-\epsilon)$

Maximize $\mathcal{L}(\theta’, \lambda)$ with respect to $\theta$
$\lambda \gets + \alpha(D_{KL}(\pi_{\theta’}(a_t s_t)|\pi(a_t s_t))-\epsilon)$

How else do we optimize the object?

define: $\begin{align} \bar{A}(\theta')&=\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right]\\ \bar{A}(\theta)&=\sum_t E_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\right] \end{align}$ applying First-order Taylor expansion and optimize $\theta' \gets \arg \max_{\theta'}\Delta_\theta\bar A(\theta)^T(\theta'-\theta)\\ \text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))\le\epsilon$ and $\begin{align} \Delta_{\theta'}\bar A(\theta')&=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^t \Delta_{\theta'}\log{\pi_{\theta'}(a_t|s_t)}A^{\pi_\theta}(s_t,a_t)\right]\right]\\ \Delta_{\theta}\bar A(\theta)&=\sum_tE_{s_t \sim p_{\theta}(s_t)}\left[E_{a_t \sim \pi_{\theta}(a_t|s_t)}\left[\gamma^t \Delta_{\theta}\log{\pi_{\theta}(a_t|s_t)}A^{\pi_\theta}(s_t,a_t)\right]\right]=\Delta_\theta J(\theta) \end{align}$ so the optimization becomes $\theta' \gets \arg \max_{\theta'}\Delta_\theta J(\theta)^T(\theta'-\theta)\\ \text{such that}\:\:D_{KL}(\pi_{\theta'}(a_t|s_t)\|\pi(a_t|s_t))\le\epsilon$ and gradient ascent does this: $\theta' \gets \arg \max_{\theta'}\Delta_\theta J(\theta)^T(\theta'-\theta)\\ \text{such that}\:\:\|\theta-\theta'\|\le\epsilon$ by updating like $\theta’=\theta’+\sqrt{\frac{\epsilon}{|\Delta_\theta J(\theta)|^2}}\Delta_\theta J(\theta)$, this is what actually gradient ascent(policy gradient) doing.

But this (the gradient ascent constrain) is not a good constrain since some parameters change probabilities a lot more than others, and we want that the probability distributions are close.

Applying ‘second order Taylor expansion’ to $D_{KL}$ $D_{KL}(\pi_{\theta'}\|\pi_\theta)\approx\frac{1}{2}(\theta'-\theta)^T\pmb{F}(\theta'-\theta)$ where $\pmb{F}$ is the ‘Fisher-information matrix’ which can estimate with with samples $\pmb{F}=E_{\pi_\theta}[\Delta_{\theta}\log\pi_\theta(a|s)\Delta_\theta\log \pi_\theta(a|s)^T]$ And if we use the following update $\theta'=\theta+\alpha\pmb{F}^{-1}\Delta_\theta J(\theta)\\ \alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}$ the constrain will satisfied. and this is called the natural gradient.

Practical methods and notes

natural policy gradient $\theta’=\theta+\alpha\pmb{F}^{-1}\Delta_\theta J(\theta)$
- Generally a good choice to stabilize policy gradient training
- See this paper for details:
  - Petters, Schaal. Reinforcement learning of motor skills with policy gradients.
- Practical implementation: requires efficient Fisher-vector products, a bit non-trivial to do without computing the full matrix
  - See: Schulman et all. Trust region policy optimization
Trust region policy optimization (TRPO) $\alpha=\sqrt{\frac{2\epsilon}{\Delta_\theta J(\theta)^T\pmb{F}\Delta_\theta J(\theta)}}$
Just use the IS (important sampling) objective directly (use $\bar{A}$ as object)
- Use regularization to stay close to old policy
- See: proximal policy optimization (PPO)

So the TRPO and the PPO is two Practical methods solving the constrained optimization in neural network setting.

7. Optimal Control and Planning

Recap: the reinforcement learning objective $p_\theta(s_1,a_1,...,s_T,a_T)=p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \theta^*=\underset{\theta}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]$ In model-free RL, we do not know $p(s_{t+1}|s_t,a_t)$.

But actually we knew the dynamics sometimes.

Often we do know the dynamics
Often we can learn the dynamics

If we know the dynamics, what can we do?

Model-based reinforcement learning

Model-based reinforcement learning: learn the transition dynamics, then figure out how to choose actions
How can we make decisions if we know the dynamics?

a. How can we choose actions under perfect knowledge of the system dynamics?

b. Optimal control, trajectory optimization, planning
How can we learning unknown dynamics?
How can we then also learn policies? (e,g. by imitating optimal control)

The objective

\[\min_{a_1,...,a_T}\sum_{t=1}^Tc(s_t,a_t)\;\text{s.t.}\;s_t=f(s_{t-1},a_{t-1})\]

Deterministic case

\[a_1,...,a_T=\arg\max_{a_1,...,a_T}\sum_{t=1}^Tr(s_t,a_t)\;\text{s.t.}\;s_t=f(s_{t-1},a_{t-1})\]

Stochastic open-loop case

\[p_\theta(s_1,...,s_T|a_1,...,a_T)=p(s_1)\prod_{t=1}^Tp(s_{t+1}|s_t,a_t)\\ a_1,...,a_T=\arg\max_{a_1,...,a_T}E\left[\sum_{t=1}^Tr(s_t,a_t)|a_1,...,a_T\right]\]

open-loop: choose a_1…a_T in one time, not step by step closed-loop: every step the agent gets a feedback from environment

Stochastic closed-loop case

\[p_\theta(s_1,a_1,...,s_T,a_T)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \pi=\underset{\pi}{\arg\max} E_{\tau\sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]\]

Stochastic optimization

optimal control/planning： $a_1,...,a_t=\arg\max_{a_1,...,a_t}J(a_1,...,a_t)\\ A=\arg\max_AJ(A)$

Cross-entropy method (CEM)

Here $A$ is $a_1,…,a_t$

sample $A_1,…,A_n$ from $p(A)$
evaluate $J(A_1),…,J(A_n)$
pick M elites $A_{i_1},…,A_{i_M}$ with the highest value, where $M<N$
refit $p(A)$ to the elites $A_{i_1},…,A_{i_M}$

Monte Carlo Tree Search (MCTS)

Generic MCTS sketch

find a leaf $s_l$ using TreePolicy($s_1$)
evaluate the leaf using DefaultPolicy($s_l$)
update all values in the tree between $a_1$ and $s_l$

take best action from $s_1$ and repeat

every node stores Q and N, Q is the estimated value and N is the visited number

UCT TreePolicy($s_t$)

if $s_t$ nit fully expanded, choose new $a_t$

else choose child with best Score($s_{t+1}$) $Score(s_t) = \frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2\ln N(s_{t-1})}{N(s_t)}}$ For more about MCTS, ref Browne. et al. A survey of Monte Carlo Tree Search Methods. (2012)

Optimal control

Here we shows the optimization process if we know the environment dynamics. Almost the stuffs in control theory.

Deterministic case $\min_{u_1,...,u_T}\sum_{t=1}^Tc(s_t,u_t)\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\\ \min_{u_1,...,u_T}c(x_1,u_1)+c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)$

Shooting methods vs collocation

the previous CEM is actually random shooting method.

collocation method: optimize over actions and states, with constraints. $\min_{u_1,...,u_T,x_1,...,x_T}\sum_{t=1}^Tc(s_t,u_t)\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})$

Linear case: LQR

\[\min_{u_1,...,u_T}c(x_1,u_1)+c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)\]

Linear case: the case that F is linear function and cost is quadratic function $f(x_t,u_t)=F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t\\ c(x_t,u_t)=\frac{1}{2}\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^Tc_t$ Where $C_T=\begin{bmatrix} C_{x_T,x_T} & C_{x_T,u_T} \\ C_{u_T,x_T} & C_{u_T,u_T} \end{bmatrix}\\ c_T=\begin{bmatrix} c_{x_T}\\ c_{u_T} \end{bmatrix}$ Base case: solve for $u_T$ only $\begin{align} Q(x_T,u_T)&= \text{const}+\frac{1}{2}\begin{bmatrix} x_{T} \\ u_{T} \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{T} \\ u_{T} \\ \end{bmatrix}+\begin{bmatrix} x_{T} \\ u_{T} \\ \end{bmatrix}^Tc_T\\ \Delta_{u_T}Q(x_T,u_T)&=C_{u_t,x_T}x_T+c_{u_T,u_T}u_T+c_{u_T}^T=0\\ u_T&=-C_{u_T,u_T}^{-1}(C_{u_T,x_T}X_T+c_{u_T})\\ U_T&=K_Tx_T+k_T\\ K_T&=-C_{u_T,u_T}^{-1}C_{u_T,x_T}\\ k_T&=-C_{u_T,u_T}^{-1}c_{u_T} \end{align}$ We substitute $u_T$ by $x_T$ to eliminate $u_T$ $\begin{align} V(x_T)&= \text{const}+\frac{1}{2}\begin{bmatrix} x_{T} \\ K_Tx_T+k_T \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{T} \\ K_Tx_T+k_T \\ \end{bmatrix}+\begin{bmatrix} x_{T} \\ K_Tx_T+k_T \\ \end{bmatrix}^Tc_T\\ V(x_T)&=\text{const}+\frac{1}{2}x_T^TV_Tx_T+x_T^Tv_T \end{align}$ Then solve for $U_{T-1}$ in term terms of $x_{T-1}$ $\begin{align} f(x_{T-1},u_{T-1})&=x_T=F_t\begin{bmatrix} x_ \\ u_ \\ \end{bmatrix}+f_{T-1}\\ c(x_t,u_t)&=\frac{1}{2}\begin{bmatrix} x_{T-1} \\ u_{T-1} \\ \end{bmatrix}^TC_t\begin{bmatrix} x_{T-1} \\ u_{T-1} \\ \end{bmatrix}+\begin{bmatrix} x_{T-1} \\ u_{T-1} \\ \end{bmatrix}^Tc_{T-1}+V(f(x_{T-1},u_{T-1}))\\ V(f(x_{T-1},u_{T-1}))&=\text{const}+\frac{1}{2}x_T^TV_Tx_T+x_T^Tv_T\\ &\text{and then raplece $x_T$ with the dynamic f} \end{align}$ and then do same thing as the T case, result in similar results.

backward recursion

for $t=T$ to 1: $\begin{align} Q_t&=C_t+F_t^TV_{t+1}F_t\\ q_t&=c_t+F_t^TV_{t+1}f_t+F_t^Tv_{t+1}\\ Q(x_t,u_t)&=\text{const}+\frac{1}{2}\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^TQ_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}^Tq_t\\ u_t &\gets \arg\max_{u_t}Q(x_t,u_t)=K_tx_t+k_t\\ K_t&=-Q_{u_t,u_t}^{-1}Q_{u_t,x_t}\\ k_t&=-Q_{u_t,u_t}^{-1}q_{u_t}\\ V_t&=Q_{x_t,x_t}+Q_{x_t,u_t}K_t+K_t^TQ_{u_t,x_t}+K_t^TQ_{u_t,u_t}K_t\\ v_t&=q_{x_t}+Q_{x_t,u_t}k_t+K_t^TQ_{u_t}+K_t^TQ_{u_t,u_t}k_t\\ V(x_t)&=\text{const}+\frac{1}{2}x_t^T V_tx_t+x_t^Tv_t\\ V(x_t)&=\min Q(x_t,u_t) \end{align}$ Forward recursion

For $t=1$ to $T$: $u_t=K_tx_t+k_t\\ x_{t+1}=f(x_t,u_t)$

Stochastic dynamics

if the probability is Gaussian and the mean is linear and variance is fixed. Then same algorithm can be applied since symmetry of Gaussian. $f(x_t,u_t)=F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t\\ x_{t-1}\sim p(x_{t+1}|x_t,u_t)\\ p(x_{t+1}|x_t,u_t)=\mathcal{N}\left(F_t\begin{bmatrix} x_{t} \\ u_{t} \\ \end{bmatrix}+f_t, \Sigma_t\right)$

Nonlinear case: DDP/iterative LQR

approximate a nonlinear system as a linear-quadratic system using Taylor expansion $f(x_t,u_t)\approx f(\hat{x}_t,\hat{u}_t)+\Delta_{x_t,u_t}f(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}\\ c(x_t,u_t)\approx c(\hat{x}_t,\hat{u}_t)+\Delta_{x_t,u_t}c(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}+\frac{1}{2}\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}^T\Delta^2_{x_t,u_t}c(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_{t}-\hat{x}_t \\ u_{t}-\hat{u}_t \\ \end{bmatrix}$

\[\bar{f}(\delta x_t,\delta u_t)=F_t\begin{bmatrix} \delta x_t \\ \delta u_t \\ \end{bmatrix}\\ \bar{c}=\frac{1}{2}\begin{bmatrix} \delta x_{t} \\ \delta u_{t} \\ \end{bmatrix}^TC_t\begin{bmatrix} \delta x_{t} \\ \delta u_{t} \\ \end{bmatrix}+\begin{bmatrix} \delta x_{t} \\ \delta u_{t} \\ \end{bmatrix}^Tc_t\\ \delta x_t= x_t-\hat{x}_t\\ \delta u_t= u_t-\hat{u}_t\]

In fact, this just Newton’s method for trajectory optimization.

for more Newton’s method for trajectory optimization, ref follow papers:

Differential dynamic programming.(1970)
Synthesis and Stabilization of complex behaviors through online trajectory optimization.(2012)
- practical guide for implementing non-linear iterative LQR.
Learning Neural Network policies with guided policy search under unknown dynamics (2014)
- Probabilistic formation and trust region alternative to deterministic line search.

8. Model-Based Reinforcement Learning (learning the model)

Basic

Why learn the model?

If we knew $f(s_t,a_t)=s_{t+1}$, we could use the tools from last course.

(or $p(s_{t+1} s_t,a_t)$ in stochastic case)

model-based reinforcement learning version 0.5:

run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$
learning dynamics model $f(s,a)$ to minimize $\sum_i|f(s_i,a_i)-s_i’|^2$
plan through $f(s,a)$ to choose actions

Does it work?

This is how system identification works in classical robotics
Some care should be taken to design a good base policy
Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters
The model only fit the base policy, but the final actual policy beyond that policy, that will cause distribution mismatch problem.

Over-fitting problem

Distribution mismatch problem

Can we do better?

can we make $p_{\pi_0}(s_t)=p_{\pi_f}(s_t)$?

model-based reinforcement learning version 1.0:

run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$
learning dynamics model $f(s,a)$ to minimize $\sum_i|f(s_i,a_i)-s_i’|^2$
plan through $f(s,a)$ to choose actions
execute those actions and add the resulting data ${(s,a,s’)_j}$ to $\mathcal{D}$, and repeat step 2~4

But the model has errors, so it may lead to some bad actions, How to address that?

MPC

model-based reinforcement learning version 1.5:

run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$
learning dynamics model $f(s,a)$ to minimize $\sum_i|f(s_i,a_i)-s_i’|^2$
plan through $f(s,a)$ to choose actions
execute the first planned action, observe resulting state $s’$ (MPC)
append $(s,a,s’)$ to dataset $\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5

Using model uncertainty

Can we do better by using model uncertainty?

How to get uncertainty?

use output entropy(bad idea)
estimate model uncertainty

\[\int p(s_{t+1}|s_t,a_t,\theta)p(\theta|\mathcal{D})d\theta\]

one way to get this is by Bayesian neural networks (BNN) (introduce later)
another way is train multiple models, and see if they agree each other.(Bootstrap ensembles)

\[p(\theta|\mathcal{D})\approx\frac{1}{N}\sum_i\delta(\theta_i)\\ \int p(s_{t+1}|s_t,a_t,\theta)p(\theta|\mathcal{D})d\theta\approx\frac{1}{N}\sum_ip(s_{t+1}|s_t,a_t,\theta_i)\]

How to train?

main idea: need to generate “independent” datasets to get “independent” models.

can do this by re-sampling from dataset with replacement, means you have same distribution but different ordered datasets

Does this works?

This basically works

Very crude approximation, because the number of models is usually small (<10)

Re-sampling with replacement is usually unnecessary, because SGD and random initialization usually makes the models sufficiently independent

For candidate action sequence $a_1,…,a_H$:

sample $\theta\sim p(\theta \mathcal{D})$
at each time step $t$, sample $s_{t+1}\sim p(s_{t+1} s_t,a_t,\theta)$
calculate $R=\sum_tr(s_t,a_t)$
repeat steps 1 to 3 and accumulate the average reward

Model-based RL with images (POMDP)

Model-based RL with latent space models

What about complex observations?

High dimensionality
Redundancy
Partial observability

\[\max_\phi\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(s_{t+1,i}|s_{t,i},a_{t,i})+\log p_\phi(o_{t,i}|s_{t,i})]\]

learn approximate posterior $q_\psi(s_t

o_{1:t},a_{1:t})$

other choices:

$q_\psi(s_t,s_{t+1} o_{1:t},a_{1:t})$
$q_\psi(s_t o_t)$

here we only estimate $q_\psi(s_t

o_t)$

assume that $q_\psi(s_t

o_t)$ is deterministic

stochastic case requires variational inference (later)

Deterministic encoder $q_\psi(s_t|o_t)=\delta(s_t=g_\psi(o_t))\Rightarrow s_t=g_\psi(o_t)$ and maybe the reward also need to learn. $\max_{\phi,\psi}\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(g_\psi(o_{t+1,i})|g_\psi(o_{t,i}),a_{t,i})+\log p_\phi(o_{t,i}|g_\psi (o_{t,i})]\\ \max_{\phi,\psi}\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^TE[\log p_\phi(g_\psi(o_{t+1,i})|g_\psi(o_{t,i}),a_{t,i})+\log p_\phi(o_{t,i}|g_\psi (o_{t,i})+log p_\phi(r_{t,i}|g_\psi(o_{t,i}))]$ Model-based RL with latent space models

run base policy $\pi_0(o_t a_t)$ (e.g., random policy) to collect $\mathcal{D}={(o,a,o’)_i}$
learning $p_\psi(s_{t+1} s_t,a_t), p_\psi(r_t s_t), p(o_t s_t), g_\psi(o_t)$
plan through the model to choose actions
execute the first planned action, observe resulting state $o’$ (MPC)
append $(o,a,o’)$ to dataset $\mathcal{D}$. repeat steps 3~5, and every N steps repeat steps 2~5

Learn directly in observation space

directly learn $p(o_{t+1}

o_t,a_t)$

do image prediction

learn reward or set the goal observation

9. Model-Based RL and Policy Learning

Basic

What if we want a policy rather than just optimal control?

Do not need to re-plan (faster)
Potentially better generalization
Closed loop control

Back-propagate directly into the policy

model-based reinforcement learning version 2.0:

run base policy $\pi_0(a_t s_t)$ (e.g., random policy) to collect $\mathcal{D}={(s,a,s’)_i}$
learning dynamics model $f(s,a)$ to minimize $\sum_i|f(s_i,a_i)-s_i’|^2$
back-propagate through $f(s,a)$ into the policy to optimize $\pi_\theta(a_t s_t)$
run $\pi_\theta (a_t s_t)$, appending the visited tuples $(s,a,s’)$ to $\mathcal{D}$ , repeat steps 2~4

What’s the problem?

similar parameter sensitivity problems as shooting methods
- But no longer have convenient second order LQR-like method, because policy parameters couple all the tie steps, so no dynamic programming
Similar problem to training long RNNs with BPTT
- Vanishing and exploding gradients
- Unlike LSTM, we can’t just “choose” a simple dynamics, dynamics are chosen by nature

Guided policy search

\[\min_{u_1,...,u_T,x_1,...,x_T}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\] \[\min_{u_1,...,u_T,x_1,...,x_T,\theta}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1}), u_t=\pi_\theta(x_t)\\ \min_{u_1,...,u_T,x_1,...,x_T,\theta}\sum_{t=1}^Tc(s_t,u_t)\:\: \text{s.t.}\:x_t=f(x_{t-1},u_{t-1})\\ \:\: \text{s.t.}\:\:u_t=\pi_\theta(x_t)\]

How to deal with constrain?

Dual gradient decent (DGD)

\[\min_xf(x)\:\:\text{s.t.}\:C(x)=0\:\:\:\:\:\: \mathcal{L}(x,\lambda)=f(x)+\lambda C(x)\\ g(\lambda)=\mathcal{L}(x^*(\lambda),\lambda)\\ x^*=\arg\min_x\mathcal{L}(x,\lambda)\\ \frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)\]

Find $x^*\gets \arg\min_x\mathcal{L}(x,\lambda)$
Compute $\frac{dg}{d\lambda}=\frac{d\mathcal{L}}{d\lambda}(x^*,\lambda)$
$\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$

A small tweak to DGD: augmented Lagrangian $\bar{\mathcal{L}}(x,\lambda)=f(x)+\lambda C(x)+\rho\|C(x)\|^2$

Find $x^*\gets \arg\min_x\bar{\mathcal{L}}(x,\lambda)$
Compute $\frac{dg}{d\lambda}=\frac{d\bar{\mathcal{L}}}{d\lambda}(x^*,\lambda)$
$\lambda\gets\lambda+\alpha\frac{dg}{d\lambda}$

When far from solution, quadratic term tends to improve stability

Constraining trajectory optimization with dual gradient descent $\min_{\tau,\theta}c(\tau)\:\:\text{s.t.}\:\:u_t=\pi_\theta(x_t)\\ \bar{\mathcal{L}}(\tau,\theta,\lambda)=c(\tau)+\sum_{t=1}^T\lambda_t(\pi_\theta(x_t)-u_t)+\sum_{t=1}^T\rho_t(\pi_\theta(x_t)-u_t)^2$

Guided policy search (GPS) discussion

Find $\tau \gets \arg\min_\tau \bar{\mathcal{L}}(\tau,\theta,\lambda)$ (e.g. via iLQR or other planning methods)
Find $\theta \gets \arg\min_\theta\bar{\mathcal{L}}(\tau, \theta, \lambda)$ (e.g. via SGD)
$\lambda \gets \lambda+\alpha \frac{dg}{d\lambda}$ and repeat

Can be interpreted as constrained trajectory optimization method
Can be interpreted as imitation of optimal control expert, since step 2 is just supervised learning
The optimal control “teacher” adapts to the learner , and avoids actions that the learner can’t mimic

General guided policy search scheme

Optimize $p(\tau)$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
Optimize $\theta$ with respect to some supervised objective
Increment or modify dual variables $\lambda$

Need to choose:

form of $p(\tau)$ or $\tau$ (if deterministic)
optimization method for $p(\tau)$ or $\tau$
surrogate $\tilde{c}(x_t,u_t)$
supervised objective for $\pi_\theta(u_t x_t)$

Deterministic case

\[\min_{\tau,\theta}c(\tau)\:\:\:\text{s.t.}\:\:\:u_t=\pi_\theta(x_t)\\ \bar{\mathcal{L}}(\tau,\theta,\lambda)=\tilde{c}(\tau)=c(\tau)+\sum_{t=1}^T\lambda_t(\pi_\theta(x_t)-u_t)+\sum_{t=1}^T\rho_t(\pi_\theta(x_t)-u_t)^2\] \[\tilde{c}_{k+1,i}(x_t,u_t)=c(x_t,u_t)+\lambda_{k+1,i}\log \pi_\theta (u_t|x_t)\]

Optimize $\tau$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
Optimize $\theta$ with respect to some supervised objective
Increment or modify dual variables $\lambda$. repeat 1~3

Learning with multiple trajectories $\min_{\tau_1,...,\tau_N,\theta}\sum_{i=1}^{N}c(\tau_i)\:\:\:\text{s.t.}\:\:\:u_{t,i}=\pi_\theta(x_{t,i})\:\:\forall i\:\forall t$

Optimize each $\tau_i$ in parallel with respect to $\tilde{c}(x_t,u_t)$
Optimize $\theta$ with respect to some supervised objective
Increment or modify dual variables $\lambda$. repeat 1~3

Stochastic (Gaussian) GPS

\[\min_{p,\theta}E_{\tau\sim p(\tau)}{c(\tau)}\:\:\text{s.t.}\:\:p(u_t|x_t)=\pi_\theta(u_t|x_t)\\ p(u_t|x_t)=\mathcal{N}(K_t(x_t-\hat{x}_t)+k_t+\hat{u}_t,\Sigma_t)\]

Optimize $p(\tau)$ with respect to some surrogate $\tilde{c}(x_t,u_t)$
Optimize $\theta$ with respect to some supervised objective
Increment or modify dual variables $\lambda$

Here, a little different from pure imitation learning that mimic the optimal control result, the agent imitate the planning results, and then if it can not perform good imitation, then the optimization process will adjust its planning policy to fit the learning policy since the policy is a constrain of planning.

Input Remapping Trick $\min_{p,\theta}E_{\tau\sim p(\tau)}[{c(\tau)}]\:\:\text{s.t.}\:\:p(u_t|x_t)=\pi_\theta(u_t|o_t)\$

Imitation optimal control

Imitation optimal control with DAgger

from current state $s_t$, run MCTS to get $a_t,a_{t+1},…$
add $(s_t,a_t)$ to dataset $\mathcal{D}$
execute action $s_t\sim\pi(a_t s_t)$ (not MCTS action!). repeat 1~3 N times
update the policy by training on $\mathcal{D}$

Problems of the original DAgger

Ask human to label the state from other policy is hard
run the initial not good policy in real world is dangerous in some applications

We address the first problem with a planning method, and how about the the second problem?

Imitating MPC: PLATO algorithm

train $\pi_\theta(u_t o_t)$ from labeled data $\mathcal{D}={o_1,u_1,…,o_N,u_N}$
run $\hat{\pi}(u_t o_t)$ to get dataset $\mathcal{D}_\pi={o_1,…,o_M}$
Ask computer to label $\mathcal{D_\pi}$ with actions $u_t$
Aggregate: $\mathcal{D}\gets\mathcal{D}\cup\mathcal{D}_\pi$

Simple stochastic policy: $\hat{\pi}(u_t|x_t)=\mathcal{N}(K_tx_t+k_t, \Sigma_{u_t})$ $\hat{\pi}(u_t|x_t)=\arg\min_{\hat{\pi}}\sum_{t'=t}^TE_{\hat{\pi}}[c(x_{t'},u_{t'})]+\lambda D_{KL}(\hat{\pi}(u_t|x_t)\|\pi_\theta(u_t|o_t))$

Here the $\hat{\pi}$ is re-planed by optimal control method, for simplicity, choose Gaussian policy since it is easy to plan with LQR, and the planning also add the KL constrain, which make sure the behavior policy is not far from the learning policy, but with this planning, it move actions away from some very bad (dangerous) actions.

DAgger vs GPS

DAgger does not require an adaptive expert
- Any expert will do, so long as states from learned policy can be labeled
- Assumes it is possible to match expert’s behavior up to bounded loss
  - Not always possible (e.g. partially observed domains)
GPS adapts the “expert” behavior
- Does not require bounded loss on initial expert (expert will change)

Why imitate?

Relatively stable and easy to use
- Supervise learning works very well
- control\planning (usually) works very well
- The combination of two (usually) works very well
Input remapping trick: can exploit availability of additional information at training time to learn policy from raw observations. (planning with state and learning policy with observations)
overcomes optimization challenges of back-propagating into policy directly

Model-free optimization with a model

just use policy gradient(or other model-free RL method) even though you have a model. (just treat the model as a simulator)
Sometimes better than using the gradients!

Dyna

on-line Q-learning algorithm that performs model-free RL with a model

given state $s$, pick action $a$ using exploration policy
observe $s’$ and $r$, to get transition $(s,a,s’,r)$
update model $\hat{p}(s’ s,a)$ and $\hat{r}(s,a)$ using $(s,a,s’)$
Q-update: $Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)-Q(s,a)]$
repeat $K$ times:
1. sample $(s,a)\sim\mathcal{B}$ from buffer of past states and actions
2. Q-update: $Q(s,a)\gets Q(s,a)+\alpha E_{s’,r}[r+\max_{a’}Q(s’,a’)-Q(s,a)]$

when model become better, re-evaluate the old states and make the estimation more accurate.

General “Dyna-style” model-based RL recipe

given state $s$, pick action $a$ using exploration policy
learn model $\hat{p}(s’ s,a)$ (and optionally, $\hat{r}(s,a)$)
repeat K times:
1. sample $s\sim\mathcal{B}$ from buffer
2. choose action a (from $\mathcal{B}$, from $\pi$, or random)
3. simulate $s’\sim\hat{p}(s’ s,a)$ (and $r=\hat{r}(s,a)$)
4. train on $(s,a,s’,r)$ with model-free RL
5. (optional) take N more model-based steps

This only requires short (as few as one step) rollouts from model, which has a little accumulated error.

Model-based RL algorithms summary

Methods

Learn model and plan (without policy)
- Iteratively more data to overcome distribution mismatch
- Re-plan every time step (MPC) to mitigate small model errors
Learning policy
- Back-propagate into policy (e.g., PILCO)–simple but potentially unstable
- imitate optimal control in a constrained optimization framework (e.g., GPS)
- imitate optimal control via DAgger-like process (e.g., PLATO)
- Use model-free algorithm with a model(Dyna, etc.)

Limitation of model-based RL

Need some kind of model
- Not always available
- Sometimes harder to learn than the policy
Learning the model takes time & data
- Sometimes expressive model classes (neural nets) are not fast
- Sometimes fast model classes (linear models) are not expressive
Some kind of additional assumptions
- Linearizability/continuity
- Ability to reset the system (for local linear models)
- Smoothness (for GP-style global model)
- Etc.

Here are some of my understandings of model-based RL:

First, why we need model-based RL?

the model-free RL learn everything from experience, the state space may very larger, it learning from scratch, which need a lot of exploration, otherwise, it may hard to coverage or has high chance to coverage to local optimal.

But in model-based RL, the model is known or already learned, so it shift the very hard exploration process to planning, which can find some decent directions that lead to good results by optimal control methods or just search by simulating with model. So after planning, the pretty promising trajectories is generated, and the policy only requires to learn to imitate these good trajectories, which reduce lots of random exploration.

Second, why not just use optimal control rather than learning policy?

Actually, you just use the optimal control methods, like traditional control methods, or MPC.

However, not all model can apply the explicit optimal control method like LQR, since the model is hard to solve mathematically. In addition, using neural network may have better generalization property, and close loop control seems better.

Third, What kind of method can i use in model based RL?

learning model and just using planing, do not learn policy

Learn policy by guided policy search

imitating optimal control with DAgger

What kind of algorithm should I use?

rank of the samples efficiency required (low to high) but computation efficiency (high to low)

gradient-free methods (e.g. NES, CMA, etc)
full on-line methods (e.g. A3C)
policy gradient methods (e.g. TRPO)
replay buffer value estimation methods (Q-learning, DDPG, NAF, SAC, etc.)
model-based deep RL (e.g. PETS, guided policy search)
model-based “shallow” RL (e.g. PILCO)

10 Variational Inference and Generative Models

Probabilistic models

Latent variable models

\[p(x)=\sum_zp(x|z)p(z)\\ p(y|x)=\sum_zp(y|x,z)p(z)\]

Latent variable models in general

feed random Gaussian to neural network to fit any distribution $p(x|z)=\mathcal{N}(\mu_{nn},\sigma_{nn}(z))\\ p(x)=\int p(x|z)p(z)dz$ where $p(z)$ is Gaussian.

the neural network input is sample of Gaussian, and output is many mean and variance of Gaussian.

Latent variable models in RL: conditional latent variable models for multi-modal policies

How to train latent variable models?

model to fit a distribution

the model: $p_\theta(x)$

the data: $\mathcal{D}={x_1,x_2,x_3,…,x_N}$

maximum likelihood fit: $\theta\gets\arg\max_\theta \frac{1}{N} \sum_i \log p_\theta(x_i)$

in latent variable model

the model: $p(x)=\int p(x

z)p(z)dz$

maximum likelihood fit: $\theta\gets\arg\max_\theta \frac{1}{N} \sum_i \log \left(\int p(x

z)p(z)dz\right)$

Estimating the log-likelihood

alternative: expected log-likelihood: $\theta\gets\arg\max_\theta \frac{1}{N} \sum_i E_{z\sim p(z

x_i)} \log p_\theta(x_i,z)$

The variational approximation

approximate $p(z|x_i)$ with $q_i(z)=\mathcal{N}(\mu_i,\sigma_i)$ $\begin{align} \log p(x_i) &= \log \int_z p(x_i|z)p(z)\\ &=\log \int_z p(x_i|z)p(z)\frac{q_i(z)}{q_i(z)}\\ &=\log E_{z\sim q_i(z)}\left[\frac{p(x_i|z)p(z)}{q_i(z)}\right]\\ &\ge E_{z \sim q_i(z)}\left[\log \frac{p(x_i|z)p(z)}{q_i(z)}\right]\\ &= E_{z \sim q_i(z)}[\log p(x_i|z)+\log p(z)]- E_{z \sim q_i(z)}[\log q_i(z)]\\ &= E_{z \sim q_i(z)}[\log p(x_i|z)+\log p(z)]+ \mathcal{H}(q_i) \end{align}$ Jensen’s inequality: $\log E[y]\ge E[\log y]$

Entropy: $\mathcal{H}(p)=-E_{x \sim p(x)}[\log p(x)]=-\int_x p(x)\log p(x)dx$

KL Divergence: $D_{KL}(q

p)=E_{x \sim q(x)}\left[\log \frac{q(x)}{p(x)}\right]=E_{x \sim q(x)}[\log q(x)]- E_{x \sim q(x)}[\log p(x)] =-E_{x \sim q(x)}[\log p(x)]- \mathcal{H}(q)$

further analysis $\log p(x_i) = E_{z \sim q_i(z)}[\log p(x_i|z)+\log p(z)]+ \mathcal{H}(q_i)=\mathcal{L}_i(q,p_i)$ so what makes a good $q_i(z)$?

intuition: $q_i(z)$ should approximate $p(z

x_i)$

why? $\begin{align} D_{KL}(q_i(z_i||p(z|x_i)))&=E_{z\sim q_i(z)}\left[\log \frac{q_i(z)}{p(z|x_i)}\right]\\ &=E_{z \sim q_i(z)}\left[\log \frac{q_i(z)p(x_i)}{p(x_i,z)}\right]\\ &=-E_{z\sim q_i(z)}[\log p(x_i|z)+\log p(z)] +E_{z \sim q_i(z)}+E_{z\sim q_i(z)}[\log q_i(z)]+E_{z \sim q_i(z)}[\log p(x_i)]\\ &=-E_{z\sim q_i(z)}[\log p(x_i|z)+\log p(z)] +E_{z \sim q_i(z)}-\mathcal{H}(q_i)\log p(x_i)\\ &=-\mathcal{L}_i(p,q_i)+\log p(x_i) \end{align}$

\[\begin{align} \log p(x_i)&=D_{KL}(q_i(z_i||p(z|x_i)))+\mathcal{L}_i(p,q_i)\\ 0 &\le D_{KL}(q_i(z_i||p(z|x_i))) \\ \log p(x_i)&\ge \mathcal{L}_i(p,q_i) \end{align}\]

So this also prove that $\mathcal{L}_i(p,q_i)$ is low bound, and the KL Divergence is the bound gap, so when $q_i(x)$ close to $p(z x_i)$, the bound will become tight.

Actually minimize KL divergence is maximizing $\mathcal{L}_i(p,q_i)$. so we need to adjust $q_i(z)$ to maximizing $\mathcal{L}_i(p,q_i)$.

So all we need to do is: $\theta \gets \arg \max_{\theta} \frac{1}{N}\sum_i \mathcal{L}_i(q,q_i)$

\[\mathcal{L}_i(q,p_i)=E_{z \sim q_i(z)}[\log p_\theta(x_i|z)+\log p(z)]+ \mathcal{H}(q_i)\]

Algorithm:

for each $x_i$ (or mini-batch):

calculate $\Delta_\theta \mathcal{L}_i(p,q_i)$

sample $z \sim q_i(z)$

$\Delta_\theta \mathcal{L}(p,q_i)\approx\Delta_\theta \log p_\theta(x_i

z)$

$\theta \gets \theta+\alpha \Delta_\theta\mathcal{L}(p,q_i)$

update $q_i$ to maximize $\mathcal{L}_i(p,q_i)$

How to update $q_i$?

let’s say $q_i(z)=\mathcal{N}(\mu_i,\sigma_i)$

use gradient $\Delta_{\mu_i}\mathcal{L}i(p,q_i)$ and $\Delta{\sigma_i}\mathcal{L}_i(p,q_i)$

gradient ascent on $\mu_i,\sigma_i$

What’ the problem?

every sample has a $\mu_i,\sigma_i$, when you have many samples , the total paramters are $

\theta

\mu_i

\sigma_i

)*N$

intuition: $q_i(z)$ should approximate $p(z

x_i)$

what if we learn a network $q_i(z)=q(z

x_i)\approx p(z

x_i)$ ?

so we have two network: $p_\theta(x

z)$ and $q_\phi(z

x)$

Amortized variational inference

\[q_\phi(z|x)=\mathcal{N}(\mu_\phi(x),\sigma_\phi(x))\] \[\log p(x_i)\ge E_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)+\log p(z)]+ \mathcal{H}(q_\phi(z|x_i))=\mathcal{L}(q_\theta(x_i|z),q_\phi(z|x_i))\]

Algorithm:

for each $x_i$ (or mini-batch):

calculate $\mathcal{L}(q_\theta(x_i

z),q_\phi(z

x_i))$

sample $z \sim q_\phi(z

x_i)$

$\Delta_\theta \mathcal{L}(p,q_i)\approx\Delta_\theta \log p_\theta(x_i

z)$

$\theta \gets \theta+\alpha \Delta_\theta\mathcal{L}$

$\theta \gets \theta+\alpha \Delta_\phi\mathcal{L}$

how can we get $\Delta_\phi\mathcal{L}$ ? $\mathcal{L}_i=E_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)+\log p(z)]+ \mathcal{H}(q_\phi(z|x_i))\\ J(\phi)= E_{z \sim q_\phi(z|x_i)}[r(x_i,z)]\\ J(\phi)\approx \frac{1}{M}\sum_j\Delta_\phi \log q_\phi(z_j|x_i)r(x_i,z_j)$ one way is just use the policy gradient trick, but it has high variance, th other way is apply the re-parameterization trick.

The re-parameterization trick

\[q_\phi(z|x)=\mathcal{N}(\mu_\phi(x),\sigma_\phi(x))\\ z=\mu_\phi(x)+\epsilon\sigma_\phi(x)\;\:\epsilon\sim \mathcal{N}(0,1)\] \[\begin{align} J(\phi)&= E_{z \sim q_\phi(z|x_i)}[r(x_i,z)]\\ &=E_{\epsilon \sim \mathcal{N}(0,1)}[r(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))] \end{align}\]

and then we can estimating $\Delta_\phi J(\phi)$:

sample $\epsilon_1,…,\epsilon_M$ from $\mathcal{N}(0,1)$ (even a single sample per data point works well!)

$\Delta_\phi J(\phi) \approx \frac{1}{M}\sum_j\Delta_\phi r(x_i,\mu_\phi(x_i)+\epsilon_j\sigma_\phi(x_i)) $

this is low variance since it get gradient from r, not just sample r.

Another way to look at it… $\begin{align} \mathcal{L}_i&=E_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)+\log p(z)]+ \mathcal{H}(q_\phi(z|x_i))\\ &=E_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)]+E_{z \sim q_\phi(z|x_i)}[\log p(z)]+ \mathcal{H}(q_\phi(z|x_i))\\ &=E_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)]-D_{KL}(q_\phi(z|x_i||p(z)))\\ &=E_{\epsilon \sim \mathcal{N}(0,1)}[p_\theta(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))]-D_{KL}(q_\phi(z|x_i||p(z)))\\ &\approx \log p_\theta(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))-D_{KL}(q_\phi(z|x_i||p(z))) \end{align}$

Re-parameterization trick vs. policy gradient

policy gradient $J(\phi)\approx \frac{1}{M}\sum_j\Delta_\phi \log q_\phi(z_j x_i)r(x_i,z_j)$
- Can handle both discrete and continuous latent variables
- but has high variance, requires multiple sample & small learning rates
Re-parameterization trick $\Delta_\phi J(\phi) \approx \frac{1}{M}\sum_j\Delta_\phi r(x_i,\mu_\phi(x_i)+\epsilon_j\sigma_\phi(x_i)) $
- only continuous latent variables
- very simple to implement
- low variance

The variational auto encoder (VAE)

Conditional models $\mathcal{L}_i=E_{z \sim q_\phi(z|x_i,y_i)}[\log p_\theta(y_i|x_i,z)+\log p(z|x_i)]+ \mathcal{H}(q_\phi(z|x_i,y_i))$ the application of variational inference

using RL\control+variational inference to model human behavior
using generative models and variational inference for exploration

this class is a little tough, here is some of my understanding:

we want to represent the distribution of a object(by neural network) and make it has ability to generalize the feature of the object (multi-modal). To achieve this, the random variables($z$) are used as input of model, but what kind of random variances can achieve this? the mathematic proof shows that the distribution of random variance should approximate $p(z x_i)$, this is kind of compress sensing, where $z$ is the latent variable. Finally, this is used to do variational auto encoder (VAE).

To make the network trainable, the re-parameterization trick is applied.

11. Re-framing Control as an Inference Problem

Get object function from policy

The human behavior is stochastic and suboptimal but overall good, and how can interpret this kind of model?

A probabilistic graphical model of decision making

\[p(\tau|\mathcal{O}_{1:T})\:\:\:\;\;\;p(\mathcal{O}_t|s_t,a_t)=\exp(r(s_t,a_t))\\ p(\tau|\mathcal{O}_{1:T})=\frac{p(\tau,\mathcal{O}_{1:T})}{p(\mathcal{O}_{1:T})}\propto p(\tau)\prod\exp(r(s_t,a_t))=p(\tau)\exp\left(\sum_tr(s_t,a_t\right)\]

$\mathcal{O}$ is the boolean variable indicate the policy choose randomly or choose to maximize reward

Inference

Backward massages

\[\begin{align} \beta_t(s_t,a_t)&=p(\mathcal{O}_{t:T}|s_t,a_t)\\ &=\int p(\mathcal{O}_{t:T},s_{t+1}|s_t,a_t)ds_{t+1}\\ &=\int p(\mathcal{O}_{t+1:T}|s_{t+1}) p(s_{t+1}|s_t,a_t)p(\mathcal{O}_t|s_t,a_t)ds_{t+1}\\ \end{align}\] \[p(\mathcal{O}_{t+1:T}|s_{t+1})=\beta_{t+1}(s_{t+1})=\int p(\mathcal{O}_{t+1:T}|s_{t+1,}a_{t+1})p(a_{t+1}|s_{t+1})d a_{t+1}\\ =\int \beta_t(s_{t+1,}a_{t+1})p(a_{t+1}|s_{t+1})d a_{t+1}\]

for $t=T-1$ to 1:

\[\begin{align} \beta_t(s_t,a_t)&=p(\mathcal{O}_t|s_t,a_t)E_{s_{t+1}\sim p(s_{t+1}|s_t,a_t)}[\beta_{t+1}(s_{t+1})]\\ \beta_t(s_t)&=E_{s_t\sim p(a_t|s_t)}[\beta_t(s_t,a_t)] \end{align}\]

let $V_t(s_t) =\log \beta_t(s_t)$

let $Q_t(s_t,a_t)= \log \beta_t(s_t,a_t)$ $V_t=\log\int \exp(Q_t(s_t,a_t))da_t\\ V_t(s_t) \to \max_{s_t}Q_t(s_t,a_t)\;\text{as}\;Q_t(s_t,a_t)\;\text{gets bigger!}\\ Q_t(s_t,a_t)=r(s_t,a_t)+\log E[\exp(V_{t+1}(s_{t+1}))]\\ \text{for determistic transition: }Q_t(s_t,a_t)=r(s_t,a_t)+V_{t+1}(s_{t+1})$

Policy computation

\[\begin{align*} p(a_t|s_t,\mathcal{O}_1:T)&=\pi(a_t|s_t)\\ &=p(a_t|s_t,\mathcal{O}_t:T)\\ &=\frac{p(a_t,a_t|\mathcal{O}_{t:T})}{p(s_t|\mathcal{O}_{t:T})}\\ &=\frac{p(\mathcal{O}_{t:T}|a_t,s_t)p(a_t,s_t)/p(a_t,s_t)}{p(\mathcal{O}_{t:T}|s_t)p(s_t)/p(\mathcal{O}_{t:T})}\\ &=\frac{p(\mathcal{O}_{t:T}|a_t,s_t)}{p(\mathcal{O}_{t:T}|s_t)}\frac{p(a_t,s_t)}{p(s_t)}\\ &=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}p(a_t|s_t)\\ &=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)} \end{align*}\]

Policy computation with value functions

for $t=T-1$ to 1:

\[V_t=\log\int \exp(Q_t(s_t,a_t))da_t\\ Q_t(s_t,a_t)=r(s_t,a_t)+\log E[\exp(V_{t+1}(s_{t+1}))]\]

\[\pi(a_t|s_t)=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}\\ V_t(s_t) =\log \beta_t(s_t)\\ Q_t(s_t,a_t)= \log \beta_t(s_t,a_t)\]

So $\pi(a_t|s_t)=\exp(Q_t(s_t,a_t)-V_t(s_t))=\exp(A_t(s_t,a_t))$ with temperature: $\pi(a_t|s_t)=\exp(\frac{1}{\alpha}Q_t(s_t,a_t)-\frac{1}{\alpha}V_t(s_t))=\exp(\frac{1}{\alpha}A_t(s_t,a_t))$ if $\alpha$ is near zero, the max action will dominate, so it’s more like deterministic, and $\alpha$ near 1 is more stochastic.

Natural interpretation: better actions are more probable
Random tie-breaking
Analogous to Boltzmann exploration
Approaches greedy policy as temperature decreases

Forward massages

\[\begin{align*} \alpha_t(s_t)&=p(s_t|\mathcal{O_{1:t-1}})\\ &=\int p(s_t,s_{t-1},,a_{t-1}|\mathcal{O}_{1:t-1})ds_{t-1}da_{t-1}\\ &=\int p(s_t|s_{t-1},a_{t-1},\mathcal{O}_{1:t-1})p(a_{t-1}|s_{t-1},\mathcal{O}_{1:t-1})p(s_{t-1|\mathcal{O}_{1:t-1}})ds_{t-1}da_{t-1}\\ &=\int p(s_t|s_{t-1},a_{t-1})p(a_{t-1}|s_{t-1},\mathcal{O}_{1:t-1})p(s_{t-1|\mathcal{O}_{1:t-1}})ds_{t-1}da_{t-1} \end{align*}\] \[\begin{align*} p(s_t|s_{t-1},a_{t-1})p(a_{t-1}|s_{t-1},\mathcal{O}_{1:t-1}) &=\frac{p(\mathcal{O}_{t-1}|s_{t-1},a_{t-1})p(a_{t-1}|s_{t-1})}{p(\mathcal{O}_{t-1}|s_{t-1})}\frac{p(\mathcal{O}_{t-1}|s_{t-1})p(s_{t-1}|\mathcal{O}_{1:t-2})}{p(\mathcal{O}_{t-1}|\mathcal{O}_{1:t-2})}\\ &=\frac{p(\mathcal{O}_{t-1}|s_{t-1},a_{t-1})p(a_{t-1}|s_{t-1})}{1}\frac{\alpha_{t-1}(s_{t-1})}{p(\mathcal{O}_{t-1}|\mathcal{O}_{1:t-2})} \end{align*}\]

what if we want $p(s_t|\mathcal{O}_{1:T})$? $\begin{align*} p(s_t|\mathcal{O}_{1:T})&=\frac{p(s_t,\mathcal{O}_{1:T})}{p(\mathcal{O}_{1:T})}\\ &=\frac{p(\mathcal{O}_{t:T}|s_t)p(s_t,\mathcal{O}_{1:t-1})}{p(\mathcal{O}_{1:T})}\\ &\propto \beta_t(s_t)p(s_t|\mathcal{O}_{1:t-1})p(\mathcal{O}_{1:t-1})\\ &\propto \beta_t(s_t)\alpha_t(s_t) \end{align*}$

The optimism problem

for $t=T-1$ to 1:

\[\begin{align} \beta_t(s_t,a_t)&=p(\mathcal{O}_t|s_t,a_t)E_{s_{t+1}\sim p(s_{t+1}|s_t,a_t)}[\beta_{t+1}(s_{t+1})]\\ \beta_t(s_t)&=E_{s_t\sim p(a_t|s_t)}[\beta_t(s_t,a_t)] \end{align}\]

let $V_t(s_t) =\log \beta_t(s_t)$

let $Q_t(s_t,a_t)= \log \beta_t(s_t,a_t)$ $Q_t(s_t,a_t)=r(s_t,a_t)+\log E[\exp(V_{t+1}(s_{t+1}))]$ why did this happen?

The inference problem: $p(s_{1:T},a_{1:T}

\mathcal{O}_{1:T})$

marginalizing and conditioning, we get: $p(a_t

s_t,\mathcal{O}_{1:T})$ (the policy)

“given that you obtained high reward, what was your action probability?”

marginalizing and conditioning, we get: $p(s_{t+1},

s_t,a_t,\mathcal{O}{1:T})\ne p(s{t+1

s_t,a_t})$

“given that you obtained high reward, what was your transition probability?”

because we asking the question of transition probability that condition on the good result, so there is the optimism problem.

Addressing the optimism problem

we actually want to ask “given that you obtained high reward, what was your action probability when the transition probability did not change?”

find anther distribution $q(s_{1:T},a_{1:T})$ that is close to $p(s_{1:T},a_{1:T}

\mathcal{O}{1:T})$ but has dynamics $p(s{t+1}

s_t,a_t)$

Try variational inference!

let $\mathrm{x}=\mathcal{O}{1:T}$ and $\mathrm{z}=(s{1:T},s_{1:T})$ find $q(\mathrm{z})$ to approximate $p(\mathrm{z}

\mathrm{x})$

Control via variational inference

\[q(s_{1:T},a_{1:T})=p(s_1)\prod_t p(s_{t+1}|s_t,a_t)q(a_t|s_t)\]

The variational lower bound (last class) $\begin{align} \log p(x)&\ge E_{z \sim q(z)}[\log p(x,z)-\log q(z)]\\ \log p(\mathcal{O}_{1:T}) &\ge E_{s_{1:T},a_{1:T}\sim q}[\log p(s_1)+\sum_{t=1}^T\log p(s_{t+1|s_t,a_t})+\sum_{t=1}^T\log p(\mathcal{O}_t|s_t,a_t)\\-\log p(s_1)-&\sum_{t=1}^T\log p(s_{t+1}|s_t,a_t)-\sum_{t=1}^T\log q(s_t|s_t)]\\ &=E_{(s_{1:T},a_{1:T})\sim q}\left[\sum_tr(s_t,a_t)-\log q(a_t|s_t)\right]\\ &=\sum_t E_{(s_t,a_t)\sim q}[r(s_t,a_t)+\mathcal{H}(q(a_t|s_t))] \end{align}$ maximize the rewards and entropy

Optimizing the variation lower bound $Q_t(s_t,a_t)=r(s_t,a_t)+E[(V_{t+1(s_{t+1})})]\\ V_t(s_t)=\log \int \exp(Q_t(s_t,a_t))d a_t$

backward pass-variational

for $t=T-1$ to 1:

\[V_t=\log\int \exp(Q_t(s_t,a_t))da_t\\ Q_t(s_t,a_t)=r(s_t,a_t)+E[\exp(V_{t+1}(s_{t+1}))]\]

Variants:

discounted SOC: $Q_t(s_t,a_t)=r(s_t,a_t+\gamma E(V_{t+1(s_{t+1})}))$
explicit temperature: $V_t(s_t)=\alpha \log \int \exp(\frac{1}{\alpha}Q_t(s_t,a_t))da_t$

Soft Q-learning

soft Q-learning $\phi \gets \phi + \alpha \Delta_\phi Q_\phi(s,a)(r(s,a)+\gamma V(s’)-Q_\phi(s,a))$

target value: $V(s’)=\text{soft} \max_{a’}Q_\phi(s’,a’)=\log \int \exp (Q_\phi(s’,a’))da’$

$\pi(a

s)=\exp(Q_\phi(s,a)-V(s))=\exp(A(s,a))$

Policy gradient with soft optimality

$\pi(a

s)=\exp(Q_\phi(s,a)-V(s))$ optimizes $\sum_{\pi(s_t,a_t)}[r(s_t,a_t)]+E_{\pi(s_t)}[\mathcal{H}(\pi(a_t

s_t))]$

intuition: $\pi(a

s)\propto \exp(Q_\phi(s,a))$ when $\pi$ minimizes $D_{KL}(\pi(a

\frac{1}{Z}\exp(Q(s,a))$

Soft Policy gradient vs soft Q-learning

policy gradient derivation: $J(\theta)=\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)]+E_{\pi(s_t)}[\mathcal{H}(\pi(a|s_t))]\\ =\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)-\log \pi(a_t|s_t)]\\ \log \pi(s_t|s_t)=Q(s_t,a_t)-V(s_t)$

\[\begin{align} &\Delta_\theta\left[\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)-\log \pi(a_t|s_t)]\right]\\ &\approx \frac{1}{N}\sum_i\sum_t\Delta_\theta \log \pi(a_t|s_t)\left(r(s_t,a_t)+\left(\sum_{t'=t+1}^Tr(s_{t'},s_{t'})-\log \pi(a_{t'}|s_{t'})\right)-\log \pi(a_t|s_t)-1\right)\\ &\approx \frac{1}{N}\sum_i\sum_t(\Delta_\theta Q(s_t,a_t)-\Delta_\theta V(s_t))\left(r(s_t,a_t)+Q(s_{t+1},s_{t+1})-Q(s_t,a_t)+V(s_t)\right)\\ &\approx \frac{1}{N}\sum_i\sum_t(\Delta_\theta Q(s_t,a_t)-\Delta_\theta V(s_t))\left(r(s_t,a_t)+Q(s_{t+1},s_{t+1})-Q(s_t,a_t)\right) \end{align}\]

soft Q-learning: $$

\frac{1}{N}\sum_i\sum_t\Delta_\theta Q(s_t,a_t)\left(r(s_t,a_t)+\text{soft}\max_{a_{t+1}}Q(s_{t+1},s_{t+1})-Q(s_t,a_t)\right) $$

Benefits of soft optimality

Improve exploration and prevent entropy collapse
Easier to specialize (fine-tune) policies for more specific tasks
Principled approach to break ties
Better robustness (due to wider converge of states)
Con reduce to hard optimality as reward magnitude increase
Good model for modeling human behavior

12. Inverse Reinforcement Learning

Why should we worry about learning rewards

The imitation learning perspective

Standard imitation learning:

copy the action performed by the expert
no reasoning about outcomes of actions

Human imitation learning:

copy the intent of the expert
might take very different actions

The reinforcement learning perspective

sometimes reward is complicated and unclear

Inverse reinforcement learning

Infer reward functions from demonstrations

by itself, this is an underspecified problem
many reward function can explain the same behavior

Inverse reinforcement learning:

given:

states $s \in \mathcal{S}$, action $a \in \mathcal{A}$

(sometimes) transitions $p(s’

s,a)$

samples ${\tau_i}$ sample from $\pi^*(\tau)$

learn $r_\psi(s,a)$

… and then use it to learn $\pi*(a

s)$

Learning the optimality variable

$p(\mathcal{O}_t

s_t,a_t,\psi)=\exp(r_\psi(s_t,a_t))$

$p(\tau

\mathcal{O}{1:T},\psi) \propto p(\tau)\exp\left(\sum_tr\psi(s_t,a_t\right)$

given:

samples ${\tau_i}$ sampled from $\pi*(\tau)$

maximum likelihood learning: $\max_\psi \frac{1}{N}\sum_{i-1}^N\log p(\tau_i

\mathcal{O}{1:T},\psi)=\max\psi \frac{1}{N}\sum_{i=1}^Nr_{\psi}(\tau_i)-\log Z$

$\log Z$ make sure all trajectory probability sums up to 1

The IRL partition function

\[\max_\psi \frac{1}{N}\sum_{i=1}^Nr_{\psi}(\tau_i)-\log Z\\ Z=\int p(\tau)\exp(r_\psi(\tau))d\tau\\ \Delta_\psi \mathcal{L}=\frac{1}{N}\sum_{i=1}^N\Delta_\psi r_\psi(\tau_i)-\frac{1}{Z}\int p(\tau)\exp (r_\psi(\tau))\Delta_\psi r_\psi(\tau)d\tau\\ =E_{\tau\sim \pi^*(\tau)}[\Delta_\psi r_\psi(\tau_i)]-E_{\tau \sim p(\tau|\mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(\tau)]\]

Estimating the expectation

$ p(s_t,a_t

\mathcal{O}_{1:T},\psi)=p(a_t

s_t,\mathcal{O}_{1:T},\psi)p(s_t

\mathcal{O}_{1:T},\psi)$

let $\mu_t(s_t,a_t)\propto\beta(s_t,a_t)\alpha(s_t)$ $\begin{align} E_{\tau \sim p(\tau|\mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(\tau)]&= E_{\tau \sim p(\tau|\mathcal{O}_{1:T},\psi)}[\Delta_\psi \sum_{t=1}^Tr_\psi(s_t,a_t)]\\ &=\sum_{t=1}^T E_{(s_t,a_t) \sim p(s_t,a_t|\mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(s_t,a_t)]\\ &=\sum_{t=1}^T\vec{\mu}_t^T\Delta_\psi\vec{r}_\psi \end{align}$

The MaxEnt IRL algorithm

Given $\psi$, compute backward message $\beta(s_t,a_t)$
Given $\psi$, compute forward message $\alpha(s_t)$
Compute $\mu_t(s_t,a_t) \propto \beta(s_t,a_t)\alpha(s_t)$
Evaluate $\Delta_\psi \mathcal{L}=\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\Delta_\psi r_\psi(s_{i,t},a_{i,t})-\sum_{t=1}^T\int\int\mu_t(s_t,a_t)\Delta_\psi r_\psi(s_t,a_t)ds_tda_t$
$\psi \gets \psi +\eta \Delta_\psi\mathcal{L}$

Why MaxEnt? in the case where $r_\psi(s_t,a_t)=\psi^Tf(s_t,a_t)$, we can show that it optimizes $\max_\psi\mathcal{H}(\pi^{r_\psi}) $ such sthat $E_{\pi^{r_\psi}}[f]=E_{\pi^*}[f]$

paper: Ziebart et al. 2008: Maximum Entropy Inverse Reinforcement learning

what’s missing so far?

MaxEnt IRL so far requires…
- Solving for (soft) optimal policy in the inner loop
- Enumerating all state-action tuples for visitation frequency and gradient
To apply this in practical problem settings, we need to handle…
- Large and continuous state and action spaces
- States obtained via sampling only
- Unknown dynamics

recall: $\Delta_\psi \mathcal{L}=E_{\tau\sim \pi^*(\tau)}[\Delta_\psi r_\psi(\tau_i)]-E_{\tau \sim p(\tau|\mathcal{O}_{1:T},\psi)}[\Delta_\psi r_\psi(\tau)]$ The first part is just the trajectory under expert policy, the second part of this is soft optimal policy under current reward, a sample idea is to learning this soft policy using the methods introduced in last class, bu it is impractical since you need to train it to convergence, it has a lot of computation.

More efficient sample-based updates

Guided cost learning

\[\Delta_\psi \approx \frac{1}{N}\sum_{i=1}^N\Delta_\psi r_{\psi}(\tau_i)-\frac{1}{M}\sum_{j=1}^M\Delta_\psi r_\psi(\tau_j)\]

instead of learn the $p(a_t

s_t, \mathcal{O}_{1:T},\psi)$using any max-ant RL algorithm to converge then run this policy to sample ${\tau_j}$, can we just run one or several gradient step and then sample?

Solution 1: use importance sampling $\Delta_\psi\mathcal{L} \approx \frac{1}{N}\sum_{i=1}^N\Delta_\psi r_{\psi}(\tau_i)-\frac{1}{\sum_jw_j}\sum_{j=1}^Mw_j\Delta_\psi r_\psi(\tau_j)\;\;\;\;w_j=\frac{p(\tau)\exp(r_\psi(\tau_j))}{\pi(\tau_j)}$

\[\begin{align} w_j&=\frac{p(\tau)\exp(r_\psi(\tau_j))}{\pi(\tau_j)}\\ &=\frac{p(s_1)\prod_tp(s_{t+1}|s_t,a_t)\exp(r_\psi(s_t,a_t))}{p(s_1)\prod_tp(s_{t+1}|s_t,a_t)\pi(a_t|s_t)}\\ &=\frac{\exp(\sum_tr_\psi(s_t,a_t))}{\prod_t\pi(a_t|s_t)} \end{align}\]

each policy update w.r.t. $r_\psi$ brings us closer to the target distribution!

paper : guided cost learning algorithm. Finn et al. ICML ‘16

this actually like a game, or GAN, which the $\Delta_\psi\mathcal{L} \approx \frac{1}{N}\sum_{i=1}^N\Delta_\psi r_{\psi}(\tau_i)-\frac{1}{\sum_jw_j}\sum_{j=1}^Mw_j\Delta_\psi r_\psi(\tau_j)$ makes demos are made more likely, samples less likely. But $\Delta_\theta\mathcal{L}\approx\frac{1}{M}\sum_{j=1}^M\Delta_\theta\log\pi_\theta(\tau_j)r_\psi(\tau_j)$ policy changed to make it harder to distinguish from demos.

Inverse RL as a Generative adversarial Networks (GAN)

In GAN, the best discriminator is $D^(x)=\frac{p^(x)}{p_\theta(x)+p^*(x)}$

For IRL, optimal policy approaches $\pi_\theta(\tau)\propto p(\tau)\exp(r_\psi(\tau))$

choose this parameterization for discriminator: $\begin{align} D_\psi(\tau)&=\frac{p(\tau\frac{1}{Z}\exp(r(\tau)))}{p_\theta(\tau)+p(\tau)\frac{1}{Z}\exp(r(\tau))}\\ &=\frac{\frac{1}{Z}\exp(r(\tau))}{\prod_t\pi_\theta(a_t|s_t)+\frac{1}{Z}\exp(r(\tau))}\\ &=\frac{\frac{1}{Z}\exp(r(\tau))}{\prod_t\pi_\theta(a_t|s_t)+\frac{1}{Z}\exp(r(\tau))} \end{align}$ and using GAN to lean by $\psi \gets \arg \max_\psi E_{\tau \sim p*}[\log D_\psi(\tau)]+E_{\tau \sim \pi_\theta}[\log(1-D_\psi(\tau))]$, and then it will get same result as previous Inverse RL update rule. and here we don not need important weight, we can just optimize Z w.r.t. same objective as $\psi$ ! and the weight is in Z.

so the discriminator is $\psi \gets \arg \max_\psi E_{\tau \sim p*}[\log D_\psi(\tau)]+E_{\tau \sim \pi_\theta}[\log(1-D_\psi(\tau))]$

the generator is $\Delta_\theta\mathcal{L}\approx\frac{1}{M}\sum_{j=1}^M\Delta_\theta\log\pi_\theta(\tau_j)r_\psi(\tau_j)$

After inverse RL, the reward has already learned, when the environment changed, it use learned reward representation has the generalization ability to learn policy under new environment.

Regular discriminator

Can we just use a regular discriminator? $\psi \gets \arg \max_\psi E_{\tau \sim p*}[\log D_\psi(\tau)]+E_{\tau \sim \pi_\theta}[\log(1-D_\psi(\tau))]$ and just parametrize $D_\psi(\tau)$ as standard binary neural net classifier

the generator becomes $\Delta_\theta\mathcal{L}\approx\frac{1}{M}\sum_{j=1}^M\Delta_\theta\log\pi_\theta(\tau_j)\log D_\psi(\tau_j)$

this is simpler to set up optimization
but discriminator knows nothing at convergence
and do know the reward representation, only get the policy $\pi_\theta$

13. Transfer and Multi-task Learning

Use prior knowledge

Transfer learning: using experience from one set of tasks for faster learning and better performance on a new task. In RL, task is MDP

shot: number of attempts in the target domain

0-shot: just run a policy trained in the source domain

1-shot: try the task once

few shot: try the task a few times

“forward” transfer: train on one task, transfer to new task
1. just try it and hope for best
2. fine-tune on the new task
3. randomize source domian
Multi-task transfer: train on many tasks, transfer to new task
1. generate highly randomized source domains
2. model-based reinforcement learning
3. model distillation
4. contextual policies
5. modular policy networks
Multi-task meta-learning: learning to learning from many tasks
1. RNN-based meta-learning
2. Gradient-based meta-learning

This class is pretty high-level introduction about Transfer and Multi-task Learning. The class give make possible direction of Transfer and Multi-task Learning, and gives many paper for further study in this part, see lecture slides for more detail of recommended papers.

14. Distributed RL

2013/2015: DQN: replay buffer

2015: GORILA

2016: A3C: one learner multiple actors, actor generate gradient and send it to learner

2018: IMPALA: several actors and learnings, actor only acting and generate data for learner, using importance sampling (V-trace) correct for policy lag.

2018: Ape-X/R2D2： Reintroduces replay

2019: R2D3

RLlib: Abstractions for Distributed Reinforcement Learning (ICML’18)

15. Exploration

Exploration in bandit

Regret $Reg(T)=TE[r(a^*)]-\sum_{t=1}^Tr(a_t)$

Optimistic exploration

optimistic estimate: $a=\arg\max \hat{\mu}_a + C \sigma_a$

Intuition: try each arm until you are sure it’s not great

example: $a=\arg \max \hat{\mu}_a+\sqrt{\frac{2\ln T}{N(a)}}$ $Reg (T)$ is $O(\log T)$

Probability matching /posterior sampling

assume $r(a_i)\sim p_{\theta_i}(r_i)$

this defines a POMDP with $s=[\theta_1,…,\theta_n]$

belief state is $\hat{p}(\theta_1,…,\theta_n)$

idea: sample $\theta_1,…,\theta_n \sim\hat{p}(\theta_1,..,\theta_n)$

pretend the model $\theta_1,..,\theta_n$ is correct

take the optimal action

update the model and repeat the process

Information gain

let $\mathcal{H}(\hat{p}(z))$ be the current entropy of our $z$ estimate

let $\mathcal{H}(\hat{p}(z)

y)$ be the entropy of our $z$ estimate after observation $y$

Information gain is $IG(z,y)=E_y[\mathcal{H}(\hat{p}(z))-\mathcal(\hat{p}(z)|y)]\\ IG(z,y|a)=E_y[\mathcal{H}(\hat{p}(z))-\mathcal(\hat{p}(z)|y)|a]$ For bandit

$y= r_a, z=\theta_a$

$g(a)=IG(\theta_a,r_a

a)$

$\Delta(a)=E[r(a^*)-r(a)]$

choose $a$ according to $\arg\min_a \frac{\Delta(a)^2}{g(a)}$

Exploration in DRL

Optimistic exploration in RL

UCB: $a=\arg \max \hat{\mu}_a+\sqrt{\frac{2\ln T}{N(a)}}$

In MDPs, count-based exploration: use $N(s,a)$ or $N(s)$ to add exploration bonus

use $r^+(s,a)=r(s,a)+\mathcal{B}(N(s))$

use $r^+(s,a)$ instead of $r(s,a)$ with any model-free algorithm

but when using counts, maybe we didn’t see the same thing twice. so we need to have a representation of similarity of states.

idea: fit a density model $p_\theta(s)$ (or $p_\theta(s,a)$)

$p_\theta(s)$ might be high even for a new $s$ if $s$ is similar to previously seen states $P(s)=\frac{N(s)}{n}\\ P'(s)=\frac{N(s)+1}{n+1}$

Exploring with pseudo-counts

fit model $p_\theta(s)$ to all states $\mathcal{D}$ seen so for

take a step $i$ and observe $s_i$

fit new model $p_{\theta’}(s)$ to $\mathcal{D}\cup s_i$

use $p_\theta(s_i)$ and $p_{\theta’}(s_i)$ to estimate $\hat{N}(s)$

set $r^+(s,a)=r(s,a)+\mathcal{B}(N(s))$, and repeat!

How to get $\hat{N}(s)$ , use previous equations and get $\hat{N}(s_i)=\hat{n}p_\theta(s_i)\\ \hat{n}=\frac{1-p_{\theta'}(s_i)}{p_{\theta'}(s_i)-p_\theta(s_i)}p_\theta(s_i)$

What kind of bonus to use? many chooses

UCB: $\mathcal{B}(N(s))=\sqrt{\frac{2\ln T}{N(s)}}$

MBIE-EB: $\mathcal{B}(N(s))=\sqrt{\frac{1}{N(s)}}$

BEB: $\mathcal{B}(N(s))=\frac{1}{N(s)}$

What kind of model ($p_\theta(s)$) to use?

Bellemare et al.: “CTS” model: condition each pixel on its top-left neighborhood
Counting with hashes

idea: compress $s$ into a $k$-bit code via $\phi (s)$, then count $N(\phi(s))$

short codes=more hash collisions

Can use VAE compression to get hash
implicit density modeling with exemplar model

explicitly compare to new state to past states

Intuition: the state is novel if it is easy to distinguish from all previous seen states by a classifier

for each observed state $s$, fit a classifier to classify that state against all past states $\mathcal{D}$, use classifiler error to obtain density $p_\theta(s)=\frac{1-D_s(s)}{D_s(s)}$ In practice, just train one amortized model takes in exemplar as input

for detail ref. Fu et al. “EX2: Exploration with exemplar models”
Heuristic estimation of counts via errors

idea: we do not need the densities, just get something tell us if the state is novel or not!

let’s say we have some target function $f^*(s,a)$, this function can be any function, just show a mapping against $s,a$.

given our buffer $\mathcal{D}={(s_i,a_i)}$, fit $\hat{f}_\theta(s,a)$

use $\xi(s,a)= \hat{f}_\theta(s,a)-f^*(s,a) ^2$ as bonus

what should we use for $f^*(s,a)$?

one common choice: set $f^*(s,a)=s’$

even simpler: $f^(s,a)=f_\phi(s,a)$, where $\phi$ is a *random parameter vector (random network distillation)

Posterior sampling in deep RL

sample Q-function Q from $p(Q)$
act according Q for for one episode
update p(Q)

Bootstrap

given a dataset $\mathcal{D}$, re-sample with replacement N times to get $\mathcal{D}_1,…,\mathcal{D}_N$
train each model $f_{\theta_i}$ on $\mathcal{D}_i$
to sample from $p(\theta)$, sample $i \in[1,…,N]$ and use $f_{\theta_i}$

but training N big neural nets is expensive, so we use one network with shared base network and multiple head for each model, and in practice, we actually do not do re-sample, since we have different initialization for each model.

Reasoning about information gain

approximations:

prediction gain: $\log p_{\theta’}(s)-\log p_\theta(s)$

intuition: if density changed a lot, the state was novel

variational inference:

IG can be equivalently written as $D_{KL}(p(z

p(z))$

learn about transitions $p_\theta(s_{t+1}

s_t,a_t): z=\theta$

$y = (s_t,a_t,s_{t+1})$ $D_{KL}(p(\theta

h,s_t,a_t,s_{t+1})

p(\theta

h))$ where h is the history of all prior transitions

intuition: a transition is more informative if it causes belief over $\theta$ to change

idea: use variational inference to estimate $q(\theta

\phi)\approx p(\theta

h)$

given new transition $(s,a,s’)$, update $\phi$ to get $\phi’$

and use $D_{KL}(q(\theta

\phi)

q(\theta

\phi))$ as approximate bonus

for more details, ref Houthooft et al. “VIME”

Imitation learning vs. Reinforcement learning

Imitation learning

requires demonstrations
must address distributional shift
Simple, stable supervised learning+
Only as good as the demo

Reinforcement learning

Require reward function
Must address exploration
Potential non-convergent RL
Can become arbitrarily good+

Can we get the best of both?

we have both demonstration and rewards

IRL already addresses distributional shift via RL, but it doesn’t use a known reward function!

Simplest combination: pre-train by imitation & fine-tune by RL

collected demonstration data $(s_i,a_i)$
initialize $\pi_\theta$ as $\max_\theta \sum_i \log \pi_\theta(a_i s_i)$
run $\pi_\theta$ to collect experience
improve $\pi_\theta$ with any RL algorithm and repeat 3 and 4

but in step 3 the agent can be very bad due to distribution shift, so the first batch of bad data can destroy initialization.

Off-policy RL

we can address this by off-policy RL. Off-policy RL can use any data, so we can keep the demonstration in buffer.

off-policy policy gradient (with importance sampling)
off-policy Q-learning

Policy gradient with demonstrations

\[\Delta J(\theta)=\sum_{\tau \in \mathcal{D}}\left[\sum_{t=1}^T\Delta_\theta\log\pi_\theta(a_t|s_t)\left(\prod_{t'=1}^t\frac{\pi_\theta(a_{t'}|s_{t'})}{q(a_{t'}|s_{t'})}\right)\left(\sum_{t'=t}^Tr(s_{t'},a_{t'})\right)\right]\]

where the $\mathcal{D}$ includes both demo data and policy data

Problem 1: which distribution did the demonstrations come from?

option 1: use supervised behavior cloning to approximate $\pi_{demo}$

option 2: assume Dirac delta: $\pi_{demo}(\tau)=\frac{1}{N}\delta (\tau \in \mathcal{D})$ this works best with self-normalized importance sampling—– $E_{p(x)}[f(x)]\approx\frac{1}{\sum_j\frac{p(x_j)}{q(x_j)}}\sum_i\frac{p(x_i)}{q(x_i)}f(x_i)$

Problem 2: what to do if have multiple distributions

fusion distribution: $q(x)=\frac{1}{M}\sum_iq_i(x)$

Q-learning with demonstrations

just drop demo demonstration data to replay buffer

What’s the problem?

importance sampling: recipe for getting stuck

Q-learning: just good data is not enough, only having good data is hard to fit good Q value

more problems with this highly off-policy Q learning

this is highly off-policy so we do not using imitation policy to collect data any more, so f Q function makes any mistake, tis had to fix it, and then we any train on totally garbage.

to address this problem, $Q(s,a) \gets r(s,a)+E_{a'\sim\pi_{new}}[Q(s',a')]$ How to pick $\pi_{new}(a|s)$?

option 1: stay close to $\beta$

e.g. $D_{KL}(\pi_{new}(.

\beta(.

s))\le \epsilon$

issue 1: we don’t know $\beta$

issue 2: this is way too conservative

option 2: constrain to support of $\beta$ ref. these two papers

Kumar et al. stabilizing Off-Policy Q-learning vai bootstrapping Error Reduction

fojimoto et al. Off-Policy Deep Reinforcement learning without exploration

Imitation as an auxiliary loss function

imitation objective: $\sum_{(s,a) \in \mathcal{D}{demo}}\log \pi\theta(a

s)$

RL objective: $E_{\pi_\theta}[r(s,a)]$

hybrid objective: $E_{\pi_\theta} [r(s,a)]+\lambda \sum_{(s,a) \in \mathcal{D}{demo}} \log \pi\theta(a

s)$

Hybrid Q-learning: $J(Q)=J_{DQ}(Q)+\lambda_1J_n(Q)+\lambda_2j_E(Q)+\lambda_3J_{L2}(Q)\\ J_E(Q)=\max_{a\in A}[Q(s,a)+l(s_E,a)]-Q(s,a_E)$ and $J_{DQ}$ is the Q-learning loss, $J_n(Q)$ is n-step Q-learning loss, $J_{L2}$ is regulation loss

what’s the problem?

need to tune the weight
The design of the objective, esp. for imitation, takes a lot of care
Algorithm becomes problem-dependent

16 Meta Reinforcement learning

This part introduced many meta learning methods in high level, may find and read papers later

17 Information theory, challenges, open problems

Information theory

entropy $\mathcal{H}(p(x))=-E_{x\sim p(x)}[\log p(x)$ mutual dependence $\begin{align} \mathcal{I}&=D_{KL}(p(x,y)||p(x)p(y))\\ &=E_{(x,y)\sim p(x,y)}\left[\log \frac{p(x,y)}{p(x)p(y)}\right]\\ &=\mathcal{H}(p(y))-\mathcal{H}(p(y|x)) \end{align}$ define $\pi(s)$ state marginal distribution of policy $\pi$

$\mathcal{H}(\pi(s))$ state marginal entropy of policy $\pi$

empowerment: $\mathcal{I}(s_{t+1},a_t)=\mathcal{H}(s_{t+1})-\mathcal{H}(s_{t+1}

a_t)$

Learning without a reward function by reaching goals

one way is giving goal states, maybe using VAE to generate goals

Propose goal: $z_g \sim p(z)$, $x_g \sim p_\theta(x_g z_g)$
Attempt to reach goal using $\pi(a x,x_g)$, reach $\bar{x}$
Use data to update $\pi$
Use data to update $p_\theta(x_g z_g)$, $q_\phi(z_g x_g)$

but how to diverse goals?

in step 4

the Standard MLE: $\theta, \phi \sim \arg \max _{\theta, \phi}E[\log p(\bar{x})]$

the weighted MLE: $\theta, \phi \sim \arg \max _{\theta, \phi}E[w(\bar{x})\log p(\bar{x})]$$

where $w(\bar{x})=p_\theta(\bar{x})^\alpha$

key result: for ant $\alpha \in [-1,0) $, entropy $\mathcal{H}(p_\theta(x))$ increases!

This actually doing $\max \mathcal{H}(p(G))$

and what does RL do?

$\pi(s

S,G)$ trained to reach goal G as $\pi$ gets better, final state S gets close to G,

that means $p(G

S)$ becomes more deterministic!

so we actually doing this $\max \mathcal{H}(p(G))-\mathcal(p(G|S))=\max \mathcal{I}(S;G)$

Learning diverse skills

$\pi(s

s,z)$ and $z$ is task index.

Intuition: different skill should visit different state-space regions

Diversity-promoting reward function $\pi(a|s,z) =\arg \max_\pi \sum_z E_{s \sim \pi(s|z)}[r(a,z)]$$ where $r(s,z)= \log p(z|s)$, reward states that are unlikely for other $z’\ne z$.

Here, when z is sampled, the only thing learning doing is maximize the probability of state of this policy by tune the $\pi(s

z)$, it turns out that different policies get different skills. Ref. Diversity is all you need (ICLR-2019)

In fact this is also goal reaching. $\mathcal{I}(z,s)=H(z)-H(z|s)$ we actually maximizing $H(z)$ by sampling $p(z)$uniformly, and the minimizing $H(z|s)$ using the algorithm.

Unsupervised reinforcement learning for meta-learning

using unsupervised reinforcement learning to propose tasks for meta learning.

so at first, using unsupervised meta-RL tor generate tasks, then using meat-learning for those tasks by given reward function from previous steps.

Challenges in deep reinforcement learning

Core algorithm:

Stability
Efficiency
Generalization

Assumptions:

Problem formulation
Supervision

Stability and hyper-parameter tuning

Devising stable RL algorithm is very hard
Q-learning/value function estimation
- No guarantee of convergence
- Lots of parameters for stability: target network delay, replay buffer size, clipping, sensitivity to learning rates, etc.
Policy gradient/likelihood ratio/REINFORCE
- Very high variance gradient estimator
- Lots of samples, complex bases, etc.
- Parameters: batch size, learning rate, design of baseline
Model-based RL algorithms
- Model class and fitting method
- Optimizing policy w.r.t. model non-trivial due to back-propagation through time
- More subtle issue: policy tends to exploit the model

The challenge with hyper-parameters is severe

Algorithms with favorable improvement and convergence properties
- TRPO
- Safe reinforcement learning, High-confidence policy improvement [Thomas ‘15’]
Algorithms that adaptively adjust parameters
- Q-Prop

Not great for beating benchmarks, but absolutely essential to make RL a viable tool for real-world problems.

Sample complexity

real-world learning becomes difficult or impractical
Precludes the use of expensive, high-fidelity simulators
Limits applicability to real-world problems

what can we do?

Better model-based RL algorithms
Design faster algorithms
- Addressing Function Approximation Error in Actor-Critic Algorithms[Fujimoto et al. ‘18’]
- Soft Actor-Critic
Reuse prior knowledge to accelerate reinforcement learning
- RL2 [Duan at al. 17]
- Learning to reinforcement learning [Wang et al. ‘17’]
- MAML [Finn at al. ‘17’]

Scaling & Generalization

Small-scale
Emphasizes mastery
Evaluated on performance
Where i the generalization

Reinforcement learning need to re-collect data during training

Assumption problems

Single task or multi-task

Train on multiple tasks, then try ti generalize or finetune
- policy distillation
- Actor-mimic
- MAML
Unsupervised or weakly supervised learning of diverse behaviors
- Stochastic neural networks
- Reinforcement learning with deep energy-based policies

Where does the supervision come from?

find some different tasks
learn objectives/reward form demonstration (IRL)
Generate objectives automatically

What is the role of the reward function?

Unsupervised reinforcement learning

Interaction with the world without a reward function
Learning something about the world
Use what you learned to quickly solve new tasks

Other sources of supervision

Demonstrations
Language
Human preferences

Where does the supervision signal come from?

Yannn YeCun’s cake
- Unsupervised or self-supervised learning
- Model learning (predict the future)
- Generative modeling of the world
- Lots to do even before you accomplish your goal
Imitation & understanding other agents
The giant value backup

18 Rethinking Reinforcement Learning from the Perspective of Generalization (Chelsea Finn)

Meta-learning:

Learning to Learning with gradient. Finn. PhD Thesis 2018

Efficient Off-Policy Meta-Reinforcement learning via Probabilistic Context Variable ICML-19

These algorithms only adapt to same similar tasks, but they can not adapt to entirely new tasks!!!

If we want to do this, we need to make sure the meta-train task distribution same as the meta-test task distribution.

Algorithms: more general than a policy, from demo and trial? or others, Meta world
Task Representation: How to tell the tasks, language or goal?
Data: RoboNet

For research beginner

2019-09-27T00:00:00-04:00

Recently, A friend of mine told me that he learned a lot of things from my posts and Github. I’m so happy that people can benefit from my posts, which encourages me keep updating my site.

In this post, I want to share my views about how to be a good researcher.

What is a good research work?

First, What is a good research work?

Many people think that a good researcher is the one who has published lots of research papers. But in my view, the quality is more important than the quantity.

So how to define the quality of a research paper?

Many people think a good paper is the paper which has lots of citations. Yes, of course, citation numbers is a good metrics to evaluate a research paper. However, in my view, a really good paper is the paper still has many new citations after it published more than ten years. A good work either propose a deep theoretical view, or find a good method that greatly influence the development of domain.

How to start?

First of all, people should know if they want to do research. In my opinion, a researcher should have strong desire to improve make things better.

If you find that you have a very strong motivation to do research. There are some advices for you.

Please read every research document in English, the original research paper or book should be read first, then you can find some Blogs explaining that. Never read stuffs in you own language even it is well translated.
When reading a paper, you should think it deeply, what the underlay reason makes things better? is there anything strange and may be improved? If you do not understate it, try to dig into it a little bit harder, and make sure you fully understand it. or you may find some source code to see the detail of its implementation. A good researcher is not only a good coder, but a good thinker, never stop on superficial stuff, but finding the fundamental reason.
Improve you English writing skill, not only the grammar and vocabulary, but also the logic of essay.
One should not hold the view that doing thing just so so, almost there is enough. but have some kind of Obsessive-Compulsive Disorder (OCD) that I must find the reason, I must do it best.
Do not trust authority, believe your strong intuition.

Linux tricks

2019-09-20T00:00:00-04:00

Tricks on Linux command line

cd -: back to the last working directory

Running multiple commands

Running multiple commands in one single command
```
command_1; command_2; command_3
```
Running multiple commands in one single command only if the previous command was successful
```
command_1 && command_2
```

Previous commands and arguments

!! the last command

!:n : the n of previous command argument. e.g. !:0 is the last command, and !:-1 is the first argument of last command

Alt+. the last argument of any of the previous commands.

Esc + . last argument of the last command.

!^      first argument
!$      last argument
!*      all arguments
!:2     second argument

!:2-3   second to third arguments
!:2-$   second to last arguments
!:2*    second to last arguments
!:2-    second to next to last arguments

!:0     the command
!!      repeat the previous line

Check for Spelling of Words in Linux

look docum

Linux terminal shortcuts list

Ctrl+a Move cursor to start of line
Ctrl+e Move cursor to end of line
Ctrl+u Cut everything before the cursor
Ctrl+k Cut everything after the cursor
Ctrl+i Clear the terminal
Ctrl+r Search command in history - type the search term
Ctrl+b Move back one character
Alt++b Move back one word
Ctrl+f Move forward one character
Alt++f Move forward one word
Ctrl+d Delete current character
Ctrl+w Cut the last word
Alt++d Cut word after the cursor
Alt++w Cut word before the cursor
Ctrl+y Paste the last deleted command
Ctrl+_ Undo
Ctrl+xx Toggle between first and current position
Ctrl+c Cancel the command
Ctrl+j End the search at current history entry
Ctrl+g Cancel the search and restore original line
Ctrl+n Next command from the History
Ctrl+p previous command from the History

Ref stackoverflow

Pretty view csv file in terminal

add this to .bashrc, and then just pretty_csv xxx.csv

function pretty_csv {
    column -t -s, -n "$@" | less -F -S -X -K
}

Pretty print code in paper with `enscript`

sudo apt install enscript
enscript -2rj --highlight=python --color=1 -o minpack.ps minpack.py

Check disk usage

sudo du -ah --max-depth=1  / | sort -hr

create crontab tasks with cron -e

Reinforcement Learning Course Notes-David Silver

2019-05-18T00:00:00-04:00

Background

I started learning Reinforcement Learning 2018, and I first learn it from the book “Deep Reinforcement Learning Hands-On” by Maxim Lapan, that book tells me some high level concept of Reinforcement Learning and how to implement it by Pytorch step by step. But when I dig out more about Reinforcement Learning, I find the high level intuition is not enough, so I read the Reinforcement Learning An introduction by S.G, and following the courses Reinforcement Learning by David Silver, I got deeper understanding of RL. For the code implementation of the book and course, refer this Github repository.

Here is some of my notes when I taking the course, for some concepts and ideas that are hard to understand, I add some my own explanation and intuition on this post, and I omit some simple concepts on this note, hopefully this note will also help you to start your RL tour.

Background

1.Introduction

2.MDP

3. Planning by Dynamic Programming

4. model-free prediction

5 Model-free control

6 Value function approximation

7 Policy gradient methods

8.Integrating Learning and Planning

9. Exploration and Exploitation

10. Case Study: RL in Classic Games

1.Introduction

RL feature

reward signal
feedback delay
sequence not i.i.d
action affect subsequent data

Why using discount reward?

mathematically convenient
avoids infinite returns in cyclic Markov processes
we are not very confident about our prediction of reward, maybe we we only confident about some near future steps.
human shows preference for immediate reward
it is sometimes possible to use undiscounted reward

2.MDP

In MDP, reward is action reward, not state reward! $R_s^a=E[R_{t+1}|S_t=s,A_t=a]$ Bellman Optimality Equation is non-linear , so we solve it by iteration methods.

3. Planning by Dynamic Programming

planning(clearly know the MDP(model) and try to find optimal policy)

prediction: given of MDP and policy, you output the value function(policy evaluation)

control: given MDP, output optimal value function and optimal policy(solving MDP)

policy evaluation
policy iteration
- policy evaluation(k steps to converge)
- policy improvement
  
  if we iterate policy once and once again and the MDP we already know, we will finally get the optimal policy(proved). so the policy iteration solve the MDP.
value iteration
1. value update (1 step policy evaluation)
2. policy improvement(one step greedy based on updated value)
  
  iterate this also solve the MDP

asynchronous dynamic programming

in-place dynamic programming(update the old value with new value immediately, not wait for all states new value)
prioritized sweeping(based on value iteration error)
real-time dynamic programming(run the game)

4. model-free prediction

model-free by sample

Monte-Carlo learning

every update of Monte-Carlo learning must have full episode

First-Visit Monte-Carlo policy evaluation

just run the agent following the policy the first time that state s is visited in an episode and do following calculation $N(s)\gets N(s)+1 \\ S(s)\gets S(s)+G_t \\ V(s)=S(s)/N(s) \\ V(s)\to v_\pi \quad as \quad N(s) \to \infty$
Every-Visit Monte-Carlo policy evaluation

just run the agent following the policy the every time(maybe there is a loop, a state can be visited more than one time) that state s is visited in an episode

Incremental mean $\begin{align} \mu_k &= \frac{1}{k}\sum_{j=1}^k x_j \\ &=\frac{1}{k}(x_k + \sum_{j=1}^{k-1} x_j) \\ &= \frac{1}{k}(x_k + (k-1)\mu_{k-1}) \\ &= \mu_{k-1}+\frac{1}{k}(x_k - \mu_{k-1}) \end{align}$

so by the incremental mean: $N(S_t)\gets N(S_t)+1 \\ V(S_t)\gets V(S_t)+\frac{1}{N_t}(G_t-v(S_t)) \\$ In non-stationary problem, it can be useful to track a running mean, i.e. forget old episodes. $V(S_t)\gets V(S_t)+\alpha(G_t-V(S_t))$

Temporal-Difference Learning

learn form incomplete episodes, it gauss the reward. $V(S_t)\gets V(S_t)+\alpha(G_t-V(S_t)) \\ V(s_t)\gets V(S_t)+\alpha(R_{t+1}+\gamma V(S_{t+1}) - V(S_t))$ TD target: $G_t=R_{t+1}+\gamma V(S_{t+1})$ TD(0)

TD error: $\delta_t = R_{t+1}+\gamma V(S_{t+1}) -V(S_t)$

TD($\lambda$)—balance between MC and TD

Let TD target look $n$ steps into the future, if $n$ is very large and the episode is terminal, then it’s Monte-Carlo $\begin{align} G_t^{(n)}&=R_{t+1}+\gamma R_{t+2}+ ... + \gamma^{n-1} R_{t+n} + \gamma^nV(S_{t+n}) \\ V(S_t)&\gets V(S_t)+\alpha(G_t-V(S_t)) \end{align}$ Averaging n-step returns—forward TD($\lambda$) $\begin{align} G_t^{\lambda} &= (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1} G_t^{(n)} \\ V(S_t)&\gets V(S_t)+\alpha(G_t^\lambda-V(S_t)) \end{align}$ Eligibility traces, combine frequency heuristic and recency heuristic $\begin{align} E_0(s) &= 0 \\ E_t(s) &= \gamma \lambda E_{t-1}(s) + 1(S_t=s) \end{align}$ TD($\lambda$)—TD(0) and $\lambda$ decayed Eligibility traces —backward TD($\lambda$) $\begin{align} \delta_t &= R_{t+1}+\gamma V(S_{t+1}) -V(S_t) \\ V(s) &\gets V(s)+\alpha \delta_tE_t(s) \end{align}$ if the updates are offline (means in one episode, we always use the old value), then the sum of forward TD($\lambda$) is identical to the backward TD($\lambda$) $\sum_{t=1}^T \alpha \delta_t E_t(s) = \sum_{t=1}^T \alpha(G_t^\lambda - V(S_t))1(S_t=s)$

5 Model-free control

$\epsilon-greedy$ policy add exploration to make sure we are improving our policy and explore the ervironment.

On policy Monte-Carlo control

for every episode:

policy evaluation: Monte-Carlo policy evaluation $Q\approx q_\pi $
policy improvement: $\epsilon-greedy$ policy improvement based on $Q(s,a)$

Greedy in the limit with infinite exploration (GLIE) will find optimal solution.

GLIE Monte-Carlo control

for the $k$th episode, set $\epsilon \gets 1/k$ , finally $\epsilon_k$ reduce to zero, and it will get the optimal policy.

On-policy TD learning

Sarsa $Q(S,A) \gets Q(S,A)+\alpha (R+ \gamma Q(S',A')-Q(S,A))$ On-Policy Sarsa:

for every time-step:

policy evaluation: Sarsa, $Q\approx q_\pi $
policy improvement: $\epsilon-greedy$ policy improvement based on $Q(s,a)$

forward n-step Sarsa —>Sarsa($\lambda$) just like TD($\lambda$)

Eligibility traces: $\begin{align} E_0(s,a) &= 0 \\ E_t(s,a) &= \gamma \lambda E_{t-1}(s,a) + 1(S_t=s,A_t=a) \end{align}$ backward Sarsa($\lambda$) by adding eligibility traces

and every time step for all $(s,a)$ do following: $\begin{align} \delta_t &= R_{t+1}+\gamma Q(S_{t+1},A_{t+1}) -Q(S_t,A_t) \\ Q(s,a) &\gets Q(s,a)+\alpha \delta_tE_t(s,a) \end{align}$

The intuition of this that the current state action pair reward and value influence all other state action pairs, but it will influence the most frequent and recent pair more. and the $\lambda$ shows how much current influence others. if you only use one step Sarsa, every you get reward, it only update one state action pair, so it is slower. For more, refer Gridworld example on course-5.

Off-policy learning

Importance sampling

\[\begin{align} E_{X~\sim P}[f(X)] &= \sum P(X)f(X) \\ &=\sum Q(X) \frac{P(X)}{Q(X)} f(X) &= E_{X ~\sim Q}\left[\frac{P(X)}{Q(X)} f(X)\right] \end{align}\]

Importance sampling for off-policy TD $V(s_t) \gets V(S_t) + \alpha \left(\frac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1}+\gamma V(S_{t+1})-V(s_t)\right)$

Q-learning

Next action is chosen using behavior policy(the true behavior) $A_{t+1} ~\sim \mu(.

S_t)$

but consider alternative successor action(our target policy) $A’ \sim \pi(.|S_t)$ $Q(S,A) \gets Q(S,A)+\alpha (R_{t+1} + \gamma Q(S_{t+1},A')-Q(S,A))$

Here has something may hard to understand, so I explain it. no matter what action we actually do(behave) next, we just update Q according our target policy action, so finally we got the Q of target policy $\pi$.

Off-policy control with Q-Learning

the target policy is greedy w.r.t Q(s,a) $\pi(S_{t+1})=\underset{a'}{\arg\max} Q(S_{t+1},a)$
the behavior policy $\mu$ is e.g. $\epsilon -greedy$ w.r.t. Q(s,a) or maybe some totally random policy, it doesn’t matter for us since it is off-policy, and we only evaluate Q on $\pi$.

\[Q(S,A) \gets Q(S,A)+\alpha (R+ \gamma \max_{a'} Q(S',A')-Q(S,A))\]

and Q-learning will converges to the optimal action-value function $Q(s,a) \to q_*(s,a)$

Q-learning can be used in off-policy learning, but it also can be used in on-policy learning!

For on-policy, if you using $\epsilon -greedy$ policy update, Sarsa is a good on-policy method, but you use Q-learning is fine since $\epsilon -greedy$ is similar to max Q policy, so you can make sure you explore most of policy action, so it is also efficient.

6 Value function approximation

Before this lecture, we talk about tabular learning since we have to maintain a Q table or value table etc.

Introduction

why

state space is large
continuous state space

Value function approximation

\[\begin{align} \hat{v}(s,\pmb{w}) &\approx v_\pi(s) \\ \hat{q}(s,a,\pmb{w})&\approx q_\pi(s,a) \end{align}\]

Approximator

non-stationary (state values are changing since policy is changinng)
non-i.i.d. (sample according policy)

Incremental method

Basic SGD for Value function approximation

Stochastic Gradient descent
feature vectors

\[x(S) = \begin{pmatrix} x_1(s) \\ \vdots \\ x_n(s) \end{pmatrix}\]

linear value function approximation $\begin{align} \hat{v}(S,\pmb{w}) &= \pmb{x}(S)^T \pmb{w} = \sum_{j=1}^n \pmb{x}_j(S) \pmb{w}_j\\ J(\pmb {w}) &= E_\pi\left[(v_\pi(S)-\hat{v}(S,\pmb{w}))^2\right] \\ \Delta\pmb{w}&=-\frac{1}{2} \alpha \Delta_w J(\pmb{w}) \\ &=\alpha E_\pi \left[(v_\pi(S)-\hat{v}(S,\pmb{w})) \Delta_{\pmb{w}}\hat{v}(S,\pmb{w})\right] \\ \Delta\pmb{w}&=\alpha (v_\pi(S)-\hat{v}(S,\pmb{w})) \Delta_{\pmb{w}}\hat{v}(S,\pmb{w}) \\ & = \alpha (v_\pi(S)-\hat{v}(S,\pmb{w}))\pmb{x}(S) \end{align}$
Table lookup feature

table lookup is a special case of linear value function approximation, where w is the value of individual state.

\[x(S) = \begin{pmatrix} 1(S=s_1)\\ \vdots \\ 1(S=s_n) \end{pmatrix}\\ \hat{v}(S,w) = \begin{pmatrix} 1(S=s_1)\\ \vdots \\ 1(S=s_n) \end{pmatrix}.\begin{pmatrix} w_1\\ \vdots \\ w_n \end{pmatrix}\]

Incremental prediction algorithms

How to supervise?

For MC, the target is the return $G_t$ $\Delta w = \alpha (G_t-\hat{v}(S_t,\pmb w))\Delta_w \hat{v}(S_t,w)$
For TD(0), the target is the TD target $R_{t+1} + \gamma \hat{v}(S_{t+1},\pmb{w})$ $\Delta w = \alpha (R_{t+1} + \gamma \hat{v}(S_{t+1},\pmb{w})-\hat{v}(S_t,\pmb w))\Delta_w \hat{v}(S_t,w)$

here should notice that the TD target also has $\hat{v}(S_{t+1},\pmb{w})$, it contains w, but we do not calculate gradient of it, we just trust target at each time step, we only look forward, rather than look forward and backward at the same time. Otherwise it can not converge.
For $TD(\lambda)$ , the target is $\lambda-return G_t^\lambda$ $$ \begin{align} \Delta\pmb{w} &= \alpha (G_t^\lambda-\hat{v}(S_t,\pmb w))\Delta_w \hat{v}(S_t,w) \

\end{align} $for backward view of Linear $TD(\lambda)$:$ \begin{align} \delta_t&= R_{t+1} + \gamma \hat{v}(S_{t+1},\pmb{w})-\hat{v}(S_t,\pmb{w})
E_t &= \gamma \lambda E_{t-1} +\pmb{x}(S_t)
\Delta \pmb{w}&=\alpha \delta_t E_t \end{align} $$

here, unlike $E_t(s) = \gamma \lambda E_{t-1}(s) + 1(S_t=s)$ , we put $x(S_t)$ in $E_t$, so we don’t need remember all previous $x(S_t)$, note that in Linear TD, $\Delta \hat{v}(S_t,\pmb{w})$ is $x(S_t)$.

here the eligibility traces is the state features, so the most recent state(state feature) have more weight, unlike TD(0), this is update all previous states simultaneously and the weight of state decayed by $\lambda$.

Control with value function approximation

policy evaluation: approximate policy evaluation, $\hat{q}(.,.,\pmb{w}) \approx q_\pi$

policy improvement: $\epsilon - greedy$ policy improvement.

Action-value function approximation $\begin{align} \hat{q}(S,A,\pmb{w}) &\approx q_\pi(S,A)\\ J(\pmb {w}) &= E_\pi\left[(q_\pi(S,A)-\hat{q}(S,A,\pmb{w}))^2\right] \\ \Delta\pmb{w}&=-\frac{1}{2} \alpha \Delta_w J(\pmb{w}) \\ &=\alpha (q_\pi(S,A)-\hat{q}(S,A,\pmb{w}))\Delta_{\pmb{w}}\hat{q}(S,A,\pmb{w}) \end{align}$ Linear Action-value function approximation $\begin{align}x(S,A) &= \begin{pmatrix} x_1(S,A)\\ \vdots \\ x_n(S,A) \end{pmatrix} \\ \hat{q}(S,A,\pmb{w}) &= \pmb{x}(S,A)^T \pmb{w} = \sum_{j=1}^n \pmb{x}_j(S,A) \pmb{w}_j\\ \Delta\pmb{w}&=\alpha (q_\pi(S,A)-\hat{q}(S,A,\pmb{w}))x(S,A) \end{align}$ The target is similar as value update, I’m lazy and do not write it down, you can refer it on book.

TD is not guarantee converge

convergence of gradient TD learning

Convergence of Control Algorithms

Batch reinforcement learning

motivation: try to fit the experiences

given value function approximation $\hat{v}(s,\pmb{w} \approx v_\pi(s))$
experience$\mathcal{D} $ consisting of$\langle state,value \rangle$ pairs:

\[\mathcal{D} = \{\langle s_1,v_1^\pi\rangle\,\langle s_2,v_2^\pi\rangle\,...,\langle s_n,v_n^\pi\rangle\}\]

Least squares minimizing sum-squares error $\begin{align} LS(\pmb{w})&=\sum_{t=1}^T(v_t^\pi -\hat{v}(s_t,\pmb{w}))^2 \\ &=\mathbb{E}_\mathcal{D}\left[(v^\pi-\hat{v}(s,\pmb{w}))^2\right] \end{align}$

SGD with experience replay(de-correlate states)

sample state, vale form experience $\langle s,v^\pi\rangle \sim \mathcal{D}$
apply SGD update

\[\Delta\pmb{w} = \alpha (v^\pi-\hat{v}(s,\pmb{w}))\Delta_{\pmb{w}}\hat{v}(s,\pmb{w})\]

Then converge to least squares solution

DQN (experience replay + Fixed Q-targets)(off-policy)

Take action $a_t$ according to $\epsilon - greedy$ policy to get experience $(s_t,a_t,r_{t+1},s_{t+1})$ store in $\mathcal{D}$
Sample random mini-batch of transitions $(s,s,r,s’)$
compute Q-learning targets w.r.t. old, fixed parameters $w^-$
optimize MSE between Q-network and Q-learning targets. $\mathcal{L}_i(\mathrm{w}_i) = \mathbb{E}_{s,a,r,s' \sim \mathcal{D}_i}\left[\left(r+\gamma \max_{a'} Q(s',a';\mathrm{w^-})-Q(s,a,;\mathrm{w}_i)\right)^2\right]$
using SGD update

On-linear Q-learning hard to converge, so why DQN converge?

experience replay de-correlate state make it more like i.i.d.

Fixed Q-targets make it stable

Least square evaluation

if the approximation function is linear and the feature space is small, we can solve the policy evaluation by least square directly.

policy evaluation: evaluation by least squares Q-learning
policy improvement: greedy policy improvement.

7 Policy gradient methods

Introduction

policy-based reinforcement learning

directly parametrize the policy $\pi_\theta(s,a) = \mathcal(P)[a|s,\theta]$ advantages:

better convergence properties
effective in high-dimensional or continuous action spaces
can learn stochastic policies

disadvantages:

converge to a local rather then global optimum
evaluating a policy is typically inefficient and high variance

policy gradient

Let $J(\theta)$ be policy objective function

find local maximum of policy objective function(value of policy) $\Delta \theta = \alpha \Delta_\theta J(\theta)$ where $\Delta_\theta J(\theta)$ is the policy gradient $\Delta_\theta J(\theta) = \begin{pmatrix} \frac{\partial J(\theta)}{\partial\theta_1 } \\ \vdots \\ \frac{\partial J(\theta)}{\partial\theta_n } \end{pmatrix}$ Score function trick $\begin{align} \Delta_\theta\pi(s,a) &= \pi_\theta \frac{\Delta_\theta(s,a)}{\pi_\theta(s,a)} \\ &=\pi_\theta(s,a)\Delta_\theta \log\pi_\theta(s,a) \end{align}$ The score function is $\Delta_\theta \log\pi_\theta(s,a)$

policy

Softmax policy for discrete actions
Gaussian policy for continuous action spaces

for one-step MDPs apply score function trick $\begin{align} J(\theta) & = \mathbb{E}_{\pi_\theta}[r] \\ & = \sum_{s\in \mathcal{S}} d(s) \sum _{a\in \mathcal{A}} \pi_\theta(s,a)\mathcal{R}_{s,a}\\ \Delta J(\theta) & = \sum_{s\in \mathcal{S}} d(s) \sum _{a\in \mathcal{A}} \pi_\theta(s,a)\Delta_\theta\log\pi_\theta(s,a)\mathcal{R}_{s,a} \\ & = \mathbb{E}_{\pi_\theta}[\Delta_\theta\log\pi_\theta(s,a)r] \end{align}$

Policy gradient theorem

the policy gradient is $\Delta_\theta J(\theta)= \mathbb{E}_{\pi_\theta}[\Delta_\theta\log\pi_\theta(s,a)Q^{\pi_\theta}(s,a)]$

Monte-Carlo policy gradient(REINFORCE)

using return $v_t$ as an unbiased sample of $Q^{\pi_\theta}(s_t,a_t)$ $\Delta\theta_t = \alpha\Delta_\theta\log\pi_\theta(s,a)v_t\\ v_t = G_t = r_{t+1}+\gamma r_{t+2}+\gamma^3 r_{t+3}...$ pseudo code

function REINFORCE Initialize $\theta$ arbitrarily

for each episode $ { s_1,a_1,r_2,…,s_{T-1},a_{T-1},R_T } \sim \pi_\theta$ do

for $t=1$ to $T-1$ do

$\theta \gets \theta+\alpha\Delta_\theta\log\pi_\theta(s,a)v_t$

end for

end for

return $\theta$

end function

REINFORCE has the high variance problem, since it get $v_t$ by sampling.

Actor-Critic policy gradient

Idea

use a critic to estimate the action-value function $Q_w(s,a) \approx Q^{\pi_\theta}(s,a)$ Actor-Critic algorithm follow an approximate policy gradient $\Delta_\theta J(\theta) \approx \mathbb{E}_{\pi_\theta}[\Delta_\theta\log\pi_\theta(s,a)Q_w(s,a)] \\ \Delta \theta= \alpha \Delta_\theta\log\pi_\theta(s,a)Q_w(s,a)$

Action value actor-Critic

Using linear value fn approx. $Q_w(s,a) = \phi(s,a)^Tw$

Critic Updates w by TD(0)
Actor Updates $\theta$ by policy gradient

function QAC Initialize $s, \theta$

Sample $a \sim \pi_\theta$

for each step do

Sample reward $r=\mathcal{R}_s^a$; sample transition $s’ \sim \mathcal{P}_s^a$,.

Sample action $a’ \sim \pi_\theta(s’,a’)$

$\delta = r + \gamma Q_w(s’,a’)- Q_w(s,a)$

$\theta = \theta + \alpha \Delta_\theta \log \pi_\theta(s,a) Q_w(s,a)$

$w \gets w+\beta \delta\phi(s,a)$

$a \gets a’, s\gets s’$

end for

end function

So it seems that Value-based learning is a spacial case of actor-critic, since the greedy function based on Q is one spacial case of policy gradient, when we set the policy gradient step size very large, then the probability of the action which max Q will close to 1, and the others will close to 0, that is what greedy means.

Reducing variance using a baseline

Subtract a baseline function $B(s)$ from the policy gradient
This can reduce variance, without changing expectation $\begin{align} \mathbb{E}_{\pi_\theta} [\Delta_\theta \log \pi_\theta(s,a)B(s)] &= \sum_{s \in \mathcal{S}}d^{\pi_\theta}(s)\sum_a \Delta_\theta \pi_\theta(s,a)B(s)\\ &= \sum_{s \in \mathcal{S}}d^{\pi_\theta}(s)B(s)\Delta_\theta \sum_{a\in \mathcal{A}} \pi_\theta(s,a)\\ & = \sum_{s \in \mathcal{S}}d^{\pi_\theta}(s)B(s)\Delta_\theta (1) \\ &=0 \end{align}$
a good baseline is the state value function $B(s) = V^{\pi_\theta}(s)$
So we can rewrite the policy gradient using the advantage function $A^{\pi_\theta}(s,a)$ $A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s) \\ \Delta_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\Delta_\theta \log \pi_\theta(s,a)A^{\pi_\theta}(s,a)]$

Actually, by using advantage function, we get rid of the variance between states, and it will make our policy network more stable.

So how to estimate the advantage function? you can using two network to estimate Q and V respectively, but it is more complicated. More commonly used is by bootstrapping.

TD error

\[\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s')-V^{\pi_\theta}(s)\]

TD error is an unbiased estimate(sample) of the advantage function $\begin{align} \mathbb{E}_{\pi_\theta} & = \mathbb{E}_{\pi_\theta}[r + \gamma V^{\pi_\theta}(s')|s,a] - V^{\pi_\theta}(s) \\ & = Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s) \\ & = A^{\pi_\theta}(s,a) \end{align}$
So $\Delta_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\Delta_\theta \log \pi_\theta(s,a)\delta^{\pi_\theta}]$
In practice, we can use an approximate TD error for one step $\delta_v = r + \gamma V_v(s')-V_v(s)$
this approach only requires one set of critic parameters for v.

For Critic, we can plug in previous used methods in value approximation, such as MC, TD(0),TD($\lambda$) and TD($\lambda$) with eligibility traces.

MC policy gradient, $\mathrm{v}_t$ is the true MC return. $\Delta \theta = \alpha(\mathrm{v}_t - V_v(s_t))\Delta_\theta \log \pi_\theta(s_t,a_t)$
TD(0) $\Delta \theta = \alpha(r + \gamma V_v(s_{t+1})-V_v(s_t))\Delta_\theta \log \pi_\theta(s_t,a_t)$
TD($\lambda$) $\Delta \theta = \alpha(\mathrm{v}_t^\lambda + \gamma V_v(s_{t+1})-V_v(s_t))\Delta_\theta \log \pi_\theta(s_t,a_t)$
TD($\lambda$) with eligibility traces (backward-view) $\begin{align} \delta & = r_{t+1} + \gamma V_V(s_{t+1}) -V_v(s_t) \\ e_{t+1} &= \lambda e_t + \Delta_\theta\log \pi_\theta(s,a) \\ \Delta \theta&= \alpha \theta e_t \end{align}$

For continuous action space, we use Gauss to represent our policy, but Gauss is noisy, so it’s better to use deterministic policy(by just picking the mean) to reduce noise and make it easy to converge. This turns out the deterministic policy gradient(DPG) algorithm.

Deterministic policy gradient(off-policy)

Deterministic policy: $a_t = \mu(s_t|\theta^\mu)$ Q network parametrize by $\theta^Q$ ,the distribution of states under behavior policy is $\rho^\beta$ $\begin{align} L(\theta^Q) &= \mathbb{E}_{s_t \sim \rho^\beta, a_t \sim \beta,r_t\sim E}[(Q(s_t,a_t|\theta^Q)-y_t)^2] \\ y_t & = r(s_t,a_t)+\gamma Q(s_{t+1},\mu(s_{t+1})|\theta^Q) \end{align}$ policy network parametrize by $\theta^\mu$ $\begin{align} J(\theta^\mu) & = \mathbb{E}_{s \sim \rho^\beta}[Q(s,a| \theta^Q)|_{s=s_t,a=\mu(s_t|\theta^\mu)}] \\ \Delta_{\theta^\mu}J &\approx \mathbb{E}_{s \sim \rho^\beta}[\Delta_{\theta_\mu} Q(s,a| \theta^Q)|_{s=s_t,a=\mu(s_t|\theta^\mu)}] \\ & = \mathbb{E}_{s \sim \rho^\beta}[\Delta Q(s,a| \theta^Q|_{s=s_t,a=\mu(s_t)}\Delta_{\theta_\mu}\mu(s|\theta^\mu)|s=s_t] \end{align}$ to make training more stable, we use target network for both critic network and actor network, and update them by soft update: $soft\; update\left\{ \begin{aligned} \theta^{Q'} & \gets \tau\theta^Q+(1-\tau)\theta^{Q'} \\ \theta^{\mu'} & \gets \tau\theta^\mu+(1-\tau)\theta^{\mu'} \\ \end{aligned} \right.$ and we set $\tau$ very small to update parameters smoothly, e.g. $\tau = 0.001$.

In addition, we add some noise to deterministic action when we are exploring the environment to get experience. $\mu'(s_t) = \mu(s_t|\theta_t^\mu)+\mathcal{N}_t$ where $\mathcal{N}$ is the noise, it can be chosen to suit the environment, e.g. Ornstein-Uhlenbeck noise.

8.Integrating Learning and Planning

Introduction

model-free RL

no model
Learn value function(and or policy) from experience

model-based RL

learn a model from experience
plan value function(and or policy) from model

Model $\mathcal{M} = \langle \mathcal{P}\eta, \mathcal{R}\eta \rangle$ $S_{t+1} \sim \mathcal{P}_\eta(s_{t+1}|s_t,A_t) \\ R_{t+1} = \mathcal{R}_\eta(R_{t+1}|s_t,A_t)$ Model learning from experience ${S_1,A_1,R_2,…,S_T}$ bu supervised learning $S_1, A_1 \to R_2, S_2 \\ S_2, A_2 \to R_3, S_3 \\ \vdots \\ S_{T-1}, A_{T-1} \to R_T, S_T$

$s,a \to r$ is a regression problem
$s,s \to s’$ is a density estimation problem

Planning with a model

Sample-based planning

sample experience from model
apply model-free RL to samples
- Monte-Carlo control
- Sarsa
- Q-learning

Performance of model-based RL is limited to optimal policy for approximate MDP

Integrated architectures

Integrating learning and planning—–Dyna

Learning a model from real experience
Learn and plan value function (and/or policy) from real and simulated experience

Simulation-Based Search

Forward search select the best action by lookahead
build a search tree withe the current state $s_t$ at the root
solve the sub-MDP starting from now

Simulation-Based Search

Simulate episodes of experience for now with the model
Apply model-free RL to simulated episodes
- Monte-Carlo control $\to$ Monte-Carlo search
- Sarsa $\to$ TD search

Sample Monte-Carlo search

Given a model $\mathcal{M}_v$ and a simulation policy $\pi$
For each action $a \in \mathcal{A}$
- Simulate K episodes from current(real) state $s_t$ $\{s_t,a,R_{t+1}^k,S_{t+1}^k,A_{t+1}^k,...,s_T^k\}_{k=1}^K \sim \mathcal{M}_v,\pi$
- Evaluate action by mean return(Monte-Carlo evaluation)
\[Q(s_t,a) = \frac{1}{K}\sum_{k=1}^K G_t \overset{\text{P}}{\to} q_\pi(s_t,a)\]
Select current(real) action with maximum value $a_t = \underset{a \in \mathcal{A}}{\arg\max} Q(S_{t},a)$

Monte-Carlo tree search

Given a model $\mathcal{M}_v$
Simulate K episodes from current(real) state $s_t$ using current simulation policy $\pi$ $\{s_t,A_t^k,R_{t+1}^k,S_{t+1}^k,A_{t+1}^k,...,s_T^k\}_{k=1}^K \sim \mathcal{M}_v,\pi$
Build a search tree containing visited states and actions
Evaluate states Q(s,a) by mean return of episodes from s,a $Q(s_t,a) = \frac{1}{N(s,a)}\sum_{k=1}^K \sum_{u=t}^T \mathbf{1}(s_u,A_u = s,a) G_u \overset{\text{P}}{\to} q_\pi(s_t,a)$
After search is finished, select current(real) action with maximum value in search tree $a_t = \underset{a \in \mathcal{A}}{\arg\max} Q(S_{t},a)$
Each simulation consist of two phases(in-tree, out-of-tree)
- Tree policy(improves): pick actions to maximise Q(s,a)
- Default policy(fixed): pick action randomly

Here we update Q on the whole sub-tree, not only the current state. And after every episode of searching, we improve the policy based on the new update value, then start a new searching. With the searching progress, we exploit on the direction which is more promise to success since we keep updating our searching policy to that direction. In addition, we also need to explore a little bit the other direction, so we can apply MCTS with which action has the max Upper Confidence Bound(UCT) , that is idea of AlphaZero.

Temporal-Difference Search

e.g. update by Sarsa $\Delta Q(S,A) = \alpha (R+\gamma Q(S',A') -Q(S,A))$ and you can also use a function approximation for simulated Q.

Dyna-2

long-term memory(real experience)—TD learning
Short-term(working) memory(simulated experience)—TD search & TD learning

9. Exploration and Exploitation

way to exploration

random exploration
- use Gaussian noise in continuous action space
- $\epsilon - greedy$, random on $\epsilon$ probability
- Softmax, select on action policy distribution
optimism in the face of uncertainty———prefer to explore state/actions with highest uncertainty
- Optimistic Initialization
- UCB
- Thompson sampling
Information state space
- Gittins indices
- Bayes-adaptive MDPS

State-action exploration vs. parameter exploration

Multi-arm bandit

Total regret $\begin{align} L_t &= \mathbb{E}\left[\sum_{\tau=1}^t V^*-Q(a_\tau)\right] \\ & \sum_{a \in \mathcal{A}}\mathbb{E}[N_t(a)](V^*-Q(a)) \\ &=\sum_{a \in \mathcal{A}}\mathbb{E}[N_t(a)]\Delta a \end{align}$ Optimistic Initialization

initialize Q(a) to high value
Then act greedily
turns out linear regret

$\epsilon - greedy$

turns out linear regret

decay $\epsilon - greedy$

sub-linear regret(need know gaps), if you tune it very well and find it just on the gap, it is good, otherwise, it maybe bad.

the regret has a low bound, it is a log bound

The performance of any algorithm is determined by similarity between optimal arm and other arms $\lim_{t \to \infty}L_t \ge \log t\sum_{a|\Delta a>0} \frac{\Delta a}{KL(\mathcal{R}^a||\mathcal{R}^{a_*})}$

Optimism in the Face of Uncertainty

Upper Confidence Bounds(UCB)

Estimate an upper confidence $U_t(a)$ for each action value
Such that $q(a) \leq Q_t(a)+U_t(a)$ with high probability
The upper confidence depend on the number of times N(s) has been sampled
Select action maximizing Upper Confidence Bounds(UCB) $A_t =\underset{a \in \mathcal{A}}{\arg\max} [Q(S_{t},a)+U_t(a)]$

Theorem(Hoeffding’s Inequality)

let $x_1,…,X_t$ be i.i.d. random variables in[0,1], and let $\overline{X} = \frac{1}{\tau}\sum_{\tau=1}^tX_\tau$ be the sample mean. Then $\mathbb{P}[\mathbb{E}[X]> \overline{X}_t+u] \leq e^{-2tu^2}$

we apply the Hoeffding’s Inequality to rewards of the bandit conditioned on selecting action a $\mathbb{P}[ Q(a)> Q(a)+U_t(a)] \leq e^{-2N_t(a)U_t(a)^2}$

Pack a probability p that true value exceeds UCB
Then solve for $U_t(a)$ $\begin{align} e^{-2N_t(a)U_t(a)^2} = p \\ \\ U_t(a)=\sqrt{\frac{-\log p}{2N_t(a)}} \end{align}$
Reduce p as we observe more rewards, e.g. $p = t^{-4}$ $U_t(a)=\sqrt{\frac{2\log t}{N_t(a)}}$
Make sure we select optimal action as $t \to \infty$

This leads to the UCB1 algorithm $A_t =\underset{a \in \mathcal{A}}{\arg\max} \left[Q(S_{t},a)+\sqrt{\frac{2\log t}{N_t(a)}}\right]$ The UCB algorithm achieves logarithmic asymptotic total regret $\lim_{t\to\infty}L_t \leq 8\log t\sum_{a|\Delta>0}\Delta a$ Bayesian Bandits

Probability matching—Thompson sampling—optimal for one sample, but may not good for MDP.

Solving Information State Space Bandits—MDP

define MDP on information state space

MDP

UCB $A_t =\underset{a \in \mathcal{A}}{\arg\max} [Q(S_{t},a)+U_t(S_t,a)]$ R-Max algorithm

10. Case Study: RL in Classic Games

TBA.

Set up machine learning development environment

2019-04-18T00:00:00-04:00

background

Machine learning algorithm always need lots of computing resource, generally, the computer is big and noisy, and most of them are host on Linux system, so most of us run our code on a remote server. How to set up the remote development environment to make us work smoothly is really important.

There includes several stuffs we have to set up:

ssh
file transfer with server
UI on remote
machine learning environment
remote development tool
others

Notice: If you are in **mainland China **, you’d better set up a proxy to go through GFW, so that you can enjoy free network, you need to set up you proxy both on browser and terminal, since you need to download many packages on terminal when you setting up you environment, otherwise it may cost you lots of time to set up stuffs. For how to set up network proxy, refer to my post: set up VPS.

Special statement: This tutorial is only for learning and research, thanks.

ssh

basically, we should visit our remote sever through ssh connection. refer this for ssh without password.

on Linux PC, just follow the basic tutorial, on Windows, you first need a bash environment, I recommend some bash env. based on mintty, such as Cygwin, git bash or wsl-terminal,

you can rename your ssh sever by edit ~/.ssh/config, eg.

Host my_server                       
    HostName example.com           # ip or domain name
    User root                      # user name

then your can just visit your server by ssh my_server

In addition, if you need to connect your server through a jumper machine, refer my note that how to make your ssh more smoothly by adding ssh tunnel.

If visiting your server should through a SSL based VPN, and the VPN client only has Windows version, and your host machine is linux, then how to make it work on your Linux? refer my post—Enabling SSL VPN on Linux.

file transfer with server

just refer my another post Transfer files, using scp or sshfs.

machine learning environment

basic tools

just install some basic tools on Linux, such as

git vim tmux htop etc.

you can write a shell scrip to do all that, and I will publish a scrip on my Github to duo that later.
development environment

machine learning frameworks:
- pytorch
- tensorflow
- Cuda driver
how to set up environment

most remote servers are using python as high-level development language, there are several python package management tools, such as pip, conda, and python virtual environment managers, such as miniconda, anaconda, pyenv, and I personally recommend conda, your can use Miniconda or anaconda.

you’d better set up an python environment which separate your env with system, since there amy be other people also using your machine, mixing up stuffs may make your env heavy and out of your control. moreover, setting up a virtual env make your transplant your env easier.

some basic conda command, for more, refer conda cmd.
```
 conda install numpy
 conda remove numpy
 conda create -n myenv
 conda create –n test_env python=3.6
 conda list
 conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
 conda config --set show_channel_urls yes
 here modify ~/.condarc and change the priority of channels, make sure the new added channel is first! 
    
 conda activate $ENVIRONMENT_NAME
 conda deactivate
```

remote development tool

some developer do like just write code on vim via ssh to their server, so how to develop on remote? one is Jupyter notebook, you can rite code and view plot figures on browser, for IDE, I recommend Pytorch, it can write code on local and automatically synchronize to your remote server and run code on your remote server with your local Pycharm IDE.

Jupyter

looking for more about jupyter set up, refer my note jupyter and tensorboard configuration

if you want to set specific GPU in your python program, please us export CUDA_VISIBLE_DEVICES=0,1 text.py , if you are in Jupyter, there is a way to set it, but sometime it doesn’t work, so I recommend you just add the following line to your ‘~/.bashrc’, then you needn’t to set it every time.
```
export CUDA_VISIBLE_DEVICES=0,1 
```
tensorborad

tensorboard is a monitor for your to trace and debug your algorithm. also visited on browser after remote server set up it.
pycharm

for set up pycharm to work on remove server, refer this post for to to set up.

on pycharm, you can edit, run and debug your code on remote server. and it also sport matplotlib to plot figures on remote and view them in IDE.
vs code

vs code can only edit remote file, but can not run it on remote via local interface, for details, referDeveloping on a remote server

view remote UI on local

basically, you can use X11 forward to do it, just use ssh -X name@domain, on Linux, you only need basic set up .ssh/config to enable X11 forward, and to test it, just run xclock on remote server, and there will be a clock ui pop up on your local machine.

for windows user, your need install and open a X11 server first, you can install xming.

Others

TBC.

Dongda’s homepage

Transparent proxy with V2ray and clash

Background

Basics

How DNS works?

socks5

tun2socks/redir

Fake IP

Iptables

Requirements

Set up a V2ray transparent proxy

V2ray config.json

iptables setting

Set up a Clash transparent proxy

Bypass gateway

Avoid the loop problem

Clash DNS setting in config.yaml

iptables setting

Save and reload iptables

Pytorch distributed data parallel step by step

Background

Dataparallel

Distributed Data Parallel (DDP)

Get start with DDP

Run

Prepare data

DDP initialization with Nvidia NCCL back-end

Model

Training

Log data

Checkpoint load and save

Batchnorm

The problems you may face

Docker container for machine learning environments

Docker basic

docker image download and commit

check docker

Setup docker

Push to docker hub

Dockerfile

Docker proxy

Docker with cuda

Container using Host proxy

1. Setup client proxy

2. Setup network mapping

SSH to container

ssh on host machine

Direct ssh on remote machine

Access files inside container

Set up WebDAV server on container

set up ssh tunnel

Set up file server on vps by nginx

Background

How to

Basic setting

With _h5ai

Add password to folder

Download and upload

Local development

Aria2+AriaNG+Nginx

Install Aria2

Set up aria2

Add ssl

AriaNG

Nginx

Set aria2 as daemon

Expose Intranet machine to outside by frp

Background

SSH Usage

Access your computer in LAN by SSH

ssh on your mobile phone

Deep Reinforcement learning notes (UBC)

Background

Table of contents

1. Imitation learning

The main problem of imitation: distribution drift

DAgger: Dataset Aggregation

why fail to fit expert?

reward function of imitation learning

MDP & RL Intro

V2ray `config.json`

`iptables` setting

Clash DNS setting in `config.yaml`

`iptables` setting

Save and reload `iptables`

With `_h5ai`