Trust Region Policy Optimization

Abstract

Propose an iterative procedure for PG method with guaranteed monotonic improvement, called TRPO
Experiments demonstrate its robust performance

Introduction

Policy optimization algorithms

Policy iteration method
- alternate between estimating the value function under the current policy and improving the policy
Policy gradient method
- use an estimator of the gradient of the expected return (total reward) obtained from sample trajectories
Derivative-free optimization method
- treat the return as a black box function to be optimized in terms of the policy parameters
- CMA CEM
- the gradient-free algorithms is better than the gradient-based algorithms, but the complex task is the opposite

Contributions

Prove that minimizing a certain surrogate objective function guarantees policy improvement with non-trivial step sizes
Drive a practical algorithm called trust region policy optimization(TRPO)
- Single-path method, model-free setting
- Vine method, setting which can restored to particular state (Simulation)

Preliminaries

Notation:

MDP: \(\left(\mathcal{S}, \mathcal{A}, P, r, \rho_0, \gamma\right)\)
Transition probability distribution: \(P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}\)
Reward function: \(r: \mathcal{S} \rightarrow \mathbb{R}\)
Discount factor: \(\gamma \in(0,1)\)
Distribution of the initial state \(s_0\): \(\rho_0: \mathcal{S} \rightarrow \mathbb{R}\)
Stochastic policy: \(\pi: \mathcal{S} \times \mathcal{A} \rightarrow[0,1]\)
Expected discounted reward: \(\eta(\pi)=\mathbb{E}_{s_0, a_0, \ldots}\left[\sum_{t=0}^{\infty} \gamma^t r\left(s_t\right)\right]\)
- Where \(s_0 \sim \rho_0\left(s_0\right), a_t \sim \pi\left(a_t \mid s_t\right), s_{t+1} \sim P\left(s_{t+1} \mid s_t, a_t\right)\)
State-action value function \(Q_\pi\), value function \(V_\pi\)

Main Idea:

It is difficult to optimize \(\eta(\pi)\) directly, we want to find a surrogate function.

A lower bound function for \(\eta(\pi)\)
Approximate \(\eta\) at the current policy
easy to optimize

Image Source

Objective function

Lemma 1

The expected return of another policy drived from current policy return

\[ \eta(\tilde{\pi})=\eta(\pi)+\mathbb{E}_{s_0, a_0, \cdots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^t A_\pi\left(s_t, a_t\right)\right] \]

where the notation \(\mathbb{E}_{s_0, a_0, \cdots \sim \tilde{\pi}}[\ldots]\) indicates that actions are sampled \(a_t \sim \tilde{\pi}\left(\cdot \mid s_t\right)\)

Discounted visitation frequencies \(\rho_\pi\): \(\rho_\pi(s)=P\left(s_0=s\right)+\gamma P\left(s_1=s\right)+\gamma^2 P\left(s_2=s\right)+\ldots\)
Rewrite lemma 1: \[ \begin{aligned} \eta(\tilde{\pi}) & =\eta(\pi)+\sum_{t=0}^{\infty} \sum_s P\left(s_t=s \mid \tilde{\pi}\right) \sum_a \tilde{\pi}(a \mid s) \gamma^t A_\pi(s, a) \\ & =\eta(\pi)+\sum_s \sum_{t=0}^{\infty} \gamma^t P\left(s_t=s \mid \tilde{\pi}\right) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) \\ & =\eta(\pi)+\sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) \end{aligned} \]

Proof of Lemma 1

\[ \begin{aligned} & \underset{\tau \sim \tilde{\pi}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^t A^\pi\left(s_t, a_t\right)\right] \\ & =\underset{\tau \sim \tilde{\pi}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^t\left(R\left(s_t, a_t, s_{t+1}\right)+\gamma V^\pi\left(s_{t+1}\right)-V^\pi\left(s_t\right)\right)\right] \\ & =\eta(\tilde{\pi})+\underset{\tau \sim \tilde{\pi}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t+1} V^\pi\left(s_{t+1}\right)-\sum_{t=0}^{\infty} \gamma^t V^\pi\left(s_t\right)\right] \\ & =\eta(\tilde{\pi})+\underset{\tau \sim \tilde{\pi}}{\mathrm{E}}\left[\sum_{t=1}^{\infty} \gamma^t V^\pi\left(s_t\right)-\sum_{t=0}^{\infty} \gamma^t V^\pi\left(s_t\right)\right] \\ & =\eta(\tilde{\pi})-\underset{\tau \sim \tilde{\pi}}{\mathrm{E}}\left[V^\pi\left(s_0\right)\right] \\ & =\eta(\tilde{\pi})-\eta(\pi) \end{aligned} \]

Full proof of Lemma 1

How to ensure \(\sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) \ge 0\) ?

Function L \[ \mathcal{L}_\pi(\tilde{\pi})=\eta(\pi)+\sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) \] Lemma 1 \[ \eta(\tilde{\pi})=\eta(\pi)+\sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) \] Difference between \(\pi\) and \(\tilde \pi\) can't be large, otherwise we can use \(\rho_{\pi}\) to substitute \(\rho _{\tilde\pi}\).

See Kadade & Langford 2002, another proof \[ \begin{aligned} L_{\pi_{\theta_0}}\left(\pi_{\theta_0}\right) & =\eta\left(\pi_{\theta_0}\right) \\ \left.\nabla_\theta L_{\pi_{\theta_0}}\left(\pi_\theta\right)\right|_{\theta=\theta_0} & =\left.\nabla_\theta \eta\left(\pi_\theta\right)\right|_{\theta=\theta_0} \end{aligned} \]

How to ensure the difference is not too large?

Kakade & Langford (2002)

Let \(\pi^{\prime}=\arg \max _{\pi^{\prime}} L_{\pi_{\text {old }}}\left(\pi^{\prime}\right)\) \[ \pi_{\text {new }}(a \mid s)=(1-\alpha) \pi_{\mathrm{old}}(a \mid s)+\alpha \pi^{\prime}(a \mid s) \] Lower bound: \[ \begin{aligned} \eta\left(\pi_{\text {new }}\right) & \geq L_{\pi_{\text {old }}}\left(\pi_{\text {new }}\right)-\frac{2 \epsilon \gamma}{(1-\gamma)^2} \alpha^2 \\ & \text { where } \epsilon=\max _s\left|\mathbb{E}_{a \sim \pi^{\prime}(a \mid s)}\left[A_\pi(s, a)\right]\right| \end{aligned} \] This bound only applies to mixture policies generated by above equation. Restrictive and unwieldly.

Monotonic Improvement Guarantee for General Stochastic Policies

Replace \(\alpha\) with a distance between \(\pi\) and \(\tilde \pi\)
Changing the constant \(\epsilon\) appropriately

Distance: total variation divergence \[ D_{T V}(p \| q)=\frac{1}{2} \sum_i\left|p_i-q_i\right| \] \(D_{TV}^{\text{max}}(\pi, \tilde \pi)\): \(D_{\mathrm{TV}}^{\max }(\pi, \tilde{\pi})=\max _s D_{T V}(\pi(\cdot \mid s) \| \tilde{\pi}(\cdot \mid s))\)

Theorem 1:

Let \(\alpha=D_{\mathrm{TV}}^{\max }\left(\pi_{\mathrm{old}}, \pi_{\mathrm{new}}\right)\): \[ \begin{aligned} & \eta\left(\pi_{\text {new }}\right) \geq L_{\pi_{\text {old }}}\left(\pi_{\text {new }}\right)-\frac{4 \epsilon \gamma}{(1-\gamma)^2} \alpha^2 \\ & \text { where } \epsilon=\max _{s, a}\left|A_\pi(s, a)\right| \end{aligned} \] Details proof

Pollard 2000 Ch. 3 :\(D_{T V}(p \| q)^2 \leq D_{\mathrm{KL}}(p \| q)\) \[ \begin{aligned} \eta(\tilde{\pi}) & \geq L_\pi(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) \\ & \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^2} \end{aligned} \] Algorithm 1

This algorithm generates a sequence of policies \(\eta(\pi_0) \le \eta(\pi_1) \le \eta(\pi_2) \le ...\)

Let \(M_i(\pi) = L_{\pi_i}(\pi)-C D_{\mathrm{KL}}^{\max }\left(\pi_i, \pi\right)\), then \[ \begin{align} \eta\left(\pi_{i+1}\right) \geq & M_i\left(\pi_{i+1}\right) \\ \eta\left(\pi_i\right)= &M_i\left(\pi_i\right) \\ \eta\left(\pi_{i+1}\right)-\eta\left(\pi_i\right) \geq& M_i\left(\pi_{i+1}\right)-M\left(\pi_i\right) \end{align} \] Minorization-maximization(MM) algorithm

Optimization of Parameterized Policies

TRPO is an approximation to Algorithm 1 which uses a constraint on the KL divergence rather than a penalty to robustly allow large updates.

Consider parameterized policies \(\pi_\theta(a|s)\) with parameter vector \(\theta\) \[ \begin{align} \eta(\theta)& := \eta(\pi_\theta) \\ L_\theta(\tilde{\theta})&:=L_{\pi_\theta}\left(\pi_{\tilde{\theta}}\right) \\ D_{\mathrm{KL}}(\theta \| \tilde{\theta})&:=D_{\mathrm{KL}}\left(\pi_\theta \| \pi_{\tilde{\theta}}\right) \end{align} \] Optimization problem: \[ \underset{\theta}{\operatorname{maximize}}\left[L_{\theta_{\mathrm{old}}}(\theta)-C D_{\mathrm{KL}}^{\max }\left(\theta_{\mathrm{old}}, \theta\right)\right] \] New problem: \[ \begin{aligned} & \underset{\theta}{\operatorname{maximize}} L_{\theta_{\text {old }}}(\theta) \\ & \text { subject to } D_{\mathrm{KL}}^{\max }\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{aligned} \] Due to the large number of constraints \[ \bar{D}_{\mathrm{KL}}^\rho\left(\theta_1, \theta_2\right):=\mathbb{E}_{s \sim \rho}\left[D_{\mathrm{KL}}\left(\pi_{\theta_1}(\cdot \mid s) \| \pi_{\theta_2}(\cdot \mid s)\right)\right] \] New problem: \[ \begin{aligned} & \underset{\theta}{\operatorname{maximize}} L_{\theta_{\text {old }}}(\theta) \\ & \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_\theta \text { old }}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{aligned} \]

Sample-Based Estimation of the Objective and Constraint

Approximate objective and constraint functions using Monte Carlo simulation \[ \begin{array}{r} \underset{\theta}{\operatorname{maximize}} \sum_s \rho_{\theta_{\text {old }}}(s) \sum_a \pi_\theta(a \mid s) A_{\theta_{\text {old }}}(s, a) \\ \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_{\theta_{\text {old }}}}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{array} \] We can't sample the new policy \(\pi_\theta(a|s)\), so introduce the importance sampling. \[ \sum_a \pi_\theta\left(a \mid s_n\right) A_{\theta_{\text {old }}}\left(s_n, a\right)=\mathbb{E}_{a \sim q}\left[\frac{\pi_\theta\left(a \mid s_n\right)}{q\left(a \mid s_n\right)} A_{\theta_{\text {old }}}\left(s_n, a\right)\right] \]

\[ \begin{aligned} & \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta_{\text {old }}}, a \sim q}\left[\frac{\pi_\theta(a \mid s)}{q(a \mid s)} Q_{\theta_{\text {old }}}(s, a)\right] \\ & \text { subject to } \mathbb{E}_{s \sim \rho_{\theta_{\text {old }}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\text {old }}}(\cdot \mid s) \| \pi_\theta(\cdot \mid s)\right)\right] \leq \delta \end{aligned} \]

Single Path and Vine

Pseudocode

Image Source

Experiments

Supplementary Material

Proof of Lemma 1

Reference

https://jonathan-hui.medium.com/rl-trust-region-policy-optimization-trpo-explained-a6ee04eeeee9
https://www.zhihu.com/question/366605427/answer/1048153125
https://www.cnblogs.com/xingzheai/p/16565686.html