0%

Trust Region Policy Optimization

Trust Region Policy Optimization

Abstract

  1. Propose an iterative procedure for PG method with guaranteed monotonic improvement, called TRPO

  2. Experiments demonstrate its robust performance

Introduction

Policy optimization algorithms

  • Policy iteration method
    • alternate between estimating the value function under the current policy and improving the policy
  • Policy gradient method
    • use an estimator of the gradient of the expected return (total reward) obtained from sample trajectories
  • Derivative-free optimization method
    • treat the return as a black box function to be optimized in terms of the policy parameters
    • CMA CEM
    • the gradient-free algorithms is better than the gradient-based algorithms, but the complex task is the opposite

Contributions

  1. Prove that minimizing a certain surrogate objective function guarantees policy improvement with non-trivial step sizes
  2. Drive a practical algorithm called trust region policy optimization(TRPO)
    • Single-path method, model-free setting
    • Vine method, setting which can restored to particular state (Simulation)

Preliminaries

Notation:

  • MDP: \(\left(\mathcal{S}, \mathcal{A}, P, r, \rho_0, \gamma\right)\)
  • Transition probability distribution: \(P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}\)
  • Reward function: \(r: \mathcal{S} \rightarrow \mathbb{R}\)
  • Discount factor: \(\gamma \in(0,1)\)
  • Distribution of the initial state \(s_0\): \(\rho_0: \mathcal{S} \rightarrow \mathbb{R}\)
  • Stochastic policy: \(\pi: \mathcal{S} \times \mathcal{A} \rightarrow[0,1]\)
  • Expected discounted reward: \(\eta(\pi)=\mathbb{E}_{s_0, a_0, \ldots}\left[\sum_{t=0}^{\infty} \gamma^t r\left(s_t\right)\right]\)
    • Where \(s_0 \sim \rho_0\left(s_0\right), a_t \sim \pi\left(a_t \mid s_t\right), s_{t+1} \sim P\left(s_{t+1} \mid s_t, a_t\right)\)
  • State-action value function \(Q_\pi\), value function \(V_\pi\)

Main Idea:

It is difficult to optimize \(\eta(\pi)\) directly, we want to find a surrogate function.

  • A lower bound function for \(\eta(\pi)\)
  • Approximate \(\eta\) at the current policy
  • easy to optimize
img
Image Source

Objective function

Lemma 1

  • The expected return of another policy drived from current policy return

\[ \eta(\tilde{\pi})=\eta(\pi)+\mathbb{E}_{s_0, a_0, \cdots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^t A_\pi\left(s_t, a_t\right)\right] \]

​ where the notation \(\mathbb{E}_{s_0, a_0, \cdots \sim \tilde{\pi}}[\ldots]\) indicates that actions are sampled \(a_t \sim \tilde{\pi}\left(\cdot \mid s_t\right)\)

  • Discounted visitation frequencies \(\rho_\pi\): \(\rho_\pi(s)=P\left(s_0=s\right)+\gamma P\left(s_1=s\right)+\gamma^2 P\left(s_2=s\right)+\ldots\)

  • Rewrite lemma 1: \[ \begin{aligned} \eta(\tilde{\pi}) & =\eta(\pi)+\sum_{t=0}^{\infty} \sum_s P\left(s_t=s \mid \tilde{\pi}\right) \sum_a \tilde{\pi}(a \mid s) \gamma^t A_\pi(s, a) \\ & =\eta(\pi)+\sum_s \sum_{t=0}^{\infty} \gamma^t P\left(s_t=s \mid \tilde{\pi}\right) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) \\ & =\eta(\pi)+\sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) \end{aligned} \]

Proof of Lemma 1

\[ \begin{aligned} & \underset{\tau \sim \tilde{\pi}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^t A^\pi\left(s_t, a_t\right)\right] \\ & =\underset{\tau \sim \tilde{\pi}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^t\left(R\left(s_t, a_t, s_{t+1}\right)+\gamma V^\pi\left(s_{t+1}\right)-V^\pi\left(s_t\right)\right)\right] \\ & =\eta(\tilde{\pi})+\underset{\tau \sim \tilde{\pi}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t+1} V^\pi\left(s_{t+1}\right)-\sum_{t=0}^{\infty} \gamma^t V^\pi\left(s_t\right)\right] \\ & =\eta(\tilde{\pi})+\underset{\tau \sim \tilde{\pi}}{\mathrm{E}}\left[\sum_{t=1}^{\infty} \gamma^t V^\pi\left(s_t\right)-\sum_{t=0}^{\infty} \gamma^t V^\pi\left(s_t\right)\right] \\ & =\eta(\tilde{\pi})-\underset{\tau \sim \tilde{\pi}}{\mathrm{E}}\left[V^\pi\left(s_0\right)\right] \\ & =\eta(\tilde{\pi})-\eta(\pi) \end{aligned} \]

Full proof of Lemma 1

How to ensure \(\sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) \ge 0\) ?

Function L \[ \mathcal{L}_\pi(\tilde{\pi})=\eta(\pi)+\sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) \] Lemma 1 \[ \eta(\tilde{\pi})=\eta(\pi)+\sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) \] Difference between \(\pi\) and \(\tilde \pi\) can't be large, otherwise we can use \(\rho_{\pi}\) to substitute \(\rho _{\tilde\pi}\).

See Kadade & Langford 2002, another proof \[ \begin{aligned} L_{\pi_{\theta_0}}\left(\pi_{\theta_0}\right) & =\eta\left(\pi_{\theta_0}\right) \\ \left.\nabla_\theta L_{\pi_{\theta_0}}\left(\pi_\theta\right)\right|_{\theta=\theta_0} & =\left.\nabla_\theta \eta\left(\pi_\theta\right)\right|_{\theta=\theta_0} \end{aligned} \]

How to ensure the difference is not too large?

Kakade & Langford (2002)

Let \(\pi^{\prime}=\arg \max _{\pi^{\prime}} L_{\pi_{\text {old }}}\left(\pi^{\prime}\right)\) \[ \pi_{\text {new }}(a \mid s)=(1-\alpha) \pi_{\mathrm{old}}(a \mid s)+\alpha \pi^{\prime}(a \mid s) \] Lower bound: \[ \begin{aligned} \eta\left(\pi_{\text {new }}\right) & \geq L_{\pi_{\text {old }}}\left(\pi_{\text {new }}\right)-\frac{2 \epsilon \gamma}{(1-\gamma)^2} \alpha^2 \\ & \text { where } \epsilon=\max _s\left|\mathbb{E}_{a \sim \pi^{\prime}(a \mid s)}\left[A_\pi(s, a)\right]\right| \end{aligned} \] This bound only applies to mixture policies generated by above equation. Restrictive and unwieldly.

Monotonic Improvement Guarantee for General Stochastic Policies

  • Replace \(\alpha\) with a distance between \(\pi\) and \(\tilde \pi\)

  • Changing the constant \(\epsilon\) appropriately

Distance: total variation divergence \[ D_{T V}(p \| q)=\frac{1}{2} \sum_i\left|p_i-q_i\right| \] \(D_{TV}^{\text{max}}(\pi, \tilde \pi)\): \(D_{\mathrm{TV}}^{\max }(\pi, \tilde{\pi})=\max _s D_{T V}(\pi(\cdot \mid s) \| \tilde{\pi}(\cdot \mid s))\)

Theorem 1:

Let \(\alpha=D_{\mathrm{TV}}^{\max }\left(\pi_{\mathrm{old}}, \pi_{\mathrm{new}}\right)\): \[ \begin{aligned} & \eta\left(\pi_{\text {new }}\right) \geq L_{\pi_{\text {old }}}\left(\pi_{\text {new }}\right)-\frac{4 \epsilon \gamma}{(1-\gamma)^2} \alpha^2 \\ & \text { where } \epsilon=\max _{s, a}\left|A_\pi(s, a)\right| \end{aligned} \] Details proof

Pollard 2000 Ch. 3 :\(D_{T V}(p \| q)^2 \leq D_{\mathrm{KL}}(p \| q)\) \[ \begin{aligned} \eta(\tilde{\pi}) & \geq L_\pi(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) \\ & \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^2} \end{aligned} \] Algorithm 1

image-20230321134101823

This algorithm generates a sequence of policies \(\eta(\pi_0) \le \eta(\pi_1) \le \eta(\pi_2) \le ...\)

Let \(M_i(\pi) = L_{\pi_i}(\pi)-C D_{\mathrm{KL}}^{\max }\left(\pi_i, \pi\right)\), then \[ \begin{align} \eta\left(\pi_{i+1}\right) \geq & M_i\left(\pi_{i+1}\right) \\ \eta\left(\pi_i\right)= &M_i\left(\pi_i\right) \\ \eta\left(\pi_{i+1}\right)-\eta\left(\pi_i\right) \geq& M_i\left(\pi_{i+1}\right)-M\left(\pi_i\right) \end{align} \] Minorization-maximization(MM) algorithm

Optimization of Parameterized Policies

TRPO is an approximation to Algorithm 1 which uses a constraint on the KL divergence rather than a penalty to robustly allow large updates.

Consider parameterized policies \(\pi_\theta(a|s)\) with parameter vector \(\theta\) \[ \begin{align} \eta(\theta)& := \eta(\pi_\theta) \\ L_\theta(\tilde{\theta})&:=L_{\pi_\theta}\left(\pi_{\tilde{\theta}}\right) \\ D_{\mathrm{KL}}(\theta \| \tilde{\theta})&:=D_{\mathrm{KL}}\left(\pi_\theta \| \pi_{\tilde{\theta}}\right) \end{align} \] Optimization problem: \[ \underset{\theta}{\operatorname{maximize}}\left[L_{\theta_{\mathrm{old}}}(\theta)-C D_{\mathrm{KL}}^{\max }\left(\theta_{\mathrm{old}}, \theta\right)\right] \] New problem: \[ \begin{aligned} & \underset{\theta}{\operatorname{maximize}} L_{\theta_{\text {old }}}(\theta) \\ & \text { subject to } D_{\mathrm{KL}}^{\max }\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{aligned} \] Due to the large number of constraints \[ \bar{D}_{\mathrm{KL}}^\rho\left(\theta_1, \theta_2\right):=\mathbb{E}_{s \sim \rho}\left[D_{\mathrm{KL}}\left(\pi_{\theta_1}(\cdot \mid s) \| \pi_{\theta_2}(\cdot \mid s)\right)\right] \] New problem: \[ \begin{aligned} & \underset{\theta}{\operatorname{maximize}} L_{\theta_{\text {old }}}(\theta) \\ & \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_\theta \text { old }}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{aligned} \]

Sample-Based Estimation of the Objective and Constraint

Approximate objective and constraint functions using Monte Carlo simulation \[ \begin{array}{r} \underset{\theta}{\operatorname{maximize}} \sum_s \rho_{\theta_{\text {old }}}(s) \sum_a \pi_\theta(a \mid s) A_{\theta_{\text {old }}}(s, a) \\ \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_{\theta_{\text {old }}}}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{array} \] We can't sample the new policy \(\pi_\theta(a|s)\), so introduce the importance sampling. \[ \sum_a \pi_\theta\left(a \mid s_n\right) A_{\theta_{\text {old }}}\left(s_n, a\right)=\mathbb{E}_{a \sim q}\left[\frac{\pi_\theta\left(a \mid s_n\right)}{q\left(a \mid s_n\right)} A_{\theta_{\text {old }}}\left(s_n, a\right)\right] \]

\[ \begin{aligned} & \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta_{\text {old }}}, a \sim q}\left[\frac{\pi_\theta(a \mid s)}{q(a \mid s)} Q_{\theta_{\text {old }}}(s, a)\right] \\ & \text { subject to } \mathbb{E}_{s \sim \rho_{\theta_{\text {old }}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\text {old }}}(\cdot \mid s) \| \pi_\theta(\cdot \mid s)\right)\right] \leq \delta \end{aligned} \]

Single Path and Vine

image-20230321145511486

Pseudocode

image-20230321150651409

Image Source

Experiments

image-20230321151215884

Supplementary Material

Proof of Lemma 1

image-20230319205900423

Reference

  1. https://jonathan-hui.medium.com/rl-trust-region-policy-optimization-trpo-explained-a6ee04eeeee9
  2. https://www.zhihu.com/question/366605427/answer/1048153125
  3. https://www.cnblogs.com/xingzheai/p/16565686.html