0%

Safe and efficient off-policy reinforcement learning

Safe and efficient off-policy reinforcement learning

Abstract

A common form called Retrace(\(\lambda\)), which is a return-based off-policy method.

(where return refers to the sum of discounted rewards \(\sum_t\gamma^tr_t\))

  • low variance
  • efficient
  • safely

Notation

Operator \(P^\pi\): \[ \left(P^\pi Q\right)(x, a):=\sum_{x^{\prime} \in \mathcal{X}} \sum_{a^{\prime} \in \mathcal{A}} P\left(x^{\prime} \mid x, a\right) \pi\left(a^{\prime} \mid x^{\prime}\right) Q\left(x^{\prime}, a^{\prime}\right) \] Value function: \[ Q^\pi:=\sum_{t \geq 0} \gamma^t\left(P^\pi\right)^t r \]

Off-Policy Algorithms

General operator

\[ \mathcal{R} Q(x, a):=Q(x, a)+\mathbb{E}_\mu\left[\sum_{t \geq 0} \gamma^t\left(\prod_{s=1}^t c_s\right)\left(r_t+\gamma \mathbb{E}_\pi Q\left(x_{t+1}, \cdot\right)-Q\left(x_t, a_t\right)\right)\right] \]

Where: \[ \mathbb{E}_\pi Q(x, \cdot):=\sum_a \pi(a \mid x) Q(x, a) \] Traces of the operator \(c_s\): non-negative coefficients, \(\left(\prod_{s=1}^t c_s\right)=1 \quad \text{when} t=0\)

Importance sampling (IS)

\[ c_s=\frac{\pi\left(a_s \mid x_s\right)}{\mu\left(a_s \mid x_s\right)} \]

The simplest way to correct for the discrepancy between \(\mu\) and \(\pi\)

Suffer from large variance, because the continuous product on the time axis may be large or small

Off-policy \(Q^\pi(\lambda)\) and \(Q^*(\lambda)\)

\[ c_s=\lambda \]

Avoiding the blow-up of the variance of the product of ratios encountered with IS.

However, it's hard to choose the \(\lambda\).

Tree-backup \(TB(\lambda)\)

\[ c_s=\lambda \pi\left(a_s \mid x_s\right) \]

The algorithm corrects for tthe discrepancy by multiplying each term of the sum by the product of target policy probabilities.

Safe but not efficient

It is unnecessary to cut the traces in the near on-policy case.

Retrace(\(\lambda\))

\[ c_s=\lambda \min \left(1, \frac{\pi\left(a_s \mid x_s\right)}{\mu\left(a_s \mid x_s\right)}\right) \]

image-20230321190532838

Analysis of Retrace(\(\lambda\))

Policy Evalution

Control

Online algorithms

Experimental Results

image-20230321195610214

Reference

  1. https://zhuanlan.zhihu.com/p/56391653