Safe and efficient off-policy reinforcement learning
Abstract
A common form called Retrace(\(\lambda\)), which is a return-based off-policy method.
(where return refers to the sum of discounted rewards \(\sum_t\gamma^tr_t\))
- low variance
- efficient
- safely
Notation
Operator \(P^\pi\): \[ \left(P^\pi Q\right)(x, a):=\sum_{x^{\prime} \in \mathcal{X}} \sum_{a^{\prime} \in \mathcal{A}} P\left(x^{\prime} \mid x, a\right) \pi\left(a^{\prime} \mid x^{\prime}\right) Q\left(x^{\prime}, a^{\prime}\right) \] Value function: \[ Q^\pi:=\sum_{t \geq 0} \gamma^t\left(P^\pi\right)^t r \]
Off-Policy Algorithms
General operator
\[ \mathcal{R} Q(x, a):=Q(x, a)+\mathbb{E}_\mu\left[\sum_{t \geq 0} \gamma^t\left(\prod_{s=1}^t c_s\right)\left(r_t+\gamma \mathbb{E}_\pi Q\left(x_{t+1}, \cdot\right)-Q\left(x_t, a_t\right)\right)\right] \]
Where: \[ \mathbb{E}_\pi Q(x, \cdot):=\sum_a \pi(a \mid x) Q(x, a) \] Traces of the operator \(c_s\): non-negative coefficients, \(\left(\prod_{s=1}^t c_s\right)=1 \quad \text{when} t=0\)
Importance sampling (IS)
\[ c_s=\frac{\pi\left(a_s \mid x_s\right)}{\mu\left(a_s \mid x_s\right)} \]
The simplest way to correct for the discrepancy between \(\mu\) and \(\pi\)
Suffer from large variance, because the continuous product on the time axis may be large or small
Off-policy \(Q^\pi(\lambda)\) and \(Q^*(\lambda)\)
\[ c_s=\lambda \]
Avoiding the blow-up of the variance of the product of ratios encountered with IS.
However, it's hard to choose the \(\lambda\).
Tree-backup \(TB(\lambda)\)
\[ c_s=\lambda \pi\left(a_s \mid x_s\right) \]
The algorithm corrects for tthe discrepancy by multiplying each term of the sum by the product of target policy probabilities.
Safe but not efficient
It is unnecessary to cut the traces in the near on-policy case.
Retrace(\(\lambda\))
\[ c_s=\lambda \min \left(1, \frac{\pi\left(a_s \mid x_s\right)}{\mu\left(a_s \mid x_s\right)}\right) \]

Analysis of Retrace(\(\lambda\))
Policy Evalution
Control
Online algorithms
Experimental Results

Reference
- https://zhuanlan.zhihu.com/p/56391653