Barto-Sutton Chap.11 Off-policy Methods with Approximation
The challenge of off-policy learning:
- The target of the updates (not to be confuse with the target
policy)
- Importance sampling (may increase variance)
- The distribution of the updates
- Importance sampling
- Develop the true gradient methods that don't rely on any special distribution for stability
Semi-gradient Methods
Importance sampling: \(\rho_t \doteq \rho_{t: t}=\frac{\pi\left(A_t \mid S_t\right)}{b\left(A_t \mid S_t\right)}\)
One step state value function is semi-gradient TD(0) error \[ \mathbf{w}_{t+1} \doteq \mathbf{w}_t+\alpha \rho_t \delta_t \nabla \hat{v}\left(S_t, \mathbf{w}_t\right) \] where \(\delta_t\) depend on whether the problem is episodic and discounted, or continuing and undiscounted using average reward: \[ \begin{aligned} \delta_t \doteq & R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}_t\right)-\hat{v}\left(S_t, \mathbf{w}_t\right) \\ or & \\ \delta_t \doteq & R_{t+1}-\bar{R}_t+\hat{v}\left(S_{t+1}, \mathbf{w}_t\right)-\hat{v}\left(S_t, \mathbf{w}_t\right) \end{aligned} \] One step action value function is semi-gradient Expected Sarsa \[ \begin{aligned} \mathbf{w}_{t+1} \doteq & \mathbf{w}_t+\alpha \delta_t \nabla \hat{q}\left(S_t, A_t, \mathbf{w}_t\right) \\ \delta_t \doteq & R_{t+1}+\gamma \sum_a \pi\left(a \mid S_{t+1}\right) \hat{q}\left(S_{t+1}, a, \mathbf{w}_t\right)-\hat{q}\left(S_t, A_t, \mathbf{w}_t\right) \\ or &\\ \delta_t \doteq & R_{t+1}-\bar{R}_t+\sum_a \pi\left(a \mid S_{t+1}\right) \hat{q}\left(S_{t+1}, a, \mathbf{w}_t\right)-\hat{q}\left(S_t, A_t, \mathbf{w}_t\right) \end{aligned} \] Note that the algorithm does not use importance sampling. In the tabular case, it is appropriate because the only sample action is \(A_t\), and in learing its value we do not have to consider any other actions. With function approximation it is less clear.
Examples of Off-policy Divergence
A simple example
Suppose there has two states, the feature vectors are all scales, in this case 1 and 2. The following situation occurs under linear function approximation.
TD error: \[ \delta_t=R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}_t\right)-\hat{v}\left(S_t, \mathbf{w}_t\right)=0+\gamma 2 w_t-w_t=(2 \gamma-1) w_t \] Update of off-policy semi-gradient TD(0): \[ w_{t+1}=w_t+\alpha \rho_t \delta_t \nabla \hat{v}\left(S_t, w_t\right)=w_t+\alpha \cdot 1 \cdot(2 \gamma-1) w_t \cdot 1=(1+\alpha(2 \gamma-1)) w_t \] Note that the importance sampling \(\rho_t = 1\).
A simple complete sample - Baird's counterexample
\[ \mathbf{w}_{k+1} \doteq \mathbf{w}_k+\frac{\alpha}{|\mathcal{S}|} \sum_{\mathbf{s}}\left(\mathbb{E}\left[R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}_k\right) \mid S_t=s\right]-\hat{v}\left(s, \mathbf{w}_k\right)\right) \nabla \hat{v}\left(s, \mathbf{w}_k\right) \]
If we apply semi-gradient TD(0) to this problem, then the weights diverge to infinity.
The Deadly Triad
The danger of instability and divergence arises whenever we combine all the following three elements.
- Function approximation
- Bootstrapping
- Off-policy training
We can avoid the instability if any two elements of deadly triad are present, but not all three.
Importance: Function approximation \(>\) Bootstrapping \(>\) Off-policy training
Linear Value-function Geometry
Given a state space \(S=\left\{s_1,s_2,s_3 \right \}\), and two parameters \(w=(w_1, w_2)^T\).
\(v_w\): Value function in the subspace
\(v_{\pi}\): The true value function, which is too complex to be represented exactly as an approximation
\(\|v\|_\mu^2 \doteq \sum_{s \in \mathcal{S}} \mu(s) v(s)^2\): The distance between value functions
\(\overline{\mathrm{VE}}(\mathbf{w})=\left\|v_{\mathbf{w}}-v_\pi\right\|_\mu^2\): Mean square value error
Projection operator: \(\Pi v \doteq v_{\mathbf{w}} \quad \text { where } \quad \mathbf{w}=\underset{\mathbf{w} \in \mathbb{R}^d}{\arg \min }\left\|v-v_{\mathbf{w}}\right\|_\mu^2\)
Bellman error BE:
\[ \begin{aligned} v_\pi(s)&=\sum_a \pi(a \mid s) \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_\pi\left(s^{\prime}\right)\right], \quad \text { for all } s \in \mathcal{S} \\ \bar{\delta}_{\mathbf{w}}(s) & \doteq\left(\sum_a \pi(a \mid s) \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_{\mathbf{w}}\left(s^{\prime}\right)\right]\right)-v_{\mathbf{w}}(s) \\ & =\mathbb{E}_\pi\left[R_{t+1}+\gamma v_{\mathbf{w}}\left(S_{t+1}\right)-v_{\mathbf{w}}\left(S_t\right) \mid S_t=s, A_t \sim \pi\right] \end{aligned} \] Conclusion: The Bellman error is the expectation of the TD error.
Mean squared Bellman Error: \(\overline{\mathrm{BE}}(\mathbf{w})=\left\|\bar{\delta}_{\mathbf{w}}\right\|_\mu^2\)
min \(\overline {BE}\) usually is different from min \(\overline{VE}\)
Bellman operator \(B_\pi: \mathbb{R}^{|\mathcal{S}|} \rightarrow \mathbb{R}^{|\mathcal{S}|}\)
\[ \left(B_\pi v\right)(s) \doteq \sum_a \pi(a \mid s) \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v\left(s^{\prime}\right)\right] \] Therefore, \(\bar{\delta}_{\mathbf{w}}=B_\pi v_{\mathbf{w}}-v_{\mathbf{w}}\).
The true value fucntion, \(v_\pi\), is the unique fixed point for the Bellman operator. \[ v_\pi=B_\pi v_\pi \] * Mean squared projection Bellman error \(\overline{PBE}\): \(\overline{\operatorname{PBE}}(\mathbf{w})=\left\|\Pi \bar{\delta}_{\mathbf{w}}\right\|_\mu^2\)
Gradient Descent in the Bellman Error
The approach of SGD, in which updates are made that in expectation are equal to the negative gradient of an objective function. Semi-gradient methods may diverge under off-policy training and contrived cases of nonlinear function apporximation.
People want to use the SGD for reinforcement learning. The starting place is the choice of an error or objective function.
Minimize TD error
One step TD error with discounting \[ \delta_t=R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}_t\right)-\hat{v}\left(S_t, \mathbf{w}_t\right) \] Mean squared TD error \[ \begin{aligned} \overline{\mathrm{TDE}}(\mathbf{w})&=\sum_{s \in \mathcal{S}} \mu(s) \mathbb{E}\left[\delta_t^2 \mid S_t=s, A_t \sim \pi\right] \\ & =\sum_{s \in \mathcal{S}} \mu(s) \mathbb{E}\left[\rho_t \delta_t^2 \mid S_t=s, A_t \sim b\right] \\ & =\mathbb{E}_b\left[\rho_t \delta_t^2\right] \end{aligned} \] Updates \[ \begin{aligned} \mathbf{w}_{t+1} & =\mathbf{w}_t-\frac{1}{2} \alpha \nabla\left(\rho_t \delta_t^2\right) \\ & =\mathbf{w}_t-\alpha \rho_t \delta_t \nabla \delta_t \\ & =\mathbf{w}_t+\alpha \rho_t \delta_t\left(\nabla \hat{v}\left(S_t, \mathbf{w}_t\right)-\gamma \nabla \hat{v}\left(S_{t+1}, \mathbf{w}_t\right)\right) \end{aligned} \] called naive residual-gradient algorithm. Although it converges robustly, it will converge to anthor place.
Minimize Bellman error
Bellman error is the expectation of the TD error. \[ \begin{aligned} \mathbf{w}_{t+1} & =\mathbf{w}_t-\frac{1}{2} \alpha \nabla\left(\mathbb{E}_\pi\left[\delta_t\right]^2\right) \\ & =\mathbf{w}_t-\frac{1}{2} \alpha \nabla\left(\mathbb{E}_b\left[\rho_t \delta_t\right]^2\right) \\ & =\mathbf{w}_t-\alpha \mathbb{E}_b\left[\rho_t \delta_t\right] \nabla \mathbb{E}_b\left[\rho_t \delta_t\right] \\ & =\mathbf{w}_t-\alpha \mathbb{E}_b\left[\rho_t\left(R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}\right)-\hat{v}\left(S_t, \mathbf{w}\right)\right)\right] \mathbb{E}_b\left[\rho_t \nabla \delta_t\right] \\ & =\mathbf{w}_t+\alpha\left[\mathbb{E}_b\left[\rho_t\left(R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}\right)\right)\right]-\hat{v}\left(S_t, \mathbf{w}\right)\right]\left[\nabla \hat{v}\left(S_t, \mathbf{w}\right)-\gamma \mathbb{E}_b\left[\rho_t \nabla \hat{v}\left(S_{t+1}, \mathbf{w}\right)\right]\right] \end{aligned} \] called residual-gradient algorithm.
Using sample values in all expectations, the equation reduce to naive residual-gradient algorithm. ==???==
Three disadvantages:
- Converge rate is slow
- Converge to wrong value
- It is not learnable
The Bellman Error is Not Learnable
Unlearnable: Even if we have infinite samples, we cannot learn some variables.
An example:
We cannot distinguish whether it is one state or two states.
Suppose \(\gamma=0\), the state value is 1,0,2 from left to right. If the \(w=1\), then \(\overline {VE_{left}}=0, \overline {VE_{right}}=1\). The same data distribution, but different VE. So VE is not Learnable.
Gradient-TD Methods
In order to obtain the SGD method of \(\overline {PBE}\), we need to ge the graident of \(\overline {PBE}\). \[ \begin{aligned} \overline{\operatorname{PBE}}(\mathbf{w}) & =\left\|\Pi \bar{\delta}_{\mathbf{w}}\right\|_\mu^2 \\ & =\left(\Pi \bar{\delta}_{\mathbf{w}}\right)^{\top} \mathbf{D} \Pi \bar{\delta}_{\mathbf{w}} \\ & =\bar{\delta}_{\mathbf{w}}^{\top} \Pi^{\top} \mathbf{D} \Pi \bar{\delta}_{\mathbf{w}} \\ & =\bar{\delta}_{\mathbf{w}}^{\top} \mathbf{D} \mathbf{X}\left(\mathbf{X}^{\top} \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^{\top} \mathbf{D} \bar{\delta}_{\mathbf{w}}\\ &(\text{using (11.14) and the identity }\left.\Pi^{\top} \mathbf{D} \Pi=\mathbf{D X}\left(\mathbf{X}^{\top} \mathbf{D X}\right)^{-1} \mathbf{X}^{\top} \mathbf{D}\right ) \\ & =\left(\mathbf{X}^{\top} \mathbf{D} \bar{\delta}_{\mathbf{w}}\right)^{\top}\left(\mathbf{X}^{\top} \mathbf{D} \mathbf{X}\right)^{-1}\left(\mathbf{X}^{\top} \mathbf{D} \bar{\delta}_{\mathbf{w}}\right) \end{aligned} \] The gradient with respect to \(w\) is \[ \nabla \overline{\mathrm{PBE}}(\mathbf{w})=2 \nabla\left[\mathbf{X}^{\top} \mathbf{D} \bar{\delta}_{\mathbf{w}}\right]^{\top}\left(\mathbf{X}^{\top} \mathbf{D} \mathbf{X}\right)^{-1}\left(\mathbf{X}^{\top} \mathbf{D} \bar{\delta}_{\mathbf{w}}\right) \] The last factor \[ \mathbf{X}^{\top} \mathbf{D} \bar{\delta}_{\mathbf{w}}=\sum_s \mu(s) \mathbf{x}(s) \bar{\delta}_{\mathbf{w}}(s)=\mathbb{E}\left[\rho_t \delta_t \mathbf{x}_t\right] \] The first factor \[ \begin{aligned} \nabla \mathbb{E}\left[\rho_t \delta_t \mathbf{x}_t\right]^{\top} & =\mathbb{E}\left[\rho_t \nabla \delta_t^{\top} \mathbf{x}_t^{\top}\right] \\ & =\mathbb{E}\left[\rho_t \nabla\left(R_{t+1}+\gamma \mathbf{w}^{\top} \mathbf{x}_{t+1}-\mathbf{w}^{\top} \mathbf{x}_t\right)^{\top} \mathbf{x}_t^{\top}\right] \quad (\text{Using episodic} \quad \delta _t) \\ & =\mathbb{E}\left[\rho_t\left(\gamma \mathbf{x}_{t+1}-\mathbf{x}_t\right) \mathbf{x}_t^{\top}\right] \end{aligned} \] The middle factor \[ \mathbf{X}^{\top} \mathbf{D} \mathbf{X}=\sum_s \mu(s) \mathbf{x}_s \mathbf{x}_s^{\top}=\mathbb{E}\left[\mathbf{x}_t \mathbf{x}_t^{\top}\right] \] In summary \[ \nabla \overline{\mathrm{PBE}}(\mathbf{w})=2 \mathbb{E}\left[\rho_t\left(\gamma \mathbf{x}_{t+1}-\mathbf{x}_t\right) \mathbf{x}_t^{\top}\right] \mathbb{E}\left[\mathbf{x}_t \mathbf{x}_t^{\top}\right]^{-1} \mathbb{E}\left[\rho_t \delta_t \mathbf{x}_t\right] \] We represent the product of the last two terms as a vector \(v\) \[ \mathbf{v} \approx \mathbb{E}\left[\mathbf{x}_t \mathbf{x}_t^{\top}\right]^{-1} \mathbb{E}\left[\rho_t \delta_t \mathbf{x}_t\right] \] \(v\) is the solution of the linear problem \(v^Tx_t=\rho_t\delta_t\) \[ \mathbf{v}_{t+1} \doteq \mathbf{v}_t+\beta \rho_t\left(\delta_t-\mathbf{v}_t^{\top} \mathbf{x}_t\right) \mathbf{x}_t \] The update rule \[ \begin{aligned} \mathbf{w}_{t+1} & =\mathbf{w}_t-\frac{1}{2} \alpha \nabla \overline{\mathrm{PBE}}\left(\mathbf{w}_t\right) \\ & =\mathbf{w}_t-\frac{1}{2} \alpha 2 \mathbb{E}\left[\rho_t\left(\gamma \mathbf{x}_{t+1}-\mathbf{x}_t\right) \mathbf{x}_t^{\top}\right] \mathbb{E}\left[\mathbf{x}_t \mathbf{x}_t^{\top}\right]^{-1} \mathbb{E}\left[\rho_t \delta_t \mathbf{x}_t\right] \\ & =\mathbf{w}_t+\alpha \mathbb{E}\left[\rho_t\left(\mathbf{x}_t-\gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^{\top}\right] \mathbb{E}\left[\mathbf{x}_t \mathbf{x}_t^{\top}\right]^{-1} \mathbb{E}\left[\rho_t \delta_t \mathbf{x}_t\right] \\ & \approx \mathbf{w}_t+\alpha \mathbb{E}\left[\rho_t\left(\mathbf{x}_t-\gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^{\top}\right] \mathbf{v}_t \\ & \approx \mathbf{w}_t+\alpha \rho_t\left(\mathbf{x}_t-\gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^{\top} \mathbf{v}_t \end{aligned} \] This algorithm is called \(GTD2\).
A slightly better algorithm can be derived by doing a few more analytic steps before substituting in \(V_t\). \[ \begin{aligned} \mathbf{w}_{t+1} & =\mathbf{w}_t+\alpha \mathbb{E}\left[\rho_t\left(\mathbf{x}_t-\gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^{\top}\right] \mathbb{E}\left[\mathbf{x}_t \mathbf{x}_t^{\top}\right]^{-1} \mathbb{E}\left[\rho_t \delta_t \mathbf{x}_t\right] \\ & =\mathbf{w}_t+\alpha\left(\mathbb{E}\left[\rho_t \mathbf{x}_t \mathbf{x}_t^{\top}\right]-\gamma \mathbb{E}\left[\rho_t \mathbf{x}_{t+1} \mathbf{x}_t^{\top}\right]\right) \mathbb{E}\left[\mathbf{x}_t \mathbf{x}_t^{\top}\right]^{-1} \mathbb{E}\left[\rho_t \delta_t \mathbf{x}_t\right] \\ & =\mathbf{w}_t+\alpha\left(\mathbb{E}\left[\mathbf{x}_t \mathbf{x}_t^{\top}\right]-\gamma \mathbb{E}\left[\rho_t \mathbf{x}_{t+1} \mathbf{x}_t^{\top}\right]\right) \mathbb{E}\left[\mathbf{x}_t \mathbf{x}_t^{\top}\right]^{-1} \mathbb{E}\left[\rho_t \delta_t \mathbf{x}_t\right] \\ & =\mathbf{w}_t+\alpha\left(\mathbb{E}\left[\mathbf{x}_t \rho_t \delta_t\right]-\gamma \mathbb{E}\left[\rho_t \mathbf{x}_{t+1} \mathbf{x}_t^{\top}\right] \mathbb{E}\left[\mathbf{x}_t \mathbf{x}_t^{\top}\right]^{-1} \mathbb{E}\left[\rho_t \delta_t \mathbf{x}_t\right]\right) \\ & \approx \mathbf{w}_t+\alpha\left(\mathbb{E}\left[\mathbf{x}_t \rho_t \delta_t\right]-\gamma \mathbb{E}\left[\rho_t \mathbf{x}_{t+1} \mathbf{x}_t^{\top}\right] \mathbf{v}_t\right) \\ & \approx \mathbf{w}_t+\alpha \rho_t\left(\delta_t \mathbf{x}_t-\gamma \mathbf{x}_{t+1} \mathbf{x}_t^{\top} \mathbf{v}_t\right) \end{aligned} \] This algorithm is called \(GTD(0)\).
Emphatic-TD Methods
We can recover the distribution of updates under the policy by
emphasizing some states and ignoring some. \[
\begin{aligned}
& \delta_t=R_{t+1}+\gamma \hat{v}\left(S_{t+1},
\mathbf{w}_t\right)-\hat{v}\left(S_t, \mathbf{w}_t\right) \\
& \mathbf{w}_{t+1}=\mathbf{w}_t+\alpha M_t \rho_t \delta_t \nabla
\hat{v}\left(S_t, \mathbf{w}_t\right) \\
& M_t=\gamma \rho_{t-1} M_{t-1}+I_t
\end{aligned}
\]
Reducing Variance
Off-policy learning is inherently of greater variance than on-policy learning.
- Weighted importance sampling is better than ordinary importance sampling
- Tree backup algorithm
- Allow the target policy to be determined in part by the behavior policy
Summary
Correcting the targets of learning for the behavior policy
- Importance sampling
The instability of semi-gradient TD methods that involve bootstrapping
Reference
- https://zhuanlan.zhihu.com/p/62839302