Barto-Sutton Chap.10 On-policy Control with approximation

发表于 2023-04-21 分类于 Reading Notes 1.1k 4 分钟

Predict problem: Given a policy, calculate the state value function.

Tabular solution: \(s_t \rightarrow V(s_t)\)
Approximate solution: \(s_t \rightarrow \hat V(s_t) \approx V(s_t)\)
Samples formation: \(S_t \rightarrow U_t\)
- Monte Carlo: \(U_t = G_t\)
- TD(0): \(U_t = r_t + \gamma \hat V(s_{t+1})\)
- TD(\(\lambda\)): \(U_t = G_{t:t+\lambda}\)

High-Dimensional Continuous Control Using Generalized Advantage Estimation

发表于 2023-04-20 更新于 2023-04-21 分类于 Reading Notes 1.5k 6 分钟

Two main challenges of policy gradient:

Large number of samples (Due to the high variance)
The difficulty of obtaining stable and steady improvement despite the non-stationarity of the incoming data

Solutions:

发表于 2023-03-21 更新于 2023-04-21 分类于 Reading Notes 553 2 分钟

Propose a conceptually simple and lightweight framework for DRL
Present asynchronous variants of four standard reinforcement learning algorithms
The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU

发表于 2023-03-21 更新于 2023-04-21 分类于 Reading Notes 356 1 分钟

A common form called Retrace(\(\lambda\)), which is a return-based off-policy method.

(where return refers to the sum of discounted rewards \(\sum_t\gamma^tr_t\))

发表于 2023-03-21 更新于 2023-04-21 分类于 Reading Notes 1.7k 6 分钟

Propose an iterative procedure for PG method with guaranteed monotonic improvement, called TRPO
Experiments demonstrate its robust performance

发表于 2023-03-13 更新于 2023-04-21 分类于 Reading Notes 2.6k 9 分钟

The challenge of off-policy learning:

The target of the updates (not to be confuse with the target policy)
- Importance sampling (may increase variance)
The distribution of the updates
- Importance sampling
- Develop the true gradient methods that don't rely on any special distribution for stability