0%

Barto-Sutton Chap.10 On-policy Control with approximation

Predict problem: Given a policy, calculate the state value function.

  • Tabular solution: \(s_t \rightarrow V(s_t)\)

  • Approximate solution: \(s_t \rightarrow \hat V(s_t) \approx V(s_t)\)

  • Samples formation: \(S_t \rightarrow U_t\)

    • Monte Carlo: \(U_t = G_t\)
    • TD(0): \(U_t = r_t + \gamma \hat V(s_{t+1})\)
    • TD(\(\lambda\)): \(U_t = G_{t:t+\lambda}\)
阅读全文 »

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Abstract

Two main challenges of policy gradient:

  • Large number of samples (Due to the high variance)
  • The difficulty of obtaining stable and steady improvement despite the non-stationarity of the incoming data

Solutions:

  • Using value functions to reduce variance at the cost of some bias
  • Using region optimization procedure
阅读全文 »

Asynchronous Methods for Deep Reinforcement Learning

img

Diagram of A3C high-level architecture (Image Source)

Abstract

  • Propose a conceptually simple and lightweight framework for DRL
  • Present asynchronous variants of four standard reinforcement learning algorithms
  • The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU
阅读全文 »

Trust Region Policy Optimization

Abstract

  1. Propose an iterative procedure for PG method with guaranteed monotonic improvement, called TRPO

  2. Experiments demonstrate its robust performance

阅读全文 »

Barto-Sutton Chap.11 Off-policy Methods with Approximation

The challenge of off-policy learning:

  • The target of the updates (not to be confuse with the target policy)
    • Importance sampling (may increase variance)
  • The distribution of the updates
    • Importance sampling
    • Develop the true gradient methods that don't rely on any special distribution for stability
阅读全文 »