Asynchronous Methods for Deep Reinforcement Learning

Diagram of A3C high-level architecture (Image Source)

Abstract

Propose a conceptually simple and lightweight framework for DRL
Present asynchronous variants of four standard reinforcement learning algorithms
The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU

Results

Learning speed: Asynchronous algorithms is more quicker than DQN

Robustness: A3C is quite robust

Data efficiency of different threads: A3C is the best method

Training speed of different threads: A3C is the best method

Introduction

Due to non-stationary and strongly correlate data, when introduce the deep neural networks, authors propose a very paradigm called asynchronous methods which is differ to experience replay buffer.

Run on a single machine with a standard multi-core CPU
Takes much less time than previous GPU-based algorithms, but performs better
A3C master a variety of continuous motor control tasks

General Reinforcement Learning Architecture (Gorila) in a distributed setting. (Nair et al., 2015)
Speed up large matrix operations but not to parallelize the collection of experience or stabilize learning. (Li & Schuurmans, 2011)
Multiple separate actor-learners to accelerate training, they update weights using peer to peer communication. (Grounds & Kudenko, 2008)
Convergence properties of Q-learning in the asynchronous optimization setting. (Tsitsiklis, 1994)
Related problem of distributed dynamic programming. (Bertsekas, 1982)

Reinforcement Learning Background

Introducing RL setting and Q-Learning. \[ L_i\left(\theta_i\right)=\mathbb{E}\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime} ; \theta_{i-1}\right)-Q\left(s, a ; \theta_i\right)\right)^2 \]

Asynchronous RL Framework

They use two main ideas to make all four algorithms practical given their design goal.

Four algorithms
- 1-step Sarsa, 1-step Q-Learning, n-step Q-Learning and advantage actor-critic
- Off-policy and on-policy
Two main ideas
- Using multiple CPU threads on a single machine instead of using separate machines and a parameter server
  - Remove the communication cost
  - Enable to use Hogwild! Recht et al., 2011
    - Parallel SGD method
- They don't use the replay buffer and rely on parallel actors employing different exploration policies
  - Less correlated data

Benefit of using multiple parallel actor-learners

Reducing the training time
Using on-policy algorithms without replay buffer
Stabilizing learning

Asynchronous one-step Q-learning

Accumulating gradients is similar to using mini-batches
- Reduce the chances of overwriting each other's updates
- Trade off computational efficiency and data efficiency

Asynchronous advantage actor-critic(A3C)

Actor policy: \(\pi(a_t|s_t;\theta)\)

Value function: \(V(S_t;\theta_v)\)

Authors found that adding the entropy of the policy \(\pi\) improved exploration by discouraging premature convergence to suboptimal deterministic policies. \[ \nabla_{\theta^{\prime}} \log \pi\left(a_t \mid s_t ; \theta^{\prime}\right)\left(R_t-V\left(s_t ; \theta_v\right)\right)+\beta \nabla_{\theta^{\prime}} H\left(\pi\left(s_t ; \theta^{\prime}\right)\right) \] 和SAC的一样