Asynchronous Methods for Deep Reinforcement Learning
Abstract
- Propose a conceptually simple and lightweight framework for DRL
- Present asynchronous variants of four standard reinforcement learning algorithms
- The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU
Results
- Learning speed: Asynchronous algorithms is more quicker than DQN
- Robustness: A3C is quite robust

- Data efficiency of different threads: A3C is the best method

- Training speed of different threads: A3C is the best method

Introduction
Due to non-stationary and strongly correlate data, when introduce the deep neural networks, authors propose a very paradigm called asynchronous methods which is differ to experience replay buffer.
- Run on a single machine with a standard multi-core CPU
- Takes much less time than previous GPU-based algorithms, but performs better
- A3C master a variety of continuous motor control tasks
Related Work
General Reinforcement Learning Architecture (Gorila) in a distributed setting. (Nair et al., 2015)
Speed up large matrix operations but not to parallelize the collection of experience or stabilize learning. (Li & Schuurmans, 2011)
Multiple separate actor-learners to accelerate training, they update weights using peer to peer communication. (Grounds & Kudenko, 2008)
Convergence properties of Q-learning in the asynchronous optimization setting. (Tsitsiklis, 1994)
Related problem of distributed dynamic programming. (Bertsekas, 1982)
Reinforcement Learning Background
Introducing RL setting and Q-Learning. \[ L_i\left(\theta_i\right)=\mathbb{E}\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime} ; \theta_{i-1}\right)-Q\left(s, a ; \theta_i\right)\right)^2 \]
Asynchronous RL Framework
They use two main ideas to make all four algorithms practical given their design goal.
Four algorithms
1-step Sarsa, 1-step Q-Learning, n-step Q-Learning and advantage actor-critic
Off-policy and on-policy
Two main ideas
- Using multiple CPU threads on a single machine instead of using
separate machines and a parameter server
- Remove the communication cost
- Enable to use Hogwild! Recht
et al., 2011
- Parallel SGD method
- They don't use the replay buffer and rely on parallel actors
employing different exploration policies
- Less correlated data
- Using multiple CPU threads on a single machine instead of using
separate machines and a parameter server
Benefit of using multiple parallel actor-learners
- Reducing the training time
- Using on-policy algorithms without replay buffer
- Stabilizing learning
Asynchronous one-step Q-learning
- Accumulating gradients is similar to using mini-batches
- Reduce the chances of overwriting each other's updates
- Trade off computational efficiency and data efficiency
Asynchronous advantage actor-critic(A3C)
Actor policy: \(\pi(a_t|s_t;\theta)\)
Value function: \(V(S_t;\theta_v)\)

Authors found that adding the entropy of the policy \(\pi\) improved exploration by discouraging premature convergence to suboptimal deterministic policies. \[ \nabla_{\theta^{\prime}} \log \pi\left(a_t \mid s_t ; \theta^{\prime}\right)\left(R_t-V\left(s_t ; \theta_v\right)\right)+\beta \nabla_{\theta^{\prime}} H\left(\pi\left(s_t ; \theta^{\prime}\right)\right) \] 和SAC的一样