0%

Asynchronous Methods for Deep Reinforcement Learning

Asynchronous Methods for Deep Reinforcement Learning

img

Diagram of A3C high-level architecture (Image Source)

Abstract

  • Propose a conceptually simple and lightweight framework for DRL
  • Present asynchronous variants of four standard reinforcement learning algorithms
  • The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU

Results

  • Learning speed: Asynchronous algorithms is more quicker than DQN

  • Robustness: A3C is quite robust
image-20230317210846939
  • Data efficiency of different threads: A3C is the best method
image-20230317213611658
  • Training speed of different threads: A3C is the best method
image-20230317213659976

Introduction

Due to non-stationary and strongly correlate data, when introduce the deep neural networks, authors propose a very paradigm called asynchronous methods which is differ to experience replay buffer.

  • Run on a single machine with a standard multi-core CPU
  • Takes much less time than previous GPU-based algorithms, but performs better
  • A3C master a variety of continuous motor control tasks
  1. General Reinforcement Learning Architecture (Gorila) in a distributed setting. (Nair et al., 2015)

  2. Speed up large matrix operations but not to parallelize the collection of experience or stabilize learning. (Li & Schuurmans, 2011)

  3. Multiple separate actor-learners to accelerate training, they update weights using peer to peer communication. (Grounds & Kudenko, 2008)

  4. Convergence properties of Q-learning in the asynchronous optimization setting. (Tsitsiklis, 1994)

  5. Related problem of distributed dynamic programming. (Bertsekas, 1982)

Reinforcement Learning Background

Introducing RL setting and Q-Learning. \[ L_i\left(\theta_i\right)=\mathbb{E}\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime} ; \theta_{i-1}\right)-Q\left(s, a ; \theta_i\right)\right)^2 \]

Asynchronous RL Framework

They use two main ideas to make all four algorithms practical given their design goal.

  • Four algorithms

    • 1-step Sarsa, 1-step Q-Learning, n-step Q-Learning and advantage actor-critic

    • Off-policy and on-policy

  • Two main ideas

    • Using multiple CPU threads on a single machine instead of using separate machines and a parameter server
      • Remove the communication cost
      • Enable to use Hogwild! Recht et al., 2011
        • Parallel SGD method
    • They don't use the replay buffer and rely on parallel actors employing different exploration policies
      • Less correlated data

Benefit of using multiple parallel actor-learners

  • Reducing the training time
  • Using on-policy algorithms without replay buffer
  • Stabilizing learning

Asynchronous one-step Q-learning

image-20230318222500978

  • Accumulating gradients is similar to using mini-batches
    • Reduce the chances of overwriting each other's updates
    • Trade off computational efficiency and data efficiency

Asynchronous advantage actor-critic(A3C)

Actor policy: \(\pi(a_t|s_t;\theta)\)

Value function: \(V(S_t;\theta_v)\)

image-20230319162054696

Authors found that adding the entropy of the policy \(\pi\) improved exploration by discouraging premature convergence to suboptimal deterministic policies. \[ \nabla_{\theta^{\prime}} \log \pi\left(a_t \mid s_t ; \theta^{\prime}\right)\left(R_t-V\left(s_t ; \theta_v\right)\right)+\beta \nabla_{\theta^{\prime}} H\left(\pi\left(s_t ; \theta^{\prime}\right)\right) \] 和SAC的一样