Playing Atari with Deep Reinforcement Learning

This paper by Volodymyr Mnih, published at NIPS 2013, is roughly the founding work of DQN; the other one is the Nature 2015 paper “Human-level control through deep reinforcement learning.” Both papers count as milestones in the field of reinforcement learning. This article summarizes some of the pioneering ideas covered in the NIPS 2013 paper.

Abstract

This is the first paper to use a deep learning model with reinforcement learning to learn control policies directly from high-dimensional sensory input. The model is built on a CNN and is trained using a variant of Q-learning. The input is raw pixels, i.e. the current state, and the output is the Q-value—in essence, using a neural network to approximate the Q(s,a) function. The paper ran experiments on Atari 2600 games and achieved excellent results without any special tuning of the architecture or algorithm (outperforming both prior algorithms and human players).

Introduction

Learning control policies directly from high-dimensional sensory data (speech, images, etc.) has long been a challenge for RL. Most successful RL applications have required hand-crafted feature engineering, and the effectiveness of such applications obviously depends heavily on the quality of feature selection. Deep learning, however, makes it possible to extract features from high-dimensional data, so combining the two should yield very good results.

Yet using deep learning for RL also comes with many challenges. Today’s deep learning largely relies on large amounts of hand-labeled data, whereas RL does not have this kind of direct one-to-one relationship; instead, it requires many iterations before yielding a scalar reward value, and this reward is typically sparse, noisy, and significantly delayed. In RL, it often takes thousands of steps to obtain a reward, which is very sluggish compared with the conventional one-to-one mapping between data and labels. In addition, deep learning typically requires that the training data be mutually independent, whereas RL data consists of highly interdependent sequences of states, and the data distribution changes as RL learns new actions—which can have pathological effects on deep learning algorithms that require a fixed data distribution. The main problems are data correlation and data distribution.

The paper proposes a CNN-based network architecture that alleviates the data-correlation and data-distribution problems by introducing an experience replay mechanism. Experience replay smooths the data distribution by randomly sampling from transitions. During training, the model receives the same data as a human, including video input, reward, termination signals, and the set of possible actions.

Background

Training here uses the Atari emulator. Note that the reward—that is, the game score—depends on the prior sequence of actions and observations, and feedback may not arrive until thousands of steps later. Since a single screen observation cannot fully capture the current state, a sequence is needed to represent it, namely st=x1,a1,x2,...,at1,xts_{t}=x_{1},a_{1},x_{2},...,a_{t-1},x_{t}, using a sequence of screen observations and actions to represent the current state, and then relying on these sequences to learn the game policy. All sequences in the emulator are assumed to terminate within a finite number of time steps, so standard reinforcement learning methods can be applied to these sequences.

It is further assumed that the reward gradually decreases as time steps increase, i.e. a discount factor is used to process the reward. The cumulative reward is Rt=t=tTγttrtR_{t}=\sum^{T}_{t'=t}\gamma^{t'-t}r_{t'}, where TT is the time step at which the game terminates. Define Q(s,a)Q^{*}(s,a) as the Q-value function under the optimal policy, i.e. Q(s,a)=maxπE[Rtst=s,at=a,π]Q^{*}(s,a)=max_{\pi}E[R_{t}|s_{t}=s,a_{t}=a,\pi]. Moreover, the optimal Q function obeys the Bellman equation, with the following idea: if for the next sequence ss' and all possible actions aa' the value Q(s,a)Q(s',a') is known, then Q(s,a)Q^{*}(s,a) is the value that maximizes the expectation r+γQ(s,a)r+\gamma Q^{*}(s',a'), i.e.

Q(s,a)=Esε[r+γmaxaQ(s,a)s,a]Q^{*}(s,a)=E_{s'\sim \varepsilon}[r+\gamma max_{a'}Q^{*}(s',a')|s,a]

The idea behind many RL algorithms is to estimate the Q-value function. If the Bellman equation is applied iteratively, Qi+1(s,a)=E[r+γmaxaQi(s,a)s,a]Q_{i+1}(s,a)=E[r+\gamma max_{a'}Q_{i}(s',a')|s,a] then in the limit ii\to\infin it converges to QQ^*. In practice, however, this is impractical, because the Q function is estimated separately for each sequence and has no generalization ability. Instead, a function is used to approximate the optimal Q function, i.e. Q(s,a;θ)=Q(s,a)Q(s,a;\theta)=Q^*(s,a). Here a neural network is used to estimate it, with the loss function:

Li(θi)=Es,ap()[(yiQ(s,a;θi))2]L_{i}(\theta_{i})=E_{s,a\sim p(\cdot)}[(y_{i}-Q(s,a;\theta_{i}))^2]

where yi=Esε[r+γmaxaQ(s,a;θi1)s,a]y_{i}=E_{s'\sim \varepsilon}[r+\gamma max_{a'}Q(s',a';\theta_{i-1})|s,a]. In the code implementation, Q(s,a;θi)Q(s,a;\theta_{i}) and Q(s,a;θi1)Q(s,a;\theta_{i-1}) are usually realized with two different networks; the PyTorch tutorial implements them with two networks, policy_net and target_net. In addition, when selecting actions, a certain probability is used to balance “exploration” and “exploitation,” i.e. the ϵ\epsilon-greedy algorithm. At the start of training, high-reward actions are exploited as much as possible, while in later stages more exploration is performed—that is, ϵ\epsilon decays from large to small over time.

I’ll leave this section without too much detail for now, and add to it once the book I ordered arrives.

Deep Reinforcement Learning

This chapter mainly introduces the core algorithm. First, it’s worth recognizing that using deep learning to extract features is superior to hand-crafted feature processing. The pseudocode of the DQN algorithm framework is shown below.

Algorithm 1  Deep Q-learning with Experience Replay

Initialize replay memory D to capacity N
Initialize action-value function Q with random weights
for episode = 1, M do
    Initialise sequence s_1 = {x_1} and preprocessed sequenced φ_1 = φ(s_1)
    for t = 1, T do
        With probability ε select a random action a_t
        otherwise select a_t = max_a Q*(φ(s_t), a; θ)
        Execute action a_t in emulator and observe reward r_t and image x_{t+1}
        Set s_{t+1} = s_t, a_t, x_{t+1} and preprocess φ_{t+1} = φ(s_{t+1})
        Store transition (φ_t, a_t, r_t, φ_{t+1}) in D
        Sample random minibatch of transitions (φ_j, a_j, r_j, φ_{j+1}) from D
        Set y_j = r_j                                          for terminal φ_{j+1}
                  r_j + γ max_{a'} Q(φ_{j+1}, a'; θ)           for non-terminal φ_{j+1}
        Perform a gradient descent step on (y_j - Q(φ_j, a_j; θ))^2 according to equation 3
    end for
end for

Pseudocode of the DQN algorithm

Here are the benefits of replay memory again: 1. Each state transition can potentially serve as data for updating the network weights. 2. Because of the strong correlation between samples, learning directly from consecutive samples is inefficient; breaking this correlation through random sampling reduces the variance of updates. 3. When learning on-policy, the current parameters determine the next batch of samples used for parameter updates. For example, if the action with the maximum reward under the current parameters is “move left,” then subsequent samples will keep coming from the left side; and if the best action later becomes “move right,” the training distribution will shift accordingly. It’s easy to see how this can create a vicious cycle and get stuck in a local optimum. Replay averages the behavior distribution over many previous states and smooths the learning process, avoiding divergence. Hence we need to learn off-policy. The distinction lies in whether the policy that generates actions is the same as the policy used to update the network parameters: in Q-learning, the update uses a max operation, whereas the policy that generates actions does not necessarily take the action with the maximum reward. Q-learning is the classic off-policy learning algorithm.

For the neural network here, the state is taken as input, and the n output nodes correspond to the rewards produced by the n possible actions, so a single computation yields the reward corresponding to each action.

As for the later experiment chapter, it’s been a long time, so I won’t go into detail.