GAN Training Stability: Notes on Improved Techniques for Training GANs

This paper proposes several improvements to the architecture and training procedure of GAN models, validated mainly on two tasks: semi-supervised learning (using additional unsupervised samples to improve performance on a supervised task) and image generation.

Why GANs Are Hard to Train

Training a GAN is essentially the search for a Nash equilibrium of a non-convex game in a high-dimensional continuous space. The problem is that we usually optimize the loss function with gradient descent, and gradient descent was not designed to find Nash equilibria, so convergence is difficult.

While there are algorithms that can find Nash equilibria in certain settings, there is still no general solution for the high-dimensional, non-convex case that GANs present.

An intuitive example: for the loss function xy, one player optimizes x to minimize xy while the other optimizes y to minimize -xy. If the two are updated alternately with gradient descent, they will very likely fall into a loop and fail to converge to xy=0. This is precisely the root cause of GAN training instability.

Feature Matching

Feature matching adds an extra objective term to the generator—requiring the generated images to satisfy certain statistical properties of the real images. Concretely, the generator is made to match the feature representations of an intermediate layer of the discriminator, because the intermediate-layer features the discriminator learns during training are exactly the key information used to distinguish real images from generated ones.

Let f(x) be the activations of some intermediate layer of the discriminator. Feature matching adds the following loss to the generator:

Expdataf(x)Ezpz(z)f(G(z))22\left\lVert \mathbb{E}_{\boldsymbol{x}\sim p_{\mathrm{data}}}\,\mathbf{f}(\boldsymbol{x}) - \mathbb{E}_{\boldsymbol{z}\sim p_{z}(\boldsymbol{z})}\,\mathbf{f}(G(\boldsymbol{z}))\right\rVert_{2}^{2}

This method offers no theoretical guarantee of convergence, but empirically it works quite well, especially on semi-supervised learning tasks.

Minibatch Discrimination

Minibatch discrimination lets the discriminator take into account information across the entire mini-batch when evaluating images, rather than scoring each image in isolation. The motivation is this: if the generator produces identical images, the discriminator may still assign a decent score when scoring each image independently, and this gradient signal will cause the generator’s training to diverge (i.e., mode collapse). By making the discriminator aware of the overall diversity within a batch, this problem can be alleviated.

The concrete approach is to insert a special layer into the discriminator, applied to both real and generated images:

On image generation tasks, minibatch discrimination produces better visual results than feature matching; but on semi-supervised classification tasks, it actually performs worse than feature matching.

Historical Averaging

This technique likewise adds an extra loss term, pulling the current parameter values toward the historical average of the parameters:

θ1ti=1tθ[i]2\left\lVert \boldsymbol{\theta} - \tfrac{1}{t}\sum_{i=1}^{t} \boldsymbol{\theta}[i] \right\rVert^{2}

The idea comes from fictitious play in game theory, which can find a Nash equilibrium in low-dimensional, continuous, non-convex games. The figure below illustrates its effect:

[16] George W Brown. Iterative solution of games by fictitious play. Activity analysis of production and allocation, 13(1):374–376, 1951.

One-sided Label Smoothing

Ordinary label smoothing smooths both positive and negative samples, for example replacing the original labels 1 and 0 with 0.9 and 0.1.

The approach here smooths only the positive samples (real images), while the beta value for the negative samples stays at 0:

D(x)=αpdata(x)+βpmodel(x)pdata(x)+pmodel(x)D(\boldsymbol{x}) = \dfrac{\alpha\,p_{\mathrm{data}}(\boldsymbol{x}) + \beta\,p_{\mathrm{model}}(\boldsymbol{x})}{p_{\mathrm{data}}(\boldsymbol{x}) + p_{\mathrm{model}}(\boldsymbol{x})}

The reason is that if the negative samples are also smoothed, then when the generated distribution p_model is large and the real distribution p_data is small, this would instead create extra positive feedback in the region of fake data, and would not help the generator better approximate the real data distribution.

Virtual Batch Normalization

Standard BN normalizes using the statistics of the current mini-batch, which means each sample’s activations depend on the other samples in the same batch. Virtual BN works as follows: a reference batch is fixed at the start of training, and the normalization of all subsequent batches uses the statistics of this fixed reference batch, thereby removing the dependence between samples.

The cost is the higher amount of computation—each batch requires two forward-backward passes—so it is used only on the generator, not the discriminator.

Inception Score: Image Quality Evaluation

The paper also proposes a metric for evaluating the quality of generated images: the Inception Score.

exp ⁣(ExKL(p(yx)p(y)))\exp\!\left(\mathbb{E}_{\boldsymbol{x}}\,\mathrm{KL}\big(p(y\mid \boldsymbol{x})\,\|\,p(y)\big)\right)

Here p(y|x) is given by a pretrained Inception model. The Inception Score measures how much information is gained from observing a generated image, i.e., the KL divergence between the conditional distribution and the marginal distribution. Note that training the generator directly to optimize the Inception Score does not necessarily yield good results.

Semi-supervised Learning

The approach in the semi-supervised setting is to add an extra class label to the discriminator, representing the “fake image” class, so that the discriminator simultaneously takes on two tasks: distinguishing real from fake images, and labeled classification.

In this setting, feature matching works well, while minibatch discrimination performs poorly—the opposite of the conclusion on image generation tasks.

Another interesting observation is that introducing the semi-supervised task (i.e., adding classification label information) can improve the visual quality of generated images (according to human annotators’ evaluations). One possible explanation is that the human visual system is itself better at handling semantic information such as classification, rather than local statistical textures, so a generator trained jointly with a classification task learns features that better match human visual perception. The high agreement between human ratings and the Inception Score also supports this—Inception itself is used for classification, and it may have learned feature representations similar to those of the human visual system.

Understanding and evaluating generated images from the perspective of human visual perception is indeed a very interesting angle.