Generative Adversarial Networks

Motivation

A generative adversarial network (GAN) (Goodfellow et al. 2014) is a likelihood-free generative model. Instead of fitting a density to data, a GAN trains a generator \(G_\theta\) that maps random noise to samples, supervised by a discriminator \(D_\phi\) that tries to distinguish generator samples from real data. The two networks play a minimax game; at convergence the discriminator cannot tell them apart, which means the generator’s distribution matches the data distribution.

GANs were the dominant deep generative model from roughly 2015 to 2020. They produce sharper images than VAEs and are faster to sample from than diffusion models, at the cost of being notoriously hard to train. They have largely been displaced by diffusion for image generation but remain important historically and for applications where the implicit (no-likelihood) formulation is convenient.

The Setup

Two networks:

  • Generator \(G_\theta : \mathbb{R}^k \to \mathcal{X}\) maps a noise sample \(z \sim p(z)\) (typically standard Gaussian) to a generated example \(\tilde x = G_\theta(z)\). The implied distribution over \(\tilde x\) is \(p_\theta\).
  • Discriminator \(D_\phi : \mathcal{X} \to (0, 1)\) outputs the probability that its input is real.

Training is the minimax problem

\[ \min_\theta \max_\phi V(\theta, \phi) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D_\phi(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D_\phi(G_\theta(z)))]. \]

The discriminator maximizes \(V\) by labeling real as \(1\) and fake as \(0\); the generator minimizes \(V\) by fooling the discriminator into labeling fakes as \(1\).

In practice both updates use a single SGD step at a time, alternating, rather than solving the inner maximization to convergence.

What the Optimum Computes

Hold \(\theta\) fixed and solve for the optimal discriminator. Variational calculus gives, pointwise,

\[ D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_\theta(x)}. \]

Plugging back into \(V\), the generator is then minimizing the Jensen-Shannon divergence between \(p_\theta\) and \(p_{\text{data}}\) (up to an additive constant). The unique global optimum is \(p_\theta = p_{\text{data}}\), at which \(D^* \equiv 1/2\) — the discriminator is indifferent.

This is the conceptual story. In practice the dynamics never reach this fixed point exactly and stability is the dominant engineering concern.

Training Difficulties

GAN training is famous for being fragile:

  • Mode collapse. The generator finds one mode (say, one image style) that fools the discriminator and stays there, ignoring the rest of the data distribution. The likelihood-free objective gives no signal that other modes are missing.
  • Vanishing generator gradient. When the discriminator is too strong — assigns near-zero probability to all generator samples — \(\log(1 - D(G(z)))\) saturates and gives no useful gradient. The standard fix is the non-saturating loss: train \(G\) to maximize \(\log D(G(z))\) instead of minimizing \(\log(1 - D(G(z)))\).
  • Discriminator overpowering. If \(D\) trains faster than \(G\), \(D\) classifies perfectly and \(G\) stops learning. Balancing the two updates is delicate.
  • No principled stopping criterion. Loss curves do not monotonically decrease as in supervised learning; visual inspection or downstream metrics (FID, IS) are needed.

Mitigations developed over years include Wasserstein GANs (use a different distance metric — earth-mover instead of JS), spectral normalization (constrain \(D\)’s Lipschitz constant), gradient penalties, and progressive growing of the architecture.

Comparison to Other Generative Models

VAE GAN Diffusion
Training ELBO maximization Minimax game Score / denoising
Likelihood Lower bound available None (implicit model) Lower bound available
Sample quality Blurry Sharp Sharp
Training stability Stable Fragile Stable
Sampling speed One forward pass One forward pass Many forward passes

The GAN’s strengths — sample sharpness, fast sampling — are useful, but its training instability and lack of a likelihood made it hard to extend to controlled generation in the way that diffusion’s likelihood-based formulation supports (classifier guidance, classifier-free guidance, ELBO weighting).

Where GANs Sit Now

Active uses include: - Image-to-image translation (CycleGAN, pix2pix). The implicit formulation makes it natural to learn maps between unpaired image domains. - Super-resolution and other low-level vision tasks where sample sharpness is the dominant metric. - Domain adaptation and representation learning via adversarial losses without generation as the primary goal.

For unconditional image generation and text-to-image synthesis, diffusion models have largely won. The conceptual lesson — that a likelihood-free objective can produce a usable generative model — remains influential.

References

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, et al. 2014. “Generative Adversarial Nets.” Advances in Neural Information Processing Systems (NeurIPS), 2672–80. https://proceedings.neurips.cc/paper/2014/hash/f033ed80deb0234979a61f95710dbe25-Abstract.html.