Generative Adversarial Networks
Motivation
A generative adversarial network (GAN) (Goodfellow et al. 2014) is a likelihood-free generative model. Instead of fitting a density to data, a GAN trains a generator \(G_\theta\) that maps random noise to samples, supervised by a discriminator \(D_\phi\) that tries to distinguish generator samples from real data. The two networks play a minimax game; at convergence the discriminator cannot tell them apart, which means the generator’s distribution matches the data distribution.
GANs were the dominant deep generative model from roughly 2015 to 2020. They produce sharper images than VAEs and are faster to sample from than diffusion models, at the cost of being notoriously hard to train. They have largely been displaced by diffusion for image generation but remain important historically and for applications where the implicit (no-likelihood) formulation is convenient.
The Setup
Two networks:
- Generator \(G_\theta : \mathbb{R}^k \to \mathcal{X}\) maps a noise sample \(z \sim p(z)\) (typically standard Gaussian) to a generated example \(\tilde x = G_\theta(z)\). The implied distribution over \(\tilde x\) is \(p_\theta\).
- Discriminator \(D_\phi : \mathcal{X} \to (0, 1)\) outputs the probability that its input is real.
Training is the minimax problem
\[ \min_\theta \max_\phi V(\theta, \phi) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D_\phi(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D_\phi(G_\theta(z)))]. \]
The discriminator maximizes \(V\) by labeling real as \(1\) and fake as \(0\); the generator minimizes \(V\) by fooling the discriminator into labeling fakes as \(1\).
In practice both updates use a single SGD step at a time, alternating, rather than solving the inner maximization to convergence.
What the Optimum Computes
Hold \(\theta\) fixed and solve for the optimal discriminator. Variational calculus gives, pointwise,
\[ D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_\theta(x)}. \]
Plugging back into \(V\), the generator is then minimizing the Jensen-Shannon divergence between \(p_\theta\) and \(p_{\text{data}}\) (up to an additive constant). The unique global optimum is \(p_\theta = p_{\text{data}}\), at which \(D^* \equiv 1/2\) — the discriminator is indifferent.
This is the conceptual story. In practice the dynamics never reach this fixed point exactly and stability is the dominant engineering concern.
Training Difficulties
GAN training is famous for being fragile:
- Mode collapse. The generator finds one mode (say, one image style) that fools the discriminator and stays there, ignoring the rest of the data distribution. The likelihood-free objective gives no signal that other modes are missing.
- Vanishing generator gradient. When the discriminator is too strong — assigns near-zero probability to all generator samples — \(\log(1 - D(G(z)))\) saturates and gives no useful gradient. The standard fix is the non-saturating loss: train \(G\) to maximize \(\log D(G(z))\) instead of minimizing \(\log(1 - D(G(z)))\).
- Discriminator overpowering. If \(D\) trains faster than \(G\), \(D\) classifies perfectly and \(G\) stops learning. Balancing the two updates is delicate.
- No principled stopping criterion. Loss curves do not monotonically decrease as in supervised learning; visual inspection or downstream metrics (FID, IS) are needed.
Mitigations developed over years include Wasserstein GANs (use a different distance metric — earth-mover instead of JS), spectral normalization (constrain \(D\)’s Lipschitz constant), gradient penalties, and progressive growing of the architecture.
Comparison to Other Generative Models
| VAE | GAN | Diffusion | |
|---|---|---|---|
| Training | ELBO maximization | Minimax game | Score / denoising |
| Likelihood | Lower bound available | None (implicit model) | Lower bound available |
| Sample quality | Blurry | Sharp | Sharp |
| Training stability | Stable | Fragile | Stable |
| Sampling speed | One forward pass | One forward pass | Many forward passes |
The GAN’s strengths — sample sharpness, fast sampling — are useful, but its training instability and lack of a likelihood made it hard to extend to controlled generation in the way that diffusion’s likelihood-based formulation supports (classifier guidance, classifier-free guidance, ELBO weighting).
Where GANs Sit Now
Active uses include: - Image-to-image translation (CycleGAN, pix2pix). The implicit formulation makes it natural to learn maps between unpaired image domains. - Super-resolution and other low-level vision tasks where sample sharpness is the dominant metric. - Domain adaptation and representation learning via adversarial losses without generation as the primary goal.
For unconditional image generation and text-to-image synthesis, diffusion models have largely won. The conceptual lesson — that a likelihood-free objective can produce a usable generative model — remains influential.