Classifier-Free Guidance

Motivation

A conditional diffusion model trained on (image, label) pairs can in principle generate images conditioned on a label \(c\). In practice, naive conditional sampling produces images that are only weakly aligned with \(c\) — the model has learned the conditional distribution but generates samples spread over the whole conditional density, including its low-quality tails.

Classifier-free guidance (Ho and Salimans 2022) is a sampling-time trick that sharpens the conditional distribution at the cost of some sample diversity. It produces dramatically better-aligned samples — sharper, more faithful to the prompt — and is the technique that made text-to-image diffusion models actually usable.

The name is a contrast with classifier guidance (Dhariwal & Nichol, 2021), an earlier method that requires a separately trained classifier on noisy data. Classifier-free guidance avoids that classifier by training a single diffusion model that can produce both conditional and unconditional scores, then combining them at sampling time.

The Trick

Train a single conditional model \(\varepsilon_\theta(x_t, t, c)\) where \(c\) is sometimes a real condition (a label, a text embedding) and sometimes a special “null” token \(\emptyset\) indicating unconditional generation. During training, drop the condition with some probability \(p_{\text{drop}} \in [0.1, 0.2]\):

for each minibatch:
    sample (x_0, c)
    with probability p_drop: c <- null
    sample t, epsilon as in DDPM
    loss = || epsilon - eps_theta(x_t, t, c) ||^2

This gives a single model that can produce both \(\varepsilon_\theta(x_t, t, c)\) and \(\varepsilon_\theta(x_t, t, \emptyset)\) — equivalently, the conditional score \(\nabla_{x_t} \log p_t(x_t \mid c)\) and the unconditional score \(\nabla_{x_t} \log p_t(x_t)\).

At sampling time, replace the model’s predicted noise with a guided combination:

\[ \tilde \varepsilon_\theta(x_t, t, c) = (1 + w) \, \varepsilon_\theta(x_t, t, c) - w \, \varepsilon_\theta(x_t, t, \emptyset), \]

where \(w \geq 0\) is the guidance scale. \(w = 0\) recovers ordinary conditional sampling; large \(w\) pushes the sampler toward regions where the conditional density is much higher than the unconditional density.

What This Computes

In score language, classifier-free guidance constructs a synthetic score

\[ \tilde s(x_t, t, c) = (1 + w) \nabla_{x_t} \log p_t(x_t \mid c) - w \nabla_{x_t} \log p_t(x_t). \]

Bayes’ rule gives \(\nabla_{x_t} \log p_t(c \mid x_t) = \nabla_{x_t} \log p_t(x_t \mid c) - \nabla_{x_t} \log p_t(x_t)\), so

\[ \tilde s = \nabla_{x_t} \log p_t(x_t \mid c) + w \nabla_{x_t} \log p_t(c \mid x_t). \]

This is the score of an implicitly tilted distribution

\[ \tilde p_t(x_t \mid c) \propto p_t(x_t \mid c) \cdot p_t(c \mid x_t)^w. \]

The \(w\) exponent on the implicit classifier \(p_t(c \mid x_t)\) sharpens the conditional — concentrating mass on \(x\) that the implicit classifier finds very aligned with \(c\), at the cost of diversity. This is structurally identical to what classifier guidance does using an external classifier, but without needing to train one.

The Trade-off

Larger \(w\) means: - More aligned samples. The conditional signal is amplified, so generations more faithfully follow the prompt. - Less diversity. The sharpened distribution has lower entropy; multiple samples with the same prompt look more similar. - Mode-seeking behavior. Heavy sharpening pushes mass toward the maximum of the implicit classifier — which can mean “stereotypical” or “averaged” outputs. - Risk of saturation. Very large \(w\) can produce oversaturated, high-contrast, somewhat unnatural images, especially in image generation.

Typical values in modern text-to-image models: \(w\) between \(5\) and \(15\). Values up to \(30\) are sometimes used for very specific prompts; below \(3\) tends to produce off-topic generations. The right value is task-dependent and is one of the few sampling-time hyperparameters with a large effect on output quality.

Why It Works This Well

Two practical reasons classifier-free guidance dominates classifier guidance:

  • No separate classifier needed. A classifier on noisy data is its own training problem and is fragile to distribution shift between the diffusion training data and the classifier’s training data. Classifier-free guidance avoids this entirely.
  • Same network for both modes. \(\varepsilon_\theta(x_t, t, c)\) and \(\varepsilon_\theta(x_t, t, \emptyset)\) share weights, so the unconditional component is essentially free at training time and only requires one extra forward pass per sampling step.

The cost: each sampling step now does two forward passes (one with \(c\), one with \(\emptyset\)), so sampling is roughly \(2 \times\) slower. This is usually worth it; many production systems pay this tax by default.

Where It Sits in the Pipeline

Classifier-free guidance is part of essentially every modern text-to-image diffusion system: Stable Diffusion, Imagen, DALL-E 2 and 3, Midjourney, Flux. The diffusion model is trained with random condition dropout, and the inference loop applies guidance with a user-controllable \(w\).

It also generalizes beyond text: any conditional diffusion model — class-conditional, layout-conditional, image-conditional (ControlNet), audio-conditional — uses some flavor of classifier-free guidance. The technique is small in line count but disproportionately important for sample quality. Without it the modern text-to-image generation experience would not exist.

References

Ho, Jonathan, and Tim Salimans. 2022. “Classifier-Free Diffusion Guidance.” arXiv Preprint arXiv:2207.12598. https://arxiv.org/abs/2207.12598.