Denoising Diffusion Probabilistic Models (DDPM)

Motivation

A denoising diffusion probabilistic model (Ho et al. 2020; Sohl-Dickstein et al. 2015) is a generative model that learns to invert a fixed Gaussian noising process. Given the forward noising process — a Markov chain that progressively corrupts data into pure noise — DDPM trains a network to predict, at every step of the chain, the noise that was added. At sampling time the model starts from random Gaussian noise and runs the chain in reverse, denoising step by step until a clean sample emerges.

DDPM unified two threads — the score-based generative modeling of Song & Ermon (2019) and the variational thread of (Sohl-Dickstein et al. 2015) — into a single training recipe that is simple, stable, and produces state-of-the-art image samples. It is the basis of essentially all modern diffusion-based generation, including text-to-image models like Stable Diffusion and Imagen.

The Two Processes

Forward noising and reverse denoising chains

DDPMs train a neural network to reverse a fixed corruption process one small denoising step at a time.

The model consists of two Markov chains over the same state space, run in opposite time directions.

Forward (fixed, no learning). Adds Gaussian noise according to a schedule \(\beta_1, \ldots, \beta_T\). The marginal at step \(t\) has closed form,

\[ q(x_t \mid x_0) = \mathcal{N}\!\left(x_t; \sqrt{\bar\alpha_t} \, x_0, \, (1 - \bar\alpha_t) I\right), \qquad \bar\alpha_t = \prod_{s=1}^t (1 - \beta_s), \]

so we can sample \(x_t\) directly without simulating the chain. See forward noising process for the construction.

Reverse (learned). A parametric Gaussian transition

\[ p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1}; \mu_\theta(x_t, t), \, \Sigma_\theta(x_t, t)\right), \]

initialized at \(p(x_T) = \mathcal{N}(0, I)\). The learned \(\mu_\theta\) encodes how to back out one step of noise.

The defining choice in DDPM is to fix \(\Sigma_\theta(x_t, t) = \sigma_t^2 I\) to a non-learned schedule (Ho et al. use either \(\beta_t I\) or \(\tilde\beta_t I\)) and only learn the mean.

The Variational Objective and the \(\varepsilon\)-Parameterization

The training objective is the variational lower bound on \(\log p_\theta(x_0)\) from the joint \(p_\theta(x_{0:T})\). The KL terms between forward and reverse transitions are between Gaussians and have closed form. After algebra (see DDPM loss as a weighted ELBO), each per-step KL becomes a quadratic in \(\mu_\theta(x_t, t) - \tilde\mu_t(x_t, x_0)\), where \(\tilde\mu_t\) is the closed-form mean of the forward posterior \(q(x_{t-1} \mid x_t, x_0)\).

The cleanest parameterization: predict the noise \(\varepsilon\) that was added to produce \(x_t\) from \(x_0\). Letting \(x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \varepsilon\), write the network as \(\varepsilon_\theta(x_t, t)\) and define

\[ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \!\left(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}} \varepsilon_\theta(x_t, t)\right). \]

With this parameterization, the per-step KL reduces to (up to a \(t\)-dependent weight) a simple MSE between the predicted and true noise. Ho et al. drop the weight and train

\[ \mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_{t \sim U\{1, T\}, \, x_0 \sim p_{\text{data}}, \, \varepsilon \sim \mathcal{N}(0, I)}\!\left[\|\varepsilon - \varepsilon_\theta(\sqrt{\bar\alpha_t} \, x_0 + \sqrt{1 - \bar\alpha_t} \, \varepsilon, t)\|^2\right]. \]

This is the core DDPM loss. It is a weighted ELBO (proof) — the unweighted version puts more emphasis on intermediate noise levels and less on the very small or very large ones, which empirically improves sample quality.

Equivalently — by the Vincent identity — it is multi-scale denoising score matching. The connection to scores is \(s_\theta(x_t, t) = -\varepsilon_\theta(x_t, t) / \sqrt{1 - \bar\alpha_t}\).

The Training Loop

for each minibatch:
    sample x_0 from data
    sample t ~ Uniform{1, ..., T}
    sample epsilon ~ N(0, I)
    x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
    loss = || epsilon - eps_theta(x_t, t) ||^2
    backprop, update theta

A single training step is a single forward and backward pass through \(\varepsilon_\theta\). Note in particular that \(T\) does not appear in the per-step compute — large \(T\) is essentially free at training time because of the closed-form marginal. Common choices are \(T = 1000\) for the original DDPM and continuous \(t \in [0, 1]\) in modern variants.

Sampling: The Reverse Chain

At inference, run

x_T ~ N(0, I)
for t = T down to 1:
    z ~ N(0, I) if t > 1 else 0
    x_{t-1} = (1 / sqrt(alpha_t)) * (x_t - (beta_t / sqrt(1 - alpha_bar_t)) * eps_theta(x_t, t)) + sigma_t * z
return x_0

This is \(T\) forward passes through \(\varepsilon_\theta\) — sampling cost grows linearly with \(T\). For \(T = 1000\) and a large image-conditional U-Net, one sample takes seconds to minutes. This is the dominant practical complaint about diffusion models and the motivation for accelerated samplers.

Accelerated samplers include: - DDIM (Song et al., 2021): reinterprets the same trained model as a non-Markovian deterministic process; samples in \(\sim 50\) steps with comparable quality. - DPM-Solver (Lu et al., 2022): high-order ODE solver for the probability-flow ODE; samples in \(10\)–\(20\) steps. - Distillation (Salimans & Ho, 2022; consistency models, Song et al. 2023): train a student that matches the teacher’s many-step output in 1–4 steps.

The Architecture

The network \(\varepsilon_\theta(x_t, t)\) takes a noisy image and a timestep and predicts the noise. Architectural choices:

U-Net backbone — encoder-decoder with skip connections at multiple resolutions. The convolutional inductive bias and multi-scale processing match the structure of the noising problem (denoising at fine and coarse scales simultaneously).
Time conditioning — sinusoidal embedding of \(t\) projected into per-block scale-and-shift parameters (FiLM, AdaGN). The network must condition on the noise level it is denoising.
Self-attention layers — typically inserted at lower resolutions for global context.
Conditioning — for class-conditional or text-to-image generation, additional conditioning is injected via cross-attention or concatenation. See classifier-free guidance for the trick that makes conditional sampling effective.

Strengths and Weaknesses

Pros: - Sample quality. State-of-the-art on most image generation benchmarks since 2021, often by a wide margin. - Training stability. No adversarial dynamics; loss decreases monotonically; minimal hyperparameter sensitivity. - Likelihood available. A continuous-time bound or a discrete ELBO can be evaluated, unlike GANs. - Compositional generation. Classifier guidance, classifier-free guidance, inpainting, and image-to-image translation are all natural extensions of the basic sampler.

Cons: - Slow sampling. Many forward passes per sample, even with accelerated samplers. Still slower than GANs by orders of magnitude. - Latent space is high-dimensional and unstructured — the diffusion is over the data manifold itself, not a learned latent. Latent diffusion (Rombach et al., 2022) addresses this by diffusing in a VAE’s latent space.

Connection to the Other Diffusion Threads

Three views of the same model:

DDPM (variational). Lower bound on \(\log p_\theta(x)\), decomposed into per-step Gaussian KLs. The story above.
NCSN (score-based). Multi-scale denoising score matching with annealed Langevin sampling.
SDE (continuous-time). Forward and reverse stochastic differential equations with a learned score.

All three are equivalent up to parameterization. DDPM was the first to combine the \(\varepsilon\)-prediction reparameterization, the unweighted simple loss, and the U-Net backbone; the SDE view (Song et al., 2021) showed that the equivalence is exact and provided the framework for the ODE samplers and likelihood computations that followed.

References

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems (NeurIPS), 6840–51. https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.

Sohl-Dickstein, Jascha, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.” International Conference on Machine Learning (ICML), 2256–65. https://proceedings.mlr.press/v37/sohl-dickstein15.html.