Denoising Score Matching
Motivation
Score matching (Hyvärinen 2005) trains a model \(s_\theta(x) \approx \nabla_x \log p_{\text{data}}(x)\) without needing the density’s normalizing constant, but its practical implementation requires the trace of the model’s Jacobian — expensive in high dimensions. Denoising score matching (Vincent 2011) replaces this with a much cheaper objective: predict the noise that was added to a data sample. The resulting loss is a simple mean-squared error with no Jacobian, and it provably learns the score of the noise-corrupted data distribution.
This is the loss that powers diffusion models. The diffusion training objective is denoising score matching applied across a schedule of noise levels.
The Setup
Pick a noise distribution \(q_\sigma(\tilde x \mid x) = \mathcal{N}(\tilde x; x, \sigma^2 I)\) — Gaussian noise of fixed scale \(\sigma\) added to clean data. The corrupted marginal is
\[ q_\sigma(\tilde x) = \int q_\sigma(\tilde x \mid x) p_{\text{data}}(x) \, dx, \]
a smoothed version of \(p_{\text{data}}\). The score of the conditional has a closed form:
\[ \nabla_{\tilde x} \log q_\sigma(\tilde x \mid x) = \frac{x - \tilde x}{\sigma^2}. \]
The denoising-score-matching objective fits the model to this conditional score:
\[ J_{\text{DSM}}(\theta) = \tfrac{1}{2} \mathbb{E}_{x \sim p_{\text{data}}, \, \tilde x \sim q_\sigma(\cdot \mid x)}\!\left[\left\| s_\theta(\tilde x) - \frac{x - \tilde x}{\sigma^2} \right\|^2\right]. \]
This is computable: sample \(x\) from data, sample \(\tilde x = x + \sigma \varepsilon\) with \(\varepsilon \sim \mathcal{N}(0, I)\), evaluate the model at \(\tilde x\), compute MSE against \((x - \tilde x)/\sigma^2 = -\varepsilon/\sigma\).
What This Actually Fits
Vincent’s theorem (the Vincent identity) says that minimizing \(J_{\text{DSM}}\) is equivalent — up to a constant in \(\theta\) — to minimizing the explicit score-matching loss against the corrupted marginal \(q_\sigma\):
\[ J_{\text{DSM}}(\theta) = \tfrac{1}{2} \mathbb{E}_{\tilde x \sim q_\sigma}\!\left[\| s_\theta(\tilde x) - \nabla_{\tilde x} \log q_\sigma(\tilde x) \|^2\right] + \text{const}. \]
So training on noise-prediction returns the score of the noised data distribution, not the original. The cleaner the noise (smaller \(\sigma\)), the closer \(q_\sigma\) is to \(p_{\text{data}}\) and the closer the learned score is to the data score.
This identity is non-obvious — the trick is integration by parts that converts the unknown \(\nabla \log q_\sigma\) into a tractable conditional score.
The Noise-Prediction Reparameterization
It is convenient to write the model as a noise predictor instead of a score:
\[ \varepsilon_\theta(\tilde x) = -\sigma \, s_\theta(\tilde x). \]
Then the loss becomes
\[ J_{\text{DSM}}(\theta) = \tfrac{1}{2 \sigma^2} \mathbb{E}_{x, \varepsilon}\!\left[\|\varepsilon_\theta(\tilde x) - \varepsilon\|^2\right], \qquad \tilde x = x + \sigma \varepsilon. \]
This is the form usually used in code: predict the noise that was added, weighted by \(1/\sigma^2\). DDPM uses an unweighted version of this loss across all noise levels — the unweighting is part of what makes it a “weighted ELBO” rather than the natural one.
Why It Beats Implicit Score Matching
Recall that implicit score matching requires computing \(\operatorname{tr}(\nabla_x s_\theta(x))\). For an image with \(d\) pixels, this is \(O(d)\) backward passes per training point — prohibitive.
Denoising score matching has no Jacobian term. Each training step is a single forward pass and a single backward pass through \(\varepsilon_\theta\). The total compute per step is the same as a supervised regression — just predict the noise.
The price: the learned score is the score of \(q_\sigma\), not \(p_{\text{data}}\). For small \(\sigma\) the difference is small but nonzero. Multi-scale training (training one model on a range of \(\sigma\) values, conditioned on \(\sigma\)) gives a model that knows the score at every noise level, which is what diffusion sampling needs.
Multi-Scale: From Single \(\sigma\) to a Schedule
A single \(\sigma\) produces a model of one noised distribution. To sample from \(p_{\text{data}}\) we need to invert the noising — start from pure noise and progressively denoise. This requires the score at every noise level on the path from pure noise to clean data.
The diffusion training loss is a multi-scale denoising score matching:
\[ \mathcal{L}(\theta) = \mathbb{E}_{t \sim \text{schedule}}\!\left[\lambda(t) \cdot J_{\text{DSM}}^{(\sigma_t)}(\theta)\right], \]
with \(\lambda(t)\) a per-level weighting. Different choices of \(\lambda(t)\) recover different familiar objectives — the DDPM loss is a weighted ELBO.
Sampling: Score Plus Langevin
Once trained, \(s_\theta(\tilde x; \sigma)\) knows the score of every noised distribution. Sampling runs Langevin dynamics from large \(\sigma\) to small \(\sigma\):
x ~ N(0, sigma_max^2 I)
for sigma in decreasing schedule:
for k in 1..K:
x = x + (eta/2) * s_theta(x; sigma) + sqrt(eta) * z # Langevin step
This is annealed Langevin sampling, the original score-based sampler (Song & Ermon, 2019). Modern diffusion samplers (DDPM, DDIM, DPM-Solver) refine this with discretizations of the corresponding SDE or ODE, but the underlying object — a learned score across noise scales — is the same.
Where DSM Fits
Denoising score matching is the bridge between:
- Score-based generative modeling (Song & Ermon, 2019) on one side: train scores at multiple noise levels and sample by annealed Langevin.
- DDPM (Ho et al., 2020) on the other: train noise-predictors at discrete timesteps and sample by reverse-time diffusion.
Both use the DSM loss; the differences are in the noise schedule, the sampler, and the parameterization. The unifying SDE picture (Song et al., 2021) makes the equivalence precise.