Denoising Score Matching

Motivation

Score matching (Hyvärinen 2005) trains a model \(s_\theta(x) \approx \nabla_x \log p_{\text{data}}(x)\) without needing the density’s normalizing constant, but its practical implementation requires the trace of the model’s Jacobian — expensive in high dimensions. Denoising score matching (Vincent 2011) replaces this with a much cheaper objective: predict the noise that was added to a data sample. The resulting loss is a simple mean-squared error with no Jacobian, and it provably learns the score of the noise-corrupted data distribution.

This is the loss that powers diffusion models. The diffusion training objective is denoising score matching applied across a schedule of noise levels.

The Setup

Pick a noise distribution \(q_\sigma(\tilde x \mid x) = \mathcal{N}(\tilde x; x, \sigma^2 I)\) — Gaussian noise of fixed scale \(\sigma\) added to clean data. The corrupted marginal is

\[ q_\sigma(\tilde x) = \int q_\sigma(\tilde x \mid x) p_{\text{data}}(x) \, dx, \]

a smoothed version of \(p_{\text{data}}\). The score of the conditional has a closed form:

\[ \nabla_{\tilde x} \log q_\sigma(\tilde x \mid x) = \frac{x - \tilde x}{\sigma^2}. \]

The denoising-score-matching objective fits the model to this conditional score:

\[ J_{\text{DSM}}(\theta) = \tfrac{1}{2} \mathbb{E}_{x \sim p_{\text{data}}, \, \tilde x \sim q_\sigma(\cdot \mid x)}\!\left[\left\| s_\theta(\tilde x) - \frac{x - \tilde x}{\sigma^2} \right\|^2\right]. \]

This is computable: sample \(x\) from data, sample \(\tilde x = x + \sigma \varepsilon\) with \(\varepsilon \sim \mathcal{N}(0, I)\), evaluate the model at \(\tilde x\), compute MSE against \((x - \tilde x)/\sigma^2 = -\varepsilon/\sigma\).

What This Actually Fits

Vincent’s theorem (the Vincent identity) says that minimizing \(J_{\text{DSM}}\) is equivalent — up to a constant in \(\theta\) — to minimizing the explicit score-matching loss against the corrupted marginal \(q_\sigma\):

\[ J_{\text{DSM}}(\theta) = \tfrac{1}{2} \mathbb{E}_{\tilde x \sim q_\sigma}\!\left[\| s_\theta(\tilde x) - \nabla_{\tilde x} \log q_\sigma(\tilde x) \|^2\right] + \text{const}. \]

So training on noise-prediction returns the score of the noised data distribution, not the original. The cleaner the noise (smaller \(\sigma\)), the closer \(q_\sigma\) is to \(p_{\text{data}}\) and the closer the learned score is to the data score.

This identity is non-obvious — the trick is integration by parts that converts the unknown \(\nabla \log q_\sigma\) into a tractable conditional score.

The Noise-Prediction Reparameterization

It is convenient to write the model as a noise predictor instead of a score:

\[ \varepsilon_\theta(\tilde x) = -\sigma \, s_\theta(\tilde x). \]

Then the loss becomes

\[ J_{\text{DSM}}(\theta) = \tfrac{1}{2 \sigma^2} \mathbb{E}_{x, \varepsilon}\!\left[\|\varepsilon_\theta(\tilde x) - \varepsilon\|^2\right], \qquad \tilde x = x + \sigma \varepsilon. \]

This is the form usually used in code: predict the noise that was added, weighted by \(1/\sigma^2\). DDPM uses an unweighted version of this loss across all noise levels — the unweighting is part of what makes it a “weighted ELBO” rather than the natural one.

Why It Beats Implicit Score Matching

Recall that implicit score matching requires computing \(\operatorname{tr}(\nabla_x s_\theta(x))\). For an image with \(d\) pixels, this is \(O(d)\) backward passes per training point — prohibitive.

Denoising score matching has no Jacobian term. Each training step is a single forward pass and a single backward pass through \(\varepsilon_\theta\). The total compute per step is the same as a supervised regression — just predict the noise.

The price: the learned score is the score of \(q_\sigma\), not \(p_{\text{data}}\). For small \(\sigma\) the difference is small but nonzero. Multi-scale training (training one model on a range of \(\sigma\) values, conditioned on \(\sigma\)) gives a model that knows the score at every noise level, which is what diffusion sampling needs.

Multi-Scale: From Single \(\sigma\) to a Schedule

A single \(\sigma\) produces a model of one noised distribution. To sample from \(p_{\text{data}}\) we need to invert the noising — start from pure noise and progressively denoise. This requires the score at every noise level on the path from pure noise to clean data.

The diffusion training loss is a multi-scale denoising score matching:

\[ \mathcal{L}(\theta) = \mathbb{E}_{t \sim \text{schedule}}\!\left[\lambda(t) \cdot J_{\text{DSM}}^{(\sigma_t)}(\theta)\right], \]

with \(\lambda(t)\) a per-level weighting. Different choices of \(\lambda(t)\) recover different familiar objectives — the DDPM loss is a weighted ELBO.

Sampling: Score Plus Langevin

Once trained, \(s_\theta(\tilde x; \sigma)\) knows the score of every noised distribution. Sampling runs Langevin dynamics from large \(\sigma\) to small \(\sigma\):

x ~ N(0, sigma_max^2 I)
for sigma in decreasing schedule:
    for k in 1..K:
        x = x + (eta/2) * s_theta(x; sigma) + sqrt(eta) * z   # Langevin step

This is annealed Langevin sampling, the original score-based sampler (Song & Ermon, 2019). Modern diffusion samplers (DDPM, DDIM, DPM-Solver) refine this with discretizations of the corresponding SDE or ODE, but the underlying object — a learned score across noise scales — is the same.

Where DSM Fits

Denoising score matching is the bridge between:

  • Score-based generative modeling (Song & Ermon, 2019) on one side: train scores at multiple noise levels and sample by annealed Langevin.
  • DDPM (Ho et al., 2020) on the other: train noise-predictors at discrete timesteps and sample by reverse-time diffusion.

Both use the DSM loss; the differences are in the noise schedule, the sampler, and the parameterization. The unifying SDE picture (Song et al., 2021) makes the equivalence precise.

References

Hyvärinen, Aapo. 2005. “Estimation of Non-Normalized Statistical Models by Score Matching.” Journal of Machine Learning Research 6: 695–709. https://www.jmlr.org/papers/v6/hyvarinen05a.html.
Vincent, Pascal. 2011. “A Connection Between Score Matching and Denoising Autoencoders.” Neural Computation 23 (7): 1661–74. https://doi.org/10.1162/neco_a_00142.