Score Matching
Motivation
Many generative-modeling problems reduce to fitting an unnormalized density \(p_\theta(x) = \tilde p_\theta(x) / Z(\theta)\), where the normalizing constant \(Z(\theta) = \int \tilde p_\theta(x) \, dx\) is intractable. Maximum likelihood fails because it requires evaluating \(Z(\theta)\). Score matching (Hyvärinen 2005) sidesteps this by fitting the gradient of the log-density,
\[ s_\theta(x) := \nabla_x \log p_\theta(x), \]
instead of the log-density itself. The gradient does not depend on \(Z(\theta)\):
\[ \nabla_x \log p_\theta(x) = \nabla_x \log \tilde p_\theta(x) - \nabla_x \log Z(\theta) = \nabla_x \log \tilde p_\theta(x), \]
since \(Z(\theta)\) is a constant in \(x\). So a model of the score is automatically normalized “for free.”
Score matching is the foundation of diffusion models: the diffusion training objective is essentially denoising score matching applied at multiple noise levels, and the Vincent identity establishes the equivalence between the two formulations.
The Score and Why It Matters
Score field around a two-dimensional density
The score vector points in the direction where log density increases fastest; for a Gaussian-shaped density, vectors point back toward high-density regions.
The score of a density \(p\) is the gradient of its log,
\[ s(x) = \nabla_x \log p(x). \]
Two facts about scores make them useful as a substitute for densities:
- Normalization-free. \(\nabla_x \log p(x) = \nabla_x \log \tilde p(x)\), so the score of an unnormalized model is the score of its normalized counterpart.
- Enough for sampling. Langevin dynamics generates samples from \(p\) using only its score: \(x_{t+1} = x_t + \tfrac{\eta}{2} s(x_t) + \sqrt{\eta} \, \varepsilon_t\), \(\varepsilon_t \sim \mathcal{N}(0, I)\). As \(\eta \to 0\) and \(T \to \infty\), samples converge to \(p\).
A model that knows the score therefore knows enough to generate samples and to compute relative log-densities along paths, even if it never knows the absolute density.
Explicit Score Matching
The natural objective is to fit \(s_\theta\) to the data score \(s_{\text{data}}(x) = \nabla_x \log p_{\text{data}}(x)\) in expectation:
\[ J_{\text{ESM}}(\theta) = \tfrac{1}{2} \mathbb{E}_{x \sim p_{\text{data}}}\!\left[\| s_\theta(x) - \nabla_x \log p_{\text{data}}(x) \|^2 \right]. \]
The problem: we do not know \(p_{\text{data}}\) in closed form, so we cannot evaluate \(\nabla_x \log p_{\text{data}}(x)\) at training points. ESM as written is uncomputable.
Hyvärinen’s Identity (Implicit Score Matching)
Hyvärinen (2005) showed that under mild regularity (\(p_{\text{data}} \to 0\) as \(\|x\| \to \infty\), integration by parts is valid), the explicit score-matching objective equals an objective that does not reference \(p_{\text{data}}\)’s score:
\[ J_{\text{ESM}}(\theta) = \mathbb{E}_{x \sim p_{\text{data}}}\!\left[\tfrac{1}{2} \|s_\theta(x)\|^2 + \operatorname{tr}(\nabla_x s_\theta(x))\right] + \text{const}. \]
The constant does not depend on \(\theta\), so the right-hand side is a usable training loss. Both terms are computable from the model alone evaluated at data points.
The trace term \(\operatorname{tr}(\nabla_x s_\theta(x))\) is the sum of diagonal entries of the Jacobian of \(s_\theta\) — a divergence. For a \(d\)-dimensional input, naive computation requires \(d\) backward passes, which is prohibitive for high-dimensional data like images. This is the practical bottleneck of implicit score matching.
Sliced and Denoising Variants
Two ways around the trace cost:
- Denoising score matching (Vincent, 2011) replaces the data score with the score of a noise-corrupted distribution, which has a closed form. The training loss becomes a simple mean-squared error against the added noise, with no Jacobian. This is the path that diffusion models follow.
- Sliced score matching (Song et al., 2019) estimates the trace via Hutchinson’s stochastic trace estimator: \(\operatorname{tr}(A) = \mathbb{E}_{v}[v^\top A v]\) for \(v\) with \(\mathbb{E}[vv^\top] = I\). Costs one backward pass per sample, regardless of \(d\).
Both make score matching practical for high-dimensional generative modeling. Denoising score matching is the dominant approach in modern diffusion models; sliced score matching is more general but less efficient when the noising trick is available.
Why It Works: Equivalence at the Population Level
If \(s_\theta(x) = s_{\text{data}}(x)\) for \(p_{\text{data}}\)-almost every \(x\), the implicit score-matching objective is minimized. Under regularity conditions, this also implies \(p_\theta = p_{\text{data}}\) — knowing the score everywhere determines the density up to a constant, and the constraint that both are densities (integrate to one) pins down the constant.
So fitting the score is, in principle, as informative as fitting the density. The practical content of score matching is that it is computable without knowing \(Z(\theta)\) or \(p_{\text{data}}\).
Connection to Diffusion
A diffusion model trains a network to predict the score of a noised version of the data distribution, at every noise level. The training loss is denoising score matching at each level, summed (or weighted) over levels. Sampling reverses the noising process by integrating Langevin or related SDE dynamics using the learned score.
The Vincent identity establishes that the simple noise-prediction loss used in DDPM is, up to a known weighting, denoising score matching — and therefore a consistent estimator of the score of the noised distribution.