Denoising Score Matching Equals Explicit Score Matching (the Vincent Identity)
Claim
Let \(p_{\text{data}}\) be a data distribution on \(\mathbb{R}^d\), \(q_\sigma(\tilde x \mid x) = \mathcal{N}(\tilde x; x, \sigma^2 I)\) a Gaussian noise kernel, and
\[ q_\sigma(\tilde x) = \int q_\sigma(\tilde x \mid x) p_{\text{data}}(x) \, dx \]
the corresponding noise-corrupted marginal. Define the denoising score matching and explicit score matching objectives at noise level \(\sigma\):
\[ J_{\text{DSM}}(\theta) = \tfrac{1}{2} \mathbb{E}_{x \sim p_{\text{data}}, \, \tilde x \sim q_\sigma(\cdot \mid x)}\!\left[\left\| s_\theta(\tilde x) - \nabla_{\tilde x} \log q_\sigma(\tilde x \mid x) \right\|^2\right], \]
\[ J_{\text{ESM}}(\theta) = \tfrac{1}{2} \mathbb{E}_{\tilde x \sim q_\sigma}\!\left[\left\| s_\theta(\tilde x) - \nabla_{\tilde x} \log q_\sigma(\tilde x) \right\|^2\right]. \]
Then
\[ J_{\text{DSM}}(\theta) = J_{\text{ESM}}(\theta) + C, \]
where \(C\) does not depend on \(\theta\). So \(\theta\)-gradients of the two objectives are equal, and minimizing \(J_{\text{DSM}}\) gives a score model of the corrupted marginal \(q_\sigma\) — without ever evaluating \(q_\sigma\) or its score directly.
This is Vincent’s identity (Vincent 2011), the technical fact that makes score-based diffusion models tractable: the denoising loss is a noise-prediction MSE, but it actually fits the score of \(q_\sigma\).
Setup
Expand \(J_{\text{ESM}}\) by squaring out:
\[ J_{\text{ESM}}(\theta) = \tfrac{1}{2} \mathbb{E}_{\tilde x \sim q_\sigma}[\|s_\theta(\tilde x)\|^2] - \mathbb{E}_{\tilde x \sim q_\sigma}[s_\theta(\tilde x)^\top \nabla_{\tilde x} \log q_\sigma(\tilde x)] + \tfrac{1}{2} \mathbb{E}_{\tilde x \sim q_\sigma}[\|\nabla_{\tilde x} \log q_\sigma(\tilde x)\|^2]. \]
Expand \(J_{\text{DSM}}\) analogously:
\[ J_{\text{DSM}}(\theta) = \tfrac{1}{2} \mathbb{E}_{x, \tilde x}[\|s_\theta(\tilde x)\|^2] - \mathbb{E}_{x, \tilde x}[s_\theta(\tilde x)^\top \nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)] + \tfrac{1}{2} \mathbb{E}_{x, \tilde x}[\|\nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)\|^2]. \]
The last terms in each expression do not depend on \(\theta\) and become parts of \(C\). The first terms are equal because the joint marginal of \(\tilde x\) in the DSM expectation is exactly \(q_\sigma(\tilde x)\):
\[ \mathbb{E}_{x, \tilde x}[\|s_\theta(\tilde x)\|^2] = \int p_{\text{data}}(x) \int q_\sigma(\tilde x \mid x) \|s_\theta(\tilde x)\|^2 \, d\tilde x \, dx = \int q_\sigma(\tilde x) \|s_\theta(\tilde x)\|^2 \, d\tilde x = \mathbb{E}_{\tilde x \sim q_\sigma}[\|s_\theta(\tilde x)\|^2]. \]
So the two objectives differ only in their cross terms, and we need to show
\[ \mathbb{E}_{\tilde x \sim q_\sigma}[s_\theta(\tilde x)^\top \nabla_{\tilde x} \log q_\sigma(\tilde x)] = \mathbb{E}_{x, \tilde x}[s_\theta(\tilde x)^\top \nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)]. \tag{$\dagger$} \]
The Cross-Term Identity
This is the heart of the proof. Use \(\nabla_{\tilde x} q_\sigma(\tilde x) = \int p_{\text{data}}(x) \nabla_{\tilde x} q_\sigma(\tilde x \mid x) \, dx\) — interchange of differentiation and integration is valid because \(q_\sigma(\tilde x \mid x)\) is smooth in \(\tilde x\). Convert this to a score expression:
\[ \nabla_{\tilde x} \log q_\sigma(\tilde x) = \frac{\nabla_{\tilde x} q_\sigma(\tilde x)}{q_\sigma(\tilde x)} = \frac{\int p_{\text{data}}(x) \nabla_{\tilde x} q_\sigma(\tilde x \mid x) \, dx}{q_\sigma(\tilde x)} = \frac{\int p_{\text{data}}(x) q_\sigma(\tilde x \mid x) \nabla_{\tilde x} \log q_\sigma(\tilde x \mid x) \, dx}{q_\sigma(\tilde x)}, \]
where the last step uses \(\nabla q = q \nabla \log q\). Recognize the integrand divided by \(q_\sigma(\tilde x)\) as the conditional density \(p(x \mid \tilde x) = p_{\text{data}}(x) q_\sigma(\tilde x \mid x) / q_\sigma(\tilde x)\):
\[ \nabla_{\tilde x} \log q_\sigma(\tilde x) = \mathbb{E}_{x \sim p(\cdot \mid \tilde x)}[\nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)]. \tag{$\ddagger$} \]
The score of the marginal is the conditional expectation of the score of the conditional. Substitute into the LHS of \((\dagger)\):
\[ \mathbb{E}_{\tilde x \sim q_\sigma}\!\left[s_\theta(\tilde x)^\top \nabla_{\tilde x} \log q_\sigma(\tilde x)\right] = \mathbb{E}_{\tilde x \sim q_\sigma}\!\left[s_\theta(\tilde x)^\top \mathbb{E}_{x \mid \tilde x}[\nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)]\right] = \mathbb{E}_{x, \tilde x}\!\left[s_\theta(\tilde x)^\top \nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)\right], \]
where the last step pulls \(s_\theta(\tilde x)\) inside the inner expectation (it is a function of \(\tilde x\) only, so it does not depend on \(x\)) and rewrites the iterated expectation as a joint expectation. This is exactly the RHS of \((\dagger)\).
Therefore \(J_{\text{DSM}}(\theta) - J_{\text{ESM}}(\theta) = C\) where \(C\) is the difference of the two \(\theta\)-independent third terms. \(\square\)
Why the Conditional Score Has a Closed Form
Vincent’s identity is useful because the conditional score \(\nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)\) is computable. For Gaussian noise \(q_\sigma(\tilde x \mid x) = \mathcal{N}(\tilde x; x, \sigma^2 I)\),
\[ \log q_\sigma(\tilde x \mid x) = -\frac{\|\tilde x - x\|^2}{2 \sigma^2} + \text{const}, \qquad \nabla_{\tilde x} \log q_\sigma(\tilde x \mid x) = \frac{x - \tilde x}{\sigma^2}. \]
So \(J_{\text{DSM}}\) is exactly mean-squared-error regression of \(s_\theta(\tilde x)\) against \((x - \tilde x)/\sigma^2\) — with \(\tilde x = x + \sigma \varepsilon\) and \(\varepsilon \sim \mathcal{N}(0, I)\), this is equivalent to predicting \(-\varepsilon/\sigma\), i.e. predicting the noise that was added.
Implication for Diffusion Training
The identity \((\ddagger)\) has a clean independent reading: at noise level \(\sigma\), the optimal denoiser maps a noisy observation \(\tilde x\) to the posterior mean of the clean signal,
\[ \mathbb{E}[x \mid \tilde x] = \tilde x + \sigma^2 \nabla_{\tilde x} \log q_\sigma(\tilde x). \]
So a perfectly trained score model is equivalent to a perfectly trained MMSE denoiser. This Tweedie’s formula (a separate but related result, due to Robbins / Miyasawa, generalized by Efron) is the reason “noise prediction” and “score estimation” and “MMSE denoising” can be used interchangeably in the diffusion literature.
Caveats
- Regularity. The proof uses interchange of differentiation and integration, which requires \(q_\sigma(\tilde x \mid x)\) to be smooth in \(\tilde x\) uniformly in \(x\). Gaussian kernels satisfy this; degenerate kernels do not.
- Score is of \(q_\sigma\), not \(p_{\text{data}}\). Minimizing \(J_{\text{DSM}}\) at fixed \(\sigma\) recovers the score of the corrupted distribution. To recover the score of the data distribution itself, \(\sigma \to 0\), but in that limit the loss becomes ill-conditioned. The standard fix in diffusion is to train at multiple \(\sigma\) (a noise schedule), which gives a model of \(q_\sigma\) at every level — enough for sampling.