The Score-Based SDE View of Diffusion

Motivation

DDPM and the score-based generative models of Song & Ermon look superficially different — one has a discrete Markov chain and a variational objective, the other has a continuous Langevin sampler and a score-matching objective. Song et al. (2020) showed that both are discretizations of the same continuous-time stochastic differential equation, and that the trained networks in both frameworks compute the same object: the score \(\nabla_x \log p_t(x)\) of the noised data distribution at time \(t\).

The SDE view is the cleanest unifying picture. It exposes:

The forward and reverse processes as continuous-time SDEs with explicit drift and diffusion.
A deterministic ODE with the same marginals, which gives faster sampling and exact likelihoods.
A single learned object — the score function across time — that parameterizes them all.

The Forward SDE

The forward Gaussian noising process, in continuous time, satisfies a linear SDE. The two standard families:

Variance-preserving (VP) SDE — the continuous limit of DDPM:

\[ dx = -\tfrac{1}{2} \beta(t) x \, dt + \sqrt{\beta(t)} \, dW_t. \]

Variance-exploding (VE) SDE — the continuous limit of Song & Ermon’s NCSN:

\[ dx = \sqrt{\frac{d\sigma^2(t)}{dt}} \, dW_t. \]

Both push \(p_0 = p_{\text{data}}\) to a known prior: VP converges to \(\mathcal{N}(0, I)\), VE diverges in variance to a wide Gaussian. Either way the forward process is fixed, no learning needed.

The Reverse SDE

Anderson (1982) showed that the time-reversal of any Itô SDE is itself an SDE, with drift modified by a score term. For the forward SDE

\[ dx = f(x, t) \, dt + g(t) \, dW_t, \]

the reverse is

\[ dx = \!\left[f(x, t) - g(t)^2 \nabla_x \log p_t(x)\right] dt + g(t) \, d\bar W_t, \]

run backward in time. The new ingredient is \(\nabla_x \log p_t(x)\) — the score of the marginal at time \(t\).

This is the key fact: inverting the forward process requires only the score. If we have a network that knows the score at every \(t\), we can integrate the reverse SDE from \(t = T\) (start at the prior) to \(t = 0\) (a clean sample) by any standard SDE solver.

Training: One Score Network Across All \(t\)

The network is trained by continuous-time denoising score matching — the same objective as in DDPM, just continuous in \(t\):

\[ \mathcal{L}(\theta) = \mathbb{E}_{t \sim U[0, T]}\!\left[\lambda(t) \cdot \mathbb{E}_{x_0, x_t}\!\left[\| s_\theta(x_t, t) - \nabla_{x_t} \log p_t(x_t \mid x_0) \|^2\right]\right], \]

with \(p_t(x_t \mid x_0)\) the closed-form Gaussian transition implied by the forward SDE. The \(\lambda(t)\) weighting is part of the design space — the DDPM “simple” loss corresponds to one specific choice (proof).

In practice the network’s input is \((x_t, t)\) and the output is either the score itself, the noise, or the clean signal — these are all linear reparameterizations of each other and a matter of convenience.

The Probability Flow ODE

A second discovery of Song et al.: the SDE has a corresponding deterministic ODE with the same marginals:

\[ dx = \!\left[f(x, t) - \tfrac{1}{2} g(t)^2 \nabla_x \log p_t(x)\right] dt. \]

This is the probability flow ODE. Sampling by integrating the ODE backward in time produces a (deterministic) sample from \(p_0\). The advantages over SDE sampling:

Faster. ODE solvers (Heun, Runge-Kutta, DPM-Solver) reach high quality in 10–30 steps. The SDE typically needs 100–1000.
Exact likelihood. The change-of-variables formula along the ODE trajectory gives \(\log p_\theta(x_0)\) exactly (modulo solver error), which an SDE-based sampler cannot.
Deterministic and invertible. Given \(x_0\) we can recover the corresponding \(x_T\), useful for image editing and interpolation.

The cost: the ODE samples are a different distribution at finite step size than the SDE samples — generally similar quality, but the modes of the two distributions can differ subtly. Some applications (image editing) benefit from the ODE’s determinism; others (best-quality samples) sometimes prefer the SDE.

What This View Buys You

Conceptual clarity. Forward, reverse, and probability-flow are all visible in the same SDE framework. The trained model is the score across noise levels; everything else is a discretization choice.

Likelihood evaluation. The probability-flow ODE gives exact (up to solver error) log-likelihoods — turning diffusion into a tractable likelihood model, comparable to normalizing flows, without the architectural restrictions.

Architectural freedom. Any sufficiently expressive network can be the score model. No invertibility, no Jacobian, no decoder-encoder split.

Sampler design. A wide menu of off-the-shelf SDE/ODE solvers can be applied. Recent fast samplers (DPM-Solver, EDM, consistency models) all live in this view.

Where the SDE View Sits Now

The SDE view is the standard framework in modern diffusion research. New methods are typically formulated continuously in time, with a particular noise schedule and parameterization, and then discretized for training and sampling.

Two adjacent generalizations:

Flow matching (Lipman et al., 2023) and rectified flows (Liu et al., 2023): the same probability-flow ODE structure, but trained by directly regressing the velocity field along chosen interpolation paths instead of via score matching. Often produces straighter trajectories that need fewer ODE steps.
Consistency models (Song et al., 2023): train a network that maps any point on the trajectory directly to \(x_0\), allowing 1–4 step sampling with quality close to the multi-step teacher.

Both build on the score-SDE view: it is the conceptual foundation that makes them well-defined.

References

Song, Yang, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2020. “Score-Based Generative Modeling Through Stochastic Differential Equations.” arXiv Preprint arXiv:2011.13456. https://arxiv.org/abs/2011.13456.