The Forward Gaussian Noising Process

Motivation

A diffusion model (Ho et al. 2020; Sohl-Dickstein et al. 2015) is built on a fixed, prescribed corruption process that maps clean data \(x_0 \sim p_{\text{data}}\) to pure Gaussian noise over a sequence of \(T\) steps. Each step adds a small amount of Gaussian noise; after enough steps the data distribution is washed out and the marginal becomes a standard Gaussian. The model never learns this forward process — it is a fixed reference distribution against which the reverse process is trained.

This article describes the forward process. The reverse process is the DDPM generator that learns to invert it.

The Markov Chain

Increasing noise levels

The forward process gradually mixes clean data with Gaussian noise until the sample is nearly indistinguishable from standard noise.

Pick a noise schedule \(\beta_1, \ldots, \beta_T \in (0, 1)\), monotonically increasing, with \(\beta_1\) tiny and \(\beta_T\) close to (but less than) \(1\). The forward process is the Markov chain

\[ q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \, \beta_t I\right), \qquad t = 1, \ldots, T. \]

Each step shrinks the previous sample by \(\sqrt{1 - \beta_t}\) (a tiny shrinkage) and adds Gaussian noise of variance \(\beta_t\). The shrinkage is what keeps the marginal variance bounded — without it, additive noise would diverge over many steps.

The Closed-Form Marginal

A useful property of the Gaussian noising chain: the marginal \(q(x_t \mid x_0)\) is itself Gaussian, with parameters available in closed form. Let

\[ \alpha_t := 1 - \beta_t, \qquad \bar\alpha_t := \prod_{s=1}^t \alpha_s. \]

Then

\[ q(x_t \mid x_0) = \mathcal{N}\!\left(x_t; \sqrt{\bar\alpha_t} \, x_0, \, (1 - \bar\alpha_t) I\right). \]

This means we can sample \(x_t\) for any \(t\) in one step, without simulating the chain:

\[ x_t = \sqrt{\bar\alpha_t} \, x_0 + \sqrt{1 - \bar\alpha_t} \, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I). \]

This is the closed form that makes diffusion training cheap: at each step of training, sample \(t\) uniformly, sample \(x_0\) from data, sample \(x_t\) in one shot from the formula above, predict the noise.

The proof is induction on \(t\) using the standard fact that the sum of independent Gaussians is Gaussian.

Why This Schedule

The construction \(q(x_t \mid x_{t-1}) = \mathcal{N}(\sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I)\) is engineered so that:

Marginal variance is bounded. If \(x_0\) has unit variance, \(x_t\) also has variance approximately \(\bar\alpha_t \cdot 1 + (1 - \bar\alpha_t) \cdot 1 = 1\). The whole trajectory stays at unit variance.
Marginal converges to standard normal. \(\bar\alpha_t \to 0\) as \(t \to T\) (since each \(\alpha_t < 1\)), so \(q(x_T \mid x_0) \to \mathcal{N}(0, I)\) uniformly in \(x_0\). The reverse process can therefore start from \(\mathcal{N}(0, I)\) at \(t = T\) regardless of \(x_0\).
Each step is small. \(\beta_t\) is small, so the chain has many small steps rather than a few large ones. This is what makes the reverse process — which the model has to learn — also a sequence of small Gaussian transitions.

The standard schedules are linear (\(\beta_t\) ramps from \(10^{-4}\) to \(0.02\)), cosine (Nichol & Dhariwal 2021, smoother near \(t = 0\) and \(t = T\)), and Karras (Karras et al. 2022, derived in continuous time and parameterized for image generation).

The Reverse Conditional Has a Closed Form Given \(x_0\)

A second useful property: even though the unconditional reverse \(q(x_{t-1} \mid x_t)\) has no closed form, the posterior given \(x_0\) does:

\[ q(x_{t-1} \mid x_t, x_0) = \mathcal{N}\!\left(x_{t-1}; \tilde\mu_t(x_t, x_0), \, \tilde\beta_t I\right), \]

where

\[ \tilde\mu_t(x_t, x_0) = \frac{\sqrt{\bar\alpha_{t-1}} \, \beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t-1})}{1 - \bar\alpha_t} x_t, \qquad \tilde\beta_t = \frac{1 - \bar\alpha_{t-1}}{1 - \bar\alpha_t} \beta_t. \]

This conditional is what the DDPM training loss compares the model’s reverse transition \(p_\theta(x_{t-1} \mid x_t)\) against. Because both are Gaussian, the per-step KL has a closed form.

The Continuous-Time Limit (the SDE)

As \(T \to \infty\) and \(\beta_t \to 0\) jointly, the discrete chain converges to a stochastic differential equation:

\[ dx = -\tfrac{1}{2} \beta(t) x \, dt + \sqrt{\beta(t)} \, dW_t, \]

where \(\beta(t)\) is the continuous-time analog of the per-step variance and \(W_t\) is standard Brownian motion. This is the Ornstein-Uhlenbeck SDE — a linear SDE that drives any initial distribution toward standard normal at rate \(\beta(t)\).

The continuous-time view (Song et al., 2021) is conceptually cleaner: forward and reverse processes are SDEs whose drift and diffusion are explicit, and a single learned score function parameterizes them all. The discrete DDPM chain is one particular discretization.

Where the Forward Process Sits in Diffusion Modeling

The forward process is not learned. It is a fixed schedule, a reference against which the model is trained. The model learns the reverse: given \(x_t\), predict \(x_{t-1}\) (or equivalently, predict \(\varepsilon\) that was added, or predict the score \(\nabla \log q_t(x_t)\)). At sampling time the model runs the reverse chain from \(x_T \sim \mathcal{N}(0, I)\) down to \(x_0\).

Three properties of the forward process are what make this work:

Closed-form marginals — train at any \(t\) in one step.
Closed-form reverse posterior given \(x_0\) — gives a tractable target for the per-step loss.
Convergence to a known prior — \(x_T \sim \mathcal{N}(0, I)\) regardless of data.

Without these, diffusion training would be much harder. They are the technical reason the model works.

References

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems (NeurIPS), 6840–51. https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.

Sohl-Dickstein, Jascha, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.” International Conference on Machine Learning (ICML), 2256–65. https://proceedings.mlr.press/v37/sohl-dickstein15.html.