The DDPM Simple Loss Is a Weighted ELBO

Claim

Let \(\varepsilon_\theta(x_t, t)\) be the noise-prediction network of a DDPM, trained on the simple loss

\[ \mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_{t \sim U\{1, T\}, \, x_0 \sim p_{\text{data}}, \, \varepsilon \sim \mathcal{N}(0, I)}\!\left[\| \varepsilon - \varepsilon_\theta(\sqrt{\bar\alpha_t} \, x_0 + \sqrt{1 - \bar\alpha_t} \, \varepsilon, t) \|^2\right]. \]

Let \(\mathcal{L}_{\text{VLB}}(\theta) = -\mathrm{ELBO}(\theta)\) be the negative variational lower bound on \(\log p_\theta(x_0)\) implied by the diffusion’s variational objective. Then

\[ \mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_t[\, w_t \cdot \mathcal{L}_t^{\text{VLB}}(\theta)\,] + \text{const} \]

for an explicit per-timestep weighting \(w_t\) that differs from the natural ELBO weights \(w_t^{\text{VLB}}\). In other words: the simple loss is the variational bound with the wrong (or rather, deliberately re-chosen) weights — uniform over \(t\) in the simple loss, but \(t\)-dependent in the natural ELBO.

This is the precise sense in which “DDPM is variational inference”: optimizing \(\mathcal{L}_{\text{simple}}\) is equivalent (up to weighting and a constant) to optimizing a lower bound on the data log-likelihood (Ho et al. 2020). The non-trivial reweighting — making the loss roughly uniform across timesteps instead of weighted by the natural KL coefficients — is what improves sample quality.

The Variational Bound

The DDPM joint \(p_\theta(x_{0:T})\) factors as

\[ p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t). \]

The forward Markov chain \(q(x_{1:T} \mid x_0)\) provides a variational distribution. The standard ELBO derivation gives

\[ -\log p_\theta(x_0) \leq \mathbb{E}_{q(x_{1:T} \mid x_0)}\!\left[\log \frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{0:T})}\right] = \mathcal{L}_{\text{VLB}}(\theta; x_0). \]

After standard manipulations (Sohl-Dickstein et al. 2015; Ho et al. 2020), this decomposes into

\[ \mathcal{L}_{\text{VLB}} = \underbrace{\mathrm{KL}\!\left(q(x_T \mid x_0) \,\|\, p(x_T)\right)}_{L_T} + \sum_{t=2}^T \underbrace{\mathbb{E}_{q(x_t \mid x_0)}\!\left[\mathrm{KL}\!\left(q(x_{t-1} \mid x_t, x_0) \,\|\, p_\theta(x_{t-1} \mid x_t)\right)\right]}_{L_{t-1}} \;-\; \underbrace{\mathbb{E}_{q(x_1 \mid x_0)}[\log p_\theta(x_0 \mid x_1)]}_{L_0}. \]

\(L_T\) does not depend on \(\theta\) (the forward process is fixed and matches the prior by construction). \(L_0\) is a discretization detail at the boundary. The interesting terms are \(L_1, \ldots, L_{T-1}\): per-step KLs between the forward posterior and the learned reverse.

Each \(L_{t-1}\) Is a Squared Difference of Means

Both \(q(x_{t-1} \mid x_t, x_0)\) and \(p_\theta(x_{t-1} \mid x_t)\) are Gaussian with the same covariance matrix \(\sigma_t^2 I\) (DDPM fixes the reverse covariance). The KL between two Gaussians with the same covariance reduces to the squared difference of their means:

\[ \mathrm{KL}\!\left(\mathcal{N}(\mu_1, \sigma^2 I) \,\|\, \mathcal{N}(\mu_2, \sigma^2 I)\right) = \frac{\|\mu_1 - \mu_2\|^2}{2 \sigma^2}. \]

Apply this:

\[ L_{t-1}(\theta) = \mathbb{E}_{q(x_t \mid x_0)}\!\left[\frac{\|\tilde\mu_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2}{2 \sigma_t^2}\right], \]

where \(\tilde\mu_t(x_t, x_0)\) is the mean of \(q(x_{t-1} \mid x_t, x_0)\) — known in closed form from the forward noising process:

\[ \tilde\mu_t(x_t, x_0) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}} \varepsilon\right), \qquad x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \varepsilon. \]

(The second expression substitutes the closed-form noise relation, with \(\varepsilon\) the noise that produced \(x_t\) from \(x_0\).)

The \(\varepsilon\)-Parameterization

Parameterize \(\mu_\theta(x_t, t)\) in the same form as \(\tilde\mu_t\), replacing the true noise \(\varepsilon\) with a learned prediction \(\varepsilon_\theta(x_t, t)\):

\[ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}} \varepsilon_\theta(x_t, t)\right). \]

Then the difference \(\tilde\mu_t - \mu_\theta\) is the difference of \(\varepsilon\)-terms scaled by a constant:

\[ \tilde\mu_t(x_t, x_0) - \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \cdot \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}} \!\left(\varepsilon_\theta(x_t, t) - \varepsilon\right). \]

Substitute into \(L_{t-1}\):

\[ L_{t-1}(\theta) = \mathbb{E}_{x_0, \varepsilon}\!\left[\frac{1}{2 \sigma_t^2} \cdot \frac{1}{\alpha_t} \cdot \frac{\beta_t^2}{1 - \bar\alpha_t} \cdot \|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right] = w_t^{\text{VLB}} \cdot \mathbb{E}_{x_0, \varepsilon}[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2], \]

with the natural-ELBO weight

\[ w_t^{\text{VLB}} = \frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar\alpha_t)}. \]

The Simple Loss Drops the Weight

The full variational training objective is \(\sum_t L_{t-1} = \sum_t w_t^{\text{VLB}} \mathbb{E}[\|\varepsilon - \varepsilon_\theta\|^2]\). The DDPM “simple” loss replaces \(w_t^{\text{VLB}}\) with a constant — equivalently, weight \(1\) for all \(t\):

\[ \mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_t \, \mathbb{E}_{x_0, \varepsilon}\!\left[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right]. \]

So \(\mathcal{L}_{\text{simple}}\) corresponds to per-step weights \(w_t = 1\) in the same per-step decomposition as the VLB:

\[ \mathcal{L}_{\text{simple}}(\theta) = \sum_t \frac{1}{w_t^{\text{VLB}}} \cdot L_{t-1}(\theta) \cdot \frac{1}{T} = \sum_t w_t \cdot L_{t-1}(\theta) + \text{const}, \]

with \(w_t = 1 / (T w_t^{\text{VLB}})\) — large at intermediate \(t\) where \(w_t^{\text{VLB}}\) is small, small at the extremes.

This is the formal equivalence: \(\mathcal{L}_{\text{simple}}\) is a per-step reweighting of the natural variational lower bound.

Why the Reweighting Helps

Empirically (Ho et al. 2020), training on \(\mathcal{L}_{\text{simple}}\) produces sharper, higher-quality samples than training on the natural \(\mathcal{L}_{\text{VLB}}\), even though the latter is a tighter lower bound on \(\log p_\theta(x_0)\). The intuition:

  • The natural weights \(w_t^{\text{VLB}}\) are very small at intermediate timesteps and large at small timesteps. The natural bound therefore concentrates training signal on getting the very-low-noise regime right (where the per-step KL is small but heavily weighted).
  • The simple loss spreads training signal more uniformly. Intermediate noise levels — where the model has the hardest job, since the data is half-corrupted — get more attention.
  • Sample quality is dominated by the model’s behavior at intermediate noise levels (early denoising sets up the structure; final denoising adds detail). The simple weighting reflects this practical observation.

So the trade-off is: tighter likelihood bound (natural weights) vs. better samples (uniform weights). DDPM chooses the second. Improved DDPM (Nichol & Dhariwal, 2021) and continuous-time diffusion both refine this weighting further with explicit variance learning and SDE-derived schedules respectively.

What This Identity Buys You

  • Conceptual clarity. DDPM is a variational latent-variable model — the same family as the VAE — with a fixed (not learned) inference network and a Markov-chain decoder. The “simple loss” is a particular reweighting of the resulting ELBO.
  • Likelihood bounds. Even when training on \(\mathcal{L}_{\text{simple}}\), one can evaluate \(\mathcal{L}_{\text{VLB}}\) at test time to get a real (if loose) lower bound on \(\log p_\theta(x_0)\).
  • Connection to score matching. Up to the same constant, the simple loss is also multi-scale denoising score matching (Vincent identity). The two views — variational and score-based — coincide on the same training objective.

References

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems (NeurIPS), 6840–51. https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.