The DDPM Simple Loss Is a Weighted ELBO
Claim
Let \(\varepsilon_\theta(x_t, t)\) be the noise-prediction network of a DDPM, trained on the simple loss
\[ \mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_{t \sim U\{1, T\}, \, x_0 \sim p_{\text{data}}, \, \varepsilon \sim \mathcal{N}(0, I)}\!\left[\| \varepsilon - \varepsilon_\theta(\sqrt{\bar\alpha_t} \, x_0 + \sqrt{1 - \bar\alpha_t} \, \varepsilon, t) \|^2\right]. \]
Let \(\mathcal{L}_{\text{VLB}}(\theta) = -\mathrm{ELBO}(\theta)\) be the negative variational lower bound on \(\log p_\theta(x_0)\) implied by the diffusion’s variational objective. Then
\[ \mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_t[\, w_t \cdot \mathcal{L}_t^{\text{VLB}}(\theta)\,] + \text{const} \]
for an explicit per-timestep weighting \(w_t\) that differs from the natural ELBO weights \(w_t^{\text{VLB}}\). In other words: the simple loss is the variational bound with the wrong (or rather, deliberately re-chosen) weights — uniform over \(t\) in the simple loss, but \(t\)-dependent in the natural ELBO.
This is the precise sense in which “DDPM is variational inference”: optimizing \(\mathcal{L}_{\text{simple}}\) is equivalent (up to weighting and a constant) to optimizing a lower bound on the data log-likelihood (Ho et al. 2020). The non-trivial reweighting — making the loss roughly uniform across timesteps instead of weighted by the natural KL coefficients — is what improves sample quality.
The Variational Bound
The DDPM joint \(p_\theta(x_{0:T})\) factors as
\[ p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t). \]
The forward Markov chain \(q(x_{1:T} \mid x_0)\) provides a variational distribution. The standard ELBO derivation gives
\[ -\log p_\theta(x_0) \leq \mathbb{E}_{q(x_{1:T} \mid x_0)}\!\left[\log \frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{0:T})}\right] = \mathcal{L}_{\text{VLB}}(\theta; x_0). \]
After standard manipulations (Sohl-Dickstein et al. 2015; Ho et al. 2020), this decomposes into
\[ \mathcal{L}_{\text{VLB}} = \underbrace{\mathrm{KL}\!\left(q(x_T \mid x_0) \,\|\, p(x_T)\right)}_{L_T} + \sum_{t=2}^T \underbrace{\mathbb{E}_{q(x_t \mid x_0)}\!\left[\mathrm{KL}\!\left(q(x_{t-1} \mid x_t, x_0) \,\|\, p_\theta(x_{t-1} \mid x_t)\right)\right]}_{L_{t-1}} \;-\; \underbrace{\mathbb{E}_{q(x_1 \mid x_0)}[\log p_\theta(x_0 \mid x_1)]}_{L_0}. \]
\(L_T\) does not depend on \(\theta\) (the forward process is fixed and matches the prior by construction). \(L_0\) is a discretization detail at the boundary. The interesting terms are \(L_1, \ldots, L_{T-1}\): per-step KLs between the forward posterior and the learned reverse.
Each \(L_{t-1}\) Is a Squared Difference of Means
Both \(q(x_{t-1} \mid x_t, x_0)\) and \(p_\theta(x_{t-1} \mid x_t)\) are Gaussian with the same covariance matrix \(\sigma_t^2 I\) (DDPM fixes the reverse covariance). The KL between two Gaussians with the same covariance reduces to the squared difference of their means:
\[ \mathrm{KL}\!\left(\mathcal{N}(\mu_1, \sigma^2 I) \,\|\, \mathcal{N}(\mu_2, \sigma^2 I)\right) = \frac{\|\mu_1 - \mu_2\|^2}{2 \sigma^2}. \]
Apply this:
\[ L_{t-1}(\theta) = \mathbb{E}_{q(x_t \mid x_0)}\!\left[\frac{\|\tilde\mu_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2}{2 \sigma_t^2}\right], \]
where \(\tilde\mu_t(x_t, x_0)\) is the mean of \(q(x_{t-1} \mid x_t, x_0)\) — known in closed form from the forward noising process:
\[ \tilde\mu_t(x_t, x_0) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}} \varepsilon\right), \qquad x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \varepsilon. \]
(The second expression substitutes the closed-form noise relation, with \(\varepsilon\) the noise that produced \(x_t\) from \(x_0\).)
The \(\varepsilon\)-Parameterization
Parameterize \(\mu_\theta(x_t, t)\) in the same form as \(\tilde\mu_t\), replacing the true noise \(\varepsilon\) with a learned prediction \(\varepsilon_\theta(x_t, t)\):
\[ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}} \varepsilon_\theta(x_t, t)\right). \]
Then the difference \(\tilde\mu_t - \mu_\theta\) is the difference of \(\varepsilon\)-terms scaled by a constant:
\[ \tilde\mu_t(x_t, x_0) - \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \cdot \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}} \!\left(\varepsilon_\theta(x_t, t) - \varepsilon\right). \]
Substitute into \(L_{t-1}\):
\[ L_{t-1}(\theta) = \mathbb{E}_{x_0, \varepsilon}\!\left[\frac{1}{2 \sigma_t^2} \cdot \frac{1}{\alpha_t} \cdot \frac{\beta_t^2}{1 - \bar\alpha_t} \cdot \|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right] = w_t^{\text{VLB}} \cdot \mathbb{E}_{x_0, \varepsilon}[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2], \]
with the natural-ELBO weight
\[ w_t^{\text{VLB}} = \frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar\alpha_t)}. \]
The Simple Loss Drops the Weight
The full variational training objective is \(\sum_t L_{t-1} = \sum_t w_t^{\text{VLB}} \mathbb{E}[\|\varepsilon - \varepsilon_\theta\|^2]\). The DDPM “simple” loss replaces \(w_t^{\text{VLB}}\) with a constant — equivalently, weight \(1\) for all \(t\):
\[ \mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_t \, \mathbb{E}_{x_0, \varepsilon}\!\left[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right]. \]
So \(\mathcal{L}_{\text{simple}}\) corresponds to per-step weights \(w_t = 1\) in the same per-step decomposition as the VLB:
\[ \mathcal{L}_{\text{simple}}(\theta) = \sum_t \frac{1}{w_t^{\text{VLB}}} \cdot L_{t-1}(\theta) \cdot \frac{1}{T} = \sum_t w_t \cdot L_{t-1}(\theta) + \text{const}, \]
with \(w_t = 1 / (T w_t^{\text{VLB}})\) — large at intermediate \(t\) where \(w_t^{\text{VLB}}\) is small, small at the extremes.
This is the formal equivalence: \(\mathcal{L}_{\text{simple}}\) is a per-step reweighting of the natural variational lower bound.
Why the Reweighting Helps
Empirically (Ho et al. 2020), training on \(\mathcal{L}_{\text{simple}}\) produces sharper, higher-quality samples than training on the natural \(\mathcal{L}_{\text{VLB}}\), even though the latter is a tighter lower bound on \(\log p_\theta(x_0)\). The intuition:
- The natural weights \(w_t^{\text{VLB}}\) are very small at intermediate timesteps and large at small timesteps. The natural bound therefore concentrates training signal on getting the very-low-noise regime right (where the per-step KL is small but heavily weighted).
- The simple loss spreads training signal more uniformly. Intermediate noise levels — where the model has the hardest job, since the data is half-corrupted — get more attention.
- Sample quality is dominated by the model’s behavior at intermediate noise levels (early denoising sets up the structure; final denoising adds detail). The simple weighting reflects this practical observation.
So the trade-off is: tighter likelihood bound (natural weights) vs. better samples (uniform weights). DDPM chooses the second. Improved DDPM (Nichol & Dhariwal, 2021) and continuous-time diffusion both refine this weighting further with explicit variance learning and SDE-derived schedules respectively.
What This Identity Buys You
- Conceptual clarity. DDPM is a variational latent-variable model — the same family as the VAE — with a fixed (not learned) inference network and a Markov-chain decoder. The “simple loss” is a particular reweighting of the resulting ELBO.
- Likelihood bounds. Even when training on \(\mathcal{L}_{\text{simple}}\), one can evaluate \(\mathcal{L}_{\text{VLB}}\) at test time to get a real (if loose) lower bound on \(\log p_\theta(x_0)\).
- Connection to score matching. Up to the same constant, the simple loss is also multi-scale denoising score matching (Vincent identity). The two views — variational and score-based — coincide on the same training objective.