Evidence Lower Bound
Motivation
Many probabilistic models have an intractable log marginal likelihood
\[ \log p_\theta(x) = \log \int p_\theta(x, z) \, dz \]
because the integral over latents \(z\) has no closed form and is high-dimensional. The evidence lower bound (ELBO) is a tractable lower bound on \(\log p_\theta(x)\) that can be optimized in \(\theta\) — and, when paired with a variational family \(q_\phi(z)\), also in \(\phi\) (Kingma and Welling 2013; Dempster et al. 1977).
The ELBO is the workhorse of latent-variable inference. It powers expectation-maximization, variational autoencoders, the per-step objective in diffusion models, and variational inference more broadly.
Definition
For any distribution \(q(z)\) with the same support as the posterior \(p_\theta(z \mid x)\),
\[ \mathrm{ELBO}(q, \theta; x) = \mathbb{E}_{z \sim q}\!\left[\log p_\theta(x, z) - \log q(z)\right]. \]
Equivalently,
\[ \mathrm{ELBO}(q, \theta; x) = \mathbb{E}_q[\log p_\theta(x \mid z)] - \mathrm{KL}(q(z) \,\|\, p_\theta(z)). \]
The first form makes the bound’s structure visible; the second decomposes it into a “reconstruction” term and a KL regularizer that pulls \(q\) toward the prior.
The Bound
For every \(q\), \(\theta\), and \(x\),
\[ \log p_\theta(x) \geq \mathrm{ELBO}(q, \theta; x). \]
The proof is one application of Jensen’s inequality. (proof)
The gap between the log evidence and the ELBO is exactly the KL divergence from \(q\) to the true posterior:
\[ \log p_\theta(x) - \mathrm{ELBO}(q, \theta; x) = \mathrm{KL}\!\left(q(z) \,\|\, p_\theta(z \mid x)\right). \]
This identity is the core of variational inference: maximizing the ELBO in \(q\) is equivalent to minimizing \(\mathrm{KL}(q \,\|\, p_\theta(z \mid x))\), i.e. fitting the variational distribution to the true posterior. The ELBO is a lower bound because \(\mathrm{KL} \geq 0\) (proof), and the bound is tight when \(q = p_\theta(z \mid x)\).
Two Uses
Coordinate ascent. Alternate maximization in \(\theta\) and \(q\) is the basis of EM. The E-step sets \(q(z) = p_\theta(z \mid x)\), making the bound tight; the M-step updates \(\theta\) to maximize \(\mathbb{E}_q[\log p_\theta(x, z)]\). Each step monotonically increases \(\log p_\theta(x)\).
Joint amortized optimization. Parameterize \(q_\phi(z \mid x)\) as a neural network (“amortized inference”) and optimize \(\theta\) and \(\phi\) together by stochastic gradient ascent on the ELBO. This is the VAE recipe; the reparameterization trick is what makes the gradient with respect to \(\phi\) low-variance.
Why “Evidence” Lower Bound
In Bayesian terminology, \(p_\theta(x)\) is the model evidence — the probability the model assigns to the observed data after marginalizing latents. The ELBO is a lower bound on its log; hence the name. Some literature uses the alternative name variational free energy for \(-\mathrm{ELBO}\), by analogy with statistical-mechanics free energies.