Evidence Lower Bound

Motivation

Many probabilistic models have an intractable log marginal likelihood

\[ \log p_\theta(x) = \log \int p_\theta(x, z) \, dz \]

because the integral over latents \(z\) has no closed form and is high-dimensional. The evidence lower bound (ELBO) is a tractable lower bound on \(\log p_\theta(x)\) that can be optimized in \(\theta\) — and, when paired with a variational family \(q_\phi(z)\), also in \(\phi\) (Kingma and Welling 2013; Dempster et al. 1977).

The ELBO is the workhorse of latent-variable inference. It powers expectation-maximization, variational autoencoders, the per-step objective in diffusion models, and variational inference more broadly.

Definition

For any distribution \(q(z)\) with the same support as the posterior \(p_\theta(z \mid x)\),

\[ \mathrm{ELBO}(q, \theta; x) = \mathbb{E}_{z \sim q}\!\left[\log p_\theta(x, z) - \log q(z)\right]. \]

Equivalently,

\[ \mathrm{ELBO}(q, \theta; x) = \mathbb{E}_q[\log p_\theta(x \mid z)] - \mathrm{KL}(q(z) \,\|\, p_\theta(z)). \]

The first form makes the bound’s structure visible; the second decomposes it into a “reconstruction” term and a KL regularizer that pulls \(q\) toward the prior.

The Bound

For every \(q\), \(\theta\), and \(x\),

\[ \log p_\theta(x) \geq \mathrm{ELBO}(q, \theta; x). \]

The proof is one application of Jensen’s inequality. (proof)

The gap between the log evidence and the ELBO is exactly the KL divergence from \(q\) to the true posterior:

\[ \log p_\theta(x) - \mathrm{ELBO}(q, \theta; x) = \mathrm{KL}\!\left(q(z) \,\|\, p_\theta(z \mid x)\right). \]

This identity is the core of variational inference: maximizing the ELBO in \(q\) is equivalent to minimizing \(\mathrm{KL}(q \,\|\, p_\theta(z \mid x))\), i.e. fitting the variational distribution to the true posterior. The ELBO is a lower bound because \(\mathrm{KL} \geq 0\) (proof), and the bound is tight when \(q = p_\theta(z \mid x)\).

Two Uses

Coordinate ascent. Alternate maximization in \(\theta\) and \(q\) is the basis of EM. The E-step sets \(q(z) = p_\theta(z \mid x)\), making the bound tight; the M-step updates \(\theta\) to maximize \(\mathbb{E}_q[\log p_\theta(x, z)]\). Each step monotonically increases \(\log p_\theta(x)\).

Joint amortized optimization. Parameterize \(q_\phi(z \mid x)\) as a neural network (“amortized inference”) and optimize \(\theta\) and \(\phi\) together by stochastic gradient ascent on the ELBO. This is the VAE recipe; the reparameterization trick is what makes the gradient with respect to \(\phi\) low-variance.

Why “Evidence” Lower Bound

In Bayesian terminology, \(p_\theta(x)\) is the model evidence — the probability the model assigns to the observed data after marginalizing latents. The ELBO is a lower bound on its log; hence the name. Some literature uses the alternative name variational free energy for \(-\mathrm{ELBO}\), by analogy with statistical-mechanics free energies.

References

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the <i>EM</i> Algorithm.” Journal of the Royal Statistical Society Series B: Statistical Methodology 39 (1): 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.

Kingma, Diederik P., and Max Welling. 2013. “Auto-Encoding Variational Bayes.” arXiv Preprint arXiv:1312.6114. https://arxiv.org/abs/1312.6114.