ELBO Lower Bounds the Log Evidence

Claim

For any distribution \(q(z)\) whose support contains the support of \(p_\theta(z \mid x)\) (Dempster et al. 1977; Kingma and Welling 2013),

\[ \log p_\theta(x) \geq \mathbb{E}_{z \sim q}\!\left[\log p_\theta(x, z) - \log q(z)\right] = \mathrm{ELBO}(q, \theta; x). \]

The gap is

\[ \log p_\theta(x) - \mathrm{ELBO}(q, \theta; x) = \mathrm{KL}\!\left(q(z) \,\|\, p_\theta(z \mid x)\right) \geq 0, \]

so the bound is tight iff \(q = p_\theta(z \mid x)\).

Proof via Jensen’s Inequality

The function \(\log\) is concave, so for any random variable \(Y\),

\[ \log \mathbb{E}[Y] \geq \mathbb{E}[\log Y] \]

(Jensen’s inequality, with the inequality direction reversed compared to the convex case).

Write the marginal as an expectation under \(q\) by importance-weighting the joint:

\[ p_\theta(x) = \int p_\theta(x, z) \, dz = \int q(z) \cdot \frac{p_\theta(x, z)}{q(z)} \, dz = \mathbb{E}_{z \sim q}\!\left[\frac{p_\theta(x, z)}{q(z)}\right]. \]

Take the log of both sides and apply Jensen:

\[ \log p_\theta(x) = \log \mathbb{E}_q\!\left[\frac{p_\theta(x, z)}{q(z)}\right] \geq \mathbb{E}_q\!\left[\log \frac{p_\theta(x, z)}{q(z)}\right] = \mathrm{ELBO}(q, \theta; x). \quad \square \]

The support condition ensures \(q(z) > 0\) wherever the integrand is non-zero, so no division-by-zero issues arise.

Proof via the Posterior Identity

A second derivation makes the gap explicit. Multiply and divide by the posterior:

\[ \log p_\theta(x, z) - \log q(z) = \log p_\theta(z \mid x) + \log p_\theta(x) - \log q(z). \]

Take expectation under \(q\):

\[ \mathrm{ELBO}(q, \theta; x) = \log p_\theta(x) - \mathbb{E}_q\!\left[\log \frac{q(z)}{p_\theta(z \mid x)}\right] = \log p_\theta(x) - \mathrm{KL}\!\left(q \,\|\, p_\theta(\cdot \mid x)\right). \]

Rearranging,

\[ \log p_\theta(x) = \mathrm{ELBO}(q, \theta; x) + \mathrm{KL}\!\left(q \,\|\, p_\theta(\cdot \mid x)\right). \]

Since KL is non-negative, the ELBO is a lower bound. The gap is exactly the KL, which is zero iff \(q = p_\theta(\cdot \mid x)\) almost everywhere. \(\square\)

Remarks

  • The two proofs are not independent: the strict inequality in Jensen for non-degenerate \(Y = p_\theta(x, z) / q(z)\) corresponds exactly to \(\mathrm{KL}(q \,\|\, p_\theta(\cdot \mid x)) > 0\).
  • The bound holds for any \(q\), not just for \(q\) close to the posterior. Choosing a flexible variational family is what makes the bound tight; choosing a tractable family is what makes it computable.

References

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the <i>EM</i> Algorithm.” Journal of the Royal Statistical Society Series B: Statistical Methodology 39 (1): 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.
Kingma, Diederik P., and Max Welling. 2013. “Auto-Encoding Variational Bayes.” arXiv Preprint arXiv:1312.6114. https://arxiv.org/abs/1312.6114.