The VAE Objective Is the ELBO with a Continuous Latent Encoder

Claim

Let \(p_\theta(x, z) = p_\theta(x \mid z) p(z)\) be a latent-variable generative model with continuous latent \(z\), and let \(q_\phi(z \mid x)\) be an amortized variational encoder. Then

\[ \log p_\theta(x) \geq \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log p_\theta(x \mid z)\right] - \mathrm{KL}\!\left(q_\phi(z \mid x) \,\|\, p(z)\right) =: \mathcal{L}_{\text{VAE}}(\theta, \phi; x). \tag{$\star$} \]

This is the variational autoencoder objective — the reconstruction term minus the KL regularizer (Kingma and Welling 2013). Equation \((\star)\) is the ELBO specialized to the choice of encoder, written in the form that exposes its two interpretable pieces.

Derivation

Start from the ELBO with \(q = q_\phi(\cdot \mid x)\) (proof):

\[ \log p_\theta(x) \geq \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log p_\theta(x, z) - \log q_\phi(z \mid x)\right]. \]

Factor the joint as \(p_\theta(x, z) = p_\theta(x \mid z) p(z)\):

\[ \log p_\theta(x, z) = \log p_\theta(x \mid z) + \log p(z). \]

Substitute and split the expectation:

\[ \mathbb{E}_{q_\phi}\!\left[\log p_\theta(x, z) - \log q_\phi(z \mid x)\right] = \mathbb{E}_{q_\phi}\!\left[\log p_\theta(x \mid z)\right] + \mathbb{E}_{q_\phi}\!\left[\log p(z) - \log q_\phi(z \mid x)\right]. \]

The second expectation is \(-\mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z))\) by definition. So

\[ \log p_\theta(x) \geq \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log p_\theta(x \mid z)\right] - \mathrm{KL}\!\left(q_\phi(z \mid x) \,\|\, p(z)\right). \quad \square \]

What Each Term Means

  • \(\mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)]\)reconstruction. Sample a code \(z\) from the encoder, decode it, and evaluate the log-likelihood of the original \(x\). For Gaussian likelihood with fixed variance this reduces to negative MSE up to constants; for Bernoulli likelihood this is negative cross-entropy per pixel.
  • \(\mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z))\)regularization. Pulls the encoder toward the prior. This is what prevents the encoder from making each \(q_\phi(\cdot \mid x)\) a different delta function (which would maximize reconstruction but destroy the generative structure).

Why This Decomposition Matters

The split \((\star)\) is more than algebraic convenience: each piece is independently estimable and differentiable.

  • The KL term has a closed form when \(q_\phi\) and \(p\) are both Gaussian — no Monte Carlo needed. For diagonal \(q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x), \operatorname{diag}(\sigma_\phi(x)^2))\) and \(p(z) = \mathcal{N}(0, I)\), \(\mathrm{KL} = \tfrac{1}{2} \sum_i (\mu_{\phi, i}^2 + \sigma_{\phi, i}^2 - \log \sigma_{\phi, i}^2 - 1)\).
  • The reconstruction term requires a Monte Carlo estimate over \(z \sim q_\phi(\cdot \mid x)\), which is where the reparameterization trick earns its keep — gradients flow through the sample, and the variance is small.

The Tightness Gap

The ELBO equals \(\log p_\theta(x)\) minus the KL between the encoder and the true posterior:

\[ \log p_\theta(x) - \mathcal{L}_{\text{VAE}}(\theta, \phi; x) = \mathrm{KL}\!\left(q_\phi(z \mid x) \,\|\, p_\theta(z \mid x)\right). \]

So the VAE is implicitly fitting \(q_\phi\) to the true posterior. With a Gaussian encoder this fit is structurally limited if the true posterior is multimodal — explaining why richer encoder families (normalizing flows, hierarchical encoders) tighten the bound and improve test-set log-likelihood.

Comparison: EM vs. VAE on the Same Bound

Expectation-maximization and the VAE both maximize the ELBO. The difference is in how they handle \(q\):

  • EM uses \(q^{(t)}(z) = p_{\theta^{(t)}}(z \mid x)\), the true posterior at the current parameters. Tractable only when the posterior has closed form.
  • VAE uses a parametric \(q_\phi(z \mid x)\) and trains \(\phi\) jointly with \(\theta\). Sacrifices tightness for tractability — works for any decoder \(p_\theta\) as long as we can sample from \(q_\phi\) and evaluate its density.

This is the precise sense in which the VAE is “amortized variational EM”: one network’s worth of \(\phi\) replaces the per-datapoint posterior computation of the E-step.

References

Kingma, Diederik P., and Max Welling. 2013. “Auto-Encoding Variational Bayes.” arXiv Preprint arXiv:1312.6114. https://arxiv.org/abs/1312.6114.