The VAE Objective Is the ELBO with a Continuous Latent Encoder

Claim

Let $p_\theta(x, z) = p_\theta(x \mid z) p(z)$ be a latent-variable generative model with continuous latent $z$, and let $q_\phi(z \mid x)$ be an amortized variational encoder. Then

\[ \log p_\theta(x) \geq \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log p_\theta(x \mid z)\right] - \mathrm{KL}\!\left(q_\phi(z \mid x) \,\|\, p(z)\right) =: \mathcal{L}_{\text{VAE}}(\theta, \phi; x). \tag{$\star$} \]

This is the variational autoencoder objective — the reconstruction term minus the KL regularizer (Kingma and Welling 2013). Equation $(\star)$ is the ELBO specialized to the choice of encoder, written in the form that exposes its two interpretable pieces.

Derivation

Start from the ELBO with $q = q_\phi(\cdot \mid x)$ (proof):

\[ \log p_\theta(x) \geq \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log p_\theta(x, z) - \log q_\phi(z \mid x)\right]. \]

Factor the joint as $p_\theta(x, z) = p_\theta(x \mid z) p(z)$:

\[ \log p_\theta(x, z) = \log p_\theta(x \mid z) + \log p(z). \]

Substitute and split the expectation:

\[ \mathbb{E}_{q_\phi}\!\left[\log p_\theta(x, z) - \log q_\phi(z \mid x)\right] = \mathbb{E}_{q_\phi}\!\left[\log p_\theta(x \mid z)\right] + \mathbb{E}_{q_\phi}\!\left[\log p(z) - \log q_\phi(z \mid x)\right]. \]

The second expectation is $-\mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z))$ by definition. So

\[ \log p_\theta(x) \geq \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log p_\theta(x \mid z)\right] - \mathrm{KL}\!\left(q_\phi(z \mid x) \,\|\, p(z)\right). \quad \square \]

What Each Term Means

$\mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)]$ — reconstruction. Sample a code $z$ from the encoder, decode it, and evaluate the log-likelihood of the original $x$. For Gaussian likelihood with fixed variance this reduces to negative MSE up to constants; for Bernoulli likelihood this is negative cross-entropy per pixel.
$\mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z))$ — regularization. Pulls the encoder toward the prior. This is what prevents the encoder from making each $q_\phi(\cdot \mid x)$ a different delta function (which would maximize reconstruction but destroy the generative structure).

Why This Decomposition Matters

The split $(\star)$ is more than algebraic convenience: each piece is independently estimable and differentiable.

The KL term has a closed form when $q_\phi$ and $p$ are both Gaussian — no Monte Carlo needed. For diagonal $q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x), \operatorname{diag}(\sigma_\phi(x)^2))$ and $p(z) = \mathcal{N}(0, I)$, $\mathrm{KL} = \tfrac{1}{2} \sum_i (\mu_{\phi, i}^2 + \sigma_{\phi, i}^2 - \log \sigma_{\phi, i}^2 - 1)$.
The reconstruction term requires a Monte Carlo estimate over $z \sim q_\phi(\cdot \mid x)$, which is where the reparameterization trick earns its keep — gradients flow through the sample, and the variance is small.

The Tightness Gap

The ELBO equals $\log p_\theta(x)$ minus the KL between the encoder and the true posterior:

\[ \log p_\theta(x) - \mathcal{L}_{\text{VAE}}(\theta, \phi; x) = \mathrm{KL}\!\left(q_\phi(z \mid x) \,\|\, p_\theta(z \mid x)\right). \]

So the VAE is implicitly fitting $q_\phi$ to the true posterior. With a Gaussian encoder this fit is structurally limited if the true posterior is multimodal — explaining why richer encoder families (normalizing flows, hierarchical encoders) tighten the bound and improve test-set log-likelihood.

Comparison: EM vs. VAE on the Same Bound

Expectation-maximization and the VAE both maximize the ELBO. The difference is in how they handle $q$:

EM uses $q^{(t)}(z) = p_{\theta^{(t)}}(z \mid x)$, the true posterior at the current parameters. Tractable only when the posterior has closed form.
VAE uses a parametric $q_\phi(z \mid x)$ and trains $\phi$ jointly with $\theta$. Sacrifices tightness for tractability — works for any decoder $p_\theta$ as long as we can sample from $q_\phi$ and evaluate its density.

This is the precise sense in which the VAE is “amortized variational EM”: one network’s worth of $\phi$ replaces the per-datapoint posterior computation of the E-step.

References

Kingma, Diederik P., and Max Welling. 2013. “Auto-Encoding Variational Bayes.” arXiv Preprint arXiv:1312.6114. https://arxiv.org/abs/1312.6114.