Variational Autoencoder

Motivation

A plain autoencoder learns a deterministic encoder-decoder pair, but the resulting latent space is not a generative model: there is no defined distribution over latents that one can sample from to produce new data. The variational autoencoder (VAE) (Kingma and Welling 2013; Rezende et al. 2014) closes this gap by making the encoder probabilistic and adding a KL regularizer that pulls the encoder’s distribution over latents toward a fixed prior. The result is a true generative model with tractable training and tractable sampling.

The VAE is also the canonical example of amortized variational inference: instead of running EM per datapoint, a single neural-network encoder predicts the variational distribution for every input. Once trained, inference at any new \(x\) is a single forward pass through the encoder.

Setup

A latent-variable generative model:

\[ p(z) = \mathcal{N}(0, I), \qquad p_\theta(x \mid z) = \mathcal{N}(\mu_\theta(z), \sigma^2 I) \text{ or similar}. \]

The decoder \(\mu_\theta(z)\) is a neural network that maps a latent code \(z\) to a parameter of the data likelihood. Maximum-likelihood training would maximize \(\log p_\theta(x) = \log \int p_\theta(x \mid z) p(z) \, dz\) — intractable for non-trivial decoders.

The VAE maximizes the evidence lower bound instead, using a learned encoder \(q_\phi(z \mid x)\):

\[ \log p_\theta(x) \geq \mathrm{ELBO}(\phi, \theta; x) = \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] - \mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z)). \]

This decomposition follows directly from the ELBO by factoring \(p_\theta(x, z) = p_\theta(x \mid z) p(z)\). (proof) The inequality holds because the gap equals \(\mathrm{KL}(q_\phi(z \mid x) \,\|\, p_\theta(z \mid x)) \geq 0\) (proof).

The encoder is typically Gaussian: \(q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x), \operatorname{diag}(\sigma_\phi(x)^2))\), with \(\mu_\phi\) and \(\sigma_\phi\) outputs of an encoder network. The objective is jointly optimized in \(\phi\) and \(\theta\) by gradient ascent.

Diagram: VAE forward pass with reparameterization

The encoder outputs \((\mu_\phi(x), \sigma_\phi(x))\) rather than a single \(z\). The reparameterization \(z = \mu + \sigma \odot \varepsilon\) moves the random draw out of the differentiable path so gradients can flow back to \(\phi\).

The Two Terms

The ELBO decomposes into two terms with intuitive meanings:

Reconstruction: \(\mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)]\). Sample \(z\) from the encoder, decode it, evaluate the likelihood of \(x\) under the decoder. For Gaussian likelihood this reduces to (negative half) MSE between \(x\) and \(\mu_\theta(z)\).
Regularization: \(\mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z))\). Pulls the encoder’s distribution toward the prior. With Gaussian encoder and standard-normal prior, this is a closed form involving \(\mu_\phi\), \(\sigma_\phi\). See KL divergence.

The KL term is what makes the VAE generative: it ensures that latent codes from different data points cluster together near the prior, so sampling \(z \sim p(z)\) at test time and decoding produces meaningful outputs.

The Reparameterization Trick

The reconstruction term has the form \(\mathbb{E}_{z \sim q_\phi(\cdot \mid x)}[\log p_\theta(x \mid z)]\). The expectation is over a distribution that depends on \(\phi\), so \(\nabla_\phi\) of this expectation is not the expectation of \(\nabla_\phi \log p_\theta(x \mid z)\) — there is an additional contribution from how the distribution itself depends on \(\phi\).

Two ways to handle this:

Score-function estimator (REINFORCE-style). Treat \(\phi\)-dependence as a policy and use the log-derivative trick. Unbiased but high-variance.
Reparameterization trick. Rewrite \(z = \mu_\phi(x) + \sigma_\phi(x) \odot \varepsilon\) with \(\varepsilon \sim \mathcal{N}(0, I)\). Now the expectation is over \(\varepsilon\) — distribution does not depend on \(\phi\) — and gradients flow through \(z\) via the chain rule. Much lower variance. See reparameterization trick.

The reparameterization trick is what makes the VAE practical. Variance reduction over the score-function alternative is dramatic. (proof)

Training

For each training example \(x\):

Compute \(\mu_\phi(x)\), \(\sigma_\phi(x)\) from the encoder.
Sample \(\varepsilon \sim \mathcal{N}(0, I)\) and compute \(z = \mu_\phi(x) + \sigma_\phi(x) \odot \varepsilon\).
Compute \(\hat x = \mu_\theta(z)\) from the decoder.
Compute the loss \[ L = -\log p_\theta(x \mid z) + \mathrm{KL}(\mathcal{N}(\mu_\phi(x), \operatorname{diag}(\sigma_\phi(x)^2)) \,\|\, \mathcal{N}(0, I)). \]
Backpropagate, update \(\phi\) and \(\theta\).

The closed-form KL for diagonal Gaussian encoder vs. standard normal prior is

\[ \mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z)) = \tfrac{1}{2} \sum_{i=1}^d \left(\mu_{\phi, i}(x)^2 + \sigma_{\phi, i}(x)^2 - \log \sigma_{\phi, i}(x)^2 - 1\right). \]

The reconstruction term is sampled (one \(\varepsilon\) per training step is typical, sometimes more for variance reduction).

Sampling

After training, generate new samples by:

Sample \(z \sim p(z) = \mathcal{N}(0, I)\).
Decode: \(\hat x = \mu_\theta(z)\) (or sample from \(p_\theta(x \mid z)\) for stochastic outputs).

The KL term ensures that samples from the prior land in regions where the decoder produces sensible outputs. Without that regularization, \(z\) from the prior would not match where training-time codes live and the decoder would produce garbage.

Strengths and Weaknesses

Pros: - Tractable training (single ELBO objective). - Tractable sampling. - Latent space with semantically meaningful structure (interpolations are smooth). - Good for representation learning even when generation is not the goal.

Cons: - Generated samples are typically blurry compared to GANs and diffusion models — the per-pixel Gaussian likelihood encourages averaging. - The KL regularizer can collapse the encoder (“posterior collapse”) so that \(q_\phi(z \mid x) \approx p(z)\) for all \(x\). The model then ignores \(z\) and the decoder degenerates to an unconditional density. Mitigations: warmup the KL weight, use richer decoders, or use a \(\beta\)-VAE with \(\beta < 1\). - The Gaussian encoder is a restrictive variational family; the bound can be loose if the true posterior is multimodal.

Variants

\(\beta\)-VAE (Higgins et al., 2017): scale the KL term by \(\beta > 1\) to encourage disentangled representations, or \(\beta < 1\) to mitigate posterior collapse.
VQ-VAE (van den Oord et al., 2017): discrete latents via vector quantization. Used in DALL-E and several speech models.
Hierarchical / ladder VAEs: multiple latent layers, capturing structure at different scales.
Normalizing-flow encoders: replace the diagonal Gaussian with a flow-based variational distribution for tighter ELBO.

Where the VAE Sits Now

VAEs were the dominant deep generative model from 2014 to roughly 2018. GANs took over for image generation by producing sharper samples; diffusion models have since taken over for state-of-the-art generation by combining sample quality with the tractable training of likelihood-based models. The VAE remains useful for representation learning, for cases where a tractable likelihood is needed, and as the conceptual foundation for understanding what diffusion models actually optimize. The DDPM loss is itself a weighted ELBO.

References

Kingma, Diederik P., and Max Welling. 2013. “Auto-Encoding Variational Bayes.” arXiv Preprint arXiv:1312.6114. https://arxiv.org/abs/1312.6114.

Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2014. “Stochastic Backpropagation and Approximate Inference in Deep Generative Models.” International Conference on Machine Learning (ICML), 1278–86. https://proceedings.mlr.press/v32/rezende14.html.