Latent-Variable Generative Models
Motivation
A generative model specifies a probability distribution \(p_\theta(x)\) over the data space \(\mathcal{X}\) — images, sentences, molecules — and supports two operations: scoring (evaluate \(p_\theta(x)\) on a given \(x\)) and sampling (produce a fresh \(x \sim p_\theta\)). Modeling \(p_\theta(x)\) directly is hard because real data lies on complicated, low-dimensional manifolds inside a very high-dimensional ambient space; a flexible enough density would need a parameterization that respects this structure.
The latent-variable approach factors the model through an unobserved variable \(z\):
\[ p_\theta(x) = \int p_\theta(x \mid z)\, p(z) \, dz. \]
A simple prior \(p(z)\) — often standard Gaussian — passes through a learned conditional \(p_\theta(x \mid z)\) to produce the data distribution. The complicated shape of \(p_\theta(x)\) is offloaded onto the geometry of the map \(z \mapsto p_\theta(x \mid z)\), while the prior itself stays easy to sample. This is the framework that powers VAEs (Kingma and Welling 2013), diffusion models (Ho et al. 2020), normalizing flows (Rezende and Mohamed 2015), GANs (Goodfellow et al. 2014), and the classical density estimators that preceded them (mixtures, factor analysis, HMMs (Baum and Petrie 1966)).
The cost is that the integral over \(z\) is usually intractable, which forces every method in this family to commit to a strategy — bound it (ELBO), avoid it (flows, GANs), or rewrite it as a chain of small steps (diffusion).
The Framework
A latent-variable generative model has three ingredients:
- Latent space \(\mathcal{Z}\) with a fixed prior \(p(z)\). Common choices: \(\mathcal{N}(0, I)\) on \(\mathbb{R}^d\), categorical \(\{1, \ldots, K\}\), or a sequence space for time-series models.
- Conditional likelihood \(p_\theta(x \mid z)\), parameterized by \(\theta\) and typically realized as a neural network mapping \(z\) to the parameters of an exponential-family distribution over \(\mathcal{X}\).
- Marginal \(p_\theta(x) = \int p_\theta(x \mid z) p(z)\, dz\), which is the actual quantity of interest — the model’s distribution over the observed data.
The posterior \(p_\theta(z \mid x) = p_\theta(x \mid z) p(z) / p_\theta(x)\) is what you need to reason backward from a data point to its latent code. It is the second hard computation: it shares the intractable \(p_\theta(x)\) in its denominator.
Diagram: prior, decoder, and the marginal it induces
The prior \(p(z)\) is simple; the decoder \(p_\theta(x \mid z)\) warps it into a complicated marginal \(p_\theta(x)\). Reading right-to-left gives the posterior \(p_\theta(z \mid x)\) — the inverse problem.
Examples Across the Spectrum
Every model in this family fits the same template; they differ in what they choose for \(\mathcal{Z}\), \(p(z)\), and \(p_\theta(x \mid z)\).
Discrete latents: mixture models
A Gaussian mixture model has \(z \in \{1, \ldots, K\}\) with \(p(z = k) = \pi_k\) and \(p_\theta(x \mid z = k) = \mathcal{N}(\mu_k, \Sigma_k)\). The latent picks which component generated the point. The integral is a finite sum, so the marginal \(p_\theta(x) = \sum_k \pi_k \mathcal{N}(x; \mu_k, \Sigma_k)\) is tractable, and the posterior \(p_\theta(z \mid x)\) is the vector of normalized component responsibilities. This is the setting where EM is exact (Dempster et al. 1977).
Continuous latents, linear decoder: factor analysis and probabilistic PCA
Take \(z \sim \mathcal{N}(0, I)\) on \(\mathbb{R}^k\) and \(p_\theta(x \mid z) = \mathcal{N}(W z + \mu, \sigma^2 I)\) with \(W \in \mathbb{R}^{d \times k}\). The marginal is again Gaussian, and the MLE for \(W\) recovers the top-\(k\) principal components (Pearson 1901). Probabilistic PCA is the linear, continuous-latent baseline; the VAE’s job is to generalize the decoder from linear to nonlinear.
Continuous latents, nonlinear decoder: the VAE
The VAE (Kingma and Welling 2013; Rezende et al. 2014) uses \(z \sim \mathcal{N}(0, I)\) and a neural-network decoder \(p_\theta(x \mid z) = \mathcal{N}(\mu_\theta(z), \sigma^2 I)\) (or a Bernoulli decoder for binary images, etc.). The marginal is no longer tractable, and there is no \(O(T)\) trick — so training falls back on the ELBO, which bounds \(\log p_\theta(x)\) from below because \(\mathrm{KL} \geq 0\) (proof).
A hierarchy of latents: diffusion
A diffusion model (Sohl-Dickstein et al. 2015; Ho et al. 2020) introduces an entire sequence of latents \(z_1, \ldots, z_T\) at increasing noise levels, with \(z_T \sim \mathcal{N}(0, I)\) and each \(z_{t-1}\) produced by a small denoising step from \(z_t\). The data is \(x = z_0\). This is still a latent-variable generative model — and its training objective is still derived from an ELBO — but the chain structure makes each step’s posterior close to tractable, which is what gives diffusion its training stability.
Implicit decoder: the GAN
A GAN (Goodfellow et al. 2014) keeps the prior and the deterministic decoder \(x = G_\theta(z)\), but throws away the likelihood: the model defines a distribution over \(x\) only implicitly, through sampling. This sidesteps the marginal-likelihood integral entirely at the cost of giving up density evaluation.
The Two Hard Computations
Most of the technical machinery in this chapter exists to deal with two quantities:
The marginal likelihood \(p_\theta(x) = \int p_\theta(x \mid z) p(z)\, dz\). Needed for maximum-likelihood training. Tractable only when (a) \(\mathcal{Z}\) is finite (mixtures), (b) the integral has special structure (linear-Gaussian, sequential), or (c) the decoder is an invertible map (flows, where the change-of-variables formula replaces the integral with a Jacobian determinant).
The posterior \(p_\theta(z \mid x) = p_\theta(x \mid z) p(z) / p_\theta(x)\). Needed for inference — assigning latent codes to observed data — and for the E-step of EM. Inherits the same intractable denominator.
For nonlinear-decoder models on high-dimensional data, both are out of reach in closed form. The chapter’s plot follows from this fact.
Strategies for Training
How each model family makes peace with the intractable marginal:
| Family | Strategy |
|---|---|
| Mixtures, HMMs | Exact: small or structured \(\mathcal{Z}\) makes the integral computable. |
| Probabilistic PCA, factor analysis | Exact: linear-Gaussian gives a closed-form marginal. |
| VAE | Bound: maximize the ELBO instead of the marginal. |
| Diffusion | Bound: a chained ELBO that simplifies to a denoising regression target. |
| Normalizing flows | Exact: invertible decoder converts the integral into a Jacobian. |
| GAN | Avoid: train by a minimax discriminator instead of by likelihood. |
Reading down the column, the chapter’s other articles are the tools each strategy uses: KL divergence measures the gap from a variational approximation to the true posterior, the ELBO packages that measurement into a tractable objective, the reparameterization trick makes gradients through samples low-variance, and the VAE is the canonical model that puts them together.
What Comes Next
The rest of the chapter assumes this framework and develops the tools to train models within it.
- KL divergence — the measure of distributional discrepancy that quantifies how far a variational posterior is from the true one.
- Evidence lower bound — the tractable surrogate for the intractable marginal log-likelihood.
- Variational autoencoder — the canonical instance: amortized inference with a Gaussian encoder, a neural decoder, and the ELBO as the objective.
- Reparameterization trick — the gradient estimator that makes the VAE practical.
The next chapter, diffusion, is the same framework applied to a hierarchy of noise-level latents — and its training objective, despite looking like a denoising regression, is itself a weighted ELBO.