Latent-Variable Generative Models

Motivation

A generative model specifies a probability distribution \(p_\theta(x)\) over the data space \(\mathcal{X}\) — images, sentences, molecules — and supports two operations: scoring (evaluate \(p_\theta(x)\) on a given \(x\)) and sampling (produce a fresh \(x \sim p_\theta\)). Modeling \(p_\theta(x)\) directly is hard because real data lies on complicated, low-dimensional manifolds inside a very high-dimensional ambient space; a flexible enough density would need a parameterization that respects this structure.

The latent-variable approach factors the model through an unobserved variable \(z\):

\[ p_\theta(x) = \int p_\theta(x \mid z)\, p(z) \, dz. \]

A simple prior \(p(z)\) — often standard Gaussian — passes through a learned conditional \(p_\theta(x \mid z)\) to produce the data distribution. The complicated shape of \(p_\theta(x)\) is offloaded onto the geometry of the map \(z \mapsto p_\theta(x \mid z)\), while the prior itself stays easy to sample. This is the framework that powers VAEs (Kingma and Welling 2013), diffusion models (Ho et al. 2020), normalizing flows (Rezende and Mohamed 2015), GANs (Goodfellow et al. 2014), and the classical density estimators that preceded them (mixtures, factor analysis, HMMs (Baum and Petrie 1966)).

The cost is that the integral over \(z\) is usually intractable, which forces every method in this family to commit to a strategy — bound it (ELBO), avoid it (flows, GANs), or rewrite it as a chain of small steps (diffusion).

The Framework

A latent-variable generative model has three ingredients:

  • Latent space \(\mathcal{Z}\) with a fixed prior \(p(z)\). Common choices: \(\mathcal{N}(0, I)\) on \(\mathbb{R}^d\), categorical \(\{1, \ldots, K\}\), or a sequence space for time-series models.
  • Conditional likelihood \(p_\theta(x \mid z)\), parameterized by \(\theta\) and typically realized as a neural network mapping \(z\) to the parameters of an exponential-family distribution over \(\mathcal{X}\).
  • Marginal \(p_\theta(x) = \int p_\theta(x \mid z) p(z)\, dz\), which is the actual quantity of interest — the model’s distribution over the observed data.

The posterior \(p_\theta(z \mid x) = p_\theta(x \mid z) p(z) / p_\theta(x)\) is what you need to reason backward from a data point to its latent code. It is the second hard computation: it shares the intractable \(p_\theta(x)\) in its denominator.

Diagram: prior, decoder, and the marginal it induces

The prior \(p(z)\) is simple; the decoder \(p_\theta(x \mid z)\) warps it into a complicated marginal \(p_\theta(x)\). Reading right-to-left gives the posterior \(p_\theta(z \mid x)\) — the inverse problem.

latent z ~ p(z) (e.g. N(0,I)) simple prior data x ~ p_θ(x) complicated marginal p_θ(x | z) decoder / likelihood p_θ(z | x) posterior (inverse problem) Marginal: p_θ(x) = ∫ p_θ(x | z) p(z) dz. Both the marginal and the posterior involve this integral.

Examples Across the Spectrum

Every model in this family fits the same template; they differ in what they choose for \(\mathcal{Z}\), \(p(z)\), and \(p_\theta(x \mid z)\).

Discrete latents: mixture models

A Gaussian mixture model has \(z \in \{1, \ldots, K\}\) with \(p(z = k) = \pi_k\) and \(p_\theta(x \mid z = k) = \mathcal{N}(\mu_k, \Sigma_k)\). The latent picks which component generated the point. The integral is a finite sum, so the marginal \(p_\theta(x) = \sum_k \pi_k \mathcal{N}(x; \mu_k, \Sigma_k)\) is tractable, and the posterior \(p_\theta(z \mid x)\) is the vector of normalized component responsibilities. This is the setting where EM is exact (Dempster et al. 1977).

Continuous latents, linear decoder: factor analysis and probabilistic PCA

Take \(z \sim \mathcal{N}(0, I)\) on \(\mathbb{R}^k\) and \(p_\theta(x \mid z) = \mathcal{N}(W z + \mu, \sigma^2 I)\) with \(W \in \mathbb{R}^{d \times k}\). The marginal is again Gaussian, and the MLE for \(W\) recovers the top-\(k\) principal components (Pearson 1901). Probabilistic PCA is the linear, continuous-latent baseline; the VAE’s job is to generalize the decoder from linear to nonlinear.

Sequential discrete latents: hidden Markov models

In an HMM the latent is a sequence \(z_{1:T}\) that evolves as a Markov chain, and observations \(x_{1:T}\) are emitted state-by-state. The integral over \(z_{1:T}\) has \(K^T\) terms, but the chain structure lets forward-backward compute it in \(O(T K^2)\). HMMs are the classical example of a latent-variable model where the marginal is intractable in the naive form but exact via dynamic programming.

Continuous latents, nonlinear decoder: the VAE

The VAE (Kingma and Welling 2013; Rezende et al. 2014) uses \(z \sim \mathcal{N}(0, I)\) and a neural-network decoder \(p_\theta(x \mid z) = \mathcal{N}(\mu_\theta(z), \sigma^2 I)\) (or a Bernoulli decoder for binary images, etc.). The marginal is no longer tractable, and there is no \(O(T)\) trick — so training falls back on the ELBO, which bounds \(\log p_\theta(x)\) from below because \(\mathrm{KL} \geq 0\) (proof).

A hierarchy of latents: diffusion

A diffusion model (Sohl-Dickstein et al. 2015; Ho et al. 2020) introduces an entire sequence of latents \(z_1, \ldots, z_T\) at increasing noise levels, with \(z_T \sim \mathcal{N}(0, I)\) and each \(z_{t-1}\) produced by a small denoising step from \(z_t\). The data is \(x = z_0\). This is still a latent-variable generative model — and its training objective is still derived from an ELBO — but the chain structure makes each step’s posterior close to tractable, which is what gives diffusion its training stability.

Implicit decoder: the GAN

A GAN (Goodfellow et al. 2014) keeps the prior and the deterministic decoder \(x = G_\theta(z)\), but throws away the likelihood: the model defines a distribution over \(x\) only implicitly, through sampling. This sidesteps the marginal-likelihood integral entirely at the cost of giving up density evaluation.

The Two Hard Computations

Most of the technical machinery in this chapter exists to deal with two quantities:

The marginal likelihood \(p_\theta(x) = \int p_\theta(x \mid z) p(z)\, dz\). Needed for maximum-likelihood training. Tractable only when (a) \(\mathcal{Z}\) is finite (mixtures), (b) the integral has special structure (linear-Gaussian, sequential), or (c) the decoder is an invertible map (flows, where the change-of-variables formula replaces the integral with a Jacobian determinant).

The posterior \(p_\theta(z \mid x) = p_\theta(x \mid z) p(z) / p_\theta(x)\). Needed for inference — assigning latent codes to observed data — and for the E-step of EM. Inherits the same intractable denominator.

For nonlinear-decoder models on high-dimensional data, both are out of reach in closed form. The chapter’s plot follows from this fact.

Strategies for Training

How each model family makes peace with the intractable marginal:

Family Strategy
Mixtures, HMMs Exact: small or structured \(\mathcal{Z}\) makes the integral computable.
Probabilistic PCA, factor analysis Exact: linear-Gaussian gives a closed-form marginal.
VAE Bound: maximize the ELBO instead of the marginal.
Diffusion Bound: a chained ELBO that simplifies to a denoising regression target.
Normalizing flows Exact: invertible decoder converts the integral into a Jacobian.
GAN Avoid: train by a minimax discriminator instead of by likelihood.

Reading down the column, the chapter’s other articles are the tools each strategy uses: KL divergence measures the gap from a variational approximation to the true posterior, the ELBO packages that measurement into a tractable objective, the reparameterization trick makes gradients through samples low-variance, and the VAE is the canonical model that puts them together.

What Comes Next

The rest of the chapter assumes this framework and develops the tools to train models within it.

  • KL divergence — the measure of distributional discrepancy that quantifies how far a variational posterior is from the true one.
  • Evidence lower bound — the tractable surrogate for the intractable marginal log-likelihood.
  • Variational autoencoder — the canonical instance: amortized inference with a Gaussian encoder, a neural decoder, and the ELBO as the objective.
  • Reparameterization trick — the gradient estimator that makes the VAE practical.

The next chapter, diffusion, is the same framework applied to a hierarchy of noise-level latents — and its training objective, despite looking like a denoising regression, is itself a weighted ELBO.

References

Baum, Leonard E., and Ted Petrie. 1966. “Statistical Inference for Probabilistic Functions of Finite State Markov Chains.” The Annals of Mathematical Statistics 37 (6): 1554–63. https://doi.org/10.1214/aoms/1177699147.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the <i>EM</i> Algorithm.” Journal of the Royal Statistical Society Series B: Statistical Methodology 39 (1): 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, et al. 2014. “Generative Adversarial Nets.” Advances in Neural Information Processing Systems (NeurIPS), 2672–80. https://proceedings.neurips.cc/paper/2014/hash/f033ed80deb0234979a61f95710dbe25-Abstract.html.
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems (NeurIPS), 6840–51. https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.
Kingma, Diederik P., and Max Welling. 2013. “Auto-Encoding Variational Bayes.” arXiv Preprint arXiv:1312.6114. https://arxiv.org/abs/1312.6114.
Pearson, Karl. 1901. “LIII. <I>on Lines and Planes of Closest Fit to Systems of Points in Space</i>.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11): 559–72. https://doi.org/10.1080/14786440109462720.
Rezende, Danilo Jimenez, and Shakir Mohamed. 2015. “Variational Inference with Normalizing Flows.” International Conference on Machine Learning (ICML), 1530–38. https://proceedings.mlr.press/v37/rezende15.html.
Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2014. “Stochastic Backpropagation and Approximate Inference in Deep Generative Models.” International Conference on Machine Learning (ICML), 1278–86. https://proceedings.mlr.press/v32/rezende14.html.
Sohl-Dickstein, Jascha, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.” International Conference on Machine Learning (ICML), 2256–65. https://proceedings.mlr.press/v37/sohl-dickstein15.html.