Normalizing Flows

Motivation

A normalizing flow (Rezende and Mohamed 2015) is a generative model defined by an invertible neural network: take a base distribution (typically standard Gaussian) over a latent \(z\), push it through a learned bijection \(f_\theta\), and read off the implied distribution over \(x = f_\theta(z)\). Because \(f_\theta\) is a bijection, the change-of-variables formula gives the exact density \(p_\theta(x)\) as a closed-form expression — no lower bound, no Monte Carlo. This is the distinctive feature that separates flows from VAEs (which give only an ELBO) and GANs (which give no density at all).

The price is architectural: \(f_\theta\) must be invertible and have a tractable Jacobian determinant, which restricts the network designs available.

The Change-of-Variables Identity

An invertible map from base noise to data

A normalizing flow starts with a simple density and warps space through invertible transformations while tracking the Jacobian determinant.

Let \(f_\theta : \mathbb{R}^d \to \mathbb{R}^d\) be a smooth bijection and \(z = f_\theta^{-1}(x)\). If \(z\) has density \(p(z)\), then \(x\) has density

\[ p_\theta(x) = p(z) \left| \det \frac{\partial f_\theta^{-1}}{\partial x}(x) \right| = p(f_\theta^{-1}(x)) \left| \det J_{f_\theta^{-1}}(x) \right|. \]

In log form,

\[ \log p_\theta(x) = \log p(f_\theta^{-1}(x)) + \log \left| \det J_{f_\theta^{-1}}(x) \right|. \]

Maximum-likelihood training is then direct: take a minibatch of \(x\), compute \(\log p_\theta(x)\) exactly, backpropagate. This is the simplest training story among modern deep generative models — closer to a logistic regression than to a VAE.

The Two Architectural Constraints

Invertibility. Each layer must be a bijection. Stacking \(L\) bijections gives a bijection, so \(f_\theta = f_L \circ \cdots \circ f_1\) inherits invertibility from its parts. The chain-rule version of the Jacobian determinant is

\[ \log |\det J_{f_\theta}| = \sum_\ell \log |\det J_{f_\ell}|. \]

Tractable Jacobian. Computing \(\det J\) for a generic dense Jacobian is \(O(d^3)\), which is prohibitive for image-sized \(d\). Flow architectures restrict each layer’s Jacobian to be triangular or block-triangular, where \(\det J = \prod_i J_{ii}\) is \(O(d)\).

The two main families:

Coupling layers (RealNVP (Dinh et al. 2017), Glow). Split \(z\) into halves \(z = (z_a, z_b)\); output \(x_a = z_a\) and \(x_b = z_b \odot \exp(s(z_a)) + t(z_a)\) where \(s\) and \(t\) are arbitrary neural networks of \(z_a\). The Jacobian is lower-triangular with diagonal \(\exp(s(z_a))\), so \(\log|\det J| = \sum_i s_i(z_a)\). Composed with permutations of coordinates, coupling layers can mix all dimensions.
Autoregressive flows (MAF, IAF). Each \(x_i\) depends only on \(z_{1:i}\) via \(x_i = z_i \exp(s_i(z_{<i})) + t_i(z_{<i})\). The Jacobian is triangular by construction. MAF makes inference (likelihood evaluation) parallel and sampling sequential; IAF flips this trade-off.

Sampling vs. Likelihood

Both directions of the bijection are useful:

Forward \(x = f_\theta(z)\) with \(z \sim p(z)\) generates a sample.
Inverse \(z = f_\theta^{-1}(x)\) with the change-of-variables determinant evaluates the density.

Different flow architectures make different sides cheap. RealNVP is symmetric: both directions are one network forward pass. Autoregressive flows are asymmetric: one direction is parallel (\(O(1)\) sequential), the other sequential (\(O(d)\) sequential). The choice depends on whether you train mostly by likelihood (need cheap inverse) or generate mostly at inference (need cheap forward).

Strengths and Weaknesses

Pros: - Exact likelihood. Maximum-likelihood training without bounds or estimators. - Exact inverse. Round-trip \(x \to z \to x\) is exact, so latent codes are usable for downstream tasks. - Stable training. No adversarial dynamics, no posterior collapse.

Cons: - Architectural restrictions. Invertibility constraints limit expressiveness compared to free-form decoders. - Dimension preservation. \(\dim(z) = \dim(x)\) is forced by invertibility, so flows cannot have a low-dimensional latent space. - Sample quality. Trails diffusion on image generation benchmarks, often by a wide margin.

Where Flows Sit Now

Flows have largely been overtaken by diffusion for unconditional image generation, but they remain important in several niches:

Density estimation for tabular data and scientific applications where exact likelihoods matter.
Variational inference, where a flow-based \(q_\phi(z \mid x)\) gives a tighter ELBO than a Gaussian encoder.
Continuous normalizing flows (Neural ODEs, FFJORD) generalize the discrete-layer construction to continuous time. The same change-of-variables logic applies via the instantaneous trace formula. Flow matching and rectified flows (2022–2024) revive this direction and produce samples competitive with diffusion at much lower sampling cost.

The flow-matching revival is the main reason flows are still discussed: they retain the exact-likelihood story while bridging the architectural gap to diffusion-style generation.

References

Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. 2017. “Density Estimation Using Real-Valued Non-Volume Preserving Transformations.” International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=HkpbnH9lx.

Rezende, Danilo Jimenez, and Shakir Mohamed. 2015. “Variational Inference with Normalizing Flows.” International Conference on Machine Learning (ICML), 1530–38. https://proceedings.mlr.press/v37/rezende15.html.