Kullback-Leibler Divergence

Motivation

The Kullback-Leibler (KL) divergence measures how one probability distribution \(q\) differs from another distribution \(p\) (Kullback and Leibler 1951; Cover and Thomas 2005). It appears as the regularizer in variational autoencoders, as a step in the evidence lower bound decomposition used by EM, as the population-level objective for maximum-likelihood estimation, and as the trust-region penalty in policy-optimization methods like PPO.

Despite the name and the suggestive notation, KL divergence is not a metric: it is asymmetric, \(\mathrm{KL}(q \,\|\, p) \neq \mathrm{KL}(p \,\|\, q)\) in general, and it does not satisfy the triangle inequality. It is, however, non-negative and zero only when \(q = p\), which is enough to use it as a “distance” in optimization.

Definition

The asymmetric gap between two distributions

KL divergence averages the log gap under the reference distribution. Swapping the reference changes which regions matter most.

For discrete distributions \(p, q\) on a set \(\mathcal{X}\),

\[ \mathrm{KL}(q \,\|\, p) = \sum_{x \in \mathcal{X}} q(x) \log \frac{q(x)}{p(x)}. \]

For continuous distributions with densities,

\[ \mathrm{KL}(q \,\|\, p) = \int q(x) \log \frac{q(x)}{p(x)} \, dx. \]

Equivalently, \(\mathrm{KL}(q \,\|\, p) = \mathbb{E}_{x \sim q}\!\left[\log q(x) - \log p(x)\right]\). The convention \(0 \log 0 = 0\) is standard. If \(q(x) > 0\) at some \(x\) where \(p(x) = 0\), the divergence is \(+\infty\) — so \(q\) must be absolutely continuous with respect to \(p\).

Worked example: two biased coins

Let \(q\) be a coin that lands heads with probability \(0.8\), and let \(p\) be a coin that lands heads with probability \(0.5\). Using natural logarithms,

\[ \begin{aligned} \mathrm{KL}(q \,\|\, p) &= 0.8 \log \frac{0.8}{0.5} + 0.2 \log \frac{0.2}{0.5} \\ &= 0.8 \log 1.6 + 0.2 \log 0.4 \\ &\approx 0.8(0.470) + 0.2(-0.916) \\ &\approx 0.193 \text{ nats}. \end{aligned} \]

Reversing the arguments gives a different number:

\[ \mathrm{KL}(p \,\|\, q) = 0.5 \log \frac{0.5}{0.8} + 0.5 \log \frac{0.5}{0.2} \approx 0.223 \text{ nats}. \]

The two divergences are close here, but not equal. The first average is weighted by outcomes likely under \(q\); the second is weighted by outcomes likely under \(p\).

Basic Properties

Non-negativity (Gibbs’ inequality). \(\mathrm{KL}(q \,\|\, p) \geq 0\), with equality iff \(q = p\) (almost everywhere) (proof). This follows from Jensen’s inequality applied to the convex function \(-\log\):

\[ -\mathrm{KL}(q \,\|\, p) = \mathbb{E}_q\!\left[\log \frac{p(x)}{q(x)}\right] \leq \log \mathbb{E}_q\!\left[\frac{p(x)}{q(x)}\right] = \log 1 = 0. \]

Asymmetry. \(\mathrm{KL}(q \,\|\, p)\) penalizes \(q\) for placing mass where \(p\) has none (“zero-forcing”); \(\mathrm{KL}(p \,\|\, q)\) penalizes \(q\) for missing mass that \(p\) has (“zero-avoiding”). Variational inference typically minimizes \(\mathrm{KL}(q \,\|\, p)\) for tractability, which is why approximate posteriors tend to be mode-seeking.

Relation to cross-entropy. Defining the entropy \(H(q) = -\mathbb{E}_q[\log q(x)]\) and cross-entropy \(H(q, p) = -\mathbb{E}_q[\log p(x)]\),

\[ \mathrm{KL}(q \,\|\, p) = H(q, p) - H(q). \]

Minimizing cross-entropy with respect to \(p\) — for fixed empirical distribution \(q\) — is therefore equivalent to minimizing KL divergence, which is in turn equivalent to maximum likelihood. This is why the standard classification loss is called “cross-entropy.”

Closed Forms

Two cases come up constantly.

Two Gaussians in \(\mathbb{R}^d\). For \(q = \mathcal{N}(\mu_q, \Sigma_q)\) and \(p = \mathcal{N}(\mu_p, \Sigma_p)\),

\[ \mathrm{KL}(q \,\|\, p) = \tfrac{1}{2}\!\left[\operatorname{tr}(\Sigma_p^{-1} \Sigma_q) + (\mu_p - \mu_q)^\top \Sigma_p^{-1} (\mu_p - \mu_q) - d + \log \frac{\det \Sigma_p}{\det \Sigma_q}\right]. \]

For diagonal \(\Sigma_q = \operatorname{diag}(\sigma_1^2, \ldots, \sigma_d^2)\) and \(p = \mathcal{N}(0, I)\),

\[ \mathrm{KL}(q \,\|\, p) = \tfrac{1}{2} \sum_{i=1}^d \left(\mu_i^2 + \sigma_i^2 - \log \sigma_i^2 - 1\right). \]

This is the closed form used in the VAE objective.

Two categorical distributions. For \(q, p\) over \(K\) classes,

\[ \mathrm{KL}(q \,\|\, p) = \sum_{k=1}^K q_k \log \frac{q_k}{p_k}. \]

Where It Appears

Variational inference and the ELBO. The gap between the log marginal likelihood and the ELBO is exactly \(\mathrm{KL}(q(z) \,\|\, p(z \mid x))\), so maximizing the ELBO minimizes the KL.
Variational autoencoders. The encoder is regularized by \(\mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z))\).
Diffusion models. The training objective decomposes into per-timestep KL terms between Gaussian forward and reverse transitions.
Maximum-likelihood estimation. Minimizing \(\mathrm{KL}(\hat p_{\text{data}} \,\|\, p_\theta)\) is equivalent to maximum likelihood.
Policy optimization (PPO, TRPO). Constraining \(\mathrm{KL}(\pi_{\theta_{\text{old}}} \,\|\, \pi_\theta)\) keeps policy updates within a trust region.

References

Cover, Thomas M., and Joy A. Thomas. 2005. Elements of Information Theory. Wiley. https://doi.org/10.1002/047174882x.

Kullback, S., and R. A. Leibler. 1951. “On Information and Sufficiency.” The Annals of Mathematical Statistics 22 (1): 79–86. https://doi.org/10.1214/aoms/1177729694.