Kullback-Leibler Divergence

Motivation

The Kullback-Leibler (KL) divergence measures how one probability distribution \(q\) differs from another distribution \(p\) (Kullback and Leibler 1951; Cover and Thomas 2005). It appears as the regularizer in variational autoencoders, as a step in the evidence lower bound decomposition used by EM, as the population-level objective for maximum-likelihood estimation, and as the trust-region penalty in policy-optimization methods like PPO.

Despite the name and the suggestive notation, KL divergence is not a metric: it is asymmetric, \(\mathrm{KL}(q \,\|\, p) \neq \mathrm{KL}(p \,\|\, q)\) in general, and it does not satisfy the triangle inequality. It is, however, non-negative and zero only when \(q = p\), which is enough to use it as a “distance” in optimization.

Definition

The asymmetric gap between two distributions

KL divergence averages the log gap under the reference distribution. Swapping the reference changes which regions matter most.

p(x)q(x)weighted log gap D_KL(p||q) weights disagreement where p puts mass.

For discrete distributions \(p, q\) on a set \(\mathcal{X}\),

\[ \mathrm{KL}(q \,\|\, p) = \sum_{x \in \mathcal{X}} q(x) \log \frac{q(x)}{p(x)}. \]

For continuous distributions with densities,

\[ \mathrm{KL}(q \,\|\, p) = \int q(x) \log \frac{q(x)}{p(x)} \, dx. \]

Equivalently, \(\mathrm{KL}(q \,\|\, p) = \mathbb{E}_{x \sim q}\!\left[\log q(x) - \log p(x)\right]\). The convention \(0 \log 0 = 0\) is standard. If \(q(x) > 0\) at some \(x\) where \(p(x) = 0\), the divergence is \(+\infty\) — so \(q\) must be absolutely continuous with respect to \(p\).

Worked example: two biased coins

Let \(q\) be a coin that lands heads with probability \(0.8\), and let \(p\) be a coin that lands heads with probability \(0.5\). Using natural logarithms,

\[ \begin{aligned} \mathrm{KL}(q \,\|\, p) &= 0.8 \log \frac{0.8}{0.5} + 0.2 \log \frac{0.2}{0.5} \\ &= 0.8 \log 1.6 + 0.2 \log 0.4 \\ &\approx 0.8(0.470) + 0.2(-0.916) \\ &\approx 0.193 \text{ nats}. \end{aligned} \]

Reversing the arguments gives a different number:

\[ \mathrm{KL}(p \,\|\, q) = 0.5 \log \frac{0.5}{0.8} + 0.5 \log \frac{0.5}{0.2} \approx 0.223 \text{ nats}. \]

The two divergences are close here, but not equal. The first average is weighted by outcomes likely under \(q\); the second is weighted by outcomes likely under \(p\).

Basic Properties

Non-negativity (Gibbs’ inequality). \(\mathrm{KL}(q \,\|\, p) \geq 0\), with equality iff \(q = p\) (almost everywhere) (proof). This follows from Jensen’s inequality applied to the convex function \(-\log\):

\[ -\mathrm{KL}(q \,\|\, p) = \mathbb{E}_q\!\left[\log \frac{p(x)}{q(x)}\right] \leq \log \mathbb{E}_q\!\left[\frac{p(x)}{q(x)}\right] = \log 1 = 0. \]

Asymmetry. \(\mathrm{KL}(q \,\|\, p)\) penalizes \(q\) for placing mass where \(p\) has none (“zero-forcing”); \(\mathrm{KL}(p \,\|\, q)\) penalizes \(q\) for missing mass that \(p\) has (“zero-avoiding”). Variational inference typically minimizes \(\mathrm{KL}(q \,\|\, p)\) for tractability, which is why approximate posteriors tend to be mode-seeking.

Relation to cross-entropy. Defining the entropy \(H(q) = -\mathbb{E}_q[\log q(x)]\) and cross-entropy \(H(q, p) = -\mathbb{E}_q[\log p(x)]\),

\[ \mathrm{KL}(q \,\|\, p) = H(q, p) - H(q). \]

Minimizing cross-entropy with respect to \(p\) — for fixed empirical distribution \(q\) — is therefore equivalent to minimizing KL divergence, which is in turn equivalent to maximum likelihood. This is why the standard classification loss is called “cross-entropy.”

Closed Forms

Two cases come up constantly.

Two Gaussians in \(\mathbb{R}^d\). For \(q = \mathcal{N}(\mu_q, \Sigma_q)\) and \(p = \mathcal{N}(\mu_p, \Sigma_p)\),

\[ \mathrm{KL}(q \,\|\, p) = \tfrac{1}{2}\!\left[\operatorname{tr}(\Sigma_p^{-1} \Sigma_q) + (\mu_p - \mu_q)^\top \Sigma_p^{-1} (\mu_p - \mu_q) - d + \log \frac{\det \Sigma_p}{\det \Sigma_q}\right]. \]

For diagonal \(\Sigma_q = \operatorname{diag}(\sigma_1^2, \ldots, \sigma_d^2)\) and \(p = \mathcal{N}(0, I)\),

\[ \mathrm{KL}(q \,\|\, p) = \tfrac{1}{2} \sum_{i=1}^d \left(\mu_i^2 + \sigma_i^2 - \log \sigma_i^2 - 1\right). \]

This is the closed form used in the VAE objective.

Two categorical distributions. For \(q, p\) over \(K\) classes,

\[ \mathrm{KL}(q \,\|\, p) = \sum_{k=1}^K q_k \log \frac{q_k}{p_k}. \]

Where It Appears

  • Variational inference and the ELBO. The gap between the log marginal likelihood and the ELBO is exactly \(\mathrm{KL}(q(z) \,\|\, p(z \mid x))\), so maximizing the ELBO minimizes the KL.
  • Variational autoencoders. The encoder is regularized by \(\mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z))\).
  • Diffusion models. The training objective decomposes into per-timestep KL terms between Gaussian forward and reverse transitions.
  • Maximum-likelihood estimation. Minimizing \(\mathrm{KL}(\hat p_{\text{data}} \,\|\, p_\theta)\) is equivalent to maximum likelihood.
  • Policy optimization (PPO, TRPO). Constraining \(\mathrm{KL}(\pi_{\theta_{\text{old}}} \,\|\, \pi_\theta)\) keeps policy updates within a trust region.

References

Cover, Thomas M., and Joy A. Thomas. 2005. Elements of Information Theory. Wiley. https://doi.org/10.1002/047174882x.
Kullback, S., and R. A. Leibler. 1951. “On Information and Sufficiency.” The Annals of Mathematical Statistics 22 (1): 79–86. https://doi.org/10.1214/aoms/1177729694.