KL Divergence Is Non-Negative (Gibbs’ Inequality)
Claim
For any two probability distributions \(q\) and \(p\) on a common space \(\mathcal{X}\),
\[ \mathrm{KL}(q \,\|\, p) \geq 0, \]
with equality if and only if \(q = p\) (almost everywhere) (Cover and Thomas 2005; Kullback and Leibler 1951). This fact is Gibbs’ inequality. It is what lets the KL divergence act as a “distance” in optimization even though it is not a metric.
Setup
Let \(q\) and \(p\) be probability distributions on \(\mathcal{X}\), either discrete (probability mass functions) or continuous (densities with respect to a common base measure). The divergence is
\[ \mathrm{KL}(q \,\|\, p) = \sum_{x \in \mathcal{X}} q(x) \log \frac{q(x)}{p(x)} \qquad\text{or}\qquad \int_{\mathcal{X}} q(x) \log \frac{q(x)}{p(x)} \, dx, \]
with the standard convention \(0 \log 0 = 0\).
Write \(\mathcal{S} = \{x : q(x) > 0\}\) for the support of \(q\). Two cases:
- Singular case. If \(q(x) > 0\) at some \(x\) where \(p(x) = 0\), then \(\mathrm{KL}(q \,\|\, p) = +\infty\) and the inequality holds trivially.
- Absolutely continuous case. Otherwise \(p(x) > 0\) for every \(x \in \mathcal{S}\), written \(q \ll p\). The argument below treats this case; the sum then ranges effectively over \(\mathcal{S}\), where every ratio \(p(x)/q(x)\) is finite and positive.
The argument is stated for the discrete case; replacing sums by integrals leaves every step unchanged.
Proof via Jensen’s Inequality
The function \(\log\) is concave, so Jensen’s inequality gives \(\mathbb{E}[\log Y] \leq \log \mathbb{E}[Y]\) for any positive random variable \(Y\). Apply it to \(Y = p(X)/q(X)\) with \(X \sim q\):
\[ -\mathrm{KL}(q \,\|\, p) = \sum_{x \in \mathcal{S}} q(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_{x \sim q}\!\left[\log \frac{p(x)}{q(x)}\right] \leq \log \, \mathbb{E}_{x \sim q}\!\left[\frac{p(x)}{q(x)}\right]. \]
The expectation inside the log is a sum of probabilities,
\[ \mathbb{E}_{x \sim q}\!\left[\frac{p(x)}{q(x)}\right] = \sum_{x \in \mathcal{S}} q(x) \cdot \frac{p(x)}{q(x)} = \sum_{x \in \mathcal{S}} p(x) \leq \sum_{x \in \mathcal{X}} p(x) = 1. \]
Therefore \(-\mathrm{KL}(q \,\|\, p) \leq \log 1 = 0\), i.e. \(\mathrm{KL}(q \,\|\, p) \geq 0\). \(\square\)
Proof via the Tangent-Line Inequality
A second proof avoids Jensen and makes the equality condition transparent. The concavity of \(\log\) at the point \(t = 1\) gives the tangent-line bound
\[ \log t \leq t - 1 \qquad \text{for all } t > 0, \]
with equality only at \(t = 1\). Apply it to \(t = p(x)/q(x)\) for each \(x \in \mathcal{S}\):
\[ -\mathrm{KL}(q \,\|\, p) = \sum_{x \in \mathcal{S}} q(x) \log \frac{p(x)}{q(x)} \leq \sum_{x \in \mathcal{S}} q(x)\!\left(\frac{p(x)}{q(x)} - 1\right) = \sum_{x \in \mathcal{S}} p(x) - \sum_{x \in \mathcal{S}} q(x). \]
Since \(q\) is a distribution supported on \(\mathcal{S}\), the second sum is \(1\), and the first is at most \(1\), so the whole expression is \(\leq 0\). Again \(\mathrm{KL}(q \,\|\, p) \geq 0\). \(\square\)
Equality Condition
Equality \(\mathrm{KL}(q \,\|\, p) = 0\) forces both inequalities in either proof to be tight.
- Pointwise tightness. The tangent-line bound \(\log t \leq t - 1\) is an equality only at \(t = 1\), so \(p(x)/q(x) = 1\) for every \(x \in \mathcal{S}\) — that is, \(p(x) = q(x)\) on the support of \(q\). (Equivalently, in the Jensen proof, strict concavity of \(\log\) makes the ratio \(p(X)/q(X)\) constant \(q\)-almost-everywhere.)
- No leaked mass. The bound \(\sum_{x \in \mathcal{S}} p(x) \leq 1\) is an equality only when \(p\) places all its mass on \(\mathcal{S}\), so \(p(x) = 0 = q(x)\) for every \(x \notin \mathcal{S}\).
Together these give \(p(x) = q(x)\) for all \(x\). Conversely, if \(p = q\) then every term is \(q(x) \log 1 = 0\), so \(\mathrm{KL}(q \,\|\, p) = 0\). Hence equality holds iff \(q = p\) almost everywhere. \(\square\)
Why This Matters
Non-negativity is the single property that turns a log-ratio average into a usable objective. A few claims across the site rest directly on it:
- The ELBO is a lower bound. The gap between the log evidence and the evidence lower bound is exactly \(\mathrm{KL}(q(z) \,\|\, p_\theta(z \mid x)) \geq 0\), so maximizing the ELBO can never overshoot \(\log p_\theta(x)\) (proof).
- EM never decreases the likelihood. The expectation-maximization monotonicity argument splits the change in log-likelihood into a guaranteed-positive improvement minus a KL term that is itself \(\geq 0\) (proof).
- Cross-entropy bounds entropy. Since \(\mathrm{KL}(q \,\|\, p) = H(q, p) - H(q)\), non-negativity says the cross-entropy \(H(q, p)\) is minimized — and equals the entropy \(H(q)\) — exactly when \(p = q\). This is why minimizing cross-entropy is a sound training objective (proof).
- Mutual information is non-negative. Writing \(I(X; Y) = \mathrm{KL}(p_{X,Y} \,\|\, p_X p_Y)\) shows \(I(X; Y) \geq 0\), with equality iff \(X\) and \(Y\) are independent.
In each case the direction of the bound — and the fact that the bound is tight only at the intended optimum — is precisely the equality condition proved above.