The Bias-Variance Decomposition

Claim

Under squared-error loss, the expected prediction error of a learning algorithm at a fixed test point splits into three non-negative pieces — squared bias, variance, and irreducible noise (Geman et al. 1992; Hastie et al. 2009):

\[ \mathbb{E}_{\mathcal{D}, \varepsilon_0}\!\left[(y_0 - \hat f_\mathcal{D}(x_0))^2\right] = \underbrace{(\bar f(x_0) - f^*(x_0))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_\mathcal{D}\!\left[(\hat f_\mathcal{D}(x_0) - \bar f(x_0))^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}}. \]

This is the identity that underlies the bias-variance trade-off between underfitting and overfitting.

Setup

The target is generated with additive noise,

\[ y = f^*(x) + \varepsilon, \qquad \mathbb{E}[\varepsilon] = 0, \qquad \mathrm{Var}(\varepsilon) = \sigma^2, \]

where \(f^*\) is the true regression function. A learning algorithm \(\mathcal{A}\) takes a training set \(\mathcal{D}\) of \(N\) i.i.d. samples and returns a predictor \(\hat f_\mathcal{D} = \mathcal{A}(\mathcal{D})\). Fix a test point \(x_0\) and observe \(y_0 = f^*(x_0) + \varepsilon_0\). Define the mean prediction across training-set draws,

\[ \bar f(x_0) = \mathbb{E}_\mathcal{D}\!\left[\hat f_\mathcal{D}(x_0)\right]. \]

Assumptions used. Only three, and they are mild:

The loss is squared error. The clean additive three-way split is special to this loss.
The test noise \(\varepsilon_0\) has mean zero, has variance \(\sigma^2\), and is independent of the training set \(\mathcal{D}\) (so \(\varepsilon_0\) is independent of \(\hat f_\mathcal{D}(x_0)\)).
\(f^*(x_0)\) is a fixed (non-random) number, and \(\bar f(x_0)\) exists.

Nothing is assumed about the algorithm \(\mathcal{A}\), the hypothesis class, or the data distribution: the result is an algebraic identity, not an approximation.

Proof

Fix \(x_0\) and abbreviate \(\hat f = \hat f_\mathcal{D}(x_0)\), \(\bar f = \bar f(x_0)\), \(f^* = f^*(x_0)\), and \(y_0 = f^* + \varepsilon_0\). Two random sources are in play: the training set \(\mathcal{D}\) (which makes \(\hat f\) random) and the test noise \(\varepsilon_0\). The proof peels them off one at a time.

Step 1 — Separate the irreducible noise

Split the error into a model part and a noise part:

\[ y_0 - \hat f = (f^* - \hat f) + \varepsilon_0. \]

Square and expand:

\[ (y_0 - \hat f)^2 = (f^* - \hat f)^2 + 2\,\varepsilon_0\,(f^* - \hat f) + \varepsilon_0^2. \]

Take the expectation over both \(\mathcal{D}\) and \(\varepsilon_0\). In the cross term, \(\varepsilon_0\) is independent of \(\hat f\) and has mean zero, so

\[ \mathbb{E}\!\left[\varepsilon_0\,(f^* - \hat f)\right] = \mathbb{E}[\varepsilon_0]\;\mathbb{E}[f^* - \hat f] = 0, \]

and \(\mathbb{E}[\varepsilon_0^2] = \mathrm{Var}(\varepsilon_0) = \sigma^2\). Hence

\[ \mathbb{E}_{\mathcal{D}, \varepsilon_0}\!\left[(y_0 - \hat f)^2\right] = \mathbb{E}_\mathcal{D}\!\left[(\hat f - f^*)^2\right] + \sigma^2. \]

The \(\sigma^2\) term involves no model and cannot be reduced by any algorithm — it is the irreducible noise.

Step 2 — Split the reducible error into bias and variance

It remains to decompose the reducible error \(\mathbb{E}_\mathcal{D}[(\hat f - f^*)^2]\). Add and subtract the mean prediction \(\bar f\):

\[ \hat f - f^* = (\hat f - \bar f) + (\bar f - f^*). \]

Square and expand:

\[ (\hat f - f^*)^2 = (\hat f - \bar f)^2 + 2\,(\hat f - \bar f)(\bar f - f^*) + (\bar f - f^*)^2. \]

Take the expectation over \(\mathcal{D}\). The quantity \(\bar f - f^*\) is a constant (it does not depend on \(\mathcal{D}\)), and by the definition of \(\bar f\),

\[ \mathbb{E}_\mathcal{D}\!\left[\hat f - \bar f\right] = \bar f - \bar f = 0, \]

so the cross term vanishes. This leaves

\[ \mathbb{E}_\mathcal{D}\!\left[(\hat f - f^*)^2\right] = \underbrace{\mathbb{E}_\mathcal{D}\!\left[(\hat f - \bar f)^2\right]}_{\text{Variance}} + \underbrace{(\bar f - f^*)^2}_{\text{Bias}^2}. \]

Combine

Substituting Step 2 into Step 1 gives

\[ \mathbb{E}_{\mathcal{D}, \varepsilon_0}\!\left[(y_0 - \hat f)^2\right] = (\bar f - f^*)^2 + \mathbb{E}_\mathcal{D}\!\left[(\hat f - \bar f)^2\right] + \sigma^2, \]

which is the claimed decomposition. \(\square\)

Remarks

It is the estimator MSE identity, applied pointwise. For any estimator \(\hat\theta\) of a fixed quantity \(\theta\), the mean squared error obeys \(\mathbb{E}[(\hat\theta - \theta)^2] = (\mathbb{E}[\hat\theta] - \theta)^2 + \mathrm{Var}(\hat\theta)\) — exactly Step 2 with \(\hat\theta = \hat f\) and \(\theta = f^*\). The bias-variance decomposition is that classical identity evaluated at each input \(x_0\), plus the noise term from the stochastic target.

Averaging over test inputs. The decomposition above is pointwise at \(x_0\). Integrating over a test distribution \(x_0 \sim p_{\text{test}}\) gives the overall expected test error, with bias\(^2\) and variance replaced by their averages over \(x_0\) and the noise term replaced by \(\mathbb{E}_{x_0}[\sigma^2(x_0)]\) (just \(\sigma^2\) when the noise is homoscedastic). Each term stays non-negative, so the three-way split survives the average.

Why squared loss is special. The decomposition relies on the cross terms vanishing, which is a consequence of squared error being a quadratic whose mixed term is linear in the centered deviation \(\hat f - \bar f\). General losses — 0-1 loss, cross-entropy — admit bias-variance-style analyses, but they do not split into a clean, loss-valued sum of three independent non-negative pieces. The tidy additive form is a property of the squared-error geometry, not a universal law.

The decomposition is exact; the trade-off is empirical. Because the identity assumes nothing about \(\mathcal{A}\), it cannot by itself predict how bias and variance move as model capacity changes. The classical claim that bias falls and variance rises with complexity — and the modern double-descent refinement of it — are empirical observations layered on top of an identity that always holds. See bias-variance for that discussion.

References

Geman, Stuart, Elie Bienenstock, and René Doursat. 1992. “Neural Networks and the Bias/Variance Dilemma.” Neural Computation 4 (1): 1–58. https://doi.org/10.1162/neco.1992.4.1.1.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/.