The Bias-Variance Decomposition
Claim
Under squared-error loss, the expected prediction error of a learning algorithm at a fixed test point splits into three non-negative pieces — squared bias, variance, and irreducible noise (Geman et al. 1992; Hastie et al. 2009):
\[ \mathbb{E}_{\mathcal{D}, \varepsilon_0}\!\left[(y_0 - \hat f_\mathcal{D}(x_0))^2\right] = \underbrace{(\bar f(x_0) - f^*(x_0))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_\mathcal{D}\!\left[(\hat f_\mathcal{D}(x_0) - \bar f(x_0))^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}}. \]
This is the identity that underlies the bias-variance trade-off between underfitting and overfitting.
Setup
The target is generated with additive noise,
\[ y = f^*(x) + \varepsilon, \qquad \mathbb{E}[\varepsilon] = 0, \qquad \mathrm{Var}(\varepsilon) = \sigma^2, \]
where \(f^*\) is the true regression function. A learning algorithm \(\mathcal{A}\) takes a training set \(\mathcal{D}\) of \(N\) i.i.d. samples and returns a predictor \(\hat f_\mathcal{D} = \mathcal{A}(\mathcal{D})\). Fix a test point \(x_0\) and observe \(y_0 = f^*(x_0) + \varepsilon_0\). Define the mean prediction across training-set draws,
\[ \bar f(x_0) = \mathbb{E}_\mathcal{D}\!\left[\hat f_\mathcal{D}(x_0)\right]. \]
Assumptions used. Only three, and they are mild:
- The loss is squared error. The clean additive three-way split is special to this loss.
- The test noise \(\varepsilon_0\) has mean zero, has variance \(\sigma^2\), and is independent of the training set \(\mathcal{D}\) (so \(\varepsilon_0\) is independent of \(\hat f_\mathcal{D}(x_0)\)).
- \(f^*(x_0)\) is a fixed (non-random) number, and \(\bar f(x_0)\) exists.
Nothing is assumed about the algorithm \(\mathcal{A}\), the hypothesis class, or the data distribution: the result is an algebraic identity, not an approximation.
Proof
Fix \(x_0\) and abbreviate \(\hat f = \hat f_\mathcal{D}(x_0)\), \(\bar f = \bar f(x_0)\), \(f^* = f^*(x_0)\), and \(y_0 = f^* + \varepsilon_0\). Two random sources are in play: the training set \(\mathcal{D}\) (which makes \(\hat f\) random) and the test noise \(\varepsilon_0\). The proof peels them off one at a time.
Step 1 — Separate the irreducible noise
Split the error into a model part and a noise part:
\[ y_0 - \hat f = (f^* - \hat f) + \varepsilon_0. \]
Square and expand:
\[ (y_0 - \hat f)^2 = (f^* - \hat f)^2 + 2\,\varepsilon_0\,(f^* - \hat f) + \varepsilon_0^2. \]
Take the expectation over both \(\mathcal{D}\) and \(\varepsilon_0\). In the cross term, \(\varepsilon_0\) is independent of \(\hat f\) and has mean zero, so
\[ \mathbb{E}\!\left[\varepsilon_0\,(f^* - \hat f)\right] = \mathbb{E}[\varepsilon_0]\;\mathbb{E}[f^* - \hat f] = 0, \]
and \(\mathbb{E}[\varepsilon_0^2] = \mathrm{Var}(\varepsilon_0) = \sigma^2\). Hence
\[ \mathbb{E}_{\mathcal{D}, \varepsilon_0}\!\left[(y_0 - \hat f)^2\right] = \mathbb{E}_\mathcal{D}\!\left[(\hat f - f^*)^2\right] + \sigma^2. \]
The \(\sigma^2\) term involves no model and cannot be reduced by any algorithm — it is the irreducible noise.
Step 2 — Split the reducible error into bias and variance
It remains to decompose the reducible error \(\mathbb{E}_\mathcal{D}[(\hat f - f^*)^2]\). Add and subtract the mean prediction \(\bar f\):
\[ \hat f - f^* = (\hat f - \bar f) + (\bar f - f^*). \]
Square and expand:
\[ (\hat f - f^*)^2 = (\hat f - \bar f)^2 + 2\,(\hat f - \bar f)(\bar f - f^*) + (\bar f - f^*)^2. \]
Take the expectation over \(\mathcal{D}\). The quantity \(\bar f - f^*\) is a constant (it does not depend on \(\mathcal{D}\)), and by the definition of \(\bar f\),
\[ \mathbb{E}_\mathcal{D}\!\left[\hat f - \bar f\right] = \bar f - \bar f = 0, \]
so the cross term vanishes. This leaves
\[ \mathbb{E}_\mathcal{D}\!\left[(\hat f - f^*)^2\right] = \underbrace{\mathbb{E}_\mathcal{D}\!\left[(\hat f - \bar f)^2\right]}_{\text{Variance}} + \underbrace{(\bar f - f^*)^2}_{\text{Bias}^2}. \]
Combine
Substituting Step 2 into Step 1 gives
\[ \mathbb{E}_{\mathcal{D}, \varepsilon_0}\!\left[(y_0 - \hat f)^2\right] = (\bar f - f^*)^2 + \mathbb{E}_\mathcal{D}\!\left[(\hat f - \bar f)^2\right] + \sigma^2, \]
which is the claimed decomposition. \(\square\)
Remarks
It is the estimator MSE identity, applied pointwise. For any estimator \(\hat\theta\) of a fixed quantity \(\theta\), the mean squared error obeys \(\mathbb{E}[(\hat\theta - \theta)^2] = (\mathbb{E}[\hat\theta] - \theta)^2 + \mathrm{Var}(\hat\theta)\) — exactly Step 2 with \(\hat\theta = \hat f\) and \(\theta = f^*\). The bias-variance decomposition is that classical identity evaluated at each input \(x_0\), plus the noise term from the stochastic target.
Averaging over test inputs. The decomposition above is pointwise at \(x_0\). Integrating over a test distribution \(x_0 \sim p_{\text{test}}\) gives the overall expected test error, with bias\(^2\) and variance replaced by their averages over \(x_0\) and the noise term replaced by \(\mathbb{E}_{x_0}[\sigma^2(x_0)]\) (just \(\sigma^2\) when the noise is homoscedastic). Each term stays non-negative, so the three-way split survives the average.
Why squared loss is special. The decomposition relies on the cross terms vanishing, which is a consequence of squared error being a quadratic whose mixed term is linear in the centered deviation \(\hat f - \bar f\). General losses — 0-1 loss, cross-entropy — admit bias-variance-style analyses, but they do not split into a clean, loss-valued sum of three independent non-negative pieces. The tidy additive form is a property of the squared-error geometry, not a universal law.
The decomposition is exact; the trade-off is empirical. Because the identity assumes nothing about \(\mathcal{A}\), it cannot by itself predict how bias and variance move as model capacity changes. The classical claim that bias falls and variance rises with complexity — and the modern double-descent refinement of it — are empirical observations layered on top of an identity that always holds. See bias-variance for that discussion.