Bias-Variance Decomposition

Motivation

A learning algorithm trained on a sample of data produces a different model on every sample. The expected test error of the algorithm — averaged over both the training data and the test point — decomposes into three terms: bias squared, variance, and irreducible noise. This decomposition is the textbook formalism for the trade-off between underfitting and overfitting and provides the language for diagnosing why a model is making errors (Hastie et al. 2009).

The classical reading: complex models have low bias and high variance; simple models have the opposite. The modern reading: large neural networks complicate this picture (the “double descent” phenomenon), but the decomposition itself remains correct and useful.

Setup

A regression target \(y\) is generated by

\[ y = f^*(x) + \varepsilon, \qquad \mathbb{E}[\varepsilon] = 0, \qquad \mathrm{Var}(\varepsilon) = \sigma^2. \]

A learning algorithm \(\mathcal{A}\) takes a training set \(\mathcal{D}\) of \(N\) samples drawn i.i.d. from a distribution and produces a predictor \(\hat f_\mathcal{D} = \mathcal{A}(\mathcal{D})\). Fix a test point \(x_0\). The expected squared error at \(x_0\), averaged over training sets and over the noise at the test point, is

\[ \mathbb{E}_{\mathcal{D}, \varepsilon}\!\left[(y_0 - \hat f_\mathcal{D}(x_0))^2\right]. \]

The Decomposition

Define \(\bar f(x_0) = \mathbb{E}_\mathcal{D}[\hat f_\mathcal{D}(x_0)]\) — the mean prediction at \(x_0\) across training-set draws. Then

\[ \mathbb{E}_{\mathcal{D}, \varepsilon}\!\left[(y_0 - \hat f_\mathcal{D}(x_0))^2\right] = \underbrace{(\bar f(x_0) - f^*(x_0))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_\mathcal{D}\!\left[(\hat f_\mathcal{D}(x_0) - \bar f(x_0))^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}}. \]

The derivation: add and subtract \(\bar f(x_0)\) inside the squared error and \(f^*(x_0)\) outside, expand, and observe that the cross terms vanish because \(\mathbb{E}_\mathcal{D}[\hat f_\mathcal{D}(x_0) - \bar f(x_0)] = 0\) and \(\mathbb{E}_\varepsilon[\varepsilon] = 0\) (full proof).

What each term measures:

  • Bias\(^2\) — how far the average model is from the truth. Large for under-expressive model classes (a linear model fitting a curve) or aggressive regularization. Reduced by using a richer hypothesis class or training longer.
  • Variance — how much the prediction at \(x_0\) jitters as the training set varies. Large for over-expressive models trained on small samples. Reduced by more data, more regularization, ensembling, or simpler models.
  • Noise \(\sigma^2\) — irreducible. No model can do better than this on a stochastic target.

Worked example: computing the three terms

At a fixed input \(x_0\), suppose the true regression function is \(f(x_0) = 10\), the observation noise variance is \(\sigma^2 = 1\), and retraining the same algorithm on many datasets gives predictions with mean \(9\) and variance \(4\):

\[ \bar f(x_0) = 9, \qquad \operatorname{Var}(\hat f_\mathcal{D}(x_0)) = 4. \]

Then

\[ \text{bias}^2 = (9 - 10)^2 = 1, \qquad \text{variance} = 4, \qquad \text{noise} = 1. \]

The expected squared prediction error at \(x_0\) is therefore

\[ 1 + 4 + 1 = 6. \]

If regularization changed the retrained predictors so their mean became \(8\) but their variance fell to \(1\), the expected error would be \((8 - 10)^2 + 1 + 1 = 6\) again: lower variance exactly offsets higher bias in this toy calculation.

The Classical Trade-off

For a fixed dataset size, the bias-variance trade-off is a U-shape: as model complexity grows, bias drops monotonically and variance rises monotonically; total error has a minimum somewhere in the middle. Underfitting is the high-bias regime (model too simple); overfitting is the high-variance regime (model too flexible relative to data).

Diagram: bias, variance, and total error vs. model complexity

model complexity → expected test error → bias² variance noise σ² total error sweet spot underfit (high bias) overfit (high variance) Total error = bias² + variance + noise. Bias falls and variance rises with capacity; the optimum is in the middle.

Classical responses:

  • Regularization (\(\ell_2\), \(\ell_1\), dropout, early stopping) trades a small amount of added bias for a large reduction in variance.
  • More data reduces variance without affecting bias.
  • Cross-validation picks the complexity level that minimizes the empirical sum of bias\(^2\) + variance + (estimated) noise.

This is the framework that underlies decisions like choosing a regularization strength, deciding whether to add more features, and explaining why a model that fits training data perfectly does poorly on test data.

The Modern Picture: Double Descent

The classical U-shape predicts that increasing model capacity past the interpolation threshold (where the model fits training data exactly) should make test error worse. Empirically, for many neural networks, test error first rises near the interpolation threshold and then falls again as capacity grows further. The total curve has two minima — one in the under-parameterized regime, one in the over-parameterized regime — and the overall trend in modern deep learning is that the over-parameterized side tends to be better.

This phenomenon, called double descent (Belkin et al. 2019), does not contradict the bias-variance decomposition. The decomposition is an identity. What it changes is the assumption that variance grows monotonically with model capacity: in the over-parameterized regime, the implicit regularization from gradient descent and architectural choices keeps variance under control even as the model can in principle interpolate the training data.

The practical takeaway for current practice: classical complexity control (regularization, early stopping) still helps, but underfitting from “too large a model” is rarely the issue. Most deep-learning models are over-parameterized by orders of magnitude relative to the dataset and benefit from being made even larger, not smaller.

In Practice

For diagnosing a real model:

  • Training error high, validation error similar. Underfit. High bias. Increase capacity, decrease regularization, train longer.
  • Training error low, validation error much higher. Overfit. High variance. Add data, increase regularization, decrease capacity, ensemble.
  • Training error low, validation error low. Working as intended.

The bias-variance language gives names to the two failure modes; the decomposition itself is rarely computed numerically in deep-learning practice.

References

Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. “Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Off.” Proceedings of the National Academy of Sciences 116 (32): 15849–54. https://doi.org/10.1073/pnas.1903070116.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/.