Linear Regression

Motivation

The simplest supervised learning problem is predicting a real-valued output from a vector of features. Linear regression assumes the output is approximately a linear function of the inputs and finds the best-fitting linear function by minimizing squared error (Hastie et al. 2009). It is the foundational model in statistics and machine learning: interpretable, fast to fit, and often surprisingly competitive. It also introduces, in the simplest possible setting, the ideas of loss minimization, closed-form solutions, regularization, and the bias-variance trade-off that recur throughout finding better hypotheses.

Problem

Input. A dataset of \(n\) examples \(\{(\mathbf{x}_i, y_i)\}_{i=1}^n\) where \(\mathbf{x}_i \in \mathbb{R}^d\) is a feature vector and \(y_i \in \mathbb{R}\) is a real-valued target.

Model. A linear function parameterized by weights \(\mathbf{w} \in \mathbb{R}^d\) and bias \(b \in \mathbb{R}\):

\[ \hat{y} = \mathbf{w}^\top \mathbf{x} + b. \]

By appending \(1\) to every \(\mathbf{x}_i\) and including \(b\) as a component of \(\mathbf{w}\), we can absorb the bias and write simply \(\hat{y} = \mathbf{w}^\top \mathbf{x}\).

Objective. Find the \(\mathbf{w}\) minimizing the sum of squared residuals:

\[ \min_{\mathbf{w}} \sum_{i=1}^n (y_i - \mathbf{w}^\top \mathbf{x}_i)^2 = \min_{\mathbf{w}} \|X\mathbf{w} - \mathbf{y}\|^2, \]

where \(X \in \mathbb{R}^{n \times d}\) is the design matrix (rows are \(\mathbf{x}_i^\top\)) and \(\mathbf{y} \in \mathbb{R}^n\) is the target vector.

Key Ideas

The squared-error objective \(f(\mathbf{w}) = \|X\mathbf{w} - \mathbf{y}\|^2\) is a convex quadratic in \(\mathbf{w}\), so it has a unique global minimum (assuming full column rank). Setting the gradient to zero gives the normal equations:

\[ X^\top X \, \mathbf{w} = X^\top \mathbf{y}. \]

When \(X^\top X\) is invertible (i.e., when the columns of \(X\) are linearly independent), this has the closed-form solution:

\[ \hat{\mathbf{w}} = (X^\top X)^{-1} X^\top \mathbf{y}. \]

This is the ordinary least squares (OLS) estimator. Geometrically, \(\hat{\mathbf{y}} = X\hat{\mathbf{w}}\) is the orthogonal projection of \(\mathbf{y}\) onto the column space of \(X\) — the closest point in that subspace.

Deriving the Normal Equations

Expand the objective:

\[ \|X\mathbf{w} - \mathbf{y}\|^2 = \mathbf{w}^\top X^\top X \mathbf{w} - 2\mathbf{w}^\top X^\top \mathbf{y} + \mathbf{y}^\top \mathbf{y}. \]

Taking the gradient with respect to \(\mathbf{w}\) and setting it to zero:

\[ \nabla_\mathbf{w}\|X\mathbf{w} - \mathbf{y}\|^2 = 2X^\top X\mathbf{w} - 2X^\top\mathbf{y} = \mathbf{0}. \]

This gives \(X^\top X \mathbf{w} = X^\top \mathbf{y}\), confirming the normal equations.

Example

Fit a line \(\hat{y} = w_0 + w_1 x\) to three points:

\(x\)	\(y\)
\(1\)	\(2\)
\(2\)	\(4\)
\(3\)	\(5\)

Design matrix (with intercept column):

\[ X = \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{pmatrix}, \quad \mathbf{y} = \begin{pmatrix} 2 \\ 4 \\ 5 \end{pmatrix}. \]

Normal equations:

\[ X^\top X = \begin{pmatrix} 3 & 6 \\ 6 & 14 \end{pmatrix}, \quad X^\top \mathbf{y} = \begin{pmatrix} 11 \\ 24 \end{pmatrix}. \]

Solving \((X^\top X)\hat{\mathbf{w}} = X^\top \mathbf{y}\):

\[ \hat{\mathbf{w}} = (X^\top X)^{-1} X^\top \mathbf{y} = \frac{1}{6}\begin{pmatrix} 14 & -6 \\ -6 & 3 \end{pmatrix}\begin{pmatrix} 11 \\ 24 \end{pmatrix} = \begin{pmatrix} 2/3 \\ 3/2 \end{pmatrix}. \]

So the fitted line is \(\hat{y} = 2/3 + (3/2)x\).

Geometric Interpretation

The prediction \(\hat{\mathbf{y}} = X\hat{\mathbf{w}}\) lies in the column space of \(X\). Since \(\hat{\mathbf{w}}\) minimizes \(\|X\mathbf{w} - \mathbf{y}\|\), the fitted value \(\hat{\mathbf{y}}\) is the orthogonal projection of \(\mathbf{y}\) onto \(\operatorname{col}(X)\). The residual vector \(\mathbf{y} - \hat{\mathbf{y}}\) is perpendicular to every column of \(X\), which is exactly the normal equations: \(X^\top(\mathbf{y} - X\hat{\mathbf{w}}) = \mathbf{0}\).

Ridge Regression

When \(X^\top X\) is singular or nearly singular (e.g., more features than examples, or correlated features), OLS is unstable. Ridge regression adds an \(\ell_2\) penalty:

\[ \min_\mathbf{w} \|X\mathbf{w} - \mathbf{y}\|^2 + \lambda\|\mathbf{w}\|^2, \]

yielding the closed-form solution:

\[ \hat{\mathbf{w}}_\lambda = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}. \]

Adding \(\lambda I\) makes the matrix invertible for any \(\lambda > 0\), shrinks the weights toward zero, and reduces variance at the cost of some bias — the bias-variance trade-off.

Assumptions and Failure Modes

OLS is optimal (minimum variance among unbiased linear estimators) when:

the true relationship is linear, and
errors are independent with equal variance (homoskedasticity).

It fails to be the best choice when:

the true relationship is nonlinear (use feature transforms, kernels, or neural networks);
features are correlated (use ridge or lasso);
errors have unequal variance (use weighted least squares);
\(n \ll d\) without regularization (the solution overfits).

When to Use It

Linear regression is the right starting point for any regression task with structured numerical features. It is interpretable (each weight is the partial effect of a feature), fast to fit via the normal equations or gradient descent, and provides a strong baseline. If it underfits, switch to a more expressive model. If it overfits, add regularization.

References

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/.