Principal Components

Motivation

High-dimensional data often lies near a low-dimensional structure. Principal component analysis (PCA) (Pearson 1901) finds an orthogonal coordinate system aligned with the directions of greatest variance, enabling dimensionality reduction, visualization, and noise suppression.

Setup

Let \(\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d\) be observations with sample mean \(\boldsymbol{\mu} = \frac{1}{n}\sum_i \mathbf{x}_i\). Center the data by replacing each \(\mathbf{x}_i\) with \(\mathbf{x}_i - \boldsymbol{\mu}\).

The sample covariance matrix is

\[ C = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^\top \in \mathbb{R}^{d \times d}. \]

\(C\) is symmetric and positive semidefinite, so it has an orthonormal eigenbasis \(\mathbf{w}_1, \ldots, \mathbf{w}_d\) with real eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0\).

Definition

Principal axes of a scatter cloud

PCA chooses orthogonal directions ordered by the variance of the projected data.

PC1: largest variancePC2 Projecting onto PC1 keeps the longest spread.

The \(k\)-th principal component is the unit vector \(\mathbf{w}_k\) that maximizes the variance of the projected data

\[ \operatorname{Var}_k = \mathbf{w}_k^\top C \mathbf{w}_k = \lambda_k, \]

subject to being orthogonal to all previous components \(\mathbf{w}_1, \ldots, \mathbf{w}_{k-1}\).

The principal components are the eigenvectors of \(C\), ordered by decreasing eigenvalue. (proof)

Projection and Reconstruction

To reduce to \(k\) dimensions, project each centered observation onto the top \(k\) components. Stacking the components as columns of \(W_k = [\mathbf{w}_1 \;\cdots\; \mathbf{w}_k] \in \mathbb{R}^{d \times k}\):

\[ \mathbf{z}_i = W_k^\top \mathbf{x}_i \in \mathbb{R}^k. \]

The reconstruction \(\hat{\mathbf{x}}_i = W_k \mathbf{z}_i\) is the best rank-\(k\) approximation in mean squared error.

Variance Explained

The total variance in the data equals \(\operatorname{tr}(C) = \sum_{j=1}^d \lambda_j\). The fraction explained by the top \(k\) components is

\[ \frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^d \lambda_j}. \]

Plotting this ratio against \(k\) (a scree plot) guides the choice of how many components to retain.

Computation

In practice, PCA is computed via the singular value decomposition of the centered data matrix \(X \in \mathbb{R}^{n \times d}\):

\[ X = U \Sigma V^\top. \]

The columns of \(V\) are the principal components and the singular values satisfy \(\sigma_k = \sqrt{n \lambda_k}\). This avoids forming \(C\) explicitly and is numerically more stable.

References

Pearson, Karl. 1901. “LIII. <I>on Lines and Planes of Closest Fit to Systems of Points in Space</i>.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11): 559–72. https://doi.org/10.1080/14786440109462720.