Principal Components
Motivation
High-dimensional data often lies near a low-dimensional structure. Principal component analysis (PCA) (Pearson 1901) finds an orthogonal coordinate system aligned with the directions of greatest variance, enabling dimensionality reduction, visualization, and noise suppression.
Setup
Let \(\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d\) be observations with sample mean \(\boldsymbol{\mu} = \frac{1}{n}\sum_i \mathbf{x}_i\). Center the data by replacing each \(\mathbf{x}_i\) with \(\mathbf{x}_i - \boldsymbol{\mu}\).
The sample covariance matrix is
\[ C = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^\top \in \mathbb{R}^{d \times d}. \]
\(C\) is symmetric and positive semidefinite, so it has an orthonormal eigenbasis \(\mathbf{w}_1, \ldots, \mathbf{w}_d\) with real eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0\).
Definition
Principal axes of a scatter cloud
PCA chooses orthogonal directions ordered by the variance of the projected data.
The \(k\)-th principal component is the unit vector \(\mathbf{w}_k\) that maximizes the variance of the projected data
\[ \operatorname{Var}_k = \mathbf{w}_k^\top C \mathbf{w}_k = \lambda_k, \]
subject to being orthogonal to all previous components \(\mathbf{w}_1, \ldots, \mathbf{w}_{k-1}\).
The principal components are the eigenvectors of \(C\), ordered by decreasing eigenvalue. (proof)
Projection and Reconstruction
To reduce to \(k\) dimensions, project each centered observation onto the top \(k\) components. Stacking the components as columns of \(W_k = [\mathbf{w}_1 \;\cdots\; \mathbf{w}_k] \in \mathbb{R}^{d \times k}\):
\[ \mathbf{z}_i = W_k^\top \mathbf{x}_i \in \mathbb{R}^k. \]
The reconstruction \(\hat{\mathbf{x}}_i = W_k \mathbf{z}_i\) is the best rank-\(k\) approximation in mean squared error.
Variance Explained
The total variance in the data equals \(\operatorname{tr}(C) = \sum_{j=1}^d \lambda_j\). The fraction explained by the top \(k\) components is
\[ \frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^d \lambda_j}. \]
Plotting this ratio against \(k\) (a scree plot) guides the choice of how many components to retain.
Computation
In practice, PCA is computed via the singular value decomposition of the centered data matrix \(X \in \mathbb{R}^{n \times d}\):
\[ X = U \Sigma V^\top. \]
The columns of \(V\) are the principal components and the singular values satisfy \(\sigma_k = \sqrt{n \lambda_k}\). This avoids forming \(C\) explicitly and is numerically more stable.