Principal Components Are Eigenvectors

Claim

The principal components of a dataset — the directions of maximum variance — are the eigenvectors of the covariance matrix (Pearson 1901).

Setup

Let \(\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d\) be centered observations (zero mean). The sample covariance matrix is

\[ C = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^\top \in \mathbb{R}^{d \times d}. \]

\(C\) is symmetric and positive semidefinite. The variance of the data projected onto a unit vector \(\mathbf{w}\) is

\[ \operatorname{Var}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n (\mathbf{w}^\top \mathbf{x}_i)^2 = \mathbf{w}^\top C \mathbf{w}. \]

The first principal component is the unit vector that maximizes this variance:

\[ \mathbf{w}_1 = \operatorname*{arg\,max}_{\|\mathbf{w}\| = 1} \mathbf{w}^\top C \mathbf{w}. \]

Proof: First Principal Component

Introduce a Lagrange multiplier \(\lambda\) for the constraint \(\|\mathbf{w}\|^2 = 1\) and form

\[ \mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^\top C \mathbf{w} - \lambda(\mathbf{w}^\top \mathbf{w} - 1). \]

Setting the gradient to zero:

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2C\mathbf{w} - 2\lambda \mathbf{w} = \mathbf{0}. \]

This gives \(C\mathbf{w} = \lambda \mathbf{w}\), so any stationary point \(\mathbf{w}\) must be an eigenvector of \(C\).

At such a point, the objective equals

\[ \mathbf{w}^\top C \mathbf{w} = \mathbf{w}^\top (\lambda \mathbf{w}) = \lambda \|\mathbf{w}\|^2 = \lambda. \]

Maximizing the variance therefore means choosing the eigenvector corresponding to the largest eigenvalue \(\lambda_1\).

Proof: Subsequent Principal Components

The \(k\)-th principal component maximizes variance subject to being a unit vector orthogonal to all previous components \(\mathbf{w}_1, \ldots, \mathbf{w}_{k-1}\).

The Lagrangian with multipliers for each orthogonality constraint is

\[ \mathcal{L} = \mathbf{w}^\top C \mathbf{w} - \lambda(\mathbf{w}^\top \mathbf{w} - 1) - \sum_{j=1}^{k-1} \alpha_j \mathbf{w}_j^\top \mathbf{w}. \]

Setting \(\partial \mathcal{L} / \partial \mathbf{w} = \mathbf{0}\):

\[ 2C\mathbf{w} - 2\lambda \mathbf{w} - \sum_{j=1}^{k-1} \alpha_j \mathbf{w}_j = \mathbf{0}. \]

Left-multiplying by \(\mathbf{w}_j^\top\) and using \(C\mathbf{w}_j = \lambda_j \mathbf{w}_j\), orthogonality \(\mathbf{w}_j^\top \mathbf{w} = 0\), and orthonormality \(\mathbf{w}_j^\top \mathbf{w}_j = 1\):

\[ 2\lambda_j \mathbf{w}_j^\top \mathbf{w} - \alpha_j = 0 \implies \alpha_j = 0. \]

The stationarity condition reduces to \(C\mathbf{w} = \lambda \mathbf{w}\), so \(\mathbf{w}_k\) is again an eigenvector of \(C\). Since \(\mathbf{w}_k\) must be orthogonal to \(\mathbf{w}_1, \ldots, \mathbf{w}_{k-1}\), it must correspond to the \(k\)-th largest eigenvalue \(\lambda_k\).

Conclusion

The \(k\) principal components are the eigenvectors of \(C\) corresponding to its \(k\) largest eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_k\), and the variance captured along the \(k\)-th component equals \(\lambda_k\).

References

Pearson, Karl. 1901. “LIII. <I>on Lines and Planes of Closest Fit to Systems of Points in Space</i>.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11): 559–72. https://doi.org/10.1080/14786440109462720.