Principal Components Are Eigenvectors
Claim
The principal components of a dataset — the directions of maximum variance — are the eigenvectors of the covariance matrix (Pearson 1901).
Setup
Let \(\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d\) be centered observations (zero mean). The sample covariance matrix is
\[ C = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^\top \in \mathbb{R}^{d \times d}. \]
\(C\) is symmetric and positive semidefinite. The variance of the data projected onto a unit vector \(\mathbf{w}\) is
\[ \operatorname{Var}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n (\mathbf{w}^\top \mathbf{x}_i)^2 = \mathbf{w}^\top C \mathbf{w}. \]
The first principal component is the unit vector that maximizes this variance:
\[ \mathbf{w}_1 = \operatorname*{arg\,max}_{\|\mathbf{w}\| = 1} \mathbf{w}^\top C \mathbf{w}. \]
Proof: First Principal Component
Introduce a Lagrange multiplier \(\lambda\) for the constraint \(\|\mathbf{w}\|^2 = 1\) and form
\[ \mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^\top C \mathbf{w} - \lambda(\mathbf{w}^\top \mathbf{w} - 1). \]
Setting the gradient to zero:
\[ \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2C\mathbf{w} - 2\lambda \mathbf{w} = \mathbf{0}. \]
This gives \(C\mathbf{w} = \lambda \mathbf{w}\), so any stationary point \(\mathbf{w}\) must be an eigenvector of \(C\).
At such a point, the objective equals
\[ \mathbf{w}^\top C \mathbf{w} = \mathbf{w}^\top (\lambda \mathbf{w}) = \lambda \|\mathbf{w}\|^2 = \lambda. \]
Maximizing the variance therefore means choosing the eigenvector corresponding to the largest eigenvalue \(\lambda_1\).
Proof: Subsequent Principal Components
The \(k\)-th principal component maximizes variance subject to being a unit vector orthogonal to all previous components \(\mathbf{w}_1, \ldots, \mathbf{w}_{k-1}\).
The Lagrangian with multipliers for each orthogonality constraint is
\[ \mathcal{L} = \mathbf{w}^\top C \mathbf{w} - \lambda(\mathbf{w}^\top \mathbf{w} - 1) - \sum_{j=1}^{k-1} \alpha_j \mathbf{w}_j^\top \mathbf{w}. \]
Setting \(\partial \mathcal{L} / \partial \mathbf{w} = \mathbf{0}\):
\[ 2C\mathbf{w} - 2\lambda \mathbf{w} - \sum_{j=1}^{k-1} \alpha_j \mathbf{w}_j = \mathbf{0}. \]
Left-multiplying by \(\mathbf{w}_j^\top\) and using \(C\mathbf{w}_j = \lambda_j \mathbf{w}_j\), orthogonality \(\mathbf{w}_j^\top \mathbf{w} = 0\), and orthonormality \(\mathbf{w}_j^\top \mathbf{w}_j = 1\):
\[ 2\lambda_j \mathbf{w}_j^\top \mathbf{w} - \alpha_j = 0 \implies \alpha_j = 0. \]
The stationarity condition reduces to \(C\mathbf{w} = \lambda \mathbf{w}\), so \(\mathbf{w}_k\) is again an eigenvector of \(C\). Since \(\mathbf{w}_k\) must be orthogonal to \(\mathbf{w}_1, \ldots, \mathbf{w}_{k-1}\), it must correspond to the \(k\)-th largest eigenvalue \(\lambda_k\).
Conclusion
The \(k\) principal components are the eigenvectors of \(C\) corresponding to its \(k\) largest eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_k\), and the variance captured along the \(k\)-th component equals \(\lambda_k\).