Covariance Matrix
Motivation
A single random variable has a variance — a measure of how spread out its values are. When you have multiple variables (features), you also need to know how they co-vary: does knowing that feature \(j\) is high tell you anything about feature \(k\)? The covariance matrix encodes all pairwise relationships in a dataset in one matrix (Hastie et al. 2009). It is the central object in principal components analysis: PCA finds the directions of greatest spread in data, and those directions are the eigenvectors of the covariance matrix (proof).
Variance and Covariance
Variance of a random variable \(X\) with mean \(\mu = \mathbb{E}[X]\):
\[ \operatorname{Var}(X) = \mathbb{E}[(X - \mu)^2]. \]
Covariance of two random variables \(X\) and \(Y\):
\[ \operatorname{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)]. \]
- \(\operatorname{Cov}(X, Y) > 0\): when \(X\) is above its mean, \(Y\) tends to be above its mean.
- \(\operatorname{Cov}(X, Y) < 0\): when \(X\) is above its mean, \(Y\) tends to be below its mean.
- \(\operatorname{Cov}(X, Y) = 0\): the variables are uncorrelated (no linear relationship).
Note \(\operatorname{Cov}(X, X) = \operatorname{Var}(X)\).
Definition
For a random vector \(\mathbf{x} = (X_1, \ldots, X_d)^\top \in \mathbb{R}^d\) with mean \(\boldsymbol{\mu} = \mathbb{E}[\mathbf{x}]\), the covariance matrix is
\[ \Sigma = \mathbb{E}\!\left[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^\top\right] \in \mathbb{R}^{d \times d}. \]
Entry \((i, j)\) is \(\Sigma_{ij} = \operatorname{Cov}(X_i, X_j)\). The diagonal entries are variances; the off-diagonal entries are covariances.
\[ \Sigma = \begin{pmatrix} \operatorname{Var}(X_1) & \operatorname{Cov}(X_1, X_2) & \cdots & \operatorname{Cov}(X_1, X_d) \\ \operatorname{Cov}(X_2, X_1) & \operatorname{Var}(X_2) & \cdots & \operatorname{Cov}(X_2, X_d) \\ \vdots & & \ddots & \vdots \\ \operatorname{Cov}(X_d, X_1) & \cdots & \cdots & \operatorname{Var}(X_d) \end{pmatrix}. \]
Properties
Symmetric: \(\Sigma^\top = \Sigma\), because \(\operatorname{Cov}(X_i, X_j) = \operatorname{Cov}(X_j, X_i)\).
Positive semidefinite: for any vector \(\mathbf{v} \in \mathbb{R}^d\),
\[ \mathbf{v}^\top \Sigma \mathbf{v} = \operatorname{Var}(\mathbf{v}^\top \mathbf{x}) \geq 0. \]
This means all eigenvalues of \(\Sigma\) are non-negative. It also means the spectral theorem applies: \(\Sigma\) has an orthonormal eigenbasis, and this is exactly what PCA exploits.
Example: Two Variables
Suppose we observe \(n = 4\) data points:
| \(X_1\) | \(X_2\) |
|---|---|
| 2 | 1 |
| 4 | 3 |
| 6 | 5 |
| 8 | 7 |
Means: \(\bar{X}_1 = 5\), \(\bar{X}_2 = 4\). Centered values:
| \(X_1 - 5\) | \(X_2 - 4\) |
|---|---|
| \(-3\) | \(-3\) |
| \(-1\) | \(-1\) |
| \(1\) | \(1\) |
| \(3\) | \(3\) |
\[ \operatorname{Var}(X_1) = \frac{9 + 1 + 1 + 9}{4} = 5, \quad \operatorname{Var}(X_2) = \frac{9 + 1 + 1 + 9}{4} = 5, \]
\[ \operatorname{Cov}(X_1, X_2) = \frac{(-3)(-3) + (-1)(-1) + (1)(1) + (3)(3)}{4} = \frac{20}{4} = 5. \]
\[ \Sigma = \begin{pmatrix} 5 & 5 \\ 5 & 5 \end{pmatrix}. \]
The large positive off-diagonal entry reflects that \(X_1\) and \(X_2\) move perfectly together here.
Sample Covariance Matrix
In practice we do not know the true distribution; we estimate \(\Sigma\) from \(n\) data points \(\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d\).
Step 1. Center the data. Compute the sample mean \(\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i\) and subtract it:
\[ \tilde{\mathbf{x}}_i = \mathbf{x}_i - \bar{\mathbf{x}}. \]
Step 2. Form the data matrix. Stack the centered points as rows: \(\tilde{X} \in \mathbb{R}^{n \times d}\).
Step 3. Compute the sample covariance matrix:
\[ \hat{\Sigma} = \frac{1}{n} \tilde{X}^\top \tilde{X}. \]
This is a \(d \times d\) matrix. Entry \((i, j)\) is the sample covariance between feature \(i\) and feature \(j\).
The Link to PCA
The eigenvectors of \(\Sigma\) (or \(\hat{\Sigma}\)) are the principal components — the orthogonal directions of greatest variance in the data. The corresponding eigenvalues are the variances in those directions. PCA projects data onto the top \(k\) eigenvectors to achieve dimensionality reduction with minimal information loss (Eckart-Young theorem).