Covariance Matrix

Motivation

A single random variable has a variance — a measure of how spread out its values are. When you have multiple variables (features), you also need to know how they co-vary: does knowing that feature \(j\) is high tell you anything about feature \(k\)? The covariance matrix encodes all pairwise relationships in a dataset in one matrix (Hastie et al. 2009). It is the central object in principal components analysis: PCA finds the directions of greatest spread in data, and those directions are the eigenvectors of the covariance matrix (proof).

Variance and Covariance

Variance of a random variable \(X\) with mean \(\mu = \mathbb{E}[X]\):

\[ \operatorname{Var}(X) = \mathbb{E}[(X - \mu)^2]. \]

Covariance of two random variables \(X\) and \(Y\):

\[ \operatorname{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)]. \]

\(\operatorname{Cov}(X, Y) > 0\): when \(X\) is above its mean, \(Y\) tends to be above its mean.
\(\operatorname{Cov}(X, Y) < 0\): when \(X\) is above its mean, \(Y\) tends to be below its mean.
\(\operatorname{Cov}(X, Y) = 0\): the variables are uncorrelated (no linear relationship).

Note \(\operatorname{Cov}(X, X) = \operatorname{Var}(X)\).

Definition

For a random vector \(\mathbf{x} = (X_1, \ldots, X_d)^\top \in \mathbb{R}^d\) with mean \(\boldsymbol{\mu} = \mathbb{E}[\mathbf{x}]\), the covariance matrix is

\[ \Sigma = \mathbb{E}\!\left[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^\top\right] \in \mathbb{R}^{d \times d}. \]

Entry \((i, j)\) is \(\Sigma_{ij} = \operatorname{Cov}(X_i, X_j)\). The diagonal entries are variances; the off-diagonal entries are covariances.

\[ \Sigma = \begin{pmatrix} \operatorname{Var}(X_1) & \operatorname{Cov}(X_1, X_2) & \cdots & \operatorname{Cov}(X_1, X_d) \\ \operatorname{Cov}(X_2, X_1) & \operatorname{Var}(X_2) & \cdots & \operatorname{Cov}(X_2, X_d) \\ \vdots & & \ddots & \vdots \\ \operatorname{Cov}(X_d, X_1) & \cdots & \cdots & \operatorname{Var}(X_d) \end{pmatrix}. \]

Properties

Symmetric: \(\Sigma^\top = \Sigma\), because \(\operatorname{Cov}(X_i, X_j) = \operatorname{Cov}(X_j, X_i)\).

Positive semidefinite: for any vector \(\mathbf{v} \in \mathbb{R}^d\),

\[ \mathbf{v}^\top \Sigma \mathbf{v} = \operatorname{Var}(\mathbf{v}^\top \mathbf{x}) \geq 0. \]

This means all eigenvalues of \(\Sigma\) are non-negative. It also means the spectral theorem applies: \(\Sigma\) has an orthonormal eigenbasis, and this is exactly what PCA exploits.

Example: Two Variables

Suppose we observe \(n = 4\) data points:

\(X_1\)	\(X_2\)
2	1
4	3
6	5
8	7

Means: \(\bar{X}_1 = 5\), \(\bar{X}_2 = 4\). Centered values:

\(X_1 - 5\)	\(X_2 - 4\)
\(-3\)	\(-3\)
\(-1\)	\(-1\)
\(1\)	\(1\)
\(3\)	\(3\)

\[ \operatorname{Var}(X_1) = \frac{9 + 1 + 1 + 9}{4} = 5, \quad \operatorname{Var}(X_2) = \frac{9 + 1 + 1 + 9}{4} = 5, \]

\[ \operatorname{Cov}(X_1, X_2) = \frac{(-3)(-3) + (-1)(-1) + (1)(1) + (3)(3)}{4} = \frac{20}{4} = 5. \]

\[ \Sigma = \begin{pmatrix} 5 & 5 \\ 5 & 5 \end{pmatrix}. \]

The large positive off-diagonal entry reflects that \(X_1\) and \(X_2\) move perfectly together here.

positive covariance X₁ X₂

negative covariance X₁ X₂

Sample Covariance Matrix

In practice we do not know the true distribution; we estimate \(\Sigma\) from \(n\) data points \(\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d\).

Step 1. Center the data. Compute the sample mean \(\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i\) and subtract it:

\[ \tilde{\mathbf{x}}_i = \mathbf{x}_i - \bar{\mathbf{x}}. \]

Step 2. Form the data matrix. Stack the centered points as rows: \(\tilde{X} \in \mathbb{R}^{n \times d}\).

Step 3. Compute the sample covariance matrix:

\[ \hat{\Sigma} = \frac{1}{n} \tilde{X}^\top \tilde{X}. \]

This is a \(d \times d\) matrix. Entry \((i, j)\) is the sample covariance between feature \(i\) and feature \(j\).

The Link to PCA

The eigenvectors of \(\Sigma\) (or \(\hat{\Sigma}\)) are the principal components — the orthogonal directions of greatest variance in the data. The corresponding eigenvalues are the variances in those directions. PCA projects data onto the top \(k\) eigenvectors to achieve dimensionality reduction with minimal information loss (Eckart-Young theorem).

References

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/.