Low-Rank Approximation

Motivation

Many matrices in practice — images, ratings, term-document counts, covariance estimates — are large but have most of their “energy” concentrated in a small number of singular values. The matrix is approximately the sum of a few simple rank-one outer products plus a small remainder. A low-rank approximation replaces the original matrix \(A \in \mathbb{R}^{m \times n}\) by a matrix \(B\) of rank at most \(k\) chosen to minimize approximation error (Eckart and Young 1936; Trefethen and Bau 1997).

The payoffs:

Compression. Storing a rank-\(k\) matrix takes \(k(m + n)\) numbers instead of \(mn\).
Compute. Multiplying by a rank-\(k\) matrix is \(O(k(m + n))\) instead of \(O(mn)\).
Denoising. Truncating small singular values removes directions dominated by noise.
Structural insight. The top-\(k\) singular vectors capture the dominant patterns.

It is the engine behind PCA, latent semantic indexing, matrix completion, and many compressed neural-network layers.

Setup

Let \(A \in \mathbb{R}^{m \times n}\) with rank \(r\). Its singular value decomposition is

\[ A = U \Sigma V^\top = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^\top, \]

with \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0\), left singular vectors \(\mathbf{u}_i \in \mathbb{R}^m\) orthonormal, and right singular vectors \(\mathbf{v}_i \in \mathbb{R}^n\) orthonormal.

For a target rank \(k \leq r\), the rank-\(k\) truncation is

\[ A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top = U_k \Sigma_k V_k^\top, \]

where \(U_k\), \(V_k\) contain the first \(k\) singular vectors and \(\Sigma_k = \operatorname{diag}(\sigma_1, \ldots, \sigma_k)\).

The Eckart–Young Theorem

The rank-\(k\) truncation \(A_k\) is the best rank-\(k\) approximation to \(A\) in both the Frobenius norm and the spectral (operator \(\ell_2\)) norm:

\[ \|A - A_k\|_F = \min_{\operatorname{rank}(B) \leq k} \|A - B\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2}, \]

\[ \|A - A_k\|_2 = \min_{\operatorname{rank}(B) \leq k} \|A - B\|_2 = \sigma_{k+1}. \]

This is the Eckart–Young theorem (proof). It says two things at once:

SVD truncation is optimal — no other rank-\(k\) matrix is closer to \(A\) in these norms.
The error is determined by the discarded singular values alone.

The second point is the practical one. If \(\sigma_{k+1}, \ldots, \sigma_r\) are small, truncation is nearly lossless. If they are not, no rank-\(k\) approximation will be good — the matrix is genuinely high-rank.

When Low Rank Pays Off

Whether truncation is useful depends on how fast the singular values decay. Three regimes:

The first regime is the home territory of low-rank methods: a few directions explain almost everything. The third is the wrong tool — the matrix is essentially full-rank and any rank-\(k\) approximation throws away a constant fraction of the energy.

A useful heuristic: plot the singular values on a log scale. If they drop off cleanly, low-rank approximation is appropriate; if they form a flat or gently sloping line, it is not.

Worked Example: A \(4 \times 4\) Matrix

Take

\[ A = \begin{pmatrix} 3 & 2 & 2 & 0 \\ 2 & 3 & 0 & 0 \\ 2 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0.01 \end{pmatrix}. \]

Its singular values are approximately \(\sigma_1 \approx 5.32\), \(\sigma_2 \approx 2.10\), \(\sigma_3 \approx 1.01\), \(\sigma_4 \approx 0.01\). The Frobenius norm satisfies \(\|A\|_F^2 = \sum \sigma_i^2 \approx 33.4\).

Rank-3 truncation discards \(\sigma_4 \approx 0.01\). Approximation error in Frobenius norm: \(\approx 0.01\). The fourth row/column is essentially zero noise.
Rank-2 truncation discards \(\sigma_3, \sigma_4\). Error: \(\sqrt{1.01^2 + 0.01^2} \approx 1.01\). Captures roughly \(97\%\) of the squared Frobenius norm.
Rank-1 truncation captures only the top direction. Error: \(\sqrt{2.10^2 + 1.01^2 + 0.01^2} \approx 2.33\).

The right choice of \(k\) depends on tolerable error. If \(0.01\) error is acceptable, \(k = 3\) saves nothing but if \(1\) unit of error is acceptable, \(k = 2\) already reduces storage from \(16\) numbers to \(2 \cdot (4 + 4) = 16\) — break-even here, but with larger \(m, n\) the savings grow as \(O(k(m + n))\) vs \(O(mn)\).

Beyond the Frobenius Norm

The Eckart–Young theorem covers Frobenius and spectral norms. For other criteria the optimal low-rank matrix is not the SVD truncation:

Robust PCA (\(\ell_1\) loss). Splits \(A = L + S\) into low-rank \(L\) and sparse \(S\). Useful when corrupted entries should be modeled as outliers, not absorbed into \(L\).
Nonnegative matrix factorization (NMF). Imposes \(A \approx W H\) with \(W, H \geq 0\). Loses optimality but gains interpretability — factors look like “parts” rather than signed eigenvectors.
Weighted low-rank (\(\sum_{ij} w_{ij} (A_{ij} - B_{ij})^2\)). Standard SVD truncation is not optimal when weights vary. Solving the weighted version is nonconvex in general; alternating least squares is the practical workhorse.
Matrix completion. Most entries of \(A\) are unobserved; find a low-rank matrix that matches the observed entries. Closely related to weighted low-rank with weights \(\in \{0, 1\}\).

The SVD is the right tool for the standard problem; the variations require their own algorithms.

Computation

For dense matrices, the full SVD takes \(O(mn \min(m, n))\) time and can be heavy for large \(m, n\). Common shortcuts when only \(A_k\) is needed:

Truncated SVD via Lanczos / ARPACK. Iterative algorithms that compute only the top \(k\) singular triples in roughly \(O(mn k)\) time.
Randomized SVD (Halko et al.). Project \(A\) onto a small random subspace, then SVD the projection. Runs in \(O(mn \log k + (m + n) k^2)\) with strong approximation guarantees. The current default for large matrices.
Power iteration. Repeatedly multiply by \(A^\top A\) (or \(A A^\top\)) and orthogonalize. Slow to converge if the spectral gap is small but easy to implement.

When to Use Low-Rank Approximation

Use it when:

The singular values decay quickly.
You need to compress, accelerate, or denoise a large matrix.
The relevant signal in \(A\) is genuinely a few latent factors plus residuals.

Skip it when:

The spectrum is flat — rank truncation just loses information.
You need exact reconstruction of every entry.
The matrix has hard structural constraints (e.g., sparsity patterns) that SVD does not respect — consider NMF or structured matrix decompositions instead.

Low-rank approximation is one of the most reused ideas in numerical linear algebra. The combination of optimality (Eckart–Young), interpretability (singular vectors as patterns), and fast computation (randomized SVD) makes it a near-universal first attack on large matrices that look approximately structured.

References

Eckart, Carl, and Gale Young. 1936. “The Approximation of One Matrix by Another of Lower Rank.” Psychometrika 1 (3): 211–18. https://doi.org/10.1007/bf02288367.

Trefethen, Lloyd N., and David Bau III. 1997. Numerical Linear Algebra. Society for Industrial; Applied Mathematics. https://doi.org/10.1137/1.9780898719574.