Matrix Completion
Motivation
Many real datasets arrive as a matrix with most entries missing. A user-by-movie ratings matrix has billions of cells but only a tiny fraction of (user, movie) pairs have been rated. A protein-by-disease matrix has been measured for a few pairs and unknown for the rest. A sensor network’s readings are partially logged. The task of filling in the unobserved entries — guessing the missing values from the observed ones — is matrix completion (Candès and Recht 2009; Koren et al. 2009).
It looks impossible on its face: an unconstrained matrix can take any values in its missing entries, so any guess is as good as any other. The reason completion is solvable in practice is that real matrices are not arbitrary. They are approximately low rank: a handful of latent factors (genres a movie belongs to, biological pathways a drug touches) explain most of the variation. If the matrix is constrained to be low rank, very few entries actually need to be observed to pin it down.
The Setup
Let \(M \in \mathbb{R}^{m \times n}\) be a target matrix. Define an observation set \(\Omega \subset \{1, \ldots, m\} \times \{1, \ldots, n\}\): the indices \((i, j)\) where the entry \(M_{ij}\) is observed. The remaining entries are unknown.
The matrix completion problem: given \(\{M_{ij} : (i,j) \in \Omega\}\), recover \(M\).
Without further structure this is hopeless. The standard assumption: \(M\) has rank at most \(r\) for some small \(r \ll \min(m, n)\). Then the goal is to find a rank-\(r\) matrix \(\hat M\) agreeing with the observations:
\[ \hat M_{ij} \approx M_{ij} \quad \text{for } (i, j) \in \Omega. \]
The Counting Argument
A rank-\(r\) matrix \(M = U V^\top\) with \(U \in \mathbb{R}^{m \times r}\) and \(V \in \mathbb{R}^{n \times r}\) has \(r(m + n)\) degrees of freedom (minus an \(r^2\) rotational ambiguity, so really \(r(m + n) - r^2\)). The number of observations \(|\Omega|\) must at least match this — fewer constraints leave the system underdetermined.
For an \(m = n = 10{,}000\) matrix with \(r = 10\), the parameter count is \(\sim 200{,}000\) — only \(0.002\%\) of the \(10^8\) entries. Observing far less than \(1\%\) of the matrix is in principle enough.
Whether you can recover \(M\) at the information-theoretic limit depends on which entries are observed. Adversarial sampling (e.g., observing only one row) gives no information about the rest. Uniformly random sampling, on the other hand, suffices: Candès and Recht showed that \(O(n r \log^2 n)\) random observations recover an \(n \times n\) rank-\(r\) matrix exactly, with high probability, provided the singular vectors are “incoherent” — not too aligned with any single coordinate (Candès and Recht 2009; Candes and Tao 2010).
Optimization Formulations
Rank-constrained least squares
The most direct formulation:
\[ \min_{\hat M} \; \sum_{(i,j) \in \Omega} (M_{ij} - \hat M_{ij})^2 \quad \text{subject to } \operatorname{rank}(\hat M) \leq r. \]
This is nonconvex because rank is. It is, however, solvable in practice via factorization: parameterize \(\hat M = U V^\top\) with \(U \in \mathbb{R}^{m \times r}\) and \(V \in \mathbb{R}^{n \times r}\) and minimize over \(U, V\). Local optimization (gradient descent, alternating least squares) finds good solutions even though the landscape is not convex.
Nuclear norm relaxation
The convex relaxation: replace rank by the nuclear norm \(\|\hat M\|_* = \sum_i \sigma_i(\hat M)\) (sum of singular values). This gives
\[ \min_{\hat M} \; \|\hat M\|_* \quad \text{subject to } \hat M_{ij} = M_{ij} \text{ for } (i,j) \in \Omega. \]
Nuclear norm is to rank what \(\ell_1\) norm is to sparsity: a convex proxy that often achieves the same minimizer. Under the incoherence and sampling conditions above, this convex problem provably recovers \(M\) exactly. The price is computational: nuclear-norm minimization is convex but slower than direct factorization.
Regularized variants
In practice the observed entries are also noisy, so exact equality is replaced by soft fit:
\[ \min_{U, V} \; \sum_{(i,j) \in \Omega} (M_{ij} - (U V^\top)_{ij})^2 + \lambda (\|U\|_F^2 + \|V\|_F^2). \]
The Frobenius regularizer on \(U, V\) shrinks the factors and reduces overfitting on noisy observations. This is the standard formulation used in production recommender systems (Koren et al. 2009).
Worked Example: A Sparse Ratings Matrix
Five users, four movies, five-star scale. Observed entries marked, unobserved as ?:
| Movie 1 | Movie 2 | Movie 3 | Movie 4 | |
|---|---|---|---|---|
| Alice | 5 | ? | 4 | 1 |
| Bob | ? | 1 | ? | 5 |
| Carol | 4 | 2 | ? | ? |
| Dave | ? | ? | 5 | 1 |
| Eve | 1 | 5 | 2 | 4 |
If we believe ratings are explained by two latent factors (say “action-y” and “drama-y”), we fit a rank-\(2\) factorization \(\hat M = U V^\top\). Alternating least squares iterates:
- Fix \(V\). For each user \(i\), solve a least-squares problem over the user’s observed entries to update row \(U_i\).
- Fix \(U\). For each movie \(j\), solve a least-squares problem over the movie’s observed entries to update row \(V_j\).
After convergence, \(\hat M_{ij} = U_i \cdot V_j\) for every \((i, j)\) — including the unobserved cells, which are now predictions. With \(r = 2\) here, the factor matrices have \(5 \cdot 2 + 4 \cdot 2 = 18\) parameters; the observed entries are \(10\) — already a tight fit, illustrating why \(r\) has to be chosen conservatively when observations are scarce.
What Can Go Wrong
- Rank misspecified. Too high: the model fits noise in observed entries and predicts arbitrarily on unobserved ones. Too low: underfits the structure. Use validation on held-out observed entries to choose \(r\).
- Sampling not random. If user-movie pairs are observed because users chose to rate certain movies, the sampling is informative — popular and polarizing movies are over-represented. This violates the random-sampling assumption and biases the completion. Tools: explicit modeling of the sampling mechanism (missing-not-at-random methods), inverse-propensity weighting, debiasing.
- Cold start. A user (or item) with zero observations gets no information for the corresponding row of \(U\) (or column of \(V\)). Some side information (demographics, item features) is needed to initialize.
- Incoherence violated. If the true factors are heavily concentrated on a few entries (e.g., one user is the only one to like a particular niche film), random sampling may miss the entries that pin those factors down.
Connections
Matrix completion sits at a crossroads:
- Low-rank approximation: standard low-rank approximation assumes all entries are observed and finds the best rank-\(k\) fit. Matrix completion is the same problem with a hole pattern.
- Recommender systems: the foundational technique behind collaborative filtering for ratings prediction; the Netflix Prize winners used variants of matrix factorization.
- Compressed sensing: the same theoretical machinery (convex relaxations of nonconvex sparsity-like constraints) underlies both.
- Denoising: denoising assumes all entries are observed but noisy; matrix completion assumes most are missing. Both rely on the prior that the clean signal is low rank.
When to Use Matrix Completion
Use it when:
- The matrix is genuinely large and sparse.
- A low-rank assumption is plausible (most variation explained by a few latent factors).
- Observations are roughly random or can be modeled as such.
Do not use it when:
- Most entries are observed — direct regression or imputation is simpler.
- The matrix is genuinely full rank (e.g., outputs of an arbitrary function on \((i, j)\) pairs).
- The downstream task has rich side information that a dedicated model can exploit better.
The core idea is small and surprisingly powerful: a low-rank constraint turns a wildly underdetermined inference problem into a well-posed one, and the right convex relaxation (nuclear norm) or local optimization (alternating least squares) does the rest.