Dimensionality Reduction
Motivation
Real datasets often live in spaces with hundreds or thousands of features, but only a few directions actually carry information. Pixels in a face image are highly correlated; words in a document obey strong co-occurrence patterns; sensor readings drift together. Dimensionality reduction is the family of techniques that maps high-dimensional data \(\mathbf{x} \in \mathbb{R}^d\) to a lower-dimensional representation \(\mathbf{z} \in \mathbb{R}^k\) with \(k \ll d\), while preserving as much of the useful structure as possible (Hastie et al. 2009; Goodfellow et al. 2016).
Why reduce dimensions at all?
- Storage and compute. Downstream algorithms scale poorly with \(d\). A nearest-neighbor query over \(10^6\) points in \(\mathbb{R}^{1000}\) is expensive; in \(\mathbb{R}^{10}\) it is cheap.
- Statistical efficiency. Models with more parameters need more data. Reducing \(d\) before fitting can be the difference between an underdetermined system and a well-posed one.
- Visualization. Plotting requires \(k = 2\) or \(k = 3\). Reductions like t-SNE and UMAP exist primarily to make high-dimensional structure visible.
- Denoising. If signal lies on a \(k\)-dimensional manifold and noise spreads across all \(d\) directions, projecting onto that manifold suppresses the noise.
- Decorrelation. Some learners (linear regression, naive Bayes) behave better when input features are uncorrelated. Reductions that diagonalize the covariance fix this.
The Setup
A dimensionality reduction defines two maps:
\[ f : \mathbb{R}^d \to \mathbb{R}^k, \qquad g : \mathbb{R}^k \to \mathbb{R}^d. \]
The encoder \(f\) takes a high-dimensional point to its compressed representation \(\mathbf{z} = f(\mathbf{x})\). The decoder \(g\) optionally maps codes back to the original space, producing a reconstruction \(\hat{\mathbf{x}} = g(\mathbf{z})\). Not every method defines a decoder explicitly — visualization methods often only need \(f\).
The quality of the reduction is measured by how much of the structure of interest survives the round trip. For reconstruction-based methods, the standard score is mean squared error:
\[ \mathbb{E}\!\left[\|\mathbf{x} - g(f(\mathbf{x}))\|^2\right]. \]
For neighborhood-preserving methods, it is the distortion of pairwise distances or local neighborhoods.
Taxonomy
Dimensionality reductions split along two main axes.
Linear vs. nonlinear
Linear methods restrict \(f\) and \(g\) to linear (or affine) maps: \(f(\mathbf{x}) = W^\top \mathbf{x}\) for some \(W \in \mathbb{R}^{d \times k}\). Principal component analysis is the canonical example. Linear methods are fast, interpretable, and have closed-form solutions, but they cannot capture curved manifolds.
Nonlinear methods allow arbitrary \(f\) and \(g\). Autoencoders parameterize both as neural networks. Manifold learning methods like t-SNE, UMAP, Isomap, and locally linear embedding define \(f\) implicitly through an optimization over the embedded coordinates. These can find structure that linear methods miss but lose closed-form solutions and interpretability.
Reconstruction vs. neighborhood preservation
Reconstruction-based methods (PCA, autoencoders) optimize \(\|\mathbf{x} - g(f(\mathbf{x}))\|\). They produce a decoder you can run, and the latent codes have a clear meaning: project, then reconstruct.
Neighborhood-preserving methods (t-SNE, UMAP, MDS) optimize how well pairwise distances or local neighborhoods are reproduced in the embedding. They typically have no decoder. They are useful for visualization but should not be used as a preprocessing step for downstream prediction, because the embedding distorts global geometry to preserve local structure.
Diagram: a manifold and three views
Common Methods
| Method | Type | Decoder | Strength | Weakness |
|---|---|---|---|---|
| PCA | Linear, reconstruction | Yes | Closed form, interpretable | Misses curved structure |
| Random projection | Linear, distance-preserving | Yes | Almost free; data-oblivious | Loose distortion bounds |
| Autoencoder | Nonlinear, reconstruction | Yes | Captures curved manifolds | Needs lots of data and tuning |
| t-SNE | Nonlinear, neighborhood | No | Strong cluster visualization | No decoder; global geometry distorted |
| UMAP | Nonlinear, neighborhood | Partial | Faster than t-SNE, better global structure | Sensitive to hyperparameters |
| Isomap / LLE | Nonlinear, neighborhood | No | Recovers known manifold geometries | Brittle to noise and topology |
Choosing \(k\)
The target dimension \(k\) is a hyperparameter. Strategies:
- Variance explained. For PCA-like methods, plot cumulative eigenvalue fraction \(\sum_{i \leq k} \lambda_i / \sum_i \lambda_i\) against \(k\) (a scree plot) and pick the smallest \(k\) above a threshold (often 90% or 95%).
- Downstream task performance. Treat \(k\) like any other hyperparameter: try several values and pick the one that minimizes validation error of whatever model uses \(\mathbf{z}\).
- Visualization constraints. If the reduction is for plotting, \(k \in \{2, 3\}\) is forced.
- Memory or speed budget. Pick the largest \(k\) that fits the downstream constraint.
When to Use Dimensionality Reduction
Useful when:
- Features are correlated, noisy, or redundant.
- A downstream model overfits because \(d\) is comparable to or larger than the sample size.
- You need to visualize or cluster high-dimensional points.
- You want to compress data for storage or transmission with controlled error.
Probably not useful when:
- Features are already few and meaningful (e.g., tabular data with 10 well-chosen columns).
- The downstream model is itself capable of representation learning (deep networks on raw images and text).
- You need exact reconstruction or interpretability of individual features.
Dimensionality reduction is an unsupervised preprocessing step. It does not know about the prediction target, so it can discard directions that happen to carry the signal. When labels exist, supervised feature selection or bias-variance diagnostics on the full model are often a better starting point.