Dimensionality Reduction

Motivation

Real datasets often live in spaces with hundreds or thousands of features, but only a few directions actually carry information. Pixels in a face image are highly correlated; words in a document obey strong co-occurrence patterns; sensor readings drift together. Dimensionality reduction is the family of techniques that maps high-dimensional data \(\mathbf{x} \in \mathbb{R}^d\) to a lower-dimensional representation \(\mathbf{z} \in \mathbb{R}^k\) with \(k \ll d\), while preserving as much of the useful structure as possible (Hastie et al. 2009; Goodfellow et al. 2016).

Why reduce dimensions at all?

  • Storage and compute. Downstream algorithms scale poorly with \(d\). A nearest-neighbor query over \(10^6\) points in \(\mathbb{R}^{1000}\) is expensive; in \(\mathbb{R}^{10}\) it is cheap.
  • Statistical efficiency. Models with more parameters need more data. Reducing \(d\) before fitting can be the difference between an underdetermined system and a well-posed one.
  • Visualization. Plotting requires \(k = 2\) or \(k = 3\). Reductions like t-SNE and UMAP exist primarily to make high-dimensional structure visible.
  • Denoising. If signal lies on a \(k\)-dimensional manifold and noise spreads across all \(d\) directions, projecting onto that manifold suppresses the noise.
  • Decorrelation. Some learners (linear regression, naive Bayes) behave better when input features are uncorrelated. Reductions that diagonalize the covariance fix this.

The Setup

A dimensionality reduction defines two maps:

\[ f : \mathbb{R}^d \to \mathbb{R}^k, \qquad g : \mathbb{R}^k \to \mathbb{R}^d. \]

The encoder \(f\) takes a high-dimensional point to its compressed representation \(\mathbf{z} = f(\mathbf{x})\). The decoder \(g\) optionally maps codes back to the original space, producing a reconstruction \(\hat{\mathbf{x}} = g(\mathbf{z})\). Not every method defines a decoder explicitly — visualization methods often only need \(f\).

The quality of the reduction is measured by how much of the structure of interest survives the round trip. For reconstruction-based methods, the standard score is mean squared error:

\[ \mathbb{E}\!\left[\|\mathbf{x} - g(f(\mathbf{x}))\|^2\right]. \]

For neighborhood-preserving methods, it is the distortion of pairwise distances or local neighborhoods.

Taxonomy

Dimensionality reductions split along two main axes.

Linear vs. nonlinear

Linear methods restrict \(f\) and \(g\) to linear (or affine) maps: \(f(\mathbf{x}) = W^\top \mathbf{x}\) for some \(W \in \mathbb{R}^{d \times k}\). Principal component analysis is the canonical example. Linear methods are fast, interpretable, and have closed-form solutions, but they cannot capture curved manifolds.

Nonlinear methods allow arbitrary \(f\) and \(g\). Autoencoders parameterize both as neural networks. Manifold learning methods like t-SNE, UMAP, Isomap, and locally linear embedding define \(f\) implicitly through an optimization over the embedded coordinates. These can find structure that linear methods miss but lose closed-form solutions and interpretability.

Reconstruction vs. neighborhood preservation

Reconstruction-based methods (PCA, autoencoders) optimize \(\|\mathbf{x} - g(f(\mathbf{x}))\|\). They produce a decoder you can run, and the latent codes have a clear meaning: project, then reconstruct.

Neighborhood-preserving methods (t-SNE, UMAP, MDS) optimize how well pairwise distances or local neighborhoods are reproduced in the embedding. They typically have no decoder. They are useful for visualization but should not be used as a preprocessing step for downstream prediction, because the embedding distorts global geometry to preserve local structure.

Diagram: a manifold and three views

data in ℝ³ curved manifold PCA to ℝ² flattens the manifold UMAP to ℝ² unrolls along the manifold

Common Methods

Method Type Decoder Strength Weakness
PCA Linear, reconstruction Yes Closed form, interpretable Misses curved structure
Random projection Linear, distance-preserving Yes Almost free; data-oblivious Loose distortion bounds
Autoencoder Nonlinear, reconstruction Yes Captures curved manifolds Needs lots of data and tuning
t-SNE Nonlinear, neighborhood No Strong cluster visualization No decoder; global geometry distorted
UMAP Nonlinear, neighborhood Partial Faster than t-SNE, better global structure Sensitive to hyperparameters
Isomap / LLE Nonlinear, neighborhood No Recovers known manifold geometries Brittle to noise and topology

Choosing \(k\)

The target dimension \(k\) is a hyperparameter. Strategies:

  • Variance explained. For PCA-like methods, plot cumulative eigenvalue fraction \(\sum_{i \leq k} \lambda_i / \sum_i \lambda_i\) against \(k\) (a scree plot) and pick the smallest \(k\) above a threshold (often 90% or 95%).
  • Downstream task performance. Treat \(k\) like any other hyperparameter: try several values and pick the one that minimizes validation error of whatever model uses \(\mathbf{z}\).
  • Visualization constraints. If the reduction is for plotting, \(k \in \{2, 3\}\) is forced.
  • Memory or speed budget. Pick the largest \(k\) that fits the downstream constraint.

When to Use Dimensionality Reduction

Useful when:

  • Features are correlated, noisy, or redundant.
  • A downstream model overfits because \(d\) is comparable to or larger than the sample size.
  • You need to visualize or cluster high-dimensional points.
  • You want to compress data for storage or transmission with controlled error.

Probably not useful when:

  • Features are already few and meaningful (e.g., tabular data with 10 well-chosen columns).
  • The downstream model is itself capable of representation learning (deep networks on raw images and text).
  • You need exact reconstruction or interpretability of individual features.

Dimensionality reduction is an unsupervised preprocessing step. It does not know about the prediction target, so it can discard directions that happen to carry the signal. When labels exist, supervised feature selection or bias-variance diagnostics on the full model are often a better starting point.

References

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. https://www.deeplearningbook.org/.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/.