Linear Dimensionality Reduction

Motivation

A linear dimensionality reduction maps data from \(\mathbb{R}^d\) to \(\mathbb{R}^k\) using a linear (or affine) transformation. The map is a matrix multiplication and nothing more. Despite this strong restriction, linear methods are workhorses: they have closed-form or convex solutions, scale to large datasets, are easy to interpret, and provide the right baseline before reaching for nonlinear methods (Hastie et al. 2009; Strang 2016).

The linear restriction is the source of both the strengths and the limitations. Curved manifolds — a Swiss roll, the surface of a sphere, the manifold of natural images — cannot be unrolled by a single matrix. But a great deal of real data is approximately linear once centered, and even when it is not, a linear projection is often a useful first step.

The Setup

Let \(\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d\) be a dataset, stacked as rows of a matrix \(X \in \mathbb{R}^{n \times d}\). A linear dimensionality reduction is determined by two matrices:

\[ W \in \mathbb{R}^{d \times k}, \qquad V \in \mathbb{R}^{d \times k}. \]

The encoder produces low-dimensional codes

\[ \mathbf{z}_i = W^\top (\mathbf{x}_i - \boldsymbol{\mu}) \in \mathbb{R}^k, \]

where \(\boldsymbol{\mu}\) is typically the sample mean. The decoder reconstructs by

\[ \hat{\mathbf{x}}_i = \boldsymbol{\mu} + V \mathbf{z}_i. \]

When the goal is reconstruction in mean squared error, the optimal choice has \(V = W\) and the columns of \(W\) orthonormal — this is the PCA solution.

Geometry: Projection Onto a Subspace

A linear reduction with orthonormal \(W\) is orthogonal projection of centered data onto the \(k\)-dimensional subspace spanned by the columns of \(W\).

subspace span(W) x x̂ = projection Reconstruction error = perpendicular distance to the subspace.

The reconstruction \(\hat{\mathbf{x}}_i = W W^\top (\mathbf{x}_i - \boldsymbol{\mu}) + \boldsymbol{\mu}\) is the closest point in the affine subspace \(\boldsymbol{\mu} + \operatorname{span}(W)\) to \(\mathbf{x}_i\). The error is the perpendicular distance.

This geometric picture pins down what every linear method is doing differently: each chooses the subspace by a different criterion.

Methods

Different linear methods pick the subspace differently.

Principal Component Analysis (PCA)

PCA picks the subspace that maximizes the variance of the projected data, equivalently the subspace that minimizes squared reconstruction error. The columns of \(W\) are the top-\(k\) eigenvectors of the sample covariance \(C = \frac{1}{n} X^\top X\) (proof). PCA is the canonical linear reduction.

Random Projection

\(W\) is sampled from a random distribution (e.g., \(W_{ij} \sim \mathcal{N}(0, 1/k)\)) with no reference to the data. The Johnson–Lindenstrauss lemma guarantees that with high probability pairwise Euclidean distances are preserved up to a multiplicative factor \(1 \pm \varepsilon\) as long as \(k = \Omega(\varepsilon^{-2} \log n)\). Cheap, oblivious to the data distribution, and surprisingly competitive for very high-dimensional sparse data.

Linear Discriminant Analysis (LDA)

Supervised: given class labels \(y_i\), LDA chooses \(W\) to maximize the ratio of between-class scatter to within-class scatter. Useful when the goal is classification rather than reconstruction. The dimension is bounded by \(k \leq (\text{number of classes}) - 1\).

Factor Analysis and Probabilistic PCA

Probabilistic PCA assumes a generative model \(\mathbf{x} = W \mathbf{z} + \boldsymbol{\mu} + \boldsymbol{\varepsilon}\) with \(\mathbf{z} \sim \mathcal{N}(0, I)\) and isotropic Gaussian noise \(\boldsymbol{\varepsilon}\). Maximum likelihood recovers the same subspace as PCA but adds a noise variance estimate. Factor analysis is the same model with anisotropic diagonal noise — features are allowed to have their own noise variance. Both are useful when you want a probabilistic story rather than a deterministic projection.

Canonical Correlation Analysis (CCA)

Two datasets \(X \in \mathbb{R}^{n \times d_1}\) and \(Y \in \mathbb{R}^{n \times d_2}\) on the same observations. CCA finds linear projections \(W\) for \(X\) and \(U\) for \(Y\) that maximize the correlation between \(X W\) and \(Y U\). Useful for cross-modal analysis (image-caption pairs, fMRI–stimulus pairs).

Why Linear Is Often Enough

Three reasons a linear reduction frequently does the job:

  1. Local linearity. Smooth manifolds are locally linear. Even when global structure is curved, small neighborhoods are well-approximated by a tangent plane. PCA on a local neighborhood is the first step in several manifold-learning methods.

  2. Closed-form solutions. PCA, LDA, CCA, and random projection all admit non-iterative solutions via eigendecomposition or singular value decomposition. No initialization, no learning rate, no convergence concerns.

  3. Interpretability. Each column of \(W\) is a direction in the original feature space. You can read off which features it weights heavily. Nonlinear methods rarely offer this.

Limitations

A linear method cannot:

  • Unroll a curved manifold (Swiss roll, sphere). After projection, points that were far apart along the manifold may end up adjacent.
  • Discover nonlinear interactions between features. Pixel intensities and edges are linearly mixed; objects and parts are not.
  • Adapt the feature representation to a downstream task without supervision (LDA needs labels; PCA does not see them).

When these matter, move to autoencoders or manifold-learning methods.

When to Use Linear Methods

Linear dimensionality reduction is the right default when:

  • \(d\) is large enough to cause problems but the underlying structure is approximately linear.
  • You need a closed-form, fast, deterministic preprocessing step.
  • Interpretability of the directions matters.
  • You have limited training data — linear methods need less data to estimate than nonlinear ones.

Reach for nonlinear methods only after linear methods have been tried and found insufficient.

References

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/.
Strang, Gilbert. 2016. Introduction to Linear Algebra. 5th ed. Wellesley-Cambridge Press.