Gram-Schmidt Algorithm

Motivation

Orthonormal bases are the most useful kind of basis: projections, coordinates, and matrix inversions all become inner products. Most subspaces, however, arrive as the span of an arbitrary collection of vectors — the columns of a data matrix, the rows of an observation matrix, a list of features — with no orthogonality in sight.

The Gram-Schmidt algorithm (Gram 1883; Schmidt 1907) converts any linearly independent list of vectors into an orthonormal basis of the same span. It is the constructive proof that every finite-dimensional subspace has an orthonormal basis, and it is the workhorse behind the \(QR\) factorization, least-squares regression, and many numerical algorithms.

Problem

Input. A list of linearly independent vectors \(\mathbf{a}_1, \ldots, \mathbf{a}_k \in \mathbb{R}^n\).

Output. An orthonormal list \(\mathbf{q}_1, \ldots, \mathbf{q}_k \in \mathbb{R}^n\) such that

\[ \operatorname{span}(\mathbf{q}_1, \ldots, \mathbf{q}_j) = \operatorname{span}(\mathbf{a}_1, \ldots, \mathbf{a}_j) \quad \text{for every } j = 1, \ldots, k. \]

The matching-prefix property — each \(\mathbf{q}_j\) depends only on \(\mathbf{a}_1, \ldots, \mathbf{a}_j\) — distinguishes Gram-Schmidt from other orthonormalization procedures and is what links it to the \(QR\) factorization.

Key Ideas

Treat the input vectors one at a time, in order. At step \(j\):

  • The previous outputs \(\mathbf{q}_1, \ldots, \mathbf{q}_{j-1}\) are already orthonormal and span the same subspace as \(\mathbf{a}_1, \ldots, \mathbf{a}_{j-1}\).
  • The new input \(\mathbf{a}_j\) generally has a component lying in \(\operatorname{span}(\mathbf{q}_1, \ldots, \mathbf{q}_{j-1})\) and a component orthogonal to it.
  • Subtract off the in-span component using projections onto each \(\mathbf{q}_i\) — these are easy because the \(\mathbf{q}_i\) are orthonormal — leaving only the orthogonal residual.
  • Normalize the residual to unit length to obtain \(\mathbf{q}_j\).

The residual is nonzero because the input is linearly independent: if \(\mathbf{a}_j\) were already in \(\operatorname{span}(\mathbf{q}_1, \ldots, \mathbf{q}_{j-1})\), then \(\{\mathbf{a}_1, \ldots, \mathbf{a}_j\}\) would be dependent.

The key operation at every step is orthogonal projection onto a one-dimensional subspace spanned by a unit vector \(\mathbf{q}_i\):

\[ \operatorname{proj}_{\mathbf{q}_i}(\mathbf{a}) = (\mathbf{q}_i^\top \mathbf{a})\, \mathbf{q}_i. \]

Because the \(\mathbf{q}_i\) are mutually orthogonal, the projection onto their span is the sum of projections onto each one — no interaction terms, no system to solve.

Algorithm

The classical Gram-Schmidt iteration:

gram_schmidt(a_1, ..., a_k):
    for j = 1, ..., k:
        u_j = a_j
        for i = 1, ..., j-1:
            u_j = u_j - (q_i^T a_j) * q_i        # subtract projection
        q_j = u_j / ||u_j||                       # normalize
    return q_1, ..., q_k

The line \(u_j \leftarrow u_j - (\mathbf{q}_i^\top \mathbf{a}_j)\, \mathbf{q}_i\) is the projection step. After the inner loop, \(\mathbf{u}_j\) is the component of \(\mathbf{a}_j\) orthogonal to \(\operatorname{span}(\mathbf{q}_1, \ldots, \mathbf{q}_{j-1})\).

Walkthrough

Orthonormalize

\[ \mathbf{a}_1 = \begin{pmatrix} 1 \\ 1 \\ 0 \end{pmatrix}, \quad \mathbf{a}_2 = \begin{pmatrix} 1 \\ 0 \\ 1 \end{pmatrix}, \quad \mathbf{a}_3 = \begin{pmatrix} 0 \\ 1 \\ 1 \end{pmatrix}. \]

Step 1. \(\mathbf{u}_1 = \mathbf{a}_1 = (1, 1, 0)^\top\), \(\|\mathbf{u}_1\| = \sqrt{2}\).

\[ \mathbf{q}_1 = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 \\ 1 \\ 0 \end{pmatrix}. \]

Step 2. Project \(\mathbf{a}_2\) onto \(\mathbf{q}_1\):

\[ \mathbf{q}_1^\top \mathbf{a}_2 = \tfrac{1}{\sqrt{2}}(1 + 0 + 0) = \tfrac{1}{\sqrt{2}}. \]

Subtract:

\[ \mathbf{u}_2 = \mathbf{a}_2 - \tfrac{1}{\sqrt{2}} \mathbf{q}_1 = \begin{pmatrix} 1 \\ 0 \\ 1 \end{pmatrix} - \tfrac{1}{2}\begin{pmatrix} 1 \\ 1 \\ 0 \end{pmatrix} = \begin{pmatrix} 1/2 \\ -1/2 \\ 1 \end{pmatrix}. \]

\(\|\mathbf{u}_2\| = \sqrt{1/4 + 1/4 + 1} = \sqrt{3/2}\), so

\[ \mathbf{q}_2 = \sqrt{\tfrac{2}{3}} \begin{pmatrix} 1/2 \\ -1/2 \\ 1 \end{pmatrix} = \tfrac{1}{\sqrt{6}} \begin{pmatrix} 1 \\ -1 \\ 2 \end{pmatrix}. \]

Step 3. Project \(\mathbf{a}_3\) onto \(\mathbf{q}_1\) and \(\mathbf{q}_2\):

\[ \mathbf{q}_1^\top \mathbf{a}_3 = \tfrac{1}{\sqrt{2}}(0 + 1 + 0) = \tfrac{1}{\sqrt{2}}, \qquad \mathbf{q}_2^\top \mathbf{a}_3 = \tfrac{1}{\sqrt{6}}(0 - 1 + 2) = \tfrac{1}{\sqrt{6}}. \]

Subtract both:

\[ \mathbf{u}_3 = \mathbf{a}_3 - \tfrac{1}{\sqrt{2}} \mathbf{q}_1 - \tfrac{1}{\sqrt{6}} \mathbf{q}_2 = \begin{pmatrix} 0 \\ 1 \\ 1 \end{pmatrix} - \tfrac{1}{2}\begin{pmatrix} 1 \\ 1 \\ 0 \end{pmatrix} - \tfrac{1}{6}\begin{pmatrix} 1 \\ -1 \\ 2 \end{pmatrix} = \begin{pmatrix} -2/3 \\ 2/3 \\ 2/3 \end{pmatrix}. \]

\(\|\mathbf{u}_3\| = \sqrt{4/9 + 4/9 + 4/9} = \tfrac{2}{\sqrt{3}}\), so

\[ \mathbf{q}_3 = \tfrac{\sqrt{3}}{2}\begin{pmatrix} -2/3 \\ 2/3 \\ 2/3 \end{pmatrix} = \tfrac{1}{\sqrt{3}}\begin{pmatrix} -1 \\ 1 \\ 1 \end{pmatrix}. \]

Verification. Pairwise inner products: \(\mathbf{q}_1^\top \mathbf{q}_2 = \tfrac{1}{\sqrt{12}}(1 - 1 + 0) = 0\), \(\mathbf{q}_1^\top \mathbf{q}_3 = \tfrac{1}{\sqrt{6}}(-1 + 1 + 0) = 0\), \(\mathbf{q}_2^\top \mathbf{q}_3 = \tfrac{1}{\sqrt{18}}(-1 -1 + 2) = 0\). All three are unit vectors.

Correctness

Two invariants hold at the end of step \(j\):

Orthonormality. \(\mathbf{q}_1, \ldots, \mathbf{q}_j\) are pairwise orthogonal and have unit norm. Pairwise orthogonality holds for \(\mathbf{q}_i, \mathbf{q}_j\) with \(i < j\) because

\[ \mathbf{q}_i^\top \mathbf{u}_j = \mathbf{q}_i^\top \mathbf{a}_j - \sum_{\ell < j} (\mathbf{q}_\ell^\top \mathbf{a}_j)\, \mathbf{q}_i^\top \mathbf{q}_\ell = \mathbf{q}_i^\top \mathbf{a}_j - \mathbf{q}_i^\top \mathbf{a}_j = 0, \]

using the inductive hypothesis \(\mathbf{q}_i^\top \mathbf{q}_\ell = \delta_{i\ell}\) for \(i, \ell < j\). Normalizing \(\mathbf{u}_j\) preserves orthogonality and gives unit length.

Matching prefixes. \(\operatorname{span}(\mathbf{q}_1, \ldots, \mathbf{q}_j) = \operatorname{span}(\mathbf{a}_1, \ldots, \mathbf{a}_j)\). By construction \(\mathbf{q}_j\) is a linear combination of \(\mathbf{a}_j\) and the previous \(\mathbf{q}_i\)’s, which inductively are combinations of \(\mathbf{a}_1, \ldots, \mathbf{a}_{j-1}\). Conversely \(\mathbf{a}_j\) is a combination of the \(\mathbf{q}_i\)’s, so each span contains the other.

The denominator \(\|\mathbf{u}_j\|\) is nonzero exactly because \(\mathbf{a}_j \notin \operatorname{span}(\mathbf{a}_1, \ldots, \mathbf{a}_{j-1})\). If the input is dependent, the algorithm will produce \(\mathbf{u}_j = \mathbf{0}\) at some step and the normalization fails.

Complexity and Tradeoffs

For \(k\) vectors in \(\mathbb{R}^n\), each step \(j\) performs \(j - 1\) projections (each \(\Theta(n)\) work) followed by a normalization. The total cost is

\[ \sum_{j=1}^{k} \Theta(j n) = \Theta(k^2 n) \]

floating-point operations. For square inputs (\(k = n\)), this is \(\Theta(n^3)\) — the same order as Gaussian elimination.

Numerical stability. Classical Gram-Schmidt accumulates round-off errors that destroy orthogonality of the output when the input columns are nearly dependent. The modified Gram-Schmidt variant (Björck 1967) — reorder the projections so that each new \(\mathbf{u}_j\) is updated immediately after every projection rather than once at the end — has the same arithmetic cost but much better numerical behavior and is the version used in practice.

For very ill-conditioned inputs, even modified Gram-Schmidt is not robust enough. Production numerical libraries use Householder reflections (Householder 1958) or Givens rotations to build the \(QR\) factorization with guaranteed backward stability.

Connection to \(QR\) Factorization

Writing \(\mathbf{a}_j = \sum_{i \le j} r_{ij} \mathbf{q}_i\), where \(r_{ij} = \mathbf{q}_i^\top \mathbf{a}_j\) and \(r_{jj} = \|\mathbf{u}_j\|\), this says

\[ A = Q R, \]

with \(A = [\mathbf{a}_1 \;\cdots\; \mathbf{a}_k]\), \(Q = [\mathbf{q}_1 \;\cdots\; \mathbf{q}_k]\) having orthonormal columns, and \(R\) upper triangular with positive diagonal. Gram-Schmidt is one algorithm for computing this \(QR\) factorization — column by column — and the matching-prefix property of the algorithm corresponds directly to the upper-triangular structure of \(R\).

The \(QR\) factorization is the standard tool for solving least-squares problems \(\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|\) in numerical practice.

When to Use It

Situation Method
Pedagogical understanding, small dimensions Classical Gram-Schmidt
Practical orthogonalization in well-conditioned inputs Modified Gram-Schmidt
Ill-conditioned numerical work Householder \(QR\)
Want both \(Q\) and the orthonormal-basis interpretation Gram-Schmidt variants
Need only the column space, not its basis Reduced row echelon form via Gaussian elimination

Gram-Schmidt is the right starting point when the goal is to construct an orthonormal basis you can reason about — the formulas are transparent and the matching-prefix property is sometimes essential (for example in building orthogonal polynomials, where the input order encodes degree).

References

Björck, Åke. 1967. “Solving Linear Least Squares Problems by Gram-Schmidt Orthogonalization.” BIT Numerical Mathematics 7 (1): 1–21. https://doi.org/10.1007/BF01934122.
Gram, Jørgen Pedersen. 1883. Über Die Entwicklung Reeller Funktionen in Reihen Mittelst Der Methode Der Kleinsten Quadrate.” Journal für Die Reine Und Angewandte Mathematik 94: 41–73.
Householder, Alston S. 1958. “Unitary Triangularization of a Nonsymmetric Matrix.” Journal of the ACM 5 (4): 339–42. https://doi.org/10.1145/320941.320947.
Schmidt, Erhard. 1907. “Zur Theorie Der Linearen Und Nichtlinearen Integralgleichungen. I. Teil: Entwicklung Willkürlicher Funktionen Nach Systemen Vorgeschriebener.” Mathematische Annalen 63 (4): 433–76. https://doi.org/10.1007/BF01449770.