Matrix Multiplication

Motivation

Multiplying two \(n \times n\) matrices using the definition takes \(\Theta(n^3)\) scalar multiplications: each of \(n^2\) output entries is a length-\(n\) inner product. For decades this was assumed to be optimal — matrix multiplication felt as basic as integer addition.

In 1969 Strassen showed it was not: the \(2 \times 2\) matrix product can be computed with \(7\) scalar multiplications instead of the obvious \(8\), and recursing on this trick gives an \(\Theta(n^{\log_2 7}) \approx \Theta(n^{2.807})\) algorithm (Strassen 1969). The result kicked off a 50-year race to lower the exponent. Today the best published bound is \(\omega < 2.371\), and it is conjectured (but not proven) that \(\omega = 2\) is achievable.

For data-science workloads — neural-network training, principal components analysis, regression at scale — matrix multiplication is the dominant cost. Understanding both the schoolbook algorithm and the Strassen-style improvements is essential to understanding modern numerical performance.

Problem

Compute \(C = A \cdot B\) where \(A, B \in \mathbb{R}^{n \times n}\), with \(C_{ij} = \sum_{k=1}^n A_{ik} B_{kj}\).

The cost model counts scalar multiplications and additions, each at unit cost. Memory traffic and numerical-stability considerations matter in practice but are not reflected in this asymptotic count.

Examples

Schoolbook on a 2 × 2

Take

\[A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}, \qquad B = \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix}.\]

Each output entry is one inner product of length \(2\):

\[C_{11} = 1 \cdot 5 + 2 \cdot 7 = 19, \quad C_{12} = 1 \cdot 6 + 2 \cdot 8 = 22,\] \[C_{21} = 3 \cdot 5 + 4 \cdot 7 = 43, \quad C_{22} = 3 \cdot 6 + 4 \cdot 8 = 50.\]

Eight scalar multiplications for the four entries. The general pattern is \(n \cdot n^2 = n^3\) multiplications for \(n \times n\) matrices, giving \(\Theta(n^3)\).

Naive block decomposition does not help

Split each matrix into four \(n/2 \times n/2\) blocks:

\[ A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix}, \qquad B = \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix}. \]

The block-form product is

\[ C = \begin{pmatrix} A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22} \end{pmatrix}. \]

Computing the four output blocks directly requires eight recursive sub-multiplications plus \(\Theta(n^2)\) block additions:

\[ T(n) = 8 \, T(n/2) + \Theta(n^2). \]

By the master theorem, \(T(n) = \Theta(n^{\log_2 8}) = \Theta(n^3)\). The recursion structure alone changes nothing — the cubic exponent comes from doing eight sub-multiplications, exactly mirroring how naive divide-and-conquer for integer multiplication preserves the \(\Theta(n^2)\) bound.

Key Ideas

The eight block sub-products are not independent. Strassen exhibits seven products \(M_1, \dots, M_7\) — formed from sums and differences of \(A\)- and \(B\)-blocks — from which all four output blocks can be reconstructed by additions and subtractions:

\[ \begin{aligned} M_1 &= (A_{11} + A_{22})(B_{11} + B_{22}), \\ M_2 &= (A_{21} + A_{22}) B_{11}, \\ M_3 &= A_{11} (B_{12} - B_{22}), \\ M_4 &= A_{22} (B_{21} - B_{11}), \\ M_5 &= (A_{11} + A_{12}) B_{22}, \\ M_6 &= (A_{21} - A_{11})(B_{11} + B_{12}), \\ M_7 &= (A_{12} - A_{22})(B_{21} + B_{22}). \end{aligned} \]

The output blocks are

\[ \begin{aligned} C_{11} &= M_1 + M_4 - M_5 + M_7, \\ C_{12} &= M_3 + M_5, \\ C_{21} &= M_2 + M_4, \\ C_{22} &= M_1 - M_2 + M_3 + M_6. \end{aligned} \]

Seven multiplications instead of eight, at the cost of \(\Theta(n^2)\) extra block additions and subtractions per level. The savings are algebraic: the three “redundant” cross-block products implicit in the schoolbook formulas are extracted by clever sums.

The key consequence is that the recurrence base \(\log_2 7 \approx 2.807 < 3\). For matrix multiplication, the recursion exponent is the asymptotic exponent — every saved sub-multiplication compounds across \(\log_2 n\) levels.

Deriving the Solution

With seven sub-multiplications per recursive call and \(\Theta(n^2)\) overhead for the block sums:

\[ T(n) = 7 \, T(n/2) + \Theta(n^2), \qquad T(1) = \Theta(1). \]

The master theorem compares \(\log_b a = \log_2 7 \approx 2.807\) against \(c = 2\). Since \(\log_2 7 > 2\), the recursion is leaf-dominated and

\[ T(n) = \Theta(n^{\log_2 7}) \approx \Theta(n^{2.807}). \]

The per-level work grows as \(7^\ell \cdot \Theta((n/2^\ell)^2) = \Theta(n^2 \cdot (7/4)^\ell)\), a geometric series whose largest term is at the leaves (level \(\ell = \log_2 n\)), summing to \(\Theta(7^{\log_2 n}) = \Theta(n^{\log_2 7})\).

Algorithm

strassen(A, B):
    n = dimension(A)
    if n <= n0:                       # base case: schoolbook
        return schoolbook_multiply(A, B)

    partition A into 2x2 blocks A11, A12, A21, A22
    partition B into 2x2 blocks B11, B12, B21, B22

    M1 = strassen(A11 + A22, B11 + B22)
    M2 = strassen(A21 + A22, B11)
    M3 = strassen(A11,        B12 - B22)
    M4 = strassen(A22,        B21 - B11)
    M5 = strassen(A11 + A12, B22)
    M6 = strassen(A21 - A11, B11 + B12)
    M7 = strassen(A12 - A22, B21 + B22)

    C11 = M1 + M4 - M5 + M7
    C12 = M3 + M5
    C21 = M2 + M4
    C22 = M1 - M2 + M3 + M6
    return assemble_blocks(C11, C12, C21, C22)

The base-case threshold \(n_0\) is tuned per platform — typically \(n_0 \approx 64\) to \(128\), where schoolbook (running through optimized BLAS) outperforms recursive Strassen due to cache behavior and constants. For odd \(n\), pad with a zero row and column so the halves are exact.

The structural map of which blocks contribute to which products, and which products land in which output blocks, shows the seven-vs-eight saving directly:

product left factor (A-blocks) right factor (B-blocks) used in A11 A12 A21 A22 B11 B12 B21 B22 C11 C12 C21 C22 M₁ 5 ops + + + + + + M₂ 2 ops + + + + M₃ 2 ops + + + + M₄ 2 ops + + + + M₅ 2 ops + + + + M₆ 3 ops + + + + M₇ 3 ops + + + + Each blue/yellow cell shows the sign with which a block enters the corresponding M product. Each green/red mark on the right shows how M contributes to that output block. Seven multiplications cover all four C-blocks; schoolbook would have used eight.

Walkthrough

Apply Strassen to the same \(2 \times 2\) matrices used above:

\[A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}, \quad B = \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix}.\]

Treating each entry as a \(1 \times 1\) block (\(A_{11}=1, A_{12}=2, A_{21}=3, A_{22}=4\); \(B_{11}=5, B_{12}=6, B_{21}=7, B_{22}=8\)), the seven products are:

Product Expression Value
\(M_1\) \((1+4)(5+8)\) \(65\)
\(M_2\) \((3+4) \cdot 5\) \(35\)
\(M_3\) \(1 \cdot (6-8)\) \(-2\)
\(M_4\) \(4 \cdot (7-5)\) \(8\)
\(M_5\) \((1+2) \cdot 8\) \(24\)
\(M_6\) \((3-1)(5+6)\) \(22\)
\(M_7\) \((2-4)(7+8)\) \(-30\)

Reconstructing the output:

\[C_{11} = M_1 + M_4 - M_5 + M_7 = 65 + 8 - 24 - 30 = 19\] \[C_{12} = M_3 + M_5 = -2 + 24 = 22\] \[C_{21} = M_2 + M_4 = 35 + 8 = 43\] \[C_{22} = M_1 - M_2 + M_3 + M_6 = 65 - 35 - 2 + 22 = 50\]

Matches the schoolbook output exactly: seven multiplications produce the same four entries that the inner-product definition extracted with eight. Recursing on block sub-matrices compounds this saving across \(\log_2 n\) levels.

Correctness

Each output formula is an algebraic identity: substituting the \(M_i\) into \(C_{11} = M_1 + M_4 - M_5 + M_7\) and expanding gives exactly \(A_{11} B_{11} + A_{12} B_{21}\), the schoolbook block formula for \(C_{11}\). The other three identities check the same way. Because the formulas use only block additions, subtractions, and multiplications — operations valid for arbitrary commutative-ring entries — the identity holds for matrix entries whose products commute with themselves, which includes all real and complex matrices when applied recursively block-wise. (The \(A\)-block and \(B\)-block factors do not need to commute with each other; only the entries inside each block need to commute.)

Induction on \(n\) closes the recursive case: assuming the recursive calls return correct sub-products, the seven-term reconstruction returns the correct overall product.

Complexity and Tradeoffs

The recurrence \(T(n) = 7 \, T(n/2) + \Theta(n^2)\) resolves to

\[ T(n) = \Theta(n^{\log_2 7}) \approx \Theta(n^{2.807}) \]

scalar operations. (proof)

Auxiliary memory is \(\Theta(n^2)\) for the recursion stack and intermediate block buffers. Each recursive call allocates temporaries proportional to the block size; careful implementations reuse buffers to keep the overhead modest.

The lower bound is \(\omega \ge 2\): simply reading the inputs and writing \(n^2\) outputs requires \(\Omega(n^2)\) operations. The matrix-multiplication exponent

\[ \omega = \inf \{ \alpha : n \times n \text{ matrices can be multiplied in } O(n^{\alpha}) \} \]

is the central open problem in algebraic complexity. Whether \(\omega = 2\) is unknown.

Important tradeoffs:

  • Constant factors and memory traffic: Strassen does extra block additions and allocates temporaries, hurting cache locality. The crossover with optimized BLAS schoolbook is around \(n \approx 100\); below that, schoolbook wins.
  • Numerical stability: the standard inner-product algorithm has a small backward error bound. Strassen’s bound is somewhat worse — adequate for most uses but a real concern in ill-conditioned numerical work.
  • Padding for odd dimensions: when \(n\) is not a power of two, the implementation pads with zero rows and columns. The padding is a constant-factor cost, not asymptotic.
  • Galactic algorithms: the post-Strassen improvements (Coppersmith–Winograd and successors) have constants so large that they only beat Strassen for matrices well beyond any conceivable hardware. They are theoretically important but unused in production.

When to Use It

For most numerical work, optimized schoolbook (vectorized, blocked, parallelized) is the right default — modern BLAS libraries make it extremely fast for the matrix sizes that actually appear in applications. Strassen is reserved for very large matrices in narrow contexts.

Situation Typical algorithm
Small matrix (\(n \lesssim 100\)) Schoolbook via BLAS — \(\Theta(n^3)\), optimal cache and SIMD use
Large matrix, accuracy-tolerant Strassen on top of a schoolbook base case — \(\Theta(n^{2.807})\)
Sparse matrix Specialized sparse-matrix algorithms — work proportional to nonzero count
Structured matrix (e.g. Toeplitz, low rank) Algorithms exploiting the structure, often \(\widetilde{O}(n^2)\) or better
Exact integer or polynomial entries Strassen is safe (no rounding); ratio over schoolbook can be substantial
Theoretical exponent only Coppersmith–Winograd-style algorithms — galactic, not used in practice

The broader pattern is the same as for integer multiplication: when a recursive decomposition produces sub-products with shared algebraic content, exploiting that structure can reduce the leading constant of the recursion and thus the asymptotic exponent.

Variants

The matrix-multiplication exponent \(\omega\) is the smallest constant such that \(n \times n\) matrices can be multiplied in \(O(n^{\omega + \varepsilon})\) for every \(\varepsilon > 0\). Strassen’s result triggered a fifty-year sequence of refinements:

Year Author $$
1969 Strassen \(2.807\)
1978 Pan \(2.781\)
1979 Bini \(2.78\)
1981 Coppersmith–Winograd \(2.495\)
1990 Coppersmith–Winograd \(2.376\)
2010 Stothers / Strassen-laser \(2.374\)
2014 Le Gall \(2.3729\)
2024 Williams et al. \(\sim 2.371\)
  • Strassen–Winograd. A re-formulation of Strassen with the same \(7\) multiplications but only \(15\) additions (versus \(18\)), useful in practice.
  • Pan’s algorithm. Splits a \(70 \times 70\) matrix into pieces and uses an algorithm with \(143{,}640\) multiplications, beating Strassen’s exponent.
  • Coppersmith–Winograd and successors. Use the laser method on tensor decompositions to drive \(\omega\) down. Constants are astronomical.
  • Approximate matrix multiplication. When approximate output suffices (e.g. randomized algorithms for ML), \(\widetilde{O}(n^2)\) algorithms exist via sketching.

References

Strassen, Volker. 1969. “Gaussian Elimination Is Not Optimal.” Numerische Mathematik 13 (4): 354–56. https://doi.org/10.1007/bf02165411.