Backpropagation

Motivation

Training a neural network requires the gradient of a scalar loss \(L\) with respect to every parameter \(\theta_i\). The naive approach — perturbing each parameter, recomputing the forward pass, and finite-differencing — costs one full forward pass per parameter. For a network with \(10^9\) parameters, that is computationally absurd.

Backpropagation (Rumelhart et al. 1986) computes all gradients \(\partial L / \partial \theta_i\) simultaneously in the same time as one forward pass (up to a small constant factor) by exploiting the chain rule and the structure of the network’s computational graph. It is the algorithm that makes modern deep learning feasible.

Mathematically, backpropagation is reverse-mode automatic differentiation specialized to neural networks. The procedure is exact (up to floating-point precision); it computes the same gradient that a perfect symbolic differentiator would.

Problem

Given a feedforward neural network expressing a scalar loss \(L\) as a composition of layers

\[ z_k = W_k a_{k-1} + b_k, \qquad a_k = \sigma(z_k), \qquad L = \ell(a_n, y), \]

compute the gradients \(\partial L / \partial W_k\) and \(\partial L / \partial b_k\) for all layers \(k\), plus the input gradient if needed, in time and memory comparable to the forward pass.

Key Ideas

Reverse-mode autodiff: one backward pass, all gradients

The naive approach to gradients — perturb each parameter and finite-difference — is \(O(P)\) forward passes for \(P\) parameters. Forward-mode AD does \(O(I)\) forward passes for \(I\) inputs (cheap when \(I\) is small). Reverse-mode AD does \(O(O)\) backward passes for \(O\) outputs. Since neural networks have one scalar output (the loss) and millions of parameters, reverse mode is essential.

The trick is that the chain rule contracts in two directions:

\[ \frac{\partial L}{\partial \theta} = \frac{\partial L}{\partial a_n} \cdot \frac{\partial a_n}{\partial a_{n-1}} \cdot \frac{\partial a_{n-1}}{\partial a_{n-2}} \cdots \frac{\partial a_1}{\partial \theta}. \]

Multiplying right-to-left (forward mode) computes Jacobian-vector products. Multiplying left-to-right (reverse mode) computes vector-Jacobian products. For a scalar output, the leftmost factor is a \(1 \times d\) row vector that contracts each subsequent Jacobian to a single column — keeping every intermediate result a vector instead of a matrix.

Adjoints and local chain rule

Define the adjoint of activation \(a_k\):

\[ \delta_k = \frac{\partial L}{\partial a_k}. \]

This is the gradient that backpropagation passes around. The recurrence is the multivariable chain rule applied locally:

\[ \delta_{k-1} = W_k^\top (\delta_k \odot \sigma'(z_k)), \qquad \frac{\partial L}{\partial W_k} = (\delta_k \odot \sigma'(z_k)) \cdot a_{k-1}^\top, \qquad \frac{\partial L}{\partial b_k} = \delta_k \odot \sigma'(z_k). \]

Each layer’s parameter gradients fall out of the adjoint at that layer — no extra passes needed. (proof)

Generalizes to any DAG of operations

Modern frameworks (PyTorch, JAX, TensorFlow) implement backpropagation generically over arbitrary directed acyclic graphs of operations. Each primitive operation \(y = f(x_1, x_2, \ldots)\) is registered with a vector-Jacobian product (VJP):

\[ \text{VJP}_f : (\bar{y}, x_1, x_2, \ldots) \mapsto \!\left(\bar{y}^\top \frac{\partial f}{\partial x_1},\, \bar{y}^\top \frac{\partial f}{\partial x_2},\, \ldots\right). \]

The forward pass executes the graph, recording operations and inputs. The backward pass walks the graph in reverse topological order, calling each VJP in turn. Backpropagation through a feedforward network is just the special case where the graph is a path.

Algorithm

forward(x, y):
    a_0 = x
    for k = 1 to n:
        z_k = W_k a_{k-1} + b_k
        a_k = sigma(z_k)
    L = loss(a_n, y)
    return L

backward():
    delta = grad_a_n loss(a_n, y)
    for k = n down to 1:
        d_zk = delta * sigma_prime(z_k)
        grad_W_k = d_zk · a_{k-1}^T
        grad_b_k = d_zk
        delta = W_k^T · d_zk

The forward pass stores activations \(a_0, z_1, a_1, \ldots, z_n, a_n\); the backward pass walks them in reverse to compute gradients. Each \(W_k\) is multiplied by a vector once forward and once backward — the backward pass’s big-O cost matches the forward.

Walkthrough

A scalar loss on a small computation graph

The graph computes \(L = (w \cdot x + b - y)^2\) for \(w = 2\), \(x = 3\), \(b = 1\), \(y = 4\), giving \(L = 9\). Forward values flow left-to-right (blue); backward gradients flow right-to-left (red).

w 2 x 3 b 1 × u = w·x = 6 + z = u + b = 7 y 4 r = z − y = 3 (·)² L = 9 ∂L/∂L = 1 ∂L/∂r = 2r = 6 ∂L/∂z = 6 ∂L/∂u = 6, ∂L/∂b = 6 ∂L/∂w = ∂L/∂u · x = 18 ∂L/∂x = ∂L/∂u · w = 12 Forward pass (blue) computes values; backward pass (red, dashed) computes gradients via the chain rule.

Each backward step is a local chain-rule application: \(\partial L / \partial L = 1\) enters at the top right; \(\partial L / \partial r = 2r\) comes from \(L = r^2\); \(\partial L / \partial z\) equals \(\partial L / \partial r\) (the subtract node’s Jacobian is \(1\) with respect to \(z\)); \(\partial L / \partial u\) and \(\partial L / \partial b\) both equal \(\partial L / \partial z\) (the add node distributes). Finally \(\partial L / \partial w = (\partial L / \partial u) \cdot x = 18\) — the multiply node’s local Jacobian is \(x\) with respect to \(w\).

Correctness

Each backward step is exactly the multivariable chain rule applied to one layer (or one primitive in the DAG case). The result is the same gradient a perfect symbolic differentiator would compute — exact up to floating-point precision. (proof)

The key invariant is that when the backward pass reaches a node, it has accumulated all upstream contributions to that node’s adjoint. This requires processing nodes in reverse topological order — a node’s adjoint isn’t finished until every node that uses it has been processed.

Complexity and Tradeoffs

Time. \(O(\text{forward time})\) — each primitive’s VJP costs a constant factor of the primitive’s forward. For a linear layer with \(m \times n\) weights, both forward and backward are \(O(mn)\).

Memory. \(O(\text{depth} \times \text{activation size})\) — every intermediate activation must be stored until its VJP is computed. This is the dominant cost in deep networks.

Gradient checkpointing trades time for memory: store only a subset of activations and recompute the others during the backward pass. Standard tool for training very deep networks; reduces memory from \(O(L)\) to \(O(\sqrt{L})\) at the cost of a \(\sim 2\times\) forward time.

Numerical issues

  • Vanishing gradients. Each backward step multiplies \(\delta\) by \(W_k^\top \odot \sigma'(z_k)\). If these factors are smaller than \(1\), \(\delta\) shrinks exponentially. Sigmoid and tanh saturate near \(0\) or \(1\) with \(\sigma' \approx 0\), exacerbating this. ReLU variants (\(\sigma' \in \{0, 1\}\)) substantially mitigate.
  • Exploding gradients. The same multiplicative chain can produce very large \(\delta\), especially in recurrent networks unrolled over long sequences. Gradient clipping (capping \(\|\delta\|\)) is the standard remedy.
  • Initialization. Careful weight initialization (Glorot, He) keeps activation and gradient magnitudes well-behaved at the start of training.
  • Normalization. Batch norm, layer norm, and residual connections all stabilize gradient flow in deep networks.

What backprop doesn’t do

Backpropagation computes gradients. It does not find good parameters — that is the job of an optimizer (SGD, Adam, etc.) using the gradients backprop provides. Many problems blamed on backprop (poor convergence, local minima, sharp minima) are properties of the loss landscape and the optimizer, not the gradient computation.

When to Use It

Situation Approach
Train any neural network Backpropagation — universal default.
Train RNN over time Backpropagation through time — the same algorithm on the unrolled graph.
Very deep network, memory-limited Backprop with gradient checkpointing.
Function with one scalar input Forward-mode AD (cheaper than reverse mode for \(I = 1\), \(O\) large).
Higher-order gradients (Hessians, etc.) Backprop through backprop, or forward-over-reverse.
Compute Jacobian of a vector output Reverse-mode for each output coordinate, or batched JVPs / VJPs in a framework.
Non-differentiable loss Subgradients, REINFORCE-style score-function gradients, or differentiable relaxations.

Variants

  • Reverse-mode autodiff over arbitrary DAGs. The framework-level generalization. Modern PyTorch/JAX/TensorFlow implement backprop this way; the network-specific recurrences above are a special case.
  • Gradient checkpointing. Store only a fraction of activations and recompute the rest. Halves to fifths the memory of training very deep networks at \(\sim 2\times\) forward-time cost.
  • Mixed-precision training. Compute forward and backward in float16/bfloat16 while keeping a float32 master copy of parameters. Cuts memory roughly in half on modern accelerators.
  • Synthetic gradients / decoupled neural interfaces. Predict the gradient locally instead of waiting for the global backward pass. Useful in pipeline-parallel training; introduces approximation error.
  • Forward-only training (Hinton’s forward-forward). Recent experiment in training without backprop, using a layerwise contrastive objective. Far from competitive with backprop on standard benchmarks.

References

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.