Backpropagation Computes Exact Gradients
Claim
Let \(L : \mathbb{R}^n \to \mathbb{R}\) be a scalar function defined by a directed acyclic computational graph \(G = (V, E)\) in which each node \(v\) computes \(v = f_v(\text{parents}(v))\) for a differentiable primitive \(f_v\). Let \(\theta \subseteq V\) be the set of input (“parameter”) nodes and \(L \in V\) the unique output node.
Backpropagation — propagating adjoints \(\delta_v = \partial L / \partial v\) from the output backward through the graph by the rule
\[ \delta_v = \sum_{w \,:\, v \to w} \delta_w \cdot \frac{\partial f_w}{\partial v} \]
— computes \(\partial L / \partial \theta_i\) for every parameter \(\theta_i \in \theta\), exactly (Rumelhart et al. 1986).
Setup
Notation. For each node \(v\), let \(\delta_v\) denote the adjoint computed by backpropagation: it is initialized \(\delta_L = 1\) and updated by the rule above as nodes are processed in reverse topological order. We will show \(\delta_v = \partial L / \partial v\) for every \(v\), where the derivative on the right-hand side is taken in the multivariable chain-rule sense — treating \(L\) as a function of \(v\) via every directed path from \(v\) to \(L\).
Reverse topological order. Process nodes such that whenever an edge \(v \to w\) exists, \(w\) is processed before \(v\). Such an ordering exists because the graph is acyclic. Backpropagation visits each node exactly once, in this order.
The Multivariable Chain Rule
For a node \(v\) with children \(w_1, w_2, \ldots, w_k\) (so \(v \to w_i\) for each \(i\)), \(L\) depends on \(v\) only through these children. The multivariable chain rule states
\[ \frac{\partial L}{\partial v} = \sum_{i = 1}^{k} \frac{\partial L}{\partial w_i} \cdot \frac{\partial w_i}{\partial v}, \]
where \(\partial w_i / \partial v = \partial f_{w_i} / \partial v\) is the local Jacobian of the primitive \(f_{w_i}\) at the point determined by the forward pass.
This identity holds whenever the graph is acyclic and the primitives are differentiable at the forward-pass values — exactly the assumptions of the theorem.
Proof
By backward induction on the reverse topological order.
Base case: \(v = L\). Backpropagation initializes \(\delta_L = 1\). The “derivative of \(L\) with respect to \(L\)” is \(1\). So \(\delta_L = \partial L / \partial L = 1\). \(\checkmark\)
Inductive step: \(v \ne L\). Assume that for every child \(w\) of \(v\), \(\delta_w = \partial L / \partial w\). (This is the inductive hypothesis: children are processed before \(v\) in reverse topological order, so they have already been updated.) Backpropagation computes
\[ \delta_v = \sum_{w \,:\, v \to w} \delta_w \cdot \frac{\partial f_w}{\partial v}. \]
By the inductive hypothesis, \(\delta_w = \partial L / \partial w\). By the multivariable chain rule applied at \(v\),
\[ \sum_{w \,:\, v \to w} \frac{\partial L}{\partial w} \cdot \frac{\partial f_w}{\partial v} = \frac{\partial L}{\partial v}. \]
Combining:
\[ \delta_v = \frac{\partial L}{\partial v}. \qquad \checkmark \]
By induction on every node in reverse topological order, \(\delta_v = \partial L / \partial v\) for all \(v \in V\). In particular, \(\delta_{\theta_i} = \partial L / \partial \theta_i\) for every parameter. \(\blacksquare\)
Cost Analysis
Each edge \(v \to w\) is touched exactly once during backpropagation: when \(w\) is processed, the contribution \(\delta_w \cdot \partial f_w / \partial v\) is added to \(\delta_v\). Each touch is a constant-time scalar (or vector–Jacobian) operation given the forward-pass values.
Hence the total work is \(\Theta(|E|)\) — proportional to the size of the computational graph, which is the same big-O as the forward pass. This is the cheap-gradient principle of reverse-mode automatic differentiation: a scalar-valued function with \(n\) parameters has its full gradient computed in time comparable to one function evaluation, regardless of \(n\).
Why “Backward” and Not “Forward”
There is a forward-mode dual: propagate tangents \(\dot{v} = \partial v / \partial \theta_i\) from each input forward through the graph. Forward mode also computes derivatives correctly, but each pass yields the gradient with respect to one input. For a function with \(n\) inputs and \(1\) output (a scalar loss), forward mode costs \(n\) passes; reverse mode costs \(1\). For \(1\) input and \(m\) outputs, the trade-off reverses.
Neural-network training has \(n \gg m = 1\), so reverse mode (backpropagation) is the right choice. Both are exact; the choice is purely about computational efficiency.
Caveats
- Differentiability. The proof requires each primitive to be differentiable at the forward-pass values. Operations like ReLU, \(\max\), and \(\arg\max\) are non-differentiable at isolated points; standard practice picks a subgradient (e.g., \(\sigma'(0) = 0\) or \(1\)), which is exact almost everywhere and works well empirically.
- Numerical precision. Floating-point arithmetic introduces small errors. The algorithm computes the gradient exactly in the mathematical sense — modulo finite-precision arithmetic, which can accumulate in very deep graphs.
- Stochastic operations. Operations like sampling, dropout, and batch norm are not pure differentiable functions of their inputs. Specialized derivatives (reparameterization trick, straight-through estimator) extend backpropagation past these — at the cost of bias or variance, depending on the technique.