Vanishing and Exploding Gradients
Motivation
When a gradient is propagated through many layers of a neural network — either depth in a feedforward network, or time in a recurrent network — it traverses a product of layer Jacobians:
\[ \frac{\partial L}{\partial h_0} = \frac{\partial L}{\partial h_T} \prod_{t=1}^T \frac{\partial h_t}{\partial h_{t-1}}. \]
The norm of this product tends to grow or shrink geometrically (Pascanu et al. 2013). When it shrinks toward zero the gradient vanishes and early layers receive no learning signal; when it grows without bound the gradient explodes and parameters overshoot during the update. Both pathologies are easy to produce in plain deep networks and were the dominant obstacle to training deep models before about 2010.
The vanishing case is the more pernicious of the two: the network keeps training, but the early layers stay near their initialization. The exploding case at least announces itself with NaNs.
The Geometric Picture
Suppose every layer Jacobian \(J_t = \partial h_t / \partial h_{t-1}\) has spectral radius \(\rho\) — the largest eigenvalue magnitude. Then the product of \(T\) such Jacobians has spectral radius approximately \(\rho^T\):
- \(\rho < 1\): the gradient norm shrinks like \(\rho^T\). Vanishing.
- \(\rho > 1\): the gradient norm grows like \(\rho^T\). Exploding.
- \(\rho = 1\): the gradient norm is approximately preserved. The only stable regime.
The condition \(\rho \approx 1\) across all relevant directions is fragile to maintain through training and is what motivates the architectural and optimization fixes below. (proof of the bound)
Diagram: gradient norm versus layer depth for three regimes
Vertical axis is log of gradient norm; horizontal is layer depth from output (layer \(L\), far right) back to input (layer \(1\), far left).
Why Vanilla RNNs Have This Problem
For a tanh RNN, the Jacobian per step is \(J_t = \operatorname{diag}(\tanh'(z_t)) \cdot W_{hh}\). The diagonal factor has entries in \((0, 1]\), with most active units producing \(\tanh' \ll 1\). So even if \(\|W_{hh}\|\) is large, the diagonal damps it; if \(\|W_{hh}\|\) is small, both factors damp. In practice the spectral radius drifts toward values that produce vanishing gradients, and gradients propagated more than a few dozen steps back in time are essentially zero. This means a vanilla RNN cannot learn dependencies longer than \(\sim 10\)–\(50\) steps regardless of sequence length.
Why Deep Feedforward Networks Have This Problem
Activations like sigmoid and tanh saturate: \(\sigma'(z) \to 0\) as \(|z| \to \infty\). For sigmoid, \(\sigma' \leq 1/4\); for tanh, \(\sigma' \leq 1\). Stacking such layers multiplies these factors \(L\) times, so for sigmoid the gradient is bounded above by \(4^{-L}\) in magnitude. With twenty layers this is already \(\sim 10^{-12}\). This is the historical reason deep MLPs were considered untrainable through the 1990s and 2000s.
ReLU networks largely sidestep this in the active region because \(\sigma' = 1\) there — the gradient is preserved through the nonlinearity. The remaining concern is then \(W\) alone, which is what initialization schemes (He, Xavier) target.
Mitigations
Several classes of fixes:
1. Architecture: gating
LSTM (Hochreiter and Schmidhuber 1997) and GRU (Cho et al. 2014) introduce a gated recurrence where the cell state has an additive update path:
\[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t. \]
When the forget gate \(f_t \approx 1\), the gradient \(\partial c_t / \partial c_{t-1} \approx I\) — the identity Jacobian. The Jacobian product through such a chain is approximately \(I^T = I\), neither vanishing nor exploding. (proof)
2. Architecture: residual connections
A residual connection computes \(h_t = h_{t-1} + F(h_{t-1})\), giving a Jacobian \(\partial h_t / \partial h_{t-1} = I + \partial F / \partial h_{t-1}\). The identity term ensures the gradient has a direct path that does not pick up additional factors layer-by-layer. This is the depth-direction analog of the LSTM’s time-direction fix.
3. Architecture: normalization
Batch norm, layer norm, and weight norm constrain activation and Jacobian scales during training, preventing the slow drift toward exploding or vanishing regimes. They do not provide an identity path the way gating or residuals do, but they keep the regime stable.
4. Initialization
He / Kaiming initialization sets \(\mathrm{Var}(W_{ij}) = 2/d_{\text{in}}\) for ReLU networks, which preserves activation variance through layers at initialization. Xavier / Glorot does the analogous thing for tanh networks. Correct initialization is necessary but not sufficient — it ensures the start is well-conditioned but does not protect against drift during training.
5. Optimization: gradient clipping
Gradient clipping addresses exploding gradients directly. If \(\|g\| > c\), rescale \(g \to c \cdot g / \|g\|\). The threshold \(c\) is a hyperparameter (often \(1\) or \(5\)). This does not help with vanishing gradients but is essential to keep training stable when gradients occasionally spike.
6. Optimization: truncated BPTT
For RNNs, only backpropagate gradients through a fixed window of \(K\) steps. This sidesteps the worst of the vanishing and exploding problem at the cost of being unable to learn dependencies longer than \(K\) steps directly through the gradient.
Diagnosing in Practice
- Loss not decreasing, parameters near initialization. Likely vanishing gradients. Check gradient norms per layer; if early layers have norms \(10^{-6}\) smaller than late layers, that confirms it. Switch to ReLU/GELU, add residual connections, or use a normalized architecture.
- Loss diverges, NaNs appear. Likely exploding gradients. Add gradient clipping, lower the learning rate, or check initialization.
- Loss decreases for short sequences, plateaus for long ones. Likely vanishing through time. Switch from vanilla RNN to LSTM or GRU, or use a transformer.
The history of deep learning since 2010 is in large part the history of these mitigations being discovered and standardized.