Jacobian Product Bound for Vanishing and Exploding Gradients

Claim

Let \(h_0, h_1, \ldots, h_T\) be hidden states of a recurrent network with \(h_t = f(h_{t-1}, x_t; \theta)\), and let \(J_t = \partial h_t / \partial h_{t-1}\). The gradient propagated through time satisfies

\[ \frac{\partial h_T}{\partial h_0} = \prod_{t=1}^T J_t, \]

and the operator norm is bounded by

\[ \left\| \frac{\partial h_T}{\partial h_0} \right\| \leq \prod_{t=1}^T \|J_t\|. \]

If \(\|J_t\| \leq \beta\) for all \(t\), then \(\|\partial h_T / \partial h_0\| \leq \beta^T\). So:

If \(\beta < 1\), the gradient norm is bounded above by \(\beta^T \to 0\) exponentially fast (vanishing gradients).
If the spectral radius of every \(J_t\) is at least \(\rho > 1\) along a common eigendirection, the gradient norm grows at least as \(\rho^T\) (exploding gradients).

(Pascanu et al. 2013)

Setup and Notation

The chain rule for composition of functions gives, for the unrolled recurrence,

\[ \frac{\partial h_T}{\partial h_0} = J_T J_{T-1} \cdots J_1. \]

Operator norms are submultiplicative: \(\|AB\| \leq \|A\| \cdot \|B\|\). So

\[ \left\| \prod_t J_t \right\| \leq \prod_t \|J_t\|. \]

This bound is tight in the worst case (achieved when each \(J_t\) has its largest singular vector aligned with the next).

Vanishing Case

Suppose \(\|J_t\| \leq \beta < 1\) for all \(t\). Then for any unit-norm vector \(u\),

\[ \left\| \prod_t J_t \, u \right\| \leq \beta^T \to 0 \]

as \(T \to \infty\). Consequently, for a downstream loss \(L\) depending on \(h_T\),

\[ \left\| \frac{\partial L}{\partial h_0} \right\| = \left\| \frac{\partial L}{\partial h_T} \prod_t J_t \right\| \leq \left\| \frac{\partial L}{\partial h_T} \right\| \cdot \beta^T. \]

The gradient at \(h_0\) shrinks exponentially in \(T\). Parameters affecting only \(h_0\) — equivalently, only the early steps of the sequence — receive vanishing gradient signal and effectively do not train. \(\square\)

When does \(\|J_t\| < 1\) happen?

For a tanh RNN, \(J_t = \operatorname{diag}(\tanh'(z_t)) W_{hh}\). The diagonal factor has entries \(\tanh'(z_t) \in (0, 1]\), with \(\tanh'(0) = 1\) and rapidly smaller for \(|z_t| \gtrsim 1\). So

\[ \|J_t\| \leq \|\operatorname{diag}(\tanh'(z_t))\| \cdot \|W_{hh}\| \leq 1 \cdot \|W_{hh}\|. \]

Even if \(\|W_{hh}\|\) is exactly \(1\), almost all units are saturated to some extent in practice, making the diagonal factor strictly less than \(1\), so \(\|J_t\| < 1\) generically. This is why vanilla RNNs vanish.

Exploding Case

The submultiplicative inequality is only an upper bound: a product of matrices can have norm much smaller than the product of norms. To prove growth one needs a lower bound, which requires a directional argument.

Suppose every \(J_t\) has the same largest right-singular vector \(v\) with associated singular value \(\rho_t \geq \rho > 1\). Then \(J_t v = \rho_t \tilde v\) where \(\|\tilde v\| = 1\), and iterating gives a vector of norm \(\prod_t \rho_t \geq \rho^T\). The gradient, evaluated against this direction, grows at least as \(\rho^T\).

In practice the singular vectors do not align so cleanly, but the largest Lyapunov exponent

\[ \lambda = \lim_{T \to \infty} \frac{1}{T} \log \|J_T \cdots J_1\| \]

(when it exists, by Oseledec’s theorem) governs the asymptotic growth rate. Values \(\lambda > 0\) correspond to exploding gradients on average; \(\lambda < 0\) to vanishing.

When the Bound Is Loose

The bound \(\|\prod_t J_t\| \leq \prod_t \|J_t\|\) is loose when the \(J_t\)’s have orthogonal singular vectors. In this case the worst-case directions interfere destructively and the product can have much smaller norm than the worst case suggests.

This is the mechanism by which gating in LSTM and GRU helps. When the LSTM forget gate \(f_t \approx 1\), the cell-state Jacobian is approximately the identity. A product of identities is the identity; norm \(1\) regardless of \(T\). (proof)

Implications for Training

The bound above is for the forward propagation of perturbations \(\partial h_T / \partial h_0\), but by reverse-mode chain rule the backward propagation \(\partial L / \partial h_0\) involves the same product (transposed). Both directions vanish or explode together.

So:

Vanishing forward \(=\) vanishing backward \(=\) early-time information cannot affect late-time output AND late-time loss cannot train early-time parameters.
Exploding forward \(=\) exploding backward \(=\) small input perturbations cause arbitrary output changes AND occasional gradient steps move parameters outside the locally-valid region.

Both pathologies are equally serious. The mitigations — gating, residual connections, normalization, gradient clipping — target the Jacobian-product structure directly.

References

Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. 2013. “On the Difficulty of Training Recurrent Neural Networks.” International Conference on Machine Learning (ICML), 1310–18. https://proceedings.mlr.press/v28/pascanu13.html.