Gradient Clipping

Motivation

Training deep recurrent networks and large transformers occasionally produces gradients with enormous norm — orders of magnitude larger than the typical step. A single such gradient, applied with the standard learning rate, moves parameters far enough to destabilize training: the loss diverges and subsequent gradients are NaN. Gradient clipping (Pascanu et al. 2013) caps the norm of the gradient before the optimizer step, trading exact gradient direction for stability.

The technique is small but essential. Almost every recurrent network and most transformers train with some form of clipping. Without it, an unlucky gradient spike on any minibatch can wreck a training run that has been progressing for hours.

Norm Clipping

The standard form. Compute the gradient \(g\) over the minibatch, then rescale it if its norm exceeds a threshold \(c\):

\[ g \leftarrow \begin{cases} g & \|g\| \leq c, \\ c \cdot \dfrac{g}{\|g\|} & \|g\| > c. \end{cases} \]

The norm is typically the global \(\ell_2\) norm across all parameters concatenated into one vector — this is what torch.nn.utils.clip_grad_norm_ computes. The direction of \(g\) is preserved, only the magnitude is bounded. This is the form to use by default.

Typical thresholds: \(c = 1\) for transformers, \(c \in [1, 5]\) for LSTMs and other RNNs. The right value depends on the architecture and on the typical gradient norm during stable training; a common heuristic is to log the unclipped norm for the first few hundred steps and pick \(c\) around the \(90\)th percentile.

Diagram: norm clipping preserves direction

The original (red) gradient has norm \(\|g\| = 4\), exceeding the threshold \(c = 1.5\). Norm clipping rescales it to lie on the threshold ball with the same direction.

‖g‖ = c = 1.5 g (norm = 4) clipped (norm = 1.5) θ_t Direction of g is preserved; only the length is rescaled to the threshold c. Value clipping (per-coordinate cap) would distort the direction — norm clipping is the modern default.

Value Clipping

Cap each coordinate of the gradient independently:

\[ g_i \leftarrow \operatorname{clip}(g_i, -c, c). \]

This is the older, blunter approach. It distorts the gradient direction, not just magnitude, because coordinates with very different scales get clipped non-uniformly. Norm clipping has effectively replaced value clipping in modern practice.

Why It Works

Gradient clipping addresses the exploding gradient half of the vanishing-and-exploding-gradients problem. In an RNN, the gradient at time \(0\) involves a product of \(T\) Jacobians; if the largest singular value of the average Jacobian is greater than \(1\), this product has gradient norm growing like \(\rho^T\). Most steps produce well-behaved gradients, but the occasional step that hits a sharp region of the loss surface produces a gradient orders of magnitude larger.

Clipping limits the damage: instead of taking a step proportional to that huge gradient (which jumps far outside the locally-valid trust region), the optimizer takes a step of bounded size. The direction is still informative — large-magnitude gradients still point in a roughly descent direction — so the bounded step still makes progress.

Clipping does not address vanishing gradients; for those one needs architectural changes (LSTM, residuals) or different activations.

Convergence Considerations

Clipping introduces a non-uniform shrinkage of the gradient: the effective learning rate is smaller on steps where \(\|g\| > c\). This complicates the standard convergence analysis but does not break it. For the typical case of bounded variance and only-occasional clipping, SGD-with-clipping converges at the standard rate. The stability benefit dominates the rare-gradient distortion in practice.

When to Use It

  • Always for RNNs. Vanilla RNNs, LSTMs, and GRUs all benefit; the gradient distribution is heavy-tailed.
  • Almost always for transformers. Even with stable architectures, occasional gradient spikes occur during early training. Standard practice is \(c = 1\).
  • Sometimes for CNNs. Plain image classifiers are usually fine without it. Adversarial training and GANs benefit.

Diagnostics

  • Log the unclipped gradient norm. If it is rarely clipped (say, \(< 1\%\) of steps), the threshold is fine. If it is clipped most of the time, the threshold is too low and is acting as a per-step learning-rate cap rather than a safety net — raise it.
  • If clipping happens frequently for the first few hundred steps and then drops to nearly zero, that is healthy: early training is unstable and clipping prevents divergence.
  • If a training run that previously was stable starts clipping a lot, something has changed — usually a learning-rate increase or a data-distribution shift — and the clipping is masking the underlying issue.

References

Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. 2013. “On the Difficulty of Training Recurrent Neural Networks.” International Conference on Machine Learning (ICML), 1310–18. https://proceedings.mlr.press/v28/pascanu13.html.