Residual Connections Preserve Gradient Norm at Initialization

Claim

Consider a stack of \(L\) residual blocks, \(h_\ell = h_{\ell-1} + F_\ell(h_{\ell-1})\), with each \(F_\ell\) initialized so that its Jacobian has small norm: \(\|\partial F_\ell / \partial h_{\ell-1}\| \leq \epsilon\) for some \(\epsilon \ll 1\). Then the gradient of a loss \(L\) with respect to the input satisfies

\[ (1 - \epsilon)^L \, \left\|\frac{\partial L}{\partial h_L}\right\| \leq \left\|\frac{\partial L}{\partial h_0}\right\| \leq (1 + \epsilon)^L \, \left\|\frac{\partial L}{\partial h_L}\right\|. \]

In particular, for \(\epsilon = O(1/L)\), both bounds are \(\Theta(1)\) and the gradient norm is preserved through depth — neither vanishing nor exploding (He et al. 2016).

This is a structural property at initialization. It is what makes residual networks trainable to thousands of layers. (Compare with vanilla feedforward networks, where the analogous gradient product satisfies only \(\|\prod_\ell J_\ell\| \leq \prod_\ell \|J_\ell\|\), with \(\|J_\ell\| < 1\) generically — exponential decay. See the Jacobian-product bound.)

Setup

Each residual block has Jacobian

\[ \frac{\partial h_\ell}{\partial h_{\ell-1}} = I + \frac{\partial F_\ell}{\partial h_{\ell-1}}. \]

The chain rule gives

\[ \frac{\partial L}{\partial h_0} = \frac{\partial L}{\partial h_L} \prod_{\ell=1}^L \left(I + \frac{\partial F_\ell}{\partial h_{\ell-1}}\right). \]

Write \(A_\ell = \partial F_\ell / \partial h_{\ell-1}\) and assume \(\|A_\ell\| \leq \epsilon\). Want to bound the operator norm of \(\prod_\ell (I + A_\ell)\).

Upper Bound

For any matrices \(A\), \(B\), \(\|AB\| \leq \|A\| \cdot \|B\|\) and \(\|I + A\| \leq 1 + \|A\|\) by the triangle inequality. Iterating,

For \(\epsilon \leq 1/L\), \((1 + \epsilon)^L \leq e \approx 2.72\) — a small constant independent of \(L\). So the gradient norm is bounded above by a constant times \(\|\partial L / \partial h_L\|\), regardless of depth.

Lower Bound

This is the harder direction. We need

\[ \|(I + A_1) \cdots (I + A_L) v\| \geq (1 - \epsilon)^L \|v\| \]

for any vector \(v\). By the reverse triangle inequality, \(\|(I + A) v\| \geq \|v\| - \|A v\| \geq (1 - \|A\|) \|v\|\). Iterating,

\[ \|(I + A_1) \cdots (I + A_L) v\| \geq (1 - \epsilon)^L \|v\|. \]

For \(\epsilon \leq 1/L\), \((1 - \epsilon)^L \geq e^{-1} \approx 0.37\) — again a constant. The gradient norm is bounded below by a constant times \(\|\partial L / \partial h_L\|\).

Combining the two bounds, \(\|\partial L / \partial h_0\|\) is bounded above and below by constants times \(\|\partial L / \partial h_L\|\), uniformly in \(L\). \(\square\)

When the Assumption \(\|A_\ell\| \leq \epsilon\) Holds

The assumption \(\|\partial F_\ell / \partial h_{\ell-1}\| \leq \epsilon\) at initialization requires that \(F_\ell\) be a small perturbation of zero. Standard initialization schemes achieve this:

He / Kaiming initialization for ReLU networks gives weights \(\mathcal{N}(0, 2/d_{\text{in}})\). The Jacobian of a single layer has bounded operator norm at initialization — concretely, \(\|\partial F / \partial h\| = O(1)\) in expectation, not \(O(1/L)\).
For very deep networks (\(L \gtrsim 100\)), one-over-\(L\) scaling is sometimes added explicitly: scale the residual branch by \(1/\sqrt{L}\) (or \(1/L\)) at initialization. This is what enables ResNet-1000 and similar extreme depths.

In practice, the bound’s constants are loose at initialization but the asymptotic flavor is correct: residual networks have Jacobian products that stay near the identity, and gradients propagate without geometric decay.

After Training

The argument above is for initialization. As training progresses, the \(A_\ell\) matrices grow and the bound loosens. Two empirical observations:

Effective depth is shorter than nominal depth. Veit, Wilber, and Belongie (2016) showed that most of the predictive signal in a trained ResNet flows through the shorter skip paths, not through deep residual chains. The network behaves like an ensemble of shallower networks.
Layer norms tend to grow with depth in trained networks. Without normalization, this would cause exploding activations; batch normalization or layer normalization within each residual block keeps things stable.

Connection to LSTM Gating

The cell-state recurrence in an LSTM, \(c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t\), has Jacobian \(\operatorname{diag}(f_t)\) when \(f_t\) does not directly depend on \(c_{t-1}\). When \(f_t \approx 1\), this is approximately the identity — same structural form as the residual Jacobian \(I + A\). (detail)

The conceptual unity: an additive path with near-identity Jacobian preserves gradient norm. Residual connections solve the depth-direction case; LSTM gating solves the time-direction case. Both use the same trick.

References

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–78. https://doi.org/10.1109/cvpr.2016.90.