Residual Connections Preserve Gradient Norm at Initialization
Claim
Consider a stack of \(L\) residual blocks, \(h_\ell = h_{\ell-1} + F_\ell(h_{\ell-1})\), with each \(F_\ell\) initialized so that its Jacobian has small norm: \(\|\partial F_\ell / \partial h_{\ell-1}\| \leq \epsilon\) for some \(\epsilon \ll 1\). Then the gradient of a loss \(L\) with respect to the input satisfies
\[ (1 - \epsilon)^L \, \left\|\frac{\partial L}{\partial h_L}\right\| \leq \left\|\frac{\partial L}{\partial h_0}\right\| \leq (1 + \epsilon)^L \, \left\|\frac{\partial L}{\partial h_L}\right\|. \]
In particular, for \(\epsilon = O(1/L)\), both bounds are \(\Theta(1)\) and the gradient norm is preserved through depth — neither vanishing nor exploding (He et al. 2016).
This is a structural property at initialization. It is what makes residual networks trainable to thousands of layers. (Compare with vanilla feedforward networks, where the analogous gradient product satisfies only \(\|\prod_\ell J_\ell\| \leq \prod_\ell \|J_\ell\|\), with \(\|J_\ell\| < 1\) generically — exponential decay. See the Jacobian-product bound.)
Setup
Each residual block has Jacobian
\[ \frac{\partial h_\ell}{\partial h_{\ell-1}} = I + \frac{\partial F_\ell}{\partial h_{\ell-1}}. \]
The chain rule gives
\[ \frac{\partial L}{\partial h_0} = \frac{\partial L}{\partial h_L} \prod_{\ell=1}^L \left(I + \frac{\partial F_\ell}{\partial h_{\ell-1}}\right). \]
Write \(A_\ell = \partial F_\ell / \partial h_{\ell-1}\) and assume \(\|A_\ell\| \leq \epsilon\). Want to bound the operator norm of \(\prod_\ell (I + A_\ell)\).
Upper Bound
For any matrices \(A\), \(B\), \(\|AB\| \leq \|A\| \cdot \|B\|\) and \(\|I + A\| \leq 1 + \|A\|\) by the triangle inequality. Iterating,
\[ \left\|\prod_\ell (I + A_\ell)\right\| \leq \prod_\ell \|I + A_\ell\| \leq \prod_\ell (1 + \|A_\ell\|) \leq (1 + \epsilon)^L. \]
For \(\epsilon \leq 1/L\), \((1 + \epsilon)^L \leq e \approx 2.72\) — a small constant independent of \(L\). So the gradient norm is bounded above by a constant times \(\|\partial L / \partial h_L\|\), regardless of depth.
Lower Bound
This is the harder direction. We need
\[ \|(I + A_1) \cdots (I + A_L) v\| \geq (1 - \epsilon)^L \|v\| \]
for any vector \(v\). By the reverse triangle inequality, \(\|(I + A) v\| \geq \|v\| - \|A v\| \geq (1 - \|A\|) \|v\|\). Iterating,
\[ \|(I + A_1) \cdots (I + A_L) v\| \geq (1 - \epsilon)^L \|v\|. \]
For \(\epsilon \leq 1/L\), \((1 - \epsilon)^L \geq e^{-1} \approx 0.37\) — again a constant. The gradient norm is bounded below by a constant times \(\|\partial L / \partial h_L\|\).
Combining the two bounds, \(\|\partial L / \partial h_0\|\) is bounded above and below by constants times \(\|\partial L / \partial h_L\|\), uniformly in \(L\). \(\square\)
When the Assumption \(\|A_\ell\| \leq \epsilon\) Holds
The assumption \(\|\partial F_\ell / \partial h_{\ell-1}\| \leq \epsilon\) at initialization requires that \(F_\ell\) be a small perturbation of zero. Standard initialization schemes achieve this:
- He / Kaiming initialization for ReLU networks gives weights \(\mathcal{N}(0, 2/d_{\text{in}})\). The Jacobian of a single layer has bounded operator norm at initialization — concretely, \(\|\partial F / \partial h\| = O(1)\) in expectation, not \(O(1/L)\).
- For very deep networks (\(L \gtrsim 100\)), one-over-\(L\) scaling is sometimes added explicitly: scale the residual branch by \(1/\sqrt{L}\) (or \(1/L\)) at initialization. This is what enables ResNet-1000 and similar extreme depths.
In practice, the bound’s constants are loose at initialization but the asymptotic flavor is correct: residual networks have Jacobian products that stay near the identity, and gradients propagate without geometric decay.
After Training
The argument above is for initialization. As training progresses, the \(A_\ell\) matrices grow and the bound loosens. Two empirical observations:
- Effective depth is shorter than nominal depth. Veit, Wilber, and Belongie (2016) showed that most of the predictive signal in a trained ResNet flows through the shorter skip paths, not through deep residual chains. The network behaves like an ensemble of shallower networks.
- Layer norms tend to grow with depth in trained networks. Without normalization, this would cause exploding activations; batch normalization or layer normalization within each residual block keeps things stable.
Connection to LSTM Gating
The cell-state recurrence in an LSTM, \(c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t\), has Jacobian \(\operatorname{diag}(f_t)\) when \(f_t\) does not directly depend on \(c_{t-1}\). When \(f_t \approx 1\), this is approximately the identity — same structural form as the residual Jacobian \(I + A\). (detail)
The conceptual unity: an additive path with near-identity Jacobian preserves gradient norm. Residual connections solve the depth-direction case; LSTM gating solves the time-direction case. Both use the same trick.