Residual Connections

Motivation

A deep feedforward network has many layers, but training all of them is hard. Past about \(20\) layers, plain feedforward networks see higher training error than shallower ones — counterintuitively, since a deeper network can in principle simulate a shallower one by making the extra layers identity functions. The empirical fact is that gradient-based optimization fails to find such a solution, and depth makes things worse rather than better.

A residual connection (He et al. 2016a) makes the identity solution easy by adding the input of a block to its output:

\[ y = F(x) + x. \]

The block \(F\) now models the deviation from identity rather than the full mapping. Stacking residual blocks gives ResNets, which trained networks of \(50\), \(100\), and even \(1000\) layers. Residual connections also appear in transformers (around every attention and feedforward block), in U-Nets (long-range skip connections from encoder to decoder), and in essentially every modern deep architecture.

The Block

A standard residual block computes

\[ y = F(x; W) + x, \]

where \(F\) is one or two convolutional layers (with batch norm and ReLU). The addition requires \(F(x)\) and \(x\) to have the same shape; when they do not (e.g., across a downsample), a \(1 \times 1\) projection \(W_s\) is used:

\[ y = F(x; W) + W_s x. \]

This is the only weight on the skip path; otherwise the skip is the identity.

Diagram: a residual block

The skip path carries \(x\) unchanged; the residual branch computes \(F(x)\). The two are summed at the end of the block.

Why It Helps Training

Two complementary explanations.

1. Identity is easy. If the optimal \(F\) for the block is approximately zero, the block reduces to \(y \approx x\). The optimizer can express this trivially by driving the weights of \(F\) small. Without the residual connection, expressing “do nothing useful” requires the layer to learn an approximate identity, which depends sensitively on initialization and is not what gradient descent naturally finds.

2. Gradients have an additive path. The Jacobian of the block is

\[ \frac{\partial y}{\partial x} = \frac{\partial F}{\partial x} + I. \]

The identity term ensures that the gradient propagated from \(y\) back to \(x\) has a direct, unattenuated path:

\[ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \left( \frac{\partial F}{\partial x} + I \right). \]

In a stack of \(L\) residual blocks, the gradient at the input is

\[ \frac{\partial L}{\partial x_0} = \frac{\partial L}{\partial x_L} \prod_{\ell=1}^L \left( I + \frac{\partial F_\ell}{\partial x_{\ell-1}} \right). \]

If the \(\partial F_\ell / \partial x\) matrices are small at initialization (which they are, with He initialization), the product is approximately the identity and the gradient norm is preserved — neither vanishing nor exploding. (proof)

This is the depth-direction analog of LSTM gating’s solution to vanishing gradients in the time direction. The pattern — additive path with near-identity Jacobian — is the lasting design lesson.

Pre-Activation vs. Post-Activation

The original ResNet (He et al. 2016a) used a post-activation form: Conv → BN → ReLU → Conv → BN → Add → ReLU. The follow-up paper (He et al. 2016b) found that pre-activation — BN → ReLU → Conv → BN → ReLU → Conv → Add — trains better at very deep counts. The intuition: with pre-activation, the skip path is exactly the identity (no nonlinearity intervening), which preserves the additive-gradient property strictly.

For new architectures, prefer the pre-activation form.

Bottleneck Blocks

For very deep networks, ResNet’s “bottleneck” block reduces the channel count before the expensive \(3 \times 3\) convolution and expands it back:

1×1 conv (C → C/4)  -- reduce channels
3×3 conv (C/4 → C/4)
1×1 conv (C/4 → C)  -- expand channels

The bottleneck block has roughly the same parameter count as the standard two-\(3 \times 3\) block but enables much wider effective channel counts at the same compute budget. Standard for ResNet-50 and deeper.

Where Residuals Appear

CNN backbones (ResNet, ResNeXt, RegNet). Standard for ImageNet-scale classification and as feature extractors for downstream tasks.
Transformers. Every attention block and every feedforward block is wrapped in x → x + F(LayerNorm(x)). Without these residuals, large transformers would not train.
U-Nets. Long-range skip connections from the encoder side to the decoder side. Different from in-block residuals but the same conceptual role: provide a path that does not pass through many layers.
Recurrent networks (sometimes). Residuals between time steps’ hidden states; less standard than in feedforward networks.

Limits

Residual connections solve the gradient-flow problem but do not by themselves give better representations. Modern architectures combine them with normalization, large-scale data, and good hyperparameter choices. A residual network without batch norm or careful initialization still trains, but not nearly as well as one with the full set of modern tricks.

Skipping every block (i.e., setting all \(F_\ell = 0\)) reduces a residual network to the identity function. Empirically, deep residual networks behave like an ensemble of shallow networks — most of the predictive signal at output comes from short skip paths through the network, rather than from the longest paths. This was Veit, Wilber, and Belongie’s “unraveled view” finding (2016) and complicates the naive picture of deep residual networks as truly deep models.

References

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–78. https://doi.org/10.1109/cvpr.2016.90.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. “Identity Mappings in Deep Residual Networks.” In Computer Vision – ECCV 2016. Springer International Publishing. https://doi.org/10.1007/978-3-319-46493-0_38.