Pooling

Motivation

A pooling layer (LeCun et al. 1989) in a convolutional network downsamples a feature map by aggregating values within local windows. The two purposes are:

Spatial reduction. Halve or quarter the spatial dimensions, growing the effective receptive field per parameter and reducing compute and memory in subsequent layers.
Local translation invariance. Small spatial shifts of the input within a pooling window produce the same pooled output. This complements the translation equivariance of convolution: convolution preserves the form of shifts; pooling actively suppresses small ones.

Modern architectures sometimes replace pooling with strided convolutions, which serve the same downsampling purpose with learnable weights. But pooling remains the simplest and most common spatial downsampling primitive.

Max Pooling

Within a window of size \(k \times k\) (typically \(2 \times 2\)) and stride \(s\) (typically \(2\)),

\[ y[c, i, j] = \max_{u \in [0, k), v \in [0, k)} x[c, s \cdot i + u, s \cdot j + v]. \]

The most common choice. Has no parameters. Preserves whichever input is largest, which makes it useful for activation maps where presence of a feature (large activation anywhere in the window) is the relevant signal.

Average Pooling

\[ y[c, i, j] = \frac{1}{k^2} \sum_{u, v} x[c, s \cdot i + u, s \cdot j + v]. \]

Averages within the window. Smoother behavior than max pooling. Used at the very end of many CNNs as global average pooling (window equal to the entire feature map): collapses the spatial dimensions of the final feature map into a per-channel summary, which is then fed to a classifier head. This was the key trick that allowed ResNet and successors to drop the final fully-connected layers’ enormous parameter count.

Diagram: max vs. average pooling on a \(4 \times 4\) input

The input is split into four non-overlapping \(2 \times 2\) windows (highlighted by colour). Max pool keeps the largest value in each window; average pool replaces it with the mean.

Stride and Window Size

Standard choice: \(k = 2\), \(s = 2\). Halves each spatial dimension. Larger windows (\(k = 3\) with \(s = 2\), used in AlexNet) increase the overlap and slightly improve the regularization, at the cost of mild computational overhead.

Output spatial size with input \(H\), window \(k\), stride \(s\):

\[ H_{\text{out}} = \left\lfloor \frac{H - k}{s} \right\rfloor + 1. \]

Backpropagation Through Pooling

Max pool. The gradient is routed only to the input position that produced the max — every other input within the window receives zero gradient. Implementation stores the argmax during the forward pass.
Average pool. The gradient is uniformly distributed across all \(k^2\) inputs in the window: each receives \(1/k^2\) of the output gradient.

These are exact gradients of the corresponding forward operations. The max-pool gradient is sparse, which is one reason max pooling can be replaced by strided convolutions without losing much — both produce sparse gradients in practice.

Local Translation Invariance

If the input is shifted by less than the window size, the output of max pooling is unchanged for the position whose window now contains the maximum. So a one-pixel shift of an edge feature within a \(2 \times 2\) window produces the same pooled output. This is the invariance property: the network’s response is constant under small shifts.

Compare to convolution alone, which is equivariant: a shift of the input produces a corresponding shift of the output, but a different output. Pooling adds true local invariance on top.

This is sometimes argued to be why CNNs are robust to small image shifts. The argument has a caveat: pooling’s invariance is local within the pooling window only. Across windows (i.e., for shifts comparable to the stride), invariance is not achieved by pooling alone — the network must learn it.

Modern Alternatives

Some architectures avoid explicit pooling layers:

Strided convolutions. A \(3 \times 3\) stride-\(2\) convolution downsamples while learning the aggregation weights. Standard in many ResNet variants.
No downsampling. Some segmentation models (e.g., U-Net) preserve spatial resolution throughout, using only \(1 \times 1\) stride and increasing channel depth.
Attention pooling / readout heads. In transformers and hybrid models, a learned attention mechanism aggregates spatial information instead of pooling.

For new convolutional designs, strided convolutions and global average pooling are the modern defaults. Max pooling and average pooling within the network are still used and still work; they are simply not always the best choice.

References

LeCun, Y., B. Boser, J. S. Denker, et al. 1989. “Backpropagation Applied to Handwritten Zip Code Recognition.” Neural Computation 1 (4): 541–51. https://doi.org/10.1162/neco.1989.1.4.541.