Receptive Fields

Motivation

A unit at some layer of a convolutional network does not see the full input; it sees only a local region. The receptive field (LeCun et al. 1989) of a unit is the region of the input that can affect its value — equivalently, the support of the gradient of the unit’s activation with respect to the input. Receptive field size determines what an architecture can “see” at a given depth, and it grows with depth, stride, and dilation.

For object detection a \(3 \times 3\) receptive field cannot see a face; for medical imaging a \(50 \times 50\) receptive field cannot see an organ; for satellite imagery a \(200 \times 200\) receptive field cannot see a building. Designing a CNN means designing the receptive field profile at each layer to match the spatial scale of the features the model needs to detect.

Effective Receptive Field at a Single Convolution

A single convolution with a \(k \times k\) filter has receptive field \(k \times k\): each output pixel sees a \(k \times k\) window of the input.

Stacking: Linear Growth

Stacking \(L\) layers of \(k \times k\) convolutions with stride \(1\) and no dilation grows the receptive field linearly:

\[ r_L = 1 + L (k - 1). \]

A stack of three \(3 \times 3\) convolutions has receptive field \(7 \times 7\). Six layers give \(13 \times 13\). Twenty layers give \(41 \times 41\). Linear growth is slow if the goal is to see a \(224 \times 224\) image.

Diagram: receptive field after one, two, three \(3 \times 3\) convolutions

The shaded square shows the support of one fixed unit’s gradient back to the input. Each new layer adds two pixels in each dimension.

after 1 layer 3×3 receptive field after 2 layers 5×5 receptive field after 3 layers 7×7 receptive field Each new 3×3 convolution adds 2 pixels of receptive field in each direction. After L stacked 3×3 layers (stride 1, no dilation): receptive field = 1 + 2L. Inserting a stride-2 layer would multiply all subsequent receptive fields by 2 — geometric growth.

Strided Layers: Multiplicative Growth

A stride-\(s\) layer multiplies the receptive field of subsequent layers by \(s\) (in the input’s coordinates). The receptive field of layer \(L\) is

\[ r_L = 1 + \sum_{\ell=1}^{L} (k_\ell - 1) \prod_{i=1}^{\ell - 1} s_i, \]

where \(k_\ell\) is the kernel size and \(s_\ell\) is the stride at layer \(\ell\). Each downsample by \(2\) doubles all subsequent receptive-field contributions.

This is why CNNs typically interleave \(3 \times 3\) convolutions with stride-\(2\) pooling or strided convolutions: the receptive field grows roughly geometrically with depth, reaching the full input scale in \(\log_2(H)\) stride-2 stages plus a handful of in-stage convolutions.

Dilation: Multiplicative Growth Without Downsampling

A dilated convolution with rate \(d\) has effective kernel size \(1 + (k - 1) d\), so it acts like a much larger filter without the parameter count. WaveNet and atrous-convolution segmentation models use exponentially-increasing dilation rates (\(1, 2, 4, 8, \ldots\)) to grow the receptive field while preserving spatial resolution. Stacking \(L\) dilated layers with rates \(1, 2, 4, \ldots, 2^{L-1}\) gives a receptive field of \(\sim 2^L\) — exponential in depth.

The cost is potential gridding artifacts, since dilated convolutions skip input positions and may not see them in any layer. Standard practice mixes dilation rates carefully or uses hybrid dilated convolution schedules to ensure full coverage.

Theoretical vs. Effective Receptive Field

The formulas above give the theoretical receptive field — the set of input positions that can affect the unit, in the sense of having a non-zero gradient path. The effective receptive field is the distribution of those gradients: how strongly each input position actually influences the output.

Luo et al. (2017) showed empirically that the effective receptive field is much smaller than the theoretical one, with a Gaussian-like decay from the center. Concretely, a unit with theoretical receptive field \(r \times r\) has effective receptive field on the order of \(\sqrt{r} \times \sqrt{r}\). The reason is that gradient signals from peripheral input positions traverse many independent paths and cancel partly through their stochastic phases.

Practical implication: the theoretical receptive field overestimates how much context the unit actually uses. Architectures with theoretical receptive fields larger than the input often have effective receptive fields that do not even cover the full input.

Designing the Receptive Field

Three rules of thumb:

  • Match the scale of the features you care about. A face detector for \(224 \times 224\) images wants units somewhere in the network with effective receptive field around \(50\)\(100\) pixels — large enough to see a face, small enough to localize.
  • Think in stages. A typical CNN has a few “stages,” each at a different spatial resolution (say \(H/2, H/4, H/8, H/16\)). Within a stage, a few \(3 \times 3\) convolutions add to the receptive field linearly; between stages, a stride-\(2\) downsample doubles it.
  • Use dilation when downsampling is undesirable. Semantic segmentation needs both large receptive fields and high-resolution outputs; dilated convolutions or encoder-decoder architectures with skip connections are the standard solutions.

Why Depth Helps

A common misreading: “deeper networks just have more parameters.” A more accurate one: deeper networks have larger receptive fields per parameter than wider ones. A \(7 \times 7\) filter has \(49\) weights and a \(7 \times 7\) receptive field. Three stacked \(3 \times 3\) filters have \(27\) weights and the same \(7 \times 7\) receptive field, with two extra nonlinearities along the way. This is the architectural argument for depth: more compositional capacity at the same receptive-field cost.

VGG (Simonyan & Zisserman, 2014) made this trade-off explicit: replace large filters with stacks of \(3 \times 3\) ones. The trade-off has held up — modern architectures use \(3 \times 3\) convolutions almost exclusively, and depth (with residual connections to keep training stable) has been the dominant scaling axis.

References

LeCun, Y., B. Boser, J. S. Denker, et al. 1989. “Backpropagation Applied to Handwritten Zip Code Recognition.” Neural Computation 1 (4): 541–51. https://doi.org/10.1162/neco.1989.1.4.541.