Convolution Layer

Motivation

A standard fully-connected layer applied to an image throws away every piece of structure that makes the image tractable. A \(224 \times 224 \times 3\) image is treated as an unstructured 150,528-dimensional vector; the model must learn from data that adjacent pixels matter and that the same edge feature can appear anywhere on the image.

A convolution layer (LeCun et al. 1989) bakes the right structure in. It applies a small filter — typically \(3 \times 3\) to \(7 \times 7\) — at every spatial position of the input, sharing weights across positions. The resulting layer has dramatically fewer parameters than a fully-connected one and has the right inductive biases for visual data: locality, weight sharing, and translation equivariance. Stacking convolutions builds receptive fields that grow with depth, letting deep networks see large image regions through composition of small operations.

The Operation

A 2D convolution layer with \(C\) input channels and \(C'\) output channels has filter weights \(W \in \mathbb{R}^{C' \times C \times k_h \times k_w}\) and bias \(b \in \mathbb{R}^{C'}\). For input \(x \in \mathbb{R}^{C \times H \times W}\), the output is

\[ y[c', i, j] = b[c'] + \sum_{c=1}^{C} \sum_{u=1}^{k_h} \sum_{v=1}^{k_w} W[c', c, u, v] \, x[c, i + u, j + v]. \]

(Strictly this is cross-correlation; every framework labels it “convolution.” The mathematical convolution mirrors the kernel, but the distinction does not matter when \(W\) is learned.)

Diagram: a horizontal-edge detector at one position

A \(5 \times 5\) binary input (top half black, bottom half white), a \(3 \times 3\) horizontal-edge kernel, and the resulting \(3 \times 3\) output. The blue patch on the input shows the receptive field of one output cell.

The filter \(W[c', \cdot, \cdot, \cdot]\) is the same at every spatial \((i, j)\) — this is weight sharing. The number of parameters is

\[ C' \cdot C \cdot k_h \cdot k_w + C', \]

independent of the input’s spatial size. A \(224 \times 224\) image and a \(14 \times 14\) feature map use the same number of weights for the same filter shape.

Stride

Stride controls the spacing of output positions:

\[ y[c', i, j] = b[c'] + \sum_{c, u, v} W[c', c, u, v] \, x[c, s \cdot i + u, s \cdot j + v]. \]

Stride \(s = 1\) (the default) produces output the same spatial size as the input (modulo padding); stride \(s = 2\) produces output half the size in each spatial dimension. Strided convolutions are a common alternative to pooling for spatial downsampling.

Padding

Padding adds zeros (or a reflection, or replication) around the input border before convolution to control output size. With \(k \times k\) filters and padding \(p\), the spatial output size is

\[ H_{\text{out}} = \left\lfloor \frac{H + 2p - k}{s} \right\rfloor + 1. \]

Two standard choices:

“Valid” padding: \(p = 0\). Output shrinks by \(k - 1\) pixels per layer (with stride \(1\)).
“Same” padding: \(p = (k - 1)/2\) for odd \(k\) and stride \(1\). Output keeps the same spatial size as input. Standard for most modern architectures.

Dilation

Dilation spreads the filter taps over a larger region without growing the parameter count:

\[ y[c', i, j] = b[c'] + \sum_{c, u, v} W[c', c, u, v] \, x[c, i + d \cdot u, j + d \cdot v]. \]

Dilation \(d = 1\) is the standard convolution. \(d = 2\) skips every other pixel; \(d = 4\) skips every fourth. Useful for enlarging the receptive field without losing resolution and for capturing long-range dependencies (e.g., in WaveNet for audio, or in semantic segmentation models like DeepLab).

Parameter and Compute Cost

For a layer with input shape \(C \times H \times W\), output shape \(C' \times H' \times W'\), and filter shape \(k \times k\):

Parameters: \(C' \cdot C \cdot k^2 + C'\).
FLOPs: \(C' \cdot C \cdot k^2 \cdot H' \cdot W'\).

So convolution is much more parameter-efficient than fully-connected (which would have \(C \cdot H \cdot W \cdot C' \cdot H' \cdot W'\) parameters) but the FLOP count is comparable or higher because the operation is applied at every output position.

1×1 Convolutions

A \(1 \times 1\) convolution has no spatial extent — it is a per-position linear transformation across channels:

\[ y[c', i, j] = \sum_c W[c', c] \, x[c, i, j]. \]

Despite the trivial-sounding form, \(1 \times 1\) convolutions are a standard building block:

Channel mixing. Combine information across channels without spatial blurring.
Bottleneck layers. Reduce the channel count before an expensive \(3 \times 3\) convolution, then expand it back. Used in ResNet’s “bottleneck block” and in MobileNet/EfficientNet.
Equivalent to per-position MLP. A \(1 \times 1\) conv is exactly an MLP applied at each spatial position, with weights shared across positions.

Beyond 2D

The same construction generalizes:

1D convolutions for time series, audio waveforms, and (historically) text.
3D convolutions for video and volumetric data (medical CT/MRI).
Graph convolutions for arbitrary relational structure.

The common thread is weight sharing across positions related by a symmetry. Images have translation symmetry; graphs have permutation symmetry of nodes. A convolutional layer is the natural way to encode such a symmetry into the model.

Where It Sits

Convolution is the workhorse of convolutional networks. Modern CNN blocks combine convolution with an activation, batch normalization, and possibly a residual connection. Stacking such blocks with periodic spatial downsampling produces architectures like ResNet, VGG, and EfficientNet.

References

LeCun, Y., B. Boser, J. S. Denker, et al. 1989. “Backpropagation Applied to Handwritten Zip Code Recognition.” Neural Computation 1 (4): 541–51. https://doi.org/10.1162/neco.1989.1.4.541.