Convolution Layer
Motivation
A standard fully-connected layer applied to an image throws away every piece of structure that makes the image tractable. A \(224 \times 224 \times 3\) image is treated as an unstructured 150,528-dimensional vector; the model must learn from data that adjacent pixels matter and that the same edge feature can appear anywhere on the image.
A convolution layer (LeCun et al. 1989) bakes the right structure in. It applies a small filter — typically \(3 \times 3\) to \(7 \times 7\) — at every spatial position of the input, sharing weights across positions. The resulting layer has dramatically fewer parameters than a fully-connected one and has the right inductive biases for visual data: locality, weight sharing, and translation equivariance. Stacking convolutions builds receptive fields that grow with depth, letting deep networks see large image regions through composition of small operations.
The Operation
A 2D convolution layer with \(C\) input channels and \(C'\) output channels has filter weights \(W \in \mathbb{R}^{C' \times C \times k_h \times k_w}\) and bias \(b \in \mathbb{R}^{C'}\). For input \(x \in \mathbb{R}^{C \times H \times W}\), the output is
\[ y[c', i, j] = b[c'] + \sum_{c=1}^{C} \sum_{u=1}^{k_h} \sum_{v=1}^{k_w} W[c', c, u, v] \, x[c, i + u, j + v]. \]
(Strictly this is cross-correlation; every framework labels it “convolution.” The mathematical convolution mirrors the kernel, but the distinction does not matter when \(W\) is learned.)
Diagram: a horizontal-edge detector at one position
A \(5 \times 5\) binary input (top half black, bottom half white), a \(3 \times 3\) horizontal-edge kernel, and the resulting \(3 \times 3\) output. The blue patch on the input shows the receptive field of one output cell.
The filter \(W[c', \cdot, \cdot, \cdot]\) is the same at every spatial \((i, j)\) — this is weight sharing. The number of parameters is
\[ C' \cdot C \cdot k_h \cdot k_w + C', \]
independent of the input’s spatial size. A \(224 \times 224\) image and a \(14 \times 14\) feature map use the same number of weights for the same filter shape.
Stride
Stride controls the spacing of output positions:
\[ y[c', i, j] = b[c'] + \sum_{c, u, v} W[c', c, u, v] \, x[c, s \cdot i + u, s \cdot j + v]. \]
Stride \(s = 1\) (the default) produces output the same spatial size as the input (modulo padding); stride \(s = 2\) produces output half the size in each spatial dimension. Strided convolutions are a common alternative to pooling for spatial downsampling.
Padding
Padding adds zeros (or a reflection, or replication) around the input border before convolution to control output size. With \(k \times k\) filters and padding \(p\), the spatial output size is
\[ H_{\text{out}} = \left\lfloor \frac{H + 2p - k}{s} \right\rfloor + 1. \]
Two standard choices:
- “Valid” padding: \(p = 0\). Output shrinks by \(k - 1\) pixels per layer (with stride \(1\)).
- “Same” padding: \(p = (k - 1)/2\) for odd \(k\) and stride \(1\). Output keeps the same spatial size as input. Standard for most modern architectures.
Dilation
Dilation spreads the filter taps over a larger region without growing the parameter count:
\[ y[c', i, j] = b[c'] + \sum_{c, u, v} W[c', c, u, v] \, x[c, i + d \cdot u, j + d \cdot v]. \]
Dilation \(d = 1\) is the standard convolution. \(d = 2\) skips every other pixel; \(d = 4\) skips every fourth. Useful for enlarging the receptive field without losing resolution and for capturing long-range dependencies (e.g., in WaveNet for audio, or in semantic segmentation models like DeepLab).
Parameter and Compute Cost
For a layer with input shape \(C \times H \times W\), output shape \(C' \times H' \times W'\), and filter shape \(k \times k\):
- Parameters: \(C' \cdot C \cdot k^2 + C'\).
- FLOPs: \(C' \cdot C \cdot k^2 \cdot H' \cdot W'\).
So convolution is much more parameter-efficient than fully-connected (which would have \(C \cdot H \cdot W \cdot C' \cdot H' \cdot W'\) parameters) but the FLOP count is comparable or higher because the operation is applied at every output position.
1×1 Convolutions
A \(1 \times 1\) convolution has no spatial extent — it is a per-position linear transformation across channels:
\[ y[c', i, j] = \sum_c W[c', c] \, x[c, i, j]. \]
Despite the trivial-sounding form, \(1 \times 1\) convolutions are a standard building block:
- Channel mixing. Combine information across channels without spatial blurring.
- Bottleneck layers. Reduce the channel count before an expensive \(3 \times 3\) convolution, then expand it back. Used in ResNet’s “bottleneck block” and in MobileNet/EfficientNet.
- Equivalent to per-position MLP. A \(1 \times 1\) conv is exactly an MLP applied at each spatial position, with weights shared across positions.
Beyond 2D
The same construction generalizes:
- 1D convolutions for time series, audio waveforms, and (historically) text.
- 3D convolutions for video and volumetric data (medical CT/MRI).
- Graph convolutions for arbitrary relational structure.
The common thread is weight sharing across positions related by a symmetry. Images have translation symmetry; graphs have permutation symmetry of nodes. A convolutional layer is the natural way to encode such a symmetry into the model.
Where It Sits
Convolution is the workhorse of convolutional networks. Modern CNN blocks combine convolution with an activation, batch normalization, and possibly a residual connection. Stacking such blocks with periodic spatial downsampling produces architectures like ResNet, VGG, and EfficientNet.