Convolutional Networks

Motivation

A standard fully-connected layer applied to an image throws away every piece of structure that makes the image tractable. A \(224 \times 224 \times 3\) image is treated as an unstructured 150,528-dimensional vector; the model must learn from data that adjacent pixels matter, that the same edge feature can appear anywhere on the image, and that small spatial translations of the input should not change classification. Each fact requires enormous data to learn, and the parameter count is correspondingly enormous.

A convolutional network (CNN) (LeCun et al. 1989) bakes these structural facts into the architecture:

  • Local receptive fields. Each unit looks at a small spatial window — typically \(3 \times 3\) to \(7 \times 7\) — instead of every pixel. See receptive fields.
  • Weight sharing. The same filter is applied at every spatial position, so what is learned at one location transfers to all others. See translation equivariance and weight sharing.
  • Translation equivariance. Shifting the input shifts the output by the same amount (proof).

The result: massively fewer parameters than a fully-connected network on the same input, with much stronger inductive biases for visual data.

The Standard Block

A CNN is a stack of blocks, each composed of a small set of primitives:

A typical block stacks a \(3 \times 3\) convolution, batch norm, and ReLU; deep networks chain hundreds of such blocks with periodic spatial downsampling.

Diagram: a small classifier pipeline

Spatial dimensions shrink (32→16→8→1) as channel count grows (3→16→32→10), then a fully-connected head produces class scores. This is the classic “pyramid” shape of a CNN classifier.

input 32 × 32 × 3 3×3 conv conv block 32 × 32 × 16 2×2 pool 16 × 16 × 16 3×3 conv conv block 16 × 16 × 32 2×2 pool 8 × 8 × 32 flatten fc head 2048 → 10 10 classes softmax Spatial size shrinks while channel count grows; the FC head turns the final volume into class scores. Each conv block is conv → batch norm → ReLU; pooling (or strided conv) downsamples between blocks.

Receptive Field Growth

The receptive field of a unit is the region of the input that can affect its value. A single \(3 \times 3\) convolution has a receptive field of \(3 \times 3\). Stacking \(L\) such layers grows the receptive field linearly: \(1 + 2L\) in each dimension. Adding stride or pooling grows it multiplicatively.

This is the architectural justification for depth: deep networks can see a large receptive field — and therefore long-range image structure — through a sequence of small, computationally cheap layers.

Architectural Milestones

  • LeNet-5 (LeCun, 1998). Two convolutional + pooling stages followed by fully-connected layers. Read handwritten digits on bank checks. Defined the template.
  • AlexNet (Krizhevsky, 2012). Eight layers, ReLU activations, dropout, GPU training. Won ImageNet 2012 by a huge margin and started the deep learning era.
  • VGG (Simonyan & Zisserman, 2014). Demonstrated that depth alone (16–19 layers of \(3 \times 3\) convolutions) yields strong gains. Simple architecture, widely used as a feature extractor.
  • GoogLeNet/Inception. Multi-scale feature extraction via parallel branches with different filter sizes. Introduced \(1 \times 1\) “bottleneck” convolutions for parameter efficiency.
  • ResNet (He, 2015). Residual connections allowed training networks with 50–200 layers; opened the era of “as deep as you want.” Standard backbone for downstream tasks.
  • U-Net (2015) and feature pyramids. Encoder-decoder architectures with skip connections for dense prediction (segmentation, depth estimation).
  • Mask R-CNN, YOLO, SSD families. Object detection on top of CNN backbones.
  • Vision Transformer (Dosovitskiy, 2020). Replaces convolution with self-attention over image patches. With enough data, matches or exceeds CNN performance — challenging convolution’s dominance in image modeling.

Beyond Images

The same architecture pattern applies to any data with spatial or temporal structure:

  • 1D convolutions for time series, audio waveforms, and (historically) text.
  • 3D convolutions for video and volumetric data.
  • Graph convolutions for arbitrary relational structure.

The common thread is weight sharing across positions related by a symmetry. Images have translation symmetry; graphs have permutation symmetry of nodes; audio has translation along time. A convolutional architecture is the natural way to encode such a symmetry into the model.

References

LeCun, Y., B. Boser, J. S. Denker, et al. 1989. “Backpropagation Applied to Handwritten Zip Code Recognition.” Neural Computation 1 (4): 541–51. https://doi.org/10.1162/neco.1989.1.4.541.