Translation Equivariance and Weight Sharing

Motivation

A convolutional network (LeCun et al. 1989) is convolutional because it shares the same filter weights across every spatial position of the input. Two facts follow:

  1. The number of parameters is independent of input spatial size.
  2. The convolution operation commutes with translations: shifting the input shifts the output the same way.

Property (2) is translation equivariance, and it is the formal statement of why CNNs are the right architecture for image data. The visual world has translation symmetry — a cat in the upper-left of an image is the same kind of object as a cat in the lower-right — and a model whose representations transform consistently under that symmetry has a structural advantage over one that has to learn the symmetry from data.

Equivariance and Invariance

Let \(T_v\) denote the translation operator that shifts an input by vector \(v\): \((T_v x)[i, j] = x[i - v_1, j - v_2]\).

A function \(f\) is translation equivariant if

\[ f(T_v x) = T_v f(x) \quad \text{for every shift } v. \]

Shifting the input shifts the output by the same amount.

A function \(f\) is translation invariant if

\[ f(T_v x) = f(x) \quad \text{for every shift } v. \]

Shifting the input does not change the output.

Equivariance and invariance are different. A face detector that returns “yes/no” should be invariant to position (a face is a face wherever it appears). A semantic segmentation model that returns per-pixel class labels should be equivariant (the labels should shift with the image). The convolution operation gives equivariance for free; invariance, when needed, must be obtained by explicit pooling, downsampling, or symmetry-breaking heads.

Why Convolution Is Equivariant

Consider the convolution

\[ y[i, j] = \sum_{u, v} W[u, v] \, x[i + u, j + v]. \]

If we shift the input by \(\Delta\), \(\tilde x = T_\Delta x\), then

\[ \tilde y[i, j] = \sum_{u, v} W[u, v] \, \tilde x[i + u, j + v] = \sum_{u, v} W[u, v] \, x[i + u - \Delta_1, j + v - \Delta_2] = y[i - \Delta_1, j - \Delta_2]. \]

So \(\tilde y = T_\Delta y\). The output is shifted exactly by \(\Delta\). The argument relies on the fact that the same filter \(W\) is applied at every position — which is exactly weight sharing. (detailed proof)

The rest of a typical CNN layer preserves equivariance: pointwise nonlinearities (ReLU, GELU) commute with translation, and per-channel batch normalization is also translation equivariant.

What Breaks Equivariance

  • Pooling and strided downsampling. A shift of the input by less than the stride does not produce a shift of the output by a fractional pixel; the output is rounded to the nearest output-grid position. This is approximate equivariance, not exact.
  • Padding. Zero-padding creates a special boundary that breaks equivariance for shifts that move content into or out of the padded region.
  • Position-dependent layers. Any layer that explicitly depends on \((i, j)\) — for example, learned positional embeddings, or coord-conv that appends \((i, j)\) as input channels — breaks equivariance by design.
  • Final flattening + fully-connected layer. After the spatial dimensions are flattened, position information is folded in non-equivariantly. This is why most modern architectures end with global average pooling rather than flattening.

Why Weight Sharing Is the Right Inductive Bias

A fully-connected network applied to images has a separate weight for every (input pixel, output pixel) pair. It must independently learn that “edge filter at position \((10, 20)\)” and “edge filter at position \((50, 80)\)” are useful — and it must learn this for every position from data. The training data needed scales with the size of the image times the number of features.

A convolutional network learns a single edge filter and applies it everywhere. The training signal for a feature at any one location informs the feature at every location. This is statistical efficiency: the same amount of data goes much further when the model bakes in the right symmetry.

The same argument applies in any domain with a useful symmetry:

  • Time (1D translation) → 1D convolutions for audio and time series.
  • Volume (3D translation) → 3D convolutions for video and volumetric data.
  • Graphs (node permutation) → graph neural networks with permutation-equivariant message passing.
  • Sets (full permutation) → DeepSets / set transformers with permutation-invariant aggregation.

The general principle — equivariance under the relevant symmetry group of the data — is the framework for designing architectures that exploit problem structure.

What Equivariance Does Not Give You

Translation equivariance does not provide:

  • Rotation invariance. A CNN can recognize a cat anywhere, but a rotated cat is different from an upright cat in feature space. Data augmentation or rotation-equivariant networks (like steerable CNNs) are needed for that.
  • Scale invariance. Similarly: a \(30\)-pixel cat and a \(300\)-pixel cat are different in feature space without scale-augmented training or multi-scale architectures.
  • Strict translation invariance for the whole network. Without explicit invariance-producing layers, the network’s output at the very end can still depend on absolute position — particularly through padding, downsampling, or the final classifier.

The CNN’s translation-equivariance inductive bias is one correct symmetry built into the architecture. Other symmetries — when relevant — must be added explicitly.

References

LeCun, Y., B. Boser, J. S. Denker, et al. 1989. “Backpropagation Applied to Handwritten Zip Code Recognition.” Neural Computation 1 (4): 541–51. https://doi.org/10.1162/neco.1989.1.4.541.