Convolution Is Translation Equivariant
Claim
Let \(T_v\) denote the translation operator \((T_v x)[i, j] = x[i - v_1, j - v_2]\). The 2D convolution operation
\[ (W * x)[i, j] = \sum_{u, v} W[u, v] \, x[i + u, j + v] \]
is translation equivariant: for every shift \(\Delta = (\Delta_1, \Delta_2)\),
\[ W * (T_\Delta x) = T_\Delta (W * x). \]
This is the formal statement that shifting the input shifts the output by the same amount, with no other change. It is the fundamental property that justifies the architecture of convolutional networks (LeCun et al. 1989; Cohen and Welling 2016). See translation equivariance for the broader context.
Proof
Direct computation. Let \(\tilde x = T_\Delta x\), so \(\tilde x[i, j] = x[i - \Delta_1, j - \Delta_2]\). Then
\[ (W * \tilde x)[i, j] = \sum_{u, v} W[u, v] \, \tilde x[i + u, j + v] = \sum_{u, v} W[u, v] \, x[i + u - \Delta_1, j + v - \Delta_2]. \]
Let \(i' = i - \Delta_1\) and \(j' = j - \Delta_2\). The right-hand side is
\[ \sum_{u, v} W[u, v] \, x[i' + u, j' + v] = (W * x)[i', j'] = (W * x)[i - \Delta_1, j - \Delta_2]. \]
So
\[ (W * \tilde x)[i, j] = (W * x)[i - \Delta_1, j - \Delta_2] = T_\Delta(W * x)[i, j]. \]
Equality holds at every \((i, j)\), hence \(W * (T_\Delta x) = T_\Delta(W * x)\). \(\square\)
Multi-Channel and Strided Convolution
The argument extends directly:
Multi-channel. With \(C\) input channels and \(C'\) output channels, \[ y[c', i, j] = \sum_{c, u, v} W[c', c, u, v] \, x[c, i + u, j + v]. \] A spatial translation \(T_\Delta\) acts on \((i, j)\) only and not on channels, so the same substitution gives equivariance per output channel.
Strided. A stride-\(s\) convolution gives \[ y[i, j] = \sum_{u, v} W[u, v] \, x[s i + u, s j + v]. \] This is equivariant to translations of the input by multiples of \(s\): shifting the input by \((s \Delta_1, s \Delta_2)\) shifts the output by \((\Delta_1, \Delta_2)\). For shifts smaller than \(s\), equivariance fails — the output is no longer just a shifted version of the unstrided output. This is the formal sense in which strided convolution and pooling are only approximately equivariant.
What Breaks the Argument
The substitution \(i' = i - \Delta_1\) relies on the convolution being applied with the same weights \(W\) at every position. If the weights varied with position — say \(W[i, j; u, v]\) depending on the location — the substitution would not produce a simple shifted output. So weight sharing across positions is the property that delivers translation equivariance. A fully-connected layer applied to flattened images has different weights at every position and is not translation equivariant: a shifted input produces an entirely different output.
Bias and Activations
Adding a per-channel bias preserves equivariance because the bias is constant in space. Applying an elementwise nonlinearity \(\sigma\) to the output also preserves it: \(\sigma\) commutes with translation because translation only relabels indices and \(\sigma\) acts pointwise. So a layer of the form \(y = \sigma(W * x + b)\) is translation equivariant.
Boundary Effects
The proof above implicitly assumes the convolution is defined on an infinite domain or with periodic boundary conditions. With finite inputs and the standard “zero padding” convention, equivariance fails near the boundary: shifting an image so that some content moves into the padding region cannot be undone, since the output near the new boundary depends on padded zeros that are not a shifted version of the original boundary’s data.
The interior of the output is exactly equivariant; the boundary band of width \(\sim k/2\) is approximately so. For input sizes much larger than the kernel, the boundary effect is small.
What Equivariance Is Not
Equivariance is not invariance: \(W * (T_\Delta x) \neq W * x\) in general — the output moves, it just moves the same way the input did. To get invariance, the architecture must combine equivariance with an explicit aggregation step (pooling, global average pooling, etc.) that collapses the spatial dimension. See pooling and translation equivariance.