Batch Normalization

Motivation

Training deep networks works much better when the activations at each layer remain at a roughly stable scale. Without it, small changes in early-layer parameters can cause large changes in late-layer activations — a problem called internal covariate shift (Ioffe and Szegedy 2015). Batch normalization (BatchNorm) (Ioffe and Szegedy 2015) addresses it by standardizing each unit’s pre-activation across the minibatch during training.

In practice, BatchNorm makes networks much easier to train: it allows substantially larger learning rates, reduces sensitivity to initialization, and acts as a mild regularizer. It is one of the few architectural ideas that has stayed dominant — for image models — since its introduction.

The Operation

For a feature in a layer with pre-activation \(z_b\) across minibatch elements \(b = 1, \ldots, B\), batch normalization computes the mini-batch mean and variance,

\[ \mu = \frac{1}{B} \sum_{b=1}^B z_b, \qquad \sigma^2 = \frac{1}{B} \sum_{b=1}^B (z_b - \mu)^2, \]

normalizes,

\[ \hat z_b = \frac{z_b - \mu}{\sqrt{\sigma^2 + \varepsilon}}, \]

and then applies a learned per-feature affine transformation:

\[ y_b = \gamma \hat z_b + \beta. \]

Parameters \(\gamma\) and \(\beta\) are learned, with default initialization \(\gamma = 1\), \(\beta = 0\). The constant \(\varepsilon\) (typically \(10^{-5}\)) prevents division by zero. The procedure is applied independently per feature dimension.

For a convolutional layer, “feature” means channel: the mean and variance are computed across the batch and across spatial positions, so each channel has a single \(\mu\), \(\sigma^2\), \(\gamma\), \(\beta\). This is what preserves translation equivariance — every spatial position gets the same normalization.

Diagram: pre-norm vs. post-norm activation distributions

Before BatchNorm, a feature’s pre-activation across the minibatch can sit at an arbitrary mean and scale. After BatchNorm with \(\gamma = 1, \beta = 0\), it is centred at \(0\) with unit variance.

before BatchNorm mean ≈ 3, std ≈ 2 after BatchNorm mean = 0, std = 1 −2 0 2 4 6 8 μ = 3 subtract μ divide by σ −3 −2 −1 0 1 2 3 μ = 0 After standardization, the next layer sees an input with predictable scale — easier to optimize, larger learning rates allowed.

Inference Mode

At inference time you cannot use the minibatch statistics — there is typically no batch, just a single example. Instead BatchNorm uses running averages of \(\mu\) and \(\sigma^2\) accumulated during training. These are typically maintained as exponential moving averages with a momentum hyperparameter:

\[ \mu_{\text{run}} \leftarrow (1 - \alpha) \mu_{\text{run}} + \alpha \mu, \qquad \sigma^2_{\text{run}} \leftarrow (1 - \alpha) \sigma^2_{\text{run}} + \alpha \sigma^2, \]

with \(\alpha\) around \(0.1\) or smaller. At inference,

\[ \hat z = \frac{z - \mu_{\text{run}}}{\sqrt{\sigma^2_{\text{run}} + \varepsilon}}, \qquad y = \gamma \hat z + \beta. \]

The training-mode and inference-mode behaviors are different — this is the source of many bugs, particularly in fine-tuning small datasets and in distributed training where running statistics may be desynchronized across workers.

Why It Works

Several explanations have been proposed; the empirical effect is robust but the theoretical picture is unsettled.

  • Reduced internal covariate shift (the original argument). Layers see inputs with stable mean and variance, so they do not have to chase a moving target.
  • Smoothed loss landscape (Santurkar et al., 2018). BatchNorm makes the gradients more Lipschitz, which permits larger learning rates and faster convergence.
  • Implicit regularization. Each example’s normalization depends on the other examples in the minibatch, injecting noise that has a regularizing effect similar to dropout.
  • Decoupled scale and direction. \(\gamma\) and \(\beta\) separate the optimization of layer scale from the optimization of weight directions, which makes the loss surface easier.

The argument over which of these is the real reason continues. The practical fact is that BatchNorm dramatically improves training of CNNs.

Pros and Cons

Pros: - Allows much larger learning rates. - Reduces sensitivity to initialization. - Mild regularization effect. - Standard for ImageNet-scale CNNs.

Cons: - Behavior depends on batch size; small batches give noisy statistics. Below batch size \(\sim 8\), BatchNorm degrades. - Training/inference discrepancy — running averages must be correct, which is fragile in distributed and fine-tuning settings. - Awkward in recurrent architectures and transformers. - Adds compute and memory overhead.

Alternatives

  • Layer normalization (Ba et al., 2016): normalize across the features of a single example, not across the batch. No batch dependence; identical training and inference. Standard in transformers and many sequence models.
  • Group normalization (Wu & He, 2018): normalize within groups of channels. Works well at small batch sizes; competitive with BatchNorm for vision.
  • Instance normalization: normalize per example, per channel — useful for style transfer.
  • Weight normalization: normalize the weights rather than the activations.

For convolutional vision models, BatchNorm remains the default. For transformers, layer norm. For small-batch or memory-constrained training, group norm or layer norm.

Practical Notes

  • Place BatchNorm before the activation function. Standard order is Conv → BatchNorm → ReLU. Some variants put it after; the original placement is the de facto standard.
  • Disable BatchNorm during fine-tuning if the new dataset is small relative to the running-statistic momentum. Otherwise, freeze the statistics by switching the layer to evaluation mode.
  • Synchronize statistics across workers in distributed training (SyncBatchNorm) when per-worker batch sizes are small.
  • No bias before BatchNorm. A bias \(b\) in a convolution layer immediately followed by BatchNorm is redundant — the \(\beta\) parameter absorbs it, and the BatchNorm step subtracts the mean which kills any constant offset anyway. Most modern implementations explicitly disable the bias on convolutions that feed into BatchNorm.

References

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” International Conference on Machine Learning (ICML), 448–56. https://proceedings.mlr.press/v37/ioffe15.html.