Regularization in Deep Networks
Motivation
A neural network with millions or billions of parameters can easily memorize a small training set, achieving zero training error while making poor predictions on unseen data. Regularization comprises the techniques used to control this overfitting — to bias the optimizer toward solutions that generalize. Three core techniques cover the vast majority of practical use: weight decay, dropout, and early stopping. Each operates differently and they are typically combined.
The bias-variance framing: regularization adds a small amount of bias to obtain a much larger reduction in variance (proof). The classical setting predicts a sweet spot at intermediate model complexity; in deep networks the situation is complicated by overparameterization and “double descent,” but regularization remains useful empirically.
Weight Decay
Add an \(\ell_2\) penalty on the parameters to the loss:
\[ L_{\text{total}}(\theta) = L(\theta) + \frac{\lambda}{2} \|\theta\|_2^2. \]
The gradient gains a term \(\lambda \theta\), so the parameter update becomes
\[ \theta \leftarrow \theta - \eta \nabla L(\theta) - \eta \lambda \theta = (1 - \eta \lambda) \theta - \eta \nabla L(\theta). \]
Each step shrinks the parameters by a factor \((1 - \eta \lambda)\) before applying the gradient, hence the name “weight decay.” Typical \(\lambda\) values are \(10^{-4}\) to \(10^{-2}\).
The Bayesian interpretation: weight decay corresponds to a Gaussian prior on the weights with variance \(1/\lambda\), and minimizing \(L_{\text{total}}\) is MAP estimation. The frequentist interpretation: it shrinks parameters toward zero, reducing the function class’s effective capacity.
With Adam: use AdamW. Standard Adam couples the weight-decay gradient term with the per-parameter adaptive scaling, which is not the right thing — large weights with small recent gradients get under-regularized. AdamW decouples the two, and is the modern default for any architecture trained with adaptive optimizers. See adaptive optimizers.
\(\ell_1\) regularization is also possible (encourages sparsity) but rarely used in deep learning; weight decay is the default.
Dropout
Dropout (Srivastava et al. 2014) randomly zeros out a fraction \(p\) of activations during training:
\[ \hat a_i = \begin{cases} a_i / (1 - p) & \text{with probability } 1 - p, \\ 0 & \text{with probability } p. \end{cases} \]
The factor \(1/(1 - p)\) (“inverted dropout”) rescales so that \(\mathbb{E}[\hat a_i] = a_i\), keeping the activation magnitudes consistent with inference. At inference, dropout is disabled — no random masking, no rescaling.
Typical \(p\) values: \(0.1\) to \(0.5\) for hidden layers in MLPs, \(0.1\) to \(0.3\) in CNNs, \(0.1\) in transformers. Modern image models often use no dropout at all and rely on data augmentation and weight decay instead.
Effect. Dropout prevents co-adaptation: a unit cannot rely on the presence of any specific other unit. Empirically, this improves generalization, particularly in over-parameterized networks. One interpretation: the network’s output averages over an exponentially-large ensemble of subnetworks induced by all possible dropout masks.
Variants:
- Spatial dropout (drop entire feature maps, not individual pixels) for CNNs, where adjacent pixels are highly correlated and dropping individual ones has limited effect.
- DropPath / stochastic depth. Drop entire residual blocks at random during training. Used in EfficientNet, Vision Transformer, and other modern architectures.
- Recurrent dropout. Apply the same dropout mask at every time step in an RNN (Gal & Ghahramani, 2016). Naive per-step dropout is too noisy.
Early Stopping
Monitor validation loss during training and stop when it stops improving (or starts increasing). Conceptually simple, mechanically reliable. Usually combined with checkpointing: save the best-validation-loss model and restore it at the end.
Why it works. Training loss decreases monotonically with optimization steps; validation loss decreases for a while (as the model fits the underlying signal) and then increases (as it starts memorizing training noise). The minimum of the validation loss approximates the bias-variance optimum. Stopping at that minimum delivers the best generalization at the chosen architecture.
Early stopping is a form of implicit regularization: shorter training implicitly limits how complex a function the model has had time to learn. Bishop (1995) showed an approximate equivalence with \(\ell_2\) regularization for linear models — early stopping on a quadratic loss is equivalent to weight decay with a \(\lambda\) that depends on the number of training steps.
Other Regularizers
- Data augmentation. Random transformations of the training data (crops, flips, color jitter for images; word dropout, back-translation for text). Often the single most effective regularizer for large vision and language models.
- Label smoothing. Replace one-hot targets with \((1 - \varepsilon)\) on the true class and \(\varepsilon / (K - 1)\) on the others. Empirically improves calibration and generalization.
- Mixup / CutMix. Train on convex combinations of pairs of training examples (and their labels). Cheap, broadly effective.
- Batch normalization. Acts as a mild regularizer due to the noise in batch statistics, in addition to its training-stability benefits. See batch normalization.
- Adversarial training. Train on inputs perturbed in a worst-case direction within an \(\ell_p\) ball. Improves adversarial robustness; as a side effect, sometimes also improves clean-data generalization.
How to Combine Them
For a typical modern deep model:
- Always: weight decay (via AdamW for transformers; via SGD-with-momentum for CNNs), data augmentation, early stopping (or equivalently, training to a fixed budget chosen by validation).
- Often: dropout (especially in fully-connected layers and transformers), batch or layer normalization.
- Sometimes: label smoothing, mixup, stochastic depth.
These compose; using all of them is fine and often helpful. The dominant regularizer is usually data augmentation (for vision) or weight decay (for transformers); the others are smaller corrections.
When Regularization Is Not Enough
For severe overfitting, regularization alone usually cannot save you. The fixes that work:
- More data. Reduces variance directly, with no bias cost.
- Architectural inductive biases. A CNN regularizes implicitly relative to an MLP on images; a transformer with relative position encoding regularizes relative to one without.
- Pretraining and transfer learning. Initialize from a model trained on a large dataset and fine-tune on the small target dataset. Often more effective than any combination of explicit regularizers.