Adaptive Optimizers

Motivation

Plain stochastic gradient descent uses a single learning rate \(\eta\) for every parameter and does not exploit information from past gradients. Two ideas — momentum and per-parameter adaptive learning rates — address common failure modes of SGD and combine into the optimizers that train most modern neural networks.

The two main facts to know:

  • Momentum averages past gradients to smooth the update direction. It accelerates progress along consistent directions and damps oscillation across narrow valleys.
  • Adaptive methods (RMSProp, Adam, AdamW) scale each parameter’s update by a running estimate of its gradient magnitude, giving small effective learning rates to high-variance directions and large ones to small-gradient directions.

Momentum

Standard SGD with momentum maintains a velocity vector \(v\) and updates

\[ v_{t+1} = \mu v_t + g_t, \qquad \theta_{t+1} = \theta_t - \eta v_{t+1}, \]

where \(g_t\) is the minibatch gradient and \(\mu \in [0, 1)\) is the momentum coefficient (typically \(0.9\) or \(0.99\)). The update is an exponential moving average of past gradients.

Geometric intuition: if the loss surface is a long narrow valley, the gradient at any point has a small component along the valley and a large component across it. Plain SGD oscillates across; momentum averages the across-direction components to near zero while the along-direction components add up coherently. The result is fast progress along the valley.

Nesterov momentum evaluates the gradient at the look-ahead point \(\theta_t - \eta \mu v_t\) instead of \(\theta_t\). Slightly faster convergence in practice; marginal in deep learning.

RMSProp

RMSProp adapts the learning rate per parameter by tracking an exponential moving average of squared gradients:

\[ s_{t+1} = \beta s_t + (1 - \beta) g_t^2, \qquad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_{t+1}} + \varepsilon} g_t, \]

with operations elementwise. A parameter whose gradient has been consistently large gets a small effective learning rate; a parameter with consistently small gradient gets a large one. This automatic per-parameter scaling helps when the loss is poorly conditioned across coordinates — common in deep networks.

The \(\varepsilon\) (typically \(10^{-8}\)) prevents division by zero for parameters that have not received gradient yet. \(\beta\) is typically \(0.9\) or \(0.99\).

Adam

Adam (Kingma and Ba 2015) combines momentum and RMSProp’s adaptive scaling. Maintain biased running averages of the gradient and squared gradient,

\[ m_{t+1} = \beta_1 m_t + (1 - \beta_1) g_t, \qquad v_{t+1} = \beta_2 v_t + (1 - \beta_2) g_t^2, \]

and bias-correct them:

\[ \hat m_{t+1} = \frac{m_{t+1}}{1 - \beta_1^{t+1}}, \qquad \hat v_{t+1} = \frac{v_{t+1}}{1 - \beta_2^{t+1}}. \]

Update:

\[ \theta_{t+1} = \theta_t - \eta \, \frac{\hat m_{t+1}}{\sqrt{\hat v_{t+1}} + \varepsilon}. \]

Default hyperparameters: \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\varepsilon = 10^{-8}\), \(\eta \in [10^{-4}, 10^{-3}]\) for most architectures. The bias correction matters for the first \(O(1/(1-\beta))\) steps; without it, \(m_1\) would be biased toward zero.

Adam is the most widely-used optimizer in deep learning. It tolerates a wide range of learning rates, requires almost no tuning to get started, and is competitive on most architectures. It is the default in transformer training.

Diagram: SGD vs. Adam on a narrow-valley loss surface

The contours are stretched along \(\theta_2\), so the minimum lies in a long thin trough. Plain SGD oscillates across the trough; Adam’s per-parameter adaptive scaling damps the high-curvature direction and accelerates progress along the valley.

θ θ₀ SGD: oscillates across the narrow direction Adam: per-parameter scaling damps oscillation Adam divides each step by √(EMA of squared gradient), so consistently large-gradient directions get small effective steps. The result: a smoother trajectory and faster progress along the long shallow direction toward θ.

AdamW

The standard \(\ell_2\) regularizer (“weight decay”) adds \(\frac{\lambda}{2} \|\theta\|^2\) to the loss; this contributes \(\lambda \theta\) to the gradient. With Adam, this gradient contribution gets adapted by \(\hat v\) along with everything else, which is not what one usually wants — large weights with small recent gradients get under-regularized.

AdamW (Loshchilov & Hutter, 2019) decouples weight decay from the gradient:

\[ \theta_{t+1} = \theta_t - \eta \left(\frac{\hat m_{t+1}}{\sqrt{\hat v_{t+1}} + \varepsilon} + \lambda \theta_t\right). \]

The weight-decay term scales linearly in \(\theta\) and is not divided by \(\sqrt{\hat v}\). Empirically AdamW substantially outperforms Adam-with-\(\ell_2\) on transformer architectures and is the default for modern large-model training.

When to Use Which

  • AdamW for transformers, language models, and most novel architectures. Default choice.
  • SGD with Nesterov momentum for ImageNet-scale CNNs. Often slightly outperforms Adam on this class of model after tuning. Standard baseline in vision.
  • Adam without weight decay tuning if reproducing older papers; otherwise prefer AdamW.
  • RMSProp in some reinforcement-learning training loops where it is the historical baseline (e.g., DQN). No strong reason to choose it over Adam for new work.

Diagnostics

If training diverges with Adam: check learning rate (often too high), use warmup (linearly ramp \(\eta\) over the first few hundred steps), and check for exploding gradients. The bias-corrected \(\hat m / \sqrt{\hat v}\) is approximately \(\Theta(1)\) early in training, so an effective step size of \(\eta \approx 10^{-4}\) already moves parameters by \(10^{-4}\) per step — small and well-behaved unless something is wrong upstream.

References

Kingma, Diederik P., and Jimmy Ba. 2015. “Adam: A Method for Stochastic Optimization.” International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1412.6980.