Stochastic Gradient Descent

Motivation

Training a neural network minimizes an empirical risk averaged over the dataset:

\[ L(\theta) = \frac{1}{N} \sum_{i=1}^N \ell(f_\theta(x_i), y_i). \]

Computing the full gradient \(\nabla L(\theta)\) requires a forward + backward pass over all \(N\) examples — fine for \(N = 1000\), infeasible for \(N = 10^9\). Stochastic gradient descent (SGD) (Rumelhart et al. 1986) estimates the gradient from a small minibatch of examples and takes a step in that direction. Each step is much cheaper than full-batch gradient descent, and the resulting noisy trajectory converges to a minimum nearly as well — sometimes better, since the noise helps escape saddle points and shallow local minima.

SGD is the foundation of every neural-network optimizer in current use. Adam, RMSProp, momentum-based methods, and learning-rate schedules are all variants on the SGD update.

Problem

Given an objective \(L(\theta) = (1/N) \sum_{i=1}^N \ell_i(\theta)\) that decomposes as an average of per-example losses, find a local minimum \(\theta^*\) using only stochastic estimates of \(\nabla L\) — gradients computed on small subsets of the data.

Key Ideas

Unbiased minibatch gradient

Sampling a minibatch \(\mathcal{B}_t \subset \{1, \ldots, N\}\) uniformly at random and computing

\[ g_t = \frac{1}{B} \sum_{i \in \mathcal{B}_t} \nabla \ell_i(\theta^{(t)}) \]

gives an unbiased estimate of the full gradient: \(\mathbb{E}[g_t] = \nabla L(\theta^{(t)})\). So in expectation, SGD takes a gradient-descent step. The variance scales as \(O(1/B)\) — larger batches give lower-variance estimates but proportionally more cost per step.

Noise as a feature, not a bug

The stochasticity has two surprising upsides:

  • Escaping saddle points. Non-convex losses have many saddle points; deterministic gradient descent can stall at them. SGD’s noise perturbs the iterates away.
  • Implicit regularization. SGD trajectories favor flatter minima, which tend to generalize better. The connection between noise and generalization is incompletely understood but well documented empirically.

For convex \(L\) with Lipschitz gradient, SGD with \(\eta_t = O(1/t)\) converges in expectation: \(\mathbb{E}[L(\theta^{(t)})] - L^* = O(1/t)\) for strongly convex \(L\), \(O(1/\sqrt{t})\) for general convex \(L\). Neural-network losses are non-convex, so these bounds do not apply directly, but the empirical fact is that SGD reliably finds good solutions for over-parameterized networks.

Trajectory is jittery but on track

θ* θ₀ full-batch GD Each red step uses a noisy minibatch gradient — direction is right on average, but jittery. Variance shrinks with batch size B; the noise itself helps escape saddle points and shallow local minima.

Concentric ellipses are contours of \(L(\theta_1, \theta_2)\). Each step in the SGD path uses a noisy gradient estimate, so the trajectory zigzags across the contours instead of following the steepest-descent line. Full-batch GD (dashed blue) takes the smooth route; SGD (red) wobbles but still reaches the minimum, and is cheaper per step.

Algorithm

Given learning rate \(\eta > 0\) and minibatch size \(B\):

initialize θ
for t = 0, 1, 2, ...:
    sample minibatch ℬ_t ⊂ {1, ..., N} of size B
    g_t = (1/B) Σ_{i ∈ ℬ_t} ∇ ℓ(f_θ(x_i), y_i)     # via backprop
    θ = θ - η · g_t

The key step is the stochastic update: replace the full-batch gradient with a minibatch estimate and take a small step in the negative direction.

Walkthrough

Three SGD steps on linear regression

Minimize \(L(\theta) = (1/N) \sum_i (\theta x_i - y_i)^2\) for \(N = 4\) data points \((x, y) = (1, 1), (2, 2), (3, 4), (4, 4)\). The true \(\theta^* \approx 1.1\).

Per-example gradient: \(\nabla \ell_i(\theta) = 2 (\theta x_i - y_i) \, x_i\).

Start at \(\theta_0 = 0\). Use \(B = 2\), \(\eta = 0.02\).

Step Minibatch (random) Per-example gradients \(g_t\) \(\theta\) after update
1 \(\{(1, 1), (3, 4)\}\) \(2(0-1)(1) = -2\), \(2(0-4)(3) = -24\) \(-13\) \(0 - 0.02(-13) = 0.26\)
2 \(\{(2, 2), (4, 4)\}\) \(2(0.52-2)(2) = -5.92\), \(2(1.04-4)(4) = -23.68\) \(-14.8\) \(0.26 + 0.296 = 0.556\)
3 \(\{(1, 1), (4, 4)\}\) \(2(0.556-1)(1) = -0.888\), \(2(2.224-4)(4) = -14.21\) \(-7.55\) \(0.556 + 0.151 = 0.707\)

After three steps \(\theta \approx 0.71\), still well below \(\theta^* \approx 1.1\). Many more steps would close the gap. The minibatch gradient varies a lot across steps — step 2 had a larger \(|g|\) than step 3 not because the loss landscape changed dramatically, but because the random minibatch happened to weight high-\(x\) examples differently.

This jitter is exactly what the SGD path picture shows: each step is approximately the right direction, but never exactly.

Correctness

For convex \(L\) with \(L\)-Lipschitz gradient and i.i.d. stochastic gradients with bounded variance \(\sigma^2\):

  • Strongly convex case with step size \(\eta_t = O(1/t)\): \(\mathbb{E}[L(\theta^{(t)})] - L^* = O(\sigma^2 / t)\).
  • General convex case with averaging and \(\eta = O(1/\sqrt{T})\): \(\mathbb{E}[L(\bar\theta^{(T)})] - L^* = O((1 + \sigma) / \sqrt{T})\).

For non-convex \(L\) (the typical neural-network case), the formal guarantees are weaker — typically only convergence to a stationary point in gradient norm. The reliable empirical success of SGD on over-parameterized networks remains incompletely explained; implicit regularization from SGD noise is a leading candidate explanation.

Complexity and Tradeoffs

Per-step cost. \(O(B \cdot c)\) where \(c\) is the cost of one example’s forward + backward. With modern GPU batching, the per-step cost is roughly \(O(c)\) as long as \(B\) fits in memory — increasing \(B\) amortizes overhead.

Choosing the minibatch size

  • Too small (\(B = 1\), “online SGD”). High variance gradients; slow per-step convergence in iteration count, but each step is cheap.
  • Too large (\(B = N\)). Full-batch gradient descent. Each step is expensive and the noise that helps escape saddle points is gone.
  • Hardware-driven. Modern GPUs prefer batches that fill their compute units. Typical choices are powers of \(2\) from \(32\) to \(1024\) for vision, larger (\(1024\)\(8192\)) for language with gradient accumulation.

Larger batches need larger learning rates. The empirical scaling rule for neural networks is linear: doubling the batch size lets you double the learning rate, up to a regime-dependent ceiling.

Learning-rate schedule

A constant learning rate is rarely optimal. Standard schedules:

  • Step decay. Drop by a factor (typically \(10\times\)) at fixed epochs.
  • Cosine annealing. Smooth decrease from initial \(\eta_0\) to a small final value over training, following half a cosine.
  • Warmup. Linearly ramp from \(0\) to \(\eta_0\) over the first few hundred or thousand steps. Standard for transformers; prevents the early-training divergence that occurs when full-magnitude gradients hit poorly-initialized parameters.
  • Reduce on plateau. Drop by a factor when validation loss stops improving. Less popular in modern practice than fixed schedules.

The schedule is typically more important than the precise initial learning rate, within a factor of \(\sim 3\) of the right value.

Practical notes

  • Shuffle the data once per epoch. Without shuffling, gradients become correlated across consecutive steps and convergence suffers.
  • Normalize inputs. Inputs with very different scales make the loss surface ill-conditioned and SGD slow.
  • Monitor training and validation loss. Diverging training loss → learning rate too high. Stagnant training loss → learning rate too low or initialization bad.
  • Gradient clipping (see gradient clipping) is essential for recurrent networks and useful for transformers; less commonly needed for plain MLPs and CNNs.

When to Use It

Situation Approach
Default for neural network training SGD with momentum or AdamW.
Vision / CNN SGD with momentum is competitive, sometimes superior.
Language / transformer AdamW with warmup + cosine schedule.
RNN / LSTM SGD with momentum or Adam, plus gradient clipping.
Tiny dataset, low compute Full-batch L-BFGS or other second-order method.
Need scale-invariance / heterogeneous gradients Adam, RMSProp (adaptive optimizers).
Convex problem with strong regularity Variance-reduced SGD (SVRG, SAGA) — see Variants.

Variants

  • SGD with momentum (polyak1964momentum?). Accumulate an exponential moving average of past gradients and update along it: \(v \leftarrow \beta v + g\), \(\theta \leftarrow \theta - \eta v\). Smoother trajectory, faster traversal of long flat valleys. The empirical default for CNN training.
  • Nesterov momentum. Evaluate the gradient at \(\theta - \eta \beta v\) instead of \(\theta\). Slightly tighter convergence on convex problems; common in practice.
  • Adam. Per-parameter learning rates derived from running estimates of gradient mean and variance. Most popular optimizer for transformers and modern architectures.
  • AdamW. Adam with decoupled weight decay; recommended over Adam for any model where weight decay matters.
  • RMSProp. Per-parameter learning rates from the running gradient variance only. Predates Adam; still used in some RL pipelines.
  • Variance-reduced SGD (SVRG, SAGA). Periodically compute a full-batch gradient and use it as a control variate to reduce variance. Better theoretical rates on convex problems; rarely used in deep learning.
  • Polyak averaging. Track an exponential moving average of \(\theta\) during training and use it at evaluation. Stabilizes the final iterate. The basis of stochastic weight averaging (SWA).

References

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.