Online Learning

Motivation

Standard machine learning assumes a training set is available all at once and the learner produces a model for future prediction. Online learning drops that assumption: predictions and feedback arrive one at a time, the learner updates as it goes, and there may be no underlying probability distribution at all — the data sequence can be adversarial.

This setting models many practical problems: spam classification (the spammers actively change tactics), online ad selection (contextual bandits), portfolio rebalancing, A/B test allocation, and competitive analysis of online algorithms. The right performance metric is regret — how much worse the learner did than the best fixed action chosen in hindsight — and the central insight of the field is that small regret is achievable even against an adversary, with no statistical assumptions whatsoever (Cesa-Bianchi and Lugosi 2006; Arora et al. 2012).

Online learning sits at the intersection of game theory (zero-sum games), linear programming, and multi-armed bandits. Many of its core algorithms (multiplicative weights, online gradient descent) are remarkably simple and apply far beyond machine learning.

Setup: prediction with expert advice

The clearest formulation is prediction with expert advice:

For \(t = 1, 2, \dots, T\):

  1. The learner has access to \(n\) “experts,” each providing a recommendation.
  2. The learner selects a distribution \(p_t \in \Delta^n\) over the experts (or equivalently, picks one at random per \(p_t\)).
  3. The adversary reveals a loss vector \(\ell_t \in [0, 1]^n\), one loss per expert.
  4. The learner suffers expected loss \(\langle p_t, \ell_t \rangle\).

The learner’s total loss is \(L_T = \sum_t \langle p_t, \ell_t \rangle\).

The benchmark is the best expert in hindsight: \(L_T^* = \min_i \sum_t \ell_{t, i}\).

Regret

The regret is the gap between the learner’s loss and the best expert’s loss: \[ R_T = L_T - L_T^* = \sum_t \langle p_t, \ell_t \rangle - \min_i \sum_t \ell_{t, i}. \]

The aim is to design an algorithm with sublinear regret: \(R_T = o(T)\). This means the per-round regret \(R_T / T \to 0\) — asymptotically, the learner does as well as the best fixed expert, even though it had to commit to choices before seeing each loss.

Why sublinear regret is non-trivial

A naive learner that picks the best expert so far can be tricked: the adversary can set \(\ell_t\) to penalize whichever expert the learner is currently following. The “follow the leader” strategy can suffer linear regret in the worst case.

The fix is randomization combined with smooth updates. The canonical algorithm — multiplicative weights — keeps a weight on each expert and updates it gently after each round, achieving regret \(O(\sqrt{T \log n})\) — sublinear, with only logarithmic dependence on the number of experts.

Example: regret comparison across three learners

Three experts, four rounds. Losses (\(n=3\), \(T=4\)):

Round Expert 1 Expert 2 Expert 3
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 1
Total 2 1 2

Best fixed expert in hindsight: expert 2, with total loss \(L_T^* = 1\).

Three learner strategies and their total losses:

Learner Strategy Losses by round \(L_T\) Regret
Follow-the-leader always pick current best expert \(1/3, 0, 0, 0\) → picks E1/E3, then E2, E2, E2 \(\approx 0.33 + 1 = 1.33\) \(0.33\)
Follow-the-loser always pick worst-performing expert picks by high loss → high expected loss high high
Uniform always \((1/3, 1/3, 1/3)\) \(1/3, 1/3, 1/3, 2/3\) \(5/3 \approx 1.67\) \(0.67\)

Follow-the-leader (FTL) looks good here but can be manipulated: if the adversary sets \(\ell_t\) to penalize whichever expert FTL currently favors, FTL suffers loss \(\Theta(T)\) while the best expert suffers \(\Theta(1)\), giving \(\Theta(T)\) regret — linear and therefore unacceptable. Multiplicative weights, by contrast, maintains smooth distributions and achieves \(O(\sqrt{T \log n})\) regret on any sequence.

Lower bound

Any online learning algorithm must suffer regret \(\Omega(\sqrt{T \log n})\) in the worst case (against an adversarial loss sequence with \(n\) experts and horizon \(T\)). So multiplicative weights is optimal up to constants. The proof uses a probabilistic argument: a uniformly-random binary loss sequence forces any algorithm into expected regret \(\Theta(\sqrt T)\) per expert pair.

Variants

  • Bandit feedback: the learner observes only the loss of the action it took, not the full \(\ell_t\). Harder; achievable regret is \(O(\sqrt{nT})\) rather than \(O(\sqrt{T \log n})\) — the cost of partial feedback. The non-stochastic multi-armed bandit is the special case of online learning with bandit feedback.
  • Online convex optimization: each \(\ell_t\) is a convex function over a convex set rather than a finite set of experts. Online gradient descent achieves \(O(\sqrt T)\) regret.
  • Stochastic setting: losses drawn i.i.d. from a fixed distribution. Easier; regret is \(O(\log T)\) for UCB-style algorithms in the bandit case.
  • Adversarial regret with switching cost: penalize the learner for changing actions across rounds. Different algorithms (e.g., shrinking dartboard) needed.

Applications

  • Multi-armed bandits are the bandit-feedback special case of online learning.
  • Boosting (AdaBoost): the multiplicative-weights algorithm applied to weak learners as experts.
  • Approximating LPs: the multiplicative-weights framework solves packing/covering LPs (and zero-sum games) in near-linear time, much faster than general LP solvers.
  • Online portfolio selection: each “expert” is a fixed portfolio; the learner achieves performance comparable to the best constant-rebalanced portfolio in hindsight (Cover’s universal portfolios).

Connection to game theory

Multiplicative weights is the algorithm a player in a zero-sum game should run to converge to an optimal mixed strategy. If both players run multiplicative weights with appropriate learning rates, their average strategies converge to a Nash equilibrium — which gives a constructive proof of the minimax theorem and an alternative to LP-based equilibrium computation. This is a remarkable cross-link: an online-learning algorithm doubles as a game-solving algorithm.

References

Arora, Sanjeev, Elad Hazan, and Satyen Kale. 2012. Theory of Computing 8 (1): 121–64. https://doi.org/10.4086/toc.2012.v008a006.
Cesa-Bianchi, Nicolò, and Gábor Lugosi. 2006. Prediction, Learning, and Games. Cambridge University Press. https://www.cambridge.org/core/books/prediction-learning-and-games/A05C9F6ABC752FAB8954C885D0065C8F.