Multiplicative Weight Update
Motivation
The multiplicative weights update (MWU) algorithm is a one-page algorithm with an unreasonable number of applications. It solves online learning with expert advice with optimal regret. It computes \(\epsilon\)-approximate Nash equilibria of zero-sum games. It approximates packing and covering linear programs in near-linear time. It is the engine behind AdaBoost. It appears (under different names) in derandomization, in network routing, in semidefinite programming, in computational geometry, and in machine learning (Arora et al. 2012; Freund and Schapire 1997).
The algorithm is so simple it can be stated in three lines: keep a weight on each “expert,” after each round multiply each weight by an exponential of its loss, and sample experts in proportion to the current weights. Yet it achieves the optimal \(O(\sqrt{T \log n})\) regret bound, matching the lower bound for online learning.
Problem
A learner faces \(T\) rounds and \(n\) “experts” providing advice. At round \(t\):
- The learner plays a distribution \(p_t\) over the experts.
- Nature reveals a loss vector \(\ell_t \in [0, 1]^n\) (one loss per expert).
- The learner incurs loss \(\langle p_t, \ell_t \rangle\).
The losses can be adversarial — no statistical assumptions. The learner’s goal is to minimize the regret
\[ R_T = \sum_{t=1}^T \langle p_t, \ell_t \rangle - \min_i \sum_{t=1}^T \ell_{t, i}, \]
the gap between the learner’s cumulative loss and the loss of the best expert in hindsight.
Key Ideas
Exponential reweighting tracks the best expert
The weight on each expert decays exponentially in its accumulated loss: an expert that has done well retains its weight, an expert that has done badly sees its weight shrink. Sampling in proportion to the weights shifts probability mass toward whoever is winning — without ever committing to one expert and without ever pruning any expert entirely.
This is hedging by aggregation: the learner plays a mixture, and the mixture concentrates on the leader as the round count grows. The randomization is the source of adversarial robustness: the adversary cannot exploit a deterministic choice.
Multiplicative beats additive
The update \(w_{t+1, i} = w_{t, i} \cdot (1 - \eta \ell_{t, i})\) changes weights by a factor that depends only on each expert’s own loss, not on the others’. This scale-invariance is what gives MWU its \(O(\sqrt{T \log n})\) regret instead of the worse \(O(\sqrt{T n})\) that additive updates (gradient descent on weights) achieve.
Robustness to adversarial input
No statistical assumptions are needed; the analysis goes through for any loss sequence. So MWU works in any setting where you can identify a finite set of “actions” and observe their losses — which is why the same algorithm appears in online learning, game-solving, LP approximation, and boosting.
A useful slogan: anything with a “best in hindsight” benchmark and an experts-style decomposition can be attacked by multiplicative weights.
Algorithm
There are \(n\) experts. The learner maintains a weight \(w_{t, i}\) for each expert \(i\), initially \(w_{1, i} = 1\). At each round \(t = 1, \dots, T\):
for t = 1, ..., T:
W_t = Σ_i w_{t, i}
p_{t, i} = w_{t, i} / W_t # distribution
play expert sampled from p_t (or play mixture p_t directly)
observe loss vector ℓ_t ∈ [0, 1]^n
w_{t+1, i} = w_{t, i} · (1 - η · ℓ_{t, i}) # update each weight
\(\eta \in (0, 1/2]\) is the learning rate. An equivalent formulation uses \(w_{t+1, i} = w_{t, i} \cdot e^{-\eta \, \ell_{t, i}}\); the analysis is essentially identical.
That is the entire algorithm.
Walkthrough
Three experts over four rounds
Three experts, four rounds, learning rate \(\eta = 0.5\). Initial weights \(w_1 = (1, 1, 1)\), \(W_1 = 3\).
| Round | \(\ell_t\) (losses) | \(p_t\) (distribution) | Expected loss | Weights after update |
|---|---|---|---|---|
| 1 | \((0, 1, 0)\) | \((1/3,\ 1/3,\ 1/3)\) | \(1/3\) | \((1,\ 0.5,\ 1)\); \(W=2.5\) |
| 2 | \((1, 0, 0)\) | \((0.4,\ 0.2,\ 0.4)\) | \(0.4\) | \((0.5,\ 0.5,\ 1)\); \(W=2\) |
| 3 | \((0, 0, 1)\) | \((0.25,\ 0.25,\ 0.5)\) | \(0.5\) | \((0.5,\ 0.5,\ 0.5)\); \(W=1.5\) |
| 4 | \((1, 0, 1)\) | \((1/3,\ 1/3,\ 1/3)\) | \(2/3\) | — |
Learner’s total loss: \(1/3 + 2/5 + 1/2 + 2/3 \approx 1.93\).
Expert cumulative losses: expert 1 suffers \(0+1+0+1=2\), expert 2 suffers \(1+0+0+0=1\), expert 3 suffers \(0+0+1+1=2\). Best expert is expert 2 with total loss \(1\).
Regret \(= 1.93 - 1 = 0.93\). The bound gives \(2\sqrt{4 \ln 3} \approx 2 \cdot 2 \cdot 1.05 \approx 4.2\) — the actual regret is well below the worst-case bound because this loss sequence is not adversarial.
Note how MWU responds: after round 1 (expert 2 was penalized), its weight halves and the distribution shifts away from it in round 2. After round 2 penalizes expert 1, experts 1 and 2 equalize. The algorithm continuously hedges without committing to any single expert.
Correctness
Theorem. With learning rate \(\eta = \sqrt{\log n / T}\), MWU achieves expected regret \[ R_T \le 2 \sqrt{T \log n} \] against any adversarial loss sequence.
Proof sketch. The total weight \(W_t\) tracks the algorithm’s performance. After update, \[ W_{t+1} = \sum_i w_{t, i} (1 - \eta \, \ell_{t, i}) = W_t \cdot (1 - \eta \langle p_t, \ell_t \rangle). \] Telescoping over \(t\): \[ W_{T+1} = n \prod_{t=1}^T (1 - \eta \langle p_t, \ell_t \rangle). \] Take logs and use \(\log(1 - x) \le -x\): \[ \log W_{T+1} \le \log n - \eta \sum_t \langle p_t, \ell_t \rangle = \log n - \eta L_T. \] For the best expert \(i^*\): \[ W_{T+1} \ge w_{T+1, i^*} = \prod_t (1 - \eta \, \ell_{t, i^*}) \ge \exp\left( -\eta L_T^* - \eta^2 L_T^* \right) \] (using \(\log(1 - x) \ge -x - x^2\) for \(x \in [0, 1/2]\)). Combining: \[ -\eta L_T^* - \eta^2 L_T^* \le \log W_{T+1} \le \log n - \eta L_T, \] so \[ L_T \le L_T^* + \eta L_T^* + \frac{\log n}{\eta}. \] Choose \(\eta = \sqrt{\log n / T}\) and use \(L_T^* \le T\) to get \(R_T = L_T - L_T^* \le 2 \sqrt{T \log n}\).
This bound is tight up to constants — the online-learning lower bound is \(\Omega(\sqrt{T \log n})\).
Complexity and Tradeoffs
Per-round cost: \(O(n)\) to normalize the weights, sample, and update. Memory: \(O(n)\).
The choice of \(\eta\) matters. Too large and the regret bound’s first term (\(\eta L_T^*\)) dominates; too small and the second term (\(\log n / \eta\)) dominates. The optimal \(\eta = \sqrt{\log n / T}\) requires knowing \(T\) in advance; in practice an anytime variant uses \(\eta_t = \sqrt{\log n / t}\) with a slight cost in constants.
The bound is in expectation; high-probability bounds follow from concentration inequalities. The dependence on \(n\) is only logarithmic, so MWU scales to enormous numbers of experts as long as each one can be evaluated cheaply.
When to Use It
| Situation | Approach |
|---|---|
| Online prediction with expert advice | MWU — optimal regret. |
| Adversarial losses, no statistical assumptions | MWU — distribution-free. |
| Approximate Nash equilibrium of a zero-sum game | MWU run by both players — see Variants. |
| Approximate packing/covering LP | MWU formulation — fast approximation. |
| Boosting (combining weak classifiers) | AdaBoost — MWU in classification disguise. |
| Stochastic bandit (rewards drawn i.i.d.) | UCB or Thompson sampling achieve \(\log T\) regret. |
| Partial feedback (only chosen expert’s loss observed) | EXP3 — see Variants. |
Variants
- EXP3 (auer2002bandit?). The bandit version: only the loss of the chosen expert is observed, not the full loss vector. Importance-weight the observed loss before updating. Achieves \(O(\sqrt{T n \log n})\) regret — worse \(n\) dependence than full-information MWU because of the loss in observability.
- Hedge. Equivalent formulation with \(w \leftarrow w \cdot e^{-\eta \ell}\). Mathematically identical to MWU up to constants in the regret bound.
- AdaHedge / adaptive \(\eta\). Adaptive learning rate that achieves optimal regret without knowing \(T\) or \(L_T^*\) in advance.
- Solving zero-sum games. Both players run MWU on the row/column experts; the time-averaged strategies \(\bar x_T, \bar y_T\) converge to an \(\epsilon\)-approximate equilibrium for \(T = O(\log(\max(m, n)) / \epsilon^2)\). A constructive proof of the minimax theorem falls out as a corollary.
- LP approximation. For a packing LP, MWU yields a \((1 - \epsilon)\)-approximate solution in time \(\tilde O(N / \epsilon^2)\), where \(N\) is the number of nonzeros in \(A\). Faster than general LP algorithms for packing/covering; basis for fast approximation algorithms in network design and resource allocation.
- AdaBoost. Weak classifiers are the “experts” and the loss vector is the indicator of misclassification. MWU upweights misclassified examples to focus future weak learners on hard cases. The training-error bound is the MWU regret bound dressed up in classification language (Freund and Schapire 1997).