Multi-Armed Bandits
Motivation
A doctor must choose among several treatments without knowing which is most effective. A website must pick which ad to show without knowing which gets the most clicks. In both cases the agent learns about the options only by trying them. The multi-armed bandit problem (Lattimore and Szepesvári 2020) is the cleanest mathematical formalization of this exploration–exploitation tension: there is no notion of state, no transitions, and no long-term consequences — just repeated choices among unknown reward distributions.
Bandits are the simplest member of the sequential decision-making family. Stripping away state dynamics isolates one fundamental question: how should an agent allocate trials between gathering information and exploiting what it already knows?
Setup
Three arms as reward distributions
A bandit hides one reward distribution behind each arm. The learner only sees samples from pulled arms, so the sample means below can be misleading early on.
A \(K\)-armed bandit consists of \(K\) actions (“arms”) \(a \in \{1, \ldots, K\}\), each associated with an unknown reward distribution with mean \(\mu_a\). At each round \(t = 1, 2, \ldots, T\):
- The agent selects an arm \(a_t\).
- The environment returns a reward \(r_t\) drawn independently from the distribution of arm \(a_t\).
The optimal mean is \(\mu^* = \max_a \mu_a\) and the optimal arm is any \(a^* \in \arg\max_a \mu_a\).
This is a Markov decision process with a single state, \(|A| = K\), and no discounting (or with horizon \(T\)).
Regret
Because the agent does not know the means in advance, total reward is a poor objective — it conflates the difficulty of the problem with the agent’s skill. Instead, performance is measured by regret, the gap between what the agent earned and what an oracle who knew \(\mu^*\) would have earned:
\[ \text{Regret}(T) = T \mu^* - \mathbb{E}\!\left[\sum_{t=1}^{T} r_t\right] = \sum_{a=1}^{K} \Delta_a \, \mathbb{E}[N_a(T)], \]
where \(\Delta_a = \mu^* - \mu_a\) is the suboptimality gap of arm \(a\) and \(N_a(T)\) is the number of times arm \(a\) was pulled. Regret accumulates only on suboptimal pulls, weighted by how suboptimal they are.
Two qualitative regimes:
- Linear regret \(\Theta(T)\): the agent keeps pulling a suboptimal arm a constant fraction of the time. This is the failure mode.
- Sublinear regret \(o(T)\): the per-round mistake rate vanishes; the agent eventually concentrates on the best arm. This is the goal.
For \(K\)-armed stochastic bandits, the optimal worst-case regret rate is \(\Theta(\sqrt{KT})\), and the optimal instance-dependent rate is \(\Theta\!\left(\sum_{a : \Delta_a > 0} \frac{\log T}{\Delta_a}\right)\) (Lai and Robbins 1985).
Exploration vs. Exploitation
The agent faces a fundamental tradeoff:
- Exploit: pull the arm with the highest estimated mean \(\hat{\mu}_a\). Maximizes immediate expected reward given current belief.
- Explore: pull an arm with high uncertainty. Refines the estimates so future exploitation is more likely to be correct.
Pure exploitation can lock onto a suboptimal arm forever if early samples were unlucky. Pure exploration spreads pulls evenly and never concentrates on the best arm. Good algorithms interleave the two so that exploration vanishes in the limit but never too quickly.
Algorithms
\(\varepsilon\)-greedy
With probability \(1 - \varepsilon\) pull the arm with the highest current empirical mean; otherwise pull a uniformly random arm.
- Constant \(\varepsilon\): linear regret \(\Theta(\varepsilon T)\) — keeps exploring forever at a fixed rate.
- Decaying \(\varepsilon_t = \min(1, c K / t)\): logarithmic regret \(O(K \log T)\) for appropriate \(c\).
Simple, hyperparameter-sensitive, and the standard baseline.
Upper Confidence Bound (UCB)
For each arm, maintain the empirical mean \(\hat{\mu}_a\) and pull count \(N_a(t)\). At each round pull
\[ a_t = \operatorname*{arg\,max}_{a} \left[ \hat{\mu}_a + \sqrt{\frac{2 \log t}{N_a(t)}} \right]. \]
The bonus term is an upper confidence bound on \(\mu_a\) — high for under-pulled or genuinely promising arms — implementing the principle optimism in the face of uncertainty. UCB1 achieves the instance-optimal \(O\!\left(\sum_{a : \Delta_a > 0} \frac{\log T}{\Delta_a}\right)\) regret rate.
Thompson Sampling
Maintain a posterior over each \(\mu_a\) (e.g., Beta posteriors for Bernoulli rewards). At each round:
- Sample \(\tilde{\mu}_a\) from the posterior of each arm.
- Pull \(a_t = \arg\max_a \tilde{\mu}_a\).
- Update the posterior of \(a_t\) using the observed reward.
Thompson sampling matches the optimal logarithmic regret rate, often performs better than UCB in practice, and extends naturally to richer settings (contextual, structured) by replacing the per-arm posterior with a posterior over a parameterized model.
Variants
- Adversarial bandits. Rewards are chosen by an adversary rather than drawn i.i.d. The Exp3 algorithm achieves \(O(\sqrt{KT \log K})\) regret without distributional assumptions.
- Contextual bandits. Before each pull the agent observes a context \(x_t\); rewards depend on \((a, x)\). The agent learns a function from contexts to actions. This is the setting underlying personalized recommendation and online ad selection.
- Best-arm identification. The objective is to identify \(a^*\) with high probability using as few samples as possible, regardless of regret accumulated during exploration. Used for A/B testing.
- Structured bandits. Linear, combinatorial, or Lipschitz reward structure across arms allows pulls of one arm to inform estimates of others, breaking the \(\Omega(K)\) barrier.
Relationship to MDPs and Reinforcement Learning
A bandit is a Markov decision process with a single state. Once states and transitions are introduced, the agent must additionally reason about how today’s action affects tomorrow’s situation, not just tomorrow’s reward — this is the full reinforcement learning problem. The exploration–exploitation tradeoff studied in bandits remains central, but now interacts with credit assignment over time.