Upper Confidence Bound (UCB)
Motivation
A greedy agent acts on its point estimate \(\hat{\mu}_a\) of each action’s value. That estimate is wrong by an amount that depends on how many times the action has been tried: an arm pulled twice has a much wider plausible range than an arm pulled a thousand times.
Optimism in the face of uncertainty turns this observation into a decision rule. Instead of acting on the point estimate, act on an upper confidence bound \(U_a = \hat{\mu}_a + b_a\), where the bonus \(b_a\) shrinks as \(a\) accumulates pulls. Two arms can be optimistic for different reasons: a high empirical mean, or a wide confidence interval. Either qualifies as plausibly best, so either deserves a pull.
Optimism is a directed exploration strategy: arms whose upper bound is already below the current best are not pulled, even if they are under-sampled. This is the structural advantage over \(\varepsilon\)-greedy. See exploration vs. exploitation for the broader picture.
Problem
A \(K\)-armed stochastic bandit: at each round \(t\), pick an arm \(a_t \in \{1, \ldots, K\}\) and observe an i.i.d. reward \(r_t\) from an unknown distribution with mean \(\mu_{a_t}\) supported in \([0, 1]\). The goal is to minimize regret
\[ \text{Regret}(T) = T \mu^* - \mathbb{E}\!\left[\sum_{t=1}^T r_t\right] \]
where \(\mu^* = \max_a \mu_a\) — the gap between the best possible cumulative reward and what the algorithm actually accumulates.
Key Ideas
Optimism in the face of uncertainty
The core principle: when an arm could plausibly be best — either because its mean estimate is high or because its confidence interval is wide — treat it as if it is best, and pull it. Each pull tightens the interval, so over time the optimistic illusion either gets confirmed (the arm really is best) or disconfirmed (the interval shrinks and the arm drops in the ranking).
This is fundamentally different from \(\varepsilon\)-greedy’s uniform random exploration. Optimism is directed: arms whose upper bound is already below the current best are never pulled. Suboptimal arms get pulled only until their intervals tighten enough to rule them out.
Confidence bonuses shrink with pulls
The bonus \(b_a = \sqrt{2 \log t / N_a}\) is a confidence-interval radius. It decreases as \(N_a\) grows — the more pulls you’ve made, the narrower the plausible range. The \(\log t\) in the numerator grows slowly: it forces every arm to keep being checked occasionally, even as \(N_a\) piles up.
UCB picks the arm with the highest upper endpoint, not the highest point estimate. Arms pulled often have narrower intervals; arms pulled rarely get a free boost from their wide intervals.
Bonus derivation from Hoeffding
The specific form \(\sqrt{2 \log t / N_a}\) comes from Hoeffding’s inequality. For rewards in \([0, 1]\) and \(n\) independent pulls,
\[ \Pr\!\left[ \hat{\mu}_a - \mu_a \le -\sqrt{\tfrac{2 \log t}{n}} \right] \le t^{-4}. \]
So with probability at least \(1 - t^{-4}\) the true mean satisfies \(\mu_a \le \hat{\mu}_a + \sqrt{2 \log t / n}\). The exponent \(t^{-4}\) is summable, so the union bound over all \(t\) keeps the total probability of “the bound was ever wrong” small. The \(\log t\) grows the bound slowly enough that the failure probability is summable but still strong enough to force exploration of any arm whose count has fallen behind \(\log t\).
Algorithm
For a \(K\)-armed stochastic bandit with rewards in \([0, 1]\), the UCB1 rule (Auer et al. 2002) is
\[ a_t = \operatorname*{arg\,max}_a \left[ \hat{\mu}_a(t-1) + \sqrt{\frac{2 \log t}{N_a(t-1)}} \right], \]
where \(\hat{\mu}_a(t-1)\) is the empirical mean reward of arm \(a\) from its first \(N_a(t-1)\) pulls. Each arm is pulled once at the start so the bonus is well-defined.
# Initialization
for each arm a: pull a once, set μ̂_a = reward, N_a = 1
# Main loop
for t = K+1, K+2, ...:
a_t = argmax_a (μ̂_a + sqrt(2 log t / N_a))
pull a_t, observe r
N_{a_t} += 1
μ̂_{a_t} = μ̂_{a_t} + (r - μ̂_{a_t}) / N_{a_t} # incremental mean
Walkthrough
Three-arm bandit, first six rounds
Three arms with true means \(\mu_A = 0.7\), \(\mu_B = 0.5\), \(\mu_C = 0.2\). Each arm pulled once for initialization in rounds 1-3.
Suppose initial pulls give: arm \(A\) rewards \(0.4\), arm \(B\) rewards \(0.6\), arm \(C\) rewards \(0.5\). So after round \(3\): \(\hat{\mu} = (0.4, 0.6, 0.5)\), \(N = (1, 1, 1)\).
| Round | \(\hat{\mu} + b\) for \(A\) | \(\hat{\mu} + b\) for \(B\) | \(\hat{\mu} + b\) for \(C\) | Pick | Reward | New state |
|---|---|---|---|---|---|---|
| 4 | \(0.4 + \sqrt{2 \ln 4 / 1} \approx 2.06\) | \(0.6 + 1.67 \approx 2.27\) | \(0.5 + 1.67 \approx 2.17\) | \(B\) | \(0.7\) | \(\hat{\mu}_B = 0.65, N_B = 2\) |
| 5 | \(0.4 + \sqrt{2 \ln 5} \approx 2.19\) | \(0.65 + \sqrt{2 \ln 5 / 2} \approx 1.92\) | \(0.5 + 1.79 \approx 2.29\) | \(C\) | \(0.0\) | \(\hat{\mu}_C = 0.25, N_C = 2\) |
| 6 | \(0.4 + \sqrt{2 \ln 6} \approx 2.29\) | \(0.65 + \sqrt{2 \ln 6 / 2} \approx 1.99\) | \(0.25 + \sqrt{2 \ln 6 / 2} \approx 1.59\) | \(A\) | \(0.9\) | \(\hat{\mu}_A = 0.65, N_A = 2\) |
By round 6, every arm has been pulled at least twice and the algorithm is starting to weigh its estimates against their uncertainties. After many more rounds, \(\hat{\mu}_A\) will stabilize near \(0.7\) and the wide-bonus arms will be tried less frequently — the arm with the actual highest mean gets pulled most often, exactly as the regret bound predicts.
Correctness
UCB1 achieves regret
\[ \text{Regret}(T) = O\!\left( \sum_{a : \Delta_a > 0} \frac{\log T}{\Delta_a} \right), \]
where \(\Delta_a = \mu^* - \mu_a\) is the gap of arm \(a\) from the optimum. This matches the Lai-Robbins lower bound up to constants (proof).
The argument is direct. Pulls of the optimal arm always look optimistic enough by Hoeffding. A suboptimal arm \(a\) is pulled only while its bonus is large enough that it could appear better than the optimum:
\[ \sqrt{\frac{2 \log t}{N_a(t-1)}} \ge \frac{\Delta_a}{2}. \]
This forces \(N_a(T) = O(\log T / \Delta_a^2)\) pulls — the regret bound follows by summing \(\Delta_a \cdot N_a(T)\).
Complexity and Tradeoffs
Per-round cost: \(O(K)\) to compute the argmax over \(K\) arms (or \(O(\log K)\) with a heap). Memory is \(O(K)\) for the per-arm means and counts.
Limitations
- Distribution assumptions. UCB1’s \(\sqrt{2 \log t / n}\) bonus relies on bounded rewards; sub-Gaussian or heavy-tailed rewards need a different bonus.
- Constant tuning. The \(2\) in the bonus is a worst-case constant. In practice it is treated as a hyperparameter, especially in UCT.
- State-aggregation. In large or continuous state spaces, per-state-action counts \(N_a(s)\) are sparse; pseudo-counts and learned uncertainty estimates extend the principle to deep RL.
- Bayesian alternative often wins. Thompson sampling typically empirically outperforms UCB1 despite having matching asymptotic regret.
When to Use It
| Situation | Approach |
|---|---|
| Stochastic bandit, bounded rewards, frequentist guarantees needed | UCB1. |
| Bernoulli rewards specifically | KL-UCB (tighter constants) — see Variants. |
| Contextual bandit (features per arm) | LinUCB — see Variants. |
| Monte Carlo tree search at every node | UCT — see Variants. |
| Posterior available, Bayesian setting | Thompson sampling — usually empirically stronger. |
| Non-stationary arms | UCB with a sliding window, or discounted UCB. |
| Pure exploration (find the best arm with confidence) | Best-arm-identification algorithms (LUCB, Successive Elimination). |
Variants
- UCB-V (Audibert et al. 2009). Uses an empirical-variance-aware bonus \(\sqrt{2 \hat{\sigma}_a^2 \log t / N_a} + 3 \log t / N_a\). Tightens regret when reward variances differ across arms.
- KL-UCB (Garivier and Cappé 2011). Replaces the Hoeffding bonus with a confidence interval based on KL divergence. Achieves the exact Lai-Robbins constant for Bernoulli rewards.
- LinUCB (Li et al. 2010). Contextual bandit version: maintains a linear model \(\mu_a(x) = x^\top \theta_a\) and uses an ellipsoidal confidence set on \(\theta_a\). Standard tool for online recommendation.
- Bayes-UCB (Kaufmann et al. 2012). Replaces the frequentist bound with a high quantile of the posterior.
- UCT (UCB applied to tree search) (Kocsis and Szepesvári 2006). Monte Carlo tree search treats every internal node as its own bandit problem; UCT applies UCB1 (with a tuning constant) at each node: \(a = \arg\max_c [\hat{V}(c) + C \sqrt{\log N(s) / N(c)}]\). Under standard assumptions, value estimates at the root converge to the minimax value as the simulation count grows (proof). UCT is the exploration backbone of AlphaGo and AlphaZero.