UCB1 Achieves Logarithmic Regret

Claim

Consider a \(K\)-armed bandit with rewards in \([0, 1]\), means \(\mu_1, \ldots, \mu_K\), optimal mean \(\mu^* = \mu_{a^*}\), and gaps \(\Delta_a = \mu^* - \mu_a\). The UCB1 algorithm pulls

\[ a_t = \operatorname*{arg\,max}_a \left[ \hat{\mu}_a(t-1) + \sqrt{\frac{2 \log t}{N_a(t-1)}} \right] \]

at each round \(t > K\) (after pulling each arm once). Then for every \(T \ge K\),

\[ \mathbb{E}[N_a(T)] \le \frac{8 \log T}{\Delta_a^2} + 1 + \frac{\pi^2}{3} \]

for every suboptimal arm \(a\), and consequently

\[ \text{Regret}(T) = \sum_{a : \Delta_a > 0} \Delta_a \, \mathbb{E}[N_a(T)] \le \sum_{a : \Delta_a > 0} \frac{8 \log T}{\Delta_a} + \left(1 + \frac{\pi^2}{3}\right) \sum_{a : \Delta_a > 0} \Delta_a. \]

This matches the Lai-Robbins lower bound up to constants (Auer et al. 2002).

Setup

Write \(U_a(t) = \hat{\mu}_a(t-1) + \sqrt{2 \log t / N_a(t-1)}\) for the UCB index of arm \(a\) at round \(t\). Let \(a^*\) be any optimal arm and \(a\) a suboptimal arm. Fix an integer \(\ell\) to be chosen later.

For UCB1 to pull arm \(a\) at round \(t\) requires \(U_a(t) \ge U_{a^*}(t)\). If furthermore \(a\) has already been pulled at least \(\ell\) times, then either

\(U_{a^*}(t) \le \mu^*\), i.e., the optimal arm’s UCB underestimates the truth, or
\(\hat{\mu}_a(t-1) \ge \mu_a + \sqrt{2 \log t / N_a(t-1)}\), i.e., arm \(a\)’s empirical mean overestimates by more than its bonus, or
\(\mu^* < \mu_a + 2 \sqrt{2 \log t / N_a(t-1)}\), i.e., arm \(a\)’s confidence interval is still wider than the gap.

The decomposition follows from rearranging the inequality \(U_a(t) \ge U_{a^*}(t) \ge \mu^*\).

Step 1: Concentration of (A) and (B)

By Hoeffding’s inequality, for any fixed \(n\),

\[ \Pr\!\left[ \hat{\mu}_a^{(n)} - \mu_a \ge \sqrt{\tfrac{2 \log t}{n}} \right] \le t^{-4}, \]

where \(\hat{\mu}_a^{(n)}\) denotes the empirical mean from \(n\) pulls. A union bound over \(n \in \{1, \ldots, t\}\) gives

\[ \Pr[\text{event (A) at round } t] \le t \cdot t^{-4} = t^{-3}, \]

and similarly for event (B). Summing over \(t \ge 1\):

\[ \sum_{t=1}^{T} \Pr[\text{(A) at } t] + \Pr[\text{(B) at } t] \le 2 \sum_{t=1}^{\infty} t^{-3} \le \frac{\pi^2}{3}. \]

This is the source of the \(\pi^2/3\) constant.

Step 2: Choosing \(\ell\) to rule out (C)

Event (C) says

\[ \Delta_a < 2 \sqrt{\frac{2 \log t}{N_a(t-1)}} \iff N_a(t-1) < \frac{8 \log t}{\Delta_a^2}. \]

Choose

\[ \ell = \left\lceil \frac{8 \log T}{\Delta_a^2} \right\rceil. \]

Then for any \(t \le T\) and \(N_a(t-1) \ge \ell\), event (C) cannot occur. Once arm \(a\) has been pulled \(\ell\) times, the only way it can be pulled again at \(t \le T\) is via event (A) or (B).

Step 3: Combine

The total number of pulls of arm \(a\) up to time \(T\) is

\[ N_a(T) \le \ell + \sum_{t=1}^{T} \mathbb{1}\!\left[ \text{(A) or (B) at } t \right]. \]

Taking expectations and using Step 1:

\[ \mathbb{E}[N_a(T)] \le \frac{8 \log T}{\Delta_a^2} + 1 + \frac{\pi^2}{3}. \]

Multiplying by \(\Delta_a\) and summing over suboptimal arms gives the regret bound.

Remarks

The constant \(8\) in \(8 \log T / \Delta_a^2\) comes from the squared bonus radius \((2 \sqrt{2 \log t / n})^2 = 8 \log t / n\). Tuned UCB variants and KL-UCB shrink this constant; KL-UCB attains the exact Lai-Robbins constant \(\Delta_a / \text{KL}(\mu_a, \mu^*)\) for Bernoulli rewards (Garivier and Cappé 2011).
The argument is robust to mild changes in the bonus, e.g., \(\sqrt{(\alpha \log t)/n}\) for any \(\alpha > 1\), with the regret constants depending on \(\alpha\).
The same proof structure applies to many derivatives — UCB-V, LinUCB, and the per-node analysis inside UCT — each replacing Hoeffding with the appropriate concentration inequality for their setting.

References

(Auer et al. 2002; Lattimore and Szepesvári 2020).

References

Auer, Peter, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. “Finite-Time Analysis of the Multiarmed Bandit Problem.” Machine Learning 47 (2-3): 235–56. https://doi.org/10.1023/a:1013689704352.

Garivier, Aurélien, and Olivier Cappé. 2011. “The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond.” Conference on Learning Theory (COLT). https://proceedings.mlr.press/v19/garivier11a.html.

Lattimore, Tor, and Csaba Szepesvári. 2020. Bandit Algorithms. Cambridge University Press. https://www.cambridge.org/core/books/bandit-algorithms/8E39FD004E6CE036680F90DD0C6F09FC.