Thompson Sampling Achieves Logarithmic Regret

Claim

Consider a \(K\)-armed Bernoulli bandit with means \(\mu_1, \ldots, \mu_K\), optimal mean \(\mu^*\), and gaps \(\Delta_a = \mu^* - \mu_a\). Run Thompson sampling with independent \(\text{Beta}(1, 1)\) priors. Then for every \(\epsilon > 0\),

\[ \mathbb{E}[N_a(T)] \le \frac{(1 + \epsilon) \log T}{\text{KL}(\mu_a, \mu^*)} + O\!\left( \frac{1}{\epsilon^2} \right), \]

so that

\[ \limsup_{T \to \infty} \frac{\text{Regret}(T)}{\log T} \le \sum_{a : \Delta_a > 0} \frac{\Delta_a}{\text{KL}(\mu_a, \mu^*)}. \]

This matches the Lai-Robbins lower bound exactly. The result is due to Kaufmann, Korda, and Munos (Kaufmann et al. 2012); an alternative analysis with weaker constants but a simpler proof is due to Agrawal and Goyal (Agrawal and Goyal 2012).

Setup

At round \(t\), Thompson sampling draws \(\tilde{\mu}_a \sim \text{Beta}(\alpha_a(t), \beta_a(t))\) where \(\alpha_a(t) = 1 + S_a(t)\) and \(\beta_a(t) = 1 + N_a(t) - S_a(t)\) for \(S_a(t) = \sum_{s : a_s = a} r_s\). The arm pulled is \(a_t = \arg\max_a \tilde{\mu}_a\).

Let \(a^*\) be optimal and fix a suboptimal arm \(a\). The strategy is to bound \(\mathbb{E}[N_a(T)]\) by separately controlling two failure modes:

The posterior of arm \(a^*\) underestimates \(\mu^*\), allowing a suboptimal \(\tilde{\mu}_a\) to win.
The posterior of arm \(a\) overestimates \(\mu_a\) enough to win against a well-calibrated \(a^*\).

Step 1: A Posterior Concentration Bound

For Beta posteriors with \(n\) observations, the standard concentration inequality is

\[ \Pr\!\left[ \tilde{\mu} \ge \mu + x \mid n, S \right] \le \exp(-n \cdot \text{KL}(\hat{\mu}, \mu + x)) \]

for \(\hat{\mu} = S / n \le \mu + x\). The KL on the right is the Bernoulli KL divergence. Symmetric bounds hold for the lower tail.

This is the analogue of Hoeffding’s inequality for Beta posteriors and is the source of the exact Lai-Robbins constant rather than the loose \(1/\Delta_a^2\) constant in UCB1.

Step 2: Posterior of \(a^*\) is Almost Always Optimistic

Define the event

\[ \mathcal{E}_t = \{ \tilde{\mu}_{a^*}(t) \ge \mu^* - \delta \} \]

for a small \(\delta > 0\). Using Step 1 and a careful tracking of how often \(a^*\) has been pulled, one shows

\[ \sum_{t=1}^{T} \Pr[\mathcal{E}_t^c] = O(\log T) \]

with a constant that vanishes as the prior becomes vacuous. The key technical ingredient is that \(a^*\) gets pulled sufficiently often even though Thompson sampling does not have an explicit exploration bonus — the random posterior samples themselves drive enough pulls.

Step 3: When \(a^*\) is Optimistic, \(a\) is Pulled Only Logarithmically Often

On the event \(\mathcal{E}_t\), arm \(a\) wins the round only if \(\tilde{\mu}_a \ge \mu^* - \delta\). By Step 1,

\[ \Pr\!\left[ \tilde{\mu}_a \ge \mu^* - \delta \mid N_a(t-1) = n \right] \le \exp\!\left( -n \cdot \text{KL}(\mu_a + \delta, \mu^* - \delta) \right) \]

once \(\hat{\mu}_a(t-1) \le \mu_a + \delta\), which itself happens with overwhelming probability after \(\Omega(\log T)\) pulls. Summing over \(t\) and letting \(\delta \to 0\):

\[ \mathbb{E}[N_a(T)] \le \frac{(1 + \epsilon) \log T}{\text{KL}(\mu_a, \mu^*)} + o(\log T) + O(1). \]

Step 4: Combine

Multiplying by \(\Delta_a\) and summing over suboptimal arms,

\[ \text{Regret}(T) \le (1 + \epsilon) \sum_{a : \Delta_a > 0} \frac{\Delta_a \log T}{\text{KL}(\mu_a, \mu^*)} + O(1). \]

The constant matches the Lai-Robbins lower bound exactly.

Bayesian regret via the information ratio

A complementary line of analysis (Russo and Van Roy 2016) sidesteps Beta-specific algebra by analyzing the information ratio

\[ \Gamma_t = \frac{\mathbb{E}[\Delta_t]^2}{\mathbb{I}_t(a^*; (a_t, r_t))}, \]

the squared expected single-step regret divided by the mutual information between the optimal arm and the round’s observation. For Thompson sampling, \(\Gamma_t \le K / 2\) uniformly, and a Cauchy-Schwarz argument turns this into Bayesian regret

\[ \text{BayesRegret}(T) \le \sqrt{\tfrac{1}{2} K T \cdot \mathcal{H}(a^*)} = O(\sqrt{KT \log K}). \]

The information-ratio argument generalizes cleanly to linear, combinatorial, and partial-monitoring bandits, where it often gives the sharpest known bounds.

Remarks

The Beta-Bernoulli analysis hinges on the closed-form KL between Beta densities; for sub-Gaussian rewards a Gaussian posterior gives an analogous bound with the squared-gap constant of UCB.
The constant \(1 + \epsilon\) in the leading term is unimprovable: matching the lower bound exactly requires the slack \(\epsilon\).
The same broad strategy — split into “\(a^*\) is optimistic” plus “\(a\) is not too optimistic” — works for Bayes-UCB and KL-UCB, which is why all three methods share the Lai-Robbins constant.

References

(Kaufmann et al. 2012; Agrawal and Goyal 2012; Russo and Van Roy 2016; Lattimore and Szepesvári 2020).

References

Agrawal, Shipra, and Navin Goyal. 2012. “Analysis of Thompson Sampling for the Multi-Armed Bandit Problem.” Conference on Learning Theory (COLT). https://proceedings.mlr.press/v23/agrawal12.html.

Kaufmann, Emilie, Nathaniel Korda, and Rémi Munos. 2012. “Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis.” In Algorithmic Learning Theory. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-34106-9_18.

Lattimore, Tor, and Csaba Szepesvári. 2020. Bandit Algorithms. Cambridge University Press. https://www.cambridge.org/core/books/bandit-algorithms/8E39FD004E6CE036680F90DD0C6F09FC.

Russo, Daniel, and Benjamin Van Roy. 2016. “An Information-Theoretic Analysis of Thompson Sampling.” Journal of Machine Learning Research 17 (68): 1–30. https://jmlr.org/papers/v17/14-087.html.