Thompson Sampling

Motivation

Upper confidence bound algorithms explore by acting greedily with respect to an optimistic point estimate. Thompson sampling explores by acting greedily with respect to a random point estimate, drawn from the posterior. The randomness does the same job the optimistic bonus does in UCB: arms whose values are still uncertain get sampled often enough to refine the posterior, and arms whose posteriors have concentrated below the leader stop being chosen.

The rule (THOMPSON 1933) predates UCB by half a century. Its modern revival is driven by two practical advantages: it matches UCB’s logarithmic regret while typically performing better in practice, and it generalizes naturally to contextual and structured settings by replacing the per-arm posterior with a Bayesian model.

See exploration vs. exploitation for the broader context and multi-armed bandits for the regret framework.

Problem

A \(K\)-armed Bayesian bandit: each arm \(a\) has an unknown mean reward \(\mu_a\) with prior \(p(\mu_a)\). At each round \(t\), pick an arm \(a_t\) and observe reward \(r_t\). The history \(\mathcal{D}_t = \{(a_s, r_s) : s < t\}\) updates the posterior. Goal: minimize expected regret over \(T\) rounds.

Key Ideas

Sample from the posterior, then act greedily

Instead of acting on a point estimate (greedy) or an optimistic bound (UCB), draw a random point estimate from each arm’s posterior and pull the argmax. The arm that wins is the one whose posterior sample happens to be largest, not the one whose posterior mean is largest.

Probability mass in the tails of an uncertain arm’s posterior translates directly into exploration: an arm with a wide posterior has many rounds where its sample is the largest, even if its mean is moderate. As pulls accumulate, the posterior tightens and tail-sampling exploration naturally winds down.

Probability matching

The probability that Thompson sampling pulls arm \(a\) at round \(t\) equals the posterior probability that \(a\) is optimal:

\[ \Pr[a_t = a \mid \mathcal{D}_t] = \Pr[a = a^* \mid \mathcal{D}_t]. \]

This is probability matching. The agent allocates trials to each arm in proportion to its current credibility of being best — neither over-exploring arms already ruled out (as \(\varepsilon\)-greedy does) nor under-exploring promising-but-uncertain arms (as a greedy posterior-mean rule would).

Posteriors over success rates

For Bernoulli rewards, the conjugate \(\text{Beta}(\alpha_a, \beta_a)\) posterior has a closed-form update. Each arm carries one Beta distribution; a sample from each Beta is one round’s “guess” at the arm’s true probability.

sample θ₁sample θ₂ arm 1: Beta(5,5)arm 2: Beta(8,3) Pull the arm whose posterior sample is largest Successful pulls shift mass right; failures shift mass left.

Algorithm

Maintain a posterior \(p(\mu_a \mid \mathcal{D}_t)\) for each arm. At each round:

for each round t:
    for each arm a:
        sample μ̃_a ~ p(μ_a | D_t)
    a_t = argmax_a μ̃_a
    pull a_t, observe r_t
    update p(μ_{a_t} | D_{t+1}) using r_t

For Bernoulli rewards with \(\text{Beta}(\alpha_a, \beta_a)\) posteriors, the update is the closed form

\[ \alpha_{a_t} \leftarrow \alpha_{a_t} + r_t, \qquad \beta_{a_t} \leftarrow \beta_{a_t} + 1 - r_t, \]

starting from \(\alpha_a = \beta_a = 1\) (uniform prior).

Walkthrough

Three rounds on a Beta-Bernoulli bandit

Two arms with true probabilities \(p_1 = 0.7\), \(p_2 = 0.4\). Start with \(\text{Beta}(1, 1)\) priors on both.

Round Posteriors before Samples drawn Pick Reward Posteriors after
1 \(\text{Beta}(1,1)\), \(\text{Beta}(1,1)\) \(\tilde{\mu}_1 = 0.83\), \(\tilde{\mu}_2 = 0.41\) arm 1 \(1\) \(\text{Beta}(2,1)\), \(\text{Beta}(1,1)\)
2 \(\text{Beta}(2,1)\), \(\text{Beta}(1,1)\) \(\tilde{\mu}_1 = 0.56\), \(\tilde{\mu}_2 = 0.73\) arm 2 \(0\) \(\text{Beta}(2,1)\), \(\text{Beta}(1,2)\)
3 \(\text{Beta}(2,1)\), \(\text{Beta}(1,2)\) \(\tilde{\mu}_1 = 0.71\), \(\tilde{\mu}_2 = 0.22\) arm 1 \(1\) \(\text{Beta}(3,1)\), \(\text{Beta}(1,2)\)

In round 2 arm 2’s posterior sample happened to be larger than arm 1’s — even though arm 1 had observed evidence in its favor. This is the desired behavior: with so little data, arm 2 deserves an exploratory pull. The pull failed, the posterior shifts left for arm 2, and round 3 returns to arm 1.

Over many rounds, \(\text{Beta}(\alpha_1, \beta_1)\) concentrates near \(p_1 = 0.7\) and \(\text{Beta}(\alpha_2, \beta_2)\) near \(p_2 = 0.4\). The probability that a sample from the latter beats a sample from the former shrinks exponentially in the gap, and the algorithm settles into pulling arm 1 almost exclusively.

Correctness

Thompson sampling matches the optimal logarithmic instance-dependent regret rate (Agrawal and Goyal 2012; Kaufmann et al. 2012):

\[ \text{Regret}(T) = O\!\left( \sum_{a : \Delta_a > 0} \frac{\log T}{\Delta_a} \right), \]

where \(\Delta_a = \mu^* - \mu_a\). This matches the Lai-Robbins lower bound up to constants (proof).

A complementary line of analysis bounds the Bayesian regret through the information ratio (Russo and Van Roy 2016), yielding \(O(\sqrt{KT \log K})\) worst-case Bayesian regret and clean extensions to structured problems.

The asymptotic guarantee is the same as UCB; the empirical performance is often better, with smaller constants on real problems.

Complexity and Tradeoffs

Per-round cost: \(O(K)\) for sampling and the argmax, plus the cost of posterior sampling (closed-form for conjugate families, otherwise an MCMC or variational step). Memory: \(O(K)\) posterior parameters in the simple case.

Limitations and considerations

  • Misspecified prior. Regret guarantees are robust to a wide range of priors, but exploration can be too aggressive if the prior is too vague, or too conservative if it is too tight. Mildly informative priors help.
  • Approximate posteriors. Outside conjugate families the posterior is approximated by Laplace, variational, ensemble, or dropout-based methods. Approximation error degrades the exploration guarantees but the algorithm remains usable in deep RL.
  • Delayed feedback. Thompson sampling is robust to delays in reward observations — pulls based on stale posteriors still match probability of optimality, just with slightly outdated estimates.
  • Comparison to UCB. Both achieve the same regret rate. UCB is deterministic given the data, easier to debug, and easier to combine with constraints; Thompson sampling generalizes more readily, often has smaller constants, and is more robust to delayed feedback.

When to Use It

Situation Approach
Bayesian bandit with conjugate posterior Thompson sampling — usually outperforms UCB.
Online recommendation, A/B testing Thompson sampling — the production default.
Linear / contextual bandit Linear Thompson sampling — see Variants.
Deep RL with neural Bayesian model Thompson sampling with approximate posterior (ensemble / dropout).
Need deterministic action choice for debugging UCB instead.
Best-arm identification (pure exploration) Pure-exploration variants (top-two TS, BayesGap).
Non-stationary arms Discounted / windowed Thompson sampling.

Variants

  • Linear Thompson sampling (Agrawal and Goyal 2013). For a linear bandit with reward \(r = x_a^\top \theta + \varepsilon\) and a Gaussian prior on \(\theta\), the posterior is Gaussian. Sample \(\tilde{\theta} \sim \mathcal{N}(\hat{\theta}_t, \Sigma_t)\), pull \(a_t = \arg\max_a x_a^\top \tilde{\theta}\), update \(\hat{\theta}, \Sigma\). Sampling \(\tilde{\theta}\) once per round automatically correlates exploration across similar arms — a single posterior sample induces a coherent ranking. The same recipe extends to GLMs, neural Bayesian models, and structured action spaces.
  • Bootstrap Thompson sampling. Approximate the posterior by bootstrapping the observed data. Useful when conjugate posteriors aren’t available.
  • Ensemble / dropout TS. Train multiple value networks (or use Monte Carlo dropout) and treat their predictions as posterior samples. Standard approximate Thompson sampling in deep RL (e.g., bootstrapped DQN).
  • Top-two Thompson sampling. For best-arm identification: sample two candidates per round, pull whichever has the larger posterior mean among them. Tightens the gap-detection rate.
  • Discounted / sliding-window TS. Discount older observations to handle non-stationarity.

References

Agrawal, Shipra, and Navin Goyal. 2012. “Analysis of Thompson Sampling for the Multi-Armed Bandit Problem.” Conference on Learning Theory (COLT). https://proceedings.mlr.press/v23/agrawal12.html.
Agrawal, Shipra, and Navin Goyal. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs.” International Conference on Machine Learning (ICML), 127–35. https://proceedings.mlr.press/v28/agrawal13.html.
Kaufmann, Emilie, Nathaniel Korda, and Rémi Munos. 2012. “Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis.” In Algorithmic Learning Theory. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-34106-9_18.
Russo, Daniel, and Benjamin Van Roy. 2016. “An Information-Theoretic Analysis of Thompson Sampling.” Journal of Machine Learning Research 17 (68): 1–30. https://jmlr.org/papers/v17/14-087.html.
THOMPSON, W. R. 1933. “ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES.” Biometrika 25 (3-4): 285–94. https://doi.org/10.1093/biomet/25.3-4.285.