The Lai-Robbins Regret Lower Bound

Claim

Consider a \(K\)-armed stochastic bandit with reward distributions \(\nu_1, \ldots, \nu_K\) from a parameterized family \(\mathcal{F}\) with means \(\mu_1, \ldots, \mu_K\) and unique optimal mean \(\mu^* = \mu_{a^*}\). A bandit algorithm is consistent on \(\mathcal{F}\) if for every problem instance and every \(\alpha \in (0, 1)\),

\[ \mathbb{E}[N_a(T)] = o(T^\alpha) \quad \text{for every suboptimal arm } a. \]

Then for every consistent algorithm and every suboptimal arm \(a\),

\[ \liminf_{T \to \infty} \frac{\mathbb{E}[N_a(T)]}{\log T} \ge \frac{1}{\text{KL}(\nu_a, \nu^*)}, \]

where \(\text{KL}(\nu_a, \nu^*)\) is the Kullback-Leibler divergence from \(\nu_a\) to any distribution in \(\mathcal{F}\) with mean \(\mu^*\). Multiplying by the gap \(\Delta_a = \mu^* - \mu_a\) and summing yields the regret lower bound

\[ \liminf_{T \to \infty} \frac{\text{Regret}(T)}{\log T} \ge \sum_{a : \Delta_a > 0} \frac{\Delta_a}{\text{KL}(\nu_a, \nu^*)}. \]

This is the Lai-Robbins lower bound (Lai and Robbins 1985). It says that no consistent bandit algorithm can do asymptotically better than logarithmic regret, with a constant determined by the KL divergences between the suboptimal arms and the best arm.

Intuition

To declare arm \(a\) suboptimal the algorithm must distinguish \(\nu_a\) from a distribution that would have made \(a\) optimal. Sanov’s theorem says that \(n\) samples from \(\nu_a\) resemble samples from a competing distribution \(\nu'\) with probability \(\exp(-n \, \text{KL}(\nu_a, \nu'))\). To avoid being fooled with probability \(1/T\) — the rate at which a consistent algorithm must avoid mistakes — the arm must be sampled at least \(\log T / \text{KL}(\nu_a, \nu^*)\) times. Each such pull contributes \(\Delta_a\) to regret.

Proof sketch

The proof uses a change of measure between the true environment \(\nu\) and a perturbed environment \(\nu'\) in which arm \(a\) has been made optimal. Let \(\mathbb{P}\) and \(\mathbb{P}'\) denote the laws of the algorithm’s full interaction history under \(\nu\) and \(\nu'\) respectively.

Step 1: Likelihood ratio

The log-likelihood ratio between the histories factors over arm pulls because between pulls the algorithm’s randomness is identical under both measures:

\[ \log \frac{d\mathbb{P}}{d\mathbb{P}'}(H_T) = \sum_{t : a_t = a} \log \frac{d\nu_a}{d\nu'_a}(r_t). \]

Taking expectations under \(\mathbb{P}\) gives the key identity

\[ \mathbb{E}\!\left[ \log \frac{d\mathbb{P}}{d\mathbb{P}'}(H_T) \right] = \mathbb{E}[N_a(T)] \cdot \text{KL}(\nu_a, \nu'_a). \]

Step 2: Data-processing inequality

For any event \(\mathcal{E}\),

\[ \text{KL}(\mathbb{P}, \mathbb{P}') \ge \text{KL}(\text{Ber}(\mathbb{P}(\mathcal{E})), \text{Ber}(\mathbb{P}'(\mathcal{E}))) \ge \mathbb{P}(\mathcal{E}) \log \frac{\mathbb{P}(\mathcal{E})}{\mathbb{P}'(\mathcal{E})} - \log 2. \]

Choose \(\mathcal{E} = \{N_a(T) \le T/2\}\). Under \(\nu\) (where \(a\) is suboptimal) consistency forces \(\mathbb{P}(\mathcal{E}) \to 1\). Under \(\nu'\) (where \(a\) is optimal) consistency forces \(\mathbb{P}'(\mathcal{E}^c) \ge 1 - o(T^\alpha / T) \to 1\), so \(\mathbb{P}'(\mathcal{E}) = o(T^{\alpha-1})\).

Step 3: Combine

Substituting,

\[ \mathbb{E}[N_a(T)] \cdot \text{KL}(\nu_a, \nu'_a) \ge (1 - \alpha) \log T - O(1). \]

Choosing \(\nu'_a\) to have mean \(\mu^* + \epsilon\) and letting \(\epsilon \to 0\) replaces \(\text{KL}(\nu_a, \nu'_a)\) with \(\text{KL}(\nu_a, \nu^*)\) in the limit, giving the claim.

Consequences

  • UCB and Thompson sampling are tight. UCB1 and Thompson sampling achieve \(O(\log T / \Delta_a)\) regret per suboptimal arm, matching the lower bound up to a constant. KL-UCB matches the exact Lai-Robbins constant for Bernoulli rewards (Garivier and Cappé 2011).
  • Worst-case rate. The instance-dependent bound becomes vacuous as \(\Delta_a \to 0\). A complementary minimax analysis gives \(\Omega(\sqrt{KT})\) worst-case regret, achieved up to constants by MOSS and other variants.
  • The bound depends on the family. A larger family \(\mathcal{F}\) makes alternative environments easier to construct and tightens the lower bound. For Bernoulli arms with parameters in \([0, 1]\) the constant is \(\Delta_a / \text{KL}(\mu_a, \mu^*)\), which can be much smaller than the \(1/\Delta_a\) factor in distribution-free bounds.

References

(Lai and Robbins 1985; Lattimore and Szepesvári 2020).

References

Garivier, Aurélien, and Olivier Cappé. 2011. “The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond.” Conference on Learning Theory (COLT). https://proceedings.mlr.press/v19/garivier11a.html.
Lai, T. L, and Herbert Robbins. 1985. “Asymptotically Efficient Adaptive Allocation Rules.” Advances in Applied Mathematics 6 (1): 4–22. https://doi.org/10.1016/0196-8858(85)90002-8.
Lattimore, Tor, and Csaba Szepesvári. 2020. Bandit Algorithms. Cambridge University Press. https://www.cambridge.org/core/books/bandit-algorithms/8E39FD004E6CE036680F90DD0C6F09FC.