Decaying Epsilon-Greedy Achieves Logarithmic Regret

Claim

Consider a \(K\)-armed bandit with rewards in \([0, 1]\), gaps \(\Delta_a = \mu^* - \mu_a\), and minimum gap \(d = \min_{a : \Delta_a > 0} \Delta_a\). Run the \(\varepsilon\)-greedy algorithm with the schedule

\[ \varepsilon_t = \min\!\left\{ 1, \, \frac{cK}{d^2 \, t} \right\} \]

for any constant \(c > 1\) (the original analysis takes \(c \ge 5\)). Then there is a constant \(C(c, d, K)\) such that

\[ \text{Regret}(T) \le C(c, d, K) \, K \log T. \]

This is the logarithmic-regret theorem of Auer, Cesa-Bianchi, and Fischer for \(\varepsilon\)-greedy (Auer et al. 2002).

Setup

Write \(a^*\) for an optimal arm, \(a\) for a suboptimal arm with gap \(\Delta_a\), and \(\hat{\mu}_a(t)\) for the empirical mean of arm \(a\) after the first \(t\) rounds. Decompose each round into two cases:

Exploitation step (probability \(1 - \varepsilon_t\)): the agent pulls \(\arg\max_{a'} \hat{\mu}_{a'}(t-1)\).
Exploration step (probability \(\varepsilon_t\)): the agent pulls a uniformly random arm.

Let \(N_a(T)\) denote the total pulls of arm \(a\) and \(N_a^{\text{exp}}(T)\), \(N_a^{\text{xpt}}(T)\) its decomposition into exploration and exploitation pulls.

Step 1: Exploration pulls contribute \(O(K \log T)\)

Each round contributes at most \(\varepsilon_t / K\) to \(\mathbb{E}[N_a^{\text{exp}}(T)]\), so

\[ \mathbb{E}[N_a^{\text{exp}}(T)] \le \frac{1}{K} \sum_{t=1}^{T} \varepsilon_t = \frac{1}{K} \sum_{t=1}^{T} \min\!\left\{ 1, \, \frac{cK}{d^2 t} \right\} \le \frac{c}{d^2} \log T + O(1). \]

These pulls cost \(\Delta_a \cdot \mathbb{E}[N_a^{\text{exp}}(T)] \le c \Delta_a \log T / d^2 \le c \log T / d\) in expectation, summing to \(O(K \log T / d)\) over suboptimal arms.

Step 2: Exploration is sufficient

The exploration component alone forces every arm to be pulled \(\Omega(\log t)\) times by round \(t\):

\[ \mathbb{E}[N_a^{\text{exp}}(t)] \ge \frac{c}{d^2} \log t - O(1). \]

Conditionally on this many pulls of arm \(a\), Hoeffding’s inequality gives

\[ \Pr\!\left[ \hat{\mu}_a(t) - \mu_a \ge \tfrac{\Delta_a}{2} \right] \le \exp(-\tfrac{1}{2} N_a^{\text{exp}}(t) \, \Delta_a^2) \le t^{-c \Delta_a^2 / (2 d^2)} \le t^{-c/2}. \]

The last step uses \(\Delta_a \ge d\). The same bound holds for the optimal arm with the inequality flipped. So with probability at least \(1 - 2 t^{-c/2}\), the empirical means satisfy

\[ \hat{\mu}_a(t) < \mu_a + \tfrac{\Delta_a}{2} < \mu^* - \tfrac{\Delta_a}{2} < \hat{\mu}_{a^*}(t), \]

and an exploitation step pulls \(a^*\) rather than \(a\).

Step 3: Exploitation pulls of suboptimal arms

By Step 2, the probability that an exploitation step at round \(t\) pulls suboptimal arm \(a\) is at most \(2 t^{-c/2}\). Summing,

\[ \mathbb{E}[N_a^{\text{xpt}}(T)] \le 2 \sum_{t=1}^{T} t^{-c/2}. \]

For any \(c > 2\) this sum is \(O(1)\), so exploitation contributes only constant regret. Refining the analysis to \(c > 1\) requires a slightly more careful concentration bound but produces the same asymptotic conclusion.

Step 4: Combine

Total expected pulls of suboptimal arm \(a\) are

\[ \mathbb{E}[N_a(T)] = \mathbb{E}[N_a^{\text{exp}}(T)] + \mathbb{E}[N_a^{\text{xpt}}(T)] \le \frac{c}{d^2} \log T + O(1). \]

Summing \(\Delta_a \mathbb{E}[N_a(T)]\) over suboptimal arms gives

\[ \text{Regret}(T) \le \frac{c K}{d} \log T + O(K). \]

Remarks

The schedule depends on the unknown minimum gap \(d\). Without prior knowledge of \(d\) the practical schedule \(\varepsilon_t = \min(1, cK/t)\) is used; its regret guarantee degrades to \(O(K \log T)\) with a constant that scales as \(1/d^2\).
The result is asymptotically optimal in \(T\) but has worse constants than UCB1 and Thompson sampling. The reason is structural: \(\varepsilon\)-greedy explores uniformly across all arms, whereas UCB and Thompson sampling concentrate exploration on plausibly-best arms.
A constant \(\varepsilon\) schedule has linear regret \(\Theta(\varepsilon T)\). The \(1/t\) decay is the slowest schedule that produces sublinear regret.

References

(Auer et al. 2002; Lattimore and Szepesvári 2020).

References

Auer, Peter, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. “Finite-Time Analysis of the Multiarmed Bandit Problem.” Machine Learning 47 (2-3): 235–56. https://doi.org/10.1023/a:1013689704352.

Lattimore, Tor, and Csaba Szepesvári. 2020. Bandit Algorithms. Cambridge University Press. https://www.cambridge.org/core/books/bandit-algorithms/8E39FD004E6CE036680F90DD0C6F09FC.