Decaying Epsilon-Greedy Achieves Logarithmic Regret
Claim
Consider a \(K\)-armed bandit with rewards in \([0, 1]\), gaps \(\Delta_a = \mu^* - \mu_a\), and minimum gap \(d = \min_{a : \Delta_a > 0} \Delta_a\). Run the \(\varepsilon\)-greedy algorithm with the schedule
\[ \varepsilon_t = \min\!\left\{ 1, \, \frac{cK}{d^2 \, t} \right\} \]
for any constant \(c > 1\) (the original analysis takes \(c \ge 5\)). Then there is a constant \(C(c, d, K)\) such that
\[ \text{Regret}(T) \le C(c, d, K) \, K \log T. \]
This is the logarithmic-regret theorem of Auer, Cesa-Bianchi, and Fischer for \(\varepsilon\)-greedy (Auer et al. 2002).
Setup
Write \(a^*\) for an optimal arm, \(a\) for a suboptimal arm with gap \(\Delta_a\), and \(\hat{\mu}_a(t)\) for the empirical mean of arm \(a\) after the first \(t\) rounds. Decompose each round into two cases:
- Exploitation step (probability \(1 - \varepsilon_t\)): the agent pulls \(\arg\max_{a'} \hat{\mu}_{a'}(t-1)\).
- Exploration step (probability \(\varepsilon_t\)): the agent pulls a uniformly random arm.
Let \(N_a(T)\) denote the total pulls of arm \(a\) and \(N_a^{\text{exp}}(T)\), \(N_a^{\text{xpt}}(T)\) its decomposition into exploration and exploitation pulls.
Step 1: Exploration pulls contribute \(O(K \log T)\)
Each round contributes at most \(\varepsilon_t / K\) to \(\mathbb{E}[N_a^{\text{exp}}(T)]\), so
\[ \mathbb{E}[N_a^{\text{exp}}(T)] \le \frac{1}{K} \sum_{t=1}^{T} \varepsilon_t = \frac{1}{K} \sum_{t=1}^{T} \min\!\left\{ 1, \, \frac{cK}{d^2 t} \right\} \le \frac{c}{d^2} \log T + O(1). \]
These pulls cost \(\Delta_a \cdot \mathbb{E}[N_a^{\text{exp}}(T)] \le c \Delta_a \log T / d^2 \le c \log T / d\) in expectation, summing to \(O(K \log T / d)\) over suboptimal arms.
Step 2: Exploration is sufficient
The exploration component alone forces every arm to be pulled \(\Omega(\log t)\) times by round \(t\):
\[ \mathbb{E}[N_a^{\text{exp}}(t)] \ge \frac{c}{d^2} \log t - O(1). \]
Conditionally on this many pulls of arm \(a\), Hoeffding’s inequality gives
\[ \Pr\!\left[ \hat{\mu}_a(t) - \mu_a \ge \tfrac{\Delta_a}{2} \right] \le \exp(-\tfrac{1}{2} N_a^{\text{exp}}(t) \, \Delta_a^2) \le t^{-c \Delta_a^2 / (2 d^2)} \le t^{-c/2}. \]
The last step uses \(\Delta_a \ge d\). The same bound holds for the optimal arm with the inequality flipped. So with probability at least \(1 - 2 t^{-c/2}\), the empirical means satisfy
\[ \hat{\mu}_a(t) < \mu_a + \tfrac{\Delta_a}{2} < \mu^* - \tfrac{\Delta_a}{2} < \hat{\mu}_{a^*}(t), \]
and an exploitation step pulls \(a^*\) rather than \(a\).
Step 3: Exploitation pulls of suboptimal arms
By Step 2, the probability that an exploitation step at round \(t\) pulls suboptimal arm \(a\) is at most \(2 t^{-c/2}\). Summing,
\[ \mathbb{E}[N_a^{\text{xpt}}(T)] \le 2 \sum_{t=1}^{T} t^{-c/2}. \]
For any \(c > 2\) this sum is \(O(1)\), so exploitation contributes only constant regret. Refining the analysis to \(c > 1\) requires a slightly more careful concentration bound but produces the same asymptotic conclusion.
Step 4: Combine
Total expected pulls of suboptimal arm \(a\) are
\[ \mathbb{E}[N_a(T)] = \mathbb{E}[N_a^{\text{exp}}(T)] + \mathbb{E}[N_a^{\text{xpt}}(T)] \le \frac{c}{d^2} \log T + O(1). \]
Summing \(\Delta_a \mathbb{E}[N_a(T)]\) over suboptimal arms gives
\[ \text{Regret}(T) \le \frac{c K}{d} \log T + O(K). \]
Remarks
- The schedule depends on the unknown minimum gap \(d\). Without prior knowledge of \(d\) the practical schedule \(\varepsilon_t = \min(1, cK/t)\) is used; its regret guarantee degrades to \(O(K \log T)\) with a constant that scales as \(1/d^2\).
- The result is asymptotically optimal in \(T\) but has worse constants than UCB1 and Thompson sampling. The reason is structural: \(\varepsilon\)-greedy explores uniformly across all arms, whereas UCB and Thompson sampling concentrate exploration on plausibly-best arms.
- A constant \(\varepsilon\) schedule has linear regret \(\Theta(\varepsilon T)\). The \(1/t\) decay is the slowest schedule that produces sublinear regret.