Epsilon-Greedy

Motivation

The simplest way to break out of a greedy policy that may have locked onto a bad action is to occasionally ignore it. \(\varepsilon\)-greedy does exactly that: with probability \(1 - \varepsilon\) it picks the action with the highest current value estimate, and with probability \(\varepsilon\) it picks uniformly at random.

It is the default exploration rule in tabular reinforcement learning because it requires nothing beyond a value estimate — no confidence interval, no posterior, no model. The cost of that simplicity is that exploration is undirected: random pulls are spread evenly across all actions, including ones already known to be bad.

See exploration vs. exploitation for the broader context.

Problem

In a multi-armed bandit setting (or per-state in an MDP), maintain reward estimates \(\hat{\mu}_a\) for each action \(a \in \mathcal{A}\) and at each round pick an action to balance:

Exploit the current best estimate to accumulate reward.
Explore alternatives to refine the estimates and avoid locking onto a suboptimal arm.

Key Ideas

Coin flip between greedy and uniform

The whole mechanism is a single coin flip. With probability \(1 - \varepsilon\) act greedily; with probability \(\varepsilon\) pick uniformly at random. No statistics about which action is most worth exploring, no notion of confidence — just a fixed exploration rate distributed uniformly across all arms.

This minimalism is the algorithm’s strength: it works with any value estimator, scales to MDPs with no change, and has no per-arm bookkeeping beyond the value estimates themselves. The cost is that “exploration” is treated as a quota to spend uniformly, not as information-gathering targeted at uncertain arms.

The schedule controls the regret

The behavior of \(\varepsilon\)-greedy is governed entirely by the schedule \(\{\varepsilon_t\}\). The choice trades off short-term reward for long-term identification of the best arm.

On a logarithmic pull axis, a constant exploration rate keeps paying a roughly linear tax, while a decaying rate spends most exploratory pulls early. The schedule choice determines whether the algorithm’s regret is linear in \(T\) (constant \(\varepsilon\)) or logarithmic in \(T\) (suitably decaying \(\varepsilon\)).

GLIE: the RL convergence requirement

In reinforcement learning the standard convergence requirement for \(\varepsilon\)-greedy is Greedy in the Limit with Infinite Exploration (GLIE):

Every state-action pair is visited infinitely often.
The behavior policy converges to greedy: \(\varepsilon_t \to 0\) as \(t \to \infty\).

GLIE is the assumption under which tabular Q-learning converges to \(Q^*\). The schedule \(\varepsilon_t = 1/t\) satisfies it.

Algorithm

At round \(t\), given value estimates \(\hat{\mu}_a\) (or \(\hat{Q}(s, a)\) in an MDP):

\[ a_t = \begin{cases} \arg\max_a \hat{\mu}_a & \text{with probability } 1 - \varepsilon_t, \\ \text{Uniform}(\mathcal{A}) & \text{with probability } \varepsilon_t. \end{cases} \]

on each round t:
    u ~ Uniform[0, 1]
    if u < ε_t:
        a_t = uniform random action
    else:
        a_t = argmax_a μ̂_a              # break ties arbitrarily
    pull a_t, observe reward, update μ̂_{a_t}

In an MDP setting the same rule is applied per state with \(\hat{Q}(s, \cdot)\) replacing \(\hat{\mu}\).

Walkthrough

Five rounds on a 3-arm bandit

Arms \(\{A, B, C\}\) with true means \(\mu_A = 0.7\), \(\mu_B = 0.5\), \(\mu_C = 0.2\). Initial estimates \(\hat{\mu} = (0, 0, 0)\). Use \(\varepsilon = 0.2\) throughout.

Round	\(u\)	Decision	Action	Reward	Updated \(\hat{\mu}\)
1	0.1	explore (random)	\(C\)	\(0.3\)	\((0, 0, 0.30)\)
2	0.5	exploit (argmax)	\(C\) (tie, picked \(C\))	\(0.1\)	\((0, 0, 0.20)\)
3	0.9	exploit	\(C\) (still tied-or-positive)	\(0.0\)	\((0, 0, 0.13)\)
4	0.15	explore	\(A\)	\(0.8\)	\((0.80, 0, 0.13)\)
5	0.6	exploit	\(A\)	\(0.6\)	\((0.70, 0, 0.13)\)

The agent gets stuck on \(C\) for a few rounds — an unlucky tie-break made \(C\) look best among zero-mean estimates. Exploration at round 4 happens to land on the actually-best arm \(A\), after which exploit takes over correctly.

This pattern is typical: \(\varepsilon\)-greedy can lock onto a misleading early estimate until a lucky exploration step rescues it. Higher \(\varepsilon\) reduces the risk but costs more reward; decaying \(\varepsilon\) tries to have both.

Correctness

Under GLIE, \(\varepsilon\)-greedy guarantees that every arm is pulled infinitely often, which is exactly the condition needed for sample-average estimators \(\hat{\mu}_a\) to converge to the true means \(\mu_a\) by the law of large numbers. As \(\varepsilon_t \to 0\), the policy converges to greedy, which is optimal once the estimates have converged.

The convergence is asymptotic, not finite-time — there is no fixed \(T\) after which \(\varepsilon\)-greedy is guaranteed to be playing the right arm. Finite-time guarantees come in the form of regret bounds (next section).

Complexity and Tradeoffs

Per-round cost: \(O(K)\) to pick the argmax over \(K\) arms (or \(O(1)\) with a heap), and \(O(1)\) to update a single arm’s running average.

Regret under different schedules

Constant \(\varepsilon\). Suboptimal arms continue to be pulled with probability at least \(\varepsilon / K\) per round, so the regret grows as

\[ \text{Regret}(T) = \Theta(\varepsilon T). \]

This is linear regret: the wrong asymptotic rate, but often acceptable when the environment is non-stationary and the agent must keep checking that the best arm has not changed.

Decaying \(\varepsilon\). Setting \(\varepsilon_t = \min\!\left(1, \tfrac{cK}{t}\right)\) for sufficiently large \(c\) gives logarithmic regret (Auer et al. 2002)

\[ \text{Regret}(T) = O(K \log T), \]

matching the optimal rate up to constants (proof). The schedule must decay slowly enough that every arm is sampled infinitely often, but fast enough that exploration eventually stops dominating reward.

Limitations

Undirected. Once an arm is known with high confidence to be bad, a random exploration step still pulls it with probability \(1/K\). UCB and Thompson sampling avoid this waste by directing exploration toward plausibly-best arms only.
Hyperparameter-sensitive. The constant \(c\) in \(\varepsilon_t = cK/t\) depends on the unknown gaps \(\Delta_a\). Choosing it too small leaves the agent stuck; too large leaves the agent exploring forever.
Action-space scaling. Uniform random exploration is uninformative when \(|\mathcal{A}|\) is large or continuous; structured exploration (Boltzmann, posterior sampling, parameter noise) tends to scale better.

When to Use It

Situation	Approach
Tabular RL with \(\varepsilon\)-greedy behavior policy	The default — works with Q-learning, SARSA, etc.
Need stronger regret guarantees, stationary bandit	UCB or Thompson sampling.
Non-stationary environment (arms change over time)	Constant \(\varepsilon\) — keeps re-checking forever.
Deep RL (DQN and friends)	\(\varepsilon\)-greedy with linear anneal from \(1.0\) to \(0.1\) over the first chunk of training.
Large or continuous action space	Boltzmann, parameter-space noise, or learned exploration.
Need a smooth (not random) action selection	Boltzmann exploration with temperature anneal.

Variants

Decaying \(\varepsilon\) schedules. \(\varepsilon_t = c/t\) (or \(cK/t\)) for log regret. Other schedules: exponential decay, step decay, linear anneal.
State-dependent \(\varepsilon\). Higher \(\varepsilon\) in less-visited states. Approximates information-directed exploration without full UCB machinery.
Optimistic initialization. Initialize \(\hat{\mu}_a\) to a value higher than any plausible reward. Pure greedy then explores naturally until each arm has been pulled enough to deflate its estimate.
Boltzmann exploration. Replace uniform-random with sampling from a softmax over \(\hat{\mu}\). Smoother, but adds a temperature hyperparameter.
Noisy nets. In deep RL, inject parametric noise into the network weights instead of sampling at action time. Yields persistent exploration across episodes.

References

Auer, Peter, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. “Finite-Time Analysis of the Multiarmed Bandit Problem.” Machine Learning 47 (2-3): 235–56. https://doi.org/10.1023/a:1013689704352.