Boltzmann Exploration
Motivation
Epsilon-greedy explores by sometimes ignoring the value estimate entirely and picking uniformly at random. That throws away information: an action whose estimated value is barely worse than the best is treated identically to one whose estimated value is catastrophic.
Boltzmann exploration (also called softmax exploration) (Sutton and Barto 2018) instead chooses each action with probability proportional to the exponential of its value estimate. Better-looking actions are more likely, worse-looking actions are less likely, and the gap between them is controlled by a temperature parameter. Exploration is graded rather than uniform.
This is the natural exploration rule whenever the agent already maintains a stochastic policy — most importantly in policy-gradient methods like REINFORCE and the broader policy gradient family — because the softmax appears as the policy itself rather than as a separate exploration wrapper.
See exploration vs. exploitation for the broader context.
Problem
At each round (or per-state in an MDP), given value estimates \(\hat{Q}(s, a)\) for each action, pick an action stochastically with probabilities that smoothly depend on the estimates. The mapping from estimates to probabilities should favor better-looking actions, never strictly outlaw any action, and let a single parameter dial between greedy and uniform.
Key Ideas
Probabilities proportional to exponentiated values
The action distribution is
\[ \pi(a \mid s) = \frac{\exp(\hat{Q}(s, a) / \tau)}{\sum_{a'} \exp(\hat{Q}(s, a') / \tau)}, \]
with temperature \(\tau > 0\). Higher-\(\hat{Q}\) actions get exponentially more probability, but every action retains positive mass. The softmax is smooth and differentiable in \(\hat{Q}\) — important for policy-gradient training and for any algorithm that needs the policy to vary continuously with parameter updates.
Temperature dials exploration to exploitation
- \(\tau \to 0\): the distribution concentrates on the argmax — pure exploitation.
- \(\tau \to \infty\): the distribution becomes uniform — pure exploration.
- Intermediate \(\tau\): actions are sampled in proportion to their softmax-weighted value.
A common schedule anneals \(\tau_t \to 0\) over training so that exploration vanishes asymptotically, analogous to a decaying \(\varepsilon\) in \(\varepsilon\)-greedy.
Maximum-entropy interpretation
Boltzmann distributions are the maximum-entropy distributions subject to a fixed expected value (Kaelbling et al. 1996). Choosing \(\pi(a \mid s) \propto \exp(Q(s, a)/\tau)\) is therefore the policy that maximizes
\[ \mathbb{E}_{a \sim \pi}\!\left[ Q(s, a) \right] + \tau \, \mathcal{H}(\pi(\cdot \mid s)), \]
where \(\mathcal{H}\) is Shannon entropy. The temperature \(\tau\) is the Lagrange multiplier on entropy.
This is the formal basis of maximum-entropy reinforcement learning (soft \(Q\)-learning, soft actor-critic): the entropy bonus is exactly an exploration incentive that prefers higher-entropy policies among those that achieve a given expected return.
Algorithm
on each round (or each state visit):
compute logits ℓ_a = Q̂(s, a) / τ for each a
compute π(a | s) = softmax(ℓ)
sample a_t ~ π(· | s)
pull a_t, observe reward, update Q̂(s, a_t)
In a policy network with a softmax head, \(\pi_\theta(a \mid s) \propto \exp(f_\theta(s, a))\) already has this form — the temperature is folded into the network’s logits (\(\tau = 1\)).
Walkthrough
Sampling from three actions at three temperatures
Suppose \(\hat{Q} = (1.0,\ 0.7,\ 0.0)\) for actions \(A, B, C\). The action probabilities depend on the temperature:
| \(\tau\) | \(\exp(\hat{Q}/\tau)\) for \(A\) | \(\exp(\hat{Q}/\tau)\) for \(B\) | \(\exp(\hat{Q}/\tau)\) for \(C\) | Normalized | Behavior |
|---|---|---|---|---|---|
| \(\tau = 0.1\) | \(\exp(10) \approx 22026\) | \(\exp(7) \approx 1097\) | \(\exp(0) = 1\) | \((0.953,\ 0.047,\ 0.000)\) | nearly greedy |
| \(\tau = 1.0\) | \(\exp(1) \approx 2.72\) | \(\exp(0.7) \approx 2.01\) | \(\exp(0) = 1\) | \((0.475,\ 0.351,\ 0.174)\) | graded |
| \(\tau = 10\) | \(\exp(0.1) \approx 1.10\) | \(\exp(0.07) \approx 1.07\) | \(\exp(0) = 1\) | \((0.348,\ 0.339,\ 0.313)\) | nearly uniform |
As \(\tau\) decreases, the policy concentrates on the highest-\(\hat{Q}\) action. The middle action \(B\) — only slightly worse than \(A\) — is treated very differently from the clearly-bad \(C\) at moderate \(\tau\). This is the key contrast with \(\varepsilon\)-greedy, which would split the exploratory probability evenly between \(B\) and \(C\).
Correctness
Boltzmann exploration is everywhere-positive: every action has \(\pi(a \mid s) > 0\), so under a finite MDP with the policy held fixed, every state-action pair is visited infinitely often. Combined with a Robbins-Monro learning-rate schedule, this is sufficient for sample-average value estimates to converge to the true values \(\mu_a\).
In the RL control setting, an annealing schedule \(\tau_t \to 0\) that satisfies GLIE (greedy in the limit, infinite exploration) yields convergence of Q-learning and other tabular methods to \(Q^*\). The specifics of the annealing rate matter — too fast a decay can violate infinite-visit conditions for high-\(\hat{Q}\) states; too slow leaves the agent exploring forever.
There is no clean logarithmic-regret bound for stochastic bandits — Boltzmann exploration is preferred in practice for its smooth differentiable form, not for its worst-case theory.
Complexity and Tradeoffs
Per-round cost: \(O(K)\) to compute the softmax over \(K\) actions, plus \(O(K)\) to sample from it. Memory \(O(K)\) for value estimates.
Limitations
- Sensitive to value scale. The softmax depends on \(\hat{Q} / \tau\), so multiplying all \(Q\)-values by a constant changes the effective temperature. In practice \(\tau\) must be tuned alongside the value scale, which makes transfer across environments awkward.
- Wastes exploration on confidently bad actions. Like \(\varepsilon\)-greedy, Boltzmann exploration uses only the point estimate \(\hat{Q}\), not its uncertainty. An action whose value is well-known and slightly suboptimal still receives non-trivial probability mass.
- No regret guarantees. Unlike UCB and Thompson sampling, there is no clean logarithmic-regret bound for Boltzmann exploration with a generic temperature schedule.
- Numerical issues at low \(\tau\). As \(\tau \to 0\), the logits explode and softmax overflows; standard implementations subtract the maximum logit before exponentiating to maintain stability.
When to Use It
| Situation | Approach |
|---|---|
| Policy-gradient method with softmax / Gaussian policy head | Use the policy distribution directly — Boltzmann exploration is the default. |
| Tabular Q-learning needing smooth action selection | Boltzmann exploration with an annealing schedule. |
| Max-entropy RL (soft AC, soft Q-learning) | Built-in — temperature is a hyperparameter of the algorithm. |
| Stochastic bandit with regret guarantees needed | UCB or Thompson sampling. |
| Simplest possible exploration | \(\varepsilon\)-greedy. |
| Continuous actions, deterministic policy preferred | Add parameter-space noise (e.g., DDPG with exploration noise). |
Variants
- Annealed temperature schedule. \(\tau_t = c / \log t\) or geometric cooling. Standard practice; the analogue of a decaying \(\varepsilon\).
- Gaussian policy. For continuous actions, replace the softmax with \(\pi(a \mid s) = \mathcal{N}(\mu_\theta(s),\, \sigma^2_\theta(s))\). Same maximum-entropy story; the variance plays the role of \(\tau\).
- Mellowmax. A log-sum-exp-based softmax variant that is non-expansive in \(\hat{Q}\), fixing some of Boltzmann’s instability when value estimates change rapidly.
- Max-entropy RL (soft Q-learning, soft actor-critic). Build the entropy bonus into the value function itself: \(V_\text{soft}(s) = \tau \log \sum_a \exp(Q(s,a)/\tau)\). Yields a principled algorithmic family where the Boltzmann policy is exactly optimal.
- Entropy regularization in policy gradients. Add \(-\beta \mathcal{H}(\pi)\) to the policy-gradient loss. Prevents premature collapse of the policy onto a single action, preserving exploration without an external \(\varepsilon\).