Rollouts

Motivation

To search well from a nonterminal state, a planner needs to estimate that state’s value — how good is it on average under optimal or near-optimal play? In many domains (board games, simulations) no hand-crafted evaluation function is available, and constructing one requires deep domain expertise. A rollout offers a domain-independent alternative: simply run the game to completion under a fixed default policy and use the terminal reward as the value estimate. This idea requires only a simulator and makes Monte Carlo tree search applicable to new domains without any domain-specific evaluation function.

Definition

A rollout (also called a playout or simulation) is a trajectory generated from a given state to a terminal state by following a fixed rollout policy \(\pi_0\) (also called the default policy). The terminal reward is then used as an estimate of the starting state’s value.

Formally, let \(s\) be a nonterminal state. A single rollout produces a trajectory

\[ s, a_1, s_1, a_2, s_2, \ldots, s_T \]

where each action \(a_t \sim \pi_0(\cdot \mid s_{t-1})\) and \(s_T\) is a terminal state with payoff \(R(s_T)\). Averaging over \(m\) independent rollouts gives an empirical value estimate:

\[ \hat{V}(s) = \frac{1}{m} \sum_{i=1}^{m} R(s_T^{(i)}). \]

Bias and Variance

The quality of a rollout estimate depends on the rollout policy:

Bias. If \(\pi_0\) is far from optimal, trajectories will systematically miss high-reward regions of the game tree. The estimate \(\hat{V}(s)\) will be a biased estimate of the true minimax value \(V^*(s)\). A stronger rollout policy reduces bias.

Variance. Rollout outcomes are noisy: a single game may be dominated by chance moves or lucky opponent errors. More rollouts reduce variance at the cost of computation.

There is a fundamental tradeoff: a more sophisticated rollout policy improves accuracy but takes longer to execute, reducing the number of rollouts per unit time. In practice, lightweight rollout policies — uniform random play or a simple hand-crafted heuristic — often work well when the number of rollouts per decision is large.

Monte Carlo Return vs. Temporal Difference

Both rollouts and temporal difference (TD) learning estimate state values, but they differ in their update targets:

  • A rollout uses the actual terminal reward \(R(s_T)\) as its target — a Monte Carlo return. This is unbiased but high-variance.
  • TD learning bootstraps: it uses a one-step reward plus a value-function estimate of the next state, \(r + \gamma \hat{V}(s')\). This introduces bias from the approximation but can have much lower variance.

The two approaches are endpoints on the TD(\(\lambda\)) spectrum (Sutton and Barto 2018): \(\lambda = 1\) corresponds to full Monte Carlo rollouts; \(\lambda = 0\) corresponds to one-step TD. Intermediate \(\lambda\) blends both.

References

Kocsis, Levente, and Csaba Szepesvári. 2006. “Bandit Based Monte-Carlo Planning.” In Machine Learning: ECML 2006. Springer Berlin Heidelberg. https://doi.org/10.1007/11871842_29.
Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. 2nd ed. MIT Press. https://mitpress.mit.edu/9780262039246/reinforcement-learning/.