Monte Carlo Tree Search

Motivation

Minimax search — even with alpha-beta pruning — collapses on games with very large branching factors or no good evaluation function. Go has \(b \approx 250\) and a famously hard-to-evaluate position; for decades the best Go programs fell far below human professional level. The breakthrough came from a different idea: instead of expanding the tree uniformly to a fixed depth and applying a hand-crafted evaluator at the leaves, sample trajectories through the game and let the average outcome estimate each move’s value.

Monte Carlo tree search (MCTS) (Kocsis and Szepesvári 2006) does exactly this, with a twist that makes it work: at each internal node it treats action selection as a multi-armed bandit problem, using UCB-style scores to balance exploiting promising moves with exploring under-sampled ones. The result is a search algorithm that needs only a simulator (no evaluation function), works on huge branching factors, and provably converges to the minimax value.

Problem

A two-player zero-sum game with finite states, actions, and depth, supplied as a simulator: a procedure that can play out any policy from any state to a terminal and return a payoff. No static evaluation function is available, and the branching factor may be too large for any depth-limited exhaustive search.

The task is to choose a good action at a given root state, given a wall-clock or iteration budget rather than a fixed depth.

Key Ideas

MCTS replaces uniform depth expansion with adaptive sampling: visit promising branches more and let bad branches starve. To make this work, three design choices interlock.

Bandit-style action selection (UCT)

At each internal node, MCTS faces the same dilemma as a multi-armed bandit: each child action has an unknown true value that can only be estimated by sampling. The selection rule is UCB1 (Auer et al. 2002) applied per node — UCT in this setting:

\[ a^* = \operatorname*{arg\,max}_{a} \left[ Q(\text{child}_a) + c \sqrt{\frac{\ln N(\text{node})}{N(\text{child}_a)}} \right]. \]

The first term exploits — pick the child with the best estimated value. The second term explores — give a bonus to under-visited children that shrinks as they accumulate samples. The exploration constant \(c\) controls the balance; \(c = \sqrt{2}\) is theoretically motivated, but tuned values are common.

Asymmetric tree growth

Each iteration deepens the tree by exactly one node, and only along the trajectory UCT picked. Over thousands of iterations, the tree grows deep along promising lines and shallow along bad ones — a sharp contrast with minimax, which expands every line to the same depth. This is what lets MCTS scale to large branching factors: it never has to pay for the bad branches.

Four-phase iteration

Each MCTS iteration repeats four steps, with each node \(n\) storing visit count \(N(n)\), total reward \(W(n)\), and average \(Q(n) = W(n)/N(n)\).

1. Selection. Starting at the root, recursively choose a child by UCT until reaching a leaf.

2. Expansion. Add one new child to the leaf, representing an unexplored action.

3. Simulation (rollout). From the new child, play out a complete game using a default policy — typically uniform random or a lightweight heuristic — and observe the terminal reward.

4. Backup. Walk back up the path from the new child to the root, incrementing \(N\) and adding the reward to \(W\) at every node along the way. For two-player zero-sum games, alternate the sign of the reward at each level so \(Q\) at each node represents the value from that node’s mover’s perspective.

Algorithm

function MCTS(root_state, time_budget):
    root = Node(root_state)
    while time remaining:
        leaf = TREE-POLICY(root)
        reward = ROLLOUT(leaf.state)
        BACKUP(leaf, reward)
    return action of root's child with highest visit count

function TREE-POLICY(node):
    while node is not terminal:
        if node has unexpanded children:
            return EXPAND(node)
        node = UCB-SELECT(node)
    return node

After the budget elapses, the action returned is the most-visited child of the root — not the highest-\(Q\) child. Visit count is more robust because UCT drives the algorithm to visit moves it estimates as best, and the visit-count metric is less sensitive to noise in the value estimates.

Walkthrough

One MCTS iteration

A small tree where root \(R\) has children \(A\) (already explored once, \(A_1\) visited) and \(B\) (not yet expanded). One iteration of MCTS:

The four phases form one iteration; thousands or millions are run before the algorithm picks a move. Each iteration deepens the tree by exactly one node (the expansion) and widens the visit counts on the path back to the root. Over time the tree grows asymmetrically — promising branches receive many visits and become deep, while bad branches stay shallow.

Correctness

Convergence to minimax (Kocsis and Szepesvári 2006). As the number of iterations \(n \to \infty\), the probability of selecting a suboptimal root action goes to zero, and the value estimate at the root converges to the minimax value.

The proof leverages UCB1’s regret guarantee at each node: with high probability, every suboptimal action is sampled \(O(\log n)\) times, so the empirical mean at each node concentrates on the true value. Backed up to the root, this gives consistent value estimates.

The bound is asymptotic and the constants are bad in the worst case — adversarial trees can force MCTS to spend exponential effort before concentrating on the optimal move. In practice, however, UCT converges quickly on real game trees, and the algorithm is anytime: more iterations always improve the answer, and it returns a usable result at any point.

Complexity and Tradeoffs

MCTS has no closed-form per-iteration cost: each iteration runs one rollout, which is \(O(d)\) for game depth \(d\), plus selection/backup along the path. Total work scales linearly with the number of iterations, and quality scales with the iteration budget — there is no fixed depth or branching factor in the analysis.

Strengths.

Domain-independent. Needs only a simulator. No domain-specific evaluation function.
Anytime. Returns a usable answer after any number of iterations.
Asymmetric. Spends iterations on promising branches, ignoring obviously bad moves entirely.
Handles large branching factors. Performance degrades gracefully where alpha-beta collapses.
Naturally parallel. Many threads can run rollouts concurrently with light coordination.

Weaknesses.

Sample inefficiency. Random rollouts waste most of their effort on irrelevant lines.
Tactical blindness. A long forced sequence (e.g., a chess combination) may be missed because rollouts dilute it with random play.
Sensitivity to the rollout policy. Better default policies dramatically improve quality.

When to Use It

Situation	Use MCTS?
Huge branching factor, no good evaluator (Go, general-game playing)	Yes — the default.
Moderate branching factor, strong evaluator available (chess)	Prefer alpha-beta; MCTS only competitive with strong neural-network priors.
Anytime constraint, hard wall-clock deadline	Yes — MCTS yields a best-effort answer at any budget.
Pure tactics, narrow forced sequences	Risky — alpha-beta is more reliable.
Stochastic transitions, planning under uncertainty	Yes — extends naturally to MDPs and POMDPs.

Variants

AlphaGo and AlphaZero. The single most influential MCTS application is DeepMind’s Go program. AlphaGo (2016) replaced the pieces MCTS handles poorly with neural networks: a value network replaces the random rollout with a learned position evaluation, and a policy network biases the UCT tree-policy expansion toward moves the network rates highly. AlphaGo Zero (2017) and AlphaZero (2018) eliminated the rollout entirely: the value network alone supplies leaf evaluations, and the policy network shapes the prior. The training loop is itself MCTS — self-play games guided by MCTS produce training targets for the networks, which in turn improve future MCTS searches. The same architecture mastered Go, chess, and shogi from scratch.

The general lesson: MCTS is a flexible scaffolding. Its abstract phases — select via bandit, expand, evaluate, back up — accommodate hand-coded heuristics, learned value functions, learned policies, or any combination, with the bandit-style selection providing a principled exploration mechanism throughout.

References

Auer, Peter, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. “Finite-Time Analysis of the Multiarmed Bandit Problem.” Machine Learning 47 (2-3): 235–56. https://doi.org/10.1023/a:1013689704352.

Kocsis, Levente, and Csaba Szepesvári. 2006. “Bandit Based Monte-Carlo Planning.” In Machine Learning: ECML 2006. Springer Berlin Heidelberg. https://doi.org/10.1007/11871842_29.