UCT Converges to the Minimax Value

Claim

Consider a finite-depth game tree with bounded rewards in \([0, 1]\). At each internal node, UCT selects a child according to

\[ a = \operatorname*{arg\,max}_{c \in \text{children}(s)} \left[ \hat{V}(c) + C \sqrt{\frac{\log N(s)}{N(c)}} \right] \]

for a tuning constant \(C\) (with the convention that unvisited children are tried first). Let \(\hat{V}_n(s)\) denote UCT’s value estimate at node \(s\) after \(n\) simulations. Then

\[ \hat{V}_n(s) \xrightarrow{n \to \infty} V^*(s) \]

in probability for every node \(s\), where \(V^*\) is the minimax value. Moreover, the failure probability of selecting the optimal action at the root decays polynomially in \(n\):

\[ \Pr\!\left[ a^*_{\text{root}} \notin \arg\max_{c} \hat{V}_n(c) \right] \to 0 \]

as \(n \to \infty\). This is the consistency theorem of Kocsis and Szepesvári (Kocsis and Szepesvári 2006).

Setup

Each internal node of the tree is treated as its own bandit problem: the children are arms, and the “reward” of pulling child \(c\) is the return of a rollout that descends through \(c\).

The complication relative to plain UCB1 is non-stationarity. At a leaf, returns are i.i.d. and UCB1’s analysis applies directly. At an internal node, however, the return distribution of child \(c\) depends on the policy used inside the subtree below \(c\) — and that policy is itself UCT, which is changing as it learns. Standard concentration inequalities don’t apply because the rewards are not identically distributed across pulls.

The proof proceeds bottom-up by induction on tree depth.

Step 1: Leaves

At a leaf node, returns are i.i.d. samples from a fixed reward distribution. UCB1’s analysis gives

\[ \mathbb{E}[N_a(n)] \le \frac{8 \log n}{\Delta_a^2} + O(1) \]

for each suboptimal action \(a\), and the value estimate satisfies \(\hat{V}_n \to V^*\) at rate \(O(\sqrt{\log n / n})\) (UCB1 logarithmic regret).

Step 2: Drift of the reward distribution at internal nodes

Suppose the consistency claim holds at every node of depth \(> d\). Consider a node \(s\) at depth \(d\) with child \(c\). The returns received from rolling out through \(c\) are not i.i.d., but their conditional means converge to the value \(V^*(c)\) at a known rate.

Concretely, after \(N\) pulls of child \(c\), the empirical mean satisfies

\[ \hat{V}(c) - V^*(c) = \frac{1}{N} \sum_{i=1}^{N} \left( X_i - V^*(c) \right), \]

where \(X_i\) is the return of the \(i\)-th rollout through \(c\). Each \(X_i\) has mean \(V^*(c) + \delta_i\) with \(|\delta_i| \to 0\) as the subtree below \(c\) matures.

The technical core of the argument is showing that the averaged drift \(\bar{\delta}_N = \frac{1}{N}\sum_{i=1}^{N} \delta_i\) also converges to zero, and at a rate compatible with the UCB confidence bonus. The original paper proves a drift-condition concentration inequality that bounds

\[ \Pr\!\left[ |\hat{V}(c) - V^*(c)| \ge \sqrt{\tfrac{C \log n}{N_c(n)}} \right] \]

by a polynomially decaying term, mirroring Hoeffding’s inequality but with constants that depend on the depth.

Step 3: UCB1 analysis with drifting rewards

With this concentration inequality in place, the UCB1 regret proof goes through with worse constants. Specifically, the suboptimal-pull bound becomes

\[ \mathbb{E}[N_a(n)] \le \frac{C(d) \log n}{\Delta_a^2} + O(\log n), \]

where \(C(d)\) depends on the tree depth via the recursion. The extra \(O(\log n)\) over UCB1’s \(O(1)\) remainder absorbs the drift contribution.

Multiplying by \(\Delta_a\) and summing,

\[ \hat{V}_n(s) - V^*(s) = O\!\left( \sqrt{\tfrac{\log n}{n}} \right). \]

Step 4: Failure probability of the root action

The root action chosen by UCT after \(n\) simulations is the most-visited child (or, equivalently for the analysis, the child with the highest empirical value). By Step 3 the gap between the optimal and suboptimal children’s value estimates is at least \(\Delta - O(\sqrt{\log n / n})\). For sufficiently large \(n\) this gap is positive, and the probability of selecting a suboptimal root action decays as a polynomial in \(n\).

Remarks

  • The polynomial convergence rate is much weaker than UCB1’s logarithmic regret. Each level of the tree multiplies the constants in the concentration inequality, so deep trees converge slowly. In practice this is masked by the fact that UCT does not need optimality at every node — only the root action matters for play.
  • The proof requires the tuning constant \(C\) to be chosen large enough to dominate the drift; in practice \(C\) is treated as a hyperparameter and tuned per game.
  • AlphaGo and AlphaZero use UCT with a learned prior \(P(s, a)\) and value head \(V_\theta\) that change the bonus to \(C \cdot P(s, a) \sqrt{N(s)} / (1 + N(c))\) (PUCT). The consistency proof does not apply directly, but the same bottom-up argument can be adapted.

References

(Kocsis and Szepesvári 2006; Lattimore and Szepesvári 2020).

References

Kocsis, Levente, and Csaba Szepesvári. 2006. “Bandit Based Monte-Carlo Planning.” In Machine Learning: ECML 2006. Springer Berlin Heidelberg. https://doi.org/10.1007/11871842_29.
Lattimore, Tor, and Csaba Szepesvári. 2020. Bandit Algorithms. Cambridge University Press. https://www.cambridge.org/core/books/bandit-algorithms/8E39FD004E6CE036680F90DD0C6F09FC.