Policy Iteration

Motivation

Policy iteration (Puterman 1994) solves a Markov decision process by alternating between two steps: computing the exact value of the current policy, then improving the policy greedily with respect to that value. Each improvement step is guaranteed not to make the policy worse, and the algorithm converges to the optimal policy in a finite number of iterations.

Problem

Given a finite MDP with states \(S\), actions \(A\), transitions \(P(s' \mid s, a)\), rewards \(R(s, a)\), and discount \(\gamma \in [0, 1)\), find an optimal deterministic policy \(\pi^* : S \to A\) and its value function \(V^*\).

See Bellman Equations for the full MDP notation.

Key Ideas

Alternate evaluation and improvement

Policy iteration is a two-stage loop:

Policy evaluation computes the exact value \(V^\pi\) of the current policy by solving the linear Bellman expectation equations.
Policy improvement picks the greedy policy with respect to \(V^\pi\), which (by the policy improvement theorem) is at least as good as \(\pi\).

Strict improvement at every step plus the finite number of deterministic policies guarantees termination at the optimum.

Policy evaluation: exact V via a linear solve

Given a fixed policy \(\pi\), the Bellman expectation equations

\[ V^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s'} P(s' \mid s, \pi(s))\, V^\pi(s') \]

form a linear system with one equation per state. Two options:

Direct solve. \(V^\pi = (I - \gamma P^\pi)^{-1} R^\pi\), \(O(|S|^3)\) by Gaussian elimination — gives \(V^\pi\) exactly.
Iterative evaluation. Apply the Bellman expectation operator \(\mathcal{T}^\pi\) repeatedly. Because \(\mathcal{T}^\pi\) is a \(\gamma\)-contraction, this converges geometrically.

Policy improvement: greedy is at least as good

Given \(V^\pi\), the greedy policy with respect to \(V^\pi\) is

\[ \pi'(s) = \operatorname*{arg\,max}_{a \in A} \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^\pi(s') \right] = \operatorname*{arg\,max}_{a} Q^\pi(s, a). \]

Policy improvement theorem. If \(\pi'\) is the greedy policy with respect to \(V^\pi\), then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s\). If \(V^{\pi'} = V^\pi\), then \(\pi\) is already optimal. (proof)

Algorithm

initialize π arbitrarily
repeat:
    # Policy evaluation
    solve V^π(s) = R(s, π(s)) + γ Σ_{s'} P(s'|s, π(s)) V^π(s')    for all s

    # Policy improvement
    π_new(s) = argmax_a [R(s, a) + γ Σ_{s'} P(s'|s, a) V^π(s')]   for all s
    if π_new == π: return π, V^π
    π = π_new

Each iteration solves a linear system (or runs iterative evaluation to convergence) and then updates the policy greedily.

Walkthrough

Two-state MDP

Reuse the MDP from value iteration: \(S = \{0, 1\}\), \(A = \{\text{stay}, \text{switch}\}\), \(\gamma = 0.9\).

\(s\)	\(a\)	\(R(s,a)\)	\(P(0 \mid s,a)\)	\(P(1 \mid s,a)\)
0	stay	1	1	0
0	switch	0	0	1
1	stay	0	0	1
1	switch	0	1	0

Initialize \(\pi_0(0) = \pi_0(1) = \text{stay}\).

Iteration 1.

Evaluate \(\pi_0\): From state \(0\), stay gives reward \(1\) and stays in \(0\) forever, so \(V^{\pi_0}(0) = 1 / (1 - 0.9) = 10\). From state \(1\), stay gives reward \(0\) and stays in \(1\) forever, so \(V^{\pi_0}(1) = 0\).
Improve: For \(s = 0\): \(Q^{\pi_0}(0, \text{stay}) = 1 + 0.9 \cdot 10 = 10\); \(Q^{\pi_0}(0, \text{switch}) = 0 + 0.9 \cdot 0 = 0\). Stay wins. For \(s = 1\): \(Q^{\pi_0}(1, \text{stay}) = 0\); \(Q^{\pi_0}(1, \text{switch}) = 0 + 0.9 \cdot 10 = 9\). Switch wins. So \(\pi_1(0) = \text{stay}\), \(\pi_1(1) = \text{switch}\).

Iteration 2.

Evaluate \(\pi_1\): \(V^{\pi_1}(0) = 10\) (unchanged), and \(V^{\pi_1}(1) = 0 + 0.9 \cdot 10 = 9\).
Improve: For \(s = 0\), stay still wins (\(10\) vs. \(0.9 \cdot 9 = 8.1\)). For \(s = 1\), switch still wins (\(9\) vs. \(0\)). \(\pi_2 = \pi_1\) — algorithm terminates.

Optimal policy: stay in \(0\), switch in \(1\). Optimal values: \(V^*(0) = 10\), \(V^*(1) = 9\). Value iteration needed many sweeps to reach the same answer; policy iteration converged in two iterations.

Correctness

Termination. Policy iteration terminates in finitely many steps with the optimal policy \(\pi^*\) (proof). Because the improvement step is strict whenever \(\pi\) is suboptimal, and there are at most \(|A|^{|S|}\) deterministic policies, the algorithm cannot cycle and must reach the optimum.

The argument has two pieces. The policy improvement theorem guarantees \(V^{\pi_{k+1}} \ge V^{\pi_k}\) componentwise. If the new policy equals the old, then \(V^{\pi_k}\) satisfies the Bellman optimality equations (the max equals the expectation under \(\pi_k\)), so \(\pi_k = \pi^*\).

Complexity and Tradeoffs

Each iteration: \(O(|S|^3)\) for policy evaluation (direct linear solve), plus \(O(|S|^2 |A|)\) for policy improvement. The number of iterations is typically small — often fewer than 20 even on large MDPs.

For very large state spaces the \(O(|S|^3)\) solve is the bottleneck; modified policy iteration replaces exact evaluation with a fixed number \(m\) of \(\mathcal{T}^\pi\) sweeps. Setting \(m = 1\) recovers value iteration, \(m = \infty\) recovers full policy iteration; intermediate \(m\) values often hit a sweet spot.

When to Use It

Situation	Approach
Small MDP, dynamics known	Policy iteration — fewer iterations than value iteration.
Medium MDP where linear solve is expensive	Modified policy iteration — finite-step evaluation, see Variants.
Iteration count doesn’t matter, want simpler algorithm	Value iteration.
Unknown dynamics, sampled experience	Q-learning / SARSA / actor-critic.
Continuous state / action	Policy gradient or actor-critic methods.

Comparison with Value Iteration

	Value Iteration	Policy Iteration
Per-iteration cost	\(O(\|S\|^2\|A\|)\)	\(O(\|S\|^3 + \|S\|^2\|A\|)\)
Iterations to convergence	many (geometric rate \(\gamma\))	few (often \(< 20\))
Output after each step	approximate \(V^*\)	exact \(V^\pi\) for current \(\pi\)
Termination	\(\varepsilon\)-convergence criterion	exact, in finite steps

Policy iteration tends to converge in far fewer iterations than value iteration, but each iteration is more expensive due to the policy evaluation solve. For small state spaces the linear solve is cheap, making policy iteration the faster method overall.

Variants

Modified policy iteration. Replace exact evaluation with \(m\) sweeps of \(\mathcal{T}^\pi\). Setting \(m = 1\) recovers value iteration; \(m = \infty\) recovers policy iteration. Tuning \(m\) trades iterations against per-iteration cost.
Asynchronous policy iteration. Evaluate or improve at a subset of states per iteration; converges as long as every state is touched infinitely often.
Generalized policy iteration. Any algorithm that alternates partial policy evaluation with partial policy improvement — the abstraction that includes value iteration, actor-critic, and many practical RL algorithms as special cases.

References

Puterman, Martin L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. In Wiley Series in Probability and Statistics. Wiley. https://doi.org/10.1002/9780470316887.