Policy Iteration
Motivation
Policy iteration (Puterman 1994) solves a Markov decision process by alternating between two steps: computing the exact value of the current policy, then improving the policy greedily with respect to that value. Each improvement step is guaranteed not to make the policy worse, and the algorithm converges to the optimal policy in a finite number of iterations.
Problem
Given a finite MDP with states \(S\), actions \(A\), transitions \(P(s' \mid s, a)\), rewards \(R(s, a)\), and discount \(\gamma \in [0, 1)\), find an optimal deterministic policy \(\pi^* : S \to A\) and its value function \(V^*\).
See Bellman Equations for the full MDP notation.
Key Ideas
Alternate evaluation and improvement
Policy iteration is a two-stage loop:
- Policy evaluation computes the exact value \(V^\pi\) of the current policy by solving the linear Bellman expectation equations.
- Policy improvement picks the greedy policy with respect to \(V^\pi\), which (by the policy improvement theorem) is at least as good as \(\pi\).
Strict improvement at every step plus the finite number of deterministic policies guarantees termination at the optimum.
Policy evaluation: exact V via a linear solve
Given a fixed policy \(\pi\), the Bellman expectation equations
\[ V^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s'} P(s' \mid s, \pi(s))\, V^\pi(s') \]
form a linear system with one equation per state. Two options:
- Direct solve. \(V^\pi = (I - \gamma P^\pi)^{-1} R^\pi\), \(O(|S|^3)\) by Gaussian elimination — gives \(V^\pi\) exactly.
- Iterative evaluation. Apply the Bellman expectation operator \(\mathcal{T}^\pi\) repeatedly. Because \(\mathcal{T}^\pi\) is a \(\gamma\)-contraction, this converges geometrically.
Policy improvement: greedy is at least as good
Given \(V^\pi\), the greedy policy with respect to \(V^\pi\) is
\[ \pi'(s) = \operatorname*{arg\,max}_{a \in A} \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^\pi(s') \right] = \operatorname*{arg\,max}_{a} Q^\pi(s, a). \]
Policy improvement theorem. If \(\pi'\) is the greedy policy with respect to \(V^\pi\), then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s\). If \(V^{\pi'} = V^\pi\), then \(\pi\) is already optimal. (proof)
Algorithm
initialize π arbitrarily
repeat:
# Policy evaluation
solve V^π(s) = R(s, π(s)) + γ Σ_{s'} P(s'|s, π(s)) V^π(s') for all s
# Policy improvement
π_new(s) = argmax_a [R(s, a) + γ Σ_{s'} P(s'|s, a) V^π(s')] for all s
if π_new == π: return π, V^π
π = π_new
Each iteration solves a linear system (or runs iterative evaluation to convergence) and then updates the policy greedily.
Walkthrough
Two-state MDP
Reuse the MDP from value iteration: \(S = \{0, 1\}\), \(A = \{\text{stay}, \text{switch}\}\), \(\gamma = 0.9\).
| \(s\) | \(a\) | \(R(s,a)\) | \(P(0 \mid s,a)\) | \(P(1 \mid s,a)\) |
|---|---|---|---|---|
| 0 | stay | 1 | 1 | 0 |
| 0 | switch | 0 | 0 | 1 |
| 1 | stay | 0 | 0 | 1 |
| 1 | switch | 0 | 1 | 0 |
Initialize \(\pi_0(0) = \pi_0(1) = \text{stay}\).
Iteration 1.
- Evaluate \(\pi_0\): From state \(0\), stay gives reward \(1\) and stays in \(0\) forever, so \(V^{\pi_0}(0) = 1 / (1 - 0.9) = 10\). From state \(1\), stay gives reward \(0\) and stays in \(1\) forever, so \(V^{\pi_0}(1) = 0\).
- Improve: For \(s = 0\): \(Q^{\pi_0}(0, \text{stay}) = 1 + 0.9 \cdot 10 = 10\); \(Q^{\pi_0}(0, \text{switch}) = 0 + 0.9 \cdot 0 = 0\). Stay wins. For \(s = 1\): \(Q^{\pi_0}(1, \text{stay}) = 0\); \(Q^{\pi_0}(1, \text{switch}) = 0 + 0.9 \cdot 10 = 9\). Switch wins. So \(\pi_1(0) = \text{stay}\), \(\pi_1(1) = \text{switch}\).
Iteration 2.
- Evaluate \(\pi_1\): \(V^{\pi_1}(0) = 10\) (unchanged), and \(V^{\pi_1}(1) = 0 + 0.9 \cdot 10 = 9\).
- Improve: For \(s = 0\), stay still wins (\(10\) vs. \(0.9 \cdot 9 = 8.1\)). For \(s = 1\), switch still wins (\(9\) vs. \(0\)). \(\pi_2 = \pi_1\) — algorithm terminates.
Optimal policy: stay in \(0\), switch in \(1\). Optimal values: \(V^*(0) = 10\), \(V^*(1) = 9\). Value iteration needed many sweeps to reach the same answer; policy iteration converged in two iterations.
Correctness
Termination. Policy iteration terminates in finitely many steps with the optimal policy \(\pi^*\) (proof). Because the improvement step is strict whenever \(\pi\) is suboptimal, and there are at most \(|A|^{|S|}\) deterministic policies, the algorithm cannot cycle and must reach the optimum.
The argument has two pieces. The policy improvement theorem guarantees \(V^{\pi_{k+1}} \ge V^{\pi_k}\) componentwise. If the new policy equals the old, then \(V^{\pi_k}\) satisfies the Bellman optimality equations (the max equals the expectation under \(\pi_k\)), so \(\pi_k = \pi^*\).
Complexity and Tradeoffs
Each iteration: \(O(|S|^3)\) for policy evaluation (direct linear solve), plus \(O(|S|^2 |A|)\) for policy improvement. The number of iterations is typically small — often fewer than 20 even on large MDPs.
For very large state spaces the \(O(|S|^3)\) solve is the bottleneck; modified policy iteration replaces exact evaluation with a fixed number \(m\) of \(\mathcal{T}^\pi\) sweeps. Setting \(m = 1\) recovers value iteration, \(m = \infty\) recovers full policy iteration; intermediate \(m\) values often hit a sweet spot.
When to Use It
| Situation | Approach |
|---|---|
| Small MDP, dynamics known | Policy iteration — fewer iterations than value iteration. |
| Medium MDP where linear solve is expensive | Modified policy iteration — finite-step evaluation, see Variants. |
| Iteration count doesn’t matter, want simpler algorithm | Value iteration. |
| Unknown dynamics, sampled experience | Q-learning / SARSA / actor-critic. |
| Continuous state / action | Policy gradient or actor-critic methods. |
Comparison with Value Iteration
| Value Iteration | Policy Iteration | |
|---|---|---|
| Per-iteration cost | \(O(|S|^2|A|)\) | \(O(|S|^3 + |S|^2|A|)\) |
| Iterations to convergence | many (geometric rate \(\gamma\)) | few (often \(< 20\)) |
| Output after each step | approximate \(V^*\) | exact \(V^\pi\) for current \(\pi\) |
| Termination | \(\varepsilon\)-convergence criterion | exact, in finite steps |
Policy iteration tends to converge in far fewer iterations than value iteration, but each iteration is more expensive due to the policy evaluation solve. For small state spaces the linear solve is cheap, making policy iteration the faster method overall.
Variants
- Modified policy iteration. Replace exact evaluation with \(m\) sweeps of \(\mathcal{T}^\pi\). Setting \(m = 1\) recovers value iteration; \(m = \infty\) recovers policy iteration. Tuning \(m\) trades iterations against per-iteration cost.
- Asynchronous policy iteration. Evaluate or improve at a subset of states per iteration; converges as long as every state is touched infinitely often.
- Generalized policy iteration. Any algorithm that alternates partial policy evaluation with partial policy improvement — the abstraction that includes value iteration, actor-critic, and many practical RL algorithms as special cases.