Policy Iteration Converges to the Optimal Policy

Claims

Policy improvement theorem. If \(\pi'\) is the greedy policy with respect to \(V^\pi\), then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s \in S\).
Termination. Policy iteration terminates in finitely many steps at the optimal policy \(\pi^*\).

(Puterman 1994; Sutton and Barto 2018)

Proof of Claim 1

Since \(\pi'\) is greedy with respect to \(V^\pi\):

\[ Q^\pi(s, \pi'(s)) = \max_{a} Q^\pi(s, a) \geq Q^\pi(s, \pi(s)) = V^\pi(s) \quad \forall s. \]

Unroll the definition of \(Q^\pi\) one step:

\[ V^\pi(s) \leq Q^\pi(s, \pi'(s)) = R(s, \pi'(s)) + \gamma \sum_{s'} P(s' \mid s, \pi'(s))\, V^\pi(s'). \]

Apply the same inequality to \(V^\pi(s')\):

\[ V^\pi(s) \leq R(s, \pi'(s)) + \gamma \sum_{s'} P(s' \mid s, \pi'(s))\left[ R(s', \pi'(s')) + \gamma \sum_{s''} P(s'' \mid s', \pi'(s'))\, V^\pi(s'') \right]. \]

Continuing to unroll and using the geometric series \(\sum_{t=0}^\infty \gamma^t < \infty\):

\[ V^\pi(s) \leq \mathbb{E}\!\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi'(s_t)) \;\middle|\; s_0 = s\right] = V^{\pi'}(s). \quad \square \]

Proof of Claim 2

There are at most \(|A|^{|S|}\) deterministic policies, so the sequence \(\pi^{(0)}, \pi^{(1)}, \ldots\) takes values in a finite set.

Suppose at step \(k\) the improvement produces \(\pi^{(k+1)} \neq \pi^{(k)}\). By Claim 1, \(V^{\pi^{(k+1)}}(s) \geq V^{\pi^{(k)}}(s)\) for all \(s\). If \(\pi^{(k+1)} \neq \pi^{(k)}\), then for at least one state the inequality is strict: there exists \(s\) with

\[ \max_a Q^{\pi^{(k)}}(s, a) > Q^{\pi^{(k)}}(s, \pi^{(k)}(s)) = V^{\pi^{(k)}}(s), \]

which implies \(V^{\pi^{(k+1)}} \neq V^{\pi^{(k)}}\). The value function strictly improves in the \(\ell^\infty\) sense at each step where the policy changes.

Since values are bounded and strictly improve with each policy change, no policy can be visited twice. The sequence of distinct policies must therefore be finite, so the algorithm terminates.

At termination \(\pi' = \pi\), meaning \(\pi\) is its own greedy policy:

\[ V^\pi(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^\pi(s') \right] \quad \forall s. \]

This is exactly the Bellman optimality equation, so \(V^\pi = V^*\) and \(\pi = \pi^*\). \(\square\)

References

Puterman, Martin L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. In Wiley Series in Probability and Statistics. Wiley. https://doi.org/10.1002/9780470316887.

Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. 2nd ed. MIT Press. https://mitpress.mit.edu/9780262039246/reinforcement-learning/.