Policy Iteration Converges to the Optimal Policy
Claims
Policy improvement theorem. If \(\pi'\) is the greedy policy with respect to \(V^\pi\), then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s \in S\).
Termination. Policy iteration terminates in finitely many steps at the optimal policy \(\pi^*\).
Proof of Claim 1
Since \(\pi'\) is greedy with respect to \(V^\pi\):
\[ Q^\pi(s, \pi'(s)) = \max_{a} Q^\pi(s, a) \geq Q^\pi(s, \pi(s)) = V^\pi(s) \quad \forall s. \]
Unroll the definition of \(Q^\pi\) one step:
\[ V^\pi(s) \leq Q^\pi(s, \pi'(s)) = R(s, \pi'(s)) + \gamma \sum_{s'} P(s' \mid s, \pi'(s))\, V^\pi(s'). \]
Apply the same inequality to \(V^\pi(s')\):
\[ V^\pi(s) \leq R(s, \pi'(s)) + \gamma \sum_{s'} P(s' \mid s, \pi'(s))\left[ R(s', \pi'(s')) + \gamma \sum_{s''} P(s'' \mid s', \pi'(s'))\, V^\pi(s'') \right]. \]
Continuing to unroll and using the geometric series \(\sum_{t=0}^\infty \gamma^t < \infty\):
\[ V^\pi(s) \leq \mathbb{E}\!\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi'(s_t)) \;\middle|\; s_0 = s\right] = V^{\pi'}(s). \quad \square \]
Proof of Claim 2
There are at most \(|A|^{|S|}\) deterministic policies, so the sequence \(\pi^{(0)}, \pi^{(1)}, \ldots\) takes values in a finite set.
Suppose at step \(k\) the improvement produces \(\pi^{(k+1)} \neq \pi^{(k)}\). By Claim 1, \(V^{\pi^{(k+1)}}(s) \geq V^{\pi^{(k)}}(s)\) for all \(s\). If \(\pi^{(k+1)} \neq \pi^{(k)}\), then for at least one state the inequality is strict: there exists \(s\) with
\[ \max_a Q^{\pi^{(k)}}(s, a) > Q^{\pi^{(k)}}(s, \pi^{(k)}(s)) = V^{\pi^{(k)}}(s), \]
which implies \(V^{\pi^{(k+1)}} \neq V^{\pi^{(k)}}\). The value function strictly improves in the \(\ell^\infty\) sense at each step where the policy changes.
Since values are bounded and strictly improve with each policy change, no policy can be visited twice. The sequence of distinct policies must therefore be finite, so the algorithm terminates.
At termination \(\pi' = \pi\), meaning \(\pi\) is its own greedy policy:
\[ V^\pi(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^\pi(s') \right] \quad \forall s. \]
This is exactly the Bellman optimality equation, so \(V^\pi = V^*\) and \(\pi = \pi^*\). \(\square\)