An Optimal Deterministic Policy Exists

Claim

For any finite MDP \((S, A, P, R, \gamma)\) with \(\gamma \in [0,1)\), there exists a deterministic policy \(\pi^* : S \to A\) such that \(V^{\pi^*}(s) = V^*(s)\) for all \(s \in S\) (Puterman 1994).

Proof

Step 1: \(V^*\) is well-defined.

The Bellman optimality operator \(\mathcal{T}\) is a \(\gamma\)-contraction on \(\mathbb{R}^S\) under the \(\ell^\infty\) norm (proof). By the Banach fixed-point theorem, \(\mathcal{T}\) has a unique fixed point \(V^* \in \mathbb{R}^S\) satisfying:

\[ V^*(s) = (\mathcal{T} V^*)(s) = \max_{a \in A} \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^*(s') \right] \quad \forall s \in S. \]

Step 2: Define a greedy deterministic policy.

For each \(s \in S\), choose any action achieving the maximum above:

\[ \pi^*(s) = \operatorname*{arg\,max}_{a \in A} \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^*(s') \right]. \]

This is deterministic since it selects a single action per state (breaking ties arbitrarily).

Step 3: \(V^{\pi^*} = V^*\).

By the definition of \(\pi^*\), the fixed-point equation for \(V^*\) can be written as:

\[ V^*(s) = R(s, \pi^*(s)) + \gamma \sum_{s'} P(s' \mid s, \pi^*(s))\, V^*(s') \quad \forall s \in S. \]

This is exactly the Bellman expectation equation for policy \(\pi^*\). The analogous Bellman expectation operator \(\mathcal{T}^{\pi^*}\) is also a \(\gamma\)-contraction with unique fixed point \(V^{\pi^*}\) (proof). Since \(V^*\) satisfies \(\mathcal{T}^{\pi^*} V^* = V^*\), uniqueness gives \(V^{\pi^*} = V^*\).

Step 4: \(\pi^*\) is optimal.

Since \(V^{\pi^*}(s) = V^*(s) = \max_\pi V^\pi(s)\) for all \(s\), no policy achieves a higher value in any state. \(\blacksquare\)

References

Puterman, Martin L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. In Wiley Series in Probability and Statistics. Wiley. https://doi.org/10.1002/9780470316887.