An Optimal Deterministic Policy Exists
Claim
For any finite MDP \((S, A, P, R, \gamma)\) with \(\gamma \in [0,1)\), there exists a deterministic policy \(\pi^* : S \to A\) such that \(V^{\pi^*}(s) = V^*(s)\) for all \(s \in S\) (Puterman 1994).
Proof
Step 1: \(V^*\) is well-defined.
The Bellman optimality operator \(\mathcal{T}\) is a \(\gamma\)-contraction on \(\mathbb{R}^S\) under the \(\ell^\infty\) norm (proof). By the Banach fixed-point theorem, \(\mathcal{T}\) has a unique fixed point \(V^* \in \mathbb{R}^S\) satisfying:
\[ V^*(s) = (\mathcal{T} V^*)(s) = \max_{a \in A} \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^*(s') \right] \quad \forall s \in S. \]
Step 2: Define a greedy deterministic policy.
For each \(s \in S\), choose any action achieving the maximum above:
\[ \pi^*(s) = \operatorname*{arg\,max}_{a \in A} \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^*(s') \right]. \]
This is deterministic since it selects a single action per state (breaking ties arbitrarily).
Step 3: \(V^{\pi^*} = V^*\).
By the definition of \(\pi^*\), the fixed-point equation for \(V^*\) can be written as:
\[ V^*(s) = R(s, \pi^*(s)) + \gamma \sum_{s'} P(s' \mid s, \pi^*(s))\, V^*(s') \quad \forall s \in S. \]
This is exactly the Bellman expectation equation for policy \(\pi^*\). The analogous Bellman expectation operator \(\mathcal{T}^{\pi^*}\) is also a \(\gamma\)-contraction with unique fixed point \(V^{\pi^*}\) (proof). Since \(V^*\) satisfies \(\mathcal{T}^{\pi^*} V^* = V^*\), uniqueness gives \(V^{\pi^*} = V^*\).
Step 4: \(\pi^*\) is optimal.
Since \(V^{\pi^*}(s) = V^*(s) = \max_\pi V^\pi(s)\) for all \(s\), no policy achieves a higher value in any state. \(\blacksquare\)