The Bellman Operator Is a Contraction
Claim
Let \((S, A, P, R, \gamma)\) be an MDP with \(\gamma \in [0,1)\). Define the Bellman optimality operator \(\mathcal{T}\) and the Bellman expectation operator \(\mathcal{T}^\pi\) for a fixed policy \(\pi\) by
\[ (\mathcal{T} V)(s) = \max_{a \in A} \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s, a)\, V(s') \right], \]
\[ (\mathcal{T}^\pi V)(s) = R(s, \pi(s)) + \gamma \sum_{s'} P(s' \mid s, \pi(s))\, V(s'). \]
Both operators are \(\gamma\)-contractions in the \(\ell^\infty\) norm (Puterman 1994):
\[ \|\mathcal{T} V - \mathcal{T} W\|_\infty \leq \gamma \|V - W\|_\infty, \qquad \|\mathcal{T}^\pi V - \mathcal{T}^\pi W\|_\infty \leq \gamma \|V - W\|_\infty. \]
Proof for \(\mathcal{T}^\pi\)
For any state \(s\):
\[ (\mathcal{T}^\pi V)(s) - (\mathcal{T}^\pi W)(s) = \gamma \sum_{s'} P(s' \mid s, \pi(s))\bigl[V(s') - W(s')\bigr]. \]
Taking absolute values and using \(\sum_{s'} P(s' \mid s, \pi(s)) = 1\):
\[ \bigl|(\mathcal{T}^\pi V)(s) - (\mathcal{T}^\pi W)(s)\bigr| \leq \gamma \sum_{s'} P(s' \mid s, \pi(s))\, \|V - W\|_\infty = \gamma \|V - W\|_\infty. \]
Taking the supremum over \(s\) gives \(\|\mathcal{T}^\pi V - \mathcal{T}^\pi W\|_\infty \leq \gamma \|V - W\|_\infty\). \(\square\)
Proof for \(\mathcal{T}\)
For any state \(s\), let \(a^*\) achieve the max in \((\mathcal{T} V)(s)\). Then:
\[ (\mathcal{T} V)(s) - (\mathcal{T} W)(s) \leq \left[R(s, a^*) + \gamma \sum_{s'} P(s' \mid s, a^*) V(s')\right] - \left[R(s, a^*) + \gamma \sum_{s'} P(s' \mid s, a^*) W(s')\right]. \]
The reward terms cancel and the same bound as above gives:
\[ (\mathcal{T} V)(s) - (\mathcal{T} W)(s) \leq \gamma \|V - W\|_\infty. \]
By symmetry (swapping \(V\) and \(W\)), \((\mathcal{T} W)(s) - (\mathcal{T} V)(s) \leq \gamma \|V - W\|_\infty\). Therefore:
\[ \bigl|(\mathcal{T} V)(s) - (\mathcal{T} W)(s)\bigr| \leq \gamma \|V - W\|_\infty \quad \forall s. \]
Taking the supremum: \(\|\mathcal{T} V - \mathcal{T} W\|_\infty \leq \gamma \|V - W\|_\infty\). \(\square\)
Consequence
Since \(\gamma < 1\), both \(\mathcal{T}\) and \(\mathcal{T}^\pi\) are contractions on the complete metric space of bounded functions \(S \to \mathbb{R}\) with the \(\ell^\infty\) norm. By the Banach fixed-point theorem, each has a unique fixed point (\(V^*\) and \(V^\pi\) respectively), and repeated application converges to that fixed point from any initialization.