The Bellman Operator Is a Contraction

Claim

Let \((S, A, P, R, \gamma)\) be an MDP with \(\gamma \in [0,1)\). Define the Bellman optimality operator \(\mathcal{T}\) and the Bellman expectation operator \(\mathcal{T}^\pi\) for a fixed policy \(\pi\) by

\[ (\mathcal{T} V)(s) = \max_{a \in A} \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s, a)\, V(s') \right], \]

\[ (\mathcal{T}^\pi V)(s) = R(s, \pi(s)) + \gamma \sum_{s'} P(s' \mid s, \pi(s))\, V(s'). \]

Both operators are \(\gamma\)-contractions in the \(\ell^\infty\) norm (Puterman 1994):

\[ \|\mathcal{T} V - \mathcal{T} W\|_\infty \leq \gamma \|V - W\|_\infty, \qquad \|\mathcal{T}^\pi V - \mathcal{T}^\pi W\|_\infty \leq \gamma \|V - W\|_\infty. \]

Proof for \(\mathcal{T}^\pi\)

For any state \(s\):

\[ (\mathcal{T}^\pi V)(s) - (\mathcal{T}^\pi W)(s) = \gamma \sum_{s'} P(s' \mid s, \pi(s))\bigl[V(s') - W(s')\bigr]. \]

Taking absolute values and using \(\sum_{s'} P(s' \mid s, \pi(s)) = 1\):

\[ \bigl|(\mathcal{T}^\pi V)(s) - (\mathcal{T}^\pi W)(s)\bigr| \leq \gamma \sum_{s'} P(s' \mid s, \pi(s))\, \|V - W\|_\infty = \gamma \|V - W\|_\infty. \]

Taking the supremum over \(s\) gives \(\|\mathcal{T}^\pi V - \mathcal{T}^\pi W\|_\infty \leq \gamma \|V - W\|_\infty\). \(\square\)

Proof for \(\mathcal{T}\)

For any state \(s\), let \(a^*\) achieve the max in \((\mathcal{T} V)(s)\). Then:

\[ (\mathcal{T} V)(s) - (\mathcal{T} W)(s) \leq \left[R(s, a^*) + \gamma \sum_{s'} P(s' \mid s, a^*) V(s')\right] - \left[R(s, a^*) + \gamma \sum_{s'} P(s' \mid s, a^*) W(s')\right]. \]

The reward terms cancel and the same bound as above gives:

\[ (\mathcal{T} V)(s) - (\mathcal{T} W)(s) \leq \gamma \|V - W\|_\infty. \]

By symmetry (swapping \(V\) and \(W\)), \((\mathcal{T} W)(s) - (\mathcal{T} V)(s) \leq \gamma \|V - W\|_\infty\). Therefore:

\[ \bigl|(\mathcal{T} V)(s) - (\mathcal{T} W)(s)\bigr| \leq \gamma \|V - W\|_\infty \quad \forall s. \]

Taking the supremum: \(\|\mathcal{T} V - \mathcal{T} W\|_\infty \leq \gamma \|V - W\|_\infty\). \(\square\)

Consequence

Since \(\gamma < 1\), both \(\mathcal{T}\) and \(\mathcal{T}^\pi\) are contractions on the complete metric space of bounded functions \(S \to \mathbb{R}\) with the \(\ell^\infty\) norm. By the Banach fixed-point theorem, each has a unique fixed point (\(V^*\) and \(V^\pi\) respectively), and repeated application converges to that fixed point from any initialization.

References

Puterman, Martin L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. In Wiley Series in Probability and Statistics. Wiley. https://doi.org/10.1002/9780470316887.