Bellman Equations

Motivation

In a sequential decision problem an agent’s reward depends on the entire future trajectory, not just the next step. The Bellman equations (Bellman 1957) decompose this long-horizon objective recursively: the value of a state equals the immediate reward plus the discounted value of the successor state. This decomposition is the foundation of dynamic programming and reinforcement learning.

Markov Decision Process

A Markov decision process (MDP) is a tuple \((S, A, P, R, \gamma)\) (see Markov Decision Processes):

  • \(S\) — finite set of states
  • \(A\) — finite set of actions
  • \(P(s' \mid s, a)\) — transition probability of reaching \(s'\) from \(s\) under action \(a\)
  • \(R(s, a)\) — expected immediate reward for taking action \(a\) in state \(s\)
  • \(\gamma \in [0, 1)\) — discount factor

A policy \(\pi : S \to A\) specifies which action to take in each state. The state-value function under \(\pi\) (Puterman 1994) is the expected discounted return from state \(s\):

\[ V^\pi(s) = \mathbb{E}\!\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \;\middle|\; s_0 = s,\, a_t = \pi(s_t)\right]. \]

The action-value function (Q-function) is the expected return from state \(s\) after taking action \(a\) and then following \(\pi\):

\[ Q^\pi(s, a) = R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V^\pi(s'). \]

Bellman Expectation Equations

Expanding the definition of \(V^\pi\) one step:

\[ V^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s' \in S} P(s' \mid s, \pi(s))\, V^\pi(s'). \]

This is the Bellman expectation equation for policy \(\pi\). It is a system of \(|S|\) linear equations in the unknowns \(V^\pi(s)\) and can be solved directly by matrix inversion:

\[ V^\pi = (I - \gamma P^\pi)^{-1} R^\pi, \]

where \(P^\pi_{ss'} = P(s' \mid s, \pi(s))\) and \(R^\pi_s = R(s, \pi(s))\). The matrix \(I - \gamma P^\pi\) is invertible because \(\gamma < 1\) ensures its spectral radius is less than 1.

Bellman Optimality Equations

One-step Bellman backup

A Bellman backup evaluates a state by looking one action ahead, then averaging over possible next states.

s a s’₁ s’₂ choosep=.6p=.4 Q(s,a)=r(s,a)+γ[.6V(s’₁)+.4V(s’₂)]

The optimal value function \(V^*(s) = \max_\pi V^\pi(s)\) satisfies:

\[ V^*(s) = \max_{a \in A} \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^*(s') \right]. \]

The optimal Q-function satisfies:

\[ Q^*(s, a) = R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, \max_{a'} Q^*(s', a'). \]

Given \(V^*\), the optimal policy is:

\[ \pi^*(s) = \operatorname*{arg\,max}_{a \in A} \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^*(s') \right]. \]

Unlike the expectation equations, the optimality equations are nonlinear (due to the max) and cannot be solved by direct matrix inversion.

The Bellman Operator

Define the Bellman optimality operator \(\mathcal{T}\) acting on value functions \(V : S \to \mathbb{R}\):

\[ (\mathcal{T} V)(s) = \max_{a \in A} \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V(s') \right]. \]

\(V^*\) is the unique fixed point of \(\mathcal{T}\): \(\mathcal{T} V^* = V^*\).

\(\mathcal{T}\) is a \(\gamma\)-contraction in the \(\ell^\infty\) norm: for any two value functions \(V\) and \(W\),

\[ \|\mathcal{T} V - \mathcal{T} W\|_\infty \leq \gamma \|V - W\|_\infty. \]

(proof)

The analogous operator \(\mathcal{T}^\pi\) for a fixed policy \(\pi\) is also a \(\gamma\)-contraction, with fixed point \(V^\pi\). (proof)

References

Bellman, Richard. 1957. Dynamic Programming. Princeton University Press. https://press.princeton.edu/books/paperback/9780691146683/dynamic-programming.
Puterman, Martin L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. In Wiley Series in Probability and Statistics. Wiley. https://doi.org/10.1002/9780470316887.