Value Iteration

Motivation

Value iteration (Bellman 1957) solves the Bellman optimality equations by treating them as an update rule and applying the Bellman operator repeatedly until convergence. It trades the expensive policy evaluation step of policy iteration for many cheap sweeps, each of which moves every state’s value estimate closer to \(V^*\).

Problem

Given a finite Markov decision process (MDP) with states \(S\), actions \(A\), transition probabilities \(P(s' \mid s, a)\), rewards \(R(s, a)\), and discount factor \(\gamma \in [0, 1)\), compute the optimal value function \(V^* : S \to \mathbb{R}\) and an optimal policy \(\pi^* : S \to A\).

See Bellman Equations for the full MDP notation.

Key Ideas

The Bellman operator as a fixed-point map

The optimal value function \(V^*\) satisfies the Bellman optimality equations

\[ V^*(s) = \max_{a \in A} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V^*(s') \right]. \]

Treating the right-hand side as a function of \(V\) gives the Bellman optimality operator

\[ (\mathcal{T} V)(s) = \max_{a \in A} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V(s') \right]. \]

\(V^*\) is the unique fixed point of \(\mathcal{T}\). Value iteration is the simplest possible algorithm to find it: start anywhere and iterate \(V \leftarrow \mathcal{T} V\).

Contraction guarantees convergence

\(\mathcal{T}\) is a \(\gamma\)-contraction in the \(\ell^\infty\) norm: \(\|\mathcal{T} U - \mathcal{T} V\|_\infty \le \gamma \|U - V\|_\infty\) (proof). By the Banach fixed-point theorem, iterating \(\mathcal{T}\) from any starting \(V_0\) converges geometrically (with ratio \(\gamma\)) to \(V^*\). Smaller \(\gamma\) means faster convergence; \(\gamma\) near \(1\) means many iterations.

One-step lookahead is enough

Each sweep does a one-step lookahead: for every \((s, a)\), compute the expected return assuming the value of the next state is \(V_k(s')\). There is no need to look ahead multiple steps — repeated application of one-step Bellman backups grows the lookahead horizon automatically. Reward at the goal propagates outward exactly one step per sweep.

Algorithm

initialize V[s] = 0 for all s in S
repeat:
    Δ = 0
    for each s in S:
        v = V[s]
        V[s] = max over a of (R(s, a) + γ · Σ_{s'} P(s' | s, a) · V[s'])
        Δ = max(Δ, |v - V[s]|)
until Δ < ε

# extract greedy policy
for each s in S:
    π[s] = argmax over a of (R(s, a) + γ · Σ_{s'} P(s' | s, a) · V[s'])

The key step is the Bellman backup — the max over actions of one-step expected return.

Walkthrough

A two-state MDP

States \(S = \{0, 1\}\), actions \(A = \{\text{stay}, \text{switch}\}\), \(\gamma = 0.9\), dynamics and rewards:

\(s\) \(a\) \(R(s,a)\) \(P(0 \mid s,a)\) \(P(1 \mid s,a)\)
0 stay 1 1 0
0 switch 0 0 1
1 stay 0 0 1
1 switch 0 1 0

Starting from \(V_0 = (0, 0)\):

\[ V_1(0) = \max\bigl[1 + 0.9 \cdot 0,\; 0 + 0.9 \cdot 0\bigr] = 1, \qquad V_1(1) = \max\bigl[0 + 0.9 \cdot 0,\; 0 + 0.9 \cdot 0\bigr] = 0. \]

\[ V_2(0) = \max\bigl[1 + 0.9 \cdot 1,\; 0 + 0.9 \cdot 0\bigr] = 1.9, \qquad V_2(1) = \max\bigl[0 + 0.9 \cdot 0,\; 0 + 0.9 \cdot 1\bigr] = 0.9. \]

Continuing the iteration, the values converge to \(V^*(0) = 10\), \(V^*(1) = 9\) (geometric sum \(\sum_{k \ge 0} 0.9^k = 10\), shifted by one for state \(1\)). The optimal policy is \(\pi^*(0) = \text{stay}\), \(\pi^*(1) = \text{switch}\).

Gridworld values over iterations

Value iteration on a small gridworld with a unit reward at the goal cell and \(\gamma = 0.9\). The reward propagates outward one cell per sweep:

iteration top-left top-middle top-right bottom-left bottom-middle goal
0 0.00 0.00 0.00 0.00 0.00 1.00
1 0.00 0.00 0.90 0.00 0.90 1.00
2 0.00 0.81 0.90 0.81 0.90 1.00
3 0.73 0.81 0.90 0.81 0.90 1.00

After three sweeps the goal’s reward has reached the far corner. Each sweep adds one layer of states to the set with nonzero value — value iteration is essentially “BFS with discount.”

Correctness

Because \(\mathcal{T}\) is a \(\gamma\)-contraction, the Banach fixed-point theorem guarantees convergence to \(V^*\) from any \(V_0\). The error after \(k\) iterations satisfies

\[ \|V_k - V^*\|_\infty \leq \frac{\gamma^k}{1 - \gamma} \|V_1 - V_0\|_\infty. \]

(proof)

The convergence is geometric with ratio \(\gamma\). The greedy policy extracted from \(V_k\) is optimal once \(\|V_k - V^*\|_\infty < (1 - \gamma) \cdot \Delta / 2\), where \(\Delta\) is the minimum action-value gap — so policy convergence happens before value convergence is exact.

Complexity and Tradeoffs

Each iteration visits every \((s, a)\) pair and sums over all \(s'\): \(O(|S|^2 |A|)\) per sweep. The number of iterations to reach \(\varepsilon\)-accuracy is \(O\!\left(\frac{\log \varepsilon^{-1}}{1 - \gamma}\right)\), giving total cost

\[ O\!\left(\frac{|S|^2 |A|}{1-\gamma} \log \varepsilon^{-1}\right). \]

The \(1/(1-\gamma)\) factor is the effective horizon: when \(\gamma\) is close to \(1\), the algorithm needs many sweeps to propagate reward across long planning horizons.

For large state spaces, exact value iteration is intractable and approximate methods — fitted value iteration with function approximators, deep RL, or sampling-based methods — are used instead.

When to Use It

Situation Approach
Small finite MDP, known dynamics Value iteration — simplest exact method.
Need policy quickly, willing to do extra work per iteration Policy iteration — fewer iterations, each more expensive.
Dynamics unknown, samples available Q-learning or other TD methods.
Huge state space, function approximator available Fitted value iteration / deep Q-learning.
Continuous state and action Linear-quadratic problems → LQR; general nonlinear → policy-gradient or actor-critic methods.

Variants

  • Asynchronous value iteration. Update states one at a time instead of in lockstep sweeps. Converges to \(V^*\) as long as every state is updated infinitely often; allows priority queuing.
  • Prioritized sweeping. Always update the state whose Bellman error is currently largest. Useful when reward is sparse and most states have stable values.
  • Gauss-Seidel updates. Use the latest values \(V[s]\) as soon as they’re computed, instead of buffering for a full sweep. Often converges faster than synchronous updates.
  • Fitted value iteration. Represent \(V\) as a function approximator (linear, neural net) and minimize the Bellman residual at sampled states. The contraction property is lost but the algorithm scales.

References

Bellman, Richard. 1957. Dynamic Programming. Princeton University Press. https://press.princeton.edu/books/paperback/9780691146683/dynamic-programming.