Markov Decision Processes
Motivation
Many sequential decision problems share a common structure: an agent repeatedly observes its current state, selects an action, receives a reward, and transitions to a new state. A Markov decision process (MDP) (Puterman 1994) formalizes this structure precisely, yielding a well-defined optimization problem whose solution is a policy that maximizes expected cumulative reward.
Definition
A (discrete-time, discounted) MDP is a tuple \((S, A, P, R, \gamma)\):
- \(S\) — finite set of states
- \(A\) — finite set of actions
- \(P(s' \mid s, a)\) — transition probability: probability of reaching \(s'\) from \(s\) after taking action \(a\), with \(\sum_{s'} P(s' \mid s, a) = 1\)
- \(R(s, a)\) — expected immediate reward for taking action \(a\) in state \(s\)
- \(\gamma \in [0, 1)\) — discount factor
The Markov property is the defining assumption: the distribution over next states depends only on the current \((s, a)\), not on the history of prior states or actions. This is what makes the current state a sufficient statistic for future decisions.
Policies
A deterministic policy \(\pi : S \to A\) maps each state to a single action. A stochastic policy \(\pi(a \mid s)\) maps each state to a distribution over actions. Every deterministic policy is a special case of a stochastic one, and in finite MDPs the added generality is unnecessary: an optimal deterministic policy always exists.
Objective
Following policy \(\pi\) from state \(s_0\) produces a trajectory \(s_0, a_0, r_0, s_1, a_1, r_1, \ldots\) where \(a_t \sim \pi(\cdot \mid s_t)\) and \(s_{t+1} \sim P(\cdot \mid s_t, a_t)\). The discounted return is:
\[ G = \sum_{t=0}^{\infty} \gamma^t r_t. \]
The discount factor \(\gamma\) makes rewards sooner worth more than rewards later, and ensures \(G\) is finite as long as rewards are bounded.
The state-value function of \(\pi\) (Sutton and Barto 2018) is the expected return starting from \(s\):
\[ V^\pi(s) = \mathbb{E}\!\left[\sum_{t=0}^{\infty} \gamma^t r_t \;\middle|\; s_0 = s,\, \pi\right]. \]
The optimal value function is:
\[ V^*(s) = \max_{\pi}\, V^\pi(s). \]
A policy \(\pi^*\) is optimal if \(V^{\pi^*}(s) = V^*(s)\) for every \(s \in S\).
Existence of an Optimal Policy
For any finite MDP with \(\gamma < 1\), an optimal deterministic policy exists. Given \(V^*\), the greedy policy
\[ \pi^*(s) = \operatorname*{arg\,max}_{a \in A} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V^*(s') \right] \]
is deterministic and satisfies \(V^{\pi^*} = V^*\). (proof)
Example
A tiny MDP
This two-state MDP has a safe action with a small immediate reward and a risky action that can move to a higher-value state.
Consider a two-state MDP with \(S = \{0, 1\}\), \(\gamma = 0.9\), and the following state-dependent actions:
| \(s\) | \(a\) | \(R(s,a)\) | \(P(0 \mid s,a)\) | \(P(1 \mid s,a)\) |
|---|---|---|---|---|
| 0 | safe | 1 | 1.0 | 0.0 |
| 0 | risky | 5 | 0.3 | 0.7 |
| 1 | stay | 3 | 0.0 | 1.0 |
| 1 | reset | 0 | 1.0 | 0.0 |
A hand calculation checks the best action in each state. If state 1 chooses stay, then
\[ V(1) = 3 + 0.9V(1) \quad \Rightarrow \quad V(1) = 30. \]
In state 0, the safe action is worth
\[ Q(0, \text{safe}) = 1 + 0.9V(0), \]
while the risky action is worth
\[ Q(0, \text{risky}) = 5 + 0.9(0.3V(0) + 0.7V(1)). \]
Assuming risky is chosen at state 0 and substituting \(V(1)=30\) gives
\[ V(0) = 5 + 0.9(0.3V(0) + 0.7 \cdot 30) = 23.9 + 0.27V(0), \]
so \(V(0) = 23.9 / 0.73 \approx 32.74\). Now check the competing actions:
\[ Q(0, \text{safe}) = 1 + 0.9(32.74) \approx 30.47 < 32.74, \]
and
\[ Q(1, \text{reset}) = 0.9(32.74) \approx 29.47 < 30. \]
Thus the optimal policy is \(\pi^*(0)=\text{risky}\) and \(\pi^*(1)=\text{stay}\). The point of the example is that the immediate reward is not enough: safe gives a guaranteed reward in state 0, but risky is better because it often reaches the long-run reward stream in state 1.
Solving MDPs
The recursive structure of \(V^*\) is captured by the Bellman equations. Two standard algorithms for computing \(\pi^*\):
- Value iteration: repeatedly apply the Bellman operator until convergence
- Policy iteration: alternate between exact policy evaluation and greedy policy improvement