Markov Decision Processes

Motivation

Many sequential decision problems share a common structure: an agent repeatedly observes its current state, selects an action, receives a reward, and transitions to a new state. A Markov decision process (MDP) (Puterman 1994) formalizes this structure precisely, yielding a well-defined optimization problem whose solution is a policy that maximizes expected cumulative reward.

Definition

A (discrete-time, discounted) MDP is a tuple \((S, A, P, R, \gamma)\):

  • \(S\) — finite set of states
  • \(A\) — finite set of actions
  • \(P(s' \mid s, a)\)transition probability: probability of reaching \(s'\) from \(s\) after taking action \(a\), with \(\sum_{s'} P(s' \mid s, a) = 1\)
  • \(R(s, a)\)expected immediate reward for taking action \(a\) in state \(s\)
  • \(\gamma \in [0, 1)\)discount factor

The Markov property is the defining assumption: the distribution over next states depends only on the current \((s, a)\), not on the history of prior states or actions. This is what makes the current state a sufficient statistic for future decisions.

Policies

A deterministic policy \(\pi : S \to A\) maps each state to a single action. A stochastic policy \(\pi(a \mid s)\) maps each state to a distribution over actions. Every deterministic policy is a special case of a stochastic one, and in finite MDPs the added generality is unnecessary: an optimal deterministic policy always exists.

Objective

Following policy \(\pi\) from state \(s_0\) produces a trajectory \(s_0, a_0, r_0, s_1, a_1, r_1, \ldots\) where \(a_t \sim \pi(\cdot \mid s_t)\) and \(s_{t+1} \sim P(\cdot \mid s_t, a_t)\). The discounted return is:

\[ G = \sum_{t=0}^{\infty} \gamma^t r_t. \]

The discount factor \(\gamma\) makes rewards sooner worth more than rewards later, and ensures \(G\) is finite as long as rewards are bounded.

The state-value function of \(\pi\) (Sutton and Barto 2018) is the expected return starting from \(s\):

\[ V^\pi(s) = \mathbb{E}\!\left[\sum_{t=0}^{\infty} \gamma^t r_t \;\middle|\; s_0 = s,\, \pi\right]. \]

The optimal value function is:

\[ V^*(s) = \max_{\pi}\, V^\pi(s). \]

A policy \(\pi^*\) is optimal if \(V^{\pi^*}(s) = V^*(s)\) for every \(s \in S\).

Existence of an Optimal Policy

For any finite MDP with \(\gamma < 1\), an optimal deterministic policy exists. Given \(V^*\), the greedy policy

\[ \pi^*(s) = \operatorname*{arg\,max}_{a \in A} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V^*(s') \right] \]

is deterministic and satisfies \(V^{\pi^*} = V^*\). (proof)

Example

A tiny MDP

This two-state MDP has a safe action with a small immediate reward and a risky action that can move to a higher-value state.

s₀ s₁ risky: p=.7, r=5 reset: p=1, r=0 safe: r=1 stay: r=3

Consider a two-state MDP with \(S = \{0, 1\}\), \(\gamma = 0.9\), and the following state-dependent actions:

\(s\) \(a\) \(R(s,a)\) \(P(0 \mid s,a)\) \(P(1 \mid s,a)\)
0 safe 1 1.0 0.0
0 risky 5 0.3 0.7
1 stay 3 0.0 1.0
1 reset 0 1.0 0.0

A hand calculation checks the best action in each state. If state 1 chooses stay, then

\[ V(1) = 3 + 0.9V(1) \quad \Rightarrow \quad V(1) = 30. \]

In state 0, the safe action is worth

\[ Q(0, \text{safe}) = 1 + 0.9V(0), \]

while the risky action is worth

\[ Q(0, \text{risky}) = 5 + 0.9(0.3V(0) + 0.7V(1)). \]

Assuming risky is chosen at state 0 and substituting \(V(1)=30\) gives

\[ V(0) = 5 + 0.9(0.3V(0) + 0.7 \cdot 30) = 23.9 + 0.27V(0), \]

so \(V(0) = 23.9 / 0.73 \approx 32.74\). Now check the competing actions:

\[ Q(0, \text{safe}) = 1 + 0.9(32.74) \approx 30.47 < 32.74, \]

and

\[ Q(1, \text{reset}) = 0.9(32.74) \approx 29.47 < 30. \]

Thus the optimal policy is \(\pi^*(0)=\text{risky}\) and \(\pi^*(1)=\text{stay}\). The point of the example is that the immediate reward is not enough: safe gives a guaranteed reward in state 0, but risky is better because it often reaches the long-run reward stream in state 1.

Solving MDPs

The recursive structure of \(V^*\) is captured by the Bellman equations. Two standard algorithms for computing \(\pi^*\):

  • Value iteration: repeatedly apply the Bellman operator until convergence
  • Policy iteration: alternate between exact policy evaluation and greedy policy improvement

References

Puterman, Martin L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. In Wiley Series in Probability and Statistics. Wiley. https://doi.org/10.1002/9780470316887.
Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. 2nd ed. MIT Press. https://mitpress.mit.edu/9780262039246/reinforcement-learning/.