Partially Observable Markov Decision Processes (POMDPs)

Motivation

A partially observable Markov decision process (POMDP) (Kaelbling et al. 1998) extends the MDP to settings where the agent cannot directly observe the true state. It is the standard model for sequential decision-making under uncertainty when perception is imperfect.

Model

A POMDP is a tuple \((S, A, \Omega, P, O, R, \gamma)\):

\(S\) — finite set of states
\(A\) — finite set of actions
\(\Omega\) — finite set of observations
\(P(s' \mid s, a)\) — transition probability
\(O(o \mid s', a)\) — probability of observing \(o\) after arriving in \(s'\) via action \(a\)
\(R(s, a)\) — expected immediate reward
\(\gamma \in [0,1)\) — discount factor

See Partial Observability for motivation and Belief States for the belief update.

Value Function over Beliefs

The POMDP is equivalent to an MDP over the belief simplex \(\Delta(S)\). The optimal value function \(V^* : \Delta(S) \to \mathbb{R}\) satisfies the Bellman equation:

\[V^*(b) = \max_{a \in A} \left[ \rho(b, a) + \gamma \sum_{o \in \Omega} P(o \mid b, a)\, V^*(b^{a,o}) \right],\]

where \(\rho(b, a) = \sum_s b(s) R(s, a)\) and \(b^{a,o}\) is the Bayes-filter update of \(b\) after action \(a\) and observation \(o\).

Piecewise Linear and Convex Value Functions

A central structural result is that \(V^*\) is piecewise linear and convex (PWLC) in \(b\). (proof) This means:

\[V^*(b) = \max_{\alpha \in \Gamma}\, \alpha \cdot b\]

for a finite collection of \(\alpha\)-vectors \(\Gamma \subset \mathbb{R}^{|S|}\). Each \(\alpha\)-vector corresponds to a conditional plan — a strategy tree that maps every possible future observation sequence to an action. The value of a belief under plan \(\alpha\) is the dot product \(\alpha \cdot b\).

Exact Solution via \(\alpha\)-Vector Iteration

Finite-horizon POMDP value iteration maintains a set \(\Gamma_k\) of \(\alpha\)-vectors representing \(V_k^*\):

Initialize \(\Gamma_0 = \{\mathbf{0}\}\).
For each horizon step, compute \(\Gamma_{k+1}\) from \(\Gamma_k\): for each action \(a\) and observation \(o\), transform each \(\alpha \in \Gamma_k\) through \(O\) and \(P\); combine contributions across observations; take the pointwise max over actions.
Prune dominated \(\alpha\)-vectors (those never optimal for any belief).

The number of \(\alpha\)-vectors can grow exponentially with the horizon in the worst case, making exact POMDP solution intractable for large problems.

Relationship to MDPs

A fully observable MDP is the special case \(O(s \mid s', a) = \mathbf{1}[o = s']\): each state produces a unique observation. The belief collapses to a point mass at the observed state, and the POMDP value function reduces to the MDP value function evaluated at that state.

Approximate Methods

Exact \(\alpha\)-vector iteration is rarely used for large problems. Common approximate approaches include:

Point-based value iteration (PBVI): updates \(\alpha\)-vectors only at a sampled set of reachable beliefs
POMCP: Monte Carlo tree search adapted to POMDPs, using particle filters to approximate beliefs
Recurrent policies: neural networks that maintain a hidden state as an implicit belief

References

Kaelbling, Leslie Pack, Michael L. Littman, and Anthony R. Cassandra. 1998. “Planning and Acting in Partially Observable Stochastic Domains.” Artificial Intelligence 101 (1-2): 99–134. https://doi.org/10.1016/s0004-3702(98)00023-x.