Partial Observability
Motivation
In a fully observable MDP the agent knows which state it occupies at every step. Many real problems violate this assumption: a robot does not know its exact position, a card player cannot see opponents’ hands, a medical system cannot directly observe a patient’s hidden physiological state. Partial observability captures this mismatch by separating the true world state from what the agent actually perceives (Russell and Norvig 2020).
Observation Model
A partially observable environment extends the MDP with:
- \(\Omega\) — finite set of observations
- \(O(o \mid s', a)\) — probability of observing \(o\) after transitioning to state \(s'\) via action \(a\)
The full model is the tuple \((S, A, \Omega, P, O, R, \gamma)\) (see Partially Observable Markov Decision Processes).
At each time step:
- The agent is in state \(s\) and takes action \(a\).
- The environment transitions to \(s' \sim P(\cdot \mid s, a)\).
- The agent receives observation \(o \sim O(\cdot \mid s', a)\).
- The agent collects reward \(R(s, a)\).
The agent never observes \(s\) or \(s'\) directly; it sees only the sequence of actions and observations.
Why the State Cannot Be Recovered
In general, the full action-observation history \(h_t = (a_0, o_1, a_1, o_2, \ldots, o_t)\) does not determine \(s_t\) exactly. Multiple states may be consistent with the same history. The best the agent can do is maintain a belief state — a probability distribution over possible current states — and act on that distribution (see Belief States).
Fully Observable as a Special Case
A fully observable MDP is the special case where \(|\Omega| = |S|\) and \(O(s \mid s', a) = \mathbf{1}[o = s']\): each state produces a unique, deterministic observation. The agent can then recover the true state exactly from the observation, and the belief collapses to a point mass.