Belief States
Motivation
Under partial observability the agent cannot know the true state. The belief state summarizes all information the agent has accumulated as a probability distribution over states (Russell and Norvig 2020). It is the correct input to planning in a POMDP because it is a sufficient statistic for the full action-observation history: knowing the belief is just as informative as knowing every past action and observation. (proof)
Definition
A belief is a probability distribution over states:
\[b : S \to [0, 1], \qquad \sum_{s \in S} b(s) = 1.\]
The set of all beliefs is the belief simplex \(\Delta(S) \subset \mathbb{R}^{|S|}\). An agent that starts in a known state \(s_0\) has initial belief \(b_0(s) = \mathbf{1}[s = s_0]\); if the initial state is unknown it uses a prior \(b_0(s) = P(s_0 = s)\).
Belief Update
After taking action \(a\) and receiving observation \(o\), the agent updates its belief by Bayes’ rule:
\[b'(s') = \frac{O(o \mid s', a)\displaystyle\sum_{s \in S} P(s' \mid s, a)\, b(s)}{P(o \mid b, a)},\]
where the normalizing constant is:
\[P(o \mid b, a) = \sum_{s' \in S} O(o \mid s', a) \sum_{s \in S} P(s' \mid s, a)\, b(s).\]
This update is the Bayes filter. It runs in \(O(|S|^2)\) per step: for each next state \(s'\), sum over all current states \(s\).
The Belief MDP
Because the belief is a sufficient statistic, the POMDP can be reformulated as a fully observable MDP over the continuous state space \(\Delta(S)\):
- State: belief \(b \in \Delta(S)\)
- Action: same action set \(A\)
- Transition: \(b \xrightarrow{a,\, o} b'\) deterministically via the Bayes filter; stochastic from the agent’s perspective because \(o\) is random
- Reward: \(\rho(b, a) = \displaystyle\sum_{s \in S} b(s)\, R(s, a)\)
The belief MDP has the same optimal value as the original POMDP. Its state space is continuous (a simplex), so exact solution requires exploiting additional structure — see Partially Observable Markov Decision Processes.