The Belief State Is a Sufficient Statistic for the History
Claim
Let \(h_t = (a_0, o_1, a_1, o_2, \ldots, a_{t-1}, o_t)\) be the action-observation history up to time \(t\) in a POMDP, and let \(b_t(s) = P(s_t = s \mid h_t)\) be the belief at time \(t\). Then:
- \(b_t\) is determined recursively from \(b_0\) and the Bayes filter; no further information from \(h_t\) is needed.
- The conditional distribution of any future event given \(h_t\) equals the conditional distribution given \(b_t\) alone.
Proof
We prove by induction on \(t\) that \(b_t(s) = P(s_t = s \mid h_t)\) depends on \(h_t\) only through \((b_0, a_0, o_1, \ldots, a_{t-1}, o_t)\) via the Bayes filter, and that \(s_t \perp h_{t-1} \mid b_t\).
Base case (\(t = 0\)). The initial belief \(b_0(s) = P(s_0 = s)\) encodes all information about \(s_0\). There is no history, so the condition holds trivially.
Inductive step. Assume \(b_t(s) = P(s_t = s \mid h_t)\) and that \(s_t \perp h_{t-1} \mid b_t\). We show \(b_{t+1}(s') = P(s_{t+1} = s' \mid h_{t+1})\).
The history \(h_{t+1}\) extends \(h_t\) with action \(a_t\) and observation \(o_{t+1}\). By Bayes’ rule:
\[P(s_{t+1} = s' \mid h_{t+1}) = \frac{P(o_{t+1} \mid s_{t+1} = s', a_t)\, P(s_{t+1} = s' \mid h_t, a_t)}{P(o_{t+1} \mid h_t, a_t)}.\]
The numerator’s second factor expands by the Markov property of the transition model:
\[P(s_{t+1} = s' \mid h_t, a_t) = \sum_{s \in S} P(s' \mid s, a_t)\, P(s_t = s \mid h_t) = \sum_{s \in S} P(s' \mid s, a_t)\, b_t(s).\]
The observation probability is \(P(o_{t+1} \mid s_{t+1} = s', a_t) = O(o_{t+1} \mid s', a_t)\) by the POMDP observation model. Substituting:
\[P(s_{t+1} = s' \mid h_{t+1}) = \frac{O(o_{t+1} \mid s', a_t)\displaystyle\sum_{s} P(s' \mid s, a_t)\, b_t(s)}{P(o_{t+1} \mid b_t, a_t)} = b_{t+1}(s').\]
The right-hand side is exactly the Bayes filter update and depends on \(h_{t+1}\) only through \((b_t, a_t, o_{t+1})\). Since \(b_t\) is determined by \(h_t\) by the inductive hypothesis, \(b_{t+1}\) is determined by \(h_{t+1}\).
Future independence. For any \(\tau > t\), the future state \(s_\tau\) is conditionally independent of \(h_{t-1}\) given \(b_t\), because \(h_{t-1}\) influences \(s_\tau\) only through \(s_t\), whose distribution is fully captured by \(b_t\). Therefore the optimal policy from time \(t\) onward depends on \(h_t\) only through \(b_t\), and \(b_t\) is a sufficient statistic. \(\blacksquare\)