The Belief State Is a Sufficient Statistic for the History

Claim

Let \(h_t = (a_0, o_1, a_1, o_2, \ldots, a_{t-1}, o_t)\) be the action-observation history up to time \(t\) in a POMDP, and let \(b_t(s) = P(s_t = s \mid h_t)\) be the belief at time \(t\). Then:

\(b_t\) is determined recursively from \(b_0\) and the Bayes filter; no further information from \(h_t\) is needed.
The conditional distribution of any future event given \(h_t\) equals the conditional distribution given \(b_t\) alone.

(Kaelbling et al. 1998)

Proof

We prove by induction on \(t\) that \(b_t(s) = P(s_t = s \mid h_t)\) depends on \(h_t\) only through \((b_0, a_0, o_1, \ldots, a_{t-1}, o_t)\) via the Bayes filter, and that \(s_t \perp h_{t-1} \mid b_t\).

Base case (\(t = 0\)). The initial belief \(b_0(s) = P(s_0 = s)\) encodes all information about \(s_0\). There is no history, so the condition holds trivially.

Inductive step. Assume \(b_t(s) = P(s_t = s \mid h_t)\) and that \(s_t \perp h_{t-1} \mid b_t\). We show \(b_{t+1}(s') = P(s_{t+1} = s' \mid h_{t+1})\).

The history \(h_{t+1}\) extends \(h_t\) with action \(a_t\) and observation \(o_{t+1}\). By Bayes’ rule:

\[P(s_{t+1} = s' \mid h_{t+1}) = \frac{P(o_{t+1} \mid s_{t+1} = s', a_t)\, P(s_{t+1} = s' \mid h_t, a_t)}{P(o_{t+1} \mid h_t, a_t)}.\]

The numerator’s second factor expands by the Markov property of the transition model:

\[P(s_{t+1} = s' \mid h_t, a_t) = \sum_{s \in S} P(s' \mid s, a_t)\, P(s_t = s \mid h_t) = \sum_{s \in S} P(s' \mid s, a_t)\, b_t(s).\]

The observation probability is \(P(o_{t+1} \mid s_{t+1} = s', a_t) = O(o_{t+1} \mid s', a_t)\) by the POMDP observation model. Substituting:

\[P(s_{t+1} = s' \mid h_{t+1}) = \frac{O(o_{t+1} \mid s', a_t)\displaystyle\sum_{s} P(s' \mid s, a_t)\, b_t(s)}{P(o_{t+1} \mid b_t, a_t)} = b_{t+1}(s').\]

The right-hand side is exactly the Bayes filter update and depends on \(h_{t+1}\) only through \((b_t, a_t, o_{t+1})\). Since \(b_t\) is determined by \(h_t\) by the inductive hypothesis, \(b_{t+1}\) is determined by \(h_{t+1}\).

Future independence. For any \(\tau > t\), the future state \(s_\tau\) is conditionally independent of \(h_{t-1}\) given \(b_t\), because \(h_{t-1}\) influences \(s_\tau\) only through \(s_t\), whose distribution is fully captured by \(b_t\). Therefore the optimal policy from time \(t\) onward depends on \(h_t\) only through \(b_t\), and \(b_t\) is a sufficient statistic. \(\blacksquare\)

References

Kaelbling, Leslie Pack, Michael L. Littman, and Anthony R. Cassandra. 1998. “Planning and Acting in Partially Observable Stochastic Domains.” Artificial Intelligence 101 (1-2): 99–134. https://doi.org/10.1016/s0004-3702(98)00023-x.