POMDP Value Functions Are Piecewise Linear and Convex

Claim

For a finite POMDP \((S, A, \Omega, P, O, R, \gamma)\), the finite-horizon optimal value function \(V_k^* : \Delta(S) \to \mathbb{R}\) is piecewise linear and convex (PWLC) in the belief \(b\) for every horizon \(k \geq 0\) (Kaelbling et al. 1998). That is, there exists a finite set \(\Gamma_k \subset \mathbb{R}^{|S|}\) of \(\alpha\)-vectors such that:

\[V_k^*(b) = \max_{\alpha \in \Gamma_k}\, \alpha \cdot b \qquad \forall b \in \Delta(S).\]

Proof

By induction on \(k\).

Base case (\(k = 0\)). Set \(V_0^*(b) = 0\) for all \(b\). This is PWLC with \(\Gamma_0 = \{\mathbf{0}\}\). \(\square\)

Inductive step. Assume \(V_k^*\) is PWLC with \(\alpha\)-vector set \(\Gamma_k\):

\[V_k^*(b) = \max_{\alpha \in \Gamma_k}\, \alpha \cdot b.\]

We show \(V_{k+1}^*\) is PWLC.

Step 1: Transform \(\alpha\)-vectors through one observation.

Fix action \(a \in A\) and observation \(o \in \Omega\). For each \(\alpha \in \Gamma_k\), define the backed-up \(\alpha\)-vector \(\hat{\alpha}^{a,o} \in \mathbb{R}^{|S|}\) by:

\[\hat{\alpha}^{a,o}_s = \gamma \sum_{s' \in S} O(o \mid s', a)\, P(s' \mid s, a)\, \alpha_{s'}.\]

This is a linear transformation of \(\alpha\) through the observation and transition matrices.

Step 2: The contribution of one observation is linear in \(b\).

The Bayes filter gives \(b^{a,o}(s') \propto O(o \mid s', a) \sum_s P(s' \mid s, a)\, b(s)\), with normalization \(P(o \mid b, a) = \sum_{s'} O(o \mid s', a) \sum_s P(s' \mid s, a)\, b(s)\). Therefore:

\[P(o \mid b, a)\, V_k^*(b^{a,o}) = P(o \mid b, a) \max_{\alpha \in \Gamma_k}\, \alpha \cdot b^{a,o} = \max_{\alpha \in \Gamma_k}\, \hat{\alpha}^{a,o} \cdot b.\]

The last equality holds because multiplying through by \(P(o \mid b, a)\) and substituting the Bayes filter expression cancels the normalization, leaving \(\sum_s \hat{\alpha}^{a,o}_s\, b(s)\).

Step 3: Sum over observations.

The one-step value for a fixed action \(a\) is:

\[Q_{k+1}(b, a) = \rho(b, a) + \gamma \sum_{o \in \Omega} P(o \mid b, a)\, V_k^*(b^{a,o}) = r^a \cdot b + \sum_{o \in \Omega} \max_{\alpha \in \Gamma_k}\, \hat{\alpha}^{a,o} \cdot b,\]

where \(r^a_s = R(s, a)\). For each \(b\), the max selects one \(\alpha\)-vector per observation. Over all beliefs, this amounts to taking the max over all functions \(\sigma : \Omega \to \Gamma_k\) (one \(\alpha\)-vector per observation):

\[Q_{k+1}(b, a) = \max_{\sigma : \Omega \to \Gamma_k} \left( r^a + \sum_{o \in \Omega} \hat{\alpha}^{a,o}_{\sigma(o)} \right) \cdot b.\]

Each candidate is a fixed vector dotted with \(b\), so \(Q_{k+1}(\cdot, a)\) is PWLC for each \(a\).

Step 4: Max over actions.

\[V_{k+1}^*(b) = \max_{a \in A} Q_{k+1}(b, a).\]

A finite max of PWLC functions is PWLC. The combined \(\alpha\)-vector set is:

\[\Gamma_{k+1} = \left\{ r^a + \sum_{o \in \Omega} \hat{\alpha}^{a,o}_{\sigma(o)} \;\middle|\; a \in A,\; \sigma : \Omega \to \Gamma_k \right\},\]

which is finite (at most \(|A| \cdot |\Gamma_k|^{|\Omega|}\) vectors, pruned to remove dominated ones). Therefore \(V_{k+1}^*\) is PWLC. \(\blacksquare\)

Consequence

Each \(\alpha\)-vector in \(\Gamma_k\) encodes the value of a specific \(k\)-step conditional plan — a complete decision tree mapping observation histories to actions. The PWLC structure enables exact POMDP algorithms: instead of representing \(V_k^*\) over the continuous belief simplex, one only needs to store the finite set \(\Gamma_k\). The worst-case number of \(\alpha\)-vectors grows as \(O((|A| \cdot |\Gamma_k|)^{|\Omega|})\) per step, which is exponential in the horizon but tractable for small problems.

References

Kaelbling, Leslie Pack, Michael L. Littman, and Anthony R. Cassandra. 1998. “Planning and Acting in Partially Observable Stochastic Domains.” Artificial Intelligence 101 (1-2): 99–134. https://doi.org/10.1016/s0004-3702(98)00023-x.