Information Gathering

Motivation

A robot localizing itself in a corridor may turn its head to look at a landmark even though the head turn earns no immediate reward. A doctor orders a diagnostic test before deciding on treatment. An autonomous car slows down at an occluded intersection to look before crossing. In each case the agent takes an action whose only purpose is to reduce uncertainty about the current state of the world, not to refine its estimate of any reward distribution.

This is information gathering: choosing actions to make the belief state more informative, even at the cost of immediate reward. It is distinct from exploration, which addresses uncertainty about the value of actions rather than uncertainty about the state.

In a partially observable MDP information gathering arises naturally: the optimal policy is a function of the belief state, not the unknown true state, and actions that sharpen the belief state have value precisely because they enable better decisions later (Kaelbling et al. 1998).

Setting

Consider a POMDP \((S, A, T, R, \Omega, O, \gamma)\) with belief state \(b_t \in \Delta(S)\). The reward function \(R(s, a)\) depends on the unobserved state, so the agent’s expected reward at belief \(b_t\) is

\[ r(b_t, a) = \sum_s b_t(s) R(s, a). \]

After action \(a\) and observation \(o\), the belief updates to \(b_{t+1} = \tau(b_t, a, o)\). The optimal value function \(V^*(b)\) on belief space satisfies the Bellman equation

\[ V^*(b) = \max_a \left[ r(b, a) + \gamma \sum_o \Pr(o \mid b, a) \, V^*(\tau(b, a, o)) \right]. \]

Information-gathering actions are actions whose value comes mostly from the second term: they have low immediate reward \(r(b, a)\) but produce observations \(o\) that lead to higher-value successor beliefs.

Value of information

The value of information of an action \(a\) at belief \(b\) is the increase in expected discounted return obtained by taking \(a\) now versus committing immediately to the best action under the current belief. Formally,

\[ \text{VoI}(a; b) = \mathbb{E}_o\!\left[ V^*(\tau(b, a, o)) \right] - V^*(b), \]

with the convention that the immediate reward of \(a\) is folded into \(V^*\). Pure information-gathering actions are those with low immediate reward but positive VoI.

In a fully observable MDP — and a fortiori in a bandit — the belief state collapses to a point mass on the current state, all observations are deterministic given the state, and VoI is identically zero. Information gathering is therefore a phenomenon specific to partial observability.

Contrast with exploration

The two phenomena are easy to confuse but have different sources, different remedies, and different theoretical treatments.

	Exploration	Information gathering
What is uncertain?	Reward function / value of actions	Hidden state of the environment
Resolved by	Repeated trials with feedback	A single informative observation
Vanishes when	The agent has tried each action enough	The belief becomes a point mass
Native setting	Bandits, reinforcement learning	POMDPs with known dynamics
Canonical algorithms	UCB, Thompson sampling, \(\varepsilon\)-greedy	Belief-space planning, point-based value iteration

A bandit requires exploration but no information gathering. A POMDP with known reward and dynamics requires information gathering but no exploration. Reinforcement learning in a POMDP requires both: the agent is uncertain about both what the world is doing right now and what the optimal action would be even if it knew.

In practice

Information gathering shows up in:

Active sensing. Robotic gaze control, sensor scheduling, active SLAM.
Sequential experimental design. Choosing the next experiment to maximize expected information about a parameter of interest.
Active learning. Selecting unlabeled data points to query — a special case where the “state” is the unknown label.
Diagnostic decision-making. Choosing a test before treatment, where the test resolves uncertainty about the patient’s state.

In all of these settings, the value of an action is dominated by what it reveals about a hidden quantity rather than by the immediate reward it produces. Belief-space planning algorithms — point-based value iteration, partially observable Monte Carlo planning — solve them by reasoning explicitly over the future evolution of the belief, the same machinery that makes POMDP value functions piecewise linear and convex.

References

Kaelbling, Leslie Pack, Michael L. Littman, and Anthony R. Cassandra. 1998. “Planning and Acting in Partially Observable Stochastic Domains.” Artificial Intelligence 101 (1-2): 99–134. https://doi.org/10.1016/s0004-3702(98)00023-x.