Policies

Motivation

In sequential decision-making, an agent must decide what to do in every situation it might encounter. A policy (Sutton and Barto 2018) answers this once and for all: it is a rule mapping situations to actions. Solving a sequential decision problem ultimately means producing a policy.

Policies are the natural object of study because the state captures everything relevant about the past (under the Markov property), so a function of the current state alone is enough to act optimally. There is no need to remember history.

Definition

Let \(S\) be a set of states and \(A\) a set of actions. A deterministic policy is a function

\[ \pi : S \to A. \]

A stochastic policy maps each state to a probability distribution over actions:

\[ \pi(a \mid s), \qquad \sum_{a \in A} \pi(a \mid s) = 1 \text{ for all } s \in S. \]

Every deterministic policy is the special case of a stochastic policy that puts all of its probability on a single action.

Stationary vs. Non-Stationary

A policy is stationary if it depends only on the current state, not on the time step:

\[ a_t = \pi(s_t) \qquad \text{(same } \pi \text{ at every } t\text{).} \]

A non-stationary policy may use a different rule at each step: \(\pi_t : S \to A\). For finite-horizon problems, optimal policies are generally non-stationary because the value of an action depends on how many decisions remain. For infinite-horizon discounted problems, the situation is much cleaner: an optimal stationary deterministic policy always exists. (proof)

Why Policies and Not Plans?

A plan is a fixed sequence of actions \(a_0, a_1, a_2, \ldots\) chosen up front. A policy is a contingent rule: it says what to do as a function of what happens.

Plans are sufficient only when the environment is deterministic and fully observable. As soon as transitions are stochastic, a plan cannot adapt to outcomes — but a policy can, because its next action is computed from the realized state. This is why policies are the central object in Markov decision processes rather than action sequences.

Evaluating a Policy

Following a policy \(\pi\) from state \(s\) produces a random trajectory and a random discounted return \(G = \sum_{t=0}^{\infty} \gamma^t r_t\). The state-value function is its expectation:

\[ V^\pi(s) = \mathbb{E}\!\left[\sum_{t=0}^{\infty} \gamma^t r_t \;\middle|\; s_0 = s,\, \pi\right]. \]

The action-value function \(Q^\pi(s, a)\) is the expected return after taking action \(a\) in state \(s\) and then following \(\pi\). These functions satisfy the Bellman expectation equations and are the standard tool for comparing policies: \(\pi\) is at least as good as \(\pi'\) iff \(V^\pi(s) \geq V^{\pi'}(s)\) for every \(s\).

Optimal Policies

A policy \(\pi^*\) is optimal if \(V^{\pi^*}(s) = V^*(s) := \max_\pi V^\pi(s)\) for every \(s\). In a finite MDP with \(\gamma < 1\):

An optimal deterministic stationary policy exists.
Every greedy policy with respect to \(V^*\) is optimal:

\[ \pi^*(s) \in \operatorname*{arg\,max}_{a \in A} \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^*(s') \right]. \]

(proof)

This reduces policy search to value computation: find \(V^*\) (e.g., via value iteration or policy iteration) and read off a greedy policy.

When Stochastic Policies Are Needed

For fully observable MDPs, restricting attention to deterministic policies loses nothing. Stochastic policies become essential in three settings:

Partial observability. When the agent does not see the true state (POMDPs), randomization can outperform any deterministic memoryless rule.
Multi-agent / adversarial settings. Game-theoretic equilibria typically require mixed strategies.
Exploration during learning. When the model is unknown, the agent must try suboptimal-looking actions to gather information; policies like \(\varepsilon\)-greedy and softmax are stochastic by design. This is the central concern in multi-armed bandits and reinforcement learning.

Parameterized Policies

For large or continuous state spaces, tabulating \(\pi(s)\) for every \(s\) is infeasible. A parameterized policy \(\pi_\theta(a \mid s)\) — for example, the output of a neural network with weights \(\theta\) — represents the policy compactly. Policy gradient methods optimize \(\theta\) directly to maximize expected return, treating the policy itself as the primary object rather than deriving it from a value function.

References

Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. 2nd ed. MIT Press. https://mitpress.mit.edu/9780262039246/reinforcement-learning/.