Policies
Motivation
In sequential decision-making, an agent must decide what to do in every situation it might encounter. A policy (Sutton and Barto 2018) answers this once and for all: it is a rule mapping situations to actions. Solving a sequential decision problem ultimately means producing a policy.
Policies are the natural object of study because the state captures everything relevant about the past (under the Markov property), so a function of the current state alone is enough to act optimally. There is no need to remember history.
Definition
Let \(S\) be a set of states and \(A\) a set of actions. A deterministic policy is a function
\[ \pi : S \to A. \]
A stochastic policy maps each state to a probability distribution over actions:
\[ \pi(a \mid s), \qquad \sum_{a \in A} \pi(a \mid s) = 1 \text{ for all } s \in S. \]
Every deterministic policy is the special case of a stochastic policy that puts all of its probability on a single action.
Stationary vs. Non-Stationary
A policy is stationary if it depends only on the current state, not on the time step:
\[ a_t = \pi(s_t) \qquad \text{(same } \pi \text{ at every } t\text{).} \]
A non-stationary policy may use a different rule at each step: \(\pi_t : S \to A\). For finite-horizon problems, optimal policies are generally non-stationary because the value of an action depends on how many decisions remain. For infinite-horizon discounted problems, the situation is much cleaner: an optimal stationary deterministic policy always exists. (proof)
Why Policies and Not Plans?
A plan is a fixed sequence of actions \(a_0, a_1, a_2, \ldots\) chosen up front. A policy is a contingent rule: it says what to do as a function of what happens.
Plans are sufficient only when the environment is deterministic and fully observable. As soon as transitions are stochastic, a plan cannot adapt to outcomes — but a policy can, because its next action is computed from the realized state. This is why policies are the central object in Markov decision processes rather than action sequences.
Evaluating a Policy
Following a policy \(\pi\) from state \(s\) produces a random trajectory and a random discounted return \(G = \sum_{t=0}^{\infty} \gamma^t r_t\). The state-value function is its expectation:
\[ V^\pi(s) = \mathbb{E}\!\left[\sum_{t=0}^{\infty} \gamma^t r_t \;\middle|\; s_0 = s,\, \pi\right]. \]
The action-value function \(Q^\pi(s, a)\) is the expected return after taking action \(a\) in state \(s\) and then following \(\pi\). These functions satisfy the Bellman expectation equations and are the standard tool for comparing policies: \(\pi\) is at least as good as \(\pi'\) iff \(V^\pi(s) \geq V^{\pi'}(s)\) for every \(s\).
Optimal Policies
A policy \(\pi^*\) is optimal if \(V^{\pi^*}(s) = V^*(s) := \max_\pi V^\pi(s)\) for every \(s\). In a finite MDP with \(\gamma < 1\):
- An optimal deterministic stationary policy exists.
- Every greedy policy with respect to \(V^*\) is optimal:
\[ \pi^*(s) \in \operatorname*{arg\,max}_{a \in A} \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^*(s') \right]. \]
(proof)
This reduces policy search to value computation: find \(V^*\) (e.g., via value iteration or policy iteration) and read off a greedy policy.
When Stochastic Policies Are Needed
For fully observable MDPs, restricting attention to deterministic policies loses nothing. Stochastic policies become essential in three settings:
- Partial observability. When the agent does not see the true state (POMDPs), randomization can outperform any deterministic memoryless rule.
- Multi-agent / adversarial settings. Game-theoretic equilibria typically require mixed strategies.
- Exploration during learning. When the model is unknown, the agent must try suboptimal-looking actions to gather information; policies like \(\varepsilon\)-greedy and softmax are stochastic by design. This is the central concern in multi-armed bandits and reinforcement learning.
Parameterized Policies
For large or continuous state spaces, tabulating \(\pi(s)\) for every \(s\) is infeasible. A parameterized policy \(\pi_\theta(a \mid s)\) — for example, the output of a neural network with weights \(\theta\) — represents the policy compactly. Policy gradient methods optimize \(\theta\) directly to maximize expected return, treating the policy itself as the primary object rather than deriving it from a value function.