Post-Training Reinforcement Learning
Motivation
Pretraining optimizes for next-token prediction. This produces fluent, knowledgeable models, but not necessarily aligned ones: they may refuse reasonable requests, hallucinate confidently, or produce harmful content. Post-training uses human feedback and reinforcement learning to steer model behavior after pretraining, without degrading language ability.
The RLHF Pipeline
The canonical pipeline (Ouyang et al. 2022) has three stages:
Stage 1: Supervised Fine-Tuning
Collect a small dataset of (prompt, high-quality response) pairs written or selected by human annotators. Fine-tune the pretrained model on this data using standard cross-entropy loss. The result is the SFT model — a well-behaved starting point.
Stage 2: Reward Model Training
Present annotators with a prompt and two or more model responses; they rank or compare responses by quality. Train a reward model \(r_\phi(x, y)\) — typically the SFT model with a scalar head added — to predict human preference:
\[ \mathcal{L}_\text{RM} = -\mathbb{E} \left[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))\right], \]
where \(y_w\) is preferred over \(y_l\). The reward model generalizes from annotated pairs to score any (prompt, response).
Stage 3: RL Optimization
Treat the language model as a policy \(\pi_\theta\) and the reward model as the reward function. Use PPO (Schulman et al. 2017) to optimize:
\[ \mathcal{L}_\text{PPO} = \mathbb{E}\left[r_\phi(x, y) - \beta \, \mathrm{KL}(\pi_\theta \| \pi_\text{SFT})\right], \]
where the KL term prevents the policy from drifting too far from the SFT model (reducing reward hacking — finding behaviors that score well on \(r_\phi\) but are not genuinely better).
Direct Preference Optimization
RLHF is complex: the RL step is unstable, sensitive to hyperparameters, and prone to reward hacking. DPO (Rafailov et al. 2023) bypasses RL entirely by directly optimizing the policy from preference pairs:
\[ \mathcal{L}_\text{DPO} = -\mathbb{E}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{SFT}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{SFT}(y_l \mid x)}\right)\right]. \]
DPO is simpler, more stable, and avoids the reward model entirely. It is now widely used.
Process vs. Outcome Reward Models
Standard RLHF rewards the final response — an outcome reward model (ORM). For tasks with long reasoning chains, outcome rewards provide a sparse signal: feedback comes only after generating hundreds of tokens, making credit assignment difficult.
A process reward model (PRM) scores each step in the reasoning chain individually. PRMs provide denser feedback, enable step-level rejection sampling, and discourage chains that reach correct final answers via incorrect intermediate steps. The trade-off: PRMs require step-level annotation, which is expensive.
Group Relative Policy Optimization
GRPO is a variant designed for reasoning tasks. For each prompt it samples a group of \(G\) responses \(y_1, \ldots, y_G\), scores each with a verifier or reward model, and normalizes advantages by the group mean and standard deviation:
\[ A_i = \frac{r_i - \mu_G}{\sigma_G}, \qquad \mu_G = \frac{1}{G}\sum_{j=1}^G r_j. \]
Policy gradient updates then use these group-normalized advantages. GRPO eliminates the value-function baseline used in PPO, reducing memory and compute overhead, and has been applied effectively to mathematical reasoning tasks.
Relationship to Policy Gradient
Post-training RL is an application of policy gradient methods. The policy is the language model, the action space is the vocabulary at each step, and the environment is a reward model or a verifiable task (math, code). The key differences from classical RL are the enormous action and state spaces and the critical importance of staying close to the pretrained initialization — without the KL penalty, the policy quickly degrades language quality while optimizing reward.