Post-Training Reinforcement Learning

Motivation

Pretraining optimizes for next-token prediction. This produces fluent, knowledgeable models, but not necessarily aligned ones: they may refuse reasonable requests, hallucinate confidently, or produce harmful content. Post-training uses human feedback and reinforcement learning to steer model behavior after pretraining, without degrading language ability.

The RLHF Pipeline

The canonical pipeline (Ouyang et al. 2022) has three stages:

Stage 1: Supervised Fine-Tuning

Collect a small dataset of (prompt, high-quality response) pairs written or selected by human annotators. Fine-tune the pretrained model on this data using standard cross-entropy loss. The result is the SFT model — a well-behaved starting point.

Stage 2: Reward Model Training

Present annotators with a prompt and two or more model responses; they rank or compare responses by quality. Train a reward model \(r_\phi(x, y)\) — typically the SFT model with a scalar head added — to predict human preference:

\[ \mathcal{L}_\text{RM} = -\mathbb{E} \left[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))\right], \]

where \(y_w\) is preferred over \(y_l\). The reward model generalizes from annotated pairs to score any (prompt, response).

Stage 3: RL Optimization

Treat the language model as a policy \(\pi_\theta\) and the reward model as the reward function. Use PPO (Schulman et al. 2017) to optimize:

\[ \mathcal{L}_\text{PPO} = \mathbb{E}\left[r_\phi(x, y) - \beta \, \mathrm{KL}(\pi_\theta \| \pi_\text{SFT})\right], \]

where the KL term prevents the policy from drifting too far from the SFT model (reducing reward hacking — finding behaviors that score well on \(r_\phi\) but are not genuinely better).

Direct Preference Optimization

RLHF is complex: the RL step is unstable, sensitive to hyperparameters, and prone to reward hacking. DPO (Rafailov et al. 2023) bypasses RL entirely by directly optimizing the policy from preference pairs:

\[ \mathcal{L}_\text{DPO} = -\mathbb{E}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{SFT}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{SFT}(y_l \mid x)}\right)\right]. \]

DPO is simpler, more stable, and avoids the reward model entirely. It is now widely used.

Process vs. Outcome Reward Models

Standard RLHF rewards the final response — an outcome reward model (ORM). For tasks with long reasoning chains, outcome rewards provide a sparse signal: feedback comes only after generating hundreds of tokens, making credit assignment difficult.

A process reward model (PRM) scores each step in the reasoning chain individually. PRMs provide denser feedback, enable step-level rejection sampling, and discourage chains that reach correct final answers via incorrect intermediate steps. The trade-off: PRMs require step-level annotation, which is expensive.

Group Relative Policy Optimization

GRPO is a variant designed for reasoning tasks. For each prompt it samples a group of \(G\) responses \(y_1, \ldots, y_G\), scores each with a verifier or reward model, and normalizes advantages by the group mean and standard deviation:

\[ A_i = \frac{r_i - \mu_G}{\sigma_G}, \qquad \mu_G = \frac{1}{G}\sum_{j=1}^G r_j. \]

Policy gradient updates then use these group-normalized advantages. GRPO eliminates the value-function baseline used in PPO, reducing memory and compute overhead, and has been applied effectively to mathematical reasoning tasks.

Relationship to Policy Gradient

Post-training RL is an application of policy gradient methods. The policy is the language model, the action space is the vocabulary at each step, and the environment is a reward model or a verifiable task (math, code). The key differences from classical RL are the enormous action and state spaces and the critical importance of staying close to the pretrained initialization — without the KL penalty, the policy quickly degrades language quality while optimizing reward.

References

Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems (NeurIPS), 27730–44. https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html.
Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. “Proximal Policy Optimization Algorithms.” arXiv Preprint arXiv:1707.06347. https://arxiv.org/abs/1707.06347.