Fine-Tuning

Motivation

A pretrained language model knows a great deal about language and the world, but its default behavior may not match deployment requirements. It may give incorrect answers confidently, fail to follow instructions, or produce outputs in the wrong format. Fine-tuning adapts the pretrained model using additional training. The main approaches differ in what kind of supervision they use.

Supervised Fine-Tuning

Supervised fine-tuning (SFT) trains the model on a curated dataset of (prompt, desired response) pairs using the standard language modeling objective — maximizing the log-likelihood of the response given the prompt:

\[ \mathcal{L}_\text{SFT} = -\sum_{t} \log p_\theta(y_t \mid x, y_{<t}), \]

where \(x\) is the prompt and \(y\) is the target response.

SFT requires humans (or a capable model) to write or select good responses. The resulting model follows the format and style of the training demonstrations. SFT alone is often sufficient for task-specific applications but may not produce genuinely helpful, safe, or honest behavior across open-ended conversation.

Reinforcement Learning from Human Feedback

RLHF (Ouyang et al. 2022) is a three-stage process:

Stage 1: Supervised fine-tuning. Fine-tune the pretrained model on a small set of high-quality demonstrations to produce an initial SFT model.

Stage 2: Reward model training. Present human annotators with pairs of model responses to the same prompt and ask which is better. Train a reward model \(r_\phi(x, y)\) to predict the human preference:

\[ \mathcal{L}_\text{RM} = -\mathbb{E}_{(x, y_w, y_l)} \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)), \]

where \(y_w\) is the preferred (winning) response and \(y_l\) is the less-preferred one.

Stage 3: RL optimization. Optimize the SFT model with reinforcement learning (typically PPO (Schulman et al. 2017)) using the reward model as the reward function:

\[ \mathcal{L}_\text{RL} = \mathbb{E}\left[ r_\phi(x, y) - \beta \log \frac{\pi_\theta(y \mid x)}{\pi_\text{SFT}(y \mid x)} \right], \]

where the KL penalty with coefficient \(\beta\) prevents the policy from drifting too far from the SFT model, reducing reward hacking.

Direct Preference Optimization

DPO (Rafailov et al. 2023) eliminates the separate reward model. It shows that the optimal RLHF policy has a closed form in terms of the SFT model, and derives a loss that directly optimizes policy parameters from preference pairs:

\[ \mathcal{L}_\text{DPO} = -\mathbb{E}_{(x, y_w, y_l)} \log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{SFT}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{SFT}(y_l \mid x)}\right). \]

DPO is simpler to implement, more stable to train, and avoids reward model overoptimization. It has become the dominant preference-learning method.

Instruction Tuning

Instruction tuning is SFT on a diverse collection of tasks expressed as natural-language instructions. By training across many tasks, the model generalizes to follow instructions on unseen tasks zero-shot. Instruction-tuned models are the starting point for most modern chat assistants.

Parameter-Efficient Fine-Tuning

Full fine-tuning updates all model parameters — expensive at the scale of modern language models. LoRA (low-rank adaptation) adds small trainable matrices alongside the frozen pretrained weights: each weight matrix \(W \in \mathbb{R}^{d \times d}\) gains an additive term \(\Delta W = AB\) with \(A \in \mathbb{R}^{d \times r}\), \(B \in \mathbb{R}^{r \times d}\), and rank \(r \ll d\). Only \(A\) and \(B\) are trained, reducing trainable parameters by orders of magnitude with little quality loss.

References

Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems (NeurIPS), 27730–44. https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html.
Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. “Proximal Policy Optimization Algorithms.” arXiv Preprint arXiv:1707.06347. https://arxiv.org/abs/1707.06347.