Pretraining
Motivation
Training a neural network from scratch on a downstream task requires large labeled datasets, which are expensive to collect. Language has an enormous advantage: text on the internet constitutes billions of unlabeled training examples, and the structure of language provides a free supervision signal — predict what comes next, or predict what was masked out. Pretraining exploits this: train a model on a large self-supervised objective over massive unlabeled data, then adapt it to downstream tasks with far less supervision.
This paradigm replaced the prior practice of training task-specific architectures from random initialization and now dominates natural language processing.
Objectives
Autoregressive Language Modeling
Given a sequence of tokens \((x_1, x_2, \ldots, x_T)\), the model learns to predict each token from all preceding tokens:
\[ \mathcal{L} = -\sum_{t=1}^{T} \log p_\theta(x_t \mid x_1, \ldots, x_{t-1}). \]
The model sees only left context, implemented in a Transformer decoder via causal masking: the attention mask prevents position \(t\) from attending to positions \(> t\).
Autoregressive pretraining (GPT (Brown et al. 2020)) produces models that generate coherent text and can be prompted to perform tasks without any gradient updates — in-context learning. The dominant architecture for large language models (GPT, Claude, Llama, Gemini) is autoregressive.
Masked Language Modeling
BERT (Devlin et al. 2019) randomly replaces 15% of tokens with a [MASK] token and trains the model to predict the original tokens at masked positions:
\[ \mathcal{L} = -\sum_{t \in \mathcal{M}} \log p_\theta(x_t \mid x_{\setminus \mathcal{M}}), \]
where \(\mathcal{M}\) is the set of masked positions. Because the model sees both left and right context, it produces bidirectional representations. BERT-style models are strong for classification, question answering, and NER, but cannot generate text autoregressively.
Scaling Laws
Empirically, the pretraining loss is a smooth power-law function of model size \(N\), dataset size \(D\), and training compute \(C\) (Kaplan et al. 2020):
\[ L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty, \]
where \(L_\infty\) is the irreducible entropy of the data. Over many orders of magnitude, each 10× increase in compute yields a predictable improvement in loss. This predictability means the performance of very large, expensive models can be forecast from smaller experiments.
Hoffmann et al. (Hoffmann et al. 2022) (the “Chinchilla” paper) refined this analysis and showed that, under a fixed compute budget, the optimal allocation trains on roughly \(D \approx 20 N\) tokens. Many earlier large models were undertrained relative to their size; optimal scaling trains smaller models on more data.
Transfer to Downstream Tasks
After pretraining, the model is adapted to downstream tasks:
- Full fine-tuning. All parameters are updated on labeled data for the target task. Effective but compute-intensive.
- Prompting / in-context learning. The frozen pretrained model is prompted with a task description and examples; no gradient update occurs. Unique to autoregressive models.
- Parameter-efficient fine-tuning. A small number of adapter parameters are added and trained; the pretrained weights are frozen. LoRA (low-rank adaptation) adds trainable matrices \(\Delta W = AB\) with rank \(r \ll d\) alongside the frozen weights, reducing trainable parameters by orders of magnitude with little quality loss.
The key insight is that pretraining instills broad linguistic and factual knowledge; adaptation steers this knowledge toward a specific task. More capable pretraining generally produces better downstream performance, which is why scaling the pretraining phase has been so productive.