LLM Inference
Motivation
A large language model defines a conditional distribution over tokens — but deploying the model requires turning that distribution into text, efficiently, at scale. Inference is the process of generating sequences from this distribution, and it has two independent design axes: how to select the next token (the decoding strategy), and how to compute the forward pass efficiently as the sequence grows. Neither axis is fixed by training. Understanding inference is necessary for interpreting model behavior (outputs are shaped by decoding, not just by the weights), for serving models cost-effectively, and for building systems where latency and throughput matter.
Autoregressive Generation
All standard large language models generate text autoregressively: one token at a time, left to right. At each step \(t\):
- Compute the distribution \(p_\theta(\cdot \mid x_1, \ldots, x_{t-1})\) over the vocabulary.
- Select token \(x_t\) according to the decoding strategy.
- Append \(x_t\) to the context and repeat until an end-of-sequence token or a length limit is reached.
Generation is inherently sequential on the output axis: step \(t\) cannot begin until step \(t-1\) completes. This is the main throughput bottleneck in serving large models.
Decoding Strategies
Greedy Decoding
Select the highest-probability token at each step:
\[ x_t = \arg\max_v \, p_\theta(v \mid x_{<t}). \]
Fast and deterministic. Tends to produce repetitive, low-diversity outputs and can get stuck in loops.
Temperature Scaling
Divide the logits \(z_v\) by temperature \(\tau > 0\) before softmax:
\[ p_\tau(v \mid x_{<t}) = \frac{\exp(z_v / \tau)}{\sum_{v'} \exp(z_{v'} / \tau)}. \]
\(\tau < 1\) sharpens the distribution (more confident, less diverse); \(\tau > 1\) flattens it (more diverse, less coherent). \(\tau = 1\) leaves the model unchanged.
Top-\(k\) Sampling
Restrict sampling to the \(k\) highest-probability tokens and renormalize. Prevents generating very low-probability tokens, but uses a fixed vocabulary size regardless of how concentrated the distribution is.
Nucleus (Top-\(p\)) Sampling
Restrict sampling to the smallest set of tokens whose cumulative probability reaches \(p\) (Holtzman et al. 2020):
\[ \mathcal{V}_p(x_{<t}) = \arg\min_{V' \subseteq V} \left\{ |V'| : \sum_{v \in V'} p_\theta(v \mid x_{<t}) \geq p \right\}. \]
The nucleus adapts to the distribution: when the model is confident, the nucleus is small; when uncertain, it is larger. Nucleus sampling with \(p \approx 0.9\) is the standard choice for open-ended generation and consistently outperforms top-\(k\).
Beam Search
Maintain \(b\) complete hypotheses (the beam) simultaneously. At each step, expand each hypothesis by every possible next token, score each extended sequence by log-probability, and keep the top \(b\). Return the highest-scoring complete sequence.
Beam search finds approximately the most probable sequence. It produces more grammatical output for structured tasks (translation, summarization) but is deterministic and tends toward generic or overly short outputs for open-ended generation.
KV Caching
Without caching, each new token requires recomputing the key and value matrices for all previous tokens in every layer — \(O(n)\) work per layer per step, for \(O(n^2)\) total across a sequence of length \(n\). KV caching stores key and value tensors from all previous tokens. Each new step computes keys and values only for the new token and retrieves the rest from cache, reducing per-step cost to \(O(1)\) per layer at the cost of memory proportional to context length × layers × hidden dimension.
KV caching is essential in any deployment: without it, generating long sequences is prohibitively expensive.
Speculative Decoding
A small, fast draft model generates several tokens autoregressively. The large target model then verifies all draft tokens in a single parallel forward pass. Tokens the target model agrees with are accepted unchanged; the first rejected token is resampled from the target model’s distribution. This delivers target-model output quality at close to draft-model speed, without changing the output distribution.
Batching and Serving
At scale, inference servers batch requests from multiple users to amortize the cost of each forward pass. Continuous batching allows new requests to join an in-progress batch mid-generation, keeping GPU utilization high even when sequences have different lengths.