Phrase-Based Translation
Motivation
Word-by-word translation fails because languages differ at the phrasal level: a phrase in one language often corresponds to a multi-word unit in another, and the mapping is not compositional at the word level (“kick the bucket” does not translate bucket by bucket). Phrase-based machine translation (Koehn et al. 2003) extends the statistical MT framework to phrase pairs, learning translation patterns over short contiguous sequences — a significant empirical improvement over word-based models, and the dominant statistical approach until neural sequence-to-sequence models arrived.
The Noisy-Channel Model
Statistical MT frames translation as Bayesian inference. Given a source sentence \(f\), find the target sentence \(e\) that maximizes
\[ \hat{e} = \arg\max_{e} \, p(e \mid f) = \arg\max_{e} \, p(f \mid e) \, p(e), \]
decomposing into:
- Translation model \(p(f \mid e)\): how likely is \(f\) as a translation of \(e\)?
- Language model \(p(e)\): how fluent is \(e\) as a target-language sentence?
This factorization allows the two components to be trained independently — the translation model from a bilingual corpus, the language model from a large monolingual corpus — and combined at decoding time.
Phrase Extraction
A phrase pair \((\bar{f}, \bar{e})\) is a pair of contiguous spans such that no word inside the source span is aligned to a word outside the target span, and vice versa (the consistency constraint).
Phrase pairs are extracted from a word-aligned bilingual corpus:
- Align words in both directions (source→target and target→source) using IBM word-alignment models.
- Symmetrize the two alignment sets using a heuristic such as grow-diag-final.
- Extract all consistent phrase pairs up to a maximum length (typically 7 words).
Each extracted phrase pair \((\bar{f}, \bar{e})\) is scored with several features, including phrase translation probabilities in both directions \(p(\bar{f} \mid \bar{e})\) and \(p(\bar{e} \mid \bar{f})\), and lexical weights that smooth the phrase probabilities using word-level translation probabilities.
The Log-Linear Model
A phrase-based system combines multiple feature functions in a log-linear model (Koehn et al. 2003):
\[ \log p(e \mid f) = \sum_i \lambda_i h_i(e, f), \]
where each \(h_i\) is a feature (translation model components, language model score, phrase count, word count, distortion penalty) and \(\lambda_i\) is a weight tuned on a development set. Minimum error rate training (MERT) tunes weights by directly optimizing a translation quality metric such as BLEU.
Decoding as Beam Search
Finding the highest-scoring translation is NP-hard in general because reordering is exponential. Phrase-based decoders use beam search with a left-to-right coverage model:
- Start with an empty hypothesis (no source words covered, empty target).
- At each step, extend the current hypothesis by selecting an uncovered source phrase and a target phrase from the phrase table.
- Score the extension using the log-linear model.
- Keep only the top-\(b\) hypotheses at each coverage state.
- Return the highest-scoring complete hypothesis (all source words covered).
A distortion model penalizes large jumps in source word position, keeping reordering local. Lexicalized reordering models condition the distortion penalty on the specific phrase pair, learning whether a given phrase tends to be translated monotonically, swapped with its neighbor, or discontinuously.
Limitations and Legacy
Phrase-based MT was superseded by sequence-to-sequence neural models, which learn soft alignments via attention rather than hard word alignments, integrate a neural language model, and scale better with data. Nevertheless, phrase-based MT established several enduring concepts: the noisy-channel decomposition, log-linear feature combination, and beam-search decoding — ideas that influenced the design of evaluation metrics and training objectives still used in neural MT.