Phrase-Based Translation

Motivation

Word-by-word translation fails because languages differ at the phrasal level: a phrase in one language often corresponds to a multi-word unit in another, and the mapping is not compositional at the word level (“kick the bucket” does not translate bucket by bucket). Phrase-based machine translation (Koehn et al. 2003) extends the statistical MT framework to phrase pairs, learning translation patterns over short contiguous sequences — a significant empirical improvement over word-based models, and the dominant statistical approach until neural sequence-to-sequence models arrived.

The Noisy-Channel Model

Statistical MT frames translation as Bayesian inference. Given a source sentence \(f\), find the target sentence \(e\) that maximizes

\[ \hat{e} = \arg\max_{e} \, p(e \mid f) = \arg\max_{e} \, p(f \mid e) \, p(e), \]

decomposing into:

  • Translation model \(p(f \mid e)\): how likely is \(f\) as a translation of \(e\)?
  • Language model \(p(e)\): how fluent is \(e\) as a target-language sentence?

This factorization allows the two components to be trained independently — the translation model from a bilingual corpus, the language model from a large monolingual corpus — and combined at decoding time.

Phrase Extraction

A phrase pair \((\bar{f}, \bar{e})\) is a pair of contiguous spans such that no word inside the source span is aligned to a word outside the target span, and vice versa (the consistency constraint).

Phrase pairs are extracted from a word-aligned bilingual corpus:

  1. Align words in both directions (source→target and target→source) using IBM word-alignment models.
  2. Symmetrize the two alignment sets using a heuristic such as grow-diag-final.
  3. Extract all consistent phrase pairs up to a maximum length (typically 7 words).

Each extracted phrase pair \((\bar{f}, \bar{e})\) is scored with several features, including phrase translation probabilities in both directions \(p(\bar{f} \mid \bar{e})\) and \(p(\bar{e} \mid \bar{f})\), and lexical weights that smooth the phrase probabilities using word-level translation probabilities.

The Log-Linear Model

A phrase-based system combines multiple feature functions in a log-linear model (Koehn et al. 2003):

\[ \log p(e \mid f) = \sum_i \lambda_i h_i(e, f), \]

where each \(h_i\) is a feature (translation model components, language model score, phrase count, word count, distortion penalty) and \(\lambda_i\) is a weight tuned on a development set. Minimum error rate training (MERT) tunes weights by directly optimizing a translation quality metric such as BLEU.

Limitations and Legacy

Phrase-based MT was superseded by sequence-to-sequence neural models, which learn soft alignments via attention rather than hard word alignments, integrate a neural language model, and scale better with data. Nevertheless, phrase-based MT established several enduring concepts: the noisy-channel decomposition, log-linear feature combination, and beam-search decoding — ideas that influenced the design of evaluation metrics and training objectives still used in neural MT.

References

Koehn, Philipp, Franz Josef Och, and Daniel Marcu. 2003. “Statistical Phrase-Based Translation.” North American Chapter of the Association for Computational Linguistics (NAACL), 127–33. https://aclanthology.org/N03-1017/.