Information Extraction
Motivation
Most knowledge in text is expressed in free-form natural language, not structured databases. Information extraction (IE) converts unstructured text into structured representations: entities, their types, and the relations among them. IE is the foundation of knowledge graph construction, question answering over documents, and many downstream NLP applications.
Named Entity Recognition
Named entity recognition (NER) identifies spans of text that refer to named entities and classifies each span into a category such as person, organization, location, date, or quantity.
For the sentence “Apple was founded by Steve Jobs in Cupertino”:
| Span | Type |
|---|---|
| Apple | ORG |
| Steve Jobs | PER |
| Cupertino | LOC |
Sequence Labeling
NER is typically formulated as sequence labeling: each token receives a tag from a scheme such as BIO:
- B-TYPE: beginning of a named entity of the given type.
- I-TYPE: continuation of a named entity of the given type.
- O: not part of any entity.
A bidirectional LSTM followed by a conditional random field (CRF) output layer (Lample et al. 2016) was the dominant architecture before transformers. The CRF models dependencies between adjacent output labels — for example, enforcing that an I-tag must follow a B-tag or I-tag of the same type — improving span-level consistency. Modern systems instead feed contextual embeddings from a pretrained transformer encoder into a linear head or CRF, achieving substantially higher accuracy with less task-specific engineering.
Evaluation
NER is evaluated at the span level: a prediction is correct only if both the span boundaries and the entity type match the gold annotation. The primary metrics are precision, recall, and F1 computed over spans. Token-level F1 (which credits partial span matches) is sometimes reported but is a looser measure.
Relation Extraction
Relation extraction (RE) identifies semantic relations between pairs of named entities. Given entities \(e_1\) and \(e_2\) appearing in the same sentence or document, the task is to determine whether a relation \(r\) holds and, if so, which one.
Example: “Marie Curie was born in Warsaw” → (Marie Curie, bornIn, Warsaw).
RE is often pipelined after NER: first identify entity mentions, then classify pairs. End-to-end models that jointly extract entities and relations have become more common with transformer-based encoders, avoiding the error propagation that affects pipelined systems.
Slot Filling and Template Filling
Slot filling extracts values for a fixed schema from text. Given a template such as
{PERSON} was born on {DATE} in {PLACE}.
the system fills the slots by identifying the appropriate spans. This is a restricted form of RE useful for well-defined extraction tasks: résumé parsing, event detection, and product attribute extraction.
Document-Level Information Extraction
Entities and relations often span paragraphs and documents, not just individual sentences. Document-level IE must resolve coreference — recognizing that “she”, “the researcher”, and “Marie Curie” all refer to the same entity — and aggregate evidence across multiple mentions to populate a knowledge base.
Modern document-level systems use long-context transformer encoders whose attention allows evidence from different sentences to interact, enabling extraction of relations that are only inferable by reading across the document.