Knowledge Graphs

Motivation

A knowledge graph is a directed, labeled graph whose nodes denote real-world entities and whose edges denote relations between them (Hogan et al. 2021). The representation is intentionally minimal: a single fact is a triple

\[(\text{subject},\ \text{predicate},\ \text{object}),\]

such as \((\text{Boston},\ \text{locatedIn},\ \text{Massachusetts})\). Stacking enough triples gives a graph that can be traversed, queried, and reasoned over.

The appeal is operational. Web-scale extraction pipelines can convert a sentence into a triple, so building a knowledge graph from text and tables is mechanical compared with handcrafting a first-order theory. Once the graph exists, search engines use it to answer entity queries, recommenders use it to find related items, and language models use it to ground their outputs in checkable facts. Knowledge graphs are how the semantic-network idea survives at modern scale — see knowledge representation for where they fit in the broader picture.

What a Triple Says

A triple \((s, p, o)\) asserts that the relation \(p\) holds from \(s\) to \(o\):

(Boston,            locatedIn,    Massachusetts)
(Massachusetts,     locatedIn,    UnitedStates)
(Boston,            populationOf, 650706)
(Boston,            mayor,        WuMichelle)
(WuMichelle,        bornIn,       1985)

Subjects and predicates are typically identified by URIs (in RDF / linked-data systems) or by internal IDs (in a property graph). Objects are either other entities or literals (strings, numbers, dates).

A knowledge graph is just a set of such triples. The graph view comes from reading each triple as an edge: a node for \(s\), a node or literal for \(o\), and a directed edge labeled \(p\).

A small example

       locatedIn          locatedIn
Boston ─────────▶ Massachusetts ─────────▶ UnitedStates
  │
  │ mayor
  ▼
WuMichelle ──bornIn──▶ 1985

A path query “what country contains the city Boston?” traverses the locatedIn edges and answers UnitedStates. This is a recursive computation in first-order logic, but a graph reachability question in the knowledge-graph view — and that change of view is most of what knowledge graphs buy you.

Schemas, Open World, and Identity

A knowledge graph by itself imposes no schema. Three design choices come up immediately:

  • Open vs. closed world. Under the open-world assumption (OWA), a fact not in the graph is unknown, not false. If the graph doesn’t list a sibling for some person, the person may still have one. Closed-world reasoning (the convention in relational databases) treats the absence of a fact as its negation; this is rarely safe for web-scale graphs assembled from incomplete sources.
  • Unique-name assumption. Two different identifiers may or may not refer to the same entity. The web of Linked Data does not assume unique names — dbpedia:Barack_Obama and wikidata:Q76 denote the same person, and owl:sameAs triples record the equivalence.
  • Schema-on-read vs. schema-on-write. Some knowledge graphs admit any triple and validate against an ontology only when reasoning; others reject triples that violate predicate domain/range constraints up front.

The first two choices make knowledge graphs a poor fit for relational-database habits and a good fit for logical entailment under classical semantics.

Reasoning Over a Knowledge Graph

Three kinds of inference recur:

Path queries

“Find all \(x\) such that \((\text{Boston},\ \text{locatedIn}^+,\ x)\)” — apply locatedIn one or more times. SPARQL property paths and Cypher’s variable-length matches both express this directly. Reachability is linear in the size of the visited subgraph and is the bread-and-butter operation of graph databases.

Rule-based deduction

Datalog-style rules over the graph derive new triples:

\[(x,\ \text{locatedIn},\ y) \land (y,\ \text{locatedIn},\ z) \;\Rightarrow\; (x,\ \text{locatedIn},\ z).\]

This is forward chaining over a function-free Horn fragment, so it terminates and is PTime in the rule and fact base size. Most production knowledge graphs ship with a small core of such rules — subClassOf transitivity, sameAs symmetry, inverse and transitive predicate declarations — and materialize the closure on ingest.

Description-logic reasoning

Pair the graph with an ontology of class definitions and constraints, and a description-logic reasoner can decide subsumption (“is City a subclass of Place?”) and instance classification (“which class does Boston belong to?”). The OWL standard is the canonical syntax for these constraints.

Knowledge Graph Embeddings

A second tradition treats the triple set not as a logical theory but as a training set for a learned model. Each entity \(e\) and each relation \(r\) get an embedding vector \(\mathbf{e}, \mathbf{r}\), and a scoring function \(f(\mathbf{s}, \mathbf{r}, \mathbf{o})\) rates how plausible the triple \((s, r, o)\) is.

The simplest scoring function, TransE (Bordes et al. 2013), treats a relation as a translation in embedding space:

\[f_{\text{TransE}}(s, r, o) = -\|\mathbf{s} + \mathbf{r} - \mathbf{o}\|.\]

Training maximizes \(f\) on observed triples and minimizes it on corrupted negatives. Once trained, the model scores every (subject, relation, object) combination, not only those in the graph — turning a sparse, discrete graph into a dense, continuous predictor over the same schema.

This is what enables link prediction: given \((s, r, ?)\), rank candidate objects by score. Embeddings recover transitive and compositional patterns by geometry alone — if locatedIn is approximately additive, then Boston + locatedIn + locatedIn ≈ UnitedStates falls out without an explicit rule. Embeddings are also where knowledge graphs meet neurosymbolic AI: a model that scores plausibility on the same schema a symbolic reasoner uses can supply candidate facts for the reasoner to check.

Sources of Triples

Real knowledge graphs are constructed, not designed. Common sources:

  • Curated structured data. Wikidata is edited by humans and bots; the schema is enforced socially. Its triples are the closest a web-scale knowledge graph gets to “ground truth.”
  • Structured extracts from semi-structured sources. Wikipedia infoboxes (DBpedia, YAGO) and Wikidata identifiers anchor most large public graphs.
  • Information extraction from text. A pipeline runs entity linking, relation classification, and consistency filtering over a corpus; see information extraction. Quality is uneven, and the open-world assumption protects against treating missing extractions as negations.
  • Programmatic ingestion. Internal enterprise knowledge graphs ingest from ERP systems, document repositories, and product catalogs through schema-mapping pipelines.

Most modern systems are hybrid: a curated core (Wikidata-style) joined to a long tail of extracted triples, with provenance attached to every fact so a downstream reasoner can weigh its confidence.

Strengths and Limitations

Knowledge graphs hit a useful sweet spot:

  • Strengths. Schema is light; ingestion scales; queries are graph operations that mature engines (Neo4j, RDF stores, custom property-graph engines) execute fast; alignment with web identifiers (URIs) makes cross-source integration possible.
  • Limitations. Triples cannot natively represent quantified statements (“every mammal nurses its young”); ternary or higher-arity facts have to be reified through helper nodes; temporal validity (“Boston had mayor \(X\) from 2014 to 2022”) requires either reified statements or named graphs.

The triple representation is a deliberate retreat from full first-order logic, in exchange for cheap operations at scale. When the additional expressiveness is needed, the standard move is to keep the knowledge graph and add an ontology on top — leaving the bulk of the facts as triples and reserving the heavier formalism for the rules that govern them.

References

Bordes, Antoine, Nicolas Usunier, Alberto Garcia-Durán, Jason Weston, and Oksana Yakhnenko. 2013. “Translating Embeddings for Modeling Multi-Relational Data.” Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper_files/paper/2013/hash/1cecc7a77928ca8133fa24680a88d2f9-Abstract.html.
Hogan, Aidan, Eva Blomqvist, Michael Cochez, et al. 2021. “Knowledge Graphs.” ACM Computing Surveys 54 (4): 1–37. https://doi.org/10.1145/3447772.