Library
Designing Data-Intensive Applications · 4 of 14
Designing Data-Intensive Applications
AI Software Development HIGH

Data Models — Relational, Document, and Graph

data-models relational document graph schema-on-read declarative-queries

Key Principle

"Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem." (Chapter 2) Every model embeds assumptions that make some operations natural and others awkward. The choice is not about capability — one model can always emulate another — but about the proportional complexity cost of that emulation. The wrong model does not block you; it taxes you continuously.

Applications are built by stacking data models in layers: application objects modeling reality, general-purpose data models (JSON/relational/graph), storage engine byte representations, and hardware. Each layer hides complexity below via a clean interface. This layered abstraction is not just a technical convenience — it is an organizational enabler that lets different groups (database vendors, application developers) work independently. (Chapter 2)

Why This Matters

Applications that start with clean one-to-many tree structures inevitably accumulate many-to-many relationships as features grow. Choosing a document model bets that data will stay tree-shaped; that bet usually loses. When it does, join logic migrates into application code where it is harder to optimize — an instance of DDIA's recurring theme that complexity migrates rather than disappearing. The relational model's query optimizer, expensive to build once, amortizes across all applications: "You only need to build a query optimizer once, and then all applications that use the database can benefit from it." (Chapter 2) This separation of logical queries from physical access paths was the relational model's decisive advantage over the network model (CODASYL).

Good Examples

  1. Schema-on-Read vs. Schema-on-Write: Document databases shift schema enforcement from write-time to read-time — analogous to dynamic vs. static typing. "Document databases are sometimes called schemaless, but that's misleading, as the code that reads the data usually assumes some kind of structure." (Chapter 2) Schema-on-read suits heterogeneous data (many object types, externally-determined structures). Schema-on-write provides enforcement guarantees for uniform data. MySQL's ALTER TABLE copies the entire table, causing potential hours of downtime — a concrete cost of schema-on-write. The schema always exists; the question is who bears validation cost and when errors surface.

  2. Graph Models for Dense Relationships: When data has dense, heterogeneous many-to-many relationships, graph models win: any vertex can connect to any other, bidirectional traversal is efficient, and different edge labels let multiple relationship types coexist. Facebook uses a single heterogeneous graph spanning people, locations, events, and friendships. "Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application's data structures." (Chapter 2) Variable-length path traversal — the core capability separating graph from relational — takes 4 lines in Cypher vs. 29 lines in SQL recursive CTEs.

  3. Declarative vs. Imperative Queries: SQL specifies what results you want; imperative languages specify how to compute them. Because declarative queries hide execution order, the engine can reorganize storage, parallelize across cores, and introduce new indexes without touching application code. This matters increasingly because CPUs scale via cores, not clock speed — imperative code with specified sequential operations cannot be safely parallelized by the runtime. "The fact that SQL is more limited in functionality gives the database much more room for automatic optimizations." (Chapter 2)

Counterpoints

  1. Document locality has harsh trade-offs: Documents stored as continuous strings avoid multi-table index lookups when the entire entity is needed. But partial reads still load the whole document, and updates require full rewrites, constraining document size and update frequency. Relational systems can achieve similar locality (Spanner's interleaved tables, Oracle multi-table clusters, Bigtable column families) without the document model's update penalty. (Chapter 2)

  2. MapReduce reinvents SQL in disguise: MapReduce requires pure functions for safe distribution, but offers less optimization opportunity than SQL. MongoDB's trajectory proves this — after shipping MapReduce, they added a declarative aggregation pipeline because users found it too cumbersome. "A NoSQL system may find itself accidentally reinventing SQL, albeit in disguise." The declarative-imperative axis and the distributed-local axis are independent. (Chapter 2)

  3. Historical patterns repeat: IMS (1968, hierarchical model) had the same strengths and weaknesses as today's document databases. The network model (CODASYL) required cursor-based navigation. Graph databases succeed where CODASYL failed because of four structural differences: any-to-any connectivity, index-based lookup, unordered edges, and declarative query languages. Declarative languages are the recurring enabler. (Chapter 2)

Key Quotes

"Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem." — Martin Kleppmann, Chapter 2

"If the same query can be written in 4 lines in one query language but requires 29 lines in another, that just shows that different data models are designed to satisfy different use cases." — Martin Kleppmann, Chapter 2

"The advantage of using an ID is that because it has no meaning to humans, it never needs to change." — Martin Kleppmann, Chapter 2

Rules of Thumb

  • Choose document model when data is predominantly one-to-many with rare cross-references
  • Choose relational when many-to-many relationships exist or are likely to emerge
  • Choose graph when anything can potentially relate to everything and relationship traversal depth is variable
  • Prefer declarative query interfaces — they decouple intent from execution, enabling optimizer improvements without query rewrites
  • Meaningless identifiers (opaque IDs) are stable identifiers — the core rationale behind normalization
  • Data interconnection grows over time; plan for relationship density to increase
  • One model can emulate another, but emulation creates disproportionate and continuous pain

Related References