Library
Designing Data-Intensive Applications
AI Software Development

Designing Data-Intensive Applications

Martin Kleppmann 2017 14 references

Kleppmann's principles for designing reliable, scalable, and maintainable data systems — storage engines, replication, transactions, distributed coordination, batch/stream processing, and data integration architectures.

distributed-systems databases replication transactions stream-processing data-architecture systems-design

Overview

The Core Framework

  • Reliability: Systems must work correctly under adversity — hardware faults, software bugs, human error. Design for fault tolerance, not fault prevention.
  • Scalability: Describe load with parameters (requests/sec, fan-out ratio), measure performance with percentiles (p50, p99, p999), then choose scaling strategies.
  • Maintainability: Operability (easy to run), simplicity (manageable complexity via good abstractions), evolvability (easy to change).
  • Trade-offs are inescapable: Every choice involves tension — read vs. write speed, consistency vs. availability, correctness vs. performance.
  • Principles over tools: The same patterns (append-only logs, sorted merge, quorum voting) recur across storage, replication, batch, and stream processing.

Quick Lookup

Situation Do This Avoid This
Choosing a storage engine Match to workload: LSM-trees for write-heavy, B-trees for read-heavy Assuming one engine fits all workloads
Schema design Use document model for self-contained documents; relational for highly interconnected data Calling document DBs "schemaless" — it's schema-on-read
Replication strategy Start with single-leader; use multi-leader only for multi-datacenter or offline clients Multi-leader for single-datacenter (adds conflict complexity for little gain)
Transaction isolation Know exactly what your DB provides; don't trust "ACID" labels Assuming "serializable" means the same across databases
Distributed locks Always use fencing tokens; never rely on timeouts alone Trusting a lock lease without fencing — process pauses can outlast leases
Data integration Use CDC with a single source of truth; derive all other views Dual writes to multiple systems (race conditions, partial failures)
Performance measurement Use percentiles (p99, p999), not averages Arithmetic mean of response times (hides tail latency)
Handling failures Design immutable inputs + replaceable outputs for recoverability Mutable state that can't be rebuilt from source

The Key Insight

"Technology is a powerful force in our society... But they can also be used for good: to make underrepresented people's voices heard, to create opportunities for everyone, and to avert disasters." — Martin Kleppmann, Dedication

References