AI Software Development
Designing Data-Intensive Applications
Martin Kleppmann 2017 14 references
Kleppmann's principles for designing reliable, scalable, and maintainable data systems — storage engines, replication, transactions, distributed coordination, batch/stream processing, and data integration architectures.
distributed-systems databases replication transactions stream-processing data-architecture systems-design
Overview
The Core Framework
- Reliability: Systems must work correctly under adversity — hardware faults, software bugs, human error. Design for fault tolerance, not fault prevention.
- Scalability: Describe load with parameters (requests/sec, fan-out ratio), measure performance with percentiles (p50, p99, p999), then choose scaling strategies.
- Maintainability: Operability (easy to run), simplicity (manageable complexity via good abstractions), evolvability (easy to change).
- Trade-offs are inescapable: Every choice involves tension — read vs. write speed, consistency vs. availability, correctness vs. performance.
- Principles over tools: The same patterns (append-only logs, sorted merge, quorum voting) recur across storage, replication, batch, and stream processing.
Quick Lookup
| Situation | Do This | Avoid This |
|---|---|---|
| Choosing a storage engine | Match to workload: LSM-trees for write-heavy, B-trees for read-heavy | Assuming one engine fits all workloads |
| Schema design | Use document model for self-contained documents; relational for highly interconnected data | Calling document DBs "schemaless" — it's schema-on-read |
| Replication strategy | Start with single-leader; use multi-leader only for multi-datacenter or offline clients | Multi-leader for single-datacenter (adds conflict complexity for little gain) |
| Transaction isolation | Know exactly what your DB provides; don't trust "ACID" labels | Assuming "serializable" means the same across databases |
| Distributed locks | Always use fencing tokens; never rely on timeouts alone | Trusting a lock lease without fencing — process pauses can outlast leases |
| Data integration | Use CDC with a single source of truth; derive all other views | Dual writes to multiple systems (race conditions, partial failures) |
| Performance measurement | Use percentiles (p99, p999), not averages | Arithmetic mean of response times (hides tail latency) |
| Handling failures | Design immutable inputs + replaceable outputs for recoverability | Mutable state that can't be rebuilt from source |
The Key Insight
"Technology is a powerful force in our society... But they can also be used for good: to make underrepresented people's voices heard, to create opportunities for everyone, and to avert disasters." — Martin Kleppmann, Dedication
References
No references match your search.