Rules of Thumb for Data-Intensive Applications - Designing Data-Intensive Applications

Key Principle

Every data system design decision involves trade-offs that cannot be optimized away -- only shifted between layers. These collected heuristics, drawn from across all twelve chapters, provide decision shortcuts grounded in engineering constraints rather than fashion.

Why This Matters

Engineers repeatedly face the same categories of decision: how to measure, what storage engine to pick, how to replicate, when to use transactions, how to coordinate distributed nodes, and how to integrate heterogeneous systems. Having battle-tested rules of thumb prevents reinventing analysis from scratch and guards against vendor marketing that obscures real trade-offs. As Kleppmann warns, "Rather than blindly relying on tools, we need to develop a good understanding of the kinds of concurrency problems that exist, and how to prevent them" (Ch. 7).

Good Examples

Percentiles over averages (Ch. 1): Amazon found that customers with the slowest response times often had the most data in their accounts -- i.e., the most valuable customers. Using p99 latency instead of mean latency revealed tail behavior that averages concealed. This single measurement choice changed which optimizations mattered.

CDC over dual writes (Ch. 11): Two concurrent clients writing values A and B to a database and a search index can arrive in different orders at each system, leaving them "permanently inconsistent with each other, even though no error occurred" (Ch. 11). CDC eliminates this by making one database the leader and all derived systems followers, reusing the single-leader replication pattern from Ch. 5.

Fencing tokens over timeouts (Ch. 8): A process holding a distributed lock may pause (GC, page fault) past the lock's timeout. When it resumes, it believes it still holds the lock. Fencing tokens -- monotonically increasing numbers checked by the storage layer -- let the system reject stale writes without relying on timing assumptions.

Counterpoints

"Use an ACID database for important data" is insufficient. Many relational databases default to weak isolation (read committed). PostgreSQL's "repeatable read" is actually snapshot isolation; Oracle's "serializable" is also snapshot isolation. Trusting the label instead of verifying behavior has caused financial losses and data corruption (Ch. 7).

"Quorums guarantee consistency" is misleading. Even with w + r > n, variable network delays allow a later read to return an older value than an earlier read. Making Dynamo-style quorums linearizable requires synchronous read repair and full consensus -- in practice, most deployments skip this for performance (Ch. 9).

"Automatic rebalancing is always better" ignores cascading failure. An overloaded node responding slowly can be misread as dead by automatic failure detection, triggering rebalancing that adds load to an already-stressed system. Human-in-the-loop rebalancing trades speed for safety (Ch. 6).

Key Quotes

"A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user." -- Martin Kleppmann, Chapter 1

"The term 'eventually' is deliberately vague: in general, there is no limit to how far a replica can fall behind." -- Martin Kleppmann, Chapter 5

"Make a system appear as if there were only one copy of the data, and all operations on it are atomic." -- Martin Kleppmann, Chapter 9

Rules of Thumb

Measurement

Measure with percentiles (p50, p95, p99), not averages -- tail latency reveals real user experience (Ch. 1)
Define load parameters specific to your system before choosing architecture -- Twitter's fan-out ratio mattered more than requests/second (Ch. 1)

Storage Choices

LSM-trees for write-heavy workloads; B-trees for read-heavy -- understand write amplification before choosing (Ch. 3)
If you need binary compactness, go all the way to schema-based encoding (Protobuf/Avro); half-measures like MessagePack lose readability without meaningful size gain (Ch. 4)
OLTP and OLAP workloads need different storage engines; do not force one to serve both (Ch. 3)

Replication

Default to asynchronous replication for availability; accept that replication lag has no upper bound (Ch. 5)
Read-after-write consistency is scoped per-user and is usually sufficient -- do not pay for linearizability unless you need leader election, hard uniqueness, or cross-channel coordination (Ch. 5, 9)
Prefer logical (row-based) replication logs over WAL shipping -- they decouple storage format from replication and enable CDC (Ch. 5, 11)

Transactions

Do not trust isolation level names -- verify actual database behavior against the anomalies it permits (Ch. 7)
Use atomic operations (UPDATE ... SET value = value + 1) over read-modify-write cycles whenever expressible (Ch. 7)
Every new schema field after initial deployment must be optional or have a default; making it required breaks all existing data (Ch. 4)

Distributed Coordination

Use fencing tokens with any distributed lock; timeouts alone cannot prevent split-brain (Ch. 8)
Linearizability, total order broadcast, and consensus are equivalent problems -- solving one solves the others (Ch. 9)
Prefer human-in-the-loop for leader failover and partition rebalancing; fully automatic approaches risk cascading failure (Ch. 5, 6)

Data Integration

Prefer CDC with a single source of truth over dual writes to multiple systems (Ch. 11)
Build derived data as replaceable outputs from immutable inputs -- enables human fault tolerance and reprocessing (Ch. 10, 11)
Apply the end-to-end argument: verify correctness at application boundaries, not just infrastructure layers (Ch. 12)
Complexity migrates rather than disappears -- eliminating a coordination service moves agreement logic into each node (Ch. 5, 6, 9)

Related References

Implementation Playbook for Data System Decisions - action sequences for applying these heuristics to concrete decisions