Key Principle
Data system design is a sequence of constrained choices, not an open canvas. Each decision (storage engine, replication topology, isolation level, integration strategy) narrows the option space for subsequent decisions. This playbook provides action sequences and checklists for the most common decision points.
Why This Matters
Engineers often select technologies based on familiarity or trend rather than workload fit. Kleppmann demonstrates that choices compound: a storage engine choice constrains replication options, which constrain consistency guarantees, which constrain how derived data systems can be integrated. Working through decisions systematically -- with awareness of downstream consequences -- prevents architectural rework that surfaces months later. "You are now not only an application developer, but also a data system designer" (Ch. 1).
Good Examples
Storage engine selection by workload (Ch. 3): A team building a time-series ingestion pipeline chose a B-tree-based database because it was familiar. Write amplification caused disk throughput saturation at moderate load. Switching to an LSM-tree engine (write-optimized, sequential I/O) resolved the bottleneck. The decision tree: Is the workload write-heavy or read-heavy? Write-heavy points to LSM-trees; read-heavy to B-trees. Mixed workloads require benchmarking write amplification against read latency.
Replication topology for multi-datacenter (Ch. 5): Single-leader replication across datacenters adds a cross-datacenter round-trip to every write. Multi-leader replication allows local writes but introduces conflict resolution complexity. The checklist: Can you tolerate conflict resolution? If yes, multi-leader reduces write latency. If no (e.g., financial transactions), single-leader with leader pinned to one datacenter is safer.
Stream fault tolerance strategy (Ch. 11): A team using Spark Streaming for event processing chose 1-second microbatches, which forced tumbling windows aligned to batch boundaries. Switching to Flink's checkpointing decoupled fault tolerance from windowing semantics, enabling event-time-based session windows. The decision: Does your windowing strategy match microbatch boundaries? If not, use checkpointing or atomic commit.
Counterpoints
Choosing serializable isolation by default wastes resources. Most applications can tolerate read committed or snapshot isolation for the majority of operations. Serializable isolation (2PL or SSI) is only necessary when write skew across multiple objects must be prevented. Identify the specific invariants that need protection rather than applying the strongest level everywhere (Ch. 7).
Treating schema evolution as an afterthought breaks rolling upgrades. If your encoding format cannot handle backward and forward compatibility (e.g., language-specific serialization like Java's Serializable), every schema change requires coordinated deployment -- the opposite of reliability. Choose schema-based binary formats (Protobuf, Avro, Thrift) from the start (Ch. 4).
Assuming log-based brokers replace traditional brokers entirely. Log-based brokers (Kafka) excel at high-throughput, ordering-sensitive workloads. Traditional brokers (AMQP/JMS) are better when messages are expensive to process and need message-level parallelism with acknowledgment. The workload determines the broker, not the trend (Ch. 11).
Key Quotes
"Readers never block writers, and writers never block readers." -- Martin Kleppmann, Chapter 7
"Change data capture makes one database the leader (the one from which the changes are captured), and turns the others into followers." -- Martin Kleppmann, Chapter 11
Rules of Thumb
Choosing a Storage Engine
- Characterize workload: read-heavy, write-heavy, or mixed
- For write-heavy: evaluate LSM-tree engines (RocksDB, LevelDB, Cassandra); accept higher read latency and compaction overhead
- For read-heavy: evaluate B-tree engines (PostgreSQL, MySQL/InnoDB); accept write amplification
- For analytics: use column-oriented storage with compression; do not run OLAP on an OLTP engine (Ch. 3)
- Verify: measure write amplification and compaction impact under realistic load
Choosing a Replication Strategy
- Single datacenter with strong consistency needs: single-leader with synchronous follower (Ch. 5)
- Multi-datacenter or offline-capable: multi-leader, but implement conflict resolution (LWW as minimum, CRDTs for correctness) (Ch. 5)
- High availability with eventual consistency acceptable: leaderless with sloppy quorum (Ch. 5)
- Cross-check: does your application need read-after-write consistency? Route user-modifiable reads to the leader or track write timestamps (Ch. 5)
Choosing an Isolation Level
- Default (read committed): sufficient for most reads; prevents dirty reads/writes (Ch. 7)
- Long-running reads (backups, analytics): use snapshot isolation to avoid read skew (Ch. 7)
- Concurrent read-modify-write: use atomic operations or explicit locking; verify your database detects lost updates automatically (MySQL/InnoDB does not) (Ch. 7)
- Cross-row invariants: require serializable isolation -- choose between actual serial execution (single-threaded, partition-scoped), 2PL (pessimistic, blocks), or SSI (optimistic, aborts on stale premises) (Ch. 7)
- Verify: test with concurrent load; do not trust vendor isolation level names (Ch. 7)
Choosing an Integration Architecture
- Identify the single source of truth for each data entity (Ch. 11)
- Derive all secondary representations (search indexes, caches, warehouses) via CDC from that source (Ch. 11)
- Use log compaction to eliminate periodic snapshot coordination for new consumers (Ch. 11)
- For stream joins: maintain local state copies via CDC rather than querying remote databases per event (Ch. 11)
- Apply end-to-end checks: use unique request IDs or idempotency keys at application boundaries to catch duplicates that infrastructure cannot prevent (Ch. 12)
Encoding and Schema Evolution Checklist
- Never use language-specific serialization formats (Java Serializable, Ruby Marshal, etc.) for persistence or inter-service communication (Ch. 4)
- Choose a schema-based binary format (Protobuf, Avro, Thrift) for internal services (Ch. 4)
- Every new field must be optional or have a default value (Ch. 4)
- Never reuse a deleted field's tag number (Ch. 4)
- Test backward compatibility (new code reads old data) and forward compatibility (old code reads new data) before each deploy (Ch. 4)
Stream Processing Fault Tolerance
- If windowing aligns with fixed time intervals: microbatching (Spark Streaming) is simplest (Ch. 11)
- If windowing is event-time-based or session-based: use checkpointing (Flink) (Ch. 11)
- For external side effects (DB writes, emails): add idempotency via storing Kafka offsets alongside output; fencing tokens prevent zombie writes (Ch. 8, 11)
- For stateful processors: choose between remote state stores (simple, slow) and local state with periodic snapshots (fast, complex recovery) (Ch. 11)
Related References
- Rules of Thumb for Data-Intensive Applications - the heuristics underlying these decision sequences