Distributed Systems Problems - Designing Data-Intensive Applications

Key Principle

Distributed systems suffer from partial failures that are nondeterministic -- some components fail while others continue, and you may not know whether an operation succeeded. Six distinct failure modes (request lost, request queued, remote crash, remote pause, response lost, response delayed) all produce the identical symptom: silence. Because asynchronous networks offer no upper bound on delivery time, the sender cannot disambiguate these cases. This ambiguity is the root cause of every hard problem in distributed systems: leader election, consensus, and exactly-once delivery all trace back to it (Ch. 8).

Why This Matters

Variable network delays and clock inaccuracies are not bugs to fix but consequences of an economic trade-off: packet switching dynamically shares bandwidth, achieving higher utilization at the cost of unbounded delays. "Variable delays in networks are not a law of nature, but simply the result of a cost/benefit trade-off" (Ch. 8). Understanding this reframes distributed systems design from fighting unreliability to choosing where to absorb uncertainty. The correct design heuristic is suspicion: assume every network call can fail silently, assume every node can pause arbitrarily, and design the recovery path before the happy path.

Good Examples

Fencing tokens: Each lock grant carries a monotonically increasing number; the storage layer rejects writes bearing a token lower than the highest it has seen. "It is not sufficient to rely on clients checking their lock status themselves" (Ch. 8) -- the resource must enforce the check because a paused client cannot know it is stale. ZooKeeper's zxid/cversion serve as practical fencing tokens. HBase suffered exactly this bug without them (Ch. 8).
Spanner's TrueTime: Returns [earliest, latest] confidence bounds rather than a point timestamp, then waits for the interval length before committing, guaranteeing non-overlapping intervals for globally consistent snapshot isolation. Google deploys GPS receivers or atomic clocks per datacenter to achieve ~7 ms synchronization (Ch. 8).
Cascading failure from premature death declaration: An overloaded-but-alive node is declared dead via timeout. Its work transfers to remaining nodes, overloading them, triggering further death declarations -- a positive feedback loop where the fault-detection mechanism causes total system failure (Ch. 8).

Counterpoints

Using time-of-day clocks for durations: Time-of-day clocks (CLOCK_REALTIME) can jump backward on NTP correction or leap seconds. Only monotonic clocks (CLOCK_MONOTONIC) are safe for measuring elapsed time. Using wall clocks for timeouts is a bug that NTP resets will eventually surface (Ch. 8).
Trusting LWW ordering with physical clocks: Clock drift is silent -- no error signal. A lagging-clock node's writes are silently dropped; sequential writes become indistinguishable from concurrent ones. Physical clocks conflate wall-clock time with causal order, which are independent quantities in a distributed system. Use logical clocks instead (Ch. 8).
Assuming check-then-act is safe: A node that checks a lease, then pauses (GC, VM live migration, disk I/O, SIGSTOP), can resume believing it still holds the lease after expiry. "A node must assume that its execution can be paused for a significant length of time at any point, even in the middle of a function" (Ch. 8).

Key Quotes

"In distributed systems, suspicion, pessimism, and paranoia pay off." -- Martin Kleppmann, Chapter 8

"If we only know the time +/- 100 ms, the microsecond digits in the timestamp are essentially meaningless." -- Martin Kleppmann, Chapter 8

"Reliable behavior is achievable, even if the underlying system model provides very few guarantees." -- Martin Kleppmann, Chapter 8

Rules of Thumb

Use monotonic clocks for elapsed-time measurement; never use wall clocks for timeouts or ordering
Prefer logical clocks (Lamport timestamps, version vectors) over physical timestamps for causal ordering
Require fencing tokens on all lock-protected resources; the resource, not the client, must enforce validity
Use adaptive failure detectors (Phi Accrual, used by Akka/Cassandra) rather than fixed timeouts
Treat GC pauses as planned brief outages: drain in-flight requests before collection
Monitor clock synchronization actively; declare drifting nodes dead and remove them
Test fault-handling paths with deliberate fault injection (Chaos Monkey); "if the error handling of network faults is not defined and tested, arbitrarily bad things could happen" (Ch. 8)
A node cannot trust its own judgment; truth requires a majority quorum (>N/2)

Related References

Replication - clock unreliability makes LWW dangerous; failover is fundamentally a consensus problem
Transactions - distributed transactions face all these partial-failure modes; 2PC is distinct from 2PL
Partitioning - request routing requires agreement on partition ownership, which is a consensus problem