Key Principle
Linearizability, total order broadcast, and consensus are formally equivalent problems -- solving any one gives you the other two (Ch. 9). This equivalence means consensus is unavoidable whenever you need a system to behave as if there is a single copy of data with atomic operations. Practical algorithms (Raft, Paxos, Zab) implement total order broadcast by running repeated consensus rounds, producing a gapless ordered log that all nodes agree on. The gapless property is what separates total order broadcast from Lamport timestamps: a node can detect missing messages and wait, whereas Lamport timestamps have gaps and a node can never know if a lower-numbered message is still in flight.
Why This Matters
Many distributed system failures trace back to ad-hoc attempts to enforce ordering or uniqueness without recognizing that the problem requires consensus. Leader election, distributed locks, unique constraints, and atomic commit all belong to the same equivalence family. Recognizing membership in this family prevents reinventing consensus badly -- and "implementing consensus from scratch has a poor success record" (Ch. 9).
The cost of consensus is real but bounded: voting is synchronous replication, a strict majority is required (3 nodes tolerate 1 failure, 5 tolerate 2), and timeout-based failure detection can cause false leader elections under variable network delays. Raft has known edge cases where unreliable links cause leadership bouncing, preventing progress. These costs determine when to accept weaker guarantees instead. Causal consistency is the strongest model that remains available during partitions and avoids the network delay penalty that linearizability imposes on every operation -- not just during faults (Attiya and Welch, 1994).
Good Examples
ZooKeeper as outsourced consensus. Rather than running majority votes across thousands of application nodes, a small coordination cluster (3-5 nodes) provides linearizable atomic operations, total ordering via monotonic zxid, failure detection via ephemeral nodes, and change notifications (watches). HBase, Kafka, YARN, and OpenStack Nova use it for slow-changing metadata like partition leadership -- consensus for metadata, not for every data operation (Ch. 9).
Epoch numbering breaks the leader-election paradox. The apparent circularity -- "electing a leader requires consensus, but consensus requires a leader" -- is resolved by guaranteeing leader uniqueness within each epoch (term/ballot number). Two overlapping quorum votes (one for election, one for proposal approval) ensure no higher-epoch leader has emerged when a proposal succeeds. This is weaker than a permanent unique leader but sufficient (Ch. 9).
Total order broadcast as log abstraction. ZooKeeper's zxid is a concrete instance: a sequence number from a total order broadcast log used as a fencing token. This unifies replication logs, write-ahead logs, and transaction logs under one abstraction. The mechanism for linearizable writes: append a claim to the log; total order resolves conflicts deterministically -- first message wins, all nodes agree (Ch. 9).
Counterpoints
Confusing linearizability with serializability. Linearizability is a recency guarantee on individual registers; serializability is a transaction isolation property preventing write skew. SSI is serializable but explicitly not linearizable (reads from snapshots). Both together produce "strict serializability." Confusing them leads to deploying the wrong guarantee (Ch. 9).
Assuming Dynamo-style quorums provide linearizability. Even strict quorums (w+r>n) allow race conditions due to variable network delays. Making them linearizable requires synchronous read repair and reading the latest quorum state before writes -- and linearizable compare-and-set still needs full consensus. Cassandra loses linearizability under concurrent writes due to last-write-wins with physical clocks (Ch. 9).
Treating 2PC as a good consensus algorithm. 2PC requires unanimity with no recovery path. If the coordinator crashes after participants vote "yes," they hold locks indefinitely -- "the only way 2PC can complete is by waiting for the coordinator to recover" (Ch. 9). The emergency escape of heuristic decisions "probably breaks atomicity." 2PC converts any single-component failure into a system-wide stall, which is the inverse of fault tolerance. MySQL distributed transactions are reported to be over 10x slower than single-node transactions (Ch. 9).
Key Quotes
"Linearizability is slow -- and this is true all the time, not only during a network fault." -- Martin Kleppmann, Chapter 9
"A large-scale outage can stop the system from being able to process requests, but it cannot corrupt the consensus system by causing it to make invalid decisions." -- Martin Kleppmann, Chapter 9
"In many cases, systems that appear to require linearizability in fact only really require causal consistency, which can be implemented more efficiently." -- Martin Kleppmann, Chapter 9
"The problem here is that the total order of operations only emerges after you have collected all of the operations." -- Martin Kleppmann, Chapter 9
Rules of Thumb
- If you need a unique constraint, a lock, or leader election -- you need consensus; do not reinvent it
- Causal consistency is the strongest model available during partitions; substitute it for linearizability when possible
- Outsource consensus to ZooKeeper/etcd rather than embedding it in application logic
- 2PC amplifies failures; prefer log-based coordination (Ch. 11-12) for cross-system consistency
- Safety properties (agreement, integrity) hold even under total failure; liveness (termination) requires a functioning majority
- Full linearizable reads require one of: sequencing reads through the log, querying latest log position and waiting (ZooKeeper sync()), or reading from a synchronously updated replica
Related References
- Stream Processing - log-based alternatives to 2PC for cross-system coordination
- Future Architecture - Unbundled Databases and End-to-End Correctness - end-to-end correctness without distributed transactions
- Batch Processing - immutable inputs as a correctness strategy complementing consensus