Library
Designing Data-Intensive Applications · 7 of 14
Designing Data-Intensive Applications
AI Software Development HIGH

Future Architecture - Unbundled Databases and End-to-End Correctness

unbundling write-path read-path end-to-end-argument timeliness integrity ethics dataflow

Key Principle

No single tool handles all access patterns, so the organization's entire dataflow becomes "one huge database" where stream processors are triggers, batch jobs maintain materialized views, and derived stores are specialized indexes (Ch. 12). The central integration problem -- keeping multiple systems consistent -- is solved by funneling all writes through a single ordered log rather than dual writes or distributed transactions. Every derived dataset sits on a spectrum between eager precomputation (write path) and lazy on-demand computation (read path).

Why This Matters

Distributed transactions (2PC) amplify failures globally: any participant failure aborts all. Log-based systems contain faults locally via buffering -- "a fault in one part of the system can be contained locally" (Ch. 12). This architectural choice determines whether your system degrades gracefully or collapses under partial failure.

The end-to-end argument (Saltzer, Reed, Clark 1984) applied to data correctness means that duplicate suppression, integrity checks, and idempotency can only be fully achieved at application endpoints, not by any single infrastructure layer. A bank transfer retried after a dropped connection can double-charge even with serializable transactions -- the fix requires a unique request ID passed end-to-end from client to database (Ch. 12).

Good Examples

  1. Timeliness vs. integrity unbundling. "Violations of timeliness are 'eventual consistency,' whereas violations of integrity are 'perpetual inconsistency'" (Ch. 12). ACID transactions bundle both; event-based systems decouple them. Credit card statements tolerate 24-hour lag but the balance must equal the sum of transactions. Most applications need integrity without timeliness, avoiding the coordination cost of linearizability (Ch. 12).

  2. Uniqueness via log-based messaging without 2PC. Partition a log by hash of the constrained value. A stream processor reads sequentially, tracks state locally, emits accept/reject. This is equivalent to implementing linearizable storage using total order broadcast (Ch. 9). Scales by adding independently processed partitions (Ch. 12).

  3. Gradual schema migration via derived views. Maintain old and new schemas as independently derived views, shifting users incrementally -- analogous to dual-gauge railway tracks. "Every stage of the process is easily reversible if something goes wrong: you always have a working system to go back to" (Ch. 12). Extends Ch. 4's schema evolution to full data-model migration with zero downtime.

Counterpoints

  1. Premature unbundling. "If there is a single technology that does everything you need, you're most likely best off simply using that product" (Ch. 12). Unbundling only pays off when no single system covers all requirements. The complexity migrates from inside a monolithic database to the integration layer.

  2. Trusting infrastructure blindly. ACID culture trained developers to neglect auditability. MySQL has exhibited uniqueness constraint bugs; PostgreSQL's serializable isolation has shown write skew anomalies. Even mature databases fail -- "just because an application uses a data system that provides comparatively strong safety properties does not mean the application is guaranteed to be free from data loss or corruption" (Ch. 12).

  3. Ignoring the causality gap in partitioned logs. When causally related events land in different partitions, downstream consumers may process them out of order. The unfriend-then-message example: notification service may send a message to an unfriended person because unfriend and message events live in different partitions. Partial solutions exist but none solve external side effects -- emails sent cannot be unsent (Ch. 12).

Key Quotes

"Violations of timeliness are 'eventual consistency,' whereas violations of integrity are 'perpetual inconsistency.'" -- Martin Kleppmann, Chapter 12

"The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the endpoints of the communication system." -- Saltzer, Reed, Clark (1984), cited in Chapter 12

"We engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in." -- Martin Kleppmann, Chapter 12

Rules of Thumb

  • Separate timeliness from integrity; enforce integrity always, tolerate timeliness violations where the business allows
  • Apply synchronous coordination surgically -- only where recovery from violation is impossible
  • Pass unique request IDs end-to-end from client to database for idempotent operations
  • Total order broadcast scales only to single-node throughput; partition logs for higher throughput and accept ambiguous cross-partition ordering
  • Audit continuously: immutable event logs plus deterministic derivations enable reprocessing as a verification mechanism
  • Data collection decisions are ethical decisions -- data is a toxic asset, not merely a valuable one

Related References

  • Consistency and Consensus - the formal equivalence (TOB = consensus = linearizability) that underpins log-based correctness
  • Stream Processing - CDC, event sourcing, and log-based coordination as the implementation layer
  • Batch Processing - immutable inputs and human fault tolerance as foundational principles