Stream Processing - Designing Data-Intensive Applications

Key Principle

"The truth is the log. The database is a cache of a subset of the log." (Pat Helland, Ch. 11). Stream processing treats events as the primary data representation -- mutable state and append-only logs are dual: state is the integral of events over time, a change stream is the derivative of state. Log-based message brokers (Kafka, Kinesis) unify durable storage with low-latency notification, making consumption a non-destructive read-only scan rather than the destructive acknowledgment model of traditional brokers.

Why This Matters

Dual writes -- application code explicitly writing to multiple systems -- create race conditions that cause permanent inconsistency "even though no error occurred" (Ch. 11). Change data capture (CDC) solves this by reframing multi-system consistency as a replication problem: one database is the leader, all derived systems (search indexes, caches, warehouses) are followers consuming its change stream. This is structurally identical to single-leader replication (Ch. 5), preserving ordering and eliminating race conditions.

Event sourcing goes further by recording user intent ("student cancelled enrollment") rather than mechanical state changes ("row deleted"), preserving the freedom to derive new side effects later. The trade-off: CDC allows log compaction, but event sourcing does not because intent-level events are not self-contained state snapshots.

Good Examples

The three stream join types share one structure. Maintain state from one input, query it when a message arrives on the other. Stream-stream joins use bounded time windows (click attribution). Stream-table joins use an infinite window where newer records overwrite older (event enrichment via local CDC copy). Table-table joins maintain a materialized view updated by changes to either side -- the Twitter timeline fan-out from Ch. 1, now solved incrementally (Ch. 11).
Three-timestamp technique for untrusted device clocks. Log (1) event time per device clock, (2) send time per device clock, (3) receipt time per server clock. The offset (3)-(2) approximates clock skew and corrects (1). This separates event time from processing time, preventing artificial spikes from processing delays during backlog replay (Ch. 11).
Log compaction eliminates snapshot bootstrapping. New derived systems can scan from offset 0 to reconstruct a full database copy without periodic snapshot coordination. Disk space depends on current database contents, not total write history -- the same principle as Ch. 3's log-structured storage engines (Ch. 11).

Counterpoints

Conflating event time with processing time. Processing delays (queueing, restarts, backlog replay) cause divergence. Windowing by processing time creates artificial spikes. Straggler events force a choice: ignore late arrivals (tracking drop rate) or publish corrections. Neither is free (Ch. 11).
Treating stream processing as inherently approximate. "There is nothing inherently approximate about stream processing, and probabilistic algorithms are merely an optimization" (Ch. 11). Bloom filters and HyperLogLog are memory trade-offs, not fundamental limitations of the paradigm.
Assuming "exactly-once" means events are processed once. Microbatching and checkpointing guarantee exactly-once only within the framework. External side effects (DB writes, emails) escape this boundary. "Effectively-once" is more accurate -- the guarantee concerns visible effects, not execution count. Idempotence with metadata (storing Kafka offset with each DB write) extends the boundary outward (Ch. 11).

Key Quotes

"The truth is the log. The database is a cache of a subset of the log." -- Pat Helland, cited in Chapter 11

"The two systems are now permanently inconsistent with each other, even though no error occurred." -- Martin Kleppmann, Chapter 11, on dual writes

"There is nothing inherently approximate about stream processing, and probabilistic algorithms are merely an optimization." -- Martin Kleppmann, Chapter 11

Rules of Thumb

Every messaging system is defined by two questions: what happens when producers outpace consumers, and what happens on crash
Use log-based brokers when throughput is high, per-message processing is fast, and ordering matters; use traditional brokers for expensive per-message processing needing message-level parallelism
Consumer parallelism in log-based brokers is bounded by partition count -- this is a hard architectural constraint
Commands must be validated before becoming events; all integrity checks happen synchronously at the write boundary
A stream-table join is actually a stream-stream join with an infinite window where newer records overwrite older ones
Slowly Changing Dimension (unique version IDs) trades log compaction for join determinism

Related References

Consistency and Consensus - total order broadcast equivalence underlying log-based coordination
Batch Processing - batch as the bounded-input predecessor; shared immutability principle
Future Architecture - Unbundled Databases and End-to-End Correctness - write-path/read-path spectrum and end-to-end correctness built on streams