Batch Processing - Designing Data-Intensive Applications

Key Principle

Batch processing inherits the Unix philosophy -- immutable inputs, uniform interfaces, composable single-purpose operators -- and applies it at datacenter scale (Ch. 10). The composability rests on two interlocking decisions: a uniform interface (file descriptors / HDFS paths) and separation of logic from wiring (stdin/stdout / mapper-reducer contracts). The core insight is that treating inputs as immutable and avoiding side effects yields not just performance but human fault tolerance: the ability to recover from buggy code by re-running on unchanged inputs. Read-write databases lack this property -- rolling back code does not un-corrupt data.

Why This Matters

MapReduce democratized large-scale processing on commodity hardware, but its eager disk materialization between every stage creates compounding costs for complex workflows (50-100 chained jobs for recommendation systems). Downstream jobs block on upstream stragglers, subsequent mappers redundantly re-partition data just written by reducers, and temporary files get replicated across nodes.

Dataflow engines (Spark, Tez, Flink) model entire workflows as single DAGs of operators connected by three exchange modes: repartition-and-sort, repartition-only (hash joins), or broadcast. They sort only where needed, pipeline intermediate data through memory, and co-locate producer/consumer tasks. The result is often orders-of-magnitude speedup -- not from changing what is computed, but how intermediate state flows. The fault tolerance trade-off: without HDFS-materialized intermediates, lost data must be recomputed from upstream. Spark tracks lineage via RDDs; Flink checkpoints operator state. Both require deterministic operators for correct recovery.

Good Examples

Build-then-swap output pattern. Writing batch output directly to a production database is an anti-pattern for three compounding reasons: per-record network requests are orders of magnitude slower than batch throughput, parallel mappers can overwhelm the database and degrade live queries, and external writes break MapReduce's all-or-nothing guarantee. The correct pattern: construct immutable output files, bulk-copy to serving nodes, atomically switch. Rollback is instant -- just switch back. This mirrors LSM-tree compaction from Ch. 3 (Ch. 10).
Join strategy taxonomy driven by locality. All strategies exist because locality determines throughput. Reduce-side sort-merge join is the universal fallback (no assumptions). Map-side broadcast hash join avoids the shuffle when the small dataset fits in memory. Partitioned hash join reduces memory when both inputs are co-partitioned. Map-side merge join works even when inputs exceed memory if both are co-partitioned and sorted. Secondary sort ensures dimension records arrive before fact records, so only one dimension record occupies memory at a time (Ch. 10).
Skew handling via random distribution. Hot keys (celebrity accounts) create straggler reducers that bottleneck the entire pipeline. Pig samples to identify hot keys, scatters their records across random reducers, and replicates the other join input to all those reducers. Hive uses a hybrid: map-side join for hot keys, reduce-side for the rest. All three break the single-reducer bottleneck by trading replication cost for parallelism (Ch. 10).

Counterpoints

Querying a remote database from within a batch job. This introduces nondeterminism (data may change mid-job) and per-record network round-trips that destroy throughput. MapReduce always performs full table scans -- no indexes. Bring the data to the computation via HDFS locality, not the other way around (Ch. 10).
Assuming MapReduce fault tolerance is about hardware reliability. At Google, a 1-hour task has ~5% preemption risk in shared clusters -- an order of magnitude above hardware failure. For 100 tasks of 10 minutes each, >50% chance at least one is killed. Fault tolerance is an economic design for multi-tenant resource sharing, not a reliability hedge. Operational realities shape system architecture more than theoretical failure modes (Ch. 10).
Assuming distributed graph processing always beats single-machine. Graph partitioning is hard; vertices are often assigned to machines arbitrarily, causing heavy cross-machine communication. If the graph fits in one machine's memory, single-machine algorithms often outperform distributed Pregel because network overhead exceeds parallelism gains (Ch. 10).

Key Quotes

"By treating inputs as immutable and avoiding side effects (such as writing to external databases), batch jobs not only achieve good performance but also become much easier to maintain." -- Martin Kleppmann, Chapter 10

"This is why MapReduce is designed to tolerate frequent unexpected task termination: it's not because the hardware is particularly unreliable, it's because the freedom to arbitrarily terminate processes enables better resource utilization in a computing cluster." -- Martin Kleppmann, Chapter 10

"Simply making data available quickly -- even if it is in a quirky, difficult-to-use, raw format -- is often more valuable than trying to decide on the ideal data model up front." -- Martin Kleppmann, Chapter 10

Rules of Thumb

When distinct keys exceed RAM, use sort-based aggregation; it degrades gracefully via sequential disk I/O
Declarative APIs let the optimizer choose join algorithms -- prefer them over hand-coded MapReduce
The mapper is a message router: the key acts as the destination address for the value
Schema-on-read (collect raw, defer modeling) trades data quality at write time for collection speed
Stateless callbacks are a safety contract: the framework can retry safely because mappers have no external side effects
More assumptions about input properties (co-partitioning, sorting) enable cheaper join strategies

Related References

Stream Processing - extends batch concepts to unbounded data with lower latency
Future Architecture - Unbundled Databases and End-to-End Correctness - combining batch and stream for reliable, scalable applications
Consistency and Consensus - consensus provides the ordering guarantees batch assumes from a single leader