Library
Designing Data-Intensive Applications · 9 of 14
Designing Data-Intensive Applications
AI Software Development HIGH

Partitioning

partitioning sharding hot-spots secondary-indexes rebalancing

Key Principle

Partitioning splits data so each record belongs to exactly one partition on a different node, enabling near-linear throughput scaling for single-partition queries. "Each partition is a small database of its own" (Ch. 6), which is why cross-partition operations are inherently costlier -- coordination, network hops, and partial failure handling. The fundamental tension is locality vs. distribution: random assignment distributes load perfectly but destroys lookup ability (must scatter-gather across all nodes), while structured assignment enables efficient queries but risks hot spots. Partitioning is an information-organization problem, not merely load-balancing.

Why This Matters

Replication handles fault tolerance and read scaling, but a single node still caps storage capacity and write throughput. Partitioning is where the book's "complexity migrates" theme becomes concrete -- you trade single-node simplicity for distributed throughput, and the complexity moves into partition design, routing, and rebalancing.

Skew is not a one-time design error but an ongoing operational concern. "In an extreme case, all the load could end up on one partition, so 9 out of 10 nodes are idle and your bottleneck is the single busy node" (Ch. 6). A partition scheme balanced at deployment can become skewed as usage evolves, making rebalancing a continuous operational task rather than a solved problem.

The two partitioning strategies -- key-range and hash -- represent a fundamental trade-off between query efficiency and load distribution. Key-range preserves sorted order for efficient range scans but is vulnerable to hot spots when sequential keys align with write patterns. Hash partitioning fixes distribution but "destroys sort order," making range queries hit all partitions (Ch. 6).

Good Examples

  • Cassandra's compound key hybrid: In a compound primary key (partition_key, clustering_columns...), only the first column is hashed for partition assignment; remaining columns form a sorted index within that partition. A key like (user_id, update_timestamp) distributes users across partitions while enabling efficient time-range queries within a single user -- recovering locality where it matters most (Ch. 6).

  • Application-level key splitting for hot keys: Append a random two-digit number to a hot key, spreading writes across ~100 partitions. The cost: reads must fan out to all split keys and recombine, and the application must track which keys are split. "Most data systems are not able to automatically compensate for such a highly skewed workload" (Ch. 6). This is complexity migration: the database pushes skew management to the application.

  • Three rebalancing strategies: Fixed partitions (Riak, Elasticsearch, Couchbase) fix partition count and vary size. Dynamic partitions (HBase, MongoDB) fix max partition size and split/merge. Node-proportional (Cassandra, 256 per node) ties partition count to cluster size. Each addresses the unknown-future-volume problem differently (Ch. 6).

  • ZooKeeper for request routing: Externalizes the consensus problem of agreeing on partition ownership. The authoritative mapping lives in ZooKeeper; subscribers get notifications on changes. This avoids reimplementing distributed consensus inside every database node (Ch. 6).

Counterpoints

  • Using hash(key) mod N: Seems elegant but fails the minimal-movement requirement. When N changes by one, nearly every key remaps. hash(123456) goes from node 6 with N=10 to node 3 with N=11 -- the mod function is hypersensitive to its divisor (Ch. 6).

  • Assuming secondary indexes "just work": Document-partitioned (local) indexes require scatter/gather across all partitions for reads, prone to tail latency amplification since the slowest partition determines overall latency. Term-partitioned (global) indexes avoid scatter/gather but require distributed transactions or accept staleness on writes (Ch. 6).

  • Fully automatic rebalancing: Can trigger cascading failure. An overloaded node is declared dead; its partitions transfer to remaining nodes, overloading them, triggering further death declarations. "Such automation can be dangerous in combination with automatic failure detection... making the situation worse and potentially causing a cascading failure" (Ch. 6). Kleppmann advocates human-in-the-loop: the system suggests reassignments, an operator approves.

  • Rolling your own secondary index in application code: Seems tempting but is dangerous -- race conditions and partial write failures silently desynchronize data and index. This is exactly the partial-failure problem that motivates transactions (Ch. 7).

Key Quotes

"Each partition is a small database of its own." -- Martin Kleppmann, Chapter 6

"In an extreme case, all the load could end up on one partition, so 9 out of 10 nodes are idle and your bottleneck is the single busy node." -- Martin Kleppmann, Chapter 6

"Such automation can be dangerous in combination with automatic failure detection... making the situation worse and potentially causing a cascading failure." -- Martin Kleppmann, Chapter 6

Rules of Thumb

  • Choose key-range partitioning when range queries dominate; hash partitioning when write distribution matters more
  • Use compound keys to recover locality within hash-distributed partitions
  • Prefer fixed partition counts when dataset size is predictable; dynamic splitting when highly variable
  • Favor document-partitioned (local) indexes for write-heavy workloads; term-partitioned (global) indexes for read-heavy secondary-index queries
  • Use a coordination service (ZooKeeper) for partition-to-node mapping rather than reimplementing consensus
  • Monitor for skew continuously -- balanced partitions at launch can drift as access patterns evolve
  • Dynamic partition splitting has a cold-start problem: pre-split if you know the key distribution
  • The hash function need not be cryptographic but must be deterministic across processes (Java's Object.hashCode() is not)

Related References

  • Replication - replication and partitioning are orthogonal and composable; a node can be leader for one partition and follower for another
  • Transactions - cross-partition writes motivate transactions; serial execution converts concurrency into a partitioning problem
  • Distributed Systems Problems - request routing is a consensus problem; partition ownership agreement requires distributed coordination