Library
Designing Data-Intensive Applications · 12 of 14
Designing Data-Intensive Applications
AI Software Development CRITICAL

Storage Engines — LSM-Trees, B-Trees, and Analytical Storage

storage-engines lsm-trees b-trees oltp olap column-storage write-amplification

Key Principle

"Well-chosen indexes speed up read queries, but every index slows down writes." (Chapter 3) This single trade-off explains why databases never auto-index everything. Storage engine design is a spectrum between two schools: log-structured engines (LSM-trees) that turn random writes into sequential writes for higher write throughput, and update-in-place engines (B-trees) that overwrite fixed-size pages for faster reads. OLTP and OLAP workloads have fundamentally different physical bottlenecks — disk seek time vs. disk bandwidth — demanding physically separate storage architectures.

Why This Matters

Storage engine choice is an architectural decision, not an implementation detail. Choosing incorrectly imposes performance costs that no amount of application-level tuning can recover. Developers — not just DBAs — must understand storage internals because modern applications compose multiple storage tools (database + cache + search index), and each tool's engine characteristics determine system-level behavior. The OLTP/OLAP split is an instance of workload isolation: conflicting access patterns are resolved by physical separation rather than clever sharing.

Good Examples

  1. The LSM-Tree Progression: Each layer solves a specific failure of the previous design. Append-only log (fast writes, O(n) reads) adds segmentation + compaction (bounds disk), adds hash index (O(1) reads, but keys must fit in RAM), evolves to SSTables (sorted segments enabling mergesort-style merging, sparse in-memory index, block compression), then adds memtable + WAL (in-memory tree accepting writes in any order, flushing sorted to disk) and Bloom filters (short-circuiting lookups for nonexistent keys). This mirrors actual history: log-structured filesystems to Bigtable to LevelDB/RocksDB. (Chapter 3)

  2. Column-Oriented Storage: Row-oriented engines waste disk bandwidth loading 100+ columns when a query needs only 4-5. Column storage inverts this by storing each column in a separate file with shared row ordering. Bitmap encoding turns queries into bitwise operations; run-length encoding compresses sorted columns from billions of rows to kilobytes. Compression is synergistic with CPU efficiency — compressed columns fit more rows into L1 cache, enabling SIMD vectorized processing. (Chapter 3)

  3. Multiple Sort Orders (Vertica): Since data must be replicated for fault tolerance anyway, store each replica in a different sort order. "You might as well store that redundant data sorted in different ways so that when you're processing a query, you can use the version that best fits the query pattern." Mandatory redundancy becomes a query optimization. (Chapter 3)

Counterpoints

  1. LSM-tree compaction is a hidden runaway failure: If compaction falls behind write throughput, segments accumulate unboundedly — disk fills and reads degrade. Most SSTable engines do not self-throttle incoming writes. "There is no quick and easy rule for determining which type of storage engine is better for your use case." Benchmarks are "often inconclusive and sensitive to details of the workload." (Chapter 3)

  2. In-memory speed comes from avoiding serialization, not avoiding disk reads: "The performance advantage of in-memory databases is not due to the fact that they don't need to read from disk... Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk." OS page caching already eliminates most disk I/O for hot data. (Chapter 3)

  3. Cassandra/HBase column families are NOT column-oriented: "Within each column family, they store all columns from a row together, along with a row key, and they do not use column compression. Thus, the Bigtable model is still mostly row-oriented." A common and consequential misconception. (Chapter 3)

Key Quotes

"Well-chosen indexes speed up read queries, but every index slows down writes." — Martin Kleppmann, Chapter 3

"One write to the database resulting in multiple writes to the disk over the course of the database's lifetime." — Martin Kleppmann, Chapter 3 (on write amplification)

"Log-structured storage engines systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs." — Martin Kleppmann, Chapter 3

Rules of Thumb

  • LSM-trees generally sustain higher write throughput; B-trees generally deliver faster reads
  • Write amplification caps write throughput and shortens SSD lifespan — measure total I/O, not just user-facing latency
  • Monitor LSM compaction lag explicitly — unbounded segment growth is the core operational danger
  • OLTP is bottlenecked by disk seek time (indexes matter); OLAP is bottlenecked by disk bandwidth (column storage and compression matter)
  • Star schemas capture events at finest granularity because coarser aggregation discards analytical optionality forever
  • B-tree single-location-per-key enables range locks; LSM-tree multi-segment duplication complicates transactional semantics
  • Column store writes use LSM-tree techniques (accumulate in memory, bulk-merge to disk) — storage paradigms can be hybridized

Related References