Database Storage Engines: B-tree vs. LSM-tree Deep Dive

The primary reason databases seem "slow" isn’t usually the network or the CPU; it’s how they manage their disk.

Let’s see what this looks like in practice. Imagine a simple key-value store. You want to PUT a new record: my_db.put("user:123", "{\"name\": \"Alice\"}").

If your storage engine is a simple B-tree like InnoDB (in its most common configuration), it’ll eventually find the right leaf page in the index, check if it’s full, possibly split it, and then write the new record to that page. This involves a few disk seeks and writes. If you GET that same record: my_db.get("user:123"), it will traverse the index to find the correct leaf page and read the data.

Now, consider RocksDB, a Log-Structured Merge-tree (LSM-tree). When you PUT, it doesn’t immediately try to find and update a specific place on disk. Instead, it writes the new key-value pair to an in-memory buffer called a MemTable. Only when the MemTable is full does it flush its contents to disk as a new, immutable file (an SSTable). Reads are more complex: RocksDB checks the MemTable first, then a series of SSTable files on disk, starting with the most recent.

WiredTiger, often used by MongoDB, can operate in a variety of modes but commonly uses a B-tree structure similar to InnoDB. However, it employs sophisticated caching and write-ahead logging strategies. When you PUT, WiredTiger might add the data to its internal cache. A write-ahead log (WAL) record is created for durability. The actual data page on disk is updated asynchronously. Reads will consult the cache first, then potentially disk.

The core difference lies in how they handle writes and data organization on disk. InnoDB, being a B-tree, aims for in-place updates. When data changes, it tries to modify the existing block on disk. This is great for read-heavy workloads where data locality is key. RocksDB, with its LSM-tree architecture, prioritizes write performance by appending new data and merging old data in background processes. This means writes are fast (just append to memory/new file), but reads can involve checking multiple files. WiredTiger aims for a balance, offering flexible configurations that can lean towards B-tree or LSM-like behavior, with strong emphasis on caching and efficient journaling.

The real magic of these engines isn’t just the basic data structure; it’s how they manage concurrency, durability, and performance. InnoDB uses a sophisticated lock manager and buffer pool to handle concurrent access and cache frequently used data. RocksDB relies on its multi-version concurrency control (MVCC) implementation to allow readers to access older versions of data while writers are modifying it, all managed through its tiered file structure. WiredTiger’s concurrency control is built around a variety of techniques including optimistic concurrency control and fine-grained locking, coupled with its advanced cache management that can intelligently evict pages.

The key lever you control is often the configuration. For InnoDB, tuning innodb_buffer_pool_size is paramount; it determines how much data and index information can be held in RAM, dramatically reducing disk I/O for reads. For RocksDB, understanding block_size, memtable_size, and the num_levels affects the trade-off between write amplification and read performance. WiredTiger’s configuration often involves tuning its cache size (storage.wiredtiger.engineConfig.cache_size) and eviction policies to match your workload.

What most people don’t realize is how much the write amplification of an LSM-tree like RocksDB can vary. It’s not just about how many writes you do to the database; it’s how many times that data gets rewritten to disk as it’s compacted across different levels of SSTable files. High write amplification means you’re burning through disk I/O and potentially SSD wear at a much higher rate than your initial writes suggest.

The next logical step is understanding how these engines integrate with query planners and transaction managers in larger database systems.

Related Concepts

More Deep Dives in Storage Systems