Log-Structured-Merge (LSM) Trees are a set of data-structures and algorithms for extremely fast writes in database systems.
Core Ideas #
- Write-ahead logs for durability
- Memtables for fast, in-memory writes
- SSTables for persisting memtables to disk
- Compaction to merge and replace sets of SSTables into smaller sets of SSTables for faster reads and reduction of storage bloat.
We'll not talk about implementation details for Memtables and SSTables in much depth in this one, but basically:
- Memtables are in-memory, implemented most often as Skip Lists, sometimes are Trees.
- SSTables are on-disk, immutable. Besides the core (key -> value changeset), they contain:
- Bloom filters for pruning tableset to read for a certain key.
- An index (on primary key) for fast K -> offset lookup during reads.
Compaction #
So, we're optimizing writes by storing value or partial updates to values (depending on database model, K-V or K-Row). And so, we're going to have an increasing amount of fragmentation right?
Say over a day a key is updated 10 times. These writes are amplified to several SSTables depending on the spacing of these updates. Then in the future we have to go through all of these SSTables (filtered, of course, through bloom-filters built into the SSTable) containing updates for this value to compute the actual value. Not very optimal.
This is how we arrive at compaction. Periodically, an LSM tree will take a subset of its SSTables, merge them to eliminate old invalidated updates and #tombstones, and write this merged SSTable* to replace the merge subset.
The exact mechanics of compaction are nuanced and not something I'll go into right now (because I'm not too familiar myself), but in short you use heuristics like merging across generations (levels) to maximise utility.
Note: There may be one or more merged SSTables, with splitting across keysets possibly.
Tombstones #
How do you delete a value in an LSM tree? Well, SSTables are immutable, can't go to them and delete values. What you do instead: add an entry for tombstoning k -> DELETE, for example. This marks a deletion or a tombstone. Then at time of compaction the LSM tree can get rid of the key in the compacted table.
Write Path #
Write -> WAL -> Memtable -> flush (batch, async) -> SSTable
Read Path #
Memtable -> SSTables (new -> old, prune with bloom filter)
- Reading each SSTable would be expensive right? Thank bloom-filters, each SSTable will have one for quickly checking if it contains some key or not. So, in practice, we'll only hit a small fraction of SSTables.
Advantages and tradeoffs #
- Super fast writes! You're just pushing to a WAL and modifying an in-memory memtable, very minimal sync disk I/O.
- Flush is fast too! Sequential, no random writes.
- Good for horizontal scaling: just introduce a partioning key (to send to node) and a sort key (organise within node/LSM tree).
- Reads are not as simple as a b-tree, as they're amplified over the complete set of relevant SSTables (pruned by their bloom filters)
- Reads are not as flexible as a traditional model: they MUST use the primary key or do full SCANs.
- Secondary keys are not so simple in database systems that use LSM trees (DynamoDB, Apache Cassandra, Google Bigtable, Hbase).
- While secondary indexes exist, they're not as simple and efficient as in other systems.
- Compaction => write amplification (async)
- Compaction can be tuned to workload and constraints: I/O, CPU vs read cost.