[!NOTE] This blog is a 100% handwritten, no LLMs involved.
Log-Structured-Merge (LSM) Trees are a set of data structures and algorithms for extremely fast writes in database systems.
Most common database management systems and workloads use B-Tree indices as the default choice. A fantastic data structure that scales to billions of rows in production systems. And yet, B-Trees are not suitable to very-write heavy use-cases (think 100K+ writes per second).
In the last 15 years, a number of systems using LSM Trees as the foundation layer have emerged: Apache Cassandra, Google's Bigtable, Amazon's DynamoDB, and ScyllaDB as the most popular examples.
Through this blog, we explore the idea from first principles. Starting with B-Trees.
B-Trees and the Write Problem #
B-Trees are the absolute standard index-type in modern database systems. There are a few properties which makes them so ideal:
- Ordered for search, like BSTs. So a search will take, at worst time at an order of
logn, wherenare the number of items in the tree. - However, BSTs have a glaring flaw: disk I/O is the slowest possible operation in database systems, 2-5 orders of magnitude greater than reads from memory! With a large enough database, it's simply not cost efficient to store everything in memory, just as its not possible to store entire massive BST indices as a single blob, so we need to think of a solution that optimizes between disk read throughput and loading up predictable pages of the data structure into memory.
- B-Trees inherit the fundamental design insight from BSTs while solving the disk I/O problem. By fanning out node size to N values instead of 1, we can make it so we're loading up large enough segments of our data-structure into memory while still retaining the ability to do an O(logn) search for a value. We're also solving for depth: by fanning out we reduce tree depth to 2-3 levels even for millions of items!
While we won't go deeper into the design of B-Trees in this blog, its very important to understand why they're not the best idea for writes. Let's consider a write path for some value X. Because the B-Tree is broken up into large leafs pointing at any number of other large leafs, writes involve:
- Searching the B-Tree for target leaf.
- Inserting value in target leaf. Now, the tree has to be kept balanced...
- If leaf has overflowed (size > leaf size threshold), we split by deciding on a split key and propagating it upwards to the parent.
- Repeat overflow/split process with the parent until we're at root, where we may create a new root if it also overflows.
- Finally, if the item is new we create a new Row tuple and point the index to it, otherwise we resolve the tuple pointer and update the Row tuple with the new value(s).
Now of course these writes are not all happening on disk. A real database would load necessary pages into memory until it finds the target leaf and then write to the in-memory pages along a WAL. Disk flush is otherwise async, so the only write in the sync path is sequential (the WAL). However, a write also causes:
- Synchronous non-contiguous block reads from the disk for search.
- Asynchronous non-contiguous block writes to the disk for persistence: both for indices and the row itself.
In very write-heavy environments (say event streams like messages, etc), B-Trees are simply not the best choice. And yet, the read advantages they offer cannot be ignored.
The Log #
At this point we must talk about the primitive log. The log is the best data-structure for writes. While disk I/O is inherently slow, its actually not all that slow if we don't need to seek! By writing a contiguous data stream to the disk, the log maximises throughput, which is as high as 2-5 GB/s on modern NVMe SSDs. On the same disk, write throughput can drop by a factor of upto 20x to <100 MB/s for random writes, say in the case of B-Trees! As an illustration, we look at user benchmarks for a Seagate SSD on userbenchmark.com.
Now you might be wondering how the log is relevant to us at all. After all, a write must propagate to the target row, right?
Not necessarily. We can always compute the current state of a row by aggregating all the updates to it in reverse chronological order. And deletes? Again represented as updates, more formally tombstones.
Which is expensive, yes, but a starting point that makes writes very cheap for us in exchange for making reads much more expensive.
So on one end we have a log, optimal to write to but not so good to read from. On the other, we have we have B-Tree backed rows with near pareto optimal reads but not so ideal writes. Can we find a middle ground?
The Log-Structured-Merge Tree, from First Principles #
The main bottlenecks to faster writes in B-Trees are:
- Searching for the right leaf on disk.
- Writing affected leaf(s) to disk.
In the previous section, we saw that a log solves these write bottlenecks in exchange for read cost. Can we find a middle ground?
Lets start with the primitive log. How do we make reads cheaper? One could see with reason that reads would be much cheaper if the log was sorted by a clustering key, which can also be seen as the primary key for the row, and within a cluster sorted by update timestamp. But this is expensive in practice! A log is an unbounded stream of updates. We cannot reasonably keep it sorted by anything except time in its primitive form.
Now, if the entire log were in-memory, it could be feasible to keep a sorted representation by using a data-structure like a Binary Search Tree, as memory is 3-5 orders of magnitude faster to do such operations than disk. But, memory is much more limited, so we cannot possibly keep the entire thing in-memory.
Lets take this further: what if we only stored a recent log buffer in memory, keeping our sort invariants satisfied? Suddenly, this is much more practical!
But what about older logs? Well, we could periodically flush our log to the disk. Lets call our In-Memory Table segment a MemTable. And likewise, our on-disk segments SSTables for Sorted-String Tables.
With these ideas in place, our reads are looking much faster? Say to compute the value of the row for Clustering Key K, we need to, in reverse-chronological order, aggregate updates to K across our MemTable and well as all our SSTables. Within each of these segments search is optimised as the sort invariants are maintained.
But we can do even better!
Refining the SSTable #
First, lets consider that we're wasting a ton of Disk I/O by searching every SSTable for K, while in practice only a small subset of these tables may contain relevant updates. Here, the first instinct is to use a set to satisfy the relevance check instead of a full search over the table. But, sets are not that much better. Bloom Filters can almost answer the same question much more cheaply in exchange for a small probabilistic cost.
In addition, searching entire SSTables to find a value can also be expensive. After all, we want these tables to be reasonably big for optimum I/O throughput. At the same time, we want to find our update entries as fast as possible. What if we have a sparse index within the SSTable, to quickly jump to a point close to our target?
And so, we arrive at some important optimisations for the SSTable:
- Prefix with a bloom filter to prune search space for a key to only relevant tables. We pay a small cost for false positives in exchange for very cheap answers to this question.
- After the bloom filter, add also a sparse index to use as a jump-table for the log.
- After the sparse index, we have our K-sorted set of updates.
With these properties our search path within a SSTable T for key K is:
- Bloom-filter: does
TcontainK? Small chance of a false-positive, no false-negatives. - Sparse index: Where should I seek to reach a point close to
K? Gives seek locationSas well as next indexS1 - Sequential scan: from
StoS1, until we arrive atK. If we reachS1without findingK, we got a false-positive from the filter (rate)
The bloom filter component may be small enough that we can load a large number of them into memory to avoid I/O for search path pruning.
Compaction #
We've refined the design of our SSTable. It's much cheaper than the naive ordered log. We could stop here, but notice that we keep producing SSTables periodically:
- A database may run for days, for weeks, for months, even for years, and as it runs the cost for reads will keep growing unless we do something about our search space.
- Space bloat: updates are overwritten. Values are deleted with tombstones. But we never get rid of these. It's not hard to imagine an unbounded, many-fold growing drift between logical information stored and practical information retained.
We have an answer to both these: compaction. The basic idea: instead of writing once and reading forever, our LSM-Tree periodically takes a subset of its SSTables and merges them to produce one or more merged tables. Updates will be aggregated, redundant updates will be removed, and tombstones will invalidate any earlier writes.
Advantages:
- By reducing the number of tables, we're reducing our search space for reads.
- Space amplification is reduced as we get rid of deleted keys and redundant point-in-time states.
- Further optimisation opportunities are presented. While on origin, these tables are not partitioned in any way, during compaction, we may partition our key-space in a way that certain tables may only contain certain contigous key ranges. This would make search even faster enabling us to skip even bloom filters by simply looking at table name/metadata.
There are also some drawbacks, starting with more complexity and increased write amplification: more IOPS, more CPU cycles. Nonetheless, compaction emerges as an important optimisation for long-term practical use of our LSM Tree.
The exact mechanics of compaction are nuanced and something that deserves a dedicated page of its own, but in short we have a lot of playing room when it comes to deciding what to merge, how to structure our table space in terms of keys and levels, and deciding how to segment compacted tables.
Write Path #
Write -> WAL -> Memtable -> flush (batch, async) -> SSTable
Read Path #
Memtable -> SSTables (new -> old, prune with bloom filter)
- Reading each SSTable would be expensive right? Thank bloom-filters, each SSTable will have one for quickly checking if it contains some key or not. So, in practice, we'll only hit a small fraction of SSTables.
Advantages and tradeoffs #
- Super fast writes! You're just pushing to a WAL and modifying an in-memory memtable, very minimal sync disk I/O.
- Flush is fast too! Sequential, no random writes.
- Good for horizontal scaling: just introduce a partioning key (to send to node) and a clustering key (organise within node/LSM tree).
- Reads are not as simple as a B-Tree, as they're amplified over a set of relevant SSTables (pruning with bloom filters and compaction-driven-partitioning)
- Reads are not as flexible as a traditional model: they MUST use the primary key or do full SCANs.
- Secondary keys are not so simple in database systems that use LSM trees (DynamoDB, Apache Cassandra, Google Bigtable, Hbase).
- While secondary indexes exist, they're not as simple and efficient as in other systems.
- Compaction => write amplification (async)
- Compaction can be tuned to workload and constraints: I/O, CPU vs read cost.