LSM Trees from First Principles

· Ditsuke.


[!NOTE] This blog is a 100% handwritten, no LLMs involved.

Log-Structured-Merge (LSM) Trees are a set of data structures and algorithms for extremely fast writes in database systems.

Most common database management systems and workloads use B-Tree indices as the default choice. A fantastic data structure that scales to billions of rows in production systems. And yet, B-Trees are not suitable to very-write heavy use-cases (think 100K+ writes per second).

In the last 15 years, a number of systems using LSM Trees as the foundation layer have emerged: Apache Cassandra, Google's Bigtable, Amazon's DynamoDB, and ScyllaDB as the most popular examples.

Through this blog, we explore the idea from first principles. Starting with B-Trees.

B-Trees and the Write Problem #

B-Trees are the absolute standard index-type in modern database systems. There are a few properties which makes them so ideal:

While we won't go deeper into the design of B-Trees in this blog, its very important to understand why they're not the best idea for writes. Let's consider a write path for some value X. Because the B-Tree is broken up into large leafs pointing at any number of other large leafs, writes involve:

Now of course these writes are not all happening on disk. A real database would load necessary pages into memory until it finds the target leaf and then write to the in-memory pages along a WAL. Disk flush is otherwise async, so the only write in the sync path is sequential (the WAL). However, a write also causes:

In very write-heavy environments (say event streams like messages, etc), B-Trees are simply not the best choice. And yet, the read advantages they offer cannot be ignored.

The Log #

At this point we must talk about the primitive log. The log is the best data-structure for writes. While disk I/O is inherently slow, its actually not all that slow if we don't need to seek! By writing a contiguous data stream to the disk, the log maximises throughput, which is as high as 2-5 GB/s on modern NVMe SSDs. On the same disk, write throughput can drop by a factor of upto 20x to <100 MB/s for random writes, say in the case of B-Trees! As an illustration, we look at user benchmarks for a Seagate SSD on userbenchmark.com.

Now you might be wondering how the log is relevant to us at all. After all, a write must propagate to the target row, right?

Not necessarily. We can always compute the current state of a row by aggregating all the updates to it in reverse chronological order. And deletes? Again represented as updates, more formally tombstones.

Which is expensive, yes, but a starting point that makes writes very cheap for us in exchange for making reads much more expensive.

So on one end we have a log, optimal to write to but not so good to read from. On the other, we have we have B-Tree backed rows with near pareto optimal reads but not so ideal writes. Can we find a middle ground?

The Log-Structured-Merge Tree, from First Principles #

The main bottlenecks to faster writes in B-Trees are:

In the previous section, we saw that a log solves these write bottlenecks in exchange for read cost. Can we find a middle ground?

Lets start with the primitive log. How do we make reads cheaper? One could see with reason that reads would be much cheaper if the log was sorted by a clustering key, which can also be seen as the primary key for the row, and within a cluster sorted by update timestamp. But this is expensive in practice! A log is an unbounded stream of updates. We cannot reasonably keep it sorted by anything except time in its primitive form.

Now, if the entire log were in-memory, it could be feasible to keep a sorted representation by using a data-structure like a Binary Search Tree, as memory is 3-5 orders of magnitude faster to do such operations than disk. But, memory is much more limited, so we cannot possibly keep the entire thing in-memory.

Lets take this further: what if we only stored a recent log buffer in memory, keeping our sort invariants satisfied? Suddenly, this is much more practical!

But what about older logs? Well, we could periodically flush our log to the disk. Lets call our In-Memory Table segment a MemTable. And likewise, our on-disk segments SSTables for Sorted-String Tables.

With these ideas in place, our reads are looking much faster? Say to compute the value of the row for Clustering Key K, we need to, in reverse-chronological order, aggregate updates to K across our MemTable and well as all our SSTables. Within each of these segments search is optimised as the sort invariants are maintained.

But we can do even better!

Refining the SSTable #

First, lets consider that we're wasting a ton of Disk I/O by searching every SSTable for K, while in practice only a small subset of these tables may contain relevant updates. Here, the first instinct is to use a set to satisfy the relevance check instead of a full search over the table. But, sets are not that much better. Bloom Filters can almost answer the same question much more cheaply in exchange for a small probabilistic cost.

In addition, searching entire SSTables to find a value can also be expensive. After all, we want these tables to be reasonably big for optimum I/O throughput. At the same time, we want to find our update entries as fast as possible. What if we have a sparse index within the SSTable, to quickly jump to a point close to our target?

And so, we arrive at some important optimisations for the SSTable:

With these properties our search path within a SSTable T for key K is:

The bloom filter component may be small enough that we can load a large number of them into memory to avoid I/O for search path pruning.

Compaction #

We've refined the design of our SSTable. It's much cheaper than the naive ordered log. We could stop here, but notice that we keep producing SSTables periodically:

We have an answer to both these: compaction. The basic idea: instead of writing once and reading forever, our LSM-Tree periodically takes a subset of its SSTables and merges them to produce one or more merged tables. Updates will be aggregated, redundant updates will be removed, and tombstones will invalidate any earlier writes.

Advantages:

There are also some drawbacks, starting with more complexity and increased write amplification: more IOPS, more CPU cycles. Nonetheless, compaction emerges as an important optimisation for long-term practical use of our LSM Tree.

The exact mechanics of compaction are nuanced and something that deserves a dedicated page of its own, but in short we have a lot of playing room when it comes to deciding what to merge, how to structure our table space in terms of keys and levels, and deciding how to segment compacted tables.

Write Path #

Write -> WAL -> Memtable -> flush (batch, async) -> SSTable

Read Path #

Memtable -> SSTables (new -> old, prune with bloom filter)

Advantages and tradeoffs #

last updated: