Related
My question is in the title. It seems the Trie tree is quite fit for string indexing, and why is it that no mainstream databases use it as an indexing strategy.
Disks or SSDs are read in blocks, and the B+Tree indexes that databases use are optimized according to that structure. The B+Tree minimizes the average number of blocks you have to read to perform a lookup. They also allow you to update the index without changing too many blocks, and maximize the utility of cache.
Tries don't have these advantages. The one advantage they do provide is compressed storage of common prefixes, but for the short strings that are usually used as DB keys, that isn't much of an advantage. Sometimes specialized index structures are built to compress common prefixes, but again they're designed around the block structure of the storage.
In Designing Data Intensive Applications, Martin introduces a data structure called LSM-trees.
There are mainly 3 parts: an in-memory memtable (usually a red-black tree), an in-memory sparse index, and on-disk SSTables (aka segments). They work together like this:
When a write happens, it first goes to the memtable, and when it turns full, all the data are flushed into a new segment (with all the keys sorted).
When a read happens, it first looks up the memtable. If the key doesn't exist there, it looks up the sparse index, to learn which segment the key may reside. See figure 1.
Periodically, compaction happens that merges multiple segments into one. See figure 2.
As you can tell from figure 2, keys are sorted within a segment, however keys are NOT sorted between segments. This make me wonder: how do we maintain the sparse index s.t. keys in the index have increasing offset?
A typical approach is to have a separate index per segment file, and this index is re-generated during compaction/merging of segment files. When reading a key, we then have to check multiple current segment files that may contain the key, and return the value that appears in the most recent of those segments.
It's not possible to tell just from looking at the index whether a particular segment contains a particular key. To avoid having to do a disk read for every segment, a common optimisation is to have a Bloom filter (or similar data structure such as a Cuckoo filter) for each segment that summarises the keys contained within that segment. That allows the read operation to only make a disk read for those segments that actually contain the desired key (with a small probability of making unnecessary disk reads due to Bloom filter false positives).
Are these two terms used interchangeably?
I have read about how SSTable works, and usually, articles just start mentioning LSM Tree.
However, they seem to be the same thing.
When should I use one term over the other?
Probably one of the best explanations of SSTables and LSM-Trees for mortals is given by Martin Kleppmann in his "Designing Data-Intensive Applications" book. These data structures are explained in chapter 3, "Storage and Retrieval", pages 69 through 79. It's a really great read, I would recommend the whole book!
Impatient ones could find my synopsis of the topic below π
Everything starts with a very dumb key-value database implemented as just two Bash functions:
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
The idea is to store the data in a CSV-like file:
$ source database.sh
$ db_set 1 'Anakin Skywalker'
$ db_set 2 'Luke Skywalker'
$ db_set 1 'Darth Vader'
$ cat database
1,Anakin Skywalker
2,Luke Skywalker
1,Darth Vader
$ db_get 1
Darth Vader
Note that the first value for the key 1 is overridden by the subsequent write.
This database has pretty good write performance: db_set just appends the data to a file, which is generally fast. But reads are inefficient, especially on huge data sets: db_get scans the entire file. Thus, writes are O(1) and reads are O(n).
Next, indices are introduced. An index is a data structure derived from the data itself. Maintaining an index always incurs additional costs, thus, indices always degrade write performance with the benefit of improving the reads.
One of the simplest possible indices is a hash index. This index is nothing more than a dictionary holding bytes offsets of the records in a database. Continuing previous example, assuming every char is one byte, the hash index would look like this:
Whenever you write data into the database, you also update the index. When you want to read a value for a given key, you could quickly look up an offset in the database file. Having the offset, you could use a "seek" operation to jump straight to the data location. Depending on the particular index implementation you could expect a logarithmic complexity for both reads and writes.
Next, Martin deals with the storage efficiency. Appending data to a database file exhausts disk space quickly. The fewer distinct keys you have β the more inefficient this append-only storage engine is. The solution to this problem is compaction:
When a database file grows to a certain size, you stop appending to it, create a new file (called segment) and redirect all the writes to this new file.
Segments are immutable in that sense that they are never used to append any new data. The only way to modify a segment is to write it's content into a new file, possibly with some transformations in between.
So, the compaction creates new segments containing only the most recent records for each key. Another possible enhancement at this step is merging multiple segments into a single one. Compaction and merging could be done, of course, in background. Old segments are just thrown away.
Every segment, including the one being written to, has its own index. So, when you want to find the value for a given key, you search those indices in reverse chronological order: from the most recent, to the oldest.
So far we have a data structure having these pros:
βοΈ Sequential writes are generally faster than random ones
βοΈ Concurrency is easy to control having a single writer process
βοΈ Crash recovery is easy to implement: just read all the segments sequentially, and store the offsets in the in-memory index
βοΈ Merging and compaction help to avoid data fragmentation
However, there are some limitations as well:
β Crash recovery could be time-consuming if segments are large and numerous
β Hash index must fit in memory. Implementing on-disk hash tables is much more difficult
β Range queries (BETWEEN) are virtually impossible
Now, with this background, let's move to the SSTables and LSM-trees. By the way, these abbreviations mean "Sorted String Tables" and "Log-Structured Merge Trees" accordingly.
SSTables are very similar to the "database" that we've seen previously. The only improvement is that we require records in segments to be sorted by key. This might seem to break the ability to use append-only writes, but that's what LSM-Trees for. We'll see in a moment!
SSTables have some advantages over those simple segments we had previously:
βοΈ Merging segments is more efficient due to the records being pre-sorted. All you have to do is to compare segment "heads" on each iteration and choose the lowest one. If multiple segments contain the same key, the value from the most recent segment wins. This compact & merge process also holds the sorting of the keys.
βοΈ With keys sorted, you don't need to have every single key in the index anymore. If the key B is known to be somewhere between keys A and C you could just do a scan. This also means that range queries are possible!
The final question is: how do you you get the data sorted by key?
The idea, described by Patrick OβNeil et al. in their "The Log-Structured Merge-Tree (LSM-Tree)", is simple: there are in-memory data structures, such as red-black trees or AVL-trees, that are good at sorting data. So, you split writes into two stages. First, you write the data into the in-memory balanced tree. Second, you flush that tree on the disk. Actually, there may be more than two stages, with deeper ones being bigger and "slower" then the upper (as shown in the other answer).
When a write comes, you add it to the in-memory balanced tree, called memtable.
When the memtable grows big, it is flushed to the disk. It is already sorted, so it naturally creates an SSTable segment.
Meanwhile, writes are processed by a fresh memtable.
Reads are first being looked up in the memtable, then in the segments, starting from the most recent one to the oldest.
Segments are compacted and merged from time to time in background as described previously.
The scheme is not perfect, it could suffer from sudden crashes: the memtable, being an in-memory data structure, is lost. This issue could be solved by maintaining another append-only file that basically duplicates the contents of the memtable. The database only needs to read it after a crash to re-create the memtable.
And that's it! Note that all the issues of a simple append-only storage, described above, are now solved:
βοΈ Now there is only one file to read in a case of a crash: the memtable backup
βοΈ Indices could be sparse, thus fitting the RAM is easier
βοΈ Range queries are now possible
TLDR: An SSTable is a key-sorted append-only key-value storage. An LSM-tree is a layered data structure, based on a balanced tree, that allows SSTables to exist without the controversy of being both sorted and append-only at the same time.
Congrats, you've finished this long read! If you enjoyed the explanation, make sure not only upvote this post, but some of the Martin's answers here as well. Remember: all credits go to him!
It is very well explained in LSM-based storage techniques: a survey paper in section 1 and 2.2.1
LSM-tree consists of some memory components and some disk components. Basically SSTable is just a one implemention of disk component for LSM-tree.
SSTable is explained by above mentioned paper:
An SSTable (Sorted String Table) contains a list of data blocks and an
index block; a data block stores key-value pairs ordered by keys, and
the index block stores the key ranges of all data blocks.
Sorted Strings Table (SSTable) is a key/value string pair based file, sorted by keys.
However, LSM Tree is different:
In computer science, the log-structured merge-tree (or LSM tree) is a
data structure with performance characteristics that make it
attractive for providing indexed access to files with high insert
volume, such as transactional log data. LSM trees, like other search
trees, maintain key-value pairs. LSM trees maintain data in two or
more separate structures, each of which is optimized for its
respective underlying storage medium; data is synchronized between the
two structures efficiently, in batches.
https://en.wikipedia.org/wiki/Log-structured_merge-tree
Actually, the term LSM tree was made official by Patrick O'Neil paper The Log-Structured Merge-Tree (LSM-Tree)
This was published in the year 1996
The term SSTable was coined by Google's Bigtable: A Distributed Storage System for Structured Data in 2006
Conceptually SSTable is something which provides indexing to LSM Tree based (mostly) storage engine (ex : Lucene). Its not about the difference, but how in academia concepts might be existing since a long time but somehow named later on.
Going through the above two paper will tell a lot.
I heard some statements like, consider the height of the AVL tree and the maximum keys that an AVL tree node can contain, the search of AVL tree will be time-consuming because of the disk io.
However, imagine that an index file contains the whole AVL tree structure, and then the size of the index file is less than a fan size, we can just read the whole AVL tree in only once disk io.
It seems like using AVL tree does not bring about extra disk io, how do you explain B tree is better?
Databases use balanced binary trees(plus) avl is only a special case of these balanced trees, so there is no need for it
we can just read the whole AVL tree in only once disk io
Yes, it could work like that. Essentially, the whole data structure would be brought into memory. IO would no longer be a concern.
Some databases use this strategy. For example, SQL Server In-Memory "Hekaton" does this and delivers ~100x the normal throughput for OLTP.
Hekaton uses two index data structues: hash tables and trees. I think the trees are called cw-trees and are similar to b-trees.
For general purpose database workloads it is very desirable to not need everything in memory. B-trees are a great design tradeoff in those cases.
Its coz B-Trees usually have larger number of keys in single node and hence reducing the depth of the search, in record indexing the link traversal time is longer if the depth is more, hence for cache locality and making the tree wider than deeper, multiple keys are stored in array of a node which improves cache performance and quick lookup comparatively.
Using two databases to illustrate this example: CouchDB and Cassandra.
CouchDB
CouchDB uses a B+ Tree for document indexes (using a clever modification to work in their append-only environment) - more specifically as documents are modified (insert/update/delete) they are appended to the running database file as well as a full Leaf -> Node path from the B+ tree of all the nodes effected by the updated revision right after the document.
These piece-mealed index revisions are inlined right alongside the modifications such that the full index is a union of the most recent index modifications appended at the end of the file along with additional pieces further back in the data file that are still relevant and haven't been modified yet.
Searching the B+ tree is O(logn).
Cassandra
Cassandra keeps record keys sorted, in-memory, in tables (let's think of them as arrays for this question) and writes them out as separate (sorted) sorted-string tables from time to time.
We can think of the collection of all of these tables as the "index" (from what I understand).
Cassandra is required to compact/combine these sorted-string tables from time to time, creating a more complete file representation of the index.
Searching a sorted array is O(logn).
Question
Assuming a similar level of complexity between either maintaining partial B+ tree chunks in CouchDB versus partial sorted-string indices in Cassandra and given that both provide O(logn) search time which one do you think would make a better representation of a database index and why?
I am specifically curious if there is an implementation detail about one over the other that makes it particularly attractive or if they are both a wash and you just pick whichever data structure you prefer to work with/makes more sense to the developer.
Thank you for the thoughts.
When comparing a BTree index to an SSTable index, you should consider the write complexity:
When writing randomly to a copy-on-write BTree, you will incur random reads (to do the copy of the leaf node and path). So while the writes my be sequential on disk, for datasets larger than RAM, these random reads will quickly become the bottle neck. For a SSTable-like index, no such read occurs on write - there will only be the sequential writes.
You should also consider that in the worse case, every update to a BTree could incur log_b N IOs - that is, you could end up writing 3 or 4 blocks for every key. If key size is much less than block size, this is extremely expensive. For an SSTable-like index, each write IO will contain as many fresh keys as it can, so the IO cost for each key is more like 1/B.
In practice, this make SSTable-like thousands of times faster (for random writes) than BTrees.
When considering implementation details, we have found it a lot easier to implement SSTable-like indexes (almost) lock-free, where as locking strategies for BTrees has become quite complicated.
You should also re-consider your read costs. You are correct than a BTree is O(log_b N) random IOs for random point reads, but a SSTable-like index is actually O(#sstables . log_b N). Without an decent merge scheme, #sstables is proportional to N. There are various tricks to get round this (using Bloom Filters, for instance), but these don't help with small, random range queries. This is what we found with Cassandra:
Cassandra under heavy write load
This is why Castle, our (GPL) storage engine, does merges slightly differently, and can achieve a lot better (O(log^2 N)) range queries performance with a slight trade off in write performance (O(log^2 N / B)). In practice we find it to be quicker than Cassandra's SSTable index for writes as well.
If you want to know more about this, I've given a talk about how it works:
podcast
slides
Some things that should also be mentioned about each approach:
B-trees
The read/write operations are supposed to be logarithmic O(logn). However, a single database write can lead to multiple writes in the storage system. For example, when a node is full, it has to be split and that means that there will be 2 writes for the 2 new nodes and 1 additional write for updating the parent node. You can see how that could increase if the parent node was also full.
Usually, B-trees are stores in such a way that each node has the size of a page. This creates a phenomenon called write amplification, where even if a single byte needs to be updated, a whole page is written.
Writes are usually random (not sequential), thus slower especially for magnetic disks.
SSTables
SSTables are usually used in the following approach. There is an in-memory structure, called memtable, as you described. Every once in a while, this structure is flushed to disk to an SSTable. As a result, all the writes go to the memtable, but the reads might not be in the current memtable, in which case they are searched in the persisted SSTables.
As a result, writes are O(logn). However, always bear in mind that they are done in memory, so they should be orders of magnitude faster than the logarithmic operations in disk of B-trees. For the sake of completeness, we should mention that writes are also written to a write-ahead log for crash recovery. But, given that these are all sequential writes, they are expected to be much more efficient than the random writes of B-trees.
When served from memory (from the memtable), reads are expected to be much faster as well. But, when there's need to look in the older, disk-based SSTables, reads can potentially become quite slower than B-trees. There are several optimisations around that, such as use of bloom filters, to check whether an SSTable contains a value without performing disk reads.
As you mentioned, there's also a background process, called compaction, used to merge SSTables. This helps remove deleted values and prevent fragmentation, but it can cause significant write load, affecting the write throughput of the incoming operations.
As it becomes evident, a comparison between these 2 approaches is much more complicated. In an extremely simplified attempt to provide a concrete comparison, I think we could say that:
SSTables provide a much better write throughput than B-trees. However, they are expected to have less stable behaviour, because of ongoing compactions. An example of this can be seen in this benchmark comparison.
B-trees are usually preferred for use-cases, where transaction semantics are needed. This is because, each key can be found only in a single place (in contrast to the SSTable, where it could exist in multiple SSTables with obsolete values in some of them) and also because one could represent a range of values as part of the tree. This means that it's easier to perform key-level and range-level locking mechanisms.
References
[1] A Performance Comparison of LevelDB and MySQL
[2] Designing Data-intensive Applications
I think fractal trees, as used by Tokutek, are a better index for a database. They offer real-world 20x to 80x improvements over b-trees.
There are excellent explanations of how fractal tree indices work here.
LSM-Trees is better than B-Trees on storage engine structured.
It converts random-write to aof in a way.
Here is a LSM-Tree src:
https://github.com/shuttler/lsmtree