How to maintain the sparse index in a LSM-tree? - database

In Designing Data Intensive Applications, Martin introduces a data structure called LSM-trees.
There are mainly 3 parts: an in-memory memtable (usually a red-black tree), an in-memory sparse index, and on-disk SSTables (aka segments). They work together like this:
When a write happens, it first goes to the memtable, and when it turns full, all the data are flushed into a new segment (with all the keys sorted).
When a read happens, it first looks up the memtable. If the key doesn't exist there, it looks up the sparse index, to learn which segment the key may reside. See figure 1.
Periodically, compaction happens that merges multiple segments into one. See figure 2.
As you can tell from figure 2, keys are sorted within a segment, however keys are NOT sorted between segments. This make me wonder: how do we maintain the sparse index s.t. keys in the index have increasing offset?

A typical approach is to have a separate index per segment file, and this index is re-generated during compaction/merging of segment files. When reading a key, we then have to check multiple current segment files that may contain the key, and return the value that appears in the most recent of those segments.
It's not possible to tell just from looking at the index whether a particular segment contains a particular key. To avoid having to do a disk read for every segment, a common optimisation is to have a Bloom filter (or similar data structure such as a Cuckoo filter) for each segment that summarises the keys contained within that segment. That allows the read operation to only make a disk read for those segments that actually contain the desired key (with a small probability of making unnecessary disk reads due to Bloom filter false positives).

Related

What is the differences between the term SSTable and LSM Tree

Are these two terms used interchangeably?
I have read about how SSTable works, and usually, articles just start mentioning LSM Tree.
However, they seem to be the same thing.
When should I use one term over the other?
Probably one of the best explanations of SSTables and LSM-Trees for mortals is given by Martin Kleppmann in his "Designing Data-Intensive Applications" book. These data structures are explained in chapter 3, "Storage and Retrieval", pages 69 through 79. It's a really great read, I would recommend the whole book!
Impatient ones could find my synopsis of the topic below πŸ‘‡
Everything starts with a very dumb key-value database implemented as just two Bash functions:
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
The idea is to store the data in a CSV-like file:
$ source database.sh
$ db_set 1 'Anakin Skywalker'
$ db_set 2 'Luke Skywalker'
$ db_set 1 'Darth Vader'
$ cat database
1,Anakin Skywalker
2,Luke Skywalker
1,Darth Vader
$ db_get 1
Darth Vader
Note that the first value for the key 1 is overridden by the subsequent write.
This database has pretty good write performance: db_set just appends the data to a file, which is generally fast. But reads are inefficient, especially on huge data sets: db_get scans the entire file. Thus, writes are O(1) and reads are O(n).
Next, indices are introduced. An index is a data structure derived from the data itself. Maintaining an index always incurs additional costs, thus, indices always degrade write performance with the benefit of improving the reads.
One of the simplest possible indices is a hash index. This index is nothing more than a dictionary holding bytes offsets of the records in a database. Continuing previous example, assuming every char is one byte, the hash index would look like this:
Whenever you write data into the database, you also update the index. When you want to read a value for a given key, you could quickly look up an offset in the database file. Having the offset, you could use a "seek" operation to jump straight to the data location. Depending on the particular index implementation you could expect a logarithmic complexity for both reads and writes.
Next, Martin deals with the storage efficiency. Appending data to a database file exhausts disk space quickly. The fewer distinct keys you have β€” the more inefficient this append-only storage engine is. The solution to this problem is compaction:
When a database file grows to a certain size, you stop appending to it, create a new file (called segment) and redirect all the writes to this new file.
Segments are immutable in that sense that they are never used to append any new data. The only way to modify a segment is to write it's content into a new file, possibly with some transformations in between.
So, the compaction creates new segments containing only the most recent records for each key. Another possible enhancement at this step is merging multiple segments into a single one. Compaction and merging could be done, of course, in background. Old segments are just thrown away.
Every segment, including the one being written to, has its own index. So, when you want to find the value for a given key, you search those indices in reverse chronological order: from the most recent, to the oldest.
So far we have a data structure having these pros:
βœ”οΈ Sequential writes are generally faster than random ones
βœ”οΈ Concurrency is easy to control having a single writer process
βœ”οΈ Crash recovery is easy to implement: just read all the segments sequentially, and store the offsets in the in-memory index
βœ”οΈ Merging and compaction help to avoid data fragmentation
However, there are some limitations as well:
❗ Crash recovery could be time-consuming if segments are large and numerous
❗ Hash index must fit in memory. Implementing on-disk hash tables is much more difficult
❗ Range queries (BETWEEN) are virtually impossible
Now, with this background, let's move to the SSTables and LSM-trees. By the way, these abbreviations mean "Sorted String Tables" and "Log-Structured Merge Trees" accordingly.
SSTables are very similar to the "database" that we've seen previously. The only improvement is that we require records in segments to be sorted by key. This might seem to break the ability to use append-only writes, but that's what LSM-Trees for. We'll see in a moment!
SSTables have some advantages over those simple segments we had previously:
βœ”οΈ Merging segments is more efficient due to the records being pre-sorted. All you have to do is to compare segment "heads" on each iteration and choose the lowest one. If multiple segments contain the same key, the value from the most recent segment wins. This compact & merge process also holds the sorting of the keys.
βœ”οΈ With keys sorted, you don't need to have every single key in the index anymore. If the key B is known to be somewhere between keys A and C you could just do a scan. This also means that range queries are possible!
The final question is: how do you you get the data sorted by key?
The idea, described by Patrick O’Neil et al. in their "The Log-Structured Merge-Tree (LSM-Tree)", is simple: there are in-memory data structures, such as red-black trees or AVL-trees, that are good at sorting data. So, you split writes into two stages. First, you write the data into the in-memory balanced tree. Second, you flush that tree on the disk. Actually, there may be more than two stages, with deeper ones being bigger and "slower" then the upper (as shown in the other answer).
When a write comes, you add it to the in-memory balanced tree, called memtable.
When the memtable grows big, it is flushed to the disk. It is already sorted, so it naturally creates an SSTable segment.
Meanwhile, writes are processed by a fresh memtable.
Reads are first being looked up in the memtable, then in the segments, starting from the most recent one to the oldest.
Segments are compacted and merged from time to time in background as described previously.
The scheme is not perfect, it could suffer from sudden crashes: the memtable, being an in-memory data structure, is lost. This issue could be solved by maintaining another append-only file that basically duplicates the contents of the memtable. The database only needs to read it after a crash to re-create the memtable.
And that's it! Note that all the issues of a simple append-only storage, described above, are now solved:
βœ”οΈ Now there is only one file to read in a case of a crash: the memtable backup
βœ”οΈ Indices could be sparse, thus fitting the RAM is easier
βœ”οΈ Range queries are now possible
TLDR: An SSTable is a key-sorted append-only key-value storage. An LSM-tree is a layered data structure, based on a balanced tree, that allows SSTables to exist without the controversy of being both sorted and append-only at the same time.
Congrats, you've finished this long read! If you enjoyed the explanation, make sure not only upvote this post, but some of the Martin's answers here as well. Remember: all credits go to him!
It is very well explained in LSM-based storage techniques: a survey paper in section 1 and 2.2.1
LSM-tree consists of some memory components and some disk components. Basically SSTable is just a one implemention of disk component for LSM-tree.
SSTable is explained by above mentioned paper:
An SSTable (Sorted String Table) contains a list of data blocks and an
index block; a data block stores key-value pairs ordered by keys, and
the index block stores the key ranges of all data blocks.
Sorted Strings Table (SSTable) is a key/value string pair based file, sorted by keys.
However, LSM Tree is different:
In computer science, the log-structured merge-tree (or LSM tree) is a
data structure with performance characteristics that make it
attractive for providing indexed access to files with high insert
volume, such as transactional log data. LSM trees, like other search
trees, maintain key-value pairs. LSM trees maintain data in two or
more separate structures, each of which is optimized for its
respective underlying storage medium; data is synchronized between the
two structures efficiently, in batches.
https://en.wikipedia.org/wiki/Log-structured_merge-tree
Actually, the term LSM tree was made official by Patrick O'Neil paper The Log-Structured Merge-Tree (LSM-Tree)
This was published in the year 1996
The term SSTable was coined by Google's Bigtable: A Distributed Storage System for Structured Data in 2006
Conceptually SSTable is something which provides indexing to LSM Tree based (mostly) storage engine (ex : Lucene). Its not about the difference, but how in academia concepts might be existing since a long time but somehow named later on.
Going through the above two paper will tell a lot.

Why does adding an index to a database field speed up searching over that field?

I'm new to databases and have been reading that adding an index to a field you need to search over can dramatically speed up search times. I understand this reality, but am curious as to how it actually works. I've searched a bit on the subject, but haven't found any good, concise, and not over technical answer to how it works.
I've read the analogy of it being like an index at the back of a book, but in the case of a data field of unique elements (such as e-mail addresses in a user database), using the back of the book analogy would provide the same linear look up time as a non-indexed seach.
What is going on here to speed up search times so much? I've read a little bit about searching using B+-Trees, but the descriptions were a bit too indepth. What I'm looking for is a high level overview of what is going on, something to help my conceptual understanding of it, not technical details.
Expanding on the search algorithm efficiencies, a key area in database performance is how fast the data can be accessed.
In general, reading data from a disk is a lot lot slower than reading data from memory.
To illustrate a point, lets assume everything is stored on disk. If you need to search through every row of data in a table looking for certain values in a field, you still need to read the entire row of data from the disk to see if it matches - this is commonly referred to as a 'table scan'.
If your table is 100MB, that's 100MB you need to read from disk.
If you now index the column you want to search on, in simplistic terms the index will store each unique value of the data and a reference to the exact location of the corresponding full row of data. This index may now only be 10MB compared to 100MB for the entire table.
Reading 10MB of data from the disk (and maybe a bit extra to read the full row data for each match) is roughly 10 times faster than reading the 100MB.
Different databases will store indexes or data in memory in different ways to make these things much faster. However, if your data set is large and doesn't fit in memory then the disk speed can have a huge impact and indexing can show huge gains.
In memory there can still be large performance gains (amongst other efficiencies).
In general, that's why you may not notice any tangible difference with indexing a small dataset which easily fits in memory.
The underlying details will vary between systems and actually will be a lot more complicated, but I've always found the disk reads vs. memory reads an easily understandable way of explaining this.
Okay, after a bit of research and discussion, here is what I have learned:
Conceptually an index is a sorted copy of the data field it is indexing, where each index value points to it's original(unsorted) row. Because the database knows how values are sorted, it can apply more sophisticated search algorithms than just looking for the value from start to finish. The binary search algorithm is a simple example of a searching algorithm for sorted lists and reduces the maximum search time from O(n) to O(log n).
As a side note: A decent sorting algorithm generally will take O(n log n) to complete, which means (as we've all probably heard before) you should only put indexes on fields you will search often, as it's a bit more expensive to add the index (which includes a sort) than it is to do a full search a few times. For example, in a large database of >1,000,000 entries it's on the range of 20x more expensive to sort than to search once.
Edit:
See #Jarod Elliott's answer for a more in-depth look at search efficiencies, specifically in regard to read from disk operations.
To continue your back-of-the-book analogy, if the pages were in order by that element it would be the same look-up time as a non-indexed search, yes.
However, what if your book were a list of book reviews ordered by author, but you only knew the ISBN. The ISBN is unique, yes, but you'd still have to scan each review to find the one you are looking for.
Now, add an index at the back of the book, sorted by ISBN. Boom, fast search time. This is analogous to the database index, going from the index key (ISBN) to the actual data row (in this case a page number of your book).

Optimize I/O to a large file in C

I have written a C program that reads data from a huge file(>3 GB). Each record in the file is a key-value pair. Whenever a query comes, the program searches for the key and retrieves the corresponding value, similarly for updating the value.
The queries are coming at a fast rate so this technique will eventually fail.
The worst case access time is too large. Creating an in-memory object will again be a bad idea, because of the size.
Is there any way in which this problem can be sorted out?
Sure seems to me a file of that size wrapping a series of name-value pairs is begging to be migrated to an actual database; failing that, I'd probably at least explore the idea of a memory-mapped file, with only portions resident at any given time...
How large are the keys, in comparison to their corresponding values? If they are significantly smaller, you might try creating a table in memory between the keys and the corresponding locations within the file of their values.

Sorted String Table (SSTable) or B+ Tree for a Database Index?

Using two databases to illustrate this example: CouchDB and Cassandra.
CouchDB
CouchDB uses a B+ Tree for document indexes (using a clever modification to work in their append-only environment) - more specifically as documents are modified (insert/update/delete) they are appended to the running database file as well as a full Leaf -> Node path from the B+ tree of all the nodes effected by the updated revision right after the document.
These piece-mealed index revisions are inlined right alongside the modifications such that the full index is a union of the most recent index modifications appended at the end of the file along with additional pieces further back in the data file that are still relevant and haven't been modified yet.
Searching the B+ tree is O(logn).
Cassandra
Cassandra keeps record keys sorted, in-memory, in tables (let's think of them as arrays for this question) and writes them out as separate (sorted) sorted-string tables from time to time.
We can think of the collection of all of these tables as the "index" (from what I understand).
Cassandra is required to compact/combine these sorted-string tables from time to time, creating a more complete file representation of the index.
Searching a sorted array is O(logn).
Question
Assuming a similar level of complexity between either maintaining partial B+ tree chunks in CouchDB versus partial sorted-string indices in Cassandra and given that both provide O(logn) search time which one do you think would make a better representation of a database index and why?
I am specifically curious if there is an implementation detail about one over the other that makes it particularly attractive or if they are both a wash and you just pick whichever data structure you prefer to work with/makes more sense to the developer.
Thank you for the thoughts.
When comparing a BTree index to an SSTable index, you should consider the write complexity:
When writing randomly to a copy-on-write BTree, you will incur random reads (to do the copy of the leaf node and path). So while the writes my be sequential on disk, for datasets larger than RAM, these random reads will quickly become the bottle neck. For a SSTable-like index, no such read occurs on write - there will only be the sequential writes.
You should also consider that in the worse case, every update to a BTree could incur log_b N IOs - that is, you could end up writing 3 or 4 blocks for every key. If key size is much less than block size, this is extremely expensive. For an SSTable-like index, each write IO will contain as many fresh keys as it can, so the IO cost for each key is more like 1/B.
In practice, this make SSTable-like thousands of times faster (for random writes) than BTrees.
When considering implementation details, we have found it a lot easier to implement SSTable-like indexes (almost) lock-free, where as locking strategies for BTrees has become quite complicated.
You should also re-consider your read costs. You are correct than a BTree is O(log_b N) random IOs for random point reads, but a SSTable-like index is actually O(#sstables . log_b N). Without an decent merge scheme, #sstables is proportional to N. There are various tricks to get round this (using Bloom Filters, for instance), but these don't help with small, random range queries. This is what we found with Cassandra:
Cassandra under heavy write load
This is why Castle, our (GPL) storage engine, does merges slightly differently, and can achieve a lot better (O(log^2 N)) range queries performance with a slight trade off in write performance (O(log^2 N / B)). In practice we find it to be quicker than Cassandra's SSTable index for writes as well.
If you want to know more about this, I've given a talk about how it works:
podcast
slides
Some things that should also be mentioned about each approach:
B-trees
The read/write operations are supposed to be logarithmic O(logn). However, a single database write can lead to multiple writes in the storage system. For example, when a node is full, it has to be split and that means that there will be 2 writes for the 2 new nodes and 1 additional write for updating the parent node. You can see how that could increase if the parent node was also full.
Usually, B-trees are stores in such a way that each node has the size of a page. This creates a phenomenon called write amplification, where even if a single byte needs to be updated, a whole page is written.
Writes are usually random (not sequential), thus slower especially for magnetic disks.
SSTables
SSTables are usually used in the following approach. There is an in-memory structure, called memtable, as you described. Every once in a while, this structure is flushed to disk to an SSTable. As a result, all the writes go to the memtable, but the reads might not be in the current memtable, in which case they are searched in the persisted SSTables.
As a result, writes are O(logn). However, always bear in mind that they are done in memory, so they should be orders of magnitude faster than the logarithmic operations in disk of B-trees. For the sake of completeness, we should mention that writes are also written to a write-ahead log for crash recovery. But, given that these are all sequential writes, they are expected to be much more efficient than the random writes of B-trees.
When served from memory (from the memtable), reads are expected to be much faster as well. But, when there's need to look in the older, disk-based SSTables, reads can potentially become quite slower than B-trees. There are several optimisations around that, such as use of bloom filters, to check whether an SSTable contains a value without performing disk reads.
As you mentioned, there's also a background process, called compaction, used to merge SSTables. This helps remove deleted values and prevent fragmentation, but it can cause significant write load, affecting the write throughput of the incoming operations.
As it becomes evident, a comparison between these 2 approaches is much more complicated. In an extremely simplified attempt to provide a concrete comparison, I think we could say that:
SSTables provide a much better write throughput than B-trees. However, they are expected to have less stable behaviour, because of ongoing compactions. An example of this can be seen in this benchmark comparison.
B-trees are usually preferred for use-cases, where transaction semantics are needed. This is because, each key can be found only in a single place (in contrast to the SSTable, where it could exist in multiple SSTables with obsolete values in some of them) and also because one could represent a range of values as part of the tree. This means that it's easier to perform key-level and range-level locking mechanisms.
References
[1] A Performance Comparison of LevelDB and MySQL
[2] Designing Data-intensive Applications
I think fractal trees, as used by Tokutek, are a better index for a database. They offer real-world 20x to 80x improvements over b-trees.
There are excellent explanations of how fractal tree indices work here.
LSM-Trees is better than B-Trees on storage engine structured.
It converts random-write to aof in a way.
Here is a LSM-Tree src:
https://github.com/shuttler/lsmtree

How do databases deal with data tables that cannot fit in memory?

Suppose you have a really large table, say a few billion unordered rows, and now you want to index it for fast lookups. Or maybe you are going to bulk load it and order it on the disk with a clustered index. Obviously, when you get to a quantity of data this size you have to stop assuming that you can do things like sorting in memory (well, not without going to virtual memory and taking a massive performance hit).
Can anyone give me some clues about how databases handle large quantities of data like this under the hood? I'm guessing there are algorithms that use some form of smart disk caching to handle all the data but I don't know where to start. References would be especially welcome. Maybe an advanced databases textbook?
Multiway Merge Sort is a keyword for sorting huge amounts of memory
As far as I know most indexes use some form of B-trees, which do not need to have stuff in memory. You can simply put nodes of the tree in a file, and then jump to varios position in the file. This can also be used for sorting.
Are you building a database engine?
Edit: I built a disc based database system back in the mid '90's.
Fixed size records are the easiest to work with because your file offset for locating a record can be easily calculated as a multiple of the record size. I also had some with variable record sizes.
My system needed to be optimized for reading. The data was actually stored on CD-ROM, so it was read-only. I created binary search tree files for each column I wanted to search on. I took an open source in-memory binary search tree implementation and converted it to do random access of a disc file. Sorted reads from each index file were easy and then reading each data record from the main data file according to the indexed order was also easy. I didn't need to do any in-memory sorting and the system was way faster than any of the available RDBMS systems that would run on a client machine at the time.
For fixed record size data, the index can just keep track of the record number. For variable length data records, the index just needs to store the offset within the file where the record starts and each record needs to begin with a structure that specifies it's length.
You would have to partition your data set in some way. Spread out each partition on a separate server's RAM. If I had a billion 32-bit int's - thats 32 GB of RAM right there. And thats only your index.
For low cardinality data, such as Gender (has only 2 bits - Male, Female) - you can represent each index-entry in less than a byte. Oracle uses a bit-map index in such cases.
Hmm... Interesting question.
I think that most used database management systems using operating system mechanism for memory management, and when physical memory ends up, memory tables goes to swap.

Resources