handling duplicates in rocksdb - database

I want to use rocksdb and wanted to know explicitly how does it handle duplicates.
The documentations say:
The entire database is stored in a set of sstfiles. When a memtable is full,
its content is written out to a file in Level-0 (L0). RocksDB removes
duplicate and overwritten keys in the memtable when it is flushed to a file in L0.
Now in the case of haiving an environment with multiple databases, I couldn't find a description.
Are the Keys, in this case, environment wide unique or would every database has its unique keys? I couldn't find a description of the behavior for the whole environment.

Short answer to your question: there's a background process called compaction, which will periodically merge couple sst-files into a single sorted run (this sorted run could be represented as multiple sst-files, but each with disjoint key-range.) During this compaction process, it handles duplicate keys.
Here's the long answer to your question:
RocksDB is a LSM database. When a key-value pair is written to RocksDB, RocksDB simply creates an data entry for it and append it to a in-memory-buffer called MemTable.
When MemTable becomes full, RocksDB will sort all the keys and flush them as a single sst-file. As we keep writing more data, there will be more sst-files being flushed, where each sst-files here usually have overlapping key-range. At this moment, suppose we have N sst files, and there's a read request comes in. Then this read request must check all these N sst files to see whether it contains the requested key as each sst file can have overlapping key-range. As a result, without any process to reorganize these sst files, reads will become slower as we keep writing more data.
The process that reorganize these sst files is called compaction, which is essentially a multi-way merge sort like operation that inputs multiple sst files and outputs a single sorted run. During the compaction process, RocksDB will sort all the keys from these input sst files, merge possible data entries, and delete data entries when it finds a matched deletion entry.

Related

Is duplicated key a legitimate concern when scanning a LSM-tree based KV database?

Hi I learned that one key might show up more than once in LSM-tree based database. This is because a key is written to disk by appending instead of overwriting.
I understand that if we want to read a key's value, we can simply read the data files in reverse time order, and just use the first value we met.
However, what if we want to scan the entire database for some analytical query? In this case, we have to scan all data files on disk because we cannot ignore any key. But if we scan all data files, then if a key shows up more than once, this duplication will cause correctness violation, right?
Thanks.

How to maintain the sparse index in a LSM-tree?

In Designing Data Intensive Applications, Martin introduces a data structure called LSM-trees.
There are mainly 3 parts: an in-memory memtable (usually a red-black tree), an in-memory sparse index, and on-disk SSTables (aka segments). They work together like this:
When a write happens, it first goes to the memtable, and when it turns full, all the data are flushed into a new segment (with all the keys sorted).
When a read happens, it first looks up the memtable. If the key doesn't exist there, it looks up the sparse index, to learn which segment the key may reside. See figure 1.
Periodically, compaction happens that merges multiple segments into one. See figure 2.
As you can tell from figure 2, keys are sorted within a segment, however keys are NOT sorted between segments. This make me wonder: how do we maintain the sparse index s.t. keys in the index have increasing offset?
A typical approach is to have a separate index per segment file, and this index is re-generated during compaction/merging of segment files. When reading a key, we then have to check multiple current segment files that may contain the key, and return the value that appears in the most recent of those segments.
It's not possible to tell just from looking at the index whether a particular segment contains a particular key. To avoid having to do a disk read for every segment, a common optimisation is to have a Bloom filter (or similar data structure such as a Cuckoo filter) for each segment that summarises the keys contained within that segment. That allows the read operation to only make a disk read for those segments that actually contain the desired key (with a small probability of making unnecessary disk reads due to Bloom filter false positives).

What is the differences between the term SSTable and LSM Tree

Are these two terms used interchangeably?
I have read about how SSTable works, and usually, articles just start mentioning LSM Tree.
However, they seem to be the same thing.
When should I use one term over the other?
Probably one of the best explanations of SSTables and LSM-Trees for mortals is given by Martin Kleppmann in his "Designing Data-Intensive Applications" book. These data structures are explained in chapter 3, "Storage and Retrieval", pages 69 through 79. It's a really great read, I would recommend the whole book!
Impatient ones could find my synopsis of the topic below 👇
Everything starts with a very dumb key-value database implemented as just two Bash functions:
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
The idea is to store the data in a CSV-like file:
$ source database.sh
$ db_set 1 'Anakin Skywalker'
$ db_set 2 'Luke Skywalker'
$ db_set 1 'Darth Vader'
$ cat database
1,Anakin Skywalker
2,Luke Skywalker
1,Darth Vader
$ db_get 1
Darth Vader
Note that the first value for the key 1 is overridden by the subsequent write.
This database has pretty good write performance: db_set just appends the data to a file, which is generally fast. But reads are inefficient, especially on huge data sets: db_get scans the entire file. Thus, writes are O(1) and reads are O(n).
Next, indices are introduced. An index is a data structure derived from the data itself. Maintaining an index always incurs additional costs, thus, indices always degrade write performance with the benefit of improving the reads.
One of the simplest possible indices is a hash index. This index is nothing more than a dictionary holding bytes offsets of the records in a database. Continuing previous example, assuming every char is one byte, the hash index would look like this:
Whenever you write data into the database, you also update the index. When you want to read a value for a given key, you could quickly look up an offset in the database file. Having the offset, you could use a "seek" operation to jump straight to the data location. Depending on the particular index implementation you could expect a logarithmic complexity for both reads and writes.
Next, Martin deals with the storage efficiency. Appending data to a database file exhausts disk space quickly. The fewer distinct keys you have — the more inefficient this append-only storage engine is. The solution to this problem is compaction:
When a database file grows to a certain size, you stop appending to it, create a new file (called segment) and redirect all the writes to this new file.
Segments are immutable in that sense that they are never used to append any new data. The only way to modify a segment is to write it's content into a new file, possibly with some transformations in between.
So, the compaction creates new segments containing only the most recent records for each key. Another possible enhancement at this step is merging multiple segments into a single one. Compaction and merging could be done, of course, in background. Old segments are just thrown away.
Every segment, including the one being written to, has its own index. So, when you want to find the value for a given key, you search those indices in reverse chronological order: from the most recent, to the oldest.
So far we have a data structure having these pros:
✔️ Sequential writes are generally faster than random ones
✔️ Concurrency is easy to control having a single writer process
✔️ Crash recovery is easy to implement: just read all the segments sequentially, and store the offsets in the in-memory index
✔️ Merging and compaction help to avoid data fragmentation
However, there are some limitations as well:
❗ Crash recovery could be time-consuming if segments are large and numerous
❗ Hash index must fit in memory. Implementing on-disk hash tables is much more difficult
❗ Range queries (BETWEEN) are virtually impossible
Now, with this background, let's move to the SSTables and LSM-trees. By the way, these abbreviations mean "Sorted String Tables" and "Log-Structured Merge Trees" accordingly.
SSTables are very similar to the "database" that we've seen previously. The only improvement is that we require records in segments to be sorted by key. This might seem to break the ability to use append-only writes, but that's what LSM-Trees for. We'll see in a moment!
SSTables have some advantages over those simple segments we had previously:
✔️ Merging segments is more efficient due to the records being pre-sorted. All you have to do is to compare segment "heads" on each iteration and choose the lowest one. If multiple segments contain the same key, the value from the most recent segment wins. This compact & merge process also holds the sorting of the keys.
✔️ With keys sorted, you don't need to have every single key in the index anymore. If the key B is known to be somewhere between keys A and C you could just do a scan. This also means that range queries are possible!
The final question is: how do you you get the data sorted by key?
The idea, described by Patrick O’Neil et al. in their "The Log-Structured Merge-Tree (LSM-Tree)", is simple: there are in-memory data structures, such as red-black trees or AVL-trees, that are good at sorting data. So, you split writes into two stages. First, you write the data into the in-memory balanced tree. Second, you flush that tree on the disk. Actually, there may be more than two stages, with deeper ones being bigger and "slower" then the upper (as shown in the other answer).
When a write comes, you add it to the in-memory balanced tree, called memtable.
When the memtable grows big, it is flushed to the disk. It is already sorted, so it naturally creates an SSTable segment.
Meanwhile, writes are processed by a fresh memtable.
Reads are first being looked up in the memtable, then in the segments, starting from the most recent one to the oldest.
Segments are compacted and merged from time to time in background as described previously.
The scheme is not perfect, it could suffer from sudden crashes: the memtable, being an in-memory data structure, is lost. This issue could be solved by maintaining another append-only file that basically duplicates the contents of the memtable. The database only needs to read it after a crash to re-create the memtable.
And that's it! Note that all the issues of a simple append-only storage, described above, are now solved:
✔️ Now there is only one file to read in a case of a crash: the memtable backup
✔️ Indices could be sparse, thus fitting the RAM is easier
✔️ Range queries are now possible
TLDR: An SSTable is a key-sorted append-only key-value storage. An LSM-tree is a layered data structure, based on a balanced tree, that allows SSTables to exist without the controversy of being both sorted and append-only at the same time.
Congrats, you've finished this long read! If you enjoyed the explanation, make sure not only upvote this post, but some of the Martin's answers here as well. Remember: all credits go to him!
It is very well explained in LSM-based storage techniques: a survey paper in section 1 and 2.2.1
LSM-tree consists of some memory components and some disk components. Basically SSTable is just a one implemention of disk component for LSM-tree.
SSTable is explained by above mentioned paper:
An SSTable (Sorted String Table) contains a list of data blocks and an
index block; a data block stores key-value pairs ordered by keys, and
the index block stores the key ranges of all data blocks.
Sorted Strings Table (SSTable) is a key/value string pair based file, sorted by keys.
However, LSM Tree is different:
In computer science, the log-structured merge-tree (or LSM tree) is a
data structure with performance characteristics that make it
attractive for providing indexed access to files with high insert
volume, such as transactional log data. LSM trees, like other search
trees, maintain key-value pairs. LSM trees maintain data in two or
more separate structures, each of which is optimized for its
respective underlying storage medium; data is synchronized between the
two structures efficiently, in batches.
https://en.wikipedia.org/wiki/Log-structured_merge-tree
Actually, the term LSM tree was made official by Patrick O'Neil paper The Log-Structured Merge-Tree (LSM-Tree)
This was published in the year 1996
The term SSTable was coined by Google's Bigtable: A Distributed Storage System for Structured Data in 2006
Conceptually SSTable is something which provides indexing to LSM Tree based (mostly) storage engine (ex : Lucene). Its not about the difference, but how in academia concepts might be existing since a long time but somehow named later on.
Going through the above two paper will tell a lot.

Understanding SSTable immutiability

I'm trying to understand better the immutability of sstables in Cassandra. It's very clear what happens both in an insert operation, or in update/delete operation when the data exists in the memtable. But it's not clear what happens when I want to modify data that has already been flushed out.
So I understand the simple senario: I execute an insert opertaion and the data is written to a memtable. When the memtable is full then it's flushed to an sstable.
Now, how does modification of data occur? What happens when I execute a delete or update command (when the data has been flushed out)? If the sstable is immutable, so how will the data get deleted/updated? And how does the memtable work in delete and update commands (of data that does not exist in it because it has been flushed out)? What will the memtable contain?
In Cassandra / Scylla you ALWAYS append. Meaning any operation, whether it's insert / update / delete will create a new entry for that partition containing the new data and new timestamp. In case of a delete operation the new entry will actually be a tombstone with the new timestamp (indicating that the previous data was deleted). This applies whether the data is still in memory (memtable) or already flushed to disk -> sstable created.
Several "versions" of the same partition with different data and different timestamps can reside in multiple sstables (and even in memory) at the same time. SStables will be merged duration compaction and there are several compaction strategies that can be applied.
When the gc_grace_period (default: 10 days, tunable) has expired, on the next compaction that tombstone will be removed, meaning the data that was deleted and the tombstone indicating the latest action (delete), will not get merged into the new sstable.
The internal implementation of the memtables might be slightly different between Scylla and Cassandra but for the sake of simplicity let's assume it is the same.
You are welcomed to read more about the architecture in the following documentation:
SStables
Compaction strategies

detecting when data has changed

Ok, so the story is like this:
-- I am having lots of files (pretty big, around 25GB) that are in a particular format and needs to be imported in a datastore
-- these files are continuously updated with data, sometimes new, sometimes the same data
-- I am trying to figure out an algorithm on how could I detect if something has changed for a particular line in a file, in order to minimize the time spent updating the database
-- the way it currently works now is that I'm dropping all the data in the database each time and then reimport it, but this won't work anymore since I'll need a timestamp for when an item has changed.
-- the files contains strings and numbers (titles, orders, prices etc.)
The only solutions I could think of are:
-- compute a hash for each row from the database, that it's compared against the hash of the row from the file and if they're different the update the database
-- keep 2 copies of the files, the previous ones and the current ones and make diffs on it (which probably are faster than updating the db) and based on those update the db.
Since the amount of data is very big to huge, I am kind of out of options for now. On the long run, I'll get rid of the files and data will be pushed straight into the database, but the problem still remains.
Any advice, will be appreciated.
Problem definition as understood.
Let’s say your file contains
ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40
As you stated Row can be added / updated , hence the file becomes
ID,Name,Age
1,Jim,20 -- to be discarded
2,Tim,35 -- to be updated
3,Kim,40 -- to be discarded
4,Zim,30 -- to be inserted
Now the requirement is to update the database by inserting / updating only above 2 records in two sql queries or 1 batch query containing two sql statements.
I am making following assumptions here
You cannot modify the existing process to create files.
You are using some batch processing [Reading from file - Processing in Memory- Writing in DB]
to upload the data in the database.
Store the hash values of Record [Name,Age] against ID in an in-memory Map where ID is the key and Value is hash [If you require scalability use hazelcast ].
Your Batch Framework to load the data [Again assuming treats one line of file as one record], needs to check the computed hash value against the ID in in-memory Map.First time creation can also be done using your batch framework for reading files.
If (ID present)
--- compare hash
---found same then discard it
—found different create an update sql
In case ID not present in in-memory hash,create an insert sql and insert the hashvalue
You might go for parallel processing , chunk processing and in-memory data partitioning using spring-batch and hazelcast.
http://www.hazelcast.com/
http://static.springframework.org/spring-batch/
Hope this helps.
Instead of computing the hash for each row from the database on demand, why don't you store the hash value instead?
Then you could just compute the hash value of the file in question and compare it against the database stored ones.
Update:
Another option that came to my mind is to store the Last Modified date/time information on the database and then compare it against that of the file in question. This should work, provided the information cannot be changed either intentionally or by accident.
Well regardless what you use your worst case is going to be O(n), which on n ~ 25GB of data is not so pretty.
Unless you can modify the process that writes to the files.
Since you are not updating all of the 25GBs all of the time, that is your biggest potential for saving cycles.
1. Don't write randomly
Why don't you make the process that writes the data append only? This way you'll have more data, but you'll have full history and you can track which data you already processed (what you already put in the datastore).
2. Keep a list of changes if you must write randomly
Alternatively if you really must do the random writes you could keep a list of updated rows. This list can be then processed as in #1, and the you can track which changes you processed. If you want to save some space you can keep a list of blocks in which the data changed (where block is a unit that you define).
Furthermore you can keep checksums/hashes of changed block/lines. However this might not be very interesting - it is not so cheap to compute and direct comparison might be cheaper (if you have free CPU cycles during writing it might save you some reading time later, YMMV).
Note(s)
Both #1 and #2 are interesting only if you can make adjustment to the process that writes the data to the disk
If you can not modify the process that writes in the 25GB data then I don't see how checksums/hashes can help - you have to read all the data anyway to compute the hashes (since you don't know what changed) so you can directly compare while you read and come up with a list of rows to update/add (or update/add directly)
Using diff algorithms might be suboptimal, diff algorithm will not only look for the lines that changed, but also check for the minimal edit distance between two text files given certain formatting options. (in diff, this can be controlled with -H or --minimal to work slower or faster, ie search for exact minimal solution or use heuristic algorithm for which if iirc this algorithm becomes O(n log n); which is not bad, but still slower then O(n) which is available to you if you do direct comparison line by line)
practically it's kind of problem that has to be solved by backup software, so why not use some of their standard solutions?
the best one would be to hook the WriteFile calls so that you'll receive callbacks on each update. This would work pretty well with binary records.
Something that I cannot understand: the files are actually text files that are not just appended, but updated? this is highly ineffective ( together with idea of keeping 2 copies of files, because it will make the file caching work even worse).

Resources