In the open source version, Scylla recommends keeping up to 50% of disk space free for “compactions”. At the same time, the documentation states that each table is compacted independently of each other. Logically, this suggests that in a applications with dozens (or even multiple) tables there’s only a small chance that so many compaction will coincide.
Is there a mathematical model of calculating how multiple compaction might overlap in an application with several tables? Based on a cursory analysis, it seems that the likelihood of multiple overlapping compaction is small, especially when we are dealing with dozens of independent tables.
You're absolutely right:
With the size-tiered compaction strategy a compaction may temporarily double the disk requirements. But it doesn't double the entire disk requirements but only of the sstables involved in this compaction (see also my blog post on size-tiered compaction and its space amplification). There is indeed a difference between "the entire disk usage" and just "the sstables involved in this compaction" for two reasons:
As you noted in your question, if you have 10 tables of similar size, compacting just one of them will work on just 10% of the data, so the temporary disk usage during compaction might be 10% of the disk usage, not 100%.
Additionally, Scylla is sharded, meaning that different CPUs handle their sstables, and compactions, completely independently. If you have 8 CPUs on your machines, each CPU only handles 1/8th of the data, so when it does compaction, the maximum temporary overhead will be 1/8th of the table's size - not the full table size.
The second reason cannot be counted on - since shards choose when to compact independently, if you're unlucky all shards may decide to compact the same table at exactly the same time, and worse - may happen to do the biggest compactions all at the same time. This "unluckiness" can also happen at 100% probability if you start a "major compaction" (nodetool compact).
The first reason, the one which you asked about, is indeed more useful and reliable: Beyond it being unlikely that all shards will choose to compact all sstables are exactly the same time, there is an important detail in Scylla's compaction algorithm which helps here: Each shard only does one compaction of a (roughly) given size at a time. So if you have many roughly-equal-sized tables, no shard can be doing full compaction of more than one of those tables at a time. This is guaranteed - it's not a matter of probability.
Of course, this "trick" only helps if you really have many roughly-equal-sized tables. If one table is much bigger than the rest, or tables have very different sizes, it won't help you too much to control the maximum temporary disk use.
In issue https://github.com/scylladb/scylla/issues/2871 I proposed a idea of how Scylla can guarantee that when disk space is low, the sharding (point 1) is also used to reduce temporary disk space usage. We haven't implemented this idea, but instead implemented a better idea - "incremental compaction strategy", which does huge compactions in pieces ("incrementally") to avoid most of the temporary disk usage. See this blog post for how this new compaction strategy works, and graphs demonstrating how it lowers the temporary disk usage. Note that Incremental Compaction Strategy is currently part of the Scylla Enterprise version (it's not in the open-source version).
I'm trying to understand better the immutability of sstables in Cassandra. It's very clear what happens both in an insert operation, or in update/delete operation when the data exists in the memtable. But it's not clear what happens when I want to modify data that has already been flushed out.
So I understand the simple senario: I execute an insert opertaion and the data is written to a memtable. When the memtable is full then it's flushed to an sstable.
Now, how does modification of data occur? What happens when I execute a delete or update command (when the data has been flushed out)? If the sstable is immutable, so how will the data get deleted/updated? And how does the memtable work in delete and update commands (of data that does not exist in it because it has been flushed out)? What will the memtable contain?
In Cassandra / Scylla you ALWAYS append. Meaning any operation, whether it's insert / update / delete will create a new entry for that partition containing the new data and new timestamp. In case of a delete operation the new entry will actually be a tombstone with the new timestamp (indicating that the previous data was deleted). This applies whether the data is still in memory (memtable) or already flushed to disk -> sstable created.
Several "versions" of the same partition with different data and different timestamps can reside in multiple sstables (and even in memory) at the same time. SStables will be merged duration compaction and there are several compaction strategies that can be applied.
When the gc_grace_period (default: 10 days, tunable) has expired, on the next compaction that tombstone will be removed, meaning the data that was deleted and the tombstone indicating the latest action (delete), will not get merged into the new sstable.
The internal implementation of the memtables might be slightly different between Scylla and Cassandra but for the sake of simplicity let's assume it is the same.
You are welcomed to read more about the architecture in the following documentation:
SStables
Compaction strategies
For a large cassandra partition read latencies are usally huge.
But does write latency get impacted in this case? Since cassandra is columnar database and holds immutable data, shouldn't the write (which appends data at the end of the row) take less time?
In all the experiments I have conducted with Cassandra, I have noticed that write throughput is not affected by data size while read performance takes a big hit if your SSTables are too big, concurrent_reads threads are low ( check using nodetool tpstats if ReadStage is going into pending state) and increase them in cassandra.yaml file. Using LeveledCompaction seems to help as data for same key remains in same SSTable. Make sure your data is distributed evenly across all nodes. Cassandra optimization is tricky and you may have to implement "hacks" to obtain desired performance in minimum possible hardware.
If my index is say 80% fragmented and is used in joins can the overall performance be worse than if that index didn't exist? And if so, why?
Your question is too vague to answer consistently, or even to know what you're actually after, but consider this:
A fragmented index means you'll have a lot of actual disk activity compared to the amount of disk activity you'd need for a certain query.
Take a look at DBCC SHOWCONTIG
Among other useful information, it shows you a figure for Scan Density. A very low "hit rate" on this can imply that you're doing heaps more IO than you'd need to with a properly maintained index. This could even exceed the amount of IO you'd need to perform a table scan, but it all depends on the size of your objects and your data access pattern.
One area where a poorly maintained (= highly fragmented) index will hurt you double, is that it hurts performance in inserts, updates AND selects.
With this in mind, it's a pretty common practice for ETL processes to drop indexes before and recreating them after processing large batches of information. In the mean time, they'd only hurt write performance and be too far fragmented to help lookups.
Besides that: it's easy to do index maintenance. I'd recommend deploying Ola Hallengren's index maintenance solution and no longer worry about it.
Using two databases to illustrate this example: CouchDB and Cassandra.
CouchDB
CouchDB uses a B+ Tree for document indexes (using a clever modification to work in their append-only environment) - more specifically as documents are modified (insert/update/delete) they are appended to the running database file as well as a full Leaf -> Node path from the B+ tree of all the nodes effected by the updated revision right after the document.
These piece-mealed index revisions are inlined right alongside the modifications such that the full index is a union of the most recent index modifications appended at the end of the file along with additional pieces further back in the data file that are still relevant and haven't been modified yet.
Searching the B+ tree is O(logn).
Cassandra
Cassandra keeps record keys sorted, in-memory, in tables (let's think of them as arrays for this question) and writes them out as separate (sorted) sorted-string tables from time to time.
We can think of the collection of all of these tables as the "index" (from what I understand).
Cassandra is required to compact/combine these sorted-string tables from time to time, creating a more complete file representation of the index.
Searching a sorted array is O(logn).
Question
Assuming a similar level of complexity between either maintaining partial B+ tree chunks in CouchDB versus partial sorted-string indices in Cassandra and given that both provide O(logn) search time which one do you think would make a better representation of a database index and why?
I am specifically curious if there is an implementation detail about one over the other that makes it particularly attractive or if they are both a wash and you just pick whichever data structure you prefer to work with/makes more sense to the developer.
Thank you for the thoughts.
When comparing a BTree index to an SSTable index, you should consider the write complexity:
When writing randomly to a copy-on-write BTree, you will incur random reads (to do the copy of the leaf node and path). So while the writes my be sequential on disk, for datasets larger than RAM, these random reads will quickly become the bottle neck. For a SSTable-like index, no such read occurs on write - there will only be the sequential writes.
You should also consider that in the worse case, every update to a BTree could incur log_b N IOs - that is, you could end up writing 3 or 4 blocks for every key. If key size is much less than block size, this is extremely expensive. For an SSTable-like index, each write IO will contain as many fresh keys as it can, so the IO cost for each key is more like 1/B.
In practice, this make SSTable-like thousands of times faster (for random writes) than BTrees.
When considering implementation details, we have found it a lot easier to implement SSTable-like indexes (almost) lock-free, where as locking strategies for BTrees has become quite complicated.
You should also re-consider your read costs. You are correct than a BTree is O(log_b N) random IOs for random point reads, but a SSTable-like index is actually O(#sstables . log_b N). Without an decent merge scheme, #sstables is proportional to N. There are various tricks to get round this (using Bloom Filters, for instance), but these don't help with small, random range queries. This is what we found with Cassandra:
Cassandra under heavy write load
This is why Castle, our (GPL) storage engine, does merges slightly differently, and can achieve a lot better (O(log^2 N)) range queries performance with a slight trade off in write performance (O(log^2 N / B)). In practice we find it to be quicker than Cassandra's SSTable index for writes as well.
If you want to know more about this, I've given a talk about how it works:
podcast
slides
Some things that should also be mentioned about each approach:
B-trees
The read/write operations are supposed to be logarithmic O(logn). However, a single database write can lead to multiple writes in the storage system. For example, when a node is full, it has to be split and that means that there will be 2 writes for the 2 new nodes and 1 additional write for updating the parent node. You can see how that could increase if the parent node was also full.
Usually, B-trees are stores in such a way that each node has the size of a page. This creates a phenomenon called write amplification, where even if a single byte needs to be updated, a whole page is written.
Writes are usually random (not sequential), thus slower especially for magnetic disks.
SSTables
SSTables are usually used in the following approach. There is an in-memory structure, called memtable, as you described. Every once in a while, this structure is flushed to disk to an SSTable. As a result, all the writes go to the memtable, but the reads might not be in the current memtable, in which case they are searched in the persisted SSTables.
As a result, writes are O(logn). However, always bear in mind that they are done in memory, so they should be orders of magnitude faster than the logarithmic operations in disk of B-trees. For the sake of completeness, we should mention that writes are also written to a write-ahead log for crash recovery. But, given that these are all sequential writes, they are expected to be much more efficient than the random writes of B-trees.
When served from memory (from the memtable), reads are expected to be much faster as well. But, when there's need to look in the older, disk-based SSTables, reads can potentially become quite slower than B-trees. There are several optimisations around that, such as use of bloom filters, to check whether an SSTable contains a value without performing disk reads.
As you mentioned, there's also a background process, called compaction, used to merge SSTables. This helps remove deleted values and prevent fragmentation, but it can cause significant write load, affecting the write throughput of the incoming operations.
As it becomes evident, a comparison between these 2 approaches is much more complicated. In an extremely simplified attempt to provide a concrete comparison, I think we could say that:
SSTables provide a much better write throughput than B-trees. However, they are expected to have less stable behaviour, because of ongoing compactions. An example of this can be seen in this benchmark comparison.
B-trees are usually preferred for use-cases, where transaction semantics are needed. This is because, each key can be found only in a single place (in contrast to the SSTable, where it could exist in multiple SSTables with obsolete values in some of them) and also because one could represent a range of values as part of the tree. This means that it's easier to perform key-level and range-level locking mechanisms.
References
[1] A Performance Comparison of LevelDB and MySQL
[2] Designing Data-intensive Applications
I think fractal trees, as used by Tokutek, are a better index for a database. They offer real-world 20x to 80x improvements over b-trees.
There are excellent explanations of how fractal tree indices work here.
LSM-Trees is better than B-Trees on storage engine structured.
It converts random-write to aof in a way.
Here is a LSM-Tree src:
https://github.com/shuttler/lsmtree