What are the differences between wide partition and data skew in Cassandra? - database

As I understood both are telling data amount in a specific partition should not be more than other partitions. So we should choose proper partition key(s) to compensate for these problems. But really what are the differences between these two idioms?

While they can occur for the same reasons (Data Model and Partition Key Cardinality), the data imbalance between nodes can occur for others reasons.
If a partition key is not selective enough, there can be situations where the amount of data partition grows, with a maximum recommended amount of 100 Mb per partition, but ideally not more than even 10 Mb.
While having a low cardinality partition key can result in some skew, you can also get a skew in the allocation of the tokens to the ring. The RandomPartitioner has more of a habit of producing an unbalanced result compared to the MurmurPartitioner - but even Murmur can be improved by using the allocate_tokens_for_keyspace / allocate_tokens_for_local_replication_factor - the same setting has different names depending on the C* or DSE version being used, but the idea is to provide the partitioner with more information relating to the intended replication factor, so it produces more of a balanced allocation.
A further way in which data can be unbalanced is from the topology choices - if you create a cluster with keyspaces using NetworkTopologyStrategy (recommended that you should), and multiple racks - unless the number of nodes per rack is the same, then the data will not be balanced.
For example (to demonstrate the result, not that you would do this.)
Rack 1 = 5 nodes
Rack 2 = 5 nodes
Rack 3 = 2 nodes.
With an RF of 3 and 100 GB of Data, each rack will hold a replica. Nodes in rack 1 and 2 will roughly be 20Gb each, rack 3 will be 50Gb each (roughly).
This is why the normal advice when using racks is you will increase the node count by 3 per DC as it expands.

Related

What decides the number of partitions in a DynamoDB table?

I'm a beginner to DynamoDB, and my online constructor doesn't answer his Q/A lol, and i've been confused about this.
I know that the partition key decides the partition in which the item will be placed.
I also know that the number of partitions is calculated based on throughput or storage using the famous formulas
So let's say a table has user_id as its partition Key, with 200 user_ids. Does that automatically mean that we have 200 partitions? If so, why didn't we calculate the no. of partitions based on the famous formulas?
Thanks
Let's establish 2 things.
A DynamoDB partition can support 3000 read operations and 1000 write operations. It keeps a divider between read and write ops so they do not interfere with each other. If you had a table that was configured to support 18000 reads and 6000 writes, you'd have at least 12 partition, but probably a few more for some head room.
A provisioned capacity table has 1 partition by default, but an on-demand partition has 4 partitions by default.
So, to answer your question directly. Just because you have 200 items, does not mean you have 200 partitions. It is very possible for those 200 items to be in just one partition if your table was in provisioned capacity mode. If the configuration of the table changes or it takes on more traffic, those items might move around to new partitions.
There are a few distinct times where DynamoDB will add partitions.
When partitions grow in storage size larger than 10GB. DynamoDB might see that you are taking on data and try to do this proactively, but 10GB is the cut off.
When your table needs to support more operations per second that it is currently doing. This can happen manually because you configured your table to support 20,000 reads/sec where before I only supported 2000. DynamoDB would have to add partitions and move data to be able to handle that 20,000 reads/sec. Or is can happen automatically to add partitions because you configured floor and ceiling values in DynamoDB auto-scaling and DynamoDB senses your ops/sec is climbing and will therefore adjust the number of partitions in response to capacity exceptions.
Your table is in on-demand capacity mode and DynamoDB attempts to automatically keep 2x your previous high water mark of capacity. For example, say your table just reached 10,000 RCU for the first time. DynamoDB would see that is past your previous high water mark and start adding more partitions as it tries to keep 2x the capacity at the ready in case you peak up again like you just did.
DynamoDB is actively monitoring your table and if it sees one or more items are particularly being hit hard (hot keys), are in the same partition and this might create a hot partition. If that is happening, DynamoDB might split the table to help isolate those items and prevent or fix a hot partition situation.
There are one or two other more rare edge cases, but you'd likely be talking to AWS Support if you encountered this.
Note: Once DynamoDB creates partitions, the number of partitions never shrinks and this is ok. Throughput dilution is no longer a thing in DynamoDB.
The partition key value is hashed to determine the actual partition to place the data item into.
Thus the number of distinct partition key values has zero affect on the number of physical partitions.
The only things that affect the physical number of partitions are RCUs/WCUs (throughput) and the amount of data stored.
Nbr Partions Pt = RCU/3000 + WCU/1000
Nbr Partions Ps = GB/10
Unless one of the above is more than 1.0, there will likely only be a single partition. But I'm sure the split happens as you approach the limits, when exactly is something only AWS knows.

Index performance on Postgresql for big tables

I have been searching good information about index benchmarking on PostgreSQL and found nothing really good.
I need to understand how PostgreSQL behaves while handling a huge amount of records.
Let's say 2000M records on a single non-partitioned table.
Theoretically, b-trees are O(log(n)) for reads and writes but in practicality
I think that's kind of an ideal scenario not considering things like HUGE indexes not fitting entirely in memory (swapping?) and maybe other things I am not seeing at this point.
There are no JOIN operations, which is fine, but note this is not an analytical database and response times below 150ms (less the better) are required. All searches are expected to be done using indexes, of course. Where we have 2-3 indexes:
UUID PK index
timestamp index
VARCHAR(20) index (non unique but high cardinality)
My concern is how writes and reads will perform once the table reach it's expected top capacity (2500M records)
... so specific questions might be:
May "adding more iron" achieve reasonable performance in such scenario?
NOTE this is non-clustered DB so this is vertical scaling.
What would be the main sources of time consumption either for reads and writes?
What would be the amount of records on a table that we can consider "too much" for this standard setup on PostgreSql (no cluster, no partitions, no sharding)?
I know this amount of records suggests taking some alternative (sharding, partitioning, etc) but this question is about learning and understanding PostgreSQL capabilities more than other thing.
There should be no performance degradation inserting into or selecting from a table, even if it is large. The expense of index access grows with the logarithm of the table size, but the base of the logarithm is large, and the index shouldn't have a depth of the index cannot be more than 5 or 6. The upper levels of the index are probably cached, so you end up with a handful of blocks read when accessing a single table row per index scan. Note that you don't have to cache the whole index.

Efficiency of Querying 10 Billion Rows (with High Cardinality) in ScyllaDB

Suppose I have a table with ten billion rows spread across 100 machines. The table has the following structure:
PK1 PK2 PK3 V1 V2
Where PK represents a partition key and V represents a value. So in the above example, the partition key consists of 3 columns.
Scylla requires that you to specify all columns of the partition key in the WHERE clause.
If you want to execute a query while specifying only some of the columns you'd get a warning as this requires a full table scan:
SELECT V1 & V2 FROM table WHERE PK1 = X & PK2 = Y
In the above query, we only specify 2 out of 3 columns. Suppose the query matches 1 billion out of 10 billion rows - what is a good mental model to think about the cost/performance of this query?
My assumption is that the cost is high: It is equivalent to executing ten billion separate queries on the data set since 1) there is no logical association between the rows in the way the rows are stored to disk as each row has a different partition key (high cardinality) 2) in order for Scylla to determine which rows match the query it has to scan all 10 billion rows (even though the result set only matches 1 billion rows)
Assuming a single server can process 100K transactions per second (well within the range advertised by ScyllaDB folks) and the data resides on 100 servers, the (estimated) time to process this query can be calculated as: 100K * 100 = 10 million queries per second. 10 billion divided by 10M = 1,000 seconds. So it would take the cluster approx. 1,000 seconds to process the query (consuming all of the cluster resources).
Is this correct? Or is there any flaw in my mental model of how Scylla processes such queries?
Thanks
As you suggested yourself, Scylla (and everything I will say in my answer also applies to Cassandra) keeps the partitions hashed by the full partition key - containing three columns. ּSo Scylla has no efficient way to scan only the matching partitions. It has to scan all the partitions, and check each of those whether its partition-key matches the request.
However, this doesn't mean that it's as grossly inefficient as "executing ten billion separate queries on the data". A scan of ten billion partitions is usually (when each row's data itself isn't very large) much more efficient than executing ten billion random-access reads, each reading a single partition individually. There's a lot of work that goes into random-access reads - Scylla needs to reach a coordinator which then sends it to replicas, each replica needs to find the specific position in its one-disk data files (often multiple files), often need to over-read from the disk (as disk and compression alignments require), and so on. Compare to this a scan - which can read long contiguous swathes of data sorted by tokens (partition-key hash) from disk and can return many rows fairly quickly with fewer I/O operations and less CPU work.
So if your example setup can do 100,000 random-access reads per node, it can probably read a lot more than 100,000 rows per second during scan. I don't know which exact number to give you, but the blog post https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one-billion-rows-a-second/ we (full disclosure: I am a ScyllaDB developer) showed an example use case scanning a billion (!) rows per second with just 83 nodes - that's 12 million rows per second on each node instead of your estimate of 100,000. So your example use case can potentially be over in just 8.3 seconds, instead of 1000 seconds as you calculated.
Finally, please don't forget (and this also mentioned in the aforementioned blog post), that if you do a large scan you should explicitly parallelize, i.e., split the token range into pieces and scan then in parallel. First of all, obviously no single client will be able to handle the results of scanning a billion partitions per second, so this parallelization is more-or-less unavoidable. Second, scanning returns partitions in partition order, which (as I explained above) sit contiguously on individual replicas - which is great for peak throughput but also means that only one node (or even one CPU) will be active at any time during the scan. So it's important to split the scan into pieces and do all of them in parallel. We also had a blog post about the importance of parallel scan, and how to do it: https://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/.
Another option is to move a PK to become a clustering key, this way if you have the first two PKs, you'll be able to locate partition, and just search withing it

Number of reads compared to writes for database indexing

It is known that database indexing only makes sense if you have large tables and more reads than writes as the creation of the indices leads to additional writing overhead as each modification in the database also leads to a modification of the indices.
If we assume that the indexing structure of the database is
a) a B+ tree
b) a hashtable
what is a rule of thumb for the number of reads compared to the number of writes where it starts to make sense to implement database indexing ?
For information on how database indexing works, check out How does database indexing work?
There are many factors involved here. For example, is the index data in the cache? Are the data blocks in the cache? How many rows are retrieved by the query? Hoe many rows in the table etc etc
I have been asked this question (or a variant of it) many times. Rather than give number, I prefer to give some information so that the person asking the question can make an informed decision. Here is a "back of the envelope" calculation example with some stated assumptions.
Lets say the table has 100M rows
Lets say a row is 200 bytes on average
Than means the table consumes about 19GB (ignoring any compression)
Lets assume that the index is fully cached and takes zero time
Lets assume that the data has to be read from disk, at 5ms per read
Lets assume that your IO subsystem can deliver data at 10 GB/s
Lets assume that you need to read 50,000 rows from your table. Lets also assume that the rows are spread out such that two of the required rows live in te same block.
OK, now we can do some math.
Scenario 1. Using an index
Index reads are zero time
Data block reads: 25,000 blocks #5ms each is 125s
Scenario 2. Scanning the table
19GB scanned at 10GB/s is 2s
Therefore in this case, scanning the data is much faster than using an index.
Costs formula
Nothing:
insert: O(1)
search O(n)
cost search = n db lookup operations
cost insert = db insert
a) B+ tree of order b
insert: O (log n)
search: O (log n)
cost search = log_b n lookup operations + db lookup
cost insert = log_b n lookup operations + db insert
with increasing order, the number of lookup operations go down, but the cost per lookup operation increases
b) Hashtable
insert: O(1)
search: O(1)
cost search = hash calculation [+ extra buckets when collisions happen] + db lookup
cost insert = hash calculation [+ extra buckets when collisions happen] + db insert
Cost at n = 1000
Nothing:
cost search = 1000 db lookup operations
cost insert = 1 db insert
a1) B+ tree of order 2
cost search = 10 lookup operations + 1 db lookup
cost insert = 10 lookup operations + 1 db insert
a2) B+ tree of order 10
cost search = 3 lookup operations + 1 db lookup
cost insert = 3 lookup operations + 1 db insert
b) Hashtable
cost search = hash calculation [+ extra buckets when collisions happen] + db lookup
cost insert = hash calculation [+ extra buckets when collisions happen] + db insert
Cost at n = 1000000
Nothing:
cost search = 1000000 db lookup operations
cost insert = 1 db insert
a1) B+ tree of order 2
cost search = 20 lookup operations + 1 db lookup
cost insert = 20 lookup operations + 1 db insert
a2) B+ tree of order 10
cost search = 6 lookup operations + 1 db lookup
cost insert = 6 lookup operations + 1 db insert
b) Hashtable
cost search = hash calculation [+ extra buckets when collisions happen] + db lookup
cost insert = hash calculation [+ extra buckets when collisions happen] + db insert
There are also quite large costs gains to be gotten by hitting the hardware caches for subsequent hits. The exact gains depend on the hardware your database is installed on.
The costs are not all easily comparable. The hash calculation is generally more expensive than the lookups, but as n gets large it stays the same. B tree lookup and sequential DB lookups are likely to hit hardware caches (due to entire blocks being loaded).
The larger a table, the more important it is to have an index (see cost at n=1000 vs n=1000000). So the number of writes vs reads will vary with the size of your table.
You should also take your specific queries into account. For example, hash tables are not ordered while B trees are. So if a query needs to pick up all values between a minimum and a maximum value, a B tree would perform better than a hash table (a good hash is uniformly distributed).
In general, you would have to measure performance of the queries and inserts you would be using in practice. You could start without an index and then add one when you need it. Indexes can be added later without having to change the queries of the programs using your DB (the response time of the queries will change though, reads will get faster if they use the index, writes will get slower).
If you are in the specific case where you have a load operation with a lot of inserts followed by a period of reads, you can temporarily turn off the index and recalculate it after the load.
References:
B trees vs hash tables:
https://cs.stackexchange.com/questions/270/hash-tables-versus-binary-trees
Cache
http://www.moreprocess.com/devices/computer-memory-hierarchy-internal-register-cache-ram-hard-disk-magnetic-tape
the point of using database is usually to cut the time spent looking for something in the data, therefore it is about "reading". I am very sure that 90% people use database for "reading", if not 100%.
let's think of a few cases:
writes only, no reads: what's the point? just for backup? well, the
backup will be used for reads when it is restored.
many writes, few reads: i can't think of such situation, but when you do have this situation, which one do you value more? swiftly presenting the data? or how fast you save the data? if you value how fast data is saved, then you can remove the index and have a night run to present the data (i'm sure it's big enough to need a night run to summarize the data considering how fast the data is being saved), therefore you have a near real time writes, and H+1 or more late data reads, which bring to question: what's the point of having such configuration?
few writes, many reads: this is the most common situation
no writes, only reads: 100% you can't produce this situation, no writes then what to read?
sorry for my english, hope it helps

DSE SOLR OOMing

We have had a 3 node DSE SOLR cluster running and recently added a new core. After about a week of running fine, all of the SOLR nodes are now OOMing. The fill up both the JVM Heap (set at 8GB) and the system memory. Then are also constantly flushing the memtables to disk.
The cluster is DSE 3.2.5 with RF=3
here is the solrconfig from the new core:
http://pastie.org/8973780
How big is your Solr index relative to the amount of system memory available for the OS to cache file system pages. Basically, your Solr index needs to fit in the OS file system cache (the amount of system memory available after DSE is started but has not yet processed any significant amount of data.)
Also, how many Solr documents (Cassandra rows) and how many fields (Cassandra columns) are populated on each node? There is no hard limit, but 40 to 100 million is a good guideline as an upper limit - per node.
And, how much system memory and how much JVM heap is available if you restart DSE, but before you start putting load on the server?
For RF=N, where N is the total number of nodes in the cluster or at least the search data center, all of the data will be stored on all nodes, which is okay for smaller datasets, but not okay for larger datasets.
For RF=n, this means that each node will have X/N*n rows or documents, where X is the total number of rows or documents all column families in the data center. X/N*n is the number that you should try to keep below 100 million. That's not a hard limit - some datasets and hardware might be able to handle substantially more, and some datasets and hardware might not even be able to hold that much. You'll have to discover the number that works best for your own app, but the 40 million to 100 million range is a good start.
In short, the safest estimate is for X/N*n to be kept under 40 million for Solr nodes. 100 may be fine for some data sets and beefier hardware.
As far as tuning, one common source of using lots of heap is heavy use of Solr facets and filter queries.
One technique is to use "DocValues" fields for facets since DocValues can be stored off-heap.
Filter queries can be marked as cache=false to save heap memory.
Also, the various Solr caches can be reduced in size or even set to zero. That's in solrconfig.xml.

Resources