Efficiency of Querying 10 Billion Rows (with High Cardinality) in ScyllaDB - database

Suppose I have a table with ten billion rows spread across 100 machines. The table has the following structure:
PK1 PK2 PK3 V1 V2
Where PK represents a partition key and V represents a value. So in the above example, the partition key consists of 3 columns.
Scylla requires that you to specify all columns of the partition key in the WHERE clause.
If you want to execute a query while specifying only some of the columns you'd get a warning as this requires a full table scan:
SELECT V1 & V2 FROM table WHERE PK1 = X & PK2 = Y
In the above query, we only specify 2 out of 3 columns. Suppose the query matches 1 billion out of 10 billion rows - what is a good mental model to think about the cost/performance of this query?
My assumption is that the cost is high: It is equivalent to executing ten billion separate queries on the data set since 1) there is no logical association between the rows in the way the rows are stored to disk as each row has a different partition key (high cardinality) 2) in order for Scylla to determine which rows match the query it has to scan all 10 billion rows (even though the result set only matches 1 billion rows)
Assuming a single server can process 100K transactions per second (well within the range advertised by ScyllaDB folks) and the data resides on 100 servers, the (estimated) time to process this query can be calculated as: 100K * 100 = 10 million queries per second. 10 billion divided by 10M = 1,000 seconds. So it would take the cluster approx. 1,000 seconds to process the query (consuming all of the cluster resources).
Is this correct? Or is there any flaw in my mental model of how Scylla processes such queries?
Thanks

As you suggested yourself, Scylla (and everything I will say in my answer also applies to Cassandra) keeps the partitions hashed by the full partition key - containing three columns. ּSo Scylla has no efficient way to scan only the matching partitions. It has to scan all the partitions, and check each of those whether its partition-key matches the request.
However, this doesn't mean that it's as grossly inefficient as "executing ten billion separate queries on the data". A scan of ten billion partitions is usually (when each row's data itself isn't very large) much more efficient than executing ten billion random-access reads, each reading a single partition individually. There's a lot of work that goes into random-access reads - Scylla needs to reach a coordinator which then sends it to replicas, each replica needs to find the specific position in its one-disk data files (often multiple files), often need to over-read from the disk (as disk and compression alignments require), and so on. Compare to this a scan - which can read long contiguous swathes of data sorted by tokens (partition-key hash) from disk and can return many rows fairly quickly with fewer I/O operations and less CPU work.
So if your example setup can do 100,000 random-access reads per node, it can probably read a lot more than 100,000 rows per second during scan. I don't know which exact number to give you, but the blog post https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one-billion-rows-a-second/ we (full disclosure: I am a ScyllaDB developer) showed an example use case scanning a billion (!) rows per second with just 83 nodes - that's 12 million rows per second on each node instead of your estimate of 100,000. So your example use case can potentially be over in just 8.3 seconds, instead of 1000 seconds as you calculated.
Finally, please don't forget (and this also mentioned in the aforementioned blog post), that if you do a large scan you should explicitly parallelize, i.e., split the token range into pieces and scan then in parallel. First of all, obviously no single client will be able to handle the results of scanning a billion partitions per second, so this parallelization is more-or-less unavoidable. Second, scanning returns partitions in partition order, which (as I explained above) sit contiguously on individual replicas - which is great for peak throughput but also means that only one node (or even one CPU) will be active at any time during the scan. So it's important to split the scan into pieces and do all of them in parallel. We also had a blog post about the importance of parallel scan, and how to do it: https://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/.

Another option is to move a PK to become a clustering key, this way if you have the first two PKs, you'll be able to locate partition, and just search withing it

Related

What decides the number of partitions in a DynamoDB table?

I'm a beginner to DynamoDB, and my online constructor doesn't answer his Q/A lol, and i've been confused about this.
I know that the partition key decides the partition in which the item will be placed.
I also know that the number of partitions is calculated based on throughput or storage using the famous formulas
So let's say a table has user_id as its partition Key, with 200 user_ids. Does that automatically mean that we have 200 partitions? If so, why didn't we calculate the no. of partitions based on the famous formulas?
Thanks
Let's establish 2 things.
A DynamoDB partition can support 3000 read operations and 1000 write operations. It keeps a divider between read and write ops so they do not interfere with each other. If you had a table that was configured to support 18000 reads and 6000 writes, you'd have at least 12 partition, but probably a few more for some head room.
A provisioned capacity table has 1 partition by default, but an on-demand partition has 4 partitions by default.
So, to answer your question directly. Just because you have 200 items, does not mean you have 200 partitions. It is very possible for those 200 items to be in just one partition if your table was in provisioned capacity mode. If the configuration of the table changes or it takes on more traffic, those items might move around to new partitions.
There are a few distinct times where DynamoDB will add partitions.
When partitions grow in storage size larger than 10GB. DynamoDB might see that you are taking on data and try to do this proactively, but 10GB is the cut off.
When your table needs to support more operations per second that it is currently doing. This can happen manually because you configured your table to support 20,000 reads/sec where before I only supported 2000. DynamoDB would have to add partitions and move data to be able to handle that 20,000 reads/sec. Or is can happen automatically to add partitions because you configured floor and ceiling values in DynamoDB auto-scaling and DynamoDB senses your ops/sec is climbing and will therefore adjust the number of partitions in response to capacity exceptions.
Your table is in on-demand capacity mode and DynamoDB attempts to automatically keep 2x your previous high water mark of capacity. For example, say your table just reached 10,000 RCU for the first time. DynamoDB would see that is past your previous high water mark and start adding more partitions as it tries to keep 2x the capacity at the ready in case you peak up again like you just did.
DynamoDB is actively monitoring your table and if it sees one or more items are particularly being hit hard (hot keys), are in the same partition and this might create a hot partition. If that is happening, DynamoDB might split the table to help isolate those items and prevent or fix a hot partition situation.
There are one or two other more rare edge cases, but you'd likely be talking to AWS Support if you encountered this.
Note: Once DynamoDB creates partitions, the number of partitions never shrinks and this is ok. Throughput dilution is no longer a thing in DynamoDB.
The partition key value is hashed to determine the actual partition to place the data item into.
Thus the number of distinct partition key values has zero affect on the number of physical partitions.
The only things that affect the physical number of partitions are RCUs/WCUs (throughput) and the amount of data stored.
Nbr Partions Pt = RCU/3000 + WCU/1000
Nbr Partions Ps = GB/10
Unless one of the above is more than 1.0, there will likely only be a single partition. But I'm sure the split happens as you approach the limits, when exactly is something only AWS knows.

Is Cassandra / ScyllaDB capable of handling millions of very wide data rows?

A new business need has emerged in our firm, where a relatively "big" data set needs to be accessed by online processes (with typical latency of up to 1 second). There is only one key with a high granularity / rows count measured in tens of millions and the expected number of columns / fields / value columns will likely exceed hundreds of thousands.
The key column is shared among all value columns, so key-value storage, while scalable, seems rather wasteful here. Is there any hope for using Cassandra / ScyllaDB (to which we gradually narrowed down our search) for such a wide data set, while ideally reducing also data storage needs by half (by storing the common key only once)?
If I understand your use case correctly, your use case will have tens of millions of partitions (what you called rows), and each will have hundreds of thousands of different values in each of them (each those would be a clustering row in modern CQL - CQL no longer supports un-schema-ed wide rows). This is a fairly reasonable data set for Scylla and Cassandra.
But I want to add that I'm not sure the storage saving you are hoping for will really be there. Yes, Scylla/Cassandra will not need to store the partition key multiple times, but unless the partition key is very long, this will be often be negligible compared to the other overheads of storing the data on disk.
Another thing you should consider is your expected queries. How will you read from this database? If you'll want to read all 100,000 columns of a particular key, or a contiguous range of them, then the data model you described is perfect. However, if the expected use case is that you always plan to read a single column from a specific key, then this data model will be inefficient - a random-access read from the middle of a long partition is slower than reading the value from a short partition.

Does the number of records have any impact on performance

does the number of records from a db affect the speed of select queries?
i mean if a db has 50 records and another one has 5 million records, will the selects from the 2nd one be slower? assuming i have all the indexes in the right place
Yes, but it doesn't have to be a large penalty.
At the most basic level an index is a b-tree. Performance is somewhat correlated to the number of levels in the b-tree, so a 5 record database has about 2 levels, a 5 million record database has about 22 levels. But it's binary, so a 10 million row database has 23 levels, and really, index access times are typically not the problem in performance tuning - the usual problem is tables that aren't indexed properly.
As noted by odedsh, caching is also a large contributor, and small databases will be cached well. Sqlite stores records in primary key sequence, so picking a primary key that allows records that are commonly used together to be stored together can be a big benefit.
Yeah it matters for the reasons the others said.
There's other things that can effect the speed of Select statements to, such as how many columns you're grabbing data from.
I once did some speed tests in a table with over 150 columns, where I needed to grab only about 40 of the columns, and I needed all 20,000+ records. While the speed differences were very minimal (we're talking 20 to 40 milliseconds), it was actually faster to grab the data from All the columns with a 'SELECT ALL *', rather than going 'Select All Field1, Field2, etc'.
I assume the more records and columns in your table, the greater the speed difference this example will net you, but I never had a need to test it any farther in more extreme cases like 5 million records in a table.
Yes.
If a table is tiny and the entire db is tiny when you select anything from the table it is very likely that all the data is in memory already and the result can be returned immediately.
If the table is huge but you have an index and you are doing a simple select on the indexed columns then the index can be scanned then the correct blocks can be read from disk and the result returned.
If there is no index that can be used then the db will do a full table scan reading the table block by block looking for matches.
If there is a partial map between the index columns and the select query columns then the db can try to minimize the number of blocks that should be read. And a lot of thought can be placed into properly choosing the indexes structure and type (BITMAP / REGULAR)
And this is just for the most basic SQL that selects from a single table without any calculations.

How do indexes and disk seeks play well together?

I have another question but i'll be more specific.
I see that when selecting a million row table it takes < 1second. What I don't understand is how might it do this with indexes. It seems to take 10ms to do a seek so for it to succeed 1sec it must do <100seeks. If there is an index entry per row then 1M rows is at least 1K blocks to store the indexes (actually its higher if its 8bytes per row (32bit index value + 32 key offset)). Then we would need to actually travel to the rows and collect the data. How do databases keep the seeks low and pull that data as fast as they do?
One way is something called a 'clustered index', where the rows of the table are physically ordered according to the clustered index's sort. Then when you want to read in a range of values along the indexed field, you find the first one, and you can just read it all in at once with no extra IO.
Also:
1) When reading an index, a large chunk of the index will be read in at once. If descending the B-tree (or moving along the children at the bottom, once you've found your range) moves you to another node already read into memory, you've saved an IO.
2) If the number of records that the SQL server statistically expects to retrieve are so high that the random access requirement of going from the index to the underlying rows will require so many IO operations that it would be faster to do a table scan, then it will do a table scan instead. You can see this e.g. using the query planner in SQL Server or PostgreSQL. But for small ranges the index is usually better, and the query plan will reflect this.

SQL Server Profiler - Evaluating Reads. What is considered 'good' or 'bad'?

I'm profiling (SQL Server 2008) some of our views and queries to determine their efficiency with regards to CPU usage and Reads. I understand Reads are the number of logical disk reads in 8KB pages. But I'm having a hard time determining what I should be satisfied with.
For example, when I query one of our views, which in turn joins with another view and has three OUTER APPLYs with table-valued UDFs, I get a Reads value of 321 with a CPU value of 0. My first thought is that I should be happy with this. But how do I evaluate the value of 321? This tells me 2,654,208 bytes of data were logically read to satisfy the query (which returned a single row and 30 columns).
How would some of you go about determining if this is good enough, or requires more fine tuning? What criteria would you use?
Also, I'm curious what is included in the 2,654,208 bytes of logical data read. Does this include all the data contained in the 30 columns in the single row returned?
The 2.5MB includes all data in the 321 pages, including the other rows in the same pages as those retrieved for your query, as well as the index pages retrieved to find your data. Note that these are logical reads, not physical reads, e.g. read from a cached page will make the read much 'cheaper' - take CPU and profiler cost indicator as well when optimising.
w.r.t. How to determine an optimum 'target' for reads.
FWIW I compare the actual reads with a optimum value which I can think of as the minimum number of pages needed to return the data in your query in a 'perfect' world.
e.g. if you calculate roughly 5 rows per page from table x, and your query returns 20 rows, the 'perfect' number of reads would be 4, plus some overhead of navigating indexes (assuming of course that the rows are clustered 'perfectly' for your query) - so utopia would be around say 5-10 pages.
For a performance critical query, you can use the actual reads vs 'utopian' reads to micro-optimise, e.g.:
Whether I can fit more rows per page in the cluster (table), e.g. replacing non-searched strings with varchar() not char, or using varchar not nvarchar() or using smaller integer types etc.
Whether the clustered index could be changed such that fewer pages would need to be fetched (e.g. if the 20 rows for the above query were scattered across different pages, then reads would be > 4)
Failing which (since you can only one CI), whether covering indexes could replace the need to go to the table data (cluster) at all, since covering indexes fitting your query will have higher 'row' densities
And for indexes, density improvements such as fillfactors or narrower indexing for indexes can mean less index reads
You might find this article useful
HTH!
321 reads with a CPU value of 0 sounds pretty good, but it all depends.
How often is this query run? Why are table-returning UDFs used instead of just doing joins? What is the context of database use (how many users, number of transactions per second, database size, is it OLTP or data warehousing)?
The extra data reads come from:
All the other data in the pages needed to satisfy the reads done in the execution plan. Note this includes clustered and nonclustered indexes. Examining the execution plan will give you a better idea of what exactly is being read. You'll see references to all sorts of indexes and tables, and whether a seek or scan was required. Note that a scan means every page in the whole index or table was read. That is why seeks are desirable over scans.
All the related data in tables INNER JOINed to in the views regardless of whether these JOINs are needed to give correct results for the query you're performing, since the optimizer doesn't know that these INNER JOINs will or won't exclude/include rows until it JOINs them.
If you provide the queries and execution plans, as requested, I would probably be able to give you better advice. Since you're using table-valued UDFs, I would also need to see the UDFs themselves or at least the execution plan of the UDFs (which is only possibly by tearing out its meat and running outside a function context, or converting it to a stored procedure).

Resources