Deciding the clustering key snowflakes - snowflake-cloud-data-platform

Deciding the clustering key snowflakes - snowflake-cloud-data-platform

As per the documentation we can select a column to for clustering based on the cardinality (distinct values of the column) and the column used in the join condition .Here is the o/p of clustering information for one of the table in the select query on which the query execution is taking more than 80% of total execution time ( just to scan the table).FYI I have collected below output for the table based on the column used in join condition.
Based on the o/p relating with my understanding .Below are the point make me feel that clustering the table based on the column in the subject will helps in increasing the performance .
ratio of total_partition_count 20955 to average_overlaps : 17151.4681
ratio of total_partition_count 20955 to average_depth : 16142.2524
1.Correct me if my understanding is wrong.(based on below facts is this table is a good candidate for clustering or not)?
Please also help with below other points as well
2.If i opt for clustering the table ,will it needs any downtime(or)does clustering adds up my bill?
3.Does this clustering impacts the future DML operations?
4.I see the select query is returning 23 rows after scanning 37 GB of data ,what would be best solution to improve the performance of the query other than choosing clustering as an option.
Let me know for any details required
select SYSTEM$CLUSTERING_INFORMATION('tablename','(columnname)');
{
"cluster_by_keys" : "LINEAR(column used in join condition)",
"total_partition_count" : 20955,
"total_constant_partition_count" : 2702,
"average_overlaps" : 17151.4681,
"average_depth" : 16142.2524,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 1933,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"08192" : 2,
"16384" : 3,
"32768" : 19017
}
}

a TABLE is a good candidate for clustering if the operation you care about "most" uses only a small number of rows than the total partitions read of which many are dropped by the filtering you would cluster by. Aka you are wasting $ reading data you don't want.
But you can only have one cluster on a table, so if you make one query better you might make others worse. Also
Auto-clustering does its work in the background, think of defragmentation of your harddrive. It's the same, and thus seems it is "work" yes you pay for it.
future DML is not directly impacted by clustering if you are mean will a future insert be N times slow because you have clustering on. But given you are altering data, there are two ways the DML multi impact your clustering, if you insert data is random sort, (with respect to your clustering) that data will be needed to be sort. Also you do you high frequency inserts this can interfere with the background clustering operations. Also the fact you insert new data to new partitions, the larger set needs reclustering.
You can rewrite the table with an ORDER BY and do the clustering yourself.

Analysis of SYSTEM$CLUSTERING_INFORMATION data.
"average_overlaps" : 17151.4681 >
Average number of overlapping micro-partitions for each micro-partition in the table. A high number indicates the table is not well-clustered.
"average_depth" : 16142.2524 >
Average overlap depth of each micro-partition in the table. A high number indicates the table is not well-clustered.
The buckets 00000 through 32768 describe in how many micro-partitions (similar in concept to files) your cluster keys are split into.
"00000" : 0 >
Zero (0) constant micro-partitions out of 20955 total micro-partitions.
"32768" : 19017 > The 32768 means that 32768 micro-partitions contain a cluster key. 19017 micro-partitions contain cluster keys that are also in up to 19017 other micro-partitions, which is bad because we'd need to scan all of these micro-partitions completely to find one of those cluster keys.
https://docs.snowflake.com/en/sql-reference/functions/system_clustering_information.html
As mentioned earlier Table would be good candidate for clustering if clustering keys are used in your queries like in selective filters or join predicates.
https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html#strategies-for-selecting-clustering-keys
2.Automatic clustering works in the background and there no downtime required. As far as billing is concerned there will be some additional charges.
3.Clustering does not impact future DML operation as such but if there is constant DML operation on that table then as soon as the number of new micro partitions reach a certain threshold Automatic Clustering will start its operation to keep the table well clustered.
As mentioned by Simeon you can opt for rewriting the table with an ORDER BY and do the clustering yourself.

Related

Time complexity of Cursor Pagination

I have read from different articles saying cursor pagination query has time complexity O(1) or O(limit) where limit is the number of item limit in sql. Some example article source:
https://uxdesign.cc/why-facebook-says-cursor-pagination-is-the-greatest-d6b98d86b6c0 and
https://dev.to/jackmarchant/offset-and-cursor-pagination-explained-b89
But I canont find related references explaining why the time complexity is O(limit). Say I have a table consist of 3 columns
id, name, created_at, where id is primary key,
if I use created_at as the cursor (which is unique and sequential), can someone explain why the time complexity is O(limit)？
Is it related to data structure used to store created_at?

After some reading, I guess the time complexity is talking about after retrieving the intermediate records, the time complexity of getting the final required records.
For offset case, all records will be selected, then database will discard x records where x is the offset, finally select y records (where y = limit), so the time complexity is O(offset + limit).
For cursor case, records matched the cursor where condition will be selected, then select y records (where y = limit), so the time complexity is O(limit).

Snowflake query pruning by Column

in the Snowflake Docs it says:
First, prune micro-partitions that are not needed for the query.
Then, prune by column within the remaining micro-partitions.
What is meant with the second step?
Let's take the example table t1 shown in the link. In this example table I use the following query:
SELECT * FROM t1
WHERE
Date = ‚11/3‘ AND
Name = ‚C‘
Because of the Date = ‚11/3‘ it would only scan micro partitions 2, 3 and 4. Because of the Name = 'C' it can prune even more and only scan micro-partions 2 and 4.
So in the end only micro-partitions 2 and 4 would be scanned.
But where does the second step come into play? What is meant with prune by column within the remaining micro partitions?
Does it mean, that only rows 4, 5 and 6 on micro-partition 2 and row 1 on micro-partition 4 are scanned, because date is my clustering key and is sorted so you can prune even further with the date?
So in the end only 4 rows would be scanned?

But where does the second step come into play? What is meant with prune by column within the remaining micro partitions?
Benefits of Micro-partitioning:
Columns are stored independently within micro-partitions, often referred to as columnar storage.
This enables efficient scanning of individual columns; only the columns referenced by a query are scanned.
It is recommended to avoid SELECT * and specify required columns explicitly.

It simply means to only select the columns that are required for the query. So in your example it would be:
SELECT col_1, col_2 FROM t1
WHERE
Date = ‚11/3‘ AND
Name = ‚C‘

Snowflake Partitioning Vs Manual Clustering

I have 2 large tables in Snowflake (~1 and ~15 TB resp.) that store click events. They live in two different schemas but have the same columns and structure; just different sources.
The data is dumped/appended into these tables on a monthly basis, and both tables have a time_id integer field which represents days since 1999-12-31 the click event took place.
Question is -- Should I leave it up to Snowflake to optimize the partitioning --OR-- Is this a good candidate for manually assigning a clustering key? And say, I do decide to add a clustering key to it, would re-clustering after next insert be just for the incremental data? --OR-- Would it be just as expensive as the initial clustering?
In case it helps, here is some clustering info on the larger of the 2 tables
select system$clustering_information( 'table_name', '(time_id)')
{
"cluster_by_keys" : "LINEAR(time_id)",
"total_partition_count" : 1151026,
"total_constant_partition_count" : 130556,
"average_overlaps" : 4850.673,
"average_depth" : 3003.3745,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 127148,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"01024" : 984,
"02048" : 234531,
"04096" : 422451,
"08192" : 365912
}
}
A typical query I would run against these tables
select col1, col_2, col3, col4, time_id
from big_table
where time_id between 6000 and 7600;

Should I leave it up to Snowflake to optimize the partitioning? Is
this a good candidate for manually assigning a clustering key?
Yes, it seems it's a good candidate to assign a clustering key (size + update intervals + query filters)
And say, I do decide to add a clustering key to it, would
re-clustering after next insert be just for the incremental data?
After the initial reclustering, if you do not insert data belongs to earlier days, existing partitions would be in a "constant" state, so the reclustering will just process only the new data/micro-partitions.
https://docs.snowflake.com/en/user-guide/tables-auto-reclustering.html#optimal-efficiency
Would it be just as expensive as the initial clustering?
In normal conditions, it should not.

Mostly a long winded commend to the question on Gokhan's answer:
This is helpful! Just so I have a sense of cost and time, how long do you think it'll take to run the clustering?
I would suggest you do a one off rebuild of the table with the order by time verse leave auto-cluster to incrementally sort a table this large.
I say this as we had a collect of tables with about 3B rows each (there was about ~30x of these table), and would do a GDPR related PII clean up every month, that deleted 1 month's data via an UPDATE command, as the UPDATE has no order by the order was destroyed for about 1/3 of the table, which auto-cluster would then "fix" over that following day.
Our auto-clustering bill was normal ~100 credits a day, but on these days we where using ~300 credits. which is implying ~6 credits per table, where a full table re-create with a order by would maybe take a medium 15 minutes so ~1 credit.
Which is not deriding auto-clustering, but when a table gets random scrambled it's "a little at a time" approach is too passive/costly, imho.
But on the other hand is you cannot block the insert process for N minutes while you recreate the table, maybe auto-cluster might be your only option, that other other-hand to this if you are always writing to the table auto-cluster will back off a lot, from failed writes.. But this point is more the "general case details to watch out for", given as you state you do monthly loads.

creating an index did not change my query cost

I was trying to decrease the cost of query execution by creating an index on the rating column. The table has 2680 tuples
SELECT * from cup_matches WHERE rating*3 > 20
However when i used pgAdmin to view the query cost before and after indexing, it remained the same. I thought it would decrease as the processes of indexing should decrease the cost of data being taken from the hardisk, due to indexing (reducing I/O cost), to the memory. Can someone tell me why did it stay the same?

The cost did not diminish because you are doing a mutation operation within the where so it cannot use the index. removing the "*3" operation should do the trick.
SELECT * from cup_matches WHERE rating > 20
Should have the performance increase, because you are no longer mutating the rating value. When values are mutated you need to do a complete table scan in order to do comparisons.

because the index is on rating and not on rating*3. To use your current index, try
SELECT * from cup_matches WHERE rating > 20/3

Advice on sql server index performance

I have "UserLog" table with 15 millions rows.
On this table I have a cluster index on User_ID field which is of type bigint identity, and a non clustered index on User_uid field which is of type varchar(35) (a fake uniqueidentifier).
On my application we can have 2 categories of users connection. 1 of them concerns only 0.007% of rows (about 1150 raws over 15 millions) and the 2nd concerns the remaining rows (99%).
The objective is to improve performance of the 0.007% users connection.
That's why I create a 'split' field "Userlog_ID" with type bit with default value of 0. So for each user connection we insert a new row in Userlog (with 0 as a value for User_log).
This field (User_Log) will then be update and it will take either 0 (for more then 99% of rows) or 1 (for 0.007% of rows) depending on the user category.
I create then a non clustered index on this field (User_log).
the select statment I want to optimize is:
SELECT User_UID, User_LastAuthentificationDate,
Language_ID,User_SecurityString
FROM dbo.UserLog
WHERE User_Active = 1
AND User_UID = '00F5AA38-C288-48C6-8ED1922601819486'
So the idea is now to add a filter on User_Log field to optimise the performance (precisely the index seek operator), only when the user belongs to the category 1 (0.007%):
SELECT User_UID, User_LastAuthentificationDate, Language_ID,User_SecurityString
FROM dbo.UserLog
WHERE User_Active = 1
AND User_UID = '00F5AA38-C288-48C6-8ED1922601819486'
and User_Log = 1
In my mind, I have the idea that since we add this filter the index seek will perform better because we have now a smaller result set.
Unfortunately, when I compare the 2 queries with the estimated execution plan I obtain 50% for each query. For both queries the optimiser use an index seek on user_uid non clustered index and then a key lookup on the cluster index (User_id).
So In conclusion by adding the split field and a non clustered index (either normal or filtered) on it, I don't improve the performance.
Can anyone have an explanation why. Maybe my reasoning and my interpretation are totaly wrong.
Thank you

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight