Snowflake: Perfomance is slow with millions of rows - snowflake-cloud-data-platform

Requirement: To Speed up the performance in snowflake
Issue: It's taking a lot of time even to read data, Has created a cluster in the table for the columns as
create or replace TABLE table_A cluster by (ID, yyyymm)(
YYYYMM NUMBER(38,0),
ID NUMBER(38,0),
.....(lot of other columns)
......
SURROGATE_KEY VARCHAR(16777216)
);
Table has 70,825,139,352 rows
If the ID was inserted into the table recently in the last 60 mins, we want to delete any previous version of that ID if it's in the last 3 months
Below is the query
select
surrogate_key,
SUBSTR(surrogate_key, 1, CHARINDEX('|', surrogate_key) - 1)::bigint as original_id,
array_agg(distinct yyyymm) as yyyymms,
max(extraction_ts) as max_extraction_ts
from table A
where (ID, surrogate_key) IN (
select ID, surrogate_key from table A where create_time >= dateadd(minute, -60, current_timestamp)
)
and yyyymm >= to_char(dateadd(month, -3, current_timestamp), 'YYYYMM')::bigint
and yyyymm <= to_char(dateadd(month, -0, current_timestamp), 'YYYYMM')::bigint
group by surrogate_key
;
Then I tried to just get rows for last 3 months , even this is taking lot of time
select yyyymm, ID,
surrogate_key,create_time,extraction__ts
from table A
where yyyymm >= to_char(dateadd(month, -3, current_timestamp), 'YYYYMM')::bigint
and yyyymm <= to_char(dateadd(month, -0, current_timestamp), 'YYYYMM')::bigint
When I checked the query explain plan ,looks like its scanning entire table instead of only those filtered data
I am not sure how to optimize the query performance, I am missing something here
I also found out that as below it was taking more time to scan entire partitions
Pruning
275,445 Partitions scanned
945,526 Partitions total –
EDIT: UPDATED
I now tried with clause somewhat faster than the original query but still takes 9 mins to get the data
with tbl as (select ID, surrogate_key from table A where create_time >= dateadd(minute, -60, current_timestamp))
select
surrogate_key,
SUBSTR(surrogate_key, 1, CHARINDEX('|', surrogate_key) - 1)::bigint as original_id,
array_agg(distinct yyyymm) as yyyymms,
max(extraction_ts) as max_extraction_ts
from table A
where (ID, surrogate_key) IN (select ID, surrogate_key from tbl)
and yyyymm >= to_char(dateadd(month, -3, current_timestamp), 'YYYYMM')::bigint
and yyyymm <= to_char(dateadd(month, -0, current_timestamp), 'YYYYMM')::bigint
group by surrogate_key
;
I tried changing the cluster key as suggested by Eric Lin but same time it took
> EDIT: Output of system$clustering_information
Original : (ID,yyymm)
{
"cluster_by_keys" : "LINEAR(ID, yyyymm)",
"total_partition_count" : 946321,
"total_constant_partition_count" : 766438,
"average_overlaps" : 57.6508,
"average_depth" : 30.1231,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 764362,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 1,
"00015" : 3,
"00016" : 1,
"00032" : 17,
"00064" : 32263,
"00128" : 43131,
"00256" : 88619,
"00512" : 17449,
"01024" : 475
}
}
Changed clustering to (yyymm,ID)
{
"cluster_by_keys" : "LINEAR(yyyymm,ID)",
"total_partition_count" : 953033,
"total_constant_partition_count" : 769276,
"average_overlaps" : 33.2017,
"average_depth" : 18.5576,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 768630,
"00002" : 0,
"00003" : 15,
"00004" : 129,
"00005" : 611,
"00006" : 1589,
"00007" : 3128,
"00008" : 4235,
"00009" : 5374,
"00010" : 6404,
"00011" : 6176,
"00012" : 5809,
"00013" : 5397,
"00014" : 4034,
"00015" : 3007,
"00016" : 2287,
"00032" : 18517,
"00064" : 18992,
"00128" : 43803,
"00256" : 43519,
"00512" : 11377
}
}
DISTINCT DATA
yyymmd 1076. Distinct
ID 179030 Distinct

Sometimes this depends on the cardinality of the clustered columns, I think this is pointed out in earlier comments.
Clustering keys work like partition variables, so ideally they should be defined for columns with a low cardinality of values.
See: https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html
What you can do is check what the column depth and overlap is, as the link above illustrates. The closer you are to 1 depth and 0 overlap the better.
Use this commend to check the clustering columns, see: https://docs.snowflake.com/en/sql-reference/functions/system_clustering_information.html
Always look at the table structure first!
There are two types of filtering when Snowflake analyzes a query to minimize the table scan (from your screenshot it appears this is where most of the time is spent in your query)
Static pruning - filters, ensure that you do not apply functions on the column itself but where you can apply functions on the static value of your query
Dynamic pruning - joins, try to use equijoins and conjunctive queries. Explicit column joins makes performance much better
Next is the appropriately sized Virtual Warehouse, on the right hand side of your query analyzer you should look for things like spillage and cache. Spillage indicates that the warehouse is not of an appropriate size. Too smaller and the content spills to remote storage.

Related

Reclustering in Snowflake

I am working on a snowflake and need to apply clustering to table for every run of application. If clustering information is changed it would change the cluster keys and it will also trigger reclustering, but what happens when the clustering information is not changed meaning if columns are same as the current cluster keys then we add the cluster keys using alter statement, would it still recluster?
Eg
consider tableA, I added cluster key using alter table tableA cluster by (name)
Now after some time, I reapply this, the same query will it result in reclustering?
#Manish you seem to be confused about cluster keys. Let's assume you have a fact table where most of the queries look like this...
select ...
from big_table
where date_id between <Date Start> and <Date End>;
You might consider altering the table and creating a CLUSTER_KEY using:
alter table big_table
cluster by date_id;
In background, the automatic clustering service will cluster your table by DATE_ID.
There is not need to apply the cluster key again.
You need to be careful however. Keep in mind the following advice from Snowflake:
Only consider cluster keys on tables at 1TB or more
Only cluster if you find the PARTITIONS_SCANNED is close to the PARTITIONS_TOTAL. IE. You currently don't partition eliminate and this leads to poor query response times.
Ensure the cluster key appears as a predicate in the WHERE clause of queries.
Be wary of placing cluster keys on tables where a significant proportion of the partitions are frequently updated. This may lead to a high cost of reclustering as updates can disrupt the clustering sequence.
Check the existing clustering on the table using:
select system$clustering_information('big_table');
If the results you get look like this - your table is VERY well clustered:
select system$clustering_information('ORDERS_BY_DAY', '(O_ORDERDATE)');
{
"cluster_by_keys" : "LINEAR(O_ORDERDATE)",
"total_partition_count" : 6648,
"total_constant_partition_count" : 6648,
"average_overlaps" : 0.0,
"average_depth" : 1.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 6648,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0
}
}
If however your table looks like this, it is BADLY clustered, and you should consider creating a cluster key.
select system$clustering_information('snowflake_sample_data.tpcds_sf100tcl.web_sales','ws_web_page_sk');
{
"cluster_by_keys" : "LINEAR(ws_web_page_sk)",
"total_partition_count" : 300112,
"total_constant_partition_count" : 0,
"average_overlaps" : 300111.0,
"average_depth" : 300112.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 0,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"524288" : 300112
}
}
The key indicator you need to look for is the
"average_depth"
This shows the average number of partitions a query will scan for a lookup on a given value.
For example:
select ...
from big_table
where date_id = to_date('22-May-2022','DD-Mon-YYYY');
If you executed the above and it returned:
"average_depth" : 300112.0
This indicates on average the above query will need 300,000 partitions read to find the values. If however it says:
"average_depth" : 10
This indicates less than 10 partition reads. Which on a large table (with over 300,000 partitions), is VERY Well clustered.
Provided your "average_depth" is 10 or under, you're fine. However, keep in mind, we're assuming that most queries are limited by DATE_ID.
In conclusion. If you think you've identified a valid case for a cluster key, it should be created once and then costs monitored.
You should also check your query performance is improved on queries which hit the table and filter by the cluster key - in this case DATE_ID.
Thank you for the question.
So to rephrase, you mean to say there is a table A with a clustering key on the column "name". Now you add another clustering key say on the column "class".
If my above understanding is correct, Definitely, it will recluster. Think of this data is stored in micro partitions and arranged based on the clustering key. If another clustering key is added, it will have to sort/re-arrange the data again in micro-partitions based on the clustering keys.

Using SYSTEM$CLUSTERING_INFORMATION to identify potential clustering keys

Queries run on my event based database currently scan all rows even if a certain event is filtered - leading to long scan times. Event_type is something that I would use often in filters which is why I think it might be a good thing to cluster on. The table is already clustered by date_id and app_title. I used the SYSTEM$CLUSTERING_INFORMATION to see if clustering on the additional event_type column would be useful.The results were bad. Does this mean that this would be a bad choice? Or does it just mean that the current table is poorly clustered on this key? Would creating a table with these three cluster keys lead to different results?
(I changed some names and values in the query/results below)
select system$clustering_information('materialized_view', '(date_id, app_title, event_type)');
{
"cluster_by_keys" : "LINEAR(date_id, app_title, event_type)",
"total_partition_count" : <more than 100k>,
"total_constant_partition_count" : 0,
"average_overlaps" : ~500
"average_depth" : ~500,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 0,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"00064" : 30,
"00128" : 3218,
"00256" : 22146,
"00512" : 94367,
"01024" : 134114
}
}
This is showing the current state of the clustering, which is not good. That means creating a cluster key the way you have it defined may help.
The order of the columns (or expressions) in the cluster key is very important. You want to go from lower cardinality to higher cardinality. If, for example, you have only five event types then it should probably be the first in the list of columns.
The APP_TITLE column is more interesting without the context. If it's got high cardinality (which the name of the column seems to suggest), you can limit the cardinality using an expression such as left(APP_TITLE, 2).
Remember, if you need to set a key on a very high cardinality or unique column, reduce the cardinality using an expression. You can see which functions Snowflake supports in cluster keys this way:
show functions;
-- Look at the "valid_for_clustering" column to see which are allowed.

Nested IF ELSE in a derived column

I have the following logic to store the date in BI_StartDate as below:
If UpdatedDate is not null then BI_StartDate=UpddatedDate
ELSE BI_StartDate takes EntryDate value , if the EntryDate is null
then BI_StartDate=CreatedDate
If the CreatedDate IS NULL then BI_StartDate=GetDATE()
I am using a derived column as seen below:
ISNULL(UpdatedDateODS) ? EntryDateODS : (ISNULL(EntryDateODS) ? CreatedDateODS :
(ISNULL(CreatedDateODS) ? GETDATE() ))
I am getting this error:
The expression "ISNULL(UpdatedDateODS) ? EntryDateODS :
(ISNULL(EntryDateODS) ? CreatedDateODS :(ISNULL(CreatedDateODS) ?
GETDATE() ))" on "Derived Column.Outputs[Derived Column
Output].Columns[Derived Column 1]" is not valid.
You are looking the first non-null which is a coalesce which doesn't exist in SSIS Data Flow (derived Column).
I'd suggest a very simple script component:
Row.BIStartDate = Row.UpdateDate ?? Row.EntryDate ?? Row.CreatedDate ?? DateTime.Now;
This is the Input Columns Screen:
This is the Inputs and Outputs:
And then you add the above code to Row Processing section:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
/*
* Add your code here
*/
Row.BIStartDate = Row.UpdateDate ?? Row.EntryDate ?? Row.CreatedDate ?? DateTime.Now;
}
From syntax perspective, the nested if-else condition is not written well, since you have to make sure that all possible output should have the same data type, also you didn't mentioned the last "else" condition:
ISNULL(UpdatedDateODS) ? EntryDateODS : (ISNULL(EntryDateODS) ? CreatedDateODS :
(ISNULL(CreatedDateODS) ? GETDATE() : **<missing>** ))
From logical perspective, you the expression may throw exception since you are using EntryDateODS column if ISNULL(UpdatedDateODS) is true, while you should check if EntryDateODS is not null before using it, I suggest that the expression is as following:
ISNULL(UpdatedDateODS) ? UpdatedDateODS : (ISNULL(EntryDateODS) ? EntryDateODS :
(ISNULL(CreatedDateODS) ? CreatedDateODS : GETDATE() ))
As mentioned above, if UpdatedDateODS , EntryDateODS, CreatedDateODS and GETDATE() don't have the same data type then you should cast to a unified data type as example:
ISNULL(UpdatedDateODS) ? (DT_DATE)UpdatedDateODS : (ISNULL(EntryDateODS) ? (DT_DATE)EntryDateODS :
(ISNULL(CreatedDateODS) ? (DT_DATE)CreatedDateODS : (DT_DATE)GETDATE() ))

How to count multiple fields with group by another field in solr

I have solr document which is like below.
agentId : 100
emailDeliveredDate : 2018-02-08,
emailSentDate : 2018-02-07
agentId : 100
emailSentDate : 2018-02-06
agentId : 101
emailDeliveredDate : 2018-02-08,
emailSentDate : 2018-02-07
I need a result like below.
agentId : 100
emailDeliveredDate : 1,
emailSentDate : 2
agentId : 101
emailDeliveredDate : 1,
emailSentDate : 1
In mysql it will be :
select count(emailDeliveredDate),count(emailSentDate) group by agentId;
I need help in solr for this.
I did not get any way in Solr which can help me. So I used facet with pivot which gave me half results. Rest half calculation I did in Java.

Converting rows to XML format in SQL Server

I have requirement like below.
And ddl and dml script for above image is
CREATE TABLE #example
([CCP_DETAILS_SID] int, [ACCOUNT_GROWTH] int, [PRODUCT_GROWTH] int, [PROJECTION_SALES] numeric(22,6), [PROJECTION_UNITS] numeric(22,6), [PERIOD_SID] int)
;
INSERT INTO #example
([CCP_DETAILS_SID], [ACCOUNT_GROWTH], [PRODUCT_GROWTH], [PROJECTION_SALES], [PROJECTION_UNITS], [PERIOD_SID])
VALUES
(30001, 0, 0, 1505384.695, 18487.25251, 1801),
(30001, 0, 0, 1552809.983, 18695.75536, 1802),
(30001, 0, 0, 1595642.121, 18834.75725, 1803),
(30002, 0, 0, 10000.32, 18834.75725, 1801),
(30002, 0, 0, 1659124.98, 18834.75725, 1802),
(30002, 0, 0, 465859546.6, 18834.75725, 1803)
;
And i have to convert above results to xml format like below (Output).
ccp_details_sid xml_format_string
30001 <period>
<period_sid period_sid=1801>
<PROJECTION_SALES>1505384.695</PROJECTION_SALES>
<PROJECTION_UNITS>18487.25251<PROJECTION_UNITS>
<ACCOUNT_GROWTH>0</ACCOUNT_GROWTH>
<PRODUCT_GROWTH>0</PRODUCT_GROWTH>
</period_sid>
<period_sid period_sid=1802>
<PROJECTION_SALES>1552809.983</PROJECTION_SALES>
<PROJECTION_UNITS>18695.75536<PROJECTION_UNITS>
<ACCOUNT_GROWTH>0</ACCOUNT_GROWTH>
<PRODUCT_GROWTH>0</PRODUCT_GROWTH>
</period_sid>
<period_sid period_sid=1802>
<PROJECTION_SALES>1595642.121</PROJECTION_SALES>
<PROJECTION_UNITS>18834.75725<PROJECTION_UNITS>
<ACCOUNT_GROWTH>0</ACCOUNT_GROWTH>
<PRODUCT_GROWTH>0</PRODUCT_GROWTH>
</period_sid>
</period>
30002 Same like above
I am new to XML so couldn't able to do it quickly. I have used Marc_s solution with cross apply but can't able to achieve it.
Note: my major goal is, in above image if we see there are three records for single ccp_details_sid so I want to convert it as one row by using XML (mentioned above).
The following will work for you:
SELECT t.CCP_DETAILS_SID,
( SELECT PERIOD_SID AS [#period_sid],
x.PROJECTION_SALES,
x.PROJECTION_UNITS,
x.ACCOUNT_GROWTH,
x.PRODUCT_GROWTH
FROM #Example AS x
WHERE x.CCP_DETAILS_SID = t.CCP_DETAILS_SID
FOR XML PATH('period_sid'), TYPE, ROOT('period')
) AS xml_format_string
FROM #Example AS t
GROUP BY t.CCP_DETAILS_SID;
It essentially gets all your unique values for CCP_DETAILS_SID using:
SELECT t.CCP_DETAILS_SID
FROM #Example AS t
GROUP BY t.CCP_DETAILS_SID;
Then for each of these values uses the correlated subquery to form the XML. With the key points being:
Use # in front of the alias to create a property, e.g. AS [#period_sid]
Use PATH('period_sid') to name the container for each row
Use ROOT('period') to name the outer nodes.
Example on DBFiddle

Resources