Using SYSTEM$CLUSTERING_INFORMATION to identify potential clustering keys - query-optimization

Queries run on my event based database currently scan all rows even if a certain event is filtered - leading to long scan times. Event_type is something that I would use often in filters which is why I think it might be a good thing to cluster on. The table is already clustered by date_id and app_title. I used the SYSTEM$CLUSTERING_INFORMATION to see if clustering on the additional event_type column would be useful.The results were bad. Does this mean that this would be a bad choice? Or does it just mean that the current table is poorly clustered on this key? Would creating a table with these three cluster keys lead to different results?
(I changed some names and values in the query/results below)
select system$clustering_information('materialized_view', '(date_id, app_title, event_type)');
{
"cluster_by_keys" : "LINEAR(date_id, app_title, event_type)",
"total_partition_count" : <more than 100k>,
"total_constant_partition_count" : 0,
"average_overlaps" : ~500
"average_depth" : ~500,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 0,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"00064" : 30,
"00128" : 3218,
"00256" : 22146,
"00512" : 94367,
"01024" : 134114
}
}

This is showing the current state of the clustering, which is not good. That means creating a cluster key the way you have it defined may help.
The order of the columns (or expressions) in the cluster key is very important. You want to go from lower cardinality to higher cardinality. If, for example, you have only five event types then it should probably be the first in the list of columns.
The APP_TITLE column is more interesting without the context. If it's got high cardinality (which the name of the column seems to suggest), you can limit the cardinality using an expression such as left(APP_TITLE, 2).
Remember, if you need to set a key on a very high cardinality or unique column, reduce the cardinality using an expression. You can see which functions Snowflake supports in cluster keys this way:
show functions;
-- Look at the "valid_for_clustering" column to see which are allowed.

Related

Reclustering in Snowflake

I am working on a snowflake and need to apply clustering to table for every run of application. If clustering information is changed it would change the cluster keys and it will also trigger reclustering, but what happens when the clustering information is not changed meaning if columns are same as the current cluster keys then we add the cluster keys using alter statement, would it still recluster?
Eg
consider tableA, I added cluster key using alter table tableA cluster by (name)
Now after some time, I reapply this, the same query will it result in reclustering?
#Manish you seem to be confused about cluster keys. Let's assume you have a fact table where most of the queries look like this...
select ...
from big_table
where date_id between <Date Start> and <Date End>;
You might consider altering the table and creating a CLUSTER_KEY using:
alter table big_table
cluster by date_id;
In background, the automatic clustering service will cluster your table by DATE_ID.
There is not need to apply the cluster key again.
You need to be careful however. Keep in mind the following advice from Snowflake:
Only consider cluster keys on tables at 1TB or more
Only cluster if you find the PARTITIONS_SCANNED is close to the PARTITIONS_TOTAL. IE. You currently don't partition eliminate and this leads to poor query response times.
Ensure the cluster key appears as a predicate in the WHERE clause of queries.
Be wary of placing cluster keys on tables where a significant proportion of the partitions are frequently updated. This may lead to a high cost of reclustering as updates can disrupt the clustering sequence.
Check the existing clustering on the table using:
select system$clustering_information('big_table');
If the results you get look like this - your table is VERY well clustered:
select system$clustering_information('ORDERS_BY_DAY', '(O_ORDERDATE)');
{
"cluster_by_keys" : "LINEAR(O_ORDERDATE)",
"total_partition_count" : 6648,
"total_constant_partition_count" : 6648,
"average_overlaps" : 0.0,
"average_depth" : 1.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 6648,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0
}
}
If however your table looks like this, it is BADLY clustered, and you should consider creating a cluster key.
select system$clustering_information('snowflake_sample_data.tpcds_sf100tcl.web_sales','ws_web_page_sk');
{
"cluster_by_keys" : "LINEAR(ws_web_page_sk)",
"total_partition_count" : 300112,
"total_constant_partition_count" : 0,
"average_overlaps" : 300111.0,
"average_depth" : 300112.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 0,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"524288" : 300112
}
}
The key indicator you need to look for is the
"average_depth"
This shows the average number of partitions a query will scan for a lookup on a given value.
For example:
select ...
from big_table
where date_id = to_date('22-May-2022','DD-Mon-YYYY');
If you executed the above and it returned:
"average_depth" : 300112.0
This indicates on average the above query will need 300,000 partitions read to find the values. If however it says:
"average_depth" : 10
This indicates less than 10 partition reads. Which on a large table (with over 300,000 partitions), is VERY Well clustered.
Provided your "average_depth" is 10 or under, you're fine. However, keep in mind, we're assuming that most queries are limited by DATE_ID.
In conclusion. If you think you've identified a valid case for a cluster key, it should be created once and then costs monitored.
You should also check your query performance is improved on queries which hit the table and filter by the cluster key - in this case DATE_ID.
Thank you for the question.
So to rephrase, you mean to say there is a table A with a clustering key on the column "name". Now you add another clustering key say on the column "class".
If my above understanding is correct, Definitely, it will recluster. Think of this data is stored in micro partitions and arranged based on the clustering key. If another clustering key is added, it will have to sort/re-arrange the data again in micro-partitions based on the clustering keys.

Snowflake: Perfomance is slow with millions of rows

Requirement: To Speed up the performance in snowflake
Issue: It's taking a lot of time even to read data, Has created a cluster in the table for the columns as
create or replace TABLE table_A cluster by (ID, yyyymm)(
YYYYMM NUMBER(38,0),
ID NUMBER(38,0),
.....(lot of other columns)
......
SURROGATE_KEY VARCHAR(16777216)
);
Table has 70,825,139,352 rows
If the ID was inserted into the table recently in the last 60 mins, we want to delete any previous version of that ID if it's in the last 3 months
Below is the query
select
surrogate_key,
SUBSTR(surrogate_key, 1, CHARINDEX('|', surrogate_key) - 1)::bigint as original_id,
array_agg(distinct yyyymm) as yyyymms,
max(extraction_ts) as max_extraction_ts
from table A
where (ID, surrogate_key) IN (
select ID, surrogate_key from table A where create_time >= dateadd(minute, -60, current_timestamp)
)
and yyyymm >= to_char(dateadd(month, -3, current_timestamp), 'YYYYMM')::bigint
and yyyymm <= to_char(dateadd(month, -0, current_timestamp), 'YYYYMM')::bigint
group by surrogate_key
;
Then I tried to just get rows for last 3 months , even this is taking lot of time
select yyyymm, ID,
surrogate_key,create_time,extraction__ts
from table A
where yyyymm >= to_char(dateadd(month, -3, current_timestamp), 'YYYYMM')::bigint
and yyyymm <= to_char(dateadd(month, -0, current_timestamp), 'YYYYMM')::bigint
When I checked the query explain plan ,looks like its scanning entire table instead of only those filtered data
I am not sure how to optimize the query performance, I am missing something here
I also found out that as below it was taking more time to scan entire partitions
Pruning
275,445 Partitions scanned
945,526 Partitions total –
EDIT: UPDATED
I now tried with clause somewhat faster than the original query but still takes 9 mins to get the data
with tbl as (select ID, surrogate_key from table A where create_time >= dateadd(minute, -60, current_timestamp))
select
surrogate_key,
SUBSTR(surrogate_key, 1, CHARINDEX('|', surrogate_key) - 1)::bigint as original_id,
array_agg(distinct yyyymm) as yyyymms,
max(extraction_ts) as max_extraction_ts
from table A
where (ID, surrogate_key) IN (select ID, surrogate_key from tbl)
and yyyymm >= to_char(dateadd(month, -3, current_timestamp), 'YYYYMM')::bigint
and yyyymm <= to_char(dateadd(month, -0, current_timestamp), 'YYYYMM')::bigint
group by surrogate_key
;
I tried changing the cluster key as suggested by Eric Lin but same time it took
> EDIT: Output of system$clustering_information
Original : (ID,yyymm)
{
"cluster_by_keys" : "LINEAR(ID, yyyymm)",
"total_partition_count" : 946321,
"total_constant_partition_count" : 766438,
"average_overlaps" : 57.6508,
"average_depth" : 30.1231,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 764362,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 1,
"00015" : 3,
"00016" : 1,
"00032" : 17,
"00064" : 32263,
"00128" : 43131,
"00256" : 88619,
"00512" : 17449,
"01024" : 475
}
}
Changed clustering to (yyymm,ID)
{
"cluster_by_keys" : "LINEAR(yyyymm,ID)",
"total_partition_count" : 953033,
"total_constant_partition_count" : 769276,
"average_overlaps" : 33.2017,
"average_depth" : 18.5576,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 768630,
"00002" : 0,
"00003" : 15,
"00004" : 129,
"00005" : 611,
"00006" : 1589,
"00007" : 3128,
"00008" : 4235,
"00009" : 5374,
"00010" : 6404,
"00011" : 6176,
"00012" : 5809,
"00013" : 5397,
"00014" : 4034,
"00015" : 3007,
"00016" : 2287,
"00032" : 18517,
"00064" : 18992,
"00128" : 43803,
"00256" : 43519,
"00512" : 11377
}
}
DISTINCT DATA
yyymmd 1076. Distinct
ID 179030 Distinct
Sometimes this depends on the cardinality of the clustered columns, I think this is pointed out in earlier comments.
Clustering keys work like partition variables, so ideally they should be defined for columns with a low cardinality of values.
See: https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html
What you can do is check what the column depth and overlap is, as the link above illustrates. The closer you are to 1 depth and 0 overlap the better.
Use this commend to check the clustering columns, see: https://docs.snowflake.com/en/sql-reference/functions/system_clustering_information.html
Always look at the table structure first!
There are two types of filtering when Snowflake analyzes a query to minimize the table scan (from your screenshot it appears this is where most of the time is spent in your query)
Static pruning - filters, ensure that you do not apply functions on the column itself but where you can apply functions on the static value of your query
Dynamic pruning - joins, try to use equijoins and conjunctive queries. Explicit column joins makes performance much better
Next is the appropriately sized Virtual Warehouse, on the right hand side of your query analyzer you should look for things like spillage and cache. Spillage indicates that the warehouse is not of an appropriate size. Too smaller and the content spills to remote storage.

Query for condition in array of JSON objects in PostgreSQL

Lets assume we have a PostgreSQL db with a table with rows of the following kind:
id | doc
---+-----------------
1 | JSON Object
2 | JSON Object
3 | JSON Object
...
The JSON has the following structure:
{
'header' : {
'info' : 'foo'},
'data' :
[{'a' : 1, 'b' : 123},
{'a' : 2, 'b' : 234},
{'a' : 1, 'b' : 543},
...
{'a' : 1, 'b' : 123},
{'a' : 4, 'b' : 452}]
}
with arbitrary values for 'a' and 'b' in 'data' in all rows of the table.
First question: how do I query for rows in the table where the following condition holds:
There exists a dictionary in the list/array with the key 'data', where a==i and b>j.
For example for i=1 and j=400 the condition would be fulfilled for the example above and the respective column would be returned.
Second question:
In my problem I have to deal with time series data in Json. Every measurement is represented by one Json and therefore one row in the table. I want to identify measurements where certain events occurred. For the case that the above structure is unsuitable in terms of easy querying: How could such a time series look like to be more easily queryable?
Thanks a lot!
I believe a query like this should answer your first question:
select distinct id, doc
from (
select id, doc, jsonb_array_elements(doc->'data') as elem
from docs
) as docelem
where (elem->>'a')::int = 4 and (elem->>'b')::int > 400
db<>fiddle here

Converting rows to XML format in SQL Server

I have requirement like below.
And ddl and dml script for above image is
CREATE TABLE #example
([CCP_DETAILS_SID] int, [ACCOUNT_GROWTH] int, [PRODUCT_GROWTH] int, [PROJECTION_SALES] numeric(22,6), [PROJECTION_UNITS] numeric(22,6), [PERIOD_SID] int)
;
INSERT INTO #example
([CCP_DETAILS_SID], [ACCOUNT_GROWTH], [PRODUCT_GROWTH], [PROJECTION_SALES], [PROJECTION_UNITS], [PERIOD_SID])
VALUES
(30001, 0, 0, 1505384.695, 18487.25251, 1801),
(30001, 0, 0, 1552809.983, 18695.75536, 1802),
(30001, 0, 0, 1595642.121, 18834.75725, 1803),
(30002, 0, 0, 10000.32, 18834.75725, 1801),
(30002, 0, 0, 1659124.98, 18834.75725, 1802),
(30002, 0, 0, 465859546.6, 18834.75725, 1803)
;
And i have to convert above results to xml format like below (Output).
ccp_details_sid xml_format_string
30001 <period>
<period_sid period_sid=1801>
<PROJECTION_SALES>1505384.695</PROJECTION_SALES>
<PROJECTION_UNITS>18487.25251<PROJECTION_UNITS>
<ACCOUNT_GROWTH>0</ACCOUNT_GROWTH>
<PRODUCT_GROWTH>0</PRODUCT_GROWTH>
</period_sid>
<period_sid period_sid=1802>
<PROJECTION_SALES>1552809.983</PROJECTION_SALES>
<PROJECTION_UNITS>18695.75536<PROJECTION_UNITS>
<ACCOUNT_GROWTH>0</ACCOUNT_GROWTH>
<PRODUCT_GROWTH>0</PRODUCT_GROWTH>
</period_sid>
<period_sid period_sid=1802>
<PROJECTION_SALES>1595642.121</PROJECTION_SALES>
<PROJECTION_UNITS>18834.75725<PROJECTION_UNITS>
<ACCOUNT_GROWTH>0</ACCOUNT_GROWTH>
<PRODUCT_GROWTH>0</PRODUCT_GROWTH>
</period_sid>
</period>
30002 Same like above
I am new to XML so couldn't able to do it quickly. I have used Marc_s solution with cross apply but can't able to achieve it.
Note: my major goal is, in above image if we see there are three records for single ccp_details_sid so I want to convert it as one row by using XML (mentioned above).
The following will work for you:
SELECT t.CCP_DETAILS_SID,
( SELECT PERIOD_SID AS [#period_sid],
x.PROJECTION_SALES,
x.PROJECTION_UNITS,
x.ACCOUNT_GROWTH,
x.PRODUCT_GROWTH
FROM #Example AS x
WHERE x.CCP_DETAILS_SID = t.CCP_DETAILS_SID
FOR XML PATH('period_sid'), TYPE, ROOT('period')
) AS xml_format_string
FROM #Example AS t
GROUP BY t.CCP_DETAILS_SID;
It essentially gets all your unique values for CCP_DETAILS_SID using:
SELECT t.CCP_DETAILS_SID
FROM #Example AS t
GROUP BY t.CCP_DETAILS_SID;
Then for each of these values uses the correlated subquery to form the XML. With the key points being:
Use # in front of the alias to create a property, e.g. AS [#period_sid]
Use PATH('period_sid') to name the container for each row
Use ROOT('period') to name the outer nodes.
Example on DBFiddle

Duplicate identity code in SQL for vb.net winforms

Good day,
I am not too sure about the reason this can happen.
I have a large code in which I insert some values to some SQL tables. The program is working in 4-5 separate machines that access the same SQL data base that is held in another server.
Try : codfact = CInt(Me.CFTableAdapter.codigoQuery) + 1 : Catch ex As Exception : codfact = 1 : End Try
There I get my new id of the table and save it in codfact.
CFTableAdapter.codigoQuery = It is a SELECT that gets the MAX(id) of my table.
And then I do the insert:
While Not exit4
Try
Me.CFTableAdapter.InsertQuery(codfact, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 1, -1, -1, Me.codcom, 0, False, -1, -1, -1, -1, 0, selprop)
exit4 = True
Catch ex As Exception
If ex.Message.Contains("PRIMARY KEY") Then
Try : codfact = CInt(Me.CF.codigoQuery) + 1 : Catch : codfact = 1 : End Try
Else
MessageBox.Show(ex.Message, "ERROR", MessageBoxButtons.OK, MessageBoxIcon.Error)
End
End If
End Try
End While
That is the code.
What happens is that sometimes both programs keep working with the same codfact that should be the Unique identity ID in the SQL table.
When I try to reproduce this error I always get the "PRIMARY KEY" error and it gets the next codfact to work with(the correct way).
That codfact is used in other tables as Foreing Key. That way I get a single codfact line instead of 2 or 3.

Resources