Snowflake CHANGES | Why does it need to perform a self join? Why is it slower than join using other unique column?

Snowflake CHANGES | Why does it need to perform a self join? Why is it slower than join using other unique column? - snowflake-cloud-data-platform

I was facing issues with merge statement over large tables.
The source table for merge is basically clone of the target table after applying some DML.
e.g. In the below example PUBLIC.customer is target and STAGING.customer is the source.
CREATE OR REPLACE TABLE STAGING.customer CLONE PUBLIC.customer;
MERGE INTO STAGING.customer TARGET USING (SELECT * FROM NEW_CUSTOMER) AS SOURCE ON TARGET.ID = SOURCE.ID
WHEN MATCHED AND SOURCE.DELETEFLAG=TRUE THEN DELETE
WHEN MATCHED AND TARGET.ROWMODIFIED < SOURCE.ROWMODIFIED THEN UPDATE SET TARGET.AGE = SOURCE.AGE, ...
WHEN NOT MATCHED THEN INSERT (AGE) VALUES (AGE, DELETEFLAG, ID,...);
Currently, we are simply merging the STAGING.customer back to PUBLIC.customer at the end.
This final merge statement is very costly for some of the large tables.
While looking for a solution to reduce the cost, I discovered Snowflake "CHANGES" mechanism. As per the documentation,
Currently, at least one of the following must be true before change tracking metadata is recorded for a table:
Change tracking is enabled on the table (using ALTER TABLE … CHANGE_TRACKING = TRUE).
A stream is created for the table (using CREATE STREAM).
Both options add hidden columns to the table which store change tracking metadata. The columns consume a small amount of storage.
I assumed that the metadata added to the table is equivalent to the result-set of the select statement using "changes" clause, which doesn't seem to be the case.
INSERT INTO PUBLIC.CUSTOMER(AGE,...) (SELECT AGE,... FROM STAGING.CUSTOMER CHANGES (information => default) at(timestamp => 1675772176::timestamp) where "METADATA$ACTION" = 'INSERT' );
The select statement using "changes" clause is way slower than the merge statement that I am using currently.
I checked the execution plan and found that Snowflake performs a self-join(sort of) on the table at two different timestamp.
Should it really be the behaviour or am I missing something here? I was hoping to get better performance assuming to scan the table one time and then simply inserting the new records which should be faster than the merge statement.
Also, even if it does a self join, why does the merge query perform better than this, the merge query is also doing join on similar volumes.
I was also hoping to use same mechanism for delete/updates on source table.

Related

Snowflake updating duplicate records

I am having staging tables in a snowflake where I am copying data from AWS S3 using snowpipe. Some of those records are a type of creating an event and multiple updates. For same event, there will be one create and multiple update events with chronological order. I wanted to move those records in another table (so create event should insert a record into a table and multiple updates event should update those records accordingly.)I was trying to use the "Merge" concept snowflake, but it does not suit well for my use case as if my target table does not have a record, it creates a new record for every creates and updates.

The following SQL will work if any update is a complete new version of the original event and can completely replace the previous, so that you really only have to apply the last update of many.
It is considerably harder if you have to apply all the updates to an event in sequence to get a correct result. You do not present any details, so that leaves us guessing.
MERGE INTO event_tab old USING (
SELECT * FROM new_events
QUALIFY ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_ts DESC) = 1
) new ON old.event_id = new.event_id
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...

Use of inserted and deleted tables for logging - is my concept sound?

I have a table with a simple identity column primary key. I have written a 'For Update' trigger that, among other things, is supposed to log the changes of certain columns to a log table. Needless to say, this is the first time I've tried this.
Essentially as follows:
Declare Cursor1 Cursor for
select a.*, b.*
from inserted a
inner join deleted b on a.OrderItemId = b.OrderItemId
(where OrderItemId is the actual name of the primary identity key).
I then do the usual open the cursor and go into a fetch next loop. With the columns I want to test, I do:
if Update(Field1)
begin
..... do some logging
end
The columns include varchars, bits, and datetimes. It works, sometimes. The problem is that the log function is writing the a and b values of the field to a log and in some cases, it appears that the before and after values are identical.
I have 2 questions:
Am I using the Update function correctly?
Am I accessing the before and after values correctly?
Is there a better way?

If you are using SQL Server 2016 or higher, I would recommend skipping this trigger entirely and instead using system-versioned temporal tables.
Not only will it eliminate the need for (and performance issues around) the trigger, it'll be easier to query the historical data.

why merge statement is much slower in partitioned table?

I have a big table (14million records) and I will need to apply merge statements (basically need to update/insert/delete some data). As the table is quite big, this has been my strategy:
insert into #ProjectUnitsCacheDetailExisting([ProjectUnitsCacheId], UniverseCode,CitiCode)
SELECT ProjectUnitsCacheId, UniverseCode,CitiCode
FROM dbo.ProjectUnitsCacheDetail WHERE ProjectUnitsCacheId = #CacheID
;MERGE #ProjectUnitsCacheDetailExisting AS T
USING #ProjectUnitsCacheDetail AS S
ON (t.UniverseCode = s.UniverseCode and t.CitiCode = s.CitiCode)
WHEN NOT MATCHED BY TARGET
THEN
INSERT(ActionType,ProjectUnitsCacheId,UniverseCode,CitiCode)
VALUES('INSERT', #CacheId,s.UniverseCode,s.CitiCode)
insert into ProjectUnitsCacheDetail(
ProjectUnitsCacheId, UniverseCode,CitiCode)
select #CacheId,UniverseCode,CitiCode
from #ProjectUnitsCacheDetailExisting
where actionType = 'INSERT'
Basically work out what needs to be added, updated and deleted in a temp table first, and then delete/add/update the data. This works much quicker than applying the merge statement directly on my 14 million record table.
I then learned about partitioned table, thought it could be a good usage for me. So I created a table that is partitioned by 10 (the partition key is ProjectUnitsCacheId % 10), then apply the merge statement directly on the new table. However, it becomes much slower.
;MERGE ProjectUnitsCacheDetailTest AS T
USING #ProjectUnitsCacheDetail AS S
ON (t.UniverseCode = s.UniverseCode and t.CitiCode = s.CitiCode) AND T.ProjectUnitsCacheId=#CacheID
WHEN NOT MATCHED BY TARGET
THEN INSERT ( ProjectUnitsCacheId,UniverseCode,CitiCode)values( #CacheId,s.UniverseCode,s.CitiCode)
....delete action
....update action
This method becomes 10 time slower than the temp table way. If I do a direct select with #CacheId as parameter, the partitioned table indeed returns data quicker. So what could be the problem?

The problem is that now your table is partitioned, its content is split in different disk places and indexes. So unless the data you are comparing against is matched with 1 partition, operations such as updates, inserts or deletes will most likely be slower than it's full, non-partitioned counter part.
Partitioned tables are good for querying data by it's partition column and operating with 1 partition at a time. If you tend to do operations among all the partitions maybe you want to review the partition key, or consider not having the table partitioned.

Update SQL Table Based On Composite Primary Key

I have an ETL process (CSV to SQL database) that runs daily, but the data in the source sometimes changes, so I want to have it run again the next day with an updated file.
How do I write a SQL statement to find all the differences?
For example, let's say Table_1 has a composite PRIMARY KEY consisting of FK_1, FK_2 and FK_3.
Do I do this in SQL or in the ETL process?
Thanks.
Edit
I realize now this question is too broad. Disregard.

You can use EXCEPT to find which are the IDs which are missing. For example:
SELECT FK_1, FK_2, FK_2
FROM new_data_table
EXCEPT
SELECT FK_1, FK_2, FK_2
FROM current_data_table;
It will be better (in performance prospective) to materialized these IDs and then to join this new table to the new_data_table in order to insert all of the columns.
If you need to do this in one query, you can use simple LEFT JOIN. For example:
INSERT INTO current_data_table
SELECT A.*
FROM new_data_table A
LEFT JOIN current_data_table B
ON A.FK_1 = B.FK_1
AND A.FK_2 = B.FK_2
AND A.FK_3 = B.FK_3
WHRE B.[FK_1] IS NULL;
The idea is to get all records in the new_data_table for which, there is no match in the current_data_table table (WHRE B.[FK_1] IS NULL).

Merge query using two tables in SQL server 2012

I am very new to SQL and SQL server, would appreciate any help with the following problem.
I am trying to update a share price table with new prices.
The table has three columns: share code, date, price.
The share code + date = PK
As you can imagine, if you have thousands of share codes and 10 years' data for each, the table can get very big. So I have created a separate table called a share ID table, and use a share ID instead in the first table (I was reliably informed this would speed up the query, as searching by integer is faster than string).
So, to summarise, I have two tables as follows:
Table 1 = Share_code_ID (int), Date, Price
Table 2 = Share_code_ID (int), Share_name (string)
So let's say I want to update the table/s with today's price for share ZZZ. I need to:
Look for the Share_code_ID corresponding to 'ZZZ' in table 2
If it is found, update table 1 with the new price for that date, using the Share_code_ID I just found
If the Share_code_ID is not found, update both tables
Let's ignore for now how the Share_code_ID is generated for a new code, I'll worry about that later.
I'm trying to use a merge query loosely based on the following structure, but have no idea what I am doing:
MERGE INTO [Table 1]
USING (VALUES (1,23-May-2013,1000)) AS SOURCE (Share_code_ID,Date,Price)
{ SEEMS LIKE THERE SHOULD BE AN INNER JOIN HERE OR SOMETHING }
ON Table 2 = 'ZZZ'
WHEN MATCHED THEN UPDATE SET Table 1.Price = 1000
WHEN NOT MATCHED THEN INSERT { TO BOTH TABLES }
Any help would be appreciated.

http://msdn.microsoft.com/library/bb510625(v=sql.100).aspx
You use Table1 for target table and Table2 for source table
You want to do action, when given ID is not found in Table2 - in the source table
In the documentation, that you had read already, that corresponds to the clause
WHEN NOT MATCHED BY SOURCE ... THEN <merge_matched>
and the latter corresponds to
<merge_matched>::=
{ UPDATE SET <set_clause> | DELETE }
Ergo, you cannot insert into source-table there.
You could use triggers for auto-insertion, when you insert something in Table1, but that will not be able to insert proper Shared_Name - trigger just won't know it.
So you have two options i guess.
1) make T-SQL code block - look for Stored Procedures. I think there also is a construct to execute anonymous code block in MS SQ, like EXECUTE BLOCK command in Firebird SQL Server, but i don't know it for sure.
2) create updatable SQL VIEW, joining Table1 and Table2 to show last most current date, so that when you insert a row in this view the view's on-insert trigger would actually insert rows to both tables. And when you would update the data in the view, the on-update trigger would modify the data.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Snowflake CHANGES | Why does it need to perform a self join? Why is it slower than join using other unique column? - snowflake-cloud-data-platform

Related

Snowflake updating duplicate records

Use of inserted and deleted tables for logging - is my concept sound?

why merge statement is much slower in partitioned table?

Update SQL Table Based On Composite Primary Key

Merge query using two tables in SQL server 2012

Categories

Resources