Snowflake updating duplicate records - snowflake-cloud-data-platform

I am having staging tables in a snowflake where I am copying data from AWS S3 using snowpipe. Some of those records are a type of creating an event and multiple updates. For same event, there will be one create and multiple update events with chronological order. I wanted to move those records in another table (so create event should insert a record into a table and multiple updates event should update those records accordingly.)I was trying to use the "Merge" concept snowflake, but it does not suit well for my use case as if my target table does not have a record, it creates a new record for every creates and updates.

The following SQL will work if any update is a complete new version of the original event and can completely replace the previous, so that you really only have to apply the last update of many.
It is considerably harder if you have to apply all the updates to an event in sequence to get a correct result. You do not present any details, so that leaves us guessing.
MERGE INTO event_tab old USING (
SELECT * FROM new_events
QUALIFY ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_ts DESC) = 1
) new ON old.event_id = new.event_id
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...

Related

Snowflake CHANGES | Why does it need to perform a self join? Why is it slower than join using other unique column?

I was facing issues with merge statement over large tables.
The source table for merge is basically clone of the target table after applying some DML.
e.g. In the below example PUBLIC.customer is target and STAGING.customer is the source.
CREATE OR REPLACE TABLE STAGING.customer CLONE PUBLIC.customer;
MERGE INTO STAGING.customer TARGET USING (SELECT * FROM NEW_CUSTOMER) AS SOURCE ON TARGET.ID = SOURCE.ID
WHEN MATCHED AND SOURCE.DELETEFLAG=TRUE THEN DELETE
WHEN MATCHED AND TARGET.ROWMODIFIED < SOURCE.ROWMODIFIED THEN UPDATE SET TARGET.AGE = SOURCE.AGE, ...
WHEN NOT MATCHED THEN INSERT (AGE) VALUES (AGE, DELETEFLAG, ID,...);
Currently, we are simply merging the STAGING.customer back to PUBLIC.customer at the end.
This final merge statement is very costly for some of the large tables.
While looking for a solution to reduce the cost, I discovered Snowflake "CHANGES" mechanism. As per the documentation,
Currently, at least one of the following must be true before change tracking metadata is recorded for a table:
Change tracking is enabled on the table (using ALTER TABLE … CHANGE_TRACKING = TRUE).
A stream is created for the table (using CREATE STREAM).
Both options add hidden columns to the table which store change tracking metadata. The columns consume a small amount of storage.
I assumed that the metadata added to the table is equivalent to the result-set of the select statement using "changes" clause, which doesn't seem to be the case.
INSERT INTO PUBLIC.CUSTOMER(AGE,...) (SELECT AGE,... FROM STAGING.CUSTOMER CHANGES (information => default) at(timestamp => 1675772176::timestamp) where "METADATA$ACTION" = 'INSERT' );
The select statement using "changes" clause is way slower than the merge statement that I am using currently.
I checked the execution plan and found that Snowflake performs a self-join(sort of) on the table at two different timestamp.
Should it really be the behaviour or am I missing something here? I was hoping to get better performance assuming to scan the table one time and then simply inserting the new records which should be faster than the merge statement.
Also, even if it does a self join, why does the merge query perform better than this, the merge query is also doing join on similar volumes.
I was also hoping to use same mechanism for delete/updates on source table.

SQL Server trigger failing for row inserts in quick succession

I have looked around on SO and found many similar questions:
SQL Server A trigger to work on multiple row inserts
SQL trigger multiple insert update
Trigger to handle multiple row inserts and updates
update multiple rows with trigger after insert (sql server)
Trigger not working when inserting multiple records
But I am still having issues with my trigger to update multiple rows when Inserting multiple rows into a table.
Outline of code
I have a Reservation table which has a ReservationID and TourComponentID columns respectively. When I insert into the reservations table I have the following trigger to update the TourComponent table with the ReservationID from the row just inserted into the reservation table with matching TourComponentID:
CREATE TRIGGER [TR_Reservation_CurrentReservation] ON [Reservation] AFTER INSERT AS
UPDATE tc
SET tc.[CurrentReservationId] = I.ReservationID
FROM [tour].[TourComponent] tc
JOIN INSERTED I on I.TourComponentID = tc.TourComponentID
End
This trigger works perfectly when updating one tourComponent to have a new reservation (inserting one row into the reservation table). However if I try update multiple tour components (inserting multiple rows into the reservation table to update multiple rows in the TourComponent table) only the first tour component gets updated, any rows.
Other answers and research has shown me that
Triggers are NOT executed once per row but rather as a set based
operation so executed only ONCE for the entire DML operation. So you
need to treat it like any other update date with join statement.
So I would have expected my joining on the INSERTED table to have handled multiple rows or have I misunderstood this?
Interestingly if I log the trigger variables for TourComponentID, ReservationID and INSERTED rowcount to a temp table foo I can see two records are inserted into my temp table, each with a rowcount of 1.
Using sql profiler to catch the actual sql executed at runtime and running this manually against the database I get two rows updated as desired. It is only when using Entity Framework to update the database ie running the application do I find only one row is updated.
I have tried logging the values to a table FOO in the trigger
INSERT INTO FOO (TourComponentID, ReservationID, Rowcounts )
SELECT i.TourComponentID, I.ReservationID, 1 --#ReservationId
FROM
INSERTED I
This logs two rows with a rowcount of 1 each time and the correct tourcomponentsID and reservationID but the TourComponent table still only has one row updated.
Any suggestions greatly appreciated.
UPDATE
Tour component ID's are passed as strings in an Ajax post to the MVC Action where tour component models are populated and then passed to be updated one at a time in the code
public void UpdateTourComponents(IEnumerable<TourComponent> tourComponents)
{
foreach (var tourComponent in tourComponents)
{
UpdateTourComponent(tourComponent);
}
}
here is the call to UpdateTourComponent
public int UpdateTourComponent(TourComponent tourComponent)
{
return TourComponentRepository.Update(tourComponent);
}
and the final call to Update
public virtual int Update(TObject TObject)
{
Dictionary<string, List<string>> newChildKeys;
return Update(TObject, null, out newChildKeys);
}
So the Inserts are happening one at a time, hence my trigger is being called once per TourComponent. This is why when I count the ##Rowcount in INSERTED and log to Foo I get value of 1. When I run the inserts manually I get the correct expected results so I would agree with #Slava Murygin tests that the issue is probably not with the trigger itself. I thought it might be a speed issue if we are firing the requests one after the other so I put a wait in the trigger and in the code but this did not fix it.
Update 2
I have used a sql profiler to capture the sql that is run when only the first insert triggers work.
Interestingly when the EXACT same sql is then run in SQL Management Studio the trigger works as expected and both tour components are updated with the reservation id.
Worth mentioning also that all constraints have been removed off all tables.
Any other suggestions what might be causing this issue?
You have different problem than that particular trigger. Try to look at the table name you are updating "[tour].[TourComponent]" or "[dbo].[TourComponent]".
I've tried your trigger and it perfectly works:
use TestDB
GO
IF object_id('Reservation') is not null DROP TABLE Reservation;
GO
IF object_id('TourComponent') is not null DROP TABLE TourComponent;
GO
CREATE TABLE Reservation (
ReservationID INT IDENTITY(1,1),
TourComponentID INT
);
GO
CREATE TABLE TourComponent (
CurrentReservationId INT,
TourComponentID INT
);
GO
CREATE TRIGGER [TR_Reservation_CurrentReservation] ON [Reservation] AFTER INSERT AS
UPDATE tc
SET tc.[CurrentReservationId] = I.ReservationID
FROM [TourComponent] tc
JOIN INSERTED I on I.TourComponentID = tc.TourComponentID
GO
INSERT INTO TourComponent(TourComponentID)
VALUES (1),(2),(3),(4),(5),(6)
GO
INSERT INTO Reservation(TourComponentID)
VALUES (1),(2),(3),(4),(5),(6)
GO
SELECT * FROM Reservation
SELECT * FROM TourComponent
So the underlying problem was down to Entity Framework.
this.Property(t => t.CurrentReservationId).HasColumnName("CurrentReservationId");
Is one property for the SQL Data access layer. This was being cached and was causing the data being read out of the db to not be the latest current, thus if we have an insert in the Reservations table the second insert will be overwritten by the cached values which in my case were NULL.
Changing the line to this resolves the problem and makes the trigger work as expected.
this.Property(t => t.CurrentReservationId).HasColumnName("CurrentReservationId").HasDatabaseGeneratedOption(DatabaseGeneratedOption.Computed);
See more info on HasDatabaseGeneratedOption

Create column to constantly update with rownum in SQL Server

I have a table in SQL Server database. It has a column testrownum. I want to update the testrownum column to be rownum whenever row is created in the table automatically.
Is there any setting that I can turn on to achieve this?
Perhaps it is best to get a row number at the time you SELECT your data, by using the ROW_NUMBER() function.
As others have pointed out (comments section) the data in a table is essentially unordered. You can however assign a row number for a certain ordering when you select data as follows (suppose a table that only has one column being name):
SELECT
name,
rownr=ROW_NUMBER() OVER(ORDER BY name)
FROM
name_table
You can create trigger function.
Trigger is a function which is called every time data is insert, update or delete from table (you can specify when you want your function to be called when creating trigger). So you can add two triggers, one for insert and one for delete and just increment/decrement value you want.

Using Triggers in SQL Server to keep a history

I am using SQL Server 2012
I have a table called AMOUNTS and a table called AMOUNTS_HIST
Both tables have identical columns:
CHANGE_DATE
AMOUNT
COMPANY_ID
EXP_ID
SPOT
UPDATE_DATE [system date]
The Primary Key of AMOUNTS is COMPANY_ID and EXP_ID.
The Primary Key pf AMOUNTS_HIST is COMPANY_ID, EXP_ID and CHANGE_DATE
Whenever I add a row in the AMOUNTS table, I would like to create a copy of it in the AMOUNTS_HIST table. [Theoretically, each time a row is added to 'AMOUNTS', the COMPANY_ID, EXP_ID, CHANGE_DATE will be unique. Practically, if they are not, the relevant row in AMOUNTS_HIST would need to be overridden. The code below does not take the overriding into account.]
I created a trigger as follows:
CREATE TRIGGER [MYDB].[update_history] ON [MYDB].[AMOUNTS]
FOR UPDATE
AS
INSERT MYDB.AMOUNTS_HIST (
CHANGE_DATE,
COMPANY_ID,
EXP_ID,
SPOT
UPDATE_DATE
)
SELECT e.CHANGE_DATE,
e.COMPANY_ID,
e.EXP_ID
e.REMARKS,
e.SPOT,
e.UPDATE_DATE
FROM MYDB.AMOUNTS e
JOIN inserted ON inserted.company_id = e.company_id
AND inserted.exp_id=e.exp_id
I don't understand why it does nothing at all in my AMOUNTS_HIST table.
Can anyone help?
Thanks,
Probably because the trigger, the way it's currently written, will only get fired when an Update is done, not an insert.
Try changing it to:
CREATE TRIGGER [MYDB].[update_history] ON [MYDB].[AMOUNTS]
FOR UPDATE, INSERT
I just wanted to chime in. Have you looked at CDC (change data capture).
http://msdn.microsoft.com/en-us/library/bb522489(v=sql.105).aspx
"Change data capture is designed to capture insert, update, and delete activity applied to SQL Server tables, and to make the details of the changes available in an easily consumed relational format. The change tables used by change data capture contain columns that mirror the column structure of a tracked source table, along with the metadata needed to understand the changes that have occurred.
Change data capture is available only on the Enterprise, Developer, and Evaluation editions of SQL Server."
As far as your trigger goes, when you update [MYDB].[AMOUNTS] does the trigger throw any errors?
Also I believe you can get all your data from Inserted table without needed to do the join back to mydb.amounts.

Updating redundant/denormalized data automatically in SQL Server

Use a high level of redundant, denormalized data in my DB designs to improve performance. I'll often store data that would normally need to be joined or calculated. For example, if I have a User table and a Task table, I would store the Username and UserDisplayName redundantly in every Task record. Another example of this is storing aggregates, such as storing the TaskCount in the User table.
User
UserID
Username
UserDisplayName
TaskCount
Task
TaskID
TaskName
UserID
UserName
UserDisplayName
This is great for performance since the app has many more reads than insert, update or delete operations, and since some values like Username change rarely. However, the big draw back is that the integrity has to be enforced via application code or triggers. This can be very cumbersome with updates.
My question is can this be done automatically in SQL Server 2005/2010... maybe via a persisted/permanent View. Would anyone recommend another possibly solution or technology. I've heard document-based DBs such as CouchDB and MongoDB can handle denormalized data more effectively.
You might want to first try an Indexed View before moving to a NoSQL solution:
http://msdn.microsoft.com/en-us/library/ms187864.aspx
and:
http://msdn.microsoft.com/en-us/library/ms191432.aspx
Using an Indexed View would allow you to keep your base data in properly normalized tables and maintain data-integrity while giving you the denormalized "view" of that data. I would not recommend this for highly transactional tables, but you said it was heavier on reads than writes so you might want to see if this works for you.
Based on your two example tables, one option is:
1) Add a column to the User table defined as:
TaskCount INT NOT NULL DEFAULT (0)
2) Add a Trigger on the Task table defined as:
CREATE TRIGGER UpdateUserTaskCount
ON dbo.Task
AFTER INSERT, DELETE
AS
;WITH added AS
(
SELECT ins.UserID, COUNT(*) AS [NumTasks]
FROM INSERTED ins
GROUP BY ins.UserID
)
UPDATE usr
SET usr.TaskCount = (usr.TaskCount + added.NumTasks)
FROM dbo.[User] usr
INNER JOIN added
ON added.UserID = usr.UserID
;WITH removed AS
(
SELECT del.UserID, COUNT(*) AS [NumTasks]
FROM DELETED del
GROUP BY del.UserID
)
UPDATE usr
SET usr.TaskCount = (usr.TaskCount - removed.NumTasks)
FROM dbo.[User] usr
INNER JOIN removed
ON removed.UserID = usr.UserID
GO
3) Then do a View that has:
SELECT u.UserID,
u.Username,
u.UserDisplayName,
u.TaskCount,
t.TaskID,
t.TaskName
FROM User u
INNER JOIN Task t
ON t.UserID = u.UserID
And then follow the recommendations from the links above (WITH SCHEMABINDING, Unique Clustered Index, etc.) to make it "persisted". While it is inefficient to do an aggregation in a subquery in the SELECT as shown above, this specific case is intended to be denormalized in a situation that has higher reads than writes. So doing the Indexed View will keep the entire structure, including the aggregation, physically stored so each read will not recalculate it.
Now, if a LEFT JOIN is needed if some Users do not have any Tasks, then the Indexed View will not work due to the 5000 restrictions on creating them. In that case, you can create a real table (UserTask) that is your denormalized structure and have it populated via either a Trigger on just the User Table (assuming you do the Trigger I show above which updates the User Table based on changes in the Task table) or you can skip the TaskCount field in the User Table and just have Triggers on both tables to populate the UserTask table. In the end, this is basically what an Indexed View does just without you having to write the synchronization Trigger(s).

Resources