Best approach for MERGE in Azure Synapse - sql-server

I have two tables(src, tgt) on a dedicated sql pool in Azure Synapse, which both contain approximately 200 million records. Let's say they have columns A and B. I'm trying to update the field tgt.A with the value from src.A when tgt.B matches src.B. If src.B does not exist in tgt.B, insert the new record.
I tried using the MERGE statement as below:
MERGE tgt
USING src
ON tgt.B = src.B
WHEN MATCHED AND tgt.A <> src.A
THEN UPDATE SET tgt.A = src.A
WHEN NOT MATCHED
THEN INSERT (
A,B
)
Values
(
src.A, src.B
);
I also tried the insert and update statements separately as below:
INSERT INTO tgt
SELECT A, B
FROM src
WHERE NOT EXISTS(SELECT B FROM tgt WHERE tgt.B = src.B)
UPDATE tgt
SET tgt.A = src.A
WHERE EXISTS(SELECT B FROM tgt WHERE tgt.B = src.B)
For testing purpose, I took a smaller subset of tgt table containing 6 million records and source containing only 400k records. Still, with either of the above approaches, the stored procedure keeps executing beyond 1 hour where I cancel the execution. I noticed that with even smaller samples with 1k records in each table, it executes within a minute but struggles with large number of records. Please recommend the best approach to tackle the issue, can this be optimized?
Note: The source table is populated and dropped within the stored procedure and only the target table is saved with the updated records.

Related

Snowflake CHANGES | Why does it need to perform a self join? Why is it slower than join using other unique column?

I was facing issues with merge statement over large tables.
The source table for merge is basically clone of the target table after applying some DML.
e.g. In the below example PUBLIC.customer is target and STAGING.customer is the source.
CREATE OR REPLACE TABLE STAGING.customer CLONE PUBLIC.customer;
MERGE INTO STAGING.customer TARGET USING (SELECT * FROM NEW_CUSTOMER) AS SOURCE ON TARGET.ID = SOURCE.ID
WHEN MATCHED AND SOURCE.DELETEFLAG=TRUE THEN DELETE
WHEN MATCHED AND TARGET.ROWMODIFIED < SOURCE.ROWMODIFIED THEN UPDATE SET TARGET.AGE = SOURCE.AGE, ...
WHEN NOT MATCHED THEN INSERT (AGE) VALUES (AGE, DELETEFLAG, ID,...);
Currently, we are simply merging the STAGING.customer back to PUBLIC.customer at the end.
This final merge statement is very costly for some of the large tables.
While looking for a solution to reduce the cost, I discovered Snowflake "CHANGES" mechanism. As per the documentation,
Currently, at least one of the following must be true before change tracking metadata is recorded for a table:
Change tracking is enabled on the table (using ALTER TABLE … CHANGE_TRACKING = TRUE).
A stream is created for the table (using CREATE STREAM).
Both options add hidden columns to the table which store change tracking metadata. The columns consume a small amount of storage.
I assumed that the metadata added to the table is equivalent to the result-set of the select statement using "changes" clause, which doesn't seem to be the case.
INSERT INTO PUBLIC.CUSTOMER(AGE,...) (SELECT AGE,... FROM STAGING.CUSTOMER CHANGES (information => default) at(timestamp => 1675772176::timestamp) where "METADATA$ACTION" = 'INSERT' );
The select statement using "changes" clause is way slower than the merge statement that I am using currently.
I checked the execution plan and found that Snowflake performs a self-join(sort of) on the table at two different timestamp.
Should it really be the behaviour or am I missing something here? I was hoping to get better performance assuming to scan the table one time and then simply inserting the new records which should be faster than the merge statement.
Also, even if it does a self join, why does the merge query perform better than this, the merge query is also doing join on similar volumes.
I was also hoping to use same mechanism for delete/updates on source table.

Insert New Records Into Table Using If/Else Statement

I have two SQL Server tables where I need to add records from one table to the next. If the unique identifier already exists in the target table, then update the record to the data coming from source table - If the unique identifier doesn't exist, then insert the entire new record into the target table.
I seem to have gotten the initial part to work where I update the records in target table but the the part where I would INSERT new records does not seem to be working.
if exists (
select 1
from SCM_Top_Up_Operational O
join SCM_Top_Up_Rolling R ON O.String = R.string
)
begin
update O
set O.Date_Added = R.Date_Added,
O.Real_Exfact = R.Real_Exfact,
O.Excess_Top_Up = R.Excess_Top_Up
from SCM_Top_Up_Operational O
join SCM_Top_Up_Rolling R on O.String = R.String
where O.String = R.string and R.date_added > O.date_added
end
else
begin
insert into SCM_Top_Up_Operational (String,Date_Added,Real_Exfact,Article_ID,Excess_Top_Up,Plant)
select String,Date_Added,Real_Exfact,Article_ID,Excess_Top_Up,Plant
from SCM_Top_Up_Rolling
end
If I followed you correctly, you should be able to solve this with a single SQL query, using SQL Server MERGE syntax, available since SQL Server 2008.
From the documentation:
Runs insert, update, or delete operations on a target table from the results of a join with a source table. For example, synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table.
Consider the following query:
MERGE
SCM_Top_Up_Operational O
USING SCM_Top_Up_Rolling R ON (O.String = R.string)
WHEN MATCHED
THEN UPDATE SET
O.Date_Added = R.Date_Added,
O.Real_Exfact = R.Real_Exfact,
O.Excess_Top_Up = R.Excess_Top_Up
WHEN NOT MATCHED BY TARGET
THEN INSERT ( String, Date_Added, Real_Exfact, Article_ID, Excess_Top_Up, Plant)
VALUES (R.String, R.Date_Added, R.Real_Exfact, R.Article_ID, R.Excess_Top_Up, R.Plant)

Delete vs Rollback Strategy - ETL Load

I am loading data to table in the following manner:
DECLARE #srcRc INT;
DECLARE #dstRc INT;
SET #srcRc = ( SELECT COUNT(*) FROM A )
INSERT INTO t
(Col1
,Col2
,Col3
)
SELECT A.Col1
,A.Col2
,B.Col3
FROM A
JOIN B
ON A.Id = B.Id;
SET #dstRc = ##ROWCOUNT
Now I am comparing the variables #srcRc and #dstRc. The ROWCOUNT must be the same. If it is not, the inserted rows need to be deleted.
Q1: What would be the best strategy to rollback the inserted rows?
I have couple of ideas:
1) Run the load in transaction and rollback if the rowcount does not match.
2) Add flag column (bit) to the destination table called toBeDeleted, run the load and if the rowcount does not match, update the toBeDeleted column with 1 value to flag it as candidate for deletion. Then delete in batch mode (while-loop). Or do not delete them, but always exclude deletion candidates from query when working with t table.
3) Before inserting the rows, compare the the rowcount first. If it does not match, don't start the load.
DECLARE #srcRc INT;
DECLARE #dstRc INT;
SET #srcRc = ( SELECT COUNT(1) FROM A );
SET #dstRc = ( SELECT COUNT(1) FROM A JOIN B ON A.Id = B.Id );
Q2: What would be better solution for higher amount of rows, let's say 10-100 mil.?
Q3: Or is there any better strategy for similar case?
OK, Assuming :
You need the roll back to work at some later date when the content of tables A and B may have changed
There may also be other rows in T which you don't want to delete as part of the rollback.
Then you MUST keep a list of the rows you inserted, as you are unable to reliably regenerate that list from A and B and you cant just delete everything from T
You could do this in two ways
Change your import, so that it first inserts the rows to an import table, keep the import table hanging around until you are sure you don't need it anymore.
Add an extra column to T [importId] into which you put a uniquely identifying value
Obviously the first strategy uses a lot more disc space. So the longer your keep the data and the more data there is, the better the extra column looks.
Another option, would be to generate the list of imported data separately and have your transaction sql be a bulk insert with all the data hard coded into the sql.
This works well for small lists, initial setup data and the like.
Edit:
from your comments it sounds like you don't want a roll back per-se. But the best way to apply business logic around the import process.
In this case your 3rd answer is the best. Don't do the import when you know the source data is incorrect.

Merge query using two tables in SQL server 2012

I am very new to SQL and SQL server, would appreciate any help with the following problem.
I am trying to update a share price table with new prices.
The table has three columns: share code, date, price.
The share code + date = PK
As you can imagine, if you have thousands of share codes and 10 years' data for each, the table can get very big. So I have created a separate table called a share ID table, and use a share ID instead in the first table (I was reliably informed this would speed up the query, as searching by integer is faster than string).
So, to summarise, I have two tables as follows:
Table 1 = Share_code_ID (int), Date, Price
Table 2 = Share_code_ID (int), Share_name (string)
So let's say I want to update the table/s with today's price for share ZZZ. I need to:
Look for the Share_code_ID corresponding to 'ZZZ' in table 2
If it is found, update table 1 with the new price for that date, using the Share_code_ID I just found
If the Share_code_ID is not found, update both tables
Let's ignore for now how the Share_code_ID is generated for a new code, I'll worry about that later.
I'm trying to use a merge query loosely based on the following structure, but have no idea what I am doing:
MERGE INTO [Table 1]
USING (VALUES (1,23-May-2013,1000)) AS SOURCE (Share_code_ID,Date,Price)
{ SEEMS LIKE THERE SHOULD BE AN INNER JOIN HERE OR SOMETHING }
ON Table 2 = 'ZZZ'
WHEN MATCHED THEN UPDATE SET Table 1.Price = 1000
WHEN NOT MATCHED THEN INSERT { TO BOTH TABLES }
Any help would be appreciated.
http://msdn.microsoft.com/library/bb510625(v=sql.100).aspx
You use Table1 for target table and Table2 for source table
You want to do action, when given ID is not found in Table2 - in the source table
In the documentation, that you had read already, that corresponds to the clause
WHEN NOT MATCHED BY SOURCE ... THEN <merge_matched>
and the latter corresponds to
<merge_matched>::=
{ UPDATE SET <set_clause> | DELETE }
Ergo, you cannot insert into source-table there.
You could use triggers for auto-insertion, when you insert something in Table1, but that will not be able to insert proper Shared_Name - trigger just won't know it.
So you have two options i guess.
1) make T-SQL code block - look for Stored Procedures. I think there also is a construct to execute anonymous code block in MS SQ, like EXECUTE BLOCK command in Firebird SQL Server, but i don't know it for sure.
2) create updatable SQL VIEW, joining Table1 and Table2 to show last most current date, so that when you insert a row in this view the view's on-insert trigger would actually insert rows to both tables. And when you would update the data in the view, the on-update trigger would modify the data.

Updating two columns in a table containing millions of rows

I am updating 2 columns in a table that contains millions (85 million) of rows. Now to update these I am using a update command like,
UPDATE Table1
SET Table1.column1 = Table2.column1 ,
Table1.column2 = Table2.column2
FROM
Tables and with a Join-conditions;
Now my problem is, it is taking 23 hours for that. Even after using the batch size there is not much change in the time taken.
But I need to update it in less than 5 hours. Is that possible. What approach should I take to achieve it ?
SQL Update statements have to keep all the rows in the log file so it can roll-back on failure. As explained by this guy, the best way to handle millions of rows is to forget about atomicity and batch your updates into 50,000 rows (or whatever):
--Declare variable for row count
Declare #rc int
Set #rc=50000
While #rc=50000
Begin
Begin Transaction
--Use Top (50000) to limit number of updates
--performed in each batch to 50K rows.
--Use tablockx and holdlock to obtain and hold
--an immediate exclusive table lock. This unusually
--speeds the update because only one lock is needed.
Update Top (50000) MyTable With (tablockx, holdlock)
Set UpdFlag = 0
From MyTable mt
Join ControlTable ct
On mt.KeyCol=ct.PK
--Add criteria to avoid updating rows that
--were updated in previous pass
Where m.UpdFlag <> 0
--Get number of rows updated
--Process will continue until less than 50000
Select #rc=##rowcount
--Commit the transaction
Commit
End
This still has some problems in that you need to know which rows you've already handled, perhaps someone smarter than this guy (and me!) can figure something nicer with more MSSQL magic; but this should be a start.
I have used SSIS for doing this task.
First I have taken the source table in which I have to update the 2-columns. Then I have taken Look-Up task in which I have to mapped source columns to the destination table columns from which I have to get the data to update source table columns. Finally added OLEDB destination from where I'll fill the table basing on the joining conditions from the look-up.
This process was really fast than executing an update script.

Resources