I have this legacy code… Merge:
MERGE [Salesforce_Lead] AS target
using (
SELECT lead_id,
salesforce_id,
createdbyid,
email,
updatedate
FROM Source.Leads ) AS source
ON (
target.lead_id = source.lead_id)
WHEN matched
AND Checksum(target.lead_id, target.salesforce_id) <> Checksum(source.lead_id,source.salesforce_id)
OR Checksum(target.lead_id, target.createdbyid) <> Checksum(source.lead_id,source.createdbyid)
OR Checksum(target.lead_id, target.email) <> Checksum(source.lead_id,source.email)
OR Checksum(target.lead_id, target.updatedate) <> Checksum(source.lead_id,source.updatedate)
THEN
UPDATE
SET target.salesforce_id = source.salesforce_id,
target.createdbyid = source.createdbyid,
target.email = source.email,
target.updatedate = source.updatedate,
when not matched BY target THEN
INSERT
(
lead_id,
salesforce_id,
createdbyid,
email,
updatedate,
)
VALUES
(
source.lead_id,
source.salesforce_id,
source.createdbyid,
source.email,
source.updatedate
)
And I want to change it for the below code:
DELETE FROM [dbo].[Salesforce_Lead]
FROM [dbo].[Salesforce_Lead] AS L
INNER JOIN Source.Leads AS t
ON L.lead_id = t.lead_id;
INSERT INTO [dbo].[lead]
SELECT *
FROM Source.Leads
Reasons:
-shorter code, easier to maintain.
-I thought MERGE was supposed to be used if you were deleting from source as well, or using the OUTPUT clause…
-There are not many updates, usually is a plain insert.
Am I missing anything? the performance I would gain through the “when matched” is the only reason I should use MERGE, but as I said, most of them are inserts. Is there any issue if I replace merge with delete/insert?
The thing is that you can use merge statement to implement both initial and incremental loading. But if you want to use plain insert, sometimes it will be hard even if you just do the insert operation. Do you really want to truncate the target table first and then do the loading all over again every time you want to do the incremental load?
Another issue will be that if you are trying to implement as type-2 SCD, plain insert and update statement will be very complex and messy, however merge statement is very efficient and fast.
Related
We've setup a stream on a table that is continuously loaded via snowpipe.
We're consuming this data with a task that runs every minute where we merge into another table. There is a possibility of duplicate keys so we use a ROW_NUMBER() window function, ordered by the file created timestamp descending where row_num=1. This way we always get the latest insert
Initially we used a standard task with the merge statement but we noticed that in some instances, since snowpipe does not guarantee loading in order of when the files were staged, we were updating rows with older data. As such, on the WHEN MATCHED section we added a condition so only when the file created ts > existing, to update the row
However, since we did that, reconciliation checks show that some new inserts are missing. I don't know for sure why changing the matched clause would interfere with the not matched clause.
My theory was that the extra clause added a bit of time to the task run where some runs were skipped or the next run happened almost immediately after the last one completed. The idea being that the missing rows were caught up in the middle and the offset changed before they could be consumed
As such, we changed the task to call a stored procedure which uses an explicit transaction. We did this because the docs seem to suggest that using a transaction will lock the stream. However even with this we can see that new inserts are still missing. We're talking very small numbers e.g. 8 out of 100,000s
Any ideas what might be happening?
Example task code below (not the sp version)
WAREHOUSE = TASK_WH
SCHEDULE = '1 minute'
WHEN SYSTEM$stream_has_data('my_stream')
AS
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED AND ms.file_created >= pd.file_created THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....
;
I am not fully sure what is going wrong here, but the file created time related recommendation is given by Snowflake somewhere. It suggest that the file created timestamp is calculated in cloud service and it may be bit different than you think. There is another recommendation related to snowpipe and data ingestion. The queue service takes a min to consume the data from pipe and if you have lot of data being flown inside with in a min, you may end up this issue. Look you implementation and simulate if pushing data in 1min interval solve that issue and don't rely on file create time.
The condition "AND ms.file_created >= pd.file_created" seems to be added as a mechanism to avoid updating the same row multiple times.
Alternative approach could be using IS DISTINCT FROM to compare source against target columns(except id):
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED
AND (pd.col1, pd.col2,..., pd.coln) IS DISTINCT FROM (ms.col1, ms.col2,..., ms.coln)
THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....;
This approach will also prevent updating row when nothing has changed.
I have two tables Application and Application_temp . I wrote a merge query to merge data from application_temp to application.
But Update statement is not working. Query is inserting the whole set of records again instead of updating , I doubt it must be because date columns, It would be great if anyone could help me with this.
I tried multiple sources from google, nothing helped. Tried changing various match conditions but din't work.
CREATE PROCEDURE [dbo].[MergeApplication_tempToApplication]
AS
--CREATE UNIQUE INDEX cmdbid ON Application_Temp(cmdb_id) WITH
(IGNORE_DUP_KEY = OFF)
MERGE INTO Application AS Target
USING Application_temp AS Source
ON (
--Target.CMDB_ID=Source.CMDB_ID
--AND Target.Application_Name=Source.Application_Name
Target.maas_application_id = source.maas_application_id
)
WHEN MATCHED
AND (
Target.CMDB_ID <> Source.CMDB_ID
OR Target.Application_Name <> Source.Application_Name
OR Target.Fss_Portfolio <> Source.Fss_Portfolio
OR Target.Managed_By <> Source.Managed_By
--Target.Date_Created <> Source.Date_Created OR
--Target.Date_Updated <> Source.Date_Updated
)
THEN
UPDATE
SET Target.cmdb_id = Source.cmdb_id
,Target.Application_Name = Source.Application_Name
,Target.Fss_Portfolio = Source.Fss_Portfolio
,Target.Managed_By = Source.Managed_By
,Target.Date_Created = Source.Date_Created
,Target.Date_Updated = Source.Date_Updated
WHEN NOT MATCHED
THEN
INSERT (
Application_Name
,cmdb_id
,Fss_Portfolio
,Managed_By
,Date_Created
,Date_Updated
)
VALUES (
source.Application_Name
,source.cmdb_id
,source.Fss_Portfolio
,source.Managed_By
,source.Date_Created
,source.Date_Updated
);
As I said update statement is not working. Can any expert in SQL Server. Help me with this? I have struggling from past month.
I've got a table with data named energydata
it has just three columns
(webmeterID, DateTime, kWh)
I have a new set of updated data in a table temp_energydata.
The DateTime and the webmeterID stay the same. But the kWh values need updating from temp_energydata table.
How do I write the T-SQL for this the correct way?
Assuming you want an actual SQL Server MERGE statement:
MERGE INTO dbo.energydata WITH (HOLDLOCK) AS target
USING dbo.temp_energydata AS source
ON target.webmeterID = source.webmeterID
AND target.DateTime = source.DateTime
WHEN MATCHED THEN
UPDATE SET target.kWh = source.kWh
WHEN NOT MATCHED BY TARGET THEN
INSERT (webmeterID, DateTime, kWh)
VALUES (source.webmeterID, source.DateTime, source.kWh);
If you also want to delete records in the target that aren't in the source:
MERGE INTO dbo.energydata WITH (HOLDLOCK) AS target
USING dbo.temp_energydata AS source
ON target.webmeterID = source.webmeterID
AND target.DateTime = source.DateTime
WHEN MATCHED THEN
UPDATE SET target.kWh = source.kWh
WHEN NOT MATCHED BY TARGET THEN
INSERT (webmeterID, DateTime, kWh)
VALUES (source.webmeterID, source.DateTime, source.kWh)
WHEN NOT MATCHED BY SOURCE THEN
DELETE;
Because this has become a bit more popular, I feel like I should expand this answer a bit with some caveats to be aware of.
First, there are several blogs which report concurrency issues with the MERGE statement in older versions of SQL Server. I do not know if this issue has ever been addressed in later editions. Either way, this can largely be worked around by specifying the HOLDLOCK or SERIALIZABLE lock hint:
MERGE INTO dbo.energydata WITH (HOLDLOCK) AS target
[...]
You can also accomplish the same thing with more restrictive transaction isolation levels.
There are several other known issues with MERGE. (Note that since Microsoft nuked Connect and didn't link issues in the old system to issues in the new system, these older issues are hard to track down. Thanks, Microsoft!) From what I can tell, most of them are not common problems or can be worked around with the same locking hints as above, but I haven't tested them.
As it is, even though I've never had any problems with the MERGE statement myself, I always use the WITH (HOLDLOCK) hint now, and I prefer to use the statement only in the most straightforward of cases.
I often used Bacon Bits great answer as I just can not memorize the syntax.
But I usually add a CTE as an addition to make the DELETE part more useful because very often you will want to apply the merge only to a part of the target table.
WITH target as (
SELECT * FROM dbo.energydate WHERE DateTime > GETDATE()
)
MERGE INTO target WITH (HOLDLOCK)
USING dbo.temp_energydata AS source
ON target.webmeterID = source.webmeterID
AND target.DateTime = source.DateTime
WHEN MATCHED THEN
UPDATE SET target.kWh = source.kWh
WHEN NOT MATCHED BY TARGET THEN
INSERT (webmeterID, DateTime, kWh)
VALUES (source.webmeterID, source.DateTime, source.kWh)
WHEN NOT MATCHED BY SOURCE THEN
DELETE
If you need just update your records in energydata based on data in temp_energydata, assuming that temp_enerydata doesn't contain any new records, then try this:
UPDATE e SET e.kWh = t.kWh
FROM energydata e INNER JOIN
temp_energydata t ON e.webmeterID = t.webmeterID AND
e.DateTime = t.DateTime
Here is working sqlfiddle
But if temp_energydata contains new records and you need to insert it to energydata preferably with one statement then you should definitely go with the answer that Bacon Bits gave.
UPDATE ed
SET ed.kWh = ted.kWh
FROM energydata ed
INNER JOIN temp_energydata ted ON ted.webmeterID = ed.webmeterID
Update energydata set energydata.kWh = temp.kWh
where energydata.webmeterID = (select webmeterID from temp_energydata as temp)
THE CORRECT WAY IS :
UPDATE test1
INNER JOIN test2 ON (test1.id = test2.id)
SET test1.data = test2.data
Recently I ran into an issue where we have multiple concurrent client requests causing performance issue in db. I tried the test scenario and as it turned out, when I run SELECT queries (same query) 6 to 7 times (gets worse with more), It degrades the performance and execution takes a lot of time. However I tried this one
SELECT TOP (100) COUNT(DISTINCT([Doc_Number])) AS "Expression"
FROM (
SELECT *
FROM "dbo"."Dummy_Table" "table_alias"
WHERE ((CAST("table_alias"."ID" AS NVARCHAR)) NOT IN
(
SELECT "PrimaryKey" AS ExceptionKey
FROM dbo.exceptions inner_exceptionStatus
LEFT JOIN dbo.Workflow inner_workflowStates ON
(inner_exceptionStatus."Status"= inner_workflowStates."UUID" AND
inner_exceptionStatus."UUID"= 'CA1662D6-73A2-4692-A765-E7E3EDB66062')
WHERE ("inner_workflowStates"."RemoveFromRecordSet" = 1 AND
"inner_workflowStates"."IsDeleted" = 0) AND
("inner_exceptionStatus"."IsArchived" IS NULL OR
"inner_exceptionStatus"."IsArchived" = 0)))) wrapperQuery
The query when runs alone takes around 1sec execution time. But If we runs it in parallel, for each query it takes up a wried amount of time of leads to timeout.
The only thing bothers me here is that SELECT query should be non-blocking and even with shared lock, then need to get along easily.
I am not sure if there is anything wrong in the query that adds up the situation.
Any help is deeply appreciated !!
Try this way
SELECT Count(DISTINCT( [Doc_Number] )) AS Expression
FROM dbo.Dummy_Table table_alias
WHERE NOT EXISTS (SELECT 1
FROM dbo.exceptions inner_exceptionStatus
INNER JOIN dbo.Workflow inner_workflowStates
ON ( inner_exceptionStatus.Status = inner_workflowStates.UUID
AND inner_exceptionStatus.UUID = 'CA1662D6-73A2-4692-A765-E7E3EDB66062' )
WHERE inner_workflowStates.RemoveFromRecordSet = 1
AND inner_workflowStates.IsDeleted = 0
AND ( inner_exceptionStatus.IsArchived IS NULL
OR inner_exceptionStatus.IsArchived = 0 )
AND table_alias.ID = PrimaryKey)
Made couple of changes.
Changed NOT IN to NOT EXISTS
Removed the convert in "table_alias"."ID" because it will avoid using any index present in "table_alias"."ID" column. If the conversion is really required then add it.
Removed Top (100) since there is no Group By it will return a single record as result.
Still if the query is running slow then you need to post the execution plan and make sure the statistics are up-to-date
You can simplyfy your query like this :
SELECT COUNT(DISTINCT(Doc_Number)) AS Expression
FROM dbo.Dummy_Table dmy
WHERE not exists
(
SELECT *
FROM dbo.exceptions ies
INNER JOIN dbo.Workflow iws ON ies.Status= iws.UUID AND ies.UUID= 'CA1662D6-73A2-4692-A765-E7E3EDB66062'
WHERE iws.RemoveFromRecordSet = 1 AND iws.IsDeleted = 0 AND (ies.IsArchived IS NULL OR ies.IsArchived = 0)
and dmy.ID=PrimaryKey
)
Like prdp say :
Changed NOT IN to NOT EXISTS
Removed the convert in "table_alias"."ID" because it will avoid using any index present in "table_alias"."ID" column. If the conversion is really required then add it.
Removed Top (100) since there is no Group By it will return a single record as result.
I add :
Remove you temporary table wrapperQuery
You can use INNER JOIN because into where you test RemoveFromRecordSet = 1 then you remove null values.
Remove not utils quotes ,brackets and parenthèses into where clause
I am new to make etl packages, and I’m learning things exploring the packages in visual studio.
I was given a task that involves involve taking the current logic in the ETL from the merge statement and redesigning the ETL to have a functioning Initial and incremental load path. I have searched the internet enough but couldn’t find a proper answer. Please direct me how to proceed with this.
Pasted the Merge query below by removing all the attributes and keeping only 3 for understanding purpose.
How to get rid of this merge and use INIT and INCR Load (which I want to add in the data flow).
Let me know if any further details or info is missing. Sry if this is a noob question.
The Merge statement is of the format:
MERGE INTO <DestTableName> AS DIM
USING (
SELECT
Cast(EF.Form_PK as nvarchar(50)) FormSourceId
,EF.FormName
,CASE WHEN EF.FormStatus = 1 THEN N'Active' ELSE N'Inactive' END FormStatus
FROM <Tables and Joins>
) AS SRC
ON SRC.FormSourceID = DIM.FormSourceID
WHEN MATCHED And (
Coalesce(SRC.FormName, '') <> Coalesce(DIM.FormName, '')
OR Coalesce(SRC.FormStatus, '') <> Coalesce(DIM.FormStatus, '')
)
THEN
UPDATE SET
DIM.FormName = SRC.FormName
,DIM.FormStatus = SRC.FormStatus
WHEN NOT MATCHED BY TARGET THEN
INSERT (
FormSourceId
,FormName
,FormStatus
)
VALUES (
SRC.FormSourceId
,SRC.FormName
,SRC.FormStatus
)
WHEN NOT MATCHED BY SOURCE AND DimEvaluationId <> -1 THEN
DELETE;