Changing ETL to incremental Load - sql-server

I am new to make etl packages, and I’m learning things exploring the packages in visual studio.
I was given a task that involves involve taking the current logic in the ETL from the merge statement and redesigning the ETL to have a functioning Initial and incremental load path. I have searched the internet enough but couldn’t find a proper answer. Please direct me how to proceed with this.
Pasted the Merge query below by removing all the attributes and keeping only 3 for understanding purpose.
How to get rid of this merge and use INIT and INCR Load (which I want to add in the data flow).
Let me know if any further details or info is missing. Sry if this is a noob question.
The Merge statement is of the format:
MERGE INTO <DestTableName> AS DIM
USING (
SELECT
Cast(EF.Form_PK as nvarchar(50)) FormSourceId
,EF.FormName
,CASE WHEN EF.FormStatus = 1 THEN N'Active' ELSE N'Inactive' END FormStatus
FROM <Tables and Joins>
) AS SRC
ON SRC.FormSourceID = DIM.FormSourceID
WHEN MATCHED And (
Coalesce(SRC.FormName, '') <> Coalesce(DIM.FormName, '')
OR Coalesce(SRC.FormStatus, '') <> Coalesce(DIM.FormStatus, '')
)
THEN
UPDATE SET
DIM.FormName = SRC.FormName
,DIM.FormStatus = SRC.FormStatus
WHEN NOT MATCHED BY TARGET THEN
INSERT (
FormSourceId
,FormName
,FormStatus
)
VALUES (
SRC.FormSourceId
,SRC.FormName
,SRC.FormStatus
)
WHEN NOT MATCHED BY SOURCE AND DimEvaluationId <> -1 THEN
DELETE;

Related

Streams + tasks missing inserts?

We've setup a stream on a table that is continuously loaded via snowpipe.
We're consuming this data with a task that runs every minute where we merge into another table. There is a possibility of duplicate keys so we use a ROW_NUMBER() window function, ordered by the file created timestamp descending where row_num=1. This way we always get the latest insert
Initially we used a standard task with the merge statement but we noticed that in some instances, since snowpipe does not guarantee loading in order of when the files were staged, we were updating rows with older data. As such, on the WHEN MATCHED section we added a condition so only when the file created ts > existing, to update the row
However, since we did that, reconciliation checks show that some new inserts are missing. I don't know for sure why changing the matched clause would interfere with the not matched clause.
My theory was that the extra clause added a bit of time to the task run where some runs were skipped or the next run happened almost immediately after the last one completed. The idea being that the missing rows were caught up in the middle and the offset changed before they could be consumed
As such, we changed the task to call a stored procedure which uses an explicit transaction. We did this because the docs seem to suggest that using a transaction will lock the stream. However even with this we can see that new inserts are still missing. We're talking very small numbers e.g. 8 out of 100,000s
Any ideas what might be happening?
Example task code below (not the sp version)
WAREHOUSE = TASK_WH
SCHEDULE = '1 minute'
WHEN SYSTEM$stream_has_data('my_stream')
AS
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED AND ms.file_created >= pd.file_created THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....
;
I am not fully sure what is going wrong here, but the file created time related recommendation is given by Snowflake somewhere. It suggest that the file created timestamp is calculated in cloud service and it may be bit different than you think. There is another recommendation related to snowpipe and data ingestion. The queue service takes a min to consume the data from pipe and if you have lot of data being flown inside with in a min, you may end up this issue. Look you implementation and simulate if pushing data in 1min interval solve that issue and don't rely on file create time.
The condition "AND ms.file_created >= pd.file_created" seems to be added as a mechanism to avoid updating the same row multiple times.
Alternative approach could be using IS DISTINCT FROM to compare source against target columns(except id):
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED
AND (pd.col1, pd.col2,..., pd.coln) IS DISTINCT FROM (ms.col1, ms.col2,..., ms.coln)
THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....;
This approach will also prevent updating row when nothing has changed.

Merge..USING..WHEN SQL Server equivalent in PostgreSQL

Right now, we are trying to migrate a stored procedure from SQL Server to PostgreSQL. In this, We saw a querying method MERGE..USING..WHEN. So I couldn't find the equivalent. So is there any option to replicate this functionality? The actual query is given below.
WITH phase_list AS (
SELECT #id_plant AS id_plant
, id_phase
, date_started
FROM OPENJSON(#phase_list)
WITH (
id_phase INT '$.id_phase',
date_started DATE '$.date_started'
)
WHERE date_started IS NOT NULL
)
MERGE grow.plant_phase AS t
USING phase_list AS s
ON s.id_plant = t.id_plant AND s.id_phase = t.id_phase
WHEN MATCHED AND (t.date_started <> s.date_started) THEN
UPDATE
SET t.date_started=s.date_started,
t.date_updated=getutcdate(),
t.updated_by=#id_user
WHEN NOT MATCHED BY TARGET THEN
INSERT (id_plant, id_phase, date_started, created_by, updated_by)
VALUES (s.id_plant, s.id_phase, s.date_started, #id_user, #id_user)
WHEN NOT MATCHED BY SOURCE AND t.id_plant = #id_plant THEN
DELETE;
Can we replicate the same using any join operation with some if/else conditions? or any other approach?

Set values on a new table from a historical table, before adding new table back into historical table

I would like to create a historical view of alerts in an application. To do this, I am grabbing all events and timestamping them, then uploading them into a MS SQL table. I would also like to be able to exempt certain objects from the total count by flagging either the finding (to exclude the finding across all systems) or the object in the finding (to exclude the object from all findings).
The idea is, I will have all previous alerts in the main table, then I will set an 'exemptobject' or 'exemptfinding' bit column in the row. When I re-run the script weekly, I will upload the results directly into a temporary table and then I would like to compare either the object or the finding for each object in the temporary table to the main database's 'object' or 'finding' and set the respective 'exemptobject' or 'exemptfinding' bit. Once all the temporary table's objects have any exemption bits set, insert the temporary table into the main table and drop the temporary table to keep a historical record.
This will give me duplicate findings and objects, so I am having difficulty with the merge command:
BEGIN TRANSACTION
MERGE INTO [dbo].[temp_table]
USING [dbo].[historical]
ON [dbo].[temp_table].[object] = [dbo].[historical].[object] OR
[dbo].[temp_table].[finding] = [dbo].[historical].[finding]
WHEN MATCHED THEN
UPDATE
SET [exemptfinding] = [dbo].[historical].[exemptfinding]
,[exemptobject] = [dbo].[historical].[exemptobject]
,[exemptdate] = [dbo].[historical].[exemptdate]
,[comments] = [dbo].[historical].[comments];
COMMIT
This seems to do what I want, but I see that the results are going to grow exponentially and I think it won't be sustainable for long.
BEGIN TRANSACTION
UPDATE [dbo].[temp]
SET [dbo].[temp].[exemptfinding] = [historical].[exemptfinding]
,[dbo].[temp].[exemptobject] = [historical].[exemptobject]
,[dbo].[temp].[exemptdate] = [historical].[exemptdate]
,[dbo].[temp].[comments] = [historical].[comments]
FROM [dbo].[temp] temp
INNER JOIN [dbo].[historical] historical
ON (
[temp].[finding] = [sci].[finding] OR
[temp].[object] = [sci].[object] OR
) AND
(
[historical].[exemptfinding] = 1 OR
[historical].[exemptobject] = 1
)
COMMIT
I feel like I need to normalize the database, but I can't think of a way to separate things out and be able to:
See a count of each finding based on date the script was run
Be able to drill down into each day and see all the findings, objects and recommendations for each
Control the count shown for each finding by removing 'exempted' findings OR objects.
I feel like there's something obvious I'm missing or I'm thinking about this incorrectly. Any help would be greatly appreciated!
EDIT - The following seems to do what I want, but as soon as I add an additional WHERE condition to the final result, the query time goes from 7 seconds to 90 seconds, so I fear it will not scale.
BEGIN TRANSACTION
UPDATE [dbo].[temp]
SET [dbo].[temp].[exemptrecommendation] = [historical].[exemptrecommendation]
,[dbo].[temp].[exemptfinding] = [historical].[exemptfinding]
,[dbo].[temp].[exemptobject] = [historical].[exemptobject]
,[dbo].[temp].[exemptdate] = [historical].[exemptdate]
,[dbo].[temp].[comments] = [historical].[comments]
FROM (
SELECT *
FROM historical h
WHERE EXISTS (
SELECT id
,recommendation
FROM temp t
WHERE (
t.id = s.id OR
t.recommendation = s.recommendation
)
)
) historical
WHERE [dbo].[temp].[recommendation] = [historical].[recommendation] OR
[dbo].[temp].[id] = [historical].[id]
COMMIT

Update statement of Merge query is not working

I have two tables Application and Application_temp . I wrote a merge query to merge data from application_temp to application.
But Update statement is not working. Query is inserting the whole set of records again instead of updating , I doubt it must be because date columns, It would be great if anyone could help me with this.
I tried multiple sources from google, nothing helped. Tried changing various match conditions but din't work.
CREATE PROCEDURE [dbo].[MergeApplication_tempToApplication]
AS
--CREATE UNIQUE INDEX cmdbid ON Application_Temp(cmdb_id) WITH
(IGNORE_DUP_KEY = OFF)
MERGE INTO Application AS Target
USING Application_temp AS Source
ON (
--Target.CMDB_ID=Source.CMDB_ID
--AND Target.Application_Name=Source.Application_Name
Target.maas_application_id = source.maas_application_id
)
WHEN MATCHED
AND (
Target.CMDB_ID <> Source.CMDB_ID
OR Target.Application_Name <> Source.Application_Name
OR Target.Fss_Portfolio <> Source.Fss_Portfolio
OR Target.Managed_By <> Source.Managed_By
--Target.Date_Created <> Source.Date_Created OR
--Target.Date_Updated <> Source.Date_Updated
)
THEN
UPDATE
SET Target.cmdb_id = Source.cmdb_id
,Target.Application_Name = Source.Application_Name
,Target.Fss_Portfolio = Source.Fss_Portfolio
,Target.Managed_By = Source.Managed_By
,Target.Date_Created = Source.Date_Created
,Target.Date_Updated = Source.Date_Updated
WHEN NOT MATCHED
THEN
INSERT (
Application_Name
,cmdb_id
,Fss_Portfolio
,Managed_By
,Date_Created
,Date_Updated
)
VALUES (
source.Application_Name
,source.cmdb_id
,source.Fss_Portfolio
,source.Managed_By
,source.Date_Created
,source.Date_Updated
);
As I said update statement is not working. Can any expert in SQL Server. Help me with this? I have struggling from past month.

Issue if I replace merge with delete/insert?

I have this legacy code… Merge:
MERGE [Salesforce_Lead] AS target
using (
SELECT lead_id,
salesforce_id,
createdbyid,
email,
updatedate
FROM Source.Leads ) AS source
ON (
target.lead_id = source.lead_id)
WHEN matched
AND Checksum(target.lead_id, target.salesforce_id) <> Checksum(source.lead_id,source.salesforce_id)
OR Checksum(target.lead_id, target.createdbyid) <> Checksum(source.lead_id,source.createdbyid)
OR Checksum(target.lead_id, target.email) <> Checksum(source.lead_id,source.email)
OR Checksum(target.lead_id, target.updatedate) <> Checksum(source.lead_id,source.updatedate)
THEN
UPDATE
SET target.salesforce_id = source.salesforce_id,
target.createdbyid = source.createdbyid,
target.email = source.email,
target.updatedate = source.updatedate,
when not matched BY target THEN
INSERT
(
lead_id,
salesforce_id,
createdbyid,
email,
updatedate,
)
VALUES
(
source.lead_id,
source.salesforce_id,
source.createdbyid,
source.email,
source.updatedate
)
And I want to change it for the below code:
DELETE FROM [dbo].[Salesforce_Lead]
FROM [dbo].[Salesforce_Lead] AS L
INNER JOIN Source.Leads AS t
ON L.lead_id = t.lead_id;
INSERT INTO [dbo].[lead]
SELECT *
FROM Source.Leads
Reasons:
-shorter code, easier to maintain.
-I thought MERGE was supposed to be used if you were deleting from source as well, or using the OUTPUT clause…
-There are not many updates, usually is a plain insert.
Am I missing anything? the performance I would gain through the “when matched” is the only reason I should use MERGE, but as I said, most of them are inserts. Is there any issue if I replace merge with delete/insert?
The thing is that you can use merge statement to implement both initial and incremental loading. But if you want to use plain insert, sometimes it will be hard even if you just do the insert operation. Do you really want to truncate the target table first and then do the loading all over again every time you want to do the incremental load?
Another issue will be that if you are trying to implement as type-2 SCD, plain insert and update statement will be very complex and messy, however merge statement is very efficient and fast.

Resources