UPSERT in SSIS - sql-server

I am writing an SSIS package to run on SQL Server 2008. How do you do an UPSERT in SSIS?
IF KEY NOT EXISTS
INSERT
ELSE
IF DATA CHANGED
UPDATE
ENDIF
ENDIF

See SQL Server 2008 - Using Merge From SSIS. I've implemented something like this, and it was very easy. Just using the BOL page Inserting, Updating, and Deleting Data using MERGE was enough to get me going.

Apart from T-SQL based solutions (and this is not even tagged as sql/tsql), you can use an SSIS Data Flow Task with a Merge Join as described here (and elsewhere).
The crucial part is the Full Outer Join in the Merger Join (if you only want to insert/update and not delete a Left Outer Join works as well) of your sorted sources.
followed by a Conditional Split to know what to do next: Insert into the destination (which is also my source here), update it (via SQL Command), or delete from it (again via SQL Command).
INSERT: If the gid is found only on the source (left)
UPDATE If the gid exists on both the source and destination
DELETE: If the gid is not found in the source but exists in the destination (right)

I would suggest you to have a look at Mat Stephen's weblog on SQL Server's upsert.
SQL 2005 - UPSERT: In nature but not by name; but at last!

Another way to create an upsert in sql (if you have pre-stage or stage tables):
--Insert Portion
INSERT INTO FinalTable
( Colums )
SELECT T.TempColumns
FROM TempTable T
WHERE
(
SELECT 'Bam'
FROM FinalTable F
WHERE F.Key(s) = T.Key(s)
) IS NULL
--Update Portion
UPDATE FinalTable
SET NonKeyColumn(s) = T.TempNonKeyColumn(s)
FROM TempTable T
WHERE FinalTable.Key(s) = T.Key(s)
AND CHECKSUM(FinalTable.NonKeyColumn(s)) <> CHECKSUM(T.NonKeyColumn(s))

The basic Data Manipulation Language (DML) commands that have been in use over the years are Update, Insert and Delete. They do exactly what you expect: Insert adds new records, Update modifies existing records and Delete removes records.
UPSERT statement modifies existing records, if a records is not present it INSERTS new records.
The functionality of UPSERT statment can be acheived by two new set of TSQL operators. These are the two new ones
EXCEPT
INTERSECT
Except:-
Returns any distinct values from the query to the left of the EXCEPT operand that are not also returned from the right query
Intersect:-
Returns any distinct values that are returned by both the query on the left and right sides of the INTERSECT operand.
Example:- Lets say we have two tables Table 1 and Table 2
Table_1 column name(Number, datatype int)
----------
1
2
3
4
5
Table_2 column name(Number, datatype int)
----------
1
2
5
SELECT * FROM TABLE_1 EXCEPT SELECT * FROM TABLE_2
will return 3,4 as it is present in Table_1 not in Table_2
SELECT * FROM TABLE_1 INTERSECT SELECT * FROM TABLE_2
will return 1,2,5 as they are present in both tables Table_1 and Table_2.
All the pains of Complex joins are now eliminated :-)
To use this functionality in SSIS, all you need to do add an "Execute SQL" task and put the code in there.

I usually prefer to let SSIS engine to manage delta merge. Only new items are inserted and changed are updated.
If your destination Server does not have enough resources to manage heavy query, this method allow to use resources of your SSIS server.

We can use slowly changing dimension component in SSIS to upsert.
https://learn.microsoft.com/en-us/sql/integration-services/data-flow/transformations/configure-outputs-using-the-slowly-changing-dimension-wizard?view=sql-server-ver15

I would use the 'slow changing dimension' task

Related

snowflake merge statement using golden gate json as source table

while executing target table in snowflake using json data as source table
merge into cust tgt using (
select parse_json(s.$1):application_num as application num
from prd_json s qualify
row_number() over(partition application
order_by application desc)=1) src
on tgt.application =src.application
when not matched and op_type='I' then
insert(application) values (src.application );
qualify commands ignores all the duplicate data present and gives only unique record but while putting joins its show only less records when compare to normal select statement.
for example :
select distinct application
from prd_json where op_type='I';
--15000 rows are there
while putting joins it shows there is not matching records in target . if it is not matched it should insert all 15000rows but 8500 rows only inserting even though it was not an duplicate record . is there any function available without using "qualify" shall we insert the record. if i ignore qualify am getting dml error duplication. pls guide me if anyone knows.
How about using SELECT DISTINCT?
You demo SQL does not compile. and you using the $1 means it's also hard to guess the names of your columns to know how the ROW_NUMBER is working.
So it's hard to nail down the problem.
But with the following SQL you can replace ROW_NUMBER with DISTINCT
CREATE TABLE cust(application INT);
CREATE OR REPLACE table prd_json as
SELECT parse_json(column1) as application, column2 as op_type
FROM VALUES
('{"application_num":1,"other":1}', 'I'),
('{"application_num":1,"other":2}', 'I'),
('{"application_num":2,"other":3}', 'I'),
('{"application_num":1,"other":1}', 'U')
;
MERGE INTO cust AS tgt
USING (
SELECT DISTINCT
parse_json(s.$1):application_num::int as application,
s.op_type
FROM prd_json AS s
) AS src
ON tgt.application = src.application
WHEN NOT MATCHED AND src.op_type = 'I' THEN
INSERT(application) VALUES (src.application );
number of rows inserted
2
SELECT * FROM cust;
APPLICATION
1
2
running the MERGE code a second time gives:
number of rows inserted
0
Now if truncate CUST and I swap to using this SQL for the inner part:
SELECT --DISTINCT
parse_json(s.$1):application_num::int as application,
s.op_type
FROM prd_json AS s
qualify row_number() over (partition by application order by application desc)=1
I get three rows inserted, because the partition by application, is effectively binding to the s.application not the output application, and there are 3 different "applications" because of the other values.
The reason I wrote my code this way is your
select distinct application
from prd_json where op_type='I';
implies there is something called application already, in the table.. and thus it runs the chance of being used in the ROW_NUMBER statement..
Anyways, there is a large possible problem is you also have "update data" I guess U in your transaction block, that you want to ORDER BY the sub-select so you never have a Inser,Update trying action in Update,Inser order. And assuming you want all update operations if there are many of them.. I will stop. But if you do not have Updates, the sub-select should have the op_type='I' to avoid the non-insert ops making it. Out, or possible worse again, in your ROW_NUMBER pattern replacing the Intserts. Which I suspect is the underlying cause of your problem.

TSQL Operators IN vs INNER JOIN

Using SQL Server 2014:
Is there any performance difference between the following statements?
DELETE FROM MyTable where PKID IN (SELECT PKID FROM #TmpTableVar)
AND
DELETE FROM MyTable INNER JOIN #TmpTableVar t ON MyTable.PKID = t.PKID
In your given example the execution plans will be the same (most probably).
But having same execution plans doesn't mean that they are the best execution plans you can possibly have for this statement.
The problem I see in both of your queries is the use of the Table Variable.
SQL Server always assumes that there is only 1 row in the table variable. Only in SQL Server 2014 and later version this assumption has been changed to 100 rows.
So no matter how many rows you have this the table variable SQL Server will always assume you have one row in the #TmpTableVar.
You can change your code slightly to give SQL Server a better idea of how many rows there will be in that table by replacing it with a Temporary table and since it is a PK_ID Column in your table variable you can also create an index on that table, to give best chance to sql server to come up with the best possible execution plan for this query.
SELECT PKID INTO #Temp
FROM #TmpTableVar
-- Create some index on the temp table here .....
DELETE FROM MyTable
WHERE EXISTS (SELECT 1
FROM #Temp t
WHERE MyTable.PKID = t.PKID)
Note
In operator will work fine since it is a primary key column in the table variable. but if you ever use IN operator on a nullable column, the results may surprise you, The IN operator goes all pear shape as soon as it finds a NULL values in the column it is checking on.
I personally prefer Exists operator for such queries but inner joins should also work just fine but avoid IN operators if you can.

sql server linked server to oracle returns no data found when data exists

I have a linked server setup in SQL Server to hit an Oracle database. I have a query in SQL Server that joins on the Oracle table using dot notation. I am getting a “No Data Found” error from Oracle. On the Oracle side, I am hitting a table (not a view) and no stored procedure is involved.
First, when there is no data I should just get zero rows and not an error.
Second, there should actually be data in this case.
Third, I have only seen the ORA-01403 error in PL/SQL code; never in SQL.
This is the full error message:
OLE DB provider "OraOLEDB.Oracle" for linked server "OM_ORACLE" returned message "ORA-01403: no data found".
Msg 7346, Level 16, State 2, Line 1
Cannot get the data of the row from the OLE DB provider "OraOLEDB.Oracle" for linked server "OM_ORACLE".
Here are some more details, but it probably does not mean anything since you don’t have my tables and data.
This is the query with the problem:
select *
from eopf.Batch b join eopf.BatchFile bf
on b.BatchID = bf.BatchID
left outer join [OM_ORACLE]..[OM].[DOCUMENT_UPLOAD] du
on bf.ReferenceID = du.documentUploadID;
I can’t understand why I get a “no data found” error. The query below uses the same Oracle table and returns no data but I don’t get an error - I just get no rows returned.
select * from [OM_ORACLE]..[OM].[DOCUMENT_UPLOAD] where documentUploadID = -1
The query below returns data. I just removed one of the SQL Server tables from the join. But removing the batch table does not change the rows returned from batchFile (271 rows in both cases – all rows in batchFile have a batch entry). It should still be joining the same batchFile rows to the same Oracle rows.
select *
from eopf.BatchFile bf
left outer join [OM_ORACLE]..[OM].[DOCUMENT_UPLOAD] du
on bf.ReferenceID = du.documentUploadID;
And this query returns 5 rows. It should be the same 5 from the original query. ( I can’t use this because I need data from the batch and batchFile table).
select *
from [OM_ORACLE]..[OM].[DOCUMENT_UPLOAD] du
where du.documentUploadId
in
(
select bf.ReferenceID
from eopf.Batch b join eopf.BatchFile bf
on b.BatchID = bf.BatchID);
Has anyone experienced this error?
Today I experienced the same problem with an inner Join. As creating a Table Valued Function suggested by codechurn or using a Temporary Table suggested by user1935511 or changing the Join Types suggested by cymorg are no options for me, I like to share my solution.
I used Join Hints to drive the query optimizer into the right direction, as the problem seems to rise up from nested loops join strategy with the remote table locally . For me HASH, MERGE and REMOTE join hints worked.
For you REMOTE will not be an option because it can be used only for inner join operations. So using something like the following should work.
select *
from eopf.Batch b
join eopf.BatchFile bf
on b.BatchID = bf.BatchID
left outer merge join [OM_ORACLE]..[OM].[DOCUMENT_UPLOAD] du
on bf.ReferenceID = du.documentUploadID;
I've had the same problem.
Solution1: load the data from the Oracle database into a temp table, then join to that temp table instead - here's a link.
From this post a link you can find out that the problem can be with using left join.
I've checked with my problem and after changing my query it solved the problem.
In my case I had a complex view made from a linked table, 3 views based on the linked table and a local table. I was using Inner Joins throughout and this problem manifested. Changing the joins to Left and Right Outer Joins (as appropriate) resolved the issue.
Another way to work around the problem is to pull back the Oracle data into a Table Valued Function. This will cause SQL Server to go out and retrieve all of the data from Oracle and throw it into a resultant table variable. For all intent and purpose, the Oracle data is now "local" to SQL Server if you use the resultant Table Valued Function in a query.
I believe the original problem is that SQL Server is trying to optimize the execution of your compound query which includes the remote Oracle query results in-line. By using a Table Valued Function to wrap the Oracle call, SQL Server will optimize the compound query on the resultant table variable returned from the function and not the results from the remote query execution.
CREATE function [dbo].[documents]()
returns #results TABLE (
DOCUMENT_ID INT NOT NULL,
TITLE VARCHAR(6) NOT NULL,
LEGALNAME VARCHAR(50) NOT NULL,
AUTHOR_ID INT NOT NULL,
DOCUMENT_TYPE VARCHAR(1) NOT NULL,
LAST_UPDATE DATETIME
) AS
BEGIN
INSERT INTO #results
SELECT CAST(DOCUMENT_ID AS INT) AS DOCUMENT_ID, TITLE, LEGALNAME, CAST(AUTHOR_ID AS INT) AS AUTHOR_ID, DOCUMENT_TYPE, LAST_UPDATE
FROM OPENQUERY(ORACLE_SERVER,
'select DOCUMENT_ID, TITLE, LEGALNAME, AUTHOR_ID, DOCUMENT_TYPE, FUNDTYPE, LAST_UPDATE
from documents')
return
END
You can then use the Table Valued Function as it it were a table in your SQL queries:
SELECT * FROM DOCUMENTS()
I resolved it by avoiding the = operator. Try using this instead:
select * from [OM_ORACLE]..[OM].[DOCUMENT_UPLOAD] where documentUploadID < 0

How to force SQL Server to process CONTAINS clauses before WHERE clauses?

I have a SQL query that uses both standard WHERE clauses and full text index CONTAINS clauses. The query is built dynamically from code and includes a variable number of WHERE and CONTAINS clauses.
In order for the query to be fast, it is very important that the full text index be searched before the rest of the criteria are applied.
However, SQL Server chooses to process the WHERE clauses before the CONTAINS clauses and that causes tables scans and the query is very slow.
I'm able to rewrite this using two queries and a temporary table. When I do so, the query executes 10 times faster. But I don't want to do that in the code that creates the query because it is too complex.
Is there an a way to force SQL Server to process the CONTAINS before anything else? I can't force a plan (USE PLAN) because the query is built dynamically and varies a lot.
Note: I have the same problem on SQL Server 2005 and SQL Server 2008.
You can signal your intent to the optimiser like this
SELECT
*
FROM
(
SELECT *
FROM
WHERE
CONTAINS
) T1
WHERE
(normal conditions)
However, SQL is declarative: you say what you want, not how to do it. So the optimiser may decide to ignore the nesting above.
You can force the derived table with CONTAINS to be materialised before the classic WHERE clause is applied. I won't guarantee performance.
SELECT
*
FROM
(
SELECT TOP 2000000000
*
FROM
....
WHERE
CONTAINS
ORDER BY
SomeID
) T1
WHERE
(normal conditions)
Try doing it with 2 queries without temp tables:
SELECT *
FROM table
WHERE id IN (
SELECT id
FROM table
WHERE contains_criterias
)
AND further_where_classes
As I noted above, this is NOT as clean a way to "materialize" the derived table as the TOP clause that #gbn proposed, but a loop join hint forces an order of evaluation, and has worked for me in the past (admittedly usually with two different tables involved). There are a couple of problems though:
The query is ugly
you still don't get any guarantees that the other WHERE parameters don't get evaluated until after the join (I'll be interested to see what you get)
Here it is though, given that you asked:
SELECT OriginalTable.XXX
FROM (
SELECT XXX
FROM OriginalTable
WHERE
CONTAINS XXX
) AS ContainsCheck
INNER LOOP JOIN OriginalTable
ON ContainsCheck.PrimaryKeyColumns = OriginalTable.PrimaryKeyColumns
AND OriginalTable.OtherWhereConditions = OtherValues

Merge query in SQL Server 2008

I having the scenario of loading the data from source table to target table. If the data from source is not present in target, then i need to insert. If it is present in the target table already, then update the status of the row to 'expire' and insert the column as new row. I used Merge query to do this. I can do insert if not exists and i can do update also. But while trying to insert when matched, it says insert not allowed in 'when matched' clause.
Please help me.. Thanks in advance
If you want to perform multiple actions for a single row of source data, you need to duplicate that row somehow.
Something like the following (making up table names, etc):
;WITH Source as (
SELECT Col1,Col2,Col3,t.Dupl
FROM SourceTable,(select 0 union all select 1) t(Dupl)
)
MERGE INTO Target t
USING Source s ON t.Col1 = s.Col1 and s.Dupl=0 /* Key columns here */
WHEN MATCHED THEN UPDATE SET Expired = 1
WHEN NOT MATCHED AND s.Dupl=1 THEN INSERT (Col1,Col2,Col3) VALUES (s.Col1,s.Col2,s.Col3);
You always want the s.Dupl condition in the not matched branch, because otherwise source rows which don't match any target rows would be inserted twice.
From the example you posted as a comment, I'd change:
MERGE target AS tar USING source AS src ON src.id = tar.id
WHEN MATCHED THEN UPDATE SET D_VALID_TO=#nowdate-1, C_IS_ACTIVE='N', D_LAST_UPDATED_DATE=#nowdate
WHEN NOT MATCHED THEN INSERT (col1,col2,col3) VALUES (tar.col1,tar.col2,tar.col3);
into:
;WITH SourceDupl AS (
SELECT id,col1,col2,col3,t.Dupl
FROM source,(select 0 union all select 1) t(Dupl)
)
MERGE target AS tar USING SourceDupl as src on src.id = tar.id AND Dupl=0
WHEN MATCHED THEN UPDATE SET D_VALID_TO=#nowdate-1, C_IS_ACTIVE='N', D_LAST_UPDATED_DATE=#nowdate
WHEN NOT MATCHED AND Dupl=1 THEN INSERT (col1,col2,col3) VALUES (src.col1,src.col2,src.col3);
I've changed the values in the VALUES clause, since in a NOT MATCHED branch, the tar table doesn't have a row to select values from.
Check out one of those many links:
Using SQL Server 2008's MERGE Statement
MERGE on Technet
Introduction to MERGE statement
SQL Server 2008 MERGE
Without actually knowing what your database tables look like, we cannot be of more help - you need to read those articles and figure out yourself how to apply this to your concrete situation.

Resources