upsert in multiple large tables using ssis - sql-server

I have 40 tables having different structure in one of DB on one server that is being updated by data provider.
I want to create a SSIS package that would pull data from that data provider DB and insert ,update or delete (merge) data in to development ,Test,UAT and prod DBs.
The tables are having 1m- 3m rows and 20-30 columns each and all the DBs are on SQL Server platform and are on different servers.
The business requirement is to load data everyday on a particular time and have to use SSIS for this. I am new to SSIS and want your suggestions to create better design.

I don't know about SSIS.
There are packaged solutions to synch databases.
In general with just TSQL
delete
update
insert
TSQL
delete
tableA a
where not exists (select 1 from tableB b where b.PK = a.PK)
update A
set ...
from TableA a
join TableB b
on a.PK = b.PK
insert into TableA (columns)
select columns
from tableB b
where not exists (select 1 from tableA a where b.PK = a.PK)

It's a very broad question. I can help you with pointers. Follow them and ask questions when you get stuck. I'll be telling you for 1 table. You will have to do parallel for others:
Create a source OLEDB connection and destination OLEDB connection. This will be used to copy from source to staging tables where actual Data warehouse sits.
Create a Data flow task. Simply copy source db to staging tables. You'll have to implement incremental loading logic. For instance, store the last source.Id and load data from that Id onwards to latest.
Once you've data in staging, create another Data flow task where you'll have to apply Lookup transformation to insert and update data, while loading in destination table.
Deletion won't work here, so you'll have to apply deletion in next step(preferably via execute sql task)
Above steps are the guidelines. You'll be having multiple sequence containers working in parallel, each having above DFT, working on separate tables.

Related

SSIS Change Data Capture on Table with Child Relations

I am new to SSIS and am having a hard time finding information on this. My source is a SQLDB and my Target is a SQLDB
In my source I have a Table with a FK reference to a different table. I want to Extract the Parent with Child from Source and Load them into a single flat table in my Target.
Select * From Table1 Left Join Table2 on Table1.Table2Id = Table2.Id
Within my Dataflow i have an OLE DB Source where i am using the DataAccessMode of SQL Command and using the Query Above.
However, I am looking to do Change Data Capture so i can track Incremental changes and not do wipe and loads. From My Understanding the Change Data Capture Checks is for Table changes. So if i was to set it up on Table 1, it would only track the changes for Table1. How would I be able to import Table1 and Table2 into the flat table in my Target with Change Data Capture?

Monitor SSIS dataflow with equivalent of SET STATISTICS

I'm trying to compare statistics between SQL Server T-SQL and SSIS
Say I have the following script:
INSERT INTO [myDB].dbo.finalTable WITH (TABLOCK)
(id, description, value)
SELECT a.id, a.description, b.value
FROM [anotherDB].dbo.sourceA a
INNER JOIN [anotherDB].dbo.sourceB ON a.id = b.id
So, it just joins a couple of tables from a seperate database and writes some data to finalTable
If I want to look at scans, reads, writes, CPU time, elapsed time and IO I can just use:
SET STATISTICS IO ON
SET STATISTICS TIME ON
Now, if I take an entirely different approach and create a SSIS package with a dataflow task.
Then add the source (anotherDB]) and destination ([myDB]) connections.
Then just add the T-SQL as the source code and map everything
How do I monitor the same statistics?
Thanks

Joining tables from two Oracle databases in SAS

I am joining two tables together that are located in two separate oracle databases.
I am currently doing this in sas by creating two libname connections to each database and then simply using something like the below.
libname dbase_a oracle user= etc... ;
libname dbase_b oracle user= etc... ;
proc sql;
create table t1 as
select a.*, b.*
from dbase_a.table1 a inner join dbase_b.table2 b
on a.id = b.id;
quit;
However the query is painfully slow. Can you suggest any better options to speed up such a query (short of creating a database link going down the path of creating a database link)?
Many thanks for looking at this.
If those two databases are on the same server and you are able to execute cross-database queries in Oracle, you could try using SQL pass-through:
proc sql;
connect to oracle (user= password= <...>);
create table t1 as
select * from connection to oracle (
select a.*, b.*
from dbase_a.schema_a.table1 a
inner join dbase_b.schema_b.table2 b
on a.id = b.id;
);
disconnect from oracle;
quit;
I think that, in most cases, SAS attemps as much as possible to have the query executed on the database server, even if pass-through was not explicitely specified. However, when that query queries tables that are on different servers, different databases on a system that does not allow cross-database queries or if the query contains SAS-specific functions that SAS is not able to translate in something valid on the DBMS system, then SAS will indeed resort to 'downloading' the complete tables and processing the query locally, which can evidently be painfully inefficient.
The select is for all columns from each table, and the inner join is on the id values only. Because the join criteria evaluation is for data coming from disparate sources, the baggage of all columns could be a big factor in the timing because even non-match rows must be downloaded (by the libname engine, within the SQL execution context) during the ON evaluation.
One approach would be to:
Select only the id from each table
Find the intersection
Upload the intersection to each server (as a scratch table)
Utilize the intersection on each server as pass through selection criteria within the final join in SAS
There are a couple variations depending on the expected number of id matches, the number of different ids in each table, or knowing table-1 and table-2 as SMALL and BIG. For a large number of id matches that need transfer back to a server you will probably want to use some form of bulk copy. For a relative small number of ids in the intersection you might get away with enumerating them directly in a SQL statement using the construct IN (). The size of a SQL statement could be limited by the database, the SAS/ACCESS to ORACLE engine, the SAS macro system.
Consider a data scenario in which it has been determined the potential number of matching ids would be too large for a construct in (id-1,...id-n). In such a case the list of matching ids are dealt with in a tabular manner:
libname SOURCE1 ORACLE ....;
libname SOURCE2 ORACLE ....;
libname SCRATCH1 ORACLE ... must specify a scratch schema ...;
libname SCRATCH2 ORACLE ... must specify a scratch schema ...;
proc sql;
connect using SOURCE1 as PASS1;
connect using SOURCE2 as PASS2;
* compute intersection from only id data sent to SAS;
create table INTERSECTION as
(select id from connection to PASS1 (select id from table1))
intersect
(select id from connection to PASS2 (select id from table2))
;
* upload intersection to each server;
create table SCRATCH1.ids as select id from INTERSECTION;
create table SCRATCH2.ids as select id from INTERSECTION;
* compute inner join from only data that matches intersection;
create table INNERJOIN as select ONE.*, TWO.* from
(select * from connection to PASS1 (
select * from oracle-path-to-schema.table1
where id in (select id from oracle-path-to-scratch.ids)
))
JOIN
(select * from connection to PASS2 (
select * from oracle-path-to-schema.table2
where id in (select id from oracle-path-to-scratch.ids)
));
...
For the case of both table-1 and table-2 having very large numbers of ids that exceed the resource capacity of your SAS platform you will have to also iterate the approach for ranges of id counts. Techniques for range criteria determination for each iteration is a tale for another day.

Find dependencies between tables in sql database

I have a Sql database with data. I have been asked to populate a fresh identical database with all the required master data so that the application is able to up and run for a new customer.
First approach
Delete all the data from database, run the application, sure i won't be even able to login. Observe errors, identify tables which need master data(sure User table at least), insert data. Then assume i am going to access a module. But without some master data it'll give me errors. Observe errors, identify tables which need master data, insert data.
But this seems not practical.
Second approach
While keeping the data in database, take one table at a time, using queries or sql server management studio tools, find all dependent tables. Keep the parent table data and delete child table data. Do this for all tables. In second round consider the remaining parent tables. Some table's data are inserted from application. Identify those and delete them. This way i can have relevant master data at the end. But i don't know how to approach this.
All these are my thoughts. Sure there might be many more approaches which are more precise and easier than these.I am confused with what to do. Please guide me. Thanks!
Here's a few queries you could use to figure out which table and column is referencing which table and column...
select * from INFORMATION_SCHEMA.KEY_COLUMN_USAGE
select * from INFORMATION_SCHEMA.columns
select * from INFORMATION_SCHEMA.tables
select * from sys.foreign_keys
select * from sys.foreign_key_columns
select * from [sys].[objects] where [name] = 'your_tablename'
For more, open Object Explorer (View Menu) and expand:
Databases/System Databases/Master/Views/System Views.
Also, check out any database diagrams there might be in Object Explorer:
Databases/Your_db_name/Database Diagrams.
How big is the database ?
No matter what you have to make proper documentation ?So better start with documentation.
You have to list all table one by one and identity if it is master table.
Remember the diff. between Delete or Truncate.
While doucmenting above query will come in handy.
Save the query and document for future need.
Most importantly,there should not be any error,even if any of the table is empty.
To find foreign key dependecies between tables you can use
SELECT FKT.name 'Parent table', CHT.name 'Child table' FROM sys.foreign_keys FK
JOIN sys.tables CHT ON FK.parent_object_id = CHT.object_id
JOIN sys.tables FKT ON FK.referenced_object_id = FKT.object_id
There is also ways to find dependencies in database views using system views.

UPSERT in SSIS

I am writing an SSIS package to run on SQL Server 2008. How do you do an UPSERT in SSIS?
IF KEY NOT EXISTS
INSERT
ELSE
IF DATA CHANGED
UPDATE
ENDIF
ENDIF
See SQL Server 2008 - Using Merge From SSIS. I've implemented something like this, and it was very easy. Just using the BOL page Inserting, Updating, and Deleting Data using MERGE was enough to get me going.
Apart from T-SQL based solutions (and this is not even tagged as sql/tsql), you can use an SSIS Data Flow Task with a Merge Join as described here (and elsewhere).
The crucial part is the Full Outer Join in the Merger Join (if you only want to insert/update and not delete a Left Outer Join works as well) of your sorted sources.
followed by a Conditional Split to know what to do next: Insert into the destination (which is also my source here), update it (via SQL Command), or delete from it (again via SQL Command).
INSERT: If the gid is found only on the source (left)
UPDATE If the gid exists on both the source and destination
DELETE: If the gid is not found in the source but exists in the destination (right)
I would suggest you to have a look at Mat Stephen's weblog on SQL Server's upsert.
SQL 2005 - UPSERT: In nature but not by name; but at last!
Another way to create an upsert in sql (if you have pre-stage or stage tables):
--Insert Portion
INSERT INTO FinalTable
( Colums )
SELECT T.TempColumns
FROM TempTable T
WHERE
(
SELECT 'Bam'
FROM FinalTable F
WHERE F.Key(s) = T.Key(s)
) IS NULL
--Update Portion
UPDATE FinalTable
SET NonKeyColumn(s) = T.TempNonKeyColumn(s)
FROM TempTable T
WHERE FinalTable.Key(s) = T.Key(s)
AND CHECKSUM(FinalTable.NonKeyColumn(s)) <> CHECKSUM(T.NonKeyColumn(s))
The basic Data Manipulation Language (DML) commands that have been in use over the years are Update, Insert and Delete. They do exactly what you expect: Insert adds new records, Update modifies existing records and Delete removes records.
UPSERT statement modifies existing records, if a records is not present it INSERTS new records.
The functionality of UPSERT statment can be acheived by two new set of TSQL operators. These are the two new ones
EXCEPT
INTERSECT
Except:-
Returns any distinct values from the query to the left of the EXCEPT operand that are not also returned from the right query
Intersect:-
Returns any distinct values that are returned by both the query on the left and right sides of the INTERSECT operand.
Example:- Lets say we have two tables Table 1 and Table 2
Table_1 column name(Number, datatype int)
----------
1
2
3
4
5
Table_2 column name(Number, datatype int)
----------
1
2
5
SELECT * FROM TABLE_1 EXCEPT SELECT * FROM TABLE_2
will return 3,4 as it is present in Table_1 not in Table_2
SELECT * FROM TABLE_1 INTERSECT SELECT * FROM TABLE_2
will return 1,2,5 as they are present in both tables Table_1 and Table_2.
All the pains of Complex joins are now eliminated :-)
To use this functionality in SSIS, all you need to do add an "Execute SQL" task and put the code in there.
I usually prefer to let SSIS engine to manage delta merge. Only new items are inserted and changed are updated.
If your destination Server does not have enough resources to manage heavy query, this method allow to use resources of your SSIS server.
We can use slowly changing dimension component in SSIS to upsert.
https://learn.microsoft.com/en-us/sql/integration-services/data-flow/transformations/configure-outputs-using-the-slowly-changing-dimension-wizard?view=sql-server-ver15
I would use the 'slow changing dimension' task

Resources