SSIS only extract Delta changes

SSIS only extract Delta changes - sql-server

After some advice. I'm using SSIS\SQL Server 2014. I have a nightly SSIS package that pulls in data from non-SQL Server db's into a single table (the SQL table is truncated beforehand each time) and I then extract from this table to create a daily csv file.
Going forward, I only want to extract to csv on a daily basis the records that have changed i.e. the Deltas.
What is the best approach? I was thinking of using CDC in SSIS, but as I'm truncating the SQL table before the initial load each time, will this be best method? Or will I need to have a master table in SQL with an initial load, then import into another table and just extract where there are different? For info, the table in SQL contains a Primary Key.
I just want to double check as CDC assumes the tables are all in SQL Server, whereas my data is coming from outside SQL Server first.
Thanks for any help.

The primary key on that table is your saving grace here. Obviously enough, the SQL Server database that you're pulling the disparate data into won't know from one table flush to the next which records have changed, but if you add two additional tables, and modify the existing table with an additional column, it should be able to figure it out by leveraging HASHBYTES.
For this example, I'll call the new table SentRows, but you can use a more meaningful name in practice. We'll call the new column in the old table HashValue.
Add the column HashValue to your table as a varbinary data type. NOT NULL as well.
Create your SentRows table with columns for all the columns in the main table's primary key, plus the HashValue column.
Create a RowsToSend table that's structurally identical to your main table, including the HashValue.
Modify your queries to create the HashValue by applying HASHBYTES to all of the non-key columns in the table. (This will be horribly tedious. Sorry about that.)
Send out your full data set.
Now move all of the key values and HashValues to the SentRows table. Truncate your main table.
On the next pull, compare the key values and HashValues from SentRows to the new data in the main table.
Primary key match + hash match = Unchanged row
Primary key match + hash mismatch = Updated row
Primary key in incoming data but missing from existing data set = New row
Primary key not in incoming data but in existing data set = Deleted row
Pull out any changes you need to send to the RowsToSend table.
Send the changes from RowsToSend.
Move the key values and HashValues to your SentRows table. Update hashes for changed key values, insert new rows, and decide how you're going to handle deletes, if you have to deal with deletes.
Truncate the SentRows table to get ready for tomorrow.
If you'd like (and you'll thank yourself later if you do) add a computed column to the SentRows table with default of GETDATE(), which will tell you when the row was added.
And away you go. Nothing but deltas from now on.
Edit 2019-10-31:
Step by step (or TL;DR):
1) Flush and Fill MainTable.
2) Compare keys and hashes on MainTable to keys and hashes on SentRows to identify new/changed rows.
3) Move new/changed rows to RowsToSend.
4) Send the rows that are in RowsToSend.
5) Move all the rows from RowsToSend to SentRows.
6) Truncate RowsToSend.

Related

Behind the scene operations for ALTER COLUMN statement in SQL Server

I am altering the column datatype for a table with around 100 Million records using the below query:
ALTER TABLE dbo.TARGETTABLE
ALTER COLUMN XXX_DATE DATE
The column values are in the right date format as I inserted original date from a valid data source.
However, the query have been running for a long time and even when I attempt to cancel the query it seems to take forever.
Can anyone explain what is happening behind the scene in SQL Server when an ALTER TABLE STATEMENT is executed and why requires such resources?

There are a lot of variables that will make these Alter statements
make multiple passes through your table and make heavy use of TempDB
and depending on efficiency of TempDB it could be very slow.
Examples include whether or not the column you are changing is in the
index (especally clustered index since non-clustering key carries the
clustering index).
Instead of altering table...i will give you one simple exmaple...so you can try this....
Suppose your table name is tblTarget1
Create the another table (tblTarget2) with same structure...
Change the dataType of tblTarget2.....
Copy the data from tblTarget1 To tblTarget2 using Insert into query....
Drop the original table(tblTarget1)
Rename the tblTarget2 as tblTarget1
The main Reaseon is that....changing the data type will take a lot of data transfer and data page alignment....
For more Information you can follow this Link

Another approach to do this is the following:
Add new column to the table - [_date] date
Using batch update you can change transfer the values from the old to the new column without blocking the table for the other users.
Then in one transaction do the following:
update all of the new values inserted after the update is done
drop the old column
rename the new column
Note, if you have an index on this field you need to drop it before deleting the old column and create if after renaming the new one.

Creating MD5 for all the rows in the table

I am working on creating a unique key to find the rows that are changed after the last refresh in the table. So my approach here is to take the PK in the table and also create a md5 column for each row and based on PK and md5, check to see if any of the rows in the table have changed since last time.
What is the best method to create md5 in MS SQL based on query itself? that will take care of all the datatype and null columns also.

SSIS I need to use primary key values in additional tables

I am new to SSIS and I hope someone can point me in the right direction!
I need to move data from one database to another. I have written a query that takes data from a number of tables (SOURCE). I then use a conditional split (Condition: Id = id) to a number of tables in the destination database. Here is my problem, I need another table populating which takes the ‘id’ value from the three tables and uses them in a fourth table as attributes, along with additional data from SOURCE.
I think I need to pass the id values to parameters but there does not seem a way to do this when inserting to ADO NET Destination.
Fourth table will have inserted id values(auto incremented) from table1, table2 and table3.
Am I going about this correctly or is there a better way?
Thanks in advance!

I know of no way to get the IDENTITY values of rows inserted in a Dataflow destination for use in the same Dataflow.
Probably the way to do what you want to do is to make a fourth branch in your dataflow inserting the columns that you have into the fourth table, and leaving the foreign keys (the ids from the other 3 tables) blank.
Then after the Dataflow, use an ExecuteSQL task to call a stored procedure that populates the missing columns in the fourth table by looking up their ids in the other three tables.
If your fourth table doesn't have the values you need to lookup the ids in the other three tables, then you can have the dataflow go to a staging table that does have those values, and populate the fourth table from the staging table while looking up the ids from the corresponding values.

How to unique identify rows in a table without primary key

I'm importing more than 600.000.000 rows from an old database/table that has no primary key set, this table is in a sql server 2005 database. I created a tool to import this data into a new database with a very different structure. The problem is that I want to resume the process from where it stopped for any reason, like an error or network error. As this table doesn't have a primary key, I can't check if the row was already imported or not. Does anyone know how to identify each row so I can check if it was already imported or not? This table has duplicated row, I already tried to compute the hash of all the columns, but it's not working due to duplicated rows...
thanks!

I would bring the rows into a staging table if this is coming from another database -- one that has an identity set on it. Then you can identify the rows where all the other data is the same except for the id and remove the duplicates before trying to put it into your production table.

So: you are loading umpteen bazillion rows of data, the rows cannot be uniquely identified, the load can (and, apparently, will) be interrupted at any point at any time, and you want to be able to resume such an interrupted load from where you left off, despite the fact that for all practical purposes you cannot identify where you left off. Ok.
Loading into a table containing an additional identity column would work, assuming that however and whenever the data load is started, it always starts at the same item and loads items in the same order. Wildly inefficient, since you have to read through everythign every time you launch.
Another clunky option would be to first break the data you are loading into manageably-sized chunks (perhaps 10,000,000 rows). Load them chunk by chunk, keeping track of which chunk you have loaded. Use a Staging table, so that you know and can control when a chunk has been "fully processed". If/when interrupted, you've only toss the chunk you were working on when interrupted, and resume work with that chunk.

With duplicate rows, even row_number() is going to get you nowhere, as that can change between queries (due to the way MSSQL stores data). You need to either bring it into a landing table with an identity column or add a new column with an identity onto the existing table (alter table oldTbl add column NewId int identity(1,1)).
You could use row_number(), and then back out the last n rows if they have more than the count in the new database for them, but it would be more straight-forward to just use a landing table.

Option 1: duplicates can be dropped
Try to find a somewhat unique field combination. (duplicates are allowed) and join over a hash of the rest of the fields which you store in the destination table.
Assume a table:
create table t_x(id int, name varchar(50), description varchar(100))
create table t_y(id int, name varchar(50), description varchar(100), hash varbinary(8000))
select * from t_x x
where not exists(select *
from t_y y
where x.id = y.id
and hashbytes('sha1', x.name + '~' + x.description) = y.hash)
The reason to try to join as many fields as possible is to reduce the chance of hash collisions which are real on a dataset with 600.000.000 records.
Option 2: duplicates are important
If you really need the duplicate rows you should add a unique id column to your big table. To achieve this in a performing way you should do the following steps:
Alter the table and add a uniqueidentifier or int field
update the table with the newsequentialid() function or a row_number()
create an index on this field
add the id field to your destination table.
once all the data is moved over, the field can be dropped.

Making primary key and identity column after data has been loaded

I have quick question for you SQL gurus. I have existing tables without primary key column and Identity is not set. Now I am trying to modify those tables by making existing integer column as primary key and adding identity values for that column. My question is should I first copy all the records from the table to a temp table before making those changes . Do I loose all the previous records if I ran the T-SQL commnad to make primary key and add identity column on those tables. What are the approaches should I take such as
1) Create temp table to copy all the records from the table to be modified
2) Load all the records to the temptable
3) Make changes on the table schema
4) Finally load the records from the temp table to the original table.
Or
there are better ways that this? I really appreciate your help
Thanks

Tools>Options>Designers>Table and Database Designers
Uncheck "Prevent saving changes that require table re-creation"
[Edit] I've tried this with populated tables and I didn't lose data, but I don't really know much about this.

Hopefully you don't have too many records in the table. What happens if you use Management studio to change an existing field to identity is that it creates another table with the identity field set. it turns identity insert on and inserets the records from the original table, then turns identity insert off. Then it drops the old table and renames the table it just created. This can be quite a lengthy process if you have many records. If so I would script this out and then do it in a job that runs during the off hours because the table will be completely locked while you do this.

just do all of your changes in management studio, copy/paste the generated script into a file. DON'T SAVE CHANGES at this point. Look over and edit that script as necessary, it will probably do almost exactly what you are thinking (it will drop the original table and rename the temp one to the original's name), but handle all constraints and FKs as well.

If your existing integer column is unique and suitable, there should be no problem converting it to a PK.
Another alternative, if you don't want to use the existing column, you can add a new PK columns to the main table, populate it and seed it, then run update statements to update all other tables with new PK.
Whatever way you do it, make sure you do a back-up first!!

You can always add the IDENTITY column after you have finished copying your data around. You can also then reset the IDENTITY seed to the max integer + 1. That should solve your problems.
DBCC CHECKIDENT ('MyTable', RESEED, n)
Where n is the number you want the identity to start at.