Creating MD5 for all the rows in the table - sql-server

I am working on creating a unique key to find the rows that are changed after the last refresh in the table. So my approach here is to take the PK in the table and also create a md5 column for each row and based on PK and md5, check to see if any of the rows in the table have changed since last time.
What is the best method to create md5 in MS SQL based on query itself? that will take care of all the datatype and null columns also.

Related

SSIS - Insert records and update records with lookup transformation

I am new to SSIS and have been tasked with taking records from a source table and inserting new or updating existing tables in the target. I estimate in the region of 10-15 records per day at maximum.
I have investigated the Lookup Transformation object and this looks like it will do the job.
Both source and target table columns are identical. However, there is no unique key/value in target or source to perform the lookup on. The existing fields I have to work with and do the lookup on are Date or Load_Date...
How can I achieve only adding new records or updating existing records in the target without a key/ID field? Do I really need to add another ID column to compare source and target with? And if so, can someone tell me how to do this? The target table must not have a key/ID field, so if using key/ID field, it would need to be dropped after any inserts/updates are done.
Could this be achieved this using Load_Date? Currently all Load_Date values are NULL in target table, so would it be as simple as matching on Load_Date and if Load_Date is already populated, then do not load? If Load_Date is NULL, then load the record?
Thanks.

SSIS only extract Delta changes

After some advice. I'm using SSIS\SQL Server 2014. I have a nightly SSIS package that pulls in data from non-SQL Server db's into a single table (the SQL table is truncated beforehand each time) and I then extract from this table to create a daily csv file.
Going forward, I only want to extract to csv on a daily basis the records that have changed i.e. the Deltas.
What is the best approach? I was thinking of using CDC in SSIS, but as I'm truncating the SQL table before the initial load each time, will this be best method? Or will I need to have a master table in SQL with an initial load, then import into another table and just extract where there are different? For info, the table in SQL contains a Primary Key.
I just want to double check as CDC assumes the tables are all in SQL Server, whereas my data is coming from outside SQL Server first.
Thanks for any help.
The primary key on that table is your saving grace here. Obviously enough, the SQL Server database that you're pulling the disparate data into won't know from one table flush to the next which records have changed, but if you add two additional tables, and modify the existing table with an additional column, it should be able to figure it out by leveraging HASHBYTES.
For this example, I'll call the new table SentRows, but you can use a more meaningful name in practice. We'll call the new column in the old table HashValue.
Add the column HashValue to your table as a varbinary data type. NOT NULL as well.
Create your SentRows table with columns for all the columns in the main table's primary key, plus the HashValue column.
Create a RowsToSend table that's structurally identical to your main table, including the HashValue.
Modify your queries to create the HashValue by applying HASHBYTES to all of the non-key columns in the table. (This will be horribly tedious. Sorry about that.)
Send out your full data set.
Now move all of the key values and HashValues to the SentRows table. Truncate your main table.
On the next pull, compare the key values and HashValues from SentRows to the new data in the main table.
Primary key match + hash match = Unchanged row
Primary key match + hash mismatch = Updated row
Primary key in incoming data but missing from existing data set = New row
Primary key not in incoming data but in existing data set = Deleted row
Pull out any changes you need to send to the RowsToSend table.
Send the changes from RowsToSend.
Move the key values and HashValues to your SentRows table. Update hashes for changed key values, insert new rows, and decide how you're going to handle deletes, if you have to deal with deletes.
Truncate the SentRows table to get ready for tomorrow.
If you'd like (and you'll thank yourself later if you do) add a computed column to the SentRows table with default of GETDATE(), which will tell you when the row was added.
And away you go. Nothing but deltas from now on.
Edit 2019-10-31:
Step by step (or TL;DR):
1) Flush and Fill MainTable.
2) Compare keys and hashes on MainTable to keys and hashes on SentRows to identify new/changed rows.
3) Move new/changed rows to RowsToSend.
4) Send the rows that are in RowsToSend.
5) Move all the rows from RowsToSend to SentRows.
6) Truncate RowsToSend.

Behind the scene operations for ALTER COLUMN statement in SQL Server

I am altering the column datatype for a table with around 100 Million records using the below query:
ALTER TABLE dbo.TARGETTABLE
ALTER COLUMN XXX_DATE DATE
The column values are in the right date format as I inserted original date from a valid data source.
However, the query have been running for a long time and even when I attempt to cancel the query it seems to take forever.
Can anyone explain what is happening behind the scene in SQL Server when an ALTER TABLE STATEMENT is executed and why requires such resources?
There are a lot of variables that will make these Alter statements
make multiple passes through your table and make heavy use of TempDB
and depending on efficiency of TempDB it could be very slow.
Examples include whether or not the column you are changing is in the
index (especally clustered index since non-clustering key carries the
clustering index).
Instead of altering table...i will give you one simple exmaple...so you can try this....
Suppose your table name is tblTarget1
Create the another table (tblTarget2) with same structure...
Change the dataType of tblTarget2.....
Copy the data from tblTarget1 To tblTarget2 using Insert into query....
Drop the original table(tblTarget1)
Rename the tblTarget2 as tblTarget1
The main Reaseon is that....changing the data type will take a lot of data transfer and data page alignment....
For more Information you can follow this Link
Another approach to do this is the following:
Add new column to the table - [_date] date
Using batch update you can change transfer the values from the old to the new column without blocking the table for the other users.
Then in one transaction do the following:
update all of the new values inserted after the update is done
drop the old column
rename the new column
Note, if you have an index on this field you need to drop it before deleting the old column and create if after renaming the new one.

Extract and load only new records

I have simple SSIS data flow which extract records from table A and load them to the table B. Table A and Table B has unique key.
What is the best way to extract and load only new records?
A)if records are on sequential unique key, check the max record in table B and then select record from table A greater than this value.
B)After each row of iteration, save the unique key in some table/xml file. Use this value at the start of task to select records from table A.
I have found the LOOKUP-task solution, but I am still searching another good solutions.

Making primary key and identity column after data has been loaded

I have quick question for you SQL gurus. I have existing tables without primary key column and Identity is not set. Now I am trying to modify those tables by making existing integer column as primary key and adding identity values for that column. My question is should I first copy all the records from the table to a temp table before making those changes . Do I loose all the previous records if I ran the T-SQL commnad to make primary key and add identity column on those tables. What are the approaches should I take such as
1) Create temp table to copy all the records from the table to be modified
2) Load all the records to the temptable
3) Make changes on the table schema
4) Finally load the records from the temp table to the original table.
Or
there are better ways that this? I really appreciate your help
Thanks
Tools>Options>Designers>Table and Database Designers
Uncheck "Prevent saving changes that require table re-creation"
[Edit] I've tried this with populated tables and I didn't lose data, but I don't really know much about this.
Hopefully you don't have too many records in the table. What happens if you use Management studio to change an existing field to identity is that it creates another table with the identity field set. it turns identity insert on and inserets the records from the original table, then turns identity insert off. Then it drops the old table and renames the table it just created. This can be quite a lengthy process if you have many records. If so I would script this out and then do it in a job that runs during the off hours because the table will be completely locked while you do this.
just do all of your changes in management studio, copy/paste the generated script into a file. DON'T SAVE CHANGES at this point. Look over and edit that script as necessary, it will probably do almost exactly what you are thinking (it will drop the original table and rename the temp one to the original's name), but handle all constraints and FKs as well.
If your existing integer column is unique and suitable, there should be no problem converting it to a PK.
Another alternative, if you don't want to use the existing column, you can add a new PK columns to the main table, populate it and seed it, then run update statements to update all other tables with new PK.
Whatever way you do it, make sure you do a back-up first!!
You can always add the IDENTITY column after you have finished copying your data around. You can also then reset the IDENTITY seed to the max integer + 1. That should solve your problems.
DBCC CHECKIDENT ('MyTable', RESEED, n)
Where n is the number you want the identity to start at.

Resources