Delete vs Rollback Strategy - ETL Load - sql-server

I am loading data to table in the following manner:
DECLARE #srcRc INT;
DECLARE #dstRc INT;
SET #srcRc = ( SELECT COUNT(*) FROM A )
INSERT INTO t
(Col1
,Col2
,Col3
)
SELECT A.Col1
,A.Col2
,B.Col3
FROM A
JOIN B
ON A.Id = B.Id;
SET #dstRc = ##ROWCOUNT
Now I am comparing the variables #srcRc and #dstRc. The ROWCOUNT must be the same. If it is not, the inserted rows need to be deleted.
Q1: What would be the best strategy to rollback the inserted rows?
I have couple of ideas:
1) Run the load in transaction and rollback if the rowcount does not match.
2) Add flag column (bit) to the destination table called toBeDeleted, run the load and if the rowcount does not match, update the toBeDeleted column with 1 value to flag it as candidate for deletion. Then delete in batch mode (while-loop). Or do not delete them, but always exclude deletion candidates from query when working with t table.
3) Before inserting the rows, compare the the rowcount first. If it does not match, don't start the load.
DECLARE #srcRc INT;
DECLARE #dstRc INT;
SET #srcRc = ( SELECT COUNT(1) FROM A );
SET #dstRc = ( SELECT COUNT(1) FROM A JOIN B ON A.Id = B.Id );
Q2: What would be better solution for higher amount of rows, let's say 10-100 mil.?
Q3: Or is there any better strategy for similar case?

OK, Assuming :
You need the roll back to work at some later date when the content of tables A and B may have changed
There may also be other rows in T which you don't want to delete as part of the rollback.
Then you MUST keep a list of the rows you inserted, as you are unable to reliably regenerate that list from A and B and you cant just delete everything from T
You could do this in two ways
Change your import, so that it first inserts the rows to an import table, keep the import table hanging around until you are sure you don't need it anymore.
Add an extra column to T [importId] into which you put a uniquely identifying value
Obviously the first strategy uses a lot more disc space. So the longer your keep the data and the more data there is, the better the extra column looks.
Another option, would be to generate the list of imported data separately and have your transaction sql be a bulk insert with all the data hard coded into the sql.
This works well for small lists, initial setup data and the like.
Edit:
from your comments it sounds like you don't want a roll back per-se. But the best way to apply business logic around the import process.
In this case your 3rd answer is the best. Don't do the import when you know the source data is incorrect.

Related

SSIS data flow - copy new data or update existing

I queried some data from table A(Source) based on certain condition and insert into temp table(Destination) before upsert into Crm.
If data already exist in Crm I dont want to query the data from table A and insert into temp table(I want this table to be empty) unless there is an update in that data or new data was created. So basically I want to query only new data or if there any modified data from table A which already existed in Crm. At the moment my data flow is like this.
clear temp table - delete sql statement
Query from source table A and insert into temp table.
From temp table insert into CRM using script component.
In source table A I have audit columns: createdOn and modifiedOn.
I found one way to do this. SSIS DataFlow - copy only changed and new records but no really clear on how to do so.
What is the best and simple way to achieve this.
The link you posted is basically saying to stage everything and use a MERGE to update your table (essentially an UPDATE/INSERT).
The only way I can really think of to make your process quicker (to a significant degree) by partially selecting from table A would be to add a "last updated" timestamp to table A and enforcing that it will always be up to date.
One way to do this is with a trigger; see here for an example.
You could then select based on that timestamp, perhaps keeping a record of the last timestamp used each time you run the SSIS package, and then adding a margin of safety to that.
Edit: I just saw that you already have a modifiedOn column, so you could use that as described above.
Examples:
There are a few different ways you could do it:
ONE
Include the modifiedOn column on in your final destination table.
You can then build a dynamic query for your data flow source in a SSIS string variable, something like:
"SELECT * FROM [table A] WHERE modifiedOn >= DATEADD(DAY, -1, '" + #[User::MaxModifiedOnDate] + "')"
#[User::MaxModifiedOnDate] (string variable) would come from an Execute SQL Task, where you would write the result of the following query to it:
SELECT FORMAT(CAST(MAX(modifiedOn) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DestinationTable
The DATEADD part, as well as the CAST to a certain degree, represent your margin of safety.
TWO
If this isn't an option, you could keep a data load history table that would tell you when you need to load from, e.g.:
CREATE TABLE DataLoadHistory
(
DataLoadID int PRIMARY KEY IDENTITY
, DataLoadStart datetime NOT NULL
, DataLoadEnd datetime
, Success bit NOT NULL
)
You would begin each data load with this (Execute SQL Task):
CREATE PROCEDURE BeginDataLoad
#DataLoadID int OUTPUT
AS
INSERT INTO DataLoadHistory
(
DataLoadStart
, Success
)
VALUES
(
GETDATE()
, 0
)
SELECT #DataLoadID = SCOPE_IDENTITY()
You would store the returned DataLoadID in a SSIS integer variable, and use it when the data load is complete as follows:
CREATE PROCEDURE DataLoadComplete
#DataLoadID int
AS
UPDATE DataLoadHistory
SET
DataLoadEnd = GETDATE()
, Success = 1
WHERE DataLoadID = #DataLoadID
When it comes to building your query for table A, you would do it the same way as before (with the dynamically generated SQL query), except MaxModifiedOnDate would come from the following query:
SELECT FORMAT(CAST(MAX(DataLoadStart) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DataLoadHistory WHERE Success = 1
So the DataLoadHistory table, rather than your destination table.
Note that this would fail on the first run, as there'd be no successful entries on the history table, so you'd need you insert a dummy record, or find some other way around it.
THREE
I've seen it done a lot where, say your data load is running every day, you would just stage the last 7 days, or something like that, some margin of safety that you're pretty sure will never be passed (because the process is being monitored for failures).
It's not my preferred option, but it is simple, and can work if you're confident in how well the process is being monitored.

How to copy large number of data from one table to another in same database?

I have two tables with same column structure in the same database: TableA and TableB.
TableA doesn't have any indexes, but TableB has a non-clustered unique index.
TableA has 290 Million rows of data that needs to be copied to TableB.
As they both have same structure, I've tried
INSERT INTO TableB
SELECT *
FROM TableA;
It was executing for hours and produced a huge log file that filled the disk. As a result the disk ran out of space and the query was killed.
I can shrink the log file. How can I copy these many rows of data to another table efficiently?
First of all, disable the index on TableB before inserting the rows. You can do it using T-SQL:
ALTER INDEX IX_Index_Name ON dbo.TableB DISABLE;
Make sure to disable all the constraints (foreign keys, check constraints, unique indexes) on your destination table.
Re-enable (and rebuild) them after the load is complete.
Now, there's a couple of approaches to solve the problem:
You have to be OK with a slight chance of data loss: use the INSERT INTO ... SELECT ... FROM ... syntax you have but switch your database to Bulk-logged recovery mode first (read before switching). Won't help if you're already in Bulk-logged or Simple.
With exporting the data first: you can use the BCP utility to export/import the data. It supports loading data in batches. Read more about using the BCP utility here.
Fancy, with exporting the data first: With SQL 2012+ you can try exporting the data into binary file (using the BCP utility) and load it by using the BULK INSERT statement, setting ROWS_PER_BATCH option.
Old-school "I don't give a damn" method: to prevent the log from filling up you will need to perform the
inserts in batches of rows, not everything at once. If your database
is running in Full recovery mode you will need to keep log backups
running, maybe even trying to increase the frequency of the job.
To batch-load your rows you will need a WHILE (don't use them in
day-to-day stuff, just for batch loads), something like the
following will work if you have an identifier in the dbo.TableA
table:
DECLARE #RowsToLoad BIGINT;
DECLARE #RowsPerBatch INT = 5000;
DECLARE #LeftBoundary BIGINT = 0;
DECLARE #RightBoundary BIGINT = #RowsPerBatch;
SELECT #RowsToLoad = MAX(IdentifierColumn) dbo.FROM TableA
WHILE #LeftBoundary < #RowsToLoad
BEGIN
INSERT INTO TableB (Column1, Column2)
SELECT
tA.Column1,
tB.Column2
FROM
dbo.TableA as tA
WHERE
tA.IdentifierColumn > #LeftBoundary
AND tA.IdentifierColumn <= #RightBoundary
SET #LeftBoundary = #LeftBoundary + #RowsPerBatch;
SET #RightBoundary = #RightBoundary + #RowsPerBatch;
END
For this to work effectively you really want to consider creating an
index on dbo.TableA (IdentifierColumn) just for the time you're
running the load.

Updating two columns in a table containing millions of rows

I am updating 2 columns in a table that contains millions (85 million) of rows. Now to update these I am using a update command like,
UPDATE Table1
SET Table1.column1 = Table2.column1 ,
Table1.column2 = Table2.column2
FROM
Tables and with a Join-conditions;
Now my problem is, it is taking 23 hours for that. Even after using the batch size there is not much change in the time taken.
But I need to update it in less than 5 hours. Is that possible. What approach should I take to achieve it ?
SQL Update statements have to keep all the rows in the log file so it can roll-back on failure. As explained by this guy, the best way to handle millions of rows is to forget about atomicity and batch your updates into 50,000 rows (or whatever):
--Declare variable for row count
Declare #rc int
Set #rc=50000
While #rc=50000
Begin
Begin Transaction
--Use Top (50000) to limit number of updates
--performed in each batch to 50K rows.
--Use tablockx and holdlock to obtain and hold
--an immediate exclusive table lock. This unusually
--speeds the update because only one lock is needed.
Update Top (50000) MyTable With (tablockx, holdlock)
Set UpdFlag = 0
From MyTable mt
Join ControlTable ct
On mt.KeyCol=ct.PK
--Add criteria to avoid updating rows that
--were updated in previous pass
Where m.UpdFlag <> 0
--Get number of rows updated
--Process will continue until less than 50000
Select #rc=##rowcount
--Commit the transaction
Commit
End
This still has some problems in that you need to know which rows you've already handled, perhaps someone smarter than this guy (and me!) can figure something nicer with more MSSQL magic; but this should be a start.
I have used SSIS for doing this task.
First I have taken the source table in which I have to update the 2-columns. Then I have taken Look-Up task in which I have to mapped source columns to the destination table columns from which I have to get the data to update source table columns. Finally added OLEDB destination from where I'll fill the table basing on the joining conditions from the look-up.
This process was really fast than executing an update script.

T-SQL Insert or update

I have a question regarding performance of SQL Server.
Suppose I have a table persons with the following columns: id, name, surname.
Now, I want to insert a new row in this table. The rule is the following:
If id is not present in the table, then insert the row.
If id is present, then update.
I have two solutions here:
First:
update persons
set id=#p_id, name=#p_name, surname=#p_surname
where id=#p_id
if ##ROWCOUNT = 0
insert into persons(id, name, surname)
values (#p_id, #p_name, #p_surname)
Second:
if exists (select id from persons where id = #p_id)
update persons
set id=#p_id, name=#p_name, surname=#p_surname
where id=#p_id
else
insert into persons(id, name, surname)
values (#p_id, #p_name, #p_surname)
What is a better approach? It seems like in the second choice, to update a row, it has to be searched two times, whereas in the first option - just once. Are there any other solutions to the problem? I am using MS SQL 2000.
Both work fine, but I usually use option 2 (pre-mssql 2008) since it reads a bit more clearly. I wouldn't stress about the performance here either...If it becomes an issue, you can use NOLOCK in the exists clause. Though before you start using NOLOCK everywhere, make sure you've covered all your bases (indexes and big picture architecture stuff). If you know you will be updating every item more than once, then it might pay to consider option 1.
Option 3 is to not use destructive updates. It takes more work, but basically you insert a new row every time the data changes (never update or delete from the table) and have a view that selects all the most recent rows. It's useful if you want the table to contain a history of all its previous states, but it can also be overkill.
Option 1 seems good. However, if you're on SQL Server 2008, you could also use MERGE, which may perform good for such UPSERT tasks.
Note that you may want to use an explicit transaction and the XACT_ABORT option for such tasks, so that the transaction consistency remains in the case of a problem or concurrent change.
I tend to use option 1. If there is record in a table, you save one search. If there isn't, you don't loose anything. Moreover, in the second option you may run into funny locking and deadlocking issues related to locks incompatibility.
There's some more info on my blog:
http://sqlblogcasts.com/blogs/piotr_rodak/archive/2010/01/04/updlock-holdlock-and-deadlocks.aspx
You could just use ##RowCount to see if the update did anything. Something like:
UPDATE MyTable
SET SomeData = 'Some Data' WHERE ID = 1
IF ##ROWCOUNT = 0
BEGIN
INSERT MyTable
SELECT 1, 'Some Data'
END
Aiming to be a little more DRY, I avoid writing out the values list twice.
begin tran
insert into persons (id)
select #p_id from persons
where not exists (select * from persons where id = #p_id)
update persons
set name=#p_name, surname=#p_surname
where id = #p_id
commit
Columns name and surname have to be nullable.
The transaction means no other user will ever see the "blank" record.
Edit: cleanup

Delete large amount of data in sql server

Suppose that I have a table with 10000000 record. What is difference between this two solution?
delete data like :
DELETE FROM MyTable
delete all of data with a application row by row :
DELETE FROM MyTable WHERE ID = #SelectedID
Is the first solution has best performance?
what is the impact on log and performance?
If you need to restrict to what rows you need to delete and not do a complete delete, or you can't use TRUNCATE TABLE (e.g. the table is referenced by a FK constraint, or included in an indexed view), then you can do the delete in chunks:
DECLARE #RowsDeleted INTEGER
SET #RowsDeleted = 1
WHILE (#RowsDeleted > 0)
BEGIN
-- delete 10,000 rows a time
DELETE TOP (10000) FROM MyTable [WHERE .....] -- WHERE is optional
SET #RowsDeleted = ##ROWCOUNT
END
Generally, TRUNCATE is the best way and I'd use that if possible. But it cannot be used in all scenarios. Also, note that TRUNCATE will reset the IDENTITY value for the table if there is one.
If you are using SQL 2000 or earlier, the TOP condition is not available, so you can use SET ROWCOUNT instead.
DECLARE #RowsDeleted INTEGER
SET #RowsDeleted = 1
SET ROWCOUNT 10000 -- delete 10,000 rows a time
WHILE (#RowsDeleted > 0)
BEGIN
DELETE FROM MyTable [WHERE .....] -- WHERE is optional
SET #RowsDeleted = ##ROWCOUNT
END
If you have that many records in your table and you want to delete them all, you should consider truncate <table> instead of delete from <table>. It will be much faster, but be aware that it cannot activate a trigger.
See for more details (this case sql server 2000):
http://msdn.microsoft.com/en-us/library/aa260621%28SQL.80%29.aspx
Deleting the table within the application row by row will end up in long long time, as your dbms can not optimize anything, as it doesn't know in advance, that you are going to delete everything.
The first has clearly better performance.
When you specify DELETE [MyTable] it will simply erase everything without doing checks for ID. The second one will waste time and disk operation to locate a respective record each time before deleting it.
It also gets worse because every time a record disappears from the middle of the table, the engine may want to condense data on disk, thus wasting time and work again.
Maybe a better idea would be to delete data based on clustered index columns in descending order. Then the table will basically be truncated from the end at every delete operation.
Option 1 will create a very large transaction and have a big impact on the log / performance, as well as escalating locks so that the table will be unavailable.
Option 2 will be slower, although it will generate less impact on the log (assuming bulk / full mode)
If you want to get rid of all the data, Truncate Table MyTable would be faster than both, although it has no facility to filter rows, it does a meta data change at the back and basically drops the IAM on the floor for the table in question.
The best performance for clearing a table would bring TRUNCATE TABLE MyTable. See http://msdn.microsoft.com/en-us/library/ms177570.aspx for more verbose explaination
Found this post on Microsoft TechNet.
Basically, it recommends:
By using SELECT INTO, copy the data that you want to KEEP to an intermediate table;
Truncate the source table;
Copy back with INSERT INTO from intermediate table, the data to the source table;
..
BEGIN TRANSACTION
SELECT *
INTO dbo.bigtable_intermediate
FROM dbo.bigtable
WHERE Id % 2 = 0;
TRUNCATE TABLE dbo.bigtable;
SET IDENTITY_INSERT dbo.bigTable ON;
INSERT INTO dbo.bigtable WITH (TABLOCK) (Id, c1, c2, c3)
SELECT Id, c1, c2, c3 FROM dbo.bigtable_intermediate ORDER BY Id;
SET IDENTITY_INSERT dbo.bigtable OFF;
ROLLBACK TRANSACTION
The first will delete all the data from the table and will have better performance that your second who will delete only data from a specific key.
Now if you have to delete all the data from the table and you don't rely on using rollback think of the use a truncate table

Resources