Speeding Up SSIS Package (Insert and Update) - sql-server

Referred here by #sqlhelp on Twitter (Solved - See the solution at the end of the post).
I'm trying to speed up an SSIS package that inserts 29 million rows of new data, then updates those rows with 2 additional columns. So far the package loops through a folder containing files, inserts the flat files into the database, then performs the update and archives the file. Added (thanks to #billinkc): the SSIS order is Foreach Loop, Data Flow, Execute SQL Task, File Task.
What doesn't take long: The loop, the file move and truncating the tables (stage).
What takes long: inserting the data, running the statement below this:
UPDATE dbo.Stage
SET Number = REPLACE(Number,',','')
## Heading ##
-- Creates temp table for State and Date
CREATE TABLE #Ref (Path VARCHAR(255))
INSERT INTO #Ref VALUES(?)
-- Variables for insert
DECLARE #state AS VARCHAR(2)
DECLARE #date AS VARCHAR(12)
SET #state = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),12,2) FROM #Ref)
SET #date = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),1,10) FROM #Ref)
SELECT #state
SELECT #date
-- Inserts the values into main table
INSERT INTO dbo.MainTable (Phone,State,Date)
SELECT d.Number, #state, #date
FROM Stage d
-- Clears the Reference and Stage table
DROP TABLE #Ref
TRUNCATE TABLE Stage
Note that I've toyed with upping Rows per batch on the insert and Max insert commit size, but neither have affected the package speed.
Solved and Added:
For those interested in the numbers: the OP package time was 11.75 minutes; with William's technique (see below this) it's dropped to 9.5 minutes. Granted, with 29 million rows and on a slower server, this can be expected, but hopefully that shows you the actual data behind how effective this is. The key is to keep as many processes happening on the Data Flow task as possible, as the updating data (after the data flow), consumed a signficant portion of time.
Hopefully that helps anyone else out there with a similar problem.
Update two: I added an IF statement and that reduced it from 9 minutes to 4 minutes. Final code for Execute SQL Task:
-- Creates temp table for State and Date
CREATE TABLE #Ref (Path VARCHAR(255))
INSERT INTO #Ref VALUES(?)
DECLARE #state AS VARCHAR(2)
DECLARE #date AS VARCHAR(12)
DECLARE #validdate datetime
SET #state = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),12,2) FROM #Ref)
SET #date = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),1,10) FROM #Ref)
SET #validdate = DATEADD(DD,-30,getdate())
IF #date < #validdate
BEGIN
TRUNCATE TABLE dbo.Stage
TRUNCATE TABLE #Ref
END
ELSE
BEGIN
-- Inserts new values
INSERT INTO dbo.MainTable (Number,State,Date)
SELECT d.Number, #state, #date
FROM Stage d
-- Clears the Reference and Stage table after the insert
DROP TABLE #Ref
TRUNCATE TABLE Stage
END

As I understand it, you are Reading ~ 29,000,000 rows from flat files and writing them into a staging table, then running a sql script that updates (reads/writes) the same 29,000,000 rows in the staging table, then moves those 29,000,000 records (read from staging then write to nat) to the final table.
Couldn't you Read from your flat files, use SSIS transfomations to clean your data and add your two additional columns, then write directly into the final table. You would only then work on each distinct set of data once rather than the three (six if you count reads and writes as distinct) times that your process does?
I would change your data flow to transform in process the needed items and write directly into my final table.
edit
From the SQL in your question it appears you are transforming the data by removing comma's from the PHONE field, and then retrieving the STATE and the Date from specific portions of the file path that the currently processed file is in, then storing those three data points into the NAT table. Those things can be done with the derived column transformation in your Data Flow.
For the State and Date columns, set up two new variables called State and Date. Use expressions in the variable definition to set them to the correct values (like you did in your SQL). When the Path variable updates (in your loop, I assume). the State and Date variables will update as well.
In the Derived Column Transformation, drag the State Variable into the Expression field and create a new column called State.
Repeat for Date.
For the PHONE column, in the Derived Column transforamtion create an expression like the following:
REPLACE( [Phone], ",", "" )
Set the Derived Column field to Replace 'Phone'
For your output, create a destination to your NAT table and link Phone, State, and Date columns in your data flow to the appropriate columns in the NAT table.
If there are additional columns in your input, you can choose not to bring them in from your source, since it appears that you are only acting on the Phone column from the original data.
/edit

Related

Update on key violation in Stored Procedure using BULK INSERT & Trigger

I have a stored procedure that performs a bulk insert of a large number of DNS log entries. I wish to summarise this raw data in a new table for analysis. The new table takes a given log entry for FQDN and Record Type and holds one record only with a hitcount.
Source table might include 100 rows of:
FQDN, Type
www.microsoft.com,A
Destination table would have:
FQDN, Type, HitCount
www.microsoft.com, A, 100
The SP establishes a unique ID made up of [FQDN] +'|'+ [Type], which is then used as the primary key in the destination table.
My plan was to have the SP fire a trigger that did an UPDATE...IF ##ROWCOUNT=0...INSERT. However, that of course failed because the trigger receives all the [inserted] rows as a single set so always throws a key violation error.
I'm having trouble getting my head around a solution and need some fresh eyes and better skills to take a look. The bulk insert SP works just fine and the raw data is exactly as desired. However trying to come up with a suitable method to create the summary data is beyond my present skills/mindset.
I have several 10s of Tb of data to process, so I don't see the summary as a something we could do dynamically with a SELECT COUNT - which is why I started down the trigger route.
The relevant code in the SP is driven by a cursor consisting of a list of compressed log files needing to be decompressed and bulk-inserted, and is as follows:
-- Bulk insert to a view because bulk insert cannot populate the UID field
SET #strDynamicSQL = 'BULK INSERT [DNS_Raw_Logs].[dbo].[vwtblRawQueryLogData] FROM ''' + #strTarFolder + '\' + #strLogFileName + ''' WITH (FIRSTROW = 1, FIELDTERMINATOR = '' '', ROWTERMINATOR = ''0x0a'', ERRORFILE = ''' + #strTarFolder + '\' + #strErrorFile + ''', TABLOCK)'
--PRINT #strDynamicSQL
EXEC (#strDynamicSQL)
-- Update [UID] field after the bulk insert
UPDATE [DNS_Raw_Logs].[dbo].[tblRawQueryLogData]
SET [UID] = [FQDN] + '|' + [Type]
FROM [tblRawQueryLogData]
WHERE [UID] IS NULL
I know that the UPDATE...IF ##ROWCOUNT=0...INSERT solution is wrong because it assumes that the input data is a single row. I'd appreciate help on a way to do this.
Thank you
First, at that scale make sure you understand columnstore tables. They are very highly compressed and fast to scan.
Then write a query that reads from the raw table and returns the summarized
create or alter view DnsSummary
as
select FQDN, Type, count(*) HitCount
from tblRawQueryLogData
group by FQDN, Type
Then if querying that view directly is too expensive, write a stored procedure that loads a table after each bulk insert. Or make the view into an indexed view.
Thanks for the answer David, obvious when someone else looks at it!
I ran the view-based solution with 14M records (about 4 hours worth) and it took 40secs to return, so I think i'll modify the SP to drop and re-create summary table each time it runs the bulk insert.
The source table also includes a timestamp for each entry. I would like to grab the earliest and latest times associated with each UID and add that to the summary.
My current summary query (courtesy of David) looks like this:
SELECT [UID], [FQDN], [Type], COUNT([UID]) AS [HitCount]
FROM [DNS_Raw_Logs].[dbo].tblRawQueryLogData
GROUP BY [UID], [FQDN], [Type]
ORDER BY COUNT([UID]) DESC
And returns:
UID, FQDN, Type, HitCount
www.microsoft.com|A, www.microsoft.com, A, 100
If I wanted to grab first earliest and latest times then I think I'm looking at nesting 3 queries to grab the earliest time (SELECT TOP N...ORDER BY... ASC), the latest time (SELECT TOP N...ORDER BY... DESC) and the hitcount. Is there a more efficient way of doing this, before I try and wrap my head around this route?

SSIS data flow - copy new data or update existing

I queried some data from table A(Source) based on certain condition and insert into temp table(Destination) before upsert into Crm.
If data already exist in Crm I dont want to query the data from table A and insert into temp table(I want this table to be empty) unless there is an update in that data or new data was created. So basically I want to query only new data or if there any modified data from table A which already existed in Crm. At the moment my data flow is like this.
clear temp table - delete sql statement
Query from source table A and insert into temp table.
From temp table insert into CRM using script component.
In source table A I have audit columns: createdOn and modifiedOn.
I found one way to do this. SSIS DataFlow - copy only changed and new records but no really clear on how to do so.
What is the best and simple way to achieve this.
The link you posted is basically saying to stage everything and use a MERGE to update your table (essentially an UPDATE/INSERT).
The only way I can really think of to make your process quicker (to a significant degree) by partially selecting from table A would be to add a "last updated" timestamp to table A and enforcing that it will always be up to date.
One way to do this is with a trigger; see here for an example.
You could then select based on that timestamp, perhaps keeping a record of the last timestamp used each time you run the SSIS package, and then adding a margin of safety to that.
Edit: I just saw that you already have a modifiedOn column, so you could use that as described above.
Examples:
There are a few different ways you could do it:
ONE
Include the modifiedOn column on in your final destination table.
You can then build a dynamic query for your data flow source in a SSIS string variable, something like:
"SELECT * FROM [table A] WHERE modifiedOn >= DATEADD(DAY, -1, '" + #[User::MaxModifiedOnDate] + "')"
#[User::MaxModifiedOnDate] (string variable) would come from an Execute SQL Task, where you would write the result of the following query to it:
SELECT FORMAT(CAST(MAX(modifiedOn) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DestinationTable
The DATEADD part, as well as the CAST to a certain degree, represent your margin of safety.
TWO
If this isn't an option, you could keep a data load history table that would tell you when you need to load from, e.g.:
CREATE TABLE DataLoadHistory
(
DataLoadID int PRIMARY KEY IDENTITY
, DataLoadStart datetime NOT NULL
, DataLoadEnd datetime
, Success bit NOT NULL
)
You would begin each data load with this (Execute SQL Task):
CREATE PROCEDURE BeginDataLoad
#DataLoadID int OUTPUT
AS
INSERT INTO DataLoadHistory
(
DataLoadStart
, Success
)
VALUES
(
GETDATE()
, 0
)
SELECT #DataLoadID = SCOPE_IDENTITY()
You would store the returned DataLoadID in a SSIS integer variable, and use it when the data load is complete as follows:
CREATE PROCEDURE DataLoadComplete
#DataLoadID int
AS
UPDATE DataLoadHistory
SET
DataLoadEnd = GETDATE()
, Success = 1
WHERE DataLoadID = #DataLoadID
When it comes to building your query for table A, you would do it the same way as before (with the dynamically generated SQL query), except MaxModifiedOnDate would come from the following query:
SELECT FORMAT(CAST(MAX(DataLoadStart) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DataLoadHistory WHERE Success = 1
So the DataLoadHistory table, rather than your destination table.
Note that this would fail on the first run, as there'd be no successful entries on the history table, so you'd need you insert a dummy record, or find some other way around it.
THREE
I've seen it done a lot where, say your data load is running every day, you would just stage the last 7 days, or something like that, some margin of safety that you're pretty sure will never be passed (because the process is being monitored for failures).
It's not my preferred option, but it is simple, and can work if you're confident in how well the process is being monitored.

SQL server GetDate in trigger called sequentially has the same value

I have a trigger on a table for insert, delete, update that on the first line gets the current date with GetDate() method.
The trigger will compare the deleted and inserted table to determine what field has been changed and stores in another table the id, datetime and the field changed. This combination must be unique
A stored procedure does an insert and an update sequentially on the table. Sometimes I get a violation of primary key and I suspect that the GetDate() returns the same value.
How can I make the GetDate() return different values in the trigger.
EDIT
Here is the code of the trigger
CREATE TRIGGER dbo.TR
ON table
FOR DELETE, INSERT, UPDATE
AS
BEGIN
SET NoCount ON
DECLARE #dt Datetime
SELECT #dt = GetDate()
insert tableLog (id, date, field, old, new)
select I.id, #dt, 'field', D.field, I.field
from INSERTED I LEFT JOIN DELETED D ON I.id=D.id
where IsNull(I.field, -1) <> IsNull(D.field, -1)
END
and the code of the calls
...
insert into table ( anotherfield)
values (#anotherfield)
if ##rowcount=1 SET #ID=##Identity
...
update table
set field = #field
where Id = #ID
...
Sometimes the GetDate() between the 2 calls (insert and update) takes 7 milliseconds and sometimes it has the same value.
That's not exactly full solution but try using SYSDATETIME instead and of course make sure that target table can store up datetime2 up to microseconds.
Note that you can't force different datetime regardless of precision (unless you will start counting up to ticks) as stuff can just happen at the same time wihthin given precision.
If stretching up to microseconds won't solve the issue on practical level, I think you will have to either redesign this logging schema (perhaps add identity column on top of what you have) or add some dirty trick - like make this insert in try catch block and add like microsecond (nanosecond?) in a loop until you insert successfully. Definitely not s.t. I would recommend.
Look at this answer: SQL Server: intrigued by GETDATE()
If you are inserting multiple ROWS, they will all use the same value of GetDate(), so you can try wrapping it in a UDF to get unique values. But as I said, this is just a guess unless you post the code of your trigger so we can see what you are actually doing?
It sounds like you're trying to create an audit trail - but now you want to forge some of the entries?
I'd suggest instead adding a rowversion column to the table and including that in your uniqueness criteria - either instead of or as well as the datetime value that is being recorded.
In this way, even if two rows are inserted with identical date/time data, you can still tell the actual insertion order.

How to copy large number of data from one table to another in same database?

I have two tables with same column structure in the same database: TableA and TableB.
TableA doesn't have any indexes, but TableB has a non-clustered unique index.
TableA has 290 Million rows of data that needs to be copied to TableB.
As they both have same structure, I've tried
INSERT INTO TableB
SELECT *
FROM TableA;
It was executing for hours and produced a huge log file that filled the disk. As a result the disk ran out of space and the query was killed.
I can shrink the log file. How can I copy these many rows of data to another table efficiently?
First of all, disable the index on TableB before inserting the rows. You can do it using T-SQL:
ALTER INDEX IX_Index_Name ON dbo.TableB DISABLE;
Make sure to disable all the constraints (foreign keys, check constraints, unique indexes) on your destination table.
Re-enable (and rebuild) them after the load is complete.
Now, there's a couple of approaches to solve the problem:
You have to be OK with a slight chance of data loss: use the INSERT INTO ... SELECT ... FROM ... syntax you have but switch your database to Bulk-logged recovery mode first (read before switching). Won't help if you're already in Bulk-logged or Simple.
With exporting the data first: you can use the BCP utility to export/import the data. It supports loading data in batches. Read more about using the BCP utility here.
Fancy, with exporting the data first: With SQL 2012+ you can try exporting the data into binary file (using the BCP utility) and load it by using the BULK INSERT statement, setting ROWS_PER_BATCH option.
Old-school "I don't give a damn" method: to prevent the log from filling up you will need to perform the
inserts in batches of rows, not everything at once. If your database
is running in Full recovery mode you will need to keep log backups
running, maybe even trying to increase the frequency of the job.
To batch-load your rows you will need a WHILE (don't use them in
day-to-day stuff, just for batch loads), something like the
following will work if you have an identifier in the dbo.TableA
table:
DECLARE #RowsToLoad BIGINT;
DECLARE #RowsPerBatch INT = 5000;
DECLARE #LeftBoundary BIGINT = 0;
DECLARE #RightBoundary BIGINT = #RowsPerBatch;
SELECT #RowsToLoad = MAX(IdentifierColumn) dbo.FROM TableA
WHILE #LeftBoundary < #RowsToLoad
BEGIN
INSERT INTO TableB (Column1, Column2)
SELECT
tA.Column1,
tB.Column2
FROM
dbo.TableA as tA
WHERE
tA.IdentifierColumn > #LeftBoundary
AND tA.IdentifierColumn <= #RightBoundary
SET #LeftBoundary = #LeftBoundary + #RowsPerBatch;
SET #RightBoundary = #RightBoundary + #RowsPerBatch;
END
For this to work effectively you really want to consider creating an
index on dbo.TableA (IdentifierColumn) just for the time you're
running the load.

Delete vs Rollback Strategy - ETL Load

I am loading data to table in the following manner:
DECLARE #srcRc INT;
DECLARE #dstRc INT;
SET #srcRc = ( SELECT COUNT(*) FROM A )
INSERT INTO t
(Col1
,Col2
,Col3
)
SELECT A.Col1
,A.Col2
,B.Col3
FROM A
JOIN B
ON A.Id = B.Id;
SET #dstRc = ##ROWCOUNT
Now I am comparing the variables #srcRc and #dstRc. The ROWCOUNT must be the same. If it is not, the inserted rows need to be deleted.
Q1: What would be the best strategy to rollback the inserted rows?
I have couple of ideas:
1) Run the load in transaction and rollback if the rowcount does not match.
2) Add flag column (bit) to the destination table called toBeDeleted, run the load and if the rowcount does not match, update the toBeDeleted column with 1 value to flag it as candidate for deletion. Then delete in batch mode (while-loop). Or do not delete them, but always exclude deletion candidates from query when working with t table.
3) Before inserting the rows, compare the the rowcount first. If it does not match, don't start the load.
DECLARE #srcRc INT;
DECLARE #dstRc INT;
SET #srcRc = ( SELECT COUNT(1) FROM A );
SET #dstRc = ( SELECT COUNT(1) FROM A JOIN B ON A.Id = B.Id );
Q2: What would be better solution for higher amount of rows, let's say 10-100 mil.?
Q3: Or is there any better strategy for similar case?
OK, Assuming :
You need the roll back to work at some later date when the content of tables A and B may have changed
There may also be other rows in T which you don't want to delete as part of the rollback.
Then you MUST keep a list of the rows you inserted, as you are unable to reliably regenerate that list from A and B and you cant just delete everything from T
You could do this in two ways
Change your import, so that it first inserts the rows to an import table, keep the import table hanging around until you are sure you don't need it anymore.
Add an extra column to T [importId] into which you put a uniquely identifying value
Obviously the first strategy uses a lot more disc space. So the longer your keep the data and the more data there is, the better the extra column looks.
Another option, would be to generate the list of imported data separately and have your transaction sql be a bulk insert with all the data hard coded into the sql.
This works well for small lists, initial setup data and the like.
Edit:
from your comments it sounds like you don't want a roll back per-se. But the best way to apply business logic around the import process.
In this case your 3rd answer is the best. Don't do the import when you know the source data is incorrect.

Resources