Ignore duplicate records in SSIS' OLE DB destination

Ignore duplicate records in SSIS' OLE DB destination - sql-server

I'm using a OLE DB Destination to populate a table with value from a webservice.
The package will be scheduled to run in the early AM for the prior day's activity. However, if this fails, the package can be executed manually.
My concern is if the operator chooses a date range that over-laps existing data, the whole package will fail (verified).
I would like it:
INSERT the missing values (works as expected if no duplicates)
ignore the duplicates; not cause the package to fail; raise an exception that can be captured by the windows application log (logged as a warning)
collect the number of successfully-inserted records and number of duplicates
If it matters, I'm using Data access mode = Table or view - fast load and
Suggestions on how to achieve this are appreciated.

That's not a feature.
If you don't want error (duplicates), then you need to defend against it - much as you'd do in your favorite language. Instead of relying on error handling, you test for the existence of the error inducing thing (Lookup Transform to identify existence of row in destination) and then filter the duplicates out (Redirect No Match Output).
The technical solution you absolutely should not implement
Change the access mode from the "Table or View Name - Fast Load" to "Table or View Name". This changes the method of insert from a bulk/set-based operation to singleton inserts. By inserting one row at a time, this will allow the SSIS package to evaluate the success/fail of each row's save. You then need to go into the advanced editor, your screenshot, and change the Error disposition from Fail Component to Ignore Failure
This solution should not used as it yields poor performance, generates unnecessary work load and has the potential to mask other save errors beyond just "duplicates" - referential integrity violations for example

Here's how I would do it:
Point your SSIS Destination to a staging table that will be empty
when the package is run.
Insert all rows into the staging table.
Run a stored procedure that uses SQL to import records from the
staging table to the final destination table, WHERE the records don't
already exist in the destination table.
Collect the desired meta-data and do whatever you want with it.
Empty the staging table for the next use.
(Those last 3 steps would all be done in the same stored procedure).

Related

Backing up a table before deleting all the records and reloading it in SSIS

I have a table named abcTbl, the data in there is populated
from other tables from a different database. Every time I am loading
data to abcTbl, I am doing a delete all to it and loading the buffer
data into it.
This package runs daily. My question is how do I avoid losing data
from the table abcTbl if we fail to load the data into it. So my
first step is deleting all the data in the abcTbl and then
selecting the data from various sources into a buffer and then
loading the buffer data into abcTbl.
Since we can encounter issues like failed connections, package
stopping prematurely, supernatural forces trying to stop/break my
package from running smoothly, etc. which will end up with the
package losing all the data in the buffer after I have already
deleted the data from abcTbl. 
My first intuition was to save the data from the abcTbl into a
backup table and then deleting the data in the abcTbl but my DBAs
wouldn't be too thrilled about creating a backup table for in every
environment for the purpose of this package, and giving me juice to
create backup tables on the fly and then deleting it again is out of
the question too. This data is not business critical and can be repopulated
again if lost.
But, what is the best approach here? What are the best practices for this issue?

For backing up your table, instead of loading data from one table (Original) to another table (Backup), you can just rename your original table to something (back-up table), create original table again like the back-up table and then drop the renamed table only when your data load is successful. This may save some time to transfer data from one table to another. You may want to test which approach is faster for you depending on your data/table structure etc., But what I wanted to mention is, this is also one of the way to do it. If you have lot of data in that table below approach may be faster.
sp_rename 'abcTbl', 'abcTbl_bkp';
CREATE TABLE abcTbl ;
While creating this table, you can keep similar table structure as that of abcTbl_bkp
Load your new data to abcTbl table
DROP TABLE abcTbl_bkp;

Trying to figure this out but I think what you are asking for is a method to capture the older data before loading the new data. I would agree with your DBA's that a seperate table for every reload would be extremely messy and not very usable if you ever need it.
Instead, create a table that copies your load table but adds a single DateTime field(say history_date). Each load you would just flow all the data in your primary table to the backup table. Use a Derived Column task in the Data Flow to add the history_date value to the backup table.
Once the backup table is complete, either truncate or delete the contents of the current table. Then load the new data.

Instead of created additional tables you can set the package to execute as a single transaction. By doing this, if any component fails all the tasks that have already executed will be rolled back and subsequent ones will not run. To do this, set the TransactionOption to Required on the package. This will allow that the package will begin a transaction. After this set all this property to Supported for all components that you want to succeed or fail together. The Supported level will have these tasks join a transaction that is already in progress by the parent container, being the package in this case. If there are other components in the package that you want to commit or rollback independent of these tasks you can place the related objects in a Sequence container, and apply the Required level to the Sequence instead. An important thing to note is that if anything performs a TRUNCATE then all other components that access the truncated object will need to have the ValidateExternalMetadata option set to false to avoid the known blocking issue that is a result of this.

SSIS Deadlock with a Slowly Changing Dimension

I am running an SSIS package that contains many (7) reads from a single flat file uploaded from an external source. There is consistently a deadlock in every environment(Test, Pre-Production, and Production) on one of the data flows that uses a Slowly Changing Dimension to update an existing SQL table with both new and changed rows.
I have three groups coming off the SCD:
-Inferred Member Updates Output goes directly to an OLE DB Update command.
-Historical Attribute goes to a derived column boxed that sets a delete date and then goes to an update OLE DB command, then goes to a union box where it unions with the last group New Output.
-New Output goes into a union box along with the Historical output then to a derived column box that adds an update/create date, then inserts the values into the same SQL table as the Inferred Member Output DB Command.
The only error I am getting in my log looks like this:
"Transaction (Process ID 170) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction."
I could put the (NOLOCK) statement into the OLE db commands, but I have read that this isn't the way to go.
I am using SQL Server 2012 Data Tools to investigate and edit the Package, but I am unsure where to go from here to find the issue.
I want to get out there that i am a novice in terms of SSIS programming... with that out of the way... Any help would be greatly appreciated, even if it is just pointing me to a place I haven't looked for help.

Adding index on the WHERE condition column may resolve your issue. After adding index on the column, transactions will executes in faster way which reduce the chances of deadlock.

How to control which rows were sent via SSIS

I'm trying to create SSIS package which will periodically send data to other database. I want to send only new records(I need to keep sent records) so I created status column in my source table.
I want my package to update this column after successfuly sending data, but I can't update all rows wih "unsent" status because during package execution some rows may have been added, and I also can't use transactions(I mean on isolation levels that would solve my problem: I can't use Serializable beacause i musn't prevent users from adding new rows, and Sequence Container doesn't support Snapshot).
My next idea was to use recordset and after sending data to other db use it to get ids of sent rows, but I couldn't find a way to use it as datasource.
I don't think I should set status "to send" and then update it to "sent", I believe it would be to costly.
Now I'm thinking about using temporary table, but I'm not convinced that this is the right way to do it, am I missing something?

Record Set is a destination. You cannot use it in Data Flow task.
But since the data is saved to a variable, it is available in the Control flow.
After completing the DataFlow, come to the control flow and create a foreach component that can run on the ResultSet varialbe.
Read each Record Set value into a variable and use it to run an update query.
Also, see if "Lookup Transform" can be useful to you. You can generate rows that match or doesn't match.
I will improve the answer based on discussions

What you have here is a very typical data mirroring problem. To start with, I would not simply have a boolean that signifies that a record was "sent" to the destination (mirror) database. At the very least, I would put a LastUpdated datetime column in the source table, and have triggers on that table, on insert and update, that put the system date into that column. Then, every day I would execute an SSIS package that reads the records updated in the last week, checks to see if those records exist in the destination, splitting the datastream into records already existing and records that do not exist in the destination. For those that do exist, if the LastUpdated in the destination is less than the LastUpdated in the source, then update them with the values from the source. For those that do not exist in the destination, insert the record from the source.
It gets a little more interesting if you also have to deal with record deletions.
I know it may seem wasteful to read and check a week's worth, every day, but your database should hardly feel it, it provides a lot of good double checking, and saves you a lot of headaches by providing a simple, error tolerant algorithm. Some record does not get transferred because of some hiccup on the network, no worries, it gets picked up the next day.
I would still set up the SSIS package as a server task that sends me an email with any errors, so that I can keep track. Most days, you get no errors, and when there are errors, you can wait a day or resolve the cause and let the next days run pick up the problems.

I am doing a similar thing, in my case, I have a status on the source record.
I read in all records with a status of new.
Then use a OLE DB Command to execute SQL on each row, changing
the status to "In progress"(in you where, enter a ? as the value in
the Component Property tab, and you can configure it as a parameter
from the table row like an ID or some pk in the Column Mappings
tab).
Once the records are processed, you can change all "In Progress"
records to "Success" or something similar using another OLE DB
Command.
Depending on what you are doing, you can use the status to mark records that errored at some point, and require further attention.

SSIS packages fail when a table doesn't exist, even though it will never be accessed

In SSIS 2008 I have a Script Task that checks if a table exists in a database and sets a boolean variable.
In my Data Flow I do a Conditional Split based on that variable, so that I can do the appropriate OLE DB Commands based on whether that table exists or not.
If the table does exist, the package will run correctly. But if the table doesn't exist, SSIS is checking metadata on the OLE DB Command that isn't being run, determine the table isn't there, and failing with an error before doing anything.
There doesn't seem to be any way to catch or ignore that error (e.g. I tried increasing MaximumErrorCount and various different ErrorRowDescription settings), or to stop it ever validating the command (ValidateExternalMetadata only seems to affect the designer, by design).
I don't have access to create stored procedures to wrap this kind of test, and OLE DB Commands do not let you use IF OBJECT_ID('') IS NOT NULL prefixes on any statements you're doing (in this case, a DELETE FROM TableName WHERE X = ?).
Is there any other way around this, short of using a script component to fire off the DELETE command row-by-row manually?

You can use Script component to execute DELETE statement for each row in input path but that might be very slow depending on number of rows to be deleted.
You can:
Store PKs of records that should be deleted to a database table (for instance: TBL_TO_DEL)
Add Execute SQL Task with SQL query to delete records by joining TBL_TO_DEL with table that You want to delete records from
Put precedence constraint on path between your data flow and Execute SQL task (constraint based on your variable)
This solution is much faster than deleting row by row.
If for some reason You can't create new table, check my answer on SSIS Pass Datasource Between Control Flow Tasks to see other ways to pass data to next data flow where You can use OleDb source and OleDb command. Whichever way You choose, key is in constraint that will or will not execute following task (Execute SQL task or data flow) depending on value in variable.
Note that Execute SQL task will not validate query and as such will fail at runtime if constraint is satisfied and table doesn't exist. If You use another Data Flow instead of Execute SQL Task, set DelayedValidation property to true. It means that task will be validated at the moment prior to executing particular task, not anytime earlier.

Error when copying a check constraint using DTS

I have a DTS package that is raising an error with a "Copy SQL Server Objects" task. The task is copying a table plus data from one SQL Server 2000 SP4 server to another (same version) and is giving the error: -
Could not find CHECK constraint for 'dbo.MyTableName', although the table is flagged as having one.
The source table has one check constraint defined that appears to cause the problem. After running the DTS package, the thing appears to work properly - the table, all constraints and data ARE created on the destination server? But the error above is raised causing subsequent steps not to run.
Any idea why this error is raised ?

This indicates that the metadata in the sys tables has gotten out of sync with your actual schema. If you aren't seeing any other signs of more generalized corruption, doing a rebuild of the table by copying it to another table (select * into newtable from oldtable), dropping the old table and then renaming the new one and replacing the constraints will help. This is similar to how the Enterprise manager for 2000 does things when you insert a column that isn't at the end of the table, so inserting a new column in the middle of the table and then removing will achieve the same thing if you don't want to manually write the queries.
I would be somewhat concerned by the state of the database as a whole if you see other occurrences of this kind of error. (I'm assuming here that you have already done CHECKDB commands and that the error is persisting...)

This error started when a new column (with a check constraint) was added to an existing table. To investigate I have: -
Copied the table to a different destination SQL Server and got the same error.
Created a new table with exactly the same structure but different name and copied with no error.
Dropped and re-created the check constraint on the problem table but still get the same error.
dbcc checktable ('MyTableName') with ALL_ERRORMSGS gives no errors.
dbcc checkdb in the source and destination database gives no errors.
Interestingly the DTS package appears to: -
Copy the table.
Copy the data.
Create the constraints
Because the check constraint create time is 7 minutes after the table create time i.e. it creates the check constraint AFTER it has moved the data. Makes sense as it does not have to check the data as it is copying, presumably improving performance.
As Godeke suggests, I think something has become corrupt in the system tables, as a new table with the same columns works. Even though the DBCC statements give no errors?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight