Insert rows into SQL Server table with a primary key with SSIS - sql-server

I have an SSIS package that I use to pass data from an Excel workbook into the SQL Server table:
My Excel file grows constantly with new records, therefore I've defined a primary key in the SQL Server table to avoid inserting duplicates, but basically I'm inserting the whole workbook each time.
I now have a problem, where either the whole package fails completely because it attempts to pass duplicate values into a table with PK, or if I set the Error Output of the destination to "Redirect row", the package gets executed successfully with the following message:
Data Flow Task, SSIS.Pipeline: "OLE DB Destination" wrote 90 rows
but no new rows are actually added to the table.
If I remove the PK constraint and add a trigger to remove duplicates on insert, it would work, but I would like to know the proper way to do this.

To make the design "work," with an error table, change your batch commit size to 1 in the OLE DB Destination. What's happening is that it's trying to commit the 90 rows but as there's at least 1 bad row in there, the whole batch fails.
The better approach will be to add a Lookup Component between the data conversion and the destination. The output of the Lookup will be the "No Match Found" output path and that is what will feed into the OLE DB Destination. The logic is that you're going to attempt to lookup the existing key in the target table. The No Match Found is what it sounds like, the row doesn't exist so therefore, shove it into the table and you won't have a PK conflict*.
* I still got a PK conflict but it's not there. In this case, you have duplicates/repeated keys in your source data and the same issue with regard to batch size is obscuring it. We're adding 2 rows with PK 50. PK 50 doesn't exist so it passes the lookup but the default batch size means both of those rows are going to be inserted in a single shot. Which violates referential integrity and then gets rolled back.

Related

SSIS flat file with joins

I have a flat file which has following columns
Device Name
Device Type
Device Location
Device Zone
Which I need to insert into SQL Server table called Devices.
Devices table has following structure
DeviceName
DeviceTypeId (foreign key from DeviceType table)
DeviceLocationId (foreign key from DeviceLocation table)
DeviceZoneId (foreign key from DeviceZone table)
DeviceType, DeviceLocation and DeviceZone tables are already prepopulated.
Now I need to write ETL which reads flat file and for each row get DeviceTypeId, DeviceLocationId and DeviceZoneId from corresponding tables and insert into Devices table.
I am sure this is not new but its being a while I worked on such SSIS packages and help would be appreciated.
Load the flat content into a staging table and write a stored procedure to handle the inserts and updates in T-SQL.
Having FK relationships between the destination tables, can probably make a lot of trouble with a single data flow and a multicast.
The problem is that you have no control over the order of the inserts so the child record could be inserted before the parent.
Also, for identity columns on the tables, you cannot retrieve the identity value from one stream and use it in another without using subsequent merge joins.
The simplest way to do that, is by using Lookup Transformation to get the ID for each value. You must be aware that duplicates may lead to a problem, you have to make sure that the value is not found multiple times in the foreign tables.
Also, make sure to redirect rows that have no match into a staging table to check them later.
You can refer to the following article for a step by step guide to Lookup Transformation:
An Overview of the LOOKUP TRANSFORMATION in SSIS

SSIS only extract Delta changes

After some advice. I'm using SSIS\SQL Server 2014. I have a nightly SSIS package that pulls in data from non-SQL Server db's into a single table (the SQL table is truncated beforehand each time) and I then extract from this table to create a daily csv file.
Going forward, I only want to extract to csv on a daily basis the records that have changed i.e. the Deltas.
What is the best approach? I was thinking of using CDC in SSIS, but as I'm truncating the SQL table before the initial load each time, will this be best method? Or will I need to have a master table in SQL with an initial load, then import into another table and just extract where there are different? For info, the table in SQL contains a Primary Key.
I just want to double check as CDC assumes the tables are all in SQL Server, whereas my data is coming from outside SQL Server first.
Thanks for any help.
The primary key on that table is your saving grace here. Obviously enough, the SQL Server database that you're pulling the disparate data into won't know from one table flush to the next which records have changed, but if you add two additional tables, and modify the existing table with an additional column, it should be able to figure it out by leveraging HASHBYTES.
For this example, I'll call the new table SentRows, but you can use a more meaningful name in practice. We'll call the new column in the old table HashValue.
Add the column HashValue to your table as a varbinary data type. NOT NULL as well.
Create your SentRows table with columns for all the columns in the main table's primary key, plus the HashValue column.
Create a RowsToSend table that's structurally identical to your main table, including the HashValue.
Modify your queries to create the HashValue by applying HASHBYTES to all of the non-key columns in the table. (This will be horribly tedious. Sorry about that.)
Send out your full data set.
Now move all of the key values and HashValues to the SentRows table. Truncate your main table.
On the next pull, compare the key values and HashValues from SentRows to the new data in the main table.
Primary key match + hash match = Unchanged row
Primary key match + hash mismatch = Updated row
Primary key in incoming data but missing from existing data set = New row
Primary key not in incoming data but in existing data set = Deleted row
Pull out any changes you need to send to the RowsToSend table.
Send the changes from RowsToSend.
Move the key values and HashValues to your SentRows table. Update hashes for changed key values, insert new rows, and decide how you're going to handle deletes, if you have to deal with deletes.
Truncate the SentRows table to get ready for tomorrow.
If you'd like (and you'll thank yourself later if you do) add a computed column to the SentRows table with default of GETDATE(), which will tell you when the row was added.
And away you go. Nothing but deltas from now on.
Edit 2019-10-31:
Step by step (or TL;DR):
1) Flush and Fill MainTable.
2) Compare keys and hashes on MainTable to keys and hashes on SentRows to identify new/changed rows.
3) Move new/changed rows to RowsToSend.
4) Send the rows that are in RowsToSend.
5) Move all the rows from RowsToSend to SentRows.
6) Truncate RowsToSend.

SQL Server : split records pointing to a unique varbinary value

I have an interesting problem for the smart people out there.
I have an external application I cannot modify writing pictures into a SQL Server table. The pictures are often non-unique, but linked to unique rows in other tables.
The table MyPictures looks like this (simplified):
Unique (ID) FileName (Varchar) Picture (Varbinary)
----------------------------------------------------------
xxx-xx-xxx1 MyPicture 0x66666666
xxx-xx-xxx2 MyPicture 0x66666666
xxx-xx-xxx3 MyPicture 0x66666666
This causes the same data to be stored over and over again, blowing up my database (85% of my DB is just this table).
Is there something on a SQL level I can do to only store the data once if filename & picture already exists in my table?
The only thing I can think of is to treat the current destination table as a 'staging' table, so allow all the rows the upstream process wants to write to it, but then have a second process that copies only distinct rows to the table(s) you're using on the SQL side and then deletes the rows from the table with the duplicates to reclaim your space.

Error 0xc002f210

I have just started learning SQL Server, I am trying to use Import Export wizard to import data from excel file to one of the table in the database. But I am getting an error 0xc002f210.
I understood only that it is taking length of excel file cell as 255 but the SQL Server table length is different. I am unable to understand why it is happening.
Validating (Error)
Warning 0x802092a7: Data Flow Task 1: Truncation may occur due to inserting data from data flow column "Name" with a length of 255 to database column "Name" with a length of 50.
(SQL Server Import and Export Wizard)
Warning 0x802092a7: Data Flow Task 1: Truncation may occur due to inserting data from data flow column "GroupName" with a length of 255 to database column "GroupName" with a length of 50.
(SQL Server Import and Export Wizard)
Error 0xc002f210: Preparation SQL Task 1: Executing the query "TRUNCATE TABLE [HumanResources].[Department]
" failed with the following error: "Cannot truncate table 'HumanResources.Department' because it is being referenced by a FOREIGN KEY constraint.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly.
(SQL Server Import and Export Wizard)
So the error you are receiving is different from the warnings that you are getting. The warnings are simply informing you that you are taking a larger column and putting it into a smaller column, so there is the possibility of data truncation.
The error will give you some insight into the purpose of using a Foreign Key Constraint. The technet article will give you an in depth understanding of it if you read through it.
TechNet
But essentially, referential integrity by way of foreign key constraints make it so that the "link" between data cannot be broken on the Primary Key side (table "Department"). In order to delete data from the Primary Key table, you must first either remove the foreign key, or delete the data first from the Foreign Key table.
You should re-evaluate two things:
1) Whether you actually should be truncating the Primary key table
2) Which table has a foreign key linked to the Department table's primary key.

Changing columns to identity (SQL Server)

My company has an application with a bunch of database tables that used to use a sequence table to determine the next value to use. Recently, we switched this to using an identity property. The problem is that in order to upgrade a client to the latest version of the software, we have to change about 150 tables to identity. To do this manually, you can right click on a table, choose design, change (Is Identity) to "Yes" and then save the table. From what I understand, in the background, SQL Server exports this to a temporary table, drops the table and then copies everything back into the new table. Clients may have their own unique indexes and possibly other things specific to the client, so making a generic script isn't really an option.
It would be really awesome if there was a stored procedure for scripting this task rather than doing it in the GUI (which takes FOREVER). We made a macro that can go through and do this, but even then, it takes a long time to run and is error prone. Something like: exec sp_change_to_identity 'table_name', 'column name'
Does something like this exist? If not, how would you handle this situation?
Update: This is SQL Server 2008 R2.
This is what SSMS seems to do:
Obtain and Drop all the foreign keys pointing to the original table.
Obtain the Indexes, Triggers, Foreign Keys and Statistics of the original table.
Create a temp_table with the same schema as the original table, with the Identity field.
Insert into temp_table all the rows from the original table (Identity_Insert On).
Drop the original table (this will drop its indexes, triggers, foreign keys and statistics)
Rename temp_table to the original table name
Recreate the foreign keys obtained in (1)
Recreate the objects obtained in (2)

Resources