SSIS - combine results only if key doesn't exist in first dataset - sql-server

I am trying to combine two inventory sources with SSIS. The first of which contains inventory information from our new system while the second contains legacy data. I am getting the data from the sources just fine.
Both data sets have the same columns, but I only want to get the results from the second data set if the ItemCode value for that record doesn't exist in the first data set.
Which transform would I need to use to achieve this?
Edit - here is what I have so far in my data flow.
I need to add a transform to the Extract Legacy Item Data source so that it will remove records whose item codes already exist in the Extract New Item Data source.
The two sources are on different servers so I cannot resolve by amending the query. I would also like to avoid running the same query that is run in the Extract New Item Data source.

If both Sources type is SQL databases and they are stored on the same Server, you can use an SQL Command as Source to achieve that:
SELECT Inverntory2.*
FROM Inverntory2 LEFT JOIN Inverntory1
On Inverntory2.ItemCode = Inverntory1.ItemCode
WHERE Inverntory1.ItemCode IS NULL
OR
SELECT *
FROM Inverntory2
WHERE NOT EXISTS (SELECT 1 FROM Inverntory1 WHERE Inverntory2.ItemCode = Inverntory1.ItemCode)

An example of this is below. Using a SQL Server Destination will work fine, however this only allows for loading to a local SQL Server instance, something that you may want to consider for the future. While a Lookup typically performs better, Merge Joins can be beneficial in certain circumstances such as when many additional columns are introduced into the Data Flow, as may be done with your data sets. It looks like #Hadi has covered how to do this with a Lookup, so you may want to test both approaches in a non-production environment that mimics prod, then assess the results to determine the better option.
Start off by creating a staging table which is an exact clone of one of the tables. Either table will work since they have the same definition. Make sure all columns in the staging allow null values.
Add an Execute SQL Task to clear the staging table before the Data Flow Task by either truncating or dropping and then creating the table.
Since ItemCode is unique sort on this column in each OLE DB Source. If you aren't already change the Data Access Mode to SQL command in both OLE DB Sources and add an ORDER BY clause for ItemCode. Do this by right-clicking the OLE DB Source and going to Show Advanced Editor > Input and Output Properties > OLE DB Source Output > Output Column > then select ItemCode and set the SortKeyPosition property to 1 (assuming you do ASC source in SQL statement).
Next add a Merge Join in the Data Flow Task. This requires both inputs to be sorted, which is why the inputs are now sorted. You can do this either way, but for this example use the OLE DB Source that will only be used when ItemCode does not exist as the merge join left input. Use a left outer join and the ItemCode column as the join key by connecting them via dragging a line from one to the other in the GUI. Add all the columns from the OLE DB Source that you want to use when the same ItemCode is in both data sets (from what I could tell this is Extract New Item Data, please adjust this if it isn't) by checking the check-box next to them in the Merge Join editor. Use an output alias prefix that will help you distingush these, for example X_ItemCode for the matching rows.
After the Merge Join add a Conditional Split. This is divide the records based on whether X_ItemCode was found. For the expression of the first output, use the ISNULL function to test if there was a match from the left outer join. For example ISNULL(X_ItemCode) != TRUE indicates that the ItemCode does exists in both data sets. You can call this output Matching Rows. The default output will contain the non-matches. To make it easier to distinguish you can rename the default output Non-Matching Rows.
Connect the Matching Rows output to the destination table. In this map only the columns of rows that were matched for the source you want to use when ItemCode exists in both data sets, i.e. the X_ prefixed rows such as X_ItemCode.
Add another SQL Server Destination in the Data Flow and connect the output Non-Matching Rows output to this, with all the columns mapped from rows that did not match, the one's without X_ in this example.
Back on the Control Flow in the package, add another Data Flow Task after this one. Use the staging table as the OLE DB Source and the destination table as the SQL Server Destination. Sorting isn't necessary here.

First of all, concerning that you are using SQL Server Destination, i suggest reading the following answer from the SSIS guru #billinkc:
Should SSIS packages and SQL database be on same server?
I will provide different methods to achieve that:
(1) Using Lookup transformation
You should add a Data Flow Task, where you add the second inventory (legacy) as source
Add a lookup transformation where you select the first inventory source as lookup table.
Map the source and lookup table with ItemCode column
In the lookup transformation select Redirect rows to no match output from the drop down list.
Use the Lookup no match output to get the desired rows (not found in the first Inventory source)
You can refer to the link below, it contains a step by step tutorials.
Helpful link
UNDERSTAND SSIS LOOKUP TRANSFORMATION WITH AN EXAMPLE STEP BY STEP
Old Versions of SSIS
If you are using old versions of SSIS, then you will not find the Redirect rows to no match output drop down list. Instead you should go to the Lookup Error output, select Redirect Row option for No Match situation, and use the error output to get the desired rows.
(2) Using Linked Servers
On the Second inventory create a Linked server to be able to connect the the first Server. Now you are be able to use an SQL Command that only select the rows not found in the first source:
SELECT *
FROM Inverntory2
WHERE NOT EXISTS (SELECT 1 FROM <Linked Server>.<database>.<schema>.Inverntory1 Inv1 WHERE Inverntory2.ItemCode = Inv1.ItemCode)
(3) Staging table + MERGE, MERGE JOIN , UNION ALL transformation
On each source SQL command add a fixed value column that contains the source id (1,2), example:
SELECT *, 1 as SourceID FROM Inventory
You can combine both sources in one destination using one of the transformation listed above, then add a second Data flow task to import distinct data from staging table into destination based on the ItemCode column, example:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY ItemCode ORDER BY SourceID) rn
FROM StagingTable ) s
Where s.rn = 1
Then it will return all rows from SourceId =1 and the new rows from SourceId = 2
To learn more about Merge, Merge Join and UNION ALL transformation you can refer to one of the following links:
Learn SSIS : MERGE, MERGE JOIN and UNION ALL
SSIS Do I Union All or Merge??
Using the SSIS Merge Join
How to get unmatched data between two sources in SSIS Data Flow?
Note: check the answer provided by #userfl89 it contains very detailed information about using Merge Join transformation and it described another approach that can help. Now you have to test which approach fits your needs. Good Luck

Related

OLE DB Source to Multiple destination with Different Column

What will be best way to handle a use case where-
I have a old -db source containing 10 columns
This source data need to go to three places with different fields from source
Excel 1 ( 5 fields from Source )
Excel 2 with different field than previous excel
SQL server table to with another combination of fields
Script component is used to choose column seems to be an option. Multicast does not provide ability to pick and choose specific column.
Please see picture for my solution. Need to know if there is other option to achieve it
There are some tips that may helps:
Avoid Script Components
Instead of adding script components to select specific columns, in each OLE DB Destination, just don't map these columns.
Example:
image reference : how to assign a constant value to a column in oledb destination in ssis
Select specific columns in the OLEDB Source
If there are some columns in the OLE DB Source that wont be used in any of the destinations it is better to change the Access Mode and use SQL Command instead of Table or View and specify the columns needed in the Select query. As example, if the table contains 5 columns [Col1],[Col2], ... [Col5] and you only need [Col1],[Col2] use the following query:
Select [Col1],[Col2] From [Table]
Instead of of selecting the Table name
For more information:
SSIS OLE DB Source Editor Data Access Mode: “SQL command” vs “Table or view”
There isn't a better approach than what you have. Instead of adding script components, In each OLE DB Destination, just don't map the columns that you don't want to use in that destination.

Iterative UPDATE loop in SQL Server

I would really like to find some kind of automation on this issue I am facing;
A client has had a database attached to their front end site for a few years now, and until this date has been inputting certain location information as a numeric code (i.e. County/State data).
They now would like to replace these values with their corresponding nvarchar values. (e.g Instead of having '8' in their County column, they want it to read 'Clermont County' etc etc for upwards of 90 separate entries).
I have been provided with a 2-column excel sheet, one with the old county numeric code and one with the text equivalent they request. I have imported this to a temp table, but cannot find a fast way of iteratively matching and updating these values.
I don't really want to write a 90 line CASE WHEN paragraph and type out each county name manually. Opens doors for human error etc.
Is there something much simpler I don't know about what I can do here?
I realize that it might be a bit late, but just in case someone else is searching and comes across this answer...
There are two ways to handle this: In Excel, or in SQL Server.
1. In Excel
Create a concatenated string in one of the available columns that meets your criteria, i.e.
=CONCATENATE("UPDATE some_table SET some_field = '",B2,"' WHERE some_field = ",A2)
You can then auto-fill this column all the way down the list, and thus get 90 different update statements which you can then copy and paste into a query window and run. Each one will say
UPDATE some_table SET some_field = 'MyCounty' WHERE some_field = X
Each one will be specific to a case; therefore, you can run them sequentially and get the desired result, or...
2. In SQL Server
If you can import the data to a table then all you need to do is write a simple query with a JOIN which handles the case, i.e.
UPDATE T1
SET T1.County_Name = T2.Name
FROM Some_Table T1 -- The original table to be updated
INNER JOIN List_Table T2 -- The imported table from an Excel spreadsheet
ON T1.CountyCode = T2.Code
;
In this case, Row 1 of your original Some_Table would be joined to the imported data by the County_Code, and would update the name field with the name from that same code in the imported data, which would give you the same result as the Excel option, minus a bit of typing.

Mule - Record cannot be mapped as it contains multiple columns with the same label

I need to do join query to MS SQL Server 2014 DB based on a column name value. The same query runs when doing query directly to DB, but when doing query through Mule I'm getting error. The query looks something like this :
SELECT * FROM sch.emple JOIN sch.dept on sch.emple.empid = sch.dept.empid;
The above query work fine while doing query directly to MS SQL Server DB, but gives the following error through mulesoft.
Record cannot be mapped as it contains multiple columns with the same label. Define column aliases to solve this problem (java.lang.IllegalArgumentException). Message payload is of type: String
Request you to please help me out.
Specify columns list directly:
SELECT e.<col1>, e.<col2>, ...., d.<col1>,...
FROM sch.emple AS e
JOIN sch.dept AS d
ON e.empid = d.empid;
Remarks:
You could use aliases instead of schema.table_name
SELECT * in production code in 95% cases is bad practice
The column that has duplicate is empid(or more). You could add alias for it e.empid AS emple_empid and d.empid AS dept_empid or just specify e.empid once.
To avoid specifying all columns manually, you could drag and drop them from object explorer to query pane like Drag and Drop Column List into query window.
Second way is to use plugin like Redgate Prompt to expand SELECT *:
Image from: https://www.simple-talk.com/sql/sql-tools/sql-server-intellisense-vs.-red-gate-sql-prompt/
Addendum
But the same query works directly.
It works because you don't bind them. Please read carefully link I provided for SELECT * antipattern and especially:
Binding Problems
When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can
often crash your data consumer. Imagine a query that joins two
tables, both of which contain a column called "ID". How would a
consumer know which was which? SELECT * can also confuse views (at
least in some versions SQL Server) when underlying table structures
change -- the view is not rebuilt, and the data which comes back can
be nonsense. And the worst part of it is that you can take care
to name your columns whatever you want, but the next guy who comes
along might have no way of knowing that he has to worry about adding a
column which will collide with your already-developed names.
But the same query works directly.
by Dave Markle

SSIS dynamic columns validation

I'm trying to use Dynamic Column mapping by selecting the destination table using the Variable Name option in the OLEDB destination. I'm getting the error: "OLE DB Destination" failed validation and returned validation status "VS_NEEDSNEWMETADATA".
I understand from what I've read that Dynamic column validation is not possible in SSIS. But then, why is it possible to select table destination in OLEDB using a variable name? Isn't it dynamic column mapping?
What I'm trying to do is to create a foreach loop to read a list of tables and import these tables from the source db to the staging area. Using the Variable Name destination within OLEDB seems perfect to me, but it does not work, even by enabling DelayValidation in the dataflow.
Thanks,
Rodrigo
Why would I use a TableName from Variable for my OLE DB Destination?
I automate the heck out of my SSIS package development. Instead of having to specify each table name, I have a variable called FullyQualifiedName that I populate once and then reuse for my package. Think of a truncate and reload pattern: Execute SQL Task to clear out the target table, A Foreach loop to load all the files-either because the names are dynamic or I have multiple days worth of data to load, and then Archive the file. I'd need to reference that table at least twice in that scenario. By having the table name in a variable, I can define it once and reference it in many different locations.
I have worked in environments where we physically isolate data based on the customer. i.e Blackstone.Sales, Yampas.Sales, Ranger.Sales, etc. When the customer logs in, their account can only access data in their schema. The tables are identical in structure but they have different names to ensure isolation. For a scenario like that, you could be matching file name to target table and therefore want to use a Variable to control what table is written to.
As you've already determined, you cannot accomplish dynamic column mapping in the manner you are attempting. If it's a straight copy from source to your staging environment, I'd just use a technology like Biml to generate the packages and be done with it.
I have faced and worked on such requests. NO, SSIS won't allow you dynamic column mappings. So I had tried something on the lines of below:
You need to first use your knowledge of the system and put together a sort of configuration table that would tell you the following things -
-Source Table(SourceTable)
-Columns to be extracted from source table(SourceQuery)
HINT: A SELECT query..e.g. SELECT ID, Name, Salary from dbo.tblEmployee
-Destination Table(DestinationTable)
-Columns which need to be fed from the source
-Few other details like server name/connection properties etc..
You would need to later traverse through the rows of this table using a ForEach Loop container.
Next, identify the maximum number of columns and maximum length of data types in these columns, in the source that might be up for extracting. You would need to create a table with information soon.
Create a sort of staging table let's say StgData. I will create this table with 50 columns, all of data type NVARCHAR(MAX). The CREATE statement should look like:
CREATE TABLE StgData
(
Column1 NVARCHAR(MAX),
Column2 NVARCHAR(MAX),
Column3 NVARCHAR(MAX),
....
Column50 NVARCHAR(MAX)
)
The raw data would be loaded onto StgData.
Now have a ForEach loop container traversing through ETLMappings.
Inside this, you would have to use INSERT statements in Execute SQL Task to load the data.
The script inside the task would look like:-
INSERT INTO dbo.StgData
?
? corresponds to the SourceQuery column(which should be captured by ForEach container.
Once the StgData is loaded, it should be used to load the DestinationTable(also captured in ForEach loop container)
Now again you need to have good understanding on schema and column mapping. The configuration table should have a column which stores the SQL query in the form
INSERT INTO DestTable1 SELECT Col1, CAST(Col2 as float) Col2 FROM StgData
Something on those lines.
This is just a basic structure. Ofcourse lot of formatting and customization has to be added.

How do you get an SSIS package to only insert new records when copying data between servers

I am copying some user data from one SqlServer to another. Call them Alpha and Beta. The SSIS package runs on Beta and it gets the rows on Alpha that meet a certain condition. The package then adds the rows to Beta's table. Pretty simple and that works great.
The problem is that I only want to add new rows into Beta. Normally I would just do something simple like....
INSERT INTO BetaPeople
SELECT * From AlphaPeople
where ID NOT IN (SELECT ID FROM BetaPeople)
But this doesn't work in an SSIS package. At least I don't know how and that is the point of this question. How would one go about doing this across servers?
Your example seems simple, looks like you are adding only new people, not looking for changed data in existing records. In this case, store the last ID in the DB.
CREATE TABLE dbo.LAST (RW int, LastID Int)
go
INSERT INTO dbo.LAST (RW, LastID) VALUES (1,0)
Now you can use this to insert the last ID of the row transferred.
UPDATE dbo.LAST SET LastID = #myLastID WHERE RW = 1
When selecting OLEDB source, set data access mode to SQL Command and use
DECLARE #Last int
SET #Last = (SELECT LastID FROM dbo.LAST WHERE RW = 1)
SELECT * FROM AlphaPeople WHERE ID > #Last;
Note, I do assume that you are using ID int IDENTITY for your PK.
If you have to monitor for data changes of existing records, then have the "last changed" column in every table, and store time of the last transfer.
A different technique would involve setting-up a linked server on Beta to Alpha and running your example without using SSIS. I would expect this to be way slower and more resource intensive than the SSIS solution.
INSERT INTO dbo.BetaPeople
SELECT * FROM [Alpha].[myDB].[dbo].[AlphaPeople]
WHERE ID NOT IN (SELECT ID FROM dbo.BetaPeople)
Add a lookup between your source and destination.
Right click the lookup box to open Lookup Transformation Editor.
Choose [Redirect rows to no match output].
Open columns, map your key columns.
Add an entry with the table key in lookup column , lookup operation as
Connect lookup box to destination, choose [Lookup no Match Output]
Simplest method I have used is as follows:
Query Alpha in a Source task in a Dataflow and bring in records to the data flow.
Perform any needed Transformations.
Before writing to the Destination (Beta) perform a lookup matching the ID column from Alpha to those in Beta. On the first page of the Lookup Transformation editor, make sure you select "Redirect rows to no match output" from the dropdown list "Specify how to handle rows with now matching error"
Link the Lookup task to the Destination. This will give you a prompt where you can specify that it is the unmatched rows that you want to insert.
This is the classical Delta detection issue. The best solution is to use Change Data Capture with/without SSIS. If what you are looking for is a once in a life time activity, no need to go for SSIS. Use other means such as linked server and compare with existing records.
The following should solve issue of loading Changed and New records using SSIS:
Extract Data from Source usint Data flow.
Extract Data from Target.
Match on Primary key Add Unmatch records and split matched and unmatched records from Source and Matched records from Target call them Matched_Source,
Unmatch_Source and Matched_Target.
Compare Matched_Source and Matched_Target and Split Matched_Source to Changed and Unchanged.
Null load TempChanged Table.
Add Changed Records to TempChanged.
Execute SQL script/stored proc to Delete Records from Target for primary key in TempChanged and add records in TempChanged to Target.
Add Unmatched_Source to Target.
Another solution would be to use a temporary table.
In the properties for Beta's connection manager, change RetainSameConnection to true (by default SSIS runs each query in it's own connection, this would mean the temporary table would be killed as soon as it has been created).
Create a SQL Task using Beta's connection and use the following SQL to create your temporary table:
SELECT TOP 0 *
INTO ##beta_temp
FROM Beta
Next create a data flow that pulls data from Alpha and loads into ##beta_temp (you will need to run the SQL statement above on SSMS first so that Visual Studio can see the table at design time and you will also need to set the DelayValidation property to true on the Data Flow task).
Now you have two tables on the same server and you can just use your example SQL modified to use the temporary table.
INSERT INTO Beta
SELECT * FROM ##beta_temp
WHERE ID NOT IN (SELECT ID FROM Beta)

Resources