Using Merge Join in SSIS to merge two similar tables - sql-server

I have two tables on two different servers that have identical schemas. For simplicity, let's say each table has 3 fields. I am trying to create a SSIS package that takes the data from both tables and merges it into one recordset.
I added two OLE DB sources, which get the same three fields from the two tables. I then have a Sort transformation on each, that then flows into a Merge Join. I set the join type to "Full Outer Join" on the Merge Join. I can select all six of the fields and see the output using a Data Viewer coming out of the Merge Join.
Let's say there were 25 records in each source table. In the Data Viewer, I end up getting 50 records - 25 with NULL in the last three fields, and 25 with NULL in the first three fields. I would like the output to be 50 records with data in only three fields. What I am doing wrong here? Should I be using some other sort of merge option?
I would greatly appreciate any suggestions on how to resolve what should be a simple task. Thanks!

The output from the merge join you are using is correct since you're using a full outer join. To fix your problem, use a merge transformation instead of merge join. This will combine your two sorted data flows into one sorted data flow. You've already set up your data flow correctly from your description (it should look like this):
Documentation can be found here: https://msdn.microsoft.com/en-us/library/ms141703.aspx

Related

SSIS - combine results only if key doesn't exist in first dataset

I am trying to combine two inventory sources with SSIS. The first of which contains inventory information from our new system while the second contains legacy data. I am getting the data from the sources just fine.
Both data sets have the same columns, but I only want to get the results from the second data set if the ItemCode value for that record doesn't exist in the first data set.
Which transform would I need to use to achieve this?
Edit - here is what I have so far in my data flow.
I need to add a transform to the Extract Legacy Item Data source so that it will remove records whose item codes already exist in the Extract New Item Data source.
The two sources are on different servers so I cannot resolve by amending the query. I would also like to avoid running the same query that is run in the Extract New Item Data source.
If both Sources type is SQL databases and they are stored on the same Server, you can use an SQL Command as Source to achieve that:
SELECT Inverntory2.*
FROM Inverntory2 LEFT JOIN Inverntory1
On Inverntory2.ItemCode = Inverntory1.ItemCode
WHERE Inverntory1.ItemCode IS NULL
OR
SELECT *
FROM Inverntory2
WHERE NOT EXISTS (SELECT 1 FROM Inverntory1 WHERE Inverntory2.ItemCode = Inverntory1.ItemCode)
An example of this is below. Using a SQL Server Destination will work fine, however this only allows for loading to a local SQL Server instance, something that you may want to consider for the future. While a Lookup typically performs better, Merge Joins can be beneficial in certain circumstances such as when many additional columns are introduced into the Data Flow, as may be done with your data sets. It looks like #Hadi has covered how to do this with a Lookup, so you may want to test both approaches in a non-production environment that mimics prod, then assess the results to determine the better option.
Start off by creating a staging table which is an exact clone of one of the tables. Either table will work since they have the same definition. Make sure all columns in the staging allow null values.
Add an Execute SQL Task to clear the staging table before the Data Flow Task by either truncating or dropping and then creating the table.
Since ItemCode is unique sort on this column in each OLE DB Source. If you aren't already change the Data Access Mode to SQL command in both OLE DB Sources and add an ORDER BY clause for ItemCode. Do this by right-clicking the OLE DB Source and going to Show Advanced Editor > Input and Output Properties > OLE DB Source Output > Output Column > then select ItemCode and set the SortKeyPosition property to 1 (assuming you do ASC source in SQL statement).
Next add a Merge Join in the Data Flow Task. This requires both inputs to be sorted, which is why the inputs are now sorted. You can do this either way, but for this example use the OLE DB Source that will only be used when ItemCode does not exist as the merge join left input. Use a left outer join and the ItemCode column as the join key by connecting them via dragging a line from one to the other in the GUI. Add all the columns from the OLE DB Source that you want to use when the same ItemCode is in both data sets (from what I could tell this is Extract New Item Data, please adjust this if it isn't) by checking the check-box next to them in the Merge Join editor. Use an output alias prefix that will help you distingush these, for example X_ItemCode for the matching rows.
After the Merge Join add a Conditional Split. This is divide the records based on whether X_ItemCode was found. For the expression of the first output, use the ISNULL function to test if there was a match from the left outer join. For example ISNULL(X_ItemCode) != TRUE indicates that the ItemCode does exists in both data sets. You can call this output Matching Rows. The default output will contain the non-matches. To make it easier to distinguish you can rename the default output Non-Matching Rows.
Connect the Matching Rows output to the destination table. In this map only the columns of rows that were matched for the source you want to use when ItemCode exists in both data sets, i.e. the X_ prefixed rows such as X_ItemCode.
Add another SQL Server Destination in the Data Flow and connect the output Non-Matching Rows output to this, with all the columns mapped from rows that did not match, the one's without X_ in this example.
Back on the Control Flow in the package, add another Data Flow Task after this one. Use the staging table as the OLE DB Source and the destination table as the SQL Server Destination. Sorting isn't necessary here.
First of all, concerning that you are using SQL Server Destination, i suggest reading the following answer from the SSIS guru #billinkc:
Should SSIS packages and SQL database be on same server?
I will provide different methods to achieve that:
(1) Using Lookup transformation
You should add a Data Flow Task, where you add the second inventory (legacy) as source
Add a lookup transformation where you select the first inventory source as lookup table.
Map the source and lookup table with ItemCode column
In the lookup transformation select Redirect rows to no match output from the drop down list.
Use the Lookup no match output to get the desired rows (not found in the first Inventory source)
You can refer to the link below, it contains a step by step tutorials.
Helpful link
UNDERSTAND SSIS LOOKUP TRANSFORMATION WITH AN EXAMPLE STEP BY STEP
Old Versions of SSIS
If you are using old versions of SSIS, then you will not find the Redirect rows to no match output drop down list. Instead you should go to the Lookup Error output, select Redirect Row option for No Match situation, and use the error output to get the desired rows.
(2) Using Linked Servers
On the Second inventory create a Linked server to be able to connect the the first Server. Now you are be able to use an SQL Command that only select the rows not found in the first source:
SELECT *
FROM Inverntory2
WHERE NOT EXISTS (SELECT 1 FROM <Linked Server>.<database>.<schema>.Inverntory1 Inv1 WHERE Inverntory2.ItemCode = Inv1.ItemCode)
(3) Staging table + MERGE, MERGE JOIN , UNION ALL transformation
On each source SQL command add a fixed value column that contains the source id (1,2), example:
SELECT *, 1 as SourceID FROM Inventory
You can combine both sources in one destination using one of the transformation listed above, then add a second Data flow task to import distinct data from staging table into destination based on the ItemCode column, example:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY ItemCode ORDER BY SourceID) rn
FROM StagingTable ) s
Where s.rn = 1
Then it will return all rows from SourceId =1 and the new rows from SourceId = 2
To learn more about Merge, Merge Join and UNION ALL transformation you can refer to one of the following links:
Learn SSIS : MERGE, MERGE JOIN and UNION ALL
SSIS Do I Union All or Merge??
Using the SSIS Merge Join
How to get unmatched data between two sources in SSIS Data Flow?
Note: check the answer provided by #userfl89 it contains very detailed information about using Merge Join transformation and it described another approach that can help. Now you have to test which approach fits your needs. Good Luck

SSIS Join Recordset With Table

I have an SSIS package in which I'm reading the records from a Flat File and storing them in a recordset. Is it possible to compare the values in the recordset with the values in a database table and update the table?
I'm Using SQL Server 2008 R2 and Same version of SSIS.
Leran2002's answer in general is right, the most straight forward way is to have a lookup component set up to Redirect rows to no match output and use a destination and a OLE DB Command afterwards.
However depending on the size of the result sets, this might be slow, since the lookup component will check each row one-by-one and if your destination table has lots of records, this will take some time. Furthermore, depending on your cache settings in the lookup component, it can use lots of memory.
There are two more ways to achieve this:
Merge Join
Using your file source and your destination table as a source, you can use a Merge Join. The logic in the DFT is a bit more complex, but this more a set-based approach and with large result sets it is working better.
You'll have to implement the logic which record has to be updated, inserted, deleted or discarded from the file using a conditional split component.
I highly recommend this question (not exactly your problem, but a good comparison in my opinion): What are the differences between Merge Join and Lookup transformations in SSIS?
Staging table
Another way is to use a staging table to temporarily store the records from a file. In this case, your DFT just loads the records from a file into the staging table, then with one or more Execute SQL Task you can do the merging of the two data sets. (UPDATE, INSERT, DELETE, MERGE, you can use what fits your needs).
Usualy I use Lookup-component with option Redirect rows to no match output.
And after that you can use two rowsets which named Lookup No Match Output and Lookup Match Output.
PS. I have three articles about SSIS, but they in Russian (but there is a lot of SQL-scripts and pictures).
If it's interesting you, you can look the following link - https://habrahabr.ru/post/330618/

SSIS Merge Join on two different DB in different sql servers won't join all rows

I have two databases residing in two different SQL servers (Server A and Server B) and I am trying to run a MERGE join in SSIS on a common column called "Name". I got the two tables sorted by "Name" and I did set the SortKey as 1 for "Name" column in both the Source OLE DB Output properties. I then selected the columns from both the tables to display and used INNER join and selected a destination empty table (with both column names from the two source servers) in Server C as Destination OLE DB Server. Everything looks good and package executes successfully without any errors and warnings.
But, out of 542 rows, only 35 rows match and it should match 405. When I specify LEFT JOIN in Merge Join transformation, I get 542 rows with 507 rows having NULL values from Server B (which again means it found a match only for 35 rows and not all 405).
Have tried using RTRIM on Name column from both the sources without any success.
Have tried using UPPER case on Name column from both the sources without any success as well :(
I don't get this issue when I do JOINS on same 2 databases in powershell using Invoke-SqlCommand, but when I do SSIS way, it only JOINS on 35 rows.
Can someone suggest what could be the issue?
Found out the issue. The two source tables were not sorted, so I had to write a SQL statement to ORDER BY Name on both sources and it worked perfectly. Hope it helps someone!

Mule - Record cannot be mapped as it contains multiple columns with the same label

I need to do join query to MS SQL Server 2014 DB based on a column name value. The same query runs when doing query directly to DB, but when doing query through Mule I'm getting error. The query looks something like this :
SELECT * FROM sch.emple JOIN sch.dept on sch.emple.empid = sch.dept.empid;
The above query work fine while doing query directly to MS SQL Server DB, but gives the following error through mulesoft.
Record cannot be mapped as it contains multiple columns with the same label. Define column aliases to solve this problem (java.lang.IllegalArgumentException). Message payload is of type: String
Request you to please help me out.
Specify columns list directly:
SELECT e.<col1>, e.<col2>, ...., d.<col1>,...
FROM sch.emple AS e
JOIN sch.dept AS d
ON e.empid = d.empid;
Remarks:
You could use aliases instead of schema.table_name
SELECT * in production code in 95% cases is bad practice
The column that has duplicate is empid(or more). You could add alias for it e.empid AS emple_empid and d.empid AS dept_empid or just specify e.empid once.
To avoid specifying all columns manually, you could drag and drop them from object explorer to query pane like Drag and Drop Column List into query window.
Second way is to use plugin like Redgate Prompt to expand SELECT *:
Image from: https://www.simple-talk.com/sql/sql-tools/sql-server-intellisense-vs.-red-gate-sql-prompt/
Addendum
But the same query works directly.
It works because you don't bind them. Please read carefully link I provided for SELECT * antipattern and especially:
Binding Problems
When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can
often crash your data consumer. Imagine a query that joins two
tables, both of which contain a column called "ID". How would a
consumer know which was which? SELECT * can also confuse views (at
least in some versions SQL Server) when underlying table structures
change -- the view is not rebuilt, and the data which comes back can
be nonsense. And the worst part of it is that you can take care
to name your columns whatever you want, but the next guy who comes
along might have no way of knowing that he has to worry about adding a
column which will collide with your already-developed names.
But the same query works directly.
by Dave Markle

finding duplicates between my list and what's in a database

I would like some advice on how to implement the solution to following
I have a list of objects. (hundreds of elements, like 500-1000, or more).
I have a table in the database of records for such objects. Database has million of records.
I need to send a list of object to the database, and report back with the list of the duplicate if found.
Initial solution, load everything from database to Java, then compare lists - is bad solution. We have out-of-memory issue, trying to load all the millions of records from database.
Is there some identifier in the object by which you can look it up in the database?
If yes, you can do the following:
Get the identifiers for your list of objects
Put them into a SELECT statement to see which are already in the database
Put the objects that are not yet in the table into an INSERT statement
If the list you get in 1 is too big for a SELECT, you can also put them into a temporary table and do a JOIN statement with the table of objects.
Cheers

Resources