SSIS Lookup transformation to prevent duplicate entries - sql-server

I am trying to import some excel files to a sql server table using SSIS.
But problem is like when we consolidate data from all excel files to one then there is a chance that it may contain duplicate record.
to solve this I used Lookup transformation with "Lookup no match output" but no luck.
Can someone explain how to make it work with Lookup transform?
please refer attached image

Could you use a sort task and check the 'Remove rows with duplicate sort values' box?

Related

Count how many fields where cleansed and which fields on SSIS

I'm doing an exercise in which I have to clean data from a Flat File Source and write it on my Database. I have already managed to clean all of the fields by using some data quality rules for each field and also generate error codes which I write to a different table when a rule is broken.
My problem is that for the final step of the exercise I have to generate some Power BI graphics in which it shows how many fields were fixed from the source and which fields where cleansed. The only thing that I have thought compares the DB table to the flat file source or maybe do something with script components but I don't really think that those are really good solutions.
Has anybody encountered this problem? if somebody could point me out for info for something like this, it would be great. Thanks!
If I am facing a similar issue, I will do this in three steps:
Importing data without any transformation to a staging table
Cleaning data and loading it into the destination table
Comparing staging and destination table to get how many values were fixed.
From design standpoint - establishing a key is central before starting to clean.
Use could use SSIS derived column transformation to create a business key that is a concatenation of available fields to create a unique key, using FindString function and string functions.
Similar to the above step add a column in your staging table or use a derived column (depending on if you are using sql cleanup or ssis tasks to cleanup) to indicate if it was cleaned or not.

Ignoring column from Excel file while importing to SQL Server

I have multiple Excel files that have the same format. I need to import them into SQL Server.
The issue I currently have is that there are two text columns that I need to ignore completely as they are free text and the character length for some rows exceeds what the server allows me to import which results in a truncation error.
Because I don't need these columns for my analysis, the table I'm importing to doesn't include these columns but for some reason the SSIS packages still picks up those columns and cuts the import job halfway through.
I tried using max character length for those columns which still results in the truncation error.
I need to create an SSIS package that ignores the two columns completely without deleting the columns from Excel.
You can specify which columns you need to ignore from the Edit Mappings dialog.
I have added the image for your reference:
If you just create the SSIS package in SSDT the Excel file can be queried to return only the required columns. In the package, create an Excel Connection Manager using the Excel file. Then on the Control Flow of the package add a Data Flow Task that has an Excel Source component in it. On this source, change the data access mode to SQL command and the file can then be queried similar to SQL. In the following example TabName is the name of the Excel tab containing the data that will be returned. If either the tab or any column names contain spaces they will need to be enclosed in square brackets, i.e. TabName would be [Tab Name].
Import/Export Wizard
Since you mentioned in the comments that you are using SQL Server Import/Export Wizard. You can solve that if you have a fixed columns (range) that you are looking to import (example: first 10 columns).
In Import/Export wizard, after selecting destination options you will be asked if you want to read from tables or query:
Select the query option, then use a simple select query and specify the columns range after the sheet name. As example:
SELECT * FROM [Sheet1$A:C]
The query above will read from the first 3 columns in Sheet1 since A:C represent the range between first column A and third column C.
Now, you can check the columns from the Edit Mappings dialog:
SSIS
You can use the same logic within SSIS package, just write the same SQL command in the Excel Source after changing the Access Mode to SQL Command.
The solution is simple. I needed to write a query that will exclude the columns. So instead of selecting "Copy data from one or more tables" you select "write a query" and exclude the columns you don't need. This one worked 100%

Excel Source SSIS

I have an SSIS package with an Excel Source reads an Excel table. I currently am using the Table or View Data Access Mode and it is literally reading every row in the worksheet, 1,048,576 which is the maximum.
The source worksheet has an Excel table on it named PSA_DATA. Why isn't this table in the Table or View drop down? There is an option for the worksheet followed by _FilterDatabase but this fails when I run the package even though it pulls the correct data when I press Preview. Wouldn't this make more sense than using the SQL Command and SELECT * FROM [fact_PSA$Ax:Bx]? The whole reason we use Named Ranges and Tables in Excel is because they are dynamic! Now I have to hard code the range in every time with rows numbers?
What am I missing here? Is there an easier way I am missing? I just want to move an Excel table into a SQL table! Why don't doesn't the most ubiquitous piece of software in the world easily talk to the second most ubiquitous piece of software in the world!?!?!
If the sheet name is not shown in Table or view combobox, it is not a bad idea to use a Sql Command.
But When using SQL Comand to read from excel it is not necessary to specify a range, OLEDB will take used range by default just use the following command
SELECT * FROM [fact_PSA$]
Workaround
you can try reading your excel file from a script task or a script component, you can follow one of the following links to achieve this:
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/2d45f180-9fd0-4224-a298-cb99e2b2100a/how-to-read-the-contents-of-excel-file-through-ssis-script-task-without-the-headers?forum=sqlintegrationservices
https://msdn.microsoft.com/en-us/library/ms403358.aspx
http://billfellows.blogspot.com/2013/04/ssis-excel-source-via-script.html
Side Note: there are many links you can follow to import data from excel to SQL using SSIS:
http://www.sqlshack.com/using-ssis-packages-import-ms-excel-data-database/
https://www.mssqltips.com/sqlservertip/2770/importing-data-from-excel-using-ssis--part-1/
https://www.simple-talk.com/sql/ssis/moving-data-from-excel-to-sql-server-10-steps-to-follow/
https://www.simple-talk.com/sql/ssis/importing-excel-data-into-sql-server-via-ssis-questions-you-were-too-shy-to-ask/
I appreciate the links to work-arounds, but I didn't really get an answer to my question. Why can't we reference an EXCEL TABLE (not a worksheet) from the SSIS Excel Source???
I ended up using the SQL Command data access mode with this query:
SELECT * FROM [fact_PSA$A:W]
WHERE fact_PSA_ID IS NOT NULL
Somehow, using SQL stopped it from reading every possible row in the worksheet even though the range provided is set for "A:W" which is every row. I guess the "WHERE fact_PSA_ID" limits the rows read before it hits the SSIS source.

How to read and change series numbers in columns SSIS?

I'm trying to manipulate a column in SSIS which looks like below after i removed unwanted rows with derived column and conditional split in my data flow task. The source for this is a flatfile.
XXX008001161022061116030S1TVCO3057
XXX008002161022061146015S1PUAG1523
XXX009001161022063116030S1DVLD3002
XXX009002161022063146030S1TVCO3057
XXX009003161022063216015S1PUAG1523
XXX010001161022065059030S1MVMA3020
XXX010002161022065129030S1TVCO3057
XXX01000316102206515901551PPE01504
The first three numbers from the left (starting with "008" first row) represent a series, and the next three ("001") represent another number within the series. what i need is to change all of the first three numbers starting from "001" to the end.
The desired reslut would thus look like:
XXX001001161022061116030S1TVCO3057
XXX001002161022061146015S1PUAG1523
XXX002001161022063116030S1DVLD3002
XXX002002161022063146030S1TVCO3057
XXX002003161022063216015S1PUAG1523
XXX003001161022065059030S1MVMA3020
XXX003002161022065129030S1TVCO3057
XXX00300316102206515901551PPE01504
...
My potential solution would be to load the file to a temporary database table and query it with SQL from there, but i am trying to avoid this.
The final destination is a flatfile.
Does anybody have any ideas how to pull this off in SSIS? Other solutions are appreciated also.
Thanks in advance
I would definitely use the staging table approach and use windows functions to accomplish this. I could see a use case if SSIS was on another machine than the database engine and there was a need to offload the processing to the SSIS box.
In that case I would create a script transformation. You can process each row and make the necessary changes before passing the row to the output. You can use C# or VB.
There are many examples out there. Here is MSDN article - https://msdn.microsoft.com/en-us/library/ms136114.aspx

excel powerpivot update error "object was not found in the cube"

I have an PowerPivot file that pulls data directly from a SQL data warehouse. Next it is fed into pivot tables. When I try and update I get the following error:
Query (20,3916) The level '&[Desktop]' object was not found in the cube when the string, [OfficeFlatFile].TopicLevel2Name]&[Desktop], was parsed.
I checked my data source and found that the member "Desktop" was no longer available (no surprise there). But I can't get the file to update now. I tried updating the PowerPivot data connection first but that didn't work either.
This is the most recent info I could find, and it doesn't help.
https://connect.microsoft.com/SQLServer/feedback/details/756691/powerpivot-data-could-not-be-retrieved-from-the-exteral-data-source
Does anyone know a solution apart from rebuilding the file?
You know, xlsx (xlsm) files are set of xml files zipped.
Try to open your Excel file with WinRar (7zip, etc) program.
Then go to xl/pivotTables folder. There you should find pivotTable1.xml file.
Then manually delete corresponding item from the .xml
Then save the changes you made and open your Excel file with the pivotTable.
Since you manually deleted the "Desktop" item there will be no error.
Remove the . in column names that are in the join. Reference.
Removing the "." in the column's name worked for me.
The cleanest solution I found up to now is to use your previous working model (the one which worked fine before the update) and find all the pivots where you were filtered on "Desktop". Set these filters to "All" and then run your update.
This way you don't lose your pivot table, which sometimes is a big rework to rebuild, specially when you had charts and other dependencies linked to such pivot.
in my case I had many sheets with power pivot reports.
one of them caused the error.
removing this excel sheet and setting filters to all in other reports solved the problem.
I used the option of clear filter in Pivot table analyse tab and after that this error is solved
I was getting the error:
The query did not run or the Data Model could not be accessed. Here's the message we got:
Query (x,y) the level '<column Name.' object> was not found in the cube when the string was parsed.
Solution:
If there is a Slicer connected to the pivot table, disconnect it from the pivot table and remove all filters from the pivot table.
If there is not Slicer connected to the pivot table, just remove all filters from the pivot table.
Please check in the database if the values in the filter column (of the pivot table) have changed. If you have the older working version of the pivot table report, compare it to check the difference.
Thanks

Resources