Count how many fields where cleansed and which fields on SSIS - sql-server

I'm doing an exercise in which I have to clean data from a Flat File Source and write it on my Database. I have already managed to clean all of the fields by using some data quality rules for each field and also generate error codes which I write to a different table when a rule is broken.
My problem is that for the final step of the exercise I have to generate some Power BI graphics in which it shows how many fields were fixed from the source and which fields where cleansed. The only thing that I have thought compares the DB table to the flat file source or maybe do something with script components but I don't really think that those are really good solutions.
Has anybody encountered this problem? if somebody could point me out for info for something like this, it would be great. Thanks!

If I am facing a similar issue, I will do this in three steps:
Importing data without any transformation to a staging table
Cleaning data and loading it into the destination table
Comparing staging and destination table to get how many values were fixed.

From design standpoint - establishing a key is central before starting to clean.
Use could use SSIS derived column transformation to create a business key that is a concatenation of available fields to create a unique key, using FindString function and string functions.
Similar to the above step add a column in your staging table or use a derived column (depending on if you are using sql cleanup or ssis tasks to cleanup) to indicate if it was cleaned or not.

Related

Importing Excel Data Seems to Randomly Give Null Values

Using SSIS for Visual Studio 2017 for some excel file imports.
I've created a package with several loop containers that call to specific packages to handle some files. I have an issue with one particular package being executed in that it seemingly randomly decides the data for columns is NULL per excel file. I was/am under the impression that this is part of the registry setting for TypeGuessRows (changed initially to 0 then to 1000 as a test) located at
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office\14.0\Access Connectivity Engine\Engines\Excel
The reason I think this is because the various files being brought in generally have the same data, but it seems that if the first few rows of columns in the source data contains only numbers, that the data with mixed values will not be brought in correctly. All other columns aside from this seems fine.
Looking at the source files, all have the same datatype.
I've tried changing the registry TypeGuessRows value and ensured that the output column property was string-based instead of numerical.
The connection string has IMEX=1
So I fixed it. Or at least found a sufficient workaround that should help anyone in my situation. I think it has to do with the cache of SSIS.
I ended up putting a sort function on the problem column so the records getting read as NULL for having a random data type are read first, and not being considered random. I will say, I tried this initially and it didn't work.
Through a little experiment of making a new data flow in the same package I discovered that this solution actually does work, hence me thinking the cache was the issue.
If anyone has any further questions on this, let me know.
This issue is related to the OLEDB provider used to read excel files: Since excel is not a database where each column has a specific data type, OLEDB provider tries to identify the dominant data types found in each column and replace all other data types that cannot be parsed with NULLs.
There are many articles found online discussing this issue and giving several workarounds (links listed below).
But after using SSIS for years, i can say that best practice is to convert excel files to csv files and read them using Flat File components.
Or, if you don't have the choice to convert excel to flat files then you can force excel connection manager to ignore headers from the first row bu adding HDR=NO to the connection string and adding IMEX=1 to tell the OLEDB provider to specify data types from the first row (which is the header - all string most of the time), in this case all columns are imported as string and no values are replaced with NULLs but you will lose the headers and a additional row (header row is imported).
If you cannot ignore the header row, just add a dummy row that contains dummy string values (example: aaa) after the header row and add IMEX=1 to the connection string.
Helpful links
SSIS Excel Data Import - Mixed data type in Rows
Mixed data types in Excel column
Importing data from Excel having Mixed Data Types in a column (SSIS)
Why SSIS always gets Excel data types wrong, and how to fix it!
EXCEL IN SSIS: FIXING THE WRONG DATA TYPES
IMEX= 1 extended properties in ssis

Reduce column size and also trim data, in production database, handle constraints/dependencies on same column

I have a scenario where Java developer has made the change to the variable which used to transfer the data from column - col of table - tbl.
Now, I have to change the column varchar(15) to varchar(10). But, before making this change - have to handle the existing data and the constraints/dependencies on same column.
What should be the best sequence of doing so?
I am thinking to check the constraints first, then trim the existing data and then alter the table.
Please suggest how to handle constrains/dependencies and before handling it, how to check such dependencies.
Schema-evolution (the DDL changes that happen over time to tables and columns in a database, while preserving existing data and functionality) is a well understood topic with several solutions, some of which are RDBMS independent, others are built-in to the RDBMS solution.
A key requirement for production environments is to need both a forward-change and a backout, which can be run unattended.
Many open source advocates use Liquibase which also has a commercial variant.
Db2 for Linux/Unix/Windows also offers a built-in stored-procedure SYSPROC.ALTOBJ which helps to automate various schema-evolution alterations, including decreasing the size of a column. You would need to study its documentation carefully and test it fully on non-production environments until you are satisfied. Read about it here
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0011934.html
You can grow-your-own script of course, in whatever language you prefer, including SQL, but remember you should also build and test a back-out script.

How to read and change series numbers in columns SSIS?

I'm trying to manipulate a column in SSIS which looks like below after i removed unwanted rows with derived column and conditional split in my data flow task. The source for this is a flatfile.
XXX008001161022061116030S1TVCO3057
XXX008002161022061146015S1PUAG1523
XXX009001161022063116030S1DVLD3002
XXX009002161022063146030S1TVCO3057
XXX009003161022063216015S1PUAG1523
XXX010001161022065059030S1MVMA3020
XXX010002161022065129030S1TVCO3057
XXX01000316102206515901551PPE01504
The first three numbers from the left (starting with "008" first row) represent a series, and the next three ("001") represent another number within the series. what i need is to change all of the first three numbers starting from "001" to the end.
The desired reslut would thus look like:
XXX001001161022061116030S1TVCO3057
XXX001002161022061146015S1PUAG1523
XXX002001161022063116030S1DVLD3002
XXX002002161022063146030S1TVCO3057
XXX002003161022063216015S1PUAG1523
XXX003001161022065059030S1MVMA3020
XXX003002161022065129030S1TVCO3057
XXX00300316102206515901551PPE01504
...
My potential solution would be to load the file to a temporary database table and query it with SQL from there, but i am trying to avoid this.
The final destination is a flatfile.
Does anybody have any ideas how to pull this off in SSIS? Other solutions are appreciated also.
Thanks in advance
I would definitely use the staging table approach and use windows functions to accomplish this. I could see a use case if SSIS was on another machine than the database engine and there was a need to offload the processing to the SSIS box.
In that case I would create a script transformation. You can process each row and make the necessary changes before passing the row to the output. You can use C# or VB.
There are many examples out there. Here is MSDN article - https://msdn.microsoft.com/en-us/library/ms136114.aspx

SSIS common destination, multiple file inputs, different structures

I'm sure this question is a common one, but I'm having a real challenging time coming up with a modular design for what should not be an impossible task. I have a situation where I have common destination tables, about five or six of them, but multiple input files which need to be massaged into a certain format for insertion. I've been tasked with making the design modular so as we work with new data providers with different formats, the pieces of the package that handle the insertion don't change, nor the error reporting, etc., just the input side. I've suggested using a common file format which would mean taking the source files and then transforming them and running the rest of the common import process on them. It was suggested that I consider using tables for this process on the input side.
I guess what strikes me about this process is the fact that the package can be saved as a template and I can use the common pieces over and over and set up new connections as we work with other data providers. Outside of that, I could see resorting to custom code in a script task to ensure a common format to be inserted into common input tables, but that's as far as I've gotten.
If anyone has ever dealt with a situation as such, I would appreciate design suggestions to accommodate functionality for now and in the future.
Update: I think the layered architectural design that is being emphasized in this particular instance would be as such (which is why I find it confusing):
There would be six layers. They are as follows:
A. File acquisition
B. File Preparation
C. Data Translation to common file format (in XML)
D. Transformation of data to destination format (XML - preparation for insertion into database)
E. Insert into database
F. Post processing (reporting and output of erred out
Since we will be dealing with several different data providers, the steps would be the same for processing the data, but the individual steps themselves may differ between providers, if that makes sense. Example: We may get data from provider A, but we would receive files from them and they are zipped CSV files. Provider B's would be in XML, uncompressed. One provider might send files to us and we may have to go pick files up (This would take place in the file acquisition step above).
So my question are:
A. Is it more important to follow the architectural pattern here or to combine things were possible? It was a possible suggestion to combine all the connection items in a single package as the top layer, so therefore a single package would handle things like making a service call, SFTP, FTP, and anything else that was needed. I'm not sure quite how one would do multiple connections for different providers when a schedule is needed. It just seems to complicate things...I'm thinking connection layer, but have it be specific to the provider, not a be all end all.
B.Same thing with the file preparation layer.
I think a modular design is key, but stacking things into control tasks seems to make things more complicated in design than they should be. Any feedback or suggestions would be helpful in this area.
I would do what was suggested in another comment and import each file into the appropriate temp table first and then union them all later. This will give you more modularity and make adding or removing input files easier and make debugging easier because you can easily see which section failed. Here is an outline of what I would do:
Step 1 SQL Task:
Create the temp table (repeat as needed for each task with a unique table):
IF EXISTS (SELECT * FROM sys.objects
WHERE object_id = OBJECT_ID(N'[dbo].[tbl_xxxxxx_temp]')
AND type in (N'U'))
DROP TABLE [dbo].[tbl_xxxxxx_temp]
GO
CREATE TABLE [dbo].[tbl_xxxxxx_temp](
(columns go here)
) ON [PRIMARY]
GO
Step 2: Data Flow Task
Create a Data Flow Task and import each file into their unique temp table which you created above.
Step 3: Data Flow Task
Create a second DFT and connect each temp table to a union all Data Flow Transformation (convert, or derive columns as needed) and then connect the output to your static data base table.
Step 4: SQL Task: Drop temp tables
DROP TABLE tbl_xxxxxx_temp
Please note it is necessary to set "DelayValidation" to True in each Data Flow Task in order for this to work.

Bad Data in Excel source does not generate error in SSIS

I have a quick question regarding SSIS. I am developing a package that performs a Data Flow task from an Excel Source into OLE DB Connection. The columns in the database should allow nulls. However there is a problem in that when I enter bad data into the numeric columns in the excel spreadsheet, it will not cause the Data Flow task to fail as I would like it to. I tried to remedy this by explicitly trying to convert any numeric columns in the Derived Column step, however the same thing occurs-- if I enter abc into the Excel numeric column, if just turns out as NULL in the db after the package runs. I do want to allow for NULLS, but I'd like the package to fail if the data is corrupt.
Any advice would be appreciated :)
I've just tried this and Ignore/Redirect/Fail setting doesn't appear to have any effect, NULLs get updated into the database regardless.
If you didn't want NULLs I would suggest that you amend the definition of your destination table to specify a NOT NULL constraint on the columns you wish to be numeric. That way the database update and the package will fail.
But since you want null columns the only thing I can suggest is that you write a script task or script component to read and validate the data before accepting it.
Alternatively, read the Excel file into a staging area where all the columns are VARCHAR and then validate it via SQL
If you edit your SSIS task where you define the import you can choose the error handling for each column. There you can choose to set it to fail and stop, to ignore and go on, etc.
This links should help you to handle it on your needs:
http://sqlblog.com/blogs/rushabh_mehta/archive/2008/04/24/gracefully-handing-task-error-in-ssis-package.aspx
and
http://sqlserver360.blogspot.de/2011/03/error-handling-in-ssis.html
and
http://msdn.microsoft.com/en-us/library/ms141679.aspx

Resources