SSIS: ETL Data Validation -> XML to SQL (xml data validation) - sql-server

i want to create a ssis etl package. am new to ssis. still managed to learn basics from internet and started working on it.
Source= xml.
destination = microsoft sql server database.. 2 separate tables one for good record and another for bad records.
my final result should look like this,, specific column with error should come and sit in the below table,,if there are 20 bad fields in a single xml row then 20 bad records should sit separately in the below table.
bad record structure :
[Slno]
[LoanNumber] = this is primary key in my source, so this needs to inserted for every bad data column.
[ErrorField] = i need to insert which input data in xml has error.
[ErrorFieldValue] = i need to insert what is the value of error column.
[ErrorMessage]= and a error message based on the validaiton.
input xml data = it has 5 rows of data in a xml and each row has 100 data fields.
i need to validate each and every data field in xml before putting it into sql database table.
i tried to validate based on data conversion field... example if (Amount) from input data source != float.. redirect the error into sql error table destination.. but while mapping i need to map all good fields or only i can select specific column but if there 20-30 error fields in input data am not able to validate and map error values.
my required validations are lenght validation ,alphanumeric and date .i need to check lenght of field should not be >10 else it should move to error table, like that amount should not be alphanumeric and date should be proper.
please help me to solve this.

What I would suggest is to have is to first generate an XSD meeting all your business rules. Use this XSD to validate the XML, you then proceed to begin the valid XML as part of the Data Flow.
Here is a blog talks about how to validate XML data using Script Task. You can further tweak that code to obtain the errors that you are looking for by adding the appropriate OleDBConnection and get that data into your error tables.

Related

Read single text file and based on a particular value of a column load that record into its respective table

I have been searching on the internet for a solution to my problem but I can not seem to find any info. I have a large single text file ( 10 million rows), I need to create an SSIS package to load these records into different tables based on the transaction group assigned to that record. That is Tx_grp1 would go into Tx_Grp1 table, Tx_Grp2 would go into Tx_Grp2 table and so forth. There are 37 different transaction groups in the single delimited text file, records are inserted into this file as to when they actually occurred (by time). Also, each transaction group has a different number of fields
Sample data file
date|tx_grp1|field1|field2|field3
date|tx_grp2|field1|field2|field3|field4
date|tx_grp10|field1|field2
.......
Any suggestion on how to proceed would be greatly appreciated.
This task can be solved with SSIS, just with some experience. Here are the main steps and discussion:
Define a Flat file data source for your file, describing all columns. Possible problems here - different data types of fields based on tx_group value. If this is the case, I would declare all fields as strings long enough and later in the dataflow - convert its type.
Create a OLEDB Connection manager for the DB you will use to store the results.
Create a main dataflow where you will proceed the file, and add a Flat File Source.
Add a Conditional Split to the output of Flat file source, and define there as much filters and outputs as you have transaction groups.
For each transaction group data output - add Data Conversion for fields if necessary. Note - you cannot change data type of existing column, if you need to cast string to int - create a new column.
Add for each destination table an OLEDB Destination. Connect it to proper transaction group data flow, and map fields.
Basically, you are done. Test the package thoroughly on a test DB before using it on a production DB.

SSIS lookup with numeric value not working

I am new to using ssis and am on my third package. We are taking data from Oracle into Sql Server. On my oracle table, the unique key is called recnum and is numeric(12,0). In this particular package, I am trying to take the record from oracle, lookup in a sql server table to see if that unique key is found, and if not add the record to the sql server table. My issue is it wouldn't find a match. After much testing, I came up with the following method that works. But I don't understand why I had to do this.
How I currently have it working:
I get the data from oracle. In my next step, I added a derived column that uses the oracle column. (The expression is just that field, no other formatting.) Then in the lookup I use the derived column instead of the column from Oracle.
We had already done this on another table where the unique key was numeric(8,0) and it worked ok without needing a derived column.
SSIS is very fussy about data types, lookups only work nicely if data types match.
Double click on the Data Path lines between Data Flow objects to check data types. I use Data Conversion tasks or CAST statements to force matching data types when I use lookups.
Hope this helps.

Bad Data in Excel source does not generate error in SSIS

I have a quick question regarding SSIS. I am developing a package that performs a Data Flow task from an Excel Source into OLE DB Connection. The columns in the database should allow nulls. However there is a problem in that when I enter bad data into the numeric columns in the excel spreadsheet, it will not cause the Data Flow task to fail as I would like it to. I tried to remedy this by explicitly trying to convert any numeric columns in the Derived Column step, however the same thing occurs-- if I enter abc into the Excel numeric column, if just turns out as NULL in the db after the package runs. I do want to allow for NULLS, but I'd like the package to fail if the data is corrupt.
Any advice would be appreciated :)
I've just tried this and Ignore/Redirect/Fail setting doesn't appear to have any effect, NULLs get updated into the database regardless.
If you didn't want NULLs I would suggest that you amend the definition of your destination table to specify a NOT NULL constraint on the columns you wish to be numeric. That way the database update and the package will fail.
But since you want null columns the only thing I can suggest is that you write a script task or script component to read and validate the data before accepting it.
Alternatively, read the Excel file into a staging area where all the columns are VARCHAR and then validate it via SQL
If you edit your SSIS task where you define the import you can choose the error handling for each column. There you can choose to set it to fail and stop, to ignore and go on, etc.
This links should help you to handle it on your needs:
http://sqlblog.com/blogs/rushabh_mehta/archive/2008/04/24/gracefully-handing-task-error-in-ssis-package.aspx
and
http://sqlserver360.blogspot.de/2011/03/error-handling-in-ssis.html
and
http://msdn.microsoft.com/en-us/library/ms141679.aspx

Recommended way of adding daily data to database

I receive new data files every day. Right now, I'm building the database with all the required tables to import the data and perform the required calculations.
Should I just append each new day's data to my current tables? Each file contains a date column, which would allow for a "WHERE" query in the future if I need to analyze data for one particular day. Or should I be creating a new set of tables for every day?
I'm new to database design (coming from Excel). I will be using SQL Server for this.
Assuming that the structure of the data being received is the same, you should only need one set of tables rather than creating new tables each day.
I'd recommend storing the value of the date column from your incoming data in your database, and also having a 'CreateDate' column in your tables, with a default value of 'GetDate()' so that it automatically gets populated with the current date when the row is inserted.
You may also want to have another column to store the data filename that the row was imported from, but if you're already storing the value of the date column and the date that the row was inserted, this shouldn't really be necessary.
In the past, when doing this type of activity using a custom data loader application, I've also found it useful to create log files to log success/error/warning messages, including some type of unique key of the source data and target database - ie. if coming from an Excel file and going into a database column, you could store the row index from Excel and the primary key of the inserted row. This helps tracking down any problems later on.
You might want to consider having a look at SSIS (SqlServer Integration Services). It's the SqlServer tool for doing ETL activities.
yes, append each day's data to the tables; 1 set of tables for all data.
yes, use a date column to identify the day that the data was loaded.
maybe have another table with a date column and a clob column. The date to contain the load date and the clob to contain the file that you imported.
Good question. You most definitely should have a single set of tables and append the data daily. Consider this: if you create a new set of tables each day, what would, say, a monthly report query look like? A quarterly report query? It would be a mess, with UNIONs and JOINs all over the place.
A single set of tables with a WHERE clause makes the querying and reporting manageable.
You might do a little reading on relational database theory. Wikipedia is a good place to start. The basics are pretty straightforward if you have the knack for it.
I would have the data load into a stage table regardless and append to the main tables after. Once a week i would then refresh all data in the main table to ensure that the data remains correct as per the source.
Marcus

How do I format dd-mmm-yy values in flat file to smalldatetime during data import?

I have a flat file which is imported into SQL Server via an existing SSIS package. I need to make a change to the package to accommodate a new field in the flat file. The new field is a date field which is in the format dd-mmm-yy (e.g. 25-AUG-11). The date field in the flat file will either be empty (e.g. a space/whitespace) or populated with a date. I don’t have any control over the date format in the flat file.
I need to import the date field in the flat file into an existing SQL Server table and the target field data type is smalldatetime.
I was proposing to import the date as a string into a load table and then convert to smalldatetime when taking the data from the load table. But is there another possible way to parse the date format dd-mmm-yy to load this straight into a smalldatetime field without having to use convert to smalldatetime from the load table. I can’t quite think how to parse the date format, particularly the month. Any suggestions welcome.
Here is an example that might give you an idea of what you can do. Ideally, in an SSIS package or in any ETL job, you should take into account that data may not be exactly what you would like it to be. You need to take appropriate steps to handle the incorrect or invalid data that might pop up now and then. That's why SSIS comes up with lots of Transformation tasks within Data Flow Task which you can make use of to clean up the data.
In your case, you can make use of Derived Column transformation or Data conversion transformation to achieve your requirements.
The example was created in SSIS 2008 R2. It shows how to read a flat file containing the dates and load into an SQL table.
I created a simple SQL table to import the flat file data.
On the SSIS package, I have a connection manager to SQL and one for Flat file. Flat file connection is configured as shown below.
On the SSIS package, I placed a Data Flow Task on the Control Flow tab. Inside, the Data Flow task, I have a Flat File Source, Derived Column transformation and an OLE DB Destination. Since the Flat file source and OLE DB destination are straightforward, I will leave those out here. The Derived transformation creates a new column with the expression (DT_DBDATE)SmallDate. Note that you can also use Data Conversion transformation to do the same. This new column SmallDateTimeValue should be mapped to the database column in OLE DB Destination.
If you execute this package, it will fail because not all the values in the file are valid.
The reason why it fails in your case is because the invalid data is directly inserted into the table. In your case, the table will throw an exception making the package to fail. In this example, the package fails because the default setting on the Derived column transformation is to fail the component if there is any error. So, let's place a dummy transformation to redirect the error rows. We will Multicast transformation for this purpose. It won't really do anything. Ideally, you should redirect the error rows to another table using OLE DB Destination or other Destination component of your choice so you can analyze the data that causes the errors.
Drag the red arrow from Derived transformation and connect it to the Multicast transformation. This will popup the Configure Error Output dialog. Change the values under the column Error and Truncation from Fail component to Redirect row. This will redirect any error rows to the Multicast transformation and will not get into the tables.
Now, if we execute the package, it will run successfully. Note the number of rows displayed in each direction.
Here is the data that got into the table. Only 2 rows were valid. You can look at the first screenshot that showed the data in the file and you can see only 2 rows were valid.
Hope that gives you an idea to implement your requirement in the SSIS package.
It should load straight into a SMALLDATETIME field as it is. Remember, dates are just numbers in SQL Server, which are presented to the user in the desired date/time format. The SSIS package should read 25-AUG-2011 just fine as a date data type, and insert it into a SMALLDATETIME field without issues.
Was the package throwing an error or something?

Resources