SSIS - Convert Multiple Column Values To Null - sql-server

I am using SSIS (SQL Server 2008 R2) to transform an input CSV file into an SQL table. Five columns in the input file (reals - e.g. 19.54271) occasionally have a bad value (strings - e.g. "NAN") that cause the package to fail.
What is the simplest way to check these 5 columns for the bad value "NAN", convert that into either a NULL value or known bad numeric (-9999), and write the corrected values into the same final SQL table?
I have the following mess going so far, and finally decided to ask if there is a simpler way...
My current conditional logic:
My Case1 Derived Column Conversion:
Note: Still not sure if I can combine the other derived columns into one instance, but since my destination can have only one input, I suspect I will need to...
TIA

SSIS expressions get hairy and hard to read when the logic is complex or if there are multiple evaluations. In your case you're going to wind up with a bunch of tasks that, individually, do very little.
I'd bundle this up into a script component. That way you could use basic vb or c# functions to evaulate if all of your columns properly convert to numeric and assign defaults when they don't. Additionally, you can implement a try/catch scenario and gracefully send errors to a different output buffer.
Here's some examples of how to use the script component as a transformation:
http://www.bimonkey.com/2009/09/the-script-transformation-part-1-a-simple-transformation/
http://www.sqlis.com/sqlis/post/The-Script-Component-as-a-Transformation.aspx
http://www.codeproject.com/Articles/193855/An-indespensible-SSIS-transformation-component-Scr

It looks like using a script component would be the best way to proceed if my logic had been more complex than simply converting bad values into nulls.
However, the logic with transformation objects is fairly straightforward, so hopefully this can help someone else:
The package (note that I redirect rows for suspect columns in the datasource):
The conditional split logic:
[EDIT: I found that every case condition requires a separate processing path. If you are evaluating multiple expressions, you can do so in one case by appending them with the || operator.]
The derived column logic:

Related

SQL Server Except operator - how to identify culprit columns

Just to give background, we have created new SSIS packages to get feeds into SQL Server tables.
To make sure new SSIS packages put data in same manner in which how current live process does, we have written comparison script to compare Live and Test tables which will be executed parallel for a month.
So have i have used EXCEPT command to get the differences and it returns data too which has differences.
Now problem is that, i have around 50 columns and i need to check data of each column and compare with other tables to find out the culprit.
In some cases, i am getting around 40000 rows as a difference and identifying correct columns is too cumbersome.
In most of the cases, it was because of NULL value in Live and blank value in Test. I have done following to identify columns
Limit Number of columns
Look at value for few columns and then change SQL statement to include ISNULL function & so on
This is time consuming.
Is there any good way to get the columns names because of which we are getting difference in rows.
It would be good if i can get better way to handle this.

How to read and change series numbers in columns SSIS?

I'm trying to manipulate a column in SSIS which looks like below after i removed unwanted rows with derived column and conditional split in my data flow task. The source for this is a flatfile.
XXX008001161022061116030S1TVCO3057
XXX008002161022061146015S1PUAG1523
XXX009001161022063116030S1DVLD3002
XXX009002161022063146030S1TVCO3057
XXX009003161022063216015S1PUAG1523
XXX010001161022065059030S1MVMA3020
XXX010002161022065129030S1TVCO3057
XXX01000316102206515901551PPE01504
The first three numbers from the left (starting with "008" first row) represent a series, and the next three ("001") represent another number within the series. what i need is to change all of the first three numbers starting from "001" to the end.
The desired reslut would thus look like:
XXX001001161022061116030S1TVCO3057
XXX001002161022061146015S1PUAG1523
XXX002001161022063116030S1DVLD3002
XXX002002161022063146030S1TVCO3057
XXX002003161022063216015S1PUAG1523
XXX003001161022065059030S1MVMA3020
XXX003002161022065129030S1TVCO3057
XXX00300316102206515901551PPE01504
...
My potential solution would be to load the file to a temporary database table and query it with SQL from there, but i am trying to avoid this.
The final destination is a flatfile.
Does anybody have any ideas how to pull this off in SSIS? Other solutions are appreciated also.
Thanks in advance
I would definitely use the staging table approach and use windows functions to accomplish this. I could see a use case if SSIS was on another machine than the database engine and there was a need to offload the processing to the SSIS box.
In that case I would create a script transformation. You can process each row and make the necessary changes before passing the row to the output. You can use C# or VB.
There are many examples out there. Here is MSDN article - https://msdn.microsoft.com/en-us/library/ms136114.aspx

How to apply same data manipulation codes to a group of SSIS components' inputs?

I am new to SSIS.
I have a number of MS access tables to transform to SQL. Some of these tables have datetime fields needed to go under some rules before sitting in respected SQL tables. I want to use Script component that deals with these kind of fields converting them to the desired values.
Since all of these fields need same modification rules, I want to apply the same code base to all of them thus avoiding the code duplication. What would be the best option for this scenario?
I know I can't use the same Script Component and direct all of those datasets outputs to it because unfortunately it doesn't support multi-inputs . So the question is is it possible to apply a set of generic data manipulation rules
on a group of different datasets' fields without repeating the rules. I can use a Script component for each ole db input and apply the same rule on them each. But it would not be an efficient way of doing that.
Any help would be highly appreciated.
SQL Server Integration Services has a specific task to suit this need, called a Data Conversion Transformation. This can be accomplished on the data source or via the task, as noted here.
You can also use the Derived Column transformation to convert data. This transformation is also simple, select an input column and then chose whether to replace this column or create a new output column. Then you apply an expression for the output column.
So why use one over the other?
The Data Conversion transformation (Pictured Below) will take an input, convert the type and provide a new output column. If you use the Derived Column transformation, you get to apply an expression to the data, which allows you to do more complex manipulations on the data.

Define a String constant in SQL Server?

Is it possible in SQL Server to define a String constant? I am rewriting some queries to use stored procedures and each has the same long string as part of an IN statement [a], [b], [c] etc.
It isn't expected to change, but could at some point in future. It is also a very long string (a few hundred characters) so if there is a way to define a global constant for this that would be much easier to work with.
If this is possible I would also be interested to know if it works in this scenario. I had tried to pass this String as a parameter, so I could control it from a single point within my application but the Stored Procedure didn't like it.
You can create a table with a single column and row and disallow writes on it.
Use that as you global string constant (or additional constants, if you wish).
You are asking for one thing (a string constant in MS SQL), but appear to maybe need something else. The reason I say this is because you have given a few hints at your ultimate objective, which appears to be using the same IN clause in multiple stored procedures.
The biggest clue is in the last sentence:
I had tried to pass this String as a
parameter, so I could control it from
a single point within my application
but the Stored Procedure didn't like
it.
Without details of your SQL scripts, I am going to attempt to use some psychic debugging techniques to see if I can get you to what I believe is your actual goal, and not necessarily your stated goal.
Given your Stored Procedure "didn't like that" when you tried to pass in a string as a parameter, I am guessing the composition of the string was simply a delimited list of values, something like "10293, 105968, 501940" or "Juice, Milk, Donuts" (pay no attention to the actual list values - the important part is the delimited list itself). And your SQL may have looked something like this (again, ignore the specific names and focus on the general concept):
SELECT Column1, Column2, Column3
FROM UnknownTable
WHERE Column1 IN (#parameterString);
If this approximately describes the path you tried to take, then you will need to reconsider your approach. Using a regular T-SQL statement, you will not be able to pass a string of parameter values to an IN clause - it just doesn't know what to do with them.
There are alternatives, however:
Dynamic SQL - you can build up the
whole SQL statement, parameters and
all, then execute that in the SQL
database. This probably is not what
you are trying to achieve, since you
are moving script to stored
procedures. But it is listed here
for completeness.
Table of values -
you can create a single-column table
that holds the specific values you
are interested in. Then your Stored
Procedure can simply use the column
from this table for the IN clause).
This way, there is no Dynamic SQL
required. Since you indicate that
the values are not likely to change,
you may just need to populate the
table once, and use it wherever
appropriate.
String Parsing to
derive the list of values - You can
pass the list of values as a string,
then implement code to parse the
list into a table structure on the
fly. An alternative form of this
technique is to pass an XML
structure containing the values, and
use MS SQL Server's XML
functionality to derive the table.
Define a table-value function that
returns the values to use - I have
not tried this one, so I may be
missing something, but you should be
able to define the values in a
table-value function (possibly using
a bunch of UNION statements or
something), and call that function
in the IN clause. Again - this is an
untested suggestion and would need
to be worked through to determine
it's feasibility.
I hope that helps (assuming I have guessed your underlying quandary).
For future reference, it would be extremely helpful if you could include SQL script showing
your table structure and stored procedure logic so we can see what you have actually attempted. This will considerably improve the effectiveness of the answers you receive. Thanks.
P.S. The link for String Parsing actually includes a large variety of techniques for passing arrays (i.e. lists) of information to Stored Procedures - it is a very good resource for this kind of thing.
In addition to string-constants tables as Oded suggests, I have used scalar functions to encapsulate some constants. That would be better for fewer constants, of course, but their use is simple.
Perhaps a combination - string constants table with a function that takes a key and returns the string. You could even use that for localization by having the function take a 'region' and combine that with a key to return a different string!

Replace certain pattern in a long string in MS SQL using T-SQL

I have a table in my MS SQL database where it has some incomplete data in a field. This field in question is a varchar field and has about 1000 characters in the field. This string consists of segmentations of words in the format of a forward slash followed by the segment and then ends with a forward slash (i.e. /p/). Each of these segments would be separated by a space. The problem is that certain of these segmentations do not have the last forward slash (i.e. /p). I need to write a T-SQL script that would correct this problem.
I know I will need to use an update statement to do that. I got the where clause too. But the problem that I have is what am I setting it to equal to. Since the string has about 1000 characters, I don't want to type the actual string and just correct the problematic segmentation. My question is, is there a "RegEx replace function" that would only change problematic segmentations and leave the rest of the string alone?
Your help will be greatly appreciated.
Thanks in advance,
Monte
SQL doesn't support RegEx within it. You could write a SQL CLR function then pipe the data through it and if there's a problem with the data correct it then return the corrected version to SQL.
UPDATE YourTable
Set YourColumn = dbo.YourClrProc(YourColumn)
If you have Windows Scripting Host installed (most machines do), then you can use this method to call into the VBScript.RegExp object from T-SQL.
There is REPLACE, but is nothing close to RegEx.
If this is a one time operation then you can consider exporting the table, use a tool you're familiar with like sed or grep and then import the modified data back. It will probably be faster and more correct than trying to do this in T-SQL.
On the other hand if is a planned maintenance operation you'll need to repeat often as a way to maintain the data, then I concur with mrdenny, a CLR function is probably the best choice.

Resources