SSIS Script Component - get raw row data in data flow - sql-server

I am processing a flat file in SSIS and one of the requirements is that if a given row contains an incorrect number of delimiters, fail the row but continue processing the file.
My plan is to load the rows into a single column in SQL server, but during the load, I’d like to test each row during the data flow to see if it has the right number of delimiters, and add a derived column value to store the result of that comparison.
I’m thinking I could do that with a script task component, but I’m wondering if anyone has done that before and what would be the best method? If a script task component would be the way to go, how do I access the raw row with its delimiters inside the script task?
SOLUTION:
I ended up going with a modified version of Holder's answer as I found that TOKENCOUNT() will not count null values per this SO answer. When two delimiters are not separated by a value, it will result in an incorrect count (at least for my purposes).
I used the following expression instead:
LEN(EntireRow) - LEN(REPLACE(EntireRow, "|", ""))
This results in the correct count of delimiters in the row, regardless of whether there's a value in a given field or not.

My suggestion is to use Derrived Column to do your test
And then add a Conditional Split to decide if you want to insert the rows or not.
Something like this:
Use the TokenCount function in the Derrived Column box to get number of columns like this: TOKENCOUNT(EntireRow,"|")

Related

How to add values in cells ONLY when other columns contain data from a query result

Link to example file:
https://docs.google.com/spreadsheets/d/1dCQSHWjndejkyyw-chJkBjfHgzEGYoRdXmPTNKu7ykg/edit?usp=sharing
The tab "Source data" contains the data to be used in the query on the tab "Query output". The tab "Desired result" shows what I would like the end result to look like.
The goal I'm trying to achieve is to have the formula in cell A2 on the tab "Query output" to populate the data in all four of the columns, so that it looks exactly like the "Desired result" tab. I know I can get the same result simply by entering additional formulas in C2 and D2, but this is not the objective, I need the results to come specifically from the single formula in A2.
The information in the "Additional data 1" column should simply repeat the word "Test" for every row that contains data in the first two columns. The information in the "Additional data 2" column should simply repeat the data from cell 'Source data'!A1 for every row that contains data in the first two columns.
Please feel free to edit the example file as it only contains dummy data. If you like, you can copy the tab "Query output" to create your own working formula for illustrative purposes.
EDIT:
I'm thinking along the lines of creating an array that consists of the required data for the columns "Additional data 1" and "Additional data 2" and then combining that array with the array of the query result which provides the first two columns. I've been experimenting with this in various ways, but so far the only result I have achieved is an error on the first cell of the query results. I also have no idea yet how I could make sure that the second array contains an equal amount of rows to the query result.
You can add static data into query:
=QUERY('Source data'!A3:B,"SELECT A,B, 'Test', '" & 'Source data'!A1 &"' WHERE A IS NOT NULL LABEL A '', B '', 'Test' '', '" & 'Source data'!A1 &"' ''")
Many thanks to #basic for the provided assistance! The insights were a great help to solving my issue. That said, I have muddled along a bit, and I've come up with a slightly different solution which I find better suited as it gives true blank values instead of a column filled with spaces.
First of all, instead of querying directly on the source data, I built an array and queried on that. I used the two existing columns (A and B) from the source data and added a third column to the array which does not exist in the source data. In order to make sure that the third column would consist of blank values, I used the IFERROR formula.
=IFERROR(0/0)
The formula above returns a blank because dividing by zero forces an error and the IFERROR method returns a blank unless an alternative return value is specified.
In order to be able to use this formula in an array however, it had to be tweaked slightly, because as it is it would only return a single blank cell value instead of a column of blank values. To do this, I used an already existing column from the source data, and then encapsulated it in an ARRAYFORMULA.
=ARRAYFORMULA(IFERROR('Source data'!A3:A/0))
Using this, the resulting array has the following formula.
=ARRAYFORMULA({'Source data'!A3:A,'Source data'!B3:B,IFERROR('Source data'!A3:A/0)})
This creates an array consisting of the two original columns A and B from the source data, plus an additional third column filled with blank values. This array can now be queried upon, and using the tricks previously provided by #basic the desired result as specified in the original question can be achieved.
Due to the query now being used upon a user-defined array, the columns in the SELECT statement now have to be referred to as Col1, Col2, Col3, instead of A, B, C. The final formula now looks like this.
=QUERY(ARRAYFORMULA({'Source data'!A3:A,'Source data'!B3:B,IFERROR('Source data'!A3:A/0)}),"SELECT Col1,Col2,'Test',Col3,'"&'Source data'!A1&"' WHERE Col1 IS NOT NULL LABEL 'Test' '','"&'Source data'!A1&"' ''")
I hope this information may prove of use to someone else as well.

Loop inside Pentaho Data Integration Transformation

I have a transformation as
where the text file is in the following format:
For each of the t_cmp(the number of t_cmp is not known prior) in the text file, I want to execute Read Company
But it is giving error as
Can anyone please tell me where am I going wrong?
You need to pass 3 rows, each with 1 field, instead of a single row with 3 fields.
The number of fields must match the number of parameters of your query.
So, in short, transpose your data. Either:
read line as a single field then use Split field to rows
or read as now and use Row normalizer
Both approaches should work.

Snowflake:Export data in multiple delimiter format

Requirement:
Need the file to be exported as below format, where gender, age, and interest are columns and value after : is data for that column. Can this be achieved while using Snowflake, if not is it possible to export data using Python
User1234^gender:male;age:18-24;interest:fishing
User2345^gender:female
User3456^age:35-44
User4567^gender:male;interest:fishing,boating
EDIT 1: Solution as given by #demircioglu
It displays as NULL values instead of other column values
Below the EMPLOYEES table data
When I ran below query
SELECT 'EMP_ID'||EMP_ID||'^'||'FIRST_NAME'||':'||FIRST_NAME||';'||'LAST_NAME'||':'||LAST_NAME FROM tempdw.EMPLOYEES ;
Create your SQL with the desired format and write it to a file
COPY INTO #~/stage_data
FROM
(
SELECT 'User'||User||'^'||'gender'||':'||gender||';'||'age'||':'||age||';'||'interest'||':'||interest FROM table
)
file_format = (TYPE=CSV compression='gzip')
File format here is not important because each line will be treated as a field because of your delimiter requirements
Edit:
CONCAT function (aliased with ||) returns NULL if you have a NULL value.
In order to eliminate NULLs you can use NVL2 function
So your SQL will have series of NVL2s
NVL2 checks the first parameter and if it's not NULL returns first expression, if it's NULL returns second expression
So for User column
'User'||User||'^' will turn into
NVL2(User,'User','')||NVL2(User,User,'')||NVL2(User,'^','')
P.S. I am leaving up to you to create the rest of the SQL, because Stackoverflow's function is to help find the solution, not spoon feed the solution.
No, I do not believe multiple delimiters like this are supported in Snowflake at this time. Multiple byte and multiple character delimiters are supported, but they will need to be specified as the same delimiter repeated for either record or line.
Yes, it may be possible to do some post-processing or use Python scripts to achieve this. Or even SQL transformative statements. This is not really my area of expertise so if someone has an example for you, I'll let them add to the discussion.

Handling truncation error in derived column in data flow task

I have a data flow task which contains a derived column. The derived column transforms a CSV file column, lets say A which is order number, to a data type char with length 10.
This works perfectly fine when the text file column is equal to or less than 10 characters. Of course, it throws an error when column A order number is more than 10 characters.
The column A (error prone).
12PR567890
254W895X98
ABC 56987K5239
485P971259 SPTGER
459745WERT
I would like to catch the error prone records and extract the order number only.
I already can configure error output from the derived column. But, this just ignores the error records and processes the others.
The expected output will process ABC 56987K5239, 485P971259 SPTGER order numbers as 56987K5239, 485P971259 respectively. The process removal of unexpected characters are not important, rather how to achieve this during the run time of the derived column (stripping and processing the data in case of error).
If the valid order number always starts with a number, and the length of it equal to 10. You could use Script Component (Transformation) together with Regular Expression to transform the source data.
Drag and drop the Script Component as Transformation
Connect the source to the Script Component
From the Script Component Edit window, checked the Order from the Input columns, and make it as Read and Write
In the script, add:using System.Text.RegularExpressions;
The full code needs to be added in the Input process method:
string pattern = "[0-9].{9}";
Row.Order = Regex.Match(Row.Order, pattern).Groups[1].ToString();
The output going to the destination should be the matched 10 characters starting with the number.

Remove Duplicate adjacent Sub-String from String in Microsoft SQL Server

I am using SQL Server 2008 and I have a column in a table, which has values like below. It basically shows departure and arrival information.
-->Heathrow/Dublin*Dublin/Heathrow
-->Gatwick/Liverpool*Liverpool/Carlisle *Carlisle/Gatwick
-->Heathrow/Dublin*Liverpool/Heathrow
(The 3rd example shown above is slightly different where the person did not depart from Dublin, instead departed from a Liverpool).
This makes the column too lengthy, and I want to remove only the adjacent duplicates, so the information can be shown like below:
-->Heathrow/Dublin/Heathrow
-->Gatwick/Liverpool/Carlisle/Gatwick
-->Heathrow/Dublin***Liverpool/Heathrow
So, this would still show the correct travel route, but omits only the contiguous duplicates. Also, in the 3rd case, since the departure and arrival information location is not the same, Iwould like to show it as ***.
I found a post here that removes all duplicates (Find and Remove Repeated Substrings) but this is slightly different from the solution that I need.
Could someone share any thoughts please?
The first step is to adapt the process defined in the following link so that it splits based on /:
T-SQL split string
This returns a table which you would then loop through checking if the value contains an *. In that case you would get the text values before and after the * and compare them. Use CHARINDEX to get the position of the *, and SUBSTRING to get the values before and after. Once you have those check both values and append to your output string accordingly.
So you have a database column that contains this text string? Is your concern to display the data to the user in a new format, or to update the data in your database table with a new value?
Do you have access to the original data from which this text string was built? It would probably be easier to re-create the string in the format you desire than it would be to edit the existing string programmatically.
If you don't have access to this data, it would probably be a lot simpler to update your data (or reformat it for display) if you do the string manipulation in a high-level language such as c# or java.
If you're reformatting it for display, write the string manipulation code in whatever language is appropriate, right before displaying it. If you're updating your table, you could write a program to process the table, reading each record, building the replacement string, and updating the record before moving on to the next one.
The bottom line is that T-SQL is just not a good language for doing this sort of string examination and manipulation. If you can build a fresh string from the original data, or do your manipulation in a high-level language, you'll have an easier job of it and end up with more maintainable code.
I wrote a code for the first example you gave. You still need to
improve it for the rest ...
DECLARE #STR VARCHAR(50)='Heathrow/Dublin*Dublin/Heathrow'
IF (SELECT SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1)) =
(SELECT SUBSTRING(#STR,CHARINDEX('*',#STR)+1,LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1))))
BEGIN
SELECT STUFF(#STR,CHARINDEX('*',#STR),LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1))+1,'')
END
ELSE
BEGIN
SELECT STUFF(#STR,CHARINDEX('*',#STR),LEN(SUBSTRING(#STR,CHARINDEX('*',#STR)+1,LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1)))),'***')
END

Resources