SSIS Merge Varying Columns - sql-server

Using SSIS, I am importing a .txt file, which for the most part if straight forward.
The file being imported has a set amount of columns up to a point, but there is a free text/comments field, which can repeat to unknown length, similar to below.
"000001","J Smith","Red","Free text here"
"000002","A Ball","Blue","Free text here","but can","continue"
"000003","W White","Green","Free text here","but can","continue","indefinitely"
"000004","J Roley","Red","Free text here"
What I would ideally like to do (within SSIS) is to keep the first three columns as singular columns, but to merge any free-text ones into a single column. i.e. Merge/concatenate anything which appears after the 'colour' column.
So when I load this into an SSMS table, it appears like:
000001 | J Smith | Red | Free text here |
000002 | A Ball | Blue | Free text here but can continue |
000003 | W White | Green | Free text here but can continue indefinitely |
000004 | J Roley | Red | Free text here |

I do not see any easy solution. You can try something like below:
1. Load the complete raw data to a temp table (without any delimiter):
Steps:
Create temp table in Execute SQL Task
Create a data flow task, with flat file source (with Ragged Right format) and
OLEDB destination (usint #temp table create in previous task)
Set the delayValidation=True for connection manager and DFT
Set retainSameConnection=True for connection manager
Refer this to create temp table and using it.
2. Create T-SQL to separate the 3 columns (something like below)
with col1 as (
Select
[Val],
substring([Val], 1 ,charindex(',', [Val]) - 1) col1,
len(substring([Val], 1 ,charindex(',', [Val]))) + 1 col1Len
from #temp
), col2 as (
select
[Val],
col1,
substring([Val], col1Len, charindex(',', [Val], col1Len) - col1Len) as col2,
charindex(',', [Val], col1Len) + 1 col2Len
from col1
) select col1, col2, substring([Val], col2Len, 200) as col3
from col2
T-SQL Output:
col1 col2 col3
"000001" "J Smith" "Red","Free text here"
"000002" "A Ball" "Blue","Free text here","but can","continue"
"000003" "W White" "Green","Free text here","but can","continue","indefinitely"
3. Use the above query in OLEDB source in different data flow task
Replace double quotes (") as per your requirement.

This was a fun exercise:
Add a data flow
Add a Script Component (select Source)
Add 4 columns to Outputs ID, Name Color , FreeText all type string
edit script:
Paste the following namespaces up top:
using System.Text.RegularExpressions;
using System.Linq;
Paste the following code into CreateNewOutputRows:
string strPath = #"a:\test.txt"; \\put your file path in here
var lines = System.IO.File.ReadAllLines(strPath);
foreach (string line in lines)
{
//Code I stole to read CSV
string delimeter = ",";
Regex rgx = new Regex(String.Format("(\"[^\"]*\"|[^{0}])+", delimeter));
var cols = rgx.Matches(line)
.Cast<Match>()
.Select(m => m.Value.Trim().Trim('"'))
.Where(v => !string.IsNullOrWhiteSpace(v));
//create a column counter
int ctr = 0;
Output0Buffer.AddRow();
//Preset FreeText to empty string
string FreeTextBuilder = String.Empty;
foreach( string col in cols)
{
switch (ctr)
{
case 0:
Output0Buffer.ID = col;
break;
case 1:
Output0Buffer.Name = col;
break;
case 2:
Output0Buffer.Color = col;
break;
default:
FreeTextBuilder += col + " ";
break;
}
ctr++;
}
Output0Buffer.FreeText = FreeTextBuilder.Trim();
}

Related

How to transform data when we have comma separated values in csv format file in snowflake

I have an excel csv format data set with the following data:
Columns: id, product_name, sales, quantity, Profit
Data: 1, "Novimex Executive Leather Armchair, Black","$3,709.40", 9, -$288.77
When I am trying to insert these records from stage to snowflake table, data is getting shifted from product name column because we have comma separated , Black and similarly for following columns data are getting shifted. After loading the data it is looking like as per below:
+----+-------------------------------------+--------+----------+---------+
| id | product_name | sales | quantity | Profit |
+----+-------------------------------------+--------+----------+---------+
| 1 | "Novimex Executive Leather Armchair | Black" | $3 | 709.40" |
+----+-------------------------------------+--------+----------+---------+
Query used:
copy into orders_staging (id,Product_Name,Sales,Quantity,Profit)
from
(select $1,$2,$3,$4,$5
from #sales_data_stage)
file_format = (type = csv field_delimiter = ',' skip_header = 1 ENCODING = 'iso-8859-1');
Use Field Enclosure.
FIELD_OPTIONALLY_ENCLOSED_BY='"'
If you have any issues with accounting styled numbers, remember to put " " around them too.
https://community.snowflake.com/s/question/0D50Z00008pDcoRSAS/copying-csv-files-delimited-by-commas-where-commas-are-also-enclosed-in-strings
Additional documentation for Copy To
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#type-csv
Additional documentation on the Create File
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html

Cannot insert Array in Snowflake

I have a CSV file with the following data:
eno | phonelist | shots
"1" | "['1112223333','6195551234']" | "[[11,12]]"
The DDL statement I have used to create table in snowflake is as follows:
CREATE TABLE ArrayTable (eno INTEGER, phonelist array,shots array);
I need to insert the data from the CSV into the Snowflake table and the method I have used is:
create or replace stage ArrayTable_stage file_format = (TYPE=CSV)
put file://ArrayTable #ArrayTable_stage auto_compress=true
copy into ArrayTable from #ArrayTable_stage/ArrayTable.gz
file_format = (TYPE=CSV FIELD_DELIMITER='|' FIELD_OPTIONALLY_ENCLOSED_BY='\"\')
But when I try to run the code, I get the error:
Copy to table failed: 100069 (22P02): Error parsing JSON:
('1112223333','6195551234')
How to resolve this?
FIELD_OPTIONALLY_ENCLOSED_BY='\"\' base on the row you have that should just be '\"'
select parse_json('[\'1112223333\',\'6195551234\']');
works (the back slashes are to get around the sql parser)
but your output has parens (, ) which is different.
SELECT column2, TRY_PARSE_JSON(column2) as j
FROM #ArrayTable_stage/ArrayTable.gz
file_format = (TYPE=CSV FIELD_DELIMITER='|' FIELD_OPTIONALLY_ENCLOSED_BY='\"\')
WHERE j is null;
will show which values are failing to parse..
failing that you might want to use to_array to parse column2 and thus insert into you table the SELECTED/transformed data, that is failing to auto transform

Loading CSV data to Snowflake table

Column splits into multiple columns when trying to load the following data in to SnowFlake table since its CSV file.
Column Data :
{"Department":"Mens
Wear","Departmentid":"10.1;20.1","customername":"john4","class":"tops wear","subclass":"sweat shirts","product":"North & Face 2 Bangle","style":"Sweat shirt hoodie - Large - Black"}
Is there any other way to load the data in to single column.
The best solution would be use a different delimiter instead of comma in your CSV file. If it's not possible, then you can ingest the data using a non-existing delimiter to get the whole line as one column, and then parse it. Of course it won't be as effective as native loading:
cat test.csv
1,2020-10-12,Gokhan,{"Department":"Mens Wear","Departmentid":"10.1;20.1","customername":"john4","class":"tops wear","subclass":"sweat shirts","product":"North & Face 2 Bangle","style":"Sweat shirt hoodie - Large - Black"}
create file format csvfile type=csv FIELD_DELIMITER='NONEXISTENT';
select $1 from #my_stage (file_format => csvfile );
create table testtable( id number, d1 date, name varchar, v variant );
copy into testtable from (
select
split( split($1,',{')[0], ',' )[0],
split( split($1,',{')[0], ',' )[1],
split( split($1,',{')[0], ',' )[2],
parse_json( '{' || split($1,',{')[1] )
from #my_stage (file_format => csvfile )
);
select * from testtable;
+----+------------+--------+-----------------------------------------------------------------+
| ID | D1 | NAME | V |
+----+------------+--------+-----------------------------------------------------------------+
| 1 | 2020-10-12 | Gokhan | { "Department": "Mens Wear", "Departmentid": "10.1;20.1", ... } |
+----+------------+--------+-----------------------------------------------------------------+

SSIS Combine multiple rows into single row

I have a flat file that has 6 columns: NoteID, Sequence, FileNumber, EntryDte, NoteType, and NoteText. The NoteText column has 200 characters and if a note is longer than 200 characters then a second row in the file contains the continuation of the note. It looks something like this:
|NoteID | Sequence | NoteText |
---------------------------------------------
|1234 | 1 | start of note text... |
|1234 | 2 | continue of note.... |
|1234 | 3 | more continuation of first note... |
|1235 | 1 | start of new note.... |
How can I in SSIS combine the multiple rows of NoteText into one row so the row would like this:
| NoteID | Sequence | NoteText |
---------------------------------------------------
|1234 | 1 | start of note text... continue of note... more continuation of first note... |
|1235 | 1 | start of new note.... |
Greatly appreciate any help?
Update: Changing the SynchronousInputID to None exposed the Output0Buffer and I was able to use it. Below is what I have in place now.
Dim NoteID As String = "-1"
Dim NoteString As String = ""
Dim IsFirstRow As Boolean = True
Dim NoteBlob As Byte()
Dim enc As New System.Text.ASCIIEncoding()
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
If Row.NoteID.ToString() = NoteID Then
NoteString += Row.NoteHTML
IsFirstRow = True
Else
If IsFirstRow Then
Output0Buffer.AddRow()
IsFirstRow = False
End If
NoteID = Row.NoteID.ToString()
NoteString = Row.NoteHTML.ToString()
End If
NoteBlob = enc.GetBytes(NoteString)
Output0Buffer.SingleNoteHTML.AddBlobData(NoteBlob)
Output0Buffer.ClaimID = Row.ClaimID
Output0Buffer.UserID = Row.UserID
Output0Buffer.NoteTypeLookupID = Row.NoteTypeLookupID
Output0Buffer.DateCreatedUTC = Row.DateCreated
Output0Buffer.ActivityDateUTC = Row.ActivityDate
Output0Buffer.IsPublic = Row.IsPublic
End Sub
My problem now is that I had to convert the output column from Wstr(4000) to NText because some of the notes are so long. When it imports into my SQL table, it is just jibberish characters and not the actual notes.
In SQL Server Management Studio (using SQL), you could easily combine your NoteText field using stuff function with XML Path to combine your row values to a single column like this:
select distinct
noteid,
min(sequence) over (partition by n.noteid order by n.sequence) as sequence,
stuff((select ' ' + NoteText
from notes n1
where n.noteid = n1.noteid
for xml path ('')
),1,1,'') as NoteText
from notes n;
You will probably want to look into something along the line that does similar thing in SSIS. Check out this link on how to create a script component in SSIS to do something similar: SSIS Script Component - concat rows
SQL Fiddle Demo

Bulk insert into SQL Server a CSV with line breaks in fields

I have a csv that looks like this:
"blah","blah, blah, blah
ect, ect","column 3"
"foo","foo, bar, baz
more stuff on another line", "another column 3"
Is it possible to import this directly into SQL server?
Every row in your file finishes with new line (\n) but the actual rows you want to get finishes with quotation mark and new line. Set ROWTERMINATOR in BULK INSERT command to:
ROWTERMINATOR = '"\n'
EDITED: I think the bigger problem will be with commas in the text. SQL Server does not use text enclosures. So the row will be divided on commas without checking if the comma is inside quotation marks or not.
You may do like this:
BULK INSERT newTable
FROM 'c:\file.txt'
WITH
(
FIELDTERMINATOR ='",',
ROWTERMINATOR = '"\n'
)
This will give you the following result:
col1 | col2 | col3
----------------------------------------------------------------
"blah | "blah, blah, blah ect, ect | "column 3
"foo | "foo, bar, baz more stuff on another line | "another column 3
All you have to do is to get rid of the quotation marks on the beginning of each cell.
For example:
UPDATE newTable
SET col1 = RIGHT(col1,LEN(col1)-1),
col2 = RIGHT(col2,LEN(col2)-1),
col3 = RIGHT(col3,LEN(col3)-1)
I think you can also do this using bcp utility with format file

Resources