Concatenation multiple columns in SSIS - sql-server

I'm doing a simple concatenation in SSIS.
For example I have a table like this:
+--------+-----------+------------+--------+
| ID | COL_A | COL_B | COL_C |
+--------+-----------+------------+--------+
| 110-99 | | APPLE | Orange |
+--------+-----------+------------+--------+
| 111-01 | Mango | Palm | |
+--------+-----------+------------+--------+
| 111-02 | | Strawberry | |
+--------+-----------+------------+--------+
| 111-05 | Pineapple | Guava | Lemom |
+--------+-----------+------------+--------+
I'm doing this in SSIS Derived column. Concatenation of 3 columns with Pipe |
COL_A +"|"+COL_B+"|"+COL_C
Actual Result:
|APPLE|Orange
MANGO|Palm|
|Strawberry|
Pineapple|Guava|Lemom
Expected Result:
APPLE|Orange
MANGO|Palm
Strawberry
Pineapple|Guava|Lemom
I'm not sure how to remove those extra | when the value is empty. I have tried using CASE but it is not working. Actually I don't know how to use CASE in Derived column expression.

You execute conditional logic in SSIS expressions using ?: syntax. ? : (Conditional) (SSIS Expression) It works much like an inline IFF.
Along these lines:
boolean_expression ? returnIfTrue : returnIfFalse
In order to get your desired results, I think I'd use two derived column transformations. In the first one, I'd create the pipe delimited string, then in the second one, I'd trim off the trailing pipe if there was one after building the string. Otherwise, the conditional logic would get pretty hairy in order to avoid leaving a trailing delimiter.
Step one - If each column is NULL or an empty string, return an empty string. Otherwise, return the contents of the column with a pipe concatenated to it:
((ISNULL(COL_A) || COL_A == "") ? "" : COL_A + "|"
Repeat that logic for all three columns, putting this expression into your derived column (Line breaks added for readability here):
(((ISNULL(COL_A) || COL_A == "") ? "" : COL_A + "|" ) +
(((ISNULL(COL_B) || COL_B == "") ? "" : COL_B + "|" ) +
(((ISNULL(COL_C) || COL_C == "") ? "" : COL_C ) --No pipe here, since nothing follows.
Then, in the second transformation, trim the trailing pipes from the instances where the last column or two were empty:
(RIGHT(NewColumnFromAbove,1)=="|") ? LEFT(NewColumnFromAbove,LEN(NewColumnFromAbove)-1) : NewColumnFromAbove
On the other hand, if there are lots of columns, or if performance gets bogged down, I would strongly consider writing the concatenation into a stored procedure, using CONCAT_WS, and then invoke that from an Execute SQL task instead.

In SQL Server, one option is concat_ws(), which ignores null values by design. If you have empty strings, your can turn them to null values with nullif().
concat_ws(
' | ',
nullif(col_a, ''),
nullif(col_b, ''),
nullif(col_c, '')
)

Related

How to transform data when we have comma separated values in csv format file in snowflake

I have an excel csv format data set with the following data:
Columns: id, product_name, sales, quantity, Profit
Data: 1, "Novimex Executive Leather Armchair, Black","$3,709.40", 9, -$288.77
When I am trying to insert these records from stage to snowflake table, data is getting shifted from product name column because we have comma separated , Black and similarly for following columns data are getting shifted. After loading the data it is looking like as per below:
+----+-------------------------------------+--------+----------+---------+
| id | product_name | sales | quantity | Profit |
+----+-------------------------------------+--------+----------+---------+
| 1 | "Novimex Executive Leather Armchair | Black" | $3 | 709.40" |
+----+-------------------------------------+--------+----------+---------+
Query used:
copy into orders_staging (id,Product_Name,Sales,Quantity,Profit)
from
(select $1,$2,$3,$4,$5
from #sales_data_stage)
file_format = (type = csv field_delimiter = ',' skip_header = 1 ENCODING = 'iso-8859-1');
Use Field Enclosure.
FIELD_OPTIONALLY_ENCLOSED_BY='"'
If you have any issues with accounting styled numbers, remember to put " " around them too.
https://community.snowflake.com/s/question/0D50Z00008pDcoRSAS/copying-csv-files-delimited-by-commas-where-commas-are-also-enclosed-in-strings
Additional documentation for Copy To
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#type-csv
Additional documentation on the Create File
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html

Loading CSV data to Snowflake table

Column splits into multiple columns when trying to load the following data in to SnowFlake table since its CSV file.
Column Data :
{"Department":"Mens
Wear","Departmentid":"10.1;20.1","customername":"john4","class":"tops wear","subclass":"sweat shirts","product":"North & Face 2 Bangle","style":"Sweat shirt hoodie - Large - Black"}
Is there any other way to load the data in to single column.
The best solution would be use a different delimiter instead of comma in your CSV file. If it's not possible, then you can ingest the data using a non-existing delimiter to get the whole line as one column, and then parse it. Of course it won't be as effective as native loading:
cat test.csv
1,2020-10-12,Gokhan,{"Department":"Mens Wear","Departmentid":"10.1;20.1","customername":"john4","class":"tops wear","subclass":"sweat shirts","product":"North & Face 2 Bangle","style":"Sweat shirt hoodie - Large - Black"}
create file format csvfile type=csv FIELD_DELIMITER='NONEXISTENT';
select $1 from #my_stage (file_format => csvfile );
create table testtable( id number, d1 date, name varchar, v variant );
copy into testtable from (
select
split( split($1,',{')[0], ',' )[0],
split( split($1,',{')[0], ',' )[1],
split( split($1,',{')[0], ',' )[2],
parse_json( '{' || split($1,',{')[1] )
from #my_stage (file_format => csvfile )
);
select * from testtable;
+----+------------+--------+-----------------------------------------------------------------+
| ID | D1 | NAME | V |
+----+------------+--------+-----------------------------------------------------------------+
| 1 | 2020-10-12 | Gokhan | { "Department": "Mens Wear", "Departmentid": "10.1;20.1", ... } |
+----+------------+--------+-----------------------------------------------------------------+

How can I update many rows with different values for the same column?

I have a table with a column containing a path to a file. The path is an absolute path, and values for this column look like this: C:\CI\Media\animal.jpg.
The table looks like so, except there are many rows so editing by hand is not practical:
`+----+-----------------------------------+
| ID | Path |
+----+-----------------------------------+
| 1 | C:\CI\Media\sushi.jpg |
| 2 | C:\CI\Media\animal.jpg |
| 3 | C:\CI\Media\Tuscany Trip\pisa.png |
+----+-----------------------------------+`
Path is an nvarchar(260)
And what'd I'd like to do is run a query that will update each record so the path for each record replaces C:\CI\ with C:\CI\Net, and end up with a table that looks like so:
`+----+---------------------------------------+
| ID | Path |
+----+---------------------------------------+
| 1 | C:\CI\Net\Media\sushi.jpg |
| 2 | C:\CI\Net\Media\animal.jpg |
| 3 | C:\CI\Net\Media\Tuscany Trip\pisa.png |
+----+---------------------------------------+`
Is there a way to format a query that will update every record, but update it based on the existing value (replace the C:\CI portion with C:\CI\Net for each record while maintaining the rest of the the value) instead of setting each column to the same value like a normal Update table set column = value ?
Gosh you almost wrote the code yourself.
Update YourTable
set path = replace(path, 'C:\CI', 'C:\CI\Net')

importing json array into hive

I'm trying to import the following json in hive
[{"time":1521115600,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":21.70905,"pm25":16.5,"pm10":14.60085,"gas1":0,"gas2":0.12,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":1521115659,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":24.34045,"pm25":18.5,"pm10":16.37065,"gas1":0,"gas2":0.08,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":1521115720,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":23.6826,"pm25":18,"pm10":15.9282,"gas1":0,"gas2":0,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":1521115779,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":25.65615,"pm25":19.5,"pm10":17.25555,"gas1":0,"gas2":0.04,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0}]
CREATE TABLE json_serde (
s array<struct<time: timestamp, latitude: string, longitude: string, pm1: string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.value' = 'value'
)
STORED AS TEXTFILE
location '/user/hduser';
the import works but if i try
Select * from json_serde;
it will return from every document that is on hadoop/user/hduser only the first element per file.
there is a good documentation on working with json array??
If I may suggest you another approach to just load the whole JSON string into a column as String datatype into an external table. The only restriction is to define LINES TERMINATED BY properly. e.g. If you may have each json in one line, then you can create table as below:
e.g.
CREATE EXTERNAL TABLE json_data_table (
json_data String
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001' LINES TERMINATED BY '\n' STORED AS TEXTFILE
LOCATION '/path/to/json';
Use Hive get_json_object to extract individual columns. this commands support basic xPath like query to json string E.g.
If json_data column has below JSON string
{"store":
{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red"}
},
"email":"amy#only_for_json_udf_test.net",
"owner":"amy"
}
The below query fetches
SELECT get_json_object(json_data, '$.owner') FROM json_data_table;
returns amy
In this way you could extract each json element as column from the table.
You have an array of structs. What you pasted is only one line.
If you want to see all the elements, you need to use inline
SELECT inline(s) FROM json_table;
Alternatively, you need to rewrite your files such that each object within that array is a single JSON object on its own line of the file
Also, I don't see a value field in your data, so I'm not sure what you're mapping in the serde properties
The JSON that you provided is not correct. A JSON always starts with an opening curly brace "{" and ends with an ending curly brace "}".
So, the first thing to look out here is that your JSON is wrong.
Your JSON should have been like the one below:
{"key":[{"key1":"value1","key2":"value2"},{"key1":"value1","key2":"value2""},{"key1":"value1","key2":"value2"}]}
And, the second thing is you have declared the data-type of the "time" field as timestamp. But the data (1521115600) is in milliseconds. The timestamp data-type need data in the format YYYY-MM-DD HH:MM:SS[.fffffffff].
So, your data should ideally be in the below format:
{"myjson":[{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":21.70905,"pm25":16.5,"pm10":14.60085,"gas1":0,"gas2":0.12,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":24.34045,"pm25":18.5,"pm10":16.37065,"gas1":0,"gas2":0.08,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":23.6826,"pm25":18,"pm10":15.9282,"gas1":0,"gas2":0,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":25.65615,"pm25":19.5,"pm10":17.25555,"gas1":0,"gas2":0.04,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0}]}
Now, you can use query to select the records from the table.
hive> select * from json_serde;
OK
[{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"21.70905"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"24.34045"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"23.6826"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"25.65615"}]
Time taken: 0.069 seconds, Fetched: 1 row(s)
hive>
If you want each value separately displayed in tabular format, you can use the below query.
select b.* from json_serde a lateral view outer inline (a.myjson) b;
The result of the above query would be like this:
+------------------------+-------------+--------------+-----------+--+
| b.time | b.latitude | b.longitude | b.pm1 |
+------------------------+-------------+--------------+-----------+--+
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 21.70905 |
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 24.34045 |
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 23.6826 |
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 25.65615 |
+------------------------+-------------+--------------+-----------+--+
Beautiful. Is n't it?
Happy Learning.
If you can not use update your input file format you can directly import in spark and use it, once data is finalized write back to Hive table.
scala> val myjs = spark.read.format("json").option("path","file:///root/tmp/test5").load()
myjs: org.apache.spark.sql.DataFrame = [altitude: bigint, gas1: bigint ... 13 more fields]
scala> myjs.show()
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
|altitude|gas1|gas2|gas3|gas4|humidity|latitude|longitude|noise| pm1| pm10|pm25|pressure|temperature| time|
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
| 53| 0|0.12| 0| 0| 0| 44.3959| 26.1025| 0|21.70905|14.60085|16.5| 0| null|1521115600|
| 53| 0|0.08| 0| 0| 0| 44.3959| 26.1025| 0|24.34045|16.37065|18.5| 0| null|1521115659|
| 53| 0| 0.0| 0| 0| 0| 44.3959| 26.1025| 0| 23.6826| 15.9282|18.0| 0| null|1521115720|
| 53| 0|0.04| 0| 0| 0| 44.3959| 26.1025| 0|25.65615|17.25555|19.5| 0| null|1521115779|
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
scala> myjs.write.json("file:///root/tmp/test_output")
Alternatively you can directly hive table
scala> myjs.createOrReplaceTempView("myjs")
scala> spark.sql("select * from myjs").show()
scala> spark.sql("create table tax.myjs_hive as select * from myjs")

SSIS Combine multiple rows into single row

I have a flat file that has 6 columns: NoteID, Sequence, FileNumber, EntryDte, NoteType, and NoteText. The NoteText column has 200 characters and if a note is longer than 200 characters then a second row in the file contains the continuation of the note. It looks something like this:
|NoteID | Sequence | NoteText |
---------------------------------------------
|1234 | 1 | start of note text... |
|1234 | 2 | continue of note.... |
|1234 | 3 | more continuation of first note... |
|1235 | 1 | start of new note.... |
How can I in SSIS combine the multiple rows of NoteText into one row so the row would like this:
| NoteID | Sequence | NoteText |
---------------------------------------------------
|1234 | 1 | start of note text... continue of note... more continuation of first note... |
|1235 | 1 | start of new note.... |
Greatly appreciate any help?
Update: Changing the SynchronousInputID to None exposed the Output0Buffer and I was able to use it. Below is what I have in place now.
Dim NoteID As String = "-1"
Dim NoteString As String = ""
Dim IsFirstRow As Boolean = True
Dim NoteBlob As Byte()
Dim enc As New System.Text.ASCIIEncoding()
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
If Row.NoteID.ToString() = NoteID Then
NoteString += Row.NoteHTML
IsFirstRow = True
Else
If IsFirstRow Then
Output0Buffer.AddRow()
IsFirstRow = False
End If
NoteID = Row.NoteID.ToString()
NoteString = Row.NoteHTML.ToString()
End If
NoteBlob = enc.GetBytes(NoteString)
Output0Buffer.SingleNoteHTML.AddBlobData(NoteBlob)
Output0Buffer.ClaimID = Row.ClaimID
Output0Buffer.UserID = Row.UserID
Output0Buffer.NoteTypeLookupID = Row.NoteTypeLookupID
Output0Buffer.DateCreatedUTC = Row.DateCreated
Output0Buffer.ActivityDateUTC = Row.ActivityDate
Output0Buffer.IsPublic = Row.IsPublic
End Sub
My problem now is that I had to convert the output column from Wstr(4000) to NText because some of the notes are so long. When it imports into my SQL table, it is just jibberish characters and not the actual notes.
In SQL Server Management Studio (using SQL), you could easily combine your NoteText field using stuff function with XML Path to combine your row values to a single column like this:
select distinct
noteid,
min(sequence) over (partition by n.noteid order by n.sequence) as sequence,
stuff((select ' ' + NoteText
from notes n1
where n.noteid = n1.noteid
for xml path ('')
),1,1,'') as NoteText
from notes n;
You will probably want to look into something along the line that does similar thing in SSIS. Check out this link on how to create a script component in SSIS to do something similar: SSIS Script Component - concat rows
SQL Fiddle Demo

Resources