Loading CSV data to Snowflake table - snowflake-cloud-data-platform

Column splits into multiple columns when trying to load the following data in to SnowFlake table since its CSV file.
Column Data :
{"Department":"Mens
Wear","Departmentid":"10.1;20.1","customername":"john4","class":"tops wear","subclass":"sweat shirts","product":"North & Face 2 Bangle","style":"Sweat shirt hoodie - Large - Black"}
Is there any other way to load the data in to single column.

The best solution would be use a different delimiter instead of comma in your CSV file. If it's not possible, then you can ingest the data using a non-existing delimiter to get the whole line as one column, and then parse it. Of course it won't be as effective as native loading:
cat test.csv
1,2020-10-12,Gokhan,{"Department":"Mens Wear","Departmentid":"10.1;20.1","customername":"john4","class":"tops wear","subclass":"sweat shirts","product":"North & Face 2 Bangle","style":"Sweat shirt hoodie - Large - Black"}
create file format csvfile type=csv FIELD_DELIMITER='NONEXISTENT';
select $1 from #my_stage (file_format => csvfile );
create table testtable( id number, d1 date, name varchar, v variant );
copy into testtable from (
select
split( split($1,',{')[0], ',' )[0],
split( split($1,',{')[0], ',' )[1],
split( split($1,',{')[0], ',' )[2],
parse_json( '{' || split($1,',{')[1] )
from #my_stage (file_format => csvfile )
);
select * from testtable;
+----+------------+--------+-----------------------------------------------------------------+
| ID | D1 | NAME | V |
+----+------------+--------+-----------------------------------------------------------------+
| 1 | 2020-10-12 | Gokhan | { "Department": "Mens Wear", "Departmentid": "10.1;20.1", ... } |
+----+------------+--------+-----------------------------------------------------------------+

Related

Store a list of values as a string when creating a table in snowflake

I am trying to create a table with 5 columns. COLUMN #2 (PROGRESS) is a comma seperated list (i.e 1,2,3,4 etc.) but when trying to create this table as either a string, variant or varchar, Snowflake refuses to allow this. Any advice on how I can create a column seperated list from a CSV? I tried to import the data as a TSV, XML, as well as a JSON file but no success.
create or replace TABLE AD_HOC.TEMP.NEW_DATA (
VISITOR_ID VARCHAR(16777216),
PROGRESS VARCHAR(16777216),
DATE DATETIME,
ROLE VARCHAR(16777216),
FIRST_VISIT DATETIME
)COMMENT='Interaction data'
;
Goal:
VISITOR_ID | PROGRESS | DATE | ROLE | FIRST_VISIT
111 | [1,2,3] | 1/1/2022 | OWNER | 1/1/2021
123 | [1] | 1/2/2022 | ADMIN | 2/2/2021
23321 | [1,2,3,4] | 2/22/2022 | USER | 3/12/2021
I encoded the column in python and loaded the data in Snowflake!
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = doc_data.join(pd.DataFrame(mlb.fit_transform(doc_data.pop('PROGRESS')),
columns=mlb.classes_,
index=doc_data.index))
df

How to transform data when we have comma separated values in csv format file in snowflake

I have an excel csv format data set with the following data:
Columns: id, product_name, sales, quantity, Profit
Data: 1, "Novimex Executive Leather Armchair, Black","$3,709.40", 9, -$288.77
When I am trying to insert these records from stage to snowflake table, data is getting shifted from product name column because we have comma separated , Black and similarly for following columns data are getting shifted. After loading the data it is looking like as per below:
+----+-------------------------------------+--------+----------+---------+
| id | product_name | sales | quantity | Profit |
+----+-------------------------------------+--------+----------+---------+
| 1 | "Novimex Executive Leather Armchair | Black" | $3 | 709.40" |
+----+-------------------------------------+--------+----------+---------+
Query used:
copy into orders_staging (id,Product_Name,Sales,Quantity,Profit)
from
(select $1,$2,$3,$4,$5
from #sales_data_stage)
file_format = (type = csv field_delimiter = ',' skip_header = 1 ENCODING = 'iso-8859-1');
Use Field Enclosure.
FIELD_OPTIONALLY_ENCLOSED_BY='"'
If you have any issues with accounting styled numbers, remember to put " " around them too.
https://community.snowflake.com/s/question/0D50Z00008pDcoRSAS/copying-csv-files-delimited-by-commas-where-commas-are-also-enclosed-in-strings
Additional documentation for Copy To
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#type-csv
Additional documentation on the Create File
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html

Check a value in an array inside a object json in PostgreSQL 9.5

I have an json object containing an array and others properties.
I need to check the first value of the array for each line of my table.
Here is an example of the json
{"objectID2":342,"objectID1":46,"objectType":["Demand","Entity"]}
So I need for example to get all lines with ObjectType[0] = 'Demand' and objectId1 = 46.
This the the table colums
id | relationName | content
Content column contains the json.
just query them? like:
t=# with table_name(id, rn, content) as (values(1,null,'{"objectID2":342,"objectID1":46,"objectType":["Demand","Entity"]}'::json))
select * From table_name
where content->'objectType'->>0 = 'Demand' and content->>'objectID1' = '46';
id | rn | content
----+----+-------------------------------------------------------------------
1 | | {"objectID2":342,"objectID1":46,"objectType":["Demand","Entity"]}
(1 row)

SSIS Merge Varying Columns

Using SSIS, I am importing a .txt file, which for the most part if straight forward.
The file being imported has a set amount of columns up to a point, but there is a free text/comments field, which can repeat to unknown length, similar to below.
"000001","J Smith","Red","Free text here"
"000002","A Ball","Blue","Free text here","but can","continue"
"000003","W White","Green","Free text here","but can","continue","indefinitely"
"000004","J Roley","Red","Free text here"
What I would ideally like to do (within SSIS) is to keep the first three columns as singular columns, but to merge any free-text ones into a single column. i.e. Merge/concatenate anything which appears after the 'colour' column.
So when I load this into an SSMS table, it appears like:
000001 | J Smith | Red | Free text here |
000002 | A Ball | Blue | Free text here but can continue |
000003 | W White | Green | Free text here but can continue indefinitely |
000004 | J Roley | Red | Free text here |
I do not see any easy solution. You can try something like below:
1. Load the complete raw data to a temp table (without any delimiter):
Steps:
Create temp table in Execute SQL Task
Create a data flow task, with flat file source (with Ragged Right format) and
OLEDB destination (usint #temp table create in previous task)
Set the delayValidation=True for connection manager and DFT
Set retainSameConnection=True for connection manager
Refer this to create temp table and using it.
2. Create T-SQL to separate the 3 columns (something like below)
with col1 as (
Select
[Val],
substring([Val], 1 ,charindex(',', [Val]) - 1) col1,
len(substring([Val], 1 ,charindex(',', [Val]))) + 1 col1Len
from #temp
), col2 as (
select
[Val],
col1,
substring([Val], col1Len, charindex(',', [Val], col1Len) - col1Len) as col2,
charindex(',', [Val], col1Len) + 1 col2Len
from col1
) select col1, col2, substring([Val], col2Len, 200) as col3
from col2
T-SQL Output:
col1 col2 col3
"000001" "J Smith" "Red","Free text here"
"000002" "A Ball" "Blue","Free text here","but can","continue"
"000003" "W White" "Green","Free text here","but can","continue","indefinitely"
3. Use the above query in OLEDB source in different data flow task
Replace double quotes (") as per your requirement.
This was a fun exercise:
Add a data flow
Add a Script Component (select Source)
Add 4 columns to Outputs ID, Name Color , FreeText all type string
edit script:
Paste the following namespaces up top:
using System.Text.RegularExpressions;
using System.Linq;
Paste the following code into CreateNewOutputRows:
string strPath = #"a:\test.txt"; \\put your file path in here
var lines = System.IO.File.ReadAllLines(strPath);
foreach (string line in lines)
{
//Code I stole to read CSV
string delimeter = ",";
Regex rgx = new Regex(String.Format("(\"[^\"]*\"|[^{0}])+", delimeter));
var cols = rgx.Matches(line)
.Cast<Match>()
.Select(m => m.Value.Trim().Trim('"'))
.Where(v => !string.IsNullOrWhiteSpace(v));
//create a column counter
int ctr = 0;
Output0Buffer.AddRow();
//Preset FreeText to empty string
string FreeTextBuilder = String.Empty;
foreach( string col in cols)
{
switch (ctr)
{
case 0:
Output0Buffer.ID = col;
break;
case 1:
Output0Buffer.Name = col;
break;
case 2:
Output0Buffer.Color = col;
break;
default:
FreeTextBuilder += col + " ";
break;
}
ctr++;
}
Output0Buffer.FreeText = FreeTextBuilder.Trim();
}

Hive lateral view not working AWS Athena

Im working on a process of AWS Cloudtrail log analysis, Im getting stuck in extract JSON from a row,
This is my table definition.
CREATE EXTERNAL TABLE cloudtrail_logs (
eventversion STRING,
eventName STRING,
awsRegion STRING,
requestParameters STRING,
elements STRING ,
additionalEventData STRING
)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://XXXXXX/CloudTrail'
If I run select elements from cl1 limit 1 it returns this result.
{"groupId":"sg-XXXX","ipPermissions":{"items":[{"ipProtocol":"tcp","fromPort":22,"toPort":22,"groups":{},"ipRanges":{"items":[{"cidrIp":"0.0.0.0/0"}]},"prefixListIds":{}}]}}
I need to show this result as virtual columns like,
| groupId | ipProtocol | fromPort | toPort| ipRanges.items.cidrIp|
|---------|------------|--------- | ------|-----------------------------|
| -1 | 0 | | | |
Im using AWS Athena and I tried Lateral view and get_json_object is not working in AWS.
its an external table
select json_extract_scalar(i.item,'$.ipProtocol') as ipProtocol
,json_extract_scalar(i.item,'$.fromPort') as fromPort
,json_extract_scalar(i.item,'$.toPort') as toPort
from cloudtrail_logs
cross join unnest (cast(json_extract(elements,'$.ipPermissions.items')
as array(json))) as i (item)
;
ipProtocol | fromPort | toPort
------------+----------+--------
"tcp" | 22 | 22

Resources