COPY INTO Snowflake Table with Extra Columns as a defined value - snowflake-cloud-data-platform

I've got a table defined in Snowflake as:
table1
COL1 VARCHAR(16777216)
COL2 VARCHAR(16777216)
tablename VARCHAR(16777216)
and a csv file (only has 2 columns) as below:
table1.csv
COL1
COL2
I want to copy all columns from "table1.csv" to "table1" snowflake table, so data for COL1 and COL2 will get copied from csv to a snowflake table and "tablename" column should be "table1" (filename)
My copy into command looks like this:
copy into table1
from #snowflake_stage
file_format = (type = 'CSV' skip_header = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '"'
validate_utf8 = FALSE error_on_column_count_mismatch=false)
on_error = CONTINUE;
Data is getting loaded successfully for COL1 and Col2 from csv to snowflake table but tablename column is NULL. I want tablename column to have value "table1" for all the rows.

Snowflake automatically generates metadata columns for files.
You can use METADATA$FILENAME.
It will have the full path, you can either store the full path or use a SUBSTRING command to get the filename without the path or the extension

Related

Is there an equivalent of SQL "FOR XML" in Snowflake?

I have a SQL query that combines multiple results from a table into a single row, ordered list result set.
TableA
Col1
ABC
DEF
select * from TableA for xml raw(''), root('ol'), elements, type
Output:
<ol><li>ABC</li><li>DEF</li></ol>
Would like to achieve the same result in Snowflake
There's no built-in XML constructor in Snowflake, but for simple XML formats you can use listagg and concatenation to produce the XML:
create or replace temp table T1(COL1 string);
insert into T1 (COL1) values ('ABC'), ('DEF');
select '<ol><li>' || listagg(COL1, '</li><li>') || '</li></ol>' from T1;

How to insert few columns from multiple CSV files on S3 bucket into a snowflake table at a time (in a single query)

I want to insert specific column from multiple CSV file which are at S3 location to a snowflake table . Suppose 1st column of 1st CSV file to 1st column of snowflake table, 5th column of 2nd csv file to 2nd column of snowflake table ... Etc.... So it possible to create a query for this ?
From little you provided as requirements that can be achieved with using $1, $2 for column names from S3 files.
To give you an idea
copy into table from (
select $1,'',...
from file1
union
select '',$5,...
from file2
)
You need to provide more info =)
Generally, how would you link/relate the info two CSV files? You need at least a key of some sort that's avail from both sources.
I would think of it in steps and ELT mindset instead of ETL:
Load CSV1 into Table1
Load CSV2 into Table2
CREATE or REPLACE Table3 (CommonKey datatype, Column1 datatype, Column2 datatype)
INSERT INTO Table3
SELECT T1.CommonKey, Column1, Column2
FROM Table1 T1
JOIN Table2 T2 ON T1.CommonKey = T2.CommonKey
I don't think there is way to load data for one table from multiple files at a time, Instead you can do below:
We can specify the column order in copy command:
copy into table(col1, col2,...,coln) from (
select $1,$2,....$n
from file1
) file format = 'my_file_format'

How to enter null for missed columns in csv?

I try to perform bulk insert from csv file.
MY csv file having 7 columns but table contains 8 columns.
i can able to perform bulk insert with below query if my table having 8 columns only.
BULK INSERT Table_Name FROM 'E:\file\input.csv' WITH (ROWTERMINATOR = '0x0A',CODEPAGE = 'ACP',FIELDTERMINATOR = ',',KEEPNULLS, ROWS_PER_BATCH = 10000)
but my csv contains only 7 columns this leads below error..,
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 8 (datecolumn).
Can anyone suggest me way to resolve this without using FormatFile?
Create a view with the 7 columns and insert into that view instead.
Example with fewer columns:
CREATE TABLE test_table(col1 int, col2 int, col3 int)
go
CREATE VIEW v_test_table
as
SELECT col1, col2
FROM test_table
go
INSERT v_test_table
SELECT 1,2
go
SELECT * FROM test_table

Combining multiple queries & returns

I have a .csv file with a list of 100 product_ids. I have a SQL Server table which (amongst others) has a product_id column (4000+ ids) and a product_description column.
What I want to do is take the .CSV list of product_ids and run a query on the table to return a list of the relevant product_description.
So my simple query would be
select product_description
from tablename
where product-id = xxxxxxx.
But how do I supply the xxxxxx as a list (perhaps I just seperate with commas?) and dump the output into another csv.
Can't you just copy + paste the id's from the CSV into a query?
SELECT
product_id
, product_description
FROM <table>
WHERE product_id in (<<list of values from CSV>>).
Since they're already in a CSV, they should be comma delimited, so you can easily plug them into your query (if you open your file with a text editor).
Alternatively, you could do everything from SQL, like this:
CREATE TABLE #TempTable (
ID integer
, col2 ..
, col3 ..
etc. )
GO
BULK INSERT #TempTable
FROM 'C:\..\file.csv'
WITH
(
FIRSTROW = 2, -- in case the first row contains headers (otherwise just remove this line)
FIELDTERMINATOR = ',', -- default CSV field delimiter
ROWTERMINATOR = '\n',
ERRORFILE = 'C:\CSVDATA\SchoolsErrorRows.csv',
TABLOCK
)
And then just run:
SELECT
product_id
, product_description
FROM <table>
WHERE product_id in (SELECT ID FROM #TempTable)
If you want to export this result to another CSV then:
INSERT INTO OPENROWSET(
'Microsoft.ACE.OLEDB.12.0'
,'Text;Database=D:\;HDR=YES;FMT=Delimited'
,'SELECT
product_id
, product_description
FROM <table>
WHERE product_id in (SELECT ID FROM #TempTable)' )

Remove duplicate from a staging file

I have a staging table which contains a who series of rows of data which where taken from a data file.
Each row details a change to a row in a remote system, the rows are effectively snapshots of the source row taken after every change. Each row contains meta data timestamps for creation and updates.
I am now trying to build an update table from these data files which contain all of the update. I require a way to remove rows with duplicate keys keeping only the row with the latest "update" timestamp.
I am aware I can use the SSIS "sort" transform to remove duplicates by sorting on the key field and telling it to remove duplicates, but how do I ensure that the row it keeps is the one with the latest time stamp?
This will remove rows with match on Col1, Col2 etc and have an UpdateDate that is NOT the most recent:
DELETE D
FROM MyTable AS D
JOIN MyTable AS T
ON T.Col1 = D.Col1
AND T.Col2 = D.Col2
...
AND T.UpdateDate > D.UpdateDate
If Col1 and Col2 need to be considered "matching" if they are both NULL then you would need to use:
ON (T.Col1 = D.Col1 OR (T.Col1 IS NULL AND D.Col1 IS NULL))
AND (T.Col2 = D.Col2 OR (T.Col2 IS NULL AND D.Col2 IS NULL))
...
Edit: If you need to make a Case Sensitive test on a Case INsensitive database then on VARCHAR and TEXT columns use:
ON (T.Col1 = D.Col1 COLLATE Latin1_General_BIN
OR (T.Col1 IS NULL AND D.Col1 IS NULL))
...
You can use the Sort Transform in SSIS to sort your data set by more than one column. Simply sort by your primary key (or ID field) followed by your timestamp column in descending order.
See the following article for more details on working with the sort Transformation?
http://msdn.microsoft.com/en-us/library/ms140182.aspx
Make sense?
Cheers, John
Does it make sense to just ignore the duplicates when moving from staging to final table?
You have to do this anyway, so why not issue one query against the staging table rather than two?
INSERT final
(key, col1, col2)
SELECT
key, col1, col2
FROM
staging s
JOIN
(SELECT key, MAX(datetimestamp) maxdt FROM staging ms ON s.key = ms.key AND s.datetimestamp = ms.maxdt

Resources