Snowflake External Table Partition - Granular Path - snowflake-cloud-data-platform

Experts,
We have our JSON files stored in the below folder structure in S3 as
/appname/lob/2020/07/24/12,/appname/lob/2020/07/24/13,/appname/lob/2020/07/24/14
stage #SFSTG = /appname/lob/
We need to create a external table with partition based on the hours. We can derive the partition part from the metadata$filename. However question here is should the partition column should be created as timestamp or varchar?
Which partition datatype helps us in better performance when accessing the file from snowflake using External table.

Snowflake's recommendation is the following:
date_part date as to_date(substr(metadata$filename, 14, 10), 'YYYY/MM/DD'),
*Double check 14 is the correct start of your partition in your stage url I may have it incorrect here.
Full example:
CREATE OR REPLACE EXTERNAL TABLE Database.Schema.ExternalTableName(
date_part date as to_date(substr(metadata$filename, 14, 10), 'YYYY/MM/DD'),
col1 varchar AS (value:col1::varchar)),
col2 varchar AS (value:col2::varchar))
PARTITION BY (date_part)
INTEGRATION = 'YourIntegration'
LOCATION=#SFSTG/appname/lob/
AUTO_REFRESH = true
FILE_FORMAT = (TYPE = JSON);

Related

How to load default colums while loading parquet file in snowflake

I am loading parquet file into snowflake using copy command. Parquet file has 10 column and Snowflake target table has 12 column( 2 with default dates - create date and update date)
SQL compilation error: Insert value list does not match column list expecting 12 but got 10
Is there any way i can load default values into snowflake table while loading the data through parquet file with less or more colums
any help would be greatly appreciated
You must give all columns in table specification.
If we use only copy into table_name from (select column_names from #stage), then stage file needs to match all column present in table.
copy into <table>(<col1>,<col2>,....<col10>) from
(select
$1:<col_1_parq_file>::<format_specifier>,
$1:<col_2_parq_file>::<format_specifier>,
$1:<col_3_parq_file>::<format_specifier>,
.
.
.
$1:<col_10_parq_file>::<format_specifier>
from #<stage_with_parquest_file>);

Snowflake - Keeping target table schema in sync with source table variant column value

I ingest data into a table source_table with AVRO data. There is a column in this table say "avro_data" which will be populated with variant data.
I plan to copy data into a structured table target_table where columns have the same name and datatype as the avro_data fields in the source table.
Example:
select avro_data from source_table
{"C1":"V1", "C2", "V2"}
This will result in
select * from target_table
------------
| C1 | C2 |
------------
| V1 | V2 |
------------
My question is when schema of the avro_data evolves and new fields get added, how can I keep schema of the target_table in sync by adding equivalent columns in the target table?
Is there anything out of the box in snowflake to achieve this or if someone has created any code to do something similar?
Here's something to get you started. It shows how to take a variant column and parse out the internal columns. This uses a table in the Snowflake sample data database, which is not always the same. You can to adjust the table name and column name.
SELECT DISTINCT regexp_replace(regexp_replace(f.path,'\\\\[(.+)\\\\]'),'(\\\\w+)','\"\\\\1\"') AS path_name, -- This generates paths with levels enclosed by double quotes (ex: "path"."to"."element"). It also strips any bracket-enclosed array element references (like "[0]")
DECODE (substr(typeof(f.value),1,1),'A','ARRAY','B','BOOLEAN','I','FLOAT','D','FLOAT','STRING') AS attribute_type, -- This generates column datatypes of ARRAY, BOOLEAN, FLOAT, and STRING only
REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\\\[(.+)\\\\]'),'[^a-zA-Z0-9]','_') AS alias_name -- This generates column aliases based on the path
FROM
"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."JCUSTOMER",
LATERAL FLATTEN("CUSTOMER", RECURSIVE=>true) f
WHERE TYPEOF(f.value) != 'OBJECT'
AND NOT contains(f.path, '[');
This is a snippet of code modified from here: https://community.snowflake.com/s/article/Automating-Snowflake-Semi-Structured-JSON-Data-Handling. The blog author attributes credit to a colleague for this section of code.
While the current incarnation of the stored procedure will create a view from the internal columns in a variant, an alternate version could create and/or alter a table to keep it in sync with changes.

snowflake parquet load schema generation

Working on loading parquet file to snowflake table from S3 location. This is what I am doing:
created target table
CREATE TABLE myschema.target_table(
col1 DATE,
col2 VARCHAR);
Created stage table using the following command
CREATE OR REPLACE TEMPORARY STAGE myschema.stage_table
url = 's3://mybucket/myfolder1/'
storage_integration = My_int
fileformat = (type = 'parquet')
Load the target table from the stage table
COPY INTO myschema.target_table FROM(
SELECT $1:col1::date,
$1:col2:varchar
FROM myschema.stage_table)
This works fine, my issue is, I have 10s of tables with 10s of columns. Is there any way to optimize the step 3, where I dont have to explicitly mention column names, so that code will become generic:
COPY INTO myschema.target_table FROM(
SELECT *
FROM myschema.stage_table)
Did you try
MATCH_BY_COLUMN_NAME = CASE_SENSITIVE | CASE_INSENSITIVE | NONE
Document: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#type-parquet

Valid partition columns for external tables

I am trying to create an external table with various partition columns.
It works to do the following, for instance:
create or replace external table mytable(
myday date as to_date(substr(metadata$filename, 35, 10), 'YYYY-MM-DD'))
partition by (myday)
location = #mys3stage
file_format = (type = parquet);
However, I would like to use regex_substr instead of character indexing, as I won't always have consistent character indices for all partitioning columns. I would like to do this:
create or replace external table mytable(
myday date as to_date(regexp_substr(metadata$filename, 'day=[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]'), 'day=YYYY-MM-DD'))
partition by (myday)
location = #mys3stage
file_format = (type = parquet);
This gives me an error Defining expression for partition column MYDAY is invalid. I can run the regexp_substr clause successfully in a select statement outside of the external table creation, getting the same results as the substr approach.
How can I use regex string matching in my external table partition column definition?
REGEX_SUBSTR is not a function that is on the list of currently supported partition key functions. Please use the following link to see the list of acceptable functions:
https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html#partitioning-parameters
I am not completely sure I understand how your folder structure wouldn't be consistent. Perhaps if you provided an example, this community can offer a more concise response. However, if you are unable to come up with a parsing mechanism that works, perhaps you could leverage the CASE statement to handle each unique folder structure that you may come across in your environment.

SQL Update master table with new table data hourly based on no match on Composite PK

Using SQL Server 2008
I have an SSIS task that downloads a CSV file from FTP and renames the file every hour. After that I'm doing a bulk insert of the data into a new table called NEWFTPDATA.
The data in this file is for the current day up to the current hour. The table has a composite primary key consisting of 4 different columns.
The next step I need to complete is, using T-SQL, compare this new table to my existing master archive table and insert any rows that do not already exist based on matching (or rather not-matching on those 4 columns)
Since I'll be downloading this file hourly (for real-time reporting) for each subsequent run there will be duplicate data which I will not want to insert into the master table to avoid duplicating data.
I've found ways to do this based off of the existence of one particular column, but I can't seem to figure out how to do it based off of 4 columns needing to match.
The workflow should be as follows
Update MASTERTABLE from NEWFTPDATA where newftpdata.column1, newftpdata.column2, newftpdata.column3, newftpdata.column4 do not exist in MASTERTABLE
Hopefully I've supplied substantial information for this question. If any further details are required please let me know. Thank you.
you can use MERGE
MERGE MasterTable as dest
using newftpdata as src
on
dest.column1 = src.column1
and
dest.column2 = src.column2
and
dest.column3 = src.column3
and
dest.column4 = src.column4
WHEN NOT MATCHED then
INSERT (column1, column2, ...)
values ( Src.column1, Src.column2,....)

Resources