How to load default colums while loading parquet file in snowflake - snowflake-cloud-data-platform

I am loading parquet file into snowflake using copy command. Parquet file has 10 column and Snowflake target table has 12 column( 2 with default dates - create date and update date)
SQL compilation error: Insert value list does not match column list expecting 12 but got 10
Is there any way i can load default values into snowflake table while loading the data through parquet file with less or more colums
any help would be greatly appreciated

You must give all columns in table specification.
If we use only copy into table_name from (select column_names from #stage), then stage file needs to match all column present in table.
copy into <table>(<col1>,<col2>,....<col10>) from
(select
$1:<col_1_parq_file>::<format_specifier>,
$1:<col_2_parq_file>::<format_specifier>,
$1:<col_3_parq_file>::<format_specifier>,
.
.
.
$1:<col_10_parq_file>::<format_specifier>
from #<stage_with_parquest_file>);

Related

Best method to get todays data through a view snowflake

My warehouse details:
warehouse - XS
reading data external tables from s3 into snowflake
Refresh structure: SNS
I have the S3 folder structure as below
S3://eveningdtaa/2022-06-07/files -- contains parquet format
S3://eveningdtaa/2022-06-08/files -- contains parquet format
S3://eveningdtaa/2022-06-09/files -- contains parquet format
I am using external tables to read data from snowflake tables.
So tables- Has historical information
views - Has daily data
My view defination is a below:
create view result_view as (
select * from table1 where date_part=(select max(date_part) from table 1)
)
My question our daily views are running slow and it has only 70k rows. Is there a way to rewrite my view to pick only the latest data instead of max of date? or able to run this view faster through some indexes?
Thanks,
Xi
It may be rewritten using QUALIFY:
create view result_view
as
select *
from table1
qualify date_part=max(date_part) over();
It is also worth adding partition on date: Partitioning Parameters

have a table with 1000+ columns, how to load data into snowflake table with parquet files without explicitly specify the column name

for the snowflake document, it has 3 columns with loading from a parquet file then you can use:
"copy into cities
from (select
$1:continent::varchar,
$1:country:name::varchar,
$1:country:city::variant
from #sf_tut_stage/cities. parquet);
"
If have 1000+ columns, can I not list all the columns like $1:col1, $1:col2...$1:co1000?
you may want to check out our INFER_SCHEMA function to dynamically obtain the columns/datatypes. https://docs.snowflake.com/en/sql-reference/functions/infer_schema.html
The expression column should be able to get you 95% of the way there.
select *
from table(
infer_schema(
location=>'#mystage'
, file_format=>'my_parquet_format'
)
);

Snowflake join table with stage file

)
I have a CSV files that have multiple columns. Sometimes it can be 2 sometimes it can be 43. I have mapped this columns to the Snowflake meta data table. I want to insert values to the target table but sometimes in the CSV files the column name can be diffrent for example subject, subject_name or subject_names. In the target table I have only one columnn for this called subjet_name. So if the column "subject" in the CSV file is null I need to check the "subject_name" column and if the "subject_name" i need to check "subject_names". Is there anyway how to check if this columns have a null values. I must add that the columns in the CSV are not always in the same place. So i can't use select $1 from #stage

snowflake parquet load schema generation

Working on loading parquet file to snowflake table from S3 location. This is what I am doing:
created target table
CREATE TABLE myschema.target_table(
col1 DATE,
col2 VARCHAR);
Created stage table using the following command
CREATE OR REPLACE TEMPORARY STAGE myschema.stage_table
url = 's3://mybucket/myfolder1/'
storage_integration = My_int
fileformat = (type = 'parquet')
Load the target table from the stage table
COPY INTO myschema.target_table FROM(
SELECT $1:col1::date,
$1:col2:varchar
FROM myschema.stage_table)
This works fine, my issue is, I have 10s of tables with 10s of columns. Is there any way to optimize the step 3, where I dont have to explicitly mention column names, so that code will become generic:
COPY INTO myschema.target_table FROM(
SELECT *
FROM myschema.stage_table)
Did you try
MATCH_BY_COLUMN_NAME = CASE_SENSITIVE | CASE_INSENSITIVE | NONE
Document: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#type-parquet

SQL Update master table with new table data hourly based on no match on Composite PK

Using SQL Server 2008
I have an SSIS task that downloads a CSV file from FTP and renames the file every hour. After that I'm doing a bulk insert of the data into a new table called NEWFTPDATA.
The data in this file is for the current day up to the current hour. The table has a composite primary key consisting of 4 different columns.
The next step I need to complete is, using T-SQL, compare this new table to my existing master archive table and insert any rows that do not already exist based on matching (or rather not-matching on those 4 columns)
Since I'll be downloading this file hourly (for real-time reporting) for each subsequent run there will be duplicate data which I will not want to insert into the master table to avoid duplicating data.
I've found ways to do this based off of the existence of one particular column, but I can't seem to figure out how to do it based off of 4 columns needing to match.
The workflow should be as follows
Update MASTERTABLE from NEWFTPDATA where newftpdata.column1, newftpdata.column2, newftpdata.column3, newftpdata.column4 do not exist in MASTERTABLE
Hopefully I've supplied substantial information for this question. If any further details are required please let me know. Thank you.
you can use MERGE
MERGE MasterTable as dest
using newftpdata as src
on
dest.column1 = src.column1
and
dest.column2 = src.column2
and
dest.column3 = src.column3
and
dest.column4 = src.column4
WHEN NOT MATCHED then
INSERT (column1, column2, ...)
values ( Src.column1, Src.column2,....)

Resources