snowflake parquet load schema generation - snowflake-cloud-data-platform

Working on loading parquet file to snowflake table from S3 location. This is what I am doing:
created target table
CREATE TABLE myschema.target_table(
col1 DATE,
col2 VARCHAR);
Created stage table using the following command
CREATE OR REPLACE TEMPORARY STAGE myschema.stage_table
url = 's3://mybucket/myfolder1/'
storage_integration = My_int
fileformat = (type = 'parquet')
Load the target table from the stage table
COPY INTO myschema.target_table FROM(
SELECT $1:col1::date,
$1:col2:varchar
FROM myschema.stage_table)
This works fine, my issue is, I have 10s of tables with 10s of columns. Is there any way to optimize the step 3, where I dont have to explicitly mention column names, so that code will become generic:
COPY INTO myschema.target_table FROM(
SELECT *
FROM myschema.stage_table)

Did you try
MATCH_BY_COLUMN_NAME = CASE_SENSITIVE | CASE_INSENSITIVE | NONE
Document: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#type-parquet

Related

Redo a create or replace in snowflake

I have the following problem:
I have used the function:
CREATE OR REPLACE TABLE myschema.public.table1 as (SELECT * FROM myschema.public.table1 BEFORE(OFFSET => -60*4*15) WHERE MARKET = 'ES'
)
I still had the filter MARKET = 'ES' in and now all the entries that are unequal MARKET = 'ES' are gone. Can I still undo this?
As you issued create or replace, the original table is dropped. So you need to rename your existing table and undrop the original table. Then you may re-run your timetravel command:
alter table table1 to table1_bak
undrop table table1;
CREATE OR REPLACE TABLE myschema.public.table1 as (SELECT * FROM myschema.public.table1 BEFORE(OFFSET => -60*4*15) )
https://docs.snowflake.com/en/sql-reference/sql/undrop-table.html

How to load default colums while loading parquet file in snowflake

I am loading parquet file into snowflake using copy command. Parquet file has 10 column and Snowflake target table has 12 column( 2 with default dates - create date and update date)
SQL compilation error: Insert value list does not match column list expecting 12 but got 10
Is there any way i can load default values into snowflake table while loading the data through parquet file with less or more colums
any help would be greatly appreciated
You must give all columns in table specification.
If we use only copy into table_name from (select column_names from #stage), then stage file needs to match all column present in table.
copy into <table>(<col1>,<col2>,....<col10>) from
(select
$1:<col_1_parq_file>::<format_specifier>,
$1:<col_2_parq_file>::<format_specifier>,
$1:<col_3_parq_file>::<format_specifier>,
.
.
.
$1:<col_10_parq_file>::<format_specifier>
from #<stage_with_parquest_file>);

Snowflake - Keeping target table schema in sync with source table variant column value

I ingest data into a table source_table with AVRO data. There is a column in this table say "avro_data" which will be populated with variant data.
I plan to copy data into a structured table target_table where columns have the same name and datatype as the avro_data fields in the source table.
Example:
select avro_data from source_table
{"C1":"V1", "C2", "V2"}
This will result in
select * from target_table
------------
| C1 | C2 |
------------
| V1 | V2 |
------------
My question is when schema of the avro_data evolves and new fields get added, how can I keep schema of the target_table in sync by adding equivalent columns in the target table?
Is there anything out of the box in snowflake to achieve this or if someone has created any code to do something similar?
Here's something to get you started. It shows how to take a variant column and parse out the internal columns. This uses a table in the Snowflake sample data database, which is not always the same. You can to adjust the table name and column name.
SELECT DISTINCT regexp_replace(regexp_replace(f.path,'\\\\[(.+)\\\\]'),'(\\\\w+)','\"\\\\1\"') AS path_name, -- This generates paths with levels enclosed by double quotes (ex: "path"."to"."element"). It also strips any bracket-enclosed array element references (like "[0]")
DECODE (substr(typeof(f.value),1,1),'A','ARRAY','B','BOOLEAN','I','FLOAT','D','FLOAT','STRING') AS attribute_type, -- This generates column datatypes of ARRAY, BOOLEAN, FLOAT, and STRING only
REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\\\[(.+)\\\\]'),'[^a-zA-Z0-9]','_') AS alias_name -- This generates column aliases based on the path
FROM
"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."JCUSTOMER",
LATERAL FLATTEN("CUSTOMER", RECURSIVE=>true) f
WHERE TYPEOF(f.value) != 'OBJECT'
AND NOT contains(f.path, '[');
This is a snippet of code modified from here: https://community.snowflake.com/s/article/Automating-Snowflake-Semi-Structured-JSON-Data-Handling. The blog author attributes credit to a colleague for this section of code.
While the current incarnation of the stored procedure will create a view from the internal columns in a variant, an alternate version could create and/or alter a table to keep it in sync with changes.

Snowflake External Table Partition - Granular Path

Experts,
We have our JSON files stored in the below folder structure in S3 as
/appname/lob/2020/07/24/12,/appname/lob/2020/07/24/13,/appname/lob/2020/07/24/14
stage #SFSTG = /appname/lob/
We need to create a external table with partition based on the hours. We can derive the partition part from the metadata$filename. However question here is should the partition column should be created as timestamp or varchar?
Which partition datatype helps us in better performance when accessing the file from snowflake using External table.
Snowflake's recommendation is the following:
date_part date as to_date(substr(metadata$filename, 14, 10), 'YYYY/MM/DD'),
*Double check 14 is the correct start of your partition in your stage url I may have it incorrect here.
Full example:
CREATE OR REPLACE EXTERNAL TABLE Database.Schema.ExternalTableName(
date_part date as to_date(substr(metadata$filename, 14, 10), 'YYYY/MM/DD'),
col1 varchar AS (value:col1::varchar)),
col2 varchar AS (value:col2::varchar))
PARTITION BY (date_part)
INTEGRATION = 'YourIntegration'
LOCATION=#SFSTG/appname/lob/
AUTO_REFRESH = true
FILE_FORMAT = (TYPE = JSON);

How to efficiently replace long strings by their index for SQL Server inserts?

I have a very large DataTable-Object which I need to import from a client into an MS SQL-Server database via ODBC.
The original Data-Table has two columns:
* First column is the Office Location (quite a long string)
* Second column is a booking value (integer)
Now I am looking for the most efficient way to insert these data into an external SQL-Server. My goal is to replace each office location automatically by an index instead using the full string because each location occurs VERY often in the initial table.
Is this possible via a trigger or via a view on the SQL-server?
At the end I want to insert the data without touching them in my script because this is very slow for these large amount of data and let the optimization done by the SQL Server.
I expect that if I do INSERT the data including the Office location, that SQL Server looks up an index for an already imported location and then use just this index. And if the location did not already exist in the index table / view then it should create a new entry here and then use the new index.
Here a sample of the data I need to import via ODBC into the SQL-Server:
OfficeLocation | BookingValue
EU-Germany-Hamburg-Ostend1 | 12
EU-Germany-Hamburg-Ostend1 | 23
EU-Germany-Hamburg-Ostend1 | 34
EU-France-Paris-Eifeltower | 42
EU-France-Paris-Eifeltower | 53
EU-France-Paris-Eifeltower | 12
What I do need on the SQL-Server is something like these 2 tables as a result:
OId|BookingValue OfficeLocation |Oid
1|12 EU-Germany-Hamburg-Ostend1 | 1
1|23 EU-France-Paris-Eifeltower | 2
1|43
2|42
2|53
2|12
My initial idea was, to write the data into a temp-table and have something like an intelligent TRIGGER (or a VIEW?) to react on any INSERT into this table to create the 2 desired (optimized) tables.
Any hint are more than welcome!
Yes, you can create a view with an INSERT trigger to handle this. Something like:
CREATE TABLE dbo.Locations (
OId int IDENTITY(1,1) not null PRIMARY KEY,
OfficeLocation varchar(500) not null UNIQUE
)
GO
CREATE TABLE dbo.Bookings (
OId int not null,
BookingValue int not null
)
GO
CREATE VIEW dbo.CombinedBookings
WITH SCHEMABINDING
AS
SELECT
OfficeLocation,
BookingValue
FROM
dbo.Bookings b
INNER JOIN
dbo.Locations l
ON
b.OId = l.OId
GO
CREATE TRIGGER CombinedBookings_Insert
ON dbo.CombinedBookings
INSTEAD OF INSERT
AS
INSERT INTO Locations (OfficeLocation)
SELECT OfficeLocation
FROM inserted where OfficeLocation not in (select OfficeLocation from Locations)
INSERT INTO Bookings (OId,BookingValue)
SELECT OId, BookingValue
FROM
inserted i
INNER JOIN
Locations l
ON
i.OfficeLocation = l.OfficeLocation
As you can see, we first add to the locations table any missing locations and then populate the bookings table.
A similar trigger can cope with Updates. I'd generally let the Locations table just grow and not attempt to clean it up (for no longer referenced locations) with triggers. If growth is a concern, a periodic job will usually be good enough.
Be aware that some tools (such as bulk inserts) may not invoke triggers, so those will not be usable with the above view.

Resources