Snowflake create partition based on file column - snowflake-cloud-data-platform

Is it possible to create partition in external tables in snowflake based on the file column instead of using file path (metadata$filename) when the file folder isn't partitioned.
I have tried creating an external like exemple bellow:
create or replace external table "db"."schema".EXT_TABLE (
"year" NUMBER(38,0) AS YEAR(TO_TIMESTAMP((PARSE_JSON(VALUE:"c3"):"time":"$date"::varchar)))
)
partition by ("year")
partition_type = user_specified
location=#stage
file_format=file_format;
but it returns the error Function GET is not supported in an external table partition column expression.
I also have tried using the (parse_json(metadata$external_table_partition) as documentation shows, but this column always returns empty.
Does anyone have any tips that can make it works ?

Related

Problems importing timestamp from Parquet files

I'm exporting data into Parquet files and importing it into Snowflake. The export is done with python (using to_parquet from pandas) on a Windows Server machine.
The exported file has several timestamp columns. Here's the metadata of one of these columns (ParquetViewer):
I'm having weird issues trying to import the timestamp columns into Snowflake.
Attempt 1 (using the copy into):
create or replace table STAGING.DIM_EMPLOYEE(
"EmployeeID" NUMBER(38,0),
"ExitDate" TIMESTAMP_NTZ(9)
);
copy into STAGING.DIM_EMPLOYEE
from #S3
pattern='dim_Employee_.*.parquet'
file_format = (type = parquet)
match_by_column_name = case_insensitive;
select * from STAGING.DIM_EMPLOYEE;
The timestamp column is not imported correctly:
It seems that Snowflake assumes that the value in the column is in seconds and not in microseconds and therefore converts incorrectly.
Attempt 2 (using the external tables):
Then I created an external table:
create or replace external table STAGING.EXT_DIM_EMPLOYEE(
"EmployeeID" NUMBER(38,0) AS (CAST(GET($1, 'EmployeeID') AS NUMBER(38,0))),
"ExitDate" TIMESTAMP_NTZ(9) AS (CAST(GET($1, 'ExitDate') AS TIMESTAMP_NTZ(9)))
)
location=#S3
pattern='dim_Employee_.*.parquet'
file_format='parquet'
;
SELECT * FROM STAGING.EXT_DIM_EMPLOYEE;
The data is still incorrect - still the same issue (seconds instead of microseconds):
Attempt 3 (using the external tables, with modified TO_TIMESTAMP):
I've then modified the external table definition to specifically define that microseconds are used TO_TIMESTAMP_TNZ with scale parameter 6:
create or replace external table STAGING.EXT_DIM_EMPLOYEE_V2(
"EmployeeID" NUMBER(38,0) AS (CAST(GET($1, 'EmployeeID') AS NUMBER(38,0))),
"ExitDate" TIMESTAMP_NTZ(9) AS (TO_TIMESTAMP_NTZ(TO_NUMBER(GET($1, 'ExitDate')), 6))
)
location=#CHICOREE_D365_BI_STAGE/
pattern='dim_Employee_.*.parquet'
file_format='parquet'
;
SELECT * FROM STAGING.EXT_DIM_EMPLOYEE_V2;
Now the data is correct:
But now the "weird" issue appears:
I can load the data into a table, but the load is quite slow and I get a Querying (repair) message during the load. However, at the end, the query is executed, albeit slow:
I want to load the data from stored procedure, using SQL script. When executing the statement using the EXECUTE IMMEDIATE an error is returned:
DECLARE
SQL STRING;
BEGIN
SET SQL := 'INSERT INTO STAGING.DIM_EMPLOYEE ("EmployeeID", "ExitDate") SELECT "EmployeeID", "ExitDate" FROM STAGING.EXT_DIM_EMPLOYEE_V2;';
EXECUTE IMMEDIATE :SQL;
END;
I have also tried to define the timestamp column in an external table as a NUMBER, import it and later convert it into timestamp. This generates the same issue (returning SQL execution internal error in SQL script).
Has anyone experienced an issue like this - it seems to me like a bug?
Basically - my goal is to generate insert/select statements dynamically and execute them (in stored procedures). I have a lot of files (with different schemas) that need to be imported and I want to create an "universal logic" to load these Parquet files into Snowflake.
As confirmed in the Snowflake Support ticket you opened, this issue got resolved when the Snowflake Support team enabled an internal configuration for Parquet timestamp logical types.
If anyone encounters a similar issue please submit a Snowflake Support ticket.

How to load Parquet/AVRO into multiple columns in Snowflake with schema auto detection?

When trying to load a Parquet/AVRO file into a Snowflake table I get the error:
PARQUET file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.
But I don't want to load these files into a new one column table — I need the COPY command to match the columns of the existing table.
What can I do to get schema auto detection?
Good news, that error message is outdated, as now Snowflake supports schema detection and COPY INTO multiple columns.
To reproduce the error:
create or replace table hits3 (
WatchID BIGINT,
JavaEnable SMALLINT,
Title TEXT
);
copy into hits3
from #temp.public.my_ext_stage/files/
file_format = (type = parquet);
-- PARQUET file format can produce one and only one column of type variant or object or array.
-- Use CSV file format if you want to load more than one column.
To fix the error and have Snowflake match the columns from the table and Parquet/AVRO files just add the option MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE (or MATCH_BY_COLUMN_NAME=CASE_SENSITIVE):
copy into hits3
from #temp.public.my_ext_stage/files/
file_format = (type = parquet)
match_by_column_name = case_insensitive;
Docs:
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
https://docs.snowflake.com/en/user-guide/data-load-overview.html?#detection-of-column-definitions-in-staged-semi-structured-data-files

How do I put a csv into an external table in SNowflake?

I have a staged file and I am trying to query the first line/row of it because it contains the column headers of the file. Is there a way I can create an external table using this file so that I can query the first line?
I am able to query the staged file using
SELECT a.$1
FROM #my_stage (FILE_FORMAT=>'my_file_format',PATTERN=>'my_file_path') a
and then to create the table I tried doing
CREATE EXTERNAL TABLE MY_FILE_TABLE
WITH
LOCATION='my_file_path'
FILE_FORMAT = my_file_format;
Reading Headers from CSV is not supported however this answer from StackOverflow gives a workaround.

How to pass optional column in TABLE VALUE TYPE in SQL from ADF

I have the following table value type in SQL which is used in Azure Data Factory to import data from a flat file in a bulk copy activity via a stored procedure. File 1 has all three columns in it so this works fine. File 2 only has Column1 and Column2, but NOT Column3. I figured since the column was defined as NULL it would be ok but ADF complains that its attempting to pass in 2 columns when the table type expects 3. Is there a way to reuse this type for both files and make Column3 optional?
CREATE TYPE [dbo].[TestType] AS TABLE(
Column1 varchar(50) NULL,
Column2 varchar(50) NULL,
Column3 varchar(50) NULL
)
Operation on target LandSource failed:
ErrorCode=SqlOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=A
database operation failed with the following error: 'Trying to pass a
table-valued parameter with 2 column(s) where the corresponding
user-defined table type requires 3 column(s)
Would be nice if the copy activity behavior was consistent regardless of whether or not a stored procedure with table type is used or native BCP in the activity. When not using the table type and using the default bulk insert, missing columns in the source file end up being NULL in the target table without error (assumming the column is NULLABLE).
It will cause the mapping error in ADF.
In the Copy Activity, every column needs to be mapped.
If the source file only has two columns, it will cause mapping error.
So, I suggest you to create two different Copy activities and create a two columns table type.
You can pass optional column, I've made a test successfully, but the steps will be a bit complex. In my case, File 1 has all three columns, File 2 only has Column1 and Column2, but NOT Column3. It will use Get Metadata activity, Set Variable activity, ForEach activity, IfCondition activity.
Please follow my steps:
You need to define a variable FileName to foreach.
In the Get Metadata1 activity, I specified the file path.
In the ForEach1 activity, use #activity('Get Metadata1').output.childItems to foreach the filelist. It need to be Sequential.
Inside the ForEach1 activity, use Set Variable1 to set the FileName variable.
In the Get Metadata2, use item().name to specify the file.
In the Get Metadata2, use Column count to get the column count from the file.
In the If Contdition1, use #greater(activity('Get Metadata2').output.columnCount,2) to determine whether the file is larger than two columns.
In the True activity, use variable FileName to specify the file.
In the False activity, use Additional columns to add a Column.
When I run debug, the result shows:

Create a Table in SnowFlake dynamically (Using JSON data from the staging area)

Is there a way to create a table( with columns) dynamically by using the JSON file from the staging area?
I used the comman: 'copy into TableName from #StageName;'
This put all the different rows in my json file into a single column.
However, I want different columns. For example column1 should be "IP", column 2 should be "OS" and so on.
Thank you in advance!!
I have implemented the same thing in my project.
So its a 2 step process.
1st Step - Create a stage table with variant data type table and copy into table from stage - which I can see you have already done that.
2nd Step - Either create a table or a view(since snowflake is superfast, View is the way to go for this dynamic extract of JSON data) which will read the data directly from this variant column, something like this
create or replace view schema.vw_tablename copy grants as
SELECT
v:Duration::int Duration,
v:Connectivity::string Connectivity
...
from public.tablename
if your JSON has an array of structure, use below
create or replace view schema.vw_tablename copy grants as
SELECT
v:Duration::int Duration,
v:Connectivity::string Connectivity,
f.value:Time::int as Event_Time,
from public.tablename,
table(flatten(v:arrayname)) f

Resources