How to load and validate timestamp data in multiple formats?

How to load and validate timestamp data in multiple formats? - snowflake-cloud-data-platform

I am populating table data from a file using the copy command. The table includes timestamp data in multiple formats. I have set alter session set TIMESTAMP_INPUT_FORMAT = 'dd-mon-yyyy hh24.mi.ss.ff6'; which handles the formatting of certain values thus formatted, but there are other timestamp values in the source file that are formatted differently. To cope with this I am doing e.g.
copy into <table> (
<timestamp_column_1>,
<timestamp_column_2>
...
) from (
SELECT
$1,
TO_TIMESTAMP_TZ(t.$2, 'DD-MON-YY')
This works, but the validate command does not support transformations, so my current validation method is unrealiable.
Is there a way I can achieve what I want in my load without using transformations?

Related

How to load Parquet/AVRO into multiple columns in Snowflake with schema auto detection?

When trying to load a Parquet/AVRO file into a Snowflake table I get the error:
PARQUET file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.
But I don't want to load these files into a new one column table — I need the COPY command to match the columns of the existing table.
What can I do to get schema auto detection?

Good news, that error message is outdated, as now Snowflake supports schema detection and COPY INTO multiple columns.
To reproduce the error:
create or replace table hits3 (
WatchID BIGINT,
JavaEnable SMALLINT,
Title TEXT
);
copy into hits3
from #temp.public.my_ext_stage/files/
file_format = (type = parquet);
-- PARQUET file format can produce one and only one column of type variant or object or array.
-- Use CSV file format if you want to load more than one column.
To fix the error and have Snowflake match the columns from the table and Parquet/AVRO files just add the option MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE (or MATCH_BY_COLUMN_NAME=CASE_SENSITIVE):
copy into hits3
from #temp.public.my_ext_stage/files/
file_format = (type = parquet)
match_by_column_name = case_insensitive;
Docs:
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
https://docs.snowflake.com/en/user-guide/data-load-overview.html?#detection-of-column-definitions-in-staged-semi-structured-data-files

Date '2017/02/23' not recognized in SnowFlake

I have a csv with example data as:
61| MXN| Mexican Peso| 2017/02/23
I'm trying to insert this into snowflake using the following commands:
create or replace stage table_stage file_format = (TYPE=CSV,ENCODING = 'WINDOWS1252');
copy into table from #table_stage/table.csv.gz file_format = (TYPE=CSV FIELD_DELIMITER='|' error_on_column_count_mismatch=false, ENCODING = 'WINDOWS1252');
put file://table.csv #table_stage auto_compress=true
But I get the error as
Date '2017/02/23' not recognized
Using alter session set date_input_format = 'YYYY-DD-MM' to change the date format fixes it.
But what can I add in the create stage or the copy command itself to change the date format?

Snowflake has session parameter DATE_INPUT_FORMAT that control over the input format for DATE data type.
The default value is AUTO specifies that Snowflake attempts to automatically detect the format of dates stored in the system during the session meaning the COPY INTO <table> command attempts to match all date strings in the staged data files with one of the formats listed in Supported Formats for AUTO Detection.
To guarantee correct loading of data, Snowflake strongly recommends explicitly setting the file format options for data loading (as explain in documentation)
To solve your issue you need to set the DATE_INPUT_FORMAT parameter with the expected format of dates in your staged files.

Just set the date format in the file format: https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html

Use the DATE_FORMAT parameter in file_format condition.
More you can read here: COPY INTO
copy into table
from #table_stage/table.csv.gz
file_format = (TYPE=CSV
FIELD_DELIMITER='|'
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
ENCODING = 'WINDOWS1252'
DATE_FORMAT = 'YYYY/MM/DD');

Staged Internal file csv.gz giving error that file does not match size of corresponding table?

I am trying to copy a csv.gz file into a table I created to start analyzing location data for a map. I was running into an error that says that there are too many characters, and I should add a on_error option. However, I am not sure if that will help load the data, can you take a look?
Data source: https://data.world/cityofchicago/array-of-things-locations
SELECT * FROM staged/array-of-things-locations-1.csv.gz
CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude number, longitude number, location_2 variant, location variant);
COPY INTO ARRAYLOC
FROM #staged/array-of-things-locations-1.csv.gz;
CREATE OR REPLACE FILE FORMAT t_csv
TYPE = "CSV"
COMPRESSION = "GZIP"
FILE_EXTENSION= 'csv.gz'
CREAT OR REPLACE STAGE staged
FILE_FORMAT='t_csv';
COPY INTO ARRAYLOC FROM #~/staged file_format = (format_name = 't_csv');
Error message:
Number of columns in file (8) does not match that of the corresponding table (9), use file format option error_on_column_count_mismatch=false to ignore this error File '#~/staged/array-of-things-locations-1.csv.gz', line 2, character 1 Row 1 starts at line 1, column "ARRAYLOC"["LOCATION_2":8] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
Solved:
The real issue was that I need to better clean the data I was staging. This was my error. This is what I ended up changing: the column types, changing the file from " to ' and had to separate one column due to a comma in the middle of the data.
CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude float, longitude varchar, location varchar);
COPY INTO ARRAYLOC
FROM #staged/array-of-things-locations-1.csv.gz;
CREATE or Replace FILE FORMAT r_csv
TYPE = "CSV"
COMPRESSION = "GZIP"
FILE_EXTENSION= 'csv.gz'
SKIP_HEADER = 1
ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE
EMPTY_FIELD_AS_NULL = TRUE;
create or replace stage staged
file_format='r_csv';
copy into ARRAYLOC from #~/staged
file_format = (format_name = 'r_csv');
SELECT * FROM ARRAYLOC LIMIT 10;

Your error doesn't say that you have too many characters but that your file has 8 columns and your table has 9 columns, so it doesn't know how to align the columns from the file to the columns in the table.
You can list out the columns specifically using a subquery in your COPY INTO statement.
Notes:
Columns from the file are positional based, so $1 is the first column in the file, $2 is the second, etc....
You can put the columns from the file in any order that you need to match your table.
You'll need to find the column that doesn't have data coming in from the file and either fill it with null or some default value. In my example, I assume it is the last column and in it I will put the current timestamp.
It helps to list out the columns of the table behind the table name, but this is not required.
Example:
COPY INTO ARRAYLOC (COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8,COLUMN9)
FROM (
SELECT $1
,$2
,$3
,$4
,$5
,$6
,$7
,$8
,CURRENT_TIMESTAMP()
FROM #staged/array-of-things-locations-1.csv.gz
);
I will advise against changing the ERROR_ON_COLUMN_COUNT_MISMATCH parameter, doing so could result in data ending up in the wrong column of the table. I would also advise against changing the ON_ERROR parameter as I believe it is best to be alerted of such errors rather than suppressing them.

Yes, setting that option should help. From the documentation:
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE | FALSE
Use: Data loading only
Definition: Boolean that specifies whether to generate a parsing error
if the number of delimited columns (i.e. fields) in an input file does
not match the number of columns in the corresponding table.
If set to FALSE, an error is not generated and the load continues. If
the file is successfully loaded:
If the input file contains records with more fields than columns in
the table, the matching fields are loaded in order of occurrence in
the file and the remaining fields are not loaded.
If the input file contains records with fewer fields than columns in
the table, the non-matching columns in the table are loaded with NULL
values.
This option assumes all the records within the input file are the same
length (i.e. a file containing records of varying length return an
error regardless of the value specified for this parameter).
So assuming you are okay with getting NULL values for the missing column in your input data, you can use ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE to load the file successfully.

When viewing that table directly on data.world, there are columns named both location and location_2 with identical data. It looks like that display is erroneous, because when downloading the CSV, it has only a single location column.
I suspect if you change your CREATE OR REPLACE statement with the following statement that omits the creation of location_2, you'll get to where you want to go:
CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude number, longitude number, location variant);

NULL Value Handling for CSV Files Via External Tables in Snowflake

I am trying to get the NULL_IF parameter of a file format working when applied to an external table.
I have a source CSV file containing NULL values in some columns. NULLS in the source file appear in the format "\N" (all non numeric values in the file are quoted). Here is an example line from the raw csv where the ModifiedOn value is NULL in the source system:
"AirportId" , "IATACode" , "CreatedOn" , "ModifiedOn"
1 , "ACU" , "2015-08-25 16:58:45" , "\N"
I have a file format defined including the parameter NULL_IF = "\\N"
The following select statement successfully interprets the correct rows as holding NULL values.
SELECT $8
FROM #MyS3Bucket
(
file_format => 'CSV_1',
pattern => '.*MyFileType.*.csv.gz'
)
However if I use the same file format with an external table like this:
CREATE OR REPLACE EXTERNAL TABLE MyTable
MyColumn varchar as (value:c8::varchar)
WITH LOCATION = #MyS3Bucket
FILE_FORMAT = (FORMAT_NAME = 'CSV_1')
PATTERN = '.*MyFileType_.*.csv.gz';
Each row holds \N as a value rather than NULL.
I assume this is caused by external tables providing a single variant output that can then be further split rather than directly presenting individual columns in the csv file.
One solution is to code the NULL handling into the external view like this:
CREATE OR REPLACE EXTERNAL TABLE MyTable
MyColumn varchar as (NULLIF(value:c8::varchar,'\\N'))
WITH LOCATION = #MyS3Bucket
FILE_FORMAT = (FORMAT_NAME = 'CSV_1')
PATTERN = '.*MyFileType_.*.csv.gz';
However this leaves me at risk of having to re-write a lot of external table code if the file format changes whereas the file format could\should centralise that NULL definition. It would also mean the NULL conversion would have to be handled column by column rather than file by file increasing code complexity.
Is there a way that I can have the NULL values appear through an external table without handling them explicitly through column definitions?
Ideally this would be applied through a file format object but changes to the format of the raw file are not impossible.

I am able to reproduce the issue, and it seems like a bug. If you have access to Snowflake support, it could be better to submit a support case regarding to this issue, so you can easily follow the process.

CSV file w/ two different timestamp formats in Snowflake

(Submitting on behalf of a Snowflake User)
I have a csv file that has two different timestamp format.
For example:
time_stmp1: 2019-07-01 00:03:17.000 EDT
time_stmp2: 2019-06-30 21:03:17 PDT
In the copy command I am able to specify only one format.
How should I proceed to load both columns in TIMESTAMP_LTZ data type?
Any recommendations?

you could use the SELECT form of COPY INTO where you transform the date fields individually, something like:
COPY INTO MY_TABLE (NAME, DOB, DOD, HAIR_COLOUR)
FROM (
SELECT $1, TO_DATE($2,'YYYYMMDD'), TO_DATE($3,'MM-DD-YYYY'), $4
FROM #MY_STAGE/mypeeps (file_format => 'MY_CSV_FORMAT')
)
ON_ERROR = CONTINUE;

Currently, Snowflake does not allow loading data with different date formats from one single file.
If the data in the file is just date, then use datatype as the date and, in FILE FORMAT, define date as AUTO.
If the data is included date and time, then use the datatype as timestamp and define timestamp in the FILE FORMAT as per the data file.
DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'YY/MM/DD HH24:MI:SS'
If there are multiple date formats in the file, for example, MM/DD/YY and MM/DD/YY HH:MI: SS, it does not load correctly, you may need to split the file and load separately or update all data(date type) to a single common format and load it to the table.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to load and validate timestamp data in multiple formats? - snowflake-cloud-data-platform

Related

How to load Parquet/AVRO into multiple columns in Snowflake with schema auto detection?

Date '2017/02/23' not recognized in SnowFlake

Staged Internal file csv.gz giving error that file does not match size of corresponding table?

NULL Value Handling for CSV Files Via External Tables in Snowflake

CSV file w/ two different timestamp formats in Snowflake

Categories

Resources