Skipping header while Parquet file loading into Snowflake - snowflake-cloud-data-platform

need to skip the header while loading parquet file into Snowflake, could someone help on this?
Thanks!

For Parquet, file format options are:
COMPRESSION = AUTO | SNAPPY | NONE
BINARY_AS_TEXT = TRUE | FALSE
TRIM_SPACE = TRUE | FALSE
NULL_IF = ( '<string>' [ , '<string>' ... ] )
More details: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#type-parquet
Also, you can review the loading Parquet tutorial at the below link:
https://docs.snowflake.com/en/user-guide/script-data-load-transform-parquet.html#script-loading-and-unloading-parquet-data

Related

Found character ':' instead of field delimiter ','

Again I am facing an issue with loading a file into snowflake.
My file format is:
TYPE = CSV
FIELD_DELIMITER = ','
FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
NULL_IF = ''
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
[ COMMENT = '<string_literal>' ]
Now by running the:
copy into trips from #citibike_trips
file_format=CSV;
I am receiving the following error:
Found character ':' instead of field delimiter ','
File 'citibike-trips-json/2013-06-01/data_01a304b5-0601-4bbe-0045-e8030021523e_005_7_2.json.gz', line 1, character 41
Row 1, column "TRIPS"["STARTTIME":2]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
I am a little confused about the file I am trying to load. Actually, I got the file from a tutorial on YouTube and in the video, it works properly. However, inside the file, there are not only CSV datasets, but also JSON, and parquet. I think this could be the problem, but I am not sure to solve it, since the command code above is having the file_format = CSV.
Remove FIELD_OPTIONALLY_ENCLOSED_BY = '\042' , recreate the file format and run the copy statement again.
You're trying to import a JSON file using a CSV file format. In most cases all you need to do is specify JSON as the file type in the COPY INTO statement.
FILE_FORMAT = ( { FORMAT_NAME = '[<namespace>.]<file_format_name>' |
TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML } [ formatTypeOptions ] } ) ]
You're using CSV, but it should be JSON:
FILE_FORMAT = (TYPE = JSON)
If you're more comfortable using a named file format, use the builder to create a named file format that's of type JSON:
I found a thread in the Snowflake Community forum that explains what I think you might have been facing. There are now three different kinds of files in the stage - CSV, parquet, and JSON. The copy process given in the tutorial expects there to be only CSV. You can use this syntax to exclude non-CSV files from the copy:
copy into trips from #citibike_trips
on_error = skip_file
pattern = '.*\.csv\.gz$'
file_format = csv;
Using the PATTERN option with a regular expression you can filter only the csv files to be loaded.
https://community.snowflake.com/s/feed/0D53r0000AVKgxuCQD
And if you also run into an error related to timestamps, you will want to set this file format before you do the copy:
create or replace file format
citibike.public.csv
type = 'csv'
field_optionally_enclosed_by = '\042'
S3 to Snowflake ( loading csv data in S3 to Snowflake table throwing following error)

Snowflake Parquet Copy Into NULL_IF

I have a staged parquet file in and s3 location. I am attempting to parse the parquet file into a relational table, the field i'm having an issue with is a timestamp_ntz field.
In the file, there is a field called "due_date", and while most of the time it is populated with data, on occasion there is an empty string like below:
"due_date":""
The error that i'm receiving is 'Failed to cast variant value "" to TIMESTAMP_NTZ.'
Using the NULL_IF parameter in the copy into is not yielding any results and is set to:
file_format = (TYPE='PARQUET' COMPRESSION = SNAPPY BINARY_AS_TEXT = true TRIM_SPACE = false NULL_IF = ('\\N','NULL','NUL','','""'))
I have seen other users replace the NULL's in the SELECT portion of the COPY INTO statement, but this would be a hard to implement option due to the fields being dynamic.
Could anyone shed any light on this, other than the knowledge that empty strings shouldn't form part of parquet?
Full query below: USE SCHEMA MY_SCHEMA; COPY INTO MY_SCHEMA.MY_TABLE(LOAD_DATE,ACCOUNTID,APPID,CREATED_AT,CREATED_ON,DATE,DUE_DATE,NUMEVENTS,NUMMINUTES,REMOTEIP,SERVER,TIMESTAMP,TRACKNAME,TRACKTYPEID,TRANSACTION_DATE,TYPE,USERAGENT,VISITORID) FROM (SELECT CURRENT_TIMESTAMP(),$1:accountId,$1:appId,$1:created_at,$1:created_on,$1:date,$1:due_date,$1:numEvents,$1:numMinutes,$1:remoteIp,$1:server,$1:timestamp,$1:trackName,$1:trackTypeId,$1:transaction_date,$1:type,$1:userAgent,$1:visitorId FROM #my_stage ) PATTERN = '.*part.*' file_format = (TYPE='PARQUET' COMPRESSION = SNAPPY BINARY_AS_TEXT = true TRIM_SPACE = false NULL_IF = ('\\N','NULL','NUL','','""'));
You can use TRY_TO_TIMESTAMP. Since TRY_TO_TIMESTAMP does not accept variant, you need to cast it to string first:
TRY_TO_TIMESTAMP($1:due_date::string)
instead of just
$1:due_date
If the due_date is empty, the result will be NULL in the timestamp field in the target table after insert.

Error when copying data into Variant table from AVRO file

I am completing a snowflake university workshop but I have run into a problem. The course has provided an AVRO file and asked us to insert the data into a Variant column table. However when I run the COPY INTO commamd I get this error:
Number of columns in file (11) does not match that of the corresponding table (1), use file format option error_on_column_count_mismatch=false to ignore this error File 'iot_files/iot_files_sample_output.avro', line 1, character 827 Row 1, column "IOT_AVRO_DATA"[11] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
These are the instructions given by the course:
CREATE OR REPLACE TABLE IOT_AVRO_DATA
(mycolumn VARIANT);
copy INTO IOT_AVRO_DATA
FROM #GOOGLE_BUCKET_SFHOL/iot_files/iot_files_sample_output.avro;
FILE_FORMAT = (type = AVRO);
It looks like there is a mismatch between the number of columns in the file and in the table.
Any help advice would be appreciated, tried reaching out to snowflake via the workshop but they have not responded.
Are you sure your AVRO file is not corrupted?
The following works fine for me:
Upload to my stage a sample avro file (userdata1.avro taken from here)
spanaite#(no warehouse)#SERGIU_DB.(no schema)>put file:///Users/spanaite/Downloads/userdata1.avro #~;
+----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+
| source | target | source_size | target_size | source_compression | target_compression | status | message |
|----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------|
| userdata1.avro | userdata1.avro.gz | 93561 | 79248 | NONE | GZIP | UPLOADED | |
+----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+
1 Row(s) produced. Time Elapsed: 3.026s
spanaite#(no warehouse)#SERGIU_DB.(no schema)>
Create a table and load the avro file:
create or replace table test_avro(mycolumn VARIANT);
copy into test_avro from #~/userdata1.avro.gz file_format = (type = AVRO);
select * from test_avro;
Try with one of the sample files from the link I posted above.

How to create a csv file format definition to load data into snowflake table

I have a CSV file, a sample of it looks like this:
Image of CSV file
Snowpipe is failing to load this CSV file with the following error:
Number of columns in file (5) does not match that of the corresponding table (3), use file format option error_on_column_count_mismatch=false to ignore this error
Can someone advise me csv file format definition to accomodate load without fail ?
The issue is that the data you are trying to load contains commas (,) inside the data itself. Snowflake thinks that those commas represent new columns which is why it thinks there are 5 columns in your file. It is then trying to load these 5 columns into a table with only 3 columns resulting in an error.
You need to tell Snowflake that anything inside double-quotes (") should be loaded as-is, and not to interpret commas inside quotes as column delimiters.
When you create your file-format via the web interface there is an option which allows you to tell Snowflake to do this. Set the "Field optionally enclosed by" dropdown to "Double Quote" like in this picture:
Alternatively, if you're creating your file-format with SQL then there is an option called FIELD_OPTIONALLY_ENCLOSED_BY that you can set to \042 which does the same thing:
CREATE FILE FORMAT "SIMON_DB"."PUBLIC".sample_file_format
TYPE = 'CSV'
COMPRESSION = 'AUTO'
FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = '\042' # <---------------- Set to double-quote
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '\134'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO';
If possible share the file format and one sample record to figure out the issue. Seems issue with number of column, Can you include field_optionally_enclosed_by option into your copy statement and try it once.
When the TAB character is unlikely to occur, I tend to use TAB delimited files - which, also - together with a Header -, make the source files more human-readable in case they need to be open for troubleshooting loading failures:
FIELD_DELIMITER = '\t'
Also (although a bit off-topic), note that Snowflake suggests files to be compressed: https://docs.snowflake.com/en/user-guide/data-load-prepare.html#data-file-compression
I mostly use GZip compression type:
COMPRESSION = GZIP
A (working) example:
CREATE FILE FORMAT Public.CSV_GZIP_TABDELIMITED_WITHHEADER_QUOTES_TRIM
FIELD_DELIMITER = '\t'
SKIP_HEADER = 1
TRIM_SPACE = TRUE
NULL_IF = ('NULL')
COMPRESSION = GZIP
;

using a variable for a stage path in snowflake sql

I'd like to keep my stage path in a variable when I run queries. It looks like there is support to get this working for tables (link), but I can't get it working for stages. Is this supported? Thanks.
CREATE STAGE "MY_DB"."EXTERNAL".AZURE_BLOBS
URL = 'azure://example.blob.core.windows.net/my-csv-container'
CREDENTIALS = (AZURE_SAS_TOKEN = '****');
CREATE FILE FORMAT "INSIGHT_ETL"."EXTERNAL".CSV_GZ
TYPE = 'CSV'
COMPRESSION = 'GZIP'
FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n'
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '\134'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N');
//This works
SELECT METADATA$FILENAME, METADATA$FILE_ROW_NUMBER, A.$1
FROM '#AZURE_BLOBS/' (FILE_FORMAT => CSV_GZ) A
limit 10;
SET StagePath = '#AZURE_BLOBS/';
//This gets a compile error
SELECT METADATA$FILENAME, METADATA$FILE_ROW_NUMBER, A.$1
FROM $StagePath (FILE_FORMAT => CSV_GZ) A
limit 10;
Feel free to have the community correct me if I am wrong, but I don't believe this is supported today due to special parsing and handling for the stage path expression.
You can submit a feature request on the Snowflake Ideas page so that users can vote on it. The Snowflake Product Management Team actively monitors this page and features with a lot of votes often get priority consideration in a future release:
https://community.snowflake.com/s/ideas

Resources