Expected behavior for file naming of Snowflake S3 Integration?

Expected behavior for file naming of Snowflake S3 Integration? - snowflake-cloud-data-platform

goal: confirm that the file naming behavior in the below picture is expected for the Snowflake S3 integration given the code I am using
code:
copy into s3://bucket-name/ga_sessions_YYYYMMDD from EDW_DEV.GOOGLE_ANALYTICS.GA_TO_S3 file_format = google_analytics_json_format storage_integration = SNOWFLAKE_TO_S3;
expected result is suffix values that ascend in order:
ga_sessions.YYYYMMDD_0_0_0
ga_sessions.YYYYMMDD_0_0_1
ga_sessions.YYYYMMDD_0_0_2
what I'm actually getting:

Related

Snowflake Data Load Wizard - File Format - how to handle null string in File Format

I am using Snowflake Data Load Wizard to upload csv file to Snowflake table. The Snowflake table structure identifies a few columns as 'NOT NULL' (non-nullable). Problem is, the wizard is treating empty strings as null and the Data Load Wizard issues the following error:
Unable to copy files into table. NULL result in a non-nullable column
File '#<...../load_data.csv', line 2, character 1 Row, 1 Column
"<TABLE_NAME>" ["PRIMARY_CONTACT_ROLE":19)]
I'm sharing my File Format parameters from the wizard:
I then updated the DDL of the table by removing the "NOT NULL" declaration of the PRIMARY_CONTACT_ROLE column, then re-create the table and this time the data load of 20K records is successful.
How do we fix the file format wizard to make SNOWFLAKE not consider empty strings as NULLS?

The option you have to set EMPTY_FIELD_AS_NULL = FALSE. Unfortunately, modifying this option is not possible in the wizard. You have to create your file format or alter your existing file format in a worksheet manually as follows:
CREATE FILE FORMAT my_csv_format
TYPE = CSV
FIELD_DELIMITER = ','
SKIP_HEADER = 1
EMPTY_FIELD_AS_NULL = FALSE;
This will cause empty strings to be not treated as NULL values but as empty strings.
The relevant documentation can be found at https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html#type-csv.
Let me know if you you need a full example of how to upload a CSV file with the SnowSQL CLI.
Just to add, there are additional ways you can load your CSV into snowflake without having to specify a file format.
You can use pre-built third-party modeling tools to upload your raw file, adjust the default column types to your preference and push your modeled data back into snowflake.
My team and I are working on such a tool - Datameer, feel free to check it out here
https://www.datameer.com/upload-csv-to-snowflake/

Snowpipe fails to read AVRO compressed by DEFLATE exported from BigQuery

I am trying to import data exported from BigQuery as AVRO and compressed as DEFLATE. The only encoding common to both is DEFLATE besides NONE.
I am exporting one of publicly available datasets bigquery-public-data:covid19_open_data.covid19_open_data with 13,343,598 rows. I am using the following command to export:
bq extract --destination_format=AVRO --compression=DEFLATE bigquery-public-data:covid19_open_data.covid19_open_data gs://staging/covid19_open_data/avro_deflate/covid19_open_data_2_*.avro
The command creates 17 files in GCP. When I query the data in the files with command:
SELECT count(*) FROM #shared.data_warehouse_ext_stage/covid19_open_data/avro_deflate;
I only get a count of 684,5021 rows. To troubleshoot the error in the pipe I issue the command:
SELECT * from table(information_schema.copy_history(table_name=>'covid19_open_data', start_time=> dateadd(hours, -1, current_timestamp())));
The error reported by the pipeline is as follows:
Invalid data encountered during decompression for file: 'covid19_open_data_3_000000000006.avro',compression type used: 'DEFLATE', cause: 'data error'
The SQL for the File Format command is:
CREATE OR REPLACE FILE FORMAT monitoring_blocking.dv_avro_deflate_format TYPE = AVRO COMPRESSION = DEFLATE;
I know the problem is only related to the compression being DEFLATE. There are only two compressions for AVRO that are common for both BigQuery and Snowflake NONE and DEFLATE. I also created two pipes one file format AVRO with compression NONE and the second with CSV and GZIP. They both load data into the table. The two AVRO pipelines are a mirror of each other except for the file format. Here is snippet of the SQL for the pipe:
CREATE OR REPLACE PIPE covid19_open_data_avro
AUTO_INGEST = TRUE
INTEGRATION = 'GCS_PUBSUB_DATA_WAREHOUSE_NOTIFICATION_INT' AS
COPY INTO covid19_open_data(
location_key
,date
,place_id
,wikidata_id
...
)
FROM
(SELECT
$1:location_key
,$1:date AS date
,$1:place_id AS place_id
,$1:wikidata_id AS wikidata_id
...
FROM #shared.staging/covid19_open_data/avro_deflate)
FILE_FORMAT = monitoring_blocking.dv_avro_deflate_format;

The problem lies within Snowflake. When we change the compression format in the FILE FORMAT definition to AUTO it worked
CREATE OR REPLACE FILE FORMAT my_schema.avro_compressed_format
TYPE = AVRO
COMPRESSION = DEFLATE;
to
CREATE OR REPLACE FILE FORMAT my_schema.avro_compressed_format
TYPE = AVRO
COMPRESSION = AUTO;

query snowflake s3 external file

I've created an S3 [external] stage and uploaded csv files into \stage*.csv folder.
I can see stage content by doing list #my_stage.
if I query the stage
select $1,$2,$3,$4,$5,$6 from #my_s3_stage it looks like I'm randomly picking up files.
So I'm trying to select from specific file by adding a pattern
PATTERN => job.csv
This returns no results.
Note: I've used snowflake for all of 5 hours so pretty new to syntax

For a pattern you can use
select t.$1, t.$2 from #mystage1 (file_format => 'myformat', pattern=>'.*data.*[.]csv.gz') t;
The pattern is a regex expression.
For a certain file you have to add the file name to the query like this:
select t.$1, t.$2 from #mystage/data1.csv.gz;
If your file format is set in your stage definition, you don't need the file format-parameter.
More info can be found here: https://docs.snowflake.com/en/user-guide/querying-stage.html

Is there a way to find out details of data type erorr in Snowflake?

I am pretty new to Snowflake Cloud offering and was just trying to load a simple .csv file from AWS s3 staging are to a table in Snowflake using copy command.
Here is what I used as the command:
copy into "database name"."schema"."table name"
from #S3_ACCESS
file_format = (format_name = format name);
When run the above code, I get the following error: Numeric value '63' is not recognized
Please see the attached image. Not sure what this error is and i'm not able to find any lead in Snowflake UI itself to find out what could be wrong with the value.
Thanks in Advance!

The error says, it was waiting a numberic value, but it got "63", and this value can not be converted to numeric value.
From the image you share, I can see that there are some weird characters around 6 and 3. There could be an issue with file encoding or data is corrupted.
Please check encoding option for file format:
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html#format-type-options-formattypeoptions
By the way, I recommend you always use utf-8.

ON_ERROR parameter does not work for me with JSON files

I have created simple table in Snowflake:
Column name Type
S VARCHAR
N NUMBER
Both columns are nullable.
Now I want to load partially bad result into the table from CSV and JSON files.
CSV
s, n
hello, 1
bye, 2
nothing, zero
Third line is "bad": its second element it not a number.
Command that I use to load this file:
"COPY INTO "DEMO_DB"."PUBLIC"."TEST5" FROM #my_s3_stage1 files=('2good-1bad.csv') file_format = (type = csv field_delimiter = ',' skip_header = 1) ON_ERROR = CONTINUE;
Thrown SnowflakeSQLException:
errorCode = 200038
SQLState = 0A000
message: Cannot convert value in the driver from type:12 to type:int, value=PARTIALLY_LOADED.
Two "good" lines are written into table; the "bad" one is ignored. This result is expected.
However when I am using the following JSON lines file:
{"s":"hello", "n":1}
{"s":"bye", "n":2}
{"s":"nothing", "n":"zero"}
with this command:
COPY INTO "DEMO_DB"."PUBLIC"."TEST5" FROM #my_s3_stage1 files=('2good-1bad.json') file_format = (type = json)
MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE
ON_ERROR = CONTINUE
I get the following SnowflakeSQLException:
errorCode = 100071
SQLState = 22000
message: Failed to cast variant value "zero" to FIXED
and nothing is written to the DB.
The question is "What's wrong?"
Why ON_ERROR = CONTINUE does not work with my JSON file?
PS:
wrapping CONTINUE with single quotes does not help
using lower case does not help
actually I do not need CONTINUE, I need SKIP_FILE_<num>, however this does not work with JSON as well.
actually we are using avro in production environment, so it is more relevant. I am using JSON for tests because it is easier.

You are correct that on_error is not supported with non-CSV file formats. I've seen folks with files who can workaround specifying CSV file type with FIELD_DELIMITER = 'none'.
I have seen a couple folks request that this option work for semi-structured files and you are welcome to submit a feature request as well to create more demand for it:
https://community.snowflake.com/s/ideas
The documentation doesn't really spell it out that it's not supported (feel free to submit docs feedback using the button at the bottom), but you can see it hinted at:
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
"You can use the corresponding file format (e.g. JSON), but any error in the transformation will stop the COPY operation, even if you set the ON_ERROR option to continue or skip the file."

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Expected behavior for file naming of Snowflake S3 Integration? - snowflake-cloud-data-platform

Related

Snowflake Data Load Wizard - File Format - how to handle null string in File Format

Snowpipe fails to read AVRO compressed by DEFLATE exported from BigQuery

query snowflake s3 external file

Is there a way to find out details of data type erorr in Snowflake?

ON_ERROR parameter does not work for me with JSON files

Categories

Resources