Snowflake Parquet Copy Into NULL_IF - snowflake-cloud-data-platform

Snowflake Parquet Copy Into NULL_IF - snowflake-cloud-data-platform

I have a staged parquet file in and s3 location. I am attempting to parse the parquet file into a relational table, the field i'm having an issue with is a timestamp_ntz field.
In the file, there is a field called "due_date", and while most of the time it is populated with data, on occasion there is an empty string like below:
"due_date":""
The error that i'm receiving is 'Failed to cast variant value "" to TIMESTAMP_NTZ.'
Using the NULL_IF parameter in the copy into is not yielding any results and is set to:
file_format = (TYPE='PARQUET' COMPRESSION = SNAPPY BINARY_AS_TEXT = true TRIM_SPACE = false NULL_IF = ('\\N','NULL','NUL','','""'))
I have seen other users replace the NULL's in the SELECT portion of the COPY INTO statement, but this would be a hard to implement option due to the fields being dynamic.
Could anyone shed any light on this, other than the knowledge that empty strings shouldn't form part of parquet?
Full query below: USE SCHEMA MY_SCHEMA; COPY INTO MY_SCHEMA.MY_TABLE(LOAD_DATE,ACCOUNTID,APPID,CREATED_AT,CREATED_ON,DATE,DUE_DATE,NUMEVENTS,NUMMINUTES,REMOTEIP,SERVER,TIMESTAMP,TRACKNAME,TRACKTYPEID,TRANSACTION_DATE,TYPE,USERAGENT,VISITORID) FROM (SELECT CURRENT_TIMESTAMP(),$1:accountId,$1:appId,$1:created_at,$1:created_on,$1:date,$1:due_date,$1:numEvents,$1:numMinutes,$1:remoteIp,$1:server,$1:timestamp,$1:trackName,$1:trackTypeId,$1:transaction_date,$1:type,$1:userAgent,$1:visitorId FROM #my_stage ) PATTERN = '.*part.*' file_format = (TYPE='PARQUET' COMPRESSION = SNAPPY BINARY_AS_TEXT = true TRIM_SPACE = false NULL_IF = ('\\N','NULL','NUL','','""'));

You can use TRY_TO_TIMESTAMP. Since TRY_TO_TIMESTAMP does not accept variant, you need to cast it to string first:
TRY_TO_TIMESTAMP($1:due_date::string)
instead of just
$1:due_date
If the due_date is empty, the result will be NULL in the timestamp field in the target table after insert.

Related

Found character ':' instead of field delimiter ','

Again I am facing an issue with loading a file into snowflake.
My file format is:
TYPE = CSV
FIELD_DELIMITER = ','
FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
NULL_IF = ''
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
[ COMMENT = '<string_literal>' ]
Now by running the:
copy into trips from #citibike_trips
file_format=CSV;
I am receiving the following error:
Found character ':' instead of field delimiter ','
File 'citibike-trips-json/2013-06-01/data_01a304b5-0601-4bbe-0045-e8030021523e_005_7_2.json.gz', line 1, character 41
Row 1, column "TRIPS"["STARTTIME":2]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
I am a little confused about the file I am trying to load. Actually, I got the file from a tutorial on YouTube and in the video, it works properly. However, inside the file, there are not only CSV datasets, but also JSON, and parquet. I think this could be the problem, but I am not sure to solve it, since the command code above is having the file_format = CSV.

Remove FIELD_OPTIONALLY_ENCLOSED_BY = '\042' , recreate the file format and run the copy statement again.

You're trying to import a JSON file using a CSV file format. In most cases all you need to do is specify JSON as the file type in the COPY INTO statement.
FILE_FORMAT = ( { FORMAT_NAME = '[<namespace>.]<file_format_name>' |
TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML } [ formatTypeOptions ] } ) ]
You're using CSV, but it should be JSON:
FILE_FORMAT = (TYPE = JSON)
If you're more comfortable using a named file format, use the builder to create a named file format that's of type JSON:

I found a thread in the Snowflake Community forum that explains what I think you might have been facing. There are now three different kinds of files in the stage - CSV, parquet, and JSON. The copy process given in the tutorial expects there to be only CSV. You can use this syntax to exclude non-CSV files from the copy:
copy into trips from #citibike_trips
on_error = skip_file
pattern = '.*\.csv\.gz$'
file_format = csv;
Using the PATTERN option with a regular expression you can filter only the csv files to be loaded.
https://community.snowflake.com/s/feed/0D53r0000AVKgxuCQD
And if you also run into an error related to timestamps, you will want to set this file format before you do the copy:
create or replace file format
citibike.public.csv
type = 'csv'
field_optionally_enclosed_by = '\042'
S3 to Snowflake ( loading csv data in S3 to Snowflake table throwing following error)

How to create a csv file format definition to load data into snowflake table

I have a CSV file, a sample of it looks like this:
Image of CSV file
Snowpipe is failing to load this CSV file with the following error:
Number of columns in file (5) does not match that of the corresponding table (3), use file format option error_on_column_count_mismatch=false to ignore this error
Can someone advise me csv file format definition to accomodate load without fail ?

The issue is that the data you are trying to load contains commas (,) inside the data itself. Snowflake thinks that those commas represent new columns which is why it thinks there are 5 columns in your file. It is then trying to load these 5 columns into a table with only 3 columns resulting in an error.
You need to tell Snowflake that anything inside double-quotes (") should be loaded as-is, and not to interpret commas inside quotes as column delimiters.
When you create your file-format via the web interface there is an option which allows you to tell Snowflake to do this. Set the "Field optionally enclosed by" dropdown to "Double Quote" like in this picture:
Alternatively, if you're creating your file-format with SQL then there is an option called FIELD_OPTIONALLY_ENCLOSED_BY that you can set to \042 which does the same thing:
CREATE FILE FORMAT "SIMON_DB"."PUBLIC".sample_file_format
TYPE = 'CSV'
COMPRESSION = 'AUTO'
FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = '\042' # <---------------- Set to double-quote
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '\134'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO';

If possible share the file format and one sample record to figure out the issue. Seems issue with number of column, Can you include field_optionally_enclosed_by option into your copy statement and try it once.

When the TAB character is unlikely to occur, I tend to use TAB delimited files - which, also - together with a Header -, make the source files more human-readable in case they need to be open for troubleshooting loading failures:
FIELD_DELIMITER = '\t'
Also (although a bit off-topic), note that Snowflake suggests files to be compressed: https://docs.snowflake.com/en/user-guide/data-load-prepare.html#data-file-compression
I mostly use GZip compression type:
COMPRESSION = GZIP
A (working) example:
CREATE FILE FORMAT Public.CSV_GZIP_TABDELIMITED_WITHHEADER_QUOTES_TRIM
FIELD_DELIMITER = '\t'
SKIP_HEADER = 1
TRIM_SPACE = TRUE
NULL_IF = ('NULL')
COMPRESSION = GZIP
;

How to get the header names of a staged csv file on snowflake?

Is it possible to get the header of staged csv file on snowflake into an array ?
I need to loop over all fields to insert data into our data vault model and it is really needed to get these column names onto an array.

Actually it was solved by using the following query over a staged file in a JavaScript stored procedure:
var get_length_and_columns_array = "select array_size(split($1,',')) as NO_OF_COL, "+
"split($1,',') as COLUMNS_ARRAY from "+FILE_FULL_PATH+" "+
"(file_format=>"+ONE_COLUMN_FORMAT_FILE+") limit 1";
The ONE_COLUMN_FORMAT_FILE will put all fields into one in order to make this query works:
CREATE FILE FORMAT ONE_COLUMN_FILE_FORMAT
TYPE = 'CSV' COMPRESSION = 'AUTO' FIELD_DELIMITER = '|' RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE'
TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '\134'
DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');

Yes, you can query the following metadata of your staged files:
METADATA$FILENAME: Name of the staged data file the current row belongs to. Includes the path to the data file in the stage.
METADATA$FILE_ROW_NUMBER: Row number for each record in the container staged data file.
So there is not enough information. But: There is the parameter SKIP_HEADER that can be used in your COPY INTO-command. So my suggestion for a workaround is:
Copy your data into a table by using SKIP_HEADER and thus also load your header into your table as regular column values
Query the first row which are the column names
Use this as input for further processing
More infos about the parameter within the COPY INTO-Command https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html

Currently dynamically generating the column list from a csv file is not currently available in snowflake or most platform afaik.
csv is not the ideal format for this kind of schema on read operation.
if you are able to work with your input files, I would suggest converting the csv to json. If you use json instead, you can then use snowflake to process the file.
here is some context:
Load CSV files with dynamic column headers Not Supported
Example of loading json data
Example Converting CSV to JSON with Pandas
import pandas as pd
import csvkit
filepath = '/home/username/data/sales.csv'
jsonfilepath = filepath.replace('.csv','.json')
df = pd.read_csv(filepath)
# df.to_json(jsonfilepath, orient="table", date_format="iso", index=False)
df.to_json(jsonfilepath, orient="records", date_format="iso")
print("Input File: {}\r\nOutput File: {}".format(filepath, jsonfilepath))
Example Converting CSV to JSON with csvkit
csvjson -i 4 '/home/username/data/sales.csv' > '/home/username/data/sales.csvkit.json'
Querying Semi-Structured Data in Snowflake
Loading JSON Data into Snowflake
/* Create a target relational table for the JSON data. The table is temporary, meaning it persists only for */
/* the duration of the user session and is not visible to other users. */
create or replace temporary table home_sales (
city string,
zip string,
state string,
type string default 'Residential',
sale_date timestamp_ntz,
price string
);
/* Create a named file format with the file delimiter set as none and the record delimiter set as the new */
/* line character. */
/* */
/* When loading semi-structured data (e.g. JSON), you should set CSV as the file format type (default value). */
/* You could use the JSON file format, but any error in the transformation would stop the COPY operation, */
/* even if you set the ON_ERROR option to continue or skip the file. */
create or replace file format sf_tut_csv_format
field_delimiter = none
record_delimiter = '\\n';
/* Create a temporary internal stage that references the file format object. */
/* Similar to temporary tables, temporary stages are automatically dropped at the end of the session. */
create or replace temporary stage sf_tut_stage
file_format = sf_tut_csv_format;
/* Stage the data file. */
/* */
/* Note that the example PUT statement references the macOS or Linux location of the data file. */
/* If you are using Windows, execute the following statement instead: */
-- PUT file://%TEMP%/sales.json #sf_tut_stage;
put file:///tmp/sales.json #sf_tut_stage;
/* Load the JSON data into the relational table. */
/* */
/* A SELECT query in the COPY statement identifies a numbered set of columns in the data files you are */
/* loading from. Note that all JSON data is stored in a single column ($1). */
copy into home_sales(city, state, zip, sale_date, price)
from (select substr(parse_json($1):location.state_city,4), substr(parse_json($1):location.state_city,1,2),
parse_json($1):location.zip, to_timestamp_ntz(parse_json($1):sale_date), parse_json($1):price
from #sf_tut_stage/sales.json.gz t)
on_error = 'continue';
/* Query the relational table */
select * from home_sales;

Data load in Snowflake: NULL result in a non-nullable column

I am getting the error message: NULL result in a non-nullable column on my loading my parquet files into Snowflake.
I have NOT null columns in Snowflake for example, NAME2, NAME3, but the values against them in the parquet files are empty string.
So my question is how can I resolve this constraint without changing my table definition or without removing not null constraint?
COPY INTO "DB_STAGE"."SCH_ABC_INIT"."T_TAB" FROM (
SELECT
$1:OPSYS::VARCHAR,
$1:MANDT::VARCHAR,
$1:LIFNR::VARCHAR,
$1:LAND1::VARCHAR,
$1:NAME1::VARCHAR,
$1:NAME2::VARCHAR,
$1:NAME3::VARCHAR,
$1:NAME4::VARCHAR,
..
..
$1:OPTYPE::VARCHAR
FROM #DB_STAGE.SCH_ABC_INIT.initial_load_stage_ABC)
file_format = (type = 'parquet', NULL_IF=('NULL','',' ','NULL','NULL','//N'))
pattern = '.*/ABC-TAB-prod/.*snappy.parquet';

I believe that this line
file_format = (type = 'parquet', NULL_IF=('NULL','',' ','NULL','NULL','//N'))
is explicitly asking to take the empty strings and make them into NULL values, which obviously won't work going into fields that are NOT NULL in your table. You should probably try something like this:
file_format = (type = 'parquet', NULL_IF=('NULL','//N'))
Your other option is to remove the NOT NULL in your table and allow the conversion to NULL.

Looking at these options there is some guidance on NULLs particularly for parquet
NULL_IF = ( 'string1' [ , 'string2' , ... ] )
Use
Data loading only
Definition
String used to convert to and from SQL NULL. Snowflake replaces these strings in the data load source with SQL NULL. To specify more than one string, enclose the list of strings in parentheses and use commas to separate each value.
This file format option is applied to the following actions only when loading Parquet data into separate columns using the MATCH_BY_COLUMN_NAME copy option.
Note that Snowflake converts all instances of the value to NULL, regardless of the data type. For example, if 2 is specified as a value, all instances of 2 as either a string or number are converted.
For example:
NULL_IF = ('\N', 'NULL', 'NUL', '')
Note that this option can include empty strings.
Default
\N (i.e. NULL, which assumes the ESCAPE_UNENCLOSED_FIELD value is \)
from the docs: https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
There's also more content I found under communities: https://community.snowflake.com/s/question/0D50Z00009Vw7ktSAB/how-can-i-get-schema-file-when-copy-into-file-returns-empty-row
https://community.snowflake.com/s/question/0D50Z00008UE4MKSA1/while-loading-a-parquet-file-to-snowflake-all-the-optional-field-in-parquet-schema-are-coming-as-null-any-idea-why-it-is-happening-all-other-fields-which-are-mandatory-in-parquet-schema-as-coming-as-expected
Let me know if these help

Try switching empty_field_as_null=false in file_format.

COPY INTO query on Snowflake returns TABLE does not exist error

I am trying to load data from azure blob storage.
The data has already been staged.
But, the issue is when I try to run
copy into random_table_name
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv'
Below is the error I encounter:
raise error_class(
snowflake.connector.errors.ProgrammingError: 001757 (42601): SQL compilation error:
Table 'random_table_name' does not exist
Basically, it says table does not exist, which it does not, but the syntax on website is the same as mine.
COPY INTO query on Snowflake returns TABLE does not exist error

In my case the table name is case-sensitive. Snowflake seems to convert everything to upper case. I changed the database/schema/table names to all upper-case and it started working.

First run the below query to fetch the column headers
select $1 FROM #stage_name_i_created/filename.csv limit 1
Assuming below are the header lines from your csv file
id;first_name;last_name;email;age;location
Create a file_format csv
create or replace file format semicolon
type = 'CSV'
field_delimiter = ';'
skip_header=1;
Then you should define the datatype and field name as below
create or replace table <yourtable> as
select $1::varchar as id
,$2::varchar as first_name
,$3::varchar as last_name
,$4::varchar as email
,$5::int as age
,$6::varchar as location
FROM #stage_name_i_created/yourfile.csv
(file_format => semicolon );

The table must exist prior to running a COPY INTO command. In your post, you say that the table does not exist...so that is your issue.

If your table exist, try by forcing the table path like this:
copy into <database>.<schema>.<random_table_name>
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv'
or by steps like this:
use database <database_name>;
use schema <schema_name>;
copy into database.schema.random_table_name
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv';

rbachkaniwala, what do you mean by 'How do I create a table?( according to snowflake syntax it is not possible to create empty tables)'.
You can just do below to create a table
CREATE TABLE random_table_name (FIELD1 VARCHAR, FIELD2 VARCHAR)

The table does need to exist. You should check the documentation for COPY INTO.
Other areas to consider are
do you have the right context set for the database & schema
does the user / role have access to the table or object.
It basically seems like you don't have the table defined yet. You should
ensure the table is created
ensure all columns in the CSV exist as columns in the table
ensure the order of the columns are the same as in the CSV
I'd check data types too.

"COPY INTO" is not a query command, it is the actual data transfer execution from source to destination, which both must exist as others commented here but If you want just to query without loading the files then run the following SQL:
//Display list of files in the stage to verify stage
LIST #stage_name_i_created;
//Create a file format
CREATE OR REPLACE FILE FORMAT RANDOM_FILE_CSV
type = csv
COMPRESSION = 'GZIP' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 0 FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N');
//Now select the data in the files
Select $1 as first_col,$2 as second_col //can add as necessary number of columns ...etc
from #stage_name_i_created
(FILE_FORMAT => RANDOM_FILE_CSV)
More information can be found in the documentation link here
https://docs.snowflake.com/en/user-guide/querying-stage.html

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Snowflake Parquet Copy Into NULL_IF - snowflake-cloud-data-platform

You can use TRY_TO_TIMESTAMP. Since TRY_TO_TIMESTAMP does not accept variant, you need to cast it to string first: TRY_TO_TIMESTAMP($1:due_date::string) instead of just $1:due_date If the due_date is empty, the result will be NULL in the timestamp field in the target table after insert.

Related

Found character ':' instead of field delimiter ','

How to create a csv file format definition to load data into snowflake table

How to get the header names of a staged csv file on snowflake?

Data load in Snowflake: NULL result in a non-nullable column

COPY INTO query on Snowflake returns TABLE does not exist error

Categories

Resources