Snowflake Copy Into failing when insert Null in timestamp column - snowflake-cloud-data-platform

Trying to load file data into Snowflake using COPY INTO. The table has a timestamp column. The file has only Null's empty string "" in that column.
On running copy into with File Format Timestamp option set AUTO, the statement is failing stating Can't parse '' as timestamp.
Is there any way to handle this

Using NULL_IF option:
NULL_IF = ( 'string1' [ , 'string2' ... ] )
String used to convert to and from SQL NULL. Snowflake replaces these strings in the data load source with SQL NULL. To specify more than one string, enclose the list of strings in parentheses and use commas to separate each value.
NULL_IF = ('\\N', '')

Related

Date '2017/02/23' not recognized in SnowFlake

I have a csv with example data as:
61| MXN| Mexican Peso| 2017/02/23
I'm trying to insert this into snowflake using the following commands:
create or replace stage table_stage file_format = (TYPE=CSV,ENCODING = 'WINDOWS1252');
copy into table from #table_stage/table.csv.gz file_format = (TYPE=CSV FIELD_DELIMITER='|' error_on_column_count_mismatch=false, ENCODING = 'WINDOWS1252');
put file://table.csv #table_stage auto_compress=true
But I get the error as
Date '2017/02/23' not recognized
Using alter session set date_input_format = 'YYYY-DD-MM' to change the date format fixes it.
But what can I add in the create stage or the copy command itself to change the date format?
Snowflake has session parameter DATE_INPUT_FORMAT that control over the input format for DATE data type.
The default value is AUTO specifies that Snowflake attempts to automatically detect the format of dates stored in the system during the session meaning the COPY INTO <table> command attempts to match all date strings in the staged data files with one of the formats listed in Supported Formats for AUTO Detection.
To guarantee correct loading of data, Snowflake strongly recommends explicitly setting the file format options for data loading (as explain in documentation)
To solve your issue you need to set the DATE_INPUT_FORMAT parameter with the expected format of dates in your staged files.
Just set the date format in the file format: https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
Use the DATE_FORMAT parameter in file_format condition.
More you can read here: COPY INTO
copy into table
from #table_stage/table.csv.gz
file_format = (TYPE=CSV
FIELD_DELIMITER='|'
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
ENCODING = 'WINDOWS1252'
DATE_FORMAT = 'YYYY/MM/DD');

Staged Internal file csv.gz giving error that file does not match size of corresponding table?

I am trying to copy a csv.gz file into a table I created to start analyzing location data for a map. I was running into an error that says that there are too many characters, and I should add a on_error option. However, I am not sure if that will help load the data, can you take a look?
Data source: https://data.world/cityofchicago/array-of-things-locations
SELECT * FROM staged/array-of-things-locations-1.csv.gz
CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude number, longitude number, location_2 variant, location variant);
COPY INTO ARRAYLOC
FROM #staged/array-of-things-locations-1.csv.gz;
CREATE OR REPLACE FILE FORMAT t_csv
TYPE = "CSV"
COMPRESSION = "GZIP"
FILE_EXTENSION= 'csv.gz'
CREAT OR REPLACE STAGE staged
FILE_FORMAT='t_csv';
COPY INTO ARRAYLOC FROM #~/staged file_format = (format_name = 't_csv');
Error message:
Number of columns in file (8) does not match that of the corresponding table (9), use file format option error_on_column_count_mismatch=false to ignore this error File '#~/staged/array-of-things-locations-1.csv.gz', line 2, character 1 Row 1 starts at line 1, column "ARRAYLOC"["LOCATION_2":8] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
Solved:
The real issue was that I need to better clean the data I was staging. This was my error. This is what I ended up changing: the column types, changing the file from " to ' and had to separate one column due to a comma in the middle of the data.
CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude float, longitude varchar, location varchar);
COPY INTO ARRAYLOC
FROM #staged/array-of-things-locations-1.csv.gz;
CREATE or Replace FILE FORMAT r_csv
TYPE = "CSV"
COMPRESSION = "GZIP"
FILE_EXTENSION= 'csv.gz'
SKIP_HEADER = 1
ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE
EMPTY_FIELD_AS_NULL = TRUE;
create or replace stage staged
file_format='r_csv';
copy into ARRAYLOC from #~/staged
file_format = (format_name = 'r_csv');
SELECT * FROM ARRAYLOC LIMIT 10;
Your error doesn't say that you have too many characters but that your file has 8 columns and your table has 9 columns, so it doesn't know how to align the columns from the file to the columns in the table.
You can list out the columns specifically using a subquery in your COPY INTO statement.
Notes:
Columns from the file are positional based, so $1 is the first column in the file, $2 is the second, etc....
You can put the columns from the file in any order that you need to match your table.
You'll need to find the column that doesn't have data coming in from the file and either fill it with null or some default value. In my example, I assume it is the last column and in it I will put the current timestamp.
It helps to list out the columns of the table behind the table name, but this is not required.
Example:
COPY INTO ARRAYLOC (COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8,COLUMN9)
FROM (
SELECT $1
,$2
,$3
,$4
,$5
,$6
,$7
,$8
,CURRENT_TIMESTAMP()
FROM #staged/array-of-things-locations-1.csv.gz
);
I will advise against changing the ERROR_ON_COLUMN_COUNT_MISMATCH parameter, doing so could result in data ending up in the wrong column of the table. I would also advise against changing the ON_ERROR parameter as I believe it is best to be alerted of such errors rather than suppressing them.
Yes, setting that option should help. From the documentation:
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE | FALSE
Use: Data loading only
Definition: Boolean that specifies whether to generate a parsing error
if the number of delimited columns (i.e. fields) in an input file does
not match the number of columns in the corresponding table.
If set to FALSE, an error is not generated and the load continues. If
the file is successfully loaded:
If the input file contains records with more fields than columns in
the table, the matching fields are loaded in order of occurrence in
the file and the remaining fields are not loaded.
If the input file contains records with fewer fields than columns in
the table, the non-matching columns in the table are loaded with NULL
values.
This option assumes all the records within the input file are the same
length (i.e. a file containing records of varying length return an
error regardless of the value specified for this parameter).
So assuming you are okay with getting NULL values for the missing column in your input data, you can use ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE to load the file successfully.
When viewing that table directly on data.world, there are columns named both location and location_2 with identical data. It looks like that display is erroneous, because when downloading the CSV, it has only a single location column.
I suspect if you change your CREATE OR REPLACE statement with the following statement that omits the creation of location_2, you'll get to where you want to go:
CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude number, longitude number, location variant);

NULL Value Handling for CSV Files Via External Tables in Snowflake

I am trying to get the NULL_IF parameter of a file format working when applied to an external table.
I have a source CSV file containing NULL values in some columns. NULLS in the source file appear in the format "\N" (all non numeric values in the file are quoted). Here is an example line from the raw csv where the ModifiedOn value is NULL in the source system:
"AirportId" , "IATACode" , "CreatedOn" , "ModifiedOn"
1 , "ACU" , "2015-08-25 16:58:45" , "\N"
I have a file format defined including the parameter NULL_IF = "\\N"
The following select statement successfully interprets the correct rows as holding NULL values.
SELECT $8
FROM #MyS3Bucket
(
file_format => 'CSV_1',
pattern => '.*MyFileType.*.csv.gz'
)
However if I use the same file format with an external table like this:
CREATE OR REPLACE EXTERNAL TABLE MyTable
MyColumn varchar as (value:c8::varchar)
WITH LOCATION = #MyS3Bucket
FILE_FORMAT = (FORMAT_NAME = 'CSV_1')
PATTERN = '.*MyFileType_.*.csv.gz';
Each row holds \N as a value rather than NULL.
I assume this is caused by external tables providing a single variant output that can then be further split rather than directly presenting individual columns in the csv file.
One solution is to code the NULL handling into the external view like this:
CREATE OR REPLACE EXTERNAL TABLE MyTable
MyColumn varchar as (NULLIF(value:c8::varchar,'\\N'))
WITH LOCATION = #MyS3Bucket
FILE_FORMAT = (FORMAT_NAME = 'CSV_1')
PATTERN = '.*MyFileType_.*.csv.gz';
However this leaves me at risk of having to re-write a lot of external table code if the file format changes whereas the file format could\should centralise that NULL definition. It would also mean the NULL conversion would have to be handled column by column rather than file by file increasing code complexity.
Is there a way that I can have the NULL values appear through an external table without handling them explicitly through column definitions?
Ideally this would be applied through a file format object but changes to the format of the raw file are not impossible.
I am able to reproduce the issue, and it seems like a bug. If you have access to Snowflake support, it could be better to submit a support case regarding to this issue, so you can easily follow the process.

hive "\n" value in records

I am processing a large 120 GB file using hive. Data is first loaded from sql server table to aws s3 as csv file (tab separated) and then hive external table is created on top of this file. I have encountered a problem while querying data from hive external table. I noticed that csv contains \n in many columns fields (which was actually “null” in sql server). Now when I create hive table the \n that appears in any record takes hive to new record and generate NULL for rest of the columns in that record. I tried lines terminated by "001" but no success. I get error that hive only supports only "lines terminated by \n". My question is if hive supports only \n as line separator how would you handle columns that contains \n values?
Any suggestions?
This is how I am creating my external table:
DROP TABLE IF EXISTS IMPT_OMNITURE__Browser;
CREATE EXTERNAL TABLE IMPT_OMNITURE__Browser (
ID int, Region string, Description string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://abm-dw/data-import/omniture/Browser/';
You could alter the table with the below command or add the property in the create statement in the TBL properties ;
ALTER TABLE table set SERDEPROPERTIES ('serialization.null.format' = "");
This would make the data in the file as NULL.

LOAD TABLE statement with NULLable dates

I m looking to do a batch load into a table, called temp_data, where some of the columns are NULLable dates.
Here is what I have till now:
LOAD TABLE some.temp_data
(SomeIntegerColumn ',', SomeDateColumn DATE('YYYYMMDD') NULL('NULL'), FILLER(1), SomeStringColumn ',')
USING CLIENT FILE '{0}' ESCAPES OFF DELIMITED BY ',' ROW DELIMITED BY '#'
and I m trying to load the following file
500,NULL,Monthly#
500,NULL,Monthly#
500,NULL,Monthly#
Unfortunately the error I get is:
ERROR [07006] [Sybase][ODBC Driver][Sybase IQ]Cannot convert NULL,Mon
to a date (column SomeDateColumn)
Any ideas why this wouldn't work?
It appears that it's reading the 8 characters following the first delimiter and trying to interpret them as a date.
Try switching to FORMAT BCP. The following example could work on your sample file:
LOAD TABLE some.temp_data (
SomeIntegerColumn
, SomeDateColumn NULL('NULL')
, SomeStringColumn
)
USING CLIENT FILE '{0}'
ESCAPES OFF
DELIMITED BY ','
ROW DELIMITED BY '#'
FORMAT BCP
In addition,FORMAT BCP also has the advantage of not requiring trailing delimiters.

Resources