Import data using special escape character - snowflake-cloud-data-platform

Import data using special escape character - snowflake-cloud-data-platform

I'm trying to import data into Snowflake using the copy command.
I have a file format defined as follows:
CREATE FILE FORMAT mydb.schema1.myFileFormat
TYPE = CSV
COMPRESSION = 'AUTO'
FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
ESCAPE = '\241'
ESCAPE_UNENCLOSED_FIELD = NONE
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N')
COMMENT = '¡ used as escape character';
There's nothing special about the file format, except it's using ¡ as an escape character.
When importing data with this file format, it seems Snowflake is not recognizing the escape character, and it's throwing an error saying "Found character 'XYZ' instead of field delimiter ','".
I tried creating a file with 1 line, like the following:
"ABC123","584382","2","01","01/22/2019","02/08/2019","02/08/2019","04/03/2019","04/03/2019","TEST","Unknown","Unknown","01-884400","Unknown","DACRON CONNECTIONS 15¡"1/2 DIA. X 11¡" LONG FOR EXHAUST DAMPER","","0.0","0.0","0.0","0.0","192.0","USD","2.0","2.0","0","0","96.00000","1","","","","","07882-0047","ASDF","ASDF","02/27/2019","04/06/2021","01/01/1970","0"
This file fails on line 1, char 167, which is right after the first escape character (before the 1 in the following text: CONNECTIONS 15¡"1/2)
Any idea why this is happening?
This is the code I'm running to do the copy
copy into mydb.schema1.mytable from #mydb.schema1.mystage/file-path/2021-05-26/test.txt
file_format = mydb.schema1.myFileFormat
validation_mode = 'return_all_errors';

Short Answer
Looks like Snowflake only allows single-byte characters to be used as an escape character for a file format. The character you're using as the escape character uses two bytes and therefore isn't allowed as an escape character by the file format.
You can however use multi-byte characters for field and row delimiters so not sure why Snowflake hasn't allowed it as the escape character as well.
Longer Answer
The character you're trying to use as the escape character (¡) is two bytes long with a hex value of \xC2\xA1. This isn't allowed as you can see by the following error:
CREATE OR REPLACE FILE FORMAT myFileFormat
TYPE = CSV
COMPRESSION = 'AUTO'
FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = '\x22' -- Double quotes (")
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
ESCAPE = '\xC2\xA1' -- Inverted exclamation point (¡)
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N')
invalid value ['\xC2\xA1'] for parameter 'ESCAPE'
On the other hand, if I use the last single-byte character I could possibly use (and is visible), the tilde (~), with a hex value of \x7E (you'd think it should be \xFF but utf-8 uses 7 bits before it goes into 2 bytes. Long story.) then it works fine. I tested this with a file and copy command and it works without issue.
CREATE OR REPLACE FILE FORMAT myFileFormat
TYPE = CSV
COMPRESSION = 'AUTO'
FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = '\x22' -- Double quotes (")
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
ESCAPE = '\x7E' -- Tilde (~)
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N')
[2021-05-26 23:49:21] completed in 149 ms

Related

How to create a file format of TSV in snowflake?

I have to create TSV file in snowflake.
If anyone knows could you please share the sample code.

Using a comma is so common for delimited files, the term for any delimited file format in Snowflake is CSV. You can create a TSV file format by specifying a type of CSV and a delimiter of tab:
CREATE FILE FORMAT TSV_FILE_FORMAT TYPE = 'CSV' COMPRESSION = 'AUTO'
FIELD_DELIMITER = '\t' RECORD_DELIMITER = '\n' SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '\134' DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
Your specific parameters may vary depending on the specific way the TSV handles things like escaping tab characters, etc., but this is a good start.

snowflake handle null while reading csv file

I am trying to load a CSV file from S3. which has a null value in the integer type data field in the snowflake table.
So I try to use IFFNULL function but gets the error.
Numeric value 'null' is not recognized.
For example when I try
select IFNULL(null,0)
I get the answer as 0.
but the same thing when I try while reading the CSV file won't work
select $1,$2,ifnull($2,0)
from
#stage/path
(file_format => csv)
I get the null not recognized Error.
and it fails when $2 is null.
My csv format is as below.
create FILE FORMAT CSV
COMPRESSION = 'AUTO' FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n' SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = '\134'
ESCAPE_UNENCLOSED_FIELD = '\134' DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
Basically, I am just trying to convert null to 0, when reading from the stage.

The null string literal could be handled by setting NULL_IF:
CREATE FILE FORMAT CSV
...
NULL_IF = ('null', '\\N');

I used the second option listed in the Snowflake documentation specifying FIELD_OPTIONALLY_ENCLOSED_BY=NONE and EMPTY_FIELD_AS_NULL = FALSE in which case I'd need to provide a value to be used for NULLs (NULL_IF=('NULL')
https://docs.snowflake.com/en/user-guide/data-unload-considerations.html
"Leave string fields unenclosed by setting the FIELD_OPTIONALLY_ENCLOSED_BY option to NONE (default), and set the EMPTY_FIELD_AS_NULL value to FALSE to unload empty strings as empty fields.
If you choose this option, make sure to specify a replacement string for NULL data using the NULL_IF option, to distinguish NULL values from empty strings in the output file. If you later choose to load data from the output files, you will specify the same NULL_IF value to identify the NULL values in the data files."
So my query looked something like the following:
COPY INTO #~/unload/table FROM (
SELECT * FROM table
)
FILE_FORMAT = (TYPE = 'CSV' COMPRESSION = 'GZIP'
FIELD_DELIMITER = '\u0001'
EMPTY_FIELD_AS_NULL = FALSE
FIELD_OPTIONALLY_ENCLOSED_BY = NONE
NULL_IF=('NULL'))
OVERWRITE = TRUE;

Snowflake Copy Into - Multiple column escape handling

I have a unique situation while loading data from a csv file into Snowflake.
I have multiple columns that need some re-work
Column enclosed in " and contains columns - this is handled properly
Columns that are enclosed in " but also contain " within the data i.e. ( "\"DataValue\"")
My File Format is as such:
ALTER FILE FORMAT DB.SCHEMA.FF_CSV_TEST
SET COMPRESSION = 'AUTO'
FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n'
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
ESCAPE = NONE
ESCAPE_UNENCLOSED_FIELD = 'NONE'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N');
My columns enclosed in " that contain commas are being handled fine. However the remaining columns that resemble ( "\"DataValue\"") are returning errors:
Found character 'V' instead of field delimiter ','
Are there there any ways to handle this?
I have attempted using a select against the stage itself:
select t.$1, t.$2, t.$3, t.$4, t.$5, TRIM(t.$6,'"')
from #STAGE_TEST/file.csv.gz t
LIMIT 1000;
with t.$5 being the column enclosed with " and containing commas
and t.$6 being the ( "\"DataValue\"")
Are there any other options than developing python (or other) code that strips out this before processing into Snowflake?

Add the \ to your escape parameter. It looks like your quote values are properly escaped, so that should take care of those quotes.

Snowflake - Escape character parameter not working in Copy statement

Problem Statement -
While doing the data load using the copy command and by defining the escape property the statement is not eliminating the escape character from the data.
Ex-
I'm trying to load the data from the CSV file. The file is having the data in the following Format in one of the column ('EC F&G BREWER\'S INT\'L BEER'). The expectation on this is that the escape character in the data (which is backslash '\') should be removed from the data field after the load.
Following is the copy statement I'm using -
COPY INTO SANDBOX.TEST_SBX.TEST_20200512
FROM #STAGE_NAME/TEST_FILE/
FILE_FORMAT = (FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = '\047'
TRIM_SPACE = FALSE
ESCAPE ='\134'
ESCAPE_UNENCLOSED_FIELD='\134'
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('NULL', 'null', '') )
PATTERN='.*.gz.*'
PURGE=FALSE
ON_ERROR = ABORT_STATEMENT
FORCE = FALSE
RETURN_FAILED_ONLY = FALSE
;
There are many columns in this file that have the () backslash in the character field, and all are getting ignored during the load.
I cannot switch to the manual column selection mode and use the Regexp to replace the escape character, I have to use the copy command without switching to the column selection mode.
I expected that the escape character configured in the file format is treated appropriately (escaped) without having to treat it as a transformation similar to how it is treated as escape characters in the other data processing/loading engines.
Please suggest what I can do here on this.

This is the code I have implemented
--Data File
'EC F&G ANSHUL\'s Public COMPANY'
'YB MARTHA\'S VINEYARD LOUNGE'
'EC F&G BREWER\'S INT\'L BEER'
COPY INTO test_so FROM #file_format_stage/Test_File.csv.gz
FILE_FORMAT = (FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n'
SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '\047'
TRIM_SPACE = FALSE
ESCAPE ='\\'
--ESCAPE_UNENCLOSED_FIELD=NONE
NULL_IF = ('NULL', 'null', '')
SKIP_BYTE_ORDER_MARK = False)
force=true
;
And the result

I tried replicating exactly the same still not getting the same result. Got the understanding there is problem with the file . Or something wrong at the account level parameters which is cause the problem.
CREATE OR REPLACE TABLE SANDBOX.AAGRA018_SBX.TEXT_FILE_TST
(COL1 VARCHAR(50)
);
TRUNCATE TABLE SANDBOX.AAGRA018_SBX.TEXT_FILE_TST;
COPY INTO SANDBOX.AAGRA018_SBX.TEXT_FILE_TST
FROM #~/Test_File.txt.gz
FILE_FORMAT = (FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n'
SKIP_HEADER = 0 FIELD_OPTIONALLY_ENCLOSED_BY = '\047'
TRIM_SPACE = FALSE ESCAPE ='\\' --ESCAPE_UNENCLOSED_FIELD='\134'
NULL_IF = ('NULL', 'null', '')
)
force=true;
select * from SANDBOX.AAGRA018_SBX.TEXT_FILE_TST;
Data File -
YB MARTHA\'S VINEYARD LOUNGE
EC F&G BREWER\'S INT\'L BEER
This is not giving the desired result
[Snowflake UI Image copy][1]
[1]: https://i.stack.imgur.com/IBgNz.png

My CSV file with double quotes enclosed fields - numeric value ' "12131" ' not recognized

I staged a csv file has all the fields enclosed in double quotes (" ") and comma separated and rows are separated by newline character. The value in the enclosed fields also contains newline characters (\n).
I am using the default FILE FORMAT = CSV. When using COPY INTO I am seeing a column mismatch error in this case.
I solved this first error by adding the file type to specify the FIELD_OPTIONALLY_ENCLOSED_BY = attribute in the SQL below.
However when I try to import NUMBER values from csv file, I already used FIELD_OPTIONALLY_ENCLOSED_BY='"'; but it's not working. I get "Numeric value '"3922000"' is not recognized" error.
A sample of my .csv file looks like this:
"3922000","14733370","57256","2","3","2","2","2019-05-23
14:14:44",",00000000",",00000000",",00000000",",00000000","1000,00000000","1000,00000000","1317,50400000","1166,50000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000","","tcllVeEFPD"
My COPY INTO statement is below:
COPY INTO '..'
FROM '...'
FILE_FORMAT = (TYPE = CSV
STRIP_NULL_VALUES = TRUE
FIELD_DELIMITER = ','
SKIP_HEADER = 1
error_on_column_count_mismatch=false
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
)
ON_ERROR = "ABORT_STATEMENT";
I get a feeling that NUMBER is interpreted as STRING.
Does anyone have solution for that one?

Try using a subquery in the FROM clause of the COPY command where each column is listed out and cast the appropriate columns.
Ex.
COPY INTO '...'
FROM (
SELECT $1::INTEGER
$2::FLOAT
...
)
FILE_FORMAT = (TYPE = CSV
STRIP_NULL_VALUES = TRUE
FIELD_DELIMITER = ','
SKIP_HEADER = 1
error_on_column_count_mismatch=false
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
)
ON_ERROR = "ABORT_STATEMENT";