Invalid utf8 detected in string in CSV - snowflake-cloud-data-platform

Invalid utf8 detected in string in CSV - snowflake-cloud-data-platform

I have a CSV file that is being ingested into Snowflake daily. All of a sudden, Snowflake decided to complain about an invalid utf8 character. (which is þ)
I checked in the target table, and it was ingested many times before with no issues. Did something change on Snowflake side? Is there a better way to solve this other than using REPLACE_INVALID_CHARACTERS = TRUE which will convert the character into a � ?

Related

Azure Data Factory Managed Instance -> Snowflake text with escape characters

I've got a Copy Data Activity in Data Factory which takes a table from a SQL Server Managed Instance and puts it into a Snowflake Instance.
The activity uses a temporary staging BLOB account.
When debugging the pipeline it's failing.
The error comes up as "Found character 't' instead of record delimiter '\r\n'".
It looks like it's caused by escape characters, but there is no options available to deal with escape characters on a temporary stage.
I think I could fix this by having two activities 1 moving Managed Instance to BLOB and 1 moving BLOB to Snowflake, but would prefer to handle it with just the 1 if possible.
I have tried to add to the user properties;
{
"name": "escapeQuoteEscaping",
"value": "true"
}
Is there anything else I could add in here?
Thanks,
Dan

It's the file format where you specify the details of the file being ingested, not the stage.
There are many options including the specification of delimiters and special characters within the data. The message
Found character 't' instead of record delimiter
suggests that you may have a tab-delimited file, so you could set \t as the delimiter in the file format.
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html

Copying S3 file with row of data containing Arabic text throws off the end of record and Copy fails

I Unloaded a table from from Redshift to S3. The table is 212 columns wide. Some fields in some rows contain Arabic text.
Here's the Redshidt Unload command I used:
unload ('select * from dataw.testing')
to 's3://uarchive-live/rpt_all/rpt_all.txt'
iam_role 'arn:aws:iam::12345678988:role/service-role'
GZIP
DELIMITER '\t'
null as ''
;
When I attempt to COPY this file into Snowflake an error occurs.
End of record reached while expected to parse column '"RPT_ALL"["AUTO_TRAF_RETR_CNT":211]' File 'rpt_all_250/rpt_all.txt0000_part_113.gz', line 9684, character 1187 Row 9684, column "RPT_ALL"["AUTO_TRAF_RETR_CNT":211]
The field name referenced in the error is not the last field in the records, there are two more after that one.
I removed the Arabic text from the fields and left them blank, then I attempted the COPY again, and this time it Copied with no errors.
Here's the Snowflake File Format I'm using:
CREATE FILE FORMAT IF NOT EXISTS "DEV"."PUBLIC"."ff_noheader" TYPE = 'CSV' RECORD_DELIMITER = '\n' FIELD_DELIMITER = '\t' SKIP_HEADER = 0 COMPRESSION = 'GZIP' TIMESTAMP_FORMAT = 'AUTO' TRIM_SPACE = TRUE REPLACE_INVALID_CHARACTERS = TRUE;
Here's the Snowflake Copy command I'm using:
COPY INTO "DEV"."PUBLIC"."RPT_ALL" FROM #"stg_All"/snowflk_test.csv FILE_FORMAT="DEV"."PUBLIC"."ff_noheader";
What do I need to configure in Snowflake to accept this Arabic text so that the end of record is not corrupted?
Thanks

I'm not a Snowflake expert but I have used it and I have debug a lot issue like this.
My initial though as to why you are getting an unexpected EOR, which is \n, is that you data contains \n. If you data has \n then this will look like an EOR when the data is read. I don't believe there is a way to change the EOR in the Redshift UNLOAD command. So you need to ESCAPE in the Redshift UNLOAD command to add a backslash before characters like \n. You will also need to tell Snowflake what the escape character is - ESCAPE = '\' (I think you need double backslash in this statement). [There's a change you may need to quote your fields also but you will know that when you hit any issues hidden by this one.]
The other way would be to use a different unload format that doesn't suffer from overloaded character meaning.
There's a chance that the issue is in character encodings related to your Arabic text but I expect not since both Redshift and Snowflake are UTF-8 based systems. Possible but not likely.

Error importing data from CSV with OpenRowset in SQL Server - Mysterious value of "S7"

I have a file dump which needs to be imported into SQL Server on a daily basis, which I have created a scheduled task to do this without any attendant. All CSV files are decimated by ',' and it's a Windows CR/LF file encoded with UTF-8.
To import data from these CSV files, I mainly use OpenRowset. It works well until I ran into a file in which there's a value of "S7". If the file contains the value of "S7" then that column will be recognized as datatype of numeric while doing the OpenRowset import and which will lead to a failure for other alphabetic characters to be imported, leaving only NULL values.
This is by far I had tried:
Using IMEX=1: openrowset('Microsoft.ACE.OLEDB.15.0','text;IMEX=1;HDR=Yes;
Using text driver: OpenRowset('MSDASQL','Driver=Microsoft Access Text Driver (*.txt, *.csv);
Using Bulk Insert with or without a format file.
The interesting part is that if I use Bulk Insert, it will give me a warning of unexpected end of file. To solve this, I have tried to use various row terminator indicators like '0x0a','\n', '\r\n' or not designated any, but they all failed. And finally I managed to import some of the records which using a row terminator of ',\n'. However the original file contains like 1000 records and only 100 will be imported, without any notice of errors or warnings.
Any tips or helps would be much appreciated.
Edit 1:
The file is ended with a newline character, from which I can tell from notepad++. I managed to import files which give me an error of unexpected end of file by removing the last record in those files. However even with this method, that I still can not import all records, only a partial of which can be imported.

SSIS - Converting DT_TEXT(Length 11,000 Characters) to DT_STR and trim to 1,000 characters

I want to read data from text file (.csv), truncate one of the column to 1000 characters and push into SQL table using SSIS Package.
The input (DT_TEXT) is of length 11,000 characters but my Challenge is ...
SSIS can convert to (DT_STR) only if Max length is 8,000 characters.
String operations cannot be performed on Stream (DT_TEXT data type)

Got a workaround/solution now;
I truncate the text in Flat File Source and selected the option to Ignore the Error;
Please share if you find a better solution!
FYI:

To help anyone else that finds this, I applied a similar concept more generally in a data flow when consuming a text stream [DT_TEXT] in a Derived Column Transformation task to transform it to [DT_WSTR] type to my defined length. This more easily calls out the conversion taking place.
Expression: (DT_WSTR,1000)(DT_STR,1000,1252)myLargeTextColumn
Data Type: Unicode string [DT_WSTR]
Length: 1000
*I used 1252 codepage since my DT_TEXT is UTF-8 encoded.
For this Derived Column, I also set the TruncationRowDisposition to RD_IgnmoreFailure in the Advanced Editor (or can be done in the Configure Error Output, setting Truncation to "Ignore failure")
(I'd post images but apparently I need to boost my rep)

PostgreSQL: unable to save special character (regional language) in blob

I am using PostgreSQL 9.0 and am trying to store a bytea file which contains certain special characters (regional language characters - UTF8 encoded). But I am not able to store the data as input by the user.
For example :
what I get in request while debugging:
<sp_first_name_gu name="sp_first_name_gu" value="ઍયેઍ"></sp_first_name_gu><sp_first_name name="sp_first_name" value="aaa"></sp_first_name>
This is what is stored in DB:
<sp_first_name_gu name="sp_first_name_gu" value="\340\252\215\340\252\257\340\253\207\340\252\215"></sp_first_name_gu><sp_first_name name="sp_first_name" value="aaa"></sp_first_name>
Note the difference in value tag. With this issue I am not able to retrieve the proper text input by the user.
Please suggest what do I need to do?
PS: My DB is UTF8 encoded.

The value is stored correctly, but is escaped into octal escape sequences upon retrieval.
To fix that - change the settings of the DB driver or chose different different encoding/escaping for bytea.
Or just use proper field types for the XML data - like varchar or XML.

Your string \340\252\215\340\252\257\340\253\207\340\252\215 is exactly ઍયેઍ in octal encoding, so postgres stores your data correctly. PostgreSQL escapes all non printable characters, for more details see postgresql documentation, especially section 8.4.2