Is there a query in Snowflake to identify the characters in a file that are invalid utf8 [duplicate] - snowflake-cloud-data-platform

This question already has answers here:
How to find rows with non utf8 characters in Snowflake?
(2 answers)
Closed 2 years ago.
I have a file that when loading into Snowflake gets an error for invalid UTF-8 characters, I have managed to load it into a table using another encoding, by creating a file format with option ENCODING = 'iso-8859-1' but I would like to find a way to get those characters queried.
I've tried TO_BINARY(col,'UTF-8') function hoping it will fail on the col that has invalid UTF-8 but was not able to get a valid result to capture those characters, has anyone faced the same issue?

Please note that ALL character data within Snowflake is encoded using UTF-8. There is no other option. A while back, this was not strictly true, and it was possible to have character data in Snowflake that was NOT valid UTF-8. But that should not be possible now.
Specifying the ENCODING = 'iso-8859-1' option instructed Snowflake (during the COPY INTO operation) to perform character set translation on the file (which was then interpreted as being encoded in ISO-8859-1), mapping all characters into their UTF-8 equivalent as it was written into Snowflake. As a result, all data in Snowflake is UTF-8 encoded, and therefore there should not be ANY non-UTF-8 characters to discover. That said, the result of character set translation might not end up translating to the correct/expected UTF-8 characters if the underlying (source) file was not truly encoded with the encoding that you specified during the COPY INTO (in this case, ISO-8859-1).
Given this, what is the ultimate problem that you are trying to resolve here? Did you load a source file with ENCODING = 'iso-8859-1' that was not actually ISO-8859-1? Or are you saying that the source file WAS truly encoded as ISO-8859-1, and yet somehow the resulting characters in Snowflake are either (1) incorrect or (2) invalid UTF-8? Or are you trying to determine the actual encoding of a source file (ignoring the whole ISO-8859-1 aspect altogether)?

Detailed answer found here How to find rows with non utf8 characters in Snowflake?
Should mark my question as duplicate and refer to the link, please.

Related

(Text-) File that supports encoding

The project I'm working on takes xml files and input streams and converts them to pdf's and text. In the unit tests I compare this generated text with a .txt file that has the expected output.
I'm now facing the issue of these .txt files not being encoded in UTF-8 and been written without persisting this information (namely umlauts).
I have read few articles on the topic of persisting and encoding .txt files. Including correcting the encoding, saving and opening files in Visual Studio with encoding, and some more.
I was wondering if there is a text file format that supports meta information about encoding like xml or html for example does.
I'm looking for a solution that is:
Easy adaptable to any coworker on the same team
It being persitant and not depending on me choosing an encoding in an editor
Does not require any additional exotic program
Can be read without or only little modification of the File class and it's input reading of C#
Does at least support UTF-8 encoding
A Unicode Byte Order Mark (BOM) is sometimes used for this purpose. Systems that process Unicode are required to strip off this metadata when passing on the text. File.ReadAllText etc do this. A BOM should exist only at the beginning of files and streams.
A BOM is sometimes conflated with encoding because both affect the file format and BOM applies only to Unicode encodings. In Visual Studio, with UTF-8, it's called "Unicode (UTF-8 with signature) - Codepage 65001".
Some C# code that demonstrates these concepts:
var path = Path.GetTempFileName() + ".txt";
File.WriteAllText(path, "Test", new UTF8Encoding(true, true));
Debug.Assert(File.ReadAllBytes(path).Length == 7);
Debug.Assert(File.ReadAllText(path).Length == 4); // slightly mushy encoding detection
However, this doesn't get anyone past the agreement required when using text files. The fundamental rule is that a text file must be read with the same encoding it was written with. A BOM is not a communication that suffices as a complete agreement for text files in general.
Test editors almost universally adopt the principle that they should guess a file's character encoding first, and—for the most part—allow users to correct them later. Some IDEs with project systems allow recording which encoding a file actually uses.
A reasonable text editor would preserve both the encoding and the presence of a Unicode BOM for existing files.
It seems that you're after a universal strategy. Unfortunately, the history of the concept of a text file doesn't allow one.

Strange Characters in database text: Ã, Ã, ¢, â‚ €,

I'm not certain when this first occured.
I have a new drop-shipping affiliate website, and receive an exported copy of the product catalog from the wholesaler. I format and import this into Prestashop 1.4.4.
The front end of the website contains combinations of strange characters inside product text: Ã, Ã, ¢, â‚ etc. They appear in place of common characters like , - : etc.
These characters are present in about 40% of the database tables, not just product specific tables like ps_product_lang.
Another website thread says this same problem occurs when the database connection string uses an incorrect character encoding type.
In /config/setting.inc, there is no character encoding string mentioned, just the MySQL Engine, which is set to InnoDB, which matches what I see in PHPMyAdmin.
I exported ps_product_lang, replaced all instances of these characters with correct characters, saved the CSV file in UTF-8 format, and reimported them using PHPMyAdmin, specifying UTF-8 as the language.
However, after doing a new search in PHPMyAdmin, I now have about 10 times as many instances of these bad characters in ps_product_lang than I started with.
If the problem is as simple as specifying the correct language attribute in the database connection string, where/how do I set this, and what to?
Incidently, I tried running this command in PHPMyAdmin mentioned in this thread, but the problem remains:
SET NAMES utf8
UPDATE: PHPMyAdmin says:
MySQL charset: UTF-8 Unicode (utf8)
This is the same character set I used in the last import file, which caused more character corruptions. UTF-8 was specified as the charset of the import file during the import process.
UPDATE2
Here is a sample:
people are truly living untetheredâ€ïâ€Â
Ã‚ï† buying and renting movies online, downloading software, and
sharing and storing files on the web.
UPDATE3
I ran an SQL command in PHPMyAdmin to display the character sets:
character_set_client utf8
character_set_connection utf8
character_set_database latin1
character_set_filesystem binary
character_set_results utf8
character_set_server latin1
character_set_system utf8
So, perhaps my database needs to be converted (or deleted and recreated) to UTF-8. Could this pose a problem if the MySQL server is latin1?
Can MySQL handle the translation of serving content as UTF8 but storing it as latin1? I don't think it can, as UTF8 is a superset of latin1. My web hosting support has not replied in 48 hours. Might be too hard for them.
If the charset of the tables is the same as it's content try to use mysql_set_charset('UTF8', $link_identifier). Note that MySQL uses UTF8 to specify the UTF-8 encoding instead of UTF-8 which is more common.
Check my other answer on a similar question too.
This is surely an encoding problem. You have a different encoding in your database and in your website and this fact is the cause of the problem. Also if you ran that command you have to change the records that are already in your tables to convert those character in UTF-8.
Update: Based on your last comment, the core of the problem is that you have a database and a data source (the CSV file) which use different encoding. Hence you can convert your database in UTF-8 or, at least, when you get the data that are in the CSV, you have to convert them from UTF-8 to latin1.
You can do the convertion following this articles:
Convert latin1 to UTF8
http://wordpress.org/support/topic/convert-latin1-to-utf-8
This appears to be a UTF-8 encoding issue that may have been caused by a double-UTF8-encoding of the database file contents.
This situation could happen due to factors such as the character set that was or was not selected (for instance when a database backup file was created) and the file format and encoding database file was saved with.
I have seen these strange UTF-8 characters in the following scenario (the description may not be entirely accurate as I no longer have access to the database in question):
As I recall, there the database and tables had a "uft8_general_ci" collation.
Backup is made of the database.
Backup file is opened on Windows in UNIX file format and with ANSI encoding.
Database is restored on a new MySQL server by copy-pasting the contents from the database backup file into phpMyAdmin.
Looking into the file contents:
Opening the SQL backup file in a text editor shows that the SQL backup file has strange characters such as "sÃ¥". On a side note, you may get different results if opening the same file in another editor. I use TextPad here but opening the same file in SublimeText said "så" because SublimeText correctly UTF8-encoded the file -- still, this is a bit confusing when you start trying to fix the issue in PHP because you don't see the right data in SublimeText at first. Anyways, that can be resolved by taking note of which encoding your text editor is using when presenting the file contents.
The strange characters are double-encoded UTF-8 characters, so in my case the first "Ã" part equals "Ã" and "Â¥" = "¥" (this is my first "encoding"). THe "Ã¥" characters equals the UTF-8 character for "å" (this is my second encoding).
So, the issue is that "false" (UTF8-encoded twice) utf-8 needs to be converted back into "correct" utf-8 (only UTF8-encoded once).
Trying to fix this in PHP turns out to be a bit challenging:
utf8_decode() is not able to process the characters.
// Fails silently (as in - nothing is output)
$str = "så";
$str = utf8_decode($str);
printf("\n%s", $str);
$str = utf8_decode($str);
printf("\n%s", $str);
iconv() fails with "Notice: iconv(): Detected an illegal character in input string".
echo iconv("UTF-8", "ISO-8859-1", "så");
Another fine and possible solution fails silently too in this scenario
$str = "så";
echo html_entity_decode(htmlentities($str, ENT_QUOTES, 'UTF-8'), ENT_QUOTES , 'ISO-8859-15');
mb_convert_encoding() silently: #
$str = "så";
echo mb_convert_encoding($str, 'ISO-8859-15', 'UTF-8');
// (No output)
Trying to fix the encoding in MySQL by converting the MySQL database characterset and collation to UTF-8 was unsuccessfully:
ALTER DATABASE myDatabase CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE myTable CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
I see a couple of ways to resolve this issue.
The first is to make a backup with correct encoding (the encoding needs to match the actual database and table encoding). You can verify the encoding by simply opening the resulting SQL file in a text editor.
The other is to replace double-UTF8-encoded characters with single-UTF8-encoded characters. This can be done manually in a text editor. To assist in this process, you can manually pick incorrect characters from Try UTF-8 Encoding Debugging Chart (it may be a matter of replacing 5-10 errors).
Finally, a script can assist in the process:
$str = "så";
// The two arrays can also be generated by double-encoding values in the first array and single-encoding values in the second array.
$str = str_replace(["Ã","Â¥"], ["Ã","¥"], $str);
$str = utf8_decode($str);
echo $str;
// Output: "så" (correct)
I encountered today quite a similar problem : mysqldump dumped my utf-8 base encoding utf-8 diacritic characters as two latin1 characters, although the file itself is regular utf8.
For example : "é" was encoded as two characters "é". These two characters correspond to the utf8 two bytes encoding of the letter but it should be interpreted as a single character.
To solve the problem and correctly import the database on another server, I had to convert the file using the ftfy (stands for "Fixes Text For You). (https://github.com/LuminosoInsight/python-ftfy) python library. The library does exactly what I expect : transform bad encoded utf-8 to correctly encoded utf-8.
For example : This latin1 combination "é" is turned into an "é".
ftfy comes with a command line script but it transforms the file so it can not be imported back into mysql.
I wrote a python3 script to do the trick :
#!/usr/bin/python3
# coding: utf-8
import ftfy
# Set input_file
input_file = open('mysql.utf8.bad.dump', 'r', encoding="utf-8")
# Set output file
output_file = open ('mysql.utf8.good.dump', 'w')
# Create fixed output stream
stream = ftfy.fix_file(
input_file,
encoding=None,
fix_entities='auto',
remove_terminal_escapes=False,
fix_encoding=True,
fix_latin_ligatures=False,
fix_character_width=False,
uncurl_quotes=False,
fix_line_breaks=False,
fix_surrogates=False,
remove_control_chars=False,
remove_bom=False,
normalization='NFC'
)
# Save stream to output file
stream_iterator = iter(stream)
while stream_iterator:
try:
line = next(stream_iterator)
output_file.write(line)
except StopIteration:
break
Apply these two things.
You need to set the character set of your database to be utf8.
You need to call the mysql_set_charset('utf8') in the file where you made the connection with the database and right after the selection of database like mysql_select_db use the mysql_set_charset. That will allow you to add and retrieve data properly in whatever the language.
The error usually gets introduced while creation of CSV. Try using Linux for saving the CSV as a TextCSV. Libre Office in Ubuntu can enforce the encoding to be UTF-8, worked for me.
I wasted a lot of time trying this on Mac OS. Linux is the key. I've tested on Ubuntu.
Good Luck

How to bake UTF-8 files with CakePHP's console?

For some reason, every file that I bake with CakePHP's console is regarded as ISO-8859-1 encoded by my IDE Dreamweaver. This works fine up to the point where I end up typing a special character, which will be wrongly displayed by the browser, since its encoding (by the editor) differs from the overall rendering.
How can I force the console to produce UTF-8 files, with a BOM if necessary?
I've already tried converting the template files that are used to bake the standard scaffolding pages, but with no luck.
I have the same problem - baked files are NOT UTF-8 but ASCII. (use notepad++ editor which allows easily convert, save files in another format).
Once bake generates files I have to convert them to UTF-8 one by one, to be able to work with Polish local characters.
I tried changing template files to UTF-8 but somehow this does not help. This may have something to do with the fact that the default file does not contain any non ascii character, therefore even if saved as UTF they stay ASCII.
The simplest way I found to overcome this is to modify template file eg.
cake\console\templates\default\classes\model.ctp
to include utf-8 character somewhere, e.g.:
//'message' => 'Your custom message here ł',
(notice last non ASCII character at the end of line.
then converting and saving as UTF-8 makes sure template file is utf-8.
now, model files are generated as UTF-8.
The baked files are UTF-8, or rather, they only contain basic ASCII characters which are identical to the basic UTF-8 range, so can be regarded as either. It's Dreamweaver's problem, not a problem with bake. Check the Dreamweaver settings (or code in a decent editor ;-P).
You do not want to include a BOM, it'll screw you over later.
Use the Bake_UTF8 plugin =]
http://www.github.com/pedroelsner/bake_utf8
I hope this be helpfull.
Pedro Elsner
Another way to achieve this is to open PHP files that are producing UTF-8 content (without BOM) and then saving them in format UTF-8 with BOM using Notepad++ (Encoding->Encode in UTF-8)
In my case I had Excel CSV file:
/patients/exportFirstReport/atskaite1-25-10-2013.csv
Then I had to convert encoding of PHP files down the stack:
\index.php
\app\Controller\PatientsController.php
\app\View\Patients\csv\export_first_report.ctp
\app\View\Layouts\csv\default.ctp
After conversion of encoding of these files it produces readable UTF8 excel files

Easy way to inspect BCP .dat file?

I'm getting BCP error "Unexpected EOF encountered in BCP data-file" during import, which is probably misleading. I strongly suspect that some field has been added to the table or there's some offending character is in the file.
How would I go about inspecting the contents of .dat file visually?
Are there any good hex viewers where I can quickly try to adjust row length to see the data in tabular manner?
Other suggestions are also appreciated.
I guess it depends on your input format. Is it binary input? if so, it's gonna be hard. I use visual studio to open a file in the binary viewer but it's far from easy. The usual suspects are CRLF's in a text field or text that contains your field delimiter or EOL character.

Linux & C-Programming: How can I write utf-8 encoded text to a file?

I am interested in writing utf-8 encoded strings to a file.
I did this with low level functions open() and write().
In the first place I set the locale to a utf-8 aware character set with
setlocale("LC_ALL", "de_DE.utf8").
But the resulting file does not contain utf-8 characters, only iso8859 encoded umlauts. What am I doing wrong?
Addendum: I don't know if my strings are really utf-8 encoded in the first place. I just keep them in the source file in this form: char *msg = "Rote Grütze";
See screenshot for content of the textfile:
alt text http://img19.imageshack.us/img19/9791/picture1jh9.png
Changing the locale won't change the actual data written to the file using write(). You have to actually produce UTF-8 characters to write them to a file. For that purpose you can use libraries as ICU.
Edit after your edit of the question: UTF-8 characters are only different from ISO-8859 in the "special" symbols (ümlauts, áccénts, etc.). So, for all the text that doesn't have any of this symbols, both are equivalent. However, if you include in your program strings with those symbols, you have to make sure your text editor treats the data as UTF-8. Sometimes you just have to tell it to.
To sum up, the text you produce will be in UTF-8 if the strings within the source code are in UTF-8.
Another edit: Just to be sure, you can convert your source code to UTF-8 using iconv:
iconv -f latin1 -t utf8 file.c
This will convert all your latin-1 strings to utf8, and when you print them they will be definitely in UTF-8. If iconv encounters a strange character, or you see the output strings with strange characters, then your strings were in UTF-8 already.
Regards,
Yes, you can do it with glibc. They call it multibyte instead of UTF-8, because it can handle more than one encoding type. Check out this part of the manual.
Look for functions that start with the prefix mb, and also function with wc prefix, for converting from multibyte to wide char. You'll have to set the locale first with setlocale() to UTF-8 so it chooses this implementation of multibyte support.
If you are coming from an Unicode file I believe the function you looking for is wcstombs().
Can you open up the file in a hex editor and verify, with a simple input example, that the written bytes are not the values of Unicode characters that you passed to write(). Sometimes, there is no way for a text editor to determine character set and your text editor may have assumed an ISO8859-1 character set.
Once you have done this, could you edit your original post to add the pertinent information?

Resources