I received a bunch of CSV files from a client (that appear to be a database dump), and many of the columns have weird characters like this:
Alain Lefèvre
Angèle Dubeau & La PietÃÂÂ
That's seems like an awful lot of characters to represent an é. Does anyone know what encoding would produce that many characters for é? I have no idea where they're getting these CSV files from, but assuming I can't get them in a better format, how would I convert them to something like UTF-8?
It seems like it's a double-re-misdecoded UTF-8. It may be possible to recover the data by opening it as utf-8, saving it as Latin-1 (perhaps), and opening it as UTF-8 again.
It looks like it's been through a corruption process where the data was written as utf-8 but read in as cp1252, and this happened three times. This might be recoverable (I don't know if it will work for every character, but at least for some) by putting the corrupted data through the reverse transformation - read in as utf8, write out as cp1252, repeat. There are plenty of ways of doing that kind of conversion - using a text editor as Tordek suggests, using commandline tools as below, or using the encoding features built in to your database or programming language.
unix shell prompt> echo Alain Lefèvre |
iconv -f utf-8 -t cp1252 |
iconv -f utf-8 -t cp1252 |
iconv -f utf-8 -t cp1252
Alain Lefèvre
unix shell prompt>
That's seems like an awful lot of characters to represent an é.
Remember, character ≠ byte. What you're seeing in the output is characters; you'll need to do something unusual to actually see the bytes. (I suggest ‘xxd’, a tool that is installed with the Vim application; or ‘od’, one of the core utilities of the GNU operating system.)
Does anyone know what encoding would produce that
One tool that is good at guessing the character encoding of a byte stream is ‘enca’ the Extremely Naive Charset Analyser.
Related
I'm trying to upload files containing special characters on our platform via the exec command but the characters are always interpreted and it fails.
For example if I try to upload a mémo.txt file I get the following error:
/bin/cp: cannot create regular file `/path/to/dir/m\351mo.txt': No such file or directory
The UTF8 is correctly configured on the system and if I run the command on the shell it works fine.
Here is the TCL code:
exec /bin/cp $tmp_filename $dest_path
How can I make it work?
The core of the problem is what encoding is being used to communicate with the operating system. For exec and filenames, that encoding is whatever is returned by the encoding system command (Tcl has a pretty good guess at what the correct value for that is when the Tcl library starts up, but very occasionally gets it wrong). On my computer, that command returns utf-8 which says (correctly!) that strings passed to (and received from) the OS are UTF-8.
You should be able to use the file copy command instead of doing exec /bin/cp, which will be helpful here as that's got less layers of trickiness (it avoids going through an external program which can impose its own problems). We'll assume that that's being done:
set tmp_filename "foobar.txt"; # <<< fill in the right value, of course
set dest_path "/path/to/dir/mémo.txt"
file copy $tmp_filename $dest_path
If that fails, we need to work out why. The most likely problems relate to the encoding though, and can go wrong in multiple ways that interact horribly. Alas, the details matter. In particular, the encoding for a path depends on the actual filesystem (it's formally a parameter when the filesystem is created) and can vary on Unix between parts of a path when you have a mount within another mount.
If the worst comes to the worst, you can put Tcl into ISO 8859-1 mode and then do all the encoding yourself (as ISO 8859-1 is the “just use the bytes I tell you” encoding); encoding convertto is also useful in this case. Be aware that this can generate filenames that cause trouble for other programs, but it's at least able to let you get at it.
encoding system iso98859-1
file copy $tmp_filename [encoding convertto utf-8 $dest_path]
Care might be needed to convert different parts of the path correctly in this case: you're taking full responsibility for what's going on.
If you're on Windows, please just let Tcl handle the details. Tcl uses the Wide (Unicode) Windows API directly so you can pretend that none of these problems exist. (There are other problems instead.)
On macOS, please leave encoding system alone as it is correct. Macs have a very opinionated approach to encodings.
I already tried the file copy command but it says error copying
"/tmp/file7k5kqg" to "/path/to/dir/mémo.txt": no such file or
directory
My reading of your problem is that, for some reason, your Tcl is set to iso8859-1 ([encoding system]), while the executing environment (shell) is set to utf-8. This explains why Donal's suggestion works for you:
encoding system iso8859-1
file copy $tmp_filename [encoding convertto utf-8 $dest_path]
This will safely pass utf-8 encoded bytearray down to any syscall: é or \xc3\xa9 or \u00e9. Watch:
% binary encode hex [encoding convertto utf-8 é]
c3a9
% encoding system iso8859-1; exec xxd << [encoding convertto utf-8 é]
00000000: c3a9 ..
This is equivalent to [encoding system] also being set to utf-8 (as to be expected in an otherwise utf-8 environment):
% encoding system
utf-8
% exec xxd << é
00000000: c3a9 ..
What you are experiencing (without any intervention) seems to be a re-coding of the Tcl internal encoding to iso8859-1 on the way out from Tcl (because of [encoding system], as Donal describes), and a follow-up (and faulty) re-coding of this iso8859-1 value into the utf-8 environment.
Watch the difference (\xe9 vs. \xc3\xa9):
% encoding system iso8859-1
% encoding system
iso8859-1
% exec xxd << é
00000000: e9
The problem it then seems is that \xe9 is to be interpreted in your otherwise utf-8 env, like:
$ locale
LANG="de_AT.UTF-8"
...
$ echo -ne '\xe9'
?
$ touch `echo -ne 'm\xe9mo.txt'`
touch: m?mo.txt: Illegal byte sequence
$ touch mémo.txt
$ ls mémo.txt
mémo.txt
$ cp `echo -ne 'm\xe9mo.txt'` b.txt
cp: m?mo.txt: No such file or directory
But:
$ cp `echo -ne 'm\xc3\xa9mo.txt'` b.txt
$ ls b.txt
b.txt
Your options:
(1) You need to find out why Tcl picks up iso8859-1, to begin with. How did you obtain your installation? Self-compiled? What are the details (version)?
(2) You may proceed as Donal suggests, or alternatively, set encoding system utf-8 explicitly.
encoding system utf-8
file copy $tmp_filename $dest_path
I use Lua within REAPER, an audio recording/mixing software with scripting capabilities.
I succeed to write file in ANSI from it from a script execution, using the standard i/o model.
How could I write/output a UTF-8 encoded file ?
I don't see anything about that in the documentation.
Thanks for your help !
If you can access to a Linux system, maybe you can try iconv command, which allows you to perform that kind of conversions.
This task could be easily done by writting:
iconv -f WINDOWS-1252 -t UTF-8 ANSI_file > UTF-8_file
I have a file file.dat which has CNBC: America¿s Gun: The Rise of the AR–15
Unfortunately i got some special characters which dint converted properly in iconv function in unix.
$ file -bi file.dat
text/plain; charset=utf-8
$ cat file.dat | cut -c14 | od -x
0000000 bfc2 000a
0000003
Can you please help me out to convert the special character?
Thanks in advance
-Praveen
Your file is basically fine, it's in proper UTF-8 and the character you are looking at is an INVERTED QUESTION MARK (U+00BF) (though you seem to be using some legacy 8-bit character set to view the file, and the output of od -x is word-oriented little-endian, so you get the hex backwards -- the sequence is 0xC2 0xBF, not the other way around).
This article explains that when Oracle tries to export to an unknown character set, it will replace characters it cannot convert with upside-down question marks. So I guess that's what happened here. The only proper fix is to go back to your Oracle database and export in a proper format where curly apostrophes are representable (which I imagine the character really should be).
If the file came from somebody else's Oracle database, ask them to do the export again, or ask them what the character should be, or ignore the problem, or guess what character to put there, and use your editor. If there are just a few problem characters, just do it manually. If there are lots, maybe you can use context-sensitive substitution rules like
it¿s => it’s
dog¿s => dog’s
¿problem¿ => ‘‘problem’’
na¿ve => naïve
¿yri¿ispy¿rykk¿ => äyriäispyörykkä (obviously!)
The use of ¿ as a placeholder for "I don't know" is problematic, but Unicode actually has a solution: the REPLACEMENT CHARACTER (U+FFFD). I guess you're not going to like this, but the only valid (context-free) replacement you can perform programmatically is s/\u{00BF}/\u{FFFD}/g (this is Perl-ish pseudocode, but use whatever you like).
I ran the following command in a software repository I have access to:
find . -not -name ".svn" -type f -exec file "{}" \;
and saw many output lines like
./File.java: ISO-8859 C++ program text
What does that mean? ISO-8859 is an encoding class, not a certain encoding. I've expected all files to be UTF-8, but most are in the presented encoding. Is ISO-8859 a proper subset of UTF-8, too?
Is it possible for me to convert all those files safely by using ISO-8859-1 as source encoding while translating it into UTF-8 with iconv for example?
I am afraid that the Unix file program is rather bad at this. It just means it is in a byte encoding. It does not mean that it is ISO-8859-1. It might even be in a non-ISO byte encdidng, although it usually figures that out.
I have a system that does much better than file, but it is trained on an English-language corpus, so might not do as well as on German.
The short answer is that the result of file is not reliable. You have to know the real encoding to up-convert it.
The charset detection used by file is rather simplistic. It recognizes UTF-8. And it distinguished between "ISO-8859" and "non-ISO extended-ASCII" by looking for bytes in the 0x80-0x9F range where the ISO 8859 encodings have "holes". But it makes no attempt to determine which ISO 8859 encoding is in use. Which is why it just says ISO-8859 instead of ISO-8859-1 or ISO-8859-15.
I am interested in writing utf-8 encoded strings to a file.
I did this with low level functions open() and write().
In the first place I set the locale to a utf-8 aware character set with
setlocale("LC_ALL", "de_DE.utf8").
But the resulting file does not contain utf-8 characters, only iso8859 encoded umlauts. What am I doing wrong?
Addendum: I don't know if my strings are really utf-8 encoded in the first place. I just keep them in the source file in this form: char *msg = "Rote Grütze";
See screenshot for content of the textfile:
alt text http://img19.imageshack.us/img19/9791/picture1jh9.png
Changing the locale won't change the actual data written to the file using write(). You have to actually produce UTF-8 characters to write them to a file. For that purpose you can use libraries as ICU.
Edit after your edit of the question: UTF-8 characters are only different from ISO-8859 in the "special" symbols (ümlauts, áccénts, etc.). So, for all the text that doesn't have any of this symbols, both are equivalent. However, if you include in your program strings with those symbols, you have to make sure your text editor treats the data as UTF-8. Sometimes you just have to tell it to.
To sum up, the text you produce will be in UTF-8 if the strings within the source code are in UTF-8.
Another edit: Just to be sure, you can convert your source code to UTF-8 using iconv:
iconv -f latin1 -t utf8 file.c
This will convert all your latin-1 strings to utf8, and when you print them they will be definitely in UTF-8. If iconv encounters a strange character, or you see the output strings with strange characters, then your strings were in UTF-8 already.
Regards,
Yes, you can do it with glibc. They call it multibyte instead of UTF-8, because it can handle more than one encoding type. Check out this part of the manual.
Look for functions that start with the prefix mb, and also function with wc prefix, for converting from multibyte to wide char. You'll have to set the locale first with setlocale() to UTF-8 so it chooses this implementation of multibyte support.
If you are coming from an Unicode file I believe the function you looking for is wcstombs().
Can you open up the file in a hex editor and verify, with a simple input example, that the written bytes are not the values of Unicode characters that you passed to write(). Sometimes, there is no way for a text editor to determine character set and your text editor may have assumed an ISO8859-1 character set.
Once you have done this, could you edit your original post to add the pertinent information?