Output UTF-8 Encoded File - file

I use Lua within REAPER, an audio recording/mixing software with scripting capabilities.
I succeed to write file in ANSI from it from a script execution, using the standard i/o model.
How could I write/output a UTF-8 encoded file ?
I don't see anything about that in the documentation.
Thanks for your help !

If you can access to a Linux system, maybe you can try iconv command, which allows you to perform that kind of conversions.
This task could be easily done by writting:
iconv -f WINDOWS-1252 -t UTF-8 ANSI_file > UTF-8_file

Related

How to copy files with special characters in their names with TCL's exec?

I'm trying to upload files containing special characters on our platform via the exec command but the characters are always interpreted and it fails.
For example if I try to upload a mémo.txt file I get the following error:
/bin/cp: cannot create regular file `/path/to/dir/m\351mo.txt': No such file or directory
The UTF8 is correctly configured on the system and if I run the command on the shell it works fine.
Here is the TCL code:
exec /bin/cp $tmp_filename $dest_path
How can I make it work?
The core of the problem is what encoding is being used to communicate with the operating system. For exec and filenames, that encoding is whatever is returned by the encoding system command (Tcl has a pretty good guess at what the correct value for that is when the Tcl library starts up, but very occasionally gets it wrong). On my computer, that command returns utf-8 which says (correctly!) that strings passed to (and received from) the OS are UTF-8.
You should be able to use the file copy command instead of doing exec /bin/cp, which will be helpful here as that's got less layers of trickiness (it avoids going through an external program which can impose its own problems). We'll assume that that's being done:
set tmp_filename "foobar.txt"; # <<< fill in the right value, of course
set dest_path "/path/to/dir/mémo.txt"
file copy $tmp_filename $dest_path
If that fails, we need to work out why. The most likely problems relate to the encoding though, and can go wrong in multiple ways that interact horribly. Alas, the details matter. In particular, the encoding for a path depends on the actual filesystem (it's formally a parameter when the filesystem is created) and can vary on Unix between parts of a path when you have a mount within another mount.
If the worst comes to the worst, you can put Tcl into ISO 8859-1 mode and then do all the encoding yourself (as ISO 8859-1 is the “just use the bytes I tell you” encoding); encoding convertto is also useful in this case. Be aware that this can generate filenames that cause trouble for other programs, but it's at least able to let you get at it.
encoding system iso98859-1
file copy $tmp_filename [encoding convertto utf-8 $dest_path]
Care might be needed to convert different parts of the path correctly in this case: you're taking full responsibility for what's going on.
If you're on Windows, please just let Tcl handle the details. Tcl uses the Wide (Unicode) Windows API directly so you can pretend that none of these problems exist. (There are other problems instead.)
On macOS, please leave encoding system alone as it is correct. Macs have a very opinionated approach to encodings.
I already tried the file copy command but it says error copying
"/tmp/file7k5kqg" to "/path/to/dir/mémo.txt": no such file or
directory
My reading of your problem is that, for some reason, your Tcl is set to iso8859-1 ([encoding system]), while the executing environment (shell) is set to utf-8. This explains why Donal's suggestion works for you:
encoding system iso8859-1
file copy $tmp_filename [encoding convertto utf-8 $dest_path]
This will safely pass utf-8 encoded bytearray down to any syscall: é or \xc3\xa9 or \u00e9. Watch:
% binary encode hex [encoding convertto utf-8 é]
c3a9
% encoding system iso8859-1; exec xxd << [encoding convertto utf-8 é]
00000000: c3a9 ..
This is equivalent to [encoding system] also being set to utf-8 (as to be expected in an otherwise utf-8 environment):
% encoding system
utf-8
% exec xxd << é
00000000: c3a9 ..
What you are experiencing (without any intervention) seems to be a re-coding of the Tcl internal encoding to iso8859-1 on the way out from Tcl (because of [encoding system], as Donal describes), and a follow-up (and faulty) re-coding of this iso8859-1 value into the utf-8 environment.
Watch the difference (\xe9 vs. \xc3\xa9):
% encoding system iso8859-1
% encoding system
iso8859-1
% exec xxd << é
00000000: e9
The problem it then seems is that \xe9 is to be interpreted in your otherwise utf-8 env, like:
$ locale
LANG="de_AT.UTF-8"
...
$ echo -ne '\xe9'
?
$ touch `echo -ne 'm\xe9mo.txt'`
touch: m?mo.txt: Illegal byte sequence
$ touch mémo.txt
$ ls mémo.txt
mémo.txt
$ cp `echo -ne 'm\xe9mo.txt'` b.txt
cp: m?mo.txt: No such file or directory
But:
$ cp `echo -ne 'm\xc3\xa9mo.txt'` b.txt
$ ls b.txt
b.txt
Your options:
(1) You need to find out why Tcl picks up iso8859-1, to begin with. How did you obtain your installation? Self-compiled? What are the details (version)?
(2) You may proceed as Donal suggests, or alternatively, set encoding system utf-8 explicitly.
encoding system utf-8
file copy $tmp_filename $dest_path

Creating and writing to an UTF-8 file with C++

I have to create a m3u8 playlist in my http streaming C++ server code. m3u8 is nothing but UTF-8 m3u file.
How do I create an UTF-8 file (i.e. how to write UTF-8 characters to a file)? Maybe with the open() function or some other function in C++ on Linux?
int fd = open("myplaylist.m3u8", O_WRONLY | O_APPEND);
The way you are planning to do the open() is fine.
What will determine if a file is in utf-8, is the characters you're going to write to it. Provided you encode the relevant characters as utf-8, everything will work as expected.
If you plan on converting a given encoding (say ISO-8859-1) to utf-8, a good way to achieve it is to use libiconv which allows to do exactly that.
Whether a file "is" utf-8 depends on the content. As long as you write() the correct byte sequences in there you will be fine.

What does ISO-8859 in `file` mean?

I ran the following command in a software repository I have access to:
find . -not -name ".svn" -type f -exec file "{}" \;
and saw many output lines like
./File.java: ISO-8859 C++ program text
What does that mean? ISO-8859 is an encoding class, not a certain encoding. I've expected all files to be UTF-8, but most are in the presented encoding. Is ISO-8859 a proper subset of UTF-8, too?
Is it possible for me to convert all those files safely by using ISO-8859-1 as source encoding while translating it into UTF-8 with iconv for example?
I am afraid that the Unix file program is rather bad at this. It just means it is in a byte encoding. It does not mean that it is ISO-8859-1. It might even be in a non-ISO byte encdidng, although it usually figures that out.
I have a system that does much better than file, but it is trained on an English-language corpus, so might not do as well as on German.
The short answer is that the result of file is not reliable. You have to know the real encoding to up-convert it.
The charset detection used by file is rather simplistic. It recognizes UTF-8. And it distinguished between "ISO-8859" and "non-ISO extended-ASCII" by looking for bytes in the 0x80-0x9F range where the ISO 8859 encodings have "holes". But it makes no attempt to determine which ISO 8859 encoding is in use. Which is why it just says ISO-8859 instead of ISO-8859-1 or ISO-8859-15.

unable to send binary data over unix tcp socket

I'm trying to implement ftp commands GET and PUT thru a UNIX socket for file transfer using usual functions like fread(), fwrite(), send() and recv().
It works fine for text files, but fails for binary files (diff says: "binary files differ")
Any suggestions regarding the following will be appreciated:
Are there any specific commands to read and write binary data?
Can diff be used to compare binary files?
Is it possible to send the binary parts in chunks of memory?
the FTP protocol has 2 modes of operation: text and binary.
try it in any FTP client -- I believe the commands for switching in between are ASCII and BIN. The text mode has only effect from what I recall on the CR/LF pairs though.
If you're reading from a file and then writing the file's data to the socket, make sure you open the file in binary mode.
Yes, diff can be used to compare binary files, typically with the -q option to suppress the actual printing of differences, which rarely makes sense for binary files. You can also use md5 or cmp if you have them.

What character set is this?

I received a bunch of CSV files from a client (that appear to be a database dump), and many of the columns have weird characters like this:
Alain Lefèvre
Angèle Dubeau & La Pietà
That's seems like an awful lot of characters to represent an é. Does anyone know what encoding would produce that many characters for é? I have no idea where they're getting these CSV files from, but assuming I can't get them in a better format, how would I convert them to something like UTF-8?
It seems like it's a double-re-misdecoded UTF-8. It may be possible to recover the data by opening it as utf-8, saving it as Latin-1 (perhaps), and opening it as UTF-8 again.
It looks like it's been through a corruption process where the data was written as utf-8 but read in as cp1252, and this happened three times. This might be recoverable (I don't know if it will work for every character, but at least for some) by putting the corrupted data through the reverse transformation - read in as utf8, write out as cp1252, repeat. There are plenty of ways of doing that kind of conversion - using a text editor as Tordek suggests, using commandline tools as below, or using the encoding features built in to your database or programming language.
unix shell prompt> echo Alain Lefèvre |
iconv -f utf-8 -t cp1252 |
iconv -f utf-8 -t cp1252 |
iconv -f utf-8 -t cp1252
Alain Lefèvre
unix shell prompt>
That's seems like an awful lot of characters to represent an é.
Remember, character ≠ byte. What you're seeing in the output is characters; you'll need to do something unusual to actually see the bytes. (I suggest ‘xxd’, a tool that is installed with the Vim application; or ‘od’, one of the core utilities of the GNU operating system.)
Does anyone know what encoding would produce that
One tool that is good at guessing the character encoding of a byte stream is ‘enca’ the Extremely Naive Charset Analyser.

Resources