Linux & C-Programming: How can I write utf-8 encoded text to a file? - c

I am interested in writing utf-8 encoded strings to a file.
I did this with low level functions open() and write().
In the first place I set the locale to a utf-8 aware character set with
setlocale("LC_ALL", "de_DE.utf8").
But the resulting file does not contain utf-8 characters, only iso8859 encoded umlauts. What am I doing wrong?
Addendum: I don't know if my strings are really utf-8 encoded in the first place. I just keep them in the source file in this form: char *msg = "Rote Grütze";
See screenshot for content of the textfile:
alt text http://img19.imageshack.us/img19/9791/picture1jh9.png

Changing the locale won't change the actual data written to the file using write(). You have to actually produce UTF-8 characters to write them to a file. For that purpose you can use libraries as ICU.
Edit after your edit of the question: UTF-8 characters are only different from ISO-8859 in the "special" symbols (ümlauts, áccénts, etc.). So, for all the text that doesn't have any of this symbols, both are equivalent. However, if you include in your program strings with those symbols, you have to make sure your text editor treats the data as UTF-8. Sometimes you just have to tell it to.
To sum up, the text you produce will be in UTF-8 if the strings within the source code are in UTF-8.
Another edit: Just to be sure, you can convert your source code to UTF-8 using iconv:
iconv -f latin1 -t utf8 file.c
This will convert all your latin-1 strings to utf8, and when you print them they will be definitely in UTF-8. If iconv encounters a strange character, or you see the output strings with strange characters, then your strings were in UTF-8 already.
Regards,

Yes, you can do it with glibc. They call it multibyte instead of UTF-8, because it can handle more than one encoding type. Check out this part of the manual.
Look for functions that start with the prefix mb, and also function with wc prefix, for converting from multibyte to wide char. You'll have to set the locale first with setlocale() to UTF-8 so it chooses this implementation of multibyte support.
If you are coming from an Unicode file I believe the function you looking for is wcstombs().

Can you open up the file in a hex editor and verify, with a simple input example, that the written bytes are not the values of Unicode characters that you passed to write(). Sometimes, there is no way for a text editor to determine character set and your text editor may have assumed an ISO8859-1 character set.
Once you have done this, could you edit your original post to add the pertinent information?

Related

Is it possible to prevent adding BOM to output UTF-8 file? (Visual Studio 2005)

I need some help.
I'm writing a program that opens 2 source files in UTF-8 encoding without BOM. The first contains English text and some other information, including ID. The second contains only string ID and translation. The program changes every string from the first file by replacing English chars to Russian translation from the second one and writes these strings to output file. Everything seems to be ok, but there is BOM appears in destination file. And i want to create file without BOM, like source.
I open files with fopen function in text mode with ccs=UTF-8
read string with fgetws function to wchar_t buffer
and write with fputws function to output file
Don't use text mode, don't use the MS ccs= extension to fopen, and don't use fputws. Instead use fopen in binary mode and write the correct UTF-8 yourself.

How to output foreign characters in console?

How can I print foreign characters on the screen using C?
Here's my code, which doesn't work:
#include <stdio.h>
#include <locale.h>
int main(){
setlocale(LC_ALL,"Turkish");
printf("İ ş ğ ü ö ı");
system("pause");
return 0;
}
On my Windows there is no such characters in the 'Terminal' font. I think you can't print them.
But I suggest you to check this font yourself. Maybe you have a different version of it.
If you're using a narrow charset then you need to make sure that the terminal/console is using the same charset and the source code file is encoded in the correct encoding, otherwise of course the system will misinterpret the character codes
To set the charset in the console run chcp. For example to use code page Windows-1254 run chcp 1254. You can use SetConsoleOutputCP to set the code page programmatically, like SetConsoleOutputCP(1254)
However you should avoid the legacy ANSI code pages and use Unicode instead. The current preferred way on Windows is to output Unicode characters as wide char with wprintf. You may need to set the mode to wide first with
int result = _setmode(_fileno(stdout), _O_U16TEXT);
then
wprintf(L"İ ş ğ ü ö ı");
See also wprintf manual in Windows, Linux or Mac. However on POSIX systems UTF-8 is preferred
On older Windows UTF-8 support on console is not very good, but it's increasingly getting better, and Windows 10 even supports UTF-8 as a locale so you can just call SetConsoleOutputCP(CP_UTF8); or SetConsoleOutputCP(65001); (or run chcp 65001 in the console) and it'll work immediately, provided that you saved the source code as UTF-8. Remember to also set the font to the one that supports those characters like Lucida Console or Consolas. The default raster font contains very a limited number of characters and appears with a lot of aliasing. It also doesn't work well on modern hidpi displays
There are already lots of questions about outputting Unicode on this site like Output unicode strings in Windows console app or UTF-8 character in .NET Console Application. Please have a look and try to see which one fits you.
Edit
When you use
wchar_t c=L'ğ';
fputwc(c,ptr);
you're printing to a file and not the console. In that case just the stream of bytes is saved into the file. When you open the file again, it's the job of the editor to treat the bytes in the correct charset and print it correctly. For example the character "ğ" is stored as c4 9f in UTF-8 and when open the file as UTF-8, the editor knows that it represents the char "ğ" to display
Unfortunately there's no character encoding information embedded in a text file so the editor must choose one. Remember There Ain't No Such Thing As Plain Text (must read). A simple editor may just choose to open the file as ANSI in the current Windows codepage and the characters won't be displayed correctly if the original encoding is not that one and you'll just see garbage
Some more advanced editors like Notepad++ or MS Word will try to guess the encoding of the file. But as with any guessing, it can be wrong and the result is again a file with garbage
The simplest solution is to add a BOM to the beginning of the file so the editor can recognize the encoding easily. If your files doesn't contain a BOM you need to tell the editor to read the file in the correct encoding if the encoding is wrong (for wchar_t on Windows like that it's UTF-16LE). For example in Notepad++ it's this menu
Unfortunately the OP didn't edit the question to show what was tried, there's nothing more I can explain
Your code works: http://ideone.com/K9hrv5
setlocale(LC_ALL,"Turkish");
printf("İ ş ğ ü ö ı");
The only issue is that you have to set your terminal locale as well before executing your c program's output binary.
Setting your terminal locale works:
setlocale only affects the runtime locale. It doesn't make your compiler support extra source file characters.
You may need to specify the non-ASCII characters in your source file by using character constants (e.g. \xF1 for the character with code 241).

Parsing rtf file without encoded special character

I have an rtf file which contains special character as it is without decode to hexadecimal or unicode. I am using lib rtf in c language.
Below is the content from the rtf file. It has not mentioned encoded mode.
\b0\fi-380\li1800 · Going to home
While parsing i will get some junk value for the character. How to get proper character from the rtf file if it contains any character in specula mode only.
Regards,
Try this:
\b0\fi-380\li1800 \u183? Going to home
How to do it:
https://stackoverflow.com/a/20117501/1543816

Do binary files have encoding? Confused

Suppose I write the following C program and save it in a text file called Hello.c
#include<stdio.h>
int main()
{
printf("Hello there");
return 0;
}
The Hello.c file will probably get saved in a UTF8 encoded format.
Now, I compile this file to create a binary file called Hello
Now, this binary file should in some way store the text "Hello there". The question is what encoding is used to store this text?
As far as I'm aware, vanilla C doesn't have any concept of encoding, although if you correctly keep track of multi-byte characters, you can probably use an encoding. By default, ASCII is used to map characters to single-byte characters.
You are correct about the string "Hello there" being stored in the executable itself. The string literal is put into global memory and replaced with a pointer in the call to printf, so you can see the string literal in the data segment of the binary.
If you have access to a hex editor, try compiling your program and opening the binary in the editor. Here is a screenshot from when I did this. You can see that each character of the string literal is represented by a single byte, followed by a 0 (NULL). This is ASCII.

Creating and writing to an UTF-8 file with C++

I have to create a m3u8 playlist in my http streaming C++ server code. m3u8 is nothing but UTF-8 m3u file.
How do I create an UTF-8 file (i.e. how to write UTF-8 characters to a file)? Maybe with the open() function or some other function in C++ on Linux?
int fd = open("myplaylist.m3u8", O_WRONLY | O_APPEND);
The way you are planning to do the open() is fine.
What will determine if a file is in utf-8, is the characters you're going to write to it. Provided you encode the relevant characters as utf-8, everything will work as expected.
If you plan on converting a given encoding (say ISO-8859-1) to utf-8, a good way to achieve it is to use libiconv which allows to do exactly that.
Whether a file "is" utf-8 depends on the content. As long as you write() the correct byte sequences in there you will be fine.

Resources