How to Print UTF-16 Characters in C? - c

i have a file containing UTF-16 characters. i read in the file and can store the characters either in a uint16_t array or a char array (any better choice?)
But how do i print those characters?

I'm assuming you want to print to stdout or stderr. One method would be to use libiconv to convert from UTF-16 to UTF-32 (also known as UCS-4) into a wide-character string (wchar_t). You could then use wprintf and friends to print to the standard streams.

Related

Will gcc functions in string.h break UTF-8 string?

I don't know the following cases in GCC, who can help me?
Whether a valid UTF-8 character (except code point 0) still contains zero byte? If so, I think function such as strlen will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to '\n'? If so, I think function such as "gets" will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to ' ' or '\t'? If so, I think function such as scanf("%s%s") will break that UTF-8 character and be interpreted as two or more words.
The answer to all your questions are the same: No.
It's one of the advantages of UTF-8: all ASCII bytes do not occur when encoding non-ASCII code points into UTF-8.
For example, you can safely use strlen on a UTF-8 string, only that its result is the number of bytes instead of UTF-8 code points.

C scan unicode character from string

I've wchar_t type string which includes unicode characters like "ş, ç, ü,.."
I need to take this character one by one from string but I can't read them with sscanf. I couldn't found alternative function. So what should I do?

What's the use of universal characters on POSIX system?

In C one can pass unicode characters to printf() like this:
printf("some unicode char: %c\n", "\u00B1");
But the problem is that on POSIX compliant systems `char' is always 8 bits and most of UTF-8 character such as the above are wider and don't fit into char and as the result nothing is printed on the terminal. I can do this to achieve this effect however:
printf("some unicode char: %s\n", "\u00B1");
%s placeholder is expanded automatically and a unicode character is printed on the terminal. Also, in a standard it says:
If the hexadecimal value for a universal character name is less than
0x20 or in the range 0x7F-0x9F (inclusive), or if the universal
character name designates a character in the basic source character
set, then the program is illformed.
When I do this:
printf("letter a: %c\n", "\u0061");
gcc says:
error: \u0061 is not a valid universal character
So this technique is also unusable for printing ASCII characters. In this article on Wikipedia http://en.wikipedia.org/wiki/Character_(computing)#cite_ref-3 it says:
A char in the C programming language is a data type with the size of
exactly one byte, which in turn is defined to be large enough to
contain any member of the basic execution character set and UTF-8 code
units.
But is this doable on POSIX systems?
Use of universal characters in byte-based strings is dependent on the compile-time and run-time character encodings matching, so it's generally not a good idea except in certain situations. However they work very well in wide string and wide character literals: printf("%ls", L"\u00B1"); or printf("%lc", L'\00B1'); will print U+00B1 in the correct encoding for your locale.

Trouble comparing UTF-8 characters using wchar.h

I am in the process of making a small program that reads a file, that contains UTF-8 elements, char by char. After reading a char it compares it with a few other characters and if there is a match it replaces the character in the file with an underscore '_'.
(Well, it actually makes a duplicate of that file with specific letters replaced by underscores.)
I'm not sure where exactly I'm messing up here but it's most likely everywhere.
Here is my code:
FILE *fpi;
FILE *fpo;
char ifilename[FILENAME_MAX];
char ofilename[FILENAME_MAX];
wint_t sample;
fpi = fopen(ifilename, "rb");
fpo = fopen(ofilename, "wb");
while (!feof(fpi)) {
fread(&sample, sizeof(wchar_t*), 1, fpi);
if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0) ) {
fwrite(L"_", sizeof(wchar_t*), 1, fpo);
} else {
fwrite(&sample, sizeof(wchar_t*), 1, fpo);
}
}
I have omitted the code that has to do with the filename generation because it has nothing to offer to the case. It is just string manipulation.
If I feed this program a file containing the words γειά σου κόσμε. I would want it to return this:
γει_ σου κόσμ_.
Searching the internet didn't help much as most results were very general or talking about completely different things regarding UTF-8. It's like nobody needs to manipulate single characters for some reason.
Anything pointing me the right way is most welcome.
I am not, necessarily, looking for a straightforward fixed version of the code I submitted, I would be grateful for any insightful comments helping me understand how exactly the wchar mechanism works. The whole wbyte, wchar, L, no-L, thing is a mess to me.
Thank you in advance for your help.
C has two different kinds of characters: multibyte characters and wide characters.
Multibyte characters can take a varying number of bytes. For instance, in UTF-8 (which is a variable-length encoding of Unicode), a takes 1 byte, while α takes 2 bytes.
Wide characters always take the same number of bytes. Additionally, a wchar_t must be able to hold any single character from the execution character set. So, when using UTF-32, both a and α take 4 bytes each. Unfortunately, some platforms made wchar_t 16 bits wide: such platforms cannot correctly support characters beyond the BMP using wchar_t. If __STDC_ISO_10646__ is defined, wchar_t holds Unicode code-points, so must be (at least) 4 bytes long (technically, it must be at least 21-bits long).
So, when using UTF-8, you should use multibyte characters, which are stored in normal char variables (but beware of strlen(), which counts bytes, not multibyte characters).
Unfortunately, there is more to Unicode than this.
ά can be represented as a single Unicode codepoint, or as two separate codepoints:
U+03AC GREEK SMALL LETTER ALPHA WITH TONOS ← 1 codepoint ← 1 multibyte character ← 2 bytes (0xCE 0xAC) = 2 char's.
U+03B1 GREEK SMALL LETTER ALPHA U+0301 COMBINING ACUTE ACCENT ← 2 codepoints ← 2 multibyte characters ← 4 bytes (0xCE 0xB1 0xCC 0x81) = 4 char's.
U+1F71 GREEK SMALL LETTER ALPHA WITH OXIA ← 1 codepoint ← 1 multibyte character ← 3 bytes (0xE1 0xBD 0xB1) = 3 char's.
All of the above are canonical equivalents, which means that they should be treated as equal for all purposes. So, you should normalize your strings on input/output, using one of the Unicode normalization algorithms (there are 4: NFC, NFD, NFKC, NFKD).
First of all, please do take the time to read this great article, which explains UTF8 vs Unicode and lots of other important things about strings and encodings: http://www.joelonsoftware.com/articles/Unicode.html
What you are trying to do in your code is read in unicode character by character, and do comparisons with those. That's won't work if the input stream is UTF8, and it's not really possible to do with quite this structure.
In short: Fully unicode strings can be encoded in several ways. One of them is using a series of equally-sized "wide" chars, one for each character. That is what the wchar_t type (sometimes WCHAR) is for. Another way is UTF8, which uses a variable number of raw bytes to encode each character, depending on the value of the character.
UTF8 is just a stream of bytes, which can encode a unicode string, and is commonly used in files. It is not the same as a string of WCHARs, which are the more common in-memory representation. You can't poke through a UTF8 stream reliably, and do character replacements within it directly. You'll need to read the whole thing in and decode it, and then loop through the WCHARs that result to do your comparisons and replacement, and then map that result back to UTF8 to write to the output file.
On Win32, use MultiByteToWideChar to do the decoding, and you can use the corresponding WideCharToMultiByte to go back.
When you use a "string literal" with regular quotes, you're creating a nul-terminated ASCII string (char*), which does not support Unicode. The L"string literal" with the L prefix will create a nul-terminated string of WCHARs (wchar_t *), which you can use in string or character comparisons. The L prefix also works with single-quote character literals, like so: L'ε'
As a commenter noted, when you use fread/fwrite, you should be using sizeof(wchar_t) and not its pointer type, since the amount you are trying to read/write is an actual wchar, not the size of a pointer to one. This advice is just code feedback independent of the above-- you don't want to be reading the input character by character anyways.
Note too that when you do string comparisons (wcscmp), you should use actual wide strings (which are terminated with a nul wide char)-- not use single characters in memory as input. If (when) you want to do character-to-character comparisons, you don't even need to use the string functions. Since a WCHAR is just a value, you can compare directly: if (sample == L'ά') {}.

printf UTF8 characters with printf from Hexadecimal ints

Kind of trivial thing but ...
I want to print japanese characters using plain C from Hexadecimals
From this table, I know that, the first char in the table, あ's Entity is &# 12353 and its Hex Entity is x3041, etc.
But how do I use this two numbers in order to get printed all characters in the command line?
If your terminal is set to UTF-8 and locale is set correctly, you may write:
char s[]="あ";
you can also try
char s[]={0xe3,0x81,0x82,0x0}
(the last is the Unicode UTF-8 encoding for "あ"), and then just printf("%s",s);
If __STDC_ISO_10646__ is defined, wchar_t is in Unicode, and you can do something like:
printf("%lc", (wchar_t)0x3041);

Resources