Using sprintf with unicode characters - c

I wanted to print out depictions of playing cards using Unicode.
Code snippet:
void printCard(int card){
char strCard[10];
sprintf(strCard, "\U0001F0A%x", (card%13)+1);
printf("%s\n", cardStr);
}
Since the \U requires 8 hex characters after it I get the following from compiling:
error: incomplete universal character name \U0001F0A
I could create a bunch of if/else statements and print out the card that way but I was hoping for a way that wouldn't make me explicitly write out every card's Unicode encoding.

Universal character names (like \U0001F0A1) are resolved by the compiler. If you use one in a format string, printf will see the UTF-8 representation of the character; it has no idea how to handle backslash escapes. (The same is true of \n and \x2C; those are single characters resolved by the compiler.) So you certainly cannot compute the UCN at runtime.
The most readable solution would be to use an array of strings to hold the 13 different card symbols.
That will avoid hard-wiring knowledge about Unicode and UTF-8 encoding into the program. If you knew that the active locale was a UTF-8 locale, you could compute the codepoints as a wchar_t and the use wide-character-to-multibyte standard library functions to produce the UTF-8 version. But I'm not at all convinced that it would be worthwhile.

A quick and dirty UTF-8 solution:
void printCard(int card) {
printf("\xF0\x9F\x82%c\n", 0xA1 + card % 13);
}
The UTF-8 representation of \U0001F0A1 is F0 9F 82 A1. The above code will correctly handle all 13 cards, if your terminal supports UTF-8 and non-BMP code points, like iTerm2 on OS/X.
Alternative solutions involving wide-char conversion to multibyte character sets are complicated to use and would not work on platforms where wchar_t is limited to 16 bits.

Related

How to check if a character is an extended ascii character in C?

Every time i do something similar to the condition below i get a Multicharacter warning.
char str[] = "León";
if(str[2] == 'ó') printf(true);
How can i solve this?
Unless the encoding on your platform is such that 'ó' can fit into a char, 'ó' is a multi-character constant. It seems to be the latter on your platform, judging by the message you get. The values of multi-character constants are implementation defined. In other words, the choice of numeric value is up to the implementation, with some constraints (e.g. it must be outside the char range on your platform).
Sadly in your case when you write char str[] = "León";, the third element will be converted to a char, using a narrowing conversion, or decomposed into more than one char and concatenated to the char[] array. So attempts to compare it to 'ó' will be futile.
If you want to use the extended ASCII characters, use their octal value.
I am using the table http://www.asciitable.com/ and I guess the value you require is 162 (decimal) = 242. So use str[] = "Le\242n";
And use the same in the comparison.
You'll need to use the wchar_t type, or a unicode library. wchar_t is infamous for having many gotchas and easy bugs to hit, but it is the best primitive type available to C++ compilers.
You need to use variants of everything that support wchar_t, such as std::wcout or wprintf.
EDIT: wchar_t has been replaced by char16_t and char32_t. The Unicode Standard 4.0 suggests their use whenever code must be portable between platforms, because wchar_t varies in size depending on platform (like int does).
I recommend finding a good unicode library to handle comparison between the many characters that are made of multiple codepoints!
The other option is to stick entirely to the native char type which is generally interpreted as some locale-specific ASCII.
The ASCII is a 7-bit character coding that numbers characters 0 ... 127. An ASCII-compatible encoding preserves the meanings of these bytes. Any character encoded as c < 0 or c > 127 cannot be an ASCII character. These sometimes can be called by various confusing names such as "Extended ASCII" or alike.
In Unicode, the ASCII characters are still the characters 0 ... 127 of the Unicode codepoint range.
The problem is not as much that ó is an extended character, it is that your source file is actually in UTF-8, and therefore ó is encoded as 2 bytes. char in C stands for the thing generally called as byte elsewhere.
C also supports wide-character strings, where each character is a UTF-16, UCS-2, UTF-32, or some other code point. There your ó would (most probably) be a single wchar_t.
Unfortunately you've opening a can of worms here, because the symbol ó can also be written in Unicode in 2 separate ways: It can be written as one code point ó or the letter o followed by the combining acute accent: ́; both have the same semantic information, but they would consist of different bytes. And even if converted to wchar_t strings, these would still have different sequences. The C standard library doesn't handle the Unicode at all, except in C11, where there is some support for character literals explicitly in UTF-8. The C standard still doesn't present a portable way for converting UTF-8 encoded textual data to wchar_t; neither can it do normalizations such as ó to o ́ or vice versa.
You could do something like
if (sizeof("ó") > 2) ...
If this is just one char the length of your string is 2, one for the character and one for the terminating 0. Otherwise if it doesn't fit the compiler will allocate a longer sequence.
When you give your source file to the compiler you have to tell which character encoding you used with your source editor (source charset). My guess that it is UTF-8, which encodes ó as 0xC3 0xB3. This seems to be going right.
But 'ó' then becomes an integer with a value outside your char range (see your <limits.h>). Therefore the warning on the == between them.
BTW—there is some meaning in "Extended ASCII" but not much. An "Extended ASCII" character set must encode each of its codepoints in one byte. So, UTF-8 is not an encoding for one of the many "Extended ASCII" character sets.

using regular expression with unicode string in C

I'm currently using regular expression on unicode strings but I just need to match ASCII characters thus effectively ignore all unicode characters and until now functions in regex.h work fine (I'm on linux so the encoding is utf8). But can someone confirm if its really ok to do so? Or do I need a regex library on Unicode (like ICU?)
UTF-8 is a variable length encoding; some characters are 1 byte, some 2, others 3 or 4. You know now many bytes to read by the prefix of each character. 0 for 1 byte, 110 for 2 bytes, 1110 for 3 bytes, 11110 for 4 bytes.
If you try to read a UTF-8 string as ASCII, or any other fixed-width encoding, things will go very wrong... unless that UTF-8 string contains nothing but 1 byte characters in which case it matches ASCII.
However since no bytes in UTF-8 contain a null byte, and none of the extra bytes can be confused with ASCII, and if you really are only matching ASCII, you might be able to get away with it... but I wouldn't recommend it because there are such better regex options than POSIX, they're easy to use, and why leave a hidden encoding bomb in your code for some sucker to deal with later? (Note: that sucker may be you)
Instead, use a Unicode aware regex library like Perl Compatible Regular Expressions (PCRE). PCRE is Unicode aware by passing the PCRE2_UTF flag to pcre2_compile. PCRE regex syntax is more powerful and more widely understood than POSIX regexes, and PCRE has more features. And PCRE comes with Gnome Lib which itself provides a feast of very handy C functions.
You need to be careful about your patterns and about the text your going to match.
As an example, given the expression a.b:
"axb" matches
"aèb" does NOT match
The reason is that è is two bytes long when UTF-8 encoded but . would only match the first one.
So as long as you only match sequences of ASCII characters you're safe. If you mix ASCII and non ASCII characters, you're in trouble.
You can try to match a single UTF-8 encoded "character" with something like:
([\xC0-\xDF].|[\xE0-\xEF]..|\xF0...|.)
but this assumes that the text is encoded correctly (and, frankly, I never tried it).

Unicode Character 'SPEAKER WITH THREE SOUND WAVES' (U+1F50A) in c source code

I want to print Unicode Character 'SPEAKER WITH THREE SOUND WAVES' (U+1F50A) Encodings "\uD83D\uDD0A" in C source code but get this output:
error: \uDD0A is not a valid universal character
error: \uD83D is not a valid universal character
\u notation (with four hexadecimal digits) is referring to UCS-2 encoding, i.e. you can encode only characters from the BMP (Basic multilingual plane, basically U+00000 through U+0FFFF).
U+1F50A is beyond the BMP, and thus cannot be encoded in 16 bits. UTF-16 uses surrogate pairs for such characters beyond the BMP (values in the 0xD800 - 0xDFFF range, which are not used in UCS-2), but they are explicitly forbidden in \u notation.
You need \U notation (with eight hexadecimal digits) for that.
Also note that the conversion from either \u or \U notation to whatever actually ends up in the string is locale-dependent, so what might work on one platform might not work on another... if you want to be really portable and ensure e.g. UTF-8 or UTF-16 encoding in the string, you need to:
do the encoding manually via hexadecimal \x... or octal \...;
use third-party libraries with proper Unicode support (ICU).
While we're at it (and because many people are unaware of this), the above points straight at why Microsoft's 16bit version of wchar_t is broken when you want Unicode: It stems from a time when there was only the BMP, and 16bit UCS-2 was plenty enough. Since it is no longer sufficient to encode all defined Unicode characters, you can use it to hold UTF-16 code values, but wchar_t -- and by extension, std::wstring as well as L"" string literals -- isn't really wide as the name implies, but multibyte at best.
Good that C++ introduced explicit char16_t and char32_t, plus the locale-independent u"", U"" and u8"" string literals. Too bad MSVC doesn't yet support them AFAIK.

Get the Unicode value of a TCHAR

I need to get the Unicode value of a TCHAR.
e.g. if the TCHAR = "A", I want to find the Unicode value of 0x41.
Is it safe to cast to an int, or is there an API function I should be using?
Your question is a little mal-formed. A TCHAR can be either an 8 bit or a 16 bit character. In and of itself, it's not enough to know how wide the character is. You also need to know how it is encoded. For example:
If you have an 8 bit ASCII encoded character, then its numeric value is the Unicode code point.
If you have an 8 bit Windows ANSI encoded character from a single byte character set you convert to UTF-16 with MultiByteToWideChar. The numeric value of the UTF-16 element is the Unicode code point.
If you have an 8 bit Windows ANSI encoded character element from a double byte or multi byte character set, that 8 bit char does not, in general, define a character. In general, you need multiple char elements.
Likewise for a 16 bit UTF-16 encoded character element. Again UTF-16 is a variable width encoding and a single character element does not in general define a Unicode code point.
So, in order to proceed you must become clear as to how your character is encoded.
Before even doing that you need to know how wide it is. TCHAR can be 8 bit or 16 bit, depending on how you compile. That flexibility was how we handled single source development for Win 9x and Win NT. The former had no Unicode support. Nowadays Win 9x is, thankfully long forgotten and so too should TCHAR. Sadly it lives on in countless MSDN examples but you should ignore that. On Windows the native character element is wchar_t.
Well, i just guess you want UTF-32 numbers.
Arx said it already, TCHAR can be char or wchar_t.
If you have a char-string, it probably will contain data
with the default single-byte charset of your system (UTF-8 is possible too)
As dealing with many different charsets is difficult
and Windows has builtin conversion stuff:
Use MultiByteToWideChar to get a wchar_t-array of your char-array.
If you have a wchar_t-array, it´s most likely UTF-16 (LE, without BOM...) on Windows.
I don´t know any built-in function to get UTF-32 of it,
but writing an own conversion is not that hard (else, use some lib)
http://en.wikipedia.org/wiki/UTF-16
Some bit-fiddling, but nothing more.
(What TCHAR is is a preprocessor-thing,
so you could implement different behaviour
based on the #define´s too. Or sizeof or...)

How to convert Unicode escaped characters to utf8?

I saw the other questions about the subject but all of them were missing important details:
I want to convert \u00252F\u00252F\u05de\u05e8\u05db\u05d6 to utf8. I understand that you look through the stream for \u followed by four hex which you convert to bytes. The problems are as follows:
I heard that sometimes you look for 4 bytes after and sometimes 6 bytes after, is this correct? If so, then how do you determine which it is? E.g. is \u00252F 4 or 6 bytes?
In the case of \u0025 this maps to one byte instead of two (0x25), why? Is the four hex supposed to represent utf16 which i am supposed to convert to utf8?
How do I know whether the text is supposed to be the literal characters \u0025 or the unicode sequence? Does that mean that all backslashes must be escaped in the stream?
Lastly, am I being stupid in doing this by hand when I can use iconv to do this for me?
If you have the iconv interfaces at your disposal, you can simply convert the \u0123\uABCD etc. sequences to an array of bytes 01 23 AB CD ..., replacing any unescaped ASCII characters with a 00 byte followed by the ASCII byte, then run the array through iconv with a conversion descriptor obtained by iconv_open("UTF-8", "UTF-16-BE").
Of course you can also do it much more efficiently working directly with the input yourself, but that requires reading and understanding the Unicode specification of UTF-16 and UTF-8.
In some conventions (like C++11 string literals), you parse a specific number of hex digits, like four after \u and eight after \U. That may or may not be the convention with the input you provided, but it seems a reasonable guess. Other styles, like C++'s \x you parse as many hex digits as you can find after the \x, which means that you have to jump through some hoops if you do want to put a literal hex digit immediately after one of these escaped characters.
Once you have all the values, you need to know what encoding they're in (e.g., UTF-16 or UTF-32) and what encoding you want (e.g., UTF-8). You then use a function to create a new string in the new encoding. You can write such a function (if you know enough about both encoding formats), or you can use a library. Some operating systems may provide such a function, but you might want to use a third-party library for portability.

Resources