wcstombs documentation says, it "converts the sequence of wide-character codes to multibyte string". But it never says what is a "wide-character".
Is it implicit, like say it converts utf-16 to utf-8 or the conversion is defined by some environment variable?
Also what is the typical use case of wcstombs?
You use the setlocale() standard function with the LC_CTYPE (or LC_ALL) category to set the mapping the library uses between wchar_t characters and multibyte characters. The actual locale name passed to setlocale() is implementation defined, so you'll need to look it up in your compiler's docs.
For example, with MSVC you might use
setlocale( LC_ALL, ".1252" );
to set the C runtime to use codepage 1252 as the multibyte character set. Note that MSVC docs explicitly indicates that the locale cannot be set to UTF-7 or UTF8 for the multibyte character sets:
The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL.
The "wide-character" wchar_t type is intended to be able to support any character set the system supports - the standard doesn't define the size of a wchar_t type (it could be as small as a char or any of the larger integer types). On Windows it's the system's 'internal' Unicode encoding, which is UTF-16 (UCS-2 before WinXP). Honestly, I can't find a direct quote on that in the MSVC docs, though. Strictly speaking, the implementation should call this out, but I can't find it.
It converts whatever your platform uses for a "wide char" (which I'm lead to believe is indeed UCS2 on Windows, but is usually UCS4 on UNIX) into your current locale's default multibyte character encoding. If your locale is a UTF-8 one, then that is the multibyte encoding that will be used - but note that there are other possibilities, like JIS.
According to the C standard, wchar_t type is "capable of representing any character in the current locale". The standard doesn't say what the encoding for wchar_t is. In fact, the limits on WCHAR_MIN and WCHAR_MAX are [0, 255] or [-127, 127], depending upon whether wchar_t is unsigned or signed.
A multibyte character can use more than one byte. A multibyte string is made of one or more multibyte characters. In a multibyte string, each character need not be of equal number of bytes (UTF-8 is an example). Whereas, an object of type wchar_t has a fixed size (in a given implementation, of course).
As an aside, I can also find the following in my copy of the C99 draft:
__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month.
So, if I understood correctly, if __STDC_ISO_10646__ is defined, then wchar_t can store Unicode characters.
Wide character strings are composed of multi-byte characters, whereas the normal C string is a char* - a sequence of byte-wide characters. Wchars are not the same thing as unicode on all platforms, though unicode representations are typically based on wchar_t
I've seen wchars used in embedded systems like phones, where you want filenames with special characters but don't necessarily want to support all the glory and complexity of unicode.
Typical usage would be converting a 2-byte based string to a regular C string, and vica versa
Related
What prerequisites are needed to do strict Unicode programming?
Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t?
And what is the role played by multibyte character sequences in this scenario?
C99 or earlier
The C standard (C99) provides for wide characters and multi-byte characters, but since there is no guarantee about what those wide characters can hold, their value is somewhat limited. For a given implementation, they provide useful support, but if your code must be able to move between implementations, there is insufficient guarantee that they will be useful.
Consequently, the approach suggested by Hans van Eck (which is to write a wrapper around the ICU - International Components for Unicode - library) is sound, IMO.
The UTF-8 encoding has many merits, one of which is that if you do not mess with the data (by truncating it, for example), then it can be copied by functions that are not fully aware of the intricacies of UTF-8 encoding. This is categorically not the case with wchar_t.
Unicode in full is a 21-bit format. That is, Unicode reserves code points from U+0000 to U+10FFFF.
One of the useful things about the UTF-8, UTF-16 and UTF-32 formats (where UTF stands for Unicode Transformation Format - see Unicode) is that you can convert between the three representations without loss of information. Each can represent anything the others can represent. Both UTF-8 and UTF-16 are multi-byte formats.
UTF-8 is well known to be a multi-byte format, with a careful structure that makes it possible to find the start of characters in a string reliably, starting at any point in the string. Single-byte characters have the high-bit set to zero. Multi-byte characters have the first character starting with one of the bit patterns 110, 1110 or 11110 (for 2-byte, 3-byte or 4-byte characters), with subsequent bytes always starting 10. The continuation characters are always in the range 0x80 .. 0xBF. There are rules that UTF-8 characters must be represented in the minimum possible format. One consequence of these rules is that the bytes 0xC0 and 0xC1 (also 0xF5..0xFF) cannot appear in valid UTF-8 data.
U+0000 .. U+007F 1 byte 0xxx xxxx
U+0080 .. U+07FF 2 bytes 110x xxxx 10xx xxxx
U+0800 .. U+FFFF 3 bytes 1110 xxxx 10xx xxxx 10xx xxxx
U+10000 .. U+10FFFF 4 bytes 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Originally, it was hoped that Unicode would be a 16-bit code set and everything would fit into a 16-bit code space. Unfortunately, the real world is more complex, and it had to be expanded to the current 21-bit encoding.
UTF-16 thus is a single unit (16-bit word) code set for the 'Basic Multilingual Plane', meaning the characters with Unicode code points U+0000 .. U+FFFF, but uses two units (32-bits) for characters outside this range. Thus, code that works with the UTF-16 encoding must be able to handle variable width encodings, just like UTF-8 must. The codes for the double-unit characters are called surrogates.
Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from U+D800 to U+DBFF, and trailing, or low, surrogates are from U+DC00 to U+DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.
UTF-32, of course, can encode any Unicode code point in a single unit of storage. It is efficient for computation but not for storage.
You can find a lot more information at the ICU and Unicode web sites.
C11 and <uchar.h>
The C11 standard changed the rules, but not all implementations have caught up with the changes even now (mid-2017). The C11 standard summarizes the changes for Unicode support as:
Unicode characters and strings (<uchar.h>) (originally specified in
ISO/IEC TR 19769:2004)
What follows is a bare minimal outline of the functionality. The specification includes:
6.4.3 Universal character names
Syntax
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
hex-quad:
hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit
7.28 Unicode utilities <uchar.h>
The header <uchar.h> declares types and functions for manipulating Unicode characters.
The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2).
(Translating the cross-references: <stddef.h> defines size_t,
<wchar.h> defines mbstate_t,
and <stdint.h> defines uint_least16_t and uint_least32_t.)
The <uchar.h> header also defines a minimal set of (restartable) conversion functions:
mbrtoc16()
c16rtomb()
mbrtoc32()
c32rtomb()
There are rules about which Unicode characters can be used in identifiers using the \unnnn or \U00nnnnnn notations. You may have to actively activate the support for such characters in identifiers. For example, GCC requires -fextended-identifiers to allow these in identifiers.
Note that macOS Sierra (10.12.5), to name but one platform, does not support <uchar.h>.
Note that this is not about "strict unicode programming" per se, but some practical experience.
What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).
Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.
When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).
We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).
This FAQ is a wealth of info. Between that page and this article by Joel Spolsky, you'll have a good start.
One conclusion I came to along the way:
wchar_t is 16 bits on Windows, but not necessarily 16 bits on other platforms. I think it's a necessary evil on Windows, but probably can be avoided elsewhere. The reason it's important on Windows is that you need it to use files that have non-ASCII characters in the name (along with the W version of functions).
Note that Windows APIs that take wchar_t strings expect UTF-16 encoding. Note also that this is different than UCS-2. Take note of surrogate pairs. This test page has enlightening tests.
If you're programming on Windows, you can't use fopen(), fread(), fwrite(), etc. since they only take char * and don't understand UTF-8 encoding. Makes portability painful.
To do strict Unicode programming:
Only use string APIs that are Unicode aware (NOT strlen, strcpy, ... but their widestring counterparts wstrlen, wsstrcpy, ...)
When dealing with a block of text, use an encoding that allows storing Unicode chars (utf-7, utf-8, utf-16, ucs-2, ...) without loss.
Check that your OS default character set is Unicode compatible (ex: utf-8)
Use fonts that are Unicode compatible (e.g. arial_unicode)
Multi-byte character sequences is an encoding that pre-dates the UTF-16 encoding (the one used normally with wchar_t) and it seems to me it is rather Windows-only.
I've never heard of wint_t.
The most important thing is to always make a clear distinction between text and binary data. Try to follow the model of Python 3.x str vs. bytes or SQL TEXT vs. BLOB.
Unfortunately, C confuses the issue by using char for both "ASCII character" and int_least8_t. You'll want to do something like:
typedef char UTF8; // for code units of UTF-8 strings
typedef unsigned char BYTE; // for binary data
You might want typedefs for UTF-16 and UTF-32 code units too, but this is more complicated because the encoding of wchar_t is not defined. You'll need to just a preprocessor #ifs. Some useful macros in C and C++0x are:
__STDC_UTF_16__ — If defined, the type _Char16_t exists and is UTF-16.
__STDC_UTF_32__ — If defined, the type _Char32_t exists and is UTF-32.
__STDC_ISO_10646__ — If defined, then wchar_t is UTF-32.
_WIN32 — On Windows, wchar_t is UTF-16, even though this breaks the standard.
WCHAR_MAX — Can be used to determine the size of wchar_t, but not whether the OS uses it to represent Unicode.
Does this imply that my code should
not use char types anywhere and that
functions need to be used that can
deal with wint_t and wchar_t?
See also:
UTF-8 or UTF-16 or UTF-32 or UCS-2
Is wchar_t needed for Unicode support?
No. UTF-8 is a perfectly valid Unicode encoding that uses char* strings. It has the advantage that if your program is transparent to non-ASCII bytes (e.g., a line ending converter which acts on \r and \n but passes through other characters unchanged), you'll need to make no changes at all!
If you go with UTF-8, you'll need to change all the assumptions that char = character (e.g., don't call toupper in a loop) or char = screen column (e.g., for text wrapping).
If you go with UTF-32, you'll have the simplicity of fixed-width characters (but not fixed-width graphemes, but will need to change the type of all of your strings).
If you go with UTF-16, you'll have to discard both the assumption of fixed-width characters and the assumption of 8-bit code units, which makes this the most difficult upgrade path from single-byte encodings.
I would recommend actively avoiding wchar_t because it's not cross-platform: Sometimes it's UTF-32, sometimes it's UTF-16, and sometimes its a pre-Unicode East Asian encoding. I'd recommend using typedefs
Even more importantly, avoid TCHAR.
I wouldn't trust any standard library implementation. Just roll your own unicode types.
#include <windows.h>
typedef unsigned char utf8_t;
typedef unsigned short utf16_t;
typedef unsigned long utf32_t;
int main ( int argc, char *argv[] )
{
int msgBoxId;
utf16_t lpText[] = { 0x03B1, 0x0009, 0x03B2, 0x0009, 0x03B3, 0x0009, 0x03B4, 0x0000 };
utf16_t lpCaption[] = L"Greek Characters";
unsigned int uType = MB_OK;
msgBoxId = MessageBoxW( NULL, lpText, lpCaption, uType );
return 0;
}
From what I know, wchar_t is implementation dependent (as can be seen from this wiki article). And it's not unicode.
You basically want to deal with strings in memory as wchar_t arrays instead of char. When you do any kind of I/O (like reading/writing files) you can encode/decode using UTF-8 (this is probably the most common encoding) which is simple enough to implement. Just google the RFCs. So in-memory nothing should be multi-byte. One wchar_t represents one character. When you come to serializing however, that's when you need to encode to something like UTF-8 where some characters are represented by multiple bytes.
You'll also have to write new versions of strcmp etc. for the wide character strings, but this isn't a big issue. The biggest problem will be interop with libraries/existing code that only accept char arrays.
And when it comes to sizeof(wchar_t) (you will need 4 bytes if you want to do it right) you can always redefine it to a larger size with typedef/macro hacks if you need to.
Every time i do something similar to the condition below i get a Multicharacter warning.
char str[] = "León";
if(str[2] == 'ó') printf(true);
How can i solve this?
Unless the encoding on your platform is such that 'ó' can fit into a char, 'ó' is a multi-character constant. It seems to be the latter on your platform, judging by the message you get. The values of multi-character constants are implementation defined. In other words, the choice of numeric value is up to the implementation, with some constraints (e.g. it must be outside the char range on your platform).
Sadly in your case when you write char str[] = "León";, the third element will be converted to a char, using a narrowing conversion, or decomposed into more than one char and concatenated to the char[] array. So attempts to compare it to 'ó' will be futile.
If you want to use the extended ASCII characters, use their octal value.
I am using the table http://www.asciitable.com/ and I guess the value you require is 162 (decimal) = 242. So use str[] = "Le\242n";
And use the same in the comparison.
You'll need to use the wchar_t type, or a unicode library. wchar_t is infamous for having many gotchas and easy bugs to hit, but it is the best primitive type available to C++ compilers.
You need to use variants of everything that support wchar_t, such as std::wcout or wprintf.
EDIT: wchar_t has been replaced by char16_t and char32_t. The Unicode Standard 4.0 suggests their use whenever code must be portable between platforms, because wchar_t varies in size depending on platform (like int does).
I recommend finding a good unicode library to handle comparison between the many characters that are made of multiple codepoints!
The other option is to stick entirely to the native char type which is generally interpreted as some locale-specific ASCII.
The ASCII is a 7-bit character coding that numbers characters 0 ... 127. An ASCII-compatible encoding preserves the meanings of these bytes. Any character encoded as c < 0 or c > 127 cannot be an ASCII character. These sometimes can be called by various confusing names such as "Extended ASCII" or alike.
In Unicode, the ASCII characters are still the characters 0 ... 127 of the Unicode codepoint range.
The problem is not as much that ó is an extended character, it is that your source file is actually in UTF-8, and therefore ó is encoded as 2 bytes. char in C stands for the thing generally called as byte elsewhere.
C also supports wide-character strings, where each character is a UTF-16, UCS-2, UTF-32, or some other code point. There your ó would (most probably) be a single wchar_t.
Unfortunately you've opening a can of worms here, because the symbol ó can also be written in Unicode in 2 separate ways: It can be written as one code point ó or the letter o followed by the combining acute accent: ́; both have the same semantic information, but they would consist of different bytes. And even if converted to wchar_t strings, these would still have different sequences. The C standard library doesn't handle the Unicode at all, except in C11, where there is some support for character literals explicitly in UTF-8. The C standard still doesn't present a portable way for converting UTF-8 encoded textual data to wchar_t; neither can it do normalizations such as ó to o ́ or vice versa.
You could do something like
if (sizeof("ó") > 2) ...
If this is just one char the length of your string is 2, one for the character and one for the terminating 0. Otherwise if it doesn't fit the compiler will allocate a longer sequence.
When you give your source file to the compiler you have to tell which character encoding you used with your source editor (source charset). My guess that it is UTF-8, which encodes ó as 0xC3 0xB3. This seems to be going right.
But 'ó' then becomes an integer with a value outside your char range (see your <limits.h>). Therefore the warning on the == between them.
BTW—there is some meaning in "Extended ASCII" but not much. An "Extended ASCII" character set must encode each of its codepoints in one byte. So, UTF-8 is not an encoding for one of the many "Extended ASCII" character sets.
What prerequisites are needed to do strict Unicode programming?
Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t?
And what is the role played by multibyte character sequences in this scenario?
C99 or earlier
The C standard (C99) provides for wide characters and multi-byte characters, but since there is no guarantee about what those wide characters can hold, their value is somewhat limited. For a given implementation, they provide useful support, but if your code must be able to move between implementations, there is insufficient guarantee that they will be useful.
Consequently, the approach suggested by Hans van Eck (which is to write a wrapper around the ICU - International Components for Unicode - library) is sound, IMO.
The UTF-8 encoding has many merits, one of which is that if you do not mess with the data (by truncating it, for example), then it can be copied by functions that are not fully aware of the intricacies of UTF-8 encoding. This is categorically not the case with wchar_t.
Unicode in full is a 21-bit format. That is, Unicode reserves code points from U+0000 to U+10FFFF.
One of the useful things about the UTF-8, UTF-16 and UTF-32 formats (where UTF stands for Unicode Transformation Format - see Unicode) is that you can convert between the three representations without loss of information. Each can represent anything the others can represent. Both UTF-8 and UTF-16 are multi-byte formats.
UTF-8 is well known to be a multi-byte format, with a careful structure that makes it possible to find the start of characters in a string reliably, starting at any point in the string. Single-byte characters have the high-bit set to zero. Multi-byte characters have the first character starting with one of the bit patterns 110, 1110 or 11110 (for 2-byte, 3-byte or 4-byte characters), with subsequent bytes always starting 10. The continuation characters are always in the range 0x80 .. 0xBF. There are rules that UTF-8 characters must be represented in the minimum possible format. One consequence of these rules is that the bytes 0xC0 and 0xC1 (also 0xF5..0xFF) cannot appear in valid UTF-8 data.
U+0000 .. U+007F 1 byte 0xxx xxxx
U+0080 .. U+07FF 2 bytes 110x xxxx 10xx xxxx
U+0800 .. U+FFFF 3 bytes 1110 xxxx 10xx xxxx 10xx xxxx
U+10000 .. U+10FFFF 4 bytes 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Originally, it was hoped that Unicode would be a 16-bit code set and everything would fit into a 16-bit code space. Unfortunately, the real world is more complex, and it had to be expanded to the current 21-bit encoding.
UTF-16 thus is a single unit (16-bit word) code set for the 'Basic Multilingual Plane', meaning the characters with Unicode code points U+0000 .. U+FFFF, but uses two units (32-bits) for characters outside this range. Thus, code that works with the UTF-16 encoding must be able to handle variable width encodings, just like UTF-8 must. The codes for the double-unit characters are called surrogates.
Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from U+D800 to U+DBFF, and trailing, or low, surrogates are from U+DC00 to U+DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.
UTF-32, of course, can encode any Unicode code point in a single unit of storage. It is efficient for computation but not for storage.
You can find a lot more information at the ICU and Unicode web sites.
C11 and <uchar.h>
The C11 standard changed the rules, but not all implementations have caught up with the changes even now (mid-2017). The C11 standard summarizes the changes for Unicode support as:
Unicode characters and strings (<uchar.h>) (originally specified in
ISO/IEC TR 19769:2004)
What follows is a bare minimal outline of the functionality. The specification includes:
6.4.3 Universal character names
Syntax
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
hex-quad:
hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit
7.28 Unicode utilities <uchar.h>
The header <uchar.h> declares types and functions for manipulating Unicode characters.
The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2).
(Translating the cross-references: <stddef.h> defines size_t,
<wchar.h> defines mbstate_t,
and <stdint.h> defines uint_least16_t and uint_least32_t.)
The <uchar.h> header also defines a minimal set of (restartable) conversion functions:
mbrtoc16()
c16rtomb()
mbrtoc32()
c32rtomb()
There are rules about which Unicode characters can be used in identifiers using the \unnnn or \U00nnnnnn notations. You may have to actively activate the support for such characters in identifiers. For example, GCC requires -fextended-identifiers to allow these in identifiers.
Note that macOS Sierra (10.12.5), to name but one platform, does not support <uchar.h>.
Note that this is not about "strict unicode programming" per se, but some practical experience.
What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).
Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.
When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).
We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).
This FAQ is a wealth of info. Between that page and this article by Joel Spolsky, you'll have a good start.
One conclusion I came to along the way:
wchar_t is 16 bits on Windows, but not necessarily 16 bits on other platforms. I think it's a necessary evil on Windows, but probably can be avoided elsewhere. The reason it's important on Windows is that you need it to use files that have non-ASCII characters in the name (along with the W version of functions).
Note that Windows APIs that take wchar_t strings expect UTF-16 encoding. Note also that this is different than UCS-2. Take note of surrogate pairs. This test page has enlightening tests.
If you're programming on Windows, you can't use fopen(), fread(), fwrite(), etc. since they only take char * and don't understand UTF-8 encoding. Makes portability painful.
To do strict Unicode programming:
Only use string APIs that are Unicode aware (NOT strlen, strcpy, ... but their widestring counterparts wstrlen, wsstrcpy, ...)
When dealing with a block of text, use an encoding that allows storing Unicode chars (utf-7, utf-8, utf-16, ucs-2, ...) without loss.
Check that your OS default character set is Unicode compatible (ex: utf-8)
Use fonts that are Unicode compatible (e.g. arial_unicode)
Multi-byte character sequences is an encoding that pre-dates the UTF-16 encoding (the one used normally with wchar_t) and it seems to me it is rather Windows-only.
I've never heard of wint_t.
The most important thing is to always make a clear distinction between text and binary data. Try to follow the model of Python 3.x str vs. bytes or SQL TEXT vs. BLOB.
Unfortunately, C confuses the issue by using char for both "ASCII character" and int_least8_t. You'll want to do something like:
typedef char UTF8; // for code units of UTF-8 strings
typedef unsigned char BYTE; // for binary data
You might want typedefs for UTF-16 and UTF-32 code units too, but this is more complicated because the encoding of wchar_t is not defined. You'll need to just a preprocessor #ifs. Some useful macros in C and C++0x are:
__STDC_UTF_16__ — If defined, the type _Char16_t exists and is UTF-16.
__STDC_UTF_32__ — If defined, the type _Char32_t exists and is UTF-32.
__STDC_ISO_10646__ — If defined, then wchar_t is UTF-32.
_WIN32 — On Windows, wchar_t is UTF-16, even though this breaks the standard.
WCHAR_MAX — Can be used to determine the size of wchar_t, but not whether the OS uses it to represent Unicode.
Does this imply that my code should
not use char types anywhere and that
functions need to be used that can
deal with wint_t and wchar_t?
See also:
UTF-8 or UTF-16 or UTF-32 or UCS-2
Is wchar_t needed for Unicode support?
No. UTF-8 is a perfectly valid Unicode encoding that uses char* strings. It has the advantage that if your program is transparent to non-ASCII bytes (e.g., a line ending converter which acts on \r and \n but passes through other characters unchanged), you'll need to make no changes at all!
If you go with UTF-8, you'll need to change all the assumptions that char = character (e.g., don't call toupper in a loop) or char = screen column (e.g., for text wrapping).
If you go with UTF-32, you'll have the simplicity of fixed-width characters (but not fixed-width graphemes, but will need to change the type of all of your strings).
If you go with UTF-16, you'll have to discard both the assumption of fixed-width characters and the assumption of 8-bit code units, which makes this the most difficult upgrade path from single-byte encodings.
I would recommend actively avoiding wchar_t because it's not cross-platform: Sometimes it's UTF-32, sometimes it's UTF-16, and sometimes its a pre-Unicode East Asian encoding. I'd recommend using typedefs
Even more importantly, avoid TCHAR.
I wouldn't trust any standard library implementation. Just roll your own unicode types.
#include <windows.h>
typedef unsigned char utf8_t;
typedef unsigned short utf16_t;
typedef unsigned long utf32_t;
int main ( int argc, char *argv[] )
{
int msgBoxId;
utf16_t lpText[] = { 0x03B1, 0x0009, 0x03B2, 0x0009, 0x03B3, 0x0009, 0x03B4, 0x0000 };
utf16_t lpCaption[] = L"Greek Characters";
unsigned int uType = MB_OK;
msgBoxId = MessageBoxW( NULL, lpText, lpCaption, uType );
return 0;
}
From what I know, wchar_t is implementation dependent (as can be seen from this wiki article). And it's not unicode.
You basically want to deal with strings in memory as wchar_t arrays instead of char. When you do any kind of I/O (like reading/writing files) you can encode/decode using UTF-8 (this is probably the most common encoding) which is simple enough to implement. Just google the RFCs. So in-memory nothing should be multi-byte. One wchar_t represents one character. When you come to serializing however, that's when you need to encode to something like UTF-8 where some characters are represented by multiple bytes.
You'll also have to write new versions of strcmp etc. for the wide character strings, but this isn't a big issue. The biggest problem will be interop with libraries/existing code that only accept char arrays.
And when it comes to sizeof(wchar_t) (you will need 4 bytes if you want to do it right) you can always redefine it to a larger size with typedef/macro hacks if you need to.
C and POSIX both require only a very limited set of characters be present in the C/POSIX locale, but allow additional characters to exist. This leaves a great deal of freedom to the implementation; for instance, supporting all of Unicode (as UTF-8) in the C locale is conforming behavior. However, most historical implementations treat the C locale as having an "8-bit-clean" single-byte character encoding, either ISO-8859-1 (Latin-1) or a sort of "abstract 8-bit character set" where the non-ASCII bytes are abstract characters with no particular identity. (However, in the latter case, if the compiler defines __STDC_ISO_10646__, they normatively correspond to Unicode characters, usually the Latin-1 range.)
Another conforming option that seems much less popular is to treat all non-ASCII bytes as non-characters, i.e. respond to them with an EILSEQ error.
What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale. Are there implementations where attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?
From your comment to the previous answer:
The ways in which the assumption could be wrong are basically that bytes outside the portable character set could be illegal non-character bytes (EILSEQ) or make up some multibyte encoding (UTF-8 or a stateless legacy CJK encoding)
Here you can find one example.
Plan 9 only supports the "C" locale. As you can see in utf.c and rune.c, when it find a rune outside the portable characters, it simply handles it as a character from a different encoding.
Another candidates could be Minix and the *BSD family (as far as they use citrus). In the Minix source code I've also found the file command looking for new encoding when the character size is not 8bit.
Amusingly, I just found that the most widely-used implementation, glibc, is an example of what I'm looking for. Consider this simple program:
#include <stdlib.h>
#include <stdio.h>
int main()
{
wchar_t wc = 0;
int n = mbtowc(&wc, "\x80", 1);
printf("%d %.4x\n", n, (int)wc);
}
On glibc, it prints -1 0000. If the byte 0x80 were an extended character in the implementation's C/POSIX locale, it would print 1 followed by some nonzero character number.
Thus, the "common knowledge" that the C/POSIX locale is "8-bit-clean" on glibc is simply false. What's going on is that there's a gross inconsistency; despite the fact that all the standard utilities, regular expression matching, etc. are specified to operate on (multibyte) characters as if read by mbrtowc, the implementations of these utilities/functions are taking a shortcut when they see MB_CUR_MAX==1 or LC_CTYPE containing "C" (or similar) and reading char values directly instead of processing input with mbrtowc or similar. This is leading to an inconsistency between the specified behavior (which, as their implementation of the C/POSIX locale is defined, would have to treat high bytes as illegal sequences) and the implementation behavior (which is bypassing the locale system entirely).
With all that said, I am still looking for other implementations with the properties requested in the question.
"What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale."
This question is very difficult to answer because it mixes the "C Locale", which I'm assuming refers to the C Standard limited character set mentioned above, with "other unusual options", which I'm assuming refers to how the specific implementation handles characters outside the (limited) C locale. Every C Implementation must implement the C Locale; I don't think there's any unusual options surrounding that.
Let's assume for argument that the question is: "...unusual options in implementing additional/extended characters beyond the C locale." Now this becomes an implementation-dependent question, and as you have already mentioned, it "leaves a great deal of freedom to the implementation." So without knowing the target compiler/hardware, it would still be difficult to answer definitively.
Now the last part:
"...attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?"
Instead of converting high bytes while in the C Locale, you might be able to set the Locale in your program as in this SO Question: Does the underlying character set depend only on the C implementation?
This way you can ensure that your characters will be treated in the Locale that you expect.
It is my understanding that the C Locale only concerns itself with the first 7-bits (of an 8-bit char type), based on the sources below:
http://www.cprogramming.com/tutorial/unicode.html
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
http://www.in-ulm.de/~mascheck/locale/
The terms "high bytes" and "Unicode" and "UTF-8" are in the class of multi-byte or wide-character encodings, and are very locale specific (and beyond the range of the minimal C Locale). I'm not clear on how it would be possible to "convert high bytes" in the (pure) C Locale. It's quite possible that implementations would pick a default (extended) locale if none was explicitly set (or pull it from the OS environment settings as stated in one of the links above).
The POSIX standard is quite clear in this regard.
The introduction to character sets in POSIX.1-2017 says:
6.2 Character Encoding
The POSIX locale shall contain 256 single-byte characters including the characters in Portable Character Set and Non-Portable Control Characters, which have the properties listed in LC_CTYPE. It is unspecified whether characters not listed in those two tables are classified as punct or cntrl, or neither. Other locales shall contain the characters in Portable Character Set and may contain any or all of the control characters identified in Non-Portable Control Characters; the presence, meaning, and representation of any additional characters are locale-specific.
(emphasis mine)
The page for mbtowc() says:
The mbtowc() function shall fail if:
[EILSEQ]
An invalid character sequence is detected. In the POSIX locale an [EILSEQ] error cannot occur since all byte values are valid characters.
Note that the POSIX locale is defined to be identical to the C locale.
So if an operating system conforms to POSIX, mbtowc is a no-op in the POSIX locale. Characters 128–255 are passed through just as characters 0–127 are. Implementations that operate differently are in violation of the standard.
According to the gcc manual, the option -fwide-exec-charset specifies the wide character set of wide string and character constants at compile time.
But what is the wide character set when converting a multi-byte character to a wide character by calling mbtowc() at run time? The POSIX standard says that the character set of multi-byte characters is determined by the LC_CTYPE category of the current locale, but says nothing about the wide character set. I don't have a C standard at hand now so I don't know what the C standard says about this.
Does the gcc option -fwide-exec-charset determine the wide character set used by mbtowc(), just as it does at compile time?
Short answer: the character set used for wide strings gets determined by the characteristics of wchar_t known at compile time. As mbtowc is a library function, this happens when libc is being built.
mbtowc reads a single character from a string encoded in an external charset and writes it out to a wchar_t value able to represent any character. Likewise, mbstowcs converts an externally encoded C string into a simple array of wchar_t. From the system's point of view, it doesn't make sense to specify the "charset" of the resulting wide character/string, because changing its output encoding in any way would break the usage of the resulting wide string as array of wchar_t.
You can describe mbstowcs as producing fixed-width Unicode encodings such as UCS-2 or UCS-4 (or more precisely UTF-16 or UTF-32) if the wide chars correspond to ISO 10646 code points, and depending on the width of wchar_t. You can also describe it as little-endian or big-endian depending on your the endianness of the processor's representation of wchar_t. But those are properties of the platform, which you can't change at run-time any more than you can change endianness, or ASCII to EBCDIC.
-fwide-exec-charset serves to explicitly specify to the compiler the charset that corresponds to the internal representation of array-of-wchar_t. This is useful when it differs from the representation the compiler would normally generate (because you are crosscompiling, or because the compiler was misconfigured). This is why the manual goes on to warn that "you will have problems with encodings that do not fit exactly in wchar_t."