What prerequisites are needed to do strict Unicode programming?
Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t?
And what is the role played by multibyte character sequences in this scenario?
C99 or earlier
The C standard (C99) provides for wide characters and multi-byte characters, but since there is no guarantee about what those wide characters can hold, their value is somewhat limited. For a given implementation, they provide useful support, but if your code must be able to move between implementations, there is insufficient guarantee that they will be useful.
Consequently, the approach suggested by Hans van Eck (which is to write a wrapper around the ICU - International Components for Unicode - library) is sound, IMO.
The UTF-8 encoding has many merits, one of which is that if you do not mess with the data (by truncating it, for example), then it can be copied by functions that are not fully aware of the intricacies of UTF-8 encoding. This is categorically not the case with wchar_t.
Unicode in full is a 21-bit format. That is, Unicode reserves code points from U+0000 to U+10FFFF.
One of the useful things about the UTF-8, UTF-16 and UTF-32 formats (where UTF stands for Unicode Transformation Format - see Unicode) is that you can convert between the three representations without loss of information. Each can represent anything the others can represent. Both UTF-8 and UTF-16 are multi-byte formats.
UTF-8 is well known to be a multi-byte format, with a careful structure that makes it possible to find the start of characters in a string reliably, starting at any point in the string. Single-byte characters have the high-bit set to zero. Multi-byte characters have the first character starting with one of the bit patterns 110, 1110 or 11110 (for 2-byte, 3-byte or 4-byte characters), with subsequent bytes always starting 10. The continuation characters are always in the range 0x80 .. 0xBF. There are rules that UTF-8 characters must be represented in the minimum possible format. One consequence of these rules is that the bytes 0xC0 and 0xC1 (also 0xF5..0xFF) cannot appear in valid UTF-8 data.
U+0000 .. U+007F 1 byte 0xxx xxxx
U+0080 .. U+07FF 2 bytes 110x xxxx 10xx xxxx
U+0800 .. U+FFFF 3 bytes 1110 xxxx 10xx xxxx 10xx xxxx
U+10000 .. U+10FFFF 4 bytes 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Originally, it was hoped that Unicode would be a 16-bit code set and everything would fit into a 16-bit code space. Unfortunately, the real world is more complex, and it had to be expanded to the current 21-bit encoding.
UTF-16 thus is a single unit (16-bit word) code set for the 'Basic Multilingual Plane', meaning the characters with Unicode code points U+0000 .. U+FFFF, but uses two units (32-bits) for characters outside this range. Thus, code that works with the UTF-16 encoding must be able to handle variable width encodings, just like UTF-8 must. The codes for the double-unit characters are called surrogates.
Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from U+D800 to U+DBFF, and trailing, or low, surrogates are from U+DC00 to U+DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.
UTF-32, of course, can encode any Unicode code point in a single unit of storage. It is efficient for computation but not for storage.
You can find a lot more information at the ICU and Unicode web sites.
C11 and <uchar.h>
The C11 standard changed the rules, but not all implementations have caught up with the changes even now (mid-2017). The C11 standard summarizes the changes for Unicode support as:
Unicode characters and strings (<uchar.h>) (originally specified in
ISO/IEC TR 19769:2004)
What follows is a bare minimal outline of the functionality. The specification includes:
6.4.3 Universal character names
Syntax
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
hex-quad:
hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit
7.28 Unicode utilities <uchar.h>
The header <uchar.h> declares types and functions for manipulating Unicode characters.
The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2).
(Translating the cross-references: <stddef.h> defines size_t,
<wchar.h> defines mbstate_t,
and <stdint.h> defines uint_least16_t and uint_least32_t.)
The <uchar.h> header also defines a minimal set of (restartable) conversion functions:
mbrtoc16()
c16rtomb()
mbrtoc32()
c32rtomb()
There are rules about which Unicode characters can be used in identifiers using the \unnnn or \U00nnnnnn notations. You may have to actively activate the support for such characters in identifiers. For example, GCC requires -fextended-identifiers to allow these in identifiers.
Note that macOS Sierra (10.12.5), to name but one platform, does not support <uchar.h>.
Note that this is not about "strict unicode programming" per se, but some practical experience.
What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).
Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.
When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).
We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).
This FAQ is a wealth of info. Between that page and this article by Joel Spolsky, you'll have a good start.
One conclusion I came to along the way:
wchar_t is 16 bits on Windows, but not necessarily 16 bits on other platforms. I think it's a necessary evil on Windows, but probably can be avoided elsewhere. The reason it's important on Windows is that you need it to use files that have non-ASCII characters in the name (along with the W version of functions).
Note that Windows APIs that take wchar_t strings expect UTF-16 encoding. Note also that this is different than UCS-2. Take note of surrogate pairs. This test page has enlightening tests.
If you're programming on Windows, you can't use fopen(), fread(), fwrite(), etc. since they only take char * and don't understand UTF-8 encoding. Makes portability painful.
To do strict Unicode programming:
Only use string APIs that are Unicode aware (NOT strlen, strcpy, ... but their widestring counterparts wstrlen, wsstrcpy, ...)
When dealing with a block of text, use an encoding that allows storing Unicode chars (utf-7, utf-8, utf-16, ucs-2, ...) without loss.
Check that your OS default character set is Unicode compatible (ex: utf-8)
Use fonts that are Unicode compatible (e.g. arial_unicode)
Multi-byte character sequences is an encoding that pre-dates the UTF-16 encoding (the one used normally with wchar_t) and it seems to me it is rather Windows-only.
I've never heard of wint_t.
The most important thing is to always make a clear distinction between text and binary data. Try to follow the model of Python 3.x str vs. bytes or SQL TEXT vs. BLOB.
Unfortunately, C confuses the issue by using char for both "ASCII character" and int_least8_t. You'll want to do something like:
typedef char UTF8; // for code units of UTF-8 strings
typedef unsigned char BYTE; // for binary data
You might want typedefs for UTF-16 and UTF-32 code units too, but this is more complicated because the encoding of wchar_t is not defined. You'll need to just a preprocessor #ifs. Some useful macros in C and C++0x are:
__STDC_UTF_16__ — If defined, the type _Char16_t exists and is UTF-16.
__STDC_UTF_32__ — If defined, the type _Char32_t exists and is UTF-32.
__STDC_ISO_10646__ — If defined, then wchar_t is UTF-32.
_WIN32 — On Windows, wchar_t is UTF-16, even though this breaks the standard.
WCHAR_MAX — Can be used to determine the size of wchar_t, but not whether the OS uses it to represent Unicode.
Does this imply that my code should
not use char types anywhere and that
functions need to be used that can
deal with wint_t and wchar_t?
See also:
UTF-8 or UTF-16 or UTF-32 or UCS-2
Is wchar_t needed for Unicode support?
No. UTF-8 is a perfectly valid Unicode encoding that uses char* strings. It has the advantage that if your program is transparent to non-ASCII bytes (e.g., a line ending converter which acts on \r and \n but passes through other characters unchanged), you'll need to make no changes at all!
If you go with UTF-8, you'll need to change all the assumptions that char = character (e.g., don't call toupper in a loop) or char = screen column (e.g., for text wrapping).
If you go with UTF-32, you'll have the simplicity of fixed-width characters (but not fixed-width graphemes, but will need to change the type of all of your strings).
If you go with UTF-16, you'll have to discard both the assumption of fixed-width characters and the assumption of 8-bit code units, which makes this the most difficult upgrade path from single-byte encodings.
I would recommend actively avoiding wchar_t because it's not cross-platform: Sometimes it's UTF-32, sometimes it's UTF-16, and sometimes its a pre-Unicode East Asian encoding. I'd recommend using typedefs
Even more importantly, avoid TCHAR.
I wouldn't trust any standard library implementation. Just roll your own unicode types.
#include <windows.h>
typedef unsigned char utf8_t;
typedef unsigned short utf16_t;
typedef unsigned long utf32_t;
int main ( int argc, char *argv[] )
{
int msgBoxId;
utf16_t lpText[] = { 0x03B1, 0x0009, 0x03B2, 0x0009, 0x03B3, 0x0009, 0x03B4, 0x0000 };
utf16_t lpCaption[] = L"Greek Characters";
unsigned int uType = MB_OK;
msgBoxId = MessageBoxW( NULL, lpText, lpCaption, uType );
return 0;
}
From what I know, wchar_t is implementation dependent (as can be seen from this wiki article). And it's not unicode.
You basically want to deal with strings in memory as wchar_t arrays instead of char. When you do any kind of I/O (like reading/writing files) you can encode/decode using UTF-8 (this is probably the most common encoding) which is simple enough to implement. Just google the RFCs. So in-memory nothing should be multi-byte. One wchar_t represents one character. When you come to serializing however, that's when you need to encode to something like UTF-8 where some characters are represented by multiple bytes.
You'll also have to write new versions of strcmp etc. for the wide character strings, but this isn't a big issue. The biggest problem will be interop with libraries/existing code that only accept char arrays.
And when it comes to sizeof(wchar_t) (you will need 4 bytes if you want to do it right) you can always redefine it to a larger size with typedef/macro hacks if you need to.
Related
What prerequisites are needed to do strict Unicode programming?
Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t?
And what is the role played by multibyte character sequences in this scenario?
C99 or earlier
The C standard (C99) provides for wide characters and multi-byte characters, but since there is no guarantee about what those wide characters can hold, their value is somewhat limited. For a given implementation, they provide useful support, but if your code must be able to move between implementations, there is insufficient guarantee that they will be useful.
Consequently, the approach suggested by Hans van Eck (which is to write a wrapper around the ICU - International Components for Unicode - library) is sound, IMO.
The UTF-8 encoding has many merits, one of which is that if you do not mess with the data (by truncating it, for example), then it can be copied by functions that are not fully aware of the intricacies of UTF-8 encoding. This is categorically not the case with wchar_t.
Unicode in full is a 21-bit format. That is, Unicode reserves code points from U+0000 to U+10FFFF.
One of the useful things about the UTF-8, UTF-16 and UTF-32 formats (where UTF stands for Unicode Transformation Format - see Unicode) is that you can convert between the three representations without loss of information. Each can represent anything the others can represent. Both UTF-8 and UTF-16 are multi-byte formats.
UTF-8 is well known to be a multi-byte format, with a careful structure that makes it possible to find the start of characters in a string reliably, starting at any point in the string. Single-byte characters have the high-bit set to zero. Multi-byte characters have the first character starting with one of the bit patterns 110, 1110 or 11110 (for 2-byte, 3-byte or 4-byte characters), with subsequent bytes always starting 10. The continuation characters are always in the range 0x80 .. 0xBF. There are rules that UTF-8 characters must be represented in the minimum possible format. One consequence of these rules is that the bytes 0xC0 and 0xC1 (also 0xF5..0xFF) cannot appear in valid UTF-8 data.
U+0000 .. U+007F 1 byte 0xxx xxxx
U+0080 .. U+07FF 2 bytes 110x xxxx 10xx xxxx
U+0800 .. U+FFFF 3 bytes 1110 xxxx 10xx xxxx 10xx xxxx
U+10000 .. U+10FFFF 4 bytes 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Originally, it was hoped that Unicode would be a 16-bit code set and everything would fit into a 16-bit code space. Unfortunately, the real world is more complex, and it had to be expanded to the current 21-bit encoding.
UTF-16 thus is a single unit (16-bit word) code set for the 'Basic Multilingual Plane', meaning the characters with Unicode code points U+0000 .. U+FFFF, but uses two units (32-bits) for characters outside this range. Thus, code that works with the UTF-16 encoding must be able to handle variable width encodings, just like UTF-8 must. The codes for the double-unit characters are called surrogates.
Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from U+D800 to U+DBFF, and trailing, or low, surrogates are from U+DC00 to U+DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.
UTF-32, of course, can encode any Unicode code point in a single unit of storage. It is efficient for computation but not for storage.
You can find a lot more information at the ICU and Unicode web sites.
C11 and <uchar.h>
The C11 standard changed the rules, but not all implementations have caught up with the changes even now (mid-2017). The C11 standard summarizes the changes for Unicode support as:
Unicode characters and strings (<uchar.h>) (originally specified in
ISO/IEC TR 19769:2004)
What follows is a bare minimal outline of the functionality. The specification includes:
6.4.3 Universal character names
Syntax
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
hex-quad:
hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit
7.28 Unicode utilities <uchar.h>
The header <uchar.h> declares types and functions for manipulating Unicode characters.
The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2).
(Translating the cross-references: <stddef.h> defines size_t,
<wchar.h> defines mbstate_t,
and <stdint.h> defines uint_least16_t and uint_least32_t.)
The <uchar.h> header also defines a minimal set of (restartable) conversion functions:
mbrtoc16()
c16rtomb()
mbrtoc32()
c32rtomb()
There are rules about which Unicode characters can be used in identifiers using the \unnnn or \U00nnnnnn notations. You may have to actively activate the support for such characters in identifiers. For example, GCC requires -fextended-identifiers to allow these in identifiers.
Note that macOS Sierra (10.12.5), to name but one platform, does not support <uchar.h>.
Note that this is not about "strict unicode programming" per se, but some practical experience.
What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).
Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.
When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).
We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).
This FAQ is a wealth of info. Between that page and this article by Joel Spolsky, you'll have a good start.
One conclusion I came to along the way:
wchar_t is 16 bits on Windows, but not necessarily 16 bits on other platforms. I think it's a necessary evil on Windows, but probably can be avoided elsewhere. The reason it's important on Windows is that you need it to use files that have non-ASCII characters in the name (along with the W version of functions).
Note that Windows APIs that take wchar_t strings expect UTF-16 encoding. Note also that this is different than UCS-2. Take note of surrogate pairs. This test page has enlightening tests.
If you're programming on Windows, you can't use fopen(), fread(), fwrite(), etc. since they only take char * and don't understand UTF-8 encoding. Makes portability painful.
To do strict Unicode programming:
Only use string APIs that are Unicode aware (NOT strlen, strcpy, ... but their widestring counterparts wstrlen, wsstrcpy, ...)
When dealing with a block of text, use an encoding that allows storing Unicode chars (utf-7, utf-8, utf-16, ucs-2, ...) without loss.
Check that your OS default character set is Unicode compatible (ex: utf-8)
Use fonts that are Unicode compatible (e.g. arial_unicode)
Multi-byte character sequences is an encoding that pre-dates the UTF-16 encoding (the one used normally with wchar_t) and it seems to me it is rather Windows-only.
I've never heard of wint_t.
The most important thing is to always make a clear distinction between text and binary data. Try to follow the model of Python 3.x str vs. bytes or SQL TEXT vs. BLOB.
Unfortunately, C confuses the issue by using char for both "ASCII character" and int_least8_t. You'll want to do something like:
typedef char UTF8; // for code units of UTF-8 strings
typedef unsigned char BYTE; // for binary data
You might want typedefs for UTF-16 and UTF-32 code units too, but this is more complicated because the encoding of wchar_t is not defined. You'll need to just a preprocessor #ifs. Some useful macros in C and C++0x are:
__STDC_UTF_16__ — If defined, the type _Char16_t exists and is UTF-16.
__STDC_UTF_32__ — If defined, the type _Char32_t exists and is UTF-32.
__STDC_ISO_10646__ — If defined, then wchar_t is UTF-32.
_WIN32 — On Windows, wchar_t is UTF-16, even though this breaks the standard.
WCHAR_MAX — Can be used to determine the size of wchar_t, but not whether the OS uses it to represent Unicode.
Does this imply that my code should
not use char types anywhere and that
functions need to be used that can
deal with wint_t and wchar_t?
See also:
UTF-8 or UTF-16 or UTF-32 or UCS-2
Is wchar_t needed for Unicode support?
No. UTF-8 is a perfectly valid Unicode encoding that uses char* strings. It has the advantage that if your program is transparent to non-ASCII bytes (e.g., a line ending converter which acts on \r and \n but passes through other characters unchanged), you'll need to make no changes at all!
If you go with UTF-8, you'll need to change all the assumptions that char = character (e.g., don't call toupper in a loop) or char = screen column (e.g., for text wrapping).
If you go with UTF-32, you'll have the simplicity of fixed-width characters (but not fixed-width graphemes, but will need to change the type of all of your strings).
If you go with UTF-16, you'll have to discard both the assumption of fixed-width characters and the assumption of 8-bit code units, which makes this the most difficult upgrade path from single-byte encodings.
I would recommend actively avoiding wchar_t because it's not cross-platform: Sometimes it's UTF-32, sometimes it's UTF-16, and sometimes its a pre-Unicode East Asian encoding. I'd recommend using typedefs
Even more importantly, avoid TCHAR.
I wouldn't trust any standard library implementation. Just roll your own unicode types.
#include <windows.h>
typedef unsigned char utf8_t;
typedef unsigned short utf16_t;
typedef unsigned long utf32_t;
int main ( int argc, char *argv[] )
{
int msgBoxId;
utf16_t lpText[] = { 0x03B1, 0x0009, 0x03B2, 0x0009, 0x03B3, 0x0009, 0x03B4, 0x0000 };
utf16_t lpCaption[] = L"Greek Characters";
unsigned int uType = MB_OK;
msgBoxId = MessageBoxW( NULL, lpText, lpCaption, uType );
return 0;
}
From what I know, wchar_t is implementation dependent (as can be seen from this wiki article). And it's not unicode.
You basically want to deal with strings in memory as wchar_t arrays instead of char. When you do any kind of I/O (like reading/writing files) you can encode/decode using UTF-8 (this is probably the most common encoding) which is simple enough to implement. Just google the RFCs. So in-memory nothing should be multi-byte. One wchar_t represents one character. When you come to serializing however, that's when you need to encode to something like UTF-8 where some characters are represented by multiple bytes.
You'll also have to write new versions of strcmp etc. for the wide character strings, but this isn't a big issue. The biggest problem will be interop with libraries/existing code that only accept char arrays.
And when it comes to sizeof(wchar_t) (you will need 4 bytes if you want to do it right) you can always redefine it to a larger size with typedef/macro hacks if you need to.
What prerequisites are needed to do strict Unicode programming?
Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t?
And what is the role played by multibyte character sequences in this scenario?
C99 or earlier
The C standard (C99) provides for wide characters and multi-byte characters, but since there is no guarantee about what those wide characters can hold, their value is somewhat limited. For a given implementation, they provide useful support, but if your code must be able to move between implementations, there is insufficient guarantee that they will be useful.
Consequently, the approach suggested by Hans van Eck (which is to write a wrapper around the ICU - International Components for Unicode - library) is sound, IMO.
The UTF-8 encoding has many merits, one of which is that if you do not mess with the data (by truncating it, for example), then it can be copied by functions that are not fully aware of the intricacies of UTF-8 encoding. This is categorically not the case with wchar_t.
Unicode in full is a 21-bit format. That is, Unicode reserves code points from U+0000 to U+10FFFF.
One of the useful things about the UTF-8, UTF-16 and UTF-32 formats (where UTF stands for Unicode Transformation Format - see Unicode) is that you can convert between the three representations without loss of information. Each can represent anything the others can represent. Both UTF-8 and UTF-16 are multi-byte formats.
UTF-8 is well known to be a multi-byte format, with a careful structure that makes it possible to find the start of characters in a string reliably, starting at any point in the string. Single-byte characters have the high-bit set to zero. Multi-byte characters have the first character starting with one of the bit patterns 110, 1110 or 11110 (for 2-byte, 3-byte or 4-byte characters), with subsequent bytes always starting 10. The continuation characters are always in the range 0x80 .. 0xBF. There are rules that UTF-8 characters must be represented in the minimum possible format. One consequence of these rules is that the bytes 0xC0 and 0xC1 (also 0xF5..0xFF) cannot appear in valid UTF-8 data.
U+0000 .. U+007F 1 byte 0xxx xxxx
U+0080 .. U+07FF 2 bytes 110x xxxx 10xx xxxx
U+0800 .. U+FFFF 3 bytes 1110 xxxx 10xx xxxx 10xx xxxx
U+10000 .. U+10FFFF 4 bytes 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Originally, it was hoped that Unicode would be a 16-bit code set and everything would fit into a 16-bit code space. Unfortunately, the real world is more complex, and it had to be expanded to the current 21-bit encoding.
UTF-16 thus is a single unit (16-bit word) code set for the 'Basic Multilingual Plane', meaning the characters with Unicode code points U+0000 .. U+FFFF, but uses two units (32-bits) for characters outside this range. Thus, code that works with the UTF-16 encoding must be able to handle variable width encodings, just like UTF-8 must. The codes for the double-unit characters are called surrogates.
Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from U+D800 to U+DBFF, and trailing, or low, surrogates are from U+DC00 to U+DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.
UTF-32, of course, can encode any Unicode code point in a single unit of storage. It is efficient for computation but not for storage.
You can find a lot more information at the ICU and Unicode web sites.
C11 and <uchar.h>
The C11 standard changed the rules, but not all implementations have caught up with the changes even now (mid-2017). The C11 standard summarizes the changes for Unicode support as:
Unicode characters and strings (<uchar.h>) (originally specified in
ISO/IEC TR 19769:2004)
What follows is a bare minimal outline of the functionality. The specification includes:
6.4.3 Universal character names
Syntax
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
hex-quad:
hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit
7.28 Unicode utilities <uchar.h>
The header <uchar.h> declares types and functions for manipulating Unicode characters.
The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2).
(Translating the cross-references: <stddef.h> defines size_t,
<wchar.h> defines mbstate_t,
and <stdint.h> defines uint_least16_t and uint_least32_t.)
The <uchar.h> header also defines a minimal set of (restartable) conversion functions:
mbrtoc16()
c16rtomb()
mbrtoc32()
c32rtomb()
There are rules about which Unicode characters can be used in identifiers using the \unnnn or \U00nnnnnn notations. You may have to actively activate the support for such characters in identifiers. For example, GCC requires -fextended-identifiers to allow these in identifiers.
Note that macOS Sierra (10.12.5), to name but one platform, does not support <uchar.h>.
Note that this is not about "strict unicode programming" per se, but some practical experience.
What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).
Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.
When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).
We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).
This FAQ is a wealth of info. Between that page and this article by Joel Spolsky, you'll have a good start.
One conclusion I came to along the way:
wchar_t is 16 bits on Windows, but not necessarily 16 bits on other platforms. I think it's a necessary evil on Windows, but probably can be avoided elsewhere. The reason it's important on Windows is that you need it to use files that have non-ASCII characters in the name (along with the W version of functions).
Note that Windows APIs that take wchar_t strings expect UTF-16 encoding. Note also that this is different than UCS-2. Take note of surrogate pairs. This test page has enlightening tests.
If you're programming on Windows, you can't use fopen(), fread(), fwrite(), etc. since they only take char * and don't understand UTF-8 encoding. Makes portability painful.
To do strict Unicode programming:
Only use string APIs that are Unicode aware (NOT strlen, strcpy, ... but their widestring counterparts wstrlen, wsstrcpy, ...)
When dealing with a block of text, use an encoding that allows storing Unicode chars (utf-7, utf-8, utf-16, ucs-2, ...) without loss.
Check that your OS default character set is Unicode compatible (ex: utf-8)
Use fonts that are Unicode compatible (e.g. arial_unicode)
Multi-byte character sequences is an encoding that pre-dates the UTF-16 encoding (the one used normally with wchar_t) and it seems to me it is rather Windows-only.
I've never heard of wint_t.
The most important thing is to always make a clear distinction between text and binary data. Try to follow the model of Python 3.x str vs. bytes or SQL TEXT vs. BLOB.
Unfortunately, C confuses the issue by using char for both "ASCII character" and int_least8_t. You'll want to do something like:
typedef char UTF8; // for code units of UTF-8 strings
typedef unsigned char BYTE; // for binary data
You might want typedefs for UTF-16 and UTF-32 code units too, but this is more complicated because the encoding of wchar_t is not defined. You'll need to just a preprocessor #ifs. Some useful macros in C and C++0x are:
__STDC_UTF_16__ — If defined, the type _Char16_t exists and is UTF-16.
__STDC_UTF_32__ — If defined, the type _Char32_t exists and is UTF-32.
__STDC_ISO_10646__ — If defined, then wchar_t is UTF-32.
_WIN32 — On Windows, wchar_t is UTF-16, even though this breaks the standard.
WCHAR_MAX — Can be used to determine the size of wchar_t, but not whether the OS uses it to represent Unicode.
Does this imply that my code should
not use char types anywhere and that
functions need to be used that can
deal with wint_t and wchar_t?
See also:
UTF-8 or UTF-16 or UTF-32 or UCS-2
Is wchar_t needed for Unicode support?
No. UTF-8 is a perfectly valid Unicode encoding that uses char* strings. It has the advantage that if your program is transparent to non-ASCII bytes (e.g., a line ending converter which acts on \r and \n but passes through other characters unchanged), you'll need to make no changes at all!
If you go with UTF-8, you'll need to change all the assumptions that char = character (e.g., don't call toupper in a loop) or char = screen column (e.g., for text wrapping).
If you go with UTF-32, you'll have the simplicity of fixed-width characters (but not fixed-width graphemes, but will need to change the type of all of your strings).
If you go with UTF-16, you'll have to discard both the assumption of fixed-width characters and the assumption of 8-bit code units, which makes this the most difficult upgrade path from single-byte encodings.
I would recommend actively avoiding wchar_t because it's not cross-platform: Sometimes it's UTF-32, sometimes it's UTF-16, and sometimes its a pre-Unicode East Asian encoding. I'd recommend using typedefs
Even more importantly, avoid TCHAR.
I wouldn't trust any standard library implementation. Just roll your own unicode types.
#include <windows.h>
typedef unsigned char utf8_t;
typedef unsigned short utf16_t;
typedef unsigned long utf32_t;
int main ( int argc, char *argv[] )
{
int msgBoxId;
utf16_t lpText[] = { 0x03B1, 0x0009, 0x03B2, 0x0009, 0x03B3, 0x0009, 0x03B4, 0x0000 };
utf16_t lpCaption[] = L"Greek Characters";
unsigned int uType = MB_OK;
msgBoxId = MessageBoxW( NULL, lpText, lpCaption, uType );
return 0;
}
From what I know, wchar_t is implementation dependent (as can be seen from this wiki article). And it's not unicode.
You basically want to deal with strings in memory as wchar_t arrays instead of char. When you do any kind of I/O (like reading/writing files) you can encode/decode using UTF-8 (this is probably the most common encoding) which is simple enough to implement. Just google the RFCs. So in-memory nothing should be multi-byte. One wchar_t represents one character. When you come to serializing however, that's when you need to encode to something like UTF-8 where some characters are represented by multiple bytes.
You'll also have to write new versions of strcmp etc. for the wide character strings, but this isn't a big issue. The biggest problem will be interop with libraries/existing code that only accept char arrays.
And when it comes to sizeof(wchar_t) (you will need 4 bytes if you want to do it right) you can always redefine it to a larger size with typedef/macro hacks if you need to.
In C11, support for portable wide char types char16_t and char32_t are added for UTF-16 and UTF-32 respectively.
However, in the technical report, there is no mention of endianness for these two types.
For example, the following snippet in gcc-4.8.4 on my x86_64 computer when compiled with -std=c11:
#include <stdio.h>
#include <uchar.h>
char16_t utf16_str[] = u"十六"; // U+5341 U+516D
unsigned char *chars = (unsigned char *) utf16_str;
printf("Bytes: %X %X %X %X\n", chars[0], chars[1], chars[2], chars[3]);
will produce
Bytes: 41 53 6D 51
Which means that it's little-endian.
But is this behaviour platform/implementation dependent: does it always adhere to the platform's endianness or may some implementation choose to always implement char16_t and char32_t in big-endian?
char16_t and char32_t do not guarantee Unicode encoding. (That is a C++ feature.) The macros __STDC_UTF_16__ and __STDC_UTF_32__, respectively, indicate that Unicode code points actually determine the fixed-size character values. See C11 §6.10.8.2 for these macros.
(By the way, __STDC_ISO_10646__ indicates the same thing for wchar_t, and it also reveals which Unicode edition is implemented via wchar_t. Of course, in practice, the compiler simply copies code points from the source file to strings in the object file, so it doesn't need to know much about particular characters.)
Given that Unicode encoding is in effect, code point values stored in char16_t or char32_t must have the same object representation as uint_least16_t and uint_least32_t, because they are defined to be typedef aliases to those types, respectively (C11 §7.28). This is again somewhat in contrast to C++, which makes those types distinct but explicitly requires compatible object representation.
The upshot is that yes, there is nothing special about char16_t and char32_t. They are ordinary integers in the platform's endianness.
However, your test program has nothing to do with endianness. It simply uses the values of the wide characters without inspecting how they map to bytes in memory.
However, in the technical report, there is no mention of endianness for these two types.
Indeed. The C standard doesn't specify much regarding the representation of multibyte characters in source files.
char16_t utf16_str[] = u"十六"; // U+5341 U+516D
printf("U+%X U+%X\n", utf_16_str[0], utf_16_str[1]);
will produce
U+5341 U+516D
Which means that it's little-endian.
But is this behaviour platform/implementation dependent: does it always adhere to the platform's endianness or may some implementation choose to always implement char16_t and char32_t in big-endian?
Yes, The behaviour is implementation dependent, as you call it. See C11§5.1.1.2:
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.
That is, whether the multibyte characters in your source code are considered big endian or little endian is implementation-defined. I would advise using something like u"\u5341\u516d", if portability is an issue.
UTF-16 and UTF-32 does not have an endianness defined. They are usually encoded in the hosts native byte ordering. This is why there are Byte Order Markers (BOM) which can be inserted at the beginning of the string to indicate the endianness for an UTF-16 or UTF-32 string.
I want to print Unicode Character 'SPEAKER WITH THREE SOUND WAVES' (U+1F50A) Encodings "\uD83D\uDD0A" in C source code but get this output:
error: \uDD0A is not a valid universal character
error: \uD83D is not a valid universal character
\u notation (with four hexadecimal digits) is referring to UCS-2 encoding, i.e. you can encode only characters from the BMP (Basic multilingual plane, basically U+00000 through U+0FFFF).
U+1F50A is beyond the BMP, and thus cannot be encoded in 16 bits. UTF-16 uses surrogate pairs for such characters beyond the BMP (values in the 0xD800 - 0xDFFF range, which are not used in UCS-2), but they are explicitly forbidden in \u notation.
You need \U notation (with eight hexadecimal digits) for that.
Also note that the conversion from either \u or \U notation to whatever actually ends up in the string is locale-dependent, so what might work on one platform might not work on another... if you want to be really portable and ensure e.g. UTF-8 or UTF-16 encoding in the string, you need to:
do the encoding manually via hexadecimal \x... or octal \...;
use third-party libraries with proper Unicode support (ICU).
While we're at it (and because many people are unaware of this), the above points straight at why Microsoft's 16bit version of wchar_t is broken when you want Unicode: It stems from a time when there was only the BMP, and 16bit UCS-2 was plenty enough. Since it is no longer sufficient to encode all defined Unicode characters, you can use it to hold UTF-16 code values, but wchar_t -- and by extension, std::wstring as well as L"" string literals -- isn't really wide as the name implies, but multibyte at best.
Good that C++ introduced explicit char16_t and char32_t, plus the locale-independent u"", U"" and u8"" string literals. Too bad MSVC doesn't yet support them AFAIK.
I came across this in the book:
wscanf(L"%lf", &variable);
where the first parameter is of type of wchar_t *.
This s different from scanf("%lf", &variable); where the first parameter is of type char *.
So what is the difference than. I have never heard "wide character string" before. I have heard something called Raw String Literals which is printing the string as it is (no need for things like escape sequences) but that was not in C.
The exact nature of wide characters is (purposefully) left implementation defined.
When they first invented the concept of wchar_t, ISO 10646 and Unicode were still competing with each other (whereas they now, mostly cooperate). Rather than try to decree that an international character would be one or the other (or possibly something else entirely) they simply provided a type (and some functions) that the implementation could define to support international character sets as they chose.
Different implementations have exercised that potential for variation. For example, if you use Microsoft's compiler on Windows, wchar_t will be a 16-bit type holding UTF-16 Unicode (originally it held UCS-2 Unicode, but that's now officially obsolete).
On Linux, wchar_t will more often be a 32-bit type, holding UCS-4/UTF-32 encoded Unicode. Ports of gcc to at least some other operating systems do the same, though I've never tried to confirm that it's always the case.
There is, however, no guarantee of that. At least in theory an implementation on Linux could use 16 bits, or one on Windows could use 32 bits, or either one could decide to use 64 bits (though I'd be a little surprised to see that in reality).
In any case, the general idea of how things are intended to work, is that a single wchar_t is sufficient to represent a code point. For I/O, the data is intended to be converted from the external representation (whatever it is) into wchar_ts, which (is supposed to) make them relatively easy to manipulate. Then during output, they again get transformed into the encoding of your choice (which may be entirely different from the encoding you read).
"Wide character string" is referring to the encoding of the characters in the string.
From Wikipedia:
A wide character is a computer character datatype that generally has a
size greater than the traditional 8-bit character. The increased
datatype size allows for the use of larger coded character sets.
UTF-16 is one of the most commonly used wide character encodings.
Further, wchar_t is defined by Microsoft as an unsigned short(16-bit) data object. This could be and is most likely a different definition in other operating systems or languages.
Taken from the Wikipedia article from the comment below:
"The width of wchar_t is compiler-specific and can be as small as 8
bits. Consequently, programs that need to be portable across any C or
C++ compiler should not use wchar_t for storing Unicode text. The
wchar_t type is intended for storing compiler-defined wide characters,
which may be Unicode characters in some compilers."