Related
So I want to parse IDv3.4 file. There are 4 types of text encoding in format specification: ISO-8859-1, UTF-16 with BOM, UTF-16BE and UTF-8. I already written code that can obtains bytes of strings.
And my question is how to print UTF-16 with BOM and UTF-16BE bytes to console.
And also one important condition: I can use only C libraries. I can't use C++ libraries. I even can't use third-party C libraries.
In general (NOT specifically for parsing IDv3.4 files alone) you will want to choose a common character encoding that your code will use internally; then convert from any other character encoding into your chosen character encoding (for input data - e.g. from user or files or network) and convert back again (for output, to user or files or network).
For choosing a common character encoding:
you want something that minimizes "nonconvertible cases" - e.g. you wouldn't want to choose ASCII because there's far too much in far too many other character encodings that can't be converted to ASCII. This mostly means that you'll want a Unicode encoding.
you want something that is convenient. For Unicode encoding, this only really gives you 2 choices - UTF-8 (because you don't have to care about endian issues, and it's relatively efficient for space/memory consumption, and C functions like strlen() can still work) and versions of UTF-32 (because each codepoint takes up a fixed amount of space and it makes conversion a little simpler). Of these, the benefits of UTF-32 are mostly unimportant (unless you're doing a font rendering engine).
the "whatever random who-knows-what" character encoding that the C compiler uses is irrelevant (for both char and w_char), because it's implementation specific and not portable.
the "whatever random who-knows-what" character encoding that the terminal uses is irrelevant (the terminal should be considered "just another flavor of input/output, where conversion is involved").
Assuming you choose UTF-8:
You might be able to force the compiler to treat string literals as UTF-8 for you (e.g. like u8"hello" in C++, except I can't seem to find any sane standard for C). Otherwise you'll need to do it yourself where necessary.
I'd recommend using the uint8_t type for storing strings; partly because char is "signed or unsigned, depending on which way the wind is blowing" (which makes conversions to/from other character encodings painful due to "shifting a signed/negative number right" problems), and partly because it help to find "accidentally used something that isn't UTF-8" bugs (e.g. warnings from compiler about "conversion from signed to unsigned").
Conversion between UTF-8 and UTF-32LE, UTF_32BE, UTF-16LE, UTF_16BE is fairly trivial (the relevant wikipedia articles are enough to describe how it works).
"UTF-16 with BOM" means that the first 2 bytes will tell you if it's UTF-16LE or UTF-16BE, so (after you add support for UTF-16LE and UTF-16BE) it's trivial. "UTF-32 with BOM" is similar (first 4 bytes tell you if it's UTF32-BE or UTF32-BE).
Conversion to/from ISO-8859-1 to UTF-8 is fairly trivial, because the characters match Unicode codepoints with the same value. However, often people get it wrong (e.g. say it's ISO-8859-1 when the data is actually encoded as Windows-1252 instead); and for the conversion from UTF-8 to ISO-8859-1 you will need to deal with "nonconvertible" codepoints.
What prerequisites are needed to do strict Unicode programming?
Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t?
And what is the role played by multibyte character sequences in this scenario?
C99 or earlier
The C standard (C99) provides for wide characters and multi-byte characters, but since there is no guarantee about what those wide characters can hold, their value is somewhat limited. For a given implementation, they provide useful support, but if your code must be able to move between implementations, there is insufficient guarantee that they will be useful.
Consequently, the approach suggested by Hans van Eck (which is to write a wrapper around the ICU - International Components for Unicode - library) is sound, IMO.
The UTF-8 encoding has many merits, one of which is that if you do not mess with the data (by truncating it, for example), then it can be copied by functions that are not fully aware of the intricacies of UTF-8 encoding. This is categorically not the case with wchar_t.
Unicode in full is a 21-bit format. That is, Unicode reserves code points from U+0000 to U+10FFFF.
One of the useful things about the UTF-8, UTF-16 and UTF-32 formats (where UTF stands for Unicode Transformation Format - see Unicode) is that you can convert between the three representations without loss of information. Each can represent anything the others can represent. Both UTF-8 and UTF-16 are multi-byte formats.
UTF-8 is well known to be a multi-byte format, with a careful structure that makes it possible to find the start of characters in a string reliably, starting at any point in the string. Single-byte characters have the high-bit set to zero. Multi-byte characters have the first character starting with one of the bit patterns 110, 1110 or 11110 (for 2-byte, 3-byte or 4-byte characters), with subsequent bytes always starting 10. The continuation characters are always in the range 0x80 .. 0xBF. There are rules that UTF-8 characters must be represented in the minimum possible format. One consequence of these rules is that the bytes 0xC0 and 0xC1 (also 0xF5..0xFF) cannot appear in valid UTF-8 data.
U+0000 .. U+007F 1 byte 0xxx xxxx
U+0080 .. U+07FF 2 bytes 110x xxxx 10xx xxxx
U+0800 .. U+FFFF 3 bytes 1110 xxxx 10xx xxxx 10xx xxxx
U+10000 .. U+10FFFF 4 bytes 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Originally, it was hoped that Unicode would be a 16-bit code set and everything would fit into a 16-bit code space. Unfortunately, the real world is more complex, and it had to be expanded to the current 21-bit encoding.
UTF-16 thus is a single unit (16-bit word) code set for the 'Basic Multilingual Plane', meaning the characters with Unicode code points U+0000 .. U+FFFF, but uses two units (32-bits) for characters outside this range. Thus, code that works with the UTF-16 encoding must be able to handle variable width encodings, just like UTF-8 must. The codes for the double-unit characters are called surrogates.
Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from U+D800 to U+DBFF, and trailing, or low, surrogates are from U+DC00 to U+DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.
UTF-32, of course, can encode any Unicode code point in a single unit of storage. It is efficient for computation but not for storage.
You can find a lot more information at the ICU and Unicode web sites.
C11 and <uchar.h>
The C11 standard changed the rules, but not all implementations have caught up with the changes even now (mid-2017). The C11 standard summarizes the changes for Unicode support as:
Unicode characters and strings (<uchar.h>) (originally specified in
ISO/IEC TR 19769:2004)
What follows is a bare minimal outline of the functionality. The specification includes:
6.4.3 Universal character names
Syntax
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
hex-quad:
hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit
7.28 Unicode utilities <uchar.h>
The header <uchar.h> declares types and functions for manipulating Unicode characters.
The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2).
(Translating the cross-references: <stddef.h> defines size_t,
<wchar.h> defines mbstate_t,
and <stdint.h> defines uint_least16_t and uint_least32_t.)
The <uchar.h> header also defines a minimal set of (restartable) conversion functions:
mbrtoc16()
c16rtomb()
mbrtoc32()
c32rtomb()
There are rules about which Unicode characters can be used in identifiers using the \unnnn or \U00nnnnnn notations. You may have to actively activate the support for such characters in identifiers. For example, GCC requires -fextended-identifiers to allow these in identifiers.
Note that macOS Sierra (10.12.5), to name but one platform, does not support <uchar.h>.
Note that this is not about "strict unicode programming" per se, but some practical experience.
What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).
Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.
When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).
We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).
This FAQ is a wealth of info. Between that page and this article by Joel Spolsky, you'll have a good start.
One conclusion I came to along the way:
wchar_t is 16 bits on Windows, but not necessarily 16 bits on other platforms. I think it's a necessary evil on Windows, but probably can be avoided elsewhere. The reason it's important on Windows is that you need it to use files that have non-ASCII characters in the name (along with the W version of functions).
Note that Windows APIs that take wchar_t strings expect UTF-16 encoding. Note also that this is different than UCS-2. Take note of surrogate pairs. This test page has enlightening tests.
If you're programming on Windows, you can't use fopen(), fread(), fwrite(), etc. since they only take char * and don't understand UTF-8 encoding. Makes portability painful.
To do strict Unicode programming:
Only use string APIs that are Unicode aware (NOT strlen, strcpy, ... but their widestring counterparts wstrlen, wsstrcpy, ...)
When dealing with a block of text, use an encoding that allows storing Unicode chars (utf-7, utf-8, utf-16, ucs-2, ...) without loss.
Check that your OS default character set is Unicode compatible (ex: utf-8)
Use fonts that are Unicode compatible (e.g. arial_unicode)
Multi-byte character sequences is an encoding that pre-dates the UTF-16 encoding (the one used normally with wchar_t) and it seems to me it is rather Windows-only.
I've never heard of wint_t.
The most important thing is to always make a clear distinction between text and binary data. Try to follow the model of Python 3.x str vs. bytes or SQL TEXT vs. BLOB.
Unfortunately, C confuses the issue by using char for both "ASCII character" and int_least8_t. You'll want to do something like:
typedef char UTF8; // for code units of UTF-8 strings
typedef unsigned char BYTE; // for binary data
You might want typedefs for UTF-16 and UTF-32 code units too, but this is more complicated because the encoding of wchar_t is not defined. You'll need to just a preprocessor #ifs. Some useful macros in C and C++0x are:
__STDC_UTF_16__ — If defined, the type _Char16_t exists and is UTF-16.
__STDC_UTF_32__ — If defined, the type _Char32_t exists and is UTF-32.
__STDC_ISO_10646__ — If defined, then wchar_t is UTF-32.
_WIN32 — On Windows, wchar_t is UTF-16, even though this breaks the standard.
WCHAR_MAX — Can be used to determine the size of wchar_t, but not whether the OS uses it to represent Unicode.
Does this imply that my code should
not use char types anywhere and that
functions need to be used that can
deal with wint_t and wchar_t?
See also:
UTF-8 or UTF-16 or UTF-32 or UCS-2
Is wchar_t needed for Unicode support?
No. UTF-8 is a perfectly valid Unicode encoding that uses char* strings. It has the advantage that if your program is transparent to non-ASCII bytes (e.g., a line ending converter which acts on \r and \n but passes through other characters unchanged), you'll need to make no changes at all!
If you go with UTF-8, you'll need to change all the assumptions that char = character (e.g., don't call toupper in a loop) or char = screen column (e.g., for text wrapping).
If you go with UTF-32, you'll have the simplicity of fixed-width characters (but not fixed-width graphemes, but will need to change the type of all of your strings).
If you go with UTF-16, you'll have to discard both the assumption of fixed-width characters and the assumption of 8-bit code units, which makes this the most difficult upgrade path from single-byte encodings.
I would recommend actively avoiding wchar_t because it's not cross-platform: Sometimes it's UTF-32, sometimes it's UTF-16, and sometimes its a pre-Unicode East Asian encoding. I'd recommend using typedefs
Even more importantly, avoid TCHAR.
I wouldn't trust any standard library implementation. Just roll your own unicode types.
#include <windows.h>
typedef unsigned char utf8_t;
typedef unsigned short utf16_t;
typedef unsigned long utf32_t;
int main ( int argc, char *argv[] )
{
int msgBoxId;
utf16_t lpText[] = { 0x03B1, 0x0009, 0x03B2, 0x0009, 0x03B3, 0x0009, 0x03B4, 0x0000 };
utf16_t lpCaption[] = L"Greek Characters";
unsigned int uType = MB_OK;
msgBoxId = MessageBoxW( NULL, lpText, lpCaption, uType );
return 0;
}
From what I know, wchar_t is implementation dependent (as can be seen from this wiki article). And it's not unicode.
You basically want to deal with strings in memory as wchar_t arrays instead of char. When you do any kind of I/O (like reading/writing files) you can encode/decode using UTF-8 (this is probably the most common encoding) which is simple enough to implement. Just google the RFCs. So in-memory nothing should be multi-byte. One wchar_t represents one character. When you come to serializing however, that's when you need to encode to something like UTF-8 where some characters are represented by multiple bytes.
You'll also have to write new versions of strcmp etc. for the wide character strings, but this isn't a big issue. The biggest problem will be interop with libraries/existing code that only accept char arrays.
And when it comes to sizeof(wchar_t) (you will need 4 bytes if you want to do it right) you can always redefine it to a larger size with typedef/macro hacks if you need to.
What prerequisites are needed to do strict Unicode programming?
Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t?
And what is the role played by multibyte character sequences in this scenario?
C99 or earlier
The C standard (C99) provides for wide characters and multi-byte characters, but since there is no guarantee about what those wide characters can hold, their value is somewhat limited. For a given implementation, they provide useful support, but if your code must be able to move between implementations, there is insufficient guarantee that they will be useful.
Consequently, the approach suggested by Hans van Eck (which is to write a wrapper around the ICU - International Components for Unicode - library) is sound, IMO.
The UTF-8 encoding has many merits, one of which is that if you do not mess with the data (by truncating it, for example), then it can be copied by functions that are not fully aware of the intricacies of UTF-8 encoding. This is categorically not the case with wchar_t.
Unicode in full is a 21-bit format. That is, Unicode reserves code points from U+0000 to U+10FFFF.
One of the useful things about the UTF-8, UTF-16 and UTF-32 formats (where UTF stands for Unicode Transformation Format - see Unicode) is that you can convert between the three representations without loss of information. Each can represent anything the others can represent. Both UTF-8 and UTF-16 are multi-byte formats.
UTF-8 is well known to be a multi-byte format, with a careful structure that makes it possible to find the start of characters in a string reliably, starting at any point in the string. Single-byte characters have the high-bit set to zero. Multi-byte characters have the first character starting with one of the bit patterns 110, 1110 or 11110 (for 2-byte, 3-byte or 4-byte characters), with subsequent bytes always starting 10. The continuation characters are always in the range 0x80 .. 0xBF. There are rules that UTF-8 characters must be represented in the minimum possible format. One consequence of these rules is that the bytes 0xC0 and 0xC1 (also 0xF5..0xFF) cannot appear in valid UTF-8 data.
U+0000 .. U+007F 1 byte 0xxx xxxx
U+0080 .. U+07FF 2 bytes 110x xxxx 10xx xxxx
U+0800 .. U+FFFF 3 bytes 1110 xxxx 10xx xxxx 10xx xxxx
U+10000 .. U+10FFFF 4 bytes 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Originally, it was hoped that Unicode would be a 16-bit code set and everything would fit into a 16-bit code space. Unfortunately, the real world is more complex, and it had to be expanded to the current 21-bit encoding.
UTF-16 thus is a single unit (16-bit word) code set for the 'Basic Multilingual Plane', meaning the characters with Unicode code points U+0000 .. U+FFFF, but uses two units (32-bits) for characters outside this range. Thus, code that works with the UTF-16 encoding must be able to handle variable width encodings, just like UTF-8 must. The codes for the double-unit characters are called surrogates.
Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from U+D800 to U+DBFF, and trailing, or low, surrogates are from U+DC00 to U+DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.
UTF-32, of course, can encode any Unicode code point in a single unit of storage. It is efficient for computation but not for storage.
You can find a lot more information at the ICU and Unicode web sites.
C11 and <uchar.h>
The C11 standard changed the rules, but not all implementations have caught up with the changes even now (mid-2017). The C11 standard summarizes the changes for Unicode support as:
Unicode characters and strings (<uchar.h>) (originally specified in
ISO/IEC TR 19769:2004)
What follows is a bare minimal outline of the functionality. The specification includes:
6.4.3 Universal character names
Syntax
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
hex-quad:
hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit
7.28 Unicode utilities <uchar.h>
The header <uchar.h> declares types and functions for manipulating Unicode characters.
The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2).
(Translating the cross-references: <stddef.h> defines size_t,
<wchar.h> defines mbstate_t,
and <stdint.h> defines uint_least16_t and uint_least32_t.)
The <uchar.h> header also defines a minimal set of (restartable) conversion functions:
mbrtoc16()
c16rtomb()
mbrtoc32()
c32rtomb()
There are rules about which Unicode characters can be used in identifiers using the \unnnn or \U00nnnnnn notations. You may have to actively activate the support for such characters in identifiers. For example, GCC requires -fextended-identifiers to allow these in identifiers.
Note that macOS Sierra (10.12.5), to name but one platform, does not support <uchar.h>.
Note that this is not about "strict unicode programming" per se, but some practical experience.
What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).
Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.
When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).
We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).
This FAQ is a wealth of info. Between that page and this article by Joel Spolsky, you'll have a good start.
One conclusion I came to along the way:
wchar_t is 16 bits on Windows, but not necessarily 16 bits on other platforms. I think it's a necessary evil on Windows, but probably can be avoided elsewhere. The reason it's important on Windows is that you need it to use files that have non-ASCII characters in the name (along with the W version of functions).
Note that Windows APIs that take wchar_t strings expect UTF-16 encoding. Note also that this is different than UCS-2. Take note of surrogate pairs. This test page has enlightening tests.
If you're programming on Windows, you can't use fopen(), fread(), fwrite(), etc. since they only take char * and don't understand UTF-8 encoding. Makes portability painful.
To do strict Unicode programming:
Only use string APIs that are Unicode aware (NOT strlen, strcpy, ... but their widestring counterparts wstrlen, wsstrcpy, ...)
When dealing with a block of text, use an encoding that allows storing Unicode chars (utf-7, utf-8, utf-16, ucs-2, ...) without loss.
Check that your OS default character set is Unicode compatible (ex: utf-8)
Use fonts that are Unicode compatible (e.g. arial_unicode)
Multi-byte character sequences is an encoding that pre-dates the UTF-16 encoding (the one used normally with wchar_t) and it seems to me it is rather Windows-only.
I've never heard of wint_t.
The most important thing is to always make a clear distinction between text and binary data. Try to follow the model of Python 3.x str vs. bytes or SQL TEXT vs. BLOB.
Unfortunately, C confuses the issue by using char for both "ASCII character" and int_least8_t. You'll want to do something like:
typedef char UTF8; // for code units of UTF-8 strings
typedef unsigned char BYTE; // for binary data
You might want typedefs for UTF-16 and UTF-32 code units too, but this is more complicated because the encoding of wchar_t is not defined. You'll need to just a preprocessor #ifs. Some useful macros in C and C++0x are:
__STDC_UTF_16__ — If defined, the type _Char16_t exists and is UTF-16.
__STDC_UTF_32__ — If defined, the type _Char32_t exists and is UTF-32.
__STDC_ISO_10646__ — If defined, then wchar_t is UTF-32.
_WIN32 — On Windows, wchar_t is UTF-16, even though this breaks the standard.
WCHAR_MAX — Can be used to determine the size of wchar_t, but not whether the OS uses it to represent Unicode.
Does this imply that my code should
not use char types anywhere and that
functions need to be used that can
deal with wint_t and wchar_t?
See also:
UTF-8 or UTF-16 or UTF-32 or UCS-2
Is wchar_t needed for Unicode support?
No. UTF-8 is a perfectly valid Unicode encoding that uses char* strings. It has the advantage that if your program is transparent to non-ASCII bytes (e.g., a line ending converter which acts on \r and \n but passes through other characters unchanged), you'll need to make no changes at all!
If you go with UTF-8, you'll need to change all the assumptions that char = character (e.g., don't call toupper in a loop) or char = screen column (e.g., for text wrapping).
If you go with UTF-32, you'll have the simplicity of fixed-width characters (but not fixed-width graphemes, but will need to change the type of all of your strings).
If you go with UTF-16, you'll have to discard both the assumption of fixed-width characters and the assumption of 8-bit code units, which makes this the most difficult upgrade path from single-byte encodings.
I would recommend actively avoiding wchar_t because it's not cross-platform: Sometimes it's UTF-32, sometimes it's UTF-16, and sometimes its a pre-Unicode East Asian encoding. I'd recommend using typedefs
Even more importantly, avoid TCHAR.
I wouldn't trust any standard library implementation. Just roll your own unicode types.
#include <windows.h>
typedef unsigned char utf8_t;
typedef unsigned short utf16_t;
typedef unsigned long utf32_t;
int main ( int argc, char *argv[] )
{
int msgBoxId;
utf16_t lpText[] = { 0x03B1, 0x0009, 0x03B2, 0x0009, 0x03B3, 0x0009, 0x03B4, 0x0000 };
utf16_t lpCaption[] = L"Greek Characters";
unsigned int uType = MB_OK;
msgBoxId = MessageBoxW( NULL, lpText, lpCaption, uType );
return 0;
}
From what I know, wchar_t is implementation dependent (as can be seen from this wiki article). And it's not unicode.
You basically want to deal with strings in memory as wchar_t arrays instead of char. When you do any kind of I/O (like reading/writing files) you can encode/decode using UTF-8 (this is probably the most common encoding) which is simple enough to implement. Just google the RFCs. So in-memory nothing should be multi-byte. One wchar_t represents one character. When you come to serializing however, that's when you need to encode to something like UTF-8 where some characters are represented by multiple bytes.
You'll also have to write new versions of strcmp etc. for the wide character strings, but this isn't a big issue. The biggest problem will be interop with libraries/existing code that only accept char arrays.
And when it comes to sizeof(wchar_t) (you will need 4 bytes if you want to do it right) you can always redefine it to a larger size with typedef/macro hacks if you need to.
This intrigues me, so I'm going to ask - for what reason is wchar_t not used so widely on Linux/Linux-like systems as it is on Windows? Specifically, the Windows API uses wchar_t internally whereas I believe Linux does not and this is reflected in a number of open source packages using char types.
My understanding is that given a character c which requires multiple bytes to represent it, then in a char[] form c is split over several parts of char* whereas it forms a single unit in wchar_t[]. Is it not easier, then, to use wchar_t always? Have I missed a technical reason that negates this difference? Or is it just an adoption problem?
wchar_t is a wide character with platform-defined width, which doesn't really help much.
UTF-8 characters span 1-4 bytes per character. UCS-2, which spans exactly 2 bytes per character, is now obsolete and can't represent the full Unicode character set.
Linux applications that support Unicode tend to do so properly, above the byte-wise storage layer. Windows applications tend to make this silly assumption that only two bytes will do.
wchar_t's Wikipedia article briefly touches on this.
The first people to use UTF-8 on a Unix-based platform explained:
The Unicode Standard [then at version 1.1]
defines an
adequate character set but an
unreasonable representation [UCS-2]. It states
that all characters are 16 bits wide [no longer true]
and are communicated and stored in 16-bit units.
It also reserves a pair
of characters (hexadecimal FFFE and
FEFF) to detect byte order in
transmitted text, requiring state in
the byte stream. (The Unicode
Consortium was thinking of files, not
pipes.) To adopt this encoding, we
would have had to convert all text
going into and out of Plan 9 between
ASCII and Unicode, which cannot be
done. Within a single program, in
command of all its input and output,
it is possible to define characters as
16-bit quantities; in the context of a
networked system with hundreds of
applications on diverse machines by
different manufacturers [italics mine], it is
impossible.
The italicized part is less relevant to Windows systems, which have a preference towards monolithic applications (Microsoft Office), non-diverse machines (everything's an x86 and thus little-endian), and a single OS vendor.
And the Unix philosophy of having small, single-purpose programs means fewer of them need to do serious character manipulation.
The source for our tools and
applications had already been
converted to work with Latin-1, so it
was ‘8-bit safe’, but the conversion
to the Unicode Standard and UTF[-8] is
more involved. Some programs needed no
change at all: cat, for instance,
interprets its argument strings,
delivered in UTF[-8], as file names
that it passes uninterpreted to the
open system call, and then just copies
bytes from its input to its output; it
never makes decisions based on the
values of the bytes...Most programs,
however, needed modest change.
...Few tools actually need to operate
on runes [Unicode code points]
internally; more typically they need
only to look for the final slash in a
file name and similar trivial tasks.
Of the 170 C source programs...only 23
now contain the word Rune.
The programs that do store runes
internally are mostly those whose
raison d’être is character
manipulation: sam (the text editor),
sed, sort, tr, troff, 8½ (the window
system and terminal emulator), and so
on. To decide whether to compute using
runes or UTF-encoded byte strings
requires balancing the cost of
converting the data when read and
written against the cost of converting
relevant text on demand. For programs
such as editors that run a long time
with a relatively constant dataset,
runes are the better choice...
UTF-32, with code points directly accessible, is indeed more convenient if you need character properties like categories and case mappings.
But widechars are awkward to use on Linux for the same reason that UTF-8 is awkward to use on Windows. GNU libc has no _wfopen or _wstat function.
UTF-8, being compatible to ASCII, makes it possible to ignore Unicode somewhat.
Often, programs don't care (and in fact, don't need to care) about what the input is, as long as there is not a \0 that could terminate strings. See:
char buf[whatever];
printf("Your favorite pizza topping is which?\n");
fgets(buf, sizeof(buf), stdin); /* Jalapeños */
printf("%s it shall be.\n", buf);
The only times when I found I needed Unicode support is when I had to have a multibyte character as a single unit (wchar_t); e.g. when having to count the number of characters in a string, rather than bytes. iconv from utf-8 to wchar_t will quickly do that. For bigger issues like zero-width spaces and combining diacritics, something more heavy like icu is needed—but how often do you do that anyway?
wchar_t is not the same size on all platforms. On Windows it is a UTF-16 code unit that uses two bytes. On other platforms it typically uses 4 bytes (for UCS-4/UTF-32). It is therefore unlikely that these platforms would standardize on using wchar_t, since it would waste a lot of space.
What prerequisites are needed to do strict Unicode programming?
Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t?
And what is the role played by multibyte character sequences in this scenario?
C99 or earlier
The C standard (C99) provides for wide characters and multi-byte characters, but since there is no guarantee about what those wide characters can hold, their value is somewhat limited. For a given implementation, they provide useful support, but if your code must be able to move between implementations, there is insufficient guarantee that they will be useful.
Consequently, the approach suggested by Hans van Eck (which is to write a wrapper around the ICU - International Components for Unicode - library) is sound, IMO.
The UTF-8 encoding has many merits, one of which is that if you do not mess with the data (by truncating it, for example), then it can be copied by functions that are not fully aware of the intricacies of UTF-8 encoding. This is categorically not the case with wchar_t.
Unicode in full is a 21-bit format. That is, Unicode reserves code points from U+0000 to U+10FFFF.
One of the useful things about the UTF-8, UTF-16 and UTF-32 formats (where UTF stands for Unicode Transformation Format - see Unicode) is that you can convert between the three representations without loss of information. Each can represent anything the others can represent. Both UTF-8 and UTF-16 are multi-byte formats.
UTF-8 is well known to be a multi-byte format, with a careful structure that makes it possible to find the start of characters in a string reliably, starting at any point in the string. Single-byte characters have the high-bit set to zero. Multi-byte characters have the first character starting with one of the bit patterns 110, 1110 or 11110 (for 2-byte, 3-byte or 4-byte characters), with subsequent bytes always starting 10. The continuation characters are always in the range 0x80 .. 0xBF. There are rules that UTF-8 characters must be represented in the minimum possible format. One consequence of these rules is that the bytes 0xC0 and 0xC1 (also 0xF5..0xFF) cannot appear in valid UTF-8 data.
U+0000 .. U+007F 1 byte 0xxx xxxx
U+0080 .. U+07FF 2 bytes 110x xxxx 10xx xxxx
U+0800 .. U+FFFF 3 bytes 1110 xxxx 10xx xxxx 10xx xxxx
U+10000 .. U+10FFFF 4 bytes 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Originally, it was hoped that Unicode would be a 16-bit code set and everything would fit into a 16-bit code space. Unfortunately, the real world is more complex, and it had to be expanded to the current 21-bit encoding.
UTF-16 thus is a single unit (16-bit word) code set for the 'Basic Multilingual Plane', meaning the characters with Unicode code points U+0000 .. U+FFFF, but uses two units (32-bits) for characters outside this range. Thus, code that works with the UTF-16 encoding must be able to handle variable width encodings, just like UTF-8 must. The codes for the double-unit characters are called surrogates.
Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from U+D800 to U+DBFF, and trailing, or low, surrogates are from U+DC00 to U+DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.
UTF-32, of course, can encode any Unicode code point in a single unit of storage. It is efficient for computation but not for storage.
You can find a lot more information at the ICU and Unicode web sites.
C11 and <uchar.h>
The C11 standard changed the rules, but not all implementations have caught up with the changes even now (mid-2017). The C11 standard summarizes the changes for Unicode support as:
Unicode characters and strings (<uchar.h>) (originally specified in
ISO/IEC TR 19769:2004)
What follows is a bare minimal outline of the functionality. The specification includes:
6.4.3 Universal character names
Syntax
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
hex-quad:
hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit
7.28 Unicode utilities <uchar.h>
The header <uchar.h> declares types and functions for manipulating Unicode characters.
The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2).
(Translating the cross-references: <stddef.h> defines size_t,
<wchar.h> defines mbstate_t,
and <stdint.h> defines uint_least16_t and uint_least32_t.)
The <uchar.h> header also defines a minimal set of (restartable) conversion functions:
mbrtoc16()
c16rtomb()
mbrtoc32()
c32rtomb()
There are rules about which Unicode characters can be used in identifiers using the \unnnn or \U00nnnnnn notations. You may have to actively activate the support for such characters in identifiers. For example, GCC requires -fextended-identifiers to allow these in identifiers.
Note that macOS Sierra (10.12.5), to name but one platform, does not support <uchar.h>.
Note that this is not about "strict unicode programming" per se, but some practical experience.
What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).
Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.
When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).
We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).
This FAQ is a wealth of info. Between that page and this article by Joel Spolsky, you'll have a good start.
One conclusion I came to along the way:
wchar_t is 16 bits on Windows, but not necessarily 16 bits on other platforms. I think it's a necessary evil on Windows, but probably can be avoided elsewhere. The reason it's important on Windows is that you need it to use files that have non-ASCII characters in the name (along with the W version of functions).
Note that Windows APIs that take wchar_t strings expect UTF-16 encoding. Note also that this is different than UCS-2. Take note of surrogate pairs. This test page has enlightening tests.
If you're programming on Windows, you can't use fopen(), fread(), fwrite(), etc. since they only take char * and don't understand UTF-8 encoding. Makes portability painful.
To do strict Unicode programming:
Only use string APIs that are Unicode aware (NOT strlen, strcpy, ... but their widestring counterparts wstrlen, wsstrcpy, ...)
When dealing with a block of text, use an encoding that allows storing Unicode chars (utf-7, utf-8, utf-16, ucs-2, ...) without loss.
Check that your OS default character set is Unicode compatible (ex: utf-8)
Use fonts that are Unicode compatible (e.g. arial_unicode)
Multi-byte character sequences is an encoding that pre-dates the UTF-16 encoding (the one used normally with wchar_t) and it seems to me it is rather Windows-only.
I've never heard of wint_t.
The most important thing is to always make a clear distinction between text and binary data. Try to follow the model of Python 3.x str vs. bytes or SQL TEXT vs. BLOB.
Unfortunately, C confuses the issue by using char for both "ASCII character" and int_least8_t. You'll want to do something like:
typedef char UTF8; // for code units of UTF-8 strings
typedef unsigned char BYTE; // for binary data
You might want typedefs for UTF-16 and UTF-32 code units too, but this is more complicated because the encoding of wchar_t is not defined. You'll need to just a preprocessor #ifs. Some useful macros in C and C++0x are:
__STDC_UTF_16__ — If defined, the type _Char16_t exists and is UTF-16.
__STDC_UTF_32__ — If defined, the type _Char32_t exists and is UTF-32.
__STDC_ISO_10646__ — If defined, then wchar_t is UTF-32.
_WIN32 — On Windows, wchar_t is UTF-16, even though this breaks the standard.
WCHAR_MAX — Can be used to determine the size of wchar_t, but not whether the OS uses it to represent Unicode.
Does this imply that my code should
not use char types anywhere and that
functions need to be used that can
deal with wint_t and wchar_t?
See also:
UTF-8 or UTF-16 or UTF-32 or UCS-2
Is wchar_t needed for Unicode support?
No. UTF-8 is a perfectly valid Unicode encoding that uses char* strings. It has the advantage that if your program is transparent to non-ASCII bytes (e.g., a line ending converter which acts on \r and \n but passes through other characters unchanged), you'll need to make no changes at all!
If you go with UTF-8, you'll need to change all the assumptions that char = character (e.g., don't call toupper in a loop) or char = screen column (e.g., for text wrapping).
If you go with UTF-32, you'll have the simplicity of fixed-width characters (but not fixed-width graphemes, but will need to change the type of all of your strings).
If you go with UTF-16, you'll have to discard both the assumption of fixed-width characters and the assumption of 8-bit code units, which makes this the most difficult upgrade path from single-byte encodings.
I would recommend actively avoiding wchar_t because it's not cross-platform: Sometimes it's UTF-32, sometimes it's UTF-16, and sometimes its a pre-Unicode East Asian encoding. I'd recommend using typedefs
Even more importantly, avoid TCHAR.
I wouldn't trust any standard library implementation. Just roll your own unicode types.
#include <windows.h>
typedef unsigned char utf8_t;
typedef unsigned short utf16_t;
typedef unsigned long utf32_t;
int main ( int argc, char *argv[] )
{
int msgBoxId;
utf16_t lpText[] = { 0x03B1, 0x0009, 0x03B2, 0x0009, 0x03B3, 0x0009, 0x03B4, 0x0000 };
utf16_t lpCaption[] = L"Greek Characters";
unsigned int uType = MB_OK;
msgBoxId = MessageBoxW( NULL, lpText, lpCaption, uType );
return 0;
}
From what I know, wchar_t is implementation dependent (as can be seen from this wiki article). And it's not unicode.
You basically want to deal with strings in memory as wchar_t arrays instead of char. When you do any kind of I/O (like reading/writing files) you can encode/decode using UTF-8 (this is probably the most common encoding) which is simple enough to implement. Just google the RFCs. So in-memory nothing should be multi-byte. One wchar_t represents one character. When you come to serializing however, that's when you need to encode to something like UTF-8 where some characters are represented by multiple bytes.
You'll also have to write new versions of strcmp etc. for the wide character strings, but this isn't a big issue. The biggest problem will be interop with libraries/existing code that only accept char arrays.
And when it comes to sizeof(wchar_t) (you will need 4 bytes if you want to do it right) you can always redefine it to a larger size with typedef/macro hacks if you need to.