How does gcc decide the wide character set when calling `mbtowc()`? - c

According to the gcc manual, the option -fwide-exec-charset specifies the wide character set of wide string and character constants at compile time.
But what is the wide character set when converting a multi-byte character to a wide character by calling mbtowc() at run time? The POSIX standard says that the character set of multi-byte characters is determined by the LC_CTYPE category of the current locale, but says nothing about the wide character set. I don't have a C standard at hand now so I don't know what the C standard says about this.
Does the gcc option -fwide-exec-charset determine the wide character set used by mbtowc(), just as it does at compile time?

Short answer: the character set used for wide strings gets determined by the characteristics of wchar_t known at compile time. As mbtowc is a library function, this happens when libc is being built.
mbtowc reads a single character from a string encoded in an external charset and writes it out to a wchar_t value able to represent any character. Likewise, mbstowcs converts an externally encoded C string into a simple array of wchar_t. From the system's point of view, it doesn't make sense to specify the "charset" of the resulting wide character/string, because changing its output encoding in any way would break the usage of the resulting wide string as array of wchar_t.
You can describe mbstowcs as producing fixed-width Unicode encodings such as UCS-2 or UCS-4 (or more precisely UTF-16 or UTF-32) if the wide chars correspond to ISO 10646 code points, and depending on the width of wchar_t. You can also describe it as little-endian or big-endian depending on your the endianness of the processor's representation of wchar_t. But those are properties of the platform, which you can't change at run-time any more than you can change endianness, or ASCII to EBCDIC.
-fwide-exec-charset serves to explicitly specify to the compiler the charset that corresponds to the internal representation of array-of-wchar_t. This is useful when it differs from the representation the compiler would normally generate (because you are crosscompiling, or because the compiler was misconfigured). This is why the manual goes on to warn that "you will have problems with encodings that do not fit exactly in wchar_t."

Related

Converting to wide characters on z/OS

I have the following code on Linux:-
rc = iconv_open("WCHAR_T", SourceCode);
prior to using iconv to convert the data into a wide character string (wchar_t).
I am now compiling this on z/OS. I do not know what value to use in place of "WCHAR_T". I have found that codepages are represented by 5-digit character strings on z/OS, e.g Codepage 500 would be "00500", so I am happy enough with what to put into my SourceCode variable above, I just can't find a value that will successfully work as the first parameter to iconv_open.
wchar_t are 4 bytes long on z/OS (when compiling 64-bit as I am), so I assume that I would need some varient of an EBCDIC equivalent to UTF32 or UCS4 perhaps, but I cannot find something that works. Every combination I have tried to date has returned with an errno of 121 (EINVAL: The parameter is incorrect).
If anyone familiar with how the above code works on Linux, could give a summary of what it does, that might also help. What does it mean to iconv into "WCHAR_T"? Is this a combination perhaps, of some data conversion and additionally a type change to wchar_t?
Alternatively, can anyone answer the question, "What is the internal representation of wchar_t on z/OS?"
wchar_t is an implementation defined data type. On z/OS it is 2 bytes in 31-bit mode and 4 bytes in 64-bit mode.
There is no single representation of wchar_t on z/OS. The encoding associated with the wchar_t data is dependent on the locale in which the application is running. It could be an IBM-939 Japanese DBCS code page or any of the other DBCS code pages that are used in countries like China, Korea, etc.
Wide string literals and character constants i.e. those defined as L"abc" or L'x' are converted to the implementation defined encoding used to implement wchar_t data type. This encoding is locale sensitive and can be manipulated using wide character run time library functions.
The conversion of multi byte string literals to wide string literals is typically done by calling one of the mbtowc run time library functions which respect the encoding associated with the locale in which the application is running.
iconv on the other hand can be used to convert any string literals to any one of the supported destination code pages including double byte code pages or any of the Unicode formats (UTF8, UTF16, UTF32). The operation of iconv is independent of wchar_t type.
Universal coded character set converters may be the answer to your question.
The closest to Unicode on z/OS would be UTF-EBCDIC but it requires defining locales that are based on UTF-EBCDIC.
If running as an ASCII application is an option, you could use UTF-32 as the internal encoding and provide iconv converters to/from any of the EBCDIC code pages your application needs to support. This would be better served by char32_t data type to avoid opacity of wchar_t.

How to check if a character is an extended ascii character in C?

Every time i do something similar to the condition below i get a Multicharacter warning.
char str[] = "León";
if(str[2] == 'ó') printf(true);
How can i solve this?
Unless the encoding on your platform is such that 'ó' can fit into a char, 'ó' is a multi-character constant. It seems to be the latter on your platform, judging by the message you get. The values of multi-character constants are implementation defined. In other words, the choice of numeric value is up to the implementation, with some constraints (e.g. it must be outside the char range on your platform).
Sadly in your case when you write char str[] = "León";, the third element will be converted to a char, using a narrowing conversion, or decomposed into more than one char and concatenated to the char[] array. So attempts to compare it to 'ó' will be futile.
If you want to use the extended ASCII characters, use their octal value.
I am using the table http://www.asciitable.com/ and I guess the value you require is 162 (decimal) = 242. So use str[] = "Le\242n";
And use the same in the comparison.
You'll need to use the wchar_t type, or a unicode library. wchar_t is infamous for having many gotchas and easy bugs to hit, but it is the best primitive type available to C++ compilers.
You need to use variants of everything that support wchar_t, such as std::wcout or wprintf.
EDIT: wchar_t has been replaced by char16_t and char32_t. The Unicode Standard 4.0 suggests their use whenever code must be portable between platforms, because wchar_t varies in size depending on platform (like int does).
I recommend finding a good unicode library to handle comparison between the many characters that are made of multiple codepoints!
The other option is to stick entirely to the native char type which is generally interpreted as some locale-specific ASCII.
The ASCII is a 7-bit character coding that numbers characters 0 ... 127. An ASCII-compatible encoding preserves the meanings of these bytes. Any character encoded as c < 0 or c > 127 cannot be an ASCII character. These sometimes can be called by various confusing names such as "Extended ASCII" or alike.
In Unicode, the ASCII characters are still the characters 0 ... 127 of the Unicode codepoint range.
The problem is not as much that ó is an extended character, it is that your source file is actually in UTF-8, and therefore ó is encoded as 2 bytes. char in C stands for the thing generally called as byte elsewhere.
C also supports wide-character strings, where each character is a UTF-16, UCS-2, UTF-32, or some other code point. There your ó would (most probably) be a single wchar_t.
Unfortunately you've opening a can of worms here, because the symbol ó can also be written in Unicode in 2 separate ways: It can be written as one code point ó or the letter o followed by the combining acute accent: ́; both have the same semantic information, but they would consist of different bytes. And even if converted to wchar_t strings, these would still have different sequences. The C standard library doesn't handle the Unicode at all, except in C11, where there is some support for character literals explicitly in UTF-8. The C standard still doesn't present a portable way for converting UTF-8 encoded textual data to wchar_t; neither can it do normalizations such as ó to o ́ or vice versa.
You could do something like
if (sizeof("ó") > 2) ...
If this is just one char the length of your string is 2, one for the character and one for the terminating 0. Otherwise if it doesn't fit the compiler will allocate a longer sequence.
When you give your source file to the compiler you have to tell which character encoding you used with your source editor (source charset). My guess that it is UTF-8, which encodes ó as 0xC3 0xB3. This seems to be going right.
But 'ó' then becomes an integer with a value outside your char range (see your <limits.h>). Therefore the warning on the == between them.
BTW—there is some meaning in "Extended ASCII" but not much. An "Extended ASCII" character set must encode each of its codepoints in one byte. So, UTF-8 is not an encoding for one of the many "Extended ASCII" character sets.

Behavior of extended bytes/characters in C/POSIX locale

C and POSIX both require only a very limited set of characters be present in the C/POSIX locale, but allow additional characters to exist. This leaves a great deal of freedom to the implementation; for instance, supporting all of Unicode (as UTF-8) in the C locale is conforming behavior. However, most historical implementations treat the C locale as having an "8-bit-clean" single-byte character encoding, either ISO-8859-1 (Latin-1) or a sort of "abstract 8-bit character set" where the non-ASCII bytes are abstract characters with no particular identity. (However, in the latter case, if the compiler defines __STDC_ISO_10646__, they normatively correspond to Unicode characters, usually the Latin-1 range.)
Another conforming option that seems much less popular is to treat all non-ASCII bytes as non-characters, i.e. respond to them with an EILSEQ error.
What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale. Are there implementations where attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?
From your comment to the previous answer:
The ways in which the assumption could be wrong are basically that bytes outside the portable character set could be illegal non-character bytes (EILSEQ) or make up some multibyte encoding (UTF-8 or a stateless legacy CJK encoding)
Here you can find one example.
Plan 9 only supports the "C" locale. As you can see in utf.c and rune.c, when it find a rune outside the portable characters, it simply handles it as a character from a different encoding.
Another candidates could be Minix and the *BSD family (as far as they use citrus). In the Minix source code I've also found the file command looking for new encoding when the character size is not 8bit.
Amusingly, I just found that the most widely-used implementation, glibc, is an example of what I'm looking for. Consider this simple program:
#include <stdlib.h>
#include <stdio.h>
int main()
{
wchar_t wc = 0;
int n = mbtowc(&wc, "\x80", 1);
printf("%d %.4x\n", n, (int)wc);
}
On glibc, it prints -1 0000. If the byte 0x80 were an extended character in the implementation's C/POSIX locale, it would print 1 followed by some nonzero character number.
Thus, the "common knowledge" that the C/POSIX locale is "8-bit-clean" on glibc is simply false. What's going on is that there's a gross inconsistency; despite the fact that all the standard utilities, regular expression matching, etc. are specified to operate on (multibyte) characters as if read by mbrtowc, the implementations of these utilities/functions are taking a shortcut when they see MB_CUR_MAX==1 or LC_CTYPE containing "C" (or similar) and reading char values directly instead of processing input with mbrtowc or similar. This is leading to an inconsistency between the specified behavior (which, as their implementation of the C/POSIX locale is defined, would have to treat high bytes as illegal sequences) and the implementation behavior (which is bypassing the locale system entirely).
With all that said, I am still looking for other implementations with the properties requested in the question.
"What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale."
This question is very difficult to answer because it mixes the "C Locale", which I'm assuming refers to the C Standard limited character set mentioned above, with "other unusual options", which I'm assuming refers to how the specific implementation handles characters outside the (limited) C locale. Every C Implementation must implement the C Locale; I don't think there's any unusual options surrounding that.
Let's assume for argument that the question is: "...unusual options in implementing additional/extended characters beyond the C locale." Now this becomes an implementation-dependent question, and as you have already mentioned, it "leaves a great deal of freedom to the implementation." So without knowing the target compiler/hardware, it would still be difficult to answer definitively.
Now the last part:
"...attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?"
Instead of converting high bytes while in the C Locale, you might be able to set the Locale in your program as in this SO Question: Does the underlying character set depend only on the C implementation?
This way you can ensure that your characters will be treated in the Locale that you expect.
It is my understanding that the C Locale only concerns itself with the first 7-bits (of an 8-bit char type), based on the sources below:
http://www.cprogramming.com/tutorial/unicode.html
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
http://www.in-ulm.de/~mascheck/locale/
The terms "high bytes" and "Unicode" and "UTF-8" are in the class of multi-byte or wide-character encodings, and are very locale specific (and beyond the range of the minimal C Locale). I'm not clear on how it would be possible to "convert high bytes" in the (pure) C Locale. It's quite possible that implementations would pick a default (extended) locale if none was explicitly set (or pull it from the OS environment settings as stated in one of the links above).
The POSIX standard is quite clear in this regard.
The introduction to character sets in POSIX.1-2017 says:
6.2 Character Encoding
The POSIX locale shall contain 256 single-byte characters including the characters in Portable Character Set and Non-Portable Control Characters, which have the properties listed in LC_CTYPE. It is unspecified whether characters not listed in those two tables are classified as punct or cntrl, or neither. Other locales shall contain the characters in Portable Character Set and may contain any or all of the control characters identified in Non-Portable Control Characters; the presence, meaning, and representation of any additional characters are locale-specific.
(emphasis mine)
The page for mbtowc() says:
The mbtowc() function shall fail if:
[EILSEQ]
 An invalid character sequence is detected. In the POSIX locale an [EILSEQ] error cannot occur since all byte values are valid characters.
Note that the POSIX locale is defined to be identical to the C locale.
So if an operating system conforms to POSIX, mbtowc is a no-op in the POSIX locale. Characters 128–255 are passed through just as characters 0–127 are. Implementations that operate differently are in violation of the standard.

What is a "wide character string" in C language?

I came across this in the book:
wscanf(L"%lf", &variable);
where the first parameter is of type of wchar_t *.
This s different from scanf("%lf", &variable); where the first parameter is of type char *.
So what is the difference than. I have never heard "wide character string" before. I have heard something called Raw String Literals which is printing the string as it is (no need for things like escape sequences) but that was not in C.
The exact nature of wide characters is (purposefully) left implementation defined.
When they first invented the concept of wchar_t, ISO 10646 and Unicode were still competing with each other (whereas they now, mostly cooperate). Rather than try to decree that an international character would be one or the other (or possibly something else entirely) they simply provided a type (and some functions) that the implementation could define to support international character sets as they chose.
Different implementations have exercised that potential for variation. For example, if you use Microsoft's compiler on Windows, wchar_t will be a 16-bit type holding UTF-16 Unicode (originally it held UCS-2 Unicode, but that's now officially obsolete).
On Linux, wchar_t will more often be a 32-bit type, holding UCS-4/UTF-32 encoded Unicode. Ports of gcc to at least some other operating systems do the same, though I've never tried to confirm that it's always the case.
There is, however, no guarantee of that. At least in theory an implementation on Linux could use 16 bits, or one on Windows could use 32 bits, or either one could decide to use 64 bits (though I'd be a little surprised to see that in reality).
In any case, the general idea of how things are intended to work, is that a single wchar_t is sufficient to represent a code point. For I/O, the data is intended to be converted from the external representation (whatever it is) into wchar_ts, which (is supposed to) make them relatively easy to manipulate. Then during output, they again get transformed into the encoding of your choice (which may be entirely different from the encoding you read).
"Wide character string" is referring to the encoding of the characters in the string.
From Wikipedia:
A wide character is a computer character datatype that generally has a
size greater than the traditional 8-bit character. The increased
datatype size allows for the use of larger coded character sets.
UTF-16 is one of the most commonly used wide character encodings.
Further, wchar_t is defined by Microsoft as an unsigned short(16-bit) data object. This could be and is most likely a different definition in other operating systems or languages.
Taken from the Wikipedia article from the comment below:
"The width of wchar_t is compiler-specific and can be as small as 8
bits. Consequently, programs that need to be portable across any C or
C++ compiler should not use wchar_t for storing Unicode text. The
wchar_t type is intended for storing compiler-defined wide characters,
which may be Unicode characters in some compilers."

wcstombs: character encoding?

wcstombs documentation says, it "converts the sequence of wide-character codes to multibyte string". But it never says what is a "wide-character".
Is it implicit, like say it converts utf-16 to utf-8 or the conversion is defined by some environment variable?
Also what is the typical use case of wcstombs?
You use the setlocale() standard function with the LC_CTYPE (or LC_ALL) category to set the mapping the library uses between wchar_t characters and multibyte characters. The actual locale name passed to setlocale() is implementation defined, so you'll need to look it up in your compiler's docs.
For example, with MSVC you might use
setlocale( LC_ALL, ".1252" );
to set the C runtime to use codepage 1252 as the multibyte character set. Note that MSVC docs explicitly indicates that the locale cannot be set to UTF-7 or UTF8 for the multibyte character sets:
The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL.
The "wide-character" wchar_t type is intended to be able to support any character set the system supports - the standard doesn't define the size of a wchar_t type (it could be as small as a char or any of the larger integer types). On Windows it's the system's 'internal' Unicode encoding, which is UTF-16 (UCS-2 before WinXP). Honestly, I can't find a direct quote on that in the MSVC docs, though. Strictly speaking, the implementation should call this out, but I can't find it.
It converts whatever your platform uses for a "wide char" (which I'm lead to believe is indeed UCS2 on Windows, but is usually UCS4 on UNIX) into your current locale's default multibyte character encoding. If your locale is a UTF-8 one, then that is the multibyte encoding that will be used - but note that there are other possibilities, like JIS.
According to the C standard, wchar_t type is "capable of representing any character in the current locale". The standard doesn't say what the encoding for wchar_t is. In fact, the limits on WCHAR_MIN and WCHAR_MAX are [0, 255] or [-127, 127], depending upon whether wchar_t is unsigned or signed.
A multibyte character can use more than one byte. A multibyte string is made of one or more multibyte characters. In a multibyte string, each character need not be of equal number of bytes (UTF-8 is an example). Whereas, an object of type wchar_t has a fixed size (in a given implementation, of course).
As an aside, I can also find the following in my copy of the C99 draft:
__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month.
So, if I understood correctly, if __STDC_ISO_10646__ is defined, then wchar_t can store Unicode characters.
Wide character strings are composed of multi-byte characters, whereas the normal C string is a char* - a sequence of byte-wide characters. Wchars are not the same thing as unicode on all platforms, though unicode representations are typically based on wchar_t
I've seen wchars used in embedded systems like phones, where you want filenames with special characters but don't necessarily want to support all the glory and complexity of unicode.
Typical usage would be converting a 2-byte based string to a regular C string, and vica versa

Resources