Behavior of extended bytes/characters in C/POSIX locale

Behavior of extended bytes/characters in C/POSIX locale - c

C and POSIX both require only a very limited set of characters be present in the C/POSIX locale, but allow additional characters to exist. This leaves a great deal of freedom to the implementation; for instance, supporting all of Unicode (as UTF-8) in the C locale is conforming behavior. However, most historical implementations treat the C locale as having an "8-bit-clean" single-byte character encoding, either ISO-8859-1 (Latin-1) or a sort of "abstract 8-bit character set" where the non-ASCII bytes are abstract characters with no particular identity. (However, in the latter case, if the compiler defines __STDC_ISO_10646__, they normatively correspond to Unicode characters, usually the Latin-1 range.)
Another conforming option that seems much less popular is to treat all non-ASCII bytes as non-characters, i.e. respond to them with an EILSEQ error.
What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale. Are there implementations where attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?

From your comment to the previous answer:
The ways in which the assumption could be wrong are basically that bytes outside the portable character set could be illegal non-character bytes (EILSEQ) or make up some multibyte encoding (UTF-8 or a stateless legacy CJK encoding)
Here you can find one example.
Plan 9 only supports the "C" locale. As you can see in utf.c and rune.c, when it find a rune outside the portable characters, it simply handles it as a character from a different encoding.
Another candidates could be Minix and the *BSD family (as far as they use citrus). In the Minix source code I've also found the file command looking for new encoding when the character size is not 8bit.

Amusingly, I just found that the most widely-used implementation, glibc, is an example of what I'm looking for. Consider this simple program:
#include <stdlib.h>
#include <stdio.h>
int main()
{
wchar_t wc = 0;
int n = mbtowc(&wc, "\x80", 1);
printf("%d %.4x\n", n, (int)wc);
}
On glibc, it prints -1 0000. If the byte 0x80 were an extended character in the implementation's C/POSIX locale, it would print 1 followed by some nonzero character number.
Thus, the "common knowledge" that the C/POSIX locale is "8-bit-clean" on glibc is simply false. What's going on is that there's a gross inconsistency; despite the fact that all the standard utilities, regular expression matching, etc. are specified to operate on (multibyte) characters as if read by mbrtowc, the implementations of these utilities/functions are taking a shortcut when they see MB_CUR_MAX==1 or LC_CTYPE containing "C" (or similar) and reading char values directly instead of processing input with mbrtowc or similar. This is leading to an inconsistency between the specified behavior (which, as their implementation of the C/POSIX locale is defined, would have to treat high bytes as illegal sequences) and the implementation behavior (which is bypassing the locale system entirely).
With all that said, I am still looking for other implementations with the properties requested in the question.

"What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale."
This question is very difficult to answer because it mixes the "C Locale", which I'm assuming refers to the C Standard limited character set mentioned above, with "other unusual options", which I'm assuming refers to how the specific implementation handles characters outside the (limited) C locale. Every C Implementation must implement the C Locale; I don't think there's any unusual options surrounding that.
Let's assume for argument that the question is: "...unusual options in implementing additional/extended characters beyond the C locale." Now this becomes an implementation-dependent question, and as you have already mentioned, it "leaves a great deal of freedom to the implementation." So without knowing the target compiler/hardware, it would still be difficult to answer definitively.
Now the last part:
"...attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?"
Instead of converting high bytes while in the C Locale, you might be able to set the Locale in your program as in this SO Question: Does the underlying character set depend only on the C implementation?
This way you can ensure that your characters will be treated in the Locale that you expect.
It is my understanding that the C Locale only concerns itself with the first 7-bits (of an 8-bit char type), based on the sources below:
http://www.cprogramming.com/tutorial/unicode.html
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
http://www.in-ulm.de/~mascheck/locale/
The terms "high bytes" and "Unicode" and "UTF-8" are in the class of multi-byte or wide-character encodings, and are very locale specific (and beyond the range of the minimal C Locale). I'm not clear on how it would be possible to "convert high bytes" in the (pure) C Locale. It's quite possible that implementations would pick a default (extended) locale if none was explicitly set (or pull it from the OS environment settings as stated in one of the links above).

The POSIX standard is quite clear in this regard.
The introduction to character sets in POSIX.1-2017 says:
6.2 Character Encoding
The POSIX locale shall contain 256 single-byte characters including the characters in Portable Character Set and Non-Portable Control Characters, which have the properties listed in LC_CTYPE. It is unspecified whether characters not listed in those two tables are classified as punct or cntrl, or neither. Other locales shall contain the characters in Portable Character Set and may contain any or all of the control characters identified in Non-Portable Control Characters; the presence, meaning, and representation of any additional characters are locale-specific.
(emphasis mine)
The page for mbtowc() says:
The mbtowc() function shall fail if:
[EILSEQ]
 An invalid character sequence is detected. In the POSIX locale an [EILSEQ] error cannot occur since all byte values are valid characters.
Note that the POSIX locale is defined to be identical to the C locale.
So if an operating system conforms to POSIX, mbtowc is a no-op in the POSIX locale. Characters 128–255 are passed through just as characters 0–127 are. Implementations that operate differently are in violation of the standard.

Related

How does C uppercase letters?

I see this code in glibc-2.33/ctype/ctype.c:
// [...]
#define __ctype_toupper \
((int32_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128)
// [...]
int
toupper (int c)
{
return c >= -128 && c < 256 ? __ctype_toupper[c] : c;
}
libc_hidden_def (toupper)
I understand that it's checking if c is within -128 and 256 (inclusive) and returns the character as-is if it's outside that range, but what does _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128) mean and where do I actually find the source code of how letters are uppercased? This seems to be looking up the current locale, I am only interested in en_US.UTF-8. Also, how can a character be negative?
I don't care about glibc specifically, I just want to know how all the ASCII characters (all as in from NUL to DEL) are uppercased in C.

"C" doesn't convert characters to upper case. The C standard only mandates that there be a function in the standard library which does so correctly according to the current locale, and that it does so in a particular way in the "C" locale (which is the only locale which is guaranteed to exist).
Library implementations are free to accomplish that task as the implementers see fit, and they all do it in different ways. Even radically different ways. Some C libraries don't support locales other than the "C" locale with an ASCII character set. An example of such a C library is musl and it is hard to beat the simplicity of its implementation:
int toupper(int c)
{
if (islower(c)) return c & 0x5f;
return c;
}
As you can see, the above code depends on islower. Here it is:
int islower(int c)
{
return (unsigned)c-'a' < 26;
}
Because of the call to islower, toupper returns unchanged any argument outside of the range of lower case characters, even arguments not in the valid range for toupper. Since the standard doesn't define the behaviour of toupper for arguments outside of the valid range (essentially values which might be returned by fgetc), just returning invalid arguments unchanged is certainly as acceptable as any other behaviour. Glibc's toupper function will often segfault on invalid arguments, since it uses the argument as an index into an array (as you can see in the code you cite). That behaviour is also acceptable according to the standard.
The Glibc implementation is a lot more complicated. And behind the scenes it depends on the locale data which is compiled from locale definition files, a process which is completely outside of the C standard and somewhat defined by the Posix standard (although the GNU implementation diverges in some way from Posix).
But here's the scoop: If you're using single byte characters in a UTF-8 locale, none of glibc's complicated code makes the slightest difference. The musl implementation works precisely as required in a UTF-8 locale, because the only alphabetic characters representable in a single byte UTF-8 representation are the 52 characters in the "Roman" alphabet. All the other Unicode characters are only representable in wide characters and multibyte sequences.
Furthermore, environments which use a single-byte encoding other than UTF-8 are increasingly rare. There are certainly a lot of us who had to learn this stuff because our programs ran on a variety of platforms which used different ISO-8859-x code pages. Or different single-byte Windows codepages. But in the end, Unicode won out. (And many of us breathed huge sighs of relief.) So most of this apparatus is no longer really necessary except in legacy environments.
But that's not to say that Unicode magically solves all the complications involved in managing the huge variety of alphabets in use in the world. Far from it. What Unicode does do is two-fold: it clarifies what the complications are (most of which is not captured by C/Posix locales), and it provides some basic standards for implementations.
And, as a side effect, UTF-8 standardises single-byte codes to basically conform with the original ASCII 7-bit standard. So if you're only dealing with 7-bit characters (which, these days, is probably less than ideal), you don't need anything beyond musl-style implementations. And if you are dealing with "all the world's character sets", you'll be looking for a library which actually conforms to Unicode, and which uses something other than char to represent characters.
But one complication is going to remain forever, sadly: the fact that C does not standardise the signedness of char. On platforms on which char is signed (Unix X86 and Windows, for two major examples),
(char)0xA0 is (a) unspecified and (b) probably -96, which is what a single-byte 0xA0 represents in 2's complement. So if you write code which uses the various functions in ctype.h and don't take care of negative char values, and then you try to use that code with a UTF-8 encoded string which includes characters outside of the single-byte domain, then you will end up passing negative numbers to functions which might not be expecting them.

If you go back at the root and look for _NL_CTYPE_TOUPPER you will find a commit where it is written
[..] (ctype_output): Support for alternate locale format: Computation of
nelems changes. _NL_CTYPE_TOUPPER32 [...]
So basically _NL_CTYPE_TOUPPER is the macro for _NL_CTYPE_TOUPPER(8bits) as for example in French you have À as uppercase version of à
Following this link you will find the header file langinfo.h that has this enum starting at line 43 and with _NL_CTYPE_TOUPPER defined at line 259.
LC_CTYPE category: character classification.
256 This information is accessed by the functions in <ctype.h>.
LC_CTYPE is defined for each language, see for example for French:
fr_FR:2000"
Note that it doesn't make a lot of sense to call this function since characters with accent are not contained in the ASCII table, but since this function is the one handling both utf8 and ascii that's how it works.

Restrictions to Unicode escape sequences in C11

Why is there a restriction for Unicode escape sequences (\unnnn and \Unnnnnnnn) in C11 such that only those characters outside of the basic character set may be represented? For example, the following code results in the compiler error: \u000A is not a valid universal character. (Some Unicode "dictionary" sites even give this invalid format as canon for the C/C++ languages, though admittedly these are likely auto-generated):
static inline int test_unicode_single() {
return strlen(u8"\u000A") > 1;
}
While I understand that it's not exactly necessary for these basic characters to supported, is there a technical reason why they're not? Something like not being able to represent the same character in more than one way?

It's precisely to avoid alternative spellings.
The primary motivations for adding Universal Character Names (UCNs) to C and C++ were to:
allow identifiers to include letters outside of the basic source character set (like ñ, for example).
allow portable mechanisms for writing string and character literals which include characters outside of the basic source character set.
Furthermore, there was a desire that the changes to existing compilers be as limited as possible, and in particular that compilers (and other tools) could continue to use their established (and often highly optimised) lexical analysis functions.
That was a challenge, because there are huge differences in the lexical analysis architectures of different compilers. Without going into all the details, it appeared that two broad implementation strategies were possible:
The compiler could internally use some single universal encoding, such as UTF-8. All input files in other encodings would be transcribed into this internal encoding very early in the input pipeline. Also, UCNs (wherever they appeared) would be converted to the corresponding internal encoding. This latter transformation could be conducted in parallel with continuation line processing, which also requires detecting backslashes, thus avoiding an extra test on every input character for a condition which very rarely turns out to be true.
The compiler could internally use strict (7-bit) ASCII. Input files in encodings allowing other characters would be transcribed into ASCII with non-ASCII characters converted to UCNs prior to any other lexical analysis.
In effect, both of these strategies would be implemented in Phase 1 (or equivalent), which is long before lexical analysis has taken place. But note the difference: strategy 1 converts UCNs to an internal character coding, while strategy 2 converts non-representable characters to UCNs.
What these two strategies have in common is that once the transcription is finished, there is no longer any difference between a character entered directly into the source stream (in whatever encoding the source file uses) and a character described with a UCN. So if the compiler allows UTF-8 source files, you could enter an ñ as either the two bytes 0xc3, 0xb1 or as the six-character sequence \u00D1, and they would both end up as the same byte sequence. That, in turn, means that every identifier has only one spelling, so no change is necessary (for example) to symbol table lookup.
Typically, compilers just pass variable names through the compilation pipeline, leaving them to be eventually handled by assemblers or linkers. If these downstream tools do not accept extended character encodings or UCNs (depending on implementation strategy) then names containing such characters need to be "mangled" (transcribed) in order to make them acceptable. But even if that's necessary, it's a minor change and can be done at a well-defined interface.
Rather than resolve arguments between compiler vendors whose products (or development teams) had clear preferences between the two strategies, the C and C++ standards committees chose mechanisms and restrictions which make both strategies compatible. In particular, both committees forbid the use of UCNs which represent characters which already have an encoding in the basic source character set. That avoids questions like:
What happens if I put \u0022 inside a string literal:
const char* quote = "\u0022";
If the compiler translates UCNs to the characters they represent, then by the time the lexical analyser sees that line, "\u0022" will have been converted to """, which is a lexical error. On the other hand, a compiler which retains UCNs until the end would happily accept that as a string literal. Banning the use of a UCN which represents a quotation mark avoids this possible non-portability.
Similarly, would '\u005cn' be a newline character? Again, if the UCN is converted to a backslash in Phase 1, then in Phase 3 the string literal would definitely be treated as a newline. But if the UCN is converted to a character value only after the character literal token has been identified as such, then the resulting character literal would contain two characters (an implementation-defined value).
And what about 2 \u002B 2? Is that going to look like an addition, even though UCNs aren't supposed to be used for punctuation characters? Or will it look like an identifier starting with a non-letter code?
And so on, for a large number of similar issues.
All of these details are avoided by the simple expedient of requiring that UCNs cannot be used to spell characters in the basic source character set. And that's what was embodied in the standards.
Note that the "basic source character set" does not contain every ASCII character. It does not contain the majority of the control characters, and nor does it contain the ASCII characters $, # and `. These characters (which have no meaning in a C or C++ program outside of string and character literals) can be written as the UCNs \u0024, \u0040 and \u0060 respectively.
Finally, in order to see what sort of knots you need to untie in order to correctly lexically analyse C (or C++), consider the following snippet:
const char* s = "\\
n";
Because continuation lines are dealt with in Phase 1, prior to lexical analysis, and Phase 1 only looks for the two-character sequence consisting of a backslash followed by a newline, that line is the same as
const char* s = "\n";
But that might not have been obvious looking at the original code.

What is a "wide character string" in C language?

I came across this in the book:
wscanf(L"%lf", &variable);
where the first parameter is of type of wchar_t *.
This s different from scanf("%lf", &variable); where the first parameter is of type char *.
So what is the difference than. I have never heard "wide character string" before. I have heard something called Raw String Literals which is printing the string as it is (no need for things like escape sequences) but that was not in C.

The exact nature of wide characters is (purposefully) left implementation defined.
When they first invented the concept of wchar_t, ISO 10646 and Unicode were still competing with each other (whereas they now, mostly cooperate). Rather than try to decree that an international character would be one or the other (or possibly something else entirely) they simply provided a type (and some functions) that the implementation could define to support international character sets as they chose.
Different implementations have exercised that potential for variation. For example, if you use Microsoft's compiler on Windows, wchar_t will be a 16-bit type holding UTF-16 Unicode (originally it held UCS-2 Unicode, but that's now officially obsolete).
On Linux, wchar_t will more often be a 32-bit type, holding UCS-4/UTF-32 encoded Unicode. Ports of gcc to at least some other operating systems do the same, though I've never tried to confirm that it's always the case.
There is, however, no guarantee of that. At least in theory an implementation on Linux could use 16 bits, or one on Windows could use 32 bits, or either one could decide to use 64 bits (though I'd be a little surprised to see that in reality).
In any case, the general idea of how things are intended to work, is that a single wchar_t is sufficient to represent a code point. For I/O, the data is intended to be converted from the external representation (whatever it is) into wchar_ts, which (is supposed to) make them relatively easy to manipulate. Then during output, they again get transformed into the encoding of your choice (which may be entirely different from the encoding you read).

"Wide character string" is referring to the encoding of the characters in the string.
From Wikipedia:
A wide character is a computer character datatype that generally has a
size greater than the traditional 8-bit character. The increased
datatype size allows for the use of larger coded character sets.
UTF-16 is one of the most commonly used wide character encodings.
Further, wchar_t is defined by Microsoft as an unsigned short(16-bit) data object. This could be and is most likely a different definition in other operating systems or languages.
Taken from the Wikipedia article from the comment below:
"The width of wchar_t is compiler-specific and can be as small as 8
bits. Consequently, programs that need to be portable across any C or
C++ compiler should not use wchar_t for storing Unicode text. The
wchar_t type is intended for storing compiler-defined wide characters,
which may be Unicode characters in some compilers."

isLetter with accented characters in C

I'd like to create (or find) a C function to check if a char c is a letter...
I can do this for a-z and A-Z easily of course.
However i get an error if testing c == á,ã,ô,ç,ë, etc
Probably those special characters are stored in more then a char...
I'd like to know:
How these special characters are stored, which arguments my function needs to receive, and how to do it?
I'd also like to know if are there any standard function that already does this.

I think you're looking for the iswalpha() routine:
#include <wctype.h>
int iswalpha(wint_t wc);
DESCRIPTION
The iswalpha() function is the wide-character equivalent of
the isalpha(3) function. It tests whether wc is a wide
character belonging to the wide-character class "alpha".
It does depend upon the LC_CTYPE of the current locale(7), so its use in a program that is supposed to handle multiple types of input correctly simultaneously might not be ideal.

If you are working with single-byte codesets such as ISO 8859-1 or 8859-15 (or any of the other 8859-x codesets), then the isalpha() function will do the job if you also remember to use setlocale(LC_ALL, ""); (or some other suitable invocation of setlocale()) in your program. Without this, the program runs in the C locale, which only classifies the ASCII characters (8859-x characters in the range 0x00..0x7F).
If you are working with multibyte or wide character codesets (such as UTF8 or UTF16), then you need to look to the wide character functions found in <wchar.h> and <wctype.h>.

How these characters are stored is locale-dependent. On most UNIX systems, they'll be stored as UTF8, whereas a Win32 machine will likely represent them as UTF16. UTF8 is stored as a variable-amount of chars, whereas UTF16 is stored using surrogate pairs - and thus inside a wchar_t (or unsigned short) (though incidentally, sizeof(wchar_t) on Windows is only 2 (vs 4 on *nix), and thus you'll often need 2 wchar_t types to store the 1 character if a surrogate pair encoding is used - which it will be in many cases).
As was mentioned, the iswalpha() routine will do this for you, and is documented here. It should take care of locale-specific issues for you.

You probably want http://site.icu-project.org/. It provides a portable library with APIs for this.

wcstombs: character encoding?

wcstombs documentation says, it "converts the sequence of wide-character codes to multibyte string". But it never says what is a "wide-character".
Is it implicit, like say it converts utf-16 to utf-8 or the conversion is defined by some environment variable?
Also what is the typical use case of wcstombs?

You use the setlocale() standard function with the LC_CTYPE (or LC_ALL) category to set the mapping the library uses between wchar_t characters and multibyte characters. The actual locale name passed to setlocale() is implementation defined, so you'll need to look it up in your compiler's docs.
For example, with MSVC you might use
setlocale( LC_ALL, ".1252" );
to set the C runtime to use codepage 1252 as the multibyte character set. Note that MSVC docs explicitly indicates that the locale cannot be set to UTF-7 or UTF8 for the multibyte character sets:
The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL.
The "wide-character" wchar_t type is intended to be able to support any character set the system supports - the standard doesn't define the size of a wchar_t type (it could be as small as a char or any of the larger integer types). On Windows it's the system's 'internal' Unicode encoding, which is UTF-16 (UCS-2 before WinXP). Honestly, I can't find a direct quote on that in the MSVC docs, though. Strictly speaking, the implementation should call this out, but I can't find it.

It converts whatever your platform uses for a "wide char" (which I'm lead to believe is indeed UCS2 on Windows, but is usually UCS4 on UNIX) into your current locale's default multibyte character encoding. If your locale is a UTF-8 one, then that is the multibyte encoding that will be used - but note that there are other possibilities, like JIS.

According to the C standard, wchar_t type is "capable of representing any character in the current locale". The standard doesn't say what the encoding for wchar_t is. In fact, the limits on WCHAR_MIN and WCHAR_MAX are [0, 255] or [-127, 127], depending upon whether wchar_t is unsigned or signed.
A multibyte character can use more than one byte. A multibyte string is made of one or more multibyte characters. In a multibyte string, each character need not be of equal number of bytes (UTF-8 is an example). Whereas, an object of type wchar_t has a fixed size (in a given implementation, of course).
As an aside, I can also find the following in my copy of the C99 draft:
__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month.
So, if I understood correctly, if __STDC_ISO_10646__ is defined, then wchar_t can store Unicode characters.

Wide character strings are composed of multi-byte characters, whereas the normal C string is a char* - a sequence of byte-wide characters. Wchars are not the same thing as unicode on all platforms, though unicode representations are typically based on wchar_t
I've seen wchars used in embedded systems like phones, where you want filenames with special characters but don't necessarily want to support all the glory and complexity of unicode.
Typical usage would be converting a 2-byte based string to a regular C string, and vica versa