Restrictions to Unicode escape sequences in C11 - c

Why is there a restriction for Unicode escape sequences (\unnnn and \Unnnnnnnn) in C11 such that only those characters outside of the basic character set may be represented? For example, the following code results in the compiler error: \u000A is not a valid universal character. (Some Unicode "dictionary" sites even give this invalid format as canon for the C/C++ languages, though admittedly these are likely auto-generated):
static inline int test_unicode_single() {
return strlen(u8"\u000A") > 1;
}
While I understand that it's not exactly necessary for these basic characters to supported, is there a technical reason why they're not? Something like not being able to represent the same character in more than one way?

It's precisely to avoid alternative spellings.
The primary motivations for adding Universal Character Names (UCNs) to C and C++ were to:
allow identifiers to include letters outside of the basic source character set (like ñ, for example).
allow portable mechanisms for writing string and character literals which include characters outside of the basic source character set.
Furthermore, there was a desire that the changes to existing compilers be as limited as possible, and in particular that compilers (and other tools) could continue to use their established (and often highly optimised) lexical analysis functions.
That was a challenge, because there are huge differences in the lexical analysis architectures of different compilers. Without going into all the details, it appeared that two broad implementation strategies were possible:
The compiler could internally use some single universal encoding, such as UTF-8. All input files in other encodings would be transcribed into this internal encoding very early in the input pipeline. Also, UCNs (wherever they appeared) would be converted to the corresponding internal encoding. This latter transformation could be conducted in parallel with continuation line processing, which also requires detecting backslashes, thus avoiding an extra test on every input character for a condition which very rarely turns out to be true.
The compiler could internally use strict (7-bit) ASCII. Input files in encodings allowing other characters would be transcribed into ASCII with non-ASCII characters converted to UCNs prior to any other lexical analysis.
In effect, both of these strategies would be implemented in Phase 1 (or equivalent), which is long before lexical analysis has taken place. But note the difference: strategy 1 converts UCNs to an internal character coding, while strategy 2 converts non-representable characters to UCNs.
What these two strategies have in common is that once the transcription is finished, there is no longer any difference between a character entered directly into the source stream (in whatever encoding the source file uses) and a character described with a UCN. So if the compiler allows UTF-8 source files, you could enter an ñ as either the two bytes 0xc3, 0xb1 or as the six-character sequence \u00D1, and they would both end up as the same byte sequence. That, in turn, means that every identifier has only one spelling, so no change is necessary (for example) to symbol table lookup.
Typically, compilers just pass variable names through the compilation pipeline, leaving them to be eventually handled by assemblers or linkers. If these downstream tools do not accept extended character encodings or UCNs (depending on implementation strategy) then names containing such characters need to be "mangled" (transcribed) in order to make them acceptable. But even if that's necessary, it's a minor change and can be done at a well-defined interface.
Rather than resolve arguments between compiler vendors whose products (or development teams) had clear preferences between the two strategies, the C and C++ standards committees chose mechanisms and restrictions which make both strategies compatible. In particular, both committees forbid the use of UCNs which represent characters which already have an encoding in the basic source character set. That avoids questions like:
What happens if I put \u0022 inside a string literal:
const char* quote = "\u0022";
If the compiler translates UCNs to the characters they represent, then by the time the lexical analyser sees that line, "\u0022" will have been converted to """, which is a lexical error. On the other hand, a compiler which retains UCNs until the end would happily accept that as a string literal. Banning the use of a UCN which represents a quotation mark avoids this possible non-portability.
Similarly, would '\u005cn' be a newline character? Again, if the UCN is converted to a backslash in Phase 1, then in Phase 3 the string literal would definitely be treated as a newline. But if the UCN is converted to a character value only after the character literal token has been identified as such, then the resulting character literal would contain two characters (an implementation-defined value).
And what about 2 \u002B 2? Is that going to look like an addition, even though UCNs aren't supposed to be used for punctuation characters? Or will it look like an identifier starting with a non-letter code?
And so on, for a large number of similar issues.
All of these details are avoided by the simple expedient of requiring that UCNs cannot be used to spell characters in the basic source character set. And that's what was embodied in the standards.
Note that the "basic source character set" does not contain every ASCII character. It does not contain the majority of the control characters, and nor does it contain the ASCII characters $, # and `. These characters (which have no meaning in a C or C++ program outside of string and character literals) can be written as the UCNs \u0024, \u0040 and \u0060 respectively.
Finally, in order to see what sort of knots you need to untie in order to correctly lexically analyse C (or C++), consider the following snippet:
const char* s = "\\
n";
Because continuation lines are dealt with in Phase 1, prior to lexical analysis, and Phase 1 only looks for the two-character sequence consisting of a backslash followed by a newline, that line is the same as
const char* s = "\n";
But that might not have been obvious looking at the original code.

Related

Do strcmp and strstr test binary equivalence?

https://learn.microsoft.com/en-us/windows/win32/intl/security-considerations--international-features
This webpage makes me wonder.
Apparently some windows api may consider two strings equal when they are actually different byte sequences.
I want to know how C standard library behaves in this respect.
in other words, does strcmp(a,b)==0 imply strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0?
and what about other string functions, including wide character versions?
edit:
for example, CompareStringW equates L"\x00C5" and L"\x212B"
printf("%d\n",CompareStringW(LOCALE_INVARIANT,0,L"\x00C5",-1,L"\x212B",-1)==CSTR_EQUAL); outputs 1
what I'm asking is whether C library functions never behave like this
two strings using different encodings can be the same even if their byte representation are different.
standard library strcmp does compare plain "character" strings and in this case strcmp(a,b)==0 implies strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0
Functions like wcscmp require both strings to be encoded the same way, so their byte representation should be the same.
The regular string functions operate byte-by-byte. The specification says:
The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.
strcmp() and memcmp() do the same comparisons. The only difference is that strcmp() uses the null terminators in the strings as the limit, memcmp() uses a parameter for this, and strncmp() takes a limit parameter and uses whichever comes first.
The wide string function specification says:
Unless explicitly stated otherwise, the functions described in this subclause order two wide characters the same way as two integers of the underlying integer type designated by wchar_t.
wcscmp() doesn't say otherwise, so it's also comparing the wide characters numerically, not by converting their encodings to some common character representations. wcscmp() is to wmemcmp() as strcmp() is to memcmp().
On the other hand, wcscoll() compares the strings as interpreted according to the LC_COLLATE category of the current locale. So this may not be equivalent to memcmp().
For other functions you should check the documentation to see whether they reference the locale.
Apparently some windows api may consider two strings equal when they are actually different byte sequences.
Depending on context and where you got those strings from, that would actually be the semantically correct behavor.
There are multiple ways to encode certain characters. The German 'ä', for example. In Unicode, this could be U+00E4 LATIN SMALL LETTER A WITH DAERHESIS, or it could be the sequence of U+0308 COMBINING DIAERESIS and U+0061 LATIN SMALL LETTER A. You could desire a comparison function that actually compares these equal. Or you could have them not compare equal, but have a standalone function that turns one representation into the other ("normalization").
You could want a comparison function that compares '6' (six) as equal to '๖' (also six, just in Thai). ("Canonicalization")
The byte string functions (strcmp() etc.) are not capable of any of that. They only deal in byte sequences, and are unaware of anything I wrote above.
As for the wide string functions (wcscmp() etc.), well... they are not that either, really.
in other words, does strcmp(a,b)==0 imply strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0? and what about other string functions, including wide character versions?
Either will test for binary equivalence, as there are no mechanics in the C Standard Library to normalize or canonicalize strings.[1]
If you are actually dealing in processing strings (as opposed to just passing them through, for which C byte strings and wide strings are adequate), you should use the ICU library, the de facto standard for C/C++ Unicode handling. It looks daunting but actually needs to be to handle all these things correctly.
Basically, any C/C++ API that promises to do the same is either using the ICU library itself, or is very likely not doing what it advertises.
[1]: Actually, strcoll() / strxfrm() and wcscoll() / wcsxfrm() actually provide enough wiggle room to squeeze in proper Unicode mechanics for collation, but I don't know of an implementation that actually bothers to do so.

How does C uppercase letters?

I see this code in glibc-2.33/ctype/ctype.c:
// [...]
#define __ctype_toupper \
((int32_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128)
// [...]
int
toupper (int c)
{
return c >= -128 && c < 256 ? __ctype_toupper[c] : c;
}
libc_hidden_def (toupper)
I understand that it's checking if c is within -128 and 256 (inclusive) and returns the character as-is if it's outside that range, but what does _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128) mean and where do I actually find the source code of how letters are uppercased? This seems to be looking up the current locale, I am only interested in en_US.UTF-8. Also, how can a character be negative?
I don't care about glibc specifically, I just want to know how all the ASCII characters (all as in from NUL to DEL) are uppercased in C.
"C" doesn't convert characters to upper case. The C standard only mandates that there be a function in the standard library which does so correctly according to the current locale, and that it does so in a particular way in the "C" locale (which is the only locale which is guaranteed to exist).
Library implementations are free to accomplish that task as the implementers see fit, and they all do it in different ways. Even radically different ways. Some C libraries don't support locales other than the "C" locale with an ASCII character set. An example of such a C library is musl and it is hard to beat the simplicity of its implementation:
int toupper(int c)
{
if (islower(c)) return c & 0x5f;
return c;
}
As you can see, the above code depends on islower. Here it is:
int islower(int c)
{
return (unsigned)c-'a' < 26;
}
Because of the call to islower, toupper returns unchanged any argument outside of the range of lower case characters, even arguments not in the valid range for toupper. Since the standard doesn't define the behaviour of toupper for arguments outside of the valid range (essentially values which might be returned by fgetc), just returning invalid arguments unchanged is certainly as acceptable as any other behaviour. Glibc's toupper function will often segfault on invalid arguments, since it uses the argument as an index into an array (as you can see in the code you cite). That behaviour is also acceptable according to the standard.
The Glibc implementation is a lot more complicated. And behind the scenes it depends on the locale data which is compiled from locale definition files, a process which is completely outside of the C standard and somewhat defined by the Posix standard (although the GNU implementation diverges in some way from Posix).
But here's the scoop: If you're using single byte characters in a UTF-8 locale, none of glibc's complicated code makes the slightest difference. The musl implementation works precisely as required in a UTF-8 locale, because the only alphabetic characters representable in a single byte UTF-8 representation are the 52 characters in the "Roman" alphabet. All the other Unicode characters are only representable in wide characters and multibyte sequences.
Furthermore, environments which use a single-byte encoding other than UTF-8 are increasingly rare. There are certainly a lot of us who had to learn this stuff because our programs ran on a variety of platforms which used different ISO-8859-x code pages. Or different single-byte Windows codepages. But in the end, Unicode won out. (And many of us breathed huge sighs of relief.) So most of this apparatus is no longer really necessary except in legacy environments.
But that's not to say that Unicode magically solves all the complications involved in managing the huge variety of alphabets in use in the world. Far from it. What Unicode does do is two-fold: it clarifies what the complications are (most of which is not captured by C/Posix locales), and it provides some basic standards for implementations.
And, as a side effect, UTF-8 standardises single-byte codes to basically conform with the original ASCII 7-bit standard. So if you're only dealing with 7-bit characters (which, these days, is probably less than ideal), you don't need anything beyond musl-style implementations. And if you are dealing with "all the world's character sets", you'll be looking for a library which actually conforms to Unicode, and which uses something other than char to represent characters.
But one complication is going to remain forever, sadly: the fact that C does not standardise the signedness of char. On platforms on which char is signed (Unix X86 and Windows, for two major examples),
(char)0xA0 is (a) unspecified and (b) probably -96, which is what a single-byte 0xA0 represents in 2's complement. So if you write code which uses the various functions in ctype.h and don't take care of negative char values, and then you try to use that code with a UTF-8 encoded string which includes characters outside of the single-byte domain, then you will end up passing negative numbers to functions which might not be expecting them.
If you go back at the root and look for _NL_CTYPE_TOUPPER you will find a commit where it is written
[..] (ctype_output): Support for alternate locale format: Computation of
nelems changes. _NL_CTYPE_TOUPPER32 [...]
So basically _NL_CTYPE_TOUPPER is the macro for _NL_CTYPE_TOUPPER(8bits) as for example in French you have À as uppercase version of à
Following this link you will find the header file langinfo.h that has this enum starting at line 43 and with _NL_CTYPE_TOUPPER defined at line 259.
LC_CTYPE category: character classification.
256 This information is accessed by the functions in <ctype.h>.
LC_CTYPE is defined for each language, see for example for French:
fr_FR:2000"
Note that it doesn't make a lot of sense to call this function since characters with accent are not contained in the ASCII table, but since this function is the one handling both utf8 and ascii that's how it works.

Contradiction in C18 standard (regarding character sets)?

We read in the C18 standard:
5.1.1.2 Translation phases
The precedence among the syntax rules of translation is specified by the following phases.
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.
Meaning that the source file character set is decoded and mapped to the source character set.
But then you can read:
5.2.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set).
Meaning that the source file character set is the source character set.
So the question is: which one did I understand wrong, or which one is actually wrong?
EDIT: Actually I was wrong. See my answer below.
Meaning that the source file character set is decoded and mapped to the source character set.
No, it does not mean that. My take is that the source is already assumed to be written in the source character set - how exactly would it make sense to "map the source character set to the source character set"? Either they are part of the set or they aren't. If you pick the wrong encoding for your source code, it will simply be rejected before the preprocessing even starts.
Translation phase 1 does two things not quite related to this at all:
Resolves trigraphs, which are standardized multibyte sequences.
Map multibyte characters into the source character set (defined in 5.2.1).
The source character set consists of the basic character set which is essentially the Latin alphabet plus various common symbols (5.2.1/3), and an extended character set, which is locale- and implemention-specific.
The definition of multibyte characters is found at 5.2.1.2:
The source character set may contain multibyte characters, used to represent members of
the extended character set. The execution character set may also contain multibyte
characters, which need not have the same encoding as for the source character set.
Meaning various locale-specific oddball special cases, such as locale-specific trigraphs.
All of this multibyte madness goes back to the first standardization in 1990 - according to anecdotes from those who were part of that committee, this was because members from various European countries weren't able to use various symbols on their national keyboards.
(I'm not sure how widespread the AltGr key on such keyboards was at the time. It remains a key subject to some serious button mashing when writing C on non-English keyboards anyway, to get access to {}[] symbols etc.)
Well, after all it seems I was wrong. After contacting David Keaton, from the WG14 group (they are in charge of the C standard), I got this clarifying reply:
There is a subtle distinction. The source character set is the
character set in which source files are written. However, the source
character set is just the list of characters available, which does not
say anything about the encoding.
Phase 1 maps the multibyte encoding of the source character set onto
the abstract source characters themselves.
In other words, a character that looks like this:
<byte 1><byte 2>
is mapped to this:
<character 1>
The first is an encoding that represents a character in the source
character set in which the program was written. The second is the
abstract character in the source character set.
You have encountered cross compiling, where a program is compiled on one architecture and executed on another architecture and these architectures have different character sets.
5.1.1.2 is active early in read, where the input file is converted into the compiler's single character set, which clearly must contain all of the characters required by a C program.
However when cross compiling, the execution character set may be different. 5.2.1 is allowing for this possibility. When the compiler emits code, it must translate all character and string constants to the target platform's character set. On modern platforms, this is a no-op, but on some ancient platforms it wasn't.

C translation phases concrete examples

According to the C11 standard (5.1.1.2 Translation phases) there are 8 translation phases.
Can anyone give a concrete example for each of the phases.
For example at phase 1 there is:
Physical source file multibyte characters are mapped, in an
implementation- defined manner, to the source character set...
so can I have an example of what happens when that mapping is executed and so on
for other phases?
Well, one example of phase one would be storing your source code into a record-oriented format, such as in z/OS on the mainframe.
These data sets have fixed record sizes so, if your data set specification was FB80 (fixed, blocked, record length of 80), the "line":
int main (void)
would be stored as those fifteen characters followed by sixty-five spaces, and no newline.
Phase one translation would read in the record, possibly strip off the trailing spaces, and add a newline character, before passing the line on to the next phase.
As per the standard, this is also the phase that handles trigraphs, such as converting ??( into [ on a 3270 terminal that has no support for the [ character.
An example of phase five is if you're writing your code on z/OS (using EBCDIC) but cross-compiling it for Linux/x86 (using ASCII/Unicode).
In that case the source characters within string literals and character constants must have the ASCII representation rather than the EBCDIC one. Otherwise, you're likely to get some truly bizarre output on your Linux box.

Behavior of extended bytes/characters in C/POSIX locale

C and POSIX both require only a very limited set of characters be present in the C/POSIX locale, but allow additional characters to exist. This leaves a great deal of freedom to the implementation; for instance, supporting all of Unicode (as UTF-8) in the C locale is conforming behavior. However, most historical implementations treat the C locale as having an "8-bit-clean" single-byte character encoding, either ISO-8859-1 (Latin-1) or a sort of "abstract 8-bit character set" where the non-ASCII bytes are abstract characters with no particular identity. (However, in the latter case, if the compiler defines __STDC_ISO_10646__, they normatively correspond to Unicode characters, usually the Latin-1 range.)
Another conforming option that seems much less popular is to treat all non-ASCII bytes as non-characters, i.e. respond to them with an EILSEQ error.
What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale. Are there implementations where attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?
From your comment to the previous answer:
The ways in which the assumption could be wrong are basically that bytes outside the portable character set could be illegal non-character bytes (EILSEQ) or make up some multibyte encoding (UTF-8 or a stateless legacy CJK encoding)
Here you can find one example.
Plan 9 only supports the "C" locale. As you can see in utf.c and rune.c, when it find a rune outside the portable characters, it simply handles it as a character from a different encoding.
Another candidates could be Minix and the *BSD family (as far as they use citrus). In the Minix source code I've also found the file command looking for new encoding when the character size is not 8bit.
Amusingly, I just found that the most widely-used implementation, glibc, is an example of what I'm looking for. Consider this simple program:
#include <stdlib.h>
#include <stdio.h>
int main()
{
wchar_t wc = 0;
int n = mbtowc(&wc, "\x80", 1);
printf("%d %.4x\n", n, (int)wc);
}
On glibc, it prints -1 0000. If the byte 0x80 were an extended character in the implementation's C/POSIX locale, it would print 1 followed by some nonzero character number.
Thus, the "common knowledge" that the C/POSIX locale is "8-bit-clean" on glibc is simply false. What's going on is that there's a gross inconsistency; despite the fact that all the standard utilities, regular expression matching, etc. are specified to operate on (multibyte) characters as if read by mbrtowc, the implementations of these utilities/functions are taking a shortcut when they see MB_CUR_MAX==1 or LC_CTYPE containing "C" (or similar) and reading char values directly instead of processing input with mbrtowc or similar. This is leading to an inconsistency between the specified behavior (which, as their implementation of the C/POSIX locale is defined, would have to treat high bytes as illegal sequences) and the implementation behavior (which is bypassing the locale system entirely).
With all that said, I am still looking for other implementations with the properties requested in the question.
"What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale."
This question is very difficult to answer because it mixes the "C Locale", which I'm assuming refers to the C Standard limited character set mentioned above, with "other unusual options", which I'm assuming refers to how the specific implementation handles characters outside the (limited) C locale. Every C Implementation must implement the C Locale; I don't think there's any unusual options surrounding that.
Let's assume for argument that the question is: "...unusual options in implementing additional/extended characters beyond the C locale." Now this becomes an implementation-dependent question, and as you have already mentioned, it "leaves a great deal of freedom to the implementation." So without knowing the target compiler/hardware, it would still be difficult to answer definitively.
Now the last part:
"...attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?"
Instead of converting high bytes while in the C Locale, you might be able to set the Locale in your program as in this SO Question: Does the underlying character set depend only on the C implementation?
This way you can ensure that your characters will be treated in the Locale that you expect.
It is my understanding that the C Locale only concerns itself with the first 7-bits (of an 8-bit char type), based on the sources below:
http://www.cprogramming.com/tutorial/unicode.html
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
http://www.in-ulm.de/~mascheck/locale/
The terms "high bytes" and "Unicode" and "UTF-8" are in the class of multi-byte or wide-character encodings, and are very locale specific (and beyond the range of the minimal C Locale). I'm not clear on how it would be possible to "convert high bytes" in the (pure) C Locale. It's quite possible that implementations would pick a default (extended) locale if none was explicitly set (or pull it from the OS environment settings as stated in one of the links above).
The POSIX standard is quite clear in this regard.
The introduction to character sets in POSIX.1-2017 says:
6.2 Character Encoding
The POSIX locale shall contain 256 single-byte characters including the characters in Portable Character Set and Non-Portable Control Characters, which have the properties listed in LC_CTYPE. It is unspecified whether characters not listed in those two tables are classified as punct or cntrl, or neither. Other locales shall contain the characters in Portable Character Set and may contain any or all of the control characters identified in Non-Portable Control Characters; the presence, meaning, and representation of any additional characters are locale-specific.
(emphasis mine)
The page for mbtowc() says:
The mbtowc() function shall fail if:
[EILSEQ]
 An invalid character sequence is detected. In the POSIX locale an [EILSEQ] error cannot occur since all byte values are valid characters.
Note that the POSIX locale is defined to be identical to the C locale.
So if an operating system conforms to POSIX, mbtowc is a no-op in the POSIX locale. Characters 128–255 are passed through just as characters 0–127 are. Implementations that operate differently are in violation of the standard.

Resources