Are multi-character character constants valid in C? Maybe in MS VC? - c

While reviewing some WINAPI code intended to compile in MS Visual C++, I found the following (simplified):
char buf[4];
// buf gets filled ...
switch ((buf[0] << 8) + buf[1]) {
case 'CT':
/* ... */
case 'SY':
/* ... */
default:
break;
}
}
Assuming 16 bit chars, I can understand why the shift of buf[0] and addition of buf[1]. What I don't gather is how the comparisons in the case clauses are intended to work.
I don't have access to Visual C++ and, of course, those yield multi-character character constant [-Wmultichar] warnings on gcc/MingW.

This is a non-portable way of storing more than one chars in one int. Finally, the comparison happens as the int values, as usual.
Note: consider concatenated representation of the ASCII values for each individual char as the final int value.
Following the wiki article, (emphasis mine)
[...] Multi-character constants (e.g. 'xy') are valid, although rarely useful — they let one store several characters in an integer (e.g. 4 ASCII characters can fit in a 32-bit integer, 8 in a 64-bit one). Since the order in which the characters are packed into an int is not specified, portable use of multi-character constants is difficult.
Related, C11, chapter §6.4.4.4/p10
An integer character constant has type int. The value of an integer character constant
containing a single character that maps to a single-byte execution character is the
numerical value of the representation of the mapped character interpreted as an integer.
The value of an integer character constant containing more than one character (e.g.,
'ab'), or containing a character or escape sequence that does not map to a single-byte
execution character, is implementation-defined. [....]

Yes, they are valid and its type is int and its value is implementation dependent.
From C11 draft, 6.4.4.4p10:
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer. The
value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape sequence
that does not map to a single-byte execution character, is
implementation-defined.
(emphasis added)
GCC is being cautious, and warns to let you know in case you have used it unintentionally.

Related

How do I compare single multibyte character constants cross-platform in C?

In my previous post I found a solution to do this using C++ strings, but I wonder if there would be a solution using char's in C as well.
My current solution uses str.compare() and size() of a character string as seen in my previous post.
Now, since I only use one (multibyte) character in the std::string, would it be possible to achieve the same using a char?
For example, if( str[i] == '¶' )? How do I achieve that using char's?
(edit: made a type on SO for comparison operator as pointed out in the comments)
How do I compare single multibyte character constants cross-platform in C?
You seem to mean an integer character constant expressed using a single multibyte character. The first thing to recognize, then, is that in C, integer character constants (examples: 'c', '¶') have type int, not char. The primary relevant section of C17 is paragraph 6.4.4.4/10:
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g.,’ab’ ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined. If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
(Emphasis added.)
Note well that "implementation defined" implies limited portability from the get-go. Even if we rule out implementations defining perverse behavior, we still have alternatives such as
the implementation rejects integer character constants containing multibyte source characters; or
the implementation rejects integer character constants that do not map to a single-byte execution character; or
the implementation maps source multibyte characters via a bytewise identity mapping, regardless of the byte sequence's significance in the execution character set.
That is not an exhaustive list.
You can certainly compare integer character constants with each other, but if they map to multibyte execution characters then you cannot usefully compare them to individual chars.
Inasmuch as your intended application appears to be to locate individual mutlibyte characters in a C string, the most natural thing to do appears to be to implement a C analog of your C++ approach, using the standard strstr() function. Example:
char str[] = "Some string ¶ some text ¶ to see";
char char_to_compare[] = "¶";
int char_size = sizeof(char_to_compare) - 1; // don't count the string terminator
for (char *location = strstr(str, char_to_compare);
location;
location = strstr(location + char_size, char_to_compare)) {
puts("Found!");
}
That will do the right thing in many cases, but it still might be wrong for some characters in some execution character encodings, such as those encodings featuring multiple shift states.
If you want robust handling for characters outside the basic execution character set, then you would be well advised to take control of the in-memory encoding, and to perform appropriate convertions to, operations on, and conversions from that encoding. This is largely what ICU does, for example.
I believe you meant something like this:
char a = '¶';
char b = '¶';
if (a == b) /*do something*/;
The above may or may not work, if the value of '¶' is bigger than the char range, then it will overflow, causing a and b to store a different value than that of '¶'. Regardless of which value they hold, they may actually both have the same value.
Remember, the char type is simply a single-byte wide (8-bits) integer, so in order to work with multibyte characters and avoid overflow you just have to use a wider integer type (short, int, long...).
short a = '¶';
short b = '¶';
if (a == b) /*do something*/;
From personal experience, I've also noticed, that sometimes your environment may try to use a different character encoding than what you need. For example, trying to print the 'á' character will actually produce something else.
unsigned char x = 'á';
putchar(x); //actually prints character 'ß' in console.
putchar(160); //will print 'á'.
This happens because the console uses an Extended ASCII encoding, while my coding environment actually uses Unicode, parsing a value of 225 for 'á' instead of the value of 160 that I want.

Using int to print character constants [duplicate]

This question already has answers here:
Multi-character constant warnings
(6 answers)
Print decimal value of a char
(5 answers)
Closed 5 years ago.
I wrote the following program,
#include<stdio.h>
int main(void)
{
int i='A';
printf("i=%c",i);
return 0;
}
and I got the result as,
i=A
So I tried another program,
#include<stdio.h>
int main(void)
{
int i='ABC';
printf("i=%c",i);
return 0;
}
According to me, since 32 bits are used to store an int value and each of 'A', 'B' and 'C' have 8 bit ASCII codes which totals to 24 bits therefore 24 bits were stored in a 32 bit unit. So I expected the output to be,
i=ABC
but the output instead was
i=C
and I can't understand why?
'ABC' in this case is a integer character constant as per section 6.4.4.4.10 of the standard.
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer. The
value of an integer character constant containing more than one
character (e.g.,'ab'), or containing a character or escape sequence
that does not map to a single-byteexecution character, is
implementation-defined. If an integer character constant contains a
single character or escape sequence, its value is the one that results
when an object with type char whose value is that of the single
character or escape sequence is converted to type int.
In this case, 'A'==0x41, 'B'==0x42, 'C'==0x43, and your compiler then interprets i to be 0x414243. As said in the other answer, this value is implementation dependent.
When you try to access it using '%c', the overflown part will be cut and you are only left with 0x43, which is 'C'.
To get more insight to it, read the answers to this question as well.
The conversion specifier c used in this call
printf("i=%c",i);
in fact extracts one character from the integer argument. So using this specifier you in any case can not get three characters as the output.
From the C Standard (7.21.6.1 The fprintf function)
c If no l length modifier is present, the int argument is converted to
an unsigned char, and the resulting character is written
Take into account that the internal representation of a multi-byte character constant is implementation defined. From the C Standard (6.4.4.4 Character constants)
...The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape
sequence that does not map to a single-byte execution character, is
implementation-defined.
'ABC' is an integer character constant. Depending on code set (overwhelming it is ASCII), endian, int width (apparently 32 bits in OP's case), it may have the same value like below. It is implementation defined behavior.
'ABC'
0x41424300
0x434241
or others.
The "%c" directs printf() to take the int value, cast it to unsigned char and print the associated character. This is the main reason for apparent loss of information.
In OP's case, it appears that i took on the value of 0x434241.
int i='A';
printf("i=%c",i); --> 'A'
// same as
printf("i=%c",0x434241); --> 'A'
if you want i to contain 3 characters you need to init a array that contains 3 characters
char i[3];
i[0]= 'A';
i[1]= 'B';
i[2]='C';
the ' ' can contain only one char your code converts the integer i into a character or better you store in your 32 bit intiger a converted 8 bit character. But i think You want to seperate the 32 bits into 8 bit containers make a char array like char i[3]. and then you will see that
int j=i;
this will result in an error because you are unable to convert a char array into a integer.
In C, 'A' is an int constant that's guaranteed to fit into a char.
'ABC' is a multicharacter constant. It has an int type, but an implementation defined value. The behaviour on using %c to print that in printf is possibly undefined if the value cannot fit into a char.

This source code is switching on a string in C. How does it do that?

I'm reading through some emulator code and I've countered something truly odd:
switch (reg){
case 'eax':
/* and so on*/
}
How is this possible? I thought you could only switch on integral types. Is there some macro trickery going on?
(Only you can answer the "macro trickery" part - unless you paste up more code. But there's not much here for macros to work on - formally you are not allowed to redefine keywords; the behaviour on doing that is undefined.)
In order to achieve program readability, the witty developer is exploiting implementation defined behaviour. 'eax' is not a string, but a multi-character constant. Note very carefully the single quotation characters around eax. Most likely it is giving you an int in your case that's unique to that combination of characters. (Quite often each character occupies 8 bits in a 32 bit int). And everyone knows you can switch on an int!
Finally, a standard reference:
The C99 standard says:
6.4.4.4p10: "The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or
escape sequence that does not map to a single-byte execution
character, is implementation-defined."
According to the C Standard (6.8.4.2 The switch statement)
3 The expression of each case label shall be an integer constant
expression...
and (6.6 Constant expressions)
6 An integer constant expression shall have integer type and shall
only have operands that are integer constants, enumeration constants,
character constants, sizeof expressions whose results are integer constants, and floating constants that are the immediate operands of
casts. Cast operators in an integer constant expression shall only
convert arithmetic types to integer types, except as part of an
operand to the sizeof operator.
Now what is 'eax'?
The C Standard (6.4.4.4 Character constants)
2 An integer character constant is a sequence of one or more
multibyte characters enclosed in single-quotes, as in 'x'...
So 'eax' is an integer character constant according to the paragraph 10 of the same section
...The value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape
sequence that does not map to a single-byte execution character, is
implementation-defined.
So according to the first mentioned quote it can be an operand of an integer constant expression that may be used as a case label.
Pay attention to that a character constant (enclosed in single quotes) has type int and is not the same as a string literal (a sequence of characters enclosed in double quotes) that has a type of a character array.
As other have said, this is an int constant and its actual value is implementation-defined.
I assume the rest of the code looks something like
if (SOMETHING)
reg='eax';
...
switch (reg){
case 'eax':
/* and so on*/
}
You can be sure that 'eax' in the first part has the same value as 'eax' in the second part, so it all works out, right? ... wrong.
In a comment #Davislor lists some possible values for 'eax':
... 0x65, 0x656178, 0x65617800, 0x786165, 0x6165, or something else
Notice the first potential value? That is just 'e', ignoring the other two characters. The problem is the program probably uses 'eax', 'ebx',
and so on. If all these constants have the same value as 'e' you end up with
switch (reg){
case 'e':
...
case 'e':
...
...
}
This doesn't look too good, does it?
The good part about "implementation-defined" is that the programmer can check the documentation of their compiler and see if it does something sensible with these constants. If it does, home free.
The bad part is that some other poor fellow can take the code and try to compile it using some other compiler. Instant compile error. The program is not portable.
As #zwol pointed out in the comments, the situation is not quite as bad as I thought, in the bad case the code doesn't compile. This will at least give you an exact file name and line number for the problem. Still, you will not have a working program.
The code fragment uses an historical oddity called multi-character character constant, also referred to as multi-chars.
'eax' is an integer constant whose value is implementation defined.
Here is an interesting page on multi-chars and how they can be used but should not:
http://www.zipcon.net/~swhite/docs/computers/languages/c_multi-char_const.html
Looking back further away into the rearview mirror, here is how the original C manual by Dennis Ritchie from the good old days ( https://www.bell-labs.com/usr/dmr/www/cman.pdf ) specified character constants.
2.3.2 Character constants
A character constant is 1 or 2 characters enclosed in single quotes ‘‘ ' ’’. Within a character constant a single quote must be preceded by a back-slash ‘‘\’’. Certain non-graphic characters, and ‘‘\’’ itself, may be escaped according to the following table:
BS \b
NL \n
CR \r
HT \t
ddd \ddd
\ \\
The escape ‘‘\ddd’’ consists of the backslash followed by 1, 2, or 3 octal digits which are taken to specify the value of the desired character. A special case of this construction is ‘‘\0’’ (not followed by a digit) which indicates a null character.
Character constants behave exactly like integers (not, in particular, like objects of character type). In conformity with the addressing structure of the PDP-11, a character constant of length 1 has the code for the given character in
the low-order byte and 0 in the high-order byte; a character constant of length 2 has the code for the first character in the low byte and that for the second character in the high-order byte. Character constants with more than one character are inherently machine-dependent and should be avoided.
The last phrase is all you need to remember about this curious construction: Character constants with more than one character are inherently machine-dependent and should be avoided.

C99 Standard - fprintf - s conversion with precision

Let's assume there's only C99 Standard paper and printf library function needs to be implemented according to this standard to work with UTF-16 encoding, could you please clarify the expected behavior for s conversion with precision specified?
C99 Standard (7.19.6.1) for s conversion says:
If no l length modifier is present, the argument shall be a pointer to the initial element of an array of character type. Characters from the array are written up to (but not including) the terminating null character. If the precision is specified, no more than that many bytes are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null character.
If an l length modifier is present, the argument shall be a pointer to the initial element of an array of wchar_t type. Wide characters from the array are converted to multibyte characters (each as if by a call to the wcrtomb function, with the conversion state described by an mbstate_t object initialized to zero before the first wide character is converted) up to and including a terminating null wide character. The resulting multibyte characters are written up to (but not including) the terminating null character (byte). If no precision is specified, the array shall contain a null wide character. If a precision is specified, no more than that many bytes are written (including shift sequences, if any), and the array shall contain a null wide character if, to equal the multibyte character sequence length given by the precision, the function would need to access a wide character one past the end of the array. In no case is a partial multibyte character written.
I don't quite understand this paragraph in general and the statement "If a precision is specified, no more than that many bytes are written" in particular.
For example, let's take UTF-16 string "TEST" (byte sequence: 0x54, 0x00, 0x45, 0x00, 0x53, 0x00, 0x54, 0x00).
What is expected to be written to the output buffer in the following cases:
If precision is 3
If precision is 9 (one byte more than string length)
If precision is 12 (several bytes more than string length)
Then there's also "Wide characters from the array are converted to multibyte characters". Does it mean UTF-16 should be converted to UTF-8 first? This is pretty strange in case I expect to work with UTF-16 only.
Converting a comment into a slightly expanded answer.
What is the value of CHAR_BIT in your implementation?
If CHAR_BIT == 8, you can't handle UTF-16 with %s; you'd use %ls and you'd pass a wchar_t * as the corresponding argument. You'd then have to read the second paragraph of the specification.
If CHAR_BIT == 16, then you can't have an odd number of octets in the data. You then need to know about how wchar_t relates to char (are they the same size? do they have the same signedness?) and interpret both paragraphs to come up with a uniform effect — unless you decided to have wchar_t represent UTF-32.
The key point is that UTF-16 cannot be handled as a C string if CHAR_BIT == 8 because there are too many useful characters that are encoded with one byte holding zero, but those zero bytes mark the end of a null-terminated string. To handle UTF-16, either the plain char type has to be a 16-bit (or larger) type (so CHAR_BIT > 8), or you have to use wchar_t (and sizeof(wchar_t) > sizeof(char)).
Note that the specification expects that wide characters will be converted to a suitable multibyte representation.
If you want wide characters output natively, you have to use the fwprintf() and related function from <wchar.h>, first defined in C99. The specification there has a lot in common with the specification of fprintf(), but there are (unsurprisingly) important differences.
7.29.2.1 The fwprintf function
…
s
If no l length modifier is present, the argument shall be a pointer to the initial
element of a character array containing a multibyte character sequence
beginning in the initial shift state. Characters from the array are converted as
if by repeated calls to the mbrtowc function, with the conversion state
described by an mbstate_t object initialized to zero before the first
multibyte character is converted, and written up to (but not including) the
terminating null wide character. If the precision is specified, no more than
that many wide characters are written. If the precision is not specified or is
greater than the size of the converted array, the converted array shall contain a
null wide character.
If an l length modifier is present, the argument shall be a pointer to the initial
element of an array of wchar_t type. Wide characters from the array are
written up to (but not including) a terminating null wide character. If the
precision is specified, no more than that many wide characters are written. If
the precision is not specified or is greater than the size of the array, the array
shall contain a null wide character.
wchar_t is not meant to be used for UTF-16, only for implementation-defined fixed-width encodings depending on the current locale. There's simply no sane way to support a variable-length encoding with the wide character API. Likewise, the multi-byte representation used by functions like printf or wcrtomb is implementation-defined. If you want to write portable code using Unicode, you can't rely on the wide character API. Use a library or roll your own code.
To answer your question: fprintf with the l modifier accepts a wide character string in the implementation-defined encoding specified by the current locale. If wchar_t is 16 bits, this encoding might be a bastardization of UTF-16, but as I mentioned above, there's no way to properly support UTF-16 surrogates. This wchar_t string is then converted to a multi-byte char string in an implementation-defined encoding. This might or might not be UTF-8. The specified precision limits the number of chars in the output string with the added restriction that no partial multi-byte characters are written.
Here's an example. Let's assume that the wide character encoding is UTF-32 with 32-bit wchar_t and that the multi-byte encoding is UTF-8 (like on Linux with an appropriate locale). The following code
wchar_t w[] = { 0x1F600, 0 }; // U+1F600 GRINNING FACE
printf("%.3ls", w);
will print nothing at all since the resulting UTF-8 sequence has four bytes. Only if you specify a precision of at least four
printf("%.4ls", w);
the character will be printed.
EDIT: To answer your second question, no, printf should never write a null character. The sentence only means that in certain cases, a null character is required to specify the end of the string and avoid buffer over-reads.

What does \x mean in C/C++?

Example:
char arr[] = "\xeb\x2a";
BTW, are the following the same?
"\xeb\x2a" vs. '\xeb\x2a'
\x indicates a hexadecimal character escape. It's used to specify characters that aren't typeable (like a null '\x00').
And "\xeb\x2a" is a literal string (type is char *, 3 bytes, null-terminated), and '\xeb\x2a' is a character constant (type is int, 2 bytes, not null-terminated, and is just another way to write 0xEB2A or 60202 or 0165452). Not the same :)
As other have said, the \x is an escape sequence that starts a "hexadecimal-escape-sequence".
Some further details from the C99 standard:
When used inside a set of single-quotes (') the characters are part of an "integer character constant" which is (6.4.4.4/2 "Character constants"):
a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'.
and
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.
So the sequence in your example of '\xeb\x2a' is an implementation defined value. It's likely to be the int value 0xeb2a or 0x2aeb depending on whether the target platform is big-endian or little-endian, but you'd have to look at your compiler's documentation to know for certain.
When used inside a set of double-quotes (") the characters specified by the hex-escape-sequence are part of a null-terminated string literal.
From the C99 standard 6.4.5/3 "String literals":
The same considerations apply to each element of the sequence in a character string literal or a wide string literal as if it were in an integer character constant or a wide character constant, except that the single-quote ' is representable either by itself or by the escape sequence \', but the double-quote " shall be represented by the escape sequence \".
Additional info:
In my opinion, you should avoid avoid using 'multi-character' constants. There are only a few situations where they provide any value over using an regular, old int constant. For example, '\xeb\x2a' could be more portably be specified as 0xeb2a or 0x2aeb depending on what value you really wanted.
One area that I've found multi-character constants to be of some use is to come up with clever enum values that can be recognized in a debugger or memory dump:
enum CommandId {
CMD_ID_READ = 'read',
CMD_ID_WRITE = 'writ',
CMD_ID_DEL = 'del ',
CMD_ID_FOO = 'foo '
};
There are few portability problems with the above (other than platforms that have small ints or warnings that might be spewed). Whether the characters end up in the enum values in little- or big-endian form, the code will still work (unless you're doing some else unholy with the enum values). If the characters end up in the value using an endianness that wasn't what you expected, it might make the values less easy to read in a debugger, but the 'correctness' isn't affected.
When you say:
BTW,are these the same:
"\xeb\x2a" vs '\xeb\x2a'
They are in fact not. The first creates a character string literal, terminated with a zero byte, containing the two characters who's hex representation you provide. The second creates an integer constant.
It's a special character that indicates the string is actually a hexadecimal number.
http://www.austincc.edu/rickster/COSC1320/handouts/escchar.htm
The \x means it's a hex character escape. So \xeb would mean character eb in hex, or 235 in decimal. See http://msdn.microsoft.com/en-us/library/6aw8xdf2.aspx for ore information.
As for the second, no, they are not the same. The double-quotes, ", means it's a string of characters, a null-terminated character array, whereas a single quote, ', means it's a single character, the byte that character represents.
\x allows you to specify the character by its hexadecimal code.
This allows you to specify characters that are normally not printable (some of which have special escape sequences predefined such as '\n'=newline and '\t'=tab '\b'=bell)
A useful website is here.
And I quote:
x Unsigned hexadecimal integer
That way, your \xeb is like 235 in decimal.

Resources