Suppose that we write in C the following character constant:
'\xFFFFAA'
Which is its numerical value?
The standard C99 says:
Character constants have type int.
Hexadecimal character constants can be represented as an unsigned char.
The value of a basic character constant is non-negative.
The value of any character constant fits in the range of char.
Besides:
The range of values of signed char is contained in the range of values of int.
The size (in bits) of char, unsigned char and signed char are the same: 1 byte.
The size of a byte is given by CHAR_BIT, whose value is at least 8.
Let's suppose that we have the typical situation with CHAR_BIT == 8.
Also, let's suppose that char is signed char for us.
By following the rules: the constant '\xFFFFAA' has type int, but its value can be represented in an unsigned char, althoug its real value fits in a char.
From these rules, an example as '\xFF' would give us:
(int)(char)(unsigned char)'\xFF' == -1
The 1st cast unsigned char comes from the "can be represented as unsigned char" requirement.
The 2nd cast char comes from the "the value fits in a char" requirement.
The 3rd cast int comes from the "has type int" requirement.
However, the constant '\xFFFFAA' is too big, and cannot be "represented" as unsigned int.
Wich is its value?
I think that the value is the resulting of (char)(0xFFFFAA % 256) since the standard says, more or less, the following:
For unsigned integer types, if a value is bigger that the maximum M that can be represented by the type, the value is the obtained after taking the remainder modulo M.
Am I right with this conclusion?
EDIT I have convinced by #KeithThompson: He says that, according to the standards, a big hexadecimal character constant is a constraint violation.
So, I will accept that answer.
However: For example, with GCC 4.8, MinGW, the compiler triggers a warning message, and the program compiles following the behaviour I have described. Thus, it was considered valid a constant like '\x100020' and its value was 0x20.
The C standard defines the syntax and semantics in section 6.4.4.4. I'll cite the N1570 draft of the C11 standard.
Paragraph 6:
The hexadecimal digits that follow the backslash and the letter x in a
hexadecimal escape sequence are taken to be part of the construction
of a single character for an integer character constant or of a single
wide character for a wide character constant. The numerical value of
the hexadecimal integer so formed specifies the value of the desired
character or wide character.
Paragraph 9:
Constraints
The value of an octal or hexadecimal escape sequence shall be in the
range of representable values for the corresponding type:
followed by a table saying that with no prefix, the "corresponding type" is unsigned char.
So, assuming that 0xFFFFAA is outside the representable range for type unsigned char, the character constant '\xFFFFAA' is a constraint violation, requiring a compile-time diagnostic. A compiler is free to reject your source file altogether.
If your compiler doesn't at least warn you about this, it's failing to conform to the C standard.
Yes, the standard does say that unsigned types have modular (wraparound) semantics, but that only applies to arithmetic expressions and some conversions, not to the meanings of constants.
(If CHAR_BIT >= 24 on your system, it's perfectly valid, but that's rare; usually CHAR_BIT == 8.)
If a compiler chooses to issue a mere warning and then continue to compile your source, the behavior is undefined (simply because the standard doesn't define the behavior).
On the other hand, if you had actually meant 'xFFFFAA', that's not interpreted as hexadecimal. (I see it was merely a typo, and the question has been edited to correct it, but I'm going to leave this here anyway.) Its value is implementation-defined, as described in paragraph 10:
The value of an integer character constant containing more than one
character (e.g.,
'ab'), ..., is implementation-defined.
Character constants containing more than one character are a nearly useless language feature, used by accident more often than they're used intentionally.
Yes, the value of \xFFFFAA should be representable by unsigned char.
6.4.4.4 9 Constraints
The value of an octal or hexadecimal escape sequence shall be in the
range of representable values for the type unsigned char for an
integer character constant.
But C99 also says,
6.4.4.4 10 Semantics
The value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape sequence
that does not map to a single-byte execution character, is
implementation-defined.
So the resulting value should be in the range of unsigned char([0, 255], if CHAR_BIT == 8). But as to which one, it depends on the compiler, architecture, etc.
Related
6.4.4.4/10 ...If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
I'm having trouble understanding this paragraph. After this paragraph standard gives the example below:
Example 2: Consider implementations that use two’s complement representation for
integers and eight bits for objects that have type char. In an
implementation in which type char has the same range of values as
signed char, the integer character constant '\xFF' has the value −1;
if type char has the same range of values as unsigned char, the
character constant '\xFF' has the value +255.
What i understand from the expression: "value of an object with type char" is the value we get when we interpret the object's content with type char. But when we look to the example it's like talking about the object's value with pure binary notation. Is my understanding wrong? Does an object's value mean the bits in that object always?
All "integer character constants" (the stuff between ' and ') have type int out of tradition and compatibility reasons. But they are mostly meant to be used together with char, so 6.4.4.4/10 needs to make a distinction between the types. Basically patch up the broken C language - we have cases such as *"\xFF" that results in type char but '\xFF' results in type int, which is very confusing.
The value '\xFF' = 255 will always fit in an int on any implementation, but not necessarily in a char, which has implementation-defined signedness (another inconsistency in the language). The behavior of the escape sequence should be as if we stored the character constant in a char, as done in my string literal example *"\xFF".
This need for consistency with char type even though the value is stored in an int is what 6.4.4.4/10 describes. That is, printf("%d", '\xFF'); should behave just as char ch = 255; printf("%d", (int)ch);
The example is describing one possible implementation, where char is either signed or unsigned and the system uses 2's complement. Generally the value of an object with integer type refers to decimal notation. char is an integer type, so it can have a negative decimal value (if the symbol table has a matching index for the value -1 or not is another story). But "raw binary" cannot have a negative value, 1111 1111 can only be said to be -1 if you say the the memory cell should be interpreted as 8 bit 2's complement. That is, if you know that a signed char is stored there. If you know that an unsigned char is stored there, then the value is 255.
What does the C 2018 standard specify for the value of a hexadecimal escape sequence such as '\xFF'?
Consider a C implementation in which char is signed and eight bits.
Clause 6.4.4.4 tells us about character constants. In paragraph 6, it discusses hexadecimal escape sequences:
The hexadecimal digits that follow the backslash and the letter x in a hexadecimal escape sequence are taken to be part of the construction of a single character for an integer character constant or of a single wide character for a wide character constant. The numerical value of the hexadecimal integer so formed specifies the value of the desired character or wide character.
The hexadecimal integer is “FF”. By the usual rules of hexadecimal notation, its value1 is 255. Note that, so far, we do not have a specific type: A “character” is a “member of a set of elements used for the organization, control, or representation of data” (3.7) or a “bit representation that fits in a byte” (3.7.1). When \xFF is used in '\xFF', it is a c-char in the grammar (6.4.4.4 1), and '\xFF' is an integer character constant. Per 6.4.4.4 2, “An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in ’x’.”
6.4.4.4 9 specifies constraints on character constants:
The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the corresponding type:
That is followed by a table that, for character constants with no prefix, shows the corresponding type is unsigned char.
So far, so good. Our hexadecimal escape sequence has value 255, which is in the range of an unsigned char.
Then 6.4.4.4 10 purports to tell us the value of the character constant. I quote it here with its sentences separated and labeled for reference:
(i) An integer character constant has type int.
(ii) The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer.
(iii) The value of an integer character constant containing more than one character (e.g., ’ab’ ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.
(iv) If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
If 255 maps to an execution character, (ii) applies, and the value of '\xFF' is the value of that character. This is the first use of “maps” in the standard; it is not defined elsewhere. Should it mean anything other than a map from the value derived so far (255) to an execution character with the same value? If so, for (ii) to apply, there must be an execution character with the value 255. Then the value of '\xFF' would be 255.
Otherwise (iii) applies, and the value of '\xFF' is implementation-defined.
Regardless of whether (ii) or (iii) applies, (iv) also applies. It says the value of '\xFF' is the value of a char object whose value is 255, subsequently converted to int. But, since char is signed and eight-bit, there is no char object whose value is 255. So the fourth sentence states an impossibility.
Footnote
1 3.19 defines “value” as “precise meaning of the contents of an object when interpreted as having a specific type,” but I do not believe that technical term is being used here. “The numerical value of the hexadecimal integer” has no object to discuss yet. This appears to be a use of the word “value” in an ordinary sense.
Your demonstration leads to an interesting conclusion:
There is no portable way to write character constants with values outside the range 0 .. CHAR_MAX. This is not necessarily a problem for single characters as one can use integers in place of character constants, but there is no such alternative for string constants.
It seems type char should always be unsigned by default for consistency with many standard C library functions:
fgetc() returns an int with a negative value EOF for failure and the value of an unsigned char if a byte was successfully read. Hence the meaning and effect of fgetc() == '\xFF' is implementation defined.
the functions from <ctype.h> accept an int argument with the same values as those returned by fgetc(). Passing a negative char value has undefined behavior.
strcmp() and compares strings based on the values of characters converted to unsigned char.
'\xFF' may have the value -1 which is completely unintuitive and is potentially identical to the value of EOF.
The only reason to make or keep char signed by default is compatibility with older compilers for historical code that relies on this behavior and were written before the advent of signed char, some 30 years ago!
I strongly advise programmers to use -funsigned-char to make char unsigned by default and use signed char or better int8_t if one needs signed 8-bit variables and structure members.
As hyde commented, to avoid portability problems, char values should be cast as (unsigned char) where the signedness of char may pose problems: for example:
char str[] = "Hello world\n";
for (int i = 0; str[i]; i++)
str[i] = tolower((unsigned char)str[i]);
I'm curious if I can compile
int map [] = { [ /*(unsigned char)*/ 'a' ]=1 };
regardless of platform or if it's better to cast character constants to unsigned char prior to using them as indices.
A character constant is a positive values of int, if it is based on a member of the basic execution-time character set.
Since a is in that basic character set, we know that 'a' is required to be positive.
On the other hand, for example, '\xFF' might not be positive. The FF value will be regarded as the bit pattern for a char†, which could be signed, giving us a -1 due to two's complement. Similar reasoning will apply if instead of a numeric escape, we use a character that corresponds to a negative value of type char, like characters corresponding to the 0x80-0xFF byte range on 8-bit systems.
It was like this in ANSI C89 and C90, where I'm relying on my memory; but the requirements persist through newer drafts and standards. In the n1570 draft, we have these items:
6.4.4.4 Character Constants, paragraph 10: "If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int."
6.2.5 Types, paragraph 3: "If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative."
A character constant is not a "char object", but the requirements in 6.4.4.4 specify that the value of a character constant is determined using the char representation: "... one that results when an object with type char whose value ...".
† The numeric escape sequences for an unprefixed character constants and those prefixed with L have an associated "corresponding type" which is unsigned and are required to be in that type's range (6.4.4.4 9). The idea is that character values are specified as an unsigned value, which gives their bit-wise representation which is then interpreted as char. This intent is also conveyed in Example 2 (6.4.4.4 13).
I'm curious if I can compile
int map [] = { [ /*(unsigned char)*/ 'a' ]=1 };
regardless of platform or if it's better to cast character constants
to unsigned char prior to using them as indices.
Your specific code is safe.
'a' is an integer character constant. The language specifies of these that
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer. [...]
If an integer character constant contains a
single character or escape sequence, its value is the one that results
when an object with type char whose value is that of the single
character or escape sequence is converted to type int.
(C2011, paragraph 6.4.4.4/10)
It furthermore specifies that
If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.
(C2011, paragraph 6.2.5/3)
and it requires of every implementation that both the basic source and basic execution character sets contain, among other characters, the lowercase Latin letters, including 'a'. (C2011, paragraph 5.2.1/3)
You should take care, however: an integer character constant for a character that is not a member of the basic execution character set, including a multibyte character, or for a multi-character integer character constant does need not to be nonnegative. Some of those could, in principle, be negative even on machines where default char is an unsigned type.
Moreover, again considering multibyte characters, the cast to unsigned char is not necessarily safe either, in that you could produce collisions that way. To be sure to avoid collisions, you would need to convert to unsigned int, but that could produce much larger arrays than you expect. If you stick to the basic character sets then you're ok. If you stick to single-byte characters then you're ok with a cast. If you must accommodate multibyte characters then for portability, you should probably choose a different approach.
According to C11 WG14 draft version N1570:
The header <ctype.h> declares several functions useful for classifying
and mapping characters. In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the argument has any other value,
the behavior is undefined.
Is it undefined behaviour?:
#include <ctype.h>
#include <limits.h>
#include <stdlib.h>
int main(void) {
char c = CHAR_MIN; /* let assume that char is signed and CHAR_MIN < 0 */
return isspace(c) ? EXIT_FAILURE : EXIT_SUCCESS;
}
Does the standard allow to pass char to isspace() (char to int)? In other words, is char after conversion to int representable as an unsigned char?
Here's how wiktionary defines "representable":
Capable of being represented.
Is char capable of being represented as unsigned char? Yes. §6.2.6.1/4:
Values stored in non-bit-field objects of any other object type
consist of n × CHAR_BIT bits, where n is the size of an object of that
type, in bytes. The value may be copied into an object of type
unsigned char [n] (e.g., by memcpy); the resulting set of bytes is
called the object representation of the value.
sizeof(char) == 1 therefore its object representation is unsigned char[1] i.e., char is capable of being represented as an unsigned char. Where am I wrong?
Concrete example, I can represent [-2, -1, 0, 1] as [0, 1, 2, 3]. If I can't then why?
Related: According to §6.3.1.3 isspace((unsigned char)c) is portable if INT_MAX >= UCHAR_MAX otherwise it is implementation-defined.
What does representable in a type mean?
Re-formulated, a type is a convention for what the underlying bit-patterns mean. A value is thus representable in a type, if that type assigns some bit-pattern that meaning.
A conversion (which might need a cast), is a mapping from a value (represented with a specific type) to a value (possibly different) represented in the target type.
Under the given assumption (that char is signed), CHAR_MIN is certainly negative, and the text you quoted leaves no room for interpretation:
Yes, it is undefined behavior, as unsigned char cannot represent any negative numbers.
If that assumption did not hold, your program would be well-defined, because CHAR_MIN would be 0, a valid value for unsigned char.
Thus, we have a case where it is implementation-defined whether the program is undefined or well-defined.
As an aside, there is no guarantee that sizeof(int)>1 or INT_MAX >= CHAR_MAX, so int might not be able to represent all values possible for unsigned char.
As conversions are defined to be value-preserving, a signed char can always be converted to int.
But if it was negative, that does not change the impossibility of representing a negative value as an unsigned char. (The conversion is defined, as conversion from any integral type to any unsigned integral type is always defined, though narrowing conversions need a cast.)
Under the assumption that char is signed then this would be undefined behavior, otherwise it is well defined since CHAR_MIN would have the value 0. It is easier to see the intention and meaning of:
the value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF
if we read section 7.4 Character handling <ctype.h> from the Rationale for International Standard—Programming Languages—C which says (emphasis mine going forward):
Since these functions are often used primarily as macros, their domain
is restricted to the small positive integers representable in an
unsigned char, plus the value of EOF. EOF is traditionally -1, but may
be any negative integer, and hence distinguishable from any valid
character code. These macros may thus be efficiently implemented by
using the argument as an index into a small array of attributes.
So valid values are:
Positive integers that can fit into unsigned char
EOF which is some implementation defined negative number
Even though this is C99 rationale since the particular wording you are referring to does not change from C99 to C11 and so the rationale still fits.
We can also find why the interface uses int as an argument as opposed to char, from section 7.1.4 Use of library functions, it says:
All library prototypes are specified in terms of the “widened” types
an argument formerly declared as char is now written as int. This
ensures that most library functions can be called with or without a
prototype in scope, thus maintaining backwards compatibility with
pre-C89 code. Note, however, that since functions like printf and
scanf use variable-length argument lists, they must be called in the
scope of a prototype.
The revealing quote (for me) is §6.3.1.3/1:
if the value can be represented by the new type, it is unchanged.
i.e., if the value has to be changed then the value can't be represented by the new type.
Therefore an unsigned type can't represent a negative value.
To answer the question in the title: "representable" refers to "can be represented" from §6.3.1.3 and unrelated to "object representation" from §6.2.6.1.
It seems trivial in retrospect. I might have been confused by the habit of treating b'\xFF', 0xff, 255, -1 as the same byte in Python:
>>> (255).to_bytes(1, 'big')
b'\xff'
>>> int.from_bytes(b'\xFF', 'big')
255
>>> 255 == 0xff
True
>>> (-1).to_bytes(1, 'big', signed=True)
b'\xff'
and the disbelief that it is an undefined behavior to pass a character to a character classification function e.g., isspace(CHAR_MIN).
I tried
printf("%d, %d\n", sizeof(char), sizeof('c'));
and got 1, 4 as output. If size of a character is one, why does 'c' give me 4? I guess it's because it's an integer. So when I do char ch = 'c'; is there an implicit conversion happening, under the hood, from that 4 byte value to a 1 byte value when it's assigned to the char variable?
In C 'a' is an integer constant (!?!), so 4 is correct for your architecture. It is implicitly converted to char for the assignment. sizeof(char) is always 1 by definition. The standard doesn't say what units 1 is, but it is often bytes.
Th C standard says that a character literal like 'a' is of type int, not type char. It therefore has (on your platform) sizeof == 4. See this question for a fuller discussion.
It is the normal behavior of the sizeof operator (See Wikipedia):
For a datatype, sizeof returns the size of the datatype. For char, you get 1.
For an expression, sizeof returns the size of the type of the variable or expression. As a character literal is typed as int, you get 4.
This is covered in ISO C11 6.4.4.4 Character constants though it's largely unchanged from earlier standards. That states, in paragraph /10:
An integer character constant has type int. The value of an integer character constant
containing a single character that maps to a single-byte execution character is the
numerical value of the representation of the mapped character interpreted as an integer.
According to the ANSI C standards, a char gets promoted to an int in the context where integers are used, you used a integer format specifier in the printf hence the different values. A char is usually 1 byte but that is implementation defined based on the runtime and compiler.