Are character constants always positive? - c

I'm curious if I can compile
int map [] = { [ /*(unsigned char)*/ 'a' ]=1 };
regardless of platform or if it's better to cast character constants to unsigned char prior to using them as indices.

A character constant is a positive values of int, if it is based on a member of the basic execution-time character set.
Since a is in that basic character set, we know that 'a' is required to be positive.
On the other hand, for example, '\xFF' might not be positive. The FF value will be regarded as the bit pattern for a char†, which could be signed, giving us a -1 due to two's complement. Similar reasoning will apply if instead of a numeric escape, we use a character that corresponds to a negative value of type char, like characters corresponding to the 0x80-0xFF byte range on 8-bit systems.
It was like this in ANSI C89 and C90, where I'm relying on my memory; but the requirements persist through newer drafts and standards. In the n1570 draft, we have these items:
6.4.4.4 Character Constants, paragraph 10: "If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int."
6.2.5 Types, paragraph 3: "If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative."
A character constant is not a "char object", but the requirements in 6.4.4.4 specify that the value of a character constant is determined using the char representation: "... one that results when an object with type char whose value ...".
† The numeric escape sequences for an unprefixed character constants and those prefixed with L have an associated "corresponding type" which is unsigned and are required to be in that type's range (6.4.4.4 9). The idea is that character values are specified as an unsigned value, which gives their bit-wise representation which is then interpreted as char. This intent is also conveyed in Example 2 (6.4.4.4 13).

I'm curious if I can compile
int map [] = { [ /*(unsigned char)*/ 'a' ]=1 };
regardless of platform or if it's better to cast character constants
to unsigned char prior to using them as indices.
Your specific code is safe.
'a' is an integer character constant. The language specifies of these that
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer. [...]
If an integer character constant contains a
single character or escape sequence, its value is the one that results
when an object with type char whose value is that of the single
character or escape sequence is converted to type int.
(C2011, paragraph 6.4.4.4/10)
It furthermore specifies that
If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.
(C2011, paragraph 6.2.5/3)
and it requires of every implementation that both the basic source and basic execution character sets contain, among other characters, the lowercase Latin letters, including 'a'. (C2011, paragraph 5.2.1/3)
You should take care, however: an integer character constant for a character that is not a member of the basic execution character set, including a multibyte character, or for a multi-character integer character constant does need not to be nonnegative. Some of those could, in principle, be negative even on machines where default char is an unsigned type.
Moreover, again considering multibyte characters, the cast to unsigned char is not necessarily safe either, in that you could produce collisions that way. To be sure to avoid collisions, you would need to convert to unsigned int, but that could produce much larger arrays than you expect. If you stick to the basic character sets then you're ok. If you stick to single-byte characters then you're ok with a cast. If you must accommodate multibyte characters then for portability, you should probably choose a different approach.

Related

Is a 64-bit character literal possible in C?

The following code compiles fine:
uint32_t myfunc32() {
uint32_t var = 'asdf';
return var;
}
The following code gives the warning, "character constant too long for its type":
uint64_t myfunc64() {
uint64_t var = 'asdfasdf';
return var;
}
Indeed, the 64-bit character literal gets truncated to a 32-bit constant by GCC. Are 64-bit character literals not a feature of C? I can't find any good info on this.
Edit: I am doing some more testing. It turns out that another compiler, MetroWerks CodeWarrior, can compile the 64-bit character literals as expected. If this is not already a feature of GCC, it really ought to be.
Are 64-bit character literals not a feature of C?
Indeed they are not. As per C99 §6.4.4.4 point 10 (page 73 here):
An integer character constant has type int. The value of an integer character constant
containing a single character that maps to a single-byte execution character is the
numerical value of the representation of the mapped character interpreted as an integer.
The value of an integer character constant containing more than one character (e.g.,
'ab'), or containing a character or escape sequence that does not map to a single-byte
execution character, is implementation-defined.
So, character constants have type int, which on most modern platforms means int32_t. On the other hand, the actual value of the int resulting from a multi-byte character constant is implementation defined, so you can't really expect much from int x = 'abc';, unless you are targeting a specific compiler and compiler version. You should avoid using such statements in sane C code.
As per GCC-specific behavior, from the GCC documentation we have:
The numeric value of character constants in preprocessor expressions.
The preprocessor and compiler interpret character constants in the same way; i.e. escape sequences such as ‘\a’ are given the values they would have on the target machine.
The compiler evaluates a multi-character character constant a character at a time, shifting the previous value left by the number of bits per target character, and then or-ing in the bit-pattern of the new character truncated to the width of a target character. The final bit-pattern is given type int, and is therefore signed, regardless of whether single characters are signed or not. If there are more characters in the constant than would fit in the target int the compiler issues a warning, and the excess leading characters are ignored.
For example, 'ab' for a target with an 8-bit char would be interpreted as ‘(int) ((unsigned char) 'a' * 256 + (unsigned char) 'b')’, and '\234a' as ‘(int) ((unsigned char) '\234' * 256 + (unsigned char) 'a')’.

Value of character constants in C

6.4.4.4/10 ...If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
I'm having trouble understanding this paragraph. After this paragraph standard gives the example below:
Example 2: Consider implementations that use two’s complement representation for
integers and eight bits for objects that have type char. In an
implementation in which type char has the same range of values as
signed char, the integer character constant '\xFF' has the value −1;
if type char has the same range of values as unsigned char, the
character constant '\xFF' has the value +255.
What i understand from the expression: "value of an object with type char" is the value we get when we interpret the object's content with type char. But when we look to the example it's like talking about the object's value with pure binary notation. Is my understanding wrong? Does an object's value mean the bits in that object always?
All "integer character constants" (the stuff between ' and ') have type int out of tradition and compatibility reasons. But they are mostly meant to be used together with char, so 6.4.4.4/10 needs to make a distinction between the types. Basically patch up the broken C language - we have cases such as *"\xFF" that results in type char but '\xFF' results in type int, which is very confusing.
The value '\xFF' = 255 will always fit in an int on any implementation, but not necessarily in a char, which has implementation-defined signedness (another inconsistency in the language). The behavior of the escape sequence should be as if we stored the character constant in a char, as done in my string literal example *"\xFF".
This need for consistency with char type even though the value is stored in an int is what 6.4.4.4/10 describes. That is, printf("%d", '\xFF'); should behave just as char ch = 255; printf("%d", (int)ch);
The example is describing one possible implementation, where char is either signed or unsigned and the system uses 2's complement. Generally the value of an object with integer type refers to decimal notation. char is an integer type, so it can have a negative decimal value (if the symbol table has a matching index for the value -1 or not is another story). But "raw binary" cannot have a negative value, 1111 1111 can only be said to be -1 if you say the the memory cell should be interpreted as 8 bit 2's complement. That is, if you know that a signed char is stored there. If you know that an unsigned char is stored there, then the value is 255.

C programming: Type char variable Is For Just One Letter Or Number?

In the textbook "The C Programming Language," page 9 has the line below.
"char character--a single byte"
Is this meaning the type "char" variable can keep just one letter, number or symbol?
I also want to understand the term's precise definition.
My understanding is here. Is this correct?
Character: Any letter, number or symbol.
Character string: Several characters.
If it is wrong, I want the correct definition.
Thank you, all members of the community for everyday's support.
The formal C standard definition of character sets (5.2.1):
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.
The basic character set is specified to contain:
the 26 uppercase letters of the Latin alphabet /--/
the 26 lowercase letters of the Latin alphabet /--/
the 10 decimal digits /--/
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab, vertical tab, and form feed.
In the basic execution character set, there shall be
control characters representing alert, backspace, carriage return, and new line.
The representation of each member of the source and execution basic
character sets shall fit in a byte.
Then 6.2.5 says:
An object declared as type char is large enough to store any member of the basic execution character set.
The formal definition of a byte is very similar (3.6):
byte
addressable unit of data storage large enough to hold any member of the basic character set of the execution environment
Furthermore, it is specified that a char is always 1 byte large (6.5.3.4):
The sizeof operator yields the size (in bytes) of its operand /--/
When sizeof is applied to an operand that has type char, unsigned char, or
signed char, (or a qualified version thereof) the result is 1.
The C standard does not however specify the number of bits in a byte, only that it has to be 8 bits or more.
The standard (draft n1570 for C11) says:
An object declared as type char is large enough to store any member of the basic
execution character set. If a member of the basic execution character set is stored in a
char object, its value is guaranteed to be nonnegative.
As the standard character set contains all upper and lower case alphabets, decimal digits and some other characters, it needs at least 7 bits to be represented. Anyway the standard mandates the size of a char to be at least 8 bits:
[The] implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
— number of bits for smallest object that is not a bit-field (byte)
CHAR_BIT 8
A char must be individually addressable. For that reason a char is said to be a byte, and by definition sizeof(char) is 1 whatever the exact number of bits - some old mainframes used 12 or 16 bits characters.
unsigned char and signed char are integer types that use same storage size as char. They are distinct types, yet the conversions between the 3 types are perfectly defined and never change the representation. Even if a distinct type, the standard requires:
The implementation shall define char to have the same range,
representation, and behavior as either signed char or unsigned char.
On common architectures, a char uses 8 bits. All values in range 0-127 represent the ASCII character set (NB: this is not mandated by the standard, and other representation like EBCDIC were used). Values in the other range (-128 to -1 or 128-255) are called the extended chars and can represent either a ISO-8859-x (or Latin) charset, or bytes in a multi-byte character set like UTF-8 or UCS2 (subset of UTF16 for unicode characters in the 0-FFFF range). ISO-8859-1 or Latin1 is a single byte charset representing Unicode characters in the range à-255. It used to be a de facto standard and Windows still uses CP1252 (a close variation) for west european language system
TL/DR: to directly answer your question:
a char represents some symbols, at least the basic execution character set
a character string is by convention a null terminated char array. The represented symbols depend on the used charset, and for multi-byte charsets (like UTF8) there is no 1 to 1 relation between a char and a symbol
The 'char' take one byte of storage, and can represent a value between -128 to +127. This is commonly used to hold single ASCII character. In the ASCII encoding, all printable characters are assigned values between 32 (space) to 126 (tilde, '~'), with non printable characters assigned to the rest of the codes.
Note that unlike the Java char (which can bold any unicode character), the "c" char will not be able to represent the latin characters.
Typically char is a one-byte sized variable type, and as byte is made of 8 bits, the value range for char is 0-255 or -128-127 if signed (One bit is used for sign indication).
Those 255 options, are used to represent a value, in the case of a char, a symbol, letter or digit (or some special characters). from the ASCII codec.
If you want for example to store a Japanese letter or an emoji, which requires 2 bytes (As there are much more characters than 255 as you know). You will have to use a type that support such size - for unicode such as wchar_t

What does the C standard specify for the value of a character constant with a hexadecimal escape sequence?

What does the C 2018 standard specify for the value of a hexadecimal escape sequence such as '\xFF'?
Consider a C implementation in which char is signed and eight bits.
Clause 6.4.4.4 tells us about character constants. In paragraph 6, it discusses hexadecimal escape sequences:
The hexadecimal digits that follow the backslash and the letter x in a hexadecimal escape sequence are taken to be part of the construction of a single character for an integer character constant or of a single wide character for a wide character constant. The numerical value of the hexadecimal integer so formed specifies the value of the desired character or wide character.
The hexadecimal integer is “FF”. By the usual rules of hexadecimal notation, its value1 is 255. Note that, so far, we do not have a specific type: A “character” is a “member of a set of elements used for the organization, control, or representation of data” (3.7) or a “bit representation that fits in a byte” (3.7.1). When \xFF is used in '\xFF', it is a c-char in the grammar (6.4.4.4 1), and '\xFF' is an integer character constant. Per 6.4.4.4 2, “An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in ’x’.”
6.4.4.4 9 specifies constraints on character constants:
The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the corresponding type:
That is followed by a table that, for character constants with no prefix, shows the corresponding type is unsigned char.
So far, so good. Our hexadecimal escape sequence has value 255, which is in the range of an unsigned char.
Then 6.4.4.4 10 purports to tell us the value of the character constant. I quote it here with its sentences separated and labeled for reference:
(i) An integer character constant has type int.
(ii) The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer.
(iii) The value of an integer character constant containing more than one character (e.g., ’ab’ ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.
(iv) If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
If 255 maps to an execution character, (ii) applies, and the value of '\xFF' is the value of that character. This is the first use of “maps” in the standard; it is not defined elsewhere. Should it mean anything other than a map from the value derived so far (255) to an execution character with the same value? If so, for (ii) to apply, there must be an execution character with the value 255. Then the value of '\xFF' would be 255.
Otherwise (iii) applies, and the value of '\xFF' is implementation-defined.
Regardless of whether (ii) or (iii) applies, (iv) also applies. It says the value of '\xFF' is the value of a char object whose value is 255, subsequently converted to int. But, since char is signed and eight-bit, there is no char object whose value is 255. So the fourth sentence states an impossibility.
Footnote
1 3.19 defines “value” as “precise meaning of the contents of an object when interpreted as having a specific type,” but I do not believe that technical term is being used here. “The numerical value of the hexadecimal integer” has no object to discuss yet. This appears to be a use of the word “value” in an ordinary sense.
Your demonstration leads to an interesting conclusion:
There is no portable way to write character constants with values outside the range 0 .. CHAR_MAX. This is not necessarily a problem for single characters as one can use integers in place of character constants, but there is no such alternative for string constants.
It seems type char should always be unsigned by default for consistency with many standard C library functions:
fgetc() returns an int with a negative value EOF for failure and the value of an unsigned char if a byte was successfully read. Hence the meaning and effect of fgetc() == '\xFF' is implementation defined.
the functions from <ctype.h> accept an int argument with the same values as those returned by fgetc(). Passing a negative char value has undefined behavior.
strcmp() and compares strings based on the values of characters converted to unsigned char.
'\xFF' may have the value -1 which is completely unintuitive and is potentially identical to the value of EOF.
The only reason to make or keep char signed by default is compatibility with older compilers for historical code that relies on this behavior and were written before the advent of signed char, some 30 years ago!
I strongly advise programmers to use -funsigned-char to make char unsigned by default and use signed char or better int8_t if one needs signed 8-bit variables and structure members.
As hyde commented, to avoid portability problems, char values should be cast as (unsigned char) where the signedness of char may pose problems: for example:
char str[] = "Hello world\n";
for (int i = 0; str[i]; i++)
str[i] = tolower((unsigned char)str[i]);

Which is the value of a "big" character hexadecimal constant in C?

Suppose that we write in C the following character constant:
'\xFFFFAA'
Which is its numerical value?
The standard C99 says:
Character constants have type int.
Hexadecimal character constants can be represented as an unsigned char.
The value of a basic character constant is non-negative.
The value of any character constant fits in the range of char.
Besides:
The range of values of signed char is contained in the range of values of int.
The size (in bits) of char, unsigned char and signed char are the same: 1 byte.
The size of a byte is given by CHAR_BIT, whose value is at least 8.
Let's suppose that we have the typical situation with CHAR_BIT == 8.
Also, let's suppose that char is signed char for us.
By following the rules: the constant '\xFFFFAA' has type int, but its value can be represented in an unsigned char, althoug its real value fits in a char.
From these rules, an example as '\xFF' would give us:
(int)(char)(unsigned char)'\xFF' == -1
The 1st cast unsigned char comes from the "can be represented as unsigned char" requirement.
The 2nd cast char comes from the "the value fits in a char" requirement.
The 3rd cast int comes from the "has type int" requirement.
However, the constant '\xFFFFAA' is too big, and cannot be "represented" as unsigned int.
Wich is its value?
I think that the value is the resulting of (char)(0xFFFFAA % 256) since the standard says, more or less, the following:
For unsigned integer types, if a value is bigger that the maximum M that can be represented by the type, the value is the obtained after taking the remainder modulo M.
Am I right with this conclusion?
EDIT I have convinced by #KeithThompson: He says that, according to the standards, a big hexadecimal character constant is a constraint violation.
So, I will accept that answer.
However: For example, with GCC 4.8, MinGW, the compiler triggers a warning message, and the program compiles following the behaviour I have described. Thus, it was considered valid a constant like '\x100020' and its value was 0x20.
The C standard defines the syntax and semantics in section 6.4.4.4. I'll cite the N1570 draft of the C11 standard.
Paragraph 6:
The hexadecimal digits that follow the backslash and the letter x in a
hexadecimal escape sequence are taken to be part of the construction
of a single character for an integer character constant or of a single
wide character for a wide character constant. The numerical value of
the hexadecimal integer so formed specifies the value of the desired
character or wide character.
Paragraph 9:
Constraints
The value of an octal or hexadecimal escape sequence shall be in the
range of representable values for the corresponding type:
followed by a table saying that with no prefix, the "corresponding type" is unsigned char.
So, assuming that 0xFFFFAA is outside the representable range for type unsigned char, the character constant '\xFFFFAA' is a constraint violation, requiring a compile-time diagnostic. A compiler is free to reject your source file altogether.
If your compiler doesn't at least warn you about this, it's failing to conform to the C standard.
Yes, the standard does say that unsigned types have modular (wraparound) semantics, but that only applies to arithmetic expressions and some conversions, not to the meanings of constants.
(If CHAR_BIT >= 24 on your system, it's perfectly valid, but that's rare; usually CHAR_BIT == 8.)
If a compiler chooses to issue a mere warning and then continue to compile your source, the behavior is undefined (simply because the standard doesn't define the behavior).
On the other hand, if you had actually meant 'xFFFFAA', that's not interpreted as hexadecimal. (I see it was merely a typo, and the question has been edited to correct it, but I'm going to leave this here anyway.) Its value is implementation-defined, as described in paragraph 10:
The value of an integer character constant containing more than one
character (e.g.,
'ab'), ..., is implementation-defined.
Character constants containing more than one character are a nearly useless language feature, used by accident more often than they're used intentionally.
Yes, the value of \xFFFFAA should be representable by unsigned char.
6.4.4.4 9 Constraints
The value of an octal or hexadecimal escape sequence shall be in the
range of representable values for the type unsigned char for an
integer character constant.
But C99 also says,
6.4.4.4 10 Semantics
The value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape sequence
that does not map to a single-byte execution character, is
implementation-defined.
So the resulting value should be in the range of unsigned char([0, 255], if CHAR_BIT == 8). But as to which one, it depends on the compiler, architecture, etc.

Resources