Value of character constants in C - c

6.4.4.4/10 ...If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
I'm having trouble understanding this paragraph. After this paragraph standard gives the example below:
Example 2: Consider implementations that use two’s complement representation for
integers and eight bits for objects that have type char. In an
implementation in which type char has the same range of values as
signed char, the integer character constant '\xFF' has the value −1;
if type char has the same range of values as unsigned char, the
character constant '\xFF' has the value +255.
What i understand from the expression: "value of an object with type char" is the value we get when we interpret the object's content with type char. But when we look to the example it's like talking about the object's value with pure binary notation. Is my understanding wrong? Does an object's value mean the bits in that object always?

All "integer character constants" (the stuff between ' and ') have type int out of tradition and compatibility reasons. But they are mostly meant to be used together with char, so 6.4.4.4/10 needs to make a distinction between the types. Basically patch up the broken C language - we have cases such as *"\xFF" that results in type char but '\xFF' results in type int, which is very confusing.
The value '\xFF' = 255 will always fit in an int on any implementation, but not necessarily in a char, which has implementation-defined signedness (another inconsistency in the language). The behavior of the escape sequence should be as if we stored the character constant in a char, as done in my string literal example *"\xFF".
This need for consistency with char type even though the value is stored in an int is what 6.4.4.4/10 describes. That is, printf("%d", '\xFF'); should behave just as char ch = 255; printf("%d", (int)ch);
The example is describing one possible implementation, where char is either signed or unsigned and the system uses 2's complement. Generally the value of an object with integer type refers to decimal notation. char is an integer type, so it can have a negative decimal value (if the symbol table has a matching index for the value -1 or not is another story). But "raw binary" cannot have a negative value, 1111 1111 can only be said to be -1 if you say the the memory cell should be interpreted as 8 bit 2's complement. That is, if you know that a signed char is stored there. If you know that an unsigned char is stored there, then the value is 255.

Related

Operations with Characters and Integers in C

I am new in programming Language and I need your help here.
I am here studying someone's code and I can across with these expressions, my doubt is how is the operation done here, given that a character and an integers are two different data types?
How will the integer type hold the character value?
Thanks
int line, col;
char ch;
scanf("%d%c", &line, &ch);
//line--;
col = ch - 'A';
my doubt is how is the operation done here, given that a character and an integers are two different data types?
I'm unsure how well this question will be received here, given it being about some fairly basic behavior of the language, but I commend you for thinking about type and type matching. Keep doing that!
The first thing to understand is that in C, char and its signed and unsigned variants are among the integer data types, so there is no mismatch of type category, just a question of possibly-different range and signedness. Characters are represented by integer codes (as, indeed, is pretty much everything in the computer's memory).
The second thing to understand is that C supports all manner of arithmetic operations on operands of mixed types. It defines a set of "usual arithmetic conversions" that are used to choose a common type for the operands and the result of each arithmetic operation. The operands are automatically converted to that type. I won't cover all the details here, but basically, floating-point types win over integer types, and wider types win over narrower types.
The third thing to understand is that C does not in any case directly define arithmetic on integer types narrower than (technically, having integer conversion rank less than that of) int. When a narrower value appears in an arithmetic expression, it is automatically converted to int (if int can represent all values of the original type) or to unsigned int. These automatic conversions are called the "integer promotions", and they are a subset of the usual arithmetic conversions.
A fourth thing that is sometimes important to know is that in C "integer character constants" such as 'A' have type int, not type char (C++ differs here).
So, to evaluate this ...
col = ch - 'A';
... the usual arithmetic conversions are first applied to ch and 'A'. This involves performing the integer promotions on the value of ch, resulting in the same numeric value, but as an int. The constant 'A' already has type int, so these now match, and their difference can be computed without any further conversions. The result is an int, which is the same type as col, so no conversion is required to assign the result, either.
How will the integer type hold the character value?
Character values are integer values. Type int can accommodate all values that type char can accommodate.* Nothing special is happening in that regard.
*Technically, int can accommodate all values that can be represented by signed char, unsigned int can accommodate all values that can be represented by type unsigned char, and at least one of the two can accommodate all values that can be represented by (default) char. You are fairly unlikely to run across a C implementation where there are char values that int cannot accommodate, and the above assumes that you are not working with such an implementation, but these are allowed and some may exist.
At the fundamental level, every type in C (be it char, int, uint32_t, short, long...) is represented by bytes, and is 'numerical' in form. You can subtract them from each other / add them together in whichever combination you like - as long as you store the resulting value in a variable of a type which is big enough to hold it - otherwise it will cause a buffer overflow.
In your example, since a char type is represented by a single byte, and an int is composed of 8, the result of this subtraction will simply be stored in the right-most byte of an int (however, depending on if you're dealing with an expression which will yield a negative value, then the representation of the int in memory will be slightly different - look into 2's complement if you're interested).
When you subtract two characters and put them in a variable of integer type, in fact the ASCII code of the two characters is subtracted.
For example when you have:
int col = 'D' - 'A';
The value of col is equal to 3
Because ascii code of D is equal to 68 and ascii code of A is 65. So col is 3, however D & A were character.
Also you can see here

What does the C standard specify for the value of a character constant with a hexadecimal escape sequence?

What does the C 2018 standard specify for the value of a hexadecimal escape sequence such as '\xFF'?
Consider a C implementation in which char is signed and eight bits.
Clause 6.4.4.4 tells us about character constants. In paragraph 6, it discusses hexadecimal escape sequences:
The hexadecimal digits that follow the backslash and the letter x in a hexadecimal escape sequence are taken to be part of the construction of a single character for an integer character constant or of a single wide character for a wide character constant. The numerical value of the hexadecimal integer so formed specifies the value of the desired character or wide character.
The hexadecimal integer is “FF”. By the usual rules of hexadecimal notation, its value1 is 255. Note that, so far, we do not have a specific type: A “character” is a “member of a set of elements used for the organization, control, or representation of data” (3.7) or a “bit representation that fits in a byte” (3.7.1). When \xFF is used in '\xFF', it is a c-char in the grammar (6.4.4.4 1), and '\xFF' is an integer character constant. Per 6.4.4.4 2, “An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in ’x’.”
6.4.4.4 9 specifies constraints on character constants:
The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the corresponding type:
That is followed by a table that, for character constants with no prefix, shows the corresponding type is unsigned char.
So far, so good. Our hexadecimal escape sequence has value 255, which is in the range of an unsigned char.
Then 6.4.4.4 10 purports to tell us the value of the character constant. I quote it here with its sentences separated and labeled for reference:
(i) An integer character constant has type int.
(ii) The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer.
(iii) The value of an integer character constant containing more than one character (e.g., ’ab’ ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.
(iv) If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
If 255 maps to an execution character, (ii) applies, and the value of '\xFF' is the value of that character. This is the first use of “maps” in the standard; it is not defined elsewhere. Should it mean anything other than a map from the value derived so far (255) to an execution character with the same value? If so, for (ii) to apply, there must be an execution character with the value 255. Then the value of '\xFF' would be 255.
Otherwise (iii) applies, and the value of '\xFF' is implementation-defined.
Regardless of whether (ii) or (iii) applies, (iv) also applies. It says the value of '\xFF' is the value of a char object whose value is 255, subsequently converted to int. But, since char is signed and eight-bit, there is no char object whose value is 255. So the fourth sentence states an impossibility.
Footnote
1 3.19 defines “value” as “precise meaning of the contents of an object when interpreted as having a specific type,” but I do not believe that technical term is being used here. “The numerical value of the hexadecimal integer” has no object to discuss yet. This appears to be a use of the word “value” in an ordinary sense.
Your demonstration leads to an interesting conclusion:
There is no portable way to write character constants with values outside the range 0 .. CHAR_MAX. This is not necessarily a problem for single characters as one can use integers in place of character constants, but there is no such alternative for string constants.
It seems type char should always be unsigned by default for consistency with many standard C library functions:
fgetc() returns an int with a negative value EOF for failure and the value of an unsigned char if a byte was successfully read. Hence the meaning and effect of fgetc() == '\xFF' is implementation defined.
the functions from <ctype.h> accept an int argument with the same values as those returned by fgetc(). Passing a negative char value has undefined behavior.
strcmp() and compares strings based on the values of characters converted to unsigned char.
'\xFF' may have the value -1 which is completely unintuitive and is potentially identical to the value of EOF.
The only reason to make or keep char signed by default is compatibility with older compilers for historical code that relies on this behavior and were written before the advent of signed char, some 30 years ago!
I strongly advise programmers to use -funsigned-char to make char unsigned by default and use signed char or better int8_t if one needs signed 8-bit variables and structure members.
As hyde commented, to avoid portability problems, char values should be cast as (unsigned char) where the signedness of char may pose problems: for example:
char str[] = "Hello world\n";
for (int i = 0; str[i]; i++)
str[i] = tolower((unsigned char)str[i]);

Are character constants always positive?

I'm curious if I can compile
int map [] = { [ /*(unsigned char)*/ 'a' ]=1 };
regardless of platform or if it's better to cast character constants to unsigned char prior to using them as indices.
A character constant is a positive values of int, if it is based on a member of the basic execution-time character set.
Since a is in that basic character set, we know that 'a' is required to be positive.
On the other hand, for example, '\xFF' might not be positive. The FF value will be regarded as the bit pattern for a char†, which could be signed, giving us a -1 due to two's complement. Similar reasoning will apply if instead of a numeric escape, we use a character that corresponds to a negative value of type char, like characters corresponding to the 0x80-0xFF byte range on 8-bit systems.
It was like this in ANSI C89 and C90, where I'm relying on my memory; but the requirements persist through newer drafts and standards. In the n1570 draft, we have these items:
6.4.4.4 Character Constants, paragraph 10: "If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int."
6.2.5 Types, paragraph 3: "If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative."
A character constant is not a "char object", but the requirements in 6.4.4.4 specify that the value of a character constant is determined using the char representation: "... one that results when an object with type char whose value ...".
† The numeric escape sequences for an unprefixed character constants and those prefixed with L have an associated "corresponding type" which is unsigned and are required to be in that type's range (6.4.4.4 9). The idea is that character values are specified as an unsigned value, which gives their bit-wise representation which is then interpreted as char. This intent is also conveyed in Example 2 (6.4.4.4 13).
I'm curious if I can compile
int map [] = { [ /*(unsigned char)*/ 'a' ]=1 };
regardless of platform or if it's better to cast character constants
to unsigned char prior to using them as indices.
Your specific code is safe.
'a' is an integer character constant. The language specifies of these that
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer. [...]
If an integer character constant contains a
single character or escape sequence, its value is the one that results
when an object with type char whose value is that of the single
character or escape sequence is converted to type int.
(C2011, paragraph 6.4.4.4/10)
It furthermore specifies that
If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.
(C2011, paragraph 6.2.5/3)
and it requires of every implementation that both the basic source and basic execution character sets contain, among other characters, the lowercase Latin letters, including 'a'. (C2011, paragraph 5.2.1/3)
You should take care, however: an integer character constant for a character that is not a member of the basic execution character set, including a multibyte character, or for a multi-character integer character constant does need not to be nonnegative. Some of those could, in principle, be negative even on machines where default char is an unsigned type.
Moreover, again considering multibyte characters, the cast to unsigned char is not necessarily safe either, in that you could produce collisions that way. To be sure to avoid collisions, you would need to convert to unsigned int, but that could produce much larger arrays than you expect. If you stick to the basic character sets then you're ok. If you stick to single-byte characters then you're ok with a cast. If you must accommodate multibyte characters then for portability, you should probably choose a different approach.

how an integer and character is stored in c

#include<stdio.h>
void main()
{
int a=65;
char d='A';
if(a==d)
printf("both are same");
}
The output is both are same.here a is a integer so 65 is stored in 32 bits and d is a char which is stored in 8 bits how could they be same as is computer everything is converted to binary for any operation.
The computer is able to compare a char to an int on a binary level because of Implicit type promotion rules.
If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions.
This means your char is promoted to an int before your processor compares the two.
C is a very flawed language, so there are many dirty, irrational things going on between the lines here:
char has implementation-defined signedness, so how it stores data depends on compiler. Is char signed or unsigned by default?
'A' is a character literal, and as it happens, character literals are actually of type int in C. This doesn't make any sense, but that's just how it is.
In the line char d='A';, the literal 'A' (type int) gets converted to char. Which may or may not be signed. Signedness shouldn't in practice affect the basic character set A to Z though.
Most likely 'A' will be stored as the value 65, although this is not guaranteed by the standard. For that reason it is better to always write 'A' and never 65 (the former is also most readable).
In the expression a==d, the character operand is a small integer type. Small integer types undergo an implicit promotion to int when used in most expressions. This integer promotion is part of a set of rules for how expressions are balanced, to ensure that both operands of an operator are always of the same type. These rules are called the usual arithmetic conversions. For details see: Implicit type promotion rules
The internal storage is the compiler's decision, and often depends on the target architecture.
However, this has nothing to do with the result your code shows; in the comparison, the char gets promoted to an int before comparing (because you can't compare apples with oranges; read the language rules). Therefore, it compares an int with an int, and they are equal.

Which is the value of a "big" character hexadecimal constant in C?

Suppose that we write in C the following character constant:
'\xFFFFAA'
Which is its numerical value?
The standard C99 says:
Character constants have type int.
Hexadecimal character constants can be represented as an unsigned char.
The value of a basic character constant is non-negative.
The value of any character constant fits in the range of char.
Besides:
The range of values of signed char is contained in the range of values of int.
The size (in bits) of char, unsigned char and signed char are the same: 1 byte.
The size of a byte is given by CHAR_BIT, whose value is at least 8.
Let's suppose that we have the typical situation with CHAR_BIT == 8.
Also, let's suppose that char is signed char for us.
By following the rules: the constant '\xFFFFAA' has type int, but its value can be represented in an unsigned char, althoug its real value fits in a char.
From these rules, an example as '\xFF' would give us:
(int)(char)(unsigned char)'\xFF' == -1
The 1st cast unsigned char comes from the "can be represented as unsigned char" requirement.
The 2nd cast char comes from the "the value fits in a char" requirement.
The 3rd cast int comes from the "has type int" requirement.
However, the constant '\xFFFFAA' is too big, and cannot be "represented" as unsigned int.
Wich is its value?
I think that the value is the resulting of (char)(0xFFFFAA % 256) since the standard says, more or less, the following:
For unsigned integer types, if a value is bigger that the maximum M that can be represented by the type, the value is the obtained after taking the remainder modulo M.
Am I right with this conclusion?
EDIT I have convinced by #KeithThompson: He says that, according to the standards, a big hexadecimal character constant is a constraint violation.
So, I will accept that answer.
However: For example, with GCC 4.8, MinGW, the compiler triggers a warning message, and the program compiles following the behaviour I have described. Thus, it was considered valid a constant like '\x100020' and its value was 0x20.
The C standard defines the syntax and semantics in section 6.4.4.4. I'll cite the N1570 draft of the C11 standard.
Paragraph 6:
The hexadecimal digits that follow the backslash and the letter x in a
hexadecimal escape sequence are taken to be part of the construction
of a single character for an integer character constant or of a single
wide character for a wide character constant. The numerical value of
the hexadecimal integer so formed specifies the value of the desired
character or wide character.
Paragraph 9:
Constraints
The value of an octal or hexadecimal escape sequence shall be in the
range of representable values for the corresponding type:
followed by a table saying that with no prefix, the "corresponding type" is unsigned char.
So, assuming that 0xFFFFAA is outside the representable range for type unsigned char, the character constant '\xFFFFAA' is a constraint violation, requiring a compile-time diagnostic. A compiler is free to reject your source file altogether.
If your compiler doesn't at least warn you about this, it's failing to conform to the C standard.
Yes, the standard does say that unsigned types have modular (wraparound) semantics, but that only applies to arithmetic expressions and some conversions, not to the meanings of constants.
(If CHAR_BIT >= 24 on your system, it's perfectly valid, but that's rare; usually CHAR_BIT == 8.)
If a compiler chooses to issue a mere warning and then continue to compile your source, the behavior is undefined (simply because the standard doesn't define the behavior).
On the other hand, if you had actually meant 'xFFFFAA', that's not interpreted as hexadecimal. (I see it was merely a typo, and the question has been edited to correct it, but I'm going to leave this here anyway.) Its value is implementation-defined, as described in paragraph 10:
The value of an integer character constant containing more than one
character (e.g.,
'ab'), ..., is implementation-defined.
Character constants containing more than one character are a nearly useless language feature, used by accident more often than they're used intentionally.
Yes, the value of \xFFFFAA should be representable by unsigned char.
6.4.4.4 9 Constraints
The value of an octal or hexadecimal escape sequence shall be in the
range of representable values for the type unsigned char for an
integer character constant.
But C99 also says,
6.4.4.4 10 Semantics
The value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape sequence
that does not map to a single-byte execution character, is
implementation-defined.
So the resulting value should be in the range of unsigned char([0, 255], if CHAR_BIT == 8). But as to which one, it depends on the compiler, architecture, etc.

Resources