C: Explain sizeof behaviour [duplicate] - c

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Size of character ('a') in C/C++
Can someone explain why in C sizeof(char) = 1 and sizeof(name[0]) = 1 but sizeof('a') = 4?
name[0] in this case would be char name[1] = {'a'};
I've tried to read through C's documentation to get this but I simply don't get it! if sizeof('a') and sizeof(name[0]) were both 4 I would get it, if they were both 1 that would make sense... but I don't get the discrepancy!

In C, character literals such as 'a' have type int, and hence sizeof('a') is equal to sizeof(int).
In C++, character literals have type char, and thus sizeof('a') is equal to sizeof(char).
References:
C99 Standard: 6.4.4.4 Character constants
Para 2:
An integer character constant is a sequence of one or more multibyte characters enclosed
in single-quotes, as in ’x’ or ’ab’.
C++03 Standard: 2.13.2 Character literals
Para 1:
A character literal is one or more characters enclosed in single quotes, as in ’x’, optionally preceded by the letterL, as in L’x’. A character literal that does not begin with L is an ordinary character literal, also referred to as a narrow-character literal. An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set.

In c, sizeof operator consider 'a' as integer so you are getting 4 as a size

Related

How are u'123' or U'\n\t' or L'abc' integer constants in c? [duplicate]

This question already has answers here:
Multicharacter literal in C and C++
(6 answers)
Closed 5 years ago.
While making a compiler for C language, I was looking for its grammar. I stumbled upon ANSI C Grammar (Lex). There I found the following regular expression (RE):
{CP}?"'"([^'\\\n]|{ES})+"'" { return I_CONSTANT; }
where CP and ES are as follows
CP (u|U|L)
ES (\\(['"\?\\abfnrtv]|[0-7]{1,3}|x[a-fA-F0-9]+))
I understand that ES is RE for escape sequence.
If I understand the regular expression correctly then, u'123' or U'\n\t' or L'abc' are valid I_CONSTANTs.
I wrote the following small program to see what constant values they represent.
#include <stdio.h>
int main(void) {
printf("%d %d %d\n", u'123', U'\n\t', L'abc');
return 0;
}
This gave the following output.
51 9 99
I deciphered that they represent the ASCII value of the right-most character inside single quotes.
However, what I fail to understand is the use and importance of this kind of integer constant.
These are multicharacter literals, and their value is implementation-defined.
From C11 6.4.4.4 p10-11:
The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.
...
The value of a wide character constant containing more than one multibyte character or a single multibyte character that maps to multiple members of the extended execution character set, or containing a multibyte character or escape sequence not represented in the extended execution character set, is implementation-defined.
From your testing, it looks like GCC chooses to ignore all but the rightmost character on wide multicharacter literals. However, if you don't specify L, u, or U, GCC will combine the characters given as different bytes of the resulting integer, in an order depending on endianess. This behavior should not be relied on in portable code.

Using int to print character constants [duplicate]

This question already has answers here:
Multi-character constant warnings
(6 answers)
Print decimal value of a char
(5 answers)
Closed 5 years ago.
I wrote the following program,
#include<stdio.h>
int main(void)
{
int i='A';
printf("i=%c",i);
return 0;
}
and I got the result as,
i=A
So I tried another program,
#include<stdio.h>
int main(void)
{
int i='ABC';
printf("i=%c",i);
return 0;
}
According to me, since 32 bits are used to store an int value and each of 'A', 'B' and 'C' have 8 bit ASCII codes which totals to 24 bits therefore 24 bits were stored in a 32 bit unit. So I expected the output to be,
i=ABC
but the output instead was
i=C
and I can't understand why?
'ABC' in this case is a integer character constant as per section 6.4.4.4.10 of the standard.
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer. The
value of an integer character constant containing more than one
character (e.g.,'ab'), or containing a character or escape sequence
that does not map to a single-byteexecution character, is
implementation-defined. If an integer character constant contains a
single character or escape sequence, its value is the one that results
when an object with type char whose value is that of the single
character or escape sequence is converted to type int.
In this case, 'A'==0x41, 'B'==0x42, 'C'==0x43, and your compiler then interprets i to be 0x414243. As said in the other answer, this value is implementation dependent.
When you try to access it using '%c', the overflown part will be cut and you are only left with 0x43, which is 'C'.
To get more insight to it, read the answers to this question as well.
The conversion specifier c used in this call
printf("i=%c",i);
in fact extracts one character from the integer argument. So using this specifier you in any case can not get three characters as the output.
From the C Standard (7.21.6.1 The fprintf function)
c If no l length modifier is present, the int argument is converted to
an unsigned char, and the resulting character is written
Take into account that the internal representation of a multi-byte character constant is implementation defined. From the C Standard (6.4.4.4 Character constants)
...The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape
sequence that does not map to a single-byte execution character, is
implementation-defined.
'ABC' is an integer character constant. Depending on code set (overwhelming it is ASCII), endian, int width (apparently 32 bits in OP's case), it may have the same value like below. It is implementation defined behavior.
'ABC'
0x41424300
0x434241
or others.
The "%c" directs printf() to take the int value, cast it to unsigned char and print the associated character. This is the main reason for apparent loss of information.
In OP's case, it appears that i took on the value of 0x434241.
int i='A';
printf("i=%c",i); --> 'A'
// same as
printf("i=%c",0x434241); --> 'A'
if you want i to contain 3 characters you need to init a array that contains 3 characters
char i[3];
i[0]= 'A';
i[1]= 'B';
i[2]='C';
the ' ' can contain only one char your code converts the integer i into a character or better you store in your 32 bit intiger a converted 8 bit character. But i think You want to seperate the 32 bits into 8 bit containers make a char array like char i[3]. and then you will see that
int j=i;
this will result in an error because you are unable to convert a char array into a integer.
In C, 'A' is an int constant that's guaranteed to fit into a char.
'ABC' is a multicharacter constant. It has an int type, but an implementation defined value. The behaviour on using %c to print that in printf is possibly undefined if the value cannot fit into a char.

This source code is switching on a string in C. How does it do that?

I'm reading through some emulator code and I've countered something truly odd:
switch (reg){
case 'eax':
/* and so on*/
}
How is this possible? I thought you could only switch on integral types. Is there some macro trickery going on?
(Only you can answer the "macro trickery" part - unless you paste up more code. But there's not much here for macros to work on - formally you are not allowed to redefine keywords; the behaviour on doing that is undefined.)
In order to achieve program readability, the witty developer is exploiting implementation defined behaviour. 'eax' is not a string, but a multi-character constant. Note very carefully the single quotation characters around eax. Most likely it is giving you an int in your case that's unique to that combination of characters. (Quite often each character occupies 8 bits in a 32 bit int). And everyone knows you can switch on an int!
Finally, a standard reference:
The C99 standard says:
6.4.4.4p10: "The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or
escape sequence that does not map to a single-byte execution
character, is implementation-defined."
According to the C Standard (6.8.4.2 The switch statement)
3 The expression of each case label shall be an integer constant
expression...
and (6.6 Constant expressions)
6 An integer constant expression shall have integer type and shall
only have operands that are integer constants, enumeration constants,
character constants, sizeof expressions whose results are integer constants, and floating constants that are the immediate operands of
casts. Cast operators in an integer constant expression shall only
convert arithmetic types to integer types, except as part of an
operand to the sizeof operator.
Now what is 'eax'?
The C Standard (6.4.4.4 Character constants)
2 An integer character constant is a sequence of one or more
multibyte characters enclosed in single-quotes, as in 'x'...
So 'eax' is an integer character constant according to the paragraph 10 of the same section
...The value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape
sequence that does not map to a single-byte execution character, is
implementation-defined.
So according to the first mentioned quote it can be an operand of an integer constant expression that may be used as a case label.
Pay attention to that a character constant (enclosed in single quotes) has type int and is not the same as a string literal (a sequence of characters enclosed in double quotes) that has a type of a character array.
As other have said, this is an int constant and its actual value is implementation-defined.
I assume the rest of the code looks something like
if (SOMETHING)
reg='eax';
...
switch (reg){
case 'eax':
/* and so on*/
}
You can be sure that 'eax' in the first part has the same value as 'eax' in the second part, so it all works out, right? ... wrong.
In a comment #Davislor lists some possible values for 'eax':
... 0x65, 0x656178, 0x65617800, 0x786165, 0x6165, or something else
Notice the first potential value? That is just 'e', ignoring the other two characters. The problem is the program probably uses 'eax', 'ebx',
and so on. If all these constants have the same value as 'e' you end up with
switch (reg){
case 'e':
...
case 'e':
...
...
}
This doesn't look too good, does it?
The good part about "implementation-defined" is that the programmer can check the documentation of their compiler and see if it does something sensible with these constants. If it does, home free.
The bad part is that some other poor fellow can take the code and try to compile it using some other compiler. Instant compile error. The program is not portable.
As #zwol pointed out in the comments, the situation is not quite as bad as I thought, in the bad case the code doesn't compile. This will at least give you an exact file name and line number for the problem. Still, you will not have a working program.
The code fragment uses an historical oddity called multi-character character constant, also referred to as multi-chars.
'eax' is an integer constant whose value is implementation defined.
Here is an interesting page on multi-chars and how they can be used but should not:
http://www.zipcon.net/~swhite/docs/computers/languages/c_multi-char_const.html
Looking back further away into the rearview mirror, here is how the original C manual by Dennis Ritchie from the good old days ( https://www.bell-labs.com/usr/dmr/www/cman.pdf ) specified character constants.
2.3.2 Character constants
A character constant is 1 or 2 characters enclosed in single quotes ‘‘ ' ’’. Within a character constant a single quote must be preceded by a back-slash ‘‘\’’. Certain non-graphic characters, and ‘‘\’’ itself, may be escaped according to the following table:
BS \b
NL \n
CR \r
HT \t
ddd \ddd
\ \\
The escape ‘‘\ddd’’ consists of the backslash followed by 1, 2, or 3 octal digits which are taken to specify the value of the desired character. A special case of this construction is ‘‘\0’’ (not followed by a digit) which indicates a null character.
Character constants behave exactly like integers (not, in particular, like objects of character type). In conformity with the addressing structure of the PDP-11, a character constant of length 1 has the code for the given character in
the low-order byte and 0 in the high-order byte; a character constant of length 2 has the code for the first character in the low byte and that for the second character in the high-order byte. Character constants with more than one character are inherently machine-dependent and should be avoided.
The last phrase is all you need to remember about this curious construction: Character constants with more than one character are inherently machine-dependent and should be avoided.

Why is an element of an array bigger than the type?

I am fairly new to C and during one of my exercises I encountered something I couldn't wrap my head around.
When I check the size of an element of tabel (which here is 'b') than I get 4. However if I were to check 'char' than I get 1. How come?
# include <stdio.h>
int main(){
char tabel[10] = {'b','f','r','o','a','u','v','t','o'};
int size_tabel = (sizeof(tabel));
int size_char = (sizeof('b'));
/*edit the above line to sizeof(char) to get 1 instead of 4*/
int length_tabel = size_tabel/size_char;
printf("size_tabel = %i, size_char = %i, lengte_tabel= %i",size_tabel,
size_char,length_tabel);
}
'b' is not of type char. 'b' is a literal and it's type is int.
From C11 Standard Draft (ISO/IEC 9899:201x): 6.4.4.4 Character constants: Description
An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'.
The 'b' literal is an int. And on your current platform, an int is 4 bytes.
An integer character constant, e.g., 'b' has type int in C.
From the C Standard:
(c11, 6.4.4.4p10) "An integer character constant has type int. [...] If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int."
This is different than C++ where an integer character constant has type char:
(c++11, 2.14.3) "An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set."
sizeof(tabel)
This will return the size of table which is sizeof(char) * 10
sizeof('b')
This will return the sizeof(char) which is one.

why '\97' ascii value equals 55

Just like C code:
#include<stdio.h>
int main(void) {
char c = '\97';
printf("%d",c);
return 0;
}
the result is 55,but I can't understand how to calculate it.
I know the Octal number or hex number follow the '\', does the 97 is hex number?
\ is a octal escape sequence but 9 is not a valid octal digit so instead of interpreting it as octal it is being interpreted as a multi-character constant a \9 and a 1 whose value is implementation defined. Without any warning flags gcc provides the following warnings by default:
warning: unknown escape sequence: '\9' [enabled by default]
warning: multi-character character constant [-Wmultichar]
warning: overflow in implicit constant conversion [-Woverflow]
The C99 draft standard in section 6.4.4.4 Character constants paragraph 10 says (emphasis mine):
An integer character constant has type int. The value of an integer character constant
containing a single character that maps to a single-byte execution character is the
numerical value of the representation of the mapped character interpreted as an integer.
The value of an integer character constant containing more than one character (e.g.,
'ab'), or containing a character or escape sequence that does not map to a single-byte
execution character, is implementation-defined.
For example gcc implementation is documented here and is as follows:
The compiler evaluates a multi-character character constant a character at a time, shifting the previous value left by the number of bits per target character, and then or-ing in the bit-pattern of the new character truncated to the width of a target character. The final bit-pattern is given type int, and is therefore signed, regardless of whether single characters are signed or not (a slight change from versions 3.1 and earlier of GCC). If there are more characters in the constant than would fit in the target int the compiler issues a warning, and the excess leading characters are ignored.
For example, 'ab' for a target with an 8-bit char would be interpreted as ‘(int) ((unsigned char) 'a' * 256 + (unsigned char) 'b')’, and '\234a' as ‘(int) ((unsigned char) '\234' * 256 + (unsigned char) 'a')’.
As far as I can tell this is being interpreted as:
char c = ((unsigned char)'\71')*256 + '7' ;
which results in 55, which is consistent with the multi-character constant implementation above although the translation of \9 to \71 is not obvious.
Edit
I realized later on what is really happening is the \ is being dropped and so \9 -> 9, so what we really have is:
c = ((unsigned char)'9')*256 + '7' ;
which seems more reasonable but still arbitrary and not clear to me why this is not a straight out error.
Update
From reading The Annotated C++ Reference Manual we find out that in Classic C and older versions of C++ when backslash followed character was not defined as an scape sequence it was equal to the numeric value of the character. ARM section 2.5.2:
This differs from the interpretation by Classic C and early versions of C++, where the value of a sequence of a blackslash followed by a character in the source character set, if not defined as an escape sequence, was equal to the numeric value of the character. For example '\q' would be equal to 'q'.
\9 is not a valid escape, so the compiler ignores it and ascii '7' is 55.
I would not depend on this behavior, it's probably undefined. But that's where the 55 came from.
edit: Shafik points out it's not undefined, it's implementation defined. See his answer for the references.
First of all, I'm going to assume your code should read this, because it matches your title.
#include<stdio.h>
int main(void) {
char c = '\97';
printf("%d",c);
return 0;
}
\9 isn't valid, thus let's just assume the character is actually 7. 7 is ascii 55, which is the answer that was printed out.
I'm not sure what you wanted, but \97 isn't it...
\9 isn't a valid escape sequence, so it's likely falling back to a plain 9 character.
This means that it's the same thing as '97', which is undefined implementation defined (see Shafik Yaghmour's answer) behavior (2 characters can't fit into 1 character...).
To avoid things like this in the future, consider cranking up the warnings on your compiler. For example, a minimum for gcc should be -Wall -Wextra -pedantic.

Resources