Question regarding C argument promotions [closed] - c

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
Alright actually I've study about how to use looping to make my code more efficient so that I could use a particular block of code that should be repeated without typing it over and over again, and after attempted to use what I've learn so far to program something, I feel it's time for me to proceed to the next chapter to learn on how to use control statement to learn how to instructs the program to make decision.
But the thing is that, before I advance myself to it, I still have a few question that need any expert's help on previous stuff. Actually it's about datatype.
A. Character Type
I extract the following from the book C primer Plus 5th ed:
Somewhat oddly , C treats character
constans as type int rather than
char. For example, on an ASCII system
with a 32-bit int and an 8-bit char
, the code:
char grade = 'B';
represents 'B' as the numerical value
66 stored in a 32-bit unit, grade
winds up with 66 stored ub ab 8-bit
unit. This characteristic of character
constants makes it possible to define
a character constant such as 'FATE',
with four separate 8-bit ASCII codes
stored in a 32-bit unit. However ,
attempting to assign such a character
constant to a char variable results
in only the last 8 bits being used,
so the variable gets the value 'E'.
So the next thing I did after reading this was of course, follow what it mentions, that is I try store the word FATE on a variable with char grade and try to compile and see what it'll be stored using printf(), but instead of getting the character 'E' printed out, what I get is 'F'.
Does this mean there's some mistake in the book? OR is there something I misunderstood?
From the above sentences, there's a line says C treats character constants as type int. So to try it out, I assign a number bigger than 255, (e.x. 356) to the char type.
Since 356 is within the range of 32-bit int (I'm running Windows 7), therefore I expect it would print out 356 when I use the %d specifier.
But instead of printing 356, it gives me 100, which is the last 8-bits value.
Why does this happen? I thought char == int == 32-bits? (Although it does mention before char is only a byte).
B. Int and Floating Type
I understand when a number stores in variable in short type is pass to variadic function or any implicit prototype function, it'll be automatically promoted to int type.
This also happen to floating point type, when a floating-point number with float type is passed, it'll be converted to double type, that is why there's no specifier for the float type but instead there's only %f for double and %Lf for long double.
But why there's a specifier for short type although it is also promoted but not float type? Why don't they just give a specifier for float type with a modifier like %hf or something? Is there anything logical or technical behind this?

A lot of questions in one question... Here are answers to a couple:
This characteristic of character constants makes it possible to define a character constant such as 'FATE' , with four separate 8-bit ASCII codes stored in a 32-bit unit.However , attempting to assign such a character constant to a char variable results in only the last 8 bits being used , so the variable gets the value 'E'.
This is actually implementation defined behavior. So yes, there's a mistake in the book. Many books on C are written with the assumption that the only C compiler in the world is the one the author used when testing the examples.
The compiler the author use treated the characters in 'FATE' as the bytes of an integer with the 'F' being the most significant byte and 'E' being the least significant. Your compiler treats the characters in the literal as bytes of an inteder with 'F' being the least significant byte and 'E' the most significant. For example, The first method is how MSVC treats the value, while MinGW (a GCC compiler targeting Windows) treats the literal in the second way.
As far as there being no format specifier to printf() that expects float, on specifiers that expect double - this is because the values passed to printf() for formatting are part of the variable argument list (the ... in printf()'s prototype). There is not type information about these arguments, so as you mentioned, the compiler must always promote them (from C99 6.5.2.2/6 "Function calls"):
If the expression that denotes the called function has a type that does not include a prototype, the integer promotions are performed on each argument, and arguments that have type float are promoted to double. These are called the default argument promotions.
And C99 6.5.2.2/7 "Function calls"
The ellipsis notation in a function prototype declarator causes argument type conversion to stop after the last declared parameter. The default argument promotions are performed on trailing arguments.
So in effect, it's impossible to pass a float to printf() - it will always be promoted to a double. That's why the format specifiers for floating point values expect a double.
Also due to the automatic promotion that would be applied to short, I'm honestly not sure if the h specifier for formatting a short is strictly necessary (though it is necessary for use with the n specifier if you want to get the count of characters written to the stream placed in a short). It might be in C because it needs to be there to support the n specifier, historical reasons, or something that I'm just not thinking of.

First, a char is by definition exactly 1 byte wide. Then the standard more or less says that the sizes should be:
sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long)
The exact sizes vary (except for char) by system and compiler but on a 32 bit Windows the sizes with GCC and VC are (AFAIK):
sizeof(short) == 2 (byte)
sizeof(int) == sizeof(long) == 4 (byte)
Your observation of 'F' versus 'E' in this case is a typical endianness issue (little vs. big endian, how a "word" is stored in memory).
Now what happens to your value ? You have a variable that is 8 bit wide. You assign a bigger value ('FATE' or 356), but the compiler knows it only is allowed to store 8 bits so it cuts off all the other bits.

To A:
3.) This is due to the different byte orderings of big and little endian CPU achitectures. You get the first byte on a little endian (i.e. x86) and the last byte on a big endian CPU (i.e. PPC). Actually you get always the lowest 8 bit when the conversion fom int to char is done but the characters in the int are stored in reversed order.
7.) a char can only hold 8 bits, so everything else gets truncated in the moment you assign the int to a char variable and can never be restored from the char variable later.
To B:
3.) You might sometimes want to print only the lowest 16 bits of a int variable regardless of what is in the higher half. It is not uncommon to pack multiple integer values in a single variable for certain optimizations. This works well for integer types but makes not much sense for floating point types which don't support bitwise operations directly, which might be the reason why there is no separate type specifier for float in printf.

char is 1 byte long. The bit length of a byte can be 8, 16, 32 bits long. In general purpose computers generally the bitlength of character is 8 bits long. So the maximum number which the character can represent depends on the bitlength of the character. To check the bitlength of character check limits.h header file it is defined as CHAR_BIT in this file.
char x = 'FATE' will depend probably on the byte ordering which the machine/compiler will interpret the 'FATE' . So this depends on the system/compiler. Someone please confirm/correct this.
If your system has 8 bits byte, then, when you do c = 360 only the lower 8 bits of the binary representation of 360 will be stored in the variable, because char data is always allocated 1 byte of storage. So %d will print 100 because the upper bits were lost when you assigned the value in the variable, and what is left is only the lower 8 bits.

Related

Use of format specifiers for conversions

I am unable to deduce the internal happenings inside the machine when we print data using format specifiers.
I was trying to understand the concept of signed and unsigned integers and the found the following:
unsigned int b=-12;
printf("%d\n",b); //prints -12
printf("%u\n\n",b); //prints 4294967284
I am guessing that b actually stores the binary version of -12 as 11111111111111111111111111110100.
So, since b is unsigned , b technically stores 4294967284.
But still the format specifier %d causes the binary value of b to be printed as its signed version i,e, -12.
However,
printf("%f\n",2); //prints 0.000000
printf("%f\n",100); //prints 0.000000
printf("%d\n",3.2); //prints 2147483639
printf("%d\n",3.1); //prints 2147483637
I kind of expected the 2 to be printed as 2.00000 and 3.2 to be printed as 3 as per type conversion norms.
Why does this not happen and what exactly takes place at machine level ?
Mismatching format specifier and argument type (like using the floating point specifier "%f" to print an int value) leads to undefined behavior.
Remember that 2 is an integer value, and vararg functions (like printf) doesn't really know the types of the arguments. The printf function have to rely on the format specifier to assume the argument is of the specified type.
To better understand how you get the results you get, to understand "the internal happenings", we first must make two assumptions:
The system uses 32 bits for the int type
The system uses 64 bits for the double type
Now what happens with
printf("%f\n",2); //prints 0.000000
is that the printf function sees the "%f" specifier, and fetch the next argument as a 64-bit double value. Since the int value you provided in the argument list is only 32 bits, half of the bits in the double value will be unknown. The printf function will then print the (invalid) double value. If you're unlucky some of the unknown bits might lead the value to be a trap value which can cause a crash.
Similarly with
printf("%d\n",3.2); //prints 2147483639
the printf function fetches the next argument as a 32-bit int value, losing half of the bits in the 64-bit double value provided as the actual argument. Exactly which 32 bits are copied into the internal int value depends on endianness. Integers don't have trap values so no crashes happens, just an unexpected value will be printed.
what exactly takes place at machine level ?
The stdio.h functions are quite far from the machine level. They provide a standardized abstraction layer on top of various OS API. Whereas "machine level" would refer to the generated assembler. The behavior you experience is mostly related to details of the C language rather than the machine.
On the machine level, there exists no signed numbers, but everything is treated as raw binary data. The compiler can turn raw binary data into a signed number by using an instruction that tells the CPU: "use what's stored at this location and treat it as a signed number". Specifically, as a two's complement signed number on all common computers. But this is irrelevant when explaining why your code misbehaves.
The integer constant 12 is of type int. When we write -12 we apply the unary - operator on that. The result is still of type int but now of value -12.
Then you attempt to store this negative number in an unsigned int. This triggers an implicit conversion to unsigned int, which should be carried out according to the C standard:
Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type
The maximum value of a 32 bit unsigned int is 2^32 - 1, which equals 4.29*10^9 - 1. "One more than the maximum" gives 4.29*10^9. If we calculate-12 + 4.29*10^9 we get 4294967284. This is in range of an unsigned int and is the result you see later.
Now as it happens, the printf family of functions is very unsafe. If you provide a wrong format specifier which doesn't matches the type, they might crash or display the wrong result etc - the program invokes undefined behavior.
So when you use %d or %i reserved for signed int, but pass an unsigned int, anything can happen. "Anything" includes the compiler trying to convert the passed type to match the passed format specifier. That's what happened when you used %d.
When you pass values of types completely mismatching the format specifier, the program just prints gibberish though. Because you are still invoking undefined behavior.
I kind of expected the 2 to be printed as 2.00000 and 3.2 to be printed as 3 as per type conversion norms.
The reason why the printf family can't do anything intelligent like assuming that 2 should be converted to 2.0, is because they are variadic (variable argument) functions. Meaning they can take any number of arguments. In order to make that possible, the parameters are essentially passed as raw binary through something called va_list, and all type information is lost. The printf implementation is therefore left with no type information but the format string you gave it. This is why variadic functions are so unsafe to use.
Unlike a regular function which has more type safety - if you declare void foo (float f) and pass the integer constant 2 (type int), it will attempt to implicitly convert from integer to float, and perhaps also give a conversion warning.
The behaviors you observe are the result of printf interpreting the bits given to it as the type specified by the format specifier. In particular, at least for your system:
The bits for an int argument and an unsigned argument in the same position within the argument list would be passed in the same place, so when you give printf one and tell it to format the other, it uses the bits you give it as if they were the bits of the other.
The bits for an int argument and a double argument would be passed in different places—possibly a general register for the int argument and a special floating-point register for the double argument, so when you give printf one and tell it to format the other, it does not get the bits for the double to use for the int; it gets completely unrelated bits that were left lying around by previous operations.
Whenever a function is called, values for its arguments must be placed in certain places. These places vary according to the software and hardware used, and they vary by the type and number of arguments. However, for any particular argument type, argument position, and specific software and hardware used, there is a specific place (or combination of places) where the bits of that argument should be stored to be passed to the function. The rules for this are part of the Application Binary Interface (ABI) for the software and hardware being used.
First, let us neglect any compiler optimization or transformation and examine what happens when the compiler implements a function call in source code directly as a function call in assembly language. The compiler will take the arguments you provide for printf and write them to the places designated for those types of arguments. When printf executes, it examines the format string. When it sees a format specifier, it figures out what type of argument it should have, and it looks for the value of that argument in the place for that type of argument.
Now, there are two things that can happen. Say you passed an unsigned but used a format specifier for int, like %d. In every ABI I have seen, an unsigned and an int argument (in the same position within the list of arguments) are passed in the same place. So, when printf looks for the bits for the int it is expected, it will get the bits for the unsigned you passed.
Then printf will interpret those bits as if they encoded the value for an int, and it will print the results. In other words, the bits of your unsigned value are reinterpreted as the bits of an int.1
This explains why you see “-12” when you pass the unsigned value 4,294,967,284 to printf to be formatted with %d. When the bits 11111111111111111111111111110100 are interpreted as an unsigned, they represent the value 4,294,967,284. When they are interpreted as an int, they represent the value −12 on your system. (This encoding system is called two’s complement. Other encoding systems include one’s complement and sign-and-magnitude, in which these bits would represent −1 and −2,147,483,636, respectively. Those systems are rare for plain integer types these days.)
That is the first of two things that can happen, and it is common when you pass the wrong type but it is similar to the correct type in size and nature—it is passed in the same place as the wrong type. The second thing that can happen is that the argument you pass is passed in a different place than the argument that is expected. For example, if you pass a double as an argument, it is, in many systems, placed in separate set of registers for floating-point values. When printf goes looking for an int argument for %d, it will not find the bits of your double at all. Instead, what it finds in the place where it looks for an int argument might be whatever bits happened to be left in a register or memory location from previous operations, or it might be the bits of the next argument in the list of arguments. In any case, this means that the value printf prints for the %d will have nothing to do with the double value you passed, because the bits of the double are not involved in any way—a complete different set of bits is used.
This is also part of the reason the C standard says it does not define the behavior when the wrong argument type is passed for a printf conversion. Once you have messed up the argument list by passing double where an int should have been, all the following arguments may be in the wrong places too. They might be in different registers from where they are expected, or they might be in different stack locations from where they are expected. printf has no way to recover from this mistake.
As stated, all of the above neglects compiler optimization. The rules of C arose out of various needs, such as accommodating the problems above and making C portable to a variety of systems. However, once those rules are written, compilers can take advantage of them to allow optimization. The C standard permits a compiler to make any transformation of a program as long as the changed program has the same behavior as the original program under the rules of the C standard. This permission allows compilers to speed up programs tremendously in some circumstances. But a consequence is that, if your program has behavior not defined by the C standard (and not defined by any other rules the compiler follows), it is allowed to transform your program into anything. Over the years, compilers have grown increasingly aggressive about their optimizations, and they continue to grow. This means, aside from the simple behaviors described above, when you pass incorrect arguments to printf, the compiler is allowed to produce completely different results. Therefore, although you may commonly see the behaviors I describe above, you may not rely on them.
Footnote
1 Note that this is not a conversion. A conversion is an operation whose input is one type and whose output is another type but has the same value (or as nearly the same as is possible, in some sense, as when we convert a double 3.5 to an int 3). In some cases, a conversion does not require any change to the bits—an unsigned 3 and an int 3 use the same bits to represent 3, so the conversion does not change the bits, and the result is the same as a reinterpretation. But they are conceptually different.

Why is the 'sizeof' operator returning a value of 4 for a character? [duplicate]

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.

A char = 1 byte, but why it is stored with 4 bytes? [duplicate]

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.

sizeof operator different behaviour? [duplicate]

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.

what is the value of sizeof 'A'? [duplicate]

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.

Resources