How do table mappings work in C? - c

I hope this question makes sense! I'm currently learning C (go easy!) and I'm interested in how table mappings work.
I'm using the extended ASCII table as an experiment. (http://www.ascii-code.com)
For example I can create a char and set its value to a tilde like so:
char charSymbol = '~';
And I can also specify the exact same value like so:
char charDec = 126;
char charHex = 0x7E;
char charOct = 0176;
char charBin = 0b01111110;
Regardless of which of the above declarations I choose (if I'm understanding things correctly) the value that's held in memory for each of these variables is always exactly the same. That is, the binary representation (01111110)
My question is; does the compiler hold the extended ASCII table and perform the binary value lookup during compilation? And if that's the case, does the machine the program is running on also hold the extended ASCII table to know that when the program is asked to print 01111110 to screen that it's to print a "~" ?

For most of the code in your question, no ASCII lookup table is needed.
Note that in C, char is an integer type, just like int, but narrower. A character constant like 'x' (for historical reasons) has type int, and on an ASCII-based system x is pretty much identical to 120.
char charDec = 126;
char charHex = 0x7E;
char charOct = 0176;
char charBin = 0b01111110;
(Standard C does not support binary constants like 0b01111110; that's a gcc extension.)
When the compiler sees an integer constant like 126 it computes an integer value from it. For this, it needs to know that 1, 2, and 6 are decimal digits, and what their values are.
char charSymbol = '~';
For this, the compiler just needs to recognize that ~ is a valid character.
The compiler reads all these characters from a text file, your C source. Each character in that file is stored as a sequence of 8 bits, which represent a number from 0 to 255.
So if your C source code contains:
putchar('~');
(and ~ happens to have the value 126), then all the compiler needs to know is that 126 is a valid character value. It generates code that sends the value 126 to the putchar() function. At run time, putchar sends that value to the standard output stream. If standard output is going to a file, the value 126 is stored in that file. If it's going to a terminal, the terminal software will do some kind of lookup to map the number 126 to the glyph that displays as the tilde character.
Compilers have to recognize specific character values. They have to recognize that + is the plus character, which is used to represent the addition operator. But for input and output, no ASCII mapping is needed, because each ASCII character is represented as a number at all stages of processing, from compilation to execution.
So how does a compiler recognize the '+' character? C compilers are typically written in C. Somewhere in the compiler's own sources, there's probably something like:
switch (c) {
...
case '+':
/* code to handle + character */
...
}
So the compiler recognizes + in its input because there's a + in its own source code -- and that + (stored in the compiler source code as the 8-bit number 43) resulted in the number 43 being stored in the compiler's own executable machine code.
Obviously the first C compiler wasn't written in C, because there was nothing to compile it. Early C compilers may have been written in B, or in BCPL, or in assembly language -- each of which is processed by a compiler or assembler that probably recognizes + because there's a + in its own source code. Each generation of C compiler passes on the "knowledge" that + to the next C compiler that it compiles. The "knowledge" that + is 43 is not necessarily written in the source code; it's propagated each time a new compiler is compiled using an old one.
For a discussion of this, see Ken Thompson's article "Reflections on Trusting Trust".
On the other hand, you can also have, for example, a compiler running on an ASCII-based system that generates code for an EBCDIC-based system, or vice versa. Such a compiler would have to have a lookup table mapping from one character set to the other.

Actually, technically speaking your text editor is the one with the ASCII (or Unicode) table. The file is saved simply as a sequence of bytes; a compiler doesn't actually need to have an ASCII table, it just needs to know which bytes do what. (Yes, the compiler logically interprets the bytes as ASCII, but if you looked at the compiler's machine code all you'd see is a bunch of comparisons of the bytes against fixed byte values).
On the flip side, the executing computer has an ASCII table somewhere to map the bytes output by the program into readable characters. This table is probably in your terminal emulator.

C language has pretty weak type-safety and that is why you could always assign an integer to a character variable.
You used different representations of an integer to assign to the character variable - and that is supported in C programming language.
When you typed a "~" in a text file in your C program, your text editor actually converted the key-strokes and stored its ASCII equivalent. Therefore when the compiler parsed the C- code, it did not sense that what is written is a ~ (tilde). While parsing, when compiler encountered ASCII equivalent of ' (i.e single quotes) it went into a mode to read next byte as something that fits in a a char variable followed by another ' (single quote) . Since a char variable can have 0-255 different values it covers whole ASCII set, with extended char set included.
This is same when you use an assembler.
Printing on to screen in entirely different game - That is part of I/O system.
When you key-in a specific character on keyboard, a pulse of a mapped integer goes in and settles in memory of the reading program. Similarly, when you print a specific integer on a printer or screen, that integer takes the shape of corresponding character.
Therefore if you want to print an integer in an int variable there are routines that convert each of its digits and send the ASCII code for each of them, and I/O system converts them into characters.

All those values are exactly equal to each other - they're just different representations of the same value, so the compiler sees them all in exactly the same way after translation from your written text into the byte value.

Related

How does atof.c work? Subtracting an ASCII zero from an ASCII digit makes it an int? Am I missing something?

So as part of my C classes, for our first homework we are supposed to implement our own atof.c function, and then use it for some tasks. So, being the smart stay-at-home student I am I decided to look at the atof.c source code and adapt it to meet my needs. I think i'm on board with most of the operations that this function does, like counting the digits before and after the decimal point, however there is one line of code that I do not understand. I'm assuming this is the line that actually converts the ASCII digit into a digit of type int. Posting it here:
frac1 = 10*frac1 + (c - '0');
in the source code, c is the digit that they are processing, and frac1 is an int that stores some of the digits from the incoming ASCII string. but why does c- '0' work?? And as a followup, is there another way of achieving the same result?
There is no such thing as "text" in C. Just APIs that happen to treat integer values as text information. char is an integer type, and you can do math with it. Character literals are actually ints in C (in C++ they're char, but they're still usable as numeric values even there).
'0' is a nice way for humans to write "the ordinal value of the character for zero"; in ASCII, that's the number 48. Since the digits appear in order from 0 to 9 in all encodings I'm aware of, you can convert from the ordinal value in the encoding (e.g. ASCII) to actual numeric values by subtracting away '0' to get actual int values from 0 to 9.
You could just as easily subtract 48 directly (when compiled, it would be impossible to tell which option you used; 48 and ASCII '0' are indistinguishable), it would just be less obvious what you were doing to other people reading your source code.
The ASCII value of '0' is the 48'th character in code page 437 (IBM default character set). Similarly, '1' is the 49'th etc. Subtracting '0' instead of a magic number such as 48 is much clearer as far as self-documentation goes.

Why is the 'sizeof' operator returning a value of 4 for a character? [duplicate]

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.

A char = 1 byte, but why it is stored with 4 bytes? [duplicate]

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.

sizeof operator different behaviour? [duplicate]

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.

what is the value of sizeof 'A'? [duplicate]

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.

Resources