In C, there appear to be differences between various values of zero -- NULL, NUL and 0.
I know that the ASCII character '0' evaluates to 48 or 0x30.
The NULL pointer is usually defined as:
#define NULL 0
Or
#define NULL (void *)0
In addition, there is the NUL character '\0' which seems to evaluate to 0 as well.
Are there times when these three values can not be equal?
Is this also true on 64 bit systems?
Note: This answer applies to the C language, not C++.
Null Pointers
The integer constant literal 0 has different meanings depending upon the context in which it's used. In all cases, it is still an integer constant with the value 0, it is just described in different ways.
If a pointer is being compared to the constant literal 0, then this is a check to see if the pointer is a null pointer. This 0 is then referred to as a null pointer constant. The C standard defines that 0 cast to the type void * is both a null pointer and a null pointer constant.
Additionally, to help readability, the macro NULL is provided in the header file stddef.h. Depending upon your compiler it might be possible to #undef NULL and redefine it to something wacky.
Therefore, here are some valid ways to check for a null pointer:
if (pointer == NULL)
NULL is defined to compare equal to a null pointer. It is implementation defined what the actual definition of NULL is, as long as it is a valid null pointer constant.
if (pointer == 0)
0 is another representation of the null pointer constant.
if (!pointer)
This if statement implicitly checks "is not 0", so we reverse that to mean "is 0".
The following are INVALID ways to check for a null pointer:
int mynull = 0;
<some code>
if (pointer == mynull)
To the compiler this is not a check for a null pointer, but an equality check on two variables. This might work if mynull never changes in the code and the compiler optimizations constant fold the 0 into the if statement, but this is not guaranteed and the compiler has to produce at least one diagnostic message (warning or error) according to the C Standard.
Note that the value of a null pointer in the C language does not matter on the underlying architecture. If the underlying architecture has a null pointer value defined as address 0xDEADBEEF, then it is up to the compiler to sort this mess out.
As such, even on this funny architecture, the following ways are still valid ways to check for a null pointer:
if (!pointer)
if (pointer == NULL)
if (pointer == 0)
The following are INVALID ways to check for a null pointer:
#define MYNULL (void *) 0xDEADBEEF
if (pointer == MYNULL)
if (pointer == 0xDEADBEEF)
as these are seen by a compiler as normal comparisons.
Null Characters
'\0' is defined to be a null character - that is a character with all bits set to zero. '\0' is (like all character literals) an integer constant, in this case with the value zero. So '\0' is completely equivalent to an unadorned 0 integer constant - the only difference is in the intent that it conveys to a human reader ("I'm using this as a null character.").
'\0' has nothing to do with pointers. However, you may see something similar to this code:
if (!*char_pointer)
checks if the char pointer is pointing at a null character.
if (*char_pointer)
checks if the char pointer is pointing at a non-null character.
Don't get these confused with null pointers. Just because the bit representation is the same, and this allows for some convenient cross over cases, they are not really the same thing.
References
See Question 5.3 of the comp.lang.c FAQ for more.
See this pdf for the C standard. Check out sections 6.3.2.3 Pointers, paragraph 3.
It appears that a number of people misunderstand what the differences between NULL, '\0' and 0 are. So, to explain, and in attempt to avoid repeating things said earlier:
A constant expression of type int with the value 0, or an expression of this type, cast to type void * is a null pointer constant, which if converted to a pointer becomes a null pointer. It is guaranteed by the standard to compare unequal to any pointer to any object or function.
NULL is a macro, defined in as a null pointer constant.
\0 is a construction used to represent the null character, used to terminate a string.
A null character is a byte which has all its bits set to 0.
All three define the meaning of zero in different context.
pointer context - NULL is used and means the value of the pointer is 0, independent of whether it is 32bit or 64bit (one case 4 bytes the other 8 bytes of zeroes).
string context - the character representing the digit zero has a hex value of 0x30, whereas the NUL character has hex value of 0x00 (used for terminating strings).
These three are always different when you look at the memory:
NULL - 0x00000000 or 0x00000000'00000000 (32 vs 64 bit)
NUL - 0x00 or 0x0000 (ascii vs 2byte unicode)
'0' - 0x20
I hope this clarifies it.
If NULL and 0 are equivalent as null pointer constants, which should I use? in the C FAQ list addresses this issue as well:
C programmers must understand that
NULL and 0 are interchangeable in
pointer contexts, and that an uncast 0
is perfectly acceptable. Any usage of
NULL (as opposed to 0) should be
considered a gentle reminder that a
pointer is involved; programmers
should not depend on it (either for
their own understanding or the
compiler's) for distinguishing pointer
0's from integer 0's.
It is only in pointer contexts that
NULL and 0 are equivalent. NULL should
not be used when another kind of 0 is
required, even though it might work,
because doing so sends the wrong
stylistic message. (Furthermore, ANSI
allows the definition of NULL to be
((void *)0), which will not work at
all in non-pointer contexts.) In
particular, do not use NULL when the
ASCII null character (NUL) is desired.
Provide your own definition
#define NUL '\0'
if you must.
What is the difference between NULL, ‘\0’ and 0
"null character (NUL)" is easiest to rule out. '\0' is a character literal.
In C, it is implemented as int, so, it's the same as 0, which is of INT_TYPE_SIZE. In C++, character literal is implemented as char, which is 1 byte. This is normally different from NULL or 0.
Next, NULL is a pointer value that specifies that a variable does not point to any address space. Set aside the fact that it is usually implemented as zeros, it must be able to express the full address space of the architecture. Thus, on a 32-bit architecture NULL (likely) is 4-byte and on 64-bit architecture 8-byte. This is up to the implementation of C.
Finally, the literal 0 is of type int, which is of size INT_TYPE_SIZE. The default value of INT_TYPE_SIZE could be different depending on architecture.
Apple wrote:
The 64-bit data model used by Mac OS X is known as "LP64". This is the common data model used by other 64-bit UNIX systems from Sun and SGI as well as 64-bit Linux. The LP64 data model defines the primitive types as follows:
ints are 32-bit
longs are 64-bit
long-longs are also 64-bit
pointers are 64-bit
Wikipedia 64-bit:
Microsoft's VC++ compiler uses the LLP64 model.
64-bit data models
Data model short int long long long pointers Sample operating systems
LLP64 16 32 32 64 64 Microsoft Win64 (X64/IA64)
LP64 16 32 64 64 64 Most Unix and Unix-like systems (Solaris, Linux, etc.)
ILP64 16 64 64 64 64 HAL
SILP64 64 64 64 64 64 ?
Edit:
Added more on the character literal.
#include <stdio.h>
int main(void) {
printf("%d", sizeof('\0'));
return 0;
}
The above code returns 4 on gcc and 1 on g++.
One good piece which helps me when starting with C (taken from the Expert C Programming by Linden)
The One 'l' nul and the Two 'l' null
Memorize this little rhyme to recall the correct terminology for pointers and ASCII zero:
The one "l" NUL ends an ASCII string,
The two "l" NULL points to no thing.
Apologies to Ogden Nash, but the three "l" nulll means check your spelling.
The ASCII character with the bit pattern of zero is termed a "NUL".
The special pointer value that means the pointer points nowhere is "NULL".
The two terms are not interchangeable in meaning.
A one-L NUL, it ends a string.
A two-L NULL points to no thing.
And I will bet a golden bull
That there is no three-L NULLL.
How do you deal with NUL?
"NUL" is not 0, but refers to the ASCII NUL character. At least, that's how I've seen it used. The null pointer is often defined as 0, but this depends on the environment you are running in, and the specification of whatever operating system or language you are using.
In ANSI C, the null pointer is specified as the integer value 0. So any world where that's not true is not ANSI C compliant.
A byte with a value of 0x00 is, on the ASCII table, the special character called NUL or NULL. In C, since you shouldn't embed control characters in your source code, this is represented in C strings with an escaped 0, i.e., \0.
But a true NULL is not a value. It is the absence of a value. For a pointer, it means the pointer has nothing to point to. In a database, it means there is no value in a field (which is not the same thing as saying the field is blank, 0, or filled with spaces).
The actual value a given system or database file format uses to represent a NULL isn't necessarily 0x00.
NULL is not guaranteed to be 0 -- its exact value is architecture-dependent. Most major architectures define it to (void*)0.
'\0' will always equal 0, because that is how byte 0 is encoded in a character literal.
I don't remember whether C compilers are required to use ASCII -- if not, '0' might not always equal 48. Regardless, it's unlikely you'll ever encounter a system which uses an alternative character set like EBCDIC unless you're working on very obscure systems.
The sizes of the various types will differ on 64-bit systems, but the integer values will be the same.
Some commenters have expressed doubt that NULL be equal to 0, but not be zero. Here is an example program, along with expected output on such a system:
#include <stdio.h>
int main () {
size_t ii;
int *ptr = NULL;
unsigned long *null_value = (unsigned long *)&ptr;
if (NULL == 0) {
printf ("NULL == 0\n"); }
printf ("NULL = 0x");
for (ii = 0; ii < sizeof (ptr); ii++) {
printf ("%02X", null_value[ii]); }
printf ("\n");
return 0;
}
That program could print:
NULL == 0
NULL = 0x00000001
(void*) 0 is NULL, and '\0' represents the end of a string.
Related
I want to understand the following code:
//...
#define _C 0x20
extern const char *_ctype_;
//...
__only_inline int iscntrl(int _c)
{
return (_c == -1 ? 0 : ((_ctype_ + 1)[(unsigned char)_c] & _C));
}
It originates from the file ctype.h from the obenbsd operating system source code. This function checks if a char is a control character or a printable letter inside the ascii range. This is my current chain of thought:
iscntrl('a') is called and 'a' is converted to it's integer value
first check if _c is -1 then return 0 else...
increment the adress the undefined pointer points to by 1
declare this adress as a pointer to an array of length (unsigned char)((int)'a')
apply the bitwise and operator to _C (0x20) and the array (???)
Somehow, strangely, it works and everytime when 0 is returned the given char _c is not a printable character. Otherwise when it's printable the function just returns an integer value that's not of any special interest. My problem of understanding is in step 3, 4 (a bit) and 5.
Thank you for any help.
_ctype_ appears to be a restricted internal version of the symbol table and I'm guessing the + 1 is that they didn't bother saving index 0 of it since that one isn't printable. Or possibly they are using a 1-indexed table instead of 0-indexed as is custom in C.
The C standard dictates this for all ctype.h functions:
In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF
Going through the code step by step:
int iscntrl(int _c) The int types are really characters, but all ctype.h functions are required to handle EOF, so they must be int.
The check against -1 is a check against EOF, since it has the value -1.
_ctype+1 is pointer arithmetic to get an address of an array item.
[(unsigned char)_c] is simply an array access of that array, where the cast is there to enforce the standard requirement of the parameter being representable as unsigned char. Note that char can actually hold a negative value, so this is defensive programming. The result of the [] array access is a single character from their internal symbol table.
The & masking is there to get a certain group of characters from the symbol table. Apparently all characters with bit 5 set (mask 0x20) are control characters. There's no making sense of this without viewing the table.
Anything with bit 5 set will return the value masked with 0x20, which is a non-zero value. This sates the requirement of the function returning non-zero in case of boolean true.
_ctype_ is a pointer to a global array of 257 bytes. I don't know what _ctype_[0] is used for. _ctype_[1] through _ctype_[256]_ represent the character categories of characters 0, …, 255 respectively: _ctype_[c + 1] represents the category of the character c. This is the same thing as saying that _ctype_ + 1 points to an array of 256 characters where (_ctype_ + 1)[c] represents the categorty of the character c.
(_ctype_ + 1)[(unsigned char)_c] is not a declaration. It's an expression using the array subscript operator. It's accessing position (unsigned char)_c of the array that starts at (_ctype_ + 1).
The code casts _c from int to unsigned char is not strictly necessary: ctype functions take char values cast to unsigned char (char is signed on OpenBSD): a correct call is char c; … iscntrl((unsigned char)c). They have the advantage of guaranteeing that there is no buffer overflow: if the application calls iscntrl with a value that is outside the range of unsigned char and isn't -1, this function returns a value which may not be meaningful but at least won't cause a crash or a leak of private data that happened to be at the address outside of the array bounds. The value is even correct if the function is called as char c; … iscntrl(c) as long as c isn't -1.
The reason for the special case with -1 is that it's EOF. Many standard C functions that operate on a char, for example getchar, represent the character as an int value which is the char value wrapped to a positive range, and use the special value EOF == -1 to indicate that no character could be read. For functions like getchar, EOF indicates the end of the file, hence the name end-of-file. Eric Postpischil suggests that the code was originally just return _ctype_[_c + 1], and that's probably right: _ctype_[0] would be the value for EOF. This simpler implementation yields to a buffer overflow if the function is misused, whereas the current implementation avoids this as discussed above.
If v is the value found in the array, v & _C tests if the bit at 0x20 is set in v. The values in the array are masks of the categories that the character is in: _C is set for control characters, _U is set for uppercase letters, etc.
I'll start with step 3:
increment the adress the undefined pointer points to by 1
The pointer is not undefined. It's just defined in some other compilation unit. That is what the extern part tells the compiler. So when all files are linked together, the linker will resolve the references to it.
So what does it point to?
It points to an array with information about each character. Each character has its own entry. An entry is a bitmap representation of characteristics for the character. For example: If bit 5 is set, it means that the character is a control character. Another example: If bit 0 is set, it means that the character is a upper character.
So something like (_ctype_ + 1)['x'] will get the characteristics that apply to 'x'. Then a bitwise and is performed to check if bit 5 is set, i.e. check whether it is a control character.
The reason for adding 1 is probably that the real index 0 is reserved for some special purpose.
All information here is based on analyzing the source code (and programming experience).
The declaration
extern const char *_ctype_;
tells the compiler that there is a pointer to const char somewhere named _ctype_.
(4) This pointer is accessed as an array.
(_ctype_ + 1)[(unsigned char)_c]
The cast (unsigned char)_c makes sure the index value is in the range of an unsigned char (0..255).
The pointer arithmetic _ctype_ + 1 effectively shifts the array position by 1 element. I don't know why they implemented the array this way. Using the range _ctype_[1].._ctype[256] for the character values 0..255 leaves the value _ctype_[0] unused for this function. (The offset of 1 could be implemented in several alternative ways.)
The array access retrieves a value (of type char, to save space) using the character value as array index.
(5) The bitwise AND operation extracts a single bit from the value.
Apparently the value from the array is used as a bit field where the bit 5 (counting from 0 starting at least significant bit, = 0x20) is a flag for "is a control character". So the array contains bit field values describing the properties of the characters.
The key here is to understand what the expression (_ctype_ + 1)[(unsigned char)_c] does (which is then fed to the bitwise and operation, & 0x20 to get the result!
Short answer: It returns element _c + 1 of the array pointed-to by _ctype_.
How?
First, although you seem to think _ctype_ is undefined it actually isn't! The header declares it as an external variable - but it is defined in (almost certainly) one of the run-time libraries that your program is linked with when you build it.
To illustrate how the syntax corresponds to array indexing, try working through (even compiling) the following short program:
#include <stdio.h>
int main() {
// Code like the following two lines will be defined somewhere in the run-time
// libraries with which your program is linked, only using _ctype_ in place of _qlist_ ...
const char list[] = "abcdefghijklmnopqrstuvwxyz";
const char* _qlist_ = list;
// These two lines show how expressions like (a)[b] and (a+1)[b] just boil down to
// a[b] and a[b+1], respectively ...
char p = (_qlist_)[6];
char q = (_qlist_ + 1)[6];
printf("p = %c q = %c\n", p, q);
return 0;
}
Feel free to ask for further clarification and/or explanation.
The functions declared in ctype.h accept objects of the type int. For characters used as arguments it is assumed that they are preliminary casted to the type unsigned char. This character is used as an index in a table that determines the characteristic of the character.
It seems the check _c == -1 is used in case when the _c contains the value of EOF. If it is not EOF then _c is casted to the type unsigned char that is used as an index in the table pointed to by the expression _ctype_ + 1. And if the bit specified by the mask 0x20 is set then the character is a control symbol.
To understand the expression
(_ctype_ + 1)[(unsigned char)_c]
take into account that the array subscripting is a postfix operator that is defined like
postfix-expression [ expression ]
You may not write like
_ctype_ + 1[(unsigned char)_c]
because this expression is equivalent to
_ctype_ + ( 1[(unsigned char)_c] )
So the expression _ctype_ + 1 is enclosed in parentheses to get a primary expression.
So in fact you have
pointer[integral_expression]
that yields the object of an array at index that is calculated as the expression integral_expression where pointer is (_ctype_ + 1) (gere is used the pointer arithmetuc) and integral_expression that is the index is the expression (unsigned char)_c.
In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.
In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.
In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.
In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.