I want to understand the following code:
//...
#define _C 0x20
extern const char *_ctype_;
//...
__only_inline int iscntrl(int _c)
{
return (_c == -1 ? 0 : ((_ctype_ + 1)[(unsigned char)_c] & _C));
}
It originates from the file ctype.h from the obenbsd operating system source code. This function checks if a char is a control character or a printable letter inside the ascii range. This is my current chain of thought:
iscntrl('a') is called and 'a' is converted to it's integer value
first check if _c is -1 then return 0 else...
increment the adress the undefined pointer points to by 1
declare this adress as a pointer to an array of length (unsigned char)((int)'a')
apply the bitwise and operator to _C (0x20) and the array (???)
Somehow, strangely, it works and everytime when 0 is returned the given char _c is not a printable character. Otherwise when it's printable the function just returns an integer value that's not of any special interest. My problem of understanding is in step 3, 4 (a bit) and 5.
Thank you for any help.
_ctype_ appears to be a restricted internal version of the symbol table and I'm guessing the + 1 is that they didn't bother saving index 0 of it since that one isn't printable. Or possibly they are using a 1-indexed table instead of 0-indexed as is custom in C.
The C standard dictates this for all ctype.h functions:
In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF
Going through the code step by step:
int iscntrl(int _c) The int types are really characters, but all ctype.h functions are required to handle EOF, so they must be int.
The check against -1 is a check against EOF, since it has the value -1.
_ctype+1 is pointer arithmetic to get an address of an array item.
[(unsigned char)_c] is simply an array access of that array, where the cast is there to enforce the standard requirement of the parameter being representable as unsigned char. Note that char can actually hold a negative value, so this is defensive programming. The result of the [] array access is a single character from their internal symbol table.
The & masking is there to get a certain group of characters from the symbol table. Apparently all characters with bit 5 set (mask 0x20) are control characters. There's no making sense of this without viewing the table.
Anything with bit 5 set will return the value masked with 0x20, which is a non-zero value. This sates the requirement of the function returning non-zero in case of boolean true.
_ctype_ is a pointer to a global array of 257 bytes. I don't know what _ctype_[0] is used for. _ctype_[1] through _ctype_[256]_ represent the character categories of characters 0, …, 255 respectively: _ctype_[c + 1] represents the category of the character c. This is the same thing as saying that _ctype_ + 1 points to an array of 256 characters where (_ctype_ + 1)[c] represents the categorty of the character c.
(_ctype_ + 1)[(unsigned char)_c] is not a declaration. It's an expression using the array subscript operator. It's accessing position (unsigned char)_c of the array that starts at (_ctype_ + 1).
The code casts _c from int to unsigned char is not strictly necessary: ctype functions take char values cast to unsigned char (char is signed on OpenBSD): a correct call is char c; … iscntrl((unsigned char)c). They have the advantage of guaranteeing that there is no buffer overflow: if the application calls iscntrl with a value that is outside the range of unsigned char and isn't -1, this function returns a value which may not be meaningful but at least won't cause a crash or a leak of private data that happened to be at the address outside of the array bounds. The value is even correct if the function is called as char c; … iscntrl(c) as long as c isn't -1.
The reason for the special case with -1 is that it's EOF. Many standard C functions that operate on a char, for example getchar, represent the character as an int value which is the char value wrapped to a positive range, and use the special value EOF == -1 to indicate that no character could be read. For functions like getchar, EOF indicates the end of the file, hence the name end-of-file. Eric Postpischil suggests that the code was originally just return _ctype_[_c + 1], and that's probably right: _ctype_[0] would be the value for EOF. This simpler implementation yields to a buffer overflow if the function is misused, whereas the current implementation avoids this as discussed above.
If v is the value found in the array, v & _C tests if the bit at 0x20 is set in v. The values in the array are masks of the categories that the character is in: _C is set for control characters, _U is set for uppercase letters, etc.
I'll start with step 3:
increment the adress the undefined pointer points to by 1
The pointer is not undefined. It's just defined in some other compilation unit. That is what the extern part tells the compiler. So when all files are linked together, the linker will resolve the references to it.
So what does it point to?
It points to an array with information about each character. Each character has its own entry. An entry is a bitmap representation of characteristics for the character. For example: If bit 5 is set, it means that the character is a control character. Another example: If bit 0 is set, it means that the character is a upper character.
So something like (_ctype_ + 1)['x'] will get the characteristics that apply to 'x'. Then a bitwise and is performed to check if bit 5 is set, i.e. check whether it is a control character.
The reason for adding 1 is probably that the real index 0 is reserved for some special purpose.
All information here is based on analyzing the source code (and programming experience).
The declaration
extern const char *_ctype_;
tells the compiler that there is a pointer to const char somewhere named _ctype_.
(4) This pointer is accessed as an array.
(_ctype_ + 1)[(unsigned char)_c]
The cast (unsigned char)_c makes sure the index value is in the range of an unsigned char (0..255).
The pointer arithmetic _ctype_ + 1 effectively shifts the array position by 1 element. I don't know why they implemented the array this way. Using the range _ctype_[1].._ctype[256] for the character values 0..255 leaves the value _ctype_[0] unused for this function. (The offset of 1 could be implemented in several alternative ways.)
The array access retrieves a value (of type char, to save space) using the character value as array index.
(5) The bitwise AND operation extracts a single bit from the value.
Apparently the value from the array is used as a bit field where the bit 5 (counting from 0 starting at least significant bit, = 0x20) is a flag for "is a control character". So the array contains bit field values describing the properties of the characters.
The key here is to understand what the expression (_ctype_ + 1)[(unsigned char)_c] does (which is then fed to the bitwise and operation, & 0x20 to get the result!
Short answer: It returns element _c + 1 of the array pointed-to by _ctype_.
How?
First, although you seem to think _ctype_ is undefined it actually isn't! The header declares it as an external variable - but it is defined in (almost certainly) one of the run-time libraries that your program is linked with when you build it.
To illustrate how the syntax corresponds to array indexing, try working through (even compiling) the following short program:
#include <stdio.h>
int main() {
// Code like the following two lines will be defined somewhere in the run-time
// libraries with which your program is linked, only using _ctype_ in place of _qlist_ ...
const char list[] = "abcdefghijklmnopqrstuvwxyz";
const char* _qlist_ = list;
// These two lines show how expressions like (a)[b] and (a+1)[b] just boil down to
// a[b] and a[b+1], respectively ...
char p = (_qlist_)[6];
char q = (_qlist_ + 1)[6];
printf("p = %c q = %c\n", p, q);
return 0;
}
Feel free to ask for further clarification and/or explanation.
The functions declared in ctype.h accept objects of the type int. For characters used as arguments it is assumed that they are preliminary casted to the type unsigned char. This character is used as an index in a table that determines the characteristic of the character.
It seems the check _c == -1 is used in case when the _c contains the value of EOF. If it is not EOF then _c is casted to the type unsigned char that is used as an index in the table pointed to by the expression _ctype_ + 1. And if the bit specified by the mask 0x20 is set then the character is a control symbol.
To understand the expression
(_ctype_ + 1)[(unsigned char)_c]
take into account that the array subscripting is a postfix operator that is defined like
postfix-expression [ expression ]
You may not write like
_ctype_ + 1[(unsigned char)_c]
because this expression is equivalent to
_ctype_ + ( 1[(unsigned char)_c] )
So the expression _ctype_ + 1 is enclosed in parentheses to get a primary expression.
So in fact you have
pointer[integral_expression]
that yields the object of an array at index that is calculated as the expression integral_expression where pointer is (_ctype_ + 1) (gere is used the pointer arithmetuc) and integral_expression that is the index is the expression (unsigned char)_c.
Related
This question already has answers here:
Multi-character constant warnings
(6 answers)
Print decimal value of a char
(5 answers)
Closed 5 years ago.
I wrote the following program,
#include<stdio.h>
int main(void)
{
int i='A';
printf("i=%c",i);
return 0;
}
and I got the result as,
i=A
So I tried another program,
#include<stdio.h>
int main(void)
{
int i='ABC';
printf("i=%c",i);
return 0;
}
According to me, since 32 bits are used to store an int value and each of 'A', 'B' and 'C' have 8 bit ASCII codes which totals to 24 bits therefore 24 bits were stored in a 32 bit unit. So I expected the output to be,
i=ABC
but the output instead was
i=C
and I can't understand why?
'ABC' in this case is a integer character constant as per section 6.4.4.4.10 of the standard.
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer. The
value of an integer character constant containing more than one
character (e.g.,'ab'), or containing a character or escape sequence
that does not map to a single-byteexecution character, is
implementation-defined. If an integer character constant contains a
single character or escape sequence, its value is the one that results
when an object with type char whose value is that of the single
character or escape sequence is converted to type int.
In this case, 'A'==0x41, 'B'==0x42, 'C'==0x43, and your compiler then interprets i to be 0x414243. As said in the other answer, this value is implementation dependent.
When you try to access it using '%c', the overflown part will be cut and you are only left with 0x43, which is 'C'.
To get more insight to it, read the answers to this question as well.
The conversion specifier c used in this call
printf("i=%c",i);
in fact extracts one character from the integer argument. So using this specifier you in any case can not get three characters as the output.
From the C Standard (7.21.6.1 The fprintf function)
c If no l length modifier is present, the int argument is converted to
an unsigned char, and the resulting character is written
Take into account that the internal representation of a multi-byte character constant is implementation defined. From the C Standard (6.4.4.4 Character constants)
...The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape
sequence that does not map to a single-byte execution character, is
implementation-defined.
'ABC' is an integer character constant. Depending on code set (overwhelming it is ASCII), endian, int width (apparently 32 bits in OP's case), it may have the same value like below. It is implementation defined behavior.
'ABC'
0x41424300
0x434241
or others.
The "%c" directs printf() to take the int value, cast it to unsigned char and print the associated character. This is the main reason for apparent loss of information.
In OP's case, it appears that i took on the value of 0x434241.
int i='A';
printf("i=%c",i); --> 'A'
// same as
printf("i=%c",0x434241); --> 'A'
if you want i to contain 3 characters you need to init a array that contains 3 characters
char i[3];
i[0]= 'A';
i[1]= 'B';
i[2]='C';
the ' ' can contain only one char your code converts the integer i into a character or better you store in your 32 bit intiger a converted 8 bit character. But i think You want to seperate the 32 bits into 8 bit containers make a char array like char i[3]. and then you will see that
int j=i;
this will result in an error because you are unable to convert a char array into a integer.
In C, 'A' is an int constant that's guaranteed to fit into a char.
'ABC' is a multicharacter constant. It has an int type, but an implementation defined value. The behaviour on using %c to print that in printf is possibly undefined if the value cannot fit into a char.
I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.
Based on the following snippet in C
int c1,c2;
printf("%d ",&c1-&c2);
Output : -1
Why does this code not return a warning saying the format %d expects of type int but it's getting a (void *) instead.
Why does it return -1 as an answer(and not 1)? Even if it's subtracting addresses it should be -4 and not -1. When I change the printf statement to printf("%d ",&c2 - &c1) I get 1 and not any random value! Why?
If I change the printf statement as printf("%d ",(int)&c1 - (int)&c2) am I typecasting an address to an integer value? Does that mean the value of the address stored as hexadecimal is now converted to int and then subtracted?
1) It's getting an int. Pointer-to-something minus pointer-to-something is an integer value (type ptrdiff_t). It tells you how many elements away are two pointers pointing into the same array.
2) As the two pointer do not point into the same array, the difference is undefined. Any value can be obtained.
3) Yes. But the "hexadecimal" part is incorrect. Addresses are stored in bits/binary (as are integers). What changes is the interpretation by your program. This is independent of the representation (hex/dec/oct/...).
There are multiple cases of undefined behavior here.
If we go to the draft C99 standard section 6.5.6 Additive operators it says (emphasis mine):
When two pointers are subtracted, both shall point to elements of the
same array object, or one past the last element of the array object;
the result is the difference of the subscripts of the two array
elements. The size of the result is implementation-defined, and its
type (a signed integer type) is ptrdiff_t defined in the
header. If the result is not representable in an object of that type,
the behavior is undefined.
Although, what you have is undefined behavior since the pointers do not point to elements of the same array.
The correct format specifier when using printf for ptrdiff_t would be %td which gives us a second case of undefined behavior since you are specifying the incorrect format specifier.
1) A void* is nothing but an adress. An adress is an number (a long). There, the adress is implicitly cast to an int.
2) In memory, your variable aren't store in the same order than in
your code ;). Furthemore, for the same reason :
int a[2];
a[0] = 3;
*(a + 1) = 5; // same that "a[1] = 5;"
This code will put a "5" in the second case. Cause it will actually do :
*(a + 1 *sizeof(*a)) = 5;
3) Hexadecimal is a number representation. It can be store in a int ! Example :
int a = 0xFF;
printf("%d\n", a); // print 255
I hope that i have answered your questions.
1) Some compilers do issue warnings for malformed printf arguments, but as a variadic function, the run-time has no way of checking that the arguments are of the type specified by the format string. Any mismatch will issue undefined behaviour as the function attempts to cast such an argument to an incorrect type.
2) You say the result should be -4 but that's not correct. Only arrays are guaranteed to have their pointers aligned contiguously. You cannot assume that c2 is at (&c1 + 1).
3) (int)&c1 is converting the address of c1 to an int. That's again, in general, undefined behaviour since you don't know that int is big enough to hold the pointer address value. (int might be 32 bits on a 64 bit chipset). You should use intptr_t in place of the int.
#include <stdio.h>
int main()
{
printf("%s", (1)["abcd"]+"efg"-'b'+1);
}
Can someone please explain why the output of this code is:
fg
I know (1)["abcd"] points to "bcd" but why +"efg"-'b'+1 is even a valid syntax ?
I know (1)["abcd"] points to "bcd"
No. (1)["abcd"] is a single char (b).
So (1)["abcd"]+"efg"-'b'+1 is: 'b' + "egf" - 'b' + 1 and if you simplify it, it becomes "efg" + 1. Hence it prints fg.
Note: The above answer explains only the observed behaviour which is not strictly legal as per the C language specification. Here's why.
case 1: 'b' < 0 or 'b' > 4
In this case, the expression (1)["abcd"] + "efg" - 'b' + 1 will lead to undefined behaviour, due to the sub-expression (1)["abcd"] + "efg", which is 'b' + "efg" producing an invalid pointer expression (C11, 6.5.5 Multiplicative operators -- quote below).
On the widely used ASCII character set, 'b' is 98 in decimal; on the not-so-widely used EBCDIC character set, 'b' is 130 in decimal. So the sub-expression (1)["abcd"] + "efg" would cause undefined behaviour on a system using either of these two.
So barring a weird architecture, where 'b' <= 4 and 'b' >= 0, this program would cause undefined behaviour due to how the
C language is defined:
C11, 5.1.2.3 Program execution
The semantic descriptions in this International Standard describe the
behavior of an abstract machine in which issues of optimization are
irrelevant. [...] In the abstract machine, all expressions are
evaluated as specified by the semantics. An actual implementation need
not evaluate part of an expression if it can deduce that its value is
not used and that no needed side effects are produced.
which categorically states that whole standard has been defined based on the abstract machine's behaviour.
So in this case, it does cause undefined behaviour.
case 2: 'b' >= 0 or 'b' <= 4 (This is quite imaginary, but in theory, it's possible).
In this case, the subexpression (1)["abcd"] + "efg" can be valid (and in turn, the whole expression (1)["abcd"] + "efg" - 'b' + 1).
The string literal "efg" consists of 4 chars, which is an array type (of type char[N] in C) and and the C standard guarantees (as quoted above) that the pointer expression evaluating to one-past the end of an array doesn't overflow or cause undefined behaviour.
The following are the possible sub-expressions and they are valid:
(1) "efg"+0 (2) "efg"+1 (3) "efg"+2 (4) "efg"+3 and (5) "efg"+4 because C standard states that:
C11, 6.5.5 Multiplicative operators
When an expression that has integer type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If the
pointer operand points to an element of an array object, and the array
is large enough, the result points to an element offset from the
original element such that the difference of the subscripts of the
resulting and original array elements equals the integer expression.
In other words, if the expression P points to the i-th element of an
array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N
(where N has the value n) point to, respectively, the i+n-th and
i−n-th elements of the array object, provided they exist. Moreover, if
the expression P points to the last element of an array object, the
expression (P)+1 points one past the last element of the array object,
and if the expression Q points one past the last element of an array
object, the expression (Q)-1 points to the last element of the array
object. If both the pointer operand and the result point to elements
of the same array object, or one past the last element of the array
object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined. If the result points one past the last element
of the array object, it shall not be used as the operand of a unary *
operator that is evaluated.
So it's not causing undefined behaviour in this case.
Thanks #zch & #Keith Thompson for digging out the relevant parts of C standard :)
There seems to be some confusion about the difference between the other two answers. Here's what happens, step by step:
(1)["abcd"]+"efg"-'b'+1
The first part, (1)["abcd"] takes advantage of the way arrays are processed in C. Let's look at the following:
int a[5] = { 0, 10, 20, 30, 40 };
printf("%d %d\n", a[2], 2[a]);
The output will be 20 20. Why? because the name of an array of int evaluates to its address, and its data type is pointer to int. Referring to an element of the integer array tells C add an offset to the address of the array and evaluate the result as type int. But this means C doesn't care about the order: a[2] is exactly the same as 2[a].
Similarly, since a is the address of the array, a + 1 is the address of the element at the first offset into the array. Of course, that's equivalent to 1 + a.
A string in C is just another, human-friendly, way of representing an array of type char. So (1)["abcd"] is the same as returning the element at the first offset into an array of the characters a, b, c, d, \0 ... which is the character b.
In C, every character has an integral value (generally its ASCII code). The value of b happens to be 98. The remainder of the evaluation, therefore, involves calculations with integers and an array: the character string "efg".
We have the address of the string. We add and subtract 98 (the ASCII value of the character b), and we add 1. The b's cancel each other, so the net result is one more than the address of the first character in the string, which is the address of the character f.
The %s conversion in the printf() tells C to treat the address as the first character in a string, and to print the entire string until it encounters the null character at the end.
So it prints fg, which is the part of the string "efg" that starts at the f.
In C, there appear to be differences between various values of zero -- NULL, NUL and 0.
I know that the ASCII character '0' evaluates to 48 or 0x30.
The NULL pointer is usually defined as:
#define NULL 0
Or
#define NULL (void *)0
In addition, there is the NUL character '\0' which seems to evaluate to 0 as well.
Are there times when these three values can not be equal?
Is this also true on 64 bit systems?
Note: This answer applies to the C language, not C++.
Null Pointers
The integer constant literal 0 has different meanings depending upon the context in which it's used. In all cases, it is still an integer constant with the value 0, it is just described in different ways.
If a pointer is being compared to the constant literal 0, then this is a check to see if the pointer is a null pointer. This 0 is then referred to as a null pointer constant. The C standard defines that 0 cast to the type void * is both a null pointer and a null pointer constant.
Additionally, to help readability, the macro NULL is provided in the header file stddef.h. Depending upon your compiler it might be possible to #undef NULL and redefine it to something wacky.
Therefore, here are some valid ways to check for a null pointer:
if (pointer == NULL)
NULL is defined to compare equal to a null pointer. It is implementation defined what the actual definition of NULL is, as long as it is a valid null pointer constant.
if (pointer == 0)
0 is another representation of the null pointer constant.
if (!pointer)
This if statement implicitly checks "is not 0", so we reverse that to mean "is 0".
The following are INVALID ways to check for a null pointer:
int mynull = 0;
<some code>
if (pointer == mynull)
To the compiler this is not a check for a null pointer, but an equality check on two variables. This might work if mynull never changes in the code and the compiler optimizations constant fold the 0 into the if statement, but this is not guaranteed and the compiler has to produce at least one diagnostic message (warning or error) according to the C Standard.
Note that the value of a null pointer in the C language does not matter on the underlying architecture. If the underlying architecture has a null pointer value defined as address 0xDEADBEEF, then it is up to the compiler to sort this mess out.
As such, even on this funny architecture, the following ways are still valid ways to check for a null pointer:
if (!pointer)
if (pointer == NULL)
if (pointer == 0)
The following are INVALID ways to check for a null pointer:
#define MYNULL (void *) 0xDEADBEEF
if (pointer == MYNULL)
if (pointer == 0xDEADBEEF)
as these are seen by a compiler as normal comparisons.
Null Characters
'\0' is defined to be a null character - that is a character with all bits set to zero. '\0' is (like all character literals) an integer constant, in this case with the value zero. So '\0' is completely equivalent to an unadorned 0 integer constant - the only difference is in the intent that it conveys to a human reader ("I'm using this as a null character.").
'\0' has nothing to do with pointers. However, you may see something similar to this code:
if (!*char_pointer)
checks if the char pointer is pointing at a null character.
if (*char_pointer)
checks if the char pointer is pointing at a non-null character.
Don't get these confused with null pointers. Just because the bit representation is the same, and this allows for some convenient cross over cases, they are not really the same thing.
References
See Question 5.3 of the comp.lang.c FAQ for more.
See this pdf for the C standard. Check out sections 6.3.2.3 Pointers, paragraph 3.
It appears that a number of people misunderstand what the differences between NULL, '\0' and 0 are. So, to explain, and in attempt to avoid repeating things said earlier:
A constant expression of type int with the value 0, or an expression of this type, cast to type void * is a null pointer constant, which if converted to a pointer becomes a null pointer. It is guaranteed by the standard to compare unequal to any pointer to any object or function.
NULL is a macro, defined in as a null pointer constant.
\0 is a construction used to represent the null character, used to terminate a string.
A null character is a byte which has all its bits set to 0.
All three define the meaning of zero in different context.
pointer context - NULL is used and means the value of the pointer is 0, independent of whether it is 32bit or 64bit (one case 4 bytes the other 8 bytes of zeroes).
string context - the character representing the digit zero has a hex value of 0x30, whereas the NUL character has hex value of 0x00 (used for terminating strings).
These three are always different when you look at the memory:
NULL - 0x00000000 or 0x00000000'00000000 (32 vs 64 bit)
NUL - 0x00 or 0x0000 (ascii vs 2byte unicode)
'0' - 0x20
I hope this clarifies it.
If NULL and 0 are equivalent as null pointer constants, which should I use? in the C FAQ list addresses this issue as well:
C programmers must understand that
NULL and 0 are interchangeable in
pointer contexts, and that an uncast 0
is perfectly acceptable. Any usage of
NULL (as opposed to 0) should be
considered a gentle reminder that a
pointer is involved; programmers
should not depend on it (either for
their own understanding or the
compiler's) for distinguishing pointer
0's from integer 0's.
It is only in pointer contexts that
NULL and 0 are equivalent. NULL should
not be used when another kind of 0 is
required, even though it might work,
because doing so sends the wrong
stylistic message. (Furthermore, ANSI
allows the definition of NULL to be
((void *)0), which will not work at
all in non-pointer contexts.) In
particular, do not use NULL when the
ASCII null character (NUL) is desired.
Provide your own definition
#define NUL '\0'
if you must.
What is the difference between NULL, ‘\0’ and 0
"null character (NUL)" is easiest to rule out. '\0' is a character literal.
In C, it is implemented as int, so, it's the same as 0, which is of INT_TYPE_SIZE. In C++, character literal is implemented as char, which is 1 byte. This is normally different from NULL or 0.
Next, NULL is a pointer value that specifies that a variable does not point to any address space. Set aside the fact that it is usually implemented as zeros, it must be able to express the full address space of the architecture. Thus, on a 32-bit architecture NULL (likely) is 4-byte and on 64-bit architecture 8-byte. This is up to the implementation of C.
Finally, the literal 0 is of type int, which is of size INT_TYPE_SIZE. The default value of INT_TYPE_SIZE could be different depending on architecture.
Apple wrote:
The 64-bit data model used by Mac OS X is known as "LP64". This is the common data model used by other 64-bit UNIX systems from Sun and SGI as well as 64-bit Linux. The LP64 data model defines the primitive types as follows:
ints are 32-bit
longs are 64-bit
long-longs are also 64-bit
pointers are 64-bit
Wikipedia 64-bit:
Microsoft's VC++ compiler uses the LLP64 model.
64-bit data models
Data model short int long long long pointers Sample operating systems
LLP64 16 32 32 64 64 Microsoft Win64 (X64/IA64)
LP64 16 32 64 64 64 Most Unix and Unix-like systems (Solaris, Linux, etc.)
ILP64 16 64 64 64 64 HAL
SILP64 64 64 64 64 64 ?
Edit:
Added more on the character literal.
#include <stdio.h>
int main(void) {
printf("%d", sizeof('\0'));
return 0;
}
The above code returns 4 on gcc and 1 on g++.
One good piece which helps me when starting with C (taken from the Expert C Programming by Linden)
The One 'l' nul and the Two 'l' null
Memorize this little rhyme to recall the correct terminology for pointers and ASCII zero:
The one "l" NUL ends an ASCII string,
The two "l" NULL points to no thing.
Apologies to Ogden Nash, but the three "l" nulll means check your spelling.
The ASCII character with the bit pattern of zero is termed a "NUL".
The special pointer value that means the pointer points nowhere is "NULL".
The two terms are not interchangeable in meaning.
A one-L NUL, it ends a string.
A two-L NULL points to no thing.
And I will bet a golden bull
That there is no three-L NULLL.
How do you deal with NUL?
"NUL" is not 0, but refers to the ASCII NUL character. At least, that's how I've seen it used. The null pointer is often defined as 0, but this depends on the environment you are running in, and the specification of whatever operating system or language you are using.
In ANSI C, the null pointer is specified as the integer value 0. So any world where that's not true is not ANSI C compliant.
A byte with a value of 0x00 is, on the ASCII table, the special character called NUL or NULL. In C, since you shouldn't embed control characters in your source code, this is represented in C strings with an escaped 0, i.e., \0.
But a true NULL is not a value. It is the absence of a value. For a pointer, it means the pointer has nothing to point to. In a database, it means there is no value in a field (which is not the same thing as saying the field is blank, 0, or filled with spaces).
The actual value a given system or database file format uses to represent a NULL isn't necessarily 0x00.
NULL is not guaranteed to be 0 -- its exact value is architecture-dependent. Most major architectures define it to (void*)0.
'\0' will always equal 0, because that is how byte 0 is encoded in a character literal.
I don't remember whether C compilers are required to use ASCII -- if not, '0' might not always equal 48. Regardless, it's unlikely you'll ever encounter a system which uses an alternative character set like EBCDIC unless you're working on very obscure systems.
The sizes of the various types will differ on 64-bit systems, but the integer values will be the same.
Some commenters have expressed doubt that NULL be equal to 0, but not be zero. Here is an example program, along with expected output on such a system:
#include <stdio.h>
int main () {
size_t ii;
int *ptr = NULL;
unsigned long *null_value = (unsigned long *)&ptr;
if (NULL == 0) {
printf ("NULL == 0\n"); }
printf ("NULL = 0x");
for (ii = 0; ii < sizeof (ptr); ii++) {
printf ("%02X", null_value[ii]); }
printf ("\n");
return 0;
}
That program could print:
NULL == 0
NULL = 0x00000001
(void*) 0 is NULL, and '\0' represents the end of a string.