C Pointer Arithmetic for Unusual Architectures - c

I'm trying to get a better understanding of the C standard. In particular I am interested in how pointer arithmetic might work in an implementation for an unusual machine architecture.
Suppose I have a processor with 64 bit wide registers that is connected to RAM where each address corresponds to a cell 8 bits wide. An implementation for C for this machine defines CHAR_BIT to be equal to 8. Suppose I compile and execute the following lines of code:
char *pointer = 0;
pointer = pointer + 1;
After execution, pointer is equal to 1. This gives one the impression that in general data of type char corresponds to the smallest addressable unit of memory on the machine.
Now suppose I have a processor with 12 bit wide registers that is connected to RAM where each address corresponds to a cell 4 bits wide. An implementation of C for this machine defines CHAR_BIT to be equal to 12. Suppose the same lines of code are compiled and executed for this machine. Would pointer be equal to 3?
More generally, when you increment a pointer to a char, is the address equal to CHAR_BIT divided by the width of a memory cell on the machine?

Would pointer be equal to 3?
Well, the standard doesn't say how pointers are implemented. The standard tells what is to happen when you use a pointer in a specific way but not what the value of a pointer shall be.
All we know is that adding 1 to a char pointer, will make the pointer point at the next char object - where ever that is. But nothing about pointers value.
So when you say that
pointer = pointer + 1;
will make the pointer equal 1, it's wrong. The standard doesn't say anything about that.
On most systems a char is 8 bit and pointers are (virtual) memory addresses referencing a 8 bit addressable memory loacation. On such systems incrementing a char pointer will increase the pointer value (aka memory address) by 1. However, on - unusual architectures - there is no way to tell.
But if you have a system where each memory address references 4 bits and a char is 12 bits, it seems a good guess that ++pointer will increase the pointer by three.

Pointers are incremented by the minimum of they width of the datatype they "point to", but are not guaranteed to increment to that size exactly.
For memory alignment purposes, there are many times where a pointer might increment to the next memory word alignment past the minimum width.
So, in general, you cannot assume this pointer to be equal to 3. It very well may be 3, 4, or some larger number.
Here is an example.
struct char_three {
char a;
char b;
char c;
};
struct char_three* my_pointer = 0;
my_pointer++;
/* I'd be shocked if my_pointer was now 3 */
Memory alignment is machine specific. One cannot generalize about it, except that most machines define a WORD as the first address that can be aligned to a memory fetch on the bus. Some machines can specify addresses that don't align with the bus fetches. In such a case, selecting two bytes that span the alignment may result in loading two WORDS.
Most systems don't accept WORD loads on non-aligned boundaries without complaining. This means that a bit of boiler plate assembly is applied to translate the fetch to the proceeding WORD boundary, if maximum density is desired.
Most compilers prefer speed to maximum density of data, so they align their structured data to take advantage of WORD boundaries, avoiding the extra calculations. This means that in many cases, data that is not carefully aligned might contain "holes" of bytes that are not used.
If you are interested in details of the above summary, you can read up on Data Structure Alignment which will discuss alignment (and as a consequence) padding.

char *pointer = 0;
After execution, pointer is equal to 1
Not necessarily.
This special case gives you a null pointer, since 0 is a null pointer constant. Strictly speaking, such a pointer is not supposed to point at a valid object. If you look at the actual address stored in the pointer, it could be anything.
Null pointers aside, the C language expects you to do pointer arithmetic by first pointing at an array. Or in case of char, you can also point at a chunk of generic data such as a struct. Everything else, like your example, is undefined behavior.
An implementation of C for this machine defines CHAR_BIT to be equal to 12
The C standard defines char to be equal to a byte, so your example is a bit weird and contradicting. Pointer arithmetic will always increase the pointer to point at the next object in the array. The standard doesn't really speak of representation of addresses at all, but your fictional example that would sensibly increase the address by 12 bits, because that's the size of a char.
Fictional computers are quite meaningless to discuss even from a learning point-of-view. I'd advise to focus on real-world computers instead.

When you increment a pointer to a char, is the address equal to CHAR_BIT divided by the width of a memory cell on the machine?
On a "conventional" machine -- indeed on the vast majority of machines where C runs -- CHAR_BIT simply is the width of a memory cell on the machine, so the answer to the question is vacuously "yes" (since CHAR_BIT / CHAR_BIT is 1.).
A machine with memory cells smaller than CHAR_BIT would be very, very strange -- arguably incompatible with C's definition.
C's definition says that:
sizeof(char) is exactly 1.
CHAR_BIT, the number of bits in a char, is at least 8. That is, as far as C is concerned, a byte may not be smaller than 8 bits. (It may be larger, and this is a surprise to many people, but it does not concern us here.)
There is a strong suggestion (if not an explicit requirement) that char (or "byte") is the machine's "minimum addressable unit" or some such.
So for a machine that can address 4 bits at a time, we would have to pick unnatural values for sizeof(char) and CHAR_BIT (which would otherwise probably want to be 2 and 4, respectively), and we would have to ignore the suggestion that type char is the machine's minimum addressable unit.
C does not mandate the internal representation (the bit pattern) of a pointer. The closest a portable C program can get to doing anything with the internal representation of a pointer value is to print it out using %p -- and that's explicitly defined to be implementation-defined.
So I think the only way to implement C on a "4 bit" machine would involve having the code
char a[10];
char *p = a;
p++;
generate instructions which actually incremented the address behind p by 2.
It would then be an interesting question whether %p should print the actual, raw pointer value, or the value divided by 2.
It would also be lots of fun to watch the ensuing fireworks as too-clever programmers on such a machine used type punning techniques to get their hands on the internal value of pointers so that they could increment them by actually 1 -- not the 2 that "proper" additions of 1 would always generate -- such that they could amaze their friends by accessing the odd nybble of a byte, or confound the regulars on SO by asking questions about it. "I just incremented a char pointer by 1. Why is %p showing a value that's 2 greater?"

Seems like the confusion in this question comes from the fact that the word "byte" in the C standard doesn't have the typical definition (which is 8 bits). Specifically, the word "byte" in the C standard means a collection of bits, where the number of bits is specified by the implementation-defined constant CHAR_BITS. Furthermore, a "byte" as defined by the C standard is the smallest addressable object that a C program can access.
This leaves open the question as to whether there is a one-to-one correspondence between the C definition of "addressable", and the hardware's definition of "addressable". In other words, is it possible that the hardware can address objects that are smaller than a "byte"? If (as in the OP) a "byte" occupies 3 addresses, then that implies that "byte" accesses have an alignment restriction. Which is to say that 3 and 6 are valid "byte" addresses, but 4 and 5 are not. This is prohibited by section 6.2.8 which discusses the alignment of objects.
Which means that the architecture proposed by the OP is not supported by the C specification. In particular, an implementation may not have pointers that point to 4-bit objects when CHAR_BIT is equal to 12.
Here are the relevant sections from the C standard:
§3.6 The definition of "byte" as used in the standard
[A byte is an] addressable unit of data storage large enough to hold
any member of the basic character set of the execution environment.
NOTE 1 It is possible to express the address of each individual byte
of an object uniquely.
NOTE 2 A byte is composed of a contiguous sequence of bits, the number
of which is implementation-defined. The least significant bit is
called the low-order bit; the most significant bit is called the
high-order bit.
§5.2.4.2.1 describes CHAR_BIT as the
number of bits for smallest object that is not a bit-field (byte)
§6.2.6.1 restricts all objects that are larger than a char to be a multiple of CHAR_BIT bits:
[...]
Except for bit-fields, objects are composed of contiguous sequences of
one or more bytes, the number, order, and encoding of which are either
explicitly specified or implementation-defined.
[...] Values stored in non-bit-field objects of any other object type
consist of n × CHAR_BIT bits, where n is the size of an object of that
type, in bytes.
§6.2.8 restricts the alignment of objects
Complete object types have alignment requirements which place
restrictions on the addresses at which objects of that type may be
allocated. An alignment is an implementation-defined integer value
representing the number of bytes between successive addresses at which
a given object can be allocated.
Valid alignments include only those values returned by an _Alignof
expression for fundamental types, plus an additional
implementation-defined set of values, which may be empty. Every
valid alignment value shall be a nonnegative integral power of two.
§6.5.3.2 specifies the sizeof a char, and hence a "byte"
When sizeof is applied to an operand that has type char, unsigned
char, or signed char, (or a qualified version thereof) the result is
1.

The following code fragment demonstrates an invariant of C pointer arithmetic -- no matter what CHAR_BIT is, no matter what the hardware least addressable unit is, and no matter what the actual bit representation of pointers is,
#include <assert.h>
int main(void)
{
T x[2]; // for any object type T whatsoever
assert(&x[1] - &x[0] == 1); // must be true
}
And since sizeof(char) == 1 by definition, this also means that
#include <assert.h>
int main(void)
{
T x[2]; // again for any object type T whatsoever
char *p = (char *)&x[0];
char *q = (char *)&x[1];
assert(q - p == sizeof(T)); // must be true
}
However, if you convert to integers before performing the subtraction, the invariant evaporates:
#include <assert.h>
#include <inttypes.h>
int main(void);
{
T x[2];
uintptr_t p = (uintptr_t)&x[0];
uintptr_t q = (uintptr_t)&x[1];
assert(q - p == sizeof(T)); // implementation-defined whether true
}
because the transformation performed by converting a pointer to an integer of the same size, or vice versa, is implementation-defined. I think it's required to be bijective, but I could be wrong about that, and it is definitely not required to preserve any of the above invariants.

Related

unexpected byte order after casting pointer-to-char into pointer-to-int

unsigned char tab[4] = 14;
If I print as individual bytes...
printf("tab[1] : %u\n", tab[0]); // output: 0
printf("tab[2] : %u\n", tab[1]); // output: 0
printf("tab[3] : %u\n", tab[2]); // output: 0
printf("tab[4] : %u\n", tab[3]); // output: 14
If I print as an integer...
unsigned int *fourbyte;
fourbyte = *((unsigned int *)tab);
printf("fourbyte : %u\n", fourbyte); // output: 234881024
My output in binary is : 00001110 00000000 00000000 00000000, which is the data I wanted but in this order tab[3] tab[2] tab[1] tab[0].
Any explanation of that, why the unsigned int pointer points to the last byte instead of the first ?
The correct answer here is that you should not have expected any relationship, order or otherwise. Except for unions, the C standard does not define a linear address space in which objects of different types can overlap. It is the case on many architecture/compiler-tool-chain combinations that these coincidences can occur from time to time, but you should never rely on them. The fact that by casting a pointer to a suitable scalar type yields a number comparable to others of the same type, in no-way implies that number is any particular memory address.
So:
int* p;
int z = 3;
int* pz = &z;
size_t cookie = (size_t)pz;
p = (int*)cookie;
printf("%d", *p); // Prints 3.
Works because the standard says it must work when cookie is derived from the same type of pointer that it is being converted to. Converting to any other type is undefined behavior. Pointers do not represent memory, they reference 'storage' in the abstract. They are merely references to objects or NULL, and the standard defines how pointers to the same object must behave and how they can be converted to scalar values and back again.
Given:
char array[5] = "five";
The standard says that &(array[0]) < &(array[1]) and that (&(array[0])) + 1) == &(array[1]), but it is mute on how elements in array are ordered in memory. The compiler writers are free to use whatever machine codes and memory layouts that they deem are appropriate for the target architecture.
In the case of unions, which provides for some overlap of objects in storage, the standard only says that each of its fields must be suitably aligned for their types, but just about everything else about them is implementation defined. The key clause is 6.2.6.1 p7:
When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values.
The gist of all of this is that the C standard defines an abstract machine. The compiler generates an architecture specific simulation of that machine based on your code. You cannot understand the C abstract machine through simple empirical means because implementation details bleed into your data set. You must limit your observations to those that are relevant to the abstraction. Therefore, avoid undefined behavior and be very aware of implementation defined behaviors.
Your example code is running on a computer that is Little-Endian. This term means that the "first byte" of an integer contains the least significant bits. By contrast, a Big-Endian computer stores the most significant bits in the first byte.
Edited to add: the way that you've demonstrated this is decidedly unsafe, as it relies upon undefined behavior to get "direct access" to the memory. There is a safer demonstration here

Does pointer equality imply integer equality?

For int *a, int *b, does a == b imply (intptr_t)a == (intptr_t)b? I know that it's true for example on a modern X86 CPU, but does the C standard or POSIX or any other standard give a guarantee for this?
This is not guaranteed by the C standard. (This answer does not address whether POSIX or other standards say about intptr_t.) What the C standard (2011, draft N1570) says about intptr_t is:
7.20.1.4 1 The following type designates a signed integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer: intptr_t
As a theoretical proof, one counterexample is a system that has 24-bit addresses, where the high eight bits are unused, but the available integer types are 8-bit, 16-bit, and 32-bit. In this case, the C implementation could make intptr_t a 32-bit integer, and it could convert a pointer to intptr_t by copying the 24-bit address into the 32-bit integer and neglecting the high eight bits. Those bits might be left over from whatever was lying around previously. When the intptr_t value is converted back to a pointer, the compiler discards the high eight bits, which results in the original address. In this system, when a == b is evaluated for pointers a and b, the compiler implements this by comparing only the 24 bits of the address. Thus, if a and b point to the same object a == b will be true, but (intptr_t) a == (intptr_t) b may evaluate to false because of the neglected high bits. (Note that, strictly, a and b should be pointers to void or should be converted to pointers to void before being converted to intptr_t.)
Another example would be a system which uses some base and offset addressing. In this system, a pointer might consist of 16 bits that specify some base address and 16 bits that specify an offset. The base might be in multiples of 64 bytes, so the actual address represented by base and offset is base•64 + offset. In this system, if pointer a has base 2 and offset 10, it represents the same address as pointer b with base 1 and offset 74. When comparing pointers, the compiler would evaluate base•64 + offset for each pointer and compare the results, so a == b evaluates to true. However, when converting to intptr_t, the compiler might simply copy the bits, thus producing 131,082 (2•65536 + 10) for (intptr_t) a and 65,610 (1•65536 + 74) for (intptr_t) b. Then (intptr_t) a == (intptr_t) b evaluates to false. But the rule that converting an intptr_t back to a pointer type produces the original pointer still holds, as the compiler will simply copy the bits again.
Rather than trying to specify all the guarantees that should be upheld by quality implementations on commonplace platforms, the Standard instead seeks to avoid mandating any guarantees that might be expensive or problematic on any conceivable platform unless they are so valuable as to justify any possible cost. The authors expected (reasonably at the time) that quality compilers for platforms which could offer stronger guarantees at essentially no cost would do so, and thus saw need to explicitly mandate things compilers were going to do anyway.
If one looks at what the actual guarantee offered by the Standard, it's absurdly wimpy. It specifies that converting a void* to a uintptr_t and then back to a void* will yield a pointer that may be compared to the original, and that the comparison will report that the two pointers are equal. It says nothing about what will happen if code does anything else with round-trip-converted pointer. A conforming implementation could perform integer-to-pointer conversions in a way that ignores the integer value (unless it is a Null Pointer Constant) and yields some arbitrary bit pattern that doesn't match any valid or null pointer, and then have its pointer-equality operators report "equal" whenever either operand holds that special bit pattern. No quality implementation should behave in such a fashion of course, but nothing in the Standard would forbid it.
In the absence of optimizations, it would be reasonable to assume that on any platform which uses "linear" pointers that are the same size as uintptr_t, quality compilers will process conversion of pointers to uintptr_t such that conversion of equal pointers will yield the same numeric value, and that given uintptr_t u;, if u==(uintptr)&someObject, then *(typeOfObject*)u may be used to access someObject, at least between the time the address of someObject was converted to a uintptr_t and the next time someObject is accessed via other means, without regard for how u came to hold its value. Unfortunately, some compilers are too primitive to recognize that conversion of an address to a uintptr_t would suggest that a pointer formed from a uintptr_t might be capable of identifying the same object.

Is Using 'sizeof(char)' When Dynamically Allocating A 'char' Redundant?

When dynamically allocating chars, I've always done it like this:
char *pCh = malloc(NUM_CHARS * sizeof(char));
I've recently been told, however, that using sizeof(char) is redundant and unnecessary because, "by definition, the size of a char is one byte," so I should/could write the above line like this:
char *pCh = malloc(NUM_CHARS);
My understanding is the size of a char depends on the native character set that is being used on the target computer. For example, if the native character set is ASCII, a char is one byte (8 bits), and if the native character set is UNICODE a char will necessarily require more bytes (> 8 bits).
To provide maximum portability, wouldn't it be necessary to use sizeof(char), as malloc simply allocates 8-bit bytes? Am I misunderstanding malloc and sizeof(char)?
Yes, it is redundant since the language standard specifies that sizeof (char) is 1. This is because that is the unit in which things are measured, so of course the size of the unit itself must be 1.
Life becomes strange with units defined in terms of themselves, that simply doesn't make any sense. Many people seem to "want" to assume that "there are 8-bit bytes, and sizeof tells me how many such there are in a particular value". That is wrong, that's simply not how it works. It's true that there can be platforms with larger characters than 8 bits, that's why we have CHAR_BIT.
Typically you always "know" when you're allocating characters anyway, but if you really want to include sizeof, you should really consider making it use the pointer, instead:
char *pCh = malloc(NUM_CHARS * sizeof *pCh);
This "locks" the unit size of the thing being allocated the pointer that is used to store the result of the allocation. These two types should match, if you ever see code like this:
int *numbers = malloc(42 * sizeof (float));
that is a huge warning signal; by using the pointer from the left-hand side in the sizeof you make that type of error impossible which I consider a big win:
int *numbers = malloc(42 * sizeof *numbers);
Also, it's likely that if you change the name of the pointer, the malloc() won't compile which it would if you had the name of the (wrong) basic type in there. There is a slight risk that if you forget the asterisk (and write sizeof numbers instead of sizeof *numbers) you'll not get what you want. In practice (for me) this seems to never happen, since the asterisk is pretty well established as part of this pattern, to me.
Also, this usage relies on (and emphasizes) the fact that sizeof is not a function, since no ()s are needed around the pointer de-referencing expression. This is a nice bonus, since many people seem to want to deny this. :)
I find this pattern highly satisfying and recommend it to everyone.
The C99 draft standard section 6.5.3.4 The sizeof operator paragraph 3 states:
When applied to an operand that has type char, unsigned char, or signed char,
(or a qualified version thereof) the result is 1. [...]
In the C11 draft standard it is paragraph 4 but the wording is the same. So NUM_CHARS * sizeof(char) should be equivalent to NUM_CHARS.
We can see from the definition of byte in 3.6 that it is a:
addressable unit of data storage large enough to hold any member of the basic character
set of the execution environment
and Note 2 says:
A byte is composed of a contiguous sequence of bits, the number of which is implementation defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit.
The C specification states that sizeof(char) is 1, so as long as you are dealing with conforming implementations of C it is redundant.
The size unit used by mallocis the same. malloc(120) allocates space for 120 char.
A char must be at least 8 bits, but may be larger.
sizeof(char) will always return 1 so it doesn't matter if you use it or nit, it will not change. You may be confusing this with UNICODE wide characters, which have two bytes, but they have a different type wchar_t so you should use sizeof in that case.
If you are working on a system where a byte is defined to have 16 bits, then sizeof(char) would still return 1 as this is what the underlying architecture would allocate. 1 Byte with 16 bits.
Allocation sizes are always measured in units of char, which has size 1 by definition. If you are on a 9-bit machine, malloc understands its argument as a number of 9-bit bytes.
sizeof(char) is always 1, but not because char is always one byte (it needn't be), but rather because the sizeof operator returns the object/type size in units of char.

Is pointer tagging in C undefined according to the standard?

Some dynamically-typed languages use pointer tagging as a quick way to identify or narrow down the runtime type of the value being represented. A classic way to do this is to convert pointers to a suitably sized integer, and add a tag value over the least significant bits which are assumed to be zero for aligned objects. When the object needs to be accessed, the tag bits are masked away, the integer is converted to a pointer, and the pointer is dereferenced as normal.
This by itself is all in order, except it all hinges on one colossal assumption: that the aligned pointer will convert to an integer guaranteed to have zero bits in the right places.
Is it possible to guarantee this according to the letter of the standard?
Although standard section 6.3.2.3 (references are to the C11 draft) says that the result of a conversion from pointer to integer is implementation-defined, what I'm wondering is whether the pointer arithmetic rules in 6.5.2.1 and 6.5.6 effectively constrain the result of pointer->integer conversion to follow the same predictable arithmetic rules that many programs already assume. (6.3.2.3 note 67 seemingly suggests that this is the intended spirit of the standard anyway, not that that means much.)
I'm specifically thinking of the case where one might allocate a large array to act as a heap for the dynamic language, and therefore the pointers we're talking about are to elements of this array. I'm assuming that the start of the C-allocated array itself can be placed at an aligned position by some secondary means (by all means discuss this too though). Say we have an array of eight-byte "cons cells"; can we guarantee that the pointer to any given cell will convert to an integer with the lowest three bits free for a tag?
For instance:
typedef Cell ...; // such that sizeof(Cell) == 8
Cell heap[1024]; // such that ((uintptr_t)&heap[0]) & 7 == 0
((char *)&heap[11]) - ((char *)&heap[10]); // == 8
(Cell *)(((char *)&heap[10]) + 8); // == &heap[11]
&(&heap[10])[0]; // == &heap[10]
0[heap]; // == heap[0]
// So...
&((char *)0)[(uintptr_t)&heap[10]]; // == &heap[10] ?
&((char *)0)[(uintptr_t)&heap[10] + 8]; // == &heap[11] ?
// ...implies?
(Cell *)((uintptr_t)&heap[10] + 8); // == &heap[11] ?
(If I understand correctly, if an implementation provides uintptr_t then the undefined behaviour hinted at in 6.3.2.3 paragraph 6 is irrelevant, right?)
If all of these hold, then I would assume that it means that you can in fact rely on the low bits of any converted pointer to an element of an aligned Cell array to be free for tagging. Do they && does it?
(As far as I'm aware this question is hypothetical since the normal assumption holds for common platforms anyway, and if you found one where it didn't, you probably wouldn't want to look to the C standard for guidance rather than the platform docs; but that's beside the point.)
This by itself is all in order, except it all hinges on one colossal
assumption: that the aligned pointer will convert to an integer
guaranteed to have zero bits in the right places.
Is it possible to guarantee this according to the letter of the
standard?
It's possible for an implementation to guarantee this. The result of converting a pointer to an integer is implementation-defined, and an implementation can define it any way it likes, as long as it meets the standard's requirements.
The standard absolutely does not guarantee this in general.
A concrete example: I've worked on a Cray T90 system, which had a C compiler running under a UNIX-like operating system. In the hardware, an address is a 64-bit word containing the address of a 64-bit word; there were no hardware byte addresses. Byte pointers (void*, char*) were implemented in software by storing a 3-bit offset in the otherwise unused high-order 3 bits of a 64-bit word pointer.
All pointer-to-pointer, pointer-to-integer, and integer-to-pointer conversions simply copied the representation.
Which means that a pointer to an 8-byte aligned object, when converted to an integer, could have any bit pattern in its low-order 3 bits.
Nothing in the standard forbids this.
The bottom line: A scheme like the one you describe, that plays games with pointer representations, can work if you make certain assumptions about how the current system represents pointers -- as long as those assumptions happen to be valid for the current system.
But no such assumptions can be 100% reliable, because the standard says nothing about how pointers are represented (other than that they're of a fixed size for each pointer type, and that the representation can be viewed as an array of unsigned char).
(The standard doesn't even guarantee that all pointers are the same size.)
You're right about the relevant parts of the standard. For reference:
An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.
Any pointer type may be converted to an integer type. Except as previously specified, the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type.
Since the conversions are implementation defined (except when the integer type is too small, in which case it's undefined), there's nothing the standard is going to tell you about this behaviour. If your implementation makes the guarantees you want, you're set. Otherwise, too bad.
I guess the answer to your explicit question:
Is it possible to guarantee this according to the letter of the standard?
Is "yes", since the standard punts on this behaviour and says the implementation has to define it. Arguably, "no" is just as good an answer for the same reason.

How are pointers stored in memory?

I'm a little confused about this.
On my system, if I do this:
printf("%d", sizeof(int*));
this will just yield 4. Now, the same happens for sizeof(int). Conclusion: if both integers and pointers are 4 bytes, a pointer can be safely "converted" to an int
(i.e. the memory it points to could be stored in an int). However, if I do this:
int* x;
printf("%p", x);
The returned hex address is far beyond the int scope, and thus any attempt to store the value in an int fails obviously.
How is this possible? If the pointer takes 4 bytes of memory, how can it store more than 232?
EDIT:
As suggested by a few users, I'm posting the code and the output:
#include <stdio.h>
int main()
{
printf ("%d\n", sizeof(int));
printf ("%d\n", sizeof(int*));
int *x;
printf ("%d\n", sizeof(x));
printf ("%p\n", x);
}
The output:
4
4
4
0xb7778000
C11, 6.3.2.3, paragraphs 5 and 6:
An integer may be converted to any pointer type. Except as previously specified, the
result is implementation-defined, might not be correctly aligned, might not point to an
entity of the referenced type, and might be a trap representation.
Any pointer type may be converted to an integer type. Except as previously specified, the
result is implementation-defined. If the result cannot be represented in the integer type,
the behavior is undefined. The result need not be in the range of values of any integer
type.
So the conversions are allowed, but the result is implementation defined (or undefined if the result cannot be stored in an integer type). (The "previously specified" is referring to NULL.)
In regards to your print statement for a pointer printing something larger than what 4 bytes of data can represent, this is not true, as 0xb7778000 is within range of a 32 bit integral type.
The returned hex address is far beyond the int scope, and thus any attempt to store the value in an int fails obviously.
4
4
4
0xb7778000
And 0xb7778000 is a 32-bit value, so an object of 4 bytes can hold it.
No, they cannot be "safely" converted. Certainly they use the same amount of storage space, but there is no guarantee that they interpret a number of set bits in the same manner.
As for the second question (and one question per question please), there is no guaranteed size for int, or for a pointer. An int is roughly the optimum size of data transfer on the bus (also known as a word). It can differ on different platforms, but must be relatively (equal or) larger than a short or char. This is why there are standard definitions for MAX_INT, but not a standard "value" for the definition.
A pointer is roughly the number of bits wide as necessary to access a memory location. The old original PC's had a 8 bit bus, but a 12 bit pointer (due to some fancy bit-shifting) to extend it's memory range past its bus size.

Resources