This is taken from C, and is based on that.
Let's imagine we have a 32 bit pointer
char* charPointer;
It points into some place in memory that contains some data. It knows that increments of this pointer are in 1 byte, etc.
On the other hand,
int* intPointer;
also points into some place in memory and if we increase it it knows that it should go up by 4 bytes if we add 1 to it.
Question is, how are we able to address full 32 bits of addressable space (2^32) - 4 gigabytes with those pointers, if obviously they contain some information in them that allows them to be separated one from another, for example char* or int*, so this leaves us with not 32 bytes, but with less.
When typing this question I came to thinking, maybe it is all syntatic sugar and really for compiler? Maybe raw pointer is just 32 bit and it doesn't care of the type? Is it the case?
You might be confused by compile time versus run time.
During compilation, gcc (or any C compiler) knows the type of a pointer, in particular knows the type of the data pointed by that pointer variable. So gcccan emit the right machine code. So an increment of a int * variable (on a 32 bits machine having 32 bits int) is translated to an increment of 4 (bytes), while an increment of a char* variable is translated to an increment of 1.
During runtime, the compiled executable (it does not care or need gcc) is only dealing with machine pointers, usually addresses of bytes (or of the start of some word).
Types (in C programs) are not known during runtime.
Some other languages (Lisp, Python, Javascript, ....) require the types to be known at runtime. In recent C++ (but not C) some objects (those having virtual functions) may have RTTI.
It is indeed syntactic sugar. Consider the following code fragment:
int t[2];
int a = t[1];
The second line is equivalent to:
int a = *(t + 1); // pointer addition
which itself is equivalent to:
int a = *(int*)((char*)t + 1 * sizeof(int)); // integer addition
After the compiler has checked the types it drops the casts and works only with addresses, lengths and integer addition.
Yes. Raw pointer is 32 bits of data (or 16 or 64 bits, depending on architecture), and does not contain anything else. Whether it's int *, char *, struct sockaddr_in * is just information for compiler, to know what is the number to actually add when incrementing, and for the type it's going to have when you dereference it.
Your hypothesis is correct: to see how different kinds of pointer are handled, try running this program:
int main()
{
char * pc = 0;
int * pi = 0;
printf("%p\n", pc + 1);
printf("%p\n", pi + 1);
return 0;
}
You will note that adding one to a char* increased its numeric value by 1, while doing the same to the int* increased by 4 (which is the size of an int on my machine).
It's exactly as you say in the end - types in C are just a compile-time concept that tells to the compiler how to generate the code for the various operations you can perform on variables.
In the end pointers just boil down to the address they point to, the semantic information doesn't exist anymore once the code is compiled.
Incrementing an int* pointer is different from a incrementing char* solely because the pointer variable is declared as int*. You can cast an int* to char* and then it will increment with 1 byte.
So, yes, it is all just syntactic sugar. It makes some kinds of array processing easier and confuses void* users.
Related
I've been doing some pointers testing in C, and I was just curious if the addresses of a function's parameters are always in a difference of 4 bytes from one another.
I've tries to run the following code:
#include <stdio.h>
void func(long a, long b);
int main(void)
{
func(1, 2);
getchar();
return 0;
}
void func(long a, long b)
{
printf("%d\n", (int)&b - (int)&a);
}
This code seems to always print 4, no matter what is the type of func's parameters.
I was just wondering if it's ALWAYS 4, because if so it can be useful for something I'm trying to do (but if it isn't necessarily 4 I guess I could just use va_list for my function or something).
So: Is it necessarily 4 bytes?
Absolutely not, in so many ways that it would be hard to count them all.
First and foremost, the memory layout of arguments is simply not specified by the C language. Full stop. It is not specified. Thus the answer is "no" immediately.
va_list exists because there was a need to be able to navigate a list of varadic arguments because it wasn't specified other than that. va_list is intentionally very limited, so that it works on platforms where the shape of the stack does not match your intuition.
Other reasons it can't always be 4:
What if you pass an object of length 8?
What if the compiler optimizes a reference to actually point at the object in another frame?
What if the compiler adds padding, perhaps to align a 64-bit number on a 64-bit boundary?
What if the stack is built in the opposite direction (such that the difference would be -4 instead of +4)
The list goes on and on. C does not specify the relative addresses between arguments.
As the other answers correctly say:
No.
Furthermore, even trying to determine whether the addresses differ by 4 bytes, depending on how you do it, probably has undefined behavior, which means the C standard says nothing about what your program does.
void func(long a, long b)
{
printf("%d\n", (int)&b - (int)&a);
}
&a and &b are expression of type long*. Converting a pointer to int is legal, but the result is implementation-defined, and "If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type."
It's very likely that pointers are 64 bits and int is 32 bits, so the conversions could lose information.
Most likely the conversions will give you values of type int, but they don't necessarily have any meaning, nor does their difference.
Now you can subtract pointer values directly, with a result of the signed integer type ptrdiff_t (which, unlike int, is probably big enough to hold the result).
printf("%td\n", &b - &a);
But "When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object; the result is the difference of the subscripts of the two array elements." Pointers to distinct object cannot be meaningfully compared or subtracted.
Having said all that, it's likely that the implementation you're using has a memory model that's reasonably straightforward, and that pointer values are in effect represented as indices into a monolithic memory space. Comparing &b vs. &a is not permitted by the C language, but examining the values can provide some insight about what's going on behind the curtain -- which can be especially useful if you're tracking down a bug.
Here's something you can do portably to examine the addresses:
printf("&a = %p\n", (void*)&a);
printf("&b = %p\n", (void*)&b);
The result you're seeing for the subtraction (4) suggests that type long is probably 4 bytes (32 bits) on your system. I'd guess you're on Windows. It also suggests something about the way function parameters are allocated -- something that you as a programmer should almost never have to care about, but is worth understanding anyway.
[...] I was just curious if the addresses of a function's parameters are always in a difference of 4 bytes from one another."
The greatest error in your reasoning is to think that the parameters exist in memory at all.
I am running this program on x86-64:
#include <stdio.h>
#include <stdint.h>
void func(long a, long b)
{
printf("%d\n", (int)((intptr_t)&b - (intptr_t)&a));
}
int main(void)
{
func(1, 2);
}
and compile it with gcc -O3 it prints 8, proving that your guess is absolutely wrong. Except... when I compile it without optimization it prints out -8.
X86-64 SYSV calling convention says that the arguments are passed in registers instead of being passed in memory. a and b do not have an address until you take their address with & - then the compiler is caught with its pants down from cheating the as-if rule and it quickly pulls up its pants and stuffs them into some memory location so that they can have their address taken, but it is in no way consistent in where they're stored.
As asked in "How does pointer incrementation work?", I have a follow-up question.
How does a pointer know the underlying size of the data it points to? Do pointers store a size of the underlying type so they can know how to increment?
I'd expect that the following code would move a pointer forward one byte:
int intarr[] = { ... };
int *intptr = intarr;
intptr = intptr + 1;
printf("intarr[1] = %d\n", *intptr);
According to the accepted answer on the linked site, having a pointer increment by bytes and not by the underlying sizeof the pointed element would cause mass hysteria, confusion, and chaos.
While I understand that this would probably be an inevitable outcome, I still don't understand how pointers work in this regard. Couldn't I declare a void pointer to some struct[] type array, and if I did so, how would the void pointer know to increment by sizeof(struct mytype)?
Edit: I believe that I've worked most of the difficulties out that I'm having, but I'm not quite there as far as demonstrating it in code.
See here: http://codepad.org/0d8veP4K
#include <stdio.h>
int main(int argc, char *argv[])
{
int intarr[] = { 0, 5, 10 };
int *intptr = intarr;
// get the value where the pointer points
printf("intptr(%p): %d\n", intptr, *intptr);
printf("intptr(%p): %d\n", intptr + 1, *(intptr + 1));
printf("intptr(%p): %d\n", intptr + 2, *(intptr + 2));
// the difference between the pointer value should be same as sizeof(int)
printf("intptr[0]: %p | intptr[1]: %p | difference: %d | expected: %d",
intptr, intptr + 1, (intptr + 1) - intptr, sizeof(int));
return 0;
}
It is in the type declaration. p1 knows the size of the type because it is sizeof(*p1) or sizeof(int). p2 does not know as sizeof(void) is not defined.
int *p1;
void *p2;
p1++; // OK
p2++; // Not defined behavior in C
Do pointers store a size of the underlying type so they can know how to increment?
This question suggests that type information needs to be kept with the object at runtime to make correct decisions on how to perform the correct operations for the type. That's not true. Type information becomes part of the code.
It may be easier to understand if we add a third type into the mix: floating point.
Consider this sample program:
int a,b,c;
float x,y,z;
void f(void)
{
c = a+b*3;
z = x+y*3;
}
(I ask you to think about the float vs. int case first not because it's simpler but because it's more complex. The extra complexity prevents you from taking shortcuts that are tempting but wrong.)
The compiler must translate f into some assembly code that performs two different kinds of addition and multiplication. Although the same operators (+ and *) appear twice in the C code, the assembly code won't look so symmetric. The first half will use the processor's integer registers, integer addition instruction, and integer multiplication instruction, and the second half will use floating point registers, floating point addition, and floating point multiplication. Even the constant 3 will be represented differently in the two places it appears.
At the assembly level, the memory where a, b, c, x, y, and z are stored doesn't need to be tagged because the type information is implicit in the instructions that access that memory. The loads and stores of the integer registers will only be targeted at the memory locations holding a, b, and c.
The C arithmetic operators are overloaded. When translating from a language with an overloaded operator to a language without a corresponding overloaded operator, the type information from the first language becomes part of the name of the operator in the second language. ("Name mangling" when translating from C++ to C is the same thing happening at another level. You could say that assembly language "ADD" (integer) and "FADD" (floating point) instructions are name-mangled + operators.)
Now, about pointer arithmetic. Pointers are just another type to overload. If the expression a=a+1 can generate two different varieties of assembly code depending on whether a is int or float, why not a third variety when a is int *, another when a is struct tm *, and so on?
In the C code, type information is contained in the variable declarations. In the compiler's intermediate representation, the type of every expression is known. In the compiler's output, the necessary pieces of type information are implicit in the machine instructions.
Kind of a crude answer, but it's worth noting at the machine level that data types, as we know them in C, don't exist. We might have arithmetical instructions that operate on integers stored in some general-purpose register, e.g., but there's nothing stored to identify that the contents of some register is actually an int. All the machine sees is a bunch of bits and bytes in various types of memory.
So you might even wonder how it's possible for a compiler to know how to do this:
int z = x + y;
How can it know to do an integer addition here if there's nothing stored when the program is running to identify that the memory regions storing the contents of x and y and z are ints?
And the short/crude answer is that the machine doesn't know once the program is running. Yet it had this information available when it generated the instructions that would be used to run the program.
It's the same case with pointers:
int intarr[] = { ... };
int *intptr = intarr;
Doing something like intptr + 1 here can be done to increment the pointer address by sizeof(int). The compiler knows to do this based on the information provided by you, the programmer, in this C code. If you did this instead:
int intarr[] = { ... };
void *voidptr = intarr;
... then trying to perform any arithmetic on voidptr would result in an error, since we aren't giving the information necessary for the compiler to know what machine instructions to generate.
Couldn't I declare a void pointer to some struct[] type array, and if
I did so, how would the void pointer know to increment by
sizeof(struct mytype)?
It can't. The void pointer would equate to a loss of compile-time information that would prevent the compiler from being able to generate the appropriate instructions. If you don't provide the info, the compiler doesn't know how to do the pointer arithmetic. And this is why functions which accept a void pointer like memcpy need a byte size to be specified. The pointee contents don't provide that kind of info, only the programmer can provide it since this kind of information is not stored in the memory used by the program at runtime.
in your example:
sizeof(pointer) is 4 bytes
sizeof(int) is 4 bytes
and in your program
Output:
intptr(0xffcbf5dc): 0
intptr(0xffcbf5e0): 5
intptr(0xffcbf5e4): 10
intptr[0]: 0xffcbf5dc | intptr[1]: 0xffcbf5e0 | difference: 1 | expected: 4
and if you try: 0xffcbf5e0 - 0xffcbf5dc = 4 (hex sub)and this is the sizeof(int).
about using void*: you can use the void*
about your structure :you can make sizeof(yourStructre)
With a 32-bit OS, we know that the pointer size is 4 bytes, so sizeof(char*) is 4 and sizeof(int*) is 4, etc. We also know that when you increment a char*, the byte address (offset) changes by sizeof(char); when you increment an int*, the byte address changes by sizeof(int).
My question is:
How does the OS know how much to increment the byte address for sizeof(YourType)?
The compiler only knows how to increment a pointer of type YourType * if it knows the size of YourType, which is the case if and only if the complete definition of YourType is known to the compiler at this point.
For example, if we have:
struct YourType *a;
struct YourOtherType *b;
struct YourType {
int x;
char y;
};
Then you are allowed to do this:
a++;
but you are not allowed to do this:
b++;
..since struct YourType is a complete type, but struct YourOtherType is an incomplete type.
The error given by gcc for the line b++; is:
error: arithmetic on pointer to an incomplete type
The OS doesn't really have anything to do with that - it's the compiler's job (as #zneak mentioned).
The compiler knows because it just compiled that struct or class - the size is, in the struct case, pretty much the sum of the sizes of all the struct's contents.
It is primarily an issue for the C (or C++) compiler, and not primarily an issue for the OS per se.
The compiler knows its alignment rules for the basic types, and applies those rules to any type you create. It can therefore establish the alignment requirement and size of YourType, and it will ensure that it increments any YourType* variable by the correct value. The alignment rules vary by hardware (CPU), and the compiler is responsible for knowing which rules to apply.
One key point is that the size of YourType must be such that when you have an array:
YourType array[20];
then &array[1] == &array[0] + 1. The byte address of &array[1] must be incremented by sizeof(YourType), and (assuming YourType is a structure), each of the elements of array[1] must be properly aligned, just as the elements of array[0] must be properly aligned.
Also remember types are defined in your compiled code to match the hardware you are working on. It is entirely up to the source code that is used to work this out.
So a low end chipset 16 bit targeted C program might have need to define types differently to a 32 bit system.
The programming language and compiler are what govern your types. Not the OS or hardware.
Although of course trying to stick a 32 bit number into a 16 bit register could be a problem!
C pointers are typed, unlike some old languages like PL/1. This not only allows the size of the object to be known, but so widening operations and formatting can be carried out. For example getting the data at *p, is that a float, a double, or a char? The compiler needs to know (think divisions, for example).
Of course we do have a typeless pointer, a void *, which you cannot do any arithmetic with simply because the compiler has no idea how much to add to the address.
The problem is simple. As I understand, GCC maintains that chars will be byte-aligned and ints 4-byte-aligned in a 32-bit environment. I am also aware of C99 standard 6.3.2.3 which says that casting between misaligned pointer-types results in undefined operations. What do the other standards of C say about this? There are also many experienced coders here - any view on this will be appreciated.
int *iptr1, *iptr2;
char *cptr1, *cptr2;
iptr1 = (int *) cptr1;
cptr2 = (char *) iptr2;
There is only one standard for C (the one by ISO), with two versions (1989 and 1999), plus some pretty minor revisions. All versions and revisions agree on the following:
all data memory is byte-addressable, and chars are bytes
thus a char* will be able to address any data
void* is the same as char* except conversions to and from it do not require type casts
converting from int* to char* always works, as does convering back to int*
converting an arbitrary char* to int* is not guaranteed to work
The reasons char pointers are guaranteed to work like this is so that you can, for example, copy integers from anywhere in memory to elsewhere in memory or disk, and back, which turns out to be a pretty useful thing to do in low-level programming, e.g., graphics libraries.
There are big-endian and little-endian for CPUs, so the results are undefined.
For example, the value of 0x01234567 could be 0x12 or 0x67 for a char pointer after casting.
You can try doing:
iptr1 = atoi(cptr1); // val now = pointed by cptr1
cptr2 = atoi(iptr2); // val now = pointed by iptr2
This worked for me in DevCpp!
I would like to know architectures which violate the assumptions I've listed below. Also, I would like to know if any of the assumptions are false for all architectures (that is, if any of them are just completely wrong).
sizeof(int *) == sizeof(char *) == sizeof(void *) == sizeof(func_ptr *)
The in-memory representation of all pointers for a given architecture is the same regardless of the data type pointed to.
The in-memory representation of a pointer is the same as an integer of the same bit length as the architecture.
Multiplication and division of pointer data types are only forbidden by the compiler. NOTE: Yes, I know this is nonsensical. What I mean is - is there hardware support to forbid this incorrect usage?
All pointer values can be casted to a single integer. In other words, what architectures still make use of segments and offsets?
Incrementing a pointer is equivalent to adding sizeof(the pointed data type) to the memory address stored by the pointer. If p is an int32* then p+1 is equal to the memory address 4 bytes after p.
I'm most used to pointers being used in a contiguous, virtual memory space. For that usage, I can generally get by thinking of them as addresses on a number line. See Stack Overflow question Pointer comparison.
I can't give you concrete examples of all of these, but I'll do my best.
sizeof(int *) == sizeof(char *) == sizeof(void *) == sizeof(func_ptr *)
I don't know of any systems where I know this to be false, but consider:
Mobile devices often have some amount of read-only memory in which program code and such is stored. Read-only values (const variables) may conceivably be stored in read-only memory. And since the ROM address space may be smaller than the normal RAM address space, the pointer size may be different as well. Likewise, pointers to functions may have a different size, as they may point to this read-only memory into which the program is loaded, and which can otherwise not be modified (so your data can't be stored in it).
So I don't know of any platforms on which I've observed that the above doesn't hold, but I can imagine systems where it might be the case.
The in-memory representation of all pointers for a given architecture is the same regardless of the data type pointed to.
Think of member pointers vs regular pointers. They don't have the same representation (or size). A member pointer consists of a this pointer and an offset.
And as above, it is conceivable that some CPU's would load constant data into a separate area of memory, which used a separate pointer format.
The in-memory representation of a pointer is the same as an integer of the same bit length as the architecture.
Depends on how that bit length is defined. :)
An int on many 64-bit platforms is still 32 bits. But a pointer is 64 bits.
As already said, CPU's with a segmented memory model will have pointers consisting of a pair of numbers. Likewise, member pointers consist of a pair of numbers.
Multiplication and division of pointer data types are only forbidden by the compiler.
Ultimately, pointers data types only exist in the compiler. What the CPU works with is not pointers, but integers and memory addresses. So there is nowhere else where these operations on pointer types could be forbidden. You might as well ask for the CPU to forbid concatenation of C++ string objects. It can't do that because the C++ string type only exists in the C++ language, not in the generated machine code.
However, to answer what you mean, look up the Motorola 68000 CPUs. I believe they have separate registers for integers and memory addresses. Which means that they can easily forbid such nonsensical operations.
All pointer values can be casted to a single integer.
You're safe there. The C and C++ standards guarantee that this is always possible, no matter the memory space layout, CPU architecture and anything else. Specifically, they guarantee an implementation-defined mapping. In other words, you can always convert a pointer to an integer, and then convert that integer back to get the original pointer. But the C/C++ languages say nothing about what the intermediate integer value should be. That is up to the individual compiler, and the hardware it targets.
Incrementing a pointer is equivalent to adding sizeof(the pointed data type) to the memory address stored by the pointer.
Again, this is guaranteed. If you consider that conceptually, a pointer does not point to an address, it points to an object, then this makes perfect sense. Adding one to the pointer will then obviously make it point to the next object. If an object is 20 bytes long, then incrementing the pointer will move it 20 bytes, so that it moves to the next object.
If a pointer was merely a memory address in a linear address space, if it was basically an integer, then incrementing it would add 1 to the address -- that is, it would move to the next byte.
Finally, as I mentioned in a comment to your question, keep in mind that C++ is just a language. It doesn't care which architecture it is compiled to. Many of these limitations may seem obscure on modern CPU's. But what if you're targeting yesteryear's CPU's? What if you're targeting the next decade's CPU's? You don't even know how they'll work, so you can't assume much about them. What if you're targeting a virtual machine? Compilers already exist which generate bytecode for Flash, ready to run from a website. What if you want to compile your C++ to Python source code?
Staying within the rules specified in the standard guarantees that your code will work in all these cases.
I don't have specific real world examples in mind but the "authority" is the C standard. If something is not required by the standard, you can build a conforming implementation that intentionally fails to comply with any other assumptions. Some of these assumption are true most of the time just because it's convenient to implement a pointer as an integer representing a memory address that can be directly fetched by the processor but this is just a consequent of "convenience" and can't be held as a universal truth.
Not required by the standard (see this question). For instance, sizeof(int*) can be unequal to size(double*). void* is guaranteed to be able to store any pointer value.
Not required by the standard. By definition, size is a part of representation. If the size can be different, the representation can be different too.
Not necessarily. In fact, "the bit length of an architecture" is a vague statement. What is a 64-bit processor, really? Is it the address bus? Size of registers? Data bus? What?
It doesn't make sense to "multiply" or "divide" a pointer. It's forbidden by the compiler but you can of course multiply or divide the underlying representation (which doesn't really make sense to me) and that results in undefined behavior.
Maybe I don't understand your point but everything in a digital computer is just some kind of binary number.
Yes; kind of. It's guaranteed to point to a location that's a sizeof(pointer_type) farther. It's not necessarily equivalent to arithmetic addition of a number (i.e. farther is a logical concept here. The actual representation is architecture specific)
For 6.: a pointer is not necessarily a memory address. See for example "The Great Pointer Conspiracy" by Stack Overflow user jalf:
Yes, I used the word “address” in the comment above. It is important to realize what I mean by this. I do not mean “the memory address at which the data is physically stored”, but simply an abstract “whatever we need in order to locate the value. The address of i might be anything, but once we have it, we can always find and modify i."
And:
A pointer is not a memory address! I mentioned this above, but let’s say it again. Pointers are typically implemented by the compiler simply as memory addresses, yes, but they don’t have to be."
Some further information about pointers from the C99 standard:
6.2.5 §27 guarantees that void* and char* have identical representations, ie they can be used interchangably without conversion, ie the same address is denoted by the same bit pattern (which doesn't have to be true for other pointer types)
6.3.2.3 §1 states that any pointer to an incomplete or object type can be cast to (and from) void* and back again and still be valid; this doesn't include function pointers!
6.3.2.3 §6 states that void* can be cast to (and from) integers and 7.18.1.4 §1 provides apropriate types intptr_t and uintptr_t; the problem: these types are optional - the standard explicitly mentions that there need not be an integer type large enough to actually hold the value of the pointer!
sizeof(char*) != sizeof(void(*)(void) ? - Not on x86 in 36 bit addressing mode (supported on pretty much every Intel CPU since Pentium 1)
"The in-memory representation of a pointer is the same as an integer of the same bit length" - there's no in-memory representation on any modern architecture; tagged memory has never caught on and was already obsolete before C was standardized. Memory in fact doesn't even hold integers, just bits and arguably words (not bytes; most physical memory doesn't allow you to read just 8 bits.)
"Multiplication of pointers is impossible" - 68000 family; address registers (the ones holding pointers) didn't support that IIRC.
"All pointers can be cast to integers" - Not on PICs.
"Incrementing a T* is equivalent to adding sizeof(T) to the memory address" - true by definition. Also equivalent to &pointer[1].
I don't know about the others, but for DOS, the assumption in #3 is untrue. DOS is 16 bit and uses various tricks to map many more than 16 bits worth of memory.
The in-memory representation of a pointer is the same as an integer of the same bit length as the architecture.
I think this assumption is false because on the 80186, for example, a 32-bit pointer is held in two registers (an offset register an a segment register), and which half-word went in which register matters during access.
Multiplication and division of pointer data types are only forbidden by the compiler.
You can't multiply or divide types. ;P
I'm unsure why you would want to multiply or divide a pointer.
All pointer values can be casted to a single integer. In other words, what architectures still make use of segments and offsets?
The C99 standard allows pointers to be stored in intptr_t, which is an integer type. So, yes.
Incrementing a pointer is equivalent to adding sizeof(the pointed data type) to the memory address stored by the pointer. If p is an int32* then p+1 is equal to the memory address 4 bytes after p.
x + y where x is a T * and y is an integer is equivilent to (T *)((intptr_t)x + y * sizeof(T)) as far as I know. Alignment may be an issue, but padding may be provided in the sizeof. I'm not really sure.
In general, the answer to all of the questions is "yes", and it's because only those machines that implement popular languages directly saw the light of day and persisted into the current century. Although the language standards reserve the right to vary these "invariants", or assertions, it hasn't ever happened in real products, with the possible exception of items 3 and 4 which require some restatement to be universally true.
It's certainly possible to build segmented MMU designs, which correspond roughly with the capability-based architectures that were popular academically in past years, but no such system has typically seen common use with such features enabled. Such a system might have conflicted with the assertions as it would probably have had large pointers.
In addition to segmented/capability MMUs, which often have large pointers, more extreme designs have tried to encode data types in pointers. Few of these were ever built. (This question brings up all of the alternatives to the basic word-oriented, a pointer-is-a-word architectures.)
Specifically:
The in-memory representation of all pointers for a given architecture is the same regardless of the data type pointed to. True except for extremely wacky past designs that tried to implement protection not in strongly-typed languages but in hardware.
The in-memory representation of a pointer is the same as an integer of the same bit length as the architecture. Maybe, certainly some sort of integral type is the same, see LP64 vs LLP64.
Multiplication and division of pointer data types are only forbidden by the compiler. Right.
All pointer values can be casted to a single integer. In other words, what architectures still make use of segments and offsets? Nothing uses segments and offsets today, but a C int is often not big enough, you may need a long or long long to hold a pointer.
Incrementing a pointer is equivalent to adding sizeof(the pointed data type) to the memory address stored by the pointer. If p is an int32* then p+1 is equal to the memory address 4 bytes after p. Yes.
It is interesting to note that every Intel Architecture CPU, i.e., every single PeeCee, contains an elaborate segmentation unit of epic, legendary, complexity. However, it is effectively disabled. Whenever a PC OS boots up, it sets the segment bases to 0 and the segment lengths to ~0, nulling out the segments and giving a flat memory model.
There were lots of "word addressed" architectures in the 1950s, 1960s and 1970s. But I cannot recall any mainstream examples that had a C compiler. I recall the ICL / Three Rivers PERQ machines in the 1980s that was word addressed and had a writable control store (microcode). One of its instantiations had a C compiler and a flavor of Unix called PNX, but the C compiler required special microcode.
The basic problem is that char* types on word addressed machines are awkward, however you implement them. You often up with sizeof(int *) != sizeof(char *) ...
Interestingly, before C there was a language called BCPL in which the basic pointer type was a word address; that is, incrementing a pointer gave you the address of the next word, and ptr!1 gave you the word at ptr + 1. There was a different operator for addressing a byte: ptr%42 if I recall.
EDIT: Don't answer questions when your blood sugar is low. Your brain (certainly, mine) doesn't work as you expect. :-(
Minor nitpick:
p is an int32* then p+1
is wrong, it needs to be unsigned int32, otherwise it will wrap at 2GB.
Interesting oddity - I got this from the author of the C compiler for the Transputer chip - he told me that for that compiler, NULL was defined as -2GB. Why? Because the Transputer had a signed address range: -2GB to +2GB. Can you beleive that? Amazing isn't it?
I've since met various people that have told me that defining NULL like that is broken. I agree, but if you don't you end up NULL pointers being in the middle of your address range.
I think most of us can be glad we're not working on Transputers!
I would like to know architectures which violate the assumptions I've
listed below.
I see that Stephen C mentioned PERQ machines, and MSalters mentioned 68000s and PICs.
I'm disappointed that no one else actually answered the question by naming any of the weird and wonderful architectures that have standards-compliant C compilers that don't fit certain unwarranted assumptions.
sizeof(int *) == sizeof(char *) == sizeof(void *) == sizeof(func_ptr
*) ?
Not necessarily. Some examples:
Most compilers for Harvard-architecture 8-bit processors -- PIC and 8051 and M8C -- make sizeof(int *) == sizeof(char *),
but different from the sizeof(func_ptr *).
Some of the very small chips in those families have 256 bytes of RAM (or less) but several kilobytes of PROGMEM (Flash or ROM), so compilers often make sizeof(int *) == sizeof(char *) equal to 1 (a single 8-bit byte), but sizeof(func_ptr *) equal to 2 (two 8-bit bytes).
Compilers for many of the larger chips in those families with a few kilobytes of RAM and 128 or so kilobytes of PROGMEM make sizeof(int *) == sizeof(char *) equal to 2 (two 8-bit bytes), but sizeof(func_ptr *) equal to 3 (three 8-bit bytes).
A few Harvard-architecture chips can store exactly a full 2^16 ("64KByte") of PROGMEM (Flash or ROM), and another 2^16 ("64KByte") of RAM + memory-mapped I/O.
The compilers for such a chip make sizeof(func_ptr *) always be 2 (two bytes);
but often have a way to make the other kinds of pointers sizeof(int *) == sizeof(char *) == sizeof(void *) into a a "long ptr" 3-byte generic pointer that has the extra magic bit that indicates whether that pointer points into RAM or PROGMEM.
(That's the kind of pointer you need to pass to a "print_text_to_the_LCD()" function when you call that function from many different subroutines, sometimes with the address of a variable string in buffer that could be anywhere in RAM, and other times with one of many constant strings that could be anywhere in PROGMEM).
Such compilers often have special keywords ("short" or "near", "long" or "far") to let programmers specifically indicate three different kinds of char pointers in the same program -- constant strings that only need 2 bytes to indicate where in PROGMEM they are located, non-constant strings that only need 2 bytes to indicate where in RAM they are located, and the kind of 3-byte pointers that "print_text_to_the_LCD()" accepts.
Most computers built in the 1950s and 1960s use a 36-bit word length or an 18-bit word length, with an 18-bit (or less) address bus.
I hear that C compilers for such computers often use 9-bit bytes,
with sizeof(int *) == sizeof(func_ptr *) = 2 which gives 18 bits, since all integers and functions have to be word-aligned; but sizeof(char *) == sizeof(void *) == 4 to take advantage of special PDP-10 instructions that store such pointers in a full 36-bit word.
That full 36-bit word includes a 18-bit word address, and a few more bits in the other 18-bits that (among other things) indicate the bit position of the pointed-to character within that word.
The in-memory representation of all pointers for a given architecture
is the same regardless of the data type pointed to?
Not necessarily. Some examples:
On any one of the architectures I mentioned above, pointers come in different sizes. So how could they possibly have "the same" representation?
Some compilers on some systems use "descriptors" to implement character pointers and other kinds of pointers.
Such a descriptor is different for a pointer pointing to the first "char" in a "char big_array[4000]" than for a pointer pointing to the first "char" in a "char small_array[10]", which are arguably different data types, even when the small array happens to start at exactly the same location in memory previously occupied by the big array.
Descriptors allow such machines to catch and trap the buffer overflows that cause such problems on other machines.
The "Low-Fat Pointers" used in the SAFElite and similar "soft processors" have analogous "extra information" about the size of the buffer that the pointer points into. Low-Fat pointers have the same advantage of catching and trapping buffer overflows.
The in-memory representation of a pointer is the same as an integer of
the same bit length as the architecture?
Not necessarily. Some examples:
In "tagged architecture" machines, each word of memory has some bits that indicate whether that word is an integer, or a pointer, or something else.
With such machines, looking at the tag bits would tell you whether that word was an integer or a pointer.
I hear that Nova minicomputers have an "indirection bit" in each word which inspired "indirect threaded code". It sounds like storing an integer clears that bit, while storing a pointer sets that bit.
Multiplication and division of pointer data types are only forbidden
by the compiler. NOTE: Yes, I know this is nonsensical. What I mean is
- is there hardware support to forbid this incorrect usage?
Yes, some hardware doesn't directly support such operations.
As others have already mentioned, the "multiply" instruction in the 68000 and the 6809 only work with (some) "data registers"; they can't be directly applied to values in "address registers".
(It would be pretty easy for a compiler to work around such restrictions -- to MOV those values from an address register to the appropriate data register, and then use MUL).
All pointer values can be casted to a single data type?
Yes.
In order for memcpy() to work right, the C standard mandates that every pointer value of every kind can be cast to a void pointer ("void *").
The compiler is required to make this work, even for architectures that still use segments and offsets.
All pointer values can be casted to a single integer? In other words,
what architectures still make use of segments and offsets?
I'm not sure.
I suspect that all pointer values can be cast to the "size_t" and "ptrdiff_t" integral data types defined in "<stddef.h>".
Incrementing a pointer is equivalent to adding sizeof(the pointed data
type) to the memory address stored by the pointer. If p is an int32*
then p+1 is equal to the memory address 4 bytes after p.
It is unclear what you are asking here.
Q: If I have an array of some kind of structure or primitive data type (for example, a "#include <stdint.h> ... int32_t example_array[1000]; ..."), and I increment a pointer that points into that array (for example, "int32_t p = &example_array[99]; ... p++; ..."), does the pointer now point to the very next consecutive member of that array, which is sizeof(the pointed data type) bytes further along in memory?
A: Yes, the compiler must make the pointer, after incrementing it once, point at the next independent consecutive int32_t in the array, sizeof(the pointed data type) bytes further along in memory, in order to be standards compliant.
Q: So, if p is an int32* , then p+1 is equal to the memory address 4 bytes after p?
A: When sizeof( int32_t ) is actually equal to 4, yes. Otherwise, such as for certain word-addressable machines including some modern DSPs where sizeof( int32_t ) may equal 2 or even 1, then p+1 is equal to the memory address 2 or even 1 "C bytes" after p.
Q: So if I take the pointer, and cast it into an "int" ...
A: One type of "All the world's a VAX heresy".
Q: ... and then cast that "int" back into a pointer ...
A: Another type of "All the world's a VAX heresy".
Q: So if I take the pointer p which is a pointer to an int32_t, and cast it into some integral type that is plenty big enough to contain the pointer, and then add sizeof( int32_t ) to that integral type, and then later cast that integral type back into a pointer -- when I do all that, the resulting pointer is equal to p+1?
Not necessarily.
Lots of DSPs and a few other modern chips have word-oriented addressing, rather than the byte-oriented processing used by 8-bit chips.
Some of the C compilers for such chips cram 2 characters into each word, but it takes 2 such words to hold a int32_t -- so they report that sizeof( int32_t ) is 4.
(I've heard rumors that there's a C compiler for the 24-bit Motorola 56000 that does this).
The compiler is required to arrange things such that doing "p++" with a pointer to an int32_t increments the pointer to the next int32_t value.
There are several ways for the compiler to do that.
One standards-compliant way is to store each pointer to a int32_t as a "native word address".
Because it takes 2 words to hold a single int32_t value, the C compiler compiles "int32_t * p; ... p++" into some assembly language that increments that pointer value by 2.
On the other hand, if that one does "int32_t * p; ... int x = (int)p; x += sizeof( int32_t ); p = (int32_t *)x;", that C compiler for the 56000 will likely compile it to assembly language that increments the pointer value by 4.
I'm most used to pointers being used in a contiguous, virtual memory
space.
Several PIC and 8086 and other systems have non-contiguous RAM --
a few blocks of RAM at addresses that "made the hardware simpler".
With memory-mapped I/O or nothing at all attached to the gaps in address space between those blocks.
It's even more awkward than it sounds.
In some cases -- such as with the bit-banding hardware used to avoid problems caused by read-modify-write -- the exact same bit in RAM can be read or written using 2 or more different addresses.