Pointer comparisons in C. Are they signed or unsigned? - c

Hi I'm sure this must be a common question but I can't find the answer when I search for it. My question basically concerns two pointers. I want to compare their addresses and determine if one is bigger than the other. I would expect all addresses to be unsigned during comparison. Is this true, and does it vary between C89, C99 and C++? When I compile with gcc the comparison is unsigned.
If I have two pointers that I'm comparing like this:
char *a = (char *) 0x80000000; //-2147483648 or 2147483648 ?
char *b = (char *) 0x1;
Then a is greater. Is this guaranteed by a standard?
Edit to update on what I am trying to do. I have a situation where I would like to determine that if there's an arithmetic error it will not cause a pointer to go out of bounds. Right now I have the start address of the array and the end address. And if there's an error and the pointer calculation is wrong, and outside of the valid addresses of memory for the array, I would like to make sure no access violation occurs. I believe I can prevent this by comparing the suspect pointer, which has been returned by another function, and determining if it is within the acceptable range of the array. The question of negative and positive addresses has to do with whether I can make the comparisons, as discussed above in my original question.
I appreciate the answers so far. Based on my edit would you say that what I'm doing is undefined behavior in gcc and msvc? This is a program that will run on Microsoft Windows only.
Here's an over simplified example:
char letters[26];
char *do_not_read = &letters[26];
char *suspect = somefunction_i_dont_control(letters,26);
if( (suspect >= letters) && (suspect < do_not_read) )
printf("%c", suspect);
Another edit, after reading AndreyT's answer it appears to be correct. Therefore I will do something like this:
char letters[26];
uintptr_t begin = letters;
uintptr_t toofar = begin + sizeof(letters);
char *suspect = somefunction_i_dont_control(letters,26);
if( ((uintptr_t)suspect >= begin) && ((uintptr_t)suspect < toofar ) )
printf("%c", suspect);
Thanks everyone!

Pointer comparisons cannot be signed or unsigned. Pointers are not integers.
C language (as well as C++) defines relative pointer comparisons only for pointers that point into the same aggregate (struct or array). The ordering is natural: the pointer that points to an element with smaller index in an array is smaller. The pointer that points to a struct member declared earlier is smaller. That's it.
You can't legally compare arbitrary pointers in C/C++. The result of such comparison is not defined. If you are interested in comparing the numerical values of the addresses stored in the pointers, it is your responsibility to manually convert the pointers to integer values first. In that case, you will have to decide whether to use a signed or unsigned integer type (intptr_t or uintptr_t). Depending on which type you choose, the comparison will be "signed" or "unsigned".

The integer-to-pointer conversion is wholly implementation defined, so it depends on the implementation you are using.
That said, you are only allowed to relationally compare pointers that point to parts of the same object (basically, to subobjects of the same struct or elements of the same array). You aren't allowed to compare two pointers to arbitrary, wholly unrelated objects.

From a draft C++ Standard 5.9:
If two pointers p and q of the same type point to different objects
that are not members of the same object or elements of the same array
or to different functions, or if only one of them is null, the results
of p<q, p>q, p<=q, and p>=q are unspecified.
So, if you cast numbers to pointers and compare them, C++ gives you unspecified results. If you take the address of elements you can validly compare, the results of comparison operations are specified independently of the signed-ness of the pointer types.
Note unspecified is not undefined: it's quite possible to compare pointers to different objects of the same type that aren't in the same structure or array, and you can expect some self-consistent result (otherwise it'd be impossible to use such pointers as keys in trees, or to sort a vector of such pointers, binary search the vector etc., where a consistent intuitive overall < ordering is needed).
Note that in very old C++ Standards the behaviour was undefined - like the 2005 WG14/N1124 draft andrewdski links to under James McNellis's answer -

To complement the other answers, comparison between pointers that point to different objects depends on the standard.
In C99 (ISO/IEC 9899:1999 (E)), §6.5.8:
5 [...] In all other cases, the behavior is undefined.
In C++03 (ISO/IEC 14882:2003(E)), §5.9:
-Other pointer comparisons are unspecified.

I know several of the answers here say you cannot compare pointers unless they point to within the same structure, but that's a red herring and I'll try to explain why. One of your pointers points to the start of your array, the other to the end, so they are pointing to the same structure. A language lawyer could say that if your third pointer points outside of the object, the comparison is undefined, so x >= array.start might be true for all x. But this is no issue, since at the point of comparison C++ cannot know if the array isn't embedded in an even bigger structure. Furthermore, if your address space is linear, like it's bound to be these days, your pointer comparison will be implemented as an (un)signed integer comparison, since any other implementation would be slower. Even in the times of segments and offsets, (far) pointer comparison was implemented by first normalising the pointer and then comparing them as integers.
What this all boils down to then, is that if your compiler is okay, comparing the pointers without worrying about the signs should work, if all you care about is that the pointer points within the array, since the compiler should make the pointers signed or unsigned depending on which of the two boundaries a C++ object may straddle.
Different platforms behave differently in this matter, which is why C++ has to leave it up to the platform. There are even platforms in which both addresses near 0 and 80..00h are not mappable or already taken at process start-up. In that case, it doesn't matter, as long as you're consistent about it.
Sometimes this can cause compatibility issues. As an example, in Win32 pointers are unsigned. Now, it used to be the case that of the 4GB address space only the lower half (more precisely 10000h ... 7FFFFFFFh, because of the NULL-Pointer Assignment Partition) was available to applications; high addresses were only available to the kernel. This caused some people to put addresses in signed variables, and their programs would keep working since the high bit was always 0. But then came /3GB switch, which made almost 3 GB available to applications (more precisely 10000h ... BFFFFFFFh) and the application would crash or behave erratically.
You explicitly state your program will be Windows-only, which uses unsigned pointers. However, maybe you'll change your mind in the future, and using intptr_t or uintptr_t is bad for portability. I also wonder if you should be doing this at all... if you're indexing into an array it might be safer to compare indices instead. Suppose for example that you have a 1 GB array at 1500000h ... 41500000h, consisting of 16,384 elements of 64 kB each. Suppose you accidentally look up index 80,000 – clearly out of range. The pointer calculation will yield 39D00000h, so your pointer check will allow it, even though it shouldn't.

Related

Does C have an equivalent of std::less from C++?

I was recently answering a question on the undefined behaviour of doing p < q in C when p and q are pointers into different objects/arrays. That got me thinking: C++ has the same (undefined) behaviour of < in this case, but also offers the standard library template std::less which is guaranteed to return the same thing as < when the pointers can be compared, and return some consistent ordering when they cannot.
Does C offer something with similar functionality which would allow safely comparing arbitrary pointers (to the same type)? I tried looking through the C11 standard and didn't find anything, but my experience in C is orders of magnitude smaller than in C++, so I could have easily missed something.
On implementations with a flat memory model (basically everything), casting to uintptr_t will Just Work.
(But see Should pointer comparisons be signed or unsigned in 64-bit x86? for discussion of whether you should treat pointers as signed or not, including issues of forming pointers outside of objects which is UB in C.)
But systems with non-flat memory models do exist, and thinking about them can help explain the current situation, like C++ having different specs for < vs. std::less.
Part of the point of < on pointers to separate objects being UB in C (or at least unspecified in some C++ revisions) is to allow for weird machines, including non-flat memory models.
A well-known example is x86-16 real mode where pointers are segment:offset, forming a 20-bit linear address via (segment << 4) + offset. The same linear address can be represented by multiple different seg:off combinations.
C++ std::less on pointers on weird ISAs might need to be expensive, e.g. "normalize" a segment:offset on x86-16 to have offset <= 15. However, there's no portable way to implement this. The manipulation required to normalize a uintptr_t (or the object-representation of a pointer object) is implementation-specific.
But even on systems where C++ std::less has to be expensive, < doesn't have to be. For example, assuming a "large" memory model where an object fits within one segment, < can just compare the offset part and not even bother with the segment part. (Pointers inside the same object will have the same segment, and otherwise it's UB in C. C++17 changed to merely "unspecified", which might still allow skipping normalization and just comparing offsets.) This is assuming all pointers to any part of an object always use the same seg value, never normalizing. This is what you'd expect an ABI to require for a "large" as opposed to "huge" memory model. (See discussion in comments).
(Such a memory model might have a max object size of 64kiB for example, but a much larger max total address space that has room for many such max-sized objects. ISO C allows implementations to have a limit on object size that's lower than the max value (unsigned) size_t can represent, SIZE_MAX. For example even on flat memory model systems, GNU C limits max object size to PTRDIFF_MAX so size calculation can ignore signed overflow.) See this answer and discussion in comments.
If you want to allow objects larger than a segment, you need a "huge" memory model that has to worry about overflowing the offset part of a pointer when doing p++ to loop through an array, or when doing indexing / pointer arithmetic. This leads to slower code everywhere, but would probably mean that p < q would happen to work for pointers to different objects, because an implementation targeting a "huge" memory model would normally choose to keep all pointers normalized all the time. See What are near, far and huge pointers? - some real C compilers for x86 real mode did have an option to compile for the "huge" model where all pointers defaulted to "huge" unless declared otherwise.
x86 real-mode segmentation isn't the only non-flat memory model possible, it's merely a useful concrete example to illustrate how it's been handled by C/C++ implementations. In real life, implementations extended ISO C with the concept of far vs. near pointers, allowing programmers to choose when they can get away with just storing / passing around the 16-bit offset part, relative to some common data segment.
But a pure ISO C implementation would have to choose between a small memory model (everything except code in the same 64kiB with 16-bit pointers) or large or huge with all pointers being 32-bit. Some loops could optimize by incrementing just the offset part, but pointer objects couldn't be optimized to be smaller.
If you knew what the magic manipulation was for any given implementation, you could implement it in pure C. The problem is that different systems use different addressing and the details aren't parameterized by any portable macros.
Or maybe not: it might involve looking something up from a special segment table or something, e.g. like x86 protected mode instead of real mode where the segment part of the address is an index, not a value to be left shifted. You could set up partially-overlapping segments in protected mode, and the segment selector parts of addresses wouldn't necessarily even be ordered in the same order as the corresponding segment base addresses. Getting a linear address from a seg:off pointer in x86 protected mode might involve a system call, if the GDT and/or LDT aren't mapped into readable pages in your process.
(Of course mainstream OSes for x86 use a flat memory model so the segment base is always 0 (except for thread-local storage using fs or gs segments), and only the 32-bit or 64-bit "offset" part is used as a pointer.)
You could manually add code for various specific platforms, e.g. by default assume flat, or #ifdef something to detect x86 real mode and split uintptr_t into 16-bit halves for seg -= off>>4; off &= 0xf; then combine those parts back into a 32-bit number.
I once tried to find a way around this and I did find a solution that works for overlapping objects and in most other cases assuming the compiler does the "usual" thing.
You can first implement the suggestion in How to implement memmove in standard C without an intermediate copy? and then if that doesn't work cast to uintptr (a wrapper type for either uintptr_t or unsigned long long depending on whether uintptr_t is available) and get a most-likely accurate result (although it probably wouldn't matter anyway):
#include <stdint.h>
#ifndef UINTPTR_MAX
typedef unsigned long long uintptr;
#else
typedef uintptr_t uintptr;
#endif
int pcmp(const void *p1, const void *p2, size_t len)
{
const unsigned char *s1 = p1;
const unsigned char *s2 = p2;
size_t l;
/* Check for overlap */
for( l = 0; l < len; l++ )
{
if( s1 + l == s2 || s1 + l == s2 + len - 1 )
{
/* The two objects overlap, so we're allowed to
use comparison operators. */
if(s1 > s2)
return 1;
else if (s1 < s2)
return -1;
else
return 0;
}
}
/* No overlap so the result probably won't really matter.
Cast the result to `uintptr` and hope the compiler
does the "usual" thing */
if((uintptr)s1 > (uintptr)s2)
return 1;
else if ((uintptr)s1 < (uintptr)s2)
return -1;
else
return 0;
}
Does C offer something with similar functionality which would allow safely comparing arbitrary pointers.
No
First let us only consider object pointers. Function pointers bring in a whole other set of concerns.
2 pointers p1, p2 can have different encodings and point to the same address so p1 == p2 even though memcmp(&p1, &p2, sizeof p1) is not 0. Such architectures are rare.
Yet conversion of these pointer to uintptr_t does not require the same integer result leading to (uintptr_t)p1 != (uinptr_t)p2.
(uintptr_t)p1 < (uinptr_t)p2 itself is well legal code, by may not provide the hoped for functionality.
If code truly needs to compare unrelated pointers, form a helper function less(const void *p1, const void *p2) and perform platform specific code there.
Perhaps:
// return -1,0,1 for <,==,>
int ptrcmp(const void *c1, const void *c1) {
// Equivalence test works on all platforms
if (c1 == c2) {
return 0;
}
// At this point, we know pointers are not equivalent.
#ifdef UINTPTR_MAX
uintptr_t u1 = (uintptr_t)c1;
uintptr_t u2 = (uintptr_t)c2;
// Below code "works" in that the computation is legal,
// but does it function as desired?
// Likely, but strange systems lurk out in the wild.
// Check implementation before using
#if tbd
return (u1 > u2) - (u1 < u2);
#else
#error TBD code
#endif
#else
#error TBD code
#endif
}
The C Standard explicitly allows implementations to behave "in a documented manner characteristic of the environment" when an action invokes "Undefined Behavior". When the Standard was written, it would have been obvious to everyone that implementations intended for low-level programming on platforms with a flat memory model should do precisely that when processing relational operators between arbitrary pointers. It also would have been obvious that implementations targeting platforms whose natural means of pointer comparisons would never have side effects should perform comparisons between arbitrary pointers in ways that don't have side effects.
There are three general circumstances where programmers might perform relational operators between pointers:
Pointers to unrelated objects will never be compared.
Code may compare pointers within an object in cases where the results would matter, or between unrelated objects in cases where the results wouldn't matter. A simple example of this would be an operation that can act upon possibly-overlapping array segments in either ascending or descending order. The choice of ascending or descending order would matter in cases where the objects overlap, but either order would be equally valid when acting upon array segments in unrelated objects.
Code relies upon comparisons yielding a transitive ordering consistent with pointer equality.
The third type of usage would seldom occur outside of platform-specific code, which would either know that relational operators would simply work, or would know a platform-specific alternative. The second type of usage could occur in code which should be mostly portable, but almost all implementations could support the second type of usage just as cheaply as the first and there would be no reasons for them to do otherwise. The only people who should have any reason to care about whether the second usage was defined would be people writing compilers for platforms where such comparisons would be expensive or those seeking to ensure that their programs would be compatible with such platforms. Such people would be better placed than the Committee to judge the pros and cons of upholding a "no side effects" guarantee, and thus the Committee leaves the question open.
To be sure, the fact that there would be no reason for a compiler not to process a construct usefully is no guarantee that a "Gratuitously Clever Compiler" won't use the Standard as an excuse to do otherwise, but the reason the C Standard doesn't define a "less" operator is that the Committee expected that "<" would be adequate for almost all programs on almost all platforms.

Pointer Subtraction and Comparison Rule Confusion

I was playing around with pointer arithmetic and came across two rules in C Standard regarding pointer subtraction and comparison.
Rule 1: When two pointers are subtracted, both must point to elements of the same array object or just one past the last element of the array object (C Standard, 6.5.6); the result is the difference of the subscripts of the two array elements. Otherwise, the operation is undefined behavior (48).
Rule 2: Similarly, comparing pointers using the relational operators <, <=, >=, and > gives the positions of the pointers relative to each other. Pointers that do not point to the same aggregate or union (nor just beyond the same array object) are compared using relational operators (6.5.8). Otherwise, the operation is undefined behavior (53).
Subtracting or comparing pointers that do not refer to the same array is undefined behavior.
Question 1: As per rule 1 mentioned above, the behavior is undefined, but the program does print an address as output. The program crashes when I try to dereference the the variable containing the address. How come there is an address which exists but the value the address points to does not exist?
Question 2: As per rule 2 mentioned above, using relational operator to compare two pointers which refer two different arrays is undefined behavior, the program should crash, but I end up with an output? How is that possible?
Can someone please help me regarding this rule confusion? I have posted the code below:
#include <stdio.h>
int main()
{
char *pointer_1;
char *pointer_2;
char *difference;
int counter=0;
char string[20]={"Pointer Arithmetic"};
char str[30]={"Substraction and Comparison"};
pointer_1=string;
pointer_2=str;
difference=(char *)(pointer_2-pointer_1);
printf("%p\n",difference); Address exists
/*printf("%c\n",difference);*/ Dereferencing leads to program crash
while(pointer_1>pointer_2) Is one is allowed to use relational operators on
pointers which point to two different arrays?
{
{
counter++;
pointer_2++;
}
}
printf("%d",counter);
}
Subtracting two pointers does not result in a pointer ("address"), it results in integer which is the "distance" between those pointers. This only makes sense if the pointers both point into the same array. Likewise, comparing pointers only makes sense if they are in the same array.
When it doesn't make sense, the result is undefined--that does NOT mean that the program will fail, or crash, or produce an error of any kind. It means ANYTHING may happen, and you have no right to complain.
You cannot test for undefined behavior, because the result is, eh, totally undefined.
Possible outcomes include getting the expected result, crashing, getting an unexpected result, or causing time travel.
In this case, and on common desktop computers, a possible outcome of comparing two unrelated pointers pointer_1>pointer_2 is that it compares the addresses stored in the two pointers. So it "works".
On computers with a segmented memory, like a 286, a memory address is composed of a pair segment:offset. Comparing two segment values doesn't say anything about which one represents the highest memory address, as it is just an index into a segment descriptor table containing the real address.
So if your arrays are in two different segments (very common, because the segments were small), comparing the pointers for order didn't make much sense and subtracting them didn't even work.
So the language standard says that you cannot do that, because sometimes it doesn't work.

Pointer to Array of Bytes

I'm having some trouble with a pointer declaration that one of my co-workers wants to use because of Misra C requirements. Misra (Safety Critical guideline) won't let us mere Programmers use pointers, but will let us operate on arrays bytes. He intends to procur a pointer to an array of bytes (so we don't pass the actual array on the stack.)
// This is how I would normally do it
//
void Foo(uint8_t* pu8Buffer, uint16_t u16Len)
{
}
// This is how he has done it
//
void Foo(uint8_t (*pu8Buffer)[], uint16_t u16Len)
{
}
The calling function looks something like;
void Bar(void)
{
uint8_t u8Payload[1024]
uint16_t u16PayloadLen;
// ...some code to fill said array...
Foo(u8Payload, u16PayloadLen);
}
But, when pu8Buffer is accessed in Foo(), the array is wrong. Obviously not passing what it is expecting. The array is correct in the calling function, but not inside Foo()
I think he has created an array of pointers to bytes, not a pointer to an array of bytes.
Anyone care to clarify? Foo(&u8Payload, u16PayloadLen); doesn't work either.
In void Foo(uint8_t (*pu8Buffer)[], uint16_t u16Len), pu8Buffer is a pointer to an (incomplete) array of uint8_t. pu8Buffer has an incomplete type; it is a pointer to an array whose size is unknown. It may not be used in expressions where the size is required (such as pointer arithmetic; pu8Buffer+1 is not allowed).
Then *pu8Buffer is an array whose size is unknown. Since it is an array, it is automatically converted in most situations to a pointer to its first element. Thus, *pu8Buffer becomes a pointer to the first uint8_t of the array. The type of the converted *pu8Buffer is complete; it is a pointer to uint8_t, so it may be used in address arithmetic; *(*pu8Buffer + 1), (*pu8Buffer)[1], and 1[*pu8Buffer] are all valid expressions for the uint8_t one beyond *pu8Buffer.
I take it you are referring to MISRA-C:2004 rule 17.4 (or 2012 rule 18.4). Even someone like me who is a fan of MISRA finds this rule to be complete nonsense. The rationale for the rule is this (MISRA-C:2012 18.4):
"Array indexing using the array subscript syntax, ptr[expr], is the
preferred form of pointer arithmetic because it is often clearer and
hence less error prone than pointer manipulation. Any explicitly
calculated pointer value has the potential to access unintended or
invalid memory addresses. Such behavior is also possible with array
indexing, but the subscript syntax may ease the task of manual review.
Pointer arithmetic in C can be confusing to the novice The expression
ptr+1 may be mistakenly interpreted as the addition of 1 to the
address held in ptr. In fact the new memory address depends on the
size in bytes of the pointer's target. This misunderstanding can lead
to unexpected behaviour if sizeof is applied incorrectly."
So it all boils down to MISRA worrying about beginner programmers confusing ptr+1 to have the outcome we would have when writing (uint8_t*)ptr + 1. The solution, in my opinion, is to educate the novice programmers, rather than to restrict the professional ones (but then if you hire novice programmers to write safety-critical software with MISRA compliance, understanding pointer arithmetic is probably the least of your problems anyhow).
Solve this by writing a permanent deviation from this rule!
If you for reasons unknown don't want to deviate, but to make your current code MISRA compliant, simply rewrite the function as
void Foo(uint8_t pu8Buffer[], uint16_t u16Len)
and then replace all pointer arithmetic with pu8Buffer[something]. Then suddenly the code is 100% MISRA compatible according to the MISRA:2004 exemplar suite. And it is also 100% functionally equivalent to what you already have.

When printf is an address of a variable, why use void*?

I saw some usage of (void*) in printf().
If I want to print a variable's address, can I do it like this:
int a = 19;
printf("%d", &a);
I think, &a is a's address which is just an integer, right?
Many articles I read use something like this:
printf("%p", (void*)&a);
What does %p stand for? (A pointer?)
Why use (void*)? Can't I use (int)&a instead?
Pointers are not numbers. They are often internally represented that way, but they are conceptually distinct.
void* is designed to be a generic pointer type. Any pointer value (other than a function pointer) may be converted to void* and back again without loss of information. This typically means that void* is at least as big as other pointer types.
printfs "%p" format requires an argument of type void*. That's why an int* should be cast to void* in that context. (There's no implicit conversion because it's a variadic function; there's no declared parameter, so the compiler doesn't know what to convert it to.)
Sloppy practices like printing pointers with "%d", or passing an int* to printf with a "%p" format, are things that you can probably get away with on most current systems, but they render your code non-portable. (Note that it's common on 64-bit systems for void* and int to be different sizes, so printing pointers with %d" is really non-portable, not just theoretically.)
Incidentally, the output format for "%p" is implementation-defined. Hexadecimal is common, (in upper or lower case, with or without a leading "0x" or "0X"), but it's not the only possibility. All you can count on is that, assuming a reasonable implementation, it will be a reasonable way to represent a pointer value in human-readable form (and that scanf will understand the output of printf).
The article you read is entirely correct. The correct way to print an int* value is
printf("%p", (void*)&a);
Don't take the lazy way out; it's not at all difficult to get it right.
Suggested reading: Section 4 of the comp.lang.c FAQ. (Further suggested reading: All the other sections.
EDIT:
In response to Alcott's question:
There is still one thing I don't quite understand. int a = 10; int *p = &a;, so p's value is a's address in mem, right? If right, then p's value will range from 0 to 2^32-1 (if cpu is 32-bit), and an integer is 4-byte on 32-bit OS, right? then What's the difference between the p's value and an integer? Can p's value go out of the range?
The difference is that they're of different types.
Assume a system on which int, int*, void*, and float are all 32 bits (this is typical for current 32-bit systems). Does the fact that float is 32 bits imply that its range is 0 to 232-1? Or -231 to 231-1? Certainly not; the range of float (assuming IEEE representation) is approximately -3.40282e+38 to +3.40282e+38, with widely varying resolution across the range, plus exotic values like negative zero, subnormalized numbers, denormalized numbers, infinities, and NaNs (Not-a-Number). int and float are both 32 bits, and you can take the 32 bits of a float object and treat it as an int representation, but the result won't have any straightforward relationship to the value of the float. The second low-order bit of an int, for example, has a specific meaning; it contributes 0 to the value if it's 0, and 2 to the value if it's 1; the corresponding bit of a float has a meaning, but it's quite different (it contributes a value that depends on the value of the exponent).
The situation with pointers is quite similar. A pointer value has a meaning: it's the address of some object (or any of several other things, but we'll set that aside for now). On most current systems, interpreting the bits of a pointer object as if it were an integer gives you something that makes sense on the machine level. But the language itself does not guarantee, or even hint, that that's the case.
Pointers are not numbers.
A concrete example: some years ago, I ran across some code that tried to compute the difference in bytes between two addresses by casting to integers. It was something like this:
unsigned char *p0;
unsigned char *p1;
long difference = (unsigned long)p1 - (unsigned long)p0;
If you assume that pointers are just numbers, representing addresses in a linear monolithic address space, then this code makes sense. But that assumption is not supported by the language. And in fact, there was a system on which that code was intended to run (the Cray T90) on which it simply would not have worked. The T90 had 64-bit pointers pointing to 64-bit words. Byte pointers were synthesized in software by storing an offset in the 3 high-order bits of a pointer object. Subtracting two pointers in the above manner, if they both had 0 offsets, would give you the number of words, not bytes, between the addresses. And if they had non-0 offsets, it would give you meaningless garbage. (Conversion from a pointer to an integer would just copy the bits; it could have done the work to give you a meaningful byte index, but it didn't.)
The solution was simple: drop the casts and use pointer arithmetic:
long difference = p1 - p0;
Other addressing schemes are possible. For example, an address might consist of a descriptor that (perhaps indirectly) references a block of memory, plus an offset within that block.
You can assume that addresses are just numbers, that the address space is linear and monolithic, that all pointers are the same size and have the same representation, that a pointer can be safely converted to int, or to long, and back again without loss of information. And the code you write based on those assumptions will probably work on most current systems. But it's entirely possible that some future systems will again use a different memory model, and your code will break.
If you avoid making any assumptions beyond what the language actually guarantees, your code will be far more future-proof. And even leaving portability issues aside, it will probably be cleaner.
So much insanity present here...
%p is generally the correct format specifier to use if you just want to print out a representation of the pointer. Never, ever use %d.
The length of an int and the length of a pointer (void* or otherwise) have no relationship. Most data models on i386 just happen to have 32-bit ints AND 32-bit pointers -- other platforms, including x86-64, are not the same! (This is also historically known as "all the world's a VAX syndrome".) http://en.wikipedia.org/wiki/64-bit#64-bit_data_models
If for some reason you want to hold a memory address in an integral variable, use the right types! intptr_t and uintptr_t. They're in stdint.h. See http://en.wikipedia.org/wiki/Stdint.h#Integers_wide_enough_to_hold_pointers
In C void * is an un-typed pointer. void does not mean void... it means anything. Thus casting to void * would be the same as casting to "pointer" in another language.
Using (int *)&a should work too... but the stylistic point of saying (void *) is to say -- I don't care about the type -- just that it is a pointer.
Note: It is possible for an implementation of C to cause this construct to fail and still meet the requirements of the standards. I don't know of any such implementations, but it is possible.
Although it the vast majority of C implementations store pointers to all kinds of objects using the same representation, the C Standard does not require that all implementations do so, nor does it even provide any means by which a program which would exploit commonality of representations could test whether an implementation follows the common practice and refuse to run if an implementation doesn't.
If on some particular platform, an int* held a word address, while both char* and void* combine a word address with a word that identifies a byte within a word, passing an int* to a function that is expecting to retrieve a variadic argument of type char* or void* would result in that function trying to fetch more data from the stack (a word address plus the supplemental word) than had been pushed (just the word address). This could cause the system to malfunction in unpredictable ways.
Many compilers for commonplace platforms that use the same representation for all pointers will process an action which passes a non-void pointer precisely the same way as they would process an action which casts the pointer to void* before passing it. They thus have no reason to care about whether the pointer type that is passed as a variadic argument will precisely match the pointer type expected by the recipient. Although the Standard could have specified that such implementations which would have no reason to care about pointer types should behave as though the pointers were cast to void*, the authors of C89 Standard avoided describing anything which wouldn't be common to all conforming compilers. The Standard's terminology for a construct that 99% of implementations should process identically, but 1% would might process unpredictably, is "Undefined Behavior". Implementations may, and often should, extend the semantics of the language by specifying how they will treat such constructs, but that's a Quality of Implementation issue outside the Standard's jurisdiction.

Pointer implementation details in C

I would like to know architectures which violate the assumptions I've listed below. Also, I would like to know if any of the assumptions are false for all architectures (that is, if any of them are just completely wrong).
sizeof(int *) == sizeof(char *) == sizeof(void *) == sizeof(func_ptr *)
The in-memory representation of all pointers for a given architecture is the same regardless of the data type pointed to.
The in-memory representation of a pointer is the same as an integer of the same bit length as the architecture.
Multiplication and division of pointer data types are only forbidden by the compiler. NOTE: Yes, I know this is nonsensical. What I mean is - is there hardware support to forbid this incorrect usage?
All pointer values can be casted to a single integer. In other words, what architectures still make use of segments and offsets?
Incrementing a pointer is equivalent to adding sizeof(the pointed data type) to the memory address stored by the pointer. If p is an int32* then p+1 is equal to the memory address 4 bytes after p.
I'm most used to pointers being used in a contiguous, virtual memory space. For that usage, I can generally get by thinking of them as addresses on a number line. See Stack Overflow question Pointer comparison.
I can't give you concrete examples of all of these, but I'll do my best.
sizeof(int *) == sizeof(char *) == sizeof(void *) == sizeof(func_ptr *)
I don't know of any systems where I know this to be false, but consider:
Mobile devices often have some amount of read-only memory in which program code and such is stored. Read-only values (const variables) may conceivably be stored in read-only memory. And since the ROM address space may be smaller than the normal RAM address space, the pointer size may be different as well. Likewise, pointers to functions may have a different size, as they may point to this read-only memory into which the program is loaded, and which can otherwise not be modified (so your data can't be stored in it).
So I don't know of any platforms on which I've observed that the above doesn't hold, but I can imagine systems where it might be the case.
The in-memory representation of all pointers for a given architecture is the same regardless of the data type pointed to.
Think of member pointers vs regular pointers. They don't have the same representation (or size). A member pointer consists of a this pointer and an offset.
And as above, it is conceivable that some CPU's would load constant data into a separate area of memory, which used a separate pointer format.
The in-memory representation of a pointer is the same as an integer of the same bit length as the architecture.
Depends on how that bit length is defined. :)
An int on many 64-bit platforms is still 32 bits. But a pointer is 64 bits.
As already said, CPU's with a segmented memory model will have pointers consisting of a pair of numbers. Likewise, member pointers consist of a pair of numbers.
Multiplication and division of pointer data types are only forbidden by the compiler.
Ultimately, pointers data types only exist in the compiler. What the CPU works with is not pointers, but integers and memory addresses. So there is nowhere else where these operations on pointer types could be forbidden. You might as well ask for the CPU to forbid concatenation of C++ string objects. It can't do that because the C++ string type only exists in the C++ language, not in the generated machine code.
However, to answer what you mean, look up the Motorola 68000 CPUs. I believe they have separate registers for integers and memory addresses. Which means that they can easily forbid such nonsensical operations.
All pointer values can be casted to a single integer.
You're safe there. The C and C++ standards guarantee that this is always possible, no matter the memory space layout, CPU architecture and anything else. Specifically, they guarantee an implementation-defined mapping. In other words, you can always convert a pointer to an integer, and then convert that integer back to get the original pointer. But the C/C++ languages say nothing about what the intermediate integer value should be. That is up to the individual compiler, and the hardware it targets.
Incrementing a pointer is equivalent to adding sizeof(the pointed data type) to the memory address stored by the pointer.
Again, this is guaranteed. If you consider that conceptually, a pointer does not point to an address, it points to an object, then this makes perfect sense. Adding one to the pointer will then obviously make it point to the next object. If an object is 20 bytes long, then incrementing the pointer will move it 20 bytes, so that it moves to the next object.
If a pointer was merely a memory address in a linear address space, if it was basically an integer, then incrementing it would add 1 to the address -- that is, it would move to the next byte.
Finally, as I mentioned in a comment to your question, keep in mind that C++ is just a language. It doesn't care which architecture it is compiled to. Many of these limitations may seem obscure on modern CPU's. But what if you're targeting yesteryear's CPU's? What if you're targeting the next decade's CPU's? You don't even know how they'll work, so you can't assume much about them. What if you're targeting a virtual machine? Compilers already exist which generate bytecode for Flash, ready to run from a website. What if you want to compile your C++ to Python source code?
Staying within the rules specified in the standard guarantees that your code will work in all these cases.
I don't have specific real world examples in mind but the "authority" is the C standard. If something is not required by the standard, you can build a conforming implementation that intentionally fails to comply with any other assumptions. Some of these assumption are true most of the time just because it's convenient to implement a pointer as an integer representing a memory address that can be directly fetched by the processor but this is just a consequent of "convenience" and can't be held as a universal truth.
Not required by the standard (see this question). For instance, sizeof(int*) can be unequal to size(double*). void* is guaranteed to be able to store any pointer value.
Not required by the standard. By definition, size is a part of representation. If the size can be different, the representation can be different too.
Not necessarily. In fact, "the bit length of an architecture" is a vague statement. What is a 64-bit processor, really? Is it the address bus? Size of registers? Data bus? What?
It doesn't make sense to "multiply" or "divide" a pointer. It's forbidden by the compiler but you can of course multiply or divide the underlying representation (which doesn't really make sense to me) and that results in undefined behavior.
Maybe I don't understand your point but everything in a digital computer is just some kind of binary number.
Yes; kind of. It's guaranteed to point to a location that's a sizeof(pointer_type) farther. It's not necessarily equivalent to arithmetic addition of a number (i.e. farther is a logical concept here. The actual representation is architecture specific)
For 6.: a pointer is not necessarily a memory address. See for example "The Great Pointer Conspiracy" by Stack Overflow user jalf:
Yes, I used the word “address” in the com­ment above. It is impor­tant to real­ize what I mean by this. I do not mean “the mem­ory address at which the data is phys­i­cally stored”, but sim­ply an abstract “what­ever we need in order to locate the value. The address of i might be any­thing, but once we have it, we can always find and mod­ify i."
And:
A pointer is not a mem­ory address! I men­tioned this above, but let’s say it again. Point­ers are typ­i­cally imple­mented by the com­piler sim­ply as mem­ory addresses, yes, but they don’t have to be."
Some further information about pointers from the C99 standard:
6.2.5 §27 guarantees that void* and char* have identical representations, ie they can be used interchangably without conversion, ie the same address is denoted by the same bit pattern (which doesn't have to be true for other pointer types)
6.3.2.3 §1 states that any pointer to an incomplete or object type can be cast to (and from) void* and back again and still be valid; this doesn't include function pointers!
6.3.2.3 §6 states that void* can be cast to (and from) integers and 7.18.1.4 §1 provides apropriate types intptr_t and uintptr_t; the problem: these types are optional - the standard explicitly mentions that there need not be an integer type large enough to actually hold the value of the pointer!
sizeof(char*) != sizeof(void(*)(void) ? - Not on x86 in 36 bit addressing mode (supported on pretty much every Intel CPU since Pentium 1)
"The in-memory representation of a pointer is the same as an integer of the same bit length" - there's no in-memory representation on any modern architecture; tagged memory has never caught on and was already obsolete before C was standardized. Memory in fact doesn't even hold integers, just bits and arguably words (not bytes; most physical memory doesn't allow you to read just 8 bits.)
"Multiplication of pointers is impossible" - 68000 family; address registers (the ones holding pointers) didn't support that IIRC.
"All pointers can be cast to integers" - Not on PICs.
"Incrementing a T* is equivalent to adding sizeof(T) to the memory address" - true by definition. Also equivalent to &pointer[1].
I don't know about the others, but for DOS, the assumption in #3 is untrue. DOS is 16 bit and uses various tricks to map many more than 16 bits worth of memory.
The in-memory representation of a pointer is the same as an integer of the same bit length as the architecture.
I think this assumption is false because on the 80186, for example, a 32-bit pointer is held in two registers (an offset register an a segment register), and which half-word went in which register matters during access.
Multiplication and division of pointer data types are only forbidden by the compiler.
You can't multiply or divide types. ;P
I'm unsure why you would want to multiply or divide a pointer.
All pointer values can be casted to a single integer. In other words, what architectures still make use of segments and offsets?
The C99 standard allows pointers to be stored in intptr_t, which is an integer type. So, yes.
Incrementing a pointer is equivalent to adding sizeof(the pointed data type) to the memory address stored by the pointer. If p is an int32* then p+1 is equal to the memory address 4 bytes after p.
x + y where x is a T * and y is an integer is equivilent to (T *)((intptr_t)x + y * sizeof(T)) as far as I know. Alignment may be an issue, but padding may be provided in the sizeof. I'm not really sure.
In general, the answer to all of the questions is "yes", and it's because only those machines that implement popular languages directly saw the light of day and persisted into the current century. Although the language standards reserve the right to vary these "invariants", or assertions, it hasn't ever happened in real products, with the possible exception of items 3 and 4 which require some restatement to be universally true.
It's certainly possible to build segmented MMU designs, which correspond roughly with the capability-based architectures that were popular academically in past years, but no such system has typically seen common use with such features enabled. Such a system might have conflicted with the assertions as it would probably have had large pointers.
In addition to segmented/capability MMUs, which often have large pointers, more extreme designs have tried to encode data types in pointers. Few of these were ever built. (This question brings up all of the alternatives to the basic word-oriented, a pointer-is-a-word architectures.)
Specifically:
The in-memory representation of all pointers for a given architecture is the same regardless of the data type pointed to. True except for extremely wacky past designs that tried to implement protection not in strongly-typed languages but in hardware.
The in-memory representation of a pointer is the same as an integer of the same bit length as the architecture. Maybe, certainly some sort of integral type is the same, see LP64 vs LLP64.
Multiplication and division of pointer data types are only forbidden by the compiler. Right.
All pointer values can be casted to a single integer. In other words, what architectures still make use of segments and offsets? Nothing uses segments and offsets today, but a C int is often not big enough, you may need a long or long long to hold a pointer.
Incrementing a pointer is equivalent to adding sizeof(the pointed data type) to the memory address stored by the pointer. If p is an int32* then p+1 is equal to the memory address 4 bytes after p. Yes.
It is interesting to note that every Intel Architecture CPU, i.e., every single PeeCee, contains an elaborate segmentation unit of epic, legendary, complexity. However, it is effectively disabled. Whenever a PC OS boots up, it sets the segment bases to 0 and the segment lengths to ~0, nulling out the segments and giving a flat memory model.
There were lots of "word addressed" architectures in the 1950s, 1960s and 1970s. But I cannot recall any mainstream examples that had a C compiler. I recall the ICL / Three Rivers PERQ machines in the 1980s that was word addressed and had a writable control store (microcode). One of its instantiations had a C compiler and a flavor of Unix called PNX, but the C compiler required special microcode.
The basic problem is that char* types on word addressed machines are awkward, however you implement them. You often up with sizeof(int *) != sizeof(char *) ...
Interestingly, before C there was a language called BCPL in which the basic pointer type was a word address; that is, incrementing a pointer gave you the address of the next word, and ptr!1 gave you the word at ptr + 1. There was a different operator for addressing a byte: ptr%42 if I recall.
EDIT: Don't answer questions when your blood sugar is low. Your brain (certainly, mine) doesn't work as you expect. :-(
Minor nitpick:
p is an int32* then p+1
is wrong, it needs to be unsigned int32, otherwise it will wrap at 2GB.
Interesting oddity - I got this from the author of the C compiler for the Transputer chip - he told me that for that compiler, NULL was defined as -2GB. Why? Because the Transputer had a signed address range: -2GB to +2GB. Can you beleive that? Amazing isn't it?
I've since met various people that have told me that defining NULL like that is broken. I agree, but if you don't you end up NULL pointers being in the middle of your address range.
I think most of us can be glad we're not working on Transputers!
I would like to know architectures which violate the assumptions I've
listed below.
I see that Stephen C mentioned PERQ machines, and MSalters mentioned 68000s and PICs.
I'm disappointed that no one else actually answered the question by naming any of the weird and wonderful architectures that have standards-compliant C compilers that don't fit certain unwarranted assumptions.
sizeof(int *) == sizeof(char *) == sizeof(void *) == sizeof(func_ptr
*) ?
Not necessarily. Some examples:
Most compilers for Harvard-architecture 8-bit processors -- PIC and 8051 and M8C -- make sizeof(int *) == sizeof(char *),
but different from the sizeof(func_ptr *).
Some of the very small chips in those families have 256 bytes of RAM (or less) but several kilobytes of PROGMEM (Flash or ROM), so compilers often make sizeof(int *) == sizeof(char *) equal to 1 (a single 8-bit byte), but sizeof(func_ptr *) equal to 2 (two 8-bit bytes).
Compilers for many of the larger chips in those families with a few kilobytes of RAM and 128 or so kilobytes of PROGMEM make sizeof(int *) == sizeof(char *) equal to 2 (two 8-bit bytes), but sizeof(func_ptr *) equal to 3 (three 8-bit bytes).
A few Harvard-architecture chips can store exactly a full 2^16 ("64KByte") of PROGMEM (Flash or ROM), and another 2^16 ("64KByte") of RAM + memory-mapped I/O.
The compilers for such a chip make sizeof(func_ptr *) always be 2 (two bytes);
but often have a way to make the other kinds of pointers sizeof(int *) == sizeof(char *) == sizeof(void *) into a a "long ptr" 3-byte generic pointer that has the extra magic bit that indicates whether that pointer points into RAM or PROGMEM.
(That's the kind of pointer you need to pass to a "print_text_to_the_LCD()" function when you call that function from many different subroutines, sometimes with the address of a variable string in buffer that could be anywhere in RAM, and other times with one of many constant strings that could be anywhere in PROGMEM).
Such compilers often have special keywords ("short" or "near", "long" or "far") to let programmers specifically indicate three different kinds of char pointers in the same program -- constant strings that only need 2 bytes to indicate where in PROGMEM they are located, non-constant strings that only need 2 bytes to indicate where in RAM they are located, and the kind of 3-byte pointers that "print_text_to_the_LCD()" accepts.
Most computers built in the 1950s and 1960s use a 36-bit word length or an 18-bit word length, with an 18-bit (or less) address bus.
I hear that C compilers for such computers often use 9-bit bytes,
with sizeof(int *) == sizeof(func_ptr *) = 2 which gives 18 bits, since all integers and functions have to be word-aligned; but sizeof(char *) == sizeof(void *) == 4 to take advantage of special PDP-10 instructions that store such pointers in a full 36-bit word.
That full 36-bit word includes a 18-bit word address, and a few more bits in the other 18-bits that (among other things) indicate the bit position of the pointed-to character within that word.
The in-memory representation of all pointers for a given architecture
is the same regardless of the data type pointed to?
Not necessarily. Some examples:
On any one of the architectures I mentioned above, pointers come in different sizes. So how could they possibly have "the same" representation?
Some compilers on some systems use "descriptors" to implement character pointers and other kinds of pointers.
Such a descriptor is different for a pointer pointing to the first "char" in a "char big_array[4000]" than for a pointer pointing to the first "char" in a "char small_array[10]", which are arguably different data types, even when the small array happens to start at exactly the same location in memory previously occupied by the big array.
Descriptors allow such machines to catch and trap the buffer overflows that cause such problems on other machines.
The "Low-Fat Pointers" used in the SAFElite and similar "soft processors" have analogous "extra information" about the size of the buffer that the pointer points into. Low-Fat pointers have the same advantage of catching and trapping buffer overflows.
The in-memory representation of a pointer is the same as an integer of
the same bit length as the architecture?
Not necessarily. Some examples:
In "tagged architecture" machines, each word of memory has some bits that indicate whether that word is an integer, or a pointer, or something else.
With such machines, looking at the tag bits would tell you whether that word was an integer or a pointer.
I hear that Nova minicomputers have an "indirection bit" in each word which inspired "indirect threaded code". It sounds like storing an integer clears that bit, while storing a pointer sets that bit.
Multiplication and division of pointer data types are only forbidden
by the compiler. NOTE: Yes, I know this is nonsensical. What I mean is
- is there hardware support to forbid this incorrect usage?
Yes, some hardware doesn't directly support such operations.
As others have already mentioned, the "multiply" instruction in the 68000 and the 6809 only work with (some) "data registers"; they can't be directly applied to values in "address registers".
(It would be pretty easy for a compiler to work around such restrictions -- to MOV those values from an address register to the appropriate data register, and then use MUL).
All pointer values can be casted to a single data type?
Yes.
In order for memcpy() to work right, the C standard mandates that every pointer value of every kind can be cast to a void pointer ("void *").
The compiler is required to make this work, even for architectures that still use segments and offsets.
All pointer values can be casted to a single integer? In other words,
what architectures still make use of segments and offsets?
I'm not sure.
I suspect that all pointer values can be cast to the "size_t" and "ptrdiff_t" integral data types defined in "<stddef.h>".
Incrementing a pointer is equivalent to adding sizeof(the pointed data
type) to the memory address stored by the pointer. If p is an int32*
then p+1 is equal to the memory address 4 bytes after p.
It is unclear what you are asking here.
Q: If I have an array of some kind of structure or primitive data type (for example, a "#include <stdint.h> ... int32_t example_array[1000]; ..."), and I increment a pointer that points into that array (for example, "int32_t p = &example_array[99]; ... p++; ..."), does the pointer now point to the very next consecutive member of that array, which is sizeof(the pointed data type) bytes further along in memory?
A: Yes, the compiler must make the pointer, after incrementing it once, point at the next independent consecutive int32_t in the array, sizeof(the pointed data type) bytes further along in memory, in order to be standards compliant.
Q: So, if p is an int32* , then p+1 is equal to the memory address 4 bytes after p?
A: When sizeof( int32_t ) is actually equal to 4, yes. Otherwise, such as for certain word-addressable machines including some modern DSPs where sizeof( int32_t ) may equal 2 or even 1, then p+1 is equal to the memory address 2 or even 1 "C bytes" after p.
Q: So if I take the pointer, and cast it into an "int" ...
A: One type of "All the world's a VAX heresy".
Q: ... and then cast that "int" back into a pointer ...
A: Another type of "All the world's a VAX heresy".
Q: So if I take the pointer p which is a pointer to an int32_t, and cast it into some integral type that is plenty big enough to contain the pointer, and then add sizeof( int32_t ) to that integral type, and then later cast that integral type back into a pointer -- when I do all that, the resulting pointer is equal to p+1?
Not necessarily.
Lots of DSPs and a few other modern chips have word-oriented addressing, rather than the byte-oriented processing used by 8-bit chips.
Some of the C compilers for such chips cram 2 characters into each word, but it takes 2 such words to hold a int32_t -- so they report that sizeof( int32_t ) is 4.
(I've heard rumors that there's a C compiler for the 24-bit Motorola 56000 that does this).
The compiler is required to arrange things such that doing "p++" with a pointer to an int32_t increments the pointer to the next int32_t value.
There are several ways for the compiler to do that.
One standards-compliant way is to store each pointer to a int32_t as a "native word address".
Because it takes 2 words to hold a single int32_t value, the C compiler compiles "int32_t * p; ... p++" into some assembly language that increments that pointer value by 2.
On the other hand, if that one does "int32_t * p; ... int x = (int)p; x += sizeof( int32_t ); p = (int32_t *)x;", that C compiler for the 56000 will likely compile it to assembly language that increments the pointer value by 4.
I'm most used to pointers being used in a contiguous, virtual memory
space.
Several PIC and 8086 and other systems have non-contiguous RAM --
a few blocks of RAM at addresses that "made the hardware simpler".
With memory-mapped I/O or nothing at all attached to the gaps in address space between those blocks.
It's even more awkward than it sounds.
In some cases -- such as with the bit-banding hardware used to avoid problems caused by read-modify-write -- the exact same bit in RAM can be read or written using 2 or more different addresses.

Resources