Related
I thought it should be as simple as:
uint32_t getCrc(void)
{
uint32_t expectedCrc = *(uint32_t*)0x27FF0;
return expectedCrc;
}
And if the memory location has the following value:
Then the expectedCrc should equal to 0xECD8743D
But surprisingly, the value is: 0x82828282
I tried defining a pointer to uint32_t and assign the memory address to it as following:
uint32_t getCrc(void)
{
uint32_t *ptr = (uint32_t*)0x27FF0;
uint32_t expectedCrc = *ptr;
return expectedCrc;
}
But the value of the pointer itself was 0xFFFF and expectedCrc equal 0x82828282
I found these two values in a different memory address:
I also tried the same with char *ptr = (char*)0x27FF0 but it gave the same values.
Finally, I tried to check what is the size of a pointer to char in this controller using uint8_t size = sizeof(char*); and the answer was 0xb0 which equals 176.
I think it has something to do with the 24-bit memory address and the CPU architecture. I'm working on stm8 controller.
Could someone explain why does this happen?
UPDATE
I tried replacing the address 0x27FF0 with 0xFFF0 and it worked fine. So the problem is with the long address. I want to write the CRC value in the last address to avoid over-writing it with the code itself in case the program went bigger. How can I handle this?
From the Cosmic compiler datasheet https://www.cosmic-software.com/pdf/cxstm8_pd.pdf
cxstm8 provides 2 different memory models depending on the size of the application.
For applications smaller that 64k, the “section 0” memory model provides the best code density by defaulting function calls and pointers to 2 bytes.
For applications bigger than 64k, the standard memory model provides the best flexibility for using easily the linear addressing space. Each model comes with its own set of libraries.
This may be the cause for your problem. If you want to access the memory location of above 16 bit address directly you need to use the correct memory model.
As #Raje answered that it is about the memory models, I did further reading in the COSMIC user's guide and I found the following:
The STM8 compiler supports two memory models for application
larger than 64K, allowing you to choose the most efficient behavior
depending on your processor configuration and your application. All
these models allow the code to be larger than 64K and then function
pointers are defaulted to #far pointers (3 bytes). Data pointers are
defaulted to #near pointers (2 bytes) unless explicitly declared with
the #far modifier.
Therefore, the solution was to add #far to the pointer's type as following:
uint32_t calculatedCrc = 0;
expectedCrc = *(#far uint32_t*)0x27FF0;
This got the problem solved.
I want an array of pointers and I want to set byte values in the memory addresses where the pointers (of the array) are pointing.
Would this work:
unsigned int *pointer[4] = {(unsigned int *) 0xFF200020, (unsigned int *) 0xFF20001C, (unsigned int *) 0xFF200018, (unsigned int *) 0xFF200014};
*pointer[0] = 0b0111111; // the value is correct for the address
Or is the syntax somehow different?
EDIT:
I'm coding for an SOC board and these are memory addresses that contain the case of some UI elements.
unsigned int *element1 = (unsigned int *) 0xFF200020;
*element1 = 0b0111111;
works so I'm just interested about the C syntax of this.
EDIT2: There was one 0 too much in ... = 0b0...
Short answer:
Everything you've written is fine.
Thoughts:
I'm a big fan of using the types from stdint.h. This would let you write uint32_t which is more clearly a 32 bit unsigned number than unsigned long.
You'll often see people write macros to refer to these registers:
#define REG_IRQ (*(volatile uint32_t *)(0xFF200020))
REG_IRQ = 0x42;
It's possible that you actually want these pointers to be to volatile integers. You want it to be volatile if the value can change outside of the execution of your program. That is, if that memory position doesn't act strictly like a piece of memory. (For example, it's a register that stores the interrupt flags).
With most compilers I've used on embedded platforms, you'll have problems from ignoring volatile once optimizations have been enabled.
0b00111111 is, sadly, non-standard. You can use octal, decimal, or hexadecimal.
Sure, this should work, providing you can find addresses in your own segment.
Most probably, you'll have a segmentation fault when running this code, because 0xFF200020 have really few chances to be in your program segment.
This will not throw any error and will work fine but hard-coding memory address the pointer is pointing to is not a good idea. De-referencing some unknown/non-existing memory location will cause segmentation fault but if you are sure about the memory location and hard-coding values to them as done here is totally fine.
A common situation while coding in C is to be writing functions which return pointers. In case some error occurred within the written function during runtime, NULL may be returned to indicate an error. NULL is just the special memory address 0x0, which is never used for anything but to indicate the occurrence of a special condition.
My question is, are there any other special memory addresses which never will be used for userland application data?
The reason I want to know this is because it could effectively be used for error handling. Consider this:
#include <stdlib.h>
#include <stdio.h>
#define ERROR_NULL 0x0
#define ERROR_ZERO 0x1
int *example(int *a) {
if (*a < 0)
return ERROR_NULL;
if (*a == 0)
return (void *) ERROR_ZERO;
return a;
}
int main(int argc, char **argv) {
if (argc != 2) return -1;
int *result;
int a = atoi(argv[1]);
switch ((int) (result = example(&a))) {
case ERROR_NULL:
printf("Below zero!\n");
break;
case ERROR_ZERO:
printf("Is zero!\n");
break;
default:
printf("Is %d!\n", *result);
break;
}
return 0;
}
Knowing some special span of addresses which never will be used by userland applications could effectively be utilized for more efficient and cleaner condition handling. If you know about this, for which platforms does it apply?
I guess spans would be operating system specific. I'm mostly interested in Linux, but it would be nice to know for OS X, Windows, Android and other systems as well.
NULL is just the special memory address 0x0, which is never used for anything but to indicate the occurrence of a special condition.
That is not exactly right: there are computers where NULL pointer is not a zero internally (link).
are there any other special memory addresses which never will be used for userland applications?
Even NULL is not universal; there are no other universally unused memory addresses, which is not surprising, considering the number of different platforms programmable in C.
However, nobody stops you from defining your own special address in memory, setting it in a global variable, and treating it as your error indicator. This will work on all platforms, and would not require a special address location.
In the header:
extern void* ERROR_ADDRESS;
In a C file:
static int UNUSED;
void *ERROR_ADDRESS = &UNUSED;
At this point, ERROR_ADDRESS points to a globally unique location (i.e. the location of UNUSED, which is local to the compilation unit where it is defined), which you can use in testing pointers for equality.
The answer depends a lot on your C compiler and on your CPU and OS, where your compiled C program is going to run.
Your userland applications typically will never be able to access data or code through pointers pointing to the OS kernel data and code. And the OS usually does not return such pointers to applications.
Typically they will also never get a pointer pointing to a location that's not backed up by physical memory. You can only get such pointers through an error (a code bug) or by purposefully constructing such a pointer.
The C standard does not anyhow define what a valid range for pointers is and isn't. In C valid pointers are either NULL pointers or pointers to objects whose lifetime hasn't ended yet and those can be your global and local variables and those created in malloc()'d memory and functions. The OS may extend this range by returning:
pointers to code or data objects not explicitly defined in your C program at its source code level (the OS may let apps access some of its code or data directly, but this is uncommon, or the OS may let apps access some of their parts that are either created by the OS when the app loads or created by the compiler when the app was compiled, one example would be Windows letting apps examine their executable PE image, you can ask Windows where the image starts in the memory)
pointers to data buffers allocated by the OS for/on behalf of apps (here, usually, the OS would use its own APIs and not your app's malloc()/free(), and you'd be required to use the appropriate OS-specific function to release this memory)
OS-specific pointers that can't be dereferenced and only serve as error indicators (e.g. you could have more than just one undereferenceable pointer like NULL and your ERROR_ZERO is a possible candidate)
I would generally discourage use of hard-coded and magic pointers in programs.
If for some reason, a pointer is the only way to communicate error conditions and there are more than one of them, you could do this:
char ErrorVars[5] = { 0 };
void* ErrorPointer1 = &ErrorVars[0];
void* ErrorPointer2 = &ErrorVars[1];
...
void* ErrorPointer5 = &ErrorVars[4];
You can then return ErrorPointer1 through ErrorPointer1 on different error conditions and then compare the returned value against them. There' a caveat here, though. You cannot legally compare a returned pointer with an arbitrary pointer using >, >=, <, <=. That's only legal when both pointers point to or into the same object. So, if you wanted a quick check like this:
if ((char*)(p = myFunction()) >= (char*)ErrorPointer1 &&
(char*)p <= (char*)ErrorPointer5)
{
// handle the error
}
else
{
// success, do something else
}
it would only be legal if p equals one of those 5 error pointers. If it's not, your program can legally behave in any imaginable and unimaginable way (this is because the C standard says so). To avoid this situation you'll have to compare the pointer against each error pointer individually:
if ((p = myFunction()) == ErrorPointer1)
HandleError1();
else if (p == ErrorPointer2)
HandleError2();
else if (p == ErrorPointer3)
HandleError3();
...
else if (p == ErrorPointer5)
HandleError5();
else
DoSomethingElse();
Again, what a pointer is and what its representation is, is compiler- and OS/CPU-specific. The C standard itself does not mandate any specific representation or range of valid and invalid pointers, so long as those pointers function as prescribed by the C standard (e.g. pointer arithmetic works with them). There's a good question on the topic.
So, if your goal is to write portable C code, don't use hard-coded and "magic" pointers and prefer using something else to communicate error conditions.
It completely depends on both the computer and the operating system. For example, on a computer with memory-mapped IO like the Game Boy Advance, you probably don't want to confuse the address for "what color is the upper left pixel" with userland data:
http://www.coranac.com/tonc/text/hardware.htm#sec-memory
You should not be worrying about addresses as a programmer, because it's different on different platforms and between actual hardware addresses and your application you have quite some layers. There's the physical to virtual translation being one of the big ones, and the virtual address space is mapped into memory, and each process has it's own address space, protected at hardware level from other processes, on most modern operating systems.
What you are specifying here are just hexadecimal values, they aren't interpreted as addresses. A pointer set to NULL is essentially saying it doesn't point to anything, not even address zero. It's just NULL. Whatever the value of that may be, depends on platform, compiler and a lot of other things.
Setting a pointer to any other value is not defined. A pointer is a variable that stores the address of another, what you're trying to do is give this pointer some other value than what is valid.
This code:
#define ERROR_NULL 0x0
#define ERROR_ZERO 0x1
int *example(int *a) {
if (*a < 0)
return ERROR_NULL;
if (*a == 0)
return (void *) ERROR_ZERO;
return a;
}
defines a function example that takes input parameter a and returns the output as a pointer to int. At the same time, when the error occurs, this function abuses cast to void* to return the error code to the caller in the same way it returns the correct output data. This approach is wrong, because the caller must know that sometimes valid output is received, but it doesn't actually contain the desired output but the error code instead.
are there any other special memory addresses which never will be used ... ?
... it could effectively be used for error handling
Don't make any assumptions about the possible address that might be returned. When you need to pass a return code to the caller, you should do it in more straightforward way. You could take the pointer to the output data as a parameter and return the error code that identifies success or failure:
#define SUCCESS 0x0
#define ERROR_NULL 0x1
#define ERROR_ZERO 0x2
int example(int *a, int** out) {
if (...)
return ERROR_NULL;
if (...)
return ERROR_ZERO;
*out = a;
return SUCCESS;
}
...
int* out = NULL;
int retVal = example(..., &out);
if (retVal != SUCCESS)
...
Actually NULL(0) is a valid address. But it's not an address that you can typically write to.
From memory, NULL could be a different value on some old VAX hardware with some very old c compiler. Maybe someone can confirm that. It will always be 0 now as the C standard defines it - see this question Is NULL always false?
Typically the way errors are returned from functions is to set errno. You could piggy back on this if the error codes makes sense in the particular situation. However, if you need your own errors then you could do the same thing as the errno method.
Personally I prefer to not return void* but make the function take a void** and return the result there. Then you can return an error code directly where 0 = success.
e.g.
int posix_memalign(void **memptr, size_t alignment, size_t size);
Note the allocated memory is returned in memptr. The result code is returned by the function call. Unlike malloc.
void *malloc(size_t size)
On Linux, on 64-bit and when using the x86_64 architecture (either from Intel or AMD) only 48 bits of the total 64-bit address space are used (hardware limitation AFAIK). Basically, any address after 247 until 262 can be used now as it will not be allocated.
For some background, the virtual address space of a Linux process is made of a user and kernel space. On the above mention architecture, the first 47 bits (128 TB) are used for the user space. The kernel space is used at the end of the spectrum, so the last 128 TB at the end of a full 64-bit address space. In between is terra incognita. Although that could change any time in the future and this is not portable.
But I could think of many other way to return an error than your method, so I do not see the advantage of using such an hack.
TL;DR:
Use -1 if you want just one more error condition beside NULL
For more special conditions just set the least significant bit(s), because the returned value from malloc() family or new is guaranteed to be aligned for any fundamental alignment and will have the low bits always zero, so they're free for use (like in a tagged pointer)
If allocation succeeds, returns a pointer that is suitably aligned for any object type with fundamental alignment.
https://en.cppreference.com/w/c/memory/malloc
Pointers to types wider than char are also always aligned. If you point to a char or a char array on stack then just align as necessary with alignas
For even more conditions you can limit the range of allocated addresses. This needs platform-specific code and there won't be a portable solution
As others said, it highly depends. However if you're on a platform with dynamic allocation then -1 is (extremely likely) a safe value.
That's because the memory allocator gives out memory in BIG BLOCKS instead of just single bytes§. Therefore the last address that can be returned would be -block_size. For example if block_size is 4 then the last block will span across the addresses { -4, -3, -2, -1 }, and the last possible address will be -4 = 0xFFFF...FFFC. As a result, -1 will never be returned by the malloc() family
Various system functions on Linux also return -1 for an invalid pointer instead of NULL, for example mmap() and shmat(). Win32 APIs that return a handle can also return NULL (0) or INVALID_HANDLE_VALUE (-1) for a failure case or an ill-formed handle. They have to do that because sometimes NULL is a valid memory address. In fact if you're on a Harvard architecture then location zero in the data space is quite usable. And even on von Neumann architectures then what you said
"NULL is just the special memory address 0x0, which is never used for anything but to indicate the occurrence of a special condition"
is still wrong, because the address 0 is also valid. It's just that most modern OSes map the page zero somehow to make it trap when user space code dereferences it. Yet the page is accessible from within kernel code. There were some exploits related to NULL pointer dereference bug in Linux kernel
In fact, quite contrary to the zero page's original preferential use, some modern operating systems such as FreeBSD, Linux and Microsoft Windows actually make the zero page inaccessible to trap uses of NULL pointers. This is useful, as NULL pointers are the method used to represent the value of a reference that points to nothing
https://en.wikipedia.org/wiki/Zero_page
In MSVC and GCC, a NULL pointer to member is also represented as the bit pattern 0xFFFFFFFF on a 32-bit machine. And in AMD GCN NULL pointer also has a value of -1
You can go even further and return a lot more error codes by exploiting the fact that pointers are normally aligned. For example malloc always "aligns memory suitable for any object type (which, in practice, means that it is aligned to alignof(max_align_t))"
how does malloc understand alignment?
Which guarantees does malloc make about memory alignment?
Nowadays the default alignment for malloc is 8 or 16 bytes depending on whether you're on a 32 or 64-bit OS, which means you'll have at least 3 bits available for error reporting or any purposes of yours. And if you use a pointer to a type wider than char then it's always aligned. So generally there's nothing to worry about unless you want to return a char pointer that's not output from malloc (in which case you can align easily). Just check the least significant bit to see whether it's a valid pointer or not
int* result = func();
if ((uintptr_t)result & 1)
error_happened(); // now the high bits can be examined to check the error condition
In case of 16-byte alignment then the last 4 bits of a valid address are always 0s, and the total number of valid addresses is only ¹⁄₁₆ the total number of bit patterns, which means you can return at most ¹⁵⁄₁₆×264 error codes with a 64-bit pointer. Then there's aligned_alloc if you want more least significant bits.
That trick has been used for storing some information in the pointer itself. On many 64-bit platforms you can also use the high bits to store more data. See Using the extra 16 bits in 64-bit pointers
You can even go to the far extreme by limiting the range of the allocated pointers with some help from the OS. For example if you specify that the pointers must be allocated in the range 2-3GB then any addresses below 2GB and above 3GB will be available for you to indicate an error condition. On how to do that see:
Allocating Memory Within A 2GB Range
How can I ensure that the virtual memory address allocated by VirtualAlloc is between 2-4GB
Allocate at low memory address
How to malloc in address range > 4 GiB
Custom heap/memory allocation ranges
See also
Is ((void *) -1) a valid address?
§ That's obvious since some information about the allocated block need to be stored for bookkeeping, therefore the block size must be much larger than the block itself, otherwise the metadata itself will be even bigger than the amount of RAM. Thus if you call malloc(1) then it still have to reserve a full block for you.
int main()
{
int *p,*q;
p=(int *)1000;
q=(int *)2000;
printf("%d:%d:%d",q,p,(q-p));
}
output
2000:1000:250
1.I cannot understand p=(int *)1000; line, does this mean that p is pointing to 1000 address location? what if I do *p=22 does this value is stored at 1000 address and overwrite the existing value? If it overwrites the value, what if another program is working with 1000 address space?
how q-p=250?
EDIT: I tried printf("%u:%u:%u",q,p,(q-p)); the output is the same
int main()
{
int *p;
int i=5;
p=&i;
printf("%u:%d",p,i);
return 0;
}
the output
3214158860:5
does this mean the addresses used by compiler are integers? there is no difference between normal integers and address integers?
does this mean that p is pointing to 1000 address location?
Yes.
what if I do *p=22
It's invoking undefined behavior - your program will most likely crash with a segfault.
Note that in modern OSes, addresses are virtual - you can't overwrite an other process' adress space like this, but you can attempt writing to an invalid memory location in your own process' address space.
how q-p=250?
Because pointer arithmetic works like this (in order to be compatible with array indexing). The difference of two pointers is the difference of their value divided by sizeof(*ptr). Similarly, adding n to a pointer ptr of type T results in a numeric value ptr + n * sizeof(T).
Read this on pointers.
does this mean the addresses used by compiler are integers?
That "used by compiler" part is not even necessary. Addresses are integers, it's just an abstraction in C that we have nice pointers to ease our life. If you were coding in assembly, you would just treat them as unsigned integers.
By the way, writing
printf("%u:%d", p, i);
is also undefined behavior - the %u format specifier expects an unsigned int, and not a pointer. To print a pointer, use %p:
printf("%p:%d", (void *)p, i);
Yes, with *p=22 you write to 1000 address.
q-p is 250 because size of int is 4 so it's 2000-1000/4=250
The meaning of p = (int *) 1000 is implementation-defined. But yes, in a typical implementation it will make p to point to address 1000.
Doing *p = 22 afterwards will indeed attempt to store 22 at address 1000. However, in general case this attempt will lead to undefined behavior, since you are not allowed to just write data to arbitrary memory locations. You have to allocate memory in one way or another in order to be able to use it. In your example you didn't make any effort to allocate anything at address 1000. This means that most likely your program will simply crash, because it attempted to write data to a memory region that was not properly allocated. (Additionally, on many platforms in order to access data through pointers these pointers must point to properly aligned locations.)
Even if you somehow succeed succeed in writing your 22 at address 1000, it does not mean that it will in any way affect "other programs". On some old platforms it would (like DOS, fro one example). But modern platforms implement independent virtual memory for each running program (process). This means that each running process has its own separate address 1000 and it cannot see the other program's address 1000.
Yes, p is pointing to virtual address 1000. If you use *p = 22;, you are likely to get a segmentation fault; quite often, the whole first 1024 bytes are invalid for reading or writing. It can't affect another program assuming you have virtual memory; each program has its own virtual address space.
The value of q - p is the number of units of sizeof(*p) or sizeof(*q) or sizeof(int) between the two addresses.
Casting arbitrary integers to pointers is undefined behavior. Anything can happen including nothing, a segmentation fault or silently overwriting other processes' memory (unlikely in the modern virtual memory models).
But we used to use absolute addresses like this back in the real mode DOS days to access interrupt tables and BIOS variables :)
About q-p == 250, it's the result of semantics of pointer arithmetic. Apparently sizeof int is 4 in your system. So when you add 1 to an int pointer it actually gets incremented by 4 so it points to the next int not the next byte. This behavior helps with array access.
does this mean that p is pointing to 1000 address location?
yes. But this 1000 address may belong to some other processes address.In this case, You illegally accessing the memory of another process's address space. This may results in segmentation fault.
I would like to know architectures which violate the assumptions I've listed below. Also, I would like to know if any of the assumptions are false for all architectures (that is, if any of them are just completely wrong).
sizeof(int *) == sizeof(char *) == sizeof(void *) == sizeof(func_ptr *)
The in-memory representation of all pointers for a given architecture is the same regardless of the data type pointed to.
The in-memory representation of a pointer is the same as an integer of the same bit length as the architecture.
Multiplication and division of pointer data types are only forbidden by the compiler. NOTE: Yes, I know this is nonsensical. What I mean is - is there hardware support to forbid this incorrect usage?
All pointer values can be casted to a single integer. In other words, what architectures still make use of segments and offsets?
Incrementing a pointer is equivalent to adding sizeof(the pointed data type) to the memory address stored by the pointer. If p is an int32* then p+1 is equal to the memory address 4 bytes after p.
I'm most used to pointers being used in a contiguous, virtual memory space. For that usage, I can generally get by thinking of them as addresses on a number line. See Stack Overflow question Pointer comparison.
I can't give you concrete examples of all of these, but I'll do my best.
sizeof(int *) == sizeof(char *) == sizeof(void *) == sizeof(func_ptr *)
I don't know of any systems where I know this to be false, but consider:
Mobile devices often have some amount of read-only memory in which program code and such is stored. Read-only values (const variables) may conceivably be stored in read-only memory. And since the ROM address space may be smaller than the normal RAM address space, the pointer size may be different as well. Likewise, pointers to functions may have a different size, as they may point to this read-only memory into which the program is loaded, and which can otherwise not be modified (so your data can't be stored in it).
So I don't know of any platforms on which I've observed that the above doesn't hold, but I can imagine systems where it might be the case.
The in-memory representation of all pointers for a given architecture is the same regardless of the data type pointed to.
Think of member pointers vs regular pointers. They don't have the same representation (or size). A member pointer consists of a this pointer and an offset.
And as above, it is conceivable that some CPU's would load constant data into a separate area of memory, which used a separate pointer format.
The in-memory representation of a pointer is the same as an integer of the same bit length as the architecture.
Depends on how that bit length is defined. :)
An int on many 64-bit platforms is still 32 bits. But a pointer is 64 bits.
As already said, CPU's with a segmented memory model will have pointers consisting of a pair of numbers. Likewise, member pointers consist of a pair of numbers.
Multiplication and division of pointer data types are only forbidden by the compiler.
Ultimately, pointers data types only exist in the compiler. What the CPU works with is not pointers, but integers and memory addresses. So there is nowhere else where these operations on pointer types could be forbidden. You might as well ask for the CPU to forbid concatenation of C++ string objects. It can't do that because the C++ string type only exists in the C++ language, not in the generated machine code.
However, to answer what you mean, look up the Motorola 68000 CPUs. I believe they have separate registers for integers and memory addresses. Which means that they can easily forbid such nonsensical operations.
All pointer values can be casted to a single integer.
You're safe there. The C and C++ standards guarantee that this is always possible, no matter the memory space layout, CPU architecture and anything else. Specifically, they guarantee an implementation-defined mapping. In other words, you can always convert a pointer to an integer, and then convert that integer back to get the original pointer. But the C/C++ languages say nothing about what the intermediate integer value should be. That is up to the individual compiler, and the hardware it targets.
Incrementing a pointer is equivalent to adding sizeof(the pointed data type) to the memory address stored by the pointer.
Again, this is guaranteed. If you consider that conceptually, a pointer does not point to an address, it points to an object, then this makes perfect sense. Adding one to the pointer will then obviously make it point to the next object. If an object is 20 bytes long, then incrementing the pointer will move it 20 bytes, so that it moves to the next object.
If a pointer was merely a memory address in a linear address space, if it was basically an integer, then incrementing it would add 1 to the address -- that is, it would move to the next byte.
Finally, as I mentioned in a comment to your question, keep in mind that C++ is just a language. It doesn't care which architecture it is compiled to. Many of these limitations may seem obscure on modern CPU's. But what if you're targeting yesteryear's CPU's? What if you're targeting the next decade's CPU's? You don't even know how they'll work, so you can't assume much about them. What if you're targeting a virtual machine? Compilers already exist which generate bytecode for Flash, ready to run from a website. What if you want to compile your C++ to Python source code?
Staying within the rules specified in the standard guarantees that your code will work in all these cases.
I don't have specific real world examples in mind but the "authority" is the C standard. If something is not required by the standard, you can build a conforming implementation that intentionally fails to comply with any other assumptions. Some of these assumption are true most of the time just because it's convenient to implement a pointer as an integer representing a memory address that can be directly fetched by the processor but this is just a consequent of "convenience" and can't be held as a universal truth.
Not required by the standard (see this question). For instance, sizeof(int*) can be unequal to size(double*). void* is guaranteed to be able to store any pointer value.
Not required by the standard. By definition, size is a part of representation. If the size can be different, the representation can be different too.
Not necessarily. In fact, "the bit length of an architecture" is a vague statement. What is a 64-bit processor, really? Is it the address bus? Size of registers? Data bus? What?
It doesn't make sense to "multiply" or "divide" a pointer. It's forbidden by the compiler but you can of course multiply or divide the underlying representation (which doesn't really make sense to me) and that results in undefined behavior.
Maybe I don't understand your point but everything in a digital computer is just some kind of binary number.
Yes; kind of. It's guaranteed to point to a location that's a sizeof(pointer_type) farther. It's not necessarily equivalent to arithmetic addition of a number (i.e. farther is a logical concept here. The actual representation is architecture specific)
For 6.: a pointer is not necessarily a memory address. See for example "The Great Pointer Conspiracy" by Stack Overflow user jalf:
Yes, I used the word “address” in the comment above. It is important to realize what I mean by this. I do not mean “the memory address at which the data is physically stored”, but simply an abstract “whatever we need in order to locate the value. The address of i might be anything, but once we have it, we can always find and modify i."
And:
A pointer is not a memory address! I mentioned this above, but let’s say it again. Pointers are typically implemented by the compiler simply as memory addresses, yes, but they don’t have to be."
Some further information about pointers from the C99 standard:
6.2.5 §27 guarantees that void* and char* have identical representations, ie they can be used interchangably without conversion, ie the same address is denoted by the same bit pattern (which doesn't have to be true for other pointer types)
6.3.2.3 §1 states that any pointer to an incomplete or object type can be cast to (and from) void* and back again and still be valid; this doesn't include function pointers!
6.3.2.3 §6 states that void* can be cast to (and from) integers and 7.18.1.4 §1 provides apropriate types intptr_t and uintptr_t; the problem: these types are optional - the standard explicitly mentions that there need not be an integer type large enough to actually hold the value of the pointer!
sizeof(char*) != sizeof(void(*)(void) ? - Not on x86 in 36 bit addressing mode (supported on pretty much every Intel CPU since Pentium 1)
"The in-memory representation of a pointer is the same as an integer of the same bit length" - there's no in-memory representation on any modern architecture; tagged memory has never caught on and was already obsolete before C was standardized. Memory in fact doesn't even hold integers, just bits and arguably words (not bytes; most physical memory doesn't allow you to read just 8 bits.)
"Multiplication of pointers is impossible" - 68000 family; address registers (the ones holding pointers) didn't support that IIRC.
"All pointers can be cast to integers" - Not on PICs.
"Incrementing a T* is equivalent to adding sizeof(T) to the memory address" - true by definition. Also equivalent to &pointer[1].
I don't know about the others, but for DOS, the assumption in #3 is untrue. DOS is 16 bit and uses various tricks to map many more than 16 bits worth of memory.
The in-memory representation of a pointer is the same as an integer of the same bit length as the architecture.
I think this assumption is false because on the 80186, for example, a 32-bit pointer is held in two registers (an offset register an a segment register), and which half-word went in which register matters during access.
Multiplication and division of pointer data types are only forbidden by the compiler.
You can't multiply or divide types. ;P
I'm unsure why you would want to multiply or divide a pointer.
All pointer values can be casted to a single integer. In other words, what architectures still make use of segments and offsets?
The C99 standard allows pointers to be stored in intptr_t, which is an integer type. So, yes.
Incrementing a pointer is equivalent to adding sizeof(the pointed data type) to the memory address stored by the pointer. If p is an int32* then p+1 is equal to the memory address 4 bytes after p.
x + y where x is a T * and y is an integer is equivilent to (T *)((intptr_t)x + y * sizeof(T)) as far as I know. Alignment may be an issue, but padding may be provided in the sizeof. I'm not really sure.
In general, the answer to all of the questions is "yes", and it's because only those machines that implement popular languages directly saw the light of day and persisted into the current century. Although the language standards reserve the right to vary these "invariants", or assertions, it hasn't ever happened in real products, with the possible exception of items 3 and 4 which require some restatement to be universally true.
It's certainly possible to build segmented MMU designs, which correspond roughly with the capability-based architectures that were popular academically in past years, but no such system has typically seen common use with such features enabled. Such a system might have conflicted with the assertions as it would probably have had large pointers.
In addition to segmented/capability MMUs, which often have large pointers, more extreme designs have tried to encode data types in pointers. Few of these were ever built. (This question brings up all of the alternatives to the basic word-oriented, a pointer-is-a-word architectures.)
Specifically:
The in-memory representation of all pointers for a given architecture is the same regardless of the data type pointed to. True except for extremely wacky past designs that tried to implement protection not in strongly-typed languages but in hardware.
The in-memory representation of a pointer is the same as an integer of the same bit length as the architecture. Maybe, certainly some sort of integral type is the same, see LP64 vs LLP64.
Multiplication and division of pointer data types are only forbidden by the compiler. Right.
All pointer values can be casted to a single integer. In other words, what architectures still make use of segments and offsets? Nothing uses segments and offsets today, but a C int is often not big enough, you may need a long or long long to hold a pointer.
Incrementing a pointer is equivalent to adding sizeof(the pointed data type) to the memory address stored by the pointer. If p is an int32* then p+1 is equal to the memory address 4 bytes after p. Yes.
It is interesting to note that every Intel Architecture CPU, i.e., every single PeeCee, contains an elaborate segmentation unit of epic, legendary, complexity. However, it is effectively disabled. Whenever a PC OS boots up, it sets the segment bases to 0 and the segment lengths to ~0, nulling out the segments and giving a flat memory model.
There were lots of "word addressed" architectures in the 1950s, 1960s and 1970s. But I cannot recall any mainstream examples that had a C compiler. I recall the ICL / Three Rivers PERQ machines in the 1980s that was word addressed and had a writable control store (microcode). One of its instantiations had a C compiler and a flavor of Unix called PNX, but the C compiler required special microcode.
The basic problem is that char* types on word addressed machines are awkward, however you implement them. You often up with sizeof(int *) != sizeof(char *) ...
Interestingly, before C there was a language called BCPL in which the basic pointer type was a word address; that is, incrementing a pointer gave you the address of the next word, and ptr!1 gave you the word at ptr + 1. There was a different operator for addressing a byte: ptr%42 if I recall.
EDIT: Don't answer questions when your blood sugar is low. Your brain (certainly, mine) doesn't work as you expect. :-(
Minor nitpick:
p is an int32* then p+1
is wrong, it needs to be unsigned int32, otherwise it will wrap at 2GB.
Interesting oddity - I got this from the author of the C compiler for the Transputer chip - he told me that for that compiler, NULL was defined as -2GB. Why? Because the Transputer had a signed address range: -2GB to +2GB. Can you beleive that? Amazing isn't it?
I've since met various people that have told me that defining NULL like that is broken. I agree, but if you don't you end up NULL pointers being in the middle of your address range.
I think most of us can be glad we're not working on Transputers!
I would like to know architectures which violate the assumptions I've
listed below.
I see that Stephen C mentioned PERQ machines, and MSalters mentioned 68000s and PICs.
I'm disappointed that no one else actually answered the question by naming any of the weird and wonderful architectures that have standards-compliant C compilers that don't fit certain unwarranted assumptions.
sizeof(int *) == sizeof(char *) == sizeof(void *) == sizeof(func_ptr
*) ?
Not necessarily. Some examples:
Most compilers for Harvard-architecture 8-bit processors -- PIC and 8051 and M8C -- make sizeof(int *) == sizeof(char *),
but different from the sizeof(func_ptr *).
Some of the very small chips in those families have 256 bytes of RAM (or less) but several kilobytes of PROGMEM (Flash or ROM), so compilers often make sizeof(int *) == sizeof(char *) equal to 1 (a single 8-bit byte), but sizeof(func_ptr *) equal to 2 (two 8-bit bytes).
Compilers for many of the larger chips in those families with a few kilobytes of RAM and 128 or so kilobytes of PROGMEM make sizeof(int *) == sizeof(char *) equal to 2 (two 8-bit bytes), but sizeof(func_ptr *) equal to 3 (three 8-bit bytes).
A few Harvard-architecture chips can store exactly a full 2^16 ("64KByte") of PROGMEM (Flash or ROM), and another 2^16 ("64KByte") of RAM + memory-mapped I/O.
The compilers for such a chip make sizeof(func_ptr *) always be 2 (two bytes);
but often have a way to make the other kinds of pointers sizeof(int *) == sizeof(char *) == sizeof(void *) into a a "long ptr" 3-byte generic pointer that has the extra magic bit that indicates whether that pointer points into RAM or PROGMEM.
(That's the kind of pointer you need to pass to a "print_text_to_the_LCD()" function when you call that function from many different subroutines, sometimes with the address of a variable string in buffer that could be anywhere in RAM, and other times with one of many constant strings that could be anywhere in PROGMEM).
Such compilers often have special keywords ("short" or "near", "long" or "far") to let programmers specifically indicate three different kinds of char pointers in the same program -- constant strings that only need 2 bytes to indicate where in PROGMEM they are located, non-constant strings that only need 2 bytes to indicate where in RAM they are located, and the kind of 3-byte pointers that "print_text_to_the_LCD()" accepts.
Most computers built in the 1950s and 1960s use a 36-bit word length or an 18-bit word length, with an 18-bit (or less) address bus.
I hear that C compilers for such computers often use 9-bit bytes,
with sizeof(int *) == sizeof(func_ptr *) = 2 which gives 18 bits, since all integers and functions have to be word-aligned; but sizeof(char *) == sizeof(void *) == 4 to take advantage of special PDP-10 instructions that store such pointers in a full 36-bit word.
That full 36-bit word includes a 18-bit word address, and a few more bits in the other 18-bits that (among other things) indicate the bit position of the pointed-to character within that word.
The in-memory representation of all pointers for a given architecture
is the same regardless of the data type pointed to?
Not necessarily. Some examples:
On any one of the architectures I mentioned above, pointers come in different sizes. So how could they possibly have "the same" representation?
Some compilers on some systems use "descriptors" to implement character pointers and other kinds of pointers.
Such a descriptor is different for a pointer pointing to the first "char" in a "char big_array[4000]" than for a pointer pointing to the first "char" in a "char small_array[10]", which are arguably different data types, even when the small array happens to start at exactly the same location in memory previously occupied by the big array.
Descriptors allow such machines to catch and trap the buffer overflows that cause such problems on other machines.
The "Low-Fat Pointers" used in the SAFElite and similar "soft processors" have analogous "extra information" about the size of the buffer that the pointer points into. Low-Fat pointers have the same advantage of catching and trapping buffer overflows.
The in-memory representation of a pointer is the same as an integer of
the same bit length as the architecture?
Not necessarily. Some examples:
In "tagged architecture" machines, each word of memory has some bits that indicate whether that word is an integer, or a pointer, or something else.
With such machines, looking at the tag bits would tell you whether that word was an integer or a pointer.
I hear that Nova minicomputers have an "indirection bit" in each word which inspired "indirect threaded code". It sounds like storing an integer clears that bit, while storing a pointer sets that bit.
Multiplication and division of pointer data types are only forbidden
by the compiler. NOTE: Yes, I know this is nonsensical. What I mean is
- is there hardware support to forbid this incorrect usage?
Yes, some hardware doesn't directly support such operations.
As others have already mentioned, the "multiply" instruction in the 68000 and the 6809 only work with (some) "data registers"; they can't be directly applied to values in "address registers".
(It would be pretty easy for a compiler to work around such restrictions -- to MOV those values from an address register to the appropriate data register, and then use MUL).
All pointer values can be casted to a single data type?
Yes.
In order for memcpy() to work right, the C standard mandates that every pointer value of every kind can be cast to a void pointer ("void *").
The compiler is required to make this work, even for architectures that still use segments and offsets.
All pointer values can be casted to a single integer? In other words,
what architectures still make use of segments and offsets?
I'm not sure.
I suspect that all pointer values can be cast to the "size_t" and "ptrdiff_t" integral data types defined in "<stddef.h>".
Incrementing a pointer is equivalent to adding sizeof(the pointed data
type) to the memory address stored by the pointer. If p is an int32*
then p+1 is equal to the memory address 4 bytes after p.
It is unclear what you are asking here.
Q: If I have an array of some kind of structure or primitive data type (for example, a "#include <stdint.h> ... int32_t example_array[1000]; ..."), and I increment a pointer that points into that array (for example, "int32_t p = &example_array[99]; ... p++; ..."), does the pointer now point to the very next consecutive member of that array, which is sizeof(the pointed data type) bytes further along in memory?
A: Yes, the compiler must make the pointer, after incrementing it once, point at the next independent consecutive int32_t in the array, sizeof(the pointed data type) bytes further along in memory, in order to be standards compliant.
Q: So, if p is an int32* , then p+1 is equal to the memory address 4 bytes after p?
A: When sizeof( int32_t ) is actually equal to 4, yes. Otherwise, such as for certain word-addressable machines including some modern DSPs where sizeof( int32_t ) may equal 2 or even 1, then p+1 is equal to the memory address 2 or even 1 "C bytes" after p.
Q: So if I take the pointer, and cast it into an "int" ...
A: One type of "All the world's a VAX heresy".
Q: ... and then cast that "int" back into a pointer ...
A: Another type of "All the world's a VAX heresy".
Q: So if I take the pointer p which is a pointer to an int32_t, and cast it into some integral type that is plenty big enough to contain the pointer, and then add sizeof( int32_t ) to that integral type, and then later cast that integral type back into a pointer -- when I do all that, the resulting pointer is equal to p+1?
Not necessarily.
Lots of DSPs and a few other modern chips have word-oriented addressing, rather than the byte-oriented processing used by 8-bit chips.
Some of the C compilers for such chips cram 2 characters into each word, but it takes 2 such words to hold a int32_t -- so they report that sizeof( int32_t ) is 4.
(I've heard rumors that there's a C compiler for the 24-bit Motorola 56000 that does this).
The compiler is required to arrange things such that doing "p++" with a pointer to an int32_t increments the pointer to the next int32_t value.
There are several ways for the compiler to do that.
One standards-compliant way is to store each pointer to a int32_t as a "native word address".
Because it takes 2 words to hold a single int32_t value, the C compiler compiles "int32_t * p; ... p++" into some assembly language that increments that pointer value by 2.
On the other hand, if that one does "int32_t * p; ... int x = (int)p; x += sizeof( int32_t ); p = (int32_t *)x;", that C compiler for the 56000 will likely compile it to assembly language that increments the pointer value by 4.
I'm most used to pointers being used in a contiguous, virtual memory
space.
Several PIC and 8086 and other systems have non-contiguous RAM --
a few blocks of RAM at addresses that "made the hardware simpler".
With memory-mapped I/O or nothing at all attached to the gaps in address space between those blocks.
It's even more awkward than it sounds.
In some cases -- such as with the bit-banding hardware used to avoid problems caused by read-modify-write -- the exact same bit in RAM can be read or written using 2 or more different addresses.