Related
Consider the following situations:
The National Semiconductor SC/MP has pointers which, when you keep incrementing them, will roll from 0x0FFF to 0x0000 because the increment circuit does not propagate the carry past the lower nybble of the higher byte. So if, for example, I want to do while(*ptr++) to traverse a null-terminated string, then I might wind up with ptr pointing outside of the array.
On the PDP-10, because a machine word is longer than an address1, a pointer may have tags and other data in the upper half of the word containing the address. In this situation, if incrementing a pointer causes an overflow, that other data might get altered. The same goes for very early Macintoshes, before the ROMs were 32-bit clean.
So my question is about whether the C standard says what incrementing a pointer really means. As far as I can tell, the C standard assumes that it should work in bit-wise the same manner as incrementing an integer. But that doesn't always hold, as we have seen.
Can a standards-conforming C compiler emit a simple adda a0, 12 to increment a pointer, without checking that the presence or lack of carry propagation will not lead to weirdness?
1: On the PDP-10, an address is 18 bits wide, but a machine word is 36 bits wide. A machine word may hold either two pointers (handy for Lisp) or one pointer, plus bitfields which mean things like "add another level of indirection", segments, offsets etc. Or a machine word may of course contain no pointers, but that's not relevant to this question.
2: Add one to an address. That's 68000 assembler.
Behavior of pointer arithmetic is specified by the C standard only as long as the result points to a valid object or just past a valid object. More than that, the standard does not say what the bits of a pointer look like; an implementation may arrange them to suit its own purposes.
So, no, the standard does not say what happens when a pointer is incremented so far that the address rolls over.
If the while loop you refer to only proceeds one element past the end of the array, it is safe in C. (Per the standard, if ptr has been incremented to one element beyond the end of the array, and x points to any element in the array, including the first, then x < ptr must be true. So, if ptr has rolled over internally, the C implementation is responsible for ensuring the comparison still works.)
If your while loop may increment ptr more than one element beyond the end of the array, the C standard does not define the behavior.
People often ask, "Why does C have undefined behavior, anyway?". And this is a great example of one of the big reasons why.
Let's stick with the NS SC/MP example. If the hardware dictates that incrementing the pointer value 0x0FFF doesn't work quite right, we have two choices:
Translate the code p++ to the equivalent of if(p == 0x0FFF) p = 0x1000; else p++;.
Translate p++ to a straight increment, but rig things up so that no properly-allocated object ever overlaps an address involving 0x0FFF, such that if anyone ever writes code that ends up manipulating the pointer value 0x0FFF and adding one to it and getting a bizarre answer, you can say "that's undefined, so anything can happen".
If you take approach #1, the generated code is bigger and slower. If you take approach #2, the generated code is maximally efficient. And if someone complains about the bizarre behavior, asks why the compiler couldn't have emitted code that did something "more reasonable", you can simply say, "our mandate was to be as efficient as possible."
A significant number of platforms have addressing methods which cannot index "easily" across certain boundaries. The C Standard allows implementations two general approaches for handling this (which may be, but typically aren't, used together):
Refrain from having the compiler, linker, or malloc-style functions place any objects in a way that would straddle any problematic boundaries.
Perform address computations in a way that can index across arbitrary boundaries, even when it would be less efficient than address-computation code that can't.
In most cases, approach #1 will lead to code which is faster and more compact, but code may be limited in its ability to use memory effectively. For example, if code needs many objects of 33,000 bytes each, a machine with 4MiB of heap space subdivided into "rigid" 64K chunks, would be limited to creating 64 of them (one for each chunk), even though there should be space for 127 of them. Approach #2 will often yield much slower code, but such code may be able to make more effective use of heap space.
Interestingly, imposing 16-bit or 32-bit alignment requirements would allow many 8-bit processors to generate more efficient code than allowing arbitrary alignment (since they could omit page-crossing logic when indexing between the bytes of a word) but I've never seen any 8-bit compilers provide an option to impose and exploit such alignments even on platforms where it could offer considerable advantages.
C standard does not know anything about the implementation, and the standard does not care about the implementation. It only says what the effect of the pointer arithmetics is.
C allows something which is called Undefined Behavior. C does not care if the result of the pointer arithmetic has any sense (ie it is not out of bounds or the actual implementation defined storage did not wrap around). If it happens it is the UB. It is up to programmer to prevent UB, and C does not have any standard mechanisms for detecting or preventing UB.
I have assigned some random address to a pointer of a particular data type. Then I stored a value in that particular address. When I run the program, it terminates abruptly.
char *c=2000;
*c='A';
printf("%u",c);
printf("%d",*c);
I could be able to print the value of c in first printf statement. But I couldn't fetch the value stored in that address through the second one. I have executed in Cygwin GCC compiler and also in online ideone.com compiler. In ideone.com compiler it shows runtime error. What's the reason behind this?
When you assign the address 2000 to the pointer c, you are assuming that will be a valid address. Generally, though, it is not a valid address. You can't choose addresses at random and expect the compiler (and operating system) to have allocated that memory for you to use. In particular, the first page of memory (often 4 KiB, usually at least 1 KiB) is completely off limits; all attempts to read or write there are more usually indicative of bugs than intentional behaviour, and the MMU (memory management unit) is configured to reject attempts to access that memory.
If you're using an embedded microprocessor, the rules might well be different, but on a general purpose o/s like Windows with Cygwin, addresses under 0x1000 (4 KiB) are usually verboten.
You can print the address (you did it unreliably, but presumably your compiler didn't warn you; mine would have warned me about using a format for a 4-byte integer quantity to print an 8-byte address). But you can't reliably read or write the data at the address. There could be machines (usually mainframes) where simply reading an invalid address (even without accessing the memory it points at) generates a memory fault.
So, as Acme said in their answer,you've invoked undefined behaviour. You've taken the responsibility from the compiler for assigning a valid address to your pointer, but you chose an invalid value. A crash is the consequence of your bad decision.
char *c=2000;
Assignment (and initialization) of integer values to pointers is implementation defined behavior.
Implementation-defined behavior is defined by the ISO C Standard in
section 3.4.1 as:
unspecified behavior where each implementation documents how the choice
is made
EXAMPLE An example of implementation-defined behavior is the
propagation of the high-order bit when a signed integer is shifted
right.
Any code that relies on implementation defined behaviour is only
guaranteed to work under a specific platform and/or compiler. Portable
programs should try to avoid such behaviour.
What forms of memory address spaces have been used?
Today, a large flat virtual address space is common. Historically, more complicated address spaces have been used, such as a pair of a base address and an offset, a pair of a segment number and an offset, a word address plus some index for a byte or other sub-object, and so on.
From time to time, various answers and comments assert that C (or C++) pointers are essentially integers. That is an incorrect model for C (or C++), since the variety of address spaces is undoubtedly the cause of some of the C (or C++) rules about pointer operations. For example, not defining pointer arithmetic beyond an array simplifies support for pointers in a base and offset model. Limits on pointer conversion simplify support for address-plus-extra-data models.
That recurring assertion motivates this question. I am looking for information about the variety of address spaces to illustrate that a C pointer is not necessarily a simple integer and that the C restrictions on pointer operations are sensible given the wide variety of machines to be supported.
Useful information may include:
Examples of computer architectures with various address spaces and descriptions of those spaces.
Examples of various address spaces still in use in machines currently being manufactured.
References to documentation or explanation, especially URLs.
Elaboration on how address spaces motivate C pointer rules.
This is a broad question, so I am open to suggestions on managing it. I would be happy to see collaborative editing on a single generally inclusive answer. However, that may fail to award reputation as deserved. I suggest up-voting multiple useful contributions.
Just about anything you can imagine has probably been used. The
first major division is between byte addressing (all modern
architectures) and word addressing (pre-IBM 360/PDP-11, but
I think modern Unisys mainframes are still word addressed). In
word addressing, char* and void* would often be bigger than
an int*; even if they were not bigger, the "byte selector"
would be in the high order bits, which were required to be 0, or
would be ignored for anything other than bytes. (On a PDP-10,
for example, if p was a char*, (int)p < (int)(p+1) would
often be false, even though int and char* had the same
size.)
Among byte addressed machines, the major variants are segmented
and non-segmented architectures. Both are still wide spread
today, although in the case of Intel 32bit (a segmented
architecture with 48 bit addresses), some of the more widely
used OSs (Windows and Linux) artificially restrict user
processes to a single segment, simulating a flat addressing.
Although I've no recent experience, I would expect even more
variety in embedded processors. In particular, in the past, it
was frequent for embedded processors to use a Harvard
architecture, where code and data were in independent address
spaces (so that a function pointer and a data pointer, cast to a
large enough integral type, could compare equal).
I would say you are asking the wrong question, except as historical curiosity.
Even if your system happens to use a flat address space -- indeed, even if every system from now until the end of time uses a flat address space -- you still cannot treat pointers as integers.
The C and C++ standards leave all sorts of pointer arithmetic "undefined". That can impact you right now, on any system, because compilers will assume you avoid undefined behavior and optimize accordingly.
For a concrete example, three months ago a very interesting bug turned up in Valgrind:
https://sourceforge.net/p/valgrind/mailman/message/29730736/
(Click "View entire thread", then search for "undefined behavior".)
Basically, Valgrind was using less-than and greater-than on pointers to try to determine if an automatic variable was within a certain range. Because comparisons between pointers in different aggregates is "undefined", Clang simply optimized away all of the comparisons to return a constant true (or false; I forget).
This bug itself spawned an interesting StackOverflow question.
So while the original pointer arithmetic definitions may have catered to real machines, and that might be interesting for its own sake, it is actually irrelevant to programming today. What is relevant today is that you simply cannot assume that pointers behave like integers, period, regardless of the system you happen to be using. "Undefined behavior" does not mean "something funny happens"; it means the compiler can assume you do not engage in it. When you do, you introduce a contradiction into the compiler's reasoning; and from a contradiction, anything follows... It only depends on how smart your compiler is.
And they get smarter all the time.
There are various forms of bank-switched memory.
I worked on an embedded system that had 128 KB of total memory: 64KB of RAM and 64KB of EPROM. Pointers were only 16-bit, so a pointer into the RAM could have the same value of a pointer in the EPROM, even though they referred to different memory locations.
The compiler kept track of the type of the pointer so that it could generate the instruction(s) to select the correct bank before dereferencing a pointer.
You could argue that this was like segment + offset, and at the hardware level, it essentially was. But the segment (or more correctly, the bank) was implicit from the pointer's type and not stored as the value of a pointer. If you inspected a pointer in the debugger, you'd just see a 16-bit value. To know whether it was an offset into the RAM or the ROM, you had to know the type.
For example, Foo * could only be in RAM and const Bar * could only be in ROM. If you had to copy a Bar into RAM, the copy would actually be a different type. (It wasn't as simple as const/non-const:
Everything in ROM was const, but not all consts were in ROM.)
This was all in C, and I know we used non-standard extensions to make this work. I suspect a 100% compliant C compiler probably couldn't cope with this.
From a C programmer's perspective, there are three main kinds of implementation to worry about:
Those which target machines with a linear memory model, and which are designed and/or configured to be usable as a "high-level assembler"--something the authors of the Standard have expressly said they did not wish to preclude. Most implementations behave in this way when optimizations are disabled.
Those which are usable as "high-level assemblers" for machines with unusual memory architectures.
Those which whose design and/or configuration make them suitable only for tasks that do not involve low-level programming, including clang and gcc when optimizations are enabled.
Memory-management code targeting the first type of implementation will often be compatible with all implementations of that type whose targets use the same representations for pointers and integers. Memory-management code for the second type of implementation will often need to be specifically tailored for the particular hardware architecture. Platforms that don't use linear addressing are sufficiently rare, and sufficiently varied, that unless one needs to write or maintain code for some particular piece of unusual hardware (e.g. because it drives an expensive piece of industrial equipment for which more modern controllers aren't available) knowledge of any particular architecture isn't likely to be of much use.
Implementations of the third type should be used only for programs that don't need to do any memory-management or systems-programming tasks. Because the Standard doesn't require that all implementations be capable of supporting such tasks, some compiler writers--even when targeting linear-address machines--make no attempt to support any of the useful semantics thereof. Even some principles like "an equality comparison between two valid pointers will--at worst--either yield 0 or 1 chosen in possibly-unspecified fashion don't apply to such implementations.
I've recently been pointed into one of my C programs that, should the start address of the memory block be low enough, one of my tests would fail as a consequence of wrapping around zero, resulting in a crash.
At first i thought "this is a nasty potential bug", but then, i wondered : can this case happen ? I've never seen that. To be fair, this program has already run millions of times on a myriad of systems, and it never happened so far.
Therefore, my question is :
What is the lowest possible memory address that a call to malloc() may return ? To the best of my knowledge, i've never seen addresses such as 0x00000032 for example.
I'm only interested in "modern" environments, such as Linux, BSD and Windows. This code is not meant to run on a C64 nor whatever hobby/research OS.
First of all, since that's what you asked for, I'm only going to consider modern systems. That means they're using paged memory and have a faulting page at 0 to handle null pointer dereferences.
Now, the smallest page size I'm aware of on any real system is 4k (4096 bytes). That means you will never have valid addresses below 0x1000; anything lower would be part of the page containing the zero address, and thus would preclude having null pointer dereferences fault.
In the real world, good systems actually keep you from going that low; modern Linux even prevents applications from intentionally mapping pages below a configurable default (64k, I believe). The idea is that you want even moderately large offsets from a null pointer (e.g. p[n] where p happens to be a null pointer) to fault (and in the case of Linux, they want code in kernelspace to fault if it tries to access such addresses to avoid kernel null-pointer-dereference bugs which can lead to privilege elevation vulns).
With that said, it's undefined behavior to perform pointer arithmetic outside of the bounds of the array the pointer points into. Even if the address doesn't wrap, there are all sorts of things a compiler might do (either for hardening your code, or just for optimization) where the undefined behavior could cause your program to break. Good code should follow the rules of the language it's written in, i.e. not invoke undefined behavior, even if you expect the UB to be harmless.
You probably mean that you are computing &a - 1 or something similar.
Please, do not do this, even if pointer comparison is currently implemented as unsigned comparison on most architectures, and you know that (uintptr_t)&a is larger than some arbitrary bound on current systems. Compilers will take advantage of undefined behavior for optimization. They do it now, and if they do not take advantage of it now, they will in the future, regardless of “guarantees” you might expect from the instruction set or platform.
See this well-told anecdote for more.
In a completely different register, you might think that signed overflow is undefined in C because it used to be that there were different hardware choices such as 1's complement and sign-magnitude. Therefore, if you knew that the platform was 2's complement, an expression such as (x+1) > x would detect MAX_INT.
This may be the historical reason, but the reasoning no longer holds. The expression (x+1) > x (with x of type int) is optimized to 1 by modern compilers, because signed overflow is undefined. Compiler authors do not care that the original reason for undefinedness used to be the variety of available architectures. And whatever undefined thing you are doing with pointers is next on their list. Your program will break tomorrow if you invoke undefined behavior, not because the architecture changed, but because compilers are more and more aggressive in their optimizations.
Dynamic allocations are performed on heap. Heap resides in a process address space just after the text (the program code), initialized data and uninitialized data sections, see here: http://www.cprogramming.com/tutorial/virtual_memory_and_heaps.html . So the minimal possible address in the heap depends on the size of these 3 segments thus there is no absolute answer since it depends on the particular program.
I feel this might be a weird/stupid question, but here goes...
In the question Is NULL in C required/defined to be zero?, it has been established that the NULL pointer points to an unaddressable memory location, and also that NULL is 0.
Now, supposedly a 32-bit processor can address 2^32 memory locations.
2^32 is only the number of distinct numbers that can be represented using 32 bits. Among those numbers is 0. But since 0, that is, NULL, is supposed to point to nothing, shouldn't we say that a 32-bit processor can only address 2^32 - 1 memory locations (because the 0 is not supposed to be a valid address)?
If a 32-bit processor can address 2^32 memory locations, that simply means that a C pointer on that architecture can refer to 2^32 - 1 locations plus NULL.
the NULL pointer points to an unaddressable memory location
This is not true. From the accepted answer in the question you linked:
Notice that, because of how the rules for null pointers are formulated, the value you use to assign/compare null pointers is guaranteed to be zero, but the bit pattern actually stored inside the pointer can be any other thing
Most platforms of which I am aware do in fact handle this by marking the first few pages of address space as invalid. That doesn't mean the processor can't address such things; it's just a convenient way of making low values a non valid pointer. For instance, several Windows APIs use this to distinguish between a resource ID and a pointer to actual data; everything below a certain value (65k if I recall correctly) is not a valid pointer, but is a valid resource ID.
Finally, just because C says something doesn't mean that the CPU needs to be restricted that way. Sure, C says accessing the null pattern is undefined -- but there's no reason someone writing in assembly need be subject to such limitations. Real machines typically can do much more than the C standard says they have to. Virtual memory, SIMD instructions, and hardware IO are some simple examples.
First, let's note the difference between the linear address (AKA the value of the pointer) and the physical address. While the linear address space is, indeed, 32 bits (AKA 2^32 different bytes), the physical address that goes to the memory chip is not the same. Parts ("pages") of the linear address space might be mapped to physical memory, or to a page file, or to an arbitrary file, or marked as inaccessible and not backed by anything. The zeroth page happens to be the latter. The mapping mechanism is implemented on the CPU level and maintained by the OS.
That said, the zero address being unaddressable memory is just a C convention that's enforced by every protected-mode OS since the first Unices. In MS-DOS-era real-mode operaring systems, null far pointer (0000:0000) was perfectly addressable; however, writing there would ruin system data structures and bring nothing but trouble. Null near pointer (DS:0000) was also perfectly accessible, but the run-time library would typically reserve some space around zero to protect from accidental null pointer dereferencing. Also, in real mode (like in DOS) the address space was not a flat 32-bit one, it was effectively 20-bit.
It depends upon the operating system. It is related to virtual memory and address spaces
In practice (at least on Linux x86 32 bits), addresses are byte "numbers"s, but most are for 4-bytes words so are often multiple of 4.
And more importantly, as seen from a Linux application, only at most 3Gbytes out of 4Gbytes is visible. a whole gigabyte of address space (including the first and last pages, near the null pointer) is unmapped. In practice the process see much less of that. See its /proc/self/maps pseudo-file (e.g. run cat /proc/self/maps to see the address map of the cat command on Linux).