Are arrays guaranteed to be contiguous in virtual memory? - c

int main() {
char a[3] = {1,2,3};
return sizeof(a);
}
Is a guaranteed to be in consecutive bytes in virtual memory?
I know it may not be consecutive in physical memory as the mapping is done behind the scene by the MMU.
If the compiler notices i'm not taking the address of any of the elements, is it then free to put them on non consecutive addresses in memory or even put them in a register?
Let's assume the optimizer will not fully get rid of it in my example.

Is a guaranteed to be in consecutive bytes in virtual memory? I know it may not be consecutive in physical memory as the mapping is done behind the scene by the MMU.
Your code is guaranteed to behave as-if the array was in consecutive bytes of address space.
If the compiler notices i'm not taking the address of any of the elements, is it then free to put them on non consecutive addresses in memory or even put them in a register?
It is free to do so even if you do take the address. The compiler can compile your code any way it wants to make it efficient so long as the code doesn't break.
Let's assume the optimizer will not fully get rid of it in my example.
Okay. But it's allowed to. C has an "as-if" rule which means that all rules are just about the behavior your code has to observe, they don't constrain how the compiler (or the machine) get that behavior.

The non-pedantic answer is that, yes, for all intents and purposes, arrays are guaranteed to be stored contiguously in C. This is no accident, it's pretty much a fundamental part of the definition of an array.
The whole point of an array is that you can access any element in constant (that is, O(1)) time, notionally by computing an address, and without having to chase any links as you would with almost any other data structure. So if, as some obscure and invisible implementation detail, an array were somehow not actually stored contiguously, it would always have to act exactly as if it were stored contiguously. And since contiguity is the defining property of an array, I don't think there's any harm in thinking consciously about that property, and assuming that it's always true.

Well in this particular code snippet most likely a will be stored in the stack and yes in this particular case it would be contiguous.
However if you were to run at a higher optimization most likely any compiler would not even store a and just do constant propagation and just return the size of a.
If you were to allocate memory in heap then it depends on the allocator most likely if it is Buddy or slab allocator then more often than not they would be contiguous but if i were to have a naive free linked list allocator then it may not be contiguous when there are multiple calls for heap allocation. In case of Arrays this is not possible as you most likely will have single call so it would still be contigous.
Tools like objdump, opt from llvm, gdb etc are great tools if you want to check the disassembly and how the compiler lays out the assembly code across different optimization levels.

C is defined in terms of an abstract machine. The bytes are guaranteed to be contiguous in the abstract machine.
The real machine can do literally anything so long as the observable behaviour of the program matches an allowed output of the program in the abstract machine.
So, there are no guarantees about any placement of data in the real machine.

Related

Undefined behavior beyond the max index of an array

Situation:
I'm taking a crash course to get familiar with C, and I've noticed that the author of this course can print array values beyond the array's index and be confident that the value will be 0 each time.
Code from crash course below:
int arrayVar[] = {45, 67, 34, 23};
printf("This array index value is %d", arrayVar[4]);
Output from code:
This array index value is 0
It's been my experience, during this tinkering/testing of C, that once you go beyond the array's max index, you're entering undefined behavior territory, where anything can happen, so how can he be so confident (and right) about seeing a 0 value every time?
If I print values beyond the array's max index, I see "random" values (or, values that were left there in memory, right?).
Why is my experience different from what I'm seeing in this course? Is this just a difference in C standards? Or does this indicate a difference in compilers? Or both?
Environment info : I am using the C11 standard, and I'm using the compiler that (I'm pretty sure) came default with ubuntu, located at /usr/bin/cc.
EDIT: For anyone interested in seeing what I'm seeing, here's a link to the course (you'll probably be prompted to login to Udemy): https://www.udemy.com/c-fast-crash-course-introduction/learn/lecture/12868540#questions
The author of the course is wrong.
It's that simple.
Undefined does not mean random. In many cases, undefined will usually lead to some default behavior and hence may go unnoticed for a long time. Memory is commonly initialized with zeros, so accessing uninitialized memory often yields zeros. Which is why some memory debugger libraries will fill allocated memory with uncommon values such as 0xDEADBEEF that have a better chance of triggering problems.
Memory allocation is nontrivial. The underlying libraries need to keep track of what is allocated vs. free, there are different kinds of allocations (stack vs. heap, data segment, BSS, ...). Libraries may have optimized strategies for allocating certain small objects, etc. - you don't call into the OS to allocate 16 bytes, but "the situation is complicated". When you allocate 16 bytes, your C library likely asks for several megabytes (if it didn't do so before), the kernel pretends it gives all this memory to the application (assuming that quite often not all of this is ever used) and the library then cuts of a chunk with your 16 bytes plus some overhead for memory management. Usually aligned to an 8 byte boundary, because micromanaging memory on the byte level is a bad idea for multiple reasons. so the next integer may be in this megabytes already allocated and cleared for future use.
(Although in this particular case, the array supposedly is in the data section and never allocated, the idea is similar - there probably is some static variable next that happens to be zero. You may want to look at a dump of the binaries data segment layout.)

How do pointers stay valid when objects move in memory?

Imagine in C I allocate two structs on the heap. One of the structs has a field which holds a pointer to the other struct.
As far as I know, data in the heap may move, thus addresses of things change. For example, defragmentation on the heap may occur, moving the second struct to a different place in the heap.
This help understanding what I'm talking about
https://en.m.wikibooks.org/wiki/Memory_Management/Memory_Compacting
The point to this struct would now be wrong (i.e. holding the wrong memory address).
I don't mean this question as specific to C, but more general: at any time, the platform may decide to move things around. How do pointers stay valid?
The key concept here is virtual memory. Your pointers do not point to a physical address, but rather to a virtual one in the virtual address space of your process. What you said is correct, data may get moved around, even swapped out to disk and then mapped again into the physical memory onto another frame, but the virtual address that your pointer points to stays always the same.
The C standard does not permit the implementation to (spontaneously) move things around in such a way that would invalidate an existing pointer. It's possible that an implementation could exist that does "defragment" the heap, but I don't know of any implementations that do.
I said "spontaneously" because realloc() calls in your code may actually cause the object to move; that's why realloc returns a pointer. If the pointer returned by realloc is different from the original pointer, the original pointer (and any pointers that aliased it) are invalid. But this is something you have to keep track of in your own code.
Managed languages (Java, C#, Python, whatever) may (or may not) deal with heap fragmentation by adding an additional level of indirection and/or keeping track of pointers into the heap. That way the language runtime can update all the pointers to object X when X moves to a different place. That would be taken care of by the garbage collection system.
It would be somewhat unusual for a C implementation to provide a garbage collector, and probably can't be done in a standards conformant way due to all the things you can (safely) do with pointers. So the premise of your question, that the heap may be spontaneously defragmented by the implementation, is not valid.
When you see a pointer in C, you observe something that looks like a memory address but, in practice, there will be one if not two levels of abstraction between this and the physical memory address.
Not only does this help make operating systems more secure but it allows it to perform any fragmentation tasks (whatever they are) without changing what is observed by your C program.

Garbage values in a multiprocess operating system

Does the allocated memory holds the garbage value since the start of the OS session? Does it have some significance before we name it as a garbage value in our program runtime session? If so then why?
I need some advice on study materials regarding linux kernel programming, device driver programming and also want to develop an understanding on how the computer devices actually work. I get stuck into the situations like the "garbage value" and feel like I have to study something else also for better understanding of the programming language. I am studying by myself and getting a lot of confusing situations. Any advice will be really helpful.
"Garbage value" is a slang term, meaning "I don't know what value is there, or why, and for that reason I will not use the value". It is "garbage" in the sense of "useless nonsense", and sometimes it is also "garbage" in the sense of "somebody else's leavings".
Formally, uninitialized memory in C takes "indeterminate values". This might be some special value written there by the C implementation, or it might be something "left over" by an earlier user of the same memory. So for examples:
A debug version of the C runtime might fill newly-allocated memory with an eye-catcher value, so that if you see it in the debugger when you were expecting your own stored data, you can reasonably conclude that either you forgot to initialize it or you're looking in the wrong place.
The kernel of a "proper" operating system will overwrite memory when it is first assigned to a process, to avoid one process seeing data that "belongs" to another process and that for security reasons should not leak across process boundaries. Typically it will overwrite it with some known value, like 0.
If you malloc memory, write something in it, then free it and malloc some more memory, you might get the same memory again with its previous contents largely intact. But formally your newly-allocated buffer is still "uninitialized" even though it happens to have the same contents as when you freed it, because formally it's a brand new array of characters that just so happens to have the same address as the old one.
One reason not to use an "indeterminate value" in C is that the standard permits it to be a "trap representation". Some machines notice when you load certain impossible values of certain types into a register, and you'd get a hardware fault. So if the memory was previously used for, say, an int, but then that value is read as a float, who is to say whether the left-over bit pattern represents a so-called "signalling NaN", that would halt the program? The same could happen if you read a value as a pointer and it's mis-aligned for the type. Even integer types are permitted to have "parity bits", meaning that reading garbage values as int could have undefined behavior. In practice, I don't think any implementation actually does have trap representations of int, and I doubt that any will check for mis-aligned pointers if you just read the pointer value -- although they might if you dereference it. But C programmers are nothing if not cautious.
What is garbage value?
When you encounter values at a memory location and cannot conclusively say what these values should be then those values are garbage value for you. i.e: The value is Indeterminate.
Most commonly, when you use a variable and do not initialize it, the variable has an Indeterminate value and is said to possess a garbage value. Note that using an Uninitialized variable leads to an Undefined Behavior, which means the program is not a valid C/C++ program and it may show(literally) any behavior.
Why the particular value exists at that location?
Most of the Operating systems of today use the concept of virtual memory. The memory address a user program sees is an virtual memory address and not the physical address. Implementations of virtual memory divide a virtual address space into pages, blocks of contiguous virtual memory addresses. Once done with usage these pages are usually at least 4 kilobytes. These pages are not explicitly wiped of their contents they are only marked as free for reuse and hence they still contain the old contents if not properly initialized.
On a typical OS, your userspace application only sees a range of virtual memory. It is up to the kernel to map this virtual memory to actual, physical memory.
When a process requests a piece of (virtual) memory, it will initially hold whatever is left in it -- it may be a reused piece of memory that another part of the process was using earlier, or it may be memory that a completely different process had been using... or it may never have been touched at all and be in whatever state it was when you powered on the machine.
Usually nobody goes and wipes a memory page with zeros (or any other equally arbitrary value) on your behalf, because there'd be no point. It's entirely up to your application to use the memory in whatever way you please, and if you're going to write to it anyway, then you don't care what was in it before.
Consequently, in C it is simply not allowed to read a variable before you have written to it, under pain of undefined behaviour.
If you declare a variable without initialising it to a particular value, it may contain a value which was previously assigned by a different program that has since released that piece of memory, or it may simply be a random value from when the computer was booted (iirc, PCs used to initialise all RAM to 0 on bootup because early versions of DOS required it, but new computers no longer do this). You can't assume the value will be zero, for instance.
Garbage value, e.g. in C, typically refers to the fact that if you just reserve memory, but never intialize it, it will hold random values, since it simply is not initialized yet (C doesn't do that for you automatically; it would just be overhead, and C is designed for as little overhead as possible).
The random values in the memory are leftovers from whatever was in there before.
These previous values are left in there, because usually there is not much use in going around setting memory to zero - or any other value - that will later be overwritten again anway. Because for the general case, there is no use in reading uninitialized memory (except if you e.g. want to exploit possible security issues - see the special cases where memory is actually zeroed: Kernel zeroes memory?).

need explanation of how memory address work in this C program

I have a very simple C program where I am (out of my own curiosity) investigating which memory addresses are used to allocate local variables. My program is:
#include <stdio.h>
int main()
{
char buffer_1[8], buffer_2[8], buffer_3[8];
printf("address of buffer_1 %p\n", buffer_1);
printf("address of buffer_2 %p\n", buffer_2);
printf("address of buffer_3 %p\n", buffer_3);
return 0;
}
output is as follows:
address of buffer_1 0x7fff5fbfec30
address of buffer_2 0x7fff5fbfec20
address of buffer_3 0x7fff5fbfec10
my question is: why do the address seem to be getting smaller? Is there some logic to this? thank you.
The compiler is allowed to do whatever it wants with your automatic variables. In this case it just looks like it's putting them consecutively on the stack. On most popular systems in use today, stacks grow downwards.
Most compilers allocate stack memory for local variables in one step, at the very beginning pf the function. The memory is allocated as a single continuous block. Under these circumstances, the compiler, obviously, is free to use absolutely any memory layout for local variables inside that block. If can put them there so that the addresses increase in the order of declaration. Or decrease. Or arranged randomly. It is an implementation detail. And there's not much logic behind it.
It is quite possible that in your case the compiler tried to "pretend" that the memory for the arrays was allocated in the stack sequentially and independently (even though that was not the case). If on your platform stack grows downwards (as it does on many platforms), then it is expected that object declared later will have smaller addresses.
But again, functions don't allocate local objects individually. And on top of that the language makes no guarantees about any relationships between local object addresses. So, there's no real reason to prefer one ordering over the other.
The output of your C program is platform-dependent, compiler-dependent.
There cannot be just one perfect answer because the address arrangements vary based on:
Whether the system is little or big endian.
What kind of OS you are compiling on.
What kind of memory architecture you are compiling for.
What kind of compiler you are using(and compilers might have bugs too)
Whether you are on 64-bit or 32-bit platform.
And so much more.
But most important of all, is the type of processor architecture. :)
Here is a list of stack growth strategies per processor:
x86,PDP11 Downwards
System z In a linked list fashion, downwards, mostly.
ARM Select-able and can grow in either up or downward.
Mostek6502 Downwards (but only 256 bytes).
SPARC In a circular fashion with a sliding window, a limited depth stack.
RCA1802A Subject to SCRT(Standard Call and Return Technique) implementation.
But, in general, your compiler, at compile-time should map those addresses into the binary file generated. Then at the run-time, the binary file may occupy(or may pretend to occupy) a sequential set of memory addresses. And in your case the addresses printed by your C source, show that the stack is growing downward.
Basically compiler has responsibility to allocate memory to all the variables .
Array gets address on stack. but it has nothing to do with the o/p you are getting.
Basically The thing is compiler found the contiguous space(or chunk of memory) empty at that time and hence it allocated it to your program.

defining a simple array of integers

Why can i do this?
int array[4]; // i assume this creates an array with 4 indexes
array[25]=1; // i have assigned an index larger than the int declaration
NSLog(#"result: %i", array[25]); // this prints "1" to the screen
Why does this work, if the index exceeds the declaration? what is the significance of the number in the declaration if it has no effect on what you can actually do with the array?
Thanks.
You are getting undefined behavior. It could print anything, it could crash, it could burst into singing (okay that isn't likely but you get the idea).
If it happens to write to a location that is mapped with the adequate permissions it will work. Until one day when it won't because of a different layout.
it is undefined. some OS will give you segmentation fault, while some tolerate this. anyhow, exceeding the array's size should be avoided.
An array is really just a pointer to the start of a contiguous, allocated block of memory.
In this case, you have allocated 4 ints worth of memory.
So if you went array[2] it would think "the memory at array + sizeof(int) * 2"
Change the 2 to 25, and you're just looking at 25 int's worth of memory past the start. Since there's no checks to verify you're in bounds (either when assigning or printing) it works.
There's nothing to say somehting else might be allocated there though!
The number in the decleration determines how many memory should be reserved, in that case 4 * sizeof(int).
If you write to memory out of the bounds, this is possible but not recommended. In C you can access any point in memory available to your program, but if you write to that if it's not reserved for that thing, you can cause memory corruption. Arrays are pointers (but not the other way around).
The behavior depends on the compiler, the platform and some randomness. Don't do it.
It's doing very bad things. If the array is declared locally to a function, you're probably writing on stack locations above the current function's stack frame. If it is declared globally, you're probably writing on memory allocated to adjacent variables. Either way, it is "working" by pure luck.
It is possible your C compiler had padded your array for memory alignment purposes, and by luck your array overrun just happens to still be within the rounded-up allocation. Don't rely on it though.
This is unsafe programming. It really should be avoided because it may not crash your program. Which is really the best thing you could hope for. It could give you garbage results. These are unpredictable and could really screw up your program. However since you don't know that is wrong because it not crashing it will ruin the integrity of your data. Since there is no try/catch with C you really should check inputs. Remember scanf returns an int.
C by design does not perform array bounds checking. It was designed as a systems level language, and the overhead of explicit run-time bounds checking could be prohibitive in a great many cases. Consequently C will allow 'dangerous' code and must be used carefully. If you need something 'safer' then C# or Java may be more appropriate, but there is a definite performance hit.
Automated static analysis tools may help, and there are run-time bounds checking tools for use in development and debugging.
In C an array is a contiguous block of memory. By accessing the array out-of-bounds, you are simply accessing memory beyond the end of the array. What accessing such memory will do is non-deterministic, it may be junk, it may belong to an adjacent variable, or it may belong to the variables of the calling function or above. It maybe a return address for the current function or a calling function. In a memory protected OS such as Windows or Linux, if you access so far out of bounds as to be no longer within the address range assigned to the process, a fault exception will occur.

Resources