Undefined behavior beyond the max index of an array

Undefined behavior beyond the max index of an array - c

Situation:
I'm taking a crash course to get familiar with C, and I've noticed that the author of this course can print array values beyond the array's index and be confident that the value will be 0 each time.
Code from crash course below:
int arrayVar[] = {45, 67, 34, 23};
printf("This array index value is %d", arrayVar[4]);
Output from code:
This array index value is 0
It's been my experience, during this tinkering/testing of C, that once you go beyond the array's max index, you're entering undefined behavior territory, where anything can happen, so how can he be so confident (and right) about seeing a 0 value every time?
If I print values beyond the array's max index, I see "random" values (or, values that were left there in memory, right?).
Why is my experience different from what I'm seeing in this course? Is this just a difference in C standards? Or does this indicate a difference in compilers? Or both?
Environment info : I am using the C11 standard, and I'm using the compiler that (I'm pretty sure) came default with ubuntu, located at /usr/bin/cc.
EDIT: For anyone interested in seeing what I'm seeing, here's a link to the course (you'll probably be prompted to login to Udemy): https://www.udemy.com/c-fast-crash-course-introduction/learn/lecture/12868540#questions

The author of the course is wrong.
It's that simple.

Undefined does not mean random. In many cases, undefined will usually lead to some default behavior and hence may go unnoticed for a long time. Memory is commonly initialized with zeros, so accessing uninitialized memory often yields zeros. Which is why some memory debugger libraries will fill allocated memory with uncommon values such as 0xDEADBEEF that have a better chance of triggering problems.
Memory allocation is nontrivial. The underlying libraries need to keep track of what is allocated vs. free, there are different kinds of allocations (stack vs. heap, data segment, BSS, ...). Libraries may have optimized strategies for allocating certain small objects, etc. - you don't call into the OS to allocate 16 bytes, but "the situation is complicated". When you allocate 16 bytes, your C library likely asks for several megabytes (if it didn't do so before), the kernel pretends it gives all this memory to the application (assuming that quite often not all of this is ever used) and the library then cuts of a chunk with your 16 bytes plus some overhead for memory management. Usually aligned to an 8 byte boundary, because micromanaging memory on the byte level is a bad idea for multiple reasons. so the next integer may be in this megabytes already allocated and cleared for future use.
(Although in this particular case, the array supposedly is in the data section and never allocated, the idea is similar - there probably is some static variable next that happens to be zero. You may want to look at a dump of the binaries data segment layout.)

Related

Are arrays guaranteed to be contiguous in virtual memory?

int main() {
char a[3] = {1,2,3};
return sizeof(a);
}
Is a guaranteed to be in consecutive bytes in virtual memory?
I know it may not be consecutive in physical memory as the mapping is done behind the scene by the MMU.
If the compiler notices i'm not taking the address of any of the elements, is it then free to put them on non consecutive addresses in memory or even put them in a register?
Let's assume the optimizer will not fully get rid of it in my example.

Is a guaranteed to be in consecutive bytes in virtual memory? I know it may not be consecutive in physical memory as the mapping is done behind the scene by the MMU.
Your code is guaranteed to behave as-if the array was in consecutive bytes of address space.
If the compiler notices i'm not taking the address of any of the elements, is it then free to put them on non consecutive addresses in memory or even put them in a register?
It is free to do so even if you do take the address. The compiler can compile your code any way it wants to make it efficient so long as the code doesn't break.
Let's assume the optimizer will not fully get rid of it in my example.
Okay. But it's allowed to. C has an "as-if" rule which means that all rules are just about the behavior your code has to observe, they don't constrain how the compiler (or the machine) get that behavior.

The non-pedantic answer is that, yes, for all intents and purposes, arrays are guaranteed to be stored contiguously in C. This is no accident, it's pretty much a fundamental part of the definition of an array.
The whole point of an array is that you can access any element in constant (that is, O(1)) time, notionally by computing an address, and without having to chase any links as you would with almost any other data structure. So if, as some obscure and invisible implementation detail, an array were somehow not actually stored contiguously, it would always have to act exactly as if it were stored contiguously. And since contiguity is the defining property of an array, I don't think there's any harm in thinking consciously about that property, and assuming that it's always true.

Well in this particular code snippet most likely a will be stored in the stack and yes in this particular case it would be contiguous.
However if you were to run at a higher optimization most likely any compiler would not even store a and just do constant propagation and just return the size of a.
If you were to allocate memory in heap then it depends on the allocator most likely if it is Buddy or slab allocator then more often than not they would be contiguous but if i were to have a naive free linked list allocator then it may not be contiguous when there are multiple calls for heap allocation. In case of Arrays this is not possible as you most likely will have single call so it would still be contigous.
Tools like objdump, opt from llvm, gdb etc are great tools if you want to check the disassembly and how the compiler lays out the assembly code across different optimization levels.

C is defined in terms of an abstract machine. The bytes are guaranteed to be contiguous in the abstract machine.
The real machine can do literally anything so long as the observable behaviour of the program matches an allowed output of the program in the abstract machine.
So, there are no guarantees about any placement of data in the real machine.

Finding Allocated Memory

Platform: x86 Linux 3.2.0 (Debian 7.1)
Compiler: GCC 4.7.2 (Debian 4.7.2-5)
I am writing a function that generates a "random" integer by reading allocated portions of memory for "random" values. The idea is based on the fact that uninitialized variables have undefined values. My initial idea was to allocate an array using malloc() and then use its uninitialized elements to generate a random number. But malloc() tends to return NULL blocks of memory so I cannot guarantee that there is anything to read. So I thought about reading a separate processes memory in order to almost guarantee values other than NULL. My current idea is somehow finding the first valid memory address and reading from there down but I do not know how to do this. I tried initializing a pointer to NULL and then incrementing it by one but if I attempt to print the referenced memory location a segmentation fault occurs. So my question is how do I read a separate processes memory. I do not need to do anything with the memory other than read it.

The idea is based on the fact that uninitialized variables have undefined values.
No, you can't. They have garbage value, which means whatever happens to be in that memory, they are not random.

You can't read a separate processes memory, the kernel protects you from doing that because it usually happens because of an error in setting up your pointers. Even if they were possible, you wouldn't be getting anything near a random integer. Why not read from /dev/random instead?

Random numbers have certain special properties. Computer memory in general doesn't satisfy those properties.
If I sampled computer memory, tons of it would be quite similar, and certain numbers would have such a low probability of existing, that they might not even be found within the entire memory of a computer.
That's not to mention that if I read a bit of memory that's outside of the memory allocated to a program, the OS will kill me dead with a SEGFAULT.
It's a bad idea, on many levels. Use a proper random number generator.

Generating random numbers in computers by software is HARD (there are hardware random number generators). Memory in a new program is a terrible source, especially early on, as the OS has zeroed all of memory before it starts the program. Any non-zeros you see are left over from initialization code leaving it's dirt behind.
Assuming you want so "do it yourself" numbers, the micro/nano-second digits of the time are an old-style solution... the theory is show below... play with your own numbers. Modulo with a large prime would be good. Just be sure to discard anything above 1/1,000's of a second.
(long long)(nano * 1E10 ) % 1000
This assume you are starting by a manual command rather than a scheduled job.
If you are running on UNIX look into reading a few bytes from /dev/urandom, or with proper care,/dev/random (read the man page).
Windows has it's own API. In perl,
new Win32::API "advapi$b32","CryptAcquireContextA",'PNNNN','N' ||
die "$^E\n"; # Use MS crypto or die
The serious work good random number generators take to get good numbers is beyond a quick response here; such usually rely on hardware, such as timestamping interrupts.

The idea is based on the fact that uninitialized variables have undefined values.
They are undefined in as far as you cannot predict what they contain. It is mostly OS dependent what they really contain.
Back in old DOS days, you could maybe rely on the fact that if you executed several programs in the current session, there was garbage in the memory. But even then the data wasn't a reliable source of randomness.
Nowadays, things are different.
If you have variables on the stack, and in the corrent program run you were never as deep on the stack as now, your local variables are 0. Otherwise, they contain the data from previous function calls.
If you malloc() and the libc takes the returned memory from the pool of already used memory, it might contain garbage as well. But if it newly gets it from the OS, it is zeroed.
My initial idea was to allocate an array using malloc() and then use its uninitialized elements to generate a random number. But malloc() tends to return NULL blocks of memory so I cannot guarantee that there is anything to read.
(Not NULL, but 0 or NUL.)
See my last point: it depends on the history of the malloc()ed area.
So I thought about reading a separate processes memory in order to almost guarantee values other than NULL.
You cannot, as processes are separated and shielded from each other.
As others said, there are better sources of randomness. /dev/random if you definitely need real entropy, /dev/urandom otherwise.

Garbage values in a multiprocess operating system

Does the allocated memory holds the garbage value since the start of the OS session? Does it have some significance before we name it as a garbage value in our program runtime session? If so then why?
I need some advice on study materials regarding linux kernel programming, device driver programming and also want to develop an understanding on how the computer devices actually work. I get stuck into the situations like the "garbage value" and feel like I have to study something else also for better understanding of the programming language. I am studying by myself and getting a lot of confusing situations. Any advice will be really helpful.

"Garbage value" is a slang term, meaning "I don't know what value is there, or why, and for that reason I will not use the value". It is "garbage" in the sense of "useless nonsense", and sometimes it is also "garbage" in the sense of "somebody else's leavings".
Formally, uninitialized memory in C takes "indeterminate values". This might be some special value written there by the C implementation, or it might be something "left over" by an earlier user of the same memory. So for examples:
A debug version of the C runtime might fill newly-allocated memory with an eye-catcher value, so that if you see it in the debugger when you were expecting your own stored data, you can reasonably conclude that either you forgot to initialize it or you're looking in the wrong place.
The kernel of a "proper" operating system will overwrite memory when it is first assigned to a process, to avoid one process seeing data that "belongs" to another process and that for security reasons should not leak across process boundaries. Typically it will overwrite it with some known value, like 0.
If you malloc memory, write something in it, then free it and malloc some more memory, you might get the same memory again with its previous contents largely intact. But formally your newly-allocated buffer is still "uninitialized" even though it happens to have the same contents as when you freed it, because formally it's a brand new array of characters that just so happens to have the same address as the old one.
One reason not to use an "indeterminate value" in C is that the standard permits it to be a "trap representation". Some machines notice when you load certain impossible values of certain types into a register, and you'd get a hardware fault. So if the memory was previously used for, say, an int, but then that value is read as a float, who is to say whether the left-over bit pattern represents a so-called "signalling NaN", that would halt the program? The same could happen if you read a value as a pointer and it's mis-aligned for the type. Even integer types are permitted to have "parity bits", meaning that reading garbage values as int could have undefined behavior. In practice, I don't think any implementation actually does have trap representations of int, and I doubt that any will check for mis-aligned pointers if you just read the pointer value -- although they might if you dereference it. But C programmers are nothing if not cautious.

What is garbage value?
When you encounter values at a memory location and cannot conclusively say what these values should be then those values are garbage value for you. i.e: The value is Indeterminate.
Most commonly, when you use a variable and do not initialize it, the variable has an Indeterminate value and is said to possess a garbage value. Note that using an Uninitialized variable leads to an Undefined Behavior, which means the program is not a valid C/C++ program and it may show(literally) any behavior.
Why the particular value exists at that location?
Most of the Operating systems of today use the concept of virtual memory. The memory address a user program sees is an virtual memory address and not the physical address. Implementations of virtual memory divide a virtual address space into pages, blocks of contiguous virtual memory addresses. Once done with usage these pages are usually at least 4 kilobytes. These pages are not explicitly wiped of their contents they are only marked as free for reuse and hence they still contain the old contents if not properly initialized.

On a typical OS, your userspace application only sees a range of virtual memory. It is up to the kernel to map this virtual memory to actual, physical memory.
When a process requests a piece of (virtual) memory, it will initially hold whatever is left in it -- it may be a reused piece of memory that another part of the process was using earlier, or it may be memory that a completely different process had been using... or it may never have been touched at all and be in whatever state it was when you powered on the machine.
Usually nobody goes and wipes a memory page with zeros (or any other equally arbitrary value) on your behalf, because there'd be no point. It's entirely up to your application to use the memory in whatever way you please, and if you're going to write to it anyway, then you don't care what was in it before.
Consequently, in C it is simply not allowed to read a variable before you have written to it, under pain of undefined behaviour.

If you declare a variable without initialising it to a particular value, it may contain a value which was previously assigned by a different program that has since released that piece of memory, or it may simply be a random value from when the computer was booted (iirc, PCs used to initialise all RAM to 0 on bootup because early versions of DOS required it, but new computers no longer do this). You can't assume the value will be zero, for instance.

Garbage value, e.g. in C, typically refers to the fact that if you just reserve memory, but never intialize it, it will hold random values, since it simply is not initialized yet (C doesn't do that for you automatically; it would just be overhead, and C is designed for as little overhead as possible).
The random values in the memory are leftovers from whatever was in there before.
These previous values are left in there, because usually there is not much use in going around setting memory to zero - or any other value - that will later be overwritten again anway. Because for the general case, there is no use in reading uninitialized memory (except if you e.g. want to exploit possible security issues - see the special cases where memory is actually zeroed: Kernel zeroes memory?).

defining a simple array of integers

Why can i do this?
int array[4]; // i assume this creates an array with 4 indexes
array[25]=1; // i have assigned an index larger than the int declaration
NSLog(#"result: %i", array[25]); // this prints "1" to the screen
Why does this work, if the index exceeds the declaration? what is the significance of the number in the declaration if it has no effect on what you can actually do with the array?
Thanks.

You are getting undefined behavior. It could print anything, it could crash, it could burst into singing (okay that isn't likely but you get the idea).
If it happens to write to a location that is mapped with the adequate permissions it will work. Until one day when it won't because of a different layout.

it is undefined. some OS will give you segmentation fault, while some tolerate this. anyhow, exceeding the array's size should be avoided.

An array is really just a pointer to the start of a contiguous, allocated block of memory.
In this case, you have allocated 4 ints worth of memory.
So if you went array[2] it would think "the memory at array + sizeof(int) * 2"
Change the 2 to 25, and you're just looking at 25 int's worth of memory past the start. Since there's no checks to verify you're in bounds (either when assigning or printing) it works.
There's nothing to say somehting else might be allocated there though!

The number in the decleration determines how many memory should be reserved, in that case 4 * sizeof(int).
If you write to memory out of the bounds, this is possible but not recommended. In C you can access any point in memory available to your program, but if you write to that if it's not reserved for that thing, you can cause memory corruption. Arrays are pointers (but not the other way around).
The behavior depends on the compiler, the platform and some randomness. Don't do it.

It's doing very bad things. If the array is declared locally to a function, you're probably writing on stack locations above the current function's stack frame. If it is declared globally, you're probably writing on memory allocated to adjacent variables. Either way, it is "working" by pure luck.

It is possible your C compiler had padded your array for memory alignment purposes, and by luck your array overrun just happens to still be within the rounded-up allocation. Don't rely on it though.

This is unsafe programming. It really should be avoided because it may not crash your program. Which is really the best thing you could hope for. It could give you garbage results. These are unpredictable and could really screw up your program. However since you don't know that is wrong because it not crashing it will ruin the integrity of your data. Since there is no try/catch with C you really should check inputs. Remember scanf returns an int.

C by design does not perform array bounds checking. It was designed as a systems level language, and the overhead of explicit run-time bounds checking could be prohibitive in a great many cases. Consequently C will allow 'dangerous' code and must be used carefully. If you need something 'safer' then C# or Java may be more appropriate, but there is a definite performance hit.
Automated static analysis tools may help, and there are run-time bounds checking tools for use in development and debugging.
In C an array is a contiguous block of memory. By accessing the array out-of-bounds, you are simply accessing memory beyond the end of the array. What accessing such memory will do is non-deterministic, it may be junk, it may belong to an adjacent variable, or it may belong to the variables of the calling function or above. It maybe a return address for the current function or a calling function. In a memory protected OS such as Windows or Linux, if you access so far out of bounds as to be no longer within the address range assigned to the process, a fault exception will occur.

Why can I write and read memory when I haven't allocated space?

I'm trying to build my own Hash Table in C from scratch as an exercise and I'm doing one little step at a time. But I'm having a little issue...
I'm declaring the Hash Table structure as pointer so I can initialize it with the size I want and increase it's size whenever the load factor is high.
The problem is that I'm creating a table with only 2 elements (it's just for testing purposes), I'm allocating memory for just those 2 elements but I'm still able to write to memory locations that I shouldn't. And I also can read memory locations that I haven't written to.
Here's my current code:
#include <stdio.h>
#include <stdlib.h>
#define HASHSIZE 2
typedef char *HashKey;
typedef int HashValue;
typedef struct sHashTable {
HashKey key;
HashValue value;
} HashEntry;
typedef HashEntry *HashTable;
void hashInsert(HashTable table, HashKey key, HashValue value) {
}
void hashInitialize(HashTable *table, int tabSize) {
*table = malloc(sizeof(HashEntry) * tabSize);
if(!*table) {
perror("malloc");
exit(1);
}
(*table)[0].key = "ABC";
(*table)[0].value = 45;
(*table)[1].key = "XYZ";
(*table)[1].value = 82;
(*table)[2].key = "JKL";
(*table)[2].value = 13;
}
int main(void) {
HashTable t1 = NULL;
hashInitialize(&t1, HASHSIZE);
printf("PAIR(%d): %s, %d\n", 0, t1[0].key, t1[0].value);
printf("PAIR(%d): %s, %d\n", 1, t1[1].key, t1[1].value);
printf("PAIR(%d): %s, %d\n", 3, t1[2].key, t1[2].value);
printf("PAIR(%d): %s, %d\n", 3, t1[3].key, t1[3].value);
return 0;
}
You can easily see that I haven't allocated space for (*table)[2].key = "JKL"; nor (*table)[2].value = 13;. I also shouldn't be able read the memory locations in the last 2 printfs in main().
Can someone please explain this to me and if I can/should do anything about it?
EDIT:
Ok, I've realized a few things about my code above, which is a mess... But I have a class right now and can't update my question. I'll update this when I have the time. Sorry about that.
EDIT 2:
I'm sorry, but I shouldn't have posted this question because I don't want my code like I posted above. I want to do things slightly different which makes this question a bit irrelevant. So, I'm just going to assume this was question that I needed an answer for and accept one of the correct answers below. I'll then post my proper questions...

Just don't do it, it's undefined behavior.
It might accidentially work because you write/read some memory the program doesn't actually use. Or it can lead to heap corruption because you overwrite metadata used by the heap manager for its purposes. Or you can overwrite some other unrelated variable and then have hard times debugging the program that goes nuts because of that. Or anything else harmful - either obvious or subtle yet severe - can happen.
Just don't do it - only read/write memory you legally allocated.

Generally speaking (different implementation for different platforms) when a malloc or similar heap based allocation call is made, the underlying library translates it into a system call. When the library does that, it generally allocates space in sets of regions - which would be equal or larger than the amount the program requested.
Such an arrangement is done so as to prevent frequent system calls to kernel for allocation, and satisfying program requests for Heap faster (This is certainly not the only reason!! - other reasons may exist as well).
Fall through of such an arrangement leads to the problem that you are observing. Once again, its not always necessary that your program would be able to write to a non-allocated zone without crashing/seg-faulting everytime - that depends on particular binary's memory arrangement. Try writing to even higher array offset - your program would eventually fault.
As for what you should/should-not do - people who have responded above have summarized fairly well. I have no better answer except that such issues should be prevented and that can only be done by being careful while allocating memory.
One way of understanding is through this crude example: When you request 1 byte in userspace, the kernel has to allocate a whole page atleast (which would be 4Kb on some Linux systems, for example - the most granular allocation at kernel level). To improve efficiency by reducing frequent calls, the kernel assigns this whole page to the calling Library - which the library can allocate as when more requests come in. Thus, writing or reading requests to such a region may not necessarily generate a fault. It would just mean garbage.

In C, you can read to any address that is mapped, you can also write to any address that is mapped to a page with read-write areas.
In practice, the OS gives a process memory in chunks (pages) of normally 8K (but this is OS-dependant). The C library then manages these pages and maintains lists of what is free and what is allocated, giving the user addresses of these blocks when asked to with malloc.
So when you get a pointer back from malloc(), you are pointing to an area within an 8k page that is read-writable. This area may contain garbage, or it contain other malloc'd memory, it may contain the memory used for stack variables, or it may even contain the memory used by the C library to manage the lists of free/allocated memory!
So you can imagine that writing to addresses beyond the range you have malloc'ed can really cause problems:
Corruption of other malloc'ed data
Corruption of stack variables, or the call stack itself, causing crashes when a function return's
Corruption of the C-library's malloc/free management memory, causing crashes when malloc() or free() are called
All of which are a real pain to debug, because the crash usually occurs much later than when the corruption occurred.
Only when you read or write from/to the address which does not correspond to a mapped page will you get a crash... eg reading from address 0x0 (NULL)
Malloc, Free and pointers are very fragile in C (and to a slightly lesser degree in C++), and it is very easy to shoot yourself in the foot accidentally
There are many 3rd party tools for memory checking which wrap each memory allocation/free/access with checking code. They do tend to slow your program down, depending on how much checking is applied..

Think of memory as being a great big blackboard divided into little squares. Writing to a memory location is equivalent to erasing a square and writing a new value there. The purpose of malloc generally isn't to bring memory (blackboard squares) into existence; rather, it's to identify an area of memory (group of squares) that's not being used for anything else, and take some action to ensure that it won't be used for anything else until further notice. Historically, it was pretty common for microprocessors to expose all of the system's memory to an application. An piece of code Foo could in theory pick an arbitrary address and store its data there, but with a couple of major caveats:
Some other code `Bar` might have previously stored something there with the expectation that it would remain. If `Bar` reads that location expecting to get back what it wrote, it will erroneously interpret the value written by `Foo` as its own. For example, if `Bar` had stored the number of widgets that were received (23), and `Foo` stored the value 57, the earlier code would then believe it had received 57 widgets.
If `Foo` expects the data it writes to remain for any significant length of time, its data might get overwritten by some other code (basically the flip-side of the above).
Newer systems include more monitoring to keep track of what processes own what areas of memory, and kill off processes that access memory that they don't own. In many such systems, each process will often start with a small blackboard and, if attempts are made to malloc more squares than are available, processes can be given new chunks of blackboard area as needed. Nonetheless, there will often be some blackboard area available to each process which hasn't yet been reserved for any particular purposes. Code could in theory use such areas to store information without bothering to allocate it first, and such code would work if nothing happened to use the memory for any other purpose, but there would be no guarantee that such memory areas wouldn't be used for some other purpose at some unexpected time.

Usually malloc will allocate more memory than you require to for alignment purpose. Also because the process really have read/write access to the heap memory region. So reading a few bytes outside of the allocated region seldom trigger any errors.
But still you should not do it. Since the memory you're writing to can be regarded as unoccupied or is in fact occupied by others, anything can happen e.g. the 2nd and 3rd key/value pair will become garbage later or an irrelevant vital function will crash due to some invalid data you've stomped onto its malloc-ed memory.
(Also, either use char[≥4] as the type of key or malloc the key, because if the key is unfortunately stored on the stack it will become invalid later.)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight