Why does it work if the size of buffer is fewer than nbyte? [duplicate] - c

This question already has answers here:
Undefined, unspecified and implementation-defined behavior
(9 answers)
Closed 9 years ago.
The codes are like these:
#define BUFSIZ 5
#include <stdio.h>
#include <sys/syscall.h>
main()
{
char buf[BUFSIZ];
int n;
n = read(0, buf, 10);
printf("%d",n);
printf("%s",buf);
return 0;
}
I inputabcdefg then and the output is:
8abcdefg
In the read(0, buf, 10);, the 10 is larger than 5, which is the size of buf. But it doesn't seem to lead to a wrong result.. Does anyone have ideas about this? Thanks!

This is a quirk of how allocation in C works. You have a buffer allocated on the stack, which is really just a chunk of contiguous memory that you can read and write. The fact that you're allowed to write off the end of this array means that in this case it just so happens to work. Perhaps on your machine with your particular compiler and stack layout, you don't end up overwriting anything important :-)
Relying on this behavior being the same between compiler versions is not advised.

You can in principle1 read from and write to any address, but it is only safe and meaningful to access data in an organized, well-defined manner.
The purpose of memory allocation (explicit or implicit) is to bring order into chaos. When you declare your buf array, a small block of memory is reserved on the stack.
Usually, allocations have a certain alignment (and sometimes a certain minimum size, also the operating system can only detect wrong accesses on a very coarse level), so there will often be small gaps in between your allocated memory blocks and small areas that you can write to and read from, seemingly without "anything bad" happening -- but you should pretend that this isn't the case, and you should not even think about using these implementation details to your advantage.
Your code example "works" because you were unlucky enough not to hit an unallocated or write-protected memory page, and you didn't overwrite another vital stack value that would have caused the application to crash (such as the function's return address).
I am purposely saying "unlucky", not "lucky" as the fact that it appears to work is not a good thing. It's incorrect code2, and such code should crash early, so you can detect and fix the problem. It may otherwise lead to very hard to diagnose problems that appear to occur at an entirely unrelated time or location. Even if it works now, you have no guarantee whatsoever that it will work tomorrow (or, on a different computer, or with a different compiler, or with ever so slightly different code).
Memory allocation is generally a three-step process. It is an allocation request to the operating system done by the C library (which usually does not directly correspond to your requests) followed by some bookkeeping done in the library, and a promise made by you. At the operating system level, the actual physical allocation on a page level happens on demand as you access memory for the first time, supposed that the C library has requested allocation for the accessed location earlier.
In the case of stack allocation, the process is somewhat easier on the library level, since it really only has to decrement one special register, but this is mostly irrelevant for you. The concept remains the same.
The promise you make is that you will only ever read from or write to the agreed area, and this is the primary thing that is important for you.
It can happen that you break your promise (deliberately or by accident) and it still "works", but that is pure coincidence.
On the stack, you will sooner or later overwrite either the store of some local variables (which may go undetected if they're cached in a register) and finally the return addresses, which will almost certainly cause a crash (or similar undesired behavior) when the function returns. On the heap, you may overwrite some other program data or access a page that hasn't been communicated to the operating system as being reserved. In that case, the program will be terminated immediately.
1 Let's not consider virtual memory and page protections for an instant.
2 Strictly speaking, it's not incorrect code, but code that invokes undefined behavior. However, overwriting unallocated memory is in my opinion serious enough to merit the label "incorrect".

Related

What happens when you do array[-1]? [duplicate]

How dangerous is accessing an array outside of its bounds (in C)? It can sometimes happen that I read from outside the array (I now understand I then access memory used by some other parts of my program or even beyond that) or I am trying to set a value to an index outside of the array. The program sometimes crashes, but sometimes just runs, only giving unexpected results.
Now what I would like to know is, how dangerous is this really? If it damages my program, it is not so bad. If on the other hand it breaks something outside my program, because I somehow managed to access some totally unrelated memory, then it is very bad, I imagine.
I read a lot of 'anything can happen', 'segmentation might be the least bad problem', 'your hard disk might turn pink and unicorns might be singing under your window', which is all nice, but what is really the danger?
My questions:
Can reading values from way outside the array damage anything
apart from my program? I would imagine just looking at things does
not change anything, or would it for instance change the 'last time
opened' attribute of a file I happened to reach?
Can setting values way out outside of the array damage anything apart from my
program? From this
Stack Overflow question I gather that it is possible to access
any memory location, that there is no safety guarantee.
I now run my small programs from within XCode. Does that
provide some extra protection around my program where it cannot
reach outside its own memory? Can it harm XCode?
Any recommendations on how to run my inherently buggy code safely?
I use OSX 10.7, Xcode 4.6.
As far as the ISO C standard (the official definition of the language) is concerned, accessing an array outside its bounds has "undefined behavior". The literal meaning of this is:
behavior, upon use of a nonportable or erroneous program construct or
of erroneous data, for which this International Standard imposes no
requirements
A non-normative note expands on this:
Possible undefined behavior ranges from ignoring the situation
completely with unpredictable results, to behaving during translation
or program execution in a documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to
terminating a translation or execution (with the issuance of a
diagnostic message).
So that's the theory. What's the reality?
In the "best" case, you'll access some piece of memory that's either owned by your currently running program (which might cause your program to misbehave), or that's not owned by your currently running program (which will probably cause your program to crash with something like a segmentation fault). Or you might attempt to write to memory that your program owns, but that's marked read-only; this will probably also cause your program to crash.
That's assuming your program is running under an operating system that attempts to protect concurrently running processes from each other. If your code is running on the "bare metal", say if it's part of an OS kernel or an embedded system, then there is no such protection; your misbehaving code is what was supposed to provide that protection. In that case, the possibilities for damage are considerably greater, including, in some cases, physical damage to the hardware (or to things or people nearby).
Even in a protected OS environment, the protections aren't always 100%. There are operating system bugs that permit unprivileged programs to obtain root (administrative) access, for example. Even with ordinary user privileges, a malfunctioning program can consume excessive resources (CPU, memory, disk), possibly bringing down the entire system. A lot of malware (viruses, etc.) exploits buffer overruns to gain unauthorized access to the system.
(One historical example: I've heard that on some old systems with core memory, repeatedly accessing a single memory location in a tight loop could literally cause that chunk of memory to melt. Other possibilities include destroying a CRT display, and moving the read/write head of a disk drive with the harmonic frequency of the drive cabinet, causing it to walk across a table and fall onto the floor.)
And there's always Skynet to worry about.
The bottom line is this: if you could write a program to do something bad deliberately, it's at least theoretically possible that a buggy program could do the same thing accidentally.
In practice, it's very unlikely that your buggy program running on a MacOS X system is going to do anything more serious than crash. But it's not possible to completely prevent buggy code from doing really bad things.
In general, Operating Systems of today (the popular ones anyway) run all applications in protected memory regions using a virtual memory manager. It turns out that it is not terribly EASY (per se) to simply read or write to a location that exists in REAL space outside the region(s) that have been assigned / allocated to your process.
Direct answers:
Reading will almost never directly damage another process, however it can indirectly damage a process if you happen to read a KEY value used to encrypt, decrypt, or validate a program / process. Reading out of bounds can have somewhat adverse / unexpected affects on your code if you are making decisions based on the data you are reading
The only way your could really DAMAGE something by writing to a loaction accessible by a memory address is if that memory address that you are writing to is actually a hardware register (a location that actually is not for data storage but for controlling some piece of hardware) not a RAM location. In all fact, you still wont normally damage something unless you are writing some one time programmable location that is not re-writable (or something of that nature).
Generally running from within the debugger runs the code in debug mode. Running in debug mode does TEND to (but not always) stop your code faster when you have done something considered out of practice or downright illegal.
Never use macros, use data structures that already have array index bounds checking built in, etc....
ADDITIONAL
I should add that the above information is really only for systems using an operating system with memory protection windows. If writing code for an embedded system or even a system utilizing an operating system (real-time or other) that does not have memory protection windows (or virtual addressed windows) that one should practice a lot more caution in reading and writing to memory. Also in these cases SAFE and SECURE coding practices should always be employed to avoid security issues.
Not checking bounds can lead to to ugly side effects, including security holes. One of the ugly ones is arbitrary code execution. In classical example: if you have an fixed size array, and use strcpy() to put a user-supplied string there, the user can give you a string that overflows the buffer and overwrites other memory locations, including code address where CPU should return when your function finishes.
Which means your user can send you a string that will cause your program to essentially call exec("/bin/sh"), which will turn it into shell, executing anything he wants on your system, including harvesting all your data and turning your machine into botnet node.
See Smashing The Stack For Fun And Profit for details on how this can be done.
You write:
I read a lot of 'anything can happen', 'segmentation might be the
least bad problem', 'your harddisk might turn pink and unicorns might
be singing under your window', which is all nice, but what is really
the danger?
Lets put it that way: load a gun. Point it outside the window without any particular aim and fire. What is the danger?
The issue is that you do not know. If your code overwrites something that crashes your program you are fine because it will stop it into a defined state. However if it does not crash then the issues start to arise. Which resources are under control of your program and what might it do to them? I know at least one major issue that was caused by such an overflow. The issue was in a seemingly meaningless statistics function that messed up some unrelated conversion table for a production database. The result was some very expensive cleanup afterwards. Actually it would have been much cheaper and easier to handle if this issue would have formatted the hard disks ... with other words: pink unicorns might be your least problem.
The idea that your operating system will protect you is optimistic. If possible try to avoid writing out of bounds.
Not running your program as root or any other privileged user won't harm any of your system, so generally this might be a good idea.
By writing data to some random memory location you won't directly "damage" any other program running on your computer as each process runs in it's own memory space.
If you try to access any memory not allocated to your process the operating system will stop your program from executing with a segmentation fault.
So directly (without running as root and directly accessing files like /dev/mem) there is no danger that your program will interfere with any other program running on your operating system.
Nevertheless - and probably this is what you have heard about in terms of danger - by blindly writing random data to random memory locations by accident you sure can damage anything you are able to damage.
For example your program might want to delete a specific file given by a file name stored somewhere in your program. If by accident you just overwrite the location where the file name is stored you might delete a very different file instead.
NSArrays in Objective-C are assigned a specific block of memory. Exceeding the bounds of the array means that you would be accessing memory that is not assigned to the array. This means:
This memory can have any value. There's no way of knowing if the data is valid based on your data type.
This memory may contain sensitive information such as private keys or other user credentials.
The memory address may be invalid or protected.
The memory can have a changing value because it's being accessed by another program or thread.
Other things use memory address space, such as memory-mapped ports.
Writing data to unknown memory address can crash your program, overwrite OS memory space, and generally cause the sun to implode.
From the aspect of your program you always want to know when your code is exceeding the bounds of an array. This can lead to unknown values being returned, causing your application to crash or provide invalid data.
You may want to try using the memcheck tool in Valgrind when you test your code -- it won't catch individual array bounds violations within a stack frame, but it should catch many other sorts of memory problem, including ones that would cause subtle, wider problems outside the scope of a single function.
From the manual:
Memcheck is a memory error detector. It can detect the following problems that are common in C and C++ programs.
Accessing memory you shouldn't, e.g. overrunning and underrunning heap blocks, overrunning the top of the stack, and accessing memory after it has been freed.
Using undefined values, i.e. values that have not been initialised, or that have been derived from other undefined values.
Incorrect freeing of heap memory, such as double-freeing heap blocks, or mismatched use of malloc/new/new[] versus free/delete/delete[]
Overlapping src and dst pointers in memcpy and related functions.
Memory leaks.
ETA: Though, as Kaz's answer says, it's not a panacea, and doesn't always give the most helpful output, especially when you're using exciting access patterns.
If you ever do systems level programming or embedded systems programming, very bad things can happen if you write to random memory locations. Older systems and many micro-controllers use memory mapped IO, so writing to a memory location that maps to a peripheral register can wreak havoc, especially if it is done asynchronously.
An example is programming flash memory. Programming mode on the memory chips is enabled by writing a specific sequence of values to specific locations inside the address range of the chip. If another process were to write to any other location in the chip while that was going on, it would cause the programming cycle to fail.
In some cases the hardware will wrap addresses around (most significant bits/bytes of address are ignored) so writing to an address beyond the end of the physical address space will actually result in data being written right in the middle of things.
And finally, older CPUs like the MC68000 can locked up to the point that only a hardware reset can get them going again. Haven't worked on them for a couple of decades but I believe it's when it encountered a bus error (non-existent memory) while trying to handle an exception, it would simply halt until the hardware reset was asserted.
My biggest recommendation is a blatant plug for a product, but I have no personal interest in it and I am not affiliated with them in any way - but based on a couple of decades of C programming and embedded systems where reliability was critical, Gimpel's PC Lint will not only detect those sort of errors, it will make a better C/C++ programmer out of you by constantly harping on you about bad habits.
I'd also recommend reading the MISRA C coding standard, if you can snag a copy from someone. I haven't seen any recent ones but in ye olde days they gave a good explanation of why you should/shouldn't do the things they cover.
Dunno about you, but about the 2nd or 3rd time I get a coredump or hangup from any application, my opinion of whatever company produced it goes down by half. The 4th or 5th time and whatever the package is becomes shelfware and I drive a wooden stake through the center of the package/disc it came in just to make sure it never comes back to haunt me.
I'm working with a compiler for a DSP chip which deliberately generates code that accesses one past the end of an array out of C code which does not!
This is because the loops are structured so that the end of an iteration prefetches some data for the next iteration. So the datum prefetched at the end of the last iteration is never actually used.
Writing C code like that invokes undefined behavior, but that is only a formality from a standards document which concerns itself with maximal portability.
More often that not, a program which accesses out of bounds is not cleverly optimized. It is simply buggy. The code fetches some garbage value and, unlike the optimized loops of the aforementioned compiler, the code then uses the value in subsequent computations, thereby corrupting theim.
It is worth catching bugs like that, and so it is worth making the behavior undefined for even just that reason alone: so that the run-time can produce a diagnostic message like "array overrun in line 42 of main.c".
On systems with virtual memory, an array could happen to be allocated such that the address which follows is in an unmapped area of virtual memory. The access will then bomb the program.
As an aside, note that in C we are permitted to create a pointer which is one past the end of an array. And this pointer has to compare greater than any pointer to the interior of an array.
This means that a C implementation cannot place an array right at the end of memory, where the one plus address would wrap around and look smaller than other addresses in the array.
Nevertheless, access to uninitialized or out of bounds values are sometimes a valid optimization technique, even if not maximally portable. This is for instance why the Valgrind tool does not report accesses to uninitialized data when those accesses happen, but only when the value is later used in some way that could affect the outcome of the program. You get a diagnostic like "conditional branch in xxx:nnn depends on uninitialized value" and it can be sometimes hard to track down where it originates. If all such accesses were trapped immediately, there would be a lot of false positives arising from compiler optimized code as well as correctly hand-optimized code.
Speaking of which, I was working with some codec from a vendor which was giving off these errors when ported to Linux and run under Valgrind. But the vendor convinced me that only several bits of the value being used actually came from uninitialized memory, and those bits were carefully avoided by the logic.. Only the good bits of the value were being used and Valgrind doesn't have the ability to track down to the individual bit. The uninitialized material came from reading a word past the end of a bit stream of encoded data, but the code knows how many bits are in the stream and will not use more bits than there actually are. Since the access beyond the end of the bit stream array does not cause any harm on the DSP architecture (there is no virtual memory after the array, no memory-mapped ports, and the address does not wrap) it is a valid optimization technique.
"Undefined behavior" does not really mean much, because according to ISO C, simply including a header which is not defined in the C standard, or calling a function which is not defined in the program itself or the C standard, are examples of undefined behavior. Undefined behavior doesn't mean "not defined by anyone on the planet" just "not defined by the ISO C standard". But of course, sometimes undefined behavior really is absolutely not defined by anyone.
Besides your own program, I don't think you will break anything, in the worst case you will try to read or write from a memory address that corresponds to a page that the kernel didn't assign to your proceses, generating the proper exception and being killed (I mean, your process).
Arrays with two or more dimensions pose a consideration beyond those mentioned in other answers. Consider the following functions:
char arr1[2][8];
char arr2[4];
int test1(int n)
{
arr1[1][0] = 1;
for (int i=0; i<n; i++) arr1[0][i] = arr2[i];
return arr1[1][0];
}
int test2(int ofs, int n)
{
arr1[1][0] = 1;
for (int i=0; i<n; i++) *(arr1[0]+i) = arr2[i];
return arr1[1][0];
}
The way gcc will processes the first function will not allow for the possibility that an attempt to write arr[0][i] might affect the value of arr[1][0], and the generated code is incapable of returning anything other than a hardcoded value of 1. Although the Standard defines the meaning of array[index] as precisely equivalent to (*((array)+(index))), gcc seems to interpret the notion of array bounds and pointer decay differently in cases which involve using [] operator on values of array type, versus those which use explicit pointer arithmetic.
I just want to add some practical examples to this questions - Imagine the following code:
#include <stdio.h>
int main(void) {
int n[5];
n[5] = 1;
printf("answer %d\n", n[5]);
return (0);
}
Which has Undefined Behaviour. If you enable for example clang optimisations (-Ofast) it would result in something like:
answer 748418584
(Which if you compile without will probably output the correct result of answer 1)
This is because in the first case the assignment to 1 is never actually assembled in the final code (you can look in the godbolt asm code as well).
(However it must be noted that by that logic main should not even call printf so best advice is not to depend on the optimiser to solve your UB - but rather have the knowledge that sometimes it may work this way)
The takeaway here is that modern C optimising compilers will assume undefined behaviour (UB) to never occur (which means the above code would be similar to something like (but not the same):
#include <stdio.h>
#include <stdlib.h>
int main(void) {
int n[5];
if (0)
n[5] = 1;
printf("answer %d\n", (exit(-1), n[5]));
return (0);
}
Which on contrary is perfectly defined).
That's because the first conditional statement never reaches it's true state (0 is always false).
And on the second argument for printf we have a sequence point after which we call exit and the program terminates before invoking the UB in the second comma operator (so it's well defined).
So the second takeaway is that UB is not UB as long as it's never actually evaluated.
Additionally I don't see mentioned here there is fairly modern Undefined Behaviour sanitiser (at least on clang) which (with the option -fsanitize=undefined) will give the following output on the first example (but not the second):
/app/example.c:5:5: runtime error: index 5 out of bounds for type 'int[5]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /app/example.c:5:5 in
/app/example.c:7:27: runtime error: index 5 out of bounds for type 'int[5]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /app/example.c:7:27 in
Here is all the samples in godbolt:
https://godbolt.org/z/eY9ja4fdh (first example and no flags)
https://godbolt.org/z/cGcY7Ta9M (first example and -Ofast clang)
https://godbolt.org/z/cGcY7Ta9M (second example and UB sanitiser on)
https://godbolt.org/z/vE531EKo4 (first example and UB sanitiser on)

C- Why does my multidimensional array only allow 3 user inputs before terminating [duplicate]

How dangerous is accessing an array outside of its bounds (in C)? It can sometimes happen that I read from outside the array (I now understand I then access memory used by some other parts of my program or even beyond that) or I am trying to set a value to an index outside of the array. The program sometimes crashes, but sometimes just runs, only giving unexpected results.
Now what I would like to know is, how dangerous is this really? If it damages my program, it is not so bad. If on the other hand it breaks something outside my program, because I somehow managed to access some totally unrelated memory, then it is very bad, I imagine.
I read a lot of 'anything can happen', 'segmentation might be the least bad problem', 'your hard disk might turn pink and unicorns might be singing under your window', which is all nice, but what is really the danger?
My questions:
Can reading values from way outside the array damage anything
apart from my program? I would imagine just looking at things does
not change anything, or would it for instance change the 'last time
opened' attribute of a file I happened to reach?
Can setting values way out outside of the array damage anything apart from my
program? From this
Stack Overflow question I gather that it is possible to access
any memory location, that there is no safety guarantee.
I now run my small programs from within XCode. Does that
provide some extra protection around my program where it cannot
reach outside its own memory? Can it harm XCode?
Any recommendations on how to run my inherently buggy code safely?
I use OSX 10.7, Xcode 4.6.
As far as the ISO C standard (the official definition of the language) is concerned, accessing an array outside its bounds has "undefined behavior". The literal meaning of this is:
behavior, upon use of a nonportable or erroneous program construct or
of erroneous data, for which this International Standard imposes no
requirements
A non-normative note expands on this:
Possible undefined behavior ranges from ignoring the situation
completely with unpredictable results, to behaving during translation
or program execution in a documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to
terminating a translation or execution (with the issuance of a
diagnostic message).
So that's the theory. What's the reality?
In the "best" case, you'll access some piece of memory that's either owned by your currently running program (which might cause your program to misbehave), or that's not owned by your currently running program (which will probably cause your program to crash with something like a segmentation fault). Or you might attempt to write to memory that your program owns, but that's marked read-only; this will probably also cause your program to crash.
That's assuming your program is running under an operating system that attempts to protect concurrently running processes from each other. If your code is running on the "bare metal", say if it's part of an OS kernel or an embedded system, then there is no such protection; your misbehaving code is what was supposed to provide that protection. In that case, the possibilities for damage are considerably greater, including, in some cases, physical damage to the hardware (or to things or people nearby).
Even in a protected OS environment, the protections aren't always 100%. There are operating system bugs that permit unprivileged programs to obtain root (administrative) access, for example. Even with ordinary user privileges, a malfunctioning program can consume excessive resources (CPU, memory, disk), possibly bringing down the entire system. A lot of malware (viruses, etc.) exploits buffer overruns to gain unauthorized access to the system.
(One historical example: I've heard that on some old systems with core memory, repeatedly accessing a single memory location in a tight loop could literally cause that chunk of memory to melt. Other possibilities include destroying a CRT display, and moving the read/write head of a disk drive with the harmonic frequency of the drive cabinet, causing it to walk across a table and fall onto the floor.)
And there's always Skynet to worry about.
The bottom line is this: if you could write a program to do something bad deliberately, it's at least theoretically possible that a buggy program could do the same thing accidentally.
In practice, it's very unlikely that your buggy program running on a MacOS X system is going to do anything more serious than crash. But it's not possible to completely prevent buggy code from doing really bad things.
In general, Operating Systems of today (the popular ones anyway) run all applications in protected memory regions using a virtual memory manager. It turns out that it is not terribly EASY (per se) to simply read or write to a location that exists in REAL space outside the region(s) that have been assigned / allocated to your process.
Direct answers:
Reading will almost never directly damage another process, however it can indirectly damage a process if you happen to read a KEY value used to encrypt, decrypt, or validate a program / process. Reading out of bounds can have somewhat adverse / unexpected affects on your code if you are making decisions based on the data you are reading
The only way your could really DAMAGE something by writing to a loaction accessible by a memory address is if that memory address that you are writing to is actually a hardware register (a location that actually is not for data storage but for controlling some piece of hardware) not a RAM location. In all fact, you still wont normally damage something unless you are writing some one time programmable location that is not re-writable (or something of that nature).
Generally running from within the debugger runs the code in debug mode. Running in debug mode does TEND to (but not always) stop your code faster when you have done something considered out of practice or downright illegal.
Never use macros, use data structures that already have array index bounds checking built in, etc....
ADDITIONAL
I should add that the above information is really only for systems using an operating system with memory protection windows. If writing code for an embedded system or even a system utilizing an operating system (real-time or other) that does not have memory protection windows (or virtual addressed windows) that one should practice a lot more caution in reading and writing to memory. Also in these cases SAFE and SECURE coding practices should always be employed to avoid security issues.
Not checking bounds can lead to to ugly side effects, including security holes. One of the ugly ones is arbitrary code execution. In classical example: if you have an fixed size array, and use strcpy() to put a user-supplied string there, the user can give you a string that overflows the buffer and overwrites other memory locations, including code address where CPU should return when your function finishes.
Which means your user can send you a string that will cause your program to essentially call exec("/bin/sh"), which will turn it into shell, executing anything he wants on your system, including harvesting all your data and turning your machine into botnet node.
See Smashing The Stack For Fun And Profit for details on how this can be done.
You write:
I read a lot of 'anything can happen', 'segmentation might be the
least bad problem', 'your harddisk might turn pink and unicorns might
be singing under your window', which is all nice, but what is really
the danger?
Lets put it that way: load a gun. Point it outside the window without any particular aim and fire. What is the danger?
The issue is that you do not know. If your code overwrites something that crashes your program you are fine because it will stop it into a defined state. However if it does not crash then the issues start to arise. Which resources are under control of your program and what might it do to them? I know at least one major issue that was caused by such an overflow. The issue was in a seemingly meaningless statistics function that messed up some unrelated conversion table for a production database. The result was some very expensive cleanup afterwards. Actually it would have been much cheaper and easier to handle if this issue would have formatted the hard disks ... with other words: pink unicorns might be your least problem.
The idea that your operating system will protect you is optimistic. If possible try to avoid writing out of bounds.
Not running your program as root or any other privileged user won't harm any of your system, so generally this might be a good idea.
By writing data to some random memory location you won't directly "damage" any other program running on your computer as each process runs in it's own memory space.
If you try to access any memory not allocated to your process the operating system will stop your program from executing with a segmentation fault.
So directly (without running as root and directly accessing files like /dev/mem) there is no danger that your program will interfere with any other program running on your operating system.
Nevertheless - and probably this is what you have heard about in terms of danger - by blindly writing random data to random memory locations by accident you sure can damage anything you are able to damage.
For example your program might want to delete a specific file given by a file name stored somewhere in your program. If by accident you just overwrite the location where the file name is stored you might delete a very different file instead.
NSArrays in Objective-C are assigned a specific block of memory. Exceeding the bounds of the array means that you would be accessing memory that is not assigned to the array. This means:
This memory can have any value. There's no way of knowing if the data is valid based on your data type.
This memory may contain sensitive information such as private keys or other user credentials.
The memory address may be invalid or protected.
The memory can have a changing value because it's being accessed by another program or thread.
Other things use memory address space, such as memory-mapped ports.
Writing data to unknown memory address can crash your program, overwrite OS memory space, and generally cause the sun to implode.
From the aspect of your program you always want to know when your code is exceeding the bounds of an array. This can lead to unknown values being returned, causing your application to crash or provide invalid data.
You may want to try using the memcheck tool in Valgrind when you test your code -- it won't catch individual array bounds violations within a stack frame, but it should catch many other sorts of memory problem, including ones that would cause subtle, wider problems outside the scope of a single function.
From the manual:
Memcheck is a memory error detector. It can detect the following problems that are common in C and C++ programs.
Accessing memory you shouldn't, e.g. overrunning and underrunning heap blocks, overrunning the top of the stack, and accessing memory after it has been freed.
Using undefined values, i.e. values that have not been initialised, or that have been derived from other undefined values.
Incorrect freeing of heap memory, such as double-freeing heap blocks, or mismatched use of malloc/new/new[] versus free/delete/delete[]
Overlapping src and dst pointers in memcpy and related functions.
Memory leaks.
ETA: Though, as Kaz's answer says, it's not a panacea, and doesn't always give the most helpful output, especially when you're using exciting access patterns.
If you ever do systems level programming or embedded systems programming, very bad things can happen if you write to random memory locations. Older systems and many micro-controllers use memory mapped IO, so writing to a memory location that maps to a peripheral register can wreak havoc, especially if it is done asynchronously.
An example is programming flash memory. Programming mode on the memory chips is enabled by writing a specific sequence of values to specific locations inside the address range of the chip. If another process were to write to any other location in the chip while that was going on, it would cause the programming cycle to fail.
In some cases the hardware will wrap addresses around (most significant bits/bytes of address are ignored) so writing to an address beyond the end of the physical address space will actually result in data being written right in the middle of things.
And finally, older CPUs like the MC68000 can locked up to the point that only a hardware reset can get them going again. Haven't worked on them for a couple of decades but I believe it's when it encountered a bus error (non-existent memory) while trying to handle an exception, it would simply halt until the hardware reset was asserted.
My biggest recommendation is a blatant plug for a product, but I have no personal interest in it and I am not affiliated with them in any way - but based on a couple of decades of C programming and embedded systems where reliability was critical, Gimpel's PC Lint will not only detect those sort of errors, it will make a better C/C++ programmer out of you by constantly harping on you about bad habits.
I'd also recommend reading the MISRA C coding standard, if you can snag a copy from someone. I haven't seen any recent ones but in ye olde days they gave a good explanation of why you should/shouldn't do the things they cover.
Dunno about you, but about the 2nd or 3rd time I get a coredump or hangup from any application, my opinion of whatever company produced it goes down by half. The 4th or 5th time and whatever the package is becomes shelfware and I drive a wooden stake through the center of the package/disc it came in just to make sure it never comes back to haunt me.
I'm working with a compiler for a DSP chip which deliberately generates code that accesses one past the end of an array out of C code which does not!
This is because the loops are structured so that the end of an iteration prefetches some data for the next iteration. So the datum prefetched at the end of the last iteration is never actually used.
Writing C code like that invokes undefined behavior, but that is only a formality from a standards document which concerns itself with maximal portability.
More often that not, a program which accesses out of bounds is not cleverly optimized. It is simply buggy. The code fetches some garbage value and, unlike the optimized loops of the aforementioned compiler, the code then uses the value in subsequent computations, thereby corrupting theim.
It is worth catching bugs like that, and so it is worth making the behavior undefined for even just that reason alone: so that the run-time can produce a diagnostic message like "array overrun in line 42 of main.c".
On systems with virtual memory, an array could happen to be allocated such that the address which follows is in an unmapped area of virtual memory. The access will then bomb the program.
As an aside, note that in C we are permitted to create a pointer which is one past the end of an array. And this pointer has to compare greater than any pointer to the interior of an array.
This means that a C implementation cannot place an array right at the end of memory, where the one plus address would wrap around and look smaller than other addresses in the array.
Nevertheless, access to uninitialized or out of bounds values are sometimes a valid optimization technique, even if not maximally portable. This is for instance why the Valgrind tool does not report accesses to uninitialized data when those accesses happen, but only when the value is later used in some way that could affect the outcome of the program. You get a diagnostic like "conditional branch in xxx:nnn depends on uninitialized value" and it can be sometimes hard to track down where it originates. If all such accesses were trapped immediately, there would be a lot of false positives arising from compiler optimized code as well as correctly hand-optimized code.
Speaking of which, I was working with some codec from a vendor which was giving off these errors when ported to Linux and run under Valgrind. But the vendor convinced me that only several bits of the value being used actually came from uninitialized memory, and those bits were carefully avoided by the logic.. Only the good bits of the value were being used and Valgrind doesn't have the ability to track down to the individual bit. The uninitialized material came from reading a word past the end of a bit stream of encoded data, but the code knows how many bits are in the stream and will not use more bits than there actually are. Since the access beyond the end of the bit stream array does not cause any harm on the DSP architecture (there is no virtual memory after the array, no memory-mapped ports, and the address does not wrap) it is a valid optimization technique.
"Undefined behavior" does not really mean much, because according to ISO C, simply including a header which is not defined in the C standard, or calling a function which is not defined in the program itself or the C standard, are examples of undefined behavior. Undefined behavior doesn't mean "not defined by anyone on the planet" just "not defined by the ISO C standard". But of course, sometimes undefined behavior really is absolutely not defined by anyone.
Besides your own program, I don't think you will break anything, in the worst case you will try to read or write from a memory address that corresponds to a page that the kernel didn't assign to your proceses, generating the proper exception and being killed (I mean, your process).
Arrays with two or more dimensions pose a consideration beyond those mentioned in other answers. Consider the following functions:
char arr1[2][8];
char arr2[4];
int test1(int n)
{
arr1[1][0] = 1;
for (int i=0; i<n; i++) arr1[0][i] = arr2[i];
return arr1[1][0];
}
int test2(int ofs, int n)
{
arr1[1][0] = 1;
for (int i=0; i<n; i++) *(arr1[0]+i) = arr2[i];
return arr1[1][0];
}
The way gcc will processes the first function will not allow for the possibility that an attempt to write arr[0][i] might affect the value of arr[1][0], and the generated code is incapable of returning anything other than a hardcoded value of 1. Although the Standard defines the meaning of array[index] as precisely equivalent to (*((array)+(index))), gcc seems to interpret the notion of array bounds and pointer decay differently in cases which involve using [] operator on values of array type, versus those which use explicit pointer arithmetic.
I just want to add some practical examples to this questions - Imagine the following code:
#include <stdio.h>
int main(void) {
int n[5];
n[5] = 1;
printf("answer %d\n", n[5]);
return (0);
}
Which has Undefined Behaviour. If you enable for example clang optimisations (-Ofast) it would result in something like:
answer 748418584
(Which if you compile without will probably output the correct result of answer 1)
This is because in the first case the assignment to 1 is never actually assembled in the final code (you can look in the godbolt asm code as well).
(However it must be noted that by that logic main should not even call printf so best advice is not to depend on the optimiser to solve your UB - but rather have the knowledge that sometimes it may work this way)
The takeaway here is that modern C optimising compilers will assume undefined behaviour (UB) to never occur (which means the above code would be similar to something like (but not the same):
#include <stdio.h>
#include <stdlib.h>
int main(void) {
int n[5];
if (0)
n[5] = 1;
printf("answer %d\n", (exit(-1), n[5]));
return (0);
}
Which on contrary is perfectly defined).
That's because the first conditional statement never reaches it's true state (0 is always false).
And on the second argument for printf we have a sequence point after which we call exit and the program terminates before invoking the UB in the second comma operator (so it's well defined).
So the second takeaway is that UB is not UB as long as it's never actually evaluated.
Additionally I don't see mentioned here there is fairly modern Undefined Behaviour sanitiser (at least on clang) which (with the option -fsanitize=undefined) will give the following output on the first example (but not the second):
/app/example.c:5:5: runtime error: index 5 out of bounds for type 'int[5]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /app/example.c:5:5 in
/app/example.c:7:27: runtime error: index 5 out of bounds for type 'int[5]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /app/example.c:7:27 in
Here is all the samples in godbolt:
https://godbolt.org/z/eY9ja4fdh (first example and no flags)
https://godbolt.org/z/cGcY7Ta9M (first example and -Ofast clang)
https://godbolt.org/z/cGcY7Ta9M (second example and UB sanitiser on)
https://godbolt.org/z/vE531EKo4 (first example and UB sanitiser on)

How am I writing on some spot of memory that I didnt allocated? [duplicate]

How dangerous is accessing an array outside of its bounds (in C)? It can sometimes happen that I read from outside the array (I now understand I then access memory used by some other parts of my program or even beyond that) or I am trying to set a value to an index outside of the array. The program sometimes crashes, but sometimes just runs, only giving unexpected results.
Now what I would like to know is, how dangerous is this really? If it damages my program, it is not so bad. If on the other hand it breaks something outside my program, because I somehow managed to access some totally unrelated memory, then it is very bad, I imagine.
I read a lot of 'anything can happen', 'segmentation might be the least bad problem', 'your hard disk might turn pink and unicorns might be singing under your window', which is all nice, but what is really the danger?
My questions:
Can reading values from way outside the array damage anything
apart from my program? I would imagine just looking at things does
not change anything, or would it for instance change the 'last time
opened' attribute of a file I happened to reach?
Can setting values way out outside of the array damage anything apart from my
program? From this
Stack Overflow question I gather that it is possible to access
any memory location, that there is no safety guarantee.
I now run my small programs from within XCode. Does that
provide some extra protection around my program where it cannot
reach outside its own memory? Can it harm XCode?
Any recommendations on how to run my inherently buggy code safely?
I use OSX 10.7, Xcode 4.6.
As far as the ISO C standard (the official definition of the language) is concerned, accessing an array outside its bounds has "undefined behavior". The literal meaning of this is:
behavior, upon use of a nonportable or erroneous program construct or
of erroneous data, for which this International Standard imposes no
requirements
A non-normative note expands on this:
Possible undefined behavior ranges from ignoring the situation
completely with unpredictable results, to behaving during translation
or program execution in a documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to
terminating a translation or execution (with the issuance of a
diagnostic message).
So that's the theory. What's the reality?
In the "best" case, you'll access some piece of memory that's either owned by your currently running program (which might cause your program to misbehave), or that's not owned by your currently running program (which will probably cause your program to crash with something like a segmentation fault). Or you might attempt to write to memory that your program owns, but that's marked read-only; this will probably also cause your program to crash.
That's assuming your program is running under an operating system that attempts to protect concurrently running processes from each other. If your code is running on the "bare metal", say if it's part of an OS kernel or an embedded system, then there is no such protection; your misbehaving code is what was supposed to provide that protection. In that case, the possibilities for damage are considerably greater, including, in some cases, physical damage to the hardware (or to things or people nearby).
Even in a protected OS environment, the protections aren't always 100%. There are operating system bugs that permit unprivileged programs to obtain root (administrative) access, for example. Even with ordinary user privileges, a malfunctioning program can consume excessive resources (CPU, memory, disk), possibly bringing down the entire system. A lot of malware (viruses, etc.) exploits buffer overruns to gain unauthorized access to the system.
(One historical example: I've heard that on some old systems with core memory, repeatedly accessing a single memory location in a tight loop could literally cause that chunk of memory to melt. Other possibilities include destroying a CRT display, and moving the read/write head of a disk drive with the harmonic frequency of the drive cabinet, causing it to walk across a table and fall onto the floor.)
And there's always Skynet to worry about.
The bottom line is this: if you could write a program to do something bad deliberately, it's at least theoretically possible that a buggy program could do the same thing accidentally.
In practice, it's very unlikely that your buggy program running on a MacOS X system is going to do anything more serious than crash. But it's not possible to completely prevent buggy code from doing really bad things.
In general, Operating Systems of today (the popular ones anyway) run all applications in protected memory regions using a virtual memory manager. It turns out that it is not terribly EASY (per se) to simply read or write to a location that exists in REAL space outside the region(s) that have been assigned / allocated to your process.
Direct answers:
Reading will almost never directly damage another process, however it can indirectly damage a process if you happen to read a KEY value used to encrypt, decrypt, or validate a program / process. Reading out of bounds can have somewhat adverse / unexpected affects on your code if you are making decisions based on the data you are reading
The only way your could really DAMAGE something by writing to a loaction accessible by a memory address is if that memory address that you are writing to is actually a hardware register (a location that actually is not for data storage but for controlling some piece of hardware) not a RAM location. In all fact, you still wont normally damage something unless you are writing some one time programmable location that is not re-writable (or something of that nature).
Generally running from within the debugger runs the code in debug mode. Running in debug mode does TEND to (but not always) stop your code faster when you have done something considered out of practice or downright illegal.
Never use macros, use data structures that already have array index bounds checking built in, etc....
ADDITIONAL
I should add that the above information is really only for systems using an operating system with memory protection windows. If writing code for an embedded system or even a system utilizing an operating system (real-time or other) that does not have memory protection windows (or virtual addressed windows) that one should practice a lot more caution in reading and writing to memory. Also in these cases SAFE and SECURE coding practices should always be employed to avoid security issues.
Not checking bounds can lead to to ugly side effects, including security holes. One of the ugly ones is arbitrary code execution. In classical example: if you have an fixed size array, and use strcpy() to put a user-supplied string there, the user can give you a string that overflows the buffer and overwrites other memory locations, including code address where CPU should return when your function finishes.
Which means your user can send you a string that will cause your program to essentially call exec("/bin/sh"), which will turn it into shell, executing anything he wants on your system, including harvesting all your data and turning your machine into botnet node.
See Smashing The Stack For Fun And Profit for details on how this can be done.
You write:
I read a lot of 'anything can happen', 'segmentation might be the
least bad problem', 'your harddisk might turn pink and unicorns might
be singing under your window', which is all nice, but what is really
the danger?
Lets put it that way: load a gun. Point it outside the window without any particular aim and fire. What is the danger?
The issue is that you do not know. If your code overwrites something that crashes your program you are fine because it will stop it into a defined state. However if it does not crash then the issues start to arise. Which resources are under control of your program and what might it do to them? I know at least one major issue that was caused by such an overflow. The issue was in a seemingly meaningless statistics function that messed up some unrelated conversion table for a production database. The result was some very expensive cleanup afterwards. Actually it would have been much cheaper and easier to handle if this issue would have formatted the hard disks ... with other words: pink unicorns might be your least problem.
The idea that your operating system will protect you is optimistic. If possible try to avoid writing out of bounds.
Not running your program as root or any other privileged user won't harm any of your system, so generally this might be a good idea.
By writing data to some random memory location you won't directly "damage" any other program running on your computer as each process runs in it's own memory space.
If you try to access any memory not allocated to your process the operating system will stop your program from executing with a segmentation fault.
So directly (without running as root and directly accessing files like /dev/mem) there is no danger that your program will interfere with any other program running on your operating system.
Nevertheless - and probably this is what you have heard about in terms of danger - by blindly writing random data to random memory locations by accident you sure can damage anything you are able to damage.
For example your program might want to delete a specific file given by a file name stored somewhere in your program. If by accident you just overwrite the location where the file name is stored you might delete a very different file instead.
NSArrays in Objective-C are assigned a specific block of memory. Exceeding the bounds of the array means that you would be accessing memory that is not assigned to the array. This means:
This memory can have any value. There's no way of knowing if the data is valid based on your data type.
This memory may contain sensitive information such as private keys or other user credentials.
The memory address may be invalid or protected.
The memory can have a changing value because it's being accessed by another program or thread.
Other things use memory address space, such as memory-mapped ports.
Writing data to unknown memory address can crash your program, overwrite OS memory space, and generally cause the sun to implode.
From the aspect of your program you always want to know when your code is exceeding the bounds of an array. This can lead to unknown values being returned, causing your application to crash or provide invalid data.
You may want to try using the memcheck tool in Valgrind when you test your code -- it won't catch individual array bounds violations within a stack frame, but it should catch many other sorts of memory problem, including ones that would cause subtle, wider problems outside the scope of a single function.
From the manual:
Memcheck is a memory error detector. It can detect the following problems that are common in C and C++ programs.
Accessing memory you shouldn't, e.g. overrunning and underrunning heap blocks, overrunning the top of the stack, and accessing memory after it has been freed.
Using undefined values, i.e. values that have not been initialised, or that have been derived from other undefined values.
Incorrect freeing of heap memory, such as double-freeing heap blocks, or mismatched use of malloc/new/new[] versus free/delete/delete[]
Overlapping src and dst pointers in memcpy and related functions.
Memory leaks.
ETA: Though, as Kaz's answer says, it's not a panacea, and doesn't always give the most helpful output, especially when you're using exciting access patterns.
If you ever do systems level programming or embedded systems programming, very bad things can happen if you write to random memory locations. Older systems and many micro-controllers use memory mapped IO, so writing to a memory location that maps to a peripheral register can wreak havoc, especially if it is done asynchronously.
An example is programming flash memory. Programming mode on the memory chips is enabled by writing a specific sequence of values to specific locations inside the address range of the chip. If another process were to write to any other location in the chip while that was going on, it would cause the programming cycle to fail.
In some cases the hardware will wrap addresses around (most significant bits/bytes of address are ignored) so writing to an address beyond the end of the physical address space will actually result in data being written right in the middle of things.
And finally, older CPUs like the MC68000 can locked up to the point that only a hardware reset can get them going again. Haven't worked on them for a couple of decades but I believe it's when it encountered a bus error (non-existent memory) while trying to handle an exception, it would simply halt until the hardware reset was asserted.
My biggest recommendation is a blatant plug for a product, but I have no personal interest in it and I am not affiliated with them in any way - but based on a couple of decades of C programming and embedded systems where reliability was critical, Gimpel's PC Lint will not only detect those sort of errors, it will make a better C/C++ programmer out of you by constantly harping on you about bad habits.
I'd also recommend reading the MISRA C coding standard, if you can snag a copy from someone. I haven't seen any recent ones but in ye olde days they gave a good explanation of why you should/shouldn't do the things they cover.
Dunno about you, but about the 2nd or 3rd time I get a coredump or hangup from any application, my opinion of whatever company produced it goes down by half. The 4th or 5th time and whatever the package is becomes shelfware and I drive a wooden stake through the center of the package/disc it came in just to make sure it never comes back to haunt me.
I'm working with a compiler for a DSP chip which deliberately generates code that accesses one past the end of an array out of C code which does not!
This is because the loops are structured so that the end of an iteration prefetches some data for the next iteration. So the datum prefetched at the end of the last iteration is never actually used.
Writing C code like that invokes undefined behavior, but that is only a formality from a standards document which concerns itself with maximal portability.
More often that not, a program which accesses out of bounds is not cleverly optimized. It is simply buggy. The code fetches some garbage value and, unlike the optimized loops of the aforementioned compiler, the code then uses the value in subsequent computations, thereby corrupting theim.
It is worth catching bugs like that, and so it is worth making the behavior undefined for even just that reason alone: so that the run-time can produce a diagnostic message like "array overrun in line 42 of main.c".
On systems with virtual memory, an array could happen to be allocated such that the address which follows is in an unmapped area of virtual memory. The access will then bomb the program.
As an aside, note that in C we are permitted to create a pointer which is one past the end of an array. And this pointer has to compare greater than any pointer to the interior of an array.
This means that a C implementation cannot place an array right at the end of memory, where the one plus address would wrap around and look smaller than other addresses in the array.
Nevertheless, access to uninitialized or out of bounds values are sometimes a valid optimization technique, even if not maximally portable. This is for instance why the Valgrind tool does not report accesses to uninitialized data when those accesses happen, but only when the value is later used in some way that could affect the outcome of the program. You get a diagnostic like "conditional branch in xxx:nnn depends on uninitialized value" and it can be sometimes hard to track down where it originates. If all such accesses were trapped immediately, there would be a lot of false positives arising from compiler optimized code as well as correctly hand-optimized code.
Speaking of which, I was working with some codec from a vendor which was giving off these errors when ported to Linux and run under Valgrind. But the vendor convinced me that only several bits of the value being used actually came from uninitialized memory, and those bits were carefully avoided by the logic.. Only the good bits of the value were being used and Valgrind doesn't have the ability to track down to the individual bit. The uninitialized material came from reading a word past the end of a bit stream of encoded data, but the code knows how many bits are in the stream and will not use more bits than there actually are. Since the access beyond the end of the bit stream array does not cause any harm on the DSP architecture (there is no virtual memory after the array, no memory-mapped ports, and the address does not wrap) it is a valid optimization technique.
"Undefined behavior" does not really mean much, because according to ISO C, simply including a header which is not defined in the C standard, or calling a function which is not defined in the program itself or the C standard, are examples of undefined behavior. Undefined behavior doesn't mean "not defined by anyone on the planet" just "not defined by the ISO C standard". But of course, sometimes undefined behavior really is absolutely not defined by anyone.
Besides your own program, I don't think you will break anything, in the worst case you will try to read or write from a memory address that corresponds to a page that the kernel didn't assign to your proceses, generating the proper exception and being killed (I mean, your process).
Arrays with two or more dimensions pose a consideration beyond those mentioned in other answers. Consider the following functions:
char arr1[2][8];
char arr2[4];
int test1(int n)
{
arr1[1][0] = 1;
for (int i=0; i<n; i++) arr1[0][i] = arr2[i];
return arr1[1][0];
}
int test2(int ofs, int n)
{
arr1[1][0] = 1;
for (int i=0; i<n; i++) *(arr1[0]+i) = arr2[i];
return arr1[1][0];
}
The way gcc will processes the first function will not allow for the possibility that an attempt to write arr[0][i] might affect the value of arr[1][0], and the generated code is incapable of returning anything other than a hardcoded value of 1. Although the Standard defines the meaning of array[index] as precisely equivalent to (*((array)+(index))), gcc seems to interpret the notion of array bounds and pointer decay differently in cases which involve using [] operator on values of array type, versus those which use explicit pointer arithmetic.
I just want to add some practical examples to this questions - Imagine the following code:
#include <stdio.h>
int main(void) {
int n[5];
n[5] = 1;
printf("answer %d\n", n[5]);
return (0);
}
Which has Undefined Behaviour. If you enable for example clang optimisations (-Ofast) it would result in something like:
answer 748418584
(Which if you compile without will probably output the correct result of answer 1)
This is because in the first case the assignment to 1 is never actually assembled in the final code (you can look in the godbolt asm code as well).
(However it must be noted that by that logic main should not even call printf so best advice is not to depend on the optimiser to solve your UB - but rather have the knowledge that sometimes it may work this way)
The takeaway here is that modern C optimising compilers will assume undefined behaviour (UB) to never occur (which means the above code would be similar to something like (but not the same):
#include <stdio.h>
#include <stdlib.h>
int main(void) {
int n[5];
if (0)
n[5] = 1;
printf("answer %d\n", (exit(-1), n[5]));
return (0);
}
Which on contrary is perfectly defined).
That's because the first conditional statement never reaches it's true state (0 is always false).
And on the second argument for printf we have a sequence point after which we call exit and the program terminates before invoking the UB in the second comma operator (so it's well defined).
So the second takeaway is that UB is not UB as long as it's never actually evaluated.
Additionally I don't see mentioned here there is fairly modern Undefined Behaviour sanitiser (at least on clang) which (with the option -fsanitize=undefined) will give the following output on the first example (but not the second):
/app/example.c:5:5: runtime error: index 5 out of bounds for type 'int[5]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /app/example.c:5:5 in
/app/example.c:7:27: runtime error: index 5 out of bounds for type 'int[5]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /app/example.c:7:27 in
Here is all the samples in godbolt:
https://godbolt.org/z/eY9ja4fdh (first example and no flags)
https://godbolt.org/z/cGcY7Ta9M (first example and -Ofast clang)
https://godbolt.org/z/cGcY7Ta9M (second example and UB sanitiser on)
https://godbolt.org/z/vE531EKo4 (first example and UB sanitiser on)

NULL automatically appended in C string arrays? [duplicate]

How dangerous is accessing an array outside of its bounds (in C)? It can sometimes happen that I read from outside the array (I now understand I then access memory used by some other parts of my program or even beyond that) or I am trying to set a value to an index outside of the array. The program sometimes crashes, but sometimes just runs, only giving unexpected results.
Now what I would like to know is, how dangerous is this really? If it damages my program, it is not so bad. If on the other hand it breaks something outside my program, because I somehow managed to access some totally unrelated memory, then it is very bad, I imagine.
I read a lot of 'anything can happen', 'segmentation might be the least bad problem', 'your hard disk might turn pink and unicorns might be singing under your window', which is all nice, but what is really the danger?
My questions:
Can reading values from way outside the array damage anything
apart from my program? I would imagine just looking at things does
not change anything, or would it for instance change the 'last time
opened' attribute of a file I happened to reach?
Can setting values way out outside of the array damage anything apart from my
program? From this
Stack Overflow question I gather that it is possible to access
any memory location, that there is no safety guarantee.
I now run my small programs from within XCode. Does that
provide some extra protection around my program where it cannot
reach outside its own memory? Can it harm XCode?
Any recommendations on how to run my inherently buggy code safely?
I use OSX 10.7, Xcode 4.6.
As far as the ISO C standard (the official definition of the language) is concerned, accessing an array outside its bounds has "undefined behavior". The literal meaning of this is:
behavior, upon use of a nonportable or erroneous program construct or
of erroneous data, for which this International Standard imposes no
requirements
A non-normative note expands on this:
Possible undefined behavior ranges from ignoring the situation
completely with unpredictable results, to behaving during translation
or program execution in a documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to
terminating a translation or execution (with the issuance of a
diagnostic message).
So that's the theory. What's the reality?
In the "best" case, you'll access some piece of memory that's either owned by your currently running program (which might cause your program to misbehave), or that's not owned by your currently running program (which will probably cause your program to crash with something like a segmentation fault). Or you might attempt to write to memory that your program owns, but that's marked read-only; this will probably also cause your program to crash.
That's assuming your program is running under an operating system that attempts to protect concurrently running processes from each other. If your code is running on the "bare metal", say if it's part of an OS kernel or an embedded system, then there is no such protection; your misbehaving code is what was supposed to provide that protection. In that case, the possibilities for damage are considerably greater, including, in some cases, physical damage to the hardware (or to things or people nearby).
Even in a protected OS environment, the protections aren't always 100%. There are operating system bugs that permit unprivileged programs to obtain root (administrative) access, for example. Even with ordinary user privileges, a malfunctioning program can consume excessive resources (CPU, memory, disk), possibly bringing down the entire system. A lot of malware (viruses, etc.) exploits buffer overruns to gain unauthorized access to the system.
(One historical example: I've heard that on some old systems with core memory, repeatedly accessing a single memory location in a tight loop could literally cause that chunk of memory to melt. Other possibilities include destroying a CRT display, and moving the read/write head of a disk drive with the harmonic frequency of the drive cabinet, causing it to walk across a table and fall onto the floor.)
And there's always Skynet to worry about.
The bottom line is this: if you could write a program to do something bad deliberately, it's at least theoretically possible that a buggy program could do the same thing accidentally.
In practice, it's very unlikely that your buggy program running on a MacOS X system is going to do anything more serious than crash. But it's not possible to completely prevent buggy code from doing really bad things.
In general, Operating Systems of today (the popular ones anyway) run all applications in protected memory regions using a virtual memory manager. It turns out that it is not terribly EASY (per se) to simply read or write to a location that exists in REAL space outside the region(s) that have been assigned / allocated to your process.
Direct answers:
Reading will almost never directly damage another process, however it can indirectly damage a process if you happen to read a KEY value used to encrypt, decrypt, or validate a program / process. Reading out of bounds can have somewhat adverse / unexpected affects on your code if you are making decisions based on the data you are reading
The only way your could really DAMAGE something by writing to a loaction accessible by a memory address is if that memory address that you are writing to is actually a hardware register (a location that actually is not for data storage but for controlling some piece of hardware) not a RAM location. In all fact, you still wont normally damage something unless you are writing some one time programmable location that is not re-writable (or something of that nature).
Generally running from within the debugger runs the code in debug mode. Running in debug mode does TEND to (but not always) stop your code faster when you have done something considered out of practice or downright illegal.
Never use macros, use data structures that already have array index bounds checking built in, etc....
ADDITIONAL
I should add that the above information is really only for systems using an operating system with memory protection windows. If writing code for an embedded system or even a system utilizing an operating system (real-time or other) that does not have memory protection windows (or virtual addressed windows) that one should practice a lot more caution in reading and writing to memory. Also in these cases SAFE and SECURE coding practices should always be employed to avoid security issues.
Not checking bounds can lead to to ugly side effects, including security holes. One of the ugly ones is arbitrary code execution. In classical example: if you have an fixed size array, and use strcpy() to put a user-supplied string there, the user can give you a string that overflows the buffer and overwrites other memory locations, including code address where CPU should return when your function finishes.
Which means your user can send you a string that will cause your program to essentially call exec("/bin/sh"), which will turn it into shell, executing anything he wants on your system, including harvesting all your data and turning your machine into botnet node.
See Smashing The Stack For Fun And Profit for details on how this can be done.
You write:
I read a lot of 'anything can happen', 'segmentation might be the
least bad problem', 'your harddisk might turn pink and unicorns might
be singing under your window', which is all nice, but what is really
the danger?
Lets put it that way: load a gun. Point it outside the window without any particular aim and fire. What is the danger?
The issue is that you do not know. If your code overwrites something that crashes your program you are fine because it will stop it into a defined state. However if it does not crash then the issues start to arise. Which resources are under control of your program and what might it do to them? I know at least one major issue that was caused by such an overflow. The issue was in a seemingly meaningless statistics function that messed up some unrelated conversion table for a production database. The result was some very expensive cleanup afterwards. Actually it would have been much cheaper and easier to handle if this issue would have formatted the hard disks ... with other words: pink unicorns might be your least problem.
The idea that your operating system will protect you is optimistic. If possible try to avoid writing out of bounds.
Not running your program as root or any other privileged user won't harm any of your system, so generally this might be a good idea.
By writing data to some random memory location you won't directly "damage" any other program running on your computer as each process runs in it's own memory space.
If you try to access any memory not allocated to your process the operating system will stop your program from executing with a segmentation fault.
So directly (without running as root and directly accessing files like /dev/mem) there is no danger that your program will interfere with any other program running on your operating system.
Nevertheless - and probably this is what you have heard about in terms of danger - by blindly writing random data to random memory locations by accident you sure can damage anything you are able to damage.
For example your program might want to delete a specific file given by a file name stored somewhere in your program. If by accident you just overwrite the location where the file name is stored you might delete a very different file instead.
NSArrays in Objective-C are assigned a specific block of memory. Exceeding the bounds of the array means that you would be accessing memory that is not assigned to the array. This means:
This memory can have any value. There's no way of knowing if the data is valid based on your data type.
This memory may contain sensitive information such as private keys or other user credentials.
The memory address may be invalid or protected.
The memory can have a changing value because it's being accessed by another program or thread.
Other things use memory address space, such as memory-mapped ports.
Writing data to unknown memory address can crash your program, overwrite OS memory space, and generally cause the sun to implode.
From the aspect of your program you always want to know when your code is exceeding the bounds of an array. This can lead to unknown values being returned, causing your application to crash or provide invalid data.
You may want to try using the memcheck tool in Valgrind when you test your code -- it won't catch individual array bounds violations within a stack frame, but it should catch many other sorts of memory problem, including ones that would cause subtle, wider problems outside the scope of a single function.
From the manual:
Memcheck is a memory error detector. It can detect the following problems that are common in C and C++ programs.
Accessing memory you shouldn't, e.g. overrunning and underrunning heap blocks, overrunning the top of the stack, and accessing memory after it has been freed.
Using undefined values, i.e. values that have not been initialised, or that have been derived from other undefined values.
Incorrect freeing of heap memory, such as double-freeing heap blocks, or mismatched use of malloc/new/new[] versus free/delete/delete[]
Overlapping src and dst pointers in memcpy and related functions.
Memory leaks.
ETA: Though, as Kaz's answer says, it's not a panacea, and doesn't always give the most helpful output, especially when you're using exciting access patterns.
If you ever do systems level programming or embedded systems programming, very bad things can happen if you write to random memory locations. Older systems and many micro-controllers use memory mapped IO, so writing to a memory location that maps to a peripheral register can wreak havoc, especially if it is done asynchronously.
An example is programming flash memory. Programming mode on the memory chips is enabled by writing a specific sequence of values to specific locations inside the address range of the chip. If another process were to write to any other location in the chip while that was going on, it would cause the programming cycle to fail.
In some cases the hardware will wrap addresses around (most significant bits/bytes of address are ignored) so writing to an address beyond the end of the physical address space will actually result in data being written right in the middle of things.
And finally, older CPUs like the MC68000 can locked up to the point that only a hardware reset can get them going again. Haven't worked on them for a couple of decades but I believe it's when it encountered a bus error (non-existent memory) while trying to handle an exception, it would simply halt until the hardware reset was asserted.
My biggest recommendation is a blatant plug for a product, but I have no personal interest in it and I am not affiliated with them in any way - but based on a couple of decades of C programming and embedded systems where reliability was critical, Gimpel's PC Lint will not only detect those sort of errors, it will make a better C/C++ programmer out of you by constantly harping on you about bad habits.
I'd also recommend reading the MISRA C coding standard, if you can snag a copy from someone. I haven't seen any recent ones but in ye olde days they gave a good explanation of why you should/shouldn't do the things they cover.
Dunno about you, but about the 2nd or 3rd time I get a coredump or hangup from any application, my opinion of whatever company produced it goes down by half. The 4th or 5th time and whatever the package is becomes shelfware and I drive a wooden stake through the center of the package/disc it came in just to make sure it never comes back to haunt me.
I'm working with a compiler for a DSP chip which deliberately generates code that accesses one past the end of an array out of C code which does not!
This is because the loops are structured so that the end of an iteration prefetches some data for the next iteration. So the datum prefetched at the end of the last iteration is never actually used.
Writing C code like that invokes undefined behavior, but that is only a formality from a standards document which concerns itself with maximal portability.
More often that not, a program which accesses out of bounds is not cleverly optimized. It is simply buggy. The code fetches some garbage value and, unlike the optimized loops of the aforementioned compiler, the code then uses the value in subsequent computations, thereby corrupting theim.
It is worth catching bugs like that, and so it is worth making the behavior undefined for even just that reason alone: so that the run-time can produce a diagnostic message like "array overrun in line 42 of main.c".
On systems with virtual memory, an array could happen to be allocated such that the address which follows is in an unmapped area of virtual memory. The access will then bomb the program.
As an aside, note that in C we are permitted to create a pointer which is one past the end of an array. And this pointer has to compare greater than any pointer to the interior of an array.
This means that a C implementation cannot place an array right at the end of memory, where the one plus address would wrap around and look smaller than other addresses in the array.
Nevertheless, access to uninitialized or out of bounds values are sometimes a valid optimization technique, even if not maximally portable. This is for instance why the Valgrind tool does not report accesses to uninitialized data when those accesses happen, but only when the value is later used in some way that could affect the outcome of the program. You get a diagnostic like "conditional branch in xxx:nnn depends on uninitialized value" and it can be sometimes hard to track down where it originates. If all such accesses were trapped immediately, there would be a lot of false positives arising from compiler optimized code as well as correctly hand-optimized code.
Speaking of which, I was working with some codec from a vendor which was giving off these errors when ported to Linux and run under Valgrind. But the vendor convinced me that only several bits of the value being used actually came from uninitialized memory, and those bits were carefully avoided by the logic.. Only the good bits of the value were being used and Valgrind doesn't have the ability to track down to the individual bit. The uninitialized material came from reading a word past the end of a bit stream of encoded data, but the code knows how many bits are in the stream and will not use more bits than there actually are. Since the access beyond the end of the bit stream array does not cause any harm on the DSP architecture (there is no virtual memory after the array, no memory-mapped ports, and the address does not wrap) it is a valid optimization technique.
"Undefined behavior" does not really mean much, because according to ISO C, simply including a header which is not defined in the C standard, or calling a function which is not defined in the program itself or the C standard, are examples of undefined behavior. Undefined behavior doesn't mean "not defined by anyone on the planet" just "not defined by the ISO C standard". But of course, sometimes undefined behavior really is absolutely not defined by anyone.
Besides your own program, I don't think you will break anything, in the worst case you will try to read or write from a memory address that corresponds to a page that the kernel didn't assign to your proceses, generating the proper exception and being killed (I mean, your process).
Arrays with two or more dimensions pose a consideration beyond those mentioned in other answers. Consider the following functions:
char arr1[2][8];
char arr2[4];
int test1(int n)
{
arr1[1][0] = 1;
for (int i=0; i<n; i++) arr1[0][i] = arr2[i];
return arr1[1][0];
}
int test2(int ofs, int n)
{
arr1[1][0] = 1;
for (int i=0; i<n; i++) *(arr1[0]+i) = arr2[i];
return arr1[1][0];
}
The way gcc will processes the first function will not allow for the possibility that an attempt to write arr[0][i] might affect the value of arr[1][0], and the generated code is incapable of returning anything other than a hardcoded value of 1. Although the Standard defines the meaning of array[index] as precisely equivalent to (*((array)+(index))), gcc seems to interpret the notion of array bounds and pointer decay differently in cases which involve using [] operator on values of array type, versus those which use explicit pointer arithmetic.
I just want to add some practical examples to this questions - Imagine the following code:
#include <stdio.h>
int main(void) {
int n[5];
n[5] = 1;
printf("answer %d\n", n[5]);
return (0);
}
Which has Undefined Behaviour. If you enable for example clang optimisations (-Ofast) it would result in something like:
answer 748418584
(Which if you compile without will probably output the correct result of answer 1)
This is because in the first case the assignment to 1 is never actually assembled in the final code (you can look in the godbolt asm code as well).
(However it must be noted that by that logic main should not even call printf so best advice is not to depend on the optimiser to solve your UB - but rather have the knowledge that sometimes it may work this way)
The takeaway here is that modern C optimising compilers will assume undefined behaviour (UB) to never occur (which means the above code would be similar to something like (but not the same):
#include <stdio.h>
#include <stdlib.h>
int main(void) {
int n[5];
if (0)
n[5] = 1;
printf("answer %d\n", (exit(-1), n[5]));
return (0);
}
Which on contrary is perfectly defined).
That's because the first conditional statement never reaches it's true state (0 is always false).
And on the second argument for printf we have a sequence point after which we call exit and the program terminates before invoking the UB in the second comma operator (so it's well defined).
So the second takeaway is that UB is not UB as long as it's never actually evaluated.
Additionally I don't see mentioned here there is fairly modern Undefined Behaviour sanitiser (at least on clang) which (with the option -fsanitize=undefined) will give the following output on the first example (but not the second):
/app/example.c:5:5: runtime error: index 5 out of bounds for type 'int[5]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /app/example.c:5:5 in
/app/example.c:7:27: runtime error: index 5 out of bounds for type 'int[5]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /app/example.c:7:27 in
Here is all the samples in godbolt:
https://godbolt.org/z/eY9ja4fdh (first example and no flags)
https://godbolt.org/z/cGcY7Ta9M (first example and -Ofast clang)
https://godbolt.org/z/cGcY7Ta9M (second example and UB sanitiser on)
https://godbolt.org/z/vE531EKo4 (first example and UB sanitiser on)

C program help: Insufficient memory allocation but still works...why? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
behaviour of malloc(0)
I'm trying to understand memory allocation in C. So I am experimenting with malloc. I allotted 0 bytes for this pointer but yet it can still hold an integer. As a matter of fact, no matter what number I put into the parameter of malloc, it can still hold any number I give it. Why is this?
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
int *ptr = (int*)malloc(0);
*ptr = 9;
printf("%i", *ptr); // 9
free(ptr);
return 0;
}
It still prints 9, what's up with that?
If size is 0, then malloc() returns either NULL, or a unique pointer
value that can later be successfully passed to free().
I guess you are hitting the 2nd case.
Anyway that pointer just by mistake happens to be in an area where you can write without generating segmentation fault, but you are probably writing in the space of some other variable messing up its value.
A lot of good answers here. But it is definitely undefined behavior. Some people declare that undefined behavior means that purple dragons may fly out of your computer or something like that... there's probably some history behind that outrageous claim that I'm missing, but I promise you that purple dragons won't appear regardless of what the undefined behavior will be.
First of all, let me mention that in the absence of an MMU, on a system without virtual memory, your program would have direct access to all of the memory on the system, regardless of its address. On a system like that, malloc() is merely the guy who helps you carve out pieces of memory in an ordered manner; the system can't actually enforce you to use only the addresses that malloc() gave you. On a system with virtual memory, the situation is slightly different... well, ok, a lot different. But within your program, any code in your program can access any part of the virtual address space that's mapped via the MMU to real physical memory. It doesn't matter whether you got an address from malloc() or whether you called rand() and happened to get an address that falls in a mapped region of your program; if it's mapped and not marked execute-only, you can read it. And if it isn't marked read-only, you can write it as well. Yes. Even if you didn't get it from malloc().
Let's consider the possibilities for the malloc(0) undefined behavior:
malloc(0) returns NULL.
OK, this is simple enough. There really is a physical address 0x00000000 in most computers, and even a virtual address 0x00000000 in all processes, but the OS intentionally doesn't map any memory to that address so that it can trap null pointer accesses. There's a whole page (generally 4KB) there that's just never mapped at all, and maybe even much more than 4KB. Therefore if you try to read or write through a null pointer, even with an offset from it, you'll hit these pages of virtual memory that aren't even mapped, and the MMU will throw an exception (a hardware exception, or interrupt) that the OS catches, and it declares a SIGSEGV (on Linux/Unix), or an illegal access (on Windows).
malloc(0) returns a valid address to previously unallocated memory of the smallest allocable unit.
With this, you actually get a real piece of memory that you can legally call your own, of some size you don't know. You really shouldn't write anything there (and probably not read either) because you don't know how big it is, and for that matter, you don't know if this is the particular case you're experiencing (see the following cases). If this is the case, the block of memory you were given is almost guaranteed to be at least 4 bytes and probably is 8 bytes or perhaps even larger; it all depends on whatever the size is of your implementation's minimum allocable unit.
malloc(0) intentionally returns the address of an unmapped page of
memory other than NULL.
This is probably a good option for an implementation, as it would allow you or the system to track & pair together malloc() calls with their corresponding free() calls, but in essence, it's the same as returning NULL. If you try to access (read/write) via this pointer, you'll crash (SEGV or illegal access).
malloc(0) returns an address in some other mapped page of memory
that may be used by "someone else".
I find it highly unlikely that a commercially-available system would take this route, as it serves to simply hide bugs rather than bring them out as soon as possible. But if it did, malloc() would be returning a pointer to somewhere in memory that you do not own. If this is the case, sure, you can write to it all you want, but you'd be corrupting some other code's memory, though it would be memory in your program's process, so you can be assured that you're at least not going to be stomping on another program's memory. (I hear someone getting ready to say, "But it's UB, so technically it could be stomping on some other program's memory. Yes, in some environments, like an embedded system, that is right. No modern commercial OS would let one process have access to another process's memory as easily as simply calling malloc(0) though; in fact, you simply can't get from one process to another process's memory without going through the OS to do it for you.) Anyway, back to reality... This is the one where "undefined behavior" really kicks in: If you're writing to "someone else's memory" (in your own program's process), you'll be changing the behavior of your program in difficult-to-predict ways. Knowing the structure of your program and where everything is laid out in memory, it's fully predictable. But from one system to another, things would be laid out in memory (appearing a different locations in memory), so the effect on one system would not necessarily be the same as the effect on another system, or on the same system at a different time.
And finally.... No, that's it. There really, truly, are only those four
possibilities. You could argue for special-case subset points for
the last two of the above, but the end result will be the same.
For one thing, your compiler may be seeing these two lines back to back and optimizing them:
*ptr = 9;
printf("%i", *ptr);
With such a simplistic program, your compiler may actually be optimizing away the entire memory allocate/free cycle and using a constant instead. A compiler-optimized version of your program could end up looking more like simply:
printf("9");
The only way to tell if this is indeed what is happening is to examine the assembly that your compiler emits. If you're trying to learn how C works, I recommend explicitly disabling all compiler optimizations when you build your code.
Regarding your particular malloc usage, remember that you will get a NULL pointer back if allocation fails. Always check the return value of malloc before you use it for anything. Blindly dereferencing it is a good way to crash your program.
The link that Nick posted gives a good explanation about why malloc(0) may appear to work (note the significant difference between "works" and "appears to work"). To summarize the information there, malloc(0) is allowed to return either NULL or a pointer. If it returns a pointer, you are expressly forbidden from using it for anything other than passing it to free(). If you do try to use such a pointer, you are invoking undefined behavior and there's no way to tell what will happen as a result. It may appear to work for you, but in doing so you may be overwriting memory that belongs to another program and corrupting their memory space. In short: nothing good can happen, so leave that pointer alone and don't waste your time with malloc(0).
The answer to the malloc(0)/free() calls not crashing you can find here:
zero size malloc
About the *ptr = 9, is just like overflowing a buffer (like malloc'ing 10 bytes and access the 11th), you are writing to memory you don't own, and doing that is looking for trouble. In this particular implementation malloc(0) happens to return a pointer instead of NULL.
Bottom line, it is wrong even if it seems to work on a simple case.
Some memory allocators have the notion of "minimum allocatable size". So, even if you pass zero, this will return pointer to the memory of word-size, for example. You need to check up with your system allocator documentation. But if it does return pointer to some memory it'd be wrong to rely on it as the pointer is only supposed to be passed either to be passed realloc() or free().

Resources