Memcpy() works on out of bounds memory? - c

I have been playing around with the idea that memcpy() could be used for malevolent purposes. I made several test applications to see if I could "steal" data in memory from different regions. I have tested three regions thus far, heap, stack, and constant( read only ) memory. The constant memory was the only one to crash in my tests, provoking an error from MinGW.
Here is an example to illustrate my latest test :
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void removeTerminatingCharacters( char ** string, const int length )
{
int i = 0;
for ( ; i < length; ++i )
if ( !( *string )[i] )
( *string )[i] = '0';
return;
}
int main()
{
int * naive = malloc( sizeof( int ) );
*naive = 0;
char * stolenData = malloc( 2000 );
memset( stolenData, 0, 2000 );
memcpy( stolenData, naive, 1999 );
removeTerminatingCharacters( &stolenData, 2000 );
printf( "%s\n", stolenData );
free( stolenData );
return 0;
}
Output :
0000-0:0Væ1lDk¦#:00000[æ0`Dk00p,:0-0:0MAIN=Computer0USERNAME=JohnDoe0USERPRO
FILE=C:\Users\JohnDoe0WATCOM=C:\watcom0windir=C:\Windows00?æ1+Ik?000S?000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00??????????????????????????000000 00000000 000000?0?0?
00000000000 0 0 ?0000000000 0000000000 0000 00000???????????????????????0???????
0 00000000000000000000000000000000000000000000000
000000000000000000abcdefghijklmnopqrstuvwxyz000000ABCDEFGHIJKLMNOPQRSTUVWXYZ0000
0000â000000Ü0£0P00000000000è0î0Ä 0000000000¬0000000000¦0000¦00000aßGpSsµtFTOd8fe
ä?:0ú?:0-?:0T?:0)?:0`?:0£?:0-?:0p?:0?:0??:0+?:08?:0T?:0¦?:0µ?:0²?:0¶?:03?:0D?:0R
?:0ä :0- :0+ :0a :0v :0?!:0^!:0q!:0ë!:0ñ!:0+!:0±!:0¤":0P":0g":0ó":0¦":0+":0¦":0?
#:0.#:0B#:0W#:0x#:0ë#:00000ALLUSERSPROFILE=C:\ProgramData0APPDATA=C:\Users\Chris
topher\AppData\Roaming0asl.log=Destination=file0CLASSPATH=.;C:\Program Files (x8
6)\Java\jre6\lib\ext\QTJava.zip0CommonProgramFiles=C:\Program Files (x86)\Common
Files0CommonProgramFiles(x86)=C:\Program Files (x86)\Common Files0CommonProgram
W6432=C:\Program Files\Common Files0COMPUTERNAME=COMPUTER0ComSpec=C:\Windows\sys
tem32\cmd.exe0FPPUILang=en-US0FP_NO_HOST_CHECK=NO0HOMEDRIVE=C:0HOMEPATH=\Users\C
hristopher0HuluDesktopPath=C:\Users\JohnDoe\AppData\Local\HuluDesktop\instan
ces\0.9.13.1\HuluDesktop.exe0LOCALAPPDATA=C:\Users\JohnDoe\AppData\Local0LOG
ONSERVER=\\COMPUTER0NUMBER_OF_PROCESSORS=20OnlineServices=Online Services0OOBEUI
Lang=en-US0OS=Windows_NT0Path=.;F:\CodeBlocks\MinGW\bin;F:\CodeBlocks\MinGW;C:\M
inGW\bin;C:\MinGW;C:\Windows\System32;C:\Windows;C:\Windows\System32\wbem;C:\Pro
gram Files\Common Files\Microsoft Shared\Windows Live;C:\Program Files (x86)\Com
mon Files\Microsoft Shared\Windows Live;C:\Windows\System32\WindowsPowerShell\v1
.0;c:\Program Files (x86)\ATI Technologies\ATI.ACE\Core-Static;c:\Program Files
(x86)\Common F0Uô1m+k
My code isn't pretty, but it demonstrates my point. As you can see, the data is mostly garbage values, but there are a few interesting strings thrown in from the heap.
My primary question is why doesn't this action cause a memory access violation error?

Memory access violation errors are caught by virtual memory hardware, when you access an unmapped page. Not every address which is out of bounds is in an unmapped page. Pages are generally of equal sizes. A typical page size if 4096 bytes. (Page sizes are hardware-specific: some chips have memory management that allows for programmable page sizes, and even mixtures of different page sizes for different areas of memory.) Sometimes only part of a page contains valid data. It's not possible for just a fragment of a page to be unmapped, so the part which contains garbage is also mapped. Also, memory managers like malloc do not always give memory back to the operating system; they keep free areas for re-use. Those areas are valid memory (mapped pages). Also, making up pointer values, you can, by fluke, simply end up going beyond out-of-bounds, and can "land" in memory that corresponds to a valid object.
This is just how it works on your PC, with hits virtual memory OS. Virtual memory is not ubiquitous. On computers without virtual memory (nowadays, small embedded systems), you can access any location in memory. However, accessing certain areas may have side effects that change the state of hardware (namely I/O registers). Some address ranges may trigger a "bus error" type CPU exception because no hardware exists for that range, and so the access request times out. Other than that, as far as valid program memory goes, it is not protected from out-of-bounds accesses.
Operating systems without memory protection were used for early interactive systems in the 1960's. That history then repeated itself when personal microcomputers appeared: again, they had operating systems without memory protection, due to small memories and unsophisticated CPU's. On these operating systems, applications often stomped over each other's memory spaces, leading to frequent crashes. (Imagine that with your memcpy, you not only copy your own out-of-bounds area, such as some malloc block which was previously freed, but also an area from a completely different running program.) Users sometimes spotted patterns, like when certain programs were loaded into memory in certain orders, there were fewer problems, or so-called "conflicts" between applications.

Yes, it does can be used to malevolent purposes, but not in the way you may be thinking.
The memory your application read and write is your process' virtual memory, it is not the physical memory. You can't even know in what address of the physical memory yours or any application really is, only the system kernel knows that.
You cannot interact with active memory from other processes without proper permission and use of syscalls aswell.
You can howover, read the garbage left in unmapped memory areas that were once used by other processes, and you may even stumble in sensitive information left there, like passwords, certificate keys or personal stuff the user typed at some point, but you are on highly volatile grounds, the information there is most likely corrupted and there is no easy way to seek specific pieces of information.
Here is an article about that: https://security.stackexchange.com/questions/29019/are-passwords-stored-in-memory-safe

You must not try to access out of bounds memory.
Accessing memory you must not access results in Undefined Behavior.
Anything may happen.

You may be able to "steal" the memory. But that's not something you should rely on and write your code based on that.

memcpy() doesn't do any bounds check, the programmer has to take care of it. In your case, it causes a heap overflow.
This pretty much sums up the consequences of a heap overflow:
https://www.owasp.org/index.php/Heap_overflow
Description
A heap overflow condition is a buffer overflow, where the buffer that
can be overwritten is allocated in the heap portion of memory,
generally meaning that the buffer was allocated using a routine such
as the POSIX malloc() call.
Consequences
Availability: Buffer overflows generally lead to crashes. Other attacks leading to lack of availability are possible, including
putting the program into an infinite loop.
Access control (memory and instruction processing): Buffer overflows often can be used to execute arbitrary code, which is
usually outside the scope of a program's implicit security policy.
Other: When the consequence is arbitrary code execution, this can often be used to subvert any other security service.
Avoidance and mitigation
Pre-design: Use a language or compiler that performs automatic bounds checking.
Design: Use an abstraction library to abstract away risky APIs. Not a complete solution.
Pre-design through Build: Canary style bounds checking, library changes which ensure the validity of chunk data, and other such fixes
are possible, but should not be relied upon.
Operational: Use OS-level preventative functionality. Not a complete solution.
Discussion
Heap overflows are usually just as dangerous as stack overflows.
Besides important user data, heap overflows can be used to overwrite
function pointers that may be living in memory, pointing it to the
attacker's code.
Even in applications that do not explicitly use
function pointers, the run-time will usually leave many in memory. For
example, object methods in C++ are generally implemented using
function pointers. Even in C programs, there is often a global offset
table used by the underlying runtime.

To answer your question of "why doesn't this action cause a memory access violation error?", the answer is that it's because the memory in the area for the naive allocation has been allocated to the process. Probably it's memory that the C runtime has prepared for further possible malloc() requests.
On most desktop systems, the smallest unit of memory protection is a page, which can vary in size but will often be in the range of 4KB or so. You might well have run into a memory access violation if your read request were larger (but still maybe not - the heap might be much larger than your small allocation). Then again, your allocation could come near then end of a page and hit an access error right away (debug allocators will sometimes do this purposefully to help catch errors).

memcpy can't realistically be used for "malevolent" purposes. The only memory that memcpy can access is the memory belonging to your own process (which you own, anyway). If you screw up your own memory, your program takes the fall, but the OS will protect all other processes (and you cannot touch their memory spaces without mmap or similar).
Most of those "interesting strings" are environment variables set by the OS and passed to your program.

Related

Is there a performance cost to using large mmap calls that go beyond expected memory usage?

Edit: On systems that use on-demand paging
For initializing data structures that are both persistent for the duration of the program and require a dynamic amount of memory is there any reason not to mmap an upper bound from the start?
An example is an array that will persistent for the entire program's life but whose final size is unknown. The approach I am most familiar with is something along the lines of:
type * array = malloc(size);
and when the array has reached capacity doubling it with:
array = realloc(array, 2 * size);
size *= 2;
I understand this is probably the best way to do this if the array might freed mid execution so that its VM can be reused, but if it is persistent is there any reason not to just initialize the array as follows:
array = mmap(0,
huge_size,
PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_NORESERVE,
-1, 0)
so that the elements never needs to be copied.
Edit: Specifically for an OS that uses on-demand paging.
Don't try to be smarter than the standard library, unless you 100% know what you are doing.
malloc() already does this for you. If you request a large amount of memory, malloc() will mmap() you a dedicated memory area. If what you are concerned about is the performance hit coming from doing size *= 2; realloc(old, size), then just malloc(huge_size) at the beginning, and then keep track of the actual used size in your program. There really is no point in doing an mmap() unless you explicitly need it for some specific reason: it isn't faster nor better in any particular way, and if malloc() thinks it's needed, it will do it for you.
It's fine to allocate upper bounds as long as:
You're building a 64bit program: 32bit ones have restricted virtual space, even on 64bit CPUs
Your upper bounds don't approach 2^47, as a mathematically derived one might
You're fine with crashing as your out-of-memory failure mode
You'll only run on systems where overcommit is enabled
As a side note, an end user application doing this may want to borrow a page from GHC's book and allocate 1TB up front even if 10GB would do. This unrealistically large amount will ensure that users don't confuse virtual memory usage with physical memory usage.
If you know for a fact that wasting a chunk of memory (most likely an entire page which is likely 4096 bytes) will not cause your program or the other programs running on your system to run out of memory, AND you know for a fact that your program will only ever be compiled and run on UNIX machines, then this approach is not incorrect, but it is not good programming practice for the following reasons:
The <stdlib.h> file you #include to use malloc() and free() in your C programs is specified by the C standard, but it is specifically implemented for your architecture by the writers of the operating system. This means that your specific system was kept in-mind when these functions were written, so finding a sneaky way to improve efficiency for memory allocation is unlikely unless you know the inner workings of memory management in your OS better than those who wrote it.
Furthermore, the <sys/mman.h> file you include to mmap() stuff is not part of the C standard, and will only compile on UNIX machines, which reduces the portability of your code.
There's also a really good chance (assuming a UNIX environment) that malloc() and realloc() already use mmap() behind-the-scenes to allocate memory for your process anyway, so it's almost certainly better to just use them. (read that as "realloc doesn't necessarily actively allocate more space for me, because there's a good chance there's already a chunk of memory that my process has control of that can satisfy my new memory request without calling mmap() again")
Hope that helps!

why am I not gettting Segmentation error?

I have
x=(int *)malloc(sizeof(int)*(1));
but still I am able to read x[20] or x[4].
How am I able to access those values? Shouldn't I be getting segmentation error while accessing those memory?
The basic premise is that of Sourav Ghosh's answer: accessing memory returned from malloc beyond the size you asked for is undefined behavior, so a conforming implementation is allowed to do pretty much anything, including happily returning bizarre values.
But given a "normal" implementation on mainstream operating systems on "normal" machines (gcc/MSVC/clang, Linux/Windows/macOS, x86/ARM) why do you sometimes get segmentation faults (or access violations), and sometimes not?
Pretty much every "regular" C implementation doesn't perform any kind of memory check when reading/writing through pointers1; these loads/stores get generally translated straight to the corresponding machine code, which accesses the memory at a given location without much regard for the size of the "abstract C machine" objects.
However, on these machines the CPU doesn't straight access the physical memory (RAM) of the PC, but a translation layer (MMU) is introduced2; whenever your program tries to access an address, the MMU checks to see whether anything has been mapped there, and if your process has permissions to write over there. In case any of those checks fail3, you get a segmentation fault and your process gets killed. This is why uninitialized and NULL pointer values generally give nice segfaults: some memory at the beginning of the virtual address space is reserved unmapped just to spot NULL dereferences, and in general if you throw a dart at random into a 32 bit address space (or even better, a 64 bit one) you are most likely to find zones of memory that have never been mapped to anything.
As good as it is, the MMU cannot catch all your memory errors for several reasons.
First of all, the granularity of memory mappings is quite coarse compared to most "run of the mill" allocations; on PCs memory pages (the smallest unit of memory that can be mapped and have protection attributes) are generally 4 KB in size. There is of course a tradeoff here: very small pages would require a lot of memory themselves (as there's a target physical address plus protection attributes associated to each page, and those have to be stored somewhere) and slow down the MMU operation3. So, if you access memory out of "logical" boundaries but still within the same memory page, the MMU cannot help you: as far as the hardware is concerned, you are still accessing valid memory.
Besides, even if you go outside of the last page of your allocation, it may be that the page that follows is "valid" as far as the hardware is concerned; indeed, this is pretty common for memory you get from the so-called heap (malloc & friends).
This comes from the fact that malloc, for smaller allocations, doesn't ask the OS for "new" blocks of memory (which in theory may be allocated keeping a guard page at both ends); instead, the allocator in the C runtime asks the OS for memory in big sequential chunks, and logically partitions them in smaller zones (usually kept in linked lists of some kind), which are handed out on malloc and returned back by free.
Now, when in your program you step outside the boundaries of the requested memory, you probably don't get any error as:
the memory chunk you are using isn't near a page boundary, so your out-of-bounds read doesn't trigger an access violation;
even if it was at the end of a page, the page that follows is still mapped, as it still belongs to the heap; it may either be memory that has been given to some other code of your process (so you are reading data of some unrelated part of your code), or a free memory zone (so you are reading whatever garbage happened to be left by the previous owner of the block when it freed it), or a zone used by the allocator to keep its bookkeping data (so you are reading parts of such data).
In all these cases except for the "free block" one, even if you were to write there you wouldn't get a segmentation fault, but you could corrupt unrelated data or the data structures of the heap (which generally results in crashes later, as the allocator finds inconsistencies in its data).
Notes
Although modern compilers provide special instrumented builds to trap some of these errors; gcc and clang, in particular, provide the so-called "address sanitizer".
This allows to introduce transparent paging (swapping out to disk memory zones that aren't actively used in case of low physical memory availability) and, most importantly, memory protection and address space separation (when a user-mode process is running, it "sees" a full virtual address space containing only his stuff, and nothing from the other processes or the kernel).
And it's not a failure put there on purpose by the operating system to be notified that the processes is trying to access memory that has been swapped out.
Given that each access to memory needs to go through the MMU, the mapping must be very fast, so the most used page mappings are kept in a cache; if you make the pages very small and the cache can hold just as many entries, you effectively have a smaller memory range covered by the cache.
No, accessing invalid memory is undefined behavior, and segmantation fault is one of the many side effects of UB. It is not guaranteed.
That said,
Always check for the success of the malloc() by checking the returned pointer against NULL before using the returned pointer.
Please see this: Do I cast the result of malloc?

Accessing memory below the stack on linux

This program accesses memory below the stack.
I would assume to get a segfault or just nuls when going out of stack bounds but I see actual data. (This is assuming 100kb below stack pointer is beyond the stack bounds)
Or is the system actually letting me see memory below the stack? Weren't there supposed to be kernel level protections against this, or does that only apply to allocated memory?
Edit: With 1024*127 below char pointer it randomly segfaults or runs, so the stack doesn't seem to be a fixed 8MB, and there seems to be a bit of random to it too.
#include <stdio.h>
int main(){
char * x;
int a;
for( x = (char *)&x-1024*127; x<(char *)(&x+1); x++){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
}
Edit: Another wierd thing. The first program segfaults at only 1024*127 but if I printf downwards away from the stack I don't get a segfault and all the memory seems to be empty (All 0x00):
#include <stdio.h>
int main(){
char * x;
int a;
for( x = (char *)(&x); x>(char *)&x-1024*1024; x--){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
}
When you access memory, you're accessing the process address space.
The process address space is divided into pages (typically 4 KB on x86). These are virtual pages: their contents are held elsewhere. The kernel manages a mapping from virtual pages to their contents. Contents can be provided by:
A physical page, for pages that are currently backed by physical RAM. Accesses to these happen directly (via the memory management hardware).
A page that's been swapped out to disk. Accessing this will cause a page fault, which the kernel handles. It needs to fill a physical page with the on-disk contents, so it finds a free physical page (perhaps swapping that page's contents out to disk), reads in the contents from disk, and updates the mapping to state that "virtual page X is in physical page Y".
A file (i.e. a memory mapped file).
Hardware devices (i.e. hardware device registers). These don't usually concern us in user space.
Suppose that we have a 4 GB virtual address space, split into 4 KB pages, giving us 1048576 virtual pages. Some of these will be mapped by the kernel; others will not. When the process starts (i.e. when main() is invoked), the virtual address space will contain, amongst other things:
Program code. These pages are usually readable and executable.
Program data (i.e. for initialised variables). This usually has some read-only pages and some read-write pages.
Code and data from libraries that the program depends on.
Some pages for the stack.
These things are all mapped as pages in the 4 GB address space. You can see what's mapped by looking at /proc/(pid)/maps, as one of the comments has pointed out. The precise contents and location of these pages depend on (a) the program in question, and (b) address space layout randomisation (ASLR), which makes locations of things harder to guess, thereby making certain security exploitation techniques more difficult.
You can access any particular location in memory by defining a pointer and dereferencing it:
*(unsigned char *)0x12345678
If this happens to point to a mapped page, and that page is readable, then the access will succeed and yield whatever's mapped at that address. If not, then you'll receive a SIGSEGV from the kernel. You could handle that (which is useful in some cases, such as JIT compilers), but normally you don't, and the process will be terminated. As noted above, due to ASLR, if you do this in a program and run the program several times then you'll get non-deterministic results for some addresses.
There is usually quite a bit of accessible memory below the stack pointer, because that memory is used when you grow the stack normally. The stack itself is only controlled by the value of the stack pointer - it is a software entity, not a hardware entity.
However, system code may assume typical stack usage. I. e., on some systems, the stack is used to store state for a context switch, while a signal handler runs, etc. This also depends on whether the hardware automatically switches stack pointers when leaving user mode. If the system does use your stack for this, it will clobber the data you stored there, and that can really happen at every point in your program.
So it is not safe to manipulate stack memory below the stack pointer. It's not even safe to assume that a value that has successfully been written will still be the same in the next line code. Only the portion above the stack pointer is guaranteed not to be touched by the runtime/kernel.
It goes without saying, that this code invokes undefined behavior. The pointer arithmetic is invalid, because the address &x-1024*127 is not allocated to the variable x, so that dereferencing this pointer invokes undefined behavior.
This is undefined behavior in C. You're accessing a random memory address which, depending on the platform, may or may not be on the stack. It may or may not be in memory this user can access; if not you will get a segfault or similar. There are absolutely no promises either way.
Actually, it's not undefined behaviour, it's pretty well defined. Accessing memory locations through pointers is and was always defined since C is as close to the hardware as it can be.
I however agree that accessing hardware through pointers when you don't know exactly what you're doing is a dangerous thing to do.
Don't Do That. (If you're one of the five or six people who has a legitimate reason to do this, you already know it and don't need our advice.)
It would be a poor world with only five or six people legitimately programming operating systems, embedded devices and drivers (although it sometimes appears as if the latter is the case...).
This is undefined behavior in C. You're accessing a random memory address which, depending on the platform, may or may not be on the stack. It may or may not be in memory this user can access; if not you will get a segfault or similar. There are absolutely no promises either way.
Don't Do That. (If you're one of the five or six people who has a legitimate reason to do this, you already know it and don't need our advice.)

Why does it work if the size of buffer is fewer than nbyte? [duplicate]

This question already has answers here:
Undefined, unspecified and implementation-defined behavior
(9 answers)
Closed 9 years ago.
The codes are like these:
#define BUFSIZ 5
#include <stdio.h>
#include <sys/syscall.h>
main()
{
char buf[BUFSIZ];
int n;
n = read(0, buf, 10);
printf("%d",n);
printf("%s",buf);
return 0;
}
I inputabcdefg then and the output is:
8abcdefg
In the read(0, buf, 10);, the 10 is larger than 5, which is the size of buf. But it doesn't seem to lead to a wrong result.. Does anyone have ideas about this? Thanks!
This is a quirk of how allocation in C works. You have a buffer allocated on the stack, which is really just a chunk of contiguous memory that you can read and write. The fact that you're allowed to write off the end of this array means that in this case it just so happens to work. Perhaps on your machine with your particular compiler and stack layout, you don't end up overwriting anything important :-)
Relying on this behavior being the same between compiler versions is not advised.
You can in principle1 read from and write to any address, but it is only safe and meaningful to access data in an organized, well-defined manner.
The purpose of memory allocation (explicit or implicit) is to bring order into chaos. When you declare your buf array, a small block of memory is reserved on the stack.
Usually, allocations have a certain alignment (and sometimes a certain minimum size, also the operating system can only detect wrong accesses on a very coarse level), so there will often be small gaps in between your allocated memory blocks and small areas that you can write to and read from, seemingly without "anything bad" happening -- but you should pretend that this isn't the case, and you should not even think about using these implementation details to your advantage.
Your code example "works" because you were unlucky enough not to hit an unallocated or write-protected memory page, and you didn't overwrite another vital stack value that would have caused the application to crash (such as the function's return address).
I am purposely saying "unlucky", not "lucky" as the fact that it appears to work is not a good thing. It's incorrect code2, and such code should crash early, so you can detect and fix the problem. It may otherwise lead to very hard to diagnose problems that appear to occur at an entirely unrelated time or location. Even if it works now, you have no guarantee whatsoever that it will work tomorrow (or, on a different computer, or with a different compiler, or with ever so slightly different code).
Memory allocation is generally a three-step process. It is an allocation request to the operating system done by the C library (which usually does not directly correspond to your requests) followed by some bookkeeping done in the library, and a promise made by you. At the operating system level, the actual physical allocation on a page level happens on demand as you access memory for the first time, supposed that the C library has requested allocation for the accessed location earlier.
In the case of stack allocation, the process is somewhat easier on the library level, since it really only has to decrement one special register, but this is mostly irrelevant for you. The concept remains the same.
The promise you make is that you will only ever read from or write to the agreed area, and this is the primary thing that is important for you.
It can happen that you break your promise (deliberately or by accident) and it still "works", but that is pure coincidence.
On the stack, you will sooner or later overwrite either the store of some local variables (which may go undetected if they're cached in a register) and finally the return addresses, which will almost certainly cause a crash (or similar undesired behavior) when the function returns. On the heap, you may overwrite some other program data or access a page that hasn't been communicated to the operating system as being reserved. In that case, the program will be terminated immediately.
1 Let's not consider virtual memory and page protections for an instant.
2 Strictly speaking, it's not incorrect code, but code that invokes undefined behavior. However, overwriting unallocated memory is in my opinion serious enough to merit the label "incorrect".

Why can I write and read memory when I haven't allocated space?

I'm trying to build my own Hash Table in C from scratch as an exercise and I'm doing one little step at a time. But I'm having a little issue...
I'm declaring the Hash Table structure as pointer so I can initialize it with the size I want and increase it's size whenever the load factor is high.
The problem is that I'm creating a table with only 2 elements (it's just for testing purposes), I'm allocating memory for just those 2 elements but I'm still able to write to memory locations that I shouldn't. And I also can read memory locations that I haven't written to.
Here's my current code:
#include <stdio.h>
#include <stdlib.h>
#define HASHSIZE 2
typedef char *HashKey;
typedef int HashValue;
typedef struct sHashTable {
HashKey key;
HashValue value;
} HashEntry;
typedef HashEntry *HashTable;
void hashInsert(HashTable table, HashKey key, HashValue value) {
}
void hashInitialize(HashTable *table, int tabSize) {
*table = malloc(sizeof(HashEntry) * tabSize);
if(!*table) {
perror("malloc");
exit(1);
}
(*table)[0].key = "ABC";
(*table)[0].value = 45;
(*table)[1].key = "XYZ";
(*table)[1].value = 82;
(*table)[2].key = "JKL";
(*table)[2].value = 13;
}
int main(void) {
HashTable t1 = NULL;
hashInitialize(&t1, HASHSIZE);
printf("PAIR(%d): %s, %d\n", 0, t1[0].key, t1[0].value);
printf("PAIR(%d): %s, %d\n", 1, t1[1].key, t1[1].value);
printf("PAIR(%d): %s, %d\n", 3, t1[2].key, t1[2].value);
printf("PAIR(%d): %s, %d\n", 3, t1[3].key, t1[3].value);
return 0;
}
You can easily see that I haven't allocated space for (*table)[2].key = "JKL"; nor (*table)[2].value = 13;. I also shouldn't be able read the memory locations in the last 2 printfs in main().
Can someone please explain this to me and if I can/should do anything about it?
EDIT:
Ok, I've realized a few things about my code above, which is a mess... But I have a class right now and can't update my question. I'll update this when I have the time. Sorry about that.
EDIT 2:
I'm sorry, but I shouldn't have posted this question because I don't want my code like I posted above. I want to do things slightly different which makes this question a bit irrelevant. So, I'm just going to assume this was question that I needed an answer for and accept one of the correct answers below. I'll then post my proper questions...
Just don't do it, it's undefined behavior.
It might accidentially work because you write/read some memory the program doesn't actually use. Or it can lead to heap corruption because you overwrite metadata used by the heap manager for its purposes. Or you can overwrite some other unrelated variable and then have hard times debugging the program that goes nuts because of that. Or anything else harmful - either obvious or subtle yet severe - can happen.
Just don't do it - only read/write memory you legally allocated.
Generally speaking (different implementation for different platforms) when a malloc or similar heap based allocation call is made, the underlying library translates it into a system call. When the library does that, it generally allocates space in sets of regions - which would be equal or larger than the amount the program requested.
Such an arrangement is done so as to prevent frequent system calls to kernel for allocation, and satisfying program requests for Heap faster (This is certainly not the only reason!! - other reasons may exist as well).
Fall through of such an arrangement leads to the problem that you are observing. Once again, its not always necessary that your program would be able to write to a non-allocated zone without crashing/seg-faulting everytime - that depends on particular binary's memory arrangement. Try writing to even higher array offset - your program would eventually fault.
As for what you should/should-not do - people who have responded above have summarized fairly well. I have no better answer except that such issues should be prevented and that can only be done by being careful while allocating memory.
One way of understanding is through this crude example: When you request 1 byte in userspace, the kernel has to allocate a whole page atleast (which would be 4Kb on some Linux systems, for example - the most granular allocation at kernel level). To improve efficiency by reducing frequent calls, the kernel assigns this whole page to the calling Library - which the library can allocate as when more requests come in. Thus, writing or reading requests to such a region may not necessarily generate a fault. It would just mean garbage.
In C, you can read to any address that is mapped, you can also write to any address that is mapped to a page with read-write areas.
In practice, the OS gives a process memory in chunks (pages) of normally 8K (but this is OS-dependant). The C library then manages these pages and maintains lists of what is free and what is allocated, giving the user addresses of these blocks when asked to with malloc.
So when you get a pointer back from malloc(), you are pointing to an area within an 8k page that is read-writable. This area may contain garbage, or it contain other malloc'd memory, it may contain the memory used for stack variables, or it may even contain the memory used by the C library to manage the lists of free/allocated memory!
So you can imagine that writing to addresses beyond the range you have malloc'ed can really cause problems:
Corruption of other malloc'ed data
Corruption of stack variables, or the call stack itself, causing crashes when a function return's
Corruption of the C-library's malloc/free management memory, causing crashes when malloc() or free() are called
All of which are a real pain to debug, because the crash usually occurs much later than when the corruption occurred.
Only when you read or write from/to the address which does not correspond to a mapped page will you get a crash... eg reading from address 0x0 (NULL)
Malloc, Free and pointers are very fragile in C (and to a slightly lesser degree in C++), and it is very easy to shoot yourself in the foot accidentally
There are many 3rd party tools for memory checking which wrap each memory allocation/free/access with checking code. They do tend to slow your program down, depending on how much checking is applied..
Think of memory as being a great big blackboard divided into little squares. Writing to a memory location is equivalent to erasing a square and writing a new value there. The purpose of malloc generally isn't to bring memory (blackboard squares) into existence; rather, it's to identify an area of memory (group of squares) that's not being used for anything else, and take some action to ensure that it won't be used for anything else until further notice. Historically, it was pretty common for microprocessors to expose all of the system's memory to an application. An piece of code Foo could in theory pick an arbitrary address and store its data there, but with a couple of major caveats:
Some other code `Bar` might have previously stored something there with the expectation that it would remain. If `Bar` reads that location expecting to get back what it wrote, it will erroneously interpret the value written by `Foo` as its own. For example, if `Bar` had stored the number of widgets that were received (23), and `Foo` stored the value 57, the earlier code would then believe it had received 57 widgets.
If `Foo` expects the data it writes to remain for any significant length of time, its data might get overwritten by some other code (basically the flip-side of the above).
Newer systems include more monitoring to keep track of what processes own what areas of memory, and kill off processes that access memory that they don't own. In many such systems, each process will often start with a small blackboard and, if attempts are made to malloc more squares than are available, processes can be given new chunks of blackboard area as needed. Nonetheless, there will often be some blackboard area available to each process which hasn't yet been reserved for any particular purposes. Code could in theory use such areas to store information without bothering to allocate it first, and such code would work if nothing happened to use the memory for any other purpose, but there would be no guarantee that such memory areas wouldn't be used for some other purpose at some unexpected time.
Usually malloc will allocate more memory than you require to for alignment purpose. Also because the process really have read/write access to the heap memory region. So reading a few bytes outside of the allocated region seldom trigger any errors.
But still you should not do it. Since the memory you're writing to can be regarded as unoccupied or is in fact occupied by others, anything can happen e.g. the 2nd and 3rd key/value pair will become garbage later or an irrelevant vital function will crash due to some invalid data you've stomped onto its malloc-ed memory.
(Also, either use char[≥4] as the type of key or malloc the key, because if the key is unfortunately stored on the stack it will become invalid later.)

Resources