Too many calls to mprotect - c

I am working on a parallel app (C, pthread). I traced the system calls because at some point I have bad parallel performances. My traces shown that my program calls mprotect() many many times ... enough to significantly slow down my program.
I do allocate a lot of memory (with malloc()) but there is only a reasonable number of calls to brk() to increase the heap size. So why so many calls to mprotect() ?!

Are you creating and destroying lots of threads?
Most pthread implementations will add a "guard page" when allocating a threads stack. It's an access protected memory page used to detect stack overflows. I'd expect at least one call to mprotect each time a thread is created or terminated to (un)protect the guard page. If this is the case, there are several obvious strategies:
Set the guard page size to zero using pthread_attr_setguardsize() before creating threads.
Use a thread-pool (of as many threads as processors say). Once a thread is done with a task, return it to the pool to get a new task rather than terminate and create a new thread.
Another explanation might be that you're on a platform where a thread's stack will be grown if overflow is detected. I don't think this is implemented on Linux with GCC/Glibc as yet, but there have been some proposals along these lines recently. If you use a lot of stack space whilst processing, you might explicitely increase the initial/minimum stack size using pthread_attr_setstacksize.
Or it might be something else entirely!

If you can, run your program under a debug libc and break on mprotect(). Look at the call stack, see what your code is doing that's leading to the mprotect() calls.

glibc library that has ptmalloc2 for its malloc uses mprotect() internally for micromanagement of heap for threads other than main thread (for main thread, sbrk() is used instead.) malloc() firstly allocates large chunk of memory with mmap() for the thread if a heap area seems to have contention, and then it changes the protection bits of unnecessary portion to make it accessible with mprotect(). Later, when it needs to grow the heap, it changes the protection to read/writable with mprotect() again. Those mprotect() calls are for heap growth and shrink in multithreaded applications.
http://www.blackhat.com/presentations/bh-usa-07/Ferguson/Whitepaper/bh-usa-07-ferguson-WP.pdf
explains this in a bit more detailed way.

The 'valgrind' suite has a tool called 'callgrind' that will tell you what is calling what. If you run the application under 'callgrind', you can then view the resulting profile data with 'kcachegrind' (it can analyze profiles made by 'cachegrind' or 'callgrind'). Then just double-click on 'mprotect' in the left pane and it will show you what code is calling it and how many times.

Related

Does terminating a program reclaim memory in the same way as free()?

I saw this answer to a stack overflow question that says that freeing memory at the very end of a c program is actually harmful because it moves variables that wouldn't be used again into system memory.
I'm confused why the free() method in C would do anything different than the operating system reclaiming the heap at the end of the program.
Does anyone know if there is a real difference between free() and termination in terms of memory management and if so how the operating system may treat these two differently?
e.g.
would anything different happen between these two short programs?
void main() {
int* mem = malloc(1);
return 0;
}
void main() {
int* mem = malloc(1);
free(mem);
return 0;
}
No, terminating a program, as with exit or abort, does not reclaim memory in the same way as free. Using free causes some activity that ultimately has no effect when the operating system discards the data maintained by malloc and free.
exit has some complications, as it does not immediately terminate the program. For now, let’s just consider the effect of immediately terminating the program and consider the complications later.
In a general-purpose multi-user operating system, when a process is terminated, the operating system releases the memory it was using to make it available for other purposes.1 In large part, this simply means the operating system does some accounting operations.
In contrast, when you call free, software inside the program runs, and it has to look up the size of the memory you are freeing and then insert information about that memory into the pool of memory it is maintaining. There could be thousands or tens of thousands (or more) of such allocations. A program that frees all its data may have to execute many thousands of calls to free. Yet, in the end, when the program exits, all of the changes produced by free will vanish, as the operating system will discard all the data about that pool of memory—all of the data is in memory pages the operating system does not preserve.
So, in this regard, the answer you link to is correct, calling free is a waste. And, as it points out, the necessity of going through all the data structures in the program to fetch the pointers in them so the memory they point to can be freed causes all those data structures to be read into memory if they had been swapped out to disk. For large programs, it can take a considerable amount of time and other resources.
On the other hand, it is not clear it is easy to avoid many calls to free. This is because releasing memory is not the only thing a terminating program has to clean up. A program may want to write final data to files or send final messages to network connections. Furthermore, a program may not have established all of this context directly. Most large programs rely on layers of software, and each software package may have set up its own context, and often no way is provided to tell other software “I want to exit now. Finish the valuable context, but skip all the freeing of memory.” So all the desired clean-up tasks may be interwined with the free-memory tasks, and there may be no good way to untangle them.
Software should generally be written so that nothing terrible happens if a program is suddenly aborted (since this can happen from a loss of power, not just deliberate user action). But even though a program might be able to tolerate an abort, there can still be value in a graceful exit.
Getting back to exit, calling the C exit routine does not exit the program immediately. Exit handlers (registered with atexit) are called, stream buffers are flushed, and streams are closed. Any software libraries you called may have set up their own exit handlers so that they can finish up when the program is exiting. So, if you want to be sure libraries you have used in your program are not calling free when you end the program, you have to call abort, not exit. But it is generally preferred to end a program gracefully, not by aborting. Calling abort will not call exit handlers, flush streams, close streams, or perform other wind-down code that exit does—data can be lost when a program calls abort.
Footnote
1 Releasing memory does not mean it is immediately available for other purposes. The specific result of this depends on each page of memory. For example:
If the memory is shared with other processes, it is still needed for them, so releasing it from use by this process only decrements the number of processes using the memory. It is not immediately available for any other use.
If the memory is not in use by any other processes but contains data mapped from a file on disk, the operating system might mark it as available when needed but leave it alone for the moment. This is because you might run the same program again, and it would be nice if the data were still in memory, so why not just leave it in place just in case? The data might even be used by a different program that uses the same file. (For example, many programs might use the same shared library.)
If the memory is not in use by any other processes and was just used by the program as a work area, not mapped from a file, then system may mark it as immediately available and not containing anything useful.
would anything different happen between these two short programs?
The simple answer is: it makes no difference, the memory is released to the system in both cases. Calling free() is not strictly necessary and does incur an infinitesimal overhead but may prove useful when trying to track memory leaks in more complex programs.
Does terminating a program reclaim memory in the same way as free?
Not exactly:
Terminating a program releases the memory used by the program, be it for the program code, data, stack or heap. It also releases some other resources such as file handles, device handles, network sockets... All this is done efficiently, no matter how many blocks of memory have been allocated with malloc().
Conversely, free() makes the block of memory available for further use by the program for later calls to malloc() or realloc(). Depending on its size and the implementation of the heap, this freed block may or may not be returned to the OS for use by other programs. Also worth noting it the fragmentation problem, where small blocks of freed memory may not be usable for a larger allocation because they are surrounded by allocated blocks. The C heap does not perform packing or de-fragmentation, it merely coalesces adjacent free blocks. Freeing all allocated blocks before leaving the program may be useful for debugging purposes, but may be complicated and time consuming, while not necessary for the memory to be reused by the system after the program terminates.
free() is a user level memory management function and depends on malloc implementation you are currently using. The user-level allocator might maintain a linked-list of memory chunk and malloc/free will take the chunk of appropropriate size/put it back.
exit() Destroys an address space and all regions.
This is related to malloced heap as well as some other regions and in-kernel data structures used for managing address space of the process:
Each address space consists of a number of page-aligned regions
of memory that are in use. They never overlap and represent a set
of addresses which contain pages that are related to each other in
terms of protection and purpose. These regions are represented by
a struct vm_area_struct and are roughly analogous to the
vm_map_entry struct in BSD. For clarity, a region may represent the
process heap for use with malloc(), a memory mapped file such as
a shared library or a block of anonymous memory allocated with
mmap(). The pages for this region may still have to be allocated, be
active and resident or have been paged out
Reference: https://www.kernel.org/doc/gorman/html/understand/understand007.html
The reason well-designed programs free memory at exit is to check for memory leaks. If your application-level memory allocation does not go to zero after your last deallocation, you know that you have a memory memory that is not being managed properly and probably have a memory leak in your code.
would anything different happen between these two short programs?
YES
I'm confused why the free() method in C would do anything different than the operating system reclaiming the heap at the end of the program.
The operating system allocates memory in pages. Heap managers (such as malloc/free implementations) allocate pages from the operating system and subdivide the pages into smaller allocations. Calls to free() normally return memory to the heap. They do not return the pages to the operating system.

Why do i need to free dynamic memory manually? [duplicate]

Let's say I have the following C code:
int main () {
int *p = malloc(10 * sizeof *p);
*p = 42;
return 0; //Exiting without freeing the allocated memory
}
When I compile and execute that C program, ie after allocating some space in memory, will that memory I allocated be still allocated (ie basically taking up space) after I exit the application and the process terminates?
It depends on the operating system. The majority of modern (and all major) operating systems will free memory not freed by the program when it ends.
Relying on this is bad practice and it is better to free it explicitly. The issue isn't just that your code looks bad. You may decide you want to integrate your small program into a larger, long running one. Then a while later you have to spend hours tracking down memory leaks.
Relying on a feature of an operating system also makes the code less portable.
In general, modern general-purpose operating systems do clean up after terminated processes. This is necessary because the alternative is for the system to lose resources over time and require rebooting due to programs which are poorly written or simply have rarely-occurring bugs that leak resources.
Having your program explicitly free its resources anyway can be good practice for various reasons, such as:
If you have additional resources that are not cleaned up by the OS on exit, such as temporary files or any kind of change to the state of an external resource, then you will need code to deal with all of those things on exit, and this is often elegantly combined with freeing memory.
If your program starts having a longer lifetime, then you will not want the only way to free memory to be to exit. For example, you might want to convert your program into a server (daemon) which keeps running while handling many requests for individual units of work, or your program might become a small part of a larger program.
However, here is a reason to skip freeing memory: efficient shutdown. For example, suppose your application contains a large cache in memory. If when it exits it goes through the entire cache structure and frees it one piece at a time, that serves no useful purpose and wastes resources. Especially, consider the case where the memory pages containing your cache have been swapped to disk by the operating system; by walking the structure and freeing it you're bringing all of those pages back into memory all at once, wasting significant time and energy for no actual benefit, and possibly even causing other programs on the system to get swapped out!
As a related example, there are high-performance servers that work by creating a process for each request, then having it exit when done; by this means they don't even have to track memory allocation, and never do any freeing or garbage collection at all, since everything just vanishes back into the operating system's free memory at the end of the process. (The same kind of thing can be done within a process using a custom memory allocator, but requires very careful programming; essentially making one's own notion of “lightweight processes” within the OS process.)
My apologies for posting so long after the last post to this thread.
One additional point. Not all programs make it to graceful exits. Crashes and ctrl-C's, etc. will cause a program to exit in uncontrolled ways. If your OS did not free your heap, clean up your stack, delete static variables, etc, you would eventually crash your system from memory leaks or worse.
Interesting aside to this, crashes/breaks in Ubuntu, and I suspect all other modern OSes, do have problems with "handled' resources. Sockets, files, devices, etc. can remain "open" when a program ends/crashes. It is also good practice to close anything with a "handle" or "descriptor" as part of your clean up prior to graceful exit.
I am currently developing a program that uses sockets heavily. When I get stuck in a hang I have to ctrl-c out of it, thus, stranding my sockets. I added a std::vector to collect a list of all opened sockets and a sigaction handler that catches sigint and sigterm. The handler walks the list and closes the sockets. I plan on making a similar cleanup routine for use before throw's that will lead to premature termination.
Anyone care to comment on this design?
What's happening here (in a modern OS), is that your program runs inside its own "process." This is an operating system entity that is endowed with its own address space, file descriptors, etc. Your malloc calls are allocating memory from the "heap", or unallocated memory pages that are assigned to your process.
When your program ends, as in this example, all of the resources assigned to your process are simply recycled/torn down by the operating system. In the case of memory, all of the memory pages that are assigned to you are simply marked as "free" and recycled for the use of other processes. Pages are a lower-level concept than what malloc handles-- as a result, the specifics of malloc/free are all simply washed away as the whole thing gets cleaned up.
It's the moral equivalent of, when you're done using your laptop and want to give it to a friend, you don't bother to individually delete each file. You just format the hard drive.
All this said, as all other answerers are noting, relying on this is not good practice:
You should always be programming to take care of resources, and in C that means memory as well. You might end up embedding your code in a library, or it might end up running much longer than you expect.
Some OSs (older ones and maybe some modern embedded ones) may not maintain such hard process boundaries, and your allocations might affect others' address spaces.
Yes. The OS cleans up resources. Well ... old versions of NetWare didn't.
Edit: As San Jacinto pointed out, there are certainly systems (aside from NetWare) that do not do that. Even in throw-away programs, I try to make a habit of freeing all resources just to keep up the habit.
Yes, the operating system releases all memory when the process ends.
It depends, operating systems will usually clean it up for you, but if you're working on for instance embedded software then it might not be released.
Just make sure you free it, it can save you a lot of time later when you might want to integrate it in to a large project.
That really depends on the operating system, but for all operating systems you'll ever encounter, the memory allocation will disappear when the process exits.
I think direct freeing is best. Undefined behaviour is the worst thing, so if you have access while it's still defined in your process, do it, there are lots of good reasons people have given for it.
As to where, or whether, I found that in W98, the real question was 'when' (I didn't see a post emphasising this). A small template program (for MIDI SysEx input, using various malloc'd spaces) would free memory in the WM_DESTROY bit of the WndProc, but when I transplanted this to a larger program it crashed on exit. I assumed this meant I was trying to free what the OS had already freed during a larger cleanup. If I did it on WM_CLOSE, then called DestroyWindow(), it all worked fine, instant clean exit.
While this isn't exactly the same as MIDI buffers, there is similarity in that it is best to keep the process intact, clean up fully, then exit. With modest memory chunks this is very fast. I found that many small buffers worked faster in operation and cleanup than fewer large ones.
Exceptions may exist, as someone said when avoiding hauling large memory chunks back out of a swap file on disk, but even that may be minimised by keeping more, and smaller, allocated spaces.

set stack size for threads using setrlimit

I'm using a library which creates a pthread using the default stack size of 8MB. Is it possible to programatically reduce the stack size of the thread the library creates? I tried using setrlimit(RLIMIT_STACK...) inside my main() function, but that doesn't seem to have any effect. ulimit -s seems to do the job, but I don't want to set the stack size before my program is executed.
Any ideas what I can do? Thanks
Update 1:
Seems like I'm going to give up on being able to set the stack size using setrlimit(RLIMIT_STACK,...). I checked the resident memory and found it's a lot less than the virtual memory. That's a good enough reason for me to give up on trying to limit the stack size.
I think you are out of luck. If the library you are using does not provide a way to set the stack limit, then you can't change it after the thread has been created. setrlimit and shell limits effects the main thread's stack.
Threads are created within the processes memory space so their stacks are allocated when the threads are created. On Unix I believe the stack will be mapped to RAM on demand, so you may not actually use 8 Megs of RAM if you don't need it (virtual vs resident memory).
There are a couple aspects to answering this question.
First, as stated in the comments, pthread_attr_setstacksize is The Right Way to do this. If the library calling pthread_create doesn't have a way to let you do this, fixing the library would be the ideal solution. If the thread is purely internal to the library (not calling code from the calling application) it really should set its own preference for the stack size based on something like PTHREAD_STACK_MIN + ITS_OWN_NEEDS. If it's calling back to your code, it should let you request how much stack space you need.
Second, as an implementation detail, glibc uses the stack limit from setrlimit/ulimit to derive the stack size for threads created by pthread_create. You can perhaps influence the size this way, but it's not portable, and as you've found, not reliable even there (it's not working when you call setrlimit from within the process itself). It's possible that glibc only probes the limit once when the relevant code is first initialized, so I would try moving the setrlimit call as early as possible in main to see if this helps.
Finally, the stack size for threads may not even be relevant to your application. Even if the stack size is 8MB, only the pages which have actually been modified (probably 4k or at most 8k unless you have big arrays on the stack) are actually using physical memory. The rest is just tying up virtual address space (of which you always have at least 2-3 GB) and possibly commit charge. By default, Linux enables overcommit, so commit charge will not be strictly enforced, and therefore the fact that glibc is requesting too much may not even matter. You could make the overcommit checking even less strict by writing a 1 to /proc/sys/vm/overcommit_memory, but this will cause you to loose information about when you're "running out of memory" and make your program crash instead. On such a constrained system you may prefer even stricter overcommit accounting, but then you have to fix the thread stack size problem...

malloc()/free() behavior differs between Debian and Redhat

I have a Linux app (written in C) that allocates large amount of memory (~60M) in small chunks through malloc() and then frees it (the app continues to run then). This memory is not returned to the OS but stays allocated to the process.
Now, the interesting thing here is that this behavior happens only on RedHat Linux and clones (Fedora, Centos, etc.) while on Debian systems the memory is returned back to the OS after all freeing is done.
Any ideas why there could be the difference between the two or which setting may control it, etc.?
I'm not certain why the two systems would behave differently (probably different implementations of malloc from different glibc's). However, you should be able to exert some control over the global policy for your process with a call like:
mallopt(M_TRIM_THRESHOLD, bytes)
(See this linuxjournal article for details).
You may also be able to request an immediate release with a call like
malloc_trim(bytes)
(See malloc.h). I believe that both of these calls can fail, so I don't think you can rely on them working 100% of the time. But my guess is that if you try them out you will find that they make a difference.
Some mem handler dont present the memory as free before it is needed. It instead leaves the CPU to do other things then finalize the cleanup. If you wish to confirm that this is true, then just do a simple test and allocate and free more memory in a loop more times than you have memeory available.

Do threads have a distinct heap?

As far as I know each thread gets a distinct stack when the thread is created by the operating system. I wonder if each thread has a heap distinct to itself also?
No. All threads share a common heap.
Each thread has a private stack, which it can quickly add and remove items from. This makes stack based memory fast, but if you use too much stack memory, as occurs in infinite recursion, you will get a stack overflow.
Since all threads share the same heap, access to the allocator/deallocator must be synchronized. There are various methods and libraries for avoiding allocator contention.
Some languages allow you to create private pools of memory, or individual heaps, which you can assign to a single thread.
By default, C has only a single heap.
That said, some allocators that are thread aware will partition the heap so that each thread has it's own area to allocate from. The idea is that this should make the heap scale better.
One example of such a heap is Hoard.
Depends on the OS. The standard c runtime on windows and unices uses a shared heap across threads. This means locking every malloc/free.
On Symbian, for example, each thread comes with its own heap, although threads can share pointers to data allocated in any heap. Symbian's design is better in my opinion since it not only eliminates the need for locking during alloc/free, but also encourages clean specification of data ownership among threads. Also in that case when a thread dies, it takes all the objects it allocated along with it - i.e. it cannot leak objects that it has allocated, which is an important property to have in mobile devices with constrained memory.
Erlang also follows a similar design where a "process" acts as a unit of garbage collection. All data is communicated between processes by copying, except for binary blobs which are reference counted (I think).
Each thread has its own stack and call stack.
Each thread shares the same heap.
It depends on what exactly you mean when saying "heap".
All threads share the address space, so heap-allocated objects are accessible from all threads. Technically, stacks are shared as well in this sense, i.e. nothing prevents you from accessing other thread's stack (though it would almost never make any sense to do so).
On the other hand, there are heap structures used to allocate memory. That is where all the bookkeeping for heap memory allocation is done. These structures are sophisticatedly organized to minimize contention between the threads - so some threads might share a heap structure (an arena), and some might use distinct arenas.
See the following thread for an excellent explanation of the details: How does malloc work in a multithreaded environment?
Typically, threads share the heap and other resources, however there are thread-like constructions that don't. Among these thread-like constructions are Erlang's lightweight processes, and UNIX's full-on processes (created with a call to fork()). You might also be working on multi-machine concurrency, in which case your inter-thread communication options are considerably more limited.
Generally speaking, all threads use the same address space and therefore usually have just one heap.
However, it can be a bit more complicated. You might be looking for Thread Local Storage (TLS), but it stores single values only.
Windows-Specific:
TLS-space can be allocated using TlsAlloc and freed using TlsFree (Overview here). Again, it's not a heap, just DWORDs.
Strangely, Windows support multiple Heaps per process. One can store the Heap's handle in TLS. Then you would have something like a "Thread-Local Heap". However, just the handle is not known to the other threads, they still can access its memory using pointers as it's still the same address space.
EDIT: Some memory allocators (specifically jemalloc on FreeBSD) use TLS to assign "arenas" to threads. This is done to optimize allocation for multiple cores by reducing synchronization overhead.
On FreeRTOS Operating system, tasks(threads) share the same heap but each one of them has its own stack. This comes in very handy when dealing with low power low RAM architectures,because the same pool of memory can be accessed/shared by several threads, but this comes with a small catch , the developer needs to keep in mind that a mechanism for synchronizing malloc and free is needed, that is why it is necessary to use some type of process synchronization/lock when allocating or freeing memory on the heap, for example a semaphore or a mutex.

Resources