Performance graph
I'm unsure if my code contains a memory leak or not. In order to verify this, I'm running it in an infinite loop which gives the performance graph shown here. The yellow plot shows the process' memory usage, and the magenta plot is the process' I/O (I'm outputting the loop's iteration number at every cycle). This process is running in Windows 8.1, and I used Qt Creator to compile it, in Release Mode.
The initial data point of the yellow plot is 732KB and the final one shows 924KB (this is after some 3000 iterations of the loop), data points are taken every second here.
It's a little hard to tell but the memory usage seems to be following a log progression with sudden jumps, i.e. it plateaus and suddenly jumps by about 10KB after a few hundred iterations. Moreover, looking at the I/O graph, it also seems that the execution speed of the loop is uneven.
This behavior seems a little odd because the memory leaks I'm used to dealing with show a clear linear memory usage over time. I'm not exactly sure what to make of this, I'm assuming it is either a very small memory leak or it could it be the OS allocating more memory to the process in order to optimize its performance? Any advice?
Related
I am trying to figure out why my resident memory for one version of a program ("new") is much higher (5x) than another version of the same program ("baseline"). The program is running on a Linux cluster with E5-2698 v3 CPUs and written in C++. The baseline is a multiprocess program, and the new one is a multithreaded program; they are both fundamentally doing the same algorithm, computation, and operating on the same input data, etc. In both, there are as many processes or threads as cores (64), with threads pinned to CPUs. I've done a fair amount of heap profiling using both Valgrind Massif and Heaptrack, and they show that the memory allocation is the same (as it should be). The RSS for both the baseline and new version of the program are larger than the LLC.
The machine has 64 cores (hyperthreads). For both versions, I straced relevant processes and found some interesting results. Here's the strace command I used:
strace -k -p <pid> -e trace=mmap,munmap,brk
Here are some details about the two versions:
Baseline Version:
64 processes
RES is around 13 MiB per process
using hugepages (2MB)
no malloc/free-related syscalls were made from the strace call listed above (more on this below)
top output
New Version
2 processes
32 threads per process
RES is around 2 GiB per process
using hugepages (2MB)
this version does a fair amount of memcpy calls of large buffers (25MB) with default settings of memcpy (which, I think, is supposed to use non-temporal stores but I haven't verified this)
in release and profile builds, many mmap and munmap calls were generated. Curiously, none were generated in debug mode. (more on that below).
top output (same columns as baseline)
Assuming I'm reading this right, the new version has 5x higher RSS in aggregate (entire node) and significantly more page faults as measured using perf stat when compared to the baseline version. When I run perf record/report on the page-faults event, it's showing that all of the page faults are coming from a memset in the program. However, the baseline version has that memset as well and there are no pagefaults due to it (as verified using perf record -e page-faults). One idea is that there's some other memory pressure for some reason that's causing the memset to page-fault.
So, my question is how can I understand where this large increase in resident memory is coming from? Are there performance monitor counters (i.e., perf events) that can help shed light on this? Or, is there a heaptrack- or massif-like tool that will allow me to see what is the actual data making up the RES footprint?
One of the most interesting things I noticed while poking around is the inconsistency of the mmap and munmap calls as mentioned above. The baseline version didn't generate any of those; the profile and release builds (basically, -march=native and -O3) of the new version DID issue those syscalls but the debug build of the new version DID NOT make calls to mmap and munmap (over tens of seconds of stracing). Note that the application is basically mallocing an array, doing compute, and then freeing that array -- all in an outer loop that runs many times.
It might seem that the allocator is able to easily reuse the allocated buffer from the previous outer loop iteration in some cases but not others -- although I don't understand how these things work nor how to influence them. I believe allocators have a notion of a time window after which application memory is returned to the OS. One guess is that in the optimized code (release builds), vectorized instructions are used for the computation and it makes it much faster. That may change the timing of the program such that the memory is returned to the OS; although I don't see why this isn't happening in the baseline. Maybe the threading is influencing this?
(As a shot-in-the-dark comment, I'll also say that I tried the jemalloc allocator, both with default settings as well as changing them, and I got a 30% slowdown with the new version but no change on the baseline when using jemalloc. I was a bit surprised here as my previous experience with jemalloc was that it tends to produce some some speedup with multithreaded programs. I'm adding this comment in case it triggers some other thoughts.)
In general: GCC can optimize malloc+memset into calloc which leaves pages untouched. If you only actually touch a few pages of a large allocation, that not happening could account for a big diff in page faults.
Or does the change between versions maybe let the system use transparent hugepages differently, in a way that happens to not be good for your workload?
Or maybe just different allocation / free is making your allocator hand pages back to the OS instead of keeping them in a free list. Lazy allocation means you get a soft page fault on the first access to a page after getting it from the kernel. strace to look for mmap / munmap or brk system calls.
In your specific case, your strace testing confirms that your change led to malloc / free handing pages back to the OS instead of keeping them on a free list.
This fully explains the extra page faults. A backtrace of munmap calls could identify the guilty free calls. To fix it, see https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html / http://man7.org/linux/man-pages/man3/mallopt.3.html, especially M_MMAP_THRESHOLD (perhaps raise it to get glibc malloc not to use mmap for your arrays?). I haven't played with the parameters before. The man page mentions something about a dynamic mmap threshold.
It doesn't explain the extra RSS; are you sure you aren't accidentally allocating 5x the space? If you aren't, perhaps better alignment of the allocation lets the kernel use transparent hugepages where it didn't before, maybe leading to wasting up to 1.99 MiB at the end of an array instead of just under 4k? Or maybe Linux wouldn't use a hugepage if you only allocated the first couple of 4k pages past a 2M boundary.
If you're getting the page faults in memset, I assume these arrays aren't sparse and that you are touching every element.
I believe allocators have a notion of a time window after which application memory is returned to the OS
It would be possible for an allocator to check the current time every time you call free, but that's expensive so it's unlikely. It's also very unlikely that they use a signal handler or separate thread to do a periodic check of free-list size.
I think glibc just uses a size-based heuristic that it evaluates on every free. As I said, the man page mentions something about heuristics.
IMO actually tuning malloc (or finding a different malloc implementation) that's better for your situation should probably be a different question.
After my kernel calls the AHCIInit() function inside of the ArchInit() function, I get a page fault in one of the MemAllocate() calls, and this only happens in real machines, as I tried replicating it on VirtualBox, VMWare and QEMU.
I tried debugging the code, unit testing the memory allocator and removing everything from the kernel, with the exception from the memory manager and the AHCI driver itself, the only thing that I discovered is that something is corrupting the allocation blocks, making the MemAllocate() page fault.
The whole kernel source is at https://github.com/CHOSTeam/CHicago-Kernel, but the main files where the problem probably occours are:
https://github.com/CHOSTeam/CHicago-Kernel/blob/master/mm/alloc.c
https://github.com/CHOSTeam/CHicago-Kernel/blob/master/arch/x86/io/ahci.c
I expected the AHCIInit() to detect and initialize all the AHCI devices and the boot to continues until it reaches the session manager or the kernel shell, but in real computers it page faults before even initializing the scheduler (so no, the problem isn't my scheduler).
If it works in emulators but doesn't work on real hardware; then the first things I'd suspect are:
bugs in physical memory management. For example, physical memory manager initialization not rounding "starting address of usable RAM area" up to a page boundary or not rounding "ending address of usable RAM area" down to a page boundary, causing a "half usable RAM and half not usable RAM" page to be allocated by the heap later (where it works on emulators because the memory map provided by firmware happens to describe areas that are nicely aligned anyway).
a bug where RAM is assumed to contain zeros but may not be (where it works on emulators because they tend to leave almost all RAM full of zeros).
a race condition (where different timing causes different behavior).
However; this is a monolithic kernel, which means that you'll continually be facing "any piece of code in kernel-space did something that caused problems for any other piece of code anywhere else"; and there's a bunch of common bugs with memory usage (e.g. accidentally writing past the end of what you allocated). For this reason I'd want better tools to help diagnose problems, especially for heap.
Specifically, for the heap I'd start with canaries (e.g. put a magic number like 0xFEEDFACE before each block of memory in the heap, and another different number after each block of memory in the heap; and then check that the magic numbers are still present and correct where convenient - e.g. when blocks are freed or resized). Then I'd write a "check_heap()" function that scans through everything checking as much as possible (the canaries, if statistics like "number of free blocks" are actually correct, etc). The idea being that (whenever you suspect something might have corrupted the heap) you can insert a call to the "check_heap()" function, and move that call around until you find out which piece of code caused the heap corruption. I'd also suggest having a "what" parameter in your "kmalloc() or equivalent" (e.g. so you can do things like myFooStructure = kmalloc("Foo Structure", sizeof(struct foo));), where the provided "what string" is stored in the allocated block's meta-data, so that later on (when you find out the heap was corrupted) you can display the "what string" associated with the block before the corruption, and so that you can (e.g.) list how many of each type of thing there currently is to help determine what is leaking memory (e.g. if the number of "Foo Structure" blocks is continually increasing). Of course these things can be (should be?) enabled/disabled by compile time options (e.g. #ifdef DEBUG_HEAP).
The other thing I'd recommend is self tests. These are like unit tests, but built directly into the kernel itself and always present. For example, you could write code to pound the daylights out of the heap (e.g. allocate random sized pieces of memory and fill them with something until you run out of memory, then free half of them, then allocate more until you run out of memory again, etc; while calling the "check_heap()" function between each step); where this code could/should take a "how much pounding" parameter (so you could spend a small amount of time doing the self test, or a huge amount of time doing the self test). You could also write code to pound the daylights out of the virtual memory manager, and the physical memory manager (and the scheduler, and ...). Then you could decide to always do a small amount of self testing each time the kernel boots and/or provide a special kernel parameter/option to enable "extremely thorough self test mode".
Don't forget that eventually (if/when the OS is released) you'll probably have to resort to "remote debugging via. email" (e.g. where someone without any programming experience, who may not know very much English, sends you an email saying "OS not work"; and you have to try to figure out what is going wrong before the end user's "amount of hassle before giving up and not caring anymore" counter is depleted).
I need to give some details about what I am doing before asking my question. I hope my English and my explanations are clear and concise enough.
I am currently working on a massive parallelization of an initially written C code. The reason I was interested in CUDA is the large sizes of the arrays I was dealing with : the code is a simulation of fluid mechanics and I needed to launch a "time loop" with five to six successive operations on arrays as big as 3.10^9 or 19.10^9 double variables. I went through various tutorials and documentation and I finally managed to write a not-so-bad CUDA code.
Without going through the details of the code, I used relatively small 2D-blocks. The number of threads is 18 or 57 (which is awkwardly done since my wraps are not fully occupied).
The kernels call a "big" 3D-grid, which describes my physical geometry (the maximal desired size is 1000 value per dimension, that means I want to deal with a 3D grid with a 1 billion blocks).
Okay so now, my five to six kernels which are doing correctly the job are making good use of the shared memory advantages, since global memory is read ounce and written ounce for each kernel (the size of my blocks was actually determined in accordance with the adequate needed amount of shared memory).
Some of my kernels are launched concurrently, asynchronously called, but most of them need to be successive. There are several memcpy from device to host, but the ratio of memcpys over kernels calls is significantly low. I am mostly executing operations on my arrays values.
Here is my question :
If I understood correctly, all of my blocks are doing the job on the arrays at the same time. So that means dealing with a 10-blocks grid, a 100-blocks grid or a billion will take the same amount of time? The answer is obviously no, since the compuation time is significantly more important when I am dealing with large grids. Why is that?
I am using a relatively modest NVIDIA device (NVS 5200M). I was trying to get used to CUDA before getting bigger/more efficient devices.
Since I went through all the optimization and CUDA programming advices/guides by myself, I may have completely misunderstood some points. I hope my question is not too naive...
Thanks!
If I understood correctly, all of my blocks are doing the job on the arrays at the same time.
No they don't run at the same time! How many thread blocks can run concurrently depends on several things, all effected on the compute capability of your device - NVS 5200M should be cc2.1.
A CUDA enabled gpu has an internal scheduler, that manages where and when which thread block and warps of the blocks will run. Where means on which streaming multiprocessor (SM) the block will be launched.
Every SM has a limited amount of resources - shared memory and registers for example. A good overview for these limitations gives the Programming Guide or the Occupancy Calculator.
The first limitation is, that for cc2.1 a SM can run up to 8 thread blocks at the same time. Depending on your usage of registers, shared memory... the number will possible decrease.
If I remind me right a SM of cc2.1 exists of 96 cuda cores and therefore your NVS 5200M should have one SM. Let's assume with your kernel setup N (N<=8) thread blocks fit into the SM at the same time. The internal scheduler will be launched the first N blocks and queue up all other thread blocks. If one thread block has finished his work, the next one from the queue will be launched. So if you will launch in total 1 until N blocks, the used time for the kernel will be very equal. If you run the kernel with N+1 blocks, than the used time will be increased.
I run my C program on Suse Linux Enterprise that compresses several thousand large files (between 10MB and 100MB in size), and the program gets slower and slower as the program runs (it's running multi-threaded with 32 threads on a Intel Sandy Bridge board). When the program completes, and it's run again, it's still very slow.
When I watch the program running, I see that the memory is being depleted while the program runs, which you would think is just a classic memory leak problem. But, with a normal malloc()/free() mismatch, I would expect all the memory to return when the program terminates. But, most of the memory doesn't get reclaimed when the program completes. The free or top command shows Mem: 63996M total, 63724M used, 272M free when the program is slowed down to a halt, but, after the termination, the free memory only grows back to about 3660M. When the program is rerun, the free memory is quickly used up.
The top program only shows that the program, while running, is using at most 4% or so of the memory.
I thought that it might be a memory fragmentation problem, but, I built a small test program that simulates all the memory allocation activity in the program (many randomized aspects were built in - size/quantity), and it always returns all the memory upon completion. So, I don't think that's it.
Questions:
Can there be a malloc()/free() mismatch that will lose memory permanently, i.e. even after the process completes?
What other things in a C program (not C++) can cause permanent memory loss, i.e. after the program completes, and even the terminal window closes? Only a reboot brings the memory back. I've read other posts about files not being closed causing problems, but, I don't think I have that problem.
Is it valid to be looking at top and free for the memory statistics, i.e. do they accurately describe the memory situation? They do seem to correspond to the slowness of the program.
If the program only shows a 4% memory usage, will something like valgrind find this problem?
Can there be a malloc()/free() mismatch that will lose memory permanently, i.e. even after the process completes?
No, malloc and free, and even mmap are harmless in this respect, and when the process terminates the OS (SUSE Linux in this case) claims all their memory back (unless it's shared with some other process that's still running).
What other things in a C program (not C++) can cause permanent memory loss, i.e. after the program completes, and even the terminal window closes? Only a reboot brings the memory back. I've read other posts about files not being closed causing problems, but, I don't think I have that problem.
Like malloc/free and mmap, files opened by the process are automatically closed by the OS.
There are a few things which cause permanent memory leaks like big pages but you would certainly know about it if you were using them. Apart from that, no.
However, if you define memory loss as memory not marked 'free' immediately, then a couple of things can happen.
Writes to disk or mmap may be cached for a while in RAM. The OS must keep the pages around until it synchs them back to disk.
Files READ by the process may remain in memory if the OS has nothing else to use that RAM for right now - on the reasonable assumption that it might need them soon and it's quicker to read the copy that's already in RAM. Again, if the OS or another process needs some of that RAM, it can be discarded instantly.
Note that as someone who paid for all my RAM, I would rather the OS used ALL of it ALL the time, if it helps in even the smallest way. Free RAM is wasted RAM.
The main problem with having little free RAM is when it is overcommitted, which is to say there are more processes (and the OS) asking for or using RAM right now than is available on the system. It sounds like you are using about 4Gb of RAM in your processes, which might be a problem - (and remember the OS needs a good chunk too. But it sounds like you have plenty of RAM! Try running half the number of processes and see if it gets better.
Sometimes a memory leak can cause temporary overcommitment - it's a good idea to look into that. Try plotting the memory use of your program over time - if it rises continuously, then it may well be a leak.
Note that forking a process creates a copy that shares the memory the original allocated - until both are closed or one of them 'exec's. But you aren't doing that.
Is it valid to be looking at top and free for the memory statistics, i.e. do they accurately describe the memory situation? They do seem to correspond to the slowness of the program.
Yes, top and ps are perfectly reasonable ways to look at memory, in particular observe the RES field. Ignore the VIRT field for now. In addition:
To see what the whole system is doing with memory, run:
vmstat 10
While your program is running and for a while after. Look at what happens to the ---memory--- columns.
In addition, after your process has finished, run
cat /proc/meminfo
And post the results in your question.
If the program only shows a 4% memory usage, will something like valgrind find this problem?
Probably, but it can be extremely slow, which might be impractical in this case. There are plenty of other tools which can help such as electricfence and others which do not slw your program down noticeably. I've even rolled my own in the past.
malloc()/free() work on the heap. This memory is guaranteed to be released to the OS when the process terminates. It is possible to leak memory even after the allocating process terminates using certain shared memory primitives (e.g. System V IPC). However, I don't think any of this is directly relevant.
Stepping back a bit, here's output from a lightly-loaded Linux server:
$ uptime
03:30:56 up 72 days, 8:42, 2 users, load average: 0.06, 0.17, 0.27
$ free -m
total used free shared buffers cached
Mem: 24104 23452 652 0 15821 978
-/+ buffers/cache: 6651 17453
Swap: 3811 5 3806
Oh no, only 652 MB free! Right? Wrong.
Whenever Linux accesses a block device (say, a hard drive), it looks for any unused memory, and stores a copy of the data there. After all, why not? The data's already in RAM, some program clearly wanted that data, and RAM that's unused can't do anyone any good. If a program comes along and asks for more memory, the cached data is discarded to make room -- until then, might as well hang onto it.
The key to this free output is not the first line, but the second. Yes, 23.4 GB of RAM is being used -- but 17.4 GB is available for programs that want it. See Help! Linux ate my RAM! for more.
I can't say why the program is getting slower, but having the "free memory" metric steadily drop down to nothing is entirely normal and not the cause.
The operating system only makes as much memory free as it absolutely needs. Making memory free is wasted effort if the memory is later used normally -- it's more efficient to just directly transition the memory from one use to another than to make the memory free just to have to make it unfree later.
The only thing the system needs free memory for is operations that require memory that can't switch used memory from one purpose to another. This is a very small set of unusual operations such as servicing network interrupts.
If you type this command sysctl vm.min_free_kbytes, the system will tell you the number of KB it needs free. It's likely less than 100MB. So having any amount more than that free is perfectly fine.
If you want more of your memory free, remove it from the computer. Otherwise, the operating system assumes that there is zero cost to using it, and thus zero benefit to making it free.
For example, consider the data you wrote to disk. The operating system could make the memory that was holding that data free. But that's a double loss. If the data you wrote to disk is later read, it will have to read it from disk rather than just grabbing it from memory. And if that memory is later needed for some other purpose, it will just have to undo all the work it went through making it free. Yuck. So if the system doesn't absolutely need free memory, it won't make it free.
My guess would be the problem is not in your program, but in the operating system. The OS keeps a cache of recently used files in memory on the assumption that you are going to access them again. It does not know with certainty what files are going to be needed, so it can end up deciding to keep the wrong ones at the expense of the ones you wish it was keeping.
It may be keeping the output files of the first run cached when you do your second run, which prevents it from effectively using the cache on the second run. You can test this theory by deleting all files from the first run (which should free them from cache) and seeing if that makes the second run go faster.
If that doesn't work, try deleting all the input files for the first run as well.
Answers
Yes there is no requirement in C or C++ to release memory that is not freed back to the OS
Do you have memory mapped files, open file handles for deleted files etc. Linux will not delete a file until all references to is a deallocated. Also linux will cache the file in memory in case it needs to be read again - file cache memory usage can be ignored as the OS will deal with it
No
Maybe valgrind will highlight cases where memory is not
So I was asking myself what would happen if I tried to do a heap overflow on Windows XP, and I was surprise to see that, once the program "ate" all the RAM (this happens instantly, by the way), the size of the process in the task manager goes down to 5MB and doesn't move afterwards. The computer memory usage is still growing, however.
So why is Windows not able to see that my software takes GB of memory ? I feel like it can be a security problem because once a software ate all the memory, it can "hide" in the small process groups (and maybe I'm a little bit paranoid).
Note : nothing happens when the heap is full, the cpu just jumps to 100% because my for(;;) loop runs like crazy once malloc fails.
Edit : Ok! Never knew that you could tweak the task manager columns. I learnt something today :D.
Interesting experiment .. by default the Task Manager shows default working set. There are other memory fields, such as Paged and Unpaged pools and Working sets. Page faults can also tell you that the program is trying allocate memory but failing.
Try opening Task Manager and going to View > Select Columns... then toggle on more of the memory columns. It may well be that the program is using far more memory but not of the type that you are viewing in Task Manager
I think under XP there may be a Virtual Memory column which will be of interest to you