Bug in OS X 10.5 malloc? - c

I'm writing a program in C. I have two main development machines, both Macs. One is running OS X 10.5 and is a 32bit machine, the other is running OS X 10.6 and is 64 bits. The program works fine when compiled and run on the 64bit machine. However, when I compile the exact same program on the 32bit machine it runs for a while and then crashes somewhere inside malloc. Here's the backtrace:
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xeeb40fe0
0x9036d598 in small_malloc_from_free_list ()
(gdb) bt
#0 0x9036d598 in small_malloc_from_free_list ()
#1 0x90365286 in szone_malloc ()
#2 0x903650b8 in malloc_zone_malloc ()
#3 0x9036504c in malloc ()
#4 0x0000b14c in xmalloc (s=2048) at Common.h:185
...
xmalloc is my custom wrapper which just calls exit if malloc returns NULL, so it's not running out of memory.
If I link the same code with -ltcmalloc it works fine, so I strongly suspect that it's a bug somewhere inside OS X 10.5's default allocator. It may be that my program is causing some memory corruption somewhere and that tcmalloc somehow doesn't get tripped up by it. I tried to reproduce the failure by doing the same sequence of mallocs and frees in a different program but that worked fine.
So my questions are:
Has anyone seen this bug before? Or, alternatively
How can I debug something like this? E.g., is there a debug version of OS X's malloc?
BTW, these are the linked libraries:
$ otool -L ./interp
./interp:
/usr/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 111.1.5)
Update: Yeah, it's heap corruption due to writing past the end off an array, it's working now. I should have run valgrind before posting the question. I was nevertheless interested in techniques (other than valgrind) how to protect from such kind of corruption, so thanks for that.

Have you read the manual page for malloc() on MacOS X? In part, it says:
DEBUGGING ALLOCATION ERRORS
A number of facilities are provided to aid in debugging allocation errors in applications. These
facilities are primarily controlled via environment variables. The recognized environment variables
and their meanings are documented below.
ENVIRONMENT
The following environment variables change the behavior of the allocation-related functions.
MallocLogFile <f>
Create/append messages to the given file path instead of writing to
the standard error.
MallocGuardEdges
If set, add a guard page before and after each large block.
MallocDoNotProtectPrelude
If set, do not add a guard page before large blocks, even if the
MallocGuardEdges environment variable is set.
MallocDoNotProtectPostlude
If set, do not add a guard page after large blocks, even if the
MallocGuardEdges environment variable is set.
MallocStackLogging
If set, record all stacks, so that tools like leaks can be used.
MallocStackLoggingNoCompact
If set, record all stacks in a manner that is compatible with the
malloc_history program.
MallocStackLoggingDirectory
If set, records stack logs to the directory specified instead of saving
them to the default location (/tmp).
MallocScribble
If set, fill memory that has been allocated with 0xaa bytes. This
increases the likelihood that a program making assumptions about the contents of freshly allocated memory will fail. Also if set, fill memory
that has been deallocated with 0x55 bytes. This increases the likelihood
that a program will fail due to accessing memory that is no longer allocated.
MallocCheckHeapStart <s>
If set, specifies the number of allocations <s> to wait before begining
periodic heap checks every <n> as specified by MallocCheckHeapEach. If
MallocCheckHeapStart is set but MallocCheckHeapEach is not specified, the
default check repetition is 1000.
MallocCheckHeapEach <n>
If set, run a consistency check on the heap every <n> operations.
MallocCheckHeapEach is only meaningful if MallocCheckHeapStart is also
set.
MallocCheckHeapSleep <t>
Sets the number of seconds to sleep (waiting for a debugger to attach)
when MallocCheckHeapStart is set and a heap corruption is detected. The
default is 100 seconds. Setting this to zero means not to sleep at all.
Setting this to a negative number means to sleep (for the positive number
of seconds) only the very first time a heap corruption is detected.
MallocCheckHeapAbort <b>
When MallocCheckHeapStart is set and this is set to a non-zero value,
causes abort(3) to be called if a heap corruption is detected, instead of
any sleeping.
MallocErrorAbort
If set, causes abort(3) to be called if an error was encountered in
malloc(3) or free(3) , such as a calling free(3) on a pointer previously
freed.
MallocCorruptionAbort
Similar to MallocErrorAbort but will not abort in out of memory conditions, making it more useful to catch only those errors which will cause
memory corruption. MallocCorruptionAbort is always set on 64-bit processes.
That said, I'd still use valgrind first.

Has anyone seen this bug before
Yes, this is common programming bug and is almost certainly in your code. See http://www.efnetcpp.org/wiki/Heap_Corruption
How can I debug something like this?
See the Tools section of the above link.

Related

Profiling resident memory usage and many page faults in C++ program on linux

I am trying to figure out why my resident memory for one version of a program ("new") is much higher (5x) than another version of the same program ("baseline"). The program is running on a Linux cluster with E5-2698 v3 CPUs and written in C++. The baseline is a multiprocess program, and the new one is a multithreaded program; they are both fundamentally doing the same algorithm, computation, and operating on the same input data, etc. In both, there are as many processes or threads as cores (64), with threads pinned to CPUs. I've done a fair amount of heap profiling using both Valgrind Massif and Heaptrack, and they show that the memory allocation is the same (as it should be). The RSS for both the baseline and new version of the program are larger than the LLC.
The machine has 64 cores (hyperthreads). For both versions, I straced relevant processes and found some interesting results. Here's the strace command I used:
strace -k -p <pid> -e trace=mmap,munmap,brk
Here are some details about the two versions:
Baseline Version:
64 processes
RES is around 13 MiB per process
using hugepages (2MB)
no malloc/free-related syscalls were made from the strace call listed above (more on this below)
top output
New Version
2 processes
32 threads per process
RES is around 2 GiB per process
using hugepages (2MB)
this version does a fair amount of memcpy calls of large buffers (25MB) with default settings of memcpy (which, I think, is supposed to use non-temporal stores but I haven't verified this)
in release and profile builds, many mmap and munmap calls were generated. Curiously, none were generated in debug mode. (more on that below).
top output (same columns as baseline)
Assuming I'm reading this right, the new version has 5x higher RSS in aggregate (entire node) and significantly more page faults as measured using perf stat when compared to the baseline version. When I run perf record/report on the page-faults event, it's showing that all of the page faults are coming from a memset in the program. However, the baseline version has that memset as well and there are no pagefaults due to it (as verified using perf record -e page-faults). One idea is that there's some other memory pressure for some reason that's causing the memset to page-fault.
So, my question is how can I understand where this large increase in resident memory is coming from? Are there performance monitor counters (i.e., perf events) that can help shed light on this? Or, is there a heaptrack- or massif-like tool that will allow me to see what is the actual data making up the RES footprint?
One of the most interesting things I noticed while poking around is the inconsistency of the mmap and munmap calls as mentioned above. The baseline version didn't generate any of those; the profile and release builds (basically, -march=native and -O3) of the new version DID issue those syscalls but the debug build of the new version DID NOT make calls to mmap and munmap (over tens of seconds of stracing). Note that the application is basically mallocing an array, doing compute, and then freeing that array -- all in an outer loop that runs many times.
It might seem that the allocator is able to easily reuse the allocated buffer from the previous outer loop iteration in some cases but not others -- although I don't understand how these things work nor how to influence them. I believe allocators have a notion of a time window after which application memory is returned to the OS. One guess is that in the optimized code (release builds), vectorized instructions are used for the computation and it makes it much faster. That may change the timing of the program such that the memory is returned to the OS; although I don't see why this isn't happening in the baseline. Maybe the threading is influencing this?
(As a shot-in-the-dark comment, I'll also say that I tried the jemalloc allocator, both with default settings as well as changing them, and I got a 30% slowdown with the new version but no change on the baseline when using jemalloc. I was a bit surprised here as my previous experience with jemalloc was that it tends to produce some some speedup with multithreaded programs. I'm adding this comment in case it triggers some other thoughts.)
In general: GCC can optimize malloc+memset into calloc which leaves pages untouched. If you only actually touch a few pages of a large allocation, that not happening could account for a big diff in page faults.
Or does the change between versions maybe let the system use transparent hugepages differently, in a way that happens to not be good for your workload?
Or maybe just different allocation / free is making your allocator hand pages back to the OS instead of keeping them in a free list. Lazy allocation means you get a soft page fault on the first access to a page after getting it from the kernel. strace to look for mmap / munmap or brk system calls.
In your specific case, your strace testing confirms that your change led to malloc / free handing pages back to the OS instead of keeping them on a free list.
This fully explains the extra page faults. A backtrace of munmap calls could identify the guilty free calls. To fix it, see https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html / http://man7.org/linux/man-pages/man3/mallopt.3.html, especially M_MMAP_THRESHOLD (perhaps raise it to get glibc malloc not to use mmap for your arrays?). I haven't played with the parameters before. The man page mentions something about a dynamic mmap threshold.
It doesn't explain the extra RSS; are you sure you aren't accidentally allocating 5x the space? If you aren't, perhaps better alignment of the allocation lets the kernel use transparent hugepages where it didn't before, maybe leading to wasting up to 1.99 MiB at the end of an array instead of just under 4k? Or maybe Linux wouldn't use a hugepage if you only allocated the first couple of 4k pages past a 2M boundary.
If you're getting the page faults in memset, I assume these arrays aren't sparse and that you are touching every element.
I believe allocators have a notion of a time window after which application memory is returned to the OS
It would be possible for an allocator to check the current time every time you call free, but that's expensive so it's unlikely. It's also very unlikely that they use a signal handler or separate thread to do a periodic check of free-list size.
I think glibc just uses a size-based heuristic that it evaluates on every free. As I said, the man page mentions something about heuristics.
IMO actually tuning malloc (or finding a different malloc implementation) that's better for your situation should probably be a different question.

can I change pthread_create to map new threads not in the stack?

I'm using the pthread.h library in glibc-2.27 and when my process calls pthread_create() eighteen times or more (it's supposed to be a heavy multi-threaded application) the process is aborted with the error message:
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
I did some strace as part of my debugging ritual and I found the reason. Apparently all implicit calls for mmap() as part of the pthread_create() looks like this:
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f6de43fa000
One can notice the MAP_STACK flag which indicates:
Allocate the mapping at an address suitable for a process or thread stack.
This flag is currently a no-op, but is used in the glibc threading implementation so that if some architectures require special treatment for stack allocations, support can later be transparently implemented for glibc.
(man mmap on my system - Ubuntu 18.04 LTS)
It is possible to configure the pthread_create call not to do this? or maybe use brk or something else to increase the data segment automatically?
Thanks for any help!
It is extremely unlikely that your issue has anything to do with this MAP_STACK flag.
You have a bug somewhere else in your application which causes stack corruption. Try running your application under valgrind, or building with -fsanitize=address. Either approach may pinpoint the exact location of the error, and you should be able to figure out what is wrong based on that.
It is possible to configure the pthread_create call not to do this?
pthread_create() needs to allocate space for the thread's stack, otherwise the thread cannot run -- not even with an empty thread function. That's what the mmap you're seeing is for. It is not possible to do without.
or maybe use brk or something else to increase the data segment automatically?
If you have the time and skill to write your own thread library, then do have a go and let us know what happens. Otherwise, no, the details of how pthread_create() reserves space for the new thread's stack is not configurable in any implementation I know.
And that does not matter anyway, because the mmap() call is not the problem. If a syscall has an unrecoverable failure then that's a failure of the kernel, and you get a kernel panic, not an application crash. GNU C's stack-smashing detection happens in userspace. The functions to which it applies therefore do not appear in your strace output, which traces only system calls.
It might be useful to you have a better understanding of stack smashing and GNU's defense against it. Dr. Dobb's ran a nice article on just that several years ago, and it is still well worth a read. The bottom line, though, is that stack smashing happens when a function implementation misbehaves by overwriting the part of its stack frame that contains its return address. Unless you've got some inline assembly going on, the smashing almost surely arises from one of your own functions overrunning the bounds of one of its local variables. It is detected when that function tries to return, by tooling in the function epilogue that serves that purpose.

C malloc "can't allocate region" error, but can't repro with GDB?

How can I debug a C application that does not crash when attached with gdb and run inside of gdb?
It crashes consistently when run standalone - even the same debug build!
A few of us are getting this error with a C program written for BSD/Linux, and we are compiling on macOS with OpenSSL.
app(37457,0x7000017c7000) malloc: *** mach_vm_map(size=13835058055282167808) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
ERROR: malloc(buf->length + 1) failed!
I know, not helpful.
Recompiling the application with -g -rdynamic gives the same error. Ok, so now we know it isn't because of a release build as it continues to fail.
It works when running within a gdb debugging session though!!
$ sudo gdb app
(gdb) b malloc_error_break
Function "malloc_error_break" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (malloc_error_break) pending.
(gdb) run -threads 8
Starting program: ~/code/app/app -threads 8
[New Thread 0x1903 of process 45436]
warning: unhandled dyld version (15)
And it runs for hours. CTRL-C, and run ./app -threads 8 and it crashes after a second or two (a few million iterations).
Obviously there's an issue within one of the threads. But those workers for the threads are pretty big (a few hundred lines of code). Nothing stands out.
Note that the threads iterate over loops of about 20 million per second.
macOS 10.12.3
Homebrew w/GNU gcc and openssl (linking to crypto)
Ps, not familiar with C too much - especially any type of debugging. Be kind and expressive/verbose in answers. :)
One debugging technique that is sometimes overlooked is to include debug prints in the code, of course it has it's disadvantages, but also it has advantages. A thing you must keep in mind though in the face of abnormal termination is to make sure the printouts actually get printed. Often it's enough to print to stderr (but if that doesn't make the trick one may need to fflush the stream explicitly).
Another trick is to stop the program before the error occurs. This requires you to know when the program is about to crash, preferably as close as possible. You do this by using raise:
raise(SIGSTOP);
This does not terminate the program, it just suspends execution. Now you can attach with gdb using the command gdb <program-name> <pid> (use ps to find the pid of the process). Now in gdb you have to tell it to ignore SIGSTOP:
> handle SIGSTOP ignore
Then you can set break-points. You can also step out of the raise function using the finish command (may have to be issued multiple times to return to your code).
This technique makes the program have normal behaviour up to the time you decide to stop it, hopefully the final part when running under gdb would not alter the behavior enuogh.
A third option is to use valgrind. Normally when you see these kind of errors there's errors involved that valgrind will pick up. These are accesses out of range and uninitialized variables.
Many memory managers initialise memory to a known bad value to expose problems like this (e.g. Microsoft's CRT will use a range of values (0xCD means uninitialised, 0xDD means already free etc).
After each use of malloc, try memset'ing the memory to 0xCD (or some other constant value). This will allow you to identify uninitialised memory more easily with the debugger. don't use 0x00 as this is a 'normal' value and will be harder to spot if it's wrong (it will also probably 'fix' your problem).
Something like:
void *memory = malloc(sizeof(my_object));
memset(memory, 0xCD, sizeof(my_object));
If you know the size of the blocks, you could do something similar before free (this is sometimes harder unless you know the size of your objects, or track it in some way):
memset(memory, 0xDD, sizeof(my_object));
free(memory);

Yet another Memory Leak Issue (memory is still gone when program terminates)- C program on SLES

I run my C program on Suse Linux Enterprise that compresses several thousand large files (between 10MB and 100MB in size), and the program gets slower and slower as the program runs (it's running multi-threaded with 32 threads on a Intel Sandy Bridge board). When the program completes, and it's run again, it's still very slow.
When I watch the program running, I see that the memory is being depleted while the program runs, which you would think is just a classic memory leak problem. But, with a normal malloc()/free() mismatch, I would expect all the memory to return when the program terminates. But, most of the memory doesn't get reclaimed when the program completes. The free or top command shows Mem: 63996M total, 63724M used, 272M free when the program is slowed down to a halt, but, after the termination, the free memory only grows back to about 3660M. When the program is rerun, the free memory is quickly used up.
The top program only shows that the program, while running, is using at most 4% or so of the memory.
I thought that it might be a memory fragmentation problem, but, I built a small test program that simulates all the memory allocation activity in the program (many randomized aspects were built in - size/quantity), and it always returns all the memory upon completion. So, I don't think that's it.
Questions:
Can there be a malloc()/free() mismatch that will lose memory permanently, i.e. even after the process completes?
What other things in a C program (not C++) can cause permanent memory loss, i.e. after the program completes, and even the terminal window closes? Only a reboot brings the memory back. I've read other posts about files not being closed causing problems, but, I don't think I have that problem.
Is it valid to be looking at top and free for the memory statistics, i.e. do they accurately describe the memory situation? They do seem to correspond to the slowness of the program.
If the program only shows a 4% memory usage, will something like valgrind find this problem?
Can there be a malloc()/free() mismatch that will lose memory permanently, i.e. even after the process completes?
No, malloc and free, and even mmap are harmless in this respect, and when the process terminates the OS (SUSE Linux in this case) claims all their memory back (unless it's shared with some other process that's still running).
What other things in a C program (not C++) can cause permanent memory loss, i.e. after the program completes, and even the terminal window closes? Only a reboot brings the memory back. I've read other posts about files not being closed causing problems, but, I don't think I have that problem.
Like malloc/free and mmap, files opened by the process are automatically closed by the OS.
There are a few things which cause permanent memory leaks like big pages but you would certainly know about it if you were using them. Apart from that, no.
However, if you define memory loss as memory not marked 'free' immediately, then a couple of things can happen.
Writes to disk or mmap may be cached for a while in RAM. The OS must keep the pages around until it synchs them back to disk.
Files READ by the process may remain in memory if the OS has nothing else to use that RAM for right now - on the reasonable assumption that it might need them soon and it's quicker to read the copy that's already in RAM. Again, if the OS or another process needs some of that RAM, it can be discarded instantly.
Note that as someone who paid for all my RAM, I would rather the OS used ALL of it ALL the time, if it helps in even the smallest way. Free RAM is wasted RAM.
The main problem with having little free RAM is when it is overcommitted, which is to say there are more processes (and the OS) asking for or using RAM right now than is available on the system. It sounds like you are using about 4Gb of RAM in your processes, which might be a problem - (and remember the OS needs a good chunk too. But it sounds like you have plenty of RAM! Try running half the number of processes and see if it gets better.
Sometimes a memory leak can cause temporary overcommitment - it's a good idea to look into that. Try plotting the memory use of your program over time - if it rises continuously, then it may well be a leak.
Note that forking a process creates a copy that shares the memory the original allocated - until both are closed or one of them 'exec's. But you aren't doing that.
Is it valid to be looking at top and free for the memory statistics, i.e. do they accurately describe the memory situation? They do seem to correspond to the slowness of the program.
Yes, top and ps are perfectly reasonable ways to look at memory, in particular observe the RES field. Ignore the VIRT field for now. In addition:
To see what the whole system is doing with memory, run:
vmstat 10
While your program is running and for a while after. Look at what happens to the ---memory--- columns.
In addition, after your process has finished, run
cat /proc/meminfo
And post the results in your question.
If the program only shows a 4% memory usage, will something like valgrind find this problem?
Probably, but it can be extremely slow, which might be impractical in this case. There are plenty of other tools which can help such as electricfence and others which do not slw your program down noticeably. I've even rolled my own in the past.
malloc()/free() work on the heap. This memory is guaranteed to be released to the OS when the process terminates. It is possible to leak memory even after the allocating process terminates using certain shared memory primitives (e.g. System V IPC). However, I don't think any of this is directly relevant.
Stepping back a bit, here's output from a lightly-loaded Linux server:
$ uptime
03:30:56 up 72 days, 8:42, 2 users, load average: 0.06, 0.17, 0.27
$ free -m
total used free shared buffers cached
Mem: 24104 23452 652 0 15821 978
-/+ buffers/cache: 6651 17453
Swap: 3811 5 3806
Oh no, only 652 MB free! Right? Wrong.
Whenever Linux accesses a block device (say, a hard drive), it looks for any unused memory, and stores a copy of the data there. After all, why not? The data's already in RAM, some program clearly wanted that data, and RAM that's unused can't do anyone any good. If a program comes along and asks for more memory, the cached data is discarded to make room -- until then, might as well hang onto it.
The key to this free output is not the first line, but the second. Yes, 23.4 GB of RAM is being used -- but 17.4 GB is available for programs that want it. See Help! Linux ate my RAM! for more.
I can't say why the program is getting slower, but having the "free memory" metric steadily drop down to nothing is entirely normal and not the cause.
The operating system only makes as much memory free as it absolutely needs. Making memory free is wasted effort if the memory is later used normally -- it's more efficient to just directly transition the memory from one use to another than to make the memory free just to have to make it unfree later.
The only thing the system needs free memory for is operations that require memory that can't switch used memory from one purpose to another. This is a very small set of unusual operations such as servicing network interrupts.
If you type this command sysctl vm.min_free_kbytes, the system will tell you the number of KB it needs free. It's likely less than 100MB. So having any amount more than that free is perfectly fine.
If you want more of your memory free, remove it from the computer. Otherwise, the operating system assumes that there is zero cost to using it, and thus zero benefit to making it free.
For example, consider the data you wrote to disk. The operating system could make the memory that was holding that data free. But that's a double loss. If the data you wrote to disk is later read, it will have to read it from disk rather than just grabbing it from memory. And if that memory is later needed for some other purpose, it will just have to undo all the work it went through making it free. Yuck. So if the system doesn't absolutely need free memory, it won't make it free.
My guess would be the problem is not in your program, but in the operating system. The OS keeps a cache of recently used files in memory on the assumption that you are going to access them again. It does not know with certainty what files are going to be needed, so it can end up deciding to keep the wrong ones at the expense of the ones you wish it was keeping.
It may be keeping the output files of the first run cached when you do your second run, which prevents it from effectively using the cache on the second run. You can test this theory by deleting all files from the first run (which should free them from cache) and seeing if that makes the second run go faster.
If that doesn't work, try deleting all the input files for the first run as well.
Answers
Yes there is no requirement in C or C++ to release memory that is not freed back to the OS
Do you have memory mapped files, open file handles for deleted files etc. Linux will not delete a file until all references to is a deallocated. Also linux will cache the file in memory in case it needs to be read again - file cache memory usage can be ignored as the OS will deal with it
No
Maybe valgrind will highlight cases where memory is not

malloc()/free() behavior differs between Debian and Redhat

I have a Linux app (written in C) that allocates large amount of memory (~60M) in small chunks through malloc() and then frees it (the app continues to run then). This memory is not returned to the OS but stays allocated to the process.
Now, the interesting thing here is that this behavior happens only on RedHat Linux and clones (Fedora, Centos, etc.) while on Debian systems the memory is returned back to the OS after all freeing is done.
Any ideas why there could be the difference between the two or which setting may control it, etc.?
I'm not certain why the two systems would behave differently (probably different implementations of malloc from different glibc's). However, you should be able to exert some control over the global policy for your process with a call like:
mallopt(M_TRIM_THRESHOLD, bytes)
(See this linuxjournal article for details).
You may also be able to request an immediate release with a call like
malloc_trim(bytes)
(See malloc.h). I believe that both of these calls can fail, so I don't think you can rely on them working 100% of the time. But my guess is that if you try them out you will find that they make a difference.
Some mem handler dont present the memory as free before it is needed. It instead leaves the CPU to do other things then finalize the cleanup. If you wish to confirm that this is true, then just do a simple test and allocate and free more memory in a loop more times than you have memeory available.

Resources