This is a really ugly question.
I have a C++ program which does the following in a loop:
Waits for a JMS message
Calculates some data
Sends a JMS message in response
My program (let's call it "Bob") has a rather severe memory leak. The memory leak is located in a shared library that someone else wrote, which I must use, but the source code to which I do not have access.
This memory leak causes Bob to crash during the "calculates some data" phase of the loop. This is a problem, because another program is awaiting Bob's response, and will be very upset if it does not receive one.
Due to various restrictions (yes, this is an X/Y problem, I told you it was ugly), I have determined that my only viable strategy is to modify Bob so that it does the following in its loop:
Waits for a JMS message
Calculates some data
Sends a JMS message in response
Checks to see whether it's in danger of using "too much" memory
If so, forks and execs another copy of itself, and gracefully exits
My question is as follows:
What is the best (reliable but not too inefficient) way to detect whether we're using "too much" memory? My current thought is to compare getrlimit(RLIMIT_AS) rlim_cur to getrusage(RUSAGE_SELF) ru_maxrss; is that correct? If not, what's a better way? Bob runs in a Linux VM on various host machines, all with different amounts of memory.
Assuming the memory leak occurs in the "Calculates some data" phase, I think it might make more sense to just refactor that portion into a separate program and fork out to execute that in its own process. That way you can at least isolate the offending code and make it easier to replace it in the future, rather than just masking the problem by having the program restart itself when it runs low on memory.
The "Calculates some data" part can either be a long-running process that waits for requests from the main program and restarts itself when necessary, or (even simpler) it could be a one-and-done program that just takes its data in *argv and sends its results to stdout. Then your main loop can just fork out and exec it every time through, and read the results when they come back. I would go with the simpler option if possible, but that will of course depend on what your needs are.
If you make the program itself restart or fork the "Calculates some data" section to a separate process, in any case you'll need to check for the memory consumption. Since you're on Linux, an easy way to check this is to get the pid number of the process of interest, and read the contents of the file /proc/$PID/statm. The second number is the size of the resident set.
Reading these proc files are the way tools like top and htop get the data about processes. Reading a ~30-byte in-memory file periodically to check the memory leak doesn't sound too inefficient.
If the leak is regular and you want to make it a bit more sophisticated, you could even keep track of the rate of growth and adjust your rate of checks accordingly.
Related
I have two processes:
Process A is mapping large file (~170 GB - content constantly changes) into memory for writing with the flags MAP_NONBLOCK and MAP_SHARED:
MyDataType *myDataType; = (MyDataType*)mmap(NULL, sizeof(MyDataType), PROT_WRITE, MAP_NONBLOCK | MAP_SHARED , fileDescriptor, 0);
and every second I call msync:
msync((void *)myDataType, sizeof(MyDataType), MS_ASYNC);
This section works fine.
The problem occurs when process B is trying to read from the same file that process A is mapped to, process A does not respond for ~20 seconds.
Process B is trying to read from the file something like 1000 times, using fread() and fseek(), small blocks (~4 bytes every time).
Most of the content the process is reading are close to each other.
What is the cause for this problem? Is it related to pages allocation? How can I solve it?
BTW, same problem occur when I use mmap() in process B instead of simple fread().
msync() is likely the problem. It forces the system to write to disk, blocking the kernel in a write frenzy.
In general on Linux (it's the same on Solaris BTW), it is a bad idea to use msync() too often. There is no need to call msync() for the synchronization of data between the memory map and the read()/write() I/O operations, this is a misconception that comes from obsolete HOWTOs. In reality, mmap() makes only the file system cache "visible" for a process. This means that the memory blocks the process changes are still under kernel control. Even if your process crashed, the changes would land on the disk eventually. Other processes would also still be serviced by the same buffer.
Here another answer on the subject mmap, msync and linux process termination
The interesting part is the link to a discussion on realworldtech where Linus Torvalds himself explains how buffer cache and memory mapping work.
PS: fseek()/fread() pair is also probably better replaced by pread(). 1 system call is always better than 2. Also fseek()/fread() read always 4K and copies in a buffer, so if you have several small reads without fseek(), it will read from its local buffer and maybe miss updates in process A.
This sounds that you are suffering from IO-Starvation, which has nothing to do with the method (mmap or fread) you choose. You will have to improve your (pre-)caching-strategy and/or try another IO-scheduler (cfq being the default, maybe deadline delivers better overall-results for you)
You can change the scheduler by writing to /sys:
echo deadline > /sys/block/<device>/queue/scheduler
Maybe you should try profiling or even using strace to figure out for sure where the process is spending its time. 20 s seems like an awfully long time to be explained by io in msync().
When you say A doesn't respond, what exactly do you mean?
Say a process is forked from another process. In other words, we replicate a process through the fork function call. Now since forking is a copy-on-write mechanism, what happens is that whenever the forked process or the original process write to a page, they get a new physical page to write. So what I've understood, things go like this when both forked and original processes are executing.
--> when forking, all pages of original and forked process are given read only access, so that the kernel get to know which page is written. When that happens, the kernel maps a new physical page to the writing process, writes the previous content to it, and then gives the write access to that page. Now what I am not clear about is if both fork and original process write to the same page, will one of them will still hold the original physical page (prior to forking that is) or both will get new physical pages. Secondly, is my assumption correct that all pages in forked and original process are given read only access at time of forking?
--> Now since each page fault will trigger an interrupt, that means each write to original or forked process will slow down execution. Say if we know about the application, and we know that alot of contiguous memory pages will be written, wouldn't it be better to give write permission to multiple pages ( a group of pages lets say ) when one of the page in the group is written to. That would reduce the number of interrupts due to page fault handling. Isn't it? Sure, we may sometimes make a copy unnecessarily in this case, but I think an interrupt has much more overhead than writing 512 variables of type long (4096 bytes of a page). Is my understanding correct or am I missing something?
If I'm not mistaken, one of the processes will be seen as writing to the page first. Even if you hae multiple cores, I believe the page fault will be handled serially. In that case, the first one to be caught will de-couple the pages of the two processes, so by the time the second writes to it, there won't be a fault, because it'll now have a writable page of its own.
I believe when that's done, the existing page is kept by one process (and set back to read/write), and one new copy is made for the other process.
I think your third point revolves around one simple point: "Say if we know about the application...". That's the problem right there: the OS does not know about the application. Essentially the only thing it "knows" will be indirect, through observation by the kernel coders. They will undoubtedly observe that fork is normally followed by exec, so that's the case for which they will undoubtedly optimize. Yes, that's not always the case, and you're obviously concerned about the other cases -- all I'm saying here is that they're sufficiently unusual that I'd guess little effort is expended on them.
I'm not quite sure I follow the logic or math about 512 longs in a 4096 byte page -- the first time a page is written, it gets duplicated and decoupled between the processes. From that point onward, further writes to either process' copy of that page will not cause any further page faults (at least related to the copy on write -- of course if a process sites idle a long time that data might be paged out to the page file, or something on that order, but it's irrelevant here).
Fork semantically makes a copy of a process. Copy-on-write is an optimization which makes it much faster. Optimizations often have some hidden trade-off. Some cases are made faster, but others suffer. There is a cost to copy-on-write, but we hope that there will usually be a saving, because most of the copied pages will not in fact be written to by the child. In the ideal case, the child performs an immediate exec.
So we suffer the page fault exceptions for a small number of pages, which is cheaper than copying all the pages upfront.
Most "lazy evaluation" type optimizations are of this nature.
A lazy list of a million items is more expensive to fully instantiate than a regular list of a million items. But if the consumer of the list only accesses the first 100 items, the lazy list wins.
Well, the initial cost would be very high if fork() wouldn't use COW. If you look at typical top display, the ratio RSS/VSIZE is very small(e.g. 2MB/ 56MB for a typical vi session).
Cloning a process without COW would cause a tremendous amount of memory pressure, which would actually cause other processes to lose their attached pages (which will have to be moved to secondary storage, and maybe later restored). And that paging would actually cause 1-2 disk I/O's per page (the swap out is only needed if the page is new or dirty, the swap in will only be needed if the page is ever again referenced by the other process)
Another point is granularity: back in the days, when MMU's did not exist, whole processes had to be swapped out to yield their memory, causing the system to actually freeze for a second or so. Page-faulting on a per-page basis causes more traps, but these are spread out nicely, allowing the processes to actually compete for physical ram.
Without prior knowledge, it's hard to beat an LRU scheme.
I am building an application which takes as it's input an executable , executes it and keeps track of dynamic memory allocations among others to help track down memory errors.
After reading the name of the executable I create a child process,link the executable with my module ( which includes my version of malloc family of functions) and execute the executable provided by the user. The parent process will consist of a GUI ( using QT framework ) where I want to display warnings/errors/number of allocations.
I need to communicate the number of mallocs/frees and a series of warning messages to the parent process in real-time. After the users application has finished executing I wish to display the number of memory leaks. ( I have taken care of all the backend coding needed for this in the shared library I link against).
Real-Time:
I though of 2 different approaches to communicate this information.
Child process will write to 2 pipes ( 1 for writing whether allocation/free happened and another for writing a single integer to denote a warning message).
I though of simply sending a signal to denote whether an allocation has happened. Also create signals for each of the warning messages. I will map these to the actual warnings (strings) in the parent process.
Is the signal version as efficient as using a pipe? Is it feasible ? Is there any better choice , as I do care about efficiency:)
After user's application finishes executing:
I need to send the whole data structure I use to keep track of memory leaks here. This could possibly be very large so I am not sure which IPC method would be the most efficient.
Thanks for your time
I would suggest a unix-domain socket, it's a little more flexible than a pipe, can be configured for datagram mode which save you having to find message boundaries, and makes it easy to move to a network interface later.
Signals are definitely not the way to do this. In general, signals are best avoided whenever possible.
A pipe solution is fine. You could also use shared memory, but that would be more vulnerable to accidental corruption by the target application.
I suggest a combination of shared memory and a socket. Have a shared memory area, say 1MB, and log all your information in some standard format in that buffer. If/when the buffer fills or the process terminates you send a message, via the socket, to the reader. After the reader ACKs you can clear the buffer and carry on.
To answer caf's concern about target application corruption, just use the mprotect system call to remove permissions (set PROT_NONE) from the shared memory area before giving control to your target process. Naturally this means you'll have to set PROT_READ|PROT_WRITE before updating your log on each allocation, not sure if this is a performance win with the mprotect calls thrown in.
EDIT: in case it isn't blindingly obvious, you can have multiple buffers (or one divided into N parts) so you can pass control back to the target process immediately and not wait for the reader to ACK. Also, given enough computation resources the reader can run as often as it wants reading the currently active buffer and performing real-time updates to the user or whatever it's reading for.
I work on Linux for ARM processor for cable modem. There is a tool that I have written that sends/storms customized UDP packets using raw sockets. I form the packet from scratch so that we have the flexibility to play with different options. This tool is mainly for stress testing routers.
I actually have multiple interfaces created. Each interface will obtain IP addresses using DHCP. This is done in order to make the modem behave as virtual customer premises equipment (vcpe).
When the system comes up, I start those processes that are asked to. Every process that I start will continuously send packets. So process 0 will send packets using interface 0 and so on. Each of these processes that send packets would allow configuration (change in UDP parameters and other options at run time). Thats the reason I decide to have separate processes.
I start these processes using fork and excec from the provisioning processes of the modem.
The problem now is that each process takes up a lot of memory. Starting just 3 such processes, causes the system to crash and reboot.
I have tried the following:
I have always assumed that pushing more code to the Shared Libraries will help. So when I tried moving many functions into shared library and keeping minimum code in the processes, it made no difference to my surprise. I also removed all arrays and made them use the heap. However it made no difference. This maybe because the processes runs continuously and it makes no difference if it is stack or heap? I suspect the process from I where I call the fork is huge and that is the reason for the processes that I make result being huge. I am not sure how else I could go about. say process A is huge -> I start process B by forking and excec. B inherits A's memory area. So now I do this -> A starts C which inturn starts B will also not help as C still inherits A?. I used vfork as an alternative which did not help either. I do wonder why.
I would appreciate if someone give me tips to help me reduce the memory used by each independent child processes.
Given this is a test tool, then the most efficient thing to do is to add more memory to the testing machine.
Failing that:
How are you measuring memory usage? Some methods don't get accurate results.
Check you don't have any memory leaks. e.g. with Valgrind on Linux x86.
You could try running the different testers in a single process, as different threads, or even multiplexed in a single thread - since the network should be the limiting factor?
exec() will shrink the processes memory size as the new execution gets a fresh memory map.
If you can't add physical memory, then maybe you can add swap, maybe just for testing?
Not technically answering your question, but providing a couple of alternative solutions:
If you are using Linux have you considered using pktgen? It is a flexible tool for sending UDP packets from kernel as fast as the interface allows. This is much faster than a userspace tool.
oh and a shameless plug. I have made a multi-threaded network testing tool, which could be used to spam the network with UDP packets. It can operate in multi-process mode (by using fork), or multi-thread mode (by using pthreads). The pthreads might use less RAM, so might be better for you to use. If anything it might be worth looking at the source as I've spent many years improving this code, and its been able to generate enough packets to saturate a 10gbps interface.
What could be happening is that the fork call in process A requires a significant amount of RAM + swap (if any). Thus, when you call fork() from this process the kernel must reserve enough RAM and swap for the child process to have it's own copy (copy-on-write, actually) of the parent process's writable private memory, namely it's stack and heap. When you call exec() from the child process, that memory is no longer needed and your child process can have it's own, smaller private working set.
So, first thing to make sure is that you don't have more than one process at a time in the state between fork() and exec(). During this state is where the child process must have a duplicate of it's parent process virtual memory space.
Second, try using the overcommit settings which will allow the kernel to reserve more memory than actually exists. These are /proc/sys/vm/overcommit*. You can get away with using overcommit because your child processes only need the extra VM space until they call exec, and shouldn't actually touch the duplicated address space of the parent process.
Third, in your parent process you can allocate the largest blocks using shared memory, rather than the stack or heap, which are private. Thus, when you fork, those shared memory regions will be shared with the child process rather than duplicated copy-on-write.
I'm running a sort of "sandbox" in C on Ubuntu: it takes a program, and runs it safely under the user nobody (and intercepts signals, etc). Also, it assigns memory and time limits, and measures time and memory usage.
(In case you're curious, it's for a sort of "online judge" to mark programs on test data)
Currently I've adapted the safeexec module from mooshak. Though most things work properly, the memory usage seems to be a problem. (It's highly inaccurate)
Now I've tried the advice here and parsed VM from /proc/pid/stat, and now the accuracy problem is fixed. However, for programs that finish really quickly it doesn't work and just gives back 0.
The safeexec program seems to work like this:
It fork()s
Uses execv() in the child process to run the desired program
Monitors the program from the parent process until the child process terminates (using wait4, which happens to return CPU usage - but not memory?)
So it parses /proc/../stat of the child process (which has been replaced by the execv)
So why is VM in /proc/child_pid/stat sometimes equal to 0?
Is it because the execv() finishes too quickly, and /proc/child_pid/stat just isn't available?
If so, is there some sort of other way to get the memory usage of the child?
(Since this is meant to judge programs under a time limit, I can't afford something with a performance penalty like valgrind)
Thanks in advance.
Can you arrange for the child process to use your own version of malloc() et al and have that log the HWM memory usage (perhaps using a handler registered with atexit())? Perhaps you'd use LD_PRELOAD to load your memory management library. This won't help with huge static arrays or huge automatic arrays.
Hmm, sounds interesting. Any way to track the static/automatic arrays, though?
Static memory can be analyzed with the 'size' command - more or less.
Automatic arrays are a problem - I'm not sure how you could handle those. Your memory allocation code could look at how much stack is in use when it is called (look at the address of a local variable). But there's no guarantee that the memory will be allocated when the maximum amount of local array is in use, so it gives at best a crude measure.
One other thought: perhaps you could use debugger technology - the ptrace() system call - to control the child process, and in particular, to hold it up for long enough to be able to collect the memory usage statistics from /proc/....
You could set the hard resource limit (setrlimit for RLIMIT_AS resource) before execve(). The program will not be able to allocate more than that amount of memory. If it tries to do so, memory allocation calls (brk, mmap, mremap) will fail. If the program does not handle the out-of-memory condition, it will segfault, which will be reflected in the exit status returned by wait4.
You can use getrusage(2) function from sys/resources.h library.
Link: https://linux.die.net/man/2/getrusage
This functions uses "rusage" structure that contains ru_maxrss field which stores information about the largest child memory usage from all the children current process had.
This function can be also executed from main process after all the child processes were terminated.
To get information try something like this:
struct rusage usage;
int a = getrusage(RUSAGE_CHILDREN, &usage);
But there is a little trick.
If You want to have information about every child processes memory usage (not only the biggest one) You must fork() your program twice (first fork allows You to have independent process and the second one will be the process You'd like to test.