I'm running a sort of "sandbox" in C on Ubuntu: it takes a program, and runs it safely under the user nobody (and intercepts signals, etc). Also, it assigns memory and time limits, and measures time and memory usage.
(In case you're curious, it's for a sort of "online judge" to mark programs on test data)
Currently I've adapted the safeexec module from mooshak. Though most things work properly, the memory usage seems to be a problem. (It's highly inaccurate)
Now I've tried the advice here and parsed VM from /proc/pid/stat, and now the accuracy problem is fixed. However, for programs that finish really quickly it doesn't work and just gives back 0.
The safeexec program seems to work like this:
It fork()s
Uses execv() in the child process to run the desired program
Monitors the program from the parent process until the child process terminates (using wait4, which happens to return CPU usage - but not memory?)
So it parses /proc/../stat of the child process (which has been replaced by the execv)
So why is VM in /proc/child_pid/stat sometimes equal to 0?
Is it because the execv() finishes too quickly, and /proc/child_pid/stat just isn't available?
If so, is there some sort of other way to get the memory usage of the child?
(Since this is meant to judge programs under a time limit, I can't afford something with a performance penalty like valgrind)
Thanks in advance.
Can you arrange for the child process to use your own version of malloc() et al and have that log the HWM memory usage (perhaps using a handler registered with atexit())? Perhaps you'd use LD_PRELOAD to load your memory management library. This won't help with huge static arrays or huge automatic arrays.
Hmm, sounds interesting. Any way to track the static/automatic arrays, though?
Static memory can be analyzed with the 'size' command - more or less.
Automatic arrays are a problem - I'm not sure how you could handle those. Your memory allocation code could look at how much stack is in use when it is called (look at the address of a local variable). But there's no guarantee that the memory will be allocated when the maximum amount of local array is in use, so it gives at best a crude measure.
One other thought: perhaps you could use debugger technology - the ptrace() system call - to control the child process, and in particular, to hold it up for long enough to be able to collect the memory usage statistics from /proc/....
You could set the hard resource limit (setrlimit for RLIMIT_AS resource) before execve(). The program will not be able to allocate more than that amount of memory. If it tries to do so, memory allocation calls (brk, mmap, mremap) will fail. If the program does not handle the out-of-memory condition, it will segfault, which will be reflected in the exit status returned by wait4.
You can use getrusage(2) function from sys/resources.h library.
Link: https://linux.die.net/man/2/getrusage
This functions uses "rusage" structure that contains ru_maxrss field which stores information about the largest child memory usage from all the children current process had.
This function can be also executed from main process after all the child processes were terminated.
To get information try something like this:
struct rusage usage;
int a = getrusage(RUSAGE_CHILDREN, &usage);
But there is a little trick.
If You want to have information about every child processes memory usage (not only the biggest one) You must fork() your program twice (first fork allows You to have independent process and the second one will be the process You'd like to test.
Related
This is a really ugly question.
I have a C++ program which does the following in a loop:
Waits for a JMS message
Calculates some data
Sends a JMS message in response
My program (let's call it "Bob") has a rather severe memory leak. The memory leak is located in a shared library that someone else wrote, which I must use, but the source code to which I do not have access.
This memory leak causes Bob to crash during the "calculates some data" phase of the loop. This is a problem, because another program is awaiting Bob's response, and will be very upset if it does not receive one.
Due to various restrictions (yes, this is an X/Y problem, I told you it was ugly), I have determined that my only viable strategy is to modify Bob so that it does the following in its loop:
Waits for a JMS message
Calculates some data
Sends a JMS message in response
Checks to see whether it's in danger of using "too much" memory
If so, forks and execs another copy of itself, and gracefully exits
My question is as follows:
What is the best (reliable but not too inefficient) way to detect whether we're using "too much" memory? My current thought is to compare getrlimit(RLIMIT_AS) rlim_cur to getrusage(RUSAGE_SELF) ru_maxrss; is that correct? If not, what's a better way? Bob runs in a Linux VM on various host machines, all with different amounts of memory.
Assuming the memory leak occurs in the "Calculates some data" phase, I think it might make more sense to just refactor that portion into a separate program and fork out to execute that in its own process. That way you can at least isolate the offending code and make it easier to replace it in the future, rather than just masking the problem by having the program restart itself when it runs low on memory.
The "Calculates some data" part can either be a long-running process that waits for requests from the main program and restarts itself when necessary, or (even simpler) it could be a one-and-done program that just takes its data in *argv and sends its results to stdout. Then your main loop can just fork out and exec it every time through, and read the results when they come back. I would go with the simpler option if possible, but that will of course depend on what your needs are.
If you make the program itself restart or fork the "Calculates some data" section to a separate process, in any case you'll need to check for the memory consumption. Since you're on Linux, an easy way to check this is to get the pid number of the process of interest, and read the contents of the file /proc/$PID/statm. The second number is the size of the resident set.
Reading these proc files are the way tools like top and htop get the data about processes. Reading a ~30-byte in-memory file periodically to check the memory leak doesn't sound too inefficient.
If the leak is regular and you want to make it a bit more sophisticated, you could even keep track of the rate of growth and adjust your rate of checks accordingly.
I have this a big server software that can hog 4-8GB of memory.
This makes fork-exec cumbersome, as the fork itself can take significant time, plus the default behavior seems to be that fork will fail unless there is enough memory for a copy of the entire resident memory.
Since this is starting to show as the hottest spot (60% of time spent in fork) when profiling I need to address it.
What would be the easiest way to avoid fork-exec routine?
You basically cannot avoid fork(2) (or the equivalent clone(2) syscall..., or the obsolete vfork which I don't recommend using) + execve(2) to start an external command (à la system(3), or à la posix_spawn) on Linux and (probably) MacOSX and most other Unix or POSIX systems.
What makes you think that it is becoming an issue? And 8GB process virtual address space is not a big deal today (at least on machines with 8Gbytes, or 16Gbytes RAM, like my desktop has). You don't practically need twice as much RAM (but you do need swap space) thanks to the lazy copy-on-write techniques used by all recent Unixes & Linux.
Perhaps you might believe that swap space could be an issue. On Linux, you could add swap space, perhaps by swapping on a file; just run as root:
dd if=/dev/zero of=/var/tmp/myswap bs=1M count=32768
mkswap /var/tmp/myswap
swapon /var/tmp/myswap
(of course, be sure that /var/tmp/ is not a tmpfs mounted filesystem, but sits on some disk, perhaps an SSD one....)
When you don't need any more a lot of swap space, run swapoff /var/tmp/myswap....
You could also start some external shell process near the beginning of your program (à la popen) and later you might send shell commands to it. Look at my execicar.c program for inspiration, or use it if it fits (I wrote it 10 years ago for similar purposes, but I forgot the details)
Alternatively fork at the beginning of your program some interpreter (Lua, Guile...) and send some commands to it.
Running more than a few dozens commands per second (starting any external program) is not reasonable, and should be considered as a design mistake, IMHO. Perhaps the commands that you are running could be replaced by in-process functions (e.g. /bin/ls can be done with stat, readdir, glob functions ...). Perhaps you might consider adding some plugin ability (with dlopen(3) & dlsym) to your code (and run functions from plugins instead of starting very often the same programs). Or perhaps embed an interpreter (Lua, Guile, ...) inside your code.
As an example, for web servers, look for old CGI vs FastCGI or HTTP forwarding (e.g. URL redirection) or embedded PHP or HOP or Ocsigen
This makes fork-exec cumbersome, as the fork itself can take
significant time
This is only half true. You didn't specify the OS, but fork(2) is pretty optimized in Linux (and I believe in other UNIX variants) by using copy-on-write. Copy-on-write means that the operating system will not copy the entire parent memory address space until the child (or the parent) writes to memory. So you can rest assured that if you have a parent process using 8 GB of memory and then you fork, you won't be using 16 GB of memory - especially if the child execs() something immediately.
fork will fail unless there is enough memory for a copy of the entire
resident memory.
No. The only overhead incurred by fork(2) is the copy and allocation of a task structure for the child, the allocation of a PID, and copying the parent's page tables. fork(2) will not fail if there isn't enough memory to copy the entire parent's address space, it will fail if there isn't enough memory to allocate a new task structure and the page tables. It may also fail if the maximum number of processes for the user has been reached. You can confirm this in man 2 fork (NOTE: See comments below).
If you still don't want to use fork(2), you can use vfork(2), which does no copying at all - it doesn't even copy the page tables - everything is shared with the parent. You can use that to create a new child process with a negligible overhead and then exec() something. Be aware that vfork(2) blocks the calling thread until the child either exits or calls one of the seven exec() functions. You also shouldn't modify the memory inside the child process before calling any of the exec() functions.
You mentioned that you can fork+exec 10k times per second. That sounds very excessive. Have you considered making the things you're execing into a daemon? Or maybe implement those external programs inside your application? It sounds very dodgy to have to fork that much.
fork most likely starts failing for you despite having the memory to back it because you're on a flavor of linux that has disabled (or put a limit on) memory overcommit. Check the file /proc/sys/vm/overcommit_memory. If it's 1 then my guess is wrong and there's something else weird going on. If it's 0 then you're not allowed to overcommit at all. If it's 2 then you need to read the documentation for how exactly this gets configured.
One solution mentioned above is just adding swap (that will never get used).
Another solution is to implement a small daemon that will take commands and execute those forks and execs for you piping back whatever output you need.
N.B. fork of a large process can in theory be as fast as a small process. The performance of fork is determined by how many memory mappings you have rather than how much memory they cover. Setting up copy-on-write is done per mapping. Except that on certain operating systems setting up COW of anonymous mappings is linear to amount of memory in those mappings, but I don't know what Linux does here, last time I studied the VM system in Linux was over 15 years ago.
I am trying to achieve CPU and Memory usage information of current process, inside a system call.
I can get current process name, pid and uid by using :
current->comm //process name
current->pid //process id
current_uid() //uid
but that seems to be all.(I am using kernel 3.2.0-24-generic)
As I have seen from Memory usage of current process in C, reading(vfs_read) and parsing /proc/pid/status seems to be the only option to get memory and cpu usage.
Is there a better way to obtain this information, or am I on the right track?
I also test my code as a kernel module first, since both system calls and kernel modules are running in kernel space. Is that also bad approach?
current->mm is the place where all memory information is stored.
current->mm->mmap is a list of memory mappings for the process, so you can iterate it and see what you find there.
current->utime and current->stime may be useful for getting CPU information.
I work on Linux for ARM processor for cable modem. There is a tool that I have written that sends/storms customized UDP packets using raw sockets. I form the packet from scratch so that we have the flexibility to play with different options. This tool is mainly for stress testing routers.
I actually have multiple interfaces created. Each interface will obtain IP addresses using DHCP. This is done in order to make the modem behave as virtual customer premises equipment (vcpe).
When the system comes up, I start those processes that are asked to. Every process that I start will continuously send packets. So process 0 will send packets using interface 0 and so on. Each of these processes that send packets would allow configuration (change in UDP parameters and other options at run time). Thats the reason I decide to have separate processes.
I start these processes using fork and excec from the provisioning processes of the modem.
The problem now is that each process takes up a lot of memory. Starting just 3 such processes, causes the system to crash and reboot.
I have tried the following:
I have always assumed that pushing more code to the Shared Libraries will help. So when I tried moving many functions into shared library and keeping minimum code in the processes, it made no difference to my surprise. I also removed all arrays and made them use the heap. However it made no difference. This maybe because the processes runs continuously and it makes no difference if it is stack or heap? I suspect the process from I where I call the fork is huge and that is the reason for the processes that I make result being huge. I am not sure how else I could go about. say process A is huge -> I start process B by forking and excec. B inherits A's memory area. So now I do this -> A starts C which inturn starts B will also not help as C still inherits A?. I used vfork as an alternative which did not help either. I do wonder why.
I would appreciate if someone give me tips to help me reduce the memory used by each independent child processes.
Given this is a test tool, then the most efficient thing to do is to add more memory to the testing machine.
Failing that:
How are you measuring memory usage? Some methods don't get accurate results.
Check you don't have any memory leaks. e.g. with Valgrind on Linux x86.
You could try running the different testers in a single process, as different threads, or even multiplexed in a single thread - since the network should be the limiting factor?
exec() will shrink the processes memory size as the new execution gets a fresh memory map.
If you can't add physical memory, then maybe you can add swap, maybe just for testing?
Not technically answering your question, but providing a couple of alternative solutions:
If you are using Linux have you considered using pktgen? It is a flexible tool for sending UDP packets from kernel as fast as the interface allows. This is much faster than a userspace tool.
oh and a shameless plug. I have made a multi-threaded network testing tool, which could be used to spam the network with UDP packets. It can operate in multi-process mode (by using fork), or multi-thread mode (by using pthreads). The pthreads might use less RAM, so might be better for you to use. If anything it might be worth looking at the source as I've spent many years improving this code, and its been able to generate enough packets to saturate a 10gbps interface.
What could be happening is that the fork call in process A requires a significant amount of RAM + swap (if any). Thus, when you call fork() from this process the kernel must reserve enough RAM and swap for the child process to have it's own copy (copy-on-write, actually) of the parent process's writable private memory, namely it's stack and heap. When you call exec() from the child process, that memory is no longer needed and your child process can have it's own, smaller private working set.
So, first thing to make sure is that you don't have more than one process at a time in the state between fork() and exec(). During this state is where the child process must have a duplicate of it's parent process virtual memory space.
Second, try using the overcommit settings which will allow the kernel to reserve more memory than actually exists. These are /proc/sys/vm/overcommit*. You can get away with using overcommit because your child processes only need the extra VM space until they call exec, and shouldn't actually touch the duplicated address space of the parent process.
Third, in your parent process you can allocate the largest blocks using shared memory, rather than the stack or heap, which are private. Thus, when you fork, those shared memory regions will be shared with the child process rather than duplicated copy-on-write.
I am working on a parallel app (C, pthread). I traced the system calls because at some point I have bad parallel performances. My traces shown that my program calls mprotect() many many times ... enough to significantly slow down my program.
I do allocate a lot of memory (with malloc()) but there is only a reasonable number of calls to brk() to increase the heap size. So why so many calls to mprotect() ?!
Are you creating and destroying lots of threads?
Most pthread implementations will add a "guard page" when allocating a threads stack. It's an access protected memory page used to detect stack overflows. I'd expect at least one call to mprotect each time a thread is created or terminated to (un)protect the guard page. If this is the case, there are several obvious strategies:
Set the guard page size to zero using pthread_attr_setguardsize() before creating threads.
Use a thread-pool (of as many threads as processors say). Once a thread is done with a task, return it to the pool to get a new task rather than terminate and create a new thread.
Another explanation might be that you're on a platform where a thread's stack will be grown if overflow is detected. I don't think this is implemented on Linux with GCC/Glibc as yet, but there have been some proposals along these lines recently. If you use a lot of stack space whilst processing, you might explicitely increase the initial/minimum stack size using pthread_attr_setstacksize.
Or it might be something else entirely!
If you can, run your program under a debug libc and break on mprotect(). Look at the call stack, see what your code is doing that's leading to the mprotect() calls.
glibc library that has ptmalloc2 for its malloc uses mprotect() internally for micromanagement of heap for threads other than main thread (for main thread, sbrk() is used instead.) malloc() firstly allocates large chunk of memory with mmap() for the thread if a heap area seems to have contention, and then it changes the protection bits of unnecessary portion to make it accessible with mprotect(). Later, when it needs to grow the heap, it changes the protection to read/writable with mprotect() again. Those mprotect() calls are for heap growth and shrink in multithreaded applications.
http://www.blackhat.com/presentations/bh-usa-07/Ferguson/Whitepaper/bh-usa-07-ferguson-WP.pdf
explains this in a bit more detailed way.
The 'valgrind' suite has a tool called 'callgrind' that will tell you what is calling what. If you run the application under 'callgrind', you can then view the resulting profile data with 'kcachegrind' (it can analyze profiles made by 'cachegrind' or 'callgrind'). Then just double-click on 'mprotect' in the left pane and it will show you what code is calling it and how many times.