Is execv() expensive? - c

I have a requirement. My process has to fork->exec another process during one of its code paths. The child process runs some checks and when some condition is true it has to re-exec itself. It did not cause any performance issues when I tested on high end machines.
But will it be an expensive to call execv() again in the same process? Especially when it is exec()ing itself?
Note: There is no fork() involved for the second time. The process would just execv() itself for the second time, to get something remapped in its virtual address space.

The second execv() call is no more expensive than the first. It might even be cheaper, since the system might not need to read the program image from disk, and should not need to load any new dynamic libraries.
On the other hand, execv() is considerably more expensive simply branching within the same program. I'm having trouble imagining a situation in which I would want to write a program that re-execs itself (without forking) instead of just calling a function.
On the third hand, "cheap" and "expensive" are relative. Unless you are doing this a lot, you probably won't actually notice any difference.

The execve syscall is a little bit expensive; it would be unreasonable to run it more than a few dozen -or perhaps a few hundreds- times per second (even if it probably lasts a few milliseconds, and perhaps a fraction of millisecond, most of the time).
It is probably faster (and cleaner) than the dozen of equivalent calls to mmap(2) (& munmap & mprotect(2)) and setcontext(3) you'll use to nearly mimic it (and then, there is the issue of killing the running threads outside of the one doing the execve, and other resources attached to a process, e.g. FD_CLOEXEC-ed file descriptors).
(you won't be able to replicate with mmap, munmap, setcontext, close exactly what execve is doing, but you might be close enough... but that would be ridiculous)
Also, the practical cost of execve should also take into amount the dynamic loading of the shared libraries (which should be loaded before running main, but technically after the execve syscall...) and their startup.
The question might not mean much, it heavily depends on the actual state of the machine and on the execveed executabe. I guess that execve a huge ELF binary (some executables might have a gigabyte of code segment, e.g. perhaps the mythical Google crawler is rumored to be a monolithic program with a billion of C++ source code lines and at some point it was statically linked), e.g. with hundreds of shared libraries is much longer than execve-in the usual /bin/sh.
I guess also that execve from a process with a terabyte sized address space is much longer than than the usual execve my zsh shell is doing on my desktop.
A typical reason to execve its own program (actually some updated version of it) is, inside a long lasting server, when the binary executable of the server has been updated.
Another reason to execve its own program is to have a more-or-less "stateless" server (some web server for static content) restart itself and reload its configuration files.
More generally, this is an entire research subject: read about dynamic software updating, application checkpointing, persistence, etc... See also the references here.
It is the same for dumping a core(5) file: in my life, I never saw a core dump lasting more that a fraction of a second, but I did hear than on early 1990-s Cray computers, a core dump could (pathologically) last half an hour.... So I imagine that some pathological execve could last quite a long time (e.g. bringing a terabyte of code segment, using C-O-W techniques, in RAM; this is not counted as execve time but it is part of the cost to start a program; and you also might have many relocations for many shared libraries.).
Addenda
For a small executable (less than a few megabytes), you might afford several hundreds execve per second, so that is not a big deal in practice. Notice that a shell script with usual commands like ls, mv, ... is execve-ing quite a lot (very often after some fork, which it does for nearly every command). If you suspect some issues, you could benchmark (e.g. with strace(1) using strace -tt -T -f....). On my desktop Debian/x86-64/Sid i7 3770K an execve of /bin/ls (by strace --T -f -tt zsh-static -c ls) takes about 250 µs (for an ELF binary executable /bin/ls of 118Kbytes which is probably already in the page cache), and for ocamlc (a binary of 1.8Mbyte) about 1.3ms ; a malloc usually takes half or a few µs ; a call to time(2) takes about 3ns (avoiding the overhead of a syscall thru vdso(7)...)

Related

Fork()ing and running on specific set of CPUs

I have a parent process, which I use to spawn a series of child processes, which each run their own program sequentially. Each of these programs change a file over time, I want to read the data from this file and see how it changes as each program runs.
I need two sets of data for this to work, the value of the file at some set interval (I haven't decided on the interval yet), and the time each program takes to run, there are other variables which can influence the execution times of these programs, which I want to see also.
So I figured to get more accurate timing of the child process while still reading from a file I could run them on different cores. I have 8 cores, I would like to run the parent process on 0-3, then fork the child to run on 4-7. I'm not sure if this is possible though within C, and a search around hasn't yielded any answers, which makes me think it isn't.
Within Linux, outside of a program, I can use taskset to do this.
I plan on setting aside 4 of the cores using the kernel parameter isolcpus(). I want as little noise as possible while running the child programs.
Asking the kernel to associate CPU cores with threads or processes is also known as setting the "affinity" between the core and the process/thread.
Under linux, there exists a set of functions that provide this capability. Take a look at the manual page for one of the functions...
man pthread_setaffinity_np
This family of API calls might be able to give you what you need.
That man page has a "see also" section that links to the other functions in this family.
Typically with features such as these that deal with kernel process and thread scheduling, it is entirely dependent on what mood the kernel is in at the time as to whether your requests are met or ignored. Your mileage may very due to system load or the number of available cores. Even if a system has 16 cores, these features may be disabled in the kernel compilation settings (think virtual machines). Equally, you may find that there are some additional options that you may be able to add to your kernel to get better results than the defaults.

Executing an external program when forking is not advisable

I have this a big server software that can hog 4-8GB of memory.
This makes fork-exec cumbersome, as the fork itself can take significant time, plus the default behavior seems to be that fork will fail unless there is enough memory for a copy of the entire resident memory.
Since this is starting to show as the hottest spot (60% of time spent in fork) when profiling I need to address it.
What would be the easiest way to avoid fork-exec routine?
You basically cannot avoid fork(2) (or the equivalent clone(2) syscall..., or the obsolete vfork which I don't recommend using) + execve(2) to start an external command (à la system(3), or à la posix_spawn) on Linux and (probably) MacOSX and most other Unix or POSIX systems.
What makes you think that it is becoming an issue? And 8GB process virtual address space is not a big deal today (at least on machines with 8Gbytes, or 16Gbytes RAM, like my desktop has). You don't practically need twice as much RAM (but you do need swap space) thanks to the lazy copy-on-write techniques used by all recent Unixes & Linux.
Perhaps you might believe that swap space could be an issue. On Linux, you could add swap space, perhaps by swapping on a file; just run as root:
dd if=/dev/zero of=/var/tmp/myswap bs=1M count=32768
mkswap /var/tmp/myswap
swapon /var/tmp/myswap
(of course, be sure that /var/tmp/ is not a tmpfs mounted filesystem, but sits on some disk, perhaps an SSD one....)
When you don't need any more a lot of swap space, run swapoff /var/tmp/myswap....
You could also start some external shell process near the beginning of your program (à la popen) and later you might send shell commands to it. Look at my execicar.c program for inspiration, or use it if it fits (I wrote it 10 years ago for similar purposes, but I forgot the details)
Alternatively fork at the beginning of your program some interpreter (Lua, Guile...) and send some commands to it.
Running more than a few dozens commands per second (starting any external program) is not reasonable, and should be considered as a design mistake, IMHO. Perhaps the commands that you are running could be replaced by in-process functions (e.g. /bin/ls can be done with stat, readdir, glob functions ...). Perhaps you might consider adding some plugin ability (with dlopen(3) & dlsym) to your code (and run functions from plugins instead of starting very often the same programs). Or perhaps embed an interpreter (Lua, Guile, ...) inside your code.
As an example, for web servers, look for old CGI vs FastCGI or HTTP forwarding (e.g. URL redirection) or embedded PHP or HOP or Ocsigen
This makes fork-exec cumbersome, as the fork itself can take
significant time
This is only half true. You didn't specify the OS, but fork(2) is pretty optimized in Linux (and I believe in other UNIX variants) by using copy-on-write. Copy-on-write means that the operating system will not copy the entire parent memory address space until the child (or the parent) writes to memory. So you can rest assured that if you have a parent process using 8 GB of memory and then you fork, you won't be using 16 GB of memory - especially if the child execs() something immediately.
fork will fail unless there is enough memory for a copy of the entire
resident memory.
No. The only overhead incurred by fork(2) is the copy and allocation of a task structure for the child, the allocation of a PID, and copying the parent's page tables. fork(2) will not fail if there isn't enough memory to copy the entire parent's address space, it will fail if there isn't enough memory to allocate a new task structure and the page tables. It may also fail if the maximum number of processes for the user has been reached. You can confirm this in man 2 fork (NOTE: See comments below).
If you still don't want to use fork(2), you can use vfork(2), which does no copying at all - it doesn't even copy the page tables - everything is shared with the parent. You can use that to create a new child process with a negligible overhead and then exec() something. Be aware that vfork(2) blocks the calling thread until the child either exits or calls one of the seven exec() functions. You also shouldn't modify the memory inside the child process before calling any of the exec() functions.
You mentioned that you can fork+exec 10k times per second. That sounds very excessive. Have you considered making the things you're execing into a daemon? Or maybe implement those external programs inside your application? It sounds very dodgy to have to fork that much.
fork most likely starts failing for you despite having the memory to back it because you're on a flavor of linux that has disabled (or put a limit on) memory overcommit. Check the file /proc/sys/vm/overcommit_memory. If it's 1 then my guess is wrong and there's something else weird going on. If it's 0 then you're not allowed to overcommit at all. If it's 2 then you need to read the documentation for how exactly this gets configured.
One solution mentioned above is just adding swap (that will never get used).
Another solution is to implement a small daemon that will take commands and execute those forks and execs for you piping back whatever output you need.
N.B. fork of a large process can in theory be as fast as a small process. The performance of fork is determined by how many memory mappings you have rather than how much memory they cover. Setting up copy-on-write is done per mapping. Except that on certain operating systems setting up COW of anonymous mappings is linear to amount of memory in those mappings, but I don't know what Linux does here, last time I studied the VM system in Linux was over 15 years ago.

Fork and dynamic library interaction

I considered the following experinment: simple C program, that only return 0, but linked with
all libraries that gcc allowed me to link - 207 total. It takes a lot of time to run this programm -2.1 cold start, 0.24 warm. So the next step is write program, also linked with
this heap of libraries, who will fork&exec on request. Idea was, that if it already loaded
libraries, and fork creates idential copy of process, then I will get running first programm
very quickly. But I found no difference, running first program via shell or via second programm, linked with all libraries.
What is my mistake?
EDIT: Yeah, I missed the point of exec. But is it any possible improvement of my idea to speedup starting application. I know about prelink, but it do a bit different idea.
The only advantage of what you're doing is that it gets all the libraries read from disk into the filesystem cache (same as your "warm start"). Otherwise, what you're doing is exactly how the shell loads a program (fork and exec) so I don't see how you expect it to be faster. The idea that this will "copy" a process is true if you just fork, but you also exec.
To make a "copying" analogy with the filesystem, it's like if you took a file that was really slow to generate, copied it, then rm'd it and generated it all over again rather than using the copy.
fork creates an exact copy of the process, however exec clears the processes memory. Therefore all the libraries have to be loaded again (or at least initialised - they code segments might be shared).

How does a system call translate to CPU instructions?

Let's say there is a simple program like:
#include<stdio.h>
void main()
{
int x;
printf("Cool");
fd = open("/tmp/cool.txt", O_READONLY)
}
The open is a system call here. I suppose when the shell runs it, it makes some hundred other system calls to implement it? How about a declaration like int x - at some point should it have some additional system calls in the backdrop to get the memory from the computer?
I am not sure what is the boundary between a system call and a normal stuff ... everything, in the end, needs the operating system's help right?!
Or is it like the C generates an executable (code) which can be run on the processor and need no OS assistance is needed until a system call is reached - at which point it has to do something to load the OS instructions etc ...
A bit vague :) Please clarify.
I'm not answering the questions in order, so I'm prefixing my answers with the questions. I've taken the liberty of editing them a bit. You didn't specify the processor architecture, but I'm assuming you want to know about x86, so the processor-level details will pertain to x86. Other architectures can behave differently (memory management, how system calls are made, etc.). I'm also using Linux for examples.
Does the c compiler generate executable code that can be run straight on the processor without need for OS assistance until a system call is reached, at which point it has to do something to load the OS instructions?
Yes, that is correct. The compiler generates native machine code that can be run straight on the processor. The executable files that you get from the compiler, however, contain both the code and other needed data, for example, instructions on where to load the code in the memory. On Linux the ELF format is typically used for executables.
If the process is completely loaded into memory and has sufficient stack space, it will not need further OS assistance before it wants to make a system call. When you make a system call, it is just an instruction in the machine code that calls the OS. The program itself does not need to "load the OS instructions" in any way. The processor handles transferring execution to the OS code.
With Linux on the x86 architecture, one way for the machine code to make a system call is to use the software interrupt vector 128 to transfer execution to the operating system. In x86 assembly (Intel syntax), that is expressed as int 0x80. Linux will then perform tasks based on the values that the calling program placed into processor registers before making the system call: the system call number is found in the eax processor register and the system call parameters are found in other processor registers. After the OS is done, it will return a result in the eax register, and has possibly modified buffers pointed to by the system call parameters etc. Note however, that this is not the only way to make a system call.
However, if the process is not entirely in memory, and execution moves to a part of the code that is not in memory at the moment, the processor causes a page fault, which moves execution to the operating system, which then loads the required part of the process into memory and transfers execution back to the process, which can then continue execution normally, without even noticing that anything happened.
I'm not entirely sure on the next point, so take it with a grain of salt. The Wikipedia article on stack overflow (the computer error, not this site :) seems to indicate that stacks are usually of fixed size, so int x; should not cause the OS to run, unless that part of the stack is not in the memory (see previous paragraph). If you had a system with dynamic stack size (if it is even possible, but as far as I can see, it is), int x; could also cause a page fault when the stack space is used up, prompting the operating system to allocate more stack space for the process.
Page faults cause the execution to move to the operating system, but are not system calls in the usual sense of the word. System calls are explicit calls to the OS when you want it to perform some work for you. Page faults and other such events are implicit. Hardware interrupts continuously transfer the execution from your process to the OS so that it can react to them. After that it transfers the execution back to your process, or some other process.
On a multitasking OS, you can run many programs at once even if you have only one processor/core. This is accomplished by running only one program at a time, but switching between programs quickly. The hardware timer interrupt makes sure that control is transferred back to the OS in a timely fashion, so that one process can't hog the CPU all for itself. When control is passed to the OS and it has done what it needs to, it may always start a different process from the one that was interrupted. The OS handles all this totally transparently, so you don't have to think about it, and your process won't notice it. From the viewpoint of your process, it is executing continuously.
In short: Your program executes system calls only when you explicitly ask it to. The operating system may also swap parts of your process in and out of the memory when it wants to, and generally does things related and unrelated to your process in the background, but you don't normally need to think about that at all. (You can reduce the amount of page faults, though, by keeping your program as small as possible, and things like that)
In this case open() is an explicit system call, but I suppose when the shell runs it, it makes some hundred other system calls to implement it.
No, the shell has got nothing to do with an open() call in your c program. Your program makes that one system call, and shell doesn't come into the picture at all.
The shell will only affect your program when it starts it. When you start your program with the shell, the shell does a fork system call to fork off a second process, which then does an execve system call to replace itself with your program. After that, your program is in control. Before the control gets to your main() function though, it executes some initialization code, that was put there by the compiler. If you want to see what system calls a process makes, on Linux you can use strace to view them. Just say strace ls, for example, to see what system calls ls makes during its execution. If you compile a c program with just a main() function that returns immediately, you can see with strace what system calls the initialization code makes.
How does the process get its memory from the computer etc.? It has to involve some system calls again right? I am not sure what is the boundary between a system call and normal stuff. Everything in the end needs the OS help, right?
Yep, system calls. When your program is loaded into memory with the execve system call, it takes care of getting enough memory for your process. When you need more memory and call malloc(), it will make a brk system call to grow the data segment of your process if it has run out of internally cached memory to give you.
Not everything needs explicit help from the OS. If you have enough memory, have all your input in memory, and you write your output data to memory, you won't need the OS at all. That is, as long as you only do calculations on data you already have in memory, don't need more memory, and don't need to communicate with the outside world, you don't need the OS. On the other hand, a program that does not communicate with the outside world at all is a pretty useless one, because it can't get any input, and cannot give any output. Even if you calculate the millionth decimal of pi, it doesn't matter at all if you don't output it to the user.
This answer got quite big, so in case I missed something or didn't explain something clearly enough, please leave me a comment and I'll try to elaborate. If anyone spots any mistakes, be sure to point them out also.

C/C++ memory usage API in Linux/Windows

I'd like to obtain memory usage information for both per process and system wide. In Windows, it's pretty easy. GetProcessMemoryInfo and GlobalMemoryStatusEx do these jobs greatly and very easily. For example, GetProcessMemoryInfo gives "PeakWorkingSetSize" of the given process. GlobalMemoryStatusEx returns system wide available memory.
However, I need to do it on Linux. I'm trying to find Linux system APIs that are equivalent GetProcessMemoryInfo and GlobalMemoryStatusEx.
I found 'getrusage'. However, max 'ru_maxrss' (resident set size) in struct rusage is just zero, which is not implemented. Also, I have no idea to get system-wide free memory.
Current workaround for it, I'm using "system("ps -p %my_pid -o vsz,rsz");". Manually logging to the file. But, it's dirty and not convenient to process the data.
I'd like to know some fancy Linux APIs for this purpose.
You can see how it is done in libstatgrab.
And you can also use it (GPL)
Linux has a (modular) filesystem-interface for fetching such data from the kernel, thus being usable by nearly any language or scripting tool.
Memory can be complex. There's the program executable itself, presumably mmap()'ed in. Shared libraries. Stack utilization. Heap utilization. Portions of the software resident in RAM. Portions swapped out. Etc.
What exactly is "PeakWorkingSetSize"? It sounds like the maximum resident set size (the maximum non-swapped physical-memory RAM used by the process).
Though it could also be the total virtual memory footprint of the entire process (sum of the in-RAM and SWAPPED-out parts).
Irregardless, under Linux, you can strace a process to see its kernel-level interactions. "ps" gets its data from /proc/${PID}/* files.
I suggest you cat /proc/${PID}/status. The Vm* lines are quite useful.
Specifically: VmData refers to process heap utilization. VmStk refers to process stack utilization.
If you continue using "ps", you might consider popen().
I have no idea to get system-wide free memory.
There's always /usr/bin/free
Note that Linux will make use of unused memory for buffering files and caching... Thus the +/-buffers/cache line.

Resources