Fork and dynamic library interaction - c

I considered the following experinment: simple C program, that only return 0, but linked with
all libraries that gcc allowed me to link - 207 total. It takes a lot of time to run this programm -2.1 cold start, 0.24 warm. So the next step is write program, also linked with
this heap of libraries, who will fork&exec on request. Idea was, that if it already loaded
libraries, and fork creates idential copy of process, then I will get running first programm
very quickly. But I found no difference, running first program via shell or via second programm, linked with all libraries.
What is my mistake?
EDIT: Yeah, I missed the point of exec. But is it any possible improvement of my idea to speedup starting application. I know about prelink, but it do a bit different idea.

The only advantage of what you're doing is that it gets all the libraries read from disk into the filesystem cache (same as your "warm start"). Otherwise, what you're doing is exactly how the shell loads a program (fork and exec) so I don't see how you expect it to be faster. The idea that this will "copy" a process is true if you just fork, but you also exec.
To make a "copying" analogy with the filesystem, it's like if you took a file that was really slow to generate, copied it, then rm'd it and generated it all over again rather than using the copy.

fork creates an exact copy of the process, however exec clears the processes memory. Therefore all the libraries have to be loaded again (or at least initialised - they code segments might be shared).

Related

Is execv() expensive?

I have a requirement. My process has to fork->exec another process during one of its code paths. The child process runs some checks and when some condition is true it has to re-exec itself. It did not cause any performance issues when I tested on high end machines.
But will it be an expensive to call execv() again in the same process? Especially when it is exec()ing itself?
Note: There is no fork() involved for the second time. The process would just execv() itself for the second time, to get something remapped in its virtual address space.
The second execv() call is no more expensive than the first. It might even be cheaper, since the system might not need to read the program image from disk, and should not need to load any new dynamic libraries.
On the other hand, execv() is considerably more expensive simply branching within the same program. I'm having trouble imagining a situation in which I would want to write a program that re-execs itself (without forking) instead of just calling a function.
On the third hand, "cheap" and "expensive" are relative. Unless you are doing this a lot, you probably won't actually notice any difference.
The execve syscall is a little bit expensive; it would be unreasonable to run it more than a few dozen -or perhaps a few hundreds- times per second (even if it probably lasts a few milliseconds, and perhaps a fraction of millisecond, most of the time).
It is probably faster (and cleaner) than the dozen of equivalent calls to mmap(2) (& munmap & mprotect(2)) and setcontext(3) you'll use to nearly mimic it (and then, there is the issue of killing the running threads outside of the one doing the execve, and other resources attached to a process, e.g. FD_CLOEXEC-ed file descriptors).
(you won't be able to replicate with mmap, munmap, setcontext, close exactly what execve is doing, but you might be close enough... but that would be ridiculous)
Also, the practical cost of execve should also take into amount the dynamic loading of the shared libraries (which should be loaded before running main, but technically after the execve syscall...) and their startup.
The question might not mean much, it heavily depends on the actual state of the machine and on the execveed executabe. I guess that execve a huge ELF binary (some executables might have a gigabyte of code segment, e.g. perhaps the mythical Google crawler is rumored to be a monolithic program with a billion of C++ source code lines and at some point it was statically linked), e.g. with hundreds of shared libraries is much longer than execve-in the usual /bin/sh.
I guess also that execve from a process with a terabyte sized address space is much longer than than the usual execve my zsh shell is doing on my desktop.
A typical reason to execve its own program (actually some updated version of it) is, inside a long lasting server, when the binary executable of the server has been updated.
Another reason to execve its own program is to have a more-or-less "stateless" server (some web server for static content) restart itself and reload its configuration files.
More generally, this is an entire research subject: read about dynamic software updating, application checkpointing, persistence, etc... See also the references here.
It is the same for dumping a core(5) file: in my life, I never saw a core dump lasting more that a fraction of a second, but I did hear than on early 1990-s Cray computers, a core dump could (pathologically) last half an hour.... So I imagine that some pathological execve could last quite a long time (e.g. bringing a terabyte of code segment, using C-O-W techniques, in RAM; this is not counted as execve time but it is part of the cost to start a program; and you also might have many relocations for many shared libraries.).
Addenda
For a small executable (less than a few megabytes), you might afford several hundreds execve per second, so that is not a big deal in practice. Notice that a shell script with usual commands like ls, mv, ... is execve-ing quite a lot (very often after some fork, which it does for nearly every command). If you suspect some issues, you could benchmark (e.g. with strace(1) using strace -tt -T -f....). On my desktop Debian/x86-64/Sid i7 3770K an execve of /bin/ls (by strace --T -f -tt zsh-static -c ls) takes about 250 µs (for an ELF binary executable /bin/ls of 118Kbytes which is probably already in the page cache), and for ocamlc (a binary of 1.8Mbyte) about 1.3ms ; a malloc usually takes half or a few µs ; a call to time(2) takes about 3ns (avoiding the overhead of a syscall thru vdso(7)...)

Executing an external program when forking is not advisable

I have this a big server software that can hog 4-8GB of memory.
This makes fork-exec cumbersome, as the fork itself can take significant time, plus the default behavior seems to be that fork will fail unless there is enough memory for a copy of the entire resident memory.
Since this is starting to show as the hottest spot (60% of time spent in fork) when profiling I need to address it.
What would be the easiest way to avoid fork-exec routine?
You basically cannot avoid fork(2) (or the equivalent clone(2) syscall..., or the obsolete vfork which I don't recommend using) + execve(2) to start an external command (à la system(3), or à la posix_spawn) on Linux and (probably) MacOSX and most other Unix or POSIX systems.
What makes you think that it is becoming an issue? And 8GB process virtual address space is not a big deal today (at least on machines with 8Gbytes, or 16Gbytes RAM, like my desktop has). You don't practically need twice as much RAM (but you do need swap space) thanks to the lazy copy-on-write techniques used by all recent Unixes & Linux.
Perhaps you might believe that swap space could be an issue. On Linux, you could add swap space, perhaps by swapping on a file; just run as root:
dd if=/dev/zero of=/var/tmp/myswap bs=1M count=32768
mkswap /var/tmp/myswap
swapon /var/tmp/myswap
(of course, be sure that /var/tmp/ is not a tmpfs mounted filesystem, but sits on some disk, perhaps an SSD one....)
When you don't need any more a lot of swap space, run swapoff /var/tmp/myswap....
You could also start some external shell process near the beginning of your program (à la popen) and later you might send shell commands to it. Look at my execicar.c program for inspiration, or use it if it fits (I wrote it 10 years ago for similar purposes, but I forgot the details)
Alternatively fork at the beginning of your program some interpreter (Lua, Guile...) and send some commands to it.
Running more than a few dozens commands per second (starting any external program) is not reasonable, and should be considered as a design mistake, IMHO. Perhaps the commands that you are running could be replaced by in-process functions (e.g. /bin/ls can be done with stat, readdir, glob functions ...). Perhaps you might consider adding some plugin ability (with dlopen(3) & dlsym) to your code (and run functions from plugins instead of starting very often the same programs). Or perhaps embed an interpreter (Lua, Guile, ...) inside your code.
As an example, for web servers, look for old CGI vs FastCGI or HTTP forwarding (e.g. URL redirection) or embedded PHP or HOP or Ocsigen
This makes fork-exec cumbersome, as the fork itself can take
significant time
This is only half true. You didn't specify the OS, but fork(2) is pretty optimized in Linux (and I believe in other UNIX variants) by using copy-on-write. Copy-on-write means that the operating system will not copy the entire parent memory address space until the child (or the parent) writes to memory. So you can rest assured that if you have a parent process using 8 GB of memory and then you fork, you won't be using 16 GB of memory - especially if the child execs() something immediately.
fork will fail unless there is enough memory for a copy of the entire
resident memory.
No. The only overhead incurred by fork(2) is the copy and allocation of a task structure for the child, the allocation of a PID, and copying the parent's page tables. fork(2) will not fail if there isn't enough memory to copy the entire parent's address space, it will fail if there isn't enough memory to allocate a new task structure and the page tables. It may also fail if the maximum number of processes for the user has been reached. You can confirm this in man 2 fork (NOTE: See comments below).
If you still don't want to use fork(2), you can use vfork(2), which does no copying at all - it doesn't even copy the page tables - everything is shared with the parent. You can use that to create a new child process with a negligible overhead and then exec() something. Be aware that vfork(2) blocks the calling thread until the child either exits or calls one of the seven exec() functions. You also shouldn't modify the memory inside the child process before calling any of the exec() functions.
You mentioned that you can fork+exec 10k times per second. That sounds very excessive. Have you considered making the things you're execing into a daemon? Or maybe implement those external programs inside your application? It sounds very dodgy to have to fork that much.
fork most likely starts failing for you despite having the memory to back it because you're on a flavor of linux that has disabled (or put a limit on) memory overcommit. Check the file /proc/sys/vm/overcommit_memory. If it's 1 then my guess is wrong and there's something else weird going on. If it's 0 then you're not allowed to overcommit at all. If it's 2 then you need to read the documentation for how exactly this gets configured.
One solution mentioned above is just adding swap (that will never get used).
Another solution is to implement a small daemon that will take commands and execute those forks and execs for you piping back whatever output you need.
N.B. fork of a large process can in theory be as fast as a small process. The performance of fork is determined by how many memory mappings you have rather than how much memory they cover. Setting up copy-on-write is done per mapping. Except that on certain operating systems setting up COW of anonymous mappings is linear to amount of memory in those mappings, but I don't know what Linux does here, last time I studied the VM system in Linux was over 15 years ago.

How to speed up consecutive program startup under Linux?

I've written two relatively small programs using C. Both of them comunnicate with each other using textual data. Program A generates some problems from given input, B evaluates them and creates input for another iteration of A.
Here's a bash script that I currently use:
for i in {1..1000}
do
./A data > data2;
./B data2 > data;
done
The problem is that since what A and B do is not very time consuming, most of the time is spent (as I suppose) in starting apps up. When I measure time the script runs I get:
$ time ./bash.sh
real 0m10.304s
user 0m4.010s
sys 0m0.113s
So my main question is: is there any way to communicate data beetwen those two apps faster? I don't want to integrate them into one application, because I'm trying to build a toolset with independent, easly communicating tools (as was suggested in "The Art of Unix Programming" from which I'm learning the way to write reusable software).
PS. The data and data2 files contain sets of data needed in whole at once by those applications (so communicating by for e.g. one line of data at time is impossible).
Thanks for any suggestions.
cheers,
kajman
Can you create named pipe ?
mkfifo data1
mkfifo data2
./A data1 > data2 &
./B data2 > data1
If your application is reading and writing in a loop, this could work :)
If you used pipes to transfer the stdout of program A to the stdin of program B you would remove the need to write the file "data2" each loop.
./A data1 | ./B > data1
Program B would need to have the capability of using input from stdin rather than a specified file.
If you want to make a program run faster, you need to understand what is making the program run slowly. The field of computer science dedicated to measuring the performance of a running program is called profiling.
Once you discover which internal portion of your program is running slow, you can generally speed it up. How you go about speeding up that item depends heavily on what "the slow part" is doing and how it is "being done".
Several people have recommended pipes for moving the data directly from the output of one program into the input of another program. Assuming you rewrite your tools to handle input and output in a piped manner, this might improve performance. Again, it depends on what you are doing and how you are doing it.
For example, if your tool just fixes windows style end-of-lines into unix style end-of-lines, the program might read in one line, waiting for it to be available, check the end-of-line and write out the line with the desired end-of-line. Or the tool might read in all of the data, do a replacement call on each "wrong" end-of-line in memory, and then write out all of the data. With the first solution, piping speeds things up. With the second solution piping doesn't speed up anything.
The reason is is truly so hard to answer such a question is because the fix you need really depends on the code you have, the problem you are trying to solve, and the means by which you are solving it now. In the end, there isn't always a 100% guarantee that the code can be sped up; however, virtually every piece of code has opportunities to be sped up. Use profiling to speed up the parts that are slow, instead of wasting your time working on a part of your program that is only called once, and represents 0.001% of the program's runtime.
Remember if you speed up something that is 0.001% of your program's runtime by 50%, you actually only sped up your entire program by 0.0005%. Use profiling to determine the block of code that's taking up 90% of your runtime and concentrate on it.
I do have to wonder why, if A and B depend on each other to run, do you want them to be part of an independent toolset.
One solution is a compromise between the two:
Create a library that contains A.
Create a library that contains B.
Create a program that spawns two threads, 1 containing A and 2 containing B.
Create a semaphore that tells A to run and another that tells B to run.
After the function that calls A in 1, increment B's semaphore.
After the function that calls B in 2, increment A's semaphore.
Another possibility is to use file locking in your programs:
Make both A and B execute in infinite loops (or however many times you're processing data)
Add code to attempt to lock both files at the beginning of the infinite loop in A and B (if not, sleep and try again so that you don't do anything until you have the lock).
Add code to unlock and sleep for longer than the sleep in step 2 at the end of each loop.
Either of these solve the problem of having the overhead of launching the program between runs.
It's almost certainly not application startup which is the bottleneck. Linux will end up caching large portions of your programs, which means that launching will progressively get faster (to a point) the more times you start your program.
You need to look elsewhere for your bottleneck.

Simulating file system access

I am designing a file system in user space and need to test it. I do not want to use the available benchmarking tools as my requirements are different. So to test the file system I wish to simulate file access operation. To do this, I first use the ftw() function to walk through one f my existing file system(experimental) and list all the files and directories in a file.
Then I invoke a simulator to simulate file access by a number of processes. Thus, the simulator randomly starts a process i.e it forks a thread which does what a real process would have done. The thread randomly selects a file operation (read, write, rename etc) selects arguments to this operation from the list(generated by ftw()) . The thread does a number of such file operations and then exits marking the end of a process. The simulator continues to spawn threads; thread execution can overlap just as real processes do. Now, as operations are performed by threads, files get inserted, deleted, renamed and this is updated in the list of files.
I have not yet started coding. Does the plan seem sane? I am also not sure how to code the simulator...how will it spawn threads over a period of time. Should I be using some random delay to do this.
Thanks
Yep, that seems fairly reasonable to me. I would consider attempting to impose a statistical distribution over your file operations (and accesses to particular files) that is somehow matched to your expected workload. You might be able to find some statistics about typical filesystem workloads as a starting point.
That sounds about right for a decent test case just to make sure it's working. You could use sleep() to wait between spawning threads or just spawn them all at once and have them do an operation then wait a bit, then do another operation, etc... IMO if you hit it hard with a lot of requests and it works then there's a likely chance your filesystem will do just fine. Take an example from PostMark which all it does is append like crazy to different files and other benchmarks that do random access reads/writes in different locations to make sure that the page has to be read from disk.

Finding which functions are called in a multi-process program without modifying source?

I'm working on a project where I need to find which functions get called in various Linux programs (written in C) given particular inputs. My current approach has been to compile a program with -pg (profiling option), run it, and find which functions get called by processing gprof's output. Only functions that are called at least once appear in the output file.
The apparent problem is that only one process can write to the gprof output file. If the program forks multiple processes, I don't get any profiling output from the other processes.
Is there any way to make gprof produce an output file for each process (maybe labelled by pid)? The manual suggests having each process change into a different directory, but I don't want to modify the source code to do this. Is there another tool for Linux that can help?
Here they suggest using tprof:
Have you tried valgrind?
http://www.network-theory.co.uk/docs/valgrind/valgrind_17.html
--child-silent-after-fork=<yes|no> [default: no]
When enabled, Valgrind will not show any debugging or logging output for the child process resulting from a fork call. This can make the output less confusing (although more misleading) when dealing with processes that create children. It is particularly useful in conjunction with --trace-children=. Use of this flag is also strongly recommended if you are requesting XML output (--xml=yes), since otherwise the XML from child and parent may become mixed up, which usually makes it useless.
Take a look at GCov: http://gcc.gnu.org/onlinedocs/gcc/Gcov.html

Resources