What happens after a process is forked? - c

Say a process is forked from another process. In other words, we replicate a process through the fork function call. Now since forking is a copy-on-write mechanism, what happens is that whenever the forked process or the original process write to a page, they get a new physical page to write. So what I've understood, things go like this when both forked and original processes are executing.
--> when forking, all pages of original and forked process are given read only access, so that the kernel get to know which page is written. When that happens, the kernel maps a new physical page to the writing process, writes the previous content to it, and then gives the write access to that page. Now what I am not clear about is if both fork and original process write to the same page, will one of them will still hold the original physical page (prior to forking that is) or both will get new physical pages. Secondly, is my assumption correct that all pages in forked and original process are given read only access at time of forking?
--> Now since each page fault will trigger an interrupt, that means each write to original or forked process will slow down execution. Say if we know about the application, and we know that alot of contiguous memory pages will be written, wouldn't it be better to give write permission to multiple pages ( a group of pages lets say ) when one of the page in the group is written to. That would reduce the number of interrupts due to page fault handling. Isn't it? Sure, we may sometimes make a copy unnecessarily in this case, but I think an interrupt has much more overhead than writing 512 variables of type long (4096 bytes of a page). Is my understanding correct or am I missing something?

If I'm not mistaken, one of the processes will be seen as writing to the page first. Even if you hae multiple cores, I believe the page fault will be handled serially. In that case, the first one to be caught will de-couple the pages of the two processes, so by the time the second writes to it, there won't be a fault, because it'll now have a writable page of its own.
I believe when that's done, the existing page is kept by one process (and set back to read/write), and one new copy is made for the other process.
I think your third point revolves around one simple point: "Say if we know about the application...". That's the problem right there: the OS does not know about the application. Essentially the only thing it "knows" will be indirect, through observation by the kernel coders. They will undoubtedly observe that fork is normally followed by exec, so that's the case for which they will undoubtedly optimize. Yes, that's not always the case, and you're obviously concerned about the other cases -- all I'm saying here is that they're sufficiently unusual that I'd guess little effort is expended on them.
I'm not quite sure I follow the logic or math about 512 longs in a 4096 byte page -- the first time a page is written, it gets duplicated and decoupled between the processes. From that point onward, further writes to either process' copy of that page will not cause any further page faults (at least related to the copy on write -- of course if a process sites idle a long time that data might be paged out to the page file, or something on that order, but it's irrelevant here).

Fork semantically makes a copy of a process. Copy-on-write is an optimization which makes it much faster. Optimizations often have some hidden trade-off. Some cases are made faster, but others suffer. There is a cost to copy-on-write, but we hope that there will usually be a saving, because most of the copied pages will not in fact be written to by the child. In the ideal case, the child performs an immediate exec.
So we suffer the page fault exceptions for a small number of pages, which is cheaper than copying all the pages upfront.
Most "lazy evaluation" type optimizations are of this nature.
A lazy list of a million items is more expensive to fully instantiate than a regular list of a million items. But if the consumer of the list only accesses the first 100 items, the lazy list wins.

Well, the initial cost would be very high if fork() wouldn't use COW. If you look at typical top display, the ratio RSS/VSIZE is very small(e.g. 2MB/ 56MB for a typical vi session).
Cloning a process without COW would cause a tremendous amount of memory pressure, which would actually cause other processes to lose their attached pages (which will have to be moved to secondary storage, and maybe later restored). And that paging would actually cause 1-2 disk I/O's per page (the swap out is only needed if the page is new or dirty, the swap in will only be needed if the page is ever again referenced by the other process)
Another point is granularity: back in the days, when MMU's did not exist, whole processes had to be swapped out to yield their memory, causing the system to actually freeze for a second or so. Page-faulting on a per-page basis causes more traps, but these are spread out nicely, allowing the processes to actually compete for physical ram.
Without prior knowledge, it's hard to beat an LRU scheme.

Related

`mmap()` manual concurrent prefaulting / paging

I'm trying to fine tune mmap() to perform fast writes or reads (generally not both) of a potentially very large file. The writes and reads will be mostly sequential on one pass and then likely very sparse on future passes. No region of memory needs to be accessed more than once.
In other words, think of it as a file transfer with some lossiness that gets fixed asynchronously.
It appears, as expected, that the main limitation of mmap()'s performance seems to be the number of minor page faults it generates on large files. Furthermore, I suspect the laziness of the Linux kernel's page-to-disk is causing some performance issues. Namely, any test programs that end up performing huge writes to mmaped memory seem to take a long time after performing all writes to terminate/munmap memory.
I was hoping to offset the cost of these faults by concurrently prefaulting pages while performing the almost-sequential access and paging out pages that I won't need again. But I have three main questions regarding this approach and my understanding of the problem:
Is there a straightforward (preferably POSIX [or at least OSX] compatible) way of performing a partial prefault? I am aware of the MAP_POPULATE flag, but this seems to attempt loading the entire file into memory, which is intolerable in many cases. Also, this seems to cause the mmap() call to block until prefaulting is complete, which is also intolerable. My idea for a manual alternative was to spawn a thread simply to try reading the next N pages in memory to force a prefetch. But it might be that madvise with MADV_SEQUENTIAL already does this, in effect.
msync() can be used to flush changes to the disk. However, is it actually useful to do this periodically? My idea is that it might be useful if the program is frequently in an "Idle" state of disk IO and can afford to squeeze in some disk writebacks. Then again, the kernel might very well be handling this itself better than the ever application could.
Is my understanding of disk IO accurate? My assumption is that prefaulting and reading/writing pages can be done concurrently by different threads or processes; if I am wrong about this, then manual prefaulting would not be useful at all. Similarly, if an msync() call blocks all disk IO, both to the filesystem cache and to the raw filesystem, then there also isn't as much of an incentive to use it over flushing the entire disk cache at the program's termination.
It appears, as expected, that the main limitation of mmap()'s performance seems to be the number of minor page faults it generates on large files.
That's not particularly surprising, I agree. But this is a cost that cannot be avoided, at least for the pages corresponding to regions of the mapped file that you actually access.
Furthermore, I suspect the laziness of the Linux kernel's page-to-disk is causing some performance issues. Namely, any test programs that end up performing huge writes to mmaped memory seem to take a long time after performing all writes to terminate/munmap memory.
That's plausible. Again, this is an unavoidable cost, at least for dirty pages, but you can exercise some influence over when those costs are incurred.
I was hoping to offset the cost of these faults by concurrently
prefaulting pages while performing the almost-sequential access and
paging out pages that I won't need again. But I have three main
questions regarding this approach and my understanding of the problem:
Is there a straightforward (preferably POSIX [or at least OSX] compatible) way of performing a partial prefault? I am aware of the
MAP_POPULATE flag, but this seems to attempt loading the entire file
into memory,
Yes, that's consistent with its documentation.
which is intolerable in many cases. Also, this seems to
cause the mmap() call to block until prefaulting is complete,
That's also as documented.
which
is also intolerable. My idea for a manual alternative was to spawn a
thread simply to try reading the next N pages in memory to force a
prefetch.
Unless there's a delay between when you initially mmap() the file and when you want to start accessing the mapping, it's not clear to me why you would expect that to provide any improvement.
But it might be that madvise with MADV_SEQUENTIAL already
does this, in effect.
If you want POSIX compatibility, then you're looking for posix_madvise(). I would indeed recommend using this function instead of trying to roll your own userspace alternative. In particular, if you use posix_madvise() to assert POSIX_MADV_SEQUENTIAL on some or all of the mapped region, then it is reasonable to hope that the kernel will read ahead to load pages before they are needed. Additionally, if you advise with POSIX_MADV_DONTNEED then you might, at the kernel's discretion, get earlier sync to disk and overall less memory use. There is other advice you can pass by this mechanism, too, if it is useful.
msync() can be used to flush changes to the disk. However, is it actually useful to do this periodically? My idea is that it might
be useful if the program is frequently in an "Idle" state of disk IO
and can afford to squeeze in some disk writebacks. Then again, the
kernel might very well be handling this itself better than the ever
application could.
This is something to test. Note that msync() supports asynchronous syncing, however, so you don't need I/O idleness. Thus, when you're sure you're done with a given page you could consider msync()ing it with flag MS_ASYNC to request that the kernel schedule a sync. This might reduce the delay incurred when you unmap the file. You'll have to experiment with combining it with posix_madvise(..., ..., POSIX_MADV_DONTNEED); they might or might not complement each other.
Is my understanding of disk IO accurate? My assumption is that prefaulting and reading/writing pages can be done concurrently by
different threads or processes; if I am wrong about this, then manual
prefaulting would not be useful at all.
It should be possible for one thread to prefault pages (by accessing them), while another reads or writes others that have already been faulted in, but it's unclear to me why you expect such a prefaulting thread to be able to run ahead of the one(s) doing the reads and writes. If it has any effect at all (i.e. if the kernel does not prefault on its own) then I would expect prefaulting a page to be more expensive than reading or writing each byte in it once.
Similarly, if an msync()
call blocks all disk IO, both to the filesystem cache and to the raw
filesystem, then there also isn't as much of an incentive to use it
over flushing the entire disk cache at the program's termination.
There is a minimum number of disk reads and writes that will need to be performed on behalf of your program. For any given mmapped file, they will all be performed on the same I/O device, and therefore they will all be serialized with respect to one another. If you are I/O bound then to a first approximation, the order in which those I/O operations are performed does not matter for overall runtime.
Thus, if the runtime is what you're concerned with, then probably neither posix_madvise() nor msync() will be of much help unless your program spends a significant fraction of its runtime on tasks that are independent of accessing the mmapped file. If you do find yourself not wholly I/O bound then my suggestion would be to see first what posix_madvise() can do for you, and to try asynchronous msync() if you need more. I'm inclined to doubt that userspace prefaulting or synchronous msync() would provide a win, but in optimization, it's always better to test than to (only) predict.

Modify read-only memory at low overhead

Assume that I have a page of memory that is read-only (e.g., set through mmap/mprotect). How do I modify one word (8 bytes) on this page at the lowest possible overhead?
Some context: I assume x86-64, Linux as my runtime environment. The modifications happen rarely but frequently enough so that I have to worry about overhead. The page is read only to protect some important data that must be read by the program frequently against rogue/illegal modifications. There are only few places that are allowed to modify the data on the page and I know all the locations of these places and the address of the page statically. The problem I'm trying to solve is protecting some data against memory safety bugs in the program with a few authorized places where I need to make modifications to the data. The modifications are not frequent but frequent enough so that several kernel-roundtrips (through system calls) are too costly.
So far, I thought of the following solutions:
mprotect
ptrace
shared memory
new system call
mprotect
mprotect(addr, 4096, PROT_WRITE | PROT_READ);
addr[12] = 0xc0fec0fe;
mprotect(addr, 4096, PROT_READ);
The mprotect solution is clean, simple, and straight-forward. Unfortunately, it involves two round trips into the kernel and will result in some overhead. In addition, the whole page will be writable during that time frame, allowing for some other thread to modify that memory area concurrently.
ptrace
Unfortunately, ptraceing yourself is no longer possible (as a ptraced-process needs to be stopped. So the solution is to fork, ptrace the child process, then use PTRACE_POKETEXT to write to the child processes memory.
This option has the drawback of spawning a parent process and will result in problems if the tracee uses multiple processes. The overhead per write is at least one system call for PTRACE plus the required synchronization between the processes.
shared memory
Shared memory is similar to the ptrace solution except that it reduces the system call. Both processes set up shared memory with different permissions (RW in the child, R in the parent). The two processes still need to synchronize on each write that is then carried out by the parent. Shared memory has similar drawbacks in complexity as the ptrace solution and incompatibilities with multiple communicating processes.
new system call
Adding a new system call to the kernel would solve the problem and would only require a single system call to modify one word in the process without having to change the page tables or the requirement to set up multiple communicating processes.
Is there anything that is faster than the 4 discussed/sketched solutions? Could I rely on any debug features? Are there any other neat low-level systems tricks?

Mitigating memory leaks by forking

This is a really ugly question.
I have a C++ program which does the following in a loop:
Waits for a JMS message
Calculates some data
Sends a JMS message in response
My program (let's call it "Bob") has a rather severe memory leak. The memory leak is located in a shared library that someone else wrote, which I must use, but the source code to which I do not have access.
This memory leak causes Bob to crash during the "calculates some data" phase of the loop. This is a problem, because another program is awaiting Bob's response, and will be very upset if it does not receive one.
Due to various restrictions (yes, this is an X/Y problem, I told you it was ugly), I have determined that my only viable strategy is to modify Bob so that it does the following in its loop:
Waits for a JMS message
Calculates some data
Sends a JMS message in response
Checks to see whether it's in danger of using "too much" memory
If so, forks and execs another copy of itself, and gracefully exits
My question is as follows:
What is the best (reliable but not too inefficient) way to detect whether we're using "too much" memory? My current thought is to compare getrlimit(RLIMIT_AS) rlim_cur to getrusage(RUSAGE_SELF) ru_maxrss; is that correct? If not, what's a better way? Bob runs in a Linux VM on various host machines, all with different amounts of memory.
Assuming the memory leak occurs in the "Calculates some data" phase, I think it might make more sense to just refactor that portion into a separate program and fork out to execute that in its own process. That way you can at least isolate the offending code and make it easier to replace it in the future, rather than just masking the problem by having the program restart itself when it runs low on memory.
The "Calculates some data" part can either be a long-running process that waits for requests from the main program and restarts itself when necessary, or (even simpler) it could be a one-and-done program that just takes its data in *argv and sends its results to stdout. Then your main loop can just fork out and exec it every time through, and read the results when they come back. I would go with the simpler option if possible, but that will of course depend on what your needs are.
If you make the program itself restart or fork the "Calculates some data" section to a separate process, in any case you'll need to check for the memory consumption. Since you're on Linux, an easy way to check this is to get the pid number of the process of interest, and read the contents of the file /proc/$PID/statm. The second number is the size of the resident set.
Reading these proc files are the way tools like top and htop get the data about processes. Reading a ~30-byte in-memory file periodically to check the memory leak doesn't sound too inefficient.
If the leak is regular and you want to make it a bit more sophisticated, you could even keep track of the rate of growth and adjust your rate of checks accordingly.

Executing an external program when forking is not advisable

I have this a big server software that can hog 4-8GB of memory.
This makes fork-exec cumbersome, as the fork itself can take significant time, plus the default behavior seems to be that fork will fail unless there is enough memory for a copy of the entire resident memory.
Since this is starting to show as the hottest spot (60% of time spent in fork) when profiling I need to address it.
What would be the easiest way to avoid fork-exec routine?
You basically cannot avoid fork(2) (or the equivalent clone(2) syscall..., or the obsolete vfork which I don't recommend using) + execve(2) to start an external command (à la system(3), or à la posix_spawn) on Linux and (probably) MacOSX and most other Unix or POSIX systems.
What makes you think that it is becoming an issue? And 8GB process virtual address space is not a big deal today (at least on machines with 8Gbytes, or 16Gbytes RAM, like my desktop has). You don't practically need twice as much RAM (but you do need swap space) thanks to the lazy copy-on-write techniques used by all recent Unixes & Linux.
Perhaps you might believe that swap space could be an issue. On Linux, you could add swap space, perhaps by swapping on a file; just run as root:
dd if=/dev/zero of=/var/tmp/myswap bs=1M count=32768
mkswap /var/tmp/myswap
swapon /var/tmp/myswap
(of course, be sure that /var/tmp/ is not a tmpfs mounted filesystem, but sits on some disk, perhaps an SSD one....)
When you don't need any more a lot of swap space, run swapoff /var/tmp/myswap....
You could also start some external shell process near the beginning of your program (à la popen) and later you might send shell commands to it. Look at my execicar.c program for inspiration, or use it if it fits (I wrote it 10 years ago for similar purposes, but I forgot the details)
Alternatively fork at the beginning of your program some interpreter (Lua, Guile...) and send some commands to it.
Running more than a few dozens commands per second (starting any external program) is not reasonable, and should be considered as a design mistake, IMHO. Perhaps the commands that you are running could be replaced by in-process functions (e.g. /bin/ls can be done with stat, readdir, glob functions ...). Perhaps you might consider adding some plugin ability (with dlopen(3) & dlsym) to your code (and run functions from plugins instead of starting very often the same programs). Or perhaps embed an interpreter (Lua, Guile, ...) inside your code.
As an example, for web servers, look for old CGI vs FastCGI or HTTP forwarding (e.g. URL redirection) or embedded PHP or HOP or Ocsigen
This makes fork-exec cumbersome, as the fork itself can take
significant time
This is only half true. You didn't specify the OS, but fork(2) is pretty optimized in Linux (and I believe in other UNIX variants) by using copy-on-write. Copy-on-write means that the operating system will not copy the entire parent memory address space until the child (or the parent) writes to memory. So you can rest assured that if you have a parent process using 8 GB of memory and then you fork, you won't be using 16 GB of memory - especially if the child execs() something immediately.
fork will fail unless there is enough memory for a copy of the entire
resident memory.
No. The only overhead incurred by fork(2) is the copy and allocation of a task structure for the child, the allocation of a PID, and copying the parent's page tables. fork(2) will not fail if there isn't enough memory to copy the entire parent's address space, it will fail if there isn't enough memory to allocate a new task structure and the page tables. It may also fail if the maximum number of processes for the user has been reached. You can confirm this in man 2 fork (NOTE: See comments below).
If you still don't want to use fork(2), you can use vfork(2), which does no copying at all - it doesn't even copy the page tables - everything is shared with the parent. You can use that to create a new child process with a negligible overhead and then exec() something. Be aware that vfork(2) blocks the calling thread until the child either exits or calls one of the seven exec() functions. You also shouldn't modify the memory inside the child process before calling any of the exec() functions.
You mentioned that you can fork+exec 10k times per second. That sounds very excessive. Have you considered making the things you're execing into a daemon? Or maybe implement those external programs inside your application? It sounds very dodgy to have to fork that much.
fork most likely starts failing for you despite having the memory to back it because you're on a flavor of linux that has disabled (or put a limit on) memory overcommit. Check the file /proc/sys/vm/overcommit_memory. If it's 1 then my guess is wrong and there's something else weird going on. If it's 0 then you're not allowed to overcommit at all. If it's 2 then you need to read the documentation for how exactly this gets configured.
One solution mentioned above is just adding swap (that will never get used).
Another solution is to implement a small daemon that will take commands and execute those forks and execs for you piping back whatever output you need.
N.B. fork of a large process can in theory be as fast as a small process. The performance of fork is determined by how many memory mappings you have rather than how much memory they cover. Setting up copy-on-write is done per mapping. Except that on certain operating systems setting up COW of anonymous mappings is linear to amount of memory in those mappings, but I don't know what Linux does here, last time I studied the VM system in Linux was over 15 years ago.

byte level write access protection?

protecting a page for Read and/or Write access is possible as there are bits in the page table entry that can be turned on and off at kernel level. Is there a way in which certain region of memory be protected from write access, lets say in a C structure there are certain variable(s) which need to be write protected and any write access to them triggers a segfault and a core dump. Its something like the scaled down functionality of mprotect (), as that works at page level, is there a mechanism to similar kind of thing at byte level in user space.
thanks, Kapil Upadhayay.
No, there is no such facility. If you need per-data-object protections, you'll have to allocate at least a page per object (using mmap). If you also want to have some protection against access beyond the end of the object (for arrays) you might allocate at least one more page than what you need, align the object so it ends right at a page boundary, and use mprotect to protect the one or more additional pages you allocated.
Of course this kind of approach will result in programs that are very slow and waste lots of resources. It's probably not viable except as a debugging technique, and valgrind can meet that need much more effectively without having to modify your program...
One way, although terribly slow, is to protect the whole page in which the object lies. Whenever a write access to that page happens, your custom handler for invalid page access gets called and resolves the situation by quickly unprotecting the page, writing the data and then protecting the page again.
This works fine for single-threaded programs, I'm not sure what to do for multi-threaded programs.
This idea is probably not new, so you may be able to find some information or even a ready-made implementation of it.

Resources