I'm working on a multi-process program which basically perform fuzzification on each layer of a RVB file. (1 process -> 1 layer). Each child process is delivering a temp file by using the function: tmpfile(). After each child process finishes its job, the main process has to read each temp file created and assemble the data. The problem is that I don't know how to read each temp file inside the main process since I can't access to child's process memory so I can't know what's the temporary pointer to the temp file created!
Any idea?
Don't hesitate to ask for clarifications if needed.
The tmpfile() function returns you a FILE pointer to a file with no determinate name - indeed, even the child process cannot readily determine a name for the file, let alone the parent (and on many Unix systems, the file has no name; it has been unlinked before tmpfile() returns to the caller).
extern FILE *tmpfile(void);
So, you are using the wrong temporary file creation primitive if you must convey file names around.
You have a number of options:
Have the parent process create the file streams with tmpfile() so that both the parent and children share the files. There are some minor coordination issues to handle - the parent will need to seek back to the start before reading what the children wrote, and it should only do that after the child has exited.
Use one of the filename generating primitives instead - mkstemp() is good, and if you need a FILE pointer instead of a file descriptor, you can use fdopen() to create one. You are still faced with the problem of getting file names from children to parent; again, the parent could open the files, or you can use a pipe for each child, or some shared memory, or ... take your pick of IPC mechanisms.
Have the parent open a pipe for each child before forking. The child process closes the read end of the pipe and writes to the the write end; the parent closes the write end of the pipe and arranges to read from the the read end. The issue here with multiple children is that the capacity of any given pipe is finite (and quite small - typically about 5 KiB). Consequently, you need to ensure the parent reads all the pipes completely, bearing in mind that the children won't be able to exit until all the data has been read (strictly, all but the last buffer full has been read).
Consider using threads - but be aware of the coordination issues using threads.
Decide that you do not need to use multiple threads of control - whether processes or threads - but simply have the main program do the work. This eliminates coordination and IPC issues - it does mean you won't benefit from the multi-core processor on the machine.
Of these, assuming parallel execution is of paramount importance, I'd probably use pipes to get the file names from the children (option 2); it has the fewest coordination issues. But for simplicity, I'd go with 'main program does it all' (option 5).
If you call tmpfile() in parent process, child will inherit all open descriptors and will be able to write to the file, and opened file will be accessible for parent as well.
You could create a tempfile in the parent process and then fork, then have the child process use that.
The child process can send back the filedescriptor to the parent process.
EDIT: example code in APUE site (src.tar.gz/apue.2e/lib, recvfd.c, sendfd.c)
Use threads instead of subprocesses? Put the names of the temporary files in another file? Don't use random names for the temp files, but (for example) names based on the pid of the parent process (to allow several instances of your program to run simultaneously) plus a sequential number?
Related
I am comparatively new to the linux programming. I wonder whether the exec() function called after fork() can cause data loss in the parent process.
After a successful call to fork, a new process is created which is a duplicate of the calling process. One thing that gets duplicated are the file descriptors, so it's possible for the new process to read/write the same file descriptors as the original process. These may be files, sockets, pipes, etc.
The exec function replaces the currently running program in the current process with a new program, overwriting the memory of the old program in that process. So any data stored in the memory of the old program is lost. This does not however affect the parent process that forked this process.
When a new program is executed via exec, any open file descriptors that do not have the FD_CLOEXEC (close-on-exec) flag set (see the fcntl man page) are again preserved. So now you have two processes, each possibly running a different program, which may both write to the same file descriptor. If that happens, and the processes don't properly synchronize, data written by one process to the file may overwritten by the other process.
So data loss can occur with regard to writing to file descriptors that the child process inherited from the parent process.
I have a bunch of processes forked from the same parent processes. And they need to read the same large file during initialization. Unfortunately, I do not have any control over the parent process.
Is it possible that one process open the file, read contents and save the trouble of other brother processes from opening and reading?
mmap does not seem to work, because I need to mmap the file before forking processes.
Simple shmget/shmat is not a good idea for the needed synchronization.
Use another separate process to load the file into shared memory, so the working processes no longer need to read file. It works although a little troublesome.
Is there any other way?
In *nix systems, processes are created by using fork() system call. Consider for example, init process creates another process.. First it forks itself and creates the a process which has the context like init. Only on calling exec(), this child process turns out to be a new process. So why is the intermediate step ( of creating a child with same context as parent ) needed? Isn't that a waste of time and resource, because we are creating a context ( consumes time and wastes memory ) and then over writing it?
Why is this not implemented as allocating a vacant memory area and then calling exec()? This would save time and resources right?
The intermediate step enables you to set up shared resources in the child process without the external program being aware of it. The canonical example is constructing a pipe:
// read output of "ls"
// (error checking omitted for brevity)
int pipe_fd[2];
pipe(&pipe_fd);
if (fork() == 0) { // child:
close(pipe_fd[0]); // we don't want to read from the pipe
dup2(pipe_fd[1], 1); // redirect stdout to the write end of the pipe
execlp("ls", "ls", (char *) NULL);
_exit(127); // in case exec fails
}
// parent:
close(pipe_fd[1]);
fp = fdopen(pipe_fd[0], "r");
while (!feof(fp)) {
char line[256];
fgets(line, sizeof line, fp);
...
}
Note how the redirection of standard output to the pipe is done in the child, between fork and exec. Of course, for this simple case, there could be a spawning API that would simply do this automatically, given the proper parameters. But the fork() design enables arbitrary manipulation of per-process resources in the child — one can close unwanted file descriptors, modify per-process limits, drop privileges, manipulate signal masks, and so on. Without fork(), the API for spawning processes would end up either extremely fat or not very useful. And indeed, the process spawning calls of competing operating systems typically fall somewhere in between.
As for the waste of memory, it is avoided with the copy on write technique. fork() doesn't allocate new memory for the child process, but points the child to the parent's memory, with the instructions to make a copy of a page only if the page is ever written to. This makes fork() not only memory-efficient, but also fast, because it only needs to copy a "table of contents".
This is an old complaint. Many people have asked Why fork() first? and typically they suggest an operation that will both create a new process from scratch and run a program in it. This operation is called something like spawn().
And they always say, Won't that be faster?
And in fact, every system other than the Unix family does go the "spawn" way. Only Unix is based on fork() and exec().
But it's funny, Unix has always been much faster than other full-featured systems. It has always handled way more users and load.
And Unix has been made even faster over the years. Fork() no longer really duplicates the address space, it just shares it using a technique called copy-on-write. (A very old fork optimization called vfork() is also still around.)
Drink the Kool-Aid.
I don't know exactly how the init process works on a kernel in terms of forking but to answer you question of why you need to call fork then exec is simply because once you exec there is no turning back.
If you check out the documentation here, it essentially requires a new process to be spawned (the fork call) in order for the parent process to resume control and either wait for it to finish or sit as a daemon probably would.
Only on calling exec(), this child process turns out to be a new
process.
Not really. After a fork, you already have new process, even not that much different from its parent. There are some cases where no exec need to follow a fork.
So why is the intermediate step ( of creating a child with same
context as parent ) needed?
One reason would be because it is an efficient way to create the whole shebang. Cloning is usually less complex than creating from scratch.
Isn't that a waste of time and resource, because we are creating a
context ( consumes time and wastes memory ) and then over writing it?
It is not a waste of time and resource as most of this resource is virtual, due to the copy on write mechanism used. Moreover, it is incorrect to state the created context is overwritten. Nothing is rewritten given the fact nothing was actually written in the first place. That's the whole point of COW. "Only" the process address space (code, heap and stack) are substituted, not overwritten. A lot of the process context is partially or totally preserved, including environment, file descriptors, priority, ignored signals, current and root directory, limits, various masks, processor bindings, privileges and several other things foreign to the process address space.
Unix kernel represents open files using three data structures: Descriptor table, File table, and v-node table.
When a process opens a file twice, it gets two different descriptors in the descriptor table, two entries in the file table(so that they have different positions in the same file), and they both point to one entry in the v-node table.
And child process inherits parent process's descriptor table, so kernel maintains one descriptor table for each process respectively. But two descriptor from different processes point to the same entry in open file table.
So
When child process does some read on the file, would the offset of the same file change in parent process?
If 1 is true, for two processes, is there a convenient way that I can get the same effect of fork on same file? That means two processes share a position(offset) information on the same file.
Is there a way to fork so that both processes have totally unrelated tables, like two unrelated processes only that they opened same files.
When child process does some read on the file, would the offset of the same file change in parent process?
Yes, since the offset is stored system-wide file table. You could get a similar effect using dup or dup2.
If 1 is true, for two processes, is there a convenient way that I can get the same effect of fork on same file? That means two processes share a position(offset) information on the same file.
There is a technique called "passing the file descriptor" using Unix domain sockets. Look for "ancillary" data in sendmsg.
Is there a way to fork so that both processes have totally unrelated tables, like two unrelated processes only that they opened same files.
You have to open the file again to achieve this. Although it doesn't do what you want, you should also look for the FD_CLOEXEC flag.
I read from man pages of execve that if a process(A) call execve, the already opened file descriptors are copied to the new process(B).
Two possiblities arise here :-
1) Does it mean that a new file descriptor table is created for process B, the entries to which are copied from older file descriptor table of process A
2) Or process B gets the file descriptor table of process A as after execve process A will cease to exist and the already opened files could only be closed from process B, if it gets the file descriptor table of process A.
Which one is correct?
execve does not create a new process. It replaces the calling process's program image, memory space, etc. with new ones based on an executable file from the filesystem. The file descriptor table is modified by closing any descriptors with close-on-exec flag set; the rest remain open and in the same state they were in (current position, locks, etc.) prior to execve.
You're probably confusing this with what happens on fork since execve is usually preceded by fork. When a process forks, the child process has a new file descriptor table referring to the same open file descriptions as the parent process's file descriptor table.
Which one is correct?
#2
Though what you ask is more of OS implementation details and that is rarely if ever important to applications, completely transparent to the applications and depends on the OS.
It is normally said that the new process inherits file descriptors. Except those having FD_CLOEXEC flag set, obviously.
Even in case of #1, if we would presume that for some short time both process A and B are in memory (not really, that's fork() territory) copying of the fd table would be OK. Since process A would be terminated (by the exec()) all its file descriptors would be close()d. And that would have no effect on the already-copied file descriptors in the process B. File descriptors are like pointers to the corresponding kernel structure containing actual information about what the file descriptor actually points to. Copying the fd table doesn't make copies of the underlying structures - it copies only the pointers. The kernel structure contains reference counter (required to implement the fork()) which is incremented when copy is made and thus knows how many processes are using it. Calling close() on file descriptor first of all does decrement of the reference counter. And only if the counter goes to zero (no more processes are using the structure) only then OS actually closes the underlying file/socket/pipe/etc. (But obviously even if inside of the kernel two processes are present for some short time simultaneously, user space applications can't see that since the new process after exec() also inherits the PID of the original process.)