Sharing File descriptors across processes - file

I want to setup a shared memory environment for multiple independent processes. In the data structure that I want to share, there are also connection fds which are per process.
I wanted to know if there is a way in which we can share these fds? or use global fds or something similar of the kind?
Thanks in advance.

There are two ways to share file descriptors on a Unix host. One is by letting a child process inherit them across a fork.
The other is sending file descriptors over a Unix domain socket with sendmsg; see this example program, function send_connection (archived here). Note that the file descriptor might have a different number in the receiving process, so you may have to perform some dup2 magic to make them come out right in your shared memory.
If you don't do this, the file descriptors in your shared memory region will be just integers.

Recently, I had to solve a problem similar to what OP is describing. To this end, I moved to propose a dedicated system call (a very simple one, I might add) to send file descriptors directly to cooperating processes addresses and relying on Posix.1b signal queues as a delivery medium (as an added benefit, such approach is inherently immune to "fd recursion" attack, which plagues all VFS based mechanisms to some degree).
Here's the proposed patch:
http://permalink.gmane.org/gmane.linux.kernel/1843084
(presently, the patch only adds the new syscall for x86/x86_64 architecture, but wiring it up to other architectures is trivial, there are no platform depended features utilized).
A theory of operation goes like following. Both sender and receiver need to agree on one or more signal numbers to use for descriptor passing. Those must be Posix.1b signals, which guarantee reliable delivery, thus SIGRTMIN offset. Also, smaller signal numbers have higher delivery priority, in case priority management is required:
int signo_to_use = SIGRTMIN + my_sig_off;
Then, originating process invokes a system call:
int err = sendfd(peer_pid, signo_to_use, fd_to_send);
That's it, nothing else is necessary on the sender's side. Obviously, sendfd() will only be successful, if the originating process has the right to signal destination process and destination process is not blocking/ignoring the signal.
It must also be noted, that sendfd() never blocks; it will return immediately if destination process' signal queue is full. In a well designed application, this will indicate that destination process is in trouble anyway, or there's too much work to do, so new workers shall be spawned/work items dropped. The size of the process' signal queue can be configured using rlimit(), same as the number of available file descriptors.
The receiving process may safely ignore the signal (in this case nothing will happen and almost no overhead will be incurred on the kernel side). However, if receiving process wants to get the delivered file descriptor, all it has to to is to collect the signal info using sigtimedwait()/sigwaitinfo() or a more versatile signalfd():
/* First, the receiver needs to specify what it is waiting for: */
sigset_t sig_mask;
sigemptyset(&sig_mask);
sigaddset(&sig_mask, signo_to_use);
siginfo_t sig_info;
/* Then all it needs is to wait for the event: */
sigwaitinfo(&sig_mask, sig_info);
After the successful return of the sigwaitinfo(), sig_info.si_int will contain the new file descriptor, pointing to the same IO object, as file descriptor sent by the originating process. sig_info.si_pid will contain the originating process' PID, and sig_info.si_uid will contain the originating process' UID. If sig_info.si_int is less than zero (represents an invalid file descriptor), sig_info.si_errno will contain the errno for the actual error encountered during fd duplication process.

Related

Should I close a single fifo that's written to by multiple threads after they are done?

I'm experimenting with a fictional server/client application where the client side launches request threads by a (possibly very large) period of time, with small in-between delays. Each request thread writes on the 'public' fifo (known by all client and server threads) the contents of the request, and receives the server answer in a 'private' fifo that is created by the server with a name that is implicitly known (in my case, it's 'tmp/processId.threadId').
The public fifo is opened once in the main (request thread spawner) thread so that all request threads may write to it.
Since I don't care about the return value of my request threads and I can't make sure how many request threads I create (so that I store their ids and join them later), I opted to create the threads in a detached state, exit the main thread when the specified timeout expires and let the already spawned threads live on their own.
All of this is fine, however, I'm not closing the public fifo anywhere after all spawned request threads finish: after all, I did exit the main thread without waiting. Is this a small kind of disaster, in which case I absolutely need to count the active threads (perhaps with a condition variable) and close the fifo when it's 0? Should I just accept that the file is not explicitly getting closed, and let the OS do it?
All of this is fine, however, I'm not closing the public fifo anywhere
after all spawned request threads finish: after all, I did exit the
main thread without waiting. Is this a small kind of disaster, in
which case I absolutely need to count the active threads (perhaps with
a condition variable) and close the fifo when it's 0? Should I just
accept that the file is not explicitly getting closed, and let the OS
do it?
Supposing that you genuinely mean a FIFO, such as might be created via mkfifo(), no, it's not a particular issue that the process does not explicitly close it. If any open handles on it remain when the process terminates, they will be closed. Depending on the nature of the termination, it might be that pending data are not flushed, but that is of no consequence if the FIFO is used only for communication among the threads of one process.
But it possibly is an issue that the process does not remove the FIFO. A FIFO has filesystem persistence. Once you create one, it lives until it no longer has any links to the filesystem and is no longer open in any process (like any other file). Merely closing it does not cause it to be removed. Aside from leaving clutter on your filesystem, this might cause issues for concurrent or future runs of the program.
If indeed you are using your FIFOs only for communication among the threads of a single process, then you would probably be better served by pipes.
I managed to solve this issue setting up a cleanup rotine with atexit, which is called when the process terminates, ie. all threads finish their work.

Is there any way to execute a callback (on Linux) when a file descriptor is closed

I'm working on a kevent/kqueue emulation library for Linux. I'm a new maintainer on this project, and unfortunately, the previous maintainer is not involved much anymore (so I can't pick their brains about this).
Under FreeBSD and macOS when you close() the file descriptor provided by kqeueue() you free any resources and events associated with it.
It seems like the existing code doesn't provide a similar interface. Before I add a function to the API (or revive an old one) to explicitly free kqueue resources, I was wondering if there was any way to associate triggers with a file descriptor in linux, so that when it's closed we can cleanup anything associated with the FD.
The file descriptor itself could be any type, i.e. one provided by eventfd, or epoll or anything else that creates file descriptors.
When the last write file descriptor from a pipe() call is closed epoll()/poll() waiters will see an [E]POLLHUP event on any read file descriptors still open. Presumably the same is true of any fd that represents a connection rather than state.
The solution to this is fairly simple, if a little annoying to implement. It relies on an fcntl called F_SETSIG, to specify the signal used to communicate FD state changes, and an fcntl called F_SETOWN_EX to specify what thread the signal should be delivered to.
When the application starts it spawns a separate monitoring thread. This thread is used to receive FD generated signals.
In our particular use case, the monitoring thread must be started implicitly the first time a monitored FD is created, and destroyed without an explicit join. This is because we're emulating a FreeBSD API (kqueue), which does not have explicit init and deinit functions.
The monitoring thread:
Listens on a the signal we passed to F_SETSIG.
Gets its thread ID, and stores it in a global.
Informs the application that the monitoring thread has started (and the global is filled) using pthread_cond_broadcast.
Calls pthread_detach to ensure it's cleaned up correctly without another thread needing to do an explicit pthread_join.
Calls sigwaitinfo to wait on delivery of a signal.
The application thread(s):
Uses pthread_once to start the monitoring thread the first time a FD is created, then waits for the monitoring thread to start fully.
Uses F_SETSIG to specify the signal sent when the FD is open/closed, and F_SETOWN_EX to direct those signals to the monitoring thread.
When a monitored FD is closed the sigwaitinfo call in the monitoring thread returns. In our case we're using a pipe to represent the kqueue, so we need to map the FD we received the signal for, to the one associated with the resources (kqueues) we need to free. Once this mapping is done, we may (see below for more information) cleanup the resources associated with the FD pair, and call sigwaitinfo again to wait for more signals.
One of the other key pieces to this strategy, is that the resources associated with the FDs are reference counted. This is because the signals are not synchronously delivered, so an FD can be closed, and a new FD can be created with the same number, before the signal indicating the original FD was closed, was delivered and acted on. This would obviously cause big issues with active resources being freed.
To solve this we maintain a mutex synchronised FD to resource mapping array. Each element in this array contains a reference count for a particular FD.
In the case where the signal is not delivered before the FD is reused when creating a new pipe/resource pair the reference count for that particular FD will be > 0. When this occurs we immediately free the resource, and reinitialise it, increasing the reference count.
When the signal indicating the FD was closed is delivered, the reference count is decremented (but not to zero), and the resource is not freed.
Alternatively if the signal is delivered before the FD was reused, then the monitoring thread will decrement the reference count to zero, and immediately free the associated resources.
If this description is a bit confusing you can look over our real world implementation using any of the links above.
Note: Our implementation isn't exactly as was described above (notably we don't check the reference count of an FD when creating a new FD/resource mapping). I think this is because we rely on the fact that closing one of the pipe doesn't necessarily result in the other end being closed, so the open end's FD isn't available for reuse immediately. Unfortunately the developer who wrote the code isn't available for querying.

Is it possible to fork a process without inherit virtual memory space of parent process?

As the parent process is using huge mount of memory, fork may fail with errno of ENOMEM under some configuration of kernel overcommit policy. Even though the child process may only exec low memory-consuming program like ls.
To clarify the problem, when /proc/sys/vm/overcommit_memory is configured to be 2, allocation of (virtual) memory is limited to SWAP + MEMORY * ration(default to 50%).
When a process forks, virtual memory is not copied thanks to COW. But the kernel still need to allocate virtual memory space. As an analogy, fork is like malloc(virtual memory space size) which will not allocate physical memory and writing to shared memory will cause copy of virtual memory and physical memory is allocated. When overcommit_memory is configured to be 2, fork may fail due to virtual memory space allocation.
Is it possible to fork a process without inherit virtual memory space of parent process in the following conditions?
if the child process calls exec after fork
if the child process doesn't call exec and will not using any global or static variable from parent process. For example, the child process just do some logging then quit.
As Basile Starynkevitch answered, it's not possible.
There is, however, a very simple and common solution used for this, that does not rely on Linux-specific behaviour or memory overcommit control: Use an early-forked slave process do the fork and exec.
Have the large parent process create an unix domain socket and fork a slave process as early as possible, closing all other descriptors in the slave (reopening STDIN_FILENO, STDOUT_FILENO, and STDERR_FILENO to /dev/null). I prefer a datagram socket for its simplicity and guarantees, although a stream socket will also work.
In some rare cases it is useful to have the slave process execute a separate dedicated small helper program. In most instances this is not necessary, and makes security design much easier. (In Linux, you can include SCM_CREDENTIALS ancillary messages when passing data using an Unix domain socket, and use the process ID therein to verify the identity/executable the peer is using the /proc/PID/exe pseudo-file.)
In any case, the slave process will block in reading from the socket. When the other end closes the socket, the read/receive will return 0, and the slave process will exit.
Each datagram the slave process receives, describes a command to execute. (Using a datagram allows using C strings, delimited with NUL characters, without any escaping etc.; using an Unix stream socket typically requires you to delimit the "command" somehow, which in turn means escaping the delimiters in the command component strings.)
The slave process creates one or more pipes, and forks a child process. This child process closes the original Unix socket, replaces the standard streams with the respective pipe ends (closing the other ends), and executes the desired command. I personally prefer to use an extra close-on-exec socket in Linux to detect successful execution; in an error case, the errno code is written to the socket, so that the slave-parent can reliably detect the failure and the exact reason, too. If success, the slave-parent closes the unnecessary pipe ends, replies to the original process about the success, with the other pipe ends as SCM_RIGHTS ancillary data. After sending the message, it closes the rest of the pipe ends, and waits for a new message.
On the original process side, the above process is sequential; only one thread may execute start executing an external process at a time. (You simply serialize the access with a mutex.) Several can run at the same time; it is only the request to and response from the slave helper that is serialized.
If that is an issue -- it should not be in typical cases -- you can for example multiplex the connections, by prefixing each message with an ID number (assigned by the parent process, monotonically increasing). In that case, you'll probably use a dedicated thread on the parent end to manage the communications with the slave, as you certainly cannot have multiple threads reading from the same socket at the same time, and expect deterministic results.
Further improvements to the scheme include things like using a dedicated process group for the executed processes, setting limits to them (by setting limits to the slave process), and executing the commands as dedicated users and groups by using a privileged slave.
The privileged slave case is where it is most useful to have the parent execute a separate helper process for it. In Linux, both sides can use SCM_CREDENTIALS ancillary messages via Unix domain sockets to verify the identity (PID, and with ID, the executable) of the peer, making it rather straightforward to implement robust security. (But note that /proc/PID/exe has to be checked more than once, to catch the attacks where a message is sent by a nefarious program, quickly executing the appropriate program but with command-line arguments that cause it to exit soon, making it occasionally look like the correct executable made the request, while a copy of the descriptor -- and thus the entire communications channel -- was in control of a nefariuous user.)
In summary, the original problem can be solved, although the answer to the posed question is No. If the executions are security-sensitive, for example change privileges (user accounts) or capabilities (in Linux), then the design has to be carefully considered, but in normal cases the implementation is quite straight-forward.
I'd be happy to elaborate if necessary.
No, it is not possible. You might be interested by vfork(2) which I don't recommend. Look also into mmap(2) and its MAP_NORESERVE flag. But copy-on-write techniques are used by the kernel, so you practically won't double the RAM consumption.
My suggestion is to have enough swap space to not being concerned by such an issue. So setup your computer to have more available swap space than the largest running process. You can always create some temporary swap file (e.g. with dd if=/dev/zero of=/var/tmp/swapfile bs=1M count=32768 then mkswap /var/tmp/swapfile) then add it as a temporary swap zone (swapon /var/tmp/swapfile) and remove it (swapoff /var/tmp/swapfile and rm /var/tmp/swapfile) when you don't need it anymore.
You probably don't want to swap on a tmpfs file system like /tmp/ often is, since tmpfs file systems are backed up by swap space!.
I dislike memory overcommitment and I disable it (thru proc(5)). YMMV.
I'm not aware of any way to do (2), but for (1) you could try to use vfork which will fork a new process without copying the page tables of the parent process. But this generally isn't recommended for a number of reasons, including because it causes the parent to block until the child performs an execve or terminates.
This is possible on Linux. Use the clone syscall without the flag CLONE_THREAD and with the flag CLONE_VM. The parent and child processes will use the same mappings, much like a thread would; there is no COW or page table copying.
madvise(addr, size, MADV_DONTFORK)
Alternatively, you can call munmap() after fork() to remove the virtual addresses inherited from the parent process.

Forcefully remove fcntl locks from a different process

Is there any way I can remove fcntl byte range locks on a file from a process that did not lock these ranges?
I have several processes that put byte range locks on files. What I basically need to come up with is an external tool that would help me remove byte range locks for files I specify.
There are two options that immediately come to mind.
Write a kernel module to do this.
As far as I know, there is no kernel facility to do this as of right now.
(You could add a new command to fcntl(), that given superuser privileges or same user as the owner of the lock, does the force-unlock or lock stealing.)
Write a small library, that installs a realtime signal handler, say SIGRTMAX. When this signal is caught, sent by sigqueue(), and the int payload describes an open file descriptor, release all byte locks on that descriptor.
Alternatively, you can have the signal handler open and read a file or pipe (say /tmp/PID.lock, where the file or pipe contains a data packet defining which file or file descriptor and byte range to unlock.
As long as the library is loaded when the process starts (and possibly interposing all signal() and sigaction() calls to make sure your signal is kept in the call chain), this should work fine.
The second option requires that you preload the library (via LD_PRELOAD environment variable, or preloading it for all binaries using /etc/ld.so.conf).
The interposing library is not difficult at all to write. I have shown an example of using an interposing library to monitor fork() calls. In your case, you'd have to think of a good way to define the byte ranges to be unlocked (in file or pipe, triggered by a signal), and handle all that in the signal handler context; but there are enough async-signal-safe low-level unistd.h I/O functions to do this.

Synchronize two processes using two different states

I am trying to work out a way to synchronize two processes which share data.
Basically I have two processes linked using shared memory. I need process A to set some data in the shared memory area, then process B to read that data and act on it.
The sequence of events I am looking to have is:
B blocks waiting for data available signal
A writes data
A signals data available
B reads data
B blocks waiting for data not available signal
A signals data not available
All goes back to the beginning.
In other terms, B would block until it got a "1" signal, get the data, then block again until that signal went to "0".
I have managed to emulate it OK using purely shared memory, but either I block using a while loop which consumes 100% of CPU time, or I use a while loop with a nanosleep in it which sometimes misses some of the signals.
I have tried using semaphores, but I can only find a way to wait for a zero, not for a one, and trying to use two semaphores just didn't work. I don't think semaphores are the way to go.
There will be numerous processes all accessing the same shared memory area, and all processes need to be notified when that shared memory has been modified.
It's basically trying to emulate a hardware data and control bus, where events are edge rather than level triggered. It's the transitions between states I am interested in, rather than the states themselves.
So, any ideas or thoughts?
Linux has its own eventfd(2) facility that you can incorporate into your normal poll/select loop. You can pass eventfd file descriptor from process to process through a UNIX socket the usual way, or just inherit it with fork(2).
Edit 0:
After re-reading the question I think one of your options is signals and process groups: start your "listening" processes under the same process group (setpgid(2)), then signal them all with negative pid argument to kill(2) or sigqueue(2). Again, Linux provides signalfd(2) for polling and avoiding slow signal trampolines.
If 2 processes are involved you can use a file , shared memory or even networking to pass the flag or signal. But if the processes are more, there may be some suitable solutions in modifying the kernel. There is one shared memory in your question, right ?! How the signals are passed now ?!
In linux, all POSIX control structures (mutex, conditions, read-write-locks, semaphores) have an option such that they also can be used between processes if they reside in shared memory. For the process that you describe a classic mutex/condition pair seem to fit the job well. Look into the man pages of the ..._init functions for these structures.
Linux has other proper utilities such as "futex" to handle this even more efficiently. But these are probably not the right tools to start with.
1 Single Reader & Single Writer
1 Single Reader & Single Writer
This can be implemented using semaphores.
In posix semaphore api, you have sem_wait() which will wait until value of the semaphore count is zero once it is incremented using sem_post from other process the wait will finish.
In this case you have to use 2 semaphores for synchronization.
process 1 (reader)
sem_wait(sem1);
.......
sem_post(sem2);
process 2(writer)
sem_wait(sem2);
.......
sem_post(sem1);
In this way you can achieve synchronization in shared memory.

Resources