Forcefully remove fcntl locks from a different process - c

Is there any way I can remove fcntl byte range locks on a file from a process that did not lock these ranges?
I have several processes that put byte range locks on files. What I basically need to come up with is an external tool that would help me remove byte range locks for files I specify.

There are two options that immediately come to mind.
Write a kernel module to do this.
As far as I know, there is no kernel facility to do this as of right now.
(You could add a new command to fcntl(), that given superuser privileges or same user as the owner of the lock, does the force-unlock or lock stealing.)
Write a small library, that installs a realtime signal handler, say SIGRTMAX. When this signal is caught, sent by sigqueue(), and the int payload describes an open file descriptor, release all byte locks on that descriptor.
Alternatively, you can have the signal handler open and read a file or pipe (say /tmp/PID.lock, where the file or pipe contains a data packet defining which file or file descriptor and byte range to unlock.
As long as the library is loaded when the process starts (and possibly interposing all signal() and sigaction() calls to make sure your signal is kept in the call chain), this should work fine.
The second option requires that you preload the library (via LD_PRELOAD environment variable, or preloading it for all binaries using /etc/ld.so.conf).
The interposing library is not difficult at all to write. I have shown an example of using an interposing library to monitor fork() calls. In your case, you'd have to think of a good way to define the byte ranges to be unlocked (in file or pipe, triggered by a signal), and handle all that in the signal handler context; but there are enough async-signal-safe low-level unistd.h I/O functions to do this.

Related

Race condition during file write

Suppose two different processes open the same file independently, and so have different entries in the Open file table (system-wide). But they refer to the same i-node entry.
As the file descriptors refer to the different entries in the Open file table (system-wide), then they may have different file offset. Will be there any chance for race condition during write as the file offset is different? And how does the kernel avoid it?
Book: The Linux Programming Interface; Page no. 95; Chapter-5 (File I/O: Further details); Section 5.4
(I'm assuming because you used write() that the question refers to POSIX systems.)
Each write() operation is supposed to be fully atomic, assuming a POSIX system (presumed from the use of write()).
Per POSIX 7's 2.9.7 Thread Interactions with Regular File Operations:
All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2017 when they operate on
regular files or symbolic links:
chmod()
chown()
close()
creat()
dup2()
fchmod()
fchmodat()
fchown()
fchownat()
fcntl()
fstat()
fstatat()
ftruncate()
lchown()
link()
linkat()
lseek()
lstat()
open()
openat()
pread()
read()
readlink()
readlinkat()
readv()
pwrite()
rename()
renameat()
stat()
symlink()
symlinkat()
truncate()
unlink()
unlinkat()
utime()
utimensat()
utimes()
write()
writev()
If two threads each call one of these functions, each call shall
either see all of the specified effects of the other call, or none of
them. The requirement on the close() function shall also apply
whenever a file descriptor is successfully closed, however caused (for
example, as a consequence of calling close(), calling dup2(), or of
process termination).
But pay particular attention to the specification for write() (bolding mine):
The write() function shall attempt to write nbyte bytes ...
POSIX says that write() calls to a file shall be atomic. POSIX does not say that the write() calls will be complete. Here's a Linux bug report where a signal was interrupting a write() that was partially complete. Note the explanation:
Now this is perfectly valid behavior as far as spec (POSIX, SUS,...) is concerned (please correct me if I'm missing something). So I'd say the program is incorrect. But OTOH I agree that this was not possible before a50527b1 and we don't want to break userspace. I'd hate to revert that commit since it allows us to interrupt processes doing large writes (especially when something goes wrong) but if you explain to us why this behavior is a problem for you then I guess I'll have to revert it.
That's all but admitting that there's a POSIX requirement for write() calls to be atomic, if not complete, with an offer to revert back to earlier behavior where the write() calls apparently were all also complete in this same circumstance.
Note, though, there are lots of file systems out there that don't conform to POSIX standards.
As the file descriptors refer to the different entries in the Open file table (system-wide), then they may have different file offset. Will be there any chance for race condition during write as the file offset is different?
Any write() in Linux can return a short count, for example due to a signal being delivered to an userspace handler. For simplicity, let's ignore that, and only consider what happens to the successfully written data.
There are two scenarios:
The regions written to do not overlap.
(For example, one process writes 100 bytes starting at offset 23, and another writes 50 bytes starting at offset 200.)
There is no race condition in this case.
The regions written to do overlap.
(For example, one process writes 100 bytes starting at offset 50, and another writes 10 bytes starting at offset 70.)
There is a race condition. It is impossible to predict (without advisory locks etc.) the order in which the data gets updated.
Depending on the target filesystem, and if the writes are large enough (so that paging effects can be observed), the two writes may even be "mixed" (in page-sized chunks) in Linux on some filesystems on machines with more than one hardware thread, even though POSIX says this shouldn't happen.
Normally, writes go through the Linux page cache. It is possible for one of the processes to have opened the file with O_DIRECT | O_SYNC, bypassing the page cache. In that case, there are many additional corner cases that can occur. Specifically, even if you use a shared clock source, and can show that the normal/page-cached write completed before the direct write call was made, it may still be possible for the page-cached write to overwrite the direct write contents.
And how does the kernel avoid it?
It doesn't. Why should it? POSIX says each write is atomic, but there is no practical way to avoid a race condition relying on that alone (and get consistent and expected results).
Userspace programs have at least four different methods to avoid such races:
Advisory file locks on the entire open file using the flock() interface.
Advisory file locks on the entire open file using the lockf() interface. In Linux, these are just shorthand for placing/removing fcntl() advisory locks on the entire file.
Advisory record locks on the file using the fcntl() interface. This works even across shared volumes, as long as the file server is configured to support file locking.
Obtaining an exclusive lease on the open file using the fcntl() interface.
Advisory file locks are like street lights: they are intended for co-operating processes to easily determine who gets to go when. However, they do not stop any other process from actually ignoring the "lock" and accessing the file.
File leases are a mechanism, where one or more processes can get a read lease at the same time on the same file, but only one process can get a write lease and only when that process is the only one having the file open. When granted, the write lease (or exclusive lease) means that if any other process tries to open the same file, the lease owner process is notified by a signal (that you can control using the fcntl() interface), and has a configured time (typically 45 seconds; see man 5 proc and /proc/sys/fs/lease-break-time, in seconds) to relinguish the lease. The opener is blocked in the kernel until the lease is downgraded or the lease break time passes, in which case the kernel breaks the lease.
This allows the lease holder to postpone the opening for a short while.
However, the lease holder cannot block the opening, and cannot e.g. replace the file with a decoy one; the opener already has a hold on the inode, and the lease break time is just a grace period for cleanup work.
Technically, a fifth method would be mandatory file locking, but aside from the kernel use wrt. executed binaries, they're not used, and are actually buggy in Linux anyway. In Linux, inodes are only locked against modification when that inode is being executed as a binary by the kernel. (You can still rename or delete the original file, and create a new one, so that any subsequent execs will execute the modified/new data. Attempts to modify a file that is being executed as a binary file will fail with error EBUSY.)

Correctly processing Ctrl-C when using poll()

I am making a program, that runs like a server, so it is constantly running poll. I need to process both Ctrl-C and Ctrl-D. And while Ctrl-D is pretty easy to work with when using poll (you just also poll for POLLIN on stdin), I cannot come up with a pretty solution for signals. Do I need to create a dummy file to which my signal handler will write something when it's time to exit, or would pipes fit this purpose nicely?
As commented by Dietrich Epp, a usual way of handling this is the "pipe to self" trick. First, at initialization time, you set up a pipe(7): you'll call pipe(2) and you keep both read and write file descriptors of that pipe in some (e.g. global) data. Your signal handler would just write(2) onto the write-end fd some bytes (perhaps a single 0 byte ...). And your event loop around poll(2) (or the older select(2), etc...) would react by read(2)-ing bytes when the read-end file descriptor has some data.
This pipe to self trick is common and portable to all POSIX systems, and recommended e.g. by Qt.
The signalfd(2) system call is Linux specific (e.g. you don't have that on MacOSX). Some old Linux kernels might not have it.
Be aware that the set of functions usable inside a signal handler is limited to async-signal-safe functions - so you are allowed to use write(2) but forbidden to use fprintf or malloc inside a signal handler. Read carefully signal(7) and signal-safety(7).
signalfd is what you are after - connect it to SIG_INT and you can poll for ctrl+c – see the example in the link provided (quite down the page – actually, they are catching ctrl+c there...).

Is there a way to close output of stderr in one thread but not others?

Say my program has some threads, since the file descriptors are shared among the threads, if I call close(stderr), all the threads won't output to stderr. my question: is there a way to shut down the output of stderr in one thread, but not the others?
To be more specific, one thread of my program calls a third party library function, and it keeps output warning messages which I know are useless. But I have no access to this third party library source.
No. File descriptors are global resources available to all threads in a process. Standard error is file descriptor number 2, of course, so it is a global resource and you can't stop the third party code from writing to it.
If the problem is serious enough to warrant the treatment, you can do:
int fd2_copy = dup(2);
int fd2_null = open("/dev/null", O_WRONLY);
Before calling your third-party library function:
dup2(fd2_null, 2);
third_party_library_function();
dup2(fd2_copy, 2);
Basically, for the duration of the third-party library, switch standard error to /dev/null, reinstating the normal output after the function.
You should, of course, error check the system calls.
The downside of this is that while this thread is executing the third party function, any other thread that needs to write to standard error will also write to /dev/null.
You'd probably have to think in terms of adding an 'error writing thread' (EWT) which can be synchronized with the 'third-party library executing thread' (TPLET). Other threads would write a message to the EWT. If the TPLET was executing the third-party library, the EWT would wait until it was done, and only then write any queued messages. (While that would 'work', it is hard work.)
One way around this would be to have the error reporting functions used by the general code (other than the third-party library code) write to fd2_copy rather than standard error per se. This would require a disciplined use of error reporting functions, but is a whole heap easier than an extra thread.
stderr is per process not per thread, so closing it will close for all threads.
If you want to skip particular messages, may be you can use grep -v.
On Linux it is possible to give the current thread its own private file descriptor table, using the unshare() function declared in <sched.h>:
unshare(CLONE_FILES);
After that call, you can call close(2); and it will affect only the current thread.
Note however that once the file descriptor table is unshared, you can't go back to sharing it again - it's a one-way operation. This is also Linux-specific, so it's not portable.

Sharing File descriptors across processes

I want to setup a shared memory environment for multiple independent processes. In the data structure that I want to share, there are also connection fds which are per process.
I wanted to know if there is a way in which we can share these fds? or use global fds or something similar of the kind?
Thanks in advance.
There are two ways to share file descriptors on a Unix host. One is by letting a child process inherit them across a fork.
The other is sending file descriptors over a Unix domain socket with sendmsg; see this example program, function send_connection (archived here). Note that the file descriptor might have a different number in the receiving process, so you may have to perform some dup2 magic to make them come out right in your shared memory.
If you don't do this, the file descriptors in your shared memory region will be just integers.
Recently, I had to solve a problem similar to what OP is describing. To this end, I moved to propose a dedicated system call (a very simple one, I might add) to send file descriptors directly to cooperating processes addresses and relying on Posix.1b signal queues as a delivery medium (as an added benefit, such approach is inherently immune to "fd recursion" attack, which plagues all VFS based mechanisms to some degree).
Here's the proposed patch:
http://permalink.gmane.org/gmane.linux.kernel/1843084
(presently, the patch only adds the new syscall for x86/x86_64 architecture, but wiring it up to other architectures is trivial, there are no platform depended features utilized).
A theory of operation goes like following. Both sender and receiver need to agree on one or more signal numbers to use for descriptor passing. Those must be Posix.1b signals, which guarantee reliable delivery, thus SIGRTMIN offset. Also, smaller signal numbers have higher delivery priority, in case priority management is required:
int signo_to_use = SIGRTMIN + my_sig_off;
Then, originating process invokes a system call:
int err = sendfd(peer_pid, signo_to_use, fd_to_send);
That's it, nothing else is necessary on the sender's side. Obviously, sendfd() will only be successful, if the originating process has the right to signal destination process and destination process is not blocking/ignoring the signal.
It must also be noted, that sendfd() never blocks; it will return immediately if destination process' signal queue is full. In a well designed application, this will indicate that destination process is in trouble anyway, or there's too much work to do, so new workers shall be spawned/work items dropped. The size of the process' signal queue can be configured using rlimit(), same as the number of available file descriptors.
The receiving process may safely ignore the signal (in this case nothing will happen and almost no overhead will be incurred on the kernel side). However, if receiving process wants to get the delivered file descriptor, all it has to to is to collect the signal info using sigtimedwait()/sigwaitinfo() or a more versatile signalfd():
/* First, the receiver needs to specify what it is waiting for: */
sigset_t sig_mask;
sigemptyset(&sig_mask);
sigaddset(&sig_mask, signo_to_use);
siginfo_t sig_info;
/* Then all it needs is to wait for the event: */
sigwaitinfo(&sig_mask, sig_info);
After the successful return of the sigwaitinfo(), sig_info.si_int will contain the new file descriptor, pointing to the same IO object, as file descriptor sent by the originating process. sig_info.si_pid will contain the originating process' PID, and sig_info.si_uid will contain the originating process' UID. If sig_info.si_int is less than zero (represents an invalid file descriptor), sig_info.si_errno will contain the errno for the actual error encountered during fd duplication process.

Reading shared data inside a signal handler

I am in a situation where I need to read a binary search tree (BST) inside a signal handler (SIGSEGV signal handler, which according to my knowledge is per thread base). The BST can be modified by the other threads in the application.
Now since a signal handler can't use semaphores, mutexes etc. and therefore can't access shared data, How do I solve this problem? Note that my application is multithreaded and running on a multicore system.
You shouldn't access shared data from signal handler. You can find out more information about signals in following articles:
Linux Signals for the Application Programmer
The Linux Signals Handling Model
All about Linux signals
Looks like the safest way to deal with signals in linux so far is signalfd.
I can see two quite clean solutions:
Linux-specific: Create a dedicated thread handling signals. Catch signals using signalfd(). This way you will handle signals in a regular thread, not any limited handler.
Portable: Also use a dedicated thread that sleeps until signal is received. You may use a pipe to create a pair of file descriptors. The thread may read(2) from the first descriptor and in a signal handler you may write(2) to the second descriptor. Using write() in a signal handler is legal according to POSIX. When the thread reads something from the pipe it knows it must perform some action.
Assuming the SH can't access the shared data directly, then maybe you could do it indirectly:
Have some global variable that only signal handlers can write to, but can be read from elsewhere (even if only within the same thread).
SH sets the flag when it is invoked
Threads poll this flag when they are not in the middle of modifying the BST; when the find it set, they do the processing that is required by the original signal (using whatever synchronizations are necessary), and then raise a different signal (like SIGUSR1) to indicate that the processing is done
The SH for THAT signal resets the flag
If you're worried about overlapping SIGSEGVs, add a counter to the mix to keep track. (Hey! You just built your own semaphore!)
The weak link here is obviously the polling, but its a start.
You might consider mmap-ing a fuse file system (in user space).
Actually, you'll be more happy on Gnu Hurd which has support for external pagers
And perhaps your hack of reading a binary search tree in your signal handler could often work in practice, non-portably and in a kernel version dependent way. Perhaps serializing access with low-level non portable tricks (e.g. futexes and atomic gcc builtins) might work. Reading the (machine specific) source code of NPTL i.e. current Linux pthread routines should help.
It could probably be the case that pthread_mutex_lock etc are in fact usable from inside a Linux signal handler... (because it probably does only futex and atomic instructions).

Resources