How to swap two open file descriptors? - c

For my master thesis project I am building an API in C that works with Unix sockets. To make it short, I have two sockets identified by their two fds, on which I have called a O_NONBLOCK connect(). At this point, I am calling select() to check which one connects first and is ready for writing.
The problems start now, as the application which is using this API is aware of only one of those sockets, let's say the one identified by fd1. If the socket identified by fd2 is the first to connect, the application has no way to know it can write to that socket.
I think my best options are using dup() and/or dup2(), but according to the their man page, dup() creates a copy of the fd passed to the function, but which refers to the same open file description, meaning that the two can be used interchangeably, and dup2() closes the new fd which replaces the old fd.
So my assumptions on what would happen are (in pseudo code)
int fd1, fd2, fd3;
fd1 = socket(x); // what the app is aware of
fd2 = socket(y); // first to connect
fd3 = dup(fd1); // fd1 and fd3 identify the same description
dup2(fd2, fd1); // The description identified by fd2 is now identified by fd1, the description previously identified by fd1 (and fd3) is closed
dup2(fd3, fd2); // The description identified by fd3 (copy of fd1, closed in the line above) is identified by fd2 (which can be closed and reassigned to fd3) since now the the description that was being identified by fd2 is being identified by fd1.
Which looks fine, except for the fact that the first dup2() closes fd1, which closes also fd3 since they are identifying the same file description. The second dup2() works fine but it's replacing the fd of a connection which has been closed by the first one, while I want it to keep trying to connect.
Can anyone with a better understanding of Unix file descriptors help me out?
EDIT: I want to elaborate a little bit more on what the API does and why the application only sees one fd.
The API provides to the application the means to call a very "fancy" version of connect() select() and close().
When the application calls api_connect(), it passes to the function a pointer to an int (together with all the necessary addresses and protocols etc). api_connect() will call socket(), bind() and connect(), the important part is that it will write the return value of socket() in the memory parsed through the pointer. This is what I mean by "The socket is only aware of one fd". The application will then call FD_SET(fd1, write_set), call a api_select() and then check if the fd is writable by calling FD_ISSET(fd1, write_set). api_select() works more or less like select(), but has a timer which can trigger a timeout if the connection takes more than a set amount of time to connect (since it's O_NONBLOCK). If this happens, api_select() creates a new connection on a different interface (calling all the necessary socket(), bind() and connect()). This connection is identified by a new fd -fd2- the application doesn't know about, and which is tracked in the API.
Now, if the application calls api_select() with FD_SET(fd1, write_set) and the API realises that is the second connection that has completed, thus making fd2 writable, I want the application to use fd2. The problem is that the application will only call FD_ISSET(fd1, write_set) and write(fd1) afterwards, that's why I need to replace fd2 with fd1.
At this point I'm really confused on whether I really need to dup or just do an integer swap (my understanding of Unix file descriptors is just a little bit more than basic).

I think my best options are using dup() and/or dup2(), but according
to the their man page, dup() creates a copy of the fd passed to the
function, but which refers to the same open file description,
Yes.
meaning
that the two can be used interchangeably,
Maybe. It depends on what you mean by "interchangeably".
and dup2() closes the new fd
which replaces the old fd.
dup2() closes the target file descriptor, if it is open, before duping the source descriptor onto it. Perhaps that's what you meant, but I'm having trouble reading your description that way.
So my assumptions on what would happen are (excuse my crappy pseudo
code)
int fd1, fd2, fd3;
fd1 = socket(x); // what the app is aware of
fd2 = socket(y); // first to connect
fd3 = dup(fd1); // fd1 and fd3 indentify the same description
Good so far.
dup2(fd2, fd1); // The description identified by fd2 is now identified by fd1, the description previously identified by fd1 (and fd3) is closed
No, the comment is incorrect. File descriptor fd1 is first closed, and then made to be a duplicate of fd2. The underlying open file description to which fd1 originally referred is not closed, because the process has another open file descriptor associated with it, fd3.
dup2(fd3, fd2); // The description identified by fd3 (copy of fd1, closed in the line above) is identified by fd2 (which can be closed and reassigned to fd3) since now the thescription that was being identified by fd2 is being identified by fd1.
Which looks fine, except for the fact that the first dup2() closes
fd1,
Yes it does.
which closes also fd3
No it doesn't.
since they are identifying the same file
description.
Irrelevant. Closing is a function on file descriptors, not, directly, on the underlying open file descriptions. In fact, it would be best not to use the word "identifying" here, for that suggests that file descriptors are some kind of identifier or alias for open file descriptions. They are not. File descriptors identify entries in a table of associations with open file descriptions, but are not themselves open file descriptions.
In short, your sequence of dup(), dup2(), and dup2() calls should effect exactly the kind of swap you want, provided that they all succeed. They do, however, leave an extra open file descriptor hanging around, which would yield a file descriptor leak under many circumstances. Therefore, don't forget to finish up with a
close(fd3);
Of course, all that assumes that it is the value of fd1 that is special to the application, not the variable containing it. File descriptors are just numbers. There is nothing inherently special about the objects that contain them, so if it is the variable fd1 that the application needs to use, regardless of its specific value, then all you need to do is perform an ordinary swap of integers:
fd3 = fd1;
fd1 = fd2;
fd2 = fd3;
With respect to the edit, you write,
When the application calls api_connect(), it passes to the function a
pointer to an int (together with all the necessary addresses and
protocols etc). api_connect() will call socket(), bind() and
connect(), the important part is that it will write the return value
of socket() in the memory parsed through the pointer.
Whether api_connect() returns the file descriptor value by writing it through a pointer or by conveying it as or in the function's return value is irrelevant. The point remains that it is the value that matters, not the object, if any, containing it.
This is what I
mean by "The socket is only aware of one fd". The application will
then call FD_SET(fd1, write_set), call a api_select() and then check
if the fd is writable by calling FD_ISSET(fd1, write_set).
Well that sounds problematic in light of the rest of your description.
[Under some conditions,]
api_select() creates a new connection on a different interface
(calling all the necessary socket(), bind() and connect()). This
connection is identified by a new fd -fd2- the application doesn't
know about, and which is tracked in the API.
Now, if the application calls api_select() with FD_SET(fd1, write_set)
and the API realises that is the second connection that has completed,
thus making fd2 writable, I want the application to use fd2. The
problem is that the application will only call FD_ISSET(fd1,
write_set) and write(fd1) afterwards, that's why I need to replace fd2
with fd1.
Do note that even if you do swap file descriptors as described in the first part of this answer, that will have no effect on either FD's membership in any fd_set, for such membership is logical, not physical. You will have to manage fd_set membership manually if the caller relies on that.
It is unclear to me whether api_select() is intended to provide services for more than one (caller-specified) file descriptor at the same time, as select() can do, but I imagine that the bookkeeping required for it to do so would be monstrous. On the other hand, if in fact the function handles only one caller-provided FD at a time, then mimicking the interface of select() is ... odd.
In that case, I would strongly urge you to design a more suitable interface. Among other things, such an interface should moot the question of swapping FDs. Instead, it can directly tell the caller what FD, if any, is ready for use, either by returning it or by writing it through a pointer to a variable specified by the caller.
Also, in the event that you do switch, one way or another, to an alternative FD, do not overlook managing the old one lest you leak a file descriptor. Each process has a pretty limited quantity of those available, so a file descriptor leak can be much more troublesome than a memory leak. In the event that you do switch, then, are you sure you really need to swap, as opposed to just dup2()ing the new FD onto the old, then closing the new?

Related

Protect /dev/shm file

I'm working on an application which is using a shared memory via shm_open(). It perform mmap() from a file within /dev/shm and is based on producer/consumer approach.
Is there any mechanism for my shared memory to be protected and accessible only by this application? I know it is possible to use encryption but does linux (or the programming language) provide any services so that the file is only accessible by my application?
If you use fd = shm_open(name, O_RDWR | O_CREAT | O_EXCL, 0);, then the shared memory object cannot be opened by any other process (without changing the access mode first). If it succeeds (fd != -1), and you immediately unlink the object via int rc = shm_unlink(name); successfully (rc == 0), only processes that can access the current process itself can access the object.
There is a small time window between the two operations when another process with sufficient privileges might have changed the mode and opened the object. To check, use fcntl(fd, F_SETLEASE, F_WRLCK) to obtain a write lease on the object. It will succeed only if this is the only process with access to the object.
Have the first instance of the application bind to a previously-agreed Unix domain stream socket, named or abstract, and listen for incoming connections on it. (For security reasons, it is important to use fcntl(sockfd, F_SETFD, FD_CLOEXEC) to avoid leaking the socket to a child process in case it exec()s a new binary.)
If the socket has been already bound, the bind will fail; so connect to that socket instead. When the first instance accepts a new connection, or the second instance connects to i, both must use int rc = getsockopt(connfd, SOL_SOCKET, SO_PEERCRED, &creds, &credslen); with struct ucred creds; socklen_t credslen = sizeof creds;, to obtain the credentials of the other side.
You can then check that the uid of the other side matches getuid() and geteuid(), and verify using e.g. stat() that the path "/proc/PID/exe" (where PID is the pid of the other side) refers to the same inode on the same filesystem as "/proc/self/exe". If they do, both sides are executing the same binary. (Note that you can also use POSIX realtime signals, via sigqueue(), passing one data token (of int, void pointer, or uintptr_t/intptr_t which happen to match unsigned long/long on Linux) between them.) This is useful, for example if one wants to notify the other that they're about to exit, and the other one should bind to and listen for incoming connections on the Unix domain stream socket.)
Then, the initial process can pass a copy of the shared object description (via descriptor fd) to the second process, using an SCM_RIGHTS ancillary message, with for example the actual size of the shared object as data (recommend a size_t for this). If you want to pass other stuff, use a structure.
The first (often, but not necessarily only) message the second process receives will contain the ancillary data with a new file descriptor referring to the shared object. Note that because this is an Unix domain stream socket, message boundaries are not preserved, and if there wasn't a full data payload, you need to use a loop to read the rest of the data.
Both sides can then close the Unix domain socket. The second side can then mmap() the shared object.
If there is never more than this exact pair of processes sharing data, then both sides can close the descriptor, making it impossible for anyone except superuser or the kernel to access the shared descriptor. The kernel will keep an internal reference as long as the mapping exists; it is equivalent to the process having the descriptor still open, except that the process itself cannot access or share the descriptor anymore, only the shared memory itself.
Because the shared object has been unlinked already, no cleanup is necessary. The shared object will vanish as soon as the last process with an open descriptor or existing mmap closes it, unmaps it, or exits.
The Unix security model that Linux implements does not have strong boundaries between processes running as the same uid. In particular, they can examine each others /proc/PID/ pseudodirectories, including their open file descriptors listed under /proc/PID/fd/.
Because of this, security-sensitive applications usually run as a dedicated user. The aforementioned scheme works well even when the second party is a process running as the human user, and the first party as the dedicated application uid. If you use a named Unix domain stream socket, you do need to ensure its access mode is suitable (you can use chmod(), chgrp(), et al. after binding to the socket, to change the named Unix domain stream socket access mode). Abstract Unix domain stream sockets do not have a filesystem-visible node, and any process can connect to such a bound socket.
When a privilege boundary is involved between the application (running as its own dedicated uid) and the agent (running as an user uid), it is important to make sure that both sides are who they claim to be across the entire exchange. The credentials are valid only at that point in time, and a known attack method is to have the valid agent execute a nefarious binary just after having connected to the socket, so that the other side still sees the original credentials, but the next communications are in control of a nefarious process.
To avoid this, make sure the socket descriptor is not shared across an exec (using CLOEXEC descriptor flag), and optionally check the peer credentials more than once, for example initially and finally.
Why is this "complicated"? Because proper security has to be baked in, it cannot be added on top afterwards, or taken invisibly care of for you: it must be a part of the approach. Changes in the approach must be reflected in the security implementation, or you have no security.
In real life, after you implement this (for the same-executable-binary one, and the privileged-service-or-application and user-agent one), you'll find that it isn't as complicated as it sounds: each step has their purpose, and can be tweaked if the approach changes. In particular, it isn't much C code at all.
If one wants or needs "something easier", then one just has to pick something other than security-sensitive code.

How to get the file descriptors of TCP socket for a given process in Linux?

I'm trying to find the file descriptors for all TCP sockets of a given process, ie. given its pid, so that I can get the socket option at another process without modifying the original one.
For example, if I know the file descriptor is fd, then I hope to call getsockopt(fd, ...) to retrieve the options at another process. I'm wondering is this doable? If so, how to get the fd I need outside the original process?
I have tried to print out the return value when creating a socket, ie. s = socket(...); printf("%d\n", s);, keeping the original process running and call getsockopt(s, ...) somewhere else but it doesn't work - it seems that such return value is process-dependent.
I have also found the solution with unix domain sockets but I don't want to modify the codes of original process.
As for reading \proc\<PID>\fd directly or leveraging lsof, I'd like to say I don't know how to find what I need from them. My gut feeling is that they could be potential solutions.
Of course any other ideas are welcome as well. To be honest, I'm not very familiar with the file descriptor mechanism in Linux.
No. You simply cannot do what you are asking.
A file descriptor is just an integer, but it refers to an open file object in a given process. That integer value in another process refers to a different, possibly unopened file object.
Without involving the ptrace debugging API, or remote code injection, you are limited to what the kernel exposes to you via /proc.
Check out the man page for ss. If this utility can't show you information about a socket you desire, then nothing can.

dup() followed by close() from multiple threads or processes

My program does the following in chronological order
The program is started with root permissions.
Among other tasks, A file only readable with root permissions is open()ed.
Root privileges are dropped.
Child processes are spawned with clone() and the CLONE_FILES | CLONE_FS | CLONE_IO flags set, which means that while they use separate regions of virtual memory, they share the same file descriptor table (and other IO stuff).
All child processes execve() their own programs (the FD_CLOEXEC flag is not used).
The original program terminates.
Now I want every spawned program to read the contents of the aforementioned file, but after they all have read the file, I want it to be closed (for security reasons).
One possible solution I'm considering now is having a step 3a where the fd of the file is dup()licated once for every child process, and each child gets its own fd (as an argv). Then every child program would simply close() their fd, so that after all fds pointing to the file are close()d the "actual file" is closed.
But does it work that way? And is it safe to do this (i.e. is the file really closed)? If not, is there another/better method?
While using dup() as I suggested above is probably just fine, I've now --a day after asking this SO question-- realized that there is a nicer way to do this, at least from the point of view of thread safety.
All dup()licated file descriptors point to the same same file position indicator, which of course means you run into trouble when multiple threads/processes might simultaneously try to change the file position during read operations (even if your own code does so in a thread safe way, the same doesn't necessarily go for libraries you depend on).
So wait, why not just call open() multiple times (once for every child) on the needed file before dropping root? From the manual of open():
A call to open() creates a new open file description, an entry in the system-wide table of open files. This entry records the file offset and the file status flags (modifiable via the fcntl(2) F_SETFL operation). A file descriptor is a reference to one of these entries; this reference is unaffected if pathname is subsequently removed or modified to refer to a different file. The new open file description is initially not shared with any other process, but sharing may arise via fork(2).
Could be used like this:
int fds[CHILD_C];
for (int i = 0; i < CHILD_C; i++) {
fds[i] = open("/foo/bar", O_RDONLY);
// check for errors here
}
drop_privileges();
// etc
Then every child gets a reference to one of those fds through argv and does something like:
FILE *stream = fdopen(atoi(argv[FD_STRING_I]), "r")
read whatever needed from the stream
fclose(stream) (this also closes the underlying file descriptor)
Disclaimer: According to a bunch of tests I've run this is indeed safe and sound. I have however only tested open()ing with O_RDONLY. Using O_RDWR or O_WRONLY may or may not be safe.

How to create blocking file descriptor in unix?

I would like to create blocking and non-blocking file in Unix's C. First, blocking:
fd = open("file.txt", O_CREAT | O_WRONLY | O_EXCL);
is that right? Shouldnt I add some mode options, like 0666 for example?
How about non-blocking file? I have no idea for this.
I would like to achieve something like:
when I open it to write in it, and it's opened for writing, it's ok; if not it blocks.
when I open it to read from it, and it's opened for reading, it's ok; if not it blocks.
File descriptors are blocking or non-blocking; files are not. Add O_NBLOCK to the options in the open() call if you want a non-blocking file descriptor.
Note that opening a FIFO for reading or writing will block unless there's a process with the FIFO open for the other operation, or you specify O_NBLOCK. If you open it for read and write, the open() is non-blocking (will return promptly); I/O operations are still controlled by whether you set O_NBLOCK or not.
The updated question is not clear. However, if you're looking for 'exclusive access to the file' (so that no-one else has it open), then neither O_EXCL nor O_NBLOCK is the answer. O_EXCL affects what happens when you create the file; the create will fail if the file already exists. O_NBLOCK affects whether a read() operation will block when there's no data available to read. If you read the POSIX open() description, there is nothing there that allows you to request 'exclusive access' to a file.
To answer the question about file mode: if you include O_CREAT, you need the third argument to open(). If you omit O_CREAT, you don't need the third argument to open(). It is a varargs function:
int open(const char *filename, int options, ...);
I don't know what you are calling a blocking file (blocking IO in Unix means that the IO operations wait for the data to be available or for a sure failure, they are opposed to non-blocking IO which returns immediately if there is no available data).
You always need to specify a mode when opening with O_CREAT.
The open you show will fails if the file already exists (when fixed for the above point).
Unix has no standard way to lock file for exclusive access excepted that. There are advisory locks (but all programs must respect the protocol). Some have mandatory lock extension. The received wisdom is not to rely on either kind of locking when accessing network file system.
Shouldn't I add some mode options?
You should, if the file is write-only and to be created if nonexistent. In this case, open() expects a third argument as well, so omitting it results in undefined behavior.
Edit:
The updated question is even more confusing...
when I open it to write in it, and it's opened for writing, it's ok; if not it blocks.
Why would you need that? See, if you try to write to a file/file descriptor not opened for writing, write() will return -1 and you can check the error code stored in errno. Tell us what you're trying to achieve by this bizarre thing you want instead of overcomplicating and messing up your code.
(Remarks in parentheses:
I would like to create blocking and non-blocking file
What's that?
in unix's C
Again, there's no such thing. There is the C language, which is platform-independent.)

Process started from system command in C inherits parent fd's

I have a sample application of a SIP server listening on both tcp and udp ports 5060.
At some point in the code, I do a system("pppd file /etc/ppp/myoptions &");
After this if I do a netstat -apn, It shows me that ports 5060 are also opened for pppd!
Is there any method to avoid this? Is this standard behaviour of the system function in Linux?
Thanks,
Elison
Yes, by default whenever you fork a process (which system does), the child inherits all the parent's file descriptors. If the child doesn't need those descriptors, it SHOULD close them. The way to do this with system (or any other method that does a fork+exec) is to set the FD_CLOEXEC flag on all file descriptors that shouldn't be used by the children of you process. This will cause them to be closed automatically whenever any child execs some other program.
In general, ANY TIME your program opens ANY KIND of file descriptor that will live for an extended period of time (such as a listen socket in your example), and which should not be shared with children, you should do
fcntl(fd, F_SETFD, fcntl(fd, F_GETFD) | FD_CLOEXEC);
on the file descriptor.
As of the 2016? revision of POSIX.1, you can use the SOCK_CLOEXEC flag or'd into the type of the socket to get this behavior automatically when you create the socket:
listenfd = socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, 0);
bind(listenfd, ...
listen(listemfd, ...
which guarentees it will be closed properly even if some other simultaneously running thread does a system or fork+exec call. Fortunately, this flag has been supported for awhile on Linux and BSD unixes (but not OSX, unfortunately).
You should probably avoid the system() function altogether. It's inherently dangerous, in that it invokes the shell, which can be tampered with and rather non-portable, even between Unicies.
What you should do is the fork()/exec() dance. It goes something like this
if(!fork()){
//close file descriptors
...
execlp("pppd", "pppd", "file", "/etc/ppp/myoptions", NULL);
perror("exec");
exit(-1);
}
Yes, this is standard behavior of fork() in Linux, from which system() is implemented.
The identifier returned from the socket() call is a valid file descriptor. This value is usable with file-oriented functions such as read(), write(), ioctl(), and close().
The converse, that every file descriptor is a socket, is not true. One cannot open a regular file with open() and pass that descriptor to, e.g., bind() or listen().
When you call system() the child process inherits the same file descriptors as the parent. This is how stdout (0), stdin (1), and stderr (2) are inherited by child processes. If you arrange to open a socket with a file descriptor of 0, 1 or 2, the child process will inherit that socket as one of the standard I/O file descriptors.
Your child process is inheriting every open file descriptor from the parent, including the socket you opened.
As others have stated, this is standard behavior that programs depend on.
When it comes to preventing it you have a few options. Firstly is closing all file descriptors after the fork(), as Dave suggests. Second, there is the POSIX support for using fcntl with FD_CLOEXEC to set a 'close on exec' bit on a per-fd basis.
Finally, though, since you mention you are running on Linux, there are a set of changes designed to let you set the bit right at the point of opening things. Naturally, this is platform dependent. An overview can be found at http://udrepper.livejournal.com/20407.html
What this means is that you can use a bitwise or with the 'type' in your socket creation call to set the SOCK_CLOEXEC flag. Provided you're running kernel 2.6.27 or later, that is.
system() copies current process and then launch a child on top of it. (current process is no more there. that is probably why pppd uses 5060. You can try fork()/exec() to create a child process and keep parent alive.

Resources