Protect /dev/shm file - c

I'm working on an application which is using a shared memory via shm_open(). It perform mmap() from a file within /dev/shm and is based on producer/consumer approach.
Is there any mechanism for my shared memory to be protected and accessible only by this application? I know it is possible to use encryption but does linux (or the programming language) provide any services so that the file is only accessible by my application?

If you use fd = shm_open(name, O_RDWR | O_CREAT | O_EXCL, 0);, then the shared memory object cannot be opened by any other process (without changing the access mode first). If it succeeds (fd != -1), and you immediately unlink the object via int rc = shm_unlink(name); successfully (rc == 0), only processes that can access the current process itself can access the object.
There is a small time window between the two operations when another process with sufficient privileges might have changed the mode and opened the object. To check, use fcntl(fd, F_SETLEASE, F_WRLCK) to obtain a write lease on the object. It will succeed only if this is the only process with access to the object.
Have the first instance of the application bind to a previously-agreed Unix domain stream socket, named or abstract, and listen for incoming connections on it. (For security reasons, it is important to use fcntl(sockfd, F_SETFD, FD_CLOEXEC) to avoid leaking the socket to a child process in case it exec()s a new binary.)
If the socket has been already bound, the bind will fail; so connect to that socket instead. When the first instance accepts a new connection, or the second instance connects to i, both must use int rc = getsockopt(connfd, SOL_SOCKET, SO_PEERCRED, &creds, &credslen); with struct ucred creds; socklen_t credslen = sizeof creds;, to obtain the credentials of the other side.
You can then check that the uid of the other side matches getuid() and geteuid(), and verify using e.g. stat() that the path "/proc/PID/exe" (where PID is the pid of the other side) refers to the same inode on the same filesystem as "/proc/self/exe". If they do, both sides are executing the same binary. (Note that you can also use POSIX realtime signals, via sigqueue(), passing one data token (of int, void pointer, or uintptr_t/intptr_t which happen to match unsigned long/long on Linux) between them.) This is useful, for example if one wants to notify the other that they're about to exit, and the other one should bind to and listen for incoming connections on the Unix domain stream socket.)
Then, the initial process can pass a copy of the shared object description (via descriptor fd) to the second process, using an SCM_RIGHTS ancillary message, with for example the actual size of the shared object as data (recommend a size_t for this). If you want to pass other stuff, use a structure.
The first (often, but not necessarily only) message the second process receives will contain the ancillary data with a new file descriptor referring to the shared object. Note that because this is an Unix domain stream socket, message boundaries are not preserved, and if there wasn't a full data payload, you need to use a loop to read the rest of the data.
Both sides can then close the Unix domain socket. The second side can then mmap() the shared object.
If there is never more than this exact pair of processes sharing data, then both sides can close the descriptor, making it impossible for anyone except superuser or the kernel to access the shared descriptor. The kernel will keep an internal reference as long as the mapping exists; it is equivalent to the process having the descriptor still open, except that the process itself cannot access or share the descriptor anymore, only the shared memory itself.
Because the shared object has been unlinked already, no cleanup is necessary. The shared object will vanish as soon as the last process with an open descriptor or existing mmap closes it, unmaps it, or exits.
The Unix security model that Linux implements does not have strong boundaries between processes running as the same uid. In particular, they can examine each others /proc/PID/ pseudodirectories, including their open file descriptors listed under /proc/PID/fd/.
Because of this, security-sensitive applications usually run as a dedicated user. The aforementioned scheme works well even when the second party is a process running as the human user, and the first party as the dedicated application uid. If you use a named Unix domain stream socket, you do need to ensure its access mode is suitable (you can use chmod(), chgrp(), et al. after binding to the socket, to change the named Unix domain stream socket access mode). Abstract Unix domain stream sockets do not have a filesystem-visible node, and any process can connect to such a bound socket.
When a privilege boundary is involved between the application (running as its own dedicated uid) and the agent (running as an user uid), it is important to make sure that both sides are who they claim to be across the entire exchange. The credentials are valid only at that point in time, and a known attack method is to have the valid agent execute a nefarious binary just after having connected to the socket, so that the other side still sees the original credentials, but the next communications are in control of a nefarious process.
To avoid this, make sure the socket descriptor is not shared across an exec (using CLOEXEC descriptor flag), and optionally check the peer credentials more than once, for example initially and finally.
Why is this "complicated"? Because proper security has to be baked in, it cannot be added on top afterwards, or taken invisibly care of for you: it must be a part of the approach. Changes in the approach must be reflected in the security implementation, or you have no security.
In real life, after you implement this (for the same-executable-binary one, and the privileged-service-or-application and user-agent one), you'll find that it isn't as complicated as it sounds: each step has their purpose, and can be tweaked if the approach changes. In particular, it isn't much C code at all.
If one wants or needs "something easier", then one just has to pick something other than security-sensitive code.

Related

Why should I close all file descriptors after calling fork() and prior to calling exec...()? And how would I do it?

I've seen a lot of C code that tries to close all file descriptors between calling fork() and calling exec...(). Why is this commonly done and what is the best way to do it in my own code, as I've seen so many different implementations already?
When calling fork(), your operation system creates a new process by simply cloning your existing process. The new process will be pretty much identical to the process it was cloned from, except for its process ID and any properties that are documented to be replaced or reset by the fork() call.
When calling any form of exec...(), the process image of the calling process is replaced by a new process image but other than that the process state is preserved. One consequence is that open file descriptors in the process file descriptor table prior to calling exec...() are still present in that table after calling it, so the new process code inherits access to them. I guess this has probably been done so that STDIN, STDOUT, and STDERR are automatically inherited by child processes.
However, keep in mind that in POSIX C file descriptors are not only used to access actual files, they are also used for all kind of system and network sockets, pipes, shared memory identifiers, and so on. If you don't close these prior to calling exec...(), your new child process will get access to all of them, even to those resources it could not gain access on its own as it doesn't even have the required access rights. Think about a root process creating a non-root child process, yet this child would have access to all open file descriptors of the root parent process, including open files that should only be writable by root or protected server sockets below port 1024.
So unless you want a child process to inherit access to currently open file descriptors, as may explicitly be desired e.g. to capture STDOUT of a process or feed data via STDIN to that process, you are required to close them prior to calling exec...(). Not only because of security (which sometimes may play no role at all) but also because otherwise the child process will have less free file descriptors available (and think of a long chain of processes, each opening files and then spawning a sub-process... there will be less and less free file descriptors available).
One way to do that is to always open files using the flag O_CLOEXEC, which ensures that this file descriptor is automatically closed when exec...() is ever called. One problem with that solution is that you cannot control how external libraries may open files, so you cannot rely that all code will always set this flag.
Another problem is that this solution only works for file descriptors created with open(). You cannot pass that flag when creating sockets, pipes, etc. This is a known problem and some systems are working around that by offering the non-standard acccept4(), pipe2(), dup3(), and the SOCK_CLOEXEC flag for sockets, however these are not yet POSIX standard and it's unknown if they will become standard (this is planned but until a new standard has been released we cannot know for sure, also it will take years until all systems have adopted them).
What you can do is to later on set the flag FD_CLOEXEC using fcntl() on the file descriptor, however, note that this isn't safe in a multi-thread environment. Just consider the following code:
int so = socket(...);
fcntl(so, F_SETFD, FD_CLOEXEC);
If another thread calls fork() in between the first and the second line, which is of course possible, the flag has not yet been set yet and thus this file descriptor won't get closed.
So the only way that is really safe is to explicitly close them and this is not as easy as it may seem!
I've seen a lot of code that does stupid things like this:
for (int i = STDERR_FILENO + 1; i < 256; i++) close(i);
But just because some POSIX systems have a default limit of 256 doesn't mean that this limit cannot be raised. Also on some system the default limit is always higher to begin with.
Using FD_SETSIZE instead of 256 is equally wrong as just because the select() API has a hard limit by default on most systems doesn't mean that a process cannot have more open file descriptors than this limit (after all you don't have to use select() with them, you can use poll() API as a replacement and poll() has no upper limit on file descriptor numbers).
Always correct is to use OPEN_MAX instead of 256 as that is really the absolute maximum of file descriptors a process can have. The downside is that OPEN_MAX can theoretically be huge and doesn't reflect the real current runtime limit of a process.
To avoid having to close too many non-existing file descriptors, you can use this code instead:
int fdlimit = (int)sysconf(_SC_OPEN_MAX);
for (int i = STDERR_FILENO + 1; i < fdlimit; i++) close(i);
sysconf(_SC_OPEN_MAX) is documented to update correctly if the open file limit (RLIMIT_NOFILE) has been raised using setrlimit(). The resource limits (rlimits) are the effective limits for a running process and for files they will always have to be between _POSIX_OPEN_MAX (documented as the minimum number of file descriptors a process is always allowed to open, must be at least 20) and OPEN_MAX (must be at least _POSIX_OPEN_MAX and sets the upper limit).
While closing all possible descriptors in a loop is technically correct and will work as desired, it may try to close several thousand file descriptors, most of them will often not exist. Even if the close() call for a non-existing file descriptor is fast (which is not guaranteed by any standard), it may take a while on weaker systems (think of embedded devices, think of small single-board computers), which may be a problem.
So several systems have developed more efficient ways to solve this issue. Famous examples are closefrom() and fdwalk() which BSD and Solaris systems support. Unfortunately The Open Group voted against adding closefrom() to the standard (quote): "it is not possible to standardize an interface that closes arbitrary file descriptors above a certain value while still guaranteeing a conforming environment." (Source) This is of course nonsense, as they make the rules themselves and if they define that certain file descriptors can always be silently omitted from closing if the environment or system requires or the code itself requests that, then this would break no existing implementation of that function and still offer the desired functionality for the rest of us. Without these functions people will use a loop and do exactly what The Open Group tries to avoid here, so not adding it only makes the situation even worse.
On some platforms you are basically out of luck, e.g. macOS, which is fully POSIX conform. If you don't want to close all file descriptors in a loop on macOS, your only option is to not use fork()/exec...() but instead posix_spawn(). posix_spawn() is a newer API for platforms that don't support process forking, it can be implemented purely in user space on top of fork()/exec...() for those platforms that do support forking and can otherwise use some other API a platform offers for starting child processes. On macOS there exists a non-standard flag POSIX_SPAWN_CLOEXEC_DEFAULT, which will tread all file descriptors as if the CLOEXEC flag has been set on them, except for those for that you explicitly specified file actions.
On Linux you can get a list of file descriptors by looking at the path /proc/{PID}/fd/ with {PID} being the process ID of your process (getpid()), that is, if the proc file system has been mounted at all and it has been mounted to /proc (but a lot of Linux tools rely on that, not doing so would break many other things as well). Basically you can limit yourself to close all descriptors listed under this path.
True story: Once upon a time I wrote a simple little C program that opened a file, and I noticed that the file descriptor returned by open was 4. "That's funny," I thought. "Standard input, output, and error are always file descriptors 0, 1, and 2, so the first file descriptor you open is usually 3."
So I wrote another little C program that started reading from file descriptor 3 (without opening it, that is, but rather, assuming that 3 was a pre-opened fd, just like 0, 1, and 2). It quickly became apparent that, on the Unix system I was using, file descriptor 3 was pre-opened on the system password file. This was evidently a bug in the login program, which was exec'ing my login shell with fd 3 still open on the password file, and the stray fd was in turn being inherited by programs I ran from my shell.
Naturally the next thing I tried was a simple little C program to write to the pre-opened file descriptor 3, to see if I could modify the password file and give myself root access. This, however, didn't work; the stray fd 3 was opened on the password file in read-only mode.
But at any rate, this helps to explain why you shouldn't leave file descriptors open when you exec a child process.
[Footnote: I said "true story", and it mostly is, but for the sake of the narrative I did change one detail. In fact, the buggy version of /bin/login was leaving fd 3 opened on the groups file, /etc/group, not the password file.]

detect if file descriptor is socket in solaris 11.0 and extract ip address

In Solaris, I need to get IP address a specific process is using (sshd session), I have his ID.
How do they do it on linux ? After reading netstat.c source, this is the flow:
Iterate the process file descriptors, located at /proc/ProcessId/fd/,
If iterated file descriptor is a socket, they readlink, open and finally read the file descriptor.
So in solaris, I can detect the socket file descriptor of the process.
int fd=NULL;
struct dirent *dentp;
while ((dentp = readdir(dirp)) != NULL) { //iterate file descriptors
fd = atoi(dentp->d_name);
struct stat statb;
char temp_dir_path [100];
if (stat(temp_dir_path, &statb) != -1)
{
if (S_ISSOCK(statb.st_mode))
{
//What to do here ?? temp_dir_path is /proc/12345/fd/4
I saw there are methods like getpeername(..),getsockname(..) they receive as param the file descriptor of the current context process, I want to read file descriptor for another process.
Can I open the file descriptor and cast it to struct sockaddr_in ?
The socket file descriptor structure is different between linux and solaris.. I guess i need to do whatever they do in pfiles / lsof
I saw there are methods like getpeername(..),getsockname(..) they receive as param the file descriptor of the current context process, I want to read file descriptor for another process.
Can I open the file descriptor and cast it to struct sockaddr_in ?
No. You can open() it and use the file descriptor open() returns and try using getpeername() and getsockname() on the file descriptor you get. It might even work.
You'll probably be better served by using the method pfiles uses. Per the pfiles man page:
pfiles
Report fstat(2) and fcntl(2) information for all open files in
each process. For network endpoints, the local (and peer if
connected) address information is also provided. For sockets, the
socket type, socket options and send and receive buffer sizes are also
provided. In addition, a path to the file is reported if the
information is available from /proc/pid/path. This is not necessarily
the same name used to open the file. See proc(4) for more information.
The pfiles source code can be found at http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/ptools/pfiles/pfiles.c
Solaris provides a libproc interface library that does what you need. pfiles uses that - the library provides calls such as pr_getpeername() and pr_getsockname(). You can see the implementations in http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libproc/common/pr_getsockname.c
Note that there are actual system calls to get what you need directly from the kernel.
The OpenSolaris man pages for the libproc library can be found at http://illumos.org/man/3proc/all They are likely to be substantially similar to the Solaris 11 libproc implementation.
To use these tools, you have to be really careful. From the Pgrab man page for the function used to grab a process:
Grabbing a process is a destructive action. Stopping a process stops
execution of all its threads. The impact of stopping a process depends
on the purpose of that process. For example, if one stops a process
that's primarily doing computation, then its computation is delayed
the entire time that it is stopped. However, if instead this is an
active TCP server, then the accept backlog may fill causing connection
errors and potentially connection time out errors.
There are options to not stop the grabbed process, and to grab it read-only.

How to get the file descriptors of TCP socket for a given process in Linux?

I'm trying to find the file descriptors for all TCP sockets of a given process, ie. given its pid, so that I can get the socket option at another process without modifying the original one.
For example, if I know the file descriptor is fd, then I hope to call getsockopt(fd, ...) to retrieve the options at another process. I'm wondering is this doable? If so, how to get the fd I need outside the original process?
I have tried to print out the return value when creating a socket, ie. s = socket(...); printf("%d\n", s);, keeping the original process running and call getsockopt(s, ...) somewhere else but it doesn't work - it seems that such return value is process-dependent.
I have also found the solution with unix domain sockets but I don't want to modify the codes of original process.
As for reading \proc\<PID>\fd directly or leveraging lsof, I'd like to say I don't know how to find what I need from them. My gut feeling is that they could be potential solutions.
Of course any other ideas are welcome as well. To be honest, I'm not very familiar with the file descriptor mechanism in Linux.
No. You simply cannot do what you are asking.
A file descriptor is just an integer, but it refers to an open file object in a given process. That integer value in another process refers to a different, possibly unopened file object.
Without involving the ptrace debugging API, or remote code injection, you are limited to what the kernel exposes to you via /proc.
Check out the man page for ss. If this utility can't show you information about a socket you desire, then nothing can.

setuid equivalent for non-root users

Does Linux have some C interface similar to setuid, which allows a program to switch to a different user using e.g. the username/password? The problem with setuid is that it can only be used by superusers.
I am running a simple web service which requires jobs to be executed as the logged in user. So the main process runs as root, and after the user logs in it forks and calls setuid to switch to the appropriate uid. However, I am not quite comfortable with the main proc running as root. I would rather have it run as another user, and have some mechanism to switch to another user similar to su (but without starting a new process).
First, setuid() can most definitely be used by non-superusers. Technically, all you need in Linux is the CAP_SETUID (and/or CAP_SETGID) capability to switch to any user. Second, setuid() and setgid() can change the process identity between the real (user who executed the process), effective (owner of the setuid/setgid binary), and saved identities.
However, none of that is really relevant to your situation.
There exists a relatively straightforward, yet extremely robust solution: Have a setuid root helper, forked and executed by your service daemon before it creates any threads, and use an Unix domain socket pair to communicate between the helper and the service, the service passing both its credentials and the pipe endpoint file descriptors to the helper when user binaries are to be executed. The helper will check everything securely, and if all is in order, it will fork and execute the desired user helper, with the specified pipe endpoints connected to standard input, standard output, and standard error.
The procedure for the service to start the helper, as early as possible, is as follows:
Create an Unix domain socket pair, used for privileged communications between the service and the helper.
Fork.
In the child, close all excess file descriptors, keeping only one end of the socket pair. Redirect standard input, output, and error to /dev/null.
In the parent, close the child end of the socket pair.
In the child, execute the privileged helper binary.
The parent sends a simple message, possibly one without any data at all, but with an ancillary message containing its credentials.
The helper program waits for the initial message from the service.
When it receives it, it checks the credentials. If the credentials do not pass muster, it quits immediately.
The credentials in the ancillary message define the originating process' UID, GID, and PID. Although the process needs to fill in these, the kernel verifies they are true. The helper of course verifies that UID and GID are as expected (correspond to the account the service ought to be running as), but the trick is to get the statistics on the file the /proc/PID/exe symlink points to. That is the genuine executable of the process that sent the credentials. You should verify it is the same as the installed system service daemon (owned by root:root, in the system binary directory).
There is a very simple attack that may defeat the security up to this point. A nefarious user may create their own program, that forks and executes the helper binary correctly, sends the initial message with its true credentials -- but replaces itself with the correct system binary before the helper has a chance to check what the credentials actually refer to!
That attack is trivially defeated by three further steps:
The helper program generates a (cryptographically secure) pseudorandom number, say 1024 bits, and sends it back to the parent.
The parent sends the number back, but again adds its credentials in an ancillary message.
The helper program verifies that the UID, GID, and PID have not changed, and that /proc/PID/exe still points to the correct service daemon binary. (I'd just repeat the full checks.)
At step 8, the helper has already ascertained the other end of the socket is executing the binary it ought to be executing. Sending it a random cookie it has to send back, means the other end cannot have "stuffed" the socket with the messages beforehand. Of course this assumes the attacker cannot guess the pseudorandom number beforehand. If you want to be careful, you can read a suitable cookie from /dev/random, but remember it is a limited resource (may block if there is not enough randomness available to the kernel). I'd personally just read say 1024 bits (128 bytes) from /dev/urandom, and use that.
At this point, the helper has ascertained the other end of the socket pair is your service daemon, and the helper can trust the control messages as far as it can trust the service daemon. (I'm assuming this is the only mechanism the service daemon will spawn user processes; otherwise you'd need to re-pass the credentials in every further message, and re-check them every time in the helper.)
Whenever the service daemon wishes to execute a user binary, it
Creates the necessary pipes (one for feeding standard input to the user binary, one to get back the standard output from the user binary)
Sends a message to the helper containing
Identity to run the binary as; either user (and group) names, or UID and GID(s)
Path to the binary
Command-line parameters given to the binary
An ancillary message containing the file descriptors for the user binary endpoints of the data pipes
Whenever the helper gets such a message, it forks. In the child, it replaces standard input and output with the file descriptors in the ancillary message, changes identity with setresgid() and setresuid() and/or initgroups(), changes the working directory to somewhere appropriate, and executes the user binary. The parent helper process closes the file descriptors in the ancillary message, and waits for the next message.
If the helper exits when there is going to be no more input from the socket, then it will automatically exit when the service exits.
I could provide some example code, if there is sufficient interest. There's lots of details to get right, so the code is a bit tedious to write. However, correctly written, it is more secure than e.g. Apache SuEXEC.
No, there is no way to change UID using only a username and password. (The concept of a "password" is not recognized by the kernel in any fashion -- it only exists in userspace.) To switch from one non-root UID to another, you must become root as an intermediate step, typically by exec()-uting a setuid binary.
Another option in your situation may be to have the main server run as an unprivileged user, and have it communicate with a back-end process running as root.

Sharing File descriptors across processes

I want to setup a shared memory environment for multiple independent processes. In the data structure that I want to share, there are also connection fds which are per process.
I wanted to know if there is a way in which we can share these fds? or use global fds or something similar of the kind?
Thanks in advance.
There are two ways to share file descriptors on a Unix host. One is by letting a child process inherit them across a fork.
The other is sending file descriptors over a Unix domain socket with sendmsg; see this example program, function send_connection (archived here). Note that the file descriptor might have a different number in the receiving process, so you may have to perform some dup2 magic to make them come out right in your shared memory.
If you don't do this, the file descriptors in your shared memory region will be just integers.
Recently, I had to solve a problem similar to what OP is describing. To this end, I moved to propose a dedicated system call (a very simple one, I might add) to send file descriptors directly to cooperating processes addresses and relying on Posix.1b signal queues as a delivery medium (as an added benefit, such approach is inherently immune to "fd recursion" attack, which plagues all VFS based mechanisms to some degree).
Here's the proposed patch:
http://permalink.gmane.org/gmane.linux.kernel/1843084
(presently, the patch only adds the new syscall for x86/x86_64 architecture, but wiring it up to other architectures is trivial, there are no platform depended features utilized).
A theory of operation goes like following. Both sender and receiver need to agree on one or more signal numbers to use for descriptor passing. Those must be Posix.1b signals, which guarantee reliable delivery, thus SIGRTMIN offset. Also, smaller signal numbers have higher delivery priority, in case priority management is required:
int signo_to_use = SIGRTMIN + my_sig_off;
Then, originating process invokes a system call:
int err = sendfd(peer_pid, signo_to_use, fd_to_send);
That's it, nothing else is necessary on the sender's side. Obviously, sendfd() will only be successful, if the originating process has the right to signal destination process and destination process is not blocking/ignoring the signal.
It must also be noted, that sendfd() never blocks; it will return immediately if destination process' signal queue is full. In a well designed application, this will indicate that destination process is in trouble anyway, or there's too much work to do, so new workers shall be spawned/work items dropped. The size of the process' signal queue can be configured using rlimit(), same as the number of available file descriptors.
The receiving process may safely ignore the signal (in this case nothing will happen and almost no overhead will be incurred on the kernel side). However, if receiving process wants to get the delivered file descriptor, all it has to to is to collect the signal info using sigtimedwait()/sigwaitinfo() or a more versatile signalfd():
/* First, the receiver needs to specify what it is waiting for: */
sigset_t sig_mask;
sigemptyset(&sig_mask);
sigaddset(&sig_mask, signo_to_use);
siginfo_t sig_info;
/* Then all it needs is to wait for the event: */
sigwaitinfo(&sig_mask, sig_info);
After the successful return of the sigwaitinfo(), sig_info.si_int will contain the new file descriptor, pointing to the same IO object, as file descriptor sent by the originating process. sig_info.si_pid will contain the originating process' PID, and sig_info.si_uid will contain the originating process' UID. If sig_info.si_int is less than zero (represents an invalid file descriptor), sig_info.si_errno will contain the errno for the actual error encountered during fd duplication process.

Resources