Background Our kernel level program invokes a process in user space for making some decisions on the basis of values in a file. The user space program is a short lived process that compares value passed by the kernel with the file contents. At a time usually many instances of the user space program can be invoked. The file has less than one thousand lines.
Question What is the preferred way to read the a small file that is shared among short lived many processes? Currently We are using File I/O (fopen, fread)
Note The question When should I use mmap for file access? discusses very nicely but there is no discussion for the case of short lived many processes
What is the preferred way to read a small file that is shared among short lived many processes?
getline() or fread() using standard POSIX I/O from <stdio.h>, or low-level <unistd.h> open() and read() to a sufficiently large buffer (with sufficiently aggressive growth policy); depending on how the read data is parsed/interpreted.
You don't use memory mapping for reading a file once; it is just not as efficient as read()/fread(), due to the mapping overhead.
Note that if the file contains many numbers, the actual bottleneck is the string-to-integer and string-to-floating-point conversions (strtol(), strtod(), sscanf(), etc.), because if accessed often enough the file contents will stay in the page cache. The standard implementations of string conversion functions are designed for correctness, not for efficiency.
Our kernel level program invokes a process in user space for making some decisions on the basis of values in a file.
Seems very inefficient to me. Personally, I'd keep the "file" in-kernel, as a structure, and only expose an userspace interface, probably a character device, to modify its contents.
That way you only incur a context switch whenever the "file" is changed by an userspace process, and kernel-space stuff can simply examine the contents of the structure directly, in native format, with no overhead.
This is what e.g. netfilter (built-in firewall) and other existing stuff do.
Related
I was reading about sendfile(2) recently, and the man page states:
sendfile() copies data between one file descriptor and another.
Because this copying is done within the kernel, sendfile() is more
efficient than the combination of read(2) and write(2), which would
require transferring data to and from user space.
It got me thinking, why exactly is the combination of read()/write() slower? The man page focuses on extra copying that has to happen to and from userspace, not the total number of calls required. I took a short look at the kernel code for read and write but didn't see the copy.
Why does the copy exist in the first place? Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?
What about asynchronous IO interfaces like AIO and io_uring? Do they also copy?
why exactly is the combination of read()/write() slower?
The manual page is quite clear about this. Doing read() and then write() requires to copy the data two times.
Why does the copy exist in the first place?
It should be quite obvious: since you invoke read, you want the data to be copied to the memory of your process, in the specified destination buffer. Same goes for write: you want the data to be copied from the memory of your process. The kernel doesn't really know that you just want to do a read + write, and that copying back and forth two times could be avoided.
When executing read, the data is copied by the kernel from the file descriptor to the process memory. When executing write the data is copied by the kernel from the process memory to the file descriptor.
Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?
The crucial point here is that when you read or write a file, the file has to be mapped from disk to memory by the kernel in order for it to be read or written. This is called memory-mapped file I/O, and it's a huge factor in the performance of modern operating systems.
The file content is already present in kernel memory, mapped as a memory page (or more). In case of a read, the data needs to be copied from that file kernel memory page to the process memory, while in case of a write, the data needs to be copied from the process memory to the file kernel memory page. The kernel will then ensure that the data in the kernel memory page(s) corresponding to the file is correctly written back to disk when needed (if needed at all).
This "intermediate" kernel mapping can be avoided, and the file mapped directly into userspace memory, but then the application would have to manage it manually, which is complicated and easy to mess up. This is why, for normal file operations, files are mapped into kernel memory. The kernel provides high level APIs for userspace programs to interact with them, and the hard work is left to the kernel itself.
The sendfile syscall is much faster because you do not need to perform the copy two times, but only once. Assuming that you want to do a sendfile of file A to file B, then all the kernel needs to do is to copy the data from A to B. However, in the case of read + write, the kernel needs to first copy from A to your process, and then from your process to B. This double copy is of course slower, and if you don't really need to read or manipulate the data, then it's a complete waste of time.
FYI, sendfile itself is basically an easy-to-use wrapper around splice (as can bee seen from the source code), which is a more generic syscall to perform zero-copy data transfer between file descriptors.
I took a short look at the kernel code for read and write but didn't see the copy.
In terms of kernel code, the whole process for reading a file is very complicated, but what the kernel ends up doing is a "special" version of memcpy(), called copy_to_user(), which copies the content of the file from the kernel memory to the userspace memory (doing the appropriate checks before performing the actual copy). More specifically, for files, the copyout() function is used, but the behavior is very similar, both end up calling raw_copy_to_user() (which is architecture-dependent).
What about asynchronous IO interfaces like AIO and io_uring? Do they also copy?
The aio_{read,write} libc functions defined by POSIX are just asynchronous wrappers around read and write (i.e. they still use read and write under the hood). These still copy data to/from userspace.
io_uring can provide zero-copy operations, when using the O_DIRECT flag of open (see the manual page):
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from user-
space buffers. The O_DIRECT flag on its own makes an effort
to transfer data synchronously, but does not give the
guarantees of the O_SYNC flag that data and necessary metadata
are transferred. To guarantee synchronous I/O, O_SYNC must be
used in addition to O_DIRECT. See NOTES below for further
discussion.
This should be done carefully though, as it could very well degrade performance in case the userspace application does not do the appropriate caching on its own (if needed).
See also this related detailed answer on asynchronous I/O, and this LWN article on io_uring.
I have third-party library with a function that does some computation on the specified data, and writes the results to a file specified by file name:
int manipulateAndWrite(const char *filename,
const FOO_DATA *data);
I cannot change this function, or reimplement the computation in my own function, because I do not have the source.
To get the results, I currently need to read them from the file. I would prefer to avoid the write to and read from the file, and obtain the results into a memory buffer instead.
Can I pass a filepath that indicates writing to memory instead of a
filesystem?
Yes, you have several options, although only the first suggestion below is supported by POSIX. The rest of them are OS-specific, and may not be portable across all POSIX systems, although I do believe they work on all POSIXy systems.
You can use a named pipe (FIFO), and have a helper thread read from it concurrently to the writer function.
Because there is no file per se, the overhead is just the syscalls (write and read); basically just the overhead of interprocess communication, nothing to worry about. To conserve resources, do create the helper thread with a small stack (using pthread_attr_ etc.), as the default stack size tends to be huge (on the order of several megabytes; 2*PTHREAD_STACK_SIZE should be plenty for helper threads.)
You should ensure the named pipe is in a safe directory, accessible only to the user running the process, for example.
In many POSIXy systems, you can create a pipe or a socket pair, and access it via /dev/fd/N, where N is the descriptor number in decimal. (In Linux, /proc/self/fd/N also works.) This is not mandated by POSIX, so may not be available on all systems, but most do support it.
This way, there is no actual file per se, and the function writes to the pipe or socket. If the data written by the function is at most PIPE_BUF bytes, you can simply read the data from the pipe afterwards; otherwise, you do need to create a helper thread to read from the pipe or socket concurrently to the function, or the write will block.
In this case, too, the overhead is minimal.
On ELF-based POSIXy systems (basically all), you can interpose the open(), write(), and close() syscalls or C library functions.
(In Linux, there are two basic approaches, one using the linker --wrap, and one using dlsym(). Both work fine for this particular case. This ability to interpose functions is based on how ELF binaries are linked at run time, and is not directly related to POSIX.)
You first set up the interposing functions, so that open() detects if the filename matches your special "in-memory" file, and returns a dedicated descriptor number for it. (You may also need to interpose other functions, like ftruncate() or lseek(), depending on what the function actually does; in Linux, you can run a binary under ptrace to examine what syscalls it actually uses.)
When write() is called with the dedicated descriptor number, you simply memcpy() it to a memory buffer. You'll need to use global variables to describe the allocated size, size used, and the pointer to the memory buffer, and probably be prepared to resize/grow the buffer if necessary.
When close() is called with the dedicated descriptor number, you know the memory buffer is complete, and the contents ready for processing.
You can use a temporary file on a RAM filesystem. While the data is technically written to a file and read back from it, the operations involve RAM only.
You should arrange for a default path to one to be set at compile time, and for individual users to be able to override that for their personal needs, for example via an environment variable (YOURAPP_TMPDIR?).
There is no need for the application to try and look for a RAM-based filesystem: choices like this are, and should be, up to the user. The application should not even care what kind of filesystem the file is on, and should just use the specified directory.
You could not use that library function. Take a look at this on how to write to in-memory files:
Is it possible to create a C FILE object to read/write in memory
I know that I should close opened files. I know that if I don't do that the file descriptors will leak. I also know that file descriptor is just an integer. With that integer os associates some resources. And here is the question. What are those resources? What makes it difficult to create infinite (a lot) file descriptors? Why can't os detect those leakages? And why os doesn't give the same file descriptors for the same file opening?
What are those resources?
This link posted by codeforester contains some material about.
Anyway, those file descriptors are simply handles to complex data the kernel holds for a program. They could be opaque pointers, but using simple numbers has its advantages (stdin, stdout, stderr have a well-known number, for example). What kind and amount of data is a kernel thing, and a program should not, and doesn't need, to know. So, nor you and me. But, just to speak, for example some buffer is needed. Then, the kernel must know in any moment which files are opened, otherwise, for example, you could unmount a filesystem with open files and leave programs dangling.
What makes it difficult to create infinite (a lot) file descriptors?
Because file descriptors cost ram (and CPU also), which is a finite resource, and nobody wants a kernel crash because some (stupid) programmer wastes file descriptors... :-). So, the kernel reserves a finite amount of resources for file descriptors (which are not always simple files). Kernels are not all equal, each can have its policy and often some way for users to manage relevant settings.
Why can't os detect those leakages?
Because it can not. The kernel can not tell the difference between a poor written program, which leaks resources, and a program which legitimately allocates many resources. Moreover, it is not the duty of a kernel to try to distinguish good programs from bad ones. A kernel must supply services, fast and efficiently -- all the rest is responsibility of programmers.
And why os doesn't give the same file descriptors for the same file opening?
Because it is legitimate to open the same file twice or more. Two programs can open the same file, two threads can, or even a single thread. And the kernel must always respect the "contract" its API claims, always in the same manner: again, it is the programmer who must know what he is doing.
If my code does something like fd = open("/dev/sdXY", ...) and pwrite(fd, ...)/pread(fd, ...), do the I/O operations skip the buffers or disk cache? Suppose /dev/sdXY is a unmounted, formatted disk partition (ext4, ufs, etc.).
I ask that because there is a need to grant contiguous file storage in an application I'm working on and I read that the only way to achieve it is doing something like what I described. However, I may remove the need for contiguous storage if that would lead in lost of buffers, disk cache or some other useful feature.
I'm also confused about if I would need to re-implement low level stuff since the partition would already be formatted with a file system. I read that would be the case for RAW disks/partitions. I already know it will be needed to handle which blocks are free or in use, files and folders structures, etc., I'm already working on that.
Another question: I have only seen something about buffers when reading about fopen()/fread()/fwrite() and C++'s file streams. Is it right that only these streams and the f* family of functions have some kind of buffer, unlike open/write/read/pwrite/pread/etc? Is this buffer the same as disk cache or something different?
A last one: Is HDD cache handled by its own drive or by file system (ext4, ufs, etc.)?
The simple answer is 'it depends'. What's hard is characterizing what it depends on.
Simply using open() doesn't avoid the kernel disk buffer pool. To do that, you need special options (O_DIRECT) on Linux. However, using open() does avoid using hidden application buffers; you get to choose where the data is read from or written to without any intermediate copies. By contrast, the f* family of functions do have a 'hidden' application buffer; the data is frequently read into an I/O buffer associated with the FILE * file stream, and then copied into your application buffers.
If your /dev/sdXY device is already formatted with a file system but you want to ensure contiguous file storage for a file, you are going to have to replicate a significant portion of the file system driver to ensure you allocate the space correctly. It is unlikely to be a sensible use of your time or energy. Yes, you would need to reimplement all sorts of low-level disk space management — it would be entirely non-trivial. Further, the implementation for ext4 would be quite different from the implementation for ufs, etc — so you'd really have your work cut out for you.
When writing directly to a device in /dev, I open a file descriptor and perform a UNIX write() followed by a read(). Can I have multiple threads doing this write()/read() sequence on the same file descriptor, and not get jumbled data if two threads enter the write() function at the same time?
References to std documentation would be immensely helpful. I've not been able to find anything though. Someone has mentioned that such operations are atomic in the kernel, but I am sceptical.
Also, to clarify this is a file in /dev, so any insight as to how far the "file pointer" concept applies here is helpful as well.
File pointers (FILE *fp, for example), are a layer in the user-side code sitting above the function calls (such as write()). Access to fp is controlled by locks in a threaded environment (you can't have to threads modifying the same structure at the same time).
Inside the kernel, I'd expect there to be a lock on the file descriptor (and/or 'open file description') to prevent it being used from two threads at once.
You can look up the POSIX specification for read() and
getchar_unlocked()
to find out more about locking etc — at least for a POSIX compliant implementation.
Note that POSIX still uses C99. Therefore, it is not cognizant of the C11 thread facilities. The C11 standard does not have read() et al (file I/O using file descriptors), so it says nothing about such system calls. Neither does it provide a getchar_unlocked() or any of its relatives.
Caveat: I have not been in kernels for awhile, but this is the way it used to work.
For disk files:
Can you open the file in append mode, writing block sizes <= BLKSIZE ?
Small enough block sizes guarantee, in POSIX environments, atomic writes in POSIX environments (actually, the limit may be greater than BLKSIZE... I'm too lazy to hunt around for the alternate symbol).
Append guarantees seeks to the end of the file... for devices supporting seeks. Combine with atomic writes you should be golden.
Each buffer must stand by itself under the assumption some "foreign" information may follow it.
For ttys:
Append mode makes no sense here. As before, but paying attention to line endings gets even more important. And this very much does not apply to reads. Codes ttys treat as control sequences can also trip you up if even the modes the sequences enable split across blocks.
For other devices:
Can get tricky here. Depends on the device.
I'm going to assume that you are referring to a generic character device (e.g. a tty) since you were not specific. As far as I know, each fd-type operation (e.g. read()/write()) maps directly into a call into the driver.
Therefore, the driver will receive each write()'s data chunk as a whole and not see the next one's data until it is done (e.g. data is queued to be transmitted).
However, if the driver does not consume the entire chunk of data at once (i.e. write() returns less than the specified number of bytes, then there is no guarantee, that the thread will be able to write again with the remainder before another thread does a different write().
Also, as Johnathan Leffler noted, if you use standard I/O with process-level buffering, all bets are off.
Bottom line, if you are using direct fd writes, each write will map directly to one driver function call. From there, it's up to the driver as to if write is atomic.
Edit: wlformyd brings up the question of locking between multiple threads on multiple processors. To my knowledge, there is no locking on a FD and, in fact, that would be ineffective as multiple FDs could be used to access the same device.
I believe it is up to the driver itself to do locking to prevent contention to internal queues and/or hardware. in that sense, on a multi-processor system, the kernel doesn't prevent multiple simultaneous access to a driver's write routine. However, a properly written driver should do the locking to prevent mixing of output between two write calls.