Working of Open System Call - file

I am reading about Memory Mapped Files, the souce says it is faster than traditional methods to open a file or read a file such as an open system call and read system call respectively without giving the description that how open or read system call works.
So here's my question how the open system call works?
As far i know it will load the file into the memory, whereas by using mapped file only their addresses will be saved in the memory and when needed the requested page may be brought into the memory.
I expect clarification over my so far understanding.
EDIT
My previous understanding written above is almost wrong, for coorrect explanation refer to the accepted answer by Pawel.

Since you gave no details I'm assuming you are interested in behavior of Unix-like systems.
Actually open() system call only creates a file descriptor which then may be used by either mmap() or read().
Both memory mapped I/O and standard I/O internally access files on disk through page cache, a buffer in which files are cached in order to reduce number of I/O operations.
Standard I/O approach (using write() and read()) involves performing a system call which then copies data from (or to if you are writing) page cache to a buffer chosen by application. In addition to that non-sequential access requires another system call lseek(). System calls are expensive and so is copying data.
When a file is memory mapped usually a memory region in process address space is mapped directly to page cache, so that all reads and writes of already loaded data can be performed without any additional delay (no system calls, no data copying). Only when an application attempts to access file region that is not already loaded a page fault occurs and the kernel loads required data (whole page) from disk.
EDIT:
I see that I also have to explain memory paging. On most modern architectures there is physical memory which is a real piece of hardware and virtual memory which creates address spaces for processes. Kernel decides how addresses in virtual memory are mapped to addresses in physical memory. The smallest unit is a memory page (usually, but not always 4K). It does not have to be 1:1 mapping, for example all virtual memory pages may be mapped to the same physical address.
In memory mapped I/O part of application address space and kernel's page cache are mapped to the same physical memory region, hence program is able to directly access page cache.

Pawel has beautifully explained how read/writes are performed. Let me explain the original question: How does fopen(3) works:
when user space process encounters fopen(defined in libc or any user space library), it translates it into open(2) system call. First, it takes arguments from fopen, writes them into architecture specific registers along with open() syscall number. This number tells kernel the system call user space program wants to run. After loading these register, user space process interrupts kernel(via softirq, traditionally INT 80H on x86) and blocks.
Kernel verifies the arguments provided and access permissions etc, and then either returns error or invokes actual system call which is vfs_open() in this case. vfs_open() checks for available file descriptor in fd array and allocates struct file. The ref counts of accessed file is increased and fd is returned to user program. That's completes the working of open, and of most of the system calls in general.
open() together with read()/write(), followed by close() is undoubtedly much lengthy process than having memory mapped file in buffer cache.

For a lucid explanation of how open and read work on Linux, you can read this. The code snippets are from an older version of the kernel but the theory still holds.
You would still need to use the open() system call to get a valid file descriptor, which you would pass on to mmap(). As to why mmaped IO is faster, it is because there is no copy of data from (to) user space to (from) kernel space buffers which is what happens with read and write system calls.

Related

Where and why do read(2) and write(2) system calls copy to and from userspace?

I was reading about sendfile(2) recently, and the man page states:
sendfile() copies data between one file descriptor and another.
Because this copying is done within the kernel, sendfile() is more
efficient than the combination of read(2) and write(2), which would
require transferring data to and from user space.
It got me thinking, why exactly is the combination of read()/write() slower? The man page focuses on extra copying that has to happen to and from userspace, not the total number of calls required. I took a short look at the kernel code for read and write but didn't see the copy.
Why does the copy exist in the first place? Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?
What about asynchronous IO interfaces like AIO and io_uring? Do they also copy?
why exactly is the combination of read()/write() slower?
The manual page is quite clear about this. Doing read() and then write() requires to copy the data two times.
Why does the copy exist in the first place?
It should be quite obvious: since you invoke read, you want the data to be copied to the memory of your process, in the specified destination buffer. Same goes for write: you want the data to be copied from the memory of your process. The kernel doesn't really know that you just want to do a read + write, and that copying back and forth two times could be avoided.
When executing read, the data is copied by the kernel from the file descriptor to the process memory. When executing write the data is copied by the kernel from the process memory to the file descriptor.
Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?
The crucial point here is that when you read or write a file, the file has to be mapped from disk to memory by the kernel in order for it to be read or written. This is called memory-mapped file I/O, and it's a huge factor in the performance of modern operating systems.
The file content is already present in kernel memory, mapped as a memory page (or more). In case of a read, the data needs to be copied from that file kernel memory page to the process memory, while in case of a write, the data needs to be copied from the process memory to the file kernel memory page. The kernel will then ensure that the data in the kernel memory page(s) corresponding to the file is correctly written back to disk when needed (if needed at all).
This "intermediate" kernel mapping can be avoided, and the file mapped directly into userspace memory, but then the application would have to manage it manually, which is complicated and easy to mess up. This is why, for normal file operations, files are mapped into kernel memory. The kernel provides high level APIs for userspace programs to interact with them, and the hard work is left to the kernel itself.
The sendfile syscall is much faster because you do not need to perform the copy two times, but only once. Assuming that you want to do a sendfile of file A to file B, then all the kernel needs to do is to copy the data from A to B. However, in the case of read + write, the kernel needs to first copy from A to your process, and then from your process to B. This double copy is of course slower, and if you don't really need to read or manipulate the data, then it's a complete waste of time.
FYI, sendfile itself is basically an easy-to-use wrapper around splice (as can bee seen from the source code), which is a more generic syscall to perform zero-copy data transfer between file descriptors.
I took a short look at the kernel code for read and write but didn't see the copy.
In terms of kernel code, the whole process for reading a file is very complicated, but what the kernel ends up doing is a "special" version of memcpy(), called copy_to_user(), which copies the content of the file from the kernel memory to the userspace memory (doing the appropriate checks before performing the actual copy). More specifically, for files, the copyout() function is used, but the behavior is very similar, both end up calling raw_copy_to_user() (which is architecture-dependent).
What about asynchronous IO interfaces like AIO and io_uring? Do they also copy?
The aio_{read,write} libc functions defined by POSIX are just asynchronous wrappers around read and write (i.e. they still use read and write under the hood). These still copy data to/from userspace.
io_uring can provide zero-copy operations, when using the O_DIRECT flag of open (see the manual page):
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from user-
space buffers. The O_DIRECT flag on its own makes an effort
to transfer data synchronously, but does not give the
guarantees of the O_SYNC flag that data and necessary metadata
are transferred. To guarantee synchronous I/O, O_SYNC must be
used in addition to O_DIRECT. See NOTES below for further
discussion.
This should be done carefully though, as it could very well degrade performance in case the userspace application does not do the appropriate caching on its own (if needed).
See also this related detailed answer on asynchronous I/O, and this LWN article on io_uring.

How does mmap work when 2 programs map the same file

I am trying to understand how mmap works while looking at man mmap.
As I understand it, it adds a mapping to the page table that maps between the file and the virtual address (which is the address that is given void *addr)
So, what happens when 2 programs map the same file?
Are there 2 entries in the page table, one for each program?
So, what happens when 2 programs map the same file? Are there 2 entries in the page table, one for each program?
In modern operating systems, each process has its own page table for its memory, that may point to pages of physical memory shared with other user and kernel processes.
With MAP_SHARED, this mapping is shared: updates to the mapping are visible to other processes that map this file, and are carried through to the underlying file. The file may not actually be updated until msync(2) or munmap() is called.
This seems very interesting, but there are numerous caveats:
The actual pages mmapped by both processes for the same file may reside at the same address or at a different address in each process, storing pointers into this shared memory may not allow the other process to use them as they might point to inconsistent addresses.
The implementation may use the same physical memory pages for both mappings or not: for subtile reasons (cache strategies, out of sync reading...), even if it is the same physical memory, modifications done by one process to its memory may not be immediately reflected in the memory of the other process.
So the modification may or may not be visible to the other processes mmapping the file nor reading it via read or the FILE* stream API.
If one of the processes calls msync(), the modifications should be visible in all maps and for all yet unread portions of the file, bearing in mind that the FILE* streaming APIs may have buffered some data in internal unshared buffers: modifications in this area will not be reflected.
Conclusion: it is risky and unreliable to use these mechanisms to implement inter process communication. The behavior may depend on system specific characteristics such as the OS strategies, the CPU and cache architectures, the type of RAM in use, the clock speed, and who knows what else. It is safer to rely on proven APIs that may indeed be implemented using mmapped memory, but only if it is know to provide the correct semantics.
The actual system implementation is different. At the risk of over simplification (and omitting paging here):
A mmap will map physical page frames to a file.
So, what happens when 2 programs map the same file? Are there 2 entries in the page table, one for each program?
If two processes (P and Q) map to the same file, then P and Q will each have there own page table; each page table will have entry mapping to the same physical page frame (which could be mapped to different addresses within P and Q).

Memory Mapped I/O in Unix

I am unable to understand how files are managed in memory mapped I/O. As normal If we open a file using open or fopen, it returns fd or
file pointer respectively. After this open where the file resides for processing. It is in memory(copy of the file which is in hard disk) or not? If it
is not in memory where the data is fetch by consequent read or write system call or It fetchs data from the hard disk for each time of calling read or write.
Otherwise the copy of the file is stored in memory and the file is accessed by process for furthur manipulation and once the process is completed the file is copied to hard disk. In the above concepts
which scenario is worked ?
The following is the definition given for memory mapped i/o in Advanced Programming in Unix Environment(2nd Edition) book:
Memory-mapped I/O lets us map a file on disk into a buffer in memory so that, when we fetch bytes from the buffer, the corresponding bytes of the file are read. Similarly, when we store data in the buffer, the corresponding bytes are automatically written to the file. This lets us perform I/O without using read or write.
what is mapping a file into memory? And here, they defined the memory is placed in between stack and heap. In this memory, what
type of data is present after mapping a file. It contains copy of the file or the address of the file which resides in hard disk. And
how the above scenario becomes true.
Does anyone explain the working mechanism of memory mapped I/O and mmap functionality?
Normally when you open a file, the system sets up some bookkeeping structures (metadata) but does not need to read any part of the actual data of the file. When you call read(), the system loads a chunk of the file into (virtual) memory which you allocated for the purpose.
When you memory-map a file, the system again sets up bookkeeping, and also sets up a (virtual) memory "mapping" which means a range of valid addresses which, if used, will reflect reads (or writes) of the underlying file. It does not mean the entire file needs to be read at once, because it can be "paged in" on demand, i.e. the system can give you an address range to use, then wait for you to actually use it before loading any data there. This "page faulting" is supported by a hardware device called the Memory Management Unit, or MMU. The same system is used when you run an executable file--the system can simply map it into virtual memory and read pages (chunks) from disk only as needed.
It is in memory(copy of the file which is in hard disk) or not?
According to Computer Programming and Utilization, When you open file with fopen its content are loaded into memory. (Partially or wholly).
If it is not in memory where the data is fetch by consequent read or
write system call
When you fwrite some data, it is eventually copied into the kernel which will then write it to disk (or wherever) after buffering. In general, no part of a file needs to be loaded in order to write.
what is mapping a file into memory?
For more refer here
In this memory, what type of data is present after mapping a file. It
contains copy of the file or the address of the file which resides in
hard disk.
A memory-mapped file is a segment of virtual memory which has been assigned a direct byte-for-byte correlation with some portion of a file or file-like resource.Refer this
It is possible to mmap a file to a region of memory. When this is done, the file can be accessed just like an array in the program.This is more efficient than read or write, as only the regions of the file that a program actually accesses are loaded. Accesses to not-yet-loaded parts of the mmapped region are handled in the same way as swapped out pages.
After this open where the file resides for processing. It is in memory(copy of the file which is in hard disk) or not?
On the disk. It may also be partly or completely in memory if the operating system does a read-ahead, but that isn't detectable by you. You still have to issue reads to get data from the file.
If it is not in memory where the data is fetch by consequent read or write system call
From the disk.
or It fetchs data from the hard disk for each time of calling read or write.
In effect, but you also have to consider the effect of any caching.
Otherwise the copy of the file is stored in memory and the file is accessed by process for furthur manipulation and once the process is completed the file is copied to hard disk.
No. The file behaves as though it is all on the disk.
And here, they defined the memory is placed in between stack and heap.
Not in what you quoted.
In this memory, what type of data is present after mapping a file.
The data in the file. The question 'what type of data' doesn't make sense. Data is data.
It contains copy of the file or the address of the file which resides in hard disk.
It effectively contains a copy of the file.
And how the above scenario becomes true.
Via virtual memory. Too broad to cover here.

User space buffer and Kernel buffer

I am recently studying file system in linux. I learned that when we call fopen(), the library call will call malloc() to allocate space for the FILE structure, and inside this FILE structure there will be a buffer for I/0. But later on I found that the write system call actually writes the data to the kernel buffer, So what's the difference between these two buffers?
You have to understand two things: fwrite() is a standard library's routine operating on FILE structure, but write() is a system call. I bet fwrite() uses write() internally. Nothing keeps fwrite() from providing user space IO-buffering until it is ready to pass your data on to the write() syscall.
write() syscall in it's turn goes straight to the kernel and says: "Hey kernel, I've got this user space buffer here. Will you write this one to the storage for me?". And here it's up to the kernel what to do next: it will either go directly to storage to write the data, or, most likely, copy the data to kernel buffer, until it decides it's time to modify storage data.
Turning back to your question. Any kind of buffering is done to accumulate data in order to postpone turning to more expensive operations: standard library may consider invoking syscall on every len-byte expensive, kernel considers going to hard disk on every syscall expensive and so on.
You might want to read this to see how far buffering goes https://fgiesen.wordpress.com/2015/10/25/reading-and-writing-are-less-symmetric-than-you-probably-think/
The FILE structure holds the meta data about the opened file (mode, stream position, etc). It is part of the C Standard I/O interface.
The buffer allocated as part of FILE takes only a limited amount of data (e.g. when the stream is buffered). It is deallocated upon fclose(). You may even provide your own user space stdio buffer with setvbuf().
The kernel buffer receives the file contents written by write(), whenever the stream is flushed or the associated file descriptor is closed.
The FILE structure holds the information about the opened file.this defines the FILE struct members. But in kernel level a file has been accessed by inode, buffer cache.
Data is getting read/write to disk from user space through buffer cache using method copy_to_user and copy_from_user.
There is a very big difference between the two buffers, one is kernel buffer and the other is the user buffer. So, what basically happens when you do a I/O is that the buffer from the user space is copied into a buffer in the kernel space. The function copy_from_user() does this task.
Now the question which arises is why do we need two buffers, when the kernel has access into the user space? The reason is that the kernel does not want to read the user buffer directly because both the kernel and the user space has different address spaces and so a valid address in user space might not be a valid address in kernel.
In kernel if a non-valid address is accessed then the system will panic immediately, so the function copy_from_user does the task of mapping the user space address and the kernel space address and checks if the address is accessible or not. If not then it simply returns EFAULT (bad address).

How does read(2) in Linux C work?

According to the man page, we can specify the amount of bytes we want to read from a file descriptor.
But in the read's implementation, how many read requests will be created to perform a read?
For example, if I want to read 4MB, will it create only one request for 4MB or will it split it into multiple small requests? such as 4KB per request?
read(2) is a system call, so it calls the vDSO shared library to dispatch the system call (in very old times it used to be an interrupt, but nowadays there are faster ways of dispatching system calls).
inside the kernel the call is first handled by the vfs (virtual file system); the virtual file system provides a common interface for inodes (the structures that represents open files) and a common way of interfacing with the underlying file system.
the vfs dispatches to the underlying file system (the mount(8) program will tell you which mount point exists and what file system is used there). (see here for more information http://www.inf.fu-berlin.de/lehre/SS01/OS/Lectures/Lecture16.pdf )
the file system can do its own caching, so number of disk reads depends on what is present in the cache and how the file system allocates blocks for storage of a particular file and how the file is divided into disk blocks - all questions to the particular file system)
If you want to do your own caching then open the file with O_DIRECT flag; in this case there is an effort not to use the cache; however all reads have to be aligned to 512 offsets and come in multiples of 512 size (this is in order that your buffer can be transfered via DMA to the backing store http://www.quora.com/Why-does-O_DIRECT-require-I-O-to-be-512-byte-aligned )
It depends on how deep you go.
The C library just passes the size you gave it straight to the kernel in one read() system call, so at that level it's just one request.
Inside the kernel, for an ordinary file in standard buffered mode the 4MB you requested is going to be copied from multiple pagecache pages (4kB each) which are unlikely to be contiguous. Any of the file data which isn't actually already in the pagecache is going to have to be read from disk. The file might not be stored contiguously on disk, so that 4MB could result in multiple requests to the underlying block device.
If there is data available, read will return as much data as is immediately available and will fit in the buffer, without waiting. If there's no data available, it will wait until there is some and return what it can without waiting more.
How much that is depends on what the file descriptor refers to. If it refers to a socket, that will be whatever is in the socket buffer. If it is a file, that will be whatever is in the buffer cache.
When you call read it only make just one request to fill the buffer size and if it couldn't to fill all the buffer (no more data or data is not arrived like in sockets) it returns the number of bytes it actually wrote in your buffer.
As the manual says:
RETURN VALUE
Upon successful completion, these functions shall return a non-negative integer indicating the number of bytes actually read. Otherwise, the functions shall return −1 and set errno to indicate the
error.
There's really no one right answer, other than however many are necessary what whatever layer the request winds up going to. Typically, a single request will be passed to the kernel. This may result in no further requests going to other layers because all the information is in memory. But if the data has to be read from, say, a software RAID, requests may have to be issued to multiple physical devices to satisfy the request.
I don't think you can really give a better answer than "whatever the implementer thought was was the best way".

Resources