How OS performs buffering for a file - c

I know that when you call fwrite or fprintf or rather any other function that writes to a file, the contents aren't immediately flushed to the disk, but buffered in the memory.
Firstly, where do the OS manage these buffers and how. Secondly, if you do the write to a file and later read in the content you wrote and assuming that the OS didn't flushed the contents between the time you wrote and read, how it knows that it has to return the read from the buffer? How does it handle this situation.
The reason I want to know this is that I'm interested in implementing my own buffering scheme in user-space, rather than kernel space as done by OS. That is, write to a file would be buffered in user-space and the actual write will only occur at a certain point. Consquently I also need to handle situations where read is called for the content that is still in the buffer. Is it possible to do all this in user-space.

Firstly, where do the OS manage these buffers and how
The functions fwrite and fprintf use stdio buffers which already are completely in userspace. The buffers are (likely) static arrays or perhaps malloced memory.
how it knows that it has to return the read from the buffer
It doesn't, so the changes aren't seen. Nothing actually happens to a file until the underlying system call (write) is called (and even then - read on).
Is it possible to do all this in user-space
No, it's not possible. The good news is that the kernel already has buffers so every write you do isn't atually translated into an actual write to the file. It is postponed and executed later. If in the meantime somebody tries to read from the file, the kernel is smart enough to serve him from the buffer.
Bits from TLPI:
When working with disk files, the read() and write() system calls
don’t directly ini- tiate disk access. Instead, they simply copy data
between a user-space buffer and a buffer in the kernel buffer cache.
When performing I/O on a disk file, a successful return from write()
doesn’t guarantee that the data has been transferred to disk, because
the kernel performs buffering of disk I/O in order to reduce disk
activity and expedite write() calls.
At some later point, the kernel writes (flushes) its buffer to the
disk.
If, in the interim, another process attempts to read these bytes of
the file, then the kernel automatically supplies the data from the
buffer cache, rather than from (the outdated contents of) the file.
So you may want to find out about sync and fsync.
Multiple levels of buffering are generally bad. The reason stdio buffers are useful is that they minimize the number of system calls performed. If a system call would be cheaper nobody would use stdio buffers anymore.

Related

Understanding low level file routines

I am going through Mark Burgess's "The GNU C Programming Tutorial". I have come across the following information:
Even though low-level fle routines do not use buffering, and once you call write, your data can be read from the file immediately, it may take up to a minute before your data is physically written to disk. (Page:142)
Firstly, is "it may take up to a minute(some time) before your data is written to disk" true?
Secondly, when low level file routines are not using buffering why will the delay take place?
There are two places where I/O buffering can occur (at least — it could be more than just two).
One is in the application; the standard I/O functions using FILE * use buffered I/O unless you use setvbuf() to prevent it.
The other is in the kernel. Disk I/O normally goes into the kernel buffer pool, and eventually gets written by the kernel to disk. There are ways around that (O_DIRECT on Linux; raw devices on classic Unix; etc). The key point is that the write() system call normally writes to he kernel buffer pool. The kernel takes responsibility for ensuring that the data is written to disk safely and correctly (journalling, …).
The kernel doesn't write everything to disk immediately because (a) you may add more changes to the data, (b) other people may need to read or write the data, (c) the disk drive may be busy writing something else at the other end of its 1 TiB of storage and it will take time to get the write head in position to take your data, and it would be better for the overall performance of the system if it scheduled other work before writing your changed buffer to disk. It will get written to disk. It is just not defined when, and it could be fractions of a second or multiple seconds or longer, though most often it will not take minutes for the data to be written to disk.
These days, there could also be buffering in the RAID controllers, and maybe in the individual disks inside the RAID setup, and maybe there's network buffering too if it is a remotely-mounted file system. Those add extra levels of buffering.
The read() and write() and related low-level I/O functions do not have any client-side (application) buffering — unlike the standard C I/O functions.
A file is said to be buffered, when its contents are not outputted or inputted directly. Instead, the file's bytes are written to a temporary buffer in memory.
For example, if you are reading from a file, you are reading from the buffer. Once you have read all the characters in the buffer, it is replenished with new bytes from the file. The reason for this indirectness, is that a memory read is much faster than a hard disk read.
The calls read and write are low-level, and do not perform buffering. The stdio.h calls like getc and putc, do use buffering. These higher-level APIs only call the low level ones, when the buffer must be replenished.
Writing to the hard drive is much slower than writing to RAM. When you write to a drive it writes to memory, but doesn't always write to the disk immediately. The data might not be written to disk until that part of memory needs to be overwritten to make room for something else. This is called a Write-Back cache.

are fread and fwrite different in handling the internal buffer?

I keep on reading that fread() and fwrite() are buffered library calls. In case of fwrite(), I understood that once we write to the file, it won't be written to the hard disk, it will fill the internal buffer and once the buffer is full, it will call write() system call to write the data actually to the file.
But I am not able to understand how this buffering works in case of fread(). Does buffered in case of fread() mean, once we call fread(), it will read more data than we originally asked and that extra data will be stored in buffer (so that when 2nd fread() occurs, it can directly give it from buffer instead of going to hard disk)?
And I have following queries also.
If fread() works as I mention above, then will first fread() call read the data that is equal to the size of the internal buffer? If that is the case, if my fread() call ask for more bytes than internal buffer size, what will happen?
If fread() works as I mention above, that means at least one read() system call to kernel will happen for sure in case of fread(). But in case of fwrite(), if we only call fwrite() once during the program execution, we can't say for sure that write() system call be called. Is my understanding correct?
Will the internal buffer be maintained by OS?
Does fclose() flush the internal buffer?
There is buffering or caching at many different levels in a modern system. This might be typical:
C standard library
OS kernel
disk controller (esp. if using hardware RAID)
disk drive
When you use fread(), it may request 8 KB or so if you asked for less. This will be stored in user-space so there is no system call and context switch on the next sequential read.
The kernel may read ahead also; there are library functions to give it hints on how to do this for your particular application. The OS cache could be gigabytes in size since it uses main memory.
The disk controller may read ahead too, and could have a cache size up to hundreds of megabytes on smallish systems. It can't do as much in terms of read-ahead, because it doesn't know where the next logical block is for the current file (indeed it doesn't even know what file it is reading).
Finally, the disk drive itself has a cache, perhaps 16 MB or so. Like the controller, it doesn't know what file it is reading. For many years one disk block was 512 bytes, but it got a little larger (a few KB) recently with multi-terabyte disks.
When you call fclose(), it will probably deallocate the user-space buffer, but not the others.
Your understanding is correct. And any buffered fwrite data will be flushed when the FILE* is closed. The buffered I/O is mostly transparent for I/O on regular files.
But for terminals and other character devices you may care. Another instance where buffered I/O may be an issue is if you read from the file that one process is writing to from another process -- a common example is if a program writes text to a log file during operation, and the user runs a command like tail -f program.log to watch the content of the log file live. If the writing process has buffering enabled and it doesn't explicitly flush the log file, it will make it difficult to monitor the log file.

When does actual write() takes place in C?

What really happens when write() system call is executed?
Lets say I have a program which writes certain data into a file using write() function call. Now C library has its own internal buffer and OS too has its own buffer.
What interaction takes place between these buffers ?
Is it like when C library buffer gets filled completely, it writes to OS buffer and when OS buffer gets filled completely, then the actual write is done on the file?
I am looking for some detailed answers, useful links would also help. Consider this question for a UNIX system.
The write() system call (in fact all system calls) are nothing more that a contract between the application program and the OS.
for "normal" files, the write() only puts the data on a buffer, and marks that buffer as "dirty"
at some time in the future, these dirty buffers will be collected and actually written to disk. This can be forced by fsync()
this is done by the .write() "method" in the mounted-filesystem-table
and this will invoke the hardware's .write() method. (which could involve another level of buffering, such as DMA)
modern hard disks have there own buffers, which may or may not have actually been written to the physical disk, even if the OS->controller told them to.
Now, some (abnormal) files don't have a write() method to support them. Imagine open()ing "/dev/null", and write()ing a buffer to it. The system could choose not to buffer it, since it will never be written anyway.
Also note that the behaviour of write() does depend on the nature of the file; for network sockets the write(fd,buff,size) can return before size bytes have been sent(write will return the number of characters sent). But it is impossible to find out where they are once they have been sent. They could still be in a network buffer (eg waiting for Nagle ...), or a buffer inside the network interface, or a buffer in a router or switch somewhere on the wire.
As far as I know...
The write() function is a lower level thing where the library doesn't buffer data (unlike fwrite() where the library does/may buffer data).
Despite that, the only guarantee is that the OS transfers the data to disk drive before the next fsync() completes. However, hard disk drives usually have their own internal buffers that are (sometimes) beyond the OS's control, so even if a subsequent fsync() has completed it's possible for a power failure or something to occur before the data is actually written from the disk drive's internal buffer to the disk's physical media.
Essentially, if you really must make sure that your data is actually written to the disk's physical media; then you need to redesign your code to avoid this requirement, or accept a (small) risk of failure, or ensure the hardware is capable of it (e.g. get a UPS).
write() writes data to operating system, making it visible for all processes (if it is something which can be read by other processes). How operating system buffers it, or when it gets written permanently to disk, that is very library, OS, system configuration and file system specific. However, sync() can be used to force buffers to be flushed.
What is quaranteed, is that POSIX requires that, on a POSIX-compliant file system, a read() which can be proved to occur after a write() has returned must return the written data.
OS dependant, see man 2 sync and (on Linux) the discussion in man 8 sync.
Years ago operating systems were supposed to implement an 'elevator algorithm' to schedule writes to disk. The idea would be to minimize the disk writing head movement, which would allow a good throughput for several processes accessing the disk at the same time.
Since you're asking for UNIX, you must keep in mind that a file might actually be on an FTP server, which you have mounted, as an example. For example files /dev and /proc are not files on the HDD, as well.
Also, on Linux data is not written to the hard drive directly, instead there is a polling process, that flushes all pending writes every so often.
But again, those are implementation details, that really don't affect anything from the point of view of your program.

since 'fread' is buffred, is it necessary to fread data into memory and then use it?

I am using fopen/fread/fwrite/fseek on linux with gcc. is it necessary to allocate a memory buffer and use fread to read data sequentially into buffer before using the data?
When you use fread or the other file I/O functions in the C standard library, memory is buffered in several places.
Your application allocates a buffer which gets passed to fread. fread copies data into your buffer, and then you can do what you want with it. You are responsible for allocation/deallocation of this buffer.
The C library will usually create a buffer for every FILE* you have open. Data is read into this buffers in large chunks. This allows fread to satisfy many small requests without having to make a large number of system calls, which are expensive. This is what people mean when they say fread is buffered.
The kernel will also buffer files that are being read in the disk cache. This reduces the time needed for the read system call, since if data is already in memory, your program won't have to wait while the kernel fetches it from the disk. The kernel will hold on to recently read files, and it may read ahead for files which are being accessed sequentially.
The C library buffer is allocated automatically when you open a file and freed when you close the file. You don't have to manage it at all.
The kernel disk cache is stored in physical memory that isn't being used for anything else. Again, you don't have to manage this. The memory will be freed as soon as it's needed for something else.
You must pass a buffer (a buffer created by your code, malloced or local) to fread to pass the read data back to you. I don't know what do you exactly mean by saying "fread is buffered". Most 'C' library calls operate in this fashion. They will not return their internal storage (buffer or otherwise) to you and if they do, they will provide you a corresponding free/release functions.
Refer http://pubs.opengroup.org/onlinepubs/000095399/functions/fread.html It has a very basic example also.
With fread, yes, you have to allocate memory in your process and the system call will copy the data into your buffer.
In some specialised cases, you can handle data without copying it into userspace. See the sendfile system call, which copies data from one file descriptor to another directly. This can be used to transfer data from a file to a network socket without excessive copying.

Write system call writes data to disk directly?

I've read couple of questions(here) related to this but I still have some confusion.
My understanding is that write system call puts the data into Buffered Cache(OS caches as referred in that question). When the Buffered Cache gets full it is written to the disk.
Buffered IO is further optimization on top of this. It caches in the C RTL buffers and when they get full a write system call issued to move the contents to Buffered Cache. If I use fflush then data related to this particular file that is present in the C RTL buffers as well as Buffered Cache is sent to the disk.
Is my understanding correct?
How the stdio buffers are flushed is depending on the standard C library you use. To quote from the Linux manual page:
Note that fflush() only flushes the user space buffers provided by the C library.
To ensure that the data is physically stored on disk the kernel buffers must be
flushed too, for example, with sync(2) or fsync(2).
This means that on a Linux system, using fflush or overflowing the buffer will call the write function. But the operating system may keep internal buffers, and not actually write the data to the device. To make sure the data is truly written to the device, use both fflush and the low-level fsync.
Edit: Answer rephrased.

Resources