Take VirtualBox's virtual disk as example:if VirtualBox didn't avoid the buffer mechanism from FileSystem in host os,the FileSystem in guest os would move data from memory to meory.
In fact ,I want to write a filesystem in user space(put all directorys and files in a single big file). But if I use c api such fread and fwrite ,the FileSystem in os would buffer the data that My UserSpace-FileSystem read、write.But My UserSpace-FileSystem has implement a buffer mechanism by itself.If i didn't avoid the buffer mechanism from FileSystem in os,My UserSpace-FileSystem would move data from memory to memory.It's so bad .
Dose anyone know how to solve this problem?
stdio doesn't support that.
For *NIX: man open for O_DIRECT, man fadvise and man madvise.
For Windows, check the CreateFile for FILE_FLAG_NO_BUFFERING. Probably a good idea to dig the CreateFileMapping too.
Your question isn't very clear, but if all you want to do is use stdio without buffering, then setbuf(file, NULL); will solve your problem. A better solution might be to avoid stdio entirely and use lower-level io primitives read, write, etc. (not part of plain C but specified by POSIX, and with nearly-equivalent versions of them available on most non-POSIX systems too).
Related
I was talking with a teacher and he told me that read and write system calls was using buffers, because there is a variable in your system spec that controls how many times you can have access to the device you want to read/write on, and the system uses buffer to stock data while he is waiting for writing on the device.
I saw on an other Stack Overflow post (C fopen vs open) that one of the advantages of fopen and fwrite functions was that those functions were using buffers (which is supposed to be way faster).
I have read the man page of read and write sys calls, and the man pages do not talk about any buffers.
Did I misunderstood something ? How do read / write C syscall buffers work?
The functions you mention, read and write are system calls, therefore their behavior is platform dependent.
As you know, fread and fwrite are C standard library functions. They do buffering in the user space and in this way optimize the performance for typical application. read and write are different. There is some stub code in userspace C libraries (such as GNU libc) for these functions, but the main function of that code is just to provide a convenient wrapper for invoking the right kernel functionality (but it's also possible to invoke that functionality with syscall() directly!)
If you're interested in the details, here is an example: the wrapper for write system call in the uclibc library.
So the typical implementations of read and write do not do buffering in user space. They may still do buffering in the kernel space, though. Read about the O_DIRECT flag for more details: How are the O_SYNC and O_DIRECT flags in open(2) different/alike?
If my code does something like fd = open("/dev/sdXY", ...) and pwrite(fd, ...)/pread(fd, ...), do the I/O operations skip the buffers or disk cache? Suppose /dev/sdXY is a unmounted, formatted disk partition (ext4, ufs, etc.).
I ask that because there is a need to grant contiguous file storage in an application I'm working on and I read that the only way to achieve it is doing something like what I described. However, I may remove the need for contiguous storage if that would lead in lost of buffers, disk cache or some other useful feature.
I'm also confused about if I would need to re-implement low level stuff since the partition would already be formatted with a file system. I read that would be the case for RAW disks/partitions. I already know it will be needed to handle which blocks are free or in use, files and folders structures, etc., I'm already working on that.
Another question: I have only seen something about buffers when reading about fopen()/fread()/fwrite() and C++'s file streams. Is it right that only these streams and the f* family of functions have some kind of buffer, unlike open/write/read/pwrite/pread/etc? Is this buffer the same as disk cache or something different?
A last one: Is HDD cache handled by its own drive or by file system (ext4, ufs, etc.)?
The simple answer is 'it depends'. What's hard is characterizing what it depends on.
Simply using open() doesn't avoid the kernel disk buffer pool. To do that, you need special options (O_DIRECT) on Linux. However, using open() does avoid using hidden application buffers; you get to choose where the data is read from or written to without any intermediate copies. By contrast, the f* family of functions do have a 'hidden' application buffer; the data is frequently read into an I/O buffer associated with the FILE * file stream, and then copied into your application buffers.
If your /dev/sdXY device is already formatted with a file system but you want to ensure contiguous file storage for a file, you are going to have to replicate a significant portion of the file system driver to ensure you allocate the space correctly. It is unlikely to be a sensible use of your time or energy. Yes, you would need to reimplement all sorts of low-level disk space management — it would be entirely non-trivial. Further, the implementation for ext4 would be quite different from the implementation for ufs, etc — so you'd really have your work cut out for you.
As far as I know, I can disable OS cache through use open() with O_DIRECT. But How to do that if I am willing to use fopen() instead of open()?
I think due to the alignment requirements of the O_DIRECT flag it's not possible (see that question). The f...() - IO family uses an internal buffer to cache IO operation and I don't think that a standard implementation would align that buffer appropriately.
Edit
For special purposes, I could think of two non-portable solutions:
If you are sure, that your file system doesn't require any special alignment, you could use fdopen():
int fd = open( ....., O_WRONLY|O_DIRECT );
FILE *fp = fdopen( fd, "w" );
If you are working on linux only, using fopencookie() could be a solution:
Use cookie to transort the 'real' fd from open() and provide a write function that copies the data to an appropriately aligned buffer and then calls write() (I have never used fopencookie(), but I think it could be worth trying if using a non-standard GNU extension isn't a NoGo)
In both cases be aware that f-...() I/O functions still do internal buffering so real write()s may not occur before you call fflush() or fclose()
After each read/write from the file, you can call fflush() to force the file to dump all user space buffers to lower level buffers. syncfs() may be of use to you to force the kernel to clear all buffers to disk. If you need greater control at a lower level, you will probably just have to use open() instead of fopen().
You may also want to expore available ioctl() calls for your disk and memory devices to see if caching can be disabled systemwide at that level.
What really happens when write() system call is executed?
Lets say I have a program which writes certain data into a file using write() function call. Now C library has its own internal buffer and OS too has its own buffer.
What interaction takes place between these buffers ?
Is it like when C library buffer gets filled completely, it writes to OS buffer and when OS buffer gets filled completely, then the actual write is done on the file?
I am looking for some detailed answers, useful links would also help. Consider this question for a UNIX system.
The write() system call (in fact all system calls) are nothing more that a contract between the application program and the OS.
for "normal" files, the write() only puts the data on a buffer, and marks that buffer as "dirty"
at some time in the future, these dirty buffers will be collected and actually written to disk. This can be forced by fsync()
this is done by the .write() "method" in the mounted-filesystem-table
and this will invoke the hardware's .write() method. (which could involve another level of buffering, such as DMA)
modern hard disks have there own buffers, which may or may not have actually been written to the physical disk, even if the OS->controller told them to.
Now, some (abnormal) files don't have a write() method to support them. Imagine open()ing "/dev/null", and write()ing a buffer to it. The system could choose not to buffer it, since it will never be written anyway.
Also note that the behaviour of write() does depend on the nature of the file; for network sockets the write(fd,buff,size) can return before size bytes have been sent(write will return the number of characters sent). But it is impossible to find out where they are once they have been sent. They could still be in a network buffer (eg waiting for Nagle ...), or a buffer inside the network interface, or a buffer in a router or switch somewhere on the wire.
As far as I know...
The write() function is a lower level thing where the library doesn't buffer data (unlike fwrite() where the library does/may buffer data).
Despite that, the only guarantee is that the OS transfers the data to disk drive before the next fsync() completes. However, hard disk drives usually have their own internal buffers that are (sometimes) beyond the OS's control, so even if a subsequent fsync() has completed it's possible for a power failure or something to occur before the data is actually written from the disk drive's internal buffer to the disk's physical media.
Essentially, if you really must make sure that your data is actually written to the disk's physical media; then you need to redesign your code to avoid this requirement, or accept a (small) risk of failure, or ensure the hardware is capable of it (e.g. get a UPS).
write() writes data to operating system, making it visible for all processes (if it is something which can be read by other processes). How operating system buffers it, or when it gets written permanently to disk, that is very library, OS, system configuration and file system specific. However, sync() can be used to force buffers to be flushed.
What is quaranteed, is that POSIX requires that, on a POSIX-compliant file system, a read() which can be proved to occur after a write() has returned must return the written data.
OS dependant, see man 2 sync and (on Linux) the discussion in man 8 sync.
Years ago operating systems were supposed to implement an 'elevator algorithm' to schedule writes to disk. The idea would be to minimize the disk writing head movement, which would allow a good throughput for several processes accessing the disk at the same time.
Since you're asking for UNIX, you must keep in mind that a file might actually be on an FTP server, which you have mounted, as an example. For example files /dev and /proc are not files on the HDD, as well.
Also, on Linux data is not written to the hard drive directly, instead there is a polling process, that flushes all pending writes every so often.
But again, those are implementation details, that really don't affect anything from the point of view of your program.
I would like to ask a fundamental question about when is it useful to use a system call like fsync. I am beginner and i was always under the impression that write is enough to write to a file, and samples that use write actually write to the file at the end.
So what is the purpose of a system call like fsync?
Just to provide some background i am using Berkeley DB library version 5.1.19 and there is a lot of talk around the cost of fsync() vs just writing. That is the reason i am wondering.
Think of it as a layer of buffering.
If you're familiar with the standard C calls like fopen and fprintf, you should already be aware of buffering happening within the C runtime library itself.
The way to flush those buffers is with fflush which ensures that the information is handed from the C runtime library to the OS (or surrounding environment).
However, just because the OS has it, doesn't mean it's on the disk. It could get buffered within the OS as well.
That's what fsync takes care of, ensuring that the stuff in the OS buffers is written physically to the disk.
You may typically see this sort of operation in logging libraries:
fprintf (myFileHandle, "something\n"); // output it
fflush (myFileHandle); // flush to OS
fsync (fileno (myFileHandle)); // flush to disk
fileno is a function which gives you the underlying int file descriptor for a given FILE* file handle, and fsync on the descriptor does the final level of flushing.
Now that is a relatively expensive operation since the disk write is usually considerably slower than in-memory transfers.
As well as logging libraries, one other use case may be useful for this behaviour. Let me see if I can remember what it was. Yes, that's it. Databases! Just like Berzerkely DB. Where you want to ensure the data is on the disk, a rather useful feature for meeting ACID requirements :-)