Buffering and file I/O in C - c

Many years ago I noticed when reading a large binary file by BlockRead() in Delphi 7 that the speed is much lower when the file is being read byte by byte compared to when each time a chunk of, say, 16384 bytes, is read. It obviously meant that Delphi 7 didn't use an internal buffer (at least, by default) and each time the BlockRead() directly read from the disc.
What about fread() in C? Should the developer manage buffering herself/himself or the C library will take care of it? I know that text file I/O is buffered by default in C and, as far as I can remember, it is possible to change the size of the internal buffer.
UPDATE: I think that it is possible that Delphi 7 did use an internal buffer for an opened file but its default size was small.

According to the book C: In a Nutshell (2005) by T. Crawford and P. Prinz
When you open an ordinary file by calling fopen( ), the new stream is fully buffered. ... After you have opened a file, and before you perform the first input or output operation on it, you can change the buffering mode using the setbuf( ) or setvbuf( ) function.
It seems that this is about files in general, not only text files.
I will update this answer soon with the result of some tests.

Related

What is exactly a stream in C language?

I can't understand the meaning of "stream" in C language. Is it an abstraction ( just a name describe many operations)? Is it an object (monitor, keyboard, file on hard drive) which a program exchange data with it ? Or it 's a memory space in the RAM holding temporarly the exchanged data ?.
Thinks for help.
A stream is an abstraction of an I/O channel. It can map to a physical device such as a terminal or tape drive or a printer, or it can map to a file in a file system, or a network socket, or something else completely. How that mapping is accomplished is not exposed to you, the programmer.
From the perspective of your code, a stream is simply a source (input stream) or sink (output stream) of characters (text stream) or bytes (binary stream). Streams are managed through FILE objects and the stdio routines.
As far as your code is concerned, all streams behave the same way, regardless of what they are mapped to. It's a uniform interface to operations that can have wildly different implementations.
Stream is just the sequence of data available over the time. It is distinct from the file for example because you cant set the position. Examples: data coming/going through the RS232, USB, Ethernet, IP newworks etc etc.
but my questions are what are exactly a stream on the machine level
Nothing special. Machine level does not know anything about the streams.
What is exactly a stream in C language?
Same - C language does not know anything about the streams.
In C when we use the term stream, we indicate any input source or output destination.
Some example may be:
stdin (standard input which is the keyboard by default)
stdout (standard output which by default is the screen)
stderr (standard error which is the screen by default)
Functions such as printf, scanf, gets, puts and getchar, are functions that have the keyboard as input stream and the screen as output stream.
But we can create streams to files to!
The stdio.h library supports two types of files, text files and binary files. Within a text file, the bytes represent characters, which makes it possible for a human to read what the file contains. By contrast, in a binary file, bytes do not necessarily represent characters. In summary, text files have two things that binary files do not: Text files are divided into lines, and each line ends with one or two special characters. The code obviously depends on the operating system. In addition, text files can contain the file terminator (END OF FILE).
Streams are specific to the running program as well. Let me explain this further.
When you run a program through the terminal (Unix-like/Windows) what essentially it does is:
The terminal forks into a child process and runs your specified program (./name_of_program).
All the printf statements are given to stdout of the parent process which forked. Same for, scanf statements but now to stdin of the parent process that forked.
The operating system handles the characteristics of the streams, i.e. how many bytes can be streamed to stdin/out at once. Generally in Unix it is 4096 bytes. (Hint: Use pipes to overcome this issue).
There are three types of streams in C or any Programming language, Buffered, Line-buffered and Unbuffered. (Hint: use delay() function between each printf() call to know what this mean)
Now, the read and write access to files is handled by other service of the OS which is file descriptor. They are positive integers used by OS to keep track of the opened files and ports (like Serial Port).

Difference between stream and direct I/O in C?

In C, I believe (correct me if I'm wrong) there are two different types of input/output functions, direct and stream, which result in binary and ASCII files respectively.
What is the difference between stream (ASCII) and direct (Binary) I/O in terms of retrieving (read/write) and printing data?
No, yes, sort of, maybe…
In C, … there are two different types of input/output functions, direct and stream, which result in binary and ASCII files respectively.
In Standard C, there are only file streams, FILE *. In POSIX C, there are what might be termed 'direct' file access functions, mainly using file descriptors instead of file streams. AFAIK, Windows also provides alternative I/O functions, mainly using handles instead of file streams. So "No" — Standard C has one type of I/O function; but POSIX (and Windows) provide alternatives.
In Standard C, you can create a binary files and text files using:
FILE *bfp = fopen("binary-file.bin", "wb");
FILE *tfp = fopen("regular-file.txt", "w");
On Windows (and maybe other systems for Windows compatibility), you can be explicit about opening a text file:
FILE *tcp = fopen("regular-file.txt", "wt");
So the standard distinguishes between text and binary files, but file streams can be used to access either type of file. Further, on Unix systems, there is no difference between a text file and a binary file; they will be treated the same. On Windows, a text file will have its CRLF (carriage return, line feed) line endings mapped to newline on input, and newlines mapped to CRLF line endings on output. That translation does not occur with binary files.
Note that there is also a concept 'direct I/O' on Linux, activated using the O_DIRECT flag, which is probably not what you're thinking of. It is a refinement of file descriptor I/O.
What is the difference between stream (ASCII) and direct (Binary) I/O in terms of retrieving (read/write) and printing data?
There are multiple issues.
First, the dichotomy between text files and binary files is separate from the dichotomy between stream I/O and direct I/O.
With stream I/O, the mapping of line endings from native (e.g. CRLF) to newline when processing text files compared with no such mapping when processing binary files.
With text I/O, it is assumed that there will be no null bytes, '\0' in the data. Such bytes in the middle of a line mess up text processing code that expects to read up to a null. With binary I/O, all 256 byte values are expected; code that breaks because of a null byte is broken.
Complicating this is the distinction between different code sets for encoding text files. If you have a single-byte code set, such as ISO 8859-15, then null bytes don't generally appear. If you have a multi-byte code set such as UTF-8, again, null bytes don't generally appear. However, if you have a wide character code set such as UTF-16 (whether big-endian or little-endian), then you will often get zero bytes in the body of the file — it is not intended to be read or written as a byte stream but rather as a stream of 16-bit units.
The major difference between stream I/O and direct I/O is that the stream library buffers data for both input and output, unless you override it with setvbuf(). That is, if you repeatedly read a single character in the user code (getchar() for example), the stream library first reads a chunk of data from the file and then doles out one character at a time from the chunk, only going back to the file for more data when the previous chunk has been delivered completely. By contrast, direct I/O reading a single byte at a time will make a system call for each byte. Granted, the kernel will buffer the I/O (it does that for the stream I/O too — so there are multiple layers of buffering here, which is part of what O_DIRECT I/O attempts to avoid whenever possible), but the overhead of a system call per byte is rather substantial.
Generally, you have more fine-grained control over access with file descriptors; there are operations you can do with file descriptors that are simply not feasible with streams because the stream interface functions simply don't cover the possibility. For example, setting FD_CLOEXEC or O_CLOEXEC on a file descriptor means that the file descriptor will be closed automatically by the system when the program executes another one — the stream library simply doesn't cover the concept, let alone provide control over it. The cost of gaining the fine-grained control is that you have to write more code — or, at least, different code that does what is handled for you by the stream library functions.
Streams are a portable way of reading and writing data. They provide a flexible and efficient means of I/O. A Stream is a file or a physical device (like monitor) which is manipulated with a pointer to the stream.
This is BUFFERED that is to say a fixed chunk is read from or written to a file via some temporary storage area (the buffer). But data written to a buffer does not appear in a file (or device) until the buffer is flushed or written out. (\n does this).
In Direct or low-level I/O-
This form of I/O is UNBUFFERED -- each read/write request results in accessing disk (or device) directly to fetch/put a specific number of bytes.
There are no formatting facilities -- we are dealing with bytes of information.
This means we are now using binary (and not text) files.

are fread and fwrite different in handling the internal buffer?

I keep on reading that fread() and fwrite() are buffered library calls. In case of fwrite(), I understood that once we write to the file, it won't be written to the hard disk, it will fill the internal buffer and once the buffer is full, it will call write() system call to write the data actually to the file.
But I am not able to understand how this buffering works in case of fread(). Does buffered in case of fread() mean, once we call fread(), it will read more data than we originally asked and that extra data will be stored in buffer (so that when 2nd fread() occurs, it can directly give it from buffer instead of going to hard disk)?
And I have following queries also.
If fread() works as I mention above, then will first fread() call read the data that is equal to the size of the internal buffer? If that is the case, if my fread() call ask for more bytes than internal buffer size, what will happen?
If fread() works as I mention above, that means at least one read() system call to kernel will happen for sure in case of fread(). But in case of fwrite(), if we only call fwrite() once during the program execution, we can't say for sure that write() system call be called. Is my understanding correct?
Will the internal buffer be maintained by OS?
Does fclose() flush the internal buffer?
There is buffering or caching at many different levels in a modern system. This might be typical:
C standard library
OS kernel
disk controller (esp. if using hardware RAID)
disk drive
When you use fread(), it may request 8 KB or so if you asked for less. This will be stored in user-space so there is no system call and context switch on the next sequential read.
The kernel may read ahead also; there are library functions to give it hints on how to do this for your particular application. The OS cache could be gigabytes in size since it uses main memory.
The disk controller may read ahead too, and could have a cache size up to hundreds of megabytes on smallish systems. It can't do as much in terms of read-ahead, because it doesn't know where the next logical block is for the current file (indeed it doesn't even know what file it is reading).
Finally, the disk drive itself has a cache, perhaps 16 MB or so. Like the controller, it doesn't know what file it is reading. For many years one disk block was 512 bytes, but it got a little larger (a few KB) recently with multi-terabyte disks.
When you call fclose(), it will probably deallocate the user-space buffer, but not the others.
Your understanding is correct. And any buffered fwrite data will be flushed when the FILE* is closed. The buffered I/O is mostly transparent for I/O on regular files.
But for terminals and other character devices you may care. Another instance where buffered I/O may be an issue is if you read from the file that one process is writing to from another process -- a common example is if a program writes text to a log file during operation, and the user runs a command like tail -f program.log to watch the content of the log file live. If the writing process has buffering enabled and it doesn't explicitly flush the log file, it will make it difficult to monitor the log file.

what is the point of using the setvbuf() function in c?

Why would you want to set aside a block of memory in setvbuf()?
I have no clue why you would want to send your read/write stream to a buffer.
setvbuf is not intended to redirect the output to a buffer (if you want to perform IO on a buffer you use sprintf & co.), but to tightly control the buffering behavior of the given stream.
In facts, C IO functions don't immediately pass the data to be written to the operating system, but keep an intermediate buffer to avoid continuously performing (potentially expensive) system calls, waiting for the buffer to fill before actually performing the write.
The most basic case is to disable buffering altogether (useful e.g. if writing to a log file, where you want the data to go to disk immediately after each output operation) or, on the other hand, to enable block buffering on streams where it is disabled by default (or is set to line-buffering). This may be useful to enhance output performance.
Setting a specific buffer for output can be useful if you are working with a device that is known to work well with a specific buffer size; on the other side, you may want to have a small buffer to cut down on memory usage in memory-constrained environments, or to avoid losing much data in case of power loss without disabling buffering completely.
In C files opened with e.g. fopen are by default buffered. You can use setvbuf to supply your own buffer, or make the file operations completely unbuffered (like to stderr is).
It can be used to create fmemopen functionality on systems that doesn't have that function.
The size of a files buffer can affect Standard library call I/O rates. There is a table in Chap 5 of Steven's 'Advanced Programming in the UNIX Environment' that shows I/O throughput increasing dramatically with I/O buffer size, up to ~16K then leveling off. A lot of other factor can influenc overall I/O throughtput, so this one "tuning" affect may or may not be a cureall. This is the main reason for "why" other than turning off/on buffering.
Each FILE structure has a buffer associated with it internally. The reason behind this is to reduce I/O, and real I/O operations are time costly.
All your read/write will be buffered until the buffer is full. All the data buffered will be output/input in one real I/O operation.
Why would you want to set aside a block of memory in setvbuf()?
For buffering.
I have no clue why you would want to send your read/write stream to a buffer.
Neither do I, but as that's not what it does the point is moot.
"The setvbuf() function may be used on any open stream to change its buffer" [my emphasis]. In other words it alread has a buffer, and all the function does is change that. It doesn't say anything about 'sending your read/write streams to a buffer". I suggest you read the man page to see what it actually says. Especially this part:
When an output stream is unbuffered, information appears on the destination file or terminal as soon as written; when it is block buffered many characters are saved up and written as a block; when it is line buffered characters are saved up until a newline is output or input is read from any stream attached to a terminal device (typically stdin).

How are files written? Why do I not see my data written immediately?

I understand the general process of writing and reading from a file, but I was curious as to what is happening under the hood during file writing. For instance, I have written a program that writes a series of numbers, line by line, to a .txt file. One thing that bothers me however is that I don't see the information written until after my c program is finished running. Is there a way to see the information written while the program is running rather than after? Is this even possible to do? This is a hard question to phrase in one line, so please forgive me if it's already been answered elsewhere.
The reason I ask this is because I'm writing to a file and was hoping that I could scan the file for the highest and lowest values (the program would optimally be able to run for hours).
Research buffering and caching.
There are a number of layers of optimisation performed by:
your application,
your OS, and
your disk driver,
in order to extend the life of your disk and increase performance.
With the careful use of flushing commands, you can generally make things happen "quite quickly" when you really need them to, though you should generally do so sparingly.
Flushing can be particularly useful when debugging.
The GNU C Library documentation has a good page on the subject of file flushing, listing functions such as fflush which may do what you want.
You observe an effect solely caused by the C standard I/O (stdio) buffers. I claim that any OS or disk driver buffering has nothing to do with it.
In stdio, I/O happens in one of three modes:
Fully buffered, data is written once BUFSIZ (from <stdio.h>) characters were accumulated. This is the default when I/0 is redirected to a file or pipe. This is what you observe. Typically BUFSIZ is anywhere from 1k to several kBytes.
Line buffered, data is written once a newline is seen (or BUFSIZ is reached). This is the default when i/o is to a terminal.
Unbuffered, data is written immediately.
You can use the setvbuf() (<stdio.h>) function to change the default, using the _IOFBF, _IOLBF or _IONBF macros, respectively. See your friendly setvbuf man page.
In your case, you can set your output stream (stdout or the FILE * returned by fopen) to line buffered.
Alternatively, you can call fflush() on the output stream whenever you want I/O to happen, regardless of buffering.
Indeed, there are several layers between the writing commands resp. functions and the actual file.
First, you open the file for writing. This causes the file to be either created or emptied. If you write then, the write doesn't actually occur immediately, but the data are cached until the buffer is full or the file is flushed or closed.
You can call fflush() for writing each portion of data, or you can actually wait until the file is closed.
Yes, it is possible to see whats written in the file(s). If you programm under Linux you can open a new Terminal and watch the progress with for example "less Filename".

Resources