what is the point of using the setvbuf() function in c? - c

Why would you want to set aside a block of memory in setvbuf()?
I have no clue why you would want to send your read/write stream to a buffer.

setvbuf is not intended to redirect the output to a buffer (if you want to perform IO on a buffer you use sprintf & co.), but to tightly control the buffering behavior of the given stream.
In facts, C IO functions don't immediately pass the data to be written to the operating system, but keep an intermediate buffer to avoid continuously performing (potentially expensive) system calls, waiting for the buffer to fill before actually performing the write.
The most basic case is to disable buffering altogether (useful e.g. if writing to a log file, where you want the data to go to disk immediately after each output operation) or, on the other hand, to enable block buffering on streams where it is disabled by default (or is set to line-buffering). This may be useful to enhance output performance.
Setting a specific buffer for output can be useful if you are working with a device that is known to work well with a specific buffer size; on the other side, you may want to have a small buffer to cut down on memory usage in memory-constrained environments, or to avoid losing much data in case of power loss without disabling buffering completely.

In C files opened with e.g. fopen are by default buffered. You can use setvbuf to supply your own buffer, or make the file operations completely unbuffered (like to stderr is).
It can be used to create fmemopen functionality on systems that doesn't have that function.

The size of a files buffer can affect Standard library call I/O rates. There is a table in Chap 5 of Steven's 'Advanced Programming in the UNIX Environment' that shows I/O throughput increasing dramatically with I/O buffer size, up to ~16K then leveling off. A lot of other factor can influenc overall I/O throughtput, so this one "tuning" affect may or may not be a cureall. This is the main reason for "why" other than turning off/on buffering.

Each FILE structure has a buffer associated with it internally. The reason behind this is to reduce I/O, and real I/O operations are time costly.
All your read/write will be buffered until the buffer is full. All the data buffered will be output/input in one real I/O operation.

Why would you want to set aside a block of memory in setvbuf()?
For buffering.
I have no clue why you would want to send your read/write stream to a buffer.
Neither do I, but as that's not what it does the point is moot.
"The setvbuf() function may be used on any open stream to change its buffer" [my emphasis]. In other words it alread has a buffer, and all the function does is change that. It doesn't say anything about 'sending your read/write streams to a buffer". I suggest you read the man page to see what it actually says. Especially this part:
When an output stream is unbuffered, information appears on the destination file or terminal as soon as written; when it is block buffered many characters are saved up and written as a block; when it is line buffered characters are saved up until a newline is output or input is read from any stream attached to a terminal device (typically stdin).

Related

Understanding Buffering in C

I am having a really hard time understanding the depths of buffering especially in C programming and I have searched for really long on this topic but haven't found something satisfying till now.
I will be a little more specific:
I do understand the concept behind it (i.e. coordination of operations by different hardware devices and minimizing the difference in speed of these devices) but I would appreciate a more full explanation of these and other potential reasons for buffering (and by full I mean full the longer and deeper the better) it would also be really nice to give some concrete Examples of how buffering is implemented in I/O streams.
The other questions would be that I noticed that some rules in buffer flushing aren't followed by my programs as weirdly as this sounds like the following simple fragment:
#include <stdio.h>
int main(void)
{
FILE * fp = fopen("hallo.txt", "w");
fputc('A', fp);
getchar();
fputc('A', fp);
getchar();
return 0;
}
The program is intended to demonstrate that impending input will flush arbitrary stream immediately when the first getchar() is called but this simply doesn't happen as often as I try it and with as many modifications as I want — it simply doesn't happen as for stdout (with printf() for example) the stream is flushed without any input requested also negating the rule therefore am I understanding this rule wrongly or is there something other to consider
I am using Gnu GCC on Windows 8.1.
Update:
I forgot to ask that I read on some sites how people refer to e.g. string literals as buffers or even arrays as buffers; is this correct or am I missing something?
Please explain this point too.
The word buffer is used for many different things in computer science. In the more general sense, it is any piece of memory where data is stored temporarily until it is processed or copied to the final destination (or other buffer).
As you hinted in the question there are many types of buffers, but as a broad grouping:
Hardware buffers: These are buffers where data is stored before being moved to a HW device. Or buffers where data is stored while being received from the HW device until it is processed by the application. This is needed because the I/O operation usually has memory and timing requirements, and these are fulfilled by the buffer. Think of DMA devices that read/write directly to memory, if the memory is not set up properly the system may crash. Or sound devices that must have sub-microsecond precision or it will work poorly.
Cache buffers: These are buffers where data is grouped before writing into/read from a file/device so that the performance is generally improved.
Helper buffers: You move data into/from such a buffer, because it is easier for your algorithm.
Case #2 is that of your FILE* example. Imagine that a call to the write system call (WriteFile() in Win32) takes 1ms for just the call plus 1us for each byte (bear with me, things are more complicated in real world). Then, if you do:
FILE *f = fopen("file.txt", "w");
for (int i=0; i < 1000000; ++i)
fputc('x', f);
fclose(f);
Without buffering, this code would take 1000000 * (1ms + 1us), that's about 1000 seconds. However, with a buffer of 10000 bytes, there will be only 100 system calls, 10000 bytes each. That would be 100 * (1ms + 10000us). That's just 0.1 seconds!
Note also that the OS will do its own buffering, so that the data is written to the actual device using the most efficient size. That will be a HW and cache buffer at the same time!
About your problem with flushing, files are usually flushed just when closed or manually flushed. Some files, such as stdout are line-flushed, that is, they are flushed whenever a '\n' is written. Also the stdin/stdout are special: when you read from stdin then stdout is flushed. Other files are untouched, only stdout. That is handy if you are writing an interactive program.
My case #3 is for example when you do:
FILE *f = open("x.txt", "r");
char buffer[1000];
fgets(buffer, sizeof(buffer), f);
int n;
sscanf(buffer, "%d", &n);
You use the buffer to hold a line from the file, and then you parse the data from the line. Yes, you could call fscanf() directly, but in other APIs there may not be the equivalent function, and moreover you have more control this way: you can analyze the type if line, skip comments, count lines...
Or imagine that you receive one byte at a time, for example from a keyboard. You will just accumulate characters in a buffer and parse the line when the Enter key is pressed. That is what most interactive console programs do.
The noun "buffer" really refers to a usage, not a distinct thing. Any block of storage can serve as a buffer. The term is intentionally used in this general sense in conjunction with various I/O functions, though the docs for the C I/O stream functions tend to avoid that. Taking the POSIX read() function as an example, however: "read() attempts to read up to count bytes from file descriptor fd into the buffer starting at buf". The "buffer" in that case simply means the block of memory in which the bytes read will be recorded; it is ordinarily implemented as a char[] or a dynamically-allocated block.
One uses a buffer especially in conjunction with I/O because some devices (especially hard disks) are most efficiently read in medium-to-large sized chunks, where as programs often want to consume that data in smaller pieces. Some other forms of I/O, such as network I/O, may inherently come in chunks, so that you must record each whole chunk (in a buffer) or else lose that part you're not immediately ready to consume. Similar considerations apply to output.
As for your test program's behavior, the "rule" you hoped to demonstrate is specific to console I/O, but only one of the streams involved is connected to the console.
The first question is a bit too broad. Buffering is used in many cases, including message storage before actual usage, DMA uses, speedup usages and so on. In short, the entire buffering thing can be summarized as "save my data, let me continue execution while you do something with the data".
Sometimes you may modify buffers after passing them to functions, sometimes not. Sometimes buffers are hardware, sometimes software. Sometimes they reside in RAM, sometimes in other memory types.
So, please ask more specific question. As a point to begin, use wikipedia, it is almost always helpful: wiki
As for the code sample, I haven't found any mention of all output buffers being flushed upon getchar. Buffers for files are generally flushed in three cases:
fflush() or equivalent
File is closed
The buffer is overflown.
Since neither of these cases is true for you, the file is not flushed (note that application termination is not in this list).
Buffer is a simple small area inside your memory (RAM) and that area is responsible of storing information before sent to your program, as long I'm typing the characters from the keyboard these characters will be stored inside the buffer and as soon I press the Enter key these characters will be transported from the buffer into your program so with the help of buffer all these characters are instantly available to your program (prevent lag and the slowly) and sent them to the output display screen

are fread and fwrite different in handling the internal buffer?

I keep on reading that fread() and fwrite() are buffered library calls. In case of fwrite(), I understood that once we write to the file, it won't be written to the hard disk, it will fill the internal buffer and once the buffer is full, it will call write() system call to write the data actually to the file.
But I am not able to understand how this buffering works in case of fread(). Does buffered in case of fread() mean, once we call fread(), it will read more data than we originally asked and that extra data will be stored in buffer (so that when 2nd fread() occurs, it can directly give it from buffer instead of going to hard disk)?
And I have following queries also.
If fread() works as I mention above, then will first fread() call read the data that is equal to the size of the internal buffer? If that is the case, if my fread() call ask for more bytes than internal buffer size, what will happen?
If fread() works as I mention above, that means at least one read() system call to kernel will happen for sure in case of fread(). But in case of fwrite(), if we only call fwrite() once during the program execution, we can't say for sure that write() system call be called. Is my understanding correct?
Will the internal buffer be maintained by OS?
Does fclose() flush the internal buffer?
There is buffering or caching at many different levels in a modern system. This might be typical:
C standard library
OS kernel
disk controller (esp. if using hardware RAID)
disk drive
When you use fread(), it may request 8 KB or so if you asked for less. This will be stored in user-space so there is no system call and context switch on the next sequential read.
The kernel may read ahead also; there are library functions to give it hints on how to do this for your particular application. The OS cache could be gigabytes in size since it uses main memory.
The disk controller may read ahead too, and could have a cache size up to hundreds of megabytes on smallish systems. It can't do as much in terms of read-ahead, because it doesn't know where the next logical block is for the current file (indeed it doesn't even know what file it is reading).
Finally, the disk drive itself has a cache, perhaps 16 MB or so. Like the controller, it doesn't know what file it is reading. For many years one disk block was 512 bytes, but it got a little larger (a few KB) recently with multi-terabyte disks.
When you call fclose(), it will probably deallocate the user-space buffer, but not the others.
Your understanding is correct. And any buffered fwrite data will be flushed when the FILE* is closed. The buffered I/O is mostly transparent for I/O on regular files.
But for terminals and other character devices you may care. Another instance where buffered I/O may be an issue is if you read from the file that one process is writing to from another process -- a common example is if a program writes text to a log file during operation, and the user runs a command like tail -f program.log to watch the content of the log file live. If the writing process has buffering enabled and it doesn't explicitly flush the log file, it will make it difficult to monitor the log file.

What is meant by stream buffering?

I had started learning C programming, so I'm a beginner, while learning about standard streams of text, I came up with the lines "stdout" stream is buffered while "stderr" stream is not buffered, but I am not able to make sense with this lines.
I already have read about "buffer" on this forum and I like candy analogy, but I am not able to figure out what is meant when one says: "This stream is buffered and the other one is not." What is the effect?
What is the difference?
Update: Does it affect the speed of processing?
Buffer is a block of memory which belongs to a stream and is used to hold stream data temporarily. When the first I/O operation occurs on a file, malloc is called and a buffer is obtained. Characters that are written to a stream are normally accumulated in the buffer (before being transmitted to the file in chunks), instead of appearing as soon as they are output by the application program. Similarly, streams retrieve input from the host environment in blocks rather than on a character-by-character basis. This is done to increase efficiency, as file and console I/O is slow in comparison to memory operations.
GCC provides three types of buffering - unbuffered, block buffered, and line buffered. Unbuffered means that characters appear on the destination file as soon as written (for an output stream), or input is read from a file on a character-by-character basis instead of reading in blocks (for input streams). Block buffered means that characters are saved up in the buffer and written or read as a block. Line buffered means that characters are saved up only till a newline is written into or read from the buffer.
stdin and stdout are block buffered if and only if they can be determined not to refer to an interactive device else they are line buffered (this is true of any stream). stderr is always unbuffered by default.
The standard library provides functions to alter the default behaviour of streams. You can use fflush to force the data out of the output stream buffer (fflush is undefined for input streams). You can make the stream unbuffered using the setbuf function.
Buffering is collecting up many elements before writing them, or reading many elements at once before processing them. Lots of information out there on the Internet, for example, this
and other SO questions like this
EDIT in response to the question update: And yes, it's done for performance reasons. Writing and reading from disks etc will in any case write or read a 'block' of some sort for most devices, and there's a fair overhead in doing so. So batching these operations up can make for a dramatic performance difference
A program writing to buffered output can perform the output in the time it takes to write to the buffer which is typically very fast, independent of the speed of the output device which may be slow.
With buffered output the information is queues and a separate process deals with the output rendering.
With unbuffered output, the data is written directly to the output device, so runs at the speed on the device. This is important for error output because if the output were buffered it would be possible for the process to fail before the buffered output has made it to the display - so the program might terminate with no diagnostic output.

How are files written? Why do I not see my data written immediately?

I understand the general process of writing and reading from a file, but I was curious as to what is happening under the hood during file writing. For instance, I have written a program that writes a series of numbers, line by line, to a .txt file. One thing that bothers me however is that I don't see the information written until after my c program is finished running. Is there a way to see the information written while the program is running rather than after? Is this even possible to do? This is a hard question to phrase in one line, so please forgive me if it's already been answered elsewhere.
The reason I ask this is because I'm writing to a file and was hoping that I could scan the file for the highest and lowest values (the program would optimally be able to run for hours).
Research buffering and caching.
There are a number of layers of optimisation performed by:
your application,
your OS, and
your disk driver,
in order to extend the life of your disk and increase performance.
With the careful use of flushing commands, you can generally make things happen "quite quickly" when you really need them to, though you should generally do so sparingly.
Flushing can be particularly useful when debugging.
The GNU C Library documentation has a good page on the subject of file flushing, listing functions such as fflush which may do what you want.
You observe an effect solely caused by the C standard I/O (stdio) buffers. I claim that any OS or disk driver buffering has nothing to do with it.
In stdio, I/O happens in one of three modes:
Fully buffered, data is written once BUFSIZ (from <stdio.h>) characters were accumulated. This is the default when I/0 is redirected to a file or pipe. This is what you observe. Typically BUFSIZ is anywhere from 1k to several kBytes.
Line buffered, data is written once a newline is seen (or BUFSIZ is reached). This is the default when i/o is to a terminal.
Unbuffered, data is written immediately.
You can use the setvbuf() (<stdio.h>) function to change the default, using the _IOFBF, _IOLBF or _IONBF macros, respectively. See your friendly setvbuf man page.
In your case, you can set your output stream (stdout or the FILE * returned by fopen) to line buffered.
Alternatively, you can call fflush() on the output stream whenever you want I/O to happen, regardless of buffering.
Indeed, there are several layers between the writing commands resp. functions and the actual file.
First, you open the file for writing. This causes the file to be either created or emptied. If you write then, the write doesn't actually occur immediately, but the data are cached until the buffer is full or the file is flushed or closed.
You can call fflush() for writing each portion of data, or you can actually wait until the file is closed.
Yes, it is possible to see whats written in the file(s). If you programm under Linux you can open a new Terminal and watch the progress with for example "less Filename".

Write system call writes data to disk directly?

I've read couple of questions(here) related to this but I still have some confusion.
My understanding is that write system call puts the data into Buffered Cache(OS caches as referred in that question). When the Buffered Cache gets full it is written to the disk.
Buffered IO is further optimization on top of this. It caches in the C RTL buffers and when they get full a write system call issued to move the contents to Buffered Cache. If I use fflush then data related to this particular file that is present in the C RTL buffers as well as Buffered Cache is sent to the disk.
Is my understanding correct?
How the stdio buffers are flushed is depending on the standard C library you use. To quote from the Linux manual page:
Note that fflush() only flushes the user space buffers provided by the C library.
To ensure that the data is physically stored on disk the kernel buffers must be
flushed too, for example, with sync(2) or fsync(2).
This means that on a Linux system, using fflush or overflowing the buffer will call the write function. But the operating system may keep internal buffers, and not actually write the data to the device. To make sure the data is truly written to the device, use both fflush and the low-level fsync.
Edit: Answer rephrased.

Resources