C: How Efficient Are Output Routines in Terms of Buffering? - c

I can't find any information on whether buffering is already implicitly done out of the box when one is writing a file with either fprintf or fwrite. I understand that this might be implementation/platform dependent feature. What I'm interested in, is whether I can at least expect it to be implemented efficiently on modern popular platforms such as Windows, Linux, or Mac OS X?
AFAIK, usually buffering for I/O routines is done on 2 levels:
Library level: this could be C standard library, or Java SDK (BufferedOutputStream), etc.;
OS level: modern platforms extensively cache/buffer I/O operations.
My question is about #1, not #2 (as I know it's already true). In other words, can I expect C standard library implementations for all modern platforms to take advantage of buffering?
If not, then is manually creating a buffer (with cleverly chosen size) and flushing it on overflow a good solution to the problem?
Conclusion
Thanks to everyone who pointed out functions like setbuf and setvbuf. These are the exact evidence that I was looking for to answer my question. Useful extract:
All files are opened with a default allocated buffer (fully buffered)
if they are known to not refer to an interactive device. This function
can be used to either set a specific memory block to be used as buffer
or to disable buffering for the stream.
The default streams stdin and stdout are fully buffered by default if
they are known to not refer to an interactive device. Otherwise, they
may either be line buffered or unbuffered by default, depending on the
system and library implementation. The same is true for stderr, which
is always either line buffered or unbuffered by default.

In most cases buffering for stdio routines is tuned to be consistent with typical block size of the operating system in question. This is done to optimize the number of I/O operations in the default case. Of course you can always change it with setbuf()/setvbuf() routines.
Unless you are doing something special, you should stick to the default buffering as you can be quite sure it's mostly optimal on your OS (for the typical scenario).
The only case that justifies it is when you want to use stdio library to interact with I/O channels that are not geared towards it, in which case you might want to disable buffering altogether. But I don't get to see cases for this too often.

You can safely assume that standard I/O is sensibly buffered on any modern system.

As #David said, you can expect sensible buffering (at both levels).
However, there can be a huge difference between fprintf and fwrite, because fprintf interprets a format string.
If you stack-sample it, you can find a significant percent of time converting doubles into character strings, and stuff like that.

The C IO library allows to control the way buffering is done (inside the application, before what the OS does) with setvbuf. If you don't specify anything, the standard requires that "when opened, a stream is fully buffered if and only if it can be determined not to
refer to an interactive device.", the requirement also holds for stdin and stdout while stderr is not buffered even if one could detect that it is directed to a non interactive device.

Related

Why no other operations on file can be performed before setvbuf?

The documentation for glibc setvbuf (http://man7.org/linux/man-pages/man3/setvbuf.3p.html) states:
The setvbuf() function may be used after the stream pointed to by
stream is associated with an open file but before any other operation
(other than an unsuccessful call to setvbuf()) is performed on the
stream.
What is the point of having this restriction? (but before any other operation...)
Why it it not possible to first write to the file and then call setvbuf() for example?
I suspect that this restriction was taken literally from Unix OS. Referring to Rationale for ANSI C:
4.9.5.6 The setvbuf function
setvbuf has been adopted from UNIX System V, both to control the
nature of stream buffering and to specify the size of I/O buffers.
A particular implementation may provide useful mechanics for UB in form of non-portable extension. It is easier for output streams, which may be just flushed, but it gets less trivial for input streams.
Without crystal ball, I guess that it is easier to reopen the file and set the buffer, rather that think of every edge case involved in rebuffering.

Are fscanf and fprintf buffered in C?

I wish to write an efficient C program, that makes a copy of a file. There doesn't seem to be a function (such as rename) that performs this. I plan to use fscanf an fprintf in stdio.h, but their descriptions do not say how or if they are buffered. Are they buffered between different cache levels? (I assume disk to memory buffer is handled by the OS)
When you open a file with fopen, it will be fully buffered.
You can change the buffering before doing any other operations on the file, using setvbuf (reference).
Using any normal I/O functions on a FILE object will take advantage of the buffering.
If you are just copying the data, you will be doing sequential reads and writes, and will not necessarily need buffering. But doing that efficiently does require choosing an appropriate block size for I/O operations. Traditionally, this is related to the size of a disk sector (4096 bytes) but that value isn't future-proof. The default used by fopen is BUFSIZ.
As with any optimisation, construct actual tests to verify your performance gains (or losses).
In the end, for the fastest I/O you might have to use OS-specific APIs. The C I/O functions just map to the general case of those APIs, but there may be special performance settings for an OS that you cannot control through the C library. I certainly ran into this when writing a fast AVI writer for Windows. Using platform-specific I/O I was able to achieve the maximum read/write speed of the disk: twice the speed of buffered I/O (<stdio.h>) or the native AVI API, and about 20% faster than traditional C unbuffered I/O.
The printf and scanf family of functions are all part of the same buffered "interface". man 3 stdio:
The standard I/O library provides a simple and efficient buffered
stream I/O interface. Input and output is mapped into logical data
streams and the physical I/O characteristics are concealed. The
functions and macros are listed below; more information is
available from the individual man pages.
If you want to avoid buffering, you will have to use a different C library.

fsync vs write system call

I would like to ask a fundamental question about when is it useful to use a system call like fsync. I am beginner and i was always under the impression that write is enough to write to a file, and samples that use write actually write to the file at the end.
So what is the purpose of a system call like fsync?
Just to provide some background i am using Berkeley DB library version 5.1.19 and there is a lot of talk around the cost of fsync() vs just writing. That is the reason i am wondering.
Think of it as a layer of buffering.
If you're familiar with the standard C calls like fopen and fprintf, you should already be aware of buffering happening within the C runtime library itself.
The way to flush those buffers is with fflush which ensures that the information is handed from the C runtime library to the OS (or surrounding environment).
However, just because the OS has it, doesn't mean it's on the disk. It could get buffered within the OS as well.
That's what fsync takes care of, ensuring that the stuff in the OS buffers is written physically to the disk.
You may typically see this sort of operation in logging libraries:
fprintf (myFileHandle, "something\n"); // output it
fflush (myFileHandle); // flush to OS
fsync (fileno (myFileHandle)); // flush to disk
fileno is a function which gives you the underlying int file descriptor for a given FILE* file handle, and fsync on the descriptor does the final level of flushing.
Now that is a relatively expensive operation since the disk write is usually considerably slower than in-memory transfers.
As well as logging libraries, one other use case may be useful for this behaviour. Let me see if I can remember what it was. Yes, that's it. Databases! Just like Berzerkely DB. Where you want to ensure the data is on the disk, a rather useful feature for meeting ACID requirements :-)

C stdio unbuffered multiplexing

I'm working on a little program that needs to pipe binary streams very closely (unbuffered). It has to rely on select() multiplexing and is never allowed to "hold existing input unless more input has arrived, because it's not worth it yet".
It's possible using System calls, but then again, I would like to use stdio for convenience (string formatting is involved, too).
Can I safely use select() on a stream's underlying file descriptor as long as I'm using unbuffered stdio? If not, how can I determine a FILE stream that will not block from a set?
Is there any call that transfers all input from libc to the the application, besides the char-by-char functions (getchar() and friends)?
While I'm not entirely clear on whether it's sanctioned by the standards, using select on fileno(f) should in practice work when f is unbuffered. Keep in mind however that unbuffered stdio can perform pathologically bad, and that you are not allowed to change the buffering except as the very first operation before you use the stream at all.
If your only concern is being able to do formatted output, the newly-standardized-in-POSIX-2008 dprintf (and vdprintf) function might be a better solution to your problem.

How does scanf() work inside the OS?

I've been wondering how scanf()/printf() actually works in the hardware and OS levels. Where does the data flow and what exactly is the OS doing around these times? What calls does the OS make? And so on...
scanf() and printf() are functions in libc (the C standard library), and they call the read() and write() operating system syscalls respectively, talking to the file descriptors stdin and stdout respectively (fscanf and fprintf allow you to specify the file stream you want to read/write from).
Calls to read() and write() (and all syscalls) result in a 'context switch' out of your user-level application into kernel mode, which means it can perform privileged operations, such as talking directly to hardware. Depending on how you started the application, the 'stdin' and 'stdout' file descriptors are probably bound to a console device (such as tty0), or some sort of virtual console device (like that exposed by an xterm). read() and write() safely copy the data to/from a kernel buffer called a 'uio'.
The format-string conversion part of scanf and printf does not occur in kernel mode, but just in ordinary user mode (inside 'libc'), the general rule of thumb with syscalls is you switch to kernel mode as infrequently as possible, both to avoid the performance overhead of context switching, and for security (you need to be very careful about anything that happens in kernel mode! less code in kernel mode means less bugs/security holes in the operating system).
btw.. all of this was written from a unix perspective, I don't know how MS Windows works.
On my OS I am working with scanf and printf are based on functions getch() ant putch().
I think the OS just provides two streams, one for input and the other for output, the streams abstract away how the output data gets presented or where the input data comes from.
so what scanf & printf are doing are just adding bytes (or consuming bytes) from either streams.
scanf , printf etc internally all these types of functions can't be directly written in c/c++ language. internally they all are written in assembly language by the use of keword "asm", any thing written with keyword "asm" are directly introduced to object file irrespective of compilation (not changed even after compilation), and in assembly language we have got predefined codes which can implement all these functions ...... so in short SCANF PRINTF etc ALL ARE WRITTEN IN ASSEMBLY LANGUAGE INTERNALLY. YOU CAN DESIGN YOUR OWN INPUT FUNCTION USING KEYWORD "ASM".

Resources