C stdio unbuffered multiplexing

C stdio unbuffered multiplexing - c

I'm working on a little program that needs to pipe binary streams very closely (unbuffered). It has to rely on select() multiplexing and is never allowed to "hold existing input unless more input has arrived, because it's not worth it yet".
It's possible using System calls, but then again, I would like to use stdio for convenience (string formatting is involved, too).
Can I safely use select() on a stream's underlying file descriptor as long as I'm using unbuffered stdio? If not, how can I determine a FILE stream that will not block from a set?
Is there any call that transfers all input from libc to the the application, besides the char-by-char functions (getchar() and friends)?

While I'm not entirely clear on whether it's sanctioned by the standards, using select on fileno(f) should in practice work when f is unbuffered. Keep in mind however that unbuffered stdio can perform pathologically bad, and that you are not allowed to change the buffering except as the very first operation before you use the stream at all.
If your only concern is being able to do formatted output, the newly-standardized-in-POSIX-2008 dprintf (and vdprintf) function might be a better solution to your problem.

Related

Why no other operations on file can be performed before setvbuf?

The documentation for glibc setvbuf (http://man7.org/linux/man-pages/man3/setvbuf.3p.html) states:
The setvbuf() function may be used after the stream pointed to by
stream is associated with an open file but before any other operation
(other than an unsuccessful call to setvbuf()) is performed on the
stream.
What is the point of having this restriction? (but before any other operation...)
Why it it not possible to first write to the file and then call setvbuf() for example?

I suspect that this restriction was taken literally from Unix OS. Referring to Rationale for ANSI C:
4.9.5.6 The setvbuf function
setvbuf has been adopted from UNIX System V, both to control the
nature of stream buffering and to specify the size of I/O buffers.
A particular implementation may provide useful mechanics for UB in form of non-portable extension. It is easier for output streams, which may be just flushed, but it gets less trivial for input streams.
Without crystal ball, I guess that it is easier to reopen the file and set the buffer, rather that think of every edge case involved in rebuffering.

C: How Efficient Are Output Routines in Terms of Buffering?

I can't find any information on whether buffering is already implicitly done out of the box when one is writing a file with either fprintf or fwrite. I understand that this might be implementation/platform dependent feature. What I'm interested in, is whether I can at least expect it to be implemented efficiently on modern popular platforms such as Windows, Linux, or Mac OS X?
AFAIK, usually buffering for I/O routines is done on 2 levels:
Library level: this could be C standard library, or Java SDK (BufferedOutputStream), etc.;
OS level: modern platforms extensively cache/buffer I/O operations.
My question is about #1, not #2 (as I know it's already true). In other words, can I expect C standard library implementations for all modern platforms to take advantage of buffering?
If not, then is manually creating a buffer (with cleverly chosen size) and flushing it on overflow a good solution to the problem?
Conclusion
Thanks to everyone who pointed out functions like setbuf and setvbuf. These are the exact evidence that I was looking for to answer my question. Useful extract:
All files are opened with a default allocated buffer (fully buffered)
if they are known to not refer to an interactive device. This function
can be used to either set a specific memory block to be used as buffer
or to disable buffering for the stream.
The default streams stdin and stdout are fully buffered by default if
they are known to not refer to an interactive device. Otherwise, they
may either be line buffered or unbuffered by default, depending on the
system and library implementation. The same is true for stderr, which
is always either line buffered or unbuffered by default.

In most cases buffering for stdio routines is tuned to be consistent with typical block size of the operating system in question. This is done to optimize the number of I/O operations in the default case. Of course you can always change it with setbuf()/setvbuf() routines.
Unless you are doing something special, you should stick to the default buffering as you can be quite sure it's mostly optimal on your OS (for the typical scenario).
The only case that justifies it is when you want to use stdio library to interact with I/O channels that are not geared towards it, in which case you might want to disable buffering altogether. But I don't get to see cases for this too often.

You can safely assume that standard I/O is sensibly buffered on any modern system.

As #David said, you can expect sensible buffering (at both levels).
However, there can be a huge difference between fprintf and fwrite, because fprintf interprets a format string.
If you stack-sample it, you can find a significant percent of time converting doubles into character strings, and stuff like that.

The C IO library allows to control the way buffering is done (inside the application, before what the OS does) with setvbuf. If you don't specify anything, the standard requires that "when opened, a stream is fully buffered if and only if it can be determined not to
refer to an interactive device.", the requirement also holds for stdin and stdout while stderr is not buffered even if one could detect that it is directed to a non interactive device.

what is the point of using the setvbuf() function in c?

Why would you want to set aside a block of memory in setvbuf()?
I have no clue why you would want to send your read/write stream to a buffer.

setvbuf is not intended to redirect the output to a buffer (if you want to perform IO on a buffer you use sprintf & co.), but to tightly control the buffering behavior of the given stream.
In facts, C IO functions don't immediately pass the data to be written to the operating system, but keep an intermediate buffer to avoid continuously performing (potentially expensive) system calls, waiting for the buffer to fill before actually performing the write.
The most basic case is to disable buffering altogether (useful e.g. if writing to a log file, where you want the data to go to disk immediately after each output operation) or, on the other hand, to enable block buffering on streams where it is disabled by default (or is set to line-buffering). This may be useful to enhance output performance.
Setting a specific buffer for output can be useful if you are working with a device that is known to work well with a specific buffer size; on the other side, you may want to have a small buffer to cut down on memory usage in memory-constrained environments, or to avoid losing much data in case of power loss without disabling buffering completely.

In C files opened with e.g. fopen are by default buffered. You can use setvbuf to supply your own buffer, or make the file operations completely unbuffered (like to stderr is).
It can be used to create fmemopen functionality on systems that doesn't have that function.

The size of a files buffer can affect Standard library call I/O rates. There is a table in Chap 5 of Steven's 'Advanced Programming in the UNIX Environment' that shows I/O throughput increasing dramatically with I/O buffer size, up to ~16K then leveling off. A lot of other factor can influenc overall I/O throughtput, so this one "tuning" affect may or may not be a cureall. This is the main reason for "why" other than turning off/on buffering.

Each FILE structure has a buffer associated with it internally. The reason behind this is to reduce I/O, and real I/O operations are time costly.
All your read/write will be buffered until the buffer is full. All the data buffered will be output/input in one real I/O operation.

Why would you want to set aside a block of memory in setvbuf()?
For buffering.
I have no clue why you would want to send your read/write stream to a buffer.
Neither do I, but as that's not what it does the point is moot.
"The setvbuf() function may be used on any open stream to change its buffer" [my emphasis]. In other words it alread has a buffer, and all the function does is change that. It doesn't say anything about 'sending your read/write streams to a buffer". I suggest you read the man page to see what it actually says. Especially this part:
When an output stream is unbuffered, information appears on the destination file or terminal as soon as written; when it is block buffered many characters are saved up and written as a block; when it is line buffered characters are saved up until a newline is output or input is read from any stream attached to a terminal device (typically stdin).

Why can't use C standard I/O with sockets

It's often said that one shouldn't use C standard I/O functions (like fprintf(), fscanf()) when working with sockets.
I can't understand why. I think if the reason was just in their buffered nature, one could just flush the output buffer each time he outputs, right?
Why everyone uses UNIX I/O functions instead? Are there any situations when the use of standard C functions is appropriate and correct?

You can certainly use stdio with sockets. You can even write a program that uses nothing but stdin and stdout, run it from inetd (which provides a socket on STDIN_FILENO and STDOUT_FILENO), and it works even though it doesn't contain any socket code at all.
What you can't do is mix buffered I/O with select or poll because there is no fselect or fpoll working on FILE *'s and you can't even implement one yourself because there's no standard way of querying a FILE * to find out whether its input buffer is empty.
As soon as you need to handle multiple connections, stdio is not good enough.

It's totally fine when you have simple scenario with one socket in blocking mode and your application protocol is text-based.
It quickly becomes a huge pain with more then one or non-blocking socket(s), with any sort of binary encoding, and with any real performance requirements.

Do not know any direct objection. Most likely this will work fine.
At the same time I can imagine that a platform, where fprintf() and fscanf() have their own buffers, staying above the file descriptor layer. You may not be able to flush these buffers.
It is hard to speak about all possible platforms. This means that it is better to avoid this with sockets.
At the end of the day the app program should solve the app problem. It should not be a compiler/library test.

It's because sockets (TCP sockets, for example) are readable and writable as if they were files or pipes, but this is just an abstraction. The inner workings of a network connection are much more complicated than a local file or pipe.
To start with, reading a file is always "fast", either you get the data or bump end-of-file. In the other hand, if you expect 500 bytes from a TCP connection and it sends 499 (and the connection is not closed), you may be waiting forever. Writing is the same thing: it will block after TCP output buffer.
Even the most basic program needs to handle timeouts, disconnection, and all these things interact with FILE's own buffered I/O, not even textbook examples could be expected to work well.

flush without sync

From what I've read, flush pushes data into the OS buffers and sync makes sure that data goes down to the storage media. So, if you want to be sure that data is actually written to disk, you need to do a flush followed by a sync. So, are there any cases where you want to call flush but not sync?

You only want to fflush if you're using stdio's FILE *. This writes a user space buffer to the kernel.
The other answers seem to be missing fdatasync. This is the system call you want to flush a specific file descriptor to disk.

When you fflush, you flush the buffer of one file to disk (unless you give NULL, in which case it flushes all open files). http://www.manpagez.com/man/3/fflush/
When you sync, you flush all the buffers to disk. http://www.manpagez.com/man/2/sync/
The most important thing that you should notice is that fflush is a standard function, while sync is a system call provided by the operating system (Linux for example).
So basically, if you are writing portable program, you in fact never use sync.

Yes, lots. Most programs most of the time would not bother to call any of the various sync operations; flushing the data into the kernel buffer pool as you close the file is sufficient. This is doubly true if you're using a journalled file system.
Note that flushing is a higher level operation than the read() or similar system calls. It is used by the C <stdio.h> library, or the C++ <iostream> library. The system calls inherently flush the data to the kernel buffer pool (or direct to disk if you're using direct I/O or something similar).
Note, too, that on POSIX-like systems, you can arrange for data sync etc by setting flags on the open() system call (O_SYNC, O_DSYNC, O_RSYNC), or subsequently via fcntl().

Just to clarify, fflush() applies only when using the FILE interface of UNIX that buffers writes at the application level. In case the normal write() call is used, fflush() makes little sense.
Having said that, I can think of two situations where you would like to call fflush() but not sync:
You want to make sure that the data will eventually make it to disk even though the application crashes.
Force to screen the data that the application has written to standard output so far.
The second case is the most common use I have seen and it is usually required if the printf() call does not end with a new line character ('\n').