Making fgets issue longer read() calls on linux - c

I'm reading quite large lines(up to 128K) of text using fgets. I'm seeing excessive context switching on the server, using strace I see the following:
read(3, "9005 10218 00840023102015 201008"..., 4096) = 4096
i.e. fgets reads chunks of 4096 bytes at a time. Is there any way to control how big chunks fgets uses to when calling read() ?

setvbuf would be the obvious place to start.

The function fgets() is part of the stdio package, and as such it must buffer (or not) the input stream in a way that is consistent with also using fgetc(), fscanf(), fread() and so forth. That means that the buffer itself (if the stream is buffered) is the property of the FILE object.
Whether there is a buffer or not, and if buffered, how large the buffer is, can be suggested to the library by calling setvbuf().
The library implementation has a fair amount of latitude to ignore hints and do what it thinks best, but buffers that are "reasonable" powers of two in size will usually be accepted. You've noticed that the default was 4096, which is clearly smaller than optimal.
The stream is buffered by default if it is opened on an actual file. Its buffering on a pipe, FIFO, TTY or anything else potentially has different defaults.

Related

What's the difference between read and fread?

I have a question about the SO question What is the difference between read() and fread()?
How do read() and fread() work? And what do unbuffered read and buffered read mean?
If fread() is implemented by calling read(), why do some people say fread() is faster than read() in the SO question C — fopen() vs open()?
fread cannot get faster than read provided (!) you use the same buffer size for read as fread does internally.
However every disk access comes with quite some overhead, so you improve performance if you minimise their number.
If you read small chunks of data then every read accesses the disk directly, thus you get slow, while in contrast fread profits from its buffer as long as there's yet data in – and on being consumed up the next large chunk is read into the buffer at once to again provide small chunks from on being called again.
What's the difference between read and fread?
fread is a standardized C programming language function that works with a FILE pointer.
read is a C function available on a POSIX compatible system that works with a file descriptor.
how do read() and fread() work?
fread calls some unknown system defined API to read from a file. To know how specific fread works you have to look at specific implmeentation of fread for your specific system - windows fread is very different from Linux fread. Here's implementation of fread as part of glibc library https://github.com/lattera/glibc/blob/master/libio/iofread.c#L30 .
read() calls read system call. https://man7.org/linux/man-pages/man2/syscalls.2.html
what do unbuffered read and buffered read mean?
fread reads data to some intermediate buffer, then copies from that buffer to the memory you passed as argument.
read makes kernel to copy straight to the buffer you passed as argument. There is no intermediate buffer.
why does some say fread() is faster than read() in this topic?
The assumption is that calling system calls is costly - what kernel does inside the system call and system call itself is costly.
Reading the data in a big like 4 kilobyte chunk into an intermediate buffer once and then reading from that buffer in small chunks within your program is faster than context switching to kernel every time to read a small chunks of data so that kernel will do repeatedly a small input/output operation to fetch the data.

are fread and fwrite different in handling the internal buffer?

I keep on reading that fread() and fwrite() are buffered library calls. In case of fwrite(), I understood that once we write to the file, it won't be written to the hard disk, it will fill the internal buffer and once the buffer is full, it will call write() system call to write the data actually to the file.
But I am not able to understand how this buffering works in case of fread(). Does buffered in case of fread() mean, once we call fread(), it will read more data than we originally asked and that extra data will be stored in buffer (so that when 2nd fread() occurs, it can directly give it from buffer instead of going to hard disk)?
And I have following queries also.
If fread() works as I mention above, then will first fread() call read the data that is equal to the size of the internal buffer? If that is the case, if my fread() call ask for more bytes than internal buffer size, what will happen?
If fread() works as I mention above, that means at least one read() system call to kernel will happen for sure in case of fread(). But in case of fwrite(), if we only call fwrite() once during the program execution, we can't say for sure that write() system call be called. Is my understanding correct?
Will the internal buffer be maintained by OS?
Does fclose() flush the internal buffer?
There is buffering or caching at many different levels in a modern system. This might be typical:
C standard library
OS kernel
disk controller (esp. if using hardware RAID)
disk drive
When you use fread(), it may request 8 KB or so if you asked for less. This will be stored in user-space so there is no system call and context switch on the next sequential read.
The kernel may read ahead also; there are library functions to give it hints on how to do this for your particular application. The OS cache could be gigabytes in size since it uses main memory.
The disk controller may read ahead too, and could have a cache size up to hundreds of megabytes on smallish systems. It can't do as much in terms of read-ahead, because it doesn't know where the next logical block is for the current file (indeed it doesn't even know what file it is reading).
Finally, the disk drive itself has a cache, perhaps 16 MB or so. Like the controller, it doesn't know what file it is reading. For many years one disk block was 512 bytes, but it got a little larger (a few KB) recently with multi-terabyte disks.
When you call fclose(), it will probably deallocate the user-space buffer, but not the others.
Your understanding is correct. And any buffered fwrite data will be flushed when the FILE* is closed. The buffered I/O is mostly transparent for I/O on regular files.
But for terminals and other character devices you may care. Another instance where buffered I/O may be an issue is if you read from the file that one process is writing to from another process -- a common example is if a program writes text to a log file during operation, and the user runs a command like tail -f program.log to watch the content of the log file live. If the writing process has buffering enabled and it doesn't explicitly flush the log file, it will make it difficult to monitor the log file.

what is the point of using the setvbuf() function in c?

Why would you want to set aside a block of memory in setvbuf()?
I have no clue why you would want to send your read/write stream to a buffer.
setvbuf is not intended to redirect the output to a buffer (if you want to perform IO on a buffer you use sprintf & co.), but to tightly control the buffering behavior of the given stream.
In facts, C IO functions don't immediately pass the data to be written to the operating system, but keep an intermediate buffer to avoid continuously performing (potentially expensive) system calls, waiting for the buffer to fill before actually performing the write.
The most basic case is to disable buffering altogether (useful e.g. if writing to a log file, where you want the data to go to disk immediately after each output operation) or, on the other hand, to enable block buffering on streams where it is disabled by default (or is set to line-buffering). This may be useful to enhance output performance.
Setting a specific buffer for output can be useful if you are working with a device that is known to work well with a specific buffer size; on the other side, you may want to have a small buffer to cut down on memory usage in memory-constrained environments, or to avoid losing much data in case of power loss without disabling buffering completely.
In C files opened with e.g. fopen are by default buffered. You can use setvbuf to supply your own buffer, or make the file operations completely unbuffered (like to stderr is).
It can be used to create fmemopen functionality on systems that doesn't have that function.
The size of a files buffer can affect Standard library call I/O rates. There is a table in Chap 5 of Steven's 'Advanced Programming in the UNIX Environment' that shows I/O throughput increasing dramatically with I/O buffer size, up to ~16K then leveling off. A lot of other factor can influenc overall I/O throughtput, so this one "tuning" affect may or may not be a cureall. This is the main reason for "why" other than turning off/on buffering.
Each FILE structure has a buffer associated with it internally. The reason behind this is to reduce I/O, and real I/O operations are time costly.
All your read/write will be buffered until the buffer is full. All the data buffered will be output/input in one real I/O operation.
Why would you want to set aside a block of memory in setvbuf()?
For buffering.
I have no clue why you would want to send your read/write stream to a buffer.
Neither do I, but as that's not what it does the point is moot.
"The setvbuf() function may be used on any open stream to change its buffer" [my emphasis]. In other words it alread has a buffer, and all the function does is change that. It doesn't say anything about 'sending your read/write streams to a buffer". I suggest you read the man page to see what it actually says. Especially this part:
When an output stream is unbuffered, information appears on the destination file or terminal as soon as written; when it is block buffered many characters are saved up and written as a block; when it is line buffered characters are saved up until a newline is output or input is read from any stream attached to a terminal device (typically stdin).

In C, what's the size of stdout buffer?

Today I learned that stdout is line buffered when it's set to terminal and buffered in different cases. So, in normal situation, if I use printf() without the terminating '\n' it will be printed on the screen only when the buffer will be full. How to get a size of this buffer, how big is this?
The actual size is defined by the individual implementation; the standard doesn't mandate a minimum size (based on what I've been able to find, anyway). Don't have a clue on how you'd determine the size of the buffer.
Edit
Chapter and verse:
7.19.3 Files
...
3 When a stream is unbuffered, characters are intended to appear from the source or at the
destination as soon as possible. Otherwise characters may be accumulated and
transmitted to or from the host environment as a block. When a stream is fully buffered,
characters are intended to be transmitted to or from the host environment as a block when
a buffer is filled. When a stream is line buffered, characters are intended to be
transmitted to or from the host environment as a block when a new-line character is
encountered. Furthermore, characters are intended to be transmitted as a block to the host
environment when a buffer is filled, when input is requested on an unbuffered stream, or
when input is requested on a line buffered stream that requires the transmission of
characters from the host environment. Support for these characteristics is
implementation-defined, and may be affected via the setbuf and setvbuf functions.
Emphasis added.
"Implementation-defined" is not a euphemism for "I don't know", it's simply a statement that the language standard explicitly leaves it up to the implementation to define the behavior.
And having said that, there is a non-programmatic way to find out; consult the documentation for your compiler. "Implementation-defined" also means that the implementation must document the behavior:
3.4.1
1 implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made
2 EXAMPLE An example of implementation-defined behavior is the propagation of the high-order bit
when a signed integer is shifted right.
The Linux when a pipe is created for default pipe size 64K is used.
In /proc/sys/fs/pipe-max-size the maximum pipe size exists.
For the default 1048576 is typical.
For glibc's default file buffer; 65536 bytes seems reasonable.
However, ascertained by grep from the glibc source tree:
libio/libio.h:#define _IO_BUFSIZ _G_BUFSIZ
sysdeps/generic/_G_config.h:#define _G_BUFSIZ 8192
sysdeps/unix/sysv/linux/_G_config.h:#define _G_BUFSIZ 8192
By that the original question might or might not be answered.
For a minute's effort the best guess is 8 kilobytes.
For mere line buffering 8K is adequate.
However, for more than line buffered output
as compared with 64K; 8K is not efficient.
Because for the default pipe size 64K is used and
if a larger pipe size is not expected and
if a larger pipe size is not explicitly set
then for a stdio buffer 64K is recommended.
If performance is required
then meager 8K buffers do not suffice.
By fcntl(pipefd,F_SETPIPE_SZ,1048576)
a pipe's size can be increased.
By setvbuf (stdout,buffer,_IOFBF,1048576)
a stdio provided file buffer can be replaced.
If a pipe is not used
then pipe size is irrelevant.
However, if between two processes data is piped
then by increasing pipe size a performance boon could become.
Otherwise
by the smallest buffer or
by the smallest pipe
a bottleneck is created.
If reading also
then by a larger buffer
by stdio fewer read function invocations might be required.
By the word "might" an important consideration is suggested.
As by provided
by a single write function invocation
by a single read function invocation
as much data can be read.
By a read function invocation
a return with fewer bytes than requested can be expected.
By an additional read function invocation
additional bytes may be gained.
For writing a data line; by stdio overkill is provided.
However, by stdio line buffered output is possible.
In some scenarios line buffered output is essential.
If writing to a proc virtual file system provided file or
if writing to a sys virtual file system provided file
then in a single write buffer
the line feed byte should be included.
If a second write is used
then an unexpected outcome could become.
If read write and stdio are mixed
then caveats exist.
Before
a write function invocation
a fflush function invocation is required.
Because stderr is not buffered;
for stderr the fflush function invocation is not required.
By read fewer than expected bytes might be provided.
By stdio the previous bytes might already be buffered.
Not mixing unistd and stdio I/O is good advise, but often ignored.
Mixing buffered input is unreasonable.
Mixing unbuffered input is possible.
Mixing buffered output is plausible.
By stdio buffered IO convenience is provided.
Without stdio buffered IO is possible.
However, for the code additional bytes are required.
When a sufficient sized buffer is leveraged;
compared with stdio provided output functions;
the write function invocation is not necessarily slower.
However, when a pipe is not involved
then by function mmap superior IO can be provided.
On a pipe by mmap an error is not returned.
However, in the address space the data is not provided.
On a pipe by lseek an error is provided.
Lastly by man 3 setvbuf a good example is provided.
If on the stack the buffer is allocated
then before a return a fclose function invocation
must not be omitted.
The actual question was
"In C, what's the size of stdout buffer?"
By 8192 that much might be answered.
By those who encounter this inquiry
curiosity concerning buffer input/output efficiency might exist.
By some inquiries the goal is implicitly approached.
By a preference for terse replies
the pipe size significance and
the buffer size significance and
mmap is not explicated.
This reply explicates.
here are some pretty interesting answers on a similar question.
on a linux system you can view buffer sizes from different functions, including ulimit.
Also the header files limits.h and pipe.h should contain that kind of info.
You could set it to unbuffered, or just flush it.
This seems to have some decent info when the C runtime typically flushes it for you and some examples. Take a look at this.

character reading in C

I am struggling to know the difference between these functions. Which one of them can be used if i want to read one character at a time.
fread()
read()
getc()
Depending on how you want to do it you can use any of those functions.
The easier to use would probably be fgetc().
fread() : read a block of data from a stream (documentation)
read() : posix implementation of fread() (documentation)
getc() : get a character from a stream (documentation). Please consider using fgetc() (doc)instead since it's kind of saffer.
fread() is a standard C function for reading blocks of binary data from a file.
read() is a POSIX function for doing the same.
getc() is a standard C function (a macro, actually) for reading a single character from a file - i.e., it's what you are looking for.
In addition to the other answers, also note that read is unbuffered method to read from a file. fread provides an internal buffer and reading is buffered. The buffer size is determined by you. Also each time you call read a system call occurs which reads the amount of bytes you told it to. Where as with fread it will read a chunk in the internal buffer and return you only the bytes you need. For each call on fread it will first check if it can provide you with more data from the buffer, if not it makes a system call (read) and gets a chunk more data and returns you only the portion you wanted.
Also read directly handles the file descriptor number, where fread needs the file to be opened as a FILE pointer.
The answer depends on what you mean by "one character at a time".
If you want to ensure that only one character is consumed from the underlying file descriptor (which may refer to a non-seekable object like a pipe, socket, or terminal device) then the only solution is to use read with a length of 1. If you use strace (or similar) to monitor a shell script using the shell command read, you'll see that it repeatedly calls read with a length of 1. Otherwise it would risk reading too many bytes (past the newline it's looking for) and having subsequent processes fail to see the data on the "next line".
On the other hand, if the only program that should be performing further reads is your program itself, fread or getc will work just fine. Note that getc should be a lot faster than fread if you're just reading a single byte.

Resources