fgets() was intended for reading some string until EOF or \n occurred. It is very handy for reading text config files, for example, but there are some problems.
First, it may return EINTR in case of signal delivery, so it should be wrapped with loop checking for that.
Second problem is much worse: at least in glibc, it will return EINTR and loss all already read data in case it delivered in middle. This is very unlikely to happen, but I think this may be source of some complicated vulnerabilities in some daemons.
Setting SA_RESTART flag on signals seems to help avoiding this problem but I'm not sure it covers ALL possible cases on all platforms. Is it?
If no, is there a way to avoid the problem at all?
If no, it seems that fgets() is not usable for reading files in daemons because it may lead to random data loss.
Example code for tests:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
static char buf[1000000];
static volatile int do_exit = 0;
static void int_sig_handle(int signum) { do_exit = 1; }
void try(void) {
char * r;
int err1, err2;
size_t len;
memset(buf,1,20); buf[20]=0;
r = fgets(buf, sizeof(buf), stdin);
if(!r) {
err1 = errno;
err2 = ferror(stdin);
printf("\n\nfgets()=NULL, errno=%d(%s), ferror()=%d\n", err1, strerror(err1), err2);
len = strlen(buf);
printf("strlen()=%u, buf=[[[%s]]]\n", (unsigned)len, buf);
} else if(r==buf) {
err1 = errno;
err2 = ferror(stdin);
len = strlen(buf);
if(!len) {
printf("\n\nfgets()=buf, strlen()=0, errno=%d(%s), ferror()=%d\n", err1, strerror(err1), err2);
} else {
printf("\n\nfgets()=buf, strlen()=%u, [len-1]=0x%02X, errno=%d(%s), ferror()=%d\n",
(unsigned)len, (unsigned char)(buf[len-1]), err1, strerror(err1), err2);
}
} else {
printf("\n\nerr\n");
}
}
int main(int argc, char * * argv) {
struct sigaction sa;
sa.sa_flags = 0; sigemptyset(&sa.sa_mask); sa.sa_handler = int_sig_handle;
sigaction(SIGINT, &sa, NULL);
printf("attempt 1\n");
try();
printf("\nattempt 2\n");
try();
printf("\nend\n");
return 0;
}
This code can be used to test signal delivery in middle of "attempt 1" and ensure that its partially read data become completely lost after that.
How to test:
run program with strace
enter some line (do not press Enter), press Ctrl+D, see read() syscall completed with some data
send SIGINT
see fread() returned NULL, "attempt 2" and enter some data and press Enter
it will print second entered data but will not print first anywhere
FreeBSD 11 libc: same behaviour
FreeBSD 8 libc: first attempt returns partially read data and sets ferror() and errno
EDIT: according with #John Bollinger recommendations I've added dumping of the buffer after NULL return. Results:
glibc and FreeBSD 11 libc: buffer contains that partially read data but NOT NULL-TERM so the only way to get its length is to clear entire buffer before calling fgets() which looks not like intended use
FreeBSD 8 libc: still returns properly null-terminated partially-read data
stdio is indeed not reasonably usable with interrupting signal handlers.
Per ISO C 11 7.21.7.2 The fgets function, paragraph 3:
The fgets function returns s if successful. If end-of-file is encountered and no characters have been read into the array, the contents of the array remain unchanged and a null pointer is returned. If a read error occurs during the operation, the array contents are indeterminate and a null pointer is returned.
EINTR is a read error, so the array contents are indeterminate after such a return.
Theoretically, the behavior could be specified for fgets in a way that you could meaningfully recover from an error in the middle of the operation by setting up the buffer appropriately before the call, since you know that fgets does not write '\n' except as the final character before null termination (similar to techniques for using fgets with embedded NULs). However, it's not specified that way, and there would be no analogous way to handle other stdio functions like scanf, which have nowhere to store state for resuming them after EINTR.
Really, signals are just a really backwards way of doing things, and interrupting signals are an even more backwards tool full of race conditions and other unpleasant and unfixable corner cases. If you want to do this kind of thing in a safe and modern way, you probably need to have a thread that forwards stdin through a pipe or socket, and close the writing end of the pipe or socket in the signal handler so that the main part of your program reading from it gets EOF.
First, it may return EINTR in case of signal delivery, so it should be
wrapped with loop checking for that.
Of course you mean that fgets() will return NULL and set errno to EINTR. Yes, this is a possibility, and not only for fgets(), or even for stdio functions generally -- a wide variety of functions from the I/O realm and others may exhibit this behavior. Most POSIX functions that may block on events external to the program can fail with EINTR and various function-specific associated behaviors. It's a characteristic of the programming and operational environment.
Second problem is much worse: at least in glibc, it will return EINTR
and loss all already read data in case it delivered in middle. This is
very unlikely to happen, but I think this may be source of some
complicated vulnerabilities in some daemons.
No, at least not in my tests. It is your test program that loses data. When fgets() returns NULL to signal an error, that does not imply that it has not transferred any data to the buffer, and if I modify your program to print the buffer after an EINTR is signaled then I indeed see that the data from attempt 1 have been transferred there. But the program ignores that data.
Now it is possible that other programs make the same mistake that yours does, and therefore lose data, but that is not because of a flaw in the implementation of fgets().
FreeBSD 8 libc: first attempt returns partially read data and sets ferror() and errno
I'm inclined to think that this behavior is flawed -- if the function returns before reaching end of line / file then it should signal an error by providing a NULL return value. It may, but is not obligated to, transfer some or all of the data read to that point to the user-provided buffer. (But if it doesn't transfer data then they should remain available to be read.) I also find it surprising that the function sets the file's error flag at all. I'm inclined to think that erroneous, but I'm not prepared to present an argument for that at the moment.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 months ago.
Improve this question
I am trying to read input from stdin with fread(). However i am have a problem, the loop will not terminate and instead keeps reading.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[])
{
if (argc != 2) {
fprintf(stderr, "argument err");
return -1;
}
FILE *in = fopen(argv[1], "w");
if (in == NULL) {
fprintf(stderr, "failed to open file");
return -1;
}
char buffer[20];
size_t ret;
while ((ret = fread(buffer, 1, 20, stdin)) > 0) {
if (fwrite(buffer, 1, ret, in) != ret) {
if (ferror(in) != 0) {
perror("write err:");
}
}
}
return 0;
}
How can i make this loop terminate when EOF is reached? i have tried using ctrl+D but that just seems like a strange way to stop taking input.
I guess what i want is to use fread() to read multiple arbitrary amounts of data in chunks of 20 bytes and then somehow stop.
How can i make this loop terminate when EOF is reached?
When do you think EOF is reached? Really. When you are providing input interactively, how is the system or the program supposed to know that you've entered all the data you want the program to consume?
i have tried using ctrl+D but that just seems like a strange way to stop taking input.
It is exactly the way to signal a soft EOF to a POSIX terminal. Since you want the loop to stop when EOF is encountered, it seems absolutely natural to me to use ctrl+D for the purpose when providing data interactively. That's not the only way you could signal the end of the input, but it has a lot going for it.
I guess what i want is to use fread() to read multiple arbitrary amounts of data in chunks of 20 bytes and then somehow stop.
Again: how is the program supposed to know when it has consumed all the "multiple arbitrary amounts" of data that you decide to provide on a given run? An EOF signal is an eminently reasonable choice for multiple reasons, and the way to deliver that from a POSIX terminal interface is ctrl+D.
As pointed out before you are reading from an eternal stream, this means that stdin don't naturally have a EOF (or <=0) value.
If you want your loop to terminate, you will have to add a termination condition, like a certain character, word or all type of value. After that you could use a break or a return in some case. You could also search if your terminal emulator support the insertion of an EOF value into the stdin, which is pretty common (But very platform dependent).
ADD: On my system, typical linux, CTRL+D is for an EOF insertion in stdin. It seems that you found this out yourself, and if you want your program to know where to stop you will need to use this.
You cand also send a signal to your program, usually done with a shortcut like CTRL+D, CTRL+C, CTRL+T etc... there is all sort of signal, which can be sent by your system or/and your TE and you just have to implement in your program the corresponding signal receiver.
How can i make this loop terminate when EOF is reached? i have tried using ctrl+D but that just seems like a strange way to stop taking input.
fread and fwrite are there to read data records, so they (both) take the number of records to read and the size of the record. If the available data doesn't fit on a full record, you will not get the full record at all (indeed, the routines return the number of full records read, and the partial read will be waiting for the next fread() call.)
All the calls in stdio.h package are buffered, so the buffer holds the data that has been read (from the system) but not yet consumed by the user, and so, this makes me to wonder why are you trying to use a buffer to read data that is already buffered?
EOF is produced when you are trying to read one record and the fread() call results in a true end of file from the system (this normally requires two calls, the first to complete the remaining data, the second resulting in no data ---zero bytes--- returned from the system) So you have to distinguish two cases:
fread() returns 0 in case it has read something, but is not enough to complete a record.
fread() returns EOF in case it has read nothing (the true end of file is reached)
As I've said above, fread() & fwrite() will read/write full records (this is useful when your data is a struct with a fixed length, but normally not when you can have extra data at the end)
The way to terminate the loop should be something like this:
while ((ret = fread(buffer, 1, 20, stdin)) >= 0) {
if (fwrite(buffer, 1, ret, in) != ret) {
if (ferror(in) != 0) {
perror("write err:");
}
}
}
/* here you can have upto 19 bytes in the buffer that cannot
* be read with that record length, but you can read individually
* with fgetc() calls. */
so, if you read half a record (at end of file) only at the next fread() it will detect the end of file (by reading nothing) and you will be free of ending. (beware that the extra data that doesn't fill a full buffer, still needs to be read by other means)
The cheapest and easiest way to solve this problem (to copy a file from one descriptor to another) is described in K&R (in the first edition) and has not yet have better code to void it, is this:
int c;
while ((c = fgetc(in)) != EOF)
fputc(c, out);
while it seems to read the characters one by one, it actually makes a call to read(2) to completely fill a full buffer of data, and return just one character, next characters will be taken from the buffer, saving calls to read(), and the same happens to fputc() (it fills the buffer until it's full, then flushes it, in a single call to write()).
Many people has tried to defeat the code above, without any measurable gain in efficience. So, my hint is be simple, that the world is complicated enough to force you to go complex.
Normally, to indicate EOF to a program attached to standard input on a Linux terminal, I need to press Ctrl+D once if I just pressed Enter, or twice otherwise. I noticed that the patch command is different, though. With it, I need to press Ctrl+D twice if I just pressed Enter, or three times otherwise. (Doing cat | patch instead doesn't have this oddity. Also, If I press Ctrl+D before typing any real input at all, it doesn't have this oddity.) Digging into patch's source code, I traced this back to the way it loops on fread. Here's a minimal program that does the same thing:
#include <stdio.h>
int main(void) {
char buf[4096];
size_t charsread;
while((charsread = fread(buf, 1, sizeof(buf), stdin)) != 0) {
printf("Read %zu bytes. EOF: %d. Error: %d.\n", charsread, feof(stdin), ferror(stdin));
}
printf("Read zero bytes. EOF: %d. Error: %d. Exiting.\n", feof(stdin), ferror(stdin));
return 0;
}
When compiling and running the above program exactly as-is, here's a timeline of events:
My program calls fread.
fread calls the read system call.
I type "asdf".
I press Enter.
The read system call returns 5.
fread calls the read system call again.
I press Ctrl+D.
The read system call returns 0.
fread returns 5.
My program prints Read 5 bytes. EOF: 1. Error: 0.
My program calls fread again.
fread calls the read system call.
I press Ctrl+D again.
The read system call returns 0.
fread returns 0.
My program prints Read zero bytes. EOF: 1. Error: 0. Exiting.
Why does this means of reading stdin have this behavior, unlike the way that every other program seems to read it? Is this a bug in patch? How should this kind of loop be written to avoid this behavior?
UPDATE: This seems to be related to libc. I originally experienced it on glibc 2.23-0ubuntu3 from Ubuntu 16.04. #Barmar noted in the comments that it doesn't happen on macOS. After hearing this, I tried compiling the same program against musl 1.1.9-1, also from Ubuntu 16.04, and it didn't have this problem. On musl, the sequence of events has steps 12 through 14 removed, which is why it doesn't have the problem, but is otherwise the same (except for the irrelevant detail of readv in place of read).
Now, the question becomes: is glibc wrong in its behavior, or is patch wrong in assuming that its libc won't have this behavior?
I've managed to confirm that this is due to an unambiguous bug in glibc versions prior to 2.28 (commit 2cc7bad). Relevant quotes from the C standard:
The byte input/output functions — those functions described in this subclause that perform
input/output: [...], fread
The byte input functions read characters from the stream as if by successive
calls to the fgetc function.
If the end-of-file indicator for the stream is set, or if the stream is at end-of-file, the end-of-file indicator for the stream is set and the fgetc function returns EOF. Otherwise, the fgetc function returns the next character from the input stream pointed to by stream.
(emphasis on "or" mine)
The following program demonstrates the bug with fgetc:
#include <stdio.h>
int main(void) {
while(fgetc(stdin) != EOF) {
puts("Read and discarded a character from stdin");
}
puts("fgetc(stdin) returned EOF");
if(!feof(stdin)) {
/* Included only for completeness. Doesn't occur in my testing. */
puts("Standard violation! After fgetc returned EOF, the end-of-file indicator wasn't set");
return 1;
}
if(fgetc(stdin) != EOF) {
/* This happens with glibc in my testing. */
puts("Standard violation! When fgetc was called with the end-of-file indicator set, it didn't return EOF");
return 1;
}
/* This happens with musl in my testing. */
puts("No standard violation detected");
return 0;
}
To demonstrate the bug:
Compile the program and execute it
Press Ctrl+D
Press Enter
The exact bug is that if the end-of-file stream indicator is set, but the stream is not at end-of-file, glibc's fgetc will return the next character from the stream, rather than EOF as the standard requires.
Since fread is defined in terms of fgetc, this is the cause of what I originally saw. It's previously been reported as glibc bug #1190 and has been fixed since commit 2cc7bad in February 2018, which landed in glibc 2.28 in August 2018.
I just want to create 2 new forks(child processes) and they will put their name sequentally. SO first they need to some string in pipe to check something. Let's see the code:
char myname[] = "ALOAA";
int main ()
{
int fds[2];
pid_t pid;
pipe(fds);
pid = fork();
if(pid == 0)
{
strcpy(myname, "first");
}
else
{
pid = fork();
if(pid == 0)
{
strcpy(myname, "second");
}
}
if(strcmp(myname, "ALOAA") != 0)
{
char readbuffer[1025];
int i;
for (i = 0; i < 2 ; i++)
{
//printf("%s\n", myname);
close(fds[0]);
write(fds[1], myname, strlen(myname));
while(1)
{
close(fds[1]);
int n = read(fds[0], readbuffer, 1024);
readbuffer[n] = 0;
printf("%s-alihan\n", readbuffer);
if(readbuffer != myname)
break;
sleep(1);
}
//printf("%s\n", myname);
}
}
return 0;
}
So the first process will write her name to pipe. And after that, will check if any new string in pipe. It will be same for second too. However I got empty string from read() function. So it prints like that
-alihan
-alihan
I couldn't get the problem.
However I got empty string from read() function [...] I couldn't get the problem.
#MikeCAT nailed this issue with his observation in comments that each child closes fds[0] before it ever attempts to read from it. No other file is assigned the same FD between, so the read fails. You do not test for the failure.
Not testing for the read failure is a significant problem, because your program does not merely fail to recognize it -- it exhibits undefined behavior as a result. This arises for (at least) two reasons:
read() will have indicated failure by returning -1, and your program will respond by attempting an out-of-bounds write (to readbuffer[-1]).
if we ignore the UB resulting from (1), we still have the program thereafter reading from completely uninitialized array readbuffer (because neither the read() call nor the assignment will have set the value of any element of that array).
Overall, you need to learn the discipline of checking the return values of your library function calls for error conditions, at least everywhere that it matters whether an error occurred (which is for most calls). For example, your usage of pipe(), fork(), and write() exhibits this problem, too. Under some circumstances you want to check the return value of printf()-family functions, and you usually want to check the return value of input functions -- not just read(), but scanf(), fgets(), etc..
Tertiarily, your usage of read() and write() is incorrect. You make the common mistake of assuming that (on success) write() will reliably write all the bytes specified, and that read() will read all bytes that have been written, up to the specified buffer size. Although that ordinarily works in practice for exchanging short messages over a pipe, it is not guaranteed. In general, write() may perform only a partial write and read() may perform only a partial read, for unspecified, unpredictable reasons.
To write successfully one generally must be prepared to repeat write() calls in a loop, using the return value to determine where (or whether) to start the next write. To read complete messages successfully one generally must be prepared similarly to repeat read() calls in a loop until the requisite number of bytes have been read into the buffer, or until some other termination condition is satisfied, such as the end of the file being reached. I presume it will not be lost on you that many forms of this require advance knowledge of the number of bytes to read.
The mod_rewrite documentation states that it is a strict requirement to disable in(out)put buffering in a rewrite program.
Keeping that in mind I've written a simple program (I do know that it lacks the EOF check but this is not an issue and it saves one condition check per loop):
#include <stdio.h>
#include <stdlib.h>
int main ( void )
{
setvbuf(stdin,NULL,_IONBF,0);
setvbuf(stdout,NULL,_IONBF,0);
int character;
while ( 42 )
{
character = getchar();
if ( character == '-' )
{
character = '_';
}
putchar(character);
}
return 0;
}
After making some measurements I was shocked - it was over 9,000 times slower than the demo Perl script provided by the documentation:
#!/usr/bin/perl
$| = 1; # Turn off I/O buffering
while (<STDIN>) {
s/-/_/g; # Replace dashes with underscores
print $_;
}
Now I have two related questions:
Question 1. I believe that the streams may be line buffered since Apache sends a new line after each path. Am I correct? Switching my program to
setvbuf(stdin,NULL,_IOLBF,4200);
setvbuf(stdout,NULL,_IOLBF,4200);
makes it twice as fast as Perl one. This should not hit Apache's performance, should it?
Question 2. How can one write a program in C which will use unbuffered streams (like Perl one) and will perform as fast as Perl one?
Question 1: You would have to look at the code. It could be line buffered, it could be using fflush at the end of each request (or block of requests), or it could be using write calls with a larger buffer. In any case, it won't be doing per-character I/O which is what your program is doing.
Question 2: I suspect the main issue is on output. If you were to assemble the entire result in a buffer and write that out as one call, then you would be faster. However, that just means you are doing the line buffering instead of having the library take care of it for you. The key is that with no buffering, each output call results in a system call - that is very high overhead. In theory, the same concept holds true on input but I'm not sure the implementation wouldn't notice the available characters and buffer them in any case. Same workaround though - read a larger buffer and then take it apart yourself.
Personally, I'd avoid all the setvbuf stuff and just do an fflush at the end of each request.
When writing to a terminal, stdout is flushed after every line. This way you can always see the output right away. When writing to a file or, as in your case a pipe, this automatic flush is disabled. Usually in those cases performance is more important.
This causes problems when processes have to interact with each other. One program writes something. It's not sent instantly but stored in a buffer. Second program waits for that data. First program waits for more data from second program resulting in a deadlock.
To avoid this, you need to flush all the output before waiting for additional input. Simple fflusuh(stdout) before every read operation should be enough. This is actually what $|=1 does in Perl. Nothing needs to be done with stdin.
If performance is critical and you need to operate only on single bytes. Read and write data in big chunks using unbuffered read/write. For example:
#include <unistd.h>
int main() {
char buf[1024];
while(1) {
int len = read(0,buf,sizeof(buf));
for(int i=0;i<len;i++) {
if ( buf[i] == '-' ) {
buf[i] = '_';
}
}
write(1,buf,len);
}
}
I've included an example program using getchar() below, for reference (not that anyone probably needs it), and feel free to address concerns with it if you desire. But my question is:
What exactly is going on when the program calls getchar()?
Here is my understanding (please clarify or correct me):
When getchar is called, it checks the STDIN buffer to see if there is any input.
If there isn't any input, getchar sleeps.
Upon wake, getchar checks to see if there is any input, and if not, puts it self to sleep again.
Steps 2 and 3 repeat until there is input.
Once there is input (which by convention includes an 'EOF' at the end), getchar returns the first character of this input and does something to indicate that the next call to getchar should return the second letter from the same buffer? I'm not really sure what that is.
When there are no more characters left other than EOF, does getchar flush the buffer?
The terms I used are probably not quite correct.
#include <stdio.h>
int getLine(char buffer[], int maxChars);
#define MAX_LINE_LENGTH 80
int main(void){
char line[MAX_LINE_LENGTH];
int errorCode;
errorCode = getLine(line, sizeof(line));
if(errorCode == 1)
printf("Input exceeded maximum line length of %d characters.\n", MAX_LINE_LENGTH);
printf("%s\n", line);
return 0;
}
int getLine(char buffer[], int maxChars){
int c, i = 0;
while((c = getchar()) != EOF && c != '\n' && i < maxChars - 1)
buffer[i++] = c;
buffer[i++] = '\0';
if(i == maxChars)
return 1;
else
return 0;
}
Step 2-4 are slightly off.
If there is no input in the standard I/O buffer, getchar() calls a function to reload the buffer. On a Unix-like system, that normally ends up calling the read() system call, and the read() system call puts the process to sleep until there is input to be processed, or the kernel knows there will be no input to be processed (EOF). When the read returns, the code adjusts the data structures so that getchar() knows how much data is available. You description implies polling; the standard I/O system does not poll for input.
Step 5 uses the adjusted pointers to return the correct values.
There really isn't an EOF character; it is a state, not a character. Even though you type Control-D or Control-Z to indicate 'EOF', that character is not inserted into the input stream. In fact, those characters cause the system to flush any typed characters that are still waiting for 'line editing' operations (like backspace) to change them so that they are made available to the read() system call. If there are no such characters, then read() returns 0 as the number of available characters, which means EOF. Then getchar() returns the value EOF (usually -1 but guaranteed to be negative whereas valid characters are guaranteed to be non-negative (zero or positive)).
So basically, rather than polling, is it that hitting Return causes a certain I/O interrupt, and then when the OS receives this, it wakes up any processes that are sleeping for I/O?
Yes, hitting Return triggers interrupts and the OS kernel processes them and wakes up processes that are waiting for the data. The terminal driver is woken by the kernel when interrupt occurs, and decides what to do with the character(s) that were just received. They may be stashed for further processing (canonical mode) or made available immediately (raw mode), etc. Assuming, of course, that the input is a terminal; if the input is from a disk file, it is simpler in many ways — or if it is a pipe, or …
Nominally, it isn't the terminal app that gets woken by the interrupt; it is the kernel that wakes first, then the shell running in the terminal app that is woken because there's data for it to read, and only when there's output does the terminal app get woken.
I say 'nominally' because there's an outside chance that in fact the terminal app does mediate the I/O via a pty (pseudo-tty), but I think it happens at the kernel level and the terminal application is involved fairly late in the process. There's a huge disconnect really between the keyboard where you type and the display where what you type appears.
See also Canonical vs non-canonical terminal input.