When does scanf start and stop scanning? - c

It seems scanf begins scanning the input when the Enter key is pressed, and I want to verify this with the code below (I eliminated error checking and handling for simplicity).
#include <stdio.h>
int main(int argc, char **argv) {
/* disable buffering */
setvbuf(stdin, NULL, _IONBF, 0);
int number;
scanf("%d", &number);
printf("number: %d\n", number);
return 0;
}
Here comes another problem, after I disable input buffering (just to verify the result; I know I should next-to-never do that in reality in case it interferes the results), the output is (note the extra prompt):
$ ./ionbf
12(space)(enter)
number: 12
$
$
which is different from the output when input buffering is enabled (no extra prompt):
$ ./iofbf
12(space)(enter)
number: 12
$
It seems the new line character is consumed when buffer enabled. I tested on two different machines, one with gcc 4.1.2 and bash 3.2.25 installed, the other with gcc 4.4.4 and bash 4.1.5, and the result is the same on both.
The problems are:
How to explain the different behaviors when input buffering is enabled and disabled?
Back to the original problem, when does scanf begin scanning user input? The moment a character is entered? Or is it buffered until a line completes?

Interesting question — long-winded answer. In case of doubt, I'm describing what I think happens on Unix; I leave Windows to other people. I think the behaviour would be similar, but I'm not sure.
When you use setvbuf(stdin, NULL, _IONBF, 0), you force the stdin stream to read one character at a time using the read(0, buffer, 1) system call. When you run with _IOFBF or _IOLBF, then the code managing the stream will attempt to read many more bytes at a time (up to the size of the buffer you provide if you use setvbuf(), or BUFSIZ if you don't). These observations plus the space in your input are key to explaining what happens. I'm assuming your terminal is in normal or canonical input mode — see Canonical vs non-canonical terminal input for a discussion of that.
You are correct that the terminal driver does not make any characters available until you type return. This allows you to use backspace etc to edit the line as you type it.
When you hit return, the kernel has 4 characters available to send to any program that wants to read them: 1 2 space return.
In the case where you are not using _IONBF, those 4 characters are all read at once into the standard I/O buffer for stdin by a call such as read(0, buffer, BUFSIZ). The scanf() then collects the 1, the 2 and the space characters from the buffer, and puts back the space into the buffer. (Note that the kernel has passed all four characters to the program.) The program prints its output and exits. The shell resumes, prints a prompt and waits for some more input to be available — but there won't be any input available until the user types another return, possibly (usually) preceded by some other characters.
In the case where you are using _IONBF, the program reads the characters one at a time. It makes a read() call to get one character and gets the 1; it makes another read() call and gets the 2; it makes another read() call and gets the space character. (Note that the kernel still has the return ready and waiting.) It doesn't need the space to interpret the number, so it puts it back in its pushback buffer (there is guaranteed to be space for at least one byte in the pushback buffer), ready for the next standard I/O read operation, and returns. The program prints its output and exits. The shell resumes, prints a prompt, and tries to read a new command from the terminal. The kernel obliges by returning the newline that is waiting, and the shell says "Oh, that's an empty command" and gives you another prompt.
You can demonstrate this is what happens by typing 1 2 x p s return to your (_IONBF) program. When you do that, your program reads the value 12 and the 'x', leaving 'ps' and the newline to be read by the shell, which will then execute the ps command (without echoing the characters that it read), and then prompt again.
You could also use truss or strace or a similar command to track the system calls that are executed by your program to see the veracity of what I suggest happens.

Related

Linux stdin buffering

With the following program and the sample run, I expected to see "stdin contains 9 bytes", but as you can see in the sample run, I got "stdin contains 0 bytes". Why is that? How can I fix this program to get the actual number of unread bytes in stdin?
Program:
#include <stdio.h>
#include <sys/ioctl.h>
#include <unistd.h>
void main() {
if ( (setvbuf(stdin, NULL, _IONBF, 0) |
setvbuf(stdout, NULL, _IONBF, 0)) != 0) {
printf("setting stdin/stdout to unbuffered failed");
return;
}
printf("type some keys\n");
sleep(3);
printf("\n");
unsigned long bytesUnread=0;
if (ioctl(0, FIONREAD, &bytesUnread) != 0) {
printf("ioctl error");
return;
}
printf("stdin contains %ld bytes\n", bytesUnread);
}
Sample run:
$ ./a.out
type some keys
some keys
stdin contains 0 bytes
Here is another sample run where I hit enter, and you can see it worked as expected.
$ ./a.out
type some keys
asd
stdin contains 4 bytes
Redirect stdin from a file or press enter after pressing a few keys and you will see that your code works as expected, independent of whether you call setvbuf or not, because your problem is not the FILE stream being buffered. The FILE stream isn't even involved in your ioctl. Rather, your problem is that the bytes have not been transmitted yet. They're in the kernel's line editing buffer for the canonical-mode tty, allowing you to backspace, CtrlW, etc. them before sending them.
If you want the tty layer to transmit bytes as they're generated on the terminal, rather than only in aggregate after line editing, you need to take the tty out of canonical mode. The termios.h interfaces are how you do this (see man 3 termios). Generally the easiest way is to tcgetattr, cfmakeraw, then tcsetattr, but cfmakeraw is not entirely portable so it can be preferable to just do the equivalent changes yourself.
setvbuf sets the output buffering of a FILE stream. You can't change the input buffering, because it makes no sense to do so -- you generally don't know what is available to be read from the underlying file descriptor until you actually read it. Even in cases where you could know (eg, by checking with an ioctl), that might change, due to later events and other threads and processes accessing the file descriptor. So you can never actually know for certain what you are going to read until you read it.
Looking at you sample run, it looks like you are using a terminal and not hitting enter? In which case, the input will be held in the terminal buffer (in case you will later hit backspace), and your ioctl call (which checks the file descriptor's buffer) will not see it.
In any case, getting the amount of data in the input buffer is not useful, as trying to do anything with that knowledge before you actually read it is a race condition -- you have no way of knowing if some other process or driver is in the middle of modifying the buffer. So for any real use, you just want to read the input data and react based on what the (atomic) read system call returns.

What is it with printf() sending output to buffer?

I am going through "C PRIMER PLUS" and there is this topic about "OUTPUT FLUSHING".
Now it says:
printf() statements sends output to an intermediate storage called buffer.
Every now and then, the material in the buffer is sent to the screen. The
standard C rules for when output is sent from the buffer to the screen are
clear:
It is sent when the buffer gets full.
When a newline character is encountered.
When there is impending input.
(Sending the output from the buffer to the screen or file is called flushing
the buffer.)
Now, To verify the above statements. I wrote this simple program :
#include<stdio.h>
int main(int argc, char** argv) {
printf("Hello World");
return 0;
}
so, neither the printf() contains a new line, nor it has some impending input(for e.g. a scanf() statement or any other input statement). Then why does it print the contents on the output screen.
Let's suppose first condition validated to true. The buffer got full(Which can't happen at all).
Keeping that in mind, I truncated the statement inside printf() to
printf("Hi");
Still it prints the statement on the console.
So whats the deal here, All of the above conditions are false but still I'm getting the output on screen.
Can you elaborate please. It appears I'm making a mistake in understanding the concept. Any help is highly appreciated.
EDIT: As suggested by a very useful comment, that maybe the execution of exit() function after the end of program is causing all the buffers to flush, resulting in the output on the console. But then if we hold the screen before the execution of exit(). Like this,
#include<stdio.h>
int main(int argc, char** argv) {
printf("Hello World!");
getchar();
return 0;
}
It still outputs on the console.
Output buffering is an optimization technique. Writing data to some devices (hard disks f.e.) is an expensive operation; that's why the buffering appeared. In essence, it avoids writing data byte-by-byte (or char-by-char) and collects it in a buffer in order to write several KiB of data at once.
Being an optimization, output buffering must be transparent to the user (it is transparent even to the program). It must not affect the behaviour of the program; with or without buffering (or with different sizes of the buffer), the program must behave the same. This is what the rules you mentioned are for.
A buffer is just an area in memory where the data to be written is temporarily stored until enough data accumulates to make the actual writing process to the device efficient. Some devices (hard disk etc.) do not even allow writing (or reading) data in small pieces but only in blocks of some fixed size.
The rules of buffer flushing:
It is sent when the buffer gets full.
This is obvious. The buffer is full, its purpose was fulfilled, let's push the data forward to the device. Also, probably there is more data to come from the program, we need to make room for it.
When a newline character is encountered.
There are two types of devices: line-mode and block-mode. This rule applies only to the line-mode devices (the terminal, for example). It doesn't make much sense to flush the buffer on newlines when writing to disk. But it makes a lot of sense to do it when the program is writing to the terminal. In front of the terminal there is the user waiting impatiently for output. Don't let them wait too much.
But why output to terminal needs buffering? Writing on the terminal is not expensive. That's correct, when the terminal is physically located near the processor. Not also when the terminal and the processor are half the globe apart and the user runs the program through a remote connection.
When there is impending input.
It should read "when there is impeding input on the same device" to make it clear.
Reading is also buffered for the same reason as writing: efficiency. The reading code uses its own buffer. It fills the buffer when needed then scanf() and the other input-reading functions get their data from the input buffer.
When an input is about to happen on the same device, the buffer must be flushed (the data actually written to the device) in order to ensure consistency. The program has send some data to the output and now it expects to read back the same data; that's why the data must be flushed to the device in order for the reading code find it there and load it.
But why the buffers are flushed when the application exits?
Err... buffering is transparent, it must not affect the application behaviour. Your application has sent some data to the output. The data must be there (on the output device) when the application quits.
The buffers are also flushed when the associated files are closed, for the same reason. And this is what happens when the application exits: the cleanup code close all the open files (standard input and output are just files from the application point of view), closing forces flushing the buffers.
Part of the specification for exit() in the C standard (POSIX link given) is:
Next, all open streams with unwritten buffered data are flushed, all open streams are closed, …
So, when the program exits, pending output is flushed, regardless of newlines, etc. Similarly, when the file is closed (fclose()), pending output is written:
Any unwritten buffered data for the stream are delivered to the host environment to be written to the file; any unread buffered data are discarded.
And, of course, the fflush() function flushes the output.
The rules quoted in the question are not wholly accurate.
When the buffer is full — this is correct.
When a newline is encountered — this is not correct, though it often applies. If the output device is an 'interactive device', then line buffering is the default. However, if the output device is 'non-interactive' (disk file, a pipe, etc), then the output is not necessarily (or usually) line-buffered.
When there is impending input — this too is not correct, though it is commonly the way it works. Again, it depends on whether the input and output devices are 'interactive'.
The output buffering mode can be modified by calling setvbuf()
to set no buffering, line buffering or full buffering.
The standard says (§7.21.3):
¶3 When a stream is unbuffered, characters are intended to appear from the source or at the destination as soon as possible. Otherwise characters may be accumulated and transmitted to or from the host environment as a block. When a stream is fully buffered, characters are intended to be transmitted to or from the host environment as a block when a buffer is filled. When a stream is line buffered, characters are intended to be transmitted to or from the host environment as a block when a new-line character is encountered. Furthermore, characters are intended to be transmitted as a block to the host environment when a buffer is filled, when input is requested on an unbuffered stream, or when input is requested on a line buffered stream that requires the transmission of characters from the host environment. Support for these characteristics is implementation-defined, and may be affected via the setbuf and setvbuf functions.
…
¶7 At program startup, three text streams are predefined and need not be opened explicitly — standard input (for reading conventional input), standard output (for writing conventional output), and standard error (for writing diagnostic output). As initially opened, the standard error stream is not fully buffered; the standard input and standard output streams are fully buffered if and only if the stream can be determined not to refer to an interactive device.
Also, §5.1.2.3 Program execution says:
The input and output dynamics of interactive devices shall take place as specified in 7.21.3. The intent of these requirements is that unbuffered or line-buffered output appear as soon as possible, to ensure that prompting messages actually appear prior to a program waiting for input.
The strange behavior of printf, buffering can be explained with below simple C code. please read through entire thing execute and understand as the below is not obvious (bit tricky)
#include <stdio.h>
int main()
{
int a=0,b=0,c=0;
printf ("Enter two numbers");
while (1)
{
sleep (1000);
}
scanf("%d%d",&b,&c);
a=b+c;
printf("The sum is %d",a);
return 1;
}
EXPERIMENT #1:
Action: Compile and Run above code
Observations:
The expected output is
Enter two numbers
But this output is not seen
EXPERIMENT #2:
Action: Move Scanf statement above while loop.
#include <stdio.h>
int main()
{
int a=0,b=0,c=0;
printf ("Enter two numbers");
scanf("%d%d",&b,&c);
while (1)
{
sleep (1000);
}
a=b+c;
printf("The sum is %d",a);
return 1;
}
Observations: Now the output is printed (reason below in the end)(just by scanf position change)
EXPERIMENT #3:
Action: Now add \n to print statement as below
#include <stdio.h>
int main()
{
int a=0,b=0,c=0;
printf ("Enter two numbers\n");
while (1)
{
sleep (1000);
}
scanf("%d%d",&b,&c);
a=b+c;
printf("The sum is %d",a);
return 1;
}
Observation: The output Enter two numbers is seen (after adding \n)
EXPERIMENT #4:
Action: Now remove \n from the printf line, comment out while loop, scanf line, addition line, printf line for printing result
#include <stdio.h>
int main()
{
int a=0,b=0,c=0;
printf ("Enter two numbers");
// while (1)
// {
// sleep (1000);
// }
// scanf("%d%d",&b,&c);
// a=b+c;
// printf("The sum is %d",a);
return 1;
}
Observations: The line "Enter two numbers" is printed to screen.
ANSWER:
The reason behind the strange behavior is described in Richard Stevens book.
PRINTF PRINTS TO SCREEN WHEN
The job of printf is to write output to stdout buffer. kernel flushes output buffers when
kernel need to read something in from input buffer. (EXPERIMENT #2)
when it encounters newline (since stdout is by default set to
linebuffered)(EXPERIMENT #3)
after program exits (all output buffers are flushed) (EXPERIMENT #4)
By default stdout set to line buffering
so printf will not print as the line did not end.
if it is no buffered, all lines are output as is.
Full buffered then, only when buffer is full it is flushed.

Why doesn't getchar() read characters such as backspace?

This is a very basic C question, coming from page 18 of Kernighan and Ritchie.
I've compiled this very simple code for counting characters input from the keyboard:
#include <stdio.h>
/* count characters in input; 1st version */
main()
{
long nc;
nc = 0;
while (getchar() != EOF)
++nc;
printf("%1d\n", nc);
}
This compiles fine, runs fine, and behaves pretty much as expected i.e. if I enter "Hello World", it returns a value of 11 when I press CTRLD to give the EOF character.
What is confusing me is if I make a mistake, I can use backspace to delete the characters and re-enter them, and it returns only the number of characters displayed by the terminal when I invoke EOF.
If the code is counting each character, including special characters, if I type four characters, delete two, and type another two, shouldn't that output as 8 characters (4 char + 2 del + 2 char), not 4?
I'm obviously misunderstanding how C handles backspace, and how/when the code is incrementing the variable nc?
Typically, your terminal session is running in "line mode", that is, it only passes data to your program when a line is complete (eg, you pressed Return, etc). So you only see the line as it is complete (with any editing having been done before your program ever sees anything). Typically this is a good thing, so every program doesn't need to deal with delete/etc.
On most systems (eg Unix-based systems, etc), it is possible to put the terminal into "raw" mode -- that is, each character is passed as received to the program. For example, screen-oriented text editors commonly do this.
It's not that getchar() doesn't count the "deletions" but it doesn't even see the input until it's passed to your program by the terminal driver.
When you input something, it doesn't reach your C program until you press \n or send EOF (or EOL). This is what POSIX defines as Canonical Mode Input Processing - which is typically the default mode.
Backspace characters are normally used to edit input in cooked tty mode (see canonical input mode in tty(4) in BSD and termios(3) in linux systems), so they are consumed at the tty driver, and don't get to the input the process gets after that. The same applies to Ctrl-D as the end of file or to Ctrl-K as the kill input character. There are several things the driver does behind the scenes that your process doesn't get finally. These are directed to make life easier to users and programmers, as you normally don't want erased input in your life (that's the reason of erasing it), or want line endings to be \n and not \r as the tty normally generates when you press the [RETURN] key. But if you read from a file that happens to have backspaces, you'll get them as normal input anyway, just create a file with backspaces and try to read redirecting input from it, and you'll see those characters in your input.
By the way, if you want to generate backspaces at the terminal, just prepend a Ctrl-V character before each (this is also managed at the tty driver and will not happen when reading from a file) and you'll see your backspace chars as normal input in your file (to send a Ctrl-V just double it)

Different behaviour of Ctrl-D (Unix) and Ctrl-Z (Windows)

As per title I am trying to understand the exact behavior of Ctrl+D / Ctrl+Z in a while loop with a gets (which I am required to use). The code I am testing is the following:
#include <stdio.h>
#include <stdlib.h>
int main()
{
char str[80];
while(printf("Insert string: ") && gets(str) != NULL) {
puts(str);
}
return 0;
}
If my input is simply a Ctrl+D (or Ctrl+Z on Windows) gets returns NULL and the program exits correctly. The unclear situation is when I insert something like house^D^D (Unix) or house^Z^Z\n (Windows).
In the first case my interpretation is a getchar (or something similar inside the gets function) waits for read() to get the input, the first Ctrl+D flushes the buffer which is not empty (hence not EOF) then the second time read() is called EOF is triggered.
In the second case though, I noticed that the first Ctrl+Z is inserted into the buffer while everything that follows is simply ignored. Hence my understanding is the first read() call inserted house^Z and discarded everything else returning 5 (number of characters read). (I say 5 because otherwise I think a simple Ctrl+Z should return 1 without triggering EOF). Then the program waits for more input from the user, hence a second read() call.
I'd like to know what I get right and wrong of the way it works and which part of it is simply implementation dependent, if any.
Furthermore I noticed that in both Unix and Windows even after EOF is triggered it seem to reset to false in the following gets() call and I don't understand why this happens and in which line of the code.
I would really appreciate any kind of help.
(12/20/2016) I heavily edited my question in order to avoid confusion
The CTRL-D and CTRL-Z "end of file" indicators serve a similar purpose on Unix and Windows systems respectively, but are implemented quite differently.
On Unix systems (including Unix clones like Linux) CTRL-D, while officially described as the end-of-file character, is actually a delimiter character. It does almost the same thing as the end-of-line character (usually carriage return or CTRL-M) which is used to delimit lines. Both characters tell the operating system that the input line is finished and to make it available the program. The only difference is that with end-of-line character a line feed (CTRL-J) character is inserted at the end of the input buffer to mark the end of the line, while with the end-of-file character nothing is inserted.
This means when you enter house^D^D on Unix the read system call will first return a buffer of length 5 with the 5 characters house in it. When read is called again to obtain more input, it will then returns of a buffer of length 0 with no characters in it. Since a zero length read on a normal file indicates that the end of file has been reached the gets library function also interprets this as the end of file and stops reading the input. However since it filled the buffer with 5 characters it doesn't return NULL to indicate that it reached end of the file. And since it hasn't actually actually reached end of file, as terminal devices aren't actually files, further calls to gets after this will make further calls to read which will return any subsequent characters that the user types.
On Windows CTRL-Z is handled much differently. The biggest difference is that it's not treated specially by the operating system at all. When you type house^Z^Z^M on Windows only the carriage return character is given special treatment. Just like on Unix, the carriage return makes the typed line available to the program, though in this case a carriage return and a line feed are added to the buffer to mark the end of the line. So the result is that ReadFile function returns a 9 byte long buffer with the 9 characters house^Z^Z^M^J in it.
It actually the program itself, specifically the C runtime library, that treats CTRL-Z specially. In the case of the Microsoft C runtime library when it sees the CTRL-Z character in the buffer returned by ReadFile it treats it as an end-of-file marker and ignores everything else after it. Using the example in the previous paragraph, gets ends up calling ReadFile to get more input because the fact its seen the CTRL-Z character isn't remembered when reading from the console (or other device) and it hasn't yet seen the end-of-line (which was ignored). If you then press enter again, gets will return with the buffer filled with the 7 bytes house^Z\0 (adding a 0 byte to indicate the end of the string). By default, it does the much same thing when reading from normal files, if a CTRL-Z character appears in a file, it and everything after it is ignored. This is for backward-compatibility with CP/M which only supported files in lengths that were multiples of 128 and used CTRL-Z to mark where text files really were supposed to end.
Note that both the Unix and Windows behaviours described above are only the normal default handling of user input. The Unix handling of CTRL-D only occurs when reading from a terminal device in canonical mode and it's possible to change the "end-of-file" character to something else. On Windows the operating system never treats CTRL-Z specially, but whether the C runtime library does or not depends on whether the FILE stream being read is in text or binary mode. This is why in portable programs you should always include the character b in the mode string when opening binary files (eg. fopen("foo.gif", "rb")).

Understanding behaviour of read() and write()

hi i am a student and just start learning low level c programming.i tried to understand read() and write() methods with this program.
#include <unistd.h>
#include <stdlib.h>
main()
{
char *st;
st=calloc(sizeof(char),2);//allocate memory for 2 char
read(0,st,2);
write(1,st,2);
}
i was expecting that it would give segmentation fault when i would try to input more than 2 input characters.but when i execute program and enter " asdf " after giving " as " as output it executes "df" command.
i want to know why it doesn't give segmentation fault when we assign more than 2 char to a string of size 2.and why is it executing rest(after 2 char)of input as command instead of giving it as output only?
also reading man page of read() i found read() should give EFAULT error,but it doesn't.
I am using linux.
Your read specifically states that it only wants two characters so that's all it gets. You are not putting any more characters into the st area so you won't get any segmentation violations.
As to why it's executing the df part, that doesn't actually happen on my immediate system since the program hangs around until ENTER is pressed, and it appears the program's I/O is absorbing the extra. But that immediate system is Cygwin - see update below for behaviour on a "real" UNIX box.
And you'll only get EFAULT if st is outside your address space or otherwise invalid. That's not the case here.
Update:
Trying this on Ubuntu 9, I see that the behaviour is identical to yours. When I supply the characters asls, the program outputs as then does a directory listing
That means your program is only reading the two characters and leaving the rest for the "next" program to read, which is the shell.
Just make sure you don't try entering:
asrm -rf /
(no, seriously, don't do that).
You ask read() to read no more than 2 characters (third parameters to read()) and so it overwrites no more than two characters in the buffer you supplied. That's why there's no reason for any erroneous behavior.
When you read(), you specify how many bytes you want. You won't get more than that unless your libc is broken, so you'll never write beyond the end of your buffer as long as your count is never greater than the size of your buffer. The extra bytes remain in the stream, and the next read() will get them. And if you don't have a next read() in your app, the process that spawned it (which would normally be the shell) may see them, since spawning a console app from the shell involves attaching the shell's input and output streams to the process. Whether the shell sees and gets the bytes depends partly on how much buffering is done behind the scenes by libc, and whether it can/does "unget" any buffered bytes on exit.
with read(0, st, 2); you read 2 chars from standard input.
The rest of what you typed will not be accuired from the program, but will not be omitted, so the keystrokes are going back to the shell, from which your program started (which are df and enter).
Since you only read 2 character, there is no problem. the df characters are not consume by your program, so they stay in the terminal buffer, and are consumed by the shell :
your program runs
you type asdf\n
your program reads asand leaves df\n in the tty buffer
you write the content of the st buffer to stdout
your program stops
the shell reads df\n from input and executes df command.
Fun things to try :
strace your program, to trace the system call : strace -e read, write ./yourprogram
read(0, st, 5)

Resources