The scanf function, the specifer %s and the new line - c

I read into C11 standard this:
Input white-space characters (as specified by the isspace function) are
skipped, unless the specification includes a [, c, or n specifier.
so I understand that if I use that specifiers the next scanf can contains for example a new line.
But if I write this:
char buff[5 + 1];
printf("Input: ");
scanf("%10s", buff);
printf("Input: ");
char buff_2[5 + 1];
scanf("%[abcde]", buff_2);
and then I input, i.e., RR and then Return,
the next scanf fails because of \n.
So also %s doesn't discard a new line?

So also %s doesn't discard a new line?
%s tells scanf to discard any leading whitespace, including newlines. It will then read any non-whitespace characters, leaving any trailing whitespace in the input buffer.
So assuming your input stream looks like "\n\ntest\n", scanf("%s", buf) will discard the two leading newlines, consume the string "test", and leave the trailing newline in the input stream, so after the call the input stream looks like "\n".
Edit
Responding to xdevel2000's comment here.
Let's talk about how conversion specifiers work. Here are some relevant paragraphs from the online C 2011 standard:
7.21.6.2 The fscanf function
...
9 An input item is read from the stream, unless the specification includes an n specifier. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence.285)
The first character, if any, after the input item remains unread. If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure.
10 Except in the case of a % specifier, the input item (or, in the case of a %n directive, the
count of input characters) is converted to a type appropriate to the conversion specifier. If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure. Unless assignment suppression was indicated by a *, the result of the conversion is placed in the object pointed to by the first argument following
the format argument that has not already received a conversion result. If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined.
12 The conversion specifiers and their meanings are:
...
c Matches a sequence of characters of exactly the number specified by the field width (1 if no field width is present in the directive).286)
...
s Matches a sequence of non-white-space characters.286)
...
[ Matches a nonempty sequence of characters from a set of expected characters
(the scanset).286)
...
285) fscanf pushes back at most one input character onto the input stream. Therefore, some sequences that are acceptable to strtod, strtol, etc., are unacceptable to fscanf.
286) No special provisions are made for multibyte characters in the matching rules used by the c, s, and [
conversion specifiers — the extent of the input field is determined on a byte-by-byte basis. The
resulting field is nevertheless a sequence of multibyte characters that begins in the initial shift state.
%s matches a sequence of non-whitespace characters. Here's a basic algorithm describing how it works (not taking into account end of file or other exceptional conditions):
c <- next character from input stream
while c is whitespace
c <- next character from input stream
while c is not whitespace
append c to target buffer
c <- next character from input stream
push c back onto input stream
append 0 terminator to target buffer
The first whitespace character after the non-whitespace characters (if any) is pushed back onto the input stream for the next input operation to read.
By contrast, the algorithm for the %c specifier is dead simple (unless you're using a field width greater than 1, which I've never done and won't get into here):
c <- next character from input stream
write c to target
The algorithm for the %[ conversion specifier is a little different:
c <- next character from input stream
while c is in the list of characters in the scan set
append c to target buffer
c <- next character from input stream
append 0 to target buffer
push c back onto input stream
So, it's a mistake to describe any conversion specifier as "retaining" trailing whitespace (which would imply that the trailing whitespace is saved to the target buffer); that's not the case. Trailing whitespace is left in the input stream for the next input operation to read.

%s consumes everything until a whitespace character and discards leading whitespace characters not trailing ones. The [ conversion specifier in the second scanf does not skip leading whitespace characters and therefore, fails to scan because of the newline character(which is a whitespace character) left over by the first scanf.
To fix the issue, either use
int c;
while((c=getchar())!='\n' && c!=EOF);
After the first scanf to clear the stdin or add a space before the format specifier(%[) in the second scanf.

Your excerpt from the standard omits important context. The preceding text specifies that skipping whitespace is the first step in processing a conversion specifier for a type other than c, [, or n.
The next step, other than for an n specifier, is to read an input item, which is defined as "the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence" (quoted from C99, but equivalent applies to C2011).
An s item "[m]atches a sequence of non-white-space characters", so with the input you specify, the first scanf() reads everything up to, but not including, the newline.
The standard explicitly specifies
Trailing white space (including new-line characters) is left unread unless matched by a directive.
so the newline definitely remains unscanned at this point.
The format given to the next scanf() starts with a %[ conversion specifier, which, as you already observed, does not cause whitespace (leading or otherwise) to be skipped, though it can include whitespace in the item that is scanned. Since the next character available from the input is a newline, however, and the given scan set for your %[ does not include that character, zero characters are scanned for that item. Going back to the standard (C99, again):
If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure.
There are easier ways to read free-form input line by line, but you can do it with scanf() if you must. For example:
char buff[10 + 1] = {0};
printf("Input: ");
/*
* Ignore leading whitespace and scan a string of up to 10 non-whitespace
* characters. Zero-length inputs will produce a matching failure, leaving
* the buffer unchanged (and initialized to an empty string). End of
* input will produce an input error, which is ignored.
*/
scanf("%10s", buff);
/* Scan and ignore anything else up to a newline. There will
* be an (ignorable) matching failure if the next available character is a
* newline. Any input error generated by this call is also ignored.
*/
scanf("%*[^\n]");
/*
* Consume the next character, if any. If there is one, it will be a
* newline. An input error will occur if we're already at the end of stdin;
* a careful program would test for that (by comparing the return value to
* EOF) but this one doesn't.
*/
scanf("%*c");
printf("Input: ");
/* scan the second string; again, we're ignoring matching and input errors */
char buff_2[5 + 1] = {0};
scanf("%5[abcde]", buff_2);
If you're exclusively using scanf() for such a job then it is essential to read each line in three steps, as shown, because each one can produce a matching failure that would prevent any attempt to match subsequent items.
Note, too, how maximum field widths are matched to buffer sizes in that example, which your original code did not do correctly.

Related

scanf function for reading full line input using ^\n

What is the difference between
scanf(" %[^\n]", str_1);
and
scanf("%[^/n],str_2);
I am trying to read a full line as an input using scanf function, the first one works fine but as soon as I remove the space between " and % the function takes no input.
scanf(" %[^\n]", str_1);
This will skip whitespace (until it finds a non-whitespace character) and then start reading characters into str_1 until it reaches a newline or EOF. It will not do any bounds checking, so may overflow the storage for str_1. If an EOF is read before finding a non-whitespace character, it will return 0 without writing anything into str_1 (not even a NUL character).
scanf("%[^/n],str_2);
This will read characters other than / and n into str_2 until it sees either a / or n or an EOF. Like the first one, it does not do any bounds checking so may (will likely?) overflow the storage of str_2. If the first character of input is either a / or a n (or EOF) it will fail to match the pattern at all, returning 0 and not storing anything into str_2 (not even a NUL terminator).
Basic scanf semantics -- whitspace in the format string will skip 0 or more whitespace characters in the input until a non-whitespace character is reached to start whatever the next pattern is. %[ matches characters with no bounds checking, so should never be used with untrusted input -- you should always use an explicit bound or m modifier.
If you want to read lines of input (as opposed to whitespace delimited words, ignoring differences between newlines and other whitespace), you should use fgets or getline

Question about format string in scanf function

For each of the following pairs of scanf format strings, indicate whether or not the two strings are equivalent. If they're not, show how they can be distinguished:
(b) "%d-%d-%d" versus "%d -%d -%d"
So in this case, my answer was that they were not equivalent. Because non-white-space characters except conversion specifier which start with %, cannot be preceded by spaces, it will not match with the non-white-space character. So in the first case, no spaces will be allowed after the first and second integer, while in the second case, any number of spaces will be allowed after the first 2 integers.
But I saw that the book had a different answer. It said that they were both equivalent to each other.
Is this the mistake of the book? Or am I just wrong with the concept of format string in the scanf function?
The book is wrong. As per the specification of the scanf():
Whitespace character: the function will read and ignore any whitespace characters encountered before the next non-whitespace character (whitespace characters include spaces, newline and tab characters -- see isspace). A single whitespace in the format string validates any quantity of whitespace characters extracted from the stream (including none).
Non-whitespace character, except format specifier (%): Any character that is not either a whitespace character (blank, newline or tab) or part of a format specifier (which begin with a % character) causes the function to read the next character from the stream, compare it to this non-whitespace character and if it matches, it is discarded and the function continues with the next character of format. If the character does not match, the function fails, returning and leaving subsequent characters of the stream unread.
So in first case when scanf arrives to the %d and gets the input, next is the - which means that scanf will expect next in the stream to see the non-whitespae character - and not any other whitespace character. So the legal input is 1- 2, but not 1 -2
In the second case, after first %d, scanf will allow the whitespace and than will arrive to non-whitespace, so it will allow the input 1 - 2 by the above definitions.
"%d-%d-%d" differs from "%d -%d -%d" and the difference has nothing to do with "%d".
Format "-" scans over input "-" and stops on the first space of input " -".
Format " -" scans over inputs "-" and " -" as the " " in the format matches 0 or more white-space characters in the input.
A directive composed of white-space character(s) is executed by reading input up to the first nonwhite-space character (which remains unread), or until no more characters can be read. The directive never fails. C17dr § 7.21.6.2 5
Had the question been: "%d-%d-%d" versus "%d- %d- %d",
These 2 are functionally identical.
We would need to dive input arcane stdin input errors to divine a potential difference.

Details for reading strings in C?

Can someone explain how does this work? I know that it reads a string(with spaces), but I don't really understand the mechanism behind. Can someone explain
it to me piece by piece?
scanf("%[^\n]%*c",string);
scanf("%[^\n]%*c",string);
This says "read everything until a newline character and then read one character". * (in %*c) is to suppresses the assignment. That's it reads a line and consumes the newline character.
From scanf():
An optional '*' assignment-suppression character: scanf() reads
input as directed by the conversion specification, but discards the
input. No corresponding pointer argument is required, and this
specification is not included in the count of successful assignments
returned by scanf().
But I'd, for reading lines, use fgets() instead and consume the trailing newline afterwards.
char string[256];
if (fgets(string, sizeof string, stdin) == NULL) {
/* handle failure */
}
string[strcspn(string, "\n")] = 0; /* For removing the newline if present */
which is less error prone and better to understand.
If you are using glibc, you can use getline() as well.
From the man page:
All conversions are introduced by the % (percent sign) character
A conversion is how we match certain text strings. For example a %s matches a string, and %d matches a decimal integer. So looking at your string, we have a "%[ conversion, which according to the man page:
[ Matches a nonempty sequence of characters from the specified set of accepted characters; ... the set is defined by the characters between the open bracket [ character and a close bracket ] character.
So this conversion is going to define a list of characters which will be matched and read into your string. Important to this is:
The set excludes those characters if the first character after the open bracket is a circumflex ^.
And if you look at your string "%[^\n]%*c" you've got %[^\n] which means that you're matching any character until you hit a newline character.
Next, you have a %* the star is a conversion which ignores what matches after it. From the man page:
Suppresses assignment. The conversion that follows occurs as usual, but no pointer is used; the result of the conversion is simply discarded.
So if you look at your last match, you've got a c,
c Matches a sequence of width count characters (default 1);
so the %*c means that you're going to match 1 character, and then discard it (the character being matched is the newline - which the %[^\n] didn't consume because you matched everything up to that newline), it won't be stored in your string variable.
Reading the man page is your friend. I hope this helps.

What is the difference between these two scanf statements?

I am having some doubt. The doubt is
What is the difference between the following two scanf statements.
scanf("%s",buf);
scanf("%[^\n]", buf);
If I am giving the second scanf in the while loop, it is going infinitely. Because the \n is in the stdin.
But in the first statement, reads up to before the \n. It also will not read the \n.
But The first statement does not go in infinitely. Why?
Regarding the properties of the %s format specifier, quoting C11 standrad, chapter §7.21.6.2, fscanf()
s Matches a sequence of non-white-space characters.
The newline is a whitespace character, so only a newlinew won't be a match for %s.
So, in case the newline is left in the buffer, it does not scan the newline alone, and wait for the next non-whitespace input to appear on stdin.
The %s format specifier specifies that scanf() should read all characters in the standard input buffer stdin until it encounters the first whitespace character, and then stop there. The whitespace ('\n') remains in the stdin buffer until consumed by another function, like getchar().
In the second case there is no mention of stopping.
You can think of scanf as extracting words separated by whitespace from a stream of characters. Imagine reading a file which contains a table of numbers, for example, without worrying about the exact number count per line or the exact space count and nature between numbers.
Whitespace, for the record, is horizontal and vertical (these exist) tabs, carriage returns, newlines, form feeds and last not least, actual spaces.
In order to free the user from details, scanf treats all whitespace the same: It normally skips it until it hits a non-whitespace and then tries to convert the character sequence starting there according to the specified input conversion. E.g. with "%d" it expects a sequence of digits, perhaps preceded by a minus sign.
The input conversion "%s" also starts with skipping whitespace (and that's clearer documented in the opengroup's man page than in the Linux one).
After skipping leading whitespace, "%s" accepts everything until another whitespace is read (and put back in the input, because it isn't made part of the "word" being read). That sequence of non-whitespace chars -- basically a "word" -- is stored in the buffer provided. For example, scanning a string from " a bc " results in skipping 3 spaces and storing "a" in the buffer. (The next scanf would skip the intervening space and put "bc" in the buffer. The next scanf after that would skip the remaining whitespace, encounter the end of file and return EOF.) So if a user is asked to enter three words they could give three words on one line or on three lines or on any number of lines preceded or separated by any number of empty lines, i.e. any number of subsequent newlines. Scanf couldn't care less.
There are a few exceptions to the "skip leading whitespace" strategy. Both concern conversions which usually indicate that the user wants to have more control about the input conversion. One of them is "%c" which just reads the next character. The other one is the "%[" spec which details exactly which characters are considered part of the next "word" to read. The conversion specification you use, "%[^\n]", reads everything except newline. Input from the keyboard is normally passed to a program line by line, and each line is by definition terminated by a newline. The newline of the first line passed to your program will be the first character from the input stream which does not match the conversion specification. Scanf will read it, inspect it and then put it back in the input stream (with ungetc()) for somebody else to consume. Unfortunately, it will itself be the next consumer, in another loop iteration (as I assume). Now the very first character it encounters (the newline) does not match the input conversion (which demands anything but the newline). Scanf therefore gives up immediately, puts the offending character dutifully back in the input for somebody else to consume and returns 0 indicating the failure to even perfom the very first conversion in the format string. But alas, it itself will be the next consumer. Yes, machines are stupid.
First scanf("%s",buf); scan only word or string, but second scanf("%[^\n]", buf); reads a string until a user inputs is new line character.
Let's take a look at these two code snippets :
#include <stdio.h>
int main(void){
char sentence[20] = {'\0'};
scanf("%s", sentence);
printf("\n%s\n", sentence);
return 0;
}
Input : Hello, my name is Claudio.
Output : Hello
#include <stdio.h>
int main(void){
char sentence[20] = {'\0'};
scanf("%[^\n]", sentence);
printf("\n%s\n", sentence);
return 0;
}
Input : Hello, my name is Claudio.
Output : Hello, my name is Claudio.
%[^\n] is an inverted group scan and this is how I personally use it, as it allows me to input a sentece with blank spaces in it.
Common
Both expect buf to be a pointer to a character array. Both append a null character to that array if at least 1 character was saved. Both return 1 if something was saved. Both return EOF if end-of-file detected before saving anything. Both return EOF in input error is detected. Both may save buf with embedded '\0' characters in it.
scanf("%s",buf);
scanf("%[^\n]", buf);
Differences
"%s" 1) consumes and discards leading white-space including '\n', space, tab, etc. 2) then saves non-white-space to buf until 3) a white-space is detected (which is then put back into stdin). buf will not contain any white-space.
"%[^\n]" 1) does not consume and discards leading white-space. 2) it saves non-'\n' characters to buf until 3) a '\n' is detected (which is then put back into stdin). If the first character read is a '\n', then nothing is saved in buf and 0 is returned. The '\n' remains in stdin and explains OP's infinite loop.
Failure to test the return value of scanf() is a common code oversight. Better code checks the return value of scanf().
IMO: code should never use either:
Both fail to limit the number of characters read. Use fgets().
You can think of %s as %[^\n \t\f\r\v], that is, after skipping any leading whitespace, a group a non-whitespace characters.

Reading tabs in C

I am trying to sscanf from a file. The pattern I am trying to match is the following
"%s\t%s\t%s\t%f"
Thing is that I am surprised because for an input like following:
Hello Hola Hallo 5.344434
it is reading all of the data properly...
Do you know why?
I was expecting it to be finding tabs like |---|---|---|---| not that only one space was matching.
Thanks
The standard reads:
A directive composed of white-space character(s) is executed by
reading input up to the first non-white-space character (which remains
unread), or until no more characters can be read.
In other words, a sequence of white-space characters (space, tab, newline, etc.; as defined by isspace()) in the format string matches any amount of white space in the input.
No way - scanf treat all white-space identically - they're used as delimiter, and just ignored. So if you really want to doing something with tab space, you should parse it yourself.
To parse, you need to read the whole line without any parsing, unlike scanf. So, you need to use fgets.
FILE *fp = /* init.. */;
char buf[1024];
fgets(buf, 1024, fp);
// parse yourself!
If you take a look at the documentation for scanf:
C string that contains a sequence of characters that control how characters extracted from the stream are treated:
Whitespace character: the function will read and ignore any whitespace characters
encountered before the next non-whitespace character (whitespace characters include
spaces, newline and tab characters -- see isspace). A single whitespace in the format
string validates any quantity of whitespace characters extracted from the stream
(including none).
Non-whitespace character, except format specifier (%): Any character that is not
either a whitespace character (blank, newline or tab) or part of a format specifier
(which begin with a % character) causes the function to read the next character
from the stream, compare it to this non-whitespace character and if it matches,
it is discarded and the function continues with the next character of format. If the
character does not match, the function fails, returning and leaving subsequent
characters of the stream unread.
Format specifiers: A sequence formed by an initial percentage sign (%) indicates a
format specifier, which is used to specify the type and format of the data to be
retrieved from the stream and stored into the locations pointed by the additional
arguments.
You will notice that the whitespace characters get ignored.
Did you carefully read scanf(3) documentation? You need to read the entire line using getline(3) then parse that line "manually"!

Resources