The scanf 'maximum field width' includes whitespace? - c

Suppose we have
int n;
sscanf(" 42", "%2d", &n);
Should n be 4 (the whitespace accounted for by the "%2d") or 42 (whitespace ignored, making scanf read 3 characters)?
ideone implementation reads 3 characters

The POSIX specification for sscanf()
is fairly clear about the processing:
The format is a character string, … composed of zero or more directives. Each directive is composed of one of the following: one or more white-space characters (<space>, <tab>, <newline>, <vertical-tab>, or <form-feed>); an ordinary character (neither '%' nor a white-space character); or a conversion specification. Each conversion specification is introduced by the character '%' [CX] ⌦ or the character sequence "%n$", ⌫ after which the following appear in sequence:
…
A directive that is a conversion specification defines a set of matching input sequences, as described below for each conversion character. A conversion specification shall be executed in the following steps.
Input white-space characters (as specified by isspace) shall be skipped, unless the conversion specification includes a [, c, C, or n conversion specifier.
An item shall be read from the input, unless the conversion specification includes an n conversion specifier. An input item shall be defined as the longest sequence of input bytes (up to any specified maximum field width, which may be measured in characters or bytes dependent on the conversion specifier) which is an initial subsequence of a matching sequence. The first byte, if any, after the input item shall remain unread. If the length of the input item is 0, the execution of the conversion specification shall fail; this condition is a matching failure, unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure.
If white space is skipped by a conversion specification (%…), it is not counted as part of the field width; the skipping occurs before any counting does.
The equivalent specification in C11 §7.21.6.2 The fscanf function is very similar (but it doesn't include the 'C extension' markup, of course).

The scanf 'maximum field width' includes whitespace?
Yes for [ and c.
No for other specifiers.
"%n" does not apply.
The fscanf() (C11dr §7.21.6.2 7-9)
7 ... A conversion specification is executed in the following steps:
8 Input white-space characters (as specified by the isspace function) are skipped, unless the specification includes a [, c, or n specifier.
9 An input item is read from the stream, ... An input item is defined as the longest sequence of input characters which does not exceed any specified field width and ....
The width applies after leading input white-space character consumption.
Further, as I read the spec, if the conversion fails, the input white-space characters remained consumed.

From the BSD manual page:
In addition to these flags, there may be an optional maximum field width,
expressed as a decimal integer, between the % and the conversion. If no width is
given, a default of ``infinity'' is used (with one exception, below); otherwise
at most this many bytes are scanned in processing the conversion. In the case of
the lc, ls and l[ conversions, the field width specifies the maximum number of
multibyte characters that will be scanned. Before conversion begins, most conversions skip white space; this white space is not counted against the field
width.
The Linux man page has
An optional decimal integer which specifies the maximum field width. Reading
of characters stops either when this maximum is reached or when a nonmatching
character is found, whichever happens first. Most conversions discard initial
white space characters (the exceptions are noted below), and these discarded
characters don't count toward the maximum field width. String input conversions store a terminating null byte ('\0') to mark the end of the input; the
maximum field width does not include this terminator.
both specify that the whitespace does not count against the field width.

Related

C - format specifier for scanf?

float lat, lon;
char info[50];
scanf("%f, %f, %49[^\n]", &lat, &lon, info);
In the above snippet, what kind of format specifier is %49[^\n].
I do understand that it is the format specifier for the character array which is going to accept input upto 49 characters (+ the sentinal \0), and [^\n] looks like its a regex (although I had read somewhere that scanf doesn't support regex) OR a character set which is to expand to "any other character" that is NOT "newline" \n. Am I correct?
Also, why is there no s in the format specifier for writing into array info?
The program this snippet is from works. But is this good C style?
The specifier %[ is a different conversion specifier from %s, even if it also must be paired with an argument of type char * (or wchar_t *). See e.g. the table here
[set] matches a non-empty sequence of character from set of characters.
If the first character of the set is ^, then all characters not in the set are matched. If the set begins with ] or ^] then the ] character is also included into the set. It is implementation-defined whether the character - in the non-initial position in the scanset may be indicating a range, as in [0-9]. If width specifier is used, matches only up to width. Always stores a null character in addition to the characters matched (so the argument array must have room for at least width+1 characters)
My apologies, I incorrectly answered below. If you can skip to the end, I'll give you the correct answer.
*** Incorrect Answer Begins ***
It would not be a proper format specifier, as there is no type.
%[parameter][flags][width][.precision][length]type
are the rules for a format statement. As youc an see, the type is non-optional. The author of this format item is thinking they can combine regex with printf, when the two have entirely different processing rules (and printf doesn't follow regex's patterns)
*** Correct Answer Begins ***
scanf uses different format string rules than printf Within scanf's man page is this addition to printf's rules
[
Matches a nonempty sequence of characters from the specified set
of accepted characters; the next pointer must be a pointer to char,
and there must be enough room for all the characters in the string,
plus a terminating null byte. The usual skip of leading white space is
suppressed. The string is to be made up of characters in (or not in) a
particular set; the set is defined by the characters between the open
bracket [ character and a close bracket ] character. The set excludes
those characters if the first character after the open bracket is a
circumflex (^). To include a close bracket in the set, make it the
first character after the open bracket or the circumflex; any other
position will end the set. The hyphen character - is also special;
when placed between two other characters, it adds all intervening
characters to the set. To include a hyphen, make it the last character
before the final close bracket. For instance, [^]0-9-] means the set
"everything except close bracket, zero through nine, and hyphen". The
string ends with the appearance of a character not in the (or, with a
circumflex, in) set or when the field width runs out.
Which basically means that scanf can scan with a subset of regex's rules (the character set subset) but not all of regex's rules

How to use a scanf width specifier of 0?

How to use a scanf width specifier of 0?
1) unrestricted width (as seen with cywin gcc version 4.5.3)
2) UB
3) something else?
My application (not shown) dynamically forms the width specifier as part of a larger format string for scanf(). Rarely it would create a "%0s" in the middle of the format string. In this context, the destination string for that %0s has just 1 byte of room for scanf() to store a \0 which with behavior #1 above causes problems.
Note: The following test cases use constant formats.
#include <memory.h>
#include <stdio.h>
void scanf_test(const char *Src, const char *Format) {
char Dest[10];
int NumFields;
memset(Dest, '\0', sizeof(Dest)-1);
NumFields = sscanf(Src, Format, Dest);
printf("scanf:%d Src:'%s' Format:'%s' Dest:'%s'\n", NumFields, Src, Format, Dest);
}
int main(int argc, char *argv[]) {
scanf_test("1234" , "%s");
scanf_test("1234" , "%2s");
scanf_test("1234" , "%1s");
scanf_test("1234" , "%0s");
return 0;
}
Output:
scanf:1 Src:'1234' Format:'%s' Dest:'1234'
scanf:1 Src:'1234' Format:'%2s' Dest:'12'
scanf:1 Src:'1234' Format:'%1s' Dest:'1'
scanf:1 Src:'1234' Format:'%0s' Dest:'1234'
My question is about the last line. It seems that a 0 width results in no width limitation rather than a width of 0. If this is correct behavior or UB, I'll have to approach the zero width situation another way or are there other scanf() formats to consider?
The maximum field width specifier must be non-zero. C99, 7.19.6.2:
The format shall be a multibyte character sequence, beginning and ending in its initial
shift state. The format is composed of zero or more directives: one or more white-space
characters, an ordinary multibyte character (neither % nor a white-space character), or a
conversion specification. Each conversion specification is introduced by the character %.
After the %, the following appear in sequence:
— An optional assignment-suppressing character *.
— An optional nonzero decimal integer that specifies the maximum field width (in
characters).
— An optional length modifier that specifies the size of the receiving object.
— A conversion specifier character that specifies the type of conversion to be applied.
So, if you use 0, the behavior is undefined.
This came from 7.21.6.2 of n1570.pdf (C11 standard draft):
After the %, the following appear in sequence:
— An optional assignment-suppressing character *.
— An optional decimal integer greater than zero that specifies the
maximum field width (in characters).
...
It's undefined behaviour, because the C standard states that your maximum field width must be greater than zero.
An input item is defined as the longest sequence of input characters
which does not exceed any specified field width and ...
What is it you wish to achieve by reading a field of width 0 and assigning it as a string (empty string) into Dest? Which actual problem are you trying to solve? It seems more clear to just assign like *Dest = '\0';.

Logical inconsistency with [ ] conversion specifier in scanf() in C

Please have a look at this code snippet:
char line1[10], line2[10];
int rtn;
rtn = scanf("%9[a]%9[^\n]", line1, line2);
printf("line1 = %s|\nline2 = %s|\n", line1, line2);
printf("rtn = %d\n", rtn);
Output:
$ gcc line.c -o line
$ ./line
abook
line1 = a|
line2 = book|
rtn = 2
$./line
book
line1 = |
line2 = �Js�|
rtn = 0
$
For input abook, %9[a] fails at b from the book and stores previously parsed a+\0 at line1.
Then %9[^\n] parses the remaining line and stores just now parsed book+\0 at line2.
Please note 2 points here:
At the time of storing the parsed input, \0 is appended at the end of it since %[] is a conversion specifier for a string.
When %9[a] failed at b, scanf didn't exit. It simply went on scanning further input.
Now for input book, %9[a] should fail at b from the book and should store just \0 at line1 since here nothing was parsed.
Then %9[^\n] should parse the remaining line and should store just now parsed book+\0 at line2.
Now, let's see what exactly happened:
Here return value is 0 that means scanf didn't assign value to any variable. scanf simply exited without assigning any values. So garbage data at line2. And in the case of line1 that garbage data happen to be a NULL character.
But this is quite strange! Isn't it?
I mean scanf exits if %[...] fails at the very first character of input. (Even if additional conversion specifier is there in scanf statement.)
But if the same %[...] fails at any other character other than first one then scanf simply continues scanning the further input. (If additional conversion specifier is there of course.) It doesn't exit.
So why this inconsistency?
Why not let scanf statement continue scan the input (if additional conversion specifier is there of course) even if %[...] fails at the very first char of input? Exactly like what happens in other case.
Is there any special reason behind this inconsistency?
$ gcc --version
gcc (Ubuntu 4.4.3-4ubuntu5.1) 4.4.3
2) When %9[a] failed at b, scanf didn't exit. It simply went on scanning further input.
Yes, the %9[a] directive means "store up to 9 'a's, but at least one"(1), so the conversion %9[a] did not fail, it succeeded. It found fewer 'a's than it could have consumed, but that's not a failure. The input matching failed at the 'b', but the conversion succeeded.
(1) Specified in 7.21.6.2 (12) where the conversions are described:
[ Matches a nonempty sequence of characters from a set of expected characters (the scanset).
Now for input book, %9[a] should fail at b from the book and should store just '\0' at line1 since here nothing was parsed. Then %9[^\n] should parse the remaining line and should store just now parsed book+\0 at line2.
No. It is supposed to exit when a conversion fails. The first conversion %9[a] failed, so scanf is supposed to stop and return 0, since no conversion succeeded.
Always check the return value of scanf.
That is specified (for fscanf, but scanf is equivalent to fscanf with stdin as input stream) in 7.21.6.2 (16):
The fscanf function returns the value of the macro EOF if an input failure occurs
before the first conversion (if any) has completed. Otherwise, the function returns the
number of input items assigned, which can be fewer than provided for, or even zero, in
the event of an early matching failure.
Here output for line1 is nothing which is exactly what we expected. An empty string!
You can't expect anything. The arrays line1 and line2 aren't initialised, so when the conversion fails, their contents is still indeterminate. In this case, line1 contained no printable character before the first 0 byte.
But for line2 it's garbage chars! We didn't expect this. So how did this happen ?
That's what happened to be the contents of line2. There were never any values assigned to the elements, so they are whatever they happened to be before the call to scanf.
Transferred from comments to the question since the response to the reply question requires more space than the comments allow.
This comment refers to an earlier version of the code:
Since you didn't check the return value from scanf(), you've no idea whether it said "I failed" or not. You can't blame it when you ignore its error returns; in the second example, it will have said '0 items scanned successfully', which means that none of the variables were set to anything useful at all. You must always check the return value from scanf() so you know whether it did what you expected.
The reply question is:
I updated the code and output to show the return value of scanf. And yes for case 2 the return value is 0. But this doesn't answer the question. Clearly scanf exited in case 2. But for case 1, return value is 2 which means scanf successfully assigned values to both the variables. So why this inconsistency?
I don't see any inconsistency. The fscanf() specification (copied from ISO/IEC 9899:2011, but the URL links to POSIX rather than the C standard) says:
¶3 [...] Each conversion specification is introduced by the character %.
After the %, the following appear in sequence:
— An optional assignment-suppressing character *.
— An optional decimal integer greater than zero that specifies the maximum field width
(in characters).
— An optional length modifier that specifies the size of the receiving object.
— A conversion specifier character that specifies the type of conversion to be applied.
Later, it says:
¶8 [...] Input white-space characters (as specified by the isspace function) are skipped, unless
the specification includes a [, c, or n specifier.284)
¶9 An input item is read from the stream, unless the specification includes an n specifier. An
input item is defined as the longest sequence of input characters which does not exceed
any specified field width and which is, or is a prefix of, a matching input sequence.285)
The first character, if any, after the input item remains unread. If the length of the input
item is zero, the execution of the directive fails; this condition is a matching failure unless
end-of-file, an encoding error, or a read error prevented input from the stream, in which
case it is an input failure.
¶12 [...]
[ Matches a nonempty sequence of characters from a set of expected characters
(the scanset).286)
[Bold italic emphasis added. I've left the footnote references in place, but the contents of the footnotes are not material to the discussion so I've omitted them.]
So, the behaviour you are seeing is exactly what the standard demands. When %9[a] is applied to the string abook, there is a sequence of one a which matches the %9[a] conversion specification, so the directive is successful, and the scan continues with book. When %9[a] is applied to the string book, there are zero characters matching the item, so the execution of the directive fails and it is a matching error and since it is the first conversion specification, the return value of 0 is correct.
Note that the length specifies a maximum field width, so the 9 in %9[a] means 1-9 letters a.

Is scanf's "regex" support a standard?

Is scanf's "regex" support a standard? I can't find the answer anywhere.
This code works in gcc but not in Visual Studio:
scanf("%[^\n]",a);
It is a Visual Studio fault or a gcc extension ?
EDIT: Looks like VS works, but have to consider the difference in line ends between Linux and Windows.(\r\n)
That particular format string should work fine in a conforming implementation. The [ character introduces a scanset for matching a non-empty set of characters (with the ^ meaning that the scanset is an inversion of the characters supplied). In other words, the format specifier %[^\n] should match every character that's not a newline.
From C99 7.19.6.2, slightly paraphrased:
The [ format specifier matches a nonempty sequence of characters from a set of expected characters (the scanset). If no l length modifier is present, the corresponding argument shall be a pointer to the initial element of a character array large enough to accept the sequence and a terminating null character, which will be added automatically.
If an l length modifier is present, the input shall be a sequence of multibyte characters that begins in the initial shift state. Each multibyte character is converted to a wide character as if by a call to the mbrtowc function, with the conversion state described by an mbstate_t object initialized to zero
before the first multibyte character is converted. The corresponding argument shall be a pointer to the initial element of an array of wchar_t large enough to accept the sequence and the terminating null wide character, which will be added automatically.
The conversion specifier includes all subsequent characters in the format string, up to and including the matching right bracket ]. The characters between the brackets (the scanlist) compose the scanset, unless the character after the left bracket is a circumflex ^, in which case the scanset contains all
characters that do not appear in the scanlist between the circumflex and the right bracket. If the conversion specifier begins with [] or [^], the right bracket character is in the scanlist and the next following right bracket character is the matching right bracket that ends the specification; otherwise the first following right bracket character is the one that ends the specification. If a - character is in the scanlist and is not the first, nor the second where the first character is a ^, nor the last character, the behavior is implementation-defined.
It's possible, if MSVC isn't working correctly, that this is just one of the many examples where Microsoft either don't conform to the latest standard, or think they know better :-)
The "%[" format spec for scanf() is standard and has been since C90.
MSVC does support it.
You can also provide a field width in the format spec to provide safety against buffer overruns:
int main()
{
char buf[9];
scanf("%8[^\n]",buf);
printf("%s\n", buf);
printf("strlen(buf) == %u\n", strlen(buf));
return 0;
}
Also note that the "%[" format spec doesn't mean that scanf() supports regular expressions. That particular format spec is similar to a capability of regexs (and no doubt was an influenced by regex), but it's far more limited than regular expressions.

usage of % [^\n]

A[50][5000];
for(i=0;i<50;++i)
scanf("%[\n]",A[i]);
%[^\n]
usage and meaning of it
and can i use that struct like
%[\t]
%[\a]
scanf()'s "%[" conversion specifier starts what's called a "scanset". It's has some similarities to the regex construct that looks the same (but it still is quite different) Here's what the standard says:
Matches a nonempty sequence of characters from a set of expected characters (the scanset).
...
The conversion specifier includes all subsequent characters in the format string, up to and including the matching right bracket (]). The characters between the brackets (the scanlist) compose the scanset, unless the character after the left bracket is a circumflex (^), in which case the scanset contains all characters that do not appear in the scanlist between the circumflex and the right bracket. If the conversion specifier begins with [] or [^], the right bracket character is in the scanlist and the next following right bracket character is the matching right bracket that ends the specification; otherwise the first following right bracket character is the one that ends the specification. If a - character is in the scanlist and is not the first, nor the second where the first character is a ^, nor the last character, the behavior is implementation-defined.
So the scanf() conversion "%[\n]" will match a newline character, while "%[^\n]" will match all characters up to a newline.
Here's what P.J. Plauger has to say about scansets in "The Standard C Library":
A scan set behaves much like the s conversion specifier. It stores up to w characters (default is the rest of the input) in the char array pointed at by ptr. It always stores a null character after any input. It does not skip leading white-space. It also lets you specify what characters to consider as part of the field. You can specify all the characters that match, as in %[0123456789abcdefABCDEF], which matches an arbitrary sequence of hexadecimal digits. Or you can specify all the characters that do not match, as in %[^0123456789] which matches any characters other than digits.
If you want to include the right bracket (]) in the set of characters you specify, write it immediately after the opening [ (or [^), as in %[][] which scans for square brackets. You cannot include the null character in the set of characters you specify. Some implementations may let you specify a range of characters by using a minus sign (-). The list of hexadecimal digits, for example, can be written as %[0-9abcdefABCDEF] or even, in some cases, as %[0-9a-fA-F]. Please note, however, that such usage is not universal. Avoid it in a program that you wish to keep maximally portable.
Yes, it's pretty much like a set in a regular expression -- you can specify a set of character to be accepted, or a set of characters to end the scan, so "%[^ \r\n\t]" would read until it encountered a space, carriage return, new-line or tab. Like with an RE, the leading "^" means "not" -- you can omit it to specify the characters that will be accepted instead of those that will end the conversion. With most compilers (though it's not technically required) you can specify ranges, such as "%[a-z]" to specify any lower-case letter (in this case, where the '-' isn't the first or last character, the behavior is implementation defined).
Though not widely used (or even known) this conversion has been part of C almost forever, and is supported in C89/90.
copies a string up to a newline from standard input to element i of A. as written, this acts almost like gets().

Resources