Is scanf's "regex" support a standard? I can't find the answer anywhere.
This code works in gcc but not in Visual Studio:
scanf("%[^\n]",a);
It is a Visual Studio fault or a gcc extension ?
EDIT: Looks like VS works, but have to consider the difference in line ends between Linux and Windows.(\r\n)
That particular format string should work fine in a conforming implementation. The [ character introduces a scanset for matching a non-empty set of characters (with the ^ meaning that the scanset is an inversion of the characters supplied). In other words, the format specifier %[^\n] should match every character that's not a newline.
From C99 7.19.6.2, slightly paraphrased:
The [ format specifier matches a nonempty sequence of characters from a set of expected characters (the scanset). If no l length modifier is present, the corresponding argument shall be a pointer to the initial element of a character array large enough to accept the sequence and a terminating null character, which will be added automatically.
If an l length modifier is present, the input shall be a sequence of multibyte characters that begins in the initial shift state. Each multibyte character is converted to a wide character as if by a call to the mbrtowc function, with the conversion state described by an mbstate_t object initialized to zero
before the first multibyte character is converted. The corresponding argument shall be a pointer to the initial element of an array of wchar_t large enough to accept the sequence and the terminating null wide character, which will be added automatically.
The conversion specifier includes all subsequent characters in the format string, up to and including the matching right bracket ]. The characters between the brackets (the scanlist) compose the scanset, unless the character after the left bracket is a circumflex ^, in which case the scanset contains all
characters that do not appear in the scanlist between the circumflex and the right bracket. If the conversion specifier begins with [] or [^], the right bracket character is in the scanlist and the next following right bracket character is the matching right bracket that ends the specification; otherwise the first following right bracket character is the one that ends the specification. If a - character is in the scanlist and is not the first, nor the second where the first character is a ^, nor the last character, the behavior is implementation-defined.
It's possible, if MSVC isn't working correctly, that this is just one of the many examples where Microsoft either don't conform to the latest standard, or think they know better :-)
The "%[" format spec for scanf() is standard and has been since C90.
MSVC does support it.
You can also provide a field width in the format spec to provide safety against buffer overruns:
int main()
{
char buf[9];
scanf("%8[^\n]",buf);
printf("%s\n", buf);
printf("strlen(buf) == %u\n", strlen(buf));
return 0;
}
Also note that the "%[" format spec doesn't mean that scanf() supports regular expressions. That particular format spec is similar to a capability of regexs (and no doubt was an influenced by regex), but it's far more limited than regular expressions.
Related
In a C-scanf format, how do I specify, that I want a character ^?
"%[^]" does not work with GNU scanf, because ^ at start has the negation meaning.
scanf is it possible to specify a string consisting of a number of ^ characters?
Doing a simple %[^] is impossible.
The %[^] is actually invalid - the initial [ is not closed. A %[^]] is interpreted as all characters except ].
Assuming %[^] would be valid, then it would present an ambiguity: %[^]] could be interpreted as a string consisting only from ^ followed by a ]. Or imagine like $[^]abc]. I believe the ability to scan strings only consisting of ^ was sacrificed to give ^ its functionality, which makes a reasonable sacrifice.
To solve the problem in practice, do not use scanf and write it yourself. Or you could do something like "%[\01^] - scan also something else that will not be in the input, like 0x01 byte.
From C99 7.19.6.2 (and this pdf) (emphasis mine):
[
[...]
The conversion specifier includes all subsequent characters in the format string, up to and including the matching right bracket (]). The characters between the brackets (the scan list) compose the scanset, unless the character after the left bracket is a circumflex (^), in which case the scanset contains all characters that do not appear in the scanlist between the circumflex and the right bracket. If the conversion specifier begins with [] or [^], the right bracket character is in the scanlist and the next following right bracket character is the matching right bracket that ends the specification; otherwise the first following right bracket character is the one that ends the specification. [...]
So if the conversion is %[] or %[^] then the ] is in the scanlist, and the next ] will end the scanlist.
As a workaround, you can specify in the scanlist ^ negation of all characters except ^, effectively scanning only for ^ - %[^^].
float lat, lon;
char info[50];
scanf("%f, %f, %49[^\n]", &lat, &lon, info);
In the above snippet, what kind of format specifier is %49[^\n].
I do understand that it is the format specifier for the character array which is going to accept input upto 49 characters (+ the sentinal \0), and [^\n] looks like its a regex (although I had read somewhere that scanf doesn't support regex) OR a character set which is to expand to "any other character" that is NOT "newline" \n. Am I correct?
Also, why is there no s in the format specifier for writing into array info?
The program this snippet is from works. But is this good C style?
The specifier %[ is a different conversion specifier from %s, even if it also must be paired with an argument of type char * (or wchar_t *). See e.g. the table here
[set] matches a non-empty sequence of character from set of characters.
If the first character of the set is ^, then all characters not in the set are matched. If the set begins with ] or ^] then the ] character is also included into the set. It is implementation-defined whether the character - in the non-initial position in the scanset may be indicating a range, as in [0-9]. If width specifier is used, matches only up to width. Always stores a null character in addition to the characters matched (so the argument array must have room for at least width+1 characters)
My apologies, I incorrectly answered below. If you can skip to the end, I'll give you the correct answer.
*** Incorrect Answer Begins ***
It would not be a proper format specifier, as there is no type.
%[parameter][flags][width][.precision][length]type
are the rules for a format statement. As youc an see, the type is non-optional. The author of this format item is thinking they can combine regex with printf, when the two have entirely different processing rules (and printf doesn't follow regex's patterns)
*** Correct Answer Begins ***
scanf uses different format string rules than printf Within scanf's man page is this addition to printf's rules
[
Matches a nonempty sequence of characters from the specified set
of accepted characters; the next pointer must be a pointer to char,
and there must be enough room for all the characters in the string,
plus a terminating null byte. The usual skip of leading white space is
suppressed. The string is to be made up of characters in (or not in) a
particular set; the set is defined by the characters between the open
bracket [ character and a close bracket ] character. The set excludes
those characters if the first character after the open bracket is a
circumflex (^). To include a close bracket in the set, make it the
first character after the open bracket or the circumflex; any other
position will end the set. The hyphen character - is also special;
when placed between two other characters, it adds all intervening
characters to the set. To include a hyphen, make it the last character
before the final close bracket. For instance, [^]0-9-] means the set
"everything except close bracket, zero through nine, and hyphen". The
string ends with the appearance of a character not in the (or, with a
circumflex, in) set or when the field width runs out.
Which basically means that scanf can scan with a subset of regex's rules (the character set subset) but not all of regex's rules
Say I have a file dog.txt
The quick brown fox jumps over the lazy dog.
I can read from the file like this
# include <stdio.h>
int main(){
char str[10];
FILE *fp;
fp = fopen("dog.txt", "r");
fscanf(fp, "%[ABCDEFGHIJKLMNOPQRSTUVWXYZ]", str);
printf("%s\n", str);
return 0;
}
and the program will output T. However instead of listing all the letters, can I utilize the POSIX Character Classes, something like [:upper:] ?
No, there's no portable way to do it. Some implementations allow you to use character ranges like %[A-Z], but that's not guaranteed by the C standard. C99 ยง7.19.6.2/12 says this about the [ conversion specifier (emphasis added):
The conversion specifier includes all subsequent characters in the format string, up to and including the matching right bracket (]). The characters between the brackets (the scanlist) compose the scanset, unless the character after the left bracket is a circumflex (^), in which case the scanset contains all characters that do not appear in the scanlist between the circumflex and the right bracket. If the conversion specifier begins with [] or [^], the right bracket character is in the scanlist and the next following right bracket character is the matching right bracket that ends the specification; otherwise the first following right bracket character is the one that ends the specification. If a - character is in the scanlist and is not the first, nor the second where the first character is a ^, nor the last character, the behavior is implementation-defined.
The POSIX.1-2008 description has almost identical wording (and even defers to the ISO C standard in case of accidental conflict), so there are no additional guarantees in this case when using a POSIX system.
No, you can't. This is what you can do with []:
The conversion specification includes all subsequent bytes in the format string up to and including the matching <right-square-bracket> (']'). The bytes between the square brackets (the scanlist) comprise the scanset, unless the byte after the <left-square-bracket> is a <circumflex> ('^'), in which case the scanset contains all bytes that do not appear in the scanlist between the and the <right-square-bracket>. If the conversion specification begins with "[]" or "[^]" , the <right-square-bracket> is included in the scanlist and the next <right-square-bracket> is the matching <right-square-bracket> that ends the conversion specification; otherwise, the first <right-square-bracket> is the one that ends the conversion specification. If a '-' is in the scanlist and is not the first character, nor the second where the first character is a '^' , nor the last character, the behavior is implementation-defined.
(POSIX standard for scanf. The C standard has similar wording, see Adam Rosenfield's answer.)
So, depending on the implementation, you might be able to do fscanf(fp, "%[A-Z]", str), but there's no guarantee that that will work on any POSIX system. In any case, [:upper:] is the same as [:epru].
Try this:
fscanf(fp, "%[A-Z]", str);
sscanf(line, "%d %64[^\n", &seconds, message);
does %64[^ mean - up to 64 characters?
Should it work with GNU C Compiler?
It means "read at most 64 characters or stop when reaching a newline, whichever comes first". It's specified by the standard so all standard libraries have to support it.
C11 7.21.6.2
[ Matches a nonempty sequence of characters from a set of expected
characters (the scanset).
[...]
The conversion specifier includes all subsequent characters in the
format string, up to and including the matching right bracket (]).
The characters between the brackets (the scanlist) compose the
scanset, unless the character after the left bracket is a circumflex
(^), in which case the scanset contains all characters that do not
appear in the scanlist between the circumflex and the right bracket.
As noted in the comments, a matching ] is probably required to delimit the scanlist. An s specifier is not required.
A[50][5000];
for(i=0;i<50;++i)
scanf("%[\n]",A[i]);
%[^\n]
usage and meaning of it
and can i use that struct like
%[\t]
%[\a]
scanf()'s "%[" conversion specifier starts what's called a "scanset". It's has some similarities to the regex construct that looks the same (but it still is quite different) Here's what the standard says:
Matches a nonempty sequence of characters from a set of expected characters (the scanset).
...
The conversion specifier includes all subsequent characters in the format string, up to and including the matching right bracket (]). The characters between the brackets (the scanlist) compose the scanset, unless the character after the left bracket is a circumflex (^), in which case the scanset contains all characters that do not appear in the scanlist between the circumflex and the right bracket. If the conversion specifier begins with [] or [^], the right bracket character is in the scanlist and the next following right bracket character is the matching right bracket that ends the specification; otherwise the first following right bracket character is the one that ends the specification. If a - character is in the scanlist and is not the first, nor the second where the first character is a ^, nor the last character, the behavior is implementation-defined.
So the scanf() conversion "%[\n]" will match a newline character, while "%[^\n]" will match all characters up to a newline.
Here's what P.J. Plauger has to say about scansets in "The Standard C Library":
A scan set behaves much like the s conversion specifier. It stores up to w characters (default is the rest of the input) in the char array pointed at by ptr. It always stores a null character after any input. It does not skip leading white-space. It also lets you specify what characters to consider as part of the field. You can specify all the characters that match, as in %[0123456789abcdefABCDEF], which matches an arbitrary sequence of hexadecimal digits. Or you can specify all the characters that do not match, as in %[^0123456789] which matches any characters other than digits.
If you want to include the right bracket (]) in the set of characters you specify, write it immediately after the opening [ (or [^), as in %[][] which scans for square brackets. You cannot include the null character in the set of characters you specify. Some implementations may let you specify a range of characters by using a minus sign (-). The list of hexadecimal digits, for example, can be written as %[0-9abcdefABCDEF] or even, in some cases, as %[0-9a-fA-F]. Please note, however, that such usage is not universal. Avoid it in a program that you wish to keep maximally portable.
Yes, it's pretty much like a set in a regular expression -- you can specify a set of character to be accepted, or a set of characters to end the scan, so "%[^ \r\n\t]" would read until it encountered a space, carriage return, new-line or tab. Like with an RE, the leading "^" means "not" -- you can omit it to specify the characters that will be accepted instead of those that will end the conversion. With most compilers (though it's not technically required) you can specify ranges, such as "%[a-z]" to specify any lower-case letter (in this case, where the '-' isn't the first or last character, the behavior is implementation defined).
Though not widely used (or even known) this conversion has been part of C almost forever, and is supported in C89/90.
copies a string up to a newline from standard input to element i of A. as written, this acts almost like gets().