In C, when using sscanf, for the format parameter, is there a difference between using:
%255[^\0]s
And:
%255c
Is one faster? Will one of the above ever give a different outcome?
The question is certainly not what the OP intended.
The OP requests the sscanf(buf, format, dest) difference between formats
"%255[^\0]s" // a seemingly format specifier %255[^\0] and the letter 's'
"%255c"
Certainly the OP wanted the sscanf(buf, format, dest) difference between formats
"%255[^\0]" // a seemingly format specifier %255[^\0]
"%255c"
OR
"%255s" // format specifier %255s
"%255c"
The "%255[^\0]" is not the format one may think. This is the same as format "%255[^". sscanf() does not know there is something past the explicit null character '\0'. Since the format specifier begins with a [ but does not end with a matching ], it is an invalid specifier. "If a conversion specification is invalid, the behavior is undefined."
This also applies to the original "%255[^\0]s":behavior is undefined.
Following are the salient issues between "%255s" and "%255c"
"%255c" does not consume leading white spaces. "%255s" does consume unlimited leading white space, scanning them, but not saving them to dest.
"%255c" does scan white spaces and saves them to dest. "%255s", after it found a non-white space, will stop scanning should it encounter a white space.
Both will scan up to 255 characters and place the scanned characters into dest.
"%255c" does not append a \0 so dest should cope with 255 char.
"%255s", if it scans at least 1 char, will append a \0, so dest should cope with 256 char.
Neither will scan a \0 as the scanning buf stops on \0 in sscanf(). "%255c" would scan a \0 in fscanf(). This is unusual as fscanf() is not used much when files have \0 in them.
Should any speed difference occur, certainly it is implementation dependent.
Can't say for sure about speed, but there is a difference in outcome.
First, %255c will (assuming there's at least 255 characters in the string you're scanning) read 255 characters, regardless of what they are. %255[^\0]s, on the other hand, will read up to 255 non-whitespace characters.
Second, because strings are already terminated by \0, the [^\0] part of the regex is redundant, as sscanf will never consider null-terminators as part of the string.
Related
See the following code:
int main()
{
char test[3];
scanf("%s", test);
__fpurge(stdin);
printf("%s", test);
}
The program should record only 3 characters, but when I type, for example, 8 characters, the program records all 8! This should not happen. The correct would record 3 characters, because the scanf do it?
scanf accepts more data than you can fit in test because you allow it to do so by using %s without a limit. This is dangerous, and must be avoided in production code.
Replace %s with %3s to fix this problem. If you want to read three characters, test must be four-characters wide to accommodate null terminator:
char test[4];
scanf("%3s", test);
When you pass test to scanf(), you are passing nothing but a pointer to the first character of your buffer, so scanf() has no idea how large your buffer is. It will happily accept as many characters as you type, and it will store them all in there. So, when you type more than 2 characters, you are causing scanf() to write characters (plus the zero asciiz terminator character) past the end of your buffer. Normally, what is to be expected in such a case is a program crash.
The fact that you did not experience a crash is largely coincidence, what is probably happening is that the compiler has allocated room for more than 3 characters in the stack due to alignment considerations, possibly room for 8 characters or more. If you type enough characters, your program will surely crash.
For this reason, this usage of scanf() is considered completely unsafe. One should never use scanf() like that when doing any serious coding. Instead you should specify the width of your string, like this: "%2s". (Note that you must specify a number which is smaller than the size of your buffer by one, in order to account for the zero asciiz terminator character that will be automatically appended by scanf().)
I've been doing abit of reading through the Linux programmer's manual looking up various functions and trying to get a deeper understanding of what they are/how they work.
Looking at fgets() I read "A '\0' is stored after the last character in the buffer .
I've read through What does \0 stand for? and have a pretty solid understanding of what \0 symbolizes (a null character right ?). But what I'm struggling to grasp is its relevance to fgets(), I don't really understand why it "needs" to end with a null character.
As you already said, you are probably aware that \0 constitutes the end of all strings in C. As per the C standard, everything that is a string needs to be \0 terminated.
Since fgets() makes a string, that string, of course, will be properly null terminated.
Do note that for all string functions in C, any string you use or generate with them must be terminated with a \0 character.
Because otherwise you do not know how long the resulting string is.
One of the arguments to fgets is the maximum number of characters to read, but it's just that: a maximum. If you ask for 512 characters, but there are only 8 in the buffer, you will only get 8 characters … and a NULL in the 9th slot to demark the logical end of the C-string.
Arguably, fgets could instead have been designed to return the number of characters read, but then for most purposes you'd only have to add the NULL byte yourself manually, and the function would have to find a way to signify an error other than returning a null pointer.
From C standards:
The fgets function reads at most one less than the number of
characters specified by n from the stream pointed to by stream into
the array pointed to by s. No additional characters are read after a
new-line character (which is retained) or after end-of-file. A null
character is written immediately after the last character read into
the array.
This is to make sure that there is no buffer-overflow (characters/contents are not going beyond the provided storage) is in the created string.
As all the people before me said, fgets reads bytes from a file and makes them into a standard C string, which is null-terminated. The termination with the \0 byte reflects the fact that this function is text-oriented.
If you don't want to use null-termination for the data read from the file, it's not a string (not text), and also the end-of-line byte \n has no significance. In this case, you can use fread.
So C has two functions to read from file: fgets for text and fread for non-text (binary data).
BTW if the input file has a genuine zero-valued byte, fgets will do an uncomfortable thing: it will continue reading until it reads an end-of-line byte \n, and the output "string" will have two (or more) null-terminations. This doesn't make any sense as text, so it's another example of fgets being text-oriented and unsuitable for arbitrary data.
I have a little problem with my code. At the moment I just have a line (char* string with \0 at the end) and I want the line to be checked on special characters. Therefore I used the following code:
char lineJunk;
if(sscanf(lineContent, "%*[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+=\0/{}:]%c", &lineJunk)){
return 0;
}
Now my compiler will spit out the following warning:
Multiple markers at this line
- no closing ‘]’ for ‘%[’ format [-Wformat=]
- embedded ‘\0’ in format [-Wformat-contains-nul]
- too many arguments for format [-Wformat-extra-args]
These warnings only appear when I have \0 in my sscanf. Yet otherwise the code won't work, because the Line I am checking on has \0 at its end. When I use \\0 instead of \0 the warnings disappear, but the code doesn't work anymore. Just \ doesn't work either.
Somebody know a solution?
You are not using sscanf() correctly, to explain the warnings
No closing ] means your format string has no closing ] which is required since you are passing a format with [.
The closeing ] in your format string is "not" there really, because you have an embeded '\0' in the format string, so the actual format string is
"%*[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+="
This is because you have an explicit '\0' in the format string, which is causing the previous warning too.
There is one in the format string which is added by the compiler at the end of it, to mark the end and so it becomes a legitimate c string, in the sense that you can pass it to strlen() and other functions that expect the nul terminator to be present.
By embeding it in the format string, you are marking the end of the string at the position where you inserted, that's why the format string is the one I say in point 1.
You are discarding the matched value by using the * modifier, you need to remove it to make the passed parameter useful, because as is you are discarding the matched value and hence no parameter is required.
You can't match the '\0' with sscanf() if you want that you need to traverse the string one byte at a time until you find a '\0', and in that case the length should be known beforehand.
There is no need for '\0' in the format of sscanf(char *src, char *format, ...). sscanf() will stop scanning when it reaches the '\0' in src. So sscanf() will never provide '\0' for scanning.
As mention by #iharob, the '\0' in the format is trouble as sscanf() see that as the end of the format. That is what the compiler is warning about.
// Eliminate `\0` from the format.
#define SKIP "%*[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+=/{}:]"
if(sscanf(lineContent, SKIP "%c" , &lineJunk) == 1) {
return 0;
}
Should A-Z be consecutive as with typical encoding of ASCII, a short-cut would be: #define SKIP "%*[A-Za-z0-9+=/{}:]"
--
Note better to check sscanf() results withe what code wants: 1 rather than non-zero. Under select situations sscanf() will return EOF
I have read strings with spaces in them using the following scanf() statement.
scanf("%[^\n]", &stringVariableName);
What is the meaning of the control string [^\n]?
Is is okay way to read strings with white space like this?
This mean "read anything until you find a '\n'"
This is OK, but would be better to do this "read anything until you find a '\n', or read more characters than my buffer support"
char stringVariableName[256] = {}
if (scanf("%255[^\n]", stringVariableName) == 1)
...
Edit: removed & from the argument, and check the result of scanf.
The format specifier "%[^\n]" instructs scanf() to read up to but not including the newline character. From the linked reference page:
matches a non-empty sequence of character from set of characters.
If the first character of the set is ^, then all characters not
in the set are matched. If the set begins with ] or ^] then the ]
character is also included into the set.
If the string is on a single line, fgets() is an alternative but the newline must be removed as fgets() writes it to the output buffer. fgets() also forces the programmer to specify the maximum number of characters that can be read into the buffer, making it less likely for a buffer overrun to occur:
char buffer[1024];
if (fgets(buffer, 1024, stdin))
{
/* Remove newline. */
char* nl = strrchr(buffer, '\n');
if (nl) *nl = '\0';
}
It is possible to specify the maximum number of characters to read via scanf():
scanf("%1023[^\n]", buffer);
but it is impossible to forget to do it for fgets() as the compiler will complain. Though, of course, the programmer could specify the wrong size but at least they are forced to consider it.
Technically, this can't be well defined.
Matches a nonempty sequence of characters from a set of expected
characters (the scanset).
If no l length modifier is present, the corresponding argument shall
be a pointer to the initial element of a character array large enough
to accept the sequence and a terminating null character, which will be
added automatically.
Supposing the declaration of stringVariableName looks like char stringVariableName[x];, then &stringVariableName is a char (*)[x];, not a char *. The type is wrong. The behaviour is undefined. It might work by coincidence, but anything that relies on coincidence doesn't work by my definition.
The only way to form a char * using &stringVariableName is if stringVariableName is a char! This implies that the character array is only large enough to accept a terminating null character. In the event where the user enters one or more characters before pressing enter, scanf would be writing beyond the end of the character array and invoking undefined behaviour. In the event where the user merely presses enter, the %[...] directive will fail and not even a '\0' will be written to your character array.
Now, with that all said and done, I'll assume you meant this: scanf("%[^\n]", stringVariableName); (note the omitted ampersand)
You really should be checking the return value!!
A %[ directive causes scanf to retrieve a sequence of characters consisting of those specified between the [ square brackets ]. A ^ at the beginning of the set indicates that the desired set contains all characters except for those between the brackets. Hence, %[^\n] tells scanf to read as many non-'\n' characters as it can, and store them into the array pointed to by the corresponding char *.
The '\n' will be left unread. This could cause problems. An empty field will result in a match failure. In this situation, it's possible that no data will be copied into your array (not even a terminating '\0' character). For this reason (and others), you really need to check the return value!
Which manual contains information about the return values of scanf? The scanf manual.
Other people have explained what %[^\n] means.
This is not an okay way to read strings. It is just as dangerous as the notoriously unsafe gets, and for the same reason: it has no idea how big the buffer at stringVariableName is.
The best way to read one full line from a file is getline, but not all C libraries have it. If you don't, you should use fgets, which knows how big the buffer is, and be aware that you might not get a complete line (if the line is too long for the buffer).
Reading from the man pages for scanf()...
[ Matches a non-empty sequence of characters from the
specified set of accepted characters; the next pointer must be a
pointer to char, and there must be enough room for all the characters
in the string, plus a terminating null byte. The usual skip of
leading white space is suppressed. The string is to be made up of
characters in (or not in) a particular set; the set is defined by the
characters between the open bracket [ character and a close bracket ]
character. The set excludes those characters if the first character
after the open bracket is a circumflex (^). To include a close
bracket in the set, make it the first character after the open bracket
or the circumflex; any other position will end the set. The hyphen
character - is also special; when placed between two other
characters, it adds all intervening characters to the set. To
include a hyphen, make it the last character before the final close
bracket. For instance, [^]0-9-] means the set "everything except
close bracket, zero through nine, and hyphen". The string ends with
the appearance of a character not in the (or, with a
circumflex, in) set or when the field width runs out.
In a nutshell, the [^\n] means that read everything from the string that is not a \n and store that in the matching pointer in the argument list.
I have been struggling to figure out the fscanf formatting. I just want to read in a file of words delimited by spaces. And I want to discard any strings that contain non-alphabetic characters.
char temp_text[100];
while(fscanf(fcorpus, "%101[a-zA-Z]s", temp_text) == 1) {
printf("%s\n", temp_text);
}
I've tried the above code both with and without the 's'. I read in another stackoverflow thread that the s when used like that will be interpreted as a literal 's' and not as a string. Either way - when I include the s and when I do not include the s - I can only get the first word from the file I am reading through to print out.
The %[ scan specifier does not skip leading spaces. Either add a space before it or at the end in place of your s. Also you have your 100 and 101 backwards and thus a serious buffer overflow bug.
The s isn't needed.
Here are a few things to try:
Print out the return value from fscanf, and make sure it is 1.
Make sure that the fscanf is consuming the whitespace by using fgetc to get the next character and printing it out.