C: fscanf - infinite loop when first character matches - c

I am attempting to parse a text (CSS) file using fscanf and pull out all statements that match this pattern:
#import "some/file/somewhere.css";
To do this, I have the following loop set up:
FILE *file = fopen(pathToSomeFile, "r");
char *buffer = (char *)malloc(sizeof(char) * 9000);
while(!feof(file))
{
// %*[^#] : Read and discard all characters up to a '#'
// %8999[^;] : Read up to 8999 characters starting at '#' to a ';'.
if(fscanf(file, "%*[^#] %8999[^;]", buffer) == 1)
{
// Do stuff with the matching characters here.
// This code is long and not relevant to the question.
}
}
This works perfectly SO LONG AS the VERY FIRST character in the file is not a '#'. (Literally, a single space before the first '#' character in the CSS file will make the code run fine.)
But if the very first character in the CSS file is a '#', then what I see in the debugger is an infinite loop -- execution enters the while loop, hits the fscanf statement, but does not enter the 'if' statement (fscanf fails), and then continues through the loop forever.
I believe my fscanf formatters may need some tweaking, but am unsure how to proceed. Any suggestions or explanations for why this is happening?
Thank you.

I'm not an expert on scanf pattern syntax, but my interpretation of yours is:
Match a non-empty sequence of non-'#' characters, then
Match a non-empty sequence of up to 8999 non-';' characters
So yes, if your string starts with a '#', then the first part will fail.
I think if you start your format string with some whitespace, then fscanf will eat any leading whitespace in your data string, i.e. simply " %8999[^;]".

Oli already said why fscanf failed. And since failure is a normal state for fscanf your busy loop is not the consequence of the fscanf failure but of the missing handling for it.
You have to handle a fscanf failure even if your format would be correct (in your special case), because you cannot be sure that the input always is matchable by the format. Actually you can be sure that much more nonmatching input exists than matching input.

Your format string does the following actions:
Read (and discard) 1 or more non-# characters
Read (and discard) 0 or more whitespace characters (due to the space in the format string)
Read and store 1 to 8999 non-; characters
Unfortunately, there is no format specifier for reading "zero or more" characters from a user-defined set.
If you don't care about multiple #include statements on a line, you could change your code to read a single line (with fgets), and then extract the #include statement from that (if the first character does not equal #, you can use your current format string with sscanf, otherwise, you could use sscanf(line, "%8999[^;]", buffer)).
If multiple #include statemens on a line should be handled correctly, you could inspect the next character to be read with getc and then put it back with ungetc.

Related

Getting particular strings in scanf

I was wondering if it is possible to only read in particular parts of a string using scanf.
For example since I am reading from a file i use fscanf
if I wanted to read name and number (where number is the 111-2222) when they are in a string such as:
Bob Hardy:sometext:111-2222:sometext:sometext
I use this but its not working:
(fscanf(read, "%23[^:] %27[^:] %10[^:] %27[^:] %d\n", name,var1, number, var2, var3))
Your initial format string fails because it does not consume the : delimiters.
If you want scanf() to read a portion of the input, but you don't care what is actually read, then you should use a field descriptor with the assignment-suppression flag (*):
char nl;
fscanf(read, "%23[^:]:%*[^:]:%10[^:]%*[^\n]%c", name, number, &nl);
As a bonus, you don't need to worry about buffer overruns for fields with assignment suppressed.
You should not attempt to match a single newline via a trailing newline character in the format, because a literal newline (or space or tab) in the format will match any run of whitespace. In this particular case, it would consume not just the line terminator but also any leading whitespace on the next line.
The last field is not suppressed, even though it will almost always receive a newline, because that way you can tell from the return value if you've scanned the last line of the file and it is not newline-terminated.
Check fscanf() return value.
fscanf(read, "%23[^:] %27[^:] ... is failing because after scanning the first field with %23[^:], fscanf() encounters a ':'. Since that does not match the next part of the format, a white-space as in ' ', scanning stops.
Had code checked the returned value of fscanf(), which was certainly 1, it may have been self-evident the source of the problem. So the scanning needs to consume the ':', add it to the format: "%23[^:]: %27[^:]: ...
Better to use fgets()
Using fscanf() to read data and detect properly and improperly formatted data is very challenging. It can be done correctly to scan expected input. Yet it rarely works to handle some incorrectly formated input.
Instead, simple read a line of data and then parse it. Using '%n' is an easy way to detect complete conversion as it saves the char scan count - if scanning gets there.
char buffer[200];
if (fgets(buffer, sizeof buffer, read) == NULL) {
return EOF;
}
int n = 0;
sscanf(buffer, " %23[^:]: %27[^:]: %10[^:]: %27[^:]:%d %n",
name, var1, number, var2, &var3, &n);
if (n == 0) {
return FAIL; // scan incomplete
}
if (buffer[n]) {
return FAIL; // Extra data on line
}
// Success!
Note: sample input ended with text, but original format used "%d". Unclear on OP's intent.

I/O in C Errors

I'm trying for hours to find the answer for this question i've got in university. I tried running this with writing a file with two lines of :
hello
world
and it reads the file perfectly, So i cant find the answer. I would appreciate your help !
A student wrote the next function for reading a text file and printing it exactly as it is.
void ReadFile(FILE *fIn)
{
char nextLine[MAX_LINE_LENGTH];
while(!feof(fIn))
{
fscanf(fIn,"%s",nextLine);
printf("%s\n",nextLine);
}
}
What are the two errors in this function?
You can assume that each line in the file is not longer than MAX_LINE_LENGTH characters, and that it is a text file that contains only alphabet characters, and that each line is terminated by '\n'.
Thanks.
It discards white space. Try adding multiple spaces and tabs.
It may evaluate a stream more than once, and If there is a read error, the loop never terminates.
See: Why is “while ( !feof (file) )” always wrong?
Reading strings via scanf is dangerous. There is no bounds checking. You may read past you MAX_LINE_LENGTH.(and boom! Segfault)
The main error is that fsacnf( fIn, "%s", nextLine ) doesn't scan a complete line.
From man page:
s
Matches a sequence of non-white-space characters; the next pointer must be a pointer to character array that is long enough to hold the input sequence and the terminating null byte ('\0'), which is added automatically. The input string stops at white space or at the maximum field width, whichever occurs first.
Thus if you have a line "a b" the first fscanf() will scan just "a" and the second one "b" and both are printed in two different lines. You can use fgets() to read a whole line.
The second one is maybe that it's stated "each line in the file is not longer than MAX_LINE_LENGTH characters" but nextLine can contain atmost MAX_LINE_LENGTH-1 characters (+ '\0'). That problem becomes even more important if you replace fscanf() by fgets() because than nextLine must have also capacity to store '\n' or '\r\n' (depending on the platform you're on)
A correct way of doing that is:
void ReadFile(FILE *fIn)
{
char nextLine[MAX_LINE_LENGTH];
while(fgets(nextLine, MAX_LINE_LENGTH, fIn)) {
printf("%s", nextLine);
}
}
As some have posted using feof to control a loop is not a good idea nor using fscanf to read lines.

C programming language (scanf)

I have read strings with spaces in them using the following scanf() statement.
scanf("%[^\n]", &stringVariableName);
What is the meaning of the control string [^\n]?
Is is okay way to read strings with white space like this?
This mean "read anything until you find a '\n'"
This is OK, but would be better to do this "read anything until you find a '\n', or read more characters than my buffer support"
char stringVariableName[256] = {}
if (scanf("%255[^\n]", stringVariableName) == 1)
...
Edit: removed & from the argument, and check the result of scanf.
The format specifier "%[^\n]" instructs scanf() to read up to but not including the newline character. From the linked reference page:
matches a non-empty sequence of character from set of characters.
If the first character of the set is ^, then all characters not
in the set are matched. If the set begins with ] or ^] then the ]
character is also included into the set.
If the string is on a single line, fgets() is an alternative but the newline must be removed as fgets() writes it to the output buffer. fgets() also forces the programmer to specify the maximum number of characters that can be read into the buffer, making it less likely for a buffer overrun to occur:
char buffer[1024];
if (fgets(buffer, 1024, stdin))
{
/* Remove newline. */
char* nl = strrchr(buffer, '\n');
if (nl) *nl = '\0';
}
It is possible to specify the maximum number of characters to read via scanf():
scanf("%1023[^\n]", buffer);
but it is impossible to forget to do it for fgets() as the compiler will complain. Though, of course, the programmer could specify the wrong size but at least they are forced to consider it.
Technically, this can't be well defined.
Matches a nonempty sequence of characters from a set of expected
characters (the scanset).
If no l length modifier is present, the corresponding argument shall
be a pointer to the initial element of a character array large enough
to accept the sequence and a terminating null character, which will be
added automatically.
Supposing the declaration of stringVariableName looks like char stringVariableName[x];, then &stringVariableName is a char (*)[x];, not a char *. The type is wrong. The behaviour is undefined. It might work by coincidence, but anything that relies on coincidence doesn't work by my definition.
The only way to form a char * using &stringVariableName is if stringVariableName is a char! This implies that the character array is only large enough to accept a terminating null character. In the event where the user enters one or more characters before pressing enter, scanf would be writing beyond the end of the character array and invoking undefined behaviour. In the event where the user merely presses enter, the %[...] directive will fail and not even a '\0' will be written to your character array.
Now, with that all said and done, I'll assume you meant this: scanf("%[^\n]", stringVariableName); (note the omitted ampersand)
You really should be checking the return value!!
A %[ directive causes scanf to retrieve a sequence of characters consisting of those specified between the [ square brackets ]. A ^ at the beginning of the set indicates that the desired set contains all characters except for those between the brackets. Hence, %[^\n] tells scanf to read as many non-'\n' characters as it can, and store them into the array pointed to by the corresponding char *.
The '\n' will be left unread. This could cause problems. An empty field will result in a match failure. In this situation, it's possible that no data will be copied into your array (not even a terminating '\0' character). For this reason (and others), you really need to check the return value!
Which manual contains information about the return values of scanf? The scanf manual.
Other people have explained what %[^\n] means.
This is not an okay way to read strings. It is just as dangerous as the notoriously unsafe gets, and for the same reason: it has no idea how big the buffer at stringVariableName is.
The best way to read one full line from a file is getline, but not all C libraries have it. If you don't, you should use fgets, which knows how big the buffer is, and be aware that you might not get a complete line (if the line is too long for the buffer).
Reading from the man pages for scanf()...
[ Matches a non-empty sequence of characters from the
specified set of accepted characters; the next pointer must be a
pointer to char, and there must be enough room for all the characters
in the string, plus a terminating null byte. The usual skip of
leading white space is suppressed. The string is to be made up of
characters in (or not in) a particular set; the set is defined by the
characters between the open bracket [ character and a close bracket ]
character. The set excludes those characters if the first character
after the open bracket is a circumflex (^). To include a close
bracket in the set, make it the first character after the open bracket
or the circumflex; any other position will end the set. The hyphen
character - is also special; when placed between two other
characters, it adds all intervening characters to the set. To
include a hyphen, make it the last character before the final close
bracket. For instance, [^]0-9-] means the set "everything except
close bracket, zero through nine, and hyphen". The string ends with
the appearance of a character not in the (or, with a
circumflex, in) set or when the field width runs out.
In a nutshell, the [^\n] means that read everything from the string that is not a \n and store that in the matching pointer in the argument list.

Invalid output with fscanf()

The language I am using is C
I am trying to scan data from a file, and the code segment is like:
char lsm;
long unsigned int address;
int objsize;
while(fscanf(mem_trace,"%c %lx,%d\n",&lsm,&address,&objsize)!=EOF){
printf("%c %lx %d\n",lsm,address,objsize);
}
The file which I read from has the first line as follows:
S 00600aa0,1
I 004005b6,5
I 004005bb,5
I 004005c0,5
S 7ff000398,8
The results that show in stdout is:
8048350 134524916
S 600aa0 1
I 4005b6 5
I 4005bb 5
I 4005c0 5
S 7ff000398,8
Obviously, the results had an extra line which comes nowhere.Is there anybody know how this could happen?
Thx!
This works for me on the data you supply:
#include <stdio.h>
int main(void)
{
char lsm[2];
long unsigned int address;
int objsize;
while (scanf("%1s %lx,%d\n", lsm, &address, &objsize) == 3)
printf("%s %9lx %d\n", lsm, address, objsize);
return 0;
}
There are multiple changes. The simplest and least consequential is the change from fscanf() to scanf(); that's for my convenience.
One important change is the type of lsm from a single char to an array of two characters. The format string then uses %1s reads one character (plus NUL '\0') into the string, but it also (and this is crucial) skips leading blanks.
Another change is the use of == 3 instead of != EOF in the condition. If something goes wrong, scanf() returns the number of successful matches. Suppose that it managed to read a letter but what followed was not a hex number; it would return 1 (not EOF). Further, it would return 1 on each iteration until it could find something that matched a hex number. Always test for the number of values you expect.
The output format was tidied up with the %9lx. I was testing on a 64-bit system, so the 9-digit hex converts fine. One problem with scanf() is that if you get an overflow on a conversion, the behaviour is undefined.
Output:
S 600aa0 1
I 4005b6 5
I 4005bb 5
I 4005c0 5
S 7ff000398 8
Why did you get the results you got?
The first conversion read a space into lsm, but then failed to convert S into a hex number, so it was left behind for the next cycle. So, you got the left-over garbage printed in the address and object size columns. The second iteration read the S and was then in synchrony with the data until the last line. The newline at the end of the format (like any other white space in the format string) eats white space, which is why the last line worked despite the leading blank.
A directive that is a conversion specification defines a set of
matching input sequences, as described below for each specifier. A
conversion specification is executed in the following steps:
Input white-space characters (as specified by the isspace function)
are skipped, unless the specification includes a [, c, or n specifier.
An input item is read from the stream, unless the specification
includes an n specifier.
[...]
The first time you call fscanf, your %c reads the first blank space in the file. Your white-space character reads zero or more characters of white-space, this time zero of them. Your %lx fails to match the S character in the file, so fscanf returns. You don't check the result. Your variables contain values that they had from earlier operations.
The second time you call fscanf, your %c reads the first S character in the file. From that point on, everything else succeeds too.
Added in editing, here is the simplest change to your format string to solve your problem:
" %c %lx,%d\n"
The space at the beginning will read zero or more characters of white-space and then %c will read the first non-white-space character in the file.
Here is another format string that will also solve your problem:
" %c %lx,%d"
The reason is that if you read and discard zero or more white-space characters twice in a row, the result is the same as doing it just once.
I think that fsanf reads the first character [space] into lsm then fails to read address and objsize because the format shift doesn't match for the rest of the line.
Then it prints a space then whatever happened to be in address and objsize when it was declared
EDIT--
fscanf consumes the whitespaces after each call, if you call ftell you'll see
printf("%c %lx %d %d\n",lsm,address,objsize,ftell(mem_trace));

How To Read in Strings that only Contain Alphabet letters with fscanf?

I have been struggling to figure out the fscanf formatting. I just want to read in a file of words delimited by spaces. And I want to discard any strings that contain non-alphabetic characters.
char temp_text[100];
while(fscanf(fcorpus, "%101[a-zA-Z]s", temp_text) == 1) {
printf("%s\n", temp_text);
}
I've tried the above code both with and without the 's'. I read in another stackoverflow thread that the s when used like that will be interpreted as a literal 's' and not as a string. Either way - when I include the s and when I do not include the s - I can only get the first word from the file I am reading through to print out.
The %[ scan specifier does not skip leading spaces. Either add a space before it or at the end in place of your s. Also you have your 100 and 101 backwards and thus a serious buffer overflow bug.
The s isn't needed.
Here are a few things to try:
Print out the return value from fscanf, and make sure it is 1.
Make sure that the fscanf is consuming the whitespace by using fgetc to get the next character and printing it out.

Resources