How To Read in Strings that only Contain Alphabet letters with fscanf? - c

I have been struggling to figure out the fscanf formatting. I just want to read in a file of words delimited by spaces. And I want to discard any strings that contain non-alphabetic characters.
char temp_text[100];
while(fscanf(fcorpus, "%101[a-zA-Z]s", temp_text) == 1) {
printf("%s\n", temp_text);
}
I've tried the above code both with and without the 's'. I read in another stackoverflow thread that the s when used like that will be interpreted as a literal 's' and not as a string. Either way - when I include the s and when I do not include the s - I can only get the first word from the file I am reading through to print out.

The %[ scan specifier does not skip leading spaces. Either add a space before it or at the end in place of your s. Also you have your 100 and 101 backwards and thus a serious buffer overflow bug.

The s isn't needed.
Here are a few things to try:
Print out the return value from fscanf, and make sure it is 1.
Make sure that the fscanf is consuming the whitespace by using fgetc to get the next character and printing it out.

Related

Reading the string with defined number of characters from the input

So I am trying to read a defined number of characters from the input. Let's say that I want to read 30 characters and put them in to a string. I managed to do this with a for loop, and I cleaned the buffer as shown below.
for(i=0;i<30;i++){
string[i]=getchar();
}
string[30]='\0';
while(c!='\n'){
c=getchar(); // c is some defined variable type char
}
And this is working for me, but I was wondering if there is another way to do this. I was researching and some of them are using sprintf() for this problem, but I didn't understand that solution. Then I found that you can use scanf with %s. And some of them use %3s when they want to read 3 characters. I tried this myself, but this command only reads the string till the first empty space. This is the code that I used:
scanf("%30s",string);
And when I run my program with this line, if I for example write: "Today is a beatiful day. It is raining, but it's okay i like rain." I thought that the first 30 characters would be saved in to the string. But when i try to read this string with puts(string); it only shows "Today".
If I use scanf("%s",string) or gets(string) that would rewrite some parts of my memory if the number of characters on input is greater than 30.
You can use scanf("%30[^\n]",s)
Actually, this is how you can set which characters to input. Here, carat sign '^' denotes negation, ie. this will input all characters except \n. %30 asks to input 30 characters. So, there you are.
The API you're looking for is fgets(). The man page describes
char *fgets(char *s, int size, FILE *stream);
fgets() reads in at most one less than size characters from stream and stores them into the buffer pointed to by s. Reading stops after an EOF or a newline. If a newline is read, it is stored into the buffer. A terminating null byte ('\0') is stored after the last character in the buffer.

reading the remainder of a string with sscanf

I'm trying to read a string which consists of a set of numbers followed by a string, wrapped with some other basic text.
In other words, the format of the line is something like this:
Stuff<5,10,-5,8,"Test string here.">
Naively, I tried:
sscanf(str,"Stuff<%d,%d,%d,%d,\"%s\">",&i1,&i2,&i3,&i4,str2);
But after some research I discovered %s is supposed to stop parsing when it gets to a whitespace character. I found this question, but none of the answers addresses the problem I have: the string could contain any character in it, including newline characters and properly escaped quotes. The latter is not a problem, if I can just get sscanf to put everything after the first quote in the pre-allocated buffer I provide, I can strip the end off myself.
But how do I do this? I can't use %[] because it requires something in it to terminate the string, and the only thing I want to terminate it is the null terminator. So I thought, "Hey, I'll just use the null terminator!" But %[\0] made the compiler grumpy:
warning: no closing ‘]’ for ‘%[’ format
warning: embedded ‘\0’ in format
warning: no closing ‘]’ for ‘%[’ format
warning: embedded ‘\0’ in format
Using something like %*c won't work either, because I don't know exactly how many characters need to be taken. I tried passing strlen(str) since it will be less than that, but sscanf returns 4 and nothing is put into str2, suggesting that perhaps because the length was too long it gave up and didn't bother.
Update: I guess I could do something like:
sscanf(str,"Stuff<%d,%d,%d,%d,\"%n",&i1,&i2,&i3,&i4,&n);
str2 = str+n;
Your update seems to be a good answer. I was going to suggest strchr to find the location of the first quote char, after using sscanf to get i1 thru i4. Side note, you should always check the return value from sscanf to make sure that the conversions worked. This is even more important with your suggested answer, since n will be left uninitialized if the first four conversions aren't successful.
Scan for '\"', then for everything not '\"', then '\"' again.
Be sure to check sscanf() result and limit how long the test string may be.
char test_string[100];
int n = 0;
if (sscanf(str, "Stuff<%d,%d,%d,%d, \"%99[^\"]\"> %n",
&i1, &i2, &i3, &i4, test_string, &n) == 5 && str[n] == '\0') Good();
Your attempt using "...%[\0]...", from sscanf() point-of-view, is "...%[".
Everything in the format from "\0" on is ignored.
Using the int n = 0, appending " %n" to the format string, appending &n to the parameters and checking str[n] == '\0' is a neat trick with sscanf() to insure the entire line parsed correctly. Note: "%n" does not add to sscanf() result.
This is not the only way to achieve what you want to achieve, but probably the neatest way to do it: You'll need to use the scansets. I won't tell you the solution directly with this answer, I'll explain how to use scansets as far as I know them, and you'll hopefully be able to do it yourself.
Scansets %[...] are like %s when it comes to assignment, they interpret values as characters and store them into character arrays. %s is whitespace-terminated, %[...] is the flexible version of that.
There are two ways of using the scanset, first one being without a preceding caret ^, second one being with a preceding caret ^.
When you use scanset without the preceding caret ^, the characters you put inside the brackets will be the only ones that will be read, stored and then left behind. As soon as scanf encounters a non-matching character, that %[...] will be over. For example:
// input: asdasdasdwasdasd
char s[100] = { 0 };
scanf( "%[das]", s );
printf( "%s", s );
// output: asdasdasd
When you use scanset with the preceding caret ^, the search is inversed. It reads, stores and leaves behind every character until it reaches any one of the characters that you've put down after the preceding caret ^. Example:
// input: abcdefgh^kekQ
char s[100] = { 0 };
scanf( "%[^Q^]", s );
printf( "%s", s );
// output: abcdefgh
Beware, remaining characters is still to be read inside the stream, the file pointer won't get beyond the character which caused termination. I.e. for the first one, getchar( ); would give a 'w', and for the second one it would give a '^'.
I hope this will be enough. If you still cannot find your way out, ask away, I can give you a solution.

C programming language (scanf)

I have read strings with spaces in them using the following scanf() statement.
scanf("%[^\n]", &stringVariableName);
What is the meaning of the control string [^\n]?
Is is okay way to read strings with white space like this?
This mean "read anything until you find a '\n'"
This is OK, but would be better to do this "read anything until you find a '\n', or read more characters than my buffer support"
char stringVariableName[256] = {}
if (scanf("%255[^\n]", stringVariableName) == 1)
...
Edit: removed & from the argument, and check the result of scanf.
The format specifier "%[^\n]" instructs scanf() to read up to but not including the newline character. From the linked reference page:
matches a non-empty sequence of character from set of characters.
If the first character of the set is ^, then all characters not
in the set are matched. If the set begins with ] or ^] then the ]
character is also included into the set.
If the string is on a single line, fgets() is an alternative but the newline must be removed as fgets() writes it to the output buffer. fgets() also forces the programmer to specify the maximum number of characters that can be read into the buffer, making it less likely for a buffer overrun to occur:
char buffer[1024];
if (fgets(buffer, 1024, stdin))
{
/* Remove newline. */
char* nl = strrchr(buffer, '\n');
if (nl) *nl = '\0';
}
It is possible to specify the maximum number of characters to read via scanf():
scanf("%1023[^\n]", buffer);
but it is impossible to forget to do it for fgets() as the compiler will complain. Though, of course, the programmer could specify the wrong size but at least they are forced to consider it.
Technically, this can't be well defined.
Matches a nonempty sequence of characters from a set of expected
characters (the scanset).
If no l length modifier is present, the corresponding argument shall
be a pointer to the initial element of a character array large enough
to accept the sequence and a terminating null character, which will be
added automatically.
Supposing the declaration of stringVariableName looks like char stringVariableName[x];, then &stringVariableName is a char (*)[x];, not a char *. The type is wrong. The behaviour is undefined. It might work by coincidence, but anything that relies on coincidence doesn't work by my definition.
The only way to form a char * using &stringVariableName is if stringVariableName is a char! This implies that the character array is only large enough to accept a terminating null character. In the event where the user enters one or more characters before pressing enter, scanf would be writing beyond the end of the character array and invoking undefined behaviour. In the event where the user merely presses enter, the %[...] directive will fail and not even a '\0' will be written to your character array.
Now, with that all said and done, I'll assume you meant this: scanf("%[^\n]", stringVariableName); (note the omitted ampersand)
You really should be checking the return value!!
A %[ directive causes scanf to retrieve a sequence of characters consisting of those specified between the [ square brackets ]. A ^ at the beginning of the set indicates that the desired set contains all characters except for those between the brackets. Hence, %[^\n] tells scanf to read as many non-'\n' characters as it can, and store them into the array pointed to by the corresponding char *.
The '\n' will be left unread. This could cause problems. An empty field will result in a match failure. In this situation, it's possible that no data will be copied into your array (not even a terminating '\0' character). For this reason (and others), you really need to check the return value!
Which manual contains information about the return values of scanf? The scanf manual.
Other people have explained what %[^\n] means.
This is not an okay way to read strings. It is just as dangerous as the notoriously unsafe gets, and for the same reason: it has no idea how big the buffer at stringVariableName is.
The best way to read one full line from a file is getline, but not all C libraries have it. If you don't, you should use fgets, which knows how big the buffer is, and be aware that you might not get a complete line (if the line is too long for the buffer).
Reading from the man pages for scanf()...
[ Matches a non-empty sequence of characters from the
specified set of accepted characters; the next pointer must be a
pointer to char, and there must be enough room for all the characters
in the string, plus a terminating null byte. The usual skip of
leading white space is suppressed. The string is to be made up of
characters in (or not in) a particular set; the set is defined by the
characters between the open bracket [ character and a close bracket ]
character. The set excludes those characters if the first character
after the open bracket is a circumflex (^). To include a close
bracket in the set, make it the first character after the open bracket
or the circumflex; any other position will end the set. The hyphen
character - is also special; when placed between two other
characters, it adds all intervening characters to the set. To
include a hyphen, make it the last character before the final close
bracket. For instance, [^]0-9-] means the set "everything except
close bracket, zero through nine, and hyphen". The string ends with
the appearance of a character not in the (or, with a
circumflex, in) set or when the field width runs out.
In a nutshell, the [^\n] means that read everything from the string that is not a \n and store that in the matching pointer in the argument list.

How can I read a string with spaces in it in C?

scanf("%s",str) won't do it. It will stop reading at the first space.
gets(str) doesn't work either when the string is large. Any ideas?
use fgets with STDIN as the file stream. Then you can specify the amount of data you want to read and where to put it.
char str[100];
Try this
scanf("%[^\n]s",str);
or this
fgets(str, sizeof str, stdin))
Create your own function to read a line. Here's what you basically have to do:
1. fgets into allocated (growable) memory
2. if it was a full line you're done
3. grow the array
4. fgets more characters into the newly allocated memory
5. goto 2.
The implementation may be a bit tricky :-)
You need to think about what you need to pass to your function (at the very least the address of the array and its size); and what the function returns when everything "works" or when there is an error. You need to decide what is an error (is a string 10Gbytes long with no '\n' an error?). You need to decide on how to grow the array.
Edit
Actually it may be better to fgetc rather than fgets
get a character
it it EOF? DONE
add to array (update length), possible growing it (update size)
is it '\n'? DONE
repeat
When do you want to stop reading? At EOF, at a specific character, or what?
You can read a specific number of characters with %c
c Matches a sequence of width
count characters (default 1); the next
pointer must be a pointer to char, and there must be enough room
for all the characters (no terminating NUL is added). The usual
skip of leading white space is suppressed. To skip white space
first, use an explicit space in the format.
You can read specific characters (or up to excluded ones) with %[
[ Matches a nonempty sequence of
characters from the specified set of
accepted characters; the next pointer must be a pointer to
char,
and there must be enough room for all the characters in the
string,
plus a terminating NUL character. The usual skip of leading
white
space is suppressed. The string is to be made up of characters
in
(or not in) a particular set; the set is defined by the
characters
between the open bracket [ character and a close bracket ]
charac-
ter. The set excludes those characters if the first
character
after the open bracket is a circumflex ^. To include a close
bracket in the set, make it the first character after the open
bracket or the circumflex; any other position will end the set.
The hyphen character - is also special; when placed between two
other characters, it adds all intervening characters to the set.
To include a hyphen, make it the last character before the final
close bracket. For instance, `[^]0-9-]' means the set
``everything
except close bracket, zero through nine, and hyphen''. The
string
ends with the appearance of a character not in the (or, with a
cir-
cumflex, in) set or when the field width runs out
To read string with space you can do as follows:
char name[30],ch;
i=1;
while((ch=getchar())!='\n')
{
name[i]=ch;
i++;
}
i++;
name[i]='\n';
printf("String is %s",name);

C: fscanf - infinite loop when first character matches

I am attempting to parse a text (CSS) file using fscanf and pull out all statements that match this pattern:
#import "some/file/somewhere.css";
To do this, I have the following loop set up:
FILE *file = fopen(pathToSomeFile, "r");
char *buffer = (char *)malloc(sizeof(char) * 9000);
while(!feof(file))
{
// %*[^#] : Read and discard all characters up to a '#'
// %8999[^;] : Read up to 8999 characters starting at '#' to a ';'.
if(fscanf(file, "%*[^#] %8999[^;]", buffer) == 1)
{
// Do stuff with the matching characters here.
// This code is long and not relevant to the question.
}
}
This works perfectly SO LONG AS the VERY FIRST character in the file is not a '#'. (Literally, a single space before the first '#' character in the CSS file will make the code run fine.)
But if the very first character in the CSS file is a '#', then what I see in the debugger is an infinite loop -- execution enters the while loop, hits the fscanf statement, but does not enter the 'if' statement (fscanf fails), and then continues through the loop forever.
I believe my fscanf formatters may need some tweaking, but am unsure how to proceed. Any suggestions or explanations for why this is happening?
Thank you.
I'm not an expert on scanf pattern syntax, but my interpretation of yours is:
Match a non-empty sequence of non-'#' characters, then
Match a non-empty sequence of up to 8999 non-';' characters
So yes, if your string starts with a '#', then the first part will fail.
I think if you start your format string with some whitespace, then fscanf will eat any leading whitespace in your data string, i.e. simply " %8999[^;]".
Oli already said why fscanf failed. And since failure is a normal state for fscanf your busy loop is not the consequence of the fscanf failure but of the missing handling for it.
You have to handle a fscanf failure even if your format would be correct (in your special case), because you cannot be sure that the input always is matchable by the format. Actually you can be sure that much more nonmatching input exists than matching input.
Your format string does the following actions:
Read (and discard) 1 or more non-# characters
Read (and discard) 0 or more whitespace characters (due to the space in the format string)
Read and store 1 to 8999 non-; characters
Unfortunately, there is no format specifier for reading "zero or more" characters from a user-defined set.
If you don't care about multiple #include statements on a line, you could change your code to read a single line (with fgets), and then extract the #include statement from that (if the first character does not equal #, you can use your current format string with sscanf, otherwise, you could use sscanf(line, "%8999[^;]", buffer)).
If multiple #include statemens on a line should be handled correctly, you could inspect the next character to be read with getc and then put it back with ungetc.

Resources