I'm trying to read a string which consists of a set of numbers followed by a string, wrapped with some other basic text.
In other words, the format of the line is something like this:
Stuff<5,10,-5,8,"Test string here.">
Naively, I tried:
sscanf(str,"Stuff<%d,%d,%d,%d,\"%s\">",&i1,&i2,&i3,&i4,str2);
But after some research I discovered %s is supposed to stop parsing when it gets to a whitespace character. I found this question, but none of the answers addresses the problem I have: the string could contain any character in it, including newline characters and properly escaped quotes. The latter is not a problem, if I can just get sscanf to put everything after the first quote in the pre-allocated buffer I provide, I can strip the end off myself.
But how do I do this? I can't use %[] because it requires something in it to terminate the string, and the only thing I want to terminate it is the null terminator. So I thought, "Hey, I'll just use the null terminator!" But %[\0] made the compiler grumpy:
warning: no closing ‘]’ for ‘%[’ format
warning: embedded ‘\0’ in format
warning: no closing ‘]’ for ‘%[’ format
warning: embedded ‘\0’ in format
Using something like %*c won't work either, because I don't know exactly how many characters need to be taken. I tried passing strlen(str) since it will be less than that, but sscanf returns 4 and nothing is put into str2, suggesting that perhaps because the length was too long it gave up and didn't bother.
Update: I guess I could do something like:
sscanf(str,"Stuff<%d,%d,%d,%d,\"%n",&i1,&i2,&i3,&i4,&n);
str2 = str+n;
Your update seems to be a good answer. I was going to suggest strchr to find the location of the first quote char, after using sscanf to get i1 thru i4. Side note, you should always check the return value from sscanf to make sure that the conversions worked. This is even more important with your suggested answer, since n will be left uninitialized if the first four conversions aren't successful.
Scan for '\"', then for everything not '\"', then '\"' again.
Be sure to check sscanf() result and limit how long the test string may be.
char test_string[100];
int n = 0;
if (sscanf(str, "Stuff<%d,%d,%d,%d, \"%99[^\"]\"> %n",
&i1, &i2, &i3, &i4, test_string, &n) == 5 && str[n] == '\0') Good();
Your attempt using "...%[\0]...", from sscanf() point-of-view, is "...%[".
Everything in the format from "\0" on is ignored.
Using the int n = 0, appending " %n" to the format string, appending &n to the parameters and checking str[n] == '\0' is a neat trick with sscanf() to insure the entire line parsed correctly. Note: "%n" does not add to sscanf() result.
This is not the only way to achieve what you want to achieve, but probably the neatest way to do it: You'll need to use the scansets. I won't tell you the solution directly with this answer, I'll explain how to use scansets as far as I know them, and you'll hopefully be able to do it yourself.
Scansets %[...] are like %s when it comes to assignment, they interpret values as characters and store them into character arrays. %s is whitespace-terminated, %[...] is the flexible version of that.
There are two ways of using the scanset, first one being without a preceding caret ^, second one being with a preceding caret ^.
When you use scanset without the preceding caret ^, the characters you put inside the brackets will be the only ones that will be read, stored and then left behind. As soon as scanf encounters a non-matching character, that %[...] will be over. For example:
// input: asdasdasdwasdasd
char s[100] = { 0 };
scanf( "%[das]", s );
printf( "%s", s );
// output: asdasdasd
When you use scanset with the preceding caret ^, the search is inversed. It reads, stores and leaves behind every character until it reaches any one of the characters that you've put down after the preceding caret ^. Example:
// input: abcdefgh^kekQ
char s[100] = { 0 };
scanf( "%[^Q^]", s );
printf( "%s", s );
// output: abcdefgh
Beware, remaining characters is still to be read inside the stream, the file pointer won't get beyond the character which caused termination. I.e. for the first one, getchar( ); would give a 'w', and for the second one it would give a '^'.
I hope this will be enough. If you still cannot find your way out, ask away, I can give you a solution.
Related
I have a little problem with my code. At the moment I just have a line (char* string with \0 at the end) and I want the line to be checked on special characters. Therefore I used the following code:
char lineJunk;
if(sscanf(lineContent, "%*[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+=\0/{}:]%c", &lineJunk)){
return 0;
}
Now my compiler will spit out the following warning:
Multiple markers at this line
- no closing ‘]’ for ‘%[’ format [-Wformat=]
- embedded ‘\0’ in format [-Wformat-contains-nul]
- too many arguments for format [-Wformat-extra-args]
These warnings only appear when I have \0 in my sscanf. Yet otherwise the code won't work, because the Line I am checking on has \0 at its end. When I use \\0 instead of \0 the warnings disappear, but the code doesn't work anymore. Just \ doesn't work either.
Somebody know a solution?
You are not using sscanf() correctly, to explain the warnings
No closing ] means your format string has no closing ] which is required since you are passing a format with [.
The closeing ] in your format string is "not" there really, because you have an embeded '\0' in the format string, so the actual format string is
"%*[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+="
This is because you have an explicit '\0' in the format string, which is causing the previous warning too.
There is one in the format string which is added by the compiler at the end of it, to mark the end and so it becomes a legitimate c string, in the sense that you can pass it to strlen() and other functions that expect the nul terminator to be present.
By embeding it in the format string, you are marking the end of the string at the position where you inserted, that's why the format string is the one I say in point 1.
You are discarding the matched value by using the * modifier, you need to remove it to make the passed parameter useful, because as is you are discarding the matched value and hence no parameter is required.
You can't match the '\0' with sscanf() if you want that you need to traverse the string one byte at a time until you find a '\0', and in that case the length should be known beforehand.
There is no need for '\0' in the format of sscanf(char *src, char *format, ...). sscanf() will stop scanning when it reaches the '\0' in src. So sscanf() will never provide '\0' for scanning.
As mention by #iharob, the '\0' in the format is trouble as sscanf() see that as the end of the format. That is what the compiler is warning about.
// Eliminate `\0` from the format.
#define SKIP "%*[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+=/{}:]"
if(sscanf(lineContent, SKIP "%c" , &lineJunk) == 1) {
return 0;
}
Should A-Z be consecutive as with typical encoding of ASCII, a short-cut would be: #define SKIP "%*[A-Za-z0-9+=/{}:]"
--
Note better to check sscanf() results withe what code wants: 1 rather than non-zero. Under select situations sscanf() will return EOF
I have the following problem:
sscanf is not returning the way I want it to.
This is the sscanf:
sscanf(naru,
"%s[^;]%s[^;]%s[^;]%s[^;]%f[^';']%f[^';']%[^;]%[^;]%[^;]%[^;]"
"%[^;]%[^;]%[^;]%[^;]%[^;]%[^;]%[^;]%[^;]%[^;]%[^;]%[^;]%[^;]"
"%[^;]%[^;]%[^;]%[^;]%[^;]%[^;]",
&jokeri, &paiva1, &keskilampo1, &minlampo1, &maxlampo1,
&paiva2, &keskilampo2, &minlampo2, &maxlampo2, &paiva3,
&keskilampo3, &minlampo3, &maxlampo3, &paiva4, &keskilampo4,
&minlampo4, &maxlampo4, &paiva5, &keskilampo5, &minlampo5,
&maxlampo5, &paiva6, &keskilampo6, &minlampo6, &maxlampo6,
&paiva7, &keskilampo7, &minlampo7, &maxlampo7);
The string it's scanning:
const char *str = "city;"
"2014-04-14;7.61;4.76;7.61;"
"2014-04-15;5.7;5.26;6.63;"
"2014-04-16;4.84;2.49;5.26;"
"2014-04-17;2.13;1.22;3.45;"
"2014-04-18;3;2.15;3.01;"
"2014-04-19;7.28;3.82;7.28;"
"2014-04-20;10.62;5.5;10.62;";
All of the variables are stored as char paiva1[22] etc; however, the sscanf isn't storing anything except the city correctly. I've been trying to stop each variable at ;.
Any help how to get it to store the dates etc correctly would be appreciated.
Or if there's a smarter way to do this, I'm open to suggestions.
There are multiple problems, but BLUEPIXY hit the first one — the scan-set notation doesn't follow %s.
Your first line of the format is:
"%s[^;]%s[^;]%s[^;]%s[^;]%f[^';']%f[^';']%[^;]%[^;]%[^;]%[^;]"
As it stands, it looks for a space separated word, followed by a [, a ^, a ;, and a ] (which is self-contradictory; the character after the string is a space or end of string).
The first fixup would be to use scan-sets properly:
"%[^;]%[^;]%[^;]%[^;]%f[^';']%f[^';']%[^;]%[^;]%[^;]%[^;]"
Now you have a problem that the first %[^;] scans everything up to the end of string or first semicolon, leaving nothing for the second %[;] to match.
"%[^;]; %[^;]; %[^;]; %[^;]; %f[^';']%f[^';']%[^;]%[^;]%[^;]%[^;]"
This looks for a string up to a semicolon, then for the semicolon, then optional white space, then repeats for three items. Apart from adding a length to limit the size of string, preventing overflow, these are fine. The %f is OK. The following material looks for an odd sequence of characters again.
However, when the data is looked at, it seems to consist of a city, and then seven sets of 'a date plus three numbers'.
You'd do better with an array of structures (if you've worked with those yet), or a set of 4 parallel arrays, and a loop:
char jokeri[30];
char paiva[7][30];
float keskilampo[7];
float minlampo[7];
float maxlampo[7];
int eoc; // End of conversion
int offset = 0;
char sep;
if (fscanf(str + offset, "%29[^;]%c%n", jokeri, &sep, &eoc) != 2 || sep != ';')
...report error...
offset += eoc;
for (int i = 0; i < 7; i++)
{
if (fscanf(str + offset, "%29[^;];%f;%f;%f%c%n", paiva[i],
&keskilampo[i], &minlampo[i], &maxlampo[i], &sep, &eoc) != 5 ||
sep != ';')
...report error...
offset += eoc;
}
See also How to use sscanf() in loops.
Now you have data that can be managed. The set of 29 separately named variables is a ghastly thought; the code using them will be horrid.
Note that the scan-set conversion specifications limit the string to a maximum length one shorter than the size of jokeri and the paiva array elements.
You might legitimately be wondering about why the code uses %c%n and &sep before &eoc. There is a reason, but it is subtle. Suppose that the sscanf() format string is:
"%29[^;];%f;%f;%f;%n"
Further, suppose there's a problem in the data that the semicolon after the third number is missing. The call to sscanf() will report that it made 4 successful conversions, but it doesn't count the %n as an assignment, so you can't tell that sscanf() didn't find a semicolon and therefore did not set &eoc at all; the value is left over from a previous call to sscanf(), or simply uninitialized. By using the %c to scan a value into sep, we get 5 returned on success, and we can be sure the %n was successful too. The code checks that the value in sep is in fact a semicolon and not something else.
You might want to consider a space before the semi-colons, and before the %c. They'll allow some other data strings to be converted that would not be matched otherwise. Spaces in a format string (outside a scan-set) indicate where optional white space may appear.
I would use strtok function to break your string into pieces using ; as a delimiter. Such a long format string may be a source of problems in future.
I have read strings with spaces in them using the following scanf() statement.
scanf("%[^\n]", &stringVariableName);
What is the meaning of the control string [^\n]?
Is is okay way to read strings with white space like this?
This mean "read anything until you find a '\n'"
This is OK, but would be better to do this "read anything until you find a '\n', or read more characters than my buffer support"
char stringVariableName[256] = {}
if (scanf("%255[^\n]", stringVariableName) == 1)
...
Edit: removed & from the argument, and check the result of scanf.
The format specifier "%[^\n]" instructs scanf() to read up to but not including the newline character. From the linked reference page:
matches a non-empty sequence of character from set of characters.
If the first character of the set is ^, then all characters not
in the set are matched. If the set begins with ] or ^] then the ]
character is also included into the set.
If the string is on a single line, fgets() is an alternative but the newline must be removed as fgets() writes it to the output buffer. fgets() also forces the programmer to specify the maximum number of characters that can be read into the buffer, making it less likely for a buffer overrun to occur:
char buffer[1024];
if (fgets(buffer, 1024, stdin))
{
/* Remove newline. */
char* nl = strrchr(buffer, '\n');
if (nl) *nl = '\0';
}
It is possible to specify the maximum number of characters to read via scanf():
scanf("%1023[^\n]", buffer);
but it is impossible to forget to do it for fgets() as the compiler will complain. Though, of course, the programmer could specify the wrong size but at least they are forced to consider it.
Technically, this can't be well defined.
Matches a nonempty sequence of characters from a set of expected
characters (the scanset).
If no l length modifier is present, the corresponding argument shall
be a pointer to the initial element of a character array large enough
to accept the sequence and a terminating null character, which will be
added automatically.
Supposing the declaration of stringVariableName looks like char stringVariableName[x];, then &stringVariableName is a char (*)[x];, not a char *. The type is wrong. The behaviour is undefined. It might work by coincidence, but anything that relies on coincidence doesn't work by my definition.
The only way to form a char * using &stringVariableName is if stringVariableName is a char! This implies that the character array is only large enough to accept a terminating null character. In the event where the user enters one or more characters before pressing enter, scanf would be writing beyond the end of the character array and invoking undefined behaviour. In the event where the user merely presses enter, the %[...] directive will fail and not even a '\0' will be written to your character array.
Now, with that all said and done, I'll assume you meant this: scanf("%[^\n]", stringVariableName); (note the omitted ampersand)
You really should be checking the return value!!
A %[ directive causes scanf to retrieve a sequence of characters consisting of those specified between the [ square brackets ]. A ^ at the beginning of the set indicates that the desired set contains all characters except for those between the brackets. Hence, %[^\n] tells scanf to read as many non-'\n' characters as it can, and store them into the array pointed to by the corresponding char *.
The '\n' will be left unread. This could cause problems. An empty field will result in a match failure. In this situation, it's possible that no data will be copied into your array (not even a terminating '\0' character). For this reason (and others), you really need to check the return value!
Which manual contains information about the return values of scanf? The scanf manual.
Other people have explained what %[^\n] means.
This is not an okay way to read strings. It is just as dangerous as the notoriously unsafe gets, and for the same reason: it has no idea how big the buffer at stringVariableName is.
The best way to read one full line from a file is getline, but not all C libraries have it. If you don't, you should use fgets, which knows how big the buffer is, and be aware that you might not get a complete line (if the line is too long for the buffer).
Reading from the man pages for scanf()...
[ Matches a non-empty sequence of characters from the
specified set of accepted characters; the next pointer must be a
pointer to char, and there must be enough room for all the characters
in the string, plus a terminating null byte. The usual skip of
leading white space is suppressed. The string is to be made up of
characters in (or not in) a particular set; the set is defined by the
characters between the open bracket [ character and a close bracket ]
character. The set excludes those characters if the first character
after the open bracket is a circumflex (^). To include a close
bracket in the set, make it the first character after the open bracket
or the circumflex; any other position will end the set. The hyphen
character - is also special; when placed between two other
characters, it adds all intervening characters to the set. To
include a hyphen, make it the last character before the final close
bracket. For instance, [^]0-9-] means the set "everything except
close bracket, zero through nine, and hyphen". The string ends with
the appearance of a character not in the (or, with a
circumflex, in) set or when the field width runs out.
In a nutshell, the [^\n] means that read everything from the string that is not a \n and store that in the matching pointer in the argument list.
I have been struggling to figure out the fscanf formatting. I just want to read in a file of words delimited by spaces. And I want to discard any strings that contain non-alphabetic characters.
char temp_text[100];
while(fscanf(fcorpus, "%101[a-zA-Z]s", temp_text) == 1) {
printf("%s\n", temp_text);
}
I've tried the above code both with and without the 's'. I read in another stackoverflow thread that the s when used like that will be interpreted as a literal 's' and not as a string. Either way - when I include the s and when I do not include the s - I can only get the first word from the file I am reading through to print out.
The %[ scan specifier does not skip leading spaces. Either add a space before it or at the end in place of your s. Also you have your 100 and 101 backwards and thus a serious buffer overflow bug.
The s isn't needed.
Here are a few things to try:
Print out the return value from fscanf, and make sure it is 1.
Make sure that the fscanf is consuming the whitespace by using fgetc to get the next character and printing it out.
I'm reading in a .txt file. I'm using fscanf to get the data as it is formatted.
The line I'm having problems with is this:
result = fscanf(fp, "%s", ap->name);
This is fine until I have a name with a whitespace eg: St Ives
So I use this to read in the white space:
result = fscanf(fp, "%[^\n]s", ap->name);
However, when I try to read in the first name (with no white space) it just doesn't work and messes up the other fscanf.
But I use the [^\n] it works fine within a different file I'm using. Not sure what is happening.
If I use fgets in the place of the fscanf above I get "\n" in the variable.
Edit//
Ok, so if I use:
result = fscanf(fp, "%s", ap->name);
result = fscanf(fp, "%[^\n]s", ap->name);
This allows me to read in a string with no white space. But When I get a "name" with whitespace it doesn't work.
One problem with this:
result = fscanf(fp, "%[^\n]s", ap->name);
is that you have an extra s at the end of your format specifier. The entire format specifier should just be %[^\n], which says "read in a string which consists of characters which are not newlines". The extra s is not part of the format specifier, so it's interpreted as a literal: "read the next character from the input; if it's an "s", continue, otherwise fail."
The extra s doesn't actually hurt you, though. You know exactly what the next character of input: a newline. It doesn't match, and input processing stops there, but it doesn't really matter since it's the end of your format specifier. This would cause problems, though, if you had other format specifiers after this one in the same format string.
The real problem is that you're not consuming the newline: you're only reading in all of the characters up to the newline, but not the newline itself. To fix that, you should do this:
result = fscanf(fp, "%[^\n]%*c", ap->name);
The %*c specifier says to read in a character (c), but don't assign it to any variable (*). If you omitted the *, you would have to pass fscanf() another parameter containing a pointer to a character (a char*), where it would then store the resulting character that it read in.
You could also use %[^\n]\n, but that would also read in any whitespace which followed the newline, which may not be what you want. When fscanf finds whitespace in its format specifier (a space, newline, or tab), it consumes as much whitespace as it can (i.e. you can think of it consuming the longest string that matches the regular expression [ \t\n]*).
Finally, you should also specify a maximum length to avoid buffer overruns. You can do this by placing the buffer length in between the % and the [. For example, if ap->name is a buffer of 256 characters, you should do this:
result = fscanf(fp, "%255[^\n]%*c", ap->name);
This works great for statically allocated arrays; unfortunately, if the array is dyamically sized at runtime, there's no easy to way to pass the buffer size to fscanf. You'll have to create the format string with sprintf, e.g.:
char format[256];
snprintf(format, sizeof(format), "%%%d[^\n]%%*c", buffer_size - 1);
result = fscanf(fp, format, ap->name);
Jumm wrote:
If I use fgets in the place of the fscanf above I get "\n" in the variable.
Which is a far easier problem to solve so go with it:
fgets( ap->name, MAX, fp ) ;
nlptr = strrchr ( ap->name, '\n' ) ;
if( nlptr != 0 )
{
*nlptr = '\0' ;
}
I'm not sure how you mean [^\n] is suppose to work. [] is a modifier which says "accept one character except any of the characters which is inside this block". The ^ inverts the condition. %s with fscanf only reads until it comes across a delimiter. For strings with spaces and newlines in them, use a combination of fgets and sscanf instead, and specify a restriction on the length.
There is no such thing as I gather you are trying to imply a regular expression in the fscanf function which does not exist, not that to my knowledge nor have I seen it anywhere - enlighten me on this.
The format specifier for reading a string is %s, it could be that you need to do it this way, %s\n which will pick up the newline.
But for pete's sake do not use the standard old gets family functions as specified by Clifford's answer above as that is where buffer overflows happen and was used in a infamous worm of the 1990's - the Morris Worm, more specifically in the fingerd daemon, that used to call gets that caused chaos. Fortunately, now, that has now been patched. And furthermore, a lot of programmers have been drilled into the mentality not to use the function.
Even Microsoft has adopted a safe version of gets family of functions, that specifies a parameter to indicate the length of buffer instead.
EDIT
My bad - I did not realize that Clifford indeed has specified the max length for input...Whoops! Sorry! Clifford's answer is correct! So +1 to Clifford's answer.
Thanks Neil for pointing out my error...
Hope this helps,
Best regards,
Tom.
I found the problem.
As Paul Tomblin said, I had an extra new line character in the field above. So using what tommieb75 said I used:
result = fscanf(fp, "%s\n", ap->code);
result = fscanf(fp, "%[^\n]s", ap->name);
And this fixed it!
Thanks for your help.