Force fscanf to Consume Possible Whitespace - c

I have a multiline TSV file with the following format:
Type\tBasic Name\tAttribute\tA Long Description\n
As you can see, the Basic Name and the Description can both contain some number of spaces. I am trying to read each line in and extract the elements. For now, I've narrowed it down to just extracting the basic name. My fscanf is as follows:
fscanf(file_in, "%*[^ ]s\t%128[^ ]s\t%*[^ ]s\t%[^ ]s\n", name_string, desc_string);
This doesn't work as I have hoped, and I'm having trouble narrowing down the error. Does anyone know how I could read in the lines properly?

I mostly agree with Pablo (that the scanf family don't make great parsers), but it's worth understanding how to write a scanf pattern. The pattern you're looking for is something like this:
fscanf(" %*[^\t] %128[^\t] %*[^\t] %128[^\n]", name_string, desc_string)
Notes:
%[xyz] is a directive. %[xyz]s is two directives, the second of which matches a literal s
As far a I know, there is no way to match a single literal tab character, since any whitespace in the pattern matches any amount of whitespace (including none) in the input. I used a space in my example, which will match a terminating tab, but it will also match any number of consecutive tabs so empty fields won't be parsed correctly.
The 128-character limit does not include the terminating NUL character.
Also, if the scan stops because the chracter limit is exceeded, it won't skip the rest of the field automatically, so you'll end up out of synch with the input.
A better pattern would be:
fscanf(" %*[^\t] %128[^\t]%*[^\t] %*[^\t] %128[^\n]%*[^\n]", name_string, desc_string)
which explicitly skips the remaining characters in the field, if necessary. An even better solution would be to use the a modifier and get fscanf to malloc memory for you.

I'd rather use strtok for this. It's more acurate than fscanf since this function family only work when the format is 100% OK, otherwise you end up missing values.
Take a look at Parallel to PHP's "explode" in C: Split char* into char* using delimiter, where I explain in more detail how to use strtok.
So, read each line with fgets and parse it with strtok.

Firstly, as it has already been noted, the %[] is a conversion specifier by itself. There's no s after the []. The s-es that you have in your format string will not be considered parts of the conversion specifiers. You have to get rid of those s-es.
Secondly, as you said yourself, your file is TAB-separated. Which immediately means that you should extract the continuous portions of the sequence by using the %[^\t] conversion specifier (or the %[^\n] specifier for the last portion). Why did you use %[^ ] and how did you expect it to work? The %[^ ] actually stops parsing at space character, which is the opposite of what you wanted.
In your example the proper combination of specifiers would be
fscanf(file_in, "%*[^\t]\t%128[^\t]\t%*[^\t]\t%[^\n]\n", name_string, desc_string);
This format string assumes that all 4 portions of the string are guaranteed to be present and that the last portion is guaranteed to be terminated by \n.

Related

Reading specifically formatted string

I'm attempting to read a file containing lines of strings in the following format:
"string";"string";"string";"string";"string"
How do i read them each using functions compatible on windows and linux?
Length of each string is unknown.
i have attempted to use fscanf like this:
fscanf(fp, "\"%s\";\"%s\";\"%s\";\"%s\";\"%s\"\n");
But the first string picked up the whole line.
If you really want to use fscanf, you could use a format string like this :
fscanf(fp, "\"%[^\"]\";\"%[^\"]\";\"%[^\"]\";\"%[^\"]\";\"%[^\"]\"\n", ...);
For more details, read up on the [set] conversion specifier in the reference docs for fscanf.
Note that this will not work with embedded '"' characters in the strings.
This also leaves no flexibility (like additional whitespace around the semicolons, optional quotes, etc.).
In case those limitations are problematic for you, you'll want a more intelligent parser (libcsv comes to mind eg.). Also ref. pmg's answer for how to roll your own.
here's some pseudo-code for you
loop
getchar; if not a quote exit with error
loop
getchar; mind EOF
if not a quote, add to string
if a quote exit inner loop
use string
getchar; if not semicolon exit with error unless EOF

sscanf returns 1 despite the line not matching the search pattern

I need to read certain parameters from a file. My problem is to scan the lines of the file to find the parameters. The text file is structured in lines like:
character *\n
So each line has to have the pattern (space or tab)character[space or tab][char][space or tab]\n.
Spaces or tabs at the beginning are optional. I tried to do it with
char val;
if(sscanf(buf, "%*[ \t]character%*[ \t]%c%*[ \t]\n",&val)==1||sscanf(buf, "character%*[ \t]%c%*[ \t]\n",&val)==1){
printf("%c in %i\n", val,line);
}else{
fprintf(stderr,"Error while reading line %i\n",line);
}
buf contains the current line.
My problem is, in lines like character \n my program does not print an error. Instead it saves '\n' in val. I do not understand this behavior, because this line does not match my search pattern.
What is wrong with my code?
My understanding of my
I do not understand this behavior, because this line does not match my search pattern.
The *scanf functions do not check the pattern first and then, if it matches, fill in the values. They check one character at a time, and indicate how many of the fields in the format string they were able to use.
Unfortunately, in your case, %c can certainly match '\n'. The subsequent %*[ \t] fails, as would the subsequent \n, but since those aren't stored anywhere, they don't affect sscanf's return value, so you can't tell from the result whether there was any error.
The simplest way to solve this might be to not use *scanf functions at all. Your input format is easily described using a custom routine, but not so easily with a format string.

Use scanf with Regular Expressions

I've been trying to use regular expressions on scanf, in order to read a string of maximum n characters and discard anything else until the New Line Character. Any spaces should be treated as regular characters, thus included in the string to be read.
I've studied a Wikipedia article about Regular Expressions, yet I can't get scanf to work properly. Here is some code I've tried:
scanf("[ ]*%ns[ ]*[\n]", string);
[ ] is supposed to go for the actual space character, * is supposed to mean one or more, n is the number of characters to read and string is a pointer allocated with malloc.
I have tried several different combinations; however I tend to get only the first word of a sentence read (stops at space character). Furthermore, * seems to discard a character instead of meaning "zero or more"...
Could anybody explain in detail how regular expressions are interpreted by scanf? What is more, is it efficient to use getc repetitively instead?
Thanks in Advance :D
The short answer: scanf does not handle regular expressions literally speaking.
If you want to use regular expressions in C, you could use the regex POSIX library. See the following question for a basic example on this library usage : Regular expressions in C: examples?
Now if you want to do it the scanf way you could try something like
scanf("%*[ ]%ns%*[ ]\n",str);
Replace the n in %ns by the maximal number of characters to read from input stream.
The %*[ ] part asks to ignore any spaces. You could replace the * by a specific number to ignore a precise number of characters. You could add other characters between braces to ignore more than just spaces.
Not sure if the above scanf would work as spaces are also matched with the %s directive.
I would definitely go with a fgets call, then triming the surrounding whitespaces with something like the following: How do I trim leading/trailing whitespace in a standard way?
is it efficient to use getc repetitively instead?
Depends somewhat on the application, but YES, repeated getc() is efficient.
unless I read the question wrong, %[^'\n']s will save everything until the carriage return is encountered.

Is there a way to know when fscanf reads a whitespace or a new line?

I want to know if there is a way to know when fscanf reads a whitespace or a new line.
Example:
formatting asking words italic
links returns
As fscanf read a string till it meets a newline or a whitespace(using %s), it'll read formatting and the space after it and before a. The thing is, is there a way to know that it read a space? And after it entered the second line is there is a way to know that it read a carriage return?
You can instruct fscanf to read whitespace into your variable instead of reading and discarding whitespace. Use something like [ \n\r\t]* but you need to include more characters in that expression. Depending on the locale and some features of the runtime character set, you might want to write a separate function to compute the appropriate format string once before using it.
If you need to distinguish \n from other kinds of whitespace, you have your variable containing the whitespace that you just finished reading. You might want to count all of the \n characters in it, depending on your needs.

fscanf usage in C

I have a file like this:
10 15
something
I want to read this into tree variables, let's say number1, number2, and mystring. I have doubts about what kind of pattern to give to fscanf. I am thinking something like this;
fscanf(fp,"%i %i\n%s",number1,number2,mystring);
Should this work, and also, is this the correct way of reading this file? If not, what would you suggest?
fscanf(fp,"%i %i\n%s",&number1,&number2,mystring);
fscanf takes pointers.
Read each line with fgets (or getline if you have it), split up the line with strsep (better, if available) or strtok_r (more awkward API but more portable), and then use strtoul to convert strings to numbers as necessary.
*scanf should never be used, because:
Some format strings (e.g. a bare "%s") are just as eager to overflow your buffers as gets is.
Behavior on integer overflow is undefined -- invalid input can potentially crash your program.
They do not report the character position of the first scan error, making it nigh-impossible to recover from a parse error. (This can be somewhat mitigated by using fgets and then sscanf instead of fscanf.)
Besides the problem with pointers, generally using spaces in the scanf format is a mistake -- in most cases scanf skips whitespace automatically. So I would use something like:
int number1, number2;
char mystring[32];
fscanf("%i%i%31s", &number1, &number2, &mystring)
This will read two numbers followed by a string of up to 31 non-whitespace characters, all separated by any whitespace. Note that "whitespace" includes spaces, tabs, and newlines, so it doesn't matter if its all on one line, or spread out over 3 lines or anything in between.
Note also using a limit on the size of the string -- without that, the input might overflow any fixed size buffer you provide (and there's no way to provide a variable sized buffer with scanf)

Resources