parsing a string using fscanf using %[...] in C - c

I am working with the fscanf function to scan in a large string that is delimited by commas, with the last substring in the larger string separated by an asterisk (*). Here is an example:
substring1,substring2,substring3*substring4
I am able to parse the substrings separated by commas with no problem, but when it gets to the asterisk, it stalls the program, as fscanf is blocking. I am using the %[^...] format specifier in fscanf, shown below:
fscanf(fs, "%[^*,]%*c", str);
The code above is in a simple for loop that scans multiple times. As you can see, I am scanning until either an asterisk or a comma appears. However, I am afraid that I am not including the asterisk in the set properly. Can someone correct my mistake?
Thanks.

The only characters that are special in a %[ pattern are ^, -, and ].
This pattern will fail if the next character to be read is either a ',' or a '*'. So if you have two consecutive commas or asterisks, then your loop will jam and stop reading.

Related

Reading specifically formatted string

I'm attempting to read a file containing lines of strings in the following format:
"string";"string";"string";"string";"string"
How do i read them each using functions compatible on windows and linux?
Length of each string is unknown.
i have attempted to use fscanf like this:
fscanf(fp, "\"%s\";\"%s\";\"%s\";\"%s\";\"%s\"\n");
But the first string picked up the whole line.
If you really want to use fscanf, you could use a format string like this :
fscanf(fp, "\"%[^\"]\";\"%[^\"]\";\"%[^\"]\";\"%[^\"]\";\"%[^\"]\"\n", ...);
For more details, read up on the [set] conversion specifier in the reference docs for fscanf.
Note that this will not work with embedded '"' characters in the strings.
This also leaves no flexibility (like additional whitespace around the semicolons, optional quotes, etc.).
In case those limitations are problematic for you, you'll want a more intelligent parser (libcsv comes to mind eg.). Also ref. pmg's answer for how to roll your own.
here's some pseudo-code for you
loop
getchar; if not a quote exit with error
loop
getchar; mind EOF
if not a quote, add to string
if a quote exit inner loop
use string
getchar; if not semicolon exit with error unless EOF

What is the purpose of : in scanf?

scanf("%d:%d:%d%s", &hh, &mm, &ss, t12)
When taking multiple inputs for the time to be displayed the input is written as above where : is used in input statements the above line works fine but can someone explain the need and uses of colon in the input statement
From the standard, C11 7.21.6.2 The fscanf function /3 and /6:
The format is composed of zero or more directives: one or more white-space
characters, an ordinary multibyte character (neither % nor a white-space character), or a conversion specification.
A directive that is an ordinary multibyte character is executed by reading the next characters of the stream. If any of those characters differ from the ones composing the directive, the directive fails and the differing and subsequent characters remain unread.
Hence the : simply means "make sure that the next character in the stream is a colon". Nothing more, nothing less.
Your format string simply means you'll be able to scan things like 12:34:56am - without the literal colons in the format string, the scan would fail.

Use scanf with Regular Expressions

I've been trying to use regular expressions on scanf, in order to read a string of maximum n characters and discard anything else until the New Line Character. Any spaces should be treated as regular characters, thus included in the string to be read.
I've studied a Wikipedia article about Regular Expressions, yet I can't get scanf to work properly. Here is some code I've tried:
scanf("[ ]*%ns[ ]*[\n]", string);
[ ] is supposed to go for the actual space character, * is supposed to mean one or more, n is the number of characters to read and string is a pointer allocated with malloc.
I have tried several different combinations; however I tend to get only the first word of a sentence read (stops at space character). Furthermore, * seems to discard a character instead of meaning "zero or more"...
Could anybody explain in detail how regular expressions are interpreted by scanf? What is more, is it efficient to use getc repetitively instead?
Thanks in Advance :D
The short answer: scanf does not handle regular expressions literally speaking.
If you want to use regular expressions in C, you could use the regex POSIX library. See the following question for a basic example on this library usage : Regular expressions in C: examples?
Now if you want to do it the scanf way you could try something like
scanf("%*[ ]%ns%*[ ]\n",str);
Replace the n in %ns by the maximal number of characters to read from input stream.
The %*[ ] part asks to ignore any spaces. You could replace the * by a specific number to ignore a precise number of characters. You could add other characters between braces to ignore more than just spaces.
Not sure if the above scanf would work as spaces are also matched with the %s directive.
I would definitely go with a fgets call, then triming the surrounding whitespaces with something like the following: How do I trim leading/trailing whitespace in a standard way?
is it efficient to use getc repetitively instead?
Depends somewhat on the application, but YES, repeated getc() is efficient.
unless I read the question wrong, %[^'\n']s will save everything until the carriage return is encountered.

Force fscanf to Consume Possible Whitespace

I have a multiline TSV file with the following format:
Type\tBasic Name\tAttribute\tA Long Description\n
As you can see, the Basic Name and the Description can both contain some number of spaces. I am trying to read each line in and extract the elements. For now, I've narrowed it down to just extracting the basic name. My fscanf is as follows:
fscanf(file_in, "%*[^ ]s\t%128[^ ]s\t%*[^ ]s\t%[^ ]s\n", name_string, desc_string);
This doesn't work as I have hoped, and I'm having trouble narrowing down the error. Does anyone know how I could read in the lines properly?
I mostly agree with Pablo (that the scanf family don't make great parsers), but it's worth understanding how to write a scanf pattern. The pattern you're looking for is something like this:
fscanf(" %*[^\t] %128[^\t] %*[^\t] %128[^\n]", name_string, desc_string)
Notes:
%[xyz] is a directive. %[xyz]s is two directives, the second of which matches a literal s
As far a I know, there is no way to match a single literal tab character, since any whitespace in the pattern matches any amount of whitespace (including none) in the input. I used a space in my example, which will match a terminating tab, but it will also match any number of consecutive tabs so empty fields won't be parsed correctly.
The 128-character limit does not include the terminating NUL character.
Also, if the scan stops because the chracter limit is exceeded, it won't skip the rest of the field automatically, so you'll end up out of synch with the input.
A better pattern would be:
fscanf(" %*[^\t] %128[^\t]%*[^\t] %*[^\t] %128[^\n]%*[^\n]", name_string, desc_string)
which explicitly skips the remaining characters in the field, if necessary. An even better solution would be to use the a modifier and get fscanf to malloc memory for you.
I'd rather use strtok for this. It's more acurate than fscanf since this function family only work when the format is 100% OK, otherwise you end up missing values.
Take a look at Parallel to PHP's "explode" in C: Split char* into char* using delimiter, where I explain in more detail how to use strtok.
So, read each line with fgets and parse it with strtok.
Firstly, as it has already been noted, the %[] is a conversion specifier by itself. There's no s after the []. The s-es that you have in your format string will not be considered parts of the conversion specifiers. You have to get rid of those s-es.
Secondly, as you said yourself, your file is TAB-separated. Which immediately means that you should extract the continuous portions of the sequence by using the %[^\t] conversion specifier (or the %[^\n] specifier for the last portion). Why did you use %[^ ] and how did you expect it to work? The %[^ ] actually stops parsing at space character, which is the opposite of what you wanted.
In your example the proper combination of specifiers would be
fscanf(file_in, "%*[^\t]\t%128[^\t]\t%*[^\t]\t%[^\n]\n", name_string, desc_string);
This format string assumes that all 4 portions of the string are guaranteed to be present and that the last portion is guaranteed to be terminated by \n.

Is there a way to know when fscanf reads a whitespace or a new line?

I want to know if there is a way to know when fscanf reads a whitespace or a new line.
Example:
formatting asking words italic
links returns
As fscanf read a string till it meets a newline or a whitespace(using %s), it'll read formatting and the space after it and before a. The thing is, is there a way to know that it read a space? And after it entered the second line is there is a way to know that it read a carriage return?
You can instruct fscanf to read whitespace into your variable instead of reading and discarding whitespace. Use something like [ \n\r\t]* but you need to include more characters in that expression. Depending on the locale and some features of the runtime character set, you might want to write a separate function to compute the appropriate format string once before using it.
If you need to distinguish \n from other kinds of whitespace, you have your variable containing the whitespace that you just finished reading. You might want to count all of the \n characters in it, depending on your needs.

Resources