Reading specifically formatted string - c

I'm attempting to read a file containing lines of strings in the following format:
"string";"string";"string";"string";"string"
How do i read them each using functions compatible on windows and linux?
Length of each string is unknown.
i have attempted to use fscanf like this:
fscanf(fp, "\"%s\";\"%s\";\"%s\";\"%s\";\"%s\"\n");
But the first string picked up the whole line.

If you really want to use fscanf, you could use a format string like this :
fscanf(fp, "\"%[^\"]\";\"%[^\"]\";\"%[^\"]\";\"%[^\"]\";\"%[^\"]\"\n", ...);
For more details, read up on the [set] conversion specifier in the reference docs for fscanf.
Note that this will not work with embedded '"' characters in the strings.
This also leaves no flexibility (like additional whitespace around the semicolons, optional quotes, etc.).
In case those limitations are problematic for you, you'll want a more intelligent parser (libcsv comes to mind eg.). Also ref. pmg's answer for how to roll your own.

here's some pseudo-code for you
loop
getchar; if not a quote exit with error
loop
getchar; mind EOF
if not a quote, add to string
if a quote exit inner loop
use string
getchar; if not semicolon exit with error unless EOF

Related

Using regex within sscanf [duplicate]

I needed to read a string until the following sequence is written: \nx\n :
(.....)\n
x\n
\n is the new line character and (.....) can be any characters that may include other \n characters.
scanf allows regular expressions as far as I know, but i can't make it to read a string untill this pattern. Can you help me with the scanf format string?
I was trying something like:
char input[50000];
scanf(" %[^(\nx\n)]", input);
but it doesn't work.
scanf allows regular expressions as far as I know
Unfortunately, it does not allow regular expressions: the syntax is misleadingly close, but there is nothing even remotely similar to the regex in the implementation of scanf. All that's there is a support for character classes of regex, so %[<something>] is treated implicitly as [<something>]*. That's why your call of scanf translates into read a string consisting of characters other than '(', ')', 'x', and '\n'.
To solve your problem at hand, you can set up a loop that read the input character by character. Every time you get a '\n', check that
You have at least three characters in the input that you've seen so far,
That the character immediately before '\n' is an 'x', and
That the character before the 'x' is another '\n'
If all of the above is true, you have reached the end of your anticipated input sequence; otherwise, your loop should continue.
scanf does not support regular expressions. It has limited support for character classes but that's not at all the same thing.
Never use scanf, fscanf, or sscanf, because:
Numeric overflow triggers undefined behavior. The C runtime is allowed to crash your program just because someone typed too many digits.
Some format specifiers (notably %s) are unsafe in exactly the same way gets is unsafe, i.e. they will cheerfully write past the end of the provided buffer and crash your program.
They make it extremely difficult to handle malformed input robustly.
You don't need regular expressions for this case; read a line at a time with getline and stop when the line read is just "x". However, the standard (not ISO C, but POSIX) regular expression library routines are called regcomp and regexec.

In C, shall I use FLUSH every time I use scanf to get rid of buffer?

As the title says, shall I use
while(getchar() != '\n');
every time I use scanf?
And can someone explain the logic behind
while(getchar() != '\n');
Thanks.
No, you generally don't need to do that. The loop you posted reads characters from stdin until it encounters one that's not \n. The way you wrote it, that last non-newline character is lost just like the newlines.
Typical problems or the need for "flushing" can be avoided by:
Not mixing scanf with other input methods. For example don't mix it with fgets
Preceding format specifiers with a space where space isn't ignored and you want it ignored
For example, to ignore blanks, instead of scanf("%c"...) use scanf(" %c"..).
That aside, when you have complex input to read in you might want to:
Read entire strings with fgets, which you can then parse as you please with sscanf, strtok et al. It may look like a contradiction, recommending sscanf where scanf is inadequate. The point is once you have the full string stored safely using fgets, you've got considerably more freedom to analyze it, throw portions that don't match, do a strchr here and there etc
Use languages (with libraries) better suited for the job, like python or perl to reduce the task to a simpler problem
Use a full-blown lexer

Use scanf with Regular Expressions

I've been trying to use regular expressions on scanf, in order to read a string of maximum n characters and discard anything else until the New Line Character. Any spaces should be treated as regular characters, thus included in the string to be read.
I've studied a Wikipedia article about Regular Expressions, yet I can't get scanf to work properly. Here is some code I've tried:
scanf("[ ]*%ns[ ]*[\n]", string);
[ ] is supposed to go for the actual space character, * is supposed to mean one or more, n is the number of characters to read and string is a pointer allocated with malloc.
I have tried several different combinations; however I tend to get only the first word of a sentence read (stops at space character). Furthermore, * seems to discard a character instead of meaning "zero or more"...
Could anybody explain in detail how regular expressions are interpreted by scanf? What is more, is it efficient to use getc repetitively instead?
Thanks in Advance :D
The short answer: scanf does not handle regular expressions literally speaking.
If you want to use regular expressions in C, you could use the regex POSIX library. See the following question for a basic example on this library usage : Regular expressions in C: examples?
Now if you want to do it the scanf way you could try something like
scanf("%*[ ]%ns%*[ ]\n",str);
Replace the n in %ns by the maximal number of characters to read from input stream.
The %*[ ] part asks to ignore any spaces. You could replace the * by a specific number to ignore a precise number of characters. You could add other characters between braces to ignore more than just spaces.
Not sure if the above scanf would work as spaces are also matched with the %s directive.
I would definitely go with a fgets call, then triming the surrounding whitespaces with something like the following: How do I trim leading/trailing whitespace in a standard way?
is it efficient to use getc repetitively instead?
Depends somewhat on the application, but YES, repeated getc() is efficient.
unless I read the question wrong, %[^'\n']s will save everything until the carriage return is encountered.

Force fscanf to Consume Possible Whitespace

I have a multiline TSV file with the following format:
Type\tBasic Name\tAttribute\tA Long Description\n
As you can see, the Basic Name and the Description can both contain some number of spaces. I am trying to read each line in and extract the elements. For now, I've narrowed it down to just extracting the basic name. My fscanf is as follows:
fscanf(file_in, "%*[^ ]s\t%128[^ ]s\t%*[^ ]s\t%[^ ]s\n", name_string, desc_string);
This doesn't work as I have hoped, and I'm having trouble narrowing down the error. Does anyone know how I could read in the lines properly?
I mostly agree with Pablo (that the scanf family don't make great parsers), but it's worth understanding how to write a scanf pattern. The pattern you're looking for is something like this:
fscanf(" %*[^\t] %128[^\t] %*[^\t] %128[^\n]", name_string, desc_string)
Notes:
%[xyz] is a directive. %[xyz]s is two directives, the second of which matches a literal s
As far a I know, there is no way to match a single literal tab character, since any whitespace in the pattern matches any amount of whitespace (including none) in the input. I used a space in my example, which will match a terminating tab, but it will also match any number of consecutive tabs so empty fields won't be parsed correctly.
The 128-character limit does not include the terminating NUL character.
Also, if the scan stops because the chracter limit is exceeded, it won't skip the rest of the field automatically, so you'll end up out of synch with the input.
A better pattern would be:
fscanf(" %*[^\t] %128[^\t]%*[^\t] %*[^\t] %128[^\n]%*[^\n]", name_string, desc_string)
which explicitly skips the remaining characters in the field, if necessary. An even better solution would be to use the a modifier and get fscanf to malloc memory for you.
I'd rather use strtok for this. It's more acurate than fscanf since this function family only work when the format is 100% OK, otherwise you end up missing values.
Take a look at Parallel to PHP's "explode" in C: Split char* into char* using delimiter, where I explain in more detail how to use strtok.
So, read each line with fgets and parse it with strtok.
Firstly, as it has already been noted, the %[] is a conversion specifier by itself. There's no s after the []. The s-es that you have in your format string will not be considered parts of the conversion specifiers. You have to get rid of those s-es.
Secondly, as you said yourself, your file is TAB-separated. Which immediately means that you should extract the continuous portions of the sequence by using the %[^\t] conversion specifier (or the %[^\n] specifier for the last portion). Why did you use %[^ ] and how did you expect it to work? The %[^ ] actually stops parsing at space character, which is the opposite of what you wanted.
In your example the proper combination of specifiers would be
fscanf(file_in, "%*[^\t]\t%128[^\t]\t%*[^\t]\t%[^\n]\n", name_string, desc_string);
This format string assumes that all 4 portions of the string are guaranteed to be present and that the last portion is guaranteed to be terminated by \n.

How to know new line character in fscanf?

How to know that fscanf reached a new line \n in a file.
I have been using my own functions for doing that. I know I can use fgets and then sscanf for my required pattern. But my requirements are not stable, some times I want to get TAB separated strings, some times new line separated strings and some times some special character separated strings. So if there is any way to know of new line from fscanf please help me. Or else any alternative ways are also welcome.
Thanks in advance.
fscanf(stream, "%42[^\n]", buffer);
is an equivalant of fgets(buffer, 42, stream). You can't replace the 42 by * to specify the buffer length in the argument (as you can do in printf), its meaning is to suppress the assignment. So
fscanf(stream, "%*[^\n]%*c");
read upto (and included) the next end of line character.
Any conversion specifier other than [, c and n start by skipping whitespaces.
Kernighan and Pike in the excellent book 'The Practice of Programming' show how to use sprintf() to create an appropriate format specifier including the length (similar to the examples in the answer by AProgrammer), and then use that in the call to scanf(). That way, you can also control the separators. Concerns about the 'inefficiency' of this approach are probably misguided - the alternatives are harder to get right.
That said, I most normally do not use the scanf() family of functions for file I/O; I get the data into a string with some sort of 'get line' routine, and then use the sscanf() family of functions to split it up - or other more specialized parsing code.

Resources