Using regex within sscanf [duplicate] - c

I needed to read a string until the following sequence is written: \nx\n :
(.....)\n
x\n
\n is the new line character and (.....) can be any characters that may include other \n characters.
scanf allows regular expressions as far as I know, but i can't make it to read a string untill this pattern. Can you help me with the scanf format string?
I was trying something like:
char input[50000];
scanf(" %[^(\nx\n)]", input);
but it doesn't work.

scanf allows regular expressions as far as I know
Unfortunately, it does not allow regular expressions: the syntax is misleadingly close, but there is nothing even remotely similar to the regex in the implementation of scanf. All that's there is a support for character classes of regex, so %[<something>] is treated implicitly as [<something>]*. That's why your call of scanf translates into read a string consisting of characters other than '(', ')', 'x', and '\n'.
To solve your problem at hand, you can set up a loop that read the input character by character. Every time you get a '\n', check that
You have at least three characters in the input that you've seen so far,
That the character immediately before '\n' is an 'x', and
That the character before the 'x' is another '\n'
If all of the above is true, you have reached the end of your anticipated input sequence; otherwise, your loop should continue.

scanf does not support regular expressions. It has limited support for character classes but that's not at all the same thing.
Never use scanf, fscanf, or sscanf, because:
Numeric overflow triggers undefined behavior. The C runtime is allowed to crash your program just because someone typed too many digits.
Some format specifiers (notably %s) are unsafe in exactly the same way gets is unsafe, i.e. they will cheerfully write past the end of the provided buffer and crash your program.
They make it extremely difficult to handle malformed input robustly.
You don't need regular expressions for this case; read a line at a time with getline and stop when the line read is just "x". However, the standard (not ISO C, but POSIX) regular expression library routines are called regcomp and regexec.

Related

Reading specifically formatted string

I'm attempting to read a file containing lines of strings in the following format:
"string";"string";"string";"string";"string"
How do i read them each using functions compatible on windows and linux?
Length of each string is unknown.
i have attempted to use fscanf like this:
fscanf(fp, "\"%s\";\"%s\";\"%s\";\"%s\";\"%s\"\n");
But the first string picked up the whole line.
If you really want to use fscanf, you could use a format string like this :
fscanf(fp, "\"%[^\"]\";\"%[^\"]\";\"%[^\"]\";\"%[^\"]\";\"%[^\"]\"\n", ...);
For more details, read up on the [set] conversion specifier in the reference docs for fscanf.
Note that this will not work with embedded '"' characters in the strings.
This also leaves no flexibility (like additional whitespace around the semicolons, optional quotes, etc.).
In case those limitations are problematic for you, you'll want a more intelligent parser (libcsv comes to mind eg.). Also ref. pmg's answer for how to roll your own.
here's some pseudo-code for you
loop
getchar; if not a quote exit with error
loop
getchar; mind EOF
if not a quote, add to string
if a quote exit inner loop
use string
getchar; if not semicolon exit with error unless EOF

What's the difference between gets and scanf?

If the code is
scanf("%s\n",message)
vs
gets(message)
what's the difference?It seems that both of them get input to message.
The basic difference [in reference to your particular scenario],
scanf() ends taking input upon encountering a whitespace, newline or EOF
gets() considers a whitespace as a part of the input string and ends the input upon encountering newline or EOF.
However, to avoid buffer overflow errors and to avoid security risks, its safer to use fgets().
Disambiguation: In the following context I'd consider "safe" if not leading to trouble when correctly used. And "unsafe" if the "unsafetyness" cannot be maneuvered around.
scanf("%s\n",message)
vs
gets(message)
What's the difference?
In terms of safety there is no difference, both read in from Standard Input and might very well overflow message, if the user enters more data then messageprovides memory for.
Whereas scanf() allows you to be used safely by specifying the maximum amount of data to be scanned in:
char message[42];
...
scanf("%41s", message); /* Only read in one few then the buffer (messega here)
provides as one byte is necessary to store the
C-"string"'s 0-terminator. */
With gets() it is not possible to specify the maximum number of characters be read in, that's why the latter shall not be used!
The main difference is that gets reads until EOF or \n, while scanf("%s") reads until any whitespace has been encountered. scanf also provides more formatting options, but at the same time it has worse type safety than gets.
Another big difference is that scanf is a standard C function, while gets has been removed from the language, since it was both superfluous and dangerous: there was no protection against buffer overruns. The very same security flaw exists with scanf however, so neither of those two functions should be used in production code.
You should always use fgets, the C standard itself even recommends this, see C11 K.3.5.4.1
Recommended practice
6 The fgets function allows properly-written
programs to safely process input lines too long to store in the result
array. In general this requires that callers of fgets pay attention to
the presence or absence of a new-line character in the result array.
Consider using fgets (along with any needed processing based on
new-line characters) instead of gets_s.
(emphasis mine)
There are several. One is that gets() will only get character string data. Another is that gets() will get only one variable at a time. scanf() on the other hand is a much, much more flexible tool. It can read multiple items of different data types.
In the particular example you have picked, there is not much of a difference.
gets - Reads characters from stdin and stores them as a string.
scanf - Reads data from stdin and stores them according to the format specified int the scanf statement like %d, %f, %s, etc.
gets:->
gets() reads a line from stdin into the buffer pointed to by s until either a terminating newline or EOF, which it replaces with a null byte ('\0').
BUGS:->
Never use gets(). Because it is impossible to tell without knowing the data in advance how many characters gets() will read, and because gets() will continue to store characters past the end of the buffer, it is extremely dangerous to use. It has been used to break computer security. Use fgets() instead.
scanf:->
The scanf() function reads input from the standard input stream stdin;
BUG:->
Some times scanf makes boundary problems when deals with array and string concepts.
In case of scanf you need that format mentioned, unlike in gets. So in gets you enter charecters, strings, numbers and spaces.
In case of scanf , you input ends as soon as a white-space is encountered.
But then in your example you are using '%s' so, neither gets() nor scanf() that the strings are valid pointers to arrays of sufficient length to hold the characters you are sending to them. Hence can easily cause an buffer overflow.
Tip: use fgets() , but that all depends on the use case
The concept that scanf does not take white space is completely wrong. If you use this part of code it will take white white space also :
#include<stdio.h>
int main()
{
char name[25];
printf("Enter your name :\n");
scanf("%[^\n]s",name);
printf("%s",name);
return 0;
}
Where the use of new line will only stop taking input. That means if you press enter only then it will stop taking inputs.
So, there is basically no difference between scanf and gets functions. It is just a tricky way of implementation.
scanf() is much more flexible tool while gets() only gets one variable at a time.
gets() is unsafe, for example: char str[1]; gets(str)
if you input more then the length, it will end with SIGSEGV.
if only can use gets, use malloc as the base variable.

Use scanf with Regular Expressions

I've been trying to use regular expressions on scanf, in order to read a string of maximum n characters and discard anything else until the New Line Character. Any spaces should be treated as regular characters, thus included in the string to be read.
I've studied a Wikipedia article about Regular Expressions, yet I can't get scanf to work properly. Here is some code I've tried:
scanf("[ ]*%ns[ ]*[\n]", string);
[ ] is supposed to go for the actual space character, * is supposed to mean one or more, n is the number of characters to read and string is a pointer allocated with malloc.
I have tried several different combinations; however I tend to get only the first word of a sentence read (stops at space character). Furthermore, * seems to discard a character instead of meaning "zero or more"...
Could anybody explain in detail how regular expressions are interpreted by scanf? What is more, is it efficient to use getc repetitively instead?
Thanks in Advance :D
The short answer: scanf does not handle regular expressions literally speaking.
If you want to use regular expressions in C, you could use the regex POSIX library. See the following question for a basic example on this library usage : Regular expressions in C: examples?
Now if you want to do it the scanf way you could try something like
scanf("%*[ ]%ns%*[ ]\n",str);
Replace the n in %ns by the maximal number of characters to read from input stream.
The %*[ ] part asks to ignore any spaces. You could replace the * by a specific number to ignore a precise number of characters. You could add other characters between braces to ignore more than just spaces.
Not sure if the above scanf would work as spaces are also matched with the %s directive.
I would definitely go with a fgets call, then triming the surrounding whitespaces with something like the following: How do I trim leading/trailing whitespace in a standard way?
is it efficient to use getc repetitively instead?
Depends somewhat on the application, but YES, repeated getc() is efficient.
unless I read the question wrong, %[^'\n']s will save everything until the carriage return is encountered.

Force fscanf to Consume Possible Whitespace

I have a multiline TSV file with the following format:
Type\tBasic Name\tAttribute\tA Long Description\n
As you can see, the Basic Name and the Description can both contain some number of spaces. I am trying to read each line in and extract the elements. For now, I've narrowed it down to just extracting the basic name. My fscanf is as follows:
fscanf(file_in, "%*[^ ]s\t%128[^ ]s\t%*[^ ]s\t%[^ ]s\n", name_string, desc_string);
This doesn't work as I have hoped, and I'm having trouble narrowing down the error. Does anyone know how I could read in the lines properly?
I mostly agree with Pablo (that the scanf family don't make great parsers), but it's worth understanding how to write a scanf pattern. The pattern you're looking for is something like this:
fscanf(" %*[^\t] %128[^\t] %*[^\t] %128[^\n]", name_string, desc_string)
Notes:
%[xyz] is a directive. %[xyz]s is two directives, the second of which matches a literal s
As far a I know, there is no way to match a single literal tab character, since any whitespace in the pattern matches any amount of whitespace (including none) in the input. I used a space in my example, which will match a terminating tab, but it will also match any number of consecutive tabs so empty fields won't be parsed correctly.
The 128-character limit does not include the terminating NUL character.
Also, if the scan stops because the chracter limit is exceeded, it won't skip the rest of the field automatically, so you'll end up out of synch with the input.
A better pattern would be:
fscanf(" %*[^\t] %128[^\t]%*[^\t] %*[^\t] %128[^\n]%*[^\n]", name_string, desc_string)
which explicitly skips the remaining characters in the field, if necessary. An even better solution would be to use the a modifier and get fscanf to malloc memory for you.
I'd rather use strtok for this. It's more acurate than fscanf since this function family only work when the format is 100% OK, otherwise you end up missing values.
Take a look at Parallel to PHP's "explode" in C: Split char* into char* using delimiter, where I explain in more detail how to use strtok.
So, read each line with fgets and parse it with strtok.
Firstly, as it has already been noted, the %[] is a conversion specifier by itself. There's no s after the []. The s-es that you have in your format string will not be considered parts of the conversion specifiers. You have to get rid of those s-es.
Secondly, as you said yourself, your file is TAB-separated. Which immediately means that you should extract the continuous portions of the sequence by using the %[^\t] conversion specifier (or the %[^\n] specifier for the last portion). Why did you use %[^ ] and how did you expect it to work? The %[^ ] actually stops parsing at space character, which is the opposite of what you wanted.
In your example the proper combination of specifiers would be
fscanf(file_in, "%*[^\t]\t%128[^\t]\t%*[^\t]\t%[^\n]\n", name_string, desc_string);
This format string assumes that all 4 portions of the string are guaranteed to be present and that the last portion is guaranteed to be terminated by \n.

fscanf usage in C

I have a file like this:
10 15
something
I want to read this into tree variables, let's say number1, number2, and mystring. I have doubts about what kind of pattern to give to fscanf. I am thinking something like this;
fscanf(fp,"%i %i\n%s",number1,number2,mystring);
Should this work, and also, is this the correct way of reading this file? If not, what would you suggest?
fscanf(fp,"%i %i\n%s",&number1,&number2,mystring);
fscanf takes pointers.
Read each line with fgets (or getline if you have it), split up the line with strsep (better, if available) or strtok_r (more awkward API but more portable), and then use strtoul to convert strings to numbers as necessary.
*scanf should never be used, because:
Some format strings (e.g. a bare "%s") are just as eager to overflow your buffers as gets is.
Behavior on integer overflow is undefined -- invalid input can potentially crash your program.
They do not report the character position of the first scan error, making it nigh-impossible to recover from a parse error. (This can be somewhat mitigated by using fgets and then sscanf instead of fscanf.)
Besides the problem with pointers, generally using spaces in the scanf format is a mistake -- in most cases scanf skips whitespace automatically. So I would use something like:
int number1, number2;
char mystring[32];
fscanf("%i%i%31s", &number1, &number2, &mystring)
This will read two numbers followed by a string of up to 31 non-whitespace characters, all separated by any whitespace. Note that "whitespace" includes spaces, tabs, and newlines, so it doesn't matter if its all on one line, or spread out over 3 lines or anything in between.
Note also using a limit on the size of the string -- without that, the input might overflow any fixed size buffer you provide (and there's no way to provide a variable sized buffer with scanf)

Resources