How to REGEX // in C? Single line comments - c

I used the following to get it to work partially:
%{
#define OR 2
#define AND 3
.........
.........
%}
delim [ \t]
ws {delim}*
letter [A-Za-z]
digit [0-9]
comments [/]+({letter}|{digit}|{delim})*
%%
{comments} {return(COMMENT);}
......................
......................
%%
int main()
{
int tkn = 0;
while (tkn = yylex())
{
switch (tkn)
{
case COMMENT:
printf("GOT COMMENT");
}
}
}
This is working fine. The problem is that the regex obviously does not recognize special characters because [/]+({letter}|{digit}|{delim})* does not consider special characters. How to change the regex to accommodate more characters till end of line?

Couldn't you just use
[/]+.*
It will match some number of / and then anything till the end of line. Of course this will not cover comments like /* COMMENT */.

may be its late. But I find this more appropriate to use \/[\/]+.* This will cover double slash and more and then the rest of the text.
Following is the explanation from regex101.com
\/
matches the character / literally (case sensitive) Match a single character present in the text
[\/]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) \/ matches the character / literally (case sensitive)
.* matches any character (except for line terminators)

A single-line comment expression starting with '//' can be captured by the following regular expression.
\/\/[^\r\n]*
\/\/ matches the double-slash
[^\r\n]* matches as many characters that are not carriage-return or line-feed as it can find.
However, the C language allows a single line comment to be extended to the next line when the last character in the line is a backslash (\). Therefore, you may want to use the following.
\/\/[^\r\n]*(?:(?<=\\)\r?\n[^\r\n]*)*
\/\/ matches the double-slash
[^\r\n]* matches as many characters that are not carriage-return (\r) or line-feed (\n) as it can find
(?: start a non-capturing group
(?<=\\) assert that a backslash (\) immediately precedes the current position
\r?\n match the end of a line
[^\r\n]* matches as many characters that are not carriage-return (\r) or line-feed
)* complete the non-capturing group and let it repeat 0 or more times
Note that this method has problems. Depending on what you are doing, you may want to find and use a lexical scanner. A lexical scanner can avoid the following problems.
Scanning the text
/* Comment appears to have // a comment inside it */
will match
// a comment inside it */
Scanning the text
char* a = "string appears to have // a comment";
will match
// a comment";

Why can't you just write
"//"|"/*" {return(COMMENT);}
?

Following regular expression works just fine for me.
\/\/.*

Related

What does [^0-9]+$ mean (regular expression in FLEX)

This is what I know:
^ inside brackets matches a character that isn't one of the included inside the brackets.
+ Matches one or more appearances of the expression to its left (in my ex. [^0-9]).
$ If I'm not mistaken, matches to an expression that ends with the expression to its left.
Then it seems this expression should match input that has at least one character that isn't a digit and that ends with that expression, for example it should match:
1a, aaa, 2321a,1b1b
and should not match:
111, 432423,asd3213
but it is unclear to me from running this rule what exactly it matches.
This is my full code:
%option noyywrap
%{
#include<stdio.h>
%}
%%
[^0-9]+$ printf("not a number");
%%
int main()
{
yylex();
return 0;
}
And I'm using flex.
output examples(sorry for the links, it won't let me upload a photo):
[1] https://ibb.co/qp3hB0r - doesn't match but prints back
[2] https://ibb.co/syZHjrw - doesn't match and eats it (why does it happen if I didn't add ".|\n" in the code?)
[3] https://ibb.co/s6S0tQh - matches and prints back
[4] https://ibb.co/VmZW7KR - same as the 3rd
[5] https://ibb.co/2vPfWhc - matched only the 11(?) and ate up the aa
I'm really confused as to what it actually matches and would appreciate the help.
This is what I know:
^ inside brackets matches a character that isn't one of the included inside the brackets.
That's an odd way to put it. More accurate would be that the whole bracket-enclosed fragment matches one character that is not (because of the ^) in the range '0' - '9'.
+ Matches one or more appearances of the expression to its left (in my ex. [^0-9]).
Again an odd way to put it. The + quantifier modifies the preceding fragment to match one or more appearances of whatever it otherwise would match exactly once.
$ If I'm not mistaken, matches to an expression that ends with the expression to its left.
You are mistaken. The $ anchors the match to the end of a line -- the overall pattern matches only text that ends at the end of a line, as determined by immediately preceding a newline (and therefore not at the very end of the file). That's a restriction, not an extension: nothing is matched that wouldn't be matched by the pattern excluding the $, but there is an additional requirement that the match occur at the end of a line. That's not at all the same thing as matching text that ends with a match to the preceding pieces of the pattern.
Thus,
it seems this expression should match input that has at least one
character that isn't a digit and that ends with that expression, for
example it should match: 1a, aaa, 2321a,1b1b
No. Taking those as four separate examples, it would not match any of them unless they appeared at the end of a line. If they all did appear at the end of a line, then only aaa would be matched in total, but the trailing a or b of each of the others would be matched.
Note also, however, that when a flex scanner cannot match the input to any user-defined rule, its default rule is invoked, which copies the next input character to the standard output, consuming it. Therefore, if you present an input to your scanner that contains at least one non-digit at the end of a line, then it will eventually consume any preceding input up to the last digit, printing all of that on the standard output, before eventually matching that trailing portion of the line and printing "not a number".

RegExp doesn't match when in the middle of other words

Hello everyone I'm testing a regexp with lex to find the product id in html from amazon. I don't know why when it read the file witch contains:
<span class="a-icon-alt">4,7 de un máximo de 5 estrellas</span>
it works but if it content is something like:
aaaaaa< span class="a-icon-alt">4,7 de un máximo de 5 estrellas< /span>bbbbbb
it doesnt. There is the code with regex.
%{
#include <stdio.h>
int nc, np, nl;
void escribir_datos (int dato1, int dato2, int dato3);
%}
productos (<li+[ ]+id=\"result_[0-9]*)+
num_productos [0-9]*
nombre_producto <h2+[ ]+data-attribute=\"([^\"]*)
nombre_final_producto \"[^\"]*\"
precio_producto <span+[ ]+class=\"a-size-base+[ ]+a-color-price+[ ]+s-price+[ ]+a-text-bold\">(.*?)<\/span>
precio_final_producto [0-9]+([,][0-9]+)?
valoraciones <span+[ ]+class=\"a-icon-alt\">(.*?)<\/span>
%%
{valoraciones} { nl++; }
[^ \t\n]+ { np++; nc += yyleng; }
[ \t]+ { nc += yyleng; }
\n { nc++; }
%%
/*----- Sección de Procedimientos --------*/
int main (int argc, char *argv[]) {
if (argc == 2) {
yyin = fopen (argv[1], "rt");
if (yyin == NULL) {
printf ("El fichero %s no se puede abrir\n", argv[1]);
exit (-1);
}
}
else yyin = stdin;
nc = np = nl = 0;
yylex ();
escribir_datos(nc,np,nl);
return 0;
}
void escribir_datos (int dato1, int dato2, int dato3) {
printf("Num_char=%d\tNum_words=%d\tNum_lines=%d\n", dato1,dato2,dato3);
}
Thanks I hope you can help me.
The intended use case for lexical analyzers generated by (f)lex is to split the input into a series of primitive "tokens", each one with some syntactic significance. They do not search for regular expressions, because they assume that every part of the input will match some pattern in your lexical description.
So each time the lexical analyzer examines the input, it will select the pattern which gives the best match. A match is a sequence of characters starting at the current input point which matches a pattern, and the best match is the one which matches the longest sequence. (If there are two or more patterns which match the same longest sequence, the first one in the list of patterns is considered the best one.)
With that in mind, consider what happens with the input
aaaaaa< span class="a-icon-alt">4,7 de un máximo de 5 estrellas< /span>bbbbbb
Your file has four patterns:
{valoraciones}
[^ \t\n]+
[ \t]+
\n
The input doesn't match {valoraciones} because that pattern only matches a string starting with <. It doesn't match [ \t]+ either, because it doesn't start with either space or tab, and similarly it doesn't match a newline. But it does match [^ \t\n]+. Since (f)lex always select the longest match, and [^ \t\n]+ matches any sequence of characters other than whitespace (space, tab, newline), the first match will be aaaaaaa<.
After that's matched, the input is span..., which means that only the third pattern ([ \t]+) matches. It could match any number of space characters, but there is only one and that's what it will match.
So then the input is span class="a-icon-alt">4,7.... {valoraciones} still won't match -- the input doesn't start with < -- so we're back to a match of the second pattern.
And so on.
I think you need to be a lot clearer (with yourself) about what the tokens you are trying to match are. If you're looking for specific HTML tags, then you probably want to recognise any sequence which doesn't contain a < as a token, rather than looking for input terminated with a white space character. But then you also need to accept any tag as a token, as well as the specific tags you are trying to catch.
Of course, it is also possible that (f)lex is not the ideal tool for your use case. You don't really say what your use case it, so I'm not going to make any assumption one way or another.
In any event, you should take a few minutes to read the documentation on flex patterns. Any regex syntax not described on that page will not work with (f)lex, regardless of whether it works with regex libraries or online regex checkers. In particular, .*? does not give you a non-greedy match, as it would in many regex libraries. (F)lex doesn't implement non-greedy matches (because it doesn't do any backtracking), and it considers .*? to be an optional (?) appearance of any number including zero repetitions (*) of any character other than a newline (.). Making the repetition optional has no effect, since the repetition already matches zero repetitions. So the pattern <.*?> would match from the < up to the last > on the same line. That is probably not what you want.
You also probably don't want <span+, which matches < followed by the letters ap a and then any number of n (as long as there is at least one). In other words, it will match <span, <spann, <spannnnnnnnnnn, and many more.
Thanks for answer, the problem was the three rules after {valoraciones} caused conflict with the first one. So i couldn't find any word between other words for example, I want to find dog in a text with aaaaadogaaaa, that doesn't match with dog cause of what i said at the beginning.

C program to store special characters in strings

Basically, I can't figure this out, I want my C program to store the entire plaintext of a batch program then insert in a file and then run.
I finished my program, but holding the contents is my problem. How do I insert the code in a string and make it ignore ALL special characters like %s \ etc?
You have to escape special characters with a \, you can escape backslash itself with another backslash (i.e. \\).
As Ian previously mentioned, you can escape characters that aren't allowed in normal C strings with \; for instance, newline becomes \n, double-quote becomes \", and backslash becomes \\.
If you're unable or unwilling to do this for whatever reason, then you may be out of luck if you're solution must be in C. However, if you're willing to switch to C++, then you can use raw strings:
const char* s1 = R"foo(
Hello
World
)foo";
This is equivalent to
const char* s2 = "\nHello\nWorld\n";
A raw string must begin with R" followed by an arbitrary delimiter (made of any source character but parentheses, backslash and spaces; can be empty; and at most 16 characters long), then (, and must end with ) followed by the delimiter and ". The delimiter must be chosen such that the termination substring (), delimiter, ") does not appear within the string.

What is [^\n] in C? [duplicate]

I have run into some code and was wondering what the original developer was up to. Below is a simplified program using this pattern:
#include <stdio.h>
int main() {
char title[80] = "mytitle";
char title2[80] = "mayataiatale";
char mystring[80];
/* hugh ? */
sscanf(title,"%[^a]",mystring);
printf("%s\n",mystring); /* Output is "mytitle" */
/* hugh ? */
sscanf(title2,"%[^a]",mystring); /* Output is "m" */
printf("%s\n",mystring);
return 0;
}
The man page for scanf has relevant information, but I'm having trouble reading it. What is the purpose of using this sort of notation? What is it trying to accomplish?
The main reason for the character classes is so that the %s notation stops at the first white space character, even if you specify field lengths, and you quite often don't want it to. In that case, the character class notation can be extremely helpful.
Consider this code to read a line of up to 10 characters, discarding any excess, but keeping spaces:
#include <ctype.h>
#include <stdio.h>
int main(void)
{
char buffer[10+1] = "";
int rc;
while ((rc = scanf("%10[^\n]%*[^\n]", buffer)) >= 0)
{
int c = getchar();
printf("rc = %d\n", rc);
if (rc >= 0)
printf("buffer = <<%s>>\n", buffer);
buffer[0] = '\0';
}
printf("rc = %d\n", rc);
return(0);
}
This was actually example code for a discussion on comp.lang.c.moderated (circa June 2004) related to getline() variants.
At least some confusion reigns. The first format specifier, %10[^\n], reads up to 10 non-newline characters and they are assigned to buffer, along with a trailing null. The second format specifier, %*[^\n] contains the assignment suppression character (*) and reads zero or more remaining non-newline characters from the input. When the scanf() function completes, the input is pointing at the next newline character. The body of the loop reads and prints that character, so that when the loop restarts, the input is looking at the start of the next line. The process then repeats. If the line is shorter than 10 characters, then those characters are copied to buffer, and the 'zero or more non-newlines' format processes zero non-newlines.
The constructs like %[a] and %[^a] exist so that scanf() can be used as a kind of lexical analyzer. These are sort of like %s, but instead of collecting a span of as many "stringy" characters as possible, they collect just a span of characters as described by the character class. There might be cases where writing %[a-zA-Z0-9] might make sense, but I'm not sure I see a compelling use case for complementary classes with scanf().
IMHO, scanf() is simply not the right tool for this job. Every time I've set out to use one of its more powerful features, I've ended up eventually ripping it out and implementing the capability in a different way. In some cases that meant using lex to write a real lexical analyzer, but usually doing line at a time I/O and breaking it coarsely into tokens with strtok() before doing value conversion was sufficient.
Edit: I ended ripping out scanf() typically because when faced with users insisting on providing incorrect input, it just isn't good at helping the program give good feedback about the problem, and having an assembler print "Error, terminated." as its sole helpful error message was not going over well with my user. (Me, in that case.)
It's like character sets from regular expressions; [0-9] matches a string of digits, [^aeiou] matches anything that isn't a lowercase vowel, etc.
There are all sorts of uses, such as pulling out numbers, identifiers, chunks of whitespace, etc.
You can read about it in the ISO/IEC9899 standard available online.
Here is a paragraph I quote from the document about [ (Page 286):
Matches a nonempty sequence of characters from a set of expected
characters.
The conversion specifier includes all subsequent characters in the
format string, up to and including the matching right bracket (]). The
characters between the brackets (the scanlist) compose the scanset,
unless the character after the left bracket is a circumflex (^), in
which case the scanset contains all characters that do not appear in
the scanlist between the circumflex and the right bracket. If the
conversion specifier begins with [] or [^], the right bracket
character is in the scanlist and the next following right bracket
character is the matching right bracket that ends the specification;
otherwise the first following right bracket character is the one that
ends the specification. If a - character is in the scanlist and is not
the first, nor the second where the first character is a ^, nor the
last character, the behavior is implementation-defined.

What is the purpose of using the [^ notation in scanf?

I have run into some code and was wondering what the original developer was up to. Below is a simplified program using this pattern:
#include <stdio.h>
int main() {
char title[80] = "mytitle";
char title2[80] = "mayataiatale";
char mystring[80];
/* hugh ? */
sscanf(title,"%[^a]",mystring);
printf("%s\n",mystring); /* Output is "mytitle" */
/* hugh ? */
sscanf(title2,"%[^a]",mystring); /* Output is "m" */
printf("%s\n",mystring);
return 0;
}
The man page for scanf has relevant information, but I'm having trouble reading it. What is the purpose of using this sort of notation? What is it trying to accomplish?
The main reason for the character classes is so that the %s notation stops at the first white space character, even if you specify field lengths, and you quite often don't want it to. In that case, the character class notation can be extremely helpful.
Consider this code to read a line of up to 10 characters, discarding any excess, but keeping spaces:
#include <ctype.h>
#include <stdio.h>
int main(void)
{
char buffer[10+1] = "";
int rc;
while ((rc = scanf("%10[^\n]%*[^\n]", buffer)) >= 0)
{
int c = getchar();
printf("rc = %d\n", rc);
if (rc >= 0)
printf("buffer = <<%s>>\n", buffer);
buffer[0] = '\0';
}
printf("rc = %d\n", rc);
return(0);
}
This was actually example code for a discussion on comp.lang.c.moderated (circa June 2004) related to getline() variants.
At least some confusion reigns. The first format specifier, %10[^\n], reads up to 10 non-newline characters and they are assigned to buffer, along with a trailing null. The second format specifier, %*[^\n] contains the assignment suppression character (*) and reads zero or more remaining non-newline characters from the input. When the scanf() function completes, the input is pointing at the next newline character. The body of the loop reads and prints that character, so that when the loop restarts, the input is looking at the start of the next line. The process then repeats. If the line is shorter than 10 characters, then those characters are copied to buffer, and the 'zero or more non-newlines' format processes zero non-newlines.
The constructs like %[a] and %[^a] exist so that scanf() can be used as a kind of lexical analyzer. These are sort of like %s, but instead of collecting a span of as many "stringy" characters as possible, they collect just a span of characters as described by the character class. There might be cases where writing %[a-zA-Z0-9] might make sense, but I'm not sure I see a compelling use case for complementary classes with scanf().
IMHO, scanf() is simply not the right tool for this job. Every time I've set out to use one of its more powerful features, I've ended up eventually ripping it out and implementing the capability in a different way. In some cases that meant using lex to write a real lexical analyzer, but usually doing line at a time I/O and breaking it coarsely into tokens with strtok() before doing value conversion was sufficient.
Edit: I ended ripping out scanf() typically because when faced with users insisting on providing incorrect input, it just isn't good at helping the program give good feedback about the problem, and having an assembler print "Error, terminated." as its sole helpful error message was not going over well with my user. (Me, in that case.)
It's like character sets from regular expressions; [0-9] matches a string of digits, [^aeiou] matches anything that isn't a lowercase vowel, etc.
There are all sorts of uses, such as pulling out numbers, identifiers, chunks of whitespace, etc.
You can read about it in the ISO/IEC9899 standard available online.
Here is a paragraph I quote from the document about [ (Page 286):
Matches a nonempty sequence of characters from a set of expected
characters.
The conversion specifier includes all subsequent characters in the
format string, up to and including the matching right bracket (]). The
characters between the brackets (the scanlist) compose the scanset,
unless the character after the left bracket is a circumflex (^), in
which case the scanset contains all characters that do not appear in
the scanlist between the circumflex and the right bracket. If the
conversion specifier begins with [] or [^], the right bracket
character is in the scanlist and the next following right bracket
character is the matching right bracket that ends the specification;
otherwise the first following right bracket character is the one that
ends the specification. If a - character is in the scanlist and is not
the first, nor the second where the first character is a ^, nor the
last character, the behavior is implementation-defined.

Resources