Hello everyone I'm testing a regexp with lex to find the product id in html from amazon. I don't know why when it read the file witch contains:
<span class="a-icon-alt">4,7 de un máximo de 5 estrellas</span>
it works but if it content is something like:
aaaaaa< span class="a-icon-alt">4,7 de un máximo de 5 estrellas< /span>bbbbbb
it doesnt. There is the code with regex.
%{
#include <stdio.h>
int nc, np, nl;
void escribir_datos (int dato1, int dato2, int dato3);
%}
productos (<li+[ ]+id=\"result_[0-9]*)+
num_productos [0-9]*
nombre_producto <h2+[ ]+data-attribute=\"([^\"]*)
nombre_final_producto \"[^\"]*\"
precio_producto <span+[ ]+class=\"a-size-base+[ ]+a-color-price+[ ]+s-price+[ ]+a-text-bold\">(.*?)<\/span>
precio_final_producto [0-9]+([,][0-9]+)?
valoraciones <span+[ ]+class=\"a-icon-alt\">(.*?)<\/span>
%%
{valoraciones} { nl++; }
[^ \t\n]+ { np++; nc += yyleng; }
[ \t]+ { nc += yyleng; }
\n { nc++; }
%%
/*----- Sección de Procedimientos --------*/
int main (int argc, char *argv[]) {
if (argc == 2) {
yyin = fopen (argv[1], "rt");
if (yyin == NULL) {
printf ("El fichero %s no se puede abrir\n", argv[1]);
exit (-1);
}
}
else yyin = stdin;
nc = np = nl = 0;
yylex ();
escribir_datos(nc,np,nl);
return 0;
}
void escribir_datos (int dato1, int dato2, int dato3) {
printf("Num_char=%d\tNum_words=%d\tNum_lines=%d\n", dato1,dato2,dato3);
}
Thanks I hope you can help me.
The intended use case for lexical analyzers generated by (f)lex is to split the input into a series of primitive "tokens", each one with some syntactic significance. They do not search for regular expressions, because they assume that every part of the input will match some pattern in your lexical description.
So each time the lexical analyzer examines the input, it will select the pattern which gives the best match. A match is a sequence of characters starting at the current input point which matches a pattern, and the best match is the one which matches the longest sequence. (If there are two or more patterns which match the same longest sequence, the first one in the list of patterns is considered the best one.)
With that in mind, consider what happens with the input
aaaaaa< span class="a-icon-alt">4,7 de un máximo de 5 estrellas< /span>bbbbbb
Your file has four patterns:
{valoraciones}
[^ \t\n]+
[ \t]+
\n
The input doesn't match {valoraciones} because that pattern only matches a string starting with <. It doesn't match [ \t]+ either, because it doesn't start with either space or tab, and similarly it doesn't match a newline. But it does match [^ \t\n]+. Since (f)lex always select the longest match, and [^ \t\n]+ matches any sequence of characters other than whitespace (space, tab, newline), the first match will be aaaaaaa<.
After that's matched, the input is span..., which means that only the third pattern ([ \t]+) matches. It could match any number of space characters, but there is only one and that's what it will match.
So then the input is span class="a-icon-alt">4,7.... {valoraciones} still won't match -- the input doesn't start with < -- so we're back to a match of the second pattern.
And so on.
I think you need to be a lot clearer (with yourself) about what the tokens you are trying to match are. If you're looking for specific HTML tags, then you probably want to recognise any sequence which doesn't contain a < as a token, rather than looking for input terminated with a white space character. But then you also need to accept any tag as a token, as well as the specific tags you are trying to catch.
Of course, it is also possible that (f)lex is not the ideal tool for your use case. You don't really say what your use case it, so I'm not going to make any assumption one way or another.
In any event, you should take a few minutes to read the documentation on flex patterns. Any regex syntax not described on that page will not work with (f)lex, regardless of whether it works with regex libraries or online regex checkers. In particular, .*? does not give you a non-greedy match, as it would in many regex libraries. (F)lex doesn't implement non-greedy matches (because it doesn't do any backtracking), and it considers .*? to be an optional (?) appearance of any number including zero repetitions (*) of any character other than a newline (.). Making the repetition optional has no effect, since the repetition already matches zero repetitions. So the pattern <.*?> would match from the < up to the last > on the same line. That is probably not what you want.
You also probably don't want <span+, which matches < followed by the letters ap a and then any number of n (as long as there is at least one). In other words, it will match <span, <spann, <spannnnnnnnnnn, and many more.
Thanks for answer, the problem was the three rules after {valoraciones} caused conflict with the first one. So i couldn't find any word between other words for example, I want to find dog in a text with aaaaadogaaaa, that doesn't match with dog cause of what i said at the beginning.
Related
Hello i have being trying to count then display the words that end with 'ent' from a file and tried to do that using the following code but i couldn't even count them.
this is the function i tried to use to count the words it is supposed to read text from a file then count and display to the terminal the words that end with 'ent'.
please help me achieve that if you have an idea how to do it.
void ent(FILE*output){
char s[250];
int ent, i;
output = fopen("output.txt", "r");
while(fgets(s, 250, output)){
if((s[i]=='e') && (s[i+1]=='n' )&& (s[i+2]=='t')){
ent++;
}
}
printf("le nbr de mots avec ent est: %d\n",ent);
fclose(output);
}
You want to use a Regular Expression, or regex for short. The most basic explanation one can give is that it is a form of notation that allows you to match patterns in text. In essence, you can run it against every line in your text file and check how many times a match is found.
The regex you are looking for is ^.*ent$:
^ matches the start of a line.
. matches any character, while adding the Kleene Star * allows any number of characters to be matched.
ent literally matches the characters ent.
$ is the end-of-line symbol.
Depending on the implementation of your programming language (and OS in the case of C), there can be numerous options that can be applied such as "multiline", "global", etc. Standard Linux documentation can be found here.
If you wish to look at an example, you can take a look at this function of mine, in which I use regular expressions to parse IPv4 and IPv6 port numbers.
I am writing a compiler for the mini-L language. I am pretty much done with the scanning part. This is where you basically find tokens in the input. My problem is, when I type in "function" it matches it as an identifier. I cant really tell whats wrong.
ident [a-zA-Z][a-zA-Z0-9_]+[^_]
%{
#include <stdio.h>
#include <stdlib.h>
int col = 0;
int row = 0;
%}
%%
"function" {printf("FUNCTION\n"); col += yyleng;}
{ident} {printf("IDENT %s\n", yytext); col += yyleng;}
. {printf("Error at line %d, column %d: unrecognized symbol \"%s\"\n", row, col, yytext);}
%%
int main(int argc, char** argv){
if(argc > 1){
FILE *fp = fopen(argv[1], "r");
if(fp){
yyin = fp;
}
}
printf("Give me the input:\n");
yylex();
}
I tried rearranging the rules but that didn't work and that doesn't seem like the problem anyway. I think it's a problem with my regex. Any help is much appreciated.
So here's your ident pattern: [a-zA-Z][a-zA-Z0-9_]+[^_]. Let's break that down:
[a-zA-Z] A single letter.
[a-zA-Z0-9_]+ One or more (+) of a letter, a digit, or a _
[^_] Anything other than _. Anything. Such as a space.
One simple observation is that the pattern can't match fewer than three characters. On the other hand, the last one could be almost anything. It could be a letter. But it could also be a comma, or a space, or even a newline. Just not an underline.
So that won't match x, which is usually considered a valid identifier. But it will match function . (That is, the word function followed by a space.) Since that is longer than function, the identifier pattern will win unless you happened to write function_.
So that's probably not what you wanted.
Note that if you want to avoid identifiers ending with an underscore, you have to deal with two issues.
The obvious pattern [[:alpha:]][[:alnum:]_]*[[:alnum:]] can't match single letters. So you need [[:alpha:]]([[:alnum:]_]*[[:alnum:]])?.
If your identifier pattern doesn't match a trailing underscore, the underscore will be left over for the next token. Maybe you're OK with that. But it's probably better to match the underscore and issue an error message. Or reconsider the restriction.
Note: The pattern syntax is documented in the Flex manual, including the built in character sets I used above.
I'm trying to create a scanner with flex that acts somewhat like grep.
Basically, what I want to do is: given a word (regular text, not a regex), find any line in the input that contains a match for that text, then print the line that contains the word.
The problem I've been having is that I can't figure out how to best print the line. I can print everything after the searched word, but I don't know how to properly store the contents of the whole line.
I tried using yyseek(), but when I compile, I get back the message that yyseek is an undefined symbol.
Using yymore() to store text works well for anything after the matched word in the line.
Here is the code that I have so far:
%option yylineno
%option noyywrap
%{
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *search_str = NULL;
char *curr_line = NULL;
%}
%x found
letter [a-zA-Z]
word {letter}+
line (.*)\n
%%
<INITIAL,found>{word} {
/* If a word matches the string that we are looking for, use the 'found'
* condition, which will cause the line to be dumped at the end.
*/
yymore();
if (strcmp(search_str, yytext) == 0) {
BEGIN(found);
}
}
<found>{line} {
yymore();
ECHO;
BEGIN(INITIAL);
}
. { }
\n {}
%%
int main(int argc, char *argv[])
{
if (argc > 1) {
unsigned int str_len = sizeof(argv[1]);
search_str = malloc(str_len + 1);
strcpy(search_str, argv[1]);
yylex();
free(search_str);
return 0;
}
printf("usage: ./a.out [search word]\n");
return 1;
}
This is really not a good use case for flex. And it's not totally clear to me that it will do what you want, either. (Since I don't actually know what you want, so I could be wrong about that. But note the following:
Target line grep night grep -w night Your code
------------------- ---------- ------------- ---------
a night to remember Yes Yes Yes
a knight to forget Yes No No
night23 Yes No Yes
Anyway, your instinct about using yymore was correct. You just have to start earlier, so that the entire line is retained in token. The small complication is that when you need to check a word, you can't check from the beginning of yytext; it contains the whole line up to this point. You have to check the last strlen(search_str) characters. The following code makes sure it only does that computation once, since it requires a complete scan of search_str. Also note that it makes sure it does not overrun the beginning of yytext.
In effect, the following code divides the text into three kinds of tokens: words, non-words, and newlines. Only newline fails to call yymore(), so when the newline rule triggers, yytext contains the entire line. As in your code, once a match is found in a line, the rest of the line is simply added to the match.
(Note: I rewrote this without macros, which are generally overused. I don't see any reason to think that {letter} is more readable than [[:alpha:]], and the latter has the advantage of being clear to anyone who knows flex, without having to search for your particular definition.)
%x FOUND
%%
/* Indented lines before the first rule are put at the top of yylex */
int match_length = strlen(search_str);
[^[:alpha:]\n]+ { yymore(); }
[[:alpha:]]+ { yymore();
if (yyleng >= match_length
&& 0 == strcmp(yytext + yyleng - match_length,
search_str))
BEGIN(FOUND);
}
<INITIAL,FOUND>\n BEGIN(INITIAL);
<FOUND>.* printf("%s\n", yytext);
The oddity at the end is to deal with inputs which are not correctly terminated with a newline character. The last pattern will print the line with a newline character (even if there isn't one), and the newline character (if there is one) will restart the start condition.
For a slight speed gain, you could remember the previous value of yyleng every time you call yymore(), so that yyleng - prev_yyleng will be the length of "this part" of the token. (The flex scanner knows this value but doesn't provide any interface for you to find it out, which is slightly annoying. But it's not a big deal.) Then instead of checking whether the entire line up to this point is long enough to make a compare possible, you could check whether the last word matched was exactly the right length, which will be true less often, thereby requiring fewer calls to strcmp.
All in all, though, this is not a good strategy. You'll probably find that strstr is faster than flex, and it's only slightly optimised compared to what is possible for a repeated search for the same target. Better would be to implement or find one of the standard search algorithms:
Boyer-Moore
Knuth-Morris-Pratt
Rabin-Karp
etc.
I used the following to get it to work partially:
%{
#define OR 2
#define AND 3
.........
.........
%}
delim [ \t]
ws {delim}*
letter [A-Za-z]
digit [0-9]
comments [/]+({letter}|{digit}|{delim})*
%%
{comments} {return(COMMENT);}
......................
......................
%%
int main()
{
int tkn = 0;
while (tkn = yylex())
{
switch (tkn)
{
case COMMENT:
printf("GOT COMMENT");
}
}
}
This is working fine. The problem is that the regex obviously does not recognize special characters because [/]+({letter}|{digit}|{delim})* does not consider special characters. How to change the regex to accommodate more characters till end of line?
Couldn't you just use
[/]+.*
It will match some number of / and then anything till the end of line. Of course this will not cover comments like /* COMMENT */.
may be its late. But I find this more appropriate to use \/[\/]+.* This will cover double slash and more and then the rest of the text.
Following is the explanation from regex101.com
\/
matches the character / literally (case sensitive) Match a single character present in the text
[\/]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) \/ matches the character / literally (case sensitive)
.* matches any character (except for line terminators)
A single-line comment expression starting with '//' can be captured by the following regular expression.
\/\/[^\r\n]*
\/\/ matches the double-slash
[^\r\n]* matches as many characters that are not carriage-return or line-feed as it can find.
However, the C language allows a single line comment to be extended to the next line when the last character in the line is a backslash (\). Therefore, you may want to use the following.
\/\/[^\r\n]*(?:(?<=\\)\r?\n[^\r\n]*)*
\/\/ matches the double-slash
[^\r\n]* matches as many characters that are not carriage-return (\r) or line-feed (\n) as it can find
(?: start a non-capturing group
(?<=\\) assert that a backslash (\) immediately precedes the current position
\r?\n match the end of a line
[^\r\n]* matches as many characters that are not carriage-return (\r) or line-feed
)* complete the non-capturing group and let it repeat 0 or more times
Note that this method has problems. Depending on what you are doing, you may want to find and use a lexical scanner. A lexical scanner can avoid the following problems.
Scanning the text
/* Comment appears to have // a comment inside it */
will match
// a comment inside it */
Scanning the text
char* a = "string appears to have // a comment";
will match
// a comment";
Why can't you just write
"//"|"/*" {return(COMMENT);}
?
Following regular expression works just fine for me.
\/\/.*
I have run into some code and was wondering what the original developer was up to. Below is a simplified program using this pattern:
#include <stdio.h>
int main() {
char title[80] = "mytitle";
char title2[80] = "mayataiatale";
char mystring[80];
/* hugh ? */
sscanf(title,"%[^a]",mystring);
printf("%s\n",mystring); /* Output is "mytitle" */
/* hugh ? */
sscanf(title2,"%[^a]",mystring); /* Output is "m" */
printf("%s\n",mystring);
return 0;
}
The man page for scanf has relevant information, but I'm having trouble reading it. What is the purpose of using this sort of notation? What is it trying to accomplish?
The main reason for the character classes is so that the %s notation stops at the first white space character, even if you specify field lengths, and you quite often don't want it to. In that case, the character class notation can be extremely helpful.
Consider this code to read a line of up to 10 characters, discarding any excess, but keeping spaces:
#include <ctype.h>
#include <stdio.h>
int main(void)
{
char buffer[10+1] = "";
int rc;
while ((rc = scanf("%10[^\n]%*[^\n]", buffer)) >= 0)
{
int c = getchar();
printf("rc = %d\n", rc);
if (rc >= 0)
printf("buffer = <<%s>>\n", buffer);
buffer[0] = '\0';
}
printf("rc = %d\n", rc);
return(0);
}
This was actually example code for a discussion on comp.lang.c.moderated (circa June 2004) related to getline() variants.
At least some confusion reigns. The first format specifier, %10[^\n], reads up to 10 non-newline characters and they are assigned to buffer, along with a trailing null. The second format specifier, %*[^\n] contains the assignment suppression character (*) and reads zero or more remaining non-newline characters from the input. When the scanf() function completes, the input is pointing at the next newline character. The body of the loop reads and prints that character, so that when the loop restarts, the input is looking at the start of the next line. The process then repeats. If the line is shorter than 10 characters, then those characters are copied to buffer, and the 'zero or more non-newlines' format processes zero non-newlines.
The constructs like %[a] and %[^a] exist so that scanf() can be used as a kind of lexical analyzer. These are sort of like %s, but instead of collecting a span of as many "stringy" characters as possible, they collect just a span of characters as described by the character class. There might be cases where writing %[a-zA-Z0-9] might make sense, but I'm not sure I see a compelling use case for complementary classes with scanf().
IMHO, scanf() is simply not the right tool for this job. Every time I've set out to use one of its more powerful features, I've ended up eventually ripping it out and implementing the capability in a different way. In some cases that meant using lex to write a real lexical analyzer, but usually doing line at a time I/O and breaking it coarsely into tokens with strtok() before doing value conversion was sufficient.
Edit: I ended ripping out scanf() typically because when faced with users insisting on providing incorrect input, it just isn't good at helping the program give good feedback about the problem, and having an assembler print "Error, terminated." as its sole helpful error message was not going over well with my user. (Me, in that case.)
It's like character sets from regular expressions; [0-9] matches a string of digits, [^aeiou] matches anything that isn't a lowercase vowel, etc.
There are all sorts of uses, such as pulling out numbers, identifiers, chunks of whitespace, etc.
You can read about it in the ISO/IEC9899 standard available online.
Here is a paragraph I quote from the document about [ (Page 286):
Matches a nonempty sequence of characters from a set of expected
characters.
The conversion specifier includes all subsequent characters in the
format string, up to and including the matching right bracket (]). The
characters between the brackets (the scanlist) compose the scanset,
unless the character after the left bracket is a circumflex (^), in
which case the scanset contains all characters that do not appear in
the scanlist between the circumflex and the right bracket. If the
conversion specifier begins with [] or [^], the right bracket
character is in the scanlist and the next following right bracket
character is the matching right bracket that ends the specification;
otherwise the first following right bracket character is the one that
ends the specification. If a - character is in the scanlist and is not
the first, nor the second where the first character is a ^, nor the
last character, the behavior is implementation-defined.