Lex: How do I Prevent it from matching against substrings? - c

For example, I'm supposed to convert "int" to "INT". But if there's the word "integer", I don't think it's supposed to turn into "INTeger".
If I define "int" printf("INT"); the substrings are matched though. Is there a way to prevent this from happening?

I believe the following captures what you want.
%{
#include <stdio.h>
%}
ws [\t\n ]
%%
{ws}int{ws} { printf ("%cINT%c", *yytext, yytext[4]); }
. { printf ("%c", *yytext); }
To expand this beyond word boundaries ({ws}, in this case) you will need to either add modifiers to ws or add more specifc checks.

well, here's how i did it:
(("int"([a-z]|[A-Z]|[0-9])+)|(([a-z]|[A-Z]|[0-9])+"int")) ECHO;
"int" printf("INT");
better suggestions welcome.

Lex will choose the rule with the longest possible match for the current input. To avoid substring matches you need to include an additional rule that is longer than int. The easiest way to do to this is to add a simple rule that picks up any string that is longer than one character, i.e. [a-zA-Z]+. The entire lex program would look like this:-
%%
[\t ]+ /* skip whitespace */
int { printf("INT"); }
[a-zA-Z]+ /* catch-all to avoid substring matches */
%%
int main(int argc, char *argv[])
{
yylex();
}

Related

Flex matching wrong rule

I am writing a compiler for the mini-L language. I am pretty much done with the scanning part. This is where you basically find tokens in the input. My problem is, when I type in "function" it matches it as an identifier. I cant really tell whats wrong.
ident [a-zA-Z][a-zA-Z0-9_]+[^_]
%{
#include <stdio.h>
#include <stdlib.h>
int col = 0;
int row = 0;
%}
%%
"function" {printf("FUNCTION\n"); col += yyleng;}
{ident} {printf("IDENT %s\n", yytext); col += yyleng;}
. {printf("Error at line %d, column %d: unrecognized symbol \"%s\"\n", row, col, yytext);}
%%
int main(int argc, char** argv){
if(argc > 1){
FILE *fp = fopen(argv[1], "r");
if(fp){
yyin = fp;
}
}
printf("Give me the input:\n");
yylex();
}
I tried rearranging the rules but that didn't work and that doesn't seem like the problem anyway. I think it's a problem with my regex. Any help is much appreciated.
So here's your ident pattern: [a-zA-Z][a-zA-Z0-9_]+[^_]. Let's break that down:
[a-zA-Z] A single letter.
[a-zA-Z0-9_]+ One or more (+) of a letter, a digit, or a _
[^_] Anything other than _. Anything. Such as a space.
One simple observation is that the pattern can't match fewer than three characters. On the other hand, the last one could be almost anything. It could be a letter. But it could also be a comma, or a space, or even a newline. Just not an underline.
So that won't match x, which is usually considered a valid identifier. But it will match function . (That is, the word function followed by a space.) Since that is longer than function, the identifier pattern will win unless you happened to write function_.
So that's probably not what you wanted.
Note that if you want to avoid identifiers ending with an underscore, you have to deal with two issues.
The obvious pattern [[:alpha:]][[:alnum:]_]*[[:alnum:]] can't match single letters. So you need [[:alpha:]]([[:alnum:]_]*[[:alnum:]])?.
If your identifier pattern doesn't match a trailing underscore, the underscore will be left over for the next token. Maybe you're OK with that. But it's probably better to match the underscore and issue an error message. Or reconsider the restriction.
Note: The pattern syntax is documented in the Flex manual, including the built in character sets I used above.

RegExp doesn't match when in the middle of other words

Hello everyone I'm testing a regexp with lex to find the product id in html from amazon. I don't know why when it read the file witch contains:
<span class="a-icon-alt">4,7 de un máximo de 5 estrellas</span>
it works but if it content is something like:
aaaaaa< span class="a-icon-alt">4,7 de un máximo de 5 estrellas< /span>bbbbbb
it doesnt. There is the code with regex.
%{
#include <stdio.h>
int nc, np, nl;
void escribir_datos (int dato1, int dato2, int dato3);
%}
productos (<li+[ ]+id=\"result_[0-9]*)+
num_productos [0-9]*
nombre_producto <h2+[ ]+data-attribute=\"([^\"]*)
nombre_final_producto \"[^\"]*\"
precio_producto <span+[ ]+class=\"a-size-base+[ ]+a-color-price+[ ]+s-price+[ ]+a-text-bold\">(.*?)<\/span>
precio_final_producto [0-9]+([,][0-9]+)?
valoraciones <span+[ ]+class=\"a-icon-alt\">(.*?)<\/span>
%%
{valoraciones} { nl++; }
[^ \t\n]+ { np++; nc += yyleng; }
[ \t]+ { nc += yyleng; }
\n { nc++; }
%%
/*----- Sección de Procedimientos --------*/
int main (int argc, char *argv[]) {
if (argc == 2) {
yyin = fopen (argv[1], "rt");
if (yyin == NULL) {
printf ("El fichero %s no se puede abrir\n", argv[1]);
exit (-1);
}
}
else yyin = stdin;
nc = np = nl = 0;
yylex ();
escribir_datos(nc,np,nl);
return 0;
}
void escribir_datos (int dato1, int dato2, int dato3) {
printf("Num_char=%d\tNum_words=%d\tNum_lines=%d\n", dato1,dato2,dato3);
}
Thanks I hope you can help me.
The intended use case for lexical analyzers generated by (f)lex is to split the input into a series of primitive "tokens", each one with some syntactic significance. They do not search for regular expressions, because they assume that every part of the input will match some pattern in your lexical description.
So each time the lexical analyzer examines the input, it will select the pattern which gives the best match. A match is a sequence of characters starting at the current input point which matches a pattern, and the best match is the one which matches the longest sequence. (If there are two or more patterns which match the same longest sequence, the first one in the list of patterns is considered the best one.)
With that in mind, consider what happens with the input
aaaaaa< span class="a-icon-alt">4,7 de un máximo de 5 estrellas< /span>bbbbbb
Your file has four patterns:
{valoraciones}
[^ \t\n]+
[ \t]+
\n
The input doesn't match {valoraciones} because that pattern only matches a string starting with <. It doesn't match [ \t]+ either, because it doesn't start with either space or tab, and similarly it doesn't match a newline. But it does match [^ \t\n]+. Since (f)lex always select the longest match, and [^ \t\n]+ matches any sequence of characters other than whitespace (space, tab, newline), the first match will be aaaaaaa<.
After that's matched, the input is span..., which means that only the third pattern ([ \t]+) matches. It could match any number of space characters, but there is only one and that's what it will match.
So then the input is span class="a-icon-alt">4,7.... {valoraciones} still won't match -- the input doesn't start with < -- so we're back to a match of the second pattern.
And so on.
I think you need to be a lot clearer (with yourself) about what the tokens you are trying to match are. If you're looking for specific HTML tags, then you probably want to recognise any sequence which doesn't contain a < as a token, rather than looking for input terminated with a white space character. But then you also need to accept any tag as a token, as well as the specific tags you are trying to catch.
Of course, it is also possible that (f)lex is not the ideal tool for your use case. You don't really say what your use case it, so I'm not going to make any assumption one way or another.
In any event, you should take a few minutes to read the documentation on flex patterns. Any regex syntax not described on that page will not work with (f)lex, regardless of whether it works with regex libraries or online regex checkers. In particular, .*? does not give you a non-greedy match, as it would in many regex libraries. (F)lex doesn't implement non-greedy matches (because it doesn't do any backtracking), and it considers .*? to be an optional (?) appearance of any number including zero repetitions (*) of any character other than a newline (.). Making the repetition optional has no effect, since the repetition already matches zero repetitions. So the pattern <.*?> would match from the < up to the last > on the same line. That is probably not what you want.
You also probably don't want <span+, which matches < followed by the letters ap a and then any number of n (as long as there is at least one). In other words, it will match <span, <spann, <spannnnnnnnnnn, and many more.
Thanks for answer, the problem was the three rules after {valoraciones} caused conflict with the first one. So i couldn't find any word between other words for example, I want to find dog in a text with aaaaadogaaaa, that doesn't match with dog cause of what i said at the beginning.

Print line of matched word in flex

I'm trying to create a scanner with flex that acts somewhat like grep.
Basically, what I want to do is: given a word (regular text, not a regex), find any line in the input that contains a match for that text, then print the line that contains the word.
The problem I've been having is that I can't figure out how to best print the line. I can print everything after the searched word, but I don't know how to properly store the contents of the whole line.
I tried using yyseek(), but when I compile, I get back the message that yyseek is an undefined symbol.
Using yymore() to store text works well for anything after the matched word in the line.
Here is the code that I have so far:
%option yylineno
%option noyywrap
%{
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *search_str = NULL;
char *curr_line = NULL;
%}
%x found
letter [a-zA-Z]
word {letter}+
line (.*)\n
%%
<INITIAL,found>{word} {
/* If a word matches the string that we are looking for, use the 'found'
* condition, which will cause the line to be dumped at the end.
*/
yymore();
if (strcmp(search_str, yytext) == 0) {
BEGIN(found);
}
}
<found>{line} {
yymore();
ECHO;
BEGIN(INITIAL);
}
. { }
\n {}
%%
int main(int argc, char *argv[])
{
if (argc > 1) {
unsigned int str_len = sizeof(argv[1]);
search_str = malloc(str_len + 1);
strcpy(search_str, argv[1]);
yylex();
free(search_str);
return 0;
}
printf("usage: ./a.out [search word]\n");
return 1;
}
This is really not a good use case for flex. And it's not totally clear to me that it will do what you want, either. (Since I don't actually know what you want, so I could be wrong about that. But note the following:
Target line grep night grep -w night Your code
------------------- ---------- ------------- ---------
a night to remember Yes Yes Yes
a knight to forget Yes No No
night23 Yes No Yes
Anyway, your instinct about using yymore was correct. You just have to start earlier, so that the entire line is retained in token. The small complication is that when you need to check a word, you can't check from the beginning of yytext; it contains the whole line up to this point. You have to check the last strlen(search_str) characters. The following code makes sure it only does that computation once, since it requires a complete scan of search_str. Also note that it makes sure it does not overrun the beginning of yytext.
In effect, the following code divides the text into three kinds of tokens: words, non-words, and newlines. Only newline fails to call yymore(), so when the newline rule triggers, yytext contains the entire line. As in your code, once a match is found in a line, the rest of the line is simply added to the match.
(Note: I rewrote this without macros, which are generally overused. I don't see any reason to think that {letter} is more readable than [[:alpha:]], and the latter has the advantage of being clear to anyone who knows flex, without having to search for your particular definition.)
%x FOUND
%%
/* Indented lines before the first rule are put at the top of yylex */
int match_length = strlen(search_str);
[^[:alpha:]\n]+ { yymore(); }
[[:alpha:]]+ { yymore();
if (yyleng >= match_length
&& 0 == strcmp(yytext + yyleng - match_length,
search_str))
BEGIN(FOUND);
}
<INITIAL,FOUND>\n BEGIN(INITIAL);
<FOUND>.* printf("%s\n", yytext);
The oddity at the end is to deal with inputs which are not correctly terminated with a newline character. The last pattern will print the line with a newline character (even if there isn't one), and the newline character (if there is one) will restart the start condition.
For a slight speed gain, you could remember the previous value of yyleng every time you call yymore(), so that yyleng - prev_yyleng will be the length of "this part" of the token. (The flex scanner knows this value but doesn't provide any interface for you to find it out, which is slightly annoying. But it's not a big deal.) Then instead of checking whether the entire line up to this point is long enough to make a compare possible, you could check whether the last word matched was exactly the right length, which will be true less often, thereby requiring fewer calls to strcmp.
All in all, though, this is not a good strategy. You'll probably find that strstr is faster than flex, and it's only slightly optimised compared to what is possible for a repeated search for the same target. Better would be to implement or find one of the standard search algorithms:
Boyer-Moore
Knuth-Morris-Pratt
Rabin-Karp
etc.

Interpreting '\n' in printf("%s", string)

This piece of code is acting a bit strange to my taste. Please, anyone care to explain why? And how to force '\n' to be interpreted as a special char?
beco#raposa:~/tmp/user/foo/bar$ ./interpretastring.x "2nd\nstr"
1st
str
2nd\nstr
beco#raposa:~/tmp/user/foo/bar$ cat interpretastring.c
#include <stdio.h>
int main(int argc, char **argv)
{
char *s="1st\nstr";
printf("%s\n", s);
printf("%s\n", argv[1]);
return 0;
}
Bottom line, the intention is that the 2nd string to be printed in two lines, just like the first. This program is a simplification. The real program has problems reading from a file using fgets (not a S.O. argument to argv like here), but I think solving here will also solve there.
It seems the shell doesn't recognize and convert the "escape sequence". Use a shell software that supports \n escape sequence.
For all purposes, this just take care of \n and no other characters get special treatment.
This answer here does the job with lower complexity. It does not change "2 chars" into "one single special \n". It just changes <\><n> to "<space><newline>". That's fine. It would be better if there were a C Standard Library to interpret special chars in a string (as I know it has for RegExp for instance).
/* change '\\n' into ' \n' */
void changebarn(char *nt)
{
while(nt!=NULL)
if((nt=strchr(nt,'\\')))
if(*++nt=='n')
{
*nt='\n';
*(nt-1)=' ';
}
}

Lex: Longest Word Consisting Only of Letters from Other Word

I am trying to write lex code which will take a string as input, and parse through a long dictionary file to find the longest word in that dictionary which is made up of only the letters in that string. Each letter in the string can be used zero or more times, meaning the word "in" would be valid for "input". Here is what I have so far:
%{
#include <stdio.h>
%}
%option noyywrap
%%
[input]+ {
printf("This is the longest I think: %s\n", yytext);
}
.|\n {}
%%
int main(void)
{
yylex();
return 0;
}
However, this really does not do what I expect it to do. This code goes through and prints the matching portions of every word in the dictionary, so I get output like "i", "iu", "inu", etc., and these obviously aren't valid words. Anyone know how to fix this?
You could use the beginning-of-line and end-of-line markers as part of your regular expression to require that the entire line is matched, not just a part of it. Try changing your regex from [input]+ to
^[input]+$
You will then need some separate logic to track the longest string you've found so far, but judging from the code you have above I think this more directly addresses your question at hand.
Hope this helps!

Resources