Flex matching wrong rule - c

I am writing a compiler for the mini-L language. I am pretty much done with the scanning part. This is where you basically find tokens in the input. My problem is, when I type in "function" it matches it as an identifier. I cant really tell whats wrong.
ident [a-zA-Z][a-zA-Z0-9_]+[^_]
%{
#include <stdio.h>
#include <stdlib.h>
int col = 0;
int row = 0;
%}
%%
"function" {printf("FUNCTION\n"); col += yyleng;}
{ident} {printf("IDENT %s\n", yytext); col += yyleng;}
. {printf("Error at line %d, column %d: unrecognized symbol \"%s\"\n", row, col, yytext);}
%%
int main(int argc, char** argv){
if(argc > 1){
FILE *fp = fopen(argv[1], "r");
if(fp){
yyin = fp;
}
}
printf("Give me the input:\n");
yylex();
}
I tried rearranging the rules but that didn't work and that doesn't seem like the problem anyway. I think it's a problem with my regex. Any help is much appreciated.

So here's your ident pattern: [a-zA-Z][a-zA-Z0-9_]+[^_]. Let's break that down:
[a-zA-Z] A single letter.
[a-zA-Z0-9_]+ One or more (+) of a letter, a digit, or a _
[^_] Anything other than _. Anything. Such as a space.
One simple observation is that the pattern can't match fewer than three characters. On the other hand, the last one could be almost anything. It could be a letter. But it could also be a comma, or a space, or even a newline. Just not an underline.
So that won't match x, which is usually considered a valid identifier. But it will match function . (That is, the word function followed by a space.) Since that is longer than function, the identifier pattern will win unless you happened to write function_.
So that's probably not what you wanted.
Note that if you want to avoid identifiers ending with an underscore, you have to deal with two issues.
The obvious pattern [[:alpha:]][[:alnum:]_]*[[:alnum:]] can't match single letters. So you need [[:alpha:]]([[:alnum:]_]*[[:alnum:]])?.
If your identifier pattern doesn't match a trailing underscore, the underscore will be left over for the next token. Maybe you're OK with that. But it's probably better to match the underscore and issue an error message. Or reconsider the restriction.
Note: The pattern syntax is documented in the Flex manual, including the built in character sets I used above.

Related

RegExp doesn't match when in the middle of other words

Hello everyone I'm testing a regexp with lex to find the product id in html from amazon. I don't know why when it read the file witch contains:
<span class="a-icon-alt">4,7 de un máximo de 5 estrellas</span>
it works but if it content is something like:
aaaaaa< span class="a-icon-alt">4,7 de un máximo de 5 estrellas< /span>bbbbbb
it doesnt. There is the code with regex.
%{
#include <stdio.h>
int nc, np, nl;
void escribir_datos (int dato1, int dato2, int dato3);
%}
productos (<li+[ ]+id=\"result_[0-9]*)+
num_productos [0-9]*
nombre_producto <h2+[ ]+data-attribute=\"([^\"]*)
nombre_final_producto \"[^\"]*\"
precio_producto <span+[ ]+class=\"a-size-base+[ ]+a-color-price+[ ]+s-price+[ ]+a-text-bold\">(.*?)<\/span>
precio_final_producto [0-9]+([,][0-9]+)?
valoraciones <span+[ ]+class=\"a-icon-alt\">(.*?)<\/span>
%%
{valoraciones} { nl++; }
[^ \t\n]+ { np++; nc += yyleng; }
[ \t]+ { nc += yyleng; }
\n { nc++; }
%%
/*----- Sección de Procedimientos --------*/
int main (int argc, char *argv[]) {
if (argc == 2) {
yyin = fopen (argv[1], "rt");
if (yyin == NULL) {
printf ("El fichero %s no se puede abrir\n", argv[1]);
exit (-1);
}
}
else yyin = stdin;
nc = np = nl = 0;
yylex ();
escribir_datos(nc,np,nl);
return 0;
}
void escribir_datos (int dato1, int dato2, int dato3) {
printf("Num_char=%d\tNum_words=%d\tNum_lines=%d\n", dato1,dato2,dato3);
}
Thanks I hope you can help me.
The intended use case for lexical analyzers generated by (f)lex is to split the input into a series of primitive "tokens", each one with some syntactic significance. They do not search for regular expressions, because they assume that every part of the input will match some pattern in your lexical description.
So each time the lexical analyzer examines the input, it will select the pattern which gives the best match. A match is a sequence of characters starting at the current input point which matches a pattern, and the best match is the one which matches the longest sequence. (If there are two or more patterns which match the same longest sequence, the first one in the list of patterns is considered the best one.)
With that in mind, consider what happens with the input
aaaaaa< span class="a-icon-alt">4,7 de un máximo de 5 estrellas< /span>bbbbbb
Your file has four patterns:
{valoraciones}
[^ \t\n]+
[ \t]+
\n
The input doesn't match {valoraciones} because that pattern only matches a string starting with <. It doesn't match [ \t]+ either, because it doesn't start with either space or tab, and similarly it doesn't match a newline. But it does match [^ \t\n]+. Since (f)lex always select the longest match, and [^ \t\n]+ matches any sequence of characters other than whitespace (space, tab, newline), the first match will be aaaaaaa<.
After that's matched, the input is span..., which means that only the third pattern ([ \t]+) matches. It could match any number of space characters, but there is only one and that's what it will match.
So then the input is span class="a-icon-alt">4,7.... {valoraciones} still won't match -- the input doesn't start with < -- so we're back to a match of the second pattern.
And so on.
I think you need to be a lot clearer (with yourself) about what the tokens you are trying to match are. If you're looking for specific HTML tags, then you probably want to recognise any sequence which doesn't contain a < as a token, rather than looking for input terminated with a white space character. But then you also need to accept any tag as a token, as well as the specific tags you are trying to catch.
Of course, it is also possible that (f)lex is not the ideal tool for your use case. You don't really say what your use case it, so I'm not going to make any assumption one way or another.
In any event, you should take a few minutes to read the documentation on flex patterns. Any regex syntax not described on that page will not work with (f)lex, regardless of whether it works with regex libraries or online regex checkers. In particular, .*? does not give you a non-greedy match, as it would in many regex libraries. (F)lex doesn't implement non-greedy matches (because it doesn't do any backtracking), and it considers .*? to be an optional (?) appearance of any number including zero repetitions (*) of any character other than a newline (.). Making the repetition optional has no effect, since the repetition already matches zero repetitions. So the pattern <.*?> would match from the < up to the last > on the same line. That is probably not what you want.
You also probably don't want <span+, which matches < followed by the letters ap a and then any number of n (as long as there is at least one). In other words, it will match <span, <spann, <spannnnnnnnnnn, and many more.
Thanks for answer, the problem was the three rules after {valoraciones} caused conflict with the first one. So i couldn't find any word between other words for example, I want to find dog in a text with aaaaadogaaaa, that doesn't match with dog cause of what i said at the beginning.

Print line of matched word in flex

I'm trying to create a scanner with flex that acts somewhat like grep.
Basically, what I want to do is: given a word (regular text, not a regex), find any line in the input that contains a match for that text, then print the line that contains the word.
The problem I've been having is that I can't figure out how to best print the line. I can print everything after the searched word, but I don't know how to properly store the contents of the whole line.
I tried using yyseek(), but when I compile, I get back the message that yyseek is an undefined symbol.
Using yymore() to store text works well for anything after the matched word in the line.
Here is the code that I have so far:
%option yylineno
%option noyywrap
%{
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *search_str = NULL;
char *curr_line = NULL;
%}
%x found
letter [a-zA-Z]
word {letter}+
line (.*)\n
%%
<INITIAL,found>{word} {
/* If a word matches the string that we are looking for, use the 'found'
* condition, which will cause the line to be dumped at the end.
*/
yymore();
if (strcmp(search_str, yytext) == 0) {
BEGIN(found);
}
}
<found>{line} {
yymore();
ECHO;
BEGIN(INITIAL);
}
. { }
\n {}
%%
int main(int argc, char *argv[])
{
if (argc > 1) {
unsigned int str_len = sizeof(argv[1]);
search_str = malloc(str_len + 1);
strcpy(search_str, argv[1]);
yylex();
free(search_str);
return 0;
}
printf("usage: ./a.out [search word]\n");
return 1;
}
This is really not a good use case for flex. And it's not totally clear to me that it will do what you want, either. (Since I don't actually know what you want, so I could be wrong about that. But note the following:
Target line grep night grep -w night Your code
------------------- ---------- ------------- ---------
a night to remember Yes Yes Yes
a knight to forget Yes No No
night23 Yes No Yes
Anyway, your instinct about using yymore was correct. You just have to start earlier, so that the entire line is retained in token. The small complication is that when you need to check a word, you can't check from the beginning of yytext; it contains the whole line up to this point. You have to check the last strlen(search_str) characters. The following code makes sure it only does that computation once, since it requires a complete scan of search_str. Also note that it makes sure it does not overrun the beginning of yytext.
In effect, the following code divides the text into three kinds of tokens: words, non-words, and newlines. Only newline fails to call yymore(), so when the newline rule triggers, yytext contains the entire line. As in your code, once a match is found in a line, the rest of the line is simply added to the match.
(Note: I rewrote this without macros, which are generally overused. I don't see any reason to think that {letter} is more readable than [[:alpha:]], and the latter has the advantage of being clear to anyone who knows flex, without having to search for your particular definition.)
%x FOUND
%%
/* Indented lines before the first rule are put at the top of yylex */
int match_length = strlen(search_str);
[^[:alpha:]\n]+ { yymore(); }
[[:alpha:]]+ { yymore();
if (yyleng >= match_length
&& 0 == strcmp(yytext + yyleng - match_length,
search_str))
BEGIN(FOUND);
}
<INITIAL,FOUND>\n BEGIN(INITIAL);
<FOUND>.* printf("%s\n", yytext);
The oddity at the end is to deal with inputs which are not correctly terminated with a newline character. The last pattern will print the line with a newline character (even if there isn't one), and the newline character (if there is one) will restart the start condition.
For a slight speed gain, you could remember the previous value of yyleng every time you call yymore(), so that yyleng - prev_yyleng will be the length of "this part" of the token. (The flex scanner knows this value but doesn't provide any interface for you to find it out, which is slightly annoying. But it's not a big deal.) Then instead of checking whether the entire line up to this point is long enough to make a compare possible, you could check whether the last word matched was exactly the right length, which will be true less often, thereby requiring fewer calls to strcmp.
All in all, though, this is not a good strategy. You'll probably find that strstr is faster than flex, and it's only slightly optimised compared to what is possible for a repeated search for the same target. Better would be to implement or find one of the standard search algorithms:
Boyer-Moore
Knuth-Morris-Pratt
Rabin-Karp
etc.

C Trying to match the exact substring and nothing more

I have tried different functions including strtok(), strcmp() and strstr(), but I guess I'm missing something. Is there a way to match the exact substring in a string?
For example:
If I have a name: "Tan"
And I have 2 file names: "SomethingTan5346" and "nothingTangyrs634"
So how can I make sure that I match the first string and not both? Because the second file is for the person Tangyrs. Or is it impossible with this approach? Am I going at it the wrong way?
If, as seems to be the case, you just want to identify strings that have your text but are immediately followed by a digit, your best bet is probably to get yourself a good regular expression implementation and just search for Tan[0-9].
It could be done simply be using strstr() to find the string then checking the character following that with isnum() but the actual code to do that would be:
not as easy as you think since you may have to do multiple searchs (e.g., TangoTangoTan42 would need three checks); and
inadvisable if there's the chance the searches my become more complex (such as Tan followed by 1-3 digits or exactly two # characters and an X).
A regular expression library will make this much easier, provided you're willing to invest a little effort into learning about it.
If you don't want to invest the time in learning regular expressions, the following complete test program should be a good starting point to evaluate a string based on the requirements in the first paragraph:
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int hasSubstrWithDigit(char *lookFor, char *searchString) {
// Cache length and set initial search position.
size_t lookLen = strlen(lookFor);
char *foundPos = searchString;
// Keep looking for string until none left.
while ((foundPos = strstr(foundPos, lookFor)) != NULL) {
// If at end, no possibility of following digit.
if (strlen(foundPos) == lookLen) return 0;
// If followed by digit, return true.
if (isdigit(foundPos[lookLen])) return 1;
// Otherwise keep looking, from next character.
foundPos++;
}
// Not found, return false.
return 0;
}
int main(int argc, char *argv[]) {
if (argc < 3) {
printf("Usage testprog <lookFor> <searchIn>...\n");
return 1;
}
for (int i = 2; i < argc; ++i) {
printf("Result of looking for '%s' in '%s' is %d\n", argv[1], argv[i], hasSubstrWithDigit(argv[1], argv[i]));
}
return 0;
}
Though, as you can see, it's not as elegant as a regex search, and is likely to become even less elegant if your requirements change :-)
Running that with:
./testprog Tan xyzzyTan xyzzyTan7 xyzzyTangy4 xyzzyTangyTan12
shows it is action:
Result of looking for 'Tan' in 'xyzzyTan' is 0
Result of looking for 'Tan' in 'xyzzyTan7' is 1
Result of looking for 'Tan' in 'xyzzyTangy4' is 0
Result of looking for 'Tan' in 'xyzzyTangyTan12' is 1
The solution depends on your definition of exact matching.
This might be useful for you:
Traverse all matches of the target substring.
C find all occurrences of substring
Finding all instances of a substring in a string
find the count of substring in string
https://cboard.cprogramming.com/c-programming/73365-how-use-strstr-find-all-occurrences-substring-string-not-only-first.html
etc.
Having the span of the match, verify that the previous and following characters match/do not match your criterion for "exact match".
Or,
You could take advantage of regex in C++ (I know the tag is "C"), with #include <regex>, or POSIX #include <regex.h>.
You may want to use strstr(3) to search a substring in a string, strchr(3) to search a character in a string, or even regular expressions with regcomp(3).
You should read more about parsing techniques, notably about recursive descent parsers. In some cases, sscanf(3) with %n can also be handy. You should take care of the return count.
You could loop to read then parse every line, perhaps using getline(3), see this.
You need first to document your input file format (or your file name conventions, if SomethingTan5346 is some file path), perhaps using EBNF notation.
(you probably want to combine several approaches I am suggesting above)
BTW, I recommend limiting (for your convenience) file paths to a restricted set of characters. For example using * or ; or spaces or tabs in file paths is possible (see path_resolution(7)) but should be frowned upon.

Creating a Lexical Analyzer in C

I am trying to create a lexical analyzer in C.
The program reads another program as input to convert it into tokens, and the source code is here-
#include <stdio.h>
#include <conio.h>
#include <string.h>
int main() {
FILE *fp;
char read[50];
char seprators [] = "\n";
char *p;
fp=fopen("C:\\Sum.c", "r");
clrscr();
while ( fgets(read, sizeof(read)-1, fp) !=NULL ) {
//Get the first token
p=strtok(read, seprators);
//Get and print other tokens
while (p!=NULL) {
printf("%s\n", p);
p=strtok(NULL, seprators);
}
}
return 0;
}
And the contents of Sum.c are-
#include <stdio.h>
int main() {
int x;
int y;
int sum;
printf("Enter two numbers\n");
scanf("%d%d", &x, &y);
sum=x+y;
printf("The sum of these numbers is %d", sum);
return 0;
}
I am not getting the correct output and only see a blank screen in place of output.
Can anybody please tell me where am I going wrong??
Thank you so much in advance..
You've asked a few question since this one, so I guess you've moved on. There are a few things that can be noted about your problem and your start at a solution that can help others starting to solve a similar problem. You'll also find that people can often be slow at answering things that are obvious homework. We often wait until homework deadlines have passed. :-)
First, I noted you used a few features specific to Borland C compiler which are non-standard and would not make the solution portable or generic. YOu could solve the problem without them just fine, and that is usually a good choice. For example, you used #include <conio.h> just to clear the screen with a clrscr(); which is probably unnecessary and not relevant to the lexer problem.
I tested the program, and as written it works! It transcribes all the lines of the file Sum.c to stdout. If you only saw a blank screen it is because it could not find the file. Either you did not write it to your C:\ directory or had a different name. As already mentioned by #WhozCraig you need to check that the file was found and opened properly.
I see you are using the C function strtok to divide the input up into tokens. There are some nice examples of using this in the documentation you could include in your code, which do more than your simple case. As mentioned by #Grijesh Chauhan there are more separators to consider than \n, or end-of-line. What about spaces and tabs, for example.
However, in programs, things are not always separated by spaces and lines. Take this example:
result=(number*scale)+total;
If we only used white space as a separator, then it would not identify the words used and only pick up the whole expression, which is obviously not tokenization. We could add these things to the separator list:
char seprators [] = "\n=(*)+;";
Then your code would pick out those words too. There is still a flaw in that strategy, because in programming languages, those symbols are also tokens that need to be identified. The problem with programming language tokenization is there are no clear separators between tokens.
There is a lot of theory behind this, but basically we have to write down the patterns that form the basis of the tokens we want to recognise and not look at the gaps between them, because as has been shown, there aren't any! These patterns are normally written as regular expressions. Computer Science theory tells us that we can use finite state automata to match these regular expressions. Writing a lexer involves a particular style of coding, which has this style:
while ( NOT <<EOF>> ) {
switch ( next_symbol() ) {
case state_symbol[1]:
....
break;
case state_symbol[2]:
....
break;
default:
error(diagnostic);
}
}
So, now, perhaps the value of the academic assignment becomes clearer.

Lex: How do I Prevent it from matching against substrings?

For example, I'm supposed to convert "int" to "INT". But if there's the word "integer", I don't think it's supposed to turn into "INTeger".
If I define "int" printf("INT"); the substrings are matched though. Is there a way to prevent this from happening?
I believe the following captures what you want.
%{
#include <stdio.h>
%}
ws [\t\n ]
%%
{ws}int{ws} { printf ("%cINT%c", *yytext, yytext[4]); }
. { printf ("%c", *yytext); }
To expand this beyond word boundaries ({ws}, in this case) you will need to either add modifiers to ws or add more specifc checks.
well, here's how i did it:
(("int"([a-z]|[A-Z]|[0-9])+)|(([a-z]|[A-Z]|[0-9])+"int")) ECHO;
"int" printf("INT");
better suggestions welcome.
Lex will choose the rule with the longest possible match for the current input. To avoid substring matches you need to include an additional rule that is longer than int. The easiest way to do to this is to add a simple rule that picks up any string that is longer than one character, i.e. [a-zA-Z]+. The entire lex program would look like this:-
%%
[\t ]+ /* skip whitespace */
int { printf("INT"); }
[a-zA-Z]+ /* catch-all to avoid substring matches */
%%
int main(int argc, char *argv[])
{
yylex();
}

Resources