flex i am confused - c

input to lexer
abc gef4 44jdjd ghghg
x
ererete
xyzzz
55k
hello wold
33
my rules
rule1 [0-9]+[a-zA-Z]+
rule2 [x-z]
rule3 .*
{rule1} {
printf("%s \n", yytext);
}
{rule2} {
printf("%s \n", yytext);
}
{rule3} {
// prints nothing
}
output :-
x
55k
I cannot understand the output ? Can someone please help me.

The first character of the input neither matches rule1 nor rule2. Instead rule3 eats input up to the end of line. The same happens on line 3, 4, 6, and 7. You probably want a less greedy rule3, i.e. one that doesn't consume the spaces:
[^ \t\n]* /* Do nothing */
Then 44jdjd is being found by rule1.

Related

Why is my bison/flex not working as intended?

I have this homework assignment where I have to transform some input into a particular output. The problem I'm having is that I can only convert the first line into the output I need, the other lines return a "syntax error" error.
Additionally, if I change the lines order, no lines are converted so only one particular line is working.
This is my input file:
Input.txt
B0102 Bobi 2017/01/16 V8 1, massage 12.50
J1841 Jeco 20.2 2017/01/17 V8 2, Tosse 2, tosquia 22.50
B2232 Bobi 2017/01/17 Tosse 1, Leptospirose 1, bath 30.00, massage 12.50
B1841 Jeco 21.4 2017/01/18 Leptospirose 1, Giardiase 2
And this is the output I should obtain:
Output
Bobi (B0102) paid 2 services/vaccines 22.50
Jeco (J1841) paid 3 services/vaccines 62.50
Bobi (B2232) paid 4 services/vaccines 62.50
Jeco (B1841) paid 2 services/vaccines 30.00
If I change the line order in the input file, not even the first line is converted.
However, if the order is as I showed above, this is my output:
Bobi (B0102) paid 2 services/vaccines 22.50
syntax error
This is my code:
file.y
%{
#include "file.h"
#include <stdio.h>
int yylex();
int counter = 0;
int vaccineCost = 10;
%}
%union{
char* code;
char* name;
float value;
int quantity;
};
%token COMMA WEIGHT DATE SERVICE VACCINE
%token CODE
%token NAME
%token VALUE
%token QUANTITY
%type <name> NAME
%type <code> CODE
%type <value> VALUE
%type <quantity> QUANTITY
%type <value> services
%start begining
%%
begining: /*empty*/
| animal
;
animal: CODE NAME WEIGHT DATE services {printf("%s (%s) paid %d services/vaccines %.2f\n", $2, $1, counter, $5); counter = 0;}
| CODE NAME DATE services {printf("%s (%s) paid %d services/vaccines %.2f\n", $2, $1, counter, $4); counter = 0;}
;
services: services COMMA SERVICE VALUE {$$ = $1 + $4; counter++;}
| services COMMA VACCINE QUANTITY{$$ = $1 + $4*vaccineCost;counter++;}
| SERVICE VALUE{$$ = $2;counter++;}
| VACCINE VALUE
{$$ = $2*vaccineCost;counter++;}
;
%%
int main(){
yyparse();
return 0;
}
void yyerror (char const *s) {
fprintf (stderr, "%s\n", s);
}
file.flex
%option noyywrap
%{
#include "file.h"
#include "file.tab.h"
#include <stdio.h>
#include <string.h>
%}
/*Patterns*/
YEAR 20[0-9]{2}
MONTH 0[1-9]|1[0-2]
DAY 0[1-9]|[1-2][0-9]|3[0-1]
%%
, {return COMMA,;}
[A-Z][0-9]{4} {yylval.code = strdup(yytext); return CODE;}
[A-Z][a-z]* {yylval.name = strdup(yytext); return NAME;}
[0-9]+[.][0-9] {return WEIGHT;}
{YEAR}"/"{MONTH}"/"{DAY} {return DATE;}
(banho|massagem|tosquia) {return SERVICE;}
[0-9]+\.[0-9]{2} {yylval.value = atof(yytext);return VALUE;}
(V8|V10|Anti-Rabatica|Giardiase|Tosse|Leptospirose) {return VACCINE;}
[1-9] {yylval.quantity = atoi(yytext);return QUANTITY;}
\n
.
<<EOF>> return 0;
%%
And these are the commands I execute:
bison -d file.y
flex -o file.c file.flex
gcc file.tab.c file.c -o exec -lfl
./exec < Input.txt
Can anyone point me in the right direction or tell me what is wrong with my code?
Thanks and if I my explaination wasn't good enough I'll try my best to explain it better!!
There are at least two different problems which cause those symptoms.
Your top-level grammar only accepts at most a single animal:
inicio: /*vazio*/
| animal
So an input containing more than one line won't be allowed. You need a top-level which accepts any number of animals. (By the way, modern bison versions let you write %empty as the right-hand side of an empty production, instead of having to (mis)use a comment.
The order of your scanner rules means that most of the words you want to recognise as VACINA will instead be recognised as NOME. Recall that when two patterns match the same token, the first one in the file wlll win. So with these rules:
[A-Z][a-z]* {yylval.nome = strdup(yytext); return NOME;}
(V8|V10|Anti-Rabatica|Giardiase|Tosse|Leptospirose) {return VACINA;}
Tokens like Tosse, which could match either rule, will be assumed to match the first rule. Only V8 and Anti-Rabatical, which [A-Z][a-z]* doesn't match, will fall through to the second rule. So your first input line doesn't trigger this problem, but all the other ones do.
You probably should handle newline characters syntactically, unless you allow treatment records to be split over multiple lines. And be aware that many (f)lex versions do not allow empty actions, as in your last two flex rules. This may cause lexical errors.
And finally
<<EOF>> return 0;
is unnecessary. That's how the scanner handles end-of-fike by default. <<EOF>> rules are often wring or redundant, and should only be used when clearly needed (and with great care).

detect if a line is match to a format - in C

I have a file and I need to check if its lines are in the following format:
name: name1,name2,name3,name4 ...
(some string, followed by ":", then a single space and after that strings separated by ",").
I tried doing it with the following code:
int result =0;
do
{
result =sscanf(rest,"%[^:]: %s%s", p1,p2,p3);
if(result==3)
{
printf("invalid!");
fclose(fpointer);
return -1;
}
}while (fgets(rest ,LINE , fpointer) != NULL);
this works good for lines like: name: name1, name2 (with space between name1, and name2).
but it fails with the following line:
name : name1,name2
I want to somehow tell sscanf not to avoid this white space before the ":".
could someone see how ?
Thanks for helping!
This works for me:
result = sscanf(rest,"%[^*:]: %[^,],%s", p1, p2, p3);
Notice the * is used to consume the space (if any).

Is there an option for flex to match whole words only?

I'm writing a lexer and I'm using Flex to generate it based on custom rules.
I want to match identifiers of sorts that start with a letter and then can have either letters or numbers. So I wrote the following pattern for them:
[[:alpha:]][[:alnum:]]*
It works fine, the lexer that gets generated recognizes the pattern perfectly, although it doesn't only match whole words but all appearances of that pattern.
So for example it would match the input "Text" and "9Text" (discarding that initial 9).
Consider the following simple lexer that accepts IDs as described above:
%{
#include <stdio.h>
#define LINE_END 1
#define ID 2
%}
/* Flex options: */
%option noinput
%option nounput
%option noyywrap
%option yylineno
/* Definitions: */
WHITESPACE [ \t]
BLANK {WHITESPACE}+
NEW_LINE "\n"|"\r\n"
ID [[:alpha:]][[:alnum:]_]*
%%
{NEW_LINE} {printf("New line.\n"); return LINE_END;}
{BLANK} {/* Blanks are skipped */}
{ID} {printf("ID recognized: '%s'\n", yytext); return ID;}
. {fprintf(stderr, "ERROR: Invalid input in line %d: \"%s\"\n", yylineno, yytext);}
%%
int main(int argc, char **argv) {
while (yylex() != 0);
return 0;
}
When compiled and fed the following input produces the output below:
Input:
Test
9Test
Output:
Test
ID recognized: 'Test'
New line.
9Test
ERROR: Invalid input in line 2: "9"
ID recognized: 'Test'
New line.
Is there a way to make flex match only whole words (i.e. delimited by either blanks or custom delimiters like '(' ')' for example)?
Because I could write a rule that excludes IDs that start with numbers, but what about the ones that start with symbols like "$Test" or "&Test"? I don't think I can enumerate all of the possible symbols.
Following the example above, the desired output would be:
Test
ID recognized: 'Test'
New line.
9Test
ERROR: Invalid input 2: "9Test"
New line.
You seem to be asking two questions at once.
'Whole word' isn't a recognized construct in programming languages. The lexical and grammar are already defined. Just implement them.
The best way to handle illegal or unexpected characters in flex is not to handle them specially at all. Return them to the parser, just as you would for a special character. Then the parser can deal with it and attempt recovery via discarding.
Place this as you final rule:
. return yytext[0];
You can use this
Lets say you want to identify the reserved word for :
([\r\n\z]|" "|"")+"for"/([\r\n\z]|" ")+ {}
any new line character or generally a control character [\r\n\z]
or a white space " "
or the beginning of the line ""
for at least 1 time +
the word you want in quotes "for"
only followed by /
almost the same expression without the "" at least 1 time -> ([\r\n\z]|" ")+
With this code you can form your own matching pattern for whatever you need to do before and after the word.
I'm not sure if this is the best answer, but this works for me.
%x ERROR
%%
{NL} {
printf("New line.\n");
return LINE_END;
}
<INITIAL,ERROR>{BLANK} {
BEGIN(INITIAL);
}
{ID} {
printf("ID recognized: '%s'\n", yytext);
return ID;
}
<INITIAL,ERROR>. {
fprintf(stderr, "ERROR: Invalid input in line %d: \"%s\"\n", yylineno, yytext);
BEGIN(ERROR);
}
%%
Read this to learn more about starting conditions.
(My attempt at explaining what I've done)
Whenever this lexer hits something unexpected, it exclusively activates 2 sets of rules. To get out of the error set of rules, the lexer has to hit a 'blank'.

Flex not counting lines properly on multiline comments

I`m using the above regex to identify multiline comments in Flex:
[/][*][^*]*[*]+([^*/][^*]*[*]+)*[/] { /* DO NOTHING */ }
But seems to me that flex/bison is not returning properly the line counter.
For example:
Input:
1 ___bqmu7ftc
2 // _qXnFEgQL9Zsyn8Ohtx7zhToLK68xbu3XRrOvRi
3 /* "{ output 6 = <=W if u7 do nN)T!=$||JN,a9vR)7"
4 -758939
5 -31943.6165480
6 // "RND"
7 '_'
8 */
9 [br _int]
Output:
1 TK_IDENT [___bqmu7ftc]
4 [
4 TK_IDENT [br]
4 TK_IDENT [_int]
4 ]
The line should be 9 instead of 4.
Any ideas?
I don't know how you generated the test output in your question, but here's an (almost) minimal example of how to use yylineno. It works fine for me:
%{
#define ID 257
%}
%option yylineno
%option noinput nounput noyywrap
%%
[[:space:]]+ { /* DO NOTHING */ }
"//".* { /* DO NOTHING */ }
[/][*][^*]*[*]+([^*/][^*]*[*]+)*[/] { /* DO NOTHING */ }
[[:alpha:]_][[:alnum:]_]* { return ID; }
. { return *yytext; }
%%
int main(int argc, char** argv) {
for (;;) {
int token = yylex();
switch (token) {
case 0: printf("%4d: %s\n", yylineno, "EOF"); return 0;
case ID: printf("%4d: %-4s[%s]\n", yylineno, "ID", yytext); break;
default: printf("%4d: %c\n", yylineno, token); break;
}
}
}
This is the solution I found on Flex manual
Remember to declare int comment_caller; on your definition scope.
%x comment
%x foo
%%
"/*" {comment_caller = INITIAL;
BEGIN(comment);
}
<foo>"/*" {
comment_caller = foo;
BEGIN(comment);
}
<comment>[^*\n]* {}
<comment>"*"+[^*/\n]* {}
<comment>\n {++line_num;}
<comment>"*"+"/" BEGIN(comment_caller);
I had the same problem with multi-line comments with flex. I used the regex that was suggested in this stackoverflow question(this is same as the regex you mentioned in this question)
This regex also gets the new lines in the multi-line comment. So if you are counting the number of the current line with counting the \n you will get into trouble. Because there could be multi-line comments and the regular expression selects the whole multi-line comment at once. So it doesn't let you to count the new lines.
So I found another way to keep the number of lines even with the regular expression. Explain below:
You know that flex keeps the matched expression inyytext variable. So we can count number of new lines in the multi-line comment and that worked perfectly with any code I tested.
Here is my code:
note: the numberOfCurrentLine variable is the global variable I used to save the number of the current line.
[/][*][^*]*[*]+([^*/][^*]*[*]+)*[/] {
// The code below, the counts number of occurance of \n and then adds
// the number to the numberOfCurrentLine variable
// to keep the number of current line
char* str = yytext;
int i = 0;
char *pch=strchr(str,'\n');
while (pch!=NULL) {
i++;
pch=strchr(pch+1,'\n');
}
numberOfCurrentLine+=i;
}
This code counts the number of \n in the selected comment and adds it to the global variable that is counting the number of the current line.
The code for counting number of occurrences of a char that I used above is from this post.
So with the above code, I always have the right number of the current line and the code works perfectly.

Skipping strings between separator with sscanf()

I've got a couple of strings structed like this:
1|36901|O|173665.47|1996-01-02|5-LOW|Clerk#000000951|0|nstructions sleep furiously among |
I want to extract the fields in position 0, 1, 3, 7, in this case 1, 36901, 173665.47 and 0.
I've tried
sscanf(line, "%d|%d|%*c|%lf|%*s|%*s|%*s|%d|%*s|", &rec.order_key, &rec.cust_key, &rec.total_price, &rec.ship_priority);
printf("%d %d %lf %d", rec.order_key, rec.cust_key, rec.total_price, rec.ship_priority);
and expecting to get
1 36901 173665.470000 0
instead I got
1 36901 173665.470000 1
so I guess I did something wrong with the skipping, but I just can't figure it out.
I figure this out: the sscanf() does greedy matching, so the string being skipped is too long. Using
sscanf(line, "%d|%d|%*c|%lf|%*[^|]|%*[^|]|%*[^|]|%d|%*[^|]|",
&rec.order_key, &rec.cust_key, &rec.total_price, &rec.ship_priority);
solved the problem.

Resources