I would like to make a project with Flex and Bison.
I have a grammar (that's only a part of mine):
variable_name:
| text {printf("VARIABLE NAME (TEXT) IN BISON: %s\n", $1); $$ = _variable_init($1);}
| character {printf("VARIABLE NAME (CHARACTER) IN BISON: %s\n", $1); $$ = _variable_init($1);}
;
bool_expr:
| true_t { printf("BOOL EXP TRUE\n"); $$ = _bool_expression_init_bool(TRUE); }
| false_t { printf("BOOL EXP FALSE\n"); $$ = _bool_expression_init_bool(FALSE); }
| variable_name { printf("BOOL EXP VARIABLE: %s\n", $1->name); $$ = _bool_expression_init_variable($1); }
| bool_expr eq bool_expr { printf("BOOL EXP ==\n"); $$ = _bool_expression_init_binary_op($1, "==", $3); }
| bool_expr noteq bool_expr { printf("BOOL EXP !=\n"); $$ = _bool_expression_init_binary_op($1, "!=", $3); }
| bool_expr and bool_expr { printf("BOOL EXP AND\n"); $$ = _bool_expression_init_binary_op($1, "&&", $3); }
| bool_expr or bool_expr { printf("BOOL EXP OR\n"); $$ = _bool_expression_init_binary_op($1, "||", $3); }
| '!' bool_expr { printf("BOOL EXP NOT\n"); $$ = _bool_expression_init_unary_op("!", $2); }
| '(' bool_expr ')' { printf("BOOL EXP ()\n"); $$ = _bool_expression_init_with_brackets($2); }
| int_expr eq int_expr { printf("BOOL EXP INT==\n"); $$ = _bool_expression_init_from_int($1, "==", $3); }
| int_expr noteq int_expr { printf("BOOL EXP INT!=\n"); $$ = _bool_expression_init_from_int($1, "!=", $3); }
| int_expr get int_expr { printf("BOOL EXP INT>=\n"); $$ = _bool_expression_init_from_int($1, ">=", $3); }
| int_expr let int_expr { printf("BOOL EXP INT<=\n"); $$ = _bool_expression_init_from_int($1, "<=", $3); }
| int_expr '<' int_expr { printf("BOOL EXP INT<\n"); $$ = _bool_expression_init_from_int($1, "<", $3); }
| int_expr '>' int_expr { printf("BOOL EXP INT>\n"); $$ = _bool_expression_init_from_int($1, ">", $3); }
;
int_expr:
| num { printf("INT EXP NUM\n"); $$ = _int_expression_init_int($1); }
| variable_name { printf("INT EXP VARIABLE\n"); _int_expression_init_variable($1); }
| int_expr '+' int_expr { printf("INT EXP +\n"); $$ = _int_expression_init_binary_op($1, "+", $3); }
| int_expr '-' int_expr { printf("INT EXP -\n"); $$ = _int_expression_init_binary_op($1, "-", $3); }
| int_expr '*' int_expr { printf("INT EXP *\n"); $$ = _int_expression_init_binary_op($1, "*", $3); }
| int_expr '/' int_expr { printf("INT EXP /\n"); $$ = _int_expression_init_binary_op($1, "/", $3); }
| '(' int_expr ')' { printf("INT EXP ())\n"); $$ = _int_expression_init_bracket($2); }
;
I copied only the important parts (hopefully there is the issue).
So when I want to parse this
var != 10
as a bool_expr the Bison identifies var as a variable and prints:
VARIABLE NAME (TEXT) IN BISON: var
but the in the next moment it prints
BOOL EXP VARIABLE: var !=
and it thinks that the variable is the "var !=" when there is a rule
int_expr != int_expr
but it doesn't check this part.
Btw variable has Variable* type (struct), int_expr has IntExpression* (struct), bool_expr has BoolExpression* (struct).
I don't what I should do. I tried and it worked when I write 6 additional rule to bool_expr that are almost same as the last 6. I replaced the first int_expr to variable, but it's disgusting. As I knew Bison search for the longest match not the first.
The log:
VARIABLE NAME (TEXT) IN BISON: var
VARIABLE NAME IN FUNCTION: var //It was called in variable_init function
BOOL EXP VARIABLE: var!=
INT EXP NUM //10
There is nothing in your grammar which allows the parser to distinguish between integer variables and boolean variables. Both are simply variable_name; moreover, either a boolean variable or an integer variable could be followed by a comparison operator. (In the case of var != 10, it would be possible to deduce that var had to be an integer variable, once it is seen that the right-hand side of the operator is an integer. But a != b would still be ambiguous, since the variable names do not carry a marker to indicate their type.)
Bison, by default, produces an LALR(1) parser, which means that every reduction needs to be determined by examining at most one token following the end of the production. (That's what the "(1)" in "LALR(1)" means.) In other words, the parser would have to be able to decide between reducing the variable_name "var" to a bool_expr or an int_expr no later than when it sees the != token. That's not possible, and because it is not possible, Bison should have reported a reduce/reduce conflict, which you are apparently ignoring. (If it didn't, then the rest of your grammar is relevant. But that would be surprising.)
Bison doesn't just give up when it sees a conflict. It makes a somewhat arbitrary choice (in the case of reduce/reduce conflicts, it chooses the reduction which occurs earliest in the grammar, in this case to bool_expr) and builds the parser regardless. Occasionally, this default produces the correct parse, but in most cases the resulting parser is flawed, with behaviour similar to what you are experiencing. So although Bison claims that the conflict report is just a warning, you ignore it at your peril.
As #user253751 notes in a comment, you can ask Bison to produce a GLR grammar, which allows arbitrary lookahead (at the cost of slowing down the parse). However, Bison's GLR implementation still requires the grammar to be unambiguous. Ambiguous parses will be detected at run-time, during the parse, and will cause the parse to fail; that would be the case with the vara != varb ambiguous expression noted above. (You can provide your own ambiguity resolution mechanism. But that's an extremely advanced technique, and in this case it won't work unless the ambiguity resolver has access to semantic information, like the declared type of each variable.)
Without seeing the rest of your grammar, it's hard to know whether type resolution could be done at compile-time (because variables need to be declared with a specific type), or only at run-time, but in neither case can that determination be made by the parser. So if you have boolean variables (and even if you don't), you cannot do better in the parser than to just have one expression non-terminal.
If you require declaration before use, then you could do type-resolution in your reduction actions by consulting the symbol table you are building up. At that point, you can either insert an automatic conversion function, or report an error (depending on whether you feel that automatic conversion is convenient). If you only require declaration (not prior declaration), then you can do the type-resolution after the parse is complete, by walking the AST twice: first to build up the symbol table, and second to resolve types.
If you consider type mismatches to be an error, then reporting the error in a semantic action is a lot more user-friendly. During the parse, it is very difficult to produce an error message more informative than "syntax error at line 10". But in the semantic action, you know precisely what the error is and what produced it, so it's easy to produce error messages like "the comparison at line 10 requires that 'var' be an integer variable, not a boolean variable." Your users will thank you.
By the way, the usual convention is to use UPPER_CASE for the symbolic names of terminals, like !=, which would usually be named something like NOTEQ or T_EQ. But if you are using Bison, you can make your grammar a lot more readable by using quoted aliases for your terminals:
%token EQ "==" NOTEQ "!="
GEQ ">=" LEQ "<="
Then you can use the symbolic names in your lexical analyser:
"!=" { return NOTEQ; }
without forcing whoever reads your grammar to guess what the name means:
expr: expr "==" expr
| expr "!=" expr
| ...
You must use double quotes; '+' is quite different.
Related
I've currently building a Python parser and I'm at the definition of arithmetic expressions. The rules behind arithmetic expressions are working properly up until I add the parenthesis.
Here is the starting point:
%token TOKEN_ARITH_ADD TOKEN_ARITH_SUB
%token TOKEN_ARITH_MUL TOKEN_ARITH_DIV TOKEN_ARITH_MOD
%token TOKEN_ARITH_POWER
%token TOKEN_ASSIGN
%token TOKEN_PAREN_OPEN TOKEN_PAREN_CLOSE
and then:
arith_expr: factor
| arith_expr TOKEN_ARITH_ADD factor { $$ = ast_init_arith_op($3, "+", $1); };
| arith_expr TOKEN_ARITH_SUB factor { $$ = ast_init_arith_op($3, "-", $1); };
| TOKEN_PAREN_OPEN arith_expr TOKEN_PAREN_CLOSE { $$ = $2; };
;
factor: power { $$ = ast_init_arith_op($1, NULL, NULL); };
| factor TOKEN_ARITH_MUL power { $$ = ast_init_arith_op($3, "*", $1); };
| factor TOKEN_ARITH_DIV power { $$ = ast_init_arith_op($3, "/", $1); };
| factor TOKEN_ARITH_MOD power { $$ = ast_init_arith_op($3, "%", $1); };
;
power: term
| power TOKEN_ARITH_POWER term { $$ = ast_init_arith_op($3, "**", $1); }
term: identifier;
| literal_int;
| literal_float;
The results is that if, for instance, I enter this :
myVar = (a + b) * 2
I get error: syntax error, unexpected TOKEN_ARITH_MUL, expecting TOKEN_EOL.
So I've tried to change the %token for %left for the first three ones, with the same problem.
I've also tried to change the %token for the assign to a %right, unfortunately I got an error at compile time (error: rule given for assign, which is a token) - in retrospect, make sense.
It looks like the TOKEN_PAREN_OPEN arith_expr TOKEN_PAREN_CLOSE collapse to a arith_expr and the assign kicks in right away. What am I doing wrong?
According to your grammar, a multiplication operator can appear only between a factor and a power. An expression enclosed in parentheses is neither and cannot be reduced to either. As far as the part of the grammar presented goes, it is an arith_expr.
#n.m.'s comment is correct: you put the rule for a parenthesized expression in the wrong place. It should be a term, not an arith_expr. However, your followup comment suggests that you misunderstood. Do not change the production. Just move it, as is, to be one of the alternatives for term:
term: identifier
| literal_int
| literal_float
| TOKEN_PAREN_OPEN arith_expr TOKEN_PAREN_CLOSE
;
That allows a parenthsized expression to appear as a complete expression itself or as an operand of any operator.
When I run the bisonprogram below (by bison file.y) , I get the error missing a declaration type for $2 in 'seq' :
%union {
char cval;
}
%token <cval> AMINO
%token STARTCODON STOPCODON
%%
series: STARTCODON seq STOPCODON {printf("%s", $2);}
seq : AMINO
| seq AMINO
;
%%
I would like to know why I get this error, and how I can correctly declare the variable $2
You haven't told Bison what type seq is, so it doesn't know what to do with $2.
Use the %type directive:
%type <cval> seq
Note that the type used for $2 is a single char, which is not a string as expected by the "%s" format. You need to come up with a way to create your own string from the sequence.
I have written a simple grammar:
operations :
/* empty */
| operations operation ';'
| operations operation_id ';'
;
operation :
NUM operator NUM
{
printf("%d\n%d\n",$1, $3);
}
;
operation_id :
WORD operator WORD
{
printf("%s\n%s\n%s\n",$1, $3, $<string>2);
}
;
operator :
'+' | '-' | '*' | '/'
{
$<string>$ = strdup(yytext);
}
;
As you can see, I have defined an operator that recognizes one of 4 symbols. Now, I want to print this symbol in operation_id. Problem is, that logic in operator works only for last symbol in alternative.
So if I write a/b; it prints ab/ and that's cool. But for other operations, eg. a+b; it prints aba. What am I doing wrong?
*I ommited new lines symbols in example output.
This non-terminal from your grammar is just plain wrong.
operator :
'+' | '-' | '*' | '/' { $<string>$ = strdup(yytext); }
;
First, in yacc/bison, each production has an action. That rule has four productions, of which only the last has an associated action. It would be clearer to write it like this:
operator : '+'
| '-'
| '*'
| '/' { $<string>$ = strdup(yytext); }
;
which makes it a bit more obvious that the action only applies to the reduction from the token '/'.
The action itself is incorrect as well. yytext should never be used outside of a lexer action, because its value isn't reliable; it will be the value at the time the most recent lexer action was taken, but since the parser usually (but not always) reads one token ahead, it will usually (but not always) be the string associated with the next token. That's why the usual advice is to make a copy of yytext, but the idea is to copy it in the lexer rule, assigning the copy to the appropriate member of yylval so that the parser can use the semantic value of the token.
You should avoid the use of $<type>$ =. A non-terminal can only have one type, and it should be declared in the prologue to the bison file:
%type <string> operator
Finally, you will find that it is very rarely useful to have a non-terminal which recognizes different operators, because the different operators are syntactically different. In a more complete expression grammar, you'd need to distinguish between a + b * c, which is the sum of a and the product of b and c, and a * b + c, which is the sum of c and the product of a and b. That can be done by using different non-terminals for the sum and product syntaxes, or by using different productions for an expression non-terminal and disambiguating with precedence rules, but in both cases you will not be able to use an operator non-terminal which produces + and * indiscriminately.
For what its worth, here is the explanation of why a+b results in the output of aba:
The production operator : '+' has no explicit action, so it ends up using the default action, which is $$ = $1.
However, the lexer rule which returns '+' (presumably -- I'm guessing here) never sets yylval. So yylval still has the value it was last assigned.
Presumably (another guess), the lexer rule which produces WORD correctly sets yylval.string = strdup(yytext);. So the semantic value of the '+' token is the semantic value of the previous WORD token, which is to say a pointer to the string "a".
So when the rule
operation_id :
WORD operator WORD
{
printf("%s\n%s\n%s\n",$1, $3, $<string>2);
}
;
executes, $1 and $2 both have the value "a" (two pointers to the same string), and $3 has the value "b".
Clearly, it is semantically incorrect for $2 to have the value "a", but there is another error waiting to occur. As written, your parser leaks memory because you never free() any of the strings created by strdup. That's not very satisfactory, and at some point you will want to fix the actions so that semantic values are freed when they are no longer required. At that point, you will discover that having two semantic values pointing at the same block of allocated memory makes it highly likely that free() will be called twice on the same memory block, which is Undefined Behaviour (and likely to produce very difficult-to-diagnose bugs).
I am learning lex and yacc programming and this yacc program to validate and evaluate arithmetic expression in giving me 10 shift/reduce conflicts. Can you point out whats wrong with this program
This is 611.y
%{
#include<stdio.h>
int flag=1;
%}
%token id num
%left '(' ')'
%left '+' '-'
%left '/' '*'
%nonassoc UMINUS
%%
stmt:
expression { printf("\n valid exprn");}
;
expression:
'(' expression ')'
| '(' expression {printf("\n Syntax error: Missing right paranthesis");}
| expression '+' expression {printf("\nplus recog!");$$=$1+$3;printf("\n %d",$$);}
| expression '+' { printf ("\n Syntax error: Right operand is missing ");}
| expression '-' expression {printf("\nminus recog!");$$=$1-$3;printf("\n %d",$$);}
| expression '-' { printf ("\n Syntax error: Right operand is missing ");}
| expression '*' expression {printf("\nMul recog!");$$=$1*$3;printf("\n %d",$$);}
| expression '*' { printf ("\n Syntax error: Right operand is missing ");}
| expression '/' expression {printf("\ndivision recog!");if($3==0) printf("\ndivision cant be done, as divisor is zero.");else {$$=$1+$3;printf("\n %d",$$);}}
| expression '/' { printf ("\n Syntax error: Right operand is missing ");}
| expression '%' expression
| expression '%' { printf ("\n Syntax error: Right operand is missing ");}
| id
| num
;
%%
main()
{
printf(" Enter an arithmetic expression\n");
yyparse();
}
yyerror()
{
printf(" Invalid arithmetic Expression\n");
exit(1);
}
This is 611.l
%{
#include "y.tab.h"
#include<stdio.h>
#include<ctype.h>
extern int yylval;
int val;
%}
%%
[a-zA-Z][a-zA-Z0-9]* {printf("\n enter the value of variable %s:",yytext);scanf("%d",&val);yylval=val;return id;}
[0-9]+[.]?[0-9]* {yylval=atoi(yytext);return num;}
[ \t] ;
\n {return 0;}
. {return yytext[0];}
%%
int yywrap()
{
return 1;
}
When I complie the code like this
lex 611.l
yacc -d 611.y
It gives me
yacc:10 shift/reduce conflicts.
Please help me out here.
Two things are wrong:
Precedence of '%' is missing, add it to '/' '*'
The '(' expression error handler is ambiguous (in an expression (4*(2+3)+5*7 there are many ways to insert a missing parenthesis) and is in fact in conflict with the normal '(' expression ')' rule. It is non-trivial to make such a handler work. I would recommend to remove it and rely on built-in yacc error handler.
Simple error handling can be implemented like this:
stmt:
expression { printf("\n valid exprn");}
| error { printf(" Invalid arithmetic Expression\n"); }
;
expression:
'(' expression ')'
| '(' error ')' { printf(" Invalid arithmetic Expression\n"); }
| ... /* all the rest */
You won't need all other error handlers too.
This question already has answers here:
how to resolve 2+2 and 2++2 conflict
(2 answers)
Closed 8 years ago.
I have some difficulty figuring out how to fix this.
Basically (and this could be valable for any operator, yet I'm using the '+' as an example), say we had this rule in the lexer source :
[+-]?[0-9]+ { yylval = atoi(yytext); return INTEGER; }
And, in the paser, we'd have
exp: INTEGER
| exp '+' exp { $$ = $1 + $3; }
| // etc etc
Then, in the resulting calculator, if I do
2 + 2
It would work as expected and give me the number 4.
But if I do
2+2
i.e. without spaces between 2, + and the other 2, I have a syntax error. The reason is that "+2" itself is a token, so bison reads "exp exp" and doesn't find anything since it's not part of the parser rules.
But, the line
2++2
is fine, since bison does "2" + "+2".
My question is... how could we fix that behavior so that "2+2" works the same way as "2 + 2"?
EDIT: It seems this question, as is, was a duplicate of another one, as pointed out in a comment below. Well, I have partically found the answer, but still.
If we make it the parser's job, and define a custom precedence level for the unary rules like this:
exp:
| // bla bla bla
| '+' exp %prec UPLUS { $$ = +$2; }
| '-' exp %prec UMINUS { $$ = -$2; }
I still see a problem. Indeed, we can technically do this, in the calculator:
2+++++2
4
2+++++++++++2
4
2++++3
5
Is there a way to avoid such an ugly syntax and trigger an error or at least a warning, so that only 2+2 is allowed, and, at worse, only 2+2 and 2++2, which are the only two choices that make sense there!
Thanks!
Unary operators are best handled in the grammar, not the scanner. There's no reason to do it the hard way. Just allow unary operators '+" and '-' in the productions for 'primary'; ignore unary '+'; and output code to negate the operand if the number of unary '-' operators is odd.
And get rid of [-+]? in the lex specification. At present you seem to be trying to handle it in both places.
There's also no reason to prohibit spaces between unary operators and their operands, or to only allow one unary operator, which is what handling it in the lexer condemns you to doing. Do it in the grammar. Only.