Operators included in lexems (bison/flex) [duplicate] - c

This question already has answers here:
how to resolve 2+2 and 2++2 conflict
(2 answers)
Closed 8 years ago.
I have some difficulty figuring out how to fix this.
Basically (and this could be valable for any operator, yet I'm using the '+' as an example), say we had this rule in the lexer source :
[+-]?[0-9]+ { yylval = atoi(yytext); return INTEGER; }
And, in the paser, we'd have
exp: INTEGER
| exp '+' exp { $$ = $1 + $3; }
| // etc etc
Then, in the resulting calculator, if I do
2 + 2
It would work as expected and give me the number 4.
But if I do
2+2
i.e. without spaces between 2, + and the other 2, I have a syntax error. The reason is that "+2" itself is a token, so bison reads "exp exp" and doesn't find anything since it's not part of the parser rules.
But, the line
2++2
is fine, since bison does "2" + "+2".
My question is... how could we fix that behavior so that "2+2" works the same way as "2 + 2"?
EDIT: It seems this question, as is, was a duplicate of another one, as pointed out in a comment below. Well, I have partically found the answer, but still.
If we make it the parser's job, and define a custom precedence level for the unary rules like this:
exp:
| // bla bla bla
| '+' exp %prec UPLUS { $$ = +$2; }
| '-' exp %prec UMINUS { $$ = -$2; }
I still see a problem. Indeed, we can technically do this, in the calculator:
2+++++2
4
2+++++++++++2
4
2++++3
5
Is there a way to avoid such an ugly syntax and trigger an error or at least a warning, so that only 2+2 is allowed, and, at worse, only 2+2 and 2++2, which are the only two choices that make sense there!
Thanks!

Unary operators are best handled in the grammar, not the scanner. There's no reason to do it the hard way. Just allow unary operators '+" and '-' in the productions for 'primary'; ignore unary '+'; and output code to negate the operand if the number of unary '-' operators is odd.
And get rid of [-+]? in the lex specification. At present you seem to be trying to handle it in both places.
There's also no reason to prohibit spaces between unary operators and their operands, or to only allow one unary operator, which is what handling it in the lexer condemns you to doing. Do it in the grammar. Only.

Related

Bison/Flex print value of terminal from alternative

I have written a simple grammar:
operations :
/* empty */
| operations operation ';'
| operations operation_id ';'
;
operation :
NUM operator NUM
{
printf("%d\n%d\n",$1, $3);
}
;
operation_id :
WORD operator WORD
{
printf("%s\n%s\n%s\n",$1, $3, $<string>2);
}
;
operator :
'+' | '-' | '*' | '/'
{
$<string>$ = strdup(yytext);
}
;
As you can see, I have defined an operator that recognizes one of 4 symbols. Now, I want to print this symbol in operation_id. Problem is, that logic in operator works only for last symbol in alternative.
So if I write a/b; it prints ab/ and that's cool. But for other operations, eg. a+b; it prints aba. What am I doing wrong?
*I ommited new lines symbols in example output.
This non-terminal from your grammar is just plain wrong.
operator :
'+' | '-' | '*' | '/' { $<string>$ = strdup(yytext); }
;
First, in yacc/bison, each production has an action. That rule has four productions, of which only the last has an associated action. It would be clearer to write it like this:
operator : '+'
| '-'
| '*'
| '/' { $<string>$ = strdup(yytext); }
;
which makes it a bit more obvious that the action only applies to the reduction from the token '/'.
The action itself is incorrect as well. yytext should never be used outside of a lexer action, because its value isn't reliable; it will be the value at the time the most recent lexer action was taken, but since the parser usually (but not always) reads one token ahead, it will usually (but not always) be the string associated with the next token. That's why the usual advice is to make a copy of yytext, but the idea is to copy it in the lexer rule, assigning the copy to the appropriate member of yylval so that the parser can use the semantic value of the token.
You should avoid the use of $<type>$ =. A non-terminal can only have one type, and it should be declared in the prologue to the bison file:
%type <string> operator
Finally, you will find that it is very rarely useful to have a non-terminal which recognizes different operators, because the different operators are syntactically different. In a more complete expression grammar, you'd need to distinguish between a + b * c, which is the sum of a and the product of b and c, and a * b + c, which is the sum of c and the product of a and b. That can be done by using different non-terminals for the sum and product syntaxes, or by using different productions for an expression non-terminal and disambiguating with precedence rules, but in both cases you will not be able to use an operator non-terminal which produces + and * indiscriminately.
For what its worth, here is the explanation of why a+b results in the output of aba:
The production operator : '+' has no explicit action, so it ends up using the default action, which is $$ = $1.
However, the lexer rule which returns '+' (presumably -- I'm guessing here) never sets yylval. So yylval still has the value it was last assigned.
Presumably (another guess), the lexer rule which produces WORD correctly sets yylval.string = strdup(yytext);. So the semantic value of the '+' token is the semantic value of the previous WORD token, which is to say a pointer to the string "a".
So when the rule
operation_id :
WORD operator WORD
{
printf("%s\n%s\n%s\n",$1, $3, $<string>2);
}
;
executes, $1 and $2 both have the value "a" (two pointers to the same string), and $3 has the value "b".
Clearly, it is semantically incorrect for $2 to have the value "a", but there is another error waiting to occur. As written, your parser leaks memory because you never free() any of the strings created by strdup. That's not very satisfactory, and at some point you will want to fix the actions so that semantic values are freed when they are no longer required. At that point, you will discover that having two semantic values pointing at the same block of allocated memory makes it highly likely that free() will be called twice on the same memory block, which is Undefined Behaviour (and likely to produce very difficult-to-diagnose bugs).

Translation of if then else in compiler grammar

...
IF LP assignment-expression RP marker statement {
backpatch($3.tlist,$5.instr);
$$.nextList = mergeList($3.flist,$6.nextList);
}
|IF LP assignment-expression RP marker statement ELSE Next statement {
backpatch($3.tlist,$5.instr);
backpatch($3.flist,$8.instr);
YYSTYPE::BackpatchList *temp = mergeList($6.nextList,$8.nextList);
$$.nextList = mergeList(temp,$9.nextList);
}
...
Assignment-expression is any assignment expression that is possible using the C operators =, +=, -=, *=, /=.
LP = (
RP = )
marker and Next are both EMPTY rule
The problem with above grammar rule and implementation is that it can't generate correct code when expression is as
bool a;
if(a){
printf("hi");
}
else{
prinf("die");
}
It expects that assignment-expression must contain relop or OR or AND to generate correct code .
Since in this case we do comparison for relop same case apply to OR and AND.
But as in above code doesn't contain any thing out of this , So it's unable to generate correct code.
The correct code can be generated by using following rule but this leads to two reduce-reduce conflict .
...
IF LP assignment-expression {
if($3.flist == NULL && $3.tlist == NULL)
...
} RP marker statement {
...
}
|IF LP assignment-expression{
if($3.flist == NULL && $3.tlist == NULL)
...
} RP marker statement ELSE Next statement {
...
}
...
What are the modification I should do in the grammar rule so that it will work as expected ?
I tried IF ELSE grammar rule
from here as well as from dragon book but unable to solve this .
Whole grammar can be found here Github
In order to insert the mid-rule action, you need to left-factor; otherwise, the bison-generated parser cannot decide which of the two MRAs to reduce. (Even though they are presumably identical, bison doesn't know that.)
if_prefix: "if" '(' expression ')' { $$ = $3; /* Normalize the flist */ }
if: if_prefix marker statement { ... }
| if_prefix marker statement "else" Next statement { ... }
(You could left factor differently; that's just one suggestion.)
It looks like your grammar has an incorrect definition of expression.
An assignment expression is only one of many non-terminals that should be able to reduce to an expression. For an if/then/else construction you generally need to allow any expression to occur between the parens. Your first example, as you point out, is perfectly valid C but doesn't contain an assignment.
In your grammar, you have this line:
/*Expression list**/
expression:
assignment-expression{}
|expression COMMA assignment-expression{}
;
However, an expression should be able to have more than assignment-expressions. Not being terribly familiar with yacc/bison, I would guess you need to change this to something like the following:
/*Expression **/
expression:
assignment-expression{}
|logical-OR-expression{}
|logical-AND-expression{}
|inclusive-OR-expression{}
|exclusive-OR-expression{}
|inclusive-AND-expression{}
|equality-expression{}
|relational-expression{}
|additive-expression{}
|multiplicative-expression{}
|exponentiation-expression{}
|unary-expression{}
|postfix-expression{}
|primary-expression{}
|expression COMMA expression{}
;
I can't really validate that this is going to work for you, and it may be imperfect, but hopefully you get the idea. Each different type of expression needs to be able to reduce to an expression. You have something very similar for statement earlier in your grammar, so this should hopefully make sense.
It might be helpful to do some reading or watch some tutorials on how LR grammars work.

Runtime formula evaluation

I would like to evaluate formulas which a user can input for many data points, so efficiency is a concern. This is for a Fortran project, but my solutions so far have been centered on using a yacc/bison grammar, so I will probably use Fortran's iso_c_binding feature to interface to yyparse().
The preferred (so far) solution would be an small extension of the classic mfcalc calculator example from the Bison manual, with the bison grammar made to recognize a (single) variable name as well (which is not hard).
The question is what to do in the executable statements. I see two options there.
First, I could simply evaluate the expression as it is parsed, as in the mfcalc example.
Second, I could invoke the bison parser once for parsing and for creating a stack-based (reverse polish) representation of the formula being parsed, so
2 + 3*x would be translated into 2 3 * + (of course, as the relevant data structure).
The relevant part of the grammar would look like this:
%union {
double val;
char *c;
int fcn;
}
%type <val> NUMBER
%type <c> VAR
%type <fcn> Function
/* Tokens and %left PLUS MINUS etc. left out for brevity */
%%
...
Function:
SIN { $$=SIN; }
| COS { $$=COS; }
| TAN { $$=TAN; }
| SQRT { $$=SQRT; }
Expression:
NUMBER { push_number($1); }
| VAR { push_var($1); }
| Expression PLUS Expression { push_operand(PLUS); }
| Expression MINUS Expression { push_operand(MINUS); }
| Expression DIVIDE Expression { push_operand(DIVIDE); }
| MINUS Expression %prec NEG { push_operand(NEG); }
| LEFT_PARENTHESIS Expression RIGHT_PARENTHESIS;
| Function LEFT_PARENTHESIS Expression RIGHT_PARENTHESIS { push_function($1); }
| Expression POWER Expression { push_operand(POWER); }
The functions push_... would put the formula into an array of structs, which which contain a struct holding the token and the yacc union.
The RPN would then be interpreted using a very simple (and hopefully fast) interpreter.
So, the questions.
Is the second approach valid? I think it is from what I understand about bison (or yacc's) way of handling shift and reduce (basically, this will shift a number and reduce an expression, so the order should be guaranteed to be correct for RPN), but I am not quite sure.
Also, is it worth the additional effort over simply evaluating the function using the $$ construct (the first approach)?
Finally, are there other, better solutions? I had considered using syntax trees, but I don't think the additional effort is actually worth it. Also, I tend to think that using trees is overkill where an array would do just nicely :-)
It's only slightly more difficult to generate three-address virtual ops than RPN. In effect, the RPN is a virtual stack machine. The three-address ops -- which can also easily go into an array -- are probably faster to interpret, and will probably be more flexible in the long term.
The main advantage of parsing the expression into some internal form is that it is likely to be faster to evaluate the internal form than to reparse the original string. That may not be the case, but it usually is because converting floating-point literals into floating-point numbers is (relatively speaking) quite slow.
There is also the intermediate case of tokenizing the expression (into an array), and then directly evaluating while parsing the token stream. (In effect, that makes bison your virtual machine.)
Which of these strategies is the best depends a lot on details of your use case, but none of them are difficult so you could try all three and compare.

Checking Valid Arithmetic Expression in Lex (in C)

I have to write code for checking if an arithmetic expression is valid or not , in lex. I am aware that I could do this very easily using yacc but doing only in lex is not so easy.
I have written the code below, which for some reason doesn't work.
Besides this, i also don't get how to handle binary operators .
My wrong code:
%{
#include <stdio.h>
/* Will be using stack to check the validity of arithetic expressions */
char stack[100];
int top = 0;
int validity =0;S
%}
operand [a-zA-Z0-9_]+
%%
/* Will consider unary operators (++,--), binary operators(+,-,*,/,^), braces((,)) and assignment operators (=,+=,-=,*=,^=) */
"(" { stack[top++]='(';}
")" { if(stack[top]!=')') yerror(); else top--;}
[+|"-"|*|/|^|%] { if(stack[top]!='$') yerror(); else stack[top]=='&';}
"++" { if(stack[top]!='$') yerror(); else top--;}
[+"-"*^%]?= { if(top) yerror();}
operand { if(stack[top]=='&') top--; else stack[top++]='$';}
%%
int yerror()
{
printf("Invalid Arithmetic Expression\n");
}
First, learn how to write regular expressions in Flex. (Patterns, Flex manual).
Inside a character class ([…]), neither quotes nor stars nor vertical bars are special. To include a - or a ], you can escape them with a \ or put them at the beginning of the list, or in the case of - at the end.
So in:
[+|"-"|*|/|^|%]
The | is just another character in the list, and including it five times doesn't change anything. "-" is a character range consisting only of the character ", although I suppose the intention was to include a -. Probably you wanted [-+*/^%] or [+\-*/^%].
There is no way that the flex scanner can guess that a + (for example) is a unary operator instead of a binary operator, and putting it twice in the list of rules won't do anything; the first rule will always take effect.
Finally, if you use definitions (like operand) in your patterns, you have to enclose them in braces: {operand}; otherwise, flex will interpret it as a simple keyword.
And a hint for the assignment itself: A valid unparenthesized arithmetic expression can be simplified into the regular expression:
term {prefix-operator}*{operand}{postfix-operator}*
expr {term}({infix-operator}{term})*
But you can't use that directly because (a) it doesn't deal with parentheses, (b) you probably need to allow whitespace, and (c) it doesn't correctly reject a+++++b because C insists on the "maximal munch" rule for lexical scans, so that is not the same as the correct expression a++ + ++b.
You can, however, translate the above regular expression into a very simple two-state state machine.

Left-recursive error with my C grammar

I have a left-recursive error with my C grammar which can be found here
http://www.archive-host.com/files/1959502/24fe084677d7655eb57ba66e1864081450017dd9/cAST.txt.
When I replace
initializer
: assignment_expression
| '{' initializer_list '}'
;
with
initializer
: assignment_expression
| '{' initializer_list '}'
| initializer_list
;
I did this because I am trying to do this code in Ctrl-D
int k [2] = 1,4;
However this code does work with the first version
int k [2] = {1,4};
Is there a way to do without the { } please?
To do this, you'd need to introduce context sensitivity (or something on that order).
The problem is that 1,4 already has a defined meaning. It's an expression using the comma operator that evaluates the 1, discards the result, then evaluates the 4, which is the value of the expression as a whole.
As such, to make this work, you'd have to use a different syntax for initializers than for normal expressions (and in the process, depart pretty widely from C as it's current defined). From a purely grammatical viewpoint, that almost certainly does not need to be done with context sensitivity, but it will involve basically defining the syntax for initializers separately from/in parallel with the syntax for normal expressions, instead of using a common syntax for both.

Resources