I'm new to Bison and I'm having trouble with shift/reduce conflicts...
I'm writing the rules for grammar for the C language: ID is a token that identifies a variable, and I wrote this rule to ensure that the identifier can be considered even if it is written in parentheses.
id : '(' ID ')' {printf("(ID) %s\n", $2);}
| ID {printf("ID %s\n", $1);}
;
Output of Bison conflicts is:
State 82
12 id: '(' ID . ')'
13 | ID .
')' shift, and go to state 22
')' [reduce using rule 13 (id)]
$default reduce using rule 13 (id)
How can I resolve this conflict?
I hope I was clear and thanks for your help.
Your id rule in itself cannot cause a shift/reduce error. There must be some other rule in your grammar that uses ID. For example, you have an expression rule such as:
expr: '(' expr ')'
| ID
;
In the above example, ID can reduce to id or to expr and the parser doesn't know which reduction to take. Check what is in state 22.
Edit: you ask "what can I do to solve the conflict?"
I'm writing the rules for grammar for the C language: ID is a token that identifies a variable, and I wrote this rule to ensure that the identifier can be considered even if it is written in parentheses
A variable in parenthesis as a left-hand side is invalid in C, so it can only occur in a right-hand side. Then you can consider it an expression, so just remove your rule and where you use id replace that with expr.
Related
I am trying to perform a syntax analysis using bison, but it uses the wrong rule at one point and I didn't manage to find how to fix it.
I have a few rules but these ones seem to be the source of the problem :
method : vars statements;
vars : //empty
| vars var;
var : type IDENTIFIER ';';
type : IDENTIFIER;
statements : //empty
| statements statement;
statement : IDENTIFIER '=' e ';';
e : (...)
With IDENTIFIER being a simple regex matching [a-zA-Z]*
So basically, if I write that :
int myint;
myint = 12;
Since myint is an identifier, bison seems to still try to match it on the second line as a type and then matches the whole thing as a var and not as a statement. So I get this error (knowing that ASSIGN is '=') :
syntax error, unexpected ASSIGN, expecting IDENTIFIER
Edit : Note that bison is indicating that there are shift/reduce errors, so it may be linked (as said in the answers).
The problem you're having is coming from the default resolution of the shift-reduce conflict you have due to the empty statements rule -- it needs to know whether to reduce the empty statement and start matching statements, or shift the IDENTIFIER that might begin another var. So it decides to shift, which puts it down the var path.
You can avoid this problem by refactoring the grammar to avoid empty productions:
method: vars | vars statements | statements ;
vars: var | vars var ;
statements : statement | statements statement ;
... rest the same
which avoids needing to know whether something is var or a statement until after shifting far enough into it to tell.
I am trying to write a simple parser using lemon, for a javascript-like language. I am unable to resolve a conflict error, and I suspect it is a unsolvable problem.
The conflict is between the grammar for:
{x = 10;}
and
{x:10};
The first is a statement block containing an assignment statement and the second is an expression statement defining an object.
A grammar to parse both of them results in a conflict. The minimal code is as follows:
rMod ::= rStmt.
rStmt ::= rStmtList RCURLY. {leaveScope();}
rStmtList ::= rStmtList rStmt.
rStmtList ::= LCURLY. {enterScope();}
rStmt ::= rExpr SEMI.
rExpr ::= rObj.
rObj ::= LCURLY rObjItemList RCURLY.
rObjItemList ::= rObjItemList COMMA rObjItem.
rObjItemList ::= rObjItem.
rObjItem ::= ID COLON rExpr.
rExpr ::= ID.
rExpr ::= NUM.
The out file shows the following:
State 4:
(3) rStmtList ::= LCURLY *
rObj ::= LCURLY * rObjItemList RCURLY
rObjItemList ::= * rObjItemList COMMA rObjItem
rObjItemList ::= * rObjItem
rObjItem ::= * ID COLON rExpr
ID shift 8
ID reduce 3 ** Parsing conflict **
rObjItemList shift 6
rObjItem shift-reduce 8 rObjItemList ::= rObjItem
{default} reduce 3 rStmtList ::= LCURLY
Any suggestions on how I can resolve this would be gratefully accepted. Thanks.
The heart of the problem is that you want to execute enterScope() after the brace which initiates a statement block. However, if the brace is followed by the two tokens VAR and :, then it starts an object literal, not a block. So it is impossible to know whether or not to execute the enterScope action without two-token lookahead, and lemon does not produce LR(2) grammars. To that extent, you are correct that the problem is unsolvable. But of course there are solutions.
Probably the worst solution from any perspective (readability, complexity, verificability) is to create an LR(1) grammar using the usual LR(2)→LR(1) transformation, which will allow you to call the enterScope(); action at the point where it is clear that a scope has been entered. This means delaying the reduction by one token. That in turn means dividing expr into two disjoint non-terminals: those expr which can start with a VAR and those which cannot. For those expr which can start with a VAR, you also need to provide a mechanism which essentially allows you to glue together a VAR and the rest of the expr; in the case of expressions, that is particularly ugly (but still possible). The goal is to be able to write:
block(A) ::= blockPrefix(B) RCURLY . { closeScope(); A = B;}
blockPrefix(A) ::= lcurlyOpen exprNotStartingVAR(E) . { A = E; }
blockPrefix(A) ::= lcurlyVAR(V) restOfExprStartingVar(R) . { A = makeExpr(V, R); }
blockPrefix(A) ::= blockPrefix(B) SEMI expr(E) . { A = appendExpr(B, E); }
lcurlyOpen ::= LCURLY . { openScope(); }
lcurlyVAR(A) ::= LCURLY VAR(V) . { openScope(); A = V; }
An alternative, which is also ugly but probably less ugly in this particular case, is to recognize a variable name followed by a colon as a single lexical token (VAR_COLON). Although that complicates the lexer (particularly since you need to recognize constructs where whitespace or even comments appear between the variable name and the colon), it makes the grammar much simpler. With that change, there is no conflict because the object literal must start with a VAR_COLON while an expr can only start with a VAR (or other unrelated tokens).
A much simpler solution is to not try to create the scope inherited attribute. If we do scope resolution synthetically, then the problem more or less vanishes:
stmt ::= expr SEMI | block .
stmtList ::= stmt .
stmtList ::= stmtList stmt .
block(A) ::= LCURLY stmtList(B) RCURLY . { A = applyScope(newScope(), B); }
objLiteral ::= LCURLY itemList RCURLY .
objLiteral ::= LCURLY RCURLY .
itemList ::= item .
itemList ::= itemList COMMA item .
item ::= VAR COLON expr .
expr ::= VAR .
expr ::= objLiteral .
...
That grammar has no conflicts, but it might radically change the way you handle scopes, since it requires variable names to be scoped once a block is complete rather than doing it in-line as the parse proceeds.
However, I would argue that for most languages (including Javascript), it is actually more convenient to do scoping at the end of a block, or even as a post-parse walk over the AST. Javascript, unlike C, allows local variables to be declared after their first mention. Local functions can even be used before their declaration. (This is subtly different from Python, where a function declaration is an executable assignment, but the scoping rules are similar.)
As another example, C++ allows class members to be declared anywhere inside the declaration of the class, even if the member has already been mentioned inside another class member function.
And there are many other examples. These scoping rules generally benefit the programmer by allowing stylistic options (such as putting member variable definitions at the end of a class definition in C++) which would not be possible in C.
I am working on a yacc file to parse a given file and convert it to an equivalent c++ file. I have created the following grammar based on the provided syntax diagrams:
program: PROGRAMnumber id 'is' comp_stmt
;
comp_stmt: BEGINnumber statement symbol ENDnumber
;
statement: statement SEMInumber statement
| id EQnumber expression
| PRINTnumber expression
| declaration
;
declaration: VARnumber id
;
expression: term
;
term: term as_op term
| MINUSnumber term
| factor
;
factor: factor md_op factor
| ICONSTnumber
| id
| RPARENnumber expression LPARENnumber
;
as_op: PLUSnumber
| MINUSnumber
;
md_op: TIMESnumber
| DIVnumber
;
symbol: SEMInumber
| COMMAnumber
;
id: IDnumber
| id symbol id
;
The only issue I have remaining is that I am receiving this error when trying to compile with yacc.
conflicts: 14 shift/reduce
calc.y:103.17-111.41: warning: rule useless in parser due to conflicts: declaration: VARnumber id
I have resolved the only other conflict I have encountered, but I am not sure what the resolution for this conflict is. The line it should match is of the format
var a, b, c, d;
or
var a;
All of your productions intended to derive lists are ambiguous and therefore generate reduce/reduce conflicts. For example:
id: id symbol id
Will be clearly ambiguous when there are three identifiers: are the first two to be reduced first, or the last two? The usual list idiom is left-recursion:
id_list: id | id_list `,` id
For most languages, that would not be correct for statements, which are terminated with semi-colons, not separated by them, but that model would work for a comma-separated list of identifiers, or for a left-associative sequence of addition operators.
For statements, you probably want something more like:
statement_list: | statement_list statement ';'
Speaking of symbol, do you really believe that , and ; have the same syntactic function? That seems unlikely, since you write var a, b, c, d; and not, for example, var a; b, c; d,.
The "useless rule" warning produced by bison is exactly because your grammar allows ids to be separated with semicolons. When the parser sees "var" ID with ; as lookahead, it first reduces ID to id and then needs to decide whether to reduce var id to declaration or to shift the ; in order to later reduce it to symbol and then proceed with the reduction of id symbol id. In the absence of precedence rules, bison always resolves shift/reduce conflicts in favour of shifting, so that is what it does in this case. But the result is that it is never possible to reduce "var" id to declaration, making the production useless as the result of a shift-reduce conflict resolution, which is more or less what the warning says.
In yacc program,how do we write the action for assign operation using c structure node?
Example:-
stmt: stmt stmt ';'
| exp ';' {printtree();}
| bool ';' {...}
| VAR ASSIGN exp ';' {//How to store this value to VAR using node?}
...
;
exp: exp PLUS exp {make_operator($1,'+',$3);// which stores a char '+' with
left node to $1 and right node to $3 to the synatx tree
}
| exp MINUS exp {...}
...
;
It would be of great help if someone can suggest a solution for this.
The answer is that since your Yacc parser is not actually executing the code, but producing an abstract syntax tree (as evidenced by the use of a make_operator function in the PLUS operation, the same thing is done for the assignment. It could be as simple as:
stmt: stmt stmt ';'
| exp ';' {printtree();}
| bool ';' {...}
| VAR ASSIGN exp ';' {$$ = make_operator($1, '=', $3);}
...
;
The actual job of generating the code to perform the assignment will be done by other passes over the syntax tree which is constructed by the parser. Those passes will have to do things like ensuring that VAR is actually defined in the given scope and so on, depending on the rules of the language: does it have the right type, is it modifiable, ...
A translation scheme for assignments (at least of a simple scalar variable which fits into a register) is:
Generate the code to calculate the address of the assignment target, such that this code leaves the value in a new temporary register, call it t1.
Generate the code to calculate the value of the expression, leaving it in another register t2.
Generate the code mem[t1] := t2 which represents store the value of t2 into the memory location pointed at by t1. (Of course, this intermediate code isn't literally represented by text such as mem[t1] := t2, but rather some instruction data structure. The text is just a printed notation so we can discuss it.)
I'm developing a domain specific language. Part of the language is exactly like C expression parsing semantics such as precidence and symbols.
I'm using the Lemon parser. I ran into an issue of the same token being used for two different things, and I can't tell the difference in the lexer. The ampersand (&) symbol is used for both 'bitwise and' and "address of".
At first I thought it was a trivial issue, until I realized that they don't have the same associativity.
How do I give the same token two different associativities? Should I just use AMP (as in ampersand) and make the addressof and bitwise and rules use AMP, or should I use different tokens (such as ADDRESSOF and BITWISE_AND). If I do use separate symbols, how am I supposed to know which one from the lexer (it can't know without being a parser itself!).
If you're going to write the rules out explicitly, using a different non-terminal for every "precedence" level, then you do not need to declare precedence at all, and you should not do so.
Lemon, like all yacc-derivatives, uses precedence declarations to remove ambiguities from ambiguous grammars. The particular ambiguous grammar referred to is this one:
expression: expression '+' expression
| expression '*' expression
| '&' expression
| ... etc, etc.
In that case, every alternative leads to a shift-reduce conflict. If your parser generator didn't have precedence rules, or you wanted to be precise, you'd have to write that as an unambiguous grammar (which is what you've done):
term: ID | NUMBER | '(' expression ')' ;
postfix_expr: term | term '[' expression '] | ... ;
unary_expr: postfix_expr | '&' unary_expr | '*' unary_expr | ... ;
multiplicative_expr: unary_expr | multiplicative_expr '*' postfix_expr | ... ;
additive_expr: multiplicative_expr | additive_expr '+' multiplicative_expr | ... ;
...
assignment_expr: conditional_expr | unary_expr '=' assignment_expr | ...;
expression: assignment_expr ;
[1]
Note that the unambiguous grammar even shows left-associative (multiplicative and additive, above), and right-associative (assignment, although it's a bit weird, see below). So there are really no ambiguities.
Now, the precedence declarations (%left, %right, etc.) are only used to disambiguate. If there are no ambiguities, the declarations are ignored. The parser generator does not even check that they reflect the grammar. (In fact, many grammars cannot be expressed as this kind of precedence relationship.)
Consequently, it's a really bad idea to include precedence declarations if the grammar is unambiguous. They might be completely wrong, and mislead anyone who reads the grammar. Changing them will not affect the way the language is parsed, which might mislead anyone who wants to edit the grammar.
There is at least some question about whether it's better to use an ambiguous grammar with precedence rules or to use an unambiguous grammar. In the case of C-like languages, whose grammar cannot be expressed with a simple precedence list, it's probably better to just use the unambiguous grammar. However, unambiguous grammars have a lot more states and may make parsing slightly slower, unless the parser generator is able to optimize away the unit-reductions (all of the first alternatives in the above grammar, where each expression-type might just be the previous expression-type without affecting the AST; each of these productions needs to be reduced, although it's mostly a no-op, and many parser generators will insert some code.)
The reason C cannot simply be expressed as a precedence relationship is precisely the assignment operator. Consider:
a = 4 + b = c + 4;
This doesn't parse because in assignment-expression, the assignment operator can only apply on the left to a unary-expression. This doesn't reflect either possible numeric precedence between + and =. [2]
If + were of higher precedence than =, the expression would parse as:
a = ((4 + b) = (c + 4));
and if + were lower precedence, it would parse as
(a = 4) + (b = (c + 4));
[1] I just realized that I left out cast_expression but I can't be cast to put it back in; you get the idea)
[2] Description fixed.
Later I realized I had the same ambiguity between dereference (*) and multiplication, also (*).
Lemon provides a way to assign a precidence to a rule, using the name used in the associativity declarations (%left/right/nonassoc) in square brackets after the period.
I haven't verified that this works correctly yet, but I think you can do this (note the things in square brackets near the end):
.
.
.
%left COMMA.
%right QUESTION ASSIGN
ADD_ASSIGN SUB_ASSIGN MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN
LSH_ASSIGN RSH_ASSIGN AND_ASSIGN XOR_ASSIGN OR_ASSIGN.
%left LOGICAL_OR.
%left LOGICAL_AND.
%left BITWISE_OR.
%left BITWISE_XOR.
%left BITWISE_AND.
%left EQ NE.
%left LT LE GT GE.
%left LSHIFT RSHIFT.
%left PLUS MINUS.
%left TIMES DIVIDE MOD.
//%left MEMBER_INDIRECT ->* .*
%right INCREMENT DECREMENT CALL INDEX DOT INDIRECT ADDRESSOF DEREFERENCE.
.
.
.
multiplicative_expr ::= cast_expr.
multiplicative_expr(A) ::= multiplicative_expr(B) STAR cast_expr(C). [TIMES]
{ A = Node_2_Op(Op_Mul, B, C); }
.
.
.
unary_expr(A) ::= STAR unary_expr(B). [DEREFERENCE]
{ A = Node_1_Op(Op_Dereference, B); }