Solving parsing conflicts in a tiny Lemon grammar - parser-generator

I'm trying to learn basics of the Lemon parser generator, but I stuck quickly.
Here's a tiny grammar:
%right PLUS_PLUS.
%left DOT.
program ::= expr.
member_expr ::= expr DOT IDENTIFIER.
lhs_expr ::= member_expr.
expr ::= lhs_expr.
expr ::= PLUS_PLUS lhs_expr.
It causes 1 parsing conflict:
State 3:
(3) expr ::= lhs_expr *
(4) expr ::= PLUS_PLUS lhs_expr *
DOT reduce 3 expr ::= lhs_expr
DOT reduce 4 ** Parsing conflict **
{default} reduce 4 expr ::= PLUS_PLUS lhs_expr
Whereas, if I rewrite the last rule as follows:
expr ::= PLUS_PLUS expr DOT IDENTIFIER.
Then it causes no conflicts. But I don't think that this is the right way to go.
I'd appreciate if someone could explain what's the right way, and why.

So you wrote an ambiguous grammar, which says to accept:
++ x . y
with two interpretations:
[++ x ] . y
and
++ [x . y]
where the [ ] are just my way to show to the groupings.
Lemon is an L(AL)R parser, and such parsers simply don't handle ambiguities (multiple interpretations). The reduce-reduce conflict reported is what happens when the parser hits that middle dot; does it group "++ x" as "[++ x] ." or as "++ [ x .]"? Both choices are valid, and it can't choose safely.
If you stick with Lemon (or another LALR parser generator), you have to get rid of the problem by changing the grammar. [You could use a GLR parser generator; it would accept and give you both parses. But all you've done then is to push the problem of resolving the ambiguity to the post-parsing phrase. As you don't want an ambiguity, you might as well avoid it during parsing if you can. In this case I think you can.]
I think you are trying to build a C-like language.
So you want something like this:
primitive_target ::= IDENTIFIER ;
primitive_target ::= IDENTIFIER '[' expr ']' ;
access_path ::= primitive_target ;
access_path ::= access_path '.' primitive_target ;
lhs ::= access_path ;
lhs ::= PLUS_PLUS access_path ;
lhs ::= access_path PLUS_PLUS ;
program ::= expr ;
expr ::= term ;
expr ::= expr '+' term ;
term :::= '(' expr ')' ;
term ::= lhs ;
term ::= lhs '=' expr ;
term ::= constant ;

Related

How to create Abstract Syntax Tree nodes when the operator and operands are not in the same production?

So in all cases of AST examples, there are productions of the following kind:
expr -> expr "+" expr;
expr -> expr "-" expr;
And in this case it's easy to create a new node like this:
expr: expr "+" expr {newNode("+",$1,$3);}
;
Now my grammar has the following implementation:
assignment:IDENTIFIER '=' expression ';'
;
expression:term expression_1
;
expression_1: '+' term expression_1 |
'-' term expression_1 |
;
term: factor term_1
;
term_1: '*' factor term_1 |
'/' factor term_1 |
;
factor: IDENTIFIER |
'(' expression ')' |
NUM | FNUM | STRING
;
Here, while making a new node, how do I take the first operand(which is in a previous production), and feed that into a newNode function which will have the operator and the second operand(both of these are together in a different production)?

How to define standard mathematical notation with mpc parser

I am reading a compiler tutorial here www.buildyourownlisp.com. It uses a parser combinator called mpc. What I have at the moment will parse polish notation, but I'm trying to work out how to use standard notation with it. I just can't seem to how to do it.
The rules of the parser are as follows:
. Any character is required.
a The character a is required.
[abcdef] Any character in the set abcdef is required.
[a-f] Any character in the range a to f is required.
a? The character a is optional.
a* Zero or more of the character a are required.
a+ One or more of the character a are required.
^ The start of input is required.
$ The end of input is required.
"ab" The string ab is required.
'a' The character a is required.
'a' 'b' First 'a' is required, then 'b' is required.
'a' | 'b' Either 'a' is required, or 'b' is required.
'a'* Zero or more 'a' are required.
'a'+ One or more 'a' are required.
<abba> The rule called abba is required.
The polish notation is written like this:
" \
number: /-?[0-9]+/'.'?/[0-9]+/ ; \
operator: '+' | '-' | '*' | '/' | '%' | \"add\" | \"sub\" | \"mul\" | \"div\" ; \
expr: <number> | '(' <operator> <expr>+ ')' ; \
dlispy: /^/ <operator> <expr>+ /$/ ;",
I've managed to make it accept decimal numbers by adding '.'?/[0-9]+/, but i can't work out how to restructure it to accept standard notation eg 2*(3+2) instead of *2 (+ 3 2). I know that I'll have to rewrite the expr and the dlispy rules, but I'm new to regex and BNF. Hope you can help, thanks
Written as yacc rules, that would be:
expr : '(' expr ')
| expr operator expr
| number
;

Bison shift/reduce conflicts warning [duplicate]

I'm having trouble fixing a shift reduce conflict in my grammar. I tried to add -v to read the output of the issue and it guides me towards State 0 and mentions that my INT and FLOAT is reduced to variable_definitions by rule 9. I cannot see the conflict and I'm having trouble finding a solution.
%{
#include <stdio.h>
#include <stdlib.h>
%}
%token INT FLOAT
%token ADDOP MULOP INCOP
%token WHILE IF ELSE RETURN
%token NUM ID
%token INCLUDE
%token STREAMIN ENDL STREAMOUT
%token CIN COUT
%token NOT
%token FLT_LITERAL INT_LITERAL STR_LITERAL
%right ASSIGNOP
%left AND OR
%left RELOP
%%
program: variable_definitions
| function_definitions
;
function_definitions: function_head block
| function_definitions function_head block
;
identifier_list: ID
| ID '[' INT_LITERAL ']'
| identifier_list ',' ID
| identifier_list ',' ID '[' INT_LITERAL ']'
;
variable_definitions:
| variable_definitions type identifier_list ';'
;
type: INT
| FLOAT
;
function_head: type ID arguments
;
arguments: '('parameter_list')'
;
parameter_list:
|parameters
;
parameters: type ID
| type ID '['']'
| parameters ',' type ID
| parameters ',' type ID '['']'
;
block: '{'variable_definitions statements'}'
;
statements:
| statements statement
;
statement: expression ';'
| compound_statement
| RETURN expression ';'
| IF '('bool_expression')' statement ELSE statement
| WHILE '('bool_expression')' statement
| input_statement ';'
| output_statement ';'
;
input_statement: CIN
| input_statement STREAMIN variable
;
output_statement: COUT
| output_statement STREAMOUT expression
| output_statement STREAMOUT STR_LITERAL
| output_statement STREAMOUT ENDL
;
compound_statement: '{'statements'}'
;
variable: ID
| ID '['expression']'
;
expression_list:
| expressions
;
expressions: expression
| expressions ',' expression
;
expression: variable ASSIGNOP expression
| variable INCOP expression
| simple_expression
;
simple_expression: term
| ADDOP term
| simple_expression ADDOP term
;
term: factor
| term MULOP factor
;
factor: ID
| ID '('expression_list')'
| literal
| '('expression')'
| ID '['expression']'
;
literal: INT_LITERAL
| FLT_LITERAL
;
bool_expression: bool_term
| bool_expression OR bool_term
;
bool_term: bool_factor
| bool_term AND bool_factor
;
bool_factor: NOT bool_factor
| '('bool_expression')'
| simple_expression RELOP simple_expression
;
%%
Your definition of a program is that it is either a list of variable definitions or a list of function definitions (program: variable_definitions | function_definitions;). That seems a bit odd to me. What if I want to define both a function and a variable? Do I have to write two programs and somehow link them together?
This is not the cause of your problem, but fixing it would probably fix the problem as well. The immediate cause is that function_definitions is one or more function definition while variable_definitions is zero or more variable definitions. In other words, the base case of the function_definitions recursion is a function definition, while the base case of variable_definitions is the empty sequence. So a list of variable definitions starts with an empty sequence.
But both function definitions and variable definitions start with a type. So if the first token of a program is int, it could be the start of a function definition with return type int or a variable definition of type int. In the former case, the parser should shift the int in order to produce the function_definitions base case:; in the latter case, it should immediately reduce an empty variable_definitions base case.
If you really wanted a program to be either function definitions or variable definitions, but not both. you would need to make variable_definitions have the same form as function_definitions, by changing the base case from empty to type identifier_list ';'. Then you could add an empty production to program so that the parser could recognize empty inputs.
But as I said at the beginning, you probably want a program to be a sequence of definitions, each of which could either be a variable or a function:
program: %empty
| program type identifier_list ';'
| program function_head block
By the way, you are misreading the output file produced by -v. It shows the following actions for State 0:
INT shift, and go to state 1
FLOAT shift, and go to state 2
INT [reduce using rule 9 (variable_definitions)]
FLOAT [reduce using rule 9 (variable_definitions)]
Here, INT and FLOAT are possible lookaheads. So the interpretation of the line INT [reduce using rule 9 (variable_definitions)] is "if the lookahead is INT, immediately reduce using production 9". Production 9 produces the empty sequence, so the reduction reduces zero tokens at the top of the parser stack into a variable_definitions. Reductions do not use the lookahead token, so after the reduction, the lookahead token is still INT.
However, the parser doesn't actually do that because it has a different action for INT, which is to shift it and go to state 1. as indicated by the first line start INT. The brackets [...] indicate that this action is not taken because it is a conflict and the resolution of the conflict was some other action. So the more accurate interpretation of that line is "if it weren't for the preceding action on INT, the lookahead INT would cause a reduction using rule 9."

Bison loop for conflict

to solve the dangling else problem, I used the following solution:
stmt : stmt_matched
| stmt_unmatched
;
stmt_unmatched : IF '(' exp ')' stmt
| IF '(' exp ')' stmt_matched ELSE stmt_unmatched
;
stmt_matched : IF '(' exp ')' stmt_matched ELSE stmt_matched
| stmt_for
| ...
;
For defining the rules of grammar about the for loop, I produce a conflict shift/reduce due to the same problem:
stmt_for : FOR '(' exp ';' exp ';' exp ')' stmt
;
How can I solve this problem?
Not all for statements are matched. Consider, for example
if (c) for (;;) if (d) ; else ;
So it is necessary to divide for statements into for_matched and for_unmatched. (And similarly with other compound statements such as while.)

Execute an action for all component of a rule in bison

I have the following rule in my bison file :
affectation: VAR '=' expr ';'
| VAR PLUSEQ expr ';'
| VAR MINUSEQ expr ';'
;
I would like the parser to display the variable name and its content every time an affectation is done. For that, I use the action {printf("%s:%s\n",$1, $3);}. However, as there are 3 forms of affectation, is there a way to apply this action to all the components without writting :
affectation: VAR '=' expr ';'
{printf("%s:%s\n",$1, $3);}
| VAR PLUSEQ expr ';'
{printf("%s:%s\n",$1, $3);}
| VAR MINUSEQ expr ';'
{printf("%s:%s\n",$1, $3);}
;
Basically, the answer is no. In most use cases, the three productions would have different semantics, so it would be normal for their to be three different actions, although they might share code. (As always, refactoring the shared code can reduce the need for duplication.)
If the three rules were really semantically identical, you could collect the different operators into a prefix rule:
aff_pfx: VAR '=' | VAR PLUSEQ | VAR MINUSEQ
affectation: aff_pfx expr ';' { handle($1, $2); }
That relies on the default action copying $$ = $1 in all the productions for aff_pfx, so it is not fully general. Also, it completely erases any distinction between the three syntaxes, which seems unlikely to be correct.
If you are just trying to produce a trace of the parse,take a look at bison's built-in debugging features.

Resources