I'm developing a domain specific language. Part of the language is exactly like C expression parsing semantics such as precidence and symbols.
I'm using the Lemon parser. I ran into an issue of the same token being used for two different things, and I can't tell the difference in the lexer. The ampersand (&) symbol is used for both 'bitwise and' and "address of".
At first I thought it was a trivial issue, until I realized that they don't have the same associativity.
How do I give the same token two different associativities? Should I just use AMP (as in ampersand) and make the addressof and bitwise and rules use AMP, or should I use different tokens (such as ADDRESSOF and BITWISE_AND). If I do use separate symbols, how am I supposed to know which one from the lexer (it can't know without being a parser itself!).
If you're going to write the rules out explicitly, using a different non-terminal for every "precedence" level, then you do not need to declare precedence at all, and you should not do so.
Lemon, like all yacc-derivatives, uses precedence declarations to remove ambiguities from ambiguous grammars. The particular ambiguous grammar referred to is this one:
expression: expression '+' expression
| expression '*' expression
| '&' expression
| ... etc, etc.
In that case, every alternative leads to a shift-reduce conflict. If your parser generator didn't have precedence rules, or you wanted to be precise, you'd have to write that as an unambiguous grammar (which is what you've done):
term: ID | NUMBER | '(' expression ')' ;
postfix_expr: term | term '[' expression '] | ... ;
unary_expr: postfix_expr | '&' unary_expr | '*' unary_expr | ... ;
multiplicative_expr: unary_expr | multiplicative_expr '*' postfix_expr | ... ;
additive_expr: multiplicative_expr | additive_expr '+' multiplicative_expr | ... ;
...
assignment_expr: conditional_expr | unary_expr '=' assignment_expr | ...;
expression: assignment_expr ;
[1]
Note that the unambiguous grammar even shows left-associative (multiplicative and additive, above), and right-associative (assignment, although it's a bit weird, see below). So there are really no ambiguities.
Now, the precedence declarations (%left, %right, etc.) are only used to disambiguate. If there are no ambiguities, the declarations are ignored. The parser generator does not even check that they reflect the grammar. (In fact, many grammars cannot be expressed as this kind of precedence relationship.)
Consequently, it's a really bad idea to include precedence declarations if the grammar is unambiguous. They might be completely wrong, and mislead anyone who reads the grammar. Changing them will not affect the way the language is parsed, which might mislead anyone who wants to edit the grammar.
There is at least some question about whether it's better to use an ambiguous grammar with precedence rules or to use an unambiguous grammar. In the case of C-like languages, whose grammar cannot be expressed with a simple precedence list, it's probably better to just use the unambiguous grammar. However, unambiguous grammars have a lot more states and may make parsing slightly slower, unless the parser generator is able to optimize away the unit-reductions (all of the first alternatives in the above grammar, where each expression-type might just be the previous expression-type without affecting the AST; each of these productions needs to be reduced, although it's mostly a no-op, and many parser generators will insert some code.)
The reason C cannot simply be expressed as a precedence relationship is precisely the assignment operator. Consider:
a = 4 + b = c + 4;
This doesn't parse because in assignment-expression, the assignment operator can only apply on the left to a unary-expression. This doesn't reflect either possible numeric precedence between + and =. [2]
If + were of higher precedence than =, the expression would parse as:
a = ((4 + b) = (c + 4));
and if + were lower precedence, it would parse as
(a = 4) + (b = (c + 4));
[1] I just realized that I left out cast_expression but I can't be cast to put it back in; you get the idea)
[2] Description fixed.
Later I realized I had the same ambiguity between dereference (*) and multiplication, also (*).
Lemon provides a way to assign a precidence to a rule, using the name used in the associativity declarations (%left/right/nonassoc) in square brackets after the period.
I haven't verified that this works correctly yet, but I think you can do this (note the things in square brackets near the end):
.
.
.
%left COMMA.
%right QUESTION ASSIGN
ADD_ASSIGN SUB_ASSIGN MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN
LSH_ASSIGN RSH_ASSIGN AND_ASSIGN XOR_ASSIGN OR_ASSIGN.
%left LOGICAL_OR.
%left LOGICAL_AND.
%left BITWISE_OR.
%left BITWISE_XOR.
%left BITWISE_AND.
%left EQ NE.
%left LT LE GT GE.
%left LSHIFT RSHIFT.
%left PLUS MINUS.
%left TIMES DIVIDE MOD.
//%left MEMBER_INDIRECT ->* .*
%right INCREMENT DECREMENT CALL INDEX DOT INDIRECT ADDRESSOF DEREFERENCE.
.
.
.
multiplicative_expr ::= cast_expr.
multiplicative_expr(A) ::= multiplicative_expr(B) STAR cast_expr(C). [TIMES]
{ A = Node_2_Op(Op_Mul, B, C); }
.
.
.
unary_expr(A) ::= STAR unary_expr(B). [DEREFERENCE]
{ A = Node_1_Op(Op_Dereference, B); }
Related
I'm new to Bison and I'm having trouble with shift/reduce conflicts...
I'm writing the rules for grammar for the C language: ID is a token that identifies a variable, and I wrote this rule to ensure that the identifier can be considered even if it is written in parentheses.
id : '(' ID ')' {printf("(ID) %s\n", $2);}
| ID {printf("ID %s\n", $1);}
;
Output of Bison conflicts is:
State 82
12 id: '(' ID . ')'
13 | ID .
')' shift, and go to state 22
')' [reduce using rule 13 (id)]
$default reduce using rule 13 (id)
How can I resolve this conflict?
I hope I was clear and thanks for your help.
Your id rule in itself cannot cause a shift/reduce error. There must be some other rule in your grammar that uses ID. For example, you have an expression rule such as:
expr: '(' expr ')'
| ID
;
In the above example, ID can reduce to id or to expr and the parser doesn't know which reduction to take. Check what is in state 22.
Edit: you ask "what can I do to solve the conflict?"
I'm writing the rules for grammar for the C language: ID is a token that identifies a variable, and I wrote this rule to ensure that the identifier can be considered even if it is written in parentheses
A variable in parenthesis as a left-hand side is invalid in C, so it can only occur in a right-hand side. Then you can consider it an expression, so just remove your rule and where you use id replace that with expr.
I am trying to write a simple parser using lemon, for a javascript-like language. I am unable to resolve a conflict error, and I suspect it is a unsolvable problem.
The conflict is between the grammar for:
{x = 10;}
and
{x:10};
The first is a statement block containing an assignment statement and the second is an expression statement defining an object.
A grammar to parse both of them results in a conflict. The minimal code is as follows:
rMod ::= rStmt.
rStmt ::= rStmtList RCURLY. {leaveScope();}
rStmtList ::= rStmtList rStmt.
rStmtList ::= LCURLY. {enterScope();}
rStmt ::= rExpr SEMI.
rExpr ::= rObj.
rObj ::= LCURLY rObjItemList RCURLY.
rObjItemList ::= rObjItemList COMMA rObjItem.
rObjItemList ::= rObjItem.
rObjItem ::= ID COLON rExpr.
rExpr ::= ID.
rExpr ::= NUM.
The out file shows the following:
State 4:
(3) rStmtList ::= LCURLY *
rObj ::= LCURLY * rObjItemList RCURLY
rObjItemList ::= * rObjItemList COMMA rObjItem
rObjItemList ::= * rObjItem
rObjItem ::= * ID COLON rExpr
ID shift 8
ID reduce 3 ** Parsing conflict **
rObjItemList shift 6
rObjItem shift-reduce 8 rObjItemList ::= rObjItem
{default} reduce 3 rStmtList ::= LCURLY
Any suggestions on how I can resolve this would be gratefully accepted. Thanks.
The heart of the problem is that you want to execute enterScope() after the brace which initiates a statement block. However, if the brace is followed by the two tokens VAR and :, then it starts an object literal, not a block. So it is impossible to know whether or not to execute the enterScope action without two-token lookahead, and lemon does not produce LR(2) grammars. To that extent, you are correct that the problem is unsolvable. But of course there are solutions.
Probably the worst solution from any perspective (readability, complexity, verificability) is to create an LR(1) grammar using the usual LR(2)→LR(1) transformation, which will allow you to call the enterScope(); action at the point where it is clear that a scope has been entered. This means delaying the reduction by one token. That in turn means dividing expr into two disjoint non-terminals: those expr which can start with a VAR and those which cannot. For those expr which can start with a VAR, you also need to provide a mechanism which essentially allows you to glue together a VAR and the rest of the expr; in the case of expressions, that is particularly ugly (but still possible). The goal is to be able to write:
block(A) ::= blockPrefix(B) RCURLY . { closeScope(); A = B;}
blockPrefix(A) ::= lcurlyOpen exprNotStartingVAR(E) . { A = E; }
blockPrefix(A) ::= lcurlyVAR(V) restOfExprStartingVar(R) . { A = makeExpr(V, R); }
blockPrefix(A) ::= blockPrefix(B) SEMI expr(E) . { A = appendExpr(B, E); }
lcurlyOpen ::= LCURLY . { openScope(); }
lcurlyVAR(A) ::= LCURLY VAR(V) . { openScope(); A = V; }
An alternative, which is also ugly but probably less ugly in this particular case, is to recognize a variable name followed by a colon as a single lexical token (VAR_COLON). Although that complicates the lexer (particularly since you need to recognize constructs where whitespace or even comments appear between the variable name and the colon), it makes the grammar much simpler. With that change, there is no conflict because the object literal must start with a VAR_COLON while an expr can only start with a VAR (or other unrelated tokens).
A much simpler solution is to not try to create the scope inherited attribute. If we do scope resolution synthetically, then the problem more or less vanishes:
stmt ::= expr SEMI | block .
stmtList ::= stmt .
stmtList ::= stmtList stmt .
block(A) ::= LCURLY stmtList(B) RCURLY . { A = applyScope(newScope(), B); }
objLiteral ::= LCURLY itemList RCURLY .
objLiteral ::= LCURLY RCURLY .
itemList ::= item .
itemList ::= itemList COMMA item .
item ::= VAR COLON expr .
expr ::= VAR .
expr ::= objLiteral .
...
That grammar has no conflicts, but it might radically change the way you handle scopes, since it requires variable names to be scoped once a block is complete rather than doing it in-line as the parse proceeds.
However, I would argue that for most languages (including Javascript), it is actually more convenient to do scoping at the end of a block, or even as a post-parse walk over the AST. Javascript, unlike C, allows local variables to be declared after their first mention. Local functions can even be used before their declaration. (This is subtly different from Python, where a function declaration is an executable assignment, but the scoping rules are similar.)
As another example, C++ allows class members to be declared anywhere inside the declaration of the class, even if the member has already been mentioned inside another class member function.
And there are many other examples. These scoping rules generally benefit the programmer by allowing stylistic options (such as putting member variable definitions at the end of a class definition in C++) which would not be possible in C.
I am working on a yacc file to parse a given file and convert it to an equivalent c++ file. I have created the following grammar based on the provided syntax diagrams:
program: PROGRAMnumber id 'is' comp_stmt
;
comp_stmt: BEGINnumber statement symbol ENDnumber
;
statement: statement SEMInumber statement
| id EQnumber expression
| PRINTnumber expression
| declaration
;
declaration: VARnumber id
;
expression: term
;
term: term as_op term
| MINUSnumber term
| factor
;
factor: factor md_op factor
| ICONSTnumber
| id
| RPARENnumber expression LPARENnumber
;
as_op: PLUSnumber
| MINUSnumber
;
md_op: TIMESnumber
| DIVnumber
;
symbol: SEMInumber
| COMMAnumber
;
id: IDnumber
| id symbol id
;
The only issue I have remaining is that I am receiving this error when trying to compile with yacc.
conflicts: 14 shift/reduce
calc.y:103.17-111.41: warning: rule useless in parser due to conflicts: declaration: VARnumber id
I have resolved the only other conflict I have encountered, but I am not sure what the resolution for this conflict is. The line it should match is of the format
var a, b, c, d;
or
var a;
All of your productions intended to derive lists are ambiguous and therefore generate reduce/reduce conflicts. For example:
id: id symbol id
Will be clearly ambiguous when there are three identifiers: are the first two to be reduced first, or the last two? The usual list idiom is left-recursion:
id_list: id | id_list `,` id
For most languages, that would not be correct for statements, which are terminated with semi-colons, not separated by them, but that model would work for a comma-separated list of identifiers, or for a left-associative sequence of addition operators.
For statements, you probably want something more like:
statement_list: | statement_list statement ';'
Speaking of symbol, do you really believe that , and ; have the same syntactic function? That seems unlikely, since you write var a, b, c, d; and not, for example, var a; b, c; d,.
The "useless rule" warning produced by bison is exactly because your grammar allows ids to be separated with semicolons. When the parser sees "var" ID with ; as lookahead, it first reduces ID to id and then needs to decide whether to reduce var id to declaration or to shift the ; in order to later reduce it to symbol and then proceed with the reduction of id symbol id. In the absence of precedence rules, bison always resolves shift/reduce conflicts in favour of shifting, so that is what it does in this case. But the result is that it is never possible to reduce "var" id to declaration, making the production useless as the result of a shift-reduce conflict resolution, which is more or less what the warning says.
Program Text
type T is array ("A" .. "F") of Integer;
Compiler Console Output
hello.adb:4:22: discrete type required for range
Question
If my understanding is correct, clause 9 from chapter 3.6 of the Ada reference manual is the reason for the compiler raising an compilation error:
Each index_subtype_definition or discrete_subtype_definition in an array_type_definition defines an index subtype; its type (the index type) shall be discrete.
Hence, why exactly is "A" .. "F" not discrete? What does discrete exactly mean?
Background info
The syntax requirements for array type definitions are quoted below. Source: Ada Reference Manual
array_type_definition ::= unconstrained_array_definition | constrained_array_definition
constrained_array_definition ::= array (discrete_subtype_definition {, discrete_subtype_definition}) of component_definition
discrete_subtype_definition ::= discrete_subtype_indication | range
range ::= range_attribute_reference | simple_expression .. simple_expression
simple_expression ::= [unary_adding_operator] term {binary_adding_operator term}
term ::= factor {multiplying_operator factor}
factor ::= primary [** primary] | abs primary | not primary
primary ::= numeric_literal | null | string_literal | aggregate | name | qualified_expression | allocator | (expression)
This:
"A" .. "F"
does satisfy the syntax of a range; it consists of a simple_expression, followed by .., followed by another simple_expression. So it's not a syntax error.
It's still invalid; specifically it's a semantic error. The syntax isn't the only thing that determines whether a chunk of code is valid or not. For example, "foo" * 42 is a syntactically valid expression, but it's semantically invalid because there's no * operator for a string and an integer (unless you write your own).
A discrete type is either an integer type or an enumeration type. Integer, Character, and Boolean are examples of discrete types. Floating-point types, array types, pointer types, record types, and so forth are not discrete types, so expressions of those types can't be used in a range for a discrete_subtype_indication.
This:
type T is array ("A" .. "F") of Integer;
is probably supposed to be:
type T is array ('A' .. 'F') of Integer;
String literals are of type String, which is an array type. Character literals are of type Character, which is an enumeration type and therefore a discrete type.
You wrote in a comment on another answer:
Unfortunately I'm unable to replace the string literals by character literals and recompile the code ...
If that's the case, it's quite unfortunate. The code you've posted is simply invalid; it will not compile. Your only options are to modify it or not to use it.
Ermm ... I think it is trying to tell you that you can't use String literals to specify ranges. You probably meant to use a character literal.
Reference:
http://archive.adaic.com/standards/83lrm/html/lrm-02-05.html
After all, the clauses quoted above are explicitely requiring string_literal to be used
You have misunderstood the Ada syntax specs. Specifically, you have missed this production:
name ::= simple_name
| character_literal | operator_symbol
| indexed_component | slice
| selected_component | attribute
In yacc program,how do we write the action for assign operation using c structure node?
Example:-
stmt: stmt stmt ';'
| exp ';' {printtree();}
| bool ';' {...}
| VAR ASSIGN exp ';' {//How to store this value to VAR using node?}
...
;
exp: exp PLUS exp {make_operator($1,'+',$3);// which stores a char '+' with
left node to $1 and right node to $3 to the synatx tree
}
| exp MINUS exp {...}
...
;
It would be of great help if someone can suggest a solution for this.
The answer is that since your Yacc parser is not actually executing the code, but producing an abstract syntax tree (as evidenced by the use of a make_operator function in the PLUS operation, the same thing is done for the assignment. It could be as simple as:
stmt: stmt stmt ';'
| exp ';' {printtree();}
| bool ';' {...}
| VAR ASSIGN exp ';' {$$ = make_operator($1, '=', $3);}
...
;
The actual job of generating the code to perform the assignment will be done by other passes over the syntax tree which is constructed by the parser. Those passes will have to do things like ensuring that VAR is actually defined in the given scope and so on, depending on the rules of the language: does it have the right type, is it modifiable, ...
A translation scheme for assignments (at least of a simple scalar variable which fits into a register) is:
Generate the code to calculate the address of the assignment target, such that this code leaves the value in a new temporary register, call it t1.
Generate the code to calculate the value of the expression, leaving it in another register t2.
Generate the code mem[t1] := t2 which represents store the value of t2 into the memory location pointed at by t1. (Of course, this intermediate code isn't literally represented by text such as mem[t1] := t2, but rather some instruction data structure. The text is just a printed notation so we can discuss it.)