Translation of if then else in compiler grammar

Translation of if then else in compiler grammar - c

...
IF LP assignment-expression RP marker statement {
backpatch($3.tlist,$5.instr);
$$.nextList = mergeList($3.flist,$6.nextList);
}
|IF LP assignment-expression RP marker statement ELSE Next statement {
backpatch($3.tlist,$5.instr);
backpatch($3.flist,$8.instr);
YYSTYPE::BackpatchList *temp = mergeList($6.nextList,$8.nextList);
$$.nextList = mergeList(temp,$9.nextList);
}
...
Assignment-expression is any assignment expression that is possible using the C operators =, +=, -=, *=, /=.
LP = (
RP = )
marker and Next are both EMPTY rule
The problem with above grammar rule and implementation is that it can't generate correct code when expression is as
bool a;
if(a){
printf("hi");
}
else{
prinf("die");
}
It expects that assignment-expression must contain relop or OR or AND to generate correct code .
Since in this case we do comparison for relop same case apply to OR and AND.
But as in above code doesn't contain any thing out of this , So it's unable to generate correct code.
The correct code can be generated by using following rule but this leads to two reduce-reduce conflict .
...
IF LP assignment-expression {
if($3.flist == NULL && $3.tlist == NULL)
...
} RP marker statement {
...
}
|IF LP assignment-expression{
if($3.flist == NULL && $3.tlist == NULL)
...
} RP marker statement ELSE Next statement {
...
}
...
What are the modification I should do in the grammar rule so that it will work as expected ?
I tried IF ELSE grammar rule
from here as well as from dragon book but unable to solve this .
Whole grammar can be found here Github

In order to insert the mid-rule action, you need to left-factor; otherwise, the bison-generated parser cannot decide which of the two MRAs to reduce. (Even though they are presumably identical, bison doesn't know that.)
if_prefix: "if" '(' expression ')' { $$ = $3; /* Normalize the flist */ }
if: if_prefix marker statement { ... }
| if_prefix marker statement "else" Next statement { ... }
(You could left factor differently; that's just one suggestion.)

It looks like your grammar has an incorrect definition of expression.
An assignment expression is only one of many non-terminals that should be able to reduce to an expression. For an if/then/else construction you generally need to allow any expression to occur between the parens. Your first example, as you point out, is perfectly valid C but doesn't contain an assignment.
In your grammar, you have this line:
/*Expression list**/
expression:
assignment-expression{}
|expression COMMA assignment-expression{}
;
However, an expression should be able to have more than assignment-expressions. Not being terribly familiar with yacc/bison, I would guess you need to change this to something like the following:
/*Expression **/
expression:
assignment-expression{}
|logical-OR-expression{}
|logical-AND-expression{}
|inclusive-OR-expression{}
|exclusive-OR-expression{}
|inclusive-AND-expression{}
|equality-expression{}
|relational-expression{}
|additive-expression{}
|multiplicative-expression{}
|exponentiation-expression{}
|unary-expression{}
|postfix-expression{}
|primary-expression{}
|expression COMMA expression{}
;
I can't really validate that this is going to work for you, and it may be imperfect, but hopefully you get the idea. Each different type of expression needs to be able to reduce to an expression. You have something very similar for statement earlier in your grammar, so this should hopefully make sense.
It might be helpful to do some reading or watch some tutorials on how LR grammars work.

Related

count>=10? break : continue;

while(1) {
// other stuff
// there's no code in the loop after the below statement:
count>=10? break : continue; // error
}
Why does this statement give errors? Any help will be highly appreciated.
58 16 [Error] expected expression before 'break'
This is the error that the compiler gives.

Why does this statement give errors ?
?: is not a "short version of if" as it is incorrectly described on many sites.
?: is not a statement, it is an operator.
An operator joins one, two or three operands to produce an expression. An expression is a piece of code that is computed and produces a value. A statement is a piece of code that does something. They are different things.
A statement can contain expressions. An expression cannot contain statements.
break and continue are statements. This is why the fragment count >= 10 ? break : continue; is not a valid statement and does not compile.
Use an if statement and it works:
if (count >= 10) {
break;
} else {
continue;
}

As it follows from the error message
58 16 [Error] expected expression before 'break'
in this statement with the conditional operator
count>=10? break : continue;
the compiler expects expressions instead of the statements break and continue.
According to the C Standard the conditional operator is defined the following way
logical-OR-expression ? expression : conditional-expression
As you can see it includes three expressions.
Instead of using the conditional operator you could use the if-else statement the following way
if ( count>=10 )
{
break;
}
else
{
continue;
}
But in any case this construction with break and continue statements looks badly.
It seems you should move the condition count>=10 in the loop statement that is used. Or it will be enough to write
if ( count>=10 )
{
break;
}
without the else part of the if statement.

Conditional operator:
a ? b : c - if a is logically true (does not evaluate to zero) then evaluate expression b, otherwise evaluate expression c
Neither break nor continue are expressions. They are statements and can't be used with the conditional operator.
Furthermore, continue as the last statement in your loop is pointless.
What you need is simply:
while(1) {
// other stuff
if(count >= 10) break;
}
or even simpler:
do {
// other stuff
} while(count < 10);

LL1 grammar for IF-ELSE condition for a C program

I have to produce an LL1 grammar that covers the IF, IF-ELSE, IF - ELSE IF - ELSE condition for a C program.
I was doing the follow and I wasn't able to solve the recursions so I thought that maybe my grammar is wrong or not satisfiyng the LL1 conditions.
Can you tell me if the grammar is correct?
<MAIN> ::= int main () { <PROG> <AUX_PROG> }
<AUX_PROG> ::= <PROG> <AUX_PROG> | ε
<PROG> ::= <IF_STAT> | other | ε
<IF_STAT> ::= if ( other ) { <PROG> } <ELSE_STAT>
<ELSE_STAT> ::= else { <PROG> } | ε
follow(PROG) = { "}", if, other }
follow(AUX_PROG) = { "}" }
follow(IF_STAT) = follow(PROG) = { "}", if, other }
follow(ELSE_STAT) = follow(IF_STAT) = { "}", if, other }
follow(MAIN) = { $ }
first(MAIN) = { int }
first(AUX_PROG) = { if, other, ε }
first(PROG) = { if, other, ε }
first(IF_STAT) = { if }
first(ELSE_STAT) = { else, ε }
UPDATE: I have modified the grammar and also I have included the first and the follow.
The braces are required so that there is no dangling-else problem.

That grammar is ambiguous because <PROG> ::= ε makes <AUX_PROG> ::= <PROG> <AUX_PROG> left-recursive. If you eliminate the null production for <PROG> then the grammar is certainly LL(1).
But just being LL(1) does not demonstrate that the grammar correctly recognises the desired syntax, much less that it correctly parses each input into the desired parse tree. So it definitely depends on how you define "correct". Since your question doesn't really specify either the syntax you hope to match nor the form in which you would like it to be analysed, it's hard to comment on these forms of correctness.
You're absolutely correct to note that the heart of C's dangling-else issue is that C does not require the bodies of if and else clauses to be delimited. So the following is legal C:
if (condition1) if (condition2) statement1; else statement2;
and the language's rules cause else statement2 to be bound to if (condition2), rather than the first if.
That's often called an ambiguity, but it's actually easy to disambiguate. You'll find the disambiguation technique all over the place, including Wikipedia's somewhat ravaged entry on dangling else, or most popular programming language textbooks. However, the disambiguation technique does not result in an LL(1) grammar; you need to use a bottom-up parser. (Even an operator precedence parser can deal with it, but LALR(1) parsers are probably more common.)
As Wikipedia points out, a simple solution is to change the grammar to remove the possibility of if (c1) if (c2) .... A simple way to do that is to insist that the target of the if be delimited in some way, such as adding braces (which would in any case be required if the body were more than one statement). It's not necessary to put the same requirement on the body of the else clause, but that would probably be confusing for language users. However, it is convenient to permit chained if...else constructs like this:
if (c1) {
body1
}
else if (c2) {
body2
}
else if (c3) {
body3
}
...
That's not ambiguous, even though the body of each else is not delimited. In some languages, that construct is abbreviated by using a special elseif token (which might be spelled elif or elsif) in order to preserve the rule that else clauses must be delimited blocks. But it's not too eccentric to simply allow else if as an exception to the general rule about bodies.
So if you're designing a language, you have options. If you're implementing someone else's language (such as the one given by the instructor of a course) you need to make sure you understand what their requirements are.

Is there any practical semantic difference between the comma operator and the semicolon?

Here's what I understand so far:
The comma operator allows for brevity of code, e.g. int x = 0, y = 0, z = 0 as opposed to int x = 0; int y = 0; int z = 0;. In this case it's sort of like syntactic sugar for a semicolon.
The comma operator acts as a sequence point. So in the code f(), g();, the function f() is guaranteed to execute and produce all of its side effects before g(). But the same is true if you use the code f(); g();.
The comma operator is an operator, whereas the semicolon is simply a program token that takes no part in the evaluation of expressions. Since the comma operator has such low precedence, it differs very little from the semicolon in this regard.
So, I'm wondering what is the semantic difference between these two constructs in practice? Is there any situation where using a comma would produce different results from using a semicolon?

In case of
int x = 0, y = 0, z = 0 ;
, is not a comma operator but they are comma separator.
Semicolon is part of statement and declarations.
int i = 0; // declaration
i = i + 5; // statement
On the other hand, comma operator is part of expressions. That said semicolon can't be used where an expression is expected. For example
if(++i, i < 10) { /*...*/ } // A semicolon can't be used.

There are cases where comma is used as a token, there are other cases where the comma is a comma operator.
Quoting wikipedia,
The use of the comma token as an operator is distinct from its use in function calls and definitions, variable declarations, enum declarations, and similar constructs, where it acts as a separator.
One example to clarify, (borrowed directly from chapter §6.5.17, C11 standard)
You can have a function call made like
f(a, (t=3, t+2), c);
here, the comma in (t=3, t+2) is a comma operator, this is valid and accepted.
However, you cannot write
f(a, (t=3; t+2), c);
this is a syntactic error.

One situation where it does make a difference is this:
while(foo(), bar()) {
...
}
I don't know if there's any real practical usage for this, but it does compile with a comma, but not with a semicolon.

The comma operator has been fairly well discussed in other answers. Remains the semi-colon.
The semi-colon (;) is a statement terminator, meaning that it terminates the syntax of any statement. It also means that an expression, when followed by a semi-colon, is turned into a statement:
foo(); // a statement
bar(); // a statement
3+5; // a statement
(t=3, t+2); // a statement
while(foo(), bar()); // a statement
while(foo(), bar()) {
; // empty statement
}
The semi-colon also terminates declarations.

Bison/Flex print value of terminal from alternative

I have written a simple grammar:
operations :
/* empty */
| operations operation ';'
| operations operation_id ';'
;
operation :
NUM operator NUM
{
printf("%d\n%d\n",$1, $3);
}
;
operation_id :
WORD operator WORD
{
printf("%s\n%s\n%s\n",$1, $3, $<string>2);
}
;
operator :
'+' | '-' | '*' | '/'
{
$<string>$ = strdup(yytext);
}
;
As you can see, I have defined an operator that recognizes one of 4 symbols. Now, I want to print this symbol in operation_id. Problem is, that logic in operator works only for last symbol in alternative.
So if I write a/b; it prints ab/ and that's cool. But for other operations, eg. a+b; it prints aba. What am I doing wrong?
*I ommited new lines symbols in example output.

This non-terminal from your grammar is just plain wrong.
operator :
'+' | '-' | '*' | '/' { $<string>$ = strdup(yytext); }
;
First, in yacc/bison, each production has an action. That rule has four productions, of which only the last has an associated action. It would be clearer to write it like this:
operator : '+'
| '-'
| '*'
| '/' { $<string>$ = strdup(yytext); }
;
which makes it a bit more obvious that the action only applies to the reduction from the token '/'.
The action itself is incorrect as well. yytext should never be used outside of a lexer action, because its value isn't reliable; it will be the value at the time the most recent lexer action was taken, but since the parser usually (but not always) reads one token ahead, it will usually (but not always) be the string associated with the next token. That's why the usual advice is to make a copy of yytext, but the idea is to copy it in the lexer rule, assigning the copy to the appropriate member of yylval so that the parser can use the semantic value of the token.
You should avoid the use of $<type>$ =. A non-terminal can only have one type, and it should be declared in the prologue to the bison file:
%type <string> operator
Finally, you will find that it is very rarely useful to have a non-terminal which recognizes different operators, because the different operators are syntactically different. In a more complete expression grammar, you'd need to distinguish between a + b * c, which is the sum of a and the product of b and c, and a * b + c, which is the sum of c and the product of a and b. That can be done by using different non-terminals for the sum and product syntaxes, or by using different productions for an expression non-terminal and disambiguating with precedence rules, but in both cases you will not be able to use an operator non-terminal which produces + and * indiscriminately.
For what its worth, here is the explanation of why a+b results in the output of aba:
The production operator : '+' has no explicit action, so it ends up using the default action, which is $$ = $1.
However, the lexer rule which returns '+' (presumably -- I'm guessing here) never sets yylval. So yylval still has the value it was last assigned.
Presumably (another guess), the lexer rule which produces WORD correctly sets yylval.string = strdup(yytext);. So the semantic value of the '+' token is the semantic value of the previous WORD token, which is to say a pointer to the string "a".
So when the rule
operation_id :
WORD operator WORD
{
printf("%s\n%s\n%s\n",$1, $3, $<string>2);
}
;
executes, $1 and $2 both have the value "a" (two pointers to the same string), and $3 has the value "b".
Clearly, it is semantically incorrect for $2 to have the value "a", but there is another error waiting to occur. As written, your parser leaks memory because you never free() any of the strings created by strdup. That's not very satisfactory, and at some point you will want to fix the actions so that semantic values are freed when they are no longer required. At that point, you will discover that having two semantic values pointing at the same block of allocated memory makes it highly likely that free() will be called twice on the same memory block, which is Undefined Behaviour (and likely to produce very difficult-to-diagnose bugs).

Checking Valid Arithmetic Expression in Lex (in C)

I have to write code for checking if an arithmetic expression is valid or not , in lex. I am aware that I could do this very easily using yacc but doing only in lex is not so easy.
I have written the code below, which for some reason doesn't work.
Besides this, i also don't get how to handle binary operators .
My wrong code:
%{
#include <stdio.h>
/* Will be using stack to check the validity of arithetic expressions */
char stack[100];
int top = 0;
int validity =0;S
%}
operand [a-zA-Z0-9_]+
%%
/* Will consider unary operators (++,--), binary operators(+,-,*,/,^), braces((,)) and assignment operators (=,+=,-=,*=,^=) */
"(" { stack[top++]='(';}
")" { if(stack[top]!=')') yerror(); else top--;}
[+|"-"|*|/|^|%] { if(stack[top]!='$') yerror(); else stack[top]=='&';}
"++" { if(stack[top]!='$') yerror(); else top--;}
[+"-"*^%]?= { if(top) yerror();}
operand { if(stack[top]=='&') top--; else stack[top++]='$';}
%%
int yerror()
{
printf("Invalid Arithmetic Expression\n");
}

First, learn how to write regular expressions in Flex. (Patterns, Flex manual).
Inside a character class ([…]), neither quotes nor stars nor vertical bars are special. To include a - or a ], you can escape them with a \ or put them at the beginning of the list, or in the case of - at the end.
So in:
[+|"-"|*|/|^|%]
The | is just another character in the list, and including it five times doesn't change anything. "-" is a character range consisting only of the character ", although I suppose the intention was to include a -. Probably you wanted [-+*/^%] or [+\-*/^%].
There is no way that the flex scanner can guess that a + (for example) is a unary operator instead of a binary operator, and putting it twice in the list of rules won't do anything; the first rule will always take effect.
Finally, if you use definitions (like operand) in your patterns, you have to enclose them in braces: {operand}; otherwise, flex will interpret it as a simple keyword.
And a hint for the assignment itself: A valid unparenthesized arithmetic expression can be simplified into the regular expression:
term {prefix-operator}*{operand}{postfix-operator}*
expr {term}({infix-operator}{term})*
But you can't use that directly because (a) it doesn't deal with parentheses, (b) you probably need to allow whitespace, and (c) it doesn't correctly reject a+++++b because C insists on the "maximal munch" rule for lexical scans, so that is not the same as the correct expression a++ + ++b.
You can, however, translate the above regular expression into a very simple two-state state machine.