Why doesn't YACC generate shift-reduce conflict? - c

Trying to understand shift-reduce conflicts and fix them.
I have following YACC code, for which I was expecting shift-reduce conflict but Bison doesn't generate any such warnings
%%
lang_cons: /* empty */
| declaraion // SEMI_COLON
| func
;
declaraion : keyword ID
;
func : keyword ID SEMI_COLON
;
keyword : INT
| FLOAT
;
%%
But if I uncomment SEMI_COLON in 2nd rule (i.e, | declaraion SEMI_COLON ), I get shift-reduce conflict. I was expecting reduce-reduce conflict in this case. Please help me understand this mess!
PS: Consider input,
1) int varName
2) int func;

If you give bison the -v command line flag, it will produce an .output file containing the generated state machine, which will probably help you see what is going on.
Note that bison actually parses the augmented grammar, which consists of your grammar with the additional rule
start': start END
where END is a special token whose code is 0, indicating the end of input, and start is whatever your grammar uses as a start symbol. (That ensures that the bison parser will not silently ignore garbage at the end of an otherwise valid input.)
That makes your original grammar unambiguous; after varName is seen, the lookahead will be either END, in which case declaration is reduced, or ';', which will be shifted (followed by a reduction of func when the following END is seen).
In your second grammar, the conflict involves the choice between reducing declaration or shifting the semicolon. If the semicolon were part of declaration, then you would see a reduce/reduce conflict.

Related

How to find a substring

Given a text file containing a string, I would find some specific substrings/sequences inside this string.
Bison file .y (Declaration+Rules)
%token <cval> AMINO
%token STARTCODON STOPCODON
%type <cval> seq
%%
series: STARTCODON seq STOPCODON {printf("%s", $2);}
seq: AMINO
| seq AMINO
;
%%
Here I want to print every sequence between STARTCODON and STOPCODON
Flex file .l (Rules)
%%
("ATG")+ {return STARTCODON;}
("TAA"|"TAG"|"TGA")+ {return STOPCODON;}
("GCT"|"GCC"|"GCA"|"GCG")+ {yylval.cval = 'A';
return AMINO;}
("CGT"|"CGC"|"CGA"|"CGG"|"AGA"|"AGG")+ {yylval.cval = 'R';
return AMINO;}
.
.
.
[ \t]+ /*ignore whitespace*/
\n /*ignore end of line*/
. {printf("-");}
%%
When I run the code I get only the output of the rule . {printf("-");}.
I am new Flex/Bison, I suspect that:
The bison rule series: STARTCODON seq STOPCODON {printf("%s", $2);} is not correct.
Flex doesn't subdivide correctly the entire string into tokens of 3 characters.
EDIT:
(Example) Input file: DnaSequence.txt:
Input string:cccATGAATTATTAGzzz, where lower characters (ccc, zzz) produce the (right) output -, ATG is the STARTCODON, AATTAT is the sequence of two AMINO (AAT TAT), and TAG is the STOPCODON.
This input string produces the (wrong) output ---.
EDIT:
Following the suggestions of #JohnBollinger I have added <<EOF>> {return ENDTXT;} in the Flex file, and the rule finalseries: series ENDTXT; in the Bison file.
Now it's returning the yyerror's error message, indicating a parsing error.
I suppose that we need a STARTTXT token, but I don't know how to implement it.
I am new Flex/Bison, I suspect that:
The bison rule series: STARTCODON seq STOPCODON {printf("%s", $2);} is not correct.
The rule is syntactically acceptable. It would be semantically correct if the value of token 2 were a C string, in which case it would cause that value to be printed to the standard output, but your Flex file appears to assume that type <cval> is char, which is not a C string, nor directly convertible to one.
Flex doesn't subdivide correctly the entire string into tokens of 3 characters.
Your Flex input looks OK to me, actually. And the example input / output you present indicates that Flex is indeed recognizing all your triplets from ATG to TAG, else the rule for . would be triggered more than three times.
The datatype problem is a detail that you'll need to sort out, but the main problem is that your production for seq does not set its semantic value. How that results in (seemingly) nothing being printed when the series production is used for a reduction depends on details that you have not disclosed, and probably involves undefined behavior.
If <cval> were declared as a string (char *), and if your lexer set its values as strings rather than as characters, then setting the semantic value might look something like this:
seq: AMINO { $$ = calloc(MAX_AMINO_ACIDS + 1, 1); /* check for allocation failure ... */
strcpy($$, $1); }
| seq AMINO { $$ = $1; strcat($$, $2); }
;
You might consider sticking with char as the type for the semantic value of AMINO, and defining seq to have a different type (i.e. char *). That way, your changes could be restricted to the grammar file. That would, however, call for a different implementation of the semantic actions in the production for seq.
Finally, note that although you say
Here I want to print every sequence between STARTCODON and STOPCODON
your grammar, as presented, has series as its start symbol. Thus, once it reduces the token sequence to a series, it expects to be done. If additional tokens follow (say those of another series) then that would be erroneous. If that's something you need to support then you'll need a higher-level start symbol representing a sequence of multiple series.

Why does a string in parenthesis compile and what does it compile to?

This program compiles (C with diab):
int main()
{
("----");
}
Why is it not considered a compiler error ? (Is it because it supports some other feature that needs this syntax)?
What does it compile to?
It compiles for the same reason that 1;, "----";, or 1 + 2 + 3 + 4; would compile: because an expression, followed by a semicolon, is a valid statement.
Turning expressions into statements with a semicolon is needed for a lot of parts of C to work. For example:
do_stuff_to(x);
is a function call, which has a value, but can be useful as a statement in its own right.
Even something like
x = y;
(that is, an assignment) is actually an expression. This one in particular is quite useful in statement position.
The relevant parts of the C grammar are:
statement
: labeled_statement
| compound_statement
| expression_statement
| selection_statement
| iteration_statement
| jump_statement
;
that is, a statement can be one of many things, including an expression_statement; and
expression_statement
: ';'
| expression ';'
;
that is, an expression_statement is either a semicolon or an expression, followed by a semicolon.
What this program compiles to is implementation-dependent. A compiler is free to compile the string into the data segment of the program, or it is free to simply ignore it. On my machine, GCC does not even put the string in the compiled executable at all, even with no optimization level.
Compilers are also not required to warn about this construct, but GCC does, when given the flag -Wunused-value. This warning can be helpful sometimes, because this particular construction is not useful at all.
test.c: In function ‘main’:
test.c:2:5: warning: statement with no effect [-Wunused-value]
("----");
^

Bison/Flex print value of terminal from alternative

I have written a simple grammar:
operations :
/* empty */
| operations operation ';'
| operations operation_id ';'
;
operation :
NUM operator NUM
{
printf("%d\n%d\n",$1, $3);
}
;
operation_id :
WORD operator WORD
{
printf("%s\n%s\n%s\n",$1, $3, $<string>2);
}
;
operator :
'+' | '-' | '*' | '/'
{
$<string>$ = strdup(yytext);
}
;
As you can see, I have defined an operator that recognizes one of 4 symbols. Now, I want to print this symbol in operation_id. Problem is, that logic in operator works only for last symbol in alternative.
So if I write a/b; it prints ab/ and that's cool. But for other operations, eg. a+b; it prints aba. What am I doing wrong?
*I ommited new lines symbols in example output.
This non-terminal from your grammar is just plain wrong.
operator :
'+' | '-' | '*' | '/' { $<string>$ = strdup(yytext); }
;
First, in yacc/bison, each production has an action. That rule has four productions, of which only the last has an associated action. It would be clearer to write it like this:
operator : '+'
| '-'
| '*'
| '/' { $<string>$ = strdup(yytext); }
;
which makes it a bit more obvious that the action only applies to the reduction from the token '/'.
The action itself is incorrect as well. yytext should never be used outside of a lexer action, because its value isn't reliable; it will be the value at the time the most recent lexer action was taken, but since the parser usually (but not always) reads one token ahead, it will usually (but not always) be the string associated with the next token. That's why the usual advice is to make a copy of yytext, but the idea is to copy it in the lexer rule, assigning the copy to the appropriate member of yylval so that the parser can use the semantic value of the token.
You should avoid the use of $<type>$ =. A non-terminal can only have one type, and it should be declared in the prologue to the bison file:
%type <string> operator
Finally, you will find that it is very rarely useful to have a non-terminal which recognizes different operators, because the different operators are syntactically different. In a more complete expression grammar, you'd need to distinguish between a + b * c, which is the sum of a and the product of b and c, and a * b + c, which is the sum of c and the product of a and b. That can be done by using different non-terminals for the sum and product syntaxes, or by using different productions for an expression non-terminal and disambiguating with precedence rules, but in both cases you will not be able to use an operator non-terminal which produces + and * indiscriminately.
For what its worth, here is the explanation of why a+b results in the output of aba:
The production operator : '+' has no explicit action, so it ends up using the default action, which is $$ = $1.
However, the lexer rule which returns '+' (presumably -- I'm guessing here) never sets yylval. So yylval still has the value it was last assigned.
Presumably (another guess), the lexer rule which produces WORD correctly sets yylval.string = strdup(yytext);. So the semantic value of the '+' token is the semantic value of the previous WORD token, which is to say a pointer to the string "a".
So when the rule
operation_id :
WORD operator WORD
{
printf("%s\n%s\n%s\n",$1, $3, $<string>2);
}
;
executes, $1 and $2 both have the value "a" (two pointers to the same string), and $3 has the value "b".
Clearly, it is semantically incorrect for $2 to have the value "a", but there is another error waiting to occur. As written, your parser leaks memory because you never free() any of the strings created by strdup. That's not very satisfactory, and at some point you will want to fix the actions so that semantic values are freed when they are no longer required. At that point, you will discover that having two semantic values pointing at the same block of allocated memory makes it highly likely that free() will be called twice on the same memory block, which is Undefined Behaviour (and likely to produce very difficult-to-diagnose bugs).

How to recursively parse an expression?

I'm writing a small language, and I'm really stuck on expression parsing. I've written a LR Recursive Descent Parser, it works, but now I need to parse expressions I'm finding it really difficult. I do not have a grammar defined, but if it helps, I kind of have an idea on how it works even without a grammar. Currently, my expression struct looks like this:
typedef struct s_ExpressionNode {
Token *value;
char expressionType;
struct *s_ExpressionNode lhand;
char operand;
struct *s_ExpressionNode rhand;
} ExpressionNode;
I'm trying to get it to parse something like:
5 + 5 + 2 * (-3 / 2) * age
I was reading this article on how to parse expressions. The first grammar I tried to implement but it didn't work out too well, then I noticed the second grammar, which appears to remove left recursion. However, I'm stuck trying to implement it since I don't understand what P, B means, and also U is a - but the - is also for a B? Also I'm not sure what expect(end) is supposed to mean either.
In the "Recursive-descent recognition" section of the article you linked, the E, P, B, and U are the non-terminal symbols in the expression grammar presented. From their definitions in the text, I infer that "E" is chosen as a mnemonic for "expression", "P" as mnemonic for "primary", "B" for "binary (operator)", and "U" for "unary (operator)". Given those characterizations, it should be clear that the terminal symbol "-" can be reduced either to a U or to a B, depending on context:
unary: -1
binary: x-1
The expect() function described in the article is used to consume the next token if it happens to be of the specified type, or otherwise to throw an error. The end token is defined to be a synthetic token representing the end of the input. Thus
expect(end)
expresses the expectation that there are no more tokens to process in the expression, and its given implementation throws an error if that expectation is not met.
All of this is in the text, except the reason for choosing the particular symbols E, P, B, and U. If you're having trouble following the text then you probably need to search out something simpler.

many multiple alternatives errors with my C grammar

I am trying to write a C grammar with Antlwork, and for that I used this one http://stuff.mit.edu/afs/athena/software/antlr_v3.2/examples-v3/java/C/C.g where I tried to make it more simple by removing many blocks I don't use and the backtracking. And here is what I have : http://www.archive-host.com/files/1956778/24fe084677d7655eb57ba66e1864081450017dd9/CNew.txt
Then when I do ctrl+D, I get a lot of warning and errors like these:
[21:20:54] warning(200): C:\CNew.g:188:2: Decision can match input such as "'{' '}'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[21:20:54] warning(200): C:\CNew.g:210:2: Decision can match input such as "'for' '('" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input
[21:20:54] error(201): C:\CNew.g:210:2: The following alternatives can never be matched: 3
[21:20:54] error(208): C:\CNew.g:250:1: The following token definitions can never be matched because prior tokens match the same input: CHAR
I don't really understand why I have all these warnings, there should not be conflicts.
but I still have this error
[22:02:55] error(208): C:\Users\Seiya\Desktop\projets\TRAD\Gram\CNew.g:238:1: The following token definitions can never be matched because prior tokens match the same input: CONSTANT [22:17:18] error(208): CNew.g:251:1: The following token definitions can never be matched because prior tokens match the same input: CHAR [22:17:18] error(208): C:\Users\Seiya\Desktop\projets\TRAD\Gram\CNew.g:251:1: The following token definitions can never be matched because prior tokens match the same input: CHAR
That means the lexer can never create the tokens CHAR and INT because some other lexer rule, CONSTANT, matches the same input. What you need to do is change CONSTANT into a parser rule.
In other words, change these two rules:
primary_expression
: ID
| CONSTANT
| '(' expression ')'
;
CONSTANT
: INT
| CHAR
;
into the following:
primary_expression
: ID
| constant
| '(' expression ')'
;
constant
: INT
| CHAR
;

Resources