How to avoid conflicts when generating IR codes in yacc? - c

%nonassoc NO_ELSE
%nonassoc ELSE
stmt_conditional
: IF '(' expr { show_if_begin($3); } ')' stmt_compound { show_if_end(); } %prec NO_ELSE
| IF '(' expr { show_if_begin($3); } ')' stmt_compound { show_if_else(); } ELSE stmt_compound { show_if_end(); }
Since stmt_compound will generate IRs, show_if_begin() should be in front of stmt_compound. However, this will cause reduce/reduce conflicts in yacc. How to solve this problem?
EDITED
This is what I've tried, but it didn't work.
stmt_conditional
: stmt_cond_if { show_if_end(); } %prec NO_ELSE
| stmt_cond_if { show_if_else(); } ELSE stmt_compound { show_if_end(); }
;
stmt_cond_if
: IF '(' expr { show_if_begin($3); } ')' stmt_compound

The reason why your variant doesn't work is that you need infinite lookakead to distinguish between the two kinds of if statement. When your parser looks at an IF token, it has no idea whether it will see ELSE or NO_ELSE, or when to expect one if those, but it has to decide where to shift now in order to process the expression.
The solution is to factor out more common stuff so that the decision is made exactly when it's needed.
stmt_conditional
: stmt_cond_if { ... } else_part
;
else_part
: %prec NO_ELSE
| ELSE stmt_compound { ... }
;
In theory, the parser generator should be able to refactor these rules automatically, but code blocks in the middle of the rule prevent yacc from doing so. Even empty code blocks in otherwise identical rules prevent them from being flattened out. That's because yacc sees the contents of code blocks as black boxes. It has no idea what these actions mean and has to assume they all might mean different things, even if the text is identical.
You can verify that when you remove the actions, the conflict disappears. Of course a yacc grammar without actions is less useful.
Naturally this factoring-out unifies your actions for THEN that originally were distinct for else and no-else variants. There is no way around this. You have to write an action that is good for both cases.
Edit In practice, such code blocks are handled by yacc by creating a new invisible rule and a new non-terminal for each mid-rule action. Quoth the yacc page:
Actions that do not terminate a rule are actually handled by Yacc by manufacturing a new nonterminal symbol name, and a new rule matching this name to the empty string. The interior action is the action triggered off by recognizing this added rule.
It's these invisible rules that actually cause conflicts in your grammar. So another solution would be to leave the rules proper as is, but move actions to the end of their respective rules. I personally still like to factor out common parts manually. -- end edit.
Without %prec there would be a shift/reduce warning, because
if (x)
if (y)
do_something
else
do_something_else
is ambiguous: when the parser sees else, it doesn't know whether it belongs to the inner if (shift) or the outer one (reduce). The default behaviour is to shift, which is what we normally want. %prec makes this explicit and eliminates the warning.
Note this still may interfere with the rest of your grammar. If you can change your syntax I would suggest switching from C-like compound-statement-based syntax to an ENDIF-based one as found in e.g. Ada.

Related

Flex: REJECT rejects one character at a time?

I'm parsing C++-style scoped names, e.g., A::B. I want to parse such a name as a single token for a type name if it was previously declared as a type; otherwise, I want to parse it as three tokens A, ::, and B.
Given:
L [A-Za-z_]
D [0-9]
S [ \f\r\t\v]
identifier {L}({L}|{D})*
sname {identifier}({S}*::{S}*{identifier})+
%%
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
REJECT;
}
{identifier} {
// ...
return Y_NAME;
}
However, for the input sequence:
A::BB -> Not a type; returned as "A", "::", "BB"
(Declare A::B as a type.)
A::BB
What happens on the second parse of A::BB, REJECT is called and flex discards only 1 character from the end and tries to match A::B (one B). This matches the previously declared A::B in the {sname} rule which is wrong.
What I assumed REJECT did was to proceed to the second-best matching rule with the same input. Hence, I expected it to match A alone in {identifier} and just leave ::BB on the input stream. But, as shown, that's not what happens. It peels off one character at a time from the end of the input and re-attempts a match.
Adding in yyless() to chop off the ::BB doesn't help:
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
char const *const sep = strpbrk( yytext, ": \t" );
size_t keep_len = (size_t)(sep - yytext);
yyless( keep_len );
REJECT;
}
The only thing that I've found that works is:
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
char const *const sep = strpbrk( yytext, ": \t" );
size_t keep_len = (size_t)(sep - yytext);
yyless( keep_len );
goto identifier;
}
{identifier} {
identifier:
// ...
return Y_NAME;
}
But these seems kind of hack-ish. Is there a more canonical "flex way" to do what I want?
Despite the hack-ish-ness, is there actually anything wrong with my solution? I.e., would it not work in some cases? Or not work in some versions of Flex?
Yes, I'm aware that, even if it did work the way I want, this won't parse all contrived C++ scoped names; but it's good enough.
Yes, I'm aware that generally parsing such things should be done as separate tokens, but C-like languages are hard to parse since they're not context-free. Knowing when you have a type really helps.
Yes, I'm aware that REJECT slows down the lexer. This is (mostly) for an interactive and command-line tool in a terminal, so the human is the slowest component.
I'd like to focus on the problem at hand with the code as it mostly is. Mine is more of a question about how to use REJECT, yyless(), etc., to get the desired behavior. Thanks.
REJECT does not make any distinction between different rules; it just falls back to the next possible accepting pattern (which might not even be shorter, if there's a lower-precedence rule which matches the same token.) That might be a shorter match of the same pattern. (Normally, Flex chooses the longest match out of the possible matches of the regular expression. With REJECT, the shorter matches are also considered.)
So you can avoid the false match of A::B for input A::BB by using trailing context: [Note 1]
{sname}/[^[:alnum:]_] {...}
(In Flex, you can't put trailing context in a macro because the expansion is surrounded with parentheses.)
You could use that solution if you wanted to try all possible complete prefixes of id1::id2::id3::id4 starting with the longest one (id1::id2::id3::id4) and falling back to each shorter prefix in turn (id1::id2::id3, id1::id2, id1). The trailing context prevents intermediate matches in the middle of an identifier.
I'm not sure if that is the problem you are trying to solve, though. And even if it is, it doesn't really solve the problem of what to do after you fall back to an acceptable solution, because you probably don't want to redo the search for prefixes in the middle of the sequence. In other words, if the original were A::B::A::B, and A::B is a "known prefix", the tokens returned should (presumably) be A::B, ::, A, ::, B rather than A::B, ::, A::B.
On the other hand, you might not care about previously defined prefixes; only whether the complete sequence is a known type. In that case, the possible token sequences for A::B::C::D are [TYPENAME(A::B::C::D)] and [ID(A),, ::, ID(B), ::, ID(C), ID(D)]. For this case, you don't want to restrict the {sname} fallbacks; you want to eliminate them. After the fallback, you only want to try matches for different rules.
There are a variety of alternative solutions (which do not use REJECT), mostly relying on the possibility of using start conditions. (You still need to think about what happens after the fallback. Start conditions can be useful for that, as well.)
One fairly general solution is to define a start condition which excludes the rule you want to fall back from. Once you've determined that you want to fall back to a different rule, you change to a start condition which excludes the rule which just matched and then call yyless(0) (without returning). That will cause Flex to rescan the current token from the beginning in the new start condition. (If you have several possible rules which might match, you will need several start conditions. This could get out of hand, but in most cases the possible set of matching rules is very limited.)
At some point, you need to return to the previous start condition. (You could use a start condition stack if it's not trivial to figure out which was the previous start condition.) Ideally, to avoid false interior matches, you would want to return to the previous start condition only when you reached the end of the first (incorrect) match. That's easy to do if you are tracking the input position for each token; you just need to save the position just before calling yyless(0). (You also need to correct the token input position by subtracting yyleng before yyless sets yyleng to 0).
Rescanning from the beginning of the token might seem inefficient, but it's less inefficient than the overhead imposed by REJECT. And the overhead of REJECT affects the entire scanner operation, while the rescanning solution is essentially free except for the tokens you happen to rescan. [Note 2]
(BEGIN(some_state); yyless(0);) is a reasonably common flex idiom; this is not the only use case. Another one is the answer to the question "How do I run a start condition until I reach a token I can't identify without consuming that token?")
But I think that in this case there is a simpler solution, using yymore to accumulate the token. (This avoids having to do your own dynamically expanded token buffer.) Again, there are two possibilities, depending on whether you might allow initial prefix matches or restrict the possibilities to either the full sequence or the first identifier.
But the outline is the same: Match the shortest valid possibility first, remember how long the token was at that point, and then use a start condition to enable a pattern which extends the token, either to the next possibility or to the end of the sequence, depending on your needs. Before continuing with the scanner loop, you call yymore() which indicates to Flex that the next pattern extends the token rather than replacing it.
At each possible match end, you test to see if that would really be a valid token and if so, record the position (and whatever else you might need to recall, such as the token type). When you reach a point where you can no longer extend the match, you use yyless to fall back to the last valid match point, and return the token.
This is slightly less inefficient than the pure yyless() solution because it avoids the rescan of the token which is returned. (All these solutions, including REJECT, do rescan the text following the selected match if it is shorter than the longest possible extended match. That's theoretically avoidable but since it's not a lot of overhead, it doesn't seem worthwhile to build a complex mechanism to avoid it.)
Again, you probably want to avoid trying to extend token matches after the fallback until you reach the longest extent. This can be solved the same way, by recording the longest matched extent, but the start condition handling is a bit simpler.
Here's some not-very-well tested code for the simpler problem, where only the first identifier and the full match are possible:
%{
/* Code to track current token position and the fallback position.
* It's a simple byte count; line-ends are not dealt with.
*/
static int token_pos = 0;
static int token_max_pos = 0;
/* This is done before every action, even empty actions */
#define YY_USER_ACTION token_pos += yyleng;
/* FALLBACK needs to undo YY_USER_ACTION */
#define FALLBACK(to) do { token_pos -= yyleng - to; yyless(to); } while (0)
/* SET_MORE needs to pre-undo the next YY_USER_ACTION */
#define SET_MORE(X) do { token_pos -= yyleng; yymore(); } while(0)
%}
%x EXTEND_ID
ident [[:alpha:]_][[:alnum:]_]*
%%
int fallback_leng = 0;
/* The trailing context here is to avoid triggering EOF in
* the EXTEND_ID state.
*/
{ident}/[^[:alnum:]_] {
/* In the fallback region, don't attempt to extend the match. */
if (token_pos <= token_max_pos)
return Y_IDENT;
fallback_leng = yyleng;
BEGIN(EXTEND_ID);
SET_MORE(yymore);
}
{ident} { return find_type(yytext) ? Y_TYPE_NAME : Y_IDENT; }
<EXTEND_ID>{
([[:space:]]*"::"[[:space:]]*{ident})*|.|\n {
BEGIN(INITIAL);
if (yyleng == fallback_leng + 1)
FALLBACK(fallback_leng);
if (find_type(yytext))
return Y_TYPE_NAME;
else {
FALLBACK(fallback_leng);
return Y_IDENT;
}
}
}
Notes
At least, I think you can do that. I haven't ever tried and REJECT does impose a number of limitations on other scanner features. Really, it's a historical artefact which massively slows down lexical analysis and should generally be avoided.
The use of REJECT anywhere in the rules causes Flex to switch to a different template in which a list of endpoints is retained, which has a cost in both time and space. Another consequence is that Flex can no longer resize the input buffer.

Bison Flex reduce/reduce conflict on dangling else with mid action

I am currently moving a fun-project of mine over to bison/flex as parser and have trouble solving a reduce/reduce conflict:
// https://github.com/X39/yaoosl/blob/master/code-gen/yaoosl.y#L761-L766
ifthen: YST_IF YST_ROUNDO expression code_ifstart YST_ROUNDC codebody code_ifendnoelse
| YST_IF YST_ROUNDO expression code_ifstart YST_ROUNDC ifthen_clsd YST_ELSE code_ifelse ifthen code_ifendelse
;
ifthen_clsd: codebody
| YST_IF YST_ROUNDO expression code_ifstart YST_ROUNDC ifthen_clsd code_ifelse YST_ELSE ifthen_clsd code_ifendelse
;
Note: stuff prefixed with code_ are the mid-actions
Could somebody explain to me how to solve this properly and why the "go-to" solution is either wrong or did not worked?
Thanks, X39
Since the two rules are identical up to the code_ifelse (and assuming code_ifelse is an empty rule, like an in-rule action), it can't tell whether to reduce code_ifelse before or after the YST_ELSE. You might be able to fix it by making the two rules consistent with the order of code_ifelse and YST_ELSE.
Some rules-of-thumb for grammars:
Don't use symblic names for single char tokens like '(' and ')' -- it just obfuscates things and makes the grammar hard to read and understand.
Don't use in-rule actions unless absolutely necessary -- it's better to create a single token rule with an end-rule action that does what you need.

Package-qualified names. Differences (if any) between Package::<&var> vs &Package::var?

Reading through https://docs.perl6.org/language/packages#Package-qualified_names it outlines qualifying package variables with this syntax:
Foo::Bar::<$quux>; #..as an alternative to Foo::Bar::quux;
For reference the package structure used as the example in the document is:
class Foo {
sub zape () { say "zipi" }
class Bar {
method baz () { return 'Þor is mighty' }
our &zape = { "zipi" }; #this is the variable I want to resolve
our $quux = 42;
}
}
The same page states this style of qualification doesn't work to access &zape in the Foo::Bar package listed above:
(This does not work with the &zape variable)
Yet, if I try:
Foo::Bar::<&zape>; # instead of &Foo::Bar::zape;
it is resolves just fine.
Have I misinterpreted the document or completely missed the point being made? What would be the logic behind it 'not working' with code reference variables vs a scalar for example?
I'm not aware of differences, but Foo::Bar::<&zape> can also be modified to use {} instead of <>, which then can be used with something other than literals, like this:
my $name = '&zape';
Foo::Bar::{$name}()
or
my $name = 'zape';
&Foo::Bar::{$name}()
JJ and Moritz have provided useful answers.
This nanswer is a whole nother ball of wax. I've written and discarded several nanswers to your question over the last few days. None have been very useful. I'm not sure this is either but I've decided I've finally got a first version of something worth publishing, regardless of its current usefulness.
In this first installment my nanswer is just a series of observations and questions. I also hope to add an explanation of my observations based on what I glean from spelunking the compiler's code to understand what we see. (For now I've just written up the start of that process as the second half of this nanswer.)
Differences (if any) between Package::<&var> vs &Package::var?
They're fundamentally different syntax. They're not fully interchangeable in where you can write them. They result in different evaluations. Their result can be different things.
Let's step thru lots of variations drawing out the differences.
say Package::<&var>; # compile-time error: Undeclared name: Package
So, forget the ::<...> bit for a moment. P6 is looking at that Package bit and demanding that it be an already declared name. That seems simple enough.
say &Package::var; # (Any)
Quite a difference! For some reason, for this second syntax, P6 has no problem with those two arbitrary names (Package and var) not having been declared. Who knows what it's doing with the &. And why is it (Any) and not (Callable) or Nil?
Let's try declaring these things. First:
my Package::<&var> = { 42 } # compile-time error: Type 'Package' is not declared
OK. But if we declare Package things don't really improve:
package Package {}
my Package::<&var> = { 42 } # compile-time error: Malformed my
OK, start with a clean slate again, without the package declaration. What about the other syntax?:
my &Package::var = { 42 }
Yay. P6 accepts this code. Now, for the next few lines we'll assume the declaration above. What about:
say &Package::var(); # 42
\o/ So can we use the other syntax?:
say Package::<&var>(); # compile-time error: Undeclared name: Package
Nope. It seems like the my didn't declare a Package with a &var in it. Maybe it declared a &Package::var, where the :: just happens to be part of the name but isn't about packages? P6 supports a bunch of "pseudo" packages. One of them is LEXICAL:
say LEXICAL::; # PseudoStash.new(... &Package::var => (Callable) ...
Bingo. Or is it?
say LEXICAL::<&Package::var>(); # Cannot invoke this object
# (REPR: Uninstantiable; Callable)
What happened to our { 42 }?
Hmm. Let's start from a clean slate and create &Package::var in a completely different way:
package Package { our sub var { 99 } }
say &Package::var(); # 99
say Package::<&var>(); # 99
Wow. Now, assuming those lines above and trying to add more:
my Package::<&var> = { 42 } # Compile-time error: Malformed my
That was to be expected given our previous attempt above. What about:
my &Package::var = { 42 } # Cannot modify an immutable Sub (&var)
Is it all making sense now? ;)
Spelunking the compiler code, checking the grammar
1 I spent a long time trying to work out what the deal really is before looking at the source code of the Rakudo compiler. This is a footnote covering my initial compiler spelunking. I hope to continue it tomorrow and turn this nanswer into an answer this weekend.
The good news is it's just P6 code -- most of Rakudo is written in P6.
The bad news is knowing where to look. You might see the doc directory and then the compiler overview. But then you'll notice the overview doc has barely been touched since 2010! Don't bother. Perhaps Andrew Shitov's "internals" posts will help orient you? Moving on...
In this case what I am interested in is understanding the precise nature of the Package::<&var> and &Package::var forms of syntax. When I type "syntax" into GH's repo search field the second file listed is the Perl 6 Grammar. Bingo.
Now comes the ugly news. The Perl 6 Grammar file is 6K LOC and looks super intimidating. But I find it all makes sense when I keep my cool.
Next, I'm wondering what to search for on the page. :: nets 600+ matches. Hmm. ::< is just 1, but it is in an error message. But in what? In token morename. Looking at that I can see it's likely not relevant. But the '::' near the start of the token is just the ticket. Searching the page for '::' yields 10 matches. The first 4 (from the start of the file) are more error messages. The next two are in the above morename token. 4 matches left.
The next one appears a quarter way thru token term:sym<name>. A "name". .oO ( Undeclared name: Package So maybe this is relevant? )
Next, token typename. A "typename". .oO ( Type 'Package' is not declared So maybe this is relevant too? )
token methodop. Definitely not relevant.
Finally token infix:sym<?? !!>. Nope.
There are no differences between Package::<&var> and &Package::var.
package Foo { our $var = "Bar" };
say $Foo::var === Foo::<$var>; # OUTPUT: «True␤»
Ditto for subs (of course):
package Foo { our &zape = { "Bar" } };
say &Foo::zape === Foo::<&zape>;# OUTPUT: «True␤»
What the documentation (somewhat confusingly) is trying to say is that package-scope variables can only be accessed if declared using our. There are two zapes, one of them has got lexical scope (subs get lexical scope by default), so you can't access that one. I have raised this issue in the doc repo and will try to fix it as soon as possible.

Lex: multiple files, different rules

I have to parse more than one file, with different rules for each case. That is: I need some rules to work when processing a file, and be disabled afterwards.
I could simple use a global variable to track the state of the program, and have my rules decide inside their body whether to do anything useful or not, like this:
%{
static int state;
%}
%%
{something} {
if (state == SOMETHING_STATE) ...
}
{something_else} {
if (state == SOMETHING_ELSE_STATE) ...
}
%%
I'm guessing there's a better way to do this, though. Is there?
You want start states -- basically lex has a builtin state variable and you can annotate rules as only firing in certain states
%s state1
%s state2
%s state3
%%
<state1>{something} { // this rule matches only in state1
// change to state2
BEGIN(state2); }
<state1,state3>{something_else} { // this rule matches in state1 or state 3 }
{more_stuff} { // this rule matches in all states }
You can find documentation on this in any lex or flex book, or in the online flex documentation
Assuming you're using something like Flex rather than the original lex, you can use start states. When you parse something that changes how you do the parsing, you enter a state to do that parsing, and you tag the appropriate rules with that state. When you parse whatever marks the end of that state, you go do another:
{state_signal} { BEGIN(STATE2); }
<STATE2>{something} { handle_something(); }
<STATE2>{state3_signal} { BEGIN(STATE3); }
<STATE3>{something} {handle_something2(); }
<STATE3>{whatever} { BEGIN(START); }
Using Flex and Yacc you can also build diferent individual parsers and tweak the makefile in order to compile them into a single executable. I believe you're only using Flex, but this should be possible too.
I can't precise exactly how to do it now, but i've already actually done it,quite some time ago, so if you look around you'll find a way. Basiccly you'll have to compile each parser with a diferent prefix (-P), and that will generate diferently named parse functions and diferently named global variables for the parsers to use.

Recovering error tokens in parsing (Lemon)

I'm using Lemon as a parser generator, its error handling is the same as yacc's and bison's if you don't know Lemon.
Lemon has an option to define the error token in a set of rules in order to catch parsing errors. The default behavior of the generated parser is to destroy the token causing the error; is there any way to override this behavior so that I can keep the token?
Here's an example to show what's happening: basically I'm appending the tokens for each rule together to reform the input string, here's an example grammar:
input ::= string(A) { printf("%s", A); } // Print the result
string(A) ::= string(B) part(C). { A = append(B, C); }
string(A) ::= part(B). { A = B; }
part(A) ::= NUMBER(B) NAME(C). { A = append(C, B); } // Rearrange the number and name
part(A) ::= error(B). { A = B; } // On error keep the token anyways
On input:
"Username 1234Joseph"
I get output:
"Joseph1234"
Because the text "Username " is junked by the parser in the part(A) ::= error(B) rule, but I really want:
"Username Joseph1234"
as output.
If you can solve this problem in bison or another parser generator I would accept that as an answer :)
With yacc/bison, a parsing error drops the tool into error recovery mode, if possible. It will attempt to discard tokens on its way to a "clean" state.
I'm unable to find a reference for lemon, so I can't show some lemon code to fix this, but with yacc/bison, one would use the rules here.
Namely, you need to adjust your error rule to state that the parser is ok with yyerrok to prevent it from dropping tokens. Next, it will attempt to reread the "bad" token, so you need to clear it with yyclearin. Finally, since the rule attached to your error code contains the contents of your token, you will need to set up a function that adjusts your input stack, by taking the current token contents and creating a new (proper) token with the same contents.
As an example, if a grammar defined as MyOther MyOther saw MyTok MyOther:
stack
MyTok: "the text"
MyOther: "new text"
stack
MyOther: "the text"
MyOther: "new text"
To accomplish this, look into using yybackup. I'm unable to find an alternative method, though yybackup is frowned upon.
It's an old one, but why not...
The grammar must include spaces. At the moment the grammar only allows a sequence of NUMBER NAME tokens (without any space between the tokens).

Resources