Recovering error tokens in parsing (Lemon) - c

I'm using Lemon as a parser generator, its error handling is the same as yacc's and bison's if you don't know Lemon.
Lemon has an option to define the error token in a set of rules in order to catch parsing errors. The default behavior of the generated parser is to destroy the token causing the error; is there any way to override this behavior so that I can keep the token?
Here's an example to show what's happening: basically I'm appending the tokens for each rule together to reform the input string, here's an example grammar:
input ::= string(A) { printf("%s", A); } // Print the result
string(A) ::= string(B) part(C). { A = append(B, C); }
string(A) ::= part(B). { A = B; }
part(A) ::= NUMBER(B) NAME(C). { A = append(C, B); } // Rearrange the number and name
part(A) ::= error(B). { A = B; } // On error keep the token anyways
On input:
"Username 1234Joseph"
I get output:
"Joseph1234"
Because the text "Username " is junked by the parser in the part(A) ::= error(B) rule, but I really want:
"Username Joseph1234"
as output.
If you can solve this problem in bison or another parser generator I would accept that as an answer :)

With yacc/bison, a parsing error drops the tool into error recovery mode, if possible. It will attempt to discard tokens on its way to a "clean" state.
I'm unable to find a reference for lemon, so I can't show some lemon code to fix this, but with yacc/bison, one would use the rules here.
Namely, you need to adjust your error rule to state that the parser is ok with yyerrok to prevent it from dropping tokens. Next, it will attempt to reread the "bad" token, so you need to clear it with yyclearin. Finally, since the rule attached to your error code contains the contents of your token, you will need to set up a function that adjusts your input stack, by taking the current token contents and creating a new (proper) token with the same contents.
As an example, if a grammar defined as MyOther MyOther saw MyTok MyOther:
stack
MyTok: "the text"
MyOther: "new text"
stack
MyOther: "the text"
MyOther: "new text"
To accomplish this, look into using yybackup. I'm unable to find an alternative method, though yybackup is frowned upon.

It's an old one, but why not...
The grammar must include spaces. At the moment the grammar only allows a sequence of NUMBER NAME tokens (without any space between the tokens).

Related

Flex: REJECT rejects one character at a time?

I'm parsing C++-style scoped names, e.g., A::B. I want to parse such a name as a single token for a type name if it was previously declared as a type; otherwise, I want to parse it as three tokens A, ::, and B.
Given:
L [A-Za-z_]
D [0-9]
S [ \f\r\t\v]
identifier {L}({L}|{D})*
sname {identifier}({S}*::{S}*{identifier})+
%%
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
REJECT;
}
{identifier} {
// ...
return Y_NAME;
}
However, for the input sequence:
A::BB -> Not a type; returned as "A", "::", "BB"
(Declare A::B as a type.)
A::BB
What happens on the second parse of A::BB, REJECT is called and flex discards only 1 character from the end and tries to match A::B (one B). This matches the previously declared A::B in the {sname} rule which is wrong.
What I assumed REJECT did was to proceed to the second-best matching rule with the same input. Hence, I expected it to match A alone in {identifier} and just leave ::BB on the input stream. But, as shown, that's not what happens. It peels off one character at a time from the end of the input and re-attempts a match.
Adding in yyless() to chop off the ::BB doesn't help:
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
char const *const sep = strpbrk( yytext, ": \t" );
size_t keep_len = (size_t)(sep - yytext);
yyless( keep_len );
REJECT;
}
The only thing that I've found that works is:
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
char const *const sep = strpbrk( yytext, ": \t" );
size_t keep_len = (size_t)(sep - yytext);
yyless( keep_len );
goto identifier;
}
{identifier} {
identifier:
// ...
return Y_NAME;
}
But these seems kind of hack-ish. Is there a more canonical "flex way" to do what I want?
Despite the hack-ish-ness, is there actually anything wrong with my solution? I.e., would it not work in some cases? Or not work in some versions of Flex?
Yes, I'm aware that, even if it did work the way I want, this won't parse all contrived C++ scoped names; but it's good enough.
Yes, I'm aware that generally parsing such things should be done as separate tokens, but C-like languages are hard to parse since they're not context-free. Knowing when you have a type really helps.
Yes, I'm aware that REJECT slows down the lexer. This is (mostly) for an interactive and command-line tool in a terminal, so the human is the slowest component.
I'd like to focus on the problem at hand with the code as it mostly is. Mine is more of a question about how to use REJECT, yyless(), etc., to get the desired behavior. Thanks.
REJECT does not make any distinction between different rules; it just falls back to the next possible accepting pattern (which might not even be shorter, if there's a lower-precedence rule which matches the same token.) That might be a shorter match of the same pattern. (Normally, Flex chooses the longest match out of the possible matches of the regular expression. With REJECT, the shorter matches are also considered.)
So you can avoid the false match of A::B for input A::BB by using trailing context: [Note 1]
{sname}/[^[:alnum:]_] {...}
(In Flex, you can't put trailing context in a macro because the expansion is surrounded with parentheses.)
You could use that solution if you wanted to try all possible complete prefixes of id1::id2::id3::id4 starting with the longest one (id1::id2::id3::id4) and falling back to each shorter prefix in turn (id1::id2::id3, id1::id2, id1). The trailing context prevents intermediate matches in the middle of an identifier.
I'm not sure if that is the problem you are trying to solve, though. And even if it is, it doesn't really solve the problem of what to do after you fall back to an acceptable solution, because you probably don't want to redo the search for prefixes in the middle of the sequence. In other words, if the original were A::B::A::B, and A::B is a "known prefix", the tokens returned should (presumably) be A::B, ::, A, ::, B rather than A::B, ::, A::B.
On the other hand, you might not care about previously defined prefixes; only whether the complete sequence is a known type. In that case, the possible token sequences for A::B::C::D are [TYPENAME(A::B::C::D)] and [ID(A),, ::, ID(B), ::, ID(C), ID(D)]. For this case, you don't want to restrict the {sname} fallbacks; you want to eliminate them. After the fallback, you only want to try matches for different rules.
There are a variety of alternative solutions (which do not use REJECT), mostly relying on the possibility of using start conditions. (You still need to think about what happens after the fallback. Start conditions can be useful for that, as well.)
One fairly general solution is to define a start condition which excludes the rule you want to fall back from. Once you've determined that you want to fall back to a different rule, you change to a start condition which excludes the rule which just matched and then call yyless(0) (without returning). That will cause Flex to rescan the current token from the beginning in the new start condition. (If you have several possible rules which might match, you will need several start conditions. This could get out of hand, but in most cases the possible set of matching rules is very limited.)
At some point, you need to return to the previous start condition. (You could use a start condition stack if it's not trivial to figure out which was the previous start condition.) Ideally, to avoid false interior matches, you would want to return to the previous start condition only when you reached the end of the first (incorrect) match. That's easy to do if you are tracking the input position for each token; you just need to save the position just before calling yyless(0). (You also need to correct the token input position by subtracting yyleng before yyless sets yyleng to 0).
Rescanning from the beginning of the token might seem inefficient, but it's less inefficient than the overhead imposed by REJECT. And the overhead of REJECT affects the entire scanner operation, while the rescanning solution is essentially free except for the tokens you happen to rescan. [Note 2]
(BEGIN(some_state); yyless(0);) is a reasonably common flex idiom; this is not the only use case. Another one is the answer to the question "How do I run a start condition until I reach a token I can't identify without consuming that token?")
But I think that in this case there is a simpler solution, using yymore to accumulate the token. (This avoids having to do your own dynamically expanded token buffer.) Again, there are two possibilities, depending on whether you might allow initial prefix matches or restrict the possibilities to either the full sequence or the first identifier.
But the outline is the same: Match the shortest valid possibility first, remember how long the token was at that point, and then use a start condition to enable a pattern which extends the token, either to the next possibility or to the end of the sequence, depending on your needs. Before continuing with the scanner loop, you call yymore() which indicates to Flex that the next pattern extends the token rather than replacing it.
At each possible match end, you test to see if that would really be a valid token and if so, record the position (and whatever else you might need to recall, such as the token type). When you reach a point where you can no longer extend the match, you use yyless to fall back to the last valid match point, and return the token.
This is slightly less inefficient than the pure yyless() solution because it avoids the rescan of the token which is returned. (All these solutions, including REJECT, do rescan the text following the selected match if it is shorter than the longest possible extended match. That's theoretically avoidable but since it's not a lot of overhead, it doesn't seem worthwhile to build a complex mechanism to avoid it.)
Again, you probably want to avoid trying to extend token matches after the fallback until you reach the longest extent. This can be solved the same way, by recording the longest matched extent, but the start condition handling is a bit simpler.
Here's some not-very-well tested code for the simpler problem, where only the first identifier and the full match are possible:
%{
/* Code to track current token position and the fallback position.
* It's a simple byte count; line-ends are not dealt with.
*/
static int token_pos = 0;
static int token_max_pos = 0;
/* This is done before every action, even empty actions */
#define YY_USER_ACTION token_pos += yyleng;
/* FALLBACK needs to undo YY_USER_ACTION */
#define FALLBACK(to) do { token_pos -= yyleng - to; yyless(to); } while (0)
/* SET_MORE needs to pre-undo the next YY_USER_ACTION */
#define SET_MORE(X) do { token_pos -= yyleng; yymore(); } while(0)
%}
%x EXTEND_ID
ident [[:alpha:]_][[:alnum:]_]*
%%
int fallback_leng = 0;
/* The trailing context here is to avoid triggering EOF in
* the EXTEND_ID state.
*/
{ident}/[^[:alnum:]_] {
/* In the fallback region, don't attempt to extend the match. */
if (token_pos <= token_max_pos)
return Y_IDENT;
fallback_leng = yyleng;
BEGIN(EXTEND_ID);
SET_MORE(yymore);
}
{ident} { return find_type(yytext) ? Y_TYPE_NAME : Y_IDENT; }
<EXTEND_ID>{
([[:space:]]*"::"[[:space:]]*{ident})*|.|\n {
BEGIN(INITIAL);
if (yyleng == fallback_leng + 1)
FALLBACK(fallback_leng);
if (find_type(yytext))
return Y_TYPE_NAME;
else {
FALLBACK(fallback_leng);
return Y_IDENT;
}
}
}
Notes
At least, I think you can do that. I haven't ever tried and REJECT does impose a number of limitations on other scanner features. Really, it's a historical artefact which massively slows down lexical analysis and should generally be avoided.
The use of REJECT anywhere in the rules causes Flex to switch to a different template in which a list of endpoints is retained, which has a cost in both time and space. Another consequence is that Flex can no longer resize the input buffer.

Bison Flex reduce/reduce conflict on dangling else with mid action

I am currently moving a fun-project of mine over to bison/flex as parser and have trouble solving a reduce/reduce conflict:
// https://github.com/X39/yaoosl/blob/master/code-gen/yaoosl.y#L761-L766
ifthen: YST_IF YST_ROUNDO expression code_ifstart YST_ROUNDC codebody code_ifendnoelse
| YST_IF YST_ROUNDO expression code_ifstart YST_ROUNDC ifthen_clsd YST_ELSE code_ifelse ifthen code_ifendelse
;
ifthen_clsd: codebody
| YST_IF YST_ROUNDO expression code_ifstart YST_ROUNDC ifthen_clsd code_ifelse YST_ELSE ifthen_clsd code_ifendelse
;
Note: stuff prefixed with code_ are the mid-actions
Could somebody explain to me how to solve this properly and why the "go-to" solution is either wrong or did not worked?
Thanks, X39
Since the two rules are identical up to the code_ifelse (and assuming code_ifelse is an empty rule, like an in-rule action), it can't tell whether to reduce code_ifelse before or after the YST_ELSE. You might be able to fix it by making the two rules consistent with the order of code_ifelse and YST_ELSE.
Some rules-of-thumb for grammars:
Don't use symblic names for single char tokens like '(' and ')' -- it just obfuscates things and makes the grammar hard to read and understand.
Don't use in-rule actions unless absolutely necessary -- it's better to create a single token rule with an end-rule action that does what you need.

Flex does not recognize identifiers

I am trying to implement a very simple parser using flex. I am currently stuck in the ID recognition. That is my code:
ID [a−zA−Z_][a−zA−Z0−9_]*
...
{ID} { printf( "An identifier: %s\n", yytext ); return TOK_ID;}
However what I get is only the first letter of the identifier, for example if I try to parse:
int _underscore ;
The result is:
An identifier: _
Any advice?
EDIT:
With a more accurate analysis I have figured out that the code is able to recognize only the id with a,z,A,Z,_, that are the explicit characters in the regular expression. I did not find anything like that online, is that a bug?
EDIT2:
If I modify the code in that way all work
ID [a−zA−Z_][a−zA−Z0−9_]*
...
[a−zA−Z_][a−zA−Z0−9_]* { printf( "An identifier: %s\n", yytext ); return TOK_ID;}
According to the documentation it should work also in the other way.
This is a character encoding issue. In your copy-and-pasted source code, the things that look like ASCII hyphens (-, code U+2D) in your definition of ID:
ID [a−zA−Z_][a−zA−Z0−9_]*
aren't. Instead they're unicode minus signs (−, U+2212). If you replace the incorrect minus signs with the correct hyphens, the line will look like:
ID [a-zA-Z_][a-zA-Z0-9_]*
Depending on your font, if you look very closely, you may see a difference between the − in the first version and the - in the second.
Anyway, replace your ID definition with the second version above (or else retype it from scratch, and all should be well.

How to avoid conflicts when generating IR codes in yacc?

%nonassoc NO_ELSE
%nonassoc ELSE
stmt_conditional
: IF '(' expr { show_if_begin($3); } ')' stmt_compound { show_if_end(); } %prec NO_ELSE
| IF '(' expr { show_if_begin($3); } ')' stmt_compound { show_if_else(); } ELSE stmt_compound { show_if_end(); }
Since stmt_compound will generate IRs, show_if_begin() should be in front of stmt_compound. However, this will cause reduce/reduce conflicts in yacc. How to solve this problem?
EDITED
This is what I've tried, but it didn't work.
stmt_conditional
: stmt_cond_if { show_if_end(); } %prec NO_ELSE
| stmt_cond_if { show_if_else(); } ELSE stmt_compound { show_if_end(); }
;
stmt_cond_if
: IF '(' expr { show_if_begin($3); } ')' stmt_compound
The reason why your variant doesn't work is that you need infinite lookakead to distinguish between the two kinds of if statement. When your parser looks at an IF token, it has no idea whether it will see ELSE or NO_ELSE, or when to expect one if those, but it has to decide where to shift now in order to process the expression.
The solution is to factor out more common stuff so that the decision is made exactly when it's needed.
stmt_conditional
: stmt_cond_if { ... } else_part
;
else_part
: %prec NO_ELSE
| ELSE stmt_compound { ... }
;
In theory, the parser generator should be able to refactor these rules automatically, but code blocks in the middle of the rule prevent yacc from doing so. Even empty code blocks in otherwise identical rules prevent them from being flattened out. That's because yacc sees the contents of code blocks as black boxes. It has no idea what these actions mean and has to assume they all might mean different things, even if the text is identical.
You can verify that when you remove the actions, the conflict disappears. Of course a yacc grammar without actions is less useful.
Naturally this factoring-out unifies your actions for THEN that originally were distinct for else and no-else variants. There is no way around this. You have to write an action that is good for both cases.
Edit In practice, such code blocks are handled by yacc by creating a new invisible rule and a new non-terminal for each mid-rule action. Quoth the yacc page:
Actions that do not terminate a rule are actually handled by Yacc by manufacturing a new nonterminal symbol name, and a new rule matching this name to the empty string. The interior action is the action triggered off by recognizing this added rule.
It's these invisible rules that actually cause conflicts in your grammar. So another solution would be to leave the rules proper as is, but move actions to the end of their respective rules. I personally still like to factor out common parts manually. -- end edit.
Without %prec there would be a shift/reduce warning, because
if (x)
if (y)
do_something
else
do_something_else
is ambiguous: when the parser sees else, it doesn't know whether it belongs to the inner if (shift) or the outer one (reduce). The default behaviour is to shift, which is what we normally want. %prec makes this explicit and eliminates the warning.
Note this still may interfere with the rest of your grammar. If you can change your syntax I would suggest switching from C-like compound-statement-based syntax to an ENDIF-based one as found in e.g. Ada.

Can Antlr C parser recover from invalid tokens

I developed an Antlr3.4, grammar which generates an AST for later parsing. The generated parser uses Antlr's C interface. When the parser encounters an unexpected token it adds
"Tree Error Node" to the AST token stream and continues on processing input. (Internally "Tree Error Node" represents ANTLR3_TOKEN_INVALID.)
When I pass the output of the parser to the AST parser, it halts upon the "Tree Error Node". Is there anyway to handle invalid tokens in an AST stream?
I'm using:
libantlr3c-3.4
antlr3.4
I turns out you can override the tree adaptor method "errorNode" to emit a user specified token. That token can then be handled in the AST parser.
You need to override Match() using the method described above and perform recover of the parser (this is c# pseudo code):
public override object Match(IIntStream input, int ttype, BitSet follow)
{
if (needs recover)
{
... Recover from mismatch, i.e. skip until next valid terminal.
}
return base.Match(input, ttype, follow);
}
Also, you need to recover from mismatched token:
protected override object RecoverFromMismatchedToken(IIntStream input, int ttype, BitSet follow)
{
if (needs recover)
{
if (unwanted token(input, ttype))
{
.. go to the next valid terminal
.. consume as if ok
.. return next valid token
}
if (missing token(input, follow))
{
.. go to the next valid terminal
.. insert missing symbol and return
}
.. othwerwise throw
}
.. call base recovery(input, ttype, follow);
}
Let me know if there are additional questions.

Resources