Can Antlr C parser recover from invalid tokens - c

I developed an Antlr3.4, grammar which generates an AST for later parsing. The generated parser uses Antlr's C interface. When the parser encounters an unexpected token it adds
"Tree Error Node" to the AST token stream and continues on processing input. (Internally "Tree Error Node" represents ANTLR3_TOKEN_INVALID.)
When I pass the output of the parser to the AST parser, it halts upon the "Tree Error Node". Is there anyway to handle invalid tokens in an AST stream?
I'm using:
libantlr3c-3.4
antlr3.4

I turns out you can override the tree adaptor method "errorNode" to emit a user specified token. That token can then be handled in the AST parser.

You need to override Match() using the method described above and perform recover of the parser (this is c# pseudo code):
public override object Match(IIntStream input, int ttype, BitSet follow)
{
if (needs recover)
{
... Recover from mismatch, i.e. skip until next valid terminal.
}
return base.Match(input, ttype, follow);
}
Also, you need to recover from mismatched token:
protected override object RecoverFromMismatchedToken(IIntStream input, int ttype, BitSet follow)
{
if (needs recover)
{
if (unwanted token(input, ttype))
{
.. go to the next valid terminal
.. consume as if ok
.. return next valid token
}
if (missing token(input, follow))
{
.. go to the next valid terminal
.. insert missing symbol and return
}
.. othwerwise throw
}
.. call base recovery(input, ttype, follow);
}
Let me know if there are additional questions.

Related

Flex: REJECT rejects one character at a time?

I'm parsing C++-style scoped names, e.g., A::B. I want to parse such a name as a single token for a type name if it was previously declared as a type; otherwise, I want to parse it as three tokens A, ::, and B.
Given:
L [A-Za-z_]
D [0-9]
S [ \f\r\t\v]
identifier {L}({L}|{D})*
sname {identifier}({S}*::{S}*{identifier})+
%%
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
REJECT;
}
{identifier} {
// ...
return Y_NAME;
}
However, for the input sequence:
A::BB -> Not a type; returned as "A", "::", "BB"
(Declare A::B as a type.)
A::BB
What happens on the second parse of A::BB, REJECT is called and flex discards only 1 character from the end and tries to match A::B (one B). This matches the previously declared A::B in the {sname} rule which is wrong.
What I assumed REJECT did was to proceed to the second-best matching rule with the same input. Hence, I expected it to match A alone in {identifier} and just leave ::BB on the input stream. But, as shown, that's not what happens. It peels off one character at a time from the end of the input and re-attempts a match.
Adding in yyless() to chop off the ::BB doesn't help:
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
char const *const sep = strpbrk( yytext, ": \t" );
size_t keep_len = (size_t)(sep - yytext);
yyless( keep_len );
REJECT;
}
The only thing that I've found that works is:
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
char const *const sep = strpbrk( yytext, ": \t" );
size_t keep_len = (size_t)(sep - yytext);
yyless( keep_len );
goto identifier;
}
{identifier} {
identifier:
// ...
return Y_NAME;
}
But these seems kind of hack-ish. Is there a more canonical "flex way" to do what I want?
Despite the hack-ish-ness, is there actually anything wrong with my solution? I.e., would it not work in some cases? Or not work in some versions of Flex?
Yes, I'm aware that, even if it did work the way I want, this won't parse all contrived C++ scoped names; but it's good enough.
Yes, I'm aware that generally parsing such things should be done as separate tokens, but C-like languages are hard to parse since they're not context-free. Knowing when you have a type really helps.
Yes, I'm aware that REJECT slows down the lexer. This is (mostly) for an interactive and command-line tool in a terminal, so the human is the slowest component.
I'd like to focus on the problem at hand with the code as it mostly is. Mine is more of a question about how to use REJECT, yyless(), etc., to get the desired behavior. Thanks.
REJECT does not make any distinction between different rules; it just falls back to the next possible accepting pattern (which might not even be shorter, if there's a lower-precedence rule which matches the same token.) That might be a shorter match of the same pattern. (Normally, Flex chooses the longest match out of the possible matches of the regular expression. With REJECT, the shorter matches are also considered.)
So you can avoid the false match of A::B for input A::BB by using trailing context: [Note 1]
{sname}/[^[:alnum:]_] {...}
(In Flex, you can't put trailing context in a macro because the expansion is surrounded with parentheses.)
You could use that solution if you wanted to try all possible complete prefixes of id1::id2::id3::id4 starting with the longest one (id1::id2::id3::id4) and falling back to each shorter prefix in turn (id1::id2::id3, id1::id2, id1). The trailing context prevents intermediate matches in the middle of an identifier.
I'm not sure if that is the problem you are trying to solve, though. And even if it is, it doesn't really solve the problem of what to do after you fall back to an acceptable solution, because you probably don't want to redo the search for prefixes in the middle of the sequence. In other words, if the original were A::B::A::B, and A::B is a "known prefix", the tokens returned should (presumably) be A::B, ::, A, ::, B rather than A::B, ::, A::B.
On the other hand, you might not care about previously defined prefixes; only whether the complete sequence is a known type. In that case, the possible token sequences for A::B::C::D are [TYPENAME(A::B::C::D)] and [ID(A),, ::, ID(B), ::, ID(C), ID(D)]. For this case, you don't want to restrict the {sname} fallbacks; you want to eliminate them. After the fallback, you only want to try matches for different rules.
There are a variety of alternative solutions (which do not use REJECT), mostly relying on the possibility of using start conditions. (You still need to think about what happens after the fallback. Start conditions can be useful for that, as well.)
One fairly general solution is to define a start condition which excludes the rule you want to fall back from. Once you've determined that you want to fall back to a different rule, you change to a start condition which excludes the rule which just matched and then call yyless(0) (without returning). That will cause Flex to rescan the current token from the beginning in the new start condition. (If you have several possible rules which might match, you will need several start conditions. This could get out of hand, but in most cases the possible set of matching rules is very limited.)
At some point, you need to return to the previous start condition. (You could use a start condition stack if it's not trivial to figure out which was the previous start condition.) Ideally, to avoid false interior matches, you would want to return to the previous start condition only when you reached the end of the first (incorrect) match. That's easy to do if you are tracking the input position for each token; you just need to save the position just before calling yyless(0). (You also need to correct the token input position by subtracting yyleng before yyless sets yyleng to 0).
Rescanning from the beginning of the token might seem inefficient, but it's less inefficient than the overhead imposed by REJECT. And the overhead of REJECT affects the entire scanner operation, while the rescanning solution is essentially free except for the tokens you happen to rescan. [Note 2]
(BEGIN(some_state); yyless(0);) is a reasonably common flex idiom; this is not the only use case. Another one is the answer to the question "How do I run a start condition until I reach a token I can't identify without consuming that token?")
But I think that in this case there is a simpler solution, using yymore to accumulate the token. (This avoids having to do your own dynamically expanded token buffer.) Again, there are two possibilities, depending on whether you might allow initial prefix matches or restrict the possibilities to either the full sequence or the first identifier.
But the outline is the same: Match the shortest valid possibility first, remember how long the token was at that point, and then use a start condition to enable a pattern which extends the token, either to the next possibility or to the end of the sequence, depending on your needs. Before continuing with the scanner loop, you call yymore() which indicates to Flex that the next pattern extends the token rather than replacing it.
At each possible match end, you test to see if that would really be a valid token and if so, record the position (and whatever else you might need to recall, such as the token type). When you reach a point where you can no longer extend the match, you use yyless to fall back to the last valid match point, and return the token.
This is slightly less inefficient than the pure yyless() solution because it avoids the rescan of the token which is returned. (All these solutions, including REJECT, do rescan the text following the selected match if it is shorter than the longest possible extended match. That's theoretically avoidable but since it's not a lot of overhead, it doesn't seem worthwhile to build a complex mechanism to avoid it.)
Again, you probably want to avoid trying to extend token matches after the fallback until you reach the longest extent. This can be solved the same way, by recording the longest matched extent, but the start condition handling is a bit simpler.
Here's some not-very-well tested code for the simpler problem, where only the first identifier and the full match are possible:
%{
/* Code to track current token position and the fallback position.
* It's a simple byte count; line-ends are not dealt with.
*/
static int token_pos = 0;
static int token_max_pos = 0;
/* This is done before every action, even empty actions */
#define YY_USER_ACTION token_pos += yyleng;
/* FALLBACK needs to undo YY_USER_ACTION */
#define FALLBACK(to) do { token_pos -= yyleng - to; yyless(to); } while (0)
/* SET_MORE needs to pre-undo the next YY_USER_ACTION */
#define SET_MORE(X) do { token_pos -= yyleng; yymore(); } while(0)
%}
%x EXTEND_ID
ident [[:alpha:]_][[:alnum:]_]*
%%
int fallback_leng = 0;
/* The trailing context here is to avoid triggering EOF in
* the EXTEND_ID state.
*/
{ident}/[^[:alnum:]_] {
/* In the fallback region, don't attempt to extend the match. */
if (token_pos <= token_max_pos)
return Y_IDENT;
fallback_leng = yyleng;
BEGIN(EXTEND_ID);
SET_MORE(yymore);
}
{ident} { return find_type(yytext) ? Y_TYPE_NAME : Y_IDENT; }
<EXTEND_ID>{
([[:space:]]*"::"[[:space:]]*{ident})*|.|\n {
BEGIN(INITIAL);
if (yyleng == fallback_leng + 1)
FALLBACK(fallback_leng);
if (find_type(yytext))
return Y_TYPE_NAME;
else {
FALLBACK(fallback_leng);
return Y_IDENT;
}
}
}
Notes
At least, I think you can do that. I haven't ever tried and REJECT does impose a number of limitations on other scanner features. Really, it's a historical artefact which massively slows down lexical analysis and should generally be avoided.
The use of REJECT anywhere in the rules causes Flex to switch to a different template in which a list of endpoints is retained, which has a cost in both time and space. Another consequence is that Flex can no longer resize the input buffer.

Flex does not recognize identifiers

I am trying to implement a very simple parser using flex. I am currently stuck in the ID recognition. That is my code:
ID [a−zA−Z_][a−zA−Z0−9_]*
...
{ID} { printf( "An identifier: %s\n", yytext ); return TOK_ID;}
However what I get is only the first letter of the identifier, for example if I try to parse:
int _underscore ;
The result is:
An identifier: _
Any advice?
EDIT:
With a more accurate analysis I have figured out that the code is able to recognize only the id with a,z,A,Z,_, that are the explicit characters in the regular expression. I did not find anything like that online, is that a bug?
EDIT2:
If I modify the code in that way all work
ID [a−zA−Z_][a−zA−Z0−9_]*
...
[a−zA−Z_][a−zA−Z0−9_]* { printf( "An identifier: %s\n", yytext ); return TOK_ID;}
According to the documentation it should work also in the other way.
This is a character encoding issue. In your copy-and-pasted source code, the things that look like ASCII hyphens (-, code U+2D) in your definition of ID:
ID [a−zA−Z_][a−zA−Z0−9_]*
aren't. Instead they're unicode minus signs (−, U+2212). If you replace the incorrect minus signs with the correct hyphens, the line will look like:
ID [a-zA-Z_][a-zA-Z0-9_]*
Depending on your font, if you look very closely, you may see a difference between the − in the first version and the - in the second.
Anyway, replace your ID definition with the second version above (or else retype it from scratch, and all should be well.

How to handle errors in custom MapFunction correctly?

I have implemented MapFunction for my Apache Flink flow. It is parsing incoming elements and convert them to other format but sometimes error can appear (i.e. incoming data is not valid).
I see two possible ways how to handle it:
Ignore invalid elements but seems like I can't ignore errors because for any incoming element I must provide outgoing element.
Split incoming elements to valid and invalid but seems like I should use other function for this.
So, I have two questions:
How to handle errors correctly in my MapFunction?
How to implement such transformation functions correctly?
You could use a FlatMapFunction instead of a MapFunction. This would allow you to only emit an element if it is valid. The following shows an example implementation:
input.flatMap(new FlatMapFunction<String, Long>() {
#Override
public void flatMap(String input, Collector<Long> collector) throws Exception {
try {
Long value = Long.parseLong(input);
collector.collect(value);
} catch (NumberFormatException e) {
// ignore invalid data
}
}
});
This is to build on #Till Rohrmann's idea above. Adding this as an answer instead of a comment for better formatting.
I think one way to implement "split + select" could be to use a ProcessFunction with a SideOutput. My graph would look something like this:
Source --> ValidateProcessFunction ---good data--> UDF--->SinkToOutput
\
\---bad data----->SinkToErrorChannel
Would this work? Is there a better way?

Reusing ANTLR3 lexer and parser

I create an input stream from a string with
pANTLR3_UINT8 input_string = (pANTLR3_UINT8) "test";
pANTLR3_INPUT_STREAM stream = antlr3StringStreamNew(input_string, ANTLR3_ENC_8BIT, sizeof(input_string), (pANTLR3_UINT8)"testname");
and then use my lexer and parser to process the string. When I'm done with this string I want to process a new one, but re-creating the lexer and parser objects seems inefficient.
I've found the reset method of the lexer and parser classes and the reuse method of the stream, but how do I use those to parse a new string?
I believe what you're looking for is the setCharStream() function.

Recovering error tokens in parsing (Lemon)

I'm using Lemon as a parser generator, its error handling is the same as yacc's and bison's if you don't know Lemon.
Lemon has an option to define the error token in a set of rules in order to catch parsing errors. The default behavior of the generated parser is to destroy the token causing the error; is there any way to override this behavior so that I can keep the token?
Here's an example to show what's happening: basically I'm appending the tokens for each rule together to reform the input string, here's an example grammar:
input ::= string(A) { printf("%s", A); } // Print the result
string(A) ::= string(B) part(C). { A = append(B, C); }
string(A) ::= part(B). { A = B; }
part(A) ::= NUMBER(B) NAME(C). { A = append(C, B); } // Rearrange the number and name
part(A) ::= error(B). { A = B; } // On error keep the token anyways
On input:
"Username 1234Joseph"
I get output:
"Joseph1234"
Because the text "Username " is junked by the parser in the part(A) ::= error(B) rule, but I really want:
"Username Joseph1234"
as output.
If you can solve this problem in bison or another parser generator I would accept that as an answer :)
With yacc/bison, a parsing error drops the tool into error recovery mode, if possible. It will attempt to discard tokens on its way to a "clean" state.
I'm unable to find a reference for lemon, so I can't show some lemon code to fix this, but with yacc/bison, one would use the rules here.
Namely, you need to adjust your error rule to state that the parser is ok with yyerrok to prevent it from dropping tokens. Next, it will attempt to reread the "bad" token, so you need to clear it with yyclearin. Finally, since the rule attached to your error code contains the contents of your token, you will need to set up a function that adjusts your input stack, by taking the current token contents and creating a new (proper) token with the same contents.
As an example, if a grammar defined as MyOther MyOther saw MyTok MyOther:
stack
MyTok: "the text"
MyOther: "new text"
stack
MyOther: "the text"
MyOther: "new text"
To accomplish this, look into using yybackup. I'm unable to find an alternative method, though yybackup is frowned upon.
It's an old one, but why not...
The grammar must include spaces. At the moment the grammar only allows a sequence of NUMBER NAME tokens (without any space between the tokens).

Resources