I'm having a problem with an island grammar and a non-greedy rule used to consume "everything except what I want".
Desired outcome:
My input file is a C header file, containing function declarations along with typedefs, structs, comments, and preprocessor definitions.
My desired output is parsing and subsequent transformation of function declarations ONLY. I would like to ignore everything else.
Setup and what I've tried:
The header file I'm attempting to lex and parse is very uniform and consistent.
Every function declaration is preceded by a linkage macro PK_linkage_m and all functions return the same type PK_ERROR_code_t, ex:
PK_linkage_m PK_ERROR_code_t PK_function(...);
These tokens don't appear anywhere other than at the start of a function declaration.
I have approached this as an island grammar, that is, function declarations in a sea of text.
I have tried to use the linkage token PK_linkage_m to indicate the end of the "TEXT" and the PK_ERROR_code_t token as the start of the function declaration.
Observed problem:
While lexing and parsing a single function declaration works, it fails when I have more than one function declaration in a file. The token stream shows that "everything + all function declarations + PK_ERROR_code_t of last function declaration " are consumed as text, and then only the last function declaration in the file is correctly parsed.
My one line summary is: My non-greedy grammar rule to consume everything before the PK_ERROR_code_t is consuming too much.
What I perhaps incorrectly believe is the solution:
Fix my lexer non-greedy rule somehow so that it consumes everything until it finds the PK_linkage_m token. My non-greedy rule appears to be consume too much.
What I haven't tried:
As this is my first ANTLR project, and my first language parsing project in a very long time, I'd be more than happy to rewrite it if I'm wrong and getting wronger. I was considering using line terminators to skip everything that doesnt start with newline, but I'm not sure how to make that work and not sure how it's fundamentally different.
Here is my lexer file KernelLexer.g4:
lexer grammar KernelLexer;
// lexer should ignore everything except function declarations
// parser should never see tokens that are irrelevant
#lexer::members {
public static final int WHITESPACE = 1;
}
PK_ERROR: 'PK_ERROR_code_t' -> mode(FUNCTION);
PK_LINK: 'PK_linkage_m';
//Doesnt work. Once it starts consuming, it doesnt stop.
TEXT_SEA: .*? PK_LINK -> skip;
TEXT_WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;
mode FUNCTION;
//These constants must go above ID rule because we want these to match first.
CONST: 'const';
OPEN_BLOCK: '(';
CLOSE_BLOCK: ');' -> mode(DEFAULT_MODE);
COMMA: ',';
STAR: '*';
COMMENTED_NAME: '/*' ID '*/';
COMMENT_RECEIVED: '/* received */' -> skip;
COMMENT_RETURNED: '/* returned */' -> skip;
COMMENT: '/*' .*? '*/' -> skip;
ID : ID_LETTER (ID_LETTER | DIGIT)*;
fragment ID_LETTER: 'a'..'z' | 'A'..'Z' | '_';
fragment DIGIT: '0'..'9';
WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;//channel(1);
Here is my parser file KernelParser.g4:
parser grammar KernelParser;
options { tokenVocab=KernelLexer; }
file : func_decl+;
func_decl : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;
param_block: param_decl*;
param_decl: type_decl COMMENTED_NAME COMMA?;
type_decl: CONST? STAR* ID STAR* CONST?;
Here is a simple example input file:
/*some stuff*/
other stuff;
PK_linkage_m PK_ERROR_code_t PK_CLASS_ask_superclass
(
/* received */
PK_CLASS_t /*class*/, /* a class */
/* returned */
PK_CLASS_t *const /*superclass*/ /* immediate superclass of class */
);
/*some stuff*/
blar blar;
PK_linkage_m PK_ERROR_code_t PK_CLASS_is_subclass
(
/* received */
PK_CLASS_t /*may_be_subclass*/, /* a potential subclass */
PK_CLASS_t /*class*/, /* a class */
/* returned */
PK_LOGICAL_t *const /*is_subclass*/ /* whether it was a subclass */
);
more stuff;
Here is the token output:
line 28:0 token recognition error at: 'more stuff;\r\n'
[#0,312:326='PK_ERROR_code_t',<'PK_ERROR_code_t'>,18:13]
[#1,328:347='PK_CLASS_is_subclass',<ID>,18:29]
[#2,350:350='(',<'('>,19:0]
[#3,369:378='PK_CLASS_t',<ID>,21:0]
[#4,390:408='/*may_be_subclass*/',<COMMENTED_NAME>,21:21]
[#5,409:409=',',<','>,21:40]
[#6,439:448='PK_CLASS_t',<ID>,22:0]
[#7,460:468='/*class*/',<COMMENTED_NAME>,22:21]
[#8,469:469=',',<','>,22:30]
[#9,512:523='PK_LOGICAL_t',<ID>,24:0]
[#10,525:525='*',<'*'>,24:13]
[#11,526:530='const',<'const'>,24:14]
[#12,533:547='/*is_subclass*/',<COMMENTED_NAME>,24:21]
[#13,587:588=');',<');'>,25:0]
[#14,608:607='<EOF>',<EOF>,29:0]
It's always difficult to cope with lexer rules "reading everything but ...", but you are on the right path.
After commenting out the skip action on TEXT_SEA: .*? PK_LINK ; //-> skip;, I have observed that the first function was consumed by a second TEXT_SEA (because lexer rules are greedy, TEXT_SEA gives no chance to PK_ERROR to be seen) :
$ grun Kernel file -tokens input.txt
line 27:0 token recognition error at: 'more stuff;'
[#0,0:41='/*some stuff*/\n\nother stuff;\n\nPK_linkage_m',<TEXT_SEA>,1:0]
[#1,42:292=' PK_ERROR_code_t PK_CLASS_ask_superclass\n(\n/* received */\nPK_CLASS_t
/*class*/, /* a class */\n/* returned */\nPK_CLASS_t *const /*superclass*/
/* immediate superclass of class */\n);\n\n/*some stuff*/\nblar blar;\n\n\n
PK_linkage_m',<TEXT_SEA>,5:12]
[#2,294:308='PK_ERROR_code_t',<'PK_ERROR_code_t'>,17:13]
[#3,310:329='PK_CLASS_is_subclass',<ID>,17:29]
This gave me the idea to use TEXT_SEA both as "sea consumer" and starter of the function mode.
lexer grammar KernelLexer;
// lexer should ignore everything except function declarations
// parser should never see tokens that are irrelevant
#lexer::members {
public static final int WHITESPACE = 1;
}
PK_LINK: 'PK_linkage_m' ;
TEXT_SEA: .*? PK_LINK -> mode(FUNCTION);
LINE : .*? ( [\r\n] | EOF ) ;
mode FUNCTION;
//These constants must go above ID rule because we want these to match first.
CONST: 'const';
OPEN_BLOCK: '(';
CLOSE_BLOCK: ');' -> mode(DEFAULT_MODE);
COMMA: ',';
STAR: '*';
PK_ERROR : 'PK_ERROR_code_t' ;
COMMENTED_NAME: '/*' ID '*/';
COMMENT_RECEIVED: '/* received */' -> skip;
COMMENT_RETURNED: '/* returned */' -> skip;
COMMENT: '/*' .*? '*/' -> skip;
ID : ID_LETTER (ID_LETTER | DIGIT)*;
fragment ID_LETTER: 'a'..'z' | 'A'..'Z' | '_';
fragment DIGIT: '0'..'9';
WS: [ \t\r\n]+ -> channel(HIDDEN) ;
.
parser grammar KernelParser;
options { tokenVocab=KernelLexer; }
file : ( TEXT_SEA | func_decl | LINE )+;
func_decl
: PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK
{System.out.println("---> Found declaration on line " + $start.getLine() + " `" + $text + "`");}
;
param_block: param_decl*;
param_decl: type_decl COMMENTED_NAME COMMA?;
type_decl: CONST? STAR* ID STAR* CONST?;
Execution :
$ grun Kernel file -tokens input.txt
[#0,0:41='/*some stuff*/\n\nother stuff;\n\nPK_linkage_m',<TEXT_SEA>,1:0]
[#1,42:42=' ',<WS>,channel=1,5:12]
[#2,43:57='PK_ERROR_code_t',<'PK_ERROR_code_t'>,5:13]
[#3,58:58=' ',<WS>,channel=1,5:28]
[#4,59:81='PK_CLASS_ask_superclass',<ID>,5:29]
[#5,82:82='\n',<WS>,channel=1,5:52]
[#6,83:83='(',<'('>,6:0]
...
[#24,249:250=');',<');'>,11:0]
[#25,251:292='\n\n/*some stuff*/\nblar blar;\n\n\nPK_linkage_m',<TEXT_SEA>,11:2]
[#26,293:293=' ',<WS>,channel=1,17:12]
[#27,294:308='PK_ERROR_code_t',<'PK_ERROR_code_t'>,17:13]
[#28,309:309=' ',<WS>,channel=1,17:28]
[#29,310:329='PK_CLASS_is_subclass',<ID>,17:29]
[#30,330:330='\n',<WS>,channel=1,17:49]
[#31,331:331='(',<'('>,18:0]
...
[#55,562:563=');',<');'>,24:0]
[#56,564:564='\n',<LINE>,24:2]
[#57,565:565='\n',<LINE>,25:0]
[#58,566:566='\n',<LINE>,26:0]
[#59,567:577='more stuff;',<LINE>,27:0]
[#60,578:577='<EOF>',<EOF>,27:11]
---> Found declaration on line 5 `PK_ERROR_code_t PK_CLASS_ask_superclass
(
PK_CLASS_t /*class*/,
PK_CLASS_t *const /*superclass*/
);`
---> Found declaration on line 17 `PK_ERROR_code_t PK_CLASS_is_subclass
(
PK_CLASS_t /*may_be_subclass*/,
PK_CLASS_t /*class*/,
PK_LOGICAL_t *const /*is_subclass*/
);`
Instead of including .*? at the start of a rule (which I'd always try to avoid), why don't you try to match either:
a PK_ERROR in the default mode (and switch to another mode like you're now doing),
or else match a single character and skip it?
Something like this:
lexer grammar KernelLexer;
PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER : . -> skip;
mode FUNCTION;
// the rest of your rules as you have them now
Note that this will match PK_ERROR_code_t as well for the input "PK_ERROR_code_t_MU ...", so this would be a safer way:
lexer grammar KernelLexer;
PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER : ( [a-zA-Z_] [a-zA-Z_0-9]* | . ) -> skip;
mode FUNCTION;
// the rest of your rules as you have them now
Your parser grammar could then look like this:
parser grammar KernelParser;
options { tokenVocab=KernelLexer; }
file : func_decl+ EOF;
func_decl : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;
param_block : param_decl*;
param_decl : type_decl COMMENTED_NAME COMMA?;
type_decl : CONST? STAR* ID STAR* CONST?;
causing your example input to be parsed like this:
I'm new to Lex/Yacc. Found these lex & yacc files which parse ansi C.
To experiment I added an action to print part of the parsing:
constant
: I_CONSTANT { printf("I_CONSTANT %d\n", $1); }
| F_CONSTANT
| ENUMERATION_CONSTANT /* after it has been defined as such */
;
The problem is, no matter where I put the action, and whatever $X I use, I always get value 0.
Here I got printed:
I_CONSTANT 0
Even though my input is:
int foo(int x)
{
return 5;
}
Any idea?
Nothing in the lex file you point to actually sets semantic values for any token. As the author says, the files are just a grammar and "the bulk of the work" still needs to be done. (There are other caveats having to do with the need for a preprocessor.)
Since nothing in the lex file ever sets yylval, it will always be 0, and that is what yacc/bison will find when it sets up the semantic value for the token ($1 in this case).
Turns out yylval = atoi(yytext) is not done in the lex file, so I had to add it myself. Also learned I can add extern char *yytext to the yacc file header, and then use yytext directly.
I am learning how to use PEGKit, but am running into problem with creating a grammar for a script that parses lines, even when they are separated by multiple line break characters. I have reduced the problem to this grammar:
expr
#before {
PKTokenizer *t = self.tokenizer;
self.silentlyConsumesWhitespace = NO;
t.whitespaceState.reportsWhitespaceTokens = YES;
self.assembly.preservesWhitespaceTokens = YES;
}
= Word nl*;
nl = nl_char nl_char*;
nl_char = '\n'! | '\r'!;
This simple grammar to me should allow one word per line, with as many line breaks as necessary. But it only allows one word with an optional line break. Does anybody know what's wrong here? Thank you.
Creator of PEGKit here.
Try the following grammar instead (make sure you are using HEAD of master):
#before {
PKTokenizer *t = self.tokenizer;
[t.whitespaceState setWhitespaceChars:NO from:'\\n' to:'\\n'];
[t.whitespaceState setWhitespaceChars:NO from:'\\r' to:'\\r'];
[t setTokenizerState:t.symbolState from:'\\n' to:'\\n'];
[t setTokenizerState:t.symbolState from:'\\r' to:'\\r'];
}
lines = line+;
line = ~eol* eol+; // note the `~` Not unary operator. this means "zero or more NON eol tokens, followed by one or more eol token"
eol = '\n'! | '\r'!;
Note that here, I am tweaking the tokenizer to recogognize newlines and carriage returns as Symbols rather than whitespace. That makes it easier to match and discard them (they are discarded by the ! operator).
For another approach to the same problem using the builtin S whitespace rule, see here.
I'm making a static analyzer for C.
I have done the lexer and parser using ANTLR in which generates Java code.
Does ANTLR build the AST for us automatically by options {output=AST;}? Or do I have to make the tree myself? If it does, then how to spit out the nodes on that AST?
I am currently thinking that the nodes on that AST will be used for making SSA, followed by data flow analysis in order to make the static analyzer. Am I on the right path?
Raphael wrote:
Does antlr build the AST for us automatically by option{output=AST;}? Or do I have to make the tree myself? If it does, then how to spit out the nodes on that AST?
No, the parser does not know what you want as root and as leaves for each parser rule, so you'll have to do a bit more than just put options { output=AST; } in your grammar.
For example, when parsing the source "true && (false || true && (true || false))" using the parser generated from the grammar:
grammar ASTDemo;
options {
output=AST;
}
parse
: orExp
;
orExp
: andExp ('||' andExp)*
;
andExp
: atom ('&&' atom)*
;
atom
: 'true'
| 'false'
| '(' orExp ')'
;
// ignore white space characters
Space
: (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
the following parse tree is generated:
(i.e. just a flat, 1 dimensional list of tokens)
You'll want to tell ANTLR which tokens in your grammar become root, leaves, or simply left out of the tree.
Creating AST's can be done in two ways:
use rewrite rules which look like this: foo : A B C D -> ^(D A B);, where foo is a parser rule that matches the tokens A B C D. So everything after the -> is the actual rewrite rule. As you can see, the token C is not used in the rewrite rule, which means it is omitted from the AST. The token placed directly after the ^( will become the root of the tree;
use the tree-operators ^ and ! after a token inside your parser rules where ^ will make a token the root, and ! will delete a token from the tree. The equivalent for foo : A B C D -> ^(D A B); would be foo : A B C! D^;
Both foo : A B C D -> ^(D A B); and foo : A B C! D^; will produce the following AST:
Now, you could rewrite the grammar as follows:
grammar ASTDemo;
options {
output=AST;
}
parse
: orExp
;
orExp
: andExp ('||'^ andExp)* // Make `||` root
;
andExp
: atom ('&&'^ atom)* // Make `&&` root
;
atom
: 'true'
| 'false'
| '(' orExp ')' -> orExp // Just a single token, no need to do `^(...)`,
// we're removing the parenthesis. Note that
// `'('! orExp ')'!` will do exactly the same.
;
// ignore white space characters
Space
: (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
which will create the following AST from the source "true && (false || true && (true || false))":
Related ANTLR wiki links:
Tree construction
Tree parsing
Tree construction facilities
Raphael wrote:
I am currently thinking that the nodes on that AST will be used for making SSA, followed by data flow analysis in order to make the static analyzer. Am I on the right path?
Never did anything like that, but IMO the first thing you'd want is an AST from the source, so yeah, I guess your on the right path! :)
EDIT
Here's how you can use the generated lexer and parser:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "true && (false || true && (true || false))";
ASTDemoLexer lexer = new ASTDemoLexer(new ANTLRStringStream(src));
ASTDemoParser parser = new ASTDemoParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
I have some questions about antlr3 with tree grammar in C target.
I have almost done my interpretor (functions, variables, boolean and math expressions ok) and i have kept the most difficult statements for the end (like if, switch, etc.)
1) I would like interpreting a simple loop statement:
repeat: ^(REPEAT DIGIT stmt);
I've seen many examples but nothing about the tree walker (only a topic here with the macros MARK() / REWIND(m) + #init / #after but not working (i've antlr errors: "unexpected node at offset 0")). How can i interpret this statement in C?
2) Same question with a simple if statement:
if: ^(IF condition stmt elseifstmt* elsestmt?);
The problem is to skip the statement if the condition is false and test the other elseif/else statements.
3) I have some statements which can stop the script (like "break" or "exit"). How can i interrupt the tree walker and skip the following tokens?
4) When a lexer or parser error is detected, antlr returns an error. But i would like to make my homemade error messages. How can i have the line number where parser crashed?
Ask me if you want more details.
Thanks you very much (and i apologize for my poor english)
About the repeat statement, i think i've found a way to do it. In antlr.org, i've found a complete interpreter for C-- language but made in Java.
I put here the while statement (a bit different but the way is the same):
whileStmt
scope{
Boolean breaked;
}
#after{
CommonTree stmtNode=(CommonTree)$whileStmt.start.getChild(1);
CommonTree exprNode=(CommonTree)$whileStmt.start.getChild(0);
int test;
$whileStmt::breaked=false;
while($whileStmt::breaked==false){
stream.push(stream.getNodeIndex(exprNode));
test=expr().value;
stream.pop();
if (test==0) break;
stream.push(stream.getNodeIndex(stmtNode));
stmt();
stream.pop();
}
}
: ^(WHILE . .)
;
I've tried to transform this code into C language:
repeat
scope {
int breaked;
int tours;
}
#after
{
int test;
pANTLR3_BASE_TREE repeatstmt = (pANTLR3_BASE_TREE)$repeat.start->getChild($repeat.start,1);
pANTLR3_BASE_TREE exprstmt = (pANTLR3_BASE_TREE)$repeat.start->getChild($repeat.start,0);
$repeat::breaked = 0;
test = 1;
while($repeat::breaked == 0)
{
TW_FOLLOWPUSH(exprstmt);
TW_FOLLOWPOP();
test++;
if(test == $repeat::tours)
break;
TW_FOLLOWPUSH(repeatstmt);
CTX->repeat(CTX);
TW_FOLLOWPOP();
}
}
: ^(REPEAT DIGIT stmt)
{
$repeat::tours = $DIGIT.text->toInt32($DIGIT.text);
}
But nothing happened (stmt is parsed juste once).
Do you have an idea about this please?
About the homemade errors messages, i've found the macro GETLINE() in the lexer. It works when the tree walker crashes but antlr continues to display errors messages for lexer or parser errors.
Thanks.