I'm having a problem with an island grammar and a non-greedy rule used to consume "everything except what I want".
Desired outcome:
My input file is a C header file, containing function declarations along with typedefs, structs, comments, and preprocessor definitions.
My desired output is parsing and subsequent transformation of function declarations ONLY. I would like to ignore everything else.
Setup and what I've tried:
The header file I'm attempting to lex and parse is very uniform and consistent.
Every function declaration is preceded by a linkage macro PK_linkage_m and all functions return the same type PK_ERROR_code_t, ex:
PK_linkage_m PK_ERROR_code_t PK_function(...);
These tokens don't appear anywhere other than at the start of a function declaration.
I have approached this as an island grammar, that is, function declarations in a sea of text.
I have tried to use the linkage token PK_linkage_m to indicate the end of the "TEXT" and the PK_ERROR_code_t token as the start of the function declaration.
Observed problem:
While lexing and parsing a single function declaration works, it fails when I have more than one function declaration in a file. The token stream shows that "everything + all function declarations + PK_ERROR_code_t of last function declaration " are consumed as text, and then only the last function declaration in the file is correctly parsed.
My one line summary is: My non-greedy grammar rule to consume everything before the PK_ERROR_code_t is consuming too much.
What I perhaps incorrectly believe is the solution:
Fix my lexer non-greedy rule somehow so that it consumes everything until it finds the PK_linkage_m token. My non-greedy rule appears to be consume too much.
What I haven't tried:
As this is my first ANTLR project, and my first language parsing project in a very long time, I'd be more than happy to rewrite it if I'm wrong and getting wronger. I was considering using line terminators to skip everything that doesnt start with newline, but I'm not sure how to make that work and not sure how it's fundamentally different.
Here is my lexer file KernelLexer.g4:
lexer grammar KernelLexer;
// lexer should ignore everything except function declarations
// parser should never see tokens that are irrelevant
#lexer::members {
public static final int WHITESPACE = 1;
}
PK_ERROR: 'PK_ERROR_code_t' -> mode(FUNCTION);
PK_LINK: 'PK_linkage_m';
//Doesnt work. Once it starts consuming, it doesnt stop.
TEXT_SEA: .*? PK_LINK -> skip;
TEXT_WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;
mode FUNCTION;
//These constants must go above ID rule because we want these to match first.
CONST: 'const';
OPEN_BLOCK: '(';
CLOSE_BLOCK: ');' -> mode(DEFAULT_MODE);
COMMA: ',';
STAR: '*';
COMMENTED_NAME: '/*' ID '*/';
COMMENT_RECEIVED: '/* received */' -> skip;
COMMENT_RETURNED: '/* returned */' -> skip;
COMMENT: '/*' .*? '*/' -> skip;
ID : ID_LETTER (ID_LETTER | DIGIT)*;
fragment ID_LETTER: 'a'..'z' | 'A'..'Z' | '_';
fragment DIGIT: '0'..'9';
WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;//channel(1);
Here is my parser file KernelParser.g4:
parser grammar KernelParser;
options { tokenVocab=KernelLexer; }
file : func_decl+;
func_decl : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;
param_block: param_decl*;
param_decl: type_decl COMMENTED_NAME COMMA?;
type_decl: CONST? STAR* ID STAR* CONST?;
Here is a simple example input file:
/*some stuff*/
other stuff;
PK_linkage_m PK_ERROR_code_t PK_CLASS_ask_superclass
(
/* received */
PK_CLASS_t /*class*/, /* a class */
/* returned */
PK_CLASS_t *const /*superclass*/ /* immediate superclass of class */
);
/*some stuff*/
blar blar;
PK_linkage_m PK_ERROR_code_t PK_CLASS_is_subclass
(
/* received */
PK_CLASS_t /*may_be_subclass*/, /* a potential subclass */
PK_CLASS_t /*class*/, /* a class */
/* returned */
PK_LOGICAL_t *const /*is_subclass*/ /* whether it was a subclass */
);
more stuff;
Here is the token output:
line 28:0 token recognition error at: 'more stuff;\r\n'
[#0,312:326='PK_ERROR_code_t',<'PK_ERROR_code_t'>,18:13]
[#1,328:347='PK_CLASS_is_subclass',<ID>,18:29]
[#2,350:350='(',<'('>,19:0]
[#3,369:378='PK_CLASS_t',<ID>,21:0]
[#4,390:408='/*may_be_subclass*/',<COMMENTED_NAME>,21:21]
[#5,409:409=',',<','>,21:40]
[#6,439:448='PK_CLASS_t',<ID>,22:0]
[#7,460:468='/*class*/',<COMMENTED_NAME>,22:21]
[#8,469:469=',',<','>,22:30]
[#9,512:523='PK_LOGICAL_t',<ID>,24:0]
[#10,525:525='*',<'*'>,24:13]
[#11,526:530='const',<'const'>,24:14]
[#12,533:547='/*is_subclass*/',<COMMENTED_NAME>,24:21]
[#13,587:588=');',<');'>,25:0]
[#14,608:607='<EOF>',<EOF>,29:0]
It's always difficult to cope with lexer rules "reading everything but ...", but you are on the right path.
After commenting out the skip action on TEXT_SEA: .*? PK_LINK ; //-> skip;, I have observed that the first function was consumed by a second TEXT_SEA (because lexer rules are greedy, TEXT_SEA gives no chance to PK_ERROR to be seen) :
$ grun Kernel file -tokens input.txt
line 27:0 token recognition error at: 'more stuff;'
[#0,0:41='/*some stuff*/\n\nother stuff;\n\nPK_linkage_m',<TEXT_SEA>,1:0]
[#1,42:292=' PK_ERROR_code_t PK_CLASS_ask_superclass\n(\n/* received */\nPK_CLASS_t
/*class*/, /* a class */\n/* returned */\nPK_CLASS_t *const /*superclass*/
/* immediate superclass of class */\n);\n\n/*some stuff*/\nblar blar;\n\n\n
PK_linkage_m',<TEXT_SEA>,5:12]
[#2,294:308='PK_ERROR_code_t',<'PK_ERROR_code_t'>,17:13]
[#3,310:329='PK_CLASS_is_subclass',<ID>,17:29]
This gave me the idea to use TEXT_SEA both as "sea consumer" and starter of the function mode.
lexer grammar KernelLexer;
// lexer should ignore everything except function declarations
// parser should never see tokens that are irrelevant
#lexer::members {
public static final int WHITESPACE = 1;
}
PK_LINK: 'PK_linkage_m' ;
TEXT_SEA: .*? PK_LINK -> mode(FUNCTION);
LINE : .*? ( [\r\n] | EOF ) ;
mode FUNCTION;
//These constants must go above ID rule because we want these to match first.
CONST: 'const';
OPEN_BLOCK: '(';
CLOSE_BLOCK: ');' -> mode(DEFAULT_MODE);
COMMA: ',';
STAR: '*';
PK_ERROR : 'PK_ERROR_code_t' ;
COMMENTED_NAME: '/*' ID '*/';
COMMENT_RECEIVED: '/* received */' -> skip;
COMMENT_RETURNED: '/* returned */' -> skip;
COMMENT: '/*' .*? '*/' -> skip;
ID : ID_LETTER (ID_LETTER | DIGIT)*;
fragment ID_LETTER: 'a'..'z' | 'A'..'Z' | '_';
fragment DIGIT: '0'..'9';
WS: [ \t\r\n]+ -> channel(HIDDEN) ;
.
parser grammar KernelParser;
options { tokenVocab=KernelLexer; }
file : ( TEXT_SEA | func_decl | LINE )+;
func_decl
: PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK
{System.out.println("---> Found declaration on line " + $start.getLine() + " `" + $text + "`");}
;
param_block: param_decl*;
param_decl: type_decl COMMENTED_NAME COMMA?;
type_decl: CONST? STAR* ID STAR* CONST?;
Execution :
$ grun Kernel file -tokens input.txt
[#0,0:41='/*some stuff*/\n\nother stuff;\n\nPK_linkage_m',<TEXT_SEA>,1:0]
[#1,42:42=' ',<WS>,channel=1,5:12]
[#2,43:57='PK_ERROR_code_t',<'PK_ERROR_code_t'>,5:13]
[#3,58:58=' ',<WS>,channel=1,5:28]
[#4,59:81='PK_CLASS_ask_superclass',<ID>,5:29]
[#5,82:82='\n',<WS>,channel=1,5:52]
[#6,83:83='(',<'('>,6:0]
...
[#24,249:250=');',<');'>,11:0]
[#25,251:292='\n\n/*some stuff*/\nblar blar;\n\n\nPK_linkage_m',<TEXT_SEA>,11:2]
[#26,293:293=' ',<WS>,channel=1,17:12]
[#27,294:308='PK_ERROR_code_t',<'PK_ERROR_code_t'>,17:13]
[#28,309:309=' ',<WS>,channel=1,17:28]
[#29,310:329='PK_CLASS_is_subclass',<ID>,17:29]
[#30,330:330='\n',<WS>,channel=1,17:49]
[#31,331:331='(',<'('>,18:0]
...
[#55,562:563=');',<');'>,24:0]
[#56,564:564='\n',<LINE>,24:2]
[#57,565:565='\n',<LINE>,25:0]
[#58,566:566='\n',<LINE>,26:0]
[#59,567:577='more stuff;',<LINE>,27:0]
[#60,578:577='<EOF>',<EOF>,27:11]
---> Found declaration on line 5 `PK_ERROR_code_t PK_CLASS_ask_superclass
(
PK_CLASS_t /*class*/,
PK_CLASS_t *const /*superclass*/
);`
---> Found declaration on line 17 `PK_ERROR_code_t PK_CLASS_is_subclass
(
PK_CLASS_t /*may_be_subclass*/,
PK_CLASS_t /*class*/,
PK_LOGICAL_t *const /*is_subclass*/
);`
Instead of including .*? at the start of a rule (which I'd always try to avoid), why don't you try to match either:
a PK_ERROR in the default mode (and switch to another mode like you're now doing),
or else match a single character and skip it?
Something like this:
lexer grammar KernelLexer;
PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER : . -> skip;
mode FUNCTION;
// the rest of your rules as you have them now
Note that this will match PK_ERROR_code_t as well for the input "PK_ERROR_code_t_MU ...", so this would be a safer way:
lexer grammar KernelLexer;
PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER : ( [a-zA-Z_] [a-zA-Z_0-9]* | . ) -> skip;
mode FUNCTION;
// the rest of your rules as you have them now
Your parser grammar could then look like this:
parser grammar KernelParser;
options { tokenVocab=KernelLexer; }
file : func_decl+ EOF;
func_decl : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;
param_block : param_decl*;
param_decl : type_decl COMMENTED_NAME COMMA?;
type_decl : CONST? STAR* ID STAR* CONST?;
causing your example input to be parsed like this:
I have an issue with my if statement with my grammar, wich can be found here http://sd-g1.archive-host.com/membres/up/24fe084677d7655eb57ba66e1864081450017dd9/CNew.txt . When I type for example in Ctrl+D :
int k = 0;
if ( k ==0 ){
return k;
}
the tree parser stops at "if(" , and the console does not state any reason. Does anyone know where the issue may comes from please ?
Assuming the entry point of your grammar is translation_unit, it looks like the parser simply stops after it matched a single external_declaration. Try adding the EOF (end of file) token at the end of that rule so that the parser is forced to match the entire input:
translation_unit
: external_declaration+ EOF
;
However, I don't see how an external_declaration would ever match an if-statement (a selection_statement) in your grammar. Perhaps you want to add a statement to your external_declaration:
translation_unit
scope Symbols; // entire file is a scope
#init {
$Symbols::types = new HashSet();
}
: (external_declaration)+ EOF
;
external_declaration
: function_definition
| declaration
| statement
;
after which your input will get properly parsed.
I am using ANTLRWorks 1.5 for C (ANTLR 3.5).
I created a lexer and parser file.
When trying to generate code, it returns as error <[18:52:50] error(100): Script.g:57:2: syntax error: antlr: MissingTokenException(inserted [#-1,0:0='<missing EOF>',<-1>,57:1] at options {)>.
Here is the code, please tell what am I missing.
/* ############################## L E X E R ############################ */
grammar Lexer;
options {
language = C;
output = AST; //generating an AST
ASTLabelType = pANTLT3_BASE_TREE; //specifying a tree walker
k=1; // Only 1 lookahead character required
}
// Define string values - either simple unquoted or complex quoted
STRING : ('a'..'z'|'A'..'Z'|'0'..'9'|'_' | '+')+
| ('"' (~'"')* '"');
// Ignore all whitespace
WS :(' '
| '\t'
| '\r' '\n' { newline(); }
| '\n' { newline(); }
)
{ $setType(Token.SKIP); } ;
// TODO:Single-line comment
LINE_COMMENT : '/*' (~('\n'|'\r'))* ('\n'|'\r'('\n')?)?
{ $setType(Token.SKIP); newline(); } ;
// Punctuation
LBRACE : '<';
RBRACE : '>';
SLASH : '/';
EQUALS : '=';
SEMI : ';';
TRIGGER : ('Trigger');
TRIGGERTYPE : ('Fall') SLASH ('Rise')|('Rise') SLASH ('Fall')|('Fall')|('Rise');
DEFAULT : ('Default TimeSet');
TIMESETVAL : ('TSET_')('0..9')*;
/* ############################## P A R S E R ############################ */
grammar Script;
options {
language=C;
output=AST; // Automatically build the AST while parsing
ASTLabelType=pANTLR3_BASE_TREE;
//k=2; // Need lookahead of two for props without keys (to check for the =)
}
/*tokens {
SCRIPT; // Imaginary token inserted at the root of the script
BLOCK; // Imaginary token inserted at the root of a block
COMMAND; // Imaginary token inserted at the root of a command
PROPERTY; // Imaginary token inserted at the root of a property
}*/
/** Rule to parse Trigger line
*/
trigger : TRIGGER EQUALS TRIGGERTYPE SEMI;
/** Rule to parse TimeSet line
*/
timeset : DEFAULT TIMESETVAL;
Your "combined" grammar Lexer only has lexer rules, while when you only define grammar, ANTLR expects at least 1 parser rule.
There are 3 different types of grammars
combined grammar: grammar Foo, generates:
class FooParser extends Parser and
class FooLexer extends Lexer
parser grammar: parser grammar Bar, generates:
class Bar extends Parser
lexer grammar: lexer grammar Baz, generates:
class Baz extends Lexer
So, in your case, change grammar Lexer; into lexer grammar ScriptLexer; (don't name your lexer grammar Lexer since it is the base lexer class in ANTLR!) and import this lexer in your parser grammar:
parser grammar ScriptParser;
import ScriptLexer;
options {
language=C;
output=AST;
ASTLabelType=pANTLR3_BASE_TREE;
}
// ...
Related:
ANTLR3 Wiki: Composite Grammars
Im trying to build small parser for XML files in C. I know, i could find some finished solutions but, i need just some basic stuff for embedded project. I`m trying to create grammar for describing XML without attributes, just tags, but it seems it is not working and i was not able to figure out why.
Here is the grammar:
XML : FIRST_TAG NIZ
NIZ : VAL NIZ | eps
VAL : START VAL END
| STR
| eps
Here is part of C code that implement this grammar :
void check() {
getSymbol();
if( sym == FIRST_LINE )
{
niz();
}
else {
printf("FIRST_LINE EXPECTED");
exit(1);
}
}
void niz() {
getSymbol();
if( sym == ERROR )
return;
if( sym == START ) {
back = 1;
val();
niz();
}
printf(" EPS OR START EXPECTED\n");
}
void val() {
getSymbol();
if( sym == ERROR )
return;
if( sym == START ) {
back = 0;
val();
getSymbol();
if( sym != END ) {
printf("END EXPECTED");
exit(1);
}
return;
}
if( sym == EMPTY_TAG || sym == STR)
return;
printf("START, STR, EMPTY_TAG OR EPS EXPECTED\n");
exit(1);
}
void getSymbol() {
int pom;
if(back == 1) {
back = 0;
return;
}
sym = getNextToken(cmd + offset, &pom);
offset += pom + 1;
}
EDIT: Here is the example of XML file that does not satisfy this grammar:
<?xml version="1.0"?>
<VATCHANGES>
<DATE>15/08/2012</DATE>
<TIME>1452</TIME>
<EFDSERIAL>01KE000001</EFDSERIAL>
<CHANGENUM>1</CHANGENUM>
<VATRATE>A</VATRATE>
<FROMVALUE>16.00</FROMVALUE>
<TOVALUE>18.00</TOVALUE>
<VATRATE>B</VATRATE>
<FROMVALUE>2.00</FROMVALUE>
<TOVALUE>0.00</TOVALUE>
<VATRATE>C</VATRATE>
<FROMVALUE>5.00</FROMVALUE>
<TOVALUE>0.00</TOVALUE>
<DATE>25/05/2010</DATE>
<CHANGENUM>2</CHANGENUM>
<VATRATE>C</VATRATE>
<FROMVALUE>0.00</FROMVALUE>
<TOVALUE>4.00</TOVALUE>
</VATCHANGES>
It gives END EXPECTED at the output.
First, your grammar needs some work. Assuming the preamble is handled correctly, you have a basic error in the definition of NIZ.
NIZ : VAL NIZ | eps
VAL : START VAL END
| STR
| eps
So we enter NIZ and we look for VAL first. The problem is the eps on the end of both VAL's possible productions and NIZ. Therefore, if VAL produces nothing (i.e. eps) and consumes no tokens in the process (which it can't to be proper, since eps is the production), NIZ reduces to:
NIZ: eps NIZ | eps
which isn't good.
Consider into something more along these lines: I just spewed this with no real foresight into having something beyond a purely basic construction.
XML: START_LINE ELEMENT
ELEMENT: OPENTAG BODY CLOSETAG
OPENTAG: lt id(n) gt
CLOSETAG: lt fs id(n) gt
BODY: ELEMENT | VALUE
VALUE: str | eps
This is super basic. Terminals include:
lt: '<'
gt: '>'
fs: '/'
str: any alphanumeric string excluding chars lt or gt.
id(n): any alphanumeric string excluding chars lt, gt, or fs.
I can almost feel the wrath of the XML purists raining down on me right now, but the point I'm trying to get across is that, when an grammar is well-defined, the RDP will literally write itself. Obviously the lexer (i.e. the token engine) needs to handle the terminals accordingly. Note: the id(n) is an id-stack to ensure you properly close the innermost tag, and is an attribute of your parser in accordance with how it manages tag ids. Its not traditional, but it makes things MUCH easier.
This can/should clearly be expanded to include stand-alone element declarations and short-cut element closure. For example, this grammar allows for elements of this form:
<ElementName>...</ElementName>
but not of this form:
<ElementName/>
Nor does it account for short-cut termination such as:
<ElementName>...</>
Accounting for such additions will obviously complicate the grammar considerably, but also make the parser substantially more robust. Like I said, the sample above is basic with a capital B. If you're really going to embark on this these are things you want to consider when designing your grammar, and thus also your RDP by consequence.
Anyway, just consider how a few reworks in your grammar can/will substantially make this easier on you.
I would like to parse a lambda calculus. I dont know how to parse the term and respect parenthesis priority. Ex:
(lx ly (x(xy)))(lx ly xxxy)
I don't manage to find the good way to do this. I just can't see the adapted algorithm.
A term is represented by a structure that have a type (APPLICATION, ABSTRACTION, VARIABLE) and
a right and left component of type "struc term".
Any idea how to do this ?
EDIT
Sorry to disturb you again, but I really want to understand. Can you check the function "expression()" to let me know if I am right.
Term* expression(){
if(current==LINKER){
Term* t = create_node(ABSTRACTION);
get_next_symbol();
t->right = create_node_variable();
get_next_symbol();
t->left = expression();
}
else if(current==OPEN_PARENTHESIS){
application();
get_next_symbol();
if(current != CLOSE_PARENTHESIS){
printf("Error\n");
exit(1);
}
}
else if(current==VARIABLE){
return create_node_variable();
}
else if(current==END_OF_TERM)
{
printf("Error");
exit(1);
}
}
Thanks
The can be simplified by separating the application from other expressions:
EXPR -> l{v} APPL "abstraction"
-> (APPL) "brackets"
-> {v} "variable"
APPL -> EXPR + "application"
The only difference with your approach is that the application is represented as a list of expressions, because abcd can be implicitly read as (((ab)c)d) so you might at well store it as abcd while parsing.
Based on this grammar, a simple recursive descent parser can be created with a single character of lookahead:
EXPR: 'l' // read character, then APPL, return as abstraction
'(' // read APPL, read ')', return as-is
any // read character, return as variable
eof // fail
APPL: ')' // unread character, return as application
any // read EXPR, append to list, loop
eof // return as application
The root symbol is APPL, of course. As a post-parsing step, you can turn your APPL = list of EXPR into a tree of applications. The recursive descent is so simple that you can easily turn into an imperative solution with an explicit stack if you wish.