I'm having a problem with an island grammar and a non-greedy rule used to consume "everything except what I want".
Desired outcome:
My input file is a C header file, containing function declarations along with typedefs, structs, comments, and preprocessor definitions.
My desired output is parsing and subsequent transformation of function declarations ONLY. I would like to ignore everything else.
Setup and what I've tried:
The header file I'm attempting to lex and parse is very uniform and consistent.
Every function declaration is preceded by a linkage macro PK_linkage_m and all functions return the same type PK_ERROR_code_t, ex:
PK_linkage_m PK_ERROR_code_t PK_function(...);
These tokens don't appear anywhere other than at the start of a function declaration.
I have approached this as an island grammar, that is, function declarations in a sea of text.
I have tried to use the linkage token PK_linkage_m to indicate the end of the "TEXT" and the PK_ERROR_code_t token as the start of the function declaration.
Observed problem:
While lexing and parsing a single function declaration works, it fails when I have more than one function declaration in a file. The token stream shows that "everything + all function declarations + PK_ERROR_code_t of last function declaration " are consumed as text, and then only the last function declaration in the file is correctly parsed.
My one line summary is: My non-greedy grammar rule to consume everything before the PK_ERROR_code_t is consuming too much.
What I perhaps incorrectly believe is the solution:
Fix my lexer non-greedy rule somehow so that it consumes everything until it finds the PK_linkage_m token. My non-greedy rule appears to be consume too much.
What I haven't tried:
As this is my first ANTLR project, and my first language parsing project in a very long time, I'd be more than happy to rewrite it if I'm wrong and getting wronger. I was considering using line terminators to skip everything that doesnt start with newline, but I'm not sure how to make that work and not sure how it's fundamentally different.
Here is my lexer file KernelLexer.g4:
lexer grammar KernelLexer;
// lexer should ignore everything except function declarations
// parser should never see tokens that are irrelevant
#lexer::members {
public static final int WHITESPACE = 1;
}
PK_ERROR: 'PK_ERROR_code_t' -> mode(FUNCTION);
PK_LINK: 'PK_linkage_m';
//Doesnt work. Once it starts consuming, it doesnt stop.
TEXT_SEA: .*? PK_LINK -> skip;
TEXT_WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;
mode FUNCTION;
//These constants must go above ID rule because we want these to match first.
CONST: 'const';
OPEN_BLOCK: '(';
CLOSE_BLOCK: ');' -> mode(DEFAULT_MODE);
COMMA: ',';
STAR: '*';
COMMENTED_NAME: '/*' ID '*/';
COMMENT_RECEIVED: '/* received */' -> skip;
COMMENT_RETURNED: '/* returned */' -> skip;
COMMENT: '/*' .*? '*/' -> skip;
ID : ID_LETTER (ID_LETTER | DIGIT)*;
fragment ID_LETTER: 'a'..'z' | 'A'..'Z' | '_';
fragment DIGIT: '0'..'9';
WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;//channel(1);
Here is my parser file KernelParser.g4:
parser grammar KernelParser;
options { tokenVocab=KernelLexer; }
file : func_decl+;
func_decl : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;
param_block: param_decl*;
param_decl: type_decl COMMENTED_NAME COMMA?;
type_decl: CONST? STAR* ID STAR* CONST?;
Here is a simple example input file:
/*some stuff*/
other stuff;
PK_linkage_m PK_ERROR_code_t PK_CLASS_ask_superclass
(
/* received */
PK_CLASS_t /*class*/, /* a class */
/* returned */
PK_CLASS_t *const /*superclass*/ /* immediate superclass of class */
);
/*some stuff*/
blar blar;
PK_linkage_m PK_ERROR_code_t PK_CLASS_is_subclass
(
/* received */
PK_CLASS_t /*may_be_subclass*/, /* a potential subclass */
PK_CLASS_t /*class*/, /* a class */
/* returned */
PK_LOGICAL_t *const /*is_subclass*/ /* whether it was a subclass */
);
more stuff;
Here is the token output:
line 28:0 token recognition error at: 'more stuff;\r\n'
[#0,312:326='PK_ERROR_code_t',<'PK_ERROR_code_t'>,18:13]
[#1,328:347='PK_CLASS_is_subclass',<ID>,18:29]
[#2,350:350='(',<'('>,19:0]
[#3,369:378='PK_CLASS_t',<ID>,21:0]
[#4,390:408='/*may_be_subclass*/',<COMMENTED_NAME>,21:21]
[#5,409:409=',',<','>,21:40]
[#6,439:448='PK_CLASS_t',<ID>,22:0]
[#7,460:468='/*class*/',<COMMENTED_NAME>,22:21]
[#8,469:469=',',<','>,22:30]
[#9,512:523='PK_LOGICAL_t',<ID>,24:0]
[#10,525:525='*',<'*'>,24:13]
[#11,526:530='const',<'const'>,24:14]
[#12,533:547='/*is_subclass*/',<COMMENTED_NAME>,24:21]
[#13,587:588=');',<');'>,25:0]
[#14,608:607='<EOF>',<EOF>,29:0]
It's always difficult to cope with lexer rules "reading everything but ...", but you are on the right path.
After commenting out the skip action on TEXT_SEA: .*? PK_LINK ; //-> skip;, I have observed that the first function was consumed by a second TEXT_SEA (because lexer rules are greedy, TEXT_SEA gives no chance to PK_ERROR to be seen) :
$ grun Kernel file -tokens input.txt
line 27:0 token recognition error at: 'more stuff;'
[#0,0:41='/*some stuff*/\n\nother stuff;\n\nPK_linkage_m',<TEXT_SEA>,1:0]
[#1,42:292=' PK_ERROR_code_t PK_CLASS_ask_superclass\n(\n/* received */\nPK_CLASS_t
/*class*/, /* a class */\n/* returned */\nPK_CLASS_t *const /*superclass*/
/* immediate superclass of class */\n);\n\n/*some stuff*/\nblar blar;\n\n\n
PK_linkage_m',<TEXT_SEA>,5:12]
[#2,294:308='PK_ERROR_code_t',<'PK_ERROR_code_t'>,17:13]
[#3,310:329='PK_CLASS_is_subclass',<ID>,17:29]
This gave me the idea to use TEXT_SEA both as "sea consumer" and starter of the function mode.
lexer grammar KernelLexer;
// lexer should ignore everything except function declarations
// parser should never see tokens that are irrelevant
#lexer::members {
public static final int WHITESPACE = 1;
}
PK_LINK: 'PK_linkage_m' ;
TEXT_SEA: .*? PK_LINK -> mode(FUNCTION);
LINE : .*? ( [\r\n] | EOF ) ;
mode FUNCTION;
//These constants must go above ID rule because we want these to match first.
CONST: 'const';
OPEN_BLOCK: '(';
CLOSE_BLOCK: ');' -> mode(DEFAULT_MODE);
COMMA: ',';
STAR: '*';
PK_ERROR : 'PK_ERROR_code_t' ;
COMMENTED_NAME: '/*' ID '*/';
COMMENT_RECEIVED: '/* received */' -> skip;
COMMENT_RETURNED: '/* returned */' -> skip;
COMMENT: '/*' .*? '*/' -> skip;
ID : ID_LETTER (ID_LETTER | DIGIT)*;
fragment ID_LETTER: 'a'..'z' | 'A'..'Z' | '_';
fragment DIGIT: '0'..'9';
WS: [ \t\r\n]+ -> channel(HIDDEN) ;
.
parser grammar KernelParser;
options { tokenVocab=KernelLexer; }
file : ( TEXT_SEA | func_decl | LINE )+;
func_decl
: PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK
{System.out.println("---> Found declaration on line " + $start.getLine() + " `" + $text + "`");}
;
param_block: param_decl*;
param_decl: type_decl COMMENTED_NAME COMMA?;
type_decl: CONST? STAR* ID STAR* CONST?;
Execution :
$ grun Kernel file -tokens input.txt
[#0,0:41='/*some stuff*/\n\nother stuff;\n\nPK_linkage_m',<TEXT_SEA>,1:0]
[#1,42:42=' ',<WS>,channel=1,5:12]
[#2,43:57='PK_ERROR_code_t',<'PK_ERROR_code_t'>,5:13]
[#3,58:58=' ',<WS>,channel=1,5:28]
[#4,59:81='PK_CLASS_ask_superclass',<ID>,5:29]
[#5,82:82='\n',<WS>,channel=1,5:52]
[#6,83:83='(',<'('>,6:0]
...
[#24,249:250=');',<');'>,11:0]
[#25,251:292='\n\n/*some stuff*/\nblar blar;\n\n\nPK_linkage_m',<TEXT_SEA>,11:2]
[#26,293:293=' ',<WS>,channel=1,17:12]
[#27,294:308='PK_ERROR_code_t',<'PK_ERROR_code_t'>,17:13]
[#28,309:309=' ',<WS>,channel=1,17:28]
[#29,310:329='PK_CLASS_is_subclass',<ID>,17:29]
[#30,330:330='\n',<WS>,channel=1,17:49]
[#31,331:331='(',<'('>,18:0]
...
[#55,562:563=');',<');'>,24:0]
[#56,564:564='\n',<LINE>,24:2]
[#57,565:565='\n',<LINE>,25:0]
[#58,566:566='\n',<LINE>,26:0]
[#59,567:577='more stuff;',<LINE>,27:0]
[#60,578:577='<EOF>',<EOF>,27:11]
---> Found declaration on line 5 `PK_ERROR_code_t PK_CLASS_ask_superclass
(
PK_CLASS_t /*class*/,
PK_CLASS_t *const /*superclass*/
);`
---> Found declaration on line 17 `PK_ERROR_code_t PK_CLASS_is_subclass
(
PK_CLASS_t /*may_be_subclass*/,
PK_CLASS_t /*class*/,
PK_LOGICAL_t *const /*is_subclass*/
);`
Instead of including .*? at the start of a rule (which I'd always try to avoid), why don't you try to match either:
a PK_ERROR in the default mode (and switch to another mode like you're now doing),
or else match a single character and skip it?
Something like this:
lexer grammar KernelLexer;
PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER : . -> skip;
mode FUNCTION;
// the rest of your rules as you have them now
Note that this will match PK_ERROR_code_t as well for the input "PK_ERROR_code_t_MU ...", so this would be a safer way:
lexer grammar KernelLexer;
PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER : ( [a-zA-Z_] [a-zA-Z_0-9]* | . ) -> skip;
mode FUNCTION;
// the rest of your rules as you have them now
Your parser grammar could then look like this:
parser grammar KernelParser;
options { tokenVocab=KernelLexer; }
file : func_decl+ EOF;
func_decl : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;
param_block : param_decl*;
param_decl : type_decl COMMENTED_NAME COMMA?;
type_decl : CONST? STAR* ID STAR* CONST?;
causing your example input to be parsed like this:
I'm new to Lex/Yacc. Found these lex & yacc files which parse ansi C.
To experiment I added an action to print part of the parsing:
constant
: I_CONSTANT { printf("I_CONSTANT %d\n", $1); }
| F_CONSTANT
| ENUMERATION_CONSTANT /* after it has been defined as such */
;
The problem is, no matter where I put the action, and whatever $X I use, I always get value 0.
Here I got printed:
I_CONSTANT 0
Even though my input is:
int foo(int x)
{
return 5;
}
Any idea?
Nothing in the lex file you point to actually sets semantic values for any token. As the author says, the files are just a grammar and "the bulk of the work" still needs to be done. (There are other caveats having to do with the need for a preprocessor.)
Since nothing in the lex file ever sets yylval, it will always be 0, and that is what yacc/bison will find when it sets up the semantic value for the token ($1 in this case).
Turns out yylval = atoi(yytext) is not done in the lex file, so I had to add it myself. Also learned I can add extern char *yytext to the yacc file header, and then use yytext directly.
I am trying to develop a language parser on CentOS 6.0 by means of Bison 3.0 (C parser generator), Flex 2.5.35 and gcc 4.4.7. I have the following Bison grammar file:
%{
#include <stdio.h>
%}
%union {
int int_t;
char* str_t;
}
%token SEP
%token <str_t> ID
%start start
%type <int_t> plst
%%
start: plst start
| EOS { YYACCEPT; }
;
// <id> , <id> , ... , <id>
plst: ID SEP_PARAMS plst { printf("Rule 1 %s %s \n",$1,$2); }
| ID { printf("Rule 2 %s \n", $1); }
| /* empty */ { }
;
%%
int yyerror(GNode* root, const char* s) {printf("Error: %s", s);}
The problem
As it is now, it is not really a meaningful one, but it is enough to understand my problem I think. Consider that I have a scanner written in Flex which recognizes my tokens. This grammar file is used to recognize simple identifier lists like: id1,id2,...,idn. My problem is that in each grammar rule, when I try to get the value of the identifier (the string representing the same of the identifier), I get a NULL pointer as also proved by my printfs.
What am I doing wrong? Thankyou
Edit
Thanks to recent answers, I could understand that the problems strongly relates to Flex and its configuration file. In particular I have edited my lex file in order to meet the specifications described by the Flex Manual for Bison Bridging:
{ID} { printf("[id-token]");
yylval->str_t = strdup(yytext);
return ID; }
However after running Bison, then Flex (providing the --bison-bridge option) and then the compiler, I execute the generated parser and I instantly get Segmentation Fault.
What's the problem?
The flex option --bison-bridge (or %option bison-bridge) matches up to the bison option %define api.pure. You need to use either BOTH bison-bridge and api.pure or NEITHER -- either way can work, but they need to be consistent. Since it appears you are NOT using api.pure, you want to delete the --bison-bridge option.
The values for $1, $2 etc. have to be set by the lexer.
If you have a rule in the lexer for identifiers, like
ID [a-z][a-z0-9]*
%%
{ID} { return ID; }
the semantic values are not set.
You have to do e.g.
{ID} { /* Set the unions value, used by e.g. `$1` in the parser */
yylval.str_t = strdup(yytext);
return ID;
}
Remember to free the value in the parser, as strdup allocates memory.
From the Bison Manual:
In a simple interactive command parser
where each input is one line, it may
be sufficient to allow yyparse to
return 1 on error and have the caller
ignore the rest of the input line when
that happens (and then call yyparse
again).
This is pretty much what I want, but I am having trouble getting to work. Basically, I want to detect and error in flex, and if an error is detected, have Bison discard the entire line. What I have right now, isn't working quite right because my commands still get executed:
kbsh: ls '/home
Error: Unterminated Single Quote
admin kbrandt tempuser
syntax error
kbsh:
In my Bison file:
commands:
/*Empty*/ { prompt(); } |
command { prompt(); }
;
command:
error {return 1; } |
chdir_command |
pwd_command |
exit_command |
WORD arg_list {
execute_command($1, $2);
//printf("%s, %s\n", $1, $2);
} |
WORD { execute_command($1, NULL); }
;
And in my Flex:
' {BEGIN inQuote; }
<inQuote>\n {printf("Error: Unterminated Single Quote\n"); BEGIN(0); return(ERROR);}
I don't think you'll find a simple solution to handling these types of parsing errors in the lexer.
I would keep the lexer (flex/lex) as dumb as possible, it should just provide a stream of basic tokens (identifiers, keywords, etc...) and have the parser (yacc/bison) do the error detection. In fact it is setup for exactly what you want, with a little restructuring of your approach...
In the lexer (parser.l), keep it simple (no eol/newline handling), something like (isn't full thing):
}%
/* I don't recall if the backslashify is required below */
SINGLE_QUOTE_STRING \'.*\'
DOUBLE_QUOTE_STRING \".*\"
%%
{SINGLE_QUOTE_STRING} {
yylval.charstr = copy_to_tmp_buffer(yytext); // implies a %union
return STRING;
}
{DOUBLE_QUOTE_STRING} {
yylval.charstr = copy_to_tmp_buffer(yytext); // implies a %union
return STRING;
}
\n return NEWLINE;
Then in your parser.y file do all the real handling (isn't full thing):
command:
error NEWLINE
{ yyclearin; yyerrorok; print_the_next_command_prompt(); }
| chdir_command STRING NEWLINE
{ do_the_chdir($<charstr>2); print_the_next_command_prompt(); }
| ... and so on ...
There are two things to note here:
The shift of things like NEWLINE to the yacc side so that you can determine when the user is done with the command then you can clear things out and start over (assuming you have "int yywrap() {return 1;}" somewhere). If you try to detect it too early in flex, when do you know to raise an error?
chdir isn't one command (unless it was sub ruled and you just didn't show it), it now has chdir_command STRING (the argument to the chdir). This makes it so that the parser can figure out what went wrong, you can then yyerror if that directory doesn't exist, etc...
This way you should get something like (guessing what chdir might look like):
cd 'some_directory
syntax error
cd 'some_directory'
you are in the some_directory dude!
And it is all handled by the yacc grammer, not by the tokenizer.
I have found that keeping flex as simple as possible gives you the most ***flex***ibility. :)
OK, so here is the deal.
In my language I have some commands, say
XYZ 3 5
GGB 8 9
HDH 8783 33
And in my Lex file
XYZ { return XYZ; }
GGB { return GGB; }
HDH { return HDH; }
[0-9]+ { yylval.ival = atoi(yytext); return NUMBER; }
\n { return EOL; }
In my yacc file
start : commands
;
commands : command
| command EOL commands
;
command : xyz
| ggb
| hdh
;
xyz : XYZ NUMBER NUMBER { /* Do something with the numbers */ }
;
etc. etc. etc. etc.
My question is, how can I get the entire text
XYZ 3 5
GGB 8 9
HDH 8783 33
Into commands while still returning the NUMBERs?
Also when my Lex returns a STRING [0-9a-zA-Z]+, and I want to do verification on it's length, should I do it like
rule: STRING STRING { if (strlen($1) < 5 ) /* Do some shit else error */ }
or actually have a token in my Lex that returns different tokens depending on length?
If I've understood your first question correctly, you can have semantic actions like
{ $$ = makeXYZ($2, $3); }
which will allow you to build the value of command as you want.
For your second question, the borders between lexical analysis and grammatical analysis and between grammatical analysis and semantic analysis aren't hard and well fixed. Moving them is a trade-off between factors like easiness of description, clarity of error messages and robustness in presence of errors. Considering the verification of string length, the likelihood of an error occurring is quite high and the error message if it is handled by returning different terminals for different length will probably be not clear. So if it is possible -- that depend on the grammar -- I'd handle it in the semantic analysis phase, where the message can easily be tailored.
If you arrange for your lexical analyzer (yylex()) to store the whole string in some variable, then your code can access it. The communication with the parser proper will be through the normal mechanisms, but there's nothing that says you can't also have another variable lurking around (probably a file static variable - but beware multithreading) that stores the whole input line before it is dissected.
As you use yylval.ival you already have union with ival field in your YACC source, like this:
%union {
int ival;
}
Now you specify token type, like this:
%token <ival> NUMBER
So now you can access ival field simply for NUMBER token as $1 in your rules, like
xyz : XYZ NUMBER NUMBER { printf("XYZ %d %d", $2, $3); }
For your second question I'd define union like this:
%union {
char* strval;
int ival;
}
and in you LEX source specify token types
%token <strval> STRING;
%token <ival> NUMBER;
So now you can do things like
foo : STRING NUMBER { printf("%s (len %d) %d", $1, strlen($1), $2); }