I am writing a compiler in C, and I use bison for the grammar and flex for the tokens. To improve the quality of error messages, some common errors need to appear in the grammar. This has the side effect, however, of bison thinking that an invalid input is actually valid.
For example, consider this grammar:
program
: command ';' program
| command ';'
| command {yyerror("Missing ;.");} // wrong input
;
command
: INC
| DEC
;
where INC and DEC are tokens and program is the initial symbol. In this case, INC; is a valid program, but INC is not, and an error message is generated. The function yyparse(), however, returns 0 as if the program were correct.
Looking at the bison manual, I found the macro YYERROR, which should behave as if the parser itself found an error. But even if I add YYERROR after the call to yyerror(), the function yyparse() still returns 0. I could use YYABORT instead, but that would stop on the first error, which is terrible and not what I want.
Is there anyway to make yyparse() return 1 without stopping on the first error?
Since you intend to recover from syntax errors, you're not going to be able to use the return code from yyparse to signal that one or more errors occurred. Instead, you'll have to track that information yourself.
The traditional way to do that would be to use a global error count (just showing the crucial pieces):
%{
int parse_error_count = 0;
%}
%%
program: statement { yyerror("Missing semicolon");
++parse_error_count; }
%%
int parse_interface() {
parse_error_count = 0;
int status = yyparse();
if (status) return status; /* Might have run out of memory */
if (parse_error_count) return 3; /* yyparse returns 0, 1 or 2 */
return 0;
}
A more modern solution is to define an additional "out" parameter to yyparse:
%parse-param { int* error_count }
%%
program: statement { yyerror("Missing semicolon");
++*error_count; }
%%
int main() {
int error_count = 0;
int status = yyparse(&error_count);
if (status || error_count) { /* handle error */ }
Finally, in case you really need to export the symbol yyparse from your bison-generated code, you can do the following ugly hack:
%code top {
#define yyparse internal_yyparse
}
%parse-param { int* error_count }
%%
program: statement { yyerror("Missing semicolon");
++*error_count; }
%%
#undef yyparse
int yyparse() {
int error_count = 0;
int status = internal_yyparse(&error_count);
// Whatever you want to do as a summary
return status ? status : error_count ? 1 : 0;
}
yyerror() just prints an error message. It doesn't alter what yyparse() returns.
What you're attempting is not a good idea. You'll enormously expand the grammar and you run a major risk of making it ambiguous. All you need to do is remove the production that calls yyerror(). That input will produce a syntax error anyway, and that will cause yyparse() not to return 0. You're keeping a dog and barking yourself. What you should be checking for is semantic errors that the parser can't see.
If you really want to improve the error messages, there's enough information in the parse tables and state information to tell you what the expected next token was. However in most cases it's such a large set it's pointless to print it. But programmers are used to sorting out 'syntax error'. Don't sweat it. Writing compilers is hard enough already.
NB You should make your grammar left-recursive to avoid excessive stack usage: for example, program : program ';' command.
Related
I'm writing a compiler with flex and bison for a college assignment. I'm having trouble adding a function identifier to my symbol table - when evaluating a function declaration I'm getting the opening parenthesis in yytext where I'd expect the identifier. In my flex file I have, where yylval is an union and vlex is a struct:
abc [A-Za-z_]
alphanum [A-Za-z_0-9]
id {abc}+{alphanum}*
...
#define STORE_YYLVAL_NONE\
do{\
... // location control irrelevant to the problem
yylval.vlex.type = none_t;\
yylval.vlex.value.sValue = yytext;\
}while(0)
...
{id} {
LOG_DEBUG("id: %s\n", yytext);
STORE_YYLVAL_NONE;
return TK_IDENTIFIER;
}
[,;:()\[\]\{\}\+\-\*/<>!&=%#\^\.\|\?\$] {
LOG_DEBUG("special\n");
STORE_YYLVAL_NONE;
return *yytext;
}
...
And in my bison file I have:
new_identifier_with_node: TK_IDENTIFIER {
hshsym_add_or_exit(&hshsym, yylval.vlex.value.sValue, &(yylval.vlex));
$$ = ast_node_create(&(yylval.vlex));
};
func: type new_identifier_with_node '(' param_list ')' func_block { ... };
I also have a log inside hshsym_add_or_exit, which adds an identifier to my symbol table. When parsing the following program:
int k(int x,int y, int z){}
int f(){
k(10,20,30);
}
I'm getting the following debug output:
yylex: DEBUG! id: k
yylex: DEBUG! special
hshsym_add_or_exit: DEBUG! Declaring: (
That is, when the new_identifier_with_node production is evaluated, the content of yytext is ( instead of k, as I would expect. Is the code above the cause? I have some still unresolved shift/reduce conflicts which I guess could be at fault, but I don't see how in this specific case. I believe I'm missing something really basic but I can't see what. The project is quite large (and shamefully disorganized) at this point, but I can provide a complete and reproducible example if need be.
The basic problem is that you are using yylval in the new_identifier_with_node production, instead of $1. $1 is the semantic value of the first symbol in the production, in this case TK_IDENTIFIER.
In a bison action, yylval is usually the value of lookahead token, which is the next token in the input stream. That's why it shows up as a parenthesis in this case. But you cannot in general count on that because bison will perform a default reduction before reading the lookahead token. In general, using yylval in a bison action is very rarely useful, aside from some applications in error recovery.
Even after you fix that, you will find that the semantic values are incorrect because your flex action is forwarding a pointer to an internal data buffer rather than copying the token string. See, for example, this question.
Program is intended to store values in a symbol table and then have them be able to be printed out stating the part of speech. Further to be parsed and state more in the parser, whether it is a sentence and more.
I create the executable file by
flex try1.l
bison -dy try1.y
gcc lex.yy.c y.tab.c -o try1.exe
in cmd (WINDOWS)
My issue occurs when I try to declare any value when running the executable,
verb run
it goes like this
BOLD IS INPUT
verb run
run
run
syntax error
noun cat
cat
syntax error
run
run
syntax error
cat run
syntax error
MY THOUGHTS: I'm unsure why I'm getting this error back from the code Syntax error. Although after debugging and trying to print out what value was being stored, I figured there has to be some kind of issue with the linked list. As it seemed only one value was being stored in the linked list and causing an error of sorts. As I tried to print out the stored word_type integer value for run and it would print out the correct value 259, but would refuse to let me define any other words to my symbol table. I reversed the changes of the print statements and now it works as previously stated. I think again there is an issue with the addword method as it isn't properly being added so the lookup method is crashing the program.
Lexer file, this example is taken from O'Reily 2nd edition on Lex And Yacc,
Example 1-5,1-6.
Am trying to learn Lex and Yacc on my own and reproduce this example.
%{
/*
* We now build a lexical analyzer to be used by a higher-level parser.
*/
#include <stdlib.h>
#include <string.h>
#include "ytab.h" /* token codes from the parser */
#define LOOKUP 0 /* default - not a defined word type. */
int state;
%}
/*
* Example from page 9 Word recognizer with a symbol table. PART 2 of Lexer
*/
%%
\n { state = LOOKUP; } /* end of line, return to default state */
\.\n { state = LOOKUP;
return 0; /* end of sentence */
}
/* whenever a line starts with a reserved part of speech name */
/* start defining words of that type */
^verb { state = VERB; }
^adj { state = ADJ; }
^adv { state = ADV; }
^noun { state = NOUN; }
^prep { state = PREP; }
^pron { state = PRON; }
^conj { state = CONJ; }
[a-zA-Z]+ {
if(state != LOOKUP) {
add_word(state, yytext);
} else {
switch(lookup_word(yytext)) {
case VERB:
return(VERB);
case ADJECTIVE:
return(ADJECTIVE);
case ADVERB:
return(ADVERB);
case NOUN:
return(NOUN);
case PREPOSITION:
return(PREPOSITION);
case PRONOUN:
return(PRONOUN);
case CONJUNCTION:
return(CONJUNCTION);
default:
printf("%s: don't recognize\n", yytext);
/* don't return, just ignore it */
}
}
}
. ;
%%
int yywrap()
{
return 1;
}
/* define a linked list of words and types */
struct word {
char *word_name;
int word_type;
struct word *next;
};
struct word *word_list; /* first element in word list */
extern void *malloc() ;
int
add_word(int type, char *word)
{
struct word *wp;
if(lookup_word(word) != LOOKUP) {
printf("!!! warning: word %s already defined \n", word);
return 0;
}
/* word not there, allocate a new entry and link it on the list */
wp = (struct word *) malloc(sizeof(struct word));
wp->next = word_list;
/* have to copy the word itself as well */
wp->word_name = (char *) malloc(strlen(word)+1);
strcpy(wp->word_name, word);
wp->word_type = type;
word_list = wp;
return 1; /* it worked */
}
int
lookup_word(char *word)
{
struct word *wp = word_list;
/* search down the list looking for the word */
for(; wp; wp = wp->next) {
if(strcmp(wp->word_name, word) == 0)
return wp->word_type;
}
return LOOKUP; /* not found */
}
Yacc file,
%{
/*
* A lexer for the basic grammar to use for recognizing English sentences.
*/
#include <stdio.h>
%}
%token NOUN PRONOUN VERB ADVERB ADJECTIVE PREPOSITION CONJUNCTION
%%
sentence: subject VERB object{ printf("Sentence is valid.\n"); }
;
subject: NOUN
| PRONOUN
;
object: NOUN
;
%%
extern FILE *yyin;
main()
{
do
{
yyparse();
}
while (!feof(yyin));
}
yyerror(s)
char *s;
{
fprintf(stderr, "%s\n", s);
}
Header file, had to create 2 versions for some values not sure why but code was having an issue with them, and I wasn't understanding why so I just created a token with the full name and the shortened as the book had only one for each.
# define NOUN 257
# define PRON 258
# define VERB 259
# define ADVERB 260
# define ADJECTIVE 261
# define PREPOSITION 262
# define CONJUNCTION 263
# define ADV 260
# define ADJ 261
# define PREP 262
# define CONJ 263
# define PRONOUN 258
If you feel that there is a problem with your linked list implementation, you'd be a lot better off testing and debugging it with a simple driver program rather than trying to do that with some tools (flex and bison) which you are still learning. On the whole, the simpler a test is and the fewest dependencies which it has, the easier it is to track down problems. See this useful essay by Eric Clippert for some suggestions on debugging.
I don't understand why you felt the need to introduce "short versions" of the token IDs. The example code in Levine's book does not anywhere use these symbols. You cannot just invent symbols and you don't need these abbreviations for anything.
The comment that you "had to create 2 versions [of the header file] for some values" but that the "code was having an issue with them, and I wasn't understanding why" is far too unspecific for an answer. Perhaps the problem was that you thought you could use identifiers which are not defined anywhere, which would certainly cause a compiler error. But if there is some other issue, you could ask a question with an accurate problem description (that is, exactly what problem you encountered) and a Minimal, Complete, and Verifiable example (as indicated in the StackOverflow help pages).
In any case, manually setting the values of the token IDs is almost certainly preventing you from being able to recognized inputs. Bison/yacc reserves the values 256 and 257 for internal tokens, so the first one which will be generated (and therefore used in the parser) has value 258. That means that the token values you are returning from your lexical scanner have a different meaning inside bison. Bottom line: Never manually set token values. If your header isn't being generated correctly, figure out why.
As far as I can see, the only legal input for your program has the form:
sentence: subject VERB object
Since none of your sample inputs ("run", for example) have this form, a syntax error is not surprising. However, the fact that you receive a very early syntax error on the input "cat" does suggest there might be a problem with your symbol table lookup. (That's probably the result of the problem noted above.)
I am trying to resolve warning issues which is shown as below :
warning: suggest braces around empty body in an 'if' statement
Relevant code:
cdc(.....)
{
//some statements
ENTER_FUNC(CDC_TRKEY_FC,cdcType_t); //Showing warning in this line
if(something)
{
if(..)
{
}
else
{
}
}
else
{
}
}
If I remove ; and adding the braces as below
ENTER_FUNC(CDC_TRKEY_FC,cdcType_t)
{
}
the warning is gone.
What does exactly it means? Is it behaving like an if statement?
Sorry, its confidential code, so I cant share entirely.
If this is your code
if (/* condition */);
/* other code */
Then the other code will ALWAYS be executed.
You probably want the other code to only be executed if the condition is true.
In order to achieve that, you mainly have to delete the ;.
It is widely considered to be best practice to be somewhat generous with the {}, i.e.
if (/* condition */)
{
/* other code */
}
The fact that the warning does not occur after deleting the ; in line
ENTER_FUNC(CDC_TRKEY_FC,cdcType_t); and replacing it with {}
can be explained if it is actually a macro which essentially expands (together with the ; which is NOT part of the macro) to the if();, which earlier versions of your question were mentioning.
The replacement with {} then does exactly what the compiler wanted.
The ENTER_FUNC() is probably meant to be used like
ENTER_FUNC(CDC_TRKEY_FC,cdcType_t) /* delete this ; */
{ /* new {, followed by rest of your function code */
if(something)
{
if(..)
{
}
else
{
}
}
else
{
}
} /* new */
Please excuse that this answer more or less assumes that you made a mistake in your code. Compare the contribution by Scheff, which assumes (also plausibly) that actually you were acting to a more complex design and fully intentionally.
The statement
if (cond) ; else do_something();
or even
if (cond) ; do_something();
might be intended. May be, the ; after if (cond) is a placeholder for something which shall be added later.
Inserting comments
if (cond) /** #todo */ ; else do_something();
or
if (cond) /** #todo */ ; /* and then always */ do_something();
would make it clear to the human reader but not for the compiler which ignores comments completely.
However, the compiler authors suspected high chance that the semicolon was unintendedly set (and can easily be overlooked). Hence, they spent a warning about this and gave a hint how to make the intention clear if there is one:
Use { } instead ; for intendedly empty then-body to come around this warning.
Sample:
#include <stdio.h>
int main()
{
int cond = 1;
if (cond) /** #todo */ ; else printf("cond not met.\n");
if (cond) /** #todo */ ; printf("cond checked.\n");
return 0;
}
Output:
cond checked.
Life demo on ideone
The compiler used on ideone is stated as gcc 6.3.
I must admit that I didn't get the diagnostics of OP.
After the question was edited, the answer does not seem to match the question anymore. Hence, a little update:
The OP states that the
warning: suggest braces around empty body in an 'if' statement
appears for this line of code:
ENTER_FUNC(CDC_TRKEY_FC,cdcType_t); //Showing warning in this line
It seems that the OP was not aware that ENTER_FUNC is (very likely) a macro with an if statement in its replacement text (something like #define ENTER_FUNC(A,B) if (...)). (This is the most imaginable scenario to get this warning for this code.)
Unfortunately, the OP is not willing to show how ENTER_FUNC is defined, nor to prepare an MCVE with the same behavior.
However, the technique to hide an if in a macro is even more questionable – I wouldn't recommend to do so. Imagine the following situation:
cdc(.....)
{
//some statements
ENTER_FUNC(CDC_TRKEY_FC,cdcType_t) // This time, the author forgot the ; or {}
if(something)
{
if(..)
{
}
else
{
}
}
else
{
}
}
The if(something) statement becomes now the body of the hidden if of the ENTER_FUNC() macro which is probably not intended but a bug. The application may now behave wrong in certain situations. By simply looking at the source code, this is probably hard to catch. Only, by single-step debugging and a bit luck, the error can be found.
(Another option would be to expand all macros and check the C code after replacement. C compilers provide usually a pre-process-only option which makes the result of pre-processing visible to human eyes. E.g. gcc -E)
So, the author of ENTER_FUNC built a macro which
causes a compiler warning if macro is used properly
where the warning goes away if macros is used wrong.
IMHO, this is a not-so-lucky design.
I have some questions about antlr3 with tree grammar in C target.
I have almost done my interpretor (functions, variables, boolean and math expressions ok) and i have kept the most difficult statements for the end (like if, switch, etc.)
1) I would like interpreting a simple loop statement:
repeat: ^(REPEAT DIGIT stmt);
I've seen many examples but nothing about the tree walker (only a topic here with the macros MARK() / REWIND(m) + #init / #after but not working (i've antlr errors: "unexpected node at offset 0")). How can i interpret this statement in C?
2) Same question with a simple if statement:
if: ^(IF condition stmt elseifstmt* elsestmt?);
The problem is to skip the statement if the condition is false and test the other elseif/else statements.
3) I have some statements which can stop the script (like "break" or "exit"). How can i interrupt the tree walker and skip the following tokens?
4) When a lexer or parser error is detected, antlr returns an error. But i would like to make my homemade error messages. How can i have the line number where parser crashed?
Ask me if you want more details.
Thanks you very much (and i apologize for my poor english)
About the repeat statement, i think i've found a way to do it. In antlr.org, i've found a complete interpreter for C-- language but made in Java.
I put here the while statement (a bit different but the way is the same):
whileStmt
scope{
Boolean breaked;
}
#after{
CommonTree stmtNode=(CommonTree)$whileStmt.start.getChild(1);
CommonTree exprNode=(CommonTree)$whileStmt.start.getChild(0);
int test;
$whileStmt::breaked=false;
while($whileStmt::breaked==false){
stream.push(stream.getNodeIndex(exprNode));
test=expr().value;
stream.pop();
if (test==0) break;
stream.push(stream.getNodeIndex(stmtNode));
stmt();
stream.pop();
}
}
: ^(WHILE . .)
;
I've tried to transform this code into C language:
repeat
scope {
int breaked;
int tours;
}
#after
{
int test;
pANTLR3_BASE_TREE repeatstmt = (pANTLR3_BASE_TREE)$repeat.start->getChild($repeat.start,1);
pANTLR3_BASE_TREE exprstmt = (pANTLR3_BASE_TREE)$repeat.start->getChild($repeat.start,0);
$repeat::breaked = 0;
test = 1;
while($repeat::breaked == 0)
{
TW_FOLLOWPUSH(exprstmt);
TW_FOLLOWPOP();
test++;
if(test == $repeat::tours)
break;
TW_FOLLOWPUSH(repeatstmt);
CTX->repeat(CTX);
TW_FOLLOWPOP();
}
}
: ^(REPEAT DIGIT stmt)
{
$repeat::tours = $DIGIT.text->toInt32($DIGIT.text);
}
But nothing happened (stmt is parsed juste once).
Do you have an idea about this please?
About the homemade errors messages, i've found the macro GETLINE() in the lexer. It works when the tree walker crashes but antlr continues to display errors messages for lexer or parser errors.
Thanks.
From the Bison Manual:
In a simple interactive command parser
where each input is one line, it may
be sufficient to allow yyparse to
return 1 on error and have the caller
ignore the rest of the input line when
that happens (and then call yyparse
again).
This is pretty much what I want, but I am having trouble getting to work. Basically, I want to detect and error in flex, and if an error is detected, have Bison discard the entire line. What I have right now, isn't working quite right because my commands still get executed:
kbsh: ls '/home
Error: Unterminated Single Quote
admin kbrandt tempuser
syntax error
kbsh:
In my Bison file:
commands:
/*Empty*/ { prompt(); } |
command { prompt(); }
;
command:
error {return 1; } |
chdir_command |
pwd_command |
exit_command |
WORD arg_list {
execute_command($1, $2);
//printf("%s, %s\n", $1, $2);
} |
WORD { execute_command($1, NULL); }
;
And in my Flex:
' {BEGIN inQuote; }
<inQuote>\n {printf("Error: Unterminated Single Quote\n"); BEGIN(0); return(ERROR);}
I don't think you'll find a simple solution to handling these types of parsing errors in the lexer.
I would keep the lexer (flex/lex) as dumb as possible, it should just provide a stream of basic tokens (identifiers, keywords, etc...) and have the parser (yacc/bison) do the error detection. In fact it is setup for exactly what you want, with a little restructuring of your approach...
In the lexer (parser.l), keep it simple (no eol/newline handling), something like (isn't full thing):
}%
/* I don't recall if the backslashify is required below */
SINGLE_QUOTE_STRING \'.*\'
DOUBLE_QUOTE_STRING \".*\"
%%
{SINGLE_QUOTE_STRING} {
yylval.charstr = copy_to_tmp_buffer(yytext); // implies a %union
return STRING;
}
{DOUBLE_QUOTE_STRING} {
yylval.charstr = copy_to_tmp_buffer(yytext); // implies a %union
return STRING;
}
\n return NEWLINE;
Then in your parser.y file do all the real handling (isn't full thing):
command:
error NEWLINE
{ yyclearin; yyerrorok; print_the_next_command_prompt(); }
| chdir_command STRING NEWLINE
{ do_the_chdir($<charstr>2); print_the_next_command_prompt(); }
| ... and so on ...
There are two things to note here:
The shift of things like NEWLINE to the yacc side so that you can determine when the user is done with the command then you can clear things out and start over (assuming you have "int yywrap() {return 1;}" somewhere). If you try to detect it too early in flex, when do you know to raise an error?
chdir isn't one command (unless it was sub ruled and you just didn't show it), it now has chdir_command STRING (the argument to the chdir). This makes it so that the parser can figure out what went wrong, you can then yyerror if that directory doesn't exist, etc...
This way you should get something like (guessing what chdir might look like):
cd 'some_directory
syntax error
cd 'some_directory'
you are in the some_directory dude!
And it is all handled by the yacc grammer, not by the tokenizer.
I have found that keeping flex as simple as possible gives you the most ***flex***ibility. :)