From the Bison Manual:
In a simple interactive command parser
where each input is one line, it may
be sufficient to allow yyparse to
return 1 on error and have the caller
ignore the rest of the input line when
that happens (and then call yyparse
again).
This is pretty much what I want, but I am having trouble getting to work. Basically, I want to detect and error in flex, and if an error is detected, have Bison discard the entire line. What I have right now, isn't working quite right because my commands still get executed:
kbsh: ls '/home
Error: Unterminated Single Quote
admin kbrandt tempuser
syntax error
kbsh:
In my Bison file:
commands:
/*Empty*/ { prompt(); } |
command { prompt(); }
;
command:
error {return 1; } |
chdir_command |
pwd_command |
exit_command |
WORD arg_list {
execute_command($1, $2);
//printf("%s, %s\n", $1, $2);
} |
WORD { execute_command($1, NULL); }
;
And in my Flex:
' {BEGIN inQuote; }
<inQuote>\n {printf("Error: Unterminated Single Quote\n"); BEGIN(0); return(ERROR);}
I don't think you'll find a simple solution to handling these types of parsing errors in the lexer.
I would keep the lexer (flex/lex) as dumb as possible, it should just provide a stream of basic tokens (identifiers, keywords, etc...) and have the parser (yacc/bison) do the error detection. In fact it is setup for exactly what you want, with a little restructuring of your approach...
In the lexer (parser.l), keep it simple (no eol/newline handling), something like (isn't full thing):
}%
/* I don't recall if the backslashify is required below */
SINGLE_QUOTE_STRING \'.*\'
DOUBLE_QUOTE_STRING \".*\"
%%
{SINGLE_QUOTE_STRING} {
yylval.charstr = copy_to_tmp_buffer(yytext); // implies a %union
return STRING;
}
{DOUBLE_QUOTE_STRING} {
yylval.charstr = copy_to_tmp_buffer(yytext); // implies a %union
return STRING;
}
\n return NEWLINE;
Then in your parser.y file do all the real handling (isn't full thing):
command:
error NEWLINE
{ yyclearin; yyerrorok; print_the_next_command_prompt(); }
| chdir_command STRING NEWLINE
{ do_the_chdir($<charstr>2); print_the_next_command_prompt(); }
| ... and so on ...
There are two things to note here:
The shift of things like NEWLINE to the yacc side so that you can determine when the user is done with the command then you can clear things out and start over (assuming you have "int yywrap() {return 1;}" somewhere). If you try to detect it too early in flex, when do you know to raise an error?
chdir isn't one command (unless it was sub ruled and you just didn't show it), it now has chdir_command STRING (the argument to the chdir). This makes it so that the parser can figure out what went wrong, you can then yyerror if that directory doesn't exist, etc...
This way you should get something like (guessing what chdir might look like):
cd 'some_directory
syntax error
cd 'some_directory'
you are in the some_directory dude!
And it is all handled by the yacc grammer, not by the tokenizer.
I have found that keeping flex as simple as possible gives you the most ***flex***ibility. :)
Related
I'm new to Lex/Yacc. Found these lex & yacc files which parse ansi C.
To experiment I added an action to print part of the parsing:
constant
: I_CONSTANT { printf("I_CONSTANT %d\n", $1); }
| F_CONSTANT
| ENUMERATION_CONSTANT /* after it has been defined as such */
;
The problem is, no matter where I put the action, and whatever $X I use, I always get value 0.
Here I got printed:
I_CONSTANT 0
Even though my input is:
int foo(int x)
{
return 5;
}
Any idea?
Nothing in the lex file you point to actually sets semantic values for any token. As the author says, the files are just a grammar and "the bulk of the work" still needs to be done. (There are other caveats having to do with the need for a preprocessor.)
Since nothing in the lex file ever sets yylval, it will always be 0, and that is what yacc/bison will find when it sets up the semantic value for the token ($1 in this case).
Turns out yylval = atoi(yytext) is not done in the lex file, so I had to add it myself. Also learned I can add extern char *yytext to the yacc file header, and then use yytext directly.
I am a new to flex. I have just written a sample code to detect multi line comments using a flex program. Now I want to improve the code. I want to detect unfinished and ill formed comments in the code. for example: a comment beginning with /* without an ending */ is an unfinished comment and by ill formed comment I mean the comment is not properly formed, say, an EOF appears inside the comment etc. What I have to add in my code to check these things? My sample code is as follows:
%x COMMENT_MULTI_LINE
%{
char* commentStart;
%}
%%
[\n\t\r ]+ {
/* ignore whitespace */ }
<INITIAL>"/*" {
commentStart = yytext;
BEGIN(COMMENT_MULTI_LINE);
}
<COMMENT_MULTI_LINE>"*/" {
char* comment = strndup(commentStart, yytext + 2 - commentStart);
printf("'%s': was a multi-line comment\n", comment);
free(comment);
BEGIN(INITIAL);
}
<COMMENT_MULTI_LINE>. {
}
<COMMENT_MULTI_LINE>\n {
}
%%
int main(int argc, char *argv[]){
yylex();
}
The flex manual section on using <<EOF>> is quite helpful as it has exactly your case as an example, and their code can also be copied verbatim into your flex program.
As it explains, when using <<EOF>> you cannot place it in a normal regular expression pattern. It can only be proceeded by a the name of a state. In your code you are using a state to indicate you are inside a comment. This state is called COMMENT_MULTI. All you have to do is put that in front of the <<EOF>> marker and give it an action to do:
<COMMENT_MULTI><<EOF>> {printf("Unterminated Comment: %s\n", yytext);
yyterminate();}
The special action function yyterminate() tells flex that you have recognised the <<EOF>> and that it marks the end-of-input for your program.
I have tested this, and it works in your code. (And with multi-line strings also).
I am writing a compiler in C, and I use bison for the grammar and flex for the tokens. To improve the quality of error messages, some common errors need to appear in the grammar. This has the side effect, however, of bison thinking that an invalid input is actually valid.
For example, consider this grammar:
program
: command ';' program
| command ';'
| command {yyerror("Missing ;.");} // wrong input
;
command
: INC
| DEC
;
where INC and DEC are tokens and program is the initial symbol. In this case, INC; is a valid program, but INC is not, and an error message is generated. The function yyparse(), however, returns 0 as if the program were correct.
Looking at the bison manual, I found the macro YYERROR, which should behave as if the parser itself found an error. But even if I add YYERROR after the call to yyerror(), the function yyparse() still returns 0. I could use YYABORT instead, but that would stop on the first error, which is terrible and not what I want.
Is there anyway to make yyparse() return 1 without stopping on the first error?
Since you intend to recover from syntax errors, you're not going to be able to use the return code from yyparse to signal that one or more errors occurred. Instead, you'll have to track that information yourself.
The traditional way to do that would be to use a global error count (just showing the crucial pieces):
%{
int parse_error_count = 0;
%}
%%
program: statement { yyerror("Missing semicolon");
++parse_error_count; }
%%
int parse_interface() {
parse_error_count = 0;
int status = yyparse();
if (status) return status; /* Might have run out of memory */
if (parse_error_count) return 3; /* yyparse returns 0, 1 or 2 */
return 0;
}
A more modern solution is to define an additional "out" parameter to yyparse:
%parse-param { int* error_count }
%%
program: statement { yyerror("Missing semicolon");
++*error_count; }
%%
int main() {
int error_count = 0;
int status = yyparse(&error_count);
if (status || error_count) { /* handle error */ }
Finally, in case you really need to export the symbol yyparse from your bison-generated code, you can do the following ugly hack:
%code top {
#define yyparse internal_yyparse
}
%parse-param { int* error_count }
%%
program: statement { yyerror("Missing semicolon");
++*error_count; }
%%
#undef yyparse
int yyparse() {
int error_count = 0;
int status = internal_yyparse(&error_count);
// Whatever you want to do as a summary
return status ? status : error_count ? 1 : 0;
}
yyerror() just prints an error message. It doesn't alter what yyparse() returns.
What you're attempting is not a good idea. You'll enormously expand the grammar and you run a major risk of making it ambiguous. All you need to do is remove the production that calls yyerror(). That input will produce a syntax error anyway, and that will cause yyparse() not to return 0. You're keeping a dog and barking yourself. What you should be checking for is semantic errors that the parser can't see.
If you really want to improve the error messages, there's enough information in the parse tables and state information to tell you what the expected next token was. However in most cases it's such a large set it's pointless to print it. But programmers are used to sorting out 'syntax error'. Don't sweat it. Writing compilers is hard enough already.
NB You should make your grammar left-recursive to avoid excessive stack usage: for example, program : program ';' command.
I'm writing a lexer and I'm using Flex to generate it based on custom rules.
I want to match identifiers of sorts that start with a letter and then can have either letters or numbers. So I wrote the following pattern for them:
[[:alpha:]][[:alnum:]]*
It works fine, the lexer that gets generated recognizes the pattern perfectly, although it doesn't only match whole words but all appearances of that pattern.
So for example it would match the input "Text" and "9Text" (discarding that initial 9).
Consider the following simple lexer that accepts IDs as described above:
%{
#include <stdio.h>
#define LINE_END 1
#define ID 2
%}
/* Flex options: */
%option noinput
%option nounput
%option noyywrap
%option yylineno
/* Definitions: */
WHITESPACE [ \t]
BLANK {WHITESPACE}+
NEW_LINE "\n"|"\r\n"
ID [[:alpha:]][[:alnum:]_]*
%%
{NEW_LINE} {printf("New line.\n"); return LINE_END;}
{BLANK} {/* Blanks are skipped */}
{ID} {printf("ID recognized: '%s'\n", yytext); return ID;}
. {fprintf(stderr, "ERROR: Invalid input in line %d: \"%s\"\n", yylineno, yytext);}
%%
int main(int argc, char **argv) {
while (yylex() != 0);
return 0;
}
When compiled and fed the following input produces the output below:
Input:
Test
9Test
Output:
Test
ID recognized: 'Test'
New line.
9Test
ERROR: Invalid input in line 2: "9"
ID recognized: 'Test'
New line.
Is there a way to make flex match only whole words (i.e. delimited by either blanks or custom delimiters like '(' ')' for example)?
Because I could write a rule that excludes IDs that start with numbers, but what about the ones that start with symbols like "$Test" or "&Test"? I don't think I can enumerate all of the possible symbols.
Following the example above, the desired output would be:
Test
ID recognized: 'Test'
New line.
9Test
ERROR: Invalid input 2: "9Test"
New line.
You seem to be asking two questions at once.
'Whole word' isn't a recognized construct in programming languages. The lexical and grammar are already defined. Just implement them.
The best way to handle illegal or unexpected characters in flex is not to handle them specially at all. Return them to the parser, just as you would for a special character. Then the parser can deal with it and attempt recovery via discarding.
Place this as you final rule:
. return yytext[0];
You can use this
Lets say you want to identify the reserved word for :
([\r\n\z]|" "|"")+"for"/([\r\n\z]|" ")+ {}
any new line character or generally a control character [\r\n\z]
or a white space " "
or the beginning of the line ""
for at least 1 time +
the word you want in quotes "for"
only followed by /
almost the same expression without the "" at least 1 time -> ([\r\n\z]|" ")+
With this code you can form your own matching pattern for whatever you need to do before and after the word.
I'm not sure if this is the best answer, but this works for me.
%x ERROR
%%
{NL} {
printf("New line.\n");
return LINE_END;
}
<INITIAL,ERROR>{BLANK} {
BEGIN(INITIAL);
}
{ID} {
printf("ID recognized: '%s'\n", yytext);
return ID;
}
<INITIAL,ERROR>. {
fprintf(stderr, "ERROR: Invalid input in line %d: \"%s\"\n", yylineno, yytext);
BEGIN(ERROR);
}
%%
Read this to learn more about starting conditions.
(My attempt at explaining what I've done)
Whenever this lexer hits something unexpected, it exclusively activates 2 sets of rules. To get out of the error set of rules, the lexer has to hit a 'blank'.
I am trying to develop a language parser on CentOS 6.0 by means of Bison 3.0 (C parser generator), Flex 2.5.35 and gcc 4.4.7. I have the following Bison grammar file:
%{
#include <stdio.h>
%}
%union {
int int_t;
char* str_t;
}
%token SEP
%token <str_t> ID
%start start
%type <int_t> plst
%%
start: plst start
| EOS { YYACCEPT; }
;
// <id> , <id> , ... , <id>
plst: ID SEP_PARAMS plst { printf("Rule 1 %s %s \n",$1,$2); }
| ID { printf("Rule 2 %s \n", $1); }
| /* empty */ { }
;
%%
int yyerror(GNode* root, const char* s) {printf("Error: %s", s);}
The problem
As it is now, it is not really a meaningful one, but it is enough to understand my problem I think. Consider that I have a scanner written in Flex which recognizes my tokens. This grammar file is used to recognize simple identifier lists like: id1,id2,...,idn. My problem is that in each grammar rule, when I try to get the value of the identifier (the string representing the same of the identifier), I get a NULL pointer as also proved by my printfs.
What am I doing wrong? Thankyou
Edit
Thanks to recent answers, I could understand that the problems strongly relates to Flex and its configuration file. In particular I have edited my lex file in order to meet the specifications described by the Flex Manual for Bison Bridging:
{ID} { printf("[id-token]");
yylval->str_t = strdup(yytext);
return ID; }
However after running Bison, then Flex (providing the --bison-bridge option) and then the compiler, I execute the generated parser and I instantly get Segmentation Fault.
What's the problem?
The flex option --bison-bridge (or %option bison-bridge) matches up to the bison option %define api.pure. You need to use either BOTH bison-bridge and api.pure or NEITHER -- either way can work, but they need to be consistent. Since it appears you are NOT using api.pure, you want to delete the --bison-bridge option.
The values for $1, $2 etc. have to be set by the lexer.
If you have a rule in the lexer for identifiers, like
ID [a-z][a-z0-9]*
%%
{ID} { return ID; }
the semantic values are not set.
You have to do e.g.
{ID} { /* Set the unions value, used by e.g. `$1` in the parser */
yylval.str_t = strdup(yytext);
return ID;
}
Remember to free the value in the parser, as strdup allocates memory.