I'm trying to understand flex/bison, but the documentation is a bit difficult for me, and I've probably grossly misunderstood something. Here's a test case: http://namakajiri.net/misc/bison_charlit_test/
File "a" contains the single character 'a'. "foo.y" has a trivial grammar like this:
%%
file: 'a' ;
The generated parser can't parse file "a"; it gives a syntax error.
The grammar "bar.y" is almost the same, only I changed the character literal for a named token:
%token TOK_A;
%%
file: TOK_A;
and then in bar.lex:
a { return TOK_A; }
This one works just fine.
What am I doing wrong in trying to use character literals directly as bison terminals, like in the docs?
I'd like my grammar to look like "statement: selector '{' property ':' value ';' '}'" and not "statement: selector LBRACE property COLON value SEMIC RBRACE"...
I'm running bison 2.5 and flex 2.5.35 in debian wheezy.
Rewrite
The problem is a runtime problem, not a compile time problem.
The trouble is that you have two radically different lexical analyzers.
The bar.lex analyzer recognizes an a in the input and returns it as a TOK_A and ignores everything else.
The foo.lex analyzer echoes every single character, but that's all.
foo.lex — as written
%{
#include "foo.tab.h"
%}
%%
foo.lex — equivalent
%{
#include "foo.tab.h"
%}
%%
. { ECHO; }
foo.lex — required
%{
#include "foo.tab.h"
%}
%%
. { return *yytext; }
Working code
Here's some working code with diagnostic printing in place.
foo-lex.l
%%
. { printf("Flex: %d\n", *yytext); return *yytext; }
foo.y
%{
#include <stdio.h>
void yyerror(char *s);
%}
%%
file: 'a' { printf("Bison: got file!\n") }
;
%%
int main(void)
{
yyparse();
}
void yyerror(char *s)
{
fprintf(stderr, "%s\n", s);
}
Compilation and execution
$ flex foo-lex.l
$ bison foo.y
$ gcc -o foo foo.tab.c lex.yy.c -lfl
$ echo a | ./foo
Flex: 97
Bison: got file!
$
Point of detail: how did that blank line get into the output? Answer: the lexical analyzer put it there. The pattern . does not match a newline, so the newline was treated as if there was a rule:
\n { ECHO; }
This is why the input was accepted. If you change the foo-lex.l file to:
%%
. { printf("Flex-1: %d\n", *yytext); return *yytext; }
\n { printf("Flex-2: %d\n", *yytext); return *yytext; }
and then recompile and run again, the output is:
$ echo a | ./foo
Flex-1: 97
Bison: got file!
Flex-2: 10
syntax error
$
with no blank lines. This is because the grammar doesn't allow a newline to appear in a valid 'file'.
Related
I am writting a parser and a scanner in Ubuntu OS. In my flex code "scanner.l" I have an IDENTIFIER token and BOOL_LITERAL token. IDENTIFIER is any word and BOOL_LITERAL is either true or false.
In my bison code "parser.y" I have the grammar in which it should be able to take a BOO_LITERAL through the primary production.
However, the code is not working as intended. Here is the erro
Here are all of my files:
scanner.l
%{
#include <string>
#include <vector>
using namespace std;
#include "listing.h"
#include "tokens.h"
%}
%option noyywrap
ws [ \t\r]+
comment (\-\-.*\n)|\/\/.*\n
line [\n]
digit [0-9]
int {digit}+
real {int}"."{int}([eE][+-]?{digit})?
boolean ["true""false"]
punc [\(\),:;]
addop ["+""-"]
mulop ["*""\/"]
relop [="/=">">=""<="<]
id [A-Za-z][A-Za-z0-9]*
%%
{ws} { ECHO; }
{comment} { ECHO; nextLine();}
{line} { ECHO; nextLine();}
{relop} { ECHO; return(RELOP); }
{addop} { ECHO; return(ADDOP); }
{mulop} { ECHO; return(MULOP); }
begin { ECHO; return(BEGIN_); }
boolean { ECHO; return(BOOLEAN); }
end { ECHO; return(END); }
endreduce { ECHO; return(ENDREDUCE); }
function { ECHO; return(FUNCTION); }
integer { ECHO; return(INTEGER); }
real { ECHO; return(REAL); }
is { ECHO; return(IS); }
reduce { ECHO; return (REDUCE); }
returns { ECHO; return(RETURNS); }
and { ECHO; return(ANDOP); }
{boolean} { ECHO; return(BOOL_LITERAL); }
{id} { ECHO; return(IDENTIFIER);}
{int} { ECHO; return(INT_LITERAL); }
{real} { ECHO; return(REAL_LITERAL); }
{punc} { ECHO; return(yytext[0]); }
. { ECHO; appendError(LEXICAL, yytext); }
%%
parser.y
%{
#include <string>
using namespace std;
#include "listing.h"
int yylex();
void yyerror(const char* message);
%}
%error-verbose
%token INT_LITERAL REAL_LITERAL BOOL_LITERAL
%token IDENTIFIER
%token ADDOP MULOP RELOP ANDOP
%token BEGIN_ BOOLEAN END ENDREDUCE FUNCTION INTEGER IS REDUCE RETURNS REAL
%%
function:
function_header optional_variable body ;
function_header:
FUNCTION IDENTIFIER RETURNS type ';' ;
parameters:
parameters ',' |
parameter ;
parameter:
IDENTIFIER ':' type |
;
optional_variable:
variable |
;
variable:
IDENTIFIER ':' type IS statement_ ;
type:
INTEGER |
BOOLEAN |
REAL ;
body:
BEGIN_ statement_ END ';' ;
statement_:
statement ';' |
error ';' ;
statement:
expression |
REDUCE operator reductions ENDREDUCE ;
operator:
ADDOP |
MULOP ;
reductions:
reductions statement_ |
;
expression:
expression ANDOP relation |
relation ;
relation:
relation RELOP term |
term;
term:
term ADDOP factor |
factor ;
factor:
factor MULOP primary |
primary ;
primary:
'(' expression ')' |
INT_LITERAL |
REAL_LITERAL |
BOOL_LITERAL |
IDENTIFIER ;
%%
void yyerror(const char* message)
{
appendError(SYNTAX, message);
}
int main(int argc, char *argv[])
{
firstLine();
yyparse();
lastLine();
return 0;
}
Other associated files:
listing.h
enum ErrorCategories {LEXICAL, SYNTAX, GENERAL_SEMANTIC, DUPLICATE_IDENTIFIER,
UNDECLARED};
void firstLine();
void nextLine();
int lastLine();
void appendError(ErrorCategories errorCategory, string message);
listing.cc
#include <cstdio>
#include <string>
using namespace std;
#include "listing.h"
static int lineNumber;
static string error = "";
static int totalErrors = 0;
static void displayErrors();
void firstLine()
{
lineNumber = 1;
printf("\n%4d ",lineNumber);
}
void nextLine()
{
displayErrors();
lineNumber++;
printf("%4d ",lineNumber);
}
int lastLine()
{
printf("\r");
displayErrors();
printf(" \n");
return totalErrors;
}
void appendError(ErrorCategories errorCategory, string message)
{
string messages[] = { "Lexical Error, Invalid Character ", "",
"Semantic Error, ", "Semantic Error, Duplicate Identifier: ",
"Semantic Error, Undeclared " };
error = messages[errorCategory] + message;
totalErrors++;
}
void displayErrors()
{
if (error != "")
printf("%s\n", error.c_str());
error = "";
}
makeile
compile: scanner.o parser.o listing.o
g++ -o compile scanner.o parser.o listing.o
scanner.o: scanner.c listing.h tokens.h
g++ -c scanner.c
scanner.c: scanner.l
flex scanner.l
mv lex.yy.c scanner.c
parser.o: parser.c listing.h
g++ -c parser.c
parser.c tokens.h: parser.y
bison -d -v parser.y
mv parser.tab.c parser.c
mv parser.tab.h tokens.h
listing.o: listing.cc listing.h
g++ -c listing.cc
Note:
I have to run "makeile", "bison -d parser.y" and finally "makefile" again. Then, I run the following command "./compile < incremental1.txt" and I get the following error:
enter image description here
Please help me understand why I am getting a syntax error.
#SoronelHaetir has certainly identified one of the problems with your parser. But that problem cannot create the syntax error message which appears in your image. [Note 1] Your grammar allows identifiers in exactly the same place as boolean literals, so the fact that true is actually scanned as an identifier will not produce a syntax error in an expression which starts true and. (In other words, x and... would be parsed just the same.)
The problem is actually your use of 8.E+1 as a numeric literal. Your rule for REAL_LITERAL uses the pattern
{int}"."{int}([eE][+-]?{digit})?
which doesn't match 8.E+1 because there is no {int} followed the .. So when the scanner reaches the input 8.E+1, it produces the INT_LITERAL 8, which is the longest match. When it is asked for the next token, it first sees a ., but that doesn't match any pattern so it uses the default fallback action (ECHO), and then continues to the next character (E) which matches the IDENTIFIER pattern. And the input
true and 8 E ...
is indeed a syntax error: there is an unexpected identifier following the 8, and that's what bison reports.
Aside from fixing the pattern for real literals, you should make sure that you do something sensible with unrecognised characters; flex's default action -- which basically just ignores characters that can't match any pattern -- is not of much use, particularly in debugging (as I think the above explanation demonstrates).
There are a number of other issues with your patterns involving the same misconception about the syntax of character classes as shown in the boolean literal pattern. This indicates to me that you did not attempt to test your lexical scanner before hooking it into your parser. That's an essential step in writing parsers; if your lexical scanner is not returning the tokens you expect it to return, you're going to have a lot of trouble trying to figure out what errors there might be in your grammar.
You might find the debugging techniques outlined in this answer useful. (That post also has links to the flex and bison manuals. Section 6 of the flex manual is a brief but complete guide to the syntax of flex patterns, and you might want to take a few minutes to read it.)
Notes
Please copy and paste the text of error messages into your questions rather than using an image showing a screenshot. Images are very hard to read on smartphones, for example, or for people who rely on screen-readers. And it's not possible to copy a part of a screenshot into an answer, which I would have preferred to have done here.
Your boolean pattern should be "true"|"false" not ["true""false"].
Honestly, the way your patterns are set up is just weird. Is there some reason not to use:
...
%%
"true" { /* */ return BOOL_LITERAL; }
"false { /* */ return BOOL_LITERAL; }
Patterns make sense when you aren't trying to match literals but here you are.
I found it easy to learn about Lex/Flex however learning Yacc/Bison seemed a lot more confusing due to the lack of simple example programs. From my observations, the first example parser introduced to students tends to be a calculator, which is not too beginner-friendly. It was quite hard for me to comprehend how Yacc works simply by looking at complex source code, thus I decided to write my own very simple program which incorporates Lex and Yacc.
test.l:
%{
#include "y.tab.h"
%}
%%
"PRINT" { printf("Returning PRINT\n"); return PRINT; }
"EXIT" { printf("Returning EXIT\n"); return EXIT; }
. ;
[ \t\n] ;
%%
int yywrap(void) { return 1; }
test.y:
%{
int yylex();
#include <stdio.h>
#include <stdlib.h>
//#include "y.tab.h"
%}
%start line
%token PRINT
%token EXIT
%%
line: PRINT {printf("Caught PRINT\n");}
| EXIT {printf("Caught EXIT\n");}
| ;
%%
int main() { yyparse(); }
Compilation:
yacc -d test.y
lex test.l
gcc lex.yy.c y.tab.h -ll
I understand that this program is very simple and there are tons of examples for Yacc, I would like to sincerely assure you that I tried for a couple hours before asking out for some help.
Although I think I got the basics done, I'm unable to figure out why it won't work, please let me know how I could fix it. I suspect that I might be compiling incorrectly. I could also display y.tab.h if necessary. Thank you for your time.
Edit: The goal is simply for the lexer to return the appropriate return value to the yacc parser, and for the yacc parser to "catch" it and print "Caught PRINT" or "Caught EXIT".
Edit 2: My compilation process was indeed incorrect. I would like to thank the person who was helped me understand how to fix the issue.
You shouldn't compile *.h files.
Withgcc y.tab.c lex.yy.c and a provided yyerror definition inside test.y (the bison manual recommends void yyerror (char const *s) { fprintf (stderr, "%s\n", s); }), it should build.
So we have a tutorial on flex,bison before we start our complation techniques course at university.
The following test should be split into lines and newlines
testtest test data
second line in the data
another line without a trailing newline
This is what my parser should output:
Line: testtest test data
NL
Line: second line in the data
NL
Line: another line without a trailing newline
When im running following
cat test.txt | ./parser
This returns:
LINE: testtest test data
It's a bad: syntax error
This is in my .y file:
%{
#include<stdio.h>
int yylex(); /* Supress C99 warning on OSX */
extern char *yytext; /* Correct for Flex */
unsigned int total;
%}
%token LINE
%token NL
%%
line : LINE {printf("LINE: %s\n", yytext);}
;
newline : NL {printf("NL\n");}
;
And this is in my binary.flex file:
%top{
#define YYSTYPE int
#include "binary.tab.h" /* Token values generated by bison */
}
%option noyywrap
%%
[^\n\r/]+ return LINE;
\n return NL;
%%
So, any ideas to solve this problem ?
PS: This is my .c file
#include<stdio.h>
#include "binary.tab.h"
extern unsigned int total;
int yyerror(char *c)
{
printf("It's a bad: %s\n", c);
return 0;
}
int main(int argc, char **argv)
{
if(!yyparse())
printf("It's a mario time: %d\n",total);
return 0;
}
Your bison grammar recognizes precisely one LINE (without a newline) because the bison grammar recognizes the first non-terminal. Just that, and no more.
If you want to recognize multiples lines, each consisting of a LINE and possibly a NL, you'll need to add a definition for an input consisting of multiple lines, each consisting of ... . I'm not sure why you would use bison for this, though, since the original problem seems easy to solve with just flex.
By the way, if your input file includes a \r character, none of your flex patterns will recognize it (the flex-generated default rule will catch it, but that is almost never what you want). Use %option nodefault so that you get a warning about this sort of error. And react when you see warnings: you will have seen several when you ran bison on your bison file, I'm sure.
I'm starting on the whole world of Flex and Bison. So I followed a tutorial to write this l file for flex:
%{
#include <stdio.h>
#include <stdlib.h>
void yyerror(char *);
#include "y.tab.h"
%}
%%
/******************** RULES ********************/
/* One letter variables */
[a-z] {
yylval = *yytext - 'a'; // This is to return a number between 0 and 26 representting the letter variable.
printf("VAR: %s\n",yytext);
return VARIABLE;
}
/* Integer constants */
[0-9]+ {
yylval = atoi(yytext);
printf("INT: %d\n",yylval);
return INTEGER;
}
/* Operators */
[-+()=/*\n]+ { printf("OPR: %s\n",yytext); return *yytext; /*\n is considered an operator because it signals the end of a statement*/ }
/* This skips white space and tab chararcters */
[ \t] ;
/* Anything esle is not allowed */
. yyerror("Invalid character found");
/***************** SUBROUTINES *****************/
%%
int yywrap(void){
return 1;
}
And this is the grammar:
/***************** DEFINITIONS *****************/
%token INTEGER VARIABLE
%left '+' '-'
%left '*' '/'
%{
void yyerror(char *);
int yylex(void);
int sym[26];
%}
%%
/******************** RULES ********************/
program:
program statement '\n'
|
;
statement:
expr { printf("EXPR: %d\n", $1); }
| VARIABLE '=' expr { sym[$1] = $3; }
;
expr:
INTEGER
| VARIABLE { $$ = sym[$1]; }
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr { $$ = $1 / $3; }
| '(' expr ')' { $$ = $2; }
;
%%
/***************** SUBROUTINES *****************/
void yyerror(char *s){
printf("%s\n",s);
}
int main(void) {
yyparse();
return 0;
}
And serveral question arise. The first comes when compiling. This is how I compile:
bison -d bas.y -o y.tab.c
flex bas.l
gcc y.tab.h lex.yy.c y.tab.c -o bas_fe
Which gives me two warnings like this:
bas.y:24:7: warning: incompatible implicit declaration of built-in function ‘printf’
expr { printf("EXPR: %d\n", $1); }
^
bas.y: In function ‘yyerror’:
bas.y:39:4: warning: incompatible implicit declaration of built-in function ‘printf’
printf("%s\n",s);
Now, they are warnings and the print work, but I found it odd, since I have clearly included the libraries for use of the printf function.
My real question arises from my interaction with the program. This is the console output:
x = (3+5)
VAR: x
OPR: =
OPR: (
INT: 3
OPR: +
INT: 5
x
OPR: )
VAR: x
syntax error
Several questions arise from this.
1) Upon inputting x = (3+5) the program printout does not include the ')' Why?
2) When inputting x (expected output would have been 8) only then the ')' appears. Why?
3) And then there is the "syntax error" message. I'm assuming the message is automatically generated within the code of y.tab.c. Can it be changed to somthing more meaningful? Am I right in assuming that the syntax error is because the program found ) and newline and the variable and that this DOES NOT correspond to a program statement, as defined by the grammar?
I have clearly included the libraries for use of the printf function.
You included stdio.h in your flex file, but not in your bison file. And the warnings about printf being undeclared are from your bison file, not your flex file.
When you compile multiple files with gcc (or any other C compiler), the files are compiled independently and then linked together. So your command
gcc y.tab.h lex.yy.c y.tab.c -o bas_fe
does not concatenate the three files and compile them as a single unit. Rather, it compiles the three files independently, including uselessly compiling the header file y.tab.h.
What you should do is add a prolog block including #include <stdio.h> to your bas.y file.
[-+()=/*\n]+ {... return *yytext; ...}
This flex pattern matches any number of characters from the set [-+()=/*\n]. So in the input x=(3+5)\n, the )\n is being matched as a single token. However, the action returns *yytext, the first character of yytext, effectively ignoring the \n. Since your grammar requires \n, that creates a syntax error.
Simply remove the repetition operator from the pattern.
Can the error message be changed to something more meaningful?
If you have a reasonably modern bison, add the declaration
%error-verbose
to the beginning of your bison file.
I have a Makefile so that when I type make the following commands run:
yacc -d parser.y
gcc -c y.tab.c
flex calclexer.l
gcc -c lex.yy.c
But then after this I get the following error messages:
calclexer.l:10: error: parse error before '[' token
calclexer.l:10: error: stray '\' in program
calclexer.l:15: error: stray '\' in program
calclexer.l:24: error: stray '\' in program
make: *** [lex.yy.o] Error 1
This is what is inside calclexer. How can it be fixed?
%{
#include "y.tab.h"
#include "parser.h"
#include <math.h>
%}
%%
%%
([0-9]+|([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)?) {
yylval.dval = atof(yytext);
return NUMBER;
}
[ \t] ; /* ignore white space */
[A-Za-z][A-Za-z0-9]* { /* return symbol pointer */
yylval.symp = symlook(yytext);
return NAME;
}
"$" { return 0; /* end of input */ }
\n |. return yytext[0];
%%
You look to have an extra "%%" in "calclexer.l", where you have:
%%
%%
Remove one of those (and the blank line).
The format of a lexer file is (taken from the flex manpage):
definitions
%%
rules
%%
user code
The user code gets copied verbatim to the output file. With the extra "%%", your rules are being interpreted as user code.