Hi i am trying to run John code from lex and yacc book by R. Levine i have compiled the lex and yacc program in linux using the commands
lex example.l
yacc example.y
gcc -o example y.tab.c
./example
the program asks the user for input of verbs,nouns,prepositions e.t.c in the format
verb accept,admire,reject
noun jam,pillow,knee
and then runs the grammar in yacc to check if it's a simple or compound sentence
but when i type
jam reject knee
it shows noting on screen where it is supposed to show the line "Parsed a simple sentence." on parsing.The code is given below
yacc file
%{
#include <stdio.h>
/* we found the following required for some yacc implementations. */
/* #define YYSTYPE int */
%}
%token NOUN PRONOUN VERB ADVERB ADJECTIVE PREPOSITION CONJUNCTION
%%
sentence: simple_sentence { printf("Parsed a simple sentence.\n"); }
| compound_sentence { printf("Parsed a compound sentence.\n"); }
;
simple_sentence: subject verb object
| subject verb object prep_phrase
;
compound_sentence: simple_sentence CONJUNCTION simple_sentence
| compound_sentence CONJUNCTION simple_sentence
;
subject: NOUN
| PRONOUN
| ADJECTIVE subject
;
verb: VERB
| ADVERB VERB
| verb VERB
;
object: NOUN
| ADJECTIVE object
;
prep_phrase: PREPOSITION NOUN
;
%%
extern FILE *yyin;
main()
{
while(!feof(yyin)) {
yyparse();
}
}
yyerror(s)
char *s;
{
fprintf(stderr, "%s\n", s);
}
lex file
%{
/*
* We now build a lexical analyzer to be used by a higher-level parser.
*/
#include "ch1-06y.h" /* token codes from the parser */
#define LOOKUP 0 /* default - not a defined word type. */
int state;
%}
%%
\n { state = LOOKUP; }
\.\n { state = LOOKUP;
return 0; /* end of sentence */
}
^verb { state = VERB; }
^adj { state = ADJECTIVE; }
^adv { state = ADVERB; }
^noun { state = NOUN; }
^prep { state = PREPOSITION; }
^pron { state = PRONOUN; }
^conj { state = CONJUNCTION; }
[a-zA-Z]+ {
if(state != LOOKUP) {
add_word(state, yytext);
} else {
switch(lookup_word(yytext)) {
case VERB:
return(VERB);
case ADJECTIVE:
return(ADJECTIVE);
case ADVERB:
return(ADVERB);
case NOUN:
return(NOUN);
case PREPOSITION:
return(PREPOSITION);
case PRONOUN:
return(PRONOUN);
case CONJUNCTION:
return(CONJUNCTION);
default:
printf("%s: don't recognize\n", yytext);
/* don't return, just ignore it */
}
}
}
. ;
%%
/* define a linked list of words and types */
struct word {
char *word_name;
int word_type;
struct word *next;
};
struct word *word_list; /* first element in word list */
extern void *malloc();
int
add_word(int type, char *word)
{
struct word *wp;
if(lookup_word(word) != LOOKUP) {
printf("!!! warning: word %s already defined \n", word);
return 0;
}
/* word not there, allocate a new entry and link it on the list */
wp = (struct word *) malloc(sizeof(struct word));
wp->next = word_list;
/* have to copy the word itself as well */
wp->word_name = (char *) malloc(strlen(word)+1);
strcpy(wp->word_name, word);
wp->word_type = type;
word_list = wp;
return 1; /* it worked */
}
int
lookup_word(char *word)
{
struct word *wp = word_list;
/* search down the list looking for the word */
for(; wp; wp = wp->next) {
if(strcmp(wp->word_name, word) == 0)
return wp->word_type;
}
return LOOKUP; /* not found */
}
header file
# define NOUN 257
# define PRONOUN 258
# define VERB 259
# define ADVERB 260
# define ADJECTIVE 261
# define PREPOSITION 262
# define CONJUNCTION 263
You have several problems:
The build details you describe do not follow the usual pattern, and in fact they do not work for the code you provide.
Having sorted out how to build your program, it does not work at all, instead segfaulting before reading any input.
Having solved that problem, your expectation of the program's behavior with the given input is incorrect in at least two ways.
With respect to the build:
yacc builds C source for a parser and optionally a header file containing corresponding token definitions. It is usual to exercise the option to get the definitions, and to #include their header in the lexer's source file (#include 'y.tab.h'):
yacc -d example.y
lex builds C source for a lexical analyzer. This can be done either before of after yacc, as lex does not depend directly on the token definitions:
lex example.l
The two generated C source files must be compiled and linked together, possibly with other sources as well, and possibly with libraries. In particular, it is often convenient to link in libl (or libfl if your lex is really GNU flex). I linked the latter to get the default yywrap():
gcc -o example lex.yy.c y.tab.c -lfl
With respect to the segfault:
Your generated program is built around this:
extern FILE *yyin;
main()
{
while(!feof(yyin)) {
yyparse();
}
}
In the first place, you should read Why is “while ( !feof (file) )” always wrong?. Having had that under consideration might have spared you from committing a much more fundamental mistake: evaluating yyin before it has been set. Although it's true that yyin will be set to stdin if you don't set it to something else, that cannot happen at program initialization because stdin is not a compile-time constant. Therefore, when control first reaches the loop control expression, yyin's value is still NULL, and a segfault results.
It would be safe and make more sense to test for end of file after yyparse() returns.
With respect to behavioral expectations
You complained that the input
verb accept,admire,reject
noun jam,pillow,knee
jam reject knee
does not elicit any output from the program, but that's not exactly true. That input does not elicit output from the program when it is entered interactively, without afterward sending an end-of-file signal (i.e. by typing control-D at the beginning of a line).
The parser not yet having detected end-of-file in that case (and not paying any attention at all to newlines, since your lexer notifies it about them only when they immediately follow a period), it has no reason to attempt to reduce its token stack to the start symbol. It could be that you will continue with an object to extend the simple sentence, and it cannot be sure that it won't see a CONJUNCTION next, eithger. It doesn't print anything because it's waiting for more input. If you end the sentence with a period or afterward send a control-D then it will in fact print "Parsed a simple sentence."
Related
Program is intended to store values in a symbol table and then have them be able to be printed out stating the part of speech. Further to be parsed and state more in the parser, whether it is a sentence and more.
I create the executable file by
flex try1.l
bison -dy try1.y
gcc lex.yy.c y.tab.c -o try1.exe
in cmd (WINDOWS)
My issue occurs when I try to declare any value when running the executable,
verb run
it goes like this
BOLD IS INPUT
verb run
run
run
syntax error
noun cat
cat
syntax error
run
run
syntax error
cat run
syntax error
MY THOUGHTS: I'm unsure why I'm getting this error back from the code Syntax error. Although after debugging and trying to print out what value was being stored, I figured there has to be some kind of issue with the linked list. As it seemed only one value was being stored in the linked list and causing an error of sorts. As I tried to print out the stored word_type integer value for run and it would print out the correct value 259, but would refuse to let me define any other words to my symbol table. I reversed the changes of the print statements and now it works as previously stated. I think again there is an issue with the addword method as it isn't properly being added so the lookup method is crashing the program.
Lexer file, this example is taken from O'Reily 2nd edition on Lex And Yacc,
Example 1-5,1-6.
Am trying to learn Lex and Yacc on my own and reproduce this example.
%{
/*
* We now build a lexical analyzer to be used by a higher-level parser.
*/
#include <stdlib.h>
#include <string.h>
#include "ytab.h" /* token codes from the parser */
#define LOOKUP 0 /* default - not a defined word type. */
int state;
%}
/*
* Example from page 9 Word recognizer with a symbol table. PART 2 of Lexer
*/
%%
\n { state = LOOKUP; } /* end of line, return to default state */
\.\n { state = LOOKUP;
return 0; /* end of sentence */
}
/* whenever a line starts with a reserved part of speech name */
/* start defining words of that type */
^verb { state = VERB; }
^adj { state = ADJ; }
^adv { state = ADV; }
^noun { state = NOUN; }
^prep { state = PREP; }
^pron { state = PRON; }
^conj { state = CONJ; }
[a-zA-Z]+ {
if(state != LOOKUP) {
add_word(state, yytext);
} else {
switch(lookup_word(yytext)) {
case VERB:
return(VERB);
case ADJECTIVE:
return(ADJECTIVE);
case ADVERB:
return(ADVERB);
case NOUN:
return(NOUN);
case PREPOSITION:
return(PREPOSITION);
case PRONOUN:
return(PRONOUN);
case CONJUNCTION:
return(CONJUNCTION);
default:
printf("%s: don't recognize\n", yytext);
/* don't return, just ignore it */
}
}
}
. ;
%%
int yywrap()
{
return 1;
}
/* define a linked list of words and types */
struct word {
char *word_name;
int word_type;
struct word *next;
};
struct word *word_list; /* first element in word list */
extern void *malloc() ;
int
add_word(int type, char *word)
{
struct word *wp;
if(lookup_word(word) != LOOKUP) {
printf("!!! warning: word %s already defined \n", word);
return 0;
}
/* word not there, allocate a new entry and link it on the list */
wp = (struct word *) malloc(sizeof(struct word));
wp->next = word_list;
/* have to copy the word itself as well */
wp->word_name = (char *) malloc(strlen(word)+1);
strcpy(wp->word_name, word);
wp->word_type = type;
word_list = wp;
return 1; /* it worked */
}
int
lookup_word(char *word)
{
struct word *wp = word_list;
/* search down the list looking for the word */
for(; wp; wp = wp->next) {
if(strcmp(wp->word_name, word) == 0)
return wp->word_type;
}
return LOOKUP; /* not found */
}
Yacc file,
%{
/*
* A lexer for the basic grammar to use for recognizing English sentences.
*/
#include <stdio.h>
%}
%token NOUN PRONOUN VERB ADVERB ADJECTIVE PREPOSITION CONJUNCTION
%%
sentence: subject VERB object{ printf("Sentence is valid.\n"); }
;
subject: NOUN
| PRONOUN
;
object: NOUN
;
%%
extern FILE *yyin;
main()
{
do
{
yyparse();
}
while (!feof(yyin));
}
yyerror(s)
char *s;
{
fprintf(stderr, "%s\n", s);
}
Header file, had to create 2 versions for some values not sure why but code was having an issue with them, and I wasn't understanding why so I just created a token with the full name and the shortened as the book had only one for each.
# define NOUN 257
# define PRON 258
# define VERB 259
# define ADVERB 260
# define ADJECTIVE 261
# define PREPOSITION 262
# define CONJUNCTION 263
# define ADV 260
# define ADJ 261
# define PREP 262
# define CONJ 263
# define PRONOUN 258
If you feel that there is a problem with your linked list implementation, you'd be a lot better off testing and debugging it with a simple driver program rather than trying to do that with some tools (flex and bison) which you are still learning. On the whole, the simpler a test is and the fewest dependencies which it has, the easier it is to track down problems. See this useful essay by Eric Clippert for some suggestions on debugging.
I don't understand why you felt the need to introduce "short versions" of the token IDs. The example code in Levine's book does not anywhere use these symbols. You cannot just invent symbols and you don't need these abbreviations for anything.
The comment that you "had to create 2 versions [of the header file] for some values" but that the "code was having an issue with them, and I wasn't understanding why" is far too unspecific for an answer. Perhaps the problem was that you thought you could use identifiers which are not defined anywhere, which would certainly cause a compiler error. But if there is some other issue, you could ask a question with an accurate problem description (that is, exactly what problem you encountered) and a Minimal, Complete, and Verifiable example (as indicated in the StackOverflow help pages).
In any case, manually setting the values of the token IDs is almost certainly preventing you from being able to recognized inputs. Bison/yacc reserves the values 256 and 257 for internal tokens, so the first one which will be generated (and therefore used in the parser) has value 258. That means that the token values you are returning from your lexical scanner have a different meaning inside bison. Bottom line: Never manually set token values. If your header isn't being generated correctly, figure out why.
As far as I can see, the only legal input for your program has the form:
sentence: subject VERB object
Since none of your sample inputs ("run", for example) have this form, a syntax error is not surprising. However, the fact that you receive a very early syntax error on the input "cat" does suggest there might be a problem with your symbol table lookup. (That's probably the result of the problem noted above.)
I'm implementing a custom parser generator with embedded lexer and parser to parse HTTP headers in an event-driven state machine way. Here's some definitions the eventual parser generator could consume to parse a single header field without CRLF at the end:
token host<prio=1> = "[Hh][Oo][Ss][Tt]" ;
token ospace = "[ \t]*" ;
token htoken = "[-!#$%&'*+.^_`|~0-9A-Za-z]+" ;
token hfield = "[\t\x20-\x7E\x80-\xFF]*" ;
token space = " " ;
token htab = "\t" ;
token colon = ":" ;
obsFoldStart = 1*( space | htab ) ;
hdrField =
obsFoldStart hfield
| host colon ospace hfield<print>
| htoken colon ospace hfield
;
The lexer is based on a maximal munch rule and the tokens are dynamically turned on and off depending on the context, so there is no conflict between htoken and hfield, and the priority value resolves the conflict between host and htoken. I'm planning to implement the parser as LL(1) table parser. I haven't yet decided if I'll implement regexp token matching by simulating the nondeterministic finite automaton or go all the way to exploding it to a deterministic finite automaton.
Now, I would like to include some C source code in my parser generator input:
hdrField =
obsFoldStart hfield
| host {
parserState->userdata.was_host = 1;
} colon ospace hfield<print>
| htoken {
parserState->userdata.was_host = 0;
} colon ospace hfield
;
What I need thus is some way to read text tokens that end when the same amount of } characters are read than the amount of { characters read.
How to do this? I'm handling comments using BEGIN(COMMENTS) and BEGIN(INITIAL) but I don't believe such a strategy would work for embedded C source. Also, the comment handling could complicate the embedded C source code handling a lot, because I don't believe a single token can have a comment in the middle of it.
Basically, I need the embedded C language snippet as a C string I can store to my data structures.
So, I took some of the generated lex code and made it self standing.
I hope, it's OK that I used C++ code although I recognized the c only. IMHO, it concerns only the not so
relevant parts of this sample code. (Memory management in C is much more tedious than simply delegating this to std::string.)
scanC.l:
%{
#include <iostream>
#include <string>
#ifdef _WIN32
/// disables #include <unistd.h>
#define YY_NO_UNISTD_H
#endif // _WIN32
// buffer for collected C/C++ code
static std::string cCode;
// counter for braces
static int nBraces = 0;
%}
/* Options */
/* make never interactive (prevent usage of certain C functions) */
%option never-interactive
/* force lexer to process 8 bit ASCIIs (unsigned characters) */
%option 8bit
/* prevent usage of yywrap */
%option noyywrap
EOL ("\n"|"\r"|"\r\n")
SPC ([ \t]|"\\"{EOL})*
LITERAL "\""("\\".|[^\\"])*"\""
%s CODE
%%
<INITIAL>"{" { cCode = '{'; nBraces = 1; BEGIN(CODE); }
<INITIAL>. |
<INITIAL>{EOL} { std::cout << yytext; }
<INITIAL><<EOF>> { return 0; }
<CODE>"{" {
cCode += '{'; ++nBraces;
//updateFilePos(yytext, yyleng);
} break;
<CODE>"}" {
cCode += '}'; //updateFilePos(yytext, yyleng);
if (!--nBraces) {
BEGIN(INITIAL);
//return new Token(filePosCCode, Token::TkCCode, cCode.c_str());
std::cout << '\n'
<< "Embedded C code:\n"
<< cCode << "// End of embedded C code\n";
}
} break;
<CODE>"/*" { // C comments
cCode += "/*"; //_filePosCComment = _filePos;
//updateFilePos(yytext, yyleng);
char c1 = ' ';
do {
char c0 = c1; c1 = yyinput();
switch (c1) {
case '\r': break;
case '\n':
cCode += '\n'; //updateFilePos(&c1, 1);
break;
default:
if (c0 == '\r' && c1 != '\n') {
c0 = '\n'; cCode += '\n'; //updateFilePos(&c0, 1);
} else {
cCode += c1; //updateFilePos(&c1, 1);
}
}
if (c0 == '*' && c1 == '/') break;
} while (c1 != EOF);
if (c1 == EOF) {
//ErrorFile error(_filePosCComment, "'/*' without '*/'!");
//throw ErrorFilePrematureEOF(_filePos);
std::cerr << "ERROR! '/*' without '*/'!\n";
return -1;
}
} break;
<CODE>"//"[^\r\n]* | /* C++ one-line comments */
<CODE>"'"("\\".|[^\\'])+"'" | /*"/* C/C++ character constants */
<CODE>{LITERAL} | /* C/C++ string constants */
<CODE>"#"[^\r\n]* | /* preprocessor commands */
<CODE>[ \t]+ | /* non-empty white space */
<CODE>[^\r\n] { // any other character except EOL
cCode += yytext;
//updateFilePos(yytext, yyleng);
} break;
<CODE>{EOL} { // special handling for EOL
cCode += '\n';
//updateFilePos(yytext, yyleng);
} break;
<CODE><<EOF>> { // premature EOF
//ErrorFile error(_filePosCCode,
// compose("%1 '{' without '}'!", _nBraces));
//_errorManager.add(error);
//throw ErrorFilePrematureEOF(_filePos);
std::cerr << "ERROR! Premature end of input. (Not enough '}'s.)\n";
}
%%
int main(int argc, char **argv)
{
return yylex();
}
A sample text to scan scanC.txt:
Hello juhist.
The text without braces doesn't need to have any syntax.
It just echoes the characters until it finds a block:
{ // the start of C code
// a C++ comment
/* a C comment
* (Remember that nested /*s are not supported.)
*/
#define MAX 1024
static char buffer[MAX] = "", empty="\"\"";
/* It is important that tokens are recognized to a limited amount.
* Otherwise, it would be too easy to fool the scanner with }}}
* where they have no meaning.
*/
char *theSameForStringConstants = "}}}";
char *andCharConstants = '}}}';
int main() { return yylex(); }
}
This code should be just copied
(with a remark that the scanner recognized the C code a such.)
Greetings, Scheff.
Compiled and tested on cygwin64:
$ flex --version
flex 2.6.4
$ flex -o scanC.cc scanC.l
$ g++ --version
g++ (GCC) 7.3.0
$ g++ -std=c++11 -o scanC scanC.cc
$ ./scanC < scanC.txt
Hello juhist.
The text without braces doesn't need to have any syntax.
It just echoes the characters until it finds a block:
Embedded C code:
{ // the start of C code
// a C++ comment
/* a C comment
* (Remember that nested /*s are not supported.)
*/
#define MAX 1024
static char buffer[MAX] = "", empty="\"\"";
/* It is important that tokens are recognized to a limited amount.
* Otherwise, it would be too easy to fool the scanner with }}}
* where they have no meaning.
*/
char *theSameForStringConstants = "}}}";
char *andCharConstants = '}}}';
int main() { return yylex(); }
}// End of embedded C code
This code should be just copied
(with a remark that the scanner recognized the C code a such.)
Greetings, Scheff.
$
Notes:
This is taken from a helper tool (not for selling). Hence, this is not bullet-proof but just good enough for productive code.
What I saw when adapting it: The line continuation of pre-processor lines is not handled.
It's surely possible to fool the tool with a creative combination of macros with unbalanced { } – something we would never do in pur productive code (see 1.).
So, it might be at least a start for further development.
To check this against a C lex specification, I have ANSI C grammar, Lex specification at hand, though it's 22 years old. (There are probably newer ones available matching the current standards.)
Im trying to build a Bison grammar and seem to be missing something. I kept it yet very basic, still I am getting a syntax error and can't figure out why:
Here is my Bison Code:
%{
#include <stdlib.h>
#include <stdio.h>
int yylex(void);
int yyerror(char *s);
%}
// Define the types flex could return
%union {
long lval;
char *sval;
}
// Define the terminal symbol token types
%token <sval> IDENT;
%token <lval> NUM;
%%
Program:
Def ';'
;
Def:
IDENT '=' Lambda { printf("Successfully parsed file"); }
;
Lambda:
"fun" IDENT "->" "end"
;
%%
main() {
yyparse();
return 0;
}
int yyerror(char *s)
{
extern int yylineno; // defined and maintained in flex.flex
extern char *yytext; // defined and maintained in flex.flex
printf("ERROR: %s at symbol \"%s\" on line %i", s, yytext, yylineno);
exit(2);
}
Here is my Flex Code
%{
#include <stdlib.h>
#include "bison.tab.h"
%}
ID [A-Za-z][A-Za-z0-9]*
NUM [0-9][0-9]*
HEX [$][A-Fa-f0-9]+
COMM [/][/].*$
%%
fun|if|then|else|let|in|not|head|tail|and|end|isnum|islist|isfun {
printf("Scanning a keyword\n");
}
{ID} {
printf("Scanning an IDENT\n");
yylval.sval = strdup( yytext );
return IDENT;
}
{NUM} {
printf("Scanning a NUM\n");
/* Convert into long to loose leading zeros */
char *ptr = NULL;
long num = strtol(yytext, &ptr, 10);
if( errno == ERANGE ) {
printf("Number was to big");
exit(1);
}
yylval.lval = num;
return NUM;
}
{HEX} {
printf("Scanning a NUM\n");
char *ptr = NULL;
/* convert hex into decimal using offset 1 because of the $ */
long num = strtol(&yytext[1], &ptr, 16);
if( errno == ERANGE ) {
printf("Number was to big");
exit(1);
}
yylval.lval = num;
return NUM;
}
";"|"="|"+"|"-"|"*"|"."|"<"|"="|"("|")"|"->" {
printf("Scanning an operator\n");
}
[ \t\n]+ /* eat up whitespace */
{COMM}* /* eat up one-line comments */
. {
printf("Unrecognized character: %s at linenumber %d\n", yytext, yylineno );
exit(1);
}
%%
And here is my Makefile:
all: parser
parser: bison flex
gcc bison.tab.c lex.yy.c -o parser -lfl
bison: bison.y
bison -d bison.y
flex: flex.flex
flex flex.flex
clean:
rm bison.tab.h
rm bison.tab.c
rm lex.yy.c
rm parser
Everything compiles just fine, I do not get any errors runnin make all.
Here is my testfile
f = fun x -> end;
And here is the output:
./parser < a0.0
Scanning an IDENT
Scanning an operator
Scanning a keyword
Scanning an IDENT
ERROR: syntax error at symbol "x" on line 1
since x seems to be recognized as a IDENT the rule should be correct, still I am gettin an syntax error.
I feel like I am missing something important, hopefully somebody can help me out.
Thanks in advance!
EDIT:
I tried to remove the IDENT in the Lambda rule and the testfile, now it seems to run through the line, but still throws
ERROR: syntax error at symbol "" on line 1
after the EOF.
Your scanner recognizes keywords (and prints out a debugging line, but see below), but it doesn't bother reporting anything to the parser. So they are effectively ignored.
In your bison definition file, you use (for example) "fun" as a terminal, but you do not provide the terminal with a name which could be used in the scanner. The scanner needs this name, because it has to return a token id to the parser.
To summarize, what you need is something like this:
In your grammar, before the %%:
token T_FUN "fun"
token T_IF "if"
token T_THEN "then"
/* Etc. */
In your scanner definition:
fun { return T_FUN; }
if { return T_IF; }
then { return T_THEN; }
/* Etc. */
A couple of other notes:
Your scanner rule for recognizing operators also fails to return anything, so operators will also be ignored. That's clearly not desirable. flex and bison allow an easier solution for single-character operators, which is to let the character be its own token id. That avoids having to create a token name. In the parser, a single-quoted character represents a token-id whose value is the character; that's quite different from a double-quoted string, which is an alias for the declared token name. So you could do this:
"=" { return '='; }
/* Etc. */
but it's easier to do all the single-character tokens at once:
[;+*.<=()-] { return yytext[0]; }
and even easier to use a default rule at the end:
. { return yytext[0]; }
which will have the effect of handling unrecognized characters by returning an unknown token id to the parser, which will cause a syntax error.
This won't work for "->", since that is not a single character token, which will have to be handled in the same way as keywords.
Flex will produce debugging output automatically if you use the -d flag when you create the scanner. That's a lot easier than inserting your own debugging printout, because you can turn it off by simply removing the -d option. (You can use %option debug instead if you don't want to change the flex invocation in your makefile.) It's also better because it provides consistent information, including position information.
Some minor points:
The pattern [0-9][0-9]* could more easily be written [0-9]+
The comment pattern "//".* does not require a $ lookahead at the end, since .* will always match the longest sequence of non-newline characters; consequently, the first unmatched character must either be a newline or the EOF. $ lookahead will not match if the pattern is terminated with an EOF, which will cause odd errors if the file ends with a comment without a newline at the end.
There is no point using {COMM}* since the comment pattern does not match the newline which terminates the comment, so it is impossible for there to be two consecutive comment matches. But anyway, after matching a comment and the following newline, flex will continue to match a following comment, so {COMM} is sufficient. (Personally, I wouldn't use the COMM abbreviation; it really adds nothing to readability, IMHO.)
Suppose i want to deal with certain patterns and have the other text(VHDL code) as it is in the output file.
For that purpose i would be required to write a master rule in the end as
(MY_PATTERN){
// do something with my pattern
}
(.*){
return TOK_VHDL_CODE;
}
Problem with this strategy is MY_PATTERN is useless in this case and would be matched with .* by maximum munch rule.
So how can i get this functionality ?
The easy way is to get rid of the * in your default rule at the end and just use
. { append_to_buffer(*yytext); }
so your default rule takes all the stuff that isn't matched by the previous rules and stuffs it off in a buffer somehwere to be dealt with by someone else.
In theory, it's possible to find a regular expression which will match a string not containing a pattern, but except in the case of very simple patterns, it is neither easy nor legible.
If all you want to do is search for (and react to) specific patterns, you could use a default rule which matches one character and does nothing:
{Pattern1} { /* Do something with the pattern */ }
{Pattern2} { /* Do something with the pattern */ }
.|\n /* Default rule does nothing */
If, on the other hand, you wanted to do something with the not-otherwise-matched strings (as in your example), you'll need to default rule to accumulate the strings, and the pattern rules to "send" (return) the accumulated token before acting on the token which they matched. That means that some actions will need to send two tokens, which is a bit awkward with the standard parser calls scanner for a token architecture, because it requires the scanner to maintain some state.
If you have a no-too-ancient version of bison, you could use a "push parser" instead, which allows the scanner to call the parser. That makes it easy to send two tokens in a single action. Otherwise, you need to build a kind of state machine into your scanner.
Below is a simple example (which needs pattern definitions, among other things) using a push-parser.
%{
#include <stdlib.h>
#include <string.h>
#include "parser.tab.h"
/* Since the lexer calls the parser and we call the lexer,
* we pass through a parser (state) to the lexer. This is
* how you change the `yylex` prototype:
*/
#define YY_DECL static int yylex(yypstate* parser)
%}
pattern1 ...
pattern2 ...
/* Standard "avoid warnings" options */
%option noyywrap noinput nounput nodefault
%%
/* Indented code before the first pattern is inserted at the beginning
* of yylex, perfect for local variables.
*/
size_t vhdl_length = 0;
/* These are macros because they do out-of-sequence return on error. */
/* If you don't like macros, please accept my apologies for the offense. */
#define SEND_(toke, leng) do { \
size_t leng_ = leng; \
char* text = memmove(malloc(leng_ + 1), yytext, leng_); \
text[leng_] = 0; \
int status = yypush_parse(parser, toke, &text); \
if (status != YYPUSH_MORE) return status; \
} while(0);
#define SEND_TOKEN(toke) SEND_(toke, yyleng)
#define SEND_TEXT do if(vhdl_length){ \
SEND_(TEXT, vhdl_length); \
yytext += vhdl_length; yyleng -= vhdl_length; vhdl_length = 0; \
} while(0);
{pattern1} { SEND_TEXT; SEND_TOKEN(TOK_1); }
{pattern2} { SEND_TEXT; SEND_TOKEN(TOK_2); }
/* Default action just registers that we have one more char
* calls yymore() to keep accumulating the token.
*/
.|\n { ++vhdl_length; yymore(); }
/* In the push model, we're responsible for sending EOF to the parser */
<<EOF>> { SEND_TEXT; return yypush_parse(parser, 0, 0); }
%%
/* In this model, the lexer drives everything, so we provide the
* top-level interface here.
*/
int parse_vhdl(FILE* in) {
yyin = in;
/* Create a new pure push parser */
yypstate* parser = yypstate_new();
int status = yylex(parser);
yypstate_delete(parser);
return status;
}
To actually get that to work with bison, you need to provide a couple of extra options:
parser.y
%code requires {
/* requires blocks get copied into the tab.h file */
/* Don't do this if you prefer a %union declaration, of course */
#define YYSTYPE char*
}
%code {
#include <stdio.h>
void yyerror(const char* msg) { fprintf(stderr, "%s\n", msg); }
}
%define api.pure full
%define api.push-pull push
I'm having a problem using (reentrant) Flex + Lemon for parsing. I'm using a simple grammar and lexer here. When I run it, I'll put in a number followed by an EOF token (Ctrl-D). The printout will read:
89
found int of .
AST=0.
Where the first line is the number I put in. Theoretically, the AST value should be the sum of everything I put in.
EDIT: when I call Parse() manually it runs correctly.
Also, lemon appears to run the atom ::= INT rule even when the token is 0 (the stop token). Why is this? I'm very confused about this behavior and I can't find any good documentation.
Okay, I figured it out. The reason is that there is a particularly nasty (and poorly documented) interaction going on between flex and lemon.
In an attempt to save memory, lemon will hold onto a token without copying, and push it on to an internal token stack. However, flex also tries to save memory by changing the value that yyget_text points to as it lexes the input. The offending line in my example is:
// in the do loop of main.c...
Parse(parser, token, yyget_text(lexer));
This should be:
Parse(parser, token, strdup(yyget_text(lexer)));
which will ensure that the value that lemon points to when it reduces the token stack later is the same as what you originally passed in.
(Note: Don't forget, strdup means you'll have to free that memory at some point later. Lemon will let you write token "destructors" that can do this, or if you're building an AST tree you should wait until the end of the AST lifetime.)
You might also try making a token type that contains a pointer to the string and the length of the string. I've had success with this.
token.h
#ifndef Token_h
#define Token_h
typedef struct Token {
int code;
char * string;
int string_length;
} Token;
#endif // Token_h
main.c
int main(int argc, char** argv) {
// Set up the scanner
yyscan_t scanner;
yylex_init(&scanner);
yyset_in(stdin, scanner);
// Set up the parser
void* parser = ParseAlloc(malloc);
// Do it!
Token t;
do {
t.code = yylex(scanner);
t.string = yyget_text(scanner);
t.string_length = yyget_leng(scanner);
Parse(parser, t.code, t);
} while (t.code > 0);
if (-1 == t.code) {
fprintf(stderr, "The scanner encountered an error.\n");
}
// Cleanup the scanner and parser
yylex_destroy(scanner);
ParseFree(parser, free);
return 0;
}
language.y (excerpt)
class_interface ::= INTERFACE IDENTIFIER(A) class_inheritance END.
{
printf("defined class %.*s\n", A.string_length, A.string);
}
See my printf statement there? I'm using the string and the length to print out my token.