How to get entire input string in Lex and Yacc? - c

OK, so here is the deal.
In my language I have some commands, say
XYZ 3 5
GGB 8 9
HDH 8783 33
And in my Lex file
XYZ { return XYZ; }
GGB { return GGB; }
HDH { return HDH; }
[0-9]+ { yylval.ival = atoi(yytext); return NUMBER; }
\n { return EOL; }
In my yacc file
start : commands
;
commands : command
| command EOL commands
;
command : xyz
| ggb
| hdh
;
xyz : XYZ NUMBER NUMBER { /* Do something with the numbers */ }
;
etc. etc. etc. etc.
My question is, how can I get the entire text
XYZ 3 5
GGB 8 9
HDH 8783 33
Into commands while still returning the NUMBERs?
Also when my Lex returns a STRING [0-9a-zA-Z]+, and I want to do verification on it's length, should I do it like
rule: STRING STRING { if (strlen($1) < 5 ) /* Do some shit else error */ }
or actually have a token in my Lex that returns different tokens depending on length?

If I've understood your first question correctly, you can have semantic actions like
{ $$ = makeXYZ($2, $3); }
which will allow you to build the value of command as you want.
For your second question, the borders between lexical analysis and grammatical analysis and between grammatical analysis and semantic analysis aren't hard and well fixed. Moving them is a trade-off between factors like easiness of description, clarity of error messages and robustness in presence of errors. Considering the verification of string length, the likelihood of an error occurring is quite high and the error message if it is handled by returning different terminals for different length will probably be not clear. So if it is possible -- that depend on the grammar -- I'd handle it in the semantic analysis phase, where the message can easily be tailored.

If you arrange for your lexical analyzer (yylex()) to store the whole string in some variable, then your code can access it. The communication with the parser proper will be through the normal mechanisms, but there's nothing that says you can't also have another variable lurking around (probably a file static variable - but beware multithreading) that stores the whole input line before it is dissected.

As you use yylval.ival you already have union with ival field in your YACC source, like this:
%union {
int ival;
}
Now you specify token type, like this:
%token <ival> NUMBER
So now you can access ival field simply for NUMBER token as $1 in your rules, like
xyz : XYZ NUMBER NUMBER { printf("XYZ %d %d", $2, $3); }
For your second question I'd define union like this:
%union {
char* strval;
int ival;
}
and in you LEX source specify token types
%token <strval> STRING;
%token <ival> NUMBER;
So now you can do things like
foo : STRING NUMBER { printf("%s (len %d) %d", $1, strlen($1), $2); }

Related

what does $1 means in yacc, and how can I get its value

I want to complete a parsing about varlist declaration,
like varlist:id comma varlist|id.
At this time, I need to set up a list about var.
So I write this code:
varlist: id comma varlist{ createtnode($1.idcontext);}
|id{createtnode($1.idcontext);};
But I find $1.idcontext is not that idcontext I want which should be this id token's idcontext.
Now, $1.idcontext is this sentence 'varlist'. Without the code action, this grammar works correctly.
typedef struct{
int* TC;
int* FC;
}boolcode;
typedef struct {
char* idcontext;
int constvalue;
int chain;
boolcode ftentry;
}includes;
/* Value type. */
#if ! defined YYSTYPE && ! defined YYSTYPE_IS_DECLARED
typedef struct{
int classify;
includes unique;
} YYSTYPE;
# define YYSTYPE_IS_TRIVIAL 1
# define YYSTYPE_IS_DECLARED 1
#endif
VARList: IDENcode COMMA VARList{
if(YYDEBUG)
{printf("6\n");
printf("%s\n",$1.idcontext);
}
varlistnode* newvar=malloc(sizeof(varlistnode));
newvar->varname=$1.idcontext;
newvar->value=0;
newvar->next=NULL;
mynotes->next=newvar;
mynotes=mynotes->next;
}|IDENcode{
if(YYDEBUG)
{printf("7\n");printf("%s\n",$1.idcontext);}
varlistnode* newvar=malloc(sizeof(varlistnode));
newvar->varname=$1.idcontext;
newvar->value=0;
newvar->next=NULL;
mynotes->next=newvar;
mynotes=mynotes->next;
};
The words wait recognize:
a,b,c,d
The result of printf() function:
7
d:
6
c,d:
6
b,c,d:
6
a,b,c,d:enter code here
The real problem in this program is not visible in this question because the bug is in your lexical scanner.
You haven't included the flex file in the question, but it's reasonable to guess that it contains something like this:
[[:alpha:]_][[:alnum:]_]* { yylval.unique.idcontext = yytext; /* INCORRECT */
return IDENcode;
}
It should read
[[:alpha:]_][[:alnum:]_]* { yylval.unique.idcontext = strdup(yytext);
return IDENcode;
}
yytext points into the scanner's internal buffer, whose contents are modified every time the scanner is called. What you're seeing is a mild version of this problem, because your input is very short; if the input were long enough that yylex needed to refill the buffer from the input file, then you would see complete garbage in your idcontext fields. If you want to use the string later, you need to make a copy of it (and then you need to remember to free() the copy when you no longer need it, which can be a bit of a challenge.)
The other possible issue -- and honestly I don't know whether you consider it a problem or not, because you did not specify what output you expect from your debugging trace -- is that your right-recursive rule:
varlist: id comma varlist { createtnode($1.idcontext); }
| id { createtnode($1.idcontext); }
ends up calling createtnode on the ids in reverse order, because a bison reduction action is executed when the rule is matched. Using right-recursion like that means that the first varlist action to execute is actually the one corresponding to the last id.
If you want the actions to execute left to right, you need to use left-recursion:
varlist: varlist comma id { createtnode($3.idcontext); } /* See below */
| id { createtnode($1.idcontext); }
Left-recursion has other advantages. For example, it does not require all the ids (and commas) to pile up on the parser's internal stack waiting for the final reduction action.
Again, you don't show enough of your code to see how you use the result of these actions. It looks to me like you're trying to create a global linked list of variables, whose header you store in a global variable. (mynotes evidently points to the tail of the list, so it can't be used to recover the head.) If that's the case, then the change above should work fine. But it would be more normal to make the semantic value of varlist a list header, avoiding the use of global variables. That would result in code which looked more like this:
varlist: id comma varlist { $$ = append($1, createtnode($3.idcontext)); }
| id { $$ = append(newlist(), createtnode($1.idcontext); }

Flex and Bison, Windows Error using symbol table

Program is intended to store values in a symbol table and then have them be able to be printed out stating the part of speech. Further to be parsed and state more in the parser, whether it is a sentence and more.
I create the executable file by
flex try1.l
bison -dy try1.y
gcc lex.yy.c y.tab.c -o try1.exe
in cmd (WINDOWS)
My issue occurs when I try to declare any value when running the executable,
verb run
it goes like this
BOLD IS INPUT
verb run
run
run
syntax error
noun cat
cat
syntax error
run
run
syntax error
cat run
syntax error
MY THOUGHTS: I'm unsure why I'm getting this error back from the code Syntax error. Although after debugging and trying to print out what value was being stored, I figured there has to be some kind of issue with the linked list. As it seemed only one value was being stored in the linked list and causing an error of sorts. As I tried to print out the stored word_type integer value for run and it would print out the correct value 259, but would refuse to let me define any other words to my symbol table. I reversed the changes of the print statements and now it works as previously stated. I think again there is an issue with the addword method as it isn't properly being added so the lookup method is crashing the program.
Lexer file, this example is taken from O'Reily 2nd edition on Lex And Yacc,
Example 1-5,1-6.
Am trying to learn Lex and Yacc on my own and reproduce this example.
%{
/*
* We now build a lexical analyzer to be used by a higher-level parser.
*/
#include <stdlib.h>
#include <string.h>
#include "ytab.h" /* token codes from the parser */
#define LOOKUP 0 /* default - not a defined word type. */
int state;
%}
/*
* Example from page 9 Word recognizer with a symbol table. PART 2 of Lexer
*/
%%
\n { state = LOOKUP; } /* end of line, return to default state */
\.\n { state = LOOKUP;
return 0; /* end of sentence */
}
/* whenever a line starts with a reserved part of speech name */
/* start defining words of that type */
^verb { state = VERB; }
^adj { state = ADJ; }
^adv { state = ADV; }
^noun { state = NOUN; }
^prep { state = PREP; }
^pron { state = PRON; }
^conj { state = CONJ; }
[a-zA-Z]+ {
if(state != LOOKUP) {
add_word(state, yytext);
} else {
switch(lookup_word(yytext)) {
case VERB:
return(VERB);
case ADJECTIVE:
return(ADJECTIVE);
case ADVERB:
return(ADVERB);
case NOUN:
return(NOUN);
case PREPOSITION:
return(PREPOSITION);
case PRONOUN:
return(PRONOUN);
case CONJUNCTION:
return(CONJUNCTION);
default:
printf("%s: don't recognize\n", yytext);
/* don't return, just ignore it */
}
}
}
. ;
%%
int yywrap()
{
return 1;
}
/* define a linked list of words and types */
struct word {
char *word_name;
int word_type;
struct word *next;
};
struct word *word_list; /* first element in word list */
extern void *malloc() ;
int
add_word(int type, char *word)
{
struct word *wp;
if(lookup_word(word) != LOOKUP) {
printf("!!! warning: word %s already defined \n", word);
return 0;
}
/* word not there, allocate a new entry and link it on the list */
wp = (struct word *) malloc(sizeof(struct word));
wp->next = word_list;
/* have to copy the word itself as well */
wp->word_name = (char *) malloc(strlen(word)+1);
strcpy(wp->word_name, word);
wp->word_type = type;
word_list = wp;
return 1; /* it worked */
}
int
lookup_word(char *word)
{
struct word *wp = word_list;
/* search down the list looking for the word */
for(; wp; wp = wp->next) {
if(strcmp(wp->word_name, word) == 0)
return wp->word_type;
}
return LOOKUP; /* not found */
}
Yacc file,
%{
/*
* A lexer for the basic grammar to use for recognizing English sentences.
*/
#include <stdio.h>
%}
%token NOUN PRONOUN VERB ADVERB ADJECTIVE PREPOSITION CONJUNCTION
%%
sentence: subject VERB object{ printf("Sentence is valid.\n"); }
;
subject: NOUN
| PRONOUN
;
object: NOUN
;
%%
extern FILE *yyin;
main()
{
do
{
yyparse();
}
while (!feof(yyin));
}
yyerror(s)
char *s;
{
fprintf(stderr, "%s\n", s);
}
Header file, had to create 2 versions for some values not sure why but code was having an issue with them, and I wasn't understanding why so I just created a token with the full name and the shortened as the book had only one for each.
# define NOUN 257
# define PRON 258
# define VERB 259
# define ADVERB 260
# define ADJECTIVE 261
# define PREPOSITION 262
# define CONJUNCTION 263
# define ADV 260
# define ADJ 261
# define PREP 262
# define CONJ 263
# define PRONOUN 258
If you feel that there is a problem with your linked list implementation, you'd be a lot better off testing and debugging it with a simple driver program rather than trying to do that with some tools (flex and bison) which you are still learning. On the whole, the simpler a test is and the fewest dependencies which it has, the easier it is to track down problems. See this useful essay by Eric Clippert for some suggestions on debugging.
I don't understand why you felt the need to introduce "short versions" of the token IDs. The example code in Levine's book does not anywhere use these symbols. You cannot just invent symbols and you don't need these abbreviations for anything.
The comment that you "had to create 2 versions [of the header file] for some values" but that the "code was having an issue with them, and I wasn't understanding why" is far too unspecific for an answer. Perhaps the problem was that you thought you could use identifiers which are not defined anywhere, which would certainly cause a compiler error. But if there is some other issue, you could ask a question with an accurate problem description (that is, exactly what problem you encountered) and a Minimal, Complete, and Verifiable example (as indicated in the StackOverflow help pages).
In any case, manually setting the values of the token IDs is almost certainly preventing you from being able to recognized inputs. Bison/yacc reserves the values 256 and 257 for internal tokens, so the first one which will be generated (and therefore used in the parser) has value 258. That means that the token values you are returning from your lexical scanner have a different meaning inside bison. Bottom line: Never manually set token values. If your header isn't being generated correctly, figure out why.
As far as I can see, the only legal input for your program has the form:
sentence: subject VERB object
Since none of your sample inputs ("run", for example) have this form, a syntax error is not surprising. However, the fact that you receive a very early syntax error on the input "cat" does suggest there might be a problem with your symbol table lookup. (That's probably the result of the problem noted above.)

Yacc actions return 0 for any variable ($)

I'm new to Lex/Yacc. Found these lex & yacc files which parse ansi C.
To experiment I added an action to print part of the parsing:
constant
: I_CONSTANT { printf("I_CONSTANT %d\n", $1); }
| F_CONSTANT
| ENUMERATION_CONSTANT /* after it has been defined as such */
;
The problem is, no matter where I put the action, and whatever $X I use, I always get value 0.
Here I got printed:
I_CONSTANT 0
Even though my input is:
int foo(int x)
{
return 5;
}
Any idea?
Nothing in the lex file you point to actually sets semantic values for any token. As the author says, the files are just a grammar and "the bulk of the work" still needs to be done. (There are other caveats having to do with the need for a preprocessor.)
Since nothing in the lex file ever sets yylval, it will always be 0, and that is what yacc/bison will find when it sets up the semantic value for the token ($1 in this case).
Turns out yylval = atoi(yytext) is not done in the lex file, so I had to add it myself. Also learned I can add extern char *yytext to the yacc file header, and then use yytext directly.

Lexical & Grammar Analysis using Yacc and Lex

I am fairly new to Yacc and Lex programming but I am training myself with a analyser for C programs.
However, I am facing a small issue that I didn't manage to solve.
When there is a declaration for example like int a,b; I want to save a and b in an simple array. I did manage to do that but it saving a bit more that wanted.
It is actually saving "a," or "b;" instead of "a" and "b".
It should have worked as $1should only return tID which is a regular expression recognising only a string chain. I don't understand why it take the coma even though it defined as a token. Does anyone know how to solve this problem ?
Here is the corresponding yacc declarations :
Declaration :
tINT Decl1 DeclN
{printf("Declaration %s\n", $2);}
| tCONST Decl1 DeclN
{printf("Declaration %s\n", $2);}
;
Decl1 :
tID
{$$ = $1;
tabvar[compteur].id=$1; tabvar[compteur].adresse=compteur;
printf("Added %s at adress %d\n", $1, compteur);
compteur++;}
| tID tEQ E
{$$ = $1;
tabvar[compteur].id=$1; tabvar[compteur].adresse=compteur;
printf("Added %s at adress %d\n", $1, compteur);
pile[compteur]=$3;
compteur++;}
;
DeclN :
/*epsilon*/
| tVIR Decl1 DeclN
And the extract of the Lex file :
separateur [ \t\n\r]
id [A-Za-z][A-Za-z0-9_]*
nb [0-9]+
nbdec [0-9]+\.[0-9]+
nbexp [0-9]+e[0-9]+
"," { return tVIR; }
";" { return tPV; }
"=" { return tEQ; }
{separateur} ;
{id} { yylval.str = yytext; return tID; }
{nb}|{nbdec}|{nbexp} { yylval.nb = atoi(yytext); return tNB; }
%%
int yywrap() {return 1;}
The problem is that yytext is a reference into lex's token scanning buffer, so it is only valid until the next time the parser calls yylex. You need to make a copy of the string in yytext if you want to return it. Something like:
{id} { yylval.str = strdup(yytext); return tID; }
will do the trick, though it also exposes you to the possibility of memory leaks.
Also, in general when writing lex/yacc parsers involving single character tokens, it is much clearer to use them directly as charcter constants (eg ',', ';', and '=') rather than defining named tokens (tVIR, tPV, and tEQ in your code).

Can't retrieve semantic values from Bison grammar file

I am trying to develop a language parser on CentOS 6.0 by means of Bison 3.0 (C parser generator), Flex 2.5.35 and gcc 4.4.7. I have the following Bison grammar file:
%{
#include <stdio.h>
%}
%union {
int int_t;
char* str_t;
}
%token SEP
%token <str_t> ID
%start start
%type <int_t> plst
%%
start: plst start
| EOS { YYACCEPT; }
;
// <id> , <id> , ... , <id>
plst: ID SEP_PARAMS plst { printf("Rule 1 %s %s \n",$1,$2); }
| ID { printf("Rule 2 %s \n", $1); }
| /* empty */ { }
;
%%
int yyerror(GNode* root, const char* s) {printf("Error: %s", s);}
The problem
As it is now, it is not really a meaningful one, but it is enough to understand my problem I think. Consider that I have a scanner written in Flex which recognizes my tokens. This grammar file is used to recognize simple identifier lists like: id1,id2,...,idn. My problem is that in each grammar rule, when I try to get the value of the identifier (the string representing the same of the identifier), I get a NULL pointer as also proved by my printfs.
What am I doing wrong? Thankyou
Edit
Thanks to recent answers, I could understand that the problems strongly relates to Flex and its configuration file. In particular I have edited my lex file in order to meet the specifications described by the Flex Manual for Bison Bridging:
{ID} { printf("[id-token]");
yylval->str_t = strdup(yytext);
return ID; }
However after running Bison, then Flex (providing the --bison-bridge option) and then the compiler, I execute the generated parser and I instantly get Segmentation Fault.
What's the problem?
The flex option --bison-bridge (or %option bison-bridge) matches up to the bison option %define api.pure. You need to use either BOTH bison-bridge and api.pure or NEITHER -- either way can work, but they need to be consistent. Since it appears you are NOT using api.pure, you want to delete the --bison-bridge option.
The values for $1, $2 etc. have to be set by the lexer.
If you have a rule in the lexer for identifiers, like
ID [a-z][a-z0-9]*
%%
{ID} { return ID; }
the semantic values are not set.
You have to do e.g.
{ID} { /* Set the unions value, used by e.g. `$1` in the parser */
yylval.str_t = strdup(yytext);
return ID;
}
Remember to free the value in the parser, as strdup allocates memory.

Resources