Why is my bison/flex not working as intended? - c

I have this homework assignment where I have to transform some input into a particular output. The problem I'm having is that I can only convert the first line into the output I need, the other lines return a "syntax error" error.
Additionally, if I change the lines order, no lines are converted so only one particular line is working.
This is my input file:
Input.txt
B0102 Bobi 2017/01/16 V8 1, massage 12.50
J1841 Jeco 20.2 2017/01/17 V8 2, Tosse 2, tosquia 22.50
B2232 Bobi 2017/01/17 Tosse 1, Leptospirose 1, bath 30.00, massage 12.50
B1841 Jeco 21.4 2017/01/18 Leptospirose 1, Giardiase 2
And this is the output I should obtain:
Output
Bobi (B0102) paid 2 services/vaccines 22.50
Jeco (J1841) paid 3 services/vaccines 62.50
Bobi (B2232) paid 4 services/vaccines 62.50
Jeco (B1841) paid 2 services/vaccines 30.00
If I change the line order in the input file, not even the first line is converted.
However, if the order is as I showed above, this is my output:
Bobi (B0102) paid 2 services/vaccines 22.50
syntax error
This is my code:
file.y
%{
#include "file.h"
#include <stdio.h>
int yylex();
int counter = 0;
int vaccineCost = 10;
%}
%union{
char* code;
char* name;
float value;
int quantity;
};
%token COMMA WEIGHT DATE SERVICE VACCINE
%token CODE
%token NAME
%token VALUE
%token QUANTITY
%type <name> NAME
%type <code> CODE
%type <value> VALUE
%type <quantity> QUANTITY
%type <value> services
%start begining
%%
begining: /*empty*/
| animal
;
animal: CODE NAME WEIGHT DATE services {printf("%s (%s) paid %d services/vaccines %.2f\n", $2, $1, counter, $5); counter = 0;}
| CODE NAME DATE services {printf("%s (%s) paid %d services/vaccines %.2f\n", $2, $1, counter, $4); counter = 0;}
;
services: services COMMA SERVICE VALUE {$$ = $1 + $4; counter++;}
| services COMMA VACCINE QUANTITY{$$ = $1 + $4*vaccineCost;counter++;}
| SERVICE VALUE{$$ = $2;counter++;}
| VACCINE VALUE
{$$ = $2*vaccineCost;counter++;}
;
%%
int main(){
yyparse();
return 0;
}
void yyerror (char const *s) {
fprintf (stderr, "%s\n", s);
}
file.flex
%option noyywrap
%{
#include "file.h"
#include "file.tab.h"
#include <stdio.h>
#include <string.h>
%}
/*Patterns*/
YEAR 20[0-9]{2}
MONTH 0[1-9]|1[0-2]
DAY 0[1-9]|[1-2][0-9]|3[0-1]
%%
, {return COMMA,;}
[A-Z][0-9]{4} {yylval.code = strdup(yytext); return CODE;}
[A-Z][a-z]* {yylval.name = strdup(yytext); return NAME;}
[0-9]+[.][0-9] {return WEIGHT;}
{YEAR}"/"{MONTH}"/"{DAY} {return DATE;}
(banho|massagem|tosquia) {return SERVICE;}
[0-9]+\.[0-9]{2} {yylval.value = atof(yytext);return VALUE;}
(V8|V10|Anti-Rabatica|Giardiase|Tosse|Leptospirose) {return VACCINE;}
[1-9] {yylval.quantity = atoi(yytext);return QUANTITY;}
\n
.
<<EOF>> return 0;
%%
And these are the commands I execute:
bison -d file.y
flex -o file.c file.flex
gcc file.tab.c file.c -o exec -lfl
./exec < Input.txt
Can anyone point me in the right direction or tell me what is wrong with my code?
Thanks and if I my explaination wasn't good enough I'll try my best to explain it better!!

There are at least two different problems which cause those symptoms.
Your top-level grammar only accepts at most a single animal:
inicio: /*vazio*/
| animal
So an input containing more than one line won't be allowed. You need a top-level which accepts any number of animals. (By the way, modern bison versions let you write %empty as the right-hand side of an empty production, instead of having to (mis)use a comment.
The order of your scanner rules means that most of the words you want to recognise as VACINA will instead be recognised as NOME. Recall that when two patterns match the same token, the first one in the file wlll win. So with these rules:
[A-Z][a-z]* {yylval.nome = strdup(yytext); return NOME;}
(V8|V10|Anti-Rabatica|Giardiase|Tosse|Leptospirose) {return VACINA;}
Tokens like Tosse, which could match either rule, will be assumed to match the first rule. Only V8 and Anti-Rabatical, which [A-Z][a-z]* doesn't match, will fall through to the second rule. So your first input line doesn't trigger this problem, but all the other ones do.
You probably should handle newline characters syntactically, unless you allow treatment records to be split over multiple lines. And be aware that many (f)lex versions do not allow empty actions, as in your last two flex rules. This may cause lexical errors.
And finally
<<EOF>> return 0;
is unnecessary. That's how the scanner handles end-of-fike by default. <<EOF>> rules are often wring or redundant, and should only be used when clearly needed (and with great care).

Related

stucture of yacc definitions

I am in the process of writing a parser for a markup language for a personal project:
sample:
/* This is a comment */
production_title = "My Production"
director = "Joe Smith"
DOP = "John Blogs"
DIT = "Random Name"
format = "16:9"
camera = "Arri Alexa"
codec = "ProRes"
date = _auto
Reel: A001
Scene: 23/22a
Slate: 001
1-2, 50MM, T1.8, {ND.3}
3AFS, 50MM, T1.8, {ND.3}
Slate: 002:
1, 65MM, T1.8, {ND.3 BPM1/2}
Slate: 003:
1-3, 24MM, T1.9 {ND.3}
Reel: A002
Scene: 23/22a
Slate: 004
1-5, 32MM, T1.9, {ND.3}
Scene: 23/21
Slate: 005
1, 100MM, T1.9, {ND.6}
END
I have started learning lex and yacc, and have run into a couple of issues regarding the structure of the grammar definitions.
yacc.y
%{
#include <stdio.h>
int yylex();
void yyerror(char *s);
%}
%token PROD_TITL _DIR DOP DIT FORMAT CAMERA CODEC DATE EQUALS
%right META
%%
meta: PROD_TITL EQUALS META {
printf("%s is set to %s\n",$1, $3);
}
| _DIR EQUALS META {
printf("%s is set to %s\n",$1, $3);
}
%%
int main(void) {
return yyparse();
}
void yyerror(char *s) {fprintf(stderr, "%s\n", s);}
lex.l
%{
#include <stdio.h>
#include <string.h>
#include "y.tab.h"
%}
%%
"production_title" {yylval = strdup(yytext); return PROD_TITL;}
"director" {yylval = strdup(yytext); return _DIR;}
"DOP" return DOP;
"DIT" return DIT;
"format" return FORMAT;
"camera" return CAMERA;
"codec" return CODEC;
"date" return DATE;
"exit" exit(EXIT_SUCCESS);
\"[^"\n]*["\n] { yylval = strdup(yytext);
return META;
}
= return EQUALS;
[ \t\n] ;
"/*"([^*]|\*+[^*/])*\*+"/" ;
. printf("unrecognized input\n");
%%
int yywrap(void) {
return 1;
}
The main issue that I am having is that the program only runs correctly on the first parse then it returns a syntax error which is incorrect. Is this something todo with the way that I have written the grammar?
example output from sample.txt and typed in commands:
hc#linuxtower:~/Documents/CODE/parse> ./a.out < sample.txt
production_title is set to "My Production"
syntax error
hc#linuxtower:~/Documents/CODE/parse> ./a.out
production_title = "My Production"
production_title is set to "My Production"
director = "Joe Smith"
syntax error
When compiling I get warnings in the lex.l file with regards to my regex's:
ca_mu.l: In function ‘yylex’:
ca_mu.l:9:9: warning: assignment makes integer from pointer without a cast [-Wint-conversion]
"production_title" {yylval = strdup(yytext); return PROD_TITL;}
^
ca_mu.l:10:9: warning: assignment makes integer from pointer without a cast [-Wint-conversion]
"director" {yylval = strdup(yytext); return _DIR;}
^
ca_mu.l:20:10: warning: assignment makes integer from pointer without a cast [-Wint-conversion]
\"[^"\n]*["\n] { yylval = strdup(yytext);
^
Could this be the source of the problem or an additional issue?
Those are two separate issues.
Your grammar is as follows, leaving out the actions:
meta: PROD_TITL EQUALS META
| _DIR EQUALS META
That means that your grammar accepts one of two sequences, both having exactly three tokens. That is, it accepts "PROD_TITL EQUALS META" or "_DIR EQUALS META". That's it. Once it finds one of those things, it has parsed as much as it knows how to parse, and it expects to be told that the input is complete. Any other input is an error.
The compiler is complaining about yylval = strdup(yytext); because it has been told that yylval is of type int. That's yacc/bison's default semantic type; if you don't do anything to change it, that's what bison will assume, and it will insert extern int yylval; in the header file it generates, so that the lexer knows what the semantic type is. If you search the internet you'll probably find a variety of macro hacks suggested to change this, but the correct way to do it with a "modern" bison is to insert the following declaration in your bison file, somewhere in the prologue:
%declare api.value.type { char* }
Later on, you'll probably find that you want a union type instead of making everything a string. Before you reach that point, you should read the section in the Bison manual on Defining Semantic Values. (In fact, you'd be well-advised to read the Bison manual from the beginning up to that point, including the simple examples in section 2. It's not that long, and it's pretty easy reading.)

Flex and Bison code - syntax error always

First of all I need to say that I am very new to Flex and Bison and I am a bit confused. There is a school project that want us to create a compiler using Flex and Bison for some kind of CLIPS language.
My code has a lot of problems but the main one is that whatever i type i see a syntax error while the result should be something else. The ideal scenario would be to fully work for the language CLIPS. EG when i write "4" it get syntax error. Reading my code maybe will get you understand this better. If i write "test 3 4" it doesnt show syntax error but it counts it as an unknown token and thats wrong again..i'm completely lost. the code is a prototype by the school and we need to do some changes. if you have any questions dont hesitate to ask. THank you!
P.S.: dont mind the comments, they are in greek.
FLEX CODE:
%option noyywrap
/* Kwdikas C gia orismo twn apaitoumenwn header files kai twn metablhtwn.
Otidhpote anamesa sta %{ kai %} metaferetai autousio sto arxeio C pou
tha dhmiourghsei to Flex. */
%{
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
/* Header file pou periexei lista me ola ta tokens */
#include "token.h"
/* Orismos metrhth trexousas grammhs */
int line = 1;
%}
/* Onomata kai antistoixoi orismoi (ypo morfh kanonikhs ekfrashs).
Meta apo auto, mporei na ginei xrhsh twn onomatwn (aristera) anti twn,
synhthws idiaiterws makroskelwn kai dysnohtwn, kanonikwn ekfrasewn */
/* dimiourgia KE simfona me ta orismata tis glossas */
DELIMITER [ \t]+
INTCONST [+-]*[1-9][0-9]*
VARIABLE [?][A-Za-z0-9]*
DEFINITIONS [a-zA-Z][-|_|A-Z|a-z|0-9]*
COMMENTS ^;.*$
/* Gia kathe pattern (aristera) pou tairiazei ekteleitai o antistoixos
kwdikas mesa sta agkistra. H entolh return epitrepei thn epistrofh
mias arithmhtikhs timhs mesw ths synarthshs yylex() */
/* an sinantisei diaxoristi i sxolio to agnoei, an sinantisei akeraio,metavliti i orismo ton emfanizei. se kathe alli periptosi ektiponei oti den anagnorizei to token, ti grammi pou vrisketai kai to string pou dothike */
%%
{DELIMITER} {;}
"bind" { return BIND;}
"test" { return TEST;}
"read" { return READ;}
"printout" { return PRINTOUT;}
"deffacts" { return DEFFACTS;}
"defrule" { return DEFRULE;}
"->" { return '->';}
"=" { return '=';}
"+" { return '+';}
"-" { return '-';}
"*" { return '*';}
"/" { return '/';}
"(" { return '(';}
")" { return ')';}
{INTCONST} { return INTCONST; }
{VARIABLE} { return VARIABLE; }
{DEFINITIONS} { return DEFINITIONS; }
{COMMENTS} {;}
\n { line++; printf("\n"); }
.+ { printf("\tLine=%d, UNKNOWN TOKEN, value=\"%s\"\n",line, yytext);}
<<EOF>> { printf("#END-OF-FILE#\n"); exit(0); }
%%
/* Pinakas me ola ta tokens se antistoixia me tous orismous sto token.h */
char *tname[11] = {"DELIMITER","INTCONST" , "VARIABLE", "DEFINITIONS", "COMMENTS", "BIND", "TEST", "READ", "PRINTOUT", "DEFFACTS", "DEFRULE"};
BISON CODE:
%{
/* Orismoi kai dhlwseis glwssas C. Otidhpote exei na kanei me orismo h arxikopoihsh
metablhtwn & synarthsewn, arxeia header kai dhlwseis #define mpainei se auto to shmeio */
#include <stdio.h>
#include <stdlib.h>
int yylex(void);
void yyerror(char *);
%}
/* Orismos twn anagnwrisimwn lektikwn monadwn. */
%token INTCONST VARIABLE DEFINITIONS PLUS NEWLINE MINUS MULT DIV COM BIND TEST READ PRINTOUT DEFFACTS DEFRULE
%%
/* Orismos twn grammatikwn kanonwn. Kathe fora pou antistoixizetai enas grammatikos
kanonas me ta dedomena eisodou, ekteleitai o kwdikas C pou brisketai anamesa sta
agkistra. H anamenomenh syntaksh einai:
onoma : kanonas { kwdikas C } */
program:
program expr NEWLINE { printf("%d\n", $2); }
|
;
expr:
INTCONST { $$ = $1; }
| VARIABLE { $$ = $1; }//prosthiki tis metavlitis
| PLUS expr expr { $$ = $2 + $3; }//prosthiki tis prosthesis os praksi
| MINUS expr expr { $$ = $2 - $3; } //prosthiki tis afairesis os praksi
| MULT expr expr { $$ = $2 * $3; }//prosthiki tou pollaplasiasmou os praksi
| DIV expr expr { $$ = $2 / $3; }//prosthiki tis diairesis os praksi
| COM { $$ = $1; }//prosthiki ton sxolion
| DEFFACTS expr { $$ = $2; }//prosthiki ton gegonoton
| DEFRULE expr { $$ = $2; }//prosthiki ton kanonon
| BIND expr expr { $$ = $2;}//prosthiki tis bind
| TEST expr expr { $$ = $2 ;}//prosthiki tis test
| READ expr expr { $$ = $2 ;}//prosthiki tis read
| PRINTOUT expr expr { $$ = $2 ;}//prosthiki tis printout
;
%%
/* H synarthsh yyerror xrhsimopoieitai gia thn anafora sfalmatwn. Sygkekrimena kaleitai
apo thn yyparse otan yparksei kapoio syntaktiko lathos. Sthn parakatw periptwsh h
synarthsh epi ths ousias typwnei mhnyma lathous sthn othonh. */
void yyerror(char *s) {
fprintf(stderr, "Error: %s\n", s);
}
/* H synarthsh main pou apotelei kai to shmeio ekkinhshs tou programmatos.
Sthn sygkekrimenh periptwsh apla kalei thn synarthsh yyparse tou Bison
gia na ksekinhsei h syntaktikh analysh. */
int main(void) {
yyparse();
return 0;
}
TOKEN FILE:
#define DELIMITER 1
#define INTCONST 2
#define VARIABLE 3
#define DEFINITIONS 4
#define COMMENTS 5
#define BIND 6
#define TEST 7
#define READ 8
#define PRINTOUT 9
#define DEFFACTS 10
#define DEFRULE 11
MAKEFILE:
all:
bison -d simple-bison-code.y
flex mini-clips-la.l
gcc simple-bison-code.tab.c lex.yy.c -o B2
./B2
clean:
rm simple-bison-code.tab.c simple-bison-code.tab.h lex.yy.c B2
Your top-level rule is:
program:
program expr NEWLINE
which cannot succeed unless the parser sees a NEWLINE token. But it will never see one, because your lexical scanner never sends one; when it sees a newline, it increments the line count but doesn't return anything.
All your tokens are considered invalid because your lexical scanner uses its own definitions of the token values. You shouldn't do that. The parser generator (bison/yacc) will generate a header file containing the correct definitions; that is, the values it is expecting to see.
There are various other problems, probably more than I noticed. The most important is that you should not call exit(0) in the <<EOF>> rule, since that will mean that the parser can never succeed; it does not succeed until it is passed an EOF token. In fact, you should not normally have an <<EOF>> rule; the default action is to return 0 and that is pretty well the only action which makes sense.
Also, '->' is not a correct C literal. The compiler would have complained about it if you had enabled compiler warnings (-Wall), which you should always do, even if you are compiling generated code.
And your scanner's last pattern, intended to trigger on bad tokens, is .+, which will match the entire line, not just the erroneous character. Since (f)lex scanners accept the pattern with the longest match, most of your other patterns will never match. (Flex usually warns you about unmatchable patterns. Didn't you get such a warning?)
The fallback pattern should be .|\n, although you can use . if you are absolutely sure that every newline will be matched by some rule. I like to use %option nodefault, which will cause flex to warn me if there is some possible input not matched by any rule.

What is wrong with this Bison grammar?

Im trying to build a Bison grammar and seem to be missing something. I kept it yet very basic, still I am getting a syntax error and can't figure out why:
Here is my Bison Code:
%{
#include <stdlib.h>
#include <stdio.h>
int yylex(void);
int yyerror(char *s);
%}
// Define the types flex could return
%union {
long lval;
char *sval;
}
// Define the terminal symbol token types
%token <sval> IDENT;
%token <lval> NUM;
%%
Program:
Def ';'
;
Def:
IDENT '=' Lambda { printf("Successfully parsed file"); }
;
Lambda:
"fun" IDENT "->" "end"
;
%%
main() {
yyparse();
return 0;
}
int yyerror(char *s)
{
extern int yylineno; // defined and maintained in flex.flex
extern char *yytext; // defined and maintained in flex.flex
printf("ERROR: %s at symbol \"%s\" on line %i", s, yytext, yylineno);
exit(2);
}
Here is my Flex Code
%{
#include <stdlib.h>
#include "bison.tab.h"
%}
ID [A-Za-z][A-Za-z0-9]*
NUM [0-9][0-9]*
HEX [$][A-Fa-f0-9]+
COMM [/][/].*$
%%
fun|if|then|else|let|in|not|head|tail|and|end|isnum|islist|isfun {
printf("Scanning a keyword\n");
}
{ID} {
printf("Scanning an IDENT\n");
yylval.sval = strdup( yytext );
return IDENT;
}
{NUM} {
printf("Scanning a NUM\n");
/* Convert into long to loose leading zeros */
char *ptr = NULL;
long num = strtol(yytext, &ptr, 10);
if( errno == ERANGE ) {
printf("Number was to big");
exit(1);
}
yylval.lval = num;
return NUM;
}
{HEX} {
printf("Scanning a NUM\n");
char *ptr = NULL;
/* convert hex into decimal using offset 1 because of the $ */
long num = strtol(&yytext[1], &ptr, 16);
if( errno == ERANGE ) {
printf("Number was to big");
exit(1);
}
yylval.lval = num;
return NUM;
}
";"|"="|"+"|"-"|"*"|"."|"<"|"="|"("|")"|"->" {
printf("Scanning an operator\n");
}
[ \t\n]+ /* eat up whitespace */
{COMM}* /* eat up one-line comments */
. {
printf("Unrecognized character: %s at linenumber %d\n", yytext, yylineno );
exit(1);
}
%%
And here is my Makefile:
all: parser
parser: bison flex
gcc bison.tab.c lex.yy.c -o parser -lfl
bison: bison.y
bison -d bison.y
flex: flex.flex
flex flex.flex
clean:
rm bison.tab.h
rm bison.tab.c
rm lex.yy.c
rm parser
Everything compiles just fine, I do not get any errors runnin make all.
Here is my testfile
f = fun x -> end;
And here is the output:
./parser < a0.0
Scanning an IDENT
Scanning an operator
Scanning a keyword
Scanning an IDENT
ERROR: syntax error at symbol "x" on line 1
since x seems to be recognized as a IDENT the rule should be correct, still I am gettin an syntax error.
I feel like I am missing something important, hopefully somebody can help me out.
Thanks in advance!
EDIT:
I tried to remove the IDENT in the Lambda rule and the testfile, now it seems to run through the line, but still throws
ERROR: syntax error at symbol "" on line 1
after the EOF.
Your scanner recognizes keywords (and prints out a debugging line, but see below), but it doesn't bother reporting anything to the parser. So they are effectively ignored.
In your bison definition file, you use (for example) "fun" as a terminal, but you do not provide the terminal with a name which could be used in the scanner. The scanner needs this name, because it has to return a token id to the parser.
To summarize, what you need is something like this:
In your grammar, before the %%:
token T_FUN "fun"
token T_IF "if"
token T_THEN "then"
/* Etc. */
In your scanner definition:
fun { return T_FUN; }
if { return T_IF; }
then { return T_THEN; }
/* Etc. */
A couple of other notes:
Your scanner rule for recognizing operators also fails to return anything, so operators will also be ignored. That's clearly not desirable. flex and bison allow an easier solution for single-character operators, which is to let the character be its own token id. That avoids having to create a token name. In the parser, a single-quoted character represents a token-id whose value is the character; that's quite different from a double-quoted string, which is an alias for the declared token name. So you could do this:
"=" { return '='; }
/* Etc. */
but it's easier to do all the single-character tokens at once:
[;+*.<=()-] { return yytext[0]; }
and even easier to use a default rule at the end:
. { return yytext[0]; }
which will have the effect of handling unrecognized characters by returning an unknown token id to the parser, which will cause a syntax error.
This won't work for "->", since that is not a single character token, which will have to be handled in the same way as keywords.
Flex will produce debugging output automatically if you use the -d flag when you create the scanner. That's a lot easier than inserting your own debugging printout, because you can turn it off by simply removing the -d option. (You can use %option debug instead if you don't want to change the flex invocation in your makefile.) It's also better because it provides consistent information, including position information.
Some minor points:
The pattern [0-9][0-9]* could more easily be written [0-9]+
The comment pattern "//".* does not require a $ lookahead at the end, since .* will always match the longest sequence of non-newline characters; consequently, the first unmatched character must either be a newline or the EOF. $ lookahead will not match if the pattern is terminated with an EOF, which will cause odd errors if the file ends with a comment without a newline at the end.
There is no point using {COMM}* since the comment pattern does not match the newline which terminates the comment, so it is impossible for there to be two consecutive comment matches. But anyway, after matching a comment and the following newline, flex will continue to match a following comment, so {COMM} is sufficient. (Personally, I wouldn't use the COMM abbreviation; it really adds nothing to readability, IMHO.)

Can't retrieve semantic values from Bison grammar file

I am trying to develop a language parser on CentOS 6.0 by means of Bison 3.0 (C parser generator), Flex 2.5.35 and gcc 4.4.7. I have the following Bison grammar file:
%{
#include <stdio.h>
%}
%union {
int int_t;
char* str_t;
}
%token SEP
%token <str_t> ID
%start start
%type <int_t> plst
%%
start: plst start
| EOS { YYACCEPT; }
;
// <id> , <id> , ... , <id>
plst: ID SEP_PARAMS plst { printf("Rule 1 %s %s \n",$1,$2); }
| ID { printf("Rule 2 %s \n", $1); }
| /* empty */ { }
;
%%
int yyerror(GNode* root, const char* s) {printf("Error: %s", s);}
The problem
As it is now, it is not really a meaningful one, but it is enough to understand my problem I think. Consider that I have a scanner written in Flex which recognizes my tokens. This grammar file is used to recognize simple identifier lists like: id1,id2,...,idn. My problem is that in each grammar rule, when I try to get the value of the identifier (the string representing the same of the identifier), I get a NULL pointer as also proved by my printfs.
What am I doing wrong? Thankyou
Edit
Thanks to recent answers, I could understand that the problems strongly relates to Flex and its configuration file. In particular I have edited my lex file in order to meet the specifications described by the Flex Manual for Bison Bridging:
{ID} { printf("[id-token]");
yylval->str_t = strdup(yytext);
return ID; }
However after running Bison, then Flex (providing the --bison-bridge option) and then the compiler, I execute the generated parser and I instantly get Segmentation Fault.
What's the problem?
The flex option --bison-bridge (or %option bison-bridge) matches up to the bison option %define api.pure. You need to use either BOTH bison-bridge and api.pure or NEITHER -- either way can work, but they need to be consistent. Since it appears you are NOT using api.pure, you want to delete the --bison-bridge option.
The values for $1, $2 etc. have to be set by the lexer.
If you have a rule in the lexer for identifiers, like
ID [a-z][a-z0-9]*
%%
{ID} { return ID; }
the semantic values are not set.
You have to do e.g.
{ID} { /* Set the unions value, used by e.g. `$1` in the parser */
yylval.str_t = strdup(yytext);
return ID;
}
Remember to free the value in the parser, as strdup allocates memory.

How to get entire input string in Lex and Yacc?

OK, so here is the deal.
In my language I have some commands, say
XYZ 3 5
GGB 8 9
HDH 8783 33
And in my Lex file
XYZ { return XYZ; }
GGB { return GGB; }
HDH { return HDH; }
[0-9]+ { yylval.ival = atoi(yytext); return NUMBER; }
\n { return EOL; }
In my yacc file
start : commands
;
commands : command
| command EOL commands
;
command : xyz
| ggb
| hdh
;
xyz : XYZ NUMBER NUMBER { /* Do something with the numbers */ }
;
etc. etc. etc. etc.
My question is, how can I get the entire text
XYZ 3 5
GGB 8 9
HDH 8783 33
Into commands while still returning the NUMBERs?
Also when my Lex returns a STRING [0-9a-zA-Z]+, and I want to do verification on it's length, should I do it like
rule: STRING STRING { if (strlen($1) < 5 ) /* Do some shit else error */ }
or actually have a token in my Lex that returns different tokens depending on length?
If I've understood your first question correctly, you can have semantic actions like
{ $$ = makeXYZ($2, $3); }
which will allow you to build the value of command as you want.
For your second question, the borders between lexical analysis and grammatical analysis and between grammatical analysis and semantic analysis aren't hard and well fixed. Moving them is a trade-off between factors like easiness of description, clarity of error messages and robustness in presence of errors. Considering the verification of string length, the likelihood of an error occurring is quite high and the error message if it is handled by returning different terminals for different length will probably be not clear. So if it is possible -- that depend on the grammar -- I'd handle it in the semantic analysis phase, where the message can easily be tailored.
If you arrange for your lexical analyzer (yylex()) to store the whole string in some variable, then your code can access it. The communication with the parser proper will be through the normal mechanisms, but there's nothing that says you can't also have another variable lurking around (probably a file static variable - but beware multithreading) that stores the whole input line before it is dissected.
As you use yylval.ival you already have union with ival field in your YACC source, like this:
%union {
int ival;
}
Now you specify token type, like this:
%token <ival> NUMBER
So now you can access ival field simply for NUMBER token as $1 in your rules, like
xyz : XYZ NUMBER NUMBER { printf("XYZ %d %d", $2, $3); }
For your second question I'd define union like this:
%union {
char* strval;
int ival;
}
and in you LEX source specify token types
%token <strval> STRING;
%token <ival> NUMBER;
So now you can do things like
foo : STRING NUMBER { printf("%s (len %d) %d", $1, strlen($1), $2); }

Resources