parsing with bison - c

I bought Flex & Bison from O'Reilly but I'm having some trouble implementing a parser (breaking things down into tokens was no big deal).
Suppose I have a huge binary string and what I need to do is add the bits together - every bit is a token:
[0-1] { return NUMBER;}
1101010111111
Or for that matter a collection of tokens with no "operation".
Would a such a grammar be correct?
calclist :
| calclist expr EOL {eval($2)}
expr: NUMBER
|expr NUMBER { $$=$1+$2 }
or is there a better way to do it?

Your example lex rule "[0-1] { return NUMBER; }" doesn't set yylval, so if you use that value in your grammar (as you do in the rule "expr NUMBER { $$=$1+$2; }") you'll get garbage.
In general what you're doing is correct, though the task you've chosen is so trivial that lex/bison is serious overkill.

Related

Unexpected value in yytext in production rule for function declaration

I'm writing a compiler with flex and bison for a college assignment. I'm having trouble adding a function identifier to my symbol table - when evaluating a function declaration I'm getting the opening parenthesis in yytext where I'd expect the identifier. In my flex file I have, where yylval is an union and vlex is a struct:
abc [A-Za-z_]
alphanum [A-Za-z_0-9]
id {abc}+{alphanum}*
...
#define STORE_YYLVAL_NONE\
do{\
... // location control irrelevant to the problem
yylval.vlex.type = none_t;\
yylval.vlex.value.sValue = yytext;\
}while(0)
...
{id} {
LOG_DEBUG("id: %s\n", yytext);
STORE_YYLVAL_NONE;
return TK_IDENTIFIER;
}
[,;:()\[\]\{\}\+\-\*/<>!&=%#\^\.\|\?\$] {
LOG_DEBUG("special\n");
STORE_YYLVAL_NONE;
return *yytext;
}
...
And in my bison file I have:
new_identifier_with_node: TK_IDENTIFIER {
hshsym_add_or_exit(&hshsym, yylval.vlex.value.sValue, &(yylval.vlex));
$$ = ast_node_create(&(yylval.vlex));
};
func: type new_identifier_with_node '(' param_list ')' func_block { ... };
I also have a log inside hshsym_add_or_exit, which adds an identifier to my symbol table. When parsing the following program:
int k(int x,int y, int z){}
int f(){
k(10,20,30);
}
I'm getting the following debug output:
yylex: DEBUG! id: k
yylex: DEBUG! special
hshsym_add_or_exit: DEBUG! Declaring: (
That is, when the new_identifier_with_node production is evaluated, the content of yytext is ( instead of k, as I would expect. Is the code above the cause? I have some still unresolved shift/reduce conflicts which I guess could be at fault, but I don't see how in this specific case. I believe I'm missing something really basic but I can't see what. The project is quite large (and shamefully disorganized) at this point, but I can provide a complete and reproducible example if need be.
The basic problem is that you are using yylval in the new_identifier_with_node production, instead of $1. $1 is the semantic value of the first symbol in the production, in this case TK_IDENTIFIER.
In a bison action, yylval is usually the value of lookahead token, which is the next token in the input stream. That's why it shows up as a parenthesis in this case. But you cannot in general count on that because bison will perform a default reduction before reading the lookahead token. In general, using yylval in a bison action is very rarely useful, aside from some applications in error recovery.
Even after you fix that, you will find that the semantic values are incorrect because your flex action is forwarding a pointer to an internal data buffer rather than copying the token string. See, for example, this question.

Yacc actions return 0 for any variable ($)

I'm new to Lex/Yacc. Found these lex & yacc files which parse ansi C.
To experiment I added an action to print part of the parsing:
constant
: I_CONSTANT { printf("I_CONSTANT %d\n", $1); }
| F_CONSTANT
| ENUMERATION_CONSTANT /* after it has been defined as such */
;
The problem is, no matter where I put the action, and whatever $X I use, I always get value 0.
Here I got printed:
I_CONSTANT 0
Even though my input is:
int foo(int x)
{
return 5;
}
Any idea?
Nothing in the lex file you point to actually sets semantic values for any token. As the author says, the files are just a grammar and "the bulk of the work" still needs to be done. (There are other caveats having to do with the need for a preprocessor.)
Since nothing in the lex file ever sets yylval, it will always be 0, and that is what yacc/bison will find when it sets up the semantic value for the token ($1 in this case).
Turns out yylval = atoi(yytext) is not done in the lex file, so I had to add it myself. Also learned I can add extern char *yytext to the yacc file header, and then use yytext directly.

How to disable parsing for a piece of text in a file?

Structure of my file is :
`pragma TOKEN1_NAME TOKEN1_VALUE
`pragma TOKEN2_NAME TOKEN2_VALUE
`pragma TOKEN3_NAME TOKEN3_VALUE
`pragma TOKEN4_NAME TOKEN4_VALUE
VHDL_TEXT{
// A valid VHDL text goes here.
}
`pragma TOKEN2_NAME TOKEN2_VALUE
VHDL_TEXT{
// VHDL text
}
I need to pass VHDL text as it is to the output file.I can do that by making a default rule at the end of lex file as:
Rule: . { append_to_buffer(*yytext); }
I also have list of other rules in my Lex file to deal with the tokens.
The problem i am having is how to deal with the situation in which VHDL text is also containing some of the tokens that can be recognized by the Lex rules?
In other words ,i want to disable detecting further valid token one i found the text i am interesting in and again start detection once it is over.
As rici points out indirectly you need to be able to distinguish between occurrences of the trailing delimiter '}' for your rule and occurrences of the right curly bracket in a valid VHDL design specification or portion.
See IEEE Std 1076-2008, 15.3 Lexical elements, separators, and delimiters where we find that '{' and '}' are not used as delimiters in VHDL.
They are other special characters (15.2 Character set, using ISO/IEC 8859-1:1998) requiring handling where graphic characters may appear.
graphic_character ::=
basic_graphic_character | lower_case_letter | other_special_character
These include extended identifiers (15.4.3), character literals (15.6), string literals (15.7), bit string literals (15.8), comments (15.9) and tool directives (15.11).
There's a need to identify these lexical elements within the production otherwise identifying '}' as a delimiter for the rule.
Only one tool directive is currently defined (24.1 Protect tool directives) wherein the use of the two curly bracket characters would be contained in VHDL lexical elements. All other uses in lexical elements are directly delimited. (And you could disclaim tool directive support, in VHDL they basically also invoke separate lexical, syntactical and semantic analysis).
Essentially you need to operate a VHDL lexical analyzer for traversing 'VHDL text' where you're rule delimiter right curly bracket will stand out like a sore thumb (as an exception, serving as the closing delimiter for VHDL text).
And about now you'd get the idea life would be easier if you could deal with VHDL by reference instead if possible. Your mechanism is as complex as including tool directives in VHDL (which can be done with a preprocessor as could your VHDL text).
This is in response to the vhdl tag added by FUZxxl.
When you have essentially different languages in a source file that you need to deal with that have clear demarcation tokens (like your VHDL_TEXT markers) that can be easily recognized by the lexer, the easiest thing to do is to use flex exclusive start states (%x). In your case, you would do something like:
%{
/* some global vars for holding aux state */
static int brace_depth;
static Buffer vhdl_text;
%}
%x VHDL
%%
.. normal lexer rules for your non-vhdl stuff
VHDL_TEXT[ \t]*{ { brace_depth = 1;
BufferClear(vhdl_text);
BEGIN(VHDL); }
<VHDL>"{" { BufferAppend(vhdl_text, *yytext);
brace_depth++; }
<VHDL>"}" { if (--brace_depth == 0) {
BEGIN(INITIAL);
yylval.buf = BufferExtract(vhdl_text);
return VHDL_TEXT; }
BufferAppend(vhdl_text, *yytext); }
<VHDL>--.*\n { BufferAppendString(vhdl_text, yytext); }
<VHDL>\"[^"\n]\" { BufferAppendString(vhdl_text, yytext); }
<VHDL>\\[^\\\n]\\ { BufferAppendString(vhdl_text, yytext); }
<VHDL>.|\n { BufferAppend(vhdl_text, *yytext); }
This will gather up everything between the curly braces in VHDL_TEXT {...} and return it to your parser as a single token (matching nested braces properly, if there are any in the VHDL text.) You can do macro substitution-like stuff in the VHDL code by adding a rule like:
<VHDL>{IDENT} { if (Macro *mac = lookup_macro_by_name(yytext)) {
BufferAppendString(vhdl_text, mac->replacement);
} else {
BufferAppendString(vhdl_text, yytext); } }
You also probably want a <VHDL><<EOF>> rule to detect a missing closing } on the vhdl text and give an appropriate error message.

Can't retrieve semantic values from Bison grammar file

I am trying to develop a language parser on CentOS 6.0 by means of Bison 3.0 (C parser generator), Flex 2.5.35 and gcc 4.4.7. I have the following Bison grammar file:
%{
#include <stdio.h>
%}
%union {
int int_t;
char* str_t;
}
%token SEP
%token <str_t> ID
%start start
%type <int_t> plst
%%
start: plst start
| EOS { YYACCEPT; }
;
// <id> , <id> , ... , <id>
plst: ID SEP_PARAMS plst { printf("Rule 1 %s %s \n",$1,$2); }
| ID { printf("Rule 2 %s \n", $1); }
| /* empty */ { }
;
%%
int yyerror(GNode* root, const char* s) {printf("Error: %s", s);}
The problem
As it is now, it is not really a meaningful one, but it is enough to understand my problem I think. Consider that I have a scanner written in Flex which recognizes my tokens. This grammar file is used to recognize simple identifier lists like: id1,id2,...,idn. My problem is that in each grammar rule, when I try to get the value of the identifier (the string representing the same of the identifier), I get a NULL pointer as also proved by my printfs.
What am I doing wrong? Thankyou
Edit
Thanks to recent answers, I could understand that the problems strongly relates to Flex and its configuration file. In particular I have edited my lex file in order to meet the specifications described by the Flex Manual for Bison Bridging:
{ID} { printf("[id-token]");
yylval->str_t = strdup(yytext);
return ID; }
However after running Bison, then Flex (providing the --bison-bridge option) and then the compiler, I execute the generated parser and I instantly get Segmentation Fault.
What's the problem?
The flex option --bison-bridge (or %option bison-bridge) matches up to the bison option %define api.pure. You need to use either BOTH bison-bridge and api.pure or NEITHER -- either way can work, but they need to be consistent. Since it appears you are NOT using api.pure, you want to delete the --bison-bridge option.
The values for $1, $2 etc. have to be set by the lexer.
If you have a rule in the lexer for identifiers, like
ID [a-z][a-z0-9]*
%%
{ID} { return ID; }
the semantic values are not set.
You have to do e.g.
{ID} { /* Set the unions value, used by e.g. `$1` in the parser */
yylval.str_t = strdup(yytext);
return ID;
}
Remember to free the value in the parser, as strdup allocates memory.

How to get entire input string in Lex and Yacc?

OK, so here is the deal.
In my language I have some commands, say
XYZ 3 5
GGB 8 9
HDH 8783 33
And in my Lex file
XYZ { return XYZ; }
GGB { return GGB; }
HDH { return HDH; }
[0-9]+ { yylval.ival = atoi(yytext); return NUMBER; }
\n { return EOL; }
In my yacc file
start : commands
;
commands : command
| command EOL commands
;
command : xyz
| ggb
| hdh
;
xyz : XYZ NUMBER NUMBER { /* Do something with the numbers */ }
;
etc. etc. etc. etc.
My question is, how can I get the entire text
XYZ 3 5
GGB 8 9
HDH 8783 33
Into commands while still returning the NUMBERs?
Also when my Lex returns a STRING [0-9a-zA-Z]+, and I want to do verification on it's length, should I do it like
rule: STRING STRING { if (strlen($1) < 5 ) /* Do some shit else error */ }
or actually have a token in my Lex that returns different tokens depending on length?
If I've understood your first question correctly, you can have semantic actions like
{ $$ = makeXYZ($2, $3); }
which will allow you to build the value of command as you want.
For your second question, the borders between lexical analysis and grammatical analysis and between grammatical analysis and semantic analysis aren't hard and well fixed. Moving them is a trade-off between factors like easiness of description, clarity of error messages and robustness in presence of errors. Considering the verification of string length, the likelihood of an error occurring is quite high and the error message if it is handled by returning different terminals for different length will probably be not clear. So if it is possible -- that depend on the grammar -- I'd handle it in the semantic analysis phase, where the message can easily be tailored.
If you arrange for your lexical analyzer (yylex()) to store the whole string in some variable, then your code can access it. The communication with the parser proper will be through the normal mechanisms, but there's nothing that says you can't also have another variable lurking around (probably a file static variable - but beware multithreading) that stores the whole input line before it is dissected.
As you use yylval.ival you already have union with ival field in your YACC source, like this:
%union {
int ival;
}
Now you specify token type, like this:
%token <ival> NUMBER
So now you can access ival field simply for NUMBER token as $1 in your rules, like
xyz : XYZ NUMBER NUMBER { printf("XYZ %d %d", $2, $3); }
For your second question I'd define union like this:
%union {
char* strval;
int ival;
}
and in you LEX source specify token types
%token <strval> STRING;
%token <ival> NUMBER;
So now you can do things like
foo : STRING NUMBER { printf("%s (len %d) %d", $1, strlen($1), $2); }

Resources