Yacc Yytext is overwritten?

Yacc Yytext is overwritten? - c

solution is mentioned in the comments below the post
I am running into an issue where I have a statement such as
i = ary[4];
lex prints out "ary", however yacc for some reason prints out '[' which means that yytext is being overwritten somehow.
Would someone tell me how to clean up this problem? As soon as I take out
PStmt : Id '[' Expr ']' { $$ = doRary($1, $3); };
then my program doesn't have problems, but I can't read arrays anymore.
In my lex file I have:
{letter}({letter}|{digit})* { return Ident; }
{digit}{digit}* { return IntLit; }
...
\[ { return '['; }
\] { return ']'; }
...
[Updated: I had to remove this section]
In my yacc file I have:
I would appreciate any tips/solutions as to how to deal with this as the aforementioned statement seems to influence other parts of the grammar.
FYI: I am following C precedence rules.

yytext is an internal buffer which belongs to the scanner generated by (f)lex, and its contents are modified on every call to yylex(). The bison/yacc-generated parser calls yylex() at unpredictable moments. In particular, it will call yylex() in order to obtain the lookahead token, which is not part of the current production.
So yytext should not be used outside of lexer actions. If the string value of the scanned token will be required by the parser, the lexer action for that token must make a copy of yytext and store it into the appropriate member of yylval so that it is available in parser actions involving that token. (See the bison manual for more details.)
Also see this question, and many others.

Related

Unexpected value in yytext in production rule for function declaration

I'm writing a compiler with flex and bison for a college assignment. I'm having trouble adding a function identifier to my symbol table - when evaluating a function declaration I'm getting the opening parenthesis in yytext where I'd expect the identifier. In my flex file I have, where yylval is an union and vlex is a struct:
abc [A-Za-z_]
alphanum [A-Za-z_0-9]
id {abc}+{alphanum}*
...
#define STORE_YYLVAL_NONE\
do{\
... // location control irrelevant to the problem
yylval.vlex.type = none_t;\
yylval.vlex.value.sValue = yytext;\
}while(0)
...
{id} {
LOG_DEBUG("id: %s\n", yytext);
STORE_YYLVAL_NONE;
return TK_IDENTIFIER;
}
[,;:()\[\]\{\}\+\-\*/<>!&=%#\^\.\|\?\$] {
LOG_DEBUG("special\n");
STORE_YYLVAL_NONE;
return *yytext;
}
...
And in my bison file I have:
new_identifier_with_node: TK_IDENTIFIER {
hshsym_add_or_exit(&hshsym, yylval.vlex.value.sValue, &(yylval.vlex));
$$ = ast_node_create(&(yylval.vlex));
};
func: type new_identifier_with_node '(' param_list ')' func_block { ... };
I also have a log inside hshsym_add_or_exit, which adds an identifier to my symbol table. When parsing the following program:
int k(int x,int y, int z){}
int f(){
k(10,20,30);
}
I'm getting the following debug output:
yylex: DEBUG! id: k
yylex: DEBUG! special
hshsym_add_or_exit: DEBUG! Declaring: (
That is, when the new_identifier_with_node production is evaluated, the content of yytext is ( instead of k, as I would expect. Is the code above the cause? I have some still unresolved shift/reduce conflicts which I guess could be at fault, but I don't see how in this specific case. I believe I'm missing something really basic but I can't see what. The project is quite large (and shamefully disorganized) at this point, but I can provide a complete and reproducible example if need be.

The basic problem is that you are using yylval in the new_identifier_with_node production, instead of $1. $1 is the semantic value of the first symbol in the production, in this case TK_IDENTIFIER.
In a bison action, yylval is usually the value of lookahead token, which is the next token in the input stream. That's why it shows up as a parenthesis in this case. But you cannot in general count on that because bison will perform a default reduction before reading the lookahead token. In general, using yylval in a bison action is very rarely useful, aside from some applications in error recovery.
Even after you fix that, you will find that the semantic values are incorrect because your flex action is forwarding a pointer to an internal data buffer rather than copying the token string. See, for example, this question.

Yacc actions return 0 for any variable ($)

I'm new to Lex/Yacc. Found these lex & yacc files which parse ansi C.
To experiment I added an action to print part of the parsing:
constant
: I_CONSTANT { printf("I_CONSTANT %d\n", $1); }
| F_CONSTANT
| ENUMERATION_CONSTANT /* after it has been defined as such */
;
The problem is, no matter where I put the action, and whatever $X I use, I always get value 0.
Here I got printed:
I_CONSTANT 0
Even though my input is:
int foo(int x)
{
return 5;
}
Any idea?

Nothing in the lex file you point to actually sets semantic values for any token. As the author says, the files are just a grammar and "the bulk of the work" still needs to be done. (There are other caveats having to do with the need for a preprocessor.)
Since nothing in the lex file ever sets yylval, it will always be 0, and that is what yacc/bison will find when it sets up the semantic value for the token ($1 in this case).

Turns out yylval = atoi(yytext) is not done in the lex file, so I had to add it myself. Also learned I can add extern char *yytext to the yacc file header, and then use yytext directly.

Lex how to emulate modes or a stack of contexts

I'm trying to figure out how to emulate a context/mode or "stack of contexts" in lex (flex).
In particular, I'd like to write a parser that has a notion of string literals that can drop you back into an expression-y context.
I have a simple grammar that supports raw string literals using the syntax '...' and prints a string when it finds one.
However, a string token has potentially unbounded length (up to lex's maximum buffer size which I think is defined in some macro in the generated C source).
I want to define a begin_string token ' and an end_string token ' as well as a distinct token for reading a character while inside a string.
And I want to achieve this by having some notion of a context that says "now I'm in a string" and affects which tokenization rules are "active".
Here's the naive grammar below for context.
%{
#include <stdio.h>
%}
%option noyywrap
%%
'[^']*' { printf("found string literal (( %s ))\n", yytext); }
\n { /* do nothing */ }
. { /* do nothing */ }
%%
int main()
{
yylex();
return 0;
}

If I understand your needs correctly, that feature is provided with start conditions. As the manual explains, a start condition is a kind of state, which can be used to enable and disable a set of productions.
For example, you might have:
%option nodefault
%x IN_STRING
%%
/* Other patterns for regular tokens */
"'" { BEGIN(IN_STRING); return BEGIN_STRING; }
<IN_STRING>"'" { BEGIN(INITIAL); return END_STRING; }
<IN_STRING>.|\n { return STRING_CHAR; }
Flex will optionally enable a feature which allows you to push and pop the current start condition on a stack, but in this simple case that isn't necessary. If you do need to do that, remember to add %option stack to your prolog, and read the description of the API at the end of the Start Condition chapter linked above.

How to disable parsing for a piece of text in a file?

Structure of my file is :
`pragma TOKEN1_NAME TOKEN1_VALUE
`pragma TOKEN2_NAME TOKEN2_VALUE
`pragma TOKEN3_NAME TOKEN3_VALUE
`pragma TOKEN4_NAME TOKEN4_VALUE
VHDL_TEXT{
// A valid VHDL text goes here.
}
`pragma TOKEN2_NAME TOKEN2_VALUE
VHDL_TEXT{
// VHDL text
}
I need to pass VHDL text as it is to the output file.I can do that by making a default rule at the end of lex file as:
Rule: . { append_to_buffer(*yytext); }
I also have list of other rules in my Lex file to deal with the tokens.
The problem i am having is how to deal with the situation in which VHDL text is also containing some of the tokens that can be recognized by the Lex rules?
In other words ,i want to disable detecting further valid token one i found the text i am interesting in and again start detection once it is over.

As rici points out indirectly you need to be able to distinguish between occurrences of the trailing delimiter '}' for your rule and occurrences of the right curly bracket in a valid VHDL design specification or portion.
See IEEE Std 1076-2008, 15.3 Lexical elements, separators, and delimiters where we find that '{' and '}' are not used as delimiters in VHDL.
They are other special characters (15.2 Character set, using ISO/IEC 8859-1:1998) requiring handling where graphic characters may appear.
graphic_character ::=
basic_graphic_character | lower_case_letter | other_special_character
These include extended identifiers (15.4.3), character literals (15.6), string literals (15.7), bit string literals (15.8), comments (15.9) and tool directives (15.11).
There's a need to identify these lexical elements within the production otherwise identifying '}' as a delimiter for the rule.
Only one tool directive is currently defined (24.1 Protect tool directives) wherein the use of the two curly bracket characters would be contained in VHDL lexical elements. All other uses in lexical elements are directly delimited. (And you could disclaim tool directive support, in VHDL they basically also invoke separate lexical, syntactical and semantic analysis).
Essentially you need to operate a VHDL lexical analyzer for traversing 'VHDL text' where you're rule delimiter right curly bracket will stand out like a sore thumb (as an exception, serving as the closing delimiter for VHDL text).
And about now you'd get the idea life would be easier if you could deal with VHDL by reference instead if possible. Your mechanism is as complex as including tool directives in VHDL (which can be done with a preprocessor as could your VHDL text).
This is in response to the vhdl tag added by FUZxxl.

When you have essentially different languages in a source file that you need to deal with that have clear demarcation tokens (like your VHDL_TEXT markers) that can be easily recognized by the lexer, the easiest thing to do is to use flex exclusive start states (%x). In your case, you would do something like:
%{
/* some global vars for holding aux state */
static int brace_depth;
static Buffer vhdl_text;
%}
%x VHDL
%%
.. normal lexer rules for your non-vhdl stuff
VHDL_TEXT[ \t]*{ { brace_depth = 1;
BufferClear(vhdl_text);
BEGIN(VHDL); }
<VHDL>"{" { BufferAppend(vhdl_text, *yytext);
brace_depth++; }
<VHDL>"}" { if (--brace_depth == 0) {
BEGIN(INITIAL);
yylval.buf = BufferExtract(vhdl_text);
return VHDL_TEXT; }
BufferAppend(vhdl_text, *yytext); }
<VHDL>--.*\n { BufferAppendString(vhdl_text, yytext); }
<VHDL>\"[^"\n]\" { BufferAppendString(vhdl_text, yytext); }
<VHDL>\\[^\\\n]\\ { BufferAppendString(vhdl_text, yytext); }
<VHDL>.|\n { BufferAppend(vhdl_text, *yytext); }
This will gather up everything between the curly braces in VHDL_TEXT {...} and return it to your parser as a single token (matching nested braces properly, if there are any in the VHDL text.) You can do macro substitution-like stuff in the VHDL code by adding a rule like:
<VHDL>{IDENT} { if (Macro *mac = lookup_macro_by_name(yytext)) {
BufferAppendString(vhdl_text, mac->replacement);
} else {
BufferAppendString(vhdl_text, yytext); } }
You also probably want a <VHDL><<EOF>> rule to detect a missing closing } on the vhdl text and give an appropriate error message.

Can't retrieve semantic values from Bison grammar file

I am trying to develop a language parser on CentOS 6.0 by means of Bison 3.0 (C parser generator), Flex 2.5.35 and gcc 4.4.7. I have the following Bison grammar file:
%{
#include <stdio.h>
%}
%union {
int int_t;
char* str_t;
}
%token SEP
%token <str_t> ID
%start start
%type <int_t> plst
%%
start: plst start
| EOS { YYACCEPT; }
;
// <id> , <id> , ... , <id>
plst: ID SEP_PARAMS plst { printf("Rule 1 %s %s \n",$1,$2); }
| ID { printf("Rule 2 %s \n", $1); }
| /* empty */ { }
;
%%
int yyerror(GNode* root, const char* s) {printf("Error: %s", s);}
The problem
As it is now, it is not really a meaningful one, but it is enough to understand my problem I think. Consider that I have a scanner written in Flex which recognizes my tokens. This grammar file is used to recognize simple identifier lists like: id1,id2,...,idn. My problem is that in each grammar rule, when I try to get the value of the identifier (the string representing the same of the identifier), I get a NULL pointer as also proved by my printfs.
What am I doing wrong? Thankyou
Edit
Thanks to recent answers, I could understand that the problems strongly relates to Flex and its configuration file. In particular I have edited my lex file in order to meet the specifications described by the Flex Manual for Bison Bridging:
{ID} { printf("[id-token]");
yylval->str_t = strdup(yytext);
return ID; }
However after running Bison, then Flex (providing the --bison-bridge option) and then the compiler, I execute the generated parser and I instantly get Segmentation Fault.
What's the problem?

The flex option --bison-bridge (or %option bison-bridge) matches up to the bison option %define api.pure. You need to use either BOTH bison-bridge and api.pure or NEITHER -- either way can work, but they need to be consistent. Since it appears you are NOT using api.pure, you want to delete the --bison-bridge option.

The values for $1, $2 etc. have to be set by the lexer.
If you have a rule in the lexer for identifiers, like
ID [a-z][a-z0-9]*
%%
{ID} { return ID; }
the semantic values are not set.
You have to do e.g.
{ID} { /* Set the unions value, used by e.g. `$1` in the parser */
yylval.str_t = strdup(yytext);
return ID;
}
Remember to free the value in the parser, as strdup allocates memory.