Lex how to emulate modes or a stack of contexts - c

I'm trying to figure out how to emulate a context/mode or "stack of contexts" in lex (flex).
In particular, I'd like to write a parser that has a notion of string literals that can drop you back into an expression-y context.
I have a simple grammar that supports raw string literals using the syntax '...' and prints a string when it finds one.
However, a string token has potentially unbounded length (up to lex's maximum buffer size which I think is defined in some macro in the generated C source).
I want to define a begin_string token ' and an end_string token ' as well as a distinct token for reading a character while inside a string.
And I want to achieve this by having some notion of a context that says "now I'm in a string" and affects which tokenization rules are "active".
Here's the naive grammar below for context.
%{
#include <stdio.h>
%}
%option noyywrap
%%
'[^']*' { printf("found string literal (( %s ))\n", yytext); }
\n { /* do nothing */ }
. { /* do nothing */ }
%%
int main()
{
yylex();
return 0;
}

If I understand your needs correctly, that feature is provided with start conditions. As the manual explains, a start condition is a kind of state, which can be used to enable and disable a set of productions.
For example, you might have:
%option nodefault
%x IN_STRING
%%
/* Other patterns for regular tokens */
"'" { BEGIN(IN_STRING); return BEGIN_STRING; }
<IN_STRING>"'" { BEGIN(INITIAL); return END_STRING; }
<IN_STRING>.|\n { return STRING_CHAR; }
Flex will optionally enable a feature which allows you to push and pop the current start condition on a stack, but in this simple case that isn't necessary. If you do need to do that, remember to add %option stack to your prolog, and read the description of the API at the end of the Start Condition chapter linked above.

Related

Regular expression in FLEX finding text

I got lex file with this rule:
%option noyywrap
%{
%}
LNA [^<>]
LNANA [^<>!]
%%
(<!!) fprintf(yyout, "begin_comment\t\t\t%s\n", yytext);
(!!>) fprintf(yyout, "end_comment\t\t\t%s\n", yytext);
({LNANA}*|({LNA}{LNANA})*|{LNA}+{LNANA}{LNANA}{LNA}) fprintf(yyout,
"string\t\t\t%s\n", yytext);
. fprintf(yyout, "illegal char %s\n", yytext);
%%
I need to find comments between "<!!" and "!!>" and strings in code wihout nothing
for example
<!! This is a comment that need to be found !!>
simple string that need to be found also
and this is my output:
as you can see this does not work as needed.
any help ?
I'm not sure exactly what you're trying for.
There's certainly a regular expression which matches an entire comment (as long as you don't intend comments to nest). But it's hard to get it right, and you typically end up splitting strings and returning more tokens than necessary. Here's one which I think works, although it's not fully tested. Since you need to match the entire comment, the pattern has to include the comment delimiters. Of course, you also have to match the strings between the comments, as well as doing something in the case that a comment is not correctly terminated.
<!!([^!]*!)([^!]+!)*!+([^!>][^!]*!([^!]+!)*!+)*> { /* Comment */ }
<!! { /* This pattern will match on unterminated comments */ }
[^<]+ { /* Non comment text (but maybe not the whole string) */ }
< { /* Also non-comment text */ }
A possibly clearer and probably slower version uses a start condition, and returns both the insides of comments and the rest of the text in single pieces (in yytext, as per the yylex interface).
%x IN_COMMENT
%%
<!! { BEGIN(IN_COMMENT);
yytext[yyleng -= 3] = 0;
if (yyleng) return STRING;
}
/* This patterns deliberately fails if it reaches the last input */
([^<]+|<)/(.|\n) { yymore(); }
/* The next pattern is to catch the last character in the input */
.|\n { return STRING; }
<IN_COMMENT>!!> { BEGIN(INITIAL);
yytext[yyleng -= 3] = 0;
return COMMENT;
}
<IN_COMMENT>[^!]+|! { yymore(); }
<IN_COMMENT><<EOF>> { fputs(stderr, "Unterminated comment\n"); }

checking unfinished comments in flex

I am a new to flex. I have just written a sample code to detect multi line comments using a flex program. Now I want to improve the code. I want to detect unfinished and ill formed comments in the code. for example: a comment beginning with /* without an ending */ is an unfinished comment and by ill formed comment I mean the comment is not properly formed, say, an EOF appears inside the comment etc. What I have to add in my code to check these things? My sample code is as follows:
%x COMMENT_MULTI_LINE
%{
char* commentStart;
%}
%%
[\n\t\r ]+ {
/* ignore whitespace */ }
<INITIAL>"/*" {
commentStart = yytext;
BEGIN(COMMENT_MULTI_LINE);
}
<COMMENT_MULTI_LINE>"*/" {
char* comment = strndup(commentStart, yytext + 2 - commentStart);
printf("'%s': was a multi-line comment\n", comment);
free(comment);
BEGIN(INITIAL);
}
<COMMENT_MULTI_LINE>. {
}
<COMMENT_MULTI_LINE>\n {
}
%%
int main(int argc, char *argv[]){
yylex();
}
The flex manual section on using <<EOF>> is quite helpful as it has exactly your case as an example, and their code can also be copied verbatim into your flex program.
As it explains, when using <<EOF>> you cannot place it in a normal regular expression pattern. It can only be proceeded by a the name of a state. In your code you are using a state to indicate you are inside a comment. This state is called COMMENT_MULTI. All you have to do is put that in front of the <<EOF>> marker and give it an action to do:
<COMMENT_MULTI><<EOF>> {printf("Unterminated Comment: %s\n", yytext);
yyterminate();}
The special action function yyterminate() tells flex that you have recognised the <<EOF>> and that it marks the end-of-input for your program.
I have tested this, and it works in your code. (And with multi-line strings also).

How to disable parsing for a piece of text in a file?

Structure of my file is :
`pragma TOKEN1_NAME TOKEN1_VALUE
`pragma TOKEN2_NAME TOKEN2_VALUE
`pragma TOKEN3_NAME TOKEN3_VALUE
`pragma TOKEN4_NAME TOKEN4_VALUE
VHDL_TEXT{
// A valid VHDL text goes here.
}
`pragma TOKEN2_NAME TOKEN2_VALUE
VHDL_TEXT{
// VHDL text
}
I need to pass VHDL text as it is to the output file.I can do that by making a default rule at the end of lex file as:
Rule: . { append_to_buffer(*yytext); }
I also have list of other rules in my Lex file to deal with the tokens.
The problem i am having is how to deal with the situation in which VHDL text is also containing some of the tokens that can be recognized by the Lex rules?
In other words ,i want to disable detecting further valid token one i found the text i am interesting in and again start detection once it is over.
As rici points out indirectly you need to be able to distinguish between occurrences of the trailing delimiter '}' for your rule and occurrences of the right curly bracket in a valid VHDL design specification or portion.
See IEEE Std 1076-2008, 15.3 Lexical elements, separators, and delimiters where we find that '{' and '}' are not used as delimiters in VHDL.
They are other special characters (15.2 Character set, using ISO/IEC 8859-1:1998) requiring handling where graphic characters may appear.
graphic_character ::=
basic_graphic_character | lower_case_letter | other_special_character
These include extended identifiers (15.4.3), character literals (15.6), string literals (15.7), bit string literals (15.8), comments (15.9) and tool directives (15.11).
There's a need to identify these lexical elements within the production otherwise identifying '}' as a delimiter for the rule.
Only one tool directive is currently defined (24.1 Protect tool directives) wherein the use of the two curly bracket characters would be contained in VHDL lexical elements. All other uses in lexical elements are directly delimited. (And you could disclaim tool directive support, in VHDL they basically also invoke separate lexical, syntactical and semantic analysis).
Essentially you need to operate a VHDL lexical analyzer for traversing 'VHDL text' where you're rule delimiter right curly bracket will stand out like a sore thumb (as an exception, serving as the closing delimiter for VHDL text).
And about now you'd get the idea life would be easier if you could deal with VHDL by reference instead if possible. Your mechanism is as complex as including tool directives in VHDL (which can be done with a preprocessor as could your VHDL text).
This is in response to the vhdl tag added by FUZxxl.
When you have essentially different languages in a source file that you need to deal with that have clear demarcation tokens (like your VHDL_TEXT markers) that can be easily recognized by the lexer, the easiest thing to do is to use flex exclusive start states (%x). In your case, you would do something like:
%{
/* some global vars for holding aux state */
static int brace_depth;
static Buffer vhdl_text;
%}
%x VHDL
%%
.. normal lexer rules for your non-vhdl stuff
VHDL_TEXT[ \t]*{ { brace_depth = 1;
BufferClear(vhdl_text);
BEGIN(VHDL); }
<VHDL>"{" { BufferAppend(vhdl_text, *yytext);
brace_depth++; }
<VHDL>"}" { if (--brace_depth == 0) {
BEGIN(INITIAL);
yylval.buf = BufferExtract(vhdl_text);
return VHDL_TEXT; }
BufferAppend(vhdl_text, *yytext); }
<VHDL>--.*\n { BufferAppendString(vhdl_text, yytext); }
<VHDL>\"[^"\n]\" { BufferAppendString(vhdl_text, yytext); }
<VHDL>\\[^\\\n]\\ { BufferAppendString(vhdl_text, yytext); }
<VHDL>.|\n { BufferAppend(vhdl_text, *yytext); }
This will gather up everything between the curly braces in VHDL_TEXT {...} and return it to your parser as a single token (matching nested braces properly, if there are any in the VHDL text.) You can do macro substitution-like stuff in the VHDL code by adding a rule like:
<VHDL>{IDENT} { if (Macro *mac = lookup_macro_by_name(yytext)) {
BufferAppendString(vhdl_text, mac->replacement);
} else {
BufferAppendString(vhdl_text, yytext); } }
You also probably want a <VHDL><<EOF>> rule to detect a missing closing } on the vhdl text and give an appropriate error message.

Is there an option for flex to match whole words only?

I'm writing a lexer and I'm using Flex to generate it based on custom rules.
I want to match identifiers of sorts that start with a letter and then can have either letters or numbers. So I wrote the following pattern for them:
[[:alpha:]][[:alnum:]]*
It works fine, the lexer that gets generated recognizes the pattern perfectly, although it doesn't only match whole words but all appearances of that pattern.
So for example it would match the input "Text" and "9Text" (discarding that initial 9).
Consider the following simple lexer that accepts IDs as described above:
%{
#include <stdio.h>
#define LINE_END 1
#define ID 2
%}
/* Flex options: */
%option noinput
%option nounput
%option noyywrap
%option yylineno
/* Definitions: */
WHITESPACE [ \t]
BLANK {WHITESPACE}+
NEW_LINE "\n"|"\r\n"
ID [[:alpha:]][[:alnum:]_]*
%%
{NEW_LINE} {printf("New line.\n"); return LINE_END;}
{BLANK} {/* Blanks are skipped */}
{ID} {printf("ID recognized: '%s'\n", yytext); return ID;}
. {fprintf(stderr, "ERROR: Invalid input in line %d: \"%s\"\n", yylineno, yytext);}
%%
int main(int argc, char **argv) {
while (yylex() != 0);
return 0;
}
When compiled and fed the following input produces the output below:
Input:
Test
9Test
Output:
Test
ID recognized: 'Test'
New line.
9Test
ERROR: Invalid input in line 2: "9"
ID recognized: 'Test'
New line.
Is there a way to make flex match only whole words (i.e. delimited by either blanks or custom delimiters like '(' ')' for example)?
Because I could write a rule that excludes IDs that start with numbers, but what about the ones that start with symbols like "$Test" or "&Test"? I don't think I can enumerate all of the possible symbols.
Following the example above, the desired output would be:
Test
ID recognized: 'Test'
New line.
9Test
ERROR: Invalid input 2: "9Test"
New line.
You seem to be asking two questions at once.
'Whole word' isn't a recognized construct in programming languages. The lexical and grammar are already defined. Just implement them.
The best way to handle illegal or unexpected characters in flex is not to handle them specially at all. Return them to the parser, just as you would for a special character. Then the parser can deal with it and attempt recovery via discarding.
Place this as you final rule:
. return yytext[0];
You can use this
Lets say you want to identify the reserved word for :
([\r\n\z]|" "|"")+"for"/([\r\n\z]|" ")+ {}
any new line character or generally a control character [\r\n\z]
or a white space " "
or the beginning of the line ""
for at least 1 time +
the word you want in quotes "for"
only followed by /
almost the same expression without the "" at least 1 time -> ([\r\n\z]|" ")+
With this code you can form your own matching pattern for whatever you need to do before and after the word.
I'm not sure if this is the best answer, but this works for me.
%x ERROR
%%
{NL} {
printf("New line.\n");
return LINE_END;
}
<INITIAL,ERROR>{BLANK} {
BEGIN(INITIAL);
}
{ID} {
printf("ID recognized: '%s'\n", yytext);
return ID;
}
<INITIAL,ERROR>. {
fprintf(stderr, "ERROR: Invalid input in line %d: \"%s\"\n", yylineno, yytext);
BEGIN(ERROR);
}
%%
Read this to learn more about starting conditions.
(My attempt at explaining what I've done)
Whenever this lexer hits something unexpected, it exclusively activates 2 sets of rules. To get out of the error set of rules, the lexer has to hit a 'blank'.

Can't retrieve semantic values from Bison grammar file

I am trying to develop a language parser on CentOS 6.0 by means of Bison 3.0 (C parser generator), Flex 2.5.35 and gcc 4.4.7. I have the following Bison grammar file:
%{
#include <stdio.h>
%}
%union {
int int_t;
char* str_t;
}
%token SEP
%token <str_t> ID
%start start
%type <int_t> plst
%%
start: plst start
| EOS { YYACCEPT; }
;
// <id> , <id> , ... , <id>
plst: ID SEP_PARAMS plst { printf("Rule 1 %s %s \n",$1,$2); }
| ID { printf("Rule 2 %s \n", $1); }
| /* empty */ { }
;
%%
int yyerror(GNode* root, const char* s) {printf("Error: %s", s);}
The problem
As it is now, it is not really a meaningful one, but it is enough to understand my problem I think. Consider that I have a scanner written in Flex which recognizes my tokens. This grammar file is used to recognize simple identifier lists like: id1,id2,...,idn. My problem is that in each grammar rule, when I try to get the value of the identifier (the string representing the same of the identifier), I get a NULL pointer as also proved by my printfs.
What am I doing wrong? Thankyou
Edit
Thanks to recent answers, I could understand that the problems strongly relates to Flex and its configuration file. In particular I have edited my lex file in order to meet the specifications described by the Flex Manual for Bison Bridging:
{ID} { printf("[id-token]");
yylval->str_t = strdup(yytext);
return ID; }
However after running Bison, then Flex (providing the --bison-bridge option) and then the compiler, I execute the generated parser and I instantly get Segmentation Fault.
What's the problem?
The flex option --bison-bridge (or %option bison-bridge) matches up to the bison option %define api.pure. You need to use either BOTH bison-bridge and api.pure or NEITHER -- either way can work, but they need to be consistent. Since it appears you are NOT using api.pure, you want to delete the --bison-bridge option.
The values for $1, $2 etc. have to be set by the lexer.
If you have a rule in the lexer for identifiers, like
ID [a-z][a-z0-9]*
%%
{ID} { return ID; }
the semantic values are not set.
You have to do e.g.
{ID} { /* Set the unions value, used by e.g. `$1` in the parser */
yylval.str_t = strdup(yytext);
return ID;
}
Remember to free the value in the parser, as strdup allocates memory.

Resources