Flex not counting lines properly on multiline comments - c

I`m using the above regex to identify multiline comments in Flex:
[/][*][^*]*[*]+([^*/][^*]*[*]+)*[/] { /* DO NOTHING */ }
But seems to me that flex/bison is not returning properly the line counter.
For example:
Input:
1 ___bqmu7ftc
2 // _qXnFEgQL9Zsyn8Ohtx7zhToLK68xbu3XRrOvRi
3 /* "{ output 6 = <=W if u7 do nN)T!=$||JN,a9vR)7"
4 -758939
5 -31943.6165480
6 // "RND"
7 '_'
8 */
9 [br _int]
Output:
1 TK_IDENT [___bqmu7ftc]
4 [
4 TK_IDENT [br]
4 TK_IDENT [_int]
4 ]
The line should be 9 instead of 4.
Any ideas?

I don't know how you generated the test output in your question, but here's an (almost) minimal example of how to use yylineno. It works fine for me:
%{
#define ID 257
%}
%option yylineno
%option noinput nounput noyywrap
%%
[[:space:]]+ { /* DO NOTHING */ }
"//".* { /* DO NOTHING */ }
[/][*][^*]*[*]+([^*/][^*]*[*]+)*[/] { /* DO NOTHING */ }
[[:alpha:]_][[:alnum:]_]* { return ID; }
. { return *yytext; }
%%
int main(int argc, char** argv) {
for (;;) {
int token = yylex();
switch (token) {
case 0: printf("%4d: %s\n", yylineno, "EOF"); return 0;
case ID: printf("%4d: %-4s[%s]\n", yylineno, "ID", yytext); break;
default: printf("%4d: %c\n", yylineno, token); break;
}
}
}

This is the solution I found on Flex manual
Remember to declare int comment_caller; on your definition scope.
%x comment
%x foo
%%
"/*" {comment_caller = INITIAL;
BEGIN(comment);
}
<foo>"/*" {
comment_caller = foo;
BEGIN(comment);
}
<comment>[^*\n]* {}
<comment>"*"+[^*/\n]* {}
<comment>\n {++line_num;}
<comment>"*"+"/" BEGIN(comment_caller);

I had the same problem with multi-line comments with flex. I used the regex that was suggested in this stackoverflow question(this is same as the regex you mentioned in this question)
This regex also gets the new lines in the multi-line comment. So if you are counting the number of the current line with counting the \n you will get into trouble. Because there could be multi-line comments and the regular expression selects the whole multi-line comment at once. So it doesn't let you to count the new lines.
So I found another way to keep the number of lines even with the regular expression. Explain below:
You know that flex keeps the matched expression inyytext variable. So we can count number of new lines in the multi-line comment and that worked perfectly with any code I tested.
Here is my code:
note: the numberOfCurrentLine variable is the global variable I used to save the number of the current line.
[/][*][^*]*[*]+([^*/][^*]*[*]+)*[/] {
// The code below, the counts number of occurance of \n and then adds
// the number to the numberOfCurrentLine variable
// to keep the number of current line
char* str = yytext;
int i = 0;
char *pch=strchr(str,'\n');
while (pch!=NULL) {
i++;
pch=strchr(pch+1,'\n');
}
numberOfCurrentLine+=i;
}
This code counts the number of \n in the selected comment and adds it to the global variable that is counting the number of the current line.
The code for counting number of occurrences of a char that I used above is from this post.
So with the above code, I always have the right number of the current line and the code works perfectly.

Related

Regular expression in FLEX finding text

I got lex file with this rule:
%option noyywrap
%{
%}
LNA [^<>]
LNANA [^<>!]
%%
(<!!) fprintf(yyout, "begin_comment\t\t\t%s\n", yytext);
(!!>) fprintf(yyout, "end_comment\t\t\t%s\n", yytext);
({LNANA}*|({LNA}{LNANA})*|{LNA}+{LNANA}{LNANA}{LNA}) fprintf(yyout,
"string\t\t\t%s\n", yytext);
. fprintf(yyout, "illegal char %s\n", yytext);
%%
I need to find comments between "<!!" and "!!>" and strings in code wihout nothing
for example
<!! This is a comment that need to be found !!>
simple string that need to be found also
and this is my output:
as you can see this does not work as needed.
any help ?
I'm not sure exactly what you're trying for.
There's certainly a regular expression which matches an entire comment (as long as you don't intend comments to nest). But it's hard to get it right, and you typically end up splitting strings and returning more tokens than necessary. Here's one which I think works, although it's not fully tested. Since you need to match the entire comment, the pattern has to include the comment delimiters. Of course, you also have to match the strings between the comments, as well as doing something in the case that a comment is not correctly terminated.
<!!([^!]*!)([^!]+!)*!+([^!>][^!]*!([^!]+!)*!+)*> { /* Comment */ }
<!! { /* This pattern will match on unterminated comments */ }
[^<]+ { /* Non comment text (but maybe not the whole string) */ }
< { /* Also non-comment text */ }
A possibly clearer and probably slower version uses a start condition, and returns both the insides of comments and the rest of the text in single pieces (in yytext, as per the yylex interface).
%x IN_COMMENT
%%
<!! { BEGIN(IN_COMMENT);
yytext[yyleng -= 3] = 0;
if (yyleng) return STRING;
}
/* This patterns deliberately fails if it reaches the last input */
([^<]+|<)/(.|\n) { yymore(); }
/* The next pattern is to catch the last character in the input */
.|\n { return STRING; }
<IN_COMMENT>!!> { BEGIN(INITIAL);
yytext[yyleng -= 3] = 0;
return COMMENT;
}
<IN_COMMENT>[^!]+|! { yymore(); }
<IN_COMMENT><<EOF>> { fputs(stderr, "Unterminated comment\n"); }

Lex :Match multiple regex at the same time

I have the code below:
I want to count the number of characters using
(.) {charCount++;}
and at the same time count the number of words using
([a-zA-Z0-9])* {wordcount++;}
is this possible using lex rules, or do I have to count it "manaully/programmatically" using the file stream in c code . Basically is there code for "continue matching/ regex"
%%
[\t ]+ ; //ignore white space
"\n" ; //ignore next line //
([a-zA-Z0-9])* {wordcount++;}
(.) {charCount++;}
%%
int yywrap(void){}
int main()
{
// The function that starts the analysis
yyin=fopen("input.txt", "r");
yylex();
printf("The number of words in the file is %d and charCount %d",count,wordSize);
return 0;
}
The number of characters matched by a rule is available in yyleng, so you can do this:
[ \t\n] ;
[a-zA-Z0-9]+ { ++wordcount; charcount += yyleng; }
. { ++charcount; }
But (f)lex is not designed to do multiple scans over the input. So there's no simple general solution.
FWIW, I'd use [[:alnum:]] instead of [a-zA-Z0-9] and [[:space:]] instead of [ \t\n].

When using multiple buffers with Flex, how do I avoid having tokens get split between buffers

Let's assume I have a simple grammar of positive integers and alphabetic strings separated by commas. I want to parse this grammar using Flex and Bison, and I want to use multiple input buffers with Flex, for whatever reasons (maybe data is arriving over the network or a serial line or whatever). The problem I'm seeing is that when a string or an integer (which are both variable length tokens) are split between the end of one buffer and the start of the next, the lexer reports two tokens when there should only be one.
In the example below, the chunks are 10, asdf and g,. If that was all in one buffer it would yield the tokens INT(10) COMMA STR(asdfg) COMMA. But with the 'g' in a different buffer from the 'asdf', the lexer actually produces INT(10) COMMA STR(asdf) STR(g) COMMA. It appears the logic upon reaching the end of the buffer is (1) check if the input matches a token, (2) refill the buffer. I feel it should be the other way around: (2) refill the buffer, (1) check if the input matches a token.
I want to make sure I'm not doing something silly with the way I'm changing the buffers.
stdout/stderr:
read_more_input: Setting up buffer containing: 10,
--accepting rule at line 48 ("10")
Starting parse
Entering state 0
Reading a token: Next token is token INT_TERM ()
Shifting token INT_TERM ()
Entering state 1
Return for a new token:
--accepting rule at line 50 (",")
Reading a token: Next token is token COMMA ()
Shifting token COMMA ()
Entering state 4
Reducing stack by rule 2 (line 67):
$1 = token INT_TERM ()
$2 = token COMMA ()
-> $$ = nterm int_non_term ()
Stack now 0
Entering state 3
Return for a new token:
--(end of buffer or a NUL)
--EOF (start condition 0)
read_more_input: Setting up buffer containing: asdf
--(end of buffer or a NUL)
--accepting rule at line 49 ("asdf")
Reading a token: Next token is token STR_TERM ()
Shifting token STR_TERM ()
Entering state 6
Return for a new token:
--(end of buffer or a NUL)
--EOF (start condition 0)
read_more_input: Setting up buffer containing: g,
--accepting rule at line 49 ("g")
Reading a token: Next token is token STR_TERM ()
syntax errorError: popping token STR_TERM ()
Stack now 0 3
Error: popping nterm int_non_term ()
Stack now 0
Cleanup: discarding lookahead token STR_TERM ()
Stack now 0
Lex file:
%{
#include <stdbool.h>
#include "yacc.h"
bool read_more_input(yyscan_t scanner);
%}
%option reentrant bison-bridge
%%
[0-9]+ { yylval->int_value = atoi(yytext); return INT_TERM; }
[a-zA-Z]+ { yylval->str_value = strdup(yytext); return STR_TERM; }
, { return COMMA; }
<<EOF>> {
if (!read_more_input(yyscanner)) {
yyterminate();
}
}
Yacc file:
%{
// This appears to be a bug. This typedef breaks a dependency cycle between the headers.
// See https://stackoverflow.com/questions/44103798/cyclic-dependency-in-reentrant-flex-bison-headers-with-union-yystype
typedef void * yyscan_t;
#include <stdbool.h>
#include "yacc.h"
#include "lex.h"
%}
%define api.pure full
%lex-param {yyscan_t scanner}
%parse-param {yyscan_t scanner}
%define api.push-pull push
%union {
int int_value;
char * str_value;
}
%token <int_value> INT_TERM
%type <int_value> int_non_term
%token <str_value> STR_TERM
%type <str_value> str_non_term
%token COMMA
%%
complete : int_non_term str_non_term { printf(" === %d === %s === \n", $1, $2); }
int_non_term : INT_TERM COMMA { $$ = $1; }
str_non_term : STR_TERM COMMA { $$ = $1; }
%%
char * packets[]= {"10,", "asdf", "g,"};
int current_packet = 0;
bool read_more_input(yyscan_t scanner) {
if (current_packet >= 3) {
fprintf(stderr, "read_more_input: No more input\n");
return false;
}
fprintf(stderr, "read_more_input: Setting up buffer containing: %s\n", packets[current_packet]);
size_t buffer_size = strlen(packets[current_packet]) + 2;
char * buffer = (char *) calloc(buffer_size, sizeof(char));
memcpy(buffer, packets[current_packet], buffer_size - 2);
yy_scan_buffer(buffer, buffer_size, scanner);
current_packet++;
return true;
}
int main(int argc, char** argv) {
yyscan_t scanner;
yylex_init(&scanner) ;
read_more_input(scanner);
yyset_debug(1, scanner);
yydebug = 1;
int status;
yypstate *ps = yypstate_new ();
YYSTYPE pushed_value;
do {
status = yypush_parse(ps, yylex(&pushed_value, scanner), &pushed_value, scanner);
} while(status == YYPUSH_MORE);
yypstate_delete (ps);
yylex_destroy (scanner) ;
return 0;
}
That is not the expected use case for multiple buffers. Multiple input buffers are generally used to handle things like #include or even macro expansion, where the included text definitely should respect token boundaries. (Consider an #included file which has an unterminated comment...)
If you want to paste together inputs from different sources in a way which allows tokens to flow across buffer boundaries, redefine the YY_INPUT macro to meet your needs.
YY_INPUT is the macro hook for customising input; it is given a buffer and a maximum length and it must copy the specified number of bytes (or fewer) into the buffer, and also indicate hoe many bytes were provided (0 bytes is taken as meaning end of input, at which point yywrap will be called.)
YY_INPUT is expanded inside yylex so it has access to the yylex arguments, which includes the lexer state. yywrap in a reentrant lexer is called with the scanner state as an argument. So you can use the two mechanisms together, if you want to.
Unfortunately, this does not allow "zero-copy" buffer switching. But flex is not optimised for in-memory input buffers in general: you can provide flex with a buffer using yyscan_buffer but the buffer must be terminated with two NUL bytes, and it will be modified during the scan, so the feature is rarely useful.
Here's a trivial example, which lets you set yylex up with a NULL-terminated argv-like array of strings, and proceeds to lex them all as a single input. (If you choose to use argv+1 to initialize this array, you'll notice that it runs tokens together from consecutive arguments.)
%{
#include <string.h>
#include <parser.tab.h>
#define YY_EXTRA_TYPE char**
/* FIXME:
* This assumes that none of the string segments are empty
* strings (or for the feature-not-a-bug interpretation,
* it allows the list to be terminated by NULL or an empty string).
*/
#define YY_INPUT(buf,result,max_size) { \
char* segment = *yyextra; \
if (segment == NULL) result = 0; \
else { \
size_t avail = strnlen(segment, max_size); \
memcpy(buf, segment, avail); \
if (segment[avail]) *yyextra += avail; \
else ++yyextra; \
result = avail; \
} \
}
%}
%option reentrant bison-bridge
%option noinput nounput nodefault noyywrap
%%
[[:space:]]+ ;
[0-9]+ { yylval->number = strtol(yytext, 0, 10); return NUMBER; }
[[:alpha:]_][[:alnum:]_]* { yylval->string = strdup(yytext); return ID; }
. { return *yytext; }
%%
/* This function must be exported in some header */
void yylex_strings(char** argv, yyscan_t scanner) {
yyset_extra(argv, scanner);
}

What is wrong with this Bison grammar?

Im trying to build a Bison grammar and seem to be missing something. I kept it yet very basic, still I am getting a syntax error and can't figure out why:
Here is my Bison Code:
%{
#include <stdlib.h>
#include <stdio.h>
int yylex(void);
int yyerror(char *s);
%}
// Define the types flex could return
%union {
long lval;
char *sval;
}
// Define the terminal symbol token types
%token <sval> IDENT;
%token <lval> NUM;
%%
Program:
Def ';'
;
Def:
IDENT '=' Lambda { printf("Successfully parsed file"); }
;
Lambda:
"fun" IDENT "->" "end"
;
%%
main() {
yyparse();
return 0;
}
int yyerror(char *s)
{
extern int yylineno; // defined and maintained in flex.flex
extern char *yytext; // defined and maintained in flex.flex
printf("ERROR: %s at symbol \"%s\" on line %i", s, yytext, yylineno);
exit(2);
}
Here is my Flex Code
%{
#include <stdlib.h>
#include "bison.tab.h"
%}
ID [A-Za-z][A-Za-z0-9]*
NUM [0-9][0-9]*
HEX [$][A-Fa-f0-9]+
COMM [/][/].*$
%%
fun|if|then|else|let|in|not|head|tail|and|end|isnum|islist|isfun {
printf("Scanning a keyword\n");
}
{ID} {
printf("Scanning an IDENT\n");
yylval.sval = strdup( yytext );
return IDENT;
}
{NUM} {
printf("Scanning a NUM\n");
/* Convert into long to loose leading zeros */
char *ptr = NULL;
long num = strtol(yytext, &ptr, 10);
if( errno == ERANGE ) {
printf("Number was to big");
exit(1);
}
yylval.lval = num;
return NUM;
}
{HEX} {
printf("Scanning a NUM\n");
char *ptr = NULL;
/* convert hex into decimal using offset 1 because of the $ */
long num = strtol(&yytext[1], &ptr, 16);
if( errno == ERANGE ) {
printf("Number was to big");
exit(1);
}
yylval.lval = num;
return NUM;
}
";"|"="|"+"|"-"|"*"|"."|"<"|"="|"("|")"|"->" {
printf("Scanning an operator\n");
}
[ \t\n]+ /* eat up whitespace */
{COMM}* /* eat up one-line comments */
. {
printf("Unrecognized character: %s at linenumber %d\n", yytext, yylineno );
exit(1);
}
%%
And here is my Makefile:
all: parser
parser: bison flex
gcc bison.tab.c lex.yy.c -o parser -lfl
bison: bison.y
bison -d bison.y
flex: flex.flex
flex flex.flex
clean:
rm bison.tab.h
rm bison.tab.c
rm lex.yy.c
rm parser
Everything compiles just fine, I do not get any errors runnin make all.
Here is my testfile
f = fun x -> end;
And here is the output:
./parser < a0.0
Scanning an IDENT
Scanning an operator
Scanning a keyword
Scanning an IDENT
ERROR: syntax error at symbol "x" on line 1
since x seems to be recognized as a IDENT the rule should be correct, still I am gettin an syntax error.
I feel like I am missing something important, hopefully somebody can help me out.
Thanks in advance!
EDIT:
I tried to remove the IDENT in the Lambda rule and the testfile, now it seems to run through the line, but still throws
ERROR: syntax error at symbol "" on line 1
after the EOF.
Your scanner recognizes keywords (and prints out a debugging line, but see below), but it doesn't bother reporting anything to the parser. So they are effectively ignored.
In your bison definition file, you use (for example) "fun" as a terminal, but you do not provide the terminal with a name which could be used in the scanner. The scanner needs this name, because it has to return a token id to the parser.
To summarize, what you need is something like this:
In your grammar, before the %%:
token T_FUN "fun"
token T_IF "if"
token T_THEN "then"
/* Etc. */
In your scanner definition:
fun { return T_FUN; }
if { return T_IF; }
then { return T_THEN; }
/* Etc. */
A couple of other notes:
Your scanner rule for recognizing operators also fails to return anything, so operators will also be ignored. That's clearly not desirable. flex and bison allow an easier solution for single-character operators, which is to let the character be its own token id. That avoids having to create a token name. In the parser, a single-quoted character represents a token-id whose value is the character; that's quite different from a double-quoted string, which is an alias for the declared token name. So you could do this:
"=" { return '='; }
/* Etc. */
but it's easier to do all the single-character tokens at once:
[;+*.<=()-] { return yytext[0]; }
and even easier to use a default rule at the end:
. { return yytext[0]; }
which will have the effect of handling unrecognized characters by returning an unknown token id to the parser, which will cause a syntax error.
This won't work for "->", since that is not a single character token, which will have to be handled in the same way as keywords.
Flex will produce debugging output automatically if you use the -d flag when you create the scanner. That's a lot easier than inserting your own debugging printout, because you can turn it off by simply removing the -d option. (You can use %option debug instead if you don't want to change the flex invocation in your makefile.) It's also better because it provides consistent information, including position information.
Some minor points:
The pattern [0-9][0-9]* could more easily be written [0-9]+
The comment pattern "//".* does not require a $ lookahead at the end, since .* will always match the longest sequence of non-newline characters; consequently, the first unmatched character must either be a newline or the EOF. $ lookahead will not match if the pattern is terminated with an EOF, which will cause odd errors if the file ends with a comment without a newline at the end.
There is no point using {COMM}* since the comment pattern does not match the newline which terminates the comment, so it is impossible for there to be two consecutive comment matches. But anyway, after matching a comment and the following newline, flex will continue to match a following comment, so {COMM} is sufficient. (Personally, I wouldn't use the COMM abbreviation; it really adds nothing to readability, IMHO.)

Is there an option for flex to match whole words only?

I'm writing a lexer and I'm using Flex to generate it based on custom rules.
I want to match identifiers of sorts that start with a letter and then can have either letters or numbers. So I wrote the following pattern for them:
[[:alpha:]][[:alnum:]]*
It works fine, the lexer that gets generated recognizes the pattern perfectly, although it doesn't only match whole words but all appearances of that pattern.
So for example it would match the input "Text" and "9Text" (discarding that initial 9).
Consider the following simple lexer that accepts IDs as described above:
%{
#include <stdio.h>
#define LINE_END 1
#define ID 2
%}
/* Flex options: */
%option noinput
%option nounput
%option noyywrap
%option yylineno
/* Definitions: */
WHITESPACE [ \t]
BLANK {WHITESPACE}+
NEW_LINE "\n"|"\r\n"
ID [[:alpha:]][[:alnum:]_]*
%%
{NEW_LINE} {printf("New line.\n"); return LINE_END;}
{BLANK} {/* Blanks are skipped */}
{ID} {printf("ID recognized: '%s'\n", yytext); return ID;}
. {fprintf(stderr, "ERROR: Invalid input in line %d: \"%s\"\n", yylineno, yytext);}
%%
int main(int argc, char **argv) {
while (yylex() != 0);
return 0;
}
When compiled and fed the following input produces the output below:
Input:
Test
9Test
Output:
Test
ID recognized: 'Test'
New line.
9Test
ERROR: Invalid input in line 2: "9"
ID recognized: 'Test'
New line.
Is there a way to make flex match only whole words (i.e. delimited by either blanks or custom delimiters like '(' ')' for example)?
Because I could write a rule that excludes IDs that start with numbers, but what about the ones that start with symbols like "$Test" or "&Test"? I don't think I can enumerate all of the possible symbols.
Following the example above, the desired output would be:
Test
ID recognized: 'Test'
New line.
9Test
ERROR: Invalid input 2: "9Test"
New line.
You seem to be asking two questions at once.
'Whole word' isn't a recognized construct in programming languages. The lexical and grammar are already defined. Just implement them.
The best way to handle illegal or unexpected characters in flex is not to handle them specially at all. Return them to the parser, just as you would for a special character. Then the parser can deal with it and attempt recovery via discarding.
Place this as you final rule:
. return yytext[0];
You can use this
Lets say you want to identify the reserved word for :
([\r\n\z]|" "|"")+"for"/([\r\n\z]|" ")+ {}
any new line character or generally a control character [\r\n\z]
or a white space " "
or the beginning of the line ""
for at least 1 time +
the word you want in quotes "for"
only followed by /
almost the same expression without the "" at least 1 time -> ([\r\n\z]|" ")+
With this code you can form your own matching pattern for whatever you need to do before and after the word.
I'm not sure if this is the best answer, but this works for me.
%x ERROR
%%
{NL} {
printf("New line.\n");
return LINE_END;
}
<INITIAL,ERROR>{BLANK} {
BEGIN(INITIAL);
}
{ID} {
printf("ID recognized: '%s'\n", yytext);
return ID;
}
<INITIAL,ERROR>. {
fprintf(stderr, "ERROR: Invalid input in line %d: \"%s\"\n", yylineno, yytext);
BEGIN(ERROR);
}
%%
Read this to learn more about starting conditions.
(My attempt at explaining what I've done)
Whenever this lexer hits something unexpected, it exclusively activates 2 sets of rules. To get out of the error set of rules, the lexer has to hit a 'blank'.

Resources