Flex pattern for ID gives 'Segmentation fault' - c

I have a program in C that converts expression to RPN (reverse Polish notation).
All I need to do is to replace lexer code written in C with Flex. I already did some work, but I have problems with patterns - word or variable id to be specific. Yes, this is class exercise.
This is what I have:
%{
#include "global.h"
int lineno = 1;
int tokenval = NONE;
%}
%option noyywrap
WS " "
NEW_LINE "\n"
DIGIT [0-9]
LETTER [a-zA-Z]
NUMBER {DIGIT}+
ID {LETTER}({LETTER}|{DIGIT})*
%%
{WS}+ {}
{NEW_LINE} { ++lineno; }
{NUMBER} { sscanf (yytext, "%d", &tokenval); return(NUM); }
{ID} { sscanf (yytext, "%s", &tokenval); return(ID); }
. { return *yytext;}
<<EOF>> { return (DONE); }
%%
and defined in global.h
#define BSIZE 128
#define NONE -1
#define EOS '\0'
#define NUM 256
#define DIV 257
#define MOD 258
#define ID 259
#define DONE 260
All work when I use digits, brackets and operators, but when I type for example a+b it gives me Segmentation fault (and the output should be ab+).
Please don't ask me for a parser code (I can share if really needed) - requirement is to ONLY implement lexer using Flex.

The problem is that the program is doing an sscanf with a string format (%s) into the address of an integer (&tokenval). You should change that to an array of char, e.g.,
%{
#include "global.h"
int lineno = 1;
int tokenval = NONE;
char tokenbuf[132];
%}
and
{ID} { sscanf (yytext, "%s", tokenbuf); return(ID); }
(though strcpy is a better choice than sscanf, this is just a starting point).

When flex scans a token matching pattern ID, the associated action attempts to copy the token into a character array at location &tokenval. But tokenval has type int, so
the code has undefined behavior
if the length of the ID equals or exceeds the size of an int, then you cannot fit all its bytes (including a string terminator) in the space occupied by an int. A reasonably likely result is that you attempt to write past its end, which could result in a segfault.

Related

Postfix calculator using words instead of operators

i need to create a postfix calculator using stack. Where user will write operators in words.
Like:
9.5 2.3 add =
or
5 3 5 sub div =
My problem, that i can't understand, what function i should use to scan input. Because it's mix of numbers, words and char (=).
What you want to do is essentially to write a parser.
First, use fgets to read a complete line. Then use strtok to get tokens separated by whitespace.
After that, check if the token is a number or not. You can do that with sscanf. Check the return value if the conversion to a number were successful. If the conversion were not successful, check if the string is equal to "add", "sub", "=" etc. If it's not a number or one of the approved operations, generate an error. You don't have to treat strings of length 1 (aka char) different.
My problem, that i can't understand, what funktion i should use to scan input. Because it's mix of numbers, words and char (=).
But all of these are separated by whitespace. You could tokenize based on that and then build up a parse tree manually with strcmp and strtol or by simply having a comparision on the first character of the token (assuming that keywords cannot start with a number and there are no variables).
See strtok(_r). The "Example" section explains how to use it in depth, but as an extract without error handling and corner cases:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
int main(void)
{
char eq[] = "5 3 5 sub div =";
for (char *tok = strtok(eq, " "); tok != NULL; tok = strtok(NULL, " ")) {
if (isdigit(tok[0]))
printf("token-num: %s\n", tok);
else if (tok[0] == '=')
printf("token-eq: =\n");
else
printf("token-op: %s\n", tok);
}
return EXIT_SUCCESS;
}

Flex - identfiy float/ int/ id tokens

I'm trying to create a flex file that will recognize float/ integers and id:
Valid int - not allowed to start with 0.
Valid float-its presentation must include exponent whose value is an integer number with or without sign 2.78e+10.
Valid id- can only start with a lower-case letter and several underscores can not appear one after another
I am not sure where I'm wrong, if I have only float number I getting back float also int and id, but when everything combined in one file it's not working.
this the file that I create:
%option noyywrap
%{
#include "Token.h"
#include <stdio.h>
#include <stdlib.h>
static int skip_single_line_comment(int num); //function for one line comment
static int skip_multiple_line_comment(int num);//function for multiple lines of comments
int line_num=0;
%}
ALPHA ([a-zA-Z])
DIGIT ([0-9])
Sign ([+|-])
Expo ([e]{Sign}?)
float_num ([1-9]+(\.({DIGIT}+{Expo}{DIGIT}+)))
int_num [1-9]{DIGIT}+
id ([a-z]+({ALPHA}|{DIGIT}|(\_({ALPHA}|{DIGIT})))*)
%%
{float_num} {
create_and_store_token(TOKEN_FLOAT, yytext, line_num);
fprintf(yyout,"Line %d : found token of type TOKEN_FLOAT, lexeme %s.\n", line_num, yytext);
}
\n {line_num++;}
{int_num} {
create_and_store_token(TOKEN_INTEGER, yytext, line_num);
fprintf(yyout,"Line %d : found token of type TOKEN_INTEGER , lexeme %s.\n", line_num, yytext);
}
{id} {
create_and_store_token(TOKEN_ID, yytext, line_num);
fprintf(yyout,"Line %d : found token of type TOKEN_ID, lexeme %s.\n", line_num, yytext);
}
"//" {line_num=skip_single_line_comment(line_num); fprintf(yyout,"The number of the line is:%d.\n", line_num);}
"/*" {line_num=skip_multiple_line_comment(line_num); fprintf(yyout,"The number of the line is:%d.\n", line_num);}
%%
static int
skip_single_line_comment(int num)
{
char c;
/* Read until we find \n or EOF */
while((c = input()) != '\n' && c != EOF)
;
/* Maybe you want to place back EOF? */
if(c == EOF)
unput(c);
return num=num+1;
}
static int
skip_multiple_line_comment(int num)
{
char c;
for(;;)
{
switch(input())
{
/* We expect ending the comment first before EOF */
case EOF:
fprintf(stderr, "Error unclosed comment, expect */\n");
exit(-1);
goto done;
break;
/* Is it the end of comment? */
case '*':
if((c = input()) == '/'){
num=num+1;
goto done;
}
unput(c);
break;
default:
/* skip this character */
break;
}
}
done:
/* exit entry */
return num ;
}
void main(int argc, char **argv){
yyin=fopen("C:\\temp\\test1.txt","r");
yyout=fopen("C:\\temp\\test1Soltion.txt","w");
yylex();}
the input file:
21
41.e-21
a_23_e4_5
8
1.1E+21
a1_c23_e4_56
The output:
Line 0 : found token of type TOKEN_INTEGER , lexeme 21.
Line 1 : found token of type TOKEN_INTEGER , lexeme 41.
.Line 1 : found token of type TOKEN_ID, lexeme e.
-Line 1 : found token of type TOKEN_INTEGER , lexeme 21.
Line 2 : found token of type TOKEN_ID, lexeme a_23_e4_5.
81.1E+Line 4 : found token of type TOKEN_INTEGER , lexeme 21.
Line 5 : found token of type TOKEN_ID, lexeme a1_c23_e4_56.
You have several problems in your code: (from top to bottom)
Sign is bad... you are saying that a sign is one of +, | or -. You have used three characters in between the square brackets [ and ] which makes them possible... you could use (\+|-) or [+-], but not what you have written. In order for the - to be accepted not as a range indicator, is to stick it to one of the square brackets that delimite the charset (better to the last one, so if you have to use the negation ^ character, you can do it without interference)
An exponent trailer to a floating point allows both e and E, so the actual regexp should be [eE].
a Floating point number can begin with 0. you can have something like -00013.26 and be valid...
your floating point number is forced to have digits at both sides of the dot ., so you'll not recognize anything like 3. or .26 as floating point numbers. You have written ([1-9]+(\.({DIGIT}+{Expo}{DIGIT}+))) which accepts a variable number of digits in the set [1-9] (you disallow 0 in front of a decimal point) but always greater than zero, followed with a dot, and followed with at least one digit after the dot... this makes 41.e-21 not to be recognized as a floating point. Even 40.25 will not be recognized as floating point (but as the tokens 4 (integer) followed by 0 integer, and a dot (which will be echoed to output by default) and then the integer 25)
you don't allow to put signs in front of number (this is common in compiler implementation, but not to read number sequences as you are trying) You have not included support for a sign in front of a number... this is the reason that 41.e-21 is parsed as Int(41), .(echoed), e(identifier),- (is not valid as sign here, because you don't allow signed integers), and 21 as an integer.
you don't accept integers of less than two digits: again, the use of + makes you to have one digit (different than 0) followed by at least one more digit... this makes the mesh you have on line four and five.
So the only thing you recognize correctly is the identifier, that by the way has to begin with a lower case alphabetic... you have not included support for uppercase beginning identifiers.

Lex/flex program to count ids, statements, keywords, operators etc

%{
#undef yywrap
#define yywrap() 1
#include<stdio.h>
int statements = 0;
int ids = 0;
int assign = 0;
int rel = 0;
int keywords = 0;
int integers = 0;
%}
DIGIT [0-9]
LETTER [A-Za-z]
TYPE int|char|bool|float|void|for|do|while|if|else|return|void
%option yylineno
%option noyywrap
%%
\n {statements++;}
{TYPE} {/*printf("%s\n",yytext);*/keywords++;}
(<|>|<=|>=|==) {rel++;}
'#'/[a-zA-Z0-9]* {;}
[a-zA-Z]+[a-zA-Z0-9]* {printf("%s\n",yytext);ids++;}
= {assign++;}
[0-9]+ {integers++;}
. {;}
%%
void main(int argc, char **argv)
{
FILE *fh;
if (argc == 2 && (fh = fopen(argv[1], "r"))) {
yyin = fh;
}
yylex();
printf("statements = %d ids = %d assign = %d rel = %d keywords = %d integers = %d \n",statements,ids,assign,rel,keywords,integers);
}
//Input file.c
#include<stdio.h>
void main(){
float a123;
char a;
char b123;
char c;
int ab[5];
int bc[2];
int ca[7];
int ds[4];
for( a = 0; a < 5 ;a++)
printf("%d ", a);
return 0;
}
output:
include
stdio
h
main
a123
a
b123
c
ab
bc
ca
ds
a
a
a
printf
d
a
statements = 14 ids = 18 assign = 1 rel = 3 keywords = 11 integers = 7
I am printing the identifiers on the go. #include<stdio.h> is being counted as identifier. How do I avoid this?
I have tried '#'/[a-zA-Z0-9]* {;} rule:action pair but it is still being counted as identifier. How is the file being tokenized?
Also the %d string in printf is being counted as an identifier. I have explicitly written that identifiers should only begin with letters, then why is %d being inferred as identifier?
I have tried '#'/[a-zA-Z0-9]* {;} rule:action pair but it [include] is still being counted as identifier. How is the file being tokenized?
Tokens are recognized one at a time. Each token starts where the previous token finished.
'#'/[a-zA-Z0-9]* matches '#' provided it is followed by [a-zA-Z0-9]*. You probably meant "#"/[a-zA-Z0-9]* (with double quotes) which would match a #, again provided it is followed by a letter or digit. Note that only the # is matched; the pattern after the / is "trailing context", which is basically a lookahead assertion. In this case, the lookahead is pointless because [a-zA-Z0-9]* can match the empty string, so any # would be matched. In any event, after the # is matched as a token, the scan continues at the next character. So the next token would be include.
Because of the typo, that pattern does not match. (There are no apostrophes in the source.) So what actually matches is your "fallback" rule: the rule whose pattern is .. (We call this a fallback rule because it matches anything. Really, it should be .|\n, since . matches anything but a newline, but as long as you have some rule which matches a newline character, it's acceptable to use .. If you don't supply a fallback rule, one will be inserted automatically by flex with the action ECHO.)
Thus, the # is ignored (just as it would have been if you'd written the rule as intended) and again the scan continues with the token include.
If you wanted to ignore the entire preprocessor directive, you could do something like
^[[:blank:]]#.* { ; }
(from a comment) I am getting stdio and h as keywords, how does that fit the definition that I have given? What happened to the . in between?
After the < is ignored by the fallback rule, stdio is matched. Since [a-zA-Z]+[a-zA-Z0-9]* doesn't match anything other than letters and digits, the . is not considered part of the token. Then the . is matched and ignored by the fallback rule, and then h is matched.
Also the %d string in printf is being counted as an identifier.
Not really. The % is explicitly ignored by the fallback rule (as was the ") and then the d is marched as an identifier. If you want to ignore words in string literals, you will have to recognise and ignore string literals.
The #include directive is a preprocessor directive and is thus preprocessed by the preprocessor. The preprocessor includes the header file and removes the #include directive And thus after preprocessing when the program is given to compiler as input it doesn't have any preprocessor directives like #include.
So you don't need to write code to detect #include because neither compiler ever sees it nor it is designed to tokenize #include directive.
References: Is #include a token of type keyword?
adding the following line in the rules section works for me:
#.* ;
Here rule is #.* and action is ;. The #.* will catch the line starting with # and ; will just do nothing so basically this would ignore the line starting with #.

How to write regex for c type inetger in lex?

I am trying to write a C parser code in lex
%{
/* this program does the job for identifying C type integer and floats*/
%}
%%
[\t ]+ /* ignore whitespace */ ;
[0-9][5]+ { printf ("\"%s\" out of range \n", yytext); }
[-+][0-9][5] { printf ("\"%s\" is a C integers\n", yytext); }
[-+]?[0-9]*\.?[0-9]+ { printf ("\"%s\" is a float\n", yytext); }
\n ECHO; /* which is the default anyway */
%%
I am facing a problem in identifying C type integer because it have a limit i.e. 32767. So I have used regex i.e. digit length greater than 5 in should yell "out of range" error but it's a hack and not perfect solution.
This might be provably impossible to do right. Regular expressions make up a fairly simplistic kind of recognition (a state machine with no memory), and are really only good for tokenization/lexical analysis. You're trying to use it for type-checking, which requires a lot more horsepower.
I'd save this for the parser itself. It will be far easier, with the symbol table filled in (to know what kind of variable you want to assign to) and can check the actual value of the integer and compare it to the upper and lower bounds.
This is working :)
%{
/* this program does the job for identifying C type integer and floats*/
#include <stdio.h>
int c;
%}
%%
[\t ]+ /* ignore whitespace */ ;
[-+]?[0-9]*\.?[0-9]+ { printf ("\"%s\" is a float\n", yytext); }
[-+]?[0-9]+ c = atoi(yytext); if(c < -32767 || c > 32767) {printf("OUT OF RANGE INTEGER");} else {printf("INTEGER");}
\n ECHO; /* which is the default anyway */
%%

Get String token value in flex and bison more than it is intended

I tried to get the value of the token in bison, but it seems that I get more than one token at once.
Here is my flex code:
%{
#include <stdio.h>
#include "y.tab.h"
//YYSTYPE yylval;
%}
semicolon [;]
var [a-c]
digit [0-9]+
string [a-zA-Z]+
%%
Counter {yylval = yytext; return VAR;}
[a-zA-Z0-9]+ { yylval = yytext; return STRING;}
....
Here is my bison code :
%{
#define YYSTYPE char *
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
int limit;
int input;
int count=0;
char a[20];
char message[200];
%}
%token DIGIT VAR OPENPAR CLOSEPAR PETIK
%token WRITELN DO FOR BEGINKEY END TO EQUAL
%token SEMICOLON VARKEY COLON TYPE STRING READLN
%start program
%%
program: dlist slist {printf("L3: HALT");}
;
dlist: /* nothing */
| decl dlist
;
decl: VARKEY VAR COLON TYPE SEMICOLON
;
slist: stmt
| stmt slist
| BEGINKEY FOR VAR EQUAL DIGIT TO DIGIT DO slist END
{
printf("\nBeginFunc\n");
printf("t%d = %d;\n",count,$5);
printf("%s = t%d\n",$3,count);
....
So the problem is when I input writeln('forloop');. The program should only get the forloop, but it get forloop');
But when I input line by line like this :
forloop
'
)
;
It shows only forloop
What may cause this problem?
You will have to process or duplicate yytext before you pass it to bison. bison will request look ahead tokens from the scanner and those will overwrite any yytext.
For identifiers typically strdup or an ANSI-C equivalent is used. If the language has only one namespace or the namespaces can be distinguished in the scanner already, then it is customary to build the symbol table(s) in the scanner directly and pass a number of the identifier only.
For numbers typically the value of the number is determined and passed to the parser.
Some of the terms above may be unfamiliar to you, but it will be worthwhile for you to investigate what they mean.
This is really an FAQ, and seeing it asked and asked again lets me wonder why this kind of question is not flagged as duplicate. Even Bison has an FAQ about it. http://www.gnu.org/software/bison/manual/html_node/Strings-are-Destroyed.html

Resources