Lex/flex program to count ids, statements, keywords, operators etc - c

%{
#undef yywrap
#define yywrap() 1
#include<stdio.h>
int statements = 0;
int ids = 0;
int assign = 0;
int rel = 0;
int keywords = 0;
int integers = 0;
%}
DIGIT [0-9]
LETTER [A-Za-z]
TYPE int|char|bool|float|void|for|do|while|if|else|return|void
%option yylineno
%option noyywrap
%%
\n {statements++;}
{TYPE} {/*printf("%s\n",yytext);*/keywords++;}
(<|>|<=|>=|==) {rel++;}
'#'/[a-zA-Z0-9]* {;}
[a-zA-Z]+[a-zA-Z0-9]* {printf("%s\n",yytext);ids++;}
= {assign++;}
[0-9]+ {integers++;}
. {;}
%%
void main(int argc, char **argv)
{
FILE *fh;
if (argc == 2 && (fh = fopen(argv[1], "r"))) {
yyin = fh;
}
yylex();
printf("statements = %d ids = %d assign = %d rel = %d keywords = %d integers = %d \n",statements,ids,assign,rel,keywords,integers);
}
//Input file.c
#include<stdio.h>
void main(){
float a123;
char a;
char b123;
char c;
int ab[5];
int bc[2];
int ca[7];
int ds[4];
for( a = 0; a < 5 ;a++)
printf("%d ", a);
return 0;
}
output:
include
stdio
h
main
a123
a
b123
c
ab
bc
ca
ds
a
a
a
printf
d
a
statements = 14 ids = 18 assign = 1 rel = 3 keywords = 11 integers = 7
I am printing the identifiers on the go. #include<stdio.h> is being counted as identifier. How do I avoid this?
I have tried '#'/[a-zA-Z0-9]* {;} rule:action pair but it is still being counted as identifier. How is the file being tokenized?
Also the %d string in printf is being counted as an identifier. I have explicitly written that identifiers should only begin with letters, then why is %d being inferred as identifier?

I have tried '#'/[a-zA-Z0-9]* {;} rule:action pair but it [include] is still being counted as identifier. How is the file being tokenized?
Tokens are recognized one at a time. Each token starts where the previous token finished.
'#'/[a-zA-Z0-9]* matches '#' provided it is followed by [a-zA-Z0-9]*. You probably meant "#"/[a-zA-Z0-9]* (with double quotes) which would match a #, again provided it is followed by a letter or digit. Note that only the # is matched; the pattern after the / is "trailing context", which is basically a lookahead assertion. In this case, the lookahead is pointless because [a-zA-Z0-9]* can match the empty string, so any # would be matched. In any event, after the # is matched as a token, the scan continues at the next character. So the next token would be include.
Because of the typo, that pattern does not match. (There are no apostrophes in the source.) So what actually matches is your "fallback" rule: the rule whose pattern is .. (We call this a fallback rule because it matches anything. Really, it should be .|\n, since . matches anything but a newline, but as long as you have some rule which matches a newline character, it's acceptable to use .. If you don't supply a fallback rule, one will be inserted automatically by flex with the action ECHO.)
Thus, the # is ignored (just as it would have been if you'd written the rule as intended) and again the scan continues with the token include.
If you wanted to ignore the entire preprocessor directive, you could do something like
^[[:blank:]]#.* { ; }
(from a comment) I am getting stdio and h as keywords, how does that fit the definition that I have given? What happened to the . in between?
After the < is ignored by the fallback rule, stdio is matched. Since [a-zA-Z]+[a-zA-Z0-9]* doesn't match anything other than letters and digits, the . is not considered part of the token. Then the . is matched and ignored by the fallback rule, and then h is matched.
Also the %d string in printf is being counted as an identifier.
Not really. The % is explicitly ignored by the fallback rule (as was the ") and then the d is marched as an identifier. If you want to ignore words in string literals, you will have to recognise and ignore string literals.

The #include directive is a preprocessor directive and is thus preprocessed by the preprocessor. The preprocessor includes the header file and removes the #include directive And thus after preprocessing when the program is given to compiler as input it doesn't have any preprocessor directives like #include.
So you don't need to write code to detect #include because neither compiler ever sees it nor it is designed to tokenize #include directive.
References: Is #include a token of type keyword?

adding the following line in the rules section works for me:
#.* ;
Here rule is #.* and action is ;. The #.* will catch the line starting with # and ; will just do nothing so basically this would ignore the line starting with #.

Related

Shall converversion specifier "%%" match white spaces

According to the C standard the conversion specifier % is defines as:
% Matches a single % character; no conversion or assignment occurs. The
complete conversion specification shall be %%.
However this code:
int main(int argc, char* argv[])
{
int n;
printf("%d\n", sscanf(" %123", "%% %d", &n));
return 0;
}
compiled with gcc-11.1.0 gives the output 1 so apparently %% matched the " %" of the string.
This seems to be a violation of "Matches a single % character" as it also accepted the spaces in front of the % character.
Question: Is it correct according to the standard to accept white spaces as part of %% directive?
According to the C89 Standard, at least, "Input white-space characters [...] are skipped, unless the specification includes a [, c, or n specifier." (That's an old version of the Standard, but it's the one I had handy. But I don't imagine this has changed in more recent versions.)
I looked at the final C17 draft and there's actually a specific example showing that %% skips whitespace:
EXAMPLE 5 The call:
#include <stdio.h>
/* ... */
int n, i;
n = sscanf("foo %bar 42", "foo%%bar%d", &i);
will assign to n the value 1 and to i the value 42 because input
white-space characters are skipped for both the % and d conversion
specifiers.

count the total no. of keywords in the file

I want to count the total no. of keywords in the file but the code counts those keywords that are used to declare the variable.
void main()
{
//2d array used to store the keywords but few of them are used.
char key[32][12]={"int","char","while","for","if","else"};
//cnt is used to count the occurrence of the keyword in the file.
int cnt=0,i;
//used to store the string that is read line by line.
char ch[100];
FILE *fp=fopen("key.c","r");
//to check whether file exists or not
if(fp=='\0')
{
printf("file not found..\n");
exit(0);
}
//to extract the word till it don't reach the end of file
while((fscanf(fp,"%s",ch))!=EOF)
{
//compare the keyword with the word present in the file.
for(i=0;i<32;i++)
{
// compare the keyword with the string in ch.
if(strcmp(key[i],ch)==0) {
//just to check which keyword is printed.
printf("\nkeyword is : %s",ch);
cnt++;
}
}
}
printf("\n Total no. of keywords are : %d", cnt);
fclose(fp);
}
Expected output should be:
Total no. of keywords are : 7
Actual output is coming :
Total no. of keywords are : 3
fscanf(fp,"%s",ch) will match a sequence of non-whitespace characters (see cpp reference), so in your case for, while and if won't be matched as single words - because there's no space after them.
In my opinion, but diverting a little from your intention, you had better to use flex(1) for that purpose, as it will scan the file more efficiently than comparing each sequence with the set of words you may have. This approach will require more processing, as several keywords can be in the same line, and it only filters which lines have keywords on them.
Also, using flex(1) will give you a more efficient C source code, a sample input for flex(1) would be:
%{
unsigned long count = 0;
%}
%%
int |
char |
unsigned |
signed |
static |
auto |
do |
while |
if |
else |
/* ... add more keywords as you want here */
return |
break |
continue |
volatile { printf("keyword is = %s\n", yytext);
count++;
}
\n |
. ;
%%
int yywrap()
{
return 1;
}
int main()
{
yylex();
printf("count = %lu\n", count);
}
The efficiency comes basically from the fact that flex(1) uses a special algorithm that gets the right match with only scanning once the source file (one decision per char, all the patterns are scanned in parallel). The problem in your code comes from the fact that %s format has a special interpretation of what it considers is a word, different as the one defined by the C language (for scanf() a word si something surrounded by spaces, where spaces means \n, \t or only --- it will match as a word something like while(a==b) if you don't put spaces around your keywords). Also, If you need to compare each input pattern with each of the words your algorithm will end doing N passes through each input file character (with each letter meaning N = nw * awl (being N the number of times you compare each character and nw the number of words, awl the average of the list of word lengths in your set) By the way, keywords should not be recognised inside comments, or string literals, It is easy to adapt the code you see above to reject those and do a right scanning. For example, the next flex file will do this:
%{
unsigned long count = 0;
%}
%x COMM1
%x COMM2
%x STRLIT
%x CHRLIT
%%
int |
char |
unsigned |
signed |
static |
auto |
do |
while |
if |
else |
/* ... */
return |
break |
continue |
volatile { printf("kw is %s\n", yytext);
count++;
}
[a-zA-Z_][a-zA-Z0-9_]* |
^[\ \t]*#.* |
"/*"([^*]|\*[^/])*"*/" |
"//".* |
\"([^"\n]|\\")*\" |
\'([^'\n]|\\')*\' |
. |
\n ;
%%
int yywrap()
{
return 1;
}
int main()
{
yylex();
printf("count = %lu\n", count);
}
It allows different regular expressions to be recognised as language tokens, so provision is given to match also C language constructs like identifiers ([a-zA-Z_][a-zA-Z0-9_]*), preprocessor directives (^[\ \t]*#.*), old style C comments ("/*"([^*]|\*[^/])*"*/"), new C++ style comments ("//".*), string literals (\"([^"\n]|\\")*\"), character literals (\'([^'\n]|\\')*\'), where keywords cannot be identified as such.
Flex(1) is worth learning, as it simplifies a lot the input of structured data into a program. I suggest you to study it.
note
you had better to write if (fp == NULL), or even if (!fp)... (You are not doing anything incorrect in your statement if (fp == '\0'), but as \0 is the char representation of the nul character, it's somewhat inconvenient, strange or imprecise to compare a pointer value with a character literal, and suggests you are interpreting the FILE * not as a pointer, but more as an integer (or char) value.) But I repeat, it's something perfectly legal in C language.
note 2
The flex sample code posted above doesn't consider the possibility of running out of buffer space due to input very long tokens (like several line comments overflowing internal buffer space) This is done on purpose, to simplify description and to make the code simpler. Of course, in a professional scanner, all of these must be acquainted for.

Lexical Analyzer C program for identifying tokens

I wrote a C program for lex analyzer (a small code) that will identify keywords, identifiers and constants. I am taking a string (C source code as a string) and then converting splitting it into words.
#include <stdio.h>
#include <conio.h>
#include <string.h>
char symTable[5][7] = { "int", "void", "float", "char", "string" };
int main() {
int i, j, k = 0, flag = 0;
char string[7];
char str[] = "int main(){printf(\"Hello\");return 0;}";
char *ptr;
printf("Splitting string \"%s\" into tokens:\n", str);
ptr = strtok(str, " (){};""");
printf("\n\n");
while (ptr != NULL) {
printf ("%s\n", ptr);
for (i = k; i < 5; i++) {
memset(&string[0], 0, sizeof(string));
for (j = 0; j < 7; j++) {
string[j] = symTable[i][j];
}
if (strcmp(ptr, string) == 0) {
printf("Keyword\n\n");
break;
} else
if (string[j] == 0 || string[j] == 1 || string[j] == 2 ||
string[j] == 3 || string[j] == 4 || string[j] == 5 ||
string[j] == 6 || string[j] == 7 || string[j] == 8 ||
string[j] == 9) {
printf("Constant\n\n");
break;
} else {
printf("Identifier\n\n");
break;
}
}
ptr = strtok(NULL, " (){};""");
k++;
}
_getch();
return 0;
}
With the above code, I am able to identify keywords and identifiers but I couldn't obtain the result for numbers. I've tried using strspn() but of no avail. I even replaced 0,1,2...,9 to '0','1',....,'9'.
Any help would be appreciated.
Here are some problems in your parser:
The test string[j] == 0 does not test if string[j] is the digit 0. The characters for digits are written '0' through '9', their values are 48 to 57 in ASCII and UTF-8. Furthermore, you should be comparing *p instead of string[j] to test if you have a digit in the string indicating the start of a number.
Splitting the string with strtok() is not a good idea: it modifies the string and overwrites the first separator character with '\0': this will prevent matching operators such as (, )...
The string " (){};""" is exactly the same as " (){};". In order to escape " inside strings, you must use \".
To write a lexer for C, you should switch on the first character and check the following characters depending on the value of the first character:
if you have white space, skip it
if you have //, it is a line comment: skip all characters up to the newline.
if you have /*, it is a block comment: skip all characters until you get the pair */.
if you have a ', you have a character constant: parse the characters, handling escape sequences until you get a closing '.
if you have a ", you have astring literal. do the same as for character constants.
if you have a digit, consume all subsequent digits, you have an integer. Parsing the full number syntax requires much more code: leave that for later.
if you have a letter or an underscore: consume all subsequent letters, digits and underscores, then compare the word with the set of predefined keywords. You have either a keyword or an identifier.
otherwise, you have an operator: check if the next characters are part of a 2 or 3 character operator, such as == and >>=.
That's about it for a simple C parser. The full syntax requires more work, but you will get there one step at a time.
When you're writing lexer, always create specific function that finds your tokens (name yylex is used for tool System Lex, that is why I used that name). Writing lexer in main is not smart idea, especially if you want to do syntax, semantic analysis later on.
From your question it is not clear whether you just want to figure out what are number tokens, or whether you want token + fetch number value. I will assume first one.
This is example code, that finds whole numbers:
int yylex(){
/* We read one char from standard input */
char c = getchar();
/* If we read new line, we will return end of input token */
if(c == '\n')
return EOI;
/* If we see digit on input, we can not return number token at the moment.
For example input could be 123a and that is lexical error */
if(isdigit(c)){
while(isdigit(c = getchar()))
;
ungetc(c,stdin);
return NUM;
}
/* Additional code for keywords, identifiers, errors, etc. */
}
Tokens EOI, NUM, etc. should be defined on top. Later on, when you want to write syntax analysis, you use these tokens to figure out whether code responds to language syntax or not. In lexical analysis, usually ASCII values are not defined at all, your lexer function would simply return ')' for example. Knowing that, tokens should be defined above 255 value. For example:
#define EOI 256
#define NUM 257
If you have any futher questions, feel free to ask.
string[j]==1
This test is wrong(1) (on all C implementations I heard of), since string[j] is some char e.g. using ASCII (or UTF-8, or even the old EBCDIC used on IBM mainframes) encoding and the encoding of the char digit 1 is not the the number 1. On my Linux/x86-64 machine (and on most machines using ASCII or UTF-8, e.g. almost all of them) using UTF-8, the character 1 is encoded as the byte of code 48 (that is (char)48 == '1')
You probably want
string[j]=='1'
and you should consider using the standard isdigit (and related) function.
Be aware that UTF-8 is practically used everywhere but is a multi-byte encoding (of displayable characters). See this answer.
Note (1): the string[j]==1 test is probably misplaced too! Perhaps you might test isdigit(*ptr) at some better place.
PS. Please take the habit of compiling with all warnings and debug info (e.g. with gcc -Wall -Wextra -g if using GCC...)
and use the debugger (e.g. gdb). You should have find out your bug in less time than it took you to get an answer here.

Regular expression matching using regcomp() and regexec() functions in C

Am not familiar to use the regex library on C language. Currently am trying to use Regexec() and Regcomp() functions to search for a string that matches my pattern or regular expression. but i can*t generate my matched string. do i miss something on my code, or any fault usage with the functions?
my sample code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <regex.h>
int main(int argc, char ** argv)
{
regex_t r;
const char * my_regex = "(\\d+.\\d+.\\d+.\\d+)";
const char * my_string = "Am trying to match any ip like, 23.54.67.89 , in this string and 123.232.123.33 is possible";
const int no_of_matches = 10;
regmatch_t m[no_of_matches];
printf ("Trying to match '%s' in '%s'\n", my_regex, my_string);
int status = regcomp (&r, my_regex, REG_EXTENDED|REG_NEWLINE);
printf("status: %d\n",status);
if(status!=0)
{
printf ("Regex error compiling \n");
}
int match_size = regexec (&r, my_string, no_of_matches, m, 0);
printf("Number of Matches : %d\n",match_size);
int i = 0;
for (i = 0; i < match_size; i++)
{
//Now i wana print all matches here,
int start = m[i].rm_so;
int finish = m[i].rm_eo;
printf("%.*s\n", (finish - start), my_string + start);
}
regfree (& r);
return 0;
}
Here,to the problem: i can*t print my matches. any suggestion? am on linux.
I have edited my for loop, now it prints:
Trying to match '(\d+.\d+.\d+.\d+)' in 'Am trying to match any ip like, 23.54.67.89 , in this string and 123.232.123.33 is possible'
status: 0
Number of Matches : 1
m trying to match any ip like, 23.54.67.89 , in this string and 123.232.123.33 is possible
But am expecting my out put as:
Trying to match '(\d+.\d+.\d+.\d+)' in 'Am trying to match any ip like, 23.54.67.89 , in this string and 123.232.123.33 is possible'
status: 0
Number of Matches : 2
23.54.67.89
123.232.123.33
Your regular expression is not a POSIX regular expression. You're using Perl/Tcl/Vim flavour, which won't work like you hope it would.
regcomp() and regexec() are POSIX regular expressions, and as such, are part of POSIX-compliant (or just POSIX-y) C libraries. They are not just part of some regex library; these are the POSIX standard stuff.
In particular, POSIX regular expressions do not recognize \d, or any other backslash-character classes. You should use [[:digit:]] instead. (The character classes are enclosed in brackets, so to match any digit or lowercase letter you could use [[:digit:][:lower:]]. For anything except a control character, you could use [^[:cntrl:]].)
In general, you can check out the Character classes table in the Regular expressions Wikipedia article, which contains a concise summary of the equivalent classes with descriptions.
Do you need a locale-aware example to demonstrate this?

Flex pattern for ID gives 'Segmentation fault'

I have a program in C that converts expression to RPN (reverse Polish notation).
All I need to do is to replace lexer code written in C with Flex. I already did some work, but I have problems with patterns - word or variable id to be specific. Yes, this is class exercise.
This is what I have:
%{
#include "global.h"
int lineno = 1;
int tokenval = NONE;
%}
%option noyywrap
WS " "
NEW_LINE "\n"
DIGIT [0-9]
LETTER [a-zA-Z]
NUMBER {DIGIT}+
ID {LETTER}({LETTER}|{DIGIT})*
%%
{WS}+ {}
{NEW_LINE} { ++lineno; }
{NUMBER} { sscanf (yytext, "%d", &tokenval); return(NUM); }
{ID} { sscanf (yytext, "%s", &tokenval); return(ID); }
. { return *yytext;}
<<EOF>> { return (DONE); }
%%
and defined in global.h
#define BSIZE 128
#define NONE -1
#define EOS '\0'
#define NUM 256
#define DIV 257
#define MOD 258
#define ID 259
#define DONE 260
All work when I use digits, brackets and operators, but when I type for example a+b it gives me Segmentation fault (and the output should be ab+).
Please don't ask me for a parser code (I can share if really needed) - requirement is to ONLY implement lexer using Flex.
The problem is that the program is doing an sscanf with a string format (%s) into the address of an integer (&tokenval). You should change that to an array of char, e.g.,
%{
#include "global.h"
int lineno = 1;
int tokenval = NONE;
char tokenbuf[132];
%}
and
{ID} { sscanf (yytext, "%s", tokenbuf); return(ID); }
(though strcpy is a better choice than sscanf, this is just a starting point).
When flex scans a token matching pattern ID, the associated action attempts to copy the token into a character array at location &tokenval. But tokenval has type int, so
the code has undefined behavior
if the length of the ID equals or exceeds the size of an int, then you cannot fit all its bytes (including a string terminator) in the space occupied by an int. A reasonably likely result is that you attempt to write past its end, which could result in a segfault.

Resources