How to find tokens from a c file? - c

I am trying to generate tokens from a C source file. I have split the C file into an array line and stored the words of the entire file in an array words.
The problem is with the strtok() function, which is splitting the line on whitespace characters. Because of this, I am not getting certain delimiters like parentheses and brackets because there is no whitespace between them and other tokens.
How do I determine which one is an identifier and which one is an operator?
Code so far:
int main()
{
/* ... */
char line[300][200];
char delim[]=" \n\t";
char *words[1000];
char *token;
while (fgets(&line[i][0], 100, fp1) != NULL)
{
token = strtok(&line[i][0], delim);
while (token != NULL)
{
words[j++] = token;
token = strtok(NULL, delim);
}
i++;
}
for(i = 0; i < 50; i++)
{
printf("%s\n", words[i]);
}
return 0;
}

This is a tricky question, something that needs probably more depth than a StackOverflow answer. I'll try, nonetheless.
Tokenizing the input is the first part of the compilation process. The objective is to simplify the task of the parser, which is going to make an abstract syntax tree with the contents of the file. How do we simplify this? We do recognize those tokens that have a special meaning, also identifiers, operators... C is indeed a tricky, complex language. Let's simplify the language to tokenize: we'll start with a typical calculator.
An input example would be:
( 4 +5)* 2
When syntax is free, you can add or skip spaces, so as you have already experimented, splitting by space is not an option.
The tokenized output for the example above would be: LPAR, LIT, OP, LIT, RPAR, OP, LIT. The meaning goes as follows:
LPAR: Left parenthesis
RPAR: Right parenthesis
LIT: Literal (a number)
OP: Operator (say: +, -, * and /).
The complete ouput would therefore be:
{ LPAR, LIT(4), OP('+'), LIT(5), RPAR, OP('*'), LIT(2) }
Your lexer basically has to advance in the input string, char by char, using a state machine. For example, when you read a number, you enter in the "input literal" state, in which only other numbers and '.' are allowed.
Now the parser has an easier task. If you feed it with the previous tokens, it does not have to skip spaces, or distinguish between a negative number and a minus operator, it can just advance in a list or array. It can behave following the type of the token, and some of them have associated data, as you can see.
This is only an introduction of the introduction, anyway. Information about the whole compilation process could fill a book. And there are actually many books devoted to this topic, such as the famous "Dragon book" from Aho, Sethi&Ullman. A more updated one is the "Tiger book".
Finally, lexers are quite similar among each others, and it is therefore possible to find generic lexers out there. You can also even find the C grammar for that kind of tools.
Hope this (somehow) helps.

Related

Need help parsing a "|" seperated line from a file

I have to parse a file that would look something like this
String|OtherString|1234|0
String2|OtherString2|4321|1
...
So, I need to go through every line of the file and take each seperate token of each line.
FILE *fp=fopen("test1.txt","r");
int c;
char str1[500];
char str2[500];
int num1=0;
int num2;
while((c=fgetc(fp))!=EOF){
fscanf(fp, "%s|%s|%d|%d", &str1[0], &str2[0], &num1, &num2);
}
fclose(fp);
There's more to it, but these are the sections relevant to my question. fscanf isn't working, presumably because I've written it wrong. What's supposed to happen is that str1[500] should be set to String, in this case, str2 to OtherString, etc. It seems as though fscanf isn't doing anything, however. Would greatly appreciate some help.
EDIT: I am not adamant about using fgetc or fscanf, these are just what I have atm, I'd use anything that would let me do what I have to
strtok() in a loop will work for you. The following is a bare bones example, with very little error handling etc, but illustrates the concept...
char strArray[4][80];
char *tok = NULL;
char *dup = strdup(origLine);
int i = 0;
if(dup)
{
tok = strtok(dup, "|\n");
while(tok)
{
strcpy(strArray[i], tok);
tok = strtok(NULL, "|\n");
i++;
}
free(dup);
}
If reading from a file, then put this loop inside another while loop that reads the file, line by line. Functions useful for this will include fopen(), fgets() and fclose(). One additional feature that should be considered for code that reads data from a file is to determine the number of records (lines) in the file to be read, and use that information to create a properly sized container with which to populate with the parsing results. But this will be for another question.
Note: fgetc() is not suggested here as it reads one char per loop, and would be less efficient than using fgets() for reading lines from a file when used in conjunction with strtok().
Note also, in general, the more consistently a file is formatted in terms of number of fields, content of fields, etc. the least complicated a parser needs to be. The inverse is also true. The less consistently formatted input file requires a more complex parser. For example, for human entered line data, the parser required is typically more complicated than say one used for a computer generated set of uniform lines.

removing multi-char constants in C

Here's some code I found in a very old C library that's trying to eat whitespace from a file...
while(
(line_buf[++line_idx] != ' ') &&
(line_buf[ line_idx] != ' ') &&
(line_buf[ line_idx] != ',') &&
(line_buf[ line_idx] != '\0') )
{
This great thread explains what the problem is, but most of the answers are "just ignore it" or "you should never do this". What I don't see, however, is the canonical solution. Can anyone offer a way to code this test using the "proper way"?
UPDATE: to clarify, the question is "what is the proper way to test for the presence of a string of one or more characters at a given index in another string". Forgive me if I am using the wrong terminology.
Original question
There is no canonical or correct way. Multi-character constants have always been implementation defined. Look up the documentation for the compiler used when the code was written and figure out what was meant.
Updated question
You can match multiple characters using strchr().
while (strchr( " ,", line_buf[++line_idx] ))
{
Again, this does not account for that multi-char constant. You should figure out why that was there before simply removing it.
Also, strchr() does not handle Unicode. If you are dealing with a UTF-8 stream, for example, you will need a function capable of handling it.
Finally, if you are concerned about speed, profile. The compiler might get you better results using the three (or four) individual test expressions in the ‘while’ condition.
In other words, the multiple tests might be the best solution!
Beyond that, I smell some uncouth indexing: the way that line_idx is updated depends on the surrounding code to actuate the loop properly. Make sure that you don’t create an off-by-one error when you update stuff.
Good luck!
UPDATE: to clarify, the question is "what is the proper way to test
for the presence of a string of one or more characters at a given
index in another string". Forgive me if I am using the wrong
terminology.
Well, there are a number of ways, but the standard way is using strspn which has the prototype:
size_t strspn(const char *s, const char *accept);
and it cleverly:
calculates the length (in bytes) of the initial segment of s
which consists entirely of bytes in accept.
This allows you to test for the "the presence of a string of one or more characters at a given index in another string" and tells you how many of the characters from that string were sequentially matched.
For example, if you had another string say char s = "somestring"; and wanted to know if it contained the letters r, s, t, say, in char *accept = "rst"; beginning at the 5th character, you could test:
size_t n;
if ((n = strspn (&s[4], accept)) > 0)
printf ("matched %zu chars from '%s' at beginning of '%s'\n",
n, accept, &s[4]);
To compare in order, you can use strncmp (&s[4], accept, strlen (accept));. You can also simply use nestest loops to iterate over s with the characters in accept.
All of the ways are "proper", so long as they do not invoke Undefined Behavior (and are reasonable efficient).

Reading String token by token in C

I'm trying to build an LL(1) Recursive Descent Parser in C using a specific grammar given to me. I have an idea how to do this recursively in general... my issue, however, is stopping me from really being able to start my implementation. I'm not too familiar with C, so I'm sure this is why I'm having an issue. Basically, I need to be able to read a String such as "(1+2)*3" token by token. So for instance, in the case of the String of above me I need to first read the "(", then further down the recursive process I'd call something like nextToken() which would give me the "1".
That being said, ultimately I would probably only need to read the very first token of the String each that I call "nextToken()because after I grab the value I'd alter the initial string to be the same as it previously was, minus the most recently read token. So for example, I start with "(1+2)*3", then I call nextToken() on the String which means that I get the "(" and then the initial String is now "1+2)*3".
My issue is I don't know how to do this in C..
That's what a "lexer" does, typically before a parser. I guess the best you can do is try LEX (flex in Flex & Bison probably). (It's true that what lexer does can also be done solely in parser, but it's probably much messier.)
A less preferable way would be to categorize all the possibilities and write regular expressions to match some valid prefix (which is what the LEX does under the hood).
In C, a "string" is just a region of memory containing characters, which is terminated by the first NUL (0) character. That being the case, all you need for a string is a pointer to the first character. (That means that the length of the string needs to be computed, so try to avoid doing that more often than is necessary.)
There are standard library functions which can do things like compare strings and copy strings, but it is important to remember that memory management of strings is your responsibility.
While this may seem primitive, error-prone, and complicated to those used to languages in which strings are actual datatypes, it is how it is. If you're planning on doing string manipulation in C, you need to get used to it.
Nonetheless, string manipulation in C can be both efficient and trouble-free, as long as you follow the rules. For example, if you want to refer to the substring of s starting at the 3rd character, you can just use pointer arithmetic: s + 2. If you want to (temporarily) create a substring at a given point in a string, you can drop a 0 into the string at the end of the substring, and then later restore the character that was there. (In fact, that's what the standard library function strtok does, and it's how a lexical scanner built with (f)lex works.) Note that this strategy requires that the character array be mutable, so you won't be able to apply it to string literals. (String arrays are fine, though, since they are mutable.)
It's quite possible that your best bet for building a lexical scanner would be to use flex. The scanner which flex builds will do a lot of things for you, including input buffering, and flex lets you specify regular expressions instead of hand coding them.
But if you want to do it by hand, it is not that hard, particularly if the entire input is in memory so that buffering is not necessary. (If no token spans a line, you could also read the input a line at a time, but that's not as efficient as reading fixed-length blocks, which is what the flex scanner will do.)
Here, for example, is a simple scanner which handles arithmetic operators, integers, and identifiers. It does not use the "overwrite with NUL" strategy, so it can be used with string literals. For identifiers, it creates a newly-allocated string, so the caller needs to free the identifier when it is no longer needed. (No garbage collection. C'est la vie.) The token is "returned" through a reference argument; the actual return value of the function is a pointer to the remainder of the source string. Quite a lot of error checking has been omitted.
#include <ctype.h>
#include <stdlib.h>
#include <string.h>
/* The type of a single-character operators is the character, so
* other token types need to start at 256. We use 0 to indicate
* the end of input token type.
*/
enum TokenType { NUMBER = 256, ID };
typedef struct Token {
enum TokenType token_type;
union { /* Anonymous unions are a C11 feature. */
long number; /* Only valid if type is NUMBER */
char* id; /* Only valid if type is ID */
};
} Token;
/* You would normally call this like this:
* do {
* s = next_token(s, &token);
* // Do something with token
* } while (token.token_type);
*/
const char* next_token(const char* input, Token* out) {
/* Skip whitespace */
while (isspace(*input)) ++input;
if (isdigit(*input)) {
char* lim;
out->number = strtol(input, &lim, 10);
out->token_type = NUMBER;
return lim;
} else if (isalpha(*input)) {
const char* lim = input + 1;
/* Find the end of the id */
while (isalnum(*lim)) ++lim;
/* Allocate enough memory to copy the id. We need one extra byte
* for the NUL
*/
size_t len = lim - input;
out->id = malloc(len + 1);
memcpy(out->id, input, len);
out->id[len] = 0; /* NUL-terminate the string */
out->token_type = ID;
return lim;
} else {
out->token_type = *input;
/* If we hit the end of the input string, we don't advance the
* input pointer, to avoid reading random memory.
*/
return *input ? input + 1 : input;
}
}

Creating a Lexical Analyzer in C

I am trying to create a lexical analyzer in C.
The program reads another program as input to convert it into tokens, and the source code is here-
#include <stdio.h>
#include <conio.h>
#include <string.h>
int main() {
FILE *fp;
char read[50];
char seprators [] = "\n";
char *p;
fp=fopen("C:\\Sum.c", "r");
clrscr();
while ( fgets(read, sizeof(read)-1, fp) !=NULL ) {
//Get the first token
p=strtok(read, seprators);
//Get and print other tokens
while (p!=NULL) {
printf("%s\n", p);
p=strtok(NULL, seprators);
}
}
return 0;
}
And the contents of Sum.c are-
#include <stdio.h>
int main() {
int x;
int y;
int sum;
printf("Enter two numbers\n");
scanf("%d%d", &x, &y);
sum=x+y;
printf("The sum of these numbers is %d", sum);
return 0;
}
I am not getting the correct output and only see a blank screen in place of output.
Can anybody please tell me where am I going wrong??
Thank you so much in advance..
You've asked a few question since this one, so I guess you've moved on. There are a few things that can be noted about your problem and your start at a solution that can help others starting to solve a similar problem. You'll also find that people can often be slow at answering things that are obvious homework. We often wait until homework deadlines have passed. :-)
First, I noted you used a few features specific to Borland C compiler which are non-standard and would not make the solution portable or generic. YOu could solve the problem without them just fine, and that is usually a good choice. For example, you used #include <conio.h> just to clear the screen with a clrscr(); which is probably unnecessary and not relevant to the lexer problem.
I tested the program, and as written it works! It transcribes all the lines of the file Sum.c to stdout. If you only saw a blank screen it is because it could not find the file. Either you did not write it to your C:\ directory or had a different name. As already mentioned by #WhozCraig you need to check that the file was found and opened properly.
I see you are using the C function strtok to divide the input up into tokens. There are some nice examples of using this in the documentation you could include in your code, which do more than your simple case. As mentioned by #Grijesh Chauhan there are more separators to consider than \n, or end-of-line. What about spaces and tabs, for example.
However, in programs, things are not always separated by spaces and lines. Take this example:
result=(number*scale)+total;
If we only used white space as a separator, then it would not identify the words used and only pick up the whole expression, which is obviously not tokenization. We could add these things to the separator list:
char seprators [] = "\n=(*)+;";
Then your code would pick out those words too. There is still a flaw in that strategy, because in programming languages, those symbols are also tokens that need to be identified. The problem with programming language tokenization is there are no clear separators between tokens.
There is a lot of theory behind this, but basically we have to write down the patterns that form the basis of the tokens we want to recognise and not look at the gaps between them, because as has been shown, there aren't any! These patterns are normally written as regular expressions. Computer Science theory tells us that we can use finite state automata to match these regular expressions. Writing a lexer involves a particular style of coding, which has this style:
while ( NOT <<EOF>> ) {
switch ( next_symbol() ) {
case state_symbol[1]:
....
break;
case state_symbol[2]:
....
break;
default:
error(diagnostic);
}
}
So, now, perhaps the value of the academic assignment becomes clearer.

performing regular expression in C

I would like to perform regular expression in C . Suppose I have following text like:
thecapital([x], implies(maincity(y),x))
The program has to output like:
implies(maincity(y),x))
can anyone please suggest how shall I proceed?
To transform the input string thecapital([x], implies(maincity(y),x)) to the output string implies(maincity(y),x)) you can use the following simple function:
const char *
transform(const char *expr) {
return expr + 16;
}
It doesn't use regular expressions, but on the other hand it's lightning fast. Or maybe you didn't put your question clearly. For example, you didn't describe in words what transformation should be done. Giving just one example is not enough.
So what do you really want to do:?
Skip the first 16 characters of the input string
Return everything after the first space character
Return everything after the last space character
Return the suffix of the argument starting with the second i
Return "implies(maincity(y),x))"
Return the second argument to the term in parentheses, followed by an extra closing parenthesis
For your one example my simple suggested function fulfills all these requirements. But of course it will fail hopelessly when given any other input.

Resources