Creating a Lexical Analyzer in C - c

I am trying to create a lexical analyzer in C.
The program reads another program as input to convert it into tokens, and the source code is here-
#include <stdio.h>
#include <conio.h>
#include <string.h>
int main() {
FILE *fp;
char read[50];
char seprators [] = "\n";
char *p;
fp=fopen("C:\\Sum.c", "r");
clrscr();
while ( fgets(read, sizeof(read)-1, fp) !=NULL ) {
//Get the first token
p=strtok(read, seprators);
//Get and print other tokens
while (p!=NULL) {
printf("%s\n", p);
p=strtok(NULL, seprators);
}
}
return 0;
}
And the contents of Sum.c are-
#include <stdio.h>
int main() {
int x;
int y;
int sum;
printf("Enter two numbers\n");
scanf("%d%d", &x, &y);
sum=x+y;
printf("The sum of these numbers is %d", sum);
return 0;
}
I am not getting the correct output and only see a blank screen in place of output.
Can anybody please tell me where am I going wrong??
Thank you so much in advance..

You've asked a few question since this one, so I guess you've moved on. There are a few things that can be noted about your problem and your start at a solution that can help others starting to solve a similar problem. You'll also find that people can often be slow at answering things that are obvious homework. We often wait until homework deadlines have passed. :-)
First, I noted you used a few features specific to Borland C compiler which are non-standard and would not make the solution portable or generic. YOu could solve the problem without them just fine, and that is usually a good choice. For example, you used #include <conio.h> just to clear the screen with a clrscr(); which is probably unnecessary and not relevant to the lexer problem.
I tested the program, and as written it works! It transcribes all the lines of the file Sum.c to stdout. If you only saw a blank screen it is because it could not find the file. Either you did not write it to your C:\ directory or had a different name. As already mentioned by #WhozCraig you need to check that the file was found and opened properly.
I see you are using the C function strtok to divide the input up into tokens. There are some nice examples of using this in the documentation you could include in your code, which do more than your simple case. As mentioned by #Grijesh Chauhan there are more separators to consider than \n, or end-of-line. What about spaces and tabs, for example.
However, in programs, things are not always separated by spaces and lines. Take this example:
result=(number*scale)+total;
If we only used white space as a separator, then it would not identify the words used and only pick up the whole expression, which is obviously not tokenization. We could add these things to the separator list:
char seprators [] = "\n=(*)+;";
Then your code would pick out those words too. There is still a flaw in that strategy, because in programming languages, those symbols are also tokens that need to be identified. The problem with programming language tokenization is there are no clear separators between tokens.
There is a lot of theory behind this, but basically we have to write down the patterns that form the basis of the tokens we want to recognise and not look at the gaps between them, because as has been shown, there aren't any! These patterns are normally written as regular expressions. Computer Science theory tells us that we can use finite state automata to match these regular expressions. Writing a lexer involves a particular style of coding, which has this style:
while ( NOT <<EOF>> ) {
switch ( next_symbol() ) {
case state_symbol[1]:
....
break;
case state_symbol[2]:
....
break;
default:
error(diagnostic);
}
}
So, now, perhaps the value of the academic assignment becomes clearer.

Related

Why does this code accessing the array after scanf result in a segmentation error?

For some homework I have to write a calculator in C. I wanted to input some string with scanf and then access it. But when I access the first element I get a segmentation error.
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
int main(){
char input1[30];
scanf("%s",input1);
printf("%s",input1);
char current = input1[0];
int counter = 0;
while(current != '\0'){
if(isdigit(current) || current == '+' || current == '-' || current == '*' || current == '/'){
counter++;
current = input1[counter];
}else{
printf("invalid input\n");
exit(1);
}
}
return 0;
}
The printf in line 3 returns the string, but accessing it in line 4 returns a segmentation error (tested in gdb). Why?
There are a few potential causes, some of which have been mentioned in the comments (I won't cover those). It's hard to say which one (or more) is the cause of your problem, so I guess it makes sense to iterate them. However, you may notice that I cite some resources in the process... The information is out there, yet you're not stumbling across it until it's too late. Something needs to change with how you research, because this is slowing your progress down.
On input/output dynamics, just a quick note
printf("%s",input1);
Unless we include a trailing newline, this output may be delayed (or "buffered"), which may have the effect of confusing you about the root of your issues. As an alternative to using a trailing newline (which I'd prefer, personally) you could explicitly force partial lines to be written by invoking fflush(stdout) immediately after each of the relevant output operations, or use setbuf to disable buffering entirely. I think this is unlikely to be your problem, but it may mask your problem, so it's important to realise, when using printf to debug, it might be best to include a trailing newline...
On main entry points
The first potential culprit I see is here:
int main()
I don't know why our education system is still pushing these broken lessons. My only guess is the professors learnt many years back using the nowadays irrelevant Turbo C and don't want to stay up-to-date with tech. We can further reduce this to a simple testcase to work out if this is your segfault, but like I said, it's hard to say whether this is actually your problem...
int main() {
char input1[30];
memset(input1, '\x90', sizeof input1);
return 0; // this is redundant for `main` nowadays, btw
}
To explain what's going on here, I'll cite this page, which you probably ought to go and read (in its entirety) once you're done here:
A common misconception for C programmers, is to assume that a function prototyped as follows takes no arguments:
int foo();
In fact, this function is deemed to take an unknown number of arguments. Using the keyword void within the brackets is the correct way to tell the compiler that the function takes NO arguments.
Simply put, if the linker doesn't know/can't work out how many arguments are required for the entry point, there's probably gonna be some oddness to your callstack, and that's gonna occur at the beginning or end of your program.
On input errors, return values and uninitialised access
#include <assert.h>
#include <stdio.h>
#include <string.h>
int main(void) {
char input1[30];
memset(input1, '\x90', sizeof input1);
scanf("%s",input1); // this is sus a.f.
assert(memchr(input1, '\0', sizeof input1));
}
In my testcase, I actually wrote '\x90' to each byte in the array, to show that if the scanf call fails you may end up with an array that has no null terminator. If this is your problem, this assertion is likely to throw (as you can see from the ideone demo) when you run it, which indicates that your loop is likely accessing garbage beyond the bounds of input1. On this note I intended to demonstrate that we (mostly) cannot rely upon scanf and friends unless we also check their return values! There's a good chance your compiler is warning you about this one, so another lesson is uto pay close attention to warning messages, and strive to have none.
On argument expectations for standard library functions
For many standard library functions it may be possible to give input that is outside of the acceptable domain, and so causes instability. The most common form, which I also see in your program, exists in the form of possibly passing invalid values to <ctype.h> functions. In your case, you could change the declaration of current to be an unsigned char instead, but the usual idiom is to put the cast explicitly in the call (like isdigit((unsigned char) current)) so the rest of us can see you're not stuck in this common error, at least while you're learning C.
Please note at this point I'm thinking whichever resources you're using to learn aren't working, because you're stumbling into common traps... please try to find more reputable resources to learn from so you don't fall into more common traps and waste more time later on. If you're struggling, check out the C tag wiki...

C Trying to match the exact substring and nothing more

I have tried different functions including strtok(), strcmp() and strstr(), but I guess I'm missing something. Is there a way to match the exact substring in a string?
For example:
If I have a name: "Tan"
And I have 2 file names: "SomethingTan5346" and "nothingTangyrs634"
So how can I make sure that I match the first string and not both? Because the second file is for the person Tangyrs. Or is it impossible with this approach? Am I going at it the wrong way?
If, as seems to be the case, you just want to identify strings that have your text but are immediately followed by a digit, your best bet is probably to get yourself a good regular expression implementation and just search for Tan[0-9].
It could be done simply be using strstr() to find the string then checking the character following that with isnum() but the actual code to do that would be:
not as easy as you think since you may have to do multiple searchs (e.g., TangoTangoTan42 would need three checks); and
inadvisable if there's the chance the searches my become more complex (such as Tan followed by 1-3 digits or exactly two # characters and an X).
A regular expression library will make this much easier, provided you're willing to invest a little effort into learning about it.
If you don't want to invest the time in learning regular expressions, the following complete test program should be a good starting point to evaluate a string based on the requirements in the first paragraph:
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int hasSubstrWithDigit(char *lookFor, char *searchString) {
// Cache length and set initial search position.
size_t lookLen = strlen(lookFor);
char *foundPos = searchString;
// Keep looking for string until none left.
while ((foundPos = strstr(foundPos, lookFor)) != NULL) {
// If at end, no possibility of following digit.
if (strlen(foundPos) == lookLen) return 0;
// If followed by digit, return true.
if (isdigit(foundPos[lookLen])) return 1;
// Otherwise keep looking, from next character.
foundPos++;
}
// Not found, return false.
return 0;
}
int main(int argc, char *argv[]) {
if (argc < 3) {
printf("Usage testprog <lookFor> <searchIn>...\n");
return 1;
}
for (int i = 2; i < argc; ++i) {
printf("Result of looking for '%s' in '%s' is %d\n", argv[1], argv[i], hasSubstrWithDigit(argv[1], argv[i]));
}
return 0;
}
Though, as you can see, it's not as elegant as a regex search, and is likely to become even less elegant if your requirements change :-)
Running that with:
./testprog Tan xyzzyTan xyzzyTan7 xyzzyTangy4 xyzzyTangyTan12
shows it is action:
Result of looking for 'Tan' in 'xyzzyTan' is 0
Result of looking for 'Tan' in 'xyzzyTan7' is 1
Result of looking for 'Tan' in 'xyzzyTangy4' is 0
Result of looking for 'Tan' in 'xyzzyTangyTan12' is 1
The solution depends on your definition of exact matching.
This might be useful for you:
Traverse all matches of the target substring.
C find all occurrences of substring
Finding all instances of a substring in a string
find the count of substring in string
https://cboard.cprogramming.com/c-programming/73365-how-use-strstr-find-all-occurrences-substring-string-not-only-first.html
etc.
Having the span of the match, verify that the previous and following characters match/do not match your criterion for "exact match".
Or,
You could take advantage of regex in C++ (I know the tag is "C"), with #include <regex>, or POSIX #include <regex.h>.
You may want to use strstr(3) to search a substring in a string, strchr(3) to search a character in a string, or even regular expressions with regcomp(3).
You should read more about parsing techniques, notably about recursive descent parsers. In some cases, sscanf(3) with %n can also be handy. You should take care of the return count.
You could loop to read then parse every line, perhaps using getline(3), see this.
You need first to document your input file format (or your file name conventions, if SomethingTan5346 is some file path), perhaps using EBNF notation.
(you probably want to combine several approaches I am suggesting above)
BTW, I recommend limiting (for your convenience) file paths to a restricted set of characters. For example using * or ; or spaces or tabs in file paths is possible (see path_resolution(7)) but should be frowned upon.

Asking a user to input their name and outputting their initials in C

I just wanted to start this off by admitting I'm a complete beginner to coding, and that it's not something that comes intuitively to me.
What I'm trying to do is write a simple program that has the user input their full name, and outputs their initials. The logic I'm trying to follow is that since strings in C are just characters in an array, the characters that should be an initial will come after the '\0' value.
I'm not sure if my problem is a problem of logic or translating that logic into working syntax, so any help would be appreciated.
Here is the code in full:
# include <stdio.h>
#include <cs50.h>
#include <string.h>
int main (void)
{
printf ("Please insert your name. \n");
string name = get_string();
//printf ("Your name is %s\n", name);
int x = 0;
char c = name [x];
while (c != '\0')
{
x++;
}
printf ("%c/n", c);
}
I understand it's a complete mess, but again I'm trying to figure out if it's best to just quit, so any help would be appreciated.
Thanks in advance.
The logic I'm trying to follow is that since strings in C are just characters in an array, the characters that should be an initial will come after the '\0' value.
In C, \0 denotes the end of a string, so you definitely don't want to be looking for that value.
Let's think about the logic. Someone's initials are probably:
the first character in the string
the first character after a space
– i.e. "Albus Percival Wulfric Brian Dumbledore" -> "APWBD" –
so you'll want to loop over the string, looking for either:
a space, in which case you'll want to grab the next letter, or
the end of the string ('\0') in which case you'll want to stop.
Edge cases to watch out for:
what happens if the string is empty?
what happens if the string ends with a space? (this might not happen if you're guaranteed to get properly formatted input)
Please don't get discouraged – we were all beginners once. This kind of thinking isn't always straightforward, but the answer will come eventually. Good luck!

How to find tokens from a c file?

I am trying to generate tokens from a C source file. I have split the C file into an array line and stored the words of the entire file in an array words.
The problem is with the strtok() function, which is splitting the line on whitespace characters. Because of this, I am not getting certain delimiters like parentheses and brackets because there is no whitespace between them and other tokens.
How do I determine which one is an identifier and which one is an operator?
Code so far:
int main()
{
/* ... */
char line[300][200];
char delim[]=" \n\t";
char *words[1000];
char *token;
while (fgets(&line[i][0], 100, fp1) != NULL)
{
token = strtok(&line[i][0], delim);
while (token != NULL)
{
words[j++] = token;
token = strtok(NULL, delim);
}
i++;
}
for(i = 0; i < 50; i++)
{
printf("%s\n", words[i]);
}
return 0;
}
This is a tricky question, something that needs probably more depth than a StackOverflow answer. I'll try, nonetheless.
Tokenizing the input is the first part of the compilation process. The objective is to simplify the task of the parser, which is going to make an abstract syntax tree with the contents of the file. How do we simplify this? We do recognize those tokens that have a special meaning, also identifiers, operators... C is indeed a tricky, complex language. Let's simplify the language to tokenize: we'll start with a typical calculator.
An input example would be:
( 4 +5)* 2
When syntax is free, you can add or skip spaces, so as you have already experimented, splitting by space is not an option.
The tokenized output for the example above would be: LPAR, LIT, OP, LIT, RPAR, OP, LIT. The meaning goes as follows:
LPAR: Left parenthesis
RPAR: Right parenthesis
LIT: Literal (a number)
OP: Operator (say: +, -, * and /).
The complete ouput would therefore be:
{ LPAR, LIT(4), OP('+'), LIT(5), RPAR, OP('*'), LIT(2) }
Your lexer basically has to advance in the input string, char by char, using a state machine. For example, when you read a number, you enter in the "input literal" state, in which only other numbers and '.' are allowed.
Now the parser has an easier task. If you feed it with the previous tokens, it does not have to skip spaces, or distinguish between a negative number and a minus operator, it can just advance in a list or array. It can behave following the type of the token, and some of them have associated data, as you can see.
This is only an introduction of the introduction, anyway. Information about the whole compilation process could fill a book. And there are actually many books devoted to this topic, such as the famous "Dragon book" from Aho, Sethi&Ullman. A more updated one is the "Tiger book".
Finally, lexers are quite similar among each others, and it is therefore possible to find generic lexers out there. You can also even find the C grammar for that kind of tools.
Hope this (somehow) helps.

Toggle the cases means capital to small and small to capital

Like if your file has the following text: CompuTEr. Then after running your code the content of the file should be: cOMPUteR!
But problem is only one character which is a. It is not changing into Captial A.
#include<stdio.h>
#include<conio.h>
#include <ctype.h>
int main()
{
char name[100];
int loop;
printf("Enter any Sting: ");
gets(name);
for (loop=0; name[loop] !=0; loop++)
{
if(name[loop]>='A' && name[loop]<='Z')
name[loop]=name[loop]+32;
else if(name[loop]>'a' && name[loop]<='z')
name[loop]=name[loop]-32;
}
printf("\nConvert case of character : %s",name);
return 0;
}
Change
else if(name[loop]>'a' && name [loop]<='z')
To
else if(name[loop]>='a' && name [loop]<='z')
And Never ever use gets(). It is dangerous because it dosen't prevent buffer overflows. Use fgets() instead:
fgets(name,sizeof(name),stdin);
There are many problems with your code:
#include<stdio.h>
Use proper spacing: #include <stdio.h>.
While most spacing is not required, consistent use of spaces and indentation greatly enhances code readability and reduces the number of bugs. If you use sloppy style, you probably also use buggy algorithms and careless constructs. Bugs will have more places to hide. C is a very sharp tool, it is unforgiving with the careless.
#include<conio.h>
This header is an obsolete, Microsoft specific thing. It is not required here.
#include <ctype.h>
ctype.h defines the right functions for your assignment, but you don't even use them!
int main()
Define main as int main(void) or int main(int argc, char *argv[])
{
char name[100];
int loop;
printf("Enter any Sting: ");
So this whole thing was just a sting? Interesting lapsus!
gets(name);
Definitely never ever use gets! If the user types more than 99 characters at the prompt, gets will cause a buffer overflow as it cannot determine the size of name. This buffer overflow will most likely cause the program to crash, but careful exploitation of such a flaw can allow an attacker to take control of the computer. This is a very good and simple example of a security flaw. scanf is more complicated, inefficient and difficult to use safely in this context, as in most. Just use fgets and handle the trailing '\n' appropriately.
for (loop=0; name[loop] !=0; loop++)
The common idiom for this loop is to use an index variable called i. Naming it loop is confusing and verbose. The comparison to 0 could be written more simply as name[loop] is your teacher condones it, or name[loop] == '\0' to make it clear that you are comparing characters.
{
if(name[loop]>='A' && name[loop]<='Z')
You should use isupper to test for upper-case. Comparing explicitly to character values 'A' and 'Z' is non portable and error prone as you don't seem to get your comparisons right.
name[loop]=name[loop]+32;
You are assuming ASCII or similar encoding. This dirty trick is non portable, confusing and error prone. Just use the tolower function.
else if(name[loop]>'a' && name[loop]<='z')
Same remark as above. Your code is buggy. Proper use of spacing and less verbose index name should make the bug more obvious:
else if (name[i] > 'a' && name[i] <= 'z')
Use islower for both portability and reliability.
name[loop]=name[loop]-32;
See above. toupper was designed for this purpose.
}
printf("\nConvert case of character : %s",name);
Fix the broken English. Also add a '\n' at the end of the format string. Some systems will not even output anything if you don't.
return 0;
}
Proofing your homework online is only useful if you learn from it. I hope you did!

Resources