When writing a compiler, how are tokens checked? - c

Does a compiler use if statements when deciding what to do if a certain keyword is encounered, and should someone writing a compiler use them for most operations when checking code? Or is there a more efficient way? For example, when I test a symbol against a symbol table and it comes back as being a valid "token", do I have to use an if statement to determine what to do for every single keyword, since it seems rather inefficient, for example the pseudocode:
/*Each keyword/token in my compiler has a numerical representation which is what the symbol table returns back for example #define IF 0 and so on*/
if(Token == IF){
//This will be done to generate the AST representation for IF statements
}else if(Token == ELSE){
//This will be done to generate the AST representation of an if statement
}else if(Token == INT){
//This will be done to generate the AST represnetation of an integer
}

What kind of compilers do you mean?
If the performance matters, you may want something like callback, in this way, use the keyword as key and the callback function as the value, so the pseudo code would looks like this:
func *fp = funcTbl.get(Token);
if (fp) { fp(); }
You may try the recursive descent too. The function related to the keyword got called just where they are expected to be.
Last but not least, what you write is ok as well.

Assuming you have already split your source language from string representation to a series of lexical tokens, your next step is to use a parser to build an AST from your tokens.
The parsing stage of compilation achieves two main goals:
It checks your language for syntactic correctness, throwing an error if your input cannot be parsed according to the structure of your grammar.
It generates an AST representation of your source code
Does a compiler use if statements when deciding what to do if a
certain keyword is encountered?
No, your parser should analyse the series of lexical tokens and check them against the structure of your language's grammar.
Parsing is a well understood topic in computer science which can be approached in different ways. it cannot be trivially implemented in the example code fragment you have provided above. In a realistic programming language you need to consider that grammars can be ambiguous, and that a simple predictive parser is appropriate for all grammars and some kind of backtracking will be needed. If you do not understand this concept, I recommend you use a Parser generator for this, such as Bison.
This diagram shows a simplistic overview of the most important stages of compilation and may help you to understand its pipeline structure.
This is a process which has been refined for decades by many academics about how to best 'divide and conquer' such a mammoth task. I strongly encourage you to follow it.
For further reading, check out Modern Compiler Implementation in Java by Andrew Appel.

Related

Is GLR algorithm a must when bison parsing C grammar?

I'm trying to study C grammar with flex/bison.
I found bison cannot parse this bison grammar: https://www.lysator.liu.se/c/ANSI-C-grammar-y.html, because LALR algorithm cannot process recursively multiple expressions.
Is GLR algorithm a must for C grammar?
There is nothing wrong with that grammar except:
it represents a very old version of C
it requires a lexical analyser which can somehow distinguish between IDENTIFIER and TYPE_NAME
it does not even attempt to handle the preprocessor phases
Also, it has one shift/reduce conflict as a result of the "dangling else" ambiguity. However, that conflict can be ignored because bison's conflict resolution algorithm produces the correct result in this case. (You can suppress the warning either with an %expect directive or by including a precedence declaration which favours shifting else over reducing if. Or you can eliminate the ambiguity in the grammar using the technique described in the Wikipedia page linked above. (Note: I'm not talking about copy-and-pasting code from the Wikipedia page. In the case of C, you need to consider all cases of compound statements which terminate with an if statement.)
Moreover, an LR parser is not recursive, and it has no problems which could be described as a failure to "process recursively multiple expressions". (You might have that problem with a recursive descent parser, although it's pretty easy to work around the issue.)
So any problems you might have experienced (if your question refers to a concrete issue) have nothing to do with what's described in your question.
Of the problems I listed above, the most troubling is the syntactic ambiguity of the cast operator. The cast operator is not actually ambiguous; clearly, C compilers manage to correct compile such expressions. But distinguishing between the two possible parses of, for example, (x)-y*z requires knowing whether x names a type or a variable.
In C, all names are lexically scoped, so it is certainly possible to resolve x at compile time. But the resolution is not context-free. Since GLR is also a technique for parsing context-free grammars, using a GLR parser won't directly help you. It might be useful in the sense that GLR parsers can theoretically produce "parse forests" rather than parse trees; that is, the output of a GLR parser might effectively contain all possible correct parses, leaving the possibility to resolve the ambiguity by building symbol tables for each scope and then choosing between alternative parses by examining the name binding in effect at each site. (This works because type alias declarations -- "typedefs" -- are not ambiguous, so all the potential parses will have the same alias declarations.)
The usual solution, though, is to parse the program text using a deterministic parser, maintaining a symbol table during the parse, and giving the lexical analyser access to this symbol table so that it can distinguish between IDENTIFIER and TYPE_NAME, as expected by the grammar you link. This technique is politely called "lexical feedback", although it's also often called "the lexer hack".

Contextual conditions when building AST for C program

I'm writing an interpreter for C (subset) in Javascript (I want to provide program's execution visualisation in browser).
As the first step I want to create an AST tree for the user program. I'm using Jison for this, which is similar to flex/bizon combination.
For now I simply tokenize the program and parse to check if it conforms to the grammar given by the standard (let's leave alone the ambiguity problem introduced by typedef).
However conforming to C grammar doesn't guarantee that program makes any sense, for example
int main() {
x = ("jklfds" || "jklgfd")(2, imlost);
}
conforms to the grammar, although x is not declared, ("jklfds" || "jklgfd") isn't a function pointer - types are not checked. In general there are many contextual conditions that aren't checked.
I'm wondering how much should I check while building the AST tree. For example, in theory at this point it would be easy to fully calculate and check constant expression. However, much of other checking requires context.
Is it possible, for example, during parsing, to know that some identifiers refer to structs declared earlier in the program?
What about building the AST tree as is and checking contextual constrains by analyzing/transforming AST multiple times proving more and more conditions are correct? Will it be easier / harder than checking during parsing ?
I'm looking for the most friendly solution, I don't care for its speed.

How to determine return type, arguments, function name from C99 function declarations

I'm looking for the simpliest way, how to determine return type, arguments and function name from c header file written under C99.
it's my school project, which have to be written in Perl without any libs. So i got a few options, i can use the regular expression, but it's not applicable to the hardest function like folowing:
int * (* func(int * arg[]))();
the return type should be "int * (* )()" and argument is "int * []".
Second way is to use grammar and parse it, but i think, that this is not the right way.
My buddy told me about an existing algorithm which can do it. But he doesn't remember name, or where he saw him. The algorithm was quite simple. Something like: Find first end parenthesis, everything between this end parenthesis and the first-match previous start parenthesis is arguments...
Does anyone have some idea what am I looking for?
Look at the magic decoder ring for C declarations
If you can obtain The C Programming Language by Kernighan and Ritchie. Not only is it the C bible, but in chapter 5 they present code to parse C declarations. You can look there to see how they do it and quite possibly adapt their approach (chapter 5, section 12).
You simply have to build a parser for that kind of problem. Usually the top-down approach (e.g. a recursive descent) would do it for this kind of job. Fortunately top-down parsers are more or less straight forward to implement.
The only hard bit in C like languages is, that these languages are usually at least LL1 (1 token look ahead) or even worse LL2 or more. So sometimes you have to peek a few tokens in advance to find out whether it's a function declaration or a function call for example.

Keyword-Label-Value style configuration file parsing library for C

Does a configuration parsing library exist already that will read the following style of file:
Keyword Label Value;
With nesting by { } replacing Values; optional Labels; support for "Include" would be nice.
An example configuration file might looks like:
Listen Inside 127.0.0.1:1000;
Listen Outside {
IP 1.2.3.4;
Port 1000;
TLS {
CertFile /path/to/file;
};
};
ACL default_acl {
IP 192.168.0.0/24;
IP 10.0.0.0/24;
};
What programming languages are you familiar with? My impression from your question is C.
It looks like like the tokens of your configuration language are regular expressions:
Listen
127.0.0.1:1000
1000
;
{
}
etc.
Almost all modern programming languages have some form of support for those.
If the implementation is C, I'd probably use flex. It generates a function which will apply a set of regular expressions, put the matched text into a C string, and return the type of that regular expression (just an int, which you choose). The function is a 'lexical analyser' or 'tokeniser'. It chops up streams of characters into handy units that match your needs, one regular expression at a time.
Flex is pretty easy to use. It has several advantages over lex. One is that you can have multiple lexical analysers functions, so if you need to do something odd for an include file, then you could have a second lexical analyser function for that job.
Your language looks simple. Bison/Yacc are very powerful tools, and "with great power comes great responsibility" :-)
I think it is sufficiently simple, that I might just write a parser by hand. It might only be a few functions to handle its structure. A technique that is very straightforward is called recursive descent parser. Have you got a CS degree, or understand this stuff?
Lots of people will (at this stage) tell you to get the 'Dragon Book' or one of its newer versions, often because that is what they had at college. The Dragon book is great, but it is like telling someone to read all of Wikipedia to find out about whales. Great if you have the time, and you'll learn a lot.
A reasonable start is the Wikipedia Recursive Descent parser article. Recursive descent is very popular because it is relatively straightforward to understand. The thing that makes it straightforward is to have a proper grammar which is cast into a form which is easy for recursive descent to parse. Then you literally write a function for every rule, with a simple error handling mechanism (that's why I asked about this). There are probably tools to generate them, but you might find it quicker to just write it. A first cut might take a day, then you'd be in a good position to decide.
A very nifty lex/flex feature is any characters which are not matched, are just echo'd to standard output. So you can see what your regular expressions are matching, and can add them incrementally. When the output 'dries up' everything is being matched.
Pontification alert: IMHO, more C programmers should learn to use flex. It is relatively easy to use, and very powerful for text handling. IMHO lots are put off because they are also told to use yacc/bison which are much more powerful, subtle and complex tools.
end Pontification.
If you need a bit of help with the grammar, please ask. If there is a nice grammar (might not be the case, but so far your examples look okay) then implementation is straightforward.
I found two links to stackoverflow answers which look helpful:
Recursive descent parser implementation
Looking for a tutorial on Recursive Descent Parsing
Here is an example of using flex.
Flex takes a 'script', and generates a C function called yylex(). This is the input script.
Remember that all of the regular expressions are being matched within that yylex function, so though the script looks weird, it is really an ordinary C function. To tell the caller, which will be your recursive descent parser, what type of regular expression is matched, it returns an integer value that you choose, just like any ordinary C function.
If there is nothing to tell the parser about, like white space, and probably some form of comment, it doesn't return. It 'silently' consumes those characters. If the syntax needs to use newline, then that would be recognised as a token, and a suitable token value returned to the parser. It is sometimes easier to let it be more free form, so this example consumes and ignores all white space.
Effectively the yylex function is everything from the first %% to the second %%. It behaves like a big switch() statement.
The regular expressions are like (very exotic) case: labels.
The code inside the { ... } is ordinary C. It can contain any C statements, and must be properly nested within the { ... }
The stuff before the first %% is the place to put flex definitions, and a few 'instructions' to flex.
The stuff inside %{ ... %} is ordinary C, and can include any headers needed by the C in the file, or even define global variables.
The stuff after the second %% is ordinary C, with no need for extra syntax, so no %{ ... %].
/* scanner for a configuration files */
%{
/* Put headers in here */
#include <config.h>
%}
%%
[0-9]+ { return TOK_NUMBER; }
[0-9]+"."[0-9]+"."[0-9]+"."[0-9]+":"[0-9]+ { return TOK_IP_PORT; }
[0-9]+"."[0-9]+"."[0-9]+"."[0-9]+"/"[0-9]+ { return TOK_IP_RANGE; }
"Listen" { return TOK_KEYWORD_LISTEN; }
[A-Za-z][A-Za-z0-9_]* { return TOK_IDENTIFIER; }
"{" { return TOK_OPEN_BRACE; }
"}" { return TOK_CLOSE_BRACE; }
";" { return TOK_SEMICOLON; }
[ \t\n]+ /* eat up whitespace, do nothing */
. { fprintf(stderr, "Unrecognized character: %s\n", yytext );
exit(1);
}
%%
/* -------- A simple test ----------- */
int main(int argc, char *argv[])
{
int tok;
yyin = stdin;
while (tok=yylex()) {
fprintf(stderr, "%d %s\n", tok, yytext);
}
}
That has a minimal, dummy main, which calls the yylex() function to get the next token
(enum) value. yytext is the string matched by the regular expression, so main just prints it.
WARNING, this is barely tested, little more than:
flex config.l
gcc lex.yy.c -ll
./a.out <tinytest
The values are just integers, so an enum in a header:
#ifndef _CONFIG_H_
#define _CONFIG_H_
enum TOKENS {
TOK_KEYWORD_LISTEN = 256,
TOK_IDENTIFIER = 257,
TOK_OPEN_BRACE = 258,
TOK_CLOSE_BRACE = 259,
TOK_SEMICOLON = 260,
TOK_IP_PORT = 261,
TOK_IP_RANGE = 262,
TOK_NUMBER = 263,
};
#endif _CONFIG_H_
In your parser, call yylex when you need the next value. You'll probably wrap yylex in something which copies yytext before handing the token type value back to the parser.
You will need to be comfortable handling memory. If this were a large file, maybe use malloc to allocate space. But for small files, and to make it easy to get started and debug, it might makes sense to write your own 'dumb' allocator. A 'dumb' memory management system, can make debugging much easier. Initially just have a big char array statically allocated and a mymalloc() handing out pieces. I can imagine the configuration data never gets free()'d. Everything can be held in C strings initially, so it is straightforward to debug because the exact sequence of input is in the char array. An improved version might 'stat' a file, and allocates a piece big enough.
How you deal with the actual configuration values is a bit beyond what I can describe. Text strings might be all that is needed, or maybe there is already a mechanism for that. Often there is no need to store the text value of 'Keywords', because the parser has recognised what it means, and the program might convert other values, e.g. IP addresses, into some internal representation.
Have you looked at lex and yacc (or alternatively, flex and bison)? It's a little hairy, but we use those to parse files that look exactly like your config file there. You can define sub-structures using brackets, parse variable-length lists with the same key, etc.
By labels do you mean comments? You can define your own comment structure, we use '#' to denote a comment line.
It doesn't support includes AFAIK.
Exist C library's to JSON and YAML. They look like what you need.

What is easiest way to calculate an infix expression using C language?

Suppose the user inputs an infix expression as a string?
What could be the easiest ( By easiest I mean the shortest) way to evaluate the result of that expression using C language?
Probable ways are converting it to a postfix then by using stacks.But its rather a long process.
Is there any way of using functions such as atoi() or eval() that could make the job easier?
C doesn't have an "eval" function built-in, but there are libraries that provide it.
I would highly recommend using TinyExpr. It's free and open-source C code that implements math evaluation from a string. TinyExpr is only 1 C file, and it's about 500 lines of code. I don't think you'll find a shorter or easier way that is actually complete (and not just a toy example).
Here is a complete example of using it, which should demostrate how easy it is:
#include "tinyexpr.h"
#include <stdio.h>
int main(int argc, char *argv[])
{
printf("%f\n", te_interp("5 * 5", 0)); //Prints 25
return 0;
}
If you want to build an expression solver yourself, I would recommend looking at the TinyExpr source-code as a starting point. It's pretty clean and easy to follow.
Certainly the most instructive way (and possibly even the easiest, once you know how) is to learn how to write your own recursive descent parser. A parser for infix expressions in C isn't very long.
Here's one of a number of excellent blog posts by Eli Bendersky on parsing. (This one is the one that's most relevant to you, but I highly recommend all of them.) It contains source code for an infix expression parser -- admittedly in Python, not C, but the conversion should be fairly straightforward, and you'll learn a lot in the process.
you need to parse the string. there's no eval() in C (as in most static languages), so you need to either write your own parser or find some library to help.
since most easy to use parsers are for C++ and not C, i'd rather use a full embeddable language. my absolute favorite is Lua, which can be incredibly lightweight if you don't include the libraries. also, the syntax is nicer than C's, so your users might like it better.
of course, Lua is a full-blown programming language, so it might not be appropriate, or maybe it could help in other ways (to make it easier to extend your application).
One clean (possible not short) way to do it is to build a tree, like a compiler would.
For example, say you have the expression "2+3". The '+' would be the head. The '2' would be the left child and the '3' would be the right child.
Since each expression evaluates to a value, this tree can be extended for infinitely complex expressions: it just needs to be sorted in order of precedence for each operator. Low precedence operators (like '+' go at the top, while high-precedence operators (like '*') go at the bottom. You would then evaluate the expressions on the tree from the bottom up.
You need to build in interpreter of some scripting language.
Convert the string into an array of tokens which are the operands and operators.
Convert the infix token array to a Reverse Polish Notation array.
After the equation is in RPN, then you can pop tokens off the stack and operate on them.
Take a look at the Wikipedia article on Reverse Polish Notation. It shows how to do the conversion and the calculation.

Resources