I am trying to make semantic phase for c compiler using lex and yacc. Right now the problem is if I have multiple errors in the c program, it stops after the 1st. What can I do?
I strongly recommend that you perform the semantic analysis as a separate phase, not as a part of the parsing phase. Use YACC only to build an abstract syntax tree, then traverse this tree in a separate function. Said function will have unlimited freedom when it comes to moving around in the tree, as opposed to having to "follow the parsing". As for the specific problem you mentioned, #pmg's comment seems to have pinpointed the problem.
There is no one absolute answer to this. A typical way to handle it is to create a special pattern to read symbols until it gets to (for example) a semicolon at the end of a line, giving a reasonable signal that whatever's after that is intended as a new declaration, definition, statement, etc., and then re-start parsing from that point (retaining enough context to know that, for example, you're currently parsing a function body, so you accept/reject input on that basis).
Related
I am currently in the process of writing a C to Assembly compiler, it is not meant to be practical, but I would like to do it for the educational value. I was wondering when I am testing for keywords, is there any more efficient way rather than just reading in the next word in the file and then running it through a bunch of nested if statements that test for the keywords. Is there any better way?
Your question is actually quite specific. You are asking about how to build the lexical analyzer, also known as the scanner, and how to efficiently and conveniently recognize keywords. The scanner is the first phase of a typical compiler, and it converts the source code, which is a sequence of characters, to a sequence of tokens, where a token is a unit such as a number, an operator or a keyword.
Since keywords match the pattern for general identifiers, a common trick is to put all the keywords in the symbol table, together with information that it is a keyword. Then, when the scanner finds an identifier, it as usual searches the symbol table to see if that identifier has been seen before. If this identifier was a kewyord, it will be found, together with the information about which keyword it is.
Are you doing this for part of a class? If so, there should be guidelines on parsing and lexing. If not, you're in for a lot of work!
Writing an actual compiler is much more complicated than just going through a bunch of if statements, because you need to keep track of the environment. You'll need to think about how you allow classes, functions, function calls, class instantiations, recursive functions... the list goes on.
Take a look at course lectures from UC Berkeley on the subject, i.e. parsing, lexing, code generation, and the tools you'll need:
http://www-inst.eecs.berkeley.edu/~cs164/fa13/
Note that this course in particular used C++ to write a Python2.5 to Assembly compiler, but the concepts in the Lectures and Readings and some of the tools are not language-restricted.
Keywords (instead of tokens in general) is a closed set, for which it's practical to generate a collision free hash function. Because the set is small, it's not even necessary to have a minimum hash function.
You can do it with a bunch of if - else if statements and strcmp(). However, writing statements for all of the keywords gets annoying very quickly. You'd be better off using a hash table - at the start of the compilation you put all keywords in the table and then you do lookups as needed. The drawback of this is that if you have to use C, you'll also have to write your own hash table (or use one from a library). If you can use C++, though, then you can use a map or an unordered_map from the STL. In any case, if you are worried about the performance, as someone else mentioned, it won't be a bottle neck.
OK, so I have a complete (and working) Bison grammar.
The thing is I want to be able to set another starting point (%start) if I wish.
How is this doable, without having to create a separate grammar/parser?
I'm going to try to put together a version of yacc that does this. There is one complication that makes this not as trivial as it seems: the question of what constitutes an "end" symbol. The kind of place where this is of greatest use is in processing chunks in mid-stream (Knuth's TeX processor for [c]Web does this, for instance). Along these lines, another example where this can be used is in providing a unified parser for both the pre-processing layer and language layer and in processing individual macros themselves as entire parsing units (as well as being able to account for which macro bodies are common syntactic units like "expression" or "statement" and which are not).
In those kinds of applications, there is no natural "end" symbol to mark off the boundary of a segment for parsing. Normally, the LR method requires this in order to recognize when to take the "accept" action. Otherwise, you have accept-reduce (and even accept-shift) conflicts to contend with!
I have a giant C project with numerous C files. I have to find all inner-loops. I am sure there is no any O(n³) block in the project, so only O(n²)-compexity blocks must be found (a loop in a loop).
Is it possible to find all inner loops using grep? If yes, what regexp may I use to find all occurrences of inner loops of all kinds like {for,for}, {while,for}, {for, while}, {do, while}, etc. ? If no, is there any simple unix-way method to do it (maybe multiple greps or a kind of awk)?
Regex are for regular languages, what you are describing seems like a Context-Free, and i'm pretty sure it can't be done using Regular Expressions. See the answer to a similar question here . You should look for other type of automata like a scripting language(python or so).
This is a good case for specific compiler extensions. The recent GCC compiler (that is version 4.6 of GCC) can be extended by plugins (painfully coded in C) or by MELT extensions; MELT is a high-level domain specific language to code GCC extensions in, and MELT is much easy to use than C.
However, I do admit that coding GCC extensions is not entirely trivial: you have to partly understand how GCC works, and what are its main internal representations (Gimple, Tree, ...). When extending GCC, you basically add your own compiler passes, which can do whatever you want (-including detecting nested loops-). Coding a GCC extension is usually more than a week of work. (The hardest part is to understand how GCC works).
The big advantage of working within the GCC framework (thru plugins in C or extensions in MELT) is that your extensions are working on the same data as the compiler does.
Back to the question of finding nested loops, don't consider it as only purely syntactic (this is why grep cannot work). Within the GCC compiler, at some level of internal representations, a loop implemented by for, or while, or do, or even with goto-s, is still considered a loop, and for GCC these things can be nested (and GCC knows about the nesting!).
Without a C parser, you can get a heuristic solution at best.
If you can rely on certain rules being consistently followed in the code (e.g. no goto, no loops through recursion, ...), you can scan the preprocessed C code with regexps. Certainly, grep is not sophisticated enough but with a few lines of Perl or similar it is possible.
But the technically better and much more reliable approach is to use a real C parser.
There are three kinds of loops in C:
"structured syntax" (while, for, ...)
[Watch out for GCC, which can hide statements therefore loops inside expressions using (stmt; exp) syntax!]
ad hoc loops using goto; these interact with the structured syntax.
recursion
To find the first kind, you have to find the structured syntax and the nesting.
Grep can certainly find the keywords (if you ignore false positives in comments and strings), but it can't find nested structures. You could of course use grep to find all the loop syntax and then simply inspect those that occurred in the same file to see if they were nested.
(If you wanted to do this without the false positives price, you could use our Source Code Search Engine, which knows the lexical syntax of C and is never confused as to when a string of characters is a keyword, a number, a string, etc.)
If you want to find those loops automatically, you pretty much need a full C parser with expanded preprocessing accomplished. (Otherwise some macro may hide a critical piece of loop syntax). Once you have a syntax tree for C, it is straightforward (although likely a bit inconvenient) to code something that clambers over the tree, detecting loop syntax nodes, and counting nesting of loops in subtrees. You can do this with any tool that will parse C and give you abstract sytnax trees. ANTLR can likely do this; I think there's a C grammar obtainable for ANTLR that handles C fairly well, but you'll have to run the preprocessor before using ANTLR.
You could also do this with our DMS Software Reengineering Toolkit with its C Front End. Our C Front End has a full preprocessor built in so it can read the code directly and expand as it parses; it also handles a relatively wide variety of dialects of C and character encodings (ever dealt with C containing Japanese text?). DMS provides an additional advantage: given a language (e.g., C) front end, you can write patterns for the that language directly using the language syntax. So we can express fragments of what we want to find easily:
pattern block_for_loop(t:expression,l:expression,i:expression, s: statements): statement
" for(\t,\l\,\i) { \s } ";
pattern statement_for_loop(t:expression,l:expression,i:expression, s: statement): statement
" for(\t,\l\,\i) \s ";
pattern block_while_loop(c:expression, s: statements): statement
" while(\c) { \s } ";
pattern statement_while_loop(c:expression): statement
" for(\c) \s ";
...
and group them together:
pattern_set syntactic_loops
{ block_for_loop,
statement_for_loop,
block_while_loop,
block_statement_loop,
...
}
Given the pattern set, DMS can scan a syntax tree and find matches to any set member, without coding any particular tree crawling machinery and without knowing a huge amount of detail about the structure of the tree. (There's a lot of node types in an AST for a real C parser!) Finding nested loops this way would be pretty straightforward: scan the tree top down for a loop (using the pattern set); any hits must be top level loops. Scan subtrees of a found loop AST node (easy when you know where the tree for the outer loop is) for additional loops; any hits must be nested loops; recurse if necessary to pick up loops with arbitrary nesting. This works for the GCC loops-with-statements stuff, too. The tree nodes are stamped with precise file/line/column information so its easy to generate a report on location.
For ad hoc loops built using goto (what, your code doesn't have any?), you need something that can produce the actual control flow graph, and then structure that graph into nested control flow regions. The point here is that a while loop that contains an unconditional goto out isn't really a loop in spite of the syntax; and an if statement whose then clause gotos back to code upstream of the if is likely really a loop. All that loop syntax stuff is really just hueristic hints you may have loop!
Those control flow regions contain the real nesting of the control flow; DMS will construct C flow graphs, and will produce those structured regions. It provides libraries to build and access that graph; this way you can get the "true" control flow based on gotos. Having found a pair of nested control flow regions, one can access the AST associated with parts of region to get location information to report.
GCC is always a lot fun due to its signficantly enhanced version of C. For instance, it has an indirect goto statement. (Regular C has this hiding under setjmp/longjmp!).
To figure out loops in the face of this, you need points-to analysis, which DMS also provides. This information is used by the nested region analysis. There are (conservative) limits to the accuracy of points-to analysis, but modulo that you get the correct nested region graph.
Recursion is harder to find. For this, you have to determine if A calls B ... calls Z calls A, where A and B and ... can be in separate compilation units. You need a global call graph, containing all the compilation units of your application. At this point, you are probably expecting me to say that DMS does that too, voila, I'm pleased to say it does. Constructing that call graph of course requires points-to anlaysis for function calls; yes, DMS does that too. With the call graph, you can find cycles in the call graph, which are likely recursion. Also with the call graph, you can find indirect nesting, e.g., loops in one function, that call a function in another compilation unit that also contains loops.
To find structures such a loops accurately you need a lot of machinery (and this will take some effort, but then C is a bitch of a language to analyze) and DMS can provide it.
If you don't care about accuracy, and you don't care about all the kinds of loops, you can use grep and manual procedures to get a semi-accurate map of just the loops hinted at by the structured loop statements.
I suspect that finding something like this would be impossible using grep alone:
public void do(){
for(...){
somethingElse();
}
}
... Insert other code...
public void somethingElse(){
for(.....){
print(....);
}
}
I have built a small code for static analysis of C code. The purpose of building it is to warn users about the use of methods such as strcpy() which could essentially cause buffer overflows.
Now, to formalise the same, I need to write a formal Grammar which shows the excluded libraries as NOT a part of the allowed set of accepted library methods used.
For example,
AllowedSentence->ANSI C Permitted Code, NOT UnSafeLibraryMethods
UnSafeLibraryMethods->strcpy|other potentially unsafe methods
Any ideas on how this grammar can be formalised?
I think, this should not be done at the grammar level. It should be a rule that is applied to the parse tree after parsing is done.
You hardly need a parser for the way you have posed the problem. If your only goal is to object to the presence of certain identifiers ("strcpy"), you can simply build a lexer that processes C and picks identifiers. Special lexemes can recognize your list of "you shouldn't use this". This way you use positive recognition instead of negative recognition to pick out the identifiers that you belive to be trouble.
If you want a more sophisticated analaysis tool, you'll likely want to parse C, an name-resolve the identifers to their actual definitisn, then the scan the tree looking for identifiers that are objectionable. This will at least let you decide if the identifier is actually defined by the user, or comes from some known library; surely, if my code defines strcpy, you shouldn't complain unless you know my strcpy is defective somehow.
I have a "a pain in the a$$" task to extract/parse all standard C functions that were called in the main() function. Ex: printf, fseek, etc...
Currently, my only plan is to read each line inside the main() and search if a standard C functions exists by checking the list of standard C functions that I will also be defining (#define CFUNCTIONS "printf...")
As you know there are so many standard C functions, so defining all of them will be so annoying.
Any idea on how can I check if a string is a standard C functions?
If you have heard of cscope, try looking into the database it generates. There are instructions available at the cscope front end to list out all the functions that a given function has called.
If you look at the list of the calls from main(), you should be able to narrow down your work considerably.
If you have to parse by hand, I suggest starting with the included standard headers. They should give you a decent idea about which functions could you expect to see in main().
Either way, the work sounds non-trivial and interesting.
Parsing C source code seems simple at first blush, but as others have pointed out, the possibility of a programmer getting far off the leash by using #defines and #includes is rather common. Unless it is known that the specific program to be parsed is mild-mannered with respect to text substitution, the complexity of parsing arbitrary C source code is considerable.
Consider the less used, but far more effective tactic of parsing the object module. Compile the source module, but do not link it. To further simplify, reprocess the file containing main to remove all other functions, but leave declarations in their places.
Depending on the requirements, there are two ways to complete the task:
Write a program which opens the object module and iterates through the external reference symbol table. If the symbol matches one of the interesting function names, list it. Many platforms have library functions for parsing an object module.
Write a command file or script which uses the developer tools to examine object modules. For example, on Linux, the command nm lists external references with a U.
The task may look simple at first but in order to be really 100% sure you would need to parse the C-file. It is not sufficient to just look for the name, you need to know the context as well i.e. when to check the id, first when you have determined that the id is a function you can check if it is a standard c-runtime function.
(plus I guess it makes the task more interesting :-)
I don't think there's any way around having to define a list of standard C functions to accomplish your task. But it's even more annoying than that -- consider macros,
for example:
#define OUTPUT(foo) printf("%s\n",foo)
main()
{
OUTPUT("Ha ha!\n");
}
So you'll probably want to run your code through the preprocessor before checking
which functions are called from main(). Then you might have cases like this:
some_func("This might look like a call to fclose(fp), but surprise!\n");
So you'll probably need a full-blown parser to do this rigorously, since string literals
may span multiple lines.
I won't bring up trigraphs...that would just be pointless sadism. :-) Anyway, good luck, and happy coding!