Multiple start points for Bison grammar/parser - c

OK, so I have a complete (and working) Bison grammar.
The thing is I want to be able to set another starting point (%start) if I wish.
How is this doable, without having to create a separate grammar/parser?

I'm going to try to put together a version of yacc that does this. There is one complication that makes this not as trivial as it seems: the question of what constitutes an "end" symbol. The kind of place where this is of greatest use is in processing chunks in mid-stream (Knuth's TeX processor for [c]Web does this, for instance). Along these lines, another example where this can be used is in providing a unified parser for both the pre-processing layer and language layer and in processing individual macros themselves as entire parsing units (as well as being able to account for which macro bodies are common syntactic units like "expression" or "statement" and which are not).
In those kinds of applications, there is no natural "end" symbol to mark off the boundary of a segment for parsing. Normally, the LR method requires this in order to recognize when to take the "accept" action. Otherwise, you have accept-reduce (and even accept-shift) conflicts to contend with!

Related

How should I parse keywords when writing a C Compiler?

I am currently in the process of writing a C to Assembly compiler, it is not meant to be practical, but I would like to do it for the educational value. I was wondering when I am testing for keywords, is there any more efficient way rather than just reading in the next word in the file and then running it through a bunch of nested if statements that test for the keywords. Is there any better way?
Your question is actually quite specific. You are asking about how to build the lexical analyzer, also known as the scanner, and how to efficiently and conveniently recognize keywords. The scanner is the first phase of a typical compiler, and it converts the source code, which is a sequence of characters, to a sequence of tokens, where a token is a unit such as a number, an operator or a keyword.
Since keywords match the pattern for general identifiers, a common trick is to put all the keywords in the symbol table, together with information that it is a keyword. Then, when the scanner finds an identifier, it as usual searches the symbol table to see if that identifier has been seen before. If this identifier was a kewyord, it will be found, together with the information about which keyword it is.
Are you doing this for part of a class? If so, there should be guidelines on parsing and lexing. If not, you're in for a lot of work!
Writing an actual compiler is much more complicated than just going through a bunch of if statements, because you need to keep track of the environment. You'll need to think about how you allow classes, functions, function calls, class instantiations, recursive functions... the list goes on.
Take a look at course lectures from UC Berkeley on the subject, i.e. parsing, lexing, code generation, and the tools you'll need:
http://www-inst.eecs.berkeley.edu/~cs164/fa13/
Note that this course in particular used C++ to write a Python2.5 to Assembly compiler, but the concepts in the Lectures and Readings and some of the tools are not language-restricted.
Keywords (instead of tokens in general) is a closed set, for which it's practical to generate a collision free hash function. Because the set is small, it's not even necessary to have a minimum hash function.
You can do it with a bunch of if - else if statements and strcmp(). However, writing statements for all of the keywords gets annoying very quickly. You'd be better off using a hash table - at the start of the compilation you put all keywords in the table and then you do lookups as needed. The drawback of this is that if you have to use C, you'll also have to write your own hash table (or use one from a library). If you can use C++, though, then you can use a map or an unordered_map from the STL. In any case, if you are worried about the performance, as someone else mentioned, it won't be a bottle neck.

C - How to find all inner loops using grep?

I have a giant C project with numerous C files. I have to find all inner-loops. I am sure there is no any O(n³) block in the project, so only O(n²)-compexity blocks must be found (a loop in a loop).
Is it possible to find all inner loops using grep? If yes, what regexp may I use to find all occurrences of inner loops of all kinds like {for,for}, {while,for}, {for, while}, {do, while}, etc. ? If no, is there any simple unix-way method to do it (maybe multiple greps or a kind of awk)?
Regex are for regular languages, what you are describing seems like a Context-Free, and i'm pretty sure it can't be done using Regular Expressions. See the answer to a similar question here . You should look for other type of automata like a scripting language(python or so).
This is a good case for specific compiler extensions. The recent GCC compiler (that is version 4.6 of GCC) can be extended by plugins (painfully coded in C) or by MELT extensions; MELT is a high-level domain specific language to code GCC extensions in, and MELT is much easy to use than C.
However, I do admit that coding GCC extensions is not entirely trivial: you have to partly understand how GCC works, and what are its main internal representations (Gimple, Tree, ...). When extending GCC, you basically add your own compiler passes, which can do whatever you want (-including detecting nested loops-). Coding a GCC extension is usually more than a week of work. (The hardest part is to understand how GCC works).
The big advantage of working within the GCC framework (thru plugins in C or extensions in MELT) is that your extensions are working on the same data as the compiler does.
Back to the question of finding nested loops, don't consider it as only purely syntactic (this is why grep cannot work). Within the GCC compiler, at some level of internal representations, a loop implemented by for, or while, or do, or even with goto-s, is still considered a loop, and for GCC these things can be nested (and GCC knows about the nesting!).
Without a C parser, you can get a heuristic solution at best.
If you can rely on certain rules being consistently followed in the code (e.g. no goto, no loops through recursion, ...), you can scan the preprocessed C code with regexps. Certainly, grep is not sophisticated enough but with a few lines of Perl or similar it is possible.
But the technically better and much more reliable approach is to use a real C parser.
There are three kinds of loops in C:
"structured syntax" (while, for, ...)
[Watch out for GCC, which can hide statements therefore loops inside expressions using (stmt; exp) syntax!]
ad hoc loops using goto; these interact with the structured syntax.
recursion
To find the first kind, you have to find the structured syntax and the nesting.
Grep can certainly find the keywords (if you ignore false positives in comments and strings), but it can't find nested structures. You could of course use grep to find all the loop syntax and then simply inspect those that occurred in the same file to see if they were nested.
(If you wanted to do this without the false positives price, you could use our Source Code Search Engine, which knows the lexical syntax of C and is never confused as to when a string of characters is a keyword, a number, a string, etc.)
If you want to find those loops automatically, you pretty much need a full C parser with expanded preprocessing accomplished. (Otherwise some macro may hide a critical piece of loop syntax). Once you have a syntax tree for C, it is straightforward (although likely a bit inconvenient) to code something that clambers over the tree, detecting loop syntax nodes, and counting nesting of loops in subtrees. You can do this with any tool that will parse C and give you abstract sytnax trees. ANTLR can likely do this; I think there's a C grammar obtainable for ANTLR that handles C fairly well, but you'll have to run the preprocessor before using ANTLR.
You could also do this with our DMS Software Reengineering Toolkit with its C Front End. Our C Front End has a full preprocessor built in so it can read the code directly and expand as it parses; it also handles a relatively wide variety of dialects of C and character encodings (ever dealt with C containing Japanese text?). DMS provides an additional advantage: given a language (e.g., C) front end, you can write patterns for the that language directly using the language syntax. So we can express fragments of what we want to find easily:
pattern block_for_loop(t:expression,l:expression,i:expression, s: statements): statement
" for(\t,\l\,\i) { \s } ";
pattern statement_for_loop(t:expression,l:expression,i:expression, s: statement): statement
" for(\t,\l\,\i) \s ";
pattern block_while_loop(c:expression, s: statements): statement
" while(\c) { \s } ";
pattern statement_while_loop(c:expression): statement
" for(\c) \s ";
...
and group them together:
pattern_set syntactic_loops
{ block_for_loop,
statement_for_loop,
block_while_loop,
block_statement_loop,
...
}
Given the pattern set, DMS can scan a syntax tree and find matches to any set member, without coding any particular tree crawling machinery and without knowing a huge amount of detail about the structure of the tree. (There's a lot of node types in an AST for a real C parser!) Finding nested loops this way would be pretty straightforward: scan the tree top down for a loop (using the pattern set); any hits must be top level loops. Scan subtrees of a found loop AST node (easy when you know where the tree for the outer loop is) for additional loops; any hits must be nested loops; recurse if necessary to pick up loops with arbitrary nesting. This works for the GCC loops-with-statements stuff, too. The tree nodes are stamped with precise file/line/column information so its easy to generate a report on location.
For ad hoc loops built using goto (what, your code doesn't have any?), you need something that can produce the actual control flow graph, and then structure that graph into nested control flow regions. The point here is that a while loop that contains an unconditional goto out isn't really a loop in spite of the syntax; and an if statement whose then clause gotos back to code upstream of the if is likely really a loop. All that loop syntax stuff is really just hueristic hints you may have loop!
Those control flow regions contain the real nesting of the control flow; DMS will construct C flow graphs, and will produce those structured regions. It provides libraries to build and access that graph; this way you can get the "true" control flow based on gotos. Having found a pair of nested control flow regions, one can access the AST associated with parts of region to get location information to report.
GCC is always a lot fun due to its signficantly enhanced version of C. For instance, it has an indirect goto statement. (Regular C has this hiding under setjmp/longjmp!).
To figure out loops in the face of this, you need points-to analysis, which DMS also provides. This information is used by the nested region analysis. There are (conservative) limits to the accuracy of points-to analysis, but modulo that you get the correct nested region graph.
Recursion is harder to find. For this, you have to determine if A calls B ... calls Z calls A, where A and B and ... can be in separate compilation units. You need a global call graph, containing all the compilation units of your application. At this point, you are probably expecting me to say that DMS does that too, voila, I'm pleased to say it does. Constructing that call graph of course requires points-to anlaysis for function calls; yes, DMS does that too. With the call graph, you can find cycles in the call graph, which are likely recursion. Also with the call graph, you can find indirect nesting, e.g., loops in one function, that call a function in another compilation unit that also contains loops.
To find structures such a loops accurately you need a lot of machinery (and this will take some effort, but then C is a bitch of a language to analyze) and DMS can provide it.
If you don't care about accuracy, and you don't care about all the kinds of loops, you can use grep and manual procedures to get a semi-accurate map of just the loops hinted at by the structured loop statements.
I suspect that finding something like this would be impossible using grep alone:
public void do(){
for(...){
somethingElse();
}
}
... Insert other code...
public void somethingElse(){
for(.....){
print(....);
}
}

Semantic Phase of compiler

I am trying to make semantic phase for c compiler using lex and yacc. Right now the problem is if I have multiple errors in the c program, it stops after the 1st. What can I do?
I strongly recommend that you perform the semantic analysis as a separate phase, not as a part of the parsing phase. Use YACC only to build an abstract syntax tree, then traverse this tree in a separate function. Said function will have unlimited freedom when it comes to moving around in the tree, as opposed to having to "follow the parsing". As for the specific problem you mentioned, #pmg's comment seems to have pinpointed the problem.
There is no one absolute answer to this. A typical way to handle it is to create a special pattern to read symbols until it gets to (for example) a semicolon at the end of a line, giving a reasonable signal that whatever's after that is intended as a new declaration, definition, statement, etc., and then re-start parsing from that point (retaining enough context to know that, for example, you're currently parsing a function body, so you accept/reject input on that basis).

How to define grammar which excludes a certain set of words?

I have built a small code for static analysis of C code. The purpose of building it is to warn users about the use of methods such as strcpy() which could essentially cause buffer overflows.
Now, to formalise the same, I need to write a formal Grammar which shows the excluded libraries as NOT a part of the allowed set of accepted library methods used.
For example,
AllowedSentence->ANSI C Permitted Code, NOT UnSafeLibraryMethods
UnSafeLibraryMethods->strcpy|other potentially unsafe methods
Any ideas on how this grammar can be formalised?
I think, this should not be done at the grammar level. It should be a rule that is applied to the parse tree after parsing is done.
You hardly need a parser for the way you have posed the problem. If your only goal is to object to the presence of certain identifiers ("strcpy"), you can simply build a lexer that processes C and picks identifiers. Special lexemes can recognize your list of "you shouldn't use this". This way you use positive recognition instead of negative recognition to pick out the identifiers that you belive to be trouble.
If you want a more sophisticated analaysis tool, you'll likely want to parse C, an name-resolve the identifers to their actual definitisn, then the scan the tree looking for identifiers that are objectionable. This will at least let you decide if the identifier is actually defined by the user, or comes from some known library; surely, if my code defines strcpy, you shouldn't complain unless you know my strcpy is defective somehow.

How do you compare two files containing C code based on code structure, not merely textual differences?

I have two files containing C code which I wish to compare. I'm looking for a utility which will construct a syntax tree for each file, and compare the syntax trees, instead of merely comparing the text of the files. This way minor differences in formatting and style will be ignored. It would be nice to even be able to tell the comparison tool to ignore differences such as variable names, etc.
Correct me if I'm wrong, but diff doesn't have this capability. I'm a Ubuntu user. Thanks!
Our SD Smart Differencer does exactly what you want. It uses compiler-quality parsers to read source code and build ASTs for two files you select. It then compares the trees guided by the syntax, so it doesn't get confused by whitespace, layout or comments. Because it normalize the values of constants, it doesn't get confused by change of radix or how you expressed escape sequences!
The deltas are reported at the level of the langauge constructs (variable, expression, statement, declaration, function, ...) in terms of programmer intent (delete, insert, copy, move) complete with determining that an identifier has been renamed consistently throughout a changed block.
The SmartDifferencer has versions available for C (in a number of dialects; if you compiler-accurate parse, the langauge dialect matters) was well as for C++, Java, C#, JavaScript, COBOL, Python and many other langauges.
If you want to understand how a set of files are related to one another, our SD CloneDR will accept a very large set of files, and tell you what they have in common. It finds code that has been copy-paste-edited across the entire set. You don't have to tell it what to look for; it finds it automatically. Using ASTs (as above), it isn't fooled by whitespace changes or renames of identifiers. There's a bunch of sample clone detection reports for various languages at the web site.
There is a program called codeCompare from devart (http://www.devart.com/codecompare/benefits.html#cc) that includes the following feature (I know it is not exactly what you asked for but probably it can be used for that).
The feature is called "Structure Comparison"
This functionality allows you to compare different file revision by the presense of the structure blocks (classes, fields, methods). At that different versions of the same file are compared independently from their destination.
Structure comparison can be applied to the following languages:
C#
C++
Visual Basic
JavaScript
(I know it does not include C, but maybe with the C++ version you can solve the problem)

Resources