C - How to find all inner loops using grep? - c

I have a giant C project with numerous C files. I have to find all inner-loops. I am sure there is no any O(n³) block in the project, so only O(n²)-compexity blocks must be found (a loop in a loop).
Is it possible to find all inner loops using grep? If yes, what regexp may I use to find all occurrences of inner loops of all kinds like {for,for}, {while,for}, {for, while}, {do, while}, etc. ? If no, is there any simple unix-way method to do it (maybe multiple greps or a kind of awk)?

Regex are for regular languages, what you are describing seems like a Context-Free, and i'm pretty sure it can't be done using Regular Expressions. See the answer to a similar question here . You should look for other type of automata like a scripting language(python or so).

This is a good case for specific compiler extensions. The recent GCC compiler (that is version 4.6 of GCC) can be extended by plugins (painfully coded in C) or by MELT extensions; MELT is a high-level domain specific language to code GCC extensions in, and MELT is much easy to use than C.
However, I do admit that coding GCC extensions is not entirely trivial: you have to partly understand how GCC works, and what are its main internal representations (Gimple, Tree, ...). When extending GCC, you basically add your own compiler passes, which can do whatever you want (-including detecting nested loops-). Coding a GCC extension is usually more than a week of work. (The hardest part is to understand how GCC works).
The big advantage of working within the GCC framework (thru plugins in C or extensions in MELT) is that your extensions are working on the same data as the compiler does.
Back to the question of finding nested loops, don't consider it as only purely syntactic (this is why grep cannot work). Within the GCC compiler, at some level of internal representations, a loop implemented by for, or while, or do, or even with goto-s, is still considered a loop, and for GCC these things can be nested (and GCC knows about the nesting!).

Without a C parser, you can get a heuristic solution at best.
If you can rely on certain rules being consistently followed in the code (e.g. no goto, no loops through recursion, ...), you can scan the preprocessed C code with regexps. Certainly, grep is not sophisticated enough but with a few lines of Perl or similar it is possible.
But the technically better and much more reliable approach is to use a real C parser.

There are three kinds of loops in C:
"structured syntax" (while, for, ...)
[Watch out for GCC, which can hide statements therefore loops inside expressions using (stmt; exp) syntax!]
ad hoc loops using goto; these interact with the structured syntax.
recursion
To find the first kind, you have to find the structured syntax and the nesting.
Grep can certainly find the keywords (if you ignore false positives in comments and strings), but it can't find nested structures. You could of course use grep to find all the loop syntax and then simply inspect those that occurred in the same file to see if they were nested.
(If you wanted to do this without the false positives price, you could use our Source Code Search Engine, which knows the lexical syntax of C and is never confused as to when a string of characters is a keyword, a number, a string, etc.)
If you want to find those loops automatically, you pretty much need a full C parser with expanded preprocessing accomplished. (Otherwise some macro may hide a critical piece of loop syntax). Once you have a syntax tree for C, it is straightforward (although likely a bit inconvenient) to code something that clambers over the tree, detecting loop syntax nodes, and counting nesting of loops in subtrees. You can do this with any tool that will parse C and give you abstract sytnax trees. ANTLR can likely do this; I think there's a C grammar obtainable for ANTLR that handles C fairly well, but you'll have to run the preprocessor before using ANTLR.
You could also do this with our DMS Software Reengineering Toolkit with its C Front End. Our C Front End has a full preprocessor built in so it can read the code directly and expand as it parses; it also handles a relatively wide variety of dialects of C and character encodings (ever dealt with C containing Japanese text?). DMS provides an additional advantage: given a language (e.g., C) front end, you can write patterns for the that language directly using the language syntax. So we can express fragments of what we want to find easily:
pattern block_for_loop(t:expression,l:expression,i:expression, s: statements): statement
" for(\t,\l\,\i) { \s } ";
pattern statement_for_loop(t:expression,l:expression,i:expression, s: statement): statement
" for(\t,\l\,\i) \s ";
pattern block_while_loop(c:expression, s: statements): statement
" while(\c) { \s } ";
pattern statement_while_loop(c:expression): statement
" for(\c) \s ";
...
and group them together:
pattern_set syntactic_loops
{ block_for_loop,
statement_for_loop,
block_while_loop,
block_statement_loop,
...
}
Given the pattern set, DMS can scan a syntax tree and find matches to any set member, without coding any particular tree crawling machinery and without knowing a huge amount of detail about the structure of the tree. (There's a lot of node types in an AST for a real C parser!) Finding nested loops this way would be pretty straightforward: scan the tree top down for a loop (using the pattern set); any hits must be top level loops. Scan subtrees of a found loop AST node (easy when you know where the tree for the outer loop is) for additional loops; any hits must be nested loops; recurse if necessary to pick up loops with arbitrary nesting. This works for the GCC loops-with-statements stuff, too. The tree nodes are stamped with precise file/line/column information so its easy to generate a report on location.
For ad hoc loops built using goto (what, your code doesn't have any?), you need something that can produce the actual control flow graph, and then structure that graph into nested control flow regions. The point here is that a while loop that contains an unconditional goto out isn't really a loop in spite of the syntax; and an if statement whose then clause gotos back to code upstream of the if is likely really a loop. All that loop syntax stuff is really just hueristic hints you may have loop!
Those control flow regions contain the real nesting of the control flow; DMS will construct C flow graphs, and will produce those structured regions. It provides libraries to build and access that graph; this way you can get the "true" control flow based on gotos. Having found a pair of nested control flow regions, one can access the AST associated with parts of region to get location information to report.
GCC is always a lot fun due to its signficantly enhanced version of C. For instance, it has an indirect goto statement. (Regular C has this hiding under setjmp/longjmp!).
To figure out loops in the face of this, you need points-to analysis, which DMS also provides. This information is used by the nested region analysis. There are (conservative) limits to the accuracy of points-to analysis, but modulo that you get the correct nested region graph.
Recursion is harder to find. For this, you have to determine if A calls B ... calls Z calls A, where A and B and ... can be in separate compilation units. You need a global call graph, containing all the compilation units of your application. At this point, you are probably expecting me to say that DMS does that too, voila, I'm pleased to say it does. Constructing that call graph of course requires points-to anlaysis for function calls; yes, DMS does that too. With the call graph, you can find cycles in the call graph, which are likely recursion. Also with the call graph, you can find indirect nesting, e.g., loops in one function, that call a function in another compilation unit that also contains loops.
To find structures such a loops accurately you need a lot of machinery (and this will take some effort, but then C is a bitch of a language to analyze) and DMS can provide it.
If you don't care about accuracy, and you don't care about all the kinds of loops, you can use grep and manual procedures to get a semi-accurate map of just the loops hinted at by the structured loop statements.

I suspect that finding something like this would be impossible using grep alone:
public void do(){
for(...){
somethingElse();
}
}
... Insert other code...
public void somethingElse(){
for(.....){
print(....);
}
}

Related

How should I parse keywords when writing a C Compiler?

I am currently in the process of writing a C to Assembly compiler, it is not meant to be practical, but I would like to do it for the educational value. I was wondering when I am testing for keywords, is there any more efficient way rather than just reading in the next word in the file and then running it through a bunch of nested if statements that test for the keywords. Is there any better way?
Your question is actually quite specific. You are asking about how to build the lexical analyzer, also known as the scanner, and how to efficiently and conveniently recognize keywords. The scanner is the first phase of a typical compiler, and it converts the source code, which is a sequence of characters, to a sequence of tokens, where a token is a unit such as a number, an operator or a keyword.
Since keywords match the pattern for general identifiers, a common trick is to put all the keywords in the symbol table, together with information that it is a keyword. Then, when the scanner finds an identifier, it as usual searches the symbol table to see if that identifier has been seen before. If this identifier was a kewyord, it will be found, together with the information about which keyword it is.
Are you doing this for part of a class? If so, there should be guidelines on parsing and lexing. If not, you're in for a lot of work!
Writing an actual compiler is much more complicated than just going through a bunch of if statements, because you need to keep track of the environment. You'll need to think about how you allow classes, functions, function calls, class instantiations, recursive functions... the list goes on.
Take a look at course lectures from UC Berkeley on the subject, i.e. parsing, lexing, code generation, and the tools you'll need:
http://www-inst.eecs.berkeley.edu/~cs164/fa13/
Note that this course in particular used C++ to write a Python2.5 to Assembly compiler, but the concepts in the Lectures and Readings and some of the tools are not language-restricted.
Keywords (instead of tokens in general) is a closed set, for which it's practical to generate a collision free hash function. Because the set is small, it's not even necessary to have a minimum hash function.
You can do it with a bunch of if - else if statements and strcmp(). However, writing statements for all of the keywords gets annoying very quickly. You'd be better off using a hash table - at the start of the compilation you put all keywords in the table and then you do lookups as needed. The drawback of this is that if you have to use C, you'll also have to write your own hash table (or use one from a library). If you can use C++, though, then you can use a map or an unordered_map from the STL. In any case, if you are worried about the performance, as someone else mentioned, it won't be a bottle neck.

How to get abstract syntax tree of a `c` program in `GCC`

How can I get the abstract syntax tree of a c program in gcc?
I'm trying to automatically insert OpenMP pragmas to the input c program.
I need to analyze nested for loops for finding dependencies so that I can insert appropriate OpenMP pragmas.
So basically what I want to do is traverse and analyze the abstract syntax tree of the input c program.
How do I achieve this?
You need full dataflow to find 'dependencies'. Then you will need to actually insert the OpenMP calls.
What you want is a program transformation system. GCC probably has the dependency information, but it is famously difficult to work with for custom projects. Others have mentioned Clang and Rose. Clang might be a decent choice, but custom analysis/transformation isn't its main purpose. Rose is designed to support custom tools, but IMHO is a rather complicated scheme to use in practice because of its use of the EDG front end, which isn't designed to support transformation.
[THE FOLLOWING TEXT WAS DELETED BY A MODERATOR. I HAVE PUT IT BACK, BECAUSE IT IS ONE THE VALID TRANSFORMATION SYSTEMS FOR THIS TASK. THE FACT THAT I AM RESPONSIBLE FOR IT IN NO WAY DIMINISHES ITS VALUE AS A USEFUL ANSWER TO THE OP.]
Our DMS Software Reengineering Toolkit with its C front end is explicitly designed to be a program transformation system. It has full data flow analysis (including points-to analysis, call graph construction and range analyses) tied to the AST in sensible ways. It provides source-to-source rewrite rules enabling changes to the ASTs expressed in surface syntax form; you can read the transformations rather than inspect a bunch of procedural code. With a modified AST, DMS can regenerate source code including the comments in a compilable form.
Not exactly an AST but GCCXML might help http://linux.die.net/man/1/gccxml
edit : as stated by Ira Baxter gccxml does not output information about function/methods bodies. Here's a fork that seems to fix that lack http://sourceforge.net/projects/gccxml-bodies/

Source to source manipulations

I need to do some source-to-source manipulations in Linux kernel. I tried to use clang for this purpose but there is a problem. Clang does preprocessing of the source code, i.e. macro and include expansion. This causes clang to sometimes produce broken C code in terms of Linux kernel. I can't maintain all the changes manually, since I expect to have thousands of changes per single file.
I tried ANTLR, but the public grammars available are incomplete and not suitable for such projects as Linux kernel.
So my question is the following. Are there any ways to perform source-to-source manipulations for a C code without preprocessing it?
So assume following code.
#define AAA 1
void f1(int a){
if(a == AAA)
printf("hello");
}
After applying source-to-source manipulation I want to get this
#define AAA 1
void f1(int a){
if(functionCall(a == AAA))
printf("hello");
}
But Clang, for instance, produces following code which does not fit my requirements, i.e. it expands macro AAA
#define AAA 1
void f1(int a){
if(functionCall(a == 1))
printf("hello");
}
I hope I was clear enough.
Edit
The above code is only an example. The source-to-source manipulations I want to do are not restricted with if() statement substitution, but also inserting unary operator in front of expression, replace arithmetic expression with its positive or negative value, etc.
Solution
There is one solution I found for my self. I use gcc in order to produce preprocessed source code and then apply Clang. Then I don't have any issues with macro expansion and includes, since that job is done by gcc. Thanks for the answers!
You may consider http://coccinelle.lip6.fr/ : it provides a nice semantics patching framwork.
An idea would be to replace all occurrences of
if(a == AAA)
with
if(functionCall(a == AAA))
You can do this easily using, e.g., the sed tool.
If you have a finite collection of patterns to be replaced you can write a sed script to perform the substitution.
Would this solve your problem?
Handling the preprocessor is one of the most difficult problems in applying transformations to C (and C++) code.
Our DMS Software Reengineering Toolkit with its C Front End come relatively close to doing this. DMS can parse C source code, preserving most preprocessor conditionals, macro defintions and uses.
It does so by allow preprocessor actions in "well-structured" places. Examples: #defines are allowed where declarations or statements can occur, macro calls and conditionals as replacements for many of the nonterminals in the language (e.g., function head, expression, statement, declarations) and in many non-structured places that people commonly place them (e.g, #if fooif (...) {#endif). It parses the source code and preprocessor directives as if they were part of one language (they ARE, its called "C"), and builds corresponding ASTs, which can be transformed and will regenerate correctly with the captured preprocessor directives. [This level of capability handles OP's example perfectly.]
Some directives are poorly placed (both in the syntax sense, e.g., across multiple fragments of the language, and the "you've got to be kidding" understandability sense). These DMS handles by expanding them away, with some guidance from the advance engineer ("alway expand this macro"). A less satisfactory approach is to hand-convert the unstructured preprocessor conditionals/macro calls into structured ones; this is a bit painful but more workable than one might expect since the bad cases occur with considerably less frequency than the good ones.
To do better than this, one needs to have symbol tables and flow analysis that take into account the preprocessor conditions, and capture all the preprocessor conditionals. We've done some experimental work with DMS to capture conditional declarations in the symbol table (seems to work fine), and we're just starting work on a scheme for the latter.
Not easy being green.
Clang maintains extremely accurate information about the original source code.
Most notably, the SourceManager is able to tell if a given token has been expanded from a macro or written as is, and Chandler Caruth recently implemented macro diagnosis which are able to display the actual macro expansion stack (at the various stages of expansions) tracing back to the actual written code (3.0).
Therefore, it is possible to use the generated AST and then rewrite the source code with all its macros still in place. You would have to query virtually every node to know whether it comes from a macro expansion or not, and if it does retrieve the original code of the expansion, but still it seems possible.
There is a rewriter module in Clang
You can dig up Chandler's code on the macro diagnosis stack
So I guess you should have all you need :) (And hope so because I won't be able to help much more :p)
I would advise to resort to Rose framework. Source is available on github.

Semantic Phase of compiler

I am trying to make semantic phase for c compiler using lex and yacc. Right now the problem is if I have multiple errors in the c program, it stops after the 1st. What can I do?
I strongly recommend that you perform the semantic analysis as a separate phase, not as a part of the parsing phase. Use YACC only to build an abstract syntax tree, then traverse this tree in a separate function. Said function will have unlimited freedom when it comes to moving around in the tree, as opposed to having to "follow the parsing". As for the specific problem you mentioned, #pmg's comment seems to have pinpointed the problem.
There is no one absolute answer to this. A typical way to handle it is to create a special pattern to read symbols until it gets to (for example) a semicolon at the end of a line, giving a reasonable signal that whatever's after that is intended as a new declaration, definition, statement, etc., and then re-start parsing from that point (retaining enough context to know that, for example, you're currently parsing a function body, so you accept/reject input on that basis).

How do you compare two files containing C code based on code structure, not merely textual differences?

I have two files containing C code which I wish to compare. I'm looking for a utility which will construct a syntax tree for each file, and compare the syntax trees, instead of merely comparing the text of the files. This way minor differences in formatting and style will be ignored. It would be nice to even be able to tell the comparison tool to ignore differences such as variable names, etc.
Correct me if I'm wrong, but diff doesn't have this capability. I'm a Ubuntu user. Thanks!
Our SD Smart Differencer does exactly what you want. It uses compiler-quality parsers to read source code and build ASTs for two files you select. It then compares the trees guided by the syntax, so it doesn't get confused by whitespace, layout or comments. Because it normalize the values of constants, it doesn't get confused by change of radix or how you expressed escape sequences!
The deltas are reported at the level of the langauge constructs (variable, expression, statement, declaration, function, ...) in terms of programmer intent (delete, insert, copy, move) complete with determining that an identifier has been renamed consistently throughout a changed block.
The SmartDifferencer has versions available for C (in a number of dialects; if you compiler-accurate parse, the langauge dialect matters) was well as for C++, Java, C#, JavaScript, COBOL, Python and many other langauges.
If you want to understand how a set of files are related to one another, our SD CloneDR will accept a very large set of files, and tell you what they have in common. It finds code that has been copy-paste-edited across the entire set. You don't have to tell it what to look for; it finds it automatically. Using ASTs (as above), it isn't fooled by whitespace changes or renames of identifiers. There's a bunch of sample clone detection reports for various languages at the web site.
There is a program called codeCompare from devart (http://www.devart.com/codecompare/benefits.html#cc) that includes the following feature (I know it is not exactly what you asked for but probably it can be used for that).
The feature is called "Structure Comparison"
This functionality allows you to compare different file revision by the presense of the structure blocks (classes, fields, methods). At that different versions of the same file are compared independently from their destination.
Structure comparison can be applied to the following languages:
C#
C++
Visual Basic
JavaScript
(I know it does not include C, but maybe with the C++ version you can solve the problem)

Resources