How to get abstract syntax tree of a `c` program in `GCC` - c

How can I get the abstract syntax tree of a c program in gcc?
I'm trying to automatically insert OpenMP pragmas to the input c program.
I need to analyze nested for loops for finding dependencies so that I can insert appropriate OpenMP pragmas.
So basically what I want to do is traverse and analyze the abstract syntax tree of the input c program.
How do I achieve this?

You need full dataflow to find 'dependencies'. Then you will need to actually insert the OpenMP calls.
What you want is a program transformation system. GCC probably has the dependency information, but it is famously difficult to work with for custom projects. Others have mentioned Clang and Rose. Clang might be a decent choice, but custom analysis/transformation isn't its main purpose. Rose is designed to support custom tools, but IMHO is a rather complicated scheme to use in practice because of its use of the EDG front end, which isn't designed to support transformation.
[THE FOLLOWING TEXT WAS DELETED BY A MODERATOR. I HAVE PUT IT BACK, BECAUSE IT IS ONE THE VALID TRANSFORMATION SYSTEMS FOR THIS TASK. THE FACT THAT I AM RESPONSIBLE FOR IT IN NO WAY DIMINISHES ITS VALUE AS A USEFUL ANSWER TO THE OP.]
Our DMS Software Reengineering Toolkit with its C front end is explicitly designed to be a program transformation system. It has full data flow analysis (including points-to analysis, call graph construction and range analyses) tied to the AST in sensible ways. It provides source-to-source rewrite rules enabling changes to the ASTs expressed in surface syntax form; you can read the transformations rather than inspect a bunch of procedural code. With a modified AST, DMS can regenerate source code including the comments in a compilable form.

Not exactly an AST but GCCXML might help http://linux.die.net/man/1/gccxml
edit : as stated by Ira Baxter gccxml does not output information about function/methods bodies. Here's a fork that seems to fix that lack http://sourceforge.net/projects/gccxml-bodies/

Related

From Assembler to C-Compiler

i designed a small RISC in verilog. Which steps do I have to take to create a c compiler which uses my assembler-language? Or is it possible to modify a conventional compiler like gcc, cause I don't want to make things like linker, ...
Thanks
You need to use an unmodified C lexer+parser (often called the front end), and a modified code generation component (the back end) to do that.
Eli Bendersky's pycparser can be used as the front end, and Atul's mini C compiler can be used as inspiration for the code generating back end: http://people.cs.uchicago.edu/~varmaa/mini_c/
With Eli Bendersky's pycparser, all you need to do is convert the AST to a Control Flow Graph (CFG) and generate code from there. It is easier to start with supporting a subset of C than the full shebang.
The two tools are written in Python, but you didn't mention any implementation language preferences :)
I have found most open sourcen compilers (except clang it seems) too tightly coupled to easily modify the back end. Clang and especially GCC are not easy to dive into, nowhere NEAR as easy as the two above. And since Eli's parser does full C99 (it parses everything I've thrown at it) it seem like a nice front end to use for further development.
The examples on the Github project demonstrates most of the features of the project and it's easy to get started. The example that parses C to literal English is worth taking a look at, but may take a while to fully grok. It basically handles any C expression, so it is a good reference for how to handle the different nodes of the AST.
I also recommended the tools above, in my answer to this question: Build AST from C code

C - How to find all inner loops using grep?

I have a giant C project with numerous C files. I have to find all inner-loops. I am sure there is no any O(n³) block in the project, so only O(n²)-compexity blocks must be found (a loop in a loop).
Is it possible to find all inner loops using grep? If yes, what regexp may I use to find all occurrences of inner loops of all kinds like {for,for}, {while,for}, {for, while}, {do, while}, etc. ? If no, is there any simple unix-way method to do it (maybe multiple greps or a kind of awk)?
Regex are for regular languages, what you are describing seems like a Context-Free, and i'm pretty sure it can't be done using Regular Expressions. See the answer to a similar question here . You should look for other type of automata like a scripting language(python or so).
This is a good case for specific compiler extensions. The recent GCC compiler (that is version 4.6 of GCC) can be extended by plugins (painfully coded in C) or by MELT extensions; MELT is a high-level domain specific language to code GCC extensions in, and MELT is much easy to use than C.
However, I do admit that coding GCC extensions is not entirely trivial: you have to partly understand how GCC works, and what are its main internal representations (Gimple, Tree, ...). When extending GCC, you basically add your own compiler passes, which can do whatever you want (-including detecting nested loops-). Coding a GCC extension is usually more than a week of work. (The hardest part is to understand how GCC works).
The big advantage of working within the GCC framework (thru plugins in C or extensions in MELT) is that your extensions are working on the same data as the compiler does.
Back to the question of finding nested loops, don't consider it as only purely syntactic (this is why grep cannot work). Within the GCC compiler, at some level of internal representations, a loop implemented by for, or while, or do, or even with goto-s, is still considered a loop, and for GCC these things can be nested (and GCC knows about the nesting!).
Without a C parser, you can get a heuristic solution at best.
If you can rely on certain rules being consistently followed in the code (e.g. no goto, no loops through recursion, ...), you can scan the preprocessed C code with regexps. Certainly, grep is not sophisticated enough but with a few lines of Perl or similar it is possible.
But the technically better and much more reliable approach is to use a real C parser.
There are three kinds of loops in C:
"structured syntax" (while, for, ...)
[Watch out for GCC, which can hide statements therefore loops inside expressions using (stmt; exp) syntax!]
ad hoc loops using goto; these interact with the structured syntax.
recursion
To find the first kind, you have to find the structured syntax and the nesting.
Grep can certainly find the keywords (if you ignore false positives in comments and strings), but it can't find nested structures. You could of course use grep to find all the loop syntax and then simply inspect those that occurred in the same file to see if they were nested.
(If you wanted to do this without the false positives price, you could use our Source Code Search Engine, which knows the lexical syntax of C and is never confused as to when a string of characters is a keyword, a number, a string, etc.)
If you want to find those loops automatically, you pretty much need a full C parser with expanded preprocessing accomplished. (Otherwise some macro may hide a critical piece of loop syntax). Once you have a syntax tree for C, it is straightforward (although likely a bit inconvenient) to code something that clambers over the tree, detecting loop syntax nodes, and counting nesting of loops in subtrees. You can do this with any tool that will parse C and give you abstract sytnax trees. ANTLR can likely do this; I think there's a C grammar obtainable for ANTLR that handles C fairly well, but you'll have to run the preprocessor before using ANTLR.
You could also do this with our DMS Software Reengineering Toolkit with its C Front End. Our C Front End has a full preprocessor built in so it can read the code directly and expand as it parses; it also handles a relatively wide variety of dialects of C and character encodings (ever dealt with C containing Japanese text?). DMS provides an additional advantage: given a language (e.g., C) front end, you can write patterns for the that language directly using the language syntax. So we can express fragments of what we want to find easily:
pattern block_for_loop(t:expression,l:expression,i:expression, s: statements): statement
" for(\t,\l\,\i) { \s } ";
pattern statement_for_loop(t:expression,l:expression,i:expression, s: statement): statement
" for(\t,\l\,\i) \s ";
pattern block_while_loop(c:expression, s: statements): statement
" while(\c) { \s } ";
pattern statement_while_loop(c:expression): statement
" for(\c) \s ";
...
and group them together:
pattern_set syntactic_loops
{ block_for_loop,
statement_for_loop,
block_while_loop,
block_statement_loop,
...
}
Given the pattern set, DMS can scan a syntax tree and find matches to any set member, without coding any particular tree crawling machinery and without knowing a huge amount of detail about the structure of the tree. (There's a lot of node types in an AST for a real C parser!) Finding nested loops this way would be pretty straightforward: scan the tree top down for a loop (using the pattern set); any hits must be top level loops. Scan subtrees of a found loop AST node (easy when you know where the tree for the outer loop is) for additional loops; any hits must be nested loops; recurse if necessary to pick up loops with arbitrary nesting. This works for the GCC loops-with-statements stuff, too. The tree nodes are stamped with precise file/line/column information so its easy to generate a report on location.
For ad hoc loops built using goto (what, your code doesn't have any?), you need something that can produce the actual control flow graph, and then structure that graph into nested control flow regions. The point here is that a while loop that contains an unconditional goto out isn't really a loop in spite of the syntax; and an if statement whose then clause gotos back to code upstream of the if is likely really a loop. All that loop syntax stuff is really just hueristic hints you may have loop!
Those control flow regions contain the real nesting of the control flow; DMS will construct C flow graphs, and will produce those structured regions. It provides libraries to build and access that graph; this way you can get the "true" control flow based on gotos. Having found a pair of nested control flow regions, one can access the AST associated with parts of region to get location information to report.
GCC is always a lot fun due to its signficantly enhanced version of C. For instance, it has an indirect goto statement. (Regular C has this hiding under setjmp/longjmp!).
To figure out loops in the face of this, you need points-to analysis, which DMS also provides. This information is used by the nested region analysis. There are (conservative) limits to the accuracy of points-to analysis, but modulo that you get the correct nested region graph.
Recursion is harder to find. For this, you have to determine if A calls B ... calls Z calls A, where A and B and ... can be in separate compilation units. You need a global call graph, containing all the compilation units of your application. At this point, you are probably expecting me to say that DMS does that too, voila, I'm pleased to say it does. Constructing that call graph of course requires points-to anlaysis for function calls; yes, DMS does that too. With the call graph, you can find cycles in the call graph, which are likely recursion. Also with the call graph, you can find indirect nesting, e.g., loops in one function, that call a function in another compilation unit that also contains loops.
To find structures such a loops accurately you need a lot of machinery (and this will take some effort, but then C is a bitch of a language to analyze) and DMS can provide it.
If you don't care about accuracy, and you don't care about all the kinds of loops, you can use grep and manual procedures to get a semi-accurate map of just the loops hinted at by the structured loop statements.
I suspect that finding something like this would be impossible using grep alone:
public void do(){
for(...){
somethingElse();
}
}
... Insert other code...
public void somethingElse(){
for(.....){
print(....);
}
}

antlr generate ast for c and parse the ast

I am doing static analyze on c program.And I search the antlr website ,there seems to be no appropriate grammar file that produce ast for c program.Does it mean I have to do it myself from the very start.Or is there a quicker method.I also need a tree parser that can traverse the ast created by the parser.
You indicated you want to do static analysis to detect buffer overflow.
First, writing a grammar for C is harder than it looks. There's all that stuff in the standard, and then there's what the real compilers actually accept. And you have to decide what to do about the preprocessor (and it varies from compiler to compiler!). If you don't get the grammar and preprocessing exactly right, you won't be able to parse real programs. (If you want to do toy languages, that's fine, but then you don't need a C grammar).
To do the analysis, you'll need far more machinery than an AST. You'll need symbol tables, control and data flow analysis, likely local and global points-to analysis, call graph extraction, and some type of range analysis.
People just don't seem to understand this.
** GETTING A PARSER IS A LONG WAY FROM DOING ANYTHING USEFUL WITH REAL LANGUAGES **
I'm shouting because I see this over, and over, and over.
If you want to get on with a specific program analysis or transformation task, unless you want to die of old age before you start your task, you better find a foundation that has most of what you need already. A foundation on a parser generator with a creaky grammar is not a foundation. (Don't get me wrong: ANTLR, YACC, JavaCC are all fine parser generators, and they're great for building a parser for a new language. They're great for implementing production parsers for real langauges when the investment gets made. But they produce parsers, and mostly people don't do the production part. And they don't provide the additional machinery by a long shot.)
Our DMS Software Reengineering Toolkit contains all the above machinery because it is almost always needed, and it is a royal headache to implement. (My team has 15 years invested so far.)
We've also instantiated that machinery is forms specifically useful for COBOL and Java, C, C++ (to somewhat lesser extent, the language is really hard), in a variety of dialects, so that others don't have to repeat this long process.
GCC and Clang are pretty mature for C and C++ as alternatives.
The hardest part is writing the grammar. Mixing in rewrite rules to create an AST isn't that hard, and creating a tree grammar from a parser grammar that emits an AST isn't that hard too (compared to writing the parser grammar, that is).
Here's a previous Q&A that shows how to create a proper AST: How to output the AST built using ANTLR?
And I couldn't find a decent SO-Q&A that explains how to go about creating a tree grammar, so here's a link to my personal blog that explains this: http://bkiers.blogspot.com/2011/03/6-creating-tree-grammar.html
Good luck.

Tool to produce self-referential programs?

Many results in computability theory (such as Kleene's second recursion theorem) ensure that it is possible to construct programs that can operate over their own source code. For example, in Michael Sipser's "Introduction to the Theory of Computation," he proves a special case of the Recursion Theorem, which states that any program representing a function that accepts two strings and produces a string can be converted into an equivalent program where the second argument is equal to the program's own source code. Moreover, this process can be done automatically.
The construction that one uses to produce programs with access to their own source code is well-known (most theory of computation books contain it) and is often used to generate quines. My question is whether someone has written a general-purpose tool that accepts as input a program in some language (perhaps C, for example) that contains some placeholder for the source of the program, then processes the program to produce a new program with access to its own source code. This would make it possible, for example, to generate quines automatically, or to write programs that can introspect on their syntax trees (possibly enabling reflection in languages that don't already support it). If not, I was planning on writing my own version of such a tool, but I don't want to reinvent the wheel if this has already been done.
EDIT: Based on #Henning Makholm's suggestion, I decided to just sit down and implement such a program. The resulting program (which I've dubbed "kleene") accepts as input a C++ program and produces a new C++ program that can access its own source code by calling the function kleene::MySource(). This means that you could transform this very simple program into a Quine using the kleene program:
#include <iostream>
int main() {
std::cout << kleene::MySource() << std::endl;
}
If you're curious to check it out, it's available here on my website.
Lots of examples at the Wikipedia article and links therefrom. After looking at one or two it should be obvious how to build a quine generator a given language that takes an arbitrary piece of payload code as input.
One problem with your reflection idea is that the program cannot, in general, know that what it has constructed is its own source code.
Our DMS Software Reengineering Toolkit is a program transformation system, that will accept programs in arbitrary syntax (described to DMS in an explicit parameter called a "domain description"), parse them to ASTs, carry out analyses and transformations of the ASTs, and can regenerate revised program text from the modified version.
DMS is of course coded in a language (actually as set of domain-specific languages) for which there are already DMS-domain descriptions. So, DMS can read itself, and we use that capability to bootstrap additional DMS capabilities and optimize its performance.
So while we aren't producing quines, we are building programs with self-enhancing code.
And yes, your observation about such a tool providing reflection for arbitrary langauges is smack on. Most reflection facilities provided in languages allow only access to those things the language-compiler folks thought of paramount importance to access at runtime, such as "method names". Things they weren't interested in, of course, aren't accessible; ever seen a reflection mechanism that will tell you what's in an expression? In a comment?
DMS provides complete access to all the details of the source code, by virtue of inspecting the code from outside, using general purpose, complete mechanisms. If your language doesn't have reflection, DMS is the way to access the code and reason arbitrarily about it. Even if your langauge has reflection, DMS can reason about programs in your language in ways that your language cannot, because it can't get access to its own detailed structure.

Abstract-syntax tree for subset of C

For teaching purpose we are building a javascript step by step interpreter for (a subset of) C code.
Basically we have : int,float..., arrays, functions, for, while... no pointers.
The javascript interpreter is done and allow us to explain how a boolean expression is evaluated, will show the variables stack...
For now, we are manually converting our C examples to some javascript that will run and build a stack of actions (affectation, function call...) that can later on be used to do the step by step stuff. Since we are limiting ourselves to a subset of C it's quite easy to do.
Now we would like to compile the C code to our javascript representation. All we need is a Abstract-syntax tree of the C code and the javascript generation is straightforward.
Do you know a good C-parser that could generate a such tree ? No need to be in javascript (but that would be perfect), any language is alright as this can be done offline.
I've looked at Emscripten ( https://github.com/kripken/emscripten ) but it's more a C=>javascript compiler and that's not what we want.
I've recently used Eli Bendersky's pycparser to mess with ASTs of C code. I think it'd work well for your purposes.
I think that ANTLR has a full C parser.
To do your translation task, I suspect you will need full symbol table support; you have to know what the symbols mean. Here most "parsers" will fail you; they don't build a full symbol table. I think ANTLR does not, but I could be wrong.
Our DMS Software Reengineering Toolkit with its C Front End provides a full C arser, and builds complete symbol tables. (You may not need it for your application, but it includes a full C preprocessor, too). It also provide control flow, data flow, points-to-analysis and call graph construction, all of which can be useful in translating C to whatever your target virtual machine is.

Resources