antlr C grammar to create AST - c

Is there any C grammar available which generates the AST, which includes all the parser rules using "^" and "!" notations?
I went through the book written by Terence Parr, to write such a grammar, but it seems that writing one such grammar for C lang is a time consuming process, so was wondering if its available already which can me save a lot of time!
(A grammar for a smaller subset of C language is also fine..)
Thanks :)

See this. It's straight from the ANTLR 4 source repo: a C11 grammar. It looks pretty compliant.
Of course, it doesn't come with a preprocessor, but handing cpp or mcpp the file first is easy enough.
It also doesn't come with AST rules, but it doesn't look too hard to do (albeit time consuming).

No answers after two weeks.
You are right, building a full parser that builds complete ASTs and handles all the details of C (including preprocessor)
covering a variety of dialects of C (e.g., ANSI, GNU C 2/3/4/, Miscrosoft Visual C, Green Hills C)... is actually a lot of work. And unless you invest this work, it won't process any real C programs.
I would expect there to be a full ANTLR grammar for C that did this considering how old ANTLR is. It is surprising that nobody here can seem to identify one; certainly you'd expect to find it at the ANTLR site.
We've put the energy required into building such C parsers (covering all the above dialects), and added computing symbol tables, extracting control and data flows, building call graphs, enabling analyzers, and tree transformations in the DMS Software Reengineering Toolkit with its C front end. This front end has been applied to C applications comprised of 18,000 compilation units to build custom analysis tools.

Related

From Assembler to C-Compiler

i designed a small RISC in verilog. Which steps do I have to take to create a c compiler which uses my assembler-language? Or is it possible to modify a conventional compiler like gcc, cause I don't want to make things like linker, ...
Thanks
You need to use an unmodified C lexer+parser (often called the front end), and a modified code generation component (the back end) to do that.
Eli Bendersky's pycparser can be used as the front end, and Atul's mini C compiler can be used as inspiration for the code generating back end: http://people.cs.uchicago.edu/~varmaa/mini_c/
With Eli Bendersky's pycparser, all you need to do is convert the AST to a Control Flow Graph (CFG) and generate code from there. It is easier to start with supporting a subset of C than the full shebang.
The two tools are written in Python, but you didn't mention any implementation language preferences :)
I have found most open sourcen compilers (except clang it seems) too tightly coupled to easily modify the back end. Clang and especially GCC are not easy to dive into, nowhere NEAR as easy as the two above. And since Eli's parser does full C99 (it parses everything I've thrown at it) it seem like a nice front end to use for further development.
The examples on the Github project demonstrates most of the features of the project and it's easy to get started. The example that parses C to literal English is worth taking a look at, but may take a while to fully grok. It basically handles any C expression, so it is a good reference for how to handle the different nodes of the AST.
I also recommended the tools above, in my answer to this question: Build AST from C code

antlr generate ast for c and parse the ast

I am doing static analyze on c program.And I search the antlr website ,there seems to be no appropriate grammar file that produce ast for c program.Does it mean I have to do it myself from the very start.Or is there a quicker method.I also need a tree parser that can traverse the ast created by the parser.
You indicated you want to do static analysis to detect buffer overflow.
First, writing a grammar for C is harder than it looks. There's all that stuff in the standard, and then there's what the real compilers actually accept. And you have to decide what to do about the preprocessor (and it varies from compiler to compiler!). If you don't get the grammar and preprocessing exactly right, you won't be able to parse real programs. (If you want to do toy languages, that's fine, but then you don't need a C grammar).
To do the analysis, you'll need far more machinery than an AST. You'll need symbol tables, control and data flow analysis, likely local and global points-to analysis, call graph extraction, and some type of range analysis.
People just don't seem to understand this.
** GETTING A PARSER IS A LONG WAY FROM DOING ANYTHING USEFUL WITH REAL LANGUAGES **
I'm shouting because I see this over, and over, and over.
If you want to get on with a specific program analysis or transformation task, unless you want to die of old age before you start your task, you better find a foundation that has most of what you need already. A foundation on a parser generator with a creaky grammar is not a foundation. (Don't get me wrong: ANTLR, YACC, JavaCC are all fine parser generators, and they're great for building a parser for a new language. They're great for implementing production parsers for real langauges when the investment gets made. But they produce parsers, and mostly people don't do the production part. And they don't provide the additional machinery by a long shot.)
Our DMS Software Reengineering Toolkit contains all the above machinery because it is almost always needed, and it is a royal headache to implement. (My team has 15 years invested so far.)
We've also instantiated that machinery is forms specifically useful for COBOL and Java, C, C++ (to somewhat lesser extent, the language is really hard), in a variety of dialects, so that others don't have to repeat this long process.
GCC and Clang are pretty mature for C and C++ as alternatives.
The hardest part is writing the grammar. Mixing in rewrite rules to create an AST isn't that hard, and creating a tree grammar from a parser grammar that emits an AST isn't that hard too (compared to writing the parser grammar, that is).
Here's a previous Q&A that shows how to create a proper AST: How to output the AST built using ANTLR?
And I couldn't find a decent SO-Q&A that explains how to go about creating a tree grammar, so here's a link to my personal blog that explains this: http://bkiers.blogspot.com/2011/03/6-creating-tree-grammar.html
Good luck.

Are GCC and Clang parsers really handwritten?

It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and not machine generated, Bison-Flex based, bottom up parsing.
Could someone here please confirm that this is the case?
And if so, why do mainstream compiler frameworks use handwritten parsers?
Update : interesting blog on this topic here
There's a folk-theorem that says C is hard to parse, and C++ essentially impossible.
It isn't true.
What is true is that C and C++ are pretty hard to parse using LALR(1) parsers without hacking the parsing machinery and tangling in symbol table data. GCC in fact used to parse them, using YACC and additional hackery like this, and yes it was ugly. Now GCC uses handwritten parsers, but still with the symbol table hackery. The Clang folks never tried to use automated parser generators; AFAIK the Clang parser has always been hand-coded recursive descent.
What is true, is that C and C++ are relatively easy to parse with stronger automatically generated parsers, e.g., GLR parsers, and you don't need any hacks. The Elsa C++ parser is one example of this. Our C++ Front End is another (as are all our "compiler" front ends, GLR is pretty wonderful parsing technology).
Our C++ front end isn't as fast as GCC's, and certainly slower than Elsa; we've put little energy into tuning it carefully because we have other more pressing issues (nontheless it has been used on millions of lines of C++ code). Elsa is likely slower than GCC simply because it is more general. Given processor speeds these days, these differences might not matter a lot in practice.
But the "real compilers" that are widely distributed today have their roots in compilers of 10 or 20 years ago or more. Inefficiencies then mattered much more, and nobody had heard of GLR parsers, so people did what they knew how to do. Clang is certainly more recent, but then folk theorems retain their "persuasiveness" for a long time.
You don't have to do it that way anymore. You can very reasonably use GLR and other such parsers as front ends, with an improvement in compiler maintainability.
What is true, is that getting a grammar that matches your friendly neighborhood compiler's behavior is hard. While virtually all C++ compilers implement (most) of the original standard, they also tend have lots of dark corner extensions, e.g., DLL specifications in MS compilers, etc. If you have a strong parsing engine, you can
spend your time trying to get the final grammar to match reality, rather than trying to bend your grammar to match the limitations of your parser generator.
EDIT November 2012: Since writing this answer, we've improved our C++ front end to handle full C++11, including ANSI, GNU, and MS variant dialects. While there was lots of extra stuff, we don't have to change our parsing engine; we just revised the grammar rules. We did have to change the semantic analysis; C++11 is semantically very complicated, and this work swamps the effort to get the parser to run.
EDIT February 2015: ... now handles full C++14. (See get human readable AST from c++ code for GLR parses of a simple bit of code, and C++'s infamous "most vexing parse").
EDIT April 2017: Now handles (draft) C++17.
Yes:
GCC used a yacc (bison) parser once upon a time, but it was replaced with a hand-written recursive descent parser at some point in the 3.x series: see http://gcc.gnu.org/wiki/New_C_Parser for links to relevant patch submissions.
Clang also uses a hand-written recursive descent parser: see the section "A single unified parser for C, Objective C, C++ and Objective C++" near the end of http://clang.llvm.org/features.html .
Clang's parser is a hand-written recursive-descent parser, as are several other open-source and commercial C and C++ front ends.
Clang uses a recursive-descent parser for several reasons:
Performance: a hand-written parser allows us to write a fast parser, optimizing the hot paths as needed, and we're always in control of that performance. Having a fast parser has allowed Clang to be used in other development tools where "real" parsers are typically not used, e.g., syntax highlighting and code completion in an IDE.
Diagnostics and error recovery: because you're in full control with a hand-written recursive-descent parser, it's easy to add special cases that detect common problems and provide great diagnostics and error recovery (e.g., see http://clang.llvm.org/features.html#expressivediags) With automatically generated parsers, you're limited to the capabilities of the generator.
Simplicity: recursive-descent parsers are easy to write, understand, and debug. You don't need to be a parsing expert or learn a new tool to extend/improve the parser (which is especially important for an open-source project), yet you can still get great results.
Overall, for a C++ compiler, it just doesn't matter much: the parsing part of C++ is non-trivial, but it's still one of the easier parts, so it pays to keep it simple. Semantic analysis---particularly name lookup, initialization, overload resolution, and template instantiation---is orders of magnitude more complicated than parsing. If you want proof, go check out the distribution of code and commits in Clang's "Sema" component (for semantic analysis) vs. its "Parse" component (for parsing).
Weird answers there!
C/C++ grammars aren't context free. They are context sensitive because of the Foo * bar; ambiguity. We have to build a list of typedefs to know if Foo is a type or not.
Ira Baxter: I don't see the point with your GLR thing. Why build a parse tree which comprises ambiguities. Parsing means solving ambiguities, building the syntax tree. You resolve these ambiguities in a second pass, so this isn't less ugly. For me it is far more ugly ...
Yacc is a LR(1) parser generator (or LALR(1)), but it can be easily modified to be context sensitive. And there is nothing ugly in it. Yacc/Bison has been created to help in parsing C language, so probably it isn't the ugliest tool to generate a C parser ...
Until GCC 3.x the C parser is generated by yacc/bison, with typedefs table built during parsing. With "in parse" typedefs table building, C grammar becomes locally context free and furthermore "locally LR(1)".
Now, in Gcc 4.x, it is a recursive descent parser. It is exactly the same parser as in Gcc 3.x, it is still LR(1), and has the same grammar rules. The difference is that the yacc parser has been hand rewritten, the shift/reduce are now hidden in the call stack, and there is no "state454 : if (nextsym == '(') goto state398" as in gcc 3.x yacc's parser, so it is easier to patch, handle errors and print nicer messages, and to perform some of the next compiling steps during parsing. At the price of much less "easy to read" code for a gcc noob.
Why did they switched from yacc to recursive descent? Because it is quite necessary to avoid yacc to parse C++, and because GCC dreams to be multi language compiler, i.e. sharing maximum of code between the different languages it can compile. This is why the C++ and the C parser are written in the same way.
C++ is harder to parse than C because it isn't "locally" LR(1) as C, it is not even LR(k).
Look at func<4 > 2> which is a template function instantiated with 4 > 2, i.e. func<4 > 2>
has to be read as func<1>. This is definitely not LR(1). Now consider, func<4 > 2 > 1 > 3 > 3 > 8 > 9 > 8 > 7 > 8>. This is where a recursive descent can easily solve ambiguity, at the price of a few more function calls (parse_template_parameter is the ambiguous parser function. If parse_template_parameter(17tokens) failed, try again parse_template_parameter(15tokens), parse_template_parameter(13tokens)
... until it works).
I don't know why it wouldn't be possible to add into yacc/bison recursive sub grammars, maybe this will be the next step in gcc/GNU parser development?
gcc's parser is handwritten.. I suspect the same for clang. This is probably for a few reasons:
Performance: something that you've hand-optimized for your particular task will almost always perform better than a general solution. Abstraction usually has a performance hit
Timing: at least in the case of GCC, GCC predates a lot of free developer tools (came out in 1987). There was no free version of yacc, etc. at the time, which I'd imagine would've been a priority to the people at the FSF.
This is probably not a case of "not invented here" syndrome, but more along the lines of "there was nothing optimized specifically for what we needed, so we wrote our own".
It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and not machine generated, Bison-Flex based, bottom up parsing.
Bison in particular I don't think can handle the grammar without parsing some things ambiguously and doing a second pass later.
I know Haskell's Happy allows for monadic (i.e. state-dependent) parsers that can resolve the particular issue with C syntax, but I know of no C parser generators that allow a user-supplied state monad.
In theory, error recovery would be a point in favor of a handwritten parser, but my experience with GCC/Clang has been that the error messages are not particularly good.
As for performance - some of the claims seem unsubstantiated. Generating a big state machine using a parser generator should result in something that's O(n) and I doubt parsing is the bottleneck in much tooling.

Abstract-syntax tree for subset of C

For teaching purpose we are building a javascript step by step interpreter for (a subset of) C code.
Basically we have : int,float..., arrays, functions, for, while... no pointers.
The javascript interpreter is done and allow us to explain how a boolean expression is evaluated, will show the variables stack...
For now, we are manually converting our C examples to some javascript that will run and build a stack of actions (affectation, function call...) that can later on be used to do the step by step stuff. Since we are limiting ourselves to a subset of C it's quite easy to do.
Now we would like to compile the C code to our javascript representation. All we need is a Abstract-syntax tree of the C code and the javascript generation is straightforward.
Do you know a good C-parser that could generate a such tree ? No need to be in javascript (but that would be perfect), any language is alright as this can be done offline.
I've looked at Emscripten ( https://github.com/kripken/emscripten ) but it's more a C=>javascript compiler and that's not what we want.
I've recently used Eli Bendersky's pycparser to mess with ASTs of C code. I think it'd work well for your purposes.
I think that ANTLR has a full C parser.
To do your translation task, I suspect you will need full symbol table support; you have to know what the symbols mean. Here most "parsers" will fail you; they don't build a full symbol table. I think ANTLR does not, but I could be wrong.
Our DMS Software Reengineering Toolkit with its C Front End provides a full C arser, and builds complete symbol tables. (You may not need it for your application, but it includes a full C preprocessor, too). It also provide control flow, data flow, points-to-analysis and call graph construction, all of which can be useful in translating C to whatever your target virtual machine is.

C to IEC 61131-3 IL compiler

I have a requirement for porting some existing C code to a IEC 61131-3 compliant PLC.
I have some options of splitting the code into discrete function blocks and weaving those blocks into a standard solution (Ladder, FB, Structured Text etc). But this would require carving up the C code in order to build each function block.
When looking at the IEC spec I realsied that the IEC Instruction List form could be a target language for a compiler. The wikepedia article lists two development tools:
CoDeSys
Beremiz
But these seem to be targeted compiling IEC languages to C, not C to IEC.
Another possible solution is to push the C code through a C to Pascal translator and use that as a starting point for a Structured Text solution.
If not any of these I will go down the route of splitting the code up into function blocks.
Edit
As prompted by mlieson's reply I should have mentioned that the C code is an existing real-time control system. So the programs algorithms should already suit a PLC environment.
Maybe this answer comes too late but it is possible to call C code from CoDeSys thanks to an external library.
You can find documentation on the CoDeSys forum at http://forum-en.3s-software.com/viewtopic.php?t=620
That would give you to use your C code into the PLC with minor modifcations. You'll just have to define the functions or function blocks interfaces.
My guess is that a C to Pascal translator will not get you near enough for being worth the trouble. Structured text looks a lot like Pascal, but there are differences that you will need to fix everywhere.
Not a bug issue, but don't forget that PLCs runtime enviroment is a bit different. A C applications starts at main() and ends when main() returns. A PLC calls it main() over and over again, 100:s of times per second and it never ends.
Usally lengthy calculations and I/O needs to be coded in diffent fashion than a C appliation would use.
Unless your C source is many many thousands lines of code - Rewrite it.
It is impossible. To be short: the IL language is a 4GL (i.e. limited to
the domain, as well as other IEC 61131-3 languages -- ST, FBD, LD, SFC).
The C language is a 3GL.
To understand the problem, try to answer the question, which way to
express in IL manipulations with a pointer? for example, to express call a
function by a pointer. What about interrupts? Low level access to the
peripherial devices?
(really, there are more problems)
BTW, there is the Reflex language, aka "C with processes". Reflex is a 4GL for the
control domain with C-like syntax. But the known translators produce
C-code and Python-code.
If the amount of code to convert is a few thousand lines, recoding by hand is probably your best bet.
If you have lots of code to convert, then an automated tool might be very effective.
Using the DMS Software Reengineering Toolkit we've built translators to map mechanical motion diagrams into RLL (PLC) code. DMS also has full C parser/analyzers/front ends. The pieces are there to build a C to RLL code.
This isn't an easy task. It likely takes 6-12 man-months to configure DMS to something resembling what you want. If that's less than what it takes to do by hand, then its the right way to do it.
There are a few IEC development environments and target hardware that can use C blocks... I would also take a look at the reasons why it HAS to be an IEC-61131 complaint target. I have written extensively on compliance and why it doesn't mean squat.
SOFTplc corp can help I'm sure with user defined loadable modules... and they can be in C..
Schneider also supports C function blocks...
Labview too!! not sure why IEC is important that's all!! the compiler if existed would create bad code for sure:)
Your best bet is to split your C code into smaller parts which can be recoded as PLC functional blocks and use C to PASCAL convertor for each block which you will rewrite in structured text. Prepare to do a lot of manual work since automated conversion will probably disappoint you.
Also take a look at this page: http://www.control.com/thread/1026228786
Every time I've done this, I just parsed and converted it by hand from C directly to ST. I only ran into a few functions that required complete rewrites, although there was very little that dealt with pointers, which is something that ST generally chokes on, unfortunately.
Using the existing C code as blocks that are called by the PLC program would have the added advantage that the C blocks could run at the same periodicity that they did before, and their function is likely already well documented and tested. This would minimize any effect on changes from the existing control system. This is an architecture for controls with software PLCs that I have seen used before.

Resources