antlr generate ast for c and parse the ast - c

I am doing static analyze on c program.And I search the antlr website ,there seems to be no appropriate grammar file that produce ast for c program.Does it mean I have to do it myself from the very start.Or is there a quicker method.I also need a tree parser that can traverse the ast created by the parser.

You indicated you want to do static analysis to detect buffer overflow.
First, writing a grammar for C is harder than it looks. There's all that stuff in the standard, and then there's what the real compilers actually accept. And you have to decide what to do about the preprocessor (and it varies from compiler to compiler!). If you don't get the grammar and preprocessing exactly right, you won't be able to parse real programs. (If you want to do toy languages, that's fine, but then you don't need a C grammar).
To do the analysis, you'll need far more machinery than an AST. You'll need symbol tables, control and data flow analysis, likely local and global points-to analysis, call graph extraction, and some type of range analysis.
People just don't seem to understand this.
** GETTING A PARSER IS A LONG WAY FROM DOING ANYTHING USEFUL WITH REAL LANGUAGES **
I'm shouting because I see this over, and over, and over.
If you want to get on with a specific program analysis or transformation task, unless you want to die of old age before you start your task, you better find a foundation that has most of what you need already. A foundation on a parser generator with a creaky grammar is not a foundation. (Don't get me wrong: ANTLR, YACC, JavaCC are all fine parser generators, and they're great for building a parser for a new language. They're great for implementing production parsers for real langauges when the investment gets made. But they produce parsers, and mostly people don't do the production part. And they don't provide the additional machinery by a long shot.)
Our DMS Software Reengineering Toolkit contains all the above machinery because it is almost always needed, and it is a royal headache to implement. (My team has 15 years invested so far.)
We've also instantiated that machinery is forms specifically useful for COBOL and Java, C, C++ (to somewhat lesser extent, the language is really hard), in a variety of dialects, so that others don't have to repeat this long process.
GCC and Clang are pretty mature for C and C++ as alternatives.

The hardest part is writing the grammar. Mixing in rewrite rules to create an AST isn't that hard, and creating a tree grammar from a parser grammar that emits an AST isn't that hard too (compared to writing the parser grammar, that is).
Here's a previous Q&A that shows how to create a proper AST: How to output the AST built using ANTLR?
And I couldn't find a decent SO-Q&A that explains how to go about creating a tree grammar, so here's a link to my personal blog that explains this: http://bkiers.blogspot.com/2011/03/6-creating-tree-grammar.html
Good luck.

Related

How much lisp to implement in C before writing extension in itself?

I am implementing a lisp interpreter in C, i have implemented along with few primitives like cons , car, cdr , eq, basic arithmetic stuff.
Just before i was starting to implement define and lambda it occurred to me that i need to implement an environment. I am unsure if i could implement it in lisp itself.
My intent is to implement minimal amount of lisp so that i could write extension to the language in itself. I am not sure how much is minimal, Would implementing FFI Qualify as minimal ?
The answer to your question depends on the meaning that you give to the word “minimal”.
Given your question, and assuming that you don't want to make an implementation competing with the nowdays fine implementations of Common Lisp and Schema, my hypothesis is that with “minimal” you intend: Turing complete, that is capable of expressing any computation expressible in a general purpose programming language.
With this assumption, you need to implement three other things:
conditional forms (cond)
lambda expressions (lambda)
a way of defining recursive lambda expression (labels or defun)
Your interpreter then should be able to evaluate forms. This should be sufficient to have a language equivalent to the initial LISP, that allow to express in the language any computable function.
First off, you are talking about first writing a LISP interpreter. You have a lot of choices to take when it comes to scoping, LISP1 vs LISP2 since these questions alter the implementation core. An interpreter is a general purpose program that reads and evaluates code. It can support abstractions but it won't extend itself by making more native stuff.
If you are interested in such stuff you can perhaps make a compiler instead. Eg. there are many Sceme like subsets that compiles to C or Java code, but you can make your own VM. Thus it can indeed compile itself to be run on it's own target machine (self hosting) if all the forms and procedures you use has been implemented using the primitives supported by the compiler.
Making a dumb compiler is not much difference from making an interpreter. That is very clear if yo've watched the SICP videos (10A is about compilation, 7A-B is about interpreters)
The environment can be a chain of pairs just as in a LISP interpreter. It would be difficult to implement the environment of itself in LISP without making it a very difficult Lisp language to use (unless it's compiled that is)
You may use the data structures of lisp and the primitives from the C code though.
Making a FFI is a fast way to give your language lots of features. It solves the chicken and egg problem by using other peoples work from within your language. In fuses the top (primitives and syntax) and the bottom layer (a runtime) of your system. It's the ultimate primitive and you can think of it as system call or message bus to the runtime.
I strongly suggest to read Queinnec's book: Lisp In Small Pieces. It is a book dedicated entirely to answer your question, and it explains in detail the many trade-offs and the internals of Lisp implementations and definitions, by giving many explained examples of Lisp interpreters and compilers.
You might also consider using libffi. You could be interested in the internals of M.Serrano's Bigloo & Hop implementations. You might even look inside my MELT lisp-like language to customize the GCC
compiler.
You also need to learn more about garbage collection (you might read the GC handbook). You could use Boehm's conservative Garbage Collector (or something else, e.g. my Qish or MPS) or write your own GC.
You may want to learn more about Chicken, Scheme 48, Guile and read their papers and look inside their code.
See also J.Pitrat's blog: it is not about Lisp (but about bootstrapping strong AI) and has several fascinating entries related to bootstrapping.

Are GCC and Clang parsers really handwritten?

It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and not machine generated, Bison-Flex based, bottom up parsing.
Could someone here please confirm that this is the case?
And if so, why do mainstream compiler frameworks use handwritten parsers?
Update : interesting blog on this topic here
There's a folk-theorem that says C is hard to parse, and C++ essentially impossible.
It isn't true.
What is true is that C and C++ are pretty hard to parse using LALR(1) parsers without hacking the parsing machinery and tangling in symbol table data. GCC in fact used to parse them, using YACC and additional hackery like this, and yes it was ugly. Now GCC uses handwritten parsers, but still with the symbol table hackery. The Clang folks never tried to use automated parser generators; AFAIK the Clang parser has always been hand-coded recursive descent.
What is true, is that C and C++ are relatively easy to parse with stronger automatically generated parsers, e.g., GLR parsers, and you don't need any hacks. The Elsa C++ parser is one example of this. Our C++ Front End is another (as are all our "compiler" front ends, GLR is pretty wonderful parsing technology).
Our C++ front end isn't as fast as GCC's, and certainly slower than Elsa; we've put little energy into tuning it carefully because we have other more pressing issues (nontheless it has been used on millions of lines of C++ code). Elsa is likely slower than GCC simply because it is more general. Given processor speeds these days, these differences might not matter a lot in practice.
But the "real compilers" that are widely distributed today have their roots in compilers of 10 or 20 years ago or more. Inefficiencies then mattered much more, and nobody had heard of GLR parsers, so people did what they knew how to do. Clang is certainly more recent, but then folk theorems retain their "persuasiveness" for a long time.
You don't have to do it that way anymore. You can very reasonably use GLR and other such parsers as front ends, with an improvement in compiler maintainability.
What is true, is that getting a grammar that matches your friendly neighborhood compiler's behavior is hard. While virtually all C++ compilers implement (most) of the original standard, they also tend have lots of dark corner extensions, e.g., DLL specifications in MS compilers, etc. If you have a strong parsing engine, you can
spend your time trying to get the final grammar to match reality, rather than trying to bend your grammar to match the limitations of your parser generator.
EDIT November 2012: Since writing this answer, we've improved our C++ front end to handle full C++11, including ANSI, GNU, and MS variant dialects. While there was lots of extra stuff, we don't have to change our parsing engine; we just revised the grammar rules. We did have to change the semantic analysis; C++11 is semantically very complicated, and this work swamps the effort to get the parser to run.
EDIT February 2015: ... now handles full C++14. (See get human readable AST from c++ code for GLR parses of a simple bit of code, and C++'s infamous "most vexing parse").
EDIT April 2017: Now handles (draft) C++17.
Yes:
GCC used a yacc (bison) parser once upon a time, but it was replaced with a hand-written recursive descent parser at some point in the 3.x series: see http://gcc.gnu.org/wiki/New_C_Parser for links to relevant patch submissions.
Clang also uses a hand-written recursive descent parser: see the section "A single unified parser for C, Objective C, C++ and Objective C++" near the end of http://clang.llvm.org/features.html .
Clang's parser is a hand-written recursive-descent parser, as are several other open-source and commercial C and C++ front ends.
Clang uses a recursive-descent parser for several reasons:
Performance: a hand-written parser allows us to write a fast parser, optimizing the hot paths as needed, and we're always in control of that performance. Having a fast parser has allowed Clang to be used in other development tools where "real" parsers are typically not used, e.g., syntax highlighting and code completion in an IDE.
Diagnostics and error recovery: because you're in full control with a hand-written recursive-descent parser, it's easy to add special cases that detect common problems and provide great diagnostics and error recovery (e.g., see http://clang.llvm.org/features.html#expressivediags) With automatically generated parsers, you're limited to the capabilities of the generator.
Simplicity: recursive-descent parsers are easy to write, understand, and debug. You don't need to be a parsing expert or learn a new tool to extend/improve the parser (which is especially important for an open-source project), yet you can still get great results.
Overall, for a C++ compiler, it just doesn't matter much: the parsing part of C++ is non-trivial, but it's still one of the easier parts, so it pays to keep it simple. Semantic analysis---particularly name lookup, initialization, overload resolution, and template instantiation---is orders of magnitude more complicated than parsing. If you want proof, go check out the distribution of code and commits in Clang's "Sema" component (for semantic analysis) vs. its "Parse" component (for parsing).
Weird answers there!
C/C++ grammars aren't context free. They are context sensitive because of the Foo * bar; ambiguity. We have to build a list of typedefs to know if Foo is a type or not.
Ira Baxter: I don't see the point with your GLR thing. Why build a parse tree which comprises ambiguities. Parsing means solving ambiguities, building the syntax tree. You resolve these ambiguities in a second pass, so this isn't less ugly. For me it is far more ugly ...
Yacc is a LR(1) parser generator (or LALR(1)), but it can be easily modified to be context sensitive. And there is nothing ugly in it. Yacc/Bison has been created to help in parsing C language, so probably it isn't the ugliest tool to generate a C parser ...
Until GCC 3.x the C parser is generated by yacc/bison, with typedefs table built during parsing. With "in parse" typedefs table building, C grammar becomes locally context free and furthermore "locally LR(1)".
Now, in Gcc 4.x, it is a recursive descent parser. It is exactly the same parser as in Gcc 3.x, it is still LR(1), and has the same grammar rules. The difference is that the yacc parser has been hand rewritten, the shift/reduce are now hidden in the call stack, and there is no "state454 : if (nextsym == '(') goto state398" as in gcc 3.x yacc's parser, so it is easier to patch, handle errors and print nicer messages, and to perform some of the next compiling steps during parsing. At the price of much less "easy to read" code for a gcc noob.
Why did they switched from yacc to recursive descent? Because it is quite necessary to avoid yacc to parse C++, and because GCC dreams to be multi language compiler, i.e. sharing maximum of code between the different languages it can compile. This is why the C++ and the C parser are written in the same way.
C++ is harder to parse than C because it isn't "locally" LR(1) as C, it is not even LR(k).
Look at func<4 > 2> which is a template function instantiated with 4 > 2, i.e. func<4 > 2>
has to be read as func<1>. This is definitely not LR(1). Now consider, func<4 > 2 > 1 > 3 > 3 > 8 > 9 > 8 > 7 > 8>. This is where a recursive descent can easily solve ambiguity, at the price of a few more function calls (parse_template_parameter is the ambiguous parser function. If parse_template_parameter(17tokens) failed, try again parse_template_parameter(15tokens), parse_template_parameter(13tokens)
... until it works).
I don't know why it wouldn't be possible to add into yacc/bison recursive sub grammars, maybe this will be the next step in gcc/GNU parser development?
gcc's parser is handwritten.. I suspect the same for clang. This is probably for a few reasons:
Performance: something that you've hand-optimized for your particular task will almost always perform better than a general solution. Abstraction usually has a performance hit
Timing: at least in the case of GCC, GCC predates a lot of free developer tools (came out in 1987). There was no free version of yacc, etc. at the time, which I'd imagine would've been a priority to the people at the FSF.
This is probably not a case of "not invented here" syndrome, but more along the lines of "there was nothing optimized specifically for what we needed, so we wrote our own".
It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and not machine generated, Bison-Flex based, bottom up parsing.
Bison in particular I don't think can handle the grammar without parsing some things ambiguously and doing a second pass later.
I know Haskell's Happy allows for monadic (i.e. state-dependent) parsers that can resolve the particular issue with C syntax, but I know of no C parser generators that allow a user-supplied state monad.
In theory, error recovery would be a point in favor of a handwritten parser, but my experience with GCC/Clang has been that the error messages are not particularly good.
As for performance - some of the claims seem unsubstantiated. Generating a big state machine using a parser generator should result in something that's O(n) and I doubt parsing is the bottleneck in much tooling.

antlr C grammar to create AST

Is there any C grammar available which generates the AST, which includes all the parser rules using "^" and "!" notations?
I went through the book written by Terence Parr, to write such a grammar, but it seems that writing one such grammar for C lang is a time consuming process, so was wondering if its available already which can me save a lot of time!
(A grammar for a smaller subset of C language is also fine..)
Thanks :)
See this. It's straight from the ANTLR 4 source repo: a C11 grammar. It looks pretty compliant.
Of course, it doesn't come with a preprocessor, but handing cpp or mcpp the file first is easy enough.
It also doesn't come with AST rules, but it doesn't look too hard to do (albeit time consuming).
No answers after two weeks.
You are right, building a full parser that builds complete ASTs and handles all the details of C (including preprocessor)
covering a variety of dialects of C (e.g., ANSI, GNU C 2/3/4/, Miscrosoft Visual C, Green Hills C)... is actually a lot of work. And unless you invest this work, it won't process any real C programs.
I would expect there to be a full ANTLR grammar for C that did this considering how old ANTLR is. It is surprising that nobody here can seem to identify one; certainly you'd expect to find it at the ANTLR site.
We've put the energy required into building such C parsers (covering all the above dialects), and added computing symbol tables, extracting control and data flows, building call graphs, enabling analyzers, and tree transformations in the DMS Software Reengineering Toolkit with its C front end. This front end has been applied to C applications comprised of 18,000 compilation units to build custom analysis tools.

What parser-generators with code separation and language extensibility would you recommend?

I'm looking for a context-free grammar parser generator with grammar/code separation and a possibility to add support for new target languages. For instance if I want parser in Pascal, I can write my own pascal code generator without reimplementing the whole thing.
I understand that most open-source parser generators can in theory be extended, still I'd prefer something that has extendability planned and documented.
Feature-wise I need the parser to at least support Python-style indentation, maybe with some additional work. No requirement on the type of parser generated, but I'd prefer something fast.
Which are the most well-known/maintained options?
Popular parser-generators seem to mostly use mixed grammar/code approach which I really don't like. Comparison list on Wikipedia lists a few but I'm a novice at this and can't tell which to try.
Why I don't like mixing grammar/code: because this approach seems like a mess. Grammar is grammar, implementation details are implementation details. They're different things written in different languages, it's intuitive to keep them in separate places.
What if I want to reuse parts of grammar in another project, with different implementation details? What if I want to compile a parser in a different language? All of this requires grammar to be kept separate.
Most parser generators won't handle context-free grammars. They handle some subset (LL(1), LL(k), LL(*), LALR(1), LR(k), ...). If you choose one of these, you will almost certainly have to hack your grammar to match the limitations of the parser generator (no left recursion, limited lookahead, ...). If you want a real context free parser generator you want an Early parser generator (inefficient), a GLR parser generator (the most practical of the lot), or a PEG parser generator (and the last isn't context-free; it requires rules to be ordered to determine which ones take precedence).
You seem to be worried about mixing syntax and parser-actions used to build the trees.
If the tree you build isn't a direct function of the syntax, there has to be some way to tie the tree-building machinery to the grammar productions. Placing it "near" the grammar production is one way, but leads to your "mixed" notation objection.
Another way is to give each rule a name (or some unique identifier), and set the tree-building machinery off to the side indexed by the names. This way your grammar isn't contaminated with the "other stuff", which seems to be your objection. None of the parser generator systems I know of do this. An awkward issue is that you now have to invent lots of rule names, and anytime you have a few hundred names that's inconvenient by itself and it is hard to make them mnemonic.
A third way is to make the a function of the syntax, and auto-generate the tree building steps. This requires no extra stuff off to the side at all to produce the ASTs. The only tool I know that does it (there may be others but I've been looking for 20 odd years and haven't seen one) is my company's product,, the DMS Software Reengineering Toolkit. [DMS isn't just a parser generator; it is a complete ecosystem for building program analysis and transformation tools for arbitrary languages, using a GLR parsing engine; yes it handles Python style indents].
One objection is that such trees are concrete, bloated and confusing; if done right, that's not true.
My SO answer to this question:
What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree? discusses how we get the benefits of ASTs from automatically generated compressed CSTs.
The good news about DMS's scheme is that the basic grammar isn't bloated with parsing support. The not so good news is that you will find lots of other things you want to associate with grammar rules (prettyprinting rules, attribute computations, tree synthesis,...) and you come right back around to the same choices. DMS has all of these "other things" and solves the association problem a number of ways:
By placing other related descriptive formalisms next to the grammar rule (producing the mixing you complained about). We tolerate this for pretty-printing rules because in fact it is nice to have the grammar (parse) rule adjacent to the pretty-print (anti-parse) rule. We also allow attribute computations to be placed near the grammar rules to provide an association.
While DMS allows rules to have names, this is only for convenient access by procedural code, not associating other mechanisms with the rule.
DMS provides a third way to associate these mechanisms (esp. attribute grammar computations) by using the rule itself as a kind of giant name. So, you write the grammar and prettyprint rules in one place, and somewhere else you can write the grammar rule again with an associated attribute computation. In principle, this is just like giving each rule a name (well, a signature) and associating the computation with the name. But it also allows us to define many, many different attribute computations (for different purposes) and associate them with their rules, without cluttering up the base grammar. Our tools check that a (rule,associated-computation) has a valid rule in the base grammar, so it makes it relatively each to track down what needs fixing when the base grammar changes.
This being my tool (I'm the architect) you shouldn't take this as a recommendation, just a bias. That bias is supported by DMS's ability to parse (without whimpering) C, C++, Java, C#, IBM Enterprise COBOL, Python, F77/F90/F95 with column6 continues/F90 continues and embedded C preprocessor directives to boot under most circumstances), Mumps, PHP4/5 and many other languages.
First off, any decent parser generator is going to be robust enough to support Python's indenting. That isn't really all that weird as languages go. You should try parsing column-sensitive languages like Fortran77 some time...
Secondly, I don't think you really need the parser itself to be "extensible" do you? You just want to be able to use it to lex and parse the language or two you have in mind, right? Again, any decent parser-generator can do that.
Thirdly, you don't really say what about the mix between grammar and code you don't like. Would you rather it be all implemented in a meta-language (kinda tough), or all in code?
Assuming it is the latter, there are a couple of in-language parser generator toolkits I know of. The first is Boost's Spirit, which is implemented in C++. I've used it, and it works. However, back when I used it you pretty much needed a graduate degree in "boostology" to be able to understand its error messages well enough to get anything working in a reasonable amount of time.
The other I know about is OpenToken, which is a parser-generation toolkit implemented in Ada. Ada doesn't have the error-novel problem that C++ has with its templates, so OpenToken is far easier to use. However, you have to use it in Ada...
Typical functional languages allow you to implement any sublanguage you like (mostly) within the language itself, thanks to their inhernetly good support for things like lambdas and metaprogramming. However, their parsers tend to be slower. That's really no problem at all if you are just parsing a configuration file or two. Its a tremendous problem if you are parsing hundreds of files at a go.

How do I implement parsing?

I am designing a compiler in C. I want to know which technique I should use, top-down or bottom up? I have only implemented operator precedence using bottom up. I have applied the following
rules:
E:=E+E
E:=E-E
E:=E/E
E:=E*E
E:=E^E
I want to know that am I going the right away?
If I want to include if-else, loops, arrays, functions, do I need to implement parsing?
If yes, how do I implement it? Any one can
I have only implemented token collection and operator precedence. What is the next steps?
Lex & Yacc is your answer. Or Flex and Bison which are branched version of original tools.
They are free, they are the real standard for writing lexers and parsers in C and used all around everywhere.
In addition O'Reilly has released a small pearl of 300 pages: Flex & Bison. I bought it and it really explains you how to write a good parser for a programming language and handle all the subtle things (error recovery, conflicts, scopes and so on). It will answer also your questions about how you are parsing expressions: your approach is right with a top-down parser but you'll discover that is not enough to handle operator precedences.
Of course, for hobby, you could write your own lexer and parser but it would be just an academic effort that is nice to understand how FSM and parser work but with no so much fun :)
If you are, instead, interested in programming language design or complex implementations I suggest this book: Programming Language Pragmatics that is not so famous because of the Dragon Book but it really explains why and how various characteristics can and should be implemented in a compiler. The Dragon Book is a bible too, and it will cover at a real low level how to write a parser.. but it would be sort of boring, I warn you..
The best way to implement a good parser in C is using flex & yacc
Your question is quite vague and hard to answer without a more specific, detailed question. The "Dragon book" is an excellent reference though for someone seeking to implement a compiler from scratch, or as others have pointed out Lex and Yacc.
If you intend to implement the parser by hand, you will want to do a recursive descent parser. The code directly reflects the grammar, so it's fairly easy to figure out and understand. It places some restrictions on your grammar (you can't have any left-recursive nonterminals), but you can work around those problems.
However, it depends on the complexity of the grammar; hand-hacking a parser for anything much more complicated than basic arithmetic expressions gets very tedious very quickly. If you're trying to implement anything that looks like a real programming language, use a parser generator like yacc or bison.

Resources