Recursive Descent Parser for C - c

I'm looking for a parser for C. Here is what I need:
Written in C (not C++).
Handwritten (not generated).
BSD or similarly permissive license.
Capable of nontrivially parsing itself (can be a subset of C).
It can be part of a project as long as it's decoupled so that I can pull out the parser.
Is there an existing parser that fulfills these requirements?

If you don't need C99, then lcc is a slam dunk:
It is documented in a very clear, well-written book.
Techniques used for recursive-descent parsing of operators with precedence are well documented in an article and technical report by Dave Hanson.
Clear, handwritten ANSI C code.
One potential downside is that the lcc parser does not build an abstract-syntax treeā€”it goes straight from parsing to intermediate code.
If you must have C99 then I think tinycc (tcc) is your best bet.

How about Sparse?

You could try TCC. It's licensed under the Lesser GPL.

It seems that nwcc sufficiently agrees with your requirements.

Good c compiler is present at this location. Simple and accessible.
https://github.com/rui314/8cc

GCC has one in gcc/c-parser.c.

Check elsa, it uses the Generalized LR algorithm.
Its main use is for C++, but it also parses C code.
Check on its page, on the section called "How much C can Elsa parse?" which says it can parse most C programs, including the Linux kernel.
It's released under a BSD license.

Here is a recursive descent parser I ported to C:
http://www.gabotronics.com/resources/recursive-descent-parser.htm

Related

GNU parser Optimization Using L-System

Any suggestion on how to use Lindenmeyer System approach (L-System) to make the GNU parser faster through parallelism. I also need to compare the normal execution time and execution time when the L-system is implemented for the C language. Any suggestion will be helpful.
Your question is too broad for this format. Note that you can write a very efficient parser for the C language without the L-System, the TinyCC compiler is a good example of a hand written C parser written int the C language that is very efficient. Check it out at http://tinycc.org

Bootstrapping A compiler [duplicate]

I've heard of the idea of bootstrapping a language, that is, writing a compiler/interpreter for the language in itself. I was wondering how this could be accomplished and looked around a bit, and saw someone say that it could only be done by either
writing an initial compiler in a different language.
hand-coding an initial compiler in Assembly, which seems like a special case of the first
To me, neither of these seem to actually be bootstrapping a language in the sense that they both require outside support. Is there a way to actually write a compiler in its own language?
Is there a way to actually write a compiler in its own language?
You have to have some existing language to write your new compiler in. If you were writing a new, say, C++ compiler, you would just write it in C++ and compile it with an existing compiler first. On the other hand, if you were creating a compiler for a new language, let's call it Yazzleof, you would need to write the new compiler in another language first. Generally, this would be another programming language, but it doesn't have to be. It can be assembly, or if necessary, machine code.
If you were going to bootstrap a compiler for Yazzleof, you generally wouldn't write a compiler for the full language initially. Instead you would write a compiler for Yazzle-lite, the smallest possible subset of the Yazzleof (well, a pretty small subset at least). Then in Yazzle-lite, you would write a compiler for the full language. (Obviously this can occur iteratively instead of in one jump.) Because Yazzle-lite is a proper subset of Yazzleof, you now have a compiler which can compile itself.
There is a really good writeup about bootstrapping a compiler from the lowest possible level (which on a modern machine is basically a hex editor), titled Bootstrapping a simple compiler from nothing. It can be found at https://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html.
The explanation you've read is correct. There's a discussion of this in Compilers: Principles, Techniques, and Tools (the Dragon Book):
Write a compiler C1 for language X in language Y
Use the compiler C1 to write compiler C2 for language X in language X
Now C2 is a fully self hosting environment.
The way I've heard of is to write an extremely limited compiler in another language, then use that to compile a more complicated version, written in the new language. This second version can then be used to compile itself, and the next version. Each time it is compiled the last version is used.
This is the definition of bootstrapping:
the process of a simple system activating a more complicated system that serves the same purpose.
EDIT: The Wikipedia article on compiler bootstrapping covers the concept better than me.
A super interesting discussion of this is in Unix co-creator Ken Thompson's Turing Award lecture.
He starts off with:
What I am about to describe is one of many "chicken and egg" problems that arise when compilers are written in their own language. In this ease, I will use a specific example from the C compiler.
and proceeds to show how he wrote a version of the Unix C compiler that would always allow him to log in without a password, because the C compiler would recognize the login program and add in special code.
The second pattern is aimed at the C compiler. The replacement code is a Stage I self-reproducing program that inserts both Trojan horses into the compiler. This requires a learning phase as in the Stage II example. First we compile the modified source with the normal C compiler to produce a bugged binary. We install this binary as the official C. We can now remove the bugs from the source of the compiler and the new binary will reinsert the bugs whenever it is compiled. Of course, the login command will remain bugged with no trace in source anywhere.
Check out podcast Software Engineering Radio episode 61 (2007-07-06) which discusses GCC compiler internals, as well as the GCC bootstrapping process.
Donald E. Knuth actually built WEB by writing the compiler in it, and then hand-compiled it to assembly or machine code.
As I understand it, the first Lisp interpreter was bootstrapped by hand-compiling the constructor functions and the token reader. The rest of the interpreter was then read in from source.
You can check for yourself by reading the original McCarthy paper, Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I.
Every example of bootstrapping a language I can think of (C, PyPy) was done after there was a working compiler. You have to start somewhere, and reimplementing a language in itself requires writing a compiler in another language first.
How else would it work? I don't think it's even conceptually possible to do otherwise.
Another alternative is to create a bytecode machine for your language (or use an existing one if it's features aren't very unusual) and write a compiler to bytecode, either in the bytecode, or in your desired language using another intermediate - such as a parser toolkit which outputs the AST as XML, then compile the XML to bytecode using XSLT (or another pattern matching language and tree-based representation). It doesn't remove the dependency on another language, but could mean that more of the bootstrapping work ends up in the final system.
It's the computer science version of the chicken-and-egg paradox. I can't think of a way not to write the initial compiler in assembler or some other language. If it could have been done, I should Lisp could have done it.
Actually, I think Lisp almost qualifies. Check out its Wikipedia entry. According to the article, the Lisp eval function could be implemented on an IBM 704 in machine code, with a complete compiler (written in Lisp itself) coming into being in 1962 at MIT.
Some bootstrapped compilers or systems keep both the source form and the object form in their repository:
ocaml is a language which has both a bytecode interpreter (i.e. a compiler to Ocaml bytecode) and a native compiler (to x86-64 or ARM, etc... assembler). Its svn repository contains both the source code (files */*.{ml,mli}) and the bytecode (file boot/ocamlc) form of the compiler. So when you build it is first using its bytecode (of a previous version of the compiler) to compile itself. Later the freshly compiled bytecode is able to compile the native compiler. So Ocaml svn repository contains both *.ml[i] source files and the boot/ocamlc bytecode file.
The rust compiler downloads (using wget, so you need a working Internet connection) a previous version of its binary to compile itself.
MELT is a Lisp-like language to customize and extend GCC. It is translated to C++ code by a bootstrapped translator. The generated C++ code of the translator is distributed, so the svn repository contains both *.melt source files and melt/generated/*.cc "object" files of the translator.
J.Pitrat's CAIA artificial intelligence system is entirely self-generating. It is available as a collection of thousands of [A-Z]*.c generated files (also with a generated dx.h header file) with a collection of thousands of _[0-9]* data files.
Several Scheme compilers are also bootstrapped. Scheme48, Chicken Scheme, ...

Parsing C to Ocaml

I would like to get the Abstract Syntax Tree (AST) from a C code, into an OCaml value, so that I can further process the parsed code with a plain OCaml program.
I had in mind to use GCC, get the AST (in GIMPLE) with a hook, and convert the GIMPLE code to Ocaml.
But I wonder if there is another way, or if someone did something similar already. (I haven't found much actually on that...)
I don't want to resort to using CIL. It is an OCaml parser for C code, but it doesn't contain all optimizations that GCC has. (I especially need a deeper alias analysis than the one implemented in CIL).
Can LLVM be a good idea to look at? Already done maybe?
Any better idea?
If your problem with CIL is the precision of the provided alias analysis, take a look at Frama-C. It is based on CIL but provides a precise value analysis that works for pointers. The value analysis makes its results available inside a modular architecture.
An other option to parse C to Ocaml would be FrontC. Its description says :
FrontC is an OCAML library providing a C parser and lexer. The result is a syntactic tree easy to process with usual OCAML tree management.
It provides support for ANSI C syntax, old-C K&R style syntax and the standard GNU CC attributes.
It provides also a C pretty printer as an example of use.

Is C open source?

Does C (or any other low-level language, for that matter) even have source, or is the compiler the part that "does all the work", including parsing? If so, couldn't different compilers have different C dialects? Where does the stdlib factor into this? I would really like to know how this works.
The C language is not a piece of software but a defined standard, so one wouldn't say that it's open-source, but rather that it's an open standard.
There are a gazillion different compilers for C however, and many of those are indeed open-source. The most notable example is GCC's C compiler, which is all under the GNU General Public License (GPL), an open-source license.
There are more options. Watcom is open-source, for instance. There is no shortage of open-source C compilers, but without a doubt the most widespread one, at least in the non-Windows world, is GCC.
For Windows, your best bet is probably Watcom or GCC by using Cygwin or MinGW.
C is a standard which specifies how C compilers should generate programs.
C itself doesn't have any source code, just like a musical note doesn't have any plastic.
Some C compilers, such as GCC, are open source.
C is just a language, and a standardised one at that, too. It pretty much is the compiler that "does all the work". Different compilers did have different dialects; before the the C99 ANSI standard, you had things like Borland C and other competing compilers, that implemented the C language in their own fantastic ways.
stdlib is just an agreed-upon collection of standard libraries that are required to be present in any ANSI C implementation.
To add on to the other great answers:
Regarding different dialects -- there are some additional features added to C that are compiler specific. You can provide the command line flag -std=... to gcc to specify the C standard that you want to use, each has slight variations/additions to syntax, the most common is probably c99.
Each compiler tends to implement a few different extras, for example, typeof() is not in the C standard and so compilers do not have to implement this but nevertheless it is useful and most compilers provide it. Here is a list of gcc C extensions
The stdlib is a set of functions specified in the C standard. Much like compilers, stdlib can have different implementations. The GNU implementation is open source, as is gcc, but there are other compilers and could be other implementations of stdlib that are closed source.
The Compiler would determine all the mappings from C to Assembly etc... but as far as someone owning it.....noone really owns C however the ANSI/ISO determines the standards
GCC's C compiler is written in C. So we know there are at least one C compiler written in C.
GNU's stdlib (glibc) is also written in C (stdio.h, stdlib.h). But it also has some parts written in assembly language.
A really good question. There is a way to define a language standard (not the implementation!) in a form of a "source code", in a strict and unambigous language. Unfortunately, all of the old languages, including C, are poorly defined. But it is still possible to translate that definitions into a source code form.
Another approach is to define a language via its operational semantics, often in a form of a simple (and unefficient) reference implementation.
Helgi Hrafn Gunnarsson has written the main answer but I thought it would be worth noting that you can effectively end up with dialects too.
The compilers should do the same thing with regards to whichever standard they support (which these days should be pretty much all the same version) but there are grey areas. The way in which the compilers work for 'undefined' functionality for example. If the C specification says that the behaviour is undefined for a specific case then the compiler can do pretty much what it wants.
There are also examples of functions added to the libraries (and new libraries added) by the compiler makers to support specific platform traits, create a competitive advantage or simply to make life easier. The cynical might suggest that some of these are added to help lock people into a specific compiler too.
I would say that C as a language is not open source.
As pointed out by many, you can download GNU licensed compilers and libraries for free, but if you wanted to write your own C compiler, you would need to follow the ISO C standards, and ISO charge hard cash for the specification of the C language, which at the time of posting this is $178.
So really the answer depends on what elements you are interested in being free and open source.
I'm not sure what your definitions of "open source" are.
For the standardization process, it is possible for anyone to participate, but if you want to be able to vote then you will need to pay to join your national body (for instance, ANSI for the USA, BSI for the UK, AFNOR for France etc.). As a rule most standards body memberships are paid by corporations. That said, the process is fairly open. You can access discussion papers on the standards web site.
The standards themselves are not free either. The ISO pdf store currently sells the C standard for 198 swiss francs. Draft copies of the standard can be found easily for free.
There are plenty of open source implementations of both compilers and libraries.

C to IEC 61131-3 IL compiler

I have a requirement for porting some existing C code to a IEC 61131-3 compliant PLC.
I have some options of splitting the code into discrete function blocks and weaving those blocks into a standard solution (Ladder, FB, Structured Text etc). But this would require carving up the C code in order to build each function block.
When looking at the IEC spec I realsied that the IEC Instruction List form could be a target language for a compiler. The wikepedia article lists two development tools:
CoDeSys
Beremiz
But these seem to be targeted compiling IEC languages to C, not C to IEC.
Another possible solution is to push the C code through a C to Pascal translator and use that as a starting point for a Structured Text solution.
If not any of these I will go down the route of splitting the code up into function blocks.
Edit
As prompted by mlieson's reply I should have mentioned that the C code is an existing real-time control system. So the programs algorithms should already suit a PLC environment.
Maybe this answer comes too late but it is possible to call C code from CoDeSys thanks to an external library.
You can find documentation on the CoDeSys forum at http://forum-en.3s-software.com/viewtopic.php?t=620
That would give you to use your C code into the PLC with minor modifcations. You'll just have to define the functions or function blocks interfaces.
My guess is that a C to Pascal translator will not get you near enough for being worth the trouble. Structured text looks a lot like Pascal, but there are differences that you will need to fix everywhere.
Not a bug issue, but don't forget that PLCs runtime enviroment is a bit different. A C applications starts at main() and ends when main() returns. A PLC calls it main() over and over again, 100:s of times per second and it never ends.
Usally lengthy calculations and I/O needs to be coded in diffent fashion than a C appliation would use.
Unless your C source is many many thousands lines of code - Rewrite it.
It is impossible. To be short: the IL language is a 4GL (i.e. limited to
the domain, as well as other IEC 61131-3 languages -- ST, FBD, LD, SFC).
The C language is a 3GL.
To understand the problem, try to answer the question, which way to
express in IL manipulations with a pointer? for example, to express call a
function by a pointer. What about interrupts? Low level access to the
peripherial devices?
(really, there are more problems)
BTW, there is the Reflex language, aka "C with processes". Reflex is a 4GL for the
control domain with C-like syntax. But the known translators produce
C-code and Python-code.
If the amount of code to convert is a few thousand lines, recoding by hand is probably your best bet.
If you have lots of code to convert, then an automated tool might be very effective.
Using the DMS Software Reengineering Toolkit we've built translators to map mechanical motion diagrams into RLL (PLC) code. DMS also has full C parser/analyzers/front ends. The pieces are there to build a C to RLL code.
This isn't an easy task. It likely takes 6-12 man-months to configure DMS to something resembling what you want. If that's less than what it takes to do by hand, then its the right way to do it.
There are a few IEC development environments and target hardware that can use C blocks... I would also take a look at the reasons why it HAS to be an IEC-61131 complaint target. I have written extensively on compliance and why it doesn't mean squat.
SOFTplc corp can help I'm sure with user defined loadable modules... and they can be in C..
Schneider also supports C function blocks...
Labview too!! not sure why IEC is important that's all!! the compiler if existed would create bad code for sure:)
Your best bet is to split your C code into smaller parts which can be recoded as PLC functional blocks and use C to PASCAL convertor for each block which you will rewrite in structured text. Prepare to do a lot of manual work since automated conversion will probably disappoint you.
Also take a look at this page: http://www.control.com/thread/1026228786
Every time I've done this, I just parsed and converted it by hand from C directly to ST. I only ran into a few functions that required complete rewrites, although there was very little that dealt with pointers, which is something that ST generally chokes on, unfortunately.
Using the existing C code as blocks that are called by the PLC program would have the added advantage that the C blocks could run at the same periodicity that they did before, and their function is likely already well documented and tested. This would minimize any effect on changes from the existing control system. This is an architecture for controls with software PLCs that I have seen used before.

Resources