Parsing C to Ocaml - c

I would like to get the Abstract Syntax Tree (AST) from a C code, into an OCaml value, so that I can further process the parsed code with a plain OCaml program.
I had in mind to use GCC, get the AST (in GIMPLE) with a hook, and convert the GIMPLE code to Ocaml.
But I wonder if there is another way, or if someone did something similar already. (I haven't found much actually on that...)
I don't want to resort to using CIL. It is an OCaml parser for C code, but it doesn't contain all optimizations that GCC has. (I especially need a deeper alias analysis than the one implemented in CIL).
Can LLVM be a good idea to look at? Already done maybe?
Any better idea?

If your problem with CIL is the precision of the provided alias analysis, take a look at Frama-C. It is based on CIL but provides a precise value analysis that works for pointers. The value analysis makes its results available inside a modular architecture.

An other option to parse C to Ocaml would be FrontC. Its description says :
FrontC is an OCAML library providing a C parser and lexer. The result is a syntactic tree easy to process with usual OCAML tree management.
It provides support for ANSI C syntax, old-C K&R style syntax and the standard GNU CC attributes.
It provides also a C pretty printer as an example of use.

Related

How much lisp to implement in C before writing extension in itself?

I am implementing a lisp interpreter in C, i have implemented along with few primitives like cons , car, cdr , eq, basic arithmetic stuff.
Just before i was starting to implement define and lambda it occurred to me that i need to implement an environment. I am unsure if i could implement it in lisp itself.
My intent is to implement minimal amount of lisp so that i could write extension to the language in itself. I am not sure how much is minimal, Would implementing FFI Qualify as minimal ?
The answer to your question depends on the meaning that you give to the word “minimal”.
Given your question, and assuming that you don't want to make an implementation competing with the nowdays fine implementations of Common Lisp and Schema, my hypothesis is that with “minimal” you intend: Turing complete, that is capable of expressing any computation expressible in a general purpose programming language.
With this assumption, you need to implement three other things:
conditional forms (cond)
lambda expressions (lambda)
a way of defining recursive lambda expression (labels or defun)
Your interpreter then should be able to evaluate forms. This should be sufficient to have a language equivalent to the initial LISP, that allow to express in the language any computable function.
First off, you are talking about first writing a LISP interpreter. You have a lot of choices to take when it comes to scoping, LISP1 vs LISP2 since these questions alter the implementation core. An interpreter is a general purpose program that reads and evaluates code. It can support abstractions but it won't extend itself by making more native stuff.
If you are interested in such stuff you can perhaps make a compiler instead. Eg. there are many Sceme like subsets that compiles to C or Java code, but you can make your own VM. Thus it can indeed compile itself to be run on it's own target machine (self hosting) if all the forms and procedures you use has been implemented using the primitives supported by the compiler.
Making a dumb compiler is not much difference from making an interpreter. That is very clear if yo've watched the SICP videos (10A is about compilation, 7A-B is about interpreters)
The environment can be a chain of pairs just as in a LISP interpreter. It would be difficult to implement the environment of itself in LISP without making it a very difficult Lisp language to use (unless it's compiled that is)
You may use the data structures of lisp and the primitives from the C code though.
Making a FFI is a fast way to give your language lots of features. It solves the chicken and egg problem by using other peoples work from within your language. In fuses the top (primitives and syntax) and the bottom layer (a runtime) of your system. It's the ultimate primitive and you can think of it as system call or message bus to the runtime.
I strongly suggest to read Queinnec's book: Lisp In Small Pieces. It is a book dedicated entirely to answer your question, and it explains in detail the many trade-offs and the internals of Lisp implementations and definitions, by giving many explained examples of Lisp interpreters and compilers.
You might also consider using libffi. You could be interested in the internals of M.Serrano's Bigloo & Hop implementations. You might even look inside my MELT lisp-like language to customize the GCC
compiler.
You also need to learn more about garbage collection (you might read the GC handbook). You could use Boehm's conservative Garbage Collector (or something else, e.g. my Qish or MPS) or write your own GC.
You may want to learn more about Chicken, Scheme 48, Guile and read their papers and look inside their code.
See also J.Pitrat's blog: it is not about Lisp (but about bootstrapping strong AI) and has several fascinating entries related to bootstrapping.

From Assembler to C-Compiler

i designed a small RISC in verilog. Which steps do I have to take to create a c compiler which uses my assembler-language? Or is it possible to modify a conventional compiler like gcc, cause I don't want to make things like linker, ...
Thanks
You need to use an unmodified C lexer+parser (often called the front end), and a modified code generation component (the back end) to do that.
Eli Bendersky's pycparser can be used as the front end, and Atul's mini C compiler can be used as inspiration for the code generating back end: http://people.cs.uchicago.edu/~varmaa/mini_c/
With Eli Bendersky's pycparser, all you need to do is convert the AST to a Control Flow Graph (CFG) and generate code from there. It is easier to start with supporting a subset of C than the full shebang.
The two tools are written in Python, but you didn't mention any implementation language preferences :)
I have found most open sourcen compilers (except clang it seems) too tightly coupled to easily modify the back end. Clang and especially GCC are not easy to dive into, nowhere NEAR as easy as the two above. And since Eli's parser does full C99 (it parses everything I've thrown at it) it seem like a nice front end to use for further development.
The examples on the Github project demonstrates most of the features of the project and it's easy to get started. The example that parses C to literal English is worth taking a look at, but may take a while to fully grok. It basically handles any C expression, so it is a good reference for how to handle the different nodes of the AST.
I also recommended the tools above, in my answer to this question: Build AST from C code

Bootstrapping A compiler [duplicate]

I've heard of the idea of bootstrapping a language, that is, writing a compiler/interpreter for the language in itself. I was wondering how this could be accomplished and looked around a bit, and saw someone say that it could only be done by either
writing an initial compiler in a different language.
hand-coding an initial compiler in Assembly, which seems like a special case of the first
To me, neither of these seem to actually be bootstrapping a language in the sense that they both require outside support. Is there a way to actually write a compiler in its own language?
Is there a way to actually write a compiler in its own language?
You have to have some existing language to write your new compiler in. If you were writing a new, say, C++ compiler, you would just write it in C++ and compile it with an existing compiler first. On the other hand, if you were creating a compiler for a new language, let's call it Yazzleof, you would need to write the new compiler in another language first. Generally, this would be another programming language, but it doesn't have to be. It can be assembly, or if necessary, machine code.
If you were going to bootstrap a compiler for Yazzleof, you generally wouldn't write a compiler for the full language initially. Instead you would write a compiler for Yazzle-lite, the smallest possible subset of the Yazzleof (well, a pretty small subset at least). Then in Yazzle-lite, you would write a compiler for the full language. (Obviously this can occur iteratively instead of in one jump.) Because Yazzle-lite is a proper subset of Yazzleof, you now have a compiler which can compile itself.
There is a really good writeup about bootstrapping a compiler from the lowest possible level (which on a modern machine is basically a hex editor), titled Bootstrapping a simple compiler from nothing. It can be found at https://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html.
The explanation you've read is correct. There's a discussion of this in Compilers: Principles, Techniques, and Tools (the Dragon Book):
Write a compiler C1 for language X in language Y
Use the compiler C1 to write compiler C2 for language X in language X
Now C2 is a fully self hosting environment.
The way I've heard of is to write an extremely limited compiler in another language, then use that to compile a more complicated version, written in the new language. This second version can then be used to compile itself, and the next version. Each time it is compiled the last version is used.
This is the definition of bootstrapping:
the process of a simple system activating a more complicated system that serves the same purpose.
EDIT: The Wikipedia article on compiler bootstrapping covers the concept better than me.
A super interesting discussion of this is in Unix co-creator Ken Thompson's Turing Award lecture.
He starts off with:
What I am about to describe is one of many "chicken and egg" problems that arise when compilers are written in their own language. In this ease, I will use a specific example from the C compiler.
and proceeds to show how he wrote a version of the Unix C compiler that would always allow him to log in without a password, because the C compiler would recognize the login program and add in special code.
The second pattern is aimed at the C compiler. The replacement code is a Stage I self-reproducing program that inserts both Trojan horses into the compiler. This requires a learning phase as in the Stage II example. First we compile the modified source with the normal C compiler to produce a bugged binary. We install this binary as the official C. We can now remove the bugs from the source of the compiler and the new binary will reinsert the bugs whenever it is compiled. Of course, the login command will remain bugged with no trace in source anywhere.
Check out podcast Software Engineering Radio episode 61 (2007-07-06) which discusses GCC compiler internals, as well as the GCC bootstrapping process.
Donald E. Knuth actually built WEB by writing the compiler in it, and then hand-compiled it to assembly or machine code.
As I understand it, the first Lisp interpreter was bootstrapped by hand-compiling the constructor functions and the token reader. The rest of the interpreter was then read in from source.
You can check for yourself by reading the original McCarthy paper, Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I.
Every example of bootstrapping a language I can think of (C, PyPy) was done after there was a working compiler. You have to start somewhere, and reimplementing a language in itself requires writing a compiler in another language first.
How else would it work? I don't think it's even conceptually possible to do otherwise.
Another alternative is to create a bytecode machine for your language (or use an existing one if it's features aren't very unusual) and write a compiler to bytecode, either in the bytecode, or in your desired language using another intermediate - such as a parser toolkit which outputs the AST as XML, then compile the XML to bytecode using XSLT (or another pattern matching language and tree-based representation). It doesn't remove the dependency on another language, but could mean that more of the bootstrapping work ends up in the final system.
It's the computer science version of the chicken-and-egg paradox. I can't think of a way not to write the initial compiler in assembler or some other language. If it could have been done, I should Lisp could have done it.
Actually, I think Lisp almost qualifies. Check out its Wikipedia entry. According to the article, the Lisp eval function could be implemented on an IBM 704 in machine code, with a complete compiler (written in Lisp itself) coming into being in 1962 at MIT.
Some bootstrapped compilers or systems keep both the source form and the object form in their repository:
ocaml is a language which has both a bytecode interpreter (i.e. a compiler to Ocaml bytecode) and a native compiler (to x86-64 or ARM, etc... assembler). Its svn repository contains both the source code (files */*.{ml,mli}) and the bytecode (file boot/ocamlc) form of the compiler. So when you build it is first using its bytecode (of a previous version of the compiler) to compile itself. Later the freshly compiled bytecode is able to compile the native compiler. So Ocaml svn repository contains both *.ml[i] source files and the boot/ocamlc bytecode file.
The rust compiler downloads (using wget, so you need a working Internet connection) a previous version of its binary to compile itself.
MELT is a Lisp-like language to customize and extend GCC. It is translated to C++ code by a bootstrapped translator. The generated C++ code of the translator is distributed, so the svn repository contains both *.melt source files and melt/generated/*.cc "object" files of the translator.
J.Pitrat's CAIA artificial intelligence system is entirely self-generating. It is available as a collection of thousands of [A-Z]*.c generated files (also with a generated dx.h header file) with a collection of thousands of _[0-9]* data files.
Several Scheme compilers are also bootstrapped. Scheme48, Chicken Scheme, ...

What libraries would be useful for implementing a small language interpreter in C?

For my own learning experience, I want to try writing an interpreter for a simple programming language in C – the main thing I think I need is a hash table library, but a general purpose collection of data structures and helper functions would be pretty helpful. What would you guys recommend?
libbasekit - by the author of Io. You can also use libcoroutine.
One library I recommend looking into is libgc, a garbage collector for C.
You use it by replacing calls to malloc, realloc, strdup, etc. with their libgc counterparts (e.g. GC_MALLOC). It works by scanning the stack, global variables, and GC-allocated blocks, looking for numbers that might be pointers. Believe it or not, it actually performs quite well (almost on par with the very good ptmalloc, which is the default (non-garbage collected) malloc implementation in GNU/Linux), and a lot of programs use it (including Mono and GCJ). A disadvantage, though, is it might not play well with other libraries you may want to use, and you may even have to recompile some of them by hand to replace calls to malloc with GC_MALLOC.
Honestly - and I know some people will hate me for it - but I recommend you use C++. You don't have to bust a gut to learn it just to be able to start your project. Just use it like C, but in an hour you can learn how to use std::map<> (an associative container), std::string for easy textual data handling, and std::vector<> for a resizable heap-allocated array. If you want to spend an extra hour or two, learn to put member functions in classes (don't worry about polymorphism, virtual functions etc. to begin with), and you'll get a more organised program.
You need no more than the standard library for a suitably small language with simple constructs. The most complex part of an interpreted language is probably expression evaluation. For both that, procedure-calling, and construct-nesting you will need to understand and implement stack data structures.
The code at the link above is C++, but the algorithm is described clearly and you could re-implement it easily in C. There again there are few valid arguments for not using C++ IMO.
Before diving into what libraries to use I suggest you learn about grammars and compiler design. Especially input parsing is for compilers and interpreters similar, that is tokenizing and parsing. The process of tokenizing converts a stream characters (your input) into a stream of tokens. A parser takes this stream of tokens and matches it with your grammar.
You don't mention what language you're writing an interpreter for. But very likely that language contains recursion. In that case you need to use a so-called bottom-up parser which you cannot write by hand but needs to be generated. If you try write such a parser by hand you will end up with a error-prone mess.
If you're developing for a posix platform then you can use lex and yacc. These tools are a bit old but very powerful for building parsers. Lex can generate code that implements the tokenizing process and yacc can generate a bottom-up parser.
My answer probably raises more questions than it answers. That's because the field of compilers/interpreters is quite complex and cannot simply be explained in a short answer. Just get a good book on compiler design.

How would I implement something similar to the Objective-C #encode() compiler directive in ANSI C?

The #encode directive returns a const char * which is a coded type descriptor of the various elements of the datatype that was passed in. Example follows:
struct test
{ int ti ;
char tc ;
} ;
printf( "%s", #encode(struct test) ) ;
// returns "{test=ic}"
I could see using sizeof() to determine primitive types - and if it was a full object, I could use the class methods to do introspection.
However, How does it determine each element of an opaque struct?
#Lothars answer might be "cynical", but it's pretty close to the mark, unfortunately. In order to implement something like #encode(), you need a full blown parser in order to extract the the type information. Well, at least for anything other than "trivial" #encode() statements (i.e., #encode(char *)). Modern compilers generally have either two or three main components:
The front end.
The intermediate end (for some compilers).
The back end.
The front end must parse all the source code and basically converts the source code text in to an internal, "machine useable" form.
The back end translates the internal, "machine useable" form in to executable code.
Compilers that have an "intermediate end" typically do so because of some need: they support multiple "front ends", possibly made up of completely different languages. Another reason is to simplify optimization: all the optimization passes work on the same intermediate representation. The gcc compiler suite is an example of a "three stage" compiler. llvm could be considered an "intermediate and back end" stage compiler: The "low level virtual machine" is the intermediate representation, and all the optimization takes place in this form. llvm also able to keep it in this intermediate representation right up until the last second- this allows for "link time optimization". The clang compiler is really a "front end" that (effectively) outputs llvm intermediate representation.
So, if you want to add #encode() functionality to an 'existing' compiler, you'd probably have to do it as a "source to source" 'compiler / preprocessor'. This was how the original Objective-C and C++ compilers were written- they parsed the input source text and converted it to "plain C" which was then fed in to the standard C compiler. There's a few ways to do this:
Roll your own
Use yacc and lex to put together a ANSI-C parser. You'll need a grammar- ANSI C grammar (Yacc) is a good start. Actually, to be clear, when I say yacc, I really mean bison and flex. And also, loosely, the other various yacc and lex like C-based tools: lemon, dparser, etc...
Use perl with Yapp or EYapp, which are pseudo-yacc clones in perl. Probably better for quickly prototyping an idea compared to C-based yacc and lex- it's perl after all: Regular expressions, associative arrays, no memory management, etc.
Build your parser with Antlr. I don't have any experience with this tool chain, but it's another "compiler compiler" tool that (seems) to be geared more towards java developers. There appears to be freely available C and Objective-C grammars available.
Hack another tool
Note: I have no personal experience using any of these tools to do anything like adding #encode(), but I suspect they would be a big help.
CIL - No personal experience with this tool, but designed for parsing C source code and then "doing stuff" with it. From what I can glean from the docs, this tool should allow you to extract the type information you'd need.
Sparse - Worth looking at, but not sure.
clang - Haven't used it for this purpose, but allegedly one of the goals was to make it "easily hackable" for just this sort of stuff. Particularly (and again, no personal experience) in doing the "heavy lifting" of all the parsing, letting you concentrate on the "interesting" part, which in this case would be extracting context and syntax sensitive type information, and then convert that in to a plain C string.
gcc Plugins - Plugins are a gcc 4.5 (which is the current alpha/beta version of the compiler) feature and "might" allow you to easily hook in to the compiler to extract the type information you'd need. No idea if the plugin architecture allows for this kind of thing.
Others
Coccinelle - Bookmarked this recently to "look at later". This "might" be able to do what you want, and "might" be able to do it with out much effort.
MetaC - Bookmarked this one recently too. No idea how useful this would be.
mygcc - "Might" do what you want. It's an interesting idea, but it's not directly applicable to what you want. From the web page: "Mygcc allows programmers to add their own checks that take into account syntax, control flow, and data flow information."
Links.
CocoaDev Objective-C Parsing - Worth looking at. Has some links to lexers and grammars.
Edit #1, the bonus links.
#Lothar makes a good point in his comment. I had actually intended to include lcc, but it looks like it got lost along the way.
lcc - The lcc C compiler. This is a C compiler that is particularly small, at least in terms of source code size. It also has a book, which I highly recommend.
tcc - The tcc C compiler. Not quite as pedagogical as lcc, but definitely still worth looking at.
poc - The poc Objective-C compiler. This is a "source to source" Objective-C compiler. It parses the Objective-C source code and emits C source code, which it then passes to gcc (well, usually gcc). Has a number of Objective-C extensions / features that aren't available in gcc. Definitely worth looking at.
You would implement this by implementing the ANSI C compiler first and then add some implementation specific pragmas and functions to it.
Yes i know this is cynical answer and i accept the downvotes.
One way to do it would be to write a preprocessor, which reads the source code for the type definitions and also replaces #encode... with the corresponding string literal.
Another approach, if your program is compiled with -g, would be to write a function that reads the type definition from the program's debug information at run-time, or use gdb or another program to read it for you and then reformat it as desired. The gdb ptype command can be used to print the definition of a particular type (or if that is insufficient there is also maint print type, which is sure to print far more information than you could possibly want).
If you are using a compiler that supports plugins (e.g. GCC 4.5), it may also be possible to write a compiler plugin for this. Your plugin could then take advantage of the type information that the compiler has already parsed. Obviously this approach would be very compiler-specific.

Resources