Related
I am thinking of a source-to-source compiler would like to add a keyword to the standard C grammar (say shared). When a pointer is being marked as shared, it is a special one not to be dereferenced directly. Instead, a function call should be made to copy out the value safely.
If all variables were primitive types, a simple C++ program would do the translation for me. However, we have struct and union, and then we have possibilities like struct containing shared pointers, struct containing simple pointers to shared pointers, etc. It sounds a serious type checking like handling the volatile keyword, probably reusing or modifying an existing compiler would be a better option. But I do not know which compiler is easier to start modifying. Do you have any suggestions? By the way, I want to see the translated C code, not the intermediate code. Would it change our choice? Thank you.
Are you aware of opaque types? Declare a type in a .h file:
typedef struct OpaqueType_s OpaqueType;
And define it in a .c file:
struct OpaqueType_s {
int value;
};
Then you can dereference pointers to it in the .c file where
you define it, but only pass them to other functions in other files
(a bit like void).
Our DMS Software Reengineering Toolkit with its C Front End might be appropriate.
DMS provides generic parsing/AST construction, symbol table construction, control and data flow analysis facilities, procedural APIs and surface-syntax rewrite rules for AST modification, and AST-to-compilable text regeneration (including comment regeneration). The front ends specialize DMS to a particular language (e.g., C; DMS supports many others). The front end is driven by a BNF; everything is built on top of that (using attribute grammars, etc.). Dialect management enables a language front end to be specialized for a particular dialect (for C, that currently includes GCC2/3/4, MS Visual C, ANSI C, GreenHills C, C99, etc.).
To carry out OP's task, he would define a dialect for his version of C, modify the symbol table machinery provided with the front end to capture his new property on pointers, and modify the type checking to verify the new pointer types weren't misused by his definition. With type checking out of the way, he could write source-to-source transformations to convert shared-ptrC to vanilla C to get runnable code.
Alternatives might be to use Clang or GCC, but my understanding is that neither has a grammar so the changes to the parser must be implemented as code. Neither offers any source-to-source rewriting. Clang I think offers source-patching but I'm not sure you can apply a series of rewrites to a place once patched. GCC I think will let you procedurally modify the AST, but can't regenerate source code.
All of these solutions have some trouble with preprocessing. Clang and GCC I think have to expand the preprocessor directives. DMS has to expand unstructured directives.
I've heard of the idea of bootstrapping a language, that is, writing a compiler/interpreter for the language in itself. I was wondering how this could be accomplished and looked around a bit, and saw someone say that it could only be done by either
writing an initial compiler in a different language.
hand-coding an initial compiler in Assembly, which seems like a special case of the first
To me, neither of these seem to actually be bootstrapping a language in the sense that they both require outside support. Is there a way to actually write a compiler in its own language?
Is there a way to actually write a compiler in its own language?
You have to have some existing language to write your new compiler in. If you were writing a new, say, C++ compiler, you would just write it in C++ and compile it with an existing compiler first. On the other hand, if you were creating a compiler for a new language, let's call it Yazzleof, you would need to write the new compiler in another language first. Generally, this would be another programming language, but it doesn't have to be. It can be assembly, or if necessary, machine code.
If you were going to bootstrap a compiler for Yazzleof, you generally wouldn't write a compiler for the full language initially. Instead you would write a compiler for Yazzle-lite, the smallest possible subset of the Yazzleof (well, a pretty small subset at least). Then in Yazzle-lite, you would write a compiler for the full language. (Obviously this can occur iteratively instead of in one jump.) Because Yazzle-lite is a proper subset of Yazzleof, you now have a compiler which can compile itself.
There is a really good writeup about bootstrapping a compiler from the lowest possible level (which on a modern machine is basically a hex editor), titled Bootstrapping a simple compiler from nothing. It can be found at https://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html.
The explanation you've read is correct. There's a discussion of this in Compilers: Principles, Techniques, and Tools (the Dragon Book):
Write a compiler C1 for language X in language Y
Use the compiler C1 to write compiler C2 for language X in language X
Now C2 is a fully self hosting environment.
The way I've heard of is to write an extremely limited compiler in another language, then use that to compile a more complicated version, written in the new language. This second version can then be used to compile itself, and the next version. Each time it is compiled the last version is used.
This is the definition of bootstrapping:
the process of a simple system activating a more complicated system that serves the same purpose.
EDIT: The Wikipedia article on compiler bootstrapping covers the concept better than me.
A super interesting discussion of this is in Unix co-creator Ken Thompson's Turing Award lecture.
He starts off with:
What I am about to describe is one of many "chicken and egg" problems that arise when compilers are written in their own language. In this ease, I will use a specific example from the C compiler.
and proceeds to show how he wrote a version of the Unix C compiler that would always allow him to log in without a password, because the C compiler would recognize the login program and add in special code.
The second pattern is aimed at the C compiler. The replacement code is a Stage I self-reproducing program that inserts both Trojan horses into the compiler. This requires a learning phase as in the Stage II example. First we compile the modified source with the normal C compiler to produce a bugged binary. We install this binary as the official C. We can now remove the bugs from the source of the compiler and the new binary will reinsert the bugs whenever it is compiled. Of course, the login command will remain bugged with no trace in source anywhere.
Check out podcast Software Engineering Radio episode 61 (2007-07-06) which discusses GCC compiler internals, as well as the GCC bootstrapping process.
Donald E. Knuth actually built WEB by writing the compiler in it, and then hand-compiled it to assembly or machine code.
As I understand it, the first Lisp interpreter was bootstrapped by hand-compiling the constructor functions and the token reader. The rest of the interpreter was then read in from source.
You can check for yourself by reading the original McCarthy paper, Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I.
Every example of bootstrapping a language I can think of (C, PyPy) was done after there was a working compiler. You have to start somewhere, and reimplementing a language in itself requires writing a compiler in another language first.
How else would it work? I don't think it's even conceptually possible to do otherwise.
Another alternative is to create a bytecode machine for your language (or use an existing one if it's features aren't very unusual) and write a compiler to bytecode, either in the bytecode, or in your desired language using another intermediate - such as a parser toolkit which outputs the AST as XML, then compile the XML to bytecode using XSLT (or another pattern matching language and tree-based representation). It doesn't remove the dependency on another language, but could mean that more of the bootstrapping work ends up in the final system.
It's the computer science version of the chicken-and-egg paradox. I can't think of a way not to write the initial compiler in assembler or some other language. If it could have been done, I should Lisp could have done it.
Actually, I think Lisp almost qualifies. Check out its Wikipedia entry. According to the article, the Lisp eval function could be implemented on an IBM 704 in machine code, with a complete compiler (written in Lisp itself) coming into being in 1962 at MIT.
Some bootstrapped compilers or systems keep both the source form and the object form in their repository:
ocaml is a language which has both a bytecode interpreter (i.e. a compiler to Ocaml bytecode) and a native compiler (to x86-64 or ARM, etc... assembler). Its svn repository contains both the source code (files */*.{ml,mli}) and the bytecode (file boot/ocamlc) form of the compiler. So when you build it is first using its bytecode (of a previous version of the compiler) to compile itself. Later the freshly compiled bytecode is able to compile the native compiler. So Ocaml svn repository contains both *.ml[i] source files and the boot/ocamlc bytecode file.
The rust compiler downloads (using wget, so you need a working Internet connection) a previous version of its binary to compile itself.
MELT is a Lisp-like language to customize and extend GCC. It is translated to C++ code by a bootstrapped translator. The generated C++ code of the translator is distributed, so the svn repository contains both *.melt source files and melt/generated/*.cc "object" files of the translator.
J.Pitrat's CAIA artificial intelligence system is entirely self-generating. It is available as a collection of thousands of [A-Z]*.c generated files (also with a generated dx.h header file) with a collection of thousands of _[0-9]* data files.
Several Scheme compilers are also bootstrapped. Scheme48, Chicken Scheme, ...
Does C (or any other low-level language, for that matter) even have source, or is the compiler the part that "does all the work", including parsing? If so, couldn't different compilers have different C dialects? Where does the stdlib factor into this? I would really like to know how this works.
The C language is not a piece of software but a defined standard, so one wouldn't say that it's open-source, but rather that it's an open standard.
There are a gazillion different compilers for C however, and many of those are indeed open-source. The most notable example is GCC's C compiler, which is all under the GNU General Public License (GPL), an open-source license.
There are more options. Watcom is open-source, for instance. There is no shortage of open-source C compilers, but without a doubt the most widespread one, at least in the non-Windows world, is GCC.
For Windows, your best bet is probably Watcom or GCC by using Cygwin or MinGW.
C is a standard which specifies how C compilers should generate programs.
C itself doesn't have any source code, just like a musical note doesn't have any plastic.
Some C compilers, such as GCC, are open source.
C is just a language, and a standardised one at that, too. It pretty much is the compiler that "does all the work". Different compilers did have different dialects; before the the C99 ANSI standard, you had things like Borland C and other competing compilers, that implemented the C language in their own fantastic ways.
stdlib is just an agreed-upon collection of standard libraries that are required to be present in any ANSI C implementation.
To add on to the other great answers:
Regarding different dialects -- there are some additional features added to C that are compiler specific. You can provide the command line flag -std=... to gcc to specify the C standard that you want to use, each has slight variations/additions to syntax, the most common is probably c99.
Each compiler tends to implement a few different extras, for example, typeof() is not in the C standard and so compilers do not have to implement this but nevertheless it is useful and most compilers provide it. Here is a list of gcc C extensions
The stdlib is a set of functions specified in the C standard. Much like compilers, stdlib can have different implementations. The GNU implementation is open source, as is gcc, but there are other compilers and could be other implementations of stdlib that are closed source.
The Compiler would determine all the mappings from C to Assembly etc... but as far as someone owning it.....noone really owns C however the ANSI/ISO determines the standards
GCC's C compiler is written in C. So we know there are at least one C compiler written in C.
GNU's stdlib (glibc) is also written in C (stdio.h, stdlib.h). But it also has some parts written in assembly language.
A really good question. There is a way to define a language standard (not the implementation!) in a form of a "source code", in a strict and unambigous language. Unfortunately, all of the old languages, including C, are poorly defined. But it is still possible to translate that definitions into a source code form.
Another approach is to define a language via its operational semantics, often in a form of a simple (and unefficient) reference implementation.
Helgi Hrafn Gunnarsson has written the main answer but I thought it would be worth noting that you can effectively end up with dialects too.
The compilers should do the same thing with regards to whichever standard they support (which these days should be pretty much all the same version) but there are grey areas. The way in which the compilers work for 'undefined' functionality for example. If the C specification says that the behaviour is undefined for a specific case then the compiler can do pretty much what it wants.
There are also examples of functions added to the libraries (and new libraries added) by the compiler makers to support specific platform traits, create a competitive advantage or simply to make life easier. The cynical might suggest that some of these are added to help lock people into a specific compiler too.
I would say that C as a language is not open source.
As pointed out by many, you can download GNU licensed compilers and libraries for free, but if you wanted to write your own C compiler, you would need to follow the ISO C standards, and ISO charge hard cash for the specification of the C language, which at the time of posting this is $178.
So really the answer depends on what elements you are interested in being free and open source.
I'm not sure what your definitions of "open source" are.
For the standardization process, it is possible for anyone to participate, but if you want to be able to vote then you will need to pay to join your national body (for instance, ANSI for the USA, BSI for the UK, AFNOR for France etc.). As a rule most standards body memberships are paid by corporations. That said, the process is fairly open. You can access discussion papers on the standards web site.
The standards themselves are not free either. The ISO pdf store currently sells the C standard for 198 swiss francs. Draft copies of the standard can be found easily for free.
There are plenty of open source implementations of both compilers and libraries.
The #encode directive returns a const char * which is a coded type descriptor of the various elements of the datatype that was passed in. Example follows:
struct test
{ int ti ;
char tc ;
} ;
printf( "%s", #encode(struct test) ) ;
// returns "{test=ic}"
I could see using sizeof() to determine primitive types - and if it was a full object, I could use the class methods to do introspection.
However, How does it determine each element of an opaque struct?
#Lothars answer might be "cynical", but it's pretty close to the mark, unfortunately. In order to implement something like #encode(), you need a full blown parser in order to extract the the type information. Well, at least for anything other than "trivial" #encode() statements (i.e., #encode(char *)). Modern compilers generally have either two or three main components:
The front end.
The intermediate end (for some compilers).
The back end.
The front end must parse all the source code and basically converts the source code text in to an internal, "machine useable" form.
The back end translates the internal, "machine useable" form in to executable code.
Compilers that have an "intermediate end" typically do so because of some need: they support multiple "front ends", possibly made up of completely different languages. Another reason is to simplify optimization: all the optimization passes work on the same intermediate representation. The gcc compiler suite is an example of a "three stage" compiler. llvm could be considered an "intermediate and back end" stage compiler: The "low level virtual machine" is the intermediate representation, and all the optimization takes place in this form. llvm also able to keep it in this intermediate representation right up until the last second- this allows for "link time optimization". The clang compiler is really a "front end" that (effectively) outputs llvm intermediate representation.
So, if you want to add #encode() functionality to an 'existing' compiler, you'd probably have to do it as a "source to source" 'compiler / preprocessor'. This was how the original Objective-C and C++ compilers were written- they parsed the input source text and converted it to "plain C" which was then fed in to the standard C compiler. There's a few ways to do this:
Roll your own
Use yacc and lex to put together a ANSI-C parser. You'll need a grammar- ANSI C grammar (Yacc) is a good start. Actually, to be clear, when I say yacc, I really mean bison and flex. And also, loosely, the other various yacc and lex like C-based tools: lemon, dparser, etc...
Use perl with Yapp or EYapp, which are pseudo-yacc clones in perl. Probably better for quickly prototyping an idea compared to C-based yacc and lex- it's perl after all: Regular expressions, associative arrays, no memory management, etc.
Build your parser with Antlr. I don't have any experience with this tool chain, but it's another "compiler compiler" tool that (seems) to be geared more towards java developers. There appears to be freely available C and Objective-C grammars available.
Hack another tool
Note: I have no personal experience using any of these tools to do anything like adding #encode(), but I suspect they would be a big help.
CIL - No personal experience with this tool, but designed for parsing C source code and then "doing stuff" with it. From what I can glean from the docs, this tool should allow you to extract the type information you'd need.
Sparse - Worth looking at, but not sure.
clang - Haven't used it for this purpose, but allegedly one of the goals was to make it "easily hackable" for just this sort of stuff. Particularly (and again, no personal experience) in doing the "heavy lifting" of all the parsing, letting you concentrate on the "interesting" part, which in this case would be extracting context and syntax sensitive type information, and then convert that in to a plain C string.
gcc Plugins - Plugins are a gcc 4.5 (which is the current alpha/beta version of the compiler) feature and "might" allow you to easily hook in to the compiler to extract the type information you'd need. No idea if the plugin architecture allows for this kind of thing.
Others
Coccinelle - Bookmarked this recently to "look at later". This "might" be able to do what you want, and "might" be able to do it with out much effort.
MetaC - Bookmarked this one recently too. No idea how useful this would be.
mygcc - "Might" do what you want. It's an interesting idea, but it's not directly applicable to what you want. From the web page: "Mygcc allows programmers to add their own checks that take into account syntax, control flow, and data flow information."
Links.
CocoaDev Objective-C Parsing - Worth looking at. Has some links to lexers and grammars.
Edit #1, the bonus links.
#Lothar makes a good point in his comment. I had actually intended to include lcc, but it looks like it got lost along the way.
lcc - The lcc C compiler. This is a C compiler that is particularly small, at least in terms of source code size. It also has a book, which I highly recommend.
tcc - The tcc C compiler. Not quite as pedagogical as lcc, but definitely still worth looking at.
poc - The poc Objective-C compiler. This is a "source to source" Objective-C compiler. It parses the Objective-C source code and emits C source code, which it then passes to gcc (well, usually gcc). Has a number of Objective-C extensions / features that aren't available in gcc. Definitely worth looking at.
You would implement this by implementing the ANSI C compiler first and then add some implementation specific pragmas and functions to it.
Yes i know this is cynical answer and i accept the downvotes.
One way to do it would be to write a preprocessor, which reads the source code for the type definitions and also replaces #encode... with the corresponding string literal.
Another approach, if your program is compiled with -g, would be to write a function that reads the type definition from the program's debug information at run-time, or use gdb or another program to read it for you and then reformat it as desired. The gdb ptype command can be used to print the definition of a particular type (or if that is insufficient there is also maint print type, which is sure to print far more information than you could possibly want).
If you are using a compiler that supports plugins (e.g. GCC 4.5), it may also be possible to write a compiler plugin for this. Your plugin could then take advantage of the type information that the compiler has already parsed. Obviously this approach would be very compiler-specific.
I know it is not supported, but I am wondering if there are any tricks around it. Any tips?
Reflection in general is a means for a program to analyze the structure of some code.
This analysis is used to change the effective behavior of the code.
Reflection as analysis is generally very weak; usually it can only provide access to function and field names. This weakness comes from the language implementers essentially not wanting to make the full source code available at runtime, along with the appropriate analysis routines to extract what one wants from the source code.
Another approach is tackle program analysis head on, by using a strong program analysis tool, e.g., one that can parse the source text exactly the way the compiler does it.
(Often people propose to abuse the compiler itself to do this, but that usually doesn't work; the compiler machinery wants to be a compiler and it is darn hard to bend it to other purposes).
What is needed is a tool that:
Parses language source text
Builds abstract syntax trees representing every detail of the program.
(It is helpful if the ASTs retain comments and other details of the source
code layout such as column numbers, literal radix values, etc.)
Builds symbol tables showing the scope and meaning of every identifier
Can extract control flows from functions
Can extact data flow from the code
Can construct a call graph for the system
Can determine what each pointer points-to
Enables the construction of custom analyzers using the above facts
Can transform the code according to such custom analyses
(usually by revising the ASTs that represent the parsed code)
Can regenerate source text (including layout and comments) from
the revised ASTs.
Using such machinery, one implements analysis at whatever level of detail is needed, and then transforms the code to achieve the effect that runtime reflection would accomplish.
There are several major benefits:
The detail level or amount of analysis is a matter of ambition (e.g., it isn't
limited by what runtime reflection can only do)
There isn't any runtime overhead to achieve the reflected change in behavior
The machinery involved can be general and applied across many languages, rather
than be limited to what a specific language implementation provides.
This is compatible with the C/C++ idea that you don't pay for what you don't use.
If you don't need reflection, you don't need this machinery. And your language
doesn't need to have the intellectual baggage of weak reflection built in.
See our DMS Software Reengineering Toolkit for a system that can do all of the above for C, Java, and COBOL, and most of it for C++.
[EDIT August 2017: Now handles C11 and C++2017]
Tips and tricks always exists. Take a look at Metaresc library https://github.com/alexanderchuranov/Metaresc
It provides interface for types declaration that will also generate meta-data for the type. Based on meta-data you can easily serialize/deserialize objects of any complexity. Out of the box you can serialize/deserialize XML, JSON, XDR, Lisp-like notation, C-init notation.
Here is a simple example:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "metaresc.h"
TYPEDEF_STRUCT (point_t,
double x,
double y
);
int main (int argc, char * argv[])
{
point_t point = {
.x = M_PI,
.y = M_E,
};
char * str = MR_SAVE_XML (point_t, &point);
if (str)
{
printf ("%s\n", str);
free (str);
}
return (EXIT_SUCCESS);
}
This program will output
$ ./point
<?xml version="1.0"?>
<point>
<x>3.1415926535897931</x>
<y>2.7182818284590451</y>
</point>
Library works fine for latest gcc and clang on Linux, MacOs, FreeBSD and Windows.
Custom macro language is one of the options. User could do declaration as usual and generate types descriptors from DWARF debug info. This moves complexity to the build process, but makes adoption much easier.
any tricks around it? Any tips?
The compiler will probably optionally generate 'debug symbol file', which a debugger can use to help debug the code. The linker may also generate a 'map file'.
A trick/tip might be to generate and then read these files.
Based on the responses to How can I add reflection to a C++ application? (Stack Overflow) and the fact that C++ is considered a "superset" of C, I would say you're out of luck.
There's also a nice long answer about why C++ doesn't have reflection (Stack Overflow).
I needed reflection in a bunch of structs in a C++ project.
I created a xml file with the description of all those structs - fortunately the fields types were primitive types.
I used a template (not C++ template) to auto generate a class for each struct along with setter/getter methods.
In each class I used a map to associate string names and class members (pointers to members).
I didn't regret using reflection because it opened new ways to design my core functionality that I couldn't even imagine without reflection.
(BTW, it was an external report generator for a program that uses a raw database)
So, I used code generation, function pointers and maps to simulate reflection.
I know of the following options, but all come at cost and a lot of limitations:
Use libdl (#include <dfcln.h>)
Call a tool like objdump or nm
Parse the object files yourself (using a corresponding library)
Involve a parser and generate the necessary information at compile time.
"Abuse" the linker to generate symbol arrays.
I'll use a bit of unit test frameworks as examples further down, because automatic test discovery for unit test frameworks is a typical example where reflection comes in very handy, and it's something that most unit test frameworks for C fall short of.
Using libdl (#include <dfcln.h>) (POSIX)
If you're on a POSIX environment, a little bit of reflection can be done using libdl. Plugins are developed that way.
Use
#include <dfcln.h>
in your source code and link with -ldl.
Then you have access to functions dlopen(), dlerror(), dlsym() and dlclose() with which you could load and access / run shared objects at runtime. However, it does not give you easy access to the symbol table.
Another disadvantage of this approach is that you basically restrict reflection to objects loaded as dynamic library (shared object loaded at runtime via dlopen()).
Running nm or objdump
You could run nm or objdump to show the symbol table and parse the output.
For me, nm -P --defined-only -g xyz.o gives good results, and parsing the output is trivial.
You'd be interested in the first word of each line only, which is the symbol name, and maybe the second one, which is the section type.
If you do not know the object name in some static way, i.e. the object is actually a shared object, at least on Linux you then might want to skip symbol names starting with '_'.
objdump, nm or similar tools are also often available outside POSIX environments.
Parsing the object files yourself
You could parse the object files yourself. You probably don't want to implement that from scratch but use an existing library for that. This is how nm, objdump and even libdl are implemented. You could peek at the source code of nm, objdump and libdl and the libraries they use in order to find out how they do what they do.
Involving a Parser
You could write a parser and code generator which generates the necessary reflective information at compile time and stores it in the object file. Then you have a lot of freedom and could even implement primitive forms of annotations. That's what some unit test frameworks like AceUnit do.
I found that writing a parser which covers straight-forward C syntax is fairly trivial. Writing a parser which really understands C and could deal with all cases is NOT trivial.
So, this has limitations which depend on how exotic the C syntax is that you want to reflect upon.
"Abusing" the linker to generate symbol arrays
You could put references to symbols which you want to reflect upon in a special section and use a linker configuration to emit the section boundaries so you can access them in C.
I've described here N-Dependency injection in C - better way than linker-defined arrays? how this works.
But beware, this is depending on a lot of things and not very portable. I have only tried this with GCC/ld, and I know it doesn't work with all compilers / linkers. Also, it's almost guaranteed that dead code elimination will not detect how you call this stuff, so if you use dead code elimination, you will have to add all the reflected symbols as entry points.
Pitfalls
For some of the mechanisms, dead code elimination can be a problem, in particular when you "abuse" the linker to generate a symbol arrays. It can be worked around by telling the reflected symbols as entry points to the linker, and depending on the amount of symbols this might be neither nice nor convenient.
Conclusion
Combining nm and libdl can actually give quite good results. The combination can be almost as powerful as the level of Reflection used by JUnit 3.x in Java. The level of reflection given is sufficient to implement a JUnit 3.x-style unit test framework for C, including test-case discovery by naming convention.
Involving a parser is more work and limited to objects that you compile yourself, but gives you most power and freedom. The level of reflection given can be sufficient to implement a JUnit 4.x-style unit test framework for C, including test-case discovery by annotations. AceUnit is a unit test framework for C that does exactly this.
Combining parsing and the linker to generate symbol arrays can give very nice results - if your environment is so much under your control that you can ensure that working with the linker that way works for you.
And of course you can combine all approaches to stitch together the bits and pieces until they fit your needs.
You would need to implement it from yourself from the ground up. In straight C, there is no runtime information whatsoever kept on structure and composite types. Metadata simply does not exist in the standard.
Implementing reflection for C would be much simpler... because C is simple language.
There is some basic options for analazing program, like detect if function exists by calling dlopen/dlsym -- depends on your needs.
There are tools for creating code that can modify/extend itselfusing tcc.
You may use the above tool in order to create your own code analizers.
For similar reasons to the author of the question, I have been working on a C-type-reflection-API along with a C reflection graph database format and a clang plug-in that writes reflection metadata.
The intent is to use the C reflection API for writing serialization and deserialization routines, such as mappers for ASN.1, function argument printers, function proxies, fuzzers, etc. Clang and GCC both have plugin APIs that allow access to the AST but there currently is no standard graph format for C reflection metadata.
The proposed C reflection API is called Crefl:
https://github.com/michaeljclark/crefl
The Crefl API provides runtime access to reflection metadata for C structure declarations with support for arbitrarily nested combinations of: intrinsic, set, enum, struct, union, field (member), array, constant, variable.
The Crefl reflection graph database format for portable reflection metadata.
The Crefl clang plug-in outputs C reflection metadata used by the library.
The Crefl API provides task-oriented query access to C reflection metadata
A C reflection API provides access to runtime reflection metadata for C structure declarations with support for arbitrarily nested combinations of: intrinsic, set, enum, struct, union, field, array, constant, variable. The Crefl C reflection data model is essentially a transcription of the C data types in ISO/IEC 9899:9999.
C intrinsic data types.
integer types.
floating-point types.
complex number types.
boolean type.
nested struct, union, field, and bitfield
arrays and pointers
typedef type aliases
enum and enum constants
functions and function parameters
const, volatile and restrict qualifiers
GNU-C style attributes using (__attribute__).
The library is still a work in progress. The hope is to find others who are interested in reflection support in C.
Parsers and Debug Symbols are great ideas. However, the gotcha is that C does not really have arrays. Just pointers to stuff.
For example, there is no way by reading the source code to know whether a char * points to a character, a string, or a fixed array of bytes based on some "nearby" length field. This is a problem for human readers let alone any automated tool.
Why not use a modern language, like Java or .Net? Can be faster than C as well.