Source to source manipulations - c

I need to do some source-to-source manipulations in Linux kernel. I tried to use clang for this purpose but there is a problem. Clang does preprocessing of the source code, i.e. macro and include expansion. This causes clang to sometimes produce broken C code in terms of Linux kernel. I can't maintain all the changes manually, since I expect to have thousands of changes per single file.
I tried ANTLR, but the public grammars available are incomplete and not suitable for such projects as Linux kernel.
So my question is the following. Are there any ways to perform source-to-source manipulations for a C code without preprocessing it?
So assume following code.
#define AAA 1
void f1(int a){
if(a == AAA)
printf("hello");
}
After applying source-to-source manipulation I want to get this
#define AAA 1
void f1(int a){
if(functionCall(a == AAA))
printf("hello");
}
But Clang, for instance, produces following code which does not fit my requirements, i.e. it expands macro AAA
#define AAA 1
void f1(int a){
if(functionCall(a == 1))
printf("hello");
}
I hope I was clear enough.
Edit
The above code is only an example. The source-to-source manipulations I want to do are not restricted with if() statement substitution, but also inserting unary operator in front of expression, replace arithmetic expression with its positive or negative value, etc.
Solution
There is one solution I found for my self. I use gcc in order to produce preprocessed source code and then apply Clang. Then I don't have any issues with macro expansion and includes, since that job is done by gcc. Thanks for the answers!

You may consider http://coccinelle.lip6.fr/ : it provides a nice semantics patching framwork.

An idea would be to replace all occurrences of
if(a == AAA)
with
if(functionCall(a == AAA))
You can do this easily using, e.g., the sed tool.
If you have a finite collection of patterns to be replaced you can write a sed script to perform the substitution.
Would this solve your problem?

Handling the preprocessor is one of the most difficult problems in applying transformations to C (and C++) code.
Our DMS Software Reengineering Toolkit with its C Front End come relatively close to doing this. DMS can parse C source code, preserving most preprocessor conditionals, macro defintions and uses.
It does so by allow preprocessor actions in "well-structured" places. Examples: #defines are allowed where declarations or statements can occur, macro calls and conditionals as replacements for many of the nonterminals in the language (e.g., function head, expression, statement, declarations) and in many non-structured places that people commonly place them (e.g, #if fooif (...) {#endif). It parses the source code and preprocessor directives as if they were part of one language (they ARE, its called "C"), and builds corresponding ASTs, which can be transformed and will regenerate correctly with the captured preprocessor directives. [This level of capability handles OP's example perfectly.]
Some directives are poorly placed (both in the syntax sense, e.g., across multiple fragments of the language, and the "you've got to be kidding" understandability sense). These DMS handles by expanding them away, with some guidance from the advance engineer ("alway expand this macro"). A less satisfactory approach is to hand-convert the unstructured preprocessor conditionals/macro calls into structured ones; this is a bit painful but more workable than one might expect since the bad cases occur with considerably less frequency than the good ones.
To do better than this, one needs to have symbol tables and flow analysis that take into account the preprocessor conditions, and capture all the preprocessor conditionals. We've done some experimental work with DMS to capture conditional declarations in the symbol table (seems to work fine), and we're just starting work on a scheme for the latter.
Not easy being green.

Clang maintains extremely accurate information about the original source code.
Most notably, the SourceManager is able to tell if a given token has been expanded from a macro or written as is, and Chandler Caruth recently implemented macro diagnosis which are able to display the actual macro expansion stack (at the various stages of expansions) tracing back to the actual written code (3.0).
Therefore, it is possible to use the generated AST and then rewrite the source code with all its macros still in place. You would have to query virtually every node to know whether it comes from a macro expansion or not, and if it does retrieve the original code of the expansion, but still it seems possible.
There is a rewriter module in Clang
You can dig up Chandler's code on the macro diagnosis stack
So I guess you should have all you need :) (And hope so because I won't be able to help much more :p)

I would advise to resort to Rose framework. Source is available on github.

Related

Preprocessor-like substitution into a parser

I am making a parser currently which aims to be able to input data in a program.
The syntax used is greatly inspired from C.
I would enjoy to reproduct a kind of preprocessor inline substitution into it.
for example
#define HELLO ((variable1 + variable2 + variable3))
int variable1 = 37;
int variable2 = 82;
int variable3 = 928;
Thing is... I'm actually using C. I'm also using standard functions from stdio.h to parse through my files.
So... what techniques I could use to make this work correctly and efficiently?
Does the standard compilers substitute the text by re-copying the stream buffer and making the substitution there as the re-copying occurs or what? Is there more efficient techniques?
I guess we say preprocessor because it first substitutes everything until theres no preproc directives (recursive approach maybe?), and then, it starts doing the real compile job?
Excuse my lack of knowledge!
Thanks!
No, modern C compilers don't implement the preprocessor as a text processor, but they have the different compiler phases (preprocessing being one of them) tangled. This is particularly important for the efficiency of the compiler itself and to be able to track errors back into the original source code.
Also implementing a preprocessor by yourself is a tedious task. Think twice before you start such a project.
Yes, you are right about preprocessors. It has the job of bringing together all files which are requires for the execution of the program to 1 file for eg. stdio.h. Then it allows the compiler to compile the program. The file you want to compile is given as argument to the compiler and the techniques used by the compiler may vary according to the os and the compiler itself
The C preprocessor works on tokens not text. In particular, macro expansion cannot contain preprocessor directives. Other preprocessors, such as m4, work differently.

Lines of Code as a function of preprocessor definitions

A project I'm working on (in C) has a lot of sections of code that can be included or omitted based on compile-time configuration, using preprocessor directives.
I'm interested in estimating how many lines of code different configurations are adding to, or subtracting from, my core project. In other words, I'd like to write a few #define and #undef lines somewhere, and get a sense of what that does to the LOC count.
I'm not familiar with LOC counters, but from a cursory search, it doesn't seem like most of the easily-available tools do that. I'm assuming this isn't a difficult problem, but just a rather uncommon metric to measure.
Is there an existing tool that would do what I'm looking for, or some easy way to do it myself? Excluding comments and blank lines would be a major nice-to-have, too.
Run it through a preprocessor. For example, under gcc, use the option -E, I believe, to get just the kind of output you seem to want.
-E Stop after the preprocessing stage; do not run the compiler proper.
The output is in the form of preprocessed source code, which is sent
to the standard output.
You could get the preprocessor output from your compiler, but this might have other unwanted side effects, like expanding complex multi-line macros, and adding to the LOC count in ways you didn't expect.
Why not write your own simple pre-processor, and use your own include/exclude directives? You can make them trivially simple to parse, and then pipe your code through this pre-processor before sending it to a full featured LOC counter like CLOC.

How does a macro-enabled language keep track of the source code for debugging?

This is a more theoretical question about macros (I think). I know macros take source code and produce object code without evaluating it, enabling programmers to create more versatile syntactic structures. If I had to classify these two macro systems, I'd say there was the "C style" macro and the "Lisp style" macro.
It seems that debugging macros can be a bit tricky because at runtime, the code that is actually running differs from the source.
How does the debugger keep track of the execution of the program in terms of the preprocessed source code? Is there a special "debug mode" that must be set to capture extra data about the macro?
In C, I can understand that you'd set a compile time switch for debugging, but how would an interpreted language, such as some forms of Lisp, do it?
Apologize for not trying this out, but the lisp toolchain requires more time than I have to spend to figure out.
I don't think there's a fundamental difference in "C style" and "Lisp style" macros in how they're compiled. Both transform the source before the compiler-proper sees it. The big difference is that C's macros use the C preprocessor (a weaker secondary language that's mostly for simple string substitution), while Lisp's macros are written in Lisp itself (and hence can do anything at all).
(As an aside: I haven't seen a non-compiled Lisp in a while ... certainly not since the turn of the century. But if anything, being interpreted would seem to make the macro debugging problem easier, not harder, since you have more information around.)
I agree with Michael: I haven't seen a debugger for C that handles macros at all. Code that uses macros gets transformed before anything happens. The "debug" mode for compiling C code generally just means it stores functions, types, variables, filenames, and such -- I don't think any of them store information about macros.
For debugging programs that use
macros, Lisp is pretty much the same
as C here: your debugger sees the
compiled code, not the macro
application. Typically macros are
kept simple, and debugged
independently before use, to avoid
the need for this, just like C.
For debugging the macros
themselves, before you go and use it somewhere, Lisp does have features
that make this easier than in C,
e.g., the repl and
macroexpand-1 (though in C
there is obviously a way to
macroexpand an entire file, fully, at
once). You can see the
before-and-after of a macroexpansion,
right in your editor, when you write
it.
I can't remember any time I ran across a situation where debugging into a macro definition itself would have been useful. Either it's a bug in the macro definition, in which case macroexpand-1 isolates the problem immediately, or it's a bug below that, in which case the normal debugging facilities work fine and I don't care that a macroexpansion occurred between two frames of my call stack.
In LispWorks developers can use the Stepper tool.
LispWorks provides a stepper, where one can step through the full macro expansion process.
You should really look into the kind of support that Racket has for debugging code with macros. This support has two aspects, as Ken mentions. On one hand there is the issue of debugging macros: in Common Lisp the best way to do that is to just expand macro forms manually. With CPP the situation is similar but more primitive -- you'd run the code through only the CPP expansion and inspect the result. However, both of these are insufficient for more involved macros, and this was the motivation for having a macro debugger in Racket -- it shows you the syntax expansion steps one by one, with additional gui-based indications for things like bound identifiers etc.
On the side of using macros, Racket has always been more advanced than other Scheme and Lisp implementations. The idea is that each expression (as a syntactic object) is the code plus additional data that contains its source location. This way when a form is a macro, the expanded code that has parts coming from the macro will have the correct source location -- from the definition of the macro rather than from its use (where the forms are not really present). Some Scheme and Lisp implementations will implement a limited for of this using the identity of subforms, as dmitry-vk mentioned.
I don't know about lisp macros (which I suspect are probably quite different than C macros) or debugging, but many - probably most - C/C++ debuggers do not handle source-level debugging of C preprocessor macros particularly well.
Generally, C/C++ debuggers they don't 'step' into the macro definition. If a macro expands into multiple statements, then the debugger will usually just stay on the same source line (where the macro is invoked) for each debugger 'step' operation.
This can make debugging macros a little more painful than they might otherwise be - yet another reason to avoid them in C/C++. If a macro is misbehaving in a truly mysterious way, I'll drop into assembly mode to debug it or expand the macro (either manually or using the compiler's switch). It's pretty rare that you have to go to that extreme; if you're writing macros that are that complicated, you're probably taking the wrong approach.
Usually in C source-level debugging has line granularity ("next" command) or instruction-level granularity ("step into"). Macro processors insert special directives into processed source that allow compiler to map compiled sequences of CPU instructions to source code lines.
In Lisp there exists no convention between macros and compiler to track source code to compiled code mapping, so it is not always possible to do single-stepping in source code.
Obvious option is to do single stepping in macroexpanded code. Compiler already sees final, expanded, version of code and can track source code to machine code mapping.
Other option is to use the fact that lisp expressions during manipulation have identity. If the macro is simple and just does destructuring and pasting code into template then some expressions of expanded code will be identical (with respect to EQ comparison) to expressions that were read from source code. In this case compiler can map some expressions from expanded code to source code.
The simple answer is that it is complicated ;-) There are several different things that contribute to being able to debug a program, and even more for tracking macros.
In C and C++, the preprocessor is used to expand macros and includes into actual source code. The originating filenames and line numbers are tracked in this expanded source file using #line directives.
http://msdn.microsoft.com/en-us/library/b5w2czay(VS.80).aspx
When a C or C++ program is compiled with debugging enabled, the assembler generates additional information in the object file that tracks source lines, symbol names, type descriptors, etc.
http://sources.redhat.com/gdb/onlinedocs/stabs.html
The operating system has features that make it possible for a debugger to attach to a process and control the process execution; pausing, single stepping, etc.
When a debugger is attached to the program, it translates the process stack and program counter back into symbolic form by looking up the meaning of program addresses in the debugging information.
Dynamic languages typically execute in a virtual machine, whether it is an interpreter or a bytecode VM. It is the VM that provides hooks to allow a debugger to control program flow and inspect program state.

Large C macros. What's the benefit?

I've been working with a large codebase written primarily by programmers who no longer work at the company. One of the programmers apparently had a special place in his heart for very long macros. The only benefit I can see to using macros is being able to write functions that don't need to be passed in all their parameters (which is recommended against in a best practices guide I've read). Other than that I see no benefit over an inline function.
Some of the macros are so complicated I have a hard time imagining someone even writing them. I tried creating one in that spirit and it was a nightmare. Debugging is extremely difficult, as it takes N+ lines of code into 1 in the a debugger (e.g. there was a segfault somewhere in this large block of code. Good luck!). I had to actually pull the macro out and run it un-macro-tized to debug it. The only way I could see the person having written these is by automatically generating them out of code written in a function after he had debugged it (or by being smarter than me and writing it perfectly the first time, which is always possible I guess).
Am I missing something? Am I crazy? Are there debugging tricks I'm not aware of? Please fill me in. I would really like to hear from the macro-lovers in the audience. :)
To me the best use of macros is to compress code and reduce errors. The downside is obviously in debugging, so they have to be used with care.
I tend to think that if the resulting code isn't an order of magnitude smaller and less prone to errors (meaning the macros take care of some bookkeeping details) then it wasn't worth it.
In C++, many uses like this can be replaced with templates, but not all. A simple example of Macros that are useful are in the event handler macros of MFC -- without them, creating event tables would be much harder to get right and the code you'd have to write (and read) would be much more complex.
If the macros are extremely long, they probably make the code short but efficient. In effect, he might have used macros to explicitly inline code or remove decision points from the run-time code path.
It might be important to understand that, in the past, such optimizations weren't done by many compilers, and some things that we take for granted today, like fast function calls, weren't valid then.
To me, macros are evil. With their so many side effects, and the fact that in C++ you can gain same perf gains with inline, they are not worth the risk.
For ex. see this short macro:
#define max(a, b) ((a)>(b)?(a):(b))
then try this call:
max(i++, j++)
More. Say you have
#define PLANETS 8
#define SOCCER_MIDDLE_RIGHT 8
if an error is thrown, it will refer to '8', but not either of its meaninful representations.
I only know of two reasons for doing what you describe.
First is to force functions to be inlined. This is pretty much pointless, since the inline keyword usually does the same thing, and function inlining is often a premature micro-optimization anyway.
Second is to simulate nested functions in C or C++. This is related to your "writing functions that don't need to be passed in all their parameters" but can actually be quite a bit more powerful than that. Walter Bright gives examples of where nested functions can be useful.
There are other reasons to use of macros, such as using preprocessor-specific functionality (like including __FILE__ and __LINE__ in autogenerated error messages) or reducing boilerplate code in ways that functions and templates can't (the Boost.Preprocessor library excels here; see Boost.ScopeExit or this sample enum code for examples), but these reasons don't seem to apply for doing what you describe.
Very long macros will have performance drawbacks, like increased compiled binary size, and there are certainly other reasons for not using them.
For the most problematic macros, I would consider running the code through the preprocessor, and replacing the macro output with function calls (inline if possible) or straight LOC. If the macros exists for compatibility with other architectures/OS's, you might be stuck though.
Part of the benefit is code replication without the eventual maintenance cost - that is, instead of copying code elsewhere you create a macro from it and only have to edit it once...
Of course, you could also just make a method to be called but that is sort of more work... I'm against much macro use myself, just trying to present a potential rationale.
There are a number of good reasons to write macros in C.
Some of the most important are for creating configuration tables using x-macros, for making function like macros that can accept multiple parameter types as inputs and converting tables from human readable/configurable/understandable values into computer used values.
I cant really see a reason for people to write very long macros, except for the historic automatic function inline.
I would say that when debugging complex macros, (when writing X macros etc) I tend to preprocess the source file and substitute the preprocessed file for the original.
This allows you to see the C code generated, and gives you real lines to work with in the debugger.
I don't use macros at all. Inline functions serve every useful purpose a macro can do. Macro allow you to do very weird and counterintuitive things like splitting up identifiers (How does someone search for the identifier then?).
I have also worked on a product where a legacy programmer (who thankfully is long gone) also had a special love affair with Macros. His 'custom' scripting language is the height of sloppiness. This was compounded by the fact that he wrote his C++ classes in C, meaning all class functions and variables were all public. Anyways, he wrote almost everything in macro's and variadic functions (Another hideous monstrosity foisted on the world). So instead of writing a proper template class he would use a Macro instead! He also resorted to macro's to create factory classes as well, instead of normal code... His code is pretty much unmaintanable.
From what I have seen, macro's can be used when they are small and are used declaratively and don't contain moving parts like loops, and other program flow expressions. It's OK if the macro is one or at the most two lines long and it declares and instance of something. Something that won't break during runtime. Also macro's should not contain class definitions, or function definitions. If the macro contains code that needs to be stepped into using a debugger than the macro should be removed and replace with something else.
They can also be useful for wrapping custom tracing/debugging functionality. For instance you want custom tracing in debug builds but not release builds.
Anyways when you are working in legacy code like that, just be sure to remove a bit of the macro mess a bit at a time. If you keep it up, with enough time eventually you will remove them all and make life a bit easier for yourself. I have done this in the past, with especially messy macro's. What I do is turn on the compiler switch to have the preprocessor generate an output file. Then I raid that file, and copy the code, re-indent it, and replace the macro with the generated code. Thank goodness for that compiler feature.
Some of the legacy code I've worked with used macros very extensively in the place of methods. The reasoning was that the computer/OS/runtime had an extremely small stack, so that stack overflows were a common problem. Using macros instead of methods meant that there were fewer methods on the stack.
Luckily, most of that code was obsolete, so it is (mostly) gone now.
C89 did not have inline functions. If using a compiler with extensions disabled (which is a desirable thing to do for several reasons), then the macro might be the only option.
Although C99 came out in 1999, there was resistance to it for a long time; commercial compiler vendors didn't feel it was worth their time to implement C99. Some (e.g. MS) still haven't. So for many companies it was not a viable practical decision to use C99 conforming mode, even up to today in the case of some compilers.
I have used C89 compilers that did have an extension for inline functions, but the extension was buggy (e.g. multiple definition errors when there should not be), things like that may dissuade a programmer from using inline functions.
Another thing is that the macro version effectively forces that the function will actually be inlined. The C99 inline keyword is only a compiler hint and the compiler may still decide to generate a single instance of the function code which is linked like a non-inline function. (One compiler that I still use will do this if the function is not trivial and returning void).

Typical C with C Preprocessor refactoring

I'm working on a refactoring tool for C with preprocessor support...
I don't know the kind of refactoring involved in large C projects and I would like to know what people actually do when refactoring C code (and preprocessor directives)
I'd like to know also if some features that would be really interesting are not present in any tool and so the refactoring has to be done completely manually... I've seen for instance that Xref could not refactor macros that are used as iterators (don't know exactly what that means though)...
thanks
Anybody interested in this (specific to C), might want to take a look at the coccinelle tool:
Coccinelle is a program matching and transformation engine which provides the language SmPL (Semantic Patch Language) for specifying desired matches and transformations in C code. Coccinelle was initially targeted towards performing collateral evolutions in Linux. Such evolutions comprise the changes that are needed in client code in response to evolutions in library APIs, and may include modifications such as renaming a function, adding a function argument whose value is somehow context-dependent, and reorganizing a data structure. Beyond collateral evolutions, Coccinelle is successfully used (by us and others) for finding and fixing bugs in systems code.
Huge topic!
The stuff I need to clean up is contorted nests of #ifdefs. A refactoring tool would understand when conditional stuff appears in argument lists (function declaration or definitions), and improve that.
If it was really good, it would recognize that
#if defined(SysA) || defined(SysB) || ... || defined(SysJ)
was really equivalent to:
#if !defined(SysK) && !defined(SysL)
If you managed that, I'd be amazed.
It would allow me to specify 'this macro is now defined - which code is visible' (meaning, visible to the compiler); it would also allow me to choose to see the code that is invisible.
It would handle a system spread across over 100 top-level directories, with varying levels of sub-directories under those. It would handle tens of thousands of files, with lengths of 20K lines in places.
It would identify where macro definitions come from makefiles instead of header files (aargh!).
Well, since it is part of the preprocessor... #include refactoring is a huge huge topic and I'm not aware of any tools that do it really well.
Trivial problems a tool could tackle:
Enforcing consistent case and backslash usage in #includes
Enforce a consistent header guarding convention, automatically add redundant external guards, etc.
Harder problems a tool could tackle:
Finding and removing spurious includes.
Suggest the use of predeclarations wherever practical.
For macros... perhaps some sort of scoping would be interesting, where if you #define a macro inside a block, the tool would automatically #undef it at the end of a block. Other quick things I can think of:
A quick analysis on macro safety could be helpful as a lot of people still don't know to use do { } while (0) and other techniques.
Alternately, find and flag spots where expressions with side-effects are passed as macro arguments. This could possibly be really helpful for things like... asserts with unintentional side-effects.
Macros can often get quite complex, so I wouldn't try supporting much more than simple renaming.
I will tell you honestly that there are no good tools for refactoring C++ like there are for Java. Most of it will be painful search and replace, but this depends on the actual task. Look at Netbeans and Eclipse C++ plugins.
I've seen for instance that Xref could
not refactor macros that are used as
iterators (don't know exactly what
that means though)
To be honest, you might be in over your head - consider if you are the right person for this task.
If you can handle reliable renaming of various types, variables and macros over a big project with an arbitrarily complex directory hierarchy, I want to use your product.
Just discovered this old question, but I wanted to mention that I've rescued the free version of Xrefactory for C, now named c-xrefactory, which manages to do some refactorings in macros such as rename macro, rename macro parameter. It is an Emacs plugin.

Resources