Parsing C files without preprocessing it

Parsing C files without preprocessing it - c

I want to run simple analysis on C files (such as if you call foo macro with INT_TYPE as argument, then cast the response to int*), I do not want to prerprocess the file, I just want to parse it (so that, for instance, I'll have correct line numbers).
Ie, I want to get from
#include <a.h>
#define FOO(f)
int f() {FOO(1);}
an list of tokens like
<include_directive value="a.h"/>
<macro name="FOO"><param name="f"/><result/></macro>
<function name="f">
<return>int</return>
<body>
<macro_call name="FOO"><param>1</param></macro_call>
</body>
</function>
with no need to set include path, etc.
Is there any preexisting parser that does it? All parsers I know assume C is preprocessed. I want to have access to the macros and actual include instructions.

Our C Front End can parse code containing preprocesser elements can do this to fair extent and still build a usable AST. (Yes, the parse tree has precise file/line/column number information).
There are a number of restrictions, which allows it to handle most code. In those few cases it cannot handle, often a small, easy change to the source file giving equivalent code solves the problem.
Here's a rough set of rules and restrictions:
#includes and #defines can occur wherever a declaration or statement can occur, but not in the middle of a statement. These rarely cause a problem.
macro calls can occur where function calls occur in expressions, or can appear without semicolon in place of statements. Macro calls that span non-well-formed chunks are not handled well (anybody surprised?). The latter occur occasionally but not rarely and need manual revision. OP's example of "j(v,oid)*" is problematic, but this is really rare in code.
#if ... #endif must be wrapped around major language concepts (nonterminals) (constant, expression, statement, declaration, function) or sequences of such entities, or around certain non-well-formed but commonly occurring idioms, such as if (exp) {. Each arm of the conditional must contain the same kind of syntactic construct as the other arms. #if wrapped around random text used as bad kind of comment is problematic, but easily fixed in the source by making a real comment. Where these conditions are not met, you need to modify the original source code, often by moving the #if #elsif #else #end a few tokens.
In our experience, one can revise a code base of 50,000 lines in a few hours to get around these issues. While that seems annoying (and it is), the alternative is to not be able to parse the source code at all, which is far worse than annoying.
You also want more than just a parser. See Life After Parsing, to know what happens after you succeed in getting a parse tree. We've done some additional work in building symbol tables in which the declarations are recorded with the preprocessor context in which they are embedded, enabling type checking to include the preprocessor conditions.

You can have a look at this ANTLR grammar. You will have to add rules for preprocessor tokens, though.

Your specific example can be handled by writing your own parsing and ignore macro expansion.
Because FOO(1) itself can be interpreted as a function call.
When more cases are considered however, the parser is much more difficult. You can refer PDF Link to find more information.

Related

How to process macros in LEX?

How do I implement #define in yacc/bison?
For Example:
#define f(x) x*x
If anywhere f(x) appears in any function then it is replaced by the right side of the
macro substituting for the argument ‘x’.
For example, f(3) would be replaced with 3*3. The macro can call another macro too.

It's not usually possible to do macro expansion inside a parser, at least not C-style macros, because C-style macro expansion doesn't respect syntax. For example
#define IF if(
#define THEN )
is legal (although very bad style IMHO). But for that to be handled inside the grammar, it would be necessary to allow a macro identifier to appear anywhere in the input, not just where an identifier might be expected. The necessary modifications to the grammar are going to make it much less readable and are very likely to introduce parser action conflicts. [Note 1]
Alternatively, you could do the macro expansion in the lexical analyzer. The lexical analyzer is not a parser, but parsing a C-style macro invocation doesn't require much sophistication, and if macro parameters were not allowed, it would be even simpler. This is how Flex handles macro replacement in its regular expressions. ({identifier}, for example. [Note 2] Since Flex macros are just raw character sequences, not token lists as with C-style macros, they can be handled by pushing the replacement text back into the input stream. (F)lex provides the unput special action for this purpose. unput pushes one character back into the input stream, so if you want to push an entire macro replacement, you have to unput it one character at a time, back to front so that the last character unput is the first one to be read afterwards.
That's workable but ugly. And it's not really scalable to even the small feature list provided by the C preprocessor. And it violates the fundamental principle of software design, which is that each component does just one thing (so that it can do it well).
So that leaves the most common approach, which is to add a separate macro processor component, so that instead of dividing the parse into lexical scan/syntax analysis, the parse becomes lexical scan/macro expansion/syntax analysis. [Note 3]
A C-style macro processor which works between the lexical analyser and the syntactic analyser could itself be written in Bison. As I mentioned above, the parsing requirements are generally minimal, but there is still parsing to be done and Bison is presumably already part of the project. Although I don't know of any macro processor (other than proof-of-concept programs I've written myself) which do this, I think it's a very flexible solution. In particular, the Bison syntactic analysis phase could be implemented with a push-parser, which avoids the need to produce the entire macro-expanded token stream in order to make it available to a traditional pull-parser.
That's not the only way to design macros, though. Indeed, it has a lot of shortcomings, because the macro expansions are not hygienic, respecting neither syntax nor scope. Probably anyone who has used C macros has at one time or other been bitten by these problems; the simplest manifestation is defining a macro like:
#define NEXT(a) a + 1
and then writing
int x = NEXT(a) * 3;
which is not going to produce the expected result (unless what is expected is a violation of the syntactic form of the last statement). Also, any macro expansion which needs to use a local variable will sooner or later produce an incorrect expansion because of unexpected name collision. Hygienic macro expansion seeks to solve these issues by viewing macro expansion as an operation on syntax trees, not token streams, making the parsing paradigm lexical scan/syntax analysis/macro expansion (of the parse tree). For that operation, the appropriate tool might well be some kind of tree parser.
Notes
Also, you'd want to remove the token from the parse tree Yacc/bison does have a poorly-documented feature, YYBACKUP, which might possibly help be able to accomplish this. I don't know if that's one of its intended use cases; indeed, it is not clear to me what its intended use cases are.
The (f)lex documentation calls these definitions, but they really are macros, and they suffer from all the usual problems macros bring with them, such as mysterious interactions with surrounding syntax.
Another possibility is macro expansion/lexical scan/syntax analysis, which could be implemented using a macro processor like M4. But that completely divorces the macros from the rest of the language.

yacc and lex generate c source at the end. So you can use macros inside the parser and lexer actions.
The actual #define preprocessor directives can go in the first section of the lexer and parser file
%{
// Somewhere here
#define f(x) x*x
%}
These sections will be copied verbatim to the generated c source.

Computed Includes in C

I was reading the C Preprocessor guide page on gnu.org on computed includes which has the following explanation:
2.6 Computed Includes
Sometimes it is necessary to select one of several different header
files to be included into your program. They might specify
configuration parameters to be used on different sorts of operating
systems, for instance. You could do this with a series of
conditionals,
#if SYSTEM_1
# include "system_1.h"
#elif SYSTEM_2
# include "system_2.h"
#elif SYSTEM_3 …
#endif
That rapidly becomes tedious. Instead, the preprocessor offers the
ability to use a macro for the header name. This is called a computed
include. Instead of writing a header name as the direct argument of
‘#include’, you simply put a macro name there instead:
#define SYSTEM_H "system_1.h"
…
#include SYSTEM_H
This doesn't make sense to me. The first code snippet allows for optionality based on which system type you encounter by using branching if elifs. The second seems to have no optionality as a macro is used to define a particular system type and then the macro is placed into the include statement without any code that would imply its definition can be changed. Yet, the text implies these are equivalent and that the second is a shorthand for the first. Can anyone explain how the optionality of the first code snippet exists in the second? I also don't know what code is implied to be contained in the "..." in the second code snippet.

There's some other places in the code or build system that define or don't define the macros that are being tested in the conditionals. What's suggested is that instead of those places defining lots of different SYSTEM_1, SYSTEM_2, etc. macros, they'll just define SYSTEM_H to the value that's desired.
Most likely this won't actually be in an explicit #define, instead of will be in a compiler option, e.g.
gcc -DSYSTEM_H='"system_1.h"' ...
And this will most likely actually come from a setting in a makefile or other configuration file.

Unit Testing a Function Macro

I'm writing unit tests for some function macros, which are just wrappers around some function calls, with a little housekeeping thrown in.
I've been writing tests all morning and I'm starting to get tedium of the brainpan, so this might just be a case of tunnel vision, but:
Is there a valid case to be made for unit testing for macro expansion? By that I mean checking that the correct function behavior is produced for the various source code forms of the function macro's arguments. For example, function arguments can take the form, in source code of a:
literal
variable
operator expression
struct member access
pointer-to-struct member access
pointer dereference
array index
function call
macro expansion
(feel free to point out any that I've missed)
If the macro doesn't expand properly, then the code usually won't even compile. So then, is there even any sensible point in a different unit test if the argument was a float literal or a float variable, or the result of a function call?
Should the expansion be part of the unit test?

As I noted in a comment:
Using expressions such as value & 1 could reveal that the macros are careless, but code inspections can do that too.
I think going through the full panoply of tests is overkill; the tedium is giving you a relevant warning.
There is an additional mode of checking that might be relevant, namely side-effects such as: x++ + ++y as an argument. If the argument to the macro is evaluated more than once, the side-effects will probably be scrambled, or at least repeated. An I/O function (getchar(), or printf("Hello World\n")) as the argument might also reveal mismanagement of arguments.
It also depends in part on the rules you want to apply to the macros. However, if they're supposed to look like and behave like function calls, they should only evaluate arguments once (but they should evaluate each argument — if the macro doesn't evaluate an argument at all, then the side-effects won't occur that should occur (that would occur if the macro was really a function).
Also, don't underestimate the value of inline functions.

Based on the comments and some of the points made in #Jonathan Leffler's answer, I've come to the conclusion that this is something that is better tested in functional testing, preferably with a fuzz tester.
That way, using a couple of automation scripts, the fuzzer can throw a jillion arguments at the function macro and log those that either don't compile, produce compiler warnings, or compile and run, but produce the incorrect result.
Since fuzz tests aren't supposed to run quickly (like unit tests), there's no problem just adding it to the fuzz suite and letting it run over the weekend.

The goal of testing is to find errors. And, your macro definitions can contain errors. Therefore, there is a case for testing macros in general, and unit-testing in particular can find many specific errors, as will be explained below.
Code inspection can obviously also be used to find errors, however, there are good points in favor of doing both: Unit-tests can cheaply be repeated whenever the respective code is modified, say, for reactoring.
Code inspections can not cheaply be repeated (at least they cause more effort than re-running unit-tests), but they also can find other points that tests can never detect, like, wrong or bad documentation, design issues like code duplication etc.
That said, there are a number of issues you can find when unit-testing macros, some of which were already mentioned. And, it may in principle be possible that there are fuzz testers which also check for such problems, but I doubt that problems with macro definitions are already in focus of fuzz-testers:
wrong algorithm: Expressions and statements in macro definitions can just be as wrong as they can be in normal non-macro code.
unsufficient parenthesization (as mentioned in the comments): This is a potential problem with macros, and it can be detected, possibly even at compile time, by passing expressions with operators with low precedence as macro arguments. For example, calling FOO(x = 2) in test code will lead to a compile error if FOO(a) is defined as (2 * a) instead of (2 * (a)).
unintended multiple use of arguments in the expansion (as mentioned by Jonathan): This also is a potential problem specific to macros. It should be part of the specification of a macro how often its arguments will be evaluated in the expanded code (and sometimes there can no fixed number be given, see assert). Such statements about how often an argument will be evaluated can be tested by passing macro arguments with side effects that can afterwards be checked by the test code. For example, if FOO(a) is defined to be ((a) * (a)), then the call FOO(++x) will result in x being incremented twice rather than once.
unintended expansion: Sometimes a macro shall expand in a way that causes no code to be produced. assert with NDEBUG is an example here, which shall expand such that the expanded code will be optimized away completely. Whether a macro shall expand in such a way typically depends on configuration macros. To check that a macro actually 'disappears' for the respective configuration, syntactically wrong macro arguments can be used: FOO(++ ++) for example can be a compile-time test to see if instead of the empty expansion one of the non-empty expansions was used (whether this works, however, depends on whether the non-empty expansions use the argument).
bad semicolon: to ensure that a function like macro expands cleanly into a compound statement (with proper do-while(0) wrapper but without trailing semicolon), a compile time check like if (0) FOO(42); else .. can be used.
Note: Those tests I mentioned to be compile-time tests are, strictly speaking, just some form of static analysis. In contrast to using a static analysis tool, such tests have the benefit to specifically test those properties that the macros are expected to have according to their design. Like, static analysis tools typically issue warnings when macro arguments are used without parentheses in the expansion - however, in many expansions parentheses are intentionally not used.

How can I get the function name as text not string in a macro?

I am trying to use a function-like macro to generate an object-like macro name (generically, a symbol). The following will not work because __func__ (C99 6.4.2.2-1) puts quotes around the function name.
#define MAKE_AN_IDENTIFIER(x) __func__##__##x
The desired result of calling MAKE_AN_IDENTIFIER(NULL_POINTER_PASSED) would be MyFunctionName__NULL_POINTER_PASSED. There may be other reasons this would not work (such as __func__ being taken literally and not interpreted, but I could fix that) but my question is what will provide a predefined macro like __func__ except without the quotes? I believe this is not possible within the C99 standard so valid answers could be references to other preprocessors.
Presently I have simply created my own object-like macro and redefined it manually before each function to be the function name. Obviously this is a poor and probably unacceptable practice. I am aware that I could take an existing cpp program or library and modify it to provide this functionality. I am hoping there is either a commonly used cpp replacement which provides this or a preprocessor library (prefer Python) which is designed for extensibility so as to allow me to 'configure' it to create the macro I need.
I wrote the above to try to provide a concise and well defined question but it is certainly the Y referred to by #Ruud. The X is...
I am trying to manage unique values for reporting errors in an embedded system. The values will be passed as a parameter to a(some) particular function(s). I have already written a Python program using pycparser to parse my code and identify all symbols being passed to the function(s) of interest. It generates a .h file of #defines maintaining the values of previously existing entries, commenting out removed entries (to avoid reusing the value and also allow for reintroduction with the same value), assigning new unique numbers for new identifiers, reporting malformed identifiers, and also reporting multiple use of any given identifier. This means that I can simply write:
void MyFunc(int * p)
{
if (p == NULL)
{
myErrorFunc(MYFUNC_NULL_POINTER_PASSED);
return;
}
// do something actually interesting here
}
and the Python program will create the #define MYFUNC_NULL_POINTER_PASSED 7 (or whatever next available number) for me with all the listed considerations. I have also written a set of macros that further simplify the above to:
#define FUNC MYFUNC
void MyFunc(int * p)
{
RETURN_ASSERT_NOT_NULL(p);
// do something actually interesting here
}
assuming I provide the #define FUNC. I want to use the function name since that will be constant throughout many changes (as opposed to LINE) and will be much easier for someone to transfer the value from the old generated #define to the new generated #define when the function itself is renamed. Honestly, I think the only reason I am trying to 'solve' this 'issue' is because I have to work in C rather than C++. At work we are writing fairly object oriented C and so there is a lot of NULL pointer checking and IsInitialized checking. I have two line functions that turn into 30 because of all these basic checks (these macros reduce those lines by a factor of five). While I do enjoy the challenge of crazy macro development, I much prefer to avoid them. That said, I dislike repeating myself and hiding the functional code in a pile of error checking even more than I dislike crazy macros.
If you prefer to take a stab at this issue, have at.

__FUNCTION__ used to compile to a string literal (I think in gcc 2.96), but it hasn't for many years. Now instead we have __func__, which compiles to a string array, and __FUNCTION__ is a deprecated alias for it. (The change was a bit painful.)
But in neither case was it possible to use this predefined macro to generate a valid C identifier (i.e. "remove the quotes").
But could you instead use the line number rather than function name as part of your identifier?
If so, the following would work. As an example, compiling the following 5-line source file:
#define CONCAT_TOKENS4(a,b,c,d) a##b##c##d
#define EXPAND_THEN_CONCAT4(a,b,c,d) CONCAT_TOKENS4(a,b,c,d)
#define MAKE_AN_IDENTIFIER(x) EXPAND_THEN_CONCAT4(line_,__LINE__,__,x)
static int MAKE_AN_IDENTIFIER(NULL_POINTER_PASSED);
will generate the warning:
foo.c:5: warning: 'line_5__NULL_POINTER_PASSED' defined but not used

As pointed out by others, there is no macro that returns the (unquoted) function name (mainly because the C preprocessor has insufficient syntactic knowledge to recognize functions). You would have to explicitly define such a macro yourself, as you already did yourself:
#define FUNC MYFUNC
To avoid having to do this manually, you could write your own preprocessor to add the macro definition automatically. A similar question is this: How to automatically insert pragmas in your program
If your source code has a consistent coding style (particularly indentation), then a simple line-based filter (sed, awk, perl) might do. In its most naive form: every function starts with a line that does not start with a hash or whitespace, and ends with a closing parenthesis or a comma. With awk:
{
print $0;
}
/^[^# \t].*[,\)][ \t]*$/ {
sub(/\(.*$/, "");
sub(/^.*[ \t]/, "");
print "#define FUNC " toupper($0);
}
For a more robust solution, you need a compiler framework like ROSE.

Gnu-C has a __FUNCTION__ macro, but sadly even that cannot be used in the way you are asking.

C macro/#define indentation?

I'm curious as to why I see nearly all C macros formatted like this:
#ifndef FOO
# define FOO
#endif
Or this:
#ifndef FOO
#define FOO
#endif
But never this:
#ifndef FOO
#define FOO
#endif
(moreover, vim's = operator only seems to count the first two as correct.)
Is this due to portability issues among compilers, or is it just a standard practice?

I've seen it done all three ways, it seems to be a matter of style, not of syntax
While usually the second example is the most common, i've seen cases where the first (or third) is used to help distinguish multiple levels of #ifdefs. Sometimes the logic can become deeply nested and the only way to understand it at a glance is to use indentation much like it is common practice to indent blocks of code between { and }.

IIRC, older C preprocessors required the # to be the first character on the line (though I've never actually encountered one that had this requirement).
I never seen your code like your first example. I usually wrote preprocessor directives as in your second example. I found that it visually interfered with the indentation of the actual code less (not that I write in C anymore).
The GNU C Preprocessor manual says:
Preprocessing directives are lines in
your program that start with '#'.
Whitespace is allowed before and after
the '#'.

For preference I use the third style, with the exception of include guards, for which I use the second style.
I don't like the first style at all - I think of #define as being a preprocessor instruction, even though really of course it isn't, it's a # followed by the preprocessor instruction define. But since I do think of it that way, it seems wrong to separate them. I expect text editors written by people who advocate that style will have a block indent/un-indent that works on code written in that style. But I would hate to encounter it using a text editor that didn't.
There's no point pandering to ancient preprocessors where the # has to be the first character of the line, unless you can also list off the top of your head all the other differences between those implementations and standard C, in order to avoid the other things you could possibly do that they would not support. Of course if you genuinely are working with a pre-standard compiler, fair enough.

Preprocessor directives are lines included in our programs that are not actually program statements but directives for the preprocessor. These lines are always preceded by a hash sign (#).Whitespace is allowed before and after the '#'. As soon as a newline character is found, the preprocessor directive is considered to end.
There is no other rule as far the standard of C/C++ concerned,So it remains as the matter of style and readability issue,I have seen/wrote programs only in the second way that you posted,although the third one seems more readable.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight