Expansion of function-like macro creates a separate token - c

I just found out that gcc seems to treat the result of the expansion of a function-like macro as a separate token. Here is a simple example showing the behavior of gcc:
#define f() foo
void f()_bar(void);
void f()bar(void);
void f()-bar(void);
When I execute gcc -E -P test.c (running just the preprocessor), I get the following output:
void foo _bar(void);
void foo bar(void);
void foo-bar(void);
It seems like, in the first two definitions, gcc inserts space after the expanded macro to ensure it is a separate token. Is that really what is happening here?
Is this mandated by any standard (I couldn't find documentation on the topic)?
I want to make _bar part of the same token. Is there any way to do this? I could use the token concatenation operator ## but it will require several levels of macros (since in the real code f() is more complex). I was wondering if there is a simple (and probably more readable) solution.

It seems like, in the first two definitions, gcc inserts space after the expanded macro to ensure it is a separate token. Is that really what is happening here?
Yes.
Is this mandated by any standard (I couldn't find documentation on the topic)?
Yes, although an implementation would be allowed to insert even more than one whitespace to separate the tokens.
f()_bar
here you have 4 tokens after lexical analysis (they are actually pre-processor tokens at this stage but let's call them tokens): f, (, ) and _bar.
The function-like macro replacement semantic (as defined in C11, 6.10.3) has to replace the 3 token f, (, ) into a new one foo. It is not allowed to work on other tokens and change the last _bar token. For this the implementation has to insert at least one whitespace to preserve _bar token. Otherwise the result would have been foo_bar which is a single token.
gcc preprocessor somewhat documents it here:
Once the input file is broken into tokens, the token boundaries never change, except when the ‘##’ preprocessing operator is used to paste tokens together. See Concatenation. For example,
#define foo() bar
foo()baz
==> bar baz
not
==> barbaz
In the other case, like f()-bar, there 5 tokens: f, (, ), - and bar. (- is a punctuator token in C whereas _ in _bar is simply a character of the identifier token). The implementation does not have to insert token separator (as whitespace) here as after macro replacement -bar are still considered as two separate tokens from C syntax.
gcc preprocessor (cpp) does not insert whitespace here simply because it does not have to. In cpp documentation, on token spacing it is written (on a different issue):
However, we would like to keep space insertion to a minimum, both for aesthetic reasons and because it causes problems for people who still try to abuse the preprocessor for things like Fortran source and Makefiles.
I didn't address the solution to your issue in this answer, but I think you have to use operator explicitly specified to concatenate tokens: the ## token pasting operator.

The only way I can think of (if you can not use the token concatenation operator ##) is using the traditional (pre-standard) C preprocessing:
gcc -E -P -traditional-cpp test.c
Output:
void foo_bar(void);
void foobar(void);
void foo-bar(void);
More info

Related

Preprocessing Tokens: '- -' vs. '--'

Why does the (GCC) preprocessor create two tokens - -B instead of a single one --B in the following example? What is the logic that the former should be correct and not the latter?
#define A -B
-A
Output according to gcc -E:
- -B
After all, -- is a valid operator, so theoretically a valid token as well.
Is this specific to the GCC preprocessor or does this follow from the C standards?
The preprocessor works on tokens, not strings. Macro substitution without ## cannot create a new token and so, if the preprocessor output goes to a textfile as opposed to going straight into the compiler, preprocessors insert whitespace so that the outputted textfile can be used as C input again without changed semantics.
The space insertion doesn't seem to be in the standard, but then the standard describes the preprocessor as working on tokens and as feeding its output to the compiler proper, not a textfile.
Focusing on the white space insertion is missing the issue.
The macro A is defined as the sequence of preprocessing tokens - and B.
When the compiler parses a fragment of source code -A, it produces 2 tokens - and A. A is expanded as part of the preprocessing phase and the tokens are converted to C tokens: -, - and B.
If B is itself defined as a macro (#define B 4), A would expand to -, -, 4, which is parsed as an expression evaluating to the value 4 with type int.
gcc -E produces text. For the text to convert back to the same sequence of tokens as the original source code, a space needs to be inserted between the two - tokens to prevent -- to be parsed as a single token.

Spaces inserted by the C preprocessor

Suppose we are given this input C code:
#define Y 20
#define A(x) (10+x+Y)
A(A(40))
gcc -E outputs like that (10+(10+40 +20)+20).
gcc -E -traditional-cpp outputs like that (10+(10+40+20)+20).
Why the default cpp inserts the space after 40 ?
Where can I find the most detailed specification of the cpp that covers that logic ?
The C standard doesn't specify this behaviour, since the output of the preprocessing phase is simply a stream of tokens and whitespace. Serializing the stream of tokens back into a character string, which is what gcc -E does, is not required or even mentioned by the standard, and does not form part of the translation processs specified by the standard.
In phase 3, the program "is decomposed into preprocessing tokens and sequences of white-space characters." Aside from the result of the concatenation operator, which ignores whitespace, and the stringification operator, which preserves whitespace, tokens are then fixed and whitespace is no longer needed to separate them. However, the whitespace is needed in order to:
parse preprocessor directives
correctly process the stringification operator
The whitespace elements in the stream are not eliminated until phase 7, although they are no longer relevant after phase 4 concludes.
Gcc is capable of producing a variety of information useful to programmers, but not corresponding to anything in the standard. For example, the preprocessor phase of the translation can also produce dependency information useful for inserting into a Makefile, using one of the -M options. Alternatively, a human-readable version of the compiled code can be output using the -S option. And a compilable version of the preprocessed program, roughly corresponding to the token stream produced by phase 4, can be output using the -E option. None of these output formats are in any way controlled by the C standard, which is only concerned with actually executing the program.
In order to produce the -E output, gcc must serialize the stream of tokens and whitespace in a format which does not change the semantics of the program. There are cases in which two consecutive tokens in the stream would be incorrectly glued together into a single token if they are not separated from each other, so gcc must take some precautions. It cannot actually insert whitespace into the stream being processed, but nothing stops it from adding whitespace when it presents the stream in response to gcc -E.
For example, if macro invocation in your example were modified to
A(A(0x40E))
then naive output of the token stream would result in
(10+(10+0x40E+20)+20)
which could not be compiled because 0x40E+20 is a single pp-number token which cannot be converted into a numeric token. The space before the + prevents this from happening.
If you attempt to implement a preprocessor as some kind of string transformation, you will undoubtedly confront serious issues in the corner cases. The correct implementation strategy is to tokenize first, as indicated in the standard, and then perform phase 4 as a function on a stream of tokens and whitespace.
Stringification is a particularly interesting case where whitespace affects semantics, and it can be used to see what the actual token stream looks like. If you stringify the expansion of A(A(40)), you can see that no whitespace was actually inserted:
$ gcc -E -x c - <<<'
#define Y 20
#define A(x) (10+x+Y)
#define Q_(x) #x
#define Q(x) Q_(x)
Q(A(A(40)))'
"(10+(10+40+20)+20)"
The handling of whitespace in stringification is precisely specified by the standard: (§6.10.3.2, paragraph 2, many thanks to John Bollinger for finding the specification.)
Each occurrence of white space between the argument’s preprocessing tokens
becomes a single space character in the character string literal. White space before the first preprocessing token and after the last preprocessing token composing the argument is deleted.
Here is a more subtle example where additional whitespace is required in the gcc -E output, but is not actually inserted into the token stream (again shown by using stringification to produce the real token stream.) The I (identify) macro is used to allow two tokens to be inserted into the token stream without intervening whitespace; that's a useful trick if you want to use macros to compose the argument to the #include directive (not recommended, but it can be done).
Maybe this could be a useful test case for your preprocessor:
#define Q_(x) #x
#define Q(x) Q_(x)
#define I(x) x
#define C(x,...) x(__VA_ARGS__)
// Uncomment the following line to run the program
//#include <stdio.h>
char*quoted=Q(C(I(int)I(main),void){I(return)I(C(puts,quoted));});
C(I(int)I(main),void){I(return)I(C(puts,quoted));}
Here's the output of gcc -E (just the good stuff at the end):
$ gcc -E squish.c | tail -n2
char*quoted="intmain(void){returnputs(quoted);}";
int main(void){return puts(quoted);}
In the token stream which is passed out of phase 4, the tokens int and main are not separated by whitespace (and neither are return and puts). That's clearly shown by the stringification, in which no whitespace separates the token. However, the program compiles and executes fine, even if passed explicitly through gcc -E:
$ gcc -E squish.c | gcc -x c - && ./a.out
intmain(void){returnputs(quoted);}
and compiling the output of gcc -E.
Different compilers and different versions of the same compiler may produce different serializations of a preprocessed program. So I don't think you will find any algorithm which is testable with a character-by-character comparison with the -E output of a given compiler.
The simplest possible serialization algorithm would be to unconditionally output a space between two consecutive tokens. Obviously, that would output unnecessary spaces, but it would never syntactically alter the program.
I think the minimal space algorithm would be to record the DFA state at the end of the last character in a token so that you can later output a space between two consecutive tokens if there exists a transition from the state at the end of the first token on the first character of the following token. (Keeping the DFA state as part of the token is not intrinsically different from keeping the token type as part of the token, since you can derive the token type from a simple lookup from the DFA state.) That algorithm would not insert a space after 40 in your original test case, but it would insert a space after 0x40E. So it is not the algorithm being used by your version of gcc.
If you use the above algorithm, you will need to rescan tokens created by token concatenation. However, that is necessary anyway, because you need to flag an error if the result of the concatenation is not a valid preprocessing token.
If you don't want to record states (although, as I said, there is essentially no cost in doing so) and you don't want to regenerate the state by rescanning the token as you output it (which would also be quite cheap), you could precompute a two-dimensional boolean array keyed by token type and following character. The computation would essentially be the same as the above: for every accepting DFA state which returns a particular token type, enter a true value in the array for that token type and any character with a transition out of the DFA state. Then you can look up the token type of a token and the first character of the following token to see if a space may be necessary. This algorithm does not produce a minimally-spaced output: it would, for example, put a space after the 40 in your example, since 40 is a pp-number and it is possible for some pp-number to be extended with a + (even though you cannot extend 40 in that way). So it's possible that gcc uses some version of this algorithm.
Adding some historical context to rici's excellent answer.
If you can get your hands on a working copy of gcc 2.7.2.3, experiment with its preprocessor. At that time the preprocessor was a separate program from the compiler, and it used a very naive algorithm for text serialization, which tended to insert far more spaces than were necessary. When Neil Booth, Per Bothner and I implemented the integrated preprocessor (appearing in gcc 3.0 and since), we decided to make -E output a bit smarter at the same time, but without making the implementation too complicated. The core of this algorithm is the library function cpp_avoid_paste, defined at https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libcpp/lex.c#l2990 , and its caller is here: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/c-family/c-ppoutput.c#l177 (look for "Subtle logic to output a space...").
In the case of your example
#define Y 20
#define A(x) (10+x+Y)
A(A(40))
cpp_avoid_paste will be called with a CPP_NUMBER token (what rici called a "pp-number") on the left, and a '+' token on the right. In this case it unconditionally says "yes, you need to insert a space to avoid pasting" rather than checking whether the last character of the number token is one of eEpP.
Compiler design often comes down to a trade-off between accuracy and implementation simplicity.

Macro Expansion: Argument with Commas

The code I'm working on uses some very convoluted macro voodoo in order to generate code, but in the end there is a construct that looks like this
#define ARGS 1,2,3
#define MACROFUNC_OUTER(PARAMS) MACROFUNC_INNER(PARAMS)
#define MACROFUNC_INNER(A,B,C) A + B + C
int a = MACROFUNC_OUTER(ARGS);
What is expected is to get
int a = 1 + 2 + 3;
This works well for the compiler it has originally been written for (GHS) and also for GCC, but MSVC (2008) considers PARAMS as a single preprocessing token that it won't expand, setting then A to the whole PARAM and B and C to nothing. The result is this
int a = 1,2,3 + + ;
while MSVC warns that not enough actual parameters for macro 'MACROFUNC_INNER'.
Is it possible to get MSVC do the expansion with some tricks (another layer of macro to force a second expansion, some well placed ## or #, ...). Admitting that changing the way the construct work is not an option. (i.e.: can I solve the problem myself?)
What does the C standard say about such corner case? I couldn't find in the C11 norm anything that explicitly tells how to handle arguments that contains a list of arguments. (i.e.: can I argue with the author of the code that he has to write it again, or is just MVSC non-conform?)
MSVC is non-conformant. The standard is actually clear on the point, although it does not feel the need to mention this particular case, which is not exceptional.
When a function-like macro invocation is encountered, the preprocessor:
§6.10.3/11 identifies the arguments, which are possibly empty sequences of tokens separated by non-protected commas , (a comma is protected if it is inside parentheses ()).
§6.10.3.1/1 does a first pass over the macro body, substituting each parameter which is not used in a # or ## operation with the corresponding fully macro-expanded argument. (It does no other substitutions in the macro body in this step.)
§6.10.3.4/1 rescans the substituted replacement token sequence, performing more macro replacements as necessary.
(The above mostly ignores stringification (#) and token concatenation (##), which are not relevant to this question.)
This order of operations unambiguously leads to the behaviour expected by whoever wrote the software.
Apparently (according to #dxiv, and verified here) the following standards-compliant workaround works on some versions of MS Visual Studio:
#define CALL(A,B) A B
#define OUTER(PARAM) CALL(INNER,(PARAM))
#define INNER(A,B,C) whatever
For reference, the actual language from the C11 standard, skipping over the references to # and ## handling:
§6.10.3 11 The sequence of preprocessing tokens bounded by the outside-most matching parentheses forms the list of arguments for the function-like macro. The individual arguments within the list are separated by comma preprocessing tokens, but comma preprocessing tokens between matching inner parentheses do not separate arguments.…
§6.10.3.1 1 After the arguments for the invocation of a function-like macro have been identified, argument substitution takes place. A parameter in the replacement list… is replaced by the corresponding argument after all macros contained therein have been expanded. Before being substituted, each argument’s preprocessing tokens are completely macro replaced as if they formed the rest of the preprocessing file…
§6.10.3.4 1 After all parameters in the replacement list have been substituted… [t]he resulting preprocessing token sequence is then rescanned, along with all subsequent preprocessing tokens of the source file, for more macro names to replace.
C11 says that each appearance of an object-like macro's name
[is] replaced by the replacement list of preprocessing tokens that constitute the remainder of the directive. The replacement list is then rescanned for more macro names as specified below.
[6.10.3/9]
Of function-like macros it says this:
If the identifier-list in the macro definition does not end with an ellipsis, the number of arguments [...] in an invocation of a function-like macro shall equal the number of parameters in the macro definition.
[6.10.3/4]
and this:
The sequence of preprocessing tokens bounded by the outside-most matching parentheses forms the list of arguments for the function-like macro.
[6.10.3/11]
and this:
After the arguments for the invocation of a function-like macro have been identified, argument substitution takes place. A parameter in the replacement list [...] is replaced by the corresponding argument after all macros contained therein have been expanded. Before being substituted, each argument’s preprocessing tokens are completely macro replaced as if they formed the rest of the preprocessing file; no other preprocessing tokens are available.
[6.10.3.1/1]
Of macros in general it also says this:
After all parameters in the replacement list have been substituted [... t]he resulting preprocessing token sequence is then rescanned, along with all subsequent preprocessing tokens of the source file, for more macro names to replace.
[6.10.3.4/1]
MSVC++ does not properly expand the arguments to function-like macros before rescanning the expansion of such macros. It seems unlikely that there is any easy workaround.
UPDATE:
In light of #dxiv's answer, however, it may be that there is a solution after all. The problem with his solution with respect to standard-conforming behavior is that there needs to be one more expansion than is actually performed. That can easily enough be supplied. This variation on his approach works with GCC, as it should, and inasmuch as it is based on code that dxiv claims works with MSVC++, it seems likely to work there, too:
#define EXPAND(x) x
#define PAREN(...) (__VA_ARGS__)
#define EXPAND_F(m, ...) EXPAND(m PAREN(__VA_ARGS__))
#define SUM3(a,b,c) a + b + c
#define ARGS 1,2,3
int sum = EXPAND_F(SUM3, ARGS);
I have of course made it a little more generic than perhaps it needs to be, but that may serve you well if you have a lot of these to deal with..
Curiuosly enough, the following appears to work in MSVC (tested with 2010 and 2015).
#define ARGS 1,2,3
#define OUTER(...) INNER PARAN(__VA_ARGS__)
#define PARAN(...) (__VA_ARGS__)
#define INNER(A,B,C) A + B + C
int a = OUTER(ARGS);
I don't know that it's supposed to work by the letter of the standard, in fact I have a hunch it's not. Could still be conditionally compiled just for MSVC, as a workaround.
[EDIT] P.S. As pointed out in the comments, the above is (another) non-standard MSVC behavior. Instead, the alternative workarounds posted by #rici and #JohnBollinger in the respective replies are compliant, thus recommended.

Can a C macro definition refer to other macros?

What I'm trying to figure out is if something such as this (written in C):
#define FOO 15
#define BAR 23
#define MEH (FOO / BAR)
is allowed? I would want the preprocessor to replace every instance of
MEH
with
(15 / 23)
but I'm not so sure that will work. Certainly if the preprocessor only goes through the code once then I don't think it'd work out the way I'd like.
I found several similar examples but all were really too complicated for me to understand. If someone could help me out with this simple one I'd be eternally grateful!
Short answer yes. You can nest defines and macros like that - as many levels as you want as long as it isn't recursive.
The answer is "yes", and two other people have correctly said so.
As for why the answer is yes, the gory details are in the C standard, section 6.10.3.4, "Rescanning and further replacement". The OP might not benefit from this, but others might be interested.
6.10.3.4 Rescanning and further replacement
After all parameters in the replacement list have been substituted and
# and ## processing has taken place, all placemarker preprocessing tokens are removed.
Then, the resulting preprocessing token sequence
is rescanned, along with all subsequent preprocessing tokens of the
source file, for more macro names to replace.
If the name of the macro being replaced is found during this scan of
the replacement list (not including the rest of the source file's
preprocessing tokens), it is not replaced. Furthermore, if any nested
replacements encounter the name of the macro being replaced, it is not
replaced. These nonreplaced macro name preprocessing tokens are no
longer available for further replacement even if they are later
(re)examined in contexts in which that macro name preprocessing token
would otherwise have been replaced.
The resulting completely macro-replaced preprocessing token sequence
is not processed as a preprocessing directive even if it resembles
one, but all pragma unary operator expressions within it are then
processed as specified in 6.10.9 below.
Yes, it's going to work.
But for your personal information, here are some simplified rules about macros that might help you (it's out of scope, but will probably help you in the future). I'll try to keep it as simple as possible.
The defines are "defined" in the order they are included/read. That means that you cannot use a define that wasn't defined previously.
Usefull pre-processor keyword: #define, #undef, #else, #elif, #ifdef, #ifndef, #if
You can use any other previously #define in your macro. They will be expanded. (like in your question)
Function macro definitions accept two special operators (# and ##)
operator # stringize the argument:
#define str(x) #x
str(test); // would translate to "test"
operator ## concatenates two arguments
#define concat(a,b) a ## b
concat(hello, world); // would translate to "helloworld"
There are some predefined macros (from the language) as well that you can use:
__LINE__, __FILE__, __cplusplus, etc
See your compiler section on that to have an extensive list since it's not "cross platform"
Pay attention to the macro expansion
You'll see that people uses a log of round brackets "()" when defining macros. The reason is that when you call a macro, it's expanded "as is"
#define mult(a, b) a * b
mult(1+2, 3+4); // will be expanded like: 1 + 2 * 3 + 4 = 11 instead of 21.
mult_fix(a, b) ((a) * (b))
Yes, and there is one more advantage of this feature. You can leave some macro undefined and set its value as a name of another macro in the compilation command.
#define STR "string"
void main() { printf("value=%s\n", VALUE); }
In the command line you can say that the macro "VALUE" takes value from another macro "STR":
$ gcc -o test_macro -DVALUE=STR main.c
$ ./test_macro
Output:
value=string
This approach works as well for MSC compiler on Windows. I find it very flexible.
I'd like to add a gotcha that tripped me up.
Function-style macros cannot do this.
Example that doesn't compile when used:
#define FOO 1
#define FSMACRO(x) FOO + x
Yes, that is supported. And used quite a lot!
One important thing to note though is to make sure you paranthesize the expression otherwise you might run into nasty issues!
#define MEH FOO/BAR
// vs
#define MEH (FOO / BAR)
// the first could be expanded in an expression like 5 * MEH to mean something
// completely different than the second

During C macro expansion, is there a special case for macros that would expand to "/*"?

Here's a relevant example. It's obviously not valid C, but I'm just dealing with the preprocessor here, so the code doesn't actually have to compile.
#define IDENTITY(x) x
#define PREPEND_ASTERISK(x) *x
#define PREPEND_SLASH(x) /x
IDENTITY(literal)
PREPEND_ASTERISK(literal)
PREPEND_SLASH(literal)
IDENTITY(*pointer)
PREPEND_ASTERISK(*pointer)
PREPEND_SLASH(*pointer)
Running gcc's preprocessor on it:
gcc -std=c99 -E macrotest.c
This yields:
(...)
literal
*literal
/literal
*pointer
**pointer
/ *pointer
Please note the extra space in the last line.
This looks like a feature to prevent macros from expanding to "/*" to me, which I'm sure is well-intentioned. But at a glance, I couldn't find anything pertaining to this behaviour in the C99 standard. Then again, I'm inexperienced at C. Can someone shed some light on this? Where is this specified? I would guess that a compiler adhering to C99 should not just insert extra spaces during macro expansion just because it would probably prevent programming mistakes.
The source code is already tokenized before being processed by CPP.
So what you have is a / and a * token that will not be combined implicitly to a /* "token" ( since /* is not really a preprocessor token I put it in "").
If you use -E to output preprocessed source CPP needs to insert a space in order to avoid /* being read by a subsequent compiler pass.
The same feature prevents from two e.g. + signs from different macros being combined into a ++ token on output.
The only way to really paste two preprocessor tokens together is with the ## operator:
#define P(x,y) x##y
...
P(foo,bar)
results in the token foobar
P(+,+)
results in the token ++, but
P(/,*)
is not valid since /* is not a valid preprocessor token.
The behavior of the pre-processor is standardized. In the summary at http://en.wikipedia.org/wiki/C_preprocessor , the results you are observing are the effect of:
"3: Tokenization - The preprocessor breaks the result into preprocessing tokens and whitespace. It replaces comments with whitespace".
This takes place before:
"4: Macro Expansion and Directive Handling".

Resources