Why pre-processor gives a space? - c-preprocessor

I want to comment a line using the pre-processor:
#define open /##*
#define close */
main()
{
open commented line close
}
when I do $gcc -E filename.c I expected
/* commented line */
but I got
/ * commented line */
so that the compiler shows an error
Why it is giving an unwanted space ?

From the GNU C Preprocessor documentation:
However, two tokens that don't together form a valid token cannot be pasted together. For example, you cannot concatenate x with + in either order. If you try, the preprocessor issues a warning and emits the two tokens. Whether it puts white space between the tokens is undefined. It is common to find unnecessary uses of '##' in complex macros. If you get this warning, it is likely that you can simply remove the '##'.
In this case '*' and '/' do not form a valid C or C++ token. So they are emitted with a space between them.
(Aside: you are likely to get C compilation errors even if you do manage to insert "comments" into the output of the C preprocessor. There aren't supposed to be any comments there.)

The error is because /* is not a valid token.
As explained from the CPP doc:
two tokens that don't together form a valid token cannot be pasted together. For example, you cannot concatenate x with + in either order.
You can get the error by pasting other nonsense stuff e.g. /##+ or +##-.
About the space, it is deliberately inserted to avoid creating a comment and mess up the rest. From the GCC source code:
/* Avoid comment headers, since they are still processed in stage 3.
It is simpler to insert a space here, rather than modifying the
lexer to ignore comments in some circumstances. Simply returning
false doesn't work, since we want to clear the PASTE_LEFT flag. */
if ((*plhs)->type == CPP_DIV && rhs->type != CPP_EQ)
*end++ = ' ';

The preprocessor runs and produces code in a form that the C compiler can understand. It only processes your code once, so even if you could produce a /* with your #define, the compiler would see the /* and give you an error because it's not valid C code (it's a preprocessing instruction).
This doesn't seem like a very good thing to do.

Because comments are replaced with spaces before (and only before) the preprocessor runs. If you paste together the characters / and * using the preprocessor, you get /* which is just a couple operators. Edit: such abuse of ## technically creates /* as a single token, which has undefined behavior. You can paste together > ## > or < %:%: :, although you shouldn't.
See §6.4.6 of C99 for what tokens you are allowed to construct and 6.10.3.3 for the catenation process.

If you want to comment some code using pre-processor, use
#if 0
...
#endif

Related

What type of content is allowed to be used as arguments for C preprocessor macro?

Honestly I know the syntax of the C programming language well, but know almost nothing about the syntax of the C preprocessor, although I use it in my programming practice sometimes.
So the question. Suppose we have a simple macro that expands to nothing:
#define macro(param)
What is the restrictions for the syntax that can be put inside macro invoking construction?
It is certainly impossible to use single or multiple comma when invoking the macro:
macro(,); // won't compile
However if we put the comma into the brackets it will be accepted by C preprocessor:
macro((,)); // compiles fine
Of course, you can't use the comment characters:
macro(//); // compile error
because, as far as I know, comments are processed by preprocessor itself.
Unclosed quotes and round brackets aren't allowed too when using the macro:
macro("); // compile error
But characters unused in the C syntax are accepted well:
macro(##$); // compiles
Even characters of foreign languages work fine:
macro(бла-бла-бла я пишу по-русски); // compiles too
Can I use a random valid C/C++ code in curly brackets when invoking the macro? Can I use a random valid C/C++ code without curly brackets? The following code seems to compile fine:
macro(int a = 5; printf("%d\n", a););
You cannot pass arbitrary text to be ignored with your proposed scheme:
The C preprocessor will read the input file and parse it as a sequence of preprocessing tokens, after stripping comments, to pass to the macro as arguments. It matches parentheses to determine what tokens constitute arguments separated by ,.
Text containing strings or character constants with unrecognized escape sequences or stand alone backslashes does not parse as standard preprocessing token. Whether macro(##$); compiles is implementation dependent.
Note however that you can work around the , problem by defining your macro as taking a variable number of arguments:
#define macro(...)

Spaces inserted by the C preprocessor

Suppose we are given this input C code:
#define Y 20
#define A(x) (10+x+Y)
A(A(40))
gcc -E outputs like that (10+(10+40 +20)+20).
gcc -E -traditional-cpp outputs like that (10+(10+40+20)+20).
Why the default cpp inserts the space after 40 ?
Where can I find the most detailed specification of the cpp that covers that logic ?
The C standard doesn't specify this behaviour, since the output of the preprocessing phase is simply a stream of tokens and whitespace. Serializing the stream of tokens back into a character string, which is what gcc -E does, is not required or even mentioned by the standard, and does not form part of the translation processs specified by the standard.
In phase 3, the program "is decomposed into preprocessing tokens and sequences of white-space characters." Aside from the result of the concatenation operator, which ignores whitespace, and the stringification operator, which preserves whitespace, tokens are then fixed and whitespace is no longer needed to separate them. However, the whitespace is needed in order to:
parse preprocessor directives
correctly process the stringification operator
The whitespace elements in the stream are not eliminated until phase 7, although they are no longer relevant after phase 4 concludes.
Gcc is capable of producing a variety of information useful to programmers, but not corresponding to anything in the standard. For example, the preprocessor phase of the translation can also produce dependency information useful for inserting into a Makefile, using one of the -M options. Alternatively, a human-readable version of the compiled code can be output using the -S option. And a compilable version of the preprocessed program, roughly corresponding to the token stream produced by phase 4, can be output using the -E option. None of these output formats are in any way controlled by the C standard, which is only concerned with actually executing the program.
In order to produce the -E output, gcc must serialize the stream of tokens and whitespace in a format which does not change the semantics of the program. There are cases in which two consecutive tokens in the stream would be incorrectly glued together into a single token if they are not separated from each other, so gcc must take some precautions. It cannot actually insert whitespace into the stream being processed, but nothing stops it from adding whitespace when it presents the stream in response to gcc -E.
For example, if macro invocation in your example were modified to
A(A(0x40E))
then naive output of the token stream would result in
(10+(10+0x40E+20)+20)
which could not be compiled because 0x40E+20 is a single pp-number token which cannot be converted into a numeric token. The space before the + prevents this from happening.
If you attempt to implement a preprocessor as some kind of string transformation, you will undoubtedly confront serious issues in the corner cases. The correct implementation strategy is to tokenize first, as indicated in the standard, and then perform phase 4 as a function on a stream of tokens and whitespace.
Stringification is a particularly interesting case where whitespace affects semantics, and it can be used to see what the actual token stream looks like. If you stringify the expansion of A(A(40)), you can see that no whitespace was actually inserted:
$ gcc -E -x c - <<<'
#define Y 20
#define A(x) (10+x+Y)
#define Q_(x) #x
#define Q(x) Q_(x)
Q(A(A(40)))'
"(10+(10+40+20)+20)"
The handling of whitespace in stringification is precisely specified by the standard: (§6.10.3.2, paragraph 2, many thanks to John Bollinger for finding the specification.)
Each occurrence of white space between the argument’s preprocessing tokens
becomes a single space character in the character string literal. White space before the first preprocessing token and after the last preprocessing token composing the argument is deleted.
Here is a more subtle example where additional whitespace is required in the gcc -E output, but is not actually inserted into the token stream (again shown by using stringification to produce the real token stream.) The I (identify) macro is used to allow two tokens to be inserted into the token stream without intervening whitespace; that's a useful trick if you want to use macros to compose the argument to the #include directive (not recommended, but it can be done).
Maybe this could be a useful test case for your preprocessor:
#define Q_(x) #x
#define Q(x) Q_(x)
#define I(x) x
#define C(x,...) x(__VA_ARGS__)
// Uncomment the following line to run the program
//#include <stdio.h>
char*quoted=Q(C(I(int)I(main),void){I(return)I(C(puts,quoted));});
C(I(int)I(main),void){I(return)I(C(puts,quoted));}
Here's the output of gcc -E (just the good stuff at the end):
$ gcc -E squish.c | tail -n2
char*quoted="intmain(void){returnputs(quoted);}";
int main(void){return puts(quoted);}
In the token stream which is passed out of phase 4, the tokens int and main are not separated by whitespace (and neither are return and puts). That's clearly shown by the stringification, in which no whitespace separates the token. However, the program compiles and executes fine, even if passed explicitly through gcc -E:
$ gcc -E squish.c | gcc -x c - && ./a.out
intmain(void){returnputs(quoted);}
and compiling the output of gcc -E.
Different compilers and different versions of the same compiler may produce different serializations of a preprocessed program. So I don't think you will find any algorithm which is testable with a character-by-character comparison with the -E output of a given compiler.
The simplest possible serialization algorithm would be to unconditionally output a space between two consecutive tokens. Obviously, that would output unnecessary spaces, but it would never syntactically alter the program.
I think the minimal space algorithm would be to record the DFA state at the end of the last character in a token so that you can later output a space between two consecutive tokens if there exists a transition from the state at the end of the first token on the first character of the following token. (Keeping the DFA state as part of the token is not intrinsically different from keeping the token type as part of the token, since you can derive the token type from a simple lookup from the DFA state.) That algorithm would not insert a space after 40 in your original test case, but it would insert a space after 0x40E. So it is not the algorithm being used by your version of gcc.
If you use the above algorithm, you will need to rescan tokens created by token concatenation. However, that is necessary anyway, because you need to flag an error if the result of the concatenation is not a valid preprocessing token.
If you don't want to record states (although, as I said, there is essentially no cost in doing so) and you don't want to regenerate the state by rescanning the token as you output it (which would also be quite cheap), you could precompute a two-dimensional boolean array keyed by token type and following character. The computation would essentially be the same as the above: for every accepting DFA state which returns a particular token type, enter a true value in the array for that token type and any character with a transition out of the DFA state. Then you can look up the token type of a token and the first character of the following token to see if a space may be necessary. This algorithm does not produce a minimally-spaced output: it would, for example, put a space after the 40 in your example, since 40 is a pp-number and it is possible for some pp-number to be extended with a + (even though you cannot extend 40 in that way). So it's possible that gcc uses some version of this algorithm.
Adding some historical context to rici's excellent answer.
If you can get your hands on a working copy of gcc 2.7.2.3, experiment with its preprocessor. At that time the preprocessor was a separate program from the compiler, and it used a very naive algorithm for text serialization, which tended to insert far more spaces than were necessary. When Neil Booth, Per Bothner and I implemented the integrated preprocessor (appearing in gcc 3.0 and since), we decided to make -E output a bit smarter at the same time, but without making the implementation too complicated. The core of this algorithm is the library function cpp_avoid_paste, defined at https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libcpp/lex.c#l2990 , and its caller is here: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/c-family/c-ppoutput.c#l177 (look for "Subtle logic to output a space...").
In the case of your example
#define Y 20
#define A(x) (10+x+Y)
A(A(40))
cpp_avoid_paste will be called with a CPP_NUMBER token (what rici called a "pp-number") on the left, and a '+' token on the right. In this case it unconditionally says "yes, you need to insert a space to avoid pasting" rather than checking whether the last character of the number token is one of eEpP.
Compiler design often comes down to a trade-off between accuracy and implementation simplicity.

What am I missing with the ## operator

I have a compiler error with GCC when trying to processor some macros from some TI code that compiles ok with the TI compiler.
The Macro's in question are some variation of
#define CHIP_FSET(Reg,Field,Val) _CHIP_##Reg##_FSET(##Field,Val)
and it is used in code like
CHIP_FSET(ST1_55, XF, CHIP_ST1_55_XF_OFF)
and when GCC gets a hold of that it says
error: pasting "(" and "XF" does not give a valid pre-processing token
It pre-processes successfully if I remove the ## in front of Field. If I am understanding the code correctly the ## in front of field seems irrelevant because it is turned into a function call (or another macro call) that takes two parameters. So the ## is redundant, the original replacement will result in ..._FSET(Field,Val) anyway.
So what am I missing? Everything I could find on the ## pre-processor directive said it just stuck the text together. So the ## never did anything in the first place in this case.
What am I missing?
And why would GCC choke on it but the TI compiler allow it? I'm guessing the answer to that is something like "ambiguous part of the spec".
=========================
Update
I think the problem is because there is a host of nested macros that might not be being completely resolved. What the compiler ends up with is invalid so it spits the dummy at some point in processing them all.
I've managed to make the problem worse by filling in the missing macros and it has caused some others parts to break. Such are the joys of porting code between platforms and compilers I guess.
Thanks for the help.
No the specs are not ambiguous. ## operates on the level of tokens. It is required that the two tokens that are pasted together must form a valid token, again. ( doesn't form a token with alpha characters, thus the error message.

What does '\' actually do in C?

As far as I know \ in C just appends the next line as if there was not a line break.
Consider the following code:
main(){\
return 0;
}
When I saw the pre-processed code(gcc -E) it shows
main(){return
0;
}
and not
main(){return 0;
}
What is the reason for this kind of behaviour? Also, how can I get the code I expected?
Yes, your expected result is the one required by the C and C++ standards. The backslash simply escapes the newline, i.e. the backslash-newline sequence is deleted.
GCC 4.2.1 from my OS X installation gives the expected result, as does Clang. Furthermore, adding a #define to the beginning and testing with
#define main(){\
return 0;
}
main()
yields the correct result
}
{return 0;
Perhaps gcc -E does some extra processing after preprocessing and before outputting it. In any case, the line break seen by the rest of the preprocessor seems to be in the right place. So it's a cosmetic bug.
UPDATE: According to the GCC FAQ, -E (or the default setting of the cpp command) attempts to put output tokens in roughly the same visual location as input tokens. To get "raw" output, specify -P as well. This fixes the observed issues.
Probably what happened:
In preserving visual appearance, tokens not separated by spaces are kept together.
Line splicing happens before spaces are identified for the above.
The { and return tokens are grouped into the same visual block.
0 follows a space and its location on the next line is duly noted.
PLUG: If this is really important to you, I have implemented my own preprocessor with correct implementation of both raw-preprocessed and whitespace-preserving "pretty" modes. Following this discussion I added line splices to the preserved whitespace. It's not really intended as a standalone tool, though. It's a testbed for a compiler framework which happens to be a fully compliant C++11 preprocessor library, which happens to have a miniature command-line driver. (The error messages are on par with GCC, or Clang, sans color, though.)
From K&R section A.12 Preprocessing:
A.12.2 Line Splicing
Lines that end with the backslash character \ are
folded by deleting the backslash and the following newline character.
This occurs before division into tokens.
It doesn't matter :/ The tokenizer will not see any difference. 1
Update In response to the comments:
There seems to be a fair amount of confusion as to what the expected output of the preprocessor should be. My point is that the expectation /seems/ reasonable at a glance but doesn't actually need to be specified in this way for the output to be valid. The amount of whitespace present in the output is simply irrelevant to the parser. What matters is that the preprocessor should treat the continued line as one line while interpreting it.
In other words: the preprocessor is not a text transformation tool, it's a token manipulation tool.
If it matters to you, you're probably
using the preprocessor for for something other than C/C++
treating C++ code as text, which is a ... code smell. (libclang and various less complete parser libraries come to mind).
1 (The preprocessor is free to achieve the specified result in whichever way it sees fit. The result you are seeing is possibly the most efficient way the implementors have found to implement this particular transformation)

During C macro expansion, is there a special case for macros that would expand to "/*"?

Here's a relevant example. It's obviously not valid C, but I'm just dealing with the preprocessor here, so the code doesn't actually have to compile.
#define IDENTITY(x) x
#define PREPEND_ASTERISK(x) *x
#define PREPEND_SLASH(x) /x
IDENTITY(literal)
PREPEND_ASTERISK(literal)
PREPEND_SLASH(literal)
IDENTITY(*pointer)
PREPEND_ASTERISK(*pointer)
PREPEND_SLASH(*pointer)
Running gcc's preprocessor on it:
gcc -std=c99 -E macrotest.c
This yields:
(...)
literal
*literal
/literal
*pointer
**pointer
/ *pointer
Please note the extra space in the last line.
This looks like a feature to prevent macros from expanding to "/*" to me, which I'm sure is well-intentioned. But at a glance, I couldn't find anything pertaining to this behaviour in the C99 standard. Then again, I'm inexperienced at C. Can someone shed some light on this? Where is this specified? I would guess that a compiler adhering to C99 should not just insert extra spaces during macro expansion just because it would probably prevent programming mistakes.
The source code is already tokenized before being processed by CPP.
So what you have is a / and a * token that will not be combined implicitly to a /* "token" ( since /* is not really a preprocessor token I put it in "").
If you use -E to output preprocessed source CPP needs to insert a space in order to avoid /* being read by a subsequent compiler pass.
The same feature prevents from two e.g. + signs from different macros being combined into a ++ token on output.
The only way to really paste two preprocessor tokens together is with the ## operator:
#define P(x,y) x##y
...
P(foo,bar)
results in the token foobar
P(+,+)
results in the token ++, but
P(/,*)
is not valid since /* is not a valid preprocessor token.
The behavior of the pre-processor is standardized. In the summary at http://en.wikipedia.org/wiki/C_preprocessor , the results you are observing are the effect of:
"3: Tokenization - The preprocessor breaks the result into preprocessing tokens and whitespace. It replaces comments with whitespace".
This takes place before:
"4: Macro Expansion and Directive Handling".

Resources