Preprocessing Tokens: '- -' vs. '--' - c

Why does the (GCC) preprocessor create two tokens - -B instead of a single one --B in the following example? What is the logic that the former should be correct and not the latter?
#define A -B
-A
Output according to gcc -E:
- -B
After all, -- is a valid operator, so theoretically a valid token as well.
Is this specific to the GCC preprocessor or does this follow from the C standards?

The preprocessor works on tokens, not strings. Macro substitution without ## cannot create a new token and so, if the preprocessor output goes to a textfile as opposed to going straight into the compiler, preprocessors insert whitespace so that the outputted textfile can be used as C input again without changed semantics.
The space insertion doesn't seem to be in the standard, but then the standard describes the preprocessor as working on tokens and as feeding its output to the compiler proper, not a textfile.

Focusing on the white space insertion is missing the issue.
The macro A is defined as the sequence of preprocessing tokens - and B.
When the compiler parses a fragment of source code -A, it produces 2 tokens - and A. A is expanded as part of the preprocessing phase and the tokens are converted to C tokens: -, - and B.
If B is itself defined as a macro (#define B 4), A would expand to -, -, 4, which is parsed as an expression evaluating to the value 4 with type int.
gcc -E produces text. For the text to convert back to the same sequence of tokens as the original source code, a space needs to be inserted between the two - tokens to prevent -- to be parsed as a single token.

Related

Spaces inserted by the C preprocessor

Suppose we are given this input C code:
#define Y 20
#define A(x) (10+x+Y)
A(A(40))
gcc -E outputs like that (10+(10+40 +20)+20).
gcc -E -traditional-cpp outputs like that (10+(10+40+20)+20).
Why the default cpp inserts the space after 40 ?
Where can I find the most detailed specification of the cpp that covers that logic ?
The C standard doesn't specify this behaviour, since the output of the preprocessing phase is simply a stream of tokens and whitespace. Serializing the stream of tokens back into a character string, which is what gcc -E does, is not required or even mentioned by the standard, and does not form part of the translation processs specified by the standard.
In phase 3, the program "is decomposed into preprocessing tokens and sequences of white-space characters." Aside from the result of the concatenation operator, which ignores whitespace, and the stringification operator, which preserves whitespace, tokens are then fixed and whitespace is no longer needed to separate them. However, the whitespace is needed in order to:
parse preprocessor directives
correctly process the stringification operator
The whitespace elements in the stream are not eliminated until phase 7, although they are no longer relevant after phase 4 concludes.
Gcc is capable of producing a variety of information useful to programmers, but not corresponding to anything in the standard. For example, the preprocessor phase of the translation can also produce dependency information useful for inserting into a Makefile, using one of the -M options. Alternatively, a human-readable version of the compiled code can be output using the -S option. And a compilable version of the preprocessed program, roughly corresponding to the token stream produced by phase 4, can be output using the -E option. None of these output formats are in any way controlled by the C standard, which is only concerned with actually executing the program.
In order to produce the -E output, gcc must serialize the stream of tokens and whitespace in a format which does not change the semantics of the program. There are cases in which two consecutive tokens in the stream would be incorrectly glued together into a single token if they are not separated from each other, so gcc must take some precautions. It cannot actually insert whitespace into the stream being processed, but nothing stops it from adding whitespace when it presents the stream in response to gcc -E.
For example, if macro invocation in your example were modified to
A(A(0x40E))
then naive output of the token stream would result in
(10+(10+0x40E+20)+20)
which could not be compiled because 0x40E+20 is a single pp-number token which cannot be converted into a numeric token. The space before the + prevents this from happening.
If you attempt to implement a preprocessor as some kind of string transformation, you will undoubtedly confront serious issues in the corner cases. The correct implementation strategy is to tokenize first, as indicated in the standard, and then perform phase 4 as a function on a stream of tokens and whitespace.
Stringification is a particularly interesting case where whitespace affects semantics, and it can be used to see what the actual token stream looks like. If you stringify the expansion of A(A(40)), you can see that no whitespace was actually inserted:
$ gcc -E -x c - <<<'
#define Y 20
#define A(x) (10+x+Y)
#define Q_(x) #x
#define Q(x) Q_(x)
Q(A(A(40)))'
"(10+(10+40+20)+20)"
The handling of whitespace in stringification is precisely specified by the standard: (§6.10.3.2, paragraph 2, many thanks to John Bollinger for finding the specification.)
Each occurrence of white space between the argument’s preprocessing tokens
becomes a single space character in the character string literal. White space before the first preprocessing token and after the last preprocessing token composing the argument is deleted.
Here is a more subtle example where additional whitespace is required in the gcc -E output, but is not actually inserted into the token stream (again shown by using stringification to produce the real token stream.) The I (identify) macro is used to allow two tokens to be inserted into the token stream without intervening whitespace; that's a useful trick if you want to use macros to compose the argument to the #include directive (not recommended, but it can be done).
Maybe this could be a useful test case for your preprocessor:
#define Q_(x) #x
#define Q(x) Q_(x)
#define I(x) x
#define C(x,...) x(__VA_ARGS__)
// Uncomment the following line to run the program
//#include <stdio.h>
char*quoted=Q(C(I(int)I(main),void){I(return)I(C(puts,quoted));});
C(I(int)I(main),void){I(return)I(C(puts,quoted));}
Here's the output of gcc -E (just the good stuff at the end):
$ gcc -E squish.c | tail -n2
char*quoted="intmain(void){returnputs(quoted);}";
int main(void){return puts(quoted);}
In the token stream which is passed out of phase 4, the tokens int and main are not separated by whitespace (and neither are return and puts). That's clearly shown by the stringification, in which no whitespace separates the token. However, the program compiles and executes fine, even if passed explicitly through gcc -E:
$ gcc -E squish.c | gcc -x c - && ./a.out
intmain(void){returnputs(quoted);}
and compiling the output of gcc -E.
Different compilers and different versions of the same compiler may produce different serializations of a preprocessed program. So I don't think you will find any algorithm which is testable with a character-by-character comparison with the -E output of a given compiler.
The simplest possible serialization algorithm would be to unconditionally output a space between two consecutive tokens. Obviously, that would output unnecessary spaces, but it would never syntactically alter the program.
I think the minimal space algorithm would be to record the DFA state at the end of the last character in a token so that you can later output a space between two consecutive tokens if there exists a transition from the state at the end of the first token on the first character of the following token. (Keeping the DFA state as part of the token is not intrinsically different from keeping the token type as part of the token, since you can derive the token type from a simple lookup from the DFA state.) That algorithm would not insert a space after 40 in your original test case, but it would insert a space after 0x40E. So it is not the algorithm being used by your version of gcc.
If you use the above algorithm, you will need to rescan tokens created by token concatenation. However, that is necessary anyway, because you need to flag an error if the result of the concatenation is not a valid preprocessing token.
If you don't want to record states (although, as I said, there is essentially no cost in doing so) and you don't want to regenerate the state by rescanning the token as you output it (which would also be quite cheap), you could precompute a two-dimensional boolean array keyed by token type and following character. The computation would essentially be the same as the above: for every accepting DFA state which returns a particular token type, enter a true value in the array for that token type and any character with a transition out of the DFA state. Then you can look up the token type of a token and the first character of the following token to see if a space may be necessary. This algorithm does not produce a minimally-spaced output: it would, for example, put a space after the 40 in your example, since 40 is a pp-number and it is possible for some pp-number to be extended with a + (even though you cannot extend 40 in that way). So it's possible that gcc uses some version of this algorithm.
Adding some historical context to rici's excellent answer.
If you can get your hands on a working copy of gcc 2.7.2.3, experiment with its preprocessor. At that time the preprocessor was a separate program from the compiler, and it used a very naive algorithm for text serialization, which tended to insert far more spaces than were necessary. When Neil Booth, Per Bothner and I implemented the integrated preprocessor (appearing in gcc 3.0 and since), we decided to make -E output a bit smarter at the same time, but without making the implementation too complicated. The core of this algorithm is the library function cpp_avoid_paste, defined at https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libcpp/lex.c#l2990 , and its caller is here: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/c-family/c-ppoutput.c#l177 (look for "Subtle logic to output a space...").
In the case of your example
#define Y 20
#define A(x) (10+x+Y)
A(A(40))
cpp_avoid_paste will be called with a CPP_NUMBER token (what rici called a "pp-number") on the left, and a '+' token on the right. In this case it unconditionally says "yes, you need to insert a space to avoid pasting" rather than checking whether the last character of the number token is one of eEpP.
Compiler design often comes down to a trade-off between accuracy and implementation simplicity.

Expansion of function-like macro creates a separate token

I just found out that gcc seems to treat the result of the expansion of a function-like macro as a separate token. Here is a simple example showing the behavior of gcc:
#define f() foo
void f()_bar(void);
void f()bar(void);
void f()-bar(void);
When I execute gcc -E -P test.c (running just the preprocessor), I get the following output:
void foo _bar(void);
void foo bar(void);
void foo-bar(void);
It seems like, in the first two definitions, gcc inserts space after the expanded macro to ensure it is a separate token. Is that really what is happening here?
Is this mandated by any standard (I couldn't find documentation on the topic)?
I want to make _bar part of the same token. Is there any way to do this? I could use the token concatenation operator ## but it will require several levels of macros (since in the real code f() is more complex). I was wondering if there is a simple (and probably more readable) solution.
It seems like, in the first two definitions, gcc inserts space after the expanded macro to ensure it is a separate token. Is that really what is happening here?
Yes.
Is this mandated by any standard (I couldn't find documentation on the topic)?
Yes, although an implementation would be allowed to insert even more than one whitespace to separate the tokens.
f()_bar
here you have 4 tokens after lexical analysis (they are actually pre-processor tokens at this stage but let's call them tokens): f, (, ) and _bar.
The function-like macro replacement semantic (as defined in C11, 6.10.3) has to replace the 3 token f, (, ) into a new one foo. It is not allowed to work on other tokens and change the last _bar token. For this the implementation has to insert at least one whitespace to preserve _bar token. Otherwise the result would have been foo_bar which is a single token.
gcc preprocessor somewhat documents it here:
Once the input file is broken into tokens, the token boundaries never change, except when the ‘##’ preprocessing operator is used to paste tokens together. See Concatenation. For example,
#define foo() bar
foo()baz
==> bar baz
not
==> barbaz
In the other case, like f()-bar, there 5 tokens: f, (, ), - and bar. (- is a punctuator token in C whereas _ in _bar is simply a character of the identifier token). The implementation does not have to insert token separator (as whitespace) here as after macro replacement -bar are still considered as two separate tokens from C syntax.
gcc preprocessor (cpp) does not insert whitespace here simply because it does not have to. In cpp documentation, on token spacing it is written (on a different issue):
However, we would like to keep space insertion to a minimum, both for aesthetic reasons and because it causes problems for people who still try to abuse the preprocessor for things like Fortran source and Makefiles.
I didn't address the solution to your issue in this answer, but I think you have to use operator explicitly specified to concatenate tokens: the ## token pasting operator.
The only way I can think of (if you can not use the token concatenation operator ##) is using the traditional (pre-standard) C preprocessing:
gcc -E -P -traditional-cpp test.c
Output:
void foo_bar(void);
void foobar(void);
void foo-bar(void);
More info

#line and string literal concatenation

Given this piece of C code:
char s[] =
"start"
#ifdef BLAH
"mid"
#endif
"end";
what should the output of the preprocessor be? In other words, what should the actual compiler receive and be able to handle? To narrow the possibilities, let's stick to C99.
I'm seeing that some preprocessors output this:
#line 1 "tst00.c"
char s[] =
"start"
#line 9
"end";
or this:
# 1 "tst00.c"
char s[] =
"start"
# 7 "tst00.c"
"end";
gcc -E outputs this:
# 1 "tst00.c"
# 1 "<command-line>"
# 1 "tst00.c"
char s[] =
"start"
"end";
And gcc is perfectly fine compiling all of the above preprocessed code even with the -fpreprocessed option, meaning that no further preprocessing should be done as all of it has been done already.
The confusion stems from this wording of the 1999 C standard:
5.1.1.2 Translation phases
1 The precedence among the syntax rules of translation is specified by the following
phases.
...
4. Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. ... All preprocessing directives are
then deleted.
...
6. Adjacent string literal tokens are concatenated.
7. White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are syntactically
and semantically analyzed and translated as a translation unit.
In other words, is it legal for the #line directive to appear between adjacent string literals? If it is, it means that the actual compiler must do another round of string literal concatenation, but that's not mentioned in the standard.
Or are we simply dealing with non-standard compiler implementations, gcc included?
The #line or # 1 lines you get from GCC -E (or a compatible tool) are added for the sake of human readers and any tools that might attempt to work with a text form of the output of the preprocessor. They are just for convenience.
In general, yes, directives may appear between concatenated string literal tokens. #line is no different from #ifdef in your example.
Or are we simply dealing with non-standard compiler implementations, gcc included?
-E and -fpreprocessed modes are not standardized. A standard preprocessor always feeds its output into a compiler, not a text file. Moreover:
The output of the preprocessor has no standard textual representation.
The reason for inserting #line directives is so that any __LINE__ and __FILE__ macros that you might insert into the already-preprocessed file, before preprocessing it again, will expand correctly. Perhaps, when compiling such a file, the compiler may notice and use the values when reporting errors. Usage of "preprocessed text files" is nonstandard and generally discouraged.

Working of the C Preprocessor

How does the following piece of code work, in other words what is the algorithm of the C preprocessor? Does this work on all compilers?
#include <stdio.h>
#define b a
#define a 170
int main() {
printf("%i", b);
return 0;
}
The preprocessor just replaces b with a wherever it finds it in the program and then replaces a with 170 It is just plain textual replacement.
Works on gcc.
It's at §6.10.3 (Macro Replacement):
6.10.3.4 Rescanning and further replacement
1) After all parameters in the replacement list have been substituted and #
and ## processing has taken place, all placemarker preprocessing tokens are removed. Then, the resulting preprocessing token sequence
is rescanned, along with all subsequent preprocessing tokens of the
source file, for more macro names to replace.
Further paragraphs state some complementary rules and exceptions, but this is basically it.
Though it may violate some definitions of "single pass", it's very useful. Like the recursive preprocessing of included files (§5.1.1.2p4).
This simple replacement (first b with a and then a with 170) should work with any compiler.
You should be careful with more complicated cases (usually involving stringification '#' and token concatenation '##') as there are corner case handled differently at least by MSVC and gcc.
In doubt, you can always check the ISO standard (a draft is available online) to see how things are supposed to work :). Section 6.10.3 is the most relevant in your case.
The preprocessor just replaces the symbols sequentially whenever they appear. The order of the definitions does not matter in this case, b is replaced by a first, and the printf statement becomes
printf("%i", a);
and after a is replaced by 170, it becomes
printf("%i", 170);
If the order of definition was changed, i.e
#define a 170
#define b a
Then preprocessor replaces a first, and the 2nd definition becomes
#define b 170
So, finally the printf statement becomes
printf("%i",170);
This works for any compiler.
To get detailed info you can try gcc -E to analyse your pre-processor output which can easily clear your doubt
#define simply assigns a value to a keyword.
Here, 'b' is first assigned value 'a' then 'a' is assigned value '170'. For simplicity, it can be expressed as follows:
b=a=170
It's just a different way of defining the same thing.
I think you are trying to get the information how the source code is processed by compiler. To know exactly you have to go through Translation Phases. The general steps that are followed by every compiler (tried to give every detail - gathered from different blogs and websites) are below:
First Step by Compiler - Physical source file characters are mapped to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
Second Step by Compiler - Each instance of a new-line character and an immediately preceding backslash character is deleted, splicing physical source lines to form logical source lines. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character.
Third Step by Compiler - The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of other white-space characters is retained or replaced by one space character is implementation-defined.
Fourth Step by Compiler - Preprocessing directives are executed and macro invocations are expanded. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively.
Fivth Step by Compler - Each escape sequence in character constants and string literals is converted to a member of the execution character set.
Sixth Step by Compiler - Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated.
Seventh Step by Compiler - White-space characters separating tokens are no longer significant. Preprocessing tokens are converted into tokens. The resulting tokens are syntactically and semantically analyzed and translated.
Last Step - All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

During C macro expansion, is there a special case for macros that would expand to "/*"?

Here's a relevant example. It's obviously not valid C, but I'm just dealing with the preprocessor here, so the code doesn't actually have to compile.
#define IDENTITY(x) x
#define PREPEND_ASTERISK(x) *x
#define PREPEND_SLASH(x) /x
IDENTITY(literal)
PREPEND_ASTERISK(literal)
PREPEND_SLASH(literal)
IDENTITY(*pointer)
PREPEND_ASTERISK(*pointer)
PREPEND_SLASH(*pointer)
Running gcc's preprocessor on it:
gcc -std=c99 -E macrotest.c
This yields:
(...)
literal
*literal
/literal
*pointer
**pointer
/ *pointer
Please note the extra space in the last line.
This looks like a feature to prevent macros from expanding to "/*" to me, which I'm sure is well-intentioned. But at a glance, I couldn't find anything pertaining to this behaviour in the C99 standard. Then again, I'm inexperienced at C. Can someone shed some light on this? Where is this specified? I would guess that a compiler adhering to C99 should not just insert extra spaces during macro expansion just because it would probably prevent programming mistakes.
The source code is already tokenized before being processed by CPP.
So what you have is a / and a * token that will not be combined implicitly to a /* "token" ( since /* is not really a preprocessor token I put it in "").
If you use -E to output preprocessed source CPP needs to insert a space in order to avoid /* being read by a subsequent compiler pass.
The same feature prevents from two e.g. + signs from different macros being combined into a ++ token on output.
The only way to really paste two preprocessor tokens together is with the ## operator:
#define P(x,y) x##y
...
P(foo,bar)
results in the token foobar
P(+,+)
results in the token ++, but
P(/,*)
is not valid since /* is not a valid preprocessor token.
The behavior of the pre-processor is standardized. In the summary at http://en.wikipedia.org/wiki/C_preprocessor , the results you are observing are the effect of:
"3: Tokenization - The preprocessor breaks the result into preprocessing tokens and whitespace. It replaces comments with whitespace".
This takes place before:
"4: Macro Expansion and Directive Handling".

Resources