Spaces inserted by the C preprocessor - c

Suppose we are given this input C code:
#define Y 20
#define A(x) (10+x+Y)
A(A(40))
gcc -E outputs like that (10+(10+40 +20)+20).
gcc -E -traditional-cpp outputs like that (10+(10+40+20)+20).
Why the default cpp inserts the space after 40 ?
Where can I find the most detailed specification of the cpp that covers that logic ?

The C standard doesn't specify this behaviour, since the output of the preprocessing phase is simply a stream of tokens and whitespace. Serializing the stream of tokens back into a character string, which is what gcc -E does, is not required or even mentioned by the standard, and does not form part of the translation processs specified by the standard.
In phase 3, the program "is decomposed into preprocessing tokens and sequences of white-space characters." Aside from the result of the concatenation operator, which ignores whitespace, and the stringification operator, which preserves whitespace, tokens are then fixed and whitespace is no longer needed to separate them. However, the whitespace is needed in order to:
parse preprocessor directives
correctly process the stringification operator
The whitespace elements in the stream are not eliminated until phase 7, although they are no longer relevant after phase 4 concludes.
Gcc is capable of producing a variety of information useful to programmers, but not corresponding to anything in the standard. For example, the preprocessor phase of the translation can also produce dependency information useful for inserting into a Makefile, using one of the -M options. Alternatively, a human-readable version of the compiled code can be output using the -S option. And a compilable version of the preprocessed program, roughly corresponding to the token stream produced by phase 4, can be output using the -E option. None of these output formats are in any way controlled by the C standard, which is only concerned with actually executing the program.
In order to produce the -E output, gcc must serialize the stream of tokens and whitespace in a format which does not change the semantics of the program. There are cases in which two consecutive tokens in the stream would be incorrectly glued together into a single token if they are not separated from each other, so gcc must take some precautions. It cannot actually insert whitespace into the stream being processed, but nothing stops it from adding whitespace when it presents the stream in response to gcc -E.
For example, if macro invocation in your example were modified to
A(A(0x40E))
then naive output of the token stream would result in
(10+(10+0x40E+20)+20)
which could not be compiled because 0x40E+20 is a single pp-number token which cannot be converted into a numeric token. The space before the + prevents this from happening.
If you attempt to implement a preprocessor as some kind of string transformation, you will undoubtedly confront serious issues in the corner cases. The correct implementation strategy is to tokenize first, as indicated in the standard, and then perform phase 4 as a function on a stream of tokens and whitespace.
Stringification is a particularly interesting case where whitespace affects semantics, and it can be used to see what the actual token stream looks like. If you stringify the expansion of A(A(40)), you can see that no whitespace was actually inserted:
$ gcc -E -x c - <<<'
#define Y 20
#define A(x) (10+x+Y)
#define Q_(x) #x
#define Q(x) Q_(x)
Q(A(A(40)))'
"(10+(10+40+20)+20)"
The handling of whitespace in stringification is precisely specified by the standard: (§6.10.3.2, paragraph 2, many thanks to John Bollinger for finding the specification.)
Each occurrence of white space between the argument’s preprocessing tokens
becomes a single space character in the character string literal. White space before the first preprocessing token and after the last preprocessing token composing the argument is deleted.
Here is a more subtle example where additional whitespace is required in the gcc -E output, but is not actually inserted into the token stream (again shown by using stringification to produce the real token stream.) The I (identify) macro is used to allow two tokens to be inserted into the token stream without intervening whitespace; that's a useful trick if you want to use macros to compose the argument to the #include directive (not recommended, but it can be done).
Maybe this could be a useful test case for your preprocessor:
#define Q_(x) #x
#define Q(x) Q_(x)
#define I(x) x
#define C(x,...) x(__VA_ARGS__)
// Uncomment the following line to run the program
//#include <stdio.h>
char*quoted=Q(C(I(int)I(main),void){I(return)I(C(puts,quoted));});
C(I(int)I(main),void){I(return)I(C(puts,quoted));}
Here's the output of gcc -E (just the good stuff at the end):
$ gcc -E squish.c | tail -n2
char*quoted="intmain(void){returnputs(quoted);}";
int main(void){return puts(quoted);}
In the token stream which is passed out of phase 4, the tokens int and main are not separated by whitespace (and neither are return and puts). That's clearly shown by the stringification, in which no whitespace separates the token. However, the program compiles and executes fine, even if passed explicitly through gcc -E:
$ gcc -E squish.c | gcc -x c - && ./a.out
intmain(void){returnputs(quoted);}
and compiling the output of gcc -E.
Different compilers and different versions of the same compiler may produce different serializations of a preprocessed program. So I don't think you will find any algorithm which is testable with a character-by-character comparison with the -E output of a given compiler.
The simplest possible serialization algorithm would be to unconditionally output a space between two consecutive tokens. Obviously, that would output unnecessary spaces, but it would never syntactically alter the program.
I think the minimal space algorithm would be to record the DFA state at the end of the last character in a token so that you can later output a space between two consecutive tokens if there exists a transition from the state at the end of the first token on the first character of the following token. (Keeping the DFA state as part of the token is not intrinsically different from keeping the token type as part of the token, since you can derive the token type from a simple lookup from the DFA state.) That algorithm would not insert a space after 40 in your original test case, but it would insert a space after 0x40E. So it is not the algorithm being used by your version of gcc.
If you use the above algorithm, you will need to rescan tokens created by token concatenation. However, that is necessary anyway, because you need to flag an error if the result of the concatenation is not a valid preprocessing token.
If you don't want to record states (although, as I said, there is essentially no cost in doing so) and you don't want to regenerate the state by rescanning the token as you output it (which would also be quite cheap), you could precompute a two-dimensional boolean array keyed by token type and following character. The computation would essentially be the same as the above: for every accepting DFA state which returns a particular token type, enter a true value in the array for that token type and any character with a transition out of the DFA state. Then you can look up the token type of a token and the first character of the following token to see if a space may be necessary. This algorithm does not produce a minimally-spaced output: it would, for example, put a space after the 40 in your example, since 40 is a pp-number and it is possible for some pp-number to be extended with a + (even though you cannot extend 40 in that way). So it's possible that gcc uses some version of this algorithm.

Adding some historical context to rici's excellent answer.
If you can get your hands on a working copy of gcc 2.7.2.3, experiment with its preprocessor. At that time the preprocessor was a separate program from the compiler, and it used a very naive algorithm for text serialization, which tended to insert far more spaces than were necessary. When Neil Booth, Per Bothner and I implemented the integrated preprocessor (appearing in gcc 3.0 and since), we decided to make -E output a bit smarter at the same time, but without making the implementation too complicated. The core of this algorithm is the library function cpp_avoid_paste, defined at https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libcpp/lex.c#l2990 , and its caller is here: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/c-family/c-ppoutput.c#l177 (look for "Subtle logic to output a space...").
In the case of your example
#define Y 20
#define A(x) (10+x+Y)
A(A(40))
cpp_avoid_paste will be called with a CPP_NUMBER token (what rici called a "pp-number") on the left, and a '+' token on the right. In this case it unconditionally says "yes, you need to insert a space to avoid pasting" rather than checking whether the last character of the number token is one of eEpP.
Compiler design often comes down to a trade-off between accuracy and implementation simplicity.

Related

Why should the controlled group in a conditional inclusion be lexically valid when the conditional is false?

The following program compiles:
// #define WILL_COMPILE
#ifdef WILL_COMPILE
int i =
#endif
int main()
{
return 0;
}
GCC Live demo here.
But the following will issue a warning:
//#define WILL_NOT_COMPILE
#ifdef WILL_NOT_COMPILE
char* s = "failure
#endif
int main()
{
return 0;
}
GCC Live demo here.
I understand that in the first example, the controlled group is removed by the time the compilation phase of the translation is reached. So it compiles without errors or warnings.
But why is lexical validity required in the second example when the controlled group is not going to be included?
Searching online I found this quote:
Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended.
But this does not state why the lexical validity is checked when the conditional fails.
Have I missed something here?
In the translation phase 3 the preprocessor will generate preprocessor tokens and having a " end up in the catch all non-white-space character that cannot be one of the above
is undefined behavior.
See C11 6.4 Lexical elements p3:
A token is the minimal lexical element of the language in translation phases 7 and 8. The
categories of tokens are: keywords, identifiers, constants, string literals, and punctuators.
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string literals, punctuators, and
single non-white-space characters that do not lexically match the other preprocessing
token categories.69) If a ' or a " character matches the last category, the behavior is
undefined. ....
For reference the preprocessing-token are:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Of which the unmatched " in your second example matches non-white-space character that cannot be one of the above.
Since this is undefined behavior and not a constraint the compiler is not obliged to diagnose it but it is certainly allowed to and using -pedantic-errors it even becomes an error godbolt session. As rici points out it only becomes a constraint violation if the token survives preprocessing.
The gcc document you cite basically says the same thing:
... Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended. ...
"Why is [something about C] the way it is?" questions can't usually be answered, because none of the people who wrote the 1989 C standard are here to answer questions [as far as I know, anyway] and if they were here, it was nearly thirty years ago and they probably don't remember.
However, I can think of a plausible reason why the contents of skipped conditional groups are required to consist of a valid sequence of preprocessing tokens. Observe that comments are not required to consist of a valid sequence of preprocessing tokens:
/* this comment's perfectly fine even though it has an unclosed
character literal inside */
Observe also that it is really simple to scan for the end of a comment. /* you look for the next */, // you look for the end of the line. The only complication is that trigraphs and backslash-newline are supposed to be converted first. Tokenizing the contents of comments would be extra code to no useful purpose.
By contrast, it is not simple to scan for the end of a skipped conditional group, because conditional groups nest. You have to be looking for #if, #ifdef, and #ifndef as well as #else and #endif, and counting your depth. And all of those directives are lexically defined in terms of preprocessor tokens, because that's the most natural way to look for them when you're not in a skipped conditional group. Requiring skipped conditional groups to be tokenizable allows the preprocessor to use the same code to process directives within skipped conditional groups as it does elsewhere.
By default, GCC issues only a warning when it encounters an un-tokenizable line inside a skipped conditional group, an error elsewhere:
#if 0
"foo
#endif
"bar
gives me
test.c:2:1: warning: missing terminating " character
"foo
^
test.c:4:1: error: missing terminating " character
"bar
^~~~
This is an intentional leniency, possibly one I introduced myself (it's only been twenty years since I wrote a third of GCC's current preprocessor, but I have still forgotten a lot of the details). You see, the original C preprocessor, the one K and R wrote, did allow arbitrary nonsense inside skipped conditional groups, because it wasn't built around the concept of tokens in the first place; it transformed text into other text. So people would put comments between #if 0 and #endif instead of /* and */, and naturally enough those comments would sometimes contain apostrophes. So, when Per Bothner and Neil Booth and Chiaki Ishikawa and I replaced GCC's original "C-Compatible Compiler Preprocessor"1 with the integrated, fully standards-compliant "cpplib", circa GCC 3.0, we felt we needed to cut a little compatibility slack here.
1 Raise your hand if you're old enough to know why RMS thought this name was funny.
The description of Translation phase 3 (C11 5.1.1.2/3), which happens before preprocessing directives are actioned:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
And the grammar for preprocessing-token is:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Note in particular that a string-literal is a single preprocessing-token. The subsequent description (C11 6.4/3) clarifies that:
If a ' or a " character matches the last category, the behavior is
undefined.
So your second code causes undefined behaviour at translation phase 3.

Preprocessing Tokens: '- -' vs. '--'

Why does the (GCC) preprocessor create two tokens - -B instead of a single one --B in the following example? What is the logic that the former should be correct and not the latter?
#define A -B
-A
Output according to gcc -E:
- -B
After all, -- is a valid operator, so theoretically a valid token as well.
Is this specific to the GCC preprocessor or does this follow from the C standards?
The preprocessor works on tokens, not strings. Macro substitution without ## cannot create a new token and so, if the preprocessor output goes to a textfile as opposed to going straight into the compiler, preprocessors insert whitespace so that the outputted textfile can be used as C input again without changed semantics.
The space insertion doesn't seem to be in the standard, but then the standard describes the preprocessor as working on tokens and as feeding its output to the compiler proper, not a textfile.
Focusing on the white space insertion is missing the issue.
The macro A is defined as the sequence of preprocessing tokens - and B.
When the compiler parses a fragment of source code -A, it produces 2 tokens - and A. A is expanded as part of the preprocessing phase and the tokens are converted to C tokens: -, - and B.
If B is itself defined as a macro (#define B 4), A would expand to -, -, 4, which is parsed as an expression evaluating to the value 4 with type int.
gcc -E produces text. For the text to convert back to the same sequence of tokens as the original source code, a space needs to be inserted between the two - tokens to prevent -- to be parsed as a single token.

Expansion of function-like macro creates a separate token

I just found out that gcc seems to treat the result of the expansion of a function-like macro as a separate token. Here is a simple example showing the behavior of gcc:
#define f() foo
void f()_bar(void);
void f()bar(void);
void f()-bar(void);
When I execute gcc -E -P test.c (running just the preprocessor), I get the following output:
void foo _bar(void);
void foo bar(void);
void foo-bar(void);
It seems like, in the first two definitions, gcc inserts space after the expanded macro to ensure it is a separate token. Is that really what is happening here?
Is this mandated by any standard (I couldn't find documentation on the topic)?
I want to make _bar part of the same token. Is there any way to do this? I could use the token concatenation operator ## but it will require several levels of macros (since in the real code f() is more complex). I was wondering if there is a simple (and probably more readable) solution.
It seems like, in the first two definitions, gcc inserts space after the expanded macro to ensure it is a separate token. Is that really what is happening here?
Yes.
Is this mandated by any standard (I couldn't find documentation on the topic)?
Yes, although an implementation would be allowed to insert even more than one whitespace to separate the tokens.
f()_bar
here you have 4 tokens after lexical analysis (they are actually pre-processor tokens at this stage but let's call them tokens): f, (, ) and _bar.
The function-like macro replacement semantic (as defined in C11, 6.10.3) has to replace the 3 token f, (, ) into a new one foo. It is not allowed to work on other tokens and change the last _bar token. For this the implementation has to insert at least one whitespace to preserve _bar token. Otherwise the result would have been foo_bar which is a single token.
gcc preprocessor somewhat documents it here:
Once the input file is broken into tokens, the token boundaries never change, except when the ‘##’ preprocessing operator is used to paste tokens together. See Concatenation. For example,
#define foo() bar
foo()baz
==> bar baz
not
==> barbaz
In the other case, like f()-bar, there 5 tokens: f, (, ), - and bar. (- is a punctuator token in C whereas _ in _bar is simply a character of the identifier token). The implementation does not have to insert token separator (as whitespace) here as after macro replacement -bar are still considered as two separate tokens from C syntax.
gcc preprocessor (cpp) does not insert whitespace here simply because it does not have to. In cpp documentation, on token spacing it is written (on a different issue):
However, we would like to keep space insertion to a minimum, both for aesthetic reasons and because it causes problems for people who still try to abuse the preprocessor for things like Fortran source and Makefiles.
I didn't address the solution to your issue in this answer, but I think you have to use operator explicitly specified to concatenate tokens: the ## token pasting operator.
The only way I can think of (if you can not use the token concatenation operator ##) is using the traditional (pre-standard) C preprocessing:
gcc -E -P -traditional-cpp test.c
Output:
void foo_bar(void);
void foobar(void);
void foo-bar(void);
More info

What does '\' actually do in C?

As far as I know \ in C just appends the next line as if there was not a line break.
Consider the following code:
main(){\
return 0;
}
When I saw the pre-processed code(gcc -E) it shows
main(){return
0;
}
and not
main(){return 0;
}
What is the reason for this kind of behaviour? Also, how can I get the code I expected?
Yes, your expected result is the one required by the C and C++ standards. The backslash simply escapes the newline, i.e. the backslash-newline sequence is deleted.
GCC 4.2.1 from my OS X installation gives the expected result, as does Clang. Furthermore, adding a #define to the beginning and testing with
#define main(){\
return 0;
}
main()
yields the correct result
}
{return 0;
Perhaps gcc -E does some extra processing after preprocessing and before outputting it. In any case, the line break seen by the rest of the preprocessor seems to be in the right place. So it's a cosmetic bug.
UPDATE: According to the GCC FAQ, -E (or the default setting of the cpp command) attempts to put output tokens in roughly the same visual location as input tokens. To get "raw" output, specify -P as well. This fixes the observed issues.
Probably what happened:
In preserving visual appearance, tokens not separated by spaces are kept together.
Line splicing happens before spaces are identified for the above.
The { and return tokens are grouped into the same visual block.
0 follows a space and its location on the next line is duly noted.
PLUG: If this is really important to you, I have implemented my own preprocessor with correct implementation of both raw-preprocessed and whitespace-preserving "pretty" modes. Following this discussion I added line splices to the preserved whitespace. It's not really intended as a standalone tool, though. It's a testbed for a compiler framework which happens to be a fully compliant C++11 preprocessor library, which happens to have a miniature command-line driver. (The error messages are on par with GCC, or Clang, sans color, though.)
From K&R section A.12 Preprocessing:
A.12.2 Line Splicing
Lines that end with the backslash character \ are
folded by deleting the backslash and the following newline character.
This occurs before division into tokens.
It doesn't matter :/ The tokenizer will not see any difference. 1
Update In response to the comments:
There seems to be a fair amount of confusion as to what the expected output of the preprocessor should be. My point is that the expectation /seems/ reasonable at a glance but doesn't actually need to be specified in this way for the output to be valid. The amount of whitespace present in the output is simply irrelevant to the parser. What matters is that the preprocessor should treat the continued line as one line while interpreting it.
In other words: the preprocessor is not a text transformation tool, it's a token manipulation tool.
If it matters to you, you're probably
using the preprocessor for for something other than C/C++
treating C++ code as text, which is a ... code smell. (libclang and various less complete parser libraries come to mind).
1 (The preprocessor is free to achieve the specified result in whichever way it sees fit. The result you are seeing is possibly the most efficient way the implementors have found to implement this particular transformation)

Working of the C Preprocessor

How does the following piece of code work, in other words what is the algorithm of the C preprocessor? Does this work on all compilers?
#include <stdio.h>
#define b a
#define a 170
int main() {
printf("%i", b);
return 0;
}
The preprocessor just replaces b with a wherever it finds it in the program and then replaces a with 170 It is just plain textual replacement.
Works on gcc.
It's at §6.10.3 (Macro Replacement):
6.10.3.4 Rescanning and further replacement
1) After all parameters in the replacement list have been substituted and #
and ## processing has taken place, all placemarker preprocessing tokens are removed. Then, the resulting preprocessing token sequence
is rescanned, along with all subsequent preprocessing tokens of the
source file, for more macro names to replace.
Further paragraphs state some complementary rules and exceptions, but this is basically it.
Though it may violate some definitions of "single pass", it's very useful. Like the recursive preprocessing of included files (§5.1.1.2p4).
This simple replacement (first b with a and then a with 170) should work with any compiler.
You should be careful with more complicated cases (usually involving stringification '#' and token concatenation '##') as there are corner case handled differently at least by MSVC and gcc.
In doubt, you can always check the ISO standard (a draft is available online) to see how things are supposed to work :). Section 6.10.3 is the most relevant in your case.
The preprocessor just replaces the symbols sequentially whenever they appear. The order of the definitions does not matter in this case, b is replaced by a first, and the printf statement becomes
printf("%i", a);
and after a is replaced by 170, it becomes
printf("%i", 170);
If the order of definition was changed, i.e
#define a 170
#define b a
Then preprocessor replaces a first, and the 2nd definition becomes
#define b 170
So, finally the printf statement becomes
printf("%i",170);
This works for any compiler.
To get detailed info you can try gcc -E to analyse your pre-processor output which can easily clear your doubt
#define simply assigns a value to a keyword.
Here, 'b' is first assigned value 'a' then 'a' is assigned value '170'. For simplicity, it can be expressed as follows:
b=a=170
It's just a different way of defining the same thing.
I think you are trying to get the information how the source code is processed by compiler. To know exactly you have to go through Translation Phases. The general steps that are followed by every compiler (tried to give every detail - gathered from different blogs and websites) are below:
First Step by Compiler - Physical source file characters are mapped to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
Second Step by Compiler - Each instance of a new-line character and an immediately preceding backslash character is deleted, splicing physical source lines to form logical source lines. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character.
Third Step by Compiler - The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of other white-space characters is retained or replaced by one space character is implementation-defined.
Fourth Step by Compiler - Preprocessing directives are executed and macro invocations are expanded. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively.
Fivth Step by Compler - Each escape sequence in character constants and string literals is converted to a member of the execution character set.
Sixth Step by Compiler - Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated.
Seventh Step by Compiler - White-space characters separating tokens are no longer significant. Preprocessing tokens are converted into tokens. The resulting tokens are syntactically and semantically analyzed and translated.
Last Step - All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

Resources