Working of the C Preprocessor

Working of the C Preprocessor - c

How does the following piece of code work, in other words what is the algorithm of the C preprocessor? Does this work on all compilers?
#include <stdio.h>
#define b a
#define a 170
int main() {
printf("%i", b);
return 0;
}

The preprocessor just replaces b with a wherever it finds it in the program and then replaces a with 170 It is just plain textual replacement.
Works on gcc.

It's at §6.10.3 (Macro Replacement):
6.10.3.4 Rescanning and further replacement
1) After all parameters in the replacement list have been substituted and #
and ## processing has taken place, all placemarker preprocessing tokens are removed. Then, the resulting preprocessing token sequence
is rescanned, along with all subsequent preprocessing tokens of the
source file, for more macro names to replace.
Further paragraphs state some complementary rules and exceptions, but this is basically it.
Though it may violate some definitions of "single pass", it's very useful. Like the recursive preprocessing of included files (§5.1.1.2p4).

This simple replacement (first b with a and then a with 170) should work with any compiler.
You should be careful with more complicated cases (usually involving stringification '#' and token concatenation '##') as there are corner case handled differently at least by MSVC and gcc.
In doubt, you can always check the ISO standard (a draft is available online) to see how things are supposed to work :). Section 6.10.3 is the most relevant in your case.

The preprocessor just replaces the symbols sequentially whenever they appear. The order of the definitions does not matter in this case, b is replaced by a first, and the printf statement becomes
printf("%i", a);
and after a is replaced by 170, it becomes
printf("%i", 170);
If the order of definition was changed, i.e
#define a 170
#define b a
Then preprocessor replaces a first, and the 2nd definition becomes
#define b 170
So, finally the printf statement becomes
printf("%i",170);
This works for any compiler.

To get detailed info you can try gcc -E to analyse your pre-processor output which can easily clear your doubt

#define simply assigns a value to a keyword.
Here, 'b' is first assigned value 'a' then 'a' is assigned value '170'. For simplicity, it can be expressed as follows:
b=a=170
It's just a different way of defining the same thing.

I think you are trying to get the information how the source code is processed by compiler. To know exactly you have to go through Translation Phases. The general steps that are followed by every compiler (tried to give every detail - gathered from different blogs and websites) are below:
First Step by Compiler - Physical source file characters are mapped to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
Second Step by Compiler - Each instance of a new-line character and an immediately preceding backslash character is deleted, splicing physical source lines to form logical source lines. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character.
Third Step by Compiler - The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of other white-space characters is retained or replaced by one space character is implementation-defined.
Fourth Step by Compiler - Preprocessing directives are executed and macro invocations are expanded. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively.
Fivth Step by Compler - Each escape sequence in character constants and string literals is converted to a member of the execution character set.
Sixth Step by Compiler - Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated.
Seventh Step by Compiler - White-space characters separating tokens are no longer significant. Preprocessing tokens are converted into tokens. The resulting tokens are syntactically and semantically analyzed and translated.
Last Step - All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

Related

Why should the controlled group in a conditional inclusion be lexically valid when the conditional is false?

The following program compiles:
// #define WILL_COMPILE
#ifdef WILL_COMPILE
int i =
#endif
int main()
{
return 0;
}
GCC Live demo here.
But the following will issue a warning:
//#define WILL_NOT_COMPILE
#ifdef WILL_NOT_COMPILE
char* s = "failure
#endif
int main()
{
return 0;
}
GCC Live demo here.
I understand that in the first example, the controlled group is removed by the time the compilation phase of the translation is reached. So it compiles without errors or warnings.
But why is lexical validity required in the second example when the controlled group is not going to be included?
Searching online I found this quote:
Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended.
But this does not state why the lexical validity is checked when the conditional fails.
Have I missed something here?

In the translation phase 3 the preprocessor will generate preprocessor tokens and having a " end up in the catch all non-white-space character that cannot be one of the above
is undefined behavior.
See C11 6.4 Lexical elements p3:
A token is the minimal lexical element of the language in translation phases 7 and 8. The
categories of tokens are: keywords, identifiers, constants, string literals, and punctuators.
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string literals, punctuators, and
single non-white-space characters that do not lexically match the other preprocessing
token categories.69) If a ' or a " character matches the last category, the behavior is
undefined. ....
For reference the preprocessing-token are:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Of which the unmatched " in your second example matches non-white-space character that cannot be one of the above.
Since this is undefined behavior and not a constraint the compiler is not obliged to diagnose it but it is certainly allowed to and using -pedantic-errors it even becomes an error godbolt session. As rici points out it only becomes a constraint violation if the token survives preprocessing.
The gcc document you cite basically says the same thing:
... Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended. ...

"Why is [something about C] the way it is?" questions can't usually be answered, because none of the people who wrote the 1989 C standard are here to answer questions [as far as I know, anyway] and if they were here, it was nearly thirty years ago and they probably don't remember.
However, I can think of a plausible reason why the contents of skipped conditional groups are required to consist of a valid sequence of preprocessing tokens. Observe that comments are not required to consist of a valid sequence of preprocessing tokens:
/* this comment's perfectly fine even though it has an unclosed
character literal inside */
Observe also that it is really simple to scan for the end of a comment. /* you look for the next */, // you look for the end of the line. The only complication is that trigraphs and backslash-newline are supposed to be converted first. Tokenizing the contents of comments would be extra code to no useful purpose.
By contrast, it is not simple to scan for the end of a skipped conditional group, because conditional groups nest. You have to be looking for #if, #ifdef, and #ifndef as well as #else and #endif, and counting your depth. And all of those directives are lexically defined in terms of preprocessor tokens, because that's the most natural way to look for them when you're not in a skipped conditional group. Requiring skipped conditional groups to be tokenizable allows the preprocessor to use the same code to process directives within skipped conditional groups as it does elsewhere.
By default, GCC issues only a warning when it encounters an un-tokenizable line inside a skipped conditional group, an error elsewhere:
#if 0
"foo
#endif
"bar
gives me
test.c:2:1: warning: missing terminating " character
"foo
^
test.c:4:1: error: missing terminating " character
"bar
^~~~
This is an intentional leniency, possibly one I introduced myself (it's only been twenty years since I wrote a third of GCC's current preprocessor, but I have still forgotten a lot of the details). You see, the original C preprocessor, the one K and R wrote, did allow arbitrary nonsense inside skipped conditional groups, because it wasn't built around the concept of tokens in the first place; it transformed text into other text. So people would put comments between #if 0 and #endif instead of /* and */, and naturally enough those comments would sometimes contain apostrophes. So, when Per Bothner and Neil Booth and Chiaki Ishikawa and I replaced GCC's original "C-Compatible Compiler Preprocessor"1 with the integrated, fully standards-compliant "cpplib", circa GCC 3.0, we felt we needed to cut a little compatibility slack here.
1 Raise your hand if you're old enough to know why RMS thought this name was funny.

The description of Translation phase 3 (C11 5.1.1.2/3), which happens before preprocessing directives are actioned:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
And the grammar for preprocessing-token is:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Note in particular that a string-literal is a single preprocessing-token. The subsequent description (C11 6.4/3) clarifies that:
If a ' or a " character matches the last category, the behavior is
undefined.
So your second code causes undefined behaviour at translation phase 3.

In which phase of the compiler, the error of an identifier name being too long is detected?

If I have a very long name of an identifier ,so in which phase of compiler this error can be detected .
Also if I have some long range of a constant assigned to a variable , is there any error in this ?
int a=1987655321467890008766555890765433111223;

The C standard defines eight phases of translation:
Physical source multibyte characters and trigraph sequences are mapped to characters of the source character set.
Each backslash followed by a new-line is deleted (splicing together two lines).
The source characters are grouped into preprocessing tokens, and each sequence of white-space characters is replaced by one space, except new-lines are kept.
Preprocessing directives and _Pragma operators are executed, and macro invocations are expanded.
Source characters in strings and character constants are converted to the execution character set.
Adjacent string literals are concatenated.
Each preprocessing token is converted into a grammar token, and white-space characters separated tokens are discarded. The resulting tokens are analyzed and translated (compiled).
All external references are resolved (the program is linked).
The C standard does not specify in which phase problems in names or values are detected, and the phases are largely conceptual. The phases explain how the C language is understood, not how a compiler must execute.
However, given that, phase 3 is a logical time to detect names that are too long, particularly since names can be preprocessing identifiers, not just identifiers for variables in the program. But this could also be done in phase 4 for preprocessing identifiers or 7 for other identifiers. Also, the compiler might accept long identifiers up to phase 7, but the linker in phase 8 might have a shorter limit, so errors could occur in 8.
Numbers that are much too large for the compiler to handle at all might be detected in phase 3, but 7 is more likely. For numbers that are too large for the object they are being used to initialize, phase 7 is the logical time to detect the problem.

Compilers that required # on the first column?

Were there widely used pre-ANSI C compilers† that required the # to be on the first column?
† I would accept any compiler on this list. If I can find mention of it in the comp.lang.c Usenet newsgroup in a post dated before 1995, I would accept it.
K&R C did not specify whether whitespace was permitted before the #. From the original The C Programming Language, §12¶1 of the "C Reference Manual" in Appendix A:
The C compiler contains a preprocessor capable of macro substitution, conditional compilation, and inclusion of named files. Lines beginning with # communicate with this preprocessor.
Thus, whether or not whitespace was permitted to precede the # was unspecified. This would mean a pre-ANSI compiler could fail to compile a program if the directive did not begin on the first column.
In ISO C (and in ANSI C before that), the C preprocessing directives were explicitly permitted to be prefixed with whitespace. In ANSI C (C-89):
A preprocessing directive consists of a sequence of preprocessing
tokens that begins with a # preprocessing token that is either the
first character in the source file (optionally after white space
containing no new-line characters) or that follows white space
containing at least one new-line character, and is ended by the next
new-line character.
ISO C.2011 has similar language, but is clarified even further:
A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the
following constraints: The first token in the sequence is a # preprocessing token that (at
the start of translation phase 4) is either the first character in the source file (optionally
after white space containing no new-line characters) or that follows white space
containing at least one new-line character. The last token in the sequence is the first newline
character that follows the first token in the sequence.165) A new-line character ends
the preprocessing directive even if it occurs within what would otherwise be an invocation of a function-like macro.
165) Thus, preprocessing directives are commonly called ‘‘lines’’. These ‘‘lines’’ have no other syntactic
significance, as all white space is equivalent except in certain situations during preprocessing (see the
# character string literal creation operator in 6.10.3.2, for example).

Short answer: Yes.
I remember writing things like
#if foo
/* ... */
#else
#if bar
/* ... */
#else
#error "neither foo nor bar specified"
#endif
#endif
so that the various pre-ANSI compilers that I once used wouldn't complain about "unrecognized preprocessor directive '#error'". This would have been with Ritchie's original cc for the pdp11, or pcc (the "portable C compiler" which, IIRC, was the basis for the Vax cc of the 80's or so). Both of those compilers -- more accurately, the preprocessor used with both of those compilers -- definitely required the # to be in the first column. (Actually, although those compilers were very different, they might both have used different variants of basically the same preprocessor, which was always a separate program in those days.)

Inserting a one-line line comment with a preprocessor macro

Is it possible to simulate a one-line comment (//) using a preprocessor macro (or magic)? For example, can this compile with gcc -std=c99?
#define LINE_COMMENT() ???
int main() {
LINE_COMMENT() asd(*&##)($*?><?><":}{)(#
return 0;
}

No. Here is an extract from the standard showing the phases of translation of a C program:
The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.
Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal character name is produced by token concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
As you can see, comments are removed before macros are expanded, so a macro cannot expand into a comment.
You can obviously define a macro that takes an argument and expands to nothing, but it's slightly more restrictive than a comment, as its argument must consist only of valid preprocessor token characters (e.g. no # or unmatched quotes). Not very useful for general commenting purposes.

No. Comments are processed at preprocessor phase. You can do selective compilation (without regard to comments) with #if directives, as in:
#if 0
... // this stuff will not be compiled
...
#endif // up to here.
that's all the magic you can do with the limited macro preprocessor available in C/C++.

Why does this C program compile without an error?

I'm a beginner in C, and I was playing with C. I typed a C code like this:
#include <stdio.h>
int main()
{
printf("hello world\n");
\
return 0;
}
Even though I used \ knowingly, the C compiler doesn't throw any error. What is this symbol used for in the C language?
Edit:
Even this works:
"\n";

The sequence backslash-newline is removed from the code in a very early phase (phase 2) of the translation process. It used to be how you created long string literals before there was string concatenation, and is how you still extend macros over multiple lines.
See §5.1.1.2 Translation Phases of the C99 standard:
The precedence among the syntax rules of translation is specified by the following
phases.5)
Physical source file multibyte characters are mapped, in an implementation defined
manner, to the source character set (introducing new-line characters for
end-of-line indicators) if necessary. Trigraph sequences are replaced by
corresponding single-character internal representations.
Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice. A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such
splicing takes place.
The source file is decomposed into preprocessing tokens6) and sequences of
white-space characters (including comments). A source file shall not end in a
partial preprocessing token or in a partial comment. Each comment is replaced by
one space character. New-line characters are retained. Whether each nonempty
sequence of white-space characters other than new-line is retained or replaced by
one space character is implementation-defined.
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
Each source character set member and escape sequence in character constants and
string literals is converted to the corresponding member of the execution character
set; if there is no corresponding member, it is converted to an implementation defined
member other than the null (wide) character.7)
Adjacent string literal tokens are concatenated.
White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
All external object and function references are resolved. Library components are
linked to satisfy external references to functions and objects not defined in the
current translation. All such translator output is collected into a program image
which contains information needed for execution in its execution environment.
5) Implementations shall behave as if these separate phases occur, even though many are typically folded together in practice.
6) As described in 6.4, the process of dividing a source file’s characters into preprocessing tokens is
context-dependent. For example, see the handling of < within a #include preprocessing directive.
7) An implementation need not convert all non-corresponding source characters to the same execution
character.
If you had a blank or any other character after your stray backslash, you would have a compilation error. We can tell that you don't have anything after it because you don't have a compilation error.
The other part of your question, about:
"\n";
is quite different. It is a simple expression that has no side-effects and therefore no effect on the program. The optimizer will completely discard it. When you write:
i = 1;
you have an expression with a value that is discarded; it is evaluated for its side-effect of modifying i.
Sometimes, you'll find code like:
*ptr++;
The compiler will warn you that the result of the expression is discarded; the expression can be simplified to:
ptr++;
and will achieve the same effect in the program.

The \, when immediately followed by a newline, is consumed by preprocessing and causes the next "physical" line to be joined to the current logical line. This is very important for writing long preprocessing directives, which have to be all on one logical line:
#define SHORT very log macro \
consisting of lots and \
lots of preprocessor \
tokens
If you remove the backslash-newline sequences, it is no longer correct. Some other languages from the Unix culture have a similar backslash line continuation syntax: the POSIX shell language derived from the Bourne shell, and also makefiles.
$ this is \
one shell command
About "\n";, that is a primary expression used to form an expression-statement. In C, expressions can be used as statements, and this is exploited all the time. Your printf call, for instance, is an expression statement. printf("hello world\n") is a postfix expression which calls a function, obtaining a return value. Because you used this expression as a statement, the return value is thrown away. The return value of printf
indicates how many characters were printed, or whether it was successful at all, so by throwing it away, your program makes itself oblivious to whether the printf call actually worked.
Since the value of an expression-statement is discarded, if such a statement also has no side effects, it is a useless statement which does nothing (like your "\n"). But such useless expression statements are not erroneous. If you add warning options to your compiler command line you might get a warning such as "statement with no effect" or something like that.

The backslash \ get interpreted by the C preprocessor. It protect its following character (the new line character on your case).

The backslash is simply escaping the next character. In this case, probably a line end (CR) character. Perfectly reasonable.

The backslash plus what is following it is an escape sequence; "\n" together is the newline character (prints a newline). Another important one is "\t", for tab.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight