This program gives output as 5. But after replacing all macros, it would result in --5. This should cause an compilation error, trying to decrement the 5. But it compiles and runs fine.
#include <stdio.h>
#define A -B
#define B -C
#define C 5
int main()
{
printf("The value of A is %d\n", A);
return 0;
}
Why is there no error?
Here are the steps for the compilation of the statement printf("The value of A is %d\n", A);:
the lexical parser produces the preprocessing tokens printf, (, "The value of A is %dn", ,, A, ) and ;.
A is a macro that expands to the 2 tokens - and B.
B is also a macro and gets expanded to - and C.
C is again a macro and gets expanded to 5.
the tokens are then converted to C tokens, producing errors for preprocessing tokens that do not convert to proper C tokens (ex: 0a). In this example, the tokens are identical.
the compiler parses the resulting sequence according to the C grammar: printf, (, "The value of A is %d\n", ,, -, -, 5, ), ; matches a function call to printf with 2 arguments: a format string and a constant expression - - 5, which evaluates to 5 at compile time.
the code is therefore equivalent to printf("The value of A is %d\n", 5);. It will produce the output:
The value of A is 5
This sequence of macros is expanded as tokens, not strictly a sequence of characters, hence A does not expand as --5, but rather as - -5. Good C compilers would insert an extra space when preprocessing the source to textual output to ensure the resulting text produces the same sequence of tokens when reparsed. Note however that the C Standard does not say anything about preprocessing to textual output, it only specifies preprocessing as one of the parsing phases and it is a quality of implementation issue for compilers to not introduce potential side effects when preprocessing to textual output.
There is a separate feature for combining tokens into new tokens in the preprocessor called token pasting. It requires a specific operator ## and is quite tricky to use.
Note also that macros should be defined with parentheses around each argument and parentheses around the whole expansion to prevent operator precedence issues:
#define A (-B)
#define B (-C)
#define C 5
Two consecutive dashes are not combined into a single pre-decrement operator -- because C preprocessor works with individual tokens, effectively inserting whitespace around macro substitutions. Running this program through gcc -E
#define A -B
#define B -C
#define C 5
int main() {
return A;
}
produces the following output:
int main() {
return - -5;
}
Note the space after the first -.
According to the standard, macro replacements are performed at the level of preprocessor tokens, not at the level of individual characters (6.10.3.9):
A preprocessing directive of the form
# define identifier replacement-list new-line
defines an object-like macro that causes each subsequent instance of the macro name to be replaced by the replacement list of preprocessing tokens that constitute the remainder of the directive.
Therefore, the two dashes - constitute two different tokens, so they are kept separate from each other in the output of the preprocessor.
Whenever we use #include in C program, compiler will replace the variable with its value wherever it is used.
#define A -B
#define B -C
#define C 5
So when we print A ,
it will execute in following steps.
A=>-B
B=>-C
A=>-(-C)=>C
So when we print value of A, it comes out to be 5.
Generally these #define statements are used to declare value of constants that are to be used through out the code.
For more info see
this link on #define directive
Related
As far my knowledge goes in C, C pre-processors replace the literals as it is in #define. But now, I am seeing that, it gives spaces before and after.
Is my explanation correct or am I doing something which should give some undefined behaviors?
Consider the following C code:
#include <stdio.h>
#define k +-6+-
#define kk xx+k-x
int main()
{
int x = 1029, xx = 4,t;
printf("x=%d,xx=%d\n",x,xx);
t=(35*kk*2)*4;
printf("t=%d,x=%d,xx=%d\n",t,x,xx);
return 0;
}
The initial values are: x = 1029, xx = 4. Lets calculate the value of t now.
t = (35*kk*2)*4;
t = (35*xx+k-x*2)*4; // replacing the literal kk
t = (35*xx++-6+--x*2)*4; // replacing the literal k
Now, the value of xx = 4 which would be increased by one just in the next statement and x is decremented by one and became 1028. So, the calculation of the current statement:
t = (35*4-6+1028*2)*4;
t = (140-6+2056)*4;
t = 2190*4;
t = 8760;
But the output of the above code is:
x=1029,xx=4
t=8768,x=1029,xx=4
From the second line of the output, it is clear that increments and decrements are not taken place.
That means after replacing k and kk, it is becoming:
t = (35*xx+ +-6+- -x*2)*4;
(If it is, then the calculation is clear.)
My concerning point: is it the standard of C or just an undefined behavior? Or am I doing something wrong?
The C standard specifies that the source file is analyzed and parsed into preprocessor tokens. When macro replacement occurs, a macro that is replaced is replaced with those tokens. The replacement is not literal text replacement.
C 2018 5.1.1.2 specifies translation phases (rephrasing and summarizing, not exact quotes):
Physical source file multibyte characters are mapped to the source character set. Trigraph sequences are replaced by single-character representations.
Lines continued with backslashes are merged.
The source file is converted from characters into preprocessing tokens and white-space characters—each sequence of characters that can be a preprocessing token is converted to a preprocessing token, and each comment becomes one space.
Preprocessing is performed (directives are executed and macros are expanded).
Source characters in character constants and string literals are converted to members of the execution character set.
Adjacent string literals are concatenated.
White-space characters are discarded. “Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.” (That quoted text is the main part of C compilation as we think of it!)
The program is linked to become an executable file.
So, in phase 3, the compiler recognizes that #define kk xx+k-x consists of the tokens #, define, kk, xx, +, k, -, and x. The compiler also knows there is white space between define and kk and between kk and xx, but this white space is not itself a preprocessor token.
In phase 4, when the compiler replaces kk in the source, it is doing so with these tokens. kk gets replaced by the tokens xx, +, k, -, and x, and k is replaced by the tokens +, -, 6, +, and -. Combined, those form xx, +, +, -, 6, +, -, -, -, and x.
The tokens remain that way. They are not reanalyzed to put + and + together to form ++.
As #EricPostpischil says in a comprehensive answer, the C pre-processor works on tokens, not character strings, and once the input is tokenised, whitespace is no longer needed to separate adjacent tokens.
If you ask a C preprocessor to print out the processed program text, it will probably add whitespace characters where needed to separate the tokens. But that's just for your convenience; the whitespace might or might not be present, and it makes almost no difference because it has no semantic value and will be discarded before the token sequence is handed over to the compiler.
But there is a brief moment during preprocessing when you can see some whitespace, or at least an indication as to whether there was whitespace inside a token sequence, if you can pass the token sequence as an argument to a function-like macro.
Most of the time, the preprocessor does not modify tokens. The tokens it receives are what it outputs, although not necessarily in the same order and not necessarily all of them. But there are two exceptions, involving the two preprocessor operators # (stringify) and ## (token concatenation). The first of these transforms a macro argument -- a possibly empty sequence of tokens -- into a string literal, and when it does so it needs to consider the presence or absence of whitespace in the token sequence.
(The token concatenation operator combines two tokens into a single token if possible; when it does so, intervening whitespace is ignored. That operator is not relevant here.)
The C standard actually specifies precisely how whitespace in a macro argument is handled if the argument is stringified, in paragraph 2 of §6.10.3.2:
Each occurrence of white space between the argument’s preprocessing tokens
becomes a single space character in the character string literal. White space before the first preprocessing token and after the last preprocessing token composing the argument is deleted.
We can see this effect in action:
/* I is just used to eliminate whitespace between two macro invocations.
* The indirection of `STRING/STRING_` is explained in many SO answers;
* it's necessary in order that the stringify operator apply to the expanded
* macro argument, rather than the literal argument.
*/
#define I(x) x
#define STRING_(x) #x
#define STRING(x) STRING_(x)
#define PLUS +
int main(void) {
printf("%s\n", STRING(I(PLUS)I(PLUS)));
printf("%s\n", STRING(I(PLUS) I(PLUS)));
}
The output of this program is:
++
+ +
showing that the whitespace in the second invocation was preserved.
Contrast the above with gcc's -E output for ordinary use of the macro:
int main(void) {
(void) I(PLUS)I(PLUS)3;
(void) I(PLUS) I(PLUS)3;
}
The macro expansion is
int main(void) {
(void) + +3;
(void) + +3;
}
showing that the preprocessor was forced to insert a cosmetic space into the first expansion, in order to preserve the semantics of the macro expansion. (Again, I emphasize that the -E output is not what the preprocessor module passes to the compiler, in normal GCC operation. Internally, it passes a token sequence. All of the whitespace in the -E output above is a courtesy which makes the generated file more useful.)
#define function(in) in+1
int main(void)
{
printf("%d\n",function(1));
return 0;
}
The above is correctly preprocessed to :
int main(void)
{
printf("%d\n",1+1);
return 0;
}
However, if the macro in+1 is changed to in_1, the preprocessor will not do the argument replacement correctly and will end up with this:
printf("%d\n",in_1);
What are the list of tokens the preprocessor can correctly separate and insert the argument? (like the + sign)
Short answer: The replacement done by preprocessor is not simple text substitution. In your case, the argument must be an identifier, and it can only replace the identical identifiers.
The related form of preprocessing is
#define identifier(identifier-list) token-sequence
In order for the replacement to take place, the identifiers in the identifier-list and the tokens in the token-sequence must be identical in the token sense, according to C's tokenization rule (the rule to parse stream into tokens).
If you agree with the fact that
in C in and in_1 are two different identifiers (and C cannot relate one to the other), while
in+1 is not an identifier but a sequence of three tokens:
(1) identifier in,
(2) operator +, and
(3) integer constant 1,
then your question is clear: in and in_1 are just two identifiers between which C does not see any relationship, and cannot do the replacement as you wish.
Reference 1: In C, there are six classes of tokens:
(1) identifiers (e.g. in)
(2) keywords (e.g. if)
(3) constants (e.g. 1)
(4) string literals (e.g. "Hello")
(5) operators (e.g. +)
(6) other separators (e.g. ;)
Reference 2: In C, an identifier is a sequence of letters (including _) and digits (the first one cannot be a digit).
Reference 3: The tokenization rule:
... the next token is the longest string of characters that could constitute a token.
This is to say, when reading in+1, the compiler will read all the way to +, and knows that in is an identifier. But in the case of in_1, the compiler will read all the way to the white space after it, and deems in_1 as an identifier.
All references from the Reference Manual from K&R's The C Programming Language. Language evolved but they capture the essence.
See the C11 standard section 6.4 for the tokenization grammar .
The relevant token type here is identifier, which is defined as any sequence of letters or digits that doesn't start with a digit; also there can be \u codes and other implementation-defined characters in an identifier.
Due to the "maximal munch" principle of tokenization, character sequence in+1 is tokenized as in + 1 (not i n + 1).
if you want in_1, use two hashes....
#define function(in) in ## _1
So...
function(dave) --> dave_1
function(phil) --> phil_1
And for completeness, you can also use a single hash to turn the arg into a text string.
#define function(in) printf(#in "\n");
The output of the following C code is T T, but I think it should be t t.
#include<stdio.h>
#define T t
void main()
{
char T = 'T';
printf("\n%c\t%c\n",T,t);
}
The preprocessor does not perform substitution of any text within quotes, whether they are single quotes or double quotes.
So the character constant 'T' is unchanged.
From section 6.10.3 of the C standard:
9 A preprocessing directive of the form
# define identifier replacement-list new-line
defines an object-like macro that causes each subsequent instance of
the macro name 171) to be replaced by the replacement list of
preprocessing tokens that constitute the remainder of the
directive. The replacement list is then rescanned for more macro
names as specified below.
171) Since, by macro-replacement time, all character constants and
string literals are preprocessing tokens, not sequences possibly
containing identifier-like subsequences (see 5.1.1.2, translation
phases), they are never scanned for macro names or parameters.
TL;DR The variable name T is subject to MACRO replacement, not the initializer 'T'.
To elaborate, #define MACROs cause textual replacements and anything inside the "quotes" (either '' or "") are not part of MACRO replacement.
So in essence, try running the preprocessor on your code (example: gcc -E test.c) and it looks like
char t = 'T';
printf("\n%c\t%c\n",t,t);
Run gcc -E main.c -o test.txt && tail -f test.txt and See it online
which, expectedly, prints the value of variable t, T T.
That said, for a hosted environment, the required signature for main() is int main(void), at least.
C seems to be pretty permissive when it comes to whitespace.
We can use or omit whitespace around an operator, between a function name and its parenthesized list of arguments, between an array name and its index, etc. in order to make code more readable. I understand this is a matter of preference.
The only place I can think of where whitespace is NOT allowed is this:
#include < stdio.h > // fatal error: stdio.h : No such file or directory
What are the other contexts in C where whitespace cannot be used for readability?
In most cases, adding whitespace within a single token either makes the program invalid or changes the meaning of the token. An obvious example: "foo" and " foo " are both valid string literals with different values, because a string literal is a single token. Changing 123456 to 123 456 changes it from a single integer constant to two integer constants, resulting in a syntax error.
The exceptions to this involve the preprocessor.
You've already mentioned the #include directive. Note that given:
#include "header.h"
the "header.h" is not syntactically a string literal; it's processed before string literals are meaningful. The syntax is similar, but for example a \t sequence in a header name isn't necessarily replaced by a tab character.
Newlines (which are a form of whitespace) are significant in preprocessor directives; you can't legally write:
#ifdef
FOO
/* ... */
#endif
But whitespace other than newlines is permitted:
# if SPACES_ARE_ALLOWED_HERE
#endif
And there's one case I can think of where whitespace is permitted between preprocessor tokens but it changes the meaning. In the definition of a function-like macro, the ( that introduces the parameter list must immediately follow the macro name. This:
#define TWICE(x) ((x) + (x))
defines TWICE as a function-like macro that takes one argument. But this:
#define NOT_TWICE (x) ((x) + (x))
defines NOT_TWICE as an ordinary macro with no arguments that expands to (x) ((x) + (x)).
This rule applies only to macro definitions; a macro invocation follows the normal rules, so you can write either TWICE(42) or TWICE ( 42 ).
White spaces are not allowed for readability (are significant) within a lexical token. I.e. within an identifier (foo bar is different from foobar), within a number (123 456 is different from 123456), within a string (that's your example basically) or within an operator (+ + is different from ++ and + = is different from +=). Between those you can add as much white space as you want, but when you add white space inside such a token you will break the lexical token into two separate tokens (or change the value in case of string constants), thus changing the meaning of your code .
In most cases the code with the added white space is either equivalent to the original code or results in a syntax error. But there are exceptions. For example:
return a +++ b;
is the same as
return a ++ + b;
but is different from:
return a + ++ b;
As I recall you need to be very careful with function-like macros, as in such dummy example:
#include <stdio.h>
#define sum(x, y) ((x)+(y))
int main(void)
{
printf("%d\n", sum(2, 2));
return 0;
}
the:
#define sum(x, y) ((x)+(y))
is different thing than say:
#define sum (x, y) ((x)+(y))
The latter one is object-like macro, that replaces exactly with (x, y) ((x)+(y)), that is parameters are not being subsituted (as it happens in function-like macro).
I came across one more piece of code that is even more confusing..
#include "stdio.h"
#define f(a,b) a##b
#define g(a) #a
#define h(a) g(a)
int main(void)
{
printf("%s\n",h(f(1,2)));
printf("%s\n",g(1));
printf("%s\n",g(f(1,2)));
return 0;
}
output is
12
1
f(1,2)
My assumption was
1) first f(1,2) is replaced by 12 , because of macro f(a,b)
concantenates its arguments
2) then g(a) macro replaces 1 by a string literal "1"
3) the output should be 1
But why is g(f(1,2)) not getting substituted to 12.
I'm sure i'm missing something here.
Can someone explain me this program ?
Macro replacement occurs from the outside in. (Strictly speaking, the preprocessor is required to behave as though it replaces macros one at a time, starting from the beginning of the file and restarting after each replacement.)
The standard (C99 §6.10.3.2/2) says
If, in the replacement list, a parameter is immediately preceded by a # preprocessing
token, both are replaced by a single character string literal preprocessing token that
contains the spelling of the preprocessing token sequence for the corresponding
argument.
Since # is present in the replacement list for the macro g, the argument f(1,2) is converted to a string immediately, and the result is "f(1,2)".
On the other hand, in h(f(1,2)), since the replacement list doesn't contain #, §6.10.3.1/1 applies,
After the arguments for the invocation of a function-like macro have been identified,
argument substitution takes place. A parameter in the replacement list, unless preceded
by a # or ## preprocessing token or followed by a ## preprocessing token (see below), is
replaced by the corresponding argument after all macros contained therein have been
expanded.
and the argument f(1, 2) is macro expanded to give 12, so the result is g(12) which then becomes "12" when the file is "re-scanned".
Macros can't expand into preprocessing directives. From C99 6.10.3.4/3 "Rescanning and further replacement":
The resulting completely macro-replaced preprocessing token sequence
is not processed as a preprocessing directive even if it resembles
one,
Source: https://stackoverflow.com/a/2429368/2591612
But you can call f(a,b) from g like you did with h. f(a,b) is interpreted as a string literal as #Red Alert states.