How does C recognize macro tokens as arguments - c

#define function(in) in+1
int main(void)
{
printf("%d\n",function(1));
return 0;
}
The above is correctly preprocessed to :
int main(void)
{
printf("%d\n",1+1);
return 0;
}
However, if the macro in+1 is changed to in_1, the preprocessor will not do the argument replacement correctly and will end up with this:
printf("%d\n",in_1);
What are the list of tokens the preprocessor can correctly separate and insert the argument? (like the + sign)

Short answer: The replacement done by preprocessor is not simple text substitution. In your case, the argument must be an identifier, and it can only replace the identical identifiers.
The related form of preprocessing is
#define identifier(identifier-list) token-sequence
In order for the replacement to take place, the identifiers in the identifier-list and the tokens in the token-sequence must be identical in the token sense, according to C's tokenization rule (the rule to parse stream into tokens).
If you agree with the fact that
in C in and in_1 are two different identifiers (and C cannot relate one to the other), while
in+1 is not an identifier but a sequence of three tokens:
(1) identifier in,
(2) operator +, and
(3) integer constant 1,
then your question is clear: in and in_1 are just two identifiers between which C does not see any relationship, and cannot do the replacement as you wish.
Reference 1: In C, there are six classes of tokens:
(1) identifiers (e.g. in)
(2) keywords (e.g. if)
(3) constants (e.g. 1)
(4) string literals (e.g. "Hello")
(5) operators (e.g. +)
(6) other separators (e.g. ;)
Reference 2: In C, an identifier is a sequence of letters (including _) and digits (the first one cannot be a digit).
Reference 3: The tokenization rule:
... the next token is the longest string of characters that could constitute a token.
This is to say, when reading in+1, the compiler will read all the way to +, and knows that in is an identifier. But in the case of in_1, the compiler will read all the way to the white space after it, and deems in_1 as an identifier.
All references from the Reference Manual from K&R's The C Programming Language. Language evolved but they capture the essence.

See the C11 standard section 6.4 for the tokenization grammar .
The relevant token type here is identifier, which is defined as any sequence of letters or digits that doesn't start with a digit; also there can be \u codes and other implementation-defined characters in an identifier.
Due to the "maximal munch" principle of tokenization, character sequence in+1 is tokenized as in + 1 (not i n + 1).

if you want in_1, use two hashes....
#define function(in) in ## _1
So...
function(dave) --> dave_1
function(phil) --> phil_1
And for completeness, you can also use a single hash to turn the arg into a text string.
#define function(in) printf(#in "\n");

Related

Why can identifiers contain '$' in C? [duplicate]

This question already has answers here:
dollar sign in variable name?
(4 answers)
Closed 1 year ago.
Recently I saw code like this:
int $ = 123;
So why can '$' be in an identifier in C?
Is it the same in C++?
This is not good practice. Generally, you should only use alphanumeric characters and underscores in identifiers ([a-z][A-Z][0-9]_).
Surface Level
Unlike in other languages (bash, perl), C does not use $ to denote the usage of a variable. As such, it is technically valid. As of C++ 17, this is standards conformant, see Draft n4659. In C it most likely falls under C11, 6.4.2. This means that it does seem to be supported by modern compilers.
As for your C++ question, lets test it!
int main(void) {
int $ = 0;
return $;
}
On GCC/G++/Clang/Clang++, this indeed compiles, and runs just fine.
Deeper Level
Compilers take source code, lex it into a token stream, put that into an abstract syntax tree (AST), and then use that to generate code (e.g. assembly/LLVM IR). Your question really only revolves around the first part (e.g. lexing).
The grammar (thus the lexer implementation) of C/C++ does not treat $ as special, unlike commas, periods, skinny arrows, etc... As such, you may get an output from the lexer like this from the below c code:
int i_love_$ = 0;
After the lexer, this becomes a token steam like such:
["int", "i_love_$", "=", "0"]
If you where to take this code:
int i_love_$,_and_.s = 0;
The lexer would output a token steam like:
["int", "i_love_$", ",", "_and_", ".", "s", "=", "0"]
As you can see, because C/C++ doesn't treat characters like $ as special, it is processed differently than other characters like periods.
The 2018 C standard says in 6.4.2 1 that an identifier consists of a nondigit character followed zero or more nondigit or digit characters, where the nondigit characters are:
one of the characters _, a through z, or A through Z,
a universal-character-name, which is \u followed by four hexadecimal digits or \U followed by eight hexadecimal digits, that is outside certain ranges1, or
implementation-defined characters.
The digit characters are 0 through 9.
Taking GCC as an example, its documentation says these additional characters are defined in its preprocessor section, and that section says GCC accepts $ and the characters that correspond to the universal character names.2 Thus, allowing $ is a choice made by the compiler implementors.
Draft n4659 of the 2017 C++ standard has the same rules, in clause 5.10 [lex.name], except it limits the universal character names further.
Footnote
1 These \u and \U forms allow you to write any character as a hexadecimal code. The excluded ranges are those in C’s basic character set and codes reserved for control characters and special uses.
2 The “universal character names” are the \u and \U forms. The characters that correspond to them are the characters that those forms represent. For example, π is a universal character, and \u03c0 is the universal character name for it.

Double hash usage

In C99 6.10.3.3.(2) (with my highlight)
If, in the replacement list of a function-like macro, a parameter is immediately preceded
or followed by a ## preprocessing token, the parameter is replaced by the corresponding
argument’s preprocessing token sequence; however, if an argument consists of no
preprocessing tokens, the parameter is replaced by a placemarker preprocessing token
instead.
#include <stdio.h>
#define hash_hash(a, b) a ## b
#define mkstr(a) # a
#define in_between(a) mkstr(a)
void main(void)
{
char p[] = in_between(a hash_hash(,) b);
printf("%s", p);
}
Output:
a b
I described the highlighted phrase by hash_hash(,) and result seemed correct as the standard's representation.
But I wonder if comma , is omitted, like hash_hash(), then does this differ from standard's explanation (undefined behavior)? And is the placemaker the same as white-space?
But I wonder if comma , is omitted, like hash_hash(), then does this differ from standard's explanation (undefined behavior)?
I think the relevant part is C11 6.10.3/4 emphasis mine:
Constraints
If the identifier-list in the macro definition does not end with an ellipsis, the number of arguments (including those arguments consisting of no preprocessing tokens) in an invocation of a function-like macro shall equal the number of parameters in the macro definition.
The comma would give a valid identifier list with no pre-proc tokens, but without it the identifier list doesn't match the macro's number of parameters. So hash_hash() is a constraint violation and you should get a diagnostic message.
But I wonder if comma , is omitted, like hash_hash(), then does this differ from standard's explanation(undefined behavior)?
The comma in this case is part of the syntax of function-like macro invocation. It separates parameters in the macro's parameter list. Invocations of a function-like macro must provide a value for each of that macro's named parameters, and with hash_hash() (without the comma) you would not be doing so. This would be a violation of a language constraint, so not only would the resulting behavior be undefined, but a conforming C implementation is obligated to emit a diagnostic when it encounters such a violation.
And is the placemaker the same as white-space?
No. You can conceptualize placemarkers as a zero-length preprocessing tokens. Such a thing is not directly representable in source code. It is neither whitespace nor absence of preprocessing tokens. "Placemarker" is a pretty good description of its nature and role.
if comma , is omitted, like hash_hash(), then does this differ from standard's explanation (undefined behavior)? And is the placemaker the same as white-space?
Yes, If a macro is defined with arguments, the arguments must be separated with a comma delimiter. Attempts to define a multiple argument macro otherwise would violate the rules of the standard, and it would not compile:
Function-like macros can take arguments, just like true functions. To
define a macro that uses arguments, you insert parameters between the
pair of parentheses in the macro definition that make the macro
function-like. The parameters must be valid C identifiers,
separated by commas and optionally whitespace.
Furthermore,
Because this requirement is well defined, it would not be an example of undefined behavior.
In this phrase of the reference paragraph and optionally whitespace, the and indicates that white space can be used along with a delimiter, but by itself is not sufficient as a delimiter.

Does #define in C put the thing exactly or give spaces before and after?

As far my knowledge goes in C, C pre-processors replace the literals as it is in #define. But now, I am seeing that, it gives spaces before and after.
Is my explanation correct or am I doing something which should give some undefined behaviors?
Consider the following C code:
#include <stdio.h>
#define k +-6+-
#define kk xx+k-x
int main()
{
int x = 1029, xx = 4,t;
printf("x=%d,xx=%d\n",x,xx);
t=(35*kk*2)*4;
printf("t=%d,x=%d,xx=%d\n",t,x,xx);
return 0;
}
The initial values are: x = 1029, xx = 4. Lets calculate the value of t now.
t = (35*kk*2)*4;
t = (35*xx+k-x*2)*4; // replacing the literal kk
t = (35*xx++-6+--x*2)*4; // replacing the literal k
Now, the value of xx = 4 which would be increased by one just in the next statement and x is decremented by one and became 1028. So, the calculation of the current statement:
t = (35*4-6+1028*2)*4;
t = (140-6+2056)*4;
t = 2190*4;
t = 8760;
But the output of the above code is:
x=1029,xx=4
t=8768,x=1029,xx=4
From the second line of the output, it is clear that increments and decrements are not taken place.
That means after replacing k and kk, it is becoming:
t = (35*xx+ +-6+- -x*2)*4;
(If it is, then the calculation is clear.)
My concerning point: is it the standard of C or just an undefined behavior? Or am I doing something wrong?
The C standard specifies that the source file is analyzed and parsed into preprocessor tokens. When macro replacement occurs, a macro that is replaced is replaced with those tokens. The replacement is not literal text replacement.
C 2018 5.1.1.2 specifies translation phases (rephrasing and summarizing, not exact quotes):
Physical source file multibyte characters are mapped to the source character set. Trigraph sequences are replaced by single-character representations.
Lines continued with backslashes are merged.
The source file is converted from characters into preprocessing tokens and white-space characters—each sequence of characters that can be a preprocessing token is converted to a preprocessing token, and each comment becomes one space.
Preprocessing is performed (directives are executed and macros are expanded).
Source characters in character constants and string literals are converted to members of the execution character set.
Adjacent string literals are concatenated.
White-space characters are discarded. “Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.” (That quoted text is the main part of C compilation as we think of it!)
The program is linked to become an executable file.
So, in phase 3, the compiler recognizes that #define kk xx+k-x consists of the tokens #, define, kk, xx, +, k, -, and x. The compiler also knows there is white space between define and kk and between kk and xx, but this white space is not itself a preprocessor token.
In phase 4, when the compiler replaces kk in the source, it is doing so with these tokens. kk gets replaced by the tokens xx, +, k, -, and x, and k is replaced by the tokens +, -, 6, +, and -. Combined, those form xx, +, +, -, 6, +, -, -, -, and x.
The tokens remain that way. They are not reanalyzed to put + and + together to form ++.
As #EricPostpischil says in a comprehensive answer, the C pre-processor works on tokens, not character strings, and once the input is tokenised, whitespace is no longer needed to separate adjacent tokens.
If you ask a C preprocessor to print out the processed program text, it will probably add whitespace characters where needed to separate the tokens. But that's just for your convenience; the whitespace might or might not be present, and it makes almost no difference because it has no semantic value and will be discarded before the token sequence is handed over to the compiler.
But there is a brief moment during preprocessing when you can see some whitespace, or at least an indication as to whether there was whitespace inside a token sequence, if you can pass the token sequence as an argument to a function-like macro.
Most of the time, the preprocessor does not modify tokens. The tokens it receives are what it outputs, although not necessarily in the same order and not necessarily all of them. But there are two exceptions, involving the two preprocessor operators # (stringify) and ## (token concatenation). The first of these transforms a macro argument -- a possibly empty sequence of tokens -- into a string literal, and when it does so it needs to consider the presence or absence of whitespace in the token sequence.
(The token concatenation operator combines two tokens into a single token if possible; when it does so, intervening whitespace is ignored. That operator is not relevant here.)
The C standard actually specifies precisely how whitespace in a macro argument is handled if the argument is stringified, in paragraph 2 of §6.10.3.2:
Each occurrence of white space between the argument’s preprocessing tokens
becomes a single space character in the character string literal. White space before the first preprocessing token and after the last preprocessing token composing the argument is deleted.
We can see this effect in action:
/* I is just used to eliminate whitespace between two macro invocations.
* The indirection of `STRING/STRING_` is explained in many SO answers;
* it's necessary in order that the stringify operator apply to the expanded
* macro argument, rather than the literal argument.
*/
#define I(x) x
#define STRING_(x) #x
#define STRING(x) STRING_(x)
#define PLUS +
int main(void) {
printf("%s\n", STRING(I(PLUS)I(PLUS)));
printf("%s\n", STRING(I(PLUS) I(PLUS)));
}
The output of this program is:
++
+ +
showing that the whitespace in the second invocation was preserved.
Contrast the above with gcc's -E output for ordinary use of the macro:
int main(void) {
(void) I(PLUS)I(PLUS)3;
(void) I(PLUS) I(PLUS)3;
}
The macro expansion is
int main(void) {
(void) + +3;
(void) + +3;
}
showing that the preprocessor was forced to insert a cosmetic space into the first expansion, in order to preserve the semantics of the macro expansion. (Again, I emphasize that the -E output is not what the preprocessor module passes to the compiler, in normal GCC operation. Internally, it passes a token sequence. All of the whitespace in the -E output above is a courtesy which makes the generated file more useful.)

Are the preprocessor commands in C counted as tokens?

I was reading about tokens and counting the number of tokens in a program.
Previously I read somewhere that preprocessor commands are not counted as tokens.
But when I read about tokens on Geeksforgeeks it is given in section "special symbols":
pre processor(#): The preprocessor is a macro processor that is used automatically by the compiler to transform your program before actual compilation.
So I am confused that in a program, if we write #define will it be a token?
For example:
#include<stdio.h>
#define max 100
int main()
{
printf("max is %d", max);
return 0;
}
How many tokens are in this example.?
The linked article is full of basic errors, and should not be relied upon.
The process of parsing C or C++ is defined as a series of transformations:1
Backslash-newline is replaced with nothing whatsoever -- not even a space.
Comments are removed and replaced with a single space each.
The surviving text is converted into a series of preprocessing tokens. These are less specific than the tokens used by the language proper: for instance, the keyword if is an IF token to the language proper, but just an IDENT token to the preprocessor.
Preprocessing directives are executed and macros are expanded.
Each preprocessing token is converted into a token.
the stream of tokens is parsed into an abstract syntax tree, and the rest of the compiler takes it from there.
Your example program
#include<stdio.h>
#define max 100
int main()
{
printf("max is %d", max);
return 0;
}
will, after transformation 3, be this series of 23 preprocessing tokens:
PUNCT:# IDENT:include INCLUDE-ARG:<stdio.h>
PUNCT:# IDENT:define IDENT:max PP-NUMBER:100
IDENT:int IDENT:main PUNCT:( PUNCT:)
PUNCT:{
IDENT:printf PUNCT:( STRING:"max is %d" PUNCT:, IDENT:max PUNCT:) PUNCT:;
IDENT:return PP-NUMBER:0 PUNCT:;
PUNCT:}
The directives are still present at this stage. Please notice that #include and #define are each two tokens: the # and the directive name are separate. Some people like to write complex #if nests with the hashmarks all in column 1 but the directive names indented.
After transformation 5, though, the directives are gone and we have this series of 16+n tokens:
[ ... some large volume of tokens produced from the contents of stdio.h ... ]
INT IDENT:main LPAREN RPAREN
LBRACE
IDENT:printf LPAREN STRING:"max is %d" COMMA DECIMAL-INTEGER:100 RPAREN SEMICOLON
RETURN DECIMAL-INTEGER:0 SEMICOLON
RBRACE
where 'n' is however many tokens came from stdio.h.
Preprocessing directives (#include, #define, #if, etc.) are always removed from the token stream and perhaps replaced with something else, so you will never have tokens after transformation 6 that directly result from the text of a directive line. But you will usually have tokens that result from the effects of each directive, such as the contents of stdio.h, and DECIMAL-INTEGER:100 replacing IDENT:max.
Finally, C and C++ do this series of operations almost, but not quite, the same, and the specifications are formally independent. You can usually rely on preprocessing operations to behave the same in both languages, as long as you're only doing simple things with the preprocessor, which is best practice nowadays anyway.
1 You will sometimes see people talking about translation phases, which are the way the C and C++ standards officially describe this series of operations. My list is not the list of translation phases; it includes separate bullet points for some things that are grouped as a single phase by the standards, and leaves out several steps that aren't relevant to this discussion.

Why these consecutive macro replacements do not result in an error?

This program gives output as 5. But after replacing all macros, it would result in --5. This should cause an compilation error, trying to decrement the 5. But it compiles and runs fine.
#include <stdio.h>
#define A -B
#define B -C
#define C 5
int main()
{
printf("The value of A is %d\n", A);
return 0;
}
Why is there no error?
Here are the steps for the compilation of the statement printf("The value of A is %d\n", A);:
the lexical parser produces the preprocessing tokens printf, (, "The value of A is %dn", ,, A, ) and ;.
A is a macro that expands to the 2 tokens - and B.
B is also a macro and gets expanded to - and C.
C is again a macro and gets expanded to 5.
the tokens are then converted to C tokens, producing errors for preprocessing tokens that do not convert to proper C tokens (ex: 0a). In this example, the tokens are identical.
the compiler parses the resulting sequence according to the C grammar: printf, (, "The value of A is %d\n", ,, -, -, 5, ), ; matches a function call to printf with 2 arguments: a format string and a constant expression - - 5, which evaluates to 5 at compile time.
the code is therefore equivalent to printf("The value of A is %d\n", 5);. It will produce the output:
The value of A is 5
This sequence of macros is expanded as tokens, not strictly a sequence of characters, hence A does not expand as --5, but rather as - -5. Good C compilers would insert an extra space when preprocessing the source to textual output to ensure the resulting text produces the same sequence of tokens when reparsed. Note however that the C Standard does not say anything about preprocessing to textual output, it only specifies preprocessing as one of the parsing phases and it is a quality of implementation issue for compilers to not introduce potential side effects when preprocessing to textual output.
There is a separate feature for combining tokens into new tokens in the preprocessor called token pasting. It requires a specific operator ## and is quite tricky to use.
Note also that macros should be defined with parentheses around each argument and parentheses around the whole expansion to prevent operator precedence issues:
#define A (-B)
#define B (-C)
#define C 5
Two consecutive dashes are not combined into a single pre-decrement operator -- because C preprocessor works with individual tokens, effectively inserting whitespace around macro substitutions. Running this program through gcc -E
#define A -B
#define B -C
#define C 5
int main() {
return A;
}
produces the following output:
int main() {
return - -5;
}
Note the space after the first -.
According to the standard, macro replacements are performed at the level of preprocessor tokens, not at the level of individual characters (6.10.3.9):
A preprocessing directive of the form
# define identifier replacement-list new-line
defines an object-like macro that causes each subsequent instance of the macro name to be replaced by the replacement list of preprocessing tokens that constitute the remainder of the directive.
Therefore, the two dashes - constitute two different tokens, so they are kept separate from each other in the output of the preprocessor.
Whenever we use #include in C program, compiler will replace the variable with its value wherever it is used.
#define A -B
#define B -C
#define C 5
So when we print A ,
it will execute in following steps.
A=>-B
B=>-C
A=>-(-C)=>C
So when we print value of A, it comes out to be 5.
Generally these #define statements are used to declare value of constants that are to be used through out the code.
For more info see
this link on #define directive

Resources