I was reading about tokens and counting the number of tokens in a program.
Previously I read somewhere that preprocessor commands are not counted as tokens.
But when I read about tokens on Geeksforgeeks it is given in section "special symbols":
pre processor(#): The preprocessor is a macro processor that is used automatically by the compiler to transform your program before actual compilation.
So I am confused that in a program, if we write #define will it be a token?
For example:
#include<stdio.h>
#define max 100
int main()
{
printf("max is %d", max);
return 0;
}
How many tokens are in this example.?
The linked article is full of basic errors, and should not be relied upon.
The process of parsing C or C++ is defined as a series of transformations:1
Backslash-newline is replaced with nothing whatsoever -- not even a space.
Comments are removed and replaced with a single space each.
The surviving text is converted into a series of preprocessing tokens. These are less specific than the tokens used by the language proper: for instance, the keyword if is an IF token to the language proper, but just an IDENT token to the preprocessor.
Preprocessing directives are executed and macros are expanded.
Each preprocessing token is converted into a token.
the stream of tokens is parsed into an abstract syntax tree, and the rest of the compiler takes it from there.
Your example program
#include<stdio.h>
#define max 100
int main()
{
printf("max is %d", max);
return 0;
}
will, after transformation 3, be this series of 23 preprocessing tokens:
PUNCT:# IDENT:include INCLUDE-ARG:<stdio.h>
PUNCT:# IDENT:define IDENT:max PP-NUMBER:100
IDENT:int IDENT:main PUNCT:( PUNCT:)
PUNCT:{
IDENT:printf PUNCT:( STRING:"max is %d" PUNCT:, IDENT:max PUNCT:) PUNCT:;
IDENT:return PP-NUMBER:0 PUNCT:;
PUNCT:}
The directives are still present at this stage. Please notice that #include and #define are each two tokens: the # and the directive name are separate. Some people like to write complex #if nests with the hashmarks all in column 1 but the directive names indented.
After transformation 5, though, the directives are gone and we have this series of 16+n tokens:
[ ... some large volume of tokens produced from the contents of stdio.h ... ]
INT IDENT:main LPAREN RPAREN
LBRACE
IDENT:printf LPAREN STRING:"max is %d" COMMA DECIMAL-INTEGER:100 RPAREN SEMICOLON
RETURN DECIMAL-INTEGER:0 SEMICOLON
RBRACE
where 'n' is however many tokens came from stdio.h.
Preprocessing directives (#include, #define, #if, etc.) are always removed from the token stream and perhaps replaced with something else, so you will never have tokens after transformation 6 that directly result from the text of a directive line. But you will usually have tokens that result from the effects of each directive, such as the contents of stdio.h, and DECIMAL-INTEGER:100 replacing IDENT:max.
Finally, C and C++ do this series of operations almost, but not quite, the same, and the specifications are formally independent. You can usually rely on preprocessing operations to behave the same in both languages, as long as you're only doing simple things with the preprocessor, which is best practice nowadays anyway.
1 You will sometimes see people talking about translation phases, which are the way the C and C++ standards officially describe this series of operations. My list is not the list of translation phases; it includes separate bullet points for some things that are grouped as a single phase by the standards, and leaves out several steps that aren't relevant to this discussion.
Related
int main(){\
int a = 5;\
return a;\
}
Above compiles fine. I assume the C preprocessor removes the backslashes before compilation?
output of gcc -E:
int main(){
int a = 5;
return a;}
It seems like not all the \n (new line) characters get removed similar to how it's done with Macros, it just mainly removed the backslash.
I have seen this used in multiline macros such as:
#define TEST(in)\
int a = in; \
int b = 6;
int main(){
TEST(5)
return 0;
}
output of gcc -E:
int main(){
int a = 5; int b = 6;
return 0;
}
Preprocess will remove the backslash as well as the \n character in the above example, but why is it not removing all the new line characters in my first example?
"Splices" -- backslash newline sequences -- are removed before the preprocessor processes the program text. At least that's the theory, bearing in mind that the C standard does not actually define a process called the "preprocessor".
What it does define is a procedure for converting the program text into a stream of tokens which can be parsed, and then turning that into an executable. The procedure consists of eight translation phases, and the compiler must produce the same result as would be produced if the phases were executed one at a time, each one taking as input the output of the previous phase. (Most of the inputs and output are streams of tokens, rather than character strings. So the output GCC produces when run with the -E flag doesn't correspond to anything in the standard, allowing GCC to basically produce whatever output it finds convenient. Or that its authors thought you would find convenient.)
The "as if" clause means that a particular compiler can combine phases or execute them in pieces, as long as it doesn't change the result. So you can really only look at the process as the abstract description of an algorithm. Still, it's useful to understand. The full text is found in §5.1.1.2 of the standard.
A highly condensed and commented description of the phases, which is incomplete and somewhat imprecise in its details, in the hopes that it's easier to digest than the language in the standard. But do read it in the original.
Remove trigraphs (which are now deprecated, so don't worry if you don't know what they are) and, if necessary, convert the program text to whatever character encoding the compiler requires.
Remove splices. All backslash-newline sequences are simply removed from the program text, leaving nothing behind. (OK, that's the theory. In practice, most compilers still know the original source line number of every bit of text. But this information is only used for producing diagnostics.)
Split the text into tokens and whitespace sequences, and replace all comments with a single space character.
"Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed". This is as close as the standard gets to defining the preprocessor, so it's reasonable to say that the "preprocessor" is the execution of phase 4. #include directives are preprocessor directives, and processing the include directive starts with passing the included file through phases 1-3 before inserting it into the token stream to be further preprocessed.
Replace all the escape sequences in character and string literals with the actual characters (possibly wide characters) which will be used during execution.
Concatenate adjacent string literals.
Remove all whitespace, leaving only tokens. Convert preprocessing tokens into syntactic tokens. Parse the resulting token stream and convert it into a "translation unit". Or, in other words, compile the program into an object file (although that's way more specific than the language in the standard).
Combine all the translation units and necessary library modules into a single executable image. Informally, this is the linking phase and the result is something you can hand to the operating system for execution.
That's what the standard mandates. But real-world compilers do lots of other stuff, like generate more or less readable error messages; rearrange the code in ways that might make it execute faster and/or occupy less space; insert debugging information into the executable; and produce whatever additional analyses and reports the user has requested (none of which are standardised). This, for example, includes the -E and/or -S outputs. The compiler does these things as a favour to you, and they can be helpful in understanding the way your program was compiled. But you shouldn't take them too seriously, since the official result of the compilation process is the actual executable.
Most compilation toolchains can also produce libraries, so it is not the case that all programs are immediately fully processed into executable images. But that's the only outcome which is standardised. Although the standard refers to libraries, particularly the standard library, it does not make any assumptions about how libraries come into existence.
The standard libraries (and headers) don't even have to exist in the filesystem; it's enough that the compiler recognises their names and responds appropriately. Some of the stuff the standard library has to implement cannot be written in portable C, so it is quite possible that the standard library source code, if it exists, is not all in the form of a standard C program. Standard library headers might include constructs which receive special handling by the compiler, and thus cannot be used by other compilers or copied directly into your program.
This might all seem too much in the air, but the intention was to make it possible to have C implementations which run on extremely limited processors, including processors without any external storage at all. (And it is still quite common to target embedded systems which might be missing lots of things you normally take for granted.) And, on the whole, it's served us pretty well over the years.
The C Standard does not specify how preprocessing can be performed with textual output, as gcc -E does. Compilers try and produce textual output that can be fed back to the compiler to produce the same program. White space details is largely irrelevant for this as long as the same tokens are produced. In your example, gcc outputs readable text where escaped newlines are output as newlines as long as they are surrounded by white space. This optional: escaped newlines should actually be removed completely from the input during an early phase, allowing for tokens that are broken on separate lines to be reconstituted.
Here is a pathological example:
#inc\
lude\
<st\
dio.\
h>
int \
main\
() {\
retu\
rn 0\
; }
Regarding macros with definitions spanning multiple lines with escaped newlines, their expansion is described precisely by the C Standard: sequences of whitespace and comments must be replaced by a single space.
Here is an illustrative example:
int main(){\
int a = 5;\
return a;\
}
#define TEST() int main(){\
int a = 5;\
return a;\
}
#define XSTR(x) #x
#define STR(x) XSTR(x)
TEST()
STR(TEST())
Output of gcc -E:
# 1 "prepmain.c"
# 1 "<built-in>" # 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "prepmain.c" int main(){
int a = 5;
return a;} # 14 "prepmain.c"
int main(){ int a = 5; return a;} "int main(){ int a = 5; return a;}"
Output of gcc -E -P:
int main(){ int a = 5; return a;}
int main(){ int a = 5; return a;}
"int main(){ int a = 5; return a;}"
#define function(in) in+1
int main(void)
{
printf("%d\n",function(1));
return 0;
}
The above is correctly preprocessed to :
int main(void)
{
printf("%d\n",1+1);
return 0;
}
However, if the macro in+1 is changed to in_1, the preprocessor will not do the argument replacement correctly and will end up with this:
printf("%d\n",in_1);
What are the list of tokens the preprocessor can correctly separate and insert the argument? (like the + sign)
Short answer: The replacement done by preprocessor is not simple text substitution. In your case, the argument must be an identifier, and it can only replace the identical identifiers.
The related form of preprocessing is
#define identifier(identifier-list) token-sequence
In order for the replacement to take place, the identifiers in the identifier-list and the tokens in the token-sequence must be identical in the token sense, according to C's tokenization rule (the rule to parse stream into tokens).
If you agree with the fact that
in C in and in_1 are two different identifiers (and C cannot relate one to the other), while
in+1 is not an identifier but a sequence of three tokens:
(1) identifier in,
(2) operator +, and
(3) integer constant 1,
then your question is clear: in and in_1 are just two identifiers between which C does not see any relationship, and cannot do the replacement as you wish.
Reference 1: In C, there are six classes of tokens:
(1) identifiers (e.g. in)
(2) keywords (e.g. if)
(3) constants (e.g. 1)
(4) string literals (e.g. "Hello")
(5) operators (e.g. +)
(6) other separators (e.g. ;)
Reference 2: In C, an identifier is a sequence of letters (including _) and digits (the first one cannot be a digit).
Reference 3: The tokenization rule:
... the next token is the longest string of characters that could constitute a token.
This is to say, when reading in+1, the compiler will read all the way to +, and knows that in is an identifier. But in the case of in_1, the compiler will read all the way to the white space after it, and deems in_1 as an identifier.
All references from the Reference Manual from K&R's The C Programming Language. Language evolved but they capture the essence.
See the C11 standard section 6.4 for the tokenization grammar .
The relevant token type here is identifier, which is defined as any sequence of letters or digits that doesn't start with a digit; also there can be \u codes and other implementation-defined characters in an identifier.
Due to the "maximal munch" principle of tokenization, character sequence in+1 is tokenized as in + 1 (not i n + 1).
if you want in_1, use two hashes....
#define function(in) in ## _1
So...
function(dave) --> dave_1
function(phil) --> phil_1
And for completeness, you can also use a single hash to turn the arg into a text string.
#define function(in) printf(#in "\n");
This program gives output as 5. But after replacing all macros, it would result in --5. This should cause an compilation error, trying to decrement the 5. But it compiles and runs fine.
#include <stdio.h>
#define A -B
#define B -C
#define C 5
int main()
{
printf("The value of A is %d\n", A);
return 0;
}
Why is there no error?
Here are the steps for the compilation of the statement printf("The value of A is %d\n", A);:
the lexical parser produces the preprocessing tokens printf, (, "The value of A is %dn", ,, A, ) and ;.
A is a macro that expands to the 2 tokens - and B.
B is also a macro and gets expanded to - and C.
C is again a macro and gets expanded to 5.
the tokens are then converted to C tokens, producing errors for preprocessing tokens that do not convert to proper C tokens (ex: 0a). In this example, the tokens are identical.
the compiler parses the resulting sequence according to the C grammar: printf, (, "The value of A is %d\n", ,, -, -, 5, ), ; matches a function call to printf with 2 arguments: a format string and a constant expression - - 5, which evaluates to 5 at compile time.
the code is therefore equivalent to printf("The value of A is %d\n", 5);. It will produce the output:
The value of A is 5
This sequence of macros is expanded as tokens, not strictly a sequence of characters, hence A does not expand as --5, but rather as - -5. Good C compilers would insert an extra space when preprocessing the source to textual output to ensure the resulting text produces the same sequence of tokens when reparsed. Note however that the C Standard does not say anything about preprocessing to textual output, it only specifies preprocessing as one of the parsing phases and it is a quality of implementation issue for compilers to not introduce potential side effects when preprocessing to textual output.
There is a separate feature for combining tokens into new tokens in the preprocessor called token pasting. It requires a specific operator ## and is quite tricky to use.
Note also that macros should be defined with parentheses around each argument and parentheses around the whole expansion to prevent operator precedence issues:
#define A (-B)
#define B (-C)
#define C 5
Two consecutive dashes are not combined into a single pre-decrement operator -- because C preprocessor works with individual tokens, effectively inserting whitespace around macro substitutions. Running this program through gcc -E
#define A -B
#define B -C
#define C 5
int main() {
return A;
}
produces the following output:
int main() {
return - -5;
}
Note the space after the first -.
According to the standard, macro replacements are performed at the level of preprocessor tokens, not at the level of individual characters (6.10.3.9):
A preprocessing directive of the form
# define identifier replacement-list new-line
defines an object-like macro that causes each subsequent instance of the macro name to be replaced by the replacement list of preprocessing tokens that constitute the remainder of the directive.
Therefore, the two dashes - constitute two different tokens, so they are kept separate from each other in the output of the preprocessor.
Whenever we use #include in C program, compiler will replace the variable with its value wherever it is used.
#define A -B
#define B -C
#define C 5
So when we print A ,
it will execute in following steps.
A=>-B
B=>-C
A=>-(-C)=>C
So when we print value of A, it comes out to be 5.
Generally these #define statements are used to declare value of constants that are to be used through out the code.
For more info see
this link on #define directive
I saw some examples in CPP manual where we can write macros body in many lines without the backslash.
#define strange(file) fprintf (file, "%s %d",
...
strange(stderr) p, 35)
output:
fprintf (stderr, "%s %d", p, 35)
Are they special cases like directives inside arguments macros or is it allowed only for #define ?
For include directives It must be always declared on one line if I am not wrong.
Edit:
From https://gcc.gnu.org/onlinedocs/cpp/Directives-Within-Macro-Arguments.html
3.9 Directives Within Macro Arguments
Occasionally it is convenient to use preprocessor directives within
the arguments of a macro. The C and C++ standards declare that
behavior in these cases is undefined. GNU CPP processes arbitrary
directives within macro arguments in exactly the same way as it would
have processed the directive were the function-like macro invocation
not present.
If, within a macro invocation, that macro is redefined, then the new
definition takes effect in time for argument pre-expansion, but the
original definition is still used for argument replacement. Here is a
pathological example:
#define f(x) x x
f (1
#undef f
#define f 2
f)
which expands to
1 2 1 2
with the semantics described above.
The example is on many lines.
Multi-line macro definitions without backslash-newline
Since comments are replaced by spaces in translation phase 3:
The source file is decomposed into preprocessing tokens7) and sequences of
white-space characters (including comments). A source file shall not end in a
partial preprocessing token or in a partial comment. Each comment is replaced by
one space character. New-line characters are retained. Whether each nonempty
sequence of white-space characters other than new-line is retained or replaced by
one space character is implementation-defined.
and the preprocessor runs as phase 4:
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1.
through phase 4, recursively. All preprocessing directives are then deleted.
it is possible, but absurd, to write a multi-line macro like this:
#include <stdio.h>
#define possible_but_absurd(a, b) /* comments
*/ printf("are translated"); /* in phase 3
*/ printf(" before phase %d", a); /* (the preprocessor)
*/ printf(" is run (%s)\n", b); /* but why abuse the system? */
int main(void)
{
printf("%s %s", "Macros can be continued without backslashes",
"because comments\n");
possible_but_absurd(4, "ISO/IEC 9899:2011,\nSection 5.1.1.2"
" Translation phases");
return 0;
}
which, when run, states:
Macros can be continued without backslashes because comments
are translated before phase 4 is run (ISO/IEC 9899:2011,
Section 5.1.1.2 Translation phases)
Backslash-newline in macro definitions
Translation phases 1 and 2 are also somewhat relevant:
Physical source file multibyte characters are mapped, in an implementation-defined
manner, to the source character set (introducing new-line characters for
end-of-line indicators) if necessary. Trigraph sequences are replaced by
corresponding single-character internal representations.
The trigraph replacement is nominally relevant because ??/ is the trigraph for a backslash.
Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice. A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such
splicing takes place.
This tells you that by the time phase 4 (the preprocessor) is run, macro definitions are on a single (logical) line — the trailing backslash-newline combinations have been deleted.
The standard notes that the phases are 'as if' — the behaviour of the compiler must be as if it went through the separate phases, but many implementations do not formally separate them out fully.
Avoid the GCC extension
The expanded example (quote from the GCC manual) has the invocation spread over many lines, but the definition is strictly on one line. (This much is not a GCC extension but completely standard behaviour.)
Note that if you're remotely sane, you'll ignore the possibility of putting preprocessing directives within the invocation of a macro (the #undef and #define in the example). It is a GCC extension and totally unportable. The standard says that the behaviour is undefined.
Annex J.2 Undefined behavior
There are sequences of preprocessing tokens within the list of macro arguments that would otherwise act as preprocessing directives (6.10.3).
int main(void)
{
#if 0
something"
#endif
return 0;
}
A simple program above generates a warning: missing terminating " character in gcc. This seems odd, because it means that the compiler allow the code blocks between #if 0 and endif have invalid statement like something here, but not double quotes " that don't pair. The same happens in the use of #ifdef and #ifndef.
Real comments are fine here:
int main(void)
{
/*
something"
*/
return 0;
}
Why? And the single quote ' behave similarly, is there any other tokens that are treating specially?
See the comp.Lang.c FAQ, 11.19:
Under ANSI C, the text inside a "turned off" #if, #ifdef, or #ifndef must still consist of "valid preprocessing tokens." This means that the characters " and ' must each be paired just as in real C code, and the pairs mustn't cross line boundaries.
Compilation needs to go through many cycles, before generating executable binary.
You are not in the compiler yet. Your pre-processor is flagging this error. This will not check for C language syntax, but missing quotes, braces and things like that are pre-processor errors.
After this pre-processor pass, Your code will go to the C Compiler which will detect the error you are expecting...
The preprocessor works at the token level, and a string literal is considered a single token. The preprocessor is warning you that you have an invalid token.
According to the C99 standard, a preprocessing token is one of these things:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the
above
The standard also says:
If a ' or a " character matches the last category, the behavior is
undefined.
Things like "statement" above are invalid to the C compiler, but it is a valid token, and the preprocessor eliminates this token before it gets to the compiler.
Beside the Kevin's answer, Incompatibilities of GCC says:
GCC complains about unterminated character constants inside of preprocessing conditionals that fail. Some programs have English comments enclosed in conditionals that are guaranteed to fail; if these comments contain apostrophes, GCC will probably report an error. For example, this code would produce an error:
#if 0
You can't expect this to work.
#endif
The best solution to such a problem is to put the text into an actual C comment delimited by /*...*/.