#include<stdio.h>
#define A -B
#define B -C
#define C 5
int main() {
printf("The value of A is %dn", A);
return 0;
}
I came across the above code. I thought that after preprocessing, it gets transformed to
// code from stdio.h
int main() {
printf("The value of A is %dn", --5);
return 0;
}
which should result in a compilation error. But, the code compiles fine and produces output 5.
How does the code get preprocessed in this case so that it does not result into a compiler error?
PS: I am using gcc version 8.2.0 on Linux x86-64.
The preprocessor is defined as operating on a stream of tokens, not text. You have to read through all of sections 5.1.1, 6.4, and 6.10 of the C standard to fully understand how this works, but the critical bits are in 5.1.1.1 "Phases of translation": in phase 3, the source file is "decomposed into preprocessing tokens"; phases 4, 5, and 6 operate on those tokens; and in phase 7 "each preprocessing token is converted into a token". That indefinite article is critical: each preprocessing token becomes exactly one token.
What this means is, if you start with this source file
#define A -B
#define B -C
#define C 5
A
then, after translation phase 4 (macro expansion, among other things), what you have is a sequence of three preprocessing tokens,
<punctuator: -> <punctuator: -> <pp-number: 5>
and at the beginning of translation phase 7 that becomes
TK_MINUS TK_MINUS TK_INTEGER:5
which is then parsed as the expression -(-(5)) rather than as --(5). The standard offers no latitude in this: a C compiler that parses your example as --(5) is defective.
When you ask a compiler to dump out preprocessed source as text, the form of that text is not specified by the standard; typically, what you get has whitespace inserted as necessary so that a human will understand it the same way translation phase 7 would have.
Related
Today I read this question Any rules about underscores in filenames in C/C++?,
and I found it very interesting that the standard seems to not allow what is usually seen in many libraries (I also do it in my personal library this way):
For example, in opencv we can see this:
// File: opencv/include/opencv2/opencv.hpp
#include "opencv2/opencv_modules.hpp"
But the standard says:
§ 6.10.2 Source file inclusion
Semantics
5 The implementation shall provide unique mappings for sequences
consisting of one or more nondigits or digits (6.4.2.1) followed by a
period (.) and a single nondigit. The first character shall not be
a digit. The implementation may ignore distinctions of alphabetical
case and restrict the mapping to eight significant characters before
the period.
nondigit means letters (A-Z a-z) and underscore _.
It says absolutely nothing about / which would imply that it is forbidden to use a path, not to mention dots or hyphens in file names.
To test this first, I wrote a simple program with a source file test.c and a header file _1.2-3~a.hh in the same directory tst/:
// File: test.c
#include "./..//tst//./_1.2-3~a.hh"
int main(void)
{
char a [10] = "abcdefghi";
char b [5] = "qwert";
strncpy(b, a, 5 - 1);
printf("b: \"%c%c%c%c%c\"\n", b[0], b[1], b[2], b[3], b[4]);
/* printed: b: "abcdt" */
b[5 - 1] = '\0';
printf("b: \"%c%c%c%c%c\"\n", b[0], b[1], b[2], b[3], b[4]);
/* printed: b: "abcd" */
return 0;
}
// File: _1.2-3~a.hh
#include <stdio.h>
#include <string.h>
Which I compiled with this options: $ gcc -std=c11 -pedantic-errors test.c -o tst with no complain from the compiler (I have gcc (Debian 8.2.0-8) 8.2.0).
Is it really forbidden to use a relative path in an include?
Ah; the standard is really talking about the minimum character set of the filesystem supporting the C compiler.
Anything in the "" (or <> with some preprocessing first) is parsed as a string according to normal C rules and passed from there to the OS to do whatever it wants with it.
This leads to compiler errors on Windows when the programmer forgets to type \\ instead of '\' when writing a path into the header files. On modern Windows we can just use '/' and expect it to work but on older Windows or DOS it didn't.
For extra fun, try
#include "/dev/tty"
Really nice one. It wants you to type C code while compiling.
I'd would say it's not forbidden but not recommanded since it will not compile in some of cases there.
For example:
if you clone this directory into your root (so you'd have C:\test\).
if you try to run it in a virtual environment online, you may face issues.
Is it really forbidden to use a path in an include?
Not sure what you mean here: relative paths are commonly used, but using absolute path would be foolish.
Comments are usually converted to a single white-space before the preprocesor is run. However, there is a compelling use case.
#pragma once
#ifdef DOXYGEN
#define DALT(t,f) t
#else
#define DALT(t,f) f
#endif
#define MAP(n,a,d) \
DALT ( COMMENT(| n | a | d |) \
, void* mm_##n = a \
)
/// Memory map table
/// | name | address | description |
/// |------|---------|-------------|
MAP (reg0 , 0 , foo )
MAP (reg1 , 8 , bar )
In this example, when the DOXYGEN flag is set, I want to generate doxygen markup from the macro. When it isn't, I want to generate the variables. In this instance, the desired behaviour is to generate comments in the macros. Any thoughts about how?
I've tried /##/ and another example with more indirection
#define COMMENT SLASH(/)
#define SLASH(s) /##s
neither work.
In doxygen it is possible to run commands on the sources before they are fed into the doxygen kernel. In the Doxyfile there are some FILTER possibilities. In this case: INPUT_FILTER the line should read:
INPUT_FILTER = "sed -e 's%^ *MAP *(\([^,]*\),\([^,]*\),\([^)]*\))%/// | \1 | \2 | \3 |%'"
Furthermore the entire #if construct can disappear and one, probably, just needs:
#define MAP(n,a,d) void* mm_##n = a
The ISO C standard describes the output of the preprocessor as a stream of preprocessing tokens, not text. Comments are not preprocessing tokens; they are stripped from the input before tokenization happens. Therefore, within the standard facilities of the language, it is fundamentally impossible for preprocessing output to contain comments or anything that resembles them.
In particular, consider
#define EMPTY
#define NOT_A_COMMENT_1(text) /EMPTY/EMPTY/ text
#define NOT_A_COMMENT_2(text) / / / text
NOT_A_COMMENT_1(word word word)
NOT_A_COMMENT_2(word word word)
After translation phase 4, both the fourth and fifth lines of the above will both become the six-token sequence
[/][/][/][word][word][word]
where square brackets indicate token boundaries. There isn't any such thing as a // token, and therefore there is nothing you can do to make the preprocessor produce one.
Now, the ISO C standard doesn't specify the behavior of doxygen. However, if doxygen is reusing a preprocessor that came with someone's C compiler, the people who wrote that preprocessor probably thought textual preprocessor output should be, above all, an accurate reflection of the token sequence that the "compiler proper" would receive. That means it will forcibly insert spaces where necessary to make separate tokens remain separate. For instance, with test.c the above example,
$ gcc -E test.c
...
/ / / word word word
/ / / word word word
(I have elided some irrelevant chatter above the output we're interested in.)
If there is a way around this, you are most likely to find it in the doxygen manual. There might, for instance, be configuration options that teach it that certain macros should be understood to define symbols, and what symbols those are, and what documentation they should have.
For example:
int main()
{
fun();//calling a fun
}
void fun(void)
{
#if 0
int a = 4;
int b = 5;
#endif
}
What is the size of the fun() function? And what is the total memory will be created for main() function?
Compilation of a C source file is done in multiple phases. The phase where the preprocessor runs is done before the phase where the code is compiled.
The "compiler" will not even see code that the preprocessor has removed; from its point of view, the function is simply
void fun(void)
{
}
Now if the function will "create memory" depends on the compiler and its optimization. For a debug build the function will probably still exist and be called. For an optimized release build the compiler might not call or even keep (generate boilerplate code for) the function.
Compilation is split into 4 stages.
Preprocessing.
Compilation.
Assembler.
Linker
Compiler performs preprocessor directives before starting the actual compilation, and in this stage conditional inclusions are performed along with others.
The #if is a conditional inclusion directive.
From C11 draft 6.10.1-3:
Preprocessing directives of the forms
#if constant-expression new-line groupopt
#elif constant-expression new-line groupopt
check whether the controlling constant expression evaluates to nonzero.
As in your code #if 0 tries to evaluate to nonzero but remains false, thereby the code within the conditional block is excluded.
The preprocessing stage can be output to stdout with -E option:
gcc -E filename.c
from the command above the output will give,
# 943 "/usr/include/stdio.h" 3 4
# 2 "filename.c" 2
void fun(void)
{
}
int main()
{
fun();
return 0;
}
As we can see the statements with the #if condition are removed during the preprocessing stage.
This directive can be used to avoid compilation of certain code block.
Now to see if there is any memory allocated by the compiler for an empty function,
filename.c:
void fun(void)
{
}
int main()
{
fun();
return 0;
}
The size command gives,
$ size a.out
text data bss dec hex filename
1171 552 8 1731 6c3 a.out
and for the code,
filename.c:
void fun(void)
{
#if 0
int a = 4;
int b = 5;
#endif
}
int main()
{
fun();
return 0;
}
The output of size command for the above code is,
$ size a.out
text data bss dec hex filename
1171 552 8 1731 6c3 a.out
As seen in both cases memory allocated is same by which can conclude that the compiler does not allocate memory for the block of code disabled by macro.
According to Gcc reference:
The simplest sort of conditional is
#ifdef MACRO
controlled text
#endif /* MACRO */
This block is called a conditional group. controlled text will be
included in the output of the preprocessor if and only if MACRO is
defined. We say that the conditional succeeds if MACRO is defined,
fails if it is not.
The controlled text inside of a conditional can include preprocessing
directives. They are executed only if the conditional succeeds. You
can nest conditional groups inside other conditional groups, but they
must be completely nested. In other words, ‘#endif’ always matches the
nearest ‘#ifdef’ (or ‘#ifndef’, or ‘#if’). Also, you cannot start a
conditional group in one file and end it in another.
Even if a conditional fails, the controlled text inside it is still
run through initial transformations and tokenization. Therefore, it
must all be lexically valid C. Normally the only way this matters is
that all comments and string literals inside a failing conditional
group must still be properly ended.
The comment following the ‘#endif’ is not required, but it is a good
practice if there is a lot of controlled text, because it helps people
match the ‘#endif’ to the corresponding ‘#ifdef’. Older programs
sometimes put MACRO directly after the ‘#endif’ without enclosing it
in a comment. This is invalid code according to the C standard. CPP
accepts it with a warning. It never affects which ‘#ifndef’ the
‘#endif’ matches.
Sometimes you wish to use some code if a macro is not defined. You can
do this by writing ‘#ifndef’ instead of ‘#ifdef’. One common use of
‘#ifndef’ is to include code only the first time a header file is
included.
I started coding C in vim and I have some problems.
The backslash is intended to join lines but when I try to write:
ret\
urn 0;
I get
return
0;
and when I add spaces before urn; it stay like that without join.
ret\
urn 0;
it stay like that.
why in the second case I don't get return 0; but
ret
urn 0;
code:
CPP output:
command:
gcc -E -Wall -Wextra -Wimplicit -pedantic -std=c99 main.c -o output.i
GCC 5.4,
Vim 7.4
-E output is not officially specified by the standard. It's an engineering tradeoff among several different design constraints, of which the relevant two are:
whitespace must be inserted or deleted as necessary so that the "compiler proper" (imagine feeding the -E output back into gcc -fpreprocessed — which is what -save-temps does) sees the same sequence of pp-tokens that it would normally (without -E). (See C99 section 6.4 for the definition of a pp-token.)
to the maximum extent possible, tokens should appear at the same line and column position that they did in the original source code, so that error messages and debugging information are as accurate as possible.
Here's how this applies to your examples:
ret\
urn 0;
The backslash-newline combines ret and urn into a single pp-token, which must therefore appear all together on one line in the output. The 0 and ;, however, should continue to be on their original line and column so that diagnostics are accurate. So you get
return
0;
with spaces inserted to keep the 0 in its original column.
ret\
urn 0;
Here the backslash-newline is immediately followed by whitespace, so ret and urn do not have to be combined, so, again, the diagnostics are most accurate if everything stays where it originally was, and the output is
ret
urn 0;
which looks like the backslash-newline had no effect at all.
You might find the output of gcc -E -P less surprising. -P tells the preprocessor not to bother trying to preserve token position (and also turns off all those lines beginning with # in the output). Your examples produce return 0; and ret urn 0;, both all on one line, in -P mode.
Finally, a word of advice: everyone who ever has to read your code (and that includes yourself six months later) will appreciate it if you never split a token in the middle with backslash-newline, except for very long string literals. It's a legacy misfeature that wouldn't be included in the language if it were designed from scratch today.
The white space is a token separator. Just because you split the line doesn't mean a white space will be ignored.
What the compiler sees is something like ret urn;. Which is not valid C, since it's two tokens which probably weren't defined before, nor are they in a valid expression.
Keywords must be written as a single token with no spaces.
Now, when you do :
ret\
urn;
The backslash followed by a newline is removed in the early translation phases, and the subsequent line is appended. If the line has no white spaces at the beginning, the result is a valid token that the compiler understands as the keyword return.
Long story short, you seem to be asking about specific behavior for GCC. It seems like a compiler bug. Since clang does the expected thing (although the line count remains the same):
clang -E -Wall -Wextra -Wimplicit -pedantic -std=c99 -x c main.cpp
# 1 "main.cpp"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 316 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "main.cpp" 2
int main(void) {
ret urn 0;
}
It doesn't seem crucial however, since in this particular case the code will be invalid either way.
The behavior of the C preprocessor on \ followed by a newline is to remove both bytes from the input. This is done in a very early phase of the parsing. Yet the preprocessor retains the original line number for each token it sees and tries to output tokens on separate lines for the compiler to issue correct diagnostics for later phases of compilation.
For the input:
ret\
urn 1;
it may produce:
#line 1 "myfile.c"
return
#line 2 "myfile.c"
1;
Which it may shorten as
return
1;
Note that you can split any input line at any position with an escaped newline:
#inclu\
de <st\
dio.h>\
"Hello word\\
n"
for (i = 0; i < n; i+\
+)
ret\
\
\
urn;
\
r\
et\
urn\
123;\
This question already has answers here:
What is the meaning of lines starting with a hash sign and number like '# 1 "a.c"' in the gcc preprocessor output?
(3 answers)
Closed 2 years ago.
Sorry if my question is very basic. I would like to understand the output produced by the preprocessor cpp. Let's say i have a very basic following program.
#include <stdio.h>
#include <stdlib.h>
int x=100;
int main ()
{
printf ("\n Welcome..\n");
}
I execute the following command.
cpp main.c main.i
in main.i
# 1 "/usr/include/stdio.h" 1 3 4
What is the meaning of the above line ?..
The gcc documentation explains the C preprocessor output aptly.
Here are the relevant sections:
The output from the C preprocessor looks much like the input, except that all preprocessing directive lines have been replaced with blank lines and all comments with spaces. Long runs of blank lines are discarded.
Source file name and line number information is conveyed by lines of the form
# linenum filename flags
These are called linemarkers. They are inserted as needed into the output (but never within a string or character constant). They mean that the following line originated in file filename at line linenum. filename will never contain any non-printing characters; they are replaced with octal escape sequences.
After the file name comes zero or more flags, which are 1, 2, 3, or 4. If there are multiple flags, spaces separate them. Here is what the flags mean:
1 This indicates the start of a new file.
2
This indicates returning to a file (after having included another file).
3
This indicates that the following text comes from a system header file, so certain warnings should be suppressed.
4
This indicates that the following text should be treated as being wrapped in an implicit extern "C" block.