Joining lines in C [Strange output not compatible with documentation]

Joining lines in C [Strange output not compatible with documentation] - c

I started coding C in vim and I have some problems.
The backslash is intended to join lines but when I try to write:
ret\
urn 0;
I get
return
0;
and when I add spaces before urn; it stay like that without join.
ret\
urn 0;
it stay like that.
why in the second case I don't get return 0; but
ret
urn 0;
code:
CPP output:
command:
gcc -E -Wall -Wextra -Wimplicit -pedantic -std=c99 main.c -o output.i
GCC 5.4,
Vim 7.4

-E output is not officially specified by the standard. It's an engineering tradeoff among several different design constraints, of which the relevant two are:
whitespace must be inserted or deleted as necessary so that the "compiler proper" (imagine feeding the -E output back into gcc -fpreprocessed — which is what -save-temps does) sees the same sequence of pp-tokens that it would normally (without -E). (See C99 section 6.4 for the definition of a pp-token.)
to the maximum extent possible, tokens should appear at the same line and column position that they did in the original source code, so that error messages and debugging information are as accurate as possible.
Here's how this applies to your examples:
ret\
urn 0;
The backslash-newline combines ret and urn into a single pp-token, which must therefore appear all together on one line in the output. The 0 and ;, however, should continue to be on their original line and column so that diagnostics are accurate. So you get
return
0;
with spaces inserted to keep the 0 in its original column.
ret\
urn 0;
Here the backslash-newline is immediately followed by whitespace, so ret and urn do not have to be combined, so, again, the diagnostics are most accurate if everything stays where it originally was, and the output is
ret
urn 0;
which looks like the backslash-newline had no effect at all.
You might find the output of gcc -E -P less surprising. -P tells the preprocessor not to bother trying to preserve token position (and also turns off all those lines beginning with # in the output). Your examples produce return 0; and ret urn 0;, both all on one line, in -P mode.
Finally, a word of advice: everyone who ever has to read your code (and that includes yourself six months later) will appreciate it if you never split a token in the middle with backslash-newline, except for very long string literals. It's a legacy misfeature that wouldn't be included in the language if it were designed from scratch today.

The white space is a token separator. Just because you split the line doesn't mean a white space will be ignored.
What the compiler sees is something like ret urn;. Which is not valid C, since it's two tokens which probably weren't defined before, nor are they in a valid expression.
Keywords must be written as a single token with no spaces.
Now, when you do :
ret\
urn;
The backslash followed by a newline is removed in the early translation phases, and the subsequent line is appended. If the line has no white spaces at the beginning, the result is a valid token that the compiler understands as the keyword return.
Long story short, you seem to be asking about specific behavior for GCC. It seems like a compiler bug. Since clang does the expected thing (although the line count remains the same):
clang -E -Wall -Wextra -Wimplicit -pedantic -std=c99 -x c main.cpp
# 1 "main.cpp"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 316 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "main.cpp" 2
int main(void) {
ret urn 0;
}
It doesn't seem crucial however, since in this particular case the code will be invalid either way.

The behavior of the C preprocessor on \ followed by a newline is to remove both bytes from the input. This is done in a very early phase of the parsing. Yet the preprocessor retains the original line number for each token it sees and tries to output tokens on separate lines for the compiler to issue correct diagnostics for later phases of compilation.
For the input:
ret\
urn 1;
it may produce:
#line 1 "myfile.c"
return
#line 2 "myfile.c"
1;
Which it may shorten as
return
1;
Note that you can split any input line at any position with an escaped newline:
#inclu\
de <st\
dio.h>\
"Hello word\\
n"
for (i = 0; i < n; i+\
+)
ret\
\
\
urn;
\
r\
et\
urn\
123;\

Related

Expand pragma to a comment (for doxygen)

Comments are usually converted to a single white-space before the preprocesor is run. However, there is a compelling use case.
#pragma once
#ifdef DOXYGEN
#define DALT(t,f) t
#else
#define DALT(t,f) f
#endif
#define MAP(n,a,d) \
DALT ( COMMENT(| n | a | d |) \
, void* mm_##n = a \
)
/// Memory map table
/// | name | address | description |
/// |------|---------|-------------|
MAP (reg0 , 0 , foo )
MAP (reg1 , 8 , bar )
In this example, when the DOXYGEN flag is set, I want to generate doxygen markup from the macro. When it isn't, I want to generate the variables. In this instance, the desired behaviour is to generate comments in the macros. Any thoughts about how?
I've tried /##/ and another example with more indirection
#define COMMENT SLASH(/)
#define SLASH(s) /##s
neither work.

In doxygen it is possible to run commands on the sources before they are fed into the doxygen kernel. In the Doxyfile there are some FILTER possibilities. In this case: INPUT_FILTER the line should read:
INPUT_FILTER = "sed -e 's%^ *MAP *(\([^,]*\),\([^,]*\),\([^)]*\))%/// | \1 | \2 | \3 |%'"
Furthermore the entire #if construct can disappear and one, probably, just needs:
#define MAP(n,a,d) void* mm_##n = a

The ISO C standard describes the output of the preprocessor as a stream of preprocessing tokens, not text. Comments are not preprocessing tokens; they are stripped from the input before tokenization happens. Therefore, within the standard facilities of the language, it is fundamentally impossible for preprocessing output to contain comments or anything that resembles them.
In particular, consider
#define EMPTY
#define NOT_A_COMMENT_1(text) /EMPTY/EMPTY/ text
#define NOT_A_COMMENT_2(text) / / / text
NOT_A_COMMENT_1(word word word)
NOT_A_COMMENT_2(word word word)
After translation phase 4, both the fourth and fifth lines of the above will both become the six-token sequence
[/][/][/][word][word][word]
where square brackets indicate token boundaries. There isn't any such thing as a // token, and therefore there is nothing you can do to make the preprocessor produce one.
Now, the ISO C standard doesn't specify the behavior of doxygen. However, if doxygen is reusing a preprocessor that came with someone's C compiler, the people who wrote that preprocessor probably thought textual preprocessor output should be, above all, an accurate reflection of the token sequence that the "compiler proper" would receive. That means it will forcibly insert spaces where necessary to make separate tokens remain separate. For instance, with test.c the above example,
$ gcc -E test.c
...
/ / / word word word
/ / / word word word
(I have elided some irrelevant chatter above the output we're interested in.)
If there is a way around this, you are most likely to find it in the doxygen manual. There might, for instance, be configuration options that teach it that certain macros should be understood to define symbols, and what symbols those are, and what documentation they should have.

How does preprocessing work in this snippet?

#include<stdio.h>
#define A -B
#define B -C
#define C 5
int main() {
printf("The value of A is %dn", A);
return 0;
}
I came across the above code. I thought that after preprocessing, it gets transformed to
// code from stdio.h
int main() {
printf("The value of A is %dn", --5);
return 0;
}
which should result in a compilation error. But, the code compiles fine and produces output 5.
How does the code get preprocessed in this case so that it does not result into a compiler error?
PS: I am using gcc version 8.2.0 on Linux x86-64.

The preprocessor is defined as operating on a stream of tokens, not text. You have to read through all of sections 5.1.1, 6.4, and 6.10 of the C standard to fully understand how this works, but the critical bits are in 5.1.1.1 "Phases of translation": in phase 3, the source file is "decomposed into preprocessing tokens"; phases 4, 5, and 6 operate on those tokens; and in phase 7 "each preprocessing token is converted into a token". That indefinite article is critical: each preprocessing token becomes exactly one token.
What this means is, if you start with this source file
#define A -B
#define B -C
#define C 5
A
then, after translation phase 4 (macro expansion, among other things), what you have is a sequence of three preprocessing tokens,
<punctuator: -> <punctuator: -> <pp-number: 5>
and at the beginning of translation phase 7 that becomes
TK_MINUS TK_MINUS TK_INTEGER:5
which is then parsed as the expression -(-(5)) rather than as --(5). The standard offers no latitude in this: a C compiler that parses your example as --(5) is defective.
When you ask a compiler to dump out preprocessed source as text, the form of that text is not specified by the standard; typically, what you get has whitespace inserted as necessary so that a human will understand it the same way translation phase 7 would have.

Creating identifiers containing universal character names via token concatenation

I wrote this code that creates identifiers containing universal character names via token concatenation.
//#include <stdio.h>
int printf(const char*, ...);
#define CAT(a, b) a ## b
int main(void) {
//int \u306d\u3053 = 10;
int CAT(\u306d, \u3053) = 10;
printf("%d\n", \u306d\u3053);
//printf("%d\n", CAT(\u306d, \u3053));
return 0;
}
This code worked well with gcc 4.8.2 with -fextended-identifiers option and gcc 5.3.1, but didn't work with clang 3.3 with error message:
prog.c:10:17: error: use of undeclared identifier 'ねこ'
printf("%d\n", \u306d\u3053);
^
1 error generated.
and local clang (Apple LLVM version 7.0.2 (clang-700.1.81)) with error message:
$ clang -std=c11 -Wall -Wextra -o uctest1 uctest1.c
warning: format specifies type 'int' but the argument has type
'<dependent type>' [-Wformat]
uctest1.c:10:17: error: use of undeclared identifier 'ねこ'
printf("%d\n", \u306d\u3053);
^
1 warning and 1 error generated.
When I used -E option to have the compilers output code with macro expanded, gcc 5.3.1 emitted this:
# 1 "main.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "main.c"
int printf(const char*, ...);
int main(void) {
int \U0000306d\U00003053 = 10;
printf("%d\n", \U0000306d\U00003053);
return 0;
}
local clang emitted this:
# 1 "uctest1.c"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 326 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "uctest1.c" 2
int printf(const char*, ...);
int main(void) {
int \u306d\u3053 = 10;
printf("%d\n", ねこ);
return 0;
}
As you see, the identifiers declared and used in printf() matches in gcc's output, but they don't match in clang's output.
I know that creating universal character names via token concatenation invokes undefined behavior.
Quote from N1570 5.1.1.2 Translation phases:
If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined.
I thought that this character sequence \u306d\u3053 may "match the syntax of a universal character name" because it contains universal character names as its substring.
I also thought that "match" may mean that the entire token produced via concatenation stands for one universal character name, and that therefore this undefined behavior isn't invoked in this code.
Reading PRE30-C. Do not create a universal character name through concatenation, I found a comment saying this kind of concatenation is allowed:
What is forbidden, to create a new UCN via concatenation. Like doing
assign(\u0001,0401,a,b,4)
just concatenating stuff that happens to contain UCNs anywhere is okay.
And a log that shows that a code example like this case (but with 4 characters) is replaced with another code example.
Does my code example invoke some undefined behaviors (not limited to ones invoked by producing universal character names via token concatenation)?
Or is this a bug in clang?

Your code is not triggering the undefined behavior you mention, as universal character name (6.4.3) not being produced by token concatenation.
And, according to 6.10.3.3, as both the left side and the right side of operator ## is an identifier, and the produced token is also a valid preprocessing token (an identifier too), the ## operator itself not trigger an undefined behavior.
After reading description about identifier (6.4.2, D.1, D.2), universal character names (6.4.3), I'm pretty sure that it is more like a bug in clang preprocessor, which treats identifier produced by token concatenation and normal identifier differently.

simple script or commands to substitute stray "\\n" with "\n"

alright, i understand that the title of this topic sounds a bit gibberish... so i'll try to explain it as clearly as i can...
this is related to this previous post (an approach that's been verified to work):
multipass a source code to cpp
-- which basically asks the cpp to preprocess the code once before starting the gcc compile build process
take the previous post's sample code:
#include <stdio.h>
#define DEF_X #define X 22
int main(void)
{
DEF_X
printf("%u", X);
return 1;
}
now, to be able to freely insert the DEF_X anywhere, we need to add a newline
this doesn't work:
#define DEF_X \
#define X 22
this still doesn't work, but is more likely to:
#define DEF_X \n \
#define X 22
if we get the latter above to work, thanks to C's free form syntax and constant string multiline concatenation, it works anywhere as far as C/C++ is concerned:
"literal_str0" DEF_X "literal_str1"
now when cpp preprocesses this:
# 1 "d:/Projects/Research/tests/test.c"
# 1 "<command-line>"
# 1 "d:/Projects/Research/test/test.c"
# 1 "c:\\mingw\\bin\\../lib/gcc/mingw32/4.7.2/../../../../include/stdio.h" 1 3
# 19 "c:\\mingw\\bin\\../lib/gcc/mingw32/4.7.2/../../../../include/stdio.h" 3
# 1 "c:\\mingw\\bin\\../lib/gcc/mingw32/4.7.2/../../../../include/_mingw.h" 1 3
# 32 "c:\\mingw\\bin\\../lib/gcc/mingw32/4.7.2/../../../../include/_mingw.h" 3=
# 33 "c:\\mingw\\bin\\../lib/gcc/mingw32/4.7.2/../../../../include/_mingw.h" 3
# 20 "c:\\mingw\\bin\\../lib/gcc/mingw32/4.7.2/../../../../include/stdio.h" 2 3
ETC_ETC_ETC_IGNORED_FOR_BREVITY_BUT_LOTS_OF_DECLARATIONS
int main(void)
{
\n #define X 22
printf("%u", X);
return 1;
}
we have a stray \n in our preprocessed file. so now the problem is to get rid of it....
now, the unix system commands aren't really my strongest suit. i've compiled dozens of packages in linux and written simple bash scripts that simply enter multiline commands (so i don't have to type them every time or keep pressing the up arrow and choose the correct command successions). so i don`t know the finer points of stream piping and their arguments.
having said that, i tried these commands:
cpp $MY_DIR/test.c | perl -p -e 's/\\n/\n/g' > $MY_DIR/test0.c
gcc $MY_DIR/test0.c -o test.exe
it works, it removes that stray \n.
ohh, as to using perl rather than sed, i'm just more familiar with perl's variant to regex... it's more consistent in my eyes.
anyways, this has the nasty side effect of eating up any \n in the file (even in string literals)... so i need a script or a series of commands to:
remove a \n if:
if it is not inside a quote -- so this won't be modified: "hell0_there\n"
not passed to a function call (inside the argument list)
this is safe as one can never pass a single \n, which is neither a keyword nor an identifier.
if i need to "stringify" an expression with \n, i can simply call a function macro QUOTE_VAR(token). so that encapsulates all instances that \n would have to be treated as a string.
this should cover all cases that \n should be substituted... at least for my own coding conventions.
really, i would do this if i could manage it on my own... but my skills in regex is extremely lacking, only using it in for simple substitutions.

The better way is to replace \n if it occurs in the beginning of line.
The following command should do the work:
sed -e 's/\s*\\n/\n/g'
or occurs before #
sed -e 's/\\n\s*#/\n#/g'
or you can reverse the order of preprocessing and substitute DEF_X with your own tool before C preprocessor.

Is math within macro computed at compile time?

For example, does MIN_N_THINGIES below compile to 2? Or will I recompute the division every time I use the macro in code (e.g. recomputing the end condition of a for loop each iteration).
#define MAX_N_THINGIES (10)
#define MIN_N_THINGIES ((MAX_N_THINGIES) / 5)
uint8_t i;
for (i = 0; i < MIN_N_THINGIES; i++) {
printf("hi");
}
This question stems from the fact that I'm still learning about the build process. Thanks!

If you pass -E to gcc it will show what the preprocessor stage outputted.
gcc -E test.c | tail -n11
Outputs:
# 3 "test.c" 2
int main() {
uint8_t i;
for (i = 0; i < ((10) / 5); i++) {
printf("hi");
}
return 0;
}
Then if you pass -s flag to gcc you will see that the division was optimized out. If you also pass the -o flag you can set the output files and diff them to see that they generated the same code.
gcc -S test.c -o test-with-div.s
edit test.c to make MIN_N_THINGIES equal a const 2
gcc -S test.c -o test-constant.s
diff test-with-div.s test-constant.s
// for educational purposes you should look at the .s files generated.
Then as mentioned in another comment you can change the optimization flag by using -O...
gcc -S test.c -O2 -o test-unroll-loop.s
Will unroll the for loop even such that there isn't even a loop.

Preprocessor will replace MIN_N_THINGIES with ((10)/5), then it is up to the compiler to optimize ( or not ) the expression.

Maybe. The standard does not mandate that it is or it is not. On most compilers it will do after passing optimization flags (for example gcc with -O0 does not do it while with -O2 it even unrolls the loop).
Modern compilers perform even much more complicated techniques (vectorization, loop skewing, blocking ...). However unless you really care about performance, for ex. you program HPC, program real time system etc., you probably should not care about the output of the compiler - unless you're just interested (and yes - compilers can be a fascinating subject).

No. The preprocessor does not calculate macros, they're handled by the compiler. The preprocessor can calculate arithmetic expressions (no floating point values) in #if conditionals though.
Macros are simply text substitutions.
Note that the expanded macros can still be calculated and optimized by the compiler, it's just that it's not done by the preprocessor.

The standard mandates that some expressions are evaluated at compile time. But note that the preprocessor does just text splicing (well, almost) when the macro is called, so if you do:
#define A(x) ((x) / (S))
#define S 5
A(10) /* Gives ((10) / (5)) == 2 */
#undef S
#define S 2
A(20) /* Gives ((20) / (2)) == 10 */
The parenteses are to avoid idiocies like:
#define square(x) x * x
square(a + b) /* Gets you a + b * a + b, not the expected square */
After preprocessing, the result is passed to the compiler proper, which does (most of) the computation in the source that the standard requests. Most compilers will do a lot of constant folding, i.e., computing (sub)expressions made of known constants, as this is simple to do.
To see the expansions, it is useful to write a *.c file of a few lines, just with the macros to check, and run it just through the preprocessor (typically someting like cc -E file.c) and check the output.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight