What does '\' actually do in C? - c

As far as I know \ in C just appends the next line as if there was not a line break.
Consider the following code:
main(){\
return 0;
}
When I saw the pre-processed code(gcc -E) it shows
main(){return
0;
}
and not
main(){return 0;
}
What is the reason for this kind of behaviour? Also, how can I get the code I expected?

Yes, your expected result is the one required by the C and C++ standards. The backslash simply escapes the newline, i.e. the backslash-newline sequence is deleted.
GCC 4.2.1 from my OS X installation gives the expected result, as does Clang. Furthermore, adding a #define to the beginning and testing with
#define main(){\
return 0;
}
main()
yields the correct result
}
{return 0;
Perhaps gcc -E does some extra processing after preprocessing and before outputting it. In any case, the line break seen by the rest of the preprocessor seems to be in the right place. So it's a cosmetic bug.
UPDATE: According to the GCC FAQ, -E (or the default setting of the cpp command) attempts to put output tokens in roughly the same visual location as input tokens. To get "raw" output, specify -P as well. This fixes the observed issues.
Probably what happened:
In preserving visual appearance, tokens not separated by spaces are kept together.
Line splicing happens before spaces are identified for the above.
The { and return tokens are grouped into the same visual block.
0 follows a space and its location on the next line is duly noted.
PLUG: If this is really important to you, I have implemented my own preprocessor with correct implementation of both raw-preprocessed and whitespace-preserving "pretty" modes. Following this discussion I added line splices to the preserved whitespace. It's not really intended as a standalone tool, though. It's a testbed for a compiler framework which happens to be a fully compliant C++11 preprocessor library, which happens to have a miniature command-line driver. (The error messages are on par with GCC, or Clang, sans color, though.)

From K&R section A.12 Preprocessing:
A.12.2 Line Splicing
Lines that end with the backslash character \ are
folded by deleting the backslash and the following newline character.
This occurs before division into tokens.

It doesn't matter :/ The tokenizer will not see any difference. 1
Update In response to the comments:
There seems to be a fair amount of confusion as to what the expected output of the preprocessor should be. My point is that the expectation /seems/ reasonable at a glance but doesn't actually need to be specified in this way for the output to be valid. The amount of whitespace present in the output is simply irrelevant to the parser. What matters is that the preprocessor should treat the continued line as one line while interpreting it.
In other words: the preprocessor is not a text transformation tool, it's a token manipulation tool.
If it matters to you, you're probably
using the preprocessor for for something other than C/C++
treating C++ code as text, which is a ... code smell. (libclang and various less complete parser libraries come to mind).
1 (The preprocessor is free to achieve the specified result in whichever way it sees fit. The result you are seeing is possibly the most efficient way the implementors have found to implement this particular transformation)

Related

Spaces inserted by the C preprocessor

Suppose we are given this input C code:
#define Y 20
#define A(x) (10+x+Y)
A(A(40))
gcc -E outputs like that (10+(10+40 +20)+20).
gcc -E -traditional-cpp outputs like that (10+(10+40+20)+20).
Why the default cpp inserts the space after 40 ?
Where can I find the most detailed specification of the cpp that covers that logic ?
The C standard doesn't specify this behaviour, since the output of the preprocessing phase is simply a stream of tokens and whitespace. Serializing the stream of tokens back into a character string, which is what gcc -E does, is not required or even mentioned by the standard, and does not form part of the translation processs specified by the standard.
In phase 3, the program "is decomposed into preprocessing tokens and sequences of white-space characters." Aside from the result of the concatenation operator, which ignores whitespace, and the stringification operator, which preserves whitespace, tokens are then fixed and whitespace is no longer needed to separate them. However, the whitespace is needed in order to:
parse preprocessor directives
correctly process the stringification operator
The whitespace elements in the stream are not eliminated until phase 7, although they are no longer relevant after phase 4 concludes.
Gcc is capable of producing a variety of information useful to programmers, but not corresponding to anything in the standard. For example, the preprocessor phase of the translation can also produce dependency information useful for inserting into a Makefile, using one of the -M options. Alternatively, a human-readable version of the compiled code can be output using the -S option. And a compilable version of the preprocessed program, roughly corresponding to the token stream produced by phase 4, can be output using the -E option. None of these output formats are in any way controlled by the C standard, which is only concerned with actually executing the program.
In order to produce the -E output, gcc must serialize the stream of tokens and whitespace in a format which does not change the semantics of the program. There are cases in which two consecutive tokens in the stream would be incorrectly glued together into a single token if they are not separated from each other, so gcc must take some precautions. It cannot actually insert whitespace into the stream being processed, but nothing stops it from adding whitespace when it presents the stream in response to gcc -E.
For example, if macro invocation in your example were modified to
A(A(0x40E))
then naive output of the token stream would result in
(10+(10+0x40E+20)+20)
which could not be compiled because 0x40E+20 is a single pp-number token which cannot be converted into a numeric token. The space before the + prevents this from happening.
If you attempt to implement a preprocessor as some kind of string transformation, you will undoubtedly confront serious issues in the corner cases. The correct implementation strategy is to tokenize first, as indicated in the standard, and then perform phase 4 as a function on a stream of tokens and whitespace.
Stringification is a particularly interesting case where whitespace affects semantics, and it can be used to see what the actual token stream looks like. If you stringify the expansion of A(A(40)), you can see that no whitespace was actually inserted:
$ gcc -E -x c - <<<'
#define Y 20
#define A(x) (10+x+Y)
#define Q_(x) #x
#define Q(x) Q_(x)
Q(A(A(40)))'
"(10+(10+40+20)+20)"
The handling of whitespace in stringification is precisely specified by the standard: (§6.10.3.2, paragraph 2, many thanks to John Bollinger for finding the specification.)
Each occurrence of white space between the argument’s preprocessing tokens
becomes a single space character in the character string literal. White space before the first preprocessing token and after the last preprocessing token composing the argument is deleted.
Here is a more subtle example where additional whitespace is required in the gcc -E output, but is not actually inserted into the token stream (again shown by using stringification to produce the real token stream.) The I (identify) macro is used to allow two tokens to be inserted into the token stream without intervening whitespace; that's a useful trick if you want to use macros to compose the argument to the #include directive (not recommended, but it can be done).
Maybe this could be a useful test case for your preprocessor:
#define Q_(x) #x
#define Q(x) Q_(x)
#define I(x) x
#define C(x,...) x(__VA_ARGS__)
// Uncomment the following line to run the program
//#include <stdio.h>
char*quoted=Q(C(I(int)I(main),void){I(return)I(C(puts,quoted));});
C(I(int)I(main),void){I(return)I(C(puts,quoted));}
Here's the output of gcc -E (just the good stuff at the end):
$ gcc -E squish.c | tail -n2
char*quoted="intmain(void){returnputs(quoted);}";
int main(void){return puts(quoted);}
In the token stream which is passed out of phase 4, the tokens int and main are not separated by whitespace (and neither are return and puts). That's clearly shown by the stringification, in which no whitespace separates the token. However, the program compiles and executes fine, even if passed explicitly through gcc -E:
$ gcc -E squish.c | gcc -x c - && ./a.out
intmain(void){returnputs(quoted);}
and compiling the output of gcc -E.
Different compilers and different versions of the same compiler may produce different serializations of a preprocessed program. So I don't think you will find any algorithm which is testable with a character-by-character comparison with the -E output of a given compiler.
The simplest possible serialization algorithm would be to unconditionally output a space between two consecutive tokens. Obviously, that would output unnecessary spaces, but it would never syntactically alter the program.
I think the minimal space algorithm would be to record the DFA state at the end of the last character in a token so that you can later output a space between two consecutive tokens if there exists a transition from the state at the end of the first token on the first character of the following token. (Keeping the DFA state as part of the token is not intrinsically different from keeping the token type as part of the token, since you can derive the token type from a simple lookup from the DFA state.) That algorithm would not insert a space after 40 in your original test case, but it would insert a space after 0x40E. So it is not the algorithm being used by your version of gcc.
If you use the above algorithm, you will need to rescan tokens created by token concatenation. However, that is necessary anyway, because you need to flag an error if the result of the concatenation is not a valid preprocessing token.
If you don't want to record states (although, as I said, there is essentially no cost in doing so) and you don't want to regenerate the state by rescanning the token as you output it (which would also be quite cheap), you could precompute a two-dimensional boolean array keyed by token type and following character. The computation would essentially be the same as the above: for every accepting DFA state which returns a particular token type, enter a true value in the array for that token type and any character with a transition out of the DFA state. Then you can look up the token type of a token and the first character of the following token to see if a space may be necessary. This algorithm does not produce a minimally-spaced output: it would, for example, put a space after the 40 in your example, since 40 is a pp-number and it is possible for some pp-number to be extended with a + (even though you cannot extend 40 in that way). So it's possible that gcc uses some version of this algorithm.
Adding some historical context to rici's excellent answer.
If you can get your hands on a working copy of gcc 2.7.2.3, experiment with its preprocessor. At that time the preprocessor was a separate program from the compiler, and it used a very naive algorithm for text serialization, which tended to insert far more spaces than were necessary. When Neil Booth, Per Bothner and I implemented the integrated preprocessor (appearing in gcc 3.0 and since), we decided to make -E output a bit smarter at the same time, but without making the implementation too complicated. The core of this algorithm is the library function cpp_avoid_paste, defined at https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libcpp/lex.c#l2990 , and its caller is here: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/c-family/c-ppoutput.c#l177 (look for "Subtle logic to output a space...").
In the case of your example
#define Y 20
#define A(x) (10+x+Y)
A(A(40))
cpp_avoid_paste will be called with a CPP_NUMBER token (what rici called a "pp-number") on the left, and a '+' token on the right. In this case it unconditionally says "yes, you need to insert a space to avoid pasting" rather than checking whether the last character of the number token is one of eEpP.
Compiler design often comes down to a trade-off between accuracy and implementation simplicity.

What is the meaning of multi-line comment warnings in C?

I'm working on a C file for a homework assignment and I thought it might help the graders if I made my answers visible like so:
//**********|ANSWER|************\\
//blah blah blah, answering the
//questions, etc etc
and found when compiling with gcc that those backslash characters at the end of the first line seemed to be triggering a "multi-line comment" warning. When I removed them, the warning disappeared. So my question is twofold:
a) how exactly does the presence of the backslash characters make it a "multi-line comment", and
b) why would a multi-line comment be a problem anyway?
C (since the 1999 standard) has two forms of comments.
Old-style comments are introduced by /* and terminated by */, and can span a portion of a line, a complete line, or multiple lines.
C++-style comments are introduced by // and terminated by the end of the line.
But a backslash at the end of a line causes that line to be spliced to the next line. So you can legally introduce a comment with //, put a backslash at the end of the line, and cause the comment to span multiple physical lines (but only one logical line).
That's what you're doing on your first line:
//**********|ANSWER|************\\
Just use something other than backslash at the end of the line, for example:
//**********|ANSWER|************//
Though even that is potentially misleading, since it almost looks like an old-style /* .. */ comment. You might consider something a little simpler:
/////////// |ANSWER| ////////////
or:
/**********|ANSWER|************/
The compiler simply tells you that you might have inadvertently commented-out the next line of code by ending the previous comment line with \, which is a line continuation character in C. This causes the second line to get concatenated with the first. This in turn makes the // comment to actually comment-out both original lines. In your case it is not a problem, since the next line is a comment as well.
But if the next line was not intended to be a comment, then you might have ended up with "weird behavior": compiler ignoring the second line for no apparent reason. The situation is often complicated by the fact that some syntax-highlighting code editors do not detect this situation and fail to highlight the next line as a comment.
Generally, for this specific reason, it is not a good idea to abuse the \ character as code level. Use it only if you really have to, i.e. only if you really want to stitch several lines into one.
Nobody asked, but this is the top answer in Google, so
Suppressing, this specific warning could be done with -Wno-comment option.
a) how exactly does the presence of the backslash characters make it a "multi-line comment", and
A backslash as the last character on a line means that the compiler should disregard the backslash and the newline character - it tells the compiler to do this before it should check for comments. So it says that before removing comments it should effectively look at
//**********|ANSWER|************\//blah blah blah, answering the
//questions, etc etc
it now sees the // at the start and ignores the rest of the line
b) why would a multi-line comment be a problem anyway?
In your example it isn't since the second line is a comment anyway, but what if you had written something useful on the second line?
Well since you asked question "a" it's likely that you didn't realize that the compiler behaved this way, and if you don't realize that you've commented out a line of code, then it's quite nice of the compiler to warn you.
Another reason is that even if had known this is that normally an editor will not visibly show whitespace and it's therefore easy to miss that the backslash may or may not be the last character on the line. For example:
int i = 42;
// backslash+space: \
i++
// backslash and no space: \
i--
printf("%d\n", i);
Would result in 43 since the i-- is commented out, but i++ isn't (because the backslash is not the last character on the line, but a space is).
That will comment the line below it as well. If you want to do that all on one line without a warning try
/* // Bla \\ */

force gcc compilation / ignore error messages

I'm trying to compile some code I found online, but gcc keeps getting me error messages.
Is there any way I can bypass the error, and compile?
ttys000$ gcc -o s-proc s-proc.c
s-proc.c:84:18: error: \x used with no following hex digits
Here's the line it keeps bitching about:
printf("\x%02x", ((unsigned char *)code)[i]);
...
First post on here, so if I broke any rules or wasn't specific enough, let me know.
You can't ignore errors1. You can only ignore warnings. Change the code.
printf("\\x%02x", ((unsigned char *)code)[i]);
It's just a guess, since without documentation or input from the original author of the code, we have no solid evidence for what the code is actually supposed to do. However, the above correction is extremely plausible, it's a simple typo (the original author forgot a \), and it's conceivable that the author uses a C compiler which silently ignores the error (Python has the same behavior by design).
The line of code above, or something almost exactly like it, is found in probably tens of thousands of source files across the globe. It is used for encoding a binary blob using escape sequences so it can be embedded as a literal in a C program. Similar code appears in JSON, XML, and HTML emitters. I've probably written it a hundred times.
Alternatively, if the code were supposed to print out the character, this would not work:
printf("\x%02x", ((unsigned char *)code)[i]);
This doesn't work because escape sequences (the things that start with \, like \x42) are handled by the C compiler, but format strings (the things that start with %, like %02x) are handled by printf. The above line of code might only work if the order were reversed: if printf ran first, before you compiled the program. So no, it doesn't work.
If the author had intended to write literal characters, the following is more plausible:
printf("%c", ((unsigned char *)code)[i]); // clumsy
putchar((unsigned char *)code)[i]); // simpler
So you know either the original author simply typo'd and forgot a single \ (I make that mistake all the time), or the author has no clue.
Notes:
1: An error means that GCC doesn't know what the code is supposed to do, so continuing would be impossible.
Looks like you want to add a prefix of x to the hex number. If yes, you can drop the \:
printf("x%02x", ((unsigned char *)code)[i]);
The reason you are getting error is \x marks the beginning of a hex escape sequence.
Example: printf("\x43\x4f\x4f\x4c");
Prints
COOL
As C has an ASCII value of 0x43.
But in your case the \x is not followed by hex digits which causes parse errors. You can see the C syntax here it clearly says:
hex-escape ::= \x hex-digit ∗
Escape the \ with another \
printf("\\x%02x", ((unsigned char *)code)[i]);
By the way, you can't force GCC to continue compilation after an error, as as error is an error because it prevents further logical analysis of the source code which is impossible to resolve.

Why pre-processor gives a space?

I want to comment a line using the pre-processor:
#define open /##*
#define close */
main()
{
open commented line close
}
when I do $gcc -E filename.c I expected
/* commented line */
but I got
/ * commented line */
so that the compiler shows an error
Why it is giving an unwanted space ?
From the GNU C Preprocessor documentation:
However, two tokens that don't together form a valid token cannot be pasted together. For example, you cannot concatenate x with + in either order. If you try, the preprocessor issues a warning and emits the two tokens. Whether it puts white space between the tokens is undefined. It is common to find unnecessary uses of '##' in complex macros. If you get this warning, it is likely that you can simply remove the '##'.
In this case '*' and '/' do not form a valid C or C++ token. So they are emitted with a space between them.
(Aside: you are likely to get C compilation errors even if you do manage to insert "comments" into the output of the C preprocessor. There aren't supposed to be any comments there.)
The error is because /* is not a valid token.
As explained from the CPP doc:
two tokens that don't together form a valid token cannot be pasted together. For example, you cannot concatenate x with + in either order.
You can get the error by pasting other nonsense stuff e.g. /##+ or +##-.
About the space, it is deliberately inserted to avoid creating a comment and mess up the rest. From the GCC source code:
/* Avoid comment headers, since they are still processed in stage 3.
It is simpler to insert a space here, rather than modifying the
lexer to ignore comments in some circumstances. Simply returning
false doesn't work, since we want to clear the PASTE_LEFT flag. */
if ((*plhs)->type == CPP_DIV && rhs->type != CPP_EQ)
*end++ = ' ';
The preprocessor runs and produces code in a form that the C compiler can understand. It only processes your code once, so even if you could produce a /* with your #define, the compiler would see the /* and give you an error because it's not valid C code (it's a preprocessing instruction).
This doesn't seem like a very good thing to do.
Because comments are replaced with spaces before (and only before) the preprocessor runs. If you paste together the characters / and * using the preprocessor, you get /* which is just a couple operators. Edit: such abuse of ## technically creates /* as a single token, which has undefined behavior. You can paste together > ## > or < %:%: :, although you shouldn't.
See §6.4.6 of C99 for what tokens you are allowed to construct and 6.10.3.3 for the catenation process.
If you want to comment some code using pre-processor, use
#if 0
...
#endif

C macro/#define indentation?

I'm curious as to why I see nearly all C macros formatted like this:
#ifndef FOO
# define FOO
#endif
Or this:
#ifndef FOO
#define FOO
#endif
But never this:
#ifndef FOO
#define FOO
#endif
(moreover, vim's = operator only seems to count the first two as correct.)
Is this due to portability issues among compilers, or is it just a standard practice?
I've seen it done all three ways, it seems to be a matter of style, not of syntax
While usually the second example is the most common, i've seen cases where the first (or third) is used to help distinguish multiple levels of #ifdefs. Sometimes the logic can become deeply nested and the only way to understand it at a glance is to use indentation much like it is common practice to indent blocks of code between { and }.
IIRC, older C preprocessors required the # to be the first character on the line (though I've never actually encountered one that had this requirement).
I never seen your code like your first example. I usually wrote preprocessor directives as in your second example. I found that it visually interfered with the indentation of the actual code less (not that I write in C anymore).
The GNU C Preprocessor manual says:
Preprocessing directives are lines in
your program that start with '#'.
Whitespace is allowed before and after
the '#'.
For preference I use the third style, with the exception of include guards, for which I use the second style.
I don't like the first style at all - I think of #define as being a preprocessor instruction, even though really of course it isn't, it's a # followed by the preprocessor instruction define. But since I do think of it that way, it seems wrong to separate them. I expect text editors written by people who advocate that style will have a block indent/un-indent that works on code written in that style. But I would hate to encounter it using a text editor that didn't.
There's no point pandering to ancient preprocessors where the # has to be the first character of the line, unless you can also list off the top of your head all the other differences between those implementations and standard C, in order to avoid the other things you could possibly do that they would not support. Of course if you genuinely are working with a pre-standard compiler, fair enough.
Preprocessor directives are lines included in our programs that are not actually program statements but directives for the preprocessor. These lines are always preceded by a hash sign (#).Whitespace is allowed before and after the '#'. As soon as a newline character is found, the preprocessor directive is considered to end.
There is no other rule as far the standard of C/C++ concerned,So it remains as the matter of style and readability issue,I have seen/wrote programs only in the second way that you posted,although the third one seems more readable.

Resources