Suppose in C, I have the following code:
i=5 \
+6;
If I print i, it gives me 11.
I do not understand how the above code executes correctly. At first glance, I guessed it to be compiler error because of unrecognized token \. Can somebody explain the logic? Is it related to maximal munch logic?
A backslash at the end of a line tells the compiler to ignore the new-line character.
It is a way of formatting lines to be readable for humans without interrupting the source text. E.g., if you have a long string enclosed in quotation marks, you can use a backslash to continue the string on a new line without inserting a new-line character in the string.
(This was more useful before the C standard added the property that adjacent strings, such as "abc" "def", are concatenated. Now you can put strings on consecutive lines, and they will be concatenated. Prior to that, you had to use the backslash to do it.)
Nowadays the most common use of the backslash is, as heretolearn points out, to continue preprocessor macro definitions. Unlike regular C statements, preprocessor statements must be on a single line. However, some preprocessor macro definitions are quite long. To format them (somewhat) nicely, a definition is spread over multiple physical lines, but the backslash makes them into one line for the compiler (including the preprocessor).
A backslash followed by a new-line character are completely removed from the source text by the compiler, unlike a new-line character by itself. So the source text:
abc\
def
is equivalent to the single identifier abcdef, not abc def. You can use it in the middle of any operator or other language construction except trigraph sequences (trigraph sequences, such as ??=, are converted to replacement characters, such as #, before the backslash-new-line processing):
MyStructureVariable-\
>MemberName
IncrementMe+\
+
However, do not do that. Use it reasonably.
The practice of escaping the newlines at the end of a line is indeed to mark the continuation of the statement onto the next line. Apparently that was needed in the old C compilers There is only one place that I'm sure it is still needed and that is in macro definitions of functions, something that is generally frowned upon in C++.
A continued line is a line which ends with a backslash, . The
backslash is removed and the following line is joined with the current
one. No space is inserted, so you may split a line anywhere, even in
the middle of a word. (It is generally more readable to split lines
only at white space.)
The trailing backslash on a continued line is commonly referred to as
a backslash-newline.
If there is white space between a backslash and the end of a line,
that is still a continued line. However, as this is usually the result
of an editing mistake, and many compilers will not accept it as a
continued line, GCC will warn you about it.
Reference
Related
I'd like to automate removal of all trailing whitespace from .c and .h files, to reduce garbage that causes merge conflicts in git history, etc.
Is there any conceivable way this could change the output of the compilation stage? Or is it perfectly safe to do this automatically?
The only case I can think of where this could change the meaning is if there's a macro that ends with backslash followed by spaces:
#define FOO bar\<space>
where <space> represents a space character. Before trimming, the backslash escapes the space, which I don't think has any effect. But when you remove the space it escapes the newline, so the next line will become part of the expansion.
Since there's no reason to write an escaped space like that, this seems like a very unlikely problem. In fact, if there's code that looks like this, I think it's more likely that they intended to write a multi-line expansion, and the space was added by accident.
Outside macros and string literals, all sequences of whitespace are treated as a single space, and there's no difference between spaces and newlines.
UPDATE:
This case isn't actually valid. C doesn't allow escape sequences outside string or character literals, and only newlines can be escaped with backslash. GCC has an extension to treat this as an escaped newline (in case the programmer made the mistake I describe above), and it prints a warning when it's doing it. So removing the spaces will produce the same result but get rid of the warning.
This question already has an answer here:
Can someone explain me exactly what the below definition means in the C standard about directives
(1 answer)
Closed 2 years ago.
What characters are valid to occur before and after a preprocessing directive in C according to the C standard.
/*what are all the valid characters that can occur here*/ #include <stdio.h> /*and here according to the C standard*/
main()
{
printf("Hello World");
}
now the C standard haven't mentioned what characters are valid to occur before and after a preprocessing directive if someone can guide me with the exact definition of the C standard it will be much appreciated
Note that before the preprocessor directives are examined, the compiler has already passed through phases 1 through 3 of the translation process. Translation phase 2 combines lines ending with a backslash with the following physical line in order to create logical lines. Translation phase 3 replaces every comment with a single space character. (It is allowed but not required that phase 3 also replaces every consecutive sequence of whitespace characters other than newline with a single space character.)
Once that is done, Phase 4 is entered, at which point preprocessor directives are identified. According to §6.10 paragraph 2 of the standard, a sequence of tokens is a preprocessor directive only if "The first token in the sequence is a # preprocessing token that (at the start of translation phase 4) is either the first character in the source file (optionally after white space containing no new-line characters) or that follows white space containing at least one new-line character."
That's a very verbose way of saying that the # has to be the first token on a line, which means that it can only be preceded on the line by whitespace. But as the parenthetic comment in the sentence I quoted says, the test applies to the program as seen after phases 1 to 3, which means that comments have already been replaced with whitespace. So in the original program, the # might be preceded by a comment or by whitespace. (The comment must be a /*…*/ comment, since //… comments extend to the end of the line. Also note that continuation lines are combined before comments are identified, so a continuation marker can occur inside a //… comment.)
As to what can follow a preprocessor directive on the line, the answer is technically "nothing", since the directive extends up to and including the newline. (Again, a comment may have appeared in the original program.) The standard shows a grammar for each preprocessor directive which indicates what the directive's syntax is. If you were to add a non-whitespace character to a preprocessing directive, that would either create a syntax error or alter the meaning of the directive.
§6.10 paragraph 5 requires that whitespace within a preprocessor directive can only be a space or tab character, so that vertical tab and form-feed characters would be illegal. However, it is possible that the implementation has changed those characters to space characters in translation phase 3, so the use of vertical tab and form-feed in a preprocessor directive is implementation-dependent. Portable programs should only contain vertical tab and form-feed characters at the beginning of a line.
Let's dissect the paragraph 2 of the C standard you linked in a comment:
A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the following constraints:
The characters that make up a source code is divided in tokens. Such tokens are for example special characters like '#', identifiers beginning with a letter or underscore, or numbers beginning with a decimal character.
The first token in the sequence is a # preprocessing token that (at the start of translation phase 4) is either the first character in the source file (optionally after white space containing no new-line characters) or that follows white space containing at least one new-line character.
"White space" is any sequence of white space characters without any other character. Commonly used white space characters are space (or blank), horizontal tabulator, line feed or carriage return.
"[...] white space containing at least one new-line character" means that one new-line character exists in the sequence of white space characters before the '#'. It does not matter where in the sequence it is.
So these are all valid sequences, shown as C strings:
"\n\t\t\t#..."
"\n #..."
"\n#..."
"\n\t#..."
"\n\t #..."
"\n \t#..."
The last token in the sequence is the first new- line character that follows the first token in the sequence.
Beginning with the token '#' all next tokens make up the preprocessor directive, until the next new-line character is found. The footnote 165 mentions the term "line" for such a sequence.
A new-line character ends the preprocessing directive even if it occurs within what would otherwise be an invocation of a function-like macro.
The invocation of a function-like macro looks like a function call in C, an identifier with a pair of parentheses. If there is a new-line before the closing parenthesis, the directive ands at that place.
EDIT:
White space characters are listed concretely in chapter 7.4.1.10 "The isspace function" of the standard you linked:
The standard white-space characters are the following: space (' '), form feed ('\f'), new-line ('\n'), carriage return ('\r'), horizontal tab ('\t'), and vertical tab ('\v').
One can assume that this function is used by the preprocessor.
Your confusion might come from interpretating "[...] white space containing no new-line characters [...]" as "white space does not include new-line in general" or "new-line is a special white space character." Neither is true.
The new-line is a valid white space character. It just has a special meaning under the specific circumstances of marking the beginning and the end of a preprocessor directive. And that is why they request white space without any new-line.
If the white space contained a new-line, it will mark the beginning of a new token sequence in the context of the preprocessor.
Please note that the preprocessor and the language C are quite separated concepts. You can use the preprocessor for preprocessing any other source files, using it for assembly is quite common. And you can write C source files without any preprocessor directive.
The preprocessor knows nothing about C, and the C compiler knows nothing about preprocessing directives.
I just noticed that I can use \ to extend the single-line comment to the next line, similarly to doing so in pre-processor directives.
Why is nobody speaking for this language feature?
I didn't even see it in books..
What language version supports this?
It's part of C. Called line splicing.
The K&R book talks about it
Lines that end with the backslash character \ are folded by deleting the backslash and the
following newline character. This occurs before division into tokens.
This occurs in the preprocessing phase.
So single line comments can be made to appear like multi line like
//This is \
still a single line comment
Likewise with the case of strings
char str[]="Hello \
world. This is \
a string";
Edit: As noted in the comments, single line comments were not there in ANSI C but were introduced as part of the standard in C99 though many compilers already supported it.
From C99,
Except within a character constant, a string literal, or a comment, the characters // introduce a comment that includes all multibyte characters up to, but not including, the next new-line character. The contents of such a comment are examined only to identify multibyte characters and to find the terminating new-line character.
As far as line splicing is concerned, it is specified in C89 itself
2.1.1.2 Translation phases
Each instance of a new-line character and an immediately preceding backslash character is deleted, splicing physical source lines to form logical source lines. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character.
Look at KamiKaze's answer to see the relevant part of C99.
While it's true that a \ will effectively escape the newline at the end of a single-line comment, splicing the line with the following one (just as it does on any other line), you could claim that this is a bug in the Standard. At any rate, the situation is fantastically confusing. You might believe that both of these facts are true:
The single-line comment syntax // turns the rest of the line, up to the next newline, into a comment, which is not interpreted in any way, i.e. is ignored.
At the end of any line, a \ character eliminates the newline and splices the line to the following line.
But these two rules are basically in conflict; it looks like they can't both be true at the same time.
Now in fact, by definition, the second rule "wins", and the first rule really has to say that the rest of the line is not interpreted in any way except to check whether the last character is a \, in which case it retains its line-splicing meaning.
(Now, if you're a compiler writer or a language lawyer, of course, you don't think about it that way. If you're a compiler writer or language lawyer, you know that the \ was processed during an earlier phase of compilation, before comments are parsed, meaning that the first rule is perfectly true as stated. But most people don't think like compiler writers and language lawyers.)
My point is that this situation is basically fraught with peril. I would bet good money that there are compilers or other language processors out there that get this wrong. I would urge any sane programmer not to rely on this, not to put a \ at the end of any line that contains a single-line comment. (And if I were writing a compiler or other language processor, I'd try to warn about this.)
This is not a feature of comments but a general feature of the language, as it applies to all newline-characters.
The following is found in the C99 standard:
5.1.1.2 Translation phases
Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.
So it is standard compliant for C99 at least.
It is not much talked about, because the relevant usecases (except for large macros and strings) are quite rare. If you need a multiline comment (the standard comment in C, // was added from C++ later on), you could just use
/* multi
line
comment
*/
Every use except for large macros and strings will make the code harder to read and might even make it quite confusing. So generally it is not used except for the mentioned niches.
Every instance of \ followed by a newline is removed from the source during the first phase of parsing, before tokenisation and comment handling.
As a consequence, a single line comment can be extended to the next line of source code by escaping this newline with a \ (or a ??/ trigraph sequence):
// this is a single \
line comment
Note how the stackoverflow code highlighter is fooled by this trick and does not colorize the end of the comment line.
This feature can be further abused to make really weird looking comments:
/\
/\ This is a single line comment /\
\/ \/
/\
*\ This is a multi-line comment
*\
/
Any token can be broken in pieces this way. Check this corner case:
\
r\
et\
urn\
0x7\
ffff;\
I'm working on a C file for a homework assignment and I thought it might help the graders if I made my answers visible like so:
//**********|ANSWER|************\\
//blah blah blah, answering the
//questions, etc etc
and found when compiling with gcc that those backslash characters at the end of the first line seemed to be triggering a "multi-line comment" warning. When I removed them, the warning disappeared. So my question is twofold:
a) how exactly does the presence of the backslash characters make it a "multi-line comment", and
b) why would a multi-line comment be a problem anyway?
C (since the 1999 standard) has two forms of comments.
Old-style comments are introduced by /* and terminated by */, and can span a portion of a line, a complete line, or multiple lines.
C++-style comments are introduced by // and terminated by the end of the line.
But a backslash at the end of a line causes that line to be spliced to the next line. So you can legally introduce a comment with //, put a backslash at the end of the line, and cause the comment to span multiple physical lines (but only one logical line).
That's what you're doing on your first line:
//**********|ANSWER|************\\
Just use something other than backslash at the end of the line, for example:
//**********|ANSWER|************//
Though even that is potentially misleading, since it almost looks like an old-style /* .. */ comment. You might consider something a little simpler:
/////////// |ANSWER| ////////////
or:
/**********|ANSWER|************/
The compiler simply tells you that you might have inadvertently commented-out the next line of code by ending the previous comment line with \, which is a line continuation character in C. This causes the second line to get concatenated with the first. This in turn makes the // comment to actually comment-out both original lines. In your case it is not a problem, since the next line is a comment as well.
But if the next line was not intended to be a comment, then you might have ended up with "weird behavior": compiler ignoring the second line for no apparent reason. The situation is often complicated by the fact that some syntax-highlighting code editors do not detect this situation and fail to highlight the next line as a comment.
Generally, for this specific reason, it is not a good idea to abuse the \ character as code level. Use it only if you really have to, i.e. only if you really want to stitch several lines into one.
Nobody asked, but this is the top answer in Google, so
Suppressing, this specific warning could be done with -Wno-comment option.
a) how exactly does the presence of the backslash characters make it a "multi-line comment", and
A backslash as the last character on a line means that the compiler should disregard the backslash and the newline character - it tells the compiler to do this before it should check for comments. So it says that before removing comments it should effectively look at
//**********|ANSWER|************\//blah blah blah, answering the
//questions, etc etc
it now sees the // at the start and ignores the rest of the line
b) why would a multi-line comment be a problem anyway?
In your example it isn't since the second line is a comment anyway, but what if you had written something useful on the second line?
Well since you asked question "a" it's likely that you didn't realize that the compiler behaved this way, and if you don't realize that you've commented out a line of code, then it's quite nice of the compiler to warn you.
Another reason is that even if had known this is that normally an editor will not visibly show whitespace and it's therefore easy to miss that the backslash may or may not be the last character on the line. For example:
int i = 42;
// backslash+space: \
i++
// backslash and no space: \
i--
printf("%d\n", i);
Would result in 43 since the i-- is commented out, but i++ isn't (because the backslash is not the last character on the line, but a space is).
That will comment the line below it as well. If you want to do that all on one line without a warning try
/* // Bla \\ */
I'm a beginner in C, and I was playing with C. I typed a C code like this:
#include <stdio.h>
int main()
{
printf("hello world\n");
\
return 0;
}
Even though I used \ knowingly, the C compiler doesn't throw any error. What is this symbol used for in the C language?
Edit:
Even this works:
"\n";
The sequence backslash-newline is removed from the code in a very early phase (phase 2) of the translation process. It used to be how you created long string literals before there was string concatenation, and is how you still extend macros over multiple lines.
See §5.1.1.2 Translation Phases of the C99 standard:
The precedence among the syntax rules of translation is specified by the following
phases.5)
Physical source file multibyte characters are mapped, in an implementation defined
manner, to the source character set (introducing new-line characters for
end-of-line indicators) if necessary. Trigraph sequences are replaced by
corresponding single-character internal representations.
Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice. A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such
splicing takes place.
The source file is decomposed into preprocessing tokens6) and sequences of
white-space characters (including comments). A source file shall not end in a
partial preprocessing token or in a partial comment. Each comment is replaced by
one space character. New-line characters are retained. Whether each nonempty
sequence of white-space characters other than new-line is retained or replaced by
one space character is implementation-defined.
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
Each source character set member and escape sequence in character constants and
string literals is converted to the corresponding member of the execution character
set; if there is no corresponding member, it is converted to an implementation defined
member other than the null (wide) character.7)
Adjacent string literal tokens are concatenated.
White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
All external object and function references are resolved. Library components are
linked to satisfy external references to functions and objects not defined in the
current translation. All such translator output is collected into a program image
which contains information needed for execution in its execution environment.
5) Implementations shall behave as if these separate phases occur, even though many are typically folded together in practice.
6) As described in 6.4, the process of dividing a source file’s characters into preprocessing tokens is
context-dependent. For example, see the handling of < within a #include preprocessing directive.
7) An implementation need not convert all non-corresponding source characters to the same execution
character.
If you had a blank or any other character after your stray backslash, you would have a compilation error. We can tell that you don't have anything after it because you don't have a compilation error.
The other part of your question, about:
"\n";
is quite different. It is a simple expression that has no side-effects and therefore no effect on the program. The optimizer will completely discard it. When you write:
i = 1;
you have an expression with a value that is discarded; it is evaluated for its side-effect of modifying i.
Sometimes, you'll find code like:
*ptr++;
The compiler will warn you that the result of the expression is discarded; the expression can be simplified to:
ptr++;
and will achieve the same effect in the program.
The \, when immediately followed by a newline, is consumed by preprocessing and causes the next "physical" line to be joined to the current logical line. This is very important for writing long preprocessing directives, which have to be all on one logical line:
#define SHORT very log macro \
consisting of lots and \
lots of preprocessor \
tokens
If you remove the backslash-newline sequences, it is no longer correct. Some other languages from the Unix culture have a similar backslash line continuation syntax: the POSIX shell language derived from the Bourne shell, and also makefiles.
$ this is \
one shell command
About "\n";, that is a primary expression used to form an expression-statement. In C, expressions can be used as statements, and this is exploited all the time. Your printf call, for instance, is an expression statement. printf("hello world\n") is a postfix expression which calls a function, obtaining a return value. Because you used this expression as a statement, the return value is thrown away. The return value of printf
indicates how many characters were printed, or whether it was successful at all, so by throwing it away, your program makes itself oblivious to whether the printf call actually worked.
Since the value of an expression-statement is discarded, if such a statement also has no side effects, it is a useless statement which does nothing (like your "\n"). But such useless expression statements are not erroneous. If you add warning options to your compiler command line you might get a warning such as "statement with no effect" or something like that.
The backslash \ get interpreted by the C preprocessor. It protect its following character (the new line character on your case).
The backslash is simply escaping the next character. In this case, probably a line end (CR) character. Perfectly reasonable.
The backslash plus what is following it is an escape sequence; "\n" together is the newline character (prints a newline). Another important one is "\t", for tab.