This question already has answers here:
Why both clang and gcc only give a warning when there is a space after backslash if C standard says that whitespace is forbidden?
(4 answers)
Closed 3 years ago.
The N2310/5.1.1.2(p1) defines translation phases. Particularly the phase 2:
Each instance of a backslash character ( \ ) immediately followed by a new-line character is deleted, splicing physical source lines to
form logical source lines. Only the last backslash on any physical
source line shall be eligible for being part of such a splice. A
source file that is not empty shall end in a new-line character, which
shall not be immediately preceded by a backslash character before any
such splicing takes place.
Consider the following code:
#include <stdio.h>
int main(void){
//There are spaces after the backslash before the new-line
int i = 12\
34;
printf("%d\n", i); //prints 1234
}
On my machine it compiles with gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 with a
warning: backslash and newline separated by space
int i = 12\
But prints 1234 anyway.
Is this a gcc/clang non-standardized extension and fully conforming compiler should print error in such case even without -Werror flag?
The standard does not differentiate between warnings and errors, it just uses one term "diagnostic message." A conforming compiler must issue a diagnostic message when it encounters an invalid program. However, it is not required to terminate compilation at that point. C11 (N1570) 5.1.1.3 says this explicitly:
A conforming implementation shall produce at least one diagnostic message (identified in
an implementation-defined manner) if a preprocessing translation unit or translation unit
contains a violation of any syntax rule or constraint, even if the behavior is also explicitly
specified as undefined or implementation-defined. Diagnostic messages need not be
produced in other circumstances.9)
9) The intent is that an implementation should identify the nature of, and where possible localize, each
violation. Of course, an implementation is free to produce any number of diagnostics as long as a
valid program is still correctly translated. It may also successfully translate an invalid program.
(Emphasis mine)
This means that gcc is fully conforming when it issues a warning and then translates the code in any way it sees fit. This is really just a usability feature.
Related
When I have a syntax error in C, how can I know if it's a preprocessor error or a compiler error?
Let's say I type in this line: "# include header.h" (The " is part of the line to make it a string literal).
Will the preprocessor have an issue with it or will it be the compiler that will treat it as a string without assigning it to anything?
Typically compiler output doesn't distinguish "pre-processor errors" from "compiler errors", as these aren't really standardized terms.
What's called "pre-processing" is often the process of forming pre-processor tokens, followed by resolving all includes, pragmas and macros. In the C standard, this "pre-processing" roughly corresponds to "translation phases" 3 and 4:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments). A source file shall not end in a
partial preprocessing token or in a partial comment. Each comment is replaced by
one space character. New-line characters are retained. Whether each nonempty
sequence of white-space characters other than new-line is retained or replaced by
one space character is implementation-defined.
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
The compiler will not obviously complain about finding a valid string literal "# include header.h" in either of the above phases - a string literal is a valid pre-processor token. What you call "pre-processor errors" is probably errors that occur in any of the above phases.
(This is a simplified explanation, there's lots of other mildly interesting stuff happening as well, like trigraph and newline \ replacement etc.)
But in this case, I think the compiler will complain in phase 7, emphasis mine:
White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
"Will the preprocessor have an issue with it or will it be the compiler that will treat it as a string without assigning it to anything?"
I've tried your example:
"#include <stdio.h>"
I get the following errors:
For GCC:
"error: expected identifier or '(' before string constant"
For Clang:
"error: expected identifier or '('"
You can see it here.
Both GCC and Clang treats it as string literal, which is reasonable since character sequences surrounded by " are specified as string literals:
"A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes,as in "xyz"."
Source: ISO/IEC 9899:2018 (C18), §6.4.5/3.
This issue is one the compiler cares about, not the preprocessor. In general, since macros are expanded before compilation, the incorrectness or failure of preprocessor directives is usually also something the compiler complains about. There is usually no explicit error detection stage for the C preprocessor.
If the assignment would be proper, f.e.:
const char* p = "#include <stdio.h>";
and you use variables, functions etc. which are declared in header.h, you can* get errors about undefined references about these variables/functions, since the compiler/linker can't see/find those declarations.
*Whether you get an error or not is furthermore dependent upon if the definition of that variable/function is visable before its use in the source code or how you link several source files.
"When I have a syntax error in C, how can I know if it's a preprocessor error or a compiler error?"
As said above, there are no real preprocessor errors, the compiler covers these issues. The preprocessor doesn't really analyze for errors, it is just expanding. Usually it is very clear if an error belongs to a macro or not, even though the compiler evaluates the syntactical issues.
As said in the comments already by Eugene, you can take a look at the macro expanded version of your code when using the -E option for GCC and test if the expansions were expanded successfully/as desired.
The following program compiles:
// #define WILL_COMPILE
#ifdef WILL_COMPILE
int i =
#endif
int main()
{
return 0;
}
GCC Live demo here.
But the following will issue a warning:
//#define WILL_NOT_COMPILE
#ifdef WILL_NOT_COMPILE
char* s = "failure
#endif
int main()
{
return 0;
}
GCC Live demo here.
I understand that in the first example, the controlled group is removed by the time the compilation phase of the translation is reached. So it compiles without errors or warnings.
But why is lexical validity required in the second example when the controlled group is not going to be included?
Searching online I found this quote:
Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended.
But this does not state why the lexical validity is checked when the conditional fails.
Have I missed something here?
In the translation phase 3 the preprocessor will generate preprocessor tokens and having a " end up in the catch all non-white-space character that cannot be one of the above
is undefined behavior.
See C11 6.4 Lexical elements p3:
A token is the minimal lexical element of the language in translation phases 7 and 8. The
categories of tokens are: keywords, identifiers, constants, string literals, and punctuators.
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string literals, punctuators, and
single non-white-space characters that do not lexically match the other preprocessing
token categories.69) If a ' or a " character matches the last category, the behavior is
undefined. ....
For reference the preprocessing-token are:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Of which the unmatched " in your second example matches non-white-space character that cannot be one of the above.
Since this is undefined behavior and not a constraint the compiler is not obliged to diagnose it but it is certainly allowed to and using -pedantic-errors it even becomes an error godbolt session. As rici points out it only becomes a constraint violation if the token survives preprocessing.
The gcc document you cite basically says the same thing:
... Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended. ...
"Why is [something about C] the way it is?" questions can't usually be answered, because none of the people who wrote the 1989 C standard are here to answer questions [as far as I know, anyway] and if they were here, it was nearly thirty years ago and they probably don't remember.
However, I can think of a plausible reason why the contents of skipped conditional groups are required to consist of a valid sequence of preprocessing tokens. Observe that comments are not required to consist of a valid sequence of preprocessing tokens:
/* this comment's perfectly fine even though it has an unclosed
character literal inside */
Observe also that it is really simple to scan for the end of a comment. /* you look for the next */, // you look for the end of the line. The only complication is that trigraphs and backslash-newline are supposed to be converted first. Tokenizing the contents of comments would be extra code to no useful purpose.
By contrast, it is not simple to scan for the end of a skipped conditional group, because conditional groups nest. You have to be looking for #if, #ifdef, and #ifndef as well as #else and #endif, and counting your depth. And all of those directives are lexically defined in terms of preprocessor tokens, because that's the most natural way to look for them when you're not in a skipped conditional group. Requiring skipped conditional groups to be tokenizable allows the preprocessor to use the same code to process directives within skipped conditional groups as it does elsewhere.
By default, GCC issues only a warning when it encounters an un-tokenizable line inside a skipped conditional group, an error elsewhere:
#if 0
"foo
#endif
"bar
gives me
test.c:2:1: warning: missing terminating " character
"foo
^
test.c:4:1: error: missing terminating " character
"bar
^~~~
This is an intentional leniency, possibly one I introduced myself (it's only been twenty years since I wrote a third of GCC's current preprocessor, but I have still forgotten a lot of the details). You see, the original C preprocessor, the one K and R wrote, did allow arbitrary nonsense inside skipped conditional groups, because it wasn't built around the concept of tokens in the first place; it transformed text into other text. So people would put comments between #if 0 and #endif instead of /* and */, and naturally enough those comments would sometimes contain apostrophes. So, when Per Bothner and Neil Booth and Chiaki Ishikawa and I replaced GCC's original "C-Compatible Compiler Preprocessor"1 with the integrated, fully standards-compliant "cpplib", circa GCC 3.0, we felt we needed to cut a little compatibility slack here.
1 Raise your hand if you're old enough to know why RMS thought this name was funny.
The description of Translation phase 3 (C11 5.1.1.2/3), which happens before preprocessing directives are actioned:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
And the grammar for preprocessing-token is:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Note in particular that a string-literal is a single preprocessing-token. The subsequent description (C11 6.4/3) clarifies that:
If a ' or a " character matches the last category, the behavior is
undefined.
So your second code causes undefined behaviour at translation phase 3.
Obviously, having the compiler come across the # character will cause a syntax error (unless it's a comment or string literal). However, if the character is found e.g. inside of an #if 0 block, is the program technically valid?
I tried this:
#define NOTHING(x)
int main()
{
NOTHING(####);
return 0;
}
with -pedantic -Wall -Wextra, on both gcc and clang and it gave no warnings. I'm not sure if that's guaranteed to work, or if they just don't have a specific warning for it.
I didn't find anything in the standard saying one way or the other, which is worrying. I don't want to base a tool on this only to find out that a standard compliant compiler chokes on it.
You can include almost any character in a C program as long as it is either eliminated or stringified by the preprocessor. # is a valid preprocessor token so it will not be flagged as an error in the preprocessing phases of compilation.
The standard does not prevent a compiler from issuing warnings about anything, and it is possible that a compiler would produce a warning in this case. But I don't know of one which does, and I would consider it a quality-of-implementation issue.
Relevant standard sections:
Definition of preprocessing-token in §6.4: "each non-white-space character that cannot be one of the above". Note that #### is therefore four preprocessing-tokens.
§6.4/2, emphasis added: "Each preprocessing token that is converted to a token shall have the lexical form of a keyword, an identifier, a constant, a string literal, or a punctuator." As long as the # are not "converted to a token", no error has occurred. The conversion to a token happens in phase 7 of the compilation process (see §5.1.1.2/7). That is long after the expansion of macros, which includes stringification, in phase 4.
This question already has answers here:
Does the C preprocessor strip comments or expand macros first? [duplicate]
(6 answers)
Closed 9 years ago.
Following line is written in the C program
in/*hello*/t k; error or not
According to me,first preprocessor would remove the comments from the code and then code will go to compiler so the code that would go to compiler is
int k;
which is perfectly ok.
but actually when I am running this on gcc compiler it's giving compiler error as
in,k,t is not defined
The comment in the code will be replaced to a white space by the compiler. So
in/*hello*/t k;
will become
in t k;
which is not correct.
C11 §5.1.1.2 Translation phases
3 The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.
From C11 5.1.1.2 Translation phases (my emphasis)
The source file is decomposed into preprocessing tokens and
sequences of white-space characters (including comments). A source
file shall not end in a partial preprocessing token or in a partial
comment. Each comment is replaced by one space character
So, as Yu Hao notes, your comment is replaced by a space by the preprocessor. You can test this yourself by using gcc -R your_file.c (gcc) or cl /EP your_file.c (msvc) to view the output from the preprocessor.
Pre-processor does not remove the comments from the code. What it does is ignore whatever is there in the comment and read the next meaningful character putting a blank space(White Space) for comments. SO you'll get the error as these variables are not defined.
I'm a beginner in C, and I was playing with C. I typed a C code like this:
#include <stdio.h>
int main()
{
printf("hello world\n");
\
return 0;
}
Even though I used \ knowingly, the C compiler doesn't throw any error. What is this symbol used for in the C language?
Edit:
Even this works:
"\n";
The sequence backslash-newline is removed from the code in a very early phase (phase 2) of the translation process. It used to be how you created long string literals before there was string concatenation, and is how you still extend macros over multiple lines.
See §5.1.1.2 Translation Phases of the C99 standard:
The precedence among the syntax rules of translation is specified by the following
phases.5)
Physical source file multibyte characters are mapped, in an implementation defined
manner, to the source character set (introducing new-line characters for
end-of-line indicators) if necessary. Trigraph sequences are replaced by
corresponding single-character internal representations.
Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice. A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such
splicing takes place.
The source file is decomposed into preprocessing tokens6) and sequences of
white-space characters (including comments). A source file shall not end in a
partial preprocessing token or in a partial comment. Each comment is replaced by
one space character. New-line characters are retained. Whether each nonempty
sequence of white-space characters other than new-line is retained or replaced by
one space character is implementation-defined.
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
Each source character set member and escape sequence in character constants and
string literals is converted to the corresponding member of the execution character
set; if there is no corresponding member, it is converted to an implementation defined
member other than the null (wide) character.7)
Adjacent string literal tokens are concatenated.
White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
All external object and function references are resolved. Library components are
linked to satisfy external references to functions and objects not defined in the
current translation. All such translator output is collected into a program image
which contains information needed for execution in its execution environment.
5) Implementations shall behave as if these separate phases occur, even though many are typically folded together in practice.
6) As described in 6.4, the process of dividing a source file’s characters into preprocessing tokens is
context-dependent. For example, see the handling of < within a #include preprocessing directive.
7) An implementation need not convert all non-corresponding source characters to the same execution
character.
If you had a blank or any other character after your stray backslash, you would have a compilation error. We can tell that you don't have anything after it because you don't have a compilation error.
The other part of your question, about:
"\n";
is quite different. It is a simple expression that has no side-effects and therefore no effect on the program. The optimizer will completely discard it. When you write:
i = 1;
you have an expression with a value that is discarded; it is evaluated for its side-effect of modifying i.
Sometimes, you'll find code like:
*ptr++;
The compiler will warn you that the result of the expression is discarded; the expression can be simplified to:
ptr++;
and will achieve the same effect in the program.
The \, when immediately followed by a newline, is consumed by preprocessing and causes the next "physical" line to be joined to the current logical line. This is very important for writing long preprocessing directives, which have to be all on one logical line:
#define SHORT very log macro \
consisting of lots and \
lots of preprocessor \
tokens
If you remove the backslash-newline sequences, it is no longer correct. Some other languages from the Unix culture have a similar backslash line continuation syntax: the POSIX shell language derived from the Bourne shell, and also makefiles.
$ this is \
one shell command
About "\n";, that is a primary expression used to form an expression-statement. In C, expressions can be used as statements, and this is exploited all the time. Your printf call, for instance, is an expression statement. printf("hello world\n") is a postfix expression which calls a function, obtaining a return value. Because you used this expression as a statement, the return value is thrown away. The return value of printf
indicates how many characters were printed, or whether it was successful at all, so by throwing it away, your program makes itself oblivious to whether the printf call actually worked.
Since the value of an expression-statement is discarded, if such a statement also has no side effects, it is a useless statement which does nothing (like your "\n"). But such useless expression statements are not erroneous. If you add warning options to your compiler command line you might get a warning such as "statement with no effect" or something like that.
The backslash \ get interpreted by the C preprocessor. It protect its following character (the new line character on your case).
The backslash is simply escaping the next character. In this case, probably a line end (CR) character. Perfectly reasonable.
The backslash plus what is following it is an escape sequence; "\n" together is the newline character (prints a newline). Another important one is "\t", for tab.