Given:
#error /*
*/ foo
Microsoft C++ outputs an error message of /* and GCC outputs foo.
Which is correct?
GCC is correct.
Replacement of comments (including line-breaks) happens in translation phase 3, pre-processing in translation phase 4 (ISO/IEC 9899:1999, §5.1.1.2).
Hence, the preprocessing part of the compiler does not "see" the line-breaks anymore.
And, #error is defined like this (§6.10.5):
A preprocessing directive of the form
# error pp-tokens_opt new-line
causes the implementation to produce a diagnostic message that includes the specified
sequence of preprocessing tokens.
So, the foo has to be part of the output.
GCC is correct because it should be replaced by a single space / * ... * / in the standard.
Related
When I have a syntax error in C, how can I know if it's a preprocessor error or a compiler error?
Let's say I type in this line: "# include header.h" (The " is part of the line to make it a string literal).
Will the preprocessor have an issue with it or will it be the compiler that will treat it as a string without assigning it to anything?
Typically compiler output doesn't distinguish "pre-processor errors" from "compiler errors", as these aren't really standardized terms.
What's called "pre-processing" is often the process of forming pre-processor tokens, followed by resolving all includes, pragmas and macros. In the C standard, this "pre-processing" roughly corresponds to "translation phases" 3 and 4:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments). A source file shall not end in a
partial preprocessing token or in a partial comment. Each comment is replaced by
one space character. New-line characters are retained. Whether each nonempty
sequence of white-space characters other than new-line is retained or replaced by
one space character is implementation-defined.
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
The compiler will not obviously complain about finding a valid string literal "# include header.h" in either of the above phases - a string literal is a valid pre-processor token. What you call "pre-processor errors" is probably errors that occur in any of the above phases.
(This is a simplified explanation, there's lots of other mildly interesting stuff happening as well, like trigraph and newline \ replacement etc.)
But in this case, I think the compiler will complain in phase 7, emphasis mine:
White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
"Will the preprocessor have an issue with it or will it be the compiler that will treat it as a string without assigning it to anything?"
I've tried your example:
"#include <stdio.h>"
I get the following errors:
For GCC:
"error: expected identifier or '(' before string constant"
For Clang:
"error: expected identifier or '('"
You can see it here.
Both GCC and Clang treats it as string literal, which is reasonable since character sequences surrounded by " are specified as string literals:
"A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes,as in "xyz"."
Source: ISO/IEC 9899:2018 (C18), §6.4.5/3.
This issue is one the compiler cares about, not the preprocessor. In general, since macros are expanded before compilation, the incorrectness or failure of preprocessor directives is usually also something the compiler complains about. There is usually no explicit error detection stage for the C preprocessor.
If the assignment would be proper, f.e.:
const char* p = "#include <stdio.h>";
and you use variables, functions etc. which are declared in header.h, you can* get errors about undefined references about these variables/functions, since the compiler/linker can't see/find those declarations.
*Whether you get an error or not is furthermore dependent upon if the definition of that variable/function is visable before its use in the source code or how you link several source files.
"When I have a syntax error in C, how can I know if it's a preprocessor error or a compiler error?"
As said above, there are no real preprocessor errors, the compiler covers these issues. The preprocessor doesn't really analyze for errors, it is just expanding. Usually it is very clear if an error belongs to a macro or not, even though the compiler evaluates the syntactical issues.
As said in the comments already by Eugene, you can take a look at the macro expanded version of your code when using the -E option for GCC and test if the expansions were expanded successfully/as desired.
Take, for example, the following:
#define FOO
FOO #define BAR 1
BAR
What should, according to each of the ANSI C and C99 standards, be the preprocessed output of the above code?
It seems to me that this should be evaluated to 1; however, running the above example through both gcc -E and clang -E produces the following:
#define BAR 1
BAR
The draft standard "ISO/IEC 9899:201x Committee Draft — April 12, 2011 N1570" section 6.10 actually contains an example of this:
EXAMPLE In:
#define EMPTY
EMPTY # include <file.h>
the sequence of preprocessing tokens on the second line is not a preprocessing directive, because it does not begin with a # at the start of translation phase 4, even though it will do so after the macro EMPTY has been replaced.
It tells us that "... the second line is not a preprocessing directive ..."
So for your code
FOO #define BAR 1
is not a preprocessing directive meaning that only FOO will be replaced and BAR will not be defined. Consequently the output of the preprocessor is:
#define BAR 1
BAR
Your code is not valid
ISO/IEC 9899:2011, Section 6.10 Preprocessing directives:
A preprocessing directive consists of a sequence of preprocessing
tokens that satisfies the following constraints: The first token in
the sequence is a # preprocessing token that (at the start of
translation phase 4) is either the first character in the source file
(optionally after white space containing no new-line characters) or
that follows white space containing at least one new-line character.
This example actually occurs in the Standard (C17 6.10/8):
EXAMPLE In:
#define EMPTY
EMPTY # include <file.h>
the sequence of preprocessing tokens on the second line is not a preprocessing directive, because it does not begin with a # at the start of translation phase 4, even though it will do so after the macro EMPTY has been replaced.
So the output you see from gcc -E is correct. (Note: the amount of whitespace here is not significant , at that stage of translation the program has been translated to a sequence of preprocessing tokens; the different amounts of whitespace in the output is just an artefact of how gcc -E works).
Obviously, having the compiler come across the # character will cause a syntax error (unless it's a comment or string literal). However, if the character is found e.g. inside of an #if 0 block, is the program technically valid?
I tried this:
#define NOTHING(x)
int main()
{
NOTHING(####);
return 0;
}
with -pedantic -Wall -Wextra, on both gcc and clang and it gave no warnings. I'm not sure if that's guaranteed to work, or if they just don't have a specific warning for it.
I didn't find anything in the standard saying one way or the other, which is worrying. I don't want to base a tool on this only to find out that a standard compliant compiler chokes on it.
You can include almost any character in a C program as long as it is either eliminated or stringified by the preprocessor. # is a valid preprocessor token so it will not be flagged as an error in the preprocessing phases of compilation.
The standard does not prevent a compiler from issuing warnings about anything, and it is possible that a compiler would produce a warning in this case. But I don't know of one which does, and I would consider it a quality-of-implementation issue.
Relevant standard sections:
Definition of preprocessing-token in §6.4: "each non-white-space character that cannot be one of the above". Note that #### is therefore four preprocessing-tokens.
§6.4/2, emphasis added: "Each preprocessing token that is converted to a token shall have the lexical form of a keyword, an identifier, a constant, a string literal, or a punctuator." As long as the # are not "converted to a token", no error has occurred. The conversion to a token happens in phase 7 of the compilation process (see §5.1.1.2/7). That is long after the expansion of macros, which includes stringification, in phase 4.
Let's say I have two files, a.h:
#if 1
#include "b.h"
and b.h:
#endif
Both gcc's and clang's preprocessors reject a.h:
$ cpp -ansi -pedantic a.h >/dev/null
In file included from a.h:2:0:
b.h:1:2: error: #endif without #if
#endif
^
a.h:1:0: error: unterminated #if
#if 1
^
However, the C standard (N1570 6.10.2.3) says:
A preprocessing directive of the form
# include "q-char-sequence" new-line
causes the replacement of that directive by the entire contents of the source file identified by the specified sequence between the " delimiters.
which appears to permit the construct above.
Are gcc and clang not compliant in rejecting my code?
The C standard defines 8 translation phases. A source file is processed by each of the 8 phases in sequence (or in an equivalent manner).
Phase 4, as defined in N1570 section 5.1.1.2, is:
Preprocessing directives are executed, macro invocations are expanded,
and _Pragma unary operator expressions are executed. If a
character sequence that matches the syntax of a universal character
name is produced by token concatenation (6.10.3.3), the behavior is
undefined. A #include preprocessing directive causes the named
header or source file to be processed from phase 1 through phase 4,
recursively. All preprocessing directives are then deleted.
The relevant sentence here is:
A #include preprocessing directive causes the named
header or source file to be processed from phase 1 through phase 4,
recursively.
which implies that each included source file is preprocessed by itself. This precludes having a #if in one file and the corresponding #endif in another.
(As "A wild elephant" mentioned in comments, and as rodrigo's answer says, the grammar in section 6.10 also says that an if-section, which starts with a #if (or #ifdef or #ifndef) line and ends with a #endif line, can only appear as part of a preprocessing-file.)
I think the compilers are right, or at best the standard is ambiguous.
The trick is not in how #include is implemented, but in the order in wich preprocessing is done.
Look at the grammar rules in section 6.10 of the C99 standard:
preprocessing-file:
group[opt]
group:
group-part
group group-part
group-part:
if-section
control-line
text-line
# non-directive
if-section:
if-group elif-groups[opt] else-group[opt] endif-line
if-group:
# if constant-expression new-line group[opt]
...
control-line:
# include pp-tokens new-line
...
As you can see, the #include stuff is nested inside the group, and group is the thing inside the #if / #endif.
For example, in a well-formed file such as:
#if 1
#include <a.h>
#endif
That will parse as #if 1, plus a group, plus #endif. And the inside group has an #include.
But in your example:
#if 1
#include <a.h>
The rule if-section does not apply to this source, so the group productions are not even checked.
Probably you can argue that the standard is ambiguous, because it does not specify when the replacement of the #include directive happen, and that a conforming implementation could shift a lot of grammar rules and replace the #include before failing for not finding the #endif. But these ambiguities are impossible to avoid if the side effects of the syntax modify the text you are parsing. Isn't C wonderful?
Thinking of a C preprocessor as a very simple compiler, to translate a file a C preprocessor conceptually carries out a few phases.
Lexical analysis – Groups the sequence of characters making up the preprocessing translation unit into strings having an identified meaning (tokens) in the preprocessor language.
Syntactic analysis – Groups the tokens of the preprocessing translation unit into syntactic structures built according to the preprocessing language grammar.
Code generation – Translates all files making up the preprocessing translation unit into a single file containing 'pure' C instructions only.
Strictly speaking, the translation phases mentioned in §5.1.1.2 of the C Standard (ISO/IEC 9899:201x) relating to preprocessing are phase 3 and phase 4. Phase 3 corresponds almost exactly to lexical analysis while phase 4 is about code generation.
Syntactic analysis (parsing) seems to be missing from that picture. Indeed, the C preprocessor grammar is so simple that real preprocessors/compilers perform it along with lexical analysis.
If the syntactic analysis phase ends successfully – i.e. all statements in the preprocessing translation unit are legal according to the preprocessor grammar – code generation can take place and all preprocessing directives are executed.
Executing a preprocessing directive means to transform the source file according to the its semantics and then removing the directive from the source file.
The semantics for each preprocessor directive is specified in §6.10.1-6.10.9 of the C Standard.
Getting back to your sample program, the 2 files you provided, i.e. a.h and b.h, are conceptually processed as follows.
Lexical Analysis - Each individual preprocessing token is delimited by a '{' on the left and a '}' on the right.
a.h
{#}{if} {1}
{#}{include} {"b.h"}
b.h
{#}{endif}
This phase is performed without errors and its result, the sequence of preprocessing tokens, is passed to the subsequent phase: syntactic analysis.
Syntactic Analysis
A tentative derivation for a.h is given below
preprocessing-file →
group →
group-part →
if-section →
if-group endif-line →
if-group #endif new-line →
…
and it is clear that the contents of a.h cannot be derived from the preprocessing grammar – in fact the terminating #endif is missing – and therefore a.h is not syntactically correct. This is exactly what your compiler is telling you when it writes
a.h:1:0: error: unterminated #if
Something similar happens for b.h; reasoning backwards, the #endif can only be derived from the rule
if-section →
if-group elif-groups[opt] else-group[opt] endif-line
This means the file contents should be derived from one of the following 3 groups
# if constant-expression new-line group[opt]
# ifdef identifier new-line group[opt]
# ifndef identifier new-line group[opt]
Since it is not the case, because b.h does not contain # if/# ifdef/# ifndef but only the single #endif line, again the contents of b.h is not syntactically correct and your compiler tells you about that this way
In file included from a.h:2:0:
b.h:1:2: error: #endif without #if
Code Generation
Of course, since your program is lexically sound but syntactically not correct, this phase never gets performed.
#if / #ifdef / #ifndef
#elif
#else
#endif
must be matched within one file.
I have a "c" program that uses ".global" assembly language code. My compiler does not allow this, i need to ignore the same, i tried the following:
#define .global //global
But this gives a compiler error. Is there any other option that I can use. The compilation error is:
"expected an identifier"
You are almost certainly going to end up writing your own preprocessing script; it shouldn't be too difficult, if your source files are reasonably controlled. If you don't use the .global construct in string literals or comments, for example, it would be sufficient to do something like:
sed 's/.global [_[:alpha:]][_[:alnum:]]*;//g'
(perhaps with a bit more attention to detail about whitespace).
You cannot manufacture a comment with a macro. (You also cannot define a macro whose name starts with a ., although I suppose there could be a compiler which accepts that as an extension.)
Comments are replaced with whitespace in phase 3 of the translation process. Preprocessor directives are not examined until phase 4, by which time all the comments have disappeared.
So there is no difference between
#define COMMENT //comment
#define COMMENT
Standards reference: §5.1.1.2/1:
The source file is decomposed into preprocessing tokens7) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment. Each comment is replaced by one space character.…
Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed.…