Is the # character allowed in C source? - c

Obviously, having the compiler come across the # character will cause a syntax error (unless it's a comment or string literal). However, if the character is found e.g. inside of an #if 0 block, is the program technically valid?
I tried this:
#define NOTHING(x)
int main()
{
NOTHING(####);
return 0;
}
with -pedantic -Wall -Wextra, on both gcc and clang and it gave no warnings. I'm not sure if that's guaranteed to work, or if they just don't have a specific warning for it.
I didn't find anything in the standard saying one way or the other, which is worrying. I don't want to base a tool on this only to find out that a standard compliant compiler chokes on it.

You can include almost any character in a C program as long as it is either eliminated or stringified by the preprocessor. # is a valid preprocessor token so it will not be flagged as an error in the preprocessing phases of compilation.
The standard does not prevent a compiler from issuing warnings about anything, and it is possible that a compiler would produce a warning in this case. But I don't know of one which does, and I would consider it a quality-of-implementation issue.
Relevant standard sections:
Definition of preprocessing-token in §6.4: "each non-white-space character that cannot be one of the above". Note that #### is therefore four preprocessing-tokens.
§6.4/2, emphasis added: "Each preprocessing token that is converted to a token shall have the lexical form of a keyword, an identifier, a constant, a string literal, or a punctuator." As long as the # are not "converted to a token", no error has occurred. The conversion to a token happens in phase 7 of the compilation process (see §5.1.1.2/7). That is long after the expansion of macros, which includes stringification, in phase 4.

Related

Preprocessor and compiler errors in C

When I have a syntax error in C, how can I know if it's a preprocessor error or a compiler error?
Let's say I type in this line: "# include header.h" (The " is part of the line to make it a string literal).
Will the preprocessor have an issue with it or will it be the compiler that will treat it as a string without assigning it to anything?
Typically compiler output doesn't distinguish "pre-processor errors" from "compiler errors", as these aren't really standardized terms.
What's called "pre-processing" is often the process of forming pre-processor tokens, followed by resolving all includes, pragmas and macros. In the C standard, this "pre-processing" roughly corresponds to "translation phases" 3 and 4:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments). A source file shall not end in a
partial preprocessing token or in a partial comment. Each comment is replaced by
one space character. New-line characters are retained. Whether each nonempty
sequence of white-space characters other than new-line is retained or replaced by
one space character is implementation-defined.
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
The compiler will not obviously complain about finding a valid string literal "# include header.h" in either of the above phases - a string literal is a valid pre-processor token. What you call "pre-processor errors" is probably errors that occur in any of the above phases.
(This is a simplified explanation, there's lots of other mildly interesting stuff happening as well, like trigraph and newline \ replacement etc.)
But in this case, I think the compiler will complain in phase 7, emphasis mine:
White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
"Will the preprocessor have an issue with it or will it be the compiler that will treat it as a string without assigning it to anything?"
I've tried your example:
"#include <stdio.h>"
I get the following errors:
For GCC:
"error: expected identifier or '(' before string constant"
For Clang:
"error: expected identifier or '('"
You can see it here.
Both GCC and Clang treats it as string literal, which is reasonable since character sequences surrounded by " are specified as string literals:
"A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes,as in "xyz"."
Source: ISO/IEC 9899:2018 (C18), §6.4.5/3.
This issue is one the compiler cares about, not the preprocessor. In general, since macros are expanded before compilation, the incorrectness or failure of preprocessor directives is usually also something the compiler complains about. There is usually no explicit error detection stage for the C preprocessor.
If the assignment would be proper, f.e.:
const char* p = "#include <stdio.h>";
and you use variables, functions etc. which are declared in header.h, you can* get errors about undefined references about these variables/functions, since the compiler/linker can't see/find those declarations.
*Whether you get an error or not is furthermore dependent upon if the definition of that variable/function is visable before its use in the source code or how you link several source files.
"When I have a syntax error in C, how can I know if it's a preprocessor error or a compiler error?"
As said above, there are no real preprocessor errors, the compiler covers these issues. The preprocessor doesn't really analyze for errors, it is just expanding. Usually it is very clear if an error belongs to a macro or not, even though the compiler evaluates the syntactical issues.
As said in the comments already by Eugene, you can take a look at the macro expanded version of your code when using the -E option for GCC and test if the expansions were expanded successfully/as desired.

Backslash preceding whitespaces and newline splice [duplicate]

This question already has answers here:
Why both clang and gcc only give a warning when there is a space after backslash if C standard says that whitespace is forbidden?
(4 answers)
Closed 3 years ago.
The N2310/5.1.1.2(p1) defines translation phases. Particularly the phase 2:
Each instance of a backslash character ( \ ) immediately followed by a new-line character is deleted, splicing physical source lines to
form logical source lines. Only the last backslash on any physical
source line shall be eligible for being part of such a splice. A
source file that is not empty shall end in a new-line character, which
shall not be immediately preceded by a backslash character before any
such splicing takes place.
Consider the following code:
#include <stdio.h>
int main(void){
//There are spaces after the backslash before the new-line
int i = 12\
34;
printf("%d\n", i); //prints 1234
}
On my machine it compiles with gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 with a
warning: backslash and newline separated by space
int i = 12\
But prints 1234 anyway.
Is this a gcc/clang non-standardized extension and fully conforming compiler should print error in such case even without -Werror flag?
The standard does not differentiate between warnings and errors, it just uses one term "diagnostic message." A conforming compiler must issue a diagnostic message when it encounters an invalid program. However, it is not required to terminate compilation at that point. C11 (N1570) 5.1.1.3 says this explicitly:
A conforming implementation shall produce at least one diagnostic message (identified in
an implementation-defined manner) if a preprocessing translation unit or translation unit
contains a violation of any syntax rule or constraint, even if the behavior is also explicitly
specified as undefined or implementation-defined. Diagnostic messages need not be
produced in other circumstances.9)
9) The intent is that an implementation should identify the nature of, and where possible localize, each
violation. Of course, an implementation is free to produce any number of diagnostics as long as a
valid program is still correctly translated. It may also successfully translate an invalid program.
(Emphasis mine)
This means that gcc is fully conforming when it issues a warning and then translates the code in any way it sees fit. This is really just a usability feature.

Why should the controlled group in a conditional inclusion be lexically valid when the conditional is false?

The following program compiles:
// #define WILL_COMPILE
#ifdef WILL_COMPILE
int i =
#endif
int main()
{
return 0;
}
GCC Live demo here.
But the following will issue a warning:
//#define WILL_NOT_COMPILE
#ifdef WILL_NOT_COMPILE
char* s = "failure
#endif
int main()
{
return 0;
}
GCC Live demo here.
I understand that in the first example, the controlled group is removed by the time the compilation phase of the translation is reached. So it compiles without errors or warnings.
But why is lexical validity required in the second example when the controlled group is not going to be included?
Searching online I found this quote:
Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended.
But this does not state why the lexical validity is checked when the conditional fails.
Have I missed something here?
In the translation phase 3 the preprocessor will generate preprocessor tokens and having a " end up in the catch all non-white-space character that cannot be one of the above
is undefined behavior.
See C11 6.4 Lexical elements p3:
A token is the minimal lexical element of the language in translation phases 7 and 8. The
categories of tokens are: keywords, identifiers, constants, string literals, and punctuators.
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string literals, punctuators, and
single non-white-space characters that do not lexically match the other preprocessing
token categories.69) If a ' or a " character matches the last category, the behavior is
undefined. ....
For reference the preprocessing-token are:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Of which the unmatched " in your second example matches non-white-space character that cannot be one of the above.
Since this is undefined behavior and not a constraint the compiler is not obliged to diagnose it but it is certainly allowed to and using -pedantic-errors it even becomes an error godbolt session. As rici points out it only becomes a constraint violation if the token survives preprocessing.
The gcc document you cite basically says the same thing:
... Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended. ...
"Why is [something about C] the way it is?" questions can't usually be answered, because none of the people who wrote the 1989 C standard are here to answer questions [as far as I know, anyway] and if they were here, it was nearly thirty years ago and they probably don't remember.
However, I can think of a plausible reason why the contents of skipped conditional groups are required to consist of a valid sequence of preprocessing tokens. Observe that comments are not required to consist of a valid sequence of preprocessing tokens:
/* this comment's perfectly fine even though it has an unclosed
character literal inside */
Observe also that it is really simple to scan for the end of a comment. /* you look for the next */, // you look for the end of the line. The only complication is that trigraphs and backslash-newline are supposed to be converted first. Tokenizing the contents of comments would be extra code to no useful purpose.
By contrast, it is not simple to scan for the end of a skipped conditional group, because conditional groups nest. You have to be looking for #if, #ifdef, and #ifndef as well as #else and #endif, and counting your depth. And all of those directives are lexically defined in terms of preprocessor tokens, because that's the most natural way to look for them when you're not in a skipped conditional group. Requiring skipped conditional groups to be tokenizable allows the preprocessor to use the same code to process directives within skipped conditional groups as it does elsewhere.
By default, GCC issues only a warning when it encounters an un-tokenizable line inside a skipped conditional group, an error elsewhere:
#if 0
"foo
#endif
"bar
gives me
test.c:2:1: warning: missing terminating " character
"foo
^
test.c:4:1: error: missing terminating " character
"bar
^~~~
This is an intentional leniency, possibly one I introduced myself (it's only been twenty years since I wrote a third of GCC's current preprocessor, but I have still forgotten a lot of the details). You see, the original C preprocessor, the one K and R wrote, did allow arbitrary nonsense inside skipped conditional groups, because it wasn't built around the concept of tokens in the first place; it transformed text into other text. So people would put comments between #if 0 and #endif instead of /* and */, and naturally enough those comments would sometimes contain apostrophes. So, when Per Bothner and Neil Booth and Chiaki Ishikawa and I replaced GCC's original "C-Compatible Compiler Preprocessor"1 with the integrated, fully standards-compliant "cpplib", circa GCC 3.0, we felt we needed to cut a little compatibility slack here.
1 Raise your hand if you're old enough to know why RMS thought this name was funny.
The description of Translation phase 3 (C11 5.1.1.2/3), which happens before preprocessing directives are actioned:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
And the grammar for preprocessing-token is:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Note in particular that a string-literal is a single preprocessing-token. The subsequent description (C11 6.4/3) clarifies that:
If a ' or a " character matches the last category, the behavior is
undefined.
So your second code causes undefined behaviour at translation phase 3.

sizeof operator evaluates in which stage of compilation in gcc

sizeof is a compile time operator. See here.
Compilation has many stages. At which stage is the sizeof operator evaluated?
Typically, after the preprocessor runs and produces the preprocessed translation unit (whole header files pasted in the place of #include, #define's substituted all over the place, inactive branches of #ifdef conditionals completely removed, etc.), the compiler runs. Most modern compilers are usually also able to do the preprocessing themselves, but for historic reasons, the C preprocessor (cpp) and the C compiler (cc) are at least conceptually distinct. The output of the former serves as input to the latter.
At consequent stages, it is entirely up to the internal implementation of the compiler what these stages are and what their order is. The most "traditional" pipeline, however, is:
Lexing: separation of tokens from one another;
Parsing: interpreting the combinations of tokens according to the language grammar and producing a parse tree;
Producing an Abstract Syntax Tree: the parse tree is taken as input and a more usable, better annotated tree is produced;
Scope analysis: matching the used identifiers with their respective declarations, emitting errors in case of undeclared identifiers;
Type checking: checking whether the type of each expression matches the expected type in the particular context. After this stage has passed and no errors have been emitted, the program is considered syntactically and semantically correct, so we can proceed with the next step;
Code generation and optimisation: possibly at this stage, the compiler would emit, for example, 4 in place of the abstract node that represents sizeof(int). It would also chew up constant expressions like 3 + 4 into 7.
Note that sizeof can be evaluated at runtime in case it is applied to a variable-length array (C99 onwards):
int n;
n = ...;
int vl_arr[n];
sizeof(vl_arr); // could be evaluated at runtime if "n" is not known at compile-time

Why does GCC emit a warning when using trigraphs, but not when using digraphs?

Code:
#include <stdio.h>
int main(void)
{
??< puts("Hello Folks!"); ??>
}
The above program, when compiled with GCC 4.8.1 with -Wall and -std=c11, gives the following warning:
source_file.c: In function ‘main’:
source_file.c:8:5: warning: trigraph ??< converted to { [-Wtrigraphs]
??< puts("Hello Folks!"); ??>
^
source_file.c:8:30: warning: trigraph ??> converted to } [-Wtrigraphs]
But when I change the body of main to:
<% puts("Hello Folks!"); %>
no warnings are thrown.
So, Why does the compiler warn me when using trigraphs, but not when using digraphs?
Because trigraphs have the undesirable effect of silently changing code. This means that the same source file is valid both with and without trigraph replacement, but leads to different code. This is especially problematic in string literals, like "<em>What??</em>".
Language design and language evolution should strive to avoid silent changes. Having the compiler warn about trigraphs is a good thing to have.
Contrast this with digraphs, which were new tokens that do not lead to silent changes.
This gcc document on pre-processing gives a pretty good rationale for a warning (emphasis mine):
Trigraphs are not popular and many compilers implement them incorrectly. Portable code should not rely on trigraphs being either converted or ignored. With -Wtrigraphs GCC will warn you when a trigraph may change the meaning of your program if it were converted.
and in this gcc document on Tokenization explains digraphs unlike trigraphs do not potential negative side effects (emphasis mine):
There are also six digraphs, which the C++ standard calls alternative tokens, which are merely alternate ways to spell other punctuators. This is a second attempt to work around missing punctuation in obsolete systems. It has no negative side effects, unlike trigraphs,
May be because it has no negative side effects, unlike trigraphs as is stated in gcc documentation:
Punctuators are all the usual bits of punctuation which are meaningful to C and C++. All but three of the punctuation characters in ASCII are C punctuators. The exceptions are ‘#’, ‘$’, and ‘`’. In addition, all the two- and three-character operators are punctuators. There are also six digraphs, which the C++ standard calls alternative tokens, which are merely alternate ways to spell other punctuators. This is a second attempt to work around missing punctuation in obsolete systems. It has no negative side effects, unlike trigraphs, but does not cover as much ground. The digraphs and their corresponding normal punctuators are:
Digraph: <% %> <: :> %: %:%:
Punctuator: { } [ ] # ##
Trigraphs are nasty because they use character sequences which could legally appear within valid code. A common case which used to cause compiler errors on code for classic Macintosh:
unsigned int signature = '????'; /* Should be value 0x3F3F3F3F */
Trigraph processing would would turn that into:
unsigned int signature = '??^; /* Should be value 0x3F3F3F3F */
which would of course not compile. In some slightly rarer cases, it would be possible for such processing to yield code which would compile, but with different meaning from what was intended, e.g.
char *template = "????/1234";
which would get turned into
char *template = "??S4"; // ??/ becomes \, and \123 becomes S
Not the string literal that was intended, but still perfectly legitimate nonetheless.
By contrast, digraphs are relatively benign because outside of some possible weird corner cases involving macros, no code containing processable digraphs would have a legitimate meaning in the absence of such processing.

Resources