C/C++ Preprocessor #include syntax. Whitespace mandatory? - c-preprocessor

In some code from others I came across the notation:
#include"file.h"
And I wondered, I always write include syntaxes with a space between the directive and the path/file like:
#include "file.h"
So I searched for the #include syntax, but could not find a difinitive answer. As it only shows how it is used but I could not find the exact syntax.
Resource used: https://gcc.gnu.org/onlinedocs/cpp/Include-Syntax.html#Include-Syntax
Is there even a spec written about how the syntax should be? Or is the first one (without the whitespace) invalid syntax but accepted through pre-compiler extensions?
(Also works on MSVC)

It is allowed to have #include<header> and #include"header", with no whitespace character after include, although one may argue that this makes it slightly harder to read.
The grammar for #include is, cut down to only show the relevant bits:
# include pp-tokens new-line
pp-tokens: preprocessing-token
preprocessing-token: header-name
header-name: < h-char-sequence >
" q-char-sequence "
(spaces in the above is only part of the grammar syntax, bold text is literal)
Furthermore:
Preprocessing tokens can be separated by white space; this consists of
comments (described later), or white-space characters (space,
horizontal tab, new-line, vertical tab, and form-feed), or both.
This is from the C99 draft standard document (my emphasis on "can").
I'm not a C++ programmer, but I imagine that C++ is following the same grammar rules on this point.

Related

what characters are valid to occur before and after a preprocessing directive in C according to the C standard [duplicate]

This question already has an answer here:
Can someone explain me exactly what the below definition means in the C standard about directives
(1 answer)
Closed 2 years ago.
What characters are valid to occur before and after a preprocessing directive in C according to the C standard.
/*what are all the valid characters that can occur here*/ #include <stdio.h> /*and here according to the C standard*/
main()
{
printf("Hello World");
}
now the C standard haven't mentioned what characters are valid to occur before and after a preprocessing directive if someone can guide me with the exact definition of the C standard it will be much appreciated
Note that before the preprocessor directives are examined, the compiler has already passed through phases 1 through 3 of the translation process. Translation phase 2 combines lines ending with a backslash with the following physical line in order to create logical lines. Translation phase 3 replaces every comment with a single space character. (It is allowed but not required that phase 3 also replaces every consecutive sequence of whitespace characters other than newline with a single space character.)
Once that is done, Phase 4 is entered, at which point preprocessor directives are identified. According to §6.10 paragraph 2 of the standard, a sequence of tokens is a preprocessor directive only if "The first token in the sequence is a # preprocessing token that (at the start of translation phase 4) is either the first character in the source file (optionally after white space containing no new-line characters) or that follows white space containing at least one new-line character."
That's a very verbose way of saying that the # has to be the first token on a line, which means that it can only be preceded on the line by whitespace. But as the parenthetic comment in the sentence I quoted says, the test applies to the program as seen after phases 1 to 3, which means that comments have already been replaced with whitespace. So in the original program, the # might be preceded by a comment or by whitespace. (The comment must be a /*…*/ comment, since //… comments extend to the end of the line. Also note that continuation lines are combined before comments are identified, so a continuation marker can occur inside a //… comment.)
As to what can follow a preprocessor directive on the line, the answer is technically "nothing", since the directive extends up to and including the newline. (Again, a comment may have appeared in the original program.) The standard shows a grammar for each preprocessor directive which indicates what the directive's syntax is. If you were to add a non-whitespace character to a preprocessing directive, that would either create a syntax error or alter the meaning of the directive.
§6.10 paragraph 5 requires that whitespace within a preprocessor directive can only be a space or tab character, so that vertical tab and form-feed characters would be illegal. However, it is possible that the implementation has changed those characters to space characters in translation phase 3, so the use of vertical tab and form-feed in a preprocessor directive is implementation-dependent. Portable programs should only contain vertical tab and form-feed characters at the beginning of a line.
Let's dissect the paragraph 2 of the C standard you linked in a comment:
A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the following constraints:
The characters that make up a source code is divided in tokens. Such tokens are for example special characters like '#', identifiers beginning with a letter or underscore, or numbers beginning with a decimal character.
The first token in the sequence is a # preprocessing token that (at the start of translation phase 4) is either the first character in the source file (optionally after white space containing no new-line characters) or that follows white space containing at least one new-line character.
"White space" is any sequence of white space characters without any other character. Commonly used white space characters are space (or blank), horizontal tabulator, line feed or carriage return.
"[...] white space containing at least one new-line character" means that one new-line character exists in the sequence of white space characters before the '#'. It does not matter where in the sequence it is.
So these are all valid sequences, shown as C strings:
"\n\t\t\t#..."
"\n #..."
"\n#..."
"\n\t#..."
"\n\t #..."
"\n \t#..."
The last token in the sequence is the first new- line character that follows the first token in the sequence.
Beginning with the token '#' all next tokens make up the preprocessor directive, until the next new-line character is found. The footnote 165 mentions the term "line" for such a sequence.
A new-line character ends the preprocessing directive even if it occurs within what would otherwise be an invocation of a function-like macro.
The invocation of a function-like macro looks like a function call in C, an identifier with a pair of parentheses. If there is a new-line before the closing parenthesis, the directive ands at that place.
EDIT:
White space characters are listed concretely in chapter 7.4.1.10 "The isspace function" of the standard you linked:
The standard white-space characters are the following: space (' '), form feed ('\f'), new-line ('\n'), carriage return ('\r'), horizontal tab ('\t'), and vertical tab ('\v').
One can assume that this function is used by the preprocessor.
Your confusion might come from interpretating "[...] white space containing no new-line characters [...]" as "white space does not include new-line in general" or "new-line is a special white space character." Neither is true.
The new-line is a valid white space character. It just has a special meaning under the specific circumstances of marking the beginning and the end of a preprocessor directive. And that is why they request white space without any new-line.
If the white space contained a new-line, it will mark the beginning of a new token sequence in the context of the preprocessor.
Please note that the preprocessor and the language C are quite separated concepts. You can use the preprocessor for preprocessing any other source files, using it for assembly is quite common. And you can write C source files without any preprocessor directive.
The preprocessor knows nothing about C, and the C compiler knows nothing about preprocessing directives.

Using \ to extend single-line comments

I just noticed that I can use \ to extend the single-line comment to the next line, similarly to doing so in pre-processor directives.
Why is nobody speaking for this language feature?
I didn't even see it in books..
What language version supports this?
It's part of C. Called line splicing.
The K&R book talks about it
Lines that end with the backslash character \ are folded by deleting the backslash and the
following newline character. This occurs before division into tokens.
This occurs in the preprocessing phase.
So single line comments can be made to appear like multi line like
//This is \
still a single line comment
Likewise with the case of strings
char str[]="Hello \
world. This is \
a string";
Edit: As noted in the comments, single line comments were not there in ANSI C but were introduced as part of the standard in C99 though many compilers already supported it.
From C99,
Except within a character constant, a string literal, or a comment, the characters // introduce a comment that includes all multibyte characters up to, but not including, the next new-line character. The contents of such a comment are examined only to identify multibyte characters and to find the terminating new-line character.
As far as line splicing is concerned, it is specified in C89 itself
2.1.1.2 Translation phases
Each instance of a new-line character and an immediately preceding backslash character is deleted, splicing physical source lines to form logical source lines. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character.
Look at KamiKaze's answer to see the relevant part of C99.
While it's true that a \ will effectively escape the newline at the end of a single-line comment, splicing the line with the following one (just as it does on any other line), you could claim that this is a bug in the Standard. At any rate, the situation is fantastically confusing. You might believe that both of these facts are true:
The single-line comment syntax // turns the rest of the line, up to the next newline, into a comment, which is not interpreted in any way, i.e. is ignored.
At the end of any line, a \ character eliminates the newline and splices the line to the following line.
But these two rules are basically in conflict; it looks like they can't both be true at the same time.
Now in fact, by definition, the second rule "wins", and the first rule really has to say that the rest of the line is not interpreted in any way except to check whether the last character is a \, in which case it retains its line-splicing meaning.
(Now, if you're a compiler writer or a language lawyer, of course, you don't think about it that way. If you're a compiler writer or language lawyer, you know that the \ was processed during an earlier phase of compilation, before comments are parsed, meaning that the first rule is perfectly true as stated. But most people don't think like compiler writers and language lawyers.)
My point is that this situation is basically fraught with peril. I would bet good money that there are compilers or other language processors out there that get this wrong. I would urge any sane programmer not to rely on this, not to put a \ at the end of any line that contains a single-line comment. (And if I were writing a compiler or other language processor, I'd try to warn about this.)
This is not a feature of comments but a general feature of the language, as it applies to all newline-characters.
The following is found in the C99 standard:
5.1.1.2 Translation phases
Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.
So it is standard compliant for C99 at least.
It is not much talked about, because the relevant usecases (except for large macros and strings) are quite rare. If you need a multiline comment (the standard comment in C, // was added from C++ later on), you could just use
/* multi
line
comment
*/
Every use except for large macros and strings will make the code harder to read and might even make it quite confusing. So generally it is not used except for the mentioned niches.
Every instance of \ followed by a newline is removed from the source during the first phase of parsing, before tokenisation and comment handling.
As a consequence, a single line comment can be extended to the next line of source code by escaping this newline with a \ (or a ??/ trigraph sequence):
// this is a single \
line comment
Note how the stackoverflow code highlighter is fooled by this trick and does not colorize the end of the comment line.
This feature can be further abused to make really weird looking comments:
/\
/\ This is a single line comment /\
\/ \/
/\
*\ This is a multi-line comment
*\
/
Any token can be broken in pieces this way. Check this corner case:
\
r\
et\
urn\
0x7\
ffff;\

Compilers that required # on the first column?

Were there widely used pre-ANSI C compilers† that required the # to be on the first column?
† I would accept any compiler on this list. If I can find mention of it in the comp.lang.c Usenet newsgroup in a post dated before 1995, I would accept it.
K&R C did not specify whether whitespace was permitted before the #. From the original The C Programming Language, §12¶1 of the "C Reference Manual" in Appendix A:
The C compiler contains a preprocessor capable of macro substitution, conditional compilation, and inclusion of named files. Lines beginning with # communicate with this preprocessor.
Thus, whether or not whitespace was permitted to precede the # was unspecified. This would mean a pre-ANSI compiler could fail to compile a program if the directive did not begin on the first column.
In ISO C (and in ANSI C before that), the C preprocessing directives were explicitly permitted to be prefixed with whitespace. In ANSI C (C-89):
A preprocessing directive consists of a sequence of preprocessing
tokens that begins with a # preprocessing token that is either the
first character in the source file (optionally after white space
containing no new-line characters) or that follows white space
containing at least one new-line character, and is ended by the next
new-line character.
ISO C.2011 has similar language, but is clarified even further:
A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the
following constraints: The first token in the sequence is a # preprocessing token that (at
the start of translation phase 4) is either the first character in the source file (optionally
after white space containing no new-line characters) or that follows white space
containing at least one new-line character. The last token in the sequence is the first newline
character that follows the first token in the sequence.165) A new-line character ends
the preprocessing directive even if it occurs within what would otherwise be an invocation of a function-like macro.
165) Thus, preprocessing directives are commonly called ‘‘lines’’. These ‘‘lines’’ have no other syntactic
significance, as all white space is equivalent except in certain situations during preprocessing (see the
# character string literal creation operator in 6.10.3.2, for example).
Short answer: Yes.
I remember writing things like
#if foo
/* ... */
#else
#if bar
/* ... */
#else
#error "neither foo nor bar specified"
#endif
#endif
so that the various pre-ANSI compilers that I once used wouldn't complain about "unrecognized preprocessor directive '#error'". This would have been with Ritchie's original cc for the pdp11, or pcc (the "portable C compiler" which, IIRC, was the basis for the Vax cc of the 80's or so). Both of those compilers -- more accurately, the preprocessor used with both of those compilers -- definitely required the # to be in the first column. (Actually, although those compilers were very different, they might both have used different variants of basically the same preprocessor, which was always a separate program in those days.)

Trailing characters with #include directive

I had a seemingly innocuous line in a source file
#include <some_sys_header_file.h>"
It was buried with a bunch of other includes that were using double-quotes (rather than angle brackets) so the spurious double-quote wasn't spotted.
The compiler (or rather, pre-processor) was happy, included the required file, and skipped the rest of the line.
But, when formatting the file using Artistic Style, the double-quote caused chaos with literal strings being incorrectly split over multiple lines.
Is there a standard for how this should be treated?
It is undefined behavior.
C99 says in 6.10 that an #include directive has the form
# include pp-tokens new-line
The only pp-tokens starting with a " are a string literal (C99 6.4.5 String literals) and a header-name in double quotes (C99 6.4.7 Header names). However, string literals must not contain un-escaped new-lines and header-names must not contain new-lines.
The lone " also cannot be part of a header-name since it is not within <> (C99 6.4.7 Header names).
What's left according to C99 6.4 Lexical elements is
preprocessing-token:
...
each non-white-space character that cannot be one of the above
In conjunction with the Semantics in paragraph 3
If a ' or a " character matches the last category, the behavior is
undefined.
So you might or might not get a diagnostic.
When Standards were written, some compilers didn't use normal string-parsing logic on their #include arguments (especially the angle-bracket form which isn't used anywhere else). Since the closing angle bracket isn't supposed to be followed by anything other than white space, it would be legitimate for a compiler to simply erase the last non-whitepsace character (which should be an angle bracket) without checking what it is, and use whatever precedes it as a file name. This, a compiler could regard:
#include <stdio.h> Hey
as a request to include a file called stdio.h> He. While that particular behavior might not be useful, it might hypothetically be necessary to use C in an environment that uses > as a path-separator character, and which might thus require:
#include <graphlib>drawing.h>
Since including characters after the intended header name could cause a compiler to #include a file other than the one intended, and since such a file could do anything whatsoever, include directives other than those defined by the standard are considered Undefined Behavior. Personally, I think the language would benefit greatly if an additional form were standardized:
#include "literal1" "literal2" ["literal3", etc.]
with string concatenation applied in the usual fashion. Many projects I've worked with have had absurdly long lists of include paths, the need for which could have been greatly mitigated if it were possible to say:
#include "fooproject_directories.h"
#include ACMEDIR "acme.h"
#include IOSYSDIR "io.h"
etc. thus making clear which files were supposed to be loaded from where. Although it is possible that some code somewhere might rely upon the ability to say:
#include "this"file.h"
as a means of loading a file called this"file.h I think the Standard could accommodate such things by having such treatments be considered non-normative, requiring that any implementations document such non-normative behavior, and not requiring strictly-compliant programs to work on such non-normative platforms.

C include paths: Safe to use hyphen & underscore?

I realise that this may be operating-system specific, but I am trying to write cross-platform GCC code. So would like to know if it really is best to avoid these, or if am forced to choose, which would be the safer bet.
The syntax of header names is (C11, 6.4.7 par. 1):
header-name:
< h-char-sequence >
" q-char-sequence "
h-char-sequence:
h-char
h-char-sequence h-char
h-char:
any member of the source character set except
the new-line character and >
q-char-sequence:
q-char
q-char-sequence q-char
q-char:
any member of the source character set except
the new-line character and "
Both - and _ are part of the basic source character set (5.2.1 par. 3). But the standard leaves much stuff related to #include implementation-defined.
In practice I don't see any problems in using them. I'm not aware of a file system that disallows or assigns a special meaning to them. This Wikipedia article doesn't list one either.

Resources