Trailing characters with #include directive - c

I had a seemingly innocuous line in a source file
#include <some_sys_header_file.h>"
It was buried with a bunch of other includes that were using double-quotes (rather than angle brackets) so the spurious double-quote wasn't spotted.
The compiler (or rather, pre-processor) was happy, included the required file, and skipped the rest of the line.
But, when formatting the file using Artistic Style, the double-quote caused chaos with literal strings being incorrectly split over multiple lines.
Is there a standard for how this should be treated?

It is undefined behavior.
C99 says in 6.10 that an #include directive has the form
# include pp-tokens new-line
The only pp-tokens starting with a " are a string literal (C99 6.4.5 String literals) and a header-name in double quotes (C99 6.4.7 Header names). However, string literals must not contain un-escaped new-lines and header-names must not contain new-lines.
The lone " also cannot be part of a header-name since it is not within <> (C99 6.4.7 Header names).
What's left according to C99 6.4 Lexical elements is
preprocessing-token:
...
each non-white-space character that cannot be one of the above
In conjunction with the Semantics in paragraph 3
If a ' or a " character matches the last category, the behavior is
undefined.
So you might or might not get a diagnostic.

When Standards were written, some compilers didn't use normal string-parsing logic on their #include arguments (especially the angle-bracket form which isn't used anywhere else). Since the closing angle bracket isn't supposed to be followed by anything other than white space, it would be legitimate for a compiler to simply erase the last non-whitepsace character (which should be an angle bracket) without checking what it is, and use whatever precedes it as a file name. This, a compiler could regard:
#include <stdio.h> Hey
as a request to include a file called stdio.h> He. While that particular behavior might not be useful, it might hypothetically be necessary to use C in an environment that uses > as a path-separator character, and which might thus require:
#include <graphlib>drawing.h>
Since including characters after the intended header name could cause a compiler to #include a file other than the one intended, and since such a file could do anything whatsoever, include directives other than those defined by the standard are considered Undefined Behavior. Personally, I think the language would benefit greatly if an additional form were standardized:
#include "literal1" "literal2" ["literal3", etc.]
with string concatenation applied in the usual fashion. Many projects I've worked with have had absurdly long lists of include paths, the need for which could have been greatly mitigated if it were possible to say:
#include "fooproject_directories.h"
#include ACMEDIR "acme.h"
#include IOSYSDIR "io.h"
etc. thus making clear which files were supposed to be loaded from where. Although it is possible that some code somewhere might rely upon the ability to say:
#include "this"file.h"
as a means of loading a file called this"file.h I think the Standard could accommodate such things by having such treatments be considered non-normative, requiring that any implementations document such non-normative behavior, and not requiring strictly-compliant programs to work on such non-normative platforms.

Related

C preprocessor: what is the rationale behind not allowing argument of the #include directive to begin with a digit?

N2479 C17..C2x working draft — February 5, 2020 ISO/IEC 9899:202x (E):
6.10.2 Source file inclusion
The implementation shall provide unique mappings for sequences consisting of one or more nondigits or digits (6.4.2.1) followed by a period (.) and a single nondigit. The first character shall not be a digit.
Question: what is the rationale behind not allowing argument (char-sequence) of the #include directive to begin with a digit?
Extra question: compilers seem to not generate any diagnostic message when the shall requirement above is violated (ex. use of #include "1.h"). Why?
UPD. Later my colleague answered: the The first character shall not be a digit is related only to unique mappings. So, the standard was misinterpreted.
Question: what is the rationale behind not allowing argument (char-sequence) of the #include directive to begin with a digit?
By itself, the sentence “The first character shall not be a digit” would seem to be saying that a C program shall not use a digit as the first character in a header name. However, it is between two sentences that tell us how C implementations must process header names and is in a clause, 6.10, that tells us how implementations process #include directives. The clause that tells us the grammar for header names is in a different place in the C standard, 6.4.7, where it gives #include <1/a.h> as an example of a possible directive (C 2018 6.4.7 4).
So I believe the intent of 6.10.2 5 is to provide a quality-of-implementation guarantee, saying that you cannot implement C directly using a file system that does not support at least eight characters in the base part of file names, but you can use a file system that ignores case (per its last sentence) or that does not support names beginning with a digit. Although “The first character shall not be a digit” appears to be a prohibition on C programs, that is because a mistake was made in putting this in a separate sentence without qualification; the first two sentences should have been something like “The implementation shall provide unique mappings for sequences consisting of a nondigit (6.4.2.1) followed by zero or more nondigits or digits followed by a period (.) and a single nondigit.”
(In C 1990, this paragraph appears in 6.8.2, where the significance requirement is only for six characters. It was increased to eight in C 1999, reflecting the prevalence of better file systems.)

Why should the controlled group in a conditional inclusion be lexically valid when the conditional is false?

The following program compiles:
// #define WILL_COMPILE
#ifdef WILL_COMPILE
int i =
#endif
int main()
{
return 0;
}
GCC Live demo here.
But the following will issue a warning:
//#define WILL_NOT_COMPILE
#ifdef WILL_NOT_COMPILE
char* s = "failure
#endif
int main()
{
return 0;
}
GCC Live demo here.
I understand that in the first example, the controlled group is removed by the time the compilation phase of the translation is reached. So it compiles without errors or warnings.
But why is lexical validity required in the second example when the controlled group is not going to be included?
Searching online I found this quote:
Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended.
But this does not state why the lexical validity is checked when the conditional fails.
Have I missed something here?
In the translation phase 3 the preprocessor will generate preprocessor tokens and having a " end up in the catch all non-white-space character that cannot be one of the above
is undefined behavior.
See C11 6.4 Lexical elements p3:
A token is the minimal lexical element of the language in translation phases 7 and 8. The
categories of tokens are: keywords, identifiers, constants, string literals, and punctuators.
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string literals, punctuators, and
single non-white-space characters that do not lexically match the other preprocessing
token categories.69) If a ' or a " character matches the last category, the behavior is
undefined. ....
For reference the preprocessing-token are:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Of which the unmatched " in your second example matches non-white-space character that cannot be one of the above.
Since this is undefined behavior and not a constraint the compiler is not obliged to diagnose it but it is certainly allowed to and using -pedantic-errors it even becomes an error godbolt session. As rici points out it only becomes a constraint violation if the token survives preprocessing.
The gcc document you cite basically says the same thing:
... Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended. ...
"Why is [something about C] the way it is?" questions can't usually be answered, because none of the people who wrote the 1989 C standard are here to answer questions [as far as I know, anyway] and if they were here, it was nearly thirty years ago and they probably don't remember.
However, I can think of a plausible reason why the contents of skipped conditional groups are required to consist of a valid sequence of preprocessing tokens. Observe that comments are not required to consist of a valid sequence of preprocessing tokens:
/* this comment's perfectly fine even though it has an unclosed
character literal inside */
Observe also that it is really simple to scan for the end of a comment. /* you look for the next */, // you look for the end of the line. The only complication is that trigraphs and backslash-newline are supposed to be converted first. Tokenizing the contents of comments would be extra code to no useful purpose.
By contrast, it is not simple to scan for the end of a skipped conditional group, because conditional groups nest. You have to be looking for #if, #ifdef, and #ifndef as well as #else and #endif, and counting your depth. And all of those directives are lexically defined in terms of preprocessor tokens, because that's the most natural way to look for them when you're not in a skipped conditional group. Requiring skipped conditional groups to be tokenizable allows the preprocessor to use the same code to process directives within skipped conditional groups as it does elsewhere.
By default, GCC issues only a warning when it encounters an un-tokenizable line inside a skipped conditional group, an error elsewhere:
#if 0
"foo
#endif
"bar
gives me
test.c:2:1: warning: missing terminating " character
"foo
^
test.c:4:1: error: missing terminating " character
"bar
^~~~
This is an intentional leniency, possibly one I introduced myself (it's only been twenty years since I wrote a third of GCC's current preprocessor, but I have still forgotten a lot of the details). You see, the original C preprocessor, the one K and R wrote, did allow arbitrary nonsense inside skipped conditional groups, because it wasn't built around the concept of tokens in the first place; it transformed text into other text. So people would put comments between #if 0 and #endif instead of /* and */, and naturally enough those comments would sometimes contain apostrophes. So, when Per Bothner and Neil Booth and Chiaki Ishikawa and I replaced GCC's original "C-Compatible Compiler Preprocessor"1 with the integrated, fully standards-compliant "cpplib", circa GCC 3.0, we felt we needed to cut a little compatibility slack here.
1 Raise your hand if you're old enough to know why RMS thought this name was funny.
The description of Translation phase 3 (C11 5.1.1.2/3), which happens before preprocessing directives are actioned:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
And the grammar for preprocessing-token is:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Note in particular that a string-literal is a single preprocessing-token. The subsequent description (C11 6.4/3) clarifies that:
If a ' or a " character matches the last category, the behavior is
undefined.
So your second code causes undefined behaviour at translation phase 3.

C/C++ Preprocessor #include syntax. Whitespace mandatory?

In some code from others I came across the notation:
#include"file.h"
And I wondered, I always write include syntaxes with a space between the directive and the path/file like:
#include "file.h"
So I searched for the #include syntax, but could not find a difinitive answer. As it only shows how it is used but I could not find the exact syntax.
Resource used: https://gcc.gnu.org/onlinedocs/cpp/Include-Syntax.html#Include-Syntax
Is there even a spec written about how the syntax should be? Or is the first one (without the whitespace) invalid syntax but accepted through pre-compiler extensions?
(Also works on MSVC)
It is allowed to have #include<header> and #include"header", with no whitespace character after include, although one may argue that this makes it slightly harder to read.
The grammar for #include is, cut down to only show the relevant bits:
# include pp-tokens new-line
pp-tokens: preprocessing-token
preprocessing-token: header-name
header-name: < h-char-sequence >
" q-char-sequence "
(spaces in the above is only part of the grammar syntax, bold text is literal)
Furthermore:
Preprocessing tokens can be separated by white space; this consists of
comments (described later), or white-space characters (space,
horizontal tab, new-line, vertical tab, and form-feed), or both.
This is from the C99 draft standard document (my emphasis on "can").
I'm not a C++ programmer, but I imagine that C++ is following the same grammar rules on this point.

Are trigraphs required to write a newline character in C99 using only ISO 646?

Assume that you're writing (portable) C99 code in the invariant set of ISO 646. This means that the \ (backslash, reverse solidus, however you name it) can't be written directly. For instance, one could opt to write a Hello World program as such:
%:include <stdio.h>
%:include <stdlib.h>
int main()
<%
fputs("Hello World!??/n", stdout);
return EXIT_SUCCESS;
%>
However, besides digraphs, I used the ??/ trigraph to write the \ character.
Given my assumptions above, is it possible to either
include the '\n' character (which is translated to a newline in <stdio.h> functions) in a string without the use of trigraphs, or
write a newline to a FILE * without using the '\n' character?
For stdout you could just use puts("") to output a newline. Or indeed replace the fputs in your original program with puts and delete the \n.
If you want to get the newline character into a variable so you can do other things with it, I know another standard function that gives you one for free:
int gimme_a_newline(void)
{
time_t t = time(0);
return strchr(ctime(&t), 0)[-1];
}
You could then say
fprintf(stderr, "Hello, world!%c", gimme_a_newline());
(I hope all of the characters I used are ISO646 or digraph-accessible. I found it surprisingly difficult to get a simple list of which ASCII characters are not in ISO646. Wikipedia has a color-coded table with not nearly enough contrast between colors for me to tell what's what.)
Your premise:
Assume that you're writing (portable) C99 code in the invariant set of ISO 646. This means that the \ (backslash, reverse solidus, however you name it) can't be written directly.
is questionable. C99 defines "source" and "execution" character sets, and requires that both include representations of the backslash character (C99 5.2.1). The only reason I can imagine for an effort such as you describe would be to try to produce source code that does not require character set transcoding upon movement among machines. In that case, however, the choice of ISO 646 as a common baseline is odd. You're more likely to run into an EBCDIC machine than one that uses an ISO 646 variant that is not coincident with the ISO-8859 family of character sets. (And if you can assume ISO 8859, then backslash does not present a problem.)
Nevertheless, if you insist on writing C source code without using a literal backslash character, then the trigraph for that character is the way to do so. That's what trigraphs were invented for. In character constants and string literals, you cannot portably substitute anything else for \n or its trigraph equivalent, ??/n, because it is implementation-dependent how that code is mapped. In particular, it is not safe to assume that it maps to a line-feed character (which, however, is included among the invariant characters of ISO 646).
Update:
You ask specifically whether it is possible to
include the '\n' character (which is translated to a newline in functions) in a string without the use of trigraphs, or
No, it is not possible, because there is no one '\n' character. Moreover, there seems to be a bit of a misconception here: \n in a character or string literal represents one character in the execution character set. The compiler is therefore responsible for that transformation, not the stdio functions. The stdio functions' responsibility is to handle that character on output by writing a character or character sequence intended to produce the specified effect ("[m]oves the active position to the initial position of the next line").
You also ask whether it is possible to
write a newline to a FILE * without using the '\n' character?
This one depends on exactly what you mean. If you want to write a character whose code in the execution character set you know, then you can write a numeric constant having that numeric value. In particular, if you want to write the character with encoded value 0xa (in the execution character set) then you can do so. For example, you could
fputc(0xa, my_file);
but that does not necessarily produce a result equivalent to
fputc('\n', my_file);
Short answer is, yes, for what you want to do, you have to use this trigraph.
Even if there was a digraph for \, it would be useless inside a string literal because digraphs must be tokens, they are recognized by the tokenizer, while trigraphs are pre-processed and so still work inside string literals and the like.
Still wondering why somebody would encode source this way today ... :o
No. \n (or its trigraph equivalent) is the portable representation of a newline character.
No. You'd have to represent the literal newline somehow, and \n (or it's trigraph equivalent) is the only portable representation.
It's very unusual to find C source code that uses trigraphs or digraphs! Some compilers (e.g. GNU gcc) require command-line options to enable the use of trigraphs and assume they have been used unintentionally and issues a warning if it encounters them in the source code.
EDIT: I forgot about puts(""). That's a sneaky way to do it, but only works for stdout.
Yes of course it's possible
fputc(0x0A, file);

Occurrences of question mark in C code

I am doing a simple program that should count the occurrences of ternary operator ?: in C source code. And I am trying to simplify that as much as it is possible. So I've filtered from source code these things:
String literals " "
Character constants ' '
Trigraph sequences ??=, ??(, etc.
Comments
Macros
And now I am only counting the occurances of questionmarks.
So my question question is: Is there any other symbol, operator or anything else what could cause problem - contain '?' ?
Let's suppose that the source is syntax valid.
I think you found all places where a question-mark is introduced and therefore eliminated all possible false-positives (for the ternary op). But maybe you eliminated too much: Maybe you want to count those "?:"'s that get introduced by macros; you dont count those. Is that what you intend? If that's so, you're done.
Run your tool on preprocessed source code (you can get this by running e.g. gcc -E). This will have done all macro expansions (as well as #include substitution), and eliminated all trigraphs and comments, so your job will become much easier.
In K&R ANSI C the only places where a question mark can validly occur are:
String literals " "
Character constants ' '
Comments
Now you might notice macros and trigraph sequences are missing from this list.
I didn't include trigraph sequences since they are a compiler extension and not "valid C". I don't mean you should remove the check from your program, I'm trying to say you already went further then what's needed for ANSI C.
I also didn't include macros because when you're talking about a character that can occur in macros you can mean two things:
Macro names/identifiers
Macro bodies
The ? character can not occur in macro identifiers (http://stackoverflow.com/questions/369495/what-are-the-valid-characters-for-macro-names), and I see macro bodies as regular C code so the first list (string literals, character constants and comments*) should cover them too.
* Can macros validly contain comments? Because if I use this:
#define somemacro 15 // this is a comment
then // this is a comment isn't part of the macro. But what if I would compiler this C file with -D somemacro="15 // this is a comment"?

Resources