Is double quote (") a preprocessing-token or an unterminated string literal?
C11, 6.4 Lexical elements, Syntax, 1:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
C11, 6.4.5 String literals, Syntax, 1:
string-literal:
encoding-prefix(opt) " s-char-sequence(opt) "
Note: GCC considers it to be an unterminated string literal:
#if 0
"
#endif
produces:
warning: missing terminating " character
C 2018 6.4.1 3 says one category of preprocessing tokens is “single non-white-space characters that do not lexically match the other preprocessing token categories”, and the next sentence says “If a ’ or a " character matches the last category, the behavior is undefined.” Thus, if an isolated " appears (one that is not paired with another " with an s-char-sequence between them), it fails to match the lexical form of a string literal and is parsed as a single non-white-space character that does not match the other categories, and the behavior is not defined by the standard. This explains the GCC message.
(I note that 6.10.2 3 describes # include directives with “" q-char-sequence " new-line”. However, the earlier grammar in 6.10 1 describes the directives as “# include pp-tokens new-line”. My interpretation of this is that the directive is parsed as having preprocessor token(s), notably a string literal, and 6.10.2 3 says that if that string literal has the form shown as “" q-char-sequence "”, then it is a # include directive of the type that paragraph is discussing.)
Related
Consider the following c code:
#include <stdio.h>
#define PRE(a) " ## a
int main() {
printf("%s\n", PRE("));
return 0;
}
If we adhere strictly to the tokenization rules of c99, I would expect it to break up as:
...
[#] [define] [PRE] [(] [a] [)] ["]* [##] [a]
...
[printf] [(] ["%s\n"] [,] [PRE] [(] ["]* [)] [)] [;]
...
* A single non-whitespace character that does not match any preprocessing-token pattern
And thus, after running preprocessing directives, the printf line should become:
printf("%s\n", "");
And parse normally. But instead, it throws an error when compiled with gcc, even when using the flags -std=c99 -pedantic. What am I missing?
From C11 6.4. Lexical elements:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
3 [...] The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string
literals, punctuators, and single non-white-space characters that do
not lexically match the other preprocessing token categories.69) If a
' or a " character matches the last category, the behavior is
undefined. [...]
So if " is not part of a string-literal, but is a non-white-space character, the behavior is undefined. I do not know why it's undefined and not a hard error - I think it's to allow compilers to parse multiline string literals.
it throws an error
But on godbolt:
<source>:3:16: warning: missing terminating " character
3 | #define PRE(a) " ## a
| ^
<source>:6:24: warning: missing terminating " character
6 | printf("%s\n", PRE("));
| ^
<source>:8: error: unterminated argument list invoking macro "PRE"
8 | }
|
<source>: In function 'int main()':
<source>:6:20: error: 'PRE' was not declared in this scope
6 | printf("%s\n", PRE("));
| ^~~
<source>:6:20: error: expected '}' at end of input
<source>:5:12: note: to match this '{'
5 | int main() {
| ^
it throws an error not on #define PRE line (it could), but on PRE(") line. Tokens are recognized before macro substitutions (phase 3 vs phase 4), so whatever you do you can't like "create" new lexical string literals as a result of macro substitution by for example gluing two macros or like you want to do. Note that -pedantic will not change the warning into error - -pedantic throws errors where standard tells to throw error, but the standard tells the behavior is undefined, so no error is needed there.
If I write putchar('\\t'); while trying to print "\t" instead of an actual tab, I get the multi character constant warning. On the other hand, if I write putchar('\\'); I get no warning. Upon looking in the ASCII table, there is no character '\\', only '\'. So why is there no warning? Why is '\\' one character but '\\t' is more than one? Can a backslash only be used to escape one following character?
You cannot print \ and t with one putchar invocation, since putchar puts one and exactly only one character into the standard output. Use 2:
putchar('\\');
putchar('t');
Another option would be to use the string "\\t" with fputs:
fputs("\\t", stdout);
There is no warning for '\\' because that is one way how you enter the character literal for the character \. On ASCII this is synonymous with '\134' and '\x5c'.
From C11 6.4.4.4 paragraphs 2 and 4:
2
An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'. [...] With a few exceptions detailed later, the elements of the sequence are any members of the source character set; they are mapped in an implementation-defined manner to members of the execution character set.
[...]
4
The double-quote " and question-mark ? are representable either by themselves or by the escape sequences \" and \?, respectively, but the single-quote ' and the backslash \ shall be represented, respectively, by the escape sequences \' and \\.
The reason why you get a warning for this is that the behaviour is wholly implementation-defined. In C11 J.3.4 the following is listed as implementation-defined behaviour:
The value of an integer character constant containing more than one character or containing a character or escape sequence that does not map to a single-byte execution character (6.4.4.4).
Since '\\' contains an escape sequence that maps to a single-byte execution character \, there is no implementation-defined pitfalls, and nothing to warn about; but \\t contains 2 characters: \ and t, and it wouldn't do what you want portably.
\\ is one character, t is one character, so that is clearly two characters.
\\ is an escape sequence, just like \t; it means \.
If you want to print the two characters \ and t, you clearly need either two calls to putch() or a function that takes a string argument "\\t".
https://en.wikipedia.org/wiki/Escape_sequences_in_C#Table_of_escape_sequences
To recap, the phases 5-7 are described in the standard:
Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member
of the execution character set; if there is no corresponding member,
it is converted to an implementation- defined member other than the
null (wide) character. 7)
Adjacent string literal tokens are concatenated.
White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting
tokens are syntactically and semantically analyzed and translated as a
translation unit.
Now I agree that whites-space characters are no longer significant at phase 7, but couldn't one get rid of them already after phase 4? Is there an example where this would make a difference?
Of course it should be realized that removing white-space characters separating tokens doesn't work at this stage as the data after phase 4 consists of preprocessing tokens. The idea is to get rid of spaces separating preprocessing tokens at an earlier stage.
Consider this source code
char* str = "some text" " with spaces";
In phase 5 this is converted to these tokens (one token per line):
char
*
str
=
"some text"
" with spaces"
Here matter the spaces in "some text" and " with spaces".
Afterwards all spacees between tokens (see above) are ignored.
If you remove whitespace before step 5 you get other string literals like "sometext"
The latest version of the C standard provides for width prefixes to string constants e.g. u8"a" is a single preprocessing token.
Does whether you get one or two preprocessing tokens depend on the exact letters in the prefix? E.g. is it the case that u9"a" is still two preprocessing tokens?
C11 specifies in 6.4 that a string literal is one of the pre-processing tokens:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Hence u8"a" is a single token because the string literal section 6.4.5 lists that as a valid option:
string-literal:
encoding-prefix(opt) " s-char-sequence(opt) "
encoding-prefix:
u8
u
U
L
The sequence u9"a" is not a string literal because u9 is not one of the valid prefixes.
The u9 would (from my reading) be treated as an identifier while the "a" would be a string literal, so that would be two separate pre-processing tokens.
I was playing around a bit with the C preprocessor, when something which seemed so simple failed:
#define STR_START "
#define STR_END "
int puts(const char *);
int main() {
puts(STR_START hello world STR_END);
}
When I compile it with gcc (note: similar errors with clang), it fails, with these errors:
$ gcc test.c
test.c:1:19: warning: missing terminating " character
test.c:2:17: warning: missing terminating " character
test.c: In function ‘main’:
test.c:7: error: missing terminating " character
test.c:7: error: ‘hello’ undeclared (first use in this function)
test.c:7: error: (Each undeclared identifier is reported only once
test.c:7: error: for each function it appears in.)
test.c:7: error: expected ‘)’ before ‘world’
test.c:7: error: missing terminating " character
Which sort of confused me, so I ran it through the pre-processor:
$ gcc -E test.c
# 1 "test.c"
# 1 ""
# 1 ""
# 1 "test.c"
test.c:1:19: warning: missing terminating " character
test.c:2:17: warning: missing terminating " character
int puts(const char *);
int main() {
puts(" hello world ");
}
Which, despite the warnings, produces completely valid code (in the bolded text)!
If, macros in C are simply a textual replace, why is it that my initial example would fail? Is this a compiler bug? If not, where in the standards does it have information pertaining to this scenario?
Note: I am not looking for how to make my initial snippet compile. I am simply looking for info on why this scenario fails.
The problem is that even though the code expands to " hello, world ", it's not being recognized as a single string literal token by the preprocessor; instead, it's being recognized as the (invalid) sequence of tokens ", hello, ,, world, ".
N1570:
6.4 Lexical elements
...
3 A token is the minimal lexical element of the language in translation phases 7 and 8. The
categories of tokens are: keywords, identifiers, constants, string literals, and punctuators.
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string literals, punctuators, and
single non-white-space characters that do not lexically match the other preprocessing
token categories.69) If a ' or a " character matches the last category, the behavior is
undefined. Preprocessing tokens can be separated by white space; this consists of
comments (described later), or white-space characters (space, horizontal tab, new-line,
vertical tab, and form-feed), or both. As described in 6.10, in certain circumstances
during translation phase 4, white space (or the absence thereof) serves as more than
preprocessing token separation. White space may appear within a preprocessing token
only as part of a header name or between the quotation characters in a character constant
or string literal.
69) An additional category, placemarkers, is used internally in translation phase 4 (see 6.10.3.3); it cannot
occur in source files.
Note that neither ' nor " are punctuators under this definition.
The preprocessor runs in multiple phases. Phase 3, tokenization, occurs before expansion, so preprocessor macros must represent full tokens. In your case, STR_START and STR_END are tokenized and then substituted, which makes those tokens invalid.
Here
#define STR_START "
compiler expects string literal. String literal must end with closing quote. That's why compiler complains about missing terminating " character.
After macro expansion compiler complains again, because invalid tokens.
For example, MSVC compiler complains in other words:
error C2001: newline in constant
and after expansion it complains about missing quotes.