C lexer: token concatenation of unterminated string literals - c

Consider the following c code:
#include <stdio.h>
#define PRE(a) " ## a
int main() {
printf("%s\n", PRE("));
return 0;
}
If we adhere strictly to the tokenization rules of c99, I would expect it to break up as:
...
[#] [define] [PRE] [(] [a] [)] ["]* [##] [a]
...
[printf] [(] ["%s\n"] [,] [PRE] [(] ["]* [)] [)] [;]
...
* A single non-whitespace character that does not match any preprocessing-token pattern
And thus, after running preprocessing directives, the printf line should become:
printf("%s\n", "");
And parse normally. But instead, it throws an error when compiled with gcc, even when using the flags -std=c99 -pedantic. What am I missing?

From C11 6.4. Lexical elements:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
3 [...] The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string
literals, punctuators, and single non-white-space characters that do
not lexically match the other preprocessing token categories.69) If a
' or a " character matches the last category, the behavior is
undefined. [...]
So if " is not part of a string-literal, but is a non-white-space character, the behavior is undefined. I do not know why it's undefined and not a hard error - I think it's to allow compilers to parse multiline string literals.
it throws an error
But on godbolt:
<source>:3:16: warning: missing terminating " character
3 | #define PRE(a) " ## a
| ^
<source>:6:24: warning: missing terminating " character
6 | printf("%s\n", PRE("));
| ^
<source>:8: error: unterminated argument list invoking macro "PRE"
8 | }
|
<source>: In function 'int main()':
<source>:6:20: error: 'PRE' was not declared in this scope
6 | printf("%s\n", PRE("));
| ^~~
<source>:6:20: error: expected '}' at end of input
<source>:5:12: note: to match this '{'
5 | int main() {
| ^
it throws an error not on #define PRE line (it could), but on PRE(") line. Tokens are recognized before macro substitutions (phase 3 vs phase 4), so whatever you do you can't like "create" new lexical string literals as a result of macro substitution by for example gluing two macros or like you want to do. Note that -pedantic will not change the warning into error - -pedantic throws errors where standard tells to throw error, but the standard tells the behavior is undefined, so no error is needed there.

Related

Is double quote (") a preprocessing-token or an unterminated string literal?

Is double quote (") a preprocessing-token or an unterminated string literal?
C11, 6.4 Lexical elements, Syntax, 1:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
C11, 6.4.5 String literals, Syntax, 1:
string-literal:
encoding-prefix(opt) " s-char-sequence(opt) "
Note: GCC considers it to be an unterminated string literal:
#if 0
"
#endif
produces:
warning: missing terminating " character
C 2018 6.4.1 3 says one category of preprocessing tokens is “single non-white-space characters that do not lexically match the other preprocessing token categories”, and the next sentence says “If a ’ or a " character matches the last category, the behavior is undefined.” Thus, if an isolated " appears (one that is not paired with another " with an s-char-sequence between them), it fails to match the lexical form of a string literal and is parsed as a single non-white-space character that does not match the other categories, and the behavior is not defined by the standard. This explains the GCC message.
(I note that 6.10.2 3 describes # include directives with “" q-char-sequence " new-line”. However, the earlier grammar in 6.10 1 describes the directives as “# include pp-tokens new-line”. My interpretation of this is that the directive is parsed as having preprocessor token(s), notably a string literal, and 6.10.2 3 says that if that string literal has the form shown as “" q-char-sequence "”, then it is a # include directive of the type that paragraph is discussing.)

how Backslash \ joins printf strings when written on separate lines in C?

Using Dev C++ I was doing some fun with C and got this :
#include<stdio.h>
main()
{
printf("Hello
world" );
}
^^^^ here I thought output would be like "Hello (with spaces) World" but
Errors :
C:\Users\ASUS\Documents\Dev C++ Programs\helloWorldDk.c In function 'main':
5 10 C:\Users\ASUS\Documents\Dev C++ Programs\helloWorldDk.c [Warning] missing terminating " character
5 3 C:\Users\ASUS\Documents\Dev C++ Programs\helloWorldDk.c [Error] missing terminating " character
6 8 C:\Users\ASUS\Documents\Dev C++ Programs\helloWorldDk.c [Warning] missing terminating " character
6 1 C:\Users\ASUS\Documents\Dev C++ Programs\helloWorldDk.c [Error] missing terminating " character
6 1 C:\Users\ASUS\Documents\Dev C++ Programs\helloWorldDk.c [Error] 'world' undeclared (first use in this function)
6 1 C:\Users\ASUS\Documents\Dev C++ Programs\helloWorldDk.c [Note] each undeclared identifier is reported only once for each function it appears in
7 1 C:\Users\ASUS\Documents\Dev C++ Programs\helloWorldDk.c [Error] expected ')' before '}' token
7 1 C:\Users\ASUS\Documents\Dev C++ Programs\helloWorldDk.c [Error] expected ';' before '}' token
but when i added a \ it worked :
#include<stdio.h>
main()
{
printf("Hello \
World" );
}
Without any warnings and errors.
What Magic of '\' is this ?
And do any other soccery exists , please let me know .
The backslash has many special meanings, e.g. escape sequences to represent special characters.
But the special meaning you found is the one of \ immediatly followed by a newline; which is "ignore me and the newline". For the compiler this solves the problem of encountering a newline in the middle of a string.
The C pre-processor will line splice, so one could have written,
#include <stdio.h>
int main(void) {
printf("Hello\n"
"World\n");
return 0;
}
Arguably nicer syntax with long strings. Note that the maximum length is still enforced. From a theoretical point-of-view, the C pre-processor is a language unto itself, see a discussion on Turing-completeness. For a practical example, x-macros are very useful in some cases.

Macro function for printing with UB

I'm learning how to use macro functions and now faced some (most likely undefined) behavior. Here is an example:
#include <stdio.h>
#define FOO(a, b) { \
printf("%s%s\n", #a #b); \
} \
int main(int argc, char * argv[]){
{ printf("%s%s\n", 1 2); } //compile error
FOO(1, 2); //prints 12 with some garbage
}
Demo1
Demo2
I'm most likely experiencing UB, but digging into the N1570 did not give the clear explanation of this. The closest thing to this that I found was 5.1.1.2(p4):
Preprocessing directives are executed, macro invocations are expanded,
and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal character name is
produced by token concatenation (6.10.3.3), the behavior is undefined.
Probably tokens "1" "2" were concatenated yielding UB, but I'm not sure.
Probably tokens "1" "2" were concatenated yielding UB, but I'm not sure.
You are correct.
"1" and "2" became "12", and went to the first %s in printf(). Then, the second %s has nothing to process, thus the garbage values.
The compiler warnings agree too (of course):
prog.cc:4:12: warning: format '%s' expects a matching 'char*' argument [-Wformat=]
4 | printf("%s%s\n", #a #b); \
| ^~~~~~~~
prog.cc:9:5: note: in expansion of macro 'FOO'
9 | FOO(1, 2); //prints 12 with some garbage
| ^~~
prog.cc:4:16: note: format string is defined here
4 | printf("%s%s\n", #a #b); \
| ~^
| |
| char*
In you Macro, change this:
printf("%s%s\n", #a #b);
to this:
printf("%s%s\n", #a, #b);
where the comma will do the trick, as #Blaze commented. Live Demo
Note: For the hardcoded printf() call to work, you would like to make 1 and 2 strings; using a comma would not suffice. Example: printf("%s%s\n", "1", "2");.
FOO expands to printf("%s%s\n", "1" "2"). The string literals are concatenated during preprocessing, yielding printf("%s%s\n", "12").
This is not a correct call to printf and UB. The relevant part in the standard is this:
7.21.6.1 The fprintf function
...
2 ... If there are insufficient arguments for the format, the behavior is undefined.

Why can't we use the preprocessor to create custom-delimited strings?

I was playing around a bit with the C preprocessor, when something which seemed so simple failed:
#define STR_START "
#define STR_END "
int puts(const char *);
int main() {
puts(STR_START hello world STR_END);
}
When I compile it with gcc (note: similar errors with clang), it fails, with these errors:
$ gcc test.c
test.c:1:19: warning: missing terminating " character
test.c:2:17: warning: missing terminating " character
test.c: In function ‘main’:
test.c:7: error: missing terminating " character
test.c:7: error: ‘hello’ undeclared (first use in this function)
test.c:7: error: (Each undeclared identifier is reported only once
test.c:7: error: for each function it appears in.)
test.c:7: error: expected ‘)’ before ‘world’
test.c:7: error: missing terminating " character
Which sort of confused me, so I ran it through the pre-processor:
$ gcc -E test.c
# 1 "test.c"
# 1 ""
# 1 ""
# 1 "test.c"
test.c:1:19: warning: missing terminating " character
test.c:2:17: warning: missing terminating " character
int puts(const char *);
int main() {
puts(" hello world ");
}
Which, despite the warnings, produces completely valid code (in the bolded text)!
If, macros in C are simply a textual replace, why is it that my initial example would fail? Is this a compiler bug? If not, where in the standards does it have information pertaining to this scenario?
Note: I am not looking for how to make my initial snippet compile. I am simply looking for info on why this scenario fails.
The problem is that even though the code expands to " hello, world ", it's not being recognized as a single string literal token by the preprocessor; instead, it's being recognized as the (invalid) sequence of tokens ", hello, ,, world, ".
N1570:
6.4 Lexical elements
...
3 A token is the minimal lexical element of the language in translation phases 7 and 8. The
categories of tokens are: keywords, identifiers, constants, string literals, and punctuators.
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string literals, punctuators, and
single non-white-space characters that do not lexically match the other preprocessing
token categories.69) If a ' or a " character matches the last category, the behavior is
undefined. Preprocessing tokens can be separated by white space; this consists of
comments (described later), or white-space characters (space, horizontal tab, new-line,
vertical tab, and form-feed), or both. As described in 6.10, in certain circumstances
during translation phase 4, white space (or the absence thereof) serves as more than
preprocessing token separation. White space may appear within a preprocessing token
only as part of a header name or between the quotation characters in a character constant
or string literal.
69) An additional category, placemarkers, is used internally in translation phase 4 (see 6.10.3.3); it cannot
occur in source files.
Note that neither ' nor " are punctuators under this definition.
The preprocessor runs in multiple phases. Phase 3, tokenization, occurs before expansion, so preprocessor macros must represent full tokens. In your case, STR_START and STR_END are tokenized and then substituted, which makes those tokens invalid.
Here
#define STR_START "
compiler expects string literal. String literal must end with closing quote. That's why compiler complains about missing terminating " character.
After macro expansion compiler complains again, because invalid tokens.
For example, MSVC compiler complains in other words:
error C2001: newline in constant
and after expansion it complains about missing quotes.

What can create a lexical error in C?

Besides not closing a comment /*..., what constitutes a lexical error in C?
Here are some:
"abc<EOF>
where EOF is the end of the file. In fact, EOF in the middle of many lexemes should produce errors:
0x<EOF>
I assume that using bad escapes in strings is illegal:
"ab\qcd"
Probably trouble with floating point exponents
1e+%
Arguably, you shouldn't have stuff at the end of a preprocessor directive:
#if x %
Basically anything that is not conforming to ISO C 9899/1999, Annex A.1 "Lexical Grammar" is a lexical fault if the compiler does its lexical analysis according to this grammar. Here are some examples:
"abc<EOF> // invalid string literal (from Ira Baxter's answer) (ISO C 9899/1999 6.4.4.5)
'a<EOF> // invalid char literal (6.4.4.4)
where EOF is the end of the file.
double a = 1e*3; // misguided floating point literal (6.4.4.2)
int a = 0x0g; // invalid integer hex literal (6.4.4.1)
int a = 09; // invalid octal literal (6.4.4.1)
char a = 'aa'; // too long char literal (from Joel's answer, 6.4.4.4)
double a = 0x1p1q; // invalid hexadecimal floating point constant (6.4.4.2)
// instead of q, only a float suffix, that is 'f', 'l', 'F' or 'L' is allowed.
// invalid header name (6.4.7)
#include <<a.h>
#include ""a.h"
Aren't [#$`] and other symbols like that (maybe from unicode) lexical errors in C if put anywhere outside of string or comment? They are not constituting any valid lexical sequence of that language. They cannot pass the lexer because the lexer cannot recognize them as any kind of valid token. Usually lexers are FSMs or regex based so these symbols are just unrecognized input.
For example in the following code there are several lexical errors:
int main(void){
` int a = 3;
# —
return 0;
}
We can support it by feeding this to gcc, which gives
../a.c: In function ‘main’:
../a.c:2: error: stray ‘`’ in program
../a.c:3: error: stray ‘#’ in program
../a.c:3: error: stray ‘\342’ in program
../a.c:3: error: stray ‘\200’ in program
../a.c:3: error: stray ‘\224’ in program
GCC is smart and does error-recovery so it parsed a function definition (it knows we are in 'main') but these errors definitely look like lexical errors, they are not syntax errors and rightly so. GCC's lexer doesn't have any types of tokens that can be built from these symbols. Note that it even treats a three-byte UTF-8 symbol as three unrecognized symbols.
Illegal id
int 3d = 1;
Illegal preprocessor directive
#define x 1
Unexpected token
if [0] {}
Unresolvable id
while (0) {}
Lexical errors:
An unterminated comment
Any sequence of non-comment and non-whitespace characters that is not a valid preprocessor token
Any preprocessor token that is not a valid C token; an example is 0xe-2, which looks like an expression but is in fact a syntax error according to the standard -- an odd corner case resulting from the rules for pp-tokens.
Badly formed float constant (e.g. 123.34e, or 123.45.33).

Resources