Width prefixes to string constants - c

The latest version of the C standard provides for width prefixes to string constants e.g. u8"a" is a single preprocessing token.
Does whether you get one or two preprocessing tokens depend on the exact letters in the prefix? E.g. is it the case that u9"a" is still two preprocessing tokens?

C11 specifies in 6.4 that a string literal is one of the pre-processing tokens:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Hence u8"a" is a single token because the string literal section 6.4.5 lists that as a valid option:
string-literal:
encoding-prefix(opt) " s-char-sequence(opt) "
encoding-prefix:
u8
u
U
L
The sequence u9"a" is not a string literal because u9 is not one of the valid prefixes.
The u9 would (from my reading) be treated as an identifier while the "a" would be a string literal, so that would be two separate pre-processing tokens.

Related

Is double quote (") a preprocessing-token or an unterminated string literal?

Is double quote (") a preprocessing-token or an unterminated string literal?
C11, 6.4 Lexical elements, Syntax, 1:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
C11, 6.4.5 String literals, Syntax, 1:
string-literal:
encoding-prefix(opt) " s-char-sequence(opt) "
Note: GCC considers it to be an unterminated string literal:
#if 0
"
#endif
produces:
warning: missing terminating " character
C 2018 6.4.1 3 says one category of preprocessing tokens is “single non-white-space characters that do not lexically match the other preprocessing token categories”, and the next sentence says “If a ’ or a " character matches the last category, the behavior is undefined.” Thus, if an isolated " appears (one that is not paired with another " with an s-char-sequence between them), it fails to match the lexical form of a string literal and is parsed as a single non-white-space character that does not match the other categories, and the behavior is not defined by the standard. This explains the GCC message.
(I note that 6.10.2 3 describes # include directives with “" q-char-sequence " new-line”. However, the earlier grammar in 6.10 1 describes the directives as “# include pp-tokens new-line”. My interpretation of this is that the directive is parsed as having preprocessor token(s), notably a string literal, and 6.10.2 3 says that if that string literal has the form shown as “" q-char-sequence "”, then it is a # include directive of the type that paragraph is discussing.)

Does #define in C put the thing exactly or give spaces before and after?

As far my knowledge goes in C, C pre-processors replace the literals as it is in #define. But now, I am seeing that, it gives spaces before and after.
Is my explanation correct or am I doing something which should give some undefined behaviors?
Consider the following C code:
#include <stdio.h>
#define k +-6+-
#define kk xx+k-x
int main()
{
int x = 1029, xx = 4,t;
printf("x=%d,xx=%d\n",x,xx);
t=(35*kk*2)*4;
printf("t=%d,x=%d,xx=%d\n",t,x,xx);
return 0;
}
The initial values are: x = 1029, xx = 4. Lets calculate the value of t now.
t = (35*kk*2)*4;
t = (35*xx+k-x*2)*4; // replacing the literal kk
t = (35*xx++-6+--x*2)*4; // replacing the literal k
Now, the value of xx = 4 which would be increased by one just in the next statement and x is decremented by one and became 1028. So, the calculation of the current statement:
t = (35*4-6+1028*2)*4;
t = (140-6+2056)*4;
t = 2190*4;
t = 8760;
But the output of the above code is:
x=1029,xx=4
t=8768,x=1029,xx=4
From the second line of the output, it is clear that increments and decrements are not taken place.
That means after replacing k and kk, it is becoming:
t = (35*xx+ +-6+- -x*2)*4;
(If it is, then the calculation is clear.)
My concerning point: is it the standard of C or just an undefined behavior? Or am I doing something wrong?
The C standard specifies that the source file is analyzed and parsed into preprocessor tokens. When macro replacement occurs, a macro that is replaced is replaced with those tokens. The replacement is not literal text replacement.
C 2018 5.1.1.2 specifies translation phases (rephrasing and summarizing, not exact quotes):
Physical source file multibyte characters are mapped to the source character set. Trigraph sequences are replaced by single-character representations.
Lines continued with backslashes are merged.
The source file is converted from characters into preprocessing tokens and white-space characters—each sequence of characters that can be a preprocessing token is converted to a preprocessing token, and each comment becomes one space.
Preprocessing is performed (directives are executed and macros are expanded).
Source characters in character constants and string literals are converted to members of the execution character set.
Adjacent string literals are concatenated.
White-space characters are discarded. “Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.” (That quoted text is the main part of C compilation as we think of it!)
The program is linked to become an executable file.
So, in phase 3, the compiler recognizes that #define kk xx+k-x consists of the tokens #, define, kk, xx, +, k, -, and x. The compiler also knows there is white space between define and kk and between kk and xx, but this white space is not itself a preprocessor token.
In phase 4, when the compiler replaces kk in the source, it is doing so with these tokens. kk gets replaced by the tokens xx, +, k, -, and x, and k is replaced by the tokens +, -, 6, +, and -. Combined, those form xx, +, +, -, 6, +, -, -, -, and x.
The tokens remain that way. They are not reanalyzed to put + and + together to form ++.
As #EricPostpischil says in a comprehensive answer, the C pre-processor works on tokens, not character strings, and once the input is tokenised, whitespace is no longer needed to separate adjacent tokens.
If you ask a C preprocessor to print out the processed program text, it will probably add whitespace characters where needed to separate the tokens. But that's just for your convenience; the whitespace might or might not be present, and it makes almost no difference because it has no semantic value and will be discarded before the token sequence is handed over to the compiler.
But there is a brief moment during preprocessing when you can see some whitespace, or at least an indication as to whether there was whitespace inside a token sequence, if you can pass the token sequence as an argument to a function-like macro.
Most of the time, the preprocessor does not modify tokens. The tokens it receives are what it outputs, although not necessarily in the same order and not necessarily all of them. But there are two exceptions, involving the two preprocessor operators # (stringify) and ## (token concatenation). The first of these transforms a macro argument -- a possibly empty sequence of tokens -- into a string literal, and when it does so it needs to consider the presence or absence of whitespace in the token sequence.
(The token concatenation operator combines two tokens into a single token if possible; when it does so, intervening whitespace is ignored. That operator is not relevant here.)
The C standard actually specifies precisely how whitespace in a macro argument is handled if the argument is stringified, in paragraph 2 of §6.10.3.2:
Each occurrence of white space between the argument’s preprocessing tokens
becomes a single space character in the character string literal. White space before the first preprocessing token and after the last preprocessing token composing the argument is deleted.
We can see this effect in action:
/* I is just used to eliminate whitespace between two macro invocations.
* The indirection of `STRING/STRING_` is explained in many SO answers;
* it's necessary in order that the stringify operator apply to the expanded
* macro argument, rather than the literal argument.
*/
#define I(x) x
#define STRING_(x) #x
#define STRING(x) STRING_(x)
#define PLUS +
int main(void) {
printf("%s\n", STRING(I(PLUS)I(PLUS)));
printf("%s\n", STRING(I(PLUS) I(PLUS)));
}
The output of this program is:
++
+ +
showing that the whitespace in the second invocation was preserved.
Contrast the above with gcc's -E output for ordinary use of the macro:
int main(void) {
(void) I(PLUS)I(PLUS)3;
(void) I(PLUS) I(PLUS)3;
}
The macro expansion is
int main(void) {
(void) + +3;
(void) + +3;
}
showing that the preprocessor was forced to insert a cosmetic space into the first expansion, in order to preserve the semantics of the macro expansion. (Again, I emphasize that the -E output is not what the preprocessor module passes to the compiler, in normal GCC operation. Internally, it passes a token sequence. All of the whitespace in the -E output above is a courtesy which makes the generated file more useful.)

How does C recognize macro tokens as arguments

#define function(in) in+1
int main(void)
{
printf("%d\n",function(1));
return 0;
}
The above is correctly preprocessed to :
int main(void)
{
printf("%d\n",1+1);
return 0;
}
However, if the macro in+1 is changed to in_1, the preprocessor will not do the argument replacement correctly and will end up with this:
printf("%d\n",in_1);
What are the list of tokens the preprocessor can correctly separate and insert the argument? (like the + sign)
Short answer: The replacement done by preprocessor is not simple text substitution. In your case, the argument must be an identifier, and it can only replace the identical identifiers.
The related form of preprocessing is
#define identifier(identifier-list) token-sequence
In order for the replacement to take place, the identifiers in the identifier-list and the tokens in the token-sequence must be identical in the token sense, according to C's tokenization rule (the rule to parse stream into tokens).
If you agree with the fact that
in C in and in_1 are two different identifiers (and C cannot relate one to the other), while
in+1 is not an identifier but a sequence of three tokens:
(1) identifier in,
(2) operator +, and
(3) integer constant 1,
then your question is clear: in and in_1 are just two identifiers between which C does not see any relationship, and cannot do the replacement as you wish.
Reference 1: In C, there are six classes of tokens:
(1) identifiers (e.g. in)
(2) keywords (e.g. if)
(3) constants (e.g. 1)
(4) string literals (e.g. "Hello")
(5) operators (e.g. +)
(6) other separators (e.g. ;)
Reference 2: In C, an identifier is a sequence of letters (including _) and digits (the first one cannot be a digit).
Reference 3: The tokenization rule:
... the next token is the longest string of characters that could constitute a token.
This is to say, when reading in+1, the compiler will read all the way to +, and knows that in is an identifier. But in the case of in_1, the compiler will read all the way to the white space after it, and deems in_1 as an identifier.
All references from the Reference Manual from K&R's The C Programming Language. Language evolved but they capture the essence.
See the C11 standard section 6.4 for the tokenization grammar .
The relevant token type here is identifier, which is defined as any sequence of letters or digits that doesn't start with a digit; also there can be \u codes and other implementation-defined characters in an identifier.
Due to the "maximal munch" principle of tokenization, character sequence in+1 is tokenized as in + 1 (not i n + 1).
if you want in_1, use two hashes....
#define function(in) in ## _1
So...
function(dave) --> dave_1
function(phil) --> phil_1
And for completeness, you can also use a single hash to turn the arg into a text string.
#define function(in) printf(#in "\n");

When are whitespace significant in translation phases 5 and 6 in C language?

To recap, the phases 5-7 are described in the standard:
Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member
of the execution character set; if there is no corresponding member,
it is converted to an implementation- defined member other than the
null (wide) character. 7)
Adjacent string literal tokens are concatenated.
White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting
tokens are syntactically and semantically analyzed and translated as a
translation unit.
Now I agree that whites-space characters are no longer significant at phase 7, but couldn't one get rid of them already after phase 4? Is there an example where this would make a difference?
Of course it should be realized that removing white-space characters separating tokens doesn't work at this stage as the data after phase 4 consists of preprocessing tokens. The idea is to get rid of spaces separating preprocessing tokens at an earlier stage.
Consider this source code
char* str = "some text" " with spaces";
In phase 5 this is converted to these tokens (one token per line):
char
*
str
=
"some text"
" with spaces"
Here matter the spaces in "some text" and " with spaces".
Afterwards all spacees between tokens (see above) are ignored.
If you remove whitespace before step 5 you get other string literals like "sometext"

Mixing wide and narrow string literals in C

Just found out that all of the following work:
printf( "%ls\n", "123" L"456" );
printf( "%ls\n", L"123" "456" );
printf( "%ls\n", L"123" L"456" );
The output is
123456
123456
123456
Why can I freely mix and match wide and narrow string literals to get a wide string literal as a result? Is that a documented behavior?
Is that a documented behavior?
Yes, this behavior is supported by the standard, from section 6.4.5 String literals paragrph 4 of the C99 draft standard says (emphasis mine):
In translation phase 6, the multibyte character sequences specified by any sequence of
adjacent character and wide string literal tokens are concatenated into a single multibyte
character sequence. If any of the tokens are wide string literal tokens, the resulting
multibyte character sequence is treated as a wide string literal; otherwise, it is treated as a character string literal.
6.4.5 String literals
In translation phase 6, the multibyte character sequences specified by
any sequence of adjacent character and wide string literal tokens are
concatenated into a single multibyte character sequence. If any of the
tokens are wide string literal tokens, the resulting multibyte
character sequence is treated as a wide string literal; otherwise, it
is treated as a character string literal.

Resources