Mixing wide and narrow string literals in C - c

Just found out that all of the following work:
printf( "%ls\n", "123" L"456" );
printf( "%ls\n", L"123" "456" );
printf( "%ls\n", L"123" L"456" );
The output is
123456
123456
123456
Why can I freely mix and match wide and narrow string literals to get a wide string literal as a result? Is that a documented behavior?

Is that a documented behavior?
Yes, this behavior is supported by the standard, from section 6.4.5 String literals paragrph 4 of the C99 draft standard says (emphasis mine):
In translation phase 6, the multibyte character sequences specified by any sequence of
adjacent character and wide string literal tokens are concatenated into a single multibyte
character sequence. If any of the tokens are wide string literal tokens, the resulting
multibyte character sequence is treated as a wide string literal; otherwise, it is treated as a character string literal.

6.4.5 String literals
In translation phase 6, the multibyte character sequences specified by
any sequence of adjacent character and wide string literal tokens are
concatenated into a single multibyte character sequence. If any of the
tokens are wide string literal tokens, the resulting multibyte
character sequence is treated as a wide string literal; otherwise, it
is treated as a character string literal.

Related

Why might a string literal not be a string?

I'm struggling with this part in the C standard about string literals, especially the second part of it:
"In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. 80)"
"80) A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
Source: ISO/IEC 9899:2018 (C18), §6.4.5/6, Page 51
I don't understand the explanation - "because a null character can be embedded in it by a \0 escape sequence.".
To look at the referenced section §7.1.1., regarding the definition of a "string", it is stated:
"A string is a contiguous sequence of characters terminated by and including the first null character."
Source: ISO/IEC 9899:2018 (C18), §7.1.1/1, Page 132
I've thought about that the focus maybe lays on the "can", in a way that a string literal does not have to include/embed the null character, while a string is needed to.
But then again I´m asking myself: How is one able to use a string literal as string if it has not a string-terminating null character in it, to determine the end of the string (required for string-operating functions)?
I´m totally drawing blanks at the moment.
Note: I´m aware of that a string literal is stored in read-only memory and can´t be modified and a string is a generic term for a sequence of characters terminated by NUL, which can or can not be mutable.
Thus, my question is not: "What is the difference between a string and a string literal?"
My Question is:
Why/How can a string-literal not be a string?
and, according to my concerns, so far:
Is it true, that a string literal can have the NUL byte omitted?
I wanted to ask this question myself but short before posting it, I got the clue. My confusion was made because of the little misplaced wording inside of the quote.
But I decided to not delete the question´s draft as it could be useful for future readers and provide a Q&A instead.
Feel free to comment and hint.
Related stuff:
What is the difference between char s[] and char *s?
What is the type of string literals in C and C++?
Are string literals const?
"Life-time" of a string literal in C
You're overthinking it.
"A string is a contiguous sequence of characters terminated by and including the first null character."
Source: ISO/IEC 9899:2018 (C18), §7.1.1/1, Page 132
says that a "string" only extends up to the first null character. Characters that may exist after the null are not part of the string. However
"80) A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
makes it clear a string literal may contain an embedded null. If it does, the string literal AS A WHOLE is not a string -- the string is just the prefix of the string literal up to the first null
Let´s take a look at the definition of the term "string literal" at the same section in C18, §6.5.1/3:
"A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes, as in "xyz"."
According to that, a string literal is only consisted of the characters enclosed in quotation marks, the bare string content. It does not have an appended \0. The NUL byte is appended later at translation, as said at §6.5.1/6:
"In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. 80)"
Let´s make an example:
"foo" is a string literal, but not a string because "foo" does not contain an embedded null character.
"foo\0" is a string literal and a string because the literal itself contains a null character at the end of the character sequence.
Note that you don´t need to explicitly insert the null character at the end of a string literal to change it to a string. As already said, it is implicitly appended during the program translation.
Means,
const char *s = "foo";
is equal to
const char *s = "foo\0";
I admit, that the sentence of:
"A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
is a little confusing and illogical in the context. It would be better phrased:
"A string literal might not be a string (see 7.1.1), because a null character might not (OR is not required to) be embedded in it by a \0 escape sequence."
or alternatively:
"A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
As #EricPostpischil pointed in his comment, the meaning of the footnote is probably quite different.
It means that if the string literal contains a null character inside of it, but not at the end, as it is required for a string, the string literal is not equivalent to a string.
F.e.:
The string literal
"foo\0bar"
is not a string, as it contains the first null character embedded inside of the string literal, but not at the end of it.

character string literal and string literal in standard?

I am confused by these four terms:
character string literal
character constants
string literal.
multibyte character sequence
And reading this quote in C Standard:
A character string literal need not be a string (see 7.1.1), because
a null character may be embedded in it by a \0 escape sequence.
What is meant by the first part ?
A string-literal is
either a character string literal, e.g. "abc";
or UTF-8 string literal, e.g. u8"abc";
or wide string literal, e.g. L"abc".
From the standard (emphasis mine):
A character string literal is a sequence of zero or more multibyte characters enclosed in
double-quotes, as in "xyz". A UTF−8 string literal is the same, except prefixed by u8.
A wide string literal is the same, except prefixed by the letter L, u, or U.....
In translation phase 7, a byte or code of value zero is appended to each multibyte
character sequence that results from a string literal or literals. 78)
78) A string literal need not be a string (see 7.1.1), because a null character may be embedded in it by a
\0 escape sequence.
A string is a contiguous sequence of characters terminated by and including the first null
character.
So a string literal may have \0 also in the middle or even at the beginning, for instance "a\0b" or "\0ab". I think this is what the footnote is saying.
A character constant is a c-char-sequence (usually a single character) in single quotes, with a possible prefix L/u/U.
An integer character constant is a sequence of one or more multibyte characters enclosed
in single-quotes, as in 'x'. A wide character constant is the same, except prefixed by the
letter L, u, or U.
So the terminology is not very symmetric, IMO. E.g. wide character constant is a particular case of character constant. However both character string literal and wide string literal belong to string literals.

When are whitespace significant in translation phases 5 and 6 in C language?

To recap, the phases 5-7 are described in the standard:
Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member
of the execution character set; if there is no corresponding member,
it is converted to an implementation- defined member other than the
null (wide) character. 7)
Adjacent string literal tokens are concatenated.
White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting
tokens are syntactically and semantically analyzed and translated as a
translation unit.
Now I agree that whites-space characters are no longer significant at phase 7, but couldn't one get rid of them already after phase 4? Is there an example where this would make a difference?
Of course it should be realized that removing white-space characters separating tokens doesn't work at this stage as the data after phase 4 consists of preprocessing tokens. The idea is to get rid of spaces separating preprocessing tokens at an earlier stage.
Consider this source code
char* str = "some text" " with spaces";
In phase 5 this is converted to these tokens (one token per line):
char
*
str
=
"some text"
" with spaces"
Here matter the spaces in "some text" and " with spaces".
Afterwards all spacees between tokens (see above) are ignored.
If you remove whitespace before step 5 you get other string literals like "sometext"

Width prefixes to string constants

The latest version of the C standard provides for width prefixes to string constants e.g. u8"a" is a single preprocessing token.
Does whether you get one or two preprocessing tokens depend on the exact letters in the prefix? E.g. is it the case that u9"a" is still two preprocessing tokens?
C11 specifies in 6.4 that a string literal is one of the pre-processing tokens:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Hence u8"a" is a single token because the string literal section 6.4.5 lists that as a valid option:
string-literal:
encoding-prefix(opt) " s-char-sequence(opt) "
encoding-prefix:
u8
u
U
L
The sequence u9"a" is not a string literal because u9 is not one of the valid prefixes.
The u9 would (from my reading) be treated as an identifier while the "a" would be a string literal, so that would be two separate pre-processing tokens.

why is the compiler treating character as an integer?

I have a small snippet of code.
When I run this on my DevC++ gnu compiler it shows the following output:
main ()
{ char b = 'a';
printf ("%d,", sizeof ('a'));
printf ("%d", sizeof (b));
getch ();
}
OUTPUT: 4,1
Why is 'a' treated as an integer, whereas as b is treated as only a character constant?
Because character literals are of type int and not char in C.
So sizeof 'a' == sizeof (int).
Note that in C++, a character literal is of type char and so sizeof 'a' == sizeof (char).
That's just the way it is in C. That's just how the language was originally defined. As for why... Back then virtually everything in C was an int, unless there was a very good reason to make it something else. So, historically character constants in C have type int.
Note BTW, in C nomenclature 'a' is called constant, not literal. C has string literals and no other literals.
In C, a character literal has type int.
In C++, a character literal that contains only one character has type char, which is an integral type.
In both C and C++, a wide character literal has type wchar_t, and a multicharacter literal has type int.
From IBM XL C/C++ documentation
A character literal contains a sequence of characters or escape
sequences enclosed in single quotation mark symbols, for example 'c'.
A character literal may be prefixed with the letter L, for example
L'c'. A character literal without the L prefix is an ordinary
character literal or a narrow character literal. A character literal
with the L prefix is a wide character literal. An ordinary character
literal that contains more than one character or escape sequence
(excluding single quotes ('), backslashes () or new-line characters)
is a multicharacter literal.
Character literals have the following form:
.---------------------.
V |
>>-+---+--'----+-character-------+-+--'------------------------><
'-L-' '-escape_sequence-'
At least one character or escape sequence must appear in the character
literal. The characters can be from the source program character set,
excluding the single quotation mark, backslash and new-line symbols. A
character literal must appear on a single logical source line.
C A character literal has type int

Resources