I am confused by these four terms:
character string literal
character constants
string literal.
multibyte character sequence
And reading this quote in C Standard:
A character string literal need not be a string (see 7.1.1), because
a null character may be embedded in it by a \0 escape sequence.
What is meant by the first part ?
A string-literal is
either a character string literal, e.g. "abc";
or UTF-8 string literal, e.g. u8"abc";
or wide string literal, e.g. L"abc".
From the standard (emphasis mine):
A character string literal is a sequence of zero or more multibyte characters enclosed in
double-quotes, as in "xyz". A UTF−8 string literal is the same, except prefixed by u8.
A wide string literal is the same, except prefixed by the letter L, u, or U.....
In translation phase 7, a byte or code of value zero is appended to each multibyte
character sequence that results from a string literal or literals. 78)
78) A string literal need not be a string (see 7.1.1), because a null character may be embedded in it by a
\0 escape sequence.
A string is a contiguous sequence of characters terminated by and including the first null
character.
So a string literal may have \0 also in the middle or even at the beginning, for instance "a\0b" or "\0ab". I think this is what the footnote is saying.
A character constant is a c-char-sequence (usually a single character) in single quotes, with a possible prefix L/u/U.
An integer character constant is a sequence of one or more multibyte characters enclosed
in single-quotes, as in 'x'. A wide character constant is the same, except prefixed by the
letter L, u, or U.
So the terminology is not very symmetric, IMO. E.g. wide character constant is a particular case of character constant. However both character string literal and wide string literal belong to string literals.
Related
In C (and similar languages), a string is declared for example as "abc". Another example is "ab\"c". I have a file which contains these strings. That is, the file contents is "abc" or "ab\c" etc. Any literal string that can be defined in a .c file can be defined in the file I'm reading.
These strings can be malformed. E.g. "abc (no closing quotes). What is the best way to write a parser to make sure the string in the file is a valid C literal string? (so that if I copy the file contents and paste them after char* str =, the resulting expression will be accepted by the compiler when at the top of a function)
The strings are each in a separate line.
Alternatively, you can think of this as wanting to parse lines that declare literal string variables. Imagine I'm grepping a big file and use char\* .* = (.*);$ and want to make sure the part in the parenthesis will not cause compilation errors;
The grammar for C string literals is given in C 2018 6.4.5. Supposing you want to parse only plain strings, not those with encoding prefixes such as u in u"xyz", then the grammar for a string-literal is " s-char-sequenceopt ", where “opt” means optional and s-char-sequence is one or more s-char tokens. An s-char is any member of the source character set except ", \ or the new-line character or is an escape-sequence.
The source character set includes at least the Latin alphabet (26 letters A-Z) in uppercase and lowercase, the ten digits, space, horizontal tab, vertical tab, form feed, and these characters:
"#%&’()*+,-./:;?[\]^_{|}~
However, a C implementation may include other characters in its source character set. Therefore, any character found in the string other than ", \, or the new-line character must be accepted as potentially valid in some C implementation.
An escape-sequence is defined in 6.4.4.4 1 to be one of:
\ followed by ', ", ?, \, a, b, f, n, r, t, v,
\ followed by one to three octal digits, or
\x followed by one or more hexadecimal digits, or
a universal-character-name.
Paragraph 7 says:
Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.
A universal-character-name is defined in 6.4.3 to be \u followed by four hexadecimal digits or \U followed by eight hexadecimal digits. Paragraph 2 limits these:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (‘), nor one in the range D800 through DFFF inclusive.
This part of the C grammar looks fairly simple to parse:
A string literal must start with a ".
If the next character is anything other than ", \, or a new-line character, then accept it.
If the next character is \ and it is followed by one of the single characters listed above, accept it and the following character.
If the next character is \ and it is followed by one to three octal digits, accept it and up to three octal digits.
If the next two characters are \x and are followed by a hexadecimal digit, accept them and all the hexadecimal digits that follow.
If the next two characters are \u and are followed by four hexadecimal digits, accept those six characters. However, if the value is one of those prohibited in the constraint above, this is not a valid C string literal.
If the next two characters are \U and are followed by eight hexadecimal digits, accept those ten characters. However, if the value is one of those prohibited in the constraint above, this is not a valid C string literal.
Repeat the above until the next character is not accepted.
If the next character is not ", this is not a valid C string literal.
If the next character is ", accept it.
If that is the end of the line read from the file, it is a valid C string literal. Otherwise, it is not.
I'm struggling with this part in the C standard about string literals, especially the second part of it:
"In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. 80)"
"80) A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
Source: ISO/IEC 9899:2018 (C18), §6.4.5/6, Page 51
I don't understand the explanation - "because a null character can be embedded in it by a \0 escape sequence.".
To look at the referenced section §7.1.1., regarding the definition of a "string", it is stated:
"A string is a contiguous sequence of characters terminated by and including the first null character."
Source: ISO/IEC 9899:2018 (C18), §7.1.1/1, Page 132
I've thought about that the focus maybe lays on the "can", in a way that a string literal does not have to include/embed the null character, while a string is needed to.
But then again I´m asking myself: How is one able to use a string literal as string if it has not a string-terminating null character in it, to determine the end of the string (required for string-operating functions)?
I´m totally drawing blanks at the moment.
Note: I´m aware of that a string literal is stored in read-only memory and can´t be modified and a string is a generic term for a sequence of characters terminated by NUL, which can or can not be mutable.
Thus, my question is not: "What is the difference between a string and a string literal?"
My Question is:
Why/How can a string-literal not be a string?
and, according to my concerns, so far:
Is it true, that a string literal can have the NUL byte omitted?
I wanted to ask this question myself but short before posting it, I got the clue. My confusion was made because of the little misplaced wording inside of the quote.
But I decided to not delete the question´s draft as it could be useful for future readers and provide a Q&A instead.
Feel free to comment and hint.
Related stuff:
What is the difference between char s[] and char *s?
What is the type of string literals in C and C++?
Are string literals const?
"Life-time" of a string literal in C
You're overthinking it.
"A string is a contiguous sequence of characters terminated by and including the first null character."
Source: ISO/IEC 9899:2018 (C18), §7.1.1/1, Page 132
says that a "string" only extends up to the first null character. Characters that may exist after the null are not part of the string. However
"80) A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
makes it clear a string literal may contain an embedded null. If it does, the string literal AS A WHOLE is not a string -- the string is just the prefix of the string literal up to the first null
Let´s take a look at the definition of the term "string literal" at the same section in C18, §6.5.1/3:
"A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes, as in "xyz"."
According to that, a string literal is only consisted of the characters enclosed in quotation marks, the bare string content. It does not have an appended \0. The NUL byte is appended later at translation, as said at §6.5.1/6:
"In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. 80)"
Let´s make an example:
"foo" is a string literal, but not a string because "foo" does not contain an embedded null character.
"foo\0" is a string literal and a string because the literal itself contains a null character at the end of the character sequence.
Note that you don´t need to explicitly insert the null character at the end of a string literal to change it to a string. As already said, it is implicitly appended during the program translation.
Means,
const char *s = "foo";
is equal to
const char *s = "foo\0";
I admit, that the sentence of:
"A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
is a little confusing and illogical in the context. It would be better phrased:
"A string literal might not be a string (see 7.1.1), because a null character might not (OR is not required to) be embedded in it by a \0 escape sequence."
or alternatively:
"A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
As #EricPostpischil pointed in his comment, the meaning of the footnote is probably quite different.
It means that if the string literal contains a null character inside of it, but not at the end, as it is required for a string, the string literal is not equivalent to a string.
F.e.:
The string literal
"foo\0bar"
is not a string, as it contains the first null character embedded inside of the string literal, but not at the end of it.
If I write putchar('\\t'); while trying to print "\t" instead of an actual tab, I get the multi character constant warning. On the other hand, if I write putchar('\\'); I get no warning. Upon looking in the ASCII table, there is no character '\\', only '\'. So why is there no warning? Why is '\\' one character but '\\t' is more than one? Can a backslash only be used to escape one following character?
You cannot print \ and t with one putchar invocation, since putchar puts one and exactly only one character into the standard output. Use 2:
putchar('\\');
putchar('t');
Another option would be to use the string "\\t" with fputs:
fputs("\\t", stdout);
There is no warning for '\\' because that is one way how you enter the character literal for the character \. On ASCII this is synonymous with '\134' and '\x5c'.
From C11 6.4.4.4 paragraphs 2 and 4:
2
An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'. [...] With a few exceptions detailed later, the elements of the sequence are any members of the source character set; they are mapped in an implementation-defined manner to members of the execution character set.
[...]
4
The double-quote " and question-mark ? are representable either by themselves or by the escape sequences \" and \?, respectively, but the single-quote ' and the backslash \ shall be represented, respectively, by the escape sequences \' and \\.
The reason why you get a warning for this is that the behaviour is wholly implementation-defined. In C11 J.3.4 the following is listed as implementation-defined behaviour:
The value of an integer character constant containing more than one character or containing a character or escape sequence that does not map to a single-byte execution character (6.4.4.4).
Since '\\' contains an escape sequence that maps to a single-byte execution character \, there is no implementation-defined pitfalls, and nothing to warn about; but \\t contains 2 characters: \ and t, and it wouldn't do what you want portably.
\\ is one character, t is one character, so that is clearly two characters.
\\ is an escape sequence, just like \t; it means \.
If you want to print the two characters \ and t, you clearly need either two calls to putch() or a function that takes a string argument "\\t".
https://en.wikipedia.org/wiki/Escape_sequences_in_C#Table_of_escape_sequences
Just found out that all of the following work:
printf( "%ls\n", "123" L"456" );
printf( "%ls\n", L"123" "456" );
printf( "%ls\n", L"123" L"456" );
The output is
123456
123456
123456
Why can I freely mix and match wide and narrow string literals to get a wide string literal as a result? Is that a documented behavior?
Is that a documented behavior?
Yes, this behavior is supported by the standard, from section 6.4.5 String literals paragrph 4 of the C99 draft standard says (emphasis mine):
In translation phase 6, the multibyte character sequences specified by any sequence of
adjacent character and wide string literal tokens are concatenated into a single multibyte
character sequence. If any of the tokens are wide string literal tokens, the resulting
multibyte character sequence is treated as a wide string literal; otherwise, it is treated as a character string literal.
6.4.5 String literals
In translation phase 6, the multibyte character sequences specified by
any sequence of adjacent character and wide string literal tokens are
concatenated into a single multibyte character sequence. If any of the
tokens are wide string literal tokens, the resulting multibyte
character sequence is treated as a wide string literal; otherwise, it
is treated as a character string literal.
I have a small snippet of code.
When I run this on my DevC++ gnu compiler it shows the following output:
main ()
{ char b = 'a';
printf ("%d,", sizeof ('a'));
printf ("%d", sizeof (b));
getch ();
}
OUTPUT: 4,1
Why is 'a' treated as an integer, whereas as b is treated as only a character constant?
Because character literals are of type int and not char in C.
So sizeof 'a' == sizeof (int).
Note that in C++, a character literal is of type char and so sizeof 'a' == sizeof (char).
That's just the way it is in C. That's just how the language was originally defined. As for why... Back then virtually everything in C was an int, unless there was a very good reason to make it something else. So, historically character constants in C have type int.
Note BTW, in C nomenclature 'a' is called constant, not literal. C has string literals and no other literals.
In C, a character literal has type int.
In C++, a character literal that contains only one character has type char, which is an integral type.
In both C and C++, a wide character literal has type wchar_t, and a multicharacter literal has type int.
From IBM XL C/C++ documentation
A character literal contains a sequence of characters or escape
sequences enclosed in single quotation mark symbols, for example 'c'.
A character literal may be prefixed with the letter L, for example
L'c'. A character literal without the L prefix is an ordinary
character literal or a narrow character literal. A character literal
with the L prefix is a wide character literal. An ordinary character
literal that contains more than one character or escape sequence
(excluding single quotes ('), backslashes () or new-line characters)
is a multicharacter literal.
Character literals have the following form:
.---------------------.
V |
>>-+---+--'----+-character-------+-+--'------------------------><
'-L-' '-escape_sequence-'
At least one character or escape sequence must appear in the character
literal. The characters can be from the source program character set,
excluding the single quotation mark, backslash and new-line symbols. A
character literal must appear on a single logical source line.
C A character literal has type int