What wide-characters are translated into a null multibyte? - c

According to the reference for wcstombs, wcstombs will translate wide-characters "until a wide character translates into a null character."
So what wide-characters are translated into a null multibyte? Is it a specific character? Or any character outside a given range?

The wcstombs function will translate until the L'\0' character (the wide character NUL) is encountered in the wide string (or until the destination multibyte string is filled). That documentation describes what it does when it encounters an error.

Related

Why might a string literal not be a string?

I'm struggling with this part in the C standard about string literals, especially the second part of it:
"In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. 80)"
"80) A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
Source: ISO/IEC 9899:2018 (C18), §6.4.5/6, Page 51
I don't understand the explanation - "because a null character can be embedded in it by a \0 escape sequence.".
To look at the referenced section §7.1.1., regarding the definition of a "string", it is stated:
"A string is a contiguous sequence of characters terminated by and including the first null character."
Source: ISO/IEC 9899:2018 (C18), §7.1.1/1, Page 132
I've thought about that the focus maybe lays on the "can", in a way that a string literal does not have to include/embed the null character, while a string is needed to.
But then again I´m asking myself: How is one able to use a string literal as string if it has not a string-terminating null character in it, to determine the end of the string (required for string-operating functions)?
I´m totally drawing blanks at the moment.
Note: I´m aware of that a string literal is stored in read-only memory and can´t be modified and a string is a generic term for a sequence of characters terminated by NUL, which can or can not be mutable.
Thus, my question is not: "What is the difference between a string and a string literal?"
My Question is:
Why/How can a string-literal not be a string?
and, according to my concerns, so far:
Is it true, that a string literal can have the NUL byte omitted?
I wanted to ask this question myself but short before posting it, I got the clue. My confusion was made because of the little misplaced wording inside of the quote.
But I decided to not delete the question´s draft as it could be useful for future readers and provide a Q&A instead.
Feel free to comment and hint.
Related stuff:
What is the difference between char s[] and char *s?
What is the type of string literals in C and C++?
Are string literals const?
"Life-time" of a string literal in C
You're overthinking it.
"A string is a contiguous sequence of characters terminated by and including the first null character."
Source: ISO/IEC 9899:2018 (C18), §7.1.1/1, Page 132
says that a "string" only extends up to the first null character. Characters that may exist after the null are not part of the string. However
"80) A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
makes it clear a string literal may contain an embedded null. If it does, the string literal AS A WHOLE is not a string -- the string is just the prefix of the string literal up to the first null
Let´s take a look at the definition of the term "string literal" at the same section in C18, §6.5.1/3:
"A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes, as in "xyz"."
According to that, a string literal is only consisted of the characters enclosed in quotation marks, the bare string content. It does not have an appended \0. The NUL byte is appended later at translation, as said at §6.5.1/6:
"In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. 80)"
Let´s make an example:
"foo" is a string literal, but not a string because "foo" does not contain an embedded null character.
"foo\0" is a string literal and a string because the literal itself contains a null character at the end of the character sequence.
Note that you don´t need to explicitly insert the null character at the end of a string literal to change it to a string. As already said, it is implicitly appended during the program translation.
Means,
const char *s = "foo";
is equal to
const char *s = "foo\0";
I admit, that the sentence of:
"A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
is a little confusing and illogical in the context. It would be better phrased:
"A string literal might not be a string (see 7.1.1), because a null character might not (OR is not required to) be embedded in it by a \0 escape sequence."
or alternatively:
"A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."
As #EricPostpischil pointed in his comment, the meaning of the footnote is probably quite different.
It means that if the string literal contains a null character inside of it, but not at the end, as it is required for a string, the string literal is not equivalent to a string.
F.e.:
The string literal
"foo\0bar"
is not a string, as it contains the first null character embedded inside of the string literal, but not at the end of it.

mbrtowc: howto determine number of characters to skip if null character is read

According to the C99 specification the mbrtowc function returns 0
if the next n or fewer bytes complete the multibyte character that
corresponds to the null wide character (which is the value stored).
What is the best way to continue reading the input immediately after the encoded null character?
My current solution is to convert the null wide character with the given encoding in order to determine the number of input bytes to skip for the next call to mbrtowc. But there might be a more elegant way to do this.
Additionally I wonder what the rationale behind this behaviour of mbrtowc might be.
One byte. The null byte always represents the null character regardless of shift state, and cannot participate as part of a multibyte character. The source for this is:
5.2.1.2 Multibyte characters
...
A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.

Will gcc functions in string.h break UTF-8 string?

I don't know the following cases in GCC, who can help me?
Whether a valid UTF-8 character (except code point 0) still contains zero byte? If so, I think function such as strlen will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to '\n'? If so, I think function such as "gets" will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to ' ' or '\t'? If so, I think function such as scanf("%s%s") will break that UTF-8 character and be interpreted as two or more words.
The answer to all your questions are the same: No.
It's one of the advantages of UTF-8: all ASCII bytes do not occur when encoding non-ASCII code points into UTF-8.
For example, you can safely use strlen on a UTF-8 string, only that its result is the number of bytes instead of UTF-8 code points.

C scan unicode character from string

I've wchar_t type string which includes unicode characters like "ş, ç, ü,.."
I need to take this character one by one from string but I can't read them with sscanf. I couldn't found alternative function. So what should I do?

Can the null character be used to represent the zero character?

The C99 standard requires that "A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string." (5.2.1.2) It then goes on to list 99 other characters that must be in the execution set. Can a character set be used in which the null character is one of these 99 characters? In particular, is it allowed that '0' == '\0' ?
Edit: Everyone is pointing out that in ASCII, '0' is 0x30. This is true, but the standard doesn't mandate the used of ASCII.
No matter if you use ASCII, EBCDIC or something "self-crafted", '0' must be distinct from '\0', for the reason you mention yourself:
A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string. (5.2.1.2)
If the null character terminates a character string, it cannot be contained in that string. It is the only character which cannot be contained in a string; all other haracters can be used and thus must be distinct from 0.
I don't think the standard states that each of the characters that it lists (including the null character) has a distinct value, other than that the digits do. But a "character set" containing a value 0 that allegedly represents 91 of the 100 required characters is clearly not really a character set containing the required 100 characters. So this is either:
part of the English-language definition of "a character set",
obvious from context,
a very minor flaw in the text of the standard, that it should spell it out to prevent wilful misinterpretation by a faithless implementer.
Take your pick.
In the case of the '0'='\0' you will not be able to differ end of string and '0' value.
Thus it will be a bit hard to use something like "0_any_string", as it already starts from '0'.
No, it can't. Character set must be described by an injective function, i.e. a function that maps each character to exactly one distinct binary value. Mapping 2 characters to the same value will make the character set non-deterministic, i.e. the computer won't be able to interpret the data to a matching character since more than one fits.
The C99 standard poses another restriction by forcing the mapping of null character to a specific binary value. Given the above paragraph this means that no other character can have a value identical to null.
The integer constant literal 0 has different meanings depending upon
the context in which it's used. In all cases, it is still an integer
constant with the value 0, it is just described in different ways.
If a pointer is being compared to the constant literal 0, then this is
a check to see if the pointer is a null pointer. This 0 is then
referred to as a null pointer constant. The C standard defines that 0
cast to the type void * is both a null pointer and a null pointer
constant.
What is the difference between NULL, '\0' and 0

Resources