Can the null character be used to represent the zero character? - c

The C99 standard requires that "A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string." (5.2.1.2) It then goes on to list 99 other characters that must be in the execution set. Can a character set be used in which the null character is one of these 99 characters? In particular, is it allowed that '0' == '\0' ?
Edit: Everyone is pointing out that in ASCII, '0' is 0x30. This is true, but the standard doesn't mandate the used of ASCII.

No matter if you use ASCII, EBCDIC or something "self-crafted", '0' must be distinct from '\0', for the reason you mention yourself:
A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string. (5.2.1.2)
If the null character terminates a character string, it cannot be contained in that string. It is the only character which cannot be contained in a string; all other haracters can be used and thus must be distinct from 0.

I don't think the standard states that each of the characters that it lists (including the null character) has a distinct value, other than that the digits do. But a "character set" containing a value 0 that allegedly represents 91 of the 100 required characters is clearly not really a character set containing the required 100 characters. So this is either:
part of the English-language definition of "a character set",
obvious from context,
a very minor flaw in the text of the standard, that it should spell it out to prevent wilful misinterpretation by a faithless implementer.
Take your pick.

In the case of the '0'='\0' you will not be able to differ end of string and '0' value.
Thus it will be a bit hard to use something like "0_any_string", as it already starts from '0'.

No, it can't. Character set must be described by an injective function, i.e. a function that maps each character to exactly one distinct binary value. Mapping 2 characters to the same value will make the character set non-deterministic, i.e. the computer won't be able to interpret the data to a matching character since more than one fits.
The C99 standard poses another restriction by forcing the mapping of null character to a specific binary value. Given the above paragraph this means that no other character can have a value identical to null.

The integer constant literal 0 has different meanings depending upon
the context in which it's used. In all cases, it is still an integer
constant with the value 0, it is just described in different ways.
If a pointer is being compared to the constant literal 0, then this is
a check to see if the pointer is a null pointer. This 0 is then
referred to as a null pointer constant. The C standard defines that 0
cast to the type void * is both a null pointer and a null pointer
constant.
What is the difference between NULL, '\0' and 0

Related

C strcmp() with one char difference

My question is that how will strcmp() handle the following case:
strcmp("goodpassT", "goodpass");
I read that the comparison continues until a different character is found or null character (\0) is found in any of the strings. In the above case, when it encounters \0 for the second argument, will it just stop comparison, or will it still compare to the T character ? The return value is 1, but I'm not sure about the stopping condition.
The comparison is done using unsigned char. Thus the shorter string is smaller as its terminating 0 is smaller than other unsigned nonzero char in the longer string.
See http://port70.net/~nsz/c/c11/n1570.html#7.24.4p1
The answer for this function strcmp("goodpassT", "goodpass"); will be 1 only.The point upto which lengths of both the string are same will be compared on the basis of their ASCII value.

mbrtowc: howto determine number of characters to skip if null character is read

According to the C99 specification the mbrtowc function returns 0
if the next n or fewer bytes complete the multibyte character that
corresponds to the null wide character (which is the value stored).
What is the best way to continue reading the input immediately after the encoded null character?
My current solution is to convert the null wide character with the given encoding in order to determine the number of input bytes to skip for the next call to mbrtowc. But there might be a more elegant way to do this.
Additionally I wonder what the rationale behind this behaviour of mbrtowc might be.
One byte. The null byte always represents the null character regardless of shift state, and cannot participate as part of a multibyte character. The source for this is:
5.2.1.2 Multibyte characters
...
A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.

Will gcc functions in string.h break UTF-8 string?

I don't know the following cases in GCC, who can help me?
Whether a valid UTF-8 character (except code point 0) still contains zero byte? If so, I think function such as strlen will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to '\n'? If so, I think function such as "gets" will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to ' ' or '\t'? If so, I think function such as scanf("%s%s") will break that UTF-8 character and be interpreted as two or more words.
The answer to all your questions are the same: No.
It's one of the advantages of UTF-8: all ASCII bytes do not occur when encoding non-ASCII code points into UTF-8.
For example, you can safely use strlen on a UTF-8 string, only that its result is the number of bytes instead of UTF-8 code points.

Printing an array of characters with "while"

So here is the working version:
#include <stdio.h>
int main(int argc, char const *argv[]) {
char myText[] = "hello world\n";
int counter = 0;
while(myText[counter]) {
printf("%c", myText[counter]);
counter++;
}
}
and in action:
Korays-MacBook-Pro:~ koraytugay$ gcc koray.c
Korays-MacBook-Pro:~ koraytugay$ ./a.out
hello world
My question is, why is this code even working? When (or how) does
while(myText[counter])
evaluate to false?
These 2 work as well:
while(myText[counter] != '\0')
while(myText[counter] != 0)
This one prints garbage in the console:
while(myText[counter] != EOF)
and this does not even compile:
while(myText[counter] != NULL)
I can see why the '\0' works, as C puts this character at the end of my array in compile time. But why does not NULL work? How is 0 == '\0'?
AS for your last question,
But why does not NULL work?
Usually, NULL is a pointer type. Here, myText[counter] is a value of type char. As per the conditions for using the == operator, from C11 standard, chapter 6.5.9,
One of the following shall hold:
both operands have arithmetic type;
both operands are pointers to qualified or unqualified versions of compatible types;
one operand is a pointer to an object type and the other is a pointer to a qualified or unqualified version of void; or
one operand is a pointer and the other is a null pointer constant.
So, it tells, you can only compare a pointer type with a null pointer constant ## (NULL).
After that,
When (or how) does while(myText[counter]) evaluate to false?
Easy, when myText[counter] has got a value of 0.
To elaborate, after the initialization, myText holds the character values used to initialize it, with a "null" at last. The "null" is the way C identifies the string endpoint. Now, the null, is represented by a values of 0. So, we can say. when the end-of-string is reached, the while() is FALSE.
Additional explanation:
while(myText[counter] != '\0') works, because '\0' is the representation of the "null" used as the string terminator.
while(myText[counter] != 0) works, because 0 is the decimal value of '\0'.
Above both statements are equivalent of while(myText[counter]).
while(myText[counter] != EOF) does not work because a null is not an EOF.
Reference: (#)
Reference: C11 standard, chapter 6.3.2.3, Pointers, Paragraph 3
An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant.
and, from chapter, 7.19,
NULL
which expands to an implementation-defined null pointer constant
Note: In the end, this question and realted answers will clear the confusion, should you have any.
In C, any non-zero value evaluates to true. C strings are null-terminated. That is, there is a special zero-value character after the last character in the string.
And so when the null terminator is reached, the zero value evaluates to false, and the loop terminates.
I can see why the '\0' works, as C puts this character at the end of my array in compile time. But why does not NULL work? How is 0 == '\0'?
0 has the same value as '\0' because '\0' is a character with the value zero. (Not to be confused with '0', which is the zero digit and has a value of 48.)
Regarding NULL, that actually can work since it also evaluates to zero. However, NULL is a pointer type so you may have to type cast to avoid errors. (Hard to say for certain since you didn't post the error that you got.)
When (or how) does this work?
while(myText[counter])
Any built-in with value zero will evaluate to false in a boolean context. So this while(myText[counter]) is false when myText[counter] is '\0', which has the value 0.
How is 0 == '\0'?
It is defined that way in the language. '\0' is an int literal with value zero. 0 is also an int literal with value zero, so both compare equal, and they both evaluate to false in a boolean context
How is 0 == '\0'?
All characters have an 8-bit numeric value, for example, 'a' is 97 (decimal). The backslash in the character literal '\0' introduces an "escape" to directly specify the character through its numeric value. In this case, the numeric value 0.
The termination of a string is \0
NULL is used to initialise a pointer to a determined value
while(myText[counter]) evaluates to false as soon as counter points to the zero byte.
In the end there is nothing different than a zero byte at the end of the string. Actually NULL would mean the same but it is used for notation purposes only in the context of pointers.
If something is not 100% clear from a coding perspective, you might want to look inside your debugger watch window, what are the bits and bytes actually during program execution.

What is the purpose of the s==NULL case for mbrtowc?

mbrtowc is specified to handle a NULL pointer for the s (multibyte character pointer) argument as follows:
If s is a null pointer, the mbrtowc() function shall be equivalent to the call:
mbrtowc(NULL, "", 1, ps)
In this case, the values of the arguments pwc and n are ignored.
As far as I can tell, this usage is largely useless. If ps is not storing any partially-converted character, the call will simply return 0 with no side effects. If ps is storing a partially-converted character, then since '\0' is not valid as the next byte in a multibyte sequence ('\0' can only be a string terminator), the call will return (size_t)-1 with errno==EILSEQ. and leave ps in an undefined state.
The intended usage seems to have been to reset the state variable, particularly when NULL is passed for ps and the internal state has been used, analogous to mbtowc's behavior with stateful encodings, but this is not specified anywhere as far as I can tell, and it conflicts with the semantics for mbrtowc's storage of partially-converted characters (if mbrtowc were to reset state when encountering a 0 byte after a potentially-valid initial subsequence, it would be unable to detect this dangerous invalid sequence).
If mbrtowc were specified to reset the state variable only when s is NULL, but not when it points to a 0 byte, a desirable state-reset behavior would be possible, but such behavior would violate the standard as written. Is this a defect in the standard? As far as I can tell, there is absolutely no way to reset the internal state (used when ps is NULL) once an illegal sequence has been encountered, and thus no correct program can use mbrtowc with ps==NULL.
Since a '\0' byte must convert to a null wide character regardless of shift state (5.2.1.2 Multibyte characters), and the mbrtowc() function is specified to reset the shift state when it converts to a wide null character (7.24.6.3.2/3 The mbrtowc function), calling mbrtowc( NULL, "", 1, ps) will reset the shift state stored in the mbstate_t pointed to by ps. And if mbrtowc( NULL, "", 1, NULL) is called to use the library's internal mbstate_t object, it will be reset to an initial state. See the end of the answer for cites of the relevant bits of the standard.
I'm by no means particularly experienced with the C standard multibyte conversion functions (my experience with this kind of thing has been using the Win32 APIs for conversion).
If mbrtowc() processes a 'incomplete char' that's cut short by a 0 byte, it should return (size_t)(-1) to indicate an invalid multibyte char (and thus detect the dangerous situation you describe). In that case the conversion/shift state is unspecified (and I think you're basically hosed for that string). The multibyte 'sequence' that a conversion was attempted on but contains a '\0' is invalid and ever will be valid with subsequent data. If the '\0' wasn't intended to be part of the converted sequence, then it shouldn't have been included in the count of bytes available for processing.
If you're in a situation where you might get additional, subsequent bytes for a partial multibyte char (say from a network stream), the n you passed for the partial multibyte char shouldn't include a 0 byte, so you'll get a (size_t)(-2) returned. In this case, if you pass a '\0' while in the middle of the partial conversion, you'll lose the fact that there's an error and as a side-effect reset the mbstate_t state in use (whether it's your own or the internal one being used because you passed in a NULL pointer for ps). I think I'm essentailly restating your question here.
However I think it is possible to detect and handle this situation, but unfortunately it requires keeping track of some state yourself:
#define MB_ERROR ((size_t)(-1))
#define MB_PARTIAL ((size_t)(-2))
// function to get a stream of multibyte characters from somewhere
int get_next(void);
int bar(void)
{
char c;
wchar_t wc;
mbstate_t state = {0};
int in_partial_convert = 0;
while ((c = get_next()) != EOF)
{
size_t result = mbrtowc( &wc, &c, 1, &state);
switch (result) {
case MB_ERROR:
// this multibyte char is invalid
return -1;
case MB_PARTIAL:
// do nothing yet, we need more data
// but remember that we're in this state
in_partial_convert = 1;
break;
case 1:
// output the competed wide char
in_partial_convert = 0; // no longer in the middle of a conversion
putwchar(wc);
break;
case 0:
if (in_partial_convert) {
// this 'last' multibyte char was mal-formed
// return an error condidtion
return -1;
}
// end of the multibyte string
// we'll handle similar to EOF
return 0;
}
}
return 0;
}
Maybe not an ideal situation, but I think it shows it's not completely broken so as to be impossible to use.
Standards citations:
5.2.1.2 Multibyte characters
A multibyte character set may have a state-dependent encoding, wherein
each sequence of multibyte characters
begins in an initial shift state and
enters other locale-specific shift
states when specific multibyte
characters are encountered in the
sequence. While in the initial shift
state, all single-byte characters
retain their usual interpretation and
do not alter the shift state. The
interpretation for subsequent bytes in
the sequence is a function of the
current shift state.
A byte with all bits zero shall be interpreted as a null character
independent of shift state.
A byte with all bits zero shall not occur in the second or subsequent
bytes of a multibyte character.
7.24.6.3.2/3 The mbrtowc function
If the corresponding wide character is
the null wide character, the resulting
state described is the initial
conversion state
In 5.2.1.2, Multibyte characters, the C Standard states:
A byte with all bits zero shall be interpreted as a null character independent of shift
state. Such a byte shall not occur as part of any other multibyte character.
The Standard seems to differentiate between shift state and conversion state, as, for example, 7.24.6 mentions:
The conversion state described by the pointed-to object is altered as needed to track the shift state, and the position within a multibyte character, for the associated multibyte character sequence.
(emphasis added by me). However, I think that the intent is to interpret a byte with all zero bits as the null character regardless of the mbstate_t value, which encodes the entire conversion state, particularly as "Such a byte shall not occur as part of any other multibyte character" implies that the null byte cannot occur within a multibyte character. If a null byte does occur in errant input where the second, third, etc. byte of a multibyte character should be, then I interpret the Standard as saying that the partial multibyte character at the EOF is silently ignored.
My reading of 7.24.6.3.2, The mbrtowc function, for the case when s is NULL is thus: the next 1 byte completes the null wide character, the return value of mbrtowc is 0, and the resulting state is the initial conversion state because:
If the corresponding wide character is the null wide character, the resulting state described is the initial conversion state.
By passing NULL for both s and ps, the internal mbstate_t of mbrtowc is reset to the initial state.

Resources