What is the string terminator sequence for a UTF-16 string?
EDIT:
Let me rephrase the question in an attempt to clarify. How's does the call to wcslen() work?
Unicode does not define string terminators. Your environment or language does. For instance, C strings use 0x0 as a string terminator, as well as in .NET strings where a separate value in the String class is used to store the length of the string.
To answer your second question, wcslen looks for a terminating L'\0' character. Which as I read it, is any length of 0x00 bytes, depending on the compiler, but will likely be the two-byte sequence 0x00 0x00 if you're using UTF-16 (encoding U+0000, 'NUL')
7.24.4.6.1 The wcslen function (from the Standard)
...
[#3] The wcslen function returns the number of wide
characters that precede the terminating null wide character.
And the null wide character is L'\0'
There isn't any. String terminators are not part of an encoding.
For example if you had the string ab it would be encoded in UTF-16 with the following sequence of bytes: 61 00 62 00. And if you had 大家 you would get 27-59-B6-5B. So as you can see no predetermined terminator sequence.
Related
Follow up
Can UTF-8 contain zero byte?
Can I safely store UTF8 string in zero terminated char * ?
I understand strlen() will not return correct information, put "storing", printing and "transferring" the char array, seems to be safe.
Yes.
Just like with ASCII and similiar 8-bit encodings before Unicode, you can't store the NUL character in such a string (the value \u+0000 is the Unicode code point NUL, very much like in ASCII).
As long as you know your strings don't need to contain that (and regular text doesn't), it's fine.
In C a 0 byte is the string terminator. As long as the Unicode point 0, U+0000 is not in the Unicode string there is no problem.
To be able to store 0 bytes in Unicode, one may use modified UTF-8 that convert not only code points >= 128, but also 0 to a multi-byte sequence (every byte thereof having its high bit set, >= 128). This is done in java for some APIs, like DataOutputStream.writeUTF. It ensures you can transmit strings with an embedded 0.
It formally is no longer UTF-8 as UTF-8 requires the shortest encoding. Also this is only possible when determining the length i.o. strlen when unpacking to non-UTF-8.
So the most feasible solution is not to accept U+0000 in strings.
I don't know the following cases in GCC, who can help me?
Whether a valid UTF-8 character (except code point 0) still contains zero byte? If so, I think function such as strlen will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to '\n'? If so, I think function such as "gets" will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to ' ' or '\t'? If so, I think function such as scanf("%s%s") will break that UTF-8 character and be interpreted as two or more words.
The answer to all your questions are the same: No.
It's one of the advantages of UTF-8: all ASCII bytes do not occur when encoding non-ASCII code points into UTF-8.
For example, you can safely use strlen on a UTF-8 string, only that its result is the number of bytes instead of UTF-8 code points.
I need to determine the length of UTF-8 string in bytes in C. How to do it correctly? As I know, in UTF-8 terminal symbol has 1-byte size. Can I use strlen function for this?
Can I use strlen function for this?
Yes, strlen gives you the number of bytes before the first '\0' character, so
strlen(utf8) + 1
is the number of bytes in utf8 including the 0-terminator, since no character other than '\0' contains a 0 byte in UTF-8.
Of course, that only works if utf8 is actually UTF-8 encoded, otherwise you need to convert it to UTF-8 first.
Yes, strlen() will simply count the bytes until it encounters the NUL, which is the correct terminator for a 0-terminated UTF-8-encoded C string.
Consider this line of text:
First line of text.
If a character array string is used to load the first TEN characters in the array it will output as:
First lin'\0'
First contains 5 letters, lin contains 3 letters. Where are the other two characters being used?
Is \0 considered two characters?
Or is the space between the words considered a character, thus '\0` is one character?
Yes, space is a character. In ASCII encoding it has code number 32.
The space between the two words has ASCII code 0x20 (0408, or 3210); it occupies one byte.
The null at the end of the string, ASCII code 0x00 (0 in both octal and decimal) occupies the other byte.
Note that the space bar is simply the key on the keyboard that generates a space character when typed.
'\0' is the null-terminator, it is literally the value zero in all implementations.
'\0' is considered a single character because the backslash \ means to escape a character. '\0' and '0' thus are both single characters, but mean very different things.
Note that space is represented by a different ascii value.
space is represented in String as "\s", probably space is represented as '\s' as a character
How do I get the byte size of a multibyte-character string in Visual C? Is there a function or do I have to count the characters myself?
Or, more general, how do I get the right byte size of a TCHAR string?
Solution:
_tcslen(_T("TCHAR string")) * sizeof(TCHAR)
EDIT:
I was talking about null-terminated strings only.
Let's see if I can clear this up:
"Multi-byte character string" is a vague term to begin with, but in the world of Microsoft, it typically meants "not ASCII, and not UTF-16". Thus, you could be using some character encoding which might use 1 byte per character, or 2 bytes, or possibly more. As soon as you do, the number of characters in the string != the number of bytes in the string.
Let's take UTF-8 as an example, even though it isn't used on MS platforms. The character é is encoded as "c3 a9" in memory -- thus, two bytes, but 1 character. If I have the string "thé", it's:
text: t h é \0
mem: 74 68 c3 a9 00
This is a "null terminated" string, in that it ends with a null. If we wanted to allow our string to have nulls in it, we'd need to store the size in some other fashion, such as:
struct my_string
{
size_t length;
char *data;
};
... and a slew of functions to help deal with that. (This is sort of how std::string works, quite roughly.)
For null-terminated strings, however, strlen() will compute their size in bytes, not characters. (There are other functions for counting characters) strlen just counts the number of bytes before it sees a 0 byte -- nothing fancy.
Now, "wide" or "unicode" strings in the world of MS refer to UTF-16 strings. They have similar problems in that the number of bytes != the number of characters. (Also: the number of bytes / 2 != the number of characters) Let look at thé again:
text: t h é \0
shorts: 0x0074 0x0068 0x00e9 0x0000
mem: 74 00 68 00 e9 00 00 00
That's "thé" in UTF-16, stored in little endian (which is what your typical desktop is). Notice all the 00 bytes -- these trip up strlen. Thus, we call wcslen, which looks at it as 2-byte shorts, not single bytes.
Lastly, you have TCHARs, which are one of the above two cases, depending on if UNICODE is defined. _tcslen will be the appropriate function (either strlen or wcslen), and TCHAR will be either char or wchar_t. TCHAR was created to ease the move to UTF-16 in the Windows world.
According to MSDN, _tcslen corresponds to strlen when _MBCS is defined. strlen will return the number of bytes in the string. If you use _tcsclen that corresponds to _mbslen which returns the number of multibyte characters.
Also, multibyte strings do not (AFAIK) contain embedded nulls, no.
I would question the use of a multibyte encoding in the first place, though... unless you're supporting a legacy app, there's no reason to choose multibyte over Unicode.