UTF-8 string size in bytes - c

I need to determine the length of UTF-8 string in bytes in C. How to do it correctly? As I know, in UTF-8 terminal symbol has 1-byte size. Can I use strlen function for this?

Can I use strlen function for this?
Yes, strlen gives you the number of bytes before the first '\0' character, so
strlen(utf8) + 1
is the number of bytes in utf8 including the 0-terminator, since no character other than '\0' contains a 0 byte in UTF-8.
Of course, that only works if utf8 is actually UTF-8 encoded, otherwise you need to convert it to UTF-8 first.

Yes, strlen() will simply count the bytes until it encounters the NUL, which is the correct terminator for a 0-terminated UTF-8-encoded C string.

Related

How to determine the size of a string in C, or at least ensuring that it doesn't exceed a maximum number of bytes?

Is it possible to determine the size in bytes of a string in C?
I'm trying to ensure that JSON strings built in C do not exceed a 1 MB size limit before passing them to the requesting application. I don't know the strings at compile time.
I've read that it is just strlen * sizeof( char ); but I don't understand that, because I read elsewhere that UTF-8 can have characters of size up to four bytes and sizeof( char ) is always one.
I am likely misunderstanding something basic.
If a character array is allocated as char JSON[1048576], does this allocate that many characters or bytes? If it is bytes, then as long as something like snprintf is used when writing to JSON array, would this guarantee that it can never exceed 1 MB in size, even if there were character in that array that exceed one byte?
Thank you.
Since you are after a size limit 1MB and not a string length limit per se, you can just use strlen(json_str). Provided that your json string is null terminated, '\0'.
If you allocate char JSON[1048576] that will give you an array with that many bytes. And snprintf(JSON, 1048576, "<json string>", ...) will guarantee that you never overfill your array.
It does not guarantee however that your string is a valid utf-8 string since the last character may be a multi byte character that is split in the middle.
A C char is not the same as a utf-8 character. In C char is by definition 1 Byte but in utf-8 the visual character that you want, like the heart in your comment, may be represented by several bytes of data.
One byte gives you 256 different values and since there are way more than 256 Unicode "characters" more than one byte is needed to encode many of them. The designers of utf-8 was clever though so the first 127 characters can be encoded using just one byte and if only those characters are used it will both valid utf-8 and ascii.

Can I store UTF8 in C-style char array

Follow up
Can UTF-8 contain zero byte?
Can I safely store UTF8 string in zero terminated char * ?
I understand strlen() will not return correct information, put "storing", printing and "transferring" the char array, seems to be safe.
Yes.
Just like with ASCII and similiar 8-bit encodings before Unicode, you can't store the NUL character in such a string (the value \u+0000 is the Unicode code point NUL, very much like in ASCII).
As long as you know your strings don't need to contain that (and regular text doesn't), it's fine.
In C a 0 byte is the string terminator. As long as the Unicode point 0, U+0000 is not in the Unicode string there is no problem.
To be able to store 0 bytes in Unicode, one may use modified UTF-8 that convert not only code points >= 128, but also 0 to a multi-byte sequence (every byte thereof having its high bit set, >= 128). This is done in java for some APIs, like DataOutputStream.writeUTF. It ensures you can transmit strings with an embedded 0.
It formally is no longer UTF-8 as UTF-8 requires the shortest encoding. Also this is only possible when determining the length i.o. strlen when unpacking to non-UTF-8.
So the most feasible solution is not to accept U+0000 in strings.

Will gcc functions in string.h break UTF-8 string?

I don't know the following cases in GCC, who can help me?
Whether a valid UTF-8 character (except code point 0) still contains zero byte? If so, I think function such as strlen will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to '\n'? If so, I think function such as "gets" will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to ' ' or '\t'? If so, I think function such as scanf("%s%s") will break that UTF-8 character and be interpreted as two or more words.
The answer to all your questions are the same: No.
It's one of the advantages of UTF-8: all ASCII bytes do not occur when encoding non-ASCII code points into UTF-8.
For example, you can safely use strlen on a UTF-8 string, only that its result is the number of bytes instead of UTF-8 code points.

utf8 strings and malloc in c

With "opendir" and "readdir" i do read a directories content.
During that process i do some strings manipulation / allocation:
something like that:
int stringlength = strlen(cur_dir)+strlen(ep->d_name)+2;
char *file_with_path = xmalloc(stringlength); //xmalloc is a malloc wrapper with some tests (like no more memory)
snprintf (file_with_path, (size_t)stringlength, "%s/%s", cur_dir, ep->d_name);
But what if a string contains a two-byte utf8 char?
How do you handle that issue?
stringlength*2?
Thanks
strlen() counts the bytes in the string, it doesn't care if the contained bytes represent UTF-8 encoded Unicode characters. So, for example, strlen() of a string containing an UTF-8 encoding of "aöü" would return 5, since the string is encoded as "a\xc3\xb6\xc3\xbc".
strlen counts the number of bytes in a string (up to the terminating NUL), not the number of UTF-8 characters, so stringlength should already be as large as you need it.

UTF-16 string terminator

What is the string terminator sequence for a UTF-16 string?
EDIT:
Let me rephrase the question in an attempt to clarify. How's does the call to wcslen() work?
Unicode does not define string terminators. Your environment or language does. For instance, C strings use 0x0 as a string terminator, as well as in .NET strings where a separate value in the String class is used to store the length of the string.
To answer your second question, wcslen looks for a terminating L'\0' character. Which as I read it, is any length of 0x00 bytes, depending on the compiler, but will likely be the two-byte sequence 0x00 0x00 if you're using UTF-16 (encoding U+0000, 'NUL')
7.24.4.6.1 The wcslen function (from the Standard)
...
[#3] The wcslen function returns the number of wide
characters that precede the terminating null wide character.
And the null wide character is L'\0'
There isn't any. String terminators are not part of an encoding.
For example if you had the string ab it would be encoded in UTF-16 with the following sequence of bytes: 61 00 62 00. And if you had 大家 you would get 27-59-B6-5B. So as you can see no predetermined terminator sequence.

Resources