utf8 strings and malloc in c - c

With "opendir" and "readdir" i do read a directories content.
During that process i do some strings manipulation / allocation:
something like that:
int stringlength = strlen(cur_dir)+strlen(ep->d_name)+2;
char *file_with_path = xmalloc(stringlength); //xmalloc is a malloc wrapper with some tests (like no more memory)
snprintf (file_with_path, (size_t)stringlength, "%s/%s", cur_dir, ep->d_name);
But what if a string contains a two-byte utf8 char?
How do you handle that issue?
stringlength*2?
Thanks

strlen() counts the bytes in the string, it doesn't care if the contained bytes represent UTF-8 encoded Unicode characters. So, for example, strlen() of a string containing an UTF-8 encoding of "aöü" would return 5, since the string is encoded as "a\xc3\xb6\xc3\xbc".

strlen counts the number of bytes in a string (up to the terminating NUL), not the number of UTF-8 characters, so stringlength should already be as large as you need it.

Related

How to determine the size of a string in C, or at least ensuring that it doesn't exceed a maximum number of bytes?

Is it possible to determine the size in bytes of a string in C?
I'm trying to ensure that JSON strings built in C do not exceed a 1 MB size limit before passing them to the requesting application. I don't know the strings at compile time.
I've read that it is just strlen * sizeof( char ); but I don't understand that, because I read elsewhere that UTF-8 can have characters of size up to four bytes and sizeof( char ) is always one.
I am likely misunderstanding something basic.
If a character array is allocated as char JSON[1048576], does this allocate that many characters or bytes? If it is bytes, then as long as something like snprintf is used when writing to JSON array, would this guarantee that it can never exceed 1 MB in size, even if there were character in that array that exceed one byte?
Thank you.
Since you are after a size limit 1MB and not a string length limit per se, you can just use strlen(json_str). Provided that your json string is null terminated, '\0'.
If you allocate char JSON[1048576] that will give you an array with that many bytes. And snprintf(JSON, 1048576, "<json string>", ...) will guarantee that you never overfill your array.
It does not guarantee however that your string is a valid utf-8 string since the last character may be a multi byte character that is split in the middle.
A C char is not the same as a utf-8 character. In C char is by definition 1 Byte but in utf-8 the visual character that you want, like the heart in your comment, may be represented by several bytes of data.
One byte gives you 256 different values and since there are way more than 256 Unicode "characters" more than one byte is needed to encode many of them. The designers of utf-8 was clever though so the first 127 characters can be encoded using just one byte and if only those characters are used it will both valid utf-8 and ascii.

Can I store UTF8 in C-style char array

Follow up
Can UTF-8 contain zero byte?
Can I safely store UTF8 string in zero terminated char * ?
I understand strlen() will not return correct information, put "storing", printing and "transferring" the char array, seems to be safe.
Yes.
Just like with ASCII and similiar 8-bit encodings before Unicode, you can't store the NUL character in such a string (the value \u+0000 is the Unicode code point NUL, very much like in ASCII).
As long as you know your strings don't need to contain that (and regular text doesn't), it's fine.
In C a 0 byte is the string terminator. As long as the Unicode point 0, U+0000 is not in the Unicode string there is no problem.
To be able to store 0 bytes in Unicode, one may use modified UTF-8 that convert not only code points >= 128, but also 0 to a multi-byte sequence (every byte thereof having its high bit set, >= 128). This is done in java for some APIs, like DataOutputStream.writeUTF. It ensures you can transmit strings with an embedded 0.
It formally is no longer UTF-8 as UTF-8 requires the shortest encoding. Also this is only possible when determining the length i.o. strlen when unpacking to non-UTF-8.
So the most feasible solution is not to accept U+0000 in strings.

convert jchararray to jstring in JNI

I am using JNI below code to convert jchararray to jstring but i am getting only first character in Linux.
char *carr =(char*)malloc(length+1);
(*env)->GetCharArrayRegion(env, ch, 0, length, carr);
return (*env)->NewStringUTF(env, carr);
GetCharArrayRegion returns Java chars, i.e. UTF-16 code points. And jchars in JNI, and they're not null-terminated, and you cannot use NewStringUTF, which expects a null-terminated string comprising bytes in the modified UTF-8 encoding.
First, allocate the correct amount of memory
jchar *carr = malloc(length * sizeof(jchar));
Then execute the GetCharArrayRegion
(*env)->GetCharArrayRegion(env, ch, 0, length, carr);
Then notice that you've got an array of UTF-16 characters. If the first character falls into the ASCII range, and the architecture is little-endian, it is expected that you'd just "get the first character", because the MSB byte of the first jchar will be zero, and NewStringUTF would consider this the terminator. Use NewString instead:
return (*env)NewString(env, carr, length);
You should use the NewString() function which takes jchar array and its length. The NewStringUTF() function takes UTF-8 encoded C string as input.
See https://www3.ntu.edu.sg/home/ehchua/programming/java/JavaNativeInterface.html#zz-4.2 for more details.

Does strlen() always correctly report the number of char's in a pointer initialized string?

As long as I use the char and not some wchar_t type to declare a string will strlen() correctly report the number of chars in the string or are there some very specific cases I need to be aware of? Here is an example:
char *something = "Report all my chars, please!";
strlen(something);
What strlen does is basically count all bytes until it hits a zero-byte, the so-called null-terminator, character '\0'.
So as long as the string contains a terminator within the bounds of the memory allocated for the string, strlen will correctly return the number of char in the string.
Note that strlen can't count the number of characters (code points) in a multi-byte encoded string (like UTF-8). It will correctly return the number of bytes in the string though.

UTF-8 string size in bytes

I need to determine the length of UTF-8 string in bytes in C. How to do it correctly? As I know, in UTF-8 terminal symbol has 1-byte size. Can I use strlen function for this?
Can I use strlen function for this?
Yes, strlen gives you the number of bytes before the first '\0' character, so
strlen(utf8) + 1
is the number of bytes in utf8 including the 0-terminator, since no character other than '\0' contains a 0 byte in UTF-8.
Of course, that only works if utf8 is actually UTF-8 encoded, otherwise you need to convert it to UTF-8 first.
Yes, strlen() will simply count the bytes until it encounters the NUL, which is the correct terminator for a 0-terminated UTF-8-encoded C string.

Resources