libunistring u8_strlen() equals to strlen()? - c

Just now I'm trying to use libunistring in my c program.
I've to process UTF-8 string, and for it I used u8_strlen() function from libunistring library.
Code example:
void print_length(uint8_t *msg) {
printf("Default strlen: %d\n", strlen((char *)msg));
printf("U8 strlen: %d\n", u8_strlen(msg));
}
Just imagine that we call print_length() with msg = "привет" (cyrillic, utf-8 encoding).
I've expected that strlen() should return 12 (6 letters * 2 bytes per letter), and
u8_strlen() should return 6 (just 6 letters).
But I recieved curious results:
Default strlen: 12
U8 strlen: 12
After this I'm tried to lookup u8_strlen realization, and found this code:
size_t
u8_strlen (const uint8_t *s)
{
return strlen ((const char *) s);
}
I'm wondering, is it bug or it's correct answer? If it's correct, why?

I believe this is the intended behavior.
The libunistring manual says that:
size_t u8_strlen (const uint8_t *s)
Returns the number of units in s.
Also in the manual, it defines what this "unit" is:
UTF-8 strings, through the type ‘uint8_t *’. The units are bytes (uint8_t).
I believe the reason they label the function u8_strlen even though it does nothing more than the standard strlen is that the library also has u16_strlen and u32_strlen for operation on UTF-16 and UTF-32 strings, respectively (which would count the number of 2-byte units until 0x0000, and 4-byte units until 0x00000000), and they included u8_strlen simply for completeness.
GNU gnulib does however include mbslen which probably does what you want:
mbslen function: Determine the number of multibyte characters in a string.

There is also the u8_mbsnlen function
Function: size_t u8_mbsnlen (const uint8_t *s, size_t n)
Counts and
returns the number of Unicode characters in the n units from s.
This function is similar to the gnulib function mbsnlen, except that
it operates on Unicode strings.
(link)
Unfortunately this needs you to pass in the length of the string in bytes as well.

Related

C Unicode: How do I apply C11 standard amendment DR488 fix to C11 standard function c16rtomb()?

Question:
As mentioned in the C reference page for the function, c16rtomb, from CPPReference, under the Notes section:
In C11 as published, unlike mbrtoc16, which converts variable-width multibyte (such as UTF-8) to variable-width 16-bit (such as UTF-16) encoding, this function can only convert single-unit 16-bit encoding, meaning it cannot convert UTF-16 to UTF-8 despite that being the original intent of this function. This was corrected by the post-C11 defect report DR488.
And below this passage, the C reference page provided an example source code with the following sentence above it:
Note: this example assumes the fix for the defect report 488 is applied.
That phrase implied there is a way to take DR488 and somehow "apply" the fix to the C11 standard function, c16rtomb.
I would like to know how to apply the fix for GCC. Because it seems to me the fix was already applied to Visual Studio 2017 Visual C++, as of v141.
The behavior seen in GCC, when debugging the code in GDB, is consistent with what was found in DR488, as follows:
Section 7.28.1 describes the function c16rtomb(). In particular, it states "When c16 is not a valid wide character, an encoding error occurs". "wide character" is defined in section 3.7.3 as "value representable by an object of type wchar_t, capable of representing any character in the current locale". This wording seems to imply that, e.g. for the common cases (e.g, an implementation that defines __STDC_UTF_16__ and a program that uses an UTF-8 locale), c16rtomb() will return -1 when it encounters a character that is encoded as multiple char16_t (for UTF-16 a wide character can be encoded as a surrogate pair consisting of two char16_t). In particular, c16rtomb() will not be able to process strings generated by mbrtoc16().
The boldfaced text is the behavior described.
Source code:
#include <stdio.h>
#include <uchar.h>
#define __STD_UTF_16__
int main() {
char16_t* ptr_string = (char16_t*) u"我是誰";
//C++ disallows variable-length arrays.
//GCC uses GNUC++, which has a C++ extension for variable length arrays.
//It is not a truly standard feature in C++ pedantic mode at all.
//https://stackoverflow.com/questions/40633344/variable-length-arrays-in-c14
char buffer[64];
char* bufferOut = buffer;
//Must zero this object before attempting to use mbstate_t at all.
mbstate_t multiByteState = {};
//c16 = 16-bit Characters or char16_t typed characters
//r = representation
//tomb = to Multi-Byte Strings
while (*ptr_string) {
char16_t character = *ptr_string;
size_t size = c16rtomb(bufferOut, character, &multiByteState);
if (size == (size_t) -1)
break;
bufferOut += size;
ptr_string++;
}
size_t bufferOutSize = bufferOut - buffer;
printf("Size: %zu - ", bufferOutSize);
for (int i = 0; i < bufferOutSize; i++) {
printf("%#x ", +(unsigned char) buffer[i]);
}
//This statement is used to set a breakpoint. It does not do anything else.
int debug = 0;
return 0;
}
Output from Visual Studio:
Size: 9 - 0xe6 0x88 0x91 0xe6 0x98 0xaf 0xe8 0xaa 0xb0
Output from GCC:
Size: 0 -
In Linux you should be able to fix this with a call to setlocale(LC_ALL, "en_US.utf8");
Example on ideone
This function will do the following, as stated in Microsoft documentation:
Convert a UTF-16 wide character into a multibyte character in the current locale.
The POSIX documentation is similar. __STD_UTF_16__ doesn't seem to have an effect in either compiler. It's supposed to specify the encoding for the source, which should be UTF16. It doesn't specify the encoding for destination.
It's Windows documentation which seems more inconsistent, because it seems to imply that setlocale is necessary or converting to ANSI code page is an option

C program stops working with scanf_s

I am fairly new at programming and I am having trouble with a piece of code. I am trying to input a word but when I run the program and enter the word it stops working.
This is the code:
int main(void){
char a[]= "";
printf("Enter word:\n");
scanf_s("%s", a);
return 0;
}
I tried giving a[] a size of 20 and used %19s as another question suggested but that did not work either.
Edit 1. Changed char a[]= ""; to char a[20]= {0}; but it did not worked.
Edit 2. Added sizeof(a) and the code worked. Additionally, I removed the {0} but I don't know if that made a difference.
Final code:
int main(void){
char a[20];
printf("Enter word:\n");
scanf_s("%19s", a, sizeof(a));
return 0;
}
Diagnosis
There are (at least) two problems in the code:
You've not provided any useful space to store the string. (The original question defined: char a[] = "";, which — be it noted — is an array of length 1 though it can only hold a string of length 0.)
You've not told scanf_s() how big the string is. It requires a length argument after the pointer to a character string.
Microsoft's definition for scanf_s() specifies:
Unlike scanf and wscanf, scanf_s and wscanf_s require the buffer size to be specified for all input parameters of type c, C, s, S, or string control sets that are enclosed in []. The buffer size in characters is passed as an additional parameter immediately following the pointer to the buffer or variable. For example, if you are reading a string, the buffer size for that string is passed as follows:
char s[10];
scanf_s("%9s", s, _countof(s)); // buffer size is 10, width specification is 9
The buffer size includes the terminating null. You can use a width specification field to ensure that the token that's read in will fit into the buffer. If no width specification field is used, and the token read in is too big to fit in the buffer, nothing is written to that buffer.
Note
The size parameter is of type unsigned, not size_t.
The _countof() operator is a Microsoft extension. It is approximately equivalent to sizeof(s) / sizeof(s[0]), which in this case is the same as sizeof(s) since sizeof(char) == 1 by definition.
Note that the size parameter is unsigned, not size_t as you would expect. This is one of the areas of difference between the Microsoft implementation of the TR 24731-1 functions and Annex K of ISO/IEC 9899:2011. The size specified in the standard is technically rsize_t, but that is defined as size_t with a restricted range (hence the r):
The type is rsize_t which is the type size_t.
but the footnote (not shown) refers to the definition of RSIZE_MAX.
See also Do you use the TR 24731 'safe' functions?
Fixing the code in the question
The example in the quote from Microsoft largely shows how to fix your code. You need:
int main(void)
{
char a[4096];
printf("Enter word:\n");
if (scanf_s("%s", a, (unsigned)sizeof(a)) != 1) // Note cast!
fprintf(stderr, "scanf_s() failed\n");
else
printf("scanf_s() read: <<%s>>\n", a);
return 0;
}
Note that I checked the result of scanf_s() rather than just assuming it worked, and reported errors on standard error.
Using
char a[]="";
Creates an array big enough for a single byte. You have to allocate enough space, e.g. like this:
char a[20] = {0}; // Can hold a string length 19 + \0 termination
Using your method, you would get an overflow as the scanf_s will write more in the memory than you allocated, resulting in a segmentation fault.

How to determine the size of the char array really need?

I have a question about the following code:
void testing(int idNumber)
{
char name[20];
snprintf(name, sizeof(name), "number_%d", idNumber);
}
The size of the char array name is 20, so if the idNumber is 111 it works, but how about the actual idNumber is 111111111111111111111111111111, how to determine how big the char array should be in order to keep the result of snprintf?
Well, if int is 32 bits on your platform, then the widest value it could print would be -2 billion, which is 11 characters, so you'd need 7 for number_, 11 for %d, and 1 for the null terminator, so 19 total.
But you should check the return value from snprintf() generally, to make sure you had enough space. For example, if the "locale" is set to other than the standard "C" one, it could print thousands separators, in which case you'd need 2 more characters than you have.
There is only one good answer:
Ask snprintf itself (Pass a length of 0).
It returns the size of the output it would have written if the buffer was big enough, excluding the terminating 0.
man-page for snprintf
Standard-quote (C99+Amendments):
7.21.6.5 The snprintf function
Synopsis
#include <stdio.h>
int snprintf(char * restrict s, size_t n,
const char * restrict format, ...);
Description
2 The snprintf function is equivalent to fprintf, except that the output is written into
an array (specified by argument s) rather than to a stream. If n is zero, nothing is written,
and s may be a null pointer. Otherwise, output characters beyond the n-1st are
discarded rather than being written to the array, and a null character is written at the end
of the characters actually written into the array. If copying takes place between objects
that overlap, the behavior is undefined.
Returns
3 The snprintf function returns the number of characters that would have been written
had n been sufficiently large, not counting the terminating null character, or a negative
value if an encoding error occurred. Thus, the null-terminated output has been
completely written if and only if the returned value is nonnegative and less than n.
Look at the documentation of snprintf. If you pass NULL for the destination and 0 for the size, it will return the number of bytes needed. So you do that first, malloc the memory, and do another snprintf with the right size.
All the printf functions return the number of bytes printed (excluding a trailing zero), except snprintf will return the number of characters that would have been printed if the length was unlimited.
Quote from here:
If the resulting string would be longer than n-1 characters, the
remaining characters are discarded and not stored, but counted for the
value returned by the function.
To use a right-sized buffer, calculate its maximum needs.
#define INT_PRINT_SIZE(i) ((sizeof(i) * CHAR_BIT)/3 + 3)
void testing(int idNumber) {
const char format[] = "number_%d";
char name[sizeof format + INT_PRINT_SIZE(idNumber)];
snprintf(name, sizeof(name), format, idNumber);
}
This approach assumes C locale. A more robust solution could use
...
int cnt = snprintf(name, sizeof(name), format, idNumber);
if (cnt < 0 || cnt >= sizeof(name)) Handle_EncodingError_SurprisingLocale().
Akin to https://stackoverflow.com/a/26497268/2410359

Converting non printable ASCII character to binary

I am trying to convert a string of non-printable ASCII character to binary. Here is the code:
int main(int argc, char *argv[])
{
char str[32];
sprintf(str,"\x01\x00\x02");
printf("\n[%x][%x][%x]",str[0],str[1],str[2]);
return 1;
}
I expect the output should be [1][0][2], but it prints [1][0][4].
What am I doing wrong here?
The sprintf operation ended at the first instance of \x00 in your string literal, because NUL (U+0000) terminates strings in C. (That the compiler does not complain when you write \x00 inside a string literal is arguably a misfeature of the language.) Thus str[2] accesses uninitialized memory and the program is entitled to print complete nonsense or even crash.
To do what you wanted to do, simply eliminate the sprintf:
int main(void)
{
static const unsigned char str[32] =
{ 0x01, 0x00, 0x02 }; // will be zero-filled to declared size
printf("[%02x][%02x][%02x]\n", str[0], str[1], str[2]);
return 0;
}
(Binary data should always be stored in arrays of unsigned char, not plain char; or uint8_t if you have it. Because U+0000 terminates strings, I think it's better style to write embedded binary data using an array literal rather than a string literal; but it is more typing. The static const is just because the data is never modified and known at compile time; the program would work without it. Don't declare argc and argv if you're not going to use them. Return zero, not one, from main to indicate successful completion.)
(Using sprintf the way you were using it is a bad idea for other reasons: for instance, if your binary block contained \x25 (also known as % in ASCII), it would try to read additional arguments-to-be-formatted, and again print complete nonsense or crash. If you have a good reason to not just use static initialized data, the right way to copy blocks of binary data around is memcpy.)
C strings end with a null byte, so sprintf only reads until \x00. Instead, you can use memcpy (like this) or simply initialize with
char str[32] = "\x01\x00\x02";
"\x00" terminates the format string which is the 2nd argument of the sprint() prematurely. Obviously that was unintentional but there is no ways sprint() can figure out that the first NUL is not the last NUL. So the format string it works on is actually shorter than what you intended to pass.

UTF8 support on cross platform C application

I am developing a cross platform C (C89 standard) application which has to deal with UTF8 text. All I need is basic string manipulation functions like substr, first, last etc.
Question 1
Is there a UTF8 library that has the above functions implemented? I have already looked into ICU and it is too big for my requirement. I just need to support UTF8.
I have found a UTF8 decoder here. Following function prototypes are from that code.
void utf8_decode_init(char p[], int length);
int utf8_decode_next();
The initialization function takes a character array but utf8_decode_next() returns int. Why is that? How can I print the characters this function returns using standard functions like printf? The function is dealing with character data and how can that be assigned to a integer?
If the above decoder is not good for production code, do you have a better recommendation?
Question 2
I also got confused by reading articles that says, for unicode you need to use wchar_t. From my understanding this is not required as normal C strings can hold UTF8 values. I have verified this by looking at source code of SQLite and git. SQLite has the following typedef.
typedef unsigned char u8
Is my understanding correct? Also why is unsigned char required?
The utf_decode_next() function returns the next Unicode code point. Since Unicode is a 21-bit character set, it cannot return anything smaller than an int, and it can be argued that technically, it should be a long since an int could be a 16-bit quantity. Effectively, the function returns you a UTF-32 character.
You would need to look at the C94 wide character extensions to C89 to print wide characters (wprintf(), <wctype.h>, <wchar.h>). However, wide characters alone are not guaranteed to be UTF-8 or even Unicode. You most probably cannot print the characters from utf8_decode_next() portably, but it depends on what your portability requirements are. The wider the range of systems you must port to, the less chance there is of it all working simply. To the extent you can write UTF-8 portably, you would send the UTF-8 string (not an array of the UTF-32 characters obtained from utf8_decode_next()) to one of the regular printing functions. One of the strengths of UTF-8 is that it can be manipulated by code that is largely ignorant of it.
You need to understand that a 4-byte wchar_t can hold any Unicode codepoint in a single unit, but that UTF-8 can require between one and four 8-bit bytes (1-4 units of storage) to hold a single Unicode codepoint. On some systems, I believe wchar_t can be a 16-bit (short) integer. In this case, you are forced into using UTF-16, which encodes Unicode codepoints outside the Basic Multilingual Plane (BMP, code points U+0000 .. U+FFFF) using two storage units and surrogates.
Using unsigned char makes life easier; plain char is often signed. Having negative numbers makes life more difficult than it need me (and, believe me, it is difficult enough without adding complexity).
You do not need any special library routines for character or substring search with UTF-8. strstr does everything you need. That's the whole point of UTF-8 and the design requirements it was invented to meet.
GLib has quite a few relevant functions, and can be used independent of GTK+.
There are over 100,000 characters in Unicode. There are 256 possible values of char in most C implementations.
Hence, UTF-8 uses more than one char to encode each character, and the decoder needs a return type which is larger than char.
wchar_t is a larger type than char (well, it doesn't have to be larger, but it usually is). It represents the characters of the implementation-defined wide character set. On some implementations (most importantly, Windows, which uses surrogate pairs for characters outside the "basic multilingual plane"), it still isn't big enough to represent any Unicode character, which presumably is why the decoder you reference uses int.
You can't print wide characters using printf, because it deals in char. wprintf deals in wchar_t, so if the wide character set is unicode, and if wchar_t is int on your system (as it is on linux), then wprintf and friends will print the decoder output without further processing. Otherwise it won't.
In any case, you cannot portably print arbitrary unicode characters, because there's no guarantee that the terminal can display them, or even that the wide character set is in any way related to Unicode.
SQLite has probably used unsigned char so that:
they know the signedness - it's implementation-defined whether char is signed or not.
they can do right-shifts and assign out-of-range values, and get consistent and defined results across all C implementations. Implemenations have more freedom how signed char behaves than unsigned char.
Normal C strings are fine for storing utf8 data, but you can't easily search for a substring in your utf8 string. This is because a character encoded as a sequence of bytes using the utf8 encoding could be anywhere from one to 4 bytes depending on the character. i.e. a "character" is not equivalent to a "byte" for utf8 like it is for ASCII.
In order to do substring searches etc. you will need to decode it to some internal format that is used to represent Unicode characters and then do the substring search on that. Since there are far more than Unicode 256 characters, a byte (or char) is not enough. That's why the library you found uses ints.
As for your second question, it's probably just because it does not make sense to talk about negative characters, so they may as well be specified as "unsigned".
I have implemented a substr & length functions which supports UTF8 characters. This code is a modified version of what SQLite uses.
The following macro loops through the input text and skip all multi-byte sequence characters. if condition checks that this is a multi-byte sequence and the loop inside it increments input until it finds next head byte.
#define SKIP_MULTI_BYTE_SEQUENCE(input) { \
if( (*(input++)) >= 0xc0 ) { \
while( (*input & 0xc0) == 0x80 ){ input++; } \
} \
}
substr and length are implemented using this macro.
typedef unsigned char utf8;
substr
void *substr(const utf8 *string,
int start,
int len,
utf8 **substring)
{
int bytes, i;
const utf8 *str2;
utf8 *output;
--start;
while( *string && start ) {
SKIP_MULTI_BYTE_SEQUENCE(string);
--start;
}
for(str2 = string; *str2 && len; len--) {
SKIP_MULTI_BYTE_SEQUENCE(str2);
}
bytes = (int) (str2 - string);
output = *substring;
for(i = 0; i < bytes; i++) {
*output++ = *string++;
}
*output = '\0';
}
length
int length(const utf8 *string)
{
int len;
len = 0;
while( *string ) {
++len;
SKIP_MULTI_BYTE_SEQUENCE(string);
}
return len;
}

Resources