Converting a UTF-8 text to wchar_t - c

I know this question has been asked quite a few times here, and i did read some of the answers, But there are a few suggested solutions and im trying to figure out the best of them.
I'm writing a C99 app that basically receives XML text encoded in UTF-8.
Part of it's job is to copy and manipulate that string (finding a substr, cat it, ex..)
As i would rather not to use an outside not-standard library right now, im trying to implement it using wchar_t.
Currently, im using mbstowcs to convert it to wchar_t for easy manipulation, and for some input i tried in different languages - it worked fine.
Thing is, i did read some people out there had some issues with UTF-8 and mbstowcs, so i would like to hear out about whether this use is permitted/acceptable.
Other option i faced was using iconv with WCHAR_T parameter. Thing is, im working on a platform(not a PC) which it's locale is very very limit to only ANSI C locale. How about that?
I did also encounter some C++ library which is very popular. but im limited for C99 implementation.
Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers? but then, which manipulation functions should i use instead?
Happy to hear some thoughts. thanks.

C does not define what encoding the char and wchar_t types are and the standard library only mandates some functions that translate between the two without saying how. If the implementation-dependent encoding of char is not UTF-8 then mbstowcs will result in data corruption.
As noted in the rationale for the C99 standard:
However, the five functions are often too restrictive and too primitive to develop portable international programs that manage characters.
...
C90 deliberately chose not to invent a more complete multibyte- and wide-character library, choosing instead to await their natural development as the C community acquired more experience with wide characters.
Sourced from here.
So, if you have UTF-8 data in your chars there isn't a standard API way to convert that to wchar_ts.
In my opinion wchar_t should usually be avoided unless necessary - you might need it if you're using WIN32 APIs for example. I am not convinced it will simplify string manipulation. wchar_t is always UTF-16LE on Windows so you may still need to have more than one wchar_t to represent a single Unicode code point anyway.
I suggest you investigate the ICU project - at least from an educational standpoint.

Also, i would be compiling this code on another platform, which the
sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine).
How can i overcome that? using fixed-size char containers?
You could do that with conditional typedefs like this:
#if defined(__STDC_UTF_16__)
typedef _Char16_t CHAR16;
#elif defined(_WIN32)
typedef wchar_t CHAR16;
#else
typedef uint16_t CHAR16;
#endif
#if defined(__STDC_UTF_32__)
typedef _Char32_t CHAR32;
#elif defined(__STDC_ISO_10646__)
typedef wchar_t CHAR32;
#else
typedef uint32_t CHAR32;
#endif
This will define the typedefs CHAR16 and CHAR32 to use the new C++11 character types if available, but otherwise fall back to using wchar_t when possible and fixed-width unsigned integers otherwise.

Related

C11 Unicode Support

I am writing some string conversion functions similar to atoi() or strtoll(). I wanted to include a version of my function that would accept a char16_t* or char32_t* instead of just a char* or wchar_t*.
My function works fine, but as I was writing it I realized that I do not understand what char16_t or char32_t are. I know that the standard only requires that they are an integer type of at least 16 or 32 bits respectively but the implication is that they are UTF-16 or UTF-32.
I also know that the standard defines a couple of functions but they did not include any *get or *put functions (like they did when they added in wchar.h in C99).
So I am wondering: what do they expect me to do with char16_t and char32_t?
That's a good question with no apparent answer.
The uchar.h types and functions added in C11 are largely useless. They only support conversions between the new type (char16_t or char32_t) and the locale-specific, implementation-defined multibyte encoding, mappings which are not going to be complete unless the locale is UTF-8 based. The useful conversions (to/from wchar_t, and to/from UTF-8) are not supported. Of course you can roll your own for conversions to/from UTF-8 since these conversions are 100% specified by the relevant RFCs/UCS/Unicode standards, but be careful: most people implement them wrong and have dangerous bugs.
Note that the new compiler-level features for UTF-8, UTF-16, and UTF-32 literals (u8, u, and U, respectively) are potentially useful; you can process the resulting strings with your own functions in meaningful ways that don't depend at all on locale. But the library-level support for Unicode in C11 is, in my opinion, basically useless.
Testing if a UTF-16 or UTF-32 charter in the ASCII range is one of the "usual" 10 digits, +, - or a "normal" white-space is easy to do as well as convert '0'-'9' to a digit. Given that, atoi_utf16/32() proceeds like atoi(). Simply inspect one character at a time.
Testing if some other UTF-16/UTF-32 is a digit or white-space - that is harder. Code would need an extended isspace(), isdigit() which can be had be switching locales (setlocale()) if the needed locale is available. (Note: likely need to restore locale when the function is done.
Converting a character that passes isdigit() but is not one of the usual 10 to its value is problematic. Anyways, that appears to not even be allowed.
Conversion steps:
Set locale to a corresponding one for UTF-16/UTF-32.
Use isspace() for white-space detection.
Convert is a similar fashion for your_atof().
Restore local.
This question may be a bit old, but I'd like to touch on implementing your functions with char16_t and char32_t support.
The easiest way to do this is to write your strtoull function using the char32_t type (call it something like strtoull_c32). This makes parsing unicode easier because every character in UTF-32 occupies four bytes. Then implement strtoull_c16 and strtoull_c8 by internally converting both UTF-8 and UTF-16 encodings to UTF-32 and passing them to strtoull_c32.
I honestly haven't looked at the Unicode facilities in the C11 standard library, but if they don't provide a suitable way for converting those types to UTF-32 then you can use a third party library to make the conversion for you.
There's ICU, which was started by IBM and then adopted by the Unicode Consortium. It's a very feature-rich and stable library that's been around for a long time.
I started a UTF library (UTFX) for C89 recently, that you could use for this too. It's pretty simple and lightweight, unit tested and documented. You could give that a go, or use it to learn more about how UTF conversions work.

Usage of SafeStr in C

I am reading about using of safe strings at following location
https://www.securecoding.cert.org/confluence/pages/viewpage.action?pageId=5111861
It is mentioned as below.
SafeStr strings, when used properly, can eliminate many of these errors and provide backward compatibility to legacy code as well.
My question is what does author mean by "provide backward compatibility to legacy code as well." ? Request to explain with example.
Thanks for your time and help
It means that functions from the standard libc (and others) which expects plain, null terminated char arrays, will work even on those SafeStrs. This is probably achieved by putting a control structure at a negative offset (or some other trick) from the start of the string.
Examples: strcmp() printf() etc can be used directly on the strings returned by SafeStr.
In contrast, there are also other string libraries for C which are very "smart" and dynamic, but these strings can not be sent without conversion to "old school" functions.
From that page:
The library is based on the safestr_t type which is completely
compatible with char *. This allows casting of safestr_t structures to
char *.
That's some backward compatibility with all the existing code that takes char * or const char * pointers.

When to decide to use typedef's data types or C's built-in standard data types

gcc 4.7.2
c89
Hello,
I am using the Apache Portable Runtime and looking at their typedef's
typedef short apr_int16_t
typedef int apr_int16_t
typedef size_t apr_size_t /* This is basically the same, so what's the point */
etc.
So what is the point of all this?
When should you decided to use C's built-in standard data types or typedef's data types?
I just gave a example using the APR. However, I am also speaking generally as well. There is also the stdint.h header file that typedef's data types.
Many thanks for any suggestions,
In my opinion, it is better to have custom defined data types for native data types of the system as it helps in clearly distingushing the size of the types.
For Ex: A long may be 32 bit or 64 bit depending on the machine in which your code runs and the way it has been built. But, if your code specifically needs a 64 bit variable, then naming it as uint_64_t or something similar will always help in associating the size clearly.
In such cases, the code be written as:
#if _64BIT_
typedef long uint_64_t
#else
typedef long long uint_64_t
#endif
But as suggested by Mehrdad, don't use it "just for kicks". : )
Great question.
So what is the point of all this?
It's meant to be for abstraction, but like anything else, it is sometimes misused/overused.
Sometimes it's necessary for backwards compatibility (e.g. typedef VOID void in Windows), sometimes it's necessary for proper abstraction (e.g. typedef unsigned int size_t), and sometimes it's completely pointless logically, but makes typing easier (e.g. typedef char const *LPCSTR in Windows).
When should you decided to use C's built-in standard data types or typedef's data types?
If it makes something easier, or if it implements a proper abstraction barrier, use it.
What exactly that means is something you'll just have to learn over time.
But don't use it "just for kicks"!

How to convert argv to wide chars in Win32 command line application?

I'm using the win32 api for C in my program to read from a serial port, it seems to be pretty low level stuff. Assuming that there is no better way of reading from a serial port, the CreateFile function involves a LPCWSTR argument, I've read and it looks like LPCWSTR is a wchar_t type. Firstly, I don't really understand the difference between wchar and char, I've read stuff about ansi and unicode, but I don't really know how it applies to my situation.
My program uses a main function, not wmain, and needs to get an argument from the command line and store it in a wchar_t variable. Now I know I could do this if I just made the string up on the spot;
wchar_t variable[1024];
swprintf(variable,1024,L"%s",L"randomstringETC");
Because it looks like the L converts char arrays to wchar arrays. However it does not work when I do;
wchar_t variable[1024];
swprintf(variable,1024,L"%s",Largv[1]);
obviously because it's a syntax error. I guess my question is, is there an easy way to convert normal strings to wchar_t strings?
Or is there a way to avoid this Unicode stuff completely and read from serial another way using C on windows..
There is no winapi function named CreateFile. There's CreateFileW and CreateFileA. CreateFile is a macro that maps to one of these real function depending on whether the _UNICODE macro is defined. CreateFileW takes an LPCWSTR (aka const wchar_t*), CreateFileA takes an LPCSTR (aka const char*).
If you are not ready yet to move to Unicode then simply use the CreateFileA() function explicitly. Or change the project setting: Project + Properties, General, Character Set. There's a non-zero cost, the underlying operating system is entirely Unicode based. So CreateFileA() goes through a translation layer that turns the const char* into a const wchar_t* according to the current system code page.
The L thing is only for string literals.
You need to convert argv string (presumably unsigned char) to wchar by using something like the winapi mbstowcs() function.
MultiByteToWideChar can be used to map from ANSI to UNICODE. To do your swprintf call, you need to define an array of wchar like this:
WCHAR lala[256] = {0};
swprintf(lala, _countof(lala), L"%s", Largv[1]);
It is possible to avoid unicode by compiling your application against a multibyte character set but it's bad practice to do this unless you're doing so for legacy reasons. Windows will need to convert it back to unicode at some point eventually anyway because that is the encoding of the underlying OS.

Why weren't new (bit width specific) printf() format option strings adoped as part of C99?

While researching how to do cross-platform printf() format strings in C (that is, taking into account the number of bits I expect each integer argument to printf() should be) I ran across this section of the Wikipedia article on printf(). The article discusses non-standard options that can be passed to printf() format strings, such as (what seems to be a Microsoft-specific extension):
printf("%I32d\n", my32bitInt);
It goes on to state that:
ISO C99 includes the inttypes.h header
file that includes a number of macros
for use in platform-independent printf
coding.
... and then lists a set of macros that can be found in said header. Looking at the header file, to use them I would have to write:
printf("%"PRId32"\n", my32bitInt);
My question is: am I missing something? Is this really the standard C99 way to do it? If so, why? (Though I'm not surprised that I have never seen code that uses the format strings this way, since it seems so cumbersome...)
The C Rationale seems to imply that <inttypes.h> is standardizing existing practice:
<inttypes.h> was derived from the header of the same name found on several existing 64-bit systems.
but the remainder of the text doesn't write about those macros, and I don't remember they were existing practice at the time.
What follows is just speculation, but educated by experience of how standardization committees work.
One advantage of the C99 macros over standardizing additional format specifier for printf (note that C99 also did add some) is that providing <inttypes.h> and <stdint.h> when you already have an implementation supporting the required features in an implementation specific way is just writing two files with adequate typedef and macros. That reduces the cost of making existing implementation conformant, reduces the risk of breaking existing programs which made use of the existing implementation specifics features (the standard way doesn't interfere) and facilitate the porting of conformant programs to implementation who don't have these headers (they can be provided by the program). Additionally, if the implementation specific ways already varied at the time, it doesn't favorize one implementation over another.
Correct, this is how the C99 standard says you should use them. If you want truly portablt code that is 100% standards-conformant to the letter, you should always print an int using "%d" and an int32_t using "%"PRId32.
Most people won't bother, though, since there are very few cases where failure to do so would matter. Unless you're porting your code to Win16 or DOS, you can assume that sizeof(int32_t) <= sizeof(int), so it's harmless to accidentally printf an int32_t as an int. Likewise, a long long is pretty much universally 64 bits (although it is not guaranteed to be so), so printing an int64_t as a long long (e.g. with a %llx specifier) is safe as well.
The types int_fast32_t, int_least32_t, et al are hardly ever used, so you can imagine that their corresponding format specifiers are used even more rarely.
You can always cast upwards and use %jd which is the intmax_t format specifier.
printf("%jd\n", (intmax_t)(-2));
I used intmax_t to show that any intXX_t can be used, but simply casting to long is much better for the int32_t case, then use %ld.
I can only speculate about why. I like AProgrammer's answer above, but there's one aspect overlooked: what are you going to add to printf as a format modifier? There are already two different ways that numbers are used in a printf format string (width and precision). Adding a third kind of number to say how many bits of precision are in the argument would be great, but where are you going to put it without confusing people? Unfortunatey one of the flaws in C is that printf was not designed to be extensible.
The macros are awful, but when you have to write code that is portable across 32-bit and 64-bit platforms, they are a godsend. Definitely saved my bacon.
I think the answer to your question why is either
Nobody could think of a better way to do it, or
The standards committee couldn't agree on anything they felt was clearly better.
Another possibility: backward compatibility. If you add more format specifiers to printf, or additional options, it is possible that a specifier in some pre-C99 code would have a format string interpreted differently.
With the C99 change, you're not changing the functionality of printf.

Resources