wchar_t encoding on different platforms

wchar_t encoding on different platforms - c

I faced a problem with encodings on different platforms (in my case Windows and Linux). On windows, size of wchar_t is 2 bytes, whereas on Linux it's 4 bytes. How can I "standardize" wchar_t to be same size for both platforms? Is it hard to implement without additional libraries? For now, I'm aiming for printf/wprintf API. The data is sent via socket communication. Thank you.

If you want to send Unicode data across different platforms and architectures, I'd suggest using UTF-8 encoding and (8-bit) chars. UTF-8 has some advantages like not having endiannes issues (UTF-8 is just a plain sequence of bytes, instead both UTF-16 and UTF-32 can be little-endian or big-endian...).
On Windows, just convert the UTF-8 text to UTF-16 at the boundary of Win32 APIs (since Windows APIs tend to work with UTF-16). You can use the MultiByteToWideChar() API for that.

To solve this problem I think you are going to have to convert all strings into UTF-8 before transmitting. On Windows you would use the WideCharToMultiByte function to convert wchar_t strings to UTF-8 strings, and MultiByteToWideChar to convert UTF-8 strings into wchar_t strings.
On Linux things aren't as straightforward. You can use the functions wctomb and mbtowc, however what they convert to/from depends on the underlying locale setting. So if you want these to convert to/from UTF-8 and Unicode then you'll need to make sure the locale is set to use UTF-8 encoding.
This article might also be a good resource.

Related

Converting to wide characters on z/OS

I have the following code on Linux:-
rc = iconv_open("WCHAR_T", SourceCode);
prior to using iconv to convert the data into a wide character string (wchar_t).
I am now compiling this on z/OS. I do not know what value to use in place of "WCHAR_T". I have found that codepages are represented by 5-digit character strings on z/OS, e.g Codepage 500 would be "00500", so I am happy enough with what to put into my SourceCode variable above, I just can't find a value that will successfully work as the first parameter to iconv_open.
wchar_t are 4 bytes long on z/OS (when compiling 64-bit as I am), so I assume that I would need some varient of an EBCDIC equivalent to UTF32 or UCS4 perhaps, but I cannot find something that works. Every combination I have tried to date has returned with an errno of 121 (EINVAL: The parameter is incorrect).
If anyone familiar with how the above code works on Linux, could give a summary of what it does, that might also help. What does it mean to iconv into "WCHAR_T"? Is this a combination perhaps, of some data conversion and additionally a type change to wchar_t?
Alternatively, can anyone answer the question, "What is the internal representation of wchar_t on z/OS?"

wchar_t is an implementation defined data type. On z/OS it is 2 bytes in 31-bit mode and 4 bytes in 64-bit mode.
There is no single representation of wchar_t on z/OS. The encoding associated with the wchar_t data is dependent on the locale in which the application is running. It could be an IBM-939 Japanese DBCS code page or any of the other DBCS code pages that are used in countries like China, Korea, etc.
Wide string literals and character constants i.e. those defined as L"abc" or L'x' are converted to the implementation defined encoding used to implement wchar_t data type. This encoding is locale sensitive and can be manipulated using wide character run time library functions.
The conversion of multi byte string literals to wide string literals is typically done by calling one of the mbtowc run time library functions which respect the encoding associated with the locale in which the application is running.
iconv on the other hand can be used to convert any string literals to any one of the supported destination code pages including double byte code pages or any of the Unicode formats (UTF8, UTF16, UTF32). The operation of iconv is independent of wchar_t type.
Universal coded character set converters may be the answer to your question.
The closest to Unicode on z/OS would be UTF-EBCDIC but it requires defining locales that are based on UTF-EBCDIC.
If running as an ASCII application is an option, you could use UTF-32 as the internal encoding and provide iconv converters to/from any of the EBCDIC code pages your application needs to support. This would be better served by char32_t data type to avoid opacity of wchar_t.

wide char string functions on linux / windows

I want to create a string library with two different string classes for handling UTF-8 and UCS-2 (which I beleive is some kind of UTF-16 not handling surrogates and characters above 0xFFFF).
On windows platforms, wide chars are 2 octets wide. On Linux they are 4. So what happens with functions related to wide char strings ? Do you pass buffers of 2 octets wide items on windows and 4 octets wide items on linux ? If yes then it makes these functions totally different on windows and linux, which doesn't make them really "standard"...
How can one handle this problem when trying to create a library that is supposed to manipulate wide chars the same way for cross platform code ? Thank you.

You're right about the different sizes of wchar_t on Windows and Linux. That also means you're right about the wide-character handling functions not being too useful. You should probably check out an encoding conversion library such as libiconv. Then you can work with UTF-8 internally and just convert on I/O.

wchar_t vs char for creating an API

I am creating a C++ library meant to be used with different applications written in different languages like Java, C#, Delphi etc.
Every now and then I am stuck on conversions between wstrings, strings, char*, wchar_t*. E.g. I sticked to wchar_t's but had to use regex library which accepts chars other similar problems.
I wish to stick to either w's or normal strings. My library will mostly deal with ASCII characters but can have non-ASCII characters too as in names etc. So, can I permanently switch to char's instead of wchar_t's and string's instead of wstring's. Can I have unicode support with them and will it affect scalability and portability across different platforms and languages.
Please advise.

You need to decide which encoding to use. Some considerations:
If you can have non-ASCII characters, then there is no point in choosing ASCII or 8bit ANSI. That way leads to disappointment and risks data loss.
It makes sense to pick one encoding and stick to it. Everywhere. The Windows API is unusual in supporting both ANSI and Unicode, but that is due to backwards compatibility of old software. If Microsoft were starting over from scratch, there would be one encoding only.
The most common choices for Unicode encoding are UTF-8 and UTF-16. Any decent environment will have support for both. Either choice may be justifiable.
Java, VB, C# and Delphi all have good support for UTF-16, and all of them use UTF-16 for their native string types (in the case of Delphi, the native string type is UTF-16 only in Delphi 2009 and later. For earlier versions, you can use the WideString string type).
Most OS platforms are natively UTF-16 (*Nix systems, like Linux, are UTF-8 instead), so it may well be simplest to just use UTF-16.
On the other hand, UTF-8 is probably a technically better choice being byte oriented, and backwards compatible with 8bit ASCII. Quite likely, if Unicode was being invented from scratch, there would be no UTF-16 and UTF-8 would be the variable length encoding.
You have phrased the question as a choice between char and wchar_t. I think that the real choice is what your preferred encoding should be. You also have to watch out that wchar_t is 16bit (UTF-16) on some systems but is 32bit (UTF-32) on others. It is not a portable data type. That is why C++11 introduces new char16_t and char32_t` data types to correct that ambiguity.

The major difference between Unicode and simple char is code page. Having just a char* pointer is not enough to understand the meaning of the string. It can be in a certain specific encoding, it can be multibyte, etc. Wide character string does not have these caveats.
In many cases international aspects are not important. In this case the difference between these 2 representations is minimal. The main question that you need to answer: is internationalization important to your library or not?

Modern Windows programming should tend towards builds with UNICODE defined, and thus use wide characters and wide character APIs. This is desirable for improved performance (fewer or no conversions behind the Windows API layers), improved capabilities (sometimes the ANSI wrappers don't expose all capabilities of the wide function), and in general it avoids problems with the inability to represent characters that are not on the system's current code page (and thus in practice the inability to represent non-ASCII characters).
Where this can be difficult is when you have to interface with things that don't use wide characters. For example, while Windows APIs have wide character filenames, Linux filesystems typically use bytestrings. While those bytestrings are often UTF-8 by convention, there's little enforcement. Interfacing with other languages can also be difficult if the language in question doesn't understand wide characters at an API level. Ideally such languages have chosen a specific encoding, such as UTF-8, allowing you to convert to and from that encoding at the boundaries.
And that's one general recommendation: use Unicode internally for all processing, and convert as necessary at the boundaries. If this isn't already familiar to you, it's good to reference Joel's article on Unicode.

Unicode: How to integrate Jansson(JSON library) with ICU special UTF-8 data types?

I've been developing a C application that expects wide range of UTF-8 characters, so I started using ICU library to support Unicode characters, but it seems things aren't working nicely with other libraries(mainly, jansson, a JSON library).
Even though jansson claims it fully supports UTF-8, it only expects chars as parameters(IIRC, a single byte isn't enough for Unicode chars), while ICU uses a special type called UChar(16byte sized character, at least on my system).
Casting a Unicode character to a regular character doesn't seem like a solution to me, since casting bigger data to smaller ones will cause data lose. I tried casting anyway; it didn't work.
So my question would be: How can I make the two libraries work nicely together?

Get ICU to produce output in UTF-8 using toUTF8/toUTF8String. (toUTF8String gives you a std::string so .c_str() to get the char* that Jansson wants.

FastCGI and Unicode

I am wondering if Fastcgi supports unicode functions like wprintf. I receive a buffer via fread and get char* that has unicode characters in it. I mean bytes with value above 128 . How do I process them inside fastcgi main function. A call to mbstowcs fails.
I meant that I was using the FastCGI Developers Kit library. A java client is sending data encoded with UTF-8 , I decoded it server side using mbstring functions in php , but what is the equivalent of that gcc . What ever it is does not seem to work inside the FastCGI amin function. I looked at Fascgipp but I dont know how much it is used and how stable it is . Further I dont find lugging a huge library like boost justified for a small utility.

If you need Unicode, use UTF-8 and not "wide" characters. They are much more suitable for the web.

Fastcgi is protocol not an API.
So it depend on the library you are choosing.
Yes see library
http://www.nongnu.org/fastcgipp/

I receive a buffer via fread and get char* that has unicode characters in it. I mean bytes with value above 128 .
Unicode is not the only encoding that has bytes above 128 in it. In fact, most other encodings do. You need to find out which encoding is exactly used in your web application. In any case, I don't think wprintf is going to be useful to you in any way.

FastCGI expects a byte stream, so your safest bet is UTF-8.
That being said, the program should work with non-UTF-8 input; iconv is ideal for this.
You can use wprintf, but only in the form of wsnprintf, after which buffer contents are converted to UTF-8 and then written to the correct FCGI stream.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight