String marshalling with marshal_as and encodings - winforms

Converting between String^ and std::string is very easy using marshal_as. However, I have nowhere found a description of how encodings in such a string are handled. String^ uses UTF-16 but what about std::string? Text in that can be interpreted in various ways and it would be very usefull if the marshalling would convert to an encoding that is native to your application.
In my case all std::string instances contain UTF-8 encoded text. So how would I tell marshal_as to give me an UTF-8 encoded variant of the original String^ (and vice versa)?

I agree that the documentation is lacking. Without proper documentation we are programming by coincidence. marshal_as can be very useful but when I have a question that isn't answered in the documentation, I just skip it and do it in multiple steps. Someone may have an accurate answer about how marshal_as works in each case but unless you add it to your code as a comment, the next programmer isn't going to think of the issue or understand it, even after checking the documentation.
The BCL is very capable of converting characters. I suggest using an Encoding member to GetBytes and then copy them to a C or C++ string data structure/class. Despite requiring more steps, it is then clear which character sets and encodings you are using, how mismatches are handled, how the string ownership can be transfered and how it should be destroyed. (Mismatches are, of course, not applicable when converting between UTF-16 and UTF-8.)

Related

What does character encoding in C programming language depend on?

What does character encoding in C programming language depend on? (OS? compiler? or editor?)
I'm working on not only characters of ASCII but also ones of other encodings such as UTF-8.
How can we check the current character encodings in C?
The C source code might be stored in distinct encodings. This is clearly compiler dependent (i.e. a compiler setting if available). Though, I wouldn't count on it and count on ASCII-only always. (IMHO this is the most portable way to write code.)
Actually, you can encode any character of any encoding using only ASCIIs in C source code if you encode them with octal or hex sequences. (This is what I do from time to time to earn respect of my colleagues – writing German texts with \303\244, \303\266, \303\274, \303\231 into translation tables out of mind...)
Example: "\303\274" encodes the UTF-8 sequence for a string constant "ü". (But if I print this on my Windows console I only get "��" although I set code page 65001 which should provide UTF-8. The damn Windows console...)
The program written in C may handle any encoding you are able to deal with. Actually, the characters are only numbers which can be stored as one of the available integral types (e.g. char for ASCII and UTF-8, other int types for encodings with 16 or 32 bit wide characters). As already mentioned by Clifford, the output decides what to do with these numbers. Thus, this is platform dependent.
To handle characters according to a certain encoding, (e.g. make it upper case or lower case, local dictionary-like sorting, etc.) you have to use an appropriate library. This might be part of the standard libaries, the system libraries, or 3rd party libraries.
This is especially true for conversion from one encoding to another. This is a good point to mention libintl.
I personally prefer ASCII, Unicode, and UTF-8 (and unfortunately UTF-16 as I'm doing most work on Windows 10). In this special case, the conversion can be done by a pure "bit-fiddling" algorithm (without any knowledge of special characters). You may have a look at Wikipedia UTF-8 to get a clue. By google, you probably will find something ready-to-use if you don't want to do it by yourself.
The standard library of C++11 and C++14 provides support also (e.g. std::codecvt_utf8) but it is remarked as deprecated in C++17. Thus, I don't need to throw away my bit-fiddling code (I'm so proud of). Oops. This is tagged with c – sorry.
It is platform or display device/framework dependent. The compiler does not care how the platform interprets either char or wchar_t when such values are rendered as glyphs on some display device.
If the output were to some remote terminal, then the rendering would be dependent on the terminal rather than the execution environment, while in a desktop computer, the rendering may be to a text console or to a GUI, and the resulting rendering may differ even between those.

UTF-8 and ISO 8859-9

I have been reading about UTF-8 and unicode for the last couple of days and when I thought I figured it all, I am confused when I read that UTF-8 and ISO 8859-9 are not compatible.
I have a database that stores data as UTF-8. I have a requirement from a customer to support various ISO 8859-x code pages (i.e. 8859-3, 8859-2, and also ISO 6937). My questions are:
Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?
I understand that unicode can support all characters and it is the way to go. However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data? Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases? Is that I need to do or there is more to it?
btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data. if so, how character set is applied? Do I have to write a code to return 8859-x response or all that's needed is to set an appropriate character set value in the response header?
Topic is pretty vast so let me simplify (a lot, even too much) and answer point by point.
Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?
Yes, you're using UNICODE and you're storing UNICODE characters (formally called code points) using UTF-8 encoding. Please note that UNICODE defines rules and sets of characters (even if same word is often used as synonym of UTF-16 encoding), the way you encode such characters in a byte stream is another thing.
... However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data?
Of course if you store UNICODE characters (it doesn't matter with which encoding) then you can always convert them to a specific ASCII code page (or to any other encoding). OK this isn't formally always true (because UNICODE doesn't define every possible characters actually in use/used in the past) but I would ignore this point...
... Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases?
All characters from ISO 8859 code pages are also available in UNICODE then (from this point of view) it's a subset. Of course encoded values are different so they need to be converted. If you know needed code page for each customer then you can always convert an UNICODE UTF-8 encoded text into an ASCII (with right code page) text.
Is that I need to do or there is more to it?
Just that. Code could be pretty short but you didn't tag your question with any language so I won't provide links/examples. Just for a rudimentary example take a look to this post.
Let me also say one important thing: if they want to consume your data in ASCII with their code page then you have to perform a conversion. If they can consume directly UTF-8 data (or you present them somehow in your own application) then you don't have to worry about code pages (that's why we're using UNICODE) because - no matters encoding - UNICODE character set contains all characters they may need.
btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data.
Not exactly. You have a table of characters, right? For example A. Now you have to store a numeric value that will be interpreted as A. In ASCII they arbitrary decided that 65 is the numeric value that represents that character. UNICODE is a long list of characters (and rules to combine them), UTF-X are arbitrary representations used to store them as numeric values.
if so, how character set is applied?
"Character set" is a pretty vague sentence. With UNICODE character set you mean all characters available with UNICODE. If you mean code page then (simplifying) it represents a subset of available character set. Imagine you have 8 bit ASCII (then up to 256 symbols), you can't accommodate all characters used in Europe, right? Code pages solve this problem, half of these symbols are always the same and the other half represent different characters according to code page (each "Country" will use a specific code page with its preferred characters).
For an introductory overview about this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

wchar_t vs char for creating an API

I am creating a C++ library meant to be used with different applications written in different languages like Java, C#, Delphi etc.
Every now and then I am stuck on conversions between wstrings, strings, char*, wchar_t*. E.g. I sticked to wchar_t's but had to use regex library which accepts chars other similar problems.
I wish to stick to either w's or normal strings. My library will mostly deal with ASCII characters but can have non-ASCII characters too as in names etc. So, can I permanently switch to char's instead of wchar_t's and string's instead of wstring's. Can I have unicode support with them and will it affect scalability and portability across different platforms and languages.
Please advise.
You need to decide which encoding to use. Some considerations:
If you can have non-ASCII characters, then there is no point in choosing ASCII or 8bit ANSI. That way leads to disappointment and risks data loss.
It makes sense to pick one encoding and stick to it. Everywhere. The Windows API is unusual in supporting both ANSI and Unicode, but that is due to backwards compatibility of old software. If Microsoft were starting over from scratch, there would be one encoding only.
The most common choices for Unicode encoding are UTF-8 and UTF-16. Any decent environment will have support for both. Either choice may be justifiable.
Java, VB, C# and Delphi all have good support for UTF-16, and all of them use UTF-16 for their native string types (in the case of Delphi, the native string type is UTF-16 only in Delphi 2009 and later. For earlier versions, you can use the WideString string type).
Most OS platforms are natively UTF-16 (*Nix systems, like Linux, are UTF-8 instead), so it may well be simplest to just use UTF-16.
On the other hand, UTF-8 is probably a technically better choice being byte oriented, and backwards compatible with 8bit ASCII. Quite likely, if Unicode was being invented from scratch, there would be no UTF-16 and UTF-8 would be the variable length encoding.
You have phrased the question as a choice between char and wchar_t. I think that the real choice is what your preferred encoding should be. You also have to watch out that wchar_t is 16bit (UTF-16) on some systems but is 32bit (UTF-32) on others. It is not a portable data type. That is why C++11 introduces new char16_t and char32_t` data types to correct that ambiguity.
The major difference between Unicode and simple char is code page. Having just a char* pointer is not enough to understand the meaning of the string. It can be in a certain specific encoding, it can be multibyte, etc. Wide character string does not have these caveats.
In many cases international aspects are not important. In this case the difference between these 2 representations is minimal. The main question that you need to answer: is internationalization important to your library or not?
Modern Windows programming should tend towards builds with UNICODE defined, and thus use wide characters and wide character APIs. This is desirable for improved performance (fewer or no conversions behind the Windows API layers), improved capabilities (sometimes the ANSI wrappers don't expose all capabilities of the wide function), and in general it avoids problems with the inability to represent characters that are not on the system's current code page (and thus in practice the inability to represent non-ASCII characters).
Where this can be difficult is when you have to interface with things that don't use wide characters. For example, while Windows APIs have wide character filenames, Linux filesystems typically use bytestrings. While those bytestrings are often UTF-8 by convention, there's little enforcement. Interfacing with other languages can also be difficult if the language in question doesn't understand wide characters at an API level. Ideally such languages have chosen a specific encoding, such as UTF-8, allowing you to convert to and from that encoding at the boundaries.
And that's one general recommendation: use Unicode internally for all processing, and convert as necessary at the boundaries. If this isn't already familiar to you, it's good to reference Joel's article on Unicode.

Unicode: How to integrate Jansson(JSON library) with ICU special UTF-8 data types?

I've been developing a C application that expects wide range of UTF-8 characters, so I started using ICU library to support Unicode characters, but it seems things aren't working nicely with other libraries(mainly, jansson, a JSON library).
Even though jansson claims it fully supports UTF-8, it only expects chars as parameters(IIRC, a single byte isn't enough for Unicode chars), while ICU uses a special type called UChar(16byte sized character, at least on my system).
Casting a Unicode character to a regular character doesn't seem like a solution to me, since casting bigger data to smaller ones will cause data lose. I tried casting anyway; it didn't work.
So my question would be: How can I make the two libraries work nicely together?
Get ICU to produce output in UTF-8 using toUTF8/toUTF8String. (toUTF8String gives you a std::string so .c_str() to get the char* that Jansson wants.

Is there even fast implementaion about multibyte character string convert to unicode wstring?

In my project, where I adopted Aho-Corasick algorithm to do some message filter mode in the server side, message the server got is string of multibyte character. But after several tests I found the bottleneck is the conversion between mulitbyte string and unicode wstring. What I use now is the pair of mbstowcs_s and wcstombs_s, which takes nearly 95% time cost of the whole mode. Also, I have tried MultiByteToWideChar/WideCharToMultiByte, it got just the same result.
So I wonder if there is some other more efficient way to do the job? My project is built in VS2005, and the string converted will contain Chinese characters.
Many thanks.
There are a number of possibilities.
Firstly, what do you mean by "multi-byte character"? Do you mean UTF8 or an ISO DBCS system?
If you look at the definition of UTF8 and UTF16 there scope to do a highly optimised conversion, ripping out the "x" bits and reformatting them. See for example http://www.faqs.org/rfcs/rfc2044.html talks about UTF8<==>UTF32. Adjusting for UTF16 would be simple.
The second option might be to work entirely in UTF16. Render your Web page (or UI Dialog or whatever) in UTF16 and get the user input that way.
If all else fails, there aare other string algorithms than Aho-Corasick. Possibly look for an algorithm that works with your original encoding.
[Added 29-Jan-2010]
See http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt for more on conversions, including two C implementations of mbtowc() and wctomb(). These are designed to work with arbitrarily large wchar_ts. If you just have 16-bit wchar_ts then you can simplify it a lot.
These would be much faster than the generic (code-page-sensitive) versions in the standard library.
Deprecated (I believe) but you could always use the non-safe versions (mbstowcs and wcstombs). Not sure if this will have a marked improvement though. Alternatively, if your character set is limited (a - z, 0 - 9, for instance), you could always do it manually with a lookup table..?
Perhaps you can reduce the amount of calls to MultiByteToWideChar?
You could also probably adopt Aho-Corasick to work directly on multibyte strings.

Resources