Converting string in host character encoding to Unicode in C - c

Is there a way to portably (that is, conforming to the C standard) convert strings in the host character encoding to an array of Unicode code points? I'm working on some data serialization software, and I've got a problem because while I need to send UTF-8 over the wire, the C standard doesn't guarantee the ASCII encoding, so converting a string in the host character encoding can be a nontrivial task.
Is there a library that takes care of this kind of stuff for me? Is there a function hidden in the C standard library that can do something like this?

The C11 standard, ISO/IEC 9899:2011, has a new header <uchar.h> with rudimentary facilities to help. It is described in section §7.28 Unicode utilities <uchar.h>.
There are two pairs of functions defined:
c16rtomb() and mbrtoc16() — using type char16_t aka uint_least16_t.
c32rtomb() and mbrtoc32() — using type char32_t aka uint_least32_t.
The r in the name is for 'restartable'; the functions are intended to be called iteratively. The mbrtoc{16,32}() pair convert from a multibyte code set (hence the mb) to either char16_t or char32_t. The c{16,32}rtomb() pair convert from either char16_t or char32_t to a multibyte character sequence.
I'm not sure whether they'll do what you want. The <uchar.h> header and hence the functions are not available on Mac OS X 10.9.1 with either the Apple-provided clang or with the 'home-built' GCC 4.8.2, so I've not had a chance to investigate them. The header does appear to be available on Linux (Ubuntu 13.10) with GCC 4.8.1.
I think it likely that ICU is a better choice — it is, however, a rather large library (but that is because it does a thorough job of supporting Unicode in general and different locales in general).

Related

Using iconv with WCHAR_T on Linux

I have the following code on Linux:-
rc = iconv_open("WCHAR_T", SourceCode);
prior to using iconv to convert the data into a wide character string (wchar_t).
I am trying to understand what it achieves in order to port it to a platform where the option on parameter 1, "WCHAR_T", does not exist.
This leads to sub-questions such as:
Is there a single representation of wchar_t on Linux?
What codepage does this use? I imagine maybe UTF-32
Does it rely on any locale settings to achieve this?
I'm hoping for an answer that says something like: "The code you show is shorthand for doing the following 2 things instead...." and then I might be able to do those two steps instead of the shorthand on the platform where "WCHAR_T" option on iconv_open doesn't exist.
The reason the (non-standard) WCHAR_T encoding exists is to make it easy to cast a pointer to wchar_t into a pointer to char and use it with iconv. The format understood by that encoding is whatever the system's native wchar_t is.
If you're asking about glibc and not other libc implementations, then on Linux wchar_t is a 32-bit type in the system's native endianness, and represents Unicode codepoints. This is not the same as UTF-32, since UTF-32 normally has a byte-order mark (BOM) and when it does not, is big endian. WCHAR_T is always native endian.
Note that some systems use different semantics for wchar_t. Windows always uses a 16-bit type using a little-endian UTF-16. If you used the GNU libiconv on that platform, the WCHAR_T encoding would be different than if you ran it on Linux.
Locale settings do not affect wchar_t because the size of wchar_t must be known at compile time, and therefore cannot practically vary based on locale.
If this piece of code is indeed casting a pointer to wchar_t and using that in its call to iconv, then you need to adjust the code to use one of the encodings UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE, depending on sizeof(wchar_t) and the platform's endianness. Those encodings do not require (nor allow) a BOM, and assuming you're not using a PDP-11, one of them will be correct for your platform.
If you're getting the data from some other source, then you need to figure out what that is, and use the appropriate encoding from the list above for it. You should also probably send a patch upstream and ask the maintainer to use a different, more correct encoding for handling their data format.

How does ncurses output non-ascii characters?

I'd like to know how ncurses (a c library) manages to put characters like ├, despite them not (to the best of my knowledge) being part of ASCII.
I would have assumed it was just drawing them pixel by pixel, but you can copy/paste them out of the terminal (in MacOS).
ncurses puts characters such as ├ on the screen by assuming that your locale environment variables (LC_ALL and/or LC_CTYPE) match the terminal on which you are displaying. The environment variables indicate the encoding (e.g., UTF-8). There are other encodings and terminals which support those encodings, but generally speaking you'll mostly see UTF-8. If the environment and terminal cooperate, things "just work":
at startup, ncurses checks for the locale which a program has initialized, via setlocale, and determines if that uses UTF-8. It uses that information later.
when a program adds character strings, e.g., using addstr, ncurses uses the character-type information (set as a side-effect of calling setlocale), and uses standard C library functions for combining sequences of bytes which make up a multi-byte character, and converting those into wide characters. It stores those wide characters internally, and
when writing to the terminal, ncurses reverses the process, converting from wide characters to use the encoding assumed to be supported by the terminal (assuming that your locale environment matches the terminal).
However —
The character indicated ├ happens to be a special case. That is one of the graphic characters used for line-drawing, which predate Unicode and UTF-8. curses has names for these graphic characters, making it simple to refer to them, e.g., ACS_LTEE (the ├ is a left-tee):
Before UTF-8 came along to complicate things, developers came up with a scheme using a table of these graphic characters by adapting the escape sequences used for the VT100 (late 1970s) and the AT&T 4410 and 5410 terminals (apparently the early 1980s since the latter were in use by 1984) for drawing their graphic characters.
AT&T SystemV curses provided support for these graphic characters from the mid-1980s. BSD curses never did that...
Unicode (roughly 1990 and later) provided most of the same glyphs using a different encoding. There are a few omissions (the most noticeable are the scan lines above/below the one used for horizontal lines), but once UTF-8 got into use in the early 2000s, it was logical to extend ncurses to use these characters.
ncurses looks at the locale settings, but prefers using the terminal description for these graphic characters except for cases where that is known to not work — and will assume that the terminal can display the Unicode equivalents for these characters if the terminal is assumed to use UTF-8. It uses a table for this purpose (SystemV curses and its successor X/Open Curses didn't do any of this — NetBSD curses adapted the table from ncurses sometime after 2010).
Further reading:
NCURSES_NO_UTF8_ACS
Line Graphics (in curs_addch(3x))
Line Graphics (in curs_add_wch(3x))
There is more than one version of ncurses, for more than one platform, and if you really want to know, check the source. However, none of them would draw a character pixel-by-pixel; that isn’t something a library running inside a terminal emulator does.
Modern versions of the C standard library, POSIX and ncurses all support writing wide characters to the console and conversion between wide and multibyte strings. Today, wide characters are normally UTF-16 or UTF-32 and multibyte strings are normally UTF-8. You can see the documentation for <wchar.h> and ncursesw for more information.
Note that C11 does have support for UTF-8 literals, through the u8 prefix.
A program that’s concerned about portability with systems where the local multibyte encoding is something other than UTF-8 can use another library such as the C++ standard library or ICU to convert between UTF-8 and wide-character strings, then display those with curses.
You might need to #define _XOPEN_SOURCE 700, or the appropriate value for the version of the standard you are targeting, and with some versions of the libraries, also #define _XOPEN_SOURCE_EXTENDED 1, to get your system libraries to let you use functions such as addwstr().
However, many programs might simply send strings of char encoded in UTF-8 to the console and assume it can handle them. I don’t recommend this approach, but it works on most Linux systems in 2017.

how to handle russian string as a command line argument in C program

I have an exe file build from C code. There is a situation where russian string is passed as an argument to this exe.
When I call exe with this argument, task manager shows russian string perfectly as command line argument.
But when I print that argument from my exe it just prints ???
How can I make my C program(hence exe) handle russian character?
The answer depends on a target platform for your program. Traditionally, a C- or C++-program begins its life from main(....) function which may have byte-oriented strings passed as arguments (notice char* in main declaration int main(int argc, char* argv[])). Byte-oriented strings mean that characters in a string are passed in a specific byte-oriented encoding and one character, for example Я or Ñ in UTF-8 may take more than 1 char.
Nowadays the most wide used encoding on Linux/Unix platform is UTF-8, but some time ago there were other encodings in use such as ISO8859-1, KOI8-R and a lot of others. Most of programs are still byte oriented as UTF-8 encoding is mostly backward-compatible with all traditional C strings API.
In other hand wide strings can be more convenient in use, because each character in a widestring uses a predefined space. Thus, for example, the following expression passes assertion test: std::wstring hello = L"Привет!¡Hola!"; assert(L'в' == hello[3]); (if UTF-8 char strings are used the test would fail). So if your program performs a lot of operations on letters, not strings as a whole, then widestrings can be the solution.
To convert strings from multi-byte to a wide character encoding, you may use mbtowc functions family or that awesome fancy codecvt C++-11 facility if your compiler supports it (likely it doesn't as of mid-2014 :))
In Windows strings are also can be passed as byte-oriented strings, and for Russian most likely CP1251 is used (depends on Operating system settings, but for Windows sold within Russia and CIS this is the most popular variant). Also MSVC has a language extension which allows an application programmer to avoid all this complexity with manual conversion of bytestring to widestrings, and use a variant of main() function which instantly receives widestrings
#user3159253 provided a good answer that I will complete with some more references:
Windows: Usually it uses wide characters.
Linux: Normally it uses UTF-8 encoding: please do NOT use wide chars in this case.
You are facing an internationalization (cf i18n, i10n ) issue.
You might need tools like iconv for character set conversion, and gettext for string translation.

Adding unicode support to a library for Windows

I would like to add Unicode support to a C library I am maintaining. Currently it expects all strings to be passed in utf8 encoded. Based on feedback it seems windows usually provides 3 function versions.
fooA() ANSI encoded strings
fooW() Unicode encoded strings
foo() string encoding depends on the UNICODE define
Is there an easy way to add this support without writing a lot of wrapper functions myself? Some of the functions are callable from the library and by the user and this complicates the situation a little.
I would like to keep support for utf8 strings as the library is usable on multiple operating systems.
The foo functions without the suffix are in fact macros. The fooA functions are obsolete and are simple wrappers around the fooW functions, which are the only ones that actually perform work. Windows uses UTF-16 strings for everything, so if you want to continue using UTF-8 strings, you must convert them for every API call (e.g. with MultiByteToWideChar).
For the public interface of your library, stick to exactly one encoding, either UTF-16, UTF-32 or UTF-8. Everything else (locale-dependent or OS-dependent encodings) is too complex for the callers. You don't need UTF-8 to be compatible with other OSes: many platform-independent libraries such as ICU, Qt or the Java standard libraries use UTF-16 on all systems. I think the choice between the three Unicode encodings depends on which OS you expect the library will be used most: If it will mostly be used on Windows, stick to UTF-16 so that you can avoid all string conversions. On Linux, UTF-8 is a common choice as a filesystem or terminal encoding (because it is the only Unicode encoding with an 8-bit-wide character unit), but see the note above regarding libraries. OS X uses UTF-8 for its POSIX interface and UTF-16 for everything else (Carbon, Cocoa).
Some notes on terminology: The words "ANSI" and "Unicode" as used in the Microsoft documentation are not in accordance to what the international standard say. When Microsoft speaks of "Unicode" or "wide characters", they mean "UTF-16" or (historically) the BMP subset thereof (with one code unit per code point). "ANSI" in Microsoft parlance means some locale-dependent legacy encoding which is completely obsolete in all modern versions of Windows.
If you want a definitive recommendation, go for UTF-16 and the ICU library.
Since your library already requires UTF-8 encoded strings, then it is already fully Unicode enabled, as UTF-8 is a loss-less Unicode encoding. If you are wanting to use your library in an environment that normally uses UTF-16 or even UTF-32 strings, then it could simply encode to, and decode from, UTF-8 when talking with your library. Otherwise, your library would have to expose extra UTF-16/32 functions that do those encoding/decoding operations internally.

How to convert an integer to a string portably?

I was looking for a way to convert an integer to a string in a portable manner (portable among at least Windows & Linux and x86 and x86_64) and I though itoa(X) to be standard just like atoi(1).
But I read the following in the Wikipedia entry:
The itoa function is a widespread non-standard extension to the standard C programming language. It cannot be portably used, as it is not defined in any of the C language standards; however, compilers often provide it through the header while in non-conforming mode, because it is a logical counterpart to the standard library function atoi.
So I'd like to know if there is any way to do it in a portable manner or not. In case I have to write my own function, which things do I have to be careful with?
Most often you just use printf("%d");
http://en.wikipedia.org/wiki/Printf
You can use sprintf if you need it in a buffer, but how often do you convert to a string and not write it to a file or output device?
If you aren't doing this terribly often, how about a runtime library routine that writes a few numbers to memory, analyzes the results and stores an encoding type? From then on you just switch on your "encoding type" to select which conversion routine to use.

Resources