Using iconv with WCHAR_T on Linux

Using iconv with WCHAR_T on Linux - c

I have the following code on Linux:-
rc = iconv_open("WCHAR_T", SourceCode);
prior to using iconv to convert the data into a wide character string (wchar_t).
I am trying to understand what it achieves in order to port it to a platform where the option on parameter 1, "WCHAR_T", does not exist.
This leads to sub-questions such as:
Is there a single representation of wchar_t on Linux?
What codepage does this use? I imagine maybe UTF-32
Does it rely on any locale settings to achieve this?
I'm hoping for an answer that says something like: "The code you show is shorthand for doing the following 2 things instead...." and then I might be able to do those two steps instead of the shorthand on the platform where "WCHAR_T" option on iconv_open doesn't exist.

The reason the (non-standard) WCHAR_T encoding exists is to make it easy to cast a pointer to wchar_t into a pointer to char and use it with iconv. The format understood by that encoding is whatever the system's native wchar_t is.
If you're asking about glibc and not other libc implementations, then on Linux wchar_t is a 32-bit type in the system's native endianness, and represents Unicode codepoints. This is not the same as UTF-32, since UTF-32 normally has a byte-order mark (BOM) and when it does not, is big endian. WCHAR_T is always native endian.
Note that some systems use different semantics for wchar_t. Windows always uses a 16-bit type using a little-endian UTF-16. If you used the GNU libiconv on that platform, the WCHAR_T encoding would be different than if you ran it on Linux.
Locale settings do not affect wchar_t because the size of wchar_t must be known at compile time, and therefore cannot practically vary based on locale.
If this piece of code is indeed casting a pointer to wchar_t and using that in its call to iconv, then you need to adjust the code to use one of the encodings UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE, depending on sizeof(wchar_t) and the platform's endianness. Those encodings do not require (nor allow) a BOM, and assuming you're not using a PDP-11, one of them will be correct for your platform.
If you're getting the data from some other source, then you need to figure out what that is, and use the appropriate encoding from the list above for it. You should also probably send a patch upstream and ask the maintainer to use a different, more correct encoding for handling their data format.

Related

how to handle russian string as a command line argument in C program

I have an exe file build from C code. There is a situation where russian string is passed as an argument to this exe.
When I call exe with this argument, task manager shows russian string perfectly as command line argument.
But when I print that argument from my exe it just prints ???
How can I make my C program(hence exe) handle russian character?

The answer depends on a target platform for your program. Traditionally, a C- or C++-program begins its life from main(....) function which may have byte-oriented strings passed as arguments (notice char* in main declaration int main(int argc, char* argv[])). Byte-oriented strings mean that characters in a string are passed in a specific byte-oriented encoding and one character, for example Я or Ñ in UTF-8 may take more than 1 char.
Nowadays the most wide used encoding on Linux/Unix platform is UTF-8, but some time ago there were other encodings in use such as ISO8859-1, KOI8-R and a lot of others. Most of programs are still byte oriented as UTF-8 encoding is mostly backward-compatible with all traditional C strings API.
In other hand wide strings can be more convenient in use, because each character in a widestring uses a predefined space. Thus, for example, the following expression passes assertion test: std::wstring hello = L"Привет!¡Hola!"; assert(L'в' == hello[3]); (if UTF-8 char strings are used the test would fail). So if your program performs a lot of operations on letters, not strings as a whole, then widestrings can be the solution.
To convert strings from multi-byte to a wide character encoding, you may use mbtowc functions family or that awesome fancy codecvt C++-11 facility if your compiler supports it (likely it doesn't as of mid-2014 :))
In Windows strings are also can be passed as byte-oriented strings, and for Russian most likely CP1251 is used (depends on Operating system settings, but for Windows sold within Russia and CIS this is the most popular variant). Also MSVC has a language extension which allows an application programmer to avoid all this complexity with manual conversion of bytestring to widestrings, and use a variant of main() function which instantly receives widestrings

#user3159253 provided a good answer that I will complete with some more references:
Windows: Usually it uses wide characters.
Linux: Normally it uses UTF-8 encoding: please do NOT use wide chars in this case.
You are facing an internationalization (cf i18n, i10n ) issue.
You might need tools like iconv for character set conversion, and gettext for string translation.

Converting string in host character encoding to Unicode in C

Is there a way to portably (that is, conforming to the C standard) convert strings in the host character encoding to an array of Unicode code points? I'm working on some data serialization software, and I've got a problem because while I need to send UTF-8 over the wire, the C standard doesn't guarantee the ASCII encoding, so converting a string in the host character encoding can be a nontrivial task.
Is there a library that takes care of this kind of stuff for me? Is there a function hidden in the C standard library that can do something like this?

The C11 standard, ISO/IEC 9899:2011, has a new header <uchar.h> with rudimentary facilities to help. It is described in section §7.28 Unicode utilities <uchar.h>.
There are two pairs of functions defined:
c16rtomb() and mbrtoc16() — using type char16_t aka uint_least16_t.
c32rtomb() and mbrtoc32() — using type char32_t aka uint_least32_t.
The r in the name is for 'restartable'; the functions are intended to be called iteratively. The mbrtoc{16,32}() pair convert from a multibyte code set (hence the mb) to either char16_t or char32_t. The c{16,32}rtomb() pair convert from either char16_t or char32_t to a multibyte character sequence.
I'm not sure whether they'll do what you want. The <uchar.h> header and hence the functions are not available on Mac OS X 10.9.1 with either the Apple-provided clang or with the 'home-built' GCC 4.8.2, so I've not had a chance to investigate them. The header does appear to be available on Linux (Ubuntu 13.10) with GCC 4.8.1.
I think it likely that ICU is a better choice — it is, however, a rather large library (but that is because it does a thorough job of supporting Unicode in general and different locales in general).

Adding unicode support to a library for Windows

I would like to add Unicode support to a C library I am maintaining. Currently it expects all strings to be passed in utf8 encoded. Based on feedback it seems windows usually provides 3 function versions.
fooA() ANSI encoded strings
fooW() Unicode encoded strings
foo() string encoding depends on the UNICODE define
Is there an easy way to add this support without writing a lot of wrapper functions myself? Some of the functions are callable from the library and by the user and this complicates the situation a little.
I would like to keep support for utf8 strings as the library is usable on multiple operating systems.

The foo functions without the suffix are in fact macros. The fooA functions are obsolete and are simple wrappers around the fooW functions, which are the only ones that actually perform work. Windows uses UTF-16 strings for everything, so if you want to continue using UTF-8 strings, you must convert them for every API call (e.g. with MultiByteToWideChar).
For the public interface of your library, stick to exactly one encoding, either UTF-16, UTF-32 or UTF-8. Everything else (locale-dependent or OS-dependent encodings) is too complex for the callers. You don't need UTF-8 to be compatible with other OSes: many platform-independent libraries such as ICU, Qt or the Java standard libraries use UTF-16 on all systems. I think the choice between the three Unicode encodings depends on which OS you expect the library will be used most: If it will mostly be used on Windows, stick to UTF-16 so that you can avoid all string conversions. On Linux, UTF-8 is a common choice as a filesystem or terminal encoding (because it is the only Unicode encoding with an 8-bit-wide character unit), but see the note above regarding libraries. OS X uses UTF-8 for its POSIX interface and UTF-16 for everything else (Carbon, Cocoa).
Some notes on terminology: The words "ANSI" and "Unicode" as used in the Microsoft documentation are not in accordance to what the international standard say. When Microsoft speaks of "Unicode" or "wide characters", they mean "UTF-16" or (historically) the BMP subset thereof (with one code unit per code point). "ANSI" in Microsoft parlance means some locale-dependent legacy encoding which is completely obsolete in all modern versions of Windows.
If you want a definitive recommendation, go for UTF-16 and the ICU library.

Since your library already requires UTF-8 encoded strings, then it is already fully Unicode enabled, as UTF-8 is a loss-less Unicode encoding. If you are wanting to use your library in an environment that normally uses UTF-16 or even UTF-32 strings, then it could simply encode to, and decode from, UTF-8 when talking with your library. Otherwise, your library would have to expose extra UTF-16/32 functions that do those encoding/decoding operations internally.

What encoding used when invoke fopen or open?

When we invoke system call in linux like 'open' or stdio function like 'fopen' we must provide a 'const char * filename'. My question is what is the encoding used here? It's utf-8 or ascii or iso8859-x? Does it depend on the system or environment setting?
I know in MS Windows there is a _wopen which accept utf-16.

It's a byte string, the interpretation is up to the particular filesystem.

Filesystem calls on Linux are encoding-agnostic, i.e. they do not (need to) know about the particular encoding. As far as they are concerned, the byte-string pointed to by the filename argument is passed down to the filesystem as-is. The filesystem expects that filenames are in the correct encoding (usually UTF-8, as mentioned by Matthew Talbert).
This means that you often don't need to do anything (filenames are treated as opaque byte-strings), but it really depends on where you receive the filename from, and whether you need to manipulate the filename in any way.

It depends on the system locale. Look at the output of the "locale" command. If the variables end in UTF-8, then your locale is UTF-8. Most modern linuxes will be using UTF-8. Although Andrew is correct that technically it's just a byte string, if you don't match the system locale some programs may not work correctly and it will be impossible to get correct user input, etc. It's best to stick with UTF-8.

The filename is the byte string; regardless of locale or any other conventions you're using about how filenames should be encoded, the string you must pass to fopen and to all functions taking filenames/pathnames is the exact byte string for how the file is named. For example if you have a file named ö.txt in UTF-8 in NFC, and your locale is UTF-8 encoded and uses NFC, you can just write the name as ö.txt and pass that to fopen. If your locale is Latin-1 based, though, you can't pass the Latin-1 form of ö.txt ("\xf6.txt") to fopen and expect it to succeed; that's a different byte string and thus a different filename. You would need to pass "\xc3\xb6.txt" ("Ã¶.txt" if you interpret that as Latin-1), the same byte string as the actual name.
This situation is very different from Windows, which you seem to be familiar with, where the filename is is a sequence of 16-bit units interpreted as UTF-16 (although AFAIK they need not actually be valid UTF-16) and filenames passed to fopen, etc. are interpreted according to the current locale as Unicode characters which are then used to open/access the file based on its UTF-16 name.

As already mentioned above, this will be a byte string and the interpretation will be open to the underlying system. More specifically, imagine to C functions; one in user space and one in kernel space which take char * as their parameter. The encoding in user space will depend upon the execution character set of the user program (eg. specified by -fexec-charset=charset in gcc). The encoding expected by the kernel function depends upon the execution charset used during kernel compilation (not sure where to get that information).

I did some further inquiries on this topic and came to the conclusion that there are two different ways how filename encoding can be handled by unixoid file systems.
File names are encoded in the "sytem locale", which usually is, but needs not to be the same as the current environment locale that is reflected by the locale command (but some preset in a global configuration file).
File names are encoded in UTF-8, independent from any locale settings.
GTK+ solves this mess by assuming UTF-8 and allowing to override it either by the current locale encoding or a user-supplied encoding.
Qt solves it by assuming locale encoding (and that system locale is reflected in the current locale) and allowing to override it with a user-supplied conversion function.
So the bottom line is: Use either UTF-8 or what LC_ALL or LANG tell you by default, and provide an override setting at least for the other alternative.

Is there a standard way to do an fopen with a Unicode string file path?

Is there a standard way to do an fopen with a Unicode string file path?

No, there's no standard way. There are some differences between operating systems. Here's how different OSs handle non-ASCII filenames.
Linux
Under Linux, a filename is simply a binary string. The convention on most modern distributions is to use UTF-8 for non-ASCII filenames. But in the beginning, it was common to encode filenames as ISO-8859-1. It's basically up to each application to choose an encoding, so you can even have different encodings used on the same filesystem. The LANG environment variable can give you a hint what the preferred encoding is. But these days, you can probably assume UTF-8 everywhere.
This is not without problems, though, because a filename containing an invalid UTF-8 sequence is perfectly valid on most Linux filesystems. How would you specify such a filename if you only support UTF-8? Ideally, you should support both UTF-8 and binary filenames.
OS X
The HFS filesystem on OS X uses Unicode (UTF-16) filenames internally. Most C (and POSIX) library functions like fopen accept UTF-8 strings (since they're 8-bit compatible) and convert them internally.
Windows
The Windows API uses UTF-16 for filenames, but fopen uses the current codepage, whatever that is (UTF-8 just became an option). Many C library functions have a non-standard equivalent that accepts UTF-16 (wchar_t on Windows). For example, _wfopen instead of fopen.

In *nix, you simply use the standard fopen (see more information in reply from TokeMacGuy, or in this forum)
In Windows, you can use _wfopen, and then pass a Unicode string (for more information, see MSDN).
As there is no real common way, I would wrap this call in a macro, together with all other system-dependent functions.

This is a matter of your current locale. On my system, which is Unicode-enabled, file paths will be in Unicode. I'm able to detect this by means of the locale command:
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
The encoding of file paths is normally set system wide, so if your file path is not in the system's locale, you will need to convert it, perhaps by means of the iconv library.

Almost all POSIX platforms use UTF-8 nowadays. And modern Windows also support UTF-8 as the locale, you can just use UTF-8 everywhere and open any files without using wide strings on Windows. fopen just works portably
setlocale(LC_ALL, "en_us.utf8"); // need some setup before calling this
fopen(R"(C:\filê\wíth\Ünicode\name.txt)", "w+");
Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that char strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use ".UTF8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".UTF8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
...
To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.
UTF-8 Support