Adding unicode support to a library for Windows

Adding unicode support to a library for Windows - c

I would like to add Unicode support to a C library I am maintaining. Currently it expects all strings to be passed in utf8 encoded. Based on feedback it seems windows usually provides 3 function versions.
fooA() ANSI encoded strings
fooW() Unicode encoded strings
foo() string encoding depends on the UNICODE define
Is there an easy way to add this support without writing a lot of wrapper functions myself? Some of the functions are callable from the library and by the user and this complicates the situation a little.
I would like to keep support for utf8 strings as the library is usable on multiple operating systems.

The foo functions without the suffix are in fact macros. The fooA functions are obsolete and are simple wrappers around the fooW functions, which are the only ones that actually perform work. Windows uses UTF-16 strings for everything, so if you want to continue using UTF-8 strings, you must convert them for every API call (e.g. with MultiByteToWideChar).
For the public interface of your library, stick to exactly one encoding, either UTF-16, UTF-32 or UTF-8. Everything else (locale-dependent or OS-dependent encodings) is too complex for the callers. You don't need UTF-8 to be compatible with other OSes: many platform-independent libraries such as ICU, Qt or the Java standard libraries use UTF-16 on all systems. I think the choice between the three Unicode encodings depends on which OS you expect the library will be used most: If it will mostly be used on Windows, stick to UTF-16 so that you can avoid all string conversions. On Linux, UTF-8 is a common choice as a filesystem or terminal encoding (because it is the only Unicode encoding with an 8-bit-wide character unit), but see the note above regarding libraries. OS X uses UTF-8 for its POSIX interface and UTF-16 for everything else (Carbon, Cocoa).
Some notes on terminology: The words "ANSI" and "Unicode" as used in the Microsoft documentation are not in accordance to what the international standard say. When Microsoft speaks of "Unicode" or "wide characters", they mean "UTF-16" or (historically) the BMP subset thereof (with one code unit per code point). "ANSI" in Microsoft parlance means some locale-dependent legacy encoding which is completely obsolete in all modern versions of Windows.
If you want a definitive recommendation, go for UTF-16 and the ICU library.

Since your library already requires UTF-8 encoded strings, then it is already fully Unicode enabled, as UTF-8 is a loss-less Unicode encoding. If you are wanting to use your library in an environment that normally uses UTF-16 or even UTF-32 strings, then it could simply encode to, and decode from, UTF-8 when talking with your library. Otherwise, your library would have to expose extra UTF-16/32 functions that do those encoding/decoding operations internally.

Related

How does ncurses output non-ascii characters?

I'd like to know how ncurses (a c library) manages to put characters like ├, despite them not (to the best of my knowledge) being part of ASCII.
I would have assumed it was just drawing them pixel by pixel, but you can copy/paste them out of the terminal (in MacOS).

ncurses puts characters such as ├ on the screen by assuming that your locale environment variables (LC_ALL and/or LC_CTYPE) match the terminal on which you are displaying. The environment variables indicate the encoding (e.g., UTF-8). There are other encodings and terminals which support those encodings, but generally speaking you'll mostly see UTF-8. If the environment and terminal cooperate, things "just work":
at startup, ncurses checks for the locale which a program has initialized, via setlocale, and determines if that uses UTF-8. It uses that information later.
when a program adds character strings, e.g., using addstr, ncurses uses the character-type information (set as a side-effect of calling setlocale), and uses standard C library functions for combining sequences of bytes which make up a multi-byte character, and converting those into wide characters. It stores those wide characters internally, and
when writing to the terminal, ncurses reverses the process, converting from wide characters to use the encoding assumed to be supported by the terminal (assuming that your locale environment matches the terminal).
However —
The character indicated ├ happens to be a special case. That is one of the graphic characters used for line-drawing, which predate Unicode and UTF-8. curses has names for these graphic characters, making it simple to refer to them, e.g., ACS_LTEE (the ├ is a left-tee):
Before UTF-8 came along to complicate things, developers came up with a scheme using a table of these graphic characters by adapting the escape sequences used for the VT100 (late 1970s) and the AT&T 4410 and 5410 terminals (apparently the early 1980s since the latter were in use by 1984) for drawing their graphic characters.
AT&T SystemV curses provided support for these graphic characters from the mid-1980s. BSD curses never did that...
Unicode (roughly 1990 and later) provided most of the same glyphs using a different encoding. There are a few omissions (the most noticeable are the scan lines above/below the one used for horizontal lines), but once UTF-8 got into use in the early 2000s, it was logical to extend ncurses to use these characters.
ncurses looks at the locale settings, but prefers using the terminal description for these graphic characters except for cases where that is known to not work — and will assume that the terminal can display the Unicode equivalents for these characters if the terminal is assumed to use UTF-8. It uses a table for this purpose (SystemV curses and its successor X/Open Curses didn't do any of this — NetBSD curses adapted the table from ncurses sometime after 2010).
Further reading:
NCURSES_NO_UTF8_ACS
Line Graphics (in curs_addch(3x))
Line Graphics (in curs_add_wch(3x))

There is more than one version of ncurses, for more than one platform, and if you really want to know, check the source. However, none of them would draw a character pixel-by-pixel; that isn’t something a library running inside a terminal emulator does.
Modern versions of the C standard library, POSIX and ncurses all support writing wide characters to the console and conversion between wide and multibyte strings. Today, wide characters are normally UTF-16 or UTF-32 and multibyte strings are normally UTF-8. You can see the documentation for <wchar.h> and ncursesw for more information.
Note that C11 does have support for UTF-8 literals, through the u8 prefix.
A program that’s concerned about portability with systems where the local multibyte encoding is something other than UTF-8 can use another library such as the C++ standard library or ICU to convert between UTF-8 and wide-character strings, then display those with curses.
You might need to #define _XOPEN_SOURCE 700, or the appropriate value for the version of the standard you are targeting, and with some versions of the libraries, also #define _XOPEN_SOURCE_EXTENDED 1, to get your system libraries to let you use functions such as addwstr().
However, many programs might simply send strings of char encoded in UTF-8 to the console and assume it can handle them. I don’t recommend this approach, but it works on most Linux systems in 2017.

How can I open a file that has a Chinese Filename in C? [duplicate]

Is there a standard way to do an fopen with a Unicode string file path?

No, there's no standard way. There are some differences between operating systems. Here's how different OSs handle non-ASCII filenames.
Linux
Under Linux, a filename is simply a binary string. The convention on most modern distributions is to use UTF-8 for non-ASCII filenames. But in the beginning, it was common to encode filenames as ISO-8859-1. It's basically up to each application to choose an encoding, so you can even have different encodings used on the same filesystem. The LANG environment variable can give you a hint what the preferred encoding is. But these days, you can probably assume UTF-8 everywhere.
This is not without problems, though, because a filename containing an invalid UTF-8 sequence is perfectly valid on most Linux filesystems. How would you specify such a filename if you only support UTF-8? Ideally, you should support both UTF-8 and binary filenames.
OS X
The HFS filesystem on OS X uses Unicode (UTF-16) filenames internally. Most C (and POSIX) library functions like fopen accept UTF-8 strings (since they're 8-bit compatible) and convert them internally.
Windows
The Windows API uses UTF-16 for filenames, but fopen uses the current codepage, whatever that is (UTF-8 just became an option). Many C library functions have a non-standard equivalent that accepts UTF-16 (wchar_t on Windows). For example, _wfopen instead of fopen.

In *nix, you simply use the standard fopen (see more information in reply from TokeMacGuy, or in this forum)
In Windows, you can use _wfopen, and then pass a Unicode string (for more information, see MSDN).
As there is no real common way, I would wrap this call in a macro, together with all other system-dependent functions.

This is a matter of your current locale. On my system, which is Unicode-enabled, file paths will be in Unicode. I'm able to detect this by means of the locale command:
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
The encoding of file paths is normally set system wide, so if your file path is not in the system's locale, you will need to convert it, perhaps by means of the iconv library.

Almost all POSIX platforms use UTF-8 nowadays. And modern Windows also support UTF-8 as the locale, you can just use UTF-8 everywhere and open any files without using wide strings on Windows. fopen just works portably
setlocale(LC_ALL, "en_us.utf8"); // need some setup before calling this
fopen(R"(C:\filê\wíth\Ünicode\name.txt)", "w+");
Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that char strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use ".UTF8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".UTF8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
...
To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.
UTF-8 Support

Converting string in host character encoding to Unicode in C

Is there a way to portably (that is, conforming to the C standard) convert strings in the host character encoding to an array of Unicode code points? I'm working on some data serialization software, and I've got a problem because while I need to send UTF-8 over the wire, the C standard doesn't guarantee the ASCII encoding, so converting a string in the host character encoding can be a nontrivial task.
Is there a library that takes care of this kind of stuff for me? Is there a function hidden in the C standard library that can do something like this?

The C11 standard, ISO/IEC 9899:2011, has a new header <uchar.h> with rudimentary facilities to help. It is described in section §7.28 Unicode utilities <uchar.h>.
There are two pairs of functions defined:
c16rtomb() and mbrtoc16() — using type char16_t aka uint_least16_t.
c32rtomb() and mbrtoc32() — using type char32_t aka uint_least32_t.
The r in the name is for 'restartable'; the functions are intended to be called iteratively. The mbrtoc{16,32}() pair convert from a multibyte code set (hence the mb) to either char16_t or char32_t. The c{16,32}rtomb() pair convert from either char16_t or char32_t to a multibyte character sequence.
I'm not sure whether they'll do what you want. The <uchar.h> header and hence the functions are not available on Mac OS X 10.9.1 with either the Apple-provided clang or with the 'home-built' GCC 4.8.2, so I've not had a chance to investigate them. The header does appear to be available on Linux (Ubuntu 13.10) with GCC 4.8.1.
I think it likely that ICU is a better choice — it is, however, a rather large library (but that is because it does a thorough job of supporting Unicode in general and different locales in general).

UTF-8 to UTF-16 API wrapper libraries for Windows?

Is there any wrapper library out there that mimics the Windows "ANSI" function names (e.g. CreateFileA), assumes the inputs are in UTF-8, converts them to UTF-16, calls the UTF-16 version of the function (e.g. CreateFileW), and converts the outputs back to UTF-8 for the program?
It would allow ASCII programs to use UTF-8 almost seamlessly.

Rather than wrapping the API functions, it's easier to wrap the strings in a conversion function. Then you'll be future-proof when the next version of Windows adds more API functions.

As others said, there are too many WinAPI functions to make such a library feasible. However one can hack it on the tool-chain level or using something like http://research.microsoft.com/en-us/projects/detours/.
EDIT: Windows 10 added support for UTF-8 codepage in ANSI API.

There is this thing called WDL, it has some UTF-8 wrappers (win32_utf8). I have never tried it so I don't know how complete the support is.

Is there a standard way to do an fopen with a Unicode string file path?

Is there a standard way to do an fopen with a Unicode string file path?

No, there's no standard way. There are some differences between operating systems. Here's how different OSs handle non-ASCII filenames.
Linux
Under Linux, a filename is simply a binary string. The convention on most modern distributions is to use UTF-8 for non-ASCII filenames. But in the beginning, it was common to encode filenames as ISO-8859-1. It's basically up to each application to choose an encoding, so you can even have different encodings used on the same filesystem. The LANG environment variable can give you a hint what the preferred encoding is. But these days, you can probably assume UTF-8 everywhere.
This is not without problems, though, because a filename containing an invalid UTF-8 sequence is perfectly valid on most Linux filesystems. How would you specify such a filename if you only support UTF-8? Ideally, you should support both UTF-8 and binary filenames.
OS X
The HFS filesystem on OS X uses Unicode (UTF-16) filenames internally. Most C (and POSIX) library functions like fopen accept UTF-8 strings (since they're 8-bit compatible) and convert them internally.
Windows
The Windows API uses UTF-16 for filenames, but fopen uses the current codepage, whatever that is (UTF-8 just became an option). Many C library functions have a non-standard equivalent that accepts UTF-16 (wchar_t on Windows). For example, _wfopen instead of fopen.

In *nix, you simply use the standard fopen (see more information in reply from TokeMacGuy, or in this forum)
In Windows, you can use _wfopen, and then pass a Unicode string (for more information, see MSDN).
As there is no real common way, I would wrap this call in a macro, together with all other system-dependent functions.

This is a matter of your current locale. On my system, which is Unicode-enabled, file paths will be in Unicode. I'm able to detect this by means of the locale command:
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
The encoding of file paths is normally set system wide, so if your file path is not in the system's locale, you will need to convert it, perhaps by means of the iconv library.

Almost all POSIX platforms use UTF-8 nowadays. And modern Windows also support UTF-8 as the locale, you can just use UTF-8 everywhere and open any files without using wide strings on Windows. fopen just works portably
setlocale(LC_ALL, "en_us.utf8"); // need some setup before calling this
fopen(R"(C:\filê\wíth\Ünicode\name.txt)", "w+");
Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that char strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use ".UTF8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".UTF8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
...
To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.
UTF-8 Support

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight