How do I convert a UTF-8 string to upper case?

How do I convert a UTF-8 string to upper case? - c

Is there a portable way to convert a UTF-8 string in C to upper case? If not, what is the Linux way to do it?

The portable way of doing it would be to use a Unicode aware library such as ICU. Seems like u_strToUpper might the function you're looking for.

glib has g_utf8_strup().

The canonical way to do this is with wchar_t -- if you have a string of wide characters and use towlower/towupper/towctrans with your wide characters (which will work if your locale is set correctly). So you need to take your UTF-8 string, convert it into a wide-character string, and then use these functions that take wchar_t's and then convert back.
This is a giant PITA so you're probably better off using a supported, open-source Unicode library like ICU.

Related

Using iconv with WCHAR_T on Linux

I have the following code on Linux:-
rc = iconv_open("WCHAR_T", SourceCode);
prior to using iconv to convert the data into a wide character string (wchar_t).
I am trying to understand what it achieves in order to port it to a platform where the option on parameter 1, "WCHAR_T", does not exist.
This leads to sub-questions such as:
Is there a single representation of wchar_t on Linux?
What codepage does this use? I imagine maybe UTF-32
Does it rely on any locale settings to achieve this?
I'm hoping for an answer that says something like: "The code you show is shorthand for doing the following 2 things instead...." and then I might be able to do those two steps instead of the shorthand on the platform where "WCHAR_T" option on iconv_open doesn't exist.

The reason the (non-standard) WCHAR_T encoding exists is to make it easy to cast a pointer to wchar_t into a pointer to char and use it with iconv. The format understood by that encoding is whatever the system's native wchar_t is.
If you're asking about glibc and not other libc implementations, then on Linux wchar_t is a 32-bit type in the system's native endianness, and represents Unicode codepoints. This is not the same as UTF-32, since UTF-32 normally has a byte-order mark (BOM) and when it does not, is big endian. WCHAR_T is always native endian.
Note that some systems use different semantics for wchar_t. Windows always uses a 16-bit type using a little-endian UTF-16. If you used the GNU libiconv on that platform, the WCHAR_T encoding would be different than if you ran it on Linux.
Locale settings do not affect wchar_t because the size of wchar_t must be known at compile time, and therefore cannot practically vary based on locale.
If this piece of code is indeed casting a pointer to wchar_t and using that in its call to iconv, then you need to adjust the code to use one of the encodings UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE, depending on sizeof(wchar_t) and the platform's endianness. Those encodings do not require (nor allow) a BOM, and assuming you're not using a PDP-11, one of them will be correct for your platform.
If you're getting the data from some other source, then you need to figure out what that is, and use the appropriate encoding from the list above for it. You should also probably send a patch upstream and ask the maintainer to use a different, more correct encoding for handling their data format.

UTF-8 to UTF-16 API wrapper libraries for Windows?

Is there any wrapper library out there that mimics the Windows "ANSI" function names (e.g. CreateFileA), assumes the inputs are in UTF-8, converts them to UTF-16, calls the UTF-16 version of the function (e.g. CreateFileW), and converts the outputs back to UTF-8 for the program?
It would allow ASCII programs to use UTF-8 almost seamlessly.

Rather than wrapping the API functions, it's easier to wrap the strings in a conversion function. Then you'll be future-proof when the next version of Windows adds more API functions.

As others said, there are too many WinAPI functions to make such a library feasible. However one can hack it on the tool-chain level or using something like http://research.microsoft.com/en-us/projects/detours/.
EDIT: Windows 10 added support for UTF-8 codepage in ANSI API.

There is this thing called WDL, it has some UTF-8 wrappers (win32_utf8). I have never tried it so I don't know how complete the support is.

Concatenating a string using Win32 API

What's the best way to concatenate a string using Win32? If Understand correctly, the normal C approach would be to use strcat, but since Win32 now deals with Unicode strings (aka LPWSTR), I can't think of a way for strcat to work with this.
Is there a function for this, or should I just write my own?

lstrcat comes in ANSI and Unicode variants. Actually lstrcat is simply a macro defined as either lstrcatA or lstrcatW.
These functions are available by importing kernel32.dll. Useful if you're trying to completely avoid the C runtime library. In most cases you can just use wcscat or _tcscat as roy commented.
Also consider the strsafe.h functions, such as StringCchCat These come in ANSI and Unicode variants as well, but they help protect against buffer overflow.

Utf8 Linux filenames and C

I am working at a OS independent file manager, using SDL_ttf to draw my text.
On Windows, everything works well, but on Linux I have to use the UTF8 functions of SDL_ttf, because the filenames can be UTF8 encoded.
This works well, but if I have my own C string (not a file name) such as "Ää", it will be displayed wrong. Is there any way to tell gcc to encode my strings as UTF8?

You don't need anything special from your C compiler for UTF-8 string literals. Proper support for it in the APIs you use is another matter, but that seems to be covered.
What you do need to do is to make sure your source files are actually saved in UTF-8, so that non-ASCII characters don't get converted to some other encoding when you edit or save the file.
The compiler doesn't need specific UTF-8 support, as long as it assumes 8-bit characters and the usual ASCII values for any syntactically significant characters; in other words, it's almost certainly not the problem.

gcc should interpret your source code and string literals as UTF-8 by default. Try -fexec-charset
See also: http://gcc.gnu.org/onlinedocs/gcc-4.0.1/cpp/Implementation_002ddefined-behavior.html#Implementation_002ddefined-behavior

C should have some sort of Unicode string literal syntax. Googling for "Unicode programming C" should get you started, two tutorials that seemed good are the one on developerworks and the one on cprogramming.com.
The general approach for your specific case would be using a wide string literal L"Ää", then converting that into UTF-8 with wcstrtombs().

Adding unicode support to a library for Windows

I would like to add Unicode support to a C library I am maintaining. Currently it expects all strings to be passed in utf8 encoded. Based on feedback it seems windows usually provides 3 function versions.
fooA() ANSI encoded strings
fooW() Unicode encoded strings
foo() string encoding depends on the UNICODE define
Is there an easy way to add this support without writing a lot of wrapper functions myself? Some of the functions are callable from the library and by the user and this complicates the situation a little.
I would like to keep support for utf8 strings as the library is usable on multiple operating systems.

The foo functions without the suffix are in fact macros. The fooA functions are obsolete and are simple wrappers around the fooW functions, which are the only ones that actually perform work. Windows uses UTF-16 strings for everything, so if you want to continue using UTF-8 strings, you must convert them for every API call (e.g. with MultiByteToWideChar).
For the public interface of your library, stick to exactly one encoding, either UTF-16, UTF-32 or UTF-8. Everything else (locale-dependent or OS-dependent encodings) is too complex for the callers. You don't need UTF-8 to be compatible with other OSes: many platform-independent libraries such as ICU, Qt or the Java standard libraries use UTF-16 on all systems. I think the choice between the three Unicode encodings depends on which OS you expect the library will be used most: If it will mostly be used on Windows, stick to UTF-16 so that you can avoid all string conversions. On Linux, UTF-8 is a common choice as a filesystem or terminal encoding (because it is the only Unicode encoding with an 8-bit-wide character unit), but see the note above regarding libraries. OS X uses UTF-8 for its POSIX interface and UTF-16 for everything else (Carbon, Cocoa).
Some notes on terminology: The words "ANSI" and "Unicode" as used in the Microsoft documentation are not in accordance to what the international standard say. When Microsoft speaks of "Unicode" or "wide characters", they mean "UTF-16" or (historically) the BMP subset thereof (with one code unit per code point). "ANSI" in Microsoft parlance means some locale-dependent legacy encoding which is completely obsolete in all modern versions of Windows.
If you want a definitive recommendation, go for UTF-16 and the ICU library.

Since your library already requires UTF-8 encoded strings, then it is already fully Unicode enabled, as UTF-8 is a loss-less Unicode encoding. If you are wanting to use your library in an environment that normally uses UTF-16 or even UTF-32 strings, then it could simply encode to, and decode from, UTF-8 when talking with your library. Otherwise, your library would have to expose extra UTF-16/32 functions that do those encoding/decoding operations internally.