Simplest way to convert unicode codepoint into UTF-8

Simplest way to convert unicode codepoint into UTF-8 - c

What's the simplest way to convert a Unicode codepoint into a UTF-8 byte sequence in C? The only way that springs to mind is using iconv to map from the UTF-32LE codepage to UTF-8, but that seems like overkill.

Unicode conversion is not a simple task. Using iconv doesn't seem like overkill at all to me. Perhaps there is a library version of iconv you can use to avoid make a system() call, if that's what you want to avoid.

Might I suggest ICU? It's a reasonably "industry standard" way of handling i18n issues.
I haven't used the C version myself, but I suspect ucnv_fromUnicode might be the function you're after.

UTF8 works by coding the length of the encoded codepoint into the highest bits of the encoded bytes. see http://en.wikipedia.org/wiki/UTF-8#Description
I found this small function in C here http://www.deanlee.cn/programming/convert-unicode-to-utf8/ , didn't test it though.

Related

unicode string manipulation in c

I am using gcc in linux mint 15 and my terminal understands unicode. I will be dealing with UTF-8. I am trying to obtain the base word of a more complex unicode string. Sort of like trimming down the word 'alternative' to 'alternat' but in a different language. Hence I will be required to test the ending of each word.
In c and ASCII, I can do something like this
if(string[last_char]=='e')
last_char-=1; //Throws away the last character
Can I do something similar with unicode? That is, something like this :
if(string[last_char]=='ഒ')
last_char=-1

EDIT:
Sorry as #chux said I just notified you are asking in C. Anyway the same principle holds.
In C you can use wscanf and wprintf to do I/O with wide char strings. If your characters are inside BMP you'll be fine. Just replace char * with wchar_t * and do all kinds of things as usual.
For serious development I'd recommend convert all strings to char32_t for processing. Or use a library like ICU.
If what you need is just remove some given characters in the string, then maybe you don't need the complex unicode character handling. Treat your unicode characters as a raw char * string and do whatever string operations over it.
The old C++ oriented answer is reproduced below, for reference.
The easy way
Use std::wstring
It's basically an std::string but individual characters are typed wchar_t.
And for IO you should use std::wcin and std::wcout. For example:
std::wstring str;
std::wcin >> str;
std::wcout << str << std::endl;
However, in some platforms wchar_t is 2-byte wide, which means characters outside BMP will not work. This should be okay for you I think, but should not be used in serious development. For more text on this topic, read this.
The hard way
Use a better unicode-aware string processing library like ICU.
The C++11 way
Use some mechanisms to convert your input string to std::u32string and you're done. The conversion routines can be hand-crafted or using an existing library like ICU.
As std::u32string is formed using char32_t, you can safely assume you're dealing with Unicode correctly.

wchar_t encoding on different platforms

I faced a problem with encodings on different platforms (in my case Windows and Linux). On windows, size of wchar_t is 2 bytes, whereas on Linux it's 4 bytes. How can I "standardize" wchar_t to be same size for both platforms? Is it hard to implement without additional libraries? For now, I'm aiming for printf/wprintf API. The data is sent via socket communication. Thank you.

If you want to send Unicode data across different platforms and architectures, I'd suggest using UTF-8 encoding and (8-bit) chars. UTF-8 has some advantages like not having endiannes issues (UTF-8 is just a plain sequence of bytes, instead both UTF-16 and UTF-32 can be little-endian or big-endian...).
On Windows, just convert the UTF-8 text to UTF-16 at the boundary of Win32 APIs (since Windows APIs tend to work with UTF-16). You can use the MultiByteToWideChar() API for that.

To solve this problem I think you are going to have to convert all strings into UTF-8 before transmitting. On Windows you would use the WideCharToMultiByte function to convert wchar_t strings to UTF-8 strings, and MultiByteToWideChar to convert UTF-8 strings into wchar_t strings.
On Linux things aren't as straightforward. You can use the functions wctomb and mbtowc, however what they convert to/from depends on the underlying locale setting. So if you want these to convert to/from UTF-8 and Unicode then you'll need to make sure the locale is set to use UTF-8 encoding.
This article might also be a good resource.

Unicode: How to integrate Jansson(JSON library) with ICU special UTF-8 data types?

I've been developing a C application that expects wide range of UTF-8 characters, so I started using ICU library to support Unicode characters, but it seems things aren't working nicely with other libraries(mainly, jansson, a JSON library).
Even though jansson claims it fully supports UTF-8, it only expects chars as parameters(IIRC, a single byte isn't enough for Unicode chars), while ICU uses a special type called UChar(16byte sized character, at least on my system).
Casting a Unicode character to a regular character doesn't seem like a solution to me, since casting bigger data to smaller ones will cause data lose. I tried casting anyway; it didn't work.
So my question would be: How can I make the two libraries work nicely together?

Get ICU to produce output in UTF-8 using toUTF8/toUTF8String. (toUTF8String gives you a std::string so .c_str() to get the char* that Jansson wants.

FastCGI and Unicode

I am wondering if Fastcgi supports unicode functions like wprintf. I receive a buffer via fread and get char* that has unicode characters in it. I mean bytes with value above 128 . How do I process them inside fastcgi main function. A call to mbstowcs fails.
I meant that I was using the FastCGI Developers Kit library. A java client is sending data encoded with UTF-8 , I decoded it server side using mbstring functions in php , but what is the equivalent of that gcc . What ever it is does not seem to work inside the FastCGI amin function. I looked at Fascgipp but I dont know how much it is used and how stable it is . Further I dont find lugging a huge library like boost justified for a small utility.

If you need Unicode, use UTF-8 and not "wide" characters. They are much more suitable for the web.

Fastcgi is protocol not an API.
So it depend on the library you are choosing.
Yes see library
http://www.nongnu.org/fastcgipp/

I receive a buffer via fread and get char* that has unicode characters in it. I mean bytes with value above 128 .
Unicode is not the only encoding that has bytes above 128 in it. In fact, most other encodings do. You need to find out which encoding is exactly used in your web application. In any case, I don't think wprintf is going to be useful to you in any way.

FastCGI expects a byte stream, so your safest bet is UTF-8.
That being said, the program should work with non-UTF-8 input; iconv is ideal for this.
You can use wprintf, but only in the form of wsnprintf, after which buffer contents are converted to UTF-8 and then written to the correct FCGI stream.

Is there even fast implementaion about multibyte character string convert to unicode wstring?

In my project, where I adopted Aho-Corasick algorithm to do some message filter mode in the server side, message the server got is string of multibyte character. But after several tests I found the bottleneck is the conversion between mulitbyte string and unicode wstring. What I use now is the pair of mbstowcs_s and wcstombs_s, which takes nearly 95% time cost of the whole mode. Also, I have tried MultiByteToWideChar/WideCharToMultiByte, it got just the same result.
So I wonder if there is some other more efficient way to do the job? My project is built in VS2005, and the string converted will contain Chinese characters.
Many thanks.

There are a number of possibilities.
Firstly, what do you mean by "multi-byte character"? Do you mean UTF8 or an ISO DBCS system?
If you look at the definition of UTF8 and UTF16 there scope to do a highly optimised conversion, ripping out the "x" bits and reformatting them. See for example http://www.faqs.org/rfcs/rfc2044.html talks about UTF8<==>UTF32. Adjusting for UTF16 would be simple.
The second option might be to work entirely in UTF16. Render your Web page (or UI Dialog or whatever) in UTF16 and get the user input that way.
If all else fails, there aare other string algorithms than Aho-Corasick. Possibly look for an algorithm that works with your original encoding.
[Added 29-Jan-2010]
See http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt for more on conversions, including two C implementations of mbtowc() and wctomb(). These are designed to work with arbitrarily large wchar_ts. If you just have 16-bit wchar_ts then you can simplify it a lot.
These would be much faster than the generic (code-page-sensitive) versions in the standard library.

Deprecated (I believe) but you could always use the non-safe versions (mbstowcs and wcstombs). Not sure if this will have a marked improvement though. Alternatively, if your character set is limited (a - z, 0 - 9, for instance), you could always do it manually with a lookup table..?

Perhaps you can reduce the amount of calls to MultiByteToWideChar?

You could also probably adopt Aho-Corasick to work directly on multibyte strings.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight