FastCGI and Unicode - c

I am wondering if Fastcgi supports unicode functions like wprintf. I receive a buffer via fread and get char* that has unicode characters in it. I mean bytes with value above 128 . How do I process them inside fastcgi main function. A call to mbstowcs fails.
I meant that I was using the FastCGI Developers Kit library. A java client is sending data encoded with UTF-8 , I decoded it server side using mbstring functions in php , but what is the equivalent of that gcc . What ever it is does not seem to work inside the FastCGI amin function. I looked at Fascgipp but I dont know how much it is used and how stable it is . Further I dont find lugging a huge library like boost justified for a small utility.

If you need Unicode, use UTF-8 and not "wide" characters. They are much more suitable for the web.

Fastcgi is protocol not an API.
So it depend on the library you are choosing.
Yes see library
http://www.nongnu.org/fastcgipp/

I receive a buffer via fread and get char* that has unicode characters in it. I mean bytes with value above 128 .
Unicode is not the only encoding that has bytes above 128 in it. In fact, most other encodings do. You need to find out which encoding is exactly used in your web application. In any case, I don't think wprintf is going to be useful to you in any way.

FastCGI expects a byte stream, so your safest bet is UTF-8.
That being said, the program should work with non-UTF-8 input; iconv is ideal for this.
You can use wprintf, but only in the form of wsnprintf, after which buffer contents are converted to UTF-8 and then written to the correct FCGI stream.

Related

wchar_t encoding on different platforms

I faced a problem with encodings on different platforms (in my case Windows and Linux). On windows, size of wchar_t is 2 bytes, whereas on Linux it's 4 bytes. How can I "standardize" wchar_t to be same size for both platforms? Is it hard to implement without additional libraries? For now, I'm aiming for printf/wprintf API. The data is sent via socket communication. Thank you.
If you want to send Unicode data across different platforms and architectures, I'd suggest using UTF-8 encoding and (8-bit) chars. UTF-8 has some advantages like not having endiannes issues (UTF-8 is just a plain sequence of bytes, instead both UTF-16 and UTF-32 can be little-endian or big-endian...).
On Windows, just convert the UTF-8 text to UTF-16 at the boundary of Win32 APIs (since Windows APIs tend to work with UTF-16). You can use the MultiByteToWideChar() API for that.
To solve this problem I think you are going to have to convert all strings into UTF-8 before transmitting. On Windows you would use the WideCharToMultiByte function to convert wchar_t strings to UTF-8 strings, and MultiByteToWideChar to convert UTF-8 strings into wchar_t strings.
On Linux things aren't as straightforward. You can use the functions wctomb and mbtowc, however what they convert to/from depends on the underlying locale setting. So if you want these to convert to/from UTF-8 and Unicode then you'll need to make sure the locale is set to use UTF-8 encoding.
This article might also be a good resource.

Unicode: How to integrate Jansson(JSON library) with ICU special UTF-8 data types?

I've been developing a C application that expects wide range of UTF-8 characters, so I started using ICU library to support Unicode characters, but it seems things aren't working nicely with other libraries(mainly, jansson, a JSON library).
Even though jansson claims it fully supports UTF-8, it only expects chars as parameters(IIRC, a single byte isn't enough for Unicode chars), while ICU uses a special type called UChar(16byte sized character, at least on my system).
Casting a Unicode character to a regular character doesn't seem like a solution to me, since casting bigger data to smaller ones will cause data lose. I tried casting anyway; it didn't work.
So my question would be: How can I make the two libraries work nicely together?
Get ICU to produce output in UTF-8 using toUTF8/toUTF8String. (toUTF8String gives you a std::string so .c_str() to get the char* that Jansson wants.

Char sent via socket

This question is pure teoretical, and not about the right way of doing it, but do we need to convert char 'x' to network format? I'm intressted in all cases: always / sometimes / never
I personally think I should but i need to be sure, than you.
No, char is a single byte value so endianess doesn't matter
As you're thinking of it (endianess, ntohs ntohl, etc...), no.
Less basically, I should raise a non-network-bound warning : Any string not attached with its encoding is unreadable.
Say you're sending/storing, over network or not, the string "Français" the 'ç' will have to be encoded using a character encoding. Not specifying the character encoding mean it can be read "Français" if you encoded it using utf8 but your reader was thinking it was latin1.
So you have two solutions :
Write down a spec for you application to only use the same char encoding
Put a metadata on a header somewhere specifying the encoding for future strings (like HTTP).

Reading files using Windows API

I'm trying to write a console program that reads characters from a file.
i want it to be able to read from a Unicode file as well as an ANSI one.
how should i address this issue? do i need to programatically distinguish the type of file and read acoordingly? or can i somehow use the windows API data types like TCHAR and stuff like that.
The only differnce between reading from the files is that in Unicode i have to read 2 bytes for a character and in ASNSI its 1 byte?
im a little lost with this windows API.
would appretiate any help
thanks
You can try to read the first chunk of the file in binary mode in a buffer (1 KB should be enough), and use the IsTextUnicode function to determine if it's likely to be some variety of Unicode; notice that this function, unless it finds some "strong" proofs that it's Unicode text (e.g. a BOM) performs fundamentally a statistical analysis on the buffer to determine what "it looks like", so it can give wrong results; a case in which this function fails is the (in)famous "Bush hid the facts" bug.
Still, you can set how much guesswork is done using its flags.
Notice that, if your application does not really manipulate/display the text you may not even need to determine if it's ANSI or Unicode, and just let it be encoding agnostic.
I'm not sure if the Windows API has some utility methods for all kinds of text files.
In general, you need to read the BOM of the file (http://en.wikipedia.org/wiki/Byte_Order_Mark), which will tell you which encoding of Unicode is actually used when you succeed in reading the character correctly.
You read bytes from a file, and then parse these bytes as the expected format dictates.
Typically you check if a file contains UTF text by reading the initial BOM, and then proceed to read the remainder of the file bytes, parsing these the way you think they're encoded.
Unicode text is typically encoded as UTF-8 (1-4 bytes per character), UTF-16 (2 or 4 bytes per character), or UTF-16 (4 bytes per character)
Dont worry, there is no difference between reading ANSI and UNICODE files. Difference has place only during processing

Simplest way to convert unicode codepoint into UTF-8

What's the simplest way to convert a Unicode codepoint into a UTF-8 byte sequence in C? The only way that springs to mind is using iconv to map from the UTF-32LE codepage to UTF-8, but that seems like overkill.
Unicode conversion is not a simple task. Using iconv doesn't seem like overkill at all to me. Perhaps there is a library version of iconv you can use to avoid make a system() call, if that's what you want to avoid.
Might I suggest ICU? It's a reasonably "industry standard" way of handling i18n issues.
I haven't used the C version myself, but I suspect ucnv_fromUnicode might be the function you're after.
UTF8 works by coding the length of the encoded codepoint into the highest bits of the encoded bytes. see http://en.wikipedia.org/wiki/UTF-8#Description
I found this small function in C here http://www.deanlee.cn/programming/convert-unicode-to-utf8/ , didn't test it though.

Resources