Char sent via socket - c

This question is pure teoretical, and not about the right way of doing it, but do we need to convert char 'x' to network format? I'm intressted in all cases: always / sometimes / never
I personally think I should but i need to be sure, than you.

No, char is a single byte value so endianess doesn't matter

As you're thinking of it (endianess, ntohs ntohl, etc...), no.
Less basically, I should raise a non-network-bound warning : Any string not attached with its encoding is unreadable.
Say you're sending/storing, over network or not, the string "Français" the 'ç' will have to be encoded using a character encoding. Not specifying the character encoding mean it can be read "Français" if you encoded it using utf8 but your reader was thinking it was latin1.
So you have two solutions :
Write down a spec for you application to only use the same char encoding
Put a metadata on a header somewhere specifying the encoding for future strings (like HTTP).

Related

What is the difference between sqlite3_bind_text, sqlite3_bind_text16 and sqlite3_bind_text64?

I am using sqlite3 C interface. After reading document at https://www.sqlite.org/c3ref/bind_blob.html , I am totally confused.
What is the difference between sqlite3_bind_text, sqlite3_bind_text16 and sqlite3_bind_text64?
The document only describe that sqlite3_bind_text64 can accept encoding parameter including SQLITE_UTF8, SQLITE_UTF16, SQLITE_UTF16BE, or SQLITE_UTF16LE.
So I guess, based on the parameters pass to these functions, that:
sqlite3_bind_text is for ANSI characters, char *
sqlite3_bind_text16 is for UTF-16 characters,
sqlite3_bind_text64 is for various encoding mentioned above.
Is that correct?
One more question:
The document said "If the fourth parameter to sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the length of the string is the number of bytes up to the first zero terminator." But it does not said what will happen for sqlite3_bind_text64. Originally I thought this is a typo. However, when I pass -1 as the fourth parameter to sqlite3_bind_text64, I will always get SQLITE_TOOBIG error, that makes me think they remove sqlite3_bind_text64 from the above statement by purpose. Is that correct?
Thanks
sqlite3_bind_text() is for UTF-8 strings.
sqlite3_bind_text16() is for UTF-16 strings using your processor's native endianness.
sqlite3_bind_text64() lets you specify a particular encoding (utf-8, native utf-16, or a particular endian utf-16). You'll probably never need it.
sqlite3_bind_blob() should be used for non-Unicode strings that are just treated as binary blobs; all sqlite string functions work only with Unicode.

C convert char* to network byte order before transfer

I'm working on a project where I must send data to server and this client will run on different os's. I knwo the problem with endians on different machines, so I'm converting 'everything' (almost) to network byte order, using htonl and vice-versa on the other end.
Also, I know that for a single byte, I don't need to convert anything. But, what should I do for char*? ex:
send(sock,(char*)text,sizeof(text));
What's the best approach to solve this? Should I create an 'intermediate function' do intercept this 'send', then really send char-by-char of this char array? If so, do I need to convert every char to network byte order? I think no, since every char is only one byte.
Thinking of this, if I create this 'intermediate functions', I don't have to convert nothing more to network byte order, since this function will send char by char, thus don't need conversion of endians.
I any advice on this.
I am presuming from your question that the application layer protocol (more specifically everything above level 4) is under your design control. For single byte-wide (octet-wide in networking parlance) data there is no issue with endian ordering and you need do nothing special to accommodate that. Now if the character data is prepended with a length specifier that is, say 2 octets, then the ordering of the bytes must be treated consistently.
Going with network byte ordering (big-endian) will certainly fill the bill, but so would consistently using little-endian. Consistency of byte ordering on each end of a connection is the crucial issue.
If the protocol is not under your design control, then the protocol specification should offer guidance on the issue of byte ordering for multi-byte integers and you should follow that.

Portable literal strings in C source files

Ok, I have this:
AllocConsole();
SetConsoleOutputCP(CP_UTF8);
HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 10, NULL, NULL);
WriteConsoleW(consoleHandle, L"wΕλληνικά\n", 10, NULL, NULL);
printf("aΕλληνικά\n");
wprintf(L"wΕλληνικά\n");
Now, the issue is that depending on the encoding file was saved as only some these works. wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters). Yet, I have issue with three others. If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 18, NULL, NULL);
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
Now, let's go to UTF-16 (Unicode - Codepage 1200). Works only WriteConsoleW. Character count in WriteConsoleA should be 10 to keep formatting precise.
Saving in UTF-16 Big Endian mode (Unicode - Codepage 1201) does not change anything. Again, WTF? Shouldn't byte order inside the strings be inverted when stored to file?
Conclusion is that the way strings are compiled into binary form depends on the encoding used. Therefore, what is the portable and compiler independent way to store strings? Is there a preprocessor which would convert one string representation into another before compilation, so I could store file in UTF-8 and only preprocess strings which I need to have in UTF-16 by wrapping them some macro.
I think you've got at least a few assumptions here which are either wrong or not 100% correct as far as I know:
Now, the issue is that depending on the encoding file was saved as only some these works.
Of course, because the encoding determines how to Interpret the string literals.
wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters).
I've never heard of that one, but I'm rather sure this depends on the locale set for your program. I've got a few work Projects where a locale is set and the output is just fine using German umlauts etc.
If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
That's because the ANSI version wants an ANSI string, while you're passing a UTF-8 encoded string (based on the file's encoding). The output still works, because the console handles the UTF-8 conversion for you - you're essentially printing raw UTF-8 here.
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
I don't think so (although I'm not sure why it isn't working either). Have you tried Setting some easy to find string and look for it in the resulting binary? I'm rather sure it's indeed encoded using UTF-16. I assume due to the missing BOM the compiler might interpret the whole thing as a narrow string and therefore converts the UTF-8 stuff wrong.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
This is exactly what I described above. Now the wide string is encoded properly, because the Compiler now knows the file is in UTF-8, not ANSI (or some codepage). The narrow string is properly converted to the locale being used as well.
Overall, there's no encoding independant way to do it, unless you escape everything using the proper codepage and/or UTF codes in advance. I'd just stick to UTF-8 with BOM, because I think all current compilers will be able to properly read and Interpret the file (besides Microsoft's Resource Compiler; although I haven't tried feeding the 2012 Version with UTF-8).
Edit:
To use an analogy:
You're essentially saving a raw image to a file and you expect it to work properly, no matter whether other programs try to read it as a grayscale, palettized, or full color image. This won't work (despite differences being smaller).
The answer is here.
Quoting:
It is impossible for the compiler to intermix UTF-8 and UTF-16
strings into the compiled output! So you have to decide for one source
code file:
either use UTF-8 with BOM and generate UTF-16 strings only (i.e.always use L prefix),
or UTF-8 without BOM and generate UTF-8 strings only (i.e. never use L prefix),
7-bit ASCII characters are not involved and can be used with or without L prefix
The only portable and compiler independent way is to use ASCII charset and escape sequences, because there are no guarantees that any compiler would accept UTF-8 encoded file and a compiler treatment of those multibyte sequences might vary.

How to convert uni code point value (utf16) to C char array

I have an api which takes uni code data as c character array and sends it as a correct sms in uni code.
Now i have four code point values corresponding to four characters in some native alphabet and i want to send those correctly by inserting them into a c char array.
I tried
char test_data[] = {"\x00\x6B\x00\x6A\x00\x63\x00\x69"};
where 0x006B is one code point and so on.
The api internally is calling
int len = mbstowcs(NULL,test_data,0);
which results in 0 for above. Seems like 0x00 is treated as a terminating null.
I want to assign the above code points correctly to c array so they result into corresponding utf16 characters on the receiving phone (which does support the char set). If required i have the leverage to change the api too.
Platform is Linux with glib
UTF-16BE is not the native execution (AKA multibyte) character set and mbstowcs does expect null-terminated strings, so this will not work. Since you are using Linux, the function is probably expecting any char[] sequence to be UTF-8.
I believe you can transcode character data in Linux using uniconv. I've only used the ICU4C project.
Your code would read the UTF-16BE data, transcode it to a common form (e.g. uint8_t), then transcode it to the native execution character set prior to calling the API (which will then transcode it to the native wide character set.)
Note: this may be a lossy process if the execution character set does not contain the relevant code points, but you have no choice because this is what the API is expecting. But as I noted above, modern Linux systems should default to UTF-8. I wrote a little bit about transcoding codepoints in C here.
I think using wchar_t would solve your problem.
Correct me if I am wrong or missing something.
I think you should create a union of chars and ints.
typedef union wchars{int int_arr[200]; char char_arr[800]}; memcpy the data into this union for your assignment

Reading files using Windows API

I'm trying to write a console program that reads characters from a file.
i want it to be able to read from a Unicode file as well as an ANSI one.
how should i address this issue? do i need to programatically distinguish the type of file and read acoordingly? or can i somehow use the windows API data types like TCHAR and stuff like that.
The only differnce between reading from the files is that in Unicode i have to read 2 bytes for a character and in ASNSI its 1 byte?
im a little lost with this windows API.
would appretiate any help
thanks
You can try to read the first chunk of the file in binary mode in a buffer (1 KB should be enough), and use the IsTextUnicode function to determine if it's likely to be some variety of Unicode; notice that this function, unless it finds some "strong" proofs that it's Unicode text (e.g. a BOM) performs fundamentally a statistical analysis on the buffer to determine what "it looks like", so it can give wrong results; a case in which this function fails is the (in)famous "Bush hid the facts" bug.
Still, you can set how much guesswork is done using its flags.
Notice that, if your application does not really manipulate/display the text you may not even need to determine if it's ANSI or Unicode, and just let it be encoding agnostic.
I'm not sure if the Windows API has some utility methods for all kinds of text files.
In general, you need to read the BOM of the file (http://en.wikipedia.org/wiki/Byte_Order_Mark), which will tell you which encoding of Unicode is actually used when you succeed in reading the character correctly.
You read bytes from a file, and then parse these bytes as the expected format dictates.
Typically you check if a file contains UTF text by reading the initial BOM, and then proceed to read the remainder of the file bytes, parsing these the way you think they're encoded.
Unicode text is typically encoded as UTF-8 (1-4 bytes per character), UTF-16 (2 or 4 bytes per character), or UTF-16 (4 bytes per character)
Dont worry, there is no difference between reading ANSI and UNICODE files. Difference has place only during processing

Resources