How to convert uni code point value (utf16) to C char array - c

I have an api which takes uni code data as c character array and sends it as a correct sms in uni code.
Now i have four code point values corresponding to four characters in some native alphabet and i want to send those correctly by inserting them into a c char array.
I tried
char test_data[] = {"\x00\x6B\x00\x6A\x00\x63\x00\x69"};
where 0x006B is one code point and so on.
The api internally is calling
int len = mbstowcs(NULL,test_data,0);
which results in 0 for above. Seems like 0x00 is treated as a terminating null.
I want to assign the above code points correctly to c array so they result into corresponding utf16 characters on the receiving phone (which does support the char set). If required i have the leverage to change the api too.
Platform is Linux with glib

UTF-16BE is not the native execution (AKA multibyte) character set and mbstowcs does expect null-terminated strings, so this will not work. Since you are using Linux, the function is probably expecting any char[] sequence to be UTF-8.
I believe you can transcode character data in Linux using uniconv. I've only used the ICU4C project.
Your code would read the UTF-16BE data, transcode it to a common form (e.g. uint8_t), then transcode it to the native execution character set prior to calling the API (which will then transcode it to the native wide character set.)
Note: this may be a lossy process if the execution character set does not contain the relevant code points, but you have no choice because this is what the API is expecting. But as I noted above, modern Linux systems should default to UTF-8. I wrote a little bit about transcoding codepoints in C here.

I think using wchar_t would solve your problem.
Correct me if I am wrong or missing something.

I think you should create a union of chars and ints.
typedef union wchars{int int_arr[200]; char char_arr[800]}; memcpy the data into this union for your assignment

Related

Effect of Wide Characters/ Strings on a C Program

Below is an excerpt from an old edition of the book Programming Windows by Charles Petzold
There are, of course, certain disadvantages to using Unicode. First and foremost is that every string in your program will occupy twice as much space. In addition, you'll observe that the functions in the wide-character run-time library
are larger than the usual functions.
Why would every string in my program occupy twice the bytes, should not only the character arrays we've declared as storing wchar_t type do so?
Is there perhaps some condition that if a program is to be able to work with Long values, then the entire program mode it'll operate on is altered?
Usually if we declare a long int, we never fuss over or mention the fact that all ints will be occupying double the memory now. Are strings somehow a special case?
Why would every string in my program occupy twice the bytes, should not only the character arrays we've declared as storing wchar_t type do so?
As I understand it, it is meant, that if you have a program that uses char *, and now you rewrite that program to use wchar_t *, then it will use (more than) twice the bytes.
If a string could potentially contain a character outside of the ascii range, you'll have to declare it as a wide string. So most strings in the program will be bigger. Personally, I wouldn't worry about it; if you need Unicode, you need Unicode, and a few more bytes aren't going to kill you.
That seems to be what you're saying, and I agree. But the question is skating the fine line between opinionated and objective.
Unicode have some types : utf8, utf16 utf32. https://en.wikipedia.org/wiki/Unicode.
You can check advantage , disadvantage of them to know what situation you should use .
reference: UTF-8, UTF-16, and UTF-32

What is the difference between sqlite3_bind_text, sqlite3_bind_text16 and sqlite3_bind_text64?

I am using sqlite3 C interface. After reading document at https://www.sqlite.org/c3ref/bind_blob.html , I am totally confused.
What is the difference between sqlite3_bind_text, sqlite3_bind_text16 and sqlite3_bind_text64?
The document only describe that sqlite3_bind_text64 can accept encoding parameter including SQLITE_UTF8, SQLITE_UTF16, SQLITE_UTF16BE, or SQLITE_UTF16LE.
So I guess, based on the parameters pass to these functions, that:
sqlite3_bind_text is for ANSI characters, char *
sqlite3_bind_text16 is for UTF-16 characters,
sqlite3_bind_text64 is for various encoding mentioned above.
Is that correct?
One more question:
The document said "If the fourth parameter to sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the length of the string is the number of bytes up to the first zero terminator." But it does not said what will happen for sqlite3_bind_text64. Originally I thought this is a typo. However, when I pass -1 as the fourth parameter to sqlite3_bind_text64, I will always get SQLITE_TOOBIG error, that makes me think they remove sqlite3_bind_text64 from the above statement by purpose. Is that correct?
Thanks
sqlite3_bind_text() is for UTF-8 strings.
sqlite3_bind_text16() is for UTF-16 strings using your processor's native endianness.
sqlite3_bind_text64() lets you specify a particular encoding (utf-8, native utf-16, or a particular endian utf-16). You'll probably never need it.
sqlite3_bind_blob() should be used for non-Unicode strings that are just treated as binary blobs; all sqlite string functions work only with Unicode.

Secure MultiByteToWideChar Usage

I've got code that used MultiByteToWideChar like so:
wchar_t * bufferW = malloc(mbBufferLen * 2);
MultiByteToWideChar(CP_ACP, 0, mbBuffer, mbBufferLen, bufferW, mbBufferLen);
Note that the code does not use a previous call to MultiByteToWideChar to check how large the new unicode buffer needs to be, and assumes it will be twice the multibyte buffer.
My question is if this usage is safe? Could there be a default code page that maps a character into a 3-byte or larger unicode character, and cause an overflow? While I'm aware the usage isn't exactly correct, I'd like to gauge the risk impact.
Could there be a default code page that maps a character into a 3-byte or larger [sequence of wchar_t UTF-16 code units]
There is currently no ANSI code page that maps a single byte to a character outside the BMP (ie one that would take more than one 2-byte codeunit in UTF-16).
No single multi-byte ANSI character can ever be encoded as more than two 2-byte codeunits in UTF-16. So, at worse, you will never end up with a UTF-16 string that has more than 2x the length of the input ANSI string (not counting the null-terminator, which does not apply in this case since you are passing explicit lengths), and at best you will end up with a UTF-16 string that has fewer wchar_t characters than the input string has char characters.
For what it's worth, Microsoft are endeavouring not to develop the ANSI code pages any further, and I suspect the NLS file format would need changes to allow it, so it's pretty unlikely that this will change in future. But there is no firm API promise that this will definitely always hold true.

Safety of using \0 before the end of character arrays

I am writing a driver for an embedded system that runs a custom version of modified linux (Its a handscanner). The manufacturer supplys a custom Eclipse Juno distribution with a few libraries and examples inbound.
The output I receive from the comport comes in form of a standard character array. I am using the individual characters in the array to convey information (error ids and error codes) like this:
if (tmp[i] == 250])
Where tmp is a character array in form of char tmp[500]; that is first initialized to 0 and then filled with input from the comport.
My question is:
Assuming I iterate through every piece of the array, is it safe to use 0 (as in \0) at any point before the end of the Array? Assuming I am:
Not treating it as a string (iterating through and using it like an int array)
In knowledge of what is going to be in there and what exactly this random \0 in the middle of it is supposed to mean.
The reason im asking is because I had several coworkers tell me that I should never ever ever use a character array that contains \0 before the end, no matter the circumstances.
My code doing this currently performs as expected, but im unsure if it might cause problems later.
Rewriting it to avoid this behaviour would be a non-trivial chunk of work.
Using an array of char as an array of small integers is perfectly fine. Just be careful not to pass it to any kind of function that expects "strings".
And if you want to be more explicit about it, and also make sure that the array is using unsigned char you could use uint8_t instead.

char vs wchar_t

I'm trying to print out a wchar_t* string.
Code goes below:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *ascii_ = "中日友好"; //line-1
wchar_t *wchar_ = L"中日友好"; //line-2
int main()
{
printf("ascii_: %s\n", ascii_); //line-3
wprintf(L"wchar_: %s\n", wchar_); //line-4
return 0;
}
//Output
ascii_: 中日友好
Question:
Apparently I should not assign CJK characters to char* pointer in line-1, but I just did it, and the output of line-3 is correct, So why? How could printf() in line-3 give me the non-ascii characters? Does it know the encoding somehow?
I assume the code in line-2 and line-4 are correct, but why I didn't get any output of line-4?
First of all, it's usually not a good idea to use non-ascii characters in source code. What's probably happening is that the chinese characters are being encoded as UTF-8 which works with ascii.
Now, as for why the wprintf() isn't working. This has to do with stream orientation. Each stream can only be set to either normal or wide. Once set, it cannot be changed. It is set the first time it is used. (which is ascii due to the printf). After that the wprintf will not work due the incorrect orientation.
In other words, once you use printf() you need to keep on using printf(). Similarly, if you start with wprintf(), you need to keep using wprintf().
You cannot intermix printf() and wprintf(). (except on Windows)
EDIT:
To answer the question about why the wprintf line doesn't work even by itself. It's probably because the code is being compiled so that the UTF-8 format of 中日友好 is stored into wchar_. However, wchar_t needs 4-byte unicode encoding. (2-bytes in Windows)
So there's two options that I can think of:
Don't bother with wchar_t, and just stick with multi-byte chars. This is the easy way, but may break if the user's system is not set to the Chinese locale.
Use wchar_t, but you will need to encode the Chinese characters using unicode escape sequences. This will obviously make it unreadable in the source code, but it will work on any machine that can print Chinese character fonts regardless of the locale.
Line 1 is not ascii, it's whatever multibyte encoding is used by your compiler at compile-time. On modern systems that's probably UTF-8. printf does not know the encoding. It's just sending bytes to stdout, and as long as the encodings match, everything is fine.
One problem you should be aware of is that lines 3 and 4 together invoke undefined behavior. You cannot mix character-based and wide-character io on the same FILE (stdout). After the first operation, the FILE has an "orientation" (either byte or wide), and after that any attempt to perform operations of the opposite orientation results in UB.
You are omitting one step and therefore think the wrong way.
You have a C file on disk, containing bytes. You have a "ASCII" string and a wide string.
The ASCII string takes the bytes exactly like they are in line 1 and outputs them.
This works as long as the encoding of the user's side is the same as the one on the programmer's side.
The wide string first decodes the given bytes into unicode codepoints and stored in the program- maybe this goes wrong on your side. On output they are encoded again according to the encoding on the user's side. This ensures that these characters are emitted as they are intended to, not as they are entered.
Either your compiler assumes the wrong encoding, or your output terminal is set up the wrong way.

Resources