C CSV API for unicode - c

I need a C API for manipulating CSV data that can work with unicode. I am aware of libcsv (sourceforge.net/projects/libcsv), but I don't think that will work for unicode (please correct me if I'm wrong) because don't see wchar_t being used.
Please advise.

It looks like libcsv does not use the C string functions to do its work, so it almost works out of the box, in spite of its mbcs/ws ignorance. It treats the string as an array of bytes with an explicit length. This might mostly work for certain wide character encodings that pad out ASCII bytes to fill the width (so newline might be encoded as "\0\n" and space as "\0 "). You could also encode your wide data as UTF-8, which should make things a bit easier. But both approaches might founder on the way libcsv identifies space and line terminator tokens: it expects you to tell it on a byte-to-byte basis whether it's looking at a space or terminator, which doesn't allow for multibyte space/term encodings. You could fix this by modifying the library to pass a pointer into the string and the length left in the string to its space/term test functions, which would be pretty straightforward.

Related

What is the difference between sqlite3_bind_text, sqlite3_bind_text16 and sqlite3_bind_text64?

I am using sqlite3 C interface. After reading document at https://www.sqlite.org/c3ref/bind_blob.html , I am totally confused.
What is the difference between sqlite3_bind_text, sqlite3_bind_text16 and sqlite3_bind_text64?
The document only describe that sqlite3_bind_text64 can accept encoding parameter including SQLITE_UTF8, SQLITE_UTF16, SQLITE_UTF16BE, or SQLITE_UTF16LE.
So I guess, based on the parameters pass to these functions, that:
sqlite3_bind_text is for ANSI characters, char *
sqlite3_bind_text16 is for UTF-16 characters,
sqlite3_bind_text64 is for various encoding mentioned above.
Is that correct?
One more question:
The document said "If the fourth parameter to sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the length of the string is the number of bytes up to the first zero terminator." But it does not said what will happen for sqlite3_bind_text64. Originally I thought this is a typo. However, when I pass -1 as the fourth parameter to sqlite3_bind_text64, I will always get SQLITE_TOOBIG error, that makes me think they remove sqlite3_bind_text64 from the above statement by purpose. Is that correct?
Thanks
sqlite3_bind_text() is for UTF-8 strings.
sqlite3_bind_text16() is for UTF-16 strings using your processor's native endianness.
sqlite3_bind_text64() lets you specify a particular encoding (utf-8, native utf-16, or a particular endian utf-16). You'll probably never need it.
sqlite3_bind_blob() should be used for non-Unicode strings that are just treated as binary blobs; all sqlite string functions work only with Unicode.

using regular expression with unicode string in C

I'm currently using regular expression on unicode strings but I just need to match ASCII characters thus effectively ignore all unicode characters and until now functions in regex.h work fine (I'm on linux so the encoding is utf8). But can someone confirm if its really ok to do so? Or do I need a regex library on Unicode (like ICU?)
UTF-8 is a variable length encoding; some characters are 1 byte, some 2, others 3 or 4. You know now many bytes to read by the prefix of each character. 0 for 1 byte, 110 for 2 bytes, 1110 for 3 bytes, 11110 for 4 bytes.
If you try to read a UTF-8 string as ASCII, or any other fixed-width encoding, things will go very wrong... unless that UTF-8 string contains nothing but 1 byte characters in which case it matches ASCII.
However since no bytes in UTF-8 contain a null byte, and none of the extra bytes can be confused with ASCII, and if you really are only matching ASCII, you might be able to get away with it... but I wouldn't recommend it because there are such better regex options than POSIX, they're easy to use, and why leave a hidden encoding bomb in your code for some sucker to deal with later? (Note: that sucker may be you)
Instead, use a Unicode aware regex library like Perl Compatible Regular Expressions (PCRE). PCRE is Unicode aware by passing the PCRE2_UTF flag to pcre2_compile. PCRE regex syntax is more powerful and more widely understood than POSIX regexes, and PCRE has more features. And PCRE comes with Gnome Lib which itself provides a feast of very handy C functions.
You need to be careful about your patterns and about the text your going to match.
As an example, given the expression a.b:
"axb" matches
"aèb" does NOT match
The reason is that è is two bytes long when UTF-8 encoded but . would only match the first one.
So as long as you only match sequences of ASCII characters you're safe. If you mix ASCII and non ASCII characters, you're in trouble.
You can try to match a single UTF-8 encoded "character" with something like:
([\xC0-\xDF].|[\xE0-\xEF]..|\xF0...|.)
but this assumes that the text is encoded correctly (and, frankly, I never tried it).

Secure MultiByteToWideChar Usage

I've got code that used MultiByteToWideChar like so:
wchar_t * bufferW = malloc(mbBufferLen * 2);
MultiByteToWideChar(CP_ACP, 0, mbBuffer, mbBufferLen, bufferW, mbBufferLen);
Note that the code does not use a previous call to MultiByteToWideChar to check how large the new unicode buffer needs to be, and assumes it will be twice the multibyte buffer.
My question is if this usage is safe? Could there be a default code page that maps a character into a 3-byte or larger unicode character, and cause an overflow? While I'm aware the usage isn't exactly correct, I'd like to gauge the risk impact.
Could there be a default code page that maps a character into a 3-byte or larger [sequence of wchar_t UTF-16 code units]
There is currently no ANSI code page that maps a single byte to a character outside the BMP (ie one that would take more than one 2-byte codeunit in UTF-16).
No single multi-byte ANSI character can ever be encoded as more than two 2-byte codeunits in UTF-16. So, at worse, you will never end up with a UTF-16 string that has more than 2x the length of the input ANSI string (not counting the null-terminator, which does not apply in this case since you are passing explicit lengths), and at best you will end up with a UTF-16 string that has fewer wchar_t characters than the input string has char characters.
For what it's worth, Microsoft are endeavouring not to develop the ANSI code pages any further, and I suspect the NLS file format would need changes to allow it, so it's pretty unlikely that this will change in future. But there is no firm API promise that this will definitely always hold true.

unicode string manipulation in c

I am using gcc in linux mint 15 and my terminal understands unicode. I will be dealing with UTF-8. I am trying to obtain the base word of a more complex unicode string. Sort of like trimming down the word 'alternative' to 'alternat' but in a different language. Hence I will be required to test the ending of each word.
In c and ASCII, I can do something like this
if(string[last_char]=='e')
last_char-=1; //Throws away the last character
Can I do something similar with unicode? That is, something like this :
if(string[last_char]=='ഒ')
last_char=-1
EDIT:
Sorry as #chux said I just notified you are asking in C. Anyway the same principle holds.
In C you can use wscanf and wprintf to do I/O with wide char strings. If your characters are inside BMP you'll be fine. Just replace char * with wchar_t * and do all kinds of things as usual.
For serious development I'd recommend convert all strings to char32_t for processing. Or use a library like ICU.
If what you need is just remove some given characters in the string, then maybe you don't need the complex unicode character handling. Treat your unicode characters as a raw char * string and do whatever string operations over it.
The old C++ oriented answer is reproduced below, for reference.
The easy way
Use std::wstring
It's basically an std::string but individual characters are typed wchar_t.
And for IO you should use std::wcin and std::wcout. For example:
std::wstring str;
std::wcin >> str;
std::wcout << str << std::endl;
However, in some platforms wchar_t is 2-byte wide, which means characters outside BMP will not work. This should be okay for you I think, but should not be used in serious development. For more text on this topic, read this.
The hard way
Use a better unicode-aware string processing library like ICU.
The C++11 way
Use some mechanisms to convert your input string to std::u32string and you're done. The conversion routines can be hand-crafted or using an existing library like ICU.
As std::u32string is formed using char32_t, you can safely assume you're dealing with Unicode correctly.

Portable literal strings in C source files

Ok, I have this:
AllocConsole();
SetConsoleOutputCP(CP_UTF8);
HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 10, NULL, NULL);
WriteConsoleW(consoleHandle, L"wΕλληνικά\n", 10, NULL, NULL);
printf("aΕλληνικά\n");
wprintf(L"wΕλληνικά\n");
Now, the issue is that depending on the encoding file was saved as only some these works. wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters). Yet, I have issue with three others. If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 18, NULL, NULL);
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
Now, let's go to UTF-16 (Unicode - Codepage 1200). Works only WriteConsoleW. Character count in WriteConsoleA should be 10 to keep formatting precise.
Saving in UTF-16 Big Endian mode (Unicode - Codepage 1201) does not change anything. Again, WTF? Shouldn't byte order inside the strings be inverted when stored to file?
Conclusion is that the way strings are compiled into binary form depends on the encoding used. Therefore, what is the portable and compiler independent way to store strings? Is there a preprocessor which would convert one string representation into another before compilation, so I could store file in UTF-8 and only preprocess strings which I need to have in UTF-16 by wrapping them some macro.
I think you've got at least a few assumptions here which are either wrong or not 100% correct as far as I know:
Now, the issue is that depending on the encoding file was saved as only some these works.
Of course, because the encoding determines how to Interpret the string literals.
wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters).
I've never heard of that one, but I'm rather sure this depends on the locale set for your program. I've got a few work Projects where a locale is set and the output is just fine using German umlauts etc.
If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
That's because the ANSI version wants an ANSI string, while you're passing a UTF-8 encoded string (based on the file's encoding). The output still works, because the console handles the UTF-8 conversion for you - you're essentially printing raw UTF-8 here.
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
I don't think so (although I'm not sure why it isn't working either). Have you tried Setting some easy to find string and look for it in the resulting binary? I'm rather sure it's indeed encoded using UTF-16. I assume due to the missing BOM the compiler might interpret the whole thing as a narrow string and therefore converts the UTF-8 stuff wrong.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
This is exactly what I described above. Now the wide string is encoded properly, because the Compiler now knows the file is in UTF-8, not ANSI (or some codepage). The narrow string is properly converted to the locale being used as well.
Overall, there's no encoding independant way to do it, unless you escape everything using the proper codepage and/or UTF codes in advance. I'd just stick to UTF-8 with BOM, because I think all current compilers will be able to properly read and Interpret the file (besides Microsoft's Resource Compiler; although I haven't tried feeding the 2012 Version with UTF-8).
Edit:
To use an analogy:
You're essentially saving a raw image to a file and you expect it to work properly, no matter whether other programs try to read it as a grayscale, palettized, or full color image. This won't work (despite differences being smaller).
The answer is here.
Quoting:
It is impossible for the compiler to intermix UTF-8 and UTF-16
strings into the compiled output! So you have to decide for one source
code file:
either use UTF-8 with BOM and generate UTF-16 strings only (i.e.always use L prefix),
or UTF-8 without BOM and generate UTF-8 strings only (i.e. never use L prefix),
7-bit ASCII characters are not involved and can be used with or without L prefix
The only portable and compiler independent way is to use ASCII charset and escape sequences, because there are no guarantees that any compiler would accept UTF-8 encoded file and a compiler treatment of those multibyte sequences might vary.

Resources