Portable literal strings in C source files - c

Ok, I have this:
AllocConsole();
SetConsoleOutputCP(CP_UTF8);
HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 10, NULL, NULL);
WriteConsoleW(consoleHandle, L"wΕλληνικά\n", 10, NULL, NULL);
printf("aΕλληνικά\n");
wprintf(L"wΕλληνικά\n");
Now, the issue is that depending on the encoding file was saved as only some these works. wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters). Yet, I have issue with three others. If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 18, NULL, NULL);
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
Now, let's go to UTF-16 (Unicode - Codepage 1200). Works only WriteConsoleW. Character count in WriteConsoleA should be 10 to keep formatting precise.
Saving in UTF-16 Big Endian mode (Unicode - Codepage 1201) does not change anything. Again, WTF? Shouldn't byte order inside the strings be inverted when stored to file?
Conclusion is that the way strings are compiled into binary form depends on the encoding used. Therefore, what is the portable and compiler independent way to store strings? Is there a preprocessor which would convert one string representation into another before compilation, so I could store file in UTF-8 and only preprocess strings which I need to have in UTF-16 by wrapping them some macro.

I think you've got at least a few assumptions here which are either wrong or not 100% correct as far as I know:
Now, the issue is that depending on the encoding file was saved as only some these works.
Of course, because the encoding determines how to Interpret the string literals.
wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters).
I've never heard of that one, but I'm rather sure this depends on the locale set for your program. I've got a few work Projects where a locale is set and the output is just fine using German umlauts etc.
If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
That's because the ANSI version wants an ANSI string, while you're passing a UTF-8 encoded string (based on the file's encoding). The output still works, because the console handles the UTF-8 conversion for you - you're essentially printing raw UTF-8 here.
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
I don't think so (although I'm not sure why it isn't working either). Have you tried Setting some easy to find string and look for it in the resulting binary? I'm rather sure it's indeed encoded using UTF-16. I assume due to the missing BOM the compiler might interpret the whole thing as a narrow string and therefore converts the UTF-8 stuff wrong.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
This is exactly what I described above. Now the wide string is encoded properly, because the Compiler now knows the file is in UTF-8, not ANSI (or some codepage). The narrow string is properly converted to the locale being used as well.
Overall, there's no encoding independant way to do it, unless you escape everything using the proper codepage and/or UTF codes in advance. I'd just stick to UTF-8 with BOM, because I think all current compilers will be able to properly read and Interpret the file (besides Microsoft's Resource Compiler; although I haven't tried feeding the 2012 Version with UTF-8).
Edit:
To use an analogy:
You're essentially saving a raw image to a file and you expect it to work properly, no matter whether other programs try to read it as a grayscale, palettized, or full color image. This won't work (despite differences being smaller).

The answer is here.
Quoting:
It is impossible for the compiler to intermix UTF-8 and UTF-16
strings into the compiled output! So you have to decide for one source
code file:
either use UTF-8 with BOM and generate UTF-16 strings only (i.e.always use L prefix),
or UTF-8 without BOM and generate UTF-8 strings only (i.e. never use L prefix),
7-bit ASCII characters are not involved and can be used with or without L prefix
The only portable and compiler independent way is to use ASCII charset and escape sequences, because there are no guarantees that any compiler would accept UTF-8 encoded file and a compiler treatment of those multibyte sequences might vary.

Related

Print UTF-16 string

So I want to parse IDv3.4 file. There are 4 types of text encoding in format specification: ISO-8859-1, UTF-16 with BOM, UTF-16BE and UTF-8. I already written code that can obtains bytes of strings.
And my question is how to print UTF-16 with BOM and UTF-16BE bytes to console.
And also one important condition: I can use only C libraries. I can't use C++ libraries. I even can't use third-party C libraries.
In general (NOT specifically for parsing IDv3.4 files alone) you will want to choose a common character encoding that your code will use internally; then convert from any other character encoding into your chosen character encoding (for input data - e.g. from user or files or network) and convert back again (for output, to user or files or network).
For choosing a common character encoding:
you want something that minimizes "nonconvertible cases" - e.g. you wouldn't want to choose ASCII because there's far too much in far too many other character encodings that can't be converted to ASCII. This mostly means that you'll want a Unicode encoding.
you want something that is convenient. For Unicode encoding, this only really gives you 2 choices - UTF-8 (because you don't have to care about endian issues, and it's relatively efficient for space/memory consumption, and C functions like strlen() can still work) and versions of UTF-32 (because each codepoint takes up a fixed amount of space and it makes conversion a little simpler). Of these, the benefits of UTF-32 are mostly unimportant (unless you're doing a font rendering engine).
the "whatever random who-knows-what" character encoding that the C compiler uses is irrelevant (for both char and w_char), because it's implementation specific and not portable.
the "whatever random who-knows-what" character encoding that the terminal uses is irrelevant (the terminal should be considered "just another flavor of input/output, where conversion is involved").
Assuming you choose UTF-8:
You might be able to force the compiler to treat string literals as UTF-8 for you (e.g. like u8"hello" in C++, except I can't seem to find any sane standard for C). Otherwise you'll need to do it yourself where necessary.
I'd recommend using the uint8_t type for storing strings; partly because char is "signed or unsigned, depending on which way the wind is blowing" (which makes conversions to/from other character encodings painful due to "shifting a signed/negative number right" problems), and partly because it help to find "accidentally used something that isn't UTF-8" bugs (e.g. warnings from compiler about "conversion from signed to unsigned").
Conversion between UTF-8 and UTF-32LE, UTF_32BE, UTF-16LE, UTF_16BE is fairly trivial (the relevant wikipedia articles are enough to describe how it works).
"UTF-16 with BOM" means that the first 2 bytes will tell you if it's UTF-16LE or UTF-16BE, so (after you add support for UTF-16LE and UTF-16BE) it's trivial. "UTF-32 with BOM" is similar (first 4 bytes tell you if it's UTF32-BE or UTF32-BE).
Conversion to/from ISO-8859-1 to UTF-8 is fairly trivial, because the characters match Unicode codepoints with the same value. However, often people get it wrong (e.g. say it's ISO-8859-1 when the data is actually encoded as Windows-1252 instead); and for the conversion from UTF-8 to ISO-8859-1 you will need to deal with "nonconvertible" codepoints.

Clarification on Winapi Paths and Filename (W functions and A functions)

I have try to check an importance and reason to use W winapi vs A, (W meaning wide char, A meaning ascii right?)
I have made a simple example, i receive a temp path for current user like this:
CHAR pszUserTempPathA[MAX_PATH] = { 0 };
WCHAR pwszUserTempPathW[MAX_PATH] = { 0 };
GetTempPathA(MAX_PATH - 1, pszUserTempPathA);
GetTempPathW(MAX_PATH - 1, pwszUserTempPathW);
printf("pathA=%s\r\npathW=%ws\r\n",pszUserTempPathA,pwszUserTempPathW);
My current user has a russian name, so its written in cyrillic, printf outputs like this:
pathA=C:\users\Пыщь\Local\Temp
pathW=C:\users\Пыщь\Local\Temp
So both paths are allright, i thought i will receive some error, or a mess of symbols with a GetTempPathA since the current user is a unicode, but i figured out, that cyrillic characters are actually included in extended ascii character set. So i have a question, if i were to use my software, and it will extract data in a temp folder of current user, who is chinese ( assuming he have chinese symbols in user name ), will i get a mess or an error using the GetTempPathA version? Should i always use a W prefixed functions, for a production software, that is working with winapi directly?
First, the -A suffix stands for ANSI, not ASCII. ASCII is a 7-bit character set. ANSI, as Microsoft uses the term, is for an encoding using 8-bit code units (chars) and code pages.
Some people use the terms "extended ASCII" or "high ASCII," but that's not actually a standard and, in some cases, isn't quite the same as ANSI. Extended ASCII is the ASCII character set plus (at most) 128 additional characters. For many ANSI code pages this is identical to extended ASCII, but some code pages accommodate variable length characters (which Microsoft calls multi-byte). Some people consider "extended ASCII" to just mean ISO-Latin-1 (which is nearly identical to Windows-1252).
Anyway, with an ANSI function, your string can include any characters from your current code page. If you need characters that aren't part of your current code page, you're out-of-luck. You'll have to use the wide -W versions.
In modern versions of Windows, you can generally think of the -A functions as wrappers around the -W functions that use MultiByteToWideChar and/or WideCharToMultiByte to convert any strings passing through the API. But the latter conversion can be lossy, since wide character strings might include characters that your multibyte strings cannot represent.
Portable, cross-platform code often stores all text in UTF-8, which uses 8-bit code units (chars) but can represent any Unicode code point, and anytime text needs to go through a Windows API, you'd explicitly convert to/from wide chars and then call the -W version of the API.
UTF-8 is nearly similar to what Microsoft calls a multibyte ANSI code page, except that Windows does not completely support a UTF-8 code page. There is CP_UTF8, but it works only with certain APIs (like WideCharToMultiByte and MultiByteToWideChar). You cannot set your code page to CP_UTF8 and expect the general -A APIs to do the right thing.
As you try to test things, be aware that it's difficult (and sometimes impossible) to get the CMD console window to display characters outside the current code page. If you want to display multi-script strings, you probably should write a GUI application and/or use the debugger to inspect the actual content of the strings.
Of course, you need the wide version. ASCII version can't even technically handle more than 256 distinct characters. Cyrillic is included in the extended ASCII set (if that's your localization) while Chinese isn't and can't due to much larger set of characters needed to represent it. Moreover, you can get mess with Cyrillic as well - it will only work properly if the executing machine has matching localization. So on a machine with non-cyrillic localization the text will be displayed according to whatever is defined by the localization settings.

using regular expression with unicode string in C

I'm currently using regular expression on unicode strings but I just need to match ASCII characters thus effectively ignore all unicode characters and until now functions in regex.h work fine (I'm on linux so the encoding is utf8). But can someone confirm if its really ok to do so? Or do I need a regex library on Unicode (like ICU?)
UTF-8 is a variable length encoding; some characters are 1 byte, some 2, others 3 or 4. You know now many bytes to read by the prefix of each character. 0 for 1 byte, 110 for 2 bytes, 1110 for 3 bytes, 11110 for 4 bytes.
If you try to read a UTF-8 string as ASCII, or any other fixed-width encoding, things will go very wrong... unless that UTF-8 string contains nothing but 1 byte characters in which case it matches ASCII.
However since no bytes in UTF-8 contain a null byte, and none of the extra bytes can be confused with ASCII, and if you really are only matching ASCII, you might be able to get away with it... but I wouldn't recommend it because there are such better regex options than POSIX, they're easy to use, and why leave a hidden encoding bomb in your code for some sucker to deal with later? (Note: that sucker may be you)
Instead, use a Unicode aware regex library like Perl Compatible Regular Expressions (PCRE). PCRE is Unicode aware by passing the PCRE2_UTF flag to pcre2_compile. PCRE regex syntax is more powerful and more widely understood than POSIX regexes, and PCRE has more features. And PCRE comes with Gnome Lib which itself provides a feast of very handy C functions.
You need to be careful about your patterns and about the text your going to match.
As an example, given the expression a.b:
"axb" matches
"aèb" does NOT match
The reason is that è is two bytes long when UTF-8 encoded but . would only match the first one.
So as long as you only match sequences of ASCII characters you're safe. If you mix ASCII and non ASCII characters, you're in trouble.
You can try to match a single UTF-8 encoded "character" with something like:
([\xC0-\xDF].|[\xE0-\xEF]..|\xF0...|.)
but this assumes that the text is encoded correctly (and, frankly, I never tried it).

UTF-8 and ISO 8859-9

I have been reading about UTF-8 and unicode for the last couple of days and when I thought I figured it all, I am confused when I read that UTF-8 and ISO 8859-9 are not compatible.
I have a database that stores data as UTF-8. I have a requirement from a customer to support various ISO 8859-x code pages (i.e. 8859-3, 8859-2, and also ISO 6937). My questions are:
Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?
I understand that unicode can support all characters and it is the way to go. However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data? Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases? Is that I need to do or there is more to it?
btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data. if so, how character set is applied? Do I have to write a code to return 8859-x response or all that's needed is to set an appropriate character set value in the response header?
Topic is pretty vast so let me simplify (a lot, even too much) and answer point by point.
Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?
Yes, you're using UNICODE and you're storing UNICODE characters (formally called code points) using UTF-8 encoding. Please note that UNICODE defines rules and sets of characters (even if same word is often used as synonym of UTF-16 encoding), the way you encode such characters in a byte stream is another thing.
... However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data?
Of course if you store UNICODE characters (it doesn't matter with which encoding) then you can always convert them to a specific ASCII code page (or to any other encoding). OK this isn't formally always true (because UNICODE doesn't define every possible characters actually in use/used in the past) but I would ignore this point...
... Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases?
All characters from ISO 8859 code pages are also available in UNICODE then (from this point of view) it's a subset. Of course encoded values are different so they need to be converted. If you know needed code page for each customer then you can always convert an UNICODE UTF-8 encoded text into an ASCII (with right code page) text.
Is that I need to do or there is more to it?
Just that. Code could be pretty short but you didn't tag your question with any language so I won't provide links/examples. Just for a rudimentary example take a look to this post.
Let me also say one important thing: if they want to consume your data in ASCII with their code page then you have to perform a conversion. If they can consume directly UTF-8 data (or you present them somehow in your own application) then you don't have to worry about code pages (that's why we're using UNICODE) because - no matters encoding - UNICODE character set contains all characters they may need.
btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data.
Not exactly. You have a table of characters, right? For example A. Now you have to store a numeric value that will be interpreted as A. In ASCII they arbitrary decided that 65 is the numeric value that represents that character. UNICODE is a long list of characters (and rules to combine them), UTF-X are arbitrary representations used to store them as numeric values.
if so, how character set is applied?
"Character set" is a pretty vague sentence. With UNICODE character set you mean all characters available with UNICODE. If you mean code page then (simplifying) it represents a subset of available character set. Imagine you have 8 bit ASCII (then up to 256 symbols), you can't accommodate all characters used in Europe, right? Code pages solve this problem, half of these symbols are always the same and the other half represent different characters according to code page (each "Country" will use a specific code page with its preferred characters).
For an introductory overview about this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

char vs wchar_t

I'm trying to print out a wchar_t* string.
Code goes below:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *ascii_ = "中日友好"; //line-1
wchar_t *wchar_ = L"中日友好"; //line-2
int main()
{
printf("ascii_: %s\n", ascii_); //line-3
wprintf(L"wchar_: %s\n", wchar_); //line-4
return 0;
}
//Output
ascii_: 中日友好
Question:
Apparently I should not assign CJK characters to char* pointer in line-1, but I just did it, and the output of line-3 is correct, So why? How could printf() in line-3 give me the non-ascii characters? Does it know the encoding somehow?
I assume the code in line-2 and line-4 are correct, but why I didn't get any output of line-4?
First of all, it's usually not a good idea to use non-ascii characters in source code. What's probably happening is that the chinese characters are being encoded as UTF-8 which works with ascii.
Now, as for why the wprintf() isn't working. This has to do with stream orientation. Each stream can only be set to either normal or wide. Once set, it cannot be changed. It is set the first time it is used. (which is ascii due to the printf). After that the wprintf will not work due the incorrect orientation.
In other words, once you use printf() you need to keep on using printf(). Similarly, if you start with wprintf(), you need to keep using wprintf().
You cannot intermix printf() and wprintf(). (except on Windows)
EDIT:
To answer the question about why the wprintf line doesn't work even by itself. It's probably because the code is being compiled so that the UTF-8 format of 中日友好 is stored into wchar_. However, wchar_t needs 4-byte unicode encoding. (2-bytes in Windows)
So there's two options that I can think of:
Don't bother with wchar_t, and just stick with multi-byte chars. This is the easy way, but may break if the user's system is not set to the Chinese locale.
Use wchar_t, but you will need to encode the Chinese characters using unicode escape sequences. This will obviously make it unreadable in the source code, but it will work on any machine that can print Chinese character fonts regardless of the locale.
Line 1 is not ascii, it's whatever multibyte encoding is used by your compiler at compile-time. On modern systems that's probably UTF-8. printf does not know the encoding. It's just sending bytes to stdout, and as long as the encodings match, everything is fine.
One problem you should be aware of is that lines 3 and 4 together invoke undefined behavior. You cannot mix character-based and wide-character io on the same FILE (stdout). After the first operation, the FILE has an "orientation" (either byte or wide), and after that any attempt to perform operations of the opposite orientation results in UB.
You are omitting one step and therefore think the wrong way.
You have a C file on disk, containing bytes. You have a "ASCII" string and a wide string.
The ASCII string takes the bytes exactly like they are in line 1 and outputs them.
This works as long as the encoding of the user's side is the same as the one on the programmer's side.
The wide string first decodes the given bytes into unicode codepoints and stored in the program- maybe this goes wrong on your side. On output they are encoded again according to the encoding on the user's side. This ensures that these characters are emitted as they are intended to, not as they are entered.
Either your compiler assumes the wrong encoding, or your output terminal is set up the wrong way.

Resources