So I want to parse IDv3.4 file. There are 4 types of text encoding in format specification: ISO-8859-1, UTF-16 with BOM, UTF-16BE and UTF-8. I already written code that can obtains bytes of strings.
And my question is how to print UTF-16 with BOM and UTF-16BE bytes to console.
And also one important condition: I can use only C libraries. I can't use C++ libraries. I even can't use third-party C libraries.
In general (NOT specifically for parsing IDv3.4 files alone) you will want to choose a common character encoding that your code will use internally; then convert from any other character encoding into your chosen character encoding (for input data - e.g. from user or files or network) and convert back again (for output, to user or files or network).
For choosing a common character encoding:
you want something that minimizes "nonconvertible cases" - e.g. you wouldn't want to choose ASCII because there's far too much in far too many other character encodings that can't be converted to ASCII. This mostly means that you'll want a Unicode encoding.
you want something that is convenient. For Unicode encoding, this only really gives you 2 choices - UTF-8 (because you don't have to care about endian issues, and it's relatively efficient for space/memory consumption, and C functions like strlen() can still work) and versions of UTF-32 (because each codepoint takes up a fixed amount of space and it makes conversion a little simpler). Of these, the benefits of UTF-32 are mostly unimportant (unless you're doing a font rendering engine).
the "whatever random who-knows-what" character encoding that the C compiler uses is irrelevant (for both char and w_char), because it's implementation specific and not portable.
the "whatever random who-knows-what" character encoding that the terminal uses is irrelevant (the terminal should be considered "just another flavor of input/output, where conversion is involved").
Assuming you choose UTF-8:
You might be able to force the compiler to treat string literals as UTF-8 for you (e.g. like u8"hello" in C++, except I can't seem to find any sane standard for C). Otherwise you'll need to do it yourself where necessary.
I'd recommend using the uint8_t type for storing strings; partly because char is "signed or unsigned, depending on which way the wind is blowing" (which makes conversions to/from other character encodings painful due to "shifting a signed/negative number right" problems), and partly because it help to find "accidentally used something that isn't UTF-8" bugs (e.g. warnings from compiler about "conversion from signed to unsigned").
Conversion between UTF-8 and UTF-32LE, UTF_32BE, UTF-16LE, UTF_16BE is fairly trivial (the relevant wikipedia articles are enough to describe how it works).
"UTF-16 with BOM" means that the first 2 bytes will tell you if it's UTF-16LE or UTF-16BE, so (after you add support for UTF-16LE and UTF-16BE) it's trivial. "UTF-32 with BOM" is similar (first 4 bytes tell you if it's UTF32-BE or UTF32-BE).
Conversion to/from ISO-8859-1 to UTF-8 is fairly trivial, because the characters match Unicode codepoints with the same value. However, often people get it wrong (e.g. say it's ISO-8859-1 when the data is actually encoded as Windows-1252 instead); and for the conversion from UTF-8 to ISO-8859-1 you will need to deal with "nonconvertible" codepoints.
Related
I have a file, foo.txt, which is just:
” ’
char x = fgetc(myfile);
When I use fgetc on the file, I get a constant value of 226 on both characters. Why is this? How can I fix this?
Here is my code:
FILE* f = fopen("./debate.txt", "rb");
int x = fgetc(f);
char y = (char)x;
For normal (portable) software, character encodings are a whole world of pain. The problems (and potential solutions) are:
A) The text file may be in any random/"text editor defined" encoding.
To deal with this there's 4 options:
expect input in a specific encoding (e.g. UTF-8) and refuse to support anything else (and generate an error message if the data in the file isn't valid for the encoding you chose). This will annoy some users (e.g. where the national standard is something incompatible like CNS 11643 ).
support many encodings, and let the user choose which encoding to expect (e.g. based on a command line argument). This is a little inconvenient for users and very painful for you.
support many encodings, and try to auto-detect which encoding the file used. This is a little more convenient for users until it guesses wrong and becomes a major annoyance (and you can't reduce the chance of guessing the wrong encoding to zero).
support many encodings, and let the user choose the encoding if they want, and auto-detect if the user didn't specify. This is the best possible option for users (and the worst possible option for software developers).
For these options I'd use the first (I would say "input file must be UTF-8", partly because UTF-8 has become very common and well supported, and partly because every other encoding is provably worse for technical reasons). Note that (based on your results) it's extremely likely that your input file is in UTF-8.
B) Whatever the compiler uses for char is implementation defined (could be ASCII, could be EBDIC, could be anything else), and may be either signed or unsigned.
In this case it's "very safe" (for portability) to assume ASCII. Assuming UTF-8 is the 2nd best choice but it creates problems with any code that does any maths (e.g. right shift, etc) on "possibly signed" char values.
C) The stdin, stdout, stderr pipes are random/implementation defined too.
This is similar to the previous problem, except that the best solution ("assume ASCII") is significantly harder (especially when you want to output error messages, etc that contain pieces of text from the input file). For this I'd be tempted to use ASCII as much as possible, but to cheat and output UTF-8 if I have to. If the OS (or shell) can't handle UTF-8 it'll create a mess, but most users would understand (and can work around it by piping your output to a file). The best alternative (for user output) is using a GUI and not using stdout, but that creates a large set of extra problems (and leads to a second large set of extra problems - internationalization for things like error messages, etc).
D) Whatever the compiler assumes for wchar is random/implementation defined (maybe UTF-16, maybe UTF-32, maybe anything else; and it may even be an 8-bit encoding that isn't "wide" at all).
The only sane choice here is to recognize that wchar is an unusable failure that should never (under any circumstances) be used for anything.
To be more specific, wchar is a historical mistake based on previous historical mistakes. Essentially, in the early days, Microsoft and Sun decided to adopt UCS-2 (an "all Unicode codepoints fit in 16 bits" assumption) which quickly became broken. To work around that problem Microsoft and Sun switched to UTF-16, but Microsoft was primarily running on little-endian machines and chose UTF-16LE and Sun (Java) was aiming for big-endian machines and chose UTF-16BE. The wchar extension was added to C in 1995 at the same time that companies (Microsoft, Sun) where doing everything wrong and weren't doing anything that is compatibility with each other; so wchar ended up being a "we don't know what the standard is so our standard is no standard at all" joke. For C (and C++) this was fixed in 2011 with the introduction of char16_t (UTF-16) and char32_t (UTF-32) in <uchar.h>, but adoption is slow (e.g. Microsoft is still too lazy to bother with C99).
Note that an additional part of the problem is that people want to assume that one wchar is one whole printable character, and that is almost never the case (e.g. even for UTF-32 where one wchar is one whole Unicode codepoint there are combining codepoints); and this ruins any benefit of any "wide char" implementation (even if your code is not portable at all and you know what wchar actually is).
The best solution (especially if you chose "expect that the input file is using UTF-8" to solve the first problem) is to use UTF-8 stored in uint8_t (so that nobody confuses it for whatever char is).
In that case; "converting the input from the file into your internal character encoding" can become "converting UTF-8 to UTF-8 by doing nothing"; and "converting your internal character encoding into whatever stdout wants" becomes "converting UTF-8 to ASCII (or UTF-8) by doing almost nothing (casting from uint8_t to char)". In other words, it can be extremely close to "use the same encoding for everything".
What does character encoding in C programming language depend on? (OS? compiler? or editor?)
I'm working on not only characters of ASCII but also ones of other encodings such as UTF-8.
How can we check the current character encodings in C?
The C source code might be stored in distinct encodings. This is clearly compiler dependent (i.e. a compiler setting if available). Though, I wouldn't count on it and count on ASCII-only always. (IMHO this is the most portable way to write code.)
Actually, you can encode any character of any encoding using only ASCIIs in C source code if you encode them with octal or hex sequences. (This is what I do from time to time to earn respect of my colleagues – writing German texts with \303\244, \303\266, \303\274, \303\231 into translation tables out of mind...)
Example: "\303\274" encodes the UTF-8 sequence for a string constant "ü". (But if I print this on my Windows console I only get "��" although I set code page 65001 which should provide UTF-8. The damn Windows console...)
The program written in C may handle any encoding you are able to deal with. Actually, the characters are only numbers which can be stored as one of the available integral types (e.g. char for ASCII and UTF-8, other int types for encodings with 16 or 32 bit wide characters). As already mentioned by Clifford, the output decides what to do with these numbers. Thus, this is platform dependent.
To handle characters according to a certain encoding, (e.g. make it upper case or lower case, local dictionary-like sorting, etc.) you have to use an appropriate library. This might be part of the standard libaries, the system libraries, or 3rd party libraries.
This is especially true for conversion from one encoding to another. This is a good point to mention libintl.
I personally prefer ASCII, Unicode, and UTF-8 (and unfortunately UTF-16 as I'm doing most work on Windows 10). In this special case, the conversion can be done by a pure "bit-fiddling" algorithm (without any knowledge of special characters). You may have a look at Wikipedia UTF-8 to get a clue. By google, you probably will find something ready-to-use if you don't want to do it by yourself.
The standard library of C++11 and C++14 provides support also (e.g. std::codecvt_utf8) but it is remarked as deprecated in C++17. Thus, I don't need to throw away my bit-fiddling code (I'm so proud of). Oops. This is tagged with c – sorry.
It is platform or display device/framework dependent. The compiler does not care how the platform interprets either char or wchar_t when such values are rendered as glyphs on some display device.
If the output were to some remote terminal, then the rendering would be dependent on the terminal rather than the execution environment, while in a desktop computer, the rendering may be to a text console or to a GUI, and the resulting rendering may differ even between those.
Ok, I have this:
AllocConsole();
SetConsoleOutputCP(CP_UTF8);
HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 10, NULL, NULL);
WriteConsoleW(consoleHandle, L"wΕλληνικά\n", 10, NULL, NULL);
printf("aΕλληνικά\n");
wprintf(L"wΕλληνικά\n");
Now, the issue is that depending on the encoding file was saved as only some these works. wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters). Yet, I have issue with three others. If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 18, NULL, NULL);
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
Now, let's go to UTF-16 (Unicode - Codepage 1200). Works only WriteConsoleW. Character count in WriteConsoleA should be 10 to keep formatting precise.
Saving in UTF-16 Big Endian mode (Unicode - Codepage 1201) does not change anything. Again, WTF? Shouldn't byte order inside the strings be inverted when stored to file?
Conclusion is that the way strings are compiled into binary form depends on the encoding used. Therefore, what is the portable and compiler independent way to store strings? Is there a preprocessor which would convert one string representation into another before compilation, so I could store file in UTF-8 and only preprocess strings which I need to have in UTF-16 by wrapping them some macro.
I think you've got at least a few assumptions here which are either wrong or not 100% correct as far as I know:
Now, the issue is that depending on the encoding file was saved as only some these works.
Of course, because the encoding determines how to Interpret the string literals.
wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters).
I've never heard of that one, but I'm rather sure this depends on the locale set for your program. I've got a few work Projects where a locale is set and the output is just fine using German umlauts etc.
If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
That's because the ANSI version wants an ANSI string, while you're passing a UTF-8 encoded string (based on the file's encoding). The output still works, because the console handles the UTF-8 conversion for you - you're essentially printing raw UTF-8 here.
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
I don't think so (although I'm not sure why it isn't working either). Have you tried Setting some easy to find string and look for it in the resulting binary? I'm rather sure it's indeed encoded using UTF-16. I assume due to the missing BOM the compiler might interpret the whole thing as a narrow string and therefore converts the UTF-8 stuff wrong.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
This is exactly what I described above. Now the wide string is encoded properly, because the Compiler now knows the file is in UTF-8, not ANSI (or some codepage). The narrow string is properly converted to the locale being used as well.
Overall, there's no encoding independant way to do it, unless you escape everything using the proper codepage and/or UTF codes in advance. I'd just stick to UTF-8 with BOM, because I think all current compilers will be able to properly read and Interpret the file (besides Microsoft's Resource Compiler; although I haven't tried feeding the 2012 Version with UTF-8).
Edit:
To use an analogy:
You're essentially saving a raw image to a file and you expect it to work properly, no matter whether other programs try to read it as a grayscale, palettized, or full color image. This won't work (despite differences being smaller).
The answer is here.
Quoting:
It is impossible for the compiler to intermix UTF-8 and UTF-16
strings into the compiled output! So you have to decide for one source
code file:
either use UTF-8 with BOM and generate UTF-16 strings only (i.e.always use L prefix),
or UTF-8 without BOM and generate UTF-8 strings only (i.e. never use L prefix),
7-bit ASCII characters are not involved and can be used with or without L prefix
The only portable and compiler independent way is to use ASCII charset and escape sequences, because there are no guarantees that any compiler would accept UTF-8 encoded file and a compiler treatment of those multibyte sequences might vary.
I came across this in the book:
wscanf(L"%lf", &variable);
where the first parameter is of type of wchar_t *.
This s different from scanf("%lf", &variable); where the first parameter is of type char *.
So what is the difference than. I have never heard "wide character string" before. I have heard something called Raw String Literals which is printing the string as it is (no need for things like escape sequences) but that was not in C.
The exact nature of wide characters is (purposefully) left implementation defined.
When they first invented the concept of wchar_t, ISO 10646 and Unicode were still competing with each other (whereas they now, mostly cooperate). Rather than try to decree that an international character would be one or the other (or possibly something else entirely) they simply provided a type (and some functions) that the implementation could define to support international character sets as they chose.
Different implementations have exercised that potential for variation. For example, if you use Microsoft's compiler on Windows, wchar_t will be a 16-bit type holding UTF-16 Unicode (originally it held UCS-2 Unicode, but that's now officially obsolete).
On Linux, wchar_t will more often be a 32-bit type, holding UCS-4/UTF-32 encoded Unicode. Ports of gcc to at least some other operating systems do the same, though I've never tried to confirm that it's always the case.
There is, however, no guarantee of that. At least in theory an implementation on Linux could use 16 bits, or one on Windows could use 32 bits, or either one could decide to use 64 bits (though I'd be a little surprised to see that in reality).
In any case, the general idea of how things are intended to work, is that a single wchar_t is sufficient to represent a code point. For I/O, the data is intended to be converted from the external representation (whatever it is) into wchar_ts, which (is supposed to) make them relatively easy to manipulate. Then during output, they again get transformed into the encoding of your choice (which may be entirely different from the encoding you read).
"Wide character string" is referring to the encoding of the characters in the string.
From Wikipedia:
A wide character is a computer character datatype that generally has a
size greater than the traditional 8-bit character. The increased
datatype size allows for the use of larger coded character sets.
UTF-16 is one of the most commonly used wide character encodings.
Further, wchar_t is defined by Microsoft as an unsigned short(16-bit) data object. This could be and is most likely a different definition in other operating systems or languages.
Taken from the Wikipedia article from the comment below:
"The width of wchar_t is compiler-specific and can be as small as 8
bits. Consequently, programs that need to be portable across any C or
C++ compiler should not use wchar_t for storing Unicode text. The
wchar_t type is intended for storing compiler-defined wide characters,
which may be Unicode characters in some compilers."
First I develope an independent platform library by using ANSI C (not C++ and any non standard libs like MS CRT or glibc, ...).
After a few searchs, I found that one of the best way to internationalization in ANSI C, is using UTF-8 encoding.
In utf-8:
strlen(s): always counts the number of bytes.
mbstowcs(NULL,s,0): The number of characters can be counted.
But I have some problems when I want to random access of elements(characters) of a utf-8 string.
In ASCII encoding:
char get_char(char* assci_str, int n)
{
// It is very FAST.
return assci_str[n];
}
In UTF-16/32 encoding:
wchar_t get_char(wchar_t* wstr, int n)
{
// It is very FAST.
return wstr[n];
}
And here my problem in UTF-8 encoding:
// What is the return type?
// Because sizeof(utf-8 char) is 8 or 16 or 24 or 32.
/*?*/ get_char(char* utf8str, int n)
{
// I can found Nth character of string by using for.
// But it is too slow.
// What is the best way?
}
Thanks.
Perhaps you're thinking about this a bit wrongly. UTF-8 is an encoding which is useful for serializing data, e.g. writing it to a file or the network. It is a very non-trivial encoding, though, and a raw string of Unicode codepoints can end up in any number of encoded bytes.
What you should probably do, if you want to handle text (given your description), is to store raw, fixed-width strings internally. If you're going for Unicode (which you should), then you need 21 bits per codepoint, so the nearest integral type is uint32_t. In short, store all your strings internally as arrays of integers. Then you can random-access each codepoint.
Only encode to UTF-8 when you are writing to a file or console, and decode from UTF-8 when reading.
By the way, a Unicode codepoint is still a long way from a character. The concept of a character is just far to high-level to have a simple general mechanic. (E.g. "a" + "accent grave" -- two codepoints, how many characters?)
You simply can't. If you do need a lot of such queries, you can build an index for the UTF-8 string, or convert it to UTF-32 up front. UTF-32 is a better in-memory representation while UTF-8 is good on disk.
By the way, the code you listed for UTF-16 is not correct either. You may want to take care of the surrogate characters.
What do you want to count? As Kerrek SB has noted, you can have decomposed glyphs, i.e. "é" can be represented as a single character (LATIN SMALL LETTER E WITH ACUTE U+00E9), or as two characters (LATIN SMALL LETER E U+0065 COMBINING ACUTE ACCENT U+0301). Unicode has composed and decomposed normalization forms.
What you are probably interested in counting is not characters, but grapheme clusters. You need some higher level library to deal with this, and to deal with normalization forms, and proper (locale-dependent) collation, proper line-breaking, proper case-folding (e.g. german ß->SS) proper bidi support, etc... Real I18N is complex.
Contrary to what others have said, I don' really see a benefit in using UTF-32 instead of UTF-8: When processing text, grapheme clusters (or 'user-perceived characters') are far more useful than Unicode characters (ie raw codepoints), so even UTF-32 has to be treated as a variable-length coding.
If you do not want to use a dedicated library, I suggest using UTF-8 as on-disk, endian-agnostic representation and modified UTF-8 (which differs from UTF-8 by encoding the zero character as a two-byte sequence) as in-memory representation compatible with ASCIIZ.
The necessary information for splitting strings into grapheme clusters can be found in annex 29 and the character database.