I am creating a C++ library meant to be used with different applications written in different languages like Java, C#, Delphi etc.
Every now and then I am stuck on conversions between wstrings, strings, char*, wchar_t*. E.g. I sticked to wchar_t's but had to use regex library which accepts chars other similar problems.
I wish to stick to either w's or normal strings. My library will mostly deal with ASCII characters but can have non-ASCII characters too as in names etc. So, can I permanently switch to char's instead of wchar_t's and string's instead of wstring's. Can I have unicode support with them and will it affect scalability and portability across different platforms and languages.
Please advise.
You need to decide which encoding to use. Some considerations:
If you can have non-ASCII characters, then there is no point in choosing ASCII or 8bit ANSI. That way leads to disappointment and risks data loss.
It makes sense to pick one encoding and stick to it. Everywhere. The Windows API is unusual in supporting both ANSI and Unicode, but that is due to backwards compatibility of old software. If Microsoft were starting over from scratch, there would be one encoding only.
The most common choices for Unicode encoding are UTF-8 and UTF-16. Any decent environment will have support for both. Either choice may be justifiable.
Java, VB, C# and Delphi all have good support for UTF-16, and all of them use UTF-16 for their native string types (in the case of Delphi, the native string type is UTF-16 only in Delphi 2009 and later. For earlier versions, you can use the WideString string type).
Most OS platforms are natively UTF-16 (*Nix systems, like Linux, are UTF-8 instead), so it may well be simplest to just use UTF-16.
On the other hand, UTF-8 is probably a technically better choice being byte oriented, and backwards compatible with 8bit ASCII. Quite likely, if Unicode was being invented from scratch, there would be no UTF-16 and UTF-8 would be the variable length encoding.
You have phrased the question as a choice between char and wchar_t. I think that the real choice is what your preferred encoding should be. You also have to watch out that wchar_t is 16bit (UTF-16) on some systems but is 32bit (UTF-32) on others. It is not a portable data type. That is why C++11 introduces new char16_t and char32_t` data types to correct that ambiguity.
The major difference between Unicode and simple char is code page. Having just a char* pointer is not enough to understand the meaning of the string. It can be in a certain specific encoding, it can be multibyte, etc. Wide character string does not have these caveats.
In many cases international aspects are not important. In this case the difference between these 2 representations is minimal. The main question that you need to answer: is internationalization important to your library or not?
Modern Windows programming should tend towards builds with UNICODE defined, and thus use wide characters and wide character APIs. This is desirable for improved performance (fewer or no conversions behind the Windows API layers), improved capabilities (sometimes the ANSI wrappers don't expose all capabilities of the wide function), and in general it avoids problems with the inability to represent characters that are not on the system's current code page (and thus in practice the inability to represent non-ASCII characters).
Where this can be difficult is when you have to interface with things that don't use wide characters. For example, while Windows APIs have wide character filenames, Linux filesystems typically use bytestrings. While those bytestrings are often UTF-8 by convention, there's little enforcement. Interfacing with other languages can also be difficult if the language in question doesn't understand wide characters at an API level. Ideally such languages have chosen a specific encoding, such as UTF-8, allowing you to convert to and from that encoding at the boundaries.
And that's one general recommendation: use Unicode internally for all processing, and convert as necessary at the boundaries. If this isn't already familiar to you, it's good to reference Joel's article on Unicode.
Related
I have a file, foo.txt, which is just:
” ’
char x = fgetc(myfile);
When I use fgetc on the file, I get a constant value of 226 on both characters. Why is this? How can I fix this?
Here is my code:
FILE* f = fopen("./debate.txt", "rb");
int x = fgetc(f);
char y = (char)x;
For normal (portable) software, character encodings are a whole world of pain. The problems (and potential solutions) are:
A) The text file may be in any random/"text editor defined" encoding.
To deal with this there's 4 options:
expect input in a specific encoding (e.g. UTF-8) and refuse to support anything else (and generate an error message if the data in the file isn't valid for the encoding you chose). This will annoy some users (e.g. where the national standard is something incompatible like CNS 11643 ).
support many encodings, and let the user choose which encoding to expect (e.g. based on a command line argument). This is a little inconvenient for users and very painful for you.
support many encodings, and try to auto-detect which encoding the file used. This is a little more convenient for users until it guesses wrong and becomes a major annoyance (and you can't reduce the chance of guessing the wrong encoding to zero).
support many encodings, and let the user choose the encoding if they want, and auto-detect if the user didn't specify. This is the best possible option for users (and the worst possible option for software developers).
For these options I'd use the first (I would say "input file must be UTF-8", partly because UTF-8 has become very common and well supported, and partly because every other encoding is provably worse for technical reasons). Note that (based on your results) it's extremely likely that your input file is in UTF-8.
B) Whatever the compiler uses for char is implementation defined (could be ASCII, could be EBDIC, could be anything else), and may be either signed or unsigned.
In this case it's "very safe" (for portability) to assume ASCII. Assuming UTF-8 is the 2nd best choice but it creates problems with any code that does any maths (e.g. right shift, etc) on "possibly signed" char values.
C) The stdin, stdout, stderr pipes are random/implementation defined too.
This is similar to the previous problem, except that the best solution ("assume ASCII") is significantly harder (especially when you want to output error messages, etc that contain pieces of text from the input file). For this I'd be tempted to use ASCII as much as possible, but to cheat and output UTF-8 if I have to. If the OS (or shell) can't handle UTF-8 it'll create a mess, but most users would understand (and can work around it by piping your output to a file). The best alternative (for user output) is using a GUI and not using stdout, but that creates a large set of extra problems (and leads to a second large set of extra problems - internationalization for things like error messages, etc).
D) Whatever the compiler assumes for wchar is random/implementation defined (maybe UTF-16, maybe UTF-32, maybe anything else; and it may even be an 8-bit encoding that isn't "wide" at all).
The only sane choice here is to recognize that wchar is an unusable failure that should never (under any circumstances) be used for anything.
To be more specific, wchar is a historical mistake based on previous historical mistakes. Essentially, in the early days, Microsoft and Sun decided to adopt UCS-2 (an "all Unicode codepoints fit in 16 bits" assumption) which quickly became broken. To work around that problem Microsoft and Sun switched to UTF-16, but Microsoft was primarily running on little-endian machines and chose UTF-16LE and Sun (Java) was aiming for big-endian machines and chose UTF-16BE. The wchar extension was added to C in 1995 at the same time that companies (Microsoft, Sun) where doing everything wrong and weren't doing anything that is compatibility with each other; so wchar ended up being a "we don't know what the standard is so our standard is no standard at all" joke. For C (and C++) this was fixed in 2011 with the introduction of char16_t (UTF-16) and char32_t (UTF-32) in <uchar.h>, but adoption is slow (e.g. Microsoft is still too lazy to bother with C99).
Note that an additional part of the problem is that people want to assume that one wchar is one whole printable character, and that is almost never the case (e.g. even for UTF-32 where one wchar is one whole Unicode codepoint there are combining codepoints); and this ruins any benefit of any "wide char" implementation (even if your code is not portable at all and you know what wchar actually is).
The best solution (especially if you chose "expect that the input file is using UTF-8" to solve the first problem) is to use UTF-8 stored in uint8_t (so that nobody confuses it for whatever char is).
In that case; "converting the input from the file into your internal character encoding" can become "converting UTF-8 to UTF-8 by doing nothing"; and "converting your internal character encoding into whatever stdout wants" becomes "converting UTF-8 to ASCII (or UTF-8) by doing almost nothing (casting from uint8_t to char)". In other words, it can be extremely close to "use the same encoding for everything".
So I want to parse IDv3.4 file. There are 4 types of text encoding in format specification: ISO-8859-1, UTF-16 with BOM, UTF-16BE and UTF-8. I already written code that can obtains bytes of strings.
And my question is how to print UTF-16 with BOM and UTF-16BE bytes to console.
And also one important condition: I can use only C libraries. I can't use C++ libraries. I even can't use third-party C libraries.
In general (NOT specifically for parsing IDv3.4 files alone) you will want to choose a common character encoding that your code will use internally; then convert from any other character encoding into your chosen character encoding (for input data - e.g. from user or files or network) and convert back again (for output, to user or files or network).
For choosing a common character encoding:
you want something that minimizes "nonconvertible cases" - e.g. you wouldn't want to choose ASCII because there's far too much in far too many other character encodings that can't be converted to ASCII. This mostly means that you'll want a Unicode encoding.
you want something that is convenient. For Unicode encoding, this only really gives you 2 choices - UTF-8 (because you don't have to care about endian issues, and it's relatively efficient for space/memory consumption, and C functions like strlen() can still work) and versions of UTF-32 (because each codepoint takes up a fixed amount of space and it makes conversion a little simpler). Of these, the benefits of UTF-32 are mostly unimportant (unless you're doing a font rendering engine).
the "whatever random who-knows-what" character encoding that the C compiler uses is irrelevant (for both char and w_char), because it's implementation specific and not portable.
the "whatever random who-knows-what" character encoding that the terminal uses is irrelevant (the terminal should be considered "just another flavor of input/output, where conversion is involved").
Assuming you choose UTF-8:
You might be able to force the compiler to treat string literals as UTF-8 for you (e.g. like u8"hello" in C++, except I can't seem to find any sane standard for C). Otherwise you'll need to do it yourself where necessary.
I'd recommend using the uint8_t type for storing strings; partly because char is "signed or unsigned, depending on which way the wind is blowing" (which makes conversions to/from other character encodings painful due to "shifting a signed/negative number right" problems), and partly because it help to find "accidentally used something that isn't UTF-8" bugs (e.g. warnings from compiler about "conversion from signed to unsigned").
Conversion between UTF-8 and UTF-32LE, UTF_32BE, UTF-16LE, UTF_16BE is fairly trivial (the relevant wikipedia articles are enough to describe how it works).
"UTF-16 with BOM" means that the first 2 bytes will tell you if it's UTF-16LE or UTF-16BE, so (after you add support for UTF-16LE and UTF-16BE) it's trivial. "UTF-32 with BOM" is similar (first 4 bytes tell you if it's UTF32-BE or UTF32-BE).
Conversion to/from ISO-8859-1 to UTF-8 is fairly trivial, because the characters match Unicode codepoints with the same value. However, often people get it wrong (e.g. say it's ISO-8859-1 when the data is actually encoded as Windows-1252 instead); and for the conversion from UTF-8 to ISO-8859-1 you will need to deal with "nonconvertible" codepoints.
What does character encoding in C programming language depend on? (OS? compiler? or editor?)
I'm working on not only characters of ASCII but also ones of other encodings such as UTF-8.
How can we check the current character encodings in C?
The C source code might be stored in distinct encodings. This is clearly compiler dependent (i.e. a compiler setting if available). Though, I wouldn't count on it and count on ASCII-only always. (IMHO this is the most portable way to write code.)
Actually, you can encode any character of any encoding using only ASCIIs in C source code if you encode them with octal or hex sequences. (This is what I do from time to time to earn respect of my colleagues – writing German texts with \303\244, \303\266, \303\274, \303\231 into translation tables out of mind...)
Example: "\303\274" encodes the UTF-8 sequence for a string constant "ü". (But if I print this on my Windows console I only get "��" although I set code page 65001 which should provide UTF-8. The damn Windows console...)
The program written in C may handle any encoding you are able to deal with. Actually, the characters are only numbers which can be stored as one of the available integral types (e.g. char for ASCII and UTF-8, other int types for encodings with 16 or 32 bit wide characters). As already mentioned by Clifford, the output decides what to do with these numbers. Thus, this is platform dependent.
To handle characters according to a certain encoding, (e.g. make it upper case or lower case, local dictionary-like sorting, etc.) you have to use an appropriate library. This might be part of the standard libaries, the system libraries, or 3rd party libraries.
This is especially true for conversion from one encoding to another. This is a good point to mention libintl.
I personally prefer ASCII, Unicode, and UTF-8 (and unfortunately UTF-16 as I'm doing most work on Windows 10). In this special case, the conversion can be done by a pure "bit-fiddling" algorithm (without any knowledge of special characters). You may have a look at Wikipedia UTF-8 to get a clue. By google, you probably will find something ready-to-use if you don't want to do it by yourself.
The standard library of C++11 and C++14 provides support also (e.g. std::codecvt_utf8) but it is remarked as deprecated in C++17. Thus, I don't need to throw away my bit-fiddling code (I'm so proud of). Oops. This is tagged with c – sorry.
It is platform or display device/framework dependent. The compiler does not care how the platform interprets either char or wchar_t when such values are rendered as glyphs on some display device.
If the output were to some remote terminal, then the rendering would be dependent on the terminal rather than the execution environment, while in a desktop computer, the rendering may be to a text console or to a GUI, and the resulting rendering may differ even between those.
I came across this in the book:
wscanf(L"%lf", &variable);
where the first parameter is of type of wchar_t *.
This s different from scanf("%lf", &variable); where the first parameter is of type char *.
So what is the difference than. I have never heard "wide character string" before. I have heard something called Raw String Literals which is printing the string as it is (no need for things like escape sequences) but that was not in C.
The exact nature of wide characters is (purposefully) left implementation defined.
When they first invented the concept of wchar_t, ISO 10646 and Unicode were still competing with each other (whereas they now, mostly cooperate). Rather than try to decree that an international character would be one or the other (or possibly something else entirely) they simply provided a type (and some functions) that the implementation could define to support international character sets as they chose.
Different implementations have exercised that potential for variation. For example, if you use Microsoft's compiler on Windows, wchar_t will be a 16-bit type holding UTF-16 Unicode (originally it held UCS-2 Unicode, but that's now officially obsolete).
On Linux, wchar_t will more often be a 32-bit type, holding UCS-4/UTF-32 encoded Unicode. Ports of gcc to at least some other operating systems do the same, though I've never tried to confirm that it's always the case.
There is, however, no guarantee of that. At least in theory an implementation on Linux could use 16 bits, or one on Windows could use 32 bits, or either one could decide to use 64 bits (though I'd be a little surprised to see that in reality).
In any case, the general idea of how things are intended to work, is that a single wchar_t is sufficient to represent a code point. For I/O, the data is intended to be converted from the external representation (whatever it is) into wchar_ts, which (is supposed to) make them relatively easy to manipulate. Then during output, they again get transformed into the encoding of your choice (which may be entirely different from the encoding you read).
"Wide character string" is referring to the encoding of the characters in the string.
From Wikipedia:
A wide character is a computer character datatype that generally has a
size greater than the traditional 8-bit character. The increased
datatype size allows for the use of larger coded character sets.
UTF-16 is one of the most commonly used wide character encodings.
Further, wchar_t is defined by Microsoft as an unsigned short(16-bit) data object. This could be and is most likely a different definition in other operating systems or languages.
Taken from the Wikipedia article from the comment below:
"The width of wchar_t is compiler-specific and can be as small as 8
bits. Consequently, programs that need to be portable across any C or
C++ compiler should not use wchar_t for storing Unicode text. The
wchar_t type is intended for storing compiler-defined wide characters,
which may be Unicode characters in some compilers."
This intrigues me, so I'm going to ask - for what reason is wchar_t not used so widely on Linux/Linux-like systems as it is on Windows? Specifically, the Windows API uses wchar_t internally whereas I believe Linux does not and this is reflected in a number of open source packages using char types.
My understanding is that given a character c which requires multiple bytes to represent it, then in a char[] form c is split over several parts of char* whereas it forms a single unit in wchar_t[]. Is it not easier, then, to use wchar_t always? Have I missed a technical reason that negates this difference? Or is it just an adoption problem?
wchar_t is a wide character with platform-defined width, which doesn't really help much.
UTF-8 characters span 1-4 bytes per character. UCS-2, which spans exactly 2 bytes per character, is now obsolete and can't represent the full Unicode character set.
Linux applications that support Unicode tend to do so properly, above the byte-wise storage layer. Windows applications tend to make this silly assumption that only two bytes will do.
wchar_t's Wikipedia article briefly touches on this.
The first people to use UTF-8 on a Unix-based platform explained:
The Unicode Standard [then at version 1.1]
defines an
adequate character set but an
unreasonable representation [UCS-2]. It states
that all characters are 16 bits wide [no longer true]
and are communicated and stored in 16-bit units.
It also reserves a pair
of characters (hexadecimal FFFE and
FEFF) to detect byte order in
transmitted text, requiring state in
the byte stream. (The Unicode
Consortium was thinking of files, not
pipes.) To adopt this encoding, we
would have had to convert all text
going into and out of Plan 9 between
ASCII and Unicode, which cannot be
done. Within a single program, in
command of all its input and output,
it is possible to define characters as
16-bit quantities; in the context of a
networked system with hundreds of
applications on diverse machines by
different manufacturers [italics mine], it is
impossible.
The italicized part is less relevant to Windows systems, which have a preference towards monolithic applications (Microsoft Office), non-diverse machines (everything's an x86 and thus little-endian), and a single OS vendor.
And the Unix philosophy of having small, single-purpose programs means fewer of them need to do serious character manipulation.
The source for our tools and
applications had already been
converted to work with Latin-1, so it
was ‘8-bit safe’, but the conversion
to the Unicode Standard and UTF[-8] is
more involved. Some programs needed no
change at all: cat, for instance,
interprets its argument strings,
delivered in UTF[-8], as file names
that it passes uninterpreted to the
open system call, and then just copies
bytes from its input to its output; it
never makes decisions based on the
values of the bytes...Most programs,
however, needed modest change.
...Few tools actually need to operate
on runes [Unicode code points]
internally; more typically they need
only to look for the final slash in a
file name and similar trivial tasks.
Of the 170 C source programs...only 23
now contain the word Rune.
The programs that do store runes
internally are mostly those whose
raison d’être is character
manipulation: sam (the text editor),
sed, sort, tr, troff, 8½ (the window
system and terminal emulator), and so
on. To decide whether to compute using
runes or UTF-encoded byte strings
requires balancing the cost of
converting the data when read and
written against the cost of converting
relevant text on demand. For programs
such as editors that run a long time
with a relatively constant dataset,
runes are the better choice...
UTF-32, with code points directly accessible, is indeed more convenient if you need character properties like categories and case mappings.
But widechars are awkward to use on Linux for the same reason that UTF-8 is awkward to use on Windows. GNU libc has no _wfopen or _wstat function.
UTF-8, being compatible to ASCII, makes it possible to ignore Unicode somewhat.
Often, programs don't care (and in fact, don't need to care) about what the input is, as long as there is not a \0 that could terminate strings. See:
char buf[whatever];
printf("Your favorite pizza topping is which?\n");
fgets(buf, sizeof(buf), stdin); /* Jalapeños */
printf("%s it shall be.\n", buf);
The only times when I found I needed Unicode support is when I had to have a multibyte character as a single unit (wchar_t); e.g. when having to count the number of characters in a string, rather than bytes. iconv from utf-8 to wchar_t will quickly do that. For bigger issues like zero-width spaces and combining diacritics, something more heavy like icu is needed—but how often do you do that anyway?
wchar_t is not the same size on all platforms. On Windows it is a UTF-16 code unit that uses two bytes. On other platforms it typically uses 4 bytes (for UCS-4/UTF-32). It is therefore unlikely that these platforms would standardize on using wchar_t, since it would waste a lot of space.