What does character encoding in C programming language depend on? - c

What does character encoding in C programming language depend on? (OS? compiler? or editor?)
I'm working on not only characters of ASCII but also ones of other encodings such as UTF-8.
How can we check the current character encodings in C?

The C source code might be stored in distinct encodings. This is clearly compiler dependent (i.e. a compiler setting if available). Though, I wouldn't count on it and count on ASCII-only always. (IMHO this is the most portable way to write code.)
Actually, you can encode any character of any encoding using only ASCIIs in C source code if you encode them with octal or hex sequences. (This is what I do from time to time to earn respect of my colleagues – writing German texts with \303\244, \303\266, \303\274, \303\231 into translation tables out of mind...)
Example: "\303\274" encodes the UTF-8 sequence for a string constant "ü". (But if I print this on my Windows console I only get "��" although I set code page 65001 which should provide UTF-8. The damn Windows console...)
The program written in C may handle any encoding you are able to deal with. Actually, the characters are only numbers which can be stored as one of the available integral types (e.g. char for ASCII and UTF-8, other int types for encodings with 16 or 32 bit wide characters). As already mentioned by Clifford, the output decides what to do with these numbers. Thus, this is platform dependent.
To handle characters according to a certain encoding, (e.g. make it upper case or lower case, local dictionary-like sorting, etc.) you have to use an appropriate library. This might be part of the standard libaries, the system libraries, or 3rd party libraries.
This is especially true for conversion from one encoding to another. This is a good point to mention libintl.
I personally prefer ASCII, Unicode, and UTF-8 (and unfortunately UTF-16 as I'm doing most work on Windows 10). In this special case, the conversion can be done by a pure "bit-fiddling" algorithm (without any knowledge of special characters). You may have a look at Wikipedia UTF-8 to get a clue. By google, you probably will find something ready-to-use if you don't want to do it by yourself.
The standard library of C++11 and C++14 provides support also (e.g. std::codecvt_utf8) but it is remarked as deprecated in C++17. Thus, I don't need to throw away my bit-fiddling code (I'm so proud of). Oops. This is tagged with c – sorry.

It is platform or display device/framework dependent. The compiler does not care how the platform interprets either char or wchar_t when such values are rendered as glyphs on some display device.
If the output were to some remote terminal, then the rendering would be dependent on the terminal rather than the execution environment, while in a desktop computer, the rendering may be to a text console or to a GUI, and the resulting rendering may differ even between those.

Related

Reading non-ascii characters from a file in C

I have a file, foo.txt, which is just:
” ’
char x = fgetc(myfile);
When I use fgetc on the file, I get a constant value of 226 on both characters. Why is this? How can I fix this?
Here is my code:
FILE* f = fopen("./debate.txt", "rb");
int x = fgetc(f);
char y = (char)x;
For normal (portable) software, character encodings are a whole world of pain. The problems (and potential solutions) are:
A) The text file may be in any random/"text editor defined" encoding.
To deal with this there's 4 options:
expect input in a specific encoding (e.g. UTF-8) and refuse to support anything else (and generate an error message if the data in the file isn't valid for the encoding you chose). This will annoy some users (e.g. where the national standard is something incompatible like CNS 11643 ).
support many encodings, and let the user choose which encoding to expect (e.g. based on a command line argument). This is a little inconvenient for users and very painful for you.
support many encodings, and try to auto-detect which encoding the file used. This is a little more convenient for users until it guesses wrong and becomes a major annoyance (and you can't reduce the chance of guessing the wrong encoding to zero).
support many encodings, and let the user choose the encoding if they want, and auto-detect if the user didn't specify. This is the best possible option for users (and the worst possible option for software developers).
For these options I'd use the first (I would say "input file must be UTF-8", partly because UTF-8 has become very common and well supported, and partly because every other encoding is provably worse for technical reasons). Note that (based on your results) it's extremely likely that your input file is in UTF-8.
B) Whatever the compiler uses for char is implementation defined (could be ASCII, could be EBDIC, could be anything else), and may be either signed or unsigned.
In this case it's "very safe" (for portability) to assume ASCII. Assuming UTF-8 is the 2nd best choice but it creates problems with any code that does any maths (e.g. right shift, etc) on "possibly signed" char values.
C) The stdin, stdout, stderr pipes are random/implementation defined too.
This is similar to the previous problem, except that the best solution ("assume ASCII") is significantly harder (especially when you want to output error messages, etc that contain pieces of text from the input file). For this I'd be tempted to use ASCII as much as possible, but to cheat and output UTF-8 if I have to. If the OS (or shell) can't handle UTF-8 it'll create a mess, but most users would understand (and can work around it by piping your output to a file). The best alternative (for user output) is using a GUI and not using stdout, but that creates a large set of extra problems (and leads to a second large set of extra problems - internationalization for things like error messages, etc).
D) Whatever the compiler assumes for wchar is random/implementation defined (maybe UTF-16, maybe UTF-32, maybe anything else; and it may even be an 8-bit encoding that isn't "wide" at all).
The only sane choice here is to recognize that wchar is an unusable failure that should never (under any circumstances) be used for anything.
To be more specific, wchar is a historical mistake based on previous historical mistakes. Essentially, in the early days, Microsoft and Sun decided to adopt UCS-2 (an "all Unicode codepoints fit in 16 bits" assumption) which quickly became broken. To work around that problem Microsoft and Sun switched to UTF-16, but Microsoft was primarily running on little-endian machines and chose UTF-16LE and Sun (Java) was aiming for big-endian machines and chose UTF-16BE. The wchar extension was added to C in 1995 at the same time that companies (Microsoft, Sun) where doing everything wrong and weren't doing anything that is compatibility with each other; so wchar ended up being a "we don't know what the standard is so our standard is no standard at all" joke. For C (and C++) this was fixed in 2011 with the introduction of char16_t (UTF-16) and char32_t (UTF-32) in <uchar.h>, but adoption is slow (e.g. Microsoft is still too lazy to bother with C99).
Note that an additional part of the problem is that people want to assume that one wchar is one whole printable character, and that is almost never the case (e.g. even for UTF-32 where one wchar is one whole Unicode codepoint there are combining codepoints); and this ruins any benefit of any "wide char" implementation (even if your code is not portable at all and you know what wchar actually is).
The best solution (especially if you chose "expect that the input file is using UTF-8" to solve the first problem) is to use UTF-8 stored in uint8_t (so that nobody confuses it for whatever char is).
In that case; "converting the input from the file into your internal character encoding" can become "converting UTF-8 to UTF-8 by doing nothing"; and "converting your internal character encoding into whatever stdout wants" becomes "converting UTF-8 to ASCII (or UTF-8) by doing almost nothing (casting from uint8_t to char)". In other words, it can be extremely close to "use the same encoding for everything".

Print UTF-16 string

So I want to parse IDv3.4 file. There are 4 types of text encoding in format specification: ISO-8859-1, UTF-16 with BOM, UTF-16BE and UTF-8. I already written code that can obtains bytes of strings.
And my question is how to print UTF-16 with BOM and UTF-16BE bytes to console.
And also one important condition: I can use only C libraries. I can't use C++ libraries. I even can't use third-party C libraries.
In general (NOT specifically for parsing IDv3.4 files alone) you will want to choose a common character encoding that your code will use internally; then convert from any other character encoding into your chosen character encoding (for input data - e.g. from user or files or network) and convert back again (for output, to user or files or network).
For choosing a common character encoding:
you want something that minimizes "nonconvertible cases" - e.g. you wouldn't want to choose ASCII because there's far too much in far too many other character encodings that can't be converted to ASCII. This mostly means that you'll want a Unicode encoding.
you want something that is convenient. For Unicode encoding, this only really gives you 2 choices - UTF-8 (because you don't have to care about endian issues, and it's relatively efficient for space/memory consumption, and C functions like strlen() can still work) and versions of UTF-32 (because each codepoint takes up a fixed amount of space and it makes conversion a little simpler). Of these, the benefits of UTF-32 are mostly unimportant (unless you're doing a font rendering engine).
the "whatever random who-knows-what" character encoding that the C compiler uses is irrelevant (for both char and w_char), because it's implementation specific and not portable.
the "whatever random who-knows-what" character encoding that the terminal uses is irrelevant (the terminal should be considered "just another flavor of input/output, where conversion is involved").
Assuming you choose UTF-8:
You might be able to force the compiler to treat string literals as UTF-8 for you (e.g. like u8"hello" in C++, except I can't seem to find any sane standard for C). Otherwise you'll need to do it yourself where necessary.
I'd recommend using the uint8_t type for storing strings; partly because char is "signed or unsigned, depending on which way the wind is blowing" (which makes conversions to/from other character encodings painful due to "shifting a signed/negative number right" problems), and partly because it help to find "accidentally used something that isn't UTF-8" bugs (e.g. warnings from compiler about "conversion from signed to unsigned").
Conversion between UTF-8 and UTF-32LE, UTF_32BE, UTF-16LE, UTF_16BE is fairly trivial (the relevant wikipedia articles are enough to describe how it works).
"UTF-16 with BOM" means that the first 2 bytes will tell you if it's UTF-16LE or UTF-16BE, so (after you add support for UTF-16LE and UTF-16BE) it's trivial. "UTF-32 with BOM" is similar (first 4 bytes tell you if it's UTF32-BE or UTF32-BE).
Conversion to/from ISO-8859-1 to UTF-8 is fairly trivial, because the characters match Unicode codepoints with the same value. However, often people get it wrong (e.g. say it's ISO-8859-1 when the data is actually encoded as Windows-1252 instead); and for the conversion from UTF-8 to ISO-8859-1 you will need to deal with "nonconvertible" codepoints.

Would a C compiler actually have an ascii look up table

I know there are a few similar questions around relating to this, but it's still not completely clear.
For example: If in my C source file, I have lots of defined string literals, as the compiler is translating this source file, does it go through each character of strings and use a look-up table to get the ascii number for each character?
I'd guess that if entering characters dynamically into a a running C program from standard input, it is the terminal that is translating actual characters to numbers, but then if we have in the code for example :
if (ch == 'c'){//.. do something}
the compiler must have its own way of understanding and mapping the characters to numbers?
Thanks in advance for some help with my confusion.
The C standard talks about the source character set, which the set of characters it expects to find in the source files, and the execution character set, which is set of characters used natively by the target platform.
For most modern computers that you're likely to encounter, the source and execution character sets will be the same.
A line like if (ch == 'c') will be stored in the source file as a sequence of values from the source character set. For the 'c' part, the representation is likely 0x27 0x63 0x27, where the 0x27s represent the single quote marks and the 0x63 represents the letter c.
If the execution character set of the platform is the same as the source character set, then there's no need to translate the 0x63 to some other value. It can just use it directly.
If, however, the execution character set of the target is different (e.g., maybe you're cross-compiling for an IBM mainframe that still uses EBCDIC), then, yes, it will need a way to look up the 0x63 it finds in the source file to map it to the actual value for a c used in the target character set.
Outside the scope of what's defined by the standard, there's the distinction between character set and encoding. While a character set tells you what characters can be represented (and what their values are), the encoding tells you how those values are stored in a file.
For "plain ASCII" text, the encoding is typically the identity function: A c has the value 0x63, and it's encoded in the file simply as a byte with the value of 0x63.
Once you get beyond ASCII, though, there can be more complex encodings. For example, if your character set is Unicode, the encoding might be UTF-8, UTF-16, or UTF-32, which represent different ways to store a sequence of Unicode values (code points) in a file.
So if your source file uses a non-trivial encoding, the compiler will have to have an algorithm and/or a lookup table to convert the values it reads from the source file into the source character set before it actually does any parsing.
On most modern systems, the source character set is typically Unicode (or a subset of Unicode). On Unix-derived systems, the source file encoding is typically UTF-8. On Windows, the source encoding might be based on a code page, UTF-8, or UTF-16, depending on the code editor used to create the source file.
On many modern systems, the execution character set is also Unicode, but, on an older or less powerful computer (e.g., an embedded system), it might be restricted to ASCII or the characters within a particular code page.
Edited to address follow-on question in the comments
Any tool that reads text files (e.g., an editor or a compiler) has three options: (1) assume the encoding, (2) take an educated guess, or (3) require the user to specify it.
Most unix utilities assume UTF-8 because UTF-8 is ubiquitous in that world.
Windows tools usually check for a Unicode byte-order mark (BOM), which can indicate UTF-16 or UTF-8. If there's no BOM, it might apply some heuristics (IsTextUnicode) to guess the encoding, or it might just assume the file is in the user's current code page.
For files that have only characters from ASCII, guessing wrong usually isn't fatal. UTF-8 was designed to be compatible with plain ASCII files. (In fact, every ASCII file is a valid UTF-8 file.) Also many common code pages are supersets of ASCII, so a plain ASCII file will be interpreted correctly. It would be bad to guess UTF-16 or UTF-32 for plain ASCII, but that's unlikely given how the heuristics work.
Regular compilers don't expend much code dealing with all of this. The host environment can handle many of the details. A cross-compiler (one that runs on one platform to make a binary that runs on a different platform) might have to deal with mapping between character sets and encodings.
Sort of. Except you can drop the ASCII bit, in full generality at least.
The mapping used between int literals like 'c' and the numeric equivalent is a function of the encoding used by the architecture that the compiler is targeting. ASCII is one such encoding, but there are others, and the C standard places only minimal requirements on the encoding, an important one being that '0' through to '9' must be consecutive, in one block, positive and able to fit into a char. Another requirement is that 'A' to 'Z' and 'a' to 'z' must be positive values that can fit into a char.
No, the compiler is not required to have such a thing. Think a minute about a pre-C11 compiler, reading EBCDIC source and translating for an EBCDIC machine. What use would have an ASCII look-up table in such a compiler?
Also think another minute about how such ASCII look-up table(s) would look like in such a compiler!

wchar_t vs char for creating an API

I am creating a C++ library meant to be used with different applications written in different languages like Java, C#, Delphi etc.
Every now and then I am stuck on conversions between wstrings, strings, char*, wchar_t*. E.g. I sticked to wchar_t's but had to use regex library which accepts chars other similar problems.
I wish to stick to either w's or normal strings. My library will mostly deal with ASCII characters but can have non-ASCII characters too as in names etc. So, can I permanently switch to char's instead of wchar_t's and string's instead of wstring's. Can I have unicode support with them and will it affect scalability and portability across different platforms and languages.
Please advise.
You need to decide which encoding to use. Some considerations:
If you can have non-ASCII characters, then there is no point in choosing ASCII or 8bit ANSI. That way leads to disappointment and risks data loss.
It makes sense to pick one encoding and stick to it. Everywhere. The Windows API is unusual in supporting both ANSI and Unicode, but that is due to backwards compatibility of old software. If Microsoft were starting over from scratch, there would be one encoding only.
The most common choices for Unicode encoding are UTF-8 and UTF-16. Any decent environment will have support for both. Either choice may be justifiable.
Java, VB, C# and Delphi all have good support for UTF-16, and all of them use UTF-16 for their native string types (in the case of Delphi, the native string type is UTF-16 only in Delphi 2009 and later. For earlier versions, you can use the WideString string type).
Most OS platforms are natively UTF-16 (*Nix systems, like Linux, are UTF-8 instead), so it may well be simplest to just use UTF-16.
On the other hand, UTF-8 is probably a technically better choice being byte oriented, and backwards compatible with 8bit ASCII. Quite likely, if Unicode was being invented from scratch, there would be no UTF-16 and UTF-8 would be the variable length encoding.
You have phrased the question as a choice between char and wchar_t. I think that the real choice is what your preferred encoding should be. You also have to watch out that wchar_t is 16bit (UTF-16) on some systems but is 32bit (UTF-32) on others. It is not a portable data type. That is why C++11 introduces new char16_t and char32_t` data types to correct that ambiguity.
The major difference between Unicode and simple char is code page. Having just a char* pointer is not enough to understand the meaning of the string. It can be in a certain specific encoding, it can be multibyte, etc. Wide character string does not have these caveats.
In many cases international aspects are not important. In this case the difference between these 2 representations is minimal. The main question that you need to answer: is internationalization important to your library or not?
Modern Windows programming should tend towards builds with UNICODE defined, and thus use wide characters and wide character APIs. This is desirable for improved performance (fewer or no conversions behind the Windows API layers), improved capabilities (sometimes the ANSI wrappers don't expose all capabilities of the wide function), and in general it avoids problems with the inability to represent characters that are not on the system's current code page (and thus in practice the inability to represent non-ASCII characters).
Where this can be difficult is when you have to interface with things that don't use wide characters. For example, while Windows APIs have wide character filenames, Linux filesystems typically use bytestrings. While those bytestrings are often UTF-8 by convention, there's little enforcement. Interfacing with other languages can also be difficult if the language in question doesn't understand wide characters at an API level. Ideally such languages have chosen a specific encoding, such as UTF-8, allowing you to convert to and from that encoding at the boundaries.
And that's one general recommendation: use Unicode internally for all processing, and convert as necessary at the boundaries. If this isn't already familiar to you, it's good to reference Joel's article on Unicode.

Why isn't wchar_t widely used in code for Linux / related platforms?

This intrigues me, so I'm going to ask - for what reason is wchar_t not used so widely on Linux/Linux-like systems as it is on Windows? Specifically, the Windows API uses wchar_t internally whereas I believe Linux does not and this is reflected in a number of open source packages using char types.
My understanding is that given a character c which requires multiple bytes to represent it, then in a char[] form c is split over several parts of char* whereas it forms a single unit in wchar_t[]. Is it not easier, then, to use wchar_t always? Have I missed a technical reason that negates this difference? Or is it just an adoption problem?
wchar_t is a wide character with platform-defined width, which doesn't really help much.
UTF-8 characters span 1-4 bytes per character. UCS-2, which spans exactly 2 bytes per character, is now obsolete and can't represent the full Unicode character set.
Linux applications that support Unicode tend to do so properly, above the byte-wise storage layer. Windows applications tend to make this silly assumption that only two bytes will do.
wchar_t's Wikipedia article briefly touches on this.
The first people to use UTF-8 on a Unix-based platform explained:
The Unicode Standard [then at version 1.1]
defines an
adequate character set but an
unreasonable representation [UCS-2]. It states
that all characters are 16 bits wide [no longer true]
and are communicated and stored in 16-bit units.
It also reserves a pair
of characters (hexadecimal FFFE and
FEFF) to detect byte order in
transmitted text, requiring state in
the byte stream. (The Unicode
Consortium was thinking of files, not
pipes.) To adopt this encoding, we
would have had to convert all text
going into and out of Plan 9 between
ASCII and Unicode, which cannot be
done. Within a single program, in
command of all its input and output,
it is possible to define characters as
16-bit quantities; in the context of a
networked system with hundreds of
applications on diverse machines by
different manufacturers [italics mine], it is
impossible.
The italicized part is less relevant to Windows systems, which have a preference towards monolithic applications (Microsoft Office), non-diverse machines (everything's an x86 and thus little-endian), and a single OS vendor.
And the Unix philosophy of having small, single-purpose programs means fewer of them need to do serious character manipulation.
The source for our tools and
applications had already been
converted to work with Latin-1, so it
was ‘8-bit safe’, but the conversion
to the Unicode Standard and UTF[-8] is
more involved. Some programs needed no
change at all: cat, for instance,
interprets its argument strings,
delivered in UTF[-8], as file names
that it passes uninterpreted to the
open system call, and then just copies
bytes from its input to its output; it
never makes decisions based on the
values of the bytes...Most programs,
however, needed modest change.
...Few tools actually need to operate
on runes [Unicode code points]
internally; more typically they need
only to look for the final slash in a
file name and similar trivial tasks.
Of the 170 C source programs...only 23
now contain the word Rune.
The programs that do store runes
internally are mostly those whose
raison d’être is character
manipulation: sam (the text editor),
sed, sort, tr, troff, 8½ (the window
system and terminal emulator), and so
on. To decide whether to compute using
runes or UTF-encoded byte strings
requires balancing the cost of
converting the data when read and
written against the cost of converting
relevant text on demand. For programs
such as editors that run a long time
with a relatively constant dataset,
runes are the better choice...
UTF-32, with code points directly accessible, is indeed more convenient if you need character properties like categories and case mappings.
But widechars are awkward to use on Linux for the same reason that UTF-8 is awkward to use on Windows. GNU libc has no _wfopen or _wstat function.
UTF-8, being compatible to ASCII, makes it possible to ignore Unicode somewhat.
Often, programs don't care (and in fact, don't need to care) about what the input is, as long as there is not a \0 that could terminate strings. See:
char buf[whatever];
printf("Your favorite pizza topping is which?\n");
fgets(buf, sizeof(buf), stdin); /* Jalapeños */
printf("%s it shall be.\n", buf);
The only times when I found I needed Unicode support is when I had to have a multibyte character as a single unit (wchar_t); e.g. when having to count the number of characters in a string, rather than bytes. iconv from utf-8 to wchar_t will quickly do that. For bigger issues like zero-width spaces and combining diacritics, something more heavy like icu is needed—but how often do you do that anyway?
wchar_t is not the same size on all platforms. On Windows it is a UTF-16 code unit that uses two bytes. On other platforms it typically uses 4 bytes (for UCS-4/UTF-32). It is therefore unlikely that these platforms would standardize on using wchar_t, since it would waste a lot of space.

Resources