I want to create a string library with two different string classes for handling UTF-8 and UCS-2 (which I beleive is some kind of UTF-16 not handling surrogates and characters above 0xFFFF).
On windows platforms, wide chars are 2 octets wide. On Linux they are 4. So what happens with functions related to wide char strings ? Do you pass buffers of 2 octets wide items on windows and 4 octets wide items on linux ? If yes then it makes these functions totally different on windows and linux, which doesn't make them really "standard"...
How can one handle this problem when trying to create a library that is supposed to manipulate wide chars the same way for cross platform code ? Thank you.
You're right about the different sizes of wchar_t on Windows and Linux. That also means you're right about the wide-character handling functions not being too useful. You should probably check out an encoding conversion library such as libiconv. Then you can work with UTF-8 internally and just convert on I/O.
Related
I have the following code on Linux:-
rc = iconv_open("WCHAR_T", SourceCode);
prior to using iconv to convert the data into a wide character string (wchar_t).
I am now compiling this on z/OS. I do not know what value to use in place of "WCHAR_T". I have found that codepages are represented by 5-digit character strings on z/OS, e.g Codepage 500 would be "00500", so I am happy enough with what to put into my SourceCode variable above, I just can't find a value that will successfully work as the first parameter to iconv_open.
wchar_t are 4 bytes long on z/OS (when compiling 64-bit as I am), so I assume that I would need some varient of an EBCDIC equivalent to UTF32 or UCS4 perhaps, but I cannot find something that works. Every combination I have tried to date has returned with an errno of 121 (EINVAL: The parameter is incorrect).
If anyone familiar with how the above code works on Linux, could give a summary of what it does, that might also help. What does it mean to iconv into "WCHAR_T"? Is this a combination perhaps, of some data conversion and additionally a type change to wchar_t?
Alternatively, can anyone answer the question, "What is the internal representation of wchar_t on z/OS?"
wchar_t is an implementation defined data type. On z/OS it is 2 bytes in 31-bit mode and 4 bytes in 64-bit mode.
There is no single representation of wchar_t on z/OS. The encoding associated with the wchar_t data is dependent on the locale in which the application is running. It could be an IBM-939 Japanese DBCS code page or any of the other DBCS code pages that are used in countries like China, Korea, etc.
Wide string literals and character constants i.e. those defined as L"abc" or L'x' are converted to the implementation defined encoding used to implement wchar_t data type. This encoding is locale sensitive and can be manipulated using wide character run time library functions.
The conversion of multi byte string literals to wide string literals is typically done by calling one of the mbtowc run time library functions which respect the encoding associated with the locale in which the application is running.
iconv on the other hand can be used to convert any string literals to any one of the supported destination code pages including double byte code pages or any of the Unicode formats (UTF8, UTF16, UTF32). The operation of iconv is independent of wchar_t type.
Universal coded character set converters may be the answer to your question.
The closest to Unicode on z/OS would be UTF-EBCDIC but it requires defining locales that are based on UTF-EBCDIC.
If running as an ASCII application is an option, you could use UTF-32 as the internal encoding and provide iconv converters to/from any of the EBCDIC code pages your application needs to support. This would be better served by char32_t data type to avoid opacity of wchar_t.
What does character encoding in C programming language depend on? (OS? compiler? or editor?)
I'm working on not only characters of ASCII but also ones of other encodings such as UTF-8.
How can we check the current character encodings in C?
The C source code might be stored in distinct encodings. This is clearly compiler dependent (i.e. a compiler setting if available). Though, I wouldn't count on it and count on ASCII-only always. (IMHO this is the most portable way to write code.)
Actually, you can encode any character of any encoding using only ASCIIs in C source code if you encode them with octal or hex sequences. (This is what I do from time to time to earn respect of my colleagues – writing German texts with \303\244, \303\266, \303\274, \303\231 into translation tables out of mind...)
Example: "\303\274" encodes the UTF-8 sequence for a string constant "ü". (But if I print this on my Windows console I only get "��" although I set code page 65001 which should provide UTF-8. The damn Windows console...)
The program written in C may handle any encoding you are able to deal with. Actually, the characters are only numbers which can be stored as one of the available integral types (e.g. char for ASCII and UTF-8, other int types for encodings with 16 or 32 bit wide characters). As already mentioned by Clifford, the output decides what to do with these numbers. Thus, this is platform dependent.
To handle characters according to a certain encoding, (e.g. make it upper case or lower case, local dictionary-like sorting, etc.) you have to use an appropriate library. This might be part of the standard libaries, the system libraries, or 3rd party libraries.
This is especially true for conversion from one encoding to another. This is a good point to mention libintl.
I personally prefer ASCII, Unicode, and UTF-8 (and unfortunately UTF-16 as I'm doing most work on Windows 10). In this special case, the conversion can be done by a pure "bit-fiddling" algorithm (without any knowledge of special characters). You may have a look at Wikipedia UTF-8 to get a clue. By google, you probably will find something ready-to-use if you don't want to do it by yourself.
The standard library of C++11 and C++14 provides support also (e.g. std::codecvt_utf8) but it is remarked as deprecated in C++17. Thus, I don't need to throw away my bit-fiddling code (I'm so proud of). Oops. This is tagged with c – sorry.
It is platform or display device/framework dependent. The compiler does not care how the platform interprets either char or wchar_t when such values are rendered as glyphs on some display device.
If the output were to some remote terminal, then the rendering would be dependent on the terminal rather than the execution environment, while in a desktop computer, the rendering may be to a text console or to a GUI, and the resulting rendering may differ even between those.
I am creating a C++ library meant to be used with different applications written in different languages like Java, C#, Delphi etc.
Every now and then I am stuck on conversions between wstrings, strings, char*, wchar_t*. E.g. I sticked to wchar_t's but had to use regex library which accepts chars other similar problems.
I wish to stick to either w's or normal strings. My library will mostly deal with ASCII characters but can have non-ASCII characters too as in names etc. So, can I permanently switch to char's instead of wchar_t's and string's instead of wstring's. Can I have unicode support with them and will it affect scalability and portability across different platforms and languages.
Please advise.
You need to decide which encoding to use. Some considerations:
If you can have non-ASCII characters, then there is no point in choosing ASCII or 8bit ANSI. That way leads to disappointment and risks data loss.
It makes sense to pick one encoding and stick to it. Everywhere. The Windows API is unusual in supporting both ANSI and Unicode, but that is due to backwards compatibility of old software. If Microsoft were starting over from scratch, there would be one encoding only.
The most common choices for Unicode encoding are UTF-8 and UTF-16. Any decent environment will have support for both. Either choice may be justifiable.
Java, VB, C# and Delphi all have good support for UTF-16, and all of them use UTF-16 for their native string types (in the case of Delphi, the native string type is UTF-16 only in Delphi 2009 and later. For earlier versions, you can use the WideString string type).
Most OS platforms are natively UTF-16 (*Nix systems, like Linux, are UTF-8 instead), so it may well be simplest to just use UTF-16.
On the other hand, UTF-8 is probably a technically better choice being byte oriented, and backwards compatible with 8bit ASCII. Quite likely, if Unicode was being invented from scratch, there would be no UTF-16 and UTF-8 would be the variable length encoding.
You have phrased the question as a choice between char and wchar_t. I think that the real choice is what your preferred encoding should be. You also have to watch out that wchar_t is 16bit (UTF-16) on some systems but is 32bit (UTF-32) on others. It is not a portable data type. That is why C++11 introduces new char16_t and char32_t` data types to correct that ambiguity.
The major difference between Unicode and simple char is code page. Having just a char* pointer is not enough to understand the meaning of the string. It can be in a certain specific encoding, it can be multibyte, etc. Wide character string does not have these caveats.
In many cases international aspects are not important. In this case the difference between these 2 representations is minimal. The main question that you need to answer: is internationalization important to your library or not?
Modern Windows programming should tend towards builds with UNICODE defined, and thus use wide characters and wide character APIs. This is desirable for improved performance (fewer or no conversions behind the Windows API layers), improved capabilities (sometimes the ANSI wrappers don't expose all capabilities of the wide function), and in general it avoids problems with the inability to represent characters that are not on the system's current code page (and thus in practice the inability to represent non-ASCII characters).
Where this can be difficult is when you have to interface with things that don't use wide characters. For example, while Windows APIs have wide character filenames, Linux filesystems typically use bytestrings. While those bytestrings are often UTF-8 by convention, there's little enforcement. Interfacing with other languages can also be difficult if the language in question doesn't understand wide characters at an API level. Ideally such languages have chosen a specific encoding, such as UTF-8, allowing you to convert to and from that encoding at the boundaries.
And that's one general recommendation: use Unicode internally for all processing, and convert as necessary at the boundaries. If this isn't already familiar to you, it's good to reference Joel's article on Unicode.
I faced a problem with encodings on different platforms (in my case Windows and Linux). On windows, size of wchar_t is 2 bytes, whereas on Linux it's 4 bytes. How can I "standardize" wchar_t to be same size for both platforms? Is it hard to implement without additional libraries? For now, I'm aiming for printf/wprintf API. The data is sent via socket communication. Thank you.
If you want to send Unicode data across different platforms and architectures, I'd suggest using UTF-8 encoding and (8-bit) chars. UTF-8 has some advantages like not having endiannes issues (UTF-8 is just a plain sequence of bytes, instead both UTF-16 and UTF-32 can be little-endian or big-endian...).
On Windows, just convert the UTF-8 text to UTF-16 at the boundary of Win32 APIs (since Windows APIs tend to work with UTF-16). You can use the MultiByteToWideChar() API for that.
To solve this problem I think you are going to have to convert all strings into UTF-8 before transmitting. On Windows you would use the WideCharToMultiByte function to convert wchar_t strings to UTF-8 strings, and MultiByteToWideChar to convert UTF-8 strings into wchar_t strings.
On Linux things aren't as straightforward. You can use the functions wctomb and mbtowc, however what they convert to/from depends on the underlying locale setting. So if you want these to convert to/from UTF-8 and Unicode then you'll need to make sure the locale is set to use UTF-8 encoding.
This article might also be a good resource.
This intrigues me, so I'm going to ask - for what reason is wchar_t not used so widely on Linux/Linux-like systems as it is on Windows? Specifically, the Windows API uses wchar_t internally whereas I believe Linux does not and this is reflected in a number of open source packages using char types.
My understanding is that given a character c which requires multiple bytes to represent it, then in a char[] form c is split over several parts of char* whereas it forms a single unit in wchar_t[]. Is it not easier, then, to use wchar_t always? Have I missed a technical reason that negates this difference? Or is it just an adoption problem?
wchar_t is a wide character with platform-defined width, which doesn't really help much.
UTF-8 characters span 1-4 bytes per character. UCS-2, which spans exactly 2 bytes per character, is now obsolete and can't represent the full Unicode character set.
Linux applications that support Unicode tend to do so properly, above the byte-wise storage layer. Windows applications tend to make this silly assumption that only two bytes will do.
wchar_t's Wikipedia article briefly touches on this.
The first people to use UTF-8 on a Unix-based platform explained:
The Unicode Standard [then at version 1.1]
defines an
adequate character set but an
unreasonable representation [UCS-2]. It states
that all characters are 16 bits wide [no longer true]
and are communicated and stored in 16-bit units.
It also reserves a pair
of characters (hexadecimal FFFE and
FEFF) to detect byte order in
transmitted text, requiring state in
the byte stream. (The Unicode
Consortium was thinking of files, not
pipes.) To adopt this encoding, we
would have had to convert all text
going into and out of Plan 9 between
ASCII and Unicode, which cannot be
done. Within a single program, in
command of all its input and output,
it is possible to define characters as
16-bit quantities; in the context of a
networked system with hundreds of
applications on diverse machines by
different manufacturers [italics mine], it is
impossible.
The italicized part is less relevant to Windows systems, which have a preference towards monolithic applications (Microsoft Office), non-diverse machines (everything's an x86 and thus little-endian), and a single OS vendor.
And the Unix philosophy of having small, single-purpose programs means fewer of them need to do serious character manipulation.
The source for our tools and
applications had already been
converted to work with Latin-1, so it
was ‘8-bit safe’, but the conversion
to the Unicode Standard and UTF[-8] is
more involved. Some programs needed no
change at all: cat, for instance,
interprets its argument strings,
delivered in UTF[-8], as file names
that it passes uninterpreted to the
open system call, and then just copies
bytes from its input to its output; it
never makes decisions based on the
values of the bytes...Most programs,
however, needed modest change.
...Few tools actually need to operate
on runes [Unicode code points]
internally; more typically they need
only to look for the final slash in a
file name and similar trivial tasks.
Of the 170 C source programs...only 23
now contain the word Rune.
The programs that do store runes
internally are mostly those whose
raison d’être is character
manipulation: sam (the text editor),
sed, sort, tr, troff, 8½ (the window
system and terminal emulator), and so
on. To decide whether to compute using
runes or UTF-encoded byte strings
requires balancing the cost of
converting the data when read and
written against the cost of converting
relevant text on demand. For programs
such as editors that run a long time
with a relatively constant dataset,
runes are the better choice...
UTF-32, with code points directly accessible, is indeed more convenient if you need character properties like categories and case mappings.
But widechars are awkward to use on Linux for the same reason that UTF-8 is awkward to use on Windows. GNU libc has no _wfopen or _wstat function.
UTF-8, being compatible to ASCII, makes it possible to ignore Unicode somewhat.
Often, programs don't care (and in fact, don't need to care) about what the input is, as long as there is not a \0 that could terminate strings. See:
char buf[whatever];
printf("Your favorite pizza topping is which?\n");
fgets(buf, sizeof(buf), stdin); /* Jalapeños */
printf("%s it shall be.\n", buf);
The only times when I found I needed Unicode support is when I had to have a multibyte character as a single unit (wchar_t); e.g. when having to count the number of characters in a string, rather than bytes. iconv from utf-8 to wchar_t will quickly do that. For bigger issues like zero-width spaces and combining diacritics, something more heavy like icu is needed—but how often do you do that anyway?
wchar_t is not the same size on all platforms. On Windows it is a UTF-16 code unit that uses two bytes. On other platforms it typically uses 4 bytes (for UCS-4/UTF-32). It is therefore unlikely that these platforms would standardize on using wchar_t, since it would waste a lot of space.