Are there any unicode/wide chars that encode to multiple encoded characters - c

Consider wctomb(), which takes a wide character and encodes to the currently selected character set. The glibc man page states that the output buffer should be MB_CUR_MAX, while the FreeBSD man page states the output buffer size should be MB_LEN_MAX. Which is correct here?
Are there any example wide char/encoding combinations where it takes multiple encoded characters to represent the wide char?
On a more general note, does MB_CUR_MAX refer to the max combined encoded char byte count to represent a wide char, or is it just representing the max byte count for any particular encoded char?

MB_CUR_MAX is correct, but both are big enough. You might want to use MB_LEN_MAX if you want to avoid variable-length array declarations.
MB_CUR_MAX is the maximum number of bytes in a multibyte character in the current locale. MB_LEN_MAX is the maximum number of bytes in a character for any supported locale. Unlike MB_CUR_MAX, MB_LEN_MAX is a macro so it can be used in an array declaration without creating a VLA.
Both constants refer to a single wide character. There is no simple definition of what a multibyte character is exactly, since multibyte encodings can include shift sequences; if the multibyte locale includes shift sequences, the number of bytes required for a particular call to wctomb with a particular wide character might vary from call to call depending on the shift state. (Also, the actual code might be different in different shift states.)
As far as I know, there is nothing which prevents a wide character from being translated to a multibyte sequence which might be decomposable into other multibyte sequences (as with Unicode composition); the definition of wctomb talks only about "representation". But I don't know of an implementation which does that, either; Unicode canonical decomposition must be done with separate APIs.
So it is possible that no installed locale requires a value as large as MB_LEN_MAX. But there is nothing stopping you from adding locales -- or even creating your own -- provided that they don't exceed the encoding limit (16 bytes on Linux).

Related

Converting C strings to Pascal strings

When converting a C string into a Pascal string, why should the length of the original string be less or equal to 127 instead of 256? I understand that an unsigned int ranges from 0~256 and a signed one ranges from -128~127, but isn't the first character of a Pascal string unsigned?
The Pascal string you are referring to is probably the one used in older Pascals (called ShortString in e.g. Delphi and FreePascal, the most popular Pascal implementations these days). That can contain 255 single-byte characters (char in C). There is no need to restrict this to 127 characters.
Perhaps you were thinking of the fact that 255 bytes can only contain 127 UTF-16 code points. But these strings were popular in the old CP/M and DOS days, when no one knew anything about Unicode yet, and were made to contain ASCII or "Extended ASCII" (8 bit, using code pages).
But most modern Pascal implementations allow you to use strings up to 2 GB in size. There, the length indicator is not stored as the first element anymore, just close to the text data. And these days, most of these strings can contain Unicode too, either as UTF-16 or as UTF-8, depending on the string type you choose (modern Pascal implementations have several different string types for different purposes, so there is not one single "Pascal string type" anymore).
Some languages do have the ability to restrict the size of a ShortString, as so called "counted" strings:
var
s: string[18];
That string has a maximum of 18 bytes text data and 1 byte length data (at index 0). Such shorter strings can be used in, say, records, so they don't grow too big.
FreePascal's wiki has a great page showing all the types of strings that Pascal (at least that implementation) supports: http://wiki.freepascal.org/Character_and_string_types - it includes length-prefixed and null-terminated string types. None of the types on that page have a length restriction of 127.
The string type you're referring to would match ShortString which has a single byte prefix, however their documentation states it accepts 0-255.
I am aware of a string-type that has a variable-length-integer prefix, which would restrict the length of the string to 127 characters if you want the in-memory representation to be binary-compatible with ShortString, as being 128 characters or longer would set the MSB bit to 1 which in variable-length-integers means the integer is at least 2 bytes long instead of 1 byte.

How can I print a string with the same length with or without multicharacters?

I am trying to do exercise 1-22 in K&R book. It asks to fold long lines (i.e.going into a new line) after a predefined number of characters in string.
As I was testing the program and it worked well, but I saw that some lines were "folding" earlier than they should. I noticed that it was the lines on which special characters appeared, such as:
ö ş ç ğ
So, my question is, how do I ensure that lines are printed with the same maximum length with or without multicharacters?
What happens in your code ?
The K&R was written in a time where all characters were encoded on one single char. Example of such encoding standards are ASCII or ISO 8859.
Nowadays the leading encoding standard is UNICODE, which comes in several flavors. The UTF-8 encoding is used to represent the thousands of unicode characters on 8 bit bytes, using a variable length scheme:
the ascii characters (i.e. 0x00 to 0x7F) are encoded on a single byte.
all other characters are encoded on 2 to 4 bytes.
So the letter ö and the others in your list are encoded as 2 consecutive bytes. Unfortunately, the standard C library and the algorithms of K&R do not manage variable encoding. So each of your special char is counted as two so that your algorithm is tricked.
How to solve it ?
There is no easy way. You must make a distinction between the length of the strings in memory, and the length of the strings when they are displayed.
I can propose you a trick that uses the properties of the encoding scheme: whenever you count the display length of a string, just ignore the characters c in memory that comply with the condition c&0xC0==0x80.
Another way would be to use wide chars wchar_t/win_t (requires header wchar.h) instead of char/int and use getwc()/putwc() instead of getc()/putc(). If on your environment sizeof(wchar_t) is 4 then you will be able to work with unicode just using the wide characters and wide library functions instead of the normal ones mentioned in K&R. If however
sizeof(wchar_t) is smaller (for example 2), you could work correctly with a larger subset of unicode but still could encounter alignement issues in some cases.
As in the comment, your string is probably encoded in UTF-8. That means that some characters, including the ones you mention, use more than one byte. If you simply count bytes to determine the width of your output, your computed value may be too large.
To properly determine the number of characters in a string with multibyte characters, use a function such as mbrlen(3).
You can use mbrtowc(3) to find out the number of bytes of the first character in a string, if you're counting character for character.
This of course goes way beyond the scope of the K&R book. It was written before multibyte characters were used.

Problem with handling path length

I'm creating library which will be used for file manipulations, both on linux and windows. So I need to handle paths, the main requirements is that my functions will recieve strings in UTF8 format. But it causes some problems, one of them is I'm using MAX_PATH on windows and PATH_MAX in linux, to represent static path variables. In the case of ASCII characters there will be no problem, but when path contains unicode characters, the length of path will be twice shorter if unicode char requires 2 bytes per char, 3 times shorter if unicode char requires 3 bytes per char and so on. So is there good solution for this problem?
Thanks in advance!
p.s. sorry for my english.
At least on Linux, your concern seems misplaced. Linux (and POSIX in general) treats paths as an opaque blob of bytes terminated by "\0". It does not concern itself with how those bytes are translated to characters. That is, PATH_MAX specifies the max length of a path name in bytes, not in characters.
So if the path names contains >= 0 multibyte UTF-8 characters, then it just means that the max path length in characters is <= PATH_MAX.
UTF-8 is multibyte encoding format ranging from 1 to 4 bytes per character.
As you want to statically define max path value, you may need to define max path as n*4 (where n is the path length in ASCII characters you want to define) to accommodate UTF-8 encoded characters.
That totally depends on what you need.
If you want MAX_PATH number of bytes, you simply define a buffer as char name[MAX_PATH]. If you want MAX_PATH number of characters, you define a buffer as char name[MAX_PATH * 4], as UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets.
In a word, as janneb points out, MAX_PATH (or PATH_MAX) specifies the number of underlying bytes instead of characters.
Doesn’t Microsoft use either UCS-2 or UTF-16 for its pathnames, and that so MAX_PATH has a length that reflects 16-bit code units, not even proper characters?
I know that Apple uses UTF-16, and that each component in a pathname can be up to 256 UTF-16 code units not characters, and that it normalized to something approximating NFD from a long time ago.
I suspect you will have to first normalize if necessary, such as to NFD for Apple, then encode to your native filesystem’s internal format, and then check the length.
When you do that comparison, it is critical to remember that Unix uses 8-bit code units, Microsoft and Apple use 16-bit code units, and that no one seems to bother to actually use abstract characters. They could do that if they used UTF-32, but nobody wastes that much space in the filesystem. Pity, that.

Using narrow string manipulation functions on wide data

I'm parsing an XML file which can contain localized strings in different languages (at the moment its just english and spanish, but in the future it could be any language), the API for the XML parser returns all data within the XML via a char* which is UTF8 encoded.
Some manipulation of the data is required after its been parsed (searching within it for substrings, concatenating strings, determining the length of substrings etc.).
It would be convenient to use standard functions such as strlen, strcat etc. As the raw data I'm receiving from the XML parser is a char* I can do all manipulation readily using these standard string handling functions.
However these all of course make the assumption and requirement that the strings are NULL terminated.
My question therefore is - if you have wide data represented as a char*, can a NULL terminator character occur within the data rather than at the end?
i.e. if a character in a certain language doesn't require 2 bytes to represent it, and it is represented in one byte, will/can the other byte be NULL?
UTF-8 is not "wide". UTF-8 is multibyte encoding, where Unicode character can take 1 to 4 bytes. UTF-8 won't have zero terminators inside valid character. Make sure you are not confused on what your parser is giving you. It could be UTF-16 or UCS2 or their 4-byte equivalents placed in wide character strings, in which case you have to treat them as wide strings.
C distinguishes between between multibyte characters and wide characters:
Wide characters must be able to represent any character of the execution character set using exactly the same number of bytes (e.g. if 兀 takes 4 bytes to be represented, A must also take 4 bytes to be represented). Examples of wide character encodings are UCS-4, and the deprecated UCS-2.
Multibyte characters can take a varying number of bytes to be represented. Examples of multibyte encodings are UTF-8 and UTF-16.
When using UTF-8, you can continue to use the str* functions, but you have to bear in mind that they don't provide a way to return the length in characters of a string, you need to convert to wide characters, and use wcslen. strlen returns the length in bytes, not characters, which is useful in different situations.
I can't stress enough that all elements of the execution character set need to be represented into a single wide character of a predefined size in bytes. Some systems use UTF-16 for their wide characters, the result is that the implementation can't be conforming to the C standard, and some wc* functions can't possibly work right.

wcstombs: character encoding?

wcstombs documentation says, it "converts the sequence of wide-character codes to multibyte string". But it never says what is a "wide-character".
Is it implicit, like say it converts utf-16 to utf-8 or the conversion is defined by some environment variable?
Also what is the typical use case of wcstombs?
You use the setlocale() standard function with the LC_CTYPE (or LC_ALL) category to set the mapping the library uses between wchar_t characters and multibyte characters. The actual locale name passed to setlocale() is implementation defined, so you'll need to look it up in your compiler's docs.
For example, with MSVC you might use
setlocale( LC_ALL, ".1252" );
to set the C runtime to use codepage 1252 as the multibyte character set. Note that MSVC docs explicitly indicates that the locale cannot be set to UTF-7 or UTF8 for the multibyte character sets:
The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL.
The "wide-character" wchar_t type is intended to be able to support any character set the system supports - the standard doesn't define the size of a wchar_t type (it could be as small as a char or any of the larger integer types). On Windows it's the system's 'internal' Unicode encoding, which is UTF-16 (UCS-2 before WinXP). Honestly, I can't find a direct quote on that in the MSVC docs, though. Strictly speaking, the implementation should call this out, but I can't find it.
It converts whatever your platform uses for a "wide char" (which I'm lead to believe is indeed UCS2 on Windows, but is usually UCS4 on UNIX) into your current locale's default multibyte character encoding. If your locale is a UTF-8 one, then that is the multibyte encoding that will be used - but note that there are other possibilities, like JIS.
According to the C standard, wchar_t type is "capable of representing any character in the current locale". The standard doesn't say what the encoding for wchar_t is. In fact, the limits on WCHAR_MIN and WCHAR_MAX are [0, 255] or [-127, 127], depending upon whether wchar_t is unsigned or signed.
A multibyte character can use more than one byte. A multibyte string is made of one or more multibyte characters. In a multibyte string, each character need not be of equal number of bytes (UTF-8 is an example). Whereas, an object of type wchar_t has a fixed size (in a given implementation, of course).
As an aside, I can also find the following in my copy of the C99 draft:
__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month.
So, if I understood correctly, if __STDC_ISO_10646__ is defined, then wchar_t can store Unicode characters.
Wide character strings are composed of multi-byte characters, whereas the normal C string is a char* - a sequence of byte-wide characters. Wchars are not the same thing as unicode on all platforms, though unicode representations are typically based on wchar_t
I've seen wchars used in embedded systems like phones, where you want filenames with special characters but don't necessarily want to support all the glory and complexity of unicode.
Typical usage would be converting a 2-byte based string to a regular C string, and vica versa

Resources