Problem with handling path length - c

I'm creating library which will be used for file manipulations, both on linux and windows. So I need to handle paths, the main requirements is that my functions will recieve strings in UTF8 format. But it causes some problems, one of them is I'm using MAX_PATH on windows and PATH_MAX in linux, to represent static path variables. In the case of ASCII characters there will be no problem, but when path contains unicode characters, the length of path will be twice shorter if unicode char requires 2 bytes per char, 3 times shorter if unicode char requires 3 bytes per char and so on. So is there good solution for this problem?
Thanks in advance!
p.s. sorry for my english.

At least on Linux, your concern seems misplaced. Linux (and POSIX in general) treats paths as an opaque blob of bytes terminated by "\0". It does not concern itself with how those bytes are translated to characters. That is, PATH_MAX specifies the max length of a path name in bytes, not in characters.
So if the path names contains >= 0 multibyte UTF-8 characters, then it just means that the max path length in characters is <= PATH_MAX.

UTF-8 is multibyte encoding format ranging from 1 to 4 bytes per character.
As you want to statically define max path value, you may need to define max path as n*4 (where n is the path length in ASCII characters you want to define) to accommodate UTF-8 encoded characters.

That totally depends on what you need.
If you want MAX_PATH number of bytes, you simply define a buffer as char name[MAX_PATH]. If you want MAX_PATH number of characters, you define a buffer as char name[MAX_PATH * 4], as UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets.
In a word, as janneb points out, MAX_PATH (or PATH_MAX) specifies the number of underlying bytes instead of characters.

Doesn’t Microsoft use either UCS-2 or UTF-16 for its pathnames, and that so MAX_PATH has a length that reflects 16-bit code units, not even proper characters?
I know that Apple uses UTF-16, and that each component in a pathname can be up to 256 UTF-16 code units not characters, and that it normalized to something approximating NFD from a long time ago.
I suspect you will have to first normalize if necessary, such as to NFD for Apple, then encode to your native filesystem’s internal format, and then check the length.
When you do that comparison, it is critical to remember that Unix uses 8-bit code units, Microsoft and Apple use 16-bit code units, and that no one seems to bother to actually use abstract characters. They could do that if they used UTF-32, but nobody wastes that much space in the filesystem. Pity, that.

Related

Convert a `char *` to UTF-8 in C, or when using xmlwriter?

I'm using libxml/xmlwriter to generate an XML file within a program.
const char *s = someCharactersFromSomewhere();
xmlTextWriterWriteAttribute (writer, _xml ("value"), _xml (s));
In general I don't have much control over the contents of s, so I can't guarantee that it will be well-formatted in UTF-8. Mostly it is, but if not, the XML which is generated will be malformed.
What I'd like to find is a way to convert s to valid UTF-8, with any invalid character sequences in s replaced with escapes or removed.
Alternatively, if there is an alternative to xmlTextWriterWriteAttribute, or some option I can pass in when initializing the XML writer, such that it guarantees that it will always write valid UTF-8, that would be even better.
One more thing to mention is that the solution must work with both Linux and OSX. Ideally writing as little of my own code as possible! :P
In case the string is encoded in ASCII, then it will always be valid UTF-8 string.
This is because UTF-8 is backwards compatible with ASCII encoding.
See the second paragraph on Wikipedia here.
Windows primarily works with UTF-16, this means you will have to convert from UTF-16 to UTF-8 before you pass the string to the XML library.
If you have 8-bit ascii input then you can simply junk any character code > 127.
If you have some dodgy UTF-8 it is quite easy to parse, but the widechar symbol number that you generate might be out of the unicode range. You can use mbrlen() to individually validate each character.
I am describing this using unsigned chars. If you must use signed chars, then >128 means <0.
At its simplest:
Until the null byte
1 If the next byte is 0, then end the loop
2 If the next byte is < 128 then it is ascii, so keep it
3 If the next byte is >=128 < 128+64 it is invalid - discard it
4 If the next byte is >= 128+64 then it is probably a proper UTF-8 lead byte
call size_t mbrlen(const char *s, size_t n, mbstate_t *ps);
to see how many bytes to keep
if mbrlen says the code is bad (either the lead byte or the trail bytes),
skip 1 byte. Rule 3 will skip the rest.
Even simpler logic just calls mbrlen repeatedly, as it can accept the low ascii range.
You can assume that all the "furniture" of the file (eg xml <>/ symbols, spaces, quotes and newlines) won't be altered by this edit, as they are all valid 7-bit ascii codes.
char is a single byte character, while UTF codepoints range from 0 to 0x10FFFFF, so how do you represent a UTF character in only one byte?
First of all you need a wchar_t character. Those are used with wprintf(3) versions of the normal printf(3) routines. If you dig a little on this, you'll see that mapping your UTF codepoints into valid UTF-8 encoding is straigtforward, based on your setlocale(3) settings. Look at those manual pages referenced, and you'll get an idea of the task you are facing.
There's full support for wide character sets in the C standard... but you have to use it through the internationalization libraries and locales availables.

Converting C strings to Pascal strings

When converting a C string into a Pascal string, why should the length of the original string be less or equal to 127 instead of 256? I understand that an unsigned int ranges from 0~256 and a signed one ranges from -128~127, but isn't the first character of a Pascal string unsigned?
The Pascal string you are referring to is probably the one used in older Pascals (called ShortString in e.g. Delphi and FreePascal, the most popular Pascal implementations these days). That can contain 255 single-byte characters (char in C). There is no need to restrict this to 127 characters.
Perhaps you were thinking of the fact that 255 bytes can only contain 127 UTF-16 code points. But these strings were popular in the old CP/M and DOS days, when no one knew anything about Unicode yet, and were made to contain ASCII or "Extended ASCII" (8 bit, using code pages).
But most modern Pascal implementations allow you to use strings up to 2 GB in size. There, the length indicator is not stored as the first element anymore, just close to the text data. And these days, most of these strings can contain Unicode too, either as UTF-16 or as UTF-8, depending on the string type you choose (modern Pascal implementations have several different string types for different purposes, so there is not one single "Pascal string type" anymore).
Some languages do have the ability to restrict the size of a ShortString, as so called "counted" strings:
var
s: string[18];
That string has a maximum of 18 bytes text data and 1 byte length data (at index 0). Such shorter strings can be used in, say, records, so they don't grow too big.
FreePascal's wiki has a great page showing all the types of strings that Pascal (at least that implementation) supports: http://wiki.freepascal.org/Character_and_string_types - it includes length-prefixed and null-terminated string types. None of the types on that page have a length restriction of 127.
The string type you're referring to would match ShortString which has a single byte prefix, however their documentation states it accepts 0-255.
I am aware of a string-type that has a variable-length-integer prefix, which would restrict the length of the string to 127 characters if you want the in-memory representation to be binary-compatible with ShortString, as being 128 characters or longer would set the MSB bit to 1 which in variable-length-integers means the integer is at least 2 bytes long instead of 1 byte.

How can I print a string with the same length with or without multicharacters?

I am trying to do exercise 1-22 in K&R book. It asks to fold long lines (i.e.going into a new line) after a predefined number of characters in string.
As I was testing the program and it worked well, but I saw that some lines were "folding" earlier than they should. I noticed that it was the lines on which special characters appeared, such as:
ö ş ç ğ
So, my question is, how do I ensure that lines are printed with the same maximum length with or without multicharacters?
What happens in your code ?
The K&R was written in a time where all characters were encoded on one single char. Example of such encoding standards are ASCII or ISO 8859.
Nowadays the leading encoding standard is UNICODE, which comes in several flavors. The UTF-8 encoding is used to represent the thousands of unicode characters on 8 bit bytes, using a variable length scheme:
the ascii characters (i.e. 0x00 to 0x7F) are encoded on a single byte.
all other characters are encoded on 2 to 4 bytes.
So the letter ö and the others in your list are encoded as 2 consecutive bytes. Unfortunately, the standard C library and the algorithms of K&R do not manage variable encoding. So each of your special char is counted as two so that your algorithm is tricked.
How to solve it ?
There is no easy way. You must make a distinction between the length of the strings in memory, and the length of the strings when they are displayed.
I can propose you a trick that uses the properties of the encoding scheme: whenever you count the display length of a string, just ignore the characters c in memory that comply with the condition c&0xC0==0x80.
Another way would be to use wide chars wchar_t/win_t (requires header wchar.h) instead of char/int and use getwc()/putwc() instead of getc()/putc(). If on your environment sizeof(wchar_t) is 4 then you will be able to work with unicode just using the wide characters and wide library functions instead of the normal ones mentioned in K&R. If however
sizeof(wchar_t) is smaller (for example 2), you could work correctly with a larger subset of unicode but still could encounter alignement issues in some cases.
As in the comment, your string is probably encoded in UTF-8. That means that some characters, including the ones you mention, use more than one byte. If you simply count bytes to determine the width of your output, your computed value may be too large.
To properly determine the number of characters in a string with multibyte characters, use a function such as mbrlen(3).
You can use mbrtowc(3) to find out the number of bytes of the first character in a string, if you're counting character for character.
This of course goes way beyond the scope of the K&R book. It was written before multibyte characters were used.

Are there any unicode/wide chars that encode to multiple encoded characters

Consider wctomb(), which takes a wide character and encodes to the currently selected character set. The glibc man page states that the output buffer should be MB_CUR_MAX, while the FreeBSD man page states the output buffer size should be MB_LEN_MAX. Which is correct here?
Are there any example wide char/encoding combinations where it takes multiple encoded characters to represent the wide char?
On a more general note, does MB_CUR_MAX refer to the max combined encoded char byte count to represent a wide char, or is it just representing the max byte count for any particular encoded char?
MB_CUR_MAX is correct, but both are big enough. You might want to use MB_LEN_MAX if you want to avoid variable-length array declarations.
MB_CUR_MAX is the maximum number of bytes in a multibyte character in the current locale. MB_LEN_MAX is the maximum number of bytes in a character for any supported locale. Unlike MB_CUR_MAX, MB_LEN_MAX is a macro so it can be used in an array declaration without creating a VLA.
Both constants refer to a single wide character. There is no simple definition of what a multibyte character is exactly, since multibyte encodings can include shift sequences; if the multibyte locale includes shift sequences, the number of bytes required for a particular call to wctomb with a particular wide character might vary from call to call depending on the shift state. (Also, the actual code might be different in different shift states.)
As far as I know, there is nothing which prevents a wide character from being translated to a multibyte sequence which might be decomposable into other multibyte sequences (as with Unicode composition); the definition of wctomb talks only about "representation". But I don't know of an implementation which does that, either; Unicode canonical decomposition must be done with separate APIs.
So it is possible that no installed locale requires a value as large as MB_LEN_MAX. But there is nothing stopping you from adding locales -- or even creating your own -- provided that they don't exceed the encoding limit (16 bytes on Linux).

LZW Compression with Entire unicode library

I am trying to do this problem:
Assume we have an initial alphabet of the entire Unicode character set,
instead of just all the possible byte values. Recall that unicode
characters are unsigned 2-byte values, so this means that each
2 bytes of uncompressed data will be treated as one symbol, and
we'll have an alphabet with over 60,000 symbols. (Treating symbols as
2-byte Unicodes, rather than a byte at a time, makes for better
compression in the case of internationalized text.) And, note, there's
nothing that limits the number of bits per code to at most 16. As you
generalize the LZW algorithm for this very large alphabet, don't worry
if you have some pretty long codes.
With this, give the compressed version of this four-symbol sequence,
using our project assumptions, including an EOD code, and grouping
into 4-byte ints. (These three symbols are Unicode values,
represented numerically.) Write your answer as 3 8-digit hex values,
space separated, using capital hex digits, not lowercase.
32767 32768 32767 32768
The problem I am having is that I don't know the entire range of the alphabet, so when doing LZW compression I don't know what byte value the new codes will have. Stemming from that problem I also don't know the the EOD code will be.
Also, it seems to me that it will only take two integers the compressed data.
The problem statement is ill-formed.
In Unicode, as we know it today, code points (those numbers that represent characters, composable parts of characters and other useful but more sneaky things) cannot be all numbered from 0 to 65535 to fit into 16 bits. There are more than 100 thousand of Chinese, Japanese and Korean characters in Unicode. Clearly, you'd need 17+ bits just for those. So, Unicode clearly cannot be the correct option here.
OTOH, there exist a sort of "abridged" version of Unicode, Universal Character Set, whose UCS-2 encoding uses 16-bit code points and can technically be used for at most 65536 characters and the like. Those characters with codes greater than 65535 are, well, unlucky, you can't have them with UCS-2.
So, if it's really UCS-2, you can download its specification (ISO/IEC 10646, I believe) and figure out exactly which codes out of those 64K are used and thus should form your initial LZW alphabet.

Resources