I'm parsing an XML file which can contain localized strings in different languages (at the moment its just english and spanish, but in the future it could be any language), the API for the XML parser returns all data within the XML via a char* which is UTF8 encoded.
Some manipulation of the data is required after its been parsed (searching within it for substrings, concatenating strings, determining the length of substrings etc.).
It would be convenient to use standard functions such as strlen, strcat etc. As the raw data I'm receiving from the XML parser is a char* I can do all manipulation readily using these standard string handling functions.
However these all of course make the assumption and requirement that the strings are NULL terminated.
My question therefore is - if you have wide data represented as a char*, can a NULL terminator character occur within the data rather than at the end?
i.e. if a character in a certain language doesn't require 2 bytes to represent it, and it is represented in one byte, will/can the other byte be NULL?
UTF-8 is not "wide". UTF-8 is multibyte encoding, where Unicode character can take 1 to 4 bytes. UTF-8 won't have zero terminators inside valid character. Make sure you are not confused on what your parser is giving you. It could be UTF-16 or UCS2 or their 4-byte equivalents placed in wide character strings, in which case you have to treat them as wide strings.
C distinguishes between between multibyte characters and wide characters:
Wide characters must be able to represent any character of the execution character set using exactly the same number of bytes (e.g. if 兀 takes 4 bytes to be represented, A must also take 4 bytes to be represented). Examples of wide character encodings are UCS-4, and the deprecated UCS-2.
Multibyte characters can take a varying number of bytes to be represented. Examples of multibyte encodings are UTF-8 and UTF-16.
When using UTF-8, you can continue to use the str* functions, but you have to bear in mind that they don't provide a way to return the length in characters of a string, you need to convert to wide characters, and use wcslen. strlen returns the length in bytes, not characters, which is useful in different situations.
I can't stress enough that all elements of the execution character set need to be represented into a single wide character of a predefined size in bytes. Some systems use UTF-16 for their wide characters, the result is that the implementation can't be conforming to the C standard, and some wc* functions can't possibly work right.
Related
I'm using libxml/xmlwriter to generate an XML file within a program.
const char *s = someCharactersFromSomewhere();
xmlTextWriterWriteAttribute (writer, _xml ("value"), _xml (s));
In general I don't have much control over the contents of s, so I can't guarantee that it will be well-formatted in UTF-8. Mostly it is, but if not, the XML which is generated will be malformed.
What I'd like to find is a way to convert s to valid UTF-8, with any invalid character sequences in s replaced with escapes or removed.
Alternatively, if there is an alternative to xmlTextWriterWriteAttribute, or some option I can pass in when initializing the XML writer, such that it guarantees that it will always write valid UTF-8, that would be even better.
One more thing to mention is that the solution must work with both Linux and OSX. Ideally writing as little of my own code as possible! :P
In case the string is encoded in ASCII, then it will always be valid UTF-8 string.
This is because UTF-8 is backwards compatible with ASCII encoding.
See the second paragraph on Wikipedia here.
Windows primarily works with UTF-16, this means you will have to convert from UTF-16 to UTF-8 before you pass the string to the XML library.
If you have 8-bit ascii input then you can simply junk any character code > 127.
If you have some dodgy UTF-8 it is quite easy to parse, but the widechar symbol number that you generate might be out of the unicode range. You can use mbrlen() to individually validate each character.
I am describing this using unsigned chars. If you must use signed chars, then >128 means <0.
At its simplest:
Until the null byte
1 If the next byte is 0, then end the loop
2 If the next byte is < 128 then it is ascii, so keep it
3 If the next byte is >=128 < 128+64 it is invalid - discard it
4 If the next byte is >= 128+64 then it is probably a proper UTF-8 lead byte
call size_t mbrlen(const char *s, size_t n, mbstate_t *ps);
to see how many bytes to keep
if mbrlen says the code is bad (either the lead byte or the trail bytes),
skip 1 byte. Rule 3 will skip the rest.
Even simpler logic just calls mbrlen repeatedly, as it can accept the low ascii range.
You can assume that all the "furniture" of the file (eg xml <>/ symbols, spaces, quotes and newlines) won't be altered by this edit, as they are all valid 7-bit ascii codes.
char is a single byte character, while UTF codepoints range from 0 to 0x10FFFFF, so how do you represent a UTF character in only one byte?
First of all you need a wchar_t character. Those are used with wprintf(3) versions of the normal printf(3) routines. If you dig a little on this, you'll see that mapping your UTF codepoints into valid UTF-8 encoding is straigtforward, based on your setlocale(3) settings. Look at those manual pages referenced, and you'll get an idea of the task you are facing.
There's full support for wide character sets in the C standard... but you have to use it through the internationalization libraries and locales availables.
When converting a C string into a Pascal string, why should the length of the original string be less or equal to 127 instead of 256? I understand that an unsigned int ranges from 0~256 and a signed one ranges from -128~127, but isn't the first character of a Pascal string unsigned?
The Pascal string you are referring to is probably the one used in older Pascals (called ShortString in e.g. Delphi and FreePascal, the most popular Pascal implementations these days). That can contain 255 single-byte characters (char in C). There is no need to restrict this to 127 characters.
Perhaps you were thinking of the fact that 255 bytes can only contain 127 UTF-16 code points. But these strings were popular in the old CP/M and DOS days, when no one knew anything about Unicode yet, and were made to contain ASCII or "Extended ASCII" (8 bit, using code pages).
But most modern Pascal implementations allow you to use strings up to 2 GB in size. There, the length indicator is not stored as the first element anymore, just close to the text data. And these days, most of these strings can contain Unicode too, either as UTF-16 or as UTF-8, depending on the string type you choose (modern Pascal implementations have several different string types for different purposes, so there is not one single "Pascal string type" anymore).
Some languages do have the ability to restrict the size of a ShortString, as so called "counted" strings:
var
s: string[18];
That string has a maximum of 18 bytes text data and 1 byte length data (at index 0). Such shorter strings can be used in, say, records, so they don't grow too big.
FreePascal's wiki has a great page showing all the types of strings that Pascal (at least that implementation) supports: http://wiki.freepascal.org/Character_and_string_types - it includes length-prefixed and null-terminated string types. None of the types on that page have a length restriction of 127.
The string type you're referring to would match ShortString which has a single byte prefix, however their documentation states it accepts 0-255.
I am aware of a string-type that has a variable-length-integer prefix, which would restrict the length of the string to 127 characters if you want the in-memory representation to be binary-compatible with ShortString, as being 128 characters or longer would set the MSB bit to 1 which in variable-length-integers means the integer is at least 2 bytes long instead of 1 byte.
I am trying to do exercise 1-22 in K&R book. It asks to fold long lines (i.e.going into a new line) after a predefined number of characters in string.
As I was testing the program and it worked well, but I saw that some lines were "folding" earlier than they should. I noticed that it was the lines on which special characters appeared, such as:
ö ş ç ğ
So, my question is, how do I ensure that lines are printed with the same maximum length with or without multicharacters?
What happens in your code ?
The K&R was written in a time where all characters were encoded on one single char. Example of such encoding standards are ASCII or ISO 8859.
Nowadays the leading encoding standard is UNICODE, which comes in several flavors. The UTF-8 encoding is used to represent the thousands of unicode characters on 8 bit bytes, using a variable length scheme:
the ascii characters (i.e. 0x00 to 0x7F) are encoded on a single byte.
all other characters are encoded on 2 to 4 bytes.
So the letter ö and the others in your list are encoded as 2 consecutive bytes. Unfortunately, the standard C library and the algorithms of K&R do not manage variable encoding. So each of your special char is counted as two so that your algorithm is tricked.
How to solve it ?
There is no easy way. You must make a distinction between the length of the strings in memory, and the length of the strings when they are displayed.
I can propose you a trick that uses the properties of the encoding scheme: whenever you count the display length of a string, just ignore the characters c in memory that comply with the condition c&0xC0==0x80.
Another way would be to use wide chars wchar_t/win_t (requires header wchar.h) instead of char/int and use getwc()/putwc() instead of getc()/putc(). If on your environment sizeof(wchar_t) is 4 then you will be able to work with unicode just using the wide characters and wide library functions instead of the normal ones mentioned in K&R. If however
sizeof(wchar_t) is smaller (for example 2), you could work correctly with a larger subset of unicode but still could encounter alignement issues in some cases.
As in the comment, your string is probably encoded in UTF-8. That means that some characters, including the ones you mention, use more than one byte. If you simply count bytes to determine the width of your output, your computed value may be too large.
To properly determine the number of characters in a string with multibyte characters, use a function such as mbrlen(3).
You can use mbrtowc(3) to find out the number of bytes of the first character in a string, if you're counting character for character.
This of course goes way beyond the scope of the K&R book. It was written before multibyte characters were used.
I've got code that used MultiByteToWideChar like so:
wchar_t * bufferW = malloc(mbBufferLen * 2);
MultiByteToWideChar(CP_ACP, 0, mbBuffer, mbBufferLen, bufferW, mbBufferLen);
Note that the code does not use a previous call to MultiByteToWideChar to check how large the new unicode buffer needs to be, and assumes it will be twice the multibyte buffer.
My question is if this usage is safe? Could there be a default code page that maps a character into a 3-byte or larger unicode character, and cause an overflow? While I'm aware the usage isn't exactly correct, I'd like to gauge the risk impact.
Could there be a default code page that maps a character into a 3-byte or larger [sequence of wchar_t UTF-16 code units]
There is currently no ANSI code page that maps a single byte to a character outside the BMP (ie one that would take more than one 2-byte codeunit in UTF-16).
No single multi-byte ANSI character can ever be encoded as more than two 2-byte codeunits in UTF-16. So, at worse, you will never end up with a UTF-16 string that has more than 2x the length of the input ANSI string (not counting the null-terminator, which does not apply in this case since you are passing explicit lengths), and at best you will end up with a UTF-16 string that has fewer wchar_t characters than the input string has char characters.
For what it's worth, Microsoft are endeavouring not to develop the ANSI code pages any further, and I suspect the NLS file format would need changes to allow it, so it's pretty unlikely that this will change in future. But there is no firm API promise that this will definitely always hold true.
Consider wctomb(), which takes a wide character and encodes to the currently selected character set. The glibc man page states that the output buffer should be MB_CUR_MAX, while the FreeBSD man page states the output buffer size should be MB_LEN_MAX. Which is correct here?
Are there any example wide char/encoding combinations where it takes multiple encoded characters to represent the wide char?
On a more general note, does MB_CUR_MAX refer to the max combined encoded char byte count to represent a wide char, or is it just representing the max byte count for any particular encoded char?
MB_CUR_MAX is correct, but both are big enough. You might want to use MB_LEN_MAX if you want to avoid variable-length array declarations.
MB_CUR_MAX is the maximum number of bytes in a multibyte character in the current locale. MB_LEN_MAX is the maximum number of bytes in a character for any supported locale. Unlike MB_CUR_MAX, MB_LEN_MAX is a macro so it can be used in an array declaration without creating a VLA.
Both constants refer to a single wide character. There is no simple definition of what a multibyte character is exactly, since multibyte encodings can include shift sequences; if the multibyte locale includes shift sequences, the number of bytes required for a particular call to wctomb with a particular wide character might vary from call to call depending on the shift state. (Also, the actual code might be different in different shift states.)
As far as I know, there is nothing which prevents a wide character from being translated to a multibyte sequence which might be decomposable into other multibyte sequences (as with Unicode composition); the definition of wctomb talks only about "representation". But I don't know of an implementation which does that, either; Unicode canonical decomposition must be done with separate APIs.
So it is possible that no installed locale requires a value as large as MB_LEN_MAX. But there is nothing stopping you from adding locales -- or even creating your own -- provided that they don't exceed the encoding limit (16 bytes on Linux).