Iterating backwards Multibyte String - C

Iterating backwards Multibyte String - C - c

I know I can iterate forwards through a multibyte string, in C, using mbrtowc(). But what if I wanted to iterate backwards; or in other words how do I find the previous valid multibyte character. I tried the following method and it at least partially works on my Ubuntu system using the default en_us.UTF-8 locale:
char *str = "\xc2\xa2\xc2\xa1xyzwxfd\xc2\xa9", *tmp = NULL;
wchar_t wc = 0;
size_t ret = 0, width = 1;
mbstate_t state = {0};
//Iterate through 2 characters using mbrtowc()
tmp = str;
tmp += mbrtowc(&wc, tmp, MB_CUR_MAX, &state);
tmp += mbrtowc(&wc, tmp, MB_CUR_MAX, &state);
//This is a simplified version of my code. I didnt test this
//exact code but this general idea did work.
for(tmp--; (ret = mbrtowc(&wc, tmp, width, &state)) == (size_t)(-1) || ret == (size_t)(-2); width++, tmp--)
if(width == MB_CUR_MAX) printf("error\n");
printf("last multibyte character %lc\n", wc);
The idea is simple just iterate backwards by one byte until we find a valid multibyte character as defined by mbrtowc(). My question is can I rely on this to work for any possible multibyte locale or just encoding's with special properties. Also more specifically is mbstate_t being used incorrectly; I mean could the change in direction affect the validity of mbstate_t? Can I guarantee that 'ret' will only be (size_t)(-1) or (size_t)(-2) instead of either because I currently assume that 'ret' could be both depending on the definitions for an incomplete and invalid multibyte character.

If you need to deal with any theoretically-possible multibyte encoding, then it is not possible to iterate backwards. There is no requirement that a multibyte encoding have the property that no proper suffix of a valid multibyte sequence is a valid multibyte sequence. (As it happens, your algorithm requires an even stronger property, because you might recognize a multibyte sequence starting in the middle of one valid sequence and continuing into the next sequence.)
Also, you cannot predict (again, in general) the multibyte state if the multibyte encoding has shift states. If you back-up over a multibyte sequence which changes the state, you have no idea what the previous state was.
UTF-8 was designed with this in mind. It does not have shift states, and it clearly marks the octets (bytes) which can start a sequence. So if you know that the multibyte encoding is UTF-8, you can easily iterate backwards. Just scan backwards for a character not in the range 0x80-0xBF. (UTF-16 and UTF-32 are also easily iterated in either direction, but you need to read them as two-/four-byte code units, respectively, because a misaligned read is quite likely to be a correct codepoint.)
If you don't know that the multibyte encoding is UTF-8, then there is simply no robust algorithm to iterate backwards. All you can do is iterate forwards and remember the starting position and mbstate of each character.
Fortunately, these days there is really little reason to support multibyte encodings other than Unicode encodings.

For UTF-8 you can take benefit of the encoding property of the additional bytes following the first one: the additional bytes of a multibyte chars (and only them) start with 10xx xxxx.
So if you go backward an a char c is so that (c & 0xC0)==0x80 then you can skip it.
For other multibyte encoding you don't necessarily have such a simple solution as the lead and following bytes are in ranges that overlap.

Related

Where is the C code encode the bytes in the memory to the specific charset in Linux?

In the document of Linux:
LC_CTYPE
This category determines the interpretation of byte sequences as characters (e.g., single versus multibyte characters), character classifications (e.g., alphabetic or digit), and the behavior of character classes. On glibc systems, this category also determines the character transliteration rules for iconv(1) and iconv(3). It changes the behavior of the character handling and classification functions, such as isupper(3) and toupper(3), and the multibyte character functions such as mblen(3) or wctomb(3).
However, I see GCC's source code of putwchar:
/* _IO_putwc_unlocked */
# define _IO_putwc_unlocked(_wch, _fp) \
(__glibc_unlikely ((_fp)->_wide_data == NULL \
|| ((_fp)->_wide_data->_IO_write_ptr \
>= (_fp)->_wide_data->_IO_write_end)) \
? __woverflow (_fp, _wch) \
: (wint_t) (*(_fp)->_wide_data->_IO_write_ptr++ = (_wch)))
/* putwchar */
wint_t
putwchar (wchar_t wc)
{
wint_t result;
_IO_acquire_lock (stdout);
result = _IO_putwc_unlocked (wc, stdout);
_IO_release_lock (stdout);
return result;
}
There is no code using the locale set with setlocale(), which confuses me. When and where the bytes stored in the memory transit to the specific charset set by setlocale()?
Update:
int main() {
wchar_t wc = L'\x00010437';
putwchar(wc); // print nothing
}
int main() {
wchar_t wc = L'\x00010437';
setlocale(LC_CTYPE, "");
putwchar(wc); // print '𐐷'
}
In the two cases above, setlocale() affects the character displayed on the screen. I want to know in which process the bytes are determined to represent the specific character like '𐐷'?
Update2:
Maybe I find the source code converting the multi-bytes data into the specific charset. Here is the code snippet in _IO_wdo_write() in glibc/libio/wfileops.c:
/* Now convert from the internal format into the external buffer. */
result = (*cc->__codecvt_do_out) (cc, &fp->_wide_data->_IO_state,
data, data + to_do, &new_data,
write_ptr,
buf_end,
&write_ptr);

Expanding on my comment:
Where is the C code encode the bytes in the memory to the specific charset in Linux?
To the best of my knowledge, there isn't any. A charset, a.k.a. character encoding, is a mapping from sequences of characters -- in a rather abstract sense of that term -- to sequences of bytes. If you are looking at bytes in memory that represent character data then, perforce, you are looking at an already-encoded representation. For a C program, they will normally be encoded according to the execution character set of the C implementation.
In particular, to the extent that C "character" and "wide character" types actually represent characters, they contain encoded character data. There is normally no conversion needed or performed when such data are read or written, which is why you don't see it in the glibc source.
It is of course possible for a program to encode characters in some other encoding and store the resulting bytes in memory, via iconv(3), for example. It is then program's responsibility to ensure that they are handled appropriately. As for mapping encoded byte sequences to a visual representation -- "glyphs" -- this is a function performed by the program that displays or prints them. One way that is done is simply by selection of a font with appropriate mappings from byte sequences to glyphs.

Can I store UTF8 in C-style char array

Follow up
Can UTF-8 contain zero byte?
Can I safely store UTF8 string in zero terminated char * ?
I understand strlen() will not return correct information, put "storing", printing and "transferring" the char array, seems to be safe.

Yes.
Just like with ASCII and similiar 8-bit encodings before Unicode, you can't store the NUL character in such a string (the value \u+0000 is the Unicode code point NUL, very much like in ASCII).
As long as you know your strings don't need to contain that (and regular text doesn't), it's fine.

In C a 0 byte is the string terminator. As long as the Unicode point 0, U+0000 is not in the Unicode string there is no problem.
To be able to store 0 bytes in Unicode, one may use modified UTF-8 that convert not only code points >= 128, but also 0 to a multi-byte sequence (every byte thereof having its high bit set, >= 128). This is done in java for some APIs, like DataOutputStream.writeUTF. It ensures you can transmit strings with an embedded 0.
It formally is no longer UTF-8 as UTF-8 requires the shortest encoding. Also this is only possible when determining the length i.o. strlen when unpacking to non-UTF-8.
So the most feasible solution is not to accept U+0000 in strings.

Trouble comparing UTF-8 characters using wchar.h

I am in the process of making a small program that reads a file, that contains UTF-8 elements, char by char. After reading a char it compares it with a few other characters and if there is a match it replaces the character in the file with an underscore '_'.
(Well, it actually makes a duplicate of that file with specific letters replaced by underscores.)
I'm not sure where exactly I'm messing up here but it's most likely everywhere.
Here is my code:
FILE *fpi;
FILE *fpo;
char ifilename[FILENAME_MAX];
char ofilename[FILENAME_MAX];
wint_t sample;
fpi = fopen(ifilename, "rb");
fpo = fopen(ofilename, "wb");
while (!feof(fpi)) {
fread(&sample, sizeof(wchar_t*), 1, fpi);
if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0) ) {
fwrite(L"_", sizeof(wchar_t*), 1, fpo);
} else {
fwrite(&sample, sizeof(wchar_t*), 1, fpo);
}
}
I have omitted the code that has to do with the filename generation because it has nothing to offer to the case. It is just string manipulation.
If I feed this program a file containing the words γειά σου κόσμε. I would want it to return this:
γει_ σου κόσμ_.
Searching the internet didn't help much as most results were very general or talking about completely different things regarding UTF-8. It's like nobody needs to manipulate single characters for some reason.
Anything pointing me the right way is most welcome.
I am not, necessarily, looking for a straightforward fixed version of the code I submitted, I would be grateful for any insightful comments helping me understand how exactly the wchar mechanism works. The whole wbyte, wchar, L, no-L, thing is a mess to me.
Thank you in advance for your help.

C has two different kinds of characters: multibyte characters and wide characters.
Multibyte characters can take a varying number of bytes. For instance, in UTF-8 (which is a variable-length encoding of Unicode), a takes 1 byte, while α takes 2 bytes.
Wide characters always take the same number of bytes. Additionally, a wchar_t must be able to hold any single character from the execution character set. So, when using UTF-32, both a and α take 4 bytes each. Unfortunately, some platforms made wchar_t 16 bits wide: such platforms cannot correctly support characters beyond the BMP using wchar_t. If __STDC_ISO_10646__ is defined, wchar_t holds Unicode code-points, so must be (at least) 4 bytes long (technically, it must be at least 21-bits long).
So, when using UTF-8, you should use multibyte characters, which are stored in normal char variables (but beware of strlen(), which counts bytes, not multibyte characters).
Unfortunately, there is more to Unicode than this.
ά can be represented as a single Unicode codepoint, or as two separate codepoints:
U+03AC GREEK SMALL LETTER ALPHA WITH TONOS ← 1 codepoint ← 1 multibyte character ← 2 bytes (0xCE 0xAC) = 2 char's.
U+03B1 GREEK SMALL LETTER ALPHA U+0301 COMBINING ACUTE ACCENT ← 2 codepoints ← 2 multibyte characters ← 4 bytes (0xCE 0xB1 0xCC 0x81) = 4 char's.
U+1F71 GREEK SMALL LETTER ALPHA WITH OXIA ← 1 codepoint ← 1 multibyte character ← 3 bytes (0xE1 0xBD 0xB1) = 3 char's.
All of the above are canonical equivalents, which means that they should be treated as equal for all purposes. So, you should normalize your strings on input/output, using one of the Unicode normalization algorithms (there are 4: NFC, NFD, NFKC, NFKD).

First of all, please do take the time to read this great article, which explains UTF8 vs Unicode and lots of other important things about strings and encodings: http://www.joelonsoftware.com/articles/Unicode.html
What you are trying to do in your code is read in unicode character by character, and do comparisons with those. That's won't work if the input stream is UTF8, and it's not really possible to do with quite this structure.
In short: Fully unicode strings can be encoded in several ways. One of them is using a series of equally-sized "wide" chars, one for each character. That is what the wchar_t type (sometimes WCHAR) is for. Another way is UTF8, which uses a variable number of raw bytes to encode each character, depending on the value of the character.
UTF8 is just a stream of bytes, which can encode a unicode string, and is commonly used in files. It is not the same as a string of WCHARs, which are the more common in-memory representation. You can't poke through a UTF8 stream reliably, and do character replacements within it directly. You'll need to read the whole thing in and decode it, and then loop through the WCHARs that result to do your comparisons and replacement, and then map that result back to UTF8 to write to the output file.
On Win32, use MultiByteToWideChar to do the decoding, and you can use the corresponding WideCharToMultiByte to go back.
When you use a "string literal" with regular quotes, you're creating a nul-terminated ASCII string (char*), which does not support Unicode. The L"string literal" with the L prefix will create a nul-terminated string of WCHARs (wchar_t *), which you can use in string or character comparisons. The L prefix also works with single-quote character literals, like so: L'ε'
As a commenter noted, when you use fread/fwrite, you should be using sizeof(wchar_t) and not its pointer type, since the amount you are trying to read/write is an actual wchar, not the size of a pointer to one. This advice is just code feedback independent of the above-- you don't want to be reading the input character by character anyways.
Note too that when you do string comparisons (wcscmp), you should use actual wide strings (which are terminated with a nul wide char)-- not use single characters in memory as input. If (when) you want to do character-to-character comparisons, you don't even need to use the string functions. Since a WCHAR is just a value, you can compare directly: if (sample == L'ά') {}.

What is the purpose of the s==NULL case for mbrtowc?

mbrtowc is specified to handle a NULL pointer for the s (multibyte character pointer) argument as follows:
If s is a null pointer, the mbrtowc() function shall be equivalent to the call:
mbrtowc(NULL, "", 1, ps)
In this case, the values of the arguments pwc and n are ignored.
As far as I can tell, this usage is largely useless. If ps is not storing any partially-converted character, the call will simply return 0 with no side effects. If ps is storing a partially-converted character, then since '\0' is not valid as the next byte in a multibyte sequence ('\0' can only be a string terminator), the call will return (size_t)-1 with errno==EILSEQ. and leave ps in an undefined state.
The intended usage seems to have been to reset the state variable, particularly when NULL is passed for ps and the internal state has been used, analogous to mbtowc's behavior with stateful encodings, but this is not specified anywhere as far as I can tell, and it conflicts with the semantics for mbrtowc's storage of partially-converted characters (if mbrtowc were to reset state when encountering a 0 byte after a potentially-valid initial subsequence, it would be unable to detect this dangerous invalid sequence).
If mbrtowc were specified to reset the state variable only when s is NULL, but not when it points to a 0 byte, a desirable state-reset behavior would be possible, but such behavior would violate the standard as written. Is this a defect in the standard? As far as I can tell, there is absolutely no way to reset the internal state (used when ps is NULL) once an illegal sequence has been encountered, and thus no correct program can use mbrtowc with ps==NULL.

Since a '\0' byte must convert to a null wide character regardless of shift state (5.2.1.2 Multibyte characters), and the mbrtowc() function is specified to reset the shift state when it converts to a wide null character (7.24.6.3.2/3 The mbrtowc function), calling mbrtowc( NULL, "", 1, ps) will reset the shift state stored in the mbstate_t pointed to by ps. And if mbrtowc( NULL, "", 1, NULL) is called to use the library's internal mbstate_t object, it will be reset to an initial state. See the end of the answer for cites of the relevant bits of the standard.
I'm by no means particularly experienced with the C standard multibyte conversion functions (my experience with this kind of thing has been using the Win32 APIs for conversion).
If mbrtowc() processes a 'incomplete char' that's cut short by a 0 byte, it should return (size_t)(-1) to indicate an invalid multibyte char (and thus detect the dangerous situation you describe). In that case the conversion/shift state is unspecified (and I think you're basically hosed for that string). The multibyte 'sequence' that a conversion was attempted on but contains a '\0' is invalid and ever will be valid with subsequent data. If the '\0' wasn't intended to be part of the converted sequence, then it shouldn't have been included in the count of bytes available for processing.
If you're in a situation where you might get additional, subsequent bytes for a partial multibyte char (say from a network stream), the n you passed for the partial multibyte char shouldn't include a 0 byte, so you'll get a (size_t)(-2) returned. In this case, if you pass a '\0' while in the middle of the partial conversion, you'll lose the fact that there's an error and as a side-effect reset the mbstate_t state in use (whether it's your own or the internal one being used because you passed in a NULL pointer for ps). I think I'm essentailly restating your question here.
However I think it is possible to detect and handle this situation, but unfortunately it requires keeping track of some state yourself:
#define MB_ERROR ((size_t)(-1))
#define MB_PARTIAL ((size_t)(-2))
// function to get a stream of multibyte characters from somewhere
int get_next(void);
int bar(void)
{
char c;
wchar_t wc;
mbstate_t state = {0};
int in_partial_convert = 0;
while ((c = get_next()) != EOF)
{
size_t result = mbrtowc( &wc, &c, 1, &state);
switch (result) {
case MB_ERROR:
// this multibyte char is invalid
return -1;
case MB_PARTIAL:
// do nothing yet, we need more data
// but remember that we're in this state
in_partial_convert = 1;
break;
case 1:
// output the competed wide char
in_partial_convert = 0; // no longer in the middle of a conversion
putwchar(wc);
break;
case 0:
if (in_partial_convert) {
// this 'last' multibyte char was mal-formed
// return an error condidtion
return -1;
}
// end of the multibyte string
// we'll handle similar to EOF
return 0;
}
}
return 0;
}
Maybe not an ideal situation, but I think it shows it's not completely broken so as to be impossible to use.
Standards citations:
5.2.1.2 Multibyte characters
A multibyte character set may have a state-dependent encoding, wherein
each sequence of multibyte characters
begins in an initial shift state and
enters other locale-specific shift
states when specific multibyte
characters are encountered in the
sequence. While in the initial shift
state, all single-byte characters
retain their usual interpretation and
do not alter the shift state. The
interpretation for subsequent bytes in
the sequence is a function of the
current shift state.
A byte with all bits zero shall be interpreted as a null character
independent of shift state.
A byte with all bits zero shall not occur in the second or subsequent
bytes of a multibyte character.
7.24.6.3.2/3 The mbrtowc function
If the corresponding wide character is
the null wide character, the resulting
state described is the initial
conversion state

In 5.2.1.2, Multibyte characters, the C Standard states:
A byte with all bits zero shall be interpreted as a null character independent of shift
state. Such a byte shall not occur as part of any other multibyte character.
The Standard seems to differentiate between shift state and conversion state, as, for example, 7.24.6 mentions:
The conversion state described by the pointed-to object is altered as needed to track the shift state, and the position within a multibyte character, for the associated multibyte character sequence.
(emphasis added by me). However, I think that the intent is to interpret a byte with all zero bits as the null character regardless of the mbstate_t value, which encodes the entire conversion state, particularly as "Such a byte shall not occur as part of any other multibyte character" implies that the null byte cannot occur within a multibyte character. If a null byte does occur in errant input where the second, third, etc. byte of a multibyte character should be, then I interpret the Standard as saying that the partial multibyte character at the EOF is silently ignored.
My reading of 7.24.6.3.2, The mbrtowc function, for the case when s is NULL is thus: the next 1 byte completes the null wide character, the return value of mbrtowc is 0, and the resulting state is the initial conversion state because:
If the corresponding wide character is the null wide character, the resulting state described is the initial conversion state.
By passing NULL for both s and ps, the internal mbstate_t of mbrtowc is reset to the initial state.

UTF8 support on cross platform C application

I am developing a cross platform C (C89 standard) application which has to deal with UTF8 text. All I need is basic string manipulation functions like substr, first, last etc.
Question 1
Is there a UTF8 library that has the above functions implemented? I have already looked into ICU and it is too big for my requirement. I just need to support UTF8.
I have found a UTF8 decoder here. Following function prototypes are from that code.
void utf8_decode_init(char p[], int length);
int utf8_decode_next();
The initialization function takes a character array but utf8_decode_next() returns int. Why is that? How can I print the characters this function returns using standard functions like printf? The function is dealing with character data and how can that be assigned to a integer?
If the above decoder is not good for production code, do you have a better recommendation?
Question 2
I also got confused by reading articles that says, for unicode you need to use wchar_t. From my understanding this is not required as normal C strings can hold UTF8 values. I have verified this by looking at source code of SQLite and git. SQLite has the following typedef.
typedef unsigned char u8
Is my understanding correct? Also why is unsigned char required?

The utf_decode_next() function returns the next Unicode code point. Since Unicode is a 21-bit character set, it cannot return anything smaller than an int, and it can be argued that technically, it should be a long since an int could be a 16-bit quantity. Effectively, the function returns you a UTF-32 character.
You would need to look at the C94 wide character extensions to C89 to print wide characters (wprintf(), <wctype.h>, <wchar.h>). However, wide characters alone are not guaranteed to be UTF-8 or even Unicode. You most probably cannot print the characters from utf8_decode_next() portably, but it depends on what your portability requirements are. The wider the range of systems you must port to, the less chance there is of it all working simply. To the extent you can write UTF-8 portably, you would send the UTF-8 string (not an array of the UTF-32 characters obtained from utf8_decode_next()) to one of the regular printing functions. One of the strengths of UTF-8 is that it can be manipulated by code that is largely ignorant of it.
You need to understand that a 4-byte wchar_t can hold any Unicode codepoint in a single unit, but that UTF-8 can require between one and four 8-bit bytes (1-4 units of storage) to hold a single Unicode codepoint. On some systems, I believe wchar_t can be a 16-bit (short) integer. In this case, you are forced into using UTF-16, which encodes Unicode codepoints outside the Basic Multilingual Plane (BMP, code points U+0000 .. U+FFFF) using two storage units and surrogates.
Using unsigned char makes life easier; plain char is often signed. Having negative numbers makes life more difficult than it need me (and, believe me, it is difficult enough without adding complexity).

You do not need any special library routines for character or substring search with UTF-8. strstr does everything you need. That's the whole point of UTF-8 and the design requirements it was invented to meet.

GLib has quite a few relevant functions, and can be used independent of GTK+.

There are over 100,000 characters in Unicode. There are 256 possible values of char in most C implementations.
Hence, UTF-8 uses more than one char to encode each character, and the decoder needs a return type which is larger than char.
wchar_t is a larger type than char (well, it doesn't have to be larger, but it usually is). It represents the characters of the implementation-defined wide character set. On some implementations (most importantly, Windows, which uses surrogate pairs for characters outside the "basic multilingual plane"), it still isn't big enough to represent any Unicode character, which presumably is why the decoder you reference uses int.
You can't print wide characters using printf, because it deals in char. wprintf deals in wchar_t, so if the wide character set is unicode, and if wchar_t is int on your system (as it is on linux), then wprintf and friends will print the decoder output without further processing. Otherwise it won't.
In any case, you cannot portably print arbitrary unicode characters, because there's no guarantee that the terminal can display them, or even that the wide character set is in any way related to Unicode.
SQLite has probably used unsigned char so that:
they know the signedness - it's implementation-defined whether char is signed or not.
they can do right-shifts and assign out-of-range values, and get consistent and defined results across all C implementations. Implemenations have more freedom how signed char behaves than unsigned char.

Normal C strings are fine for storing utf8 data, but you can't easily search for a substring in your utf8 string. This is because a character encoded as a sequence of bytes using the utf8 encoding could be anywhere from one to 4 bytes depending on the character. i.e. a "character" is not equivalent to a "byte" for utf8 like it is for ASCII.
In order to do substring searches etc. you will need to decode it to some internal format that is used to represent Unicode characters and then do the substring search on that. Since there are far more than Unicode 256 characters, a byte (or char) is not enough. That's why the library you found uses ints.
As for your second question, it's probably just because it does not make sense to talk about negative characters, so they may as well be specified as "unsigned".

I have implemented a substr & length functions which supports UTF8 characters. This code is a modified version of what SQLite uses.
The following macro loops through the input text and skip all multi-byte sequence characters. if condition checks that this is a multi-byte sequence and the loop inside it increments input until it finds next head byte.
#define SKIP_MULTI_BYTE_SEQUENCE(input) { \
if( (*(input++)) >= 0xc0 ) { \
while( (*input & 0xc0) == 0x80 ){ input++; } \
} \
}
substr and length are implemented using this macro.
typedef unsigned char utf8;
substr
void *substr(const utf8 *string,
int start,
int len,
utf8 **substring)
{
int bytes, i;
const utf8 *str2;
utf8 *output;
--start;
while( *string && start ) {
SKIP_MULTI_BYTE_SEQUENCE(string);
--start;
}
for(str2 = string; *str2 && len; len--) {
SKIP_MULTI_BYTE_SEQUENCE(str2);
}
bytes = (int) (str2 - string);
output = *substring;
for(i = 0; i < bytes; i++) {
*output++ = *string++;
}
*output = '\0';
}
length
int length(const utf8 *string)
{
int len;
len = 0;
while( *string ) {
++len;
SKIP_MULTI_BYTE_SEQUENCE(string);
}
return len;
}