I am in the process of making a small program that reads a file, that contains UTF-8 elements, char by char. After reading a char it compares it with a few other characters and if there is a match it replaces the character in the file with an underscore '_'.
(Well, it actually makes a duplicate of that file with specific letters replaced by underscores.)
I'm not sure where exactly I'm messing up here but it's most likely everywhere.
Here is my code:
FILE *fpi;
FILE *fpo;
char ifilename[FILENAME_MAX];
char ofilename[FILENAME_MAX];
wint_t sample;
fpi = fopen(ifilename, "rb");
fpo = fopen(ofilename, "wb");
while (!feof(fpi)) {
fread(&sample, sizeof(wchar_t*), 1, fpi);
if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0) ) {
fwrite(L"_", sizeof(wchar_t*), 1, fpo);
} else {
fwrite(&sample, sizeof(wchar_t*), 1, fpo);
}
}
I have omitted the code that has to do with the filename generation because it has nothing to offer to the case. It is just string manipulation.
If I feed this program a file containing the words γειά σου κόσμε. I would want it to return this:
γει_ σου κόσμ_.
Searching the internet didn't help much as most results were very general or talking about completely different things regarding UTF-8. It's like nobody needs to manipulate single characters for some reason.
Anything pointing me the right way is most welcome.
I am not, necessarily, looking for a straightforward fixed version of the code I submitted, I would be grateful for any insightful comments helping me understand how exactly the wchar mechanism works. The whole wbyte, wchar, L, no-L, thing is a mess to me.
Thank you in advance for your help.
C has two different kinds of characters: multibyte characters and wide characters.
Multibyte characters can take a varying number of bytes. For instance, in UTF-8 (which is a variable-length encoding of Unicode), a takes 1 byte, while α takes 2 bytes.
Wide characters always take the same number of bytes. Additionally, a wchar_t must be able to hold any single character from the execution character set. So, when using UTF-32, both a and α take 4 bytes each. Unfortunately, some platforms made wchar_t 16 bits wide: such platforms cannot correctly support characters beyond the BMP using wchar_t. If __STDC_ISO_10646__ is defined, wchar_t holds Unicode code-points, so must be (at least) 4 bytes long (technically, it must be at least 21-bits long).
So, when using UTF-8, you should use multibyte characters, which are stored in normal char variables (but beware of strlen(), which counts bytes, not multibyte characters).
Unfortunately, there is more to Unicode than this.
ά can be represented as a single Unicode codepoint, or as two separate codepoints:
U+03AC GREEK SMALL LETTER ALPHA WITH TONOS ← 1 codepoint ← 1 multibyte character ← 2 bytes (0xCE 0xAC) = 2 char's.
U+03B1 GREEK SMALL LETTER ALPHA U+0301 COMBINING ACUTE ACCENT ← 2 codepoints ← 2 multibyte characters ← 4 bytes (0xCE 0xB1 0xCC 0x81) = 4 char's.
U+1F71 GREEK SMALL LETTER ALPHA WITH OXIA ← 1 codepoint ← 1 multibyte character ← 3 bytes (0xE1 0xBD 0xB1) = 3 char's.
All of the above are canonical equivalents, which means that they should be treated as equal for all purposes. So, you should normalize your strings on input/output, using one of the Unicode normalization algorithms (there are 4: NFC, NFD, NFKC, NFKD).
First of all, please do take the time to read this great article, which explains UTF8 vs Unicode and lots of other important things about strings and encodings: http://www.joelonsoftware.com/articles/Unicode.html
What you are trying to do in your code is read in unicode character by character, and do comparisons with those. That's won't work if the input stream is UTF8, and it's not really possible to do with quite this structure.
In short: Fully unicode strings can be encoded in several ways. One of them is using a series of equally-sized "wide" chars, one for each character. That is what the wchar_t type (sometimes WCHAR) is for. Another way is UTF8, which uses a variable number of raw bytes to encode each character, depending on the value of the character.
UTF8 is just a stream of bytes, which can encode a unicode string, and is commonly used in files. It is not the same as a string of WCHARs, which are the more common in-memory representation. You can't poke through a UTF8 stream reliably, and do character replacements within it directly. You'll need to read the whole thing in and decode it, and then loop through the WCHARs that result to do your comparisons and replacement, and then map that result back to UTF8 to write to the output file.
On Win32, use MultiByteToWideChar to do the decoding, and you can use the corresponding WideCharToMultiByte to go back.
When you use a "string literal" with regular quotes, you're creating a nul-terminated ASCII string (char*), which does not support Unicode. The L"string literal" with the L prefix will create a nul-terminated string of WCHARs (wchar_t *), which you can use in string or character comparisons. The L prefix also works with single-quote character literals, like so: L'ε'
As a commenter noted, when you use fread/fwrite, you should be using sizeof(wchar_t) and not its pointer type, since the amount you are trying to read/write is an actual wchar, not the size of a pointer to one. This advice is just code feedback independent of the above-- you don't want to be reading the input character by character anyways.
Note too that when you do string comparisons (wcscmp), you should use actual wide strings (which are terminated with a nul wide char)-- not use single characters in memory as input. If (when) you want to do character-to-character comparisons, you don't even need to use the string functions. Since a WCHAR is just a value, you can compare directly: if (sample == L'ά') {}.
Related
Follow up
Can UTF-8 contain zero byte?
Can I safely store UTF8 string in zero terminated char * ?
I understand strlen() will not return correct information, put "storing", printing and "transferring" the char array, seems to be safe.
Yes.
Just like with ASCII and similiar 8-bit encodings before Unicode, you can't store the NUL character in such a string (the value \u+0000 is the Unicode code point NUL, very much like in ASCII).
As long as you know your strings don't need to contain that (and regular text doesn't), it's fine.
In C a 0 byte is the string terminator. As long as the Unicode point 0, U+0000 is not in the Unicode string there is no problem.
To be able to store 0 bytes in Unicode, one may use modified UTF-8 that convert not only code points >= 128, but also 0 to a multi-byte sequence (every byte thereof having its high bit set, >= 128). This is done in java for some APIs, like DataOutputStream.writeUTF. It ensures you can transmit strings with an embedded 0.
It formally is no longer UTF-8 as UTF-8 requires the shortest encoding. Also this is only possible when determining the length i.o. strlen when unpacking to non-UTF-8.
So the most feasible solution is not to accept U+0000 in strings.
I want to index the characters in a utf8 string which does not necessarily contain
only ascii characters. I want the same kind of behavior I get in javascript:
> str = "lλך" // i.e. Latin ell, Greek lambda, Hebrew lamedh
'lλך'
> str[0]
'l'
> str[1]
'λ'
> str[2]
'ך'
Following the advice of UTF-8 Everywhere, I am representing my mixed character-length string just as any other sting in c - and not using wchars.
The problem is that, in C, one cannot access the 16th character of a string: only the 16th byte. Because λ is encoded with two bytes in utf-8, I have to access the 16th and 17th bytes of the string in order to print out one λ.
For reference, the output of:
#include <stdio.h>
int main () {
char word_with_greek[] = "this is lambda:_λ";
printf("%s\n",word_with_greek);
printf("The 0th character is: %c\n", word_with_greek[0]);
printf("The 15th character is: %c\n",word_with_greek[15]);
printf("The 16th character is: %c%c\n",word_with_greek[16],word_with_greek[17]);
return 0;
}
is:
this is lambda:_λ
The 0th character is: t
The 15th character is: _
The 16th character is: λ
Is there an easy way to break up the string into characters? It does not seem too difficult to write a function which breaks a string into wchars- but I imagine that someone has already written this yet I cannot find it.
It depends on what your unicode characters can be. Most strings are restricted to the Basic Multilanguage Plane. If yours are (not by accident by because of their very nature: at least no risk for emoji...) you can use the char16_t to represent any character. BTW wchar_t is at least as large as char16_t so in that case it is safe to use it.
If your script can contain emoji character, or other characters not in the BMP or simply if you are unsure, the only foolproof way is to convert everything to char32_t because any unicode character (at least in 2019...) as a code using less than 32 bits.
Converting for UTF8 to 32 (or 16) bits unicode is not that hard, and can be coded by hand, Wikipedia contains enough information for it. But you will find tons of library where this is already coded and tested, mainly the excellent libiconv, but the C11 version of the C standard library contains functions for UTF8 conversions. Not as nice but useable.
I don't know the following cases in GCC, who can help me?
Whether a valid UTF-8 character (except code point 0) still contains zero byte? If so, I think function such as strlen will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to '\n'? If so, I think function such as "gets" will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to ' ' or '\t'? If so, I think function such as scanf("%s%s") will break that UTF-8 character and be interpreted as two or more words.
The answer to all your questions are the same: No.
It's one of the advantages of UTF-8: all ASCII bytes do not occur when encoding non-ASCII code points into UTF-8.
For example, you can safely use strlen on a UTF-8 string, only that its result is the number of bytes instead of UTF-8 code points.
I know I can iterate forwards through a multibyte string, in C, using mbrtowc(). But what if I wanted to iterate backwards; or in other words how do I find the previous valid multibyte character. I tried the following method and it at least partially works on my Ubuntu system using the default en_us.UTF-8 locale:
char *str = "\xc2\xa2\xc2\xa1xyzwxfd\xc2\xa9", *tmp = NULL;
wchar_t wc = 0;
size_t ret = 0, width = 1;
mbstate_t state = {0};
//Iterate through 2 characters using mbrtowc()
tmp = str;
tmp += mbrtowc(&wc, tmp, MB_CUR_MAX, &state);
tmp += mbrtowc(&wc, tmp, MB_CUR_MAX, &state);
//This is a simplified version of my code. I didnt test this
//exact code but this general idea did work.
for(tmp--; (ret = mbrtowc(&wc, tmp, width, &state)) == (size_t)(-1) || ret == (size_t)(-2); width++, tmp--)
if(width == MB_CUR_MAX) printf("error\n");
printf("last multibyte character %lc\n", wc);
The idea is simple just iterate backwards by one byte until we find a valid multibyte character as defined by mbrtowc(). My question is can I rely on this to work for any possible multibyte locale or just encoding's with special properties. Also more specifically is mbstate_t being used incorrectly; I mean could the change in direction affect the validity of mbstate_t? Can I guarantee that 'ret' will only be (size_t)(-1) or (size_t)(-2) instead of either because I currently assume that 'ret' could be both depending on the definitions for an incomplete and invalid multibyte character.
If you need to deal with any theoretically-possible multibyte encoding, then it is not possible to iterate backwards. There is no requirement that a multibyte encoding have the property that no proper suffix of a valid multibyte sequence is a valid multibyte sequence. (As it happens, your algorithm requires an even stronger property, because you might recognize a multibyte sequence starting in the middle of one valid sequence and continuing into the next sequence.)
Also, you cannot predict (again, in general) the multibyte state if the multibyte encoding has shift states. If you back-up over a multibyte sequence which changes the state, you have no idea what the previous state was.
UTF-8 was designed with this in mind. It does not have shift states, and it clearly marks the octets (bytes) which can start a sequence. So if you know that the multibyte encoding is UTF-8, you can easily iterate backwards. Just scan backwards for a character not in the range 0x80-0xBF. (UTF-16 and UTF-32 are also easily iterated in either direction, but you need to read them as two-/four-byte code units, respectively, because a misaligned read is quite likely to be a correct codepoint.)
If you don't know that the multibyte encoding is UTF-8, then there is simply no robust algorithm to iterate backwards. All you can do is iterate forwards and remember the starting position and mbstate of each character.
Fortunately, these days there is really little reason to support multibyte encodings other than Unicode encodings.
For UTF-8 you can take benefit of the encoding property of the additional bytes following the first one: the additional bytes of a multibyte chars (and only them) start with 10xx xxxx.
So if you go backward an a char c is so that (c & 0xC0)==0x80 then you can skip it.
For other multibyte encoding you don't necessarily have such a simple solution as the lead and following bytes are in ranges that overlap.
I am developing a cross platform C (C89 standard) application which has to deal with UTF8 text. All I need is basic string manipulation functions like substr, first, last etc.
Question 1
Is there a UTF8 library that has the above functions implemented? I have already looked into ICU and it is too big for my requirement. I just need to support UTF8.
I have found a UTF8 decoder here. Following function prototypes are from that code.
void utf8_decode_init(char p[], int length);
int utf8_decode_next();
The initialization function takes a character array but utf8_decode_next() returns int. Why is that? How can I print the characters this function returns using standard functions like printf? The function is dealing with character data and how can that be assigned to a integer?
If the above decoder is not good for production code, do you have a better recommendation?
Question 2
I also got confused by reading articles that says, for unicode you need to use wchar_t. From my understanding this is not required as normal C strings can hold UTF8 values. I have verified this by looking at source code of SQLite and git. SQLite has the following typedef.
typedef unsigned char u8
Is my understanding correct? Also why is unsigned char required?
The utf_decode_next() function returns the next Unicode code point. Since Unicode is a 21-bit character set, it cannot return anything smaller than an int, and it can be argued that technically, it should be a long since an int could be a 16-bit quantity. Effectively, the function returns you a UTF-32 character.
You would need to look at the C94 wide character extensions to C89 to print wide characters (wprintf(), <wctype.h>, <wchar.h>). However, wide characters alone are not guaranteed to be UTF-8 or even Unicode. You most probably cannot print the characters from utf8_decode_next() portably, but it depends on what your portability requirements are. The wider the range of systems you must port to, the less chance there is of it all working simply. To the extent you can write UTF-8 portably, you would send the UTF-8 string (not an array of the UTF-32 characters obtained from utf8_decode_next()) to one of the regular printing functions. One of the strengths of UTF-8 is that it can be manipulated by code that is largely ignorant of it.
You need to understand that a 4-byte wchar_t can hold any Unicode codepoint in a single unit, but that UTF-8 can require between one and four 8-bit bytes (1-4 units of storage) to hold a single Unicode codepoint. On some systems, I believe wchar_t can be a 16-bit (short) integer. In this case, you are forced into using UTF-16, which encodes Unicode codepoints outside the Basic Multilingual Plane (BMP, code points U+0000 .. U+FFFF) using two storage units and surrogates.
Using unsigned char makes life easier; plain char is often signed. Having negative numbers makes life more difficult than it need me (and, believe me, it is difficult enough without adding complexity).
You do not need any special library routines for character or substring search with UTF-8. strstr does everything you need. That's the whole point of UTF-8 and the design requirements it was invented to meet.
GLib has quite a few relevant functions, and can be used independent of GTK+.
There are over 100,000 characters in Unicode. There are 256 possible values of char in most C implementations.
Hence, UTF-8 uses more than one char to encode each character, and the decoder needs a return type which is larger than char.
wchar_t is a larger type than char (well, it doesn't have to be larger, but it usually is). It represents the characters of the implementation-defined wide character set. On some implementations (most importantly, Windows, which uses surrogate pairs for characters outside the "basic multilingual plane"), it still isn't big enough to represent any Unicode character, which presumably is why the decoder you reference uses int.
You can't print wide characters using printf, because it deals in char. wprintf deals in wchar_t, so if the wide character set is unicode, and if wchar_t is int on your system (as it is on linux), then wprintf and friends will print the decoder output without further processing. Otherwise it won't.
In any case, you cannot portably print arbitrary unicode characters, because there's no guarantee that the terminal can display them, or even that the wide character set is in any way related to Unicode.
SQLite has probably used unsigned char so that:
they know the signedness - it's implementation-defined whether char is signed or not.
they can do right-shifts and assign out-of-range values, and get consistent and defined results across all C implementations. Implemenations have more freedom how signed char behaves than unsigned char.
Normal C strings are fine for storing utf8 data, but you can't easily search for a substring in your utf8 string. This is because a character encoded as a sequence of bytes using the utf8 encoding could be anywhere from one to 4 bytes depending on the character. i.e. a "character" is not equivalent to a "byte" for utf8 like it is for ASCII.
In order to do substring searches etc. you will need to decode it to some internal format that is used to represent Unicode characters and then do the substring search on that. Since there are far more than Unicode 256 characters, a byte (or char) is not enough. That's why the library you found uses ints.
As for your second question, it's probably just because it does not make sense to talk about negative characters, so they may as well be specified as "unsigned".
I have implemented a substr & length functions which supports UTF8 characters. This code is a modified version of what SQLite uses.
The following macro loops through the input text and skip all multi-byte sequence characters. if condition checks that this is a multi-byte sequence and the loop inside it increments input until it finds next head byte.
#define SKIP_MULTI_BYTE_SEQUENCE(input) { \
if( (*(input++)) >= 0xc0 ) { \
while( (*input & 0xc0) == 0x80 ){ input++; } \
} \
}
substr and length are implemented using this macro.
typedef unsigned char utf8;
substr
void *substr(const utf8 *string,
int start,
int len,
utf8 **substring)
{
int bytes, i;
const utf8 *str2;
utf8 *output;
--start;
while( *string && start ) {
SKIP_MULTI_BYTE_SEQUENCE(string);
--start;
}
for(str2 = string; *str2 && len; len--) {
SKIP_MULTI_BYTE_SEQUENCE(str2);
}
bytes = (int) (str2 - string);
output = *substring;
for(i = 0; i < bytes; i++) {
*output++ = *string++;
}
*output = '\0';
}
length
int length(const utf8 *string)
{
int len;
len = 0;
while( *string ) {
++len;
SKIP_MULTI_BYTE_SEQUENCE(string);
}
return len;
}