How to navigate a UTF-8 text file - c

I have a text file in UTF-8 that I need to navigate in C. I need to split this file into seperate smaller files (i.e. cut it in half). When this happens, it sometimes splits the multi-byte characters into two different files. When a dumb text editor goes to read the file containing the second half of text, it reads the second half of the cut character and becomes confused, thus not displaying the rest of the text correctly. If I read byte-by-byte, how can I tell if I am at the beginning of a character or in the middle? Non-ascii compatible UTF-8 characters all start with the leading bit set to 1 but some are two bytes and some are three bytes.
Edit: Nevermind, I just found out that the first byte contains the number of leading 1s that the character is long. IE a three byte character is 1110xxxx xxxxxxxx xxxxxxxx.

if ((*s & 0xc0) == 0x80) /* You are in the middle of */;

UTF-8 characters are represented use 1 to 4 bytes.
Check a byte, if you have this binary pattern:
10xxxxxx
you are in the middle of a multi-byte. And you should continue to the next leading character.
If you have this:
0xxxxxxx
you have a 1-byte character.
110xxxxx
is the leading byte of a 2-byte character
1110xxxx
is the leading byte of a 3-byte character
and
11110xxx
is the leading byte of a 4-byte character

All UTF-8 characters are made of a leading byte and zero or more continuation bytes. All continuation bytes are of the form "10xxxxxx" in binary. So all leading bytes are of one of the two forms: "0xxxxxxx" or "11xxxxxx".

Related

sizeof(string) not including a "\" sign

I've been going through strlen and sizeof for strings (char arrays) and I don't quite get one thing.
I have the following code:
int main() {
char str[]="gdb\0eahr";
printf("sizeof=%u\n",sizeof(str));
printf("strlen=%u\n",strlen(str));
return 0;
}
the output of the code is:
sizeof=9
strlen=3
At first I was pretty sure that 2 separate characters \ followed by 0 wouldn't actually act as a NUL (\0) but I managed to figure that it does.
The thing is that I have no idea why sizeof shows 9 and not 10.
Since sizeof counts the amount of used bytes by the data type why doesn't it count the byte for the \?
In a following example:
char str[]="abc";
printf("sizeof=%u\n",sizeof(str));
that would print out "4" because of the NUL value terminating the array so why is \ being not counted?
In a character or string constant, the \ character marks the beginning of an escape sequence, used to represent character values for which there isn't a symbol in the source character set. For example, the escape sequence \n represents the newline character, \b represents the backspace character, \0 represents the zero-valued character (which is also the string terminator), etc.
In the string literal "gdb\0eahr", the escape sequence \0 maps to a single 0-valued character; the actual contents of str are {'g', 'd', 'b', 0, 'e', 'a', 'h', 'r', 0}.
It seems you already have the answer:
At first I was pretty sure that 2 separate characters "\" followed by "0" wouldn't actually act as a NULL "\0" but I managed to figure that it does.
The sequence \0 is an octal escape sequence for the byte 0. So while there are two characters in the code to denote this, it translated to a single byte in the string.
So you have 7 alphabetic characters, a null byte in the middle, and a null byte at the end. That's 9 bytes.
Why should char str[]="gdb\0eahr"; be 10 bytes with sizeof operator? It is 9 bytes because there are 8 string elements + trailing zero.
\0 is only 1 character, not 2. \'s purpose is to escape characters, therefore you might see some of these: \t, \n, \\ and others.
Strlen returns 3 because you have string termination at position str[3].
Single sequence of \ acts as escape character and is not part of string size. If you want to literally use \ in your string, you have to write it twice in sequence like \\, then this is single char of \ printable char.
The C compiler scans text strings as a part of compiling the source code and during the scan any special, escape sequences of characters are turned into a single character. The symbol backslash (\) is used to indicate the start of an escape sequence.
There are several formats for escape sequences. The most basic is a backslash followed by one of several special letters. These two characters are then translated into a single character. Some of these are:
'\n' is turned into a line feed character (0x0A or decimal 10)
'\t' is turned into a tab character (0x09 or decimal 9)
'\r' is turned into a carriage return character (0x0D or decimal 13)
'\\' is turned into a backslash character (0x5C)
This escape sequence idea was used back in the old days so that when a line of text was printed to a teletype machine or printer or a CRT terminal, the programmer could use these and other special command code characters to set where the next character would be printed or to cause the device to do some physical action like ring a bell or feed the paper to the next line.
The escape character also allowed you to embed a double quote (") or a single quote (') into a text string so that you could print text that contain quote marks.
In addition to the above special sequences of backslash followed by a letter there was also a way to specify any character by using a backslash followed by one up to three octal digits (0 through 7). So you could specify a line feed character by either using '\n' or you could use '\12' where 12 is the octal representation of the hexadecimal value A or the decimal value 10.
Then the ability to use a hexadecimal escape sequence was introduced with the backslash followed by the letter x followed by one or more hexadecimal digits. So you can write a line feed character with '\n' or '\12' or '\xa'.
See also Escape sequences in C in Wikipedia.

mbrtowc: howto determine number of characters to skip if null character is read

According to the C99 specification the mbrtowc function returns 0
if the next n or fewer bytes complete the multibyte character that
corresponds to the null wide character (which is the value stored).
What is the best way to continue reading the input immediately after the encoded null character?
My current solution is to convert the null wide character with the given encoding in order to determine the number of input bytes to skip for the next call to mbrtowc. But there might be a more elegant way to do this.
Additionally I wonder what the rationale behind this behaviour of mbrtowc might be.
One byte. The null byte always represents the null character regardless of shift state, and cannot participate as part of a multibyte character. The source for this is:
5.2.1.2 Multibyte characters
...
A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.

Find non-ascii characters from a UTF-8 string

I need to find the non-ASCII characters from a UTF-8 string.
my understanding:
UTF-8 is a superset of character encoding in which 0-127 are ascii characters.
So if in a UTF-8 string , a characters value is Not between 0-127, then it is not a ascii character , right? Please correct me if i'm wrong here.
On the above understanding i have written following code in C :
Note:
I'm using the Ubuntu gcc compiler to run C code
utf-string is x√ab c
long i;
char arr[] = "x√ab c";
printf("length : %lu \n", sizeof(arr));
for(i=0; i<sizeof(arr); i++){
char ch = arr[i];
if (isascii(ch))
printf("Ascii character %c\n", ch);
else
printf("Not ascii character %c\n", ch);
}
Which prints the output like:
length : 9
Ascii character x
Not ascii character
Not ascii character �
Not ascii character �
Ascii character a
Ascii character b
Ascii character
Ascii character c
Ascii character
To naked eye length of x√ab c seems to be 6, but in code it is coming as 9 ?
Correct answer for the x√ab c is 1 ...i.e it has only 1 non-ascii character , but in above output it is coming as 3 (times Not ascii character).
How can i find the non-ascii character from UTF-8 string, correctly.
Please guide on the subject.
What C calls a char is actually a byte. A UTF-8 character can be made up of several bytes.
In fact only the ASCII characters are represented by a single byte in UTF-8 (which is why all valid ASCII-encoded text is also effectively UTF-8 encoded).
So to count the number of UTF-8 characters you have to do a partial decoding: count the number of UTF-8 start codepoints.
See the Wikipedia article on UTF-8 to find out how they are encoded.
Basically there are 3 categories:
single-byte codes 0b0xxxxxxx
start bytes: 0b110xxxxx, 0b1110xxxx, 0b11110xxx
continuation bytes: 0b10xxxxxx
To count the number of unicode codepoint simply count all characters that are not continuation bytes.
However unicode codepoints don't always have a 1-to-1 correspondence to "characters" (depending on your exact definition of character).
The UTF-8 characters when taken in a character array occupies it in such a way that the first byte occupied by each UTF-8 character would contain the information regarding the number of bytes taken to represent the character. The number of consecutive 1's from the MSB of the first byte would represent the total bytes taken by the non-ascii character. In case of '√' the binary form would be: 11100010,10001000,10011010. Counting the number of 1's the in the first byte gives the number of bytes occupied as 3. Something like the code below would work for this:
int get_count(char non_ascii_char){
/*
The function returns the number of bytes occupied by the UTF-8 character
It takes the non ASCII character as the input and returns the length
to the calling function.
*/
int bit_counter=7,count=0;
/*
bit_counter - is the counter initialized to traverse through each bit of the
non ascii character
count - stores the number of bytes occupied by the character
*/
for(;bit_counter>=0;bit_counter--){
if((non_ascii_char>>bit_counter)&1){
count++;// increments on the number of consecutive 1s in the byte
}
else{
break;// breaks on encountering the first 0
}
}
return count;// returns the count to the calling function
}

Is spacebar a character?

Consider this line of text:
First line of text.
If a character array string is used to load the first TEN characters in the array it will output as:
First lin'\0'
First contains 5 letters, lin contains 3 letters. Where are the other two characters being used?
Is \0 considered two characters?
Or is the space between the words considered a character, thus '\0` is one character?
Yes, space is a character. In ASCII encoding it has code number 32.
The space between the two words has ASCII code 0x20 (0408, or 3210); it occupies one byte.
The null at the end of the string, ASCII code 0x00 (0 in both octal and decimal) occupies the other byte.
Note that the space bar is simply the key on the keyboard that generates a space character when typed.
'\0' is the null-terminator, it is literally the value zero in all implementations.
'\0' is considered a single character because the backslash \ means to escape a character. '\0' and '0' thus are both single characters, but mean very different things.
Note that space is represented by a different ascii value.
space is represented in String as "\s", probably space is represented as '\s' as a character

Trouble comparing UTF-8 characters using wchar.h

I am in the process of making a small program that reads a file, that contains UTF-8 elements, char by char. After reading a char it compares it with a few other characters and if there is a match it replaces the character in the file with an underscore '_'.
(Well, it actually makes a duplicate of that file with specific letters replaced by underscores.)
I'm not sure where exactly I'm messing up here but it's most likely everywhere.
Here is my code:
FILE *fpi;
FILE *fpo;
char ifilename[FILENAME_MAX];
char ofilename[FILENAME_MAX];
wint_t sample;
fpi = fopen(ifilename, "rb");
fpo = fopen(ofilename, "wb");
while (!feof(fpi)) {
fread(&sample, sizeof(wchar_t*), 1, fpi);
if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0) ) {
fwrite(L"_", sizeof(wchar_t*), 1, fpo);
} else {
fwrite(&sample, sizeof(wchar_t*), 1, fpo);
}
}
I have omitted the code that has to do with the filename generation because it has nothing to offer to the case. It is just string manipulation.
If I feed this program a file containing the words γειά σου κόσμε. I would want it to return this:
γει_ σου κόσμ_.
Searching the internet didn't help much as most results were very general or talking about completely different things regarding UTF-8. It's like nobody needs to manipulate single characters for some reason.
Anything pointing me the right way is most welcome.
I am not, necessarily, looking for a straightforward fixed version of the code I submitted, I would be grateful for any insightful comments helping me understand how exactly the wchar mechanism works. The whole wbyte, wchar, L, no-L, thing is a mess to me.
Thank you in advance for your help.
C has two different kinds of characters: multibyte characters and wide characters.
Multibyte characters can take a varying number of bytes. For instance, in UTF-8 (which is a variable-length encoding of Unicode), a takes 1 byte, while α takes 2 bytes.
Wide characters always take the same number of bytes. Additionally, a wchar_t must be able to hold any single character from the execution character set. So, when using UTF-32, both a and α take 4 bytes each. Unfortunately, some platforms made wchar_t 16 bits wide: such platforms cannot correctly support characters beyond the BMP using wchar_t. If __STDC_ISO_10646__ is defined, wchar_t holds Unicode code-points, so must be (at least) 4 bytes long (technically, it must be at least 21-bits long).
So, when using UTF-8, you should use multibyte characters, which are stored in normal char variables (but beware of strlen(), which counts bytes, not multibyte characters).
Unfortunately, there is more to Unicode than this.
ά can be represented as a single Unicode codepoint, or as two separate codepoints:
U+03AC GREEK SMALL LETTER ALPHA WITH TONOS ← 1 codepoint ← 1 multibyte character ← 2 bytes (0xCE 0xAC) = 2 char's.
U+03B1 GREEK SMALL LETTER ALPHA U+0301 COMBINING ACUTE ACCENT ← 2 codepoints ← 2 multibyte characters ← 4 bytes (0xCE 0xB1 0xCC 0x81) = 4 char's.
U+1F71 GREEK SMALL LETTER ALPHA WITH OXIA ← 1 codepoint ← 1 multibyte character ← 3 bytes (0xE1 0xBD 0xB1) = 3 char's.
All of the above are canonical equivalents, which means that they should be treated as equal for all purposes. So, you should normalize your strings on input/output, using one of the Unicode normalization algorithms (there are 4: NFC, NFD, NFKC, NFKD).
First of all, please do take the time to read this great article, which explains UTF8 vs Unicode and lots of other important things about strings and encodings: http://www.joelonsoftware.com/articles/Unicode.html
What you are trying to do in your code is read in unicode character by character, and do comparisons with those. That's won't work if the input stream is UTF8, and it's not really possible to do with quite this structure.
In short: Fully unicode strings can be encoded in several ways. One of them is using a series of equally-sized "wide" chars, one for each character. That is what the wchar_t type (sometimes WCHAR) is for. Another way is UTF8, which uses a variable number of raw bytes to encode each character, depending on the value of the character.
UTF8 is just a stream of bytes, which can encode a unicode string, and is commonly used in files. It is not the same as a string of WCHARs, which are the more common in-memory representation. You can't poke through a UTF8 stream reliably, and do character replacements within it directly. You'll need to read the whole thing in and decode it, and then loop through the WCHARs that result to do your comparisons and replacement, and then map that result back to UTF8 to write to the output file.
On Win32, use MultiByteToWideChar to do the decoding, and you can use the corresponding WideCharToMultiByte to go back.
When you use a "string literal" with regular quotes, you're creating a nul-terminated ASCII string (char*), which does not support Unicode. The L"string literal" with the L prefix will create a nul-terminated string of WCHARs (wchar_t *), which you can use in string or character comparisons. The L prefix also works with single-quote character literals, like so: L'ε'
As a commenter noted, when you use fread/fwrite, you should be using sizeof(wchar_t) and not its pointer type, since the amount you are trying to read/write is an actual wchar, not the size of a pointer to one. This advice is just code feedback independent of the above-- you don't want to be reading the input character by character anyways.
Note too that when you do string comparisons (wcscmp), you should use actual wide strings (which are terminated with a nul wide char)-- not use single characters in memory as input. If (when) you want to do character-to-character comparisons, you don't even need to use the string functions. Since a WCHAR is just a value, you can compare directly: if (sample == L'ά') {}.

Resources