I am developing a cross platform C (C89 standard) application which has to deal with UTF8 text. All I need is basic string manipulation functions like substr, first, last etc.
Question 1
Is there a UTF8 library that has the above functions implemented? I have already looked into ICU and it is too big for my requirement. I just need to support UTF8.
I have found a UTF8 decoder here. Following function prototypes are from that code.
void utf8_decode_init(char p[], int length);
int utf8_decode_next();
The initialization function takes a character array but utf8_decode_next() returns int. Why is that? How can I print the characters this function returns using standard functions like printf? The function is dealing with character data and how can that be assigned to a integer?
If the above decoder is not good for production code, do you have a better recommendation?
Question 2
I also got confused by reading articles that says, for unicode you need to use wchar_t. From my understanding this is not required as normal C strings can hold UTF8 values. I have verified this by looking at source code of SQLite and git. SQLite has the following typedef.
typedef unsigned char u8
Is my understanding correct? Also why is unsigned char required?
The utf_decode_next() function returns the next Unicode code point. Since Unicode is a 21-bit character set, it cannot return anything smaller than an int, and it can be argued that technically, it should be a long since an int could be a 16-bit quantity. Effectively, the function returns you a UTF-32 character.
You would need to look at the C94 wide character extensions to C89 to print wide characters (wprintf(), <wctype.h>, <wchar.h>). However, wide characters alone are not guaranteed to be UTF-8 or even Unicode. You most probably cannot print the characters from utf8_decode_next() portably, but it depends on what your portability requirements are. The wider the range of systems you must port to, the less chance there is of it all working simply. To the extent you can write UTF-8 portably, you would send the UTF-8 string (not an array of the UTF-32 characters obtained from utf8_decode_next()) to one of the regular printing functions. One of the strengths of UTF-8 is that it can be manipulated by code that is largely ignorant of it.
You need to understand that a 4-byte wchar_t can hold any Unicode codepoint in a single unit, but that UTF-8 can require between one and four 8-bit bytes (1-4 units of storage) to hold a single Unicode codepoint. On some systems, I believe wchar_t can be a 16-bit (short) integer. In this case, you are forced into using UTF-16, which encodes Unicode codepoints outside the Basic Multilingual Plane (BMP, code points U+0000 .. U+FFFF) using two storage units and surrogates.
Using unsigned char makes life easier; plain char is often signed. Having negative numbers makes life more difficult than it need me (and, believe me, it is difficult enough without adding complexity).
You do not need any special library routines for character or substring search with UTF-8. strstr does everything you need. That's the whole point of UTF-8 and the design requirements it was invented to meet.
GLib has quite a few relevant functions, and can be used independent of GTK+.
There are over 100,000 characters in Unicode. There are 256 possible values of char in most C implementations.
Hence, UTF-8 uses more than one char to encode each character, and the decoder needs a return type which is larger than char.
wchar_t is a larger type than char (well, it doesn't have to be larger, but it usually is). It represents the characters of the implementation-defined wide character set. On some implementations (most importantly, Windows, which uses surrogate pairs for characters outside the "basic multilingual plane"), it still isn't big enough to represent any Unicode character, which presumably is why the decoder you reference uses int.
You can't print wide characters using printf, because it deals in char. wprintf deals in wchar_t, so if the wide character set is unicode, and if wchar_t is int on your system (as it is on linux), then wprintf and friends will print the decoder output without further processing. Otherwise it won't.
In any case, you cannot portably print arbitrary unicode characters, because there's no guarantee that the terminal can display them, or even that the wide character set is in any way related to Unicode.
SQLite has probably used unsigned char so that:
they know the signedness - it's implementation-defined whether char is signed or not.
they can do right-shifts and assign out-of-range values, and get consistent and defined results across all C implementations. Implemenations have more freedom how signed char behaves than unsigned char.
Normal C strings are fine for storing utf8 data, but you can't easily search for a substring in your utf8 string. This is because a character encoded as a sequence of bytes using the utf8 encoding could be anywhere from one to 4 bytes depending on the character. i.e. a "character" is not equivalent to a "byte" for utf8 like it is for ASCII.
In order to do substring searches etc. you will need to decode it to some internal format that is used to represent Unicode characters and then do the substring search on that. Since there are far more than Unicode 256 characters, a byte (or char) is not enough. That's why the library you found uses ints.
As for your second question, it's probably just because it does not make sense to talk about negative characters, so they may as well be specified as "unsigned".
I have implemented a substr & length functions which supports UTF8 characters. This code is a modified version of what SQLite uses.
The following macro loops through the input text and skip all multi-byte sequence characters. if condition checks that this is a multi-byte sequence and the loop inside it increments input until it finds next head byte.
#define SKIP_MULTI_BYTE_SEQUENCE(input) { \
if( (*(input++)) >= 0xc0 ) { \
while( (*input & 0xc0) == 0x80 ){ input++; } \
} \
}
substr and length are implemented using this macro.
typedef unsigned char utf8;
substr
void *substr(const utf8 *string,
int start,
int len,
utf8 **substring)
{
int bytes, i;
const utf8 *str2;
utf8 *output;
--start;
while( *string && start ) {
SKIP_MULTI_BYTE_SEQUENCE(string);
--start;
}
for(str2 = string; *str2 && len; len--) {
SKIP_MULTI_BYTE_SEQUENCE(str2);
}
bytes = (int) (str2 - string);
output = *substring;
for(i = 0; i < bytes; i++) {
*output++ = *string++;
}
*output = '\0';
}
length
int length(const utf8 *string)
{
int len;
len = 0;
while( *string ) {
++len;
SKIP_MULTI_BYTE_SEQUENCE(string);
}
return len;
}
Related
In the document of Linux:
LC_CTYPE
This category determines the interpretation of byte sequences as characters (e.g., single versus multibyte characters), character classifications (e.g., alphabetic or digit), and the behavior of character classes. On glibc systems, this category also determines the character transliteration rules for iconv(1) and iconv(3). It changes the behavior of the character handling and classification functions, such as isupper(3) and toupper(3), and the multibyte character functions such as mblen(3) or wctomb(3).
However, I see GCC's source code of putwchar:
/* _IO_putwc_unlocked */
# define _IO_putwc_unlocked(_wch, _fp) \
(__glibc_unlikely ((_fp)->_wide_data == NULL \
|| ((_fp)->_wide_data->_IO_write_ptr \
>= (_fp)->_wide_data->_IO_write_end)) \
? __woverflow (_fp, _wch) \
: (wint_t) (*(_fp)->_wide_data->_IO_write_ptr++ = (_wch)))
/* putwchar */
wint_t
putwchar (wchar_t wc)
{
wint_t result;
_IO_acquire_lock (stdout);
result = _IO_putwc_unlocked (wc, stdout);
_IO_release_lock (stdout);
return result;
}
There is no code using the locale set with setlocale(), which confuses me. When and where the bytes stored in the memory transit to the specific charset set by setlocale()?
Update:
int main() {
wchar_t wc = L'\x00010437';
putwchar(wc); // print nothing
}
int main() {
wchar_t wc = L'\x00010437';
setlocale(LC_CTYPE, "");
putwchar(wc); // print '𐐷'
}
In the two cases above, setlocale() affects the character displayed on the screen. I want to know in which process the bytes are determined to represent the specific character like '𐐷'?
Update2:
Maybe I find the source code converting the multi-bytes data into the specific charset. Here is the code snippet in _IO_wdo_write() in glibc/libio/wfileops.c:
/* Now convert from the internal format into the external buffer. */
result = (*cc->__codecvt_do_out) (cc, &fp->_wide_data->_IO_state,
data, data + to_do, &new_data,
write_ptr,
buf_end,
&write_ptr);
Expanding on my comment:
Where is the C code encode the bytes in the memory to the specific charset in Linux?
To the best of my knowledge, there isn't any. A charset, a.k.a. character encoding, is a mapping from sequences of characters -- in a rather abstract sense of that term -- to sequences of bytes. If you are looking at bytes in memory that represent character data then, perforce, you are looking at an already-encoded representation. For a C program, they will normally be encoded according to the execution character set of the C implementation.
In particular, to the extent that C "character" and "wide character" types actually represent characters, they contain encoded character data. There is normally no conversion needed or performed when such data are read or written, which is why you don't see it in the glibc source.
It is of course possible for a program to encode characters in some other encoding and store the resulting bytes in memory, via iconv(3), for example. It is then program's responsibility to ensure that they are handled appropriately. As for mapping encoded byte sequences to a visual representation -- "glyphs" -- this is a function performed by the program that displays or prints them. One way that is done is simply by selection of a font with appropriate mappings from byte sequences to glyphs.
I want to use √ symbol in the program written below.
#include <stdio.h>
main(){
char a='√';
if (a=='√'){
printf("Working");
}
else{
printf("Not working");
}
}
√ is not ASCII that's why its not working. But I want to know to make it work.
Thanks in advance.
There are two different things going on here to be aware of:
The source C file itself may not be able to contain this character correctly.
The char type within the semantics of the actual program does not support this character, either.
As to the first issue, it depends on your platform (etc) but being conservative with C source is most portable, which means sticking to ASCII characters only within the code file. That means, e.g., in comments as well as within meaningful code. That said, lots of platforms will allow and support Unicode characters inside the source files.
Regarding the second, a char is old-fashioned for containing characters, and is limited to an octet, which means arbitrary unicode characters with values above 0xFF just don't fit inside of them. I suppose some non-ASCII characters do in a platform dependent way (Windows code pages?) above value 0x7F, but in this case, I would treat this as a string, using a unicode escape sequence for this character: "\u221A".
char * sqrt = "\u221A";
if (strcmp(sqrt, "\u221A") == 0) {
printf("Working");
} else {
printf("Not working");
}
Heads-up that C strings (char*) are not really designed around non-ASCII characters either, so in this case you end up embedding the UTF8 encoded representation of the character (which is three bytes long) inside the char string. This works, preserves the value, and the compare works, but if you're going to be working with unicode more generally...
If your platform supports "wide characters" (wchar_t or unichar or similar) that can hold Unicode characters, then you can use those types to hold this character, and do direct equality comparisons like you were doing:
wchar_t sqrt = L'\u221A';
if (sqrt == L'\u221A') {
...
(FYI Be a little aware that these wide char types may not be wide enough for arbitrary Unicode code points on your platform thus might work for the square root char, but not, say, an emoji.)
Finally, for the sake of completeness, I feel honor-bound to admit that given a contemporary development environment/toolchain and target platform, you could probably get away with using the explicit character in a widechar literal like so:
wchar_t sqrt = L'√';
if (sqrt == L'√') {
....
But I'm old-fashioned, this feels sketchy, and I don't recommend it. :)
I am in the process of making a small program that reads a file, that contains UTF-8 elements, char by char. After reading a char it compares it with a few other characters and if there is a match it replaces the character in the file with an underscore '_'.
(Well, it actually makes a duplicate of that file with specific letters replaced by underscores.)
I'm not sure where exactly I'm messing up here but it's most likely everywhere.
Here is my code:
FILE *fpi;
FILE *fpo;
char ifilename[FILENAME_MAX];
char ofilename[FILENAME_MAX];
wint_t sample;
fpi = fopen(ifilename, "rb");
fpo = fopen(ofilename, "wb");
while (!feof(fpi)) {
fread(&sample, sizeof(wchar_t*), 1, fpi);
if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0) ) {
fwrite(L"_", sizeof(wchar_t*), 1, fpo);
} else {
fwrite(&sample, sizeof(wchar_t*), 1, fpo);
}
}
I have omitted the code that has to do with the filename generation because it has nothing to offer to the case. It is just string manipulation.
If I feed this program a file containing the words γειά σου κόσμε. I would want it to return this:
γει_ σου κόσμ_.
Searching the internet didn't help much as most results were very general or talking about completely different things regarding UTF-8. It's like nobody needs to manipulate single characters for some reason.
Anything pointing me the right way is most welcome.
I am not, necessarily, looking for a straightforward fixed version of the code I submitted, I would be grateful for any insightful comments helping me understand how exactly the wchar mechanism works. The whole wbyte, wchar, L, no-L, thing is a mess to me.
Thank you in advance for your help.
C has two different kinds of characters: multibyte characters and wide characters.
Multibyte characters can take a varying number of bytes. For instance, in UTF-8 (which is a variable-length encoding of Unicode), a takes 1 byte, while α takes 2 bytes.
Wide characters always take the same number of bytes. Additionally, a wchar_t must be able to hold any single character from the execution character set. So, when using UTF-32, both a and α take 4 bytes each. Unfortunately, some platforms made wchar_t 16 bits wide: such platforms cannot correctly support characters beyond the BMP using wchar_t. If __STDC_ISO_10646__ is defined, wchar_t holds Unicode code-points, so must be (at least) 4 bytes long (technically, it must be at least 21-bits long).
So, when using UTF-8, you should use multibyte characters, which are stored in normal char variables (but beware of strlen(), which counts bytes, not multibyte characters).
Unfortunately, there is more to Unicode than this.
ά can be represented as a single Unicode codepoint, or as two separate codepoints:
U+03AC GREEK SMALL LETTER ALPHA WITH TONOS ← 1 codepoint ← 1 multibyte character ← 2 bytes (0xCE 0xAC) = 2 char's.
U+03B1 GREEK SMALL LETTER ALPHA U+0301 COMBINING ACUTE ACCENT ← 2 codepoints ← 2 multibyte characters ← 4 bytes (0xCE 0xB1 0xCC 0x81) = 4 char's.
U+1F71 GREEK SMALL LETTER ALPHA WITH OXIA ← 1 codepoint ← 1 multibyte character ← 3 bytes (0xE1 0xBD 0xB1) = 3 char's.
All of the above are canonical equivalents, which means that they should be treated as equal for all purposes. So, you should normalize your strings on input/output, using one of the Unicode normalization algorithms (there are 4: NFC, NFD, NFKC, NFKD).
First of all, please do take the time to read this great article, which explains UTF8 vs Unicode and lots of other important things about strings and encodings: http://www.joelonsoftware.com/articles/Unicode.html
What you are trying to do in your code is read in unicode character by character, and do comparisons with those. That's won't work if the input stream is UTF8, and it's not really possible to do with quite this structure.
In short: Fully unicode strings can be encoded in several ways. One of them is using a series of equally-sized "wide" chars, one for each character. That is what the wchar_t type (sometimes WCHAR) is for. Another way is UTF8, which uses a variable number of raw bytes to encode each character, depending on the value of the character.
UTF8 is just a stream of bytes, which can encode a unicode string, and is commonly used in files. It is not the same as a string of WCHARs, which are the more common in-memory representation. You can't poke through a UTF8 stream reliably, and do character replacements within it directly. You'll need to read the whole thing in and decode it, and then loop through the WCHARs that result to do your comparisons and replacement, and then map that result back to UTF8 to write to the output file.
On Win32, use MultiByteToWideChar to do the decoding, and you can use the corresponding WideCharToMultiByte to go back.
When you use a "string literal" with regular quotes, you're creating a nul-terminated ASCII string (char*), which does not support Unicode. The L"string literal" with the L prefix will create a nul-terminated string of WCHARs (wchar_t *), which you can use in string or character comparisons. The L prefix also works with single-quote character literals, like so: L'ε'
As a commenter noted, when you use fread/fwrite, you should be using sizeof(wchar_t) and not its pointer type, since the amount you are trying to read/write is an actual wchar, not the size of a pointer to one. This advice is just code feedback independent of the above-- you don't want to be reading the input character by character anyways.
Note too that when you do string comparisons (wcscmp), you should use actual wide strings (which are terminated with a nul wide char)-- not use single characters in memory as input. If (when) you want to do character-to-character comparisons, you don't even need to use the string functions. Since a WCHAR is just a value, you can compare directly: if (sample == L'ά') {}.
I discovered an interesting problem when processing UTF-8 strings containing non-ASCII chars with C standard library formatting functions like sprintf():
The functions of the printf() family are not aware of utf-8 and process everything based on the number of bytes, not chars. Therefore the formatting is incorrect.
Simple example:
#include <stdio.h>
int main(int argc, char *argv[])
{
const char* testMsg = "Tääääßt";
char buf[1024];
int len;
sprintf(buf, "|%7.7s|", testMsg);
len = strlen(buf);
printf("Result=\"%s\", len=%d", buf, len);
return 0;
}
The result is:
Result="|Täää|", len=7
Most probably some of you will recommand to convert the application from char to wchar_t and use fwprintf(), etc., but that's absolutely impossible because of huge existing applications. I could imagine writing a wrapper that uses these functions internally, but this would be tricky and very inefficient.
So the best solution would be a UTF-8-aware replacement for the formatting functions of the Standard C Library.
Currently I'm working on QNX 6.4, but replies for other operating systems. e.g. Linux, are also very welcome.
Well, once you ask printf to do intelligent padding of Unicode characters, you run into major problems. As they say,
w͢͢͝h͡o͢͡ ̸͢k̵͟n̴͘ǫw̸̛s͘ ̀́w͘͢ḩ̵a҉̡͢t ̧̕h́o̵r͏̵rors̡ ̶͡͠lį̶e͟͟ ̶͝in͢ ͏t̕h̷̡͟e ͟͟d̛a͜r̕͡k̢̨ ͡h̴e͏a̷̢̡rt́͏ ̴̷͠ò̵̶f̸ u̧͘ní̛͜c͢͏o̷͏d̸͢e̡͝?͞
How many Unicode characters are in Tääääßt? Well, it could be anywhere from 7 to 11, depending on how it's encoded. Each ä can be written as U+00E4, which is one character, or it could be written as U+0061 U+0308, which is two characters. So your next hope is to count grapheme clusters. (No, normalization won't make the problem go away.)
But, how wide is a grapheme cluster? Obviously, a is one column wide. U+200B should be zero columns wide, it's a "zero-width" space. Should each ひらがな be two columns wide? They usually are in terminal emulators. What happens when you format ひらがな as 7 columns, do you get "ひらが ", which adds a space, or do you get "ひらが", which is only 6 columns?
If you cut something up which mixes RTL and LTR text, should you reset the text direction afterwards? What are you going to do? (Some terminal emulators, such as Apple's, support a mixture of left-to-right and right-to-left text.)
What is your goal by truncating text? Are you trying to show the user a string in limited space, or are you trying to write a format that uses fixed-width fields?
Basically, if you want to cut Unicode text into chunks, you shouldn't be doing it with something as simple as printf (or wprintf, which is quite possibly worse). Use LibICU (website) to iterate over the breaks you want. Writing a UTF-8 aware version of printf is asking for all sorts of trouble that you don't want.
The following C99 code snippet defines the function u8printf where format specifiers such as %10s yield 10 utf-8 code points, that is characters rather than bytes. Don't forget to set the locale with setlocale(LC_ALL,"") somewhere before this routine is called. This works because the wprintf uses wchar_t internally. You can define u8fprintf and u8sprintf in a similar way. If you want to write this without C99 variable length arrays than a suitable combination of malloc/free is also possible.
int u8printf(char *fmt,...){
va_list ap;
va_start(ap,fmt);
int n=mbstowcs(0,fmt,0);
if(n==-1) return -1;
wchar_t wfmt[n+1];
mbstowcs(wfmt,fmt,n+1);
for(int m=128;m<=32768;m*=2){
wchar_t wbuf[m];
int r=vswprintf(wbuf,m,wfmt,ap);
if(r!=-1) {
char buf[m*4];
wcstombs(buf,wbuf,m*4);
fputs(buf,stdout);
return r;
}
}
return -1;
va_end(ap);
}
Lets say I have a string:
char theString[] = "你们好āa";
Given that my encoding is utf-8, this string is 12 bytes long (the three hanzi characters are three bytes each, the latin character with the macron is two bytes, and the 'a' is one byte:
strlen(theString) == 12
How can I count the number of characters? How can i do the equivalent of subscripting so that:
theString[3] == "好"
How can I slice, and cat such strings?
You only count the characters that have the top two bits are not set to 10 (i.e., everything less that 0x80 or greater than 0xbf).
That's because all the characters with the top two bits set to 10 are UTF-8 continuation bytes.
See here for a description of the encoding and how strlen can work on a UTF-8 string.
For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0 bit or a 11 sequence is the start of a UTF-8 code point, all others are continuation characters.
Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:
utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;
to get, respectively:
the left sz UTF-8 bytes of a string.
the sz UTF-8 bytes of a string, starting at pos.
the rest of the UTF-8 bytes of a string, starting at pos.
This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.
Try this for size:
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// returns the number of utf8 code points in the buffer at s
size_t utf8len(char *s)
{
size_t len = 0;
for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;
return len;
}
// returns a pointer to the beginning of the pos'th utf8 codepoint
// in the buffer at s
char *utf8index(char *s, size_t pos)
{
++pos;
for (; *s; ++s) {
if ((*s & 0xC0) != 0x80) --pos;
if (pos == 0) return s;
}
return NULL;
}
// converts codepoint indexes start and end to byte offsets in the buffer at s
void utf8slice(char *s, ssize_t *start, ssize_t *end)
{
char *p = utf8index(s, *start);
*start = p ? p - s : -1;
p = utf8index(s, *end);
*end = p ? p - s : -1;
}
// appends the utf8 string at src to dest
char *utf8cat(char *dest, char *src)
{
return strcat(dest, src);
}
// test program
int main(int argc, char **argv)
{
// slurp all of stdin to p, with length len
char *p = malloc(0);
size_t len = 0;
while (true) {
p = realloc(p, len + 0x10000);
ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);
if (cnt == -1) {
perror("read");
abort();
} else if (cnt == 0) {
break;
} else {
len += cnt;
}
}
// do some demo operations
printf("utf8len=%zu\n", utf8len(p));
ssize_t start = 2, end = 3;
utf8slice(p, &start, &end);
printf("utf8slice[2:3]=%.*s\n", end - start, p + start);
start = 3; end = 4;
utf8slice(p, &start, &end);
printf("utf8slice[3:4]=%.*s\n", end - start, p + start);
return 0;
}
Sample run:
matt#stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops
utf8len=5
utf8slice[2:3]=好
utf8slice[3:4]=ā
Note that your example has an off by one error. theString[2] == "好"
The easiest way is to use a library like ICU
Depending on your notion of "character", this question can get more or less involved.
First off, you should transform your byte string into a string of unicode codepoints. You can do this with iconv() of ICU, though if this is the only thing you do, iconv() is a lot easier, and it's part of POSIX.
Your string of unicode codepoints could be something like a null-terminated uint32_t[], or if you have C1x, an array of char32_t. The size of that array (i.e. its number of elements, not its size in bytes) is the number of codepoints (plus the terminator), and that should give you a very good start.
However, the notion of a "printable character" is fairly complex, and you may prefer to count graphemes rather than codepoints - for instance, an a with an accent ^ can be expressed as two unicode codepoints, or as a combined legacy codepoint â - both are valid, and both are required by the unicode standard to be treated equally. There is a process called "normalization" which turns your string into a definite version, but there are many graphemes which are not expressible as a single codepoint, and in general there is no way around a proper library that understands this and counts graphemes for you.
That said, it's up to you to decide how complex your scripts are and how thoroughly you want to treat them. Transforming into unicode codepoints is a must, everything beyond that is at your discretion.
Don't hesitate to ask questions about ICU if you decide that you need it, but feel free to explore the vastly simpler iconv() first.
In the real world, theString[3]=foo; is not a meaningful operation. Why would you ever want to replace a character at a particular position in the string with a different character? There's certainly no natural-language-text processing task for which this operation is meaningful.
Counting characters is also unlikely to be meaningful. How many characters (for your idea of "character") are there in "á"? How about "á"? Now how about "གི"? If you need this information for implementing some sort of text editing, you're going to have to deal with these hard questions, or just use an existing library/gui toolkit. I would recommend the latter unless you're an expert on world scripts and languages and think you can do better.
For all other purposes, strlen tells you exactly the piece of information that's actually useful: how much storage space a string takes. This is what's needed for combining and separating strings. If all you want to do is combine strings or separate them at a particular delimiter, snprintf (or strcat if you insist...) and strstr are all you need.
If you want to perform higher-level natural-language-text operations, like capitalization, line breaking, etc. or even higher-level operations like pluralization, tense changes, etc. then you'll need either a library like ICU or respectively something much higher-level and linguistically-capable (and specific to the language(s) you're working with).
Again, most programs do not have any use for this sort of thing and just need to assemble and parse text without any considerations to natural language.
while (s[i]) {
if ((s[i] & 0xC0) != 0x80)
j++;
i++;
}
return (j);
This will count characters in a UTF-8 String... (Found in this article: Even faster UTF-8 character counting)
However I'm still stumped on slicing and concatenating?!?
In general we should use a different data type for unicode characters.
For example, you can use the wide char data type
wchar_t theString[] = L"你们好āa";
Note the L modifier that tells that the string is composed of wide chars.
The length of that string can be calculated using the wcslen function, which behaves like strlen.
One thing that's not clear from the above answers is why it's not simple. Each character is encoded in one way or another - it doesn't have to be UTF-8, for example - and each character may have multiple encodings, with varying ways to handle combining of accents, etc. The rules are really complicated, and vary by encoding (e.g., utf-8 vs. utf-16).
This question has enormous security concerns, so it is imperative that this be done correctly. Use an OS-supplied library or a well-known third-party library to manipulate unicode strings; don't roll your own.
I did similar implementation years back. But I do not have code with me.
For each unicode characters, first byte describes the number of bytes follow it to construct a unicode character. Based on the first byte you can determine the length of each unicode character.
I think its a good UTF8 library.
enter link description here
A sequence of code points constitute a single syllable / letter / character in many other Non Western-European languages (eg: all Indic languages)
So, when you are counting the length OR finding the substring (there are definitely use cases of finding the substrings - let us say playing a hangman game), you need to advance syllable by syllable , not by code point by code point.
So the definition of the character/syllable and where you actually break the string into "chunks of syllables" depends upon the nature of the language you are dealing with.
For example, the pattern of the syllables in many Indic languages (Hindi, Telugu, Kannada, Malayalam, Nepali, Tamil, Punjabi, etc.) can be any of the following
V (Vowel in their primary form appearing at the beginning of the word)
C (consonant)
C + V (consonant + vowel in their secondary form)
C + C + V
C + C + C + V
You need to parse the string and look for the above patterns to break the string and to find the substrings.
I do not think it is possible to have a general purpose method which can magically break the strings in the above fashion for any unicode string (or sequence of code points) - as the pattern that works for one language may not be applicable for another letter;
I guess there may be some methods / libraries that can take some definition / configuration parameters as the input to break the unicode strings into such syllable chunks. Not sure though! Appreciate if some one can share how they solved this problem using any commercially available or open source methods.