How to count characters in a unicode string in C - c

Lets say I have a string:
char theString[] = "你们好āa";
Given that my encoding is utf-8, this string is 12 bytes long (the three hanzi characters are three bytes each, the latin character with the macron is two bytes, and the 'a' is one byte:
strlen(theString) == 12
How can I count the number of characters? How can i do the equivalent of subscripting so that:
theString[3] == "好"
How can I slice, and cat such strings?

You only count the characters that have the top two bits are not set to 10 (i.e., everything less that 0x80 or greater than 0xbf).
That's because all the characters with the top two bits set to 10 are UTF-8 continuation bytes.
See here for a description of the encoding and how strlen can work on a UTF-8 string.
For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0 bit or a 11 sequence is the start of a UTF-8 code point, all others are continuation characters.
Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:
utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;
to get, respectively:
the left sz UTF-8 bytes of a string.
the sz UTF-8 bytes of a string, starting at pos.
the rest of the UTF-8 bytes of a string, starting at pos.
This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.

Try this for size:
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// returns the number of utf8 code points in the buffer at s
size_t utf8len(char *s)
{
size_t len = 0;
for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;
return len;
}
// returns a pointer to the beginning of the pos'th utf8 codepoint
// in the buffer at s
char *utf8index(char *s, size_t pos)
{
++pos;
for (; *s; ++s) {
if ((*s & 0xC0) != 0x80) --pos;
if (pos == 0) return s;
}
return NULL;
}
// converts codepoint indexes start and end to byte offsets in the buffer at s
void utf8slice(char *s, ssize_t *start, ssize_t *end)
{
char *p = utf8index(s, *start);
*start = p ? p - s : -1;
p = utf8index(s, *end);
*end = p ? p - s : -1;
}
// appends the utf8 string at src to dest
char *utf8cat(char *dest, char *src)
{
return strcat(dest, src);
}
// test program
int main(int argc, char **argv)
{
// slurp all of stdin to p, with length len
char *p = malloc(0);
size_t len = 0;
while (true) {
p = realloc(p, len + 0x10000);
ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);
if (cnt == -1) {
perror("read");
abort();
} else if (cnt == 0) {
break;
} else {
len += cnt;
}
}
// do some demo operations
printf("utf8len=%zu\n", utf8len(p));
ssize_t start = 2, end = 3;
utf8slice(p, &start, &end);
printf("utf8slice[2:3]=%.*s\n", end - start, p + start);
start = 3; end = 4;
utf8slice(p, &start, &end);
printf("utf8slice[3:4]=%.*s\n", end - start, p + start);
return 0;
}
Sample run:
matt#stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops
utf8len=5
utf8slice[2:3]=好
utf8slice[3:4]=ā
Note that your example has an off by one error. theString[2] == "好"

The easiest way is to use a library like ICU

Depending on your notion of "character", this question can get more or less involved.
First off, you should transform your byte string into a string of unicode codepoints. You can do this with iconv() of ICU, though if this is the only thing you do, iconv() is a lot easier, and it's part of POSIX.
Your string of unicode codepoints could be something like a null-terminated uint32_t[], or if you have C1x, an array of char32_t. The size of that array (i.e. its number of elements, not its size in bytes) is the number of codepoints (plus the terminator), and that should give you a very good start.
However, the notion of a "printable character" is fairly complex, and you may prefer to count graphemes rather than codepoints - for instance, an a with an accent ^ can be expressed as two unicode codepoints, or as a combined legacy codepoint â - both are valid, and both are required by the unicode standard to be treated equally. There is a process called "normalization" which turns your string into a definite version, but there are many graphemes which are not expressible as a single codepoint, and in general there is no way around a proper library that understands this and counts graphemes for you.
That said, it's up to you to decide how complex your scripts are and how thoroughly you want to treat them. Transforming into unicode codepoints is a must, everything beyond that is at your discretion.
Don't hesitate to ask questions about ICU if you decide that you need it, but feel free to explore the vastly simpler iconv() first.

In the real world, theString[3]=foo; is not a meaningful operation. Why would you ever want to replace a character at a particular position in the string with a different character? There's certainly no natural-language-text processing task for which this operation is meaningful.
Counting characters is also unlikely to be meaningful. How many characters (for your idea of "character") are there in "á"? How about "á"? Now how about "གི"? If you need this information for implementing some sort of text editing, you're going to have to deal with these hard questions, or just use an existing library/gui toolkit. I would recommend the latter unless you're an expert on world scripts and languages and think you can do better.
For all other purposes, strlen tells you exactly the piece of information that's actually useful: how much storage space a string takes. This is what's needed for combining and separating strings. If all you want to do is combine strings or separate them at a particular delimiter, snprintf (or strcat if you insist...) and strstr are all you need.
If you want to perform higher-level natural-language-text operations, like capitalization, line breaking, etc. or even higher-level operations like pluralization, tense changes, etc. then you'll need either a library like ICU or respectively something much higher-level and linguistically-capable (and specific to the language(s) you're working with).
Again, most programs do not have any use for this sort of thing and just need to assemble and parse text without any considerations to natural language.

while (s[i]) {
if ((s[i] & 0xC0) != 0x80)
j++;
i++;
}
return (j);
This will count characters in a UTF-8 String... (Found in this article: Even faster UTF-8 character counting)
However I'm still stumped on slicing and concatenating?!?

In general we should use a different data type for unicode characters.
For example, you can use the wide char data type
wchar_t theString[] = L"你们好āa";
Note the L modifier that tells that the string is composed of wide chars.
The length of that string can be calculated using the wcslen function, which behaves like strlen.

One thing that's not clear from the above answers is why it's not simple. Each character is encoded in one way or another - it doesn't have to be UTF-8, for example - and each character may have multiple encodings, with varying ways to handle combining of accents, etc. The rules are really complicated, and vary by encoding (e.g., utf-8 vs. utf-16).
This question has enormous security concerns, so it is imperative that this be done correctly. Use an OS-supplied library or a well-known third-party library to manipulate unicode strings; don't roll your own.

I did similar implementation years back. But I do not have code with me.
For each unicode characters, first byte describes the number of bytes follow it to construct a unicode character. Based on the first byte you can determine the length of each unicode character.
I think its a good UTF8 library.
enter link description here

A sequence of code points constitute a single syllable / letter / character in many other Non Western-European languages (eg: all Indic languages)
So, when you are counting the length OR finding the substring (there are definitely use cases of finding the substrings - let us say playing a hangman game), you need to advance syllable by syllable , not by code point by code point.
So the definition of the character/syllable and where you actually break the string into "chunks of syllables" depends upon the nature of the language you are dealing with.
For example, the pattern of the syllables in many Indic languages (Hindi, Telugu, Kannada, Malayalam, Nepali, Tamil, Punjabi, etc.) can be any of the following
V (Vowel in their primary form appearing at the beginning of the word)
C (consonant)
C + V (consonant + vowel in their secondary form)
C + C + V
C + C + C + V
You need to parse the string and look for the above patterns to break the string and to find the substrings.
I do not think it is possible to have a general purpose method which can magically break the strings in the above fashion for any unicode string (or sequence of code points) - as the pattern that works for one language may not be applicable for another letter;
I guess there may be some methods / libraries that can take some definition / configuration parameters as the input to break the unicode strings into such syllable chunks. Not sure though! Appreciate if some one can share how they solved this problem using any commercially available or open source methods.

Related

Wide-character version of _memccpy

I have to concatenate wide C-style strings, and based on my research, it seems that something like _memccpy is most ideal (in order to avoid Shlemiel's problem). But I can't seem to find a wide-character version. Does something like that exist?
Does something like that exist?
The C standard library does not contain a wide-character version of Microsoft's _memccpy(). Neither does it contain _memccpy() itself, although POSIX specifies the memccpy() function on which MS's _memccpy() appears to be modeled.
POSIX also defines wcpcpy() (a wide version of stpcpy()), which copies a a wide string and returns a pointer to the end of the result. That's not as fully featured as memccpy(), but it would suffice to avoid Shlemiel's problem, if only Microsoft's C library provided a version of it.
You can, however, use swprintf() to concatenate wide strings without suffering from Shlemiel's problem, with the added advantage that it is in the standard library, since C99. It does not provide the memccpy behavior of halting after copying a user-specified (wide) character, but it does return the number of wide characters written, which is equivalent to returning a pointer to the end of the result. Also, it can directly concatenate an arbitrary fixed number of strings in a single call. swprintf does have significant overhead, though.
But of course, if the overhead of swprintf puts you off then it's pretty easy to write your own. The result might not be as efficient as a well-tuned implementation from your library vendor, but we're talking about a scaling problem, so you mainly need to win on the asymptotic complexity front. Simple example:
/*
* Copies at most 'count' wide characters from 'src' to 'dest', stopping after
* copying a wide character with value 0 if that happens first. If no 0 is
* encountered in the first 'count' wide characters of 'src' then the result
* will be unterminated.
* Returns 'dest' + n, where n is the number of non-zero wide characters copied to 'dest'.
*/
wchar_t *wcpncpy(wchar_t *dest, const wchar_t *src, size_t count) {
for (wchar_t *bound = dest + count; dest < bound; ) {
if ((*dest++ = *src++) == 0) return dest - 1;
}
return dest;
}

At what point do encodings enter into play in C, if at all? How do strings get properly printed, then?

To investigate how C deals with UTF-8 / Unicode characters, I did this little experiment.
It's not that I'm trying to solve anything particular at the moment, but I know that Java deals with the whole encoding situation in a transparent way to the coder and I was wondering how C, that is a lot lower level, treats its characters.
The following test seems to indicate that C is entirely ignorant about encoding concerns, as that it's just up to the display device to know how to interpret the sequence of chars when showing them on screen. The later tests (when printing the characters surrounded by _) seem particular telling?
#include <stdio.h>
#include <string.h>
int main() {
char str[] = "João"; // ã does not belong to the standard
// (or extended) ASCII characters
printf("number of chars = %d\n", (int)strlen(str)); // 5
int len = 0;
while (str[len] != '\0')
len++;
printf("number of bytes = %d\n", len); // 5
for (int i = 0; i < len; i++)
printf("%c", str[i]);
puts("");
// "João"
for (int i = 0; i < len; i++)
printf("_%c_", str[i]);
puts("");
// _J__o__�__�__o_ -> wow!!!
str[2] = 'X'; // let's change this special character
// and see what happens
for (int i = 0; i < len; i++)
printf("%c", str[i]);
puts("");
// JoX�o
for (int i = 0; i < len; i++)
printf("_%c_", str[i]);
puts("");
// _J__o__X__�__o_
}
I have knowledge of how ASCII / UTF-8 work, what I'm really unsure is on at what moment do the characters get interpreted as "compound" characters, as it seems that C just treats them as dumb bytes. What's really the science behind this?
The printing isn't a function of C, but of the display context, whatever that is. For a terminal there are UTF-8 decoding functions which map the raw character data into the character to be shown on screen using a particular font. A similar sort of display logic happens in graphical applications, though with even more complexity relating to proportional font widths, ligatures, hyphenation, and numerous other typographical concerns.
Internally this is often done by decoding UTF-8 into some intermediate form first, like UTF-16 or UTF-32, for look-up purposes. In extremely simple terms, each character in a font has a Unicode identifier. In practice this is a lot more complicated as there is room for character variants, and multiple characters may be represented by a singular character in a font, like "fi" and "ff" ligatures. Accented characters like "ç" may be a combination of characters, as allowed by Unicode. That's where things like Zalgo text come about: you can often stack a truly ridiculous number of Unicode "combining characters" together into a single output character.
Typography is a complex world with complex libraries required to render properly.
You can handle UTF-8 data in C, but only with special libraries. Nothing that C ships with in the Standard Library can understand them, to C it's just a series of bytes, and it assumes byte is equivalent to character for the purposes of length. That is strlen and such work with bytes as a unit, not characters.
C++, as an example, has much better support for this distinction between byte and character. Other languages have even better support, with languages like Swift having exceptional support for UTF-8 specifically and Unicode in general.
printf("_%c_", str[i]); prints the character associated with each str[i] - one at a time.
The value of char str[i] is converted to an int when passed ot a ... function. The int value is then converted to unsigned char as directed by "%c" and "and the resulting character is written".
char str[] = "João"; does not certainly specify a UTF8 sequence. That in an implementation detail. A specified way is to use char str[] = u8"João"; since C11 (or maybe C99).
printf() does not specify a direct way to print UTF8 stirrings.

Best way to read a file into buffer if you need to check middle symbol of word

I want to ask what's the best way of reading a file (pref. using buffer) if I will need to throw out words which middle symbol is a number. (There can be more than one space between words). The text file could look like this "asd4ggt gklk6k k77k 345k ll4l 7" so I need to throw out "asd4ggt" and "7" (I don't need to throw out "k77k" because it's even number of symbols so there isn't middle symbol). In words symbols can be from 0 to 9, A to Z, a to z (only simple English alphabet)
I think of reading a text file word by word: read one word into buffer if it has even number of symbols then write it to file but if it has odd number of symbols then I have to check if its' middle symbol is a number and if it is I skip this word and go to the next word.
Is this a right way of thinking how to complete this task?
Based of your comment we came to this:
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int evenCheck(const char *ptr);
size_t middleCheck(const char *ptr);
int main(void){
const char *ptr = "t4k4k";
size_t middle = middleCheck(ptr);
if( evenCheck(ptr) == 0){
printf("Output to file the word %s\n",ptr);
}else{
if ( isdigit(ptr[middle]) ){
printf("Ignoring the word %s, because has the number %c in the middle\n",ptr, ptr[middle]);
}else{
printf("Output to file the word %s, because the middle is %c which is a Letter\n",ptr, ptr[middle]);
}
}
}
int evenCheck(const char *ptr){
size_t len = strlen(ptr);
if ( (len % 2) ){
return 1;
}
return 0;
}
size_t middleCheck(const char *ptr){
size_t middle = strlen(ptr) / 2;
return middle;
}
Output:
Output to file the word t4k4k, because the middle is k which is a Letter
Now you were asking about how to do this if the file has more than one word.
Well one option will be to save the file in a Multi-Dimensional array or read the whole file.
I'm sure you can do it, if not come back with another Question.
Depends on the type of data, and what you plan to do with it. If it's a small enough file that sits in a single buffer, just load the file and then throw out whichever parts you don't want from the buffer.
If the data needs to be loaded into a data structure other than a flat buffer, then you'll need to process the input, probably line-by-line, building the structure and throwing out what you don't need as you go.
Note that the standard file routines can read a byte or line of text efficiently (they still use a larger buffer internally).
Other than that, you're question really isn't that clear.
First, you should define which encoding is your text file using. On most operating systems, it would be UTF-8 (so Unicode characters can take several bytes each).
Then, the notion of word is probably (human) language specific. I'm not sure it would be the same in English (BTW, gklk6k is not an English word, it does not appear in any English dictionnary), in ancient Greek, in Russian, in Japanese, in Chinese. Notice that the notion of letter is not as simple as you would imagine (Unicode has much more letters than A ... Z & a ... z; my family name in Russian is Старынкевич and all these letters are outside of A ... Z & a ... z, and they need more than one byte each). And what about combining characters? Greek diacritics? What exactly are words for you, how can they be separated? What about punctuation?
If your system provides it, I would use getline(3) to read a line and have a loop around that. Then process every line. Splitting it into words is itself interesting. You could use some UTF-8 library like ICU, Glib (from GTK) etc etc...
In other words, you should define first what your input can be (and a few examples don't constitute a definition). Perhaps you could specify the possible valid input using EBNF notation. Maybe reading more about lexing & parsing techniques is relevant.

Creating C substrings: looping with assignment operator VS strncopy, which is better?

This might be somewhat pointless, but I'm curious what you guys think about it. I'm iterating over a string with pointers and want to pull a short substring out of it (placing the substring into a pre-allocated temporary array). Are there any reasons to use assignment over strncopy, or vice-versa? I.e.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main()
{ char orig[] = "Hello. I am looking for Molly.";
/* Strings to store the copies
* Pretend that strings had some prior value, ensure null-termination */
char cpy1[4] = "huh\0";
char cpy2[4] = "huh\0";
/* Pointer to simulate iteration over a string */
char *startptr = orig + 2;
int length = 3;
int i;
/* Using strncopy */
strncpy(cpy1, startptr, length);
/* Using assignment operator */
for (i = 0; i < length; i++)
{ cpy2[i] = *(startptr + i);
}
/* Display Results */
printf("strncpy result:\n");
printf("%s\n\n", cpy1);
printf("loop result:\n");
printf("%s\n", cpy2);
}
It seems to me that strncopy is both less typing and more easily readable, but I've seen people advocate looping instead. Is there a difference? Does it even matter? Assume that this is for small values of i (0 < i < 5), and null-termination is assured.
Refs: Strings in c, how to get subString, How to get substring in C, Difference between strncpy and memcpy?
strncpy(char * dst, char *src, size_t len) has two peculiar properties:
if (strlen(src) >= len) : the resulting string will not be nul-terminated.
if (strlen(src) < len) : the end of the string will be filled/padded with '\0'.
The first property will force you to actually check if (strlen(src) >= len) and act appropiately. (or brutally set the final character to nul with dst[len-1] = '\0';, like #Gilles does above) The other property is not particular dangerous, but can spill a lot of cycles. Imagine:
char buff[10000];
strncpy(buff, "Hello!", sizeof buff);
which touches 10000 bytes, where only 7 need to be touched.
My advice:
A: if you know the sizes, just do memcpy(dst,src,len); dst[len] = 0;
B: if you don't know the sizes, get them somehow (using strlen and/or sizeof and/or the allocated size for dynamically allocced memory). Then: goto A above.
Since for safe operation the strncpy() version already needs to know the sizes, (and the checks on them!), the memcpy() version is not more complex or more dangerous than the strncpy() version. (technically it is even marginally faster; because memcpy() does not have to check for the '\0' byte)
While this may seem counter-intuitive, there are more optimized ways to copy a string than by using the assignment operator in a loop. For instance, IA-32 provides the REP prefix for MOVS, STOS, CMPS etc for string handling, and these can be much faster than a loop that copies one char at a time. The implementation of strncpy or strcpy may choose to use such hardware-optimized code to achieve better performance.
As long as you know your lengths are "in range" and everything is correctly nul terminated, then strncpy is better.
If you need to get length checks etc in there, looping could be more convenient.
A loop with assignment is a bad idea because you're reinventing the wheel. You might make a mistake, and your code is likely to be less efficient than the code in the standard library (some processors have optimized instructions for memory copies, and optimized implementations usually at least copy word by word if possible).
However, note that strncpy is not a well-rounded wheel. In particular, if the string is too long, it does not append a null byte to the destination. The BSD function strlcpy is better designed, but not available everywhere. Even strlcpy is not a panacea: you need to get the buffer size right, and be aware that it might truncate the string.
A portable way to copy a string, with truncation if the string is too long, is to call strncpy and always add the terminating null byte. If the buffer is an array:
char buffer[BUFFER_SIZE];
strncpy(buffer, source, sizeof(buffer)-1);
buf[sizeof(buffer)-1] = 0;
If the buffer is given by a pointer and size:
strncpy(buf, source, buffer_size-1);
buf[buffer_size-1] = 0;

UTF8 support on cross platform C application

I am developing a cross platform C (C89 standard) application which has to deal with UTF8 text. All I need is basic string manipulation functions like substr, first, last etc.
Question 1
Is there a UTF8 library that has the above functions implemented? I have already looked into ICU and it is too big for my requirement. I just need to support UTF8.
I have found a UTF8 decoder here. Following function prototypes are from that code.
void utf8_decode_init(char p[], int length);
int utf8_decode_next();
The initialization function takes a character array but utf8_decode_next() returns int. Why is that? How can I print the characters this function returns using standard functions like printf? The function is dealing with character data and how can that be assigned to a integer?
If the above decoder is not good for production code, do you have a better recommendation?
Question 2
I also got confused by reading articles that says, for unicode you need to use wchar_t. From my understanding this is not required as normal C strings can hold UTF8 values. I have verified this by looking at source code of SQLite and git. SQLite has the following typedef.
typedef unsigned char u8
Is my understanding correct? Also why is unsigned char required?
The utf_decode_next() function returns the next Unicode code point. Since Unicode is a 21-bit character set, it cannot return anything smaller than an int, and it can be argued that technically, it should be a long since an int could be a 16-bit quantity. Effectively, the function returns you a UTF-32 character.
You would need to look at the C94 wide character extensions to C89 to print wide characters (wprintf(), <wctype.h>, <wchar.h>). However, wide characters alone are not guaranteed to be UTF-8 or even Unicode. You most probably cannot print the characters from utf8_decode_next() portably, but it depends on what your portability requirements are. The wider the range of systems you must port to, the less chance there is of it all working simply. To the extent you can write UTF-8 portably, you would send the UTF-8 string (not an array of the UTF-32 characters obtained from utf8_decode_next()) to one of the regular printing functions. One of the strengths of UTF-8 is that it can be manipulated by code that is largely ignorant of it.
You need to understand that a 4-byte wchar_t can hold any Unicode codepoint in a single unit, but that UTF-8 can require between one and four 8-bit bytes (1-4 units of storage) to hold a single Unicode codepoint. On some systems, I believe wchar_t can be a 16-bit (short) integer. In this case, you are forced into using UTF-16, which encodes Unicode codepoints outside the Basic Multilingual Plane (BMP, code points U+0000 .. U+FFFF) using two storage units and surrogates.
Using unsigned char makes life easier; plain char is often signed. Having negative numbers makes life more difficult than it need me (and, believe me, it is difficult enough without adding complexity).
You do not need any special library routines for character or substring search with UTF-8. strstr does everything you need. That's the whole point of UTF-8 and the design requirements it was invented to meet.
GLib has quite a few relevant functions, and can be used independent of GTK+.
There are over 100,000 characters in Unicode. There are 256 possible values of char in most C implementations.
Hence, UTF-8 uses more than one char to encode each character, and the decoder needs a return type which is larger than char.
wchar_t is a larger type than char (well, it doesn't have to be larger, but it usually is). It represents the characters of the implementation-defined wide character set. On some implementations (most importantly, Windows, which uses surrogate pairs for characters outside the "basic multilingual plane"), it still isn't big enough to represent any Unicode character, which presumably is why the decoder you reference uses int.
You can't print wide characters using printf, because it deals in char. wprintf deals in wchar_t, so if the wide character set is unicode, and if wchar_t is int on your system (as it is on linux), then wprintf and friends will print the decoder output without further processing. Otherwise it won't.
In any case, you cannot portably print arbitrary unicode characters, because there's no guarantee that the terminal can display them, or even that the wide character set is in any way related to Unicode.
SQLite has probably used unsigned char so that:
they know the signedness - it's implementation-defined whether char is signed or not.
they can do right-shifts and assign out-of-range values, and get consistent and defined results across all C implementations. Implemenations have more freedom how signed char behaves than unsigned char.
Normal C strings are fine for storing utf8 data, but you can't easily search for a substring in your utf8 string. This is because a character encoded as a sequence of bytes using the utf8 encoding could be anywhere from one to 4 bytes depending on the character. i.e. a "character" is not equivalent to a "byte" for utf8 like it is for ASCII.
In order to do substring searches etc. you will need to decode it to some internal format that is used to represent Unicode characters and then do the substring search on that. Since there are far more than Unicode 256 characters, a byte (or char) is not enough. That's why the library you found uses ints.
As for your second question, it's probably just because it does not make sense to talk about negative characters, so they may as well be specified as "unsigned".
I have implemented a substr & length functions which supports UTF8 characters. This code is a modified version of what SQLite uses.
The following macro loops through the input text and skip all multi-byte sequence characters. if condition checks that this is a multi-byte sequence and the loop inside it increments input until it finds next head byte.
#define SKIP_MULTI_BYTE_SEQUENCE(input) { \
if( (*(input++)) >= 0xc0 ) { \
while( (*input & 0xc0) == 0x80 ){ input++; } \
} \
}
substr and length are implemented using this macro.
typedef unsigned char utf8;
substr
void *substr(const utf8 *string,
int start,
int len,
utf8 **substring)
{
int bytes, i;
const utf8 *str2;
utf8 *output;
--start;
while( *string && start ) {
SKIP_MULTI_BYTE_SEQUENCE(string);
--start;
}
for(str2 = string; *str2 && len; len--) {
SKIP_MULTI_BYTE_SEQUENCE(str2);
}
bytes = (int) (str2 - string);
output = *substring;
for(i = 0; i < bytes; i++) {
*output++ = *string++;
}
*output = '\0';
}
length
int length(const utf8 *string)
{
int len;
len = 0;
while( *string ) {
++len;
SKIP_MULTI_BYTE_SEQUENCE(string);
}
return len;
}

Resources