How to know the number of characters in utf8 string - c

i want to know is there a simple way to determine the number of characters in UTF8 string.
For example, in windows it can be done by:
converting UTF8 string to wchar_t string
use wcslen function and get result
But I need more simpler and crossplatform solution.
Thanks in advance.

UTF-8 characters are either single bytes where the left-most-bit is a 0 or multiple bytes where the first byte has left-most-bit 1..10... (with the number of 1s on the left 2 or more) followed by successive bytes of the form 10... (i.e. a single 1 on the left). Assuming that your string is well-formed you can loop over all the bytes and increment your "character count" every time you see a byte that is not of the form 10... - i.e. counting only the first bytes in all UTF-8 characters.

The entire concept of a "number of characters" does not really apply to Unicode, as codes do not map 1:1 to glyphs. The method proposed by #borrible is fine if you want to establish storage requirements in uncompressed form, but that is all that it can tell you.
For example, there are code points like the "zero width space", which do not take up space on the screen when rendered, but occupy a code point, or modifiers for diacritics or vowels. So any statistic would have to be specific to the concrete application.
A proper Unicode renderer will have a function that can tell you how many pixels will be used for rendering a string if that information is what you're after.

If the string is known to be valid UTF-8, simply take the length of the string in bytes, excluding bytes whose values are in the range 0x80-0xbf:
size_t i, cnt;
for (cnt=i=0; s[i]; i++) if (s[i]<0x80 || s[i]>0xbf) cnt++;
Note that s must point to an array of unsigned char in order for the comparisons to work.

Related

Length of Greek character string is larger than it should be

I'm writing a program and I take a string of Greek characters as input and when I print its len, it outputs its double. For example, if ch="ΑΒ"(greek characters) or ch="αβ",
printf("%d",strlen(ch)); outputs 4 instead of 2. And if ch="ab", it outputs 2. What's going on?
You can use mbstowcs() function to convert multybite string to wide-character string. And then use wcslen() to determine it's length.
Probably because your string is encoded using variable-width character encoding.
In the good old days, we only bothered with 128 different characters: a-z, A-Z, 0-9, and some commas and brackets and control things. Everything was taken care of in 7 bits, and we called it ASCII. Then that wasn't enough and we added some other things like letters with lines or dots on top, and we went to 8 bits (1 byte) and could do any of 256 characters in one byte. (Although people's ideas of what should go in those extra 128 slots varied widely, based on what was most useful in their language - see comment from usr2564301 - and you then had to say whose version you were using for what should be in those extra slots.)
If you had 2 characters in your string, it would be 2 bytes long (plus a null terminator perhaps), always.
But then people woke up to the fact that English isn't the only language in the world, and there were in fact thousands of letters in hundreds of languages around the globe. Now what to do?
Well, we could say there are only about 65,000 characters that interest us, and encode all letters in two bytes. There are some encoding formats that do this. A two-letter string will then always be 4 bytes (um, perhaps with some byte order mark at the front, and maybe a null terminator at the end). Two problems: a) not very backwards compatible with ASCII, and b) wasteful of bytes if most text is stuff that is in the good ol' ASCII character set anyway.
Step in UTF-8, which I'll wager is what your string is using for its encoding, or something similar. ASCII characters, like 'a' and 'b', are encoded with one byte, and more exotic characters (--blush-- from an English-speaking perspective) take up more than one byte, of which the first byte is to say "what follows is to be taken along with this byte to represent a letter". So you get variable-width encoding. So the length of a two-letter string will be at least two bytes, but if it includes non-ASCII characters, it'll be more.

How can I print a string with the same length with or without multicharacters?

I am trying to do exercise 1-22 in K&R book. It asks to fold long lines (i.e.going into a new line) after a predefined number of characters in string.
As I was testing the program and it worked well, but I saw that some lines were "folding" earlier than they should. I noticed that it was the lines on which special characters appeared, such as:
ö ş ç ğ
So, my question is, how do I ensure that lines are printed with the same maximum length with or without multicharacters?
What happens in your code ?
The K&R was written in a time where all characters were encoded on one single char. Example of such encoding standards are ASCII or ISO 8859.
Nowadays the leading encoding standard is UNICODE, which comes in several flavors. The UTF-8 encoding is used to represent the thousands of unicode characters on 8 bit bytes, using a variable length scheme:
the ascii characters (i.e. 0x00 to 0x7F) are encoded on a single byte.
all other characters are encoded on 2 to 4 bytes.
So the letter ö and the others in your list are encoded as 2 consecutive bytes. Unfortunately, the standard C library and the algorithms of K&R do not manage variable encoding. So each of your special char is counted as two so that your algorithm is tricked.
How to solve it ?
There is no easy way. You must make a distinction between the length of the strings in memory, and the length of the strings when they are displayed.
I can propose you a trick that uses the properties of the encoding scheme: whenever you count the display length of a string, just ignore the characters c in memory that comply with the condition c&0xC0==0x80.
Another way would be to use wide chars wchar_t/win_t (requires header wchar.h) instead of char/int and use getwc()/putwc() instead of getc()/putc(). If on your environment sizeof(wchar_t) is 4 then you will be able to work with unicode just using the wide characters and wide library functions instead of the normal ones mentioned in K&R. If however
sizeof(wchar_t) is smaller (for example 2), you could work correctly with a larger subset of unicode but still could encounter alignement issues in some cases.
As in the comment, your string is probably encoded in UTF-8. That means that some characters, including the ones you mention, use more than one byte. If you simply count bytes to determine the width of your output, your computed value may be too large.
To properly determine the number of characters in a string with multibyte characters, use a function such as mbrlen(3).
You can use mbrtowc(3) to find out the number of bytes of the first character in a string, if you're counting character for character.
This of course goes way beyond the scope of the K&R book. It was written before multibyte characters were used.

What is octet meant in iCalendar Specification

Lines of text SHOULD NOT be longer than 75 octets, excluding the line break. Long content lines SHOULD be split into a multiple line representations using a line "folding" technique. That is, a long line can be split between any two characters by inserting a CRLF immediately followed by a single linear white-space character (i.e., SPACE or HTAB). iCalendar Specification 3.1.Content Lines
What is meant by octet here?
Does it mean number of characters over here?
No. It really means octet, as in 8bits. UTF-8 characters have a variable length (multi-octet). You have another hint here:
Note: It is possible for very simple implementations to generate
improperly folded lines in the middle of a UTF-8 multi-octet
sequence. For this reason, implementations need to unfold lines
in such a way to properly restore the original sequence.

Finding the number of occurrences of each character in a String or character array

I am going over some interview preparation material and I was wondering what the best way to solve this problem would be if the characters in the String or array can be unicode characters. If it they were strictly ascii, you could make an int array of size 256 and map each ascii character to an index and that position in the array would represent the number of occurrences. If the string has unicode characters is it still possible to do so, i.e. does the unicode character a reasonable size that you could represent it using the indexes of a integer array? Since unicode characters can be more than 1 byte in size, what data type would you use to represent them? What would be the most optimal solution for this case?
Since Unicode only defines code points in the range [0, 221), you only need an array of 221 (i.e. 2 million) elements, which should fit comfortably into memory.
An array wouldn't be practical when using Unicode. This is because Unicode defines (less than) 221 characters.
Instead, consider using two parallel vectors, one for the character and one for the count. The setup would look something like this:
<'c', '$', 'F', '¿', '¤'> //unicode characters
< 1 , 3 , 1 , 9 , 4 > //number of times each character has appeared.
EDIT
After seeing Kerrek's answer, I must admit, an array of size 2 million would be reasonable. The amount of memory it would take up would be in the Megabyte range.
But as it's for an interview, I wouldn't recommend having an array 2 million elements long, especially if many of those slots will be unused (not all Unicode characters will appear, most likely). They're probably looking for something a little more elegant.
SECOND EDIT
As per the comments here, Kerrek's answer does indeed seem to be more efficient as well as easier to code.
While others here are focusing on data structures, you should also know that the notion of "Unicode character" is somewhat ill-defined. That's a potential interview trap. Consider: are å and å the same character? The first one is a "latin small letter a with ring above" (codepoint U+00E5). The second one is a "latin small letter a" (codepoint U+0061) followed by a "combining ring above" (U+030A). Depending on the purpose of the count, you might need to consider these as the same character.
You might want to look into Unicode normalization forms. It's great fun.
Convert string to UTF-32.
Sort the 32-bit characters.
Getting character counts is now trivial.

SQLite3 stores nonreadable text

I used SQLite3 to implement small application to read from or write to a database. Some records that need to be added to the database are Arabic texts and when they are stored to the database they converted to non-readable and non-understood texts. I use these APIs for write & read:
sqlite3_open
sqlite3_prepare
sqlite3_bind_text
sqlite3_step
What can I do to solve the problem ?
It is most likely that your text is in non-ASCII encoding. For example, in unicode.
This is because ASCII table has only characters represented by integer numbers from 0 to 127. So there is nothing that can be used to represent Arabic letters. For example, Unicode is using five different ranges to represent Arabic language:
Arabic (0600—06FF, 224 characters)
Arabic Supplement (0750—077F, 48 characters)
Arabic Presentation Forms-A (FB50—FDFF, 608 characters)
Arabic Presentation Forms-B (FE70—FEFF, 140 characters)
Rumi Numeral Symbols (10E60—10E7F, 31 characters)
And since there could be more letters/characters that a 8-bit value (char type, which has a length of 1 byte) would allow, wide character is used to represent some (or even all) of those letters.
As a result, the length of the string in characters will be different from length of the string in bytes. My assumption is that when you use sqlite3_bind_text function, you pass a number of characters as a fourth parameter, whereas it should be a number of bytes. Or you could misinterpret this length when reading the string back from the database. The sqlite3_bind_text documentation is saying this about the fourth parameter:
In those routines that have a fourth argument, its value is the number
of bytes in the parameter. To be clear: the value is the number of
bytes in the value, not the number of characters. If the fourth
parameter is negative, the length of the string is the number of bytes
up to the first zero terminator.
Make sure you do the right thing there.
See also:
Wide characters
Unicode
Arabic characters in Unicode
Good luck!

Resources