SQLite3 stores nonreadable text - c

I used SQLite3 to implement small application to read from or write to a database. Some records that need to be added to the database are Arabic texts and when they are stored to the database they converted to non-readable and non-understood texts. I use these APIs for write & read:
sqlite3_open
sqlite3_prepare
sqlite3_bind_text
sqlite3_step
What can I do to solve the problem ?

It is most likely that your text is in non-ASCII encoding. For example, in unicode.
This is because ASCII table has only characters represented by integer numbers from 0 to 127. So there is nothing that can be used to represent Arabic letters. For example, Unicode is using five different ranges to represent Arabic language:
Arabic (0600—06FF, 224 characters)
Arabic Supplement (0750—077F, 48 characters)
Arabic Presentation Forms-A (FB50—FDFF, 608 characters)
Arabic Presentation Forms-B (FE70—FEFF, 140 characters)
Rumi Numeral Symbols (10E60—10E7F, 31 characters)
And since there could be more letters/characters that a 8-bit value (char type, which has a length of 1 byte) would allow, wide character is used to represent some (or even all) of those letters.
As a result, the length of the string in characters will be different from length of the string in bytes. My assumption is that when you use sqlite3_bind_text function, you pass a number of characters as a fourth parameter, whereas it should be a number of bytes. Or you could misinterpret this length when reading the string back from the database. The sqlite3_bind_text documentation is saying this about the fourth parameter:
In those routines that have a fourth argument, its value is the number
of bytes in the parameter. To be clear: the value is the number of
bytes in the value, not the number of characters. If the fourth
parameter is negative, the length of the string is the number of bytes
up to the first zero terminator.
Make sure you do the right thing there.
See also:
Wide characters
Unicode
Arabic characters in Unicode
Good luck!

Related

Converting C strings to Pascal strings

When converting a C string into a Pascal string, why should the length of the original string be less or equal to 127 instead of 256? I understand that an unsigned int ranges from 0~256 and a signed one ranges from -128~127, but isn't the first character of a Pascal string unsigned?
The Pascal string you are referring to is probably the one used in older Pascals (called ShortString in e.g. Delphi and FreePascal, the most popular Pascal implementations these days). That can contain 255 single-byte characters (char in C). There is no need to restrict this to 127 characters.
Perhaps you were thinking of the fact that 255 bytes can only contain 127 UTF-16 code points. But these strings were popular in the old CP/M and DOS days, when no one knew anything about Unicode yet, and were made to contain ASCII or "Extended ASCII" (8 bit, using code pages).
But most modern Pascal implementations allow you to use strings up to 2 GB in size. There, the length indicator is not stored as the first element anymore, just close to the text data. And these days, most of these strings can contain Unicode too, either as UTF-16 or as UTF-8, depending on the string type you choose (modern Pascal implementations have several different string types for different purposes, so there is not one single "Pascal string type" anymore).
Some languages do have the ability to restrict the size of a ShortString, as so called "counted" strings:
var
s: string[18];
That string has a maximum of 18 bytes text data and 1 byte length data (at index 0). Such shorter strings can be used in, say, records, so they don't grow too big.
FreePascal's wiki has a great page showing all the types of strings that Pascal (at least that implementation) supports: http://wiki.freepascal.org/Character_and_string_types - it includes length-prefixed and null-terminated string types. None of the types on that page have a length restriction of 127.
The string type you're referring to would match ShortString which has a single byte prefix, however their documentation states it accepts 0-255.
I am aware of a string-type that has a variable-length-integer prefix, which would restrict the length of the string to 127 characters if you want the in-memory representation to be binary-compatible with ShortString, as being 128 characters or longer would set the MSB bit to 1 which in variable-length-integers means the integer is at least 2 bytes long instead of 1 byte.

Length of Greek character string is larger than it should be

I'm writing a program and I take a string of Greek characters as input and when I print its len, it outputs its double. For example, if ch="ΑΒ"(greek characters) or ch="αβ",
printf("%d",strlen(ch)); outputs 4 instead of 2. And if ch="ab", it outputs 2. What's going on?
You can use mbstowcs() function to convert multybite string to wide-character string. And then use wcslen() to determine it's length.
Probably because your string is encoded using variable-width character encoding.
In the good old days, we only bothered with 128 different characters: a-z, A-Z, 0-9, and some commas and brackets and control things. Everything was taken care of in 7 bits, and we called it ASCII. Then that wasn't enough and we added some other things like letters with lines or dots on top, and we went to 8 bits (1 byte) and could do any of 256 characters in one byte. (Although people's ideas of what should go in those extra 128 slots varied widely, based on what was most useful in their language - see comment from usr2564301 - and you then had to say whose version you were using for what should be in those extra slots.)
If you had 2 characters in your string, it would be 2 bytes long (plus a null terminator perhaps), always.
But then people woke up to the fact that English isn't the only language in the world, and there were in fact thousands of letters in hundreds of languages around the globe. Now what to do?
Well, we could say there are only about 65,000 characters that interest us, and encode all letters in two bytes. There are some encoding formats that do this. A two-letter string will then always be 4 bytes (um, perhaps with some byte order mark at the front, and maybe a null terminator at the end). Two problems: a) not very backwards compatible with ASCII, and b) wasteful of bytes if most text is stuff that is in the good ol' ASCII character set anyway.
Step in UTF-8, which I'll wager is what your string is using for its encoding, or something similar. ASCII characters, like 'a' and 'b', are encoded with one byte, and more exotic characters (--blush-- from an English-speaking perspective) take up more than one byte, of which the first byte is to say "what follows is to be taken along with this byte to represent a letter". So you get variable-width encoding. So the length of a two-letter string will be at least two bytes, but if it includes non-ASCII characters, it'll be more.

How can I print a string with the same length with or without multicharacters?

I am trying to do exercise 1-22 in K&R book. It asks to fold long lines (i.e.going into a new line) after a predefined number of characters in string.
As I was testing the program and it worked well, but I saw that some lines were "folding" earlier than they should. I noticed that it was the lines on which special characters appeared, such as:
ö ş ç ğ
So, my question is, how do I ensure that lines are printed with the same maximum length with or without multicharacters?
What happens in your code ?
The K&R was written in a time where all characters were encoded on one single char. Example of such encoding standards are ASCII or ISO 8859.
Nowadays the leading encoding standard is UNICODE, which comes in several flavors. The UTF-8 encoding is used to represent the thousands of unicode characters on 8 bit bytes, using a variable length scheme:
the ascii characters (i.e. 0x00 to 0x7F) are encoded on a single byte.
all other characters are encoded on 2 to 4 bytes.
So the letter ö and the others in your list are encoded as 2 consecutive bytes. Unfortunately, the standard C library and the algorithms of K&R do not manage variable encoding. So each of your special char is counted as two so that your algorithm is tricked.
How to solve it ?
There is no easy way. You must make a distinction between the length of the strings in memory, and the length of the strings when they are displayed.
I can propose you a trick that uses the properties of the encoding scheme: whenever you count the display length of a string, just ignore the characters c in memory that comply with the condition c&0xC0==0x80.
Another way would be to use wide chars wchar_t/win_t (requires header wchar.h) instead of char/int and use getwc()/putwc() instead of getc()/putc(). If on your environment sizeof(wchar_t) is 4 then you will be able to work with unicode just using the wide characters and wide library functions instead of the normal ones mentioned in K&R. If however
sizeof(wchar_t) is smaller (for example 2), you could work correctly with a larger subset of unicode but still could encounter alignement issues in some cases.
As in the comment, your string is probably encoded in UTF-8. That means that some characters, including the ones you mention, use more than one byte. If you simply count bytes to determine the width of your output, your computed value may be too large.
To properly determine the number of characters in a string with multibyte characters, use a function such as mbrlen(3).
You can use mbrtowc(3) to find out the number of bytes of the first character in a string, if you're counting character for character.
This of course goes way beyond the scope of the K&R book. It was written before multibyte characters were used.

Calculate string size in UTF-8 when converted from Latin-9 (ISO/IEC 8859-15)

We have a jdbc program which moves data from one database to another.
Source database is using Latin9 character set
Destination database uses UTF-8 encoding and the size of a column is specified in bytes instead of characters
We have converted ddl scripts of source database to equivalent script in destination database keeping the size of the column as-is.
In some cases, if there are some special characters, the size of the data after converting to UTF-8 is exceeding the size of the column in destination database causing the jdbc program to fail.
I understand that UTF-8 is variable-width encoding scheme which can take 1-4 bytes per character, given this the worst case solution would be to allocate 4 times the size of a column in destination database.
Is there a better estimate?
Since there's no telling in advance exactly how much a text string will grow, I think that all you can do is a trial run to convert the text to UTF-8, and generate a warning that certain columns need to be increased in size. Any ASCII (unaccented) characters will remain single bytes, and most Latin-9 accented characters will probably be 2 bytes each, but there are some that might be 3. You'd have to look at the Latin-9 and UTF-8 tables to see if any will be 3 or 4 bytes after conversion. Still, you'd have to examine your Latin-9 text to see how much it will grow.
The Euro symbol in Latin-9 will take 3 bytes to represent in utf-8. The ascii characters will only take 1 byte. The remaining 127 characters will take 2 bytes. Depending on what the actual locale is (and what characters are commonly used) an estimate between 1.5x and 2x should be sufficient.

How to know the number of characters in utf8 string

i want to know is there a simple way to determine the number of characters in UTF8 string.
For example, in windows it can be done by:
converting UTF8 string to wchar_t string
use wcslen function and get result
But I need more simpler and crossplatform solution.
Thanks in advance.
UTF-8 characters are either single bytes where the left-most-bit is a 0 or multiple bytes where the first byte has left-most-bit 1..10... (with the number of 1s on the left 2 or more) followed by successive bytes of the form 10... (i.e. a single 1 on the left). Assuming that your string is well-formed you can loop over all the bytes and increment your "character count" every time you see a byte that is not of the form 10... - i.e. counting only the first bytes in all UTF-8 characters.
The entire concept of a "number of characters" does not really apply to Unicode, as codes do not map 1:1 to glyphs. The method proposed by #borrible is fine if you want to establish storage requirements in uncompressed form, but that is all that it can tell you.
For example, there are code points like the "zero width space", which do not take up space on the screen when rendered, but occupy a code point, or modifiers for diacritics or vowels. So any statistic would have to be specific to the concrete application.
A proper Unicode renderer will have a function that can tell you how many pixels will be used for rendering a string if that information is what you're after.
If the string is known to be valid UTF-8, simply take the length of the string in bytes, excluding bytes whose values are in the range 0x80-0xbf:
size_t i, cnt;
for (cnt=i=0; s[i]; i++) if (s[i]<0x80 || s[i]>0xbf) cnt++;
Note that s must point to an array of unsigned char in order for the comparisons to work.

Resources