ANSI C UTF-8 problem - c

First I develope an independent platform library by using ANSI C (not C++ and any non standard libs like MS CRT or glibc, ...).
After a few searchs, I found that one of the best way to internationalization in ANSI C, is using UTF-8 encoding.
In utf-8:
strlen(s): always counts the number of bytes.
mbstowcs(NULL,s,0): The number of characters can be counted.
But I have some problems when I want to random access of elements(characters) of a utf-8 string.
In ASCII encoding:
char get_char(char* assci_str, int n)
{
// It is very FAST.
return assci_str[n];
}
In UTF-16/32 encoding:
wchar_t get_char(wchar_t* wstr, int n)
{
// It is very FAST.
return wstr[n];
}
And here my problem in UTF-8 encoding:
// What is the return type?
// Because sizeof(utf-8 char) is 8 or 16 or 24 or 32.
/*?*/ get_char(char* utf8str, int n)
{
// I can found Nth character of string by using for.
// But it is too slow.
// What is the best way?
}
Thanks.

Perhaps you're thinking about this a bit wrongly. UTF-8 is an encoding which is useful for serializing data, e.g. writing it to a file or the network. It is a very non-trivial encoding, though, and a raw string of Unicode codepoints can end up in any number of encoded bytes.
What you should probably do, if you want to handle text (given your description), is to store raw, fixed-width strings internally. If you're going for Unicode (which you should), then you need 21 bits per codepoint, so the nearest integral type is uint32_t. In short, store all your strings internally as arrays of integers. Then you can random-access each codepoint.
Only encode to UTF-8 when you are writing to a file or console, and decode from UTF-8 when reading.
By the way, a Unicode codepoint is still a long way from a character. The concept of a character is just far to high-level to have a simple general mechanic. (E.g. "a" + "accent grave" -- two codepoints, how many characters?)

You simply can't. If you do need a lot of such queries, you can build an index for the UTF-8 string, or convert it to UTF-32 up front. UTF-32 is a better in-memory representation while UTF-8 is good on disk.
By the way, the code you listed for UTF-16 is not correct either. You may want to take care of the surrogate characters.

What do you want to count? As Kerrek SB has noted, you can have decomposed glyphs, i.e. "é" can be represented as a single character (LATIN SMALL LETTER E WITH ACUTE U+00E9), or as two characters (LATIN SMALL LETER E U+0065 COMBINING ACUTE ACCENT U+0301). Unicode has composed and decomposed normalization forms.
What you are probably interested in counting is not characters, but grapheme clusters. You need some higher level library to deal with this, and to deal with normalization forms, and proper (locale-dependent) collation, proper line-breaking, proper case-folding (e.g. german ß->SS) proper bidi support, etc... Real I18N is complex.

Contrary to what others have said, I don' really see a benefit in using UTF-32 instead of UTF-8: When processing text, grapheme clusters (or 'user-perceived characters') are far more useful than Unicode characters (ie raw codepoints), so even UTF-32 has to be treated as a variable-length coding.
If you do not want to use a dedicated library, I suggest using UTF-8 as on-disk, endian-agnostic representation and modified UTF-8 (which differs from UTF-8 by encoding the zero character as a two-byte sequence) as in-memory representation compatible with ASCIIZ.
The necessary information for splitting strings into grapheme clusters can be found in annex 29 and the character database.

Related

Print UTF-16 string

So I want to parse IDv3.4 file. There are 4 types of text encoding in format specification: ISO-8859-1, UTF-16 with BOM, UTF-16BE and UTF-8. I already written code that can obtains bytes of strings.
And my question is how to print UTF-16 with BOM and UTF-16BE bytes to console.
And also one important condition: I can use only C libraries. I can't use C++ libraries. I even can't use third-party C libraries.
In general (NOT specifically for parsing IDv3.4 files alone) you will want to choose a common character encoding that your code will use internally; then convert from any other character encoding into your chosen character encoding (for input data - e.g. from user or files or network) and convert back again (for output, to user or files or network).
For choosing a common character encoding:
you want something that minimizes "nonconvertible cases" - e.g. you wouldn't want to choose ASCII because there's far too much in far too many other character encodings that can't be converted to ASCII. This mostly means that you'll want a Unicode encoding.
you want something that is convenient. For Unicode encoding, this only really gives you 2 choices - UTF-8 (because you don't have to care about endian issues, and it's relatively efficient for space/memory consumption, and C functions like strlen() can still work) and versions of UTF-32 (because each codepoint takes up a fixed amount of space and it makes conversion a little simpler). Of these, the benefits of UTF-32 are mostly unimportant (unless you're doing a font rendering engine).
the "whatever random who-knows-what" character encoding that the C compiler uses is irrelevant (for both char and w_char), because it's implementation specific and not portable.
the "whatever random who-knows-what" character encoding that the terminal uses is irrelevant (the terminal should be considered "just another flavor of input/output, where conversion is involved").
Assuming you choose UTF-8:
You might be able to force the compiler to treat string literals as UTF-8 for you (e.g. like u8"hello" in C++, except I can't seem to find any sane standard for C). Otherwise you'll need to do it yourself where necessary.
I'd recommend using the uint8_t type for storing strings; partly because char is "signed or unsigned, depending on which way the wind is blowing" (which makes conversions to/from other character encodings painful due to "shifting a signed/negative number right" problems), and partly because it help to find "accidentally used something that isn't UTF-8" bugs (e.g. warnings from compiler about "conversion from signed to unsigned").
Conversion between UTF-8 and UTF-32LE, UTF_32BE, UTF-16LE, UTF_16BE is fairly trivial (the relevant wikipedia articles are enough to describe how it works).
"UTF-16 with BOM" means that the first 2 bytes will tell you if it's UTF-16LE or UTF-16BE, so (after you add support for UTF-16LE and UTF-16BE) it's trivial. "UTF-32 with BOM" is similar (first 4 bytes tell you if it's UTF32-BE or UTF32-BE).
Conversion to/from ISO-8859-1 to UTF-8 is fairly trivial, because the characters match Unicode codepoints with the same value. However, often people get it wrong (e.g. say it's ISO-8859-1 when the data is actually encoded as Windows-1252 instead); and for the conversion from UTF-8 to ISO-8859-1 you will need to deal with "nonconvertible" codepoints.

How to use Unicode in C? [duplicate]

What prerequisites are needed to do strict Unicode programming?
Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t?
And what is the role played by multibyte character sequences in this scenario?
C99 or earlier
The C standard (C99) provides for wide characters and multi-byte characters, but since there is no guarantee about what those wide characters can hold, their value is somewhat limited. For a given implementation, they provide useful support, but if your code must be able to move between implementations, there is insufficient guarantee that they will be useful.
Consequently, the approach suggested by Hans van Eck (which is to write a wrapper around the ICU - International Components for Unicode - library) is sound, IMO.
The UTF-8 encoding has many merits, one of which is that if you do not mess with the data (by truncating it, for example), then it can be copied by functions that are not fully aware of the intricacies of UTF-8 encoding. This is categorically not the case with wchar_t.
Unicode in full is a 21-bit format. That is, Unicode reserves code points from U+0000 to U+10FFFF.
One of the useful things about the UTF-8, UTF-16 and UTF-32 formats (where UTF stands for Unicode Transformation Format - see Unicode) is that you can convert between the three representations without loss of information. Each can represent anything the others can represent. Both UTF-8 and UTF-16 are multi-byte formats.
UTF-8 is well known to be a multi-byte format, with a careful structure that makes it possible to find the start of characters in a string reliably, starting at any point in the string. Single-byte characters have the high-bit set to zero. Multi-byte characters have the first character starting with one of the bit patterns 110, 1110 or 11110 (for 2-byte, 3-byte or 4-byte characters), with subsequent bytes always starting 10. The continuation characters are always in the range 0x80 .. 0xBF. There are rules that UTF-8 characters must be represented in the minimum possible format. One consequence of these rules is that the bytes 0xC0 and 0xC1 (also 0xF5..0xFF) cannot appear in valid UTF-8 data.
U+0000 .. U+007F 1 byte 0xxx xxxx
U+0080 .. U+07FF 2 bytes 110x xxxx 10xx xxxx
U+0800 .. U+FFFF 3 bytes 1110 xxxx 10xx xxxx 10xx xxxx
U+10000 .. U+10FFFF 4 bytes 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Originally, it was hoped that Unicode would be a 16-bit code set and everything would fit into a 16-bit code space. Unfortunately, the real world is more complex, and it had to be expanded to the current 21-bit encoding.
UTF-16 thus is a single unit (16-bit word) code set for the 'Basic Multilingual Plane', meaning the characters with Unicode code points U+0000 .. U+FFFF, but uses two units (32-bits) for characters outside this range. Thus, code that works with the UTF-16 encoding must be able to handle variable width encodings, just like UTF-8 must. The codes for the double-unit characters are called surrogates.
Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from U+D800 to U+DBFF, and trailing, or low, surrogates are from U+DC00 to U+DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.
UTF-32, of course, can encode any Unicode code point in a single unit of storage. It is efficient for computation but not for storage.
You can find a lot more information at the ICU and Unicode web sites.
C11 and <uchar.h>
The C11 standard changed the rules, but not all implementations have caught up with the changes even now (mid-2017). The C11 standard summarizes the changes for Unicode support as:
Unicode characters and strings (<uchar.h>) (originally specified in
ISO/IEC TR 19769:2004)
What follows is a bare minimal outline of the functionality. The specification includes:
6.4.3 Universal character names
Syntax
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
hex-quad:
hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit
7.28 Unicode utilities <uchar.h>
The header <uchar.h> declares types and functions for manipulating Unicode characters.
The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2).
(Translating the cross-references: <stddef.h> defines size_t,
<wchar.h> defines mbstate_t,
and <stdint.h> defines uint_least16_t and uint_least32_t.)
The <uchar.h> header also defines a minimal set of (restartable) conversion functions:
mbrtoc16()
c16rtomb()
mbrtoc32()
c32rtomb()
There are rules about which Unicode characters can be used in identifiers using the \unnnn or \U00nnnnnn notations. You may have to actively activate the support for such characters in identifiers. For example, GCC requires -fextended-identifiers to allow these in identifiers.
Note that macOS Sierra (10.12.5), to name but one platform, does not support <uchar.h>.
Note that this is not about "strict unicode programming" per se, but some practical experience.
What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).
Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.
When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).
We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).
This FAQ is a wealth of info. Between that page and this article by Joel Spolsky, you'll have a good start.
One conclusion I came to along the way:
wchar_t is 16 bits on Windows, but not necessarily 16 bits on other platforms. I think it's a necessary evil on Windows, but probably can be avoided elsewhere. The reason it's important on Windows is that you need it to use files that have non-ASCII characters in the name (along with the W version of functions).
Note that Windows APIs that take wchar_t strings expect UTF-16 encoding. Note also that this is different than UCS-2. Take note of surrogate pairs. This test page has enlightening tests.
If you're programming on Windows, you can't use fopen(), fread(), fwrite(), etc. since they only take char * and don't understand UTF-8 encoding. Makes portability painful.
To do strict Unicode programming:
Only use string APIs that are Unicode aware (NOT strlen, strcpy, ... but their widestring counterparts wstrlen, wsstrcpy, ...)
When dealing with a block of text, use an encoding that allows storing Unicode chars (utf-7, utf-8, utf-16, ucs-2, ...) without loss.
Check that your OS default character set is Unicode compatible (ex: utf-8)
Use fonts that are Unicode compatible (e.g. arial_unicode)
Multi-byte character sequences is an encoding that pre-dates the UTF-16 encoding (the one used normally with wchar_t) and it seems to me it is rather Windows-only.
I've never heard of wint_t.
The most important thing is to always make a clear distinction between text and binary data. Try to follow the model of Python 3.x str vs. bytes or SQL TEXT vs. BLOB.
Unfortunately, C confuses the issue by using char for both "ASCII character" and int_least8_t. You'll want to do something like:
typedef char UTF8; // for code units of UTF-8 strings
typedef unsigned char BYTE; // for binary data
You might want typedefs for UTF-16 and UTF-32 code units too, but this is more complicated because the encoding of wchar_t is not defined. You'll need to just a preprocessor #ifs. Some useful macros in C and C++0x are:
__STDC_UTF_16__ — If defined, the type _Char16_t exists and is UTF-16.
__STDC_UTF_32__ — If defined, the type _Char32_t exists and is UTF-32.
__STDC_ISO_10646__ — If defined, then wchar_t is UTF-32.
_WIN32 — On Windows, wchar_t is UTF-16, even though this breaks the standard.
WCHAR_MAX — Can be used to determine the size of wchar_t, but not whether the OS uses it to represent Unicode.
Does this imply that my code should
not use char types anywhere and that
functions need to be used that can
deal with wint_t and wchar_t?
See also:
UTF-8 or UTF-16 or UTF-32 or UCS-2
Is wchar_t needed for Unicode support?
No. UTF-8 is a perfectly valid Unicode encoding that uses char* strings. It has the advantage that if your program is transparent to non-ASCII bytes (e.g., a line ending converter which acts on \r and \n but passes through other characters unchanged), you'll need to make no changes at all!
If you go with UTF-8, you'll need to change all the assumptions that char = character (e.g., don't call toupper in a loop) or char = screen column (e.g., for text wrapping).
If you go with UTF-32, you'll have the simplicity of fixed-width characters (but not fixed-width graphemes, but will need to change the type of all of your strings).
If you go with UTF-16, you'll have to discard both the assumption of fixed-width characters and the assumption of 8-bit code units, which makes this the most difficult upgrade path from single-byte encodings.
I would recommend actively avoiding wchar_t because it's not cross-platform: Sometimes it's UTF-32, sometimes it's UTF-16, and sometimes its a pre-Unicode East Asian encoding. I'd recommend using typedefs
Even more importantly, avoid TCHAR.
I wouldn't trust any standard library implementation. Just roll your own unicode types.
#include <windows.h>
typedef unsigned char utf8_t;
typedef unsigned short utf16_t;
typedef unsigned long utf32_t;
int main ( int argc, char *argv[] )
{
int msgBoxId;
utf16_t lpText[] = { 0x03B1, 0x0009, 0x03B2, 0x0009, 0x03B3, 0x0009, 0x03B4, 0x0000 };
utf16_t lpCaption[] = L"Greek Characters";
unsigned int uType = MB_OK;
msgBoxId = MessageBoxW( NULL, lpText, lpCaption, uType );
return 0;
}
From what I know, wchar_t is implementation dependent (as can be seen from this wiki article). And it's not unicode.
You basically want to deal with strings in memory as wchar_t arrays instead of char. When you do any kind of I/O (like reading/writing files) you can encode/decode using UTF-8 (this is probably the most common encoding) which is simple enough to implement. Just google the RFCs. So in-memory nothing should be multi-byte. One wchar_t represents one character. When you come to serializing however, that's when you need to encode to something like UTF-8 where some characters are represented by multiple bytes.
You'll also have to write new versions of strcmp etc. for the wide character strings, but this isn't a big issue. The biggest problem will be interop with libraries/existing code that only accept char arrays.
And when it comes to sizeof(wchar_t) (you will need 4 bytes if you want to do it right) you can always redefine it to a larger size with typedef/macro hacks if you need to.

using regular expression with unicode string in C

I'm currently using regular expression on unicode strings but I just need to match ASCII characters thus effectively ignore all unicode characters and until now functions in regex.h work fine (I'm on linux so the encoding is utf8). But can someone confirm if its really ok to do so? Or do I need a regex library on Unicode (like ICU?)
UTF-8 is a variable length encoding; some characters are 1 byte, some 2, others 3 or 4. You know now many bytes to read by the prefix of each character. 0 for 1 byte, 110 for 2 bytes, 1110 for 3 bytes, 11110 for 4 bytes.
If you try to read a UTF-8 string as ASCII, or any other fixed-width encoding, things will go very wrong... unless that UTF-8 string contains nothing but 1 byte characters in which case it matches ASCII.
However since no bytes in UTF-8 contain a null byte, and none of the extra bytes can be confused with ASCII, and if you really are only matching ASCII, you might be able to get away with it... but I wouldn't recommend it because there are such better regex options than POSIX, they're easy to use, and why leave a hidden encoding bomb in your code for some sucker to deal with later? (Note: that sucker may be you)
Instead, use a Unicode aware regex library like Perl Compatible Regular Expressions (PCRE). PCRE is Unicode aware by passing the PCRE2_UTF flag to pcre2_compile. PCRE regex syntax is more powerful and more widely understood than POSIX regexes, and PCRE has more features. And PCRE comes with Gnome Lib which itself provides a feast of very handy C functions.
You need to be careful about your patterns and about the text your going to match.
As an example, given the expression a.b:
"axb" matches
"aèb" does NOT match
The reason is that è is two bytes long when UTF-8 encoded but . would only match the first one.
So as long as you only match sequences of ASCII characters you're safe. If you mix ASCII and non ASCII characters, you're in trouble.
You can try to match a single UTF-8 encoded "character" with something like:
([\xC0-\xDF].|[\xE0-\xEF]..|\xF0...|.)
but this assumes that the text is encoded correctly (and, frankly, I never tried it).

Unicode Character 'SPEAKER WITH THREE SOUND WAVES' (U+1F50A) in c source code

I want to print Unicode Character 'SPEAKER WITH THREE SOUND WAVES' (U+1F50A) Encodings "\uD83D\uDD0A" in C source code but get this output:
error: \uDD0A is not a valid universal character
error: \uD83D is not a valid universal character
\u notation (with four hexadecimal digits) is referring to UCS-2 encoding, i.e. you can encode only characters from the BMP (Basic multilingual plane, basically U+00000 through U+0FFFF).
U+1F50A is beyond the BMP, and thus cannot be encoded in 16 bits. UTF-16 uses surrogate pairs for such characters beyond the BMP (values in the 0xD800 - 0xDFFF range, which are not used in UCS-2), but they are explicitly forbidden in \u notation.
You need \U notation (with eight hexadecimal digits) for that.
Also note that the conversion from either \u or \U notation to whatever actually ends up in the string is locale-dependent, so what might work on one platform might not work on another... if you want to be really portable and ensure e.g. UTF-8 or UTF-16 encoding in the string, you need to:
do the encoding manually via hexadecimal \x... or octal \...;
use third-party libraries with proper Unicode support (ICU).
While we're at it (and because many people are unaware of this), the above points straight at why Microsoft's 16bit version of wchar_t is broken when you want Unicode: It stems from a time when there was only the BMP, and 16bit UCS-2 was plenty enough. Since it is no longer sufficient to encode all defined Unicode characters, you can use it to hold UTF-16 code values, but wchar_t -- and by extension, std::wstring as well as L"" string literals -- isn't really wide as the name implies, but multibyte at best.
Good that C++ introduced explicit char16_t and char32_t, plus the locale-independent u"", U"" and u8"" string literals. Too bad MSVC doesn't yet support them AFAIK.

Detect UTF-16 file content

Is it possible to know if a file has Unicode (16-byte per char) or 8-bit ASCII content?
You may be able to read a byte-order-mark, if the file has this present.
UTF-16 characters are all at least 16-bits, with some being 32-bits with the right prefix (0xE000 to 0xFFFF). So simply scanning each char to see if less than 128 won't work. For example, the two bytes 0x20 0x20 would encode in ASCII and UTF-8 for two spaces, but encode in UTF-16 for a single character 0x2020 (dagger). If the text is known to be English with the occasional non-ASCII character, then most every other byte will be zero. But without some apriori knowledge about the text and/or it's encoding, there is no reliable way distinguish a general ASCII string from a general UTF-16 string.
Ditto to what Brian Agnew said about reading the byte order mark, a special two bytes that might appear at the beginning of the file.
You can also know if it is ASCII by scanning every byte in the file and seeing if they are all less than 128. If they are all less than 128, then it's just an ASCII file. If some of them are more than 128, there is some other encoding in there.
First off, ASCII is 7-bit, so if any byte has its high bit set you know the file isn't ASCII.
The various "common" character sets such as ISO-8859-x, Windows-1252, etc, are 8-bit, so if every other byte is 0, you know that you're dealing with Unicode that only uses the ISO-8859 characters.
You'll run into problems where you're trying to distinguish between Unicode and some encoding such as UTF-8. In this case, almost every byte will have a value, so you can't make an easy decision. You can, as Pascal says do some sort of statistical analysis of the content: Arabic and Ancient Greek probably won't be in the same file. However, this is probably more work than it's worth.
Edit in response to OP's comment:
I think that it will be sufficient to check for the presence of 0-value bytes (ASCII NUL) within your content, and make the choice based on that. The reason being that JavaScript keywords are ASCII, and ASCII is a subset of Unicode. Therefore any Unicode representation of those keywords will consist of one byte containing the ASCII character (low byte), and another containing 0 (the high byte).
My one caveat is that you carefully read the documentation to ensure that their use of the word "Unicode" is correct (I looked at this page to understand the function, did not look any further).
If the file for which you have to solve this problem is long enough each time, and you have some idea what it's supposed to be (say, English text in unicode or English text in ASCII), you can do a simple frequency analysis on the chars and see if the distribution looks like that of ASCII or of unicode.
Unicode is an alphabet, not a encoding. You probably meant UTF-16. There is lot of libraries around (python-chardet comes to mind instantly) to autodetect encoding of text, though they all use heuristics.
To programmatically discern the type of a file -- including, but not limited to the encoding -- the best bet is to use libmagic. BSD-licensed it is part of just about every Unix-system you are about to encounter, but for a lesser ones you can bundle it with your application.
Detecting the mime-type from C, for example, is as simple as:
Magic = magic_open(MAGIC_MIME|MAGIC_ERROR);
mimetype = magic_buffer(Magic, buf, bufsize);
Other languages have their own modules wrapping this library.
Back to your question, here is what I get from file(1) (the command-line interface to libmagic(3)):
% file /tmp/*rdp
/tmp/meow.rdp: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
For your specific use-case, it's very easy to tell. Just scan the file, if you find any NULL ("\0"), it must be UTF-16. JavaScript got to have ASCII chars and they are represented by a leading 0 in UTF-16.

Resources