I need to get the Unicode value of a TCHAR.
e.g. if the TCHAR = "A", I want to find the Unicode value of 0x41.
Is it safe to cast to an int, or is there an API function I should be using?
Your question is a little mal-formed. A TCHAR can be either an 8 bit or a 16 bit character. In and of itself, it's not enough to know how wide the character is. You also need to know how it is encoded. For example:
If you have an 8 bit ASCII encoded character, then its numeric value is the Unicode code point.
If you have an 8 bit Windows ANSI encoded character from a single byte character set you convert to UTF-16 with MultiByteToWideChar. The numeric value of the UTF-16 element is the Unicode code point.
If you have an 8 bit Windows ANSI encoded character element from a double byte or multi byte character set, that 8 bit char does not, in general, define a character. In general, you need multiple char elements.
Likewise for a 16 bit UTF-16 encoded character element. Again UTF-16 is a variable width encoding and a single character element does not in general define a Unicode code point.
So, in order to proceed you must become clear as to how your character is encoded.
Before even doing that you need to know how wide it is. TCHAR can be 8 bit or 16 bit, depending on how you compile. That flexibility was how we handled single source development for Win 9x and Win NT. The former had no Unicode support. Nowadays Win 9x is, thankfully long forgotten and so too should TCHAR. Sadly it lives on in countless MSDN examples but you should ignore that. On Windows the native character element is wchar_t.
Well, i just guess you want UTF-32 numbers.
Arx said it already, TCHAR can be char or wchar_t.
If you have a char-string, it probably will contain data
with the default single-byte charset of your system (UTF-8 is possible too)
As dealing with many different charsets is difficult
and Windows has builtin conversion stuff:
Use MultiByteToWideChar to get a wchar_t-array of your char-array.
If you have a wchar_t-array, it´s most likely UTF-16 (LE, without BOM...) on Windows.
I don´t know any built-in function to get UTF-32 of it,
but writing an own conversion is not that hard (else, use some lib)
http://en.wikipedia.org/wiki/UTF-16
Some bit-fiddling, but nothing more.
(What TCHAR is is a preprocessor-thing,
so you could implement different behaviour
based on the #define´s too. Or sizeof or...)
Related
I am receiving hex data from a serial port.
I have converted the hex data to corresponding int value.
I want to display the equivalent character over GTK label.
But if we see character map there are control characters from 0x00 to 0x20.
So i was thinking of adding 256 to the converted int value and show the corresponding Unicode character to label.
But i am not able to convert int to Unicode. say if i have an array of ints 266,267,289...
how should i convert it to Unichar and display over GTK label.
I know it may seems very basic problem to you all but i have struggled a lot and didn't find any answer. Please help,
The GTK functions that set text on UI elements all assume UTF-8 strings. A single unsigned byte representing a Unicode code point with value > 127 will not form a valid UTF-8 string if written out as an unsigned byte. I can think of a couple of ways around this.
Store the code point as a 32-bit integer (which is essentially UTF-32) and use the functions in the iconv library, or something similar, to do the conversion from UTF-32 to UTF-8. There are other conversion implementations in C widely available. Converting your unsigned byte to UTF-32 really amounts to padding it with three leading zero bytes -- which is easy to code.
Generate a UTF-8 string yourself, based on the 8-bit code point value. Since you have a limited range of values, this is easy-ish. If you look at the way that UTF-8 is written out, e.g., here:
https://en.wikipedia.org/wiki/UTF-8
you'll see that the values you need to represent are written as two unsigned bytes, the first beginning with binary 110B, and the second with 10B. The bits of the code point value are split up and distributed between these two bytes. Doing this conversion will need a little masking and bit-shifting, but it's not hugely difficult.
Having said all that, I have to wonder why you'd want to assign a character to a label that users will likely not understand? Why not just write the hex number on the label, if it is not a displayable character?
So I want to parse IDv3.4 file. There are 4 types of text encoding in format specification: ISO-8859-1, UTF-16 with BOM, UTF-16BE and UTF-8. I already written code that can obtains bytes of strings.
And my question is how to print UTF-16 with BOM and UTF-16BE bytes to console.
And also one important condition: I can use only C libraries. I can't use C++ libraries. I even can't use third-party C libraries.
In general (NOT specifically for parsing IDv3.4 files alone) you will want to choose a common character encoding that your code will use internally; then convert from any other character encoding into your chosen character encoding (for input data - e.g. from user or files or network) and convert back again (for output, to user or files or network).
For choosing a common character encoding:
you want something that minimizes "nonconvertible cases" - e.g. you wouldn't want to choose ASCII because there's far too much in far too many other character encodings that can't be converted to ASCII. This mostly means that you'll want a Unicode encoding.
you want something that is convenient. For Unicode encoding, this only really gives you 2 choices - UTF-8 (because you don't have to care about endian issues, and it's relatively efficient for space/memory consumption, and C functions like strlen() can still work) and versions of UTF-32 (because each codepoint takes up a fixed amount of space and it makes conversion a little simpler). Of these, the benefits of UTF-32 are mostly unimportant (unless you're doing a font rendering engine).
the "whatever random who-knows-what" character encoding that the C compiler uses is irrelevant (for both char and w_char), because it's implementation specific and not portable.
the "whatever random who-knows-what" character encoding that the terminal uses is irrelevant (the terminal should be considered "just another flavor of input/output, where conversion is involved").
Assuming you choose UTF-8:
You might be able to force the compiler to treat string literals as UTF-8 for you (e.g. like u8"hello" in C++, except I can't seem to find any sane standard for C). Otherwise you'll need to do it yourself where necessary.
I'd recommend using the uint8_t type for storing strings; partly because char is "signed or unsigned, depending on which way the wind is blowing" (which makes conversions to/from other character encodings painful due to "shifting a signed/negative number right" problems), and partly because it help to find "accidentally used something that isn't UTF-8" bugs (e.g. warnings from compiler about "conversion from signed to unsigned").
Conversion between UTF-8 and UTF-32LE, UTF_32BE, UTF-16LE, UTF_16BE is fairly trivial (the relevant wikipedia articles are enough to describe how it works).
"UTF-16 with BOM" means that the first 2 bytes will tell you if it's UTF-16LE or UTF-16BE, so (after you add support for UTF-16LE and UTF-16BE) it's trivial. "UTF-32 with BOM" is similar (first 4 bytes tell you if it's UTF32-BE or UTF32-BE).
Conversion to/from ISO-8859-1 to UTF-8 is fairly trivial, because the characters match Unicode codepoints with the same value. However, often people get it wrong (e.g. say it's ISO-8859-1 when the data is actually encoded as Windows-1252 instead); and for the conversion from UTF-8 to ISO-8859-1 you will need to deal with "nonconvertible" codepoints.
I'm using libxml/xmlwriter to generate an XML file within a program.
const char *s = someCharactersFromSomewhere();
xmlTextWriterWriteAttribute (writer, _xml ("value"), _xml (s));
In general I don't have much control over the contents of s, so I can't guarantee that it will be well-formatted in UTF-8. Mostly it is, but if not, the XML which is generated will be malformed.
What I'd like to find is a way to convert s to valid UTF-8, with any invalid character sequences in s replaced with escapes or removed.
Alternatively, if there is an alternative to xmlTextWriterWriteAttribute, or some option I can pass in when initializing the XML writer, such that it guarantees that it will always write valid UTF-8, that would be even better.
One more thing to mention is that the solution must work with both Linux and OSX. Ideally writing as little of my own code as possible! :P
In case the string is encoded in ASCII, then it will always be valid UTF-8 string.
This is because UTF-8 is backwards compatible with ASCII encoding.
See the second paragraph on Wikipedia here.
Windows primarily works with UTF-16, this means you will have to convert from UTF-16 to UTF-8 before you pass the string to the XML library.
If you have 8-bit ascii input then you can simply junk any character code > 127.
If you have some dodgy UTF-8 it is quite easy to parse, but the widechar symbol number that you generate might be out of the unicode range. You can use mbrlen() to individually validate each character.
I am describing this using unsigned chars. If you must use signed chars, then >128 means <0.
At its simplest:
Until the null byte
1 If the next byte is 0, then end the loop
2 If the next byte is < 128 then it is ascii, so keep it
3 If the next byte is >=128 < 128+64 it is invalid - discard it
4 If the next byte is >= 128+64 then it is probably a proper UTF-8 lead byte
call size_t mbrlen(const char *s, size_t n, mbstate_t *ps);
to see how many bytes to keep
if mbrlen says the code is bad (either the lead byte or the trail bytes),
skip 1 byte. Rule 3 will skip the rest.
Even simpler logic just calls mbrlen repeatedly, as it can accept the low ascii range.
You can assume that all the "furniture" of the file (eg xml <>/ symbols, spaces, quotes and newlines) won't be altered by this edit, as they are all valid 7-bit ascii codes.
char is a single byte character, while UTF codepoints range from 0 to 0x10FFFFF, so how do you represent a UTF character in only one byte?
First of all you need a wchar_t character. Those are used with wprintf(3) versions of the normal printf(3) routines. If you dig a little on this, you'll see that mapping your UTF codepoints into valid UTF-8 encoding is straigtforward, based on your setlocale(3) settings. Look at those manual pages referenced, and you'll get an idea of the task you are facing.
There's full support for wide character sets in the C standard... but you have to use it through the internationalization libraries and locales availables.
Every time i do something similar to the condition below i get a Multicharacter warning.
char str[] = "León";
if(str[2] == 'ó') printf(true);
How can i solve this?
Unless the encoding on your platform is such that 'ó' can fit into a char, 'ó' is a multi-character constant. It seems to be the latter on your platform, judging by the message you get. The values of multi-character constants are implementation defined. In other words, the choice of numeric value is up to the implementation, with some constraints (e.g. it must be outside the char range on your platform).
Sadly in your case when you write char str[] = "León";, the third element will be converted to a char, using a narrowing conversion, or decomposed into more than one char and concatenated to the char[] array. So attempts to compare it to 'ó' will be futile.
If you want to use the extended ASCII characters, use their octal value.
I am using the table http://www.asciitable.com/ and I guess the value you require is 162 (decimal) = 242. So use str[] = "Le\242n";
And use the same in the comparison.
You'll need to use the wchar_t type, or a unicode library. wchar_t is infamous for having many gotchas and easy bugs to hit, but it is the best primitive type available to C++ compilers.
You need to use variants of everything that support wchar_t, such as std::wcout or wprintf.
EDIT: wchar_t has been replaced by char16_t and char32_t. The Unicode Standard 4.0 suggests their use whenever code must be portable between platforms, because wchar_t varies in size depending on platform (like int does).
I recommend finding a good unicode library to handle comparison between the many characters that are made of multiple codepoints!
The other option is to stick entirely to the native char type which is generally interpreted as some locale-specific ASCII.
The ASCII is a 7-bit character coding that numbers characters 0 ... 127. An ASCII-compatible encoding preserves the meanings of these bytes. Any character encoded as c < 0 or c > 127 cannot be an ASCII character. These sometimes can be called by various confusing names such as "Extended ASCII" or alike.
In Unicode, the ASCII characters are still the characters 0 ... 127 of the Unicode codepoint range.
The problem is not as much that ó is an extended character, it is that your source file is actually in UTF-8, and therefore ó is encoded as 2 bytes. char in C stands for the thing generally called as byte elsewhere.
C also supports wide-character strings, where each character is a UTF-16, UCS-2, UTF-32, or some other code point. There your ó would (most probably) be a single wchar_t.
Unfortunately you've opening a can of worms here, because the symbol ó can also be written in Unicode in 2 separate ways: It can be written as one code point ó or the letter o followed by the combining acute accent: ́; both have the same semantic information, but they would consist of different bytes. And even if converted to wchar_t strings, these would still have different sequences. The C standard library doesn't handle the Unicode at all, except in C11, where there is some support for character literals explicitly in UTF-8. The C standard still doesn't present a portable way for converting UTF-8 encoded textual data to wchar_t; neither can it do normalizations such as ó to o ́ or vice versa.
You could do something like
if (sizeof("ó") > 2) ...
If this is just one char the length of your string is 2, one for the character and one for the terminating 0. Otherwise if it doesn't fit the compiler will allocate a longer sequence.
When you give your source file to the compiler you have to tell which character encoding you used with your source editor (source charset). My guess that it is UTF-8, which encodes ó as 0xC3 0xB3. This seems to be going right.
But 'ó' then becomes an integer with a value outside your char range (see your <limits.h>). Therefore the warning on the == between them.
BTW—there is some meaning in "Extended ASCII" but not much. An "Extended ASCII" character set must encode each of its codepoints in one byte. So, UTF-8 is not an encoding for one of the many "Extended ASCII" character sets.
I'm currently using regular expression on unicode strings but I just need to match ASCII characters thus effectively ignore all unicode characters and until now functions in regex.h work fine (I'm on linux so the encoding is utf8). But can someone confirm if its really ok to do so? Or do I need a regex library on Unicode (like ICU?)
UTF-8 is a variable length encoding; some characters are 1 byte, some 2, others 3 or 4. You know now many bytes to read by the prefix of each character. 0 for 1 byte, 110 for 2 bytes, 1110 for 3 bytes, 11110 for 4 bytes.
If you try to read a UTF-8 string as ASCII, or any other fixed-width encoding, things will go very wrong... unless that UTF-8 string contains nothing but 1 byte characters in which case it matches ASCII.
However since no bytes in UTF-8 contain a null byte, and none of the extra bytes can be confused with ASCII, and if you really are only matching ASCII, you might be able to get away with it... but I wouldn't recommend it because there are such better regex options than POSIX, they're easy to use, and why leave a hidden encoding bomb in your code for some sucker to deal with later? (Note: that sucker may be you)
Instead, use a Unicode aware regex library like Perl Compatible Regular Expressions (PCRE). PCRE is Unicode aware by passing the PCRE2_UTF flag to pcre2_compile. PCRE regex syntax is more powerful and more widely understood than POSIX regexes, and PCRE has more features. And PCRE comes with Gnome Lib which itself provides a feast of very handy C functions.
You need to be careful about your patterns and about the text your going to match.
As an example, given the expression a.b:
"axb" matches
"aèb" does NOT match
The reason is that è is two bytes long when UTF-8 encoded but . would only match the first one.
So as long as you only match sequences of ASCII characters you're safe. If you mix ASCII and non ASCII characters, you're in trouble.
You can try to match a single UTF-8 encoded "character" with something like:
([\xC0-\xDF].|[\xE0-\xEF]..|\xF0...|.)
but this assumes that the text is encoded correctly (and, frankly, I never tried it).