GTK int to unicode char conversion for display in GTK label - c

I am receiving hex data from a serial port.
I have converted the hex data to corresponding int value.
I want to display the equivalent character over GTK label.
But if we see character map there are control characters from 0x00 to 0x20.
So i was thinking of adding 256 to the converted int value and show the corresponding Unicode character to label.
But i am not able to convert int to Unicode. say if i have an array of ints 266,267,289...
how should i convert it to Unichar and display over GTK label.
I know it may seems very basic problem to you all but i have struggled a lot and didn't find any answer. Please help,

The GTK functions that set text on UI elements all assume UTF-8 strings. A single unsigned byte representing a Unicode code point with value > 127 will not form a valid UTF-8 string if written out as an unsigned byte. I can think of a couple of ways around this.
Store the code point as a 32-bit integer (which is essentially UTF-32) and use the functions in the iconv library, or something similar, to do the conversion from UTF-32 to UTF-8. There are other conversion implementations in C widely available. Converting your unsigned byte to UTF-32 really amounts to padding it with three leading zero bytes -- which is easy to code.
Generate a UTF-8 string yourself, based on the 8-bit code point value. Since you have a limited range of values, this is easy-ish. If you look at the way that UTF-8 is written out, e.g., here:
https://en.wikipedia.org/wiki/UTF-8
you'll see that the values you need to represent are written as two unsigned bytes, the first beginning with binary 110B, and the second with 10B. The bits of the code point value are split up and distributed between these two bytes. Doing this conversion will need a little masking and bit-shifting, but it's not hugely difficult.
Having said all that, I have to wonder why you'd want to assign a character to a label that users will likely not understand? Why not just write the hex number on the label, if it is not a displayable character?

Related

How does C language transform char literal to number and vice versa

I've been diving into C/low-level programming/system design recently. As a seasoned Java developer I still remember my attemtps to pass SUN Java Certification and questions if char type in Java can be cast to Integer and how can that be done. That is what I know and remember - numbers up to 255 can be treated both like numbers or characters depending on casting.
Getting to know C I want to know more but I find it hard to find proper answer (tried googling but I usually get gazilion results how just to convert char to int in the code) how does EXACTLY it work, that C compiler/system calls transform number to character and vice versa.
AFAIK in the memory numbers are being stored. So let's assume in the memory cell we store value 65 (which is letter 'A'). So there is a value stored and suddenly C code wants to get it and store into char variable. So far so good. And then we issue printf procedure with %c formatting for given char parameter.
And here is where the magic happens - HOW EXACTLY printf knows that character with value 65 is letter 'A' (and should display it as a letter). It is a base sign from raw ASCII range (not some funny emoji-style UTF sign). Does it call external STD/libraries/system calls to consult encoding system? I would love some nitty-gritty, low-level explanation or at least link to trusted source.
The C language is largely agnostic about the actual encoding of characters. It has a source character set which defines how the compiler treats characters in the source code. So, for instance on an old IBM system the source character set might be EBCDIC where 65 does not represent 'A'.
C also has an execution character set which defines the meaning of characters in the running program. This is the one that seems more pertinent to your question. But it doesn't really affect the behavior of I/O functions like printf. Instead it affects the results of ctype.h functions like isalpha and toupper. printf just treats it as a char sized value which it receives as an int due to variadic functions using default argument promotions (any type smaller than int is promoted to int, and float is promoted to double). printf then shuffles off the same value to the stdout file and then it's somebody else's problem.
If the source character set and execution character set are different, then the compiler will perform the appropriate conversion so the source token 'A' will be manipulated in the running program as the corresponding A from the execution character set. The choice of actual encoding for the two character sets, ie. whether it's ASCII or EBCDIC or something else is implementation defined.
With a console application it is the console or terminal which receives the character value that has to look it up in a font's glyph table to display the correct image of the character.
Character constants are of type int. Except for the fact that it is implementation defined whether char is signed or unsigned, a char can mostly be treated as a narrow integer. The only conversion needed between the two is narrowing or widening (and possibly sign extension).
"HOW EXACTLY printf knows that character with value 65 is letter 'A' (and should display it as a letter)."
It usually doesn't, and it does not even need to. Even the compiler does not see characters ', A and ' in the C language fragment
char a = 'A';
printf("%c", c);
If the source and execution character sets are both ASCII or ASCII-compatible, as is usually the case nowadays, the compiler will have among the stream of bytes the triplet 39, 65, 39 - or rather 00100111 01000001 00100111. And its parser has been programmed with a rule that something between two 00100111s is a character literal, and since 01000001 is not a magic value it is translated as is to the final program.
The C program, at runtime, then handles 01000001 all the time (though from time to time it might be 01000001 zero-extended to an int, e.g. 00000000 00000000 00000000 01000001 on 32-bit systems; adding leading zeroes does not change its numerical value). On some systems, printf - or rather the underlying internal file routines - might translate the character value 01000001 to something else. But on most systems, 01000001 will be passed to the operating system as is. Then on the operating system - or possibly in a GUI program receiving the output from the operating system - will want to display that character, and then the display font is consulted for the glyph that corresponds to 01000001, and usually the glyph for letter 01000001 looks something like
A
And that will be displayed to the user.
At no point does the system really operate with glyphs or characters but just binary numbers. The system in itself is a Chinese room.
The real magic of printf is not how it handles characters, but how it handles numbers, as these are converted to more characters. While %c passes values as-is, %d will convert such a simple integer value as 0b101111000110000101001110 to stream of bytes 0b00110001 0b00110010 0b00110011 0b00110100 0b00110101 0b00110110 0b00110111 0b00111000 so that the display routine will correctly display it as
12345678
char in C is just an integer CHAR_BIT bits long. Usually it is 8 bits long.
HOW EXACTLY printf knows that character with value 65 is letter 'A'
The implementation knows what characters encoding it uses and pritnf function code takes the appropriate action do output the letter 'A'

Convert a `char *` to UTF-8 in C, or when using xmlwriter?

I'm using libxml/xmlwriter to generate an XML file within a program.
const char *s = someCharactersFromSomewhere();
xmlTextWriterWriteAttribute (writer, _xml ("value"), _xml (s));
In general I don't have much control over the contents of s, so I can't guarantee that it will be well-formatted in UTF-8. Mostly it is, but if not, the XML which is generated will be malformed.
What I'd like to find is a way to convert s to valid UTF-8, with any invalid character sequences in s replaced with escapes or removed.
Alternatively, if there is an alternative to xmlTextWriterWriteAttribute, or some option I can pass in when initializing the XML writer, such that it guarantees that it will always write valid UTF-8, that would be even better.
One more thing to mention is that the solution must work with both Linux and OSX. Ideally writing as little of my own code as possible! :P
In case the string is encoded in ASCII, then it will always be valid UTF-8 string.
This is because UTF-8 is backwards compatible with ASCII encoding.
See the second paragraph on Wikipedia here.
Windows primarily works with UTF-16, this means you will have to convert from UTF-16 to UTF-8 before you pass the string to the XML library.
If you have 8-bit ascii input then you can simply junk any character code > 127.
If you have some dodgy UTF-8 it is quite easy to parse, but the widechar symbol number that you generate might be out of the unicode range. You can use mbrlen() to individually validate each character.
I am describing this using unsigned chars. If you must use signed chars, then >128 means <0.
At its simplest:
Until the null byte
1 If the next byte is 0, then end the loop
2 If the next byte is < 128 then it is ascii, so keep it
3 If the next byte is >=128 < 128+64 it is invalid - discard it
4 If the next byte is >= 128+64 then it is probably a proper UTF-8 lead byte
call size_t mbrlen(const char *s, size_t n, mbstate_t *ps);
to see how many bytes to keep
if mbrlen says the code is bad (either the lead byte or the trail bytes),
skip 1 byte. Rule 3 will skip the rest.
Even simpler logic just calls mbrlen repeatedly, as it can accept the low ascii range.
You can assume that all the "furniture" of the file (eg xml <>/ symbols, spaces, quotes and newlines) won't be altered by this edit, as they are all valid 7-bit ascii codes.
char is a single byte character, while UTF codepoints range from 0 to 0x10FFFFF, so how do you represent a UTF character in only one byte?
First of all you need a wchar_t character. Those are used with wprintf(3) versions of the normal printf(3) routines. If you dig a little on this, you'll see that mapping your UTF codepoints into valid UTF-8 encoding is straigtforward, based on your setlocale(3) settings. Look at those manual pages referenced, and you'll get an idea of the task you are facing.
There's full support for wide character sets in the C standard... but you have to use it through the internationalization libraries and locales availables.

Get the Unicode value of a TCHAR

I need to get the Unicode value of a TCHAR.
e.g. if the TCHAR = "A", I want to find the Unicode value of 0x41.
Is it safe to cast to an int, or is there an API function I should be using?
Your question is a little mal-formed. A TCHAR can be either an 8 bit or a 16 bit character. In and of itself, it's not enough to know how wide the character is. You also need to know how it is encoded. For example:
If you have an 8 bit ASCII encoded character, then its numeric value is the Unicode code point.
If you have an 8 bit Windows ANSI encoded character from a single byte character set you convert to UTF-16 with MultiByteToWideChar. The numeric value of the UTF-16 element is the Unicode code point.
If you have an 8 bit Windows ANSI encoded character element from a double byte or multi byte character set, that 8 bit char does not, in general, define a character. In general, you need multiple char elements.
Likewise for a 16 bit UTF-16 encoded character element. Again UTF-16 is a variable width encoding and a single character element does not in general define a Unicode code point.
So, in order to proceed you must become clear as to how your character is encoded.
Before even doing that you need to know how wide it is. TCHAR can be 8 bit or 16 bit, depending on how you compile. That flexibility was how we handled single source development for Win 9x and Win NT. The former had no Unicode support. Nowadays Win 9x is, thankfully long forgotten and so too should TCHAR. Sadly it lives on in countless MSDN examples but you should ignore that. On Windows the native character element is wchar_t.
Well, i just guess you want UTF-32 numbers.
Arx said it already, TCHAR can be char or wchar_t.
If you have a char-string, it probably will contain data
with the default single-byte charset of your system (UTF-8 is possible too)
As dealing with many different charsets is difficult
and Windows has builtin conversion stuff:
Use MultiByteToWideChar to get a wchar_t-array of your char-array.
If you have a wchar_t-array, it´s most likely UTF-16 (LE, without BOM...) on Windows.
I don´t know any built-in function to get UTF-32 of it,
but writing an own conversion is not that hard (else, use some lib)
http://en.wikipedia.org/wiki/UTF-16
Some bit-fiddling, but nothing more.
(What TCHAR is is a preprocessor-thing,
so you could implement different behaviour
based on the #define´s too. Or sizeof or...)

LZW Compression with Entire unicode library

I am trying to do this problem:
Assume we have an initial alphabet of the entire Unicode character set,
instead of just all the possible byte values. Recall that unicode
characters are unsigned 2-byte values, so this means that each
2 bytes of uncompressed data will be treated as one symbol, and
we'll have an alphabet with over 60,000 symbols. (Treating symbols as
2-byte Unicodes, rather than a byte at a time, makes for better
compression in the case of internationalized text.) And, note, there's
nothing that limits the number of bits per code to at most 16. As you
generalize the LZW algorithm for this very large alphabet, don't worry
if you have some pretty long codes.
With this, give the compressed version of this four-symbol sequence,
using our project assumptions, including an EOD code, and grouping
into 4-byte ints. (These three symbols are Unicode values,
represented numerically.) Write your answer as 3 8-digit hex values,
space separated, using capital hex digits, not lowercase.
32767 32768 32767 32768
The problem I am having is that I don't know the entire range of the alphabet, so when doing LZW compression I don't know what byte value the new codes will have. Stemming from that problem I also don't know the the EOD code will be.
Also, it seems to me that it will only take two integers the compressed data.
The problem statement is ill-formed.
In Unicode, as we know it today, code points (those numbers that represent characters, composable parts of characters and other useful but more sneaky things) cannot be all numbered from 0 to 65535 to fit into 16 bits. There are more than 100 thousand of Chinese, Japanese and Korean characters in Unicode. Clearly, you'd need 17+ bits just for those. So, Unicode clearly cannot be the correct option here.
OTOH, there exist a sort of "abridged" version of Unicode, Universal Character Set, whose UCS-2 encoding uses 16-bit code points and can technically be used for at most 65536 characters and the like. Those characters with codes greater than 65535 are, well, unlucky, you can't have them with UCS-2.
So, if it's really UCS-2, you can download its specification (ISO/IEC 10646, I believe) and figure out exactly which codes out of those 64K are used and thus should form your initial LZW alphabet.

How to convert Unicode escaped characters to utf8?

I saw the other questions about the subject but all of them were missing important details:
I want to convert \u00252F\u00252F\u05de\u05e8\u05db\u05d6 to utf8. I understand that you look through the stream for \u followed by four hex which you convert to bytes. The problems are as follows:
I heard that sometimes you look for 4 bytes after and sometimes 6 bytes after, is this correct? If so, then how do you determine which it is? E.g. is \u00252F 4 or 6 bytes?
In the case of \u0025 this maps to one byte instead of two (0x25), why? Is the four hex supposed to represent utf16 which i am supposed to convert to utf8?
How do I know whether the text is supposed to be the literal characters \u0025 or the unicode sequence? Does that mean that all backslashes must be escaped in the stream?
Lastly, am I being stupid in doing this by hand when I can use iconv to do this for me?
If you have the iconv interfaces at your disposal, you can simply convert the \u0123\uABCD etc. sequences to an array of bytes 01 23 AB CD ..., replacing any unescaped ASCII characters with a 00 byte followed by the ASCII byte, then run the array through iconv with a conversion descriptor obtained by iconv_open("UTF-8", "UTF-16-BE").
Of course you can also do it much more efficiently working directly with the input yourself, but that requires reading and understanding the Unicode specification of UTF-16 and UTF-8.
In some conventions (like C++11 string literals), you parse a specific number of hex digits, like four after \u and eight after \U. That may or may not be the convention with the input you provided, but it seems a reasonable guess. Other styles, like C++'s \x you parse as many hex digits as you can find after the \x, which means that you have to jump through some hoops if you do want to put a literal hex digit immediately after one of these escaped characters.
Once you have all the values, you need to know what encoding they're in (e.g., UTF-16 or UTF-32) and what encoding you want (e.g., UTF-8). You then use a function to create a new string in the new encoding. You can write such a function (if you know enough about both encoding formats), or you can use a library. Some operating systems may provide such a function, but you might want to use a third-party library for portability.

Resources