How do I "decode" a UTF-8 character?

How do I "decode" a UTF-8 character? - c

Let's assume I want to write a function to compare two Unicode characters. How should I do that? I read some articles around (like this) but still didn't got that. Let's take € as input. It's in range 0x0800 and 0xFFFF so it will use 3 bytes to encode it. How do I decode it? bitwise operation to get 3 bytes from wchar_t and store into 3 chars? A code in example in C could be great.
Here's my C code to "decode" but obviously show wrong value to decode unicode...
#include <stdio.h>
#include <wchar.h>
void printbin(unsigned n);
int length(wchar_t c);
void print(struct Bytes *b);
// support for UTF8 which encodes up to 4 bytes only
struct Bytes
{
char v1;
char v2;
char v3;
char v4;
};
int main(void)
{
struct Bytes bytes = { 0 };
wchar_t c = '€';
int len = length(c);
//c = 11100010 10000010 10101100
bytes.v1 = (c >> 24) << 4; // get first byte and remove leading "1110"
bytes.v2 = (c >> 16) << 5; // skip over first byte and get 000010 from 10000010
bytes.v3 = (c >> 8) << 5; // skip over first two bytes and 10101100 from 10000010
print(&bytes);
return 0;
}
void print(struct Bytes *b)
{
int v1 = (int) (b->v1);
int v2 = (int)(b->v2);
int v3 = (int)(b->v3);
int v4 = (int)(b->v4);
printf("v1 = %d\n", v1);
printf("v2 = %d\n", v2);
printf("v3 = %d\n", v3);
printf("v4 = %d\n", v4);
}
int length(wchar_t c)
{
if (c >= 0 && c < 0x007F)
return 1;
if (c >= 0x0080 && c <= 0x07FF)
return 2;
if (c >= 0x0800 && c <= 0xFFFF)
return 3;
if (c >= 0x10000 && c <= 0x1FFFFF)
return 4;
if (c >= 0x200000 && c <= 0x3FFFFFF)
return 5;
if (c >= 0x4000000 && c <= 0x7FFFFFFF)
return 6;
return -1;
}
void printbin(unsigned n)
{
if (!n)
return;
printbin(n >> 1);
printf("%c", (n & 1) ? '1' : '0');
}

It's not at all easy to compare UTF-8 encoded characters. Best not to try. Either:
Convert them both to a wide format (32 bit integer) and compare this arithmetically. See wstring_convert or your favorite vendor-specific function; or
Convert them into 1 character strings and use a function that compares UTF-8 encoded strings. There is no standard way to do this in C++, but it is the preferred method in other languages such as Ruby, PHP, whatever.
Just to make it clear, the thing that is hard is to take raw bits/bytes/characters encoded as UTF_8 and compare them. This is because your comparison has to take account of the encoding to know whether to compare 8 bits, 16 bits or more. If you can somehow turn the raw data bits into a null-terminated string then the comparison is trivially easy using regular string functions. This string may be more than one byte/octet in length, but it will represent a single character/code point.
Windows is a bit of a special case. Wide characters are short int (16-bit). Historically this meant UCS-2 but it has been redefined as UTF-16. This means that all valid characters in the Basic Multilingual Plane (BMP) can be compared directly, since they will occupy a single short int, but others cannot. I am not aware of any simple way to deal with 32-bit wide characters (represented as a simple int) outside the BMP on Windows.

Related

Calculating checksum (16 bit) in c

I am being asked to do a checksum on this text with the bit size being 16:
"AAAAAAAAAA\nX"
At first the description seemed like it wanted the Fletcher-16 checksum. But the output of Fletcher's checksum performed on the above text yielded 8aee in hex. The example file says that the modular sum algorithm (minus the two's complement) should output 509d in hex.
The only other info is the standard "every two characters should be added to the checksum."
Besides using the generic Fletcher-16 checksum provided on the corresponding Wikipedia page, I have tried using this solution found here: calculating-a-16-bit-checksum to no avail. This code produced the hex value of 4f27.

Simply adding the data seeing it as an array of big-endian 16-bit integers produced the result 509d.
#include <stdio.h>
int main(void) {
char data[] = "AAAAAAAAAA\nX";
int sum = 0;
int i;
for(i = 0; data[i] != '\0' && data[i + 1] != '\0'; i += 2) {
int value = ((unsigned char)data[i] << 8) | (unsigned char)data[i + 1];
sum = (sum + value) & 0xffff;
}
printf("%04x\n", sum);
return 0;
}

Obtaining a sequence of bits from a C-style string in C?

I need to obtain a sequence of bits from a char* fixed length C-style string in C, how can I do it? I need a sequence of bits representing the string, not a particular one. I need to do it strictly in C not in C++.

You can use a simple bitmask of only one 1 and scan through the string one byte at a time, starting with mask = 0x80 (binary 10000000) and going down to 1 (binary 00000001).
#include <stdio.h>
#define N 5
int main(void) {
char mystring[N] = "abcd";
unsigned i;
unsigned char mask;
for (i = 0; i < N; i++) {
unsigned char c = mystring[i];
unsigned char mask = 0x80;
do {
putchar(c & mask ? '1' : '0');
mask >>= 1;
} while (mask > 0);
putchar(' ');
}
putchar('\n');
return 0;
}
Result:
01100001 01100010 01100011 01100100 00000000

Decimal to BCD to ASCII

Perhaps this task is a bit more complicated than what I've written below, but the code that follows is my take on decimal to BCD. The task is to take in a decimal number, convert it to BCD and then to ASCII so that it can be displayed on a microcontroller. As far as I'm aware the code works sufficiently for the basic operation of converting to BCD however I'm stuck when it comes to converting this into ASCII. The overall output is ASCII so that an incremented value can be displayed on an LCD.
My code so far:
int dec2bin(int a){ //Decimal to binary function
int bin;
int i =1;
while (a!=0){
bin+=(a%2)*i;
i*=10;
a/=2;
}
return bin;
}
unsigned int ConverttoBCD(int val){
unsigned int unit = 0;
unsigned int ten = 0;
unsigned int hundred = 0;
hundred = (val/100);
ten = ((val-hundred*100)/10);
unit = (val-(hundred*100+ten*10));
uint8_t ret1 = dec2bin(unit);
uint8_t ret2 = dec2bin((ten)<<4);
uint8_t ret3 = dec2bin((hundred)<<8);
return(ret3+ret2+ret1);
}

The idea to convert to BCD for an ASCII representation of a number is actually the "correct one". Given BCD, you only need to add '0' to each digit for getting the corresponding ASCII value.
But your code has several problems. The most important one is that you try to stuff a value shifted left by 8 bits in an 8bit type. This can never work, those 8 bits will be zero, think about it! Then I absolutely do not understand what your dec2bin() function is supposed to do.
So I'll present you one possible correct solution to your problem. The key idea is to use a char for each individual BCD digit. Of course, a BCD digit only needs 4 bits and a char has at least 8 of them -- but you need char anyways for your ASCII representation and when your BCD digits are already in individual chars, all you have to do is indeed add '0' to each.
While at it: Converting to BCD by dividing and multiplying is a waste of resources. There's a nice algorithm called Double dabble for converting to BCD only using bit shifting and additions. I'm using it in the following example code:
#include <stdio.h>
#include <string.h>
// for determining the number of value bits in an integer type,
// see https://stackoverflow.com/a/4589384/2371524 for this nice trick:
#define IMAX_BITS(m) ((m) /((m)%0x3fffffffL+1) /0x3fffffffL %0x3fffffffL *30 \
+ (m)%0x3fffffffL /((m)%31+1)/31%31*5 + 4-12/((m)%31+3))
// number of bits in unsigned int:
#define UNSIGNEDINT_BITS IMAX_BITS((unsigned)-1)
// convert to ASCII using BCD, return the number of digits:
int toAscii(char *buf, int bufsize, unsigned val)
{
// sanity check, a buffer smaller than one digit is pointless
if (bufsize < 1) return -1;
// initialize output buffer to zero
// if you don't have memset, use a loop here
memset(buf, 0, bufsize);
int scanstart = bufsize - 1;
int i;
// mask for single bits in value, start at most significant bit
unsigned mask = 1U << (UNSIGNEDINT_BITS - 1);
while (mask)
{
// extract single bit
int bit = !!(val & mask);
for (i = scanstart; i < bufsize; ++i)
{
// this is the "double dabble" trick -- in each iteration,
// add 3 to each element that is greater than 4. This will
// generate the correct overflowing bits while shifting for
// BCD
if (buf[i] > 4) buf[i] += 3;
}
// if we have filled the output buffer from the right far enough,
// we have to scan one position earlier in the next iteration
if (buf[scanstart] > 7) --scanstart;
// check for overflow of our buffer:
if (scanstart < 0) return -1;
// now just shift the bits in the BCD digits:
for (i = scanstart; i < bufsize - 1; ++i)
{
buf[i] <<= 1;
buf[i] &= 0xf;
buf[i] |= (buf[i+1] > 7);
}
// shift in the new bit from our value:
buf[bufsize-1] <<= 1;
buf[bufsize-1] &= 0xf;
buf[bufsize-1] |= bit;
// next bit:
mask >>= 1;
}
// find first non-zero digit:
for (i = 0; i < bufsize - 1; ++i) if (buf[i]) break;
int digits = bufsize - i;
// eliminate leading zero digits
// (again, use a loop if you don't have memmove)
// (or, if you're converting to a fixed number of digits and *want*
// the leading zeros, just skip this step entirely, including the
// loop above)
memmove(buf, buf + i, digits);
// convert to ascii:
for (i = 0; i < digits; ++i) buf[i] += '0';
return digits;
}
int main(void)
{
// some simple test code:
char buf[10];
int digits = toAscii(buf, 10, 471142);
for (int i = 0; i < digits; ++i)
{
putchar(buf[i]);
}
puts("");
}
You won't need this IMAX_BITS() "magic macro" if you actually know your target platform and how many bits there are in the integer type you want to convert.

Convert Hexadecimal to Decimal in AVR Studio?

How can I convert a hexadecimal (unsigned char type) to decimal (int type) in AVR Studio?
Are there any built-in functions available for these?

On AVRs, I had problems using the traditional hex 2 int approach:
char *z="82000001";
uint32_t x=0;
sscanf(z, "%8X", &x);
or
x = strtol(z, 0, 16);
They simply provided the wrong output, and didn't have time to investigate why.
So, for AVR Microcontrollers I wrote the following function, including relevant comments to make it easy to understand:
/**
* hex2int
* take a hex string and convert it to a 32bit number (max 8 hex digits)
*/
uint32_t hex2int(char *hex) {
uint32_t val = 0;
while (*hex) {
// get current character then increment
char byte = *hex++;
// transform hex character to the 4bit equivalent number, using the ascii table indexes
if (byte >= '0' && byte <= '9') byte = byte - '0';
else if (byte >= 'a' && byte <='f') byte = byte - 'a' + 10;
else if (byte >= 'A' && byte <='F') byte = byte - 'A' + 10;
// shift 4 to make space for new digit, and add the 4 bits of the new digit
val = (val << 4) | (byte & 0xF);
}
return val;
}
Example:
char *z ="82ABC1EF";
uint32_t x = hex2int(z);
printf("Number is [%X]\n", x);
Will output:
EDIT: sscanf will also work on AVRs, but for big hex numbers you'll need to use "%lX", like this:
char *z="82000001";
uint32_t x=0;
sscanf(z, "%lX", &x);

decimal to hexadecimal unit convertion

I have the next code that converts an decimal unit to an hexadecimal.If i introduce the number 4095 for example, it returns the FFF hexadecimal,but the problem is that i want the number to be printed in a 2 byte format like this(with zeros on the left):
4095 -> 0FFF,
33 -> 0021
i know that there are simpler ways to do this like:
int number = 4095
printf("%04x",number);
but i want to do the conversions and the 2 byte format by myself,but i dont know how to do the zero's on the left procedure.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char Hexadecimal(int rest);
int main()
{
char num_hex,*x,c[2],;
int number = 4095,d,rest;
x = calloc(12,sizeof(char));
for(d = number;d > 0;d/=16)
{
rest = d % 16;
num_hex = Hexadecimal(rest);
sprintf(c,"%c",num_hex);
strcat(x,c);
}
strrev(x);
printf("[%s]\n",x);
return 0;
}
char Hexadecimal(int rest)
{
char letter;
switch(rest)
{
case 10:
letter = 'A';
break;
case 11:
letter = 'B';
break;
case 12:
letter = 'C';
break;
case 13:
letter = 'D';
break;
case 14:
letter = 'E';
break;
case 15:
letter = 'F';
break;
default:
letter = '0' + rest;
}
return letter;
}

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char const Hexadecimal[] = "0123456789ABCDEF";
int main()
{
char num_hex,*x,c[2],;
int number = 4095,d,rest;
x = calloc(12,sizeof(char));
for(d = number;d > 0;d/=16)
{
rest = d % 16;
num_hex = Hexadecimal[rest];
sprintf(c,"%c",num_hex);
strcat(x,c);
}
strrev(x);
printf("[%s]\n",x);
return 0;
}

I think it's easier to think of these problems in terms of binary arithmetic.
Here's an implementation that uses binary math and no standard library functions. It's written in a general way, so you could replace short with int or long and get a similarly correct result (as long as you replace the short type everywhere in the function with your desired type).
char* shortToHex(short value)
{
static char tmp[(sizeof(short)*2)+3] = "0x";
const char* hex = "0123456789ABCDEF";
int i = sizeof(short);
while(i > 0)
{
i--;
char most_significant_nibble = (value >> (i * 8 + 4)) & 0xF;
char least_significant_nibble = (value >> (i * 8)) & 0xF;
tmp[2 + (sizeof(short)-i-1) * 2] = hex[most_significant_nibble];
tmp[2 + (sizeof(short)-i-1) * 2 + 1] = hex[least_significant_nibble];
}
tmp[(sizeof(short)*2)+2] = 0;
return tmp;
}
Let me explain line-by-line so you understand.
static char tmp[(sizeof(short)*2)+3] = "0x";
We need a buffer large enough to hold the amount of nibbles in a short (16 bits == 2 bytes == 4 characters, so two per byte in general) plus a '\0' terminator, plus the "0x" string.
const char* hex = "0123456789ABCDEF";
We need to be able to grab a hexadecimal character based on an array index.
while(i > 0)
{
i--;
...
}
We need a loop counter that counts down from the number of bytes in a short. We're going to loop starting at the most significant byte, to build up the string left-to-right.
char most_significant_nibble = (value >> (i * 8 + 4)) & 0xF;
char least_significant_nibble = (value >> (i * 8)) & 0xF;
tmp[2 + (sizeof(short)-i-1) * 2] = hex[most_significant_nibble];
tmp[2 + (sizeof(short)-i-1) * 2 + 1] = hex[least_significant_nibble];
We're going to shift the bits in the input value to the right, so that the desired bits are in the least-significant nibble of a temporary character. We determine how many bits to shift by multiplying the index (in the case of a 16-bit short, i would start out being 1 since we decremented it, so we'll have 1 * 8 + 4 the first time through the loop, and shift 12 bits to the right, leaving us with the most significant nibble in the short. The second line would shift 8 bits to the right and grab the second nibble. (and so on) The & 0xF portion ensures that whatever 4-bit value we have shifted furthest to the right is the only value in the temporary char.
Then we take those temporary values and set them into the tmp array at the appropriate place. (the math is a little tricky since our index as at the most significant - highest - byte of the value, but we want to place the character at the lowest point in the tmp value.)
tmp[(sizeof(short)*2)+2] = 0;
return tmp;
Lastly, we terminate the tmp string with a '\0' and return it.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How do I "decode" a UTF-8 character? - c

Related

Calculating checksum (16 bit) in c

Obtaining a sequence of bits from a C-style string in C?

Decimal to BCD to ASCII

Convert Hexadecimal to Decimal in AVR Studio?

decimal to hexadecimal unit convertion

Categories

Resources