UTF-8 decoder fails on non-ASCII characters - c

Note: if you've followed my recent questions, you'll see that they're all about my Unicode library exercise in C -- as one of my first few serious projects in C, I'm having many problems, so I'm sorry if I'm asking too many questions about one thing.
Part of my library decodes UTF-8 encoded char pointers into raw unsigned code points. However, certain planes don't decode correctly. Let's take a look at the (relevant) code:
typedef struct string {
unsigned long length;
unsigned *data;
} string;
// really simple stuff
string *upush(string *s, unsigned c) {
if (!s->length) s->data = (unsigned *) malloc((s->length = 1) * sizeof(unsigned));
else s->data = (unsigned *) realloc(s->data, ++s->length * sizeof(unsigned));
s->data[s->length - 1] = c;
return s;
}
// UTF-8 conversions
string ctou(char *old) {
unsigned long i, byte = 0, cur = 0;
string new;
new.length = 0;
for (i = 0; old[i]; i++)
if (old[i] < 0x80) upush(&new, old[i]);
else if (old[i] < 0xc0)
if (!byte) {
byte = cur = 0;
continue;
} else {
cur |= (unsigned)(old[i] & 0x3f) << (6 * (--byte));
if (!byte) upush(&new, cur), cur = 0;
}
else if (old[i] < 0xc2) continue;
else if (old[i] < 0xe0) {
cur = (unsigned)(old[i] & 0x1f) << 6;
byte = 1;
}
else if (old[i] < 0xf0) {
cur = (unsigned)(old[i] & 0xf) << 12;
byte = 2;
}
else if (old[i] < 0xf5) {
cur = (unsigned)(old[i] & 0x7) << 18;
byte = 3;
}
else continue;
return new;
}
All upush does, by the way, is pushes a code point onto the end of a string, reallocating memory as needed. ctou does the decoding work, and stores the number of bytes still needed in a sequence in byte, as well as the in-progress code point in cur.
The code all seems correct to me. Let's try decoding U+10ffff, which is f4 8f bf bd in UTF-8. Doing this:
long i;
string b = ctou("\xf4\x8f\xbf\xbd");
for (i = 0; i < b.length; i++)
printf("%z ", b.data[i]);
should print out:
10ffff
but instead it prints out:
fffffff4 ffffff8f ffffffbf ffffffbd
which is basically the four bytes of UTF-8, with ffffff tacked on before it.
Any guidance as to what is wrong in my code?

The char type is allowed to be signed, and conversion to int and then unsigned (which is what happens implicitly when you convert directly to unsigned) shows the error:
#include <stdio.h>
int main() {
char c = '\xF4';
int i = c;
unsigned n = i;
printf("%X\n", n);
n = c;
printf("%X\n", n);
return 0;
}
Prints:
FFFFFFF4
FFFFFFF4
Use unsigned char instead.

You've probably ignored the fact that char is a signed type on your platform. Always use:
unsigned char if you will be reading the actual values of bytes
signed char if you're using bytes as small signed integers
char for abstract strings where you don't care about the values except perhaps for 0.
By the way, your code is extremely inefficient. Instead of calling realloc over and over per-character, why not allocate sizeof(unsigned)*(strlen(old)+1) to begin with, then reduce the size at the end if it's too big? Of course this is only one of the many inefficiencies.

Related

Converting a string of chars into its decimal value then back to its character valus

unsigned long long int power(int base, unsigned int exponent)
{
if (exponent == 0)
return 1;
else
return base * power(base, exponent - 1);
}
I am working on a program where I need to take in a string of 8 characters (e.g. "I want t") then convert this into a long long int in the pack function. I have the pack function working fine.
unsigned long long int pack(char unpack[])
{
/*converting string to long long int here
didn't post code because its large*/
}
After I enter "I want t" I get "Value in Decimal = 5269342824372117620" and then I send the decimal to the unpack function. So I need to convert 5269342824372117620 back into "I want t". I tried bit manipulation which was unsuccessful any help would be greatly appreciated.
void unpack(long long int pack)
{
long long int bin;
char convert[100];
for(int i = 63, j = 0, k = 0; i >= 0; i--,j++)
{
if((pack & (1 << i)) != 0)
bin += power(2,j);
if(j % 8 == 0)
{
convert[k] = (char)bin;
bin = 0;
k++;
j = -1;
}
}
printf("String: %s\n", convert);
}
A simple solution for your problem is to consider the characters in the string to be digits in a large base that encompasses all possible values. For example base64 encoding can convert strings of 8 characters to 48-bit numbers, but you can only use a subset of at most 64 different characters in the source string.
To convert any 8 byte string into a number, you must use a base of at least 256.
Given your extra input, After I enter "I want t" I get "Value in Decimal = 5269342824372117620", and since 5269342824372117620 == 0x492077616e742074, you do indeed use base 256, big-endian order and ASCII encoding for the characters.
Here is a simple portable pack function for this method:
unsigned long long pack(const char *s) {
unsigned long long x = 0;
int i;
for (i = 0; i < 8; i++) {
x = x * 256 + (unsigned char)s[i];
}
return x;
}
The unpack function is easy to derive: compute the remainders of divisions in the reverse order:
char *unpack(char *dest, unsigned long long x) {
/* dest is assumed to have a length of at least 9 */
int i;
for (i = 8; i-- > 0; ) {
s[i] = x % 256;
x = x / 256;
}
s[8] = '\0'; /* set the null terminator */
return s;
}
For a potentially faster but less portable solution, you could use this, but you would get a different conversion on little-endian systems such as current Macs and PCs:
#include <string.h>
unsigned long long pack(const char *s) {
unsigned long long x;
memcpy(&x, s, 8);
return x;
}
char *unpack(char *s, unsigned long long x) {
memcpy(s, &x, 8);
s[8] = '\0';
return s;
}

C - Pointer points to random values

I'm pretty confused by pointers in general and I have no idea why this is happening. Normally a pointer to an array would just print out the values of an array but I have no idea why this is happening. Could someone explain why or suggest what is happening?
char *showBits(int dec, char *buf) {
char array[33];
buf=array;
unsigned int mask=1u<<31;
int count=0;
while (mask>0) {
if ((dec & mask) == 0) {
array[count]='0';
}
else {
array[count]='1';
}
count++;
mask=mask>>1;
}
return buf;
}
Expecting it to return a binary representation of dec, but printing it produces random garbage.
The problem is that you're returning a reference to local array. Instead, let the caller allocate the buffer. I've also fixed some other problems in the code:
#define MAX_BUFFER_LENGTH (sizeof(unsigned int) * CHAR_BIT + 1)
char *to_bit_string(unsigned int n, char *buf) {
unsigned int mask = UINT_MAX - (UINT_MAX >> 1);
char *tmp;
for (tmp = buf; mask; mask >>= 1, tmp++) {
*tmp = n & mask ? '1': '0';
}
*tmp = 0;
return buf;
}
First of all, we use unsigned int instead of signed int here, because signed ints would be converted to unsigned ints when used in conjunction with unsigned int. Second, unsigned ints can have varying number of bits; so we use sizeof(unsigned int) * CHAR_BIT + 1 to get the absolute maximum of the number of bits. Third, we use UINT_MAX - (UINT_MAX >> 1) as a handy way to get a value that has only the most-significant bit set, no matter how many value bits the number has. Fourth: instead of indices, we use a moving pointer. Fifth - we remember to null-terminate the string.
Usage:
char the_bits[MAX_BUFFER_LENGTH];
puts(to_bit_string(0xDEADBEEF, the_bits));
Output
11011110101011011011111011101111
You have
char *showBits(int dec, char *buf);
and the function is expected "to return a binary representation of dec".
Assuming int is 32 bits, do
#define INT_BITS (32) // to avoid all those magic numbers: 32, 32-1, 32+1
Assuming further that the function is called like this:
int main(void)
{
int i = 42;
char buf[INT_BITS + 1]; // + 1 to be able to store the C-string's '0'-terminator.
printf("%d = 0b%s\n", i, showBits(i, buf));
}
You could change your code as follows:
char *showBits(int dec, char *buf) {
// char array[INT_BITS + 1]; // drop this, not needed as buf provides all we need
// buf=array; // drop this; see above
unsigned int mask = (1u << (INT_BITS - 1));
size_t count = 0; // size_t is typically used to type indexes
while (mask > 0) {
if ((dec & mask) == 0) {
buf[count] = '0'; // operate on the buffer provided by the caller.
} else {
buf[count] = '1'; // operate on the buffer provided by the caller.
}
count++;
mask >>= 1; // same as: mask = mask >> 1;
}
buf[INT_BITS] = '\0'; // '0'-terminate the char-array to make it a C-string.
return buf;
}
Alternatively the function can be used like this:
int main(void)
{
...
showBits(i, buf);
printf("%d = 0b%s\n", i, buf);
}
The result printed should look like this in both cases:
42 = 0b00000000000000000000000000101010
A bit modified code - the caller should provide buff to accommodate the string
char *showBits(unsigned int dec, char *buf) {
unsigned int mask = 1u << 31;
int count = 0;
while (mask>0) {
if ((dec & mask) == 0) {
buf[count] = '0';
}
else {
buf[count] = '1';
}
count++;
mask = mask >> 1;
}
buf[count] = '\0';
return buf;
}

Need to convert int to string using C

Hi I am pretty new to coding and I really need help.
Basically I have a decimal value and I converted it to a binary value.
Using this method
long decimalToBinary(long n)
{
int remainder;
long binary = 0, i = 1;
while(n != 0)
{
remainder = n%2;
n = n/2;
binary= binary + (remainder*i);
i = i*10;
}
return binary;
}
And I want to give each character of the binary into it's own space inside an array. However, I can't seem to save digits from the return values in my string array. I think it has something to do with converting the long to string but I could be wrong! Here is what I have so far.
I do not want to use sprintf(); I do not wish to print the value I just want the value stored inside it so that the if conditions can read it. Any help would be appreciated!
int decimalG = 24;
long binaryG = decimalToBinary(decimalG);
char myStringG[8] = {binaryG};
for( int i = 0; i<8; i++)
{
if (myStringG[i] == '1' )
{
T1();
}
else
{
T0();
}
}
In this case since the decimal is 24, the binary would be 11000 therefore it should execute the the function T1(); 2 times and T0() 6 times. But it doesn't do that and I can't seem to find the answer to store the saved values in the array.
*Ps the Itoa(); function is also not an option. Thanks in Advance! :)
As the post is tagged arm using malloc() might not be the best approach, although the simplest. If you insist on using arrays:
#include <stdio.h>
#include <stdlib.h>
int decimalToBinary(long n, char out[], int len)
{
long remainder;
// C arrays are zero based
len--;
// TODO: check if the input is reasonable
while (n != 0) {
// pick a bit
remainder = n % 2;
// shift n one bit to the right
// It is the same as n = n/2 but
// is more telling of what you are doing:
// shifting the whole thing to the right
// and drop the least significant bit
n >>= 1;
// Check boundaries! Always!
if (len < 0) {
// return zero for "Fail"
return 0;
}
// doing the following four things at once:
// cast remainder to char
// add the numerical value of the digit "0"
// put it into the array at place len
// decrement len
out[len--] = (char) remainder + '0';
}
// return non-zero value for "All OK"
return 1;
}
// I don't know what you do here, but it
// doesn't matter at all for this example
void T0()
{
fputc('0', stdout);
}
void T1()
{
fputc('1', stdout);
}
int main()
{
// your input
int decimalG = 24;
// an array able to hold 8 (eight) elements of type char
char myStringG[8];
// call decimalToBinary with the number, the array and
// the length of that array
if (!decimalToBinary(decimalG, myStringG, 8)) {
fprintf(stderr, "decimalToBinary failed\n");
exit(EXIT_FAILURE);
}
// Print the whole array
// How to get rid of the leading zeros is left to you
for (int i = 0; i < 8; i++) {
if (myStringG[i] == '1') {
T1();
} else {
T0();
}
}
// just for the optics
fputc('\n', stdout);
exit(EXIT_SUCCESS);
}
Computing the length needed is tricky, but if you know the size of long your Micro uses (8, 16, 32, or even 64 bit these days) you can take that as the maximum size for the array. Leaves the leading zeros but that should not be a problem, or is it?
To achieve your goal, you don't have to convert a decimal value to binary:
unsigned decimalG = 24; // Assumed positive, for negative values
// have implementation-defined representation
for (; decimalG; decimalG >>= 1) {
if(decimalG & 1) {
// Do something
} else {
// Do something else
}
}
Or you can use a union, but I'm not sure whether this approach is well defined by the standard.
If you stick to writing decimalToBinary, note that you'll have to use an array:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
char *decimalToBinary(unsigned n);
int
main(void) {
int decimalG = 15;
char *binary = decimalToBinary(decimalG);
puts(binary);
free(binary);
}
char *
decimalToBinary(unsigned n) {
// Don't forget to free() it after use!!!
char *binary = malloc(sizeof(unsigned) * CHAR_BIT + 1);
if(!binary) return 0;
size_t i;
for (i = 0; i < sizeof(unsigned) * CHAR_BIT; i++) {
binary[i] = '0' + ((n >> i) & 1); // in reverse order
}
binary[i] = 0;
return binary;
}
Use the itoa (integer-to-ascii) function.
http://www.cplusplus.com/reference/cstdlib/itoa/
EDIT: Correction:
Don't be an idiot, use the itoa (integer-to-ascii) function.
http://www.cplusplus.com/reference/cstdlib/itoa/
EDIT:
Maybe I wasn't clear enough. I saw the line that said:
*Ps the Itoa(); function is also not an option.
This is completely unreasonable. You want to reinvent the wheel, but you want someone else to do it? What do you possibly have against itoa? It's part of the standard. It will always exist, no matter what platform you're targeting or version of C that you're using.
I want to give each character of the binary into it's own
space inside an array. However, I can't seem to save digits
from the return values in my string array.
There are a number of ways to approach this, if I understand what you are asking. First, there is no need to actually store the results of the binary representation of your number in an array to call T1() or T0() based on the bit value of any given bit that makes up the number.
Take your example 24 (binary 11000). If I read your post correctly you state:
In this case since the decimal is 24, the binary
would be 11000 therefore it should execute the the
function T1() 2 times and T0() 6 times.
(I'm not sure where you get 6 times, it looks like you intended that T0() would be called 3 times)
If you have T0 and T1 defined, for example, to simply let you know when they are called, e.g.:
void T1 (void) { puts ("T1 called"); }
void T0 (void) { puts ("T0 called"); }
You can write a function (say named callt) to call T1 for each 1-bit and T0 for each 0-bit in a number simply as follows:
void callt (const unsigned long v)
{
if (!v) { putchar ('0'); return; };
size_t sz = sizeof v * CHAR_BIT;
unsigned long rem = 0;
while (sz--)
if ((rem = v >> sz)) {
if (rem & 1)
T1();
else
T0();
}
}
So far example if you passed 24 to the function callt (24), the output would be:
$ ./bin/dec2bincallt
T1 called
T1 called
T0 called
T0 called
T0 called
(full example provided at the end of answer)
On the other hand, if you really do want to give each character of the binary into it's own space inside an array, then you would simply need to pass an array to capture the bit values (either the ASCII character representations for '0' and '1', or just 0 and 1) instead of calling T0 and T1 (you would also add a few lines to handle v=0 and also the nul-terminating character if you will use the array as a string) For example:
/** copy 'sz' bits of the binary representation of 'v' to 's'.
* returns pointer to 's', on success, empty string otherwise.
* 's' must be adequately sized to hold 'sz + 1' bytes.
*/
char *bincpy (char *s, unsigned long v, unsigned sz)
{
if (!s || !sz) {
*s = 0;
return s;
}
if (!v) {
*s = '0';
*(s + 1) = 0;
return s;
}
unsigned i;
for (i = 0; i < sz; i++)
s[i] = (v >> (sz - 1 - i)) & 1 ? '1' : '0';
s[sz] = 0;
return s;
}
Let me know if you have any additional questions. Below are two example programs. Both take as their first argument the number to convert (or to process) as binary (default: 24 if no argument is given). The first simply calls T1 for each 1-bit and T0 for each 0-bit:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h> /* for CHAR_BIT */
void callt (const unsigned long v);
void T1 (void) { puts ("T1 called"); }
void T0 (void) { puts ("T0 called"); }
int main (int argc, char **argv) {
unsigned long v = argc > 1 ? strtoul (argv[1], NULL, 10) : 24;
callt (v);
return 0;
}
void callt (const unsigned long v)
{
if (!v) { putchar ('0'); return; };
size_t sz = sizeof v * CHAR_BIT;
unsigned long rem = 0;
while (sz--)
if ((rem = v >> sz)) {
if (rem & 1) T1(); else T0();
}
}
Example Use/Output
$ ./bin/dec2bincallt
T1 called
T1 called
T0 called
T0 called
T0 called
$ ./bin/dec2bincallt 11
T1 called
T0 called
T1 called
T1 called
The second stores each bit of the binary representation for the value as a nul-terminated string and prints the result:
#include <stdio.h>
#include <stdlib.h>
#define BITS_PER_LONG 64 /* define as needed */
char *bincpy (char *s, unsigned long v, unsigned sz);
int main (int argc, char **argv) {
unsigned long v = argc > 1 ? strtoul (argv[1], NULL, 10) : 24;
char array[BITS_PER_LONG + 1] = "";
printf (" values in array: %s\n", bincpy (array, v, 16));
return 0;
}
/** copy 'sz' bits of the binary representation of 'v' to 's'.
* returns pointer to 's', on success, empty string otherwise.
* 's' must be adequately sized to hold 'sz + 1' bytes.
*/
char *bincpy (char *s, unsigned long v, unsigned sz)
{
if (!s || !sz) {
*s = 0;
return s;
}
if (!v) {
*s = '0';
*(s + 1) = 0;
return s;
}
unsigned i;
for (i = 0; i < sz; i++)
s[i] = (v >> (sz - 1 - i)) & 1 ? '1' : '0';
s[sz] = 0;
return s;
}
Example Use/Output
(padding to 16 bits)
$ ./bin/dec2binarray
values in array: 0000000000011000
$ ./bin/dec2binarray 11
values in array: 0000000000001011

Bitwise operation on char array

So I have a binary representation of a number as a character array. What I need to do is shift this representation to the right by 11 bits.
For example,
I have a char array which is currently storing this string: 11000000111001
After performing a bitwise shift, I will get 110 with some zeros before it.
I tried using this function but it gave me strange output:
char *shift_right(unsigned char *ar, int size, int shift)
{
int carry = 0; // Clear the initial carry bit.
while (shift--) { // For each bit to shift ...
for (int i = size - 1; i >= 0; --i) { // For each element of the array from high to low ...
int next = (ar[i] & 1) ? 0x80 : 0; // ... if the low bit is set, set the carry bit.
ar[i] = carry | (ar[i] >> 1); // Shift the element one bit left and addthe old carry.
carry = next; // Remember the old carry for next time.
}
}
return ar;
}
Any help on this would be very much appreciated; let me know if I'm not being clear.
They are just characters...
char *shift_right(unsigned char *ar, int size, int shift)
{
memmove(&ar[shift], ar, size-shift);
memset(ar, '0', shift);
return(ar);
};
Or, convert the string to a long-long, shift it, then back to a string:
char *shift_right(char *ar, int size, int shift)
{
unsigned long long x;
char *cp;
x=strtoull(ar, &cp, 2); // As suggested by 'Don't You Worry Child'
x = x >> shift;
while(cp > ar)
{
--cp;
*cp = (1 & x) ? '1' : '0';
x = x >> 1;
}
return(ar);
};
If you really want to use bitwise shifting, then you can't do it on a string. Simply not Possible!!
You have to convert it to integer (use strtol for that) then do bitwise shifting. After that, convert it back to string (no standard library function for that, use for loop).
I would advise to keep the code simple and readable.
#include <stdio.h>
#include <stdlib.h>
void shift_right (char* dest, const char* source, int shift_n)
{
uint16_t val = strtoul(source, NULL, 2);
val >>= shift_n;
for(uint8_t i=0; i<16; i++)
{
if(val & 0x8000) // first item of the string is the MSB
{
dest[i] = '1';
}
else
{
dest[i] = '0';
}
val <<= 1; // keep evaluating the number from MSB and down
}
dest[16] = '\0';
}
int main()
{
const char str [16+1] = "0011000000111001";
char str_shifted [16+1];
puts(str);
shift_right(str_shifted, str, 11);
puts(str_shifted);
return 0;
}

IEEE 754 arithmitic on 4 bytes(32 bits)

I wrote this code to do the IEEE 754 floating point arithmetic on a 4byte string.
It takes in the bytes, converts them to binary and with the binary I get the sign, exponent, and mantissa and then do the calculation.
It all works just about perfectl, 0xDEADBEEF gives me 6259853398707798016 and the true answer is 6.259853398707798016E18, now these are same values and I wont have anything this large in the project I'm working with, all other smaller values put the decimal in the correct place.
Here is my code:
float calcByteValue(uint8_t data[]) {
int i;
int j = 0;
int index;
int sign, exp;
float mant;
char bits[8] = {0};
int *binary = malloc(32*sizeof *binary);
for (index = 0;index < 4;index++) {
for (i = 0;i < 8;i++,j++) {
bits[i] = (data[index] >> 7-i) & 0x01;
if (bits[i] == 1) {
binary[j] = 1;
} else {
binary[j] = 0;
}
}
printf("\nindex(%d)\n", index);
}
sign = getSign(&(binary[0]));
mant = getMant(&(binary[0]));
exp = getExp(&(binary[0]));
printf("\nBinary: ");
for (i = 0;i < 32;i++)
printf("%d", binary[i]);
printf("\nsign:%d, exp:%d, mant:%f\n",sign, exp, mant);
float f = pow(-1.0, sign) * mant * pow(2,exp);
printf("\n%f\n", f);
return f;
}
//-------------------------------------------------------------------
int getSign(int *bin) {
return bin[0];
}
int getExp (int *bin) {
int expInt, i, b, sum;
int exp = 0;
for (i = 0;i < 8;i++) {
b = 1;
b = b<<(7-i);
if (bin[i+1] == 1)
exp += bin[i+1] * b;
}
return exp-127;
}
float getMant(int *bin) {
int i,j;
float b;
float m;
int manBin[24] = {0};
manBin[0] = 1;
for (i = 1,j=9;j < 32;i++,j++) {
manBin[i] = bin[j];
printf("%d",manBin[i]);
}
for (i = 0;i < 24;i++) {
m += manBin[i] * pow(2,-i);;
}
return m;
}
Now, my teacher told me that there is a much easier way where I can just take in the stream of bytes, and turn it into a float and it should work. I tried doing it that way but could not figure it out if my life depended on it.
I'm not asking you to do my homework for me, I have it done and working, but I just need to know if I could of done it differently/easier/more efficiently.
EDIT: there are a couple special cases I need to handle, but it's just things like if the exponent is all zeros blah blah blah. Easy to implement.
The teacher probably had this in mind:
char * str; // your deadbeef
float x;
memcpy(&x, str, sizeof(float));
I would advise against it, for the issues with endianness. But if your teacher wants it, he shall have it.
I think you want a union - just create a union where one member is a 4 character array, and the other a float. Write the first, then read the second.
Looking at what your code does then the "4 byte string" looks like it already contains the binary representation of a 32 bit float, so it already exists in memory at the address specified by data in big endian byte order.
You could probably cast the array data to a float pointer and dereference that (if you can assume the system you are running on is big endian and that data will be correctly aligned for the float type on your platform).
Alternatively if you need more control (for example to change the byte order or ensure alignment) you could look into type punning using a union of a uint8_t array and a float. Copy the bytes into your union's uint8_t array and then read the float member.
Here is my working code:
unsigned char val[4] = {0, 0, 0xc8, 0x41};
cout << val << endl;
cout << "--------------------------------------------" << endl;
float f = *(float*)&val;
cout << f << endl;
return 0;

Resources