I try to understand when casting causes data losing and how it works.
so for the following examples i try to understand if there is data loss and if yes why:
(i - int(4),f - float(4),d-double(8))
i == (int)(float) i; // sizeof(int)==sizeof(float) <- no loss
i == (int)(double) i; // sizeof(int)!=sizeof(double) <- possible loss
f == (float)(double) f;// sizeof(double)!=sizeof(float) <- possible loss
d == (float) d;// sizeof(double)!=sizeof(float) <- possible loss
Is it sufficient to base the answer only on type sizes?(+ round )
Assuming 32 bit ints and normal 4 and 8 byte IEEE-754 floats/doubles it would be:
i == (int)(float) i; // possible loss (32 -> 23 -> 32 bits)
i == (int)(double) i; // no loss (32 -> 52 -> 32 bits)
f == (float)(double) f; // no loss (23 -> 52 -> 23 bits)
d == (float) d; // possible loss (52 -> 23 -> 52 bits)
Note that int has 32 bits of precision, float has 23 bits, double has 52.
The memory allocated to store variable of a type is not the only fact for you to consider loss of data. Generally, way to roundoff and how CPU processes numeric data in case of overflow would be other aspects you might want to look into.
Because the sizeof reports the same size in memory does not mean there is data loss.
Consider 0.5.
Can store that in a float but cannot store it in an integer.
Therefore data loss.
I.e. I want 0.5 of that cake. Cannot represent that as an integer. Either get nothing or lots of cakes. Yum
Why integer? because you may need only integer numbers for example an ID_num
Why float? because you may need to work on real numbers example % calculations
Why double? when you have real numbers that cannot fit into float size
Related
i'm trying to find out what this program prints exactly.
#include <stdio.h>
int main() {
float bf = -62.140625;
int bi = *(int *)&bf;
int ci = bi+(1<<23);
float cf = *(float *)&ci;
printf("%X\n",bi);
printf("%f\n",cf);
}
This prints out:
C2789000
-124.281250
But what happens line by line ? I do not understand .
Thanks in advance.
It is a convoluted way of doubling an 32bit floating point number by adding one to its exponent. Moreover it is incorrect due to violation of strict aliasing rule by accesing object if type float via type int.
Exponent is located at bits number 23 to 30. Adding 1<<23 increment the exponent by one what works like multiplication of the original number by 2.
If we rewrite this program to remove pointer punning
int main() {
float bf = -62.140625;
memcpy(&bi, &bf, sizeof(bi));
for(int i = 0; i < 32; i += 8)
printf("%02x ", ((unsigned)bi & (0xff << i)) >> i);
bi += (1<<23);
memcpy(&bf, &bi, sizeof(bi));;
printf("%f\n",bf);
}
Float numbers have the format:
-62.140625 has exponent == 0.
bi += (1<<23);
sets the exponent to 1 so the resulting float number will be -62.140625 * 2^1 and it is equal to -124.281250. If you change that line to
bi += (1<<24);
it will set the exponent to 4 so the resulting float number will be -62.140625 * 2^2 and it is equal to -248.562500.
float bf = -62.140625;
This creates a float object named bf and initializes it to −62.140625.
int bi = *(int *)&bf;
&bf takes the address of bf, which produces a pointer to a float. (int *) says to convert this to a pointer to an int. Then * says to access the pointed-to memory, as if it were an int.
The C standard does not define the behavior of this access, but many C implementations support it, sometimes requiring a command-line switch to enable support for it.
A float value is normally encoded in some way. −62.140625 is not an integer, so it cannot be stored as a binary numeral that represents an integer. It is encoded. Reinterpreting the bytes memory as an int using * (int *) &bf is an attempt to get the bits into an int so they can be manipulated directly, instead of through floating-point operations.
int ci = bi+(1<<23);
The format most commonly used for the float type is IEEE-754 binary32, also called “single precision.” In this format, bit 31 is a sign bit, bits 30-23 encode an exponent and/or some other information, and bits 22-0 encode most of a significand (or, in the case of a NaN, other information). (The significand is the fraction part of a floating-point representation. A floating-point format represents a number as ±F•be, where b is a fixed base, F is a number with a fixed precision in a certain range, and e is an exponent in a certain range. F is the significand.)
1<<23 is 1 shifted 23 bits, so it is 1 in the exponent field, bits 30-23.
If the exponent field contains 1 to 1021, then adding 1 to it increases the encoded exponent by 1. (The codes 0 and 1023 have special meaning in the exponent field. 1022 is a normal value, but adding 1 to it overflows the exponent in the special code 1023, so it will not increase the exponent in a normal way.)
Since the base b of a binary floating-point format is 2, increasing the exponent by 1 multiplies the number represented by 2. ±F•be becomes ±F•be+1.
float cf = *(float *)&ci;
This is the opposite of the previous reinterpretation: It says to reinterpet the bytes of the int as a float.
printf("%X\n",bi);
This says to print bi using a hexadecimal format. This is technically wrong; the %X format should be used with an unsigned int, not an int, but most C implementations let it pass.
printf("%f\n",cf);
This prints the new float value.
The limit size of a BLE packet is 20 bytes. I need to transfer the following data over it:
struct Data {
uint16_t top;
uint16_t bottom;
float accX;
float accY;
float accZ;
float qx;
float qy;
float qz;
float qw;
};
Size of Data is 32 bytes. The precision of floats can not be sacrificed, since they represent accelerometers and quaternions, and would create a huge drift error if not represented precisely (data would be integrated over time).
I don't want to send 2 packets as well, as it's really important that whole data is taken at the same time.
I'm planning to take advantage of the range instead.
Accelerometer are IEEE floats in the range of [-10, 10]
Quaternions are IEEE floats in the range of [-1, 1]. We could remove w, as x^2 + y^2 + z^2 + w^2 = 1
Top, and bottom are 10-bit each.
Knowing this information, how can I serialize Data using at most 20 bytes?
Assuming binary32, code is using 2*16 + 7*32 bits (256 bits) and OP wants to limit to 20*8 bits (160).
Some savings:
1) 10 bit uint16_t,
2) reduced exponent range saves a bit or 2 per float - would save a few more bits if OP stated the _minimum exponent as well. (estimate 4 bits total)
3) Not coding w.
This make make for 2*10 + 6*(32-4) = 188 bits, still not down to 160.
OP says "The precision of floats can not be sacrificed" implying the 24 bit (23- bits explicitly coded) significand is needed. 7 float * 23 bits is 161 bits and that is not counting the sign, exponent nor 2 uint16_t.
So unless some pattern or redundant information can be eliminated, OP is outta luck.
Suggest taking many samples of data and try compressing it using LZ or other compression techniques. If OP ends up with significantly less than 20 bytes per averagedata, then the answer is yes - in theory, else you are SOL.
I'm working on a small project, where I need float multiplication with 16bit floats (half precision). Unhappily, I'm facing some problems with the algorithm:
Example Output
1 * 5 = 5
2 * 5 = 10
3 * 5 = 14.5
4 * 5 = 20
5 * 5 = 24.5
100 * 4 = 100
100 * 5 = 482
The Source Code
const int bits = 16;
const int exponent_length = 5;
const int fraction_length = 10;
const int bias = pow(2, exponent_length - 1) - 1;
const int exponent_mask = ((1 << 5) - 1) << fraction_length;
const int fraction_mask = (1 << fraction_length) - 1;
const int hidden_bit = (1 << 10); // Was 1 << 11 before update 1
int float_mul(int f1, int f2) {
int res_exp = 0;
int res_frac = 0;
int result = 0;
int exp1 = (f1 & exponent_mask) >> fraction_length;
int exp2 = (f2 & exponent_mask) >> fraction_length;
int frac1 = (f1 & fraction_mask) | hidden_bit;
int frac2 = (f2 & fraction_mask) | hidden_bit;
// Add exponents
res_exp = exp1 + exp2 - bias; // Remove double bias
// Multiply significants
res_frac = frac1 * frac2; // 11 bit * 11 bit → 22 bit!
// Shift 22bit int right to fit into 10 bit
if (highest_bit_pos(res_mant) == 21) {
res_mant >>= 11;
res_exp += 1;
} else {
res_mant >>= 10;
}
res_frac &= ~hidden_bit; // Remove hidden bit
// Construct float
return (res_exp << bits - exponent_length - 1) | res_frac;
}
By the way: I'm storing the floats in ints, because I'll try to port this code to some kind of Assembler w/o float point operations later.
The Question
Why does the code work for some values only? Did I forget some normalization or similar? Or does it work only by accident?
Disclaimer: I'm not a CompSci student, it's a leisure project ;)
Update #1
Thanks to the comment by Eric Postpischil I noticed one problem with the code: the hidden_bit flag was off by one (should be 1 << 10). With that change, I don't get decimal places any more, but still some calculations are off (e.g. 3•3=20). I assume, it's the res_frac shift as descibred in the answers.
Update #2
The second problem with the code was indeed the res_frac shifting. After update #1 I got wrong results when having 22 bit results of frac1 * frac2. I've updated the code above with a the corrected shift statement. Thanks to all for every comment and answer! :)
From a cursory look:
No attempt is made to determine the location of the high bit in the product. Two 11-bit numbers, each their high bit set, may produce a 21- or 22-bit number. (Example with two-bit numbers: 102•102 is 1002, three bits, but 112•112 is 10012, four bits.)
The result is truncated instead of rounded.
Signs are ignored.
Subnormal numbers are not handled, on input or output.
11 is hardcoded as a shift amount in one place. This is likely incorrect; the correct amount will depend on how the significand is handled for normalization and rounding.
In decoding, the exponent field is shifted right by fraction_length. In encoding, it is shifted left by bits - exponent_length - 1. To avoid bugs, the same expression should be used in both places.
From a more detailed look by chux:
res_frac = frac1 * frac2 fails if int is less than 23 bits (22 for the product and one for the sign).
This is more a suggestion for how to make it easier to get your code right, rather than analysis of what is wrong with the existing code.
There are a number of steps that are common to some or all of the floating point arithmetic operations. I suggest extracting each into a function that can be written with focus on one issue, and tested separately. Then when you come to write e.g. multiplication, you only have to deal with the specifics of that operation.
All the operations will be easier working with a structure that has the actual signed exponent, and the full significand in a wider unsigned integer field. If you were dealing with signed numbers, it would also have a boolean for the sign bit.
Here are some sample operations that could be separate functions, at least until you get it working:
unpack: Take a 16 bit float and extract the exponent and significand into a struct.
pack: Undo unpack - deal with dropping the hidden bit, applying the bias the expoent, and combining them into a float.
normalize: Shift the significand and adjust the exponent to bring the most significant 1-bit to a specified bit position.
round: Apply your rounding rules to drop low significance bits. If you want to do IEEE 754 style round-to-nearest, you need a guard digit that is the most significant bit that will be dropped, and an additional bit indicating if there are any one bits of lower significance than the guard bit.
One problem is that you are truncating instead of rounding:
res_frac >>= 11; // Shift 22bit int right to fit into 10 bit
You should compute res_frac & 0x7ff first, the part of the 22-bit result that your algorithm is about to discard, and compare it to 0x400. If it is below, truncate. If it is above, round away from zero. If it is equal to 0x400, round to the even alternative.
Hi i am trying to concatinate 4 integers one integer. I used the concatinate function found here :
https://stackoverflow.com/a/12700533/2016977
My code:
unsigned concatenate(unsigned x, unsigned y) {
unsigned pow = 10;
while(y >= pow)
pow *= 10;
return x * pow + y;
}
void stringtoint(){
struct router *ptr;
ptr=start;
while(ptr!=NULL){
int a;
int b;
int c;
int d;
sscanf(ptr->ip, "%d.%d.%d.%d", &a, &b, &c, &d);
int num1 = concatenate(a,b);
int num2 = concatenate(c,d);
int num3 = concatenate(num1,num2);
printf("%d\n",num3);
ptr=ptr->next;
};
}
The problem:
I am dealing with IP address numbers e.g. 198.32.141.140 i am breaking them down to 4 integers and concatenate them to form 19832141140, however my concatenate function is doing maths on the larger number like 198.32.141.140 (becomes) - >-1642695340
but it is concatenating the IP which are small numbers e.g. 164.78.104.1 becomes 164781041 (which is correct)
How should i solve the problem, basically i am trying to make a string of IP e.g. 198.32.141.140 into an integer number 19832141140
Your proposed approach is likely a very big mistake. How do you distinguish 127.0.1.1 from 127.0.0.11?
It's much better to treat IP addresses as exactly what they are. Namely, a.b.c.d represents
a * 256^3 + b * 256^2 + c * 256^1 + d * 256^0
and done in this way you can not possibly run into the issue I just described. Moreover, the implementation is trivial:
unsigned int number;
number = (a << 24) + (b << 16) + (c << 8) + d
You may read a line, and then use inet_aton(). Otherwise, you can do as Jason says, but you'd need to check each integers value to be within 0 ... 255 (those 4 x 8 bits represent the 32bit integer containing an IPv4 address). inet_aton() would support hex, dec and octal notation of IPv4 addresses.
/**
** You DO NOT want to do this usually...
**/
#include <stdint.h>
uint_fast64_t
concatIPv4Addr(uint_fast16_t parts[])
{
uint_fast64_t n = 0;
for (int i = 0; i < 3; ++i) {
n += parts[i];
n *= 1000;
}
return (n += parts[3]);
}
I used the "fast" integer types for speed purposes, but if you have a storage requirement, use the corresponding "least" types instead. Of course this assumes you have a C99 compiler or a C89 compiler with extensions. Otherwise you're stuck with the primitive types where a char could even be 32-bit according to the C standard. Since I don't know your target environment, I made no assumptions. Feel free to change to the appropriate primitive types as you see fit.
I used a 16-bit value (minimum) because an 8-bit number can only represent 0-255, meaning if 358 was entered accidentally, it would be interpreted as 102, which is still valid. If you have a type able to store more than 8 bits and less than 16 bits, you can obviously use that, but the type must be able to store more than 8 bits.
That aside, you will need at least a 38-bit type:
4294967295 (32-bit unsigned max)
255255255255 (255.255.255.255 converted to the integer you want)
274877906944 (38-bit unsigned max)
The function above will convert 127.0.1.1 and 127.0.0.11 to 127000001001 and 127000000011 respectively:
127.0.1.1 ->
127.000.001.001 ->
127000001001
127.0.0.11 ->
127.000.000.011 ->
127000000011
Why so many zeros? Because otherwise you can't tell the difference between them! As others have said, you could confuse 127.0.1.1 and 127.0.0.11. Using the function above or something more appropriate that actually converts an IPv4 address to its real decimal representation, you won't have such a problem.
Lastly, I did no validation on the IPv4 address passed to the function. I assume you already ensure the address is valid before calling any functions that save or use the IPv4 address. BTW, if you wanted to do this same thing for IPv6, you can't so easily because that would require a string or conversion to decimal of each of the 8 parts, each of which is at most 16-bit, yielding 5 decimal digits per part, or 40 digits. To store that, you'd need a minimum of 133 bits, rather than the 128 bits required for the IPv6 address, just as you'd need 38 bits to store an IPv4 address instead of the 32 bits required.
Still not too bad, right? How about a theoretical IPv8 where there are 16 parts, each of which are 32-bit in size? The equivalent function to the one above would require 580 bits, instead of the proper mathematical requirement: 512 bits. While not a problem today, I'm simply pointing out the error in doing anything with an IPv4 address represented by concatenating the decimal values of each part. It scales absolutely terribly.
I'm doing some exercises for Computer Science university and one of them is about converting an int array of 64 bit into it's double-precision floating point value.
Understanding the first bit, the sign +/-, is quite easy. Same for the exponent, as well as we know that the bias is 1023.
We are having problems with the significand. How can I calculate it?
In the end, I would like to obtain the real numbers that the bits meant.
computing the significand of the given 64 bit is quite easy.
according to the wiki article using the IEEE 754, the significand is made up the first 53 bits (from bit 0 to bit 52).
Now if you want to convert number having like 67 bits to your 64 bits value, it would be rounded by setting the trailing 64th bits of your value to 1, even if it was one before... because of the other 3 bits:
11110000 11110010 11111 becomes 11110000 11110011 after the rounding of the last byte;
therefore the there is no need to store the 53th bits because it has always a value a one.
that's why you only store in 52 bits in the significand instead of 53.
now to compute it, you just need to target the bit range of the significand [bit(1) - bit(52)] -bit(0) is always 1- and use it .
int index_signf = 1; // starting at 1, not 0
int significand_length = 52;
int byteArray[53]; // array containing the bits of the significand
double significand_endValue = 0;
for( ; index_signf <= significand_length ; index_signf ++)
{
significand_endValue += byteArray[index_signf] * (pow(2,-(index_signf)));
}
significand_endValue += 1;
Now you just have to fill byteArray accordlingly before computing it, using function like that:
int* getSignificandBits(int* array64bits){
//returned array
int significandBitsArray[53];
// indexes++
int i_array64bits = 0;
int i_significandBitsArray=1;
//set the first bit = 1
significandBitsArray[0] = 1;
// fill it
for(i_significandBitsArray=1, i_array64bits = (63 - 1); i_array64bits >= (64 - 52); i_array64bits--, i_significandBitsArray ++)
significandBitsArray[i_significandBitsArray] = array64bits[i_array64bits];
return significandBitsArray;
}
You could just load the bits into an unsigned integer of the same size as a double, take the address of that and cast it to a void* which you then cast to a double* and dereference.
Of course, this might be "cheating" if you really are supposed to parse the floating point standard, but this is how I would have solved the problem given the parameters you've stated so far.
If you have a byte representation of an object you can copy the bytes into the storage of a variable of the right type to convert it.
double convert_to_double(uint64_t x) {
double result;
mempcy(&result, &x, sizeof(x));
return result;
}
You will often see code like *(double *)&x to do the conversion, but whereas in practice this will always work it's undefined behavior in C.