This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 7 years ago.
I try to convert two byte to float and I have problem with precision.
In my case I read temp and store into two bytes. For example 14.69*C - 14(dec) to one byte and 69(dec) to second byte. Then I would like to convert this bytes to float and compare with another float, for example:
byte byte1 = 0xE;
byte byte2 = 0x45;
float temp1 = (float) byte1*1.0 + (float) byte2*0.01; // byte2*0.1 if byte2<10
float temp2 = 14.69;
...
if (temp1==temp2){
...
}
I expected temp1 value 14.69 but value is 14.68999958 - Why, and what is the solution?
Every time a floating point operation is done, some precision is lost. You can try to reduce the error by replacing floating point arithmetic with int as much as possible. For example:
((float)((unsigned int)byte1 * 100 + (unsigned int)byte2))/100.0
also, comparing floats for strict equality can fail due to machine precision issues, you should use if (fabsf(f1 - f2) < EPSILON)
I think you should use the bytes as they are before converting them in float, float are not really precise when it comes to equality.
Related
This question already has an answer here:
pow() function in C problems [duplicate]
(1 answer)
Closed 3 years ago.
I'm trying to multiply 2, 3 digit numbers.
I used 2 for loops (nested) and multiplied each digit of num1 with num2,
and shifted each result to the appropriate place using pow().
So the problem is pow(10,3) is coming out to be 299 instead of 300.
I haven't tried much as but used printf to find what is actually happening in the runtime and this is what I have found.
the values of tempR after shift should be
5,40,300,100,800,6000,1500,12000,90000
but are coming as
5,40,299,100,799,6000,1500,12000,89999
int main(void)
{
int result; // final result
int tempR; // temporary for each iteration
char a[] = "345"; // number 1
char b[] = "321"; // number 2
for(int i = 2;i>= 0 ; i --)
{
for(int j = 2;j >= 0 ; j --)
{
int shift = abs(i-2 + j -2);
printf("%d\n",shift); //used to see the values of shift.
//and it is coming as expected
tempR = (int)(b[i] - '0') * (int)(a[j] - '0');
printf("%d \n",tempR); // value to tempR is perfect
tempR = tempR*pow(10,shift);
printf("%d \n",tempR); // here the problem starts
result += tempR;
}
}
printf("%d",result);
}
Although IEEE754 (ubiquitous on desktop systems) is required to return the best possible floating point value for certain operators such as addition, multiplication, division, and subtraction, and certain functions such as sqrt, this does not apply to pow.
pow(x, y) can and often is implemented as exp(y * ln (x)). Hopefully you can see that this can cause result to "go off" spectacularly when pow is used with seemingly trivial integral arguments and the result truncated to int.
There are C implementations out there that have more accurate implementations of pow than the one you have, particularly for integral arguments. If such accuracy is required, then you could move your toolset to such an implementation. Borrowing an implementation of pow from a respected mathematics library is also an option, else roll your own. Using round is also a technique, if a little kludgy if you get my meaning.
Never use float functions for the integer calculations. Your pow result almost never will be precise. In this case it is slightly below 300 and the cast to integer makes it 299.
The pow function operates on doubles. Doubles use finite precision. Conversion back to integer chops rather than rounding.
Finite precision is like representing 1/3 as 0.333333. If you do 9 * 1/3 and chop to an integer, you'll get 2 instead of 3 because 9 * 1/3 will give 2.999997 which chops to two.
This same kind of rounding and chopping is causing you to be off by one. You could also round by adding 0.5 before chopping to an integer, but I wouldn't suggest it.
Don't pass integers through doubles and back if you expect exact answers.
Others have mentioned that pow does not yield exact results, and if you convert the result to an integer there's a high risk of loss of precision. Especially since if you assign a float type to an integer type, the result get truncated rather than rounded. Read more here: Is floating math broken?
The most convenient solution is to write your own integer variant of pow. It can look like this:
int int_pow(int num, int e)
{
int ret = 1;
while(e-- > 0)
ret *= num;
return ret;
}
Note that it will not work if e is negative or if both num and e is 0. It also have no protection for overflow. It just shows the idea.
In your particular case, you could write a very specialized variant based on 10:
unsigned int pow10(unsigned int e)
{
unsigned int ret = 1;
while(e-- > 0)
ret *= 10;
return ret;
}
This question already has answers here:
How to extract the decimal part from a floating point number in C?
(16 answers)
Closed 5 years ago.
I want to split the float number to two separate part as real and non real part.
For example: if x = 45.678, then my function have to give real= 45 and non_real=678. I have tried the following logic.
split ( float x, unsigned int *real, unsigned int *non_real)
{
*real = x;
*non_real = ((int)(x*N_DECIMAL_POINTS_PRECISION)%N_DECIMAL_POINTS_PRECISION);
printf ("Real = %d , Non_Real = %d\n", *real, *non_real);
}
where N_DECIMAL_POINTS_PRECISION = 10000. It would give decimal part till 4 digits, not after.
It works only for specific set of decimal point precision. The code is not generic, it has to work for all floating numbers also like 9.565784 and 45.6875322 and so on. So if anyone could help me on this, it would be really helpful.
Thanks in advance.
Use floor() to find the integer part, and then subtract the integer part from the original value to find the fractional part.
Note: The problem you're most likely having is that some numbers are too large for the integer part to fit in the range of an int.
--Added--
If and only if you are able to assume that an unsigned int is larger than the floating point representation's significand (e.g. 32-bit unsigned int and IEEE standard single-precision floating point with only 23 fractional bits, where "32 < 23" is true); then a number that is too large for an unsigned int can't have any fractional bits. This leads to a solution like:
if(x > UINT_MAX) {
integer_part = x;
fractional_part = 0;
} else {
integer_part = (int)x;
fractional_part = x - integer_part;
}
This question already has answers here:
Why does dividing two int not yield the right value when assigned to double?
(10 answers)
Closed 6 years ago.
I have an array of double:
double theoretical_distribution[] = {1/21, 2/21, 3/21, 4/21, 5/21, 6/21};
And I am trying to computer it's entropy as:
double entropy = 0;
for (int i = 0; i < sizeof(theoretical_distribution)/sizeof(*theoretical_distribution); i++) {
entropy -= (theoretical_distribution[i] * (log10(theoretical_distribution[i])/log10(arity)));
}
However I am getting NaN, I have checked the part
(theoretical_distribution[i] * (log10(theoretical_distribution[i])/log10(arity)))
And found it to return NaN itself, so I assume it's the culprit, however all it's supposed to be is a simple base conversion of the log? Am I missing some detail about the maths of it?
Why is it evaluating to NaN.
You are passing 0 to the log10 function.
This is because your array theoretical_distribution is being populated with constant values that result from integer computations, all of which have a denominator larger than the numerator.
You probably intended floating computations, so make at least one of the numerator or denominator a floating constant.
Sorry for the wordy title. My code is targeting a microcontroller (msp430) with no floating point unit, but this should apply to any similar MCU.
If I am multiplying a large runtime variable with what would normally be considered a floating point decimal number (1.8), is this still treated like floating point math by the MCU or compiler?
My simplified code is:
int multip = 0xf; // Can be from 0-15, not available at compile time
int holder = multip * 625; // 0 - 9375
holder = holder * 1.8; // 0 - 16875`
Since the result will always be a positive full, real integer number, is it still floating point math as far as the MCU or compiler are concerned, or is it fixed point?
(I realize I could just multiply by 18, but that would require declaring a 32bit long instead of a 16 bit int then dividing and downcasting for the array it will be put in, trying to skimp on memory here)
The result is not an integer; it rounds to an integer.
9375 * 1.8000000000000000444089209850062616169452667236328125
yields
16875.0000000000004163336342344337026588618755340576171875
which rounds (in double precision floating point) to 16875.
If you write a floating-point multiply, I know of no compiler that will determine that there's a way to do that in fixed-point instead. (That does not mean they do not exist, but it ... seems unlikely.)
I assume you simplified away something important, because it seems like you could just do:
result = multip * 1125;
and get the final result directly.
I'd go for chux's formula if there's some reason you can't just multiply by 1125.
Confident FP code will be created for
holder = holder * 1.8
To avoid FP and 32-bit math, given the OP values of
int multip = 0xf; // Max 15
unsigned holder = multip * 625; // Max 9375
// holder = holder * 1.8;
// alpha depends on rounding desired, e.g. 2 for round to nearest.
holder += (holder*4u + alpha)/5;
If int x is non-negative, you can compute x *= 1.8 rounded to nearest using only int arithmetic, without overflow unless the final result overflows, with:
x - (x+2)/5 + x
For truncation instead of round-to-nearest, use:
x - (x+4)/5 + x
If x may be negative, some additional work is needed.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
John Carmack’s Unusual Fast Inverse Square Root (Quake III)
I came across this piece of code a blog recently - it is from the Quake3 Engine. It is meant to calculate the inverse square root fast using the Newton-Rhapson method.
float InvSqrt (float x){
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
What is the reason for doing int i = *(int*)&x;? Doing int i = (int) x; instead gives a completely different result.
int i = *(int*)&x; doesn't convert x to an int -- what it does is get the actual bits of the float x, which is usually represented as a whole other 4-byte value than you'd expect.
For reference, doing this is a really bad idea unless you know exactly how float values are represented in memory.
int i = *(int*)&x; says "take the four bytes which make up the float value x, and treat them as if they were an int." float values and int value are stored using completely different methods (e.g. int 4 and float 4.0 have completely different bit patterns)
The number that ends up in i is the binary value of the IEEE floating point representation of the number in x. The link explains what that looks like. This is not a common C idiom, it's a clever trick from before the SSE instructions got added to commercially available x86 processors.