Division of two floats giving incorrect answer - c

Attempting to divide two floats in C, using the code below:
#include <stdio.h>
#include <math.h>
int main(){
float fpfd = 122.88e6;
float flo = 10e10;
float int_part, frac_part;
int_part = (int)(flo/fpfd);
frac_part = (flo/fpfd) - int_part;
printf("\nInt_Part = %f\n", int_part);
printf("Frac_Part = %f\n", frac_part);
return(0);
}
To this code, I use the commands:
>> gcc test_prog.c -o test_prog -lm
>> ./test_prog
I then get this output:
Int_Part = 813.000000
Frac_Part = 0.802063
Now, this Frac_part it seems is incorrect. I have tried the same equation on a calculator first and then in Wolfram Alpha and they both give me:
Frac_Part = 0.802083
Notice the number at the fifth decimal place is different.
This may seem insignificant to most, but for the calculations I am doing it is of paramount importance.
Can anyone explain to me why the C code is making this error?

When you have inadequate precision from floating point operations, the first most natural step is to just use floating point types of higher precision, e.g. use double instead of float. (As pointed out immediately in the other answers.)
Second, examine the different floating point operations and consider their precisions. The one that stands out to me as being a source of error is the method above of separating a float into integer part and fractional part, by simply casting to int and subtracting. This is not ideal, because, when you subtract the integer part from the original value, you are doing arithmetic where the three numbers involved (two inputs and result) have very different scales, and this will likely lead to precision loss.
I would suggest to use the C <math.h> function modf instead to split floating point numbers into integer and fractional part. http://www.techonthenet.com/c_language/standard_library_functions/math_h/modf.php
(In greater detail: When you do an operation like f - (int)f, the floating point addition procedure is going to see that two numbers of some given precision X are being added, and it's going to naturally assume that the result will also have precision X. Then it will perform the actual computation under that assumption, and finally reevaluate the precision of the result at the end. Because the initial prediction turned out not to be ideal, some low order bits are going to get lost.)

Float are single precision for floating point, you should instead try to use double, the following code give me the right result:
#include <stdio.h>
#include <math.h>
int main(){
double fpfd = 122.88e6;
double flo = 10e10;
double int_part, frac_part;
int_part = (int)(flo/fpfd);
frac_part = (flo/fpfd) - int_part;
printf("\nInt_Part = %f\n", int_part);
printf("Frac_Part = %f\n", frac_part);
return(0);
}
Why ?
As I said, float are single precision floating point, they are smaller than double (in most architecture, sizeof(float) < sizeof(double)).
By using double instead of float you will have more bit to store the mantissa and the exponent part of the number (see wikipedia).

float has only 6~9 significant digits, it's not precise enough for most uses in practice. Changing all float variables to double (which provides 15~17 significant digits) gives output:
Int_Part = 813.000000
Frac_Part = 0.802083

Related

How to extract fractional part of a double value without rounding in c

When i try to extract the fractional part of a double it seems to be rounding down in C, Is there a way to do it without rounding?
double t = 8.2;
int ipart = (int)t;
long long val = abs((long long)(t*1000000));
long long fpart = (val)%1000000;
fpart gives 199999, Is there a way to get it as 200000 without rounding down? Tried many ways but none of the methods seems to be working for all the numbers.
Intention is to finally convert this double into string which should have the exact value as "8.20000". If i can extract fraction part in long long variable then i can generate the string using snprintf.
How to extract fractional part of a double value without rounding ...?
Use modf() from the standard C library.
#include <math.h>
double ipart;
double fraction = modf(value, &ipart);
printf("Fraction %g\n", fraction);
printf("Whole number %g\n", ipart);
The modf functions break the argument value into integral and fractional parts, each of which has the same type and sign as the argument. They store the integral part (in floating-point format) in the object pointed to by iptr.
C17dr § 7.12.6.12 2
Deeper into 8.2
double can represent about 264 different values exactly. 8.2 is not one of them.
With double t = 8.2;, t takes on a nearby value, which is exactly 8.199999999999999289457264239899814128875732421875 or
81801439850948192/9007199254740992 due to the binary nature of common double.
To find the fraction, use fraction*pow(2,DBL_MANT_DIG)/pow(2,DBL_MANT_DIG).
Thus the goal of "to get it as 200000 without rounding down" as 200000/denominator for the fraction part of t is not possible.
The value 8.2 can't be exactly represented in binary floating point. The actual value is closer to 8.19999999999999929.
Because of this, you're forced to round:
long long val = llabs(round(t*1000000));
Or add 0.5 to the value before converting:
long long val = llabs((t*1000000) + 0.5);

Nonintuitive result of the assignment of a double precision number to an int variable in C

Could someone give me an explanation why I get two different
numbers, resp. 14 and 15, as an output from the following code?
#include <stdio.h>
int main()
{
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
printf("%d %d",b,c); // 14 15, why?
return 0;
}
I expect to get 15 in both cases but it seems I'm missing some fundamentals of the language.
I am not sure if it's relevant but I was doing the test in CodeBlocks. However, if I type the same lines of code in some on-line compiler ( this one for example) I get an answer of 15 for the two printed variables.
... why I get two different numbers ...
Aside from the usual float-point issues, the computation paths to b and c are arrived in different ways. c is calculated by first saving the value as double a.
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
C allows intermediate floating-point math to be computed using wider types. Check the value of FLT_EVAL_METHOD from <float.h>.
Except for assignment and cast (which remove all extra range and precision), ...
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the
type;
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double
operations and constants to the range and precision of the long double
type;
2 evaluate all operations and constants to the range and precision of the
long double type.
C11dr §5.2.4.2.2 9
OP reported 2
By saving the quotient in double a = (Vmax-Vmin)/step;, precision is forced to double whereas int b = (Vmax-Vmin)/step; could compute as long double.
This subtle difference results from (Vmax-Vmin)/step (computed perhaps as long double) being saved as a double versus remaining a long double. One as 15 (or just above), and the other just under 15. int truncation amplifies this difference to 15 and 14.
On another compiler, the results may both have been the same due to FLT_EVAL_METHOD < 2 or other floating-point characteristics.
Conversion to int from a floating-point number is severe with numbers near a whole number. Often better to round() or lround(). The best solution is situation dependent.
This is indeed an interesting question, here is what happens precisely in your hardware. This answer gives the exact calculations with the precision of IEEE double precision floats, i.e. 52 bits mantissa plus one implicit bit. For details on the representation, see the wikipedia article.
Ok, so you first define some variables:
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
The respective values in binary will be
Vmax = 10.111001100110011001100110011001100110011001100110011
Vmin = 1.0110011001100110011001100110011001100110011001100110
step = .00011001100110011001100110011001100110011001100110011010
If you count the bits, you will see that I have given the first bit that is set plus 52 bits to the right. This is exactly the precision at which your computer stores a double. Note that the value of step has been rounded up.
Now you do some math on these numbers. The first operation, the subtraction, results in the precise result:
10.111001100110011001100110011001100110011001100110011
- 1.0110011001100110011001100110011001100110011001100110
--------------------------------------------------------
1.1000000000000000000000000000000000000000000000000000
Then you divide by step, which has been rounded up by your compiler:
1.1000000000000000000000000000000000000000000000000000
/ .00011001100110011001100110011001100110011001100110011010
--------------------------------------------------------
1110.1111111111111111111111111111111111111111111111111100001111111111111
Due to the rounding of step, the result is a tad below 15. Unlike before, I have not rounded immediately, because that is precisely where the interesting stuff happens: Your CPU can indeed store floating point numbers of greater precision than a double, so rounding does not take place immediately.
So, when you convert the result of (Vmax-Vmin)/step directly to an int, your CPU simply cuts off the bits after the fractional point (this is how the implicit double -> int conversion is defined by the language standards):
1110.1111111111111111111111111111111111111111111111111100001111111111111
cutoff to int: 1110
However, if you first store the result in a variable of type double, rounding takes place:
1110.1111111111111111111111111111111111111111111111111100001111111111111
rounded: 1111.0000000000000000000000000000000000000000000000000
cutoff to int: 1111
And this is precisely the result you got.
The "simple" answer is that those seemingly-simple numbers 2.9, 1.4, and 0.1 are all represented internally as binary floating point, and in binary, the number 1/10 is represented as the infinitely-repeating binary fraction 0.00011001100110011...[2] . (This is analogous to the way 1/3 in decimal ends up being 0.333333333... .) Converted back to decimal, those original numbers end up being things like 2.8999999999, 1.3999999999, and 0.0999999999. And when you do additional math on them, those .0999999999's tend to proliferate.
And then the additional problem is that the path by which you compute something -- whether you store it in intermediate variables of a particular type, or compute it "all at once", meaning that the processor might use internal registers with greater precision than type double -- can end up making a significant difference.
The bottom line is that when you convert a double back to an int, you almost always want to round, not truncate. What happened here was that (in effect) one computation path gave you 15.0000000001 which truncated down to 15, while the other gave you 14.999999999 which truncated all the way down to 14.
See also question 14.4a in the C FAQ list.
An equivalent problem is analyzed in analysis of C programs for FLT_EVAL_METHOD==2.
If FLT_EVAL_METHOD==2:
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
computes b by evaluating a long double expression then truncating it to a int, whereas for c it's evaluating from long double, truncating it to double and then to int.
So both values are not obtained with the same process, and this may lead to different results because floating types does not provides usual exact arithmetic.

Precision loss / rounding difference when directly assigning double result to an int

Is there a reason why converting from a double to an int performs as expected in this case:
double value = 45.33;
double multResult = (double) value*100.0; // assign to double
int convert = multResult; // assign to int
printf("convert = %d\n", convert); // prints 4533 as expected
But not in this case:
double value = 45.33;
int multResultInt = (double) value*100.0; // assign directly to int
printf("multResultInt = %d\n", multResultInt); // prints 4532??
It seems to me there should be no difference. In the second case the result is still first stored as a double before being converted to an int unless I am not understanding some difference between casts and hard assignments.
There is indeed no difference between the two, but compilers are used to take some freedom when it comes down to floating point computations. For example compilers are free to use higher precision for intermediate results of computations but higher still means different so the results may vary.
Some compilers provide switches to always drop extra precision and convert all intermediate results to the prescribed floating point numbers (say 64bit double-precision numbers). This will make the code slower, however.
In the specific the number 45.33 cannot be represented exactly with a floating point value (it's a periodic number when expressed in binary and it would require an infinite number of bits). When multiplying by 100 this value may be you don't get an integer, but something very close (just below or just above).
int conversion or cast is performed using truncation and something very close to 4533 but below will become 4532, when above will become 4533; even if the difference is incredibly tiny, say 1E-300.
To avoid having problems be sure to account for numeric accuracy problems. If you are doing a computation that depends on exact values of floating point numbers then you're using the wrong tool.
#6502 has given you the theory, here's how to look at things experimentally
double v = 45.33;
int x = v * 100.0;
printf("x=%d v=%.20lf v100=%.20lf\n", x, v, v * 100.0 );
On my machine, this prints
x=4533 v=45.32999999999999829470 v100=4533.00000000000000000000
The value 45.33 does not have an exact representation when encoded as a 64-bit IEEE-754 floating point number. The actual value of v is slightly lower than the intended value due to the limited precision of the encoding.
So why does multiplying by 100.0 fix the problem on some machines? One possibility is that the multiplication is done with 80-bits of precision and then rounded to fit into a 64-bit result. The 80-bit number 4532.999... will round to 4533 when converted to 64-bits.
On your machine, the multiplication is evidently done with 64-bits of precision, and I would expect that v100 will print as 4532.999....

Truncating a double to a float in C

This a very simple question, but an important one since it affects my whole project tremendously.
Suppose I have the following code snipet:
unsigned int x = 0xffffffff;
float f = (float)((double)x * (double)2.328306436538696e-010); // x/2^32
I would expect that f be something like 0.99999, but instead, it rounds up to 1, since it's the closest float approximation. That's not good since I need float values on the interval of [0,1), not [0,1]. I'm sure it's something simple, but I'd appreciate some help.
In C (since C99), you can change the rounding direction with fesetround from libm
#include <stdio.h>
#include <fenv.h>
int main()
{
#pragma STDC FENV_ACCESS ON
fesetround(FE_DOWNWARD);
// volatile -- uncomment for GNU gcc and whoever else doesn't support FENV
unsigned long x = 0xffffffff;
float f = (float)((double)x * (double)2.328306436538696e-010); // x/2^32
printf("%.50f\n", f);
}
Tested with IBM XL, Sun Studio, clang, GNU gcc. This gives me 0.99999994039535522460937500000000000000000000000000 in all cases
The value above which a double rounds to 1 or more when converted to float in the default IEEE 754 rounding mode is 0x1.ffffffp-1 (in C99's hexadecimal notation, since your question is tagged “C”).
Your options are:
turn the FPU rounding mode to round-downward before the conversion, or
multiply by (0x1.ffffffp-1 / 0xffffffffp0) (give or take one ULP) to exploit the full single-precision range [0, 1) without getting the value 1.0f.
Method 2 leads to use the constant 0x1.ffffff01fffffp-33:
double factor = nextafter(0x1.ffffffp-1 / 0xffffffffp0, 0.0);
unsigned int x = 0xffffffff;
float f = (float)((double)x * factor);
printf("factor:%a\nunrounded:%a\nresult:%a\n", factor, (double)x * factor, f);
Prints:
factor:0x1.ffffff01fffffp-33
unrounded:0x1.fffffefffffffp-1
result:0x1.fffffep-1
You could just truncate the value to maximum precision (keeping the 24 high bits) and divide by 2^24 to get the closest value a float can represent without being rounded to 1;
unsigned int i = 0xffffffff;
float value = (float)(i>>8)/(1<<24);
printf("%.20f\n", value);
printf("%a\n", value);
>>> 0.99999994039535522461
>>> 0x1.fffffep-1
There's not much you can do - your int holds 32 bits but the mantissa of a float holds only 24. Rounding is going to happen. You could change the processor rounding mode to round down instead of to nearest, but that is going to cause some side effects that you want to avoid especially if you don't restore the rounding mode when you are finished.
There's nothing wrong with the formula you're using, it's producing the most accurate answer possible for the given input. There's just an end case that's failing a hard requirement. There's nothing wrong with testing for the specific end case and replacing it with the closest value that meets the requirement:
if (f >= 1.0f)
f = 0.99999994f;
0.999999940395355224609375 is the closest value that an IEEE-754 float can take without being equal to 1.0.
My eventual solution was to just shrink the size of my constant multiplier. It was probably the best solution since there was no point in multiplying by a double anyway. The precision was not seen after conversion to a float.
so 2.328306436538696e-010 was changed to 2.3283063

Multiplying two floats doesn't give exact result

I am trying to multiply two floats as follows:
float number1 = 321.12;
float number2 = 345.34;
float rexsult = number1 * number2;
The result I want to see is 110895.582, but when I run the code it just gives me 110896. Most of the time I'm having this issue. Any calculator gives me the exact result with all decimals. How can I achive that result?
edit : It's C code. I'm using XCode iOS simulator.
There's a lot of rounding going on.
float a = 321.12; // this number will be rounded
float b = 345.34; // this number will also be rounded
float r = a * b; // and this number will be rounded too
printf("%.15f\n", r);
I get 110895.578125000000000 after the three separate roundings.
If you want more than 6 decimal digits' worth of precision, you will have to use double and not float. (Note that I said "decimal digits' worth", because you don't get decimal digits, you get binary.) As it stands, 1/2 ULP of error (a worst-case bound for a perfectly rounded result) is about 0.004.
If you want exactly rounded decimal numbers, you will have to use a specialized decimal library for such a task. A double has more than enough precision for scientists, but if you work with money everything has to be 100% exact. No floating point numbers for money.
Unlike integers, floating point numbers take some real work before you can get accustomed to their pitfalls. See "What Every Computer Scientist Should Know About Floating-Point Arithmetic", which is the classic introduction to the topic.
Edit: Actually, I'm not sure that the code rounds three times. It might round five times, since the constants for a and b might be rounded first to double-precision and then to single-precision when they are stored. But I don't know the rules of this part of C very well.
You will never get the exact result that way.
First of all, number1 ≠ 321.12 because that value cannot be represented exactly in a base-2 system. You'll need an infinite number of bits for it.
The same holds for number2 ≠ 345.34.
So, you begin with inexact values to begin with.
Then the product will get rounded because multiplication gives you double the number of significant digits but the product has to be stored in float again if you multiply floats.
You probably want to use a 10-based system for your numbers. Or, in case your numbers only have 2 decimal digits of the fractional, you can use integers (32-bit integers are sufficient in this case, but you may end up needing 64-bit):
32112 * 34534 = 1108955808.
That represents 321.12 * 345.34 = 110895.5808.
Since you are using C you could easily set the precision by using "%.xf" where x is the wanted precision.
For example:
float n1 = 321.12;
float n2 = 345.34;
float result = n1 * n2;
printf("%.20f", result);
Output:
110895.57812500000000000000
However, note that float only gives six digits of precision. For better precision use double.
floating point variables are only approximate representation, not precise one. Not every number can "fit" into float variable. For example, there is no way to put 1/10 (0.1) into binary variable, just like it's not possible to put 1/3 into decimal one (you can only approximate it with endless 0.33333)
when outputting such variables, it's usual to apply many rounding options. Unless you set them all, you can never be sure which of them are applied. This is especially true for << operators, as the stream can be told how to round BEFORE <<.
Printf also does some rounding. Consider http://codepad.org/LLweoeHp:
float t = 0.1f;
printf("result: %f\n", t);
--
result: 0.100000
Well, it looks fine. Why? Because printf defaulted to some precision and rounded up the output. Let's dial in 50 places after decimal point: http://codepad.org/frUPOvcI
float t = 0.1f;
printf("result: %.50f\n", t);
--
result: 0.10000000149011611938476562500000000000000000000000
That's different, isn't it? After 625 the float ran out of capacity to hold more data, that's why we see zeroes.
A double can hold more digits, but 0.1 in binary is not finite. Double has to give up, eventually: http://codepad.org/RAd7Yu2r
double t = 0.1;
printf("result: %.70f\n", t);
--
result: 0.1000000000000000055511151231257827021181583404541015625000000000000000
In your example, 321.12 alone is enough to cause trouble: http://codepad.org/cgw3vUKn
float t = 321.12f;
printf("and the result is: %.50f\n", t);
result: 321.11999511718750000000000000000000000000000000000000
This is why one has to round up floating point values before presenting them to humans.
Calculator programs don't use floats or doubles at all. They implement decimal number format. eg:
struct decimal
{
int mantissa; //meaningfull digits
int exponent; //number of decimal zeroes
};
Ofc that requires reinventing all operations: addition, substraction, multiplication and division. Or just look for a decimal library.

Resources