This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 3 years ago.
When given the input 150, I expect the output to be the mathematically correct answer 70685.7750, but I am getting the wrong output 70685.7812.
#include<stdio.h>
int main()
{
float A,n,R;
n=3.14159;
scanf("%f",&R);
A=n*(R*R);
printf("A=%.4f\n",A);
}
float and double numbers are not represented very accurately in the memory. The main reason is that the memory is limited, and most non-integers are not.
The best example is PI. You can specify as many digits as you want, but it will still be an approximation.
The limited precision of representing the numbers is the reason of the following rule:
when working with floats and double numbers, not not check for equality (m == n), but check that the difference between them is smaller than a certain error ((m-n) < e)
Please note, as mentioned in the comments too, that the above rule is not "the mother rule of all rules". There are other rules also.
Careful analysis must be done for each particular situation, in order to have a properly working application.
(Thanks #EricPostpischil for the reminder)
It is common for a variable of type float to be an IEEE-754 32-bit floating point number.
The number 3.14159 cannot be stored exactly in an IEEE-754 32-bit float - the closest value is approximately 3.14159012. 150 * 150 * 3.14159012 is 70685.7777, and the closest value to this that can be represented in a 32-bit float is 70685.78125, which you are then printing with %.4f so you see 70685.7812.
Another way of thinking about this is that your n value only ends up being accurate to the sixth significant figure, so - as you are just calculating a series of multiplications - your result is also only acccurate to the sixth significant figure (ie 70685.8). (In the general case this can be worse - for example subtraction of two close values can lead to a large increase in the relative error).
If you switch to using variables of type double (and change the scanf() to use %lf), then you will likely get the answer you are after. double is typically a 64-bit float, which means that the error in the representation of your n values and the result is small enough not to affect the fourth decimal place.
Have you heard that float and double values aren't always perfectly accurate, have limited precision? Have you heard that type float gives you the equivalent of only about 7 decimal digits' worth of precision? This is what that means. Your expected and actual answers, 70685.7750 and 70685.7812, differ in the seventh digit, just about as expected.
I expect the output to be the mathematically correct answer
I am sorry to disappoint you, but that's your mistake. As a general rule, when you're doing floating-point arithmetic, you will never get the mathematically correct answer, you will always get a limited-precision approximation of the mathematically correct answer.
The canonical SO answers to this sort of question are collected at Is floating point math broken?. You might want to read some of those answers for more enlightenment.
Related
My Matlab script reads a string value "0.001044397222448" from a file, and after parsing the file, this value printed in the console shows as double precision:
value_double =
0.001044397222448
After I convert this number to singe using value_float = single(value_double), the value shows as:
value_float =
0.0010444
What is the real value of this variable, that I later use in my Simulink simulation? Is it really truncated/rounded to 0.0010444?
My problem is that later on, after I compare this with analogous C code, I get differences. In the C code the value is read as float gf = 0.001044397222448f; and it prints out as 0.001044397242367267608642578125000. So the C code keeps good precision. But, does Matlab?
The number 0.001044397222448 (like the vast majority of decimal fractions) cannot be exactly represented in binary floating point.
As a single-precision float, it's most closely represented as (hex) 0x0.88e428 × 2-9, which in decimal is 0.001044397242367267608642578125.
In double precision, it's most closely represented as 0x0.88e427d4327300 × 2-9, which in decimal is 0.001044397222447999984407118745366460643708705902099609375.
Those are what the numbers are, internally, in both C and Matlab.
Everything else you see is an artifact of how the numbers are printed back out, possibly rounded and/or truncated.
When I said that the single-precision representation "in decimal is 0.001044397242367267608642578125", that's mildly misleading, because it makes it look like there are 28 or more digits' worth of precision. Most of those digits, however, are an artifact of the conversion from base 2 back to base 10. As other answers have noted, single-precision floating point actually gives you only about 7 decimal digits of precision, as you can see if you notice where the single- and double-precision equivalents start to diverge:
0.001044397242367267608642578125
0.001044397222447999984407118745366460643708705902099609375
^
difference
Similarly, double precision gives you roughly 16 decimal digits worth of precision, as you can see if you compare the results of converting a few previous and next mantissa values:
0x0.88e427d43272f8 0.00104439722244799976756668424826557384221814572811126708984375
0x0.88e427d4327300 0.001044397222447999984407118745366460643708705902099609375
0x0.88e427d4327308 0.00104439722244800020124755324246734744519926607608795166015625
0x0.88e427d4327310 0.0010443972224480004180879877395682342466898262500762939453125
^
changes
This also demonstrates why you can never exactly represent your original value 0.001044397222448 in binary. If you're using double, you can have 0.00104439722244799998, or you can have 0.0010443972224480002, but you can't have anything in between. (You'd get a little less close with float, and you could get considerably closer with long double, but you'll never get your exact value.)
In C, and whether you're using float or double, you can ask for as little or as much precision as you want when printing things with %f, and under a high-quality implementation you'll always get properly-rounded results. (Of course the results you get will always be the result of rounding the actual, internal value, not necessarily the decimal value you started with.) For example, if I run this code:
printf("%.5f\n", 0.001044397222448);
printf("%.10f\n", 0.001044397222448);
printf("%.15f\n", 0.001044397222448);
printf("%.20f\n", 0.001044397222448);
printf("%.30f\n", 0.001044397222448);
printf("%.40f\n", 0.001044397222448);
printf("%.50f\n", 0.001044397222448);
printf("%.60f\n", 0.001044397222448);
printf("%.70f\n", 0.001044397222448);
I see these results, which as you can see match the analysis above.
(Note that this particular example is using double, not float.)
0.00104
0.0010443972
0.001044397222448
0.00104439722244799998
0.001044397222447999984407118745
0.0010443972224479999844071187453664606437
0.00104439722244799998440711874536646064370870590210
0.001044397222447999984407118745366460643708705902099609375000
0.0010443972224479999844071187453664606437087059020996093750000000000000
I'm not sure how Matlab prints things.
In answer to your specific questions:
What is the real value of this variable, that I later use in my Simulink simulation? Is it really truncated/rounded to 0.0010444?
As a float, it is really "truncated" to a number which, converted back to decimal, is exactly 0.001044397242367267608642578125. But as we've seen, most of those digits are essentially meaningless, and the result can more properly thought of as being about 0.0010443972.
In the C code the value is read as float gf = 0.001044397222448f; and it prints out as 0.001044397242367267608642578125000
So C got the same answer I did -- but, again, most of those digits are not meaningful.
So the C code keeps good precision. But, does Matlab?
I'd be willing to bet that Matlab keeps the same internal precision for ordinary floats and doubles.
MATLAB uses IEEE-754 binary64 for its double-precision type and binary32 for single-precision. When 0.001044397222448 is rounded to the nearest value representable in binary64, the result is 4816432068447840•2−62 = 0.001044397222447999984407118745366460643708705902099609375.
When that is rounded to the nearest value representable in binary32, the result is 8971304•2−33 = 0.001044397242367267608642578125.
Various software (C, Matlab, others) displays floating-point numbers in diverse ways, with more or fewer digits. The above values are the exact numbers represented by the floating-point data, per the IEEE 754 specification, and they are the values the data has when used in arithmetic operations.
All single precisions should be the same
So here is the thing. According to documentation, both matlab and C comply with the IEEE 754 standard. Which means that there should not be any difference between what is actually stored in memory.
You could compute the binary representation by hand but according to this(thanks #Danijel) handy website, the representation of 0.001044397222448 should be 0x3a88e428.
The question is how precise is your representation? It is a bit tricky with floating point but the short answer is your number is accurate up to the 9th decimal and has decimal represented up to the 33rd decimal. If you want the long answer see the tow paragraphs at the end of this post.
A display issue
The fact that you are not seeing the same thing when you print does not mean that you don't have the same bits in memory (and you should have the exact same bytes in memory in C and MATLAB). The only reason you see a difference on your display is because the print functions truncate your number. If you print the 33 decimals in each language you should not have any difference.
To do so in matlab use: fprintf('%.33f', value_float);
To do so in c use printf('%.33f\n', gf);
About floating point precision
Now in more details, the question was: how precise is this representation? Well the tricky thing with floating point is that the precision of the representation depends on what number you are representing. The representation is over 32 bits and is divide with 1 bit for the sign, 8 for the exponent and 23 for the fraction.
The number can be computed as sign * 2^(exponent-127) * 1.fraction. This basically means that the maximal error/precision (depending on how you want to call it) is basically 2^(exponent-127-23), the 23 is here to represent the 23 bytes of the fraction. (There are a few edge cases, I won't elaborate on it). In our case the exponent is 117, which means your precision is 2^(117-127-23) = 1.16415321826934814453125e-10. That means that your single precision float should represent your number accurately up to the 9th decimal, after that it is up to luck.
Further details
I know this is a rather short explanation. For more details, this post explains the floating point imprecision more precisely and this website gives you some useful info and allows you to play visually with the representation.
I'm trying to recreate printf and I'm currently trying to find a way to handle the conversion specifiers that deal with floats. More specifically: I'm trying to round doubles at a specific decimal place. Now I have the following code:
double ft_round(double value, int precision)
{
long long int power;
long long int result;
power = ft_power(10, precision);
result = (long long int) (value * power);
return ((double)result / power);
}
Which works for relatively small numbers (I haven't quite figured out whether printf compensates for truncation and rounding errors caused by it but that's another story). However, if I try a large number like
-154584942443242549.213565124235
I get -922337203685.4775391 as output, whereas printf itself gives me
-154584942443242560.0000000 (precision for both outputs is 7).
Both aren't exactly the output I was expecting but I'm wondering if you can help me figure out how I can make my idea for rounding work with larger numbers.
My question is basically twofold:
What exactly is happening in this case, both with my code and printf itself, that causes this output? (I'm pretty new to programming, sorry if it's a dumb question)
Do you guys have any tips on how to make my code capable of handling these bigger numbers?
P.S. I know there are libraries and such to do the rounding but I'm looking for a reinventing-the-wheel type of answer here, just FYI!
You can't round to a particular decimal precision with binary floating point arithmetic. It's not just possible. At small magnitudes, the errors are small enough that you can still get the right answer, but in general it doesn't work.
The only way to round a floating point number as decimal is to do all the arithmetic in decimal. Basically you start with the mantissa, converting it to decimal like an integer, then scale it by powers of 2 (the exponent) using decimal arithmetic. The amount of (decimal) precision you need to keep at each step is roughly (just a bit over) the final decimal precision you want. If you want an exact result, though, it's on the order of the base-2 exponent range (i.e. very large).
Typically rather than using base 10, implementations will use a base that's some large power of 10, since it's equivalent to work with but much faster. 1000000000 is a nice base because it fits in 32 bits and lets you treat your decimal representation as an array of 32-bit ints (comparable to how BCD lets you treat decimal representations as arrays of 4-bit nibbles).
My implementation in musl is dense but demonstrates this approach near-optimally and may be informative.
What exactly is happening in this case, both with my code and printf itself, that causes this output?
Overflow. Either ft_power(10, precision) exceeds LLONG_MAX and/or value * power > LLONG_MAX.
Do you guys have any tips on how to make my code capable of handling these bigger numbers?
Set aside various int types to do rounding/truncation. Use FP routines like round(), nearby(), etc.
double ft_round(double value, int precision) {
// Use a re-coded `ft_power()` that computes/returns `double`
double pwr = ft_power(10, precision);
return round(value * pwr)/pwr;
}
As well mentioned in this answer, floating point numbers have binary characteristics as well as finite precision. Using only double will extend the range of acceptable behavior. With extreme precision, the value computed with this code be close yet potentially only near the desired result.
Using temporary wider math will extend the acceptable range.
double ft_round(double value, int precision) {
double pwr = ft_power(10, precision);
return (double) (roundl((long double) value * pwr)/pwr);
}
I haven't quite figured out whether printf compensates for truncation and rounding errors caused by it but that's another story
See Printf width specifier to maintain precision of floating-point value to print FP with enough precision.
This question already has answers here:
Why does division result in zero instead of a decimal?
(5 answers)
Closed 8 years ago.
I'm programming a microcontroller atmega168 (8 bit).
I want to do something like:
float A = cos(- (2/3) * M_PI);
Including of course math.h (#define M_PI 3.14159265358979323846)
As result, instead of having -0.5, i get 1.
I check the result using a serial communication to my pc that i'm sure that works also for float numbers because if i set
A= -0.50;
I receive the correct result.
PS. I cannot use double...also because i don't see the reason of doing so
Help me please!
2/3 is evaluated using integer arithmetic. It evaluates to 0. You mean to use floating point divide
float A = cos(-(2.0/3.0) * M_PI);
If you want float literals use an f suffix:
float A = cos(-(2.0f/3.0f) * M_PI);
Do note however, that the M_PI macro, expanded here, is a double literal. Is that what you want?
I presume the real code doesn't look quite like this. If this really is your code then you would write float A = -0.5f and move on. I guess the real code has variables.
How much precision do you need, and where are the numbers coming from? If your goal is to compute the cosine of 120 degrees, just set a to 0.5. If your original numbers are not expressed as radians, and if you don't need an absolutely precise result, a table-based approximation may be more useful than built in trig functions, since you can strike whatever balance between table size, execution speed, and precision will best fit your needs. Note also that if your original numbers are integers, you may be able to compute trig functions without using any floating-point values [e.g. one could have a function that accepts angles from 0-65535 and returns values from -16384 to +16384]. Integer math is often much faster than floating-point, so such a function could be a major performance win.
I have been asked a very simple question in the book to write the output of the following program -
#include<stdio.h>
int main()
{
float i=1.1;
while(i==1.1)
{
printf("%f\n",i);
i=i-0.1;
}
return 0;
}
Now I already read that I can use floating point numbers as loop counters but are not advisable which I learned. Now when I run this program inside the gcc, I get no output even though the logic is completely correct and according to which the value of I should be printed once. I tried printing the value of i and it gave me a result of 1.100000 . So I do not understand why the value is not being printed?
In most C implementations, using IEEE-754 binary floating-point, what happens in your program is:
The source text 1.1 is converted to a double. Since binary floating-point does not represent this value exactly, the result is the nearest representable value, 1.100000000000000088817841970012523233890533447265625.
The definition float i=1.1; converts the value to float. Since float has less precision than double, the result is 1.10000002384185791015625.
In the comparison i==1.1, the float 1.10000002384185791015625 is converted to double (which does not change its value) and compared to 1.100000000000000088817841970012523233890533447265625. Since they are unequal, the result is false.
The quantity 11/10 cannot be represented exactly in binary floating-point, and it has different approximations as double and as float.
The constant 1.1 in the source code is the double approximation of 11/10. Since i is of type float, it ends up containing the float approximation of 1.1.
Write while (i==1.1f) or declare i as double and your program will work.
Comparing floating point numbers:1
Floating point math is not exact. Simple values like 0.2 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations can change the result. Different compilers and CPU architectures store temporary results at different precision, so results will differ depending on the details of your environment. If you do a calculation and then compare the results against some expected value it is highly unlikely that you will get exactly the result you intended.
In other words, if you do a calculation and then do this comparison:
if (result == expectedResult)
then it is unlikely that the comparison will be true. If the comparison is true then it is probably unstable – tiny changes in the input values, compiler, or CPU may change the result and make the comparison be false.
In short:
1.1 can't be represented exactly in binary floating pint number. This is like the decimal representation of 10/3 in decimal which is 3.333333333..........
I would suggest you to Read the article What Every Computer Scientist Should Know About Floating-Point Arithmetic.
1. For the experts who are encouraging beginner programmers to use == in floating point comparision
It is because i is not quite exactly 1.1.
If you are going to test a floating point, you should do something along the lines of while(i-1.1 < SOME_DELTA) where delta is the threshold where equality is good enough.
Read: https://softwareengineering.stackexchange.com/questions/101163/what-causes-floating-point-rounding-errors
This question already has answers here:
strange output in comparison of float with float literal
(8 answers)
Closed 9 years ago.
float a;
a=8.3;
if(a==8.3)
printf("1");
else
printf("2");
giving a as 8.3 and 8.4 respectively and comparing with 8.3 and 8.4 correspondingly , output becomes 2 but when comparing with 8.5 output is 1. I found that it is related to concept of recurring binary which takes 8 bytes. I want to know how to find which number is recurring binary. kindly give some input.
Recurring numbers are not representable, hence floating point comparison will not work.
Floating point math is not exact. Simple values like 0.2 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations can change the result. Also as in the 2nd comment - floating point literals 8.3 has type double and a has type float.
Comparing with epsilon – absolute error
Since floating point calculations involve a bit of uncertainty we can try to allow for this by seeing if two numbers are ‘close’ to each other. If you decide – based on error analysis, testing, or a wild guess – that the result should always be within 0.00001 of the expected result then you can change your comparison to this:
if (fabs(result - expectedResult) < 0.00001)
For example, 3/7 is a repeating binary fraction, its computed value in double precision is different from its stored value in single precision. Thus the comparison 3/7 with its stored computed value fails.
For more please read - What Every Computer Scientist Should Know About Floating-Point Arithmetic
You should not compare floating point numbers for equality using ==. Because of how floating point numbers are actually stored in memory it will give inaccurate results.
Use something like this to determine if your number a is close enough to the desired value:
if(fabs(a-8.3) < 0.0000005))
There are two problems here.
First is that floating point literals like 8.3 have type double, while a has type float. Doubles and floats store values to different precisions, and for values that don't have an exact floating point representation (such as 8.3), the stored values are slightly different. Thus, the comparison fails.
You could fix this by writing the comparison as a==8.3f; the f suffix forces the literal to be a float instead of a double.
However, it's bad juju to compare floating point values directly; again, most values cannot be represented exactly, but only to an approximation. If a were the result of an expression involving multiple floating-point calcuations, it may not be equivalent to 8.3f. Ideally, you should look at the difference between the two values, and if it's less than some threshold, then they are effectively equivalent:
if ( fabs( a - 8.3f) < EPSILON )
{
// a is "equal enough" to 8.3
}
The exact value of EPSILON depends on a number of factors, not least of which is the magnitude of the values being compared. You only have so many digits of precision, so if the values you're trying to compare are greater than 999999.0, then you can't test for differences within 0.000001 of each other.