This question already has answers here:
strange output in comparison of float with float literal
(8 answers)
Closed 9 years ago.
float a;
a=8.3;
if(a==8.3)
printf("1");
else
printf("2");
giving a as 8.3 and 8.4 respectively and comparing with 8.3 and 8.4 correspondingly , output becomes 2 but when comparing with 8.5 output is 1. I found that it is related to concept of recurring binary which takes 8 bytes. I want to know how to find which number is recurring binary. kindly give some input.
Recurring numbers are not representable, hence floating point comparison will not work.
Floating point math is not exact. Simple values like 0.2 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations can change the result. Also as in the 2nd comment - floating point literals 8.3 has type double and a has type float.
Comparing with epsilon – absolute error
Since floating point calculations involve a bit of uncertainty we can try to allow for this by seeing if two numbers are ‘close’ to each other. If you decide – based on error analysis, testing, or a wild guess – that the result should always be within 0.00001 of the expected result then you can change your comparison to this:
if (fabs(result - expectedResult) < 0.00001)
For example, 3/7 is a repeating binary fraction, its computed value in double precision is different from its stored value in single precision. Thus the comparison 3/7 with its stored computed value fails.
For more please read - What Every Computer Scientist Should Know About Floating-Point Arithmetic
You should not compare floating point numbers for equality using ==. Because of how floating point numbers are actually stored in memory it will give inaccurate results.
Use something like this to determine if your number a is close enough to the desired value:
if(fabs(a-8.3) < 0.0000005))
There are two problems here.
First is that floating point literals like 8.3 have type double, while a has type float. Doubles and floats store values to different precisions, and for values that don't have an exact floating point representation (such as 8.3), the stored values are slightly different. Thus, the comparison fails.
You could fix this by writing the comparison as a==8.3f; the f suffix forces the literal to be a float instead of a double.
However, it's bad juju to compare floating point values directly; again, most values cannot be represented exactly, but only to an approximation. If a were the result of an expression involving multiple floating-point calcuations, it may not be equivalent to 8.3f. Ideally, you should look at the difference between the two values, and if it's less than some threshold, then they are effectively equivalent:
if ( fabs( a - 8.3f) < EPSILON )
{
// a is "equal enough" to 8.3
}
The exact value of EPSILON depends on a number of factors, not least of which is the magnitude of the values being compared. You only have so many digits of precision, so if the values you're trying to compare are greater than 999999.0, then you can't test for differences within 0.000001 of each other.
Related
My Matlab script reads a string value "0.001044397222448" from a file, and after parsing the file, this value printed in the console shows as double precision:
value_double =
0.001044397222448
After I convert this number to singe using value_float = single(value_double), the value shows as:
value_float =
0.0010444
What is the real value of this variable, that I later use in my Simulink simulation? Is it really truncated/rounded to 0.0010444?
My problem is that later on, after I compare this with analogous C code, I get differences. In the C code the value is read as float gf = 0.001044397222448f; and it prints out as 0.001044397242367267608642578125000. So the C code keeps good precision. But, does Matlab?
The number 0.001044397222448 (like the vast majority of decimal fractions) cannot be exactly represented in binary floating point.
As a single-precision float, it's most closely represented as (hex) 0x0.88e428 × 2-9, which in decimal is 0.001044397242367267608642578125.
In double precision, it's most closely represented as 0x0.88e427d4327300 × 2-9, which in decimal is 0.001044397222447999984407118745366460643708705902099609375.
Those are what the numbers are, internally, in both C and Matlab.
Everything else you see is an artifact of how the numbers are printed back out, possibly rounded and/or truncated.
When I said that the single-precision representation "in decimal is 0.001044397242367267608642578125", that's mildly misleading, because it makes it look like there are 28 or more digits' worth of precision. Most of those digits, however, are an artifact of the conversion from base 2 back to base 10. As other answers have noted, single-precision floating point actually gives you only about 7 decimal digits of precision, as you can see if you notice where the single- and double-precision equivalents start to diverge:
0.001044397242367267608642578125
0.001044397222447999984407118745366460643708705902099609375
^
difference
Similarly, double precision gives you roughly 16 decimal digits worth of precision, as you can see if you compare the results of converting a few previous and next mantissa values:
0x0.88e427d43272f8 0.00104439722244799976756668424826557384221814572811126708984375
0x0.88e427d4327300 0.001044397222447999984407118745366460643708705902099609375
0x0.88e427d4327308 0.00104439722244800020124755324246734744519926607608795166015625
0x0.88e427d4327310 0.0010443972224480004180879877395682342466898262500762939453125
^
changes
This also demonstrates why you can never exactly represent your original value 0.001044397222448 in binary. If you're using double, you can have 0.00104439722244799998, or you can have 0.0010443972224480002, but you can't have anything in between. (You'd get a little less close with float, and you could get considerably closer with long double, but you'll never get your exact value.)
In C, and whether you're using float or double, you can ask for as little or as much precision as you want when printing things with %f, and under a high-quality implementation you'll always get properly-rounded results. (Of course the results you get will always be the result of rounding the actual, internal value, not necessarily the decimal value you started with.) For example, if I run this code:
printf("%.5f\n", 0.001044397222448);
printf("%.10f\n", 0.001044397222448);
printf("%.15f\n", 0.001044397222448);
printf("%.20f\n", 0.001044397222448);
printf("%.30f\n", 0.001044397222448);
printf("%.40f\n", 0.001044397222448);
printf("%.50f\n", 0.001044397222448);
printf("%.60f\n", 0.001044397222448);
printf("%.70f\n", 0.001044397222448);
I see these results, which as you can see match the analysis above.
(Note that this particular example is using double, not float.)
0.00104
0.0010443972
0.001044397222448
0.00104439722244799998
0.001044397222447999984407118745
0.0010443972224479999844071187453664606437
0.00104439722244799998440711874536646064370870590210
0.001044397222447999984407118745366460643708705902099609375000
0.0010443972224479999844071187453664606437087059020996093750000000000000
I'm not sure how Matlab prints things.
In answer to your specific questions:
What is the real value of this variable, that I later use in my Simulink simulation? Is it really truncated/rounded to 0.0010444?
As a float, it is really "truncated" to a number which, converted back to decimal, is exactly 0.001044397242367267608642578125. But as we've seen, most of those digits are essentially meaningless, and the result can more properly thought of as being about 0.0010443972.
In the C code the value is read as float gf = 0.001044397222448f; and it prints out as 0.001044397242367267608642578125000
So C got the same answer I did -- but, again, most of those digits are not meaningful.
So the C code keeps good precision. But, does Matlab?
I'd be willing to bet that Matlab keeps the same internal precision for ordinary floats and doubles.
MATLAB uses IEEE-754 binary64 for its double-precision type and binary32 for single-precision. When 0.001044397222448 is rounded to the nearest value representable in binary64, the result is 4816432068447840•2−62 = 0.001044397222447999984407118745366460643708705902099609375.
When that is rounded to the nearest value representable in binary32, the result is 8971304•2−33 = 0.001044397242367267608642578125.
Various software (C, Matlab, others) displays floating-point numbers in diverse ways, with more or fewer digits. The above values are the exact numbers represented by the floating-point data, per the IEEE 754 specification, and they are the values the data has when used in arithmetic operations.
All single precisions should be the same
So here is the thing. According to documentation, both matlab and C comply with the IEEE 754 standard. Which means that there should not be any difference between what is actually stored in memory.
You could compute the binary representation by hand but according to this(thanks #Danijel) handy website, the representation of 0.001044397222448 should be 0x3a88e428.
The question is how precise is your representation? It is a bit tricky with floating point but the short answer is your number is accurate up to the 9th decimal and has decimal represented up to the 33rd decimal. If you want the long answer see the tow paragraphs at the end of this post.
A display issue
The fact that you are not seeing the same thing when you print does not mean that you don't have the same bits in memory (and you should have the exact same bytes in memory in C and MATLAB). The only reason you see a difference on your display is because the print functions truncate your number. If you print the 33 decimals in each language you should not have any difference.
To do so in matlab use: fprintf('%.33f', value_float);
To do so in c use printf('%.33f\n', gf);
About floating point precision
Now in more details, the question was: how precise is this representation? Well the tricky thing with floating point is that the precision of the representation depends on what number you are representing. The representation is over 32 bits and is divide with 1 bit for the sign, 8 for the exponent and 23 for the fraction.
The number can be computed as sign * 2^(exponent-127) * 1.fraction. This basically means that the maximal error/precision (depending on how you want to call it) is basically 2^(exponent-127-23), the 23 is here to represent the 23 bytes of the fraction. (There are a few edge cases, I won't elaborate on it). In our case the exponent is 117, which means your precision is 2^(117-127-23) = 1.16415321826934814453125e-10. That means that your single precision float should represent your number accurately up to the 9th decimal, after that it is up to luck.
Further details
I know this is a rather short explanation. For more details, this post explains the floating point imprecision more precisely and this website gives you some useful info and allows you to play visually with the representation.
This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 3 years ago.
When given the input 150, I expect the output to be the mathematically correct answer 70685.7750, but I am getting the wrong output 70685.7812.
#include<stdio.h>
int main()
{
float A,n,R;
n=3.14159;
scanf("%f",&R);
A=n*(R*R);
printf("A=%.4f\n",A);
}
float and double numbers are not represented very accurately in the memory. The main reason is that the memory is limited, and most non-integers are not.
The best example is PI. You can specify as many digits as you want, but it will still be an approximation.
The limited precision of representing the numbers is the reason of the following rule:
when working with floats and double numbers, not not check for equality (m == n), but check that the difference between them is smaller than a certain error ((m-n) < e)
Please note, as mentioned in the comments too, that the above rule is not "the mother rule of all rules". There are other rules also.
Careful analysis must be done for each particular situation, in order to have a properly working application.
(Thanks #EricPostpischil for the reminder)
It is common for a variable of type float to be an IEEE-754 32-bit floating point number.
The number 3.14159 cannot be stored exactly in an IEEE-754 32-bit float - the closest value is approximately 3.14159012. 150 * 150 * 3.14159012 is 70685.7777, and the closest value to this that can be represented in a 32-bit float is 70685.78125, which you are then printing with %.4f so you see 70685.7812.
Another way of thinking about this is that your n value only ends up being accurate to the sixth significant figure, so - as you are just calculating a series of multiplications - your result is also only acccurate to the sixth significant figure (ie 70685.8). (In the general case this can be worse - for example subtraction of two close values can lead to a large increase in the relative error).
If you switch to using variables of type double (and change the scanf() to use %lf), then you will likely get the answer you are after. double is typically a 64-bit float, which means that the error in the representation of your n values and the result is small enough not to affect the fourth decimal place.
Have you heard that float and double values aren't always perfectly accurate, have limited precision? Have you heard that type float gives you the equivalent of only about 7 decimal digits' worth of precision? This is what that means. Your expected and actual answers, 70685.7750 and 70685.7812, differ in the seventh digit, just about as expected.
I expect the output to be the mathematically correct answer
I am sorry to disappoint you, but that's your mistake. As a general rule, when you're doing floating-point arithmetic, you will never get the mathematically correct answer, you will always get a limited-precision approximation of the mathematically correct answer.
The canonical SO answers to this sort of question are collected at Is floating point math broken?. You might want to read some of those answers for more enlightenment.
I am trying to understand what is the difference between the following:
printf("%f",4.567f);
printf("%f",4.567);
How does using the f suffix change/influence the output?
How using the 'f' changes/influences the output?
The f at the end of a floating point constant determines the type and can affect the value.
4.567 is floating point constant of type and precision of double. A double can represent exactly typical about 264 different values. 4.567 is not one on them*1. The closest alternative typically is exactly
4.56700000000000017053025658242404460906982421875 // best
4.56699999999999928235183688229881227016448974609375 // next best double
4.567f is floating point constant of type and precision of float. A float can represent exactly typical about 232 different values. 4.567 is not one on them. The closest alternative typically is exactly
4.566999912261962890625 // best
4.56700038909912109375 // next best float
When passed to printf() as part of the ... augments, a float is converted to double with the same value.
So the question becomes what is the expected difference in printing?
printf("%f",4.56700000000000017053025658242404460906982421875);
printf("%f",4.566999912261962890625);
Since the default number of digits after the decimal point to print for "%f" is 6, the output for both rounds to:
4.567000
To see a difference, print with more precision or try 4.567e10, 4.567e10f.
45670000000.000000 // double
45669998592.000000 // float
Your output may slightly differ to to quality of implementation issues.
*1 C supports many floating point encodings. A common one is binary64. Thus typical floating-point values are encoded as an sign * binary fraction * 2exponent. Even simple decimal values like 0.1 can not be represented exactly as such.
I have been asked a very simple question in the book to write the output of the following program -
#include<stdio.h>
int main()
{
float i=1.1;
while(i==1.1)
{
printf("%f\n",i);
i=i-0.1;
}
return 0;
}
Now I already read that I can use floating point numbers as loop counters but are not advisable which I learned. Now when I run this program inside the gcc, I get no output even though the logic is completely correct and according to which the value of I should be printed once. I tried printing the value of i and it gave me a result of 1.100000 . So I do not understand why the value is not being printed?
In most C implementations, using IEEE-754 binary floating-point, what happens in your program is:
The source text 1.1 is converted to a double. Since binary floating-point does not represent this value exactly, the result is the nearest representable value, 1.100000000000000088817841970012523233890533447265625.
The definition float i=1.1; converts the value to float. Since float has less precision than double, the result is 1.10000002384185791015625.
In the comparison i==1.1, the float 1.10000002384185791015625 is converted to double (which does not change its value) and compared to 1.100000000000000088817841970012523233890533447265625. Since they are unequal, the result is false.
The quantity 11/10 cannot be represented exactly in binary floating-point, and it has different approximations as double and as float.
The constant 1.1 in the source code is the double approximation of 11/10. Since i is of type float, it ends up containing the float approximation of 1.1.
Write while (i==1.1f) or declare i as double and your program will work.
Comparing floating point numbers:1
Floating point math is not exact. Simple values like 0.2 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations can change the result. Different compilers and CPU architectures store temporary results at different precision, so results will differ depending on the details of your environment. If you do a calculation and then compare the results against some expected value it is highly unlikely that you will get exactly the result you intended.
In other words, if you do a calculation and then do this comparison:
if (result == expectedResult)
then it is unlikely that the comparison will be true. If the comparison is true then it is probably unstable – tiny changes in the input values, compiler, or CPU may change the result and make the comparison be false.
In short:
1.1 can't be represented exactly in binary floating pint number. This is like the decimal representation of 10/3 in decimal which is 3.333333333..........
I would suggest you to Read the article What Every Computer Scientist Should Know About Floating-Point Arithmetic.
1. For the experts who are encouraging beginner programmers to use == in floating point comparision
It is because i is not quite exactly 1.1.
If you are going to test a floating point, you should do something along the lines of while(i-1.1 < SOME_DELTA) where delta is the threshold where equality is good enough.
Read: https://softwareengineering.stackexchange.com/questions/101163/what-causes-floating-point-rounding-errors
I'm trying to get the user to input a number between 1.00000 to 0.00001 while edges not included into a float variable. I can assume that the user isn't typing more than 5 numbers after the dot.
now, here is what I have written:
printf("Enter required Leibniz gap.(Between 0.00001 to 1.00000)\n");
scanf("%f", &gap);
while ((gap < 0.00002) || (gap > 0.99999))
{
printf("Enter required Leibniz gap.(Between 0.00001 to 1.00000)\n");
scanf("%f", &gap);
}
now, when I'm typing the smallest number possible: 0.00002 in getting stuck in the while loop.
when I run the debugger I saw that 0.00002 is stored with this value in the float variable: 1.99999995e-005
anybody can clarify for me what am I doing wrong? why isn't 0.00002 meeting the conditions? what is this "1.99999995e-005" thing.
The problem here is that you are using a float variable (gap), but you are comparing it with a double constant (0.00002). The constant is double because floating-point constants in C are double unless otherwise specified.
An underlying issue is that the number 0.00002 is not representable in either float or double. (It's not representable at all in binary floating point because it's binary expansion is infinitely long, like the decimal expansion of ⅓.) So when you write 0.00002 in a program, the C compiler substitutes it with a double value which is very close to 0.00002. Similarly, when scanf reads the number 0.00002 into a float variable, it substitutes a float value which is very close to 0.00002. Since double numbers have more bits than floats, the double value is closer to 0.00002 than the float value.
When you compare two floating point values with different precision, the compiler converts the value with less precision into exactly the same value with more precision. (The set of values representable as double is a superset of the set of values representable as float, so it is always possible to find a double whose value is the same as the value of a float.) And that's what happens when gap < 0.00002 is executed: gap is converted to the double of the same value, and that is compared with the double (close to) 0.00002. Since both of these values are actually slightly less than 0.00002, and the double is closer, the float is less than the double.
You can solve this problem in a couple of ways. First, you can avoid the conversion, either by making gap a double and changing the scanf format to %lf, or by comparing gap to a float:
while (gap < 0.00002F || gap > 0.99999F) {
But that's not really correct, for a couple of reasons. First, there is actually no guarantee that the floating point conversion done by the C compiler is the same as the conversion done by the standard library (scanf), and the standard allows the compiler to use "either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner." (It doesn't specify in detail which value scanf produces either, but recommends that it be the nearest representable value.) As it happens, gcc and glibc (the C compiler and standard library used on Linux) both produce the nearest representable value, but other implementations don't.
Anyway, according to your error message, you want the value to be between 0.00001 and 1.00000. So your test should be precisely that:
while (gap <= 0.00001F || gap >= 1.0000F) { ...
(assuming you keep gap as a float.)
Any of the above solutions will work. Personally, I'd make gap a double in order to make the comparison more intuitive, and also change the comparison to compare against 0.00001 and 1.0000.
By the way, the E-05 suffix means "times ten to the power of -5" (the E stands for Exponent). You'll see that a lot; it's a standard way of writing floating point constants.
floats are not capable of storing exact values for every possible number (infinite numbers between 0-1 therefore impossible). Assigning 0.00002 to a float will have a different but really close number due to the implementation which is what you are experiencing. Precision decreases as the number grows.
So you can't directly compare two close floats and have healthy results.
More information on floating points can be found on this Wikipedia page.
What you could do is emulate fixed point math. Have an int n = 100000; to represent 1.00000 internally (1000 -> 0.001 and such) and do calculations accordingly or use a fixed point math library.
Fraction part of single precision floating numbers can represent numbers from -2 to 2-2^-23 and have a fraction part with smallest quantization step of 2^-23. So if some value cannot be represented with a such step then it represented with a nearest value according to IEEE 754 rounding rules:
0.00002*32768 = 0.655360043 // floating point exponent is chosen.
0.655360043/(2^-23) = 5497558.5 // is not an integer multiplier
// of quantization step, so the
5497558*(2^-23) = 0.655359983 // nearest value is chosen
5497559*(2^-23) = 0.655360103 // from these two variants
First one variant equals to 1.999969797×10⁻⁵ in decimal format and the second one equals to 1.999999948×10⁻⁵ (just to compare - if we choose 5497560 we get 2.000000677×10⁻⁵). So the second variant can be choosen as a result and its value is not equal to 0.00002.
The total precision of floating point number depends on exponent value as well (takes values from -128 to 127): it can be computed by multiplication of fraction part quantization step and exponent value. In case of 0.00002 total precision is (2^-23)×(2^-15) = 3.6×(10^-12). It means if we add to 0.00002 a value which is smaller than a half of this value than 0.00002 remains the same. In general it means that numbers of floating point number which is meaningful are from 1×exponent to 2×(10^-23)×exponent.
That is why a very popular approach is to compare two floating numbers using some epsilon value which is greater than quantization step.
Like some of the comments said, due to how floating point numbers are represented, you will see errors like this.
A solution to this is convert it to
gap + 1e-8 < 0.0002
This gives you a small window of tolerance enough to let most cases you want to pass and most you dont want to fail