This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 8 years ago.
today happened to me a strange thing, when I try to compile and execute the output of this code isn't what I expected. Here is the code that simply add floating values to an array of float and then print it out.
The simple code:
int main(){
float r[10];
int z;
int i=34;
for(z=0;z<10;z++){
i=z*z*z;
r[z]=i;
r[z]=r[z]+0.634;
printf("%f\n",r[z]);
}
}
the output:
0.634000
1.634000
8.634000
27.634001
64.634003
125.634003
216.634003
343.634003
512.633972
729.633972
note that from the 27 appears numbers after the .634 that should not be there. Anyone know why this happened? It's an event caused by floating point approximation?..
P.S I have a linux debian system, 64 bit
thanks all
A number maybe represented in the following form:
[sign] [mantissa] * 2[exponent]
So there will be rounding or relative errors when the space is less in memory.
From wiki:
Single-precision floating-point format is a computer number format that occupies 4 bytes (32 bits) in computer memory and represents a wide dynamic range of values by using a floating point.
The IEEE 754 standard specifies a binary32 as having:
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 bits (23 explicitly stored)
This gives from 6 to 9 significant decimal digits precision (if a
decimal string with at most 6 significant decimal is converted to IEEE
754 single precision and then converted back to the same number of
significant decimal, then the final string should match the original;
and if an IEEE 754 single precision is converted to a decimal string
with at least 9 significant decimal and then converted back to single,
then the final number must match the original [4]).
Edit (Edward's comment): Larger (more bits) floating point representations allow for greater precision.
Yes, this is a floating point approximation error or Round-off error. Floating point numbers representation uses quantization to represent a large range of numbers, so it only represent steps and round all the in-between numbers to the nearest step. This cause error if the wanted number is not one of these steps.
In addition to the other useful answers, it can be illustrative to print more digits than the default:
int main(){
float r[10];
int z;
int i=34;
for(z=0;z<10;z++){
i=z*z*z;
r[z]=i;
r[z]=r[z]+0.634;
printf("%.30f\n",r[z]);
}
}
gives
0.634000003337860107421875000000
1.633999943733215332031250000000
8.633999824523925781250000000000
27.634000778198242187500000000000
64.634002685546875000000000000000
125.634002685546875000000000000000
216.634002685546875000000000000000
343.634002685546875000000000000000
512.633972167968750000000000000000
729.633972167968750000000000000000
In particular, note that 0.634 isn't actually "0.634", but instead is the closest number representable by a float.
"float" has only about six digit precision, so it isn't unexpected that you get errors that large.
If you used "double", you would have about 15 digits precision. You would have an error, but you would get for example 125.634000000000003 and not 125.634003.
So you will always get rounding errors and your results will not be quite what you expect, but by using double the effect will be minimal. Warning: If you do things like adding 125 + 0.634 and then subtract 125, the result will (most likely) not be 0.634. No matter whether you use float or double. But with double, the result will be very, very close to 0.634.
In principle, given the choice of float and double, you should never use float, unless you have a very, very good reason.
Related
This question already has answers here:
Why IEEE754 single-precision float has only 7 digit precision?
(2 answers)
Closed 1 year ago.
Why does C print float values after the decimal point different from the input value?
Following is the code.
CODE:
#include <stdio.h>
#include<math.h>
void main()
{
float num=2118850.132000;
printf("num:%f",num);
}
OUTPUT:
num:2118850.250000
This should have printed 2118850.132000, But instead it is changing the digits after the decimal to .250000. Why is it happening so?
Also, what can one do to avoid this?
Please guide me.
Your computer uses binary floating point internally. Type float has 24 bits of precision, which translates to approximately 7 decimal digits of precision.
Your number, 2118850.132, has 10 decimal digits of precision. So right away we can see that it probably won't be possible to represent this number exactly as a float.
Furthermore, due to the properties of binary numbers, no decimal fraction that ends in 1, 2, 3, 4, 6, 7, 8, or 9 (that is, numbers like 0.1 or 0.2 or 0.132) can be exactly represented in binary. So those numbers are always going to experience some conversion or roundoff error.
When you enter the number 2118850.132 as a float, it is converted internally into the binary fraction 1000000101010011000010.01. That's equivalent to the decimal fraction 2118850.25. So that's why the .132 seems to get converted to 0.25.
As I mentioned, float has only 24 bits of precision. You'll notice that 1000000101010011000010.01 is exactly 24 bits long. So we can't, for example, get closer to your original number by using something like 1000000101010011000010.001, which would be equivalent to 2118850.125, which would be closer to your 2118850.132. No, the next lower 24-bit fraction is 1000000101010011000010.00 which is equivalent to 2118850.00, and the next higher one is 1000000101010011000010.10 which is equivalent to 2118850.50, and both of those are farther away from your 2118850.132. So 2118850.25 is as close as you can get with a float.
If you used type double you could get closer. Type double has 53 bits of precision, which translates to approximately 16 decimal digits. But you still have the problem that .132 ends in 2 and so can never be exactly represented in binary. As type double, your number would be represented internally as the binary number 1000000101010011000010.0010000111001010110000001000010 (note 53 bits), which is equivalent to 2118850.132000000216066837310791015625, which is much closer to your 2118850.132, but is still not exact. (Also notice that 2118850.132000000216066837310791015625 begins to diverge from your 2118850.1320000000 after 16 digits.)
So how do you avoid this? At one level, you can't. It's a fundamental limitation of finite-precision floating-point numbers that they cannot represent all real numbers with perfect accuracy. Also, the fact that computers typically use binary floating-point internally means that they can almost never represent "exact-looking" decimal fractions like .132 exactly.
There are two things you can do:
If you need more than about 7 digits worth of precision, definitely use type double, don't try to use type float.
If you believe your data is accurate to three places past the decimal, print it out using %.3f. If you take 2118850.132 as a double, and printf it using %.3f, you'll get 2118850.132, like you want. (But if you printed it with %.12f, you'd get the misleading 2118850.132000000216.)
This will work if you use double instead of float:
#include <stdio.h>
#include<math.h>
void main()
{
double num=2118850.132000;
printf("num:%f",num);
}
This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 3 years ago.
I'm getting confused with floating point number in c programming
#include <stdio.h>
int main()
{
float a = 255.167715;
printf("%f", a);
return 0;
} // it print value of 255.167709
Why it produce value like so? To be honest can you tell how it actually works in C programming?
In binary, 255.167715 is approximately 11111111.001010101110111101011110110010000000110001110…2. In your C implementation, most likely, the source code 255.167715 is converted to 11111111.0010101011101111010111101100100000001100011102, which is 255.16771499999998695784597657620906829833984375, because 255.167715 is a double constant, and that is the closest value representable in your implementation’s double type to the decimal number 255.167715, because the double type has only 53-bit significands. (A significand is the fraction portion of a floating-point number. There is also a sign and an exponent portion.)
Then, for float a = 255.167715;, this double value is converted to float. Since the float type has only 24-bit significands, the result is 11111111.00101010111011112, which is 255.1677093505859375 in decimal.
When you print this with the default formatting of %f, six digits after the decimal place are used, so it prints “255.167709”.
The short answer is that there is no such single-precision floating-point number as 255.167715. This is true for any computer using IEEE 754 floating-point formats; the situation is not unique to C. As a single-precision floating-point number, the closest value is 255.167709.
Single-precision floating point gives you the equivalent of 7 or so decimal digits of precision. And as you can see, the input you gave and the output you got agreed to seven places, namely 255.1677. Any digits past that are essentially meaningless.
For much more about the sometimes-surprising aspects of floating-point math, see "Is floating point math broken?".
This question already has answers here:
Why does floating-point arithmetic not give exact results when adding decimal fractions?
(31 answers)
Closed 7 years ago.
A checker(of type double) array is obtained from a gsl_vector in the following way.
for (i=0; i<M; i++)
{
checker[i] = (double)gsl_vector_get(check, i);
printf(" %f", checker[i]);
}
Array checker has [ 3.000000 -3.000000 -11.000000 -5.000000 ] elements after the above operation (from the output of the above program).
I am facing a weird problem now.
for (i=0; i<M; i++)
{
printf("checker: %f\n", checker[i]);
if(checker[0] == 3.00)
{
printf("Inside If: %f\n", checker[i]);
}
}
The above code outputs
checker: 3.000000
checker: -3.000000
checker: -11.000000
checker: -5.000000
As seen, the if loop inside for is not executed. What could be the problem?
Edit: The above problem is gone when I directly copied [ 3.000000 -3.000000 -11.000000 -5.000000 ] into the checker Array instead of the gsl_vector_get(check,i). Check value comes from dgmev function where a matrix and a vector are multiplied.
Thanks
A floating point number maybe represented in the following form:
[sign] [mantissa] * 2[exponent]
So there will be rounding or relative errors when the space is less in memory.
From wiki:
Single-precision floating-point format is a computer number format that occupies 4 bytes (32 bits) in computer memory and represents a wide dynamic range of values by using a floating point.
The IEEE 754 standard specifies a binary32 as having:
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 bits (23 explicitly stored)
This gives from 6 to 9 significant decimal digits precision (if a
decimal string with at most 6 significant decimal is converted to IEEE
754 single precision and then converted back to the same number of
significant decimal, then the final string should match the original;
and if an IEEE 754 single precision is converted to a decimal string
with at least 9 significant decimal and then converted back to single,
then the final number must match the original [4]).
Larger (more bits) floating point representations allow for greater precision.
Floating point math is not exact. Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations can change the result. A must read:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
The IEEE standard divides exceptions into 5 classes: overflow, underflow, division by zero, invalid operation and inexact. There is a separate status flag for each class of exception. The meaning of the first three exceptions is self-evident. Invalid operation covers the situations listed in TABLE D-3, and any comparison that involves a NaN.
This question already has answers here:
strange output in comparison of float with float literal
(8 answers)
Closed 7 years ago.
I was reading this. It contains following C program.
#include<stdio.h>
int main()
{
float x = 0.1;
if (x == 0.1)
printf("IF");
else if (x == 0.1f)
printf("ELSE IF");
else
printf("ELSE");
}
The article says that
The output of above program is “ELSE IF” which means the expression “x
== 0.1″ returns false and expression “x == 0.1f” returns true.
But I tried it on different compilers & getting different outputs:
Here is the outputs on various IDEs.
1) Orwell Dev C++: ELSE
2) Code Blocks 13.12: ELSE IF but it gives following warnings during compilation.
warning: comparing floating point with == or != is unsafe.
Why this comparison is unsafe?
3) Ideone.com: ELSE IF (see run: http://ideone.com/VOE3E0)
4) TDM GCC 32 bit: ELSE IF
5) MSVS 2010: ELSE IF but compiles with warning
Warning 1 warning C4305: 'initializing' : truncation from 'double' to
'float'
What is exactly happening here? What's wrong with the program? Is it implementation defined behavior occurring?
Please help me.
A floating point number maybe represented in the following form:
[sign] [mantissa] * 2[exponent]
So there will be rounding or relative errors when the space is less in memory.
From wiki:
Single-precision floating-point format is a computer number format that occupies 4 bytes (32 bits) in computer memory and represents a wide dynamic range of values by using a floating point.
The IEEE 754 standard specifies a binary32 as having:
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 bits (23 explicitly stored)
This gives from 6 to 9 significant decimal digits precision (if a
decimal string with at most 6 significant decimal is converted to IEEE
754 single precision and then converted back to the same number of
significant decimal, then the final string should match the original;
and if an IEEE 754 single precision is converted to a decimal string
with at least 9 significant decimal and then converted back to single,
then the final number must match the original [4]).
Larger (more bits) floating point representations allow for greater precision.
Floating point math is not exact. Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations can change the result. A must read:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
The IEEE standard divides exceptions into 5 classes: overflow, underflow, division by zero, invalid operation and inexact. There is a separate status flag for each class of exception. The meaning of the first three exceptions is self-evident. Invalid operation covers the situations listed in TABLE D-3, and any comparison that involves a NaN.
float a=67107842,b=512;
float c=a/b;
printf("%lf\n",c);
Why is c 131070.000000 instead of the correct value 131070.00390625?
Your compiler's float type is probably using the 32-bit IEEE 754 single-precision format.
67107842 is a 26-bit binary number:
11111111111111110000000010
The single-precision format represents most numbers as 1.x multipled by some (positive or negative) power of two, where 23 bits are stored after the binary place, with the leading 1. being implied (very small numbers are an exception).
But 67107842 would require 24 bits after the binary place (to be represented as 1.111111111111111000000001 multipled by 225). As there is only room to store 23 bits, the final 1 gets lost. So it is the value in a that is wrong in this case, not the division - a actually contains 67107840 (11111111111111110000000000), which is exactly 131070 * 512.
You can see this if you print a as well:
printf("%lf %lf %lf\n", a, b, c);
gives
67107840.000000 512.000000 131070.000000
Try changing a and c to be type "double", rather than float. That will give you better precision / accuracy. (Floats have about 6 or so significant digits; doubles have more than twice that.)
A float typically uses 32bit IEEE-754 single precision representation, and is good for only approximately 6 significant decimal figures. A double is good for 15, and where supported an 80 bit long double gets to 20 significant figures.
Note that on some compilers there is no distinction between double and long double, or even no support for long double at all.
One solution is to use an arbitrary-precision numeric library, or to use a decimal-floating point library rather then the built-in binary floating point support. Decimal floating point is not intrinsically more precise (though often such libraries support larger, more precise types), but will not show up the artefacts that occur when displaying a decimal representation of a binary floating point value. Decimal floating point is also likely to be much slower, since it is not typically implemented in hardware.