float - c data types - c

int main() {
float a = 20000000;
float b = 1;
float c = a+b;
if (c==a) { printf("equal"); }
else { printf("not equal");}
return 0;
}
when I run this it says "equal".
but when I change the value of a to 2000000 (one zero less) the answer is no.
why ?

Typically, a float has a precision of 24 bits. The number 20000001 = 0x1312d01 needs 25 bits to be represented exactly, so it must be rounded. The normal rounding mode for values exactly half-way between two representable values is rounding to last-bit-zero, hence 20000001 is rounded to 20000000 as a float.
2000001 = 0x1e8481 needs fewer than 24 bits to be represented (21), so there is no rounding needed for that.

Floating point numbers are very often approximate values instead of exact values. You can read all about it here:
http://en.wikipedia.org/wiki/Floating_point

If you declare two variable as of type float and then even you put those two value same. You would get unpredictable result if you compare for equality. For more details google for IEEE standard (IEEE 754) for representing floating point number.
wikipedia article

Precision of float.
What range of numbers can be represented in a 16-, 32- and 64-bit IEEE-754 systems?
http://en.wikipedia.org/wiki/Floating_point
http://en.wikipedia.org/wiki/Single-precision_floating-point_format

Related

Nonintuitive result of the assignment of a double precision number to an int variable in C

Could someone give me an explanation why I get two different
numbers, resp. 14 and 15, as an output from the following code?
#include <stdio.h>
int main()
{
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
printf("%d %d",b,c); // 14 15, why?
return 0;
}
I expect to get 15 in both cases but it seems I'm missing some fundamentals of the language.
I am not sure if it's relevant but I was doing the test in CodeBlocks. However, if I type the same lines of code in some on-line compiler ( this one for example) I get an answer of 15 for the two printed variables.
... why I get two different numbers ...
Aside from the usual float-point issues, the computation paths to b and c are arrived in different ways. c is calculated by first saving the value as double a.
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
C allows intermediate floating-point math to be computed using wider types. Check the value of FLT_EVAL_METHOD from <float.h>.
Except for assignment and cast (which remove all extra range and precision), ...
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the
type;
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double
operations and constants to the range and precision of the long double
type;
2 evaluate all operations and constants to the range and precision of the
long double type.
C11dr §5.2.4.2.2 9
OP reported 2
By saving the quotient in double a = (Vmax-Vmin)/step;, precision is forced to double whereas int b = (Vmax-Vmin)/step; could compute as long double.
This subtle difference results from (Vmax-Vmin)/step (computed perhaps as long double) being saved as a double versus remaining a long double. One as 15 (or just above), and the other just under 15. int truncation amplifies this difference to 15 and 14.
On another compiler, the results may both have been the same due to FLT_EVAL_METHOD < 2 or other floating-point characteristics.
Conversion to int from a floating-point number is severe with numbers near a whole number. Often better to round() or lround(). The best solution is situation dependent.
This is indeed an interesting question, here is what happens precisely in your hardware. This answer gives the exact calculations with the precision of IEEE double precision floats, i.e. 52 bits mantissa plus one implicit bit. For details on the representation, see the wikipedia article.
Ok, so you first define some variables:
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
The respective values in binary will be
Vmax = 10.111001100110011001100110011001100110011001100110011
Vmin = 1.0110011001100110011001100110011001100110011001100110
step = .00011001100110011001100110011001100110011001100110011010
If you count the bits, you will see that I have given the first bit that is set plus 52 bits to the right. This is exactly the precision at which your computer stores a double. Note that the value of step has been rounded up.
Now you do some math on these numbers. The first operation, the subtraction, results in the precise result:
10.111001100110011001100110011001100110011001100110011
- 1.0110011001100110011001100110011001100110011001100110
--------------------------------------------------------
1.1000000000000000000000000000000000000000000000000000
Then you divide by step, which has been rounded up by your compiler:
1.1000000000000000000000000000000000000000000000000000
/ .00011001100110011001100110011001100110011001100110011010
--------------------------------------------------------
1110.1111111111111111111111111111111111111111111111111100001111111111111
Due to the rounding of step, the result is a tad below 15. Unlike before, I have not rounded immediately, because that is precisely where the interesting stuff happens: Your CPU can indeed store floating point numbers of greater precision than a double, so rounding does not take place immediately.
So, when you convert the result of (Vmax-Vmin)/step directly to an int, your CPU simply cuts off the bits after the fractional point (this is how the implicit double -> int conversion is defined by the language standards):
1110.1111111111111111111111111111111111111111111111111100001111111111111
cutoff to int: 1110
However, if you first store the result in a variable of type double, rounding takes place:
1110.1111111111111111111111111111111111111111111111111100001111111111111
rounded: 1111.0000000000000000000000000000000000000000000000000
cutoff to int: 1111
And this is precisely the result you got.
The "simple" answer is that those seemingly-simple numbers 2.9, 1.4, and 0.1 are all represented internally as binary floating point, and in binary, the number 1/10 is represented as the infinitely-repeating binary fraction 0.00011001100110011...[2] . (This is analogous to the way 1/3 in decimal ends up being 0.333333333... .) Converted back to decimal, those original numbers end up being things like 2.8999999999, 1.3999999999, and 0.0999999999. And when you do additional math on them, those .0999999999's tend to proliferate.
And then the additional problem is that the path by which you compute something -- whether you store it in intermediate variables of a particular type, or compute it "all at once", meaning that the processor might use internal registers with greater precision than type double -- can end up making a significant difference.
The bottom line is that when you convert a double back to an int, you almost always want to round, not truncate. What happened here was that (in effect) one computation path gave you 15.0000000001 which truncated down to 15, while the other gave you 14.999999999 which truncated all the way down to 14.
See also question 14.4a in the C FAQ list.
An equivalent problem is analyzed in analysis of C programs for FLT_EVAL_METHOD==2.
If FLT_EVAL_METHOD==2:
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
computes b by evaluating a long double expression then truncating it to a int, whereas for c it's evaluating from long double, truncating it to double and then to int.
So both values are not obtained with the same process, and this may lead to different results because floating types does not provides usual exact arithmetic.

why there is a different types of output for different float comparisons

I have the following code
Case 1
float a=1.7;
if(a==1.7)
{
printf("Equal\n");
}
else if(a<1.7)
{
printf("less\n");
}
else{
printf("Greater\n");
}
Output
Greater
Case 2
float a=1.3;
if(a==1.3)
{
printf("Equal\n");
}
else if(a<1.3)
{
printf("less\n");
}
else{
printf("Greater\n");
}
Output
Less
Why the outputs are not as expected?
Why the outputs are different in both cases?
Can anyone explain how the values are stored and compared here
Thanks in advance
a combination of things -
internally C uses double; float is a storage format.
1.3 is an infinitely repeating decimal when expressed in binary
IEEE floating-point rounds the mantissa when storing it. 1.3 stored as a float gets
rounded down, and 1.7 gets rounded up (by a tiny bit)
the float a when read will be converted to a double by adding extra digits of
least significant bits set to 0's. Values that were rounded down are less than they
were. Values that were rounded up, even though they end in all zeros, are greater.
and when comparing 1.xyzxyz000000 to 1.xyzxyzxyzxyz where not all of the xyz is 0,
the one ending in zeros will be less.
Edit: missed the ieee rounding aspect earlier
For the curious, here is the actual value stored for 1.3 and 1.7 as 32-bit floats:
1.299999952316284179687500000000
1.700000047683715820312500000000
and as 64-bit floats:
1.30000000000000004440892098500626161694526672363281250000000000000000
1.69999999999999995559107901499373838305473327636718750000000000000000
When you assign a float a literal and then compare the result to the same literal, you should see that the values are equal. The trouble is, in your code you are not comparing a float to float: it is a float to double comparison.
Constants 1.7 and 1.3 are of type double. When you compare them to a float, the value of float gets expanded to double before the comparison. This is the stage at which the values become non-equal: unless the constant is represented exactly (neither 1.7 nor 1.3 can be represented exactly in a floating point number) there will be a conversion error.
You can address this by comparing to a float constant: if you cast your constants to float before comparison, the equality checks would succeed:
float a=1.7;
if(a==(float)1.7) {
printf("Equal\n");
} else if(a<(float)1.7) {
printf("less\n");
} else {
printf("Greater\n");
}
Demo (prints Equals).
In checking case it will consider an integer. so it will roundup the floating value to
integer.
1.7 roundup as 2.
1.3 roundup as 1.
so this is the reason for that answer.

Float point numbers are not equal, despite being the same? [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Identical float values comparing as inequal [duplicate]
(4 answers)
Closed 3 years ago.
The program below outputs This No. is not same. Why does it do this when both numbers are the same?
void main() {
float f = 2.7;
if(f == 2.7) {
printf("This No. is same");
} else {
printf("This No. is not same");
}
}
Why does it do this when both numbers are the same?
The numbers are not the same value.
double can represent exactly typically about 264 different values.
float can represent exactly typically about 232 different values.
2.7 is not one of the values - some approximations are made given the binary nature of floating-point encoding versus the decimal text 2.7
The compiler converts 2.7 to the nearest representable double or
2.70000000000000017763568394002504646778106689453125
given the typical binary64 representation of double.
The next best double is
2.699999999999999733546474089962430298328399658203125.
Knowing the exact value beyond 17 significant digits has reduced usefulness.
When the value is assigned to a float, it becomes the nearest representable float or
2.7000000476837158203125.
2.70000000000000017763568394002504646778106689453125 does not equal 2.7000000476837158203125.
Should a compiler use a float/double which represents 2.7 exactly like with decimal32/decimal64, code would work as OP expected. double representation using an underlying decimal format is rare. Far more often the underlying double representation is base 2 and these conversion artifacts need to be considered when programming.
Had the code been float f = 2.5;, the float and double value, using a binary or decimal underlying format, would have made if (f == 2.5) true. The same value as a high precision double is representable exactly as a low precision float.
(Assuming binary32/binary64 floating point)
double has 53 bits of significance and float has 24. The key is that if the number as a double has its least significant (53-24) bits set to 0, when converted to float, it will have the same numeric value as a float or double. Numbers like 1, 2.5 and 2.7000000476837158203125 fulfill that. (range, sub-normal, and NaN issues ignored here.)
This is a reason why exact floating point comparisons are typically only done in select situations.
Check this program:
#include<stdio.h>
int main()
{
float x = 0.1;
printf("%zu %zu %zu\n", sizeof(x), sizeof(0.1), sizeof(0.1f));
return 0;
}
Output is 4 8 4.
The values used in an expression are considered as double (double precision floating point format) unless a f is specified at the end. So the expression “x==0.1″ has a double on right side and float which are stored in a single precision floating point format on left side.
In such situations float is promoted to double .
The double precision format uses uses more bits for precision than single precision format.
In your case add 2.7f to get expected result.
The literal 2.7 in f == 2.7 is converted to double that's why 2.7 is not equal to f.

How to set precision of a float

Can someone explain me how to choose the precision of a float with a C function?
Examples:
theFatFunction(0.666666666, 3) returns 0.667
theFatFunction(0.111111111, 3) returns 0.111
You can't do that, since precision is determined by the data type (i.e. float or double or long double). If you want to round it for printing purposes, you can use the proper format specifiers in printf(), i.e. printf("%0.3f\n", 0.666666666).
You can't. Precision depends entirely on the data type. You've got float and double and that's it.
Floats have a static, fixed precision. You can't change it. What you can sometimes do, is round the number.
See this page, and consider to scale yourself by powers of 10. Note that not all numbers are exactly representable as floats, either.
Most systems follow IEEE-754 floating point standard which defines several floating point types.
On these systems, usually float is the IEEE-754 binary32 single precision type: it has 24-bit of precision. double is the binary64 double precision type; it has 53-bit of precision. The precision in bit numbers is defined by the IEEE-754 standard and cannot be changed.
When you print values of floating point types using functions of the fprintf family (e.g., printf), the precision is defined as the maximum number of significant digits and is by default set to 6 digits. You can change the default precision with a . followed by a decimal number in the conversion specification. For example:
printf("%.10f\n", 4.0 * atan(1.0)); // prints 3.1415926536
whereas
printf("%f\n", 4.0 * atan(1.0)); // prints 3.141593
It might be roughly the following steps:
Add 0.666666666 with 0.0005 (we get 0.667166666)
Multiply by 1000 (we get 667.166666)
Shift the number to an int (we get 667)
Shift it back to float (we get 667.0)
Divide by 1000 (we get 0.667)
Thank you.
Precision is determined by the data type (i.e. float or double or long double).
If you want to round it for printing purposes, you can use the proper format specifiers in printf(), i.e.
printf("%0.3f\n", 0.666666666) //will print 0.667 in c
Now if you want to round it for calculating purposes you have to first multiply the float by 10^number of digits then typecast to int , do the calculation and then again typecast to float and divide by same power of 10
float f=0.66666;
f *= 1000; // 666.660
int i = (int)f; // 666
i = 2*i; // 1332
f = i; // 1332
f /= 1000; // 1.332
printf("%f",f); //1.332000

Why does the addition of two float numbers is incorrect in C?

I have a problem with the addition of two float numbers.
Code below:
float a = 30000.0f;
float b = 4499722832.0f;
printf("%f\n", a+b);
Why the output result is 450002816.000000? (The correct one should be 450002832.)
Float are not represented exactly in C - see http://en.wikipedia.org/wiki/Floating_point#IEEE_754:_floating_point_in_modern_computers and http://en.wikipedia.org/wiki/Single_precision, so calculations with float can only give an approximate result.
This is especially apparent for larger values, since the possible difference can be represented as a percentage of the value. In case of adding/subtracting two values, you get the worse precision of both (and of the result).
Floating-point values cannot represent all integer values.
Remember that single-precision floating-point numbers only have 24 (or 23, depending on how you count) bits of precision (i.e. significant figures). So as values get larger, you begin to lose low-end precision, which is why the result of your calculation isn't quite "correct".
From wikipedia
Single precision, called "float" in the C language family, and "real" or "real*4" in Fortran. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits).
So your number doesn't actually fit in float. You can use double instead.

Resources