C Floating point zero comparison - c

Will the following code, with nothing in between the lines, always produce a value of true for the boolean b?
double d = 0.0;
bool b = (d == 0.0);
I'm using g++ version 4.8.1.

Assuming IEEE-754 (and probably most floating point representations), this is correct as 0.0 is representable exactly in all IEEE-754 formats.
Now if we take another literal that is not representable exactly in IEEE-754 binary formats, like 0.1:
double d = 0.1;
bool b = (d == 0.1);
This may result in false value in b object!
The implementation has the right to use for example a double precision for d and a greater precision for the comparison with the literal.
(C99, 5.2.4.2.2p8) "Except for assignment and cast (which remove all extra range and precision), the values of operations with floating operands and values subject to the usual arithmetic conversions and of floating constants are evaluated to a format whose range and precision may be greater than required by the type."

Related

Comparison of double and float - Implicit casting

Assume we have the variables double d and float f in the C programming language.
As far as I understand the expression d == (float) d will not be true for all double values since when we cast it to float we truncate it and hence loose precision.
On the other hand f == (double) f should be true for all float values (except for NaN, because it NaN != NaN) since we aren't loosing anything (just extending the mantissa with zeros).
I have read that when comparing a float to a double the float will implicitly be cast to a double: https://en.cppreference.com/w/c/language/conversion#Usual_arithmetic_conversions, is this implicit casting correct for all values (including infinity and NaN)?
I am aware that this is a pretty straightforward question; I have played with it for a while, but it would be great if someone could confirm this. The first part is already answered in other posts, but I haven't found answers for the second part of the question.
In a C implementation that conforms to the C standard, f == (double) f evaluates to true for all float values of f other than NaNs. (For a NaN, f == f is false.) This is true because, in f == (double) f, the left operand is a float, so it is automatically converted to double, and the expression is then equivalent to (double) f == (double) f, and so is inherently true.
The C standard allows implementations to evaluate floating-point expressions with more precision than the nominal types of the operands. However, excess precision would have no effect on cast operators (which are required to discard excess precision) or the == operator. So (double) f == (double) f is not affected by this, and its computed value is the same as its mathematical value.
You might be interested in the result of f == (float) (double) f. In this, since both operands of == have type float, there is no automatic conversion to double. You could ask whether the cast conversion to double introduces some change, and then converting back to float could produce a different value. It cannot.
To see that it cannot, consider if f is infinity. Then (double) f is infinity, and so is (float) (double) f, so the result is a comparison of infinity to infinity, which evaluates to true. (This also holds for negative infinity.) If f is not infinity or a NaN, it is a finite value.
Per C 2018 6.2.5 10, “The set of values of the type float is a subset of the set of values of the type double;…” Therefore, every value representable in float is representable in double, so the conversion to double does not change the value, and neither does the conversion back to float. Therefore, f == (float) (double) f evaluates to true for all float values of f other than NaN.
Note that while you cannot determine whether two NaNs are identical using ==, you could compare the bytes in their representations using memcmp. In this case, conversion to double and back to float is not required to preserve any information in the NaN object other than that it is a NaN; any payload information may be lost.
Yes. Casting to a greater precision will not produce incorrect values, and it is true that comparing a float to a double will implicitly cast the float into a double.

Why is the statement "f == (float)(double)f;" wrong?

I have recently taken a lecture of System Programming, and my professor told me that f == (float)(double) f is which wrong that I cannot get.
I know that double type loses its data when converted to float, but I believe the loss happens only if the stored number in double type cannot be expressed in float type.
Shouldn't it be true as same as x == (int)(double)x; is true?
the picture is the way I'm understanding it
I'm so sorry that I didn't make my question clearly.
the question is not about declaration, but about double type conversion.
I hope you don't lose your precious time because of my fault.
Assuming IEC 60559, the result of f == (float)(double) f depends on the type of f.
Further assuming f is a float, then there's nothing "wrong" about the expression - it will evaluate to true (unless f held NaN, in which case the expression will evaluate to false).
On the other hand, x == (int)(double)x (assuming x is a int) is (potentially) problematic, since a double precision IEC 60559 floating point value only has 53 bits for the significand1, which cannot represent all possible values of an int if it uses more than 53 bits for its value on your platform (admittedly rare). So it will evaluate to true on platforms where ints are 32-bit (using 31 bits for the value), and might evaluate to false on platforms where ints are 64-bit (using 63 bits for the value) (depending on the value).
Relevant quotes from the C standard (6.3.1.4 and 6.3.1.5) :
When a value of integer type is converted to a real floating type, if the value being converted can be represented exactly in the new type, it is unchanged.
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
When a value of real floating type is converted to a real floating type, if the value being converted can be represented exactly in the new type, it is unchanged.
1 a double precision IEC 60559 floating point value consists of 1 bit for the sign, 11 bits for the exponent, and 53 bits for the significand (of which 1 is implied and not stored) - totaling 64 (stored) bits.
Taking the question as posed in the title literally,
Why is the statement “f == (float)(double)f;” wrong?
the statement is "wrong" not in any way related to the representation of floating point values but because it is trivially optimized away by any compiler and thus you might as well have saved the electrons used to store it. It is exactly equivalent to the statement
1;
or, if you like, to the statement (from the original question)
x == (int)(double)x;
(which has exactly the same effect as that in the title, regardless of the available precision of the types int, float, and double, i.e. none whatsoever).
Programming being somewhat concerned with precision you should perhaps take note of the difference between a statement and an expression. An expression has a value which might be true or false or something else, but when you add a semicolon (as you did in the question) it becomes a statement (as you called it in the question) and in the absence of side effects the compiler is free to throw it away.
NaNs are retained through float => double => float, but they not equal themselves.
#include <math.h>
#include <stdio.h>
int main(void) {
float f = HUGE_VALF;
printf("%d\n", f == (float)(double) f);
f = NAN;
printf("%d\n", f == (float)(double) f);
printf("%d\n", f == f);
}
Prints
1
0
0

Dodging the inaccuracy of a floating point number

I totally understand the problems associated with floating points, but I have seen a very interesting behavior that I can't explain.
float x = 1028.25478;
long int y = 102825478;
float z = y/(float)100000.0;
printf("x = %f ", x);
printf("z = %f",z);
The output is:
x = 1028.254761 z = 1028.254780
Now if floating numbers failed to represent that specific random value (1028.25478) when I assigned that to variable x. Why isn't it the same in case of variable z?
P.S. I'm using pellesC IDE to test the code (C11 compiler).
I am pretty sure that what happens here is that the latter floating point variable is elided and instead kept in a double-precision register; and then passed as is as an argument to printf. Then the compiler will believe that it is safe to pass this number at double precision after default argument promotions.
I managed to produce a similar result using GCC 7.2.0, with these switches:
-Wall -Werror -ffast-math -m32 -funsafe-math-optimizations -fexcess-precision=fast -O3
The output is
x = 1028.254761 z = 1028.254800
The number is slightly different there^.
The description for -fexcess-precision=fast says:
-fexcess-precision=style
This option allows further control over excess precision on
machines where floating-point operations occur in a format with
more precision or range than the IEEE standard and interchange
floating-point types. By default, -fexcess-precision=fast is in
effect; this means that operations may be carried out in a wider
precision than the types specified in the source if that would
result in faster code, and it is unpredictable when rounding to
the types specified in the source code takes place. When
compiling C, if -fexcess-precision=standard is specified then
excess precision follows the rules specified in ISO C99; in
particular, both casts and assignments cause values to be rounded
to their semantic types (whereas -ffloat-store only affects
assignments). This option [-fexcess-precision=standard] is enabled by default for C if a
strict conformance option such as -std=c99 is used. -ffast-math
enables -fexcess-precision=fast by default regardless of whether
a strict conformance option is used.
This behaviour isn't C11-compliant
Restricting this to IEEE754 strict floating point, the answers should be the same.
1028.25478 is actually 1028.2547607421875. That accounts for x.
In the evaluation of y / (float)100000.0;, y is converted to a float, by C's rules of argument promotion. The closest float to 102825478 is 102825480. IEEE754 requires the returning of the the best result of a division, which should be 1028.2547607421875 (the value of z): the closest number to 1028.25480.
So my answer is at odds with your observed behaviour. I put that down to your compiler not implementing floating point strictly; or perhaps not implementing IEEE754.
Code acts as if z was a double and y/(float)100000.0 is y/100000.0.
float x = 1028.25478;
long int y = 102825478;
double z = y/100000.0;
// output
x = 1028.254761 z = 1028.254780
An important consideration is FLT_EVAL_METHOD. This allows select floating point code to evaluate at higher precision.
#include <float.h>
#include <stdio.h>
printf("FLT_EVAL_METHOD %d\n", FLT_EVAL_METHOD);
Except for assignment and cast ..., the values yielded by operators with floating operands and values subject to the usual arithmetic conversions and of floating constants are evaluated to a format whose range and precision may be greater than required by the type. The use of evaluation formats is characterized by the implementation-defined value of FLT_EVAL_METHOD.
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the
type;
1 evaluate ... type float and double to the
range and precision of the double type, evaluate long double
... to the range and precision of the long double
type;
2 evaluate all ... to the range and precision of the
long double type.
Yet this does not apply as z with float z = y/(float)100000.0; should lose all higher precision on the assignment.
I agree with #Antti Haapala that code is using a speed optimization that has less adherence to the expected rules of floating point math.

C IEEE-Floats inf equal inf

In C, on a implementation with IEEE-754 floats, when I compare two floating point numbers which are NaN, it return 0 or "false". But why do two floating point numbers which both are inf count as equal?
This Program prints "equal: ..." (at least under Linux AMD64 with gcc) and in my opinion it should print "different: ...".
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
volatile double a = 1e200; //use volatile to suppress compiler warnings
volatile double b = 3e200;
volatile double c = 1e200;
double resA = a * c; //resA and resB should by inf
double resB = b * c;
if (resA == resB)
{
printf("equal: %e * %e = %e = %e = %e * %e\n",a,c,resA,resB,b,c);
}
else
{
printf("different: %e * %e = %e != %e = %e * %e\n", a, c, resA, resB, b, c);
}
return EXIT_SUCCESS;
}
A other example, why I think inf is not the same as inf, is: the numbers of natural numbers and rational numbers, both are infinite but not the same.
So why is inf == inf?
Infinities compare equal because that's what the standard says. From section 5.11 Details of comparison predicates:
Infinite operands of the same sign shall compare equal.
inf==inf for the same reason that almost all floating point numbers compare equal to themselves: Because they're equal. They contain the same sign, exponent, and mantissa.
You might be thinking of how NaN != NaN. But that's a relatively unimportant consequence of a much more important invariant: NaN != x for any x. As the name implies, NaN is not any number at all, and hence cannot compare equal to anything, because the comparison in question is a numeric one (hence why -0 == +0).
It would certainly make some amount of sense to have inf compare unequal to other infs, since in a mathematical context they're almost certainly unequal. But keep in mind that floating point equality is not the same thing as absolute mathematical equality; 0.1f * 10.0f != 1.0f, and 1e100f + 1.0f == 1e100f. Just as floating point numbers gradually underflow into denormals without compromising as-good-as-possible equality, so they overflow into infinity without compromising as-good-as-possible equality.
If you want inf != inf, you can emulate it: 1e400 == 3e400 evaluates to true, but 1e400 - 3e400 == 0 evaluates to false, because the result of +inf + -inf is NaN. (Arguably you could say it should evaluate to 0, but that would serve nobody's interest.)
Background
In C, according to the IEEE 754 binary floating point standard (so, if you use a float or a double) you're going to get an exact value that can be compared exactly with another variable of the same type. Well, this is true unless your computations result in a value that lies outside the range of integers that can be represented (i.e., overflow).
Why is Infinity == Infinity
resA and resB
The IEEE-754 standard tailored the values of infinity and negative infinity to be greater than or less than, respectively, all other values that may be represented according to the standard (<= INFINITY == 0 11111111111 0000000000000000000000000000000000000000000000000000 and >= -INFINITY == 1 11111111111 0000000000000000000000000000000000000000000000000000), except for NaN, which is neither less than, equal to, or greater than any floating point value (even itself). Take note that infinity and it's negative have explicit definitions in their sign, exponent, and mantissa bits.
So, resA and resB are infinity and since infinity is explicitly defined and reproducible, resA==resB. I'm fairly certain this is how isinf() is implemented.
Why is NaN != NaN
However, NaN is not explicitly defined. A NaN value has a sign bit of 0, exponent bits of all 1s (just like infinity and it's negative), and any set of non-zero fraction bits (Source). So, how would you tell one NaN from another, if their fraction bits are arbitrary anyways? Well, the standard doesn't assume that and simply returns false when two floating point values of this structure are compared to one another.
More Explanation
Because infinity is an explicitly defined value (Source, GNU C Manual):
Infinities propagate through calculations as one would expect
2 + ∞ = ∞
4 ÷ ∞ = 0
arctan (∞) = π/2.
However, NaN may or may not propagate through propagate through computations. When it does, it is a QNan (Quieting NaN, most significant fraction bit set) and all computations will result in NaN. When it doesn't, it is a SNan (Signalling NaN, most significant fraction bit not set) and all computations will result in an error.
There are many arithmetic systems. Some of them, including the ones normally covered in high school mathematics, such as the real numbers, do not have infinity as a number. Others have a single infinity, for example the projectively extended real line. Others, such as the IEEE floating point arithmetic under discussion, and the extended real line, have both positive and negative infinity.
IEEE754 arithmetic is different from real number arithmetic in many ways, but is a useful approximation for many purposes.
There is logic to the different treatment of NaNs and infinities. It is entirely reasonable to say that positive infinity is greater than negative infinity and any finite number. It would not be reasonable to say anything similar about the square root of -1.

What does the numerical literal 0.e0f mean?

I am currently trying to debug an uninitialized memory error. I have now come across the numerical literal 0.e0f in the OpenBlas source code (which is what the debugger is currently at) what does that mean?
The context is this:
if ((alpha_r == 0.e0f) && (alpha_i == 0.e0f)) return;
The 0.e0f evaluates to 0 apparently.
A floating-point literals have two syntaxes. The first one consists of the following parts:
nonempty sequence of decimal digits containing a decimal point character (defines significand)
(optional) e or E followed with optional minus or plus sign and nonempty sequence of decimal digits (defines exponent)
(optional) a suffix type specifier as a l, f, L or F
The second one consists of the following parts:
nonempty sequence of decimal digits (defines significant)
e or E followed with optional minus or plus sign and nonempty sequence of decimal digits (defines exponent)
(optional) a suffix type specifier as a l, f, L or F
The suffix type specifier defines the actual type of the floating-point literal:
(no suffix) defines double
f F defines float
l L defines long double
f floating point indicator
eX is exponent value for 10 to the power of X. for example e5 means 10 to the 5 which is 100000. or e-3 means 0.001.
Combining the two
1.23e-3f --> 1.23 x 10 ^ -3 = 0.00123
By expanding your example it is
0.e0f --> 0.0 x 10 ^ 0 = 0.0 (in floating point format)
PS: A useful programming practice. Never ever (see PS2) compare two floating point numbers for equality.
There are some values cannot be represented exactly by floating points just approximations.
Like this example
0.3 + 0.6 = 0.89999999999999991 != 0.9
Instead use:
float a;
float b;
....
if( abs(a-b) < FLT_EPSILON )
FLT_EPLISON is a very small floating point value where 1.0 + FLT_EPSILON != 1.0. (example 0.1e-10) (Quoted from #AlterMann)
abs is short for absolute function for floating points; which in std is fabs. Can be found in other libraries with different names too.
PS 2: OK... Never ever had been a little strong. I didn't mean that this piece of code is wrong neither as algorithm nor syntax. Question itself is a bit simple though this warning may help the new programmers end up here.
As in comments stated this code example is appropriate for checking. 0 check is a valid operation, because current standards for floating point representation guarantee that 0 can be expressed.
It's zero in scientific notation, as single-precision floating point, with a redundant decimal mark. It's the same thing as 0e0.
That's a 0 of type float.
See 6.4.4.2 in the C99 Standard ( http://port70.net/~nsz/c/c99/n1256.html#6.4.4.2 )
In my opinion a plain 0 would be better irrespective of the type of alpha_r and alpha_i
if ((alpha_r == 0) && (alpha_i == 0)) return;

Resources