In C, on a implementation with IEEE-754 floats, when I compare two floating point numbers which are NaN, it return 0 or "false". But why do two floating point numbers which both are inf count as equal?
This Program prints "equal: ..." (at least under Linux AMD64 with gcc) and in my opinion it should print "different: ...".
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
volatile double a = 1e200; //use volatile to suppress compiler warnings
volatile double b = 3e200;
volatile double c = 1e200;
double resA = a * c; //resA and resB should by inf
double resB = b * c;
if (resA == resB)
{
printf("equal: %e * %e = %e = %e = %e * %e\n",a,c,resA,resB,b,c);
}
else
{
printf("different: %e * %e = %e != %e = %e * %e\n", a, c, resA, resB, b, c);
}
return EXIT_SUCCESS;
}
A other example, why I think inf is not the same as inf, is: the numbers of natural numbers and rational numbers, both are infinite but not the same.
So why is inf == inf?
Infinities compare equal because that's what the standard says. From section 5.11 Details of comparison predicates:
Infinite operands of the same sign shall compare equal.
inf==inf for the same reason that almost all floating point numbers compare equal to themselves: Because they're equal. They contain the same sign, exponent, and mantissa.
You might be thinking of how NaN != NaN. But that's a relatively unimportant consequence of a much more important invariant: NaN != x for any x. As the name implies, NaN is not any number at all, and hence cannot compare equal to anything, because the comparison in question is a numeric one (hence why -0 == +0).
It would certainly make some amount of sense to have inf compare unequal to other infs, since in a mathematical context they're almost certainly unequal. But keep in mind that floating point equality is not the same thing as absolute mathematical equality; 0.1f * 10.0f != 1.0f, and 1e100f + 1.0f == 1e100f. Just as floating point numbers gradually underflow into denormals without compromising as-good-as-possible equality, so they overflow into infinity without compromising as-good-as-possible equality.
If you want inf != inf, you can emulate it: 1e400 == 3e400 evaluates to true, but 1e400 - 3e400 == 0 evaluates to false, because the result of +inf + -inf is NaN. (Arguably you could say it should evaluate to 0, but that would serve nobody's interest.)
Background
In C, according to the IEEE 754 binary floating point standard (so, if you use a float or a double) you're going to get an exact value that can be compared exactly with another variable of the same type. Well, this is true unless your computations result in a value that lies outside the range of integers that can be represented (i.e., overflow).
Why is Infinity == Infinity
resA and resB
The IEEE-754 standard tailored the values of infinity and negative infinity to be greater than or less than, respectively, all other values that may be represented according to the standard (<= INFINITY == 0 11111111111 0000000000000000000000000000000000000000000000000000 and >= -INFINITY == 1 11111111111 0000000000000000000000000000000000000000000000000000), except for NaN, which is neither less than, equal to, or greater than any floating point value (even itself). Take note that infinity and it's negative have explicit definitions in their sign, exponent, and mantissa bits.
So, resA and resB are infinity and since infinity is explicitly defined and reproducible, resA==resB. I'm fairly certain this is how isinf() is implemented.
Why is NaN != NaN
However, NaN is not explicitly defined. A NaN value has a sign bit of 0, exponent bits of all 1s (just like infinity and it's negative), and any set of non-zero fraction bits (Source). So, how would you tell one NaN from another, if their fraction bits are arbitrary anyways? Well, the standard doesn't assume that and simply returns false when two floating point values of this structure are compared to one another.
More Explanation
Because infinity is an explicitly defined value (Source, GNU C Manual):
Infinities propagate through calculations as one would expect
2 + ∞ = ∞
4 ÷ ∞ = 0
arctan (∞) = π/2.
However, NaN may or may not propagate through propagate through computations. When it does, it is a QNan (Quieting NaN, most significant fraction bit set) and all computations will result in NaN. When it doesn't, it is a SNan (Signalling NaN, most significant fraction bit not set) and all computations will result in an error.
There are many arithmetic systems. Some of them, including the ones normally covered in high school mathematics, such as the real numbers, do not have infinity as a number. Others have a single infinity, for example the projectively extended real line. Others, such as the IEEE floating point arithmetic under discussion, and the extended real line, have both positive and negative infinity.
IEEE754 arithmetic is different from real number arithmetic in many ways, but is a useful approximation for many purposes.
There is logic to the different treatment of NaNs and infinities. It is entirely reasonable to say that positive infinity is greater than negative infinity and any finite number. It would not be reasonable to say anything similar about the square root of -1.
Related
float f = 1.0;
while (f != 0.0) f = f / 2.0;
This loop runs 150 times using 32-bit precision. Why is that so? Is it getting rounded to zero?
In common C implementations, the IEEE-754 binary32 format is used for float. It is also called “single precision.” It is a binary based format where finite numbers are represented as ±f•2e, where f is a 24-bit binary numeral in [1, 2) and e is an integer in [−126, 127].
In this format, 1 is represented as +1.000000000000000000000002•20. Dividing that by 2 yields ½, which is represented as +1.000000000000000000000002•2−1. Dividing that by 2 yields +1.000000000000000000000002•2−2, then +1.000000000000000000000002•2−3, and so on until we reach +1.000000000000000000000002•2−126.
When that is divided by two, the mathematical result is +1.000000000000000000000002•2−127, but −127 is below the normal exponent range, [−126, 127]. Instead, the significand becomes denormalized; 2−127 is represented by +0.100000000000000000000002•2−126. Dividing that by 2 yields +0.010000000000000000000002•2−126, then +0.001000000000000000000002•2−126, +0.000100000000000000000002•2−126, and so on until we get to +0.000000000000000000000012•2−126.
At this point, we have done 149 divisions by 2; +0.000000000000000000000012•2−126 is 2−149.
When the next division is performed, the result would be 2−150, but that is not representable in this format. Even with the lowest non-zero significand, 0.000000000000000000000012, and the lowest exponent, −126, we cannot get to 2−150. The next lower representable number is +0.000000000000000000000002•2−126, which equals 0.
So, the real-number-arithmetic result of the division would be 2−150, but we cannot represent that in this format. The two nearest representable numbers are +0.000000000000000000000012•2−126 just above it and +0.000000000000000000000002•2−126 just below it. They are equally near 2−150. The default rounding method is to take the nearest representable number and, in case of ties, to take the number with the even low digit. So +0.000000000000000000000002•2−126 wins the tie, and that is produced as the result for the 150th division.
What happens is simply that your system has only a limited number of bits available for a variable, and hence limited precision; even though, mathematically, you can halve a number (!= 0) indefinitely without ever reaching zero, in a computer implementation that has a limited precision for a float variable, that variable will inevitably, at some stage, become indistinguishable from zero. The more bits your system uses, the more precision it has and the later this will happen, but at some stage it will.
Since I suppose this is meant to be C, I just implemented it in C (with a counter counting each iteration), and indeed it ran for 150 rounds until the loop ended. I also implemented it with a double, where it ran for 1075 iterations. Keep in mind, however, that the C standard does not define the exact precision of a float variable. In most implementations it's 32 bits for a float and 64 for a double. With a long double, I get 16,446 iterations.
Currently I'm learning about floating point exceptions. I'm writing a loop with a function. In that function a value is calculated that equals 0.5. As the loop proceeds, the input value gets divided by 10.
The loop:
for(i = 0; i < e; i++)
{
xf /= 10.0; // force increasingly smaller values for x
float_testk (xf, i);
}
The function:
void float_testk(float x, int i)
{
float result;
feclearexcept(FE_ALL_EXCEPT); // clear all pending exceptions
result = (1 - cosf(x)) / (x * x);
if(fetestexcept(FE_UNDERFLOW) != 0)
fprintf(stderr,"Underflow occurred in double_testk!\n");
if(fetestexcept(FE_OVERFLOW) != 0)
fprintf(stderr,"Overflow occurred in double_testk!\n");
if(fetestexcept(FE_INVALID) != 0)
fprintf(stderr,"Invalid exception occurred in double_testk!\n");
printf("Iteration %3d, float result for x=%.8f : %f\n",i,x,result);
}
The first few iterations the output is around 0.5 and later it becomes 0 due CC. After a while this is the output of the program:
Iteration 18, float result for x=0.00000000 : 0.000000
Underflow occurred in double_testk!
Iteration 19, float result for x=0.00000000 : 0.000000
Underflow occurred in double_testk!
Iteration 20, float result for x=0.00000000 : 0.000000
Underflow occurred in double_testk!
Iteration 21, float result for x=0.00000000 : 0.000000
Underflow occurred in double_testk!
Invalid exception occurred in double_testk!
Iteration 22, float result for x=0.00000000 : -nan
Underflow occurred in double_testk!
Invalid exception occurred in double_testk!
I want to know what happens at the transition from underflow to NaN. Because underflow means that the number is too small to be stored in the memory.
But if the number is already too small, what is the goal of NaN?
Because underflow means that the number is too small to be stored in the memory.
Not quite; in floating-point, underflow means a result is below the range at which numbers can be represented with full precision. The result may still be fairly accurate.
As long as x is at least 2−75, x * x produces a non-zero result. It may be in the subnormal part of the floating-point domain, where the precision is declining, but the real-number result of x•x is big enough to round to 2−149 or greater. Then, for these small x, (1 - cosf(x)) / (x * x) evaluates to zero divided by a non-zero value, so the result is zero.
When x is less than 2−75, then x * x produces zero, because the real-number result of x•x is so small that, in floating-point arithmetic, it is rounded to zero. Then (1 - cosf(x)) / (x * x) evaluates to zero divided by zero, so the result is a NaN. This is what happens in your iteration 22.
(2−149 is the smallest positive value representable in IEEE-754 binary32, which your C implementation likely uses for float. Real-number results between that and 2−150 will round up to 2−149. Lower results will round down to 0. Assuming the rounding mode is round-to-nearest, ties-to-even.)
NaN is a concept defined in IEEE 754 standard for floating-point arithmetic, not being a number is not the same as negative infinity or positive infinity, NaN is used for arithmetic values that cannot be represented, not because they are too small or too large but simply because they don't exist. Examples:
1/0 = ∞ //too large
log (0) = -∞ //too small
sqrt (-1) = NaN //is not a number, can't be calculated
IEEE 754 floating point numbers can represent positive or negative infinity, and NaN (not a number). These three values arise from calculations whose result is undefined or cannot be represented accurately. You can also deliberately set a floating-point variable to any of them, which is sometimes useful. Some examples of calculations that produce infinity or NaN:
The goal of these flags you are using is to be compiant with the mentioned standard. It specifies five arithmetic exceptions that are to be recorded in the status flags:
FE_INEXACT: Inexact result, rounding was necessary to store the result of an earlier floating-point operation.
Set if the rounded (and returned) value is different from the mathematically exact result of the operation.
FE_UNDERFLOW: The result of an earlier floating-point operation was subnormal with a loss of precision.
Set if the rounded value is tiny (as specified in IEEE 754) and inexact (or maybe limited to if it has denormalization loss, as per the 1984 version of IEEE 754), returning a subnormal value including the zeros.
FE_OVERFLOW: The result of an earlier floating-point operation was too large to be representable.
Set if the absolute value of the rounded value is too large to be represented. An infinity or maximal finite value is returned, depending on which rounding is used.
FE_DIVBYZERO: Pole error occurred in an earlier floating-point operation.
Set if the result is infinite given finite operands, returning an infinity, either +∞ or −∞.
FE_INVALID: Domain error occurred in an earlier floating-point operation.
Set if a real-valued result cannot be returned e.g. sqrt(−1) or 0/0, returning a quiet NaN.*
*Concept of quiet NaN:
Quiet NaNs, or qNaNs, do not raise any additional exceptions as they propagate through most operations. The exceptions are where the NaN cannot simply be passed through unchanged to the output, such as in format conversions or certain comparison operations.
Sources:
Floating point environment
Floating-point arithmetic
20.5.2 Infinity and NaN
NaN
The problem: given a floating point constant expression, can we write a macro that evaluates to a constant expression whose value is a power of two equal to the most significant place of the significand? Equivalently, this is just the greatest power of two less than or equal to the input in magnitude.
For the purposes of this question we can ignore:
Near-overflow or near-underflow values (they can be handled with finitely many applications of ?: to rescale).
Negative inputs (they can be handled likewise).
Non-Annex-F-conforming implementations (can't really do anything useful in floating point with them).
Weirdness around excess precision (float_t and double_t can be used with FLT_EVAL_METHOD and other float.h macros to handle it safely).
So it suffices to solve the problem for positive values bounded away from infinity and the denormal range.
Note that this problem is equivalent to finding the "epsilon" for a specific value, that is, nextafter(x,INF)-x (or the equivalent in float or long double), with the result just scaled by DBL_EPSILON (or equivalent for the type). Solutions that find that are perfectly acceptable if they're simpler.
I have a proposed solution I'm posting as a self-answer, but I'm not sure if it's correct.
If you can assume IEEE 754 binary64 format and semantics (and in particular that arithmetic operations are correctly rounded), and a round-ties-to-even rounding mode, then it's a nice fact that for any not-too-small not-too-large positive finite double value x, the next representable value up from x is always given by x / 0x1.fffffffffffffp-1 (where 0x1.fffffffffffffp-1 is just 1.0 - 0.5 * DBL_EPSILON spelled out as a hex literal).
So we can get the most significant bit that you ask for simply from:
(x / 0x1.fffffffffffffp-1 - x) * 0x1.0p+52
And of course there are analogous results for float, assuming IEEE 754 binary32 format and semantics.
In fact, the only normal positive value that this fails for is DBL_MAX, where the result of the division overflows to infinity.
To show that the division trick works, it's enough to prove it for x in the range 1.0 <= x < 2.0; it's easy to show that for any x in this range, the value of x / 0x1.fffffffffffffp-1 - x (where / represents mathematical division in this case) lies in the half-open interval (2^-53, 2^52], and it follows that under round-ties-to-even (or in fact any round-to-nearest rounding mode), x / 0x1.fffffffffffffp-1 rounds up to the next representable value.
Similarly, under the same assumptions, x * 0x1.fffffffffffffp-1 is always the next representable value down from x.
Here is code for finding the ULP. It was inspired by algorithm 3.5 in Accurate floating-Point Summation by Siegfriend M. Rump, Takeshi Ogita, and Shin’ichi Oishi (which calculates 2⌈log2 |p|⌉):
double ULP(double q)
{
// SmallestPositive is the smallest positive floating-point number.
static const double SmallestPositive = DBL_EPSILON * DBL_MIN;
/* Scale is .75 ULP, so multiplying it by any significand in [1, 2) yields
something in [.75 ULP, 1.5 ULP) (even with rounding).
*/
static const double Scale = 0.75 * DBL_EPSILON;
q = fabs(q);
// Handle denormals, and get the lowest normal exponent as a bonus.
if (q < 2*DBL_MIN)
return SmallestPositive;
/* Subtract from q something more than .5 ULP but less than 1.5 ULP. That
must produce q - 1 ULP. Then subtract that from q, and we get 1 ULP.
The significand 1 is of particular interest. We subtract .75 ULP from
q, which is midway between the greatest two floating-point numbers less
than q. Since we round to even, the lesser one is selected, which is
less than q by 1 ULP of q, although 2 ULP of itself.
*/
return q - (q - q * Scale);
}
The fabs and if can be replaced with ?:.
For reference, the 2⌈log2 |p|⌉ algorithm is:
q = p / FLT_EPSILON
L = |(q+p) - q|
if L = 0
L = |p|
For the sake of example, assume the type is float and let x be the input. Initially I will write this as a sequence of statements for readability, but they can be translated directly into macros that produce constant expressions.
float y = x*(1+FLT_EPSILON)-x;
if (y/FLT_EPSILON > x) y/=2;
If we could ensure rounding-down, the initial value of y should be exactly what we want. However, if the top two bits of x are 1 and any lower bits are set, or if we hit a rounds-to-even case, x*(1+FLT_EPSILON) could exceed x by 2 units in the last place instead of just 1. I don't believe any other cases are possible, and I believe the second line accounts fully for this one.
Written as macros:
#define PRE_ULP(x) ((x)*(1+FLT_EPSILON)-(x))
#define ULP(x) ((PRE_ULP(x)/FLT_EPSILON>(x) ? PRE_ULP(x)/2 : PRE_ULP(x))
#define MSB_VAL(x) (ULP(x)/FLT_EPSILON)
I have a program example:
int main()
{
double x;
x=-0.000000;
if(x<0)
{
printf("x is less");
}
else
{
printf("x is greater");
}
}
Why does the control goes in the first statement - x is less . What is -0.000000?
IEEE 754 defines a standard floating point numbers, which is very commonly used. You can see its structure here:
Finite numbers, which may be either base 2 (binary) or base 10
(decimal). Each finite number is described by three integers: s = a
sign (zero or one), c = a significand (or 'coefficient'), q = an
exponent. The numerical value of a finite number is
(−1)^s × c × bq
where b is the base (2 or 10). For example, if the sign is 1
(indicating negative), the significand is 12345, the exponent is −3,
and the base is 10, then the value of the number is −12.345.
So if the fraction is 0, and the sign is 0, you have +0.0.
And if the fraction is 0, and the sign is 1, you have -0.0.
The numbers have the same value, but they differ in the positive/negative check. This means, for instance, that if:
x = +0.0;
y = -0.0;
Then you should see:
(x -y) == 0
However, for x, the OP's code would go with "x is greater", while for y, it would go with "x is less".
Edit: Artur's answer and Jeffrey Sax's comment to this answer clarify that the difference in the test for x < 0 in the OP's question is actually a compiler optimization, and that actually the test for x < 0 for both positive and negative 0 should always be false.
Negative zero is still zero, so +0 == -0 and -0 < +0 is false. They are two representations of the same value. There are only a few operations for which it makes a difference:
1 / -0 = -infinity, while 1 / +0 = +infinity.
sqrt(-0) = -0, while sqrt(+0) = +0
Negative zero can be created in a few different ways:
Dividing a positive number by -infinity, or a negative number by +infinity.
An operation that produces an underflow on a negative number.
This may seem rather obscure, but there is a good reason for this, mainly to do with making mathematical expressions involving complex numbers consistent. For example, note that the identity 1/√(-z)==-1/√z is not correct unless you define the square root as I did above.
If you want to know more details, try and find William Kahan's Branch Cuts for Complex Elementary Functions, or Much Ado About Nothing's Sign Bit in The State of the Art in Numerical Analysis (1987).
Nathan is right but there is one issue though. Usually most of float/double operations are performed by coprocessor. However some compilers try to be clever and instead of letting coprocessor do the comparison (it treats -0.0 and +0.0 the same as 0.0) just assume that since your x variable has minus sign it means that it should be treated as negative and optimize your code.
If you would be able to see how assembly output looks like - I bet you'll only see call to:
printf("x is less");
So it is optimization stuff (bad optimization).
BTW - VC 2008 produces correct output here regardless of optimization level set.
For example - VC optimizes (at full/max optimization level) the code leaving this only:
printf("x is grater");
I like my compiler more every day ;-)
In the sqrt function of most languages (though here I'm mostly interested in C and Haskell), are there any guarantees that the square root of a perfect square will be returned exactly? For example, if I do sqrt(81.0) == 9.0, is that safe or is there a chance that sqrt will return 8.999999998 or 9.00000003?
If numerical precision is not guaranteed, what would be the preferred way to check that a number is a perfect square? Take the square root, get the floor and the ceiling and make sure they square back to the original number?
Thank you!
In IEEE 754 floating-point, if the double-precision value x is the square of a nonnegative representable number y (i.e. y*y == x and the computation of y*y does not involve any rounding, overflow, or underflow), then sqrt(x) will return y.
This is all because sqrt is required to be correctly-rounded by the IEEE 754 standard. That is, sqrt(x), for any x, will be the closest double to the actual square root of x. That sqrt works for perfect squares is a simple corollary of this fact.
If you want to check whether a floating-point number is a perfect square, here's the simplest code I can think of:
int issquare(double d) {
if (signbit(d)) return false;
feclearexcept(FE_INEXACT);
double dd = sqrt(d);
asm volatile("" : "+x"(dd));
return !fetestexcept(FE_INEXACT);
}
I need the empty asm volatile block that depends on dd because otherwise your compiler might be clever and "optimise" away the calculation of dd.
I used a couple of weird functions from fenv.h, namely feclearexcept and fetestexcept. It's probably a good idea to look at their man pages.
Another strategy that you might be able to make work is to compute the square root, check whether it has set bits in the low 26 bits of the mantissa, and complain if it does. I try this approach below.
And I needed to check whether d is zero because otherwise it can return true for -0.0.
EDIT: Eric Postpischil suggested that hacking around with the mantissa might be better. Given that the above issquare doesn't work in another popular compiler, clang, I tend to agree. I think the following code works:
int _issquare2(double d) {
if (signbit(d)) return 0;
int foo;
double s = sqrt(d);
double a = frexp(s, &foo);
frexp(d, &foo);
if (foo & 1) {
return (a + 33554432.0) - 33554432.0 == a && s*s == d;
} else {
return (a + 67108864.0) - 67108864.0 == a;
}
}
Adding and subtracting 67108864.0 from a has the effect of wiping the low 26 bits of the mantissa. We will get a back exactly when those bits were clear in the first place.
According to this paper, which discusses proving the correctness of IEEE floating-point square root:
The IEEE-754 Standard for Binary Floating-Point
Arithmetic [1] requires that the result of a divide or square
root operation be calculated as if in infinite precision, and
then rounded to one of the two nearest floating-point
numbers of the specified precision that surround the
infinitely precise result
Since a perfect square that can be represented exactly in floating-point is an integer and its square root is an integer that can be precisely represented, the square root of a perfect square should always be exactly correct.
Of course, there's no guarantee that your code will execute with a conforming IEEE floating-point library.
#tmyklebu perfectly answered the question. As a complement, let's see a possibly less efficient alternative for testing perfect square of fractions without asm directive.
Let's suppose we have an IEEE 754 compliant sqrt which rounds the result correctly.
Let's suppose exceptional values (Inf/Nan) and zeros (+/-) are already handled.
Let's decompose sqrt(x) into I*2^m where I is an odd integer.
And where I spans n bits: 1+2^(n-1) <= I < 2^n.
If n > 1+floor(p/2) where p is floating point precision (e.g. p=53 and n>27 in double precision)
Then 2^(2n-2) < I^2 < 2^2n.
As I is odd, I^2 is odd too and thus spans over > p bits.
Thus I is not the exact square root of any representable floating point with this precision.
But given I^2<2^p, could we say that x was a perfect square?
The answer is obviously no. A taylor expansion would give
sqrt(I^2+e)=I*(1+e/2I - e^2/4I^2 + O(e^3/I^3))
Thus, for e=ulp(I^2) up to sqrt(ulp(I^2)) the square root is correctly rounded to rsqrt(I^2+e)=I... (round to nearest even or truncate or floor mode).
Thus we would have to assert that sqrt(x)*sqrt(x) == x.
But above test is not sufficient, for example, assuming IEEE 754 double precision, sqrt(1.0e200)*sqrt(1.0e200)=1.0e200, where 1.0e200 is exactly 99999999999999996973312221251036165947450327545502362648241750950346848435554075534196338404706251868027512415973882408182135734368278484639385041047239877871023591066789981811181813306167128854888448 whose first prime factor is 2^613, hardly a perfect square of any fraction...
So we can combine both tests:
#include <float.h>
bool is_perfect_square(double x) {
return sqrt(x)*sqrt(x) == x
&& squared_significand_fits_in_precision(sqrt(x));
}
bool squared_significand_fits_in_precision(double x) {
double scaled=scalb( x , DBL_MANT_DIG/2-ilogb(x));
return scaled == floor(scaled)
&& (scalb(scaled,-1)==floor(scalb(scaled,-1)) /* scaled is even */
|| scaled < scalb( sqrt((double) FLT_RADIX) , DBL_MANT_DIG/2 + 1));
}
EDIT:
If we want to restrict to the case of integers, we can also check that floor(sqrt(x))==sqrt(x) or use dirty bit hacks in squared_significand_fits_in_precision...
Instead of doing sqrt(81.0) == 9.0, try 9.0*9.0 == 81.0. This will always work as long as the square is within the limits of the floating point magnitude.
Edit: I was probably unclear about what I meant by "floating point magnitude". What I mean is to keep the number within the range of integer values that can be held without precision loss, less than 2**53 for a IEEE double. I also expected that there would be a separate operation to make sure the square root was an integer.
double root = floor(sqrt(x) + 0.5); /* rounded result to nearest integer */
if (root*root == x && x < 9007199254740992.0)
/* it's a perfect square */