sqrt, perfect squares and floating point errors - c

In the sqrt function of most languages (though here I'm mostly interested in C and Haskell), are there any guarantees that the square root of a perfect square will be returned exactly? For example, if I do sqrt(81.0) == 9.0, is that safe or is there a chance that sqrt will return 8.999999998 or 9.00000003?
If numerical precision is not guaranteed, what would be the preferred way to check that a number is a perfect square? Take the square root, get the floor and the ceiling and make sure they square back to the original number?
Thank you!

In IEEE 754 floating-point, if the double-precision value x is the square of a nonnegative representable number y (i.e. y*y == x and the computation of y*y does not involve any rounding, overflow, or underflow), then sqrt(x) will return y.
This is all because sqrt is required to be correctly-rounded by the IEEE 754 standard. That is, sqrt(x), for any x, will be the closest double to the actual square root of x. That sqrt works for perfect squares is a simple corollary of this fact.
If you want to check whether a floating-point number is a perfect square, here's the simplest code I can think of:
int issquare(double d) {
if (signbit(d)) return false;
feclearexcept(FE_INEXACT);
double dd = sqrt(d);
asm volatile("" : "+x"(dd));
return !fetestexcept(FE_INEXACT);
}
I need the empty asm volatile block that depends on dd because otherwise your compiler might be clever and "optimise" away the calculation of dd.
I used a couple of weird functions from fenv.h, namely feclearexcept and fetestexcept. It's probably a good idea to look at their man pages.
Another strategy that you might be able to make work is to compute the square root, check whether it has set bits in the low 26 bits of the mantissa, and complain if it does. I try this approach below.
And I needed to check whether d is zero because otherwise it can return true for -0.0.
EDIT: Eric Postpischil suggested that hacking around with the mantissa might be better. Given that the above issquare doesn't work in another popular compiler, clang, I tend to agree. I think the following code works:
int _issquare2(double d) {
if (signbit(d)) return 0;
int foo;
double s = sqrt(d);
double a = frexp(s, &foo);
frexp(d, &foo);
if (foo & 1) {
return (a + 33554432.0) - 33554432.0 == a && s*s == d;
} else {
return (a + 67108864.0) - 67108864.0 == a;
}
}
Adding and subtracting 67108864.0 from a has the effect of wiping the low 26 bits of the mantissa. We will get a back exactly when those bits were clear in the first place.

According to this paper, which discusses proving the correctness of IEEE floating-point square root:
The IEEE-754 Standard for Binary Floating-Point
Arithmetic [1] requires that the result of a divide or square
root operation be calculated as if in infinite precision, and
then rounded to one of the two nearest floating-point
numbers of the specified precision that surround the
infinitely precise result
Since a perfect square that can be represented exactly in floating-point is an integer and its square root is an integer that can be precisely represented, the square root of a perfect square should always be exactly correct.
Of course, there's no guarantee that your code will execute with a conforming IEEE floating-point library.

#tmyklebu perfectly answered the question. As a complement, let's see a possibly less efficient alternative for testing perfect square of fractions without asm directive.
Let's suppose we have an IEEE 754 compliant sqrt which rounds the result correctly.
Let's suppose exceptional values (Inf/Nan) and zeros (+/-) are already handled.
Let's decompose sqrt(x) into I*2^m where I is an odd integer.
And where I spans n bits: 1+2^(n-1) <= I < 2^n.
If n > 1+floor(p/2) where p is floating point precision (e.g. p=53 and n>27 in double precision)
Then 2^(2n-2) < I^2 < 2^2n.
As I is odd, I^2 is odd too and thus spans over > p bits.
Thus I is not the exact square root of any representable floating point with this precision.
But given I^2<2^p, could we say that x was a perfect square?
The answer is obviously no. A taylor expansion would give
sqrt(I^2+e)=I*(1+e/2I - e^2/4I^2 + O(e^3/I^3))
Thus, for e=ulp(I^2) up to sqrt(ulp(I^2)) the square root is correctly rounded to rsqrt(I^2+e)=I... (round to nearest even or truncate or floor mode).
Thus we would have to assert that sqrt(x)*sqrt(x) == x.
But above test is not sufficient, for example, assuming IEEE 754 double precision, sqrt(1.0e200)*sqrt(1.0e200)=1.0e200, where 1.0e200 is exactly 99999999999999996973312221251036165947450327545502362648241750950346848435554075534196338404706251868027512415973882408182135734368278484639385041047239877871023591066789981811181813306167128854888448 whose first prime factor is 2^613, hardly a perfect square of any fraction...
So we can combine both tests:
#include <float.h>
bool is_perfect_square(double x) {
return sqrt(x)*sqrt(x) == x
&& squared_significand_fits_in_precision(sqrt(x));
}
bool squared_significand_fits_in_precision(double x) {
double scaled=scalb( x , DBL_MANT_DIG/2-ilogb(x));
return scaled == floor(scaled)
&& (scalb(scaled,-1)==floor(scalb(scaled,-1)) /* scaled is even */
|| scaled < scalb( sqrt((double) FLT_RADIX) , DBL_MANT_DIG/2 + 1));
}
EDIT:
If we want to restrict to the case of integers, we can also check that floor(sqrt(x))==sqrt(x) or use dirty bit hacks in squared_significand_fits_in_precision...

Instead of doing sqrt(81.0) == 9.0, try 9.0*9.0 == 81.0. This will always work as long as the square is within the limits of the floating point magnitude.
Edit: I was probably unclear about what I meant by "floating point magnitude". What I mean is to keep the number within the range of integer values that can be held without precision loss, less than 2**53 for a IEEE double. I also expected that there would be a separate operation to make sure the square root was an integer.
double root = floor(sqrt(x) + 0.5); /* rounded result to nearest integer */
if (root*root == x && x < 9007199254740992.0)
/* it's a perfect square */

Related

Is there a correct constant-expression, in terms of a float, for its msb?

The problem: given a floating point constant expression, can we write a macro that evaluates to a constant expression whose value is a power of two equal to the most significant place of the significand? Equivalently, this is just the greatest power of two less than or equal to the input in magnitude.
For the purposes of this question we can ignore:
Near-overflow or near-underflow values (they can be handled with finitely many applications of ?: to rescale).
Negative inputs (they can be handled likewise).
Non-Annex-F-conforming implementations (can't really do anything useful in floating point with them).
Weirdness around excess precision (float_t and double_t can be used with FLT_EVAL_METHOD and other float.h macros to handle it safely).
So it suffices to solve the problem for positive values bounded away from infinity and the denormal range.
Note that this problem is equivalent to finding the "epsilon" for a specific value, that is, nextafter(x,INF)-x (or the equivalent in float or long double), with the result just scaled by DBL_EPSILON (or equivalent for the type). Solutions that find that are perfectly acceptable if they're simpler.
I have a proposed solution I'm posting as a self-answer, but I'm not sure if it's correct.
If you can assume IEEE 754 binary64 format and semantics (and in particular that arithmetic operations are correctly rounded), and a round-ties-to-even rounding mode, then it's a nice fact that for any not-too-small not-too-large positive finite double value x, the next representable value up from x is always given by x / 0x1.fffffffffffffp-1 (where 0x1.fffffffffffffp-1 is just 1.0 - 0.5 * DBL_EPSILON spelled out as a hex literal).
So we can get the most significant bit that you ask for simply from:
(x / 0x1.fffffffffffffp-1 - x) * 0x1.0p+52
And of course there are analogous results for float, assuming IEEE 754 binary32 format and semantics.
In fact, the only normal positive value that this fails for is DBL_MAX, where the result of the division overflows to infinity.
To show that the division trick works, it's enough to prove it for x in the range 1.0 <= x < 2.0; it's easy to show that for any x in this range, the value of x / 0x1.fffffffffffffp-1 - x (where / represents mathematical division in this case) lies in the half-open interval (2^-53, 2^52], and it follows that under round-ties-to-even (or in fact any round-to-nearest rounding mode), x / 0x1.fffffffffffffp-1 rounds up to the next representable value.
Similarly, under the same assumptions, x * 0x1.fffffffffffffp-1 is always the next representable value down from x.
Here is code for finding the ULP. It was inspired by algorithm 3.5 in Accurate floating-Point Summation by Siegfriend M. Rump, Takeshi Ogita, and Shin’ichi Oishi (which calculates 2⌈log2 |p|⌉):
double ULP(double q)
{
// SmallestPositive is the smallest positive floating-point number.
static const double SmallestPositive = DBL_EPSILON * DBL_MIN;
/* Scale is .75 ULP, so multiplying it by any significand in [1, 2) yields
something in [.75 ULP, 1.5 ULP) (even with rounding).
*/
static const double Scale = 0.75 * DBL_EPSILON;
q = fabs(q);
// Handle denormals, and get the lowest normal exponent as a bonus.
if (q < 2*DBL_MIN)
return SmallestPositive;
/* Subtract from q something more than .5 ULP but less than 1.5 ULP. That
must produce q - 1 ULP. Then subtract that from q, and we get 1 ULP.
The significand 1 is of particular interest. We subtract .75 ULP from
q, which is midway between the greatest two floating-point numbers less
than q. Since we round to even, the lesser one is selected, which is
less than q by 1 ULP of q, although 2 ULP of itself.
*/
return q - (q - q * Scale);
}
The fabs and if can be replaced with ?:.
For reference, the 2⌈log2 |p|⌉ algorithm is:
q = p / FLT_EPSILON
L = |(q+p) - q|
if L = 0
L = |p|
For the sake of example, assume the type is float and let x be the input. Initially I will write this as a sequence of statements for readability, but they can be translated directly into macros that produce constant expressions.
float y = x*(1+FLT_EPSILON)-x;
if (y/FLT_EPSILON > x) y/=2;
If we could ensure rounding-down, the initial value of y should be exactly what we want. However, if the top two bits of x are 1 and any lower bits are set, or if we hit a rounds-to-even case, x*(1+FLT_EPSILON) could exceed x by 2 units in the last place instead of just 1. I don't believe any other cases are possible, and I believe the second line accounts fully for this one.
Written as macros:
#define PRE_ULP(x) ((x)*(1+FLT_EPSILON)-(x))
#define ULP(x) ((PRE_ULP(x)/FLT_EPSILON>(x) ? PRE_ULP(x)/2 : PRE_ULP(x))
#define MSB_VAL(x) (ULP(x)/FLT_EPSILON)

C IEEE-Floats inf equal inf

In C, on a implementation with IEEE-754 floats, when I compare two floating point numbers which are NaN, it return 0 or "false". But why do two floating point numbers which both are inf count as equal?
This Program prints "equal: ..." (at least under Linux AMD64 with gcc) and in my opinion it should print "different: ...".
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
volatile double a = 1e200; //use volatile to suppress compiler warnings
volatile double b = 3e200;
volatile double c = 1e200;
double resA = a * c; //resA and resB should by inf
double resB = b * c;
if (resA == resB)
{
printf("equal: %e * %e = %e = %e = %e * %e\n",a,c,resA,resB,b,c);
}
else
{
printf("different: %e * %e = %e != %e = %e * %e\n", a, c, resA, resB, b, c);
}
return EXIT_SUCCESS;
}
A other example, why I think inf is not the same as inf, is: the numbers of natural numbers and rational numbers, both are infinite but not the same.
So why is inf == inf?
Infinities compare equal because that's what the standard says. From section 5.11 Details of comparison predicates:
Infinite operands of the same sign shall compare equal.
inf==inf for the same reason that almost all floating point numbers compare equal to themselves: Because they're equal. They contain the same sign, exponent, and mantissa.
You might be thinking of how NaN != NaN. But that's a relatively unimportant consequence of a much more important invariant: NaN != x for any x. As the name implies, NaN is not any number at all, and hence cannot compare equal to anything, because the comparison in question is a numeric one (hence why -0 == +0).
It would certainly make some amount of sense to have inf compare unequal to other infs, since in a mathematical context they're almost certainly unequal. But keep in mind that floating point equality is not the same thing as absolute mathematical equality; 0.1f * 10.0f != 1.0f, and 1e100f + 1.0f == 1e100f. Just as floating point numbers gradually underflow into denormals without compromising as-good-as-possible equality, so they overflow into infinity without compromising as-good-as-possible equality.
If you want inf != inf, you can emulate it: 1e400 == 3e400 evaluates to true, but 1e400 - 3e400 == 0 evaluates to false, because the result of +inf + -inf is NaN. (Arguably you could say it should evaluate to 0, but that would serve nobody's interest.)
Background
In C, according to the IEEE 754 binary floating point standard (so, if you use a float or a double) you're going to get an exact value that can be compared exactly with another variable of the same type. Well, this is true unless your computations result in a value that lies outside the range of integers that can be represented (i.e., overflow).
Why is Infinity == Infinity
resA and resB
The IEEE-754 standard tailored the values of infinity and negative infinity to be greater than or less than, respectively, all other values that may be represented according to the standard (<= INFINITY == 0 11111111111 0000000000000000000000000000000000000000000000000000 and >= -INFINITY == 1 11111111111 0000000000000000000000000000000000000000000000000000), except for NaN, which is neither less than, equal to, or greater than any floating point value (even itself). Take note that infinity and it's negative have explicit definitions in their sign, exponent, and mantissa bits.
So, resA and resB are infinity and since infinity is explicitly defined and reproducible, resA==resB. I'm fairly certain this is how isinf() is implemented.
Why is NaN != NaN
However, NaN is not explicitly defined. A NaN value has a sign bit of 0, exponent bits of all 1s (just like infinity and it's negative), and any set of non-zero fraction bits (Source). So, how would you tell one NaN from another, if their fraction bits are arbitrary anyways? Well, the standard doesn't assume that and simply returns false when two floating point values of this structure are compared to one another.
More Explanation
Because infinity is an explicitly defined value (Source, GNU C Manual):
Infinities propagate through calculations as one would expect
2 + ∞ = ∞
4 ÷ ∞ = 0
arctan (∞) = π/2.
However, NaN may or may not propagate through propagate through computations. When it does, it is a QNan (Quieting NaN, most significant fraction bit set) and all computations will result in NaN. When it doesn't, it is a SNan (Signalling NaN, most significant fraction bit not set) and all computations will result in an error.
There are many arithmetic systems. Some of them, including the ones normally covered in high school mathematics, such as the real numbers, do not have infinity as a number. Others have a single infinity, for example the projectively extended real line. Others, such as the IEEE floating point arithmetic under discussion, and the extended real line, have both positive and negative infinity.
IEEE754 arithmetic is different from real number arithmetic in many ways, but is a useful approximation for many purposes.
There is logic to the different treatment of NaNs and infinities. It is entirely reasonable to say that positive infinity is greater than negative infinity and any finite number. It would not be reasonable to say anything similar about the square root of -1.

Double value as a negative zero

I have a program example:
int main()
{
double x;
x=-0.000000;
if(x<0)
{
printf("x is less");
}
else
{
printf("x is greater");
}
}
Why does the control goes in the first statement - x is less . What is -0.000000?
IEEE 754 defines a standard floating point numbers, which is very commonly used. You can see its structure here:
Finite numbers, which may be either base 2 (binary) or base 10
(decimal). Each finite number is described by three integers: s = a
sign (zero or one), c = a significand (or 'coefficient'), q = an
exponent. The numerical value of a finite number is
(−1)^s × c × bq
where b is the base (2 or 10). For example, if the sign is 1
(indicating negative), the significand is 12345, the exponent is −3,
and the base is 10, then the value of the number is −12.345.
So if the fraction is 0, and the sign is 0, you have +0.0.
And if the fraction is 0, and the sign is 1, you have -0.0.
The numbers have the same value, but they differ in the positive/negative check. This means, for instance, that if:
x = +0.0;
y = -0.0;
Then you should see:
(x -y) == 0
However, for x, the OP's code would go with "x is greater", while for y, it would go with "x is less".
Edit: Artur's answer and Jeffrey Sax's comment to this answer clarify that the difference in the test for x < 0 in the OP's question is actually a compiler optimization, and that actually the test for x < 0 for both positive and negative 0 should always be false.
Negative zero is still zero, so +0 == -0 and -0 < +0 is false. They are two representations of the same value. There are only a few operations for which it makes a difference:
1 / -0 = -infinity, while 1 / +0 = +infinity.
sqrt(-0) = -0, while sqrt(+0) = +0
Negative zero can be created in a few different ways:
Dividing a positive number by -infinity, or a negative number by +infinity.
An operation that produces an underflow on a negative number.
This may seem rather obscure, but there is a good reason for this, mainly to do with making mathematical expressions involving complex numbers consistent. For example, note that the identity 1/√(-z)==-1/√z is not correct unless you define the square root as I did above.
If you want to know more details, try and find William Kahan's Branch Cuts for Complex Elementary Functions, or Much Ado About Nothing's Sign Bit in The State of the Art in Numerical Analysis (1987).
Nathan is right but there is one issue though. Usually most of float/double operations are performed by coprocessor. However some compilers try to be clever and instead of letting coprocessor do the comparison (it treats -0.0 and +0.0 the same as 0.0) just assume that since your x variable has minus sign it means that it should be treated as negative and optimize your code.
If you would be able to see how assembly output looks like - I bet you'll only see call to:
printf("x is less");
So it is optimization stuff (bad optimization).
BTW - VC 2008 produces correct output here regardless of optimization level set.
For example - VC optimizes (at full/max optimization level) the code leaving this only:
printf("x is grater");
I like my compiler more every day ;-)

How do I specify in which direction to round the average of two floats that differ by the LSB of their significand?

I'm working on a Nelder-Mead optimization routine in C that involves taking the average of two floats. In rare (but perfectly reproducible) circumstances, the two floats, say x and y, differ only by the least significant bit of their significand. When the average is taken, rounding errors imply that the result will be either x or y.
I'd like to specify that rounding should always be towards the second float. That is, I cannot simply specify that rounding should be towards zero, or infinity, because I do not know in advance whether x will be larger than y.
(How) can I do that?
I don't think there's a hardware rounding mode for that. You have to write your own function, then,
double average(double x, double y) {
double a = 0.5*(x+y);
return (a == x) ? y : a;
}
You could recognize the special case and pick the value you would like to return.
The values that are of interest are:
When the values have the same sign and exponent and only differs by one in the mantissa.
When the values have the same sign, the exponents differs by one, and the one with the larger exponent has a mantissa of 0 and the other a mantissa filled with ones.
In fact, if you are using IEEE-754 numbers (which you probably are) you can perform both tests at once (after checking for things like Zero, Inf, and Nan):
if ( repr1 + 1 == repr2
|| repr2 + 1 == repr1)
....
The reason for this is that the exponent is placed right next to the mantissa, and if the mantissa is all ones, the add will continue up into the exponent field.
However, talking about this, I would suggest another strategy. Instead of simply returning the second number, you could check the second lest significant bit and decide if you would like to round up or down. That way the rounding errors would be evenly distributed.

Comparing floating point numbers in C

I've got a double that prints as 0.000000 and I'm trying to compare it to 0.0f, unsuccessfully. Why is there a difference here? What's the most reliable way to determine if your double is zero?
To determine whether it's close enough to zero that it will print as 0.000000 to six decimal places, something like:
fabs(d) < 0.0000005
Dealing with small inaccuracies in floating-point calculations can get quite complicated in general, though.
If you want a better idea what value you've got, try printing with %g instead of %f.
You can do a range. Like -0.00001 <= x <= 0.00001
This is fundamental problem with floating point arithmetic on modern computers. They are by nature imprecise, and cannot be reliably compared. For example, the language ML explicitly disallows equality comparison on real types because it was considered too unsafe. See also the excellent (if a bit long and mathematically oriented) paper by David Goldberg on this topic.
Edit: tl;dr: you might be doing it wrong.
Also, one often overlooked features of floating point number are the denormalized numbers.
That's numbers which have the minimal exponent, yet don't fit in the 0.5-1 range.
Those numbers are lower than FLT_MIN for float, and DBL_MIN for double.
A common mistake with using a threshold is to compare two values, or use FLT_MIN/DBL_MIN as limit.
For example, this would lead unlogical result (if you don't know about denormals):
bool areDifferent(float a, float b) {
if (a == b) return false; // Or also: if ((a - b) == FLT_MIN)
return true;
}
// What is the output of areDifferent(val, val + FLT_MIN * 0.5f) ?
// true, not false, even if adding half the "minimum value".
Denormals also usually implies a performance loss in computation.
Yet, you can not disable them, else such code could still produce a DIVIDE BY ZERO floating point exception (if enabled):
float getInverse(float a, float b) {
if (a != b)
return 1.0f / (a-b); // With denormals disabled, a != b can be true, but (a - b) can still be denormals, it'll rounded to 0 and throw the exception
return FLT_MAX;
}

Resources