Double value as a negative zero - c

I have a program example:
int main()
{
double x;
x=-0.000000;
if(x<0)
{
printf("x is less");
}
else
{
printf("x is greater");
}
}
Why does the control goes in the first statement - x is less . What is -0.000000?

IEEE 754 defines a standard floating point numbers, which is very commonly used. You can see its structure here:
Finite numbers, which may be either base 2 (binary) or base 10
(decimal). Each finite number is described by three integers: s = a
sign (zero or one), c = a significand (or 'coefficient'), q = an
exponent. The numerical value of a finite number is
(−1)^s × c × bq
where b is the base (2 or 10). For example, if the sign is 1
(indicating negative), the significand is 12345, the exponent is −3,
and the base is 10, then the value of the number is −12.345.
So if the fraction is 0, and the sign is 0, you have +0.0.
And if the fraction is 0, and the sign is 1, you have -0.0.
The numbers have the same value, but they differ in the positive/negative check. This means, for instance, that if:
x = +0.0;
y = -0.0;
Then you should see:
(x -y) == 0
However, for x, the OP's code would go with "x is greater", while for y, it would go with "x is less".
Edit: Artur's answer and Jeffrey Sax's comment to this answer clarify that the difference in the test for x < 0 in the OP's question is actually a compiler optimization, and that actually the test for x < 0 for both positive and negative 0 should always be false.

Negative zero is still zero, so +0 == -0 and -0 < +0 is false. They are two representations of the same value. There are only a few operations for which it makes a difference:
1 / -0 = -infinity, while 1 / +0 = +infinity.
sqrt(-0) = -0, while sqrt(+0) = +0
Negative zero can be created in a few different ways:
Dividing a positive number by -infinity, or a negative number by +infinity.
An operation that produces an underflow on a negative number.
This may seem rather obscure, but there is a good reason for this, mainly to do with making mathematical expressions involving complex numbers consistent. For example, note that the identity 1/√(-z)==-1/√z is not correct unless you define the square root as I did above.
If you want to know more details, try and find William Kahan's Branch Cuts for Complex Elementary Functions, or Much Ado About Nothing's Sign Bit in The State of the Art in Numerical Analysis (1987).

Nathan is right but there is one issue though. Usually most of float/double operations are performed by coprocessor. However some compilers try to be clever and instead of letting coprocessor do the comparison (it treats -0.0 and +0.0 the same as 0.0) just assume that since your x variable has minus sign it means that it should be treated as negative and optimize your code.
If you would be able to see how assembly output looks like - I bet you'll only see call to:
printf("x is less");
So it is optimization stuff (bad optimization).
BTW - VC 2008 produces correct output here regardless of optimization level set.
For example - VC optimizes (at full/max optimization level) the code leaving this only:
printf("x is grater");
I like my compiler more every day ;-)

Related

Is there a correct constant-expression, in terms of a float, for its msb?

The problem: given a floating point constant expression, can we write a macro that evaluates to a constant expression whose value is a power of two equal to the most significant place of the significand? Equivalently, this is just the greatest power of two less than or equal to the input in magnitude.
For the purposes of this question we can ignore:
Near-overflow or near-underflow values (they can be handled with finitely many applications of ?: to rescale).
Negative inputs (they can be handled likewise).
Non-Annex-F-conforming implementations (can't really do anything useful in floating point with them).
Weirdness around excess precision (float_t and double_t can be used with FLT_EVAL_METHOD and other float.h macros to handle it safely).
So it suffices to solve the problem for positive values bounded away from infinity and the denormal range.
Note that this problem is equivalent to finding the "epsilon" for a specific value, that is, nextafter(x,INF)-x (or the equivalent in float or long double), with the result just scaled by DBL_EPSILON (or equivalent for the type). Solutions that find that are perfectly acceptable if they're simpler.
I have a proposed solution I'm posting as a self-answer, but I'm not sure if it's correct.
If you can assume IEEE 754 binary64 format and semantics (and in particular that arithmetic operations are correctly rounded), and a round-ties-to-even rounding mode, then it's a nice fact that for any not-too-small not-too-large positive finite double value x, the next representable value up from x is always given by x / 0x1.fffffffffffffp-1 (where 0x1.fffffffffffffp-1 is just 1.0 - 0.5 * DBL_EPSILON spelled out as a hex literal).
So we can get the most significant bit that you ask for simply from:
(x / 0x1.fffffffffffffp-1 - x) * 0x1.0p+52
And of course there are analogous results for float, assuming IEEE 754 binary32 format and semantics.
In fact, the only normal positive value that this fails for is DBL_MAX, where the result of the division overflows to infinity.
To show that the division trick works, it's enough to prove it for x in the range 1.0 <= x < 2.0; it's easy to show that for any x in this range, the value of x / 0x1.fffffffffffffp-1 - x (where / represents mathematical division in this case) lies in the half-open interval (2^-53, 2^52], and it follows that under round-ties-to-even (or in fact any round-to-nearest rounding mode), x / 0x1.fffffffffffffp-1 rounds up to the next representable value.
Similarly, under the same assumptions, x * 0x1.fffffffffffffp-1 is always the next representable value down from x.
Here is code for finding the ULP. It was inspired by algorithm 3.5 in Accurate floating-Point Summation by Siegfriend M. Rump, Takeshi Ogita, and Shin’ichi Oishi (which calculates 2⌈log2 |p|⌉):
double ULP(double q)
{
// SmallestPositive is the smallest positive floating-point number.
static const double SmallestPositive = DBL_EPSILON * DBL_MIN;
/* Scale is .75 ULP, so multiplying it by any significand in [1, 2) yields
something in [.75 ULP, 1.5 ULP) (even with rounding).
*/
static const double Scale = 0.75 * DBL_EPSILON;
q = fabs(q);
// Handle denormals, and get the lowest normal exponent as a bonus.
if (q < 2*DBL_MIN)
return SmallestPositive;
/* Subtract from q something more than .5 ULP but less than 1.5 ULP. That
must produce q - 1 ULP. Then subtract that from q, and we get 1 ULP.
The significand 1 is of particular interest. We subtract .75 ULP from
q, which is midway between the greatest two floating-point numbers less
than q. Since we round to even, the lesser one is selected, which is
less than q by 1 ULP of q, although 2 ULP of itself.
*/
return q - (q - q * Scale);
}
The fabs and if can be replaced with ?:.
For reference, the 2⌈log2 |p|⌉ algorithm is:
q = p / FLT_EPSILON
L = |(q+p) - q|
if L = 0
L = |p|
For the sake of example, assume the type is float and let x be the input. Initially I will write this as a sequence of statements for readability, but they can be translated directly into macros that produce constant expressions.
float y = x*(1+FLT_EPSILON)-x;
if (y/FLT_EPSILON > x) y/=2;
If we could ensure rounding-down, the initial value of y should be exactly what we want. However, if the top two bits of x are 1 and any lower bits are set, or if we hit a rounds-to-even case, x*(1+FLT_EPSILON) could exceed x by 2 units in the last place instead of just 1. I don't believe any other cases are possible, and I believe the second line accounts fully for this one.
Written as macros:
#define PRE_ULP(x) ((x)*(1+FLT_EPSILON)-(x))
#define ULP(x) ((PRE_ULP(x)/FLT_EPSILON>(x) ? PRE_ULP(x)/2 : PRE_ULP(x))
#define MSB_VAL(x) (ULP(x)/FLT_EPSILON)

Prevent overflow/underflow in float division

I have two numbers:
FL_64 variable_number;
FL_64 constant_number;
The constant number is always the same, for example:
constant_number=(FL_64)0.0000176019966602325;
The variable number is given to me and I need to perform the division:
FL_64 result = variable_number/constant_number;
What would be the checks I need to do to variable_number in order to make sure the operation will not overflow / underflow before performing it?
Edit: FL_64 is just a typedef for double so FL_64 = double.
A Test For Overflow
Assume:
The C implementation uses IEEE-754 arithmetic with round-to-nearest-ties-to-even.
The magnitude of the divisor is at most 1, and the divisor is non-zero.
The divisor is positive.
The test and the proof below are written with the above assumptions for simplicity, but the general cases are easily handled:
If the divisor might be negative, use fabs(divisor) in place of divisor when calculating the limit shown below.
If the divisor is zero, there is no need to test for overflow, as it is already known an error (divide-by-zero) occurs.
If the magnitude exceeds 1, the division never creates a new overflow. Overflow occurs only if the dividend is already infinity (so a test would be isinf(candidate)). (With a divisor exceeding 1 in magnitude, the division could underflow. This answer does not discuss testing for underflow in that case.)
Note about notation: Expressions using non-code-format operators, such as x•y, represent exact mathematical expressions, without floating-point rounding. Expressions in code format, such as x*y, mean the computed results with floating-point rounding.
To detect overflow when dividing by divisor, we can use:
FL_64 limit = DBL_MAX * divisor;
if (-limit <= candidate && candidate <= limit)
// Overflow will not occur.
else
// Overflow will occur or candidate or divisor is NaN.
Proof:
limit will equal DBL_MAX multiplied by divisor and rounded to the nearest representable value. This is exactly DBL_MAX•divisor•(1+e) for some error e such that −2−53 ≤ e ≤ 2−53, by the properties of rounding to nearest plus the fact that no representable value for divisor can, when multiplied by DBL_MAX, produce a value below the normal range. (In the subnormal range, the relative error due to rounding could be greater than 2−53. Since the product remains in the normal range, that does not occur.)
However, e = 2−53 can occur only if the exact mathematical value of DBL_MAX•divisor falls exactly midway between two representable values, thus requiring it to have 54 significant bits (the bit that is ½ of the lowest position of the 53-bit significand of representable values is the 54th bit, counting from the leading bit). We know the significand of DBL_MAX is 1fffffffffffff16 (53 bits). Multiplying it by odd numbers produces 1fffffffffffff16 (when multiplied by 1), 5ffffffffffffd16 (by 3), and 0x9ffffffffffffb16 (by 5), and numbers with more significant bits when multiplied by greater odd numbers. Note that 5ffffffffffffd16 has 55 significant bits. None of these has exactly 54 significant bits. When multiplied by even numbers, the product has trailing zeros, so the number of significant bits is the same as when multiplying by the odd number that results from dividing the even number by the greatest power of two that divides it. Therefore, no product of DBL_MAX is exactly midway between two representable values, so the error e is never exactly 2−53. So −253 < e < 2−53.
So, limit = DBL_MAX•divisor•(1+e), where e < 2−53. Therefore limit/divisor is DBL_MAX•(1+e). Since this result is less than ½ ULP from DBL_MAX, it never rounds up to infinity, so it never overflows. So dividing any candidate that is less than or equal to limit by divisor does not overflow.
Now we will consider candidates exceeding limit. As with the upper bound, e cannot equal −2−53, for the same reason. Then the least e can be is −2−53 + 2−105, because the product of DBL_MAX and divisor has at most 106 significant bits, so any increase from the midpoint between two representable values must be by at least one part in 2−105. Then, if limit < candidate, candidate is at least one part in 2−52 greater than limit, since there are 53 bits in a significand. So DBL_MAX•divisor•(1−2−53+2−105)•(1+2−52) < candidate. Then candidate/divisor is at least DBL_MAX•(1−2−53+2−105)•(1+2−52), which is DBL_MAX•(1+2−53+2−157). The exceeds the midpoint between DBL_MAX and what would be the next representable value if the exponent range were unbounded, which is the basis for the IEEE-754 rounding criterion. Therefore, it rounds up to infinity, so overflow occurs.
Underflow
Dividing by a number with magnitude less than one of course makes a number larger in magnitude, so it never underflows to zero. However, the IEEE-754 definition of underflow is that a non-zero result is tiny (in the subnormal range), either before or after rounding (whether to use before or after is implementation-defined). It is of course possible that dividing a subnormal number by a divisor less than one will produce a result still in the subnormal range. However, for this to happen, underflow must have occurred previously, to get the subnormal dividend in the first place. Therefore, underflow will never be introduced by a division by a number with magnitude less than one.
If one does wish to test for this underflow, one might similarly to the test for overflow—by comparing the candidate to the minimum normal (or the greatest subnormal) multiplied by divisor—but I have not yet worked through the numerical properties.
Assuming FL_64 is something like a double you can get the maximum value which is named DBL_MAX from float.h
So you want to make sure that
DBL_MAX >= variable_number/constant_number
or equally
DBL_MAX * constant_number >= variable_number
In code that could be something like
if (constant_number > 0.0 && constant_number < 1.0)
{
if (DBL_MAX * constant_number >= variable_number)
{
// wont overflow
}
else
{
// will overflow
}
}
else
{
// add code for other ranges of constant_number
}
However, notice that floating point calculations are imprecise so there maybe corner cases where the above code will fail.
I'm going to attempt to answer the question you asked (instead trying to answer a different "How to detect overflow or underflow that was not prevented" question that you didn't ask).
To prevent overflow and underflow for division during the design of software:
Determine the range of the numerator and find the values with the largest and smallest absolute magnitude
Determine the range of the divisor and find the values with the largest and smallest absolute magnitude
Make sure that the maximum representable value of the data type (e.g. FLT_MAX) divided by the largest absolute magnitude of the range of divisors is larger than the largest absolute magnitude of the range of numerators.
Make sure that the minimum representable value of the data type (e.g. FLT_MIN) multiplied by the smallest absolute magnitude of the range of divisors is smaller than the smallest absolute magnitude of the range of numerators.
Note that the last few steps may need to be repeated for each possible data type until you've found the "best" (smallest) data type that prevents underflow and underflow (e.g. you might check if float satisfies the last 2 steps and find that it doesn't, then check if double satisfies the last 2 steps and find that it does).
It's also possible that you find out that no data type is able to prevent overflow and underflow, and that you have to limit the range of values that could be used for numerator or divisor, or rearrange formulas (e.g. change a (c*a)/b into a (c/b)*a) or switch to a different representation ("double double", rational numbers, ...).
Also; be aware that this provides a guarantee that (for all combinations of values within your ranges) overflow and underflow will be prevented; but doesn't guarantee that the smallest data type will be chosen if there's some kind of relationship between the magnitudes of the numerators and divisors. For a simple example, if you're doing something like b = a*a+1; result = b/a; where the magnitude of the numerator depends on the magnitude of the divisor, then you'll never get the "largest numerator with smallest divisor" or "smallest numerator with largest divisor" cases and a smaller data type (that can't handle cases that won't exist) may be suitable.
Note that you can also do checks before each individual division. This tends to make performance worse (due to the branches/checks) while causing code duplication (e.g. providing alternative code that uses double for cases when float would've caused overflow or underflow); and can't work when the largest type supported isn't large enough (you end up with an } else { // Now what??? problem that can't be solved in a way that ensures values that should work do work because typically the only thing you can do is treat it as an error condition).
I don't know what standard your FL_64 adheres to, but if it's anything like IEEE 754, you'll want to watch out for
Not a Number
There might be a special NaN value. In some implementation, the result of comparing it to anything is 0, so if (variable_number == variable_number) == 0, then that's what's going on. There might be macros and functions to check for this depending on the implementation, such as in the GNU C Library.
Infinity
IEEE 754 also supports infinity (and negative infinity). This can be the result of an overflow, for instance. If variable_number is infinite and you divide it by constant_number, the result will probably be infinite again. As with NaN, the implementation usually supplies macros or functions to test for this, otherwise you could try dividing the number by something and see if it got any smaller.
Overflow
Since dividing the number by constant_number will make it bigger, the variable_number could overflow if it is already enormous. Check if it's not so big that this can happen. But depending on what your task is, the possibility of it being this large might already be excluded. The 64 bit floats in IEEE 754 go up to about 10^308. If your number overflows, it might turn into infinity.
I personally don't know the FL_64 variable type, from the name I suppose it has a 64 bit representation, but is it signed or unsigned?
Anyway I would see a potential problem only if the type is signed, otherwise both the quotient and reminder would be re-presentable on the same quantity of bits.
In case of signed, you need to check the result sign:
FL_64 result = variable_number/constant_number;
if ((variable_number > 0 && constant_number > 0) || (variable_number < 0 && constant_number < 0)) {
if (result < 0) {
//OVER/UNDER FLOW
printf("over/under flow");
} else {
//NO OVER/UNDER FLOW
printf("no over/under flow");
}
} else {
if (result < 0) {
//NO OVER/UNDER FLOW
printf("no over/under flow");
} else {
//OVER/UNDER FLOW
printf("over/under flow");
}
}
Also other cases should be checked, like division by 0. But as you mentioned constant_number is always fixed and different from 0.
EDIT:
Ok so there could be another way to check overflow by using the DBL_MAX value. By having the maximum re-presentable number on a double you can multiply it by the constant_number and compute the maximum value for the variable_number. From the code snippet below, you can see that the first case does not cause overflow, while the second does (since the variable_number is a larger number compared to the test). From the console output in fact you can see that the first value result is higher than the second one, even if this should actually be the double of the previous one. So this case is an overflow case.
#include <stdio.h>
#include <float.h>
typedef double FL_64;
int main() {
FL_64 constant_number = (FL_64)0.0000176019966602325;
FL_64 test = DBL_MAX * constant_number;
FL_64 variable_number = test;
FL_64 result;
printf("MAX double value:\n%f\n\n", DBL_MAX);
printf("Variable Number value:\n%f\n\n", variable_number);
printf(variable_number > test ? "Overflow case\n\n" : "No overflow\n\n");
result = variable_number / constant_number;
printf("Result: %f\n\n", variable_number);
variable_number *= 2;
printf("Variable Number value:\n%f\n\n", variable_number);
printf(variable_number > test ? "Overflow case\n\n" : "No overflow\n\n");
result = variable_number / constant_number;
printf("Result:\n%f\n\n", variable_number);
return 0;
}
This a specific case solution, since you have a constant value number. But this solution will not work in a general case.

C IEEE-Floats inf equal inf

In C, on a implementation with IEEE-754 floats, when I compare two floating point numbers which are NaN, it return 0 or "false". But why do two floating point numbers which both are inf count as equal?
This Program prints "equal: ..." (at least under Linux AMD64 with gcc) and in my opinion it should print "different: ...".
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
volatile double a = 1e200; //use volatile to suppress compiler warnings
volatile double b = 3e200;
volatile double c = 1e200;
double resA = a * c; //resA and resB should by inf
double resB = b * c;
if (resA == resB)
{
printf("equal: %e * %e = %e = %e = %e * %e\n",a,c,resA,resB,b,c);
}
else
{
printf("different: %e * %e = %e != %e = %e * %e\n", a, c, resA, resB, b, c);
}
return EXIT_SUCCESS;
}
A other example, why I think inf is not the same as inf, is: the numbers of natural numbers and rational numbers, both are infinite but not the same.
So why is inf == inf?
Infinities compare equal because that's what the standard says. From section 5.11 Details of comparison predicates:
Infinite operands of the same sign shall compare equal.
inf==inf for the same reason that almost all floating point numbers compare equal to themselves: Because they're equal. They contain the same sign, exponent, and mantissa.
You might be thinking of how NaN != NaN. But that's a relatively unimportant consequence of a much more important invariant: NaN != x for any x. As the name implies, NaN is not any number at all, and hence cannot compare equal to anything, because the comparison in question is a numeric one (hence why -0 == +0).
It would certainly make some amount of sense to have inf compare unequal to other infs, since in a mathematical context they're almost certainly unequal. But keep in mind that floating point equality is not the same thing as absolute mathematical equality; 0.1f * 10.0f != 1.0f, and 1e100f + 1.0f == 1e100f. Just as floating point numbers gradually underflow into denormals without compromising as-good-as-possible equality, so they overflow into infinity without compromising as-good-as-possible equality.
If you want inf != inf, you can emulate it: 1e400 == 3e400 evaluates to true, but 1e400 - 3e400 == 0 evaluates to false, because the result of +inf + -inf is NaN. (Arguably you could say it should evaluate to 0, but that would serve nobody's interest.)
Background
In C, according to the IEEE 754 binary floating point standard (so, if you use a float or a double) you're going to get an exact value that can be compared exactly with another variable of the same type. Well, this is true unless your computations result in a value that lies outside the range of integers that can be represented (i.e., overflow).
Why is Infinity == Infinity
resA and resB
The IEEE-754 standard tailored the values of infinity and negative infinity to be greater than or less than, respectively, all other values that may be represented according to the standard (<= INFINITY == 0 11111111111 0000000000000000000000000000000000000000000000000000 and >= -INFINITY == 1 11111111111 0000000000000000000000000000000000000000000000000000), except for NaN, which is neither less than, equal to, or greater than any floating point value (even itself). Take note that infinity and it's negative have explicit definitions in their sign, exponent, and mantissa bits.
So, resA and resB are infinity and since infinity is explicitly defined and reproducible, resA==resB. I'm fairly certain this is how isinf() is implemented.
Why is NaN != NaN
However, NaN is not explicitly defined. A NaN value has a sign bit of 0, exponent bits of all 1s (just like infinity and it's negative), and any set of non-zero fraction bits (Source). So, how would you tell one NaN from another, if their fraction bits are arbitrary anyways? Well, the standard doesn't assume that and simply returns false when two floating point values of this structure are compared to one another.
More Explanation
Because infinity is an explicitly defined value (Source, GNU C Manual):
Infinities propagate through calculations as one would expect
2 + ∞ = ∞
4 ÷ ∞ = 0
arctan (∞) = π/2.
However, NaN may or may not propagate through propagate through computations. When it does, it is a QNan (Quieting NaN, most significant fraction bit set) and all computations will result in NaN. When it doesn't, it is a SNan (Signalling NaN, most significant fraction bit not set) and all computations will result in an error.
There are many arithmetic systems. Some of them, including the ones normally covered in high school mathematics, such as the real numbers, do not have infinity as a number. Others have a single infinity, for example the projectively extended real line. Others, such as the IEEE floating point arithmetic under discussion, and the extended real line, have both positive and negative infinity.
IEEE754 arithmetic is different from real number arithmetic in many ways, but is a useful approximation for many purposes.
There is logic to the different treatment of NaNs and infinities. It is entirely reasonable to say that positive infinity is greater than negative infinity and any finite number. It would not be reasonable to say anything similar about the square root of -1.

sqrt, perfect squares and floating point errors

In the sqrt function of most languages (though here I'm mostly interested in C and Haskell), are there any guarantees that the square root of a perfect square will be returned exactly? For example, if I do sqrt(81.0) == 9.0, is that safe or is there a chance that sqrt will return 8.999999998 or 9.00000003?
If numerical precision is not guaranteed, what would be the preferred way to check that a number is a perfect square? Take the square root, get the floor and the ceiling and make sure they square back to the original number?
Thank you!
In IEEE 754 floating-point, if the double-precision value x is the square of a nonnegative representable number y (i.e. y*y == x and the computation of y*y does not involve any rounding, overflow, or underflow), then sqrt(x) will return y.
This is all because sqrt is required to be correctly-rounded by the IEEE 754 standard. That is, sqrt(x), for any x, will be the closest double to the actual square root of x. That sqrt works for perfect squares is a simple corollary of this fact.
If you want to check whether a floating-point number is a perfect square, here's the simplest code I can think of:
int issquare(double d) {
if (signbit(d)) return false;
feclearexcept(FE_INEXACT);
double dd = sqrt(d);
asm volatile("" : "+x"(dd));
return !fetestexcept(FE_INEXACT);
}
I need the empty asm volatile block that depends on dd because otherwise your compiler might be clever and "optimise" away the calculation of dd.
I used a couple of weird functions from fenv.h, namely feclearexcept and fetestexcept. It's probably a good idea to look at their man pages.
Another strategy that you might be able to make work is to compute the square root, check whether it has set bits in the low 26 bits of the mantissa, and complain if it does. I try this approach below.
And I needed to check whether d is zero because otherwise it can return true for -0.0.
EDIT: Eric Postpischil suggested that hacking around with the mantissa might be better. Given that the above issquare doesn't work in another popular compiler, clang, I tend to agree. I think the following code works:
int _issquare2(double d) {
if (signbit(d)) return 0;
int foo;
double s = sqrt(d);
double a = frexp(s, &foo);
frexp(d, &foo);
if (foo & 1) {
return (a + 33554432.0) - 33554432.0 == a && s*s == d;
} else {
return (a + 67108864.0) - 67108864.0 == a;
}
}
Adding and subtracting 67108864.0 from a has the effect of wiping the low 26 bits of the mantissa. We will get a back exactly when those bits were clear in the first place.
According to this paper, which discusses proving the correctness of IEEE floating-point square root:
The IEEE-754 Standard for Binary Floating-Point
Arithmetic [1] requires that the result of a divide or square
root operation be calculated as if in infinite precision, and
then rounded to one of the two nearest floating-point
numbers of the specified precision that surround the
infinitely precise result
Since a perfect square that can be represented exactly in floating-point is an integer and its square root is an integer that can be precisely represented, the square root of a perfect square should always be exactly correct.
Of course, there's no guarantee that your code will execute with a conforming IEEE floating-point library.
#tmyklebu perfectly answered the question. As a complement, let's see a possibly less efficient alternative for testing perfect square of fractions without asm directive.
Let's suppose we have an IEEE 754 compliant sqrt which rounds the result correctly.
Let's suppose exceptional values (Inf/Nan) and zeros (+/-) are already handled.
Let's decompose sqrt(x) into I*2^m where I is an odd integer.
And where I spans n bits: 1+2^(n-1) <= I < 2^n.
If n > 1+floor(p/2) where p is floating point precision (e.g. p=53 and n>27 in double precision)
Then 2^(2n-2) < I^2 < 2^2n.
As I is odd, I^2 is odd too and thus spans over > p bits.
Thus I is not the exact square root of any representable floating point with this precision.
But given I^2<2^p, could we say that x was a perfect square?
The answer is obviously no. A taylor expansion would give
sqrt(I^2+e)=I*(1+e/2I - e^2/4I^2 + O(e^3/I^3))
Thus, for e=ulp(I^2) up to sqrt(ulp(I^2)) the square root is correctly rounded to rsqrt(I^2+e)=I... (round to nearest even or truncate or floor mode).
Thus we would have to assert that sqrt(x)*sqrt(x) == x.
But above test is not sufficient, for example, assuming IEEE 754 double precision, sqrt(1.0e200)*sqrt(1.0e200)=1.0e200, where 1.0e200 is exactly 99999999999999996973312221251036165947450327545502362648241750950346848435554075534196338404706251868027512415973882408182135734368278484639385041047239877871023591066789981811181813306167128854888448 whose first prime factor is 2^613, hardly a perfect square of any fraction...
So we can combine both tests:
#include <float.h>
bool is_perfect_square(double x) {
return sqrt(x)*sqrt(x) == x
&& squared_significand_fits_in_precision(sqrt(x));
}
bool squared_significand_fits_in_precision(double x) {
double scaled=scalb( x , DBL_MANT_DIG/2-ilogb(x));
return scaled == floor(scaled)
&& (scalb(scaled,-1)==floor(scalb(scaled,-1)) /* scaled is even */
|| scaled < scalb( sqrt((double) FLT_RADIX) , DBL_MANT_DIG/2 + 1));
}
EDIT:
If we want to restrict to the case of integers, we can also check that floor(sqrt(x))==sqrt(x) or use dirty bit hacks in squared_significand_fits_in_precision...
Instead of doing sqrt(81.0) == 9.0, try 9.0*9.0 == 81.0. This will always work as long as the square is within the limits of the floating point magnitude.
Edit: I was probably unclear about what I meant by "floating point magnitude". What I mean is to keep the number within the range of integer values that can be held without precision loss, less than 2**53 for a IEEE double. I also expected that there would be a separate operation to make sure the square root was an integer.
double root = floor(sqrt(x) + 0.5); /* rounded result to nearest integer */
if (root*root == x && x < 9007199254740992.0)
/* it's a perfect square */

How do I specify in which direction to round the average of two floats that differ by the LSB of their significand?

I'm working on a Nelder-Mead optimization routine in C that involves taking the average of two floats. In rare (but perfectly reproducible) circumstances, the two floats, say x and y, differ only by the least significant bit of their significand. When the average is taken, rounding errors imply that the result will be either x or y.
I'd like to specify that rounding should always be towards the second float. That is, I cannot simply specify that rounding should be towards zero, or infinity, because I do not know in advance whether x will be larger than y.
(How) can I do that?
I don't think there's a hardware rounding mode for that. You have to write your own function, then,
double average(double x, double y) {
double a = 0.5*(x+y);
return (a == x) ? y : a;
}
You could recognize the special case and pick the value you would like to return.
The values that are of interest are:
When the values have the same sign and exponent and only differs by one in the mantissa.
When the values have the same sign, the exponents differs by one, and the one with the larger exponent has a mantissa of 0 and the other a mantissa filled with ones.
In fact, if you are using IEEE-754 numbers (which you probably are) you can perform both tests at once (after checking for things like Zero, Inf, and Nan):
if ( repr1 + 1 == repr2
|| repr2 + 1 == repr1)
....
The reason for this is that the exponent is placed right next to the mantissa, and if the mantissa is all ones, the add will continue up into the exponent field.
However, talking about this, I would suggest another strategy. Instead of simply returning the second number, you could check the second lest significant bit and decide if you would like to round up or down. That way the rounding errors would be evenly distributed.

Resources